This chapter presents quantitative evidence from academic research demonstrating where conventional MMU architecture encounters fundamental limitations when scaled to AI workloads. We examine three specific breaking points—multi-GPU coherence, multi-tenant isolation, and virtualization overhead—and analyze the architectural solutions that emerge.
Before examining how MMU architecture breaks under AI workloads, we must establish the fundamental requirements driving these systems. Six factors—increasing from CPU-style computing to modern AI—create pressures that conventional MMU designs struggle to accommodate.
Traditional computing operates on relatively small working sets. Video game rendering requires 700 MB for textures, buffers, and shaders. With 2MB pages and 512 TLB entries, this provides 1 GB of coverage—complete coverage with 98-99% hit rates.
AI training transforms this equation. GPT-3 with 175 billion parameters requires 350 GB for parameters alone, plus 700 GB each for momentum and variance (AdamW optimizer), and 50-100 GB for activations. Total working set: 1,800 GB. The same 512-entry TLB now covers only 1 GB of 1,800 GB—0.055% coverage. TLB miss rate: 99.945%.
The quantified impact is severe. TLB miss penalty is 200ns for a 4-level page walk. With baseline HBM3 latency of 100ns, effective latency becomes 299.89ns (0.055% × 100ns + 99.945% × 300ns), creating a 3× slowdown compared to perfect TLB coverage. This is not marginal—TLB effectiveness has completely collapsed. The MMU was designed for working sets of 1-10 GB. AI models are 100-1,000× larger.
Consumer GPUs operate singly. Workstations use 2-4 GPUs with independent address spaces. HPC clusters max out at 8-64 GPUs using MPI message passing. TLB coordination is minimal or absent.
AI training at scale operates differently. Research training uses 256-512 GPUs. Production training (GPT-4, LLaMA-3) uses 1,024-10,000 GPUs. Meta's cluster for LLaMA-3 405B used 24,576 GPUs. All share a unified virtual address space via CUDA Unified Memory and distributed training frameworks. One GPU's munmap() requires invalidating TLBs across all 10,000 GPUs.
The O(N) coordination problem becomes catastrophic. A single unmap operation requires sending invalidation to GPUs 1-9,999 serially (10ms), then waiting for 9,999 acknowledgments (10ms), totaling 20ms per operation. At 74 unmaps per second (measured by GRIT), this creates 1,480ms per second—148% overhead, causing system deadlock. With batching and pipelining, actual measurement shows 23.7% overhead at 512 GPUs. Extrapolation to 10,000 GPUs yields 47-80% overhead.
Gaming GPUs achieve 200-400 GB/s utilization of 1 TB/s peak bandwidth (20-40%). They're limited by compute, not memory. TLB misses occur during idle memory cycles, enabling latency hiding. Throughput loss: 2-5%.
Training GPUs achieve 2.7-3.2 TB/s utilization of 3.35 TB/s peak (80-95%). Matrix multiplication saturates bandwidth. No spare cycles exist to hide TLB miss latency. Every miss directly reduces throughput. Measured loss: 18-25%.
At low utilization (40%), other threads issue requests during TLB misses, using spare bandwidth. The 200ns stall is hidden. At high utilization (90%), all 10,752 threads access memory simultaneously. A TLB miss blocks all threads, idling the entire GPU for 200ns. At 1% TLB miss rate, low utilization yields negligible 2ns average delay. High utilization yields 18% throughput loss. Bandwidth pressure amplifies overhead exponentially.
Consumer GPUs serve single users who own the hardware. Trust is implicit. Workstations serve single organizations. Colleagues are trusted within departmental boundaries.
Cloud AI services share GPUs across untrusted customers. AWS p4d.24xlarge partitions A100 GPUs across 7 customers. Azure and GCP offer similar multi-instance configurations. Security requirements become absolute: zero trust, perfect isolation, no timing side-channels.
Traditional isolation via separate address spaces proves insufficient. Shared TLBs create timing side-channels. The TunneLs attack fills the L3 TLB with 4096 junk entries, measures which get evicted when the victim runs LLaMA-70B, and achieves 97.3% accuracy identifying the victim's model architecture. Shared TLB equals no isolation.
Three solutions emerge: partitioned TLB (5-7% performance cost, 99% isolation), chiplet isolation (0% cost via physics-based separation), and CryptoMMU (18% cost, 95% isolation, research only).
CPU virtualization matured from 2005-2026. Nested page tables create two-level translation (Guest VA → Guest PA → Host PA), doubling TLB miss cost from 200 to 400 cycles. With 98% hit rates and good locality, total overhead stays at 2-5%. This is acceptable. Deployment reached 90%+ of cloud servers.
GPU virtualization remains immature with high overhead. On-demand paging allocates pages on first access. Problem: 10,752 threads fault simultaneously. Measured overhead: 75% throughput loss. With prefetching and hints, optimization reduces this to 3-5%, but requires programmer expertise—15 lines of carefully tuned code. Deployment remains below 5% of cloud GPU instances.
The GPU page fault problem differs fundamentally from CPUs. When a CPU thread faults, it stalls for 50,000 cycles while other threads continue, impacting only 12.5% of capacity (1 of 8 threads). When a GPU thread faults, all 10,752 threads stall due to warp execution model, idling 100% of GPU capacity. Measured impact: 75% throughput loss over full iteration.
Traditional compute follows predictable patterns. Image processing scans pixel arrays sequentially (stride-1). Matrix multiply traverses row-major (predictable strides). Hardware prefetchers achieve 90-95% accuracy, hiding most TLB misses.
Transformer attention operates unpredictably. Variable sequence lengths range from 10 to 8,000 tokens per request. Access patterns are data-dependent: KV_cache[token_ids] depends on model output. For LLaMA-70B inference, successive tokens access embeddings at indices [47, 8291, 5812, ...], unpredictable jumps that prefetchers cannot anticipate. Prefetch accuracy drops below 20%. Result: every access becomes a TLB miss.
Measured prefetching effectiveness varies dramatically by workload. Image processing with sequential access achieves 95% prefetch accuracy and 98% TLB hit rate. CNN (ResNet) with mostly sequential patterns achieves 75% accuracy and 92% hit rate. Transformer (GPT) with data-dependent access achieves only 20% accuracy and 45% hit rate. Graph neural networks with random walks achieve 5% accuracy and 12% hit rate. Dynamic workloads cannot rely on hardware prefetching.
These six drivers do not occur independently. Modern AI systems face all six simultaneously. GPT-4 training combines 1.8 TB working set, 25,000 GPUs, 85-90% bandwidth utilization, and attention's unpredictable access patterns. Combined MMU challenges include 99.9% TLB miss rate, 40%+ coherence overhead at 25K GPUs, no latency hiding at 90% bandwidth, 75% virtualization overhead if enabled, and ineffective prefetching.
Traditional MMUs were designed for 1-10 GB working sets, 1-8 devices, and 40% bandwidth utilization. Modern AI demands 1,800 GB, 25,000 devices, and 90% bandwidth. The mismatch is fundamental, not incremental.
Dynamic memory allocation enables protection, sharing, and flexibility. But flexibility has a cost. This section quantifies the architectural overhead of runtime address translation for AI workloads.
Component 1: TLB Lookup Latency. Every virtual address requires translation. TLB hits cost 2 cycles (overlapped with L1 cache). TLB misses cost 50-200 cycles for 4-level page table walks (200ns at 2.5 GHz). With 95% hit rate, effective latency becomes 9.4 cycles average (0.95 × 2 + 0.05 × 150), creating 4.7× slowdown per memory access.
At AI scale with 90% bandwidth utilization, this compounds. H100 GPUs perform 100 billion memory accesses per second. With 5% TLB miss rate (using 2MB pages), 5 billion misses occur per second. Wasted cycles: 750 billion per second. At 2.5 GHz, this represents 300 wasted seconds per second—impossible without pipelining, but resulting in measured 20-30% effective overhead.
Component 2: Page Fault Handling. GPU page faults cost approximately 50,000 cycles: 10,000 for context switch to OS, 5,000 for physical page allocation, 1,000 for page table updates, 500 for TLB flush, and 10,000 to resume application. Total: 20µs per fault.
Impact on throughput: one thread faults, all 10,752 threads stall, GPU idles for 50,000 cycles. At 1% page fault rate, training iteration becomes 100ms compute plus 75ms faults, totaling 175ms with 43% overhead. At 0.1% fault rate with prefetching, iteration becomes 100ms plus 5ms, totaling 105ms with 5% overhead.
Component 3: TLB Shootdown. Single-GPU shootdown costs 500ns. Multi-GPU shootdown with 512 GPUs using serial protocol costs 5.12ms: 2.56ms sending invalidation to 511 GPUs (511 × 5µs), plus 2.56ms waiting for ACKs. At 74 shootdowns per second (GRIT measurement), overhead reaches 379ms per second—37.9%. Measured with batching: 23.7%. With IDYLL directory-based coherence: ~5%.
Solution 1: Hardware Directory-Based Coherence (IDYLL). Directory-based cache coherence (Censier & Feautrier, IEEE TC 1978) tracks which processors cache each memory line, invalidating only processors with cached copies. This reduces O(N) broadcast to O(log N) multicast.
IDYLL's modern adaptation stores directory bits in page table entries. Each PTE uses 128 bits: 64 for directory, 64 for address. Directory bits indicate which GPUs cached this translation. Invalidation protocol: read PTE directory bits (1 cycle), multicast to GPUs with cached translation (12µs average for 12 GPUs), collect ACKs via reduction tree (O(log N) = 20µs). Total: 32µs versus 5.12ms serial—160× faster.
IDYLL measurements at 512 GPUs show serial baseline of 640µs per shootdown versus IDYLL's 35µs, achieving 18.3× speedup. Scaling to 1,024 GPUs: serial grows to 1,280µs (O(N) growth), IDYLL grows to 50µs (O(log N) growth), achieving 25.6× speedup. Systems implementing directory-based coherence achieve ~5% overhead at 1,024 GPUs, validating O(log N) theoretical predictions.
Hardware cost: directory controller requires ~5mm² silicon (5nm process), high-radix switch needs 64-128 ports with custom ASIC ($1-2B R&D), power consumption is 300W per switch.
Solution 2: Static Compilation Eliminates Runtime Overhead (XLA). If all allocations are known at compile time, eliminate the MMU entirely. XLA compiler performs liveness analysis at compile time, determining which tensors are live at each program point. Dead tensors can be reclaimed, memory reused.
For ResNet-50: naive approach uses 50 layers × 1 MB = 50 MB for activations plus 51 MB parameters, totaling 101 MB. Liveness analysis reveals only 3-4 layers are simultaneously active, reducing peak live activations to 8 MB. With 51 MB parameters (always live), total becomes 59 MB—42% reduction.
XLA then assigns static physical addresses for all tensors and generates code with physical addresses directly. Memory layout: 0x0000_0000 for parameters (51 MB), 0x0330_0000 for activation buffer (8 MB, reused across layers), 0x0B30_0000 for temporary buffers (4 MB). Generated instructions use physical addresses: load r1, 0x0000_0000, conv r2, r1, 0x0004_BC00, store 0x0330_0000, r2.
Performance: zero overhead. Dynamic PyTorch suffers 1.2M TLB misses per iteration, 250 page faults, 5% allocation overhead—total 20-30%. Static XLA eliminates all: 0 TLB misses (no virtual memory), 0 page faults (pre-allocated), 0% allocation overhead (no runtime allocation). Speedup: 20-30% performance gain.
The cost: flexibility loss. XLA cannot handle dynamic batch sizes (must be fixed at compile time), data-dependent control flow (if/else based on tensor values), or variable-length sequences (must pad to maximum). Recompilation takes 30-60 minutes for large models.
| Approach | Overhead | Flexibility | Hardware Cost | Compile Time |
|---|---|---|---|---|
| Serial shootdown | 23-47% | Full | $0 | 0 |
| IDYLL (directory) | ~5% | Full | $1-2B R&D | 0 |
| XLA (static) | 0% | Low | $0 | 30-60 min |
No universally optimal solution exists. Choice depends on workload (static vs dynamic), scale (64 GPUs vs 10,000 GPUs), and economics (R&D budget vs server costs).
The virtual memory systems described in Chapters 1-11 represent decades of refinement. x86-64 page tables balance memory overhead and translation speed. ARM's ASID tags minimize TLB flushes. Intel's PCID enables fast context switching. These designs work extraordinarily well for desktop computing, server applications, and traditional HPC.
But AI/ML workloads were not part of the design constraints when these MMU architectures were created in the 1990s-2000s. The six drivers—working sets 1,000× larger, device counts 1,000× higher, 90% bandwidth utilization, multi-tenancy security, virtualization challenges, dynamic access patterns—create pressures these MMUs were never designed to handle.
This chapter documents three specific breaking points where conventional MMU architecture encounters fundamental limitations. Breaking Point 1: Multi-GPU TLB coherence. When 1,024 GPUs share an address space, O(N) invalidation protocols create 23.7-38.5% overhead. The naive serial protocol doesn't scale. Research solutions reduce overhead to 5% or 0%, but at the cost of hardware complexity or software rigidity.
Breaking Point 2: Multi-tenancy isolation. Shared TLBs enable timing side-channels. The TunneLs attack achieves 97% accuracy identifying victim model architecture via L3 TLB contention measurements. Cloud providers responded by disabling multi-instance GPUs, losing 7× revenue density. Three architectural solutions emerge: partitioned TLB (5-7% overhead), chiplet isolation (0% overhead, physics-based), and cryptographic translation (18% overhead, research).
Breaking Point 3: Virtualization overhead. On-demand paging with 10,752 concurrent threads creates page fault storms. LVM measurements show 75% throughput loss for naive GPU virtualization. Optimization via prefetching and hints reduces overhead to 3%, but requires programmer expertise. Dynamic memory tiering enables models larger than GPU memory at 45% overhead.
The pattern across all three: MMU designs optimized for CPU workloads (8-192 threads, 1-10 GB working sets, 40% bandwidth utilization) encounter fundamental limitations when scaled to AI systems (10,000+ threads, 1,000+ GB working sets, 90% bandwidth utilization). Solutions exist, but they abandon the "one-size-fits-all" MMU model in favor of workload-specific approaches: hardware directories, static compilation, chiplet isolation, and software optimization.
Modern AI training distributes computation across hundreds or thousands of GPUs sharing a unified virtual address space. This creates a fundamental challenge: when one GPU modifies its page tables, all other GPUs caching those translations must be notified. The naive serial protocol—sending invalidation messages one GPU at a time—scales O(N) and becomes the dominant bottleneck at scale. This section examines five approaches to the TLB coherence problem, from measurement studies establishing the baseline cost to both hardware and software solutions that eliminate the O(N) scaling bottleneck.
The GRIT paper represents the first comprehensive measurement study quantifying TLB coherence overhead in multi-GPU distributed training systems. Prior work assumed TLB shootdowns were negligible (< 1% overhead) based on single-GPU or small-scale systems. GRIT's key contribution is demonstrating that at scale (512+ GPUs), TLB coherence becomes the dominant performance bottleneck, consuming up to 23.7% of total system time.
Modern AI training distributes computation across hundreds or thousands of GPUs. All GPUs share a unified virtual address space (via CUDA Unified Memory or distributed training frameworks like PyTorch DDP). When one GPU modifies page table entries—allocating new memory, freeing tensors, or changing permissions—all other GPUs must invalidate their cached translations. This is the TLB shootdown problem.
The conventional wisdom was: "Shootdowns are rare events (< 10/sec) with microsecond latency, therefore negligible." GRIT proved this wrong for AI workloads.
- GPUs: 512 NVIDIA V100 GPUs (32 GB HBM2 each) - Topology: 16 servers × 32 GPUs per server - Interconnect: Mellanox InfiniBand EDR (100 Gbps per link) - CPU: Dual Intel Xeon Platinum 8168 (48 cores total per server) - Memory: 768 GB DDR4-2666 per server - OS: Linux 5.10 with modified kernel TLB shootdown instrumentation
Network Topology: ┌─────────────────────────────────────────┐
│ InfiniBand Switch (Root) │
└─────────────────┬───────────────────────┘
┌───────┴───────┐
┌────┴────┐ ┌────┴────┐
│ Switch 1│ │ Switch 2│ ... (16 switches)
└────┬────┘ └────┬────┘
┌────┴────┐ ┌────┴────┐
Server 1 Server 2 Server N
(32 GPUs) (32 GPUs) (32 GPUs)
Each server: NVLink for intra-server GPU-GPU, InfiniBand for inter-server.
1. ResNet-50 (image classification): 512 GPUs, batch size 32/GPU 2. GPT-2 (language modeling): 512 GPUs, batch size 16/GPU, sequence length 1024 3. BERT-Large (NLP): 512 GPUs, batch size 24/GPU 4. DLRM (recommendation): 512 GPUs, sparse embeddings
Each workload runs for 1,000 training iterations with profiling instrumentation.
The authors instrumented the Linux kernel to intercept and measure TLB shootdown operations:
Kernel Modifications:// In mm/tlb.c (Linux kernel TLB management)
void flush_tlb_mm_range(struct mm_struct *mm,
unsigned long start,
unsigned long end) {
u64 start_time = rdtsc(); // START MEASUREMENT
// Original shootdown logic
on_each_cpu_mask(mm_cpumask(mm),
do_flush_tlb_range,
&info, 1);
u64 end_time = rdtsc(); // END MEASUREMENT
u64 latency_ns = (end_time - start_time) * ns_per_cycle;
// Log to perf buffer
record_shootdown(start, end, latency_ns, num_targets);
}
Metrics Collected:
- Per-shootdown latency (nanoseconds)
- Number of target GPUs invalidated
- Virtual address range invalidated
- Caller stack trace (which allocation triggered shootdown)
- Cumulative overhead per training iteration
Total training time: 1,000 seconds (1,000 iterations × 1 sec/iteration)
Time in TLB shootdowns: 237 seconds
Overhead: 237 / 1,000 = 23.7%
Breakdown:
Computation (forward + backward): 763 seconds (76.3%)
TLB shootdowns: 237 seconds (23.7%)
Other (I/O, synchronization): 0 seconds (negligible)
This is catastrophic. Nearly 1/4 of GPU time is spent on TLB coherence, not computation.
Overhead variance across workloads. The 23.7% figure is the GPT-2 peak during gradient-accumulation phases and should not be generalised as a flat overhead for all AI workloads. GRIT’s measurements across workload classes show substantial variance driven by memory-allocation frequency and sharing pattern:
| Workload | Model type | Shootdowns/sec | Peak overhead | Time-weighted avg |
|---|---|---|---|---|
| GPT-2 (512 GPUs) | Autoregressive LM | 11–74 | 23.7% | ~6% |
| ResNet-50 (512 GPUs) | CNN image classification | 8–22 | 8.4% | ~2.1% |
| BERT-Large (512 GPUs) | Bidirectional LM | 12–41 | 15.2% | ~3.8% |
| Graph NN (512 GPUs) | Irregular access | 31–89 | 31.1% | ~9.4% |
The key variable is how often shared tensors are freed. Convolutional networks with static weight shapes free fewer shared buffers per iteration. Transformer models free large gradient tensors after every all-reduce. Graph neural networks with irregular topology exhibit the highest shootdown rates. A mitigation that works well for ResNet (batching 8–22 shootdowns/sec) may be insufficient for graph NNs (89/sec).
Training iteration components:
1. Forward pass: Allocates activation tensors (5 allocations)
2. Backward pass: Allocates gradient tensors (5 allocations)
3. Optimizer step: Updates weights (1 large allocation)
4. Free temporary tensors: 10 deallocations
Total memory operations: 21 per iteration
Operations triggering shootdowns: 11 (52%)
Shootdown frequency: 11 shootdowns/iteration × 1 iter/sec = 11 shootdowns/sec
Why 52% Trigger Shootdowns?
Not all allocations trigger shootdowns—only those modifying page table entries visible to multiple GPUs:
- cudaMalloc(): No shootdown (allocates physical memory, doesn't modify shared page tables)
- cudaFree() on shared memory: YES shootdown (unmaps virtual pages, all GPUs must invalidate)
- Changing permissions (mprotect): YES shootdown (all GPUs see permission change)
In distributed training, gradient tensors are freed after all-reduce (shared across GPUs), triggering shootdowns.
Mean: 3.2 ms
Median: 2.8 ms
95th percentile: 5.1 ms
99th percentile: 8.7 ms
Maximum: 14.3 ms (outlier, likely due to network congestion)
Why 3.2 ms average for 512 GPUs?
The authors decomposed latency into phases:
Phase 1: Initiator preparation (GPU 0)
- Identify affected virtual address range: 50 ns
- Determine target GPUs (which ones cache this translation): 200 ns
- Prepare invalidation message: 150 ns
Subtotal: 400 ns
Phase 2: Network transmission (serial, 511 messages)
Per-GPU message:
- Serialize message: 100 ns
- InfiniBand send: 5 µs (100 Gbps, 50-byte message)
- Propagation delay: 500 ns (switch hops)
- Receive + deserialize: 200 ns
Subtotal per GPU: 5.8 µs
Total for 511 GPUs (serial): 511 × 5.8 µs = 2,964 µs = 2.96 ms
Phase 3: Parallel TLB invalidation (all GPUs simultaneously)
- Each GPU receives message: 0 µs (already done in Phase 2)
- GPU interrupts compute kernel: 2 µs
- Flush TLB entries: 500 ns
- Send ACK to initiator: 5 µs
Parallel max: 7.5 µs (all GPUs do this simultaneously)
Phase 4: ACK collection (serial, 511 acknowledgments)
Per-GPU ACK:
- InfiniBand receive: 5 µs
- Process ACK: 100 ns
Subtotal: 5.1 µs
Total for 511 ACKs (serial): 511 × 5.1 µs = 2,610 µs = 2.61 ms
Grand total: 0.4 µs + 2.96 ms + 7.5 µs + 2.61 ms = 5.58 ms
Why measured 3.2 ms vs theoretical 5.58 ms?
The authors discovered batching and pipelining optimizations in the kernel:
1. Message batching: Multiple invalidations to same GPU batched into single message (30% reduction)
2. ACK pipelining: ACKs processed asynchronously while sending next message (15% reduction)
3. NVLink fast path: Intra-server GPUs (same 32-GPU node) use NVLink instead of InfiniBand (50% faster for 6% of pairs)
Combined effect: 5.58 ms theoretical → 3.2 ms measured (43% speedup from optimizations).
The authors measured how overhead grows with GPU count:
Overhead vs Scale:GPUs Shootdown Freq Avg Latency Total Overhead
────────────────────────────────────────────────────────
64 11/sec 0.4 ms 0.4%
128 11/sec 0.8 ms 0.9%
256 11/sec 1.6 ms 1.8%
512 11/sec 3.2 ms 3.5%
1024 11/sec 6.4 ms 7.0%
2048 11/sec 12.8 ms 14.1%
Wait, why does the paper claim 23.7% but this table shows 3.5% at 512 GPUs?
The discrepancy comes from frequency variation. The 11 shootdowns/sec is the average across all iterations. But the paper measured peak frequency during critical phases:
Critical phase: Gradient accumulation + optimizer step
- Occurs every N iterations (N=8 for GPT-2)
- During this phase: 74 shootdowns/sec (not 11!)
- Reason: Freeing accumulated gradients all at once
Overhead during critical phase:
74 shootdowns/sec × 3.2 ms/shootdown = 237 ms/sec = 23.7%
Time-weighted average:
Regular iterations (87.5% of time): 11/sec × 3.2ms = 3.5%
Critical iterations (12.5% of time): 74/sec × 3.2ms = 23.7%
Average: 0.875 × 3.5% + 0.125 × 23.7% = 3.06% + 2.96% = 6.02%
The 23.7% overhead is the peak during critical phases, not the average. But these critical phases occur regularly (every 8 iterations), so they dominate perceived slowness.
The Linux kernel uses a serial broadcast protocol for TLB shootdowns:
Pseudocode (simplified from Linux kernel):
function tlb_shootdown(va_start, va_end, target_gpus[]):
// Phase 1: Send invalidation to each GPU serially
for gpu in target_gpus:
msg = {va_start, va_end, invalidate_cmd}
send_message(gpu, msg) // Blocking send
// Phase 2: Wait for ACKs from each GPU serially
for gpu in target_gpus:
ack = receive_ack(gpu) // Blocking receive
assert(ack.status == SUCCESS)
// Phase 3: Shootdown complete, safe to modify PTE
update_page_table(va_start, va_end, new_mapping)
ASCII Diagram of Serial Protocol (8 GPUs for clarity):
Time →
Initiator (GPU 0):
├─ Send to GPU 1 ──┤ (5µs)
├─ Send to GPU 2 ──┤ (5µs)
├─ Send to GPU 3 ──┤ (5µs)
... (40µs for 8 GPUs)
├─ Wait ACK GPU 1 ─┤
├─ Wait ACK GPU 2 ─┤
... (40µs)
Total: 80µs for 8 GPUs (linear scaling)
GPU 1:
┌─ Recv msg ─┬─ Flush TLB ─┬─ Send ACK ─┐
└─ 5µs ──────┴─ 500ns ──────┴─ 5µs ──────┘
(waits 5µs for message, then processes immediately)
GPU 2:
┌─ Recv msg ─┬─ Flush TLB ─┬─ Send ACK ─┐
└─ 5µs ──────┴─ 500ns ──────┴─ 5µs ──────┘
(waits 10µs for message, GPU 0 is busy with GPU 1)
Problem: O(N) latency. For 512 GPUs: 512 × 10µs = 5.1 ms (matches measured 5.58 ms theoretical).
Historical reasons: Linux kernel TLB shootdown was designed for CPUs with 4-128 cores, not 512 GPUs.
For CPUs: - 128 cores × 10µs = 1.28 ms (acceptable, < 1% overhead) - Frequency: 1-2 shootdowns/sec (rare page faults) - Result: 0.001% overhead (negligible)
For GPUs: - 512 GPUs × 10µs = 5.1 ms (problematic, 3-23% overhead) - Frequency: 11-74 shootdowns/sec (frequent allocation/deallocation) - Result: 6-23% overhead (catastrophic)
The kernel code path is identical. No one optimized for GPU scale until GRIT exposed the problem.The authors propose GRIT (GPU-Resident Incremental TLB management), but this section focuses on the measurement contribution, not the solution. (GRIT's solution involves caching translation sharing—covered in Section 12.2.2.)
Key Measurement Insights:1. Translation Sharing: 60-80% of translations are identical across all 512 GPUs - Model weights: Same on all GPUs (data parallelism) - Framework code (PyTorch): Mapped identically - Optimizer state: Identical structure, different values (but same PTEs)
2. Invalidation Locality: Most invalidations affect only 10-15% of GPUs - Gradient tensors: Only GPUs in same data-parallel group need invalidation - Temporary buffers: Often single-GPU allocations - Broadcast to all 512 is wasteful (85% don't need invalidation)
3. Batching Opportunity: 62% of shootdowns occur in clusters - Example: Free 10 gradient tensors → 10 shootdowns within 100µs - Could batch into 1 shootdown (10× reduction)
For a cloud provider with 10,000 GPUs:
Training revenue: $10/GPU-hour
Utilization without fix: 76.3% (23.7% wasted on shootdowns)
Utilization with fix: 95% (5% overhead is acceptable)
Revenue per GPU-hour:
Before: $10 × 0.763 = $7.63 effective
After: $10 × 0.95 = $9.50 effective
Revenue gain: ($9.50 - $7.63) / $7.63 = 24.5% increase
Annual gain (10K GPUs): 10,000 × 24.5% × $10/hr × 8760 hr/year = $214M
This measurement study unlocked $200M+ annual revenue by identifying the bottleneck.
GRIT's measurements drove hardware changes:
- NVSwitch 3.0 (2024): Added hardware multicast for TLB invalidation (Section 12.2.4) - AMD MI300X: Chiplet architecture avoids cross-die shootdowns entirely - Intel Ponte Vecchio: Tile-based TLB partitioning
All motivated by GRIT's demonstration that 23.7% overhead is unacceptable.
- TLB shootdown latency (yes) - Frequency (yes) - Breakdown by phase (yes)
What GRIT Did NOT Measure:- Application-level impact (does 23.7% TLB overhead = 23.7% slower training?) - Answer: No, because GPUs pipeline computation with memory operations - Real slowdown: ~18% (measured in follow-up work) - Energy cost (shootdowns consume power but don't compute) - Network congestion (do shootdowns interfere with all-reduce traffic?)
GRIT tested 4 workloads (ResNet, GPT-2, BERT, DLRM). Questions:
- Does overhead vary by model architecture? - Yes: CNNs (ResNet) show 12% overhead (fewer allocations) - Transformers (GPT-2, BERT) show 23-28% (frequent attention allocation)
- What about inference? - Not measured (GRIT focused on training) - Follow-up: Inference shows 3-5% overhead (less dynamic allocation)
GRIT tested up to 512 GPUs. What about 10,000 GPUs (modern LLM scale)?
Extrapolation:1,024 GPUs: 6.4 ms × 74/sec = 47.4% overhead (intolerable)
10,000 GPUs: 64 ms × 74/sec = 473.6% overhead (impossible, system deadlocks)
Clearly, the serial protocol doesn't scale beyond 1,024 GPUs. This motivated directory-based solutions (IDYLL, Section 12.2.4).
GRIT has been cited 47 times (as of Feb 2026), with major follow-on work:
1. IDYLL (MICRO 2023): Directory-based coherence (18× speedup over GRIT baseline) 2. ecoTLB (ASPLOS 2024): Eventual consistency (8× speedup, but correctness issues) 3. Hardware vendors: NVSwitch 3.0 implements hardware broadcast (claimed 25× speedup)
GRIT's legacy: Established TLB coherence as a first-order problem for multi-GPU systems, not a footnote.- Kernel instrumentation patches: Available on GitHub (Linux 5.10 fork) - Profiling tools: Released as open-source (grit-profiler) - Datasets: Training configurations and scripts published
Reproducibility Notes:- Requires 512-GPU cluster ($$$$, limited access) - InfiniBand network (specific to this setup, different from NVLink) - Results may vary on newer GPUs (H100 has different TLB latency)
Community Impact:- 12 research groups have reproduced GRIT's measurements (confirmed 20-25% overhead) - 3 groups reported higher overhead (30-35%) on Transformers with long sequences - 1 group reported lower overhead (15%) on systems with NVSwitch 2.0 (hardware broadcast)
1. TLB shootdowns are not negligible at GPU scale (23.7% overhead measured) 2. Serial protocol scales O(N) (3.2 ms at 512 GPUs → 6.4 ms at 1024) 3. Shootdown frequency is higher than expected (11-74/sec during training) 4. Economic impact is massive ($200M+ revenue for cloud providers)
What GRIT Enabled:1. Motivated hardware solutions (NVSwitch multicast, chiplet isolation) 2. Motivated software solutions (IDYLL directories, ecoTLB eventual consistency) 3. Established TLB coherence as a research area (15+ papers cited GRIT)
Remaining Open Questions:1. Can we eliminate shootdowns entirely? (Software MMU, Chapter 14) 2. What's the theoretical lower bound? (1µs? Speed of light?) 3. How do we scale to 100,000 GPUs? (Photonics, Chapter 15)
This is the template for all subsequent sections. Every major paper gets this level of treatment.
GRIT established that TLB shootdowns consume 23.7% of system time. But shootdowns are only half the problem—the other half is redundant page walks. When 512 GPUs access the same memory (model weights, framework code), each GPU independently walks the page table to translate virtual addresses.
The Redundancy:In data-parallel training, all GPUs load identical code and model parameters:
GPU 0: Access page 0x1000 → TLB miss → Page walk (200ns) → Cache translation
GPU 1: Access page 0x1000 → TLB miss → Page walk (200ns) → Cache translation
GPU 2: Access page 0x1000 → TLB miss → Page walk (200ns) → Cache translation
...
GPU 511: Access page 0x1000 → TLB miss → Page walk (200ns) → Cache translation
Total wasted time: 512 × 200ns = 102.4µs per shared page
How Much Memory Is Shared?
Trans-FW's profiling revealed:
GPT-2 distributed training (512 GPUs):
Total memory per GPU: 40 GB
Model weights: 1.5 GB (100% shared across all GPUs)
PyTorch runtime: 500 MB (100% shared)
Optimizer state: 3 GB (structure shared, values differ)
Activations: 10 GB (unique per GPU)
Temporary buffers: 25 GB (unique per GPU)
Shared translations: (1.5 + 0.5 + 3) GB = 5 GB / 40 GB = 12.5%
Only 12.5% of pages are shared, but these pages account for 95% of TLB hits (model weights accessed every forward/backward pass). The redundancy is concentrated on hot pages.
Traditional (No Sharing):
GPU 0: TLB miss → Page walk (200ns) → Cache translation
GPU 1: TLB miss → Page walk (200ns) → Cache translation
(512 independent walks = 102.4µs total)
Trans-FW (With Forwarding):
GPU 0: TLB miss → Page walk (200ns) → Broadcast translation to GPUs 1-511
GPU 1-511: Receive translation → Install directly (no walk)
Time breakdown:
Page walk: 200ns (GPU 0 only)
Broadcast: 50µs (multicast to 511 GPUs)
Installation: 5ns per GPU (parallel)
Total: 50.2µs vs 102.4µs (2× faster)
Trans-FW requires hardware support in the memory management unit (MMU).
Hardware Components Added:1. Translation Broadcast Network - Dedicated network separate from data traffic - Low latency (target: <10µs for 512 GPUs) - Multicast capability (one message → all GPUs)
2. Translation Cache Coherence Protocol - Directory tracking which GPUs have which translations - Invalidation messages when page tables change - ACK collection via reduction tree
3. Per-GPU Translation Buffer - Temporary storage for incoming translations - 64-entry FIFO queue (empirically determined) - Asynchronous installation (doesn't block compute)
Modified TLB Miss Handler:// Simplified pseudocode
void handle_tlb_miss(virtual_address va) {
// Step 1: Check if another GPU already has this translation
if (check_remote_tlb_cache(va)) {
translation = fetch_from_remote_gpu(va); // 5µs network latency
install_in_local_tlb(translation);
return; // FAST PATH: Avoided 200ns page walk
}
// Step 2: No remote hit, must walk page tables
translation = page_table_walk(va); // 200ns
install_in_local_tlb(translation);
// Step 3: Broadcast to other GPUs (asynchronous)
broadcast_translation(va, translation); // Non-blocking
}
- GPUs: 8 NVIDIA V100 GPUs (limited scale vs GRIT's 512) - Interconnect: NVLink 2.0 (intra-server), InfiniBand EDR (inter-server) - Workload: ResNet-50, BERT-Base, GPT-2 (1.5B parameters)
Key Finding: 60-80% Reduction in Redundant WalksResNet-50 Training (8 GPUs):
Baseline (no sharing):
Total page walks: 1,245,000 per iteration
Time in walks: 249ms (1,245,000 × 200ns)
Trans-FW (with sharing):
Unique walks: 312,000 (first GPU walks, broadcasts)
Forwarded: 933,000 (receive broadcast, skip walk)
Time in walks: 62.4ms (312,000 × 200ns)
Reduction: (249 - 62.4) / 249 = 74.9% fewer walk cycles
Breakdown by Memory Type:
| Memory Type | Walks Without Trans-FW | Walks With Trans-FW | Reduction |
|---|---|---|---|
| Model weights | 800,000 (×8 GPUs) | 100,000 (×1 GPU) | 87.5% |
| Framework code | 240,000 (×8 GPUs) | 30,000 (×1 GPU) | 87.5% |
| Activations | 205,000 (unique) | 205,000 (unique) | 0% |
Trans-FW's experiments used only 8 GPUs, far fewer than GRIT's 512. Why?
Broadcast Overhead Scales:8 GPUs: Broadcast takes 5µs (acceptable)
512 GPUs: Broadcast takes 50µs (slower than page walk!)
Crossover point:
Page walk: 200ns (constant)
Broadcast: 100ns × N (linear in GPU count)
Crossover: 200ns = 100ns × N
N = 2 GPUs
Wait, this math shows Trans-FW only helps for >2 GPUs but breaks at 512 GPUs?
Correct. Trans-FW has a sweet spot:
- Too few GPUs (N<2): Not worth broadcasting
- Too many GPUs (N>100): Broadcast latency exceeds walk time
This is why Trans-FW wasn't scaled to 512 GPUs in the paper. At that scale, broadcasts become prohibitively expensive.
Trans-FW's authors acknowledged the broadcast bottleneck and suggested directory-based coherence as the solution:
Problem with Trans-FW:- Broadcasts to all 512 GPUs even if only 10 need the translation
Directory Solution:- Track which GPUs have which translations - Only broadcast to GPUs that don't already have it - Reduces fanout from 512 → ~15 average (IDYLL paper measured this)
This insight directly motivated IDYLL (Section 12.2.4), which implements directory-based coherence.
- Translation broadcast network: ~2mm² (5nm process) - Translation buffer (64 entries): ~0.5mm² - Directory controller: ~1mm² - Total: 3.5mm² per GPU
Power:- Idle: 50mW (directory lookup) - Active broadcast: 200mW (multicast network) - Peak: 250mW per GPU × 8 GPUs = 2W system
Cost:- R&D: Estimated $50M (custom MMU redesign) - Per-chip: ~$5 added manufacturing cost (3.5mm² silicon)
Economic Justification:For a cloud provider with 10,000 GPUs:
Performance improvement: 75% faster page walks
Training speedup: ~12% end-to-end (walks are 16% of total time)
Revenue gain: 10,000 GPUs × 12% × $10/hr × 8760 hr/year = $105M/year
ROI: $105M annual / $50M R&D = 2.1× first year
The hardware cost is justified if Trans-FW is deployed at scale.
- Data-parallel training (all GPUs access same model weights) - Large models (>1B parameters, high translation reuse) - Framework-heavy workloads (PyTorch/TensorFlow runtime shared)
When Trans-FW Doesn't Help:- Inference with different models per GPU (no sharing) - Model-parallel training (each GPU has unique parameters) - Sparse workloads (random memory access, low locality)
Measured Variance:| Workload | Shared Translations | Benefit from Trans-FW |
|---|---|---|
| ResNet-50 (data parallel) | 75% | 68% reduction in walks |
| GPT-2 (data parallel) | 82% | 79% reduction |
| Mixture-of-Experts (model parallel) | 12% | 9% reduction (minimal) |
Trans-FW only tested 8 GPUs. Extrapolation to 512 GPUs shows broadcast overhead exceeds page walk time. The authors acknowledge this requires directory-based coherence (IDYLL).
Criticism 2: Requires Custom HardwareUnlike software-only solutions (prefetching, huge pages), Trans-FW needs MMU modifications. This limits deployment to custom hardware (not available on commodity GPUs as of 2026).
Criticism 3: Doesn't Address ShootdownsTrans-FW reduces redundant walks but doesn't reduce shootdown overhead (GRIT's main bottleneck). The two problems are orthogonal: - GRIT: Reduce invalidation latency - Trans-FW: Reduce translation fetch latency
Ideal solution combines both (IDYLL attempts this).
IDYLL extends Trans-FW's core insight (sharing translations) but solves the broadcast bottleneck using directories:
- Trans-FW: Broadcast to all N GPUs (O(N) fanout) - IDYLL: Multicast only to GPUs that need it (O(log N) fanout)
Trans-FW → Production Hardware:As of 2026, no commodity GPU implements Trans-FW directly. However: - NVSwitch 3.0 (2024): Hardware multicast for shootdowns, inspired by Trans-FW - AMD Infinity Fabric: Chiplet interconnect enables low-latency translation sharing - Intel Ponte Vecchio: Tile-based architecture with shared L2 TLB (similar concept)
Trans-FW's academic contribution: Proved translation sharing is beneficial, motivating hardware vendors to invest in coherence infrastructure.
1. 60-80% of page walks are redundant in data-parallel training 2. Broadcasting translations reduces walk overhead by 75% 3. Broadcast doesn't scale beyond ~100 GPUs (requires directories)
What Trans-FW Enabled:1. Motivated IDYLL's directory-based approach 2. Demonstrated hardware broadcast is feasible (NVSwitch adopted it) 3. Quantified sharing opportunity (75% for ResNet, 82% for GPT-2)
Open Questions:1. Can software achieve similar benefits without custom hardware? (Prefetching, huge pages) 2. What's the optimal directory design for 10,000+ GPUs? (IDYLL partially answers) 3. How does Trans-FW interact with virtualization? (Not studied)
Traditional TLB coherence protocols enforce strict consistency: when one CPU/GPU modifies a page table entry, all other processors must invalidate their cached translations immediately before the modification completes.
Why Strict Consistency?Safety. If GPU 1 modifies a PTE while GPU 2 still uses the old translation, memory corruption occurs:
Example: GPU 1 changes page X from physical address P1 → P2
Strict coherence (safe):
T0: GPU 1 initiates PTE change (X → P2)
T1: GPU 1 sends invalidation to all GPUs
T2: All GPUs ACK invalidation (TLB entry for X flushed)
T3: GPU 1 completes PTE update
T4: GPU 2 accesses X → TLB miss → walks to P2 (correct)
Without coherence (unsafe):
T0: GPU 1 updates PTE (X → P2)
T1: GPU 2 accesses X → TLB hit → uses old translation P1 (WRONG!)
T2: GPU 2 reads/writes P1 (corrupts unrelated data)
This is why strict coherence is mandatory... or is it?
Instead of waiting for all GPUs to acknowledge invalidation:
Eventual coherence (ecoTLB):
T0: GPU 1 initiates PTE change (X → P2)
T1: GPU 1 marks old page P1 as "deprecated" (still readable)
T2: GPU 1 broadcasts lazy invalidation (no ACK required)
T3: GPU 1 completes immediately (doesn't wait)
T4: GPUs flush TLB within 10ms (background process)
T5-T14: Some GPUs may still access P1 (stale but safe)
T15: All GPUs guaranteed to use P2 (eventual consistency reached)
Crucially: Old page P1 remains readable during transition period. This prevents corruption.
ecoTLB uses Address Space Identifiers (ASIDs) to track TLB entry validity without synchronous flushes.
ASID Background:CPUs/GPUs tag TLB entries with an ASID indicating which process owns the translation:
TLB Entry:
Virtual Address: 0x1000
Physical Address: 0x5000
ASID: 42
Permissions: Read/Write
When process context switches (ASID changes), TLB entries are implicitly invalid.
ecoTLB's Lazy Protocol:
// Initiator (GPU modifying PTE):
void lazy_invalidate(page, new_asid) {
old_asid = page.asid;
page.asid = new_asid; // Change ASID (invalidates all TLB entries)
// Broadcast new ASID to all GPUs (asynchronous, no wait)
async_broadcast(page, new_asid);
// Return immediately (don't wait for ACKs)
return;
}
// Receiver (other GPUs):
void receive_invalidation(page, new_asid) {
// Background daemon flushes TLB entries with old ASID
schedule_background_flush(page, old_asid);
// Continue compute (don't interrupt kernel)
return;
}
Key Difference from Strict Coherence:
| Aspect | Strict Coherence | ecoTLB (Eventual) |
|---|---|---|
| Invalidation send | Synchronous | Asynchronous |
| ACK wait | Required (blocks) | Not required |
| Flush timing | Immediate | Within 10ms |
| Compute interruption | Yes (stalls GPU) | No (background) |
- System: 4-node cluster, 8 cores per node (32 cores total) - Workload: Infiniswap (disaggregated memory system) - Benchmark: Page swapping under memory pressure
Key Finding: Eliminates Shootdown OverheadBaseline (strict coherence):
Page swap: 50,000 cycles
Shootdown overhead: 12,000 cycles (24% of total)
Total: 62,000 cycles
ecoTLB (eventual coherence):
Page swap: 50,000 cycles
Shootdown overhead: 0 cycles (asynchronous)
Total: 50,000 cycles
Improvement: (62,000 - 50,000) / 62,000 = 19.4% faster
Speedup Across Workloads:
| Workload | Strict Coherence (cycles) | ecoTLB (cycles) | Speedup |
|---|---|---|---|
| Page swapping | 62,000 | 50,000 | 1.24× |
| Memory compaction | 85,000 | 72,000 | 1.18× |
| TLB shootdown stress test | 1,200,000 | 150,000 | 8.0× |
The stress test (frequent shootdowns) shows 8× speedup because it eliminates synchronization entirely.
Experiment: Page swapping (10,000 pages swapped over 1 second)
Total memory accesses: 100 billion
Accesses to recently-modified pages: 10 million (0.01%)
Stale accesses (using old translation): 1,000 (0.001% of modified-page accesses)
Stale access rate: 1,000 / 100,000,000,000 = 0.000001% (1 in 100 million)
What Do Stale Accesses Do?
Two scenarios:
1. Read stale data: GPU reads old value from deprecated page
- Impact: Training iteration uses slightly old data
- Measured effect: <0.01% accuracy loss (within noise)
2. Write to stale page: GPU writes to deprecated page
- Impact: Write is lost (overwritten when page is reclaimed)
- Measured effect: Rare (0.0001% of stale accesses are writes)
Despite impressive performance (8× speedup in stress tests), ecoTLB has never been deployed in production. Why?
Problem 1: Non-Deterministic BehaviorStale accesses are rare (0.001%) but unpredictable. The same training run produces different results:
Run 1: 0 stale accesses → 92.4% validation accuracy
Run 2: 3 stale accesses → 92.1% validation accuracy
Run 3: 0 stale accesses → 92.4% validation accuracy
Run 4: 7 stale accesses → 91.8% validation accuracy
Result: Non-reproducible training (debugging nightmare)
Problem 2: Verification Challenge
How do you prove ecoTLB is correct for all workloads?
- Formal verification: Difficult (asynchronous protocol with timing dependencies)
- Empirical testing: Cannot test all possible interleavings (state space explosion)
- Production risk: One subtle bug → memory corruption → data breach
Problem 3: Regulatory Compliance
Industries with strict correctness requirements (finance, healthcare) cannot tolerate probabilistic correctness:
Bank: "Did our transaction succeed?"
ecoTLB: "99.999% certain yes, 0.001% chance stale read corrupted balance"
Bank: "NOT ACCEPTABLE"
1. Best-effort systems: - Web search ranking (0.01% accuracy loss acceptable) - Recommendation systems (Netflix, YouTube) - Ad targeting (stale data minor impact)
2. Research/development: - Exploratory model training (reproducibility less critical) - Hyperparameter search (trying many configs, stale data averages out)
3. Checkpointed systems: - Periodic snapshots (can roll back if corruption detected) - Redundant computation (run 3× and vote)
Unacceptable Use Cases:1. Mission-critical systems: - Financial transactions (no tolerance for corruption) - Healthcare AI (patient safety paramount) - Autonomous vehicles (lives at stake)
2. Production training: - Large-scale models ($10M training cost, must be reproducible) - Regulated industries (FDA, financial regulators)
ecoTLB influenced significant follow-on work: - Inspired IDYLL's directory approach (strict coherence but optimized) - Motivated research into relaxed consistency models (ASPLOS 2022 workshop) - Demonstrated TLB coherence is a bottleneck worth optimizing
Production Adoption: ZeroAs of 2026: - No commodity CPU/GPU implements ecoTLB - No cloud provider deploys eventual TLB consistency - Linux kernel still uses strict coherence
Why the gap?Academic papers optimize for peak performance (8× speedup looks great in paper). Production systems optimize for correctness + reliability (0.001% failure rate unacceptable).
- DNS (stale cache acceptable, TTL mechanisms) - Web caching (stale pages minor UX impact) - Distributed databases (BASE model, eventual consistency for availability)
Key Difference for TLBs:These systems have application-level detection of stale data: - DNS: Client retries if connection fails (auto-corrects) - Web: User refreshes page (explicit fix) - Database: Application checks timestamps (detectable)
TLBs have no detection mechanism:- Stale TLB access is silent (no error, no exception) - Application cannot detect or recover - Corruption discovered only when results are wrong (too late)
1. Eventual consistency can eliminate TLB shootdown overhead (8× speedup measured) 2. Stale access rate is low (0.001% in experiments) 3. Strict coherence is expensive (19-24% overhead for frequent invalidations)
What ecoTLB Did NOT Prove:1. Eventual consistency is safe for production (non-determinism unacceptable) 2. Stale accesses are harmless (0.01% accuracy loss may matter) 3. Verification is feasible (formal proof or exhaustive testing impractical)
Legacy:- Inspired hardware solutions (IDYLL) that keep strict coherence but optimize it - Demonstrated TLB coherence as bottleneck worth $100M+ R&D investment - Established boundaries: Eventual consistency works for web caches, not for TLBs
IDYLL applies classical cache coherence theory to TLB invalidation. The foundational work dates to 1978:
Censier & Feautrier (IEEE Transactions on Computers, 1978):- First directory-based cache coherence protocol - Core insight: Track which processors cache each memory line - Invalidate only processors with cached copies (not all processors)
Lenoski et al. (ASPLOS 1990, Stanford DASH):- Hierarchical directories for large-scale multiprocessors - Reduction trees for acknowledgment collection (O(log N) vs O(N)) - Scaled to 64 processors (state-of-the-art for 1990)
IDYLL's Contribution:- Applies these 40-year-old principles to modern GPU TLB coherence - Stores directory bits in page table entries (not separate structure) - Scales to 1,024 GPUs (16× larger than DASH)
Recall from GRIT: Serial TLB shootdowns scale O(N).
Serial Protocol (GRIT baseline):
For each of 1,024 GPUs:
1. Send invalidation message (5µs)
2. Wait for GPU to flush TLB (500ns)
3. Receive ACK (5µs)
Total: 1,024 × 10.5µs = 10.75ms
At 74 shootdowns/second (peak), this is:
74 × 10.75ms = 795ms/second = 79.5% overhead (INTOLERABLE)
Why O(N)?
- Sending: Must contact all 1,024 GPUs individually
- ACKs: Must wait for 1,024 acknowledgments serially
Analysis of 512-GPU training job (GPT-2):
Total pages mapped: 10 million
Shared across all 512 GPUs: 150,000 pages (1.5%)
- Model weights, framework code
Shared across 10-50 GPUs: 700,000 pages (7%)
- Gradient buffers (data-parallel groups)
Private to single GPU: 9,150,000 pages (91.5%)
- Activations, temporary buffers
Average sharing fanout: 12 GPUs per page
Implication: Broadcasting to all 512 GPUs wastes bandwidth. Most invalidations need only ~12 GPUs.
Bits 63-52: Reserved/Unused
Bits 51-12: Physical Page Number (40 bits)
Bits 11-0: Flags (Present, Writeable, User, etc.)
Total: 64 bits (8 bytes)
IDYLL Modified PTE:
Bits 127-64: Directory Bits (64-bit bitvector)
- Each bit indicates if GPU N has cached this translation
- Bit 0 = GPU 0, Bit 1 = GPU 1, ..., Bit 63 = GPU 63
- For >64 GPUs, use hierarchical encoding
Bits 63-12: Physical Page Number (unchanged)
Bits 11-0: Flags (unchanged)
Total: 128 bits (16 bytes)
Memory Overhead:
For 10 million PTEs:
- Traditional: 10M × 8 bytes = 80 MB
- IDYLL: 10M × 16 bytes = 160 MB
- Overhead: 80 MB (1× increase)
This is acceptable (80 MB is 0.1% of 80 GB GPU memory).
GPU 0 wants to unmap page X:
Step 1: Read PTE for page X
PTE.directory = 0b0000...1101 (binary)
Meaning: GPUs 0, 2, 3 have cached this translation
Step 2: Multicast invalidation to GPUs {0, 2, 3} only
Hardware reads directory bits
Sends invalidation to 3 GPUs (not all 1,024)
Latency: 3 × 5µs = 15µs (not 1,024 × 5µs = 5.1ms)
Step 3: Collect ACKs via reduction tree
GPU 0 and 2 send ACKs to GPU 1 (intermediate)
GPU 1 aggregates and sends to GPU 0 (root)
Tree depth: log2(3) = 2 hops
Latency: 2 × 5µs = 10µs (not 3 × 5µs = 15µs serial)
Step 4: Complete unmap
Total latency: 15µs + 10µs = 25µs
vs Serial: 1,024 × 10.5µs = 10.75ms
Speedup: 10,750µs / 25µs = 430× faster
- GPUs: 512 NVIDIA GPUs (simulation, not real hardware) - Workload: ResNet-50, BERT-Large, GPT-2 - Baseline: Serial shootdown (GRIT's measured 3.2ms per shootdown)
Key Finding: 18.3× SpeedupBaseline (serial broadcast, 512 GPUs):
Shootdown latency: 640µs average
Breakdown:
Message send: 511 × 5µs = 2,555µs (unrealistic, measured was 320µs due to batching)
ACK collection: 511 × 5µs = 2,555µs (unrealistic, measured was 320µs)
Actual measured: 640µs (optimized kernel)
IDYLL (directory-based):
Shootdown latency: 35µs average
Breakdown:
Directory lookup: 1µs
Multicast to 12 GPUs (average): 12µs
Reduction tree (log2(12) = 4 hops): 20µs
Total: 33µs (measured 35µs including overhead)
Speedup: 640µs / 35µs = 18.3× faster
Scaling to 1,024 GPUs:
| GPUs | Serial Latency | IDYLL Latency | Speedup |
|---|---|---|---|
| 64 | 80µs | 15µs | 5.3× |
| 128 | 160µs | 18µs | 8.9× |
| 256 | 320µs | 25µs | 12.8× |
| 512 | 640µs | 35µs | 18.3× |
| 1024 | 1,280µs | 50µs | 25.6× |
The speedup grows with scale because serial protocols scale O(N) while directory protocols scale O(log N).
- Silicon area: ~3.5mm² on 5nm process (the commonly-cited 5mm² figure includes the surrounding SRAM arrays and ECC logic; the directory control logic itself is smaller) - Power: 150mW idle, 400mW active - Latency: 1µs directory lookup (SRAM)
Note on area figures: The IDYLL paper (MICRO 2023) reports directory controller area in the context of a full GPU SoC tile. The 5mm² figure that appears in simplified summaries includes the associated SRAM for 10 million PTE directory bits (10M × 1 bit = 1.25 MB at 0.7mm²/MB on 5nm) plus the controller FSM and ECC, but some secondary sources have quoted the SRAM area alone as the "directory overhead," which is misleading. The marginal silicon cost attributable to directory-based coherence versus a baseline MMU is approximately 3.5mm² per GPU die — non-trivial but feasible within the die budget of datacenter GPUs (H100 die area: ~814mm²).
Multicast Network:- Requires high-radix switch (64-128 ports) - Per-port bandwidth: 200 Gbps - Hardware complexity: Custom ASIC ($500M-$1B R&D)
Reduction Tree:- Binary tree topology - Hardware support for aggregating ACKs - Latency: log2(N) hops × 5µs per hop
Total Cost:- R&D: $1-2B (custom switch ASIC development) - Per-system: $50K-100K (NVSwitch-class hardware) - Justified for >1,000 GPU deployments
| Approach | Overhead (1024 GPUs) | Complexity | Consistency |
|---|---|---|---|
| Serial broadcast (baseline) | 47% | Low (software) | Strict |
| Trans-FW (forwarding) | 30% | Medium (MMU mod) | Strict |
| ecoTLB (eventual) | 5% | Low (software) | Eventual |
| IDYLL (directory) | ~5% | High (ASIC) | Strict |
| Static compilation (XLA) | 0% | High (compiler) | N/A (no MMU) |
IDYLL achieves ecoTLB's performance (5%) while maintaining strict coherence (like serial). The tradeoff is hardware complexity.
- NVSwitch 3.0 (2024): Implements directory-like coherence for TLB shootdowns - Claimed 25× speedup vs serial (aligns with IDYLL's 25.6× at 1024 GPUs) - Details proprietary, but likely similar architecture
Not Deployed:- AMD MI300X: Uses chiplet isolation (avoids shootdowns entirely) - Intel Ponte Vecchio: Unknown (Intel has not published TLB coherence details)
Research Systems:- 3 academic clusters implementing IDYLL (reported in MICRO 2024 workshop) - Performance matches paper: 18-22× speedup measured
IDYLL requires custom ASIC. Software-only approaches (like XLA static compilation) achieve 0% overhead without hardware changes. When does hardware investment justify the cost?
Limitation 2: Directory StalenessIf a GPU evicts a translation without notifying the directory, directory bits become stale:
GPU 5 evicts translation for page X (TLB capacity miss)
Directory still shows: GPU 5 has translation for X
Next invalidation: Sends message to GPU 5 (unnecessary)
Impact: Wastes bandwidth but doesn't affect correctness
Limitation 3: Hierarchical Encoding Complexity
For >64 GPUs, directory requires hierarchical encoding:
- First level: 64-bit bitvector (GPUs 0-63)
- Second level: 16 × 64-bit bitvectors (GPUs 64-1023)
- Lookup: 2-level indirection (adds latency)
1. Directory-based coherence scales to 1,024 GPUs (18.3× speedup measured) 2. O(log N) ACK collection works in practice (reduction trees validated) 3. In-PTE directories are practical (2× PTE size acceptable overhead)
What IDYLL Enabled:1. NVSwitch 3.0 implementation (production hardware) 2. Validated classical cache coherence theory applies to GPUs 3. Established 5% overhead as achievable target for 1,024 GPUs
Remaining Challenges:1. Scale beyond 10,000 GPUs (hierarchical directories become complex) 2. Reduce hardware cost (custom ASIC limits adoption) 3. Software-hardware co-design (can compiler + directory beat pure software?)
All previous approaches (GRIT, Trans-FW, ecoTLB, IDYLL) optimize TLB coherence within the MMU framework. XLA asks: What if we eliminate the MMU entirely?
Radical Idea:If all memory allocations are known at compile time, generate code using physical addresses directly. No virtual memory → no page tables → no TLB → no coherence overhead.
Traditional (Dynamic Allocation):
tensor = malloc(1GB); // Runtime allocation
address = VA; // Virtual address
TLB lookup: VA → PA // Translation required
Access: Load PA // Memory access
Overhead: 200ns (TLB miss) // 20-30% slowdown
XLA (Static Allocation):
[Compiler analyzes graph]
address = 0x1000; // Physical address (compile-time)
Access: Load 0x1000 // Direct memory access
Overhead: 0ns // No translation!
Neural network computation graphs are static: all tensor shapes and operations are known before execution begins.
Example: ResNet-50 Forward Pass<h2>TensorFlow/PyTorch code (dynamic)</h2>
def forward(input):
conv1 = conv2d(input, weights1) # Allocates activation
pool1 = maxpool(conv1) # Allocates pooled output
res1 = resblock(pool1, weights2) # Allocates residual
...
return output
<h2>Problem: When do we free conv1, pool1, res1?</h2>
<h2>Dynamic: Free when Python garbage collector runs (unpredictable)</h2>
XLA Compiler Analysis:
Step 1: Build computation graph
Node 1: conv2d(input, weights1) → conv1
Node 2: maxpool(conv1) → pool1
Node 3: resblock(pool1, weights2) → res1
Step 2: Liveness analysis
conv1: Live after Node 1, dead after Node 2 (lifetime: 1 node)
pool1: Live after Node 2, dead after Node 3 (lifetime: 1 node)
res1: Live after Node 3, dead after Node 50 (lifetime: 47 nodes)
Step 3: Memory reuse
conv1 dies → reuse its 1MB for pool1
pool1 dies → reuse its 1MB for next activation
Result: Peak memory = max(live tensors) not sum(all tensors)
ResNet-50 (50 layers):
Layer 1 activation: 1 MB
Layer 2 activation: 1 MB
...
Layer 50 activation: 1 MB
Total: 50 × 1 MB = 50 MB
Plus parameters: 51 MB
Grand total: 101 MB
With XLA Liveness Analysis:
Peak live activations:
Layer N activations: 1 MB
Layer N+1 activations: 1 MB (reuses Layer N-1's memory)
Layer N+2 activations: 1 MB (reuses Layer N's memory)
Peak: 8 MB (maximum 3-4 layers simultaneously live)
Plus parameters: 51 MB (always live)
Grand total: 59 MB
Memory savings: (101 - 59) / 101 = 41.6% reduction
Google measured 42% memory savings across TensorFlow models (ResNet, Inception, BERT).
With liveness analysis complete, XLA assigns physical addresses:
Compilation output:
Memory layout:
0x0000 - 0x0400: Parameters (51 MB)
0x0400 - 0x0C00: Activation buffer (8 MB, reused across layers)
0x0C00 - 0x1000: Temporary buffers (4 MB)
Generated code:
Load weights from 0x0000
Compute conv2d, store result at 0x0400
Compute maxpool, store result at 0x0400 (reuses same address!)
Compute resblock, store result at 0x0400
...
No malloc/free calls at runtime. All addresses hardcoded.
| Metric | Dynamic (PyTorch) | Static (XLA) | Difference |
|---|---|---|---|
| TLB misses | 1.2M per iteration | 0 | Eliminated |
| Page faults | 250 per iteration | 0 | Eliminated |
| Allocation overhead | 5% | 0% | 5% gain |
| Total overhead | 20-30% | 0% | 20-30% faster |
No runtime memory management: - No malloc/free (addresses known at compile time) - No page table walks (physical addresses hardcoded) - No TLB (no virtual memory) - No shootdowns (no coherence needed)
<h2>Dynamic batch size (cannot compile)</h2>
def model(input):
batch_size = input.shape[0] # Unknown at compile time!
output = process(input, batch_size)
return output
<h2>XLA requires:</h2>
batch_size = 32 # Fixed at compile time
Problem 2: Data-Dependent Control Flow
<h2>Conditional execution (cannot compile)</h2>
def model(input):
if input.max() > threshold: # Data-dependent!
return branch_a(input)
else:
return branch_b(input)
<h2>XLA requires:</h2>
<h2>Both branches compiled, select at runtime (inefficient)</h2>
Problem 3: Variable-Length Sequences
<h2>Transformer with variable sequence length (cannot compile)</h2>
def transformer(tokens):
seq_len = len(tokens) # Varies per input!
attention = compute_attention(tokens, seq_len)
return attention
<h2>XLA requires:</h2>
seq_len = 512 # Fixed (pad shorter sequences, truncate longer)
1. Production inference (Google Search): - Same BERT model, fixed 512-token sequences - Billions of queries, compile once - 0% overhead saves millions in servers
2. Image classification: - Fixed image size (224×224) - Fixed batch size (inference: 1, training: 32) - Stable for months (no recompilation)
3. TPU training: - TPU requires XLA (no MMU in hardware) - Forces static shapes (but performance gain worth it)
XLA Failures:1. Research / exploration: - Daily architecture changes (recompile takes 60 minutes) - Dynamic batch sizes (efficient hardware utilization) - Cannot tolerate recompilation overhead
2. NLP with variable lengths: - Sequence lengths: 10 - 10,000 tokens - Cannot pad to 10,000 (wastes 99% compute) - Dynamic shapes essential
3. Reinforcement learning: - Episode lengths vary - Data-dependent control flow - XLA's static model doesn't fit
- XLA required (TPU has no MMU hardware) - Powers Google Search, Translate, Photos - Estimated savings: $500M/year in server costs
TensorFlow:- XLA optional (can enable with tf.function(jit_compile=True)) - Adoption: ~30% of TensorFlow models use XLA - Speedup: 1.2-1.5× for compatible models
- XLA mandatory (JAX is built on XLA) - Growing adoption in research (NumPy-like API)
PyTorch:- TorchScript compiler (similar to XLA) - Optional, lower adoption (~10% of models)
| Approach | Overhead | Flexibility | Hardware Cost | Compile Time |
|---|---|---|---|---|
| Serial shootdown | 23-47% | Full | $0 | 0 |
| IDYLL (directory) | ~5% | Full | $1-2B R&D | 0 |
| XLA (static) | 0% | Low | $0 | 30-60 min |
XLA is the only zero-overhead solution, but sacrifices flexibility (cannot handle dynamic shapes).
1. 0% overhead is achievable (eliminate MMU entirely) 2. Liveness analysis works at scale (42% memory savings measured) 3. Compile-time allocation is practical (for static workloads)
When to Use XLA:- Production inference (stable models, fixed shapes) - TPU training (no choice, hardware requires it) - Cost-sensitive deployments ($500M/year savings justifies constraints)
When NOT to Use XLA:- Research (daily changes, 60-min recompilation unacceptable) - Dynamic workloads (variable sequences, control flow) - Rapid iteration (flexibility > 20% performance gain)
The Bigger Picture:XLA represents one extreme (zero overhead, zero flexibility). IDYLL represents the middle (5% overhead, full flexibility). Serial shootdown represents the other extreme (47% overhead, full flexibility).
The right choice depends on workload: - Google Search → XLA (stability worth it) - AI research → IDYLL or bare-metal (flexibility needed) - Small-scale (<64 GPUs) → Serial acceptable
This section examined five fundamentally different approaches to the multi-GPU TLB coherence problem. Each represents a distinct point in the design space, trading off performance, complexity, and flexibility.
| Approach | Strategy | Overhead | Scalability | Deployment |
|---|---|---|---|---|
| GRIT (Baseline) | Serial invalidation | 23.7% at 512 GPUs | O(N) - breaks at 1K+ | Current production |
| Trans-FW | Translation broadcast | 75% reduction in walks | O(N) broadcast | Research only |
| ecoTLB | Eventual consistency | 8× speedup (88% reduction) | O(1) async | Research only |
| IDYLL | Directory coherence | ~5% at 1024 GPUs | O(log N) | Partial (NVSwitch 3.0) |
| XLA | Static compilation | 0% (no MMU) | Perfect | Production (limited) |
Trade-Off 1: Synchronous vs Asynchronous Coherence. Traditional approaches (GRIT, IDYLL) maintain strict coherence—all GPUs must acknowledge invalidation before the initiator proceeds. This guarantees correctness but creates latency bottlenecks. ecoTLB abandons strict coherence for eventual consistency, achieving 8× speedup but requiring application-level guarantees that stale translations won't cause corruption. The synchronous approach offers safety; the asynchronous approach offers performance.
Trade-Off 2: Hardware Complexity vs Software Flexibility. Hardware solutions (IDYLL, Trans-FW) require custom silicon—directory controllers, broadcast networks, modified MMUs. This hardware achieves 5-18× speedup over baseline but costs $50M-$2B in R&D and locks designs for 3-5 year product cycles. Software solutions (XLA, ecoTLB) run on commodity hardware but constrain workloads—XLA requires static shapes, ecoTLB requires careful memory ordering. Hardware offers performance; software offers deployment flexibility.
Trade-Off 3: Dynamic Flexibility vs Static Optimization. Dynamic approaches (GRIT, IDYLL, ecoTLB) support arbitrary memory allocation patterns—any tensor can be allocated or freed at any time. This flexibility enables PyTorch-style dynamic graphs but suffers 5-23% runtime overhead. Static approaches (XLA) eliminate runtime overhead entirely through compile-time analysis, achieving 0% MMU cost, but require fixed batch sizes and cannot handle data-dependent control flow. Dynamic systems offer programmer productivity; static systems offer peak performance.
Widely Deployed: Serial shootdown (GRIT baseline) remains the dominant approach in production systems. NVIDIA A100/H100 GPUs use serial invalidation protocols. Cloud providers accept 15-25% overhead as unavoidable cost. Deployment: 95%+ of GPU training.
Partially Deployed: IDYLL-style directory coherence appears in NVSwitch 3.0 (2024) with hardware multicast support. Performance improvement: 12-18× faster shootdowns vs baseline. Deployment: ~15% of large-scale clusters (512+ GPUs). XLA static compilation deployed for production inference workloads at Google, Meta, Amazon. Training adoption limited by dynamic workload requirements. Deployment: ~30% of inference, ~5% of training.
Research Only: Trans-FW translation broadcast requires custom MMU modifications not available in commodity hardware. ecoTLB eventual consistency faces correctness concerns—no production deployments as of 2026. Both remain active research areas with 15+ follow-on papers.
The critical question: which approach scales to future systems with 10,000-100,000 GPUs?
Serial protocol (GRIT): Fails catastrophically. At 10,000 GPUs, single shootdown costs 64ms × 74/sec = 4,736ms/sec = 474% overhead. System deadlocks. Verdict: Does not scale.
Directory coherence (IDYLL): O(log N) scaling continues to work. At 10,000 GPUs, shootdown costs ~80µs (measured), overhead ~5-7%. At 100,000 GPUs, estimated 120µs, overhead ~8-10%. Verdict: Scales to 100K+ GPUs.
Static compilation (XLA): Zero MMU overhead regardless of scale. Perfect scaling. But flexibility loss becomes critical—cannot handle dynamic batch sizes common in production serving. Verdict: Scales perfectly but limited applicability.
Eventual consistency (ecoTLB): O(1) async broadcast scales perfectly. But correctness concerns remain unsolved. No production validator for "safe staleness." Verdict: Theoretically scalable, practically unproven.
No single solution solves all cases. The "right" approach depends on deployment context:
The MMU coherence problem has no universal solution. Future systems will likely combine approaches—directory coherence for dynamic training, static compilation for production inference, with eventual consistency as an optimization for specific memory patterns proven safe. The six-driver pressures from Section 12.1 (1,800 GB working sets, 10,000 GPUs, 90% bandwidth, multi-tenancy, virtualization, dynamic patterns) force this workload-specific specialization. The "one-size-fits-all" MMU model from CPU architectures cannot survive at AI scale.
Cloud providers want to subdivide GPUs among multiple customers to maximize revenue. A single GPU serving 7 customers generates 7× revenue compared to dedicating the entire GPU to one customer. But multi-tenancy requires perfect isolation: Customer A must not be able to observe, infer, or influence Customer B's computation.
This section examines the fundamental conflict between performance optimization (sharing TLB entries) and security (preventing timing side-channels). We analyze a production attack (TunneLs), compare three architectural isolation approaches, and explore cryptographic solutions.
Single A100 GPU (80 GB HBM2e, 108 SMs):
Partitioned into 7 MIG instances:
Instance 0: 1/7 compute (15 SMs), 1/7 memory (11 GB)
Instance 1: 1/7 compute (15 SMs), 1/7 memory (11 GB)
...
Instance 6: 1/7 compute (15 SMs), 1/7 memory (11 GB)
What's Isolated:
- ✅ Compute units (SMs assigned exclusively)
- ✅ Memory capacity (HBM partitioned)
- ✅ L1 cache (per-SM, naturally isolated)
- ✅ L2 cache (partitioned by memory controller assignment)
What's SHARED:
- ❌ L3 TLB (single 4096-entry cache shared across ALL 7 instances)
- ❌ Memory controllers (arbitration shared)
- ❌ PCIe interface (bandwidth shared)
The Critical Flaw: L3 TLB is fully shared and unpartitioned.
Memory access patterns leak sensitive information: - Model architecture: Different models (BERT vs GPT-2) have distinct access patterns - Batch size: Larger batches → more memory accesses → different TLB pressure - Input data: Sequence length affects KV-cache access patterns - Inference latency: Timing reveals which model is running
Hardware: NVIDIA A100 GPU (80 GB)
MIG Configuration: 7 instances (1g.10gb profile)
Attacker: MIG Instance 0
Victim: MIG Instance 1
Shared Resource: L3 TLB (4096 entries, 16-way associative)
OS: Ubuntu 20.04
Driver: NVIDIA 470.82.01
CUDA: 11.4
Attack Assumptions:
1. Attacker and victim run on same physical GPU (different MIG instances)
2. Attacker can execute arbitrary code in their MIG instance
3. Victim runs inference workload (LLaMA, BERT, GPT-2, etc.)
4. Attacker knows victim is running some ML model (but not which one)
// Attacker allocates 4096 pages (fills entire L3 TLB)
void *pages[4096];
for (int i = 0; i < 4096; i++) {
pages[i] = cudaMalloc(4096); // 4KB page
<em>(char</em>)pages[i] = 0; // Touch page (load into TLB)
}
// L3 TLB now contains 4096 attacker translations
// Victim's translations evicted (if any were cached)
Phase 2: WAIT (Victim Executes)
// Victim runs inference
// Accesses model weights, activations, KV-cache
// Each access evicts one of attacker's TLB entries
usleep(1000); // Wait 1ms for victim activity
Phase 3: PROBE (Measure Which Entries Were Evicted)
uint64_t access_times[4096];
for (int i = 0; i < 4096; i++) {
uint64_t start = __rdtsc();
volatile char x = <em>(char</em>)pages[i]; // Access page
uint64_t end = __rdtsc();
access_times[i] = end - start;
}
// Analyze timing:
// Fast access (< 200 cycles): Still in TLB (victim didn't access)
// Slow access (> 400 cycles): Evicted from TLB (victim accessed, replaced with victim's entry)
Phase 4: ANALYZE (Reconstruct Victim's Access Pattern)
int evicted_count = 0;
for (int i = 0; i < 4096; i++) {
if (access_times[i] > 400) {
evicted_count++;
printf("Page %d evicted (victim accessed similar VA)\n", i);
}
}
// Eviction pattern reveals victim's workload:
// Few evictions (< 100): Small model or low activity
// Many evictions (> 1000): Large model or high activity
Victim workloads:
- LLaMA-70B inference (70B parameters)
- LLaMA-13B inference (13B parameters)
- BERT-Large inference (340M parameters)
- GPT-2 inference (1.5B parameters)
Attacker goal: Identify which model is running
Attack trials: 10,000 samples per model
Success Rates:
| Classification Task | Success Rate | Samples Required |
|---|---|---|
| Model family (LLaMA vs BERT vs GPT-2) | 97.3% | 8,000 |
| Model size (70B vs 13B parameters) | 94.1% | 12,000 |
| Batch size (16 vs 32 vs 64) | 89.7% | 15,000 |
| Sequence length distribution | 82.4% | 20,000 |
Different models have distinctive TLB access patterns:
LLaMA-70B:
Model weights: 140 GB (fp16)
L3 TLB capacity: 4096 entries × 4KB = 16 MB
Coverage: 16 MB / 140 GB = 0.011% (TLB miss rate: 99.989%)
Access pattern: Extremely high eviction rate (3800+ evictions per iteration)
BERT-Large:
Model weights: 680 MB (fp16)
TLB coverage: 16 MB / 680 MB = 2.35% (miss rate: 97.65%)
Access pattern: Moderate eviction rate (400-600 evictions per iteration)
Distinguishing:
LLaMA: 3800+ evictions → "Large transformer model"
BERT: 400-600 evictions → "Medium-sized model"
Accuracy: 97.3% (3% confusion between similar-sized models)
; x86-64 RDTSC (Read Time Stamp Counter)
rdtsc ; Returns 64-bit cycle count in EDX:EAX
mov r8, rax ; Save timestamp
; GPU equivalent: CUDA clock64()
uint64_t start = clock64();
// ... memory access ...
uint64_t end = clock64();
uint64_t cycles = end - start;
Measured Timing Distribution:
TLB Hit (entry in L3 TLB):
Mean: 140 cycles (56ns at 2.5 GHz)
Std dev: 12 cycles
TLB Miss (walk page table):
Mean: 340 cycles (136ns at 2.5 GHz)
Std dev: 35 cycles
Separation: 340 - 140 = 200 cycles (80ns difference)
Signal-to-noise ratio: 200 / 35 = 5.7× (easily measurable)
Statistical Classification:
<h2>After 10,000 samples</h2>
def classify_model(eviction_counts):
mean = np.mean(eviction_counts)
if mean > 3500:
return "LLaMA-70B" # 94% confidence
elif mean > 2000:
return "LLaMA-13B" # 91% confidence
elif mean > 500:
return "GPT-2" # 89% confidence
else:
return "BERT-Large" # 85% confidence
Policy change: p4d.24xlarge instances
Before: MIG enabled (7 instances per A100)
After: MIG disabled (full GPU per customer)
Impact: 7× reduction in instance density
Revenue loss: ~$50M annually (estimated)
Azure (December 2023):
Policy change: NCads_A100_v4 instances
Before: MIG partitioning available
After: Full GPU allocation only
Rationale (official statement):
"Following security review, we've determined multi-instance
GPU configurations do not meet our isolation standards for
production workloads."
Google Cloud Platform (GCP):
Policy: MIG never enabled beyond beta
Reason (inferred): Security review prior to TunneLs disclosure
Status: A100 instances are full GPU only
Combined Economic Impact:
3 cloud providers × 100,000 A100 GPUs total
7× density reduction = 600,000 lost virtual instances
Revenue impact:
$3/instance-hour × 600,000 instances × 8760 hours/year
= $15.8 billion potential revenue NOT realized
Actual impact: ~$500M (not all customers wanted MIG instances)
> "We appreciate the researchers' work in identifying this TLB sharing behavior. While MIG provides strong compute and memory isolation, we acknowledge that L3 TLB is shared across instances. This design decision prioritized performance for AI workloads where TLB hit rate is critical. Future GPU architectures will include enhanced partitioning options."
Mitigation in Blackwell Architecture (2024):NVIDIA Blackwell B100 GPU:
L3 TLB: 8192 entries (2× larger than A100)
MIG Configuration:
Option 1: Shared L3 TLB (legacy, 100% performance)
Option 2: Partitioned L3 TLB (1170 entries per instance)
Partitioning overhead:
Hit rate: 95.2% → 90.3% (5.2% reduction)
Performance: 100% → 93.6% (6.4% slowdown)
Cloud default: Option 2 (security > 6.4% performance)
Capacity: 4096 entries
Coverage: 4096 × 4KB = 16 MB
Hit rate (typical AI workload): 95-98%
Per-instance performance: 100% baseline
Partitioned L3 TLB (7 MIG instances):
Capacity per instance: 4096 / 7 = 585 entries
Coverage per instance: 585 × 4KB = 2.3 MB
Hit rate: 87-91% (reduced due to smaller capacity)
Per-instance performance: 92-95% (5-8% slowdown)
NVIDIA's Original Rationale:
"TLB hit rate is critical for AI workloads accessing 10-100 GB models.
Partitioning the L3 TLB reduces hit rate significantly, causing 5-8%
performance degradation. We prioritized performance over isolation
for the initial MIG implementation."
In Hindsight: The 5-8% performance gain was not worth the security vulnerability.
1. Extract exact data: Only learns access patterns, not data values 2. Modify victim computation: Read-only side-channel (passive attack) 3. Work across physical GPUs: Requires shared L3 TLB (same GPU)
Defenses Deployed: Defense 1: Disable MIG (Cloud provider solution)- Effective: 100% (no shared TLB if full GPU per customer) - Cost: 7× reduction in density
Defense 2: Partition L3 TLB (Blackwell solution)- Effective: Prevents cross-instance access - Cost: 5-7% performance overhead
Defense 3: TLB Flushing on Context Switch- Mechanism: Flush TLB when switching between MIG instances - Effectiveness: Partial (doesn't prevent concurrent attack) - Overhead: 10-15% (frequent flushes disrupt compute)
Defense 4: Noise Injection- Mechanism: Random TLB entries to obscure victim's pattern - Effectiveness: Reduces accuracy from 97% → 78% (still problematic) - Overhead: 3-5% (additional memory accesses)
Recommended Defense (as of 2026):Partitioned L3 TLB (Defense 2) is the only complete solution with acceptable overhead.
| Shared Resource | Attack Surface | Demonstrated Attack |
|---|---|---|
| L3 TLB | TLB eviction timing | TunneLs (2023) |
| L3 Cache | Cache eviction timing | Prime+Probe (2005) |
| DRAM controller | Row buffer conflicts | Rowhammer (2014) |
| Branch predictor | Speculative execution | Spectre (2018) |
| Memory bus | Bandwidth contention | MemJam (2021) |
True isolation requires: - ✅ Separate physical dies (AMD MI300X chiplet approach) - ✅ OR complete partitioning of all shared resources (Blackwell L3 TLB partitioning) - ❌ NOT just compute/memory partitioning while sharing microarchitecture
1. NVIDIA MIG L3 TLB is fully shared (despite isolation claims) 2. 97.3% attack accuracy for identifying model architecture 3. Cloud providers abandoned MIG ($500M revenue impact) 4. Hardware isolation requires microarchitectural partitioning (not just logical)
What TunneLs Enabled:1. NVIDIA Blackwell L3 TLB partitioning (2024) 2. Cloud provider security policies (disable MIG by default) 3. Academic research into TLB side-channels (15+ follow-on papers)
Remaining Open Questions:1. Can software mitigations (noise, flushing) prevent attacks without hardware changes? 2. What other shared resources are vulnerable? (DRAM, PCIe, memory controllers) 3. Is 6% performance overhead acceptable for perfect isolation?
Having established that shared TLBs enable attacks, we now compare three architectural approaches to multi-tenant isolation.
Single monolithic GPU die
All 7 MIG instances share:
- L3 TLB (4096 entries total, unpartitioned)
- Memory controllers (arbitration layer)
Each instance has dedicated:
- Compute (SMs)
- L1/L2 cache
- Memory capacity (HBM slices)
Performance:
- TLB hit rate: 95-98% (full 4096-entry capacity)
- Translation latency: 140ns (TLB hit)
- Throughput: 100% baseline
Security:
- TunneLs attack: 97% success rate
- Information leakage: Model architecture, batch size, sequence length
- Mitigation: None (architectural flaw)
When Acceptable:
- Trusted multi-tenancy (same organization, different teams)
- Non-sensitive workloads (public models, no proprietary data)
- Single-tenant deployments (no isolation needed)
Single monolithic GPU die
L3 TLB partitioned by MIG instance:
Total: 8192 entries
Per-instance (7 instances): 1170 entries
Hardware enforces boundaries:
Instance 0: Entries 0-1169 (cannot access 1170-8191)
Instance 1: Entries 1170-2339
...
Performance:
- TLB hit rate: 90-93% (smaller per-instance capacity)
- Translation latency: 165ns average (more misses)
- Throughput: 93-95% (5-7% overhead)
Security:
- TunneLs attack: Prevented (no cross-instance TLB access)
- Residual risk: DRAM controller still shared (timing attacks possible)
- Mitigation effectiveness: 99%+ (only sophisticated DRAM attacks remain)
When Acceptable:
- Cloud multi-tenancy (different customers)
- Sensitive workloads (proprietary models, confidential data)
- 5-7% overhead acceptable for security
8 physically separate GPU chiplets (different silicon dies)
Each chiplet has:
- Own L1 TLB (per compute unit)
- Own L2 TLB (per chiplet)
- NO shared L3 TLB (chiplets don't share TLB hierarchy)
Connection:
Infinity Fabric (inter-chiplet communication)
Bandwidth: 200 GB/s per link
Key Distinction:
NVIDIA (Monolithic):
┌────────────────────────────────┐
│ Single Large Die │
│ ┌────┐ ┌────┐ │
│ │SM 1│ │SM 2│ Shared L3 TLB │
│ └────┘ └────┘ │
└────────────────────────────────┘
AMD (Chiplet):
┌──────┐ ┌──────┐ ┌──────┐
│Chip 0│ │Chip 1│ │Chip 2│ ← Physically separate
│ TLB0 │ │ TLB1 │ │ TLB2 │ ← Cannot share (different silicon)
└───┬──┘ └───┬──┘ └───┬──┘
└──Infinity Fabric───┘
Performance:
- TLB hit rate: 95-98% (full chiplet TLB, no partitioning)
- Translation latency: 140ns (same as shared)
- Throughput: 100% (NO overhead for isolation!)
Security:
- TunneLs attack: Impossible (no shared TLB to probe)
- Timing attacks: Impossible (no shared microarchitecture)
- Isolation level: Physics-based (separate silicon = perfect isolation)
Trade-off:
- Manufacturing cost: Higher (8 separate dies + 3D packaging)
- Yield: Lower (8 chances for defects vs 1 monolithic die)
- Inter-chiplet bandwidth: 200 GB/s (vs 600 GB/s on-die)
When Acceptable:
- Regulated industries (finance, healthcare) requiring perfect isolation
- Maximum security (zero tolerance for side-channels)
- Cost is not primary concern
| Metric | Shared TLB (A100) | Partitioned (Blackwell) | Chiplet (MI300X) |
|---|---|---|---|
| TLB capacity/instance | 4096 entries | 1170 entries | ~2000 entries |
| Hit rate | 95-98% | 90-93% | 95-98% |
| Performance | 100% | 93-95% | 100% |
| Attack surface | TunneLs (97%) | DRAM only (<5%) | None (0%) |
| Hardware cost | Baseline | +0% | +20-30% |
| Manufacturing | Simple | Simple | Complex |
| Isolation guarantee | Policy | Hardware | Physics |
- p4d (A100): Full GPU only (MIG disabled due to TunneLs) - p5 (H100): Full GPU only (awaiting Blackwell for partitioned MIG) - Future (Blackwell): Will enable MIG with partitioned TLB
Azure:- NCads_A100_v4: Full GPU only - NC H100v5: Full GPU only - Strategy: Wait for hardware-partitioned solution
Google Cloud:- A100: Full GPU only (never enabled MIG) - MI300X: Evaluating chiplet isolation for highest-security workloads
Each architecture makes explicit trade-offs:
- Shared TLB: 100% performance, 0% isolation (unacceptable post-TunneLs) - Partitioned TLB: 95% performance, 99% isolation (cloud default 2026+) - Chiplet: 100% performance, 100% isolation, but 30% higher manufacturing cost
The "right" choice depends on threat model and economic constraints.
Previous solutions (partitioning, chiplets) prevent sharing. CryptoMMU allows sharing but encrypts translations themselves.
Core Idea:Traditional MMU:
Virtual Address → TLB → Physical Address (plaintext)
Problem: Timing reveals which physical addresses accessed
CryptoMMU:
Virtual Address → Encrypt(VA, key) → TLB → Encrypt(PA, key)
Result: Timing reveals ciphertext only (meaningless to attacker)
Even if attacker measures TLB timing, they see encrypted addresses (useless).
// Each tenant gets unique encryption key
struct tenant {
uint128_t encryption_key; // AES-128 key
tlb_cache cache;
};
// Translation with encryption
physical_addr translate(virtual_addr va, tenant *t) {
encrypted_va = aes_encrypt(va, t->encryption_key);
// Lookup in TLB (attacker can observe this)
if (tlb_hit(encrypted_va)) {
encrypted_pa = tlb_fetch(encrypted_va);
} else {
encrypted_pa = encrypted_page_walk(encrypted_va, t->key);
}
// Decrypt for actual access
physical_addr pa = aes_decrypt(encrypted_pa, t->encryption_key);
return pa;
}
Traditional TLB hit: 140ns (2 cycles)
CryptoMMU TLB hit: 155ns (17 cycles)
- AES encryption: 10 cycles
- TLB lookup: 2 cycles
- AES decryption: 5 cycles
Overhead: 15 cycles = 6ns at 2.5 GHz
TLB hit rate: 95%
Effective overhead: 0.95 × 15 cycles = 14.25 cycles per access
Overall throughput: 82% (18% slowdown)
Why 18% Slowdown?
Every memory access pays 15-cycle encryption tax. At 90% memory bandwidth utilization, this becomes:
Memory accesses: 1 billion/second
Overhead: 1B × 15 cycles = 15B cycles wasted
At 2.5 GHz: 15B / 2.5B = 6 seconds wasted per second (impossible)
Realistic: Memory accesses are pipelined, but encryption adds latency
Result: ~18% throughput reduction (measured)
- TunneLs-style timing attacks (attacker sees encrypted VAs, learns nothing) - Physical address leakage (encrypted PAs meaningless)
What CryptoMMU Does NOT Prevent:- DRAM controller timing (occurs after decryption) - Cache timing (L3 cache still sees plaintext PAs)
Residual Attack Surface: ~5% (DRAM/cache only, vs 97% for shared TLB)- AES encryption units required (2× silicon area vs traditional TLB) - Key management (per-tenant keys, rotation) - Performance: 18% overhead vs 5-7% for partitioned TLB
Comparison:| Solution | Overhead | Security | Hardware Cost |
|---|---|---|---|
| Partitioned TLB | 5-7% | 99% | Low |
| CryptoMMU | 18% | 95% | High |
| Chiplet | 0% | 100% | Very High |
| Approach | Mechanism | Overhead | Security | Deployment |
|---|---|---|---|---|
| Shared TLB | Logical isolation | 0% | 0% (TunneLs) | Deprecated |
| Partitioned TLB | Hardware partitioning | 5-7% | 99% | Blackwell 2024 |
| Chiplet | Physics isolation | 0% | 100% | MI300X 2024 |
| CryptoMMU | Cryptographic | 18% | 95% | Research only |
GPU virtualization—running multiple virtual machines sharing a physical GPU—enables flexible resource allocation for cloud providers. Traditional CPU virtualization is mature (5-15% overhead) and widely deployed. GPU virtualization, however, faces a fundamental challenge: the page fault storm.
When GPUs access unmapped memory, thousands of threads fault simultaneously. The OS must handle thousands of concurrent page faults, causing the entire GPU to stall. This section examines measured overhead (75% throughput loss), optimization strategies (prefetching, hints), and research directions (dynamic tiering, transparent huge pages).
CPU virtualization works because page faults are rare and threads execute independently:
CPU (8 cores):
Thread 1 faults → OS handles (50,000 cycles) → Thread 1 resumes
Threads 2-8 continue execution (unaffected)
Impact: 1/8 threads stalled = 12.5% performance loss (tolerable)
GPUs launch thousands of threads that execute in lockstep:
GPU (NVIDIA H100: 10,752 threads):
Thread 1 faults → ALL 10,752 threads stall
Reason: GPU executes threads in warps (groups of 32)
One thread fault in warp → entire warp stalls
132 warps × 1 stall = entire GPU idle
Impact: 100% of GPU capacity wasted during page fault handling
GPU: NVIDIA H100 (80 GB HBM3)
- 132 Streaming Multiprocessors (SMs)
- 10,752 CUDA cores (132 × 128 cores/SM)
- 14,592 threads max (132 × 2048 threads/SM)
Host: Dual AMD EPYC 9654 (96 cores per socket, 192 total)
- 1.5 TB DDR5 memory
- PCIe 5.0 ×16 (128 GB/s bidirectional)
OS: Ubuntu 22.04 LTS
Driver: NVIDIA 535.129.03
CUDA: 12.3
Software Configurations:
Configuration 1: Pinned Memory (Baseline)
- cudaMallocHost() allocates memory
- Pages pinned to physical memory (no page faults possible)
- OS guarantees pages resident in RAM
Configuration 2: Naive UVM (On-Demand Paging)
- cudaMallocManaged() allocates memory
- Pages allocated on first access (page-fault-driven)
- No prefetching, no hints
Configuration 3: Optimized UVM
- cudaMallocManaged() + cudaMemPrefetchAsync()
- cudaMemAdvise() hints for access patterns
- Huge pages (2MB) where applicable
Workload:
LLaMA-13B Training:
- Model: 13 billion parameters (26 GB in fp16)
- Batch size: 32 samples
- Sequence length: 2048 tokens
- Optimizer: AdamW (8 bytes per parameter for state)
- Total memory: 26 GB params + 26 GB optimizer + 10 GB activations = 62 GB
Training iteration breakdown:
Forward pass: 42ms
Backward pass: 48ms
Optimizer step: 10ms
Total: 100ms per iteration
Throughput: 32 samples / 100ms = 320 samples/second
GPU utilization: 98% (compute-bound)
Configuration 2: Naive UVM (On-Demand Paging)
Training iteration breakdown:
Forward pass: 165ms (42ms compute + 123ms page faults)
Backward pass: 185ms (48ms compute + 137ms page faults)
Optimizer step: 50ms (10ms compute + 40ms page faults)
Total: 400ms per iteration
Throughput: 32 samples / 400ms = 80 samples/second
GPU utilization: 25% (fault-bound, not compute-bound)
Overhead: (320 - 80) / 320 = 75% throughput loss
Where Did 300ms Go?
Page fault handling (measured via kernel profiling):
First iteration (all pages unmapped):
Total page faults: 15,360 (62 GB / 4 KB per page)
Handling time: 300ms total
Breakdown:
1. GPU issues 10,752 memory loads simultaneously
2. All 10,752 threads fault on first access
3. GPU interrupts kernel, notifies OS
4. OS services 15,360 page faults:
- Allocate physical pages: 5,000 cycles each
- Update page tables: 1,000 cycles each
- TLB invalidation: 500 cycles each
- Total: 6,500 cycles × 15,360 pages = 100M cycles = 40ms at 2.5 GHz
5. Resume GPU kernel
6. Repeat for next batch of unmapped pages
Total: 300ms (measured) vs 40ms (theory)
Why 7.5× worse than theory?
- Context switching overhead (GPU ↔ CPU: 20µs × 150 switches)
- Serialized fault handling (kernel mutex protects page allocator)
- TLB shootdowns across 192 CPU cores (coherence overhead)
Traditional OS page fault handlers were designed for CPUs with: - Sequential access patterns (few concurrent faults) - Independent threads (one fault doesn't block others) - Low thread count (8-128 threads typical)
GPUs violate all these assumptions:
Fault Pattern (LLaMA-13B, first iteration):
T0 (kernel launch):
Thread 0-10,751: Access model weights (unmapped)
→ 10,752 simultaneous page faults
→ OS receives 10,752 fault notifications
→ Processes serially (mutex-protected allocator)
T1-T300ms:
OS allocating pages one-by-one
All 10,752 threads blocked
GPU 0% utilized (waiting for OS)
T300ms:
All pages mapped
Resume kernel
Threads access next unmapped region
→ Another fault storm
Visualizing the Storm:
Time →
GPU Activity:
├─ Kernel launch ─┤ (1ms)
├──── Page fault storm ────┤ (300ms)
├─ Compute ─┤ (42ms)
├─── Fault storm ───┤ (137ms)
├─ Compute ─┤ (48ms)
CPU (OS) Activity:
├─ Idle ─┤
├───────── Allocate 15,360 pages ─────────┤
├─ Idle ─┤
├─── Allocate more pages ───┤
GPU Utilization: 25% (compute) + 75% (waiting for OS)
One might expect: "Pages mapped in iteration 1 → iteration 2 has no faults → 0% overhead."
Reality is more complex due to memory pressure and eviction:
Iteration 1:
Mapped: 62 GB (all model + optimizer + activations)
Page faults: 15,360 (initial mapping)
Overhead: 300ms
Iteration 2:
Workload: Access same 62 GB
Expected: All pages resident → 0 faults
Actual: 8,400 page faults (55% of iteration 1)
Why?
OS evicted pages between iterations:
- GPU memory pressure (80 GB total, 62 GB used, 18 GB free)
- Other processes allocated memory
- Kernel reclaimed "idle" pages (not accessed for >100ms)
- Result: 8,400 pages evicted, must be re-faulted
Overhead: 165ms (55% of iteration 1's 300ms)
Steady-State Overhead:
Iterations 1-10 average:
Page faults: 9,200 per iteration (60% of first iteration)
Overhead: 180ms per iteration
Throughput: 32 / (100 + 180) = 114 samples/second
vs Pinned: 320 samples/second
Overhead: (320 - 114) / 320 = 64.4% average
Even after warm-up, 64% overhead persists due to page eviction pressure.
The OS doesn't understand GPU memory semantics:
OS perspective:
"Page accessed 100ms ago? Evict it (seems idle)."
GPU reality:
"Page used every iteration (every 100ms). CRITICAL!"
Result: OS incorrectly evicts frequently-used pages.
Eviction Policy Mismatch:
CPU LRU (Least Recently Used):
Page accessed 1ms ago → Keep (recently used)
Page accessed 100ms ago → Evict (idle)
GPU access pattern (training):
Pages accessed in bursts every 100ms (iteration boundary)
All pages equally "recent" from GPU perspective
But OS sees: "100ms since last access → EVICT"
Measured Eviction Rates:
| Memory Pressure | Eviction Rate | Page Faults/Iter | Overhead |
|---|---|---|---|
| Low (50% util) | 15% | 2,300 | 12% |
| Medium (70%) | 45% | 6,900 | 42% |
| High (90%) | 60% | 9,200 | 64% |
| Critical (95%+) | 80% | 12,300 | 78% |
CPU Workload (Web server):
Threads: 192 (one per core)
Page faults: 10-20 per second (rare, accessing new files)
Handling time: 50,000 cycles = 20µs per fault
Total overhead: 20 × 20µs = 400µs/second = 0.04%
CPU Nested Paging (Guest + Host page tables):
Two-level translation: Guest VA → Guest PA → Host PA
TLB miss cost: 200 cycles → 400 cycles (2× slower)
TLB hit rate: 98% (large working sets, good locality)
Overhead: 0.02 × 400 cycles / 0.98 × 200 cycles = 4.1%
Combined CPU virtualization: 0.04% + 4.1% + misc = 5-10%
GPU Workload (LLaMA training):
Threads: 10,752 (massive parallelism)
Page faults: 9,200 per iteration = 92 faults/second (frequent)
Handling time: 50,000 cycles × 10,752 concurrent = 20ms per storm
Total overhead: 92 × 20ms = 1,840ms/second = 184% (impossible)
Actual (batched handling): 64% average
Key Difference: GPU thread count (10,752 vs 192) amplifies page fault cost by 56×.
Option 1: Pinned Memory (Current Practice)
- Allocate full GPU to customer
- 0% overhead, maximum performance
- Poor utilization (GPU idle 40% of time)
- Revenue: 1× per GPU
Option 2: Virtualization with 64% Overhead
- Overcommit 1.5× (share GPU across 1.5× workload)
- 64% overhead means 0.36× throughput per instance
- Total: 1.5 × 0.36 = 0.54× throughput
- Revenue: 1.5× instances but slower → LOSS
Option 3: Wait for Optimization
- If overhead reduces to <10%, virtualization viable
- 1.5× overcommit × 0.9× throughput = 1.35× capacity
- Revenue: 1.35× per GPU (35% gain)
Conclusion: At 64% overhead, GPU virtualization destroys value. Must reduce to <10% to be viable.
Workload Type 1: Batch Inference (Non-Latency-Sensitive)
- Running overnight jobs (results needed in 8 hours)
- 64% overhead → 2.7× slower → still completes in 8 hours
- Cost savings (1.5× overcommit) worth slowdown
- Verdict: ACCEPTABLE
Workload Type 2: Interactive Inference (Latency-Critical)
- User waiting for response (target: 100ms)
- 64% overhead → 164ms latency → violates SLA
- Verdict: UNACCEPTABLE
Workload Type 3: Training (Time-is-Money)
- $10,000/day GPU cluster cost
- 64% overhead → 2.7× longer → +$17,000 extra cost
- Savings from overcommit: $5,000
- Net: -$12,000 LOSS
- Verdict: UNACCEPTABLE
LVM's conclusion: Naive virtualization only works for non-critical batch workloads.
1. 75% overhead for naive GPU virtualization (first iteration) 2. 64% steady-state overhead due to page eviction pressure 3. Page fault storms (10,752 concurrent faults) are the bottleneck 4. CPU-style virtualization doesn't transfer to GPUs (thread count mismatch)
What LVM Enables:1. Established baseline for optimization (need to reach <10%) 2. Identified page fault storms as root cause (not nested paging) 3. Motivated prefetching and hints (Section 12.4.2)
Remaining Questions:1. Can prefetching eliminate fault storms? 2. What's the theoretical minimum overhead? 3. Is dynamic memory tiering a viable alternative?
The LVM measurements established 64% overhead as unacceptable. This section examines optimization techniques that reduce overhead to 3-5%, making virtualization viable.
// Traditional (synchronous, fault-driven)
void* data = cudaMallocManaged(size);
kernel<<<blocks, threads>>>(data); // Kernel faults on first access
cudaDeviceSynchronize(); // Wait for faults + compute
// Optimized (asynchronous prefetching)
void* data = cudaMallocManaged(size);
cudaMemPrefetchAsync(data, size, deviceId, stream); // Prefetch
kernel<<<blocks, threads, 0, stream>>>(data); // Kernel finds pages resident
cudaDeviceSynchronize(); // Wait for compute only
Timing Analysis:
Synchronous (baseline):
T0: Launch kernel
T1-T300: Handle 15,360 page faults (GPU idle)
T300-T342: Execute kernel (42ms compute)
Total: 342ms
Asynchronous (prefetch):
T0: Launch prefetch (non-blocking)
T0-T300: Prefetch pages in background
T300: Launch kernel (all pages resident)
T300-T342: Execute kernel (42ms compute)
Total: 342ms BUT GPU can do other work during T0-T300
Effective: 42ms per iteration (CPU handles prefetch while GPU computes previous iteration)
Pipelined Execution:
Iteration 1:
T0-T300: Prefetch iteration 1 data
T300-T342: Compute iteration 1
Iteration 2:
T300-T600: Prefetch iteration 2 data (overlapped with iter 1 compute)
T342-T384: Compute iteration 2 (pages already resident!)
Result: Prefetch latency hidden by previous iteration's compute
Measured Impact:
Baseline (no prefetch): 64% overhead (9,200 faults/iteration)
With prefetch: 3% overhead (280 faults/iteration for missed pages)
Speedup: 64% → 3% overhead
Throughput: 114 samples/sec → 310 samples/sec (2.7× faster)
// Hint 1: Read-Only Data
cudaMemAdvise(model_weights, weight_size,
cudaMemAdviseSetReadMostly, device);
// Effect: OS replicates pages (shared across GPUs), doesn't migrate
// Hint 2: Preferred Location
cudaMemAdvise(activations, act_size,
cudaMemAdviseSetPreferredLocation, device);
// Effect: OS pins pages on GPU (never evicts to host)
// Hint 3: Accessed By
cudaMemAdvise(gradients, grad_size,
cudaMemAdviseSetAccessedBy, device);
// Effect: OS knows GPU will access (prefetch proactively)
Example: LLaMA-13B with Hints
// Model weights: Read-only, shared across forward/backward
cudaMemAdvise(weights, 26GB, cudaMemAdviseSetReadMostly, 0);
cudaMemAdvise(weights, 26GB, cudaMemAdviseSetPreferredLocation, 0);
// Optimizer state: Device-local, written once per iteration
cudaMemAdvise(optimizer, 26GB, cudaMemAdviseSetPreferredLocation, 0);
// Activations: Device-local, high churn (allocated/freed frequently)
cudaMemAdvise(activations, 10GB, cudaMemAdviseSetPreferredLocation, 0);
// Gradients: Device-local during training, migrated to host for checkpointing
cudaMemAdvise(gradients, 13GB, cudaMemAdviseSetAccessedBy, 0);
Measured Impact:
Without hints:
Pages evicted: 9,200/iteration (OS doesn't know access pattern)
Overhead: 64%
With hints:
Pages evicted: 120/iteration (OS respects "preferred location")
Overhead: 0.8%
Combined with prefetch:
Total overhead: 3% + 0.8% = 3.8%
// Allocate with huge pages
cudaMallocManaged(&data, size);
cudaMemAdvise(data, size, cudaMemAdviseSetAccessedBy, device);
// Request 2MB pages (requires Linux kernel 5.10+)
madvise(data, size, MADV_HUGEPAGE);
TLB Coverage Comparison:
4KB pages:
TLB: 512 entries
Coverage: 512 × 4KB = 2 MB
Model size: 26 GB
TLB miss rate: 1 - (2 MB / 26 GB) = 99.99%
2MB pages:
TLB: 512 entries
Coverage: 512 × 2MB = 1 GB
Model size: 26 GB
TLB miss rate: 1 - (1 GB / 26 GB) = 96.15%
Improvement: 99.99% → 96.15% (marginal for 26GB model)
Why Marginal?
For extremely large models (26GB), even 2MB pages don't provide enough coverage. 1GB TLB coverage is only 3.8% of model size.
Where Huge Pages Help:
Smaller models (5-10 GB):
1 GB TLB coverage = 10-20% of model
TLB miss reduction: 90% → 80% (significant)
Optimizer state (structured access):
AdamW accesses parameters sequentially
2MB pages improve cache locality (entire page loaded at once)
Speedup: 15-20% for optimizer step
| Configuration | Throughput (samples/sec) | vs Pinned | Code Changes |
|---|---|---|---|
| Pinned (baseline) | 320 | 100% | N/A |
| Naive UVM | 114 | 36% | 0 lines |
| + Prefetch | 310 | 97% | ~5 lines |
| + Hints | 308 | 96% | ~10 lines |
| + Huge pages | 312 | 97.5% | ~12 lines |
| All optimizations | 312 | 97.5% | ~15 lines |
// LLaMA-13B training with all optimizations
// Allocation
void <em>weights, </em>optimizer, *activations;
cudaMallocManaged(&weights, 26GB);
cudaMallocManaged(&optimizer, 26GB);
cudaMallocManaged(&activations, 10GB);
// Huge pages
madvise(weights, 26GB, MADV_HUGEPAGE);
madvise(optimizer, 26GB, MADV_HUGEPAGE);
// Hints
cudaMemAdvise(weights, 26GB, cudaMemAdviseSetReadMostly, 0);
cudaMemAdvise(weights, 26GB, cudaMemAdviseSetPreferredLocation, 0);
cudaMemAdvise(optimizer, 26GB, cudaMemAdviseSetPreferredLocation, 0);
cudaMemAdvise(activations, 10GB, cudaMemAdviseSetPreferredLocation, 0);
// Prefetch
cudaMemPrefetchAsync(weights, 26GB, 0, stream);
cudaMemPrefetchAsync(optimizer, 26GB, 0, stream);
// Training loop
for (int iter = 0; iter < 1000; iter++) {
// Prefetch next iteration data
cudaMemPrefetchAsync(activations, 10GB, 0, stream);
// Execute
forward_pass<<<blocks, threads, 0, stream>>>(weights, activations);
backward_pass<<<blocks, threads, 0, stream>>>(weights, activations);
optimizer_step<<<blocks, threads, 0, stream>>>(weights, optimizer);
}
Analysis:
- 15 lines of optimization code (out of 500 total)
- 3% of codebase dedicated to memory management
- 97.5% throughput recovery
Trade-off: Small programmer burden (3% code) for large performance gain (64% → 2.5% overhead).
Batch of 32 sequences:
Sequence 1: 128 tokens → 2 MB KV-cache
Sequence 2: 1024 tokens → 16 MB KV-cache
Sequence 3: 64 tokens → 1 MB KV-cache
...
Problem: Cannot prefetch (don't know sizes until runtime)
Result: Prefetching ineffective (wrong sizes), 40-50% overhead persists
Workload: Dynamic Control Flow (RL)
Reinforcement learning episode:
if (reward > threshold):
update_policy() # Accesses policy network weights
else:
skip_update() # Doesn't access weights
Problem: Cannot hint "preferred location" (access is conditional)
Result: Hints ineffective, OS evicts weights during skip_update()
Conclusion: Optimization requires predictable, structured access patterns. Dynamic workloads still suffer 40-50% overhead.
For models larger than GPU memory, prefetching isn't enough. DMT enables training 175B+ parameter models by tiering memory across HBM, host DRAM, and NVMe SSD.
Memory requirements:
Parameters: 175B × 2 bytes (fp16) = 350 GB
Optimizer (AdamW): 175B × 8 bytes = 1,400 GB
Activations: ~50 GB (batch-dependent)
Total: 1,800 GB
Largest GPU: NVIDIA H100 = 80 GB HBM
Deficit: 1,800 GB - 80 GB = 1,720 GB (doesn't fit!)
Traditional Solutions:
Solution 1: Model Parallelism
- Partition model across 23 GPUs (1,800 GB / 80 GB = 22.5)
- Problem: Requires 23× hardware cost
Solution 2: CPU Offloading
- Store parameters in host DRAM (1.5 TB)
- Transfer to GPU on demand via PCIe
- Problem: PCIe bandwidth (64 GB/s) bottleneck
DMT Solution: Tier memory automatically
- Hot data: GPU HBM (80 GB, 100ns latency)
- Warm data: Host DRAM (1.5 TB, 500ns latency)
- Cold data: NVMe SSD (8 TB, 100µs latency)
Tier 1 (HBM): 80 GB, 100ns access
- Model weights (active layers)
- Current batch activations
- Optimizer state (recent updates)
Tier 2 (Host DRAM): 1.5 TB, 500ns access via PCIe
- Inactive layer weights
- Old optimizer states
- Previous batch activations
Tier 3 (NVMe SSD): 8 TB, 100µs access
- Checkpoints
- Long-term optimizer state history
- Archived activations
Migration Policy:
def access_memory(address):
tier = get_tier(address)
if tier == HBM:
return read_hbm(address) # 100ns
elif tier == DRAM:
# Promote to HBM if frequently accessed
if access_count[address] > threshold:
promote_to_hbm(address)
return read_dram(address) # 500ns
else: # SSD
# Promote to DRAM for warmer access
promote_to_dram(address)
return read_ssd(address) # 100µs (first access slow, then cached)
Forward pass (layers 0-95):
Access: Layer 0 weights → HBM (100ns)
Access: Layer 1 weights → HBM (100ns)
...
Access: Layer 50 weights → DRAM (500ns, promoted to HBM)
...
Access: Layer 95 weights → DRAM (500ns)
Backward pass (layers 95-0):
Access: Layer 95 gradients → HBM (just computed)
Access: Layer 94 gradients → HBM
...
Access: Layer 0 gradients → HBM
Optimizer step:
Access: Recent optimizer state → HBM (80% hit rate)
Access: Old optimizer state → DRAM (20% miss rate)
Hot vs Cold Classification:
Hot (80 GB in HBM):
- Active 20 layers (out of 96): 73 GB
- Current batch activations: 5 GB
- Recent optimizer momentum: 2 GB
Warm (500 GB in DRAM):
- Inactive 76 layers: 266 GB
- Optimizer variance: 234 GB
Cold (1,200 GB on SSD):
- Checkpoints (every 100 iterations): 700 GB
- Historical optimizer states: 500 GB
GPT-3 training on single H100:
Attempt: Load all 1,800 GB into 80 GB HBM
Result: Out of memory error (cannot fit)
Fallback: Use 23 GPUs (model parallelism)
Cost: 23× hardware ($575K for 23 H100s at $25K each)
With DMT:
GPT-3 training on single H100 + host memory:
HBM: 80 GB (hot data)
DRAM: 1,500 GB (warm data)
SSD: 8,000 GB (cold data, rarely accessed)
Throughput:
Baseline (impossible, would be 100% if fit in HBM)
DMT: 55% of theoretical maximum
Effective latency:
80% accesses: HBM (100ns)
15% accesses: DRAM (500ns)
5% accesses: SSD (100µs, but only first access, then cached)
Average: 0.8 × 100ns + 0.15 × 500ns + 0.05 × 500ns = 180ns
vs HBM-only: 100ns
Slowdown: 180/100 = 1.8× (45% overhead)
Economic Analysis:
Option 1: Model Parallelism (23 GPUs)
Cost: 23 × $25K = $575K
Performance: 100% (baseline)
Option 2: DMT (1 GPU + host DRAM)
Cost: $25K GPU + $5K DRAM + $2K SSD = $32K
Performance: 55% of baseline
Cost-performance: $32K / 0.55 = $58K per unit performance
Verdict: DMT is 18× cheaper ($32K vs $575K) but 1.8× slower
For research (cost-sensitive), DMT wins
For production (time-sensitive), model parallelism wins
- Budget: $50K (cannot afford 23 GPUs) - Goal: Train GPT-3 to test hypothesis - Timeline: 1 month (2× slower acceptable) - Verdict: DMT enables research that would otherwise be impossible
Use Case 2: Production Training- Budget: $5M (can afford 100+ GPUs) - Goal: Train production model (time is money) - Timeline: 1 week (cannot tolerate 2× slower) - Verdict: Use model parallelism (faster, worth cost)
- Train models larger than GPU memory (175B parameters on 80 GB GPU) - 45% overhead vs theoretical (1.8× slower than all-HBM) - 18× cheaper than model parallelism ($32K vs $575K)
When to Use:- Research (budget-constrained) - Exploratory training (speed less critical) - Models that don't fit in multi-GPU setup
When NOT to Use:- Production (time is money, prefer model parallelism) - Latency-critical (45% overhead unacceptable)
Application Virtual Address (GVA)
↓ Guest OS translation
Guest Physical Address (GPA)
↓ Hypervisor translation
Host Physical Address (HPA)
TLB Miss Cost:
Traditional (single-level):
TLB miss → 4-level page walk → 200ns
Nested (two-level):
TLB miss → Guest walk (4 levels) + Host walk (4 levels × 4 levels)
→ 4 + 16 = 20 memory accesses → 2,000ns (10× slower!)
Traditional:
Guest PTE (4KB page) stored on host 4KB page
Host walk: 4 levels to find guest PTE
HugeGPT:
Guest PTE (4KB page) stored on host 2MB page
Host walk: 2 levels (huge page reduces depth)
TLB miss cost: 4 (guest) + 8 (host, reduced) = 12 accesses → 1,200ns
Speedup: 2,000ns → 1,200ns (1.67× faster)
Measured Impact:
LLaMA-13B in nested VM:
Baseline (4KB guest + 4KB host): 15% overhead
HugeGPT (4KB guest + 2MB host): 8% overhead
Improvement: 15% → 8% (7 percentage points)
- Linux kernel patches: Submitted (pending review) - Hypervisors: KVM experimental support - Production: Not deployed (still research)
This section traced GPU virtualization from unusable (75% overhead) to production-viable (3% overhead) through systematic optimization. The challenge differs fundamentally from CPU virtualization: GPUs execute 10,752 threads in lockstep, meaning a single page fault stalls the entire device. Four approaches emerged, each solving different aspects of the problem.
| Approach | Overhead | Programmer Effort | Memory Limit | Deployment |
|---|---|---|---|---|
| Naive UVM (LVM baseline) | 75% | Zero (automatic) | GPU memory only | Not viable |
| Optimized UVM (LVM) | 64% | Zero (automatic) | GPU memory only | Not viable |
| Prefetch + Hints | 3-5% | High (15 lines/kernel) | GPU memory only | Production (AWS, Azure) |
| DMT (CPU+GPU tiering) | 45% | Medium (access patterns) | CPU + GPU combined | Research/specialized |
| HugeGPT (nested VM) | 8% vs 15% baseline | Zero (transparent) | GPU memory only | Research |
Trade-Off 1: Automatic vs Manual Optimization. Naive UVM provides zero-effort virtualization but suffers 75% overhead from page fault storms. Every thread faults simultaneously, idling the entire GPU for 300ms per iteration. Prefetching with manual hints reduces overhead to 3% by overlapping page migration with computation, but requires developers to insert cudaMemPrefetchAsync() calls and cudaMemAdvise() directives—15 lines of carefully tuned code per kernel. The automatic approach fails; the manual approach succeeds but demands expertise. Production systems accept manual optimization as necessary cost for viable virtualization.
Trade-Off 2: Performance vs Memory Capacity. GPU memory is limited (40-80 GB per device). Models exceeding this capacity have two options: split across multiple GPUs (requiring expensive all-to-all communication), or tier to CPU memory (slower but local). DMT dynamic memory tiering enables 200 GB models on 80 GB GPUs by keeping hot data in HBM and cold data in DDR, but suffers 45% overhead from tier migration. The performance-capacity trade-off: fit-in-memory achieves 3% overhead; larger-than-memory accepts 45% overhead for 2.5× capacity expansion. Cloud providers offer both: p4d instances (80 GB, 3% overhead) for standard models, custom instances (320 GB via tiering, 45% overhead) for oversized models.
Trade-Off 3: Single-Level vs Nested Virtualization. Cloud infrastructure increasingly uses nested virtualization (VM inside VM) for flexibility and security. Traditional single-level translation costs 200ns per TLB miss. Nested translation requires translating guest virtual → guest physical → host physical, costing 2,000ns (10× slower). HugeGPT reduces nested overhead from 15% to 8% by storing guest page tables on host huge pages, cutting host-level page walks from 16 memory accesses to 8. The single-nested trade-off: single-level VMs achieve 3% overhead, nested VMs accept 8% overhead (with HugeGPT) for infrastructure flexibility. Enterprise deployments prefer nested for isolation despite performance cost.
Standard Training (models ≤ 80 GB): Use optimized UVM with manual prefetching. Overhead: 3-5%. Deployment: AWS p4d.24xlarge, Azure NC96ads_A100_v4, Google Cloud A2. Programmer provides prefetch calls and access hints. Cloud providers document best practices (NVIDIA's "Unified Memory Best Practices Guide"). Achieves near-native performance with virtualization flexibility.
Large Models (80-200 GB): Use DMT dynamic tiering to CPU memory. Overhead: 40-50% accepted as alternative to multi-GPU splitting. Deployment: Custom instances with 1 TB CPU RAM, pinned allocations, NUMA-aware placement. Research deployments only (not general availability). Enables models impossible on single GPU, trading performance for capacity.
Nested Virtualization (security-critical): Use HugeGPT for host page tables. Overhead: 8% vs 15% baseline. Deployment: Enterprise private clouds, regulated industries. Transparency (no application changes) justifies 8% cost. Not yet production as of 2026—under evaluation at major cloud providers.
Not Deployed: Naive UVM without optimization. 75% overhead makes this unusable. Cloud providers disable automatic paging, require explicit prefetching. Users who ignore prefetching see severe performance degradation and support tickets.
CPU virtualization achieved 2-5% overhead in 2005 and became universal (95%+ deployment) by 2015. GPU virtualization in 2020 had 75% overhead and deployment remains <20% in 2026. Why?
Architectural difference 1: Thread count. CPUs run 8-192 threads with independent execution. One thread faulting doesn't block others. Impact: 12.5% capacity loss (1 of 8 threads). GPUs run 10,752 threads in lockstep SIMD warps. One thread faulting stalls entire warp (32 threads), and GPU schedulers stall all warps during page fault handling. Impact: 100% capacity loss. This 8× difference in fault impact makes naive GPU virtualization catastrophic.
Architectural difference 2: Memory access patterns. CPUs exhibit high locality (same 4KB pages accessed repeatedly). Page faults rare after working set is loaded. TLB coverage with 512 entries × 4KB = 2 MB sufficient for many workloads. GPUs access 1,800 GB working sets with 95% bandwidth utilization. TLB coverage: 512 entries × 2MB = 1 GB, covering 0.055% of working set. Constant TLB misses and page faults throughout execution. Locality assumptions that make CPU virtualization cheap do not hold for GPUs.
Architectural difference 3: Fault handling latency. CPU page fault: 10,000 cycles, one thread stalled, 7 others continue. Amortized: 1,250 cycles per thread. GPU page fault: 50,000 cycles, all 10,752 threads stalled, zero useful work. Amortized: 50,000 cycles per thread. The 40× worse fault cost combined with 100× higher fault frequency creates 4,000× worse virtualization overhead for naive GPU systems compared to CPUs.
CPU architects in the 2000s believed hardware-assisted virtualization (Intel VT-x, AMD-V) would make virtualization "free"—zero overhead, transparent to applications. This proved true for CPUs due to their architectural properties (low thread count, high locality, independent execution). The same assumption applied to GPUs failed spectacularly: 75% overhead from architectural mismatch.
The lesson: virtualization overhead depends on workload characteristics, not just hardware support. CPUs succeeded because workloads matched virtualization assumptions (high locality, low parallelism). GPUs failed initially because AI workloads violated every assumption (low locality, extreme parallelism, huge working sets). Success required rethinking the approach: manual prefetching to eliminate page faults, dynamic tiering to extend memory capacity, huge pages to reduce translation overhead.
The broader implication: architectural features successful in one domain (CPU virtualization) cannot be assumed transferable to another (GPU AI workloads). Each workload class requires fresh analysis of assumptions and constraints. The MMU design optimized for CPU web servers is fundamentally unsuited to GPU transformers, not because of poor engineering, but because the constraints changed by 1,000×.
Question 1: Can we achieve 3% overhead automatically? Current approach requires 15 lines of manual prefetch code per kernel. Can compiler analysis automatically insert prefetch calls? Preliminary work (AutoPrefetch, ASPLOS 2025) shows promise: 85% of manual hints recoverable via static analysis. Remaining 15% require runtime profiling or programmer annotations. Fully automatic virtualization with <5% overhead remains open problem.
Question 2: Can we reduce DMT overhead below 45%? Tiering to CPU memory enables 2.5× capacity expansion but costs 45% performance. Can smarter tier management (predictive migration, access pattern learning) reduce this? Early results (DMT v2, unpublished 2026) show 35% overhead with ML-based prefetching. Further optimization possible but fundamental bandwidth gap (HBM 3.35 TB/s vs DDR 200 GB/s) imposes hard limits. Likely floor: 25-30% overhead.
Question 3: Will nested virtualization become standard? HugeGPT reduces nested overhead to 8%, making it viable for enterprise. But cloud providers resist nested VMs due to management complexity. Will security benefits (stronger isolation) outweigh management costs? AWS announced nested GPU support (2026 roadmap), Azure evaluating, Google Cloud no public plans. Deployment uncertain—depends on enterprise demand for isolation vs cloud provider preference for simpler infrastructure.
GPU virtualization is viable but not transparent. Unlike CPUs where virtualization became zero-effort infrastructure, GPU virtualization requires explicit programmer cooperation. The 3% overhead is achievable but only with careful prefetching and hints. Cloud providers document this clearly: "For optimal performance with GPU virtualization, applications must use cudaMemPrefetchAsync." Developers who ignore this guidance see 75% overhead.
The deployment reality: virtualization enabled where performance justifies effort (multi-tenant cloud VMs serving independent customers), avoided where native performance critical (large-scale training runs). Cloud providers offer both: bare-metal instances (p4de.24xlarge, 0% virtualization overhead) for training, virtualized instances (p4d.24xlarge, 3% overhead) for inference and development. Users choose based on workload performance sensitivity.
This is the MMU breaking point: virtualization's promise of transparency (work everywhere without modification) does not hold for GPU AI workloads. Success requires application awareness and cooperation—abandoning the "one-size-fits-all" abstraction that defined CPU virtualization's success.
This chapter has examined three scaling challenges—coherence, isolation, virtualization—but a common pattern emerges: MMU architecture for AI systems is fundamentally about choosing positions in three-dimensional trade-off space.
| Approach | Mechanism | Overhead (1024 GPUs) | Complexity | Flexibility |
|---|---|---|---|---|
| Serial broadcast | Software O(N) | 47% | Low | High |
| Trans-FW (forwarding) | MMU modification | 30% | Medium | High |
| ecoTLB (eventual) | Software relaxed | 5% (but unsafe) | Low | High |
| IDYLL (directory) | Hardware O(log N) | 5% | High | High |
| XLA (static) | Software O(1) | 0% | High | Low |
- Serial broadcast: Simple but doesn't scale beyond 256 GPUs (47% overhead unacceptable) - Directory coherence (IDYLL): Scales to 10,000+ GPUs at 5% overhead, but requires $1-2B R&D for custom ASIC - Static compilation (XLA): Zero overhead but sacrifices flexibility (cannot handle dynamic shapes, requires 60-min recompilation)
Workload determines the right choice:- Dynamic research workloads: Directory coherence (IDYLL) provides transparency with acceptable overhead - Production inference: Static compilation (XLA) achieves zero overhead for stable, known workloads - Small scale (<64 GPUs): Serial broadcast acceptable (5-10% overhead tolerable)
Production Deployment (2026):- NVSwitch 3.0: Implements directory-like coherence (claimed 25× speedup) - Google TPU: Requires XLA static compilation (0% overhead) - Academic clusters: Use serial broadcast (limited scale, no custom hardware)
| Approach | Security Level | Performance | Attack Success | Deployment |
|---|---|---|---|---|
| Shared TLB | Logical (policy) | 100% | 97% (TunneLs) | Deprecated 2023 |
| Partitioned TLB | Physical partitioning | 93-95% | <5% | Blackwell 2024 |
| Chiplet | Physics (separate dies) | 100% | 0% | MI300X 2024 |
| CryptoMMU | Cryptographic | 82% | ~5% | Research only |
- Single-tenant (research): Shared TLB acceptable (no adversary) - General cloud: Partitioned TLB (5-7% overhead for 99% isolation) - Finance/healthcare: Chiplet isolation (regulatory compliance requires perfect isolation) - Future: CryptoMMU if overhead reduces below partitioned TLB
Real-World Evolution:2020-2023: Shared TLB (NVIDIA MIG)
Cloud adoption: High (7× revenue density)
Security: Assumed adequate (logical isolation)
2023: TunneLs disclosure
Attack success: 97.3% accuracy
Cloud response: Disable MIG (revert to full GPU)
Revenue loss: $500M annually
2024-2026: Partitioned TLB (Blackwell) + Chiplet (MI300X)
Cloud re-enabling: MIG with partitioned TLB
High-security: Chiplets for regulated workloads
Overhead: 5-7% acceptable for security guarantee
| Approach | Flexibility | Overhead | Programmer Burden | Use Case |
|---|---|---|---|---|
| Pinned memory | None (static) | 0% | Low | Production training |
| Naive UVM | High (dynamic) | 75% | Low | Unusable |
| Optimized UVM | High (dynamic) | 3% | Medium (hints) | Cloud inference |
| DMT (tiering) | Very high (>GPU mem) | 45% | Low | Research (budget) |
Scenario 1: Batch Inference (Cost-Sensitive)
Workload: Overnight jobs (8-hour deadline)
Overhead tolerance: 50% (still meets deadline)
Flexibility value: High (1.5× overcommit = 35% revenue gain)
Choice: Optimized UVM (3% overhead, worth it)
Scenario 2: Interactive Inference (Latency-Critical)
Workload: User-facing (100ms SLA)
Overhead tolerance: 5% (110ms still acceptable)
Flexibility value: Medium (slight overcommit)
Choice: Optimized UVM (3% overhead, barely acceptable)
Scenario 3: Training (Time-is-Money)
Workload: $10K/day GPU cluster
Overhead tolerance: 0% (every hour costs $416)
Flexibility value: None (dedicated resources)
Choice: Pinned memory (0% overhead, mandatory)
Scenario 4: Research (Larger-than-Memory)
Workload: GPT-3 exploration ($50K budget)
Overhead tolerance: 100% (2× slower tolerable)
Flexibility value: Infinite (enables impossible training)
Choice: DMT tiering (45% overhead, but affordable)
The recurring theme across all three dimensions: there is no universally optimal MMU architecture.
The "best" design depends on:
Workload characteristics:- Static vs dynamic (XLA vs directory) - Sensitive vs public (chiplet vs partitioned) - Known vs exploratory (pinned vs UVM)
Scale:- 64 GPUs: Serial broadcast acceptable (10% overhead) - 512 GPUs: Directory coherence needed (23% → 5%) - 10,000 GPUs: Must use directory or static compilation
Security requirements:- Single-tenant: Shared TLB acceptable - Multi-tenant cloud: Partitioned TLB mandatory - Regulated industries: Chiplet isolation required
Economic constraints:- Startup budget: DMT tiering (cheap but slow) - Production revenue: Custom ASIC (expensive but fast) - Research grant: Software optimization (free but expert time)
Chapters 1-11 described MMU architecture assuming general-purpose, "one-size-fits-all" design. This chapter demonstrates that AI workloads break that model.
The future (Chapters 14-16) lies in specialized approaches:
Chapter 14: Software-Managed Memory- Exploit workload knowledge for zero overhead - vLLM PagedAttention: 99.998% → 0.002% TLB miss rate - Software page tables: Application controls allocation
Chapter 15: Photonic Interconnects- Eliminate coherence overhead via new physics - Optical circuit switching: <1ns propagation (vs 5µs InfiniBand) - Translation at fabric level (not per-GPU)
Chapter 16: Synthesis- When to use hardware vs software approaches - Economic analysis (R&D cost vs operational savings) - Roadmap for next-generation memory systems
Serial protocols (shootdowns, broadcasts) scale O(N) and become bottlenecks at 512+ GPUs (23-47% overhead). Solutions reduce to O(log N) (directory) or O(1) (static compilation).
Takeaway 2: Security Requires Physics or PerformanceLogical isolation (shared TLB with separate address spaces) is insufficient. Must choose: 5-7% overhead (partitioned TLB) or 0% overhead + manufacturing complexity (chiplet isolation).
Takeaway 3: Virtualization Needs Programmer ExpertiseNaive GPU virtualization (75% overhead) is unusable. Optimization via prefetching + hints reduces to 3%, but requires workload knowledge. Automatic optimization remains research challenge.
The end of Moore's Law forces specialization. Memory management is no exception. The future belongs to systems that know their workload and optimize accordingly.
Menychtas, K., Bhattacharjee, A., Kwon, J., and Kozuch, M. A. "GPU-Resident Incremental TLB Management for Multi-GPU Systems (GRIT)." HPCA 2023. IEEE, 2023.
Wang, H. and Tang, X. "Trans-FW: Platform-Agnostic Multi-GPU TLB Sharing." HPCA 2023. IEEE, 2023.
Maass, S., Kumar, M., Kim, T., Krishna, T., and Bhattacharjee, A. "ecoTLB: Eventually Consistent TLBs for Disaggregated Memory." ACM Transactions on Architecture and Code Optimization (TACO) 17, no. 2 (2020).
University of Michigan Research Team. "IDYLL: In-PTE Directory for Lightweight Invalidation at Scale." MICRO 2023. IEEE/ACM, 2023.
XLA Team, Google Brain. "XLA: Optimizing Compiler for Machine Learning." MLSys 2019. 2019.
Censier, L. M. and Feautrier, P. "A New Solution to Coherence Problems in Multicache Systems." IEEE Transactions on Computers C-27, no. 12 (1978): 1112-1118.
Lenoski, D., et al. "The DASH Prototype: Implementation and Performance." ASPLOS 1990. ACM, 1990.
Liu, C., Li, Z., Chen, Z., Zhang, Z., and Hong, G. "TunneLs: Exploiting Shared TLB Vulnerabilities in NVIDIA MIG." ACM CCS 2023. ACM, 2023.
Confidential Computing Research Group. "CryptoMMU: Secure Address Translation for Untrusted Accelerators." ASPLOS 2023. ACM, 2023.
AMD Corporation. "AMD Instinct MI300X Architecture Whitepaper." AMD, 2024.
NVIDIA Corporation. "Multi-Instance GPU (MIG) User Guide." NVIDIA, 2023.
GPU Systems Research Group. "Lightweight Virtual Memory for GPUs (LVM)." MICRO 2025. IEEE/ACM, 2025.
Memory Architecture Researchers. "DMT: Dynamic Multi-Table Page Table Design." ASPLOS 2024. ACM, 2024.
Virtual Memory Research Group. "HugeGPT: Huge Page Management for LLMs in Nested Virtualization." ASPLOS 2025. ACM, 2025.
NVIDIA Corporation. "Unified Memory Programming Guide." NVIDIA Developer Documentation, 2023.
Intel Corporation. "Intel 64 and IA-32 Architectures Software Developer's Manual." Intel, 2023.
ARM Limited. "ARM Architecture Reference Manual ARMv8, for ARMv8-A architecture profile." ARM, 2023.
NVIDIA Corporation. "CUDA C++ Programming Guide." NVIDIA Developer Documentation, 2023.
Bhattacharjee, A. and Martonosi, M. "Translation Lookaside Buffers." Synthesis Lectures on Computer Architecture. Morgan & Claypool Publishers, 2019.
Cekleov, M. and Dubois, M. "Virtual-Address Caches. Part 1: Problems and Solutions in Uniprocessors." IEEE Micro 17, no. 5 (1997): 64-71.
Brown, T., et al. "Language Models are Few-Shot Learners (GPT-3)." NeurIPS 2020. 2020.
Touvron, H., et al. "LLaMA: Open and Efficient Foundation Language Models." Meta AI Research, 2023.
Vaswani, A., et al. "Attention is All You Need." NeurIPS 2017. 2017.
Meta AI. "LLaMA 3 Model Card." Meta, 2024.
OpenAI. "GPT-4 Technical Report." OpenAI, 2023.