The preceding fourteen chapters have documented both the power and the limitations of conventional memory management units. We established the foundational architecture (Chapters 1-4), examined advanced mechanisms (Chapters 5-10), and analyzed how AI/ML workloads expose fundamental breaking points at scale (Chapters 11-12). Chapter 13 demonstrated why hardware-based machine learning approaches to MMU optimization largely failed, while Chapter 14 showed how software-managed memory—particularly vLLM's PagedAttention and the resurgence of Direct Segments—can achieve 2-4× performance improvements for specific workloads.
This chapter examines a fundamentally different question: What if translation doesn't happen in the CPU or GPU at all?
Traditional MMU architecture assumes that address translation occurs at the processor—whether through TLB lookups, hardware page walks, or software exception handlers. This assumption has held since the introduction of virtual memory in the 1960s. Even the innovations documented in previous chapters—huge pages (Chapter 3), TLB hierarchies (Chapter 4), IOMMU (Chapter 5), and Direct Segments (Chapter 14)—maintain this fundamental model: the processing element performs translation.
Three converging trends make alternative translation architectures increasingly relevant:
Trend 1: Disaggregated Memory
Modern AI clusters increasingly separate compute from memory. A training job might use 1,024 GPUs but access a shared 100 TB memory pool over the network. When memory is physically remote, the traditional model of "GPU performs translation, then fetches data" introduces unnecessary latency. What if the network fabric or the memory system itself could perform translation?
Trend 2: Processing-in-Memory (PIM)
As documented in Chapters 11-12, memory bandwidth has become the critical bottleneck for AI workloads. Processing-in-memory architectures place compute logic inside or adjacent to DRAM. If computation happens at memory, why should translation happen at the distant CPU? Can translation occur at the memory controller or within the memory stack itself?
Trend 3: Translation Overhead at Extreme Scale
Chapter 12 quantified the breaking points: at 10,000 GPUs with 1.8 TB working sets, traditional MMU overhead reaches 40-80%. Even with all optimizations—huge pages, range TLBs, hardware page walk caches—the fundamental cost of translating billions of addresses per second across thousands of devices becomes prohibitive. Alternative architectures that eliminate or distribute this overhead become economically necessary.
This chapter distinguishes between optimizations within the traditional MMU paradigm and true architectural alternatives:
Optimizations (covered in previous chapters):
Alternative architectures (this chapter):
The key distinction: alternative architectures change where or how translation fundamentally occurs, not just how efficiently the traditional model operates.
We examine three peer-reviewed approaches, each representing a different architectural alternative:
Section 15.2: Network-Level Translation
MIND (SOSP 2021) and pulse (ASPLOS 2025) move translation into programmable network switches and distributed memory controllers. Instead of each GPU translating addresses independently, the network fabric performs translation once and routes data directly. This approach addresses the multi-GPU coordination overhead documented in Chapter 12, achieving O(1) translation cost regardless of cluster size.
Section 15.3: Processing-in-Memory Translation (PIM-TLB)
vPIM (DAC 2023), IMPRINT, and recent work (NDPage, H2M2) place translation logic within or adjacent to High-Bandwidth Memory (HBM) stacks. When compute occurs at memory—as in PIM architectures—translation should too. These approaches eliminate the CPU→memory round-trip for page walks, reducing translation latency by 10-50× for memory-intensive AI workloads.
Section 15.4: Utopia—Hybrid Radix-Segments
While Chapter 14 examined Direct Segments in depth, Utopia (2022) represents a different architectural choice: rather than choosing between radix page tables or segments, it combines both. Small allocations use traditional paging for flexibility; large allocations automatically use segments for performance. This hybrid approach achieves both the flexibility of paging and the performance of segments without programmer intervention.
This chapter builds on but does not repeat previous content:
This chapter focuses on approaches with peer-reviewed evidence and clear MMU relevance. We exclude:
Our criterion: the architecture must change where or how translation fundamentally occurs, with published evaluation demonstrating feasibility.
For each architecture, we examine:
By the end of this chapter, readers will understand not just how these alternatives work, but when and why they represent fundamentally different architectural choices from the traditional processor-centric MMU model that has dominated for five decades.
Let us begin with the most radical departure: moving translation entirely out of the processor and into the network fabric itself.
Traditional memory management assumes a processor-centric model: the CPU or GPU that needs data must translate the virtual address to a physical address, then fetch the data. This model made sense when processors had local memory. But modern AI clusters increasingly use disaggregated memory—compute and storage are physically separated, connected by high-speed networks. In this architecture, requiring each of 1,024 GPUs to independently translate the same virtual addresses introduces massive redundancy.
Network-level translation inverts this model: translation happens once, in the network fabric, and data is routed directly to the requesting device. Two recent systems demonstrate this approach: MIND (SOSP 2021) and pulse (ASPLOS 2025).
Paper: "MIND: In-Network Memory Management for Disaggregated Data Centers" (SOSP 2021)
Authors: Abhishek Bhattacharjee, et al. (Yale University)
Key Innovation: Programmable network switches perform address translation
Consider a scenario from Chapter 12's analysis:
Training cluster: 512 GPUs
Shared memory pool: 10 TB (disaggregated across 64 memory servers)
Workload: GPT-3 training
Traditional approach (per-GPU translation):
- Each GPU maintains page tables for entire 10 TB space
- Access to virtual address 0x7fff_0000_1000:
* GPU 0: TLB miss → page walk → translate
* GPU 1: TLB miss → page walk → translate (same address!)
* ...
* GPU 511: TLB miss → page walk → translate (same address!)
- Result: 512 redundant translations for shared data
Cost per GPU:
- Page walk: 4 memory accesses × 200ns = 800ns
- Total cluster: 512 × 800ns = 409.6μs wasted on redundant work
MIND observes that in disaggregated memory architectures, the network already knows where data is physically located. The switch routes packets to memory servers. Why not have the switch also perform translation?
MIND leverages programmable network switches (P4-capable hardware like Tofino) to implement translation logic directly in the network data plane.
System Components:
1. GPU compute nodes (512 GPUs)
- Generate memory requests with virtual addresses
- No local page tables for remote memory
- Send requests to network switch
2. Network switch (Programmable switch with MIND logic)
- Translation cache: 32K-64K entries
- Page table cache: Stores frequently-accessed PTE
- Translation engine: Performs VA→PA in switch ASIC
- Routing logic: Directs packets to correct memory server
3. Memory servers (64 servers, 10 TB total)
- Store data at physical addresses
- No translation logic needed
- Pure storage + retrieval
4. Central controller
- Manages page tables (software)
- Updates switch translation cache
- Handles page faults, allocation
Translation Process:
GPU issues memory request:
Packet: {SrcGPU=42, VirtualAddr=0x7fff00001000, Length=4KB, Type=READ}
↓
Arrives at network switch
↓
Switch translation cache lookup:
if (VirtualAddr in cache):
PhysicalAddr = cache[VirtualAddr]
MemoryServer = PhysicalAddr / ServerCapacity
Modify packet: {Dst=MemoryServer, PhysAddr=...}
Forward to memory server
else:
Send to controller for page walk
Controller updates cache
Retry
↓
Memory server receives:
Packet: {PhysAddr=0x20040000, Length=4KB, Type=READ}
Reads 4KB from PhysAddr
Returns data to SrcGPU=42
The power of MIND comes from amortization:
Scenario: 512 GPUs all access same virtual page
Traditional (each GPU translates):
- 512 page walks
- 512 × 800ns = 409.6μs
MIND (switch translates once):
- First GPU: Miss in switch cache → controller page walk
Cost: 2μs (includes controller communication)
- Subsequent 511 GPUs: Hit in switch cache
Cost: 511 × 50ns = 25.5μs (cache lookup only)
- Total: 2μs + 25.5μs = 27.5μs
Speedup: 409.6μs / 27.5μs = 14.9× faster
The more GPUs share data, the greater the benefit. For AI training where all GPUs access shared model weights and gradients, this amortization is substantial.
Programmable switches have strict limitations:
MIND's translation cache design accounts for these:
Translation cache: 64K entries × 12 bytes = 768 KB
Entry format:
VirtualPageNum: 40 bits (assumes 48-bit VA, 4KB pages)
PhysicalPageNum: 40 bits
Permissions: 4 bits (R/W/X/Valid)
ServerID: 8 bits (up to 256 memory servers)
Padding: 4 bits
Total: 96 bits = 12 bytes per entry
Lookup: Hash-based (O(1))
Hash VPN → index into cache
Compare VPN (40 bits)
Return PPN + ServerID if match
Latency: 1-2 switch cycles = 10-20ns (at 100 Gbps line rate)
When the switch cache misses:
1. Packet sent to central controller (fast path bypassed)
2. Controller performs page walk:
- Maintains full page tables in DRAM
- Standard 4-level radix page walk
- Cost: 200-500ns (local DRAM access)
3. Controller updates switch cache via control plane
4. Controller instructs switch to retry packet
5. Retry hits in cache, proceeds normally
Miss penalty: ~2-5μs (vs 800ns local page walk)
But: Amortized across many GPUs accessing same page
The MIND paper evaluated on a testbed:
Configuration:
- 32 compute servers (each with 1 GPU)
- 8 memory servers (256 GB total disaggregated memory)
- Barefoot Tofino programmable switch
- 100 Gbps Ethernet
Workloads:
- Graph analytics (PageRank, BFS)
- ML training (ResNet-50, BERT)
- In-memory databases (Redis, Memcached)
Results:
| Workload | Baseline (CPU translation) | MIND (switch translation) | Speedup |
|---|---|---|---|
| PageRank (64 GB graph) | 1.0× (baseline) | 1.8× faster | 1.8× |
| BFS (64 GB graph) | 1.0× | 1.6× faster | 1.6× |
| ResNet-50 training | 1.0× | 1.3× faster | 1.3× |
| BERT fine-tuning | 1.0× | 1.4× faster | 1.4× |
| Redis (50% remote) | 1.0× | 1.2× faster | 1.2× |
Analysis:
Graph analytics benefits most (1.6-1.8×) because:
ML training benefits moderately (1.3-1.4×) because:
The MIND paper projects performance at larger scale:
Scale: 512 GPUs, 10 TB disaggregated memory
Switch cache capacity: 64K entries
Coverage: 64K × 4KB = 256 MB of unique pages
Working set: 10 TB / 512 GPUs = 20 GB per GPU average
Cache hit rate estimation:
- If 80% of accesses to shared data (model weights, etc.)
- And shared data = 2 GB (fits in 512K pages > 64K cache)
- Then 80% × 512/512 sharing benefit
- Effective hit rate: ~75-85%
Performance at 512 GPUs:
- Traditional: Each GPU translates independently
Overhead: 10-15% of execution time
- MIND: 75% switch hits, 25% controller lookups
Overhead: 2-4% of execution time
- Speedup: 1.08-1.12× overall (10-12% reduction in overhead)
The benefit is most pronounced when:
Paper: "pulse: Accelerating Distributed Page Table Walks with Programmable NICs" (ASPLOS 2025)
Authors: Hao Tang, et al. (University of Wisconsin-Madison)
Key Innovation: Distributed pointer-chasing for page walks across network
MIND addresses redundant translation but assumes a central controller performs page walks. The pulse paper identifies a different bottleneck: page walk latency when page tables themselves are disaggregated.
Scenario: 1 TB working set disaggregated across 64 memory servers
Page tables for 1 TB:
- With 4KB pages: 268 million pages
- 4-level page table: 268M PTEs at leaf level
- Total page table size: ~2 GB (with higher levels)
Problem: Where do page tables live?
Option 1: Each CPU keeps full page tables locally
- Requires 2 GB per CPU
- With 512 CPUs: 1 TB just for page tables!
- Memory explosion
Option 2: Page tables also disaggregated
- Distribute page tables across memory servers
- But now page walks require network access
- 4-level walk = 4 network round-trips
- Latency catastrophic!
Traditional page walk on local DRAM:
Walk 4 levels:
PML4 → PDPT → PD → PT
4 accesses × 100ns = 400ns total
Page walk with disaggregated page tables (naive):
Walk 4 levels over network:
CPU → Network → Memory server 1: Read PML4 entry (2μs RTT)
CPU → Network → Memory server 2: Read PDPT entry (2μs RTT)
CPU → Network → Memory server 3: Read PD entry (2μs RTT)
CPU → Network → Memory server 4: Read PT entry (2μs RTT)
Total: 8μs
20× slower than local page walk!
The pulse insight: rather than having the CPU orchestrate each step of the page walk, let the network interface cards (NICs) chase pointers autonomously.
Key Components:
1. Programmable SmartNICs (NVIDIA BlueField or similar)
- On-board ARM cores
- DRAM for caching
- DMA engines
- Can execute custom logic
2. Distributed page walk agents (running on each SmartNIC)
- Receive page walk request from CPU
- Chase pointers across network autonomously
- Only return final result to CPU
3. Memory servers
- Store both data and page table fragments
- Respond to NIC requests
Operation:
CPU needs to translate VA 0x7fff_0000_1000:
Traditional approach (CPU-orchestrated):
1. CPU → NIC: "Read PML4[entry]"
2. NIC → Memory Server A → NIC: Returns PML4 entry (2μs)
3. CPU processes: "Next level at Memory Server B"
4. CPU → NIC: "Read PDPT[entry]"
5. NIC → Memory Server B → NIC: Returns PDPT entry (2μs)
6. CPU processes: "Next level at Memory Server C"
... (repeat for all 4 levels)
Total: 8μs (4 round-trips × 2μs)
pulse approach (NIC-autonomous):
1. CPU → NIC: "Walk page tables for VA 0x7fff_0000_1000"
2. NIC agent executes:
a. Fetch PML4 entry from Server A
b. Parse returned pointer to PDPT
c. Fetch PDPT entry from Server B (no CPU involvement!)
d. Parse returned pointer to PD
e. Fetch PD entry from Server C
f. Parse returned pointer to PT
g. Fetch PT entry from Server D
h. Extract physical address
3. NIC → CPU: "Translation complete: PA 0x2004_0000"
Latency: ~2.5μs (pipelined, overlapped network requests)
Speedup: 8μs / 2.5μs = 3.2× faster
pulse achieves further improvements through pipelining:
Without pipelining:
Request 1: Level 1 (2μs) → Level 2 (2μs) → Level 3 (2μs) → Level 4 (2μs)
Request 2: Wait for Request 1 to complete...
Total for 2 requests: 16μs
With pipelining:
Request 1: L1 (2μs) → L2 (2μs) → L3 (2μs) → L4 (2μs)
Request 2: L1 (2μs) → L2 (2μs) → L3 (2μs) → L4 (2μs)
Timeline:
0-2μs: R1-L1 executing
2-4μs: R1-L2 executing, R2-L1 executing (parallel!)
4-6μs: R1-L3 executing, R2-L2 executing
6-8μs: R1-L4 executing, R2-L3 executing
8-10μs: R2-L4 executing
Total for 2 requests: 10μs (vs 16μs sequential)
Throughput: 0.2 translations/μs → 0.2M translations/sec per NIC
Prefetching based on spatial locality:
If CPU requests VA 0x1000:
- NIC walks page tables
- While walking, notices that PT covers 0x0000-0x20000
- Speculatively fetches PTEs for 0x1000-0x20000
- Caches in NIC DRAM
- Future requests in that range: instant hits
Benefit: Batch-processes contiguous translations
Hit rate for sequential access: 95%+
The pulse paper evaluates on a testbed with disaggregated memory:
Configuration:
- 16 compute servers (each with NVIDIA BlueField-2 SmartNIC)
- 8 memory servers (512 GB total disaggregated memory)
- 100 Gbps network
- Page tables also disaggregated across memory servers
Workloads:
- Large-scale graph processing (graph500 benchmark)
- LLM inference (LLaMA-70B)
- In-memory key-value store (distributed hash table)
Results:
| Workload | CPU-orchestrated walks | pulse (NIC-autonomous) | Speedup |
|---|---|---|---|
| Graph500 (256 GB graph) | 1.0× (baseline) | 2.8× faster | 2.8× |
| LLaMA-70B inference | 1.0× | 1.9× faster | 1.9× |
| KV store (random access) | 1.0× | 2.2× faster | 2.2× |
Page walk latency reduction:
Average page walk latency:
- CPU-orchestrated: 7.2μs (4 round-trips × ~1.8μs each)
- pulse: 2.3μs (pipelined, autonomous)
- Reduction: 68% lower latency
| Aspect | MIND | pulse |
|---|---|---|
| Where translation occurs | Network switch | SmartNICs (distributed) |
| Primary benefit | Eliminates redundant translations | Reduces page walk latency |
| Best for | High sharing across many clients | Disaggregated page tables |
| Hardware | Programmable switch (P4) | SmartNICs (BlueField, etc.) |
| Deployment | Centralized (switch) | Distributed (per-node NIC) |
| Speedup | 1.2-1.8× | 1.9-2.8× |
| Scalability | Switch cache limited (64K entries) | Scales with NIC count |
The two approaches are complementary:
Network-level translation represents a fundamental shift:
Traditional model (50+ years):
Processor performs translation → Processor issues physical address → Memory responds
Network-level model:
Processor issues virtual address → Network performs translation → Memory responds
This inversion has cascading effects:
When network translation wins:
When network translation loses:
Hardware requirements:
Software complexity:
Security and isolation:
Current status (2026):
Barriers to adoption:
Path to production (speculative):
Network-level translation addresses problems identified but not solved earlier:
But network translation does not replace all local translation:
The likely future is hybrid: local TLB for local/private memory, network translation for shared/disaggregated memory.
(Continuing with Section 15.3: PIM-TLB...)