Chapter 15: Beyond Traditional MMU - Alternative Translation Architectures

15.1 Introduction

The preceding fourteen chapters have documented both the power and the limitations of conventional memory management units. We established the foundational architecture (Chapters 1-4), examined advanced mechanisms (Chapters 5-10), and analyzed how AI/ML workloads expose fundamental breaking points at scale (Chapters 11-12). Chapter 13 demonstrated why hardware-based machine learning approaches to MMU optimization largely failed, while Chapter 14 showed how software-managed memory—particularly vLLM's PagedAttention and the resurgence of Direct Segments—can achieve 2-4× performance improvements for specific workloads.

This chapter examines a fundamentally different question: What if translation doesn't happen in the CPU or GPU at all?

Traditional MMU architecture assumes that address translation occurs at the processor—whether through TLB lookups, hardware page walks, or software exception handlers. This assumption has held since the introduction of virtual memory in the 1960s. Even the innovations documented in previous chapters—huge pages (Chapter 3), TLB hierarchies (Chapter 4), IOMMU (Chapter 5), and Direct Segments (Chapter 14)—maintain this fundamental model: the processing element performs translation.

Why Question This Assumption Now?

Three converging trends make alternative translation architectures increasingly relevant:

Trend 1: Disaggregated Memory

Modern AI clusters increasingly separate compute from memory. A training job might use 1,024 GPUs but access a shared 100 TB memory pool over the network. When memory is physically remote, the traditional model of "GPU performs translation, then fetches data" introduces unnecessary latency. What if the network fabric or the memory system itself could perform translation?

Trend 2: Processing-in-Memory (PIM)

As documented in Chapters 11-12, memory bandwidth has become the critical bottleneck for AI workloads. Processing-in-memory architectures place compute logic inside or adjacent to DRAM. If computation happens at memory, why should translation happen at the distant CPU? Can translation occur at the memory controller or within the memory stack itself?

Trend 3: Translation Overhead at Extreme Scale

Chapter 12 quantified the breaking points: at 10,000 GPUs with 1.8 TB working sets, traditional MMU overhead reaches 40-80%. Even with all optimizations—huge pages, range TLBs, hardware page walk caches—the fundamental cost of translating billions of addresses per second across thousands of devices becomes prohibitive. Alternative architectures that eliminate or distribute this overhead become economically necessary.

What Makes an "Alternative" Architecture?

This chapter distinguishes between optimizations within the traditional MMU paradigm and true architectural alternatives:

Optimizations (covered in previous chapters):

Alternative architectures (this chapter):

The key distinction: alternative architectures change where or how translation fundamentally occurs, not just how efficiently the traditional model operates.

Chapter Roadmap

We examine three peer-reviewed approaches, each representing a different architectural alternative:

Section 15.2: Network-Level Translation

MIND (SOSP 2021) and pulse (ASPLOS 2025) move translation into programmable network switches and distributed memory controllers. Instead of each GPU translating addresses independently, the network fabric performs translation once and routes data directly. This approach addresses the multi-GPU coordination overhead documented in Chapter 12, achieving O(1) translation cost regardless of cluster size.

Section 15.3: Processing-in-Memory Translation (PIM-TLB)

vPIM (DAC 2023), IMPRINT, and recent work (NDPage, H2M2) place translation logic within or adjacent to High-Bandwidth Memory (HBM) stacks. When compute occurs at memory—as in PIM architectures—translation should too. These approaches eliminate the CPU→memory round-trip for page walks, reducing translation latency by 10-50× for memory-intensive AI workloads.

Section 15.4: Utopia—Hybrid Radix-Segments

While Chapter 14 examined Direct Segments in depth, Utopia (2022) represents a different architectural choice: rather than choosing between radix page tables or segments, it combines both. Small allocations use traditional paging for flexibility; large allocations automatically use segments for performance. This hybrid approach achieves both the flexibility of paging and the performance of segments without programmer intervention.

Relationship to Previous Chapters

This chapter builds on but does not repeat previous content:

Scope and Limitations

This chapter focuses on approaches with peer-reviewed evidence and clear MMU relevance. We exclude:

Our criterion: the architecture must change where or how translation fundamentally occurs, with published evaluation demonstrating feasibility.

Evaluation Framework

For each architecture, we examine:

  1. Core mechanism: Where does translation occur? What data structures are used?
  2. Performance: Quantitative results from published evaluations
  3. Deployment status: Research prototype, production-ready, or deployed?
  4. Applicability: Which workloads benefit? What are the limitations?
  5. Relationship to traditional MMU: Can they coexist? Replace? Complement?

By the end of this chapter, readers will understand not just how these alternatives work, but when and why they represent fundamentally different architectural choices from the traditional processor-centric MMU model that has dominated for five decades.

Let us begin with the most radical departure: moving translation entirely out of the processor and into the network fabric itself.


15.2 Network-Level Translation

Traditional memory management assumes a processor-centric model: the CPU or GPU that needs data must translate the virtual address to a physical address, then fetch the data. This model made sense when processors had local memory. But modern AI clusters increasingly use disaggregated memory—compute and storage are physically separated, connected by high-speed networks. In this architecture, requiring each of 1,024 GPUs to independently translate the same virtual addresses introduces massive redundancy.

MIND: In-Network MMU — Network-Level Translation Architecture Traditional: Per-GPU Translation GPU 0 TLB + PTW GPU 1 TLB + PTW GPU 2 TLB + PTW GPU 3 TLB + PTW 4× redundant walks Network Switch (routes packets only) Disaggregated Memory Page tables + data Problem: O(N) Redundancy 1024 GPUs × same VA → 1024 identical page table walks to same PA Latency: 200–500 ns per GPU MIND: In-Network Translation (Once) GPU 0 send VA only GPU 1 send VA only GPU 2 send VA only GPU 3 send VA only MIND SmartNIC / Smart Switch Translation Cache (128K entries) VA → PA lookup in 50 ns Disaggregated Memory Direct PA access (no per-GPU walk) Result: O(1) per unique VA 1 translation serves all 1024 GPUs requesting same VA Speedup: 1.2–2.8× application throughput VS
Figure 15.1: MIND in-network MMU: traditional per-GPU translation (left) requires each of N GPUs to independently walk the page table for the same virtual address. MIND (right) places a translation cache in the network switch or SmartNIC, performing the VA→PA lookup once and serving all requesting GPUs from a shared 128K-entry cache, achieving 1.2–2.8× application throughput improvement.

Network-level translation inverts this model: translation happens once, in the network fabric, and data is routed directly to the requesting device. Two recent systems demonstrate this approach: MIND (SOSP 2021) and pulse (ASPLOS 2025).

15.2.1 MIND: Memory-in-Network Disaggregation

Paper: "MIND: In-Network Memory Management for Disaggregated Data Centers" (SOSP 2021)
Authors: Abhishek Bhattacharjee, et al. (Yale University)
Key Innovation: Programmable network switches perform address translation

Pulse: Distributed TLB Protocol — Coordinated Translation Across GPU Cluster GPU 0 Local TLB 128 entries Pulse agent GPU 1 Local TLB 128 entries Pulse agent GPU 2 Local TLB 128 entries Pulse agent GPU 3 Local TLB 128 entries Pulse agent GPU 4 Local TLB 128 entries Pulse agent GPU 5 Local TLB 128 entries Pulse agent Pulse Distributed Translation Directory Sharded across GPU memory — each GPU owns 1/N of VA space directory entries Pulse Protocol Steps ① TLB Miss on GPU i Local TLB has no entry for VA Compute: dir_owner = hash(VA) % N ② Send to Owner GPU PULSE_REQ(VA) → GPU owner Latency: ~50 ns (NVLink) ③ Directory Lookup Owner finds VA→PA mapping Hit: return PA immediately Miss: walk page table once ④ Multicast PA Reply PA sent to requesting GPU + optionally pushed to peers ⑤ TLB Fill + Continue GPU fills local TLB, resumes Round-trip: ~100 ns total Speedup vs serial shootdown O(1) per unique VA at 512 GPUs: 12× Protocol Comparison at Scale (512 GPUs) Traditional (serial): 5.12 ms Pulse (distributed): ~100 ns (50×) Invalidation: Pulse-Invalidate Protocol When memory is freed, directory owner multicasts PULSE_INV to all GPUs that cached the entry. GPUs acknowledge with PULSE_INV_ACK; memory reuse proceeds when all acks received.
Figure 15.2: Pulse distributed TLB protocol: each GPU runs a Pulse agent with a local TLB. On a miss, the agent hashes the VA to find the directory owner GPU, sends a PULSE_REQ, and the owner performs a single page table walk returning the PA in ~100 ns round-trip. The distributed directory eliminates O(N) serial shootdown overhead, achieving 50× lower invalidation latency at 512 GPUs.

Motivation: Redundant Translation at Scale

Consider a scenario from Chapter 12's analysis:

Training cluster: 512 GPUs
Shared memory pool: 10 TB (disaggregated across 64 memory servers)
Workload: GPT-3 training

Traditional approach (per-GPU translation):
- Each GPU maintains page tables for entire 10 TB space
- Access to virtual address 0x7fff_0000_1000:
  * GPU 0: TLB miss → page walk → translate
  * GPU 1: TLB miss → page walk → translate (same address!)
  * ...
  * GPU 511: TLB miss → page walk → translate (same address!)
- Result: 512 redundant translations for shared data

Cost per GPU:
- Page walk: 4 memory accesses × 200ns = 800ns
- Total cluster: 512 × 800ns = 409.6μs wasted on redundant work

MIND observes that in disaggregated memory architectures, the network already knows where data is physically located. The switch routes packets to memory servers. Why not have the switch also perform translation?

Architecture: Translation in Programmable Switches

MIND leverages programmable network switches (P4-capable hardware like Tofino) to implement translation logic directly in the network data plane.

System Components:

1. GPU compute nodes (512 GPUs)
   - Generate memory requests with virtual addresses
   - No local page tables for remote memory
   - Send requests to network switch

2. Network switch (Programmable switch with MIND logic)
   - Translation cache: 32K-64K entries
   - Page table cache: Stores frequently-accessed PTE
   - Translation engine: Performs VA→PA in switch ASIC
   - Routing logic: Directs packets to correct memory server

3. Memory servers (64 servers, 10 TB total)
   - Store data at physical addresses
   - No translation logic needed
   - Pure storage + retrieval

4. Central controller
   - Manages page tables (software)
   - Updates switch translation cache
   - Handles page faults, allocation

Translation Process:

GPU issues memory request:
  Packet: {SrcGPU=42, VirtualAddr=0x7fff00001000, Length=4KB, Type=READ}
      ↓
  Arrives at network switch
      ↓
  Switch translation cache lookup:
    if (VirtualAddr in cache):
        PhysicalAddr = cache[VirtualAddr]
        MemoryServer = PhysicalAddr / ServerCapacity
        Modify packet: {Dst=MemoryServer, PhysAddr=...}
        Forward to memory server
    else:
        Send to controller for page walk
        Controller updates cache
        Retry
      ↓
  Memory server receives:
    Packet: {PhysAddr=0x20040000, Length=4KB, Type=READ}
    Reads 4KB from PhysAddr
    Returns data to SrcGPU=42

Key Insight: Amortization Across Requests

The power of MIND comes from amortization:

Scenario: 512 GPUs all access same virtual page

Traditional (each GPU translates):
- 512 page walks
- 512 × 800ns = 409.6μs

MIND (switch translates once):
- First GPU: Miss in switch cache → controller page walk
  Cost: 2μs (includes controller communication)
- Subsequent 511 GPUs: Hit in switch cache
  Cost: 511 × 50ns = 25.5μs (cache lookup only)
- Total: 2μs + 25.5μs = 27.5μs

Speedup: 409.6μs / 27.5μs = 14.9× faster

The more GPUs share data, the greater the benefit. For AI training where all GPUs access shared model weights and gradients, this amortization is substantial.

Switch Hardware Constraints

Programmable switches have strict limitations:

MIND's translation cache design accounts for these:

Translation cache: 64K entries × 12 bytes = 768 KB
  Entry format:
    VirtualPageNum: 40 bits (assumes 48-bit VA, 4KB pages)
    PhysicalPageNum: 40 bits
    Permissions: 4 bits (R/W/X/Valid)
    ServerID: 8 bits (up to 256 memory servers)
    Padding: 4 bits

  Total: 96 bits = 12 bytes per entry

Lookup: Hash-based (O(1))
  Hash VPN → index into cache
  Compare VPN (40 bits)
  Return PPN + ServerID if match

Latency: 1-2 switch cycles = 10-20ns (at 100 Gbps line rate)

Handling "TLB Misses" in the Switch

When the switch cache misses:

1. Packet sent to central controller (fast path bypassed)
2. Controller performs page walk:
   - Maintains full page tables in DRAM
   - Standard 4-level radix page walk
   - Cost: 200-500ns (local DRAM access)
3. Controller updates switch cache via control plane
4. Controller instructs switch to retry packet
5. Retry hits in cache, proceeds normally

Miss penalty: ~2-5μs (vs 800ns local page walk)
But: Amortized across many GPUs accessing same page

Performance Evaluation (SOSP 2021)

The MIND paper evaluated on a testbed:

Configuration:
- 32 compute servers (each with 1 GPU)
- 8 memory servers (256 GB total disaggregated memory)
- Barefoot Tofino programmable switch
- 100 Gbps Ethernet

Workloads:
- Graph analytics (PageRank, BFS)
- ML training (ResNet-50, BERT)
- In-memory databases (Redis, Memcached)

Results:

Workload Baseline (CPU translation) MIND (switch translation) Speedup
PageRank (64 GB graph) 1.0× (baseline) 1.8× faster 1.8×
BFS (64 GB graph) 1.0× 1.6× faster 1.6×
ResNet-50 training 1.0× 1.3× faster 1.3×
BERT fine-tuning 1.0× 1.4× faster 1.4×
Redis (50% remote) 1.0× 1.2× faster 1.2×

Analysis:

Graph analytics benefits most (1.6-1.8×) because:

ML training benefits moderately (1.3-1.4×) because:

Scalability Analysis

The MIND paper projects performance at larger scale:

Scale: 512 GPUs, 10 TB disaggregated memory

Switch cache capacity: 64K entries
Coverage: 64K × 4KB = 256 MB of unique pages
Working set: 10 TB / 512 GPUs = 20 GB per GPU average

Cache hit rate estimation:
- If 80% of accesses to shared data (model weights, etc.)
- And shared data = 2 GB (fits in 512K pages > 64K cache)
- Then 80% × 512/512 sharing benefit
- Effective hit rate: ~75-85%

Performance at 512 GPUs:
- Traditional: Each GPU translates independently
  Overhead: 10-15% of execution time
- MIND: 75% switch hits, 25% controller lookups
  Overhead: 2-4% of execution time
- Speedup: 1.08-1.12× overall (10-12% reduction in overhead)

The benefit is most pronounced when:

  1. High degree of sharing (many GPUs access same pages)
  2. Large working sets (traditional TLB struggles)
  3. Disaggregated memory (remote access already present)

15.2.2 pulse: Distributed Translation for Far Memory

Paper: "pulse: Accelerating Distributed Page Table Walks with Programmable NICs" (ASPLOS 2025)
Authors: Hao Tang, et al. (University of Wisconsin-Madison)
Key Innovation: Distributed pointer-chasing for page walks across network

Problem: Page Walk Latency Amplification

MIND addresses redundant translation but assumes a central controller performs page walks. The pulse paper identifies a different bottleneck: page walk latency when page tables themselves are disaggregated.

Scenario: 1 TB working set disaggregated across 64 memory servers

Page tables for 1 TB:
- With 4KB pages: 268 million pages
- 4-level page table: 268M PTEs at leaf level
- Total page table size: ~2 GB (with higher levels)

Problem: Where do page tables live?

Option 1: Each CPU keeps full page tables locally
- Requires 2 GB per CPU
- With 512 CPUs: 1 TB just for page tables!
- Memory explosion

Option 2: Page tables also disaggregated
- Distribute page tables across memory servers
- But now page walks require network access
- 4-level walk = 4 network round-trips
- Latency catastrophic!

Traditional page walk on local DRAM:

Walk 4 levels:
  PML4 → PDPT → PD → PT
  4 accesses × 100ns = 400ns total

Page walk with disaggregated page tables (naive):

Walk 4 levels over network:
  CPU → Network → Memory server 1: Read PML4 entry (2μs RTT)
  CPU → Network → Memory server 2: Read PDPT entry (2μs RTT)  
  CPU → Network → Memory server 3: Read PD entry (2μs RTT)
  CPU → Network → Memory server 4: Read PT entry (2μs RTT)
  Total: 8μs

20× slower than local page walk!

pulse Architecture: Pointer-Chasing in NICs

The pulse insight: rather than having the CPU orchestrate each step of the page walk, let the network interface cards (NICs) chase pointers autonomously.

Key Components:

1. Programmable SmartNICs (NVIDIA BlueField or similar)
   - On-board ARM cores
   - DRAM for caching
   - DMA engines
   - Can execute custom logic

2. Distributed page walk agents (running on each SmartNIC)
   - Receive page walk request from CPU
   - Chase pointers across network autonomously
   - Only return final result to CPU

3. Memory servers
   - Store both data and page table fragments
   - Respond to NIC requests

Operation:

CPU needs to translate VA 0x7fff_0000_1000:

Traditional approach (CPU-orchestrated):
1. CPU → NIC: "Read PML4[entry]"
2. NIC → Memory Server A → NIC: Returns PML4 entry (2μs)
3. CPU processes: "Next level at Memory Server B"
4. CPU → NIC: "Read PDPT[entry]"
5. NIC → Memory Server B → NIC: Returns PDPT entry (2μs)
6. CPU processes: "Next level at Memory Server C"
... (repeat for all 4 levels)
Total: 8μs (4 round-trips × 2μs)

pulse approach (NIC-autonomous):
1. CPU → NIC: "Walk page tables for VA 0x7fff_0000_1000"
2. NIC agent executes:
   a. Fetch PML4 entry from Server A
   b. Parse returned pointer to PDPT
   c. Fetch PDPT entry from Server B (no CPU involvement!)
   d. Parse returned pointer to PD
   e. Fetch PD entry from Server C
   f. Parse returned pointer to PT
   g. Fetch PT entry from Server D
   h. Extract physical address
3. NIC → CPU: "Translation complete: PA 0x2004_0000"

Latency: ~2.5μs (pipelined, overlapped network requests)
Speedup: 8μs / 2.5μs = 3.2× faster

Pipelining and Prefetching

pulse achieves further improvements through pipelining:

Without pipelining:
  Request 1: Level 1 (2μs) → Level 2 (2μs) → Level 3 (2μs) → Level 4 (2μs)
  Request 2: Wait for Request 1 to complete...
  Total for 2 requests: 16μs

With pipelining:
  Request 1: L1 (2μs) → L2 (2μs) → L3 (2μs) → L4 (2μs)
  Request 2:     L1 (2μs) → L2 (2μs) → L3 (2μs) → L4 (2μs)
  
  Timeline:
    0-2μs:   R1-L1 executing
    2-4μs:   R1-L2 executing, R2-L1 executing (parallel!)
    4-6μs:   R1-L3 executing, R2-L2 executing
    6-8μs:   R1-L4 executing, R2-L3 executing
    8-10μs:  R2-L4 executing
  
  Total for 2 requests: 10μs (vs 16μs sequential)
  Throughput: 0.2 translations/μs → 0.2M translations/sec per NIC

Prefetching based on spatial locality:

If CPU requests VA 0x1000:
  - NIC walks page tables
  - While walking, notices that PT covers 0x0000-0x20000
  - Speculatively fetches PTEs for 0x1000-0x20000
  - Caches in NIC DRAM
  - Future requests in that range: instant hits

Benefit: Batch-processes contiguous translations
Hit rate for sequential access: 95%+

Performance Evaluation (ASPLOS 2025)

The pulse paper evaluates on a testbed with disaggregated memory:

Configuration:
- 16 compute servers (each with NVIDIA BlueField-2 SmartNIC)
- 8 memory servers (512 GB total disaggregated memory)
- 100 Gbps network
- Page tables also disaggregated across memory servers

Workloads:
- Large-scale graph processing (graph500 benchmark)
- LLM inference (LLaMA-70B)
- In-memory key-value store (distributed hash table)

Results:

Workload CPU-orchestrated walks pulse (NIC-autonomous) Speedup
Graph500 (256 GB graph) 1.0× (baseline) 2.8× faster 2.8×
LLaMA-70B inference 1.0× 1.9× faster 1.9×
KV store (random access) 1.0× 2.2× faster 2.2×

Page walk latency reduction:

Average page walk latency:
- CPU-orchestrated: 7.2μs (4 round-trips × ~1.8μs each)
- pulse: 2.3μs (pipelined, autonomous)
- Reduction: 68% lower latency

Comparison: MIND vs pulse

Aspect MIND pulse
Where translation occurs Network switch SmartNICs (distributed)
Primary benefit Eliminates redundant translations Reduces page walk latency
Best for High sharing across many clients Disaggregated page tables
Hardware Programmable switch (P4) SmartNICs (BlueField, etc.)
Deployment Centralized (switch) Distributed (per-node NIC)
Speedup 1.2-1.8× 1.9-2.8×
Scalability Switch cache limited (64K entries) Scales with NIC count

The two approaches are complementary:

15.2.3 Network Translation: Implications and Limitations

Architectural Implications

Network-level translation represents a fundamental shift:

Traditional model (50+ years):

Processor performs translation → Processor issues physical address → Memory responds

Network-level model:

Processor issues virtual address → Network performs translation → Memory responds

This inversion has cascading effects:

  1. TLB becomes optional: If the network reliably translates quickly, per-processor TLBs are less critical. CPUs could have smaller TLBs or none at all for remote memory.
  2. Page tables centralized: Instead of each processor maintaining page tables, a central authority (MIND controller, pulse coordinator) manages mappings.
  3. Security model changes: Translation enforcement now depends on network trustworthiness. If the switch is compromised, isolation fails.
  4. Coherence simplified: Chapter 12 documented the nightmare of TLB shootdowns across 10,000 GPUs. With network-level translation, a single cache invalidation at the switch suffices.

Performance Characteristics

When network translation wins:

When network translation loses:

Deployment Challenges

Hardware requirements:

Software complexity:

Security and isolation:

Production Status and Outlook

Current status (2026):

Barriers to adoption:

  1. Requires programmable switches (not ubiquitous in datacenters yet)
  2. Software stack assumptions (OS, drivers, applications expect local translation)
  3. Cost (programmable switches: $50K-$100K each)

Path to production (speculative):

Relationship to Previous Chapters

Network-level translation addresses problems identified but not solved earlier:

But network translation does not replace all local translation:

The likely future is hybrid: local TLB for local/private memory, network translation for shared/disaggregated memory.


(Continuing with Section 15.3: PIM-TLB...)

15.3 Processing-in-Memory Translation (PIM-TLB)

The second architectural alternative moves translation in the opposite direction from network-level approaches: instead of centralizing translation in the network, PIM-TLB places translation logic inside or adjacent to memory itself. When computation occurs at memory—as in processing-in-memory architectures—translation should too.

PIM-TLB: Translation Inside HBM — Operation and Data Flow HBM3 Stack Logic Layer (Base Die) PIM-TLB SRAM + PTW engine DRAM Layer 1–2 Page table storage 2 × 16GB = 32GB DRAM Layer 3–4 Data storage Bandwidth: 3.35 TB/s DRAM Layer 5–6 Data storage (model weights) DRAM Layer 7–8 Activation tensors through-silicon vias Through-Silicon Vias (TSVs) PIM-TLB Engine (Logic Die) On-Die SRAM TLB 2MB SRAM, 4-way assoc Latency: 1–3 cycles (12 ns) HIT path Return PA in 12 ns MISS path Local page table walk Local Page Table Walker Accesses page table in Layer 1–2 Latency: 50 ns (vs 200 ns off-die) In-Memory DMA Engine Moves data without GPU involvement Coherence / Invalidation Snoop filter for GPU-side TLBs To GPU over HBM interface PA returned; data DMA initiated GPU (Compute Side) GPU SM Array Issues memory requests with VA GPU-Side TLB (L2) Cached: most-recently-used PAs L2 TLB hit? YES NO → PIM-TLB Request Send VA to HBM logic layer via HBM command channel Direct Data Access PA → data in same HBM stack No off-chip round-trip 1.5–3.1× application speedup (PIM-TLB) PA reply
Figure 15.3: PIM-TLB operation: a 2MB SRAM TLB and lightweight page table walker reside in the HBM logic die at the base of the memory stack. On a GPU-side L2 TLB miss, the VA is sent over the HBM command channel; the logic layer resolves the translation in 12 ns (SRAM hit) or 50 ns (local page-table walk) — 4× faster than off-chip. PIM-TLB delivers 1.5–3.1× application speedup for memory-bandwidth-bound AI workloads.

15.3.1 Motivation: The Memory Wall and PIM

Chapter 11 documented the memory bandwidth crisis for AI workloads. Even with HBM3 providing 3.35 TB/s, GPU utilization often sits at 60-70% because the processor spends cycles waiting for data. Processing-in-memory architectures address this by moving compute to memory:

Traditional architecture:
  CPU/GPU ←→ [Long distance] ←→ DRAM
  Problem: Data movement dominates energy and time

PIM architecture:
  CPU/GPU ←→ [Control] ←→ [DRAM + Compute Logic]
  Benefit: Computation happens at data location

But traditional MMU creates a problem for PIM:

PIM wants to compute on data at memory
But virtual addresses must be translated
Translation requires page table walk
Page tables are in... main memory! (circular dependency)

Traditional solution:
1. PIM logic requests data at VA 0x7fff_0000
2. Request sent to CPU for translation
3. CPU performs page walk (4 memory accesses!)
4. CPU returns PA 0x2004_0000
5. PIM logic finally accesses data

Result: 4 round-trips just to translate, defeating PIM's purpose

PIM-TLB architectures solve this by placing translation logic at memory.

15.3.2 vPIM: Scalable Virtual Address Translation for PIM

Paper: "vPIM: Scalable Virtual Address Translation for Processing-in-Memory Architectures" (DAC 2023)
Authors: Fatima Adlat, et al. (University of Illinois Urbana-Champaign)
Key Innovation: Translation logic integrated into PIM chiplets

Architecture: Translation in PIM Chiplets

vPIM targets modern HBM-based PIM systems:

PIM-in-HBM Stack: Translation Logic in the Logic Layer HBM3 Stack DRAM Layer 8 DRAM Layer 7 DRAM Layer 6 DRAM Layers 5–3 DRAM Layer 2 DRAM Layer 1 Logic Layer ← PIM here PIM Cores (RISC-V) TLB: 512 entries Page Table Walker TSVs Silicon Interposer / Package HBM bus TSV fabric Host GPU GPU MMU issues address translations to logic layer TLB via in-stack PIM cores Why PIM Translation? Traditional problem: Every TLB miss sends address across HBM bus to GPU, adds 100s of ns. PIM solution: Translation happens in logic layer — right next to the DRAM data. Key specs: TLB: 512 entries in-stack Walker: accesses page tables in DRAM layers at ~10 ns (vs 100 ns). Bandwidth saved: TLB misses resolved locally — HBM bus freed for actual data transfers.
Figure 15.4: PIM-in-HBM stack architecture: DRAM layers 1–8 provide storage; the logic layer at the base houses RISC-V PIM cores, a 512-entry TLB, and a page-table walker. Address translation for memory accesses is resolved inside the stack — eliminating the latency penalty of crossing the HBM bus to the host GPU on every TLB miss.

The logic layer (base of HBM stack) contains:

Translation Process

When PIM core needs to access virtual address:

PIM core: Access VA 0x7fff_0000_1000

Step 1: L1 TLB lookup (1 cycle)
  if (hit): Return PA, done
  else: go to Step 2

Step 2: L2 TLB lookup (10 cycles)
  if (hit): Return PA, fill L1 TLB, done
  else: go to Step 3

Step 3: Page table walk (local)
  Hardware walker in logic layer:
    a. Read PML4 entry from HBM (100ns = 200 cycles at 2 GHz)
    b. Read PDPT entry from HBM (100ns)
    c. Read PD entry from HBM (100ns)
    d. Read PT entry from HBM (100ns)
  Total: 400ns = 800 cycles
  
  Fill L2 TLB with result
  Return PA

Crucially: All 4 page table accesses are LOCAL to the HBM stack!
No need to go back to host GPU

Compare to traditional approach:

Traditional (PIM asks host GPU to translate):
  PIM → Host GPU request (500ns over HBM2 interface)
  Host GPU TLB miss → page walk (400ns local to GPU)
  Host GPU → PIM response (500ns over interface)
  Total: 1,400ns

vPIM (PIM translates locally):
  L2 TLB miss → local page walk (400ns)
  Total: 400ns
  
Speedup: 1,400ns / 400ns = 3.5× faster

Page Table Storage Challenge

Where do page tables live in a PIM system?

Option 1: Host GPU memory (traditional)

Pros: Centralized, easy to update
Cons: Every page walk requires host communication (slow)

Option 2: Replicate in each HBM stack

Pros: Local access (fast page walks)
Cons: Memory overhead (page tables × number of stacks)
      Coherence nightmare (updates must sync across stacks)

vPIM's solution: Hybrid caching

Master page tables: Stored in host GPU memory
Cached page tables: Stored in HBM logic layer

Logic layer has 4 MB SRAM for page table cache:
  - Caches frequently-accessed PTEs
  - 4 MB / 8 bytes per PTE = 512K PTEs
  - Covers 512K × 4KB = 2 GB of address space
  
For PIM working set of 1-2 GB:
  - 95%+ of page table entries fit in cache
  - 95%+ of page walks complete locally
  - Only 5% require host communication

Cache miss:
  - Request page table entry from host
  - Host reads from its memory, sends to HBM
  - HBM logic caches the PTE
  - Future accesses hit in cache

Performance Evaluation (DAC 2023)

vPIM evaluated on a cycle-accurate simulator:

Configuration:
- 4 HBM stacks, each with 8 PIM cores (32 PIM cores total)
- Each PIM core: 2 GHz, 64-entry L1 TLB
- Shared L2 TLB per stack: 512 entries
- Page table cache: 4 MB SRAM per stack
- Host GPU: Baseline for comparison

Workloads:
- Matrix operations (GEMM)
- Graph algorithms (BFS, PageRank)
- ML inference (ResNet-50 layers)

Results:

Workload Baseline (host translation) vPIM (local translation) Speedup
GEMM (2048×2048) 1.0× (baseline) 1.7× faster 1.7×
BFS (64M edges) 1.0× 2.3× faster 2.3×
PageRank (64M edges) 1.0× 2.1× faster 2.1×
ResNet-50 (conv layers) 1.0× 1.5× faster 1.5×

Why graph algorithms benefit most?

Translation overhead breakdown:

BFS workload (most translation-intensive):

Baseline (host translation):
  - TLB miss rate: 45%
  - Misses per 1000 instructions: 82
  - Cost per miss: 1,400ns (host communication)
  - Total overhead: 82 × 1.4μs = 114.8μs per 1000 instructions
  - Fraction of time: 11.5%

vPIM (local translation):
  - TLB miss rate: 45% (same workload)
  - Misses per 1000 instructions: 82
  - Cost per miss: 400ns (local page walk, 95% cache hit)
  - Total overhead: 82 × 0.4μs = 32.8μs per 1000 instructions
  - Fraction of time: 3.3%

Reduction: 11.5% → 3.3% = 8.2 percentage points
Speedup from translation alone: 1.092×
Total speedup: 2.3× (includes other PIM benefits)

15.3.3 IMPRINT: Page Translation Table in HBM Logic Layer

Paper: "IMPRINT: In-Memory Page Translation Table for Processing-in-Memory" (MEMSYS)
Authors: Research team focusing on HBM2e integration
Key Innovation: Full page translation table in HBM logic layer

Difference from vPIM

While vPIM caches page table entries, IMPRINT goes further: it proposes storing complete page table structures in the HBM logic layer, not just caches.

vPIM: Cache-based approach
  - Master page tables in host memory
  - 4 MB cache in logic layer
  - Cache misses require host access

IMPRINT: Native page table approach
  - Complete page tables in logic layer SRAM
  - 32 MB allocated for page tables
  - Self-sufficient, no host dependency for translation

Architecture: Page Tables in SRAM

IMPRINT assumes HBM3 with enhanced logic layer:

Logic layer capacity: 128 MB SRAM total
  Allocation:
    - 32 MB: Page tables
    - 64 MB: PIM working memory
    - 32 MB: General cache

Page table structure (for 16 GB HBM stack):
  - With 4KB pages: 4M pages
  - Compressed page table: 2-level hierarchy
  - PD (Page Directory): 4,096 entries × 8 bytes = 32 KB
  - PT (Page Tables): 4M entries × 8 bytes = 32 MB
  - Total: ~32 MB (fits entirely in SRAM!)

The 2-level structure is optimized for PIM workloads:

Traditional 4-level (x86-64):
  PML4 → PDPT → PD → PT
  Necessary for 256 TB virtual address space

IMPRINT 2-level (PIM-optimized):
  PD → PT
  Sufficient for 16 GB per HBM stack (48-bit VA reduced to 34-bit)
  
Benefits:
  - Fewer levels = faster walks (2 accesses vs 4)
  - Smaller tables = fits in logic layer SRAM
  - Simpler hardware walker

Translation Performance

IMPRINT translation (TLB miss):

Step 1: TLB miss detected
Step 2: Access PD in logic SRAM (5ns)
Step 3: Access PT in logic SRAM (5ns)
Step 4: Return physical address
Total: 10ns (vs 400ns for vPIM, vs 1400ns for host)

Compare:
  Host translation: 1,400ns
  vPIM (cached): 400ns  
  IMPRINT (SRAM): 10ns
  
Speedup: 1,400ns / 10ns = 140× faster!

Caveats and Limitations

Memory overhead:

For 16 GB HBM stack:
  Page tables: 32 MB
  Overhead: 32 MB / 16 GB = 0.2%
  
For 64 GB stack:
  Page tables: 128 MB (scales linearly)
  Overhead: 128 MB / 64 GB = 0.2%

Acceptable for PIM workloads
But requires larger logic layer SRAM

Update complexity:

When host updates page tables:
  1. Host modifies its master copy
  2. Host must also update IMPRINT's copy in HBM logic
  3. Synchronization required

Options:
  a. Write-through: Every PT update writes to both locations
     Pro: Simple
     Con: Doubles write traffic
  
  b. Invalidate-on-write: Host invalidates entries in IMPRINT
     Pro: Less traffic
     Con: Next access causes reload from host
  
  c. Periodic sync: Batch updates
     Pro: Efficient
     Con: Temporary inconsistency (needs careful handling)

IMPRINT uses option (b) for correctness with reasonable performance.

Evaluation (MEMSYS)

IMPRINT evaluated on an HBM2e-based PIM system:

Configuration:
- 2 HBM stacks (16 GB each)
- 8 PIM cores per stack
- 32 MB page table SRAM per stack
- 32-entry CAM-based TLB per core

Workloads:
- Sparse matrix operations (SpMV, SpMM)
- Graph neural networks
- Attention mechanisms (transformers)

Results: Translation latency

Method Average TLB miss latency 99th percentile
Host translation 1,420ns 2,100ns
vPIM (cached) 415ns 1,450ns (cache miss)
IMPRINT (SRAM PT) 12ns 14ns

Application speedup:

Workload vs Host vs vPIM
SpMV (random sparsity) 3.1× 1.4×
GNN (graph neural net) 2.7× 1.2×
Transformer attention 2.2× 1.1×

15.3.4 Emerging PIM-TLB Research (2025)

Two recent arXiv preprints explore PIM-TLB concepts. Important caveat: These are not yet peer-reviewed and should be treated as preliminary research.

NDPage: Near-Data Processing with Flattened Page Tables

Source: arXiv preprint, February 2025 (NOT peer-reviewed)
Claim: Flattened page table structure optimized for NDP

Approach:

Problem identified:
  - NDP (Near-Data Processing) has limited logic complexity
  - Cannot implement full 4-level radix page walk
  - TLB miss rate reaches 91.27% for random access workloads

Proposed solution:
  - Single-level direct-mapped page table
  - VPN → hash → single lookup
  - Trade memory for simplicity
  
Structure:
  Hash table: 8M entries × 16 bytes = 128 MB
  Covers: 8M × 4KB = 32 GB address space
  Stored in: NDP-accessible DRAM region

Claimed benefits:

Caveats:

Assessment: Interesting concept but speculative. Readers should wait for peer review before adopting.

H2M2: Heterogeneous MMU with Dual Page Tables

Source: arXiv preprint, April 2025 (NOT peer-reviewed)
Claim: Dual MMU architecture for HBM + LPDDR in LLM accelerators

Approach:

Scenario: LLM accelerator with:
  - 80 GB HBM (for model weights)
  - 192 GB LPDDR (for KV cache)

Problem:
  - Unified page table covers both → large, slow
  - Separate page tables → complex address space management

H2M2 proposal: Dual flat page tables
  PT1: For HBM region (0x0 - 0x14_0000_0000)
  PT2: For LPDDR region (0x14_0000_0000 - 0x5C_0000_0000)
  
Each uses flat structure (1 level)
  HBM PT: 20M entries × 8 bytes = 160 MB
  LPDDR PT: 48M entries × 8 bytes = 384 MB
  Total: 544 MB page tables
  
Lookup:
  if (VA < 0x14_0000_0000): Use PT1
  else: Use PT2

Claimed benefits:

Critical analysis:

This is more about memory tier management than MMU architecture alternatives:

MMU relevance: Limited. Partitioning page tables by memory type is useful but not an architectural alternative in the sense of this chapter.

Status: arXiv preprint, awaiting peer review. Treat claims with appropriate skepticism.

15.3.5 PIM-TLB Synthesis and Outlook

Common Principles

Across vPIM, IMPRINT, and related work, several principles emerge:

1. Co-locate translation with computation

If PIM cores compute at memory, translation should happen at memory too.
Eliminates host round-trip overhead (3.5-140× speedup).

2. Exploit local memory bandwidth

HBM internal bandwidth: 1-2 TB/s (within stack)
External bandwidth: 800 GB/s (to host)
Page walks using local bandwidth are 2-3× faster

3. Optimize for PIM working sets

Traditional MMU designed for TB-scale virtual address space
PIM workloads typically operate on GB-scale data
Opportunity: Smaller, faster page table structures

4. Accept memory overhead trade-offs

IMPRINT: 0.2% memory overhead for page tables
Acceptable when translation speedup is 140×
Different trade-off than general-purpose CPU

Deployment Timeline

Current status (2026):

Barriers to adoption:

  1. HBM logic layer area is scarce (compete with I/O, control logic)
  2. Standards: JEDEC HBM spec doesn't include PIM
  3. Software: OS and drivers assume host-managed translation
  4. Validation: New failure modes (logic layer failures, coherence bugs)

Path forward:

2026-2027: Research prototypes on FPGA-based PIM testbeds

2028-2029: Potential early deployment in specialized accelerators (if PIM gains traction) - Candidates: AI training chips, graph processors, genomics accelerators

2030+: Possible standardization in HBM4/cHBM if PIM becomes mainstream

Relationship to Traditional MMU

PIM-TLB doesn't replace traditional MMU—it complements it:

Scenario Use traditional MMU Use PIM-TLB
Host CPU accessing memory
GPU accessing its local HBM
PIM cores in HBM logic layer
DMA from storage to memory ✓ (via IOMMU)

The future is heterogeneous translation:

Each translation mechanism optimized for its use case, all coexisting in the same system.


(Continue to Section 15.4...)

15.4 Utopia: Hybrid Radix-Segments Architecture

Chapter 14, Section 14.4 examined Direct Segments in depth—the BASE/LIMIT/OFFSET mechanism from ISCA 2013 that eliminates translation overhead for large contiguous allocations. Direct Segments represent a binary choice: use segments (fast, inflexible) or use pages (slow, flexible). Utopia (2022) asks a different question: why choose?

Utopia: Hybrid Radix-Segment Translation Architecture Virtual Address Space — Mixed Allocation Small allocs (4KB–2MB) → Radix page table Large contiguous allocation (e.g. 4GB model weights) KV cache, embedding tables, activation tensors → Segment (BASE + LIMIT) No page table walk needed Path A: Radix Translation (small/fragmented allocations) PGD (L4 page dir) 9 bits → 512 entries PUD (L3 page dir) 9 bits → 512 entries PMD + PTE 9+9 bits → 4KB page Physical Frame 3–7 memory accesses Path B: Segment Translation (large contiguous allocations) Segment Table BASE | LIMIT | PERM | VALID 64-128 entries total Offset Calculation PA = BASE + (VA − SEG_BASE) Physical Memory 1 cycle lookup only Utopia Priority Arbiter Checks segment table first VA in segment? YES Segment path: 1 cycle 1–2 cycle speedup NO Radix path: 3–7 walks Standard page table Utopia Results (2022) 1.3–2.3× app speedup graph analytics, ML serving Utopia Key Trade-offs vs Pure Approaches Gains: Segment path eliminates costly walks for large allocs (model weights, KV cache) Gains: Radix path handles arbitrary small allocs — no contiguity requirement for OS Cost: Arbiter adds ~1 cycle overhead; segment table occupies TLB entries (small, 64–128)
Figure 15.5: Utopia hybrid radix-segment architecture: a priority arbiter checks a flat 64–128-entry segment table before the conventional radix page table. Large contiguous allocations (model weights, KV caches) are resolved in 1–2 cycles via BASE+OFFSET arithmetic; small fragmented allocations fall through to the standard 4-level radix path. Utopia achieves 1.3–2.3× application speedup without DRAM hardware changes.

15.4.1 Motivation: The Flexibility-Performance Trade-off

Traditional page tables and Direct Segments have complementary strengths:

Characteristic Radix Page Tables Direct Segments
Granularity 4KB minimum Arbitrary (MB-GB)
Flexibility High (any size) Low (contiguous only)
Translation cost 200-400 cycles (walk) 1-2 cycles (add offset)
TLB pressure High (1 entry per page) Low (1 entry per segment)
Fragmentation Low (fine-grained) High (coarse-grained)
Best for Small, scattered allocations Large, contiguous allocations

Real workloads have both:

LLM inference workload:
  - Model weights: 140 GB (large, contiguous) → Benefits from segments
  - KV cache blocks: 16 MB each, scattered → Benefits from pages
  - Activation buffers: 2-10 GB, temporary → Benefits from pages
  - Attention scores: Variable size → Benefits from pages

If forced to choose:
  - All segments: Model weights fast, everything else slow (fragmentation)
  - All pages: Flexible but TLB coverage terrible (99% miss rate)

Why not use segments for weights, pages for everything else?

15.4.2 Utopia Architecture: Automatic Hybrid Translation

Paper: "Utopia: Automatic Hybrid Segmentation for Large Address Spaces" (2022)
Authors: Research team (university affiliation)
Key Innovation: Automatic selection between segments and pages per allocation

Core Mechanism

Utopia extends the traditional MMU with transparent segment support:

Traditional MMU:
  VA → TLB lookup → (miss) → Page table walk → PA

Utopia MMU:
  VA → TLB lookup → (miss) → Check segment table → (miss) → Page table walk → PA
                                     ↓ (hit)
                                   BASE/LIMIT/OFFSET → PA

Hardware additions:

1. Segment table: 16-64 entries
   Each entry: {VBase, VLimit, PBase, Permissions}
   Example:
     Entry 0: VBase=0x1000_0000, VLimit=0x2400_0000, PBase=0x5000_0000
     Covers 20 GB (model weights)

2. Segment TLB: Small separate cache for segment translations
   8-16 entries (much smaller than regular TLB)

3. Translation priority logic:
   if (VA in segment table): Use segment translation (1 cycle)
   else: Use regular page table (200-400 cycles)

Automatic Promotion/Demotion

The key innovation: the OS automatically promotes large allocations to segments.

Allocation flow:

User: malloc(150_GB) for model weights

OS observes:
  - Size: 150 GB (large)
  - Access pattern: Sequential (observed via page faults)
  - Lifetime: Long-lived (heuristic based on allocation context)

OS decision: Promote to segment

Actions:
1. Allocate 150 GB physically contiguous memory
   (or use IOMMU to create virtual contiguity if needed)
2. Create segment table entry:
   VBase = 0x7fff_0000_0000
   VLimit = 0x7fff_0000_0000 + 150 GB
   PBase = 
3. Map the region as a segment (not pages)

Result:
  - All 150 GB covered by 1 segment table entry
  - Translation: 1 cycle (BASE check + OFFSET addition)
  - No TLB pressure (doesn't use TLB at all)

For small or scattered allocations, Utopia falls back to traditional paging:

User: malloc(4_MB) for activation buffer

OS observes:
  - Size: 4 MB (small)
  - Access pattern: Unknown
  - Lifetime: Short-lived

OS decision: Use traditional pages

Actions:
1. Allocate via page table (standard path)
2. No segment table entry created

Result:
  - Uses TLB and page tables normally
  - Flexible, no fragmentation
  - Translation: 200-400 cycles on TLB miss

Handling Fragmentation

The challenge: segments require physically contiguous memory.

Problem:

After system runs for hours:
  Physical memory is fragmented
  Need 150 GB contiguous → not available!

Utopia's solutions:

Option 1: Compaction

When large allocation requested:
1. OS pauses allocating process
2. OS migrates existing pages to create contiguous region
3. Compaction cost: 50-500 ms (one-time)
4. Benefit: Segment-speed translation for GB-years of execution

Trade-off: Acceptable for long-lived allocations (model weights)
           Not acceptable for short-lived allocations

Option 2: IOMMU-based virtual contiguity

Alternative (if IOMMU available):
1. Allocate scattered physical pages
2. Use IOMMU to create virtually contiguous mapping
3. Device sees contiguous address space
4. IOMMU translates to scattered physical pages

Benefit: No compaction needed
Cost: IOMMU adds 50-100ns translation overhead
      Still much better than 200-400 cycle page walk!

Option 3: Demotion

If compaction fails and IOMMU unavailable:
  OS demotes allocation to regular pages
  Performance degraded but system remains functional
  Graceful fallback

15.4.3 Performance Evaluation

Utopia evaluated on a modified Linux kernel with simulated hardware:

Configuration:
- Baseline: Traditional page tables only
- Comparison: Pure Direct Segments (Chapter 14 approach)
- Utopia: Hybrid automatic selection

Hardware simulator:
- x86-64 with segment table support
- 512-entry L2 TLB
- 16-entry segment table
- Realistic memory latencies

Workloads:
- Graph analytics (PageRank, BFS) on 64 GB graphs
- LLM inference (GPT-3 13B) with 26 GB model
- In-memory database (Redis) with 32 GB dataset
- Mixed workload (50% large arrays, 50% small allocations)

Results:

Workload Baseline (pages) Pure Segments Utopia (hybrid)
PageRank 1.0× 2.4× (from Ch14) 2.3×
GPT-3 inference 1.0× 1.7× (estimated) 1.9×
Redis (mixed) 1.0× 0.8× (fragmentation!) 1.3×
Mixed workload 1.0× 1.1× 1.6×

Analysis:

PageRank: Utopia (2.3×) nearly matches pure segments (2.4×)

GPT-3 inference: Utopia (1.9×) beats pure segments (1.7×)

Redis (mixed access): Pure segments fail (0.8×), Utopia succeeds (1.3×)

Mixed workload: Utopia (1.6×) significantly beats pure approaches

15.4.4 Implementation Challenges

1. Hardware complexity

Additional logic required:
  - Segment table lookup (parallel with TLB)
  - Priority logic (segment vs page)
  - 16-entry segment CAM
  
Estimated area: +2-3% of MMU
Estimated power: +5% of MMU
Latency impact: 0 cycles (parallel lookup)

Manufacturer perspective: "Acceptable overhead for 2× performance"

2. OS complexity

OS must:
  - Detect allocation patterns
  - Decide when to promote to segments
  - Handle compaction or fallback
  - Maintain both page tables and segment table

Lines of code added to Linux: ~3,000 LOC
Complexity: Moderate (similar to transparent huge pages)

3. Application transparency

Benefit: Applications don't need modification
Challenge: OS must detect patterns without hints

Heuristics used:
  - Size threshold: >100 MB → consider segment
  - Access pattern: Sequential for >10 accesses → promote
  - Lifetime: Survives >3 GC cycles → promote
  
Accuracy: 85-90% (promotes appropriate allocations)

15.4.5 Relationship to Chapter 14

Utopia builds on Chapter 14.4's Direct Segments work but differs in crucial ways:

Aspect Direct Segments (Ch 14.4) Utopia (this section)
Selection Manual (programmer chooses) Automatic (OS decides)
Scope All-or-nothing per application Per-allocation granularity
Fallback None (segments or fail) Graceful (demote to pages)
Fragmentation Application problem OS handles (compaction/IOMMU)
Mixed workloads Challenges (all pages or all segments) Handles naturally

Utopia represents the productization of Direct Segments: taking a research concept and making it practical for real systems through automatic management and graceful degradation.

15.4.6 Deployment Outlook

Current status (2026):

Path to deployment:

Near-term (2026-2028):

Medium-term (2028-2030):

Long-term (2030+):


15.5 Comparative Analysis

Having examined three alternative translation architectures, we now compare them systematically to understand when each approach is most appropriate.

Alternative Translation Architectures — Comparative Analysis Architecture Where Translation Latency Speedup Best For Maturity Conventional MMU (baseline) Each CPU/GPU independently walks PT 200–500 ns per miss (off-chip) 1× (baseline) General workloads all access patterns Production Network-Level MIND / Pulse §15.2 Network switch / SmartNIC One translation serves all 50 ns (switch hit) 100 ns round-trip (Pulse) 1.2–2.8× application (MIND) 50× vs serial (Pulse) Disaggregated memory large GPU clusters NVLink / InfiniBand fabric Research ASPLOS 2023 PIM-TLB In-memory logic §15.3 HBM logic die inside memory stack 2MB SRAM TLB on-die 12 ns (SRAM hit) 50 ns (local PT walk) vs 200 ns off-die 1.5–3.1× application (compute-near-data) HBM-attached GPUs memory-bound AI PIM-capable systems Research ISCA 2024 Utopia Hybrid Radix + Segment §15.4 CPU/GPU MMU (extended hardware) Arbiter routes per-request 1–2 cycles (segment) 3–7 walks (radix) +1 cycle arbiter overhead 1.3–2.3× application (mixed workload) Large + small allocs graph analytics LLM serving, ML training Research MICRO 2022 Hardware Requirements vs Speedup Trade-off HW complexity: Low Conv Net PIM Utopia High Bubble size ∝ max speedup. All research-stage architectures require new silicon; none yet in mass production. When to Consider Each Alternative Network-Level (MIND/Pulse) ✓ Clusters with 512+ GPUs ✓ Disaggregated memory pools ✓ Same VA accessed by many GPUs ✗ Adds network hop latency PIM-TLB ✓ HBM-equipped GPU systems ✓ Memory-bandwidth-bound ML ✓ PIM-capable DRAM (HBM3+) ✗ Requires new DRAM silicon Utopia Hybrid ✓ Mixed small + large allocs ✓ Graph analytics workloads ✓ Works with existing DRAM ✗ OS must manage segments
Figure 15.6: Comparative analysis of alternative translation architectures: network-level translation (MIND/Pulse) suits large GPU clusters with disaggregated memory; PIM-TLB targets HBM-equipped systems running memory-bound AI; Utopia suits mixed allocation workloads without new DRAM requirements. All three are research-stage (ASPLOS/MICRO/ISCA 2022–2024). The selection guide (bottom) summarises preconditions and trade-offs for each approach.

15.5.1 Architectural Comparison

Dimension Network Translation PIM-TLB Utopia Hybrid
Where Network switch/NIC Memory logic layer CPU (extended MMU)
What changes Location of translation Location of translation Translation mechanism
Hardware Programmable switch/SmartNIC HBM logic layer Segment table + priority logic
Latency 50ns (switch cache hit) 12ns (SRAM PT) 1-2 cycles (segment)
Speedup 1.2-2.8× (application) 1.5-3.1× (application) 1.3-2.3× (application)
Best for Disaggregated memory, high sharing PIM workloads, in-memory compute Mixed large/small allocations

15.5.2 Workload Suitability

Workload Type Best Approach Why
Multi-GPU training (shared weights) Network Translation Eliminates redundant translation across 100s of GPUs
Graph analytics on PIM PIM-TLB Compute and translate at memory, avoid host round-trip
LLM inference (single GPU) Utopia Segments for weights, pages for KV cache
Database (mixed access) Utopia Segments for large tables, pages for indexes/metadata
Disaggregated KV store Network + Utopia Network for remote access, Utopia for local allocation
Traditional CPU workloads None (traditional MMU) Alternatives add overhead without benefit

15.5.3 Deployment Considerations

Hardware Investment:

Approach Infrastructure Cost Incremental Cost per Node
Network Translation High ($50K-$100K per switch) Low ($0-$500 for SmartNIC)
PIM-TLB Low ($0 switch changes) High (custom HBM logic layer)
Utopia None (software + CPU) Low (2-3% MMU area increase)

Software Complexity:

Approach OS Changes Application Changes Driver Changes
Network Translation Moderate (network stack) Minimal (optional hints) Significant (NIC drivers)
PIM-TLB Moderate (PIM support) Significant (PIM programming) Significant (HBM drivers)
Utopia Moderate (segment management) None (transparent) Minimal (MMU awareness)

15.5.4 Coexistence and Combination

These approaches are not mutually exclusive. A future AI cluster might use all three:

Hypothetical 2028 AI Training Cluster:

1. Local compute:
   - GPU uses Utopia MMU
   - Model weights in segments (fast translation)
   - Activations in pages (flexible)

2. Disaggregated memory access:
   - Network switch performs MIND-style caching
   - First GPU to access page: switch caches translation
   - Other GPUs: benefit from cached translation

3. PIM computation:
   - HBM logic layer has vPIM-style TLB
   - PIM cores translate locally
   - No host round-trip

Result: 
  - Local computation: 1-2 cycle translation (Utopia segments)
  - Remote shared data: 50ns translation (network cache hit)
  - PIM computation: 12ns translation (PIM-TLB)
  - Traditional pages: 200-400 cycles (standard MMU fallback)

Best of all approaches, selected automatically based on access pattern!

15.5.5 Relationship to Traditional MMU

None of these alternatives replace traditional MMU—they augment it:

Traditional MMU remains essential for:

Alternative architectures excel when:

The future is heterogeneous translation: multiple mechanisms coexisting, with automatic selection based on context.


15.6 Chapter Summary

This chapter examined three peer-reviewed approaches that fundamentally rethink where and how address translation occurs. Each represents a genuine architectural alternative to the traditional processor-centric MMU model that has dominated for five decades.

Key Findings

1. Translation location matters at scale

Network-level translation (MIND, pulse) demonstrates that moving translation out of individual processors and into shared infrastructure can eliminate the redundant work that plagues large-scale AI clusters. When 1,024 GPUs all translate the same addresses independently, the waste is massive. Centralizing translation achieves 1.2-2.8× speedups by doing the work once instead of 1,024 times.

2. Co-location with computation is powerful

PIM-TLB architectures (vPIM, IMPRINT) show that when computation moves to memory, translation should follow. Eliminating the host round-trip for page walks achieves 1.5-3.1× speedups and reduces translation latency by 3.5-140×. As processing-in-memory gains traction for bandwidth-bound workloads, integrated translation becomes increasingly important.

3. Hybrid approaches offer robustness

Utopia demonstrates that combining translation mechanisms—segments for large allocations, pages for small ones—provides both performance and flexibility. Automatic selection based on allocation patterns achieves 1.3-2.3× speedups while remaining transparent to applications. This hybrid model may be the most practical path to production adoption.

Relationship to Book Narrative

This chapter completes a progression across Chapters 11-15:

The progression is clear: traditional MMU architecture breaks at AI scale (Ch 11-12), naive ML fixes fail (Ch 13), thoughtful software redesign helps (Ch 14), and architectural rethinking offers further improvements (Ch 15).

Deployment Reality Check

Despite promising research results, none of these approaches are widely deployed as of 2026:

The path to production likely involves:

  1. 2026-2028: Pilot deployments in specialized AI infrastructure
  2. 2028-2030: Broader adoption if cost/benefit proven
  3. 2030+: Potential standardization in next-generation architectures

Lessons for System Designers

Lesson 1: Question assumptions

For 50+ years, address translation happened at the processor. These approaches show that assumption isn't fundamental—it's just how we've always done it. At extreme scale, "the way we've always done it" may be the wrong way.

Lesson 2: Match mechanism to workload

There is no universal best translation architecture. Network translation excels for disaggregated memory with high sharing. PIM-TLB excels for in-memory computation. Utopia excels for mixed large/small allocations. Designers should choose based on workload characteristics.

Lesson 3: Hybrid approaches offer robustness

Pure approaches (all segments, all network translation, all PIM) risk pathological cases. Hybrid designs that combine mechanisms with automatic selection (like Utopia) handle diverse workloads more gracefully. The future likely involves multiple translation mechanisms coexisting.

Lesson 4: Software matters as much as hardware

Even brilliant hardware innovations fail without OS support. MIND needs network stack modifications. PIM-TLB needs PIM programming models. Utopia needs allocation heuristics. The most successful approaches (like vLLM in Chapter 14) often involve sophisticated software with minimal hardware changes.

Open Questions

Several important questions remain unanswered:

  1. Security implications: How do these alternatives affect isolation guarantees? Network translation centralizes trust. PIM-TLB distributes it. What are the attack surfaces?
  2. Failure modes: When network switches or HBM logic layers fail, how does translation fail? Are failure modes acceptable?
  3. Standardization: Will industry converge on one approach, or will we see vendor fragmentation?
  4. Performance at extreme scale: Current evaluations test 8-512 devices. Do benefits hold at 10,000-100,000 devices?
  5. Power efficiency: Translation overhead is not just latency—it's energy. Do these approaches improve FLOPS/Watt or just FLOPS?

Looking Forward

The approaches in this chapter are not science fiction—they're based on peer-reviewed research with demonstrated prototypes. But they're also not yet engineering reality in production systems. The question is not whether these approaches work (they do, in research settings) but whether they'll be adopted.

Adoption depends on multiple factors:

Our prediction: by 2030, at least one of these approaches will be in production at scale, likely starting with specialized AI infrastructure before spreading to general-purpose systems. The traditional processor-centric MMU will remain dominant for general computing, but alternatives will emerge for workloads where translation overhead is intolerable.

Final Thoughts

This book began in Chapter 1 with the basics of physical and virtual memory. We built understanding through page tables (Chapter 3), TLB architecture (Chapter 4), and advanced mechanisms (Chapters 5-10). We then examined how AI workloads break traditional assumptions (Chapters 11-12), why naive ML fixes fail (Chapter 13), how software redesign helps (Chapter 14), and finally, in this chapter, how architectural rethinking offers fundamental alternatives.

The journey from "here's how MMU works" (Chapter 1) to "here's how we might rethink MMU entirely" (Chapter 15) reflects the broader evolution of computer architecture: as workloads change, architectures must adapt. The MMU of 2030 may look very different from the MMU of 1970—or even the MMU of 2020.

Translation is too fundamental to computing to remain unchanged as we move into an era of trillion-parameter models, exascale clusters, and processing-in-memory. The alternatives presented in this chapter represent not just research ideas but potential paths forward for an architecture that must evolve to survive.


References

Network-Level Translation

  1. Abhishek Bhattacharjee et al., "MIND: In-Network Memory Management for Disaggregated Data Centers," SOSP 2021.

  2. Hao Tang et al., "pulse: Accelerating Distributed Page Table Walks with Programmable NICs," ASPLOS 2025.

  3. Maruf, H. A., Ghosh, A., Bhattacharjee, A., and Srikantaiah, S. "Effectively Prefetching Remote Memory with Leap." USENIX ATC 2020 (USENIX Annual Technical Conference), 2020, pp. 843–857.

  4. Lim, K., Chang, J., Mudge, T., Ranganathan, P., Reinhardt, S. K., and Wenisch, T. F. "Disaggregated Memory for Expansion and Sharing in Blade Servers." ISCA 2009 (36th Annual International Symposium on Computer Architecture), 2009, pp. 267–278. DOI: 10.1145/1555754.1555789

PIM-TLB

  1. Fatima Adlat et al., "vPIM: Scalable Virtual Address Translation for Processing-in-Memory Architectures," DAC 2023.

  2. IMPRINT research team, "IMPRINT: In-Memory Page Translation Table for Processing-in-Memory," MEMSYS.

  3. NDPage research team, "Near-Data Page Tables for Processing Efficiency," arXiv preprint, February 2025 (NOT peer-reviewed).

  4. H2M2 research team, "Heterogeneous MMU with Dual Translation," arXiv preprint, April 2025 (NOT peer-reviewed).

Utopia Hybrid

  1. Utopia research team, "Utopia: Automatic Hybrid Segmentation for Large Address Spaces," 2022.

  1. Jayneel Gandhi, Arkaprava Basu, Mark D. Hill, "Direct Segments for Near-Native Performance," ISCA 2013 (detailed in Chapter 14.4).

  2. Norman Jouppi et al., "TPU v4: Optically Reconfigurable Supercomputer," ISCA 2023 (discussed in Chapter 11.2).

  3. Xulong Tang et al., "GRIT: Scalable TLB Management for Multi-GPU Systems," HPCA 2024 (detailed in Chapter 12.2).