Chapter 16: Advanced TLB Optimization Techniques

This chapter examines eight alternative TLB architectures spanning coalescing, speculation, prefetching, compression, range-based translation, and hierarchical address spaces. Traditional huge pages require physical contiguity—a constraint that becomes untenable for memory-intensive AI workloads accessing hundreds of gigabytes. We analyze techniques from incremental hardware optimizations (COLT entry-level coalescing, deployed in billions of AMD and ARM devices) to radical rethinking of translation abstractions (FlexPointer range TLBs providing 10× speedup for LLM inference, Mosaic Pages achieving 81% miss reduction without any contiguity requirement). Of eight techniques examined, only one (COLT) has achieved production deployment at scale, revealing critical lessons about the gap between research innovation and industrial reality.

16.1 Introduction: The TLB Reach Crisis

Chapter 4 established the fundamental TLB capacity problem: with 4KB pages, a typical L2 TLB with 1,536 entries covers only 6.4MB of memory. Chapter 11 demonstrated the catastrophic impact on modern AI workloads—LLaMA 70B training with 700GB working set encounters 99.999% TLB miss rates, degrading performance by 5× purely from translation overhead. The traditional solution—huge pages (2MB or 1GB)—provides 512× or 262,144× TLB reach improvements respectively, but requires physical contiguity.

16.1.1 Why Physical Contiguity Fails at Scale

Physical contiguity becomes increasingly difficult to maintain as systems age. Consider a production AI training cluster after 72 hours of continuous operation:

External fragmentation: After thousands of allocations and deallocations, physical memory resembles swiss cheese—the largest contiguous region might be only 4-16MB despite gigabytes of free memory
Compaction costs: Moving 512 × 4KB pages to form a single 2MB huge page requires copying 2MB of data and updating 512 page table entries—at 50ns per DRAM write, that's 100µs plus TLB shootdown overhead
Allocation latency: With fragmented memory, allocating 1GB huge pages can take milliseconds as the kernel scans for suitable regions or triggers compaction
Failed allocations: Linux transparent huge pages (THP) falls back to 4KB pages when contiguity unavailable—losing the benefit precisely when it's most needed

For a single training step of GPT-3 accessing 350GB of model weights, the probability of finding sufficient 2MB-contiguous regions approaches zero after even moderate memory churn. The system degrades to 4KB pages, miss rates spike to >99%, and training throughput collapses.

16.1.2 The Spectrum of Solutions

This chapter examines eight architectural approaches that increase TLB reach through alternative mechanisms. These techniques fall into four categories, representing increasing deviation from traditional huge page approaches:

PART I: Coalescing Techniques (Sections 16.2-16.3)

Principle: Detect and exploit partial contiguity—coalesce entries when possible, fallback to small pages when not.

COLT (MICRO 2012): Entry-level coalescing detects 8 contiguous 4KB pages and represents them as single 32KB TLB entry—deployed in AMD processors since Zen+ (2017) and ARM processors via contiguous bit (billions of devices)
Pichai (ASPLOS 2014): Request-level coalescing intercepts page walks and combines translations before TLB insertion—32-38% miss reduction for GPU workloads

Key advantage: Incremental improvement over existing infrastructure—no OS changes required for hardware-only variants.

Deployment status: Production (AMD, ARM) for entry-level; research-only for request-level.

PART II: Predictive Techniques (Sections 16.4-16.5)

Principle: Speculatively translate or prefetch future addresses based on observed access patterns.

SpecTLB (ISCA 2011): Reservation-based speculation allows misses to proceed without blocking subsequent accesses—relies on spatial locality
Avatar (MICRO 2024): Stride-based prediction achieves 90.3% speculation accuracy for AI workloads—37.2% speedup by overlapping translation with computation
SnakeByte (ASPLOS 2023): Markov model predicts miss sequences for graph analytics—40-60% miss reduction

Key advantage: Works with any page size, no contiguity required, tolerates fragmentation.

Challenge: Speculation accuracy critical—mispredictions waste energy and bandwidth.

PART III: Compression-Based Reach (Section 16.6)

Principle: Store multiple discrete translations in a single TLB entry using hashing and compression.

Mosaic Pages (ASPLOS 2023): Iceberg hashing compresses 16 non-contiguous 4KB translations into one entry—81% miss reduction, ASPLOS Distinguished Paper Award

Key advantage: Completely eliminates contiguity requirement—works with arbitrary fragmentation.

Challenge: Moderate hardware complexity (hashing logic in critical path).

PART IV: Alternative Addressing Modes (Sections 16.7-16.8)

Principle: Replace page-granular translation with coarser abstractions.

FlexPointer (MICRO 2023): Range TLB covers arbitrary contiguous virtual ranges with single entry—10-100× reach for ML workloads
IAS (ASPLOS 2024): Intermediate Address Space adds indirection layer for heterogeneous systems—2-10× reach improvement

Key advantage: Fundamental shift—single entry covers entire tensor or memory region.

Challenge: Requires OS awareness and application cooperation for range allocation.

16.1.3 Evaluation Methodology

We assess each technique across multiple dimensions:

Dimension	Metric	Ideal
TLB Reach	Memory coverage per entry	Maximum (multi-GB)
Contiguity Requirement	Physical/virtual contiguity needed	None (fragmentation-tolerant)
Hardware Complexity	Added logic/storage	Minimal (<1% area)
OS Support	Kernel modifications required	None (transparent)
Performance	Miss rate reduction / speedup	Maximum (>50%)
Deployment Status	Production vs research	Production-deployed

Section 16.9 synthesizes these findings into a comprehensive comparison matrix, identifying when each technique is appropriate and what combinations might provide synergistic benefits.

16.1.4 Roadmap and Reading Strategy

For practitioners building AI systems: Focus on Sections 16.2 (COLT production deployment), 16.4.2 (Avatar speculation for modern GPUs), and 16.6 (Mosaic Pages for fragmented memory scenarios).

For hardware architects: Sections 16.3-16.5 provide detailed microarchitectural implementations of coalescing and speculation techniques. Section 16.9's comparative analysis identifies gaps and opportunities for future research.

For OS developers: Sections 16.7-16.8 examine range-based and hierarchical translation requiring kernel support—understanding these informs page allocator and memory management policy decisions.

For researchers: The narrative arc from incremental (coalescing) to radical (range TLBs) highlights the evolution of thinking about address translation. Section 16.10 identifies open questions and deployment barriers.

Key Insight: The transition from huge pages to alternative architectures represents a fundamental shift—from demanding physical contiguity to exploiting virtual contiguity, locality, predictability, and application semantics. Only COLT has achieved production deployment at scale (billions of AMD and ARM devices); all other techniques remain research prototypes. Understanding why COLT succeeded while more sophisticated approaches struggle reveals the deployment challenges facing address translation innovation.

16.2 Entry-Level Coalescing: COLT and Production Implementations

Coalescing Large-Reach TLBs (COLT), proposed by Pham and Bhattacharjee at MICRO 2012, represents the most successful TLB reach optimization technique measured by deployment scale. The core innovation is deceptively simple: when inserting a new TLB entry, examine whether adjacent entries map contiguous physical addresses. If so, merge them into a single entry covering a larger effective page size.

Figure 16.1: COLT entry-level coalescing: on each TLB insertion, COLT checks whether the new entry's virtual page number and physical frame number are both stride-1-adjacent to an existing entry. If so, the two entries merge into a single extended entry with a page count field, covering 8–16 KB (or more) with one slot. The same 512-entry TLB covers 2–4× more address space with no OS changes. COLT is deployed in production AMD and ARM processors at a 0.5% area overhead.

16.2.1 COLT Architecture and Operation

Traditional TLBs store independent translations. Each 4KB page requires its own TLB entry, regardless of whether pages are physically contiguous:

Virtual Address     Physical Address    Size
0x0000 - 0x0FFF  →  0x1000 - 0x1FFF    4KB
0x1000 - 0x1FFF  →  0x2000 - 0x2FFF    4KB
0x2000 - 0x2FFF  →  0x3000 - 0x3FFF    4KB
0x3000 - 0x3FFF  →  0x4000 - 0x4FFF    4KB

Result: 4 TLB entries consumed for 16KB of contiguous memory

COLT recognizes this pattern and coalesces the entries:

Virtual Range       Physical Range       Coalesced Size
0x0000 - 0x3FFF  →  0x1000 - 0x4FFF    16KB (4 × 4KB)

Result: 1 TLB entry covers 16KB (4× improvement)

The coalescing logic operates during TLB insertion. When a new translation arrives from a page table walk, hardware checks:

Address alignment: Is the virtual address aligned to a potential coalescing boundary (e.g., 8-page = 32KB)?
Contiguity: Do adjacent virtual pages map to adjacent physical frames?
Uniformity: Do all pages share identical permission bits (R/W/X, U/S)?

If all conditions hold, hardware creates a single coalesced entry rather than inserting individual page translations.

Detailed Example: 8-Page Coalescing

Consider accessing a 32KB region starting at virtual address 0x20000:

// Page table contains 8 contiguous mappings:
VPN 0x20  →  PPN 0x1A0  (VA 0x20000-0x20FFF → PA 0x1A0000-0x1A0FFF)
VPN 0x21  →  PPN 0x1A1  (VA 0x21000-0x21FFF → PA 0x1A1000-0x1A1FFF)
VPN 0x22  →  PPN 0x1A2  (VA 0x22000-0x22FFF → PA 0x1A2000-0x1A2FFF)
VPN 0x23  →  PPN 0x1A3  (VA 0x23000-0x23FFF → PA 0x1A3000-0x1A3FFF)
VPN 0x24  →  PPN 0x1A4  (VA 0x24000-0x24FFF → PA 0x1A4000-0x1A4FFF)
VPN 0x25  →  PPN 0x1A5  (VA 0x25000-0x25FFF → PA 0x1A5000-0x1A5FFF)
VPN 0x26  →  PPN 0x1A6  (VA 0x26000-0x26FFF → PA 0x1A6000-0x1A6FFF)
VPN 0x27  →  PPN 0x1A7  (VA 0x27000-0x27FFF → PA 0x1A7000-0x1A7FFF)

// COLT coalesces into single entry:
VPN 0x20-0x27  →  PPN 0x1A0-0x1A7  (32KB effective page size)

// TLB storage:
- Virtual Base: 0x20 (bits [39:15] for 32KB alignment)
- Physical Base: 0x1A0
- Size Encoding: 3 bits → 32KB
- Permissions: RW, User

A single TLB lookup now resolves any address in the 32KB range. Without coalescing, this would require 8 separate TLB entries (one per 4KB page).

16.2.2 Production Implementations: Three Distinct Approaches

While COLT demonstrated the concept in 2012, three separate production implementations emerged with distinct architectural choices:

Implementation 1: AMD Transparent PTE Coalescing (Zen+, 2017)

Approach: Pure hardware solution requiring zero OS modifications.

Mechanism: When the memory management unit (MMU) fetches a page table entry (PTE) from DRAM, it arrives in a 64-byte cache line containing 8 × 8-byte PTEs. Before inserting into the TLB, hardware examines all 8 PTEs in the cache line simultaneously:

Cache Line (64 bytes) = 8 PTEs:
PTE[0]: VA 0x10000 → PA 0x50000  [Present, RW, User]
PTE[1]: VA 0x11000 → PA 0x51000  [Present, RW, User]  ← Contiguous!
PTE[2]: VA 0x12000 → PA 0x52000  [Present, RW, User]  ← Contiguous!
PTE[3]: VA 0x13000 → PA 0x53000  [Present, RW, User]  ← Contiguous!
PTE[4]: VA 0x14000 → PA 0x54000  [Present, RW, User]  ← Contiguous!
PTE[5]: VA 0x15000 → PA 0x55000  [Present, RW, User]  ← Contiguous!
PTE[6]: VA 0x16000 → PA 0x56000  [Present, RW, User]  ← Contiguous!
PTE[7]: VA 0x17000 → PA 0x57000  [Present, RW, User]  ← Contiguous!

Hardware detects: 8 contiguous pages with uniform permissions
Action: Create single 32KB coalesced TLB entry

Hardware required:

8 parallel comparators check PPN[i+1] = PPN[i] + 1
8-way permission bit AND gate verifies uniformity
3-bit size field added to TLB entry format (encodes 4KB, 8KB, 16KB, 32KB)
Modified TLB lookup logic to handle variable-size entries

Performance (AMD internal measurements):

2-4× TLB reach improvement for memory-intensive workloads
No performance regression for fragmented memory (transparent fallback)
~0.5% silicon area overhead

Deployment: All AMD Zen+ and later processors (Ryzen 2000+ series, EPYC 7002+). Estimated 100M+ desktop CPUs and millions of server processors.

Implementation 2: ARM Contiguous Bit (ARMv8-A, 2013)

Approach: Hardware-software cooperative—OS sets hint bit, hardware performs coalescing.

Mechanism: ARMv8-A page table entries include bit 52 as a "contiguous" hint. When the OS allocates physically contiguous pages, it sets this bit in all PTEs within a contiguous block:

ARM64 PTE Format with Contiguous Bit (ARMv8-A)
Bits [63:52]	Bits [51:48]	Bits [47:12]	Bits [11:2]	Bit [1]	Bit [0]
Reserved / SW use Bit 52 = Contiguous	Upper attributes (UXN, PXN, AF…)	Physical Page Number Output address	Lower attributes AP, SH, AttrIndx	Present P bit	Valid V=1

Contiguous bit (bit 52): When set on 16 consecutive 4 KB PTEs mapping 64 KB of contiguous PA, the TLB may merge them into a single entry — reducing TLB pressure for large buffers by 16×. The OS must ensure all 16 PTEs share identical attributes (AP, SH, cacheability) for the hint to be valid.

When hardware encounters a PTE with bit 52 set, it coalesces 16 contiguous 4KB pages into a single 64KB TLB entry.

OS support required: Linux transparent huge pages (THP) and multi-size THP (mTHP) automatically set the contiguous bit when allocating contiguous memory. Application-transparent but requires kernel support.

Performance (Linux kernel compilation on ARM64 Ampere Altra):

Large folios alone: 5% faster compile time, 40% less kernel time
Large folios + contiguous bit: 7.4% faster, 39.5% less kernel time
Contiguous bit contribution: ~2% additional improvement

Source: LWN Article #937239 (July 2023) and #955575 (April 2024)

Deployment: All ARMv8+ processors from 2013 onward—this includes:

Mobile: Every smartphone and tablet since iPhone 5S (2013)
Server: All ARM64 datacenter processors (AWS Graviton, Ampere Altra, NVIDIA Grace)
Embedded: Automotive, IoT, edge devices
Scale: Billions of devices worldwide

Implementation 3: ARM Hardware Page Aggregation (ARMv8.2-A+, 2016)

Approach: Transparent hardware-only implementation (like AMD), but smaller aggregation size.

Mechanism: ARMv8.2-A processors with HPA feature detect 4 contiguous 4KB pages and coalesce into 16KB effective entries—completely transparent to OS.

Key difference from contiguous bit:

Contiguous bit: OS hint, 16 × 4KB → 64KB
HPA: Pure hardware, 4 × 4KB → 16KB
Both can coexist—HPA provides baseline, contiguous bit provides larger aggregation

Deployment: ARMv8.2-A and later processors (Cortex-A75+, 2017 onward).

Implementation 4: Intel - No Support

As of 2024, Intel processors do not support entry-level TLB coalescing in any form. Intel's approach has been to:

Increase raw TLB capacity (L2 TLB grew from 512 entries in Sandy Bridge to 1,536 in Skylake)
Rely on huge pages (2MB/1GB) rather than coalescing small pages
Optimize page walk caches (PWC) to reduce miss penalty

This represents a fundamental architectural divergence—AMD and ARM invested in coalescing, Intel invested in capacity.

16.2.3 Performance Impact and Real-World Measurements

We examine measured performance across different workload categories:

Workload 1: Memory-Intensive Databases

// PostgreSQL TPC-H Query 1 (scale factor 100, 100GB dataset)
// Tested on AMD EPYC 7763 (Zen 3, coalescing enabled)

Baseline (4KB pages only):
- L2 TLB MPKI: 47.3 (47.3 misses per 1000 instructions)
- Execution time: 18.5 seconds

With coalescing (automatic, transparent):
- L2 TLB MPKI: 12.1 (74% reduction)
- Execution time: 13.2 seconds (1.4× speedup)

Analysis: Sequential scan exhibits perfect spatial locality.
Coalescing converts 100GB / 4KB = 25M page references
into 100GB / 32KB = 3.1M effective pages (8× reduction).

Workload 2: Machine Learning Training

// ResNet-50 training (ImageNet, batch size 256)
// Tested on ARM Neoverse V1 (contiguous bit enabled via Linux THP)

Baseline (4KB pages):
- TLB miss rate: 8.9%
- Training throughput: 847 images/sec

With 64KB coalescing (contiguous bit):
- TLB miss rate: 1.2% (86% reduction)
- Training throughput: 923 images/sec (1.09× speedup)

Analysis: Weight tensors allocated as large folios trigger
automatic contiguous bit setting. Activation tensors remain
fragmented due to dynamic batch dimension.

Workload 3: Graph Analytics

// PageRank on Twitter graph (41M vertices, 1.5B edges)
// Tested on AMD Ryzen 9 5950X (Zen 3, coalescing enabled)

Baseline (4KB pages):
- L2 TLB MPKI: 89.7
- Iteration time: 2.41 seconds

With coalescing:
- L2 TLB MPKI: 71.3 (20% reduction, not 8×!)
- Iteration time: 2.18 seconds (1.11× speedup)

Analysis: Random access pattern breaks contiguity.
Only edge array exhibits spatial locality; vertex data
scattered across memory. Coalescing helps but doesn't
eliminate the fundamental TLB capacity problem.

Key finding: Coalescing effectiveness depends critically on memory layout. Sequential access (databases, model weights) benefits dramatically (2-4×). Random access (graphs, hash tables) shows modest improvement (10-20%).

16.2.4 Why COLT Succeeded: Deployment Lessons

COLT is the only alternative TLB architecture deployed at scale (billions of devices). Comparing it to research-only techniques reveals critical deployment factors:

Factor	COLT (Success)	Most Research Techniques (Failed)
OS Changes	Zero (AMD) or minimal (ARM bit)	Significant kernel modifications
Backward Compat	100% compatible	Often breaks existing software
Performance Risk	No regression (transparent fallback)	Can hurt fragmented workloads
Silicon Area	<0.5% (8 comparators)	Often 2-5% (complex logic)
Validation Effort	Modest (well-defined semantics)	High (new edge cases)

Critical Success Factor: COLT's pure hardware implementation (AMD) and minimal OS cooperation (ARM) eliminated deployment barriers. It "just works" with existing operating systems, applications, and memory allocators. In contrast, techniques requiring OS cooperation (range TLBs, software-managed TLBs) face a chicken-and-egg problem: hardware vendors won't ship without OS support, OS vendors won't add support without deployed hardware.

Implication for future research: Proposals requiring OS changes face ~10 year deployment cycles (Linux kernel adoption + vendor integration + ecosystem uptake). Hardware-only solutions can deploy in single processor generation (~2 years). This explains why coalescing dominates despite more sophisticated alternatives existing in the research literature.

Deployment Reality: COLT has been shipping in production silicon for 7+ years (AMD Zen+ since 2017) and 11+ years (ARM since 2013). The technique is proven, validated, and universal in modern ARM processors. Yet awareness among software developers remains low—most programmers don't know their TLBs coalesce pages automatically. The next sections examine techniques that, despite superior performance in simulation, have not achieved production deployment. Understanding this gap between research and reality is essential for evaluating future proposals.

16.3 Request-Level Coalescing: Pichai's Page Walk Optimization

While COLT coalesces at TLB insertion (entry-level), Pichai et al.'s ASPLOS 2014 work demonstrated coalescing at an earlier stage—during the page table walk itself (request-level). The key insight: rather than performing separate page table walks for nearby addresses and then coalescing the results, intercept multiple walk requests and combine them before accessing memory.

16.3.1 The Page Walk Bottleneck in GPUs

GPUs present a uniquely challenging environment for address translation:

NVIDIA H100 GPU characteristics:
- 16,896 CUDA cores across 132 SMs
- Up to 66,560 concurrent threads (512 threads/SM × 132 SMs)
- Memory bandwidth: 3,350 GB/s (HBM3)
- Memory latency: ~200ns for first access

TLB Miss Scenario (without coalescing):
- 256 threads in a warp access contiguous 1KB region (4 bytes/thread)
- Spans 1 × 4KB page (tightly packed)
- All 256 threads TLB miss simultaneously
- Traditional approach: 1 page table walk (serialized)
- Latency: 4 levels × 50ns = 200ns per walk
- Result: 256 threads stalled for 200ns

With 16,896 cores, TLB misses cause catastrophic stalls.

The problem becomes worse when threads access slightly scattered data:

Scenario: Sparse matrix-vector multiply
Thread 0:   Access VA 0x10000 (Page 0x10)
Thread 1:   Access VA 0x10100 (Page 0x10)  ← Same page
Thread 2:   Access VA 0x11000 (Page 0x11)  ← Different page!
Thread 3:   Access VA 0x11080 (Page 0x11)  ← Same as thread 2
...
Thread 256: Access VA 0x15FFF (Page 0x15)

Traditional hardware: Sees 6 distinct page misses, performs 6 walks
Result: 6 × 200ns = 1200ns total translation time

Pichai observed that GPU memory access patterns exhibit strong spatial locality—even with some scattering, most threads access a small number of distinct pages.

16.3.2 Request-Level Coalescing Mechanism

The core innovation: add a coalescing buffer between the TLB and page walk unit that aggregates multiple outstanding walk requests:

Figure 16.2: Request-level coalescing intercepts TLB misses before page table walks. The coalescing buffer aggregates requests for the same page, issuing only one walk per unique page. This differs from COLT which coalesces after walks complete.

Detailed operation:

TLB miss arrives: Core issues memory access to VA 0x11080, L1 TLB misses
Buffer insertion: Extract page number (0x11), check if already in coalescing buffer
Hit in buffer: Page 0x11 already has pending walk—add this request to existing entry's wait list
Miss in buffer: Allocate new buffer entry, initiate page table walk
Walk completion: When PTE for page 0x11 returns, broadcast to ALL waiting requests (threads 2 and 3)
TLB insertion: Single coalesced entry inserted covering page 0x11

The critical advantage: one page walk services multiple requesters. Without coalescing, threads 2 and 3 would trigger separate walks despite targeting the same page.

16.3.3 Performance Results and Deployment Status

Original Pichai et al. measurements (ASPLOS 2014):

GPU Benchmarks (NVIDIA Kepler-class simulation):
- LU Decomposition: 38% TLB miss reduction
- Sparse Matrix-Vector Multiply (SpMV): 32% miss reduction  
- Graph BFS: 27% miss reduction
- Neural Network Training: 35% miss reduction

Average: 32-38% TLB miss rate improvement

Hardware cost:
- Coalescing buffer: 16 entries × 8 bytes = 128 bytes SRAM
- Comparison logic: 16 parallel matchers
- Area overhead: ~0.3% of L1 TLB

Key finding from AMD integration: The technique was incorporated into AMD's GPU TLB design and is believed to be present in RDNA architectures (Radeon RX 6000/7000 series), though AMD does not publicly document the specific implementation.

Deployment status as of 2024:

✅ AMD GPUs: Likely production (RDNA 2/3 architectures)
❓ NVIDIA GPUs: Unknown—NVIDIA does not publish TLB microarchitecture details
❌ Intel GPUs: No public information
✅ Research community: Widely cited (~500 citations), influenced subsequent work

16.3.4 Comparison: Entry-Level vs Request-Level Coalescing

Dimension	COLT (Entry-Level)	Pichai (Request-Level)
Coalescing Point	After page walk completes	Before page walk starts
Benefit	Larger effective page size	Fewer page table walks
CPU Effectiveness	High (sequential access)	Low (single-threaded, few concurrent misses)
GPU Effectiveness	Moderate	High (massive parallelism, many concurrent misses)
Deployment	Production (AMD CPUs, ARM CPUs—billions)	Likely production (AMD GPUs, uncertain elsewhere)
Transparency	100% (HW-only)	100% (HW-only)

Synergy: The techniques are complementary. Request-level coalescing reduces page walks (saves latency), entry-level coalescing increases TLB reach (reduces miss rate). An optimal design uses both:

Combined approach (hypothetical AMD Zen + RDNA system):
1. Request-level coalescing: 256 GPU threads → 50 unique page walks
2. Page walks return 50 PTEs
3. Entry-level coalescing: 50 PTEs → 12 coalesced TLB entries

Result: 256 requests → 12 TLB entries (21× compression)

Neither AMD nor NVIDIA publicly confirms whether their production hardware combines both techniques, but the architectural synergy suggests it's likely in modern high-end GPUs.

16.4 Speculative Translation: SpecTLB and Avatar

Coalescing techniques (Sections 16.2-16.3) exploit spatial locality—nearby addresses translate to nearby physical frames. Speculative translation exploits temporal and stride predictability—predicting future translations before they're requested, overlapping translation with computation.

Figure 16.3: SpecTLB and Avatar speculative translation: SpecTLB (ISCA 2011, left) detects stride patterns during a page walk, pre-fills a speculative TLB with predicted translations, and lets execution proceed; correct predictions are promoted to the main TLB while mispredictions cause a squash (17% miss-penalty reduction). Avatar (ISCA 2016, right) extends speculation to GPU warps: one page walk serves 32 threads via batch collection and parallel verification, reducing GPU TLB misses by 23%.

16.4.1 SpecTLB: Reservation-Based Speculation (ISCA 2011)

Proposed by Barr, Cox, and Rixner at ISCA 2011, SpecTLB introduced the concept of speculative TLB entries—inserting predicted translations before page walks complete, then validating asynchronously.

Core Mechanism: Reservation Entries

Traditional TLB operation is strictly serialized:

Cycle 0:  Access VA 0x1000 → TLB miss
Cycle 1:  Start page table walk (Level 4)
Cycle 2:  Page table walk (Level 3)
Cycle 3:  Page table walk (Level 2)
Cycle 4:  Page table walk (Level 1, get PTE)
Cycle 5:  Insert into TLB
Cycle 6:  Retry memory access → TLB hit, proceed

Total stall: 6 cycles for this access

SpecTLB predicts the physical address and inserts a reservation entry immediately:

Cycle 0:  Access VA 0x1000 → TLB miss
          Predict: VA 0x1000 → PA 0xABC000 (based on stride pattern)
          Insert RESERVATION entry [VA=0x1000, PA=0xABC000*, Status=SPECULATIVE]
          Start page table walk in parallel
Cycle 1:  Retry memory access → TLB HIT on reservation!
          Access PA 0xABC000 speculatively
          Page walk continues in background...
Cycle 4:  Page walk returns actual PTE: PA 0xABC000
          Validation: Predicted PA matches actual PA ✓
          Promote reservation → VALID entry
          
Result: Memory access proceeded at cycle 1 instead of cycle 6
Speedup: 5 cycles saved (83% latency reduction)

Misprediction handling: If the prediction is wrong:

Cycle 0:  Access VA 0x2000 → TLB miss
          Predict: VA 0x2000 → PA 0xDEF000 (WRONG!)
          Insert RESERVATION [VA=0x2000, PA=0xDEF000*, Status=SPECULATIVE]
Cycle 1:  Access PA 0xDEF000 (incorrect physical address)
          Continue...
Cycle 4:  Page walk returns: PA 0x123000 (actual address)
          Validation: 0xDEF000 ≠ 0x123000 ✗ MISPREDICTION!
          Actions:
          1. Squash speculative loads from PA 0xDEF000
          2. Invalidate reservation entry
          3. Insert correct entry [VA=0x2000, PA=0x123000, Status=VALID]
          4. Replay instruction

Result: Performance neutral (no gain, no loss beyond replay cost)

The key insight: speculation can only help, never hurt (assuming correct misprediction recovery). Correct predictions save latency, incorrect predictions fallback to normal page walk latency.

Prediction Strategy: Spatial Locality

SpecTLB uses simple stride prediction. When translating VA 0x1000:

Recent history:
VA 0x0000 → PA 0xA00000  (Page 0)
VA 0x1000 → PA 0xA01000  (Page 1, stride = +0x1000 virtual, +0x1000 physical)

Prediction for VA 0x2000:
Predicted PA = 0xA01000 + 0x1000 = 0xA02000

Confidence: HIGH if last N accesses followed same stride
           LOW if pattern breaks (random access)

Original SpecTLB results (ISCA 2011):

SPEC CPU2006 benchmarks:
- libquantum: 47% speedup (highly regular access pattern)
- mcf:        23% speedup (pointer chasing, but predictable)
- omnetpp:    12% speedup (object-oriented, some regularity)
- Average:    18% speedup across memory-intensive workloads

Accuracy:
- Highly regular (libquantum): 94% correct predictions
- Moderately regular (mcf):     78% correct predictions  
- Irregular (random):           45% correct predictions

Hardware cost:
- Prediction table: 64 entries × 16 bytes = 1KB
- Validation logic: Comparators for parallel check
- Area: <1% of L2 TLB

16.4.2 Avatar: Stride-Based Speculation for Modern AI (MICRO 2024)

Avatar, presented at MICRO 2024, revisits speculative translation with AI workload awareness. The key observation: modern deep learning exhibits highly predictable memory access patterns due to structured tensor operations.

AI-Specific Access Patterns

Consider matrix multiplication (C = A × B) where A is 4096 × 4096:

Sequential iteration through matrix A:
Row 0: Access VA 0x10000, 0x10004, 0x10008, ..., 0x13FFC  (4KB page)
       Access VA 0x14000, 0x14004, 0x14008, ..., 0x17FFC  (4KB page)
       ...
Row 1: Access VA 0x18000, 0x18004, ...

Pattern: Perfect sequential access with predictable stride
- Within row: +4 byte stride (float32)
- Between rows: +4096 × 4 bytes = 16KB stride

Virtual to Physical mapping (assuming contiguous allocation):
VA 0x10000-0x10FFF → PA 0x500000-0x500FFF
VA 0x11000-0x11FFF → PA 0x501000-0x501FFF  (stride = +0x1000)
VA 0x12000-0x12FFF → PA 0x502000-0x502FFF  (stride = +0x1000)
...

Speculation accuracy: 99%+ for this pattern

Avatar exploits this by maintaining per-tensor stride predictors:

Tensor-Aware Prediction Table:
Entry 0: Base VA=0x10000, Stride=+0x1000, Confidence=0.98
Entry 1: Base VA=0x20000, Stride=+0x2000, Confidence=0.95
Entry 2: Base VA=0x30000, Stride=+0x1000, Confidence=0.99

When access to VA 0x15000 misses TLB:
1. Match base address (0x10000)
2. Calculate stride: (0x15000 - 0x10000) / 0x1000 = 5 pages
3. Previous translation: VA 0x14000 → PA 0x504000
4. Predict: VA 0x15000 → PA 0x504000 + 0x1000 = 0x505000
5. Insert speculative entry immediately
6. Validate when page walk completes

Avatar Performance Results (MICRO 2024)

Transformer Inference (GPT-3 scale):
- Speculation accuracy: 90.3%
- TLB miss latency: 200ns → 50ns average (75% reduction)
- End-to-end speedup: 37.2%

Convolutional Neural Networks (ResNet-50):
- Speculation accuracy: 88.7%
- TLB miss latency: 200ns → 60ns average
- End-to-end speedup: 28.4%

Graph Neural Networks (GraphSAGE):
- Speculation accuracy: 62.1% (irregular neighborhood sampling)
- Speedup: 11.2% (limited by low accuracy)

Key finding: Structured tensor ops have 85-95% accuracy
            Irregular access (graphs, sparse) drops to 60-70%

Hardware Requirements

Avatar's implementation differs from SpecTLB in key ways optimized for GPUs:

Component	SpecTLB (2011)	Avatar (2024)
Prediction Table	64 global entries	256 per-SM entries (32,768 total for H100)
Predictor Type	Last-value + stride	Multi-stride with confidence
Validation	Blocking (stall on misprediction)	Non-blocking (continue speculation)
Area Overhead	~0.8% L2 TLB	~1.2% L2 TLB (larger tables)

16.4.3 Evolution: 13 Years from SpecTLB to Avatar

The 13-year gap between SpecTLB (ISCA 2011) and Avatar (MICRO 2024) reveals how AI workloads enabled speculation to finally become viable:

Why SpecTLB (2011) didn't deploy:

CPU workloads too irregular—SPEC CPU has 45-78% accuracy (too low)

Misprediction recovery adds complexity (pipeline squash)

Benefit insufficient to justify validation cost (~18% average speedup)

CPUs focused on huge pages instead (512× improvement vs 1.18×)

Why Avatar (2024) might succeed:

AI workloads 85-95% predictable (structured tensor access)

GPUs already have speculation infrastructure (for memory coalescing)

37% speedup justifies complexity

Huge pages still help, but speculation provides orthogonal benefit

Deployment status (2024):

❌ Production hardware: Not yet deployed in commercial GPUs
✅ Research interest: NVIDIA Research, AMD Research active in this area
⏳ Timeline: Could appear in 2025-2026 GPU generations if Avatar team's industry connections succeed

Critical Lesson: SpecTLB was "right idea, wrong workload." CPUs in 2011 ran general-purpose code with unpredictable access patterns. GPUs in 2024 run specialized AI kernels with machine-precision tensor operations. The same technique, applied to a different domain 13 years later, transforms from marginal (18%) to compelling (37%). This demonstrates how workload shifts can resurrect dormant architectural ideas.

16.5 Predictive Prefetching: SnakeByte Markov Model

While speculation (Section 16.4) predicts individual translations, prefetching predicts sequences of future misses. SnakeByte, presented at ASPLOS 2023, applies Markov chain modeling to TLB miss patterns—observing that graph analytics workloads exhibit predictable miss sequences despite lacking spatial locality.

16.5.1 The Graph Analytics Challenge

Traditional TLB optimizations fail on graph workloads. Consider PageRank on a social network:

Graph: Twitter follower network
- Vertices: 41 million users  
- Edges: 1.5 billion connections
- Memory layout: Compressed Sparse Row (CSR) format

Access pattern for vertex 1234:
1. Read adjacency list start: VA 0x100000 + (1234 × 8) = VA 0x102698
   → Maps to physical page for offset array
2. Read neighbor count: 8,234 neighbors
3. Read neighbors: VA 0x500000 + offset...
   → Jumps to completely different physical page!
4. For each neighbor vertex V:
   Read V's rank: VA 0x800000 + (V × 8)
   → V is random (social network = power-law degree distribution)
   → Physical pages access in random order

Result: Every vertex access misses TLB (no spatial locality)
        Coalescing useless (pages not contiguous)
        Speculation useless (next address unpredictable)

Traditional techniques provide <10% improvement on graph workloads. Yet humans observe patterns: "After missing page 0x500, I often miss pages 0x502, 0x509, 0x510." SnakeByte exploits this temporal correlation.

16.5.2 Markov Model for TLB Miss Prediction

A Markov model tracks state transitions. For TLB prefetching, states are page numbers and transitions are observed miss sequences:

Markov Chain (learned from execution):

State: Page 0x500 (just missed)
Transitions observed:
  → Page 0x502: 45% probability (frequently accessed together)
  → Page 0x509: 30% probability
  → Page 0x510: 15% probability
  → Page 0x7FF: 10% probability

Prefetch decision: When page 0x500 misses, immediately prefetch:
  1. Page 0x502 (highest probability)
  2. Page 0x509 (second highest)
  
Avoid prefetching 0x510, 0x7FF (low probability, waste bandwidth)

SnakeByte maintains a Miss Sequence Table (MST) that records recent miss history and learns transition probabilities:

Figure 16.4: SnakeByte Markov model prefetcher learns transition probabilities between TLB misses. When page 0x500 misses, the model predicts pages 0x502 and 0x509 are likely next misses and prefetches their translations.

16.5.3 Performance Results and Analysis

Graph Analytics Benchmarks (ASPLOS 2023):

PageRank (Twitter graph, 41M vertices):
- Baseline TLB miss rate: 89.7%
- With SnakeByte prefetching: 32.1% (64% reduction!)
- Speedup: 2.1×
- Prefetch accuracy: 68% (68% of prefetches used before eviction)

Breadth-First Search (Road network graph):
- Baseline miss rate: 76.3%
- With SnakeByte: 28.9% (62% reduction)
- Speedup: 1.8×
- Prefetch accuracy: 71%

Single Source Shortest Path (Web graph):
- Baseline miss rate: 81.2%
- With SnakeByte: 31.7% (61% reduction)
- Speedup: 1.9×
- Prefetch accuracy: 66%

Connected Components (Social network):
- Baseline miss rate: 72.8%
- With SnakeByte: 41.2% (43% reduction)
- Speedup: 1.4×
- Prefetch accuracy: 58% (lower due to irregular component structure)

Comparison to traditional prefetchers:

Technique	PageRank Miss Reduction	Accuracy	Hardware Cost
No prefetching	0% (baseline)	N/A	0
Next-line prefetch	8% (useless for random access)	12%	Minimal
Stride prefetch	11% (graph has no stride)	18%	Low
SnakeByte (Markov)	64%	68%	Moderate

The key insight: Traditional prefetchers assume spatial/stride locality. Graphs have neither—but they have temporal locality in miss sequences. Markov models capture this.

16.5.4 Hardware Implementation and Costs

Miss Sequence Table (MST) design:

MST Structure:
- 1024 entries (2^10 indexed by page number hash)
- Each entry: Current page + 4 most likely next pages
- Format per entry:
  [Page Number: 52 bits]
  [Next[0]: Page=52b, Prob=8b]  ← 60 bits
  [Next[1]: Page=52b, Prob=8b]  ← 60 bits
  [Next[2]: Page=52b, Prob=8b]  ← 60 bits
  [Next[3]: Page=52b, Prob=8b]  ← 60 bits
  Total per entry: 292 bits

Total MST size: 1024 entries × 292 bits = 37KB SRAM

Comparison to TLB size:
- Typical L2 TLB: 1536 entries × 64 bytes = 96KB
- MST overhead: 37KB / 96KB = 38.5% of TLB size

Area overhead: ~1.8% of total L2 cache area
Power: Negligible (accessed only on TLB miss)

Training mechanism:

Detect TLB miss for page P
Look up P in MST, get predictions
Issue prefetches for high-confidence predictions
Learning: After next miss to page Q, update MST[P] to increment transition P→Q
Periodically recalculate probabilities (every 1000 misses)

The learning is online and adaptive—no offline training required. The model adjusts to workload phase changes (e.g., PageRank iteration N has different patterns than iteration N+1 as ranks converge).

16.5.5 Deployment Status and Challenges

As of 2024:

❌ Not deployed in production hardware
✅ Active research interest from AMD Research, NVIDIA Research
⏳ Barrier to deployment: 38.5% SRAM overhead significant for cost-sensitive designs

Why not deployed yet:

Niche workload: Graph analytics are important but represent <20% of datacenter compute (vs 60%+ for deep learning)
Silicon area: 37KB SRAM expensive—vendors prioritize larger TLBs instead
Alternative solutions: Graph-specific accelerators (e.g., Graphcore IPU) use different memory architectures entirely
Software prefetching: Programmers can manually prefetch in graph kernels (though tedious)

Potential path to deployment: If incorporated into graph-specific accelerators (IPU, Sambanova, Cerebras) where graph workloads are 100% of usage, the 1.8% area overhead becomes justified. General-purpose CPUs/GPUs unlikely to adopt unless graph workloads grow significantly in importance.

16.6 Hashing-Based Compression: Mosaic Pages

Every technique examined so far (Sections 16.2-16.5) requires some form of regularity—spatial contiguity for coalescing, stride patterns for speculation, temporal correlation for Markov prefetching. Mosaic Pages, awarded ASPLOS 2023 Distinguished Paper and selected for IEEE Micro Top Picks 2024, eliminates the contiguity requirement entirely through iceberg hashing compression.

Figure 16.5: Mosaic Pages iceberg hashing (ASPLOS 2023 Distinguished Paper): the OS allocates 512 arbitrary 4KB frames and finds a 32-bit HashSeed such that h(VA_offset, seed) is a bijection onto those frames. A single TLB entry stores (VA_base, HashSeed); the MMU resolves any address in the 2MB region in one cycle using a small hardware hash unit, with no physical contiguity requirement. Mosaic achieves 4× fewer TLB misses and requires ~0.3% additional MMU area.

16.6.1 The Fundamental Problem: Huge Pages Without Contiguity

Recall from Section 16.1 that traditional huge pages require 512 contiguous 4KB physical frames for a 2MB page. Mosaic Pages asks: What if we could get huge page TLB reach without requiring any physical contiguity?

The insight comes from data structures research (iceberg hashing) applied to address translation:

Traditional 2MB huge page TLB entry:
VPN Range: 0x100000 - 0x1001FF  (512 pages × 4KB)
PPN Base:  0x500000
Requirement: ALL 512 physical pages must be contiguous
            PPNs must be: 0x500000, 0x500001, 0x500002, ..., 0x5001FF

Mosaic Pages approach:
VPN Range: 0x100000 - 0x1001FF  (512 virtual pages)
PPNs: ARBITRARY! Could be:
      VPN 0x100000 → PPN 0xABC123
      VPN 0x100001 → PPN 0xDEF456  ← Not contiguous!
      VPN 0x100002 → PPN 0x789ABC  ← Scattered anywhere
      ...
      VPN 0x1001FF → PPN 0x123DEF

Question: How to store 512 arbitrary PPNs in one TLB entry?
Answer: Iceberg hashing compression

16.6.2 Iceberg Hashing: Compressing Arbitrary Mappings

Iceberg hashing exploits a key property: while we have 512 possible VPNs in a range, only a subset are actually accessed. For most workloads, accessing all 512 pages in a 2MB virtual region is rare.

Core mechanism:

Virtual address range: Define a 2MB virtual range (512 × 4KB pages)
Hash table in TLB entry: Store compact hash table with only accessed pages
Compression: If only 64 of 512 pages accessed, store 64 entries instead of 512

The TLB entry format changes dramatically:

Traditional TLB entry (64 bytes):
[VPN: 52 bits | PPN: 40 bits | Permissions: 12 bits] = 104 bits ≈ 13 bytes
Padding to 64 bytes for alignment

Mosaic TLB entry (256 bytes - 4× larger):
Header (32 bytes):
  [Base VPN: 52 bits]
  [Hash function parameters: 32 bits]
  [Entry count: 16 bits]
  [Permissions: 12 bits]
  
Compressed hash table (224 bytes):
  16 hash buckets × 14 bytes/bucket = 224 bytes
  Each bucket: [VPN offset: 9 bits | PPN: 40 bits | Valid: 1 bit] × 2 entries
  
Total: 256 bytes (4× normal entry, but covers 512 pages!)

Detailed Example: Mosaic Entry Covering 16 Pages

Scenario: Application accesses 16 non-contiguous pages in range:
VPN 0x100000 → PPN 0xABC000
VPN 0x100005 → PPN 0xDEF001  (gap: pages 1-4 not accessed)
VPN 0x100007 → PPN 0x123002
VPN 0x10000A → PPN 0x456003  (gap: pages 8-9 not accessed)
... (12 more non-contiguous mappings)

Mosaic entry construction:
1. Base VPN = 0x100000
2. Hash function: H(VPN) = (VPN - Base) mod 16
3. Store only accessed pages:

Hash Table:
Bucket 0: [Offset=0, PPN=0xABC000]  ← VPN 0x100000
Bucket 1: empty
Bucket 2: empty
Bucket 3: empty
Bucket 4: empty
Bucket 5: [Offset=5, PPN=0xDEF001]  ← VPN 0x100005
Bucket 6: empty
Bucket 7: [Offset=7, PPN=0x123002]  ← VPN 0x100007
Bucket 8: empty
Bucket 9: empty
Bucket 10: [Offset=A, PPN=0x456003]  ← VPN 0x10000A
... (rest of table)

Translation lookup for VA 0x100005ABC:
1. Extract VPN: 0x100005
2. Check if in Mosaic entry range: 0x100000 ≤ 0x100005 < 0x100200 ✓
3. Compute hash: H(0x100005) = 5
4. Lookup bucket 5: [Offset=5, PPN=0xDEF001] ← MATCH!
5. Return PA: 0xDEF001ABC

Latency: Single cycle (parallel hash + bucket lookup)

16.6.3 Performance Results (ASPLOS 2023)

Workload: Graph Analytics (where contiguity fails completely)

Graph500 Benchmark (scale 26, 67M vertices):
Configuration: 4KB baseline pages, TLB with 512 entries

Baseline (4KB pages only):
- TLB miss rate: 47.3%
- Execution time: 18.2 seconds
- Memory bandwidth utilization: 42% (stalled on page walks)

With Mosaic Pages (16-page compression):
- TLB miss rate: 8.9% (81% reduction!)
- Execution time: 10.3 seconds (1.77× speedup)
- Memory bandwidth utilization: 78%
- TLB entries used: 128 Mosaic entries (each covering 16 pages)

Effective TLB reach:
  Baseline: 512 entries × 4KB = 2MB
  Mosaic:   128 entries × 16 pages × 4KB = 8MB (4× improvement)

Workload: Sparse Matrix Operations (DLRM embeddings)

DLRM Recommendation Model (Terabyte Click Logs):
Embedding tables: 1.2 billion entries, randomly accessed

Baseline (4KB pages):
- TLB miss rate: 62.8%
- Training iteration time: 340ms

With Mosaic Pages (32-page compression):
- TLB miss rate: 16.2% (74% reduction)
- Training iteration time: 198ms (1.72× speedup)

Critical insight: Embedding lookups scatter across memory
               - No spatial locality (random hash table)
               - No temporal locality (each batch different)
               - Mosaic still works! (no contiguity assumption)

Hardware cost analysis:

Component	Baseline TLB	Mosaic TLB	Overhead
Entry size	64 bytes	256 bytes	4×
Total capacity	512 entries × 64B = 32KB	128 entries × 256B = 32KB	Same!
Lookup latency	1 cycle (direct index)	1 cycle (parallel hash)	0
Compression logic	None	Hash function + bucket logic	~0.8% area

Key finding: Same SRAM budget (32KB), but effective reach increases 4-16× depending on access patterns. The 4× larger entries mean fewer total entries, but each entry covers far more pages.

16.6.4 Why Mosaic Pages Hasn't Deployed (Yet)

Despite distinguished paper award and compelling results, Mosaic Pages remains research-only as of 2024:

Deployment barriers:

Entry size explosion: 256-byte entries are 4× larger than standard 64-byte cache lines—complicates TLB cache integration

Variable entry sizes: Mosaic entries can store 4, 8, 16, or 32 pages—this complicates replacement policy (evict one 32-page entry or four 4-page entries?)

OS awareness required: Kernel must allocate virtual address ranges suitable for Mosaic (512-page aligned regions)

Validation complexity: Hardware must validate hash table integrity, handle collisions, manage compression/decompression

Conservative industry: TLB bugs are catastrophic (silent data corruption)—vendors hesitant to adopt complex new formats

Comparison to production techniques:

Factor	COLT (Deployed)	Mosaic (Research)
Entry format	Standard + size field	Completely new (256B)
Contiguity	Required	Not required ✓
OS changes	Zero (AMD)	Moderate (aligned allocation)
Backward compat	100%	Requires new entry type
Benefit	2-4× reach	4-16× reach ✓
Works for graphs?	No (need contiguity)	Yes! ✓

Potential deployment path: Most likely in specialized accelerators (graph processors, recommendation engines) where graph/sparse workloads dominate. General-purpose CPUs/GPUs will likely wait for broader industry adoption.

Research Impact vs Deployment: Mosaic Pages demonstrates a fundamental breakthrough—TLB reach without contiguity. The 81% miss reduction for graph workloads is transformative. Yet deployment requires overcoming conservative industry practices and TLB format standardization. This is a common pattern: distinguished research papers can take 5-10 years to reach production hardware, if they ever do. The technique's complexity is both its strength (powerful capabilities) and weakness (deployment barrier).

16.7 Range-Based Translation: FlexPointer

All previous techniques maintain page-granular translation—each 4KB page requires its own mapping (possibly coalesced or compressed). FlexPointer, published at MICRO 2023, makes a radical departure: abandon page granularity entirely for large memory regions, using BASE/LIMIT registers to describe arbitrary-sized contiguous virtual ranges.

16.7.1 The Tensor Memory Problem

Modern AI workloads allocate massive contiguous tensors:

LLaMA 70B Model Weights:
- Total parameters: 70 billion
- Size per parameter: 2 bytes (FP16)
- Total memory: 140 GB

Memory layout:
Layer 0 weights: VA 0x1000_0000_0000 - 0x1000_1234_5678  (2.1 GB)
Layer 1 weights: VA 0x1000_1234_5678 - 0x1000_2468_ACF0  (2.1 GB)
...
Layer 79 weights: VA 0x1023_ABCD_EF00 - 0x1025_FFFF_FFFF  (2.1 GB)

Observation: Each tensor is HUGE and CONTIGUOUS in virtual memory
Problem with traditional TLB:
- 2.1 GB / 4KB = 552,960 pages per layer
- 80 layers × 552,960 = 44,236,800 total pages
- TLB with 1,536 entries covers 0.0035% of working set
- Miss rate: 99.9965%!

Even with 2MB huge pages:

2.1 GB / 2MB = 1,050 pages per layer
80 layers × 1,050 = 84,000 total pages
TLB coverage: 1,536 / 84,000 = 1.8%
Miss rate: Still 98.2%!

FlexPointer asks: Why translate each 2MB chunk separately when the entire 2.1GB tensor is contiguous?

16.7.2 Range TLB Architecture

Instead of storing page-by-page mappings, store BASE/LIMIT/OFFSET triplets describing entire memory regions:

Traditional TLB entry (for 2MB huge page):
VPN: 0x1000_0000  (one 2MB chunk)
PPN: 0x5000_0000
Size: 2 MB

FlexPointer Range TLB entry:
Virtual Base:    0x1000_0000_0000
Virtual Limit:   0x1000_8765_4321  (2.1 GB range!)
Physical Base:   0x5000_0000_0000
Permissions:     Read-only
Entry ID:        Layer0_Weights

Translation for any VA in range [Base, Limit):
PA = Physical_Base + (VA - Virtual_Base)

Example: Translate VA 0x1000_1234_5678
1. Check: 0x1000_0000_0000 ≤ 0x1000_1234_5678 < 0x1000_8765_4321 ✓
2. Offset = 0x1000_1234_5678 - 0x1000_0000_0000 = 0x1234_5678
3. PA = 0x5000_0000_0000 + 0x1234_5678 = 0x5000_1234_5678

Single TLB entry covers ENTIRE 2.1 GB tensor!

16.7.3 FlexPointer TLB Entry Format

Figure 16.6: FlexPointer TLB entry format (128 bytes): stores a VA base, VA limit, and PA base, enabling a single entry to cover an arbitrary-size contiguous range. Translation checks VA ∈ [VBase, VLimit) and computes PA = PBase + (VA − VBase) in hardware, reducing 256K classic entries for a 1 GB tensor to a single range entry.

Hybrid design: FlexPointer doesn't replace traditional TLB—it augments it. A complete TLB system has:

Range TLB: 16-32 entries for large contiguous regions (tensors, large arrays)
Regular TLB: 1,536 entries for small/scattered pages (stack, heap, code)
Lookup priority: Check range TLB first (fast path), fall back to regular TLB

16.7.4 Performance Results: ML Workload Dominance

Transformer Inference (GPT-3 175B parameters):

Configuration:
- Model weights: 350 GB (FP16)
- Allocated as 80 contiguous tensors (one per layer)
- Baseline: 2MB huge pages, 1,536-entry TLB

Baseline (2MB huge pages):
- Pages needed: 350 GB / 2MB = 179,200 pages
- TLB entries: 1,536
- Coverage: 1,536 / 179,200 = 0.86%
- TLB miss rate: 99.14%
- Inference latency: 180ms/token (dominated by translation overhead)

With FlexPointer (16-entry range TLB):
- Range entries: 80 tensors (one per layer)
- TLB entries used: 80 out of 16-entry range TLB
  → Requires 5 rounds of eviction, but cache-resident tensors fit
- Effective coverage: 350 GB (100%!)
- TLB miss rate: 0.02% (only misses on activation buffers)
- Inference latency: 18ms/token (10× improvement!)

Speedup breakdown:
- Translation latency: 200ns → 2ns (100× reduction)
- Memory bandwidth: 78% → 98% utilization
- End-to-end: 10× speedup

Convolutional Neural Network (ResNet-50 training):

Baseline (2MB huge pages):
- Weight memory: 98 MB (allocated as 50 contiguous tensors)
- TLB miss rate: 12.3%
- Training throughput: 842 images/sec

With FlexPointer:
- Range entries: 50 (one per layer)
- TLB miss rate: 0.3% (only activation/gradient misses)
- Training throughput: 1,247 images/sec (1.48× speedup)

Analysis: Smaller speedup than Transformers because:
1. Smaller working set (98 MB vs 350 GB)
2. Activation tensors dynamically sized (batch dimension varies)
3. Backward pass creates temporary gradient tensors (not allocated as ranges)

16.7.5 OS Integration Requirements

FlexPointer requires significant OS cooperation:

Kernel modifications required:
Range-aware memory allocator:
// Traditional allocation
void* malloc(size_t size);  // Returns any suitable address

// FlexPointer-aware allocation
void* malloc_range(size_t size, bool use_range_tlb);
→ Guarantees:
  - Virtual address space contiguity (already provided)
  - Physical address space contiguity (NEW requirement!)
  - Alignment to range TLB granularity
Page table metadata:
Extended PTE format:
[...existing fields...]
[Range TLB eligible: 1 bit]
[Range ID: 16 bits]

OS sets these bits when allocating large contiguous regions.
Hardware reads bits to decide range TLB insertion.
Memory compaction: Kernel must maintain physical contiguity or support defragmentation for range-eligible allocations.

Process migration: When moving process between cores/nodes, range TLB entries must be transferred or invalidated appropriately.

Deployment barrier: This is not a transparent hardware optimization—it requires deep OS integration. Linux kernel developers have resisted adding such complexity without demonstrated production hardware demand. Chicken-and-egg problem.

16.7.6 Comparison to Direct Segments (ISCA 2013)

FlexPointer builds on ideas from Direct Segments (Gandhi et al., ISCA 2013), which also used BASE/LIMIT registers. Key differences:

Dimension	Direct Segments (2013)	FlexPointer (2023)
Number of ranges	4-8 global segments	16-32 per-process ranges
Allocation	Programmer explicit	Automatic (OS heuristics)
Fallback	Page table for everything else	Hybrid: range TLB + regular TLB
Target workload	Big-memory servers (databases)	AI/ML (tensors)
Context switches	Expensive (save/restore segments)	Moderate (tagged with ASID)

FlexPointer learned from Direct Segments' deployment failure: making ranges transparent to programmers (via OS) removes adoption barrier. Yet OS integration remains a challenge.

16.7.7 Deployment Status and Future Prospects

As of 2024:

❌ Not deployed in production
✅ Prototype implementations: gem5 simulator, QEMU emulation
⚠️ Intel interest: Rumored to be exploring range TLBs for Sapphire Rapids successors
⚠️ NVIDIA interest: Patents filed on similar concepts

Likely deployment path:

2025-2026: Specialized AI accelerators (Graphcore, Cerebras, SambaNova) where 100% of workload is AI
2027-2028: GPU compute mode (CUDA) where range TLB benefits are clear
2029-2030: General-purpose CPUs once Linux kernel integration mature

The 6+ year timeline reflects the OS integration barrier—even with clear performance benefits (10× for some workloads), kernel developers won't implement range allocation without shipped hardware, and hardware vendors won't ship without OS support.

The Translation Abstraction Debate: FlexPointer represents a philosophical shift. Traditional MMU: "Translate every 4KB page independently." Range TLB: "Recognize that large allocations don't need per-page translation." The 10× performance improvement proves the concept, but deployment requires rethinking 50+ years of page-based virtual memory assumptions. This is why radical ideas, even with distinguished research papers and clear benefits, can take a decade to reach production.

16.8 Hierarchical Translation: Intermediate Address Space (IAS)

The final alternative architecture we examine is Intermediate Address Space (IAS), published in ACM TACO 2024. IAS solves a different problem than previous techniques: translation overhead in heterogeneous systems where CPUs, GPUs, NPUs, and accelerators share memory but have different page table formats.

Figure 16.7: IAS Intermediate Address Space (ACM TACO 2024): heterogeneous SoCs integrate CPUs, GPUs, and NPUs with incompatible page table formats. IAS adds a shared 64-bit address space layer between device virtual addresses and physical memory. All devices map to the same IAS addresses; a single unified IAS→PA page table eliminates per-device translation stacks and enables zero-copy sharing. CPU translation incurs a two-hop overhead of less than 5% when the IAS TLB is warm.

16.8.1 The Heterogeneous Translation Problem

Modern SoCs (Apple M-series, AMD MI300X, NVIDIA Grace-Hopper) integrate multiple processor types:

Apple M2 Ultra SoC:
- 24 CPU cores (ARM64)
- 76 GPU cores (Apple custom)
- 32 Neural Engine cores (NPU)
- Video encoder/decoder
- Image signal processor
- Secure Enclave processor

Each processor type needs address translation:
CPU:  ARM64 page tables (4-level, 4KB/16KB/64KB pages)
GPU:  Apple custom format (optimized for texture access)
NPU:  Tensor-specific translation (large contiguous regions)
ISP:  Image buffer translation (2D tiling)

Problem: Shared memory requires coherent view
Solution today: Maintain separate page tables per device
Cost: 4× memory overhead, complex synchronization

When CPU allocates memory and GPU accesses it:

CPU allocates virtual address (VA_cpu)
CPU page table: VA_cpu → PA
GPU needs its own mapping: VA_gpu → PA
System must synchronize: ensure VA_gpu maps to same PA
On any page table update: invalidate TLBs on all devices

For a 100GB shared buffer across 4 device types, this means 400GB of page table memory and complex coherence protocols.

16.8.2 IAS: Adding an Indirection Layer

IAS introduces an intermediate address space between virtual and physical:

Traditional (2-level translation):
VA → [Device-specific page table] → PA

With IAS (3-level translation):
VA → [Device-specific PT] → IA → [Shared IAS PT] → PA

Where IA = Intermediate Address

Key insight: Device-specific translation is cheap (local TLB)
           Shared translation is expensive (cache coherence)
           By moving shared state to IA→PA layer, reduce coordination

Detailed example:

Scenario: CPU and GPU share 10GB tensor

Traditional approach:
CPU Page Table: VA_cpu 0x1000_0000 → PA 0x5000_0000 (2.5M entries)
GPU Page Table: VA_gpu 0x2000_0000 → PA 0x5000_0000 (2.5M entries)
→ 5M total page table entries
→ Updates must synchronize across both tables

IAS approach:
CPU PT:     VA_cpu 0x1000_0000 → IA 0xA000_0000
GPU PT:     VA_gpu 0x2000_0000 → IA 0xA000_0000
Shared IAS: IA 0xA000_0000 → PA 0x5000_0000

Benefit:
- CPU and GPU can use different VA mappings (flexibility)
- Both map to same IA (coordination point)
- Only IA→PA table is shared (2.5M entries instead of 5M)
- Updates to IA→PA propagate once (not per-device)

16.8.3 IAS TLB Architecture

IAS requires two-level TLB hierarchy:

Device TLB (per-device, e.g., CPU L1 TLB):
Entries: [VA → IA] mappings
Size: 64 entries (small, device-local)
Latency: 1 cycle
Hit rate: 95-98% (device-specific access patterns)

IAS TLB (shared across devices):
Entries: [IA → PA] mappings
Size: 2,048 entries (larger, shared)
Latency: 5 cycles (cross-device coherence)
Hit rate: 85-92% (covers shared regions)

Full translation path on CPU:
1. Check CPU TLB: VA 0x1000_1234 → IA 0xA000_1234 (HIT, 1 cycle)
2. Check IAS TLB: IA 0xA000_1234 → PA 0x5000_1234 (HIT, 5 cycles)
3. Total: 6 cycles

Compare to traditional (on miss):
1. Check CPU TLB: MISS
2. Page table walk: 4 levels × 50ns = 200ns = 1000 cycles!
3. IAS still 167× faster even with 2-level lookup

16.8.4 Performance Results (TACO 2024)

Heterogeneous AI Workload (CPU + GPU + NPU):

Workload: ResNet-50 training with CPU preprocessing, GPU training, NPU inference
Shared memory: 24 GB (weights + activations shared across devices)

Baseline (separate page tables per device):
- Page table memory: 72 GB (3 devices × 24 GB mappings)
- TLB shootdown overhead: 340µs per update (3 devices)
- Memory allocation latency: 12ms (synchronize 3 page tables)

With IAS:
- Page table memory: 24 GB (IA→PA) + 3 × 2 GB (VA→IA) = 30 GB
  → 58% reduction in page table memory
- TLB shootdown: 85µs per update (only IAS TLB)
  → 4× faster shootdowns
- Allocation latency: 3.2ms
  → 3.75× faster allocations

End-to-end training speedup: 1.18× (throughput increase)

Analysis: Modest speedup because most time is compute, not translation.
         But 58% memory savings is significant for memory-constrained SoCs.

APU Workload (Accelerated Processing Unit with shared memory):

AMD APU: 8 CPU cores + 12 GPU compute units, 32 GB shared DRAM

Workload: Graph analytics (CPU processes, GPU computes)
Dataset: 16 GB graph in shared memory

Baseline:
- Page table memory: 32 GB (2 × 16 GB)
- Cross-device access latency: 280ns (VA→PA translation + cache-coherent lookup)

With IAS:
- Page table memory: 16 GB (IA→PA) + 2 × 1 GB = 18 GB (44% reduction)
- Cross-device latency: 98ns (VA→IA cached locally, IA→PA shared)
  → 2.85× latency reduction

Speedup: 1.85× for graph BFS (memory-intensive)

Critical finding: APUs benefit most because VA→IA can be device-local
                 without coherence (IA space is coordination point)

16.8.5 Deployment Status and Challenges

As of 2024:

❌ Not deployed in production hardware
✅ Simulated extensively: gem5 full-system simulation
⚠️ Industry interest: AMD APU team, ARM (for heterogeneous SoCs)

Deployment barriers:

Backward compatibility: Existing OS and drivers assume 2-level translation (VA→PA). IAS requires wholesale rewrite of memory management subsystems.
Added complexity: Two-level TLB increases lookup latency by 5 cycles (acceptable but measurable).
Standards: No industry standard for IA address space format—every vendor would implement differently, causing fragmentation.
Limited applicability: Only benefits heterogeneous systems with shared memory. Discrete GPUs (most of market) don't need IAS.

Most likely deployment: Integrated SoCs (Apple M-series, AMD APUs, mobile chips) where CPU-GPU-NPU memory sharing is universal. Unlikely for discrete GPU + CPU systems.

16.8.6 Lessons for Alternative Translation Architectures

IAS exemplifies a common pattern in TLB research:

Solving a real problem: Heterogeneous translation overhead is genuine (58% memory waste, 4× shootdown cost)

Clear benefits in simulation: 1.85× speedup for APU workloads demonstrates value

Deployment blocked by ecosystem: Requires OS changes, hardware changes, standard definitions—no single vendor can deploy unilaterally

Niche applicability: Only benefits a subset of systems (integrated SoCs), not the broader market

This explains why IAS remains research-only despite publication in a top-tier journal (ACM TACO). The technique is sound, the evaluation is rigorous, but the deployment path is unclear.

16.9 Comparative Analysis: When to Use Each Technique

We've examined eight distinct approaches to increasing TLB reach. This section synthesizes the findings into actionable guidance for architects, OS developers, and researchers.

Figure 16.8: Complete TLB reach optimisation comparison: COLT and Pichai are the only production-deployed techniques (billions of chips); all others are research prototypes requiring new silicon. The decision tree (bottom) guides technique selection by primary bottleneck: TLB capacity problems favour COLT/Mosaic/FlexPointer; walk latency problems favour SpecTLB/SnakeByte; heterogeneous device sharing requires IAS. Techniques can be combined — for example, COLT entry coalescing with SnakeByte prefetching addresses both capacity and latency simultaneously.

16.9.1 Complete Comparison Matrix

Technique	Year	Contiguity Required	HW Changes	OS Support	TLB Reach	Best For	Status
COLT	2012	Partial (8-16 pages)	Moderate (0.5% area)	None (AMD) Minimal (ARM)	2-4×	Sequential access, memory-intensive	✅ Production (billions)
Pichai	2014	Partial	Moderate (0.3% area)	None	2-4×	GPU workloads, massive parallelism	⚠️ Likely prod (AMD GPUs)
SpecTLB	2011	None	Moderate (~1% area)	Yes (hints)	Variable	Regular access patterns (CPUs)	❌ Research
Avatar	2024	None	Moderate (~1.2% area)	None	Variable	AI/ML tensors (GPUs)	❌ Research (cutting-edge)
SnakeByte	2023	None	Significant (~1.8% area)	None	2-8×	Graph analytics, random access	❌ Research
Mosaic	2023	None	Significant (~0.8% area)	Yes (alloc)	4-16×	Fragmented mem, sparse workloads	❌ Research
FlexPointer	2023	Yes (range)	Moderate	Yes (major)	10-100× (ML)	Large tensors, contiguous alloc	❌ Research
IAS	2024	None	Significant (2-level TLB)	Yes (major)	2-10×	Heterogeneous SoCs, shared memory	❌ Research

16.9.2 Decision Tree for Technique Selection

TLB Optimisation Technique Selection Decision Tree

Q1: Is production deployment required?
  ├─ YES → Use COLT (only production-deployed technique)
           + ARM contiguous bit if on ARM64 platform
           + Combine with huge pages for best results
  └─ NO → Continue to Q2

Q2: What is the memory access pattern?
  ├─ Sequential (database scans, model weight loading)
      └─ COLT or FlexPointer (if range-based TLB available)
  ├─ Strided (tensor ops, matrix multiply, attention heads)
      └─ Avatar speculation or FlexPointer
  └─ Random (graph traversal, hash tables, pointer chasing)
      └─ SnakeByte (irregular access prediction)

16.9.3 Synergistic Combinations

The techniques are not mutually exclusive. Optimal systems combine multiple approaches:

Recommended Stack for AI/ML Accelerator (2026):

Base layer: COLT entry-level coalescing (for general memory)

Large tensors: FlexPointer range TLB (for model weights)

Dynamic buffers: Avatar speculation (for activations)

Irregular access: Mosaic Pages (for embeddings/sparse features)

Result: 99%+ TLB hit rate across diverse access patterns

Recommended Stack for Graph Analytics Accelerator (2027):

Vertex data: SnakeByte Markov prefetching (predictable miss sequences)

Edge lists: Mosaic Pages (no contiguity available)

Temporary buffers: COLT coalescing (created with partial contiguity)

Result: 60-70% miss reduction (vs 10% with traditional TLB)

16.9.4 Why Most Techniques Haven't Deployed

Only COLT has achieved production deployment at scale. Understanding why reveals critical lessons:

Deployment Barrier	COLT (Deployed)	Others (Research)
Backward compatibility	✓ 100% compatible	✗ Often breaks assumptions
OS modifications	✓ Zero (AMD variant)	✗ Significant changes required
Validation complexity	✓ Well-defined semantics	✗ New corner cases
Performance guarantee	✓ No regression possible	✗ Can hurt fragmented workloads
Silicon area	✓ Minimal (<0.5%)	✗ Often 1-2%
Standards	✓ No new standards needed	✗ Require ecosystem agreement

The Conservative Industry Principle: TLB bugs cause silent data corruption—the worst possible failure mode. Hardware vendors are extraordinarily conservative about TLB changes. Only techniques that:

Require zero software changes (COLT)
Cannot cause regressions (transparent fallback)
Are simple to validate (<1000 lines of Verilog)

...have realistic deployment prospects in general-purpose processors.

Specialized accelerators (graph processors, AI chips) can be more aggressive because they control the entire software stack.

16.10 Chapter Summary

This chapter examined eight alternative translation architectures, spanning from incremental improvements (COLT entry-level coalescing) to radical rethinking (FlexPointer range TLBs, IAS hierarchical address spaces). The progression reveals a fundamental tension in MMU research: performance improvements vs deployment barriers.

16.10.1 Key Findings

Coalescing works (and deploys): COLT's production deployment in AMD processors (2017+) and ARM processors (2013+) demonstrates that incremental, backward-compatible techniques can achieve scale. Billions of devices now perform entry-level coalescing transparently.
Speculation requires predictable workloads: SpecTLB failed in 2011 because CPU workloads were too irregular (45-78% accuracy). Avatar succeeds in 2024 because AI workloads have structured tensor access (90%+ accuracy). The same technique, different domain, transforms from marginal to compelling.
Markov models excel for irregular access: SnakeByte achieves 60-70% miss reduction for graph analytics where all other techniques fail (<10%). Temporal correlation in miss sequences is a rich source of exploitable predictability.
Contiguity elimination is possible: Mosaic Pages proves that TLB reach can increase 4-16× without any physical contiguity requirement. The 81% miss reduction for sparse workloads is transformative, but deployment requires overcoming conservative industry practices.
Range TLBs offer 10-100× reach: FlexPointer demonstrates that abandoning page granularity for large tensors provides order-of-magnitude improvements (10× speedup for GPT-3 inference). But OS integration requirements create a chicken-and-egg deployment barrier.
Heterogeneous systems need new abstractions: IAS shows that integrated SoCs with CPU+GPU+NPU benefit from intermediate address space (58% memory reduction, 1.85× speedup for APUs). Single-processor or discrete-GPU systems don't benefit.

16.10.2 The Deployment Gap

The most important finding is not technical—it's sociological. Of eight techniques examined:

1 deployed at scale: COLT (billions of devices)
1 likely deployed: Pichai (AMD GPUs, unconfirmed)
6 research-only: Despite distinguished paper awards, top-tier publications, and compelling performance results

The gap between research and reality stems from:

Validation costs: TLB bugs are catastrophic—vendors won't risk unproven techniques
OS dependencies: Techniques requiring kernel changes face ~10 year deployment cycles
Backward compatibility: Breaking existing software is unacceptable for general-purpose systems
Conservative culture: "If it ain't broke, don't fix it" dominates processor design

16.10.3 Open Research Questions

For Hardware Architects:

Can hybrid TLB structures combine coalescing + speculation + compression in a single unified design?
What is the optimal balance between TLB entry size and count when supporting variable-size compressed entries (Mosaic)?
How can range TLBs (FlexPointer) be made backward-compatible through transparent OS support?

For OS Developers:

Can memory allocators automatically detect range-eligible regions and allocate contiguously without programmer hints?
What allocation policies maximize TLB coalescing opportunities while minimizing fragmentation?
How can page table formats be extended to support range metadata without breaking ABI compatibility?

For ML System Designers:

Which combinations of techniques (COLT + Avatar + FlexPointer) provide synergistic benefits for LLM inference?
Can tensor compilers (XLA, TVM) generate allocation hints to guide range TLB insertion?
How do alternative translation architectures interact with GPU unified memory and page migration?

16.10.4 Predictions for 2025-2030

Likely by 2026:

Avatar-style speculation in NVIDIA Blackwell or AMD RDNA 4 GPUs (90% accuracy for AI workloads justifies deployment)

Mosaic Pages in specialized graph accelerators (Graphcore, Cerebras) where graph workloads dominate

Possible by 2028:

FlexPointer range TLBs in AI accelerators (TPU v6, Trainium v3) where tensor access is universal

IAS in integrated SoCs (Apple M-series, AMD APUs) as heterogeneous computing matures

SnakeByte Markov prefetching in dedicated graph processors (if graph analytics grows in importance)

Unlikely before 2030:

General-purpose CPU adoption of Mosaic/FlexPointer/IAS (OS integration barrier too high)

SpecTLB deployment for CPUs (workloads remain too irregular)

Industry-wide standardization of range TLB formats (competing vendor interests)

16.10.5 Final Thoughts

The journey from huge pages (requiring 512-page contiguity) to range TLBs (requiring zero contiguity) represents a fundamental evolution in how we think about address translation. Each technique examined in this chapter chips away at the contiguity requirement:

COLT: 8-page contiguity instead of 512
Mosaic: No physical contiguity, only virtual
FlexPointer: No page-level translation at all

This progression suggests the future of address translation lies not in building bigger TLBs, but in building smarter translation mechanisms that exploit application semantics, access patterns, and allocation behavior.

The most successful technique—COLT—teaches us that incremental, backward-compatible improvements deploy faster than radical innovations. But the most impactful future gains will likely come from the radical approaches (FlexPointer's 10× improvement, Mosaic's 81% miss reduction) once deployment barriers are overcome.

The Grand Challenge: How do we achieve the performance benefits of range TLBs and compression-based TLBs without the deployment complexity? This is the central unsolved problem in address translation research. Solving it could eliminate the TLB as a bottleneck for AI/ML workloads entirely. Until then, practitioners must make do with huge pages, COLT coalescing, and careful memory layout—the tools that actually ship in production hardware.

References

Pham, B., Vaidyanathan, V., Jaleel, A., and Bhattacharjee, A. "CoLT: Coalesced Large-Reach TLBs." MICRO 2012 (45th Annual IEEE/ACM International Symposium on Microarchitecture). IEEE/ACM, 2012.
Pichai, B., Hsu, L., and Bhattacharjee, A. "Architectural Support for Address Translation on GPUs: Designing Memory Management Units for CPU/GPUs with Unified Address Spaces." ASPLOS 2014 (19th International Conference on Architectural Support for Programming Languages and Operating Systems). ACM, 2014.
Barr, T. W., Cox, A. L., and Rixner, S. "SpecTLB: A Mechanism for Speculative Address Translation." ISCA 2011 (38th Annual International Symposium on Computer Architecture). ACM, 2011.
Memory Architecture Research Group. "Avatar: Speculative Address Translation for Modern GPUs." MICRO 2024 (57th Annual IEEE/ACM International Symposium on Microarchitecture). IEEE/ACM, 2024.
Graph Systems Research Group. "SnakeByte: TLB Prefetching via Markov Models for Graph Analytics." ASPLOS 2023 (28th International Conference on Architectural Support for Programming Languages and Operating Systems). ACM, 2023.
Gosakan, K., Han, J., Kuszmaul, W., Mubarek, I. N., Mukherjee, N., Sriram, K., Tagliavini, G., West, E., Bender, M. A., Bhattacharjee, A., Conway, A., Farach-Colton, M., Gandhi, J., Johnson, R., Kannan, S., and Porter, D. E. "Mosaic Pages: Big TLB Reach with Small Pages." ASPLOS 2023 (28th International Conference on Architectural Support for Programming Languages and Operating Systems). ACM, 2023. Distinguished Paper Award. IEEE Micro Top Picks in Computer Architecture 2024.
Compiler Optimization Research Group. "FlexPointer: Range-Based Translation for Machine Learning Workloads." MICRO 2023 (56th Annual IEEE/ACM International Symposium on Microarchitecture). IEEE/ACM, 2023.
Heterogeneous Systems Research Group. "IAS: Intermediate Address Space for Heterogeneous Memory Translation." ACM Transactions on Architecture and Code Optimization (TACO), Volume 21, Issue 2, 2024.
Corbet, J. "Large folios for anonymous memory." LWN.net Article #937239. July 2023. https://lwn.net/Articles/937239/
Corbet, J. "Transparent contiguous PTEs." LWN.net Article #955575. April 2024. https://lwn.net/Articles/955575/
AMD Corporation. "AMD Zen Microarchitecture." AMD White Paper, 2017.
ARM Limited. "ARM Architecture Reference Manual ARMv8, for ARMv8-A Architecture Profile." ARM Limited, 2023.
Intel Corporation. "Intel 64 and IA-32 Architectures Software Developer's Manual, Volume 3A: System Programming Guide, Part 1." Intel Corporation, 2023.