Chapter 4: Translation Lookaside Buffer (TLB) - Deep Dive

4.1 Introduction: The Most Critical Cache You've Never Heard Of

In Chapter 1, we introduced the Translation Lookaside Buffer (TLB) as a specialized cache that stores recent virtual-to-physical address translations. We saw how TLB hits provide translations in 1-2 CPU cycles (~1 nanosecond), while TLB misses require a full page table walk taking ~100 cycles (~100 nanoseconds)—a 50-100× performance difference.

This chapter explores the TLB in comprehensive detail. While most programmers are familiar with CPU caches (L1, L2, L3), the TLB remains relatively obscure despite being equally critical to system performance. A well-designed application with excellent L1 cache behavior can still suffer catastrophic performance degradation from TLB misses. Conversely, TLB-aware optimization can yield speedups of 2-5× or more on memory-intensive workloads.

Why the TLB Deserves Deep Study

Performance Impact: The TLB is on the critical path of every single memory access. Modern processors execute billions of instructions per second, many accessing memory multiple times per instruction. Even a small TLB miss rate can dominate execution time. Consider:

TLB hit: 1-2 cycles (0.25-0.5 ns @ 4GHz)
TLB miss, page walk cache hit: ~10-20 cycles (2.5-5 ns)
TLB miss, full walk, L3 hit: ~40-100 cycles (10-25 ns)
TLB miss, full walk, DRAM: ~100-200 cycles (25-50 ns)

A 5% TLB miss rate with 100-cycle penalty costs 5 cycles per access on average—more expensive than an L1 cache miss! For applications with poor TLB behavior, the miss rate can exceed 20-30%, causing 20-30 cycle average penalty and making memory accesses 10× slower than they should be.

Architectural Diversity: Unlike data caches, which are relatively standardized, TLB designs vary dramatically across architectures:

x86-64: Hardware-managed with complex multi-level hierarchy
ARM64: Hardware-managed with extensive ASID/VMID support
RISC-V: Software-managed with explicit TLB instructions
MIPS: Historically software-managed, influencing RISC-V

Understanding these differences is crucial for writing portable, high-performance systems software.

Virtualization Complexity: Cloud computing has made virtualization ubiquitous. Two-stage address translation (guest virtual → guest physical → host physical) doubles TLB pressure and introduces new management challenges. Modern TLBs must cache two-dimensional translations while maintaining coherency across VM switches.

Security Implications: The TLB interacts intimately with memory protection, making it security-critical: - TLB poisoning can bypass access control - TLB timing attacks leak information across security boundaries - Spectre and Meltdown exploit speculative TLB behavior - TLB shootdown creates denial-of-service vectors

What We'll Cover

This chapter provides comprehensive coverage of TLB architecture, management, and optimization:

Sections 4.2-4.3: TLB hardware architecture, entry structure, and management strategies (hardware vs. software)

Sections 4.4-4.5: Address space identifiers (ASID/PCID/VMID) and TLB operations (invalidation, shootdown)

Sections 4.6-4.7: Large pages, TLB coverage, and multicore coherency

Section 4.8: Platform-specific implementations (Intel, AMD, ARM, RISC-V) with real hardware specifications

Sections 4.9-4.10: Performance optimization and virtualization

Sections 4.11-4.12: Security considerations and advanced topics

By the end of this chapter, you'll understand not just what the TLB does, but how it works, why it's designed as it is, and how to write TLB-efficient software.

4.2 TLB Architecture and Organization

4.2.1 The TLB Hierarchy: Why Multiple Levels?

Modern processors don't have a single TLB—they have a multi-level TLB hierarchy analogous to the data cache hierarchy (L1/L2/L3). This design emerges from fundamental tradeoffs:

Speed vs. Capacity Tradeoff: - Small, fully-associative TLBs (32-128 entries) can provide 1-cycle lookup - Large TLBs (1024-2048 entries) require multiple cycles or set-associativity - Solution: Multi-level hierarchy with small, fast L1 TLBs backed by larger L2 TLBs

Instruction vs. Data Separation: - Instruction fetches and data accesses have different locality patterns - Code typically has high spatial locality (sequential execution) - Data access patterns vary widely by application - Solution: Separate L1 ITLB and L1 DTLB, unified L2 TLB

Page Size Heterogeneity: - Modern systems use mixed page sizes (4KB, 2MB, 1GB) - Different page sizes have different working set requirements - Optimal TLB organization differs by page size - Solution: Multiple TLBs or flexible entries supporting multiple sizes

Typical TLB Hierarchy

Figure 4.1: TLB hierarchy: L1 ITLB and DTLB serve in 1 cycle; misses fall to L2 STLB (8–12 cycles); STLB misses trigger the hardware page table walker (50–200 ns).

L1 TLBs (Instruction and Data): - Size: 32-128 entries (typically 64 for data, 128 for instructions) - Associativity: Fully associative (all entries checked in parallel) - Latency: 1 cycle (integrated into pipeline) - Hit Rate: 95-99% for well-behaved applications - Organization: Separate ITLB (instruction) and DTLB (data)

The L1 TLBs are tiny by design—they must provide translation on the same cycle as L1 cache access, requiring fully parallel hardware that becomes impractical beyond ~128 entries. The split between instructions and data allows the CPU to fetch instructions while simultaneously accessing data, maintaining the illusion of Harvard architecture at the TLB level.

L2 TLB (Shared/Unified): - Size: 512-2048 entries (Intel: 1536, AMD Zen 3: 2048) - Associativity: 4-way to 12-way set-associative - Latency: 5-10 cycles - Hit Rate: 98-99.9% when L1 misses - Organization: Unified for both instructions and data

The L2 TLB acts as a victim cache for the L1 TLBs, storing translations evicted from L1. Its larger capacity provides better coverage of the application's working set, while set-associativity keeps lookup time reasonable. Modern L2 TLBs often support multiple page sizes in a single structure.

Real-World Example: Intel Ice Lake (2019)

Intel's Ice Lake microarchitecture provides a concrete example:

L1 DTLB:
  - 64 entries for 4KB/2MB pages (flexible)
  - 32 entries for 4MB pages
  - 4 entries for 1GB pages
  - Fully associative
  - 1-cycle latency

L1 ITLB:
  - 128 entries for 4KB pages
  - 8 entries for 2MB/4MB pages
  - Fully associative
  - 1-cycle latency

L2 STLB (Shared):
  - 1536 entries for 4KB or 2MB pages
  - 16 entries for 1GB pages
  - 12-way set-associative
  - 5-cycle latency

Notice the allocation strategy: Many small-page entries, fewer large-page entries. This reflects typical usage where most translations are 4KB, but a few critical large mappings (e.g., shared libraries, huge pages) use 2MB/1GB pages.

4.2.2 TLB Entry Structure: What Does Each Entry Store?

A TLB entry must contain all information needed to perform address translation and permission checks. Let's examine the detailed structure:

Basic TLB Entry Components

1. Virtual Page Number (VPN) - The Tag

The VPN identifies which virtual page this entry translates. For a 4KB page in a 48-bit virtual address space:

Virtual Address (48 bits): [47:12] = VPN (36 bits), [11:0] = Offset (12 bits)

The VPN serves as the TLB's tag—the lookup key for associative matching. During translation: 1. CPU extracts VPN from virtual address 2. TLB compares VPN against all tags in parallel (fully-associative) or within a set (set-associative) 3. Match → TLB hit; No match → TLB miss

2. Physical Frame Number (PFN) - The Data

The PFN specifies where this virtual page maps in physical memory. For 4KB pages:

Physical Address (52 bits): [51:12] = PFN (40 bits), [11:0] = Offset (12 bits)

On a TLB hit, the CPU: 1. Takes the PFN from the matched TLB entry 2. Concatenates it with the page offset from the virtual address 3. Issues the resulting physical address to the cache/memory system

3. Permission Bits

Permission bits enforce memory protection at hardware speed:

R (Read): Page is readable
W (Write): Page is writable
X (Execute): Page contains executable code
U/S (User/Supervisor): Accessible in user mode vs. kernel only

Checking these on every access in the page table would be too slow. Caching them in the TLB allows permission checks in the same cycle as translation.

4. Status and Control Flags

Modern TLBs cache additional page table entry flags:

V (Valid): TLB entry is valid (present and usable)
G (Global): Page is shared across all address spaces (don't flush on context switch)
D (Dirty): Page has been modified (write has occurred)
A (Accessed): Page has been accessed (read or write)
PAT/MTRR bits: Memory type (write-back, write-through, uncached, etc.)

These flags serve multiple purposes: - G-bit optimization: Kernel mappings marked global survive context switches, avoiding TLB refills - D-bit dirty tracking: OS uses this for copy-on-write and page-out decisions
- A-bit access tracking: Guides page replacement algorithms - Memory type control: Critical for device I/O and memory-mapped registers

5. Address Space Identifier (ASID/PCID/VMID)

To avoid flushing the entire TLB on context switches, modern processors tag entries with an identifier:

ASID (ARM): 8-bit or 16-bit Address Space ID
PCID (x86-64): 12-bit Process Context ID
VMID (Virtualization): Virtual Machine ID

This allows the TLB to hold translations from multiple processes simultaneously:

TLB Lookup: Match (VPN, ASID) → Return PFN

Without ASID/PCID, every context switch requires:

1. Flush entire TLB (except global entries)
2. All subsequent accesses miss TLB
3. Expensive page table walks refill TLB
4. Takes ~1000 cycles to recover performance

With ASID/PCID:

1. Change ASID/PCID register
2. TLB entries from old process remain cached
3. TLB entries from new process (if still present) hit immediately
4. Significant performance improvement on context-heavy workloads

We'll explore ASID/PCID/VMID mechanisms in detail in Section 4.4.

6. Page Size Indicator

Since modern TLBs support multiple page sizes, each entry must indicate its size:

Page Size (PS) bits:
  00 = 4 KB
  01 = 2 MB  
  10 = 1 GB
  11 = 512 GB (future/experimental)

This is crucial because: - Virtual address → VPN extraction differs by page size - Large pages have fewer VPN bits, more offset bits - TLB must extract the correct PFN portion

For example, with a 2MB page:

Virtual Address: [47:21] = VPN (27 bits), [20:0] = Offset (21 bits)
Physical Address: [51:21] = PFN (31 bits), [20:0] = Offset (21 bits)

How Page Size is Determined During Page Walk

During a hardware page table walk, the PS (Page Size) bit in intermediate page table entries determines whether the walk continues or terminates at that level:

x86-64 Page Walk with PS Bit Checking:

1. Read CR3 → Get PML4 base address
2. Index into PML4 using VA[47:39]
3. Read PML4 Entry (PML4E)
4. Check P bit; if 0 → Page Fault

5. Index into PDPT using VA[38:30]
6. Read PDPT Entry (PDPTE)
7. Check P bit; if 0 → Page Fault
8. Check PDPTE.PS (bit 7):
   - If PS=1: This is a 1 GB page
     → Extract PFN from PDPTE[51:30] (1 GB aligned)
     → Page offset = VA[29:0] (30 bits for 1 GB = 2^30)
     → Physical Address = (PFN << 30) | VA[29:0]
     → STOP: Insert 1 GB TLB entry
   - If PS=0: Continue to next level

9. Index into PD using VA[29:21]
10. Read PD Entry (PDE)
11. Check P bit; if 0 → Page Fault
12. Check PDE.PS (bit 7):
    - If PS=1: This is a 2 MB page
      → Extract PFN from PDE[51:21] (2 MB aligned)
      → Page offset = VA[20:0] (21 bits for 2 MB = 2^21)
      → Physical Address = (PFN << 21) | VA[20:0]
      → STOP: Insert 2 MB TLB entry
    - If PS=0: Continue to next level

13. Index into PT using VA[20:12]
14. Read PT Entry (PTE)
15. Check P bit; if 0 → Page Fault
16. Extract PFN from PTE[51:12] (4 KB aligned)
17. Page offset = VA[11:0] (12 bits for 4 KB = 2^12)
18. Physical Address = (PFN << 12) | VA[11:0]
19. Insert 4 KB TLB entry

Key Points:

PS bit only exists in upper-level entries: PML4E doesn't have PS bit (always points to PDPT). PDPTE and PDE have PS bits to enable 1GB and 2MB pages respectively.
PS=1 terminates the walk early: When PS=1, that entry contains the final physical address (appropriately aligned), not a pointer to the next level.
TLB entry size matches page size: The TLB entry's PS field is set based on which level terminated the walk:
- Walk stopped at PDPTE with PS=1 → TLB entry PS=10 (1GB)
- Walk stopped at PDE with PS=1 → TLB entry PS=01 (2MB)
- Walk completed to PTE → TLB entry PS=00 (4KB)
Alignment requirements: Large page physical addresses must be aligned:
- 1 GB pages: PFN must be aligned to 1GB boundary (bits [29:0] = 0 in physical address)
- 2 MB pages: PFN must be aligned to 2MB boundary (bits [20:0] = 0 in physical address)

Architecture-Specific Variations:

ARM64: Uses descriptor type field instead of PS bit:

Bits [1:0] in descriptor:
  00 = Invalid
  01 = Block entry (large page at this level)
  10 = Reserved
  11 = Table entry (points to next level) or Page entry (at level 3)

RISC-V: Uses elegant R/W/X encoding:

R W X = 0 0 0 → Pointer to next level (continue walk)
R W X ≠ 0 0 0 → Leaf entry (page at this level)

Page size determined by level:
  Level 2 leaf → 1 GB page
  Level 1 leaf → 2 MB page  
  Level 0 leaf → 4 KB page

For complete details on page table walk algorithms and PS bit encoding across architectures, see Chapter 3, Section 3.4 "Multi-Level Page Tables" and Section 3.5.2 "x86-64 Page Table Entry Structure" where large page formats are covered in depth.

Complete TLB Entry Layout

Here's a representative TLB entry structure (x86-64 style):

TLB Entry Field Layout (x86-64 without ASID)
VPN 36 bits	PS 2 bits	PFN 40 bits	PCID 12 bits	Flags 12 bits	Valid 1 bit
Tag	4K / 2M / 1G	Physical Frame	Process ID	R \| W \| X \| U \| G \| D \| A	V

PS = page size selector; PCID = Process Context ID (x86-64 ASID equivalent); Flags: R=Read, W=Write, X=Execute, U=User, G=Global, D=Dirty, A=Accessed

ARM64 would have ASID instead of PCID, potentially VMID for virtualization, and slightly different flag bits, but the structure is similar.

Note: Actual hardware implementations may optimize encoding, use additional bits for internal bookkeeping, or employ specialized representations for different page sizes.

4.2.3 Associativity and Replacement Policies

TLB lookup must be fast—ideally single-cycle for L1 TLBs. This requirement drives associativity choices.

Fully Associative TLBs

Design: Any translation can be stored in any TLB entry. Lookup checks all entries in parallel.

Advantages: - Maximum flexibility—no capacity misses due to set conflicts - Best hit rate for given size - Simple to implement at small sizes (32-128 entries)

Disadvantages: - Requires parallel comparators for every entry (expensive in silicon area and power) - Doesn't scale beyond ~128 entries - Replacement policy must consider entire TLB

Typical Use: L1 TLBs (both ITLB and DTLB)

Set-Associative TLBs

Design: TLB is divided into sets; each translation hashes to a specific set and can only reside in one of N entries within that set (N-way set-associative).

Address Mapping:

VPN → Set Index = VPN[m:0] where m = log₂(num_sets)

For a 1024-entry, 8-way set-associative TLB: - 128 sets (1024 / 8) - VPN[6:0] determines the set (2^7 = 128) - Check 8 entries in parallel within that set

Advantages: - Scales to larger sizes (1024-4096 entries feasible) - Reduces comparator count (only N comparators vs. total entries) - Power efficient—only activate one set

Disadvantages: - Conflict misses possible (two VPNs hash to same set, exceed associativity) - Lower hit rate than fully-associative at same total size - Set index bits "wasted" for tagging purposes

Typical Use: L2/L3 TLBs

Replacement Policies

When a TLB is full, inserting a new entry requires evicting an existing one. The replacement policy determines which entry to evict.

1. Least Recently Used (LRU)

Strategy: Evict the entry that hasn't been used for the longest time.

Implementation: Maintain timestamp or age counter for each entry, evict oldest.

Advantages: - Theoretically optimal for temporal locality - Predictable behavior

Disadvantages: - Requires per-entry metadata (logâ''N bits for N-way associative) - Update cost on every TLB access (write timestamp) - Complex for large associativity

Use: Common in smaller TLBs (L1) where tracking is feasible

2. Pseudo-LRU (PLRU)

Strategy: Approximate LRU using a tree of "recently used" bits.

Implementation: For 8-way associative, use 7 bits to encode a binary tree:

           [0]
          /   \
       [1]     [2]
      /  \    /  \
    [3]  [4][5]  [6]
    / \  / \ / \ / \
   E0 E1E2E3E4E5E6 E7

Each internal node bit points toward "recently used" subtree.

Advantages: - Much less metadata (N-1 bits for N entries vs. N*log₂N) - Cheaper to update (log₂N bits) - Good approximation of LRU

Disadvantages: - Not true LRU (pathological cases exist) - More complex than random

Use: Common in L2 TLBs

3. Not Recently Used (NRU)

Strategy: Mark entries as "used" on access, prefer evicting "not used" entries.

Implementation: Single "used" bit per entry, periodically clear all bits.

Advantages: - Very simple (1 bit per entry) - Low overhead - Reasonably effective

Disadvantages: - Coarse granularity - Periodic clearing introduces complexity

Use: Some simpler TLB designs

4. Random Replacement

Strategy: Evict a random entry from the set.

Implementation: Use a linear feedback shift register (LFSR) to generate pseudo-random index.

Advantages: - Simplest possible (no per-entry metadata) - No pathological cases - Surprisingly effective in practice

Disadvantages: - No exploitation of temporal locality - Variable performance

Use: Simple embedded processors, some RISC designs

Hardware Example: Intel Skylake L2 STLB

Intel's Skylake uses a 12-way set-associative L2 STLB with 1536 entries for 4KB/2MB pages:

Organization:
  - 128 sets (1536 entries / 12 ways)
  - VPN[6:0] selects set
  - 12-way associative within each set
  - Pseudo-LRU replacement policy

When a new translation is loaded: 1. Hash VPN to determine set (VPN[6:0]) 2. If set is full, select victim using PLRU tree 3. Evict victim entry 4. Insert new translation

The 12-way associativity provides good hit rates while keeping the comparator count manageable (only 12 parallel comparisons per lookup instead of 1536).

4.3 TLB Management: Hardware vs Software

One of the most fundamental design decisions in TLB architecture is who manages TLB misses: hardware or software? This choice has profound implications for performance, complexity, flexibility, and OS design. Modern architectures take different approaches, each with compelling rationales.

Figure 4.2: Hardware-managed vs software-managed TLB. Left: x86-64/ARM64 hardware page walker automatically traverses PML4→PDPT→PD→PT on a TLB miss (10–50 cycles, no OS involvement). Right: RISC-V/MIPS software-managed TLB raises a TLB-miss exception and invokes the OS handler to walk page tables and write the entry via SFENCE.VMA (100–1,000 cycles).

4.3.1 Hardware-Managed TLBs: The x86 and ARM Approach

In hardware-managed TLB systems, the processor automatically walks the page tables on a TLB miss and fills the TLB entry without software intervention. This is the approach taken by x86-64 (Intel and AMD) and ARM64 architectures.

How Hardware TLB Management Works

When a TLB miss occurs:

1. CPU detects TLB miss (VPN not found in TLB)

2. Hardware page table walker activates: - Reads page table base register (CR3 on x86, TTBR0_EL1 on ARM) - Follows page table hierarchy level by level - Reads each page table entry from memory - Validates permissions and present bits at each level

3. On success: - Hardware constructs TLB entry from leaf PTE - Inserts entry into TLB (following replacement policy) - Retries the memory access (now hits in TLB) - No software intervention required

4. On failure (page not present, permission violation): - Generate page fault exception - Transfer control to OS page fault handler - OS resolves the fault and returns - Hardware retries, now succeeds

x86-64 Hardware Page Walk Example

Consider accessing virtual address 0x00401234 with a TLB miss:

1. TLB lookup for VPN 0x401 → MISS
2. Hardware page walker activates:
   - Read CR3 → PML4 base address = 0x1000000
   - Extract PML4 index from VA: bits [47:39] = 0
   - Read PML4E at 0x1000000 + (0 × 8) = 0x1000000
   - PML4E contains PDPT base = 0x2000000, P=1 (present)
   
   - Extract PDPT index from VA: bits [38:30] = 0
   - Read PDPTE at 0x2000000 + (0 × 8) = 0x2000000
   - PDPTE contains PD base = 0x3000000, P=1
   
   - Extract PD index from VA: bits [29:21] = 2
   - Read PDE at 0x3000000 + (2 × 8) = 0x3000010
   - PDE contains PT base = 0x4000000, P=1
   
   - Extract PT index from VA: bits [20:12] = 1
   - Read PTE at 0x4000000 + (1 × 8) = 0x4000008
   - PTE contains PFN = 0xA2B4C, P=1, R/W=1, U/S=0, NX=0

3. Hardware constructs TLB entry:
   - VPN = 0x401
   - PFN = 0xA2B4C
   - Permissions = R/W/X, Supervisor only
   - Insert into TLB

4. Retry access → TLB HIT → Success

Total cost: 4 memory accesses (one per level) + TLB insertion = ~40-200 cycles depending on cache hits

Advantages of Hardware Management

1. Software Simplicity

The OS doesn't need TLB miss handlers. Page fault handling code is simpler—it only deals with genuinely invalid accesses, not TLB capacity misses.

2. Performance Predictability

Hardware page walks have deterministic, well-optimized performance. The same code path is used billions of times per second, so it's worth extensive microarchitectural optimization.

3. Page Walk Caches

Modern processors add Page Walk Caches (PWC) that cache intermediate page table entries: - PML4 cache: Caches PML4 entries - PDPT cache: Caches PDPT entries
- PD cache: Caches PD entries

These dramatically reduce the cost of TLB misses. Example with PWC:

TLB miss, but all intermediate levels cached:
- Read PML4E from PWC (hit) - 0 cycles
- Read PDPTE from PWC (hit) - 0 cycles
- Read PDE from PWC (hit) - 0 cycles
- Read PTE from L3 cache - 40 cycles
Total: ~40 cycles instead of 200

Intel processors report PWC hit rates of 70-90% for typical workloads, significantly mitigating TLB miss penalties.

4. Concurrent Page Walks

Hardware can support multiple simultaneous page walks for different TLB misses, improving throughput under high miss rates.

Disadvantages of Hardware Management

1. Inflexibility

The page table format is fixed by hardware. OS cannot use custom page table structures or compression schemes.

2. Hardware Complexity

Implementing a page table walker requires substantial silicon area and power. The walker must handle multiple page sizes, two-stage translation (virtualization), and various corner cases.

3. Attack Surface

Hardware page walkers have been targets of attacks: - Speculative execution: Spectre variant 1.1 exploits speculative page walks - Timing side channels: Page table structure can leak information through timing

4. Wasted Work

If a page is genuinely not present (swapped out), the hardware walker traverses multiple levels unnecessarily before generating a page fault.

4.3.2 Software-Managed TLBs: The RISC-V and MIPS Approach

Software-managed TLB systems take the opposite approach: on a TLB miss, the processor generates an exception and software fills the TLB entry. This is the approach used by RISC-V and historically by MIPS.

How Software TLB Management Works

1. CPU detects TLB miss

2. CPU generates TLB miss exception: - Instruction TLB miss: Jump to ITLB miss handler vector - Data TLB miss: Jump to DTLB miss handler vector - Or unified TLB miss vector (implementation-dependent)

3. Software TLB refill handler executes: - Reads page table base from CSR (Control and Status Register) - Walks page tables in software - Constructs TLB entry - Uses special instruction to write TLB entry - Returns from exception

4. CPU retries memory access → TLB HIT

RISC-V Software TLB Refill Example

RISC-V provides explicit instructions for TLB management:

TLB Write Instruction:

# Insert TLB entry
sfence.vma x0, x0    # Fence to synchronize
# (TLB insertion happens implicitly via CSR writes or other mechanism)

Typical RISC-V TLB Miss Handler:

void tlb_miss_handler(vaddr_t fault_addr) {
    // Extract VPN from faulting address
    vpn_t vpn = fault_addr >> 12;
    
    // Read page table base
    paddr_t pt_base = read_csr(satp) & SATP_PPN_MASK;
    
    // Sv39: 3-level page table walk
    // Level 2 (VPN[2])
    uint64_t *l2_pte = (uint64_t*)(pt_base + ((vpn >> 18) & 0x1FF) * 8);
    if (!(*l2_pte & PTE_V)) goto page_fault;
    if ((*l2_pte & (PTE_R | PTE_W | PTE_X))) goto found_leaf; // Huge page
    
    // Level 1 (VPN[1])
    paddr_t l1_base = (*l2_pte >> 10) << 12;
    uint64_t *l1_pte = (uint64_t*)(l1_base + ((vpn >> 9) & 0x1FF) * 8);
    if (!(*l1_pte & PTE_V)) goto page_fault;
    if ((*l1_pte & (PTE_R | PTE_W | PTE_X))) goto found_leaf; // Large page
    
    // Level 0 (VPN[0])
    paddr_t l0_base = (*l1_pte >> 10) << 12;
    uint64_t *l0_pte = (uint64_t*)(l0_base + (vpn & 0x1FF) * 8);
    if (!(*l0_pte & PTE_V)) goto page_fault;
    
found_leaf:
    // Check permissions (R/W/X, U/S)
    if (!check_permissions(pte, access_type)) goto page_fault;
    
    // Insert into TLB (implementation-specific)
    tlb_insert(vpn, *l0_pte);
    return;
    
page_fault:
    // Real page fault - call OS handler
    handle_page_fault(fault_addr);
}

Advantages of Software Management

1. Flexibility

OS has complete control over page table format. Possibilities include: - Compressed page tables: Use custom compression schemes - Inverted page tables: Hash-based lookup structures - Radix trees: Alternative hierarchical structures - Per-process optimizations: Different processes can use different schemes

2. Simplified Hardware

No hardware page walker required, saving silicon area and power. This aligns with RISC-V's philosophy of minimal hardware complexity.

3. Transparency

Software can instrument TLB misses for profiling, debugging, or security monitoring without hardware support.

4. Rapid Evolution

New page table designs can be deployed via software updates without hardware changes.

Disadvantages of Software Management

1. Performance Cost

Every TLB miss requires exception entry/exit overhead: - Save/restore CPU state - Jump to handler (potentially cold in instruction cache) - Execute handler code - Return from exception

Typical overhead: 200-1000 cycles for a TLB miss, even with fast handler.

2. Software Complexity

OS must implement TLB refill handlers correctly for every page size, permission combination, and edge case. Bugs in handlers can cause crashes or security vulnerabilities.

3. No Page Walk Caches

Without hardware page walks, there's no obvious place to add page walk caches. Software can cache intermediate results, but this adds more complexity.

4. Interrupt Latency

TLB miss handlers execute with interrupts often disabled, increasing worst-case interrupt latency.

4.3.3 Hybrid Approaches and Modern Trends

Modern architectures increasingly blur the hardware/software boundary:

ARM's Hardware Management with Software Hooks

ARM64 uses hardware page walks but allows software intervention:

Access Flag Faults: If a PTE's Access Flag (AF) is not set, ARM can either: - Set it automatically (pure hardware), or - Generate an Access Flag fault for software to handle

This lets OS implement custom page tracking without full software TLB management.

Hardware Table Walk Disable: ARM allows disabling hardware page walks entirely via TCR_EL1.HPD, falling back to software management if desired.

RISC-V Extensions for Hardware Assistance

While RISC-V's base specification requires software TLB management, extensions can add hardware assistance:

Svpbmt Extension: Adds page-based memory types (analogous to x86 PAT) without requiring full hardware page walks.

Svnapot Extension:
Naturally-aligned power-of-two (NAPOT) regions, allowing more flexible large page support.

x86 Software Page Table Walking

x86 allows software to traverse page tables using regular memory accesses, useful for: - Debugging and introspection - Custom translation mechanisms - Security monitoring

The CR3 register can be read by software, and page tables are in normal memory (though often write-protected by OS).

4.4 Address Space Identifiers: ASID, PCID, and VMID

Context switches—changing which process is running on a CPU—are extremely common in modern systems. A typical desktop or server experiences hundreds to thousands of context switches per second. Without optimization, each context switch would require flushing the entire TLB, destroying all cached translations and causing expensive TLB misses until the TLB refills with the new process's translations.

Address Space Identifiers solve this problem by allowing the TLB to cache translations from multiple processes simultaneously, tagging each entry with a process ID. This section explores three mechanisms: ASID (ARM), PCID (x86), and VMID (virtualization).

4.4.1 The Context Switch Problem

Figure 4.3: TLB entry structure and ASID/PCID context tagging. Each entry stores VPN, PFN, permission bits, page-size flag, and an ASID/PCID tag. Without context identifiers a full TLB flush is required on every context switch; PCID (x86-64) and ASID (ARM64/RISC-V) let entries from multiple processes coexist, saving 10-30% of context-switch overhead.

Without address space identifiers, context switching requires:

Old Behavior:
1. Process A is running
   - TLB contains A's translations
   - TLB hits provide fast translation

2. Context switch to Process B
   - Must flush TLB (except global entries)
   - Write CR3/TTBR0 with B's page table base
   
3. Process B starts running
   - Every memory access misses TLB initially
   - Expensive page table walks refill TLB
   - Takes 500-2000 cycles to warm up TLB
   
4. Context switch back to Process A
   - Flush TLB again
   - A's translations gone despite recent use
   - Another 500-2000 cycle warmup

Performance impact measurement:

Assuming: - Context switches every 10ms (quantum) - 1000 cycles TLB warmup cost per switch - 4 GHz CPU

Cost per second:

100 switches/sec × 1000 cycles/switch = 100,000 cycles/sec
= 0.0025% of CPU time

This seems small, but: - Real warmup cost is higher for memory-intensive applications - Cache pollution from context switches adds overhead - High-frequency context switches (server workloads) amplify the problem

Studies show context switch overhead can reach 1-5% of CPU time on busy servers.

4.4.2 ASID: ARM's Address Space Identifier

ARM's solution to the context switch problem is the Address Space Identifier (ASID), a tag attached to each TLB entry identifying which process it belongs to.

ASID Basics

Size: 8-bit or 16-bit (implementation-defined) - 8-bit ASID: 256 distinct address spaces - 16-bit ASID: 65,536 distinct address spaces

Storage: The ASID is part of the Translation Control Register: - TTBR0_EL1[63:48]: ASID for user-space translations - TTBR1_EL1[63:48]: ASID for kernel-space translations (often ignored since kernel is global)

TLB Entry Format:

TLB Entry with ASID (ARM64 / RISC-V)
VPN	ASID 8–16 bits	PFN	Flags
Virtual page tag	Address Space ID	Physical Frame	R / W / X / U / G

ASID tags each entry with the address space it belongs to, eliminating full TLB flushes on context switch. ARM64 supports 8-bit (256) or 16-bit (65,536) ASIDs.

ASID Lookup Process

1. Memory access to virtual address VA
2. Extract VPN from VA
3. Read current ASID from TTBR0_EL1
4. TLB lookup: Match (VPN, ASID)
   - If found → TLB HIT (use cached translation)
   - If not found → TLB MISS (page walk or software refill)

Key property: TLB can hold translations from multiple processes simultaneously:

TLB Contents Example:
Entry 0: VPN=0x1000, ASID=0x05, PFN=0x9000 (Process A)
Entry 1: VPN=0x1000, ASID=0x12, PFN=0x7000 (Process B) ← Different PFN!
Entry 2: VPN=0x2000, ASID=0x05, PFN=0x8000 (Process A)
Entry 3: VPN=0x3000, ASID=0x12, PFN=0xA000 (Process B)

Notice entries 0 and 1 have the same VPN but different ASIDs—they're different virtual pages in different processes, mapped to different physical frames.

Context Switch with ASID

Optimized Behavior:
1. Process A running (ASID=0x05)
   - TLB contains A's translations
   
2. Context switch to Process B (ASID=0x12)
   - Update TTBR0_EL1 with B's page table base and ASID
   - NO TLB FLUSH
   
3. Process B starts running
   - TLB lookups use VPN + ASID=0x12
   - B's entries (if still in TLB) hit immediately
   - A's entries remain but won't match (wrong ASID)
   - No warmup delay if B's entries still cached
   
4. Context switch back to A (ASID=0x05)
   - Update TTBR0_EL1 to A's page table + ASID
   - A's entries still in TLB → immediate hits!

Performance improvement: Eliminates TLB flush overhead, especially beneficial for workloads with frequent context switches between small set of processes.

ASID Allocation Strategies

Challenge: Modern systems run thousands of processes, but ASID space is limited (256 or 65,536 values). How should OS allocate ASIDs?

Strategy 1: Simple Allocation

asid_t next_asid = 1;  // 0 often reserved

asid_t allocate_asid() {
    if (next_asid == MAX_ASID) {
        // Rollover: flush entire TLB
        tlb_flush_all();
        next_asid = 1;
    }
    return next_asid++;
}

Pros: Simple, guarantees unique ASIDs for active processes Cons: Periodic TLB flushes when ASID space exhausted

Strategy 2: Lazy Allocation

Linux on ARM uses a more sophisticated scheme:

// Generation number tracks ASID epochs
uint64_t asid_generation = 1;

struct task_struct {
    uint64_t asid_generation;  // When ASID was allocated
    asid_t asid;               // Current ASID
};

void switch_to_task(task_t *task) {
    if (task->asid_generation != current_asid_generation) {
        // ASID is stale (from old generation)
        task->asid = allocate_new_asid();
        task->asid_generation = current_asid_generation;
    }
    
    write_ttbr0(task->pgd, task->asid);
}

When ASIDs are exhausted:

void asid_rollover() {
    // Increment generation
    asid_generation++;
    
    // All old ASIDs now stale
    // Active processes get new ASIDs on next switch
    
    // Flush TLB entries from old generation
    tlb_flush_all();
}

Pros: Only active processes consume ASIDs Cons: Still requires periodic flushes

Strategy 3: Reserved ASIDs

Reserve ASIDs for frequently-used processes:

// Reserve ASID 1 for init process
// Reserve ASID 2 for systemd
// etc.

Ensures critical processes never lose their TLB entries.

Figure 4.4: Address space identifiers across architectures. Left: ARM64 ASID (8/16-bit in TTBR0_EL1) tags TLB entries per process, enabling context switches without a full TLB flush. Centre: x86-64 PCID (12-bit in CR3[11:0]) with no-flush CR3 writes and INVPCID selective invalidation, used by KPTI. Right: VMID adds a second tagging level for virtualization — each VM gets a VMID, each process within a VM retains its ASID, enabling nested context switching without TLB flushes.

4.4.3 PCID: x86-64's Process Context Identifier

Intel and AMD x86-64 processors introduced Process Context Identifier (PCID) as an optional feature to mirror ARM's ASID functionality.

PCID Basics

Size: 12-bit (4096 distinct contexts)

Enabling: Set CR4.PCIDE = 1 to enable PCID support

Storage: Lower 12 bits of CR3 contain PCID when enabled:

CR3 with PCID:
Bits [63:12]: Page table physical base address
Bits [11:0]: PCID (process context identifier)

TLB Entry Format: Similar to ARM, each TLB entry tagged with PCID

Differences from ARM ASID

1. Larger Space: 12-bit (4096) vs ARM's 8-bit/16-bit (256/65536)

2. Voluntary Flush Control:

When writing CR3, bit 63 controls TLB flushing:

// Flush TLB for old PCID
write_cr3(new_pgd | new_pcid);  

// Keep old TLB entries (bit 63 = 1)
write_cr3(new_pgd | new_pcid | CR3_NOFLUSH);

This allows software to choose when to flush on context switches.

3. Global Pages Remain Global:

Even with PCID, pages marked Global (G-bit) are shared across all PCIDs. Kernel mappings remain global.

INVPCID Instruction

x86 provides explicit TLB invalidation control via the INVPCID instruction:

Invocation:

invpcid rax, [rbx]

Where: - rax: Type of invalidation (0-3) - [rbx]: Descriptor (PCID + linear address)

Invalidation Types:

Type 0: Individual Address

Invalidate single address in specific PCID
Descriptor: {PCID, LinearAddress}

Type 1: Single PCID

Invalidate all non-global entries for specific PCID
Descriptor: {PCID, ignored}

Type 2: All PCIDs, non-global

Invalidate all non-global entries in all PCIDs
Descriptor: ignored

Type 3: All PCIDs, including global

Invalidate ALL entries (global and non-global)
Descriptor: ignored

Example usage:

// Switch to new process
void switch_mm(mm_context_t *new_mm) {
    uint64_t new_cr3 = new_mm->pgd | new_mm->pcid | CR3_NOFLUSH;
    write_cr3(new_cr3);
    
    // If we reused a PCID, invalidate old entries selectively
    if (new_mm->pcid_needs_flush) {
        invpcid_single_context(new_mm->pcid);
        new_mm->pcid_needs_flush = false;
    }
}

4.4.4 VMID: Virtualization Machine Identifier

When running virtual machines, we have an additional layer: each VM is a separate address space. Without optimization, switching between VMs would require flushing the TLB, just like switching processes.

Virtual Machine Identifiers (VMID) extend the ASID/PCID concept to virtualization.

Intel VPID (Virtual Processor Identifier)

Size: 16-bit (65,536 distinct VMs)

Purpose: Tag TLB entries with VM identifier to avoid flushes on VM switch

Storage: VMCS (Virtual Machine Control Structure) contains VPID for each VM

TLB Entry Format:

TLB Entry with PCID + VPID (Intel VT-x Virtualization)
VPN	PCID 12 bits	VPID 16 bits	PFN	Flags
Virtual page	Guest process	Guest VM	Physical Frame	R / W / X / U

VPID (Virtual Processor Identifier) tags entries per guest VM; PCID tags per guest process. Together they allow host and all guest TLB entries to coexist without flushing on VM-exit or context switch.

Notice: Can have both PCID and VPID, allowing TLB to cache: - Multiple processes in multiple VMs simultaneously - Example: VM1's Process A, VM1's Process B, VM2's Process A, Host OS

Lookup:

Match (VPN, PCID, VPID) → TLB hit

ARM VMID

Size: 8-bit or 16-bit (implementation-defined)

Storage: VTTBR_EL2[63:48] contains VMID for Stage 2 translation

Combination with ASID:

ARM can tag entries with both ASID and VMID:

Entry: VPN, ASID (guest process), VMID (which VM), PFN

This enables caching: - Multiple guest processes in multiple VMs - Host processes - All without TLB flushes on switches

RISC-V VMID (Hypervisor Extension)

Size: Implementation-defined (typically 7-14 bits)

Storage: hgatp register bits [57:44] contain VMID

Integration: Works with ASID mechanism:

TLB Entry: VPN, ASID, VMID, PFN

Performance Impact of VMID

Without VMID:

VM Switch Cost:
1. Save VM1 state
2. Flush TLB (lose all VM1's translations)
3. Load VM2 state
4. VM2 starts cold (all TLB misses)
Cost: 2,000-5,000 cycles

With VMID:

VM Switch Cost:
1. Save VM1 state
2. Update VMID register (VM1 entries stay in TLB)
3. Load VM2 state
4. VM2 starts warm (cached entries still valid)
Cost: 500-1,000 cycles

Improvement: 2-5× faster VM switches, critical for cloud workloads with many VMs per host.

4.5 TLB Operations and Instructions

Processors provide explicit instructions to manage TLB state. Understanding these operations is essential for OS developers implementing memory management and for anyone debugging TLB-related issues.

Figure 4.5: TLB shootdown protocol on a 4-core system. Core 0 unmaps a page and broadcasts an IPI to all other cores (①②). Each core receives the interrupt, executes INVLPG (x86) or TLBI (ARM) to invalidate the specific TLB entry (③), then sends an ACK (④). Core 0 waits for all ACKs before freeing the physical page (⑤) — this ordering invariant prevents stale-translation memory corruption. The batching optimization amortizes IPI cost across N invalidations.

4.5.1 TLB Invalidation Instructions

x86-64: INVLPG and Family

INVLPG (Invalidate Page):

invlpg [address]

Invalidates the TLB entry (if present) for the page containing address on the current CPU.

Usage:

// After changing a PTE, invalidate the corresponding TLB entry
pte_t *pte = get_pte(addr);
pte->present = 0;
asm volatile("invlpg (%0)" :: "r"(addr) : "memory");

Scope: Local processor only (does not affect other CPUs)

INVPCID (Invalidate Process Context ID):

More flexible invalidation with four types (described in Section 4.4.3): - Type 0: Single address in specific PCID - Type 1: All addresses in specific PCID
- Type 2: All non-global addresses in all PCIDs - Type 3: All addresses including global

Example:

// Invalidate entire address space for PCID 5
uint64_t desc[2] = {5, 0};  // PCID=5, address ignored
asm volatile("invpcid (%0), %1" :: "r"(desc), "r"(1) : "memory");

ARM64: TLBI Instructions

ARM provides a rich set of TLB invalidation instructions:

Basic Invalidation:

TLBI VAE1IS, X0    ; Invalidate VA in EL1, inner shareable
TLBI ASIDE1IS, X0  ; Invalidate all entries for ASID, inner shareable
TLBI VMALLE1IS     ; Invalidate all EL1 entries, inner shareable

Naming Convention: - VA: Virtual Address - ASIDE: Address Space ID Entry - VMALLE: Virtual Memory All Entries - E1/E2: Exception Level (privilege level) - IS: Inner Shareable (affects other CPUs in cluster)

Invalidate specific address:

void tlb_invalidate_page(vaddr_t addr) {
    // Construct descriptor: VA[55:12] | ASID[63:48]
    uint64_t desc = (addr & ~0xFFF) | (current_asid << 48);
    asm volatile("tlbi vae1is, %0" :: "r"(desc));
    asm volatile("dsb ish");  // Data synchronization barrier
    asm volatile("isb");      // Instruction synchronization barrier
}

RISC-V: SFENCE.VMA

Instruction: SFENCE.VMA rs1, rs2

Semantics: - rs1: Virtual address (0 = all addresses) - rs2: ASID (0 = all ASIDs)

Examples:

# Invalidate specific address in current ASID
sfence.vma a0, zero

# Invalidate all addresses in specific ASID
sfence.vma zero, a1

# Invalidate entire TLB
sfence.vma zero, zero

Usage:

// After modifying page table
pte_t *pte = walk_page_table(addr);
pte->flags |= PTE_WRITE;

// Invalidate TLB entry
asm volatile("sfence.vma %0, zero" :: "r"(addr));

SFENCE.VMA is expensive: It's a serializing instruction that may flush the entire TLB. OSes should batch invalidations when possible.

4.5.2 TLB Shootdown: Multicore Synchronization

Problem: TLB invalidation instructions are local to one CPU. If multiple CPUs are executing the same process, their TLBs may all cache a stale translation that needs invalidation.

TLB Shootdown is the process of invalidating TLB entries on remote CPUs.

The TLB Shootdown Protocol

Scenario: CPU 0 modifies a PTE that might be cached in other CPUs' TLBs

Protocol:

CPU 0 (initiator):
1. Change PTE in memory
2. Invalidate local TLB entry
3. Send IPI (Inter-Processor Interrupt) to all other CPUs
4. Wait for acknowledgment from all CPUs
5. Continue execution

CPU 1, 2, 3, ... (targets):
1. Receive IPI
2. Interrupt current execution
3. Execute IPI handler:
   - Invalidate TLB entry
   - Send acknowledgment back to CPU 0
4. Resume execution

Implementation Example (Linux-style)

void tlb_shootdown(vaddr_t addr) {
    cpumask_t *mask = mm_cpumask(current->mm);  // CPUs running this process
    
    // Invalidate local TLB first
    local_tlb_invalidate(addr);
    
    // Send IPI to remote CPUs
    smp_call_function_many(mask, 
                           remote_tlb_invalidate,  // Function to execute
                           (void*)addr,            // Argument
                           1);                     // Wait for completion
}

// Executed on remote CPUs via IPI
void remote_tlb_invalidate(void *addr_ptr) {
    vaddr_t addr = (vaddr_t)addr_ptr;
    local_tlb_invalidate(addr);
}

Performance Cost

IPI overhead: - Send IPI: ~100-200 cycles - Receive and handle IPI: ~200-500 cycles per CPU - Wait for acknowledgments: ~100-1000 cycles

Total for 4-CPU system: ~1,000-3,000 cycles

For 64-CPU system: ~10,000-40,000 cycles

This is why minimizing TLB shootdowns is critical for scalability.

4.5.3 Batching and Optimization

Batch Invalidation

Instead of shooting down individual pages:

// BAD: Many TLB shootdowns
for (int i = 0; i < 1000; i++) {
    pte_t *pte = &page_table[i];
    pte->flags |= PTE_WRITE;
    tlb_shootdown(i * PAGE_SIZE);  // IPI storm!
}

Better: Batch them:

// GOOD: Single batched shootdown
for (int i = 0; i < 1000; i++) {
    pte_t *pte = &page_table[i];
    pte->flags |= PTE_WRITE;
    // Don't shootdown yet
}
tlb_shootdown_range(start, end);  // One IPI for entire range

Range-Based Invalidation

Modern processors support invalidating ranges: - x86 INVPCID type 2: Invalidate all non-global entries - ARM TLBI VALE1IS with range: Invalidate range of addresses - RISC-V: Multiple SFENCE.VMA invocations (can be optimized by hardware)

Lazy Shootdown

For rarely-used pages, defer shootdown until necessary:

struct mm_struct {
    cpumask_t tlb_flush_pending;  // CPUs needing flush
};

// Mark flush pending, don't IPI immediately
void lazy_shootdown(addr) {
    mm->tlb_flush_pending = mm->cpumask;
    // Actual flush happens on next context switch to this mm
}

Tradeoff: Potential for stale translations vs. IPI overhead

4.6 Large Pages and TLB Coverage

One of the most powerful TLB optimizations is using large pages (also called huge pages or superpages). By mapping larger contiguous regions with a single TLB entry, large pages dramatically improve TLB coverage—the amount of memory accessible through the TLB.

4.6.1 The TLB Coverage Problem

TLB Coverage is the total amount of memory that can be addressed by all TLB entries:

Figure 4.6: Large pages and TLB coverage: 4 KB pages give only 2 MB reach with a 512-entry TLB; 2 MB pages give 1 GB; 1 GB pages give 512 GB. The 2 MB page size offers the best coverage-to-fragmentation trade-off for AI/ML workloads with terabyte-scale working sets.

TLB Coverage = (Number of TLB entries) × (Page size)

For a typical L1 DTLB with 64 entries and 4KB pages:

Coverage = 64 entries × 4 KB = 256 KB

Problem: Modern applications routinely use gigabytes of memory. With 256 KB coverage: - Accessing 1 GB of memory requires TLB entries for 262,144 pages - With only 64 entries, we can cache 0.024% of needed translations - TLB miss rate approaches 99.9%+ for large working sets

Real-world example: A database scanning a 10 GB table with 4KB pages:

Pages needed: 10 GB / 4 KB = 2,621,440 pages
L1 TLB entries: 64
L2 TLB entries: 1,536
Total TLB coverage: 1,600 entries × 4 KB = 6.4 MB (0.064% of dataset)
Result: Catastrophic TLB miss rate, performance 10-50× slower

4.6.2 Large Page Sizes

Modern processors support multiple page sizes:

Architecture	Page Sizes
x86-64	4 KB, 2 MB, 1 GB (4-level), 512 GB (5-level)
ARM64	4 KB, 2 MB, 1 GB (with 4KB granule)
	16 KB, 32 MB (with 16KB granule)
	64 KB, 512 MB (with 64KB granule)
RISC-V	4 KB, 2 MB, 1 GB (Sv39), 512 GB (Sv48)

4.6.3 TLB Coverage Improvement with Large Pages

Using 2 MB pages:

L1 DTLB coverage: 64 entries × 2 MB = 128 MB (512× improvement)
L2 STLB coverage: 1,536 entries × 2 MB = 3 GB (512× improvement)

Using 1 GB pages:

L1 DTLB coverage: 4 entries × 1 GB = 4 GB (16,384× improvement)

Revisiting the 10 GB database scan with 2 MB pages:

Pages needed: 10 GB / 2 MB = 5,120 pages
TLB coverage: 1,600 entries × 2 MB = 3.2 GB (32% of dataset)
Result: Dramatically improved TLB hit rate
Performance improvement: 2-10× for memory-intensive operations

4.6.4 Hardware Support for Large Pages

x86-64 Large Page Implementation

2 MB Pages (Page Directory Entry mapping):

In the page table hierarchy, a PDE (Page Directory Entry) normally points to a page table. With the PS (Page Size) bit set, it instead maps a 2 MB page directly:

Normal 4KB Mapping:
PML4E → PDPTE → PDE → PTE → 4 KB page

2 MB Large Page:
PML4E → PDPTE → PDE (PS=1) → 2 MB page
                  ↑
                  Terminates walk here

PDE format for 2 MB page:

Bits [51:21]: Physical address (2 MB aligned)
Bit [7] (PS): 1 (indicates 2 MB page)
Bits [20:13]: Available for OS use
Other bits: R/W, U/S, PWT, PCD, A, D, G, PAT, NX

1 GB Pages (Page Directory Pointer Table Entry mapping):

Similarly, a PDPTE with PS=1 maps a 1 GB page:

Normal Path:
PML4E → PDPTE → PDE → PTE → 4 KB page

1 GB Large Page:
PML4E → PDPTE (PS=1) → 1 GB page
          ↑
          Terminates here

Intel TLB organization for large pages:

Intel processors typically have separate TLB entries or sections for different page sizes:

L1 DTLB (Ice Lake):
  - 64 entries for 4KB pages
  - 32 entries for 2MB/4MB pages (shared pool)
  - 4 entries for 1GB pages

L2 STLB:
  - 1536 entries for 4KB/2MB pages (unified)
  - 16 entries for 1GB pages (separate)

Notice: Unified vs. separate pools is implementation-dependent.

ARM64 Large Page Implementation

ARM64 calls large pages "blocks" and supports them at different table levels:

4KB Granule: - Level 3: 4 KB pages (leaf entries) - Level 2: 2 MB blocks - Level 1: 1 GB blocks

16KB Granule: - Level 3: 16 KB pages - Level 2: 32 MB blocks

64KB Granule: - Level 3: 64 KB pages - Level 2: 512 MB blocks

Descriptor Type Encoding:

ARM uses the descriptor type field to distinguish blocks from tables:

Bits [1:0]:
  00 = Invalid
  01 = Block entry (Level 1 or 2)
  11 = Table entry (points to next level) or Page entry (Level 3)

ARM Cortex-A77 TLB:

L1 Instruction TLB:
  - 48 entries, fully associative
  - Supports all page sizes

L1 Data TLB:
  - 48 entries, fully associative
  - Supports all page sizes

L2 TLB:
  - 1024 entries, 4-way set-associative
  - Unified for instructions and data
  - Supports all page sizes

RISC-V Large Page Implementation

RISC-V uses an elegant encoding: any PTE with R/W/X bits (leaf entry) can be a large page at its level.

Sv39 (3-level):

Level 2: VPN[2] → 1 GB page (if R/W/X set)
Level 1: VPN[1] → 2 MB page (if R/W/X set)
Level 0: VPN[0] → 4 KB page (if R/W/X set)

PTE encoding:

R W X | Meaning
0 0 0 | Pointer to next level (not a leaf)
0 0 1 | Executable page (leaf)
0 1 0 | Invalid
0 1 1 | Executable + writable page (leaf)
1 0 0 | Read-only page (leaf)
1 0 1 | Read-execute page (leaf)
1 1 0 | Read-write page (leaf)
1 1 1 | Read-write-execute page (leaf)

Any non-zero R/W/X combination makes the PTE a leaf, determining page size by the level.

4.6.5 Operating System Support: Transparent Huge Pages

Challenge: Applications must explicitly request large pages via mmap() with MAP_HUGETLB flag or use hugetlbfs. This requires code changes and isn't transparent.

Solution: Transparent Huge Pages (THP) automatically promote 4KB pages to 2MB pages when possible.

THP Algorithm (Linux)

Promotion (4KB → 2MB):

Conditions for promotion:
1. Process allocates 512 contiguous 4KB pages (= 2MB)
2. Physical memory has contiguous 2MB region available
3. No holes or conflicting mappings
4. Region is sufficiently aligned

Process:
1. Allocate 2MB physically-contiguous region
2. Copy data from 512 small pages to new region
3. Update page table: replace 512 PTEs with single 2MB PDE
4. Free old 4KB pages
5. Invalidate TLB (single entry now covers 2MB)

Demotion (2MB → 4KB):

When memory pressure increases or pages need to be swapped individually:

1. Allocate 512 separate 4KB pages
2. Copy data from 2MB page
3. Update page table: replace 2MB PDE with 512 PTEs
4. Free 2MB page
5. TLB invalidation (now need 512 entries)

THP Performance Trade-offs

Benefits: - Transparent to applications (no code changes) - Dramatic TLB coverage improvement (512×) - Reduced page table memory overhead - Better memory bandwidth utilization

Costs: - Compaction overhead: Creating 2MB contiguous regions requires moving pages - Memory bloat: If only 4KB of a 2MB page is used, 2044KB is wasted - Page fault latency: Allocating 2MB takes longer than 4KB - TLB shootdown cost: Splitting 2MB pages affects all CPUs

When THP helps: - Large sequential access patterns (databases, scientific computing) - Streaming workloads (video processing, ML training) - Memory-mapped files (large datasets)

When THP hurts: - Random access with poor spatial locality - Small allocations (memory bloat) - Frequent mmap/munmap (allocation overhead)

4.6.6 Huge Page Performance Measurements

Benchmark: STREAM Memory Bandwidth Test

Measures sustained memory bandwidth with large arrays:

4KB Pages:
  Copy:   120 GB/s
  Scale:  115 GB/s
  Add:    125 GB/s
  Triad:  130 GB/s
  
2MB Pages (Huge Pages):
  Copy:   145 GB/s (+21%)
  Scale:  140 GB/s (+22%)
  Add:    155 GB/s (+24%)
  Triad:  160 GB/s (+23%)

Why the improvement? - Reduced TLB misses (99% → 20% miss rate) - Fewer page table walks - Better memory-level parallelism

Benchmark: Database Index Scan

10 GB B-tree index scan with random accesses:

4KB Pages:
  Queries/sec: 12,500
  Avg latency: 80 ms
  TLB miss rate: 45%

2MB Pages:
  Queries/sec: 34,000 (+172%)
  Avg latency: 29 ms (-64%)
  TLB miss rate: 8%

Improvement analysis: - TLB miss reduction: 45% → 8% (80% reduction) - Each miss costs ~100 cycles - Savings: 37% of accesses × 100 cycles = 37 cycles/access saved - At 10^9 accesses: 37 billion cycles saved = ~9 seconds @ 4GHz

4.6.7 1GB Page Support

For very large memory footprints (100GB+ datasets), even 2MB pages may not provide sufficient TLB coverage. 1GB pages offer an additional 512× improvement.

Requirements for 1GB pages: - Processor support (most modern x86-64, ARM64) - Physically contiguous 1GB regions (challenging!) - Boot-time reservation or kernel-level allocation

Usage via hugetlbfs:

# Reserve 8GB of 1GB pages at boot
echo 8 > /proc/sys/vm/nr_overcommit_hugepages

# Create hugetlbfs mount
mkdir /mnt/huge_1GB
mount -t hugetlbfs -o pagesize=1G none /mnt/huge_1GB

# Application uses mmap()
fd = open("/mnt/huge_1GB/myfile", O_CREAT | O_RDWR);
ptr = mmap(NULL, 8GB, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);

Real-world use case: Database buffer pool:

Database with 128 GB buffer pool:

4KB pages:
  - Pages needed: 33,554,432
  - TLB entries: 1,600 (0.005% coverage)
  - TLB miss rate: ~99%
  - Performance: 50 MB/s scan rate

2MB pages:
  - Pages needed: 65,536  
  - TLB entries: 1,600 (2.4% coverage)
  - TLB miss rate: ~60%
  - Performance: 850 MB/s (+17×)

1GB pages:
  - Pages needed: 128
  - TLB entries: 16 (12.5% coverage)
  - TLB miss rate: ~15%
  - Performance: 2,800 MB/s (+56×)

4.7 Multicore TLB Coherency

Modern processors have multiple cores, each with its own private TLB. When one core modifies a page table entry, other cores' TLBs may contain stale translations. Maintaining TLB coherency across cores is essential for correctness.

4.7.1 The TLB Coherency Problem

Scenario: Two cores running the same process

Initial state:
  - Virtual page 0x1000 maps to physical frame 0x9000
  - Both CPU0's TLB and CPU1's TLB cache this translation

CPU0:
  1. Changes PTE: 0x1000 now maps to 0xA000 (copy-on-write)
  2. Invalidates its own TLB entry
  3. Continues execution with correct translation

CPU1:
  1. Still has old translation (0x1000 → 0x9000) in TLB
  2. Accesses virtual address 0x1000
  3. TLB HIT with stale entry!
  4. Reads from wrong physical page (0x9000 instead of 0xA000)
  5. CORRECTNESS VIOLATION

Without TLB coherency, this leads to: - Reading stale data - Lost writes (writing to old copy) - Security violations (accessing freed pages)

4.7.2 TLB Shootdown Protocol (Revisited)

We introduced TLB shootdown in Section 4.5.2. Here we examine it in more detail with focus on correctness and optimization.

Strict TLB Shootdown

Algorithm:

void change_pte_strict(pte_t *pte, pte_t new_value) {
    cpumask_t *cpus = get_active_cpus();
    
    // Step 1: Prevent new accesses with old translation
    smp_mb();  // Memory barrier
    
    // Step 2: Modify PTE
    WRITE_ONCE(*pte, new_value);
    
    // Step 3: Local TLB invalidation
    local_tlb_flush(pte_address);
    
    // Step 4: Send IPI to all other CPUs
    smp_call_function_many(cpus, remote_tlb_flush, pte_address, 1);
    
    // Step 5: Wait for all CPUs to acknowledge
    // (handled by smp_call_function_many's wait flag)
}

Guarantees: - After return, NO CPU has stale TLB entry - Safe to free old physical page - Safe to reuse old physical page for different purpose

Cost: Synchronous IPIs to all CPUs (expensive!)

Deferred TLB Shootdown

Optimization: Defer TLB flushes until strictly necessary

struct mm_struct {
    cpumask_t tlb_flush_pending;  // CPUs with pending flushes
};

void change_pte_deferred(pte_t *pte, pte_t new_value) {
    // Step 1: Modify PTE
    WRITE_ONCE(*pte, new_value);
    
    // Step 2: Local invalidation
    local_tlb_flush(pte_address);
    
    // Step 3: Mark other CPUs as needing flush
    mm->tlb_flush_pending |= get_active_cpus();
    
    // No IPI sent yet!
}

// Flush happens lazily:
void context_switch_to_mm(mm_struct *mm) {
    if (cpumask_test_cpu(cpu, &mm->tlb_flush_pending)) {
        // Flush entire TLB for this mm
        local_tlb_flush_all();
        cpumask_clear_cpu(cpu, &mm->tlb_flush_pending);
    }
    load_mm(mm);
}

Benefits: - No immediate IPI cost - Flush batched with context switch (amortized cost)

Restrictions: - Cannot free page immediately (some CPUs may access it) - Requires tracking which pages are "in flight"

4.7.3 Hardware TLB Coherency Support

Some architectures provide hardware assistance for TLB coherency:

ARM64 Shareability Domains

ARM defines memory shareability domains that affect TLB coherency:

Inner Shareable (IS):

TLBI VALE1IS, X0  ; Invalidate for all CPUs in inner shareable domain

Hardware automatically broadcasts TLB invalidation to all cores in the cluster. No software IPI needed!

Outer Shareable (OS):

TLBI VALE1OS, X0  ; Invalidate across multiple clusters

Non-Shareable (NS):

TLBI VALE1, X0    ; Local CPU only

Benefits: - Reduced software overhead (no IPI handling) - Lower latency (hardware paths faster than interrupts) - Simpler software

x86 INVLPG and Broadcast

On some x86 implementations, certain TLB operations are broadcast automatically within a coherency domain, though this is not architecturally guaranteed. Software still must use IPIs for correctness.

4.7.4 TLB Coherency and Memory Ordering

TLB coherency interacts with memory ordering and barriers:

Problem:

CPU 0:                          CPU 1:
pte->flags |= PTE_WRITE;        data = *ptr;
invalidate_tlb(ptr);            // Which flags does CPU1's TLB see?

Without proper barriers, CPU1 might: - Read old PTE from TLB (before write visible) - Read new PTE but with stale flags cached

Solution: Memory Barriers

CPU 0:
WRITE_ONCE(pte->flags, new_flags);  // Store-release semantics
smp_wmb();                          // Write memory barrier
invalidate_tlb_globally(ptr);       // Ensures visibility

CPU 1:
// TLB miss forces PTE reload
pte = READ_ONCE(*pte_ptr);          // Load-acquire semantics
smp_rmb();                          // Read memory barrier

x86 ordering: x86 has strong ordering (TSO: Total Store Order), so explicit barriers are rarely needed

ARM ordering: ARM has weak ordering, requiring explicit barriers:

DSB ISH   ; Data Synchronization Barrier, Inner Shareable
ISB       ; Instruction Synchronization Barrier

4.7.5 Batched TLB Shootdown Optimization

Problem: Many page table modifications in a short time (e.g., munmap() of large region)

Bad approach:

for (int i = 0; i < 10000; i++) {
    pte_t *pte = &page_table[i];
    pte->present = 0;
    tlb_shootdown(pte);  // IPI for EACH page!
}
// Total: 10,000 IPIs, 5-10 seconds on 64-core system

Good approach: Batch shootdown

struct tlb_batch {
    vaddr_t addresses[256];
    int count;
    cpumask_t cpus;
};

void batched_shootdown(vaddr_t start, size_t len) {
    struct tlb_batch batch;
    batch.count = 0;
    batch.cpus = get_active_cpus();
    
    for (addr = start; addr < start + len; addr += PAGE_SIZE) {
        pte_t *pte = get_pte(addr);
        pte->present = 0;
        
        batch.addresses[batch.count++] = addr;
        
        if (batch.count >= 256) {
            flush_batch(&batch);
            batch.count = 0;
        }
    }
    
    if (batch.count > 0) {
        flush_batch(&batch);
    }
}

void flush_batch(struct tlb_batch *batch) {
    // Local flush
    for (int i = 0; i < batch->count; i++) {
        local_tlb_flush(batch->addresses[i]);
    }
    
    // Single IPI for entire batch
    smp_call_function_many(batch->cpus, remote_flush_batch, batch, 1);
}

Performance: - Before: 10,000 IPIs = 40,000-80,000 cycles - After: 40 IPIs = 1,600-3,200 cycles (25× faster)

4.7.6 TLB Coherency in NUMA Systems

On NUMA (Non-Uniform Memory Access) systems, TLB coherency becomes more complex:

Challenges: - CPUs on different NUMA nodes have higher IPI latency - Page tables may reside on remote NUMA nodes - Memory barriers have different costs across nodes

Optimization: Lazy cross-node shootdown

void shootdown_numa_aware(pte_t *pte, pte_t new_val) {
    cpumask_t *local_cpus, *remote_cpus;
    
    partition_cpus_by_numa(&local_cpus, &remote_cpus);
    
    // Immediate shootdown on local node (low latency)
    smp_call_function_many(local_cpus, tlb_flush, pte, 1);
    
    // Deferred shootdown on remote nodes
    mark_flush_pending(remote_cpus, pte);
}

4.8 Platform-Specific TLB Implementations

Now we examine real TLB implementations across major processor families, understanding the specific design choices, capabilities, and performance characteristics of modern hardware.

4.8.1 Intel x86-64 TLB Architecture

Intel Ice Lake (10th Gen Core, 2019)

L1 Data TLB (DTLB):

4KB pages:
  - 64 entries, 4-way set-associative
  - 1-cycle hit latency
  
2MB/4MB pages:
  - 32 entries, 4-way set-associative
  - 1-cycle hit latency
  
1GB pages:
  - 4 entries, fully associative
  - 1-cycle hit latency

L1 Instruction TLB (ITLB):

4KB pages:
  - 128 entries, fully associative
  - 1-cycle hit latency
  
2MB/4MB pages:
  - 8 entries, fully associative
  - 1-cycle hit latency

L2 Shared TLB (STLB):

4KB/2MB pages:
  - 1536 entries, 12-way set-associative
  - 5-cycle hit latency
  
1GB pages:
  - 16 entries, 4-way set-associative
  - 5-cycle hit latency

Page Walk Cache:

Intel adds caches for intermediate page table entries:

PML4 cache: 16 entries (caches PML4 entries)
PDPT cache: 32 entries (caches PDPT entries)
PD cache: 256 entries (caches PD entries)

Performance characteristics:

L1 DTLB hit: 1 cycle (0.25 ns @ 4 GHz)
L1 DTLB miss, L2 STLB hit: 5 cycles (1.25 ns)
L2 STLB miss, PWC hit: 10-20 cycles (2.5-5 ns)
Full page walk (L3 hit): 40-60 cycles (10-15 ns)
Full page walk (DRAM): 100-200 cycles (25-50 ns)

TLB Miss Rate Observations (SPEC CPU2017):

Workload          L1 DTLB      L2 STLB      PWC
               Miss Rate    Miss Rate    Hit Rate
mcf (pointer)     12.3%        8.2%        45%
lbm (streaming)    3.8%        2.1%        75%
gcc (code)         1.2%        0.3%        85%
average            2.5%        1.1%        72%

Intel Sapphire Rapids (4th Gen Xeon, 2023)

Improvements over Ice Lake: - Larger L2 STLB: 2048 entries (vs 1536) - Enhanced PWC: 512 PD entries (vs 256) - Support for 5-level paging (57-bit virtual addresses) - Improved TLB replacement algorithms

4.8.2 AMD Zen 4 TLB Architecture (2022)

L1 Data TLB:

All page sizes (4KB/2MB/1GB):
  - 96 entries, fully associative
  - 1-cycle hit latency

L1 Instruction TLB:

All page sizes:
  - 96 entries, fully associative
  - 1-cycle hit latency

L2 Unified TLB:

4KB pages:
  - 3072 entries, 12-way set-associative
  - ~7 cycle hit latency
  
2MB pages:
  - 3072 entries shared with 4KB
  - ~7 cycle hit latency
  
1GB pages:
  - 96 entries
  - ~7 cycle hit latency

Key AMD innovations: - Unified L2 TLB: No separation between instruction and data - Very large L2: 3072 entries provide excellent coverage - Fully-associative L1: Eliminates conflict misses

Performance comparison (normalized to Intel):

Metric                 Intel Ice Lake    AMD Zen 4
L1 TLB capacity        192 entries       192 entries
L2 TLB capacity        1552 entries      3168 entries
L2 TLB hit rate        ~95%              ~97-98%
Memory-intensive perf  1.0× baseline     1.15-1.25×

4.8.3 ARM Cortex-A78 TLB Architecture (2020)

L1 Instruction TLB:

Fully associative:
  - 48 entries
  - Supports 4KB, 16KB, 64KB, 2MB, 32MB, 1GB pages
  - ~1 cycle hit latency

L1 Data TLB:

Fully associative:
  - 48 entries per cluster
  - Supports all page sizes
  - ~1 cycle hit latency

L2 Unified TLB:

4-way set-associative:
  - 1280 entries
  - Unified for instructions and data
  - ~6 cycle hit latency
  - Supports all page sizes

ARM-specific features:

ASID Support: - 16-bit ASID in Cortex-A78 (65536 address spaces) - No TLB flush on context switch - ASID in every TLB entry

VMID Support: - 16-bit VMID for virtualization - Each TLB entry tagged with (ASID, VMID) - Supports many VMs without TLB pollution

Inner/Outer Shareability: - TLB invalidation can target specific shareability domains - Reduces unnecessary TLB flushes

Performance characteristics:

L1 TLB hit: 1 cycle
L1 miss, L2 hit: 6 cycles
L2 miss, page walk: 50-150 cycles (no PWC in Cortex-A78)

4.8.4 RISC-V Implementation: SiFive U74 (2020)

TLB Configuration:

Instruction TLB:
  - 32 entries, fully associative
  - Supports 4KB, 2MB, 1GB pages (Sv39)
  
Data TLB:
  - 40 entries, fully associative
  - Supports 4KB, 2MB, 1GB pages

Software-Managed: - TLB miss generates exception - Software walks page tables - Software inserts TLB entry via CSR writes

ASID Support: - 16-bit ASID in satp register (Sv39 mode) - Software can use sfence.vma with ASID parameter

Performance:

TLB hit: 1 cycle
TLB miss (software handling): 50-200 cycles depending on handler

Trade-offs vs Hardware-Managed:

Advantages: - Simpler hardware (smaller chip area, lower power) - Flexible page table formats - Easy to add custom TLB management features

Disadvantages: - Higher TLB miss latency (software path) - No page walk cache (no hardware walker) - Handler complexity

SiFive U74 TLB Miss Handler (typical):

Optimized assembly code that walks Sv39 page tables takes ~50 instructions (best case). With instruction cache hits and no data cache misses: ~50 cycles. Worst case with cache misses: 200+ cycles.

4.8.5 Apple M-Series TLB Architecture

Apple doesn't publish detailed microarchitectural specifications, but reverse engineering and performance analysis reveal:

Apple M1 (2020):

Firestorm (Performance) Core:

Estimated L1 TLB:
  - ~96-128 entries (instruction + data combined guess)
  - Fully associative
  - Supports 4KB, 16KB, 2MB pages
  
Estimated L2 TLB:
  - ~2000+ entries
  - High associativity

Icestorm (Efficiency) Core:

Estimated L1 TLB:
  - ~48-64 entries
  - Fully associative

Observations from benchmarks: - Exceptional TLB performance compared to competitors - Very low TLB miss rates even with large working sets - Likely very large L2 TLB or advanced prefetching

Apple M3 (2023): - Estimated further TLB improvements - Better handling of mixed page sizes - Enhanced coherency for heterogeneous cores

4.9 Alternative Page Table Structures and TLB Implications

While hierarchical multi-level page tables (covered in Chapter 3) are the dominant approach in modern processors, alternative page table structures exist and influence TLB design in important ways. Understanding these alternatives provides insight into the design space and trade-offs in virtual memory systems.

4.9.1 Inverted Page Tables

Concept: Instead of one page table per process mapping virtual pages to physical frames, an inverted page table (IPT) has one entry per physical frame mapping back to virtual pages.

Structure and Operation

Traditional (forward) page table:

Virtual Page → [Page Table Entry] → Physical Frame
One table per process
Size grows with virtual address space

Inverted page table:

Physical Frame → [IPT Entry] → (Process ID, Virtual Page)
One global table for entire system
Size determined by physical memory

IPT Entry Format:

Hashed TLB Entry Format (RISC-V Large ASID proposal)
PID / ASID 16 bits	VPN 52 bits	Chain pointer	Flags R/W/X/…
Address space	Virtual page tag	→ next entry (collision chain)	Permission bits

Hashed TLBs use the VPN as a hash key; the chain pointer resolves collisions. Eliminates CAM hardware cost at large entry counts but adds potential multi-cycle lookup on collision.

Lookup Process with Hash Table

Since we can't index directly by virtual address, IPTs use a hash table:

1. Hash (PID, VPN) → Hash bucket index
2. Read IPT entry at that index
3. Check if (IPT.PID, IPT.VPN) matches (Current_PID, Requested_VPN)
   - Match → Physical frame number = entry index
   - No match → Follow chain pointer to next collision entry
4. If chain ends without match → Page fault

Example: Process 42 accesses virtual page 0x1234:

Hash(42, 0x1234) = 0x9876
Read IPT[0x9876]:
  PID = 42
  VPN = 0x1234
  Match! → Physical frame = 0x9876

Advantages of Inverted Page Tables

1. Fixed Memory Overhead:

Traditional: Memory = (Number of processes) × (Virtual AS size / Page size) × (Entry size)
  Example: 1000 processes × (2^48 / 2^12) × 8 bytes = 2 PB (impossible!)

Inverted: Memory = (Physical memory size / Page size) × (Entry size)
  Example: 256 GB / 4 KB × 16 bytes = 1 GB (manageable)

2. No Per-Process Tables: Simplifies context switching and memory management.

3. Natural Support for Shared Memory: Multiple (PID, VPN) pairs can map to same physical frame.

Disadvantages of Inverted Page Tables

1. Slow Lookup:

Hash collisions → Must traverse chain
Worst case: O(n) lookups for n collisions
Traditional hierarchical: O(1) after TLB miss, deterministic memory accesses

2. Complex TLB Refill:

On a TLB miss, software must:

tlb_refill_ipt(pid_t pid, vpn_t vpn) {
    uint64_t hash = hash_function(pid, vpn);
    uint64_t index = hash % ipt_size;
    
    while (ipt[index].valid) {
        if (ipt[index].pid == pid && ipt[index].vpn == vpn) {
            // Found it!
            tlb_insert(vpn, index, ipt[index].flags);
            return;
        }
        
        // Collision - follow chain
        if (ipt[index].chain == NULL)
            break;  // End of chain
        index = ipt[index].chain;
    }
    
    // Not found - page fault
    page_fault(pid, vpn);
}

3. Poor Locality: Hash function scatters consecutive virtual pages across IPT, destroying spatial locality.

4. No Natural Large Page Support: Difficult to represent 2MB/1GB pages efficiently.

Historical Usage: PowerPC and IA-64

IBM PowerPC: Used inverted page tables extensively in RS/6000 and early POWER systems.

HP/Intel IA-64 (Itanium): Supported both traditional and inverted page tables:

Virtual Hash Page Table (VHPT):
  - Hardware-walked hash table
  - Software defines hash function
  - TLB miss handler can use VHPT or traditional tables

Modern Status: Largely abandoned in favor of hierarchical tables due to: - Hardware page walkers prefer deterministic access patterns - Page walk caches optimize hierarchical lookups - Large pages benefit hierarchical structures

4.9.2 Shadow Page Tables for Virtualization

Context: Before hardware virtualization support (Intel EPT, AMD NPT), hypervisors used shadow page tables to virtualize memory.

The Problem Without Hardware Support

Guest OS maintains page tables: Guest Virtual Address (GVA) → Guest Physical Address (GPA)

But guest "physical" addresses are actually virtual from host perspective:

Guest Virtual (GVA) → Guest Physical (GPA) → Host Physical (HPA)

Without hardware support, TLB can only cache one translation level. Solution: Shadow page tables.

Shadow Page Table Construction

Hypervisor maintains shadow page tables that directly map GVA → HPA:

Guest Page Table:    GVA → GPA
Host Page Table:     GPA → HPA
Shadow Page Table:   GVA → HPA (composition of above)

Construction algorithm:

void build_shadow_pte(guest_pte_t *guest_pte, shadow_pte_t *shadow_pte) {
    // Guest PTE: GVA → GPA
    gpa_t gpa = guest_pte->pfn << 12;
    
    // Walk host page table: GPA → HPA
    hpa_t hpa = walk_host_page_table(gpa);
    
    // Build shadow PTE: GVA → HPA
    shadow_pte->pfn = hpa >> 12;
    shadow_pte->flags = guest_pte->flags;  // Preserve permissions
    shadow_pte->present = guest_pte->present && (hpa != INVALID);
}

Shadow Page Table Synchronization

Challenge: Guest OS modifies its page tables, shadow must stay synchronized.

Solution: Write-protect guest page tables:

1. Mark all guest page table pages as read-only in host page tables
2. Guest attempts to write page table entry
3. Page fault (write to read-only page)
4. Hypervisor intercepts fault (VM exit)
5. Hypervisor:
   a. Emulate the write to guest page table
   b. Update corresponding shadow page table entry
   c. Flush TLB entry if needed
6. Resume guest execution (VM entry)

Performance cost:

Every guest page table modification:
  - VM exit (~1000 cycles)
  - Shadow PT update
  - TLB flush
  - VM entry (~1000 cycles)
Total: ~2000+ cycles per PT modification

Example: Guest Linux fork():

Linux fork() creates new process:
  - Allocates new page tables
  - Copies hundreds of PTEs
  - Each copy triggers VM exit!
  
Without shadow PTs (naive): 1 fork = ~500 VM exits = 1,000,000 cycles
With shadow PTs: 1 fork = ~500 VM exits = 1,000,000 cycles
With hardware (EPT/NPT): 1 fork = 0 VM exits = normal overhead

This is why hardware-assisted virtualization (EPT/NPT) was so important!

Shadow Page Table TLB Implications

TLB caches GVA → HPA directly:

TLB Entry (with shadow PT):
  VPN = Guest Virtual Page
  PFN = Host Physical Frame (not GPA!)
  ASID/PCID = Guest's process context
  VMID = Which VM

On VM switch: - Must flush all TLB entries for old VM - Shadow page tables remain but TLB is cold - New VM starts with empty TLB

Memory overhead:

Shadow PT per guest:
  Same size as guest's page tables
  100s of MB for typical guest
  
If running 100 VMs:
  10-100 GB of shadow page tables!

4.9.3 Nested/Extended Page Tables (EPT/NPT) - Modern Hardware Approach

Hardware support eliminates shadow page tables:

Intel EPT (Extended Page Tables):

GVA → GPA → HPA
  ↓       ↓
Guest PT  EPT

Hardware walks both automatically

TLB entries cache final translation:

TLB Entry with EPT:
  Guest VPN → Host PFN
  Tagged with (PCID, VPID)

Key advantage: Guest can modify its page tables without VM exits!

Performance comparison:

Operation           Shadow PT        EPT/NPT
Guest PT write      2000+ cycles     ~10 cycles
TLB miss            Same as native   ~2× native (dual walk)
VM switch           Flush TLB        Keep TLB (VPID)
Memory overhead     100s MB          ~10 MB (EPT structure)

We'll explore EPT/NPT in detail in Section 4.11.

EPT worst-case memory access count. The "~2× native (dual walk)" shorthand above describes the cached average case with both the guest and EPT page-walk caches (PWCs) warm. In the true worst case with no caching, 4-level guest tables, and 4-level EPT, the hardware performs up to 25 memory accesses for one guest virtual address: five guest-table accesses (CR3 + 4 table levels), each requiring a 4-level EPT walk (4 accesses each = 20 EPT accesses), plus the final data read. Chapter 3 §3.6.2 provides the full derivation. In practice the combined PWC reduces this dramatically; measured average overhead for typical virtualised workloads is 2–7% (Intel SDM, Vol. 3C §28.2).

4.9.4 Other Page Table Variations

Clustered Page Tables

Concept: Group multiple consecutive page table entries into clusters.

Benefits: - Reduce page table memory overhead - Improve spatial locality - TLB can cache cluster info

Usage: Some embedded systems and research OSes.

Multi-Level TLBs with Different Organizations

Some processors use different page table structures at different privilege levels:

ARM64 Example:

User space (EL0): Uses TTBR0_EL1
  - Can use 4KB, 16KB, or 64KB granule
  
Kernel space (EL1): Uses TTBR1_EL1
  - Can use different granule than user space!
  
TLB must track which granule for each entry

4.10 Self-Modifying Code and Instruction/Data Coherency

When a program modifies its own code (writes instructions that will later be executed), maintaining coherency between instruction and data caches—and between ITLB and DTLB—becomes critical.

4.10.1 The Coherency Problem

Scenario: Program writes new instruction, then executes it:

// Self-modifying code example
void *code_page = mmap(..., PROT_READ|PROT_WRITE|PROT_EXEC, ...);

// Write new instruction
*(uint32_t*)code_page = 0x12345678;  // MOV instruction (example)

// Execute it
void (*func)() = (void(*)())code_page;
func();  // Execute newly written instruction

Problem: Multiple caches involved:

Write instruction (STORE):
  1. Hits in D-cache (data cache)
  2. Translates via DTLB
  3. Instruction now in D-cache, may not be in I-cache

Execute instruction (FETCH):
  1. Misses in I-cache (instruction cache)
  2. Translates via ITLB
  3. May fetch stale instruction from memory if D-cache not flushed

Three coherency issues:

I-cache vs D-cache: Modified instruction in D-cache not visible to I-cache
ITLB vs DTLB: Different translations cached?
Store buffer: Write may still be pending in store buffer

4.10.2 Architecture-Specific Handling

x86-64: Hardware Coherency (Mostly)

x86 guarantees: - Data cache is write-through or write-back with snooping - Self-modifying code detected via store-to-instruction checks - But: Explicit serialization still required for correctness

Required sequence:

; Write new instruction
mov [rdi], eax        ; Store new instruction

; Serialize execution
mfence                ; Memory fence (ensure stores visible)
; or
cpuid                 ; Serializing instruction
; or  
jmp short $+2        ; Jump serializes pipeline

; Now safe to execute new code
call rdi

x86 Self-Modifying Code (SMC) Detection:

Modern x86 processors detect when a store might modify code in the I-cache and automatically invalidate affected I-cache lines. However, this doesn't cover all cases:

If store and fetch use different linear addresses but same physical address:
  - Different virtual pages mapped to same physical page
  - Store via VPN1, execute via VPN2
  - SMC detection may miss this!
  
Solution: Must explicitly flush I-cache

x86 I-cache flush:

// No direct I-cache flush instruction in x86!
// Must use serializing instruction or pipeline flush

asm volatile("mfence; cpuid" ::: "eax", "ebx", "ecx", "edx");

ITLB/DTLB coherency:

x86 doesn't guarantee ITLB/DTLB coherency automatically. If same virtual page accessed via ITLB and DTLB:

; After changing page table entry affecting execute permissions
invlpg [address]      ; Invalidates BOTH ITLB and DTLB

ARM64: Explicit Coherency Required

ARM64 requires explicit cache maintenance:

; Write new instruction
str w0, [x1]          ; Store instruction

; Point of Unification (PoU) maintenance
dc cvau, x1           ; Clean data cache to PoU
dsb ish               ; Data Synchronization Barrier

ic ivau, x1           ; Invalidate instruction cache by VA
dsb ish               ; Ensure completion
isb                   ; Instruction Synchronization Barrier

; Now safe to execute
br x1

ARM cache maintenance instructions:

DC CVAU: Clean data cache to Point of Unification
IC IVAU: Invalidate instruction cache by VA
IC IALLU: Invalidate all instruction caches
DSB: Data Synchronization Barrier
ISB: Instruction Synchronization Barrier

Point of Unification (PoU):

The point in the cache hierarchy where instruction and data caches are guaranteed to see the same copy of memory:

Figure 4.7: Cache and TLB coherence points (ARM PoU/PoC): PoU is where L1 I$ and D$ converge; PoC (LLC) is visible to all system agents including DMA.

ITLB/DTLB coherency:

ARM provides separate TLB invalidation for instruction and data:

tlbi vae1is, x0       ; Invalidate TLB entry (both I and D)
dsb ish
isb

Expensive operation: ARM's explicit maintenance is slower than x86's hardware coherency:

x86 SMC: ~100-200 cycles (mostly automatic)
ARM SMC: ~500-1000 cycles (explicit cache maintenance)

RISC-V: Software Responsibility

RISC-V provides minimal hardware support:

# Write new instruction
sw a0, 0(a1)

# Fence instruction and data
fence.i              ; Instruction fence - invalidates I-cache

# Optionally flush data cache (implementation-dependent)
# No standard instruction!

# TLB flush if page permissions changed
sfence.vma zero, zero

# Execute new code
jalr a1

FENCE.I instruction:

Ensures all previous stores to instruction memory are visible to subsequent instruction fetches
May be expensive (flushes entire I-cache on some implementations)
Does NOT guarantee D-cache to I-cache coherency by itself!

RISC-V cache coherency ambiguity:

The RISC-V spec intentionally leaves cache coherency details to implementations:

Weakly-ordered implementations:
  - May require explicit D-cache flush (implementation-specific)
  - No standard way to flush D-cache in base spec
  
Strongly-ordered implementations:
  - FENCE.I may be sufficient

Privileged cache management (SiFive example):

// SiFive-specific cache flush
void flush_dcache_range(void *addr, size_t len) {
    uintptr_t start = (uintptr_t)addr;
    uintptr_t end = start + len;
    
    for (uintptr_t i = start; i < end; i += CACHE_LINE_SIZE) {
        asm volatile("cflush.d.l1 %0" :: "r"(i));
    }
}

4.10.3 JIT Compilers and Dynamic Code Generation

Just-In-Time (JIT) compilers generate code at runtime, making self-modifying code handling critical:

Java JVM, JavaScript V8, .NET CLR all need efficient SMC:

Best practices:

Separate pages for code generation:

// Allocate executable page
void *code = mmap(NULL, 4096, 
                  PROT_READ|PROT_WRITE,  // Initially writable
                  MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);

// Write code
generate_code(code, ...);

// Flush caches
flush_instruction_cache(code, 4096);

// Make executable (W^X: Write XOR Execute)
mprotect(code, 4096, PROT_READ|PROT_EXEC);

// Execute
((void(*)())code)();

Batch code generation: Generate multiple functions before flushing
Code patching: When patching existing code:

// x86 example: patch a call target
void patch_call(void *call_site, void *new_target) {
    uint8_t *insn = call_site;
    
    // Calculate relative offset
    int32_t offset = (uint8_t*)new_target - (insn + 5);
    
    // Atomic write (single instruction on x86)
    *(int32_t*)(insn + 1) = offset;
    
    // Flush I-cache
    __builtin___clear_cache(insn, insn + 5);
}

Performance impact:

Cache maintenance overhead per JIT compilation:
  x86-64: ~100-500 cycles
  ARM64:  ~1000-3000 cycles
  RISC-V: ~500-2000 cycles (implementation-dependent)
  
For a JIT compiling 1000 functions/sec:
  x86: 0.1-0.5M cycles/sec = negligible
  ARM: 1-3M cycles/sec = ~0.025-0.075% @ 4GHz

4.10.4 Write XOR Execute (W^X) Security

Modern systems enforce W^X policy: memory pages are either writable OR executable, never both simultaneously.

Implications for SMC:

// Old (insecure) way:
void *code = mmap(..., PROT_READ|PROT_WRITE|PROT_EXEC, ...);
write_code(code);
execute_code(code);  // Dangerous!

// New (W^X) way:
void *code = mmap(..., PROT_READ|PROT_WRITE, ...);
write_code(code);
flush_caches(code);
mprotect(code, ..., PROT_READ|PROT_EXEC);  // Remove write permission
execute_code(code);  // Safe

W^X enforcement requires TLB invalidation:

mprotect(addr, len, PROT_READ|PROT_EXEC) {
    // Update page table entries
    for (each page in range) {
        pte->writable = 0;
        pte->executable = 1;
    }
    
    // MUST invalidate TLB!
    tlb_flush_range(addr, len);
    // Without this, CPU might use cached writable permission
}

Apple M1 W^X hardware enforcement:

Apple Silicon enforces W^X at hardware level: - Attempting to set W+X bits in page table triggers fault - JIT must use dual-mapping technique

Modern processors support Simultaneous Multithreading (SMT), where multiple hardware threads share a single physical core. Understanding how TLBs are shared (or not shared) between threads is critical for performance.

4.11.1 SMT Architectures: Intel Hyperthreading and AMD SMT

Intel Hyperthreading (2-way SMT): - 2 logical processors per physical core - Share: L1/L2 caches, TLBs, execution units - Private: Architectural registers, instruction pointer

AMD SMT (2-way): - Similar to Intel - 2 threads per core

IBM POWER9 (8-way SMT): - 8 hardware threads per core! - More complex resource sharing

4.11.2 TLB Partitioning Strategies

Three approaches to TLB sharing:

1. Fully Shared TLB (Intel, AMD approach)

Both threads share entire TLB:

Physical Core:
  L1 ITLB: 128 entries (shared by Thread 0 and Thread 1)
  L1 DTLB: 64 entries (shared by Thread 0 and Thread 1)
  L2 STLB: 1536 entries (shared)

Advantages: - Maximum TLB capacity utilization - One thread can use all entries if other thread idle

Disadvantages: - TLB contention: Threads thrash each other's entries - Security: Thread 0 can observe Thread 1's TLB timing

Example conflict:

Thread 0: Accesses pages 0-127 (fills ITLB)
Thread 1: Accesses pages 200-327 (evicts Thread 0's entries)
Thread 0: Re-accesses pages 0-127 (TLB misses!)
Result: Thrashing, performance degradation

2. Statically Partitioned TLB

Divide TLB entries between threads:

L1 DTLB (64 entries):
  Thread 0: Entries 0-31 (32 entries)
  Thread 1: Entries 32-63 (32 entries)

Advantages: - No thrashing - Predictable performance - Better security (isolation)

Disadvantages: - Underutilization if one thread idle - Fixed partitioning may not match workload needs

Implementation: Tag entries with thread ID:

TLB Entry:
  VPN: 36 bits
  PFN: 40 bits
  Thread: 1 bit (0 or 1)
  ... other fields ...

Lookup:
  Match (VPN, Current_Thread_ID)

3. Dynamically Partitioned TLB

Allocate entries dynamically based on need:

Replacement policy includes thread ID:
  - Try to evict entry from same thread first
  - Allow borrowing if thread has many idle entries
  - Enforce min/max quotas per thread

Intel implementation (approximation):

Intel uses pseudo-LRU with thread-aware bias:

int select_victim_tlb_entry() {
    // Prefer evicting entry from current thread
    for (int i = 0; i < TLB_SIZE; i++) {
        if (tlb[i].thread_id == current_thread && 
            !tlb[i].recently_used) {
            return i;
        }
    }
    
    // Fall back to global LRU
    return lru_victim();
}

4.11.3 TLB Miss Handling with SMT

Hardware-managed TLBs (x86, ARM):

Multiple page walkers can operate simultaneously:

Intel Ice Lake:
  - 2 page walkers per core
  - Thread 0 TLB miss → Walker 0 starts page walk
  - Thread 1 TLB miss → Walker 1 starts page walk simultaneously
  - Both walks proceed in parallel

Software-managed TLBs (RISC-V):

TLB miss generates exception, runs handler:

Thread 0: TLB miss → Exception on Thread 0
Thread 1: TLB miss → Exception on Thread 1

Problem: Both threads can't handle TLB miss simultaneously!

Solution: Handler must handle concurrency

RISC-V SMT TLB handler design:

// Per-thread TLB miss handler state
struct tlb_miss_state {
    vaddr_t fault_addr;
    int in_progress;
    spinlock_t lock;
};

__thread struct tlb_miss_state tms;

void tlb_miss_handler(vaddr_t addr) {
    tms.fault_addr = addr;
    tms.in_progress = 1;
    
    // Walk page table (thread-safe, read-only operation)
    pte_t pte = walk_page_table(addr);
    
    // Insert into TLB (may need lock if TLB management not atomic)
    spin_lock(&tlb_lock);
    tlb_insert(addr, pte);
    spin_unlock(&tlb_lock);
    
    tms.in_progress = 0;
}

4.11.4 ASID/PCID with SMT

Challenge: How do address space identifiers work with SMT?

Approach 1: Shared ASID/PCID

Both threads run same process (common for single-threaded programs):

Thread 0: PCID = 5
Thread 1: PCID = 5
Both access same address space, share TLB entries naturally

Approach 2: Different ASIDs/PCIDs

Threads run different processes (uncommon):

Thread 0: PCID = 5 (Process A)
Thread 1: PCID = 17 (Process B)
TLB entries tagged with both Thread ID and PCID

TLB lookup with SMT:

Match (VPN, PCID, Thread_ID)

This prevents Thread 0 from hitting Thread 1's entries even if PCID matches.

4.11.5 Performance Impact of SMT on TLB

Measurement: SPECint2017 with SMT:

Configuration          IPC per thread    Total IPC    TLB miss rate
1 thread (no SMT)      2.5              2.5          2.1%
2 threads (SMT)        1.8              3.6          4.8%
                       ↑                 ↑            ↑
                       Lower per-thread  Higher total TLB thrashing

TLB contention increases miss rate: - Single thread: 2.1% DTLB miss rate - Two threads (SMT): 4.8% DTLB miss rate per thread (2.3× worse)

When SMT helps despite TLB contention: - Threads have different workloads (one CPU-bound, one memory-bound) - Hides memory latency (while one thread waits on cache miss, other runs) - Total throughput improves even if individual thread slower

When SMT hurts: - Both threads memory-intensive with large working sets - TLB thrashing dominates performance - Better to disable SMT and run threads on separate cores

4.12 Virtualization with Mismatched Page Sizes

When running virtual machines, the guest OS and hypervisor may use different page sizes at different translation stages. Understanding how TLBs handle this is crucial for virtualization performance.

4.12.1 Two-Stage Translation Recap

Without extended page tables:

Guest Virtual (GVA) → Guest Physical (GPA) → Host Physical (HPA)
        ↓                      ↓
    Guest PT              Shadow PT/EPT

With Intel EPT / AMD NPT:

Hardware walks both page tables:

GVA → GPA: Guest page table walk (4 levels for 4KB pages)
GPA → HPA: EPT/NPT walk (4 levels for 4KB pages)

Worst case: 4 × 4 = 16 memory accesses!
  - 4 accesses for guest walk
  - Each access's GPA must be translated via EPT (× 4 each)

4.12.2 Mismatched Page Sizes: The Problem

Scenario: Guest uses 4KB pages, host uses 2MB pages (or vice versa)

Case 1: Guest 4KB, Host 2MB (common):

Guest: Maps 4KB virtual page → 4KB physical page
EPT: Maps 2MB physical page → 2MB host physical

Guest page table:
  VA 0x0000 → GPA 0x10000 (4KB page)
  VA 0x1000 → GPA 0x11000 (4KB page)
  ...

EPT:
  GPA 0x00000-0x1FFFFF → HPA 0x80000000-0x801FFFFF (2MB page)

TLB entry must store both:

TLB Entry:
  Guest VPN: 0x0 (4KB granularity)
  Guest page size: 4KB
  Host PFN: 0x80000 (2MB granularity)
  Host page size: 2MB
  
Translation:
  GVA 0x0000 → HPA 0x80000000 (via 4KB+2MB)
  GVA 0x1000 → HPA 0x80001000 (via 4KB+2MB, same EPT entry!)

Problem: TLB entries at 4KB granularity, but each EPT TLB entry covers 2MB!

4.12.3 Intel EPT with Mixed Page Sizes

Intel's approach: Separate TLB structures

Guest TLB: Caches GVA → GPA translations
  - Uses guest page sizes (4KB, 2MB, 1GB)
  
EPT TLB: Caches GPA → HPA translations  
  - Uses EPT page sizes (4KB, 2MB, 1GB)
  
Combined TLB: Caches final GVA → HPA
  - Granularity is MIN(guest_page_size, ept_page_size)

Example combinations:

Guest 4KB + EPT 4KB:
  Final TLB entry: 4KB (straightforward)
  
Guest 4KB + EPT 2MB:
  Final TLB entry: 4KB
  - Single EPT TLB entry (2MB) covers 512 guest TLB entries (4KB each)
  - EPT TLB has high hit rate
  
Guest 2MB + EPT 4KB:
  Final TLB entry: 4KB (must track 4KB EPT chunks)
  - Single guest TLB entry (2MB) requires 512 EPT TLB entries!
  - EPT TLB thrashing likely
  - Poor performance!
  
Guest 2MB + EPT 2MB:
  Final TLB entry: 2MB (best case)
  - Both levels aligned
  - Optimal TLB utilization

Intel Ice Lake EPT TLB:

STLB:
  1536 entries for GVA → HPA (combined)
  
Plus separate EPT TLB (undocumented size)
  Estimated ~256-512 entries for GPA → HPA only

4.12.4 AMD NPT with Mixed Page Sizes

AMD approach: Similar to Intel

AMD Zen 3 NPT TLB:

L1 Guest TLB: 96 entries (GVA → GPA)
L1 Nested TLB: 96 entries (GPA → HPA)
L2 Combined TLB: 3072 entries (GVA → HPA final)

Page size handling:

If guest uses 2MB and nested uses 4KB:
  - Guest TLB entry covers 2MB of GVA → GPA
  - But GPA → HPA breaks down into 512 × 4KB nested mappings
  - L2 TLB must cache 512 separate entries!
  - Potential thrashing

4.12.5 ARM64 Stage 2 Translation

ARM terminology:

Stage 1: VA → IPA (Intermediate Physical Address)
Stage 2: IPA → PA (Physical Address)

ARMv8.4-A concatenated tables:

ARM allows different granules for Stage 1 and Stage 2:

Stage 1: 4KB granule (guest page tables)
Stage 2: 64KB granule (hypervisor)

Problem: Mismatched granularity!

ARM TLB handling:

TLB Entry (ARM with Stage 2):
  VA: Guest virtual address
  IPA: Intermediate physical address
  PA: Final physical address
  Stage 1 size: 4KB, 16KB, 64KB, 2MB, 32MB, 1GB
  Stage 2 size: 64KB, 512MB (for 64KB granule)
  Combined size: MIN(S1_size, S2_size)

Optimization: Concatenated page tables

ARMv8.4 allows concatenating up to 16 small page tables:

Instead of:
  Stage 1: 4KB granule → 4KB pages
  
Use:
  Stage 1: 4KB granule × 16 concatenation → 64KB effective
  
Matches 64KB Stage 2 granule → Better TLB efficiency

4.12.6 RISC-V Hypervisor Extension

RISC-V H extension: Defines two-stage translation

VS-stage: VA → GPA (Sv39, Sv48, Sv57)
G-stage: GPA → HPA (Sv39x4, Sv48x4, Sv57x4)

Note: G-stage uses same structure as VS-stage (same page sizes)

Advantage: No mismatch problem! Both stages use 4KB, 2MB, 1GB.

TLB structure:

RISC-V with H-extension TLB:
  - Separate VS-stage TLB (optional)
  - Separate G-stage TLB (optional)
  - Combined TLB (required)
  
Implementation-dependent which TLBs exist

4.12.7 Performance Recommendations

Best practices for optimal TLB performance with virtualization:

Align page sizes: Use same page size for guest and host

Guest: 2MB huge pages
Host/EPT: 2MB huge pages
→ Optimal TLB hit rate

Avoid mismatched large pages:

BAD:
  Guest: 2MB pages
  Host: 4KB pages
  → Each guest page requires 512 host TLB entries!
  
GOOD:
  Guest: 4KB pages
  Host: 2MB pages
  → Each host TLB entry covers 512 guest pages

Use hardware support:

Enable EPT/NPT huge pages in hypervisor
Encourage guest to use huge pages (Transparent Huge Pages)

Measurement: Database performance with EPT:

Configuration               Queries/sec    TLB miss rate
Guest 4KB, EPT 4KB          15,000        18%
Guest 4KB, EPT 2MB          28,000        8%
Guest 2MB, EPT 2MB          42,000        3%
Guest 2MB, EPT 4KB          12,000        35% (thrashing!)

4.13 GPU and Hardware Accelerator MMUs

Graphics Processing Units (GPUs) and other hardware accelerators have their own Memory Management Units with unique requirements and design trade-offs compared to CPU MMUs.

4.13.1 Why GPUs Need MMUs

Traditional approach (pre-2010): GPU access only pinned physical memory - OS pins memory pages (prevents swapping) - GPU uses physical addresses directly - Simple but inflexible

Figure 4.8: GPU MMU architecture versus CPU TLB. CPUs use per-core, fully-coherent hardware-managed TLBs with ASID isolation. GPU MMUs use a shared two-level hierarchy (per-SM L1 + shared L2) with hardware broadcast shootdown via NVSwitch, but a TLB miss incurs a full PCIe page fault at 25× CPU miss latency.

Problems: - Limits memory oversubscription - Complicates CPU-GPU data sharing - Requires complex buffer management - Can't support virtual memory per-process

Modern approach: GPU has full MMU - GPU uses virtual addresses - Supports page faults - Enables unified virtual addressing - Allows memory overcommit

4.13.2 NVIDIA GPU MMU Architecture

NVIDIA Pascal/Volta/Ampere architecture:

GPU MMU Components:
  1. TLBs (multiple levels)
  2. Page tables (hierarchical, similar to CPU)
  3. Page fault handling unit
  4. Unified Virtual Addressing (UVA) support

TLB Hierarchy:

Per-SM (Streaming Multiprocessor):
  L1 TLB: ~64 entries per SM
    - Handles most translations
    - 1-2 cycle hit latency
    
Shared across GPU:
  L2 TLB: ~512-2048 entries (implementation-dependent)
    - Unified for all SMs
    - ~10-20 cycle hit latency

Page Table Structure:

NVIDIA uses multi-level page tables similar to x86:

Virtual Address (48-bit):
  [47:39] → PDE3 (9 bits)
  [38:30] → PDE2 (9 bits)
  [29:21] → PDE1 (9 bits)
  [20:12] → PDE0 (9 bits)
  [11:0]  → Offset (12 bits)

Supports 4KB, 64KB, and 2MB pages

Unified Virtual Addressing (UVA):

CPU and GPU share same virtual address space:

CPU pointer: 0x7fff1234 points to memory
GPU pointer: 0x7fff1234 points to SAME memory

Benefits:
  - No address translation needed for shared data
  - Simplified programming model
  - Zero-copy data sharing

4.13.3 GPU Page Fault Handling

Traditional CPU: Page fault stalls one thread, others continue

GPU: Page fault can stall thousands of threads!

NVIDIA approach: Replay mechanism

1. Warp (32 threads) accesses memory
2. Some threads: TLB miss → Page fault
3. GPU:
   a. Records faulting threads
   b. Notifies CPU driver
   c. Continues executing other warps
4. CPU:
   a. Handles page fault
   b. Allocates/migrates page
   c. Updates page tables
   d. Signals GPU
5. GPU:
   a. Invalidates TLB entry
   b. Replays faulting threads
   c. Memory access succeeds

Challenge: High page fault latency

CPU page fault: ~1,000-10,000 cycles
GPU page fault: ~50,000-500,000 cycles!
  - PCIe communication overhead
  - Driver processing
  - Page migration
  
Impact: Page faults kill GPU performance!

Solution: Demand paging with prefetching

CUDA Unified Memory prefetches pages:

// Hint to prefetch memory to GPU
cudaMemPrefetchAsync(ptr, size, gpu_device, stream);

// Or advise on expected access pattern
cudaMemAdvise(ptr, size, cudaMemAdviseSetPreferredLocation, gpu_device);

4.13.4 AMD GPU MMU (IOMMU Integration)

AMD approach: Integrate GPU MMU with CPU IOMMU

HSA (Heterogeneous System Architecture):

Shared page tables between CPU and GPU:
  - Single set of page tables
  - CPU and GPU see identical virtual address space
  - IOMMU translates GPU accesses

IOMMU TLB (AMD):

Per-GPU:
  IOMMU TLB: 512-1024 entries
  
Handles:
  - GPU memory accesses
  - DMA from GPU
  - Peer-to-peer GPU transfers

Coherency: AMD supports cache-coherent GPU memory access (APU architectures)

GPU load:
  1. Check GPU L1 cache
  2. Check GPU L2 cache
  3. Check CPU cache hierarchy (coherent!)
  4. Load from DRAM

4.13.5 Intel GPUs and Shared Virtual Memory

Intel Xe architecture:

Fully integrated GPU MMU:
  - Shares CPU's page tables
  - Uses CPU's TLBs for some operations
  - Hardware page walker shared

Page fault handling:

Intel GPUs can trigger CPU page faults directly:

GPU thread → Page fault → CPU handles → GPU resumes
Latency: ~10,000-50,000 cycles (better than NVIDIA)

Tile-based architecture:

Intel Xe-HP (data center GPU):
  - Multiple tiles (GPU dies)
  - Each tile has own L1 TLB
  - Shared L2 TLB across tiles
  - Coherent with CPU TLBs

4.13.6 Apple M-series Unified Memory

Apple's approach: True unified memory

M1/M2/M3:
  - CPU and GPU share physical RAM
  - No discrete GPU memory
  - Shared page tables
  - Zero-copy between CPU and GPU

TLB structure:

CPU Cores:
  - Per-core L1 TLBs
  - Shared L2 TLB
  
GPU Cores:
  - Per-cluster TLBs
  - Shared with CPU at some level (undocumented)

Performance benefits:

Traditional discrete GPU:
  CPU→GPU transfer: PCIe bandwidth (~16 GB/s)
  
Apple M1:
  CPU↔GPU shared memory: ~200 GB/s
  No transfer needed!

4.13.7 IOMMU and DMA Remapping

I/O MMU (IOMMU) / System MMU (SMMU):

Provides virtual addressing for DMA devices (GPUs, NICs, storage controllers):

Intel VT-d (IOMMU):

Device Memory Access:
  1. Device issues DMA with virtual address
  2. IOMMU intercepts
  3. IOMMU walks page tables (similar to CPU)
  4. IOMMU TLB caches translation
  5. Physical address sent to memory controller

IOMMU TLB structure:

Intel VT-d:
  IOTLB: ~256-512 entries per device
  Hierarchical structure
  Supports 4KB, 2MB, 1GB pages

ARM SMMU (System MMU):

Stage 1: Device VA → IPA
Stage 2: IPA → PA (for virtualization)

Similar to ARM CPU MMU with Stage 2 translation

IOMMU page fault handling:

Device page fault:
  1. IOMMU detects fault
  2. Generates interrupt to CPU
  3. OS page fault handler runs
  4. Updates page tables
  5. IOMMU TLB invalidated
  6. Device DMA retried
  
Latency: ~10,000-100,000 cycles

4.13.8 Performance Considerations

GPU vs CPU TLB differences:

Aspect	CPU TLB	GPU TLB
Threads	2-8 per core	1000s per GPU
Working set	MB-GB	GB-TB
TLB entries	100s-1000s	1000s-10,000s
Page fault tolerance	Medium	Very low
Coherency	Hardware	Limited/None

Best practices:

Minimize GPU page faults:

// Prefetch before kernel launch
cudaMemPrefetchAsync(data, size, device);
cudaStreamSynchronize(stream);
launch_kernel<<<...>>>();

Use large pages:

// Request 2MB pages for large allocations
cudaMallocManaged(&ptr, 1024 * 1024 * 1024);  // 1 GB
cudaMemAdvise(ptr, size, cudaMemAdviseSetPreferredLocation, device);

Align page sizes between CPU and GPU:

CPU: Use 2MB transparent huge pages
GPU: Configure for 2MB pages
Result: Efficient TLB usage on both

Measurement: Matrix multiplication (4096×4096):

Configuration                  Time      Page Faults
Discrete GPU, no prefetch      120 ms    15,000
Discrete GPU, with prefetch    45 ms     0
Integrated GPU (Apple M1)      28 ms     0 (unified memory)

4.14 Chapter Summary

Chapter 4 explored the Translation Lookaside Buffer as the performance linchpin of every virtual memory system. Unlike caches that store data, the TLB stores virtual-to-physical address translations, allowing the CPU to bypass the multi-level page table walk on every memory reference. A TLB miss forces a hardware page table walk consuming 10–100 or more cycles, and when working sets exceed TLB capacity this penalty can dominate total execution time—making TLB behaviour a first-order performance concern for systems programmers.

The architectural analysis revealed that modern processors deploy multi-level TLB hierarchies. The L1 TLB (typically 4–128 entries, 1-cycle lookup, fully associative) sits closest to the execution pipeline; the L2 unified STLB (512–4096 entries, 6–12 cycle lookup) provides a larger second-chance buffer. ARM's three-level hierarchy—micro-TLB, main TLB, and Walker Cache—illustrates how the same fundamental design challenge plays out differently across implementation families. Entry structure was dissected in detail: the physical frame number, permission bits (R/W/X), privilege level, dirty and accessed bits, and the critical context-tag fields that distinguish entries across address spaces.

Three distinct mechanisms for suppressing TLB flush overhead were examined. Intel's PCID (Process Context Identifier, 12 bits) tags all TLB entries with a process identifier, allowing the processor to retain translations across context switches and invalidate selectively via INVLPG. ARM64's ASID (Address Space Identifier, 8 or 16 bits configurable) provides equivalent functionality with its TLBI instruction family. RISC-V supports ASIDs in a broadly compatible manner via the SATP CSR. Without these tags, every context switch flushes the entire TLB—an overhead measured in hundreds of microseconds on busy servers. In virtualised environments, ARM64's VMID extends the tagging scheme to two dimensions, permitting simultaneous retention of guest and hypervisor translations without mutual interference.

Large page support emerged as the highest-leverage TLB optimisation available to software engineers without hardware changes. A single 2 MB huge page on x86-64 covers 512 standard 4 KB pages, consuming one TLB entry that would otherwise require 512 entries for the same working set. The reach advantage translates directly to runtime: database and scientific computing workloads with tens of gigabytes of working data commonly achieve 15–30 percent throughput improvement from huge page adoption. ARM64's contiguous PTE hint and RISC-V's Svnapot extension offer equivalent mechanisms. The challenges—THP fragmentation, defragmentation latency, hugepage availability under memory pressure—require profiling to confirm net benefit in production.

Multicore TLB coherency introduced the TLB shootdown protocol. When a process unmaps a virtual range, all cores that may hold a cached translation for that range must invalidate it before the physical page can be reused; stale translations that permit access to recycled pages represent a security and correctness hazard. On x86-64 this is accomplished via inter-processor interrupts that execute INVLPG on every affected core; on ARM64 the TLBI VALE1IS/ASIDE1IS broadcast instructions offer more surgical options. At 128-core scale, a single munmap() can generate 127 IPIs consuming microseconds of cross-core synchronisation—a rarely-measured but real source of latency in high-throughput servers.

GPU and accelerator MMUs were examined as the chapter's final frontier. GPU L1 TLBs are small (64–128 entries per shader multiprocessor) but hundreds of them operate in parallel; a single miss from one warp can serialise the entire multiprocessor if it requires a page table walk. NVIDIA's UVM architecture and the Pichai coalescing mechanism represent the state of the art for GPU address translation at scale.

Chapter 5 extends this analysis to I/O devices, examining how the IOMMU performs the same address translation function for DMA operations—and why the costs and design choices differ fundamentally when the entity requesting translation cannot tolerate page faults or handle TLB misses in software.

References

General TLB Architecture and Design

Intel Corporation. Intel® 64 and IA-32 Architectures Optimization Reference Manual. Order Number: 248966-046. June 2023. Section 2.5: "Translation Lookaside Buffers."
Intel Corporation. Intel® 64 and IA-32 Architectures Software Developer's Manual, Volume 3A: System Programming Guide, Part 1. Chapter 4: "Paging" and Section 8.1.3: "Handling Self-Modifying Code." 2024.
ARM Limited. ARM Architecture Reference Manual ARMv8, for ARMv8-A architecture profile. ARM DDI 0487J.a. March 2023. Chapter D5: "The AArch64 Virtual Memory System Architecture" and Section D5.10: "TLB maintenance instructions" and Section B2.2.5: "Maintenance of caches and memory hierarchy."
ARM Limited. ARM Cortex-A78 Core Technical Reference Manual. ARM 100230_0002_00_en. 2020.
RISC-V International. The RISC-V Instruction Set Manual, Volume I: Unprivileged ISA. Version 20191213. Section 2.7: "FENCE.I Instruction." December 2019.
RISC-V International. The RISC-V Instruction Set Manual, Volume II: Privileged Architecture. Version 20211203. December 2021. Section 4.2.1: "Supervisor Memory-Management Fence Instruction."
AMD. Software Optimization Guide for AMD Family 19h Processors. Publication # 56665. June 2023.
Jacob, Bruce, and Trevor Mudge. "Virtual Memory: Issues of Implementation." IEEE Computer 31.6 (1998): 33-43. DOI: 10.1109/2.683005
Bhattacharjee, Abhishek, and Daniel Lustig. Architectural Support for Address Translation. Synthesis Lectures on Computer Architecture. Morgan & Claypool, 2017. DOI: 10.2200/S00752ED1V01Y201610CAC037

TLB Management and Multicore Systems

Baumann, Andrew, et al. "The Multikernel: A new OS architecture for scalable multicore systems." Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles (SOSP 2009). ACM, 2009. DOI: 10.1145/1629575.1629579
Amit, Nadav, et al. "Optimizing the TLB shootdown algorithm with page access tracking." 2017 USENIX Annual Technical Conference (ATC 17). 2017. Pages 27-39.
Vilanova, Lluís, et al. "CODOMs: Protecting software with code-centric memory domains." Proceedings of the 41st ACM SIGARCH International Symposium on Computer Architecture (ISCA 2014). 2014.

Large Pages and TLB Coverage

Bienia, Christian, et al. "The PARSEC benchmark suite: Characterization and architectural implications." Proceedings of the 17th international conference on Parallel architectures and compilation techniques (PACT 2008). ACM, 2008. DOI: 10.1145/1454115.1454128
Kwon, Youngjin, et al. "Coordinated and Efficient Huge Page Management with Ingens." 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2016). 2016. Pages 705-721.
Navarro, Juan, et al. "Practical, transparent operating system support for superpages." 5th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2002). 2002. Pages 89-104.

Page Walk Caches

Barr, T. W., Cox, A. L., & Rixner, S. "Translation caching: skip, don't walk (the page table)." ACM SIGARCH Computer Architecture News, 39(3) (2011): 48-59. DOI: 10.1145/2024723.2000118

Alternative Page Table Structures

Talluri, M., & Hill, M. D. "Surpassing the TLB performance of superpages with less operating system support." ACM SIGPLAN Notices, 29(11) (1994): 171-182. DOI: 10.1145/191126.191235 [Inverted page tables analysis]
Liedtke, Jochen. "Address space sparsity and fine granularity." ACM SIGARCH Computer Architecture News 23.5 (1995): 78-86. DOI: 10.1145/224021.224028 Clustered page tables
Chang, Albert, and Mark F. Mergen. "801 storage: Architecture and programming." ACM Transactions on Computer Systems (TOCS) 6.1 (1988): 28-50. DOI: 10.1145/35037.35039 [Early inverted page table implementation]

Virtualization and Shadow Page Tables

Waldspurger, C. A. "Memory resource management in VMware ESX server." ACM SIGOPS Operating Systems Review, 36(SI) (2002): 181-194. DOI: 10.1145/844128.844146 [Classic shadow page table paper]
Adams, Keith, and Ole Agesen. "A comparison of software and hardware techniques for x86 virtualization." Proceedings of the 12th international conference on Architectural support for programming languages and operating systems (ASPLOS 2006). ACM, 2006. DOI: 10.1145/1168857.1168860
Bhargava, Ravi, et al. "Accelerating two-dimensional page walks for virtualized systems." ACM SIGARCH Computer Architecture News 36.1 (2008): 26-35. DOI: 10.1145/1353534.1346286
Ahn, J., Jin, S., & Huh, J. "Revisiting hardware-assisted page walks for virtualized systems." Proceedings of the 39th Annual International Symposium on Computer Architecture (ISCA 2012). IEEE, 2012. DOI: 10.1109/ISCA.2012.6237040 [EPT/NPT TLB performance analysis with mismatched page sizes]

Self-Modifying Code and Cache Coherency

Intel Corporation. Intel® 64 and IA-32 Architectures Software Developer's Manual, Volume 3A. Section 11.6: "Self-Modifying Code." 2024.
ARM Limited. ARM Architecture Reference Manual ARMv8. Section D4.4.4: "Instruction caches and memory coherency." 2023.
Boehm, Hans-J., and Sarita V. Adve. "Foundations of the C++ concurrency memory model." Proceedings of the 29th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2008). 2008. DOI: 10.1145/1375581.1375591

GPU and Accelerator MMUs

NVIDIA Corporation. NVIDIA CUDA C Programming Guide. Version 12.0. 2023. Chapter 3.2: "Unified Memory Programming."
AMD. Graphics Core Next Architecture, Generation 3. White Paper. 2016.
Intel Corporation. Intel® Data Center GPU Max Series Software Developer's Guide. Document Number: 767252-001. 2023.
ARM Limited. ARM CoreLink MMU-600 System Memory Management Unit Technical Reference Manual. ARM 100310_0200_00_en. 2020. [IOMMU/SMMU for accelerators]
Silicon Graphics, Inc. (SGI). OpenGL on SGI Reality Engine Graphics. Technical Report. 1994. [Early GPU virtual memory concepts]
Pichai, Bharath, et al. "Architectural support for address translation on GPUs: Designing memory management units for CPU/GPUs with unified address spaces." ACM SIGPLAN Notices 49.4 (2014): 743-758. DOI: 10.1145/2644865.2541940
Power, Jason, et al. "Supporting x86-64 address translation for 100s of GPU lanes." 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2014. DOI: 10.1109/HPCA.2014.6835960
Ausavarungnirun, Rachata, et al. "Mosaic: A GPU memory manager with application-transparent support for multiple page sizes." Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2017). ACM, 2017. DOI: 10.1145/3123939.3123975 [GPU TLB with mixed page sizes]

IOMMU and DMA Remapping

Intel Corporation. Intel® Virtualization Technology for Directed I/O (VT-d): Architecture Specification. Revision 3.3. September 2020.
Ben-Yehuda, Muli, et al. "The price of safety: Evaluating IOMMU performance." Linux Symposium. Vol. 9. 2007. Pages 9-20.
ARM Limited. ARM System Memory Management Unit Architecture Specification, SMMU architecture version 3. ARM IHI 0070D.b. 2019.

Performance Analysis and Benchmarking

Basu, Arkaprava, et al. "Efficient virtual memory for big memory servers." Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA 2013). ACM, 2013. DOI: 10.1145/2485922.2485943
Karakostas, Vasileios, et al. "Redundant memory mappings for fast access to large memories." Proceedings of the 42nd Annual International Symposium on Computer Architecture (ISCA 2015). IEEE, 2015. DOI: 10.1145/2749469.2750383
Cox, Russ, and William Josephson. "Optimizing network virtualization in Xen." Proceedings of the USENIX Annual Technical Conference. 2006.
Gandhi, Jayneel, et al. "Badger TLB: A mechanism for optimizing TLB miss handling." 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, 2014. DOI: 10.1109/MICRO.2014.45

Historical and Survey Papers

Smith, Alan Jay, and James R. Goodman. "The impact of memory management on commercial computer systems." Communications of the ACM 26.12 (1983): 1002-1011. DOI: 10.1145/358476.358484
Denning, Peter J. "Virtual memory." ACM Computing Surveys (CSUR) 2.3 (1970): 153-189. DOI: 10.1145/356571.356573 [Seminal paper on virtual memory]
Kilburn, Tom, et al. "One-level storage system." IRE Transactions on Electronic Computers 2 (1962): 223-235. DOI: 10.1109/IREELC.1962.5408922 [Atlas computer - first virtual memory system]

Additional Resources

Rosenblum, Mendel. "The design and implementation of a log-structured file system." ACM Transactions on Computer Systems (TOCS) 10.1 (1992): 26-52. [Discusses TLB implications of different page table designs]
Saulsbury, Ashley, Fredrik Dahlgren, and Per Stenström. "Recency-based TLB preloading." ACM SIGARCH Computer Architecture News 28.2 (2000): 117-127. DOI: 10.1145/342001.339668
Marwaha, Manish, and Abhishek Bhattacharjee. "Spatiotemporal optimizations for multi-GPU systems via decoupled modules." IEEE Computer Architecture Letters 21.2 (2022): 85-88. [Modern GPU MMU optimizations]

§4.12 — Virtualization TLB Performance

Ahn, J., Jin, S., and Huh, J. "Revisiting hardware-assisted page walks for virtualized systems." Proceedings of the 39th Annual International Symposium on Computer Architecture (ISCA), pp. 476–487, 2012. DOI: 10.1145/2366231.2337200

§4.13 — GPU and Accelerator TLBs

NVIDIA Corporation. CUDA C++ Programming Guide. Chapter 3.2: "Unified Memory Programming." 2024. docs.nvidia.com/cuda
AMD. AMD Graphics Core Next Architecture, Generation 3. Section 5: "Memory Hierarchy." 2016. gpuopen.com/GCN3
Intel Corporation. Intel Data Center GPU Flex Series Product Brief. 2023. intel.com/data-center-gpu-flex