In Chapter 1, we introduced the Translation Lookaside Buffer (TLB) as a specialized cache that stores recent virtual-to-physical address translations. We saw how TLB hits provide translations in 1-2 CPU cycles (~1 nanosecond), while TLB misses require a full page table walk taking ~100 cycles (~100 nanoseconds)—a 50-100× performance difference.
This chapter explores the TLB in comprehensive detail. While most programmers are familiar with CPU caches (L1, L2, L3), the TLB remains relatively obscure despite being equally critical to system performance. A well-designed application with excellent L1 cache behavior can still suffer catastrophic performance degradation from TLB misses. Conversely, TLB-aware optimization can yield speedups of 2-5× or more on memory-intensive workloads.
Performance Impact: The TLB is on the critical path of every single memory access. Modern processors execute billions of instructions per second, many accessing memory multiple times per instruction. Even a small TLB miss rate can dominate execution time. Consider:
A 5% TLB miss rate with 100-cycle penalty costs 5 cycles per access on average—more expensive than an L1 cache miss! For applications with poor TLB behavior, the miss rate can exceed 20-30%, causing 20-30 cycle average penalty and making memory accesses 10× slower than they should be.
Architectural Diversity: Unlike data caches, which are relatively standardized, TLB designs vary dramatically across architectures:
Understanding these differences is crucial for writing portable, high-performance systems software.
Virtualization Complexity: Cloud computing has made virtualization ubiquitous. Two-stage address translation (guest virtual → guest physical → host physical) doubles TLB pressure and introduces new management challenges. Modern TLBs must cache two-dimensional translations while maintaining coherency across VM switches.
Security Implications: The TLB interacts intimately with memory protection, making it security-critical: - TLB poisoning can bypass access control - TLB timing attacks leak information across security boundaries - Spectre and Meltdown exploit speculative TLB behavior - TLB shootdown creates denial-of-service vectors
This chapter provides comprehensive coverage of TLB architecture, management, and optimization:
Sections 4.2-4.3: TLB hardware architecture, entry structure, and management strategies (hardware vs. software)
Sections 4.4-4.5: Address space identifiers (ASID/PCID/VMID) and TLB operations (invalidation, shootdown)
Sections 4.6-4.7: Large pages, TLB coverage, and multicore coherency
Section 4.8: Platform-specific implementations (Intel, AMD, ARM, RISC-V) with real hardware specifications
Sections 4.9-4.10: Performance optimization and virtualization
Sections 4.11-4.12: Security considerations and advanced topics
By the end of this chapter, you'll understand not just what the TLB does, but how it works, why it's designed as it is, and how to write TLB-efficient software.
Modern processors don't have a single TLB—they have a multi-level TLB hierarchy analogous to the data cache hierarchy (L1/L2/L3). This design emerges from fundamental tradeoffs:
Speed vs. Capacity Tradeoff: - Small, fully-associative TLBs (32-128 entries) can provide 1-cycle lookup - Large TLBs (1024-2048 entries) require multiple cycles or set-associativity - Solution: Multi-level hierarchy with small, fast L1 TLBs backed by larger L2 TLBs
Instruction vs. Data Separation: - Instruction fetches and data accesses have different locality patterns - Code typically has high spatial locality (sequential execution) - Data access patterns vary widely by application - Solution: Separate L1 ITLB and L1 DTLB, unified L2 TLB
Page Size Heterogeneity: - Modern systems use mixed page sizes (4KB, 2MB, 1GB) - Different page sizes have different working set requirements - Optimal TLB organization differs by page size - Solution: Multiple TLBs or flexible entries supporting multiple sizes
L1 TLBs (Instruction and Data): - Size: 32-128 entries (typically 64 for data, 128 for instructions) - Associativity: Fully associative (all entries checked in parallel) - Latency: 1 cycle (integrated into pipeline) - Hit Rate: 95-99% for well-behaved applications - Organization: Separate ITLB (instruction) and DTLB (data)
The L1 TLBs are tiny by design—they must provide translation on the same cycle as L1 cache access, requiring fully parallel hardware that becomes impractical beyond ~128 entries. The split between instructions and data allows the CPU to fetch instructions while simultaneously accessing data, maintaining the illusion of Harvard architecture at the TLB level.
L2 TLB (Shared/Unified): - Size: 512-2048 entries (Intel: 1536, AMD Zen 3: 2048) - Associativity: 4-way to 12-way set-associative - Latency: 5-10 cycles - Hit Rate: 98-99.9% when L1 misses - Organization: Unified for both instructions and data
The L2 TLB acts as a victim cache for the L1 TLBs, storing translations evicted from L1. Its larger capacity provides better coverage of the application's working set, while set-associativity keeps lookup time reasonable. Modern L2 TLBs often support multiple page sizes in a single structure.
Real-World Example: Intel Ice Lake (2019)
Intel's Ice Lake microarchitecture provides a concrete example:
L1 DTLB:
- 64 entries for 4KB/2MB pages (flexible)
- 32 entries for 4MB pages
- 4 entries for 1GB pages
- Fully associative
- 1-cycle latency
L1 ITLB:
- 128 entries for 4KB pages
- 8 entries for 2MB/4MB pages
- Fully associative
- 1-cycle latency
L2 STLB (Shared):
- 1536 entries for 4KB or 2MB pages
- 16 entries for 1GB pages
- 12-way set-associative
- 5-cycle latency
Notice the allocation strategy: Many small-page entries, fewer large-page entries. This reflects typical usage where most translations are 4KB, but a few critical large mappings (e.g., shared libraries, huge pages) use 2MB/1GB pages.
A TLB entry must contain all information needed to perform address translation and permission checks. Let's examine the detailed structure:
1. Virtual Page Number (VPN) - The Tag
The VPN identifies which virtual page this entry translates. For a 4KB page in a 48-bit virtual address space:
Virtual Address (48 bits): [47:12] = VPN (36 bits), [11:0] = Offset (12 bits)
The VPN serves as the TLB's tag—the lookup key for associative matching. During translation: 1. CPU extracts VPN from virtual address 2. TLB compares VPN against all tags in parallel (fully-associative) or within a set (set-associative) 3. Match → TLB hit; No match → TLB miss
2. Physical Frame Number (PFN) - The Data
The PFN specifies where this virtual page maps in physical memory. For 4KB pages:
Physical Address (52 bits): [51:12] = PFN (40 bits), [11:0] = Offset (12 bits)
On a TLB hit, the CPU: 1. Takes the PFN from the matched TLB entry 2. Concatenates it with the page offset from the virtual address 3. Issues the resulting physical address to the cache/memory system
3. Permission Bits
Permission bits enforce memory protection at hardware speed:
Checking these on every access in the page table would be too slow. Caching them in the TLB allows permission checks in the same cycle as translation.
4. Status and Control Flags
Modern TLBs cache additional page table entry flags:
These flags serve multiple purposes: - G-bit
optimization: Kernel mappings marked global survive context
switches, avoiding TLB refills - D-bit dirty tracking:
OS uses this for copy-on-write and page-out decisions
- A-bit access tracking: Guides page replacement
algorithms - Memory type control: Critical for device
I/O and memory-mapped registers
5. Address Space Identifier (ASID/PCID/VMID)
To avoid flushing the entire TLB on context switches, modern processors tag entries with an identifier:
This allows the TLB to hold translations from multiple processes simultaneously:
TLB Lookup: Match (VPN, ASID) → Return PFN
Without ASID/PCID, every context switch requires:
1. Flush entire TLB (except global entries)
2. All subsequent accesses miss TLB
3. Expensive page table walks refill TLB
4. Takes ~1000 cycles to recover performance
With ASID/PCID:
1. Change ASID/PCID register
2. TLB entries from old process remain cached
3. TLB entries from new process (if still present) hit immediately
4. Significant performance improvement on context-heavy workloads
We'll explore ASID/PCID/VMID mechanisms in detail in Section 4.4.
6. Page Size Indicator
Since modern TLBs support multiple page sizes, each entry must indicate its size:
Page Size (PS) bits:
00 = 4 KB
01 = 2 MB
10 = 1 GB
11 = 512 GB (future/experimental)
This is crucial because: - Virtual address → VPN extraction differs by page size - Large pages have fewer VPN bits, more offset bits - TLB must extract the correct PFN portion
For example, with a 2MB page:
Virtual Address: [47:21] = VPN (27 bits), [20:0] = Offset (21 bits)
Physical Address: [51:21] = PFN (31 bits), [20:0] = Offset (21 bits)
During a hardware page table walk, the PS (Page Size) bit in intermediate page table entries determines whether the walk continues or terminates at that level:
x86-64 Page Walk with PS Bit Checking:
1. Read CR3 → Get PML4 base address
2. Index into PML4 using VA[47:39]
3. Read PML4 Entry (PML4E)
4. Check P bit; if 0 → Page Fault
5. Index into PDPT using VA[38:30]
6. Read PDPT Entry (PDPTE)
7. Check P bit; if 0 → Page Fault
8. Check PDPTE.PS (bit 7):
- If PS=1: This is a 1 GB page
→ Extract PFN from PDPTE[51:30] (1 GB aligned)
→ Page offset = VA[29:0] (30 bits for 1 GB = 2^30)
→ Physical Address = (PFN << 30) | VA[29:0]
→ STOP: Insert 1 GB TLB entry
- If PS=0: Continue to next level
9. Index into PD using VA[29:21]
10. Read PD Entry (PDE)
11. Check P bit; if 0 → Page Fault
12. Check PDE.PS (bit 7):
- If PS=1: This is a 2 MB page
→ Extract PFN from PDE[51:21] (2 MB aligned)
→ Page offset = VA[20:0] (21 bits for 2 MB = 2^21)
→ Physical Address = (PFN << 21) | VA[20:0]
→ STOP: Insert 2 MB TLB entry
- If PS=0: Continue to next level
13. Index into PT using VA[20:12]
14. Read PT Entry (PTE)
15. Check P bit; if 0 → Page Fault
16. Extract PFN from PTE[51:12] (4 KB aligned)
17. Page offset = VA[11:0] (12 bits for 4 KB = 2^12)
18. Physical Address = (PFN << 12) | VA[11:0]
19. Insert 4 KB TLB entry
Key Points:
PS bit only exists in upper-level entries: PML4E doesn't have PS bit (always points to PDPT). PDPTE and PDE have PS bits to enable 1GB and 2MB pages respectively.
PS=1 terminates the walk early: When PS=1, that entry contains the final physical address (appropriately aligned), not a pointer to the next level.
TLB entry size matches page size: The TLB entry's PS field is set based on which level terminated the walk:
Alignment requirements: Large page physical addresses must be aligned:
Architecture-Specific Variations:
ARM64: Uses descriptor type field instead of PS bit:
Bits [1:0] in descriptor:
00 = Invalid
01 = Block entry (large page at this level)
10 = Reserved
11 = Table entry (points to next level) or Page entry (at level 3)
RISC-V: Uses elegant R/W/X encoding:
R W X = 0 0 0 → Pointer to next level (continue walk)
R W X ≠ 0 0 0 → Leaf entry (page at this level)
Page size determined by level:
Level 2 leaf → 1 GB page
Level 1 leaf → 2 MB page
Level 0 leaf → 4 KB page
For complete details on page table walk algorithms and PS bit encoding across architectures, see Chapter 3, Section 3.4 "Multi-Level Page Tables" and Section 3.5.2 "x86-64 Page Table Entry Structure" where large page formats are covered in depth.
Here's a representative TLB entry structure (x86-64 style):
| VPN 36 bits |
PS 2 bits |
PFN 40 bits |
PCID 12 bits |
Flags 12 bits |
Valid 1 bit |
|---|---|---|---|---|---|
| Tag | 4K / 2M / 1G | Physical Frame | Process ID | R | W | X | U | G | D | A | V |
PS = page size selector; PCID = Process Context ID (x86-64 ASID equivalent); Flags: R=Read, W=Write, X=Execute, U=User, G=Global, D=Dirty, A=Accessed
ARM64 would have ASID instead of PCID, potentially VMID for virtualization, and slightly different flag bits, but the structure is similar.
Note: Actual hardware implementations may optimize encoding, use additional bits for internal bookkeeping, or employ specialized representations for different page sizes.
TLB lookup must be fast—ideally single-cycle for L1 TLBs. This requirement drives associativity choices.
Design: Any translation can be stored in any TLB entry. Lookup checks all entries in parallel.
Advantages: - Maximum flexibility—no capacity misses due to set conflicts - Best hit rate for given size - Simple to implement at small sizes (32-128 entries)
Disadvantages: - Requires parallel comparators for every entry (expensive in silicon area and power) - Doesn't scale beyond ~128 entries - Replacement policy must consider entire TLB
Typical Use: L1 TLBs (both ITLB and DTLB)
Design: TLB is divided into sets; each translation hashes to a specific set and can only reside in one of N entries within that set (N-way set-associative).
Address Mapping:
VPN → Set Index = VPN[m:0] where m = log₂(num_sets)
For a 1024-entry, 8-way set-associative TLB: - 128 sets (1024 / 8) - VPN[6:0] determines the set (2^7 = 128) - Check 8 entries in parallel within that set
Advantages: - Scales to larger sizes (1024-4096 entries feasible) - Reduces comparator count (only N comparators vs. total entries) - Power efficient—only activate one set
Disadvantages: - Conflict misses possible (two VPNs hash to same set, exceed associativity) - Lower hit rate than fully-associative at same total size - Set index bits "wasted" for tagging purposes
Typical Use: L2/L3 TLBs
When a TLB is full, inserting a new entry requires evicting an existing one. The replacement policy determines which entry to evict.
1. Least Recently Used (LRU)
Strategy: Evict the entry that hasn't been used for the longest time.
Implementation: Maintain timestamp or age counter for each entry, evict oldest.
Advantages: - Theoretically optimal for temporal locality - Predictable behavior
Disadvantages: - Requires per-entry metadata (logâ''N bits for N-way associative) - Update cost on every TLB access (write timestamp) - Complex for large associativity
Use: Common in smaller TLBs (L1) where tracking is feasible
2. Pseudo-LRU (PLRU)
Strategy: Approximate LRU using a tree of "recently used" bits.
Implementation: For 8-way associative, use 7 bits to encode a binary tree:
[0]
/ \
[1] [2]
/ \ / \
[3] [4][5] [6]
/ \ / \ / \ / \
E0 E1E2E3E4E5E6 E7
Each internal node bit points toward "recently used" subtree.
Advantages: - Much less metadata (N-1 bits for N entries vs. N*log₂N) - Cheaper to update (log₂N bits) - Good approximation of LRU
Disadvantages: - Not true LRU (pathological cases exist) - More complex than random
Use: Common in L2 TLBs
3. Not Recently Used (NRU)
Strategy: Mark entries as "used" on access, prefer evicting "not used" entries.
Implementation: Single "used" bit per entry, periodically clear all bits.
Advantages: - Very simple (1 bit per entry) - Low overhead - Reasonably effective
Disadvantages: - Coarse granularity - Periodic clearing introduces complexity
Use: Some simpler TLB designs
4. Random Replacement
Strategy: Evict a random entry from the set.
Implementation: Use a linear feedback shift register (LFSR) to generate pseudo-random index.
Advantages: - Simplest possible (no per-entry metadata) - No pathological cases - Surprisingly effective in practice
Disadvantages: - No exploitation of temporal locality - Variable performance
Use: Simple embedded processors, some RISC designs
Intel's Skylake uses a 12-way set-associative L2 STLB with 1536 entries for 4KB/2MB pages:
Organization:
- 128 sets (1536 entries / 12 ways)
- VPN[6:0] selects set
- 12-way associative within each set
- Pseudo-LRU replacement policy
When a new translation is loaded: 1. Hash VPN to determine set (VPN[6:0]) 2. If set is full, select victim using PLRU tree 3. Evict victim entry 4. Insert new translation
The 12-way associativity provides good hit rates while keeping the comparator count manageable (only 12 parallel comparisons per lookup instead of 1536).
One of the most fundamental design decisions in TLB architecture is who manages TLB misses: hardware or software? This choice has profound implications for performance, complexity, flexibility, and OS design. Modern architectures take different approaches, each with compelling rationales.
In hardware-managed TLB systems, the processor automatically walks the page tables on a TLB miss and fills the TLB entry without software intervention. This is the approach taken by x86-64 (Intel and AMD) and ARM64 architectures.
When a TLB miss occurs:
1. CPU detects TLB miss (VPN not found in TLB)
2. Hardware page table walker activates: - Reads page table base register (CR3 on x86, TTBR0_EL1 on ARM) - Follows page table hierarchy level by level - Reads each page table entry from memory - Validates permissions and present bits at each level
3. On success: - Hardware constructs TLB entry from leaf PTE - Inserts entry into TLB (following replacement policy) - Retries the memory access (now hits in TLB) - No software intervention required
4. On failure (page not present, permission violation): - Generate page fault exception - Transfer control to OS page fault handler - OS resolves the fault and returns - Hardware retries, now succeeds
Consider accessing virtual address 0x00401234 with a TLB
miss:
1. TLB lookup for VPN 0x401 → MISS
2. Hardware page walker activates:
- Read CR3 → PML4 base address = 0x1000000
- Extract PML4 index from VA: bits [47:39] = 0
- Read PML4E at 0x1000000 + (0 × 8) = 0x1000000
- PML4E contains PDPT base = 0x2000000, P=1 (present)
- Extract PDPT index from VA: bits [38:30] = 0
- Read PDPTE at 0x2000000 + (0 × 8) = 0x2000000
- PDPTE contains PD base = 0x3000000, P=1
- Extract PD index from VA: bits [29:21] = 2
- Read PDE at 0x3000000 + (2 × 8) = 0x3000010
- PDE contains PT base = 0x4000000, P=1
- Extract PT index from VA: bits [20:12] = 1
- Read PTE at 0x4000000 + (1 × 8) = 0x4000008
- PTE contains PFN = 0xA2B4C, P=1, R/W=1, U/S=0, NX=0
3. Hardware constructs TLB entry:
- VPN = 0x401
- PFN = 0xA2B4C
- Permissions = R/W/X, Supervisor only
- Insert into TLB
4. Retry access → TLB HIT → Success
Total cost: 4 memory accesses (one per level) + TLB insertion = ~40-200 cycles depending on cache hits
1. Software Simplicity
The OS doesn't need TLB miss handlers. Page fault handling code is simpler—it only deals with genuinely invalid accesses, not TLB capacity misses.
2. Performance Predictability
Hardware page walks have deterministic, well-optimized performance. The same code path is used billions of times per second, so it's worth extensive microarchitectural optimization.
3. Page Walk Caches
Modern processors add Page Walk Caches (PWC) that
cache intermediate page table entries: - PML4 cache:
Caches PML4 entries - PDPT cache: Caches PDPT
entries
- PD cache: Caches PD entries
These dramatically reduce the cost of TLB misses. Example with PWC:
TLB miss, but all intermediate levels cached:
- Read PML4E from PWC (hit) - 0 cycles
- Read PDPTE from PWC (hit) - 0 cycles
- Read PDE from PWC (hit) - 0 cycles
- Read PTE from L3 cache - 40 cycles
Total: ~40 cycles instead of 200
Intel processors report PWC hit rates of 70-90% for typical workloads, significantly mitigating TLB miss penalties.
4. Concurrent Page Walks
Hardware can support multiple simultaneous page walks for different TLB misses, improving throughput under high miss rates.
1. Inflexibility
The page table format is fixed by hardware. OS cannot use custom page table structures or compression schemes.
2. Hardware Complexity
Implementing a page table walker requires substantial silicon area and power. The walker must handle multiple page sizes, two-stage translation (virtualization), and various corner cases.
3. Attack Surface
Hardware page walkers have been targets of attacks: - Speculative execution: Spectre variant 1.1 exploits speculative page walks - Timing side channels: Page table structure can leak information through timing
4. Wasted Work
If a page is genuinely not present (swapped out), the hardware walker traverses multiple levels unnecessarily before generating a page fault.
Software-managed TLB systems take the opposite approach: on a TLB miss, the processor generates an exception and software fills the TLB entry. This is the approach used by RISC-V and historically by MIPS.
1. CPU detects TLB miss
2. CPU generates TLB miss exception: - Instruction TLB miss: Jump to ITLB miss handler vector - Data TLB miss: Jump to DTLB miss handler vector - Or unified TLB miss vector (implementation-dependent)
3. Software TLB refill handler executes: - Reads page table base from CSR (Control and Status Register) - Walks page tables in software - Constructs TLB entry - Uses special instruction to write TLB entry - Returns from exception
4. CPU retries memory access → TLB HIT
RISC-V provides explicit instructions for TLB management:
TLB Write Instruction:
# Insert TLB entry
sfence.vma x0, x0 # Fence to synchronize
# (TLB insertion happens implicitly via CSR writes or other mechanism)
Typical RISC-V TLB Miss Handler:
void tlb_miss_handler(vaddr_t fault_addr) {
// Extract VPN from faulting address
vpn_t vpn = fault_addr >> 12;
// Read page table base
paddr_t pt_base = read_csr(satp) & SATP_PPN_MASK;
// Sv39: 3-level page table walk
// Level 2 (VPN[2])
uint64_t *l2_pte = (uint64_t*)(pt_base + ((vpn >> 18) & 0x1FF) * 8);
if (!(*l2_pte & PTE_V)) goto page_fault;
if ((*l2_pte & (PTE_R | PTE_W | PTE_X))) goto found_leaf; // Huge page
// Level 1 (VPN[1])
paddr_t l1_base = (*l2_pte >> 10) << 12;
uint64_t *l1_pte = (uint64_t*)(l1_base + ((vpn >> 9) & 0x1FF) * 8);
if (!(*l1_pte & PTE_V)) goto page_fault;
if ((*l1_pte & (PTE_R | PTE_W | PTE_X))) goto found_leaf; // Large page
// Level 0 (VPN[0])
paddr_t l0_base = (*l1_pte >> 10) << 12;
uint64_t *l0_pte = (uint64_t*)(l0_base + (vpn & 0x1FF) * 8);
if (!(*l0_pte & PTE_V)) goto page_fault;
found_leaf:
// Check permissions (R/W/X, U/S)
if (!check_permissions(pte, access_type)) goto page_fault;
// Insert into TLB (implementation-specific)
tlb_insert(vpn, *l0_pte);
return;
page_fault:
// Real page fault - call OS handler
handle_page_fault(fault_addr);
}1. Flexibility
OS has complete control over page table format. Possibilities include: - Compressed page tables: Use custom compression schemes - Inverted page tables: Hash-based lookup structures - Radix trees: Alternative hierarchical structures - Per-process optimizations: Different processes can use different schemes
2. Simplified Hardware
No hardware page walker required, saving silicon area and power. This aligns with RISC-V's philosophy of minimal hardware complexity.
3. Transparency
Software can instrument TLB misses for profiling, debugging, or security monitoring without hardware support.
4. Rapid Evolution
New page table designs can be deployed via software updates without hardware changes.
1. Performance Cost
Every TLB miss requires exception entry/exit overhead: - Save/restore CPU state - Jump to handler (potentially cold in instruction cache) - Execute handler code - Return from exception
Typical overhead: 200-1000 cycles for a TLB miss, even with fast handler.
2. Software Complexity
OS must implement TLB refill handlers correctly for every page size, permission combination, and edge case. Bugs in handlers can cause crashes or security vulnerabilities.
3. No Page Walk Caches
Without hardware page walks, there's no obvious place to add page walk caches. Software can cache intermediate results, but this adds more complexity.
4. Interrupt Latency
TLB miss handlers execute with interrupts often disabled, increasing worst-case interrupt latency.
Modern architectures increasingly blur the hardware/software boundary:
ARM64 uses hardware page walks but allows software intervention:
Access Flag Faults: If a PTE's Access Flag (AF) is not set, ARM can either: - Set it automatically (pure hardware), or - Generate an Access Flag fault for software to handle
This lets OS implement custom page tracking without full software TLB management.
Hardware Table Walk Disable: ARM allows disabling hardware page walks entirely via TCR_EL1.HPD, falling back to software management if desired.
While RISC-V's base specification requires software TLB management, extensions can add hardware assistance:
Svpbmt Extension: Adds page-based memory types (analogous to x86 PAT) without requiring full hardware page walks.
Svnapot Extension:
Naturally-aligned power-of-two (NAPOT) regions, allowing more flexible
large page support.
x86 allows software to traverse page tables using regular memory accesses, useful for: - Debugging and introspection - Custom translation mechanisms - Security monitoring
The CR3 register can be read by software, and page
tables are in normal memory (though often write-protected by OS).
Context switches—changing which process is running on a CPU—are extremely common in modern systems. A typical desktop or server experiences hundreds to thousands of context switches per second. Without optimization, each context switch would require flushing the entire TLB, destroying all cached translations and causing expensive TLB misses until the TLB refills with the new process's translations.
Address Space Identifiers solve this problem by allowing the TLB to cache translations from multiple processes simultaneously, tagging each entry with a process ID. This section explores three mechanisms: ASID (ARM), PCID (x86), and VMID (virtualization).
Without address space identifiers, context switching requires:
Old Behavior:
1. Process A is running
- TLB contains A's translations
- TLB hits provide fast translation
2. Context switch to Process B
- Must flush TLB (except global entries)
- Write CR3/TTBR0 with B's page table base
3. Process B starts running
- Every memory access misses TLB initially
- Expensive page table walks refill TLB
- Takes 500-2000 cycles to warm up TLB
4. Context switch back to Process A
- Flush TLB again
- A's translations gone despite recent use
- Another 500-2000 cycle warmup
Performance impact measurement:
Assuming: - Context switches every 10ms (quantum) - 1000 cycles TLB warmup cost per switch - 4 GHz CPU
Cost per second:
100 switches/sec × 1000 cycles/switch = 100,000 cycles/sec
= 0.0025% of CPU time
This seems small, but: - Real warmup cost is higher for memory-intensive applications - Cache pollution from context switches adds overhead - High-frequency context switches (server workloads) amplify the problem
Studies show context switch overhead can reach 1-5% of CPU time on busy servers.
ARM's solution to the context switch problem is the Address Space Identifier (ASID), a tag attached to each TLB entry identifying which process it belongs to.
Size: 8-bit or 16-bit (implementation-defined) - 8-bit ASID: 256 distinct address spaces - 16-bit ASID: 65,536 distinct address spaces
Storage: The ASID is part of the Translation Control Register: - TTBR0_EL1[63:48]: ASID for user-space translations - TTBR1_EL1[63:48]: ASID for kernel-space translations (often ignored since kernel is global)
TLB Entry Format:
| VPN | ASID 8–16 bits |
PFN | Flags |
|---|---|---|---|
| Virtual page tag | Address Space ID | Physical Frame | R / W / X / U / G |
ASID tags each entry with the address space it belongs to, eliminating full TLB flushes on context switch. ARM64 supports 8-bit (256) or 16-bit (65,536) ASIDs.
1. Memory access to virtual address VA
2. Extract VPN from VA
3. Read current ASID from TTBR0_EL1
4. TLB lookup: Match (VPN, ASID)
- If found → TLB HIT (use cached translation)
- If not found → TLB MISS (page walk or software refill)
Key property: TLB can hold translations from multiple processes simultaneously:
TLB Contents Example:
Entry 0: VPN=0x1000, ASID=0x05, PFN=0x9000 (Process A)
Entry 1: VPN=0x1000, ASID=0x12, PFN=0x7000 (Process B) ← Different PFN!
Entry 2: VPN=0x2000, ASID=0x05, PFN=0x8000 (Process A)
Entry 3: VPN=0x3000, ASID=0x12, PFN=0xA000 (Process B)
Notice entries 0 and 1 have the same VPN but different ASIDs—they're different virtual pages in different processes, mapped to different physical frames.
Optimized Behavior:
1. Process A running (ASID=0x05)
- TLB contains A's translations
2. Context switch to Process B (ASID=0x12)
- Update TTBR0_EL1 with B's page table base and ASID
- NO TLB FLUSH
3. Process B starts running
- TLB lookups use VPN + ASID=0x12
- B's entries (if still in TLB) hit immediately
- A's entries remain but won't match (wrong ASID)
- No warmup delay if B's entries still cached
4. Context switch back to A (ASID=0x05)
- Update TTBR0_EL1 to A's page table + ASID
- A's entries still in TLB → immediate hits!
Performance improvement: Eliminates TLB flush overhead, especially beneficial for workloads with frequent context switches between small set of processes.
Challenge: Modern systems run thousands of processes, but ASID space is limited (256 or 65,536 values). How should OS allocate ASIDs?
Strategy 1: Simple Allocation
asid_t next_asid = 1; // 0 often reserved
asid_t allocate_asid() {
if (next_asid == MAX_ASID) {
// Rollover: flush entire TLB
tlb_flush_all();
next_asid = 1;
}
return next_asid++;
}Pros: Simple, guarantees unique ASIDs for active processes Cons: Periodic TLB flushes when ASID space exhausted
Strategy 2: Lazy Allocation
Linux on ARM uses a more sophisticated scheme:
// Generation number tracks ASID epochs
uint64_t asid_generation = 1;
struct task_struct {
uint64_t asid_generation; // When ASID was allocated
asid_t asid; // Current ASID
};
void switch_to_task(task_t *task) {
if (task->asid_generation != current_asid_generation) {
// ASID is stale (from old generation)
task->asid = allocate_new_asid();
task->asid_generation = current_asid_generation;
}
write_ttbr0(task->pgd, task->asid);
}When ASIDs are exhausted:
void asid_rollover() {
// Increment generation
asid_generation++;
// All old ASIDs now stale
// Active processes get new ASIDs on next switch
// Flush TLB entries from old generation
tlb_flush_all();
}Pros: Only active processes consume ASIDs Cons: Still requires periodic flushes
Strategy 3: Reserved ASIDs
Reserve ASIDs for frequently-used processes:
// Reserve ASID 1 for init process
// Reserve ASID 2 for systemd
// etc.Ensures critical processes never lose their TLB entries.
Intel and AMD x86-64 processors introduced Process Context Identifier (PCID) as an optional feature to mirror ARM's ASID functionality.
Size: 12-bit (4096 distinct contexts)
Enabling: Set CR4.PCIDE = 1 to enable PCID support
Storage: Lower 12 bits of CR3 contain PCID when enabled:
CR3 with PCID:
Bits [63:12]: Page table physical base address
Bits [11:0]: PCID (process context identifier)
TLB Entry Format: Similar to ARM, each TLB entry tagged with PCID
1. Larger Space: 12-bit (4096) vs ARM's 8-bit/16-bit (256/65536)
2. Voluntary Flush Control:
When writing CR3, bit 63 controls TLB flushing:
// Flush TLB for old PCID
write_cr3(new_pgd | new_pcid);
// Keep old TLB entries (bit 63 = 1)
write_cr3(new_pgd | new_pcid | CR3_NOFLUSH);This allows software to choose when to flush on context switches.
3. Global Pages Remain Global:
Even with PCID, pages marked Global (G-bit) are shared across all PCIDs. Kernel mappings remain global.
x86 provides explicit TLB invalidation control via the INVPCID instruction:
Invocation:
invpcid rax, [rbx]
Where: - rax: Type of invalidation (0-3) -
[rbx]: Descriptor (PCID + linear address)
Invalidation Types:
Type 0: Individual Address
Invalidate single address in specific PCID
Descriptor: {PCID, LinearAddress}
Type 1: Single PCID
Invalidate all non-global entries for specific PCID
Descriptor: {PCID, ignored}
Type 2: All PCIDs, non-global
Invalidate all non-global entries in all PCIDs
Descriptor: ignored
Type 3: All PCIDs, including global
Invalidate ALL entries (global and non-global)
Descriptor: ignored
Example usage:
// Switch to new process
void switch_mm(mm_context_t *new_mm) {
uint64_t new_cr3 = new_mm->pgd | new_mm->pcid | CR3_NOFLUSH;
write_cr3(new_cr3);
// If we reused a PCID, invalidate old entries selectively
if (new_mm->pcid_needs_flush) {
invpcid_single_context(new_mm->pcid);
new_mm->pcid_needs_flush = false;
}
}When running virtual machines, we have an additional layer: each VM is a separate address space. Without optimization, switching between VMs would require flushing the TLB, just like switching processes.
Virtual Machine Identifiers (VMID) extend the ASID/PCID concept to virtualization.
Size: 16-bit (65,536 distinct VMs)
Purpose: Tag TLB entries with VM identifier to avoid flushes on VM switch
Storage: VMCS (Virtual Machine Control Structure) contains VPID for each VM
TLB Entry Format:
| VPN | PCID 12 bits |
VPID 16 bits |
PFN | Flags |
|---|---|---|---|---|
| Virtual page | Guest process | Guest VM | Physical Frame | R / W / X / U |
VPID (Virtual Processor Identifier) tags entries per guest VM; PCID tags per guest process. Together they allow host and all guest TLB entries to coexist without flushing on VM-exit or context switch.
Notice: Can have both PCID and VPID, allowing TLB to cache: - Multiple processes in multiple VMs simultaneously - Example: VM1's Process A, VM1's Process B, VM2's Process A, Host OS
Lookup:
Match (VPN, PCID, VPID) → TLB hit
Size: 8-bit or 16-bit (implementation-defined)
Storage: VTTBR_EL2[63:48] contains VMID for Stage 2 translation
Combination with ASID:
ARM can tag entries with both ASID and VMID:
Entry: VPN, ASID (guest process), VMID (which VM), PFN
This enables caching: - Multiple guest processes in multiple VMs - Host processes - All without TLB flushes on switches
Size: Implementation-defined (typically 7-14 bits)
Storage: hgatp register bits [57:44]
contain VMID
Integration: Works with ASID mechanism:
TLB Entry: VPN, ASID, VMID, PFN
Without VMID:
VM Switch Cost:
1. Save VM1 state
2. Flush TLB (lose all VM1's translations)
3. Load VM2 state
4. VM2 starts cold (all TLB misses)
Cost: 2,000-5,000 cycles
With VMID:
VM Switch Cost:
1. Save VM1 state
2. Update VMID register (VM1 entries stay in TLB)
3. Load VM2 state
4. VM2 starts warm (cached entries still valid)
Cost: 500-1,000 cycles
Improvement: 2-5× faster VM switches, critical for cloud workloads with many VMs per host.
Processors provide explicit instructions to manage TLB state. Understanding these operations is essential for OS developers implementing memory management and for anyone debugging TLB-related issues.
INVLPG (Invalidate Page):
invlpg [address]
Invalidates the TLB entry (if present) for the page containing
address on the current CPU.
Usage:
// After changing a PTE, invalidate the corresponding TLB entry
pte_t *pte = get_pte(addr);
pte->present = 0;
asm volatile("invlpg (%0)" :: "r"(addr) : "memory");Scope: Local processor only (does not affect other CPUs)
INVPCID (Invalidate Process Context ID):
More flexible invalidation with four types (described in Section
4.4.3): - Type 0: Single address in specific PCID - Type 1: All
addresses in specific PCID
- Type 2: All non-global addresses in all PCIDs - Type 3: All addresses
including global
Example:
// Invalidate entire address space for PCID 5
uint64_t desc[2] = {5, 0}; // PCID=5, address ignored
asm volatile("invpcid (%0), %1" :: "r"(desc), "r"(1) : "memory");ARM provides a rich set of TLB invalidation instructions:
Basic Invalidation:
TLBI VAE1IS, X0 ; Invalidate VA in EL1, inner shareable
TLBI ASIDE1IS, X0 ; Invalidate all entries for ASID, inner shareable
TLBI VMALLE1IS ; Invalidate all EL1 entries, inner shareable
Naming Convention: - VA: Virtual Address - ASIDE: Address Space ID Entry - VMALLE: Virtual Memory All Entries - E1/E2: Exception Level (privilege level) - IS: Inner Shareable (affects other CPUs in cluster)
Invalidate specific address:
void tlb_invalidate_page(vaddr_t addr) {
// Construct descriptor: VA[55:12] | ASID[63:48]
uint64_t desc = (addr & ~0xFFF) | (current_asid << 48);
asm volatile("tlbi vae1is, %0" :: "r"(desc));
asm volatile("dsb ish"); // Data synchronization barrier
asm volatile("isb"); // Instruction synchronization barrier
}Instruction: SFENCE.VMA rs1, rs2
Semantics: - rs1: Virtual address (0 =
all addresses) - rs2: ASID (0 = all ASIDs)
Examples:
# Invalidate specific address in current ASID
sfence.vma a0, zero
# Invalidate all addresses in specific ASID
sfence.vma zero, a1
# Invalidate entire TLB
sfence.vma zero, zero
Usage:
// After modifying page table
pte_t *pte = walk_page_table(addr);
pte->flags |= PTE_WRITE;
// Invalidate TLB entry
asm volatile("sfence.vma %0, zero" :: "r"(addr));SFENCE.VMA is expensive: It's a serializing instruction that may flush the entire TLB. OSes should batch invalidations when possible.
Problem: TLB invalidation instructions are local to one CPU. If multiple CPUs are executing the same process, their TLBs may all cache a stale translation that needs invalidation.
TLB Shootdown is the process of invalidating TLB entries on remote CPUs.
Scenario: CPU 0 modifies a PTE that might be cached in other CPUs' TLBs
Protocol:
CPU 0 (initiator):
1. Change PTE in memory
2. Invalidate local TLB entry
3. Send IPI (Inter-Processor Interrupt) to all other CPUs
4. Wait for acknowledgment from all CPUs
5. Continue execution
CPU 1, 2, 3, ... (targets):
1. Receive IPI
2. Interrupt current execution
3. Execute IPI handler:
- Invalidate TLB entry
- Send acknowledgment back to CPU 0
4. Resume execution
void tlb_shootdown(vaddr_t addr) {
cpumask_t *mask = mm_cpumask(current->mm); // CPUs running this process
// Invalidate local TLB first
local_tlb_invalidate(addr);
// Send IPI to remote CPUs
smp_call_function_many(mask,
remote_tlb_invalidate, // Function to execute
(void*)addr, // Argument
1); // Wait for completion
}
// Executed on remote CPUs via IPI
void remote_tlb_invalidate(void *addr_ptr) {
vaddr_t addr = (vaddr_t)addr_ptr;
local_tlb_invalidate(addr);
}IPI overhead: - Send IPI: ~100-200 cycles - Receive and handle IPI: ~200-500 cycles per CPU - Wait for acknowledgments: ~100-1000 cycles
Total for 4-CPU system: ~1,000-3,000 cycles
For 64-CPU system: ~10,000-40,000 cycles
This is why minimizing TLB shootdowns is critical for scalability.
Instead of shooting down individual pages:
// BAD: Many TLB shootdowns
for (int i = 0; i < 1000; i++) {
pte_t *pte = &page_table[i];
pte->flags |= PTE_WRITE;
tlb_shootdown(i * PAGE_SIZE); // IPI storm!
}Better: Batch them:
// GOOD: Single batched shootdown
for (int i = 0; i < 1000; i++) {
pte_t *pte = &page_table[i];
pte->flags |= PTE_WRITE;
// Don't shootdown yet
}
tlb_shootdown_range(start, end); // One IPI for entire rangeModern processors support invalidating ranges: - x86 INVPCID type 2: Invalidate all non-global entries - ARM TLBI VALE1IS with range: Invalidate range of addresses - RISC-V: Multiple SFENCE.VMA invocations (can be optimized by hardware)
For rarely-used pages, defer shootdown until necessary:
struct mm_struct {
cpumask_t tlb_flush_pending; // CPUs needing flush
};
// Mark flush pending, don't IPI immediately
void lazy_shootdown(addr) {
mm->tlb_flush_pending = mm->cpumask;
// Actual flush happens on next context switch to this mm
}Tradeoff: Potential for stale translations vs. IPI overhead
One of the most powerful TLB optimizations is using large pages (also called huge pages or superpages). By mapping larger contiguous regions with a single TLB entry, large pages dramatically improve TLB coverage—the amount of memory accessible through the TLB.
TLB Coverage is the total amount of memory that can be addressed by all TLB entries:
TLB Coverage = (Number of TLB entries) × (Page size)
For a typical L1 DTLB with 64 entries and 4KB pages:
Coverage = 64 entries × 4 KB = 256 KB
Problem: Modern applications routinely use gigabytes of memory. With 256 KB coverage: - Accessing 1 GB of memory requires TLB entries for 262,144 pages - With only 64 entries, we can cache 0.024% of needed translations - TLB miss rate approaches 99.9%+ for large working sets
Real-world example: A database scanning a 10 GB table with 4KB pages:
Pages needed: 10 GB / 4 KB = 2,621,440 pages
L1 TLB entries: 64
L2 TLB entries: 1,536
Total TLB coverage: 1,600 entries × 4 KB = 6.4 MB (0.064% of dataset)
Result: Catastrophic TLB miss rate, performance 10-50× slower
Modern processors support multiple page sizes:
| Architecture | Page Sizes |
|---|---|
| x86-64 | 4 KB, 2 MB, 1 GB (4-level), 512 GB (5-level) |
| ARM64 | 4 KB, 2 MB, 1 GB (with 4KB granule) |
| 16 KB, 32 MB (with 16KB granule) | |
| 64 KB, 512 MB (with 64KB granule) | |
| RISC-V | 4 KB, 2 MB, 1 GB (Sv39), 512 GB (Sv48) |
Using 2 MB pages:
L1 DTLB coverage: 64 entries × 2 MB = 128 MB (512× improvement)
L2 STLB coverage: 1,536 entries × 2 MB = 3 GB (512× improvement)
Using 1 GB pages:
L1 DTLB coverage: 4 entries × 1 GB = 4 GB (16,384× improvement)
Revisiting the 10 GB database scan with 2 MB pages:
Pages needed: 10 GB / 2 MB = 5,120 pages
TLB coverage: 1,600 entries × 2 MB = 3.2 GB (32% of dataset)
Result: Dramatically improved TLB hit rate
Performance improvement: 2-10× for memory-intensive operations
2 MB Pages (Page Directory Entry mapping):
In the page table hierarchy, a PDE (Page Directory Entry) normally points to a page table. With the PS (Page Size) bit set, it instead maps a 2 MB page directly:
Normal 4KB Mapping:
PML4E → PDPTE → PDE → PTE → 4 KB page
2 MB Large Page:
PML4E → PDPTE → PDE (PS=1) → 2 MB page
↑
Terminates walk here
PDE format for 2 MB page:
Bits [51:21]: Physical address (2 MB aligned)
Bit [7] (PS): 1 (indicates 2 MB page)
Bits [20:13]: Available for OS use
Other bits: R/W, U/S, PWT, PCD, A, D, G, PAT, NX
1 GB Pages (Page Directory Pointer Table Entry mapping):
Similarly, a PDPTE with PS=1 maps a 1 GB page:
Normal Path:
PML4E → PDPTE → PDE → PTE → 4 KB page
1 GB Large Page:
PML4E → PDPTE (PS=1) → 1 GB page
↑
Terminates here
Intel TLB organization for large pages:
Intel processors typically have separate TLB entries or sections for different page sizes:
L1 DTLB (Ice Lake):
- 64 entries for 4KB pages
- 32 entries for 2MB/4MB pages (shared pool)
- 4 entries for 1GB pages
L2 STLB:
- 1536 entries for 4KB/2MB pages (unified)
- 16 entries for 1GB pages (separate)
Notice: Unified vs. separate pools is implementation-dependent.
ARM64 calls large pages "blocks" and supports them at different table levels:
4KB Granule: - Level 3: 4 KB pages (leaf entries) - Level 2: 2 MB blocks - Level 1: 1 GB blocks
16KB Granule: - Level 3: 16 KB pages - Level 2: 32 MB blocks
64KB Granule: - Level 3: 64 KB pages - Level 2: 512 MB blocks
Descriptor Type Encoding:
ARM uses the descriptor type field to distinguish blocks from tables:
Bits [1:0]:
00 = Invalid
01 = Block entry (Level 1 or 2)
11 = Table entry (points to next level) or Page entry (Level 3)
ARM Cortex-A77 TLB:
L1 Instruction TLB:
- 48 entries, fully associative
- Supports all page sizes
L1 Data TLB:
- 48 entries, fully associative
- Supports all page sizes
L2 TLB:
- 1024 entries, 4-way set-associative
- Unified for instructions and data
- Supports all page sizes
RISC-V uses an elegant encoding: any PTE with R/W/X bits (leaf entry) can be a large page at its level.
Sv39 (3-level):
Level 2: VPN[2] → 1 GB page (if R/W/X set)
Level 1: VPN[1] → 2 MB page (if R/W/X set)
Level 0: VPN[0] → 4 KB page (if R/W/X set)
PTE encoding:
R W X | Meaning
0 0 0 | Pointer to next level (not a leaf)
0 0 1 | Executable page (leaf)
0 1 0 | Invalid
0 1 1 | Executable + writable page (leaf)
1 0 0 | Read-only page (leaf)
1 0 1 | Read-execute page (leaf)
1 1 0 | Read-write page (leaf)
1 1 1 | Read-write-execute page (leaf)
Any non-zero R/W/X combination makes the PTE a leaf, determining page size by the level.
Challenge: Applications must explicitly request
large pages via mmap() with MAP_HUGETLB flag
or use hugetlbfs. This requires code changes and isn't transparent.
Solution: Transparent Huge Pages (THP) automatically promote 4KB pages to 2MB pages when possible.
Promotion (4KB → 2MB):
Conditions for promotion:
1. Process allocates 512 contiguous 4KB pages (= 2MB)
2. Physical memory has contiguous 2MB region available
3. No holes or conflicting mappings
4. Region is sufficiently aligned
Process:
1. Allocate 2MB physically-contiguous region
2. Copy data from 512 small pages to new region
3. Update page table: replace 512 PTEs with single 2MB PDE
4. Free old 4KB pages
5. Invalidate TLB (single entry now covers 2MB)
Demotion (2MB → 4KB):
When memory pressure increases or pages need to be swapped individually:
1. Allocate 512 separate 4KB pages
2. Copy data from 2MB page
3. Update page table: replace 2MB PDE with 512 PTEs
4. Free 2MB page
5. TLB invalidation (now need 512 entries)
Benefits: - Transparent to applications (no code changes) - Dramatic TLB coverage improvement (512×) - Reduced page table memory overhead - Better memory bandwidth utilization
Costs: - Compaction overhead: Creating 2MB contiguous regions requires moving pages - Memory bloat: If only 4KB of a 2MB page is used, 2044KB is wasted - Page fault latency: Allocating 2MB takes longer than 4KB - TLB shootdown cost: Splitting 2MB pages affects all CPUs
When THP helps: - Large sequential access patterns (databases, scientific computing) - Streaming workloads (video processing, ML training) - Memory-mapped files (large datasets)
When THP hurts: - Random access with poor spatial locality - Small allocations (memory bloat) - Frequent mmap/munmap (allocation overhead)
Benchmark: STREAM Memory Bandwidth Test
Measures sustained memory bandwidth with large arrays:
4KB Pages:
Copy: 120 GB/s
Scale: 115 GB/s
Add: 125 GB/s
Triad: 130 GB/s
2MB Pages (Huge Pages):
Copy: 145 GB/s (+21%)
Scale: 140 GB/s (+22%)
Add: 155 GB/s (+24%)
Triad: 160 GB/s (+23%)
Why the improvement? - Reduced TLB misses (99% → 20% miss rate) - Fewer page table walks - Better memory-level parallelism
Benchmark: Database Index Scan
10 GB B-tree index scan with random accesses:
4KB Pages:
Queries/sec: 12,500
Avg latency: 80 ms
TLB miss rate: 45%
2MB Pages:
Queries/sec: 34,000 (+172%)
Avg latency: 29 ms (-64%)
TLB miss rate: 8%
Improvement analysis: - TLB miss reduction: 45% → 8% (80% reduction) - Each miss costs ~100 cycles - Savings: 37% of accesses × 100 cycles = 37 cycles/access saved - At 10^9 accesses: 37 billion cycles saved = ~9 seconds @ 4GHz
For very large memory footprints (100GB+ datasets), even 2MB pages may not provide sufficient TLB coverage. 1GB pages offer an additional 512× improvement.
Requirements for 1GB pages: - Processor support (most modern x86-64, ARM64) - Physically contiguous 1GB regions (challenging!) - Boot-time reservation or kernel-level allocation
Usage via hugetlbfs:
# Reserve 8GB of 1GB pages at boot
echo 8 > /proc/sys/vm/nr_overcommit_hugepages
# Create hugetlbfs mount
mkdir /mnt/huge_1GB
mount -t hugetlbfs -o pagesize=1G none /mnt/huge_1GB
# Application uses mmap()
fd = open("/mnt/huge_1GB/myfile", O_CREAT | O_RDWR);
ptr = mmap(NULL, 8GB, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);Real-world use case: Database buffer pool:
Database with 128 GB buffer pool:
4KB pages:
- Pages needed: 33,554,432
- TLB entries: 1,600 (0.005% coverage)
- TLB miss rate: ~99%
- Performance: 50 MB/s scan rate
2MB pages:
- Pages needed: 65,536
- TLB entries: 1,600 (2.4% coverage)
- TLB miss rate: ~60%
- Performance: 850 MB/s (+17×)
1GB pages:
- Pages needed: 128
- TLB entries: 16 (12.5% coverage)
- TLB miss rate: ~15%
- Performance: 2,800 MB/s (+56×)
Modern processors have multiple cores, each with its own private TLB. When one core modifies a page table entry, other cores' TLBs may contain stale translations. Maintaining TLB coherency across cores is essential for correctness.
Scenario: Two cores running the same process
Initial state:
- Virtual page 0x1000 maps to physical frame 0x9000
- Both CPU0's TLB and CPU1's TLB cache this translation
CPU0:
1. Changes PTE: 0x1000 now maps to 0xA000 (copy-on-write)
2. Invalidates its own TLB entry
3. Continues execution with correct translation
CPU1:
1. Still has old translation (0x1000 → 0x9000) in TLB
2. Accesses virtual address 0x1000
3. TLB HIT with stale entry!
4. Reads from wrong physical page (0x9000 instead of 0xA000)
5. CORRECTNESS VIOLATION
Without TLB coherency, this leads to: - Reading stale data - Lost writes (writing to old copy) - Security violations (accessing freed pages)
We introduced TLB shootdown in Section 4.5.2. Here we examine it in more detail with focus on correctness and optimization.
Algorithm:
void change_pte_strict(pte_t *pte, pte_t new_value) {
cpumask_t *cpus = get_active_cpus();
// Step 1: Prevent new accesses with old translation
smp_mb(); // Memory barrier
// Step 2: Modify PTE
WRITE_ONCE(*pte, new_value);
// Step 3: Local TLB invalidation
local_tlb_flush(pte_address);
// Step 4: Send IPI to all other CPUs
smp_call_function_many(cpus, remote_tlb_flush, pte_address, 1);
// Step 5: Wait for all CPUs to acknowledge
// (handled by smp_call_function_many's wait flag)
}Guarantees: - After return, NO CPU has stale TLB entry - Safe to free old physical page - Safe to reuse old physical page for different purpose
Cost: Synchronous IPIs to all CPUs (expensive!)
Optimization: Defer TLB flushes until strictly necessary
struct mm_struct {
cpumask_t tlb_flush_pending; // CPUs with pending flushes
};
void change_pte_deferred(pte_t *pte, pte_t new_value) {
// Step 1: Modify PTE
WRITE_ONCE(*pte, new_value);
// Step 2: Local invalidation
local_tlb_flush(pte_address);
// Step 3: Mark other CPUs as needing flush
mm->tlb_flush_pending |= get_active_cpus();
// No IPI sent yet!
}
// Flush happens lazily:
void context_switch_to_mm(mm_struct *mm) {
if (cpumask_test_cpu(cpu, &mm->tlb_flush_pending)) {
// Flush entire TLB for this mm
local_tlb_flush_all();
cpumask_clear_cpu(cpu, &mm->tlb_flush_pending);
}
load_mm(mm);
}Benefits: - No immediate IPI cost - Flush batched with context switch (amortized cost)
Restrictions: - Cannot free page immediately (some CPUs may access it) - Requires tracking which pages are "in flight"
Some architectures provide hardware assistance for TLB coherency:
ARM defines memory shareability domains that affect TLB coherency:
Inner Shareable (IS):
TLBI VALE1IS, X0 ; Invalidate for all CPUs in inner shareable domain
Hardware automatically broadcasts TLB invalidation to all cores in the cluster. No software IPI needed!
Outer Shareable (OS):
TLBI VALE1OS, X0 ; Invalidate across multiple clusters
Non-Shareable (NS):
TLBI VALE1, X0 ; Local CPU only
Benefits: - Reduced software overhead (no IPI handling) - Lower latency (hardware paths faster than interrupts) - Simpler software
On some x86 implementations, certain TLB operations are broadcast automatically within a coherency domain, though this is not architecturally guaranteed. Software still must use IPIs for correctness.
TLB coherency interacts with memory ordering and barriers:
Problem:
CPU 0: CPU 1:
pte->flags |= PTE_WRITE; data = *ptr;
invalidate_tlb(ptr); // Which flags does CPU1's TLB see?Without proper barriers, CPU1 might: - Read old PTE from TLB (before write visible) - Read new PTE but with stale flags cached
Solution: Memory Barriers
CPU 0:
WRITE_ONCE(pte->flags, new_flags); // Store-release semantics
smp_wmb(); // Write memory barrier
invalidate_tlb_globally(ptr); // Ensures visibility
CPU 1:
// TLB miss forces PTE reload
pte = READ_ONCE(*pte_ptr); // Load-acquire semantics
smp_rmb(); // Read memory barrierx86 ordering: x86 has strong ordering (TSO: Total Store Order), so explicit barriers are rarely needed
ARM ordering: ARM has weak ordering, requiring explicit barriers:
DSB ISH ; Data Synchronization Barrier, Inner Shareable
ISB ; Instruction Synchronization Barrier
Problem: Many page table modifications in a short
time (e.g., munmap() of large region)
Bad approach:
for (int i = 0; i < 10000; i++) {
pte_t *pte = &page_table[i];
pte->present = 0;
tlb_shootdown(pte); // IPI for EACH page!
}
// Total: 10,000 IPIs, 5-10 seconds on 64-core systemGood approach: Batch shootdown
struct tlb_batch {
vaddr_t addresses[256];
int count;
cpumask_t cpus;
};
void batched_shootdown(vaddr_t start, size_t len) {
struct tlb_batch batch;
batch.count = 0;
batch.cpus = get_active_cpus();
for (addr = start; addr < start + len; addr += PAGE_SIZE) {
pte_t *pte = get_pte(addr);
pte->present = 0;
batch.addresses[batch.count++] = addr;
if (batch.count >= 256) {
flush_batch(&batch);
batch.count = 0;
}
}
if (batch.count > 0) {
flush_batch(&batch);
}
}
void flush_batch(struct tlb_batch *batch) {
// Local flush
for (int i = 0; i < batch->count; i++) {
local_tlb_flush(batch->addresses[i]);
}
// Single IPI for entire batch
smp_call_function_many(batch->cpus, remote_flush_batch, batch, 1);
}Performance: - Before: 10,000 IPIs = 40,000-80,000 cycles - After: 40 IPIs = 1,600-3,200 cycles (25× faster)
On NUMA (Non-Uniform Memory Access) systems, TLB coherency becomes more complex:
Challenges: - CPUs on different NUMA nodes have higher IPI latency - Page tables may reside on remote NUMA nodes - Memory barriers have different costs across nodes
Optimization: Lazy cross-node shootdown
void shootdown_numa_aware(pte_t *pte, pte_t new_val) {
cpumask_t *local_cpus, *remote_cpus;
partition_cpus_by_numa(&local_cpus, &remote_cpus);
// Immediate shootdown on local node (low latency)
smp_call_function_many(local_cpus, tlb_flush, pte, 1);
// Deferred shootdown on remote nodes
mark_flush_pending(remote_cpus, pte);
}Now we examine real TLB implementations across major processor families, understanding the specific design choices, capabilities, and performance characteristics of modern hardware.
L1 Data TLB (DTLB):
4KB pages:
- 64 entries, 4-way set-associative
- 1-cycle hit latency
2MB/4MB pages:
- 32 entries, 4-way set-associative
- 1-cycle hit latency
1GB pages:
- 4 entries, fully associative
- 1-cycle hit latency
L1 Instruction TLB (ITLB):
4KB pages:
- 128 entries, fully associative
- 1-cycle hit latency
2MB/4MB pages:
- 8 entries, fully associative
- 1-cycle hit latency
L2 Shared TLB (STLB):
4KB/2MB pages:
- 1536 entries, 12-way set-associative
- 5-cycle hit latency
1GB pages:
- 16 entries, 4-way set-associative
- 5-cycle hit latency
Page Walk Cache:
Intel adds caches for intermediate page table entries:
PML4 cache: 16 entries (caches PML4 entries)
PDPT cache: 32 entries (caches PDPT entries)
PD cache: 256 entries (caches PD entries)
Performance characteristics:
L1 DTLB hit: 1 cycle (0.25 ns @ 4 GHz)
L1 DTLB miss, L2 STLB hit: 5 cycles (1.25 ns)
L2 STLB miss, PWC hit: 10-20 cycles (2.5-5 ns)
Full page walk (L3 hit): 40-60 cycles (10-15 ns)
Full page walk (DRAM): 100-200 cycles (25-50 ns)
TLB Miss Rate Observations (SPEC CPU2017):
Workload L1 DTLB L2 STLB PWC
Miss Rate Miss Rate Hit Rate
mcf (pointer) 12.3% 8.2% 45%
lbm (streaming) 3.8% 2.1% 75%
gcc (code) 1.2% 0.3% 85%
average 2.5% 1.1% 72%
Improvements over Ice Lake: - Larger L2 STLB: 2048 entries (vs 1536) - Enhanced PWC: 512 PD entries (vs 256) - Support for 5-level paging (57-bit virtual addresses) - Improved TLB replacement algorithms
L1 Data TLB:
All page sizes (4KB/2MB/1GB):
- 96 entries, fully associative
- 1-cycle hit latency
L1 Instruction TLB:
All page sizes:
- 96 entries, fully associative
- 1-cycle hit latency
L2 Unified TLB:
4KB pages:
- 3072 entries, 12-way set-associative
- ~7 cycle hit latency
2MB pages:
- 3072 entries shared with 4KB
- ~7 cycle hit latency
1GB pages:
- 96 entries
- ~7 cycle hit latency
Key AMD innovations: - Unified L2 TLB: No separation between instruction and data - Very large L2: 3072 entries provide excellent coverage - Fully-associative L1: Eliminates conflict misses
Performance comparison (normalized to Intel):
Metric Intel Ice Lake AMD Zen 4
L1 TLB capacity 192 entries 192 entries
L2 TLB capacity 1552 entries 3168 entries
L2 TLB hit rate ~95% ~97-98%
Memory-intensive perf 1.0× baseline 1.15-1.25×
L1 Instruction TLB:
Fully associative:
- 48 entries
- Supports 4KB, 16KB, 64KB, 2MB, 32MB, 1GB pages
- ~1 cycle hit latency
L1 Data TLB:
Fully associative:
- 48 entries per cluster
- Supports all page sizes
- ~1 cycle hit latency
L2 Unified TLB:
4-way set-associative:
- 1280 entries
- Unified for instructions and data
- ~6 cycle hit latency
- Supports all page sizes
ARM-specific features:
ASID Support: - 16-bit ASID in Cortex-A78 (65536 address spaces) - No TLB flush on context switch - ASID in every TLB entry
VMID Support: - 16-bit VMID for virtualization - Each TLB entry tagged with (ASID, VMID) - Supports many VMs without TLB pollution
Inner/Outer Shareability: - TLB invalidation can target specific shareability domains - Reduces unnecessary TLB flushes
Performance characteristics:
L1 TLB hit: 1 cycle
L1 miss, L2 hit: 6 cycles
L2 miss, page walk: 50-150 cycles (no PWC in Cortex-A78)
TLB Configuration:
Instruction TLB:
- 32 entries, fully associative
- Supports 4KB, 2MB, 1GB pages (Sv39)
Data TLB:
- 40 entries, fully associative
- Supports 4KB, 2MB, 1GB pages
Software-Managed: - TLB miss generates exception - Software walks page tables - Software inserts TLB entry via CSR writes
ASID Support: - 16-bit ASID in satp
register (Sv39 mode) - Software can use sfence.vma with
ASID parameter
Performance:
TLB hit: 1 cycle
TLB miss (software handling): 50-200 cycles depending on handler
Trade-offs vs Hardware-Managed:
Advantages: - Simpler hardware (smaller chip area, lower power) - Flexible page table formats - Easy to add custom TLB management features
Disadvantages: - Higher TLB miss latency (software path) - No page walk cache (no hardware walker) - Handler complexity
SiFive U74 TLB Miss Handler (typical):
Optimized assembly code that walks Sv39 page tables takes ~50 instructions (best case). With instruction cache hits and no data cache misses: ~50 cycles. Worst case with cache misses: 200+ cycles.
Apple doesn't publish detailed microarchitectural specifications, but reverse engineering and performance analysis reveal:
Apple M1 (2020):
Firestorm (Performance) Core:
Estimated L1 TLB:
- ~96-128 entries (instruction + data combined guess)
- Fully associative
- Supports 4KB, 16KB, 2MB pages
Estimated L2 TLB:
- ~2000+ entries
- High associativity
Icestorm (Efficiency) Core:
Estimated L1 TLB:
- ~48-64 entries
- Fully associative
Observations from benchmarks: - Exceptional TLB performance compared to competitors - Very low TLB miss rates even with large working sets - Likely very large L2 TLB or advanced prefetching
Apple M3 (2023): - Estimated further TLB improvements - Better handling of mixed page sizes - Enhanced coherency for heterogeneous cores
While hierarchical multi-level page tables (covered in Chapter 3) are the dominant approach in modern processors, alternative page table structures exist and influence TLB design in important ways. Understanding these alternatives provides insight into the design space and trade-offs in virtual memory systems.
Concept: Instead of one page table per process mapping virtual pages to physical frames, an inverted page table (IPT) has one entry per physical frame mapping back to virtual pages.
Traditional (forward) page table:
Virtual Page → [Page Table Entry] → Physical Frame
One table per process
Size grows with virtual address space
Inverted page table:
Physical Frame → [IPT Entry] → (Process ID, Virtual Page)
One global table for entire system
Size determined by physical memory
IPT Entry Format:
| PID / ASID 16 bits |
VPN 52 bits |
Chain pointer |
Flags R/W/X/… |
|---|---|---|---|
| Address space | Virtual page tag | → next entry (collision chain) |
Permission bits |
Hashed TLBs use the VPN as a hash key; the chain pointer resolves collisions. Eliminates CAM hardware cost at large entry counts but adds potential multi-cycle lookup on collision.
Since we can't index directly by virtual address, IPTs use a hash table:
1. Hash (PID, VPN) → Hash bucket index
2. Read IPT entry at that index
3. Check if (IPT.PID, IPT.VPN) matches (Current_PID, Requested_VPN)
- Match → Physical frame number = entry index
- No match → Follow chain pointer to next collision entry
4. If chain ends without match → Page fault
Example: Process 42 accesses virtual page 0x1234:
Hash(42, 0x1234) = 0x9876
Read IPT[0x9876]:
PID = 42
VPN = 0x1234
Match! → Physical frame = 0x9876
1. Fixed Memory Overhead:
Traditional: Memory = (Number of processes) × (Virtual AS size / Page size) × (Entry size)
Example: 1000 processes × (2^48 / 2^12) × 8 bytes = 2 PB (impossible!)
Inverted: Memory = (Physical memory size / Page size) × (Entry size)
Example: 256 GB / 4 KB × 16 bytes = 1 GB (manageable)
2. No Per-Process Tables: Simplifies context switching and memory management.
3. Natural Support for Shared Memory: Multiple (PID, VPN) pairs can map to same physical frame.
1. Slow Lookup:
Hash collisions → Must traverse chain
Worst case: O(n) lookups for n collisions
Traditional hierarchical: O(1) after TLB miss, deterministic memory accesses
2. Complex TLB Refill:
On a TLB miss, software must:
tlb_refill_ipt(pid_t pid, vpn_t vpn) {
uint64_t hash = hash_function(pid, vpn);
uint64_t index = hash % ipt_size;
while (ipt[index].valid) {
if (ipt[index].pid == pid && ipt[index].vpn == vpn) {
// Found it!
tlb_insert(vpn, index, ipt[index].flags);
return;
}
// Collision - follow chain
if (ipt[index].chain == NULL)
break; // End of chain
index = ipt[index].chain;
}
// Not found - page fault
page_fault(pid, vpn);
}3. Poor Locality: Hash function scatters consecutive virtual pages across IPT, destroying spatial locality.
4. No Natural Large Page Support: Difficult to represent 2MB/1GB pages efficiently.
IBM PowerPC: Used inverted page tables extensively in RS/6000 and early POWER systems.
HP/Intel IA-64 (Itanium): Supported both traditional and inverted page tables:
Virtual Hash Page Table (VHPT):
- Hardware-walked hash table
- Software defines hash function
- TLB miss handler can use VHPT or traditional tables
Modern Status: Largely abandoned in favor of hierarchical tables due to: - Hardware page walkers prefer deterministic access patterns - Page walk caches optimize hierarchical lookups - Large pages benefit hierarchical structures
Context: Before hardware virtualization support (Intel EPT, AMD NPT), hypervisors used shadow page tables to virtualize memory.
Guest OS maintains page tables: Guest Virtual Address (GVA) → Guest Physical Address (GPA)
But guest "physical" addresses are actually virtual from host perspective:
Guest Virtual (GVA) → Guest Physical (GPA) → Host Physical (HPA)
Without hardware support, TLB can only cache one translation level. Solution: Shadow page tables.
Hypervisor maintains shadow page tables that directly map GVA → HPA:
Guest Page Table: GVA → GPA
Host Page Table: GPA → HPA
Shadow Page Table: GVA → HPA (composition of above)
Construction algorithm:
void build_shadow_pte(guest_pte_t *guest_pte, shadow_pte_t *shadow_pte) {
// Guest PTE: GVA → GPA
gpa_t gpa = guest_pte->pfn << 12;
// Walk host page table: GPA → HPA
hpa_t hpa = walk_host_page_table(gpa);
// Build shadow PTE: GVA → HPA
shadow_pte->pfn = hpa >> 12;
shadow_pte->flags = guest_pte->flags; // Preserve permissions
shadow_pte->present = guest_pte->present && (hpa != INVALID);
}Challenge: Guest OS modifies its page tables, shadow must stay synchronized.
Solution: Write-protect guest page tables:
1. Mark all guest page table pages as read-only in host page tables
2. Guest attempts to write page table entry
3. Page fault (write to read-only page)
4. Hypervisor intercepts fault (VM exit)
5. Hypervisor:
a. Emulate the write to guest page table
b. Update corresponding shadow page table entry
c. Flush TLB entry if needed
6. Resume guest execution (VM entry)
Performance cost:
Every guest page table modification:
- VM exit (~1000 cycles)
- Shadow PT update
- TLB flush
- VM entry (~1000 cycles)
Total: ~2000+ cycles per PT modification
Example: Guest Linux fork():
Linux fork() creates new process:
- Allocates new page tables
- Copies hundreds of PTEs
- Each copy triggers VM exit!
Without shadow PTs (naive): 1 fork = ~500 VM exits = 1,000,000 cycles
With shadow PTs: 1 fork = ~500 VM exits = 1,000,000 cycles
With hardware (EPT/NPT): 1 fork = 0 VM exits = normal overhead
This is why hardware-assisted virtualization (EPT/NPT) was so important!
TLB caches GVA → HPA directly:
TLB Entry (with shadow PT):
VPN = Guest Virtual Page
PFN = Host Physical Frame (not GPA!)
ASID/PCID = Guest's process context
VMID = Which VM
On VM switch: - Must flush all TLB entries for old VM - Shadow page tables remain but TLB is cold - New VM starts with empty TLB
Memory overhead:
Shadow PT per guest:
Same size as guest's page tables
100s of MB for typical guest
If running 100 VMs:
10-100 GB of shadow page tables!
Hardware support eliminates shadow page tables:
Intel EPT (Extended Page Tables):
GVA → GPA → HPA
↓ ↓
Guest PT EPT
Hardware walks both automatically
TLB entries cache final translation:
TLB Entry with EPT:
Guest VPN → Host PFN
Tagged with (PCID, VPID)
Key advantage: Guest can modify its page tables without VM exits!
Performance comparison:
Operation Shadow PT EPT/NPT
Guest PT write 2000+ cycles ~10 cycles
TLB miss Same as native ~2× native (dual walk)
VM switch Flush TLB Keep TLB (VPID)
Memory overhead 100s MB ~10 MB (EPT structure)
We'll explore EPT/NPT in detail in Section 4.11.
EPT worst-case memory access count. The "~2× native (dual walk)" shorthand above describes the cached average case with both the guest and EPT page-walk caches (PWCs) warm. In the true worst case with no caching, 4-level guest tables, and 4-level EPT, the hardware performs up to 25 memory accesses for one guest virtual address: five guest-table accesses (CR3 + 4 table levels), each requiring a 4-level EPT walk (4 accesses each = 20 EPT accesses), plus the final data read. Chapter 3 §3.6.2 provides the full derivation. In practice the combined PWC reduces this dramatically; measured average overhead for typical virtualised workloads is 2–7% (Intel SDM, Vol. 3C §28.2).
Concept: Group multiple consecutive page table entries into clusters.
Benefits: - Reduce page table memory overhead - Improve spatial locality - TLB can cache cluster info
Usage: Some embedded systems and research OSes.
Some processors use different page table structures at different privilege levels:
ARM64 Example:
User space (EL0): Uses TTBR0_EL1
- Can use 4KB, 16KB, or 64KB granule
Kernel space (EL1): Uses TTBR1_EL1
- Can use different granule than user space!
TLB must track which granule for each entry
When a program modifies its own code (writes instructions that will later be executed), maintaining coherency between instruction and data caches—and between ITLB and DTLB—becomes critical.
Scenario: Program writes new instruction, then executes it:
// Self-modifying code example
void *code_page = mmap(..., PROT_READ|PROT_WRITE|PROT_EXEC, ...);
// Write new instruction
*(uint32_t*)code_page = 0x12345678; // MOV instruction (example)
// Execute it
void (*func)() = (void(*)())code_page;
func(); // Execute newly written instructionProblem: Multiple caches involved:
Write instruction (STORE):
1. Hits in D-cache (data cache)
2. Translates via DTLB
3. Instruction now in D-cache, may not be in I-cache
Execute instruction (FETCH):
1. Misses in I-cache (instruction cache)
2. Translates via ITLB
3. May fetch stale instruction from memory if D-cache not flushed
Three coherency issues:
x86 guarantees: - Data cache is write-through or write-back with snooping - Self-modifying code detected via store-to-instruction checks - But: Explicit serialization still required for correctness
Required sequence:
; Write new instruction
mov [rdi], eax ; Store new instruction
; Serialize execution
mfence ; Memory fence (ensure stores visible)
; or
cpuid ; Serializing instruction
; or
jmp short $+2 ; Jump serializes pipeline
; Now safe to execute new code
call rdi
x86 Self-Modifying Code (SMC) Detection:
Modern x86 processors detect when a store might modify code in the I-cache and automatically invalidate affected I-cache lines. However, this doesn't cover all cases:
If store and fetch use different linear addresses but same physical address:
- Different virtual pages mapped to same physical page
- Store via VPN1, execute via VPN2
- SMC detection may miss this!
Solution: Must explicitly flush I-cache
x86 I-cache flush:
// No direct I-cache flush instruction in x86!
// Must use serializing instruction or pipeline flush
asm volatile("mfence; cpuid" ::: "eax", "ebx", "ecx", "edx");ITLB/DTLB coherency:
x86 doesn't guarantee ITLB/DTLB coherency automatically. If same virtual page accessed via ITLB and DTLB:
; After changing page table entry affecting execute permissions
invlpg [address] ; Invalidates BOTH ITLB and DTLB
ARM64 requires explicit cache maintenance:
; Write new instruction
str w0, [x1] ; Store instruction
; Point of Unification (PoU) maintenance
dc cvau, x1 ; Clean data cache to PoU
dsb ish ; Data Synchronization Barrier
ic ivau, x1 ; Invalidate instruction cache by VA
dsb ish ; Ensure completion
isb ; Instruction Synchronization Barrier
; Now safe to execute
br x1
ARM cache maintenance instructions:
Point of Unification (PoU):
The point in the cache hierarchy where instruction and data caches are guaranteed to see the same copy of memory:
ITLB/DTLB coherency:
ARM provides separate TLB invalidation for instruction and data:
tlbi vae1is, x0 ; Invalidate TLB entry (both I and D)
dsb ish
isb
Expensive operation: ARM's explicit maintenance is slower than x86's hardware coherency:
x86 SMC: ~100-200 cycles (mostly automatic)
ARM SMC: ~500-1000 cycles (explicit cache maintenance)
RISC-V provides minimal hardware support:
# Write new instruction
sw a0, 0(a1)
# Fence instruction and data
fence.i ; Instruction fence - invalidates I-cache
# Optionally flush data cache (implementation-dependent)
# No standard instruction!
# TLB flush if page permissions changed
sfence.vma zero, zero
# Execute new code
jalr a1
FENCE.I instruction:
RISC-V cache coherency ambiguity:
The RISC-V spec intentionally leaves cache coherency details to implementations:
Weakly-ordered implementations:
- May require explicit D-cache flush (implementation-specific)
- No standard way to flush D-cache in base spec
Strongly-ordered implementations:
- FENCE.I may be sufficient
Privileged cache management (SiFive example):
// SiFive-specific cache flush
void flush_dcache_range(void *addr, size_t len) {
uintptr_t start = (uintptr_t)addr;
uintptr_t end = start + len;
for (uintptr_t i = start; i < end; i += CACHE_LINE_SIZE) {
asm volatile("cflush.d.l1 %0" :: "r"(i));
}
}Just-In-Time (JIT) compilers generate code at runtime, making self-modifying code handling critical:
Java JVM, JavaScript V8, .NET CLR all need efficient SMC:
Best practices:
// Allocate executable page
void *code = mmap(NULL, 4096,
PROT_READ|PROT_WRITE, // Initially writable
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
// Write code
generate_code(code, ...);
// Flush caches
flush_instruction_cache(code, 4096);
// Make executable (W^X: Write XOR Execute)
mprotect(code, 4096, PROT_READ|PROT_EXEC);
// Execute
((void(*)())code)();Batch code generation: Generate multiple functions before flushing
Code patching: When patching existing code:
// x86 example: patch a call target
void patch_call(void *call_site, void *new_target) {
uint8_t *insn = call_site;
// Calculate relative offset
int32_t offset = (uint8_t*)new_target - (insn + 5);
// Atomic write (single instruction on x86)
*(int32_t*)(insn + 1) = offset;
// Flush I-cache
__builtin___clear_cache(insn, insn + 5);
}Performance impact:
Cache maintenance overhead per JIT compilation:
x86-64: ~100-500 cycles
ARM64: ~1000-3000 cycles
RISC-V: ~500-2000 cycles (implementation-dependent)
For a JIT compiling 1000 functions/sec:
x86: 0.1-0.5M cycles/sec = negligible
ARM: 1-3M cycles/sec = ~0.025-0.075% @ 4GHz
Modern systems enforce W^X policy: memory pages are either writable OR executable, never both simultaneously.
Implications for SMC:
// Old (insecure) way:
void *code = mmap(..., PROT_READ|PROT_WRITE|PROT_EXEC, ...);
write_code(code);
execute_code(code); // Dangerous!
// New (W^X) way:
void *code = mmap(..., PROT_READ|PROT_WRITE, ...);
write_code(code);
flush_caches(code);
mprotect(code, ..., PROT_READ|PROT_EXEC); // Remove write permission
execute_code(code); // SafeW^X enforcement requires TLB invalidation:
mprotect(addr, len, PROT_READ|PROT_EXEC) {
// Update page table entries
for (each page in range) {
pte->writable = 0;
pte->executable = 1;
}
// MUST invalidate TLB!
tlb_flush_range(addr, len);
// Without this, CPU might use cached writable permission
}Apple M1 W^X hardware enforcement:
Apple Silicon enforces W^X at hardware level: - Attempting to set W+X bits in page table triggers fault - JIT must use dual-mapping technique
Modern processors support Simultaneous Multithreading (SMT), where multiple hardware threads share a single physical core. Understanding how TLBs are shared (or not shared) between threads is critical for performance.
Intel Hyperthreading (2-way SMT): - 2 logical processors per physical core - Share: L1/L2 caches, TLBs, execution units - Private: Architectural registers, instruction pointer
AMD SMT (2-way): - Similar to Intel - 2 threads per core
IBM POWER9 (8-way SMT): - 8 hardware threads per core! - More complex resource sharing
Three approaches to TLB sharing:
Both threads share entire TLB:
Physical Core:
L1 ITLB: 128 entries (shared by Thread 0 and Thread 1)
L1 DTLB: 64 entries (shared by Thread 0 and Thread 1)
L2 STLB: 1536 entries (shared)
Advantages: - Maximum TLB capacity utilization - One thread can use all entries if other thread idle
Disadvantages: - TLB contention: Threads thrash each other's entries - Security: Thread 0 can observe Thread 1's TLB timing
Example conflict:
Thread 0: Accesses pages 0-127 (fills ITLB)
Thread 1: Accesses pages 200-327 (evicts Thread 0's entries)
Thread 0: Re-accesses pages 0-127 (TLB misses!)
Result: Thrashing, performance degradation
Divide TLB entries between threads:
L1 DTLB (64 entries):
Thread 0: Entries 0-31 (32 entries)
Thread 1: Entries 32-63 (32 entries)
Advantages: - No thrashing - Predictable performance - Better security (isolation)
Disadvantages: - Underutilization if one thread idle - Fixed partitioning may not match workload needs
Implementation: Tag entries with thread ID:
TLB Entry:
VPN: 36 bits
PFN: 40 bits
Thread: 1 bit (0 or 1)
... other fields ...
Lookup:
Match (VPN, Current_Thread_ID)
Allocate entries dynamically based on need:
Replacement policy includes thread ID:
- Try to evict entry from same thread first
- Allow borrowing if thread has many idle entries
- Enforce min/max quotas per thread
Intel implementation (approximation):
Intel uses pseudo-LRU with thread-aware bias:
int select_victim_tlb_entry() {
// Prefer evicting entry from current thread
for (int i = 0; i < TLB_SIZE; i++) {
if (tlb[i].thread_id == current_thread &&
!tlb[i].recently_used) {
return i;
}
}
// Fall back to global LRU
return lru_victim();
}Hardware-managed TLBs (x86, ARM):
Multiple page walkers can operate simultaneously:
Intel Ice Lake:
- 2 page walkers per core
- Thread 0 TLB miss → Walker 0 starts page walk
- Thread 1 TLB miss → Walker 1 starts page walk simultaneously
- Both walks proceed in parallel
Software-managed TLBs (RISC-V):
TLB miss generates exception, runs handler:
Thread 0: TLB miss → Exception on Thread 0
Thread 1: TLB miss → Exception on Thread 1
Problem: Both threads can't handle TLB miss simultaneously!
Solution: Handler must handle concurrency
RISC-V SMT TLB handler design:
// Per-thread TLB miss handler state
struct tlb_miss_state {
vaddr_t fault_addr;
int in_progress;
spinlock_t lock;
};
__thread struct tlb_miss_state tms;
void tlb_miss_handler(vaddr_t addr) {
tms.fault_addr = addr;
tms.in_progress = 1;
// Walk page table (thread-safe, read-only operation)
pte_t pte = walk_page_table(addr);
// Insert into TLB (may need lock if TLB management not atomic)
spin_lock(&tlb_lock);
tlb_insert(addr, pte);
spin_unlock(&tlb_lock);
tms.in_progress = 0;
}Challenge: How do address space identifiers work with SMT?
Approach 1: Shared ASID/PCID
Both threads run same process (common for single-threaded programs):
Thread 0: PCID = 5
Thread 1: PCID = 5
Both access same address space, share TLB entries naturally
Approach 2: Different ASIDs/PCIDs
Threads run different processes (uncommon):
Thread 0: PCID = 5 (Process A)
Thread 1: PCID = 17 (Process B)
TLB entries tagged with both Thread ID and PCID
TLB lookup with SMT:
Match (VPN, PCID, Thread_ID)
This prevents Thread 0 from hitting Thread 1's entries even if PCID matches.
Measurement: SPECint2017 with SMT:
Configuration IPC per thread Total IPC TLB miss rate
1 thread (no SMT) 2.5 2.5 2.1%
2 threads (SMT) 1.8 3.6 4.8%
↑ ↑ ↑
Lower per-thread Higher total TLB thrashing
TLB contention increases miss rate: - Single thread: 2.1% DTLB miss rate - Two threads (SMT): 4.8% DTLB miss rate per thread (2.3× worse)
When SMT helps despite TLB contention: - Threads have different workloads (one CPU-bound, one memory-bound) - Hides memory latency (while one thread waits on cache miss, other runs) - Total throughput improves even if individual thread slower
When SMT hurts: - Both threads memory-intensive with large working sets - TLB thrashing dominates performance - Better to disable SMT and run threads on separate cores
When running virtual machines, the guest OS and hypervisor may use different page sizes at different translation stages. Understanding how TLBs handle this is crucial for virtualization performance.
Without extended page tables:
Guest Virtual (GVA) → Guest Physical (GPA) → Host Physical (HPA)
↓ ↓
Guest PT Shadow PT/EPT
With Intel EPT / AMD NPT:
Hardware walks both page tables:
GVA → GPA: Guest page table walk (4 levels for 4KB pages)
GPA → HPA: EPT/NPT walk (4 levels for 4KB pages)
Worst case: 4 × 4 = 16 memory accesses!
- 4 accesses for guest walk
- Each access's GPA must be translated via EPT (× 4 each)
Scenario: Guest uses 4KB pages, host uses 2MB pages (or vice versa)
Case 1: Guest 4KB, Host 2MB (common):
Guest: Maps 4KB virtual page → 4KB physical page
EPT: Maps 2MB physical page → 2MB host physical
Guest page table:
VA 0x0000 → GPA 0x10000 (4KB page)
VA 0x1000 → GPA 0x11000 (4KB page)
...
EPT:
GPA 0x00000-0x1FFFFF → HPA 0x80000000-0x801FFFFF (2MB page)
TLB entry must store both:
TLB Entry:
Guest VPN: 0x0 (4KB granularity)
Guest page size: 4KB
Host PFN: 0x80000 (2MB granularity)
Host page size: 2MB
Translation:
GVA 0x0000 → HPA 0x80000000 (via 4KB+2MB)
GVA 0x1000 → HPA 0x80001000 (via 4KB+2MB, same EPT entry!)
Problem: TLB entries at 4KB granularity, but each EPT TLB entry covers 2MB!
Intel's approach: Separate TLB structures
Guest TLB: Caches GVA → GPA translations
- Uses guest page sizes (4KB, 2MB, 1GB)
EPT TLB: Caches GPA → HPA translations
- Uses EPT page sizes (4KB, 2MB, 1GB)
Combined TLB: Caches final GVA → HPA
- Granularity is MIN(guest_page_size, ept_page_size)
Example combinations:
Guest 4KB + EPT 4KB:
Final TLB entry: 4KB (straightforward)
Guest 4KB + EPT 2MB:
Final TLB entry: 4KB
- Single EPT TLB entry (2MB) covers 512 guest TLB entries (4KB each)
- EPT TLB has high hit rate
Guest 2MB + EPT 4KB:
Final TLB entry: 4KB (must track 4KB EPT chunks)
- Single guest TLB entry (2MB) requires 512 EPT TLB entries!
- EPT TLB thrashing likely
- Poor performance!
Guest 2MB + EPT 2MB:
Final TLB entry: 2MB (best case)
- Both levels aligned
- Optimal TLB utilization
Intel Ice Lake EPT TLB:
STLB:
1536 entries for GVA → HPA (combined)
Plus separate EPT TLB (undocumented size)
Estimated ~256-512 entries for GPA → HPA only
AMD approach: Similar to Intel
AMD Zen 3 NPT TLB:
L1 Guest TLB: 96 entries (GVA → GPA)
L1 Nested TLB: 96 entries (GPA → HPA)
L2 Combined TLB: 3072 entries (GVA → HPA final)
Page size handling:
If guest uses 2MB and nested uses 4KB:
- Guest TLB entry covers 2MB of GVA → GPA
- But GPA → HPA breaks down into 512 × 4KB nested mappings
- L2 TLB must cache 512 separate entries!
- Potential thrashing
ARM terminology:
Stage 1: VA → IPA (Intermediate Physical Address)
Stage 2: IPA → PA (Physical Address)
ARMv8.4-A concatenated tables:
ARM allows different granules for Stage 1 and Stage 2:
Stage 1: 4KB granule (guest page tables)
Stage 2: 64KB granule (hypervisor)
Problem: Mismatched granularity!
ARM TLB handling:
TLB Entry (ARM with Stage 2):
VA: Guest virtual address
IPA: Intermediate physical address
PA: Final physical address
Stage 1 size: 4KB, 16KB, 64KB, 2MB, 32MB, 1GB
Stage 2 size: 64KB, 512MB (for 64KB granule)
Combined size: MIN(S1_size, S2_size)
Optimization: Concatenated page tables
ARMv8.4 allows concatenating up to 16 small page tables:
Instead of:
Stage 1: 4KB granule → 4KB pages
Use:
Stage 1: 4KB granule × 16 concatenation → 64KB effective
Matches 64KB Stage 2 granule → Better TLB efficiency
RISC-V H extension: Defines two-stage translation
VS-stage: VA → GPA (Sv39, Sv48, Sv57)
G-stage: GPA → HPA (Sv39x4, Sv48x4, Sv57x4)
Note: G-stage uses same structure as VS-stage (same page sizes)
Advantage: No mismatch problem! Both stages use 4KB, 2MB, 1GB.
TLB structure:
RISC-V with H-extension TLB:
- Separate VS-stage TLB (optional)
- Separate G-stage TLB (optional)
- Combined TLB (required)
Implementation-dependent which TLBs exist
Best practices for optimal TLB performance with virtualization:
Guest: 2MB huge pages
Host/EPT: 2MB huge pages
→ Optimal TLB hit rate
BAD:
Guest: 2MB pages
Host: 4KB pages
→ Each guest page requires 512 host TLB entries!
GOOD:
Guest: 4KB pages
Host: 2MB pages
→ Each host TLB entry covers 512 guest pages
Enable EPT/NPT huge pages in hypervisor
Encourage guest to use huge pages (Transparent Huge Pages)
Measurement: Database performance with EPT:
Configuration Queries/sec TLB miss rate
Guest 4KB, EPT 4KB 15,000 18%
Guest 4KB, EPT 2MB 28,000 8%
Guest 2MB, EPT 2MB 42,000 3%
Guest 2MB, EPT 4KB 12,000 35% (thrashing!)
Graphics Processing Units (GPUs) and other hardware accelerators have their own Memory Management Units with unique requirements and design trade-offs compared to CPU MMUs.
Traditional approach (pre-2010): GPU access only pinned physical memory - OS pins memory pages (prevents swapping) - GPU uses physical addresses directly - Simple but inflexible
Problems: - Limits memory oversubscription - Complicates CPU-GPU data sharing - Requires complex buffer management - Can't support virtual memory per-process
Modern approach: GPU has full MMU - GPU uses virtual addresses - Supports page faults - Enables unified virtual addressing - Allows memory overcommit
NVIDIA Pascal/Volta/Ampere architecture:
GPU MMU Components:
1. TLBs (multiple levels)
2. Page tables (hierarchical, similar to CPU)
3. Page fault handling unit
4. Unified Virtual Addressing (UVA) support
TLB Hierarchy:
Per-SM (Streaming Multiprocessor):
L1 TLB: ~64 entries per SM
- Handles most translations
- 1-2 cycle hit latency
Shared across GPU:
L2 TLB: ~512-2048 entries (implementation-dependent)
- Unified for all SMs
- ~10-20 cycle hit latency
Page Table Structure:
NVIDIA uses multi-level page tables similar to x86:
Virtual Address (48-bit):
[47:39] → PDE3 (9 bits)
[38:30] → PDE2 (9 bits)
[29:21] → PDE1 (9 bits)
[20:12] → PDE0 (9 bits)
[11:0] → Offset (12 bits)
Supports 4KB, 64KB, and 2MB pages
Unified Virtual Addressing (UVA):
CPU and GPU share same virtual address space:
CPU pointer: 0x7fff1234 points to memory
GPU pointer: 0x7fff1234 points to SAME memory
Benefits:
- No address translation needed for shared data
- Simplified programming model
- Zero-copy data sharing
Traditional CPU: Page fault stalls one thread, others continue
GPU: Page fault can stall thousands of threads!
NVIDIA approach: Replay mechanism
1. Warp (32 threads) accesses memory
2. Some threads: TLB miss → Page fault
3. GPU:
a. Records faulting threads
b. Notifies CPU driver
c. Continues executing other warps
4. CPU:
a. Handles page fault
b. Allocates/migrates page
c. Updates page tables
d. Signals GPU
5. GPU:
a. Invalidates TLB entry
b. Replays faulting threads
c. Memory access succeeds
Challenge: High page fault latency
CPU page fault: ~1,000-10,000 cycles
GPU page fault: ~50,000-500,000 cycles!
- PCIe communication overhead
- Driver processing
- Page migration
Impact: Page faults kill GPU performance!
Solution: Demand paging with prefetching
CUDA Unified Memory prefetches pages:
// Hint to prefetch memory to GPU
cudaMemPrefetchAsync(ptr, size, gpu_device, stream);
// Or advise on expected access pattern
cudaMemAdvise(ptr, size, cudaMemAdviseSetPreferredLocation, gpu_device);AMD approach: Integrate GPU MMU with CPU IOMMU
HSA (Heterogeneous System Architecture):
Shared page tables between CPU and GPU:
- Single set of page tables
- CPU and GPU see identical virtual address space
- IOMMU translates GPU accesses
IOMMU TLB (AMD):
Per-GPU:
IOMMU TLB: 512-1024 entries
Handles:
- GPU memory accesses
- DMA from GPU
- Peer-to-peer GPU transfers
Coherency: AMD supports cache-coherent GPU memory access (APU architectures)
GPU load:
1. Check GPU L1 cache
2. Check GPU L2 cache
3. Check CPU cache hierarchy (coherent!)
4. Load from DRAM
Intel Xe architecture:
Fully integrated GPU MMU:
- Shares CPU's page tables
- Uses CPU's TLBs for some operations
- Hardware page walker shared
Page fault handling:
Intel GPUs can trigger CPU page faults directly:
GPU thread → Page fault → CPU handles → GPU resumes
Latency: ~10,000-50,000 cycles (better than NVIDIA)
Tile-based architecture:
Intel Xe-HP (data center GPU):
- Multiple tiles (GPU dies)
- Each tile has own L1 TLB
- Shared L2 TLB across tiles
- Coherent with CPU TLBs
Apple's approach: True unified memory
M1/M2/M3:
- CPU and GPU share physical RAM
- No discrete GPU memory
- Shared page tables
- Zero-copy between CPU and GPU
TLB structure:
CPU Cores:
- Per-core L1 TLBs
- Shared L2 TLB
GPU Cores:
- Per-cluster TLBs
- Shared with CPU at some level (undocumented)
Performance benefits:
Traditional discrete GPU:
CPU→GPU transfer: PCIe bandwidth (~16 GB/s)
Apple M1:
CPU↔GPU shared memory: ~200 GB/s
No transfer needed!
I/O MMU (IOMMU) / System MMU (SMMU):
Provides virtual addressing for DMA devices (GPUs, NICs, storage controllers):
Intel VT-d (IOMMU):
Device Memory Access:
1. Device issues DMA with virtual address
2. IOMMU intercepts
3. IOMMU walks page tables (similar to CPU)
4. IOMMU TLB caches translation
5. Physical address sent to memory controller
IOMMU TLB structure:
Intel VT-d:
IOTLB: ~256-512 entries per device
Hierarchical structure
Supports 4KB, 2MB, 1GB pages
ARM SMMU (System MMU):
Stage 1: Device VA → IPA
Stage 2: IPA → PA (for virtualization)
Similar to ARM CPU MMU with Stage 2 translation
IOMMU page fault handling:
Device page fault:
1. IOMMU detects fault
2. Generates interrupt to CPU
3. OS page fault handler runs
4. Updates page tables
5. IOMMU TLB invalidated
6. Device DMA retried
Latency: ~10,000-100,000 cycles
GPU vs CPU TLB differences:
| Aspect | CPU TLB | GPU TLB |
|---|---|---|
| Threads | 2-8 per core | 1000s per GPU |
| Working set | MB-GB | GB-TB |
| TLB entries | 100s-1000s | 1000s-10,000s |
| Page fault tolerance | Medium | Very low |
| Coherency | Hardware | Limited/None |
Best practices:
// Prefetch before kernel launch
cudaMemPrefetchAsync(data, size, device);
cudaStreamSynchronize(stream);
launch_kernel<<<...>>>();// Request 2MB pages for large allocations
cudaMallocManaged(&ptr, 1024 * 1024 * 1024); // 1 GB
cudaMemAdvise(ptr, size, cudaMemAdviseSetPreferredLocation, device);CPU: Use 2MB transparent huge pages
GPU: Configure for 2MB pages
Result: Efficient TLB usage on both
Measurement: Matrix multiplication (4096×4096):
Configuration Time Page Faults
Discrete GPU, no prefetch 120 ms 15,000
Discrete GPU, with prefetch 45 ms 0
Integrated GPU (Apple M1) 28 ms 0 (unified memory)
Chapter 4 explored the Translation Lookaside Buffer as the performance linchpin of every virtual memory system. Unlike caches that store data, the TLB stores virtual-to-physical address translations, allowing the CPU to bypass the multi-level page table walk on every memory reference. A TLB miss forces a hardware page table walk consuming 10–100 or more cycles, and when working sets exceed TLB capacity this penalty can dominate total execution time—making TLB behaviour a first-order performance concern for systems programmers.
The architectural analysis revealed that modern processors deploy multi-level TLB hierarchies. The L1 TLB (typically 4–128 entries, 1-cycle lookup, fully associative) sits closest to the execution pipeline; the L2 unified STLB (512–4096 entries, 6–12 cycle lookup) provides a larger second-chance buffer. ARM's three-level hierarchy—micro-TLB, main TLB, and Walker Cache—illustrates how the same fundamental design challenge plays out differently across implementation families. Entry structure was dissected in detail: the physical frame number, permission bits (R/W/X), privilege level, dirty and accessed bits, and the critical context-tag fields that distinguish entries across address spaces.
Three distinct mechanisms for suppressing TLB flush overhead were examined. Intel's PCID (Process Context Identifier, 12 bits) tags all TLB entries with a process identifier, allowing the processor to retain translations across context switches and invalidate selectively via INVLPG. ARM64's ASID (Address Space Identifier, 8 or 16 bits configurable) provides equivalent functionality with its TLBI instruction family. RISC-V supports ASIDs in a broadly compatible manner via the SATP CSR. Without these tags, every context switch flushes the entire TLB—an overhead measured in hundreds of microseconds on busy servers. In virtualised environments, ARM64's VMID extends the tagging scheme to two dimensions, permitting simultaneous retention of guest and hypervisor translations without mutual interference.
Large page support emerged as the highest-leverage TLB optimisation available to software engineers without hardware changes. A single 2 MB huge page on x86-64 covers 512 standard 4 KB pages, consuming one TLB entry that would otherwise require 512 entries for the same working set. The reach advantage translates directly to runtime: database and scientific computing workloads with tens of gigabytes of working data commonly achieve 15–30 percent throughput improvement from huge page adoption. ARM64's contiguous PTE hint and RISC-V's Svnapot extension offer equivalent mechanisms. The challenges—THP fragmentation, defragmentation latency, hugepage availability under memory pressure—require profiling to confirm net benefit in production.
Multicore TLB coherency introduced the TLB shootdown protocol. When a process unmaps a virtual range, all cores that may hold a cached translation for that range must invalidate it before the physical page can be reused; stale translations that permit access to recycled pages represent a security and correctness hazard. On x86-64 this is accomplished via inter-processor interrupts that execute INVLPG on every affected core; on ARM64 the TLBI VALE1IS/ASIDE1IS broadcast instructions offer more surgical options. At 128-core scale, a single munmap() can generate 127 IPIs consuming microseconds of cross-core synchronisation—a rarely-measured but real source of latency in high-throughput servers.
GPU and accelerator MMUs were examined as the chapter's final frontier. GPU L1 TLBs are small (64–128 entries per shader multiprocessor) but hundreds of them operate in parallel; a single miss from one warp can serialise the entire multiprocessor if it requires a page table walk. NVIDIA's UVM architecture and the Pichai coalescing mechanism represent the state of the art for GPU address translation at scale.
Chapter 5 extends this analysis to I/O devices, examining how the IOMMU performs the same address translation function for DMA operations—and why the costs and design choices differ fundamentally when the entity requesting translation cannot tolerate page faults or handle TLB misses in software.
Intel Corporation. Intel® 64 and IA-32 Architectures Optimization Reference Manual. Order Number: 248966-046. June 2023. Section 2.5: "Translation Lookaside Buffers."
Intel Corporation. Intel® 64 and IA-32 Architectures Software Developer's Manual, Volume 3A: System Programming Guide, Part 1. Chapter 4: "Paging" and Section 8.1.3: "Handling Self-Modifying Code." 2024.
ARM Limited. ARM Architecture Reference Manual ARMv8, for ARMv8-A architecture profile. ARM DDI 0487J.a. March 2023. Chapter D5: "The AArch64 Virtual Memory System Architecture" and Section D5.10: "TLB maintenance instructions" and Section B2.2.5: "Maintenance of caches and memory hierarchy."
ARM Limited. ARM Cortex-A78 Core Technical Reference Manual. ARM 100230_0002_00_en. 2020.
RISC-V International. The RISC-V Instruction Set Manual, Volume I: Unprivileged ISA. Version 20191213. Section 2.7: "FENCE.I Instruction." December 2019.
RISC-V International. The RISC-V Instruction Set Manual, Volume II: Privileged Architecture. Version 20211203. December 2021. Section 4.2.1: "Supervisor Memory-Management Fence Instruction."
AMD. Software Optimization Guide for AMD Family 19h Processors. Publication # 56665. June 2023.
Jacob, Bruce, and Trevor Mudge. "Virtual Memory: Issues of Implementation." IEEE Computer 31.6 (1998): 33-43. DOI: 10.1109/2.683005
Bhattacharjee, Abhishek, and Daniel Lustig. Architectural Support for Address Translation. Synthesis Lectures on Computer Architecture. Morgan & Claypool, 2017. DOI: 10.2200/S00752ED1V01Y201610CAC037
Baumann, Andrew, et al. "The Multikernel: A new OS architecture for scalable multicore systems." Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles (SOSP 2009). ACM, 2009. DOI: 10.1145/1629575.1629579
Amit, Nadav, et al. "Optimizing the TLB shootdown algorithm with page access tracking." 2017 USENIX Annual Technical Conference (ATC 17). 2017. Pages 27-39.
Vilanova, Lluís, et al. "CODOMs: Protecting software with code-centric memory domains." Proceedings of the 41st ACM SIGARCH International Symposium on Computer Architecture (ISCA 2014). 2014.
Bienia, Christian, et al. "The PARSEC benchmark suite: Characterization and architectural implications." Proceedings of the 17th international conference on Parallel architectures and compilation techniques (PACT 2008). ACM, 2008. DOI: 10.1145/1454115.1454128
Kwon, Youngjin, et al. "Coordinated and Efficient Huge Page Management with Ingens." 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2016). 2016. Pages 705-721.
Navarro, Juan, et al. "Practical, transparent operating system support for superpages." 5th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2002). 2002. Pages 89-104.
Barr, T. W., Cox, A. L., & Rixner, S. "Translation caching: skip, don't walk (the page table)." ACM SIGARCH Computer Architecture News, 39(3) (2011): 48-59. DOI: 10.1145/2024723.2000118
Talluri, M., & Hill, M. D. "Surpassing the TLB performance of superpages with less operating system support." ACM SIGPLAN Notices, 29(11) (1994): 171-182. DOI: 10.1145/191126.191235 [Inverted page tables analysis]
Liedtke, Jochen. "Address space sparsity and fine granularity." ACM SIGARCH Computer Architecture News 23.5 (1995): 78-86. DOI: 10.1145/224021.224028 Clustered page tables
Chang, Albert, and Mark F. Mergen. "801 storage: Architecture and programming." ACM Transactions on Computer Systems (TOCS) 6.1 (1988): 28-50. DOI: 10.1145/35037.35039 [Early inverted page table implementation]
Waldspurger, C. A. "Memory resource management in VMware ESX server." ACM SIGOPS Operating Systems Review, 36(SI) (2002): 181-194. DOI: 10.1145/844128.844146 [Classic shadow page table paper]
Adams, Keith, and Ole Agesen. "A comparison of software and hardware techniques for x86 virtualization." Proceedings of the 12th international conference on Architectural support for programming languages and operating systems (ASPLOS 2006). ACM, 2006. DOI: 10.1145/1168857.1168860
Bhargava, Ravi, et al. "Accelerating two-dimensional page walks for virtualized systems." ACM SIGARCH Computer Architecture News 36.1 (2008): 26-35. DOI: 10.1145/1353534.1346286
Ahn, J., Jin, S., & Huh, J. "Revisiting hardware-assisted page walks for virtualized systems." Proceedings of the 39th Annual International Symposium on Computer Architecture (ISCA 2012). IEEE, 2012. DOI: 10.1109/ISCA.2012.6237040 [EPT/NPT TLB performance analysis with mismatched page sizes]
Intel Corporation. Intel® 64 and IA-32 Architectures Software Developer's Manual, Volume 3A. Section 11.6: "Self-Modifying Code." 2024.
ARM Limited. ARM Architecture Reference Manual ARMv8. Section D4.4.4: "Instruction caches and memory coherency." 2023.
Boehm, Hans-J., and Sarita V. Adve. "Foundations of the C++ concurrency memory model." Proceedings of the 29th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2008). 2008. DOI: 10.1145/1375581.1375591
NVIDIA Corporation. NVIDIA CUDA C Programming Guide. Version 12.0. 2023. Chapter 3.2: "Unified Memory Programming."
AMD. Graphics Core Next Architecture, Generation 3. White Paper. 2016.
Intel Corporation. Intel® Data Center GPU Max Series Software Developer's Guide. Document Number: 767252-001. 2023.
ARM Limited. ARM CoreLink MMU-600 System Memory Management Unit Technical Reference Manual. ARM 100310_0200_00_en. 2020. [IOMMU/SMMU for accelerators]
Silicon Graphics, Inc. (SGI). OpenGL on SGI Reality Engine Graphics. Technical Report. 1994. [Early GPU virtual memory concepts]
Pichai, Bharath, et al. "Architectural support for address translation on GPUs: Designing memory management units for CPU/GPUs with unified address spaces." ACM SIGPLAN Notices 49.4 (2014): 743-758. DOI: 10.1145/2644865.2541940
Power, Jason, et al. "Supporting x86-64 address translation for 100s of GPU lanes." 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2014. DOI: 10.1109/HPCA.2014.6835960
Ausavarungnirun, Rachata, et al. "Mosaic: A GPU memory manager with application-transparent support for multiple page sizes." Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2017). ACM, 2017. DOI: 10.1145/3123939.3123975 [GPU TLB with mixed page sizes]
Intel Corporation. Intel® Virtualization Technology for Directed I/O (VT-d): Architecture Specification. Revision 3.3. September 2020.
Ben-Yehuda, Muli, et al. "The price of safety: Evaluating IOMMU performance." Linux Symposium. Vol. 9. 2007. Pages 9-20.
ARM Limited. ARM System Memory Management Unit Architecture Specification, SMMU architecture version 3. ARM IHI 0070D.b. 2019.
Basu, Arkaprava, et al. "Efficient virtual memory for big memory servers." Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA 2013). ACM, 2013. DOI: 10.1145/2485922.2485943
Karakostas, Vasileios, et al. "Redundant memory mappings for fast access to large memories." Proceedings of the 42nd Annual International Symposium on Computer Architecture (ISCA 2015). IEEE, 2015. DOI: 10.1145/2749469.2750383
Cox, Russ, and William Josephson. "Optimizing network virtualization in Xen." Proceedings of the USENIX Annual Technical Conference. 2006.
Gandhi, Jayneel, et al. "Badger TLB: A mechanism for optimizing TLB miss handling." 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, 2014. DOI: 10.1109/MICRO.2014.45
Smith, Alan Jay, and James R. Goodman. "The impact of memory management on commercial computer systems." Communications of the ACM 26.12 (1983): 1002-1011. DOI: 10.1145/358476.358484
Denning, Peter J. "Virtual memory." ACM Computing Surveys (CSUR) 2.3 (1970): 153-189. DOI: 10.1145/356571.356573 [Seminal paper on virtual memory]
Kilburn, Tom, et al. "One-level storage system." IRE Transactions on Electronic Computers 2 (1962): 223-235. DOI: 10.1109/IREELC.1962.5408922 [Atlas computer - first virtual memory system]
Rosenblum, Mendel. "The design and implementation of a log-structured file system." ACM Transactions on Computer Systems (TOCS) 10.1 (1992): 26-52. [Discusses TLB implications of different page table designs]
Saulsbury, Ashley, Fredrik Dahlgren, and Per Stenström. "Recency-based TLB preloading." ACM SIGARCH Computer Architecture News 28.2 (2000): 117-127. DOI: 10.1145/342001.339668
Marwaha, Manish, and Abhishek Bhattacharjee. "Spatiotemporal optimizations for multi-GPU systems via decoupled modules." IEEE Computer Architecture Letters 21.2 (2022): 85-88. [Modern GPU MMU optimizations]
Ahn, J., Jin, S., and Huh, J. "Revisiting hardware-assisted page walks for virtualized systems." Proceedings of the 39th Annual International Symposium on Computer Architecture (ISCA), pp. 476–487, 2012. DOI: 10.1145/2366231.2337200
NVIDIA Corporation. CUDA C++ Programming Guide. Chapter 3.2: "Unified Memory Programming." 2024. docs.nvidia.com/cuda
AMD. AMD Graphics Core Next Architecture, Generation 3. Section 5: "Memory Hierarchy." 2016. gpuopen.com/GCN3
Intel Corporation. Intel Data Center GPU Flex Series Product Brief. 2023. intel.com/data-center-gpu-flex