In the first two chapters, we established the fundamental concepts of memory management and virtual memory. We explored how the Memory Management Unit (MMU) provides the crucial abstraction layer between virtual addresses—what programs see—and physical addresses—where data actually resides in RAM. We introduced page tables as the data structures that store these virtual-to-physical mappings, and we touched on important optimization structures like the Translation Lookaside Buffer (TLB) and multi-level page tables.
Now, in Part 2 of this book, we dive deep into the heart of virtual memory systems: the page tables themselves. These deceptively simple data structures represent one of the most critical design decisions in computer architecture, influencing everything from memory overhead and translation speed to security guarantees and virtualization capabilities.
The design of page table structures has far-reaching implications across multiple dimensions:
Performance: Every memory access by a running program potentially requires consulting the page tables. With modern processors executing billions of instructions per second, even small inefficiencies in address translation can significantly impact overall system performance. The difference between a well-designed page table structure and a poor one can mean the difference between 5% and 50% overhead in memory-intensive applications. Memory Overhead: Page tables themselves consume memory—sometimes substantial amounts. A naive single-level page table for a 64-bit address space would require tens of petabytes of memory just to store the mapping information! Practical page table designs must balance the memory overhead of the tables themselves against the efficiency of the translation process. Security: Page tables are the primary mechanism for memory protection in modern systems. They enforce process isolation, preventing programs from reading or modifying each other's memory. They enable key security features like the NX (No Execute) bit that prevents code execution from data pages, mitigating entire classes of exploits. The robustness of these protections depends critically on page table design and implementation. Virtualization: Modern cloud computing relies on virtualization, where multiple guest operating systems run simultaneously on shared hardware. This introduces a new layer of complexity: guest operating systems manage their own page tables (translating guest virtual addresses to what they perceive as physical addresses), while the hypervisor must translate these "guest physical addresses" to actual physical addresses. This two-stage translation process doubles the complexity of address translation and places new demands on page table structures.Page table design has evolved dramatically over the past five decades:
The 1960s-1970s: Early virtual memory systems like the Atlas computer and later the VAX-11 used simple, single-level page tables. With 32-bit or smaller address spaces, this straightforward approach was practical—a 32-bit address space with 4KB pages requires only 2^20 (about 1 million) page table entries. The 1980s-1990s: As address spaces grew and memory became cheaper but still finite, multi-level page tables became standard. The 32-bit x86 architecture introduced two-level tables, and this hierarchical approach allowed systems to allocate page table memory only for regions of the address space actually in use. The 2000s: The transition to 64-bit computing posed new challenges. A naive single-level page table for a 64-bit address space would be impossibly large. Modern x86-64 processors use four-level page tables, and recent models support five levels. The introduction of hardware virtualization support (Intel EPT and AMD NPT) added the complexity of two-stage address translation. The 2010s-Present: Contemporary systems face new pressures from big data workloads, virtualized cloud environments, and security threats. Innovations like huge pages (2MB and 1GB mappings), transparent huge page support, and increasingly sophisticated TLB designs reflect the ongoing optimization of page table structures for modern workloads.In this chapter, we will systematically explore page table structures from simple to complex, always grounding our discussion in real-world implementations:
Single-Level Page Tables (Section 3.2): We begin with the simplest approach, examining why it works for small address spaces and understanding its fundamental limitations. We'll look at historical systems like the VAX-11 and modern embedded systems that still use this approach. Two-Level Page Tables (Section 3.3): We'll examine how a second level of indirection dramatically reduces memory overhead, using the 32-bit x86 architecture as our primary example. This section includes detailed address translation examples with concrete numbers. Multi-Level Page Tables (Section 3.4): This is where we get into the details of modern architectures. We'll cover:For each architecture, we'll examine the exact address format, register usage, and translation process.
Page Table Entry Deep Dive (Section 3.5): We'll dissect the bit-level structure of page table entries across x86, ARM, and RISC-V architectures, understanding how each bit affects memory access, permissions, and system behavior. We'll explain the difference between hardware-managed and software-managed bits, and how operating systems leverage this structure. Virtualization and Two-Stage Translation (Section 3.5): One of the most important sections in this chapter, here we tackle the complexities of running virtual machines. We'll examine:Throughout this chapter, we will focus on three major architecture families that together represent the vast majority of computing devices in use today:
x86-64 (Intel and AMD): Dominates servers, desktops, and laptops. We'll draw heavily from the Intel 64 and IA-32 Architectures Software Developer's Manual and AMD's equivalent documentation. ARM64 (ARMv8-A): Powers billions of smartphones, tablets, and increasingly, servers (AWS Graviton, Apple M-series). We'll reference the ARM Architecture Reference Manual. RISC-V: An emerging open-source architecture with clean design and growing adoption. We'll use the RISC-V Privileged Architecture Specification.Each architecture offers unique insights into page table design, and by comparing them, we'll develop a comprehensive understanding of the design space and trade-offs involved.
This chapter is the first of several that will explore page tables in depth:
By the end of this chapter, you will have a thorough understanding of how modern processors implement address translation, why they make the design choices they do, and how these choices affect system performance, security, and capability. You'll be able to read architecture manuals with confidence, understand performance profiles of memory-intensive applications, and make informed decisions about page size and memory configuration in production systems.
Let's begin our detailed exploration of page table structures.
The single-level page table represents the most straightforward approach to address translation. Conceptually, it's nothing more than a large array where each entry maps one virtual page to one physical page. Despite its simplicity—or perhaps because of it—understanding single-level page tables provides the foundation for comprehending more sophisticated designs.
In a single-level page table system, the virtual address is divided into two parts:
1. Page number: Used as an index into the page table
2. Page offset: Identifies the byte within the page
For a system with 4KB (4096-byte) pages, the offset requires 12 bits (since 2^12 = 4096). The remaining high-order bits form the page number. For a 32-bit address space, this gives us:
The page number component provides 2^20 = 1,048,576 possible pages, meaning our page table needs exactly this many entries.
The Page Table Base Register (PTBR)Every architecture that implements virtual memory includes a special register that points to the base address of the current page table. When the processor needs to translate a virtual address, it starts with this register. The specific register varies by architecture:
x86 Architecture: Uses the CR3 register (Control Register 3), also called the "page table base register" or PTBR in generic architectural discussions. CR3 holds the physical address of the top-level page directory. When the operating system switches processes (context switch), it loads CR3 with the physical address of the new process's page table structure. ARM Architecture: Uses two translation table base registers for flexibility:This dual-register approach allows ARM systems to maintain separate page tables for user and kernel space, with independent switching. User processes can switch (updating TTBR0) without touching kernel mappings (TTBR1).
RISC-V Architecture: Uses the satp register (Supervisor Address Translation and Protection). The satp register not only contains the base address of the page table but also encodes the paging mode (Sv39, Sv48, or Sv57) and an Address Space Identifier (ASID) for TLB tagging.Specifically, for RV64 (64-bit RISC-V):
When the CPU needs to translate a virtual address using a single-level page table, the process is straightforward:
1. Extract page number from virtual address (VA bits 31-12 for 32-bit, 4KB pages)
2. Read base address from PTBR (CR3 on x86, TTBR0_EL1 on ARM, satp on RISC-V)
3. Calculate PTE address: PTE_address = PTBR + (page_number × entry_size)
4. Read PTE from memory at PTE_address
5. Extract physical page number from PTE
6. Combine with page offset: PA = (physical_page_number << 12) | page_offset
7. Access physical memory at PA
This process requires one memory access to read the page table entry (assuming the PTBR is in a CPU register, which it is). If the translation is not cached in the TLB, every memory access requires this additional lookup, effectively doubling the number of memory accesses.
The memory overhead of single-level page tables is easy to calculate but reveals the approach's fundamental limitation.
32-bit Address Space ExampleConsider a classic 32-bit system with 4KB pages:
This means every single process requires 4MB just for its page table, regardless of how much memory the process actually uses. A process using only 4KB of memory still needs a 4MB page table—a 1000:1 overhead ratio!
The VAX-11/780: A Historical ExampleThe VAX-11/780, introduced by Digital Equipment Corporation in 1978, provides an excellent historical example of single-level page tables in practice. The VAX-11 architecture used:
With 512-byte pages, a 32-bit address space requires:
At 4 bytes per entry, this would be 32MB per process—an enormous amount in an era when the VAX-11/780 shipped with only 2MB to 8MB of physical memory! In practice, the VAX used a clever optimization: the virtual address space was divided into four regions (P0, P1, S0, S1), each with its own page table. User processes used P0 and P1, which could be much smaller than the full address space, making the overhead manageable.
64-bit Address Space: The ImpossibilityTo understand why single-level page tables don't scale to 64-bit address spaces, let's do the arithmetic:
To put this in perspective, 32 petabytes is:
Even if we used larger 2MB pages (21 bits for offset):
Still impossibly large. This calculation alone demonstrates why multi-level page tables are absolutely necessary for 64-bit computing.
Despite their limitations, single-level page tables have some advantages:
Simplicity: The translation algorithm is trivial. There's only one memory access (to fetch the PTE), and no complex tree traversal. Predictable Performance: Every translation takes exactly the same time—one memory access. There are no worst-case scenarios where deeply nested page tables create variable latency. Fast Context Switching: Switching page tables only requires updating a single register (PTBR/CR3/TTBR/satp). No complex page table structures need to be manipulated. Cache Friendly: Since the entire page table is contiguous in memory, it benefits from spatial locality. Sequential address translations access nearby PTEs, which may be cached together. When They're Still Used TodaySingle-level page tables remain practical in specific scenarios:
Small Address Spaces: Embedded systems with limited address spaces (e.g., 16-bit or small 32-bit systems) where the page table fits comfortably in available memory. Real-Time Systems: Where predictable, constant-time translation latency is more important than memory efficiency. Some safety-critical embedded systems prefer single-level tables because their behavior is completely predictable. Hardware Accelerators: Some specialized processors (GPUs, DSPs) with limited memory management capabilities use simplified single-level schemes. Educational Systems: Teaching operating systems often implement single-level tables first because they're easiest to understand and implement.The fatal flaw of single-level page tables is the complete page table problem: the page table must be fully allocated even if only a tiny fraction of the virtual address space is actually used.
Sparse Address Space InefficiencyModern programs have sparse memory usage patterns. A typical application might use:
These regions are typically scattered across a 4GB (32-bit) or larger address space, with vast empty regions between them. Yet a single-level page table allocates entries for every possible page in the address space, including all the unused regions.
Example: Small Program on 32-bit SystemConsider a minimal "Hello, World" program:
With single-level page tables:
The page table is 256 times larger than the program's actual memory usage! This is clearly unsustainable as systems scale.
Transition to Multi-Level TablesThis limitation drove the development of hierarchical page table structures, which we'll examine in the next section. The key insight: if we can avoid allocating page table entries for unused regions of the address space, we can dramatically reduce memory overhead while maintaining the virtual memory abstraction.
The single-level page table, despite its limitations, teaches us the fundamental mechanics of address translation. Every more sophisticated page table structure builds on these basics, adding levels of indirection to solve the memory overhead problem while preserving the elegance of the virtual-to-physical mapping concept.
The two-level page table represents a breakthrough in virtual memory design. By introducing one level of indirection—a page table that points to page tables—we can eliminate the requirement that the entire page table be allocated at once. This simple change reduces memory overhead by orders of magnitude while adding only modest complexity to the translation process.
A two-level page table divides the virtual address into three parts instead of two:
1. Page directory index: Selects an entry in the page directory (the first level)
2. Page table index: Selects an entry in a page table (the second level)
3. Page offset: Identifies the byte within the physical page
The page directory is always present—it's the single top-level structure pointed to by the PTBR register. However, the second-level page tables are allocated on demand. If a region of the virtual address space is not in use, the corresponding page directory entry is marked invalid, and no page table is allocated for that region.
This is the key insight: we only allocate page tables for regions of virtual address space that are actually mapped to physical memory.
Directory → Table → Page HierarchyThe translation process now involves two table lookups:
Virtual Address
↓
Page Directory Index → [Page Directory] → Page Directory Entry
↓
(contains physical address of page table)
↓
Page Table Index → [Page Table] → Page Table Entry
↓
(contains physical page number)
↓
Physical Page + Offset → Physical Address
Compare this to single-level:
We've traded one additional memory access for the ability to leave second-level page tables unallocated.
The Intel IA-32 architecture (32-bit x86) provides the canonical example of two-level paging. Introduced with the Intel 80386 in 1985, this design influenced countless subsequent architectures.
Address FormatFor 32-bit x86 with 4KB pages:
Bits 31-22: Page Directory Index (10 bits) = 1024 entries
Bits 21-12: Page Table Index (10 bits) = 1024 entries
Bits 11-0: Page Offset (12 bits) = 4096 bytes
Why 10 bits for each index? Because each table (both directory and page tables) is designed to fit exactly in one 4KB page:
This self-referential elegance means page tables can be paged—the page tables themselves are just pages that can be swapped out if memory is tight!
Page Directory Entry (PDE) StructureEach 32-bit PDE contains:
Bits 31-12: Physical address of page table (20 bits)
Bit 11: Reserved
Bit 10: Reserved
Bit 9: Available for system use
Bit 8: Global (G)
Bit 7: Page Size (PS) - if 1, this PDE maps a 4MB page directly
Bit 6: Reserved (0)
Bit 5: Accessed (A)
Bit 4: Cache Disable (PCD)
Bit 3: Write-Through (PWT)
Bit 2: User/Supervisor (U/S)
Bit 1: Read/Write (R/W)
Bit 0: Present (P)
The Present bit (P) is crucial: if P=0, this page directory entry is invalid, and no page table exists for this 4MB region of virtual address space. The CPU doesn't access this memory at all.
Page Table Entry (PTE) StructureEach 32-bit PTE contains:
Bits 31-12: Physical page frame number (20 bits)
Bit 11: Reserved
Bit 10: Reserved
Bit 9: Available for system use
Bit 8: Global (G)
Bit 7: Page Attribute Table (PAT)
Bit 6: Dirty (D)
Bit 5: Accessed (A)
Bit 4: Cache Disable (PCD)
Bit 3: Write-Through (PWT)
Bit 2: User/Supervisor (U/S)
Bit 1: Read/Write (R/W)
Bit 0: Present (P)
Again, if P=0, this virtual page is not mapped to physical memory, and accessing it triggers a page fault.
CR3 Register and TranslationThe CR3 register (Control Register 3) points to the physical address of the page directory:
Bits 31-12: Physical address of page directory (20 bits)
Bits 11-0: Flags and reserved bits
When the MMU translates an address:
1. Extract indices: Split virtual address into directory index (bits 31-22), table index (bits 21-12), and offset (bits 11-0)
2. Read PDE:
- Calculate PDE address: CR3[31:12] + (directory_index × 4)
- Read 32-bit PDE from this address
- Check Present bit: if P=0, generate page fault
3. Read PTE:
- Extract page table base from PDE[31:12]
- Calculate PTE address: PDE[31:12] + (table_index × 4)
- Read 32-bit PTE from this address
- Check Present bit: if P=0, generate page fault
4. Form physical address:
- Extract physical page number from PTE[31:12]
- Combine with offset: PA = (PTE[31:12] << 12) | offset
5. Access memory: Read from or write to the physical address
Let's translate a specific virtual address step by step to make this concrete.
Given:Virtual Address: 0x0040_5678 = 0000 0000 0100 0000 0101 0110 0111 1000
Bits 31-22 (PD Index): 0000 0000 01 = 0x001 = 1
Bits 21-12 (PT Index): 00 0000 0101 = 0x005 = 5
Bits 11-0 (Offset): 0110 0111 1000 = 0x678 = 1656
So we need:
PDE Address = CR3 + (PD_index × 4)
= 0x0010_0000 + (1 × 4)
= 0x0010_0000 + 0x4
= 0x0010_0004
Read 32-bit value at physical address 0x0010_0004. Let's say we find:
PDE = 0x0020_0007
Interpret this PDE:
Bits 31-12: 0x0020_0 = physical address of page table (0x0020_0000)
Bits 11-0: 0x007 = flags
Bit 0 (P): 1 = Present ✓
Bit 1 (R/W): 1 = Read/Write allowed
Bit 2 (U/S): 1 = User accessible
Other bits: 0 = disabled/not used
The page table is at physical address 0x0020_0000.
Step 3: Read Page Table EntryPTE Address = Page_Table_Base + (PT_index × 4)
= 0x0020_0000 + (5 × 4)
= 0x0020_0000 + 0x14
= 0x0020_0014
Read 32-bit value at physical address 0x0020_0014. Let's say we find:
PTE = 0x0030_5027
Interpret this PTE:
Bits 31-12: 0x0030_5 = physical page frame number (0x0030_5000)
Bits 11-0: 0x027 = flags
Bit 0 (P): 1 = Present ✓
Bit 1 (R/W): 1 = Read/Write allowed
Bit 2 (U/S): 1 = User accessible
Bit 5 (A): 1 = Accessed (hardware set this on previous access)
Other bits: 0 = disabled/not used
The physical page is at 0x0030_5000.
Step 4: Form final physical addressPhysical Address = Physical_Page_Base + Offset
= 0x0030_5000 + 0x678
= 0x0030_5678
Without a TLB cache hit, this translation requires two additional memory accesses beyond the actual data access. This is why the TLB is so critical for performance!
Now let's quantify the memory savings that two-level page tables provide.
Scenario: Process Using 16MB of MemoryConsider a process that uses 16MB of memory scattered across its address space:
Let's calculate which page directory entries need page tables:
Code segment (0x0000_0000 - 0x003F_FFFF):Page directory: 4KB
Page tables (4): 16KB
Total: 20KB
Single-level: 4,096KB (4MB)
Two-level: 20KB
Savings: 4,076KB (99.5% reduction!)
We've reduced page table memory overhead by 99.5%! Instead of 4MB for every process regardless of memory usage, we now use only 20KB for this 16MB process—and that overhead grows proportionally with actual memory usage rather than address space size.
Worst-Case ScenarioWhat if the process used memory maximally scattered across the entire 4GB address space? For instance, one page in each 4MB region:
In the absolute worst case, two-level page tables use approximately the same memory as single-level tables. But this worst case is extremely rare. Real programs exhibit locality—code, data, heap, and stack cluster together rather than spreading uniformly across the entire address space.
Typical Case: Much BetterMost processes use tens to hundreds of megabytes of memory in contiguous or semi-contiguous regions:
Compare to 4MB overhead for every process with single-level tables, and the savings are dramatic for the common case.
Two-level page tables reduce memory overhead but at a cost: translation now requires two memory accesses instead of one (when the TLB doesn't have the translation cached).
Performance Impact Without TLB:Single-level: 1 memory access (page table lookup)
Two-level: 2 memory accesses (directory lookup + table lookup)
If every memory instruction required these lookups, two-level tables would cut memory bandwidth in half! This is why the TLB is essential:
Performance Impact With TLB:TLB hit rate: 95-99% (typical)
Effective cost: ~1.0-1.1 memory accesses per instruction
With a 95% TLB hit rate:
The TLB amortizes the translation cost so effectively that the performance penalty of two-level tables is typically negligible—a few percent at most—while the memory savings are dramatic.
Two-level page tables solved the memory overhead problem for 32-bit address spaces, but they still have limitations:
32-bit Limitation: Even in the worst case, the overhead is bounded at ~4MB, which is acceptable for systems with gigabytes of RAM. 64-bit Impossibility: For 64-bit address spaces, even two-level tables don't suffice. With 4KB pages and 8-byte entries:Still too large! This is why modern 64-bit systems use three, four, or even five levels of page tables, which we'll explore in the next section.
The two-level page table represents a sweet spot for 32-bit systems: it provides dramatic memory savings over single-level tables while adding minimal complexity and only modest (TLB-mitigated) performance overhead. Its success in the x86 architecture influenced the design of virtually all subsequent page table implementations.
The transition from 32-bit to 64-bit computing created an entirely new scale of challenge for page table design. A 64-bit address space is not merely twice as large as 32-bit—it's 2^32 times larger. Even with two-level page tables, the overhead would be unmanageable. The solution: add more levels to the hierarchy.
Modern architectures typically use three, four, or even five levels of page tables. Each additional level multiplies the sparseness advantage: we only allocate page table structures for the portions of the vast 64-bit address space that are actually in use. In this section, we'll examine the specific implementations used by the three dominant architecture families: x86-64 (Intel and AMD), ARM64, and RISC-V.
When AMD introduced the x86-64 architecture (also called AMD64 or x86_64) in 2003, it faced a critical design decision: how to handle the exponentially larger 64-bit address space? The solution was four-level paging, which Intel adopted when it began producing 64-bit processors with EM64T (later rebranded Intel 64).
The 48-Bit CompromiseAn important detail: despite being called "64-bit," x86-64 processors don't actually use all 64 bits for virtual addresses. Early implementations use only 48 bits, supporting a 256TB virtual address space. This was a pragmatic decision:
This design allows future expansion if 256TB ever becomes limiting (which it has, as we'll see with five-level paging).
Four-Level StructureThe x86-64 four-level page table hierarchy uses these names:
1. PML4 (Page Map Level 4): The top level
2. PDPT (Page Directory Pointer Table): Third level
3. PD (Page Directory): Second level
4. PT (Page Table): Lowest level, contains PTEs that point to actual pages
Each level contains 512 entries (not 1024 like in 32-bit x86), and each entry is 8 bytes (64 bits). This gives each table a size of:
512 entries × 8 bytes = 4,096 bytes = 4KB = one page
Once again, the page table structures themselves fit in single pages!
Virtual Address FormatA canonical 48-bit x86-64 virtual address is divided:
Bits 63-48: Sign extension (16 bits) - must equal bit 47
Bits 47-39: PML4 index (9 bits) = 512 entries, each covers 512GB
Bits 38-30: PDPT index (9 bits) = 512 entries, each covers 1GB
Bits 29-21: PD index (9 bits) = 512 entries, each covers 2MB
Bits 20-12: PT index (9 bits) = 512 entries, each covers 4KB
Bits 11-0: Page offset (12 bits) = 4096 bytes
Let's understand what each level covers:
- Calculation: 512 (PDPT entries) × 1GB = 512GB
- Calculation: 512 (PD entries) × 2MB = 1GB
- Calculation: 512 (PT entries) × 4KB = 2MB
In 64-bit mode, CR3 still points to the top-level page structure, but now that's the PML4:
Bits 63-52: Reserved (0)
Bits 51-12: Physical address of PML4 (40 bits)
Bits 11-5: Reserved (0)
Bits 4-3: PWT, PCD flags (page-level cache control)
Bit 2-0: Reserved (0)
The 40-bit physical address allows addressing up to 1TB (2^40 bytes) of physical RAM, though actual CPU models support varying amounts.
Entry FormatAll four levels (PML4E, PDPTE, PDE, PTE) share a similar 64-bit format, though some bits have level-specific meanings:
Bit 63: XD (Execute Disable) / NX (No Execute)
Bits 62-52: Available for OS use (11 bits)
Bits 51-12: Physical address of next level / page frame (40 bits)
Bits 11-9: Available for OS use (3 bits)
Bit 8: Global (G) - only in lowest-level PTE
Bit 7: PS (Page Size) - in PDPTE/PDE, indicates huge page
Bit 6: Dirty (D) - only in lowest-level PTE
Bit 5: Accessed (A)
Bit 4: PCD (Page-level Cache Disable)
Bit 3: PWT (Page-level Write-Through)
Bit 2: U/S (User/Supervisor)
Bit 1: R/W (Read/Write)
Bit 0: P (Present)
Key points:
The hardware page walker performs these steps on a TLB miss:
1. Read CR3 → get PML4 physical address
2. Extract PML4 index from VA[47:39]
3. Read PML4[index] (PML4E)
4. Check PML4E.P bit; if 0, page fault
5. Extract PDPT physical address from PML4E[51:12]
6. Extract PDPT index from VA[38:30]
7. Read PDPT[index] (PDPTE)
8. Check PDPTE.P bit; if 0, page fault
9. If PDPTE.PS = 1, this is a 1GB huge page → skip to step 15
10. Extract PD physical address from PDPTE[51:12]
11. Extract PD index from VA[29:21]
12. Read PD[index] (PDE)
13. Check PDE.P bit; if 0, page fault
14. If PDE.PS = 1, this is a 2MB huge page → skip to step 18
15. Extract PT physical address from PDE[51:12]
16. Extract PT index from VA[20:12]
17. Read PT[index] (PTE)
18. Check PTE.P bit; if 0, page fault
19. Check permissions (XD, U/S, R/W); if violation, page fault
20. Extract physical page frame from PTE[51:12]
21. Set PTE.A bit (hardware); if write, set PTE.D bit
22. Combine page frame with offset: PA = (PTE[51:12] << 12) | VA[11:0]
23. Insert translation into TLB
24. Complete memory access
Without TLB caching, this requires four memory reads before the actual data access—five memory accesses total. This is why page walk caches and large TLBs are critical for modern processor performance.
Example: Translating a Concrete AddressLet's translate virtual address 0x0000_7F8A_4050_1678:
First, verify it's canonical:
Binary: 0000 0000 0000 0000 0111 1111 1000 1010 0100 0000 0101 0000 0001 0110 0111 1000
Bit 47: 0
Bits 63-48: all 0 (match bit 47) ✓ Canonical
Extract indices:
Bits 47-39 (PML4): 000 0000 00 = 0x000 = 0
Bits 38-30 (PDPT): 00 1111 111 = 0x07F = 127
Bits 29-21 (PD): 000 1010 01 = 0x029 = 41
Bits 20-12 (PT): 00 0000 010 = 0x002 = 2
Bits 11-0 (Offset): 1000 0001 0110 0111 1000 = 0x678 = 1656
Assume CR3 = 0x0010_0000. The walk proceeds:
Step 1: PML4 lookupPML4E address = 0x0010_0000 + (0 × 8) = 0x0010_0000
Read PML4E = assume 0x0020_0000_0000_0003
Present: 1 ✓
R/W: 1 ✓
PDPT at 0x0020_0000_0000
PDPTE address = 0x0020_0000_0000 + (127 × 8) = 0x0020_0000_03F8
Read PDPTE = assume 0x0030_0000_0000_0003
Present: 1 ✓
PS: 0 (not 1GB page)
PD at 0x0030_0000_0000
PDE address = 0x0030_0000_0000 + (41 × 8) = 0x0030_0000_0148
Read PDE = assume 0x0040_0000_0000_0003
Present: 1 ✓
PS: 0 (not 2MB page)
PT at 0x0040_0000_0000
PTE address = 0x0040_0000_0000 + (2 × 8) = 0x0040_0000_0010
Read PTE = assume 0x0050_5000_0000_0027
Present: 1 ✓
R/W: 1 ✓
Accessed: 1 (previously accessed)
Page at 0x0050_5000_0000
PA = 0x0050_5000_0000 + 0x678 = 0x0050_5000_0678
This is why workload performance can degrade significantly if TLB misses are frequent!
In 2019, Intel introduced five-level paging with the Ice Lake processor architecture. This wasn't merely an incremental improvement—it represented a massive expansion of the virtual address space from 256TB (48-bit) to 128PB (57-bit), a 512× increase.
Why Five Levels?The motivation for five-level paging comes from large-scale systems:
While 256TB seemed enormous in 2003, modern high-end servers can have multiple terabytes of RAM, and applications using memory-mapped files can easily exceed 256TB of virtual address space usage when working with very large datasets.
The LA57 ExtensionFive-level paging is enabled by setting the LA57 bit (bit 12) in the CR4 control register. When CR4.LA57 = 1, the processor uses 57-bit virtual addresses; when CR4.LA57 = 0, it uses traditional 48-bit addresses with four-level paging.
This opt-in design allows:
The hierarchy adds one more level at the top:
1. PML5 (Page Map Level 5): New top level
2. PML4 (Page Map Level 4): Second level (same as before)
3. PDPT (Page Directory Pointer Table): Third level
4. PD (Page Directory): Fourth level
5. PT (Page Table): Lowest level
Virtual Address Format (57-bit)A canonical 57-bit virtual address:
Bits 63-57: Sign extension (7 bits) - must equal bit 56
Bits 56-48: PML5 index (9 bits) = 512 entries, each covers 256TB
Bits 47-39: PML4 index (9 bits) = 512 entries, each covers 512GB
Bits 38-30: PDPT index (9 bits) = 512 entries, each covers 1GB
Bits 29-21: PD index (9 bits) = 512 entries, each covers 2MB
Bits 20-12: PT index (9 bits) = 512 entries, each covers 4KB
Bits 11-0: Page offset (12 bits) = 4096 bytes
Coverage per level:
Total address space: 512 (PML5) × 256TB = 128PB
CR3 in Five-Level ModeWhen CR4.LA57 = 1, CR3 points to PML5 instead of PML4:
Bits 63-52: Reserved (0)
Bits 51-12: Physical address of PML5 (40 bits)
Bits 11-0: Flags and reserved
The page walk now requires up to five memory accesses:
1. Read CR3 → get PML5 physical address
2. Extract PML5 index from VA[56:48]
3. Read PML5[index] (PML5E)
4. Check PML5E.P bit; if 0, page fault
5. Extract PML4 physical address from PML5E[51:12]
[Then continue with four-level process from PML4 onward...]
6-9: PML4 lookup
10-14: PDPT lookup (check for 1GB huge page)
15-19: PD lookup (check for 2MB huge page)
20-24: PT lookup
25: Combine to form physical address
Without TLB or page walk cache hits, this means five memory reads plus the actual data access—six memory accesses total for a single load or store instruction!
Performance ConsiderationsThe additional level has performance implications:
However, Intel's implementation mitigates these issues:
Linux kernel added five-level paging support in version 4.14 (November 2017), predating the actual hardware release:
// Linux kernel configuration
CONFIG_X86_5LEVEL=y // Enable five-level paging support
The kernel detects CPU support and can boot in either four-level or five-level mode. Most distributions enable the config option but don't activate five-level paging unless needed (very large memory configurations or specific workload requirements).
When to Use Five-Level PagingFive-level paging makes sense for:
Most desktop, laptop, and even server workloads don't benefit from five-level paging and incur a small performance penalty from the extra translation level. As of 2024, five-level paging remains a specialized feature for high-end systems.
The ARM architecture takes a different approach to page table design than x86. Rather than specifying a single fixed structure, ARMv8-A provides multiple configuration options, allowing system designers to choose the best trade-off for their specific use case. This flexibility is one reason ARM dominates embedded systems and mobile devices—the same architecture scales from tiny microcontrollers to high-performance servers.
Three Configuration DimensionsARM64 systems can independently choose:
1. Page size: 4KB, 16KB, or 64KB
2. Virtual address size: 39-bit (512GB), 48-bit (256TB), or 52-bit (4PB)
3. Physical address size: Up to 52 bits (4PB of physical RAM)
This creates dozens of possible combinations. We'll focus on the most common configuration: 4KB pages with 48-bit virtual addresses, which uses a four-level page table similar to x86-64.
Translation Table Base RegistersARM uses separate registers for user and kernel address spaces:
The address space is split:
This allows kernel mappings to remain unchanged when switching between processes—only TTBR0_EL1 needs updating.
Four-Level Structure with 4KB Pages (Most Common)The ARM names for the four levels:
1. Level 0: Top level (512GB per entry)
2. Level 1: Third level (1GB per entry)
3. Level 2: Second level (2MB per entry)
4. Level 3: Lowest level (4KB per entry)
Note that ARM counts from 0, and in the opposite direction from x86!
Virtual Address Format (48-bit, 4KB pages)Bits 63: Region select (0 = TTBR0, 1 = TTBR1)
Bits 62-48: Must match bit 63 (canonical form)
Bits 47-39: Level 0 index (9 bits) = 512 entries, each covers 512GB
Bits 38-30: Level 1 index (9 bits) = 512 entries, each covers 1GB
Bits 29-21: Level 2 index (9 bits) = 512 entries, each covers 2MB
Bits 20-12: Level 3 index (9 bits) = 512 entries, each covers 4KB
Bits 11-0: Page offset (12 bits) = 4096 bytes
This is very similar to x86-64's four-level structure, reflecting convergent evolution toward the same solution.
Descriptor Format (64-bit)ARM uses the term "descriptor" instead of "page table entry." Each 64-bit descriptor contains:
For Upper-Level Descriptors (Levels 0-2):Bits 63: Ignored
Bits 62-52: Ignored (software can use)
Bits 51-12: Next-level table address (40 bits, 4KB aligned)
Bits 11-2: Ignored
Bit 1: Descriptor type (0 = block, 1 = table)
Bit 0: Valid bit
Bits 63-59: Ignored
Bit 58-55: Reserved
Bit 54: XN (Execute Never) for privileged
Bit 53: PXN (Privileged Execute Never)
Bit 52: Contiguous hint (for TLB caching)
Bits 51-12: Output address (physical address, 40 bits)
Bit 11: nG (not Global) - if 1, tagged with ASID
Bit 10: AF (Access Flag) - must be 1 for valid translation
Bits 9-8: SH (Shareability) - for multi-core coherency
Bits 7-6: AP (Access Permissions) - read/write control
Bit 5: NS (Non-Secure bit) - for TrustZone
Bits 4-2: AttrIndx - index into MAIR_ELx (memory attributes)
Bit 1: Descriptor type (always 1 for pages)
Bit 0: Valid bit
Key ARM-specific features:
Access Flag (AF): Unlike x86 where the Accessed bit starts at 0 and hardware sets it to 1, ARM requires software to set AF=1 before the page is accessible. If AF=0, the MMU raises an Access Flag fault, allowing the OS to track page accesses. Shareability (SH): Controls cache coherency in multi-core systems:Used in some server deployments for better TLB coverage:
Bits 47-42: Level 1 index (6 bits) = 64 entries, each covers 512GB
Bits 41-29: Level 2 index (13 bits) = 8192 entries, each covers 64MB
Bits 28-16: Level 3 index (13 bits) = 8192 entries, each covers 64KB
Bits 15-0: Page offset (16 bits) = 65536 bytes
Fewer levels mean faster translation, and larger pages mean better TLB hit rates. The trade-off: internal fragmentation (wasted space within pages).
16KB Pages (Four Levels):A middle ground, less commonly used:
Bits 47-42: Level 0 index (6 bits)
Bits 41-36: Level 1 index (6 bits)
Bits 35-25: Level 2 index (11 bits)
Bits 24-14: Level 3 index (11 bits)
Bits 13-0: Page offset (14 bits) = 16384 bytes
RISC-V, being a newer architecture (first specification released 2011, ratified 2019), learned from decades of x86 and ARM experience. The designers made deliberate choices to simplify page table handling while maintaining flexibility and performance.
The Satp RegisterRISC-V uses a single register for all address translation configuration: satp (Supervisor Address Translation and Protection). For RV64 (64-bit RISC-V), satp is divided:
Bits 63-60: MODE (4 bits) - selects paging scheme
0000 = Bare (no translation)
1000 = Sv39 (39-bit virtual address, 3 levels)
1001 = Sv48 (48-bit virtual address, 4 levels)
1010 = Sv57 (57-bit virtual address, 5 levels)
Others = Reserved
Bits 59-44: ASID (16 bits) - Address Space Identifier for TLB tagging
Bits 43-0: PPN (44 bits) - Physical Page Number of root page table
This single register encodes:
Sv39 uses a 39-bit virtual address space (512GB) with three levels of page tables. This is the most common scheme in current RISC-V implementations because it balances capability with simplicity.
Virtual Address Format (Sv39):Bits 63-39: Must be copies of bit 38 (canonical form, 25 bits)
Bits 38-30: VPN[2] (Virtual Page Number level 2, 9 bits) = 512 entries
Bits 29-21: VPN[1] (Virtual Page Number level 1, 9 bits) = 512 entries
Bits 20-12: VPN[0] (Virtual Page Number level 0, 9 bits) = 512 entries
Bits 11-0: Page offset (12 bits) = 4096 bytes
Note: RISC-V numbers levels from 0 (lowest) to 2 (highest), opposite of ARM's convention!
Page Table Entry Format (64-bit):Bits 63-54: Reserved (must be zero)
Bits 53-28: PPN[2] (Physical Page Number bits 26:0)
Bits 27-19: PPN[1] (Physical Page Number bits 18:9)
Bits 18-10: PPN[0] (Physical Page Number bits 8:0)
Bits 9-8: RSW (Reserved for Supervisor Software)
Bit 7: D (Dirty)
Bit 6: A (Accessed)
Bit 5: G (Global)
Bit 4: U (User accessible)
Bit 3: X (Execute permission)
Bit 2: W (Write permission)
Bit 1: R (Read permission)
Bit 0: V (Valid)
This allows flexible huge page support without special flags.
Translation Process (Sv39):1. Read satp.PPN → get root page table (Level 2)
2. Extract VPN[2] from VA[38:30]
3. Read PTE = root_table[VPN[2]]
4. Check PTE.V; if 0, page fault
5. If PTE.R|W|X != 0, this is a gigapage → skip to step 12
6. Extract next table address from PTE.PPN
7. Extract VPN[1] from VA[29:21]
8. Read PTE = level1_table[VPN[1]]
9. Check PTE.V; if 0, page fault
10. If PTE.R|W|X != 0, this is a megapage → skip to step 12
11. Extract next table address from PTE.PPN
12. Extract VPN[0] from VA[20:12]
13. Read PTE = level0_table[VPN[0]]
14. Check PTE.V; if 0, page fault
15. Check permissions (R/W/X/U); if violation, page fault
16. Set PTE.A; if write, set PTE.D
17. Form PA = (PTE.PPN << 12) | VA[11:0]
18. Complete memory access
RISC-V also defines four-level (Sv48, 48-bit addresses) and five-level (Sv57, 57-bit addresses) schemes for larger address spaces, similar to x86-64:
Sv48 (256TB virtual space):Bits 47-39: VPN[3] (9 bits) - Fourth level
Bits 38-30: VPN[2] (9 bits) - Third level
Bits 29-21: VPN[1] (9 bits) - Second level
Bits 20-12: VPN[0] (9 bits) - First level
Bits 11-0: Offset (12 bits)
Bits 56-48: VPN[4] (9 bits) - Fifth level
Bits 47-39: VPN[3] (9 bits) - Fourth level
[... continues similarly ...]
As of 2024, most RISC-V implementations use Sv39 because:
Sv48 and Sv57 provide headroom for future growth as RISC-V moves into high-end server markets.
Comparison Across ArchitecturesNow that we've examined x86-64, ARM64, and RISC-V in detail, we can compare their approaches:
| Feature | x86-64 | ARM64 | RISC-V |
|---------|--------|-------|--------|
| Common config | 4-level, 48-bit | 4-level, 48-bit | 3-level, 39-bit |
| Max levels | 5 | 4 | 5 |
| Max VA size | 57-bit (128PB) | 52-bit (4PB) | 57-bit (128PB) |
| Page sizes | 4KB, 2MB, 1GB | 4KB, 16KB, 64KB + huge | 4KB, 2MB, 1GB |
| Entry size | 8 bytes | 8 bytes | 8 bytes |
| Entries/table | 512 | 512 or varies | 512 |
| Permission model | R/W, XD separate | AP bits + XN/PXN | Explicit R/W/X |
| ASID support | PCID (optional) | Built-in (16-bit) | Built-in (16-bit) |
| Huge page method | PS bit in PDE/PDPTE | Block descriptors | R/W/X in upper level |
All three architectures converged on similar solutions (4-5 levels, 512 entries per table, 8-byte entries) despite different design philosophies, suggesting these are close to optimal trade-offs for modern systems.
Before moving on to virtualization, let's summarize the page table walk process across architectures:
Memory Accesses Required (without caching):Each access requires reading from DRAM (or cache), which takes 50-200 CPU cycles depending on cache hits. This is why TLB hit rates of 95-99% are crucial—they eliminate most of these accesses.
Address Space Coverage:The progression shows a clear pattern: each additional level multiplies the address space by 512 (the number of entries per table).
Modern cloud computing depends critically on virtualization: running multiple guest operating systems simultaneously on shared physical hardware. Each guest OS believes it has exclusive access to physical memory, when in reality it's sharing the machine with other guests. Making this illusion work efficiently requires extending our page table mechanisms to support two-stage address translation.
###
3.6.1 The Virtualization Problem
In traditional (non-virtualized) systems, we have one translation:
Virtual Address (VA) → Physical Address (PA)
The operating system manages page tables, and the MMU translates addresses using these tables.
With virtualization, we have multiple operating systems running as "guests," each managing its own page tables. But these page tables translate to what the guest thinks are physical addresses—which the hypervisor must then translate to actual physical addresses. This creates a two-level translation problem:
Guest Virtual Address (GVA) → Guest Physical Address (GPA) → Host Physical Address (HPA)
Or using more standard terminology:
VA → IPA → PA
(Virtual Address → Intermediate Physical Address → Physical Address)
Before hardware virtualization support, VMMs (Virtual Machine Monitors) like VMware used shadow page tables: the hypervisor intercepted every guest OS page table modification and maintained duplicate "shadow" page tables that directly mapped GVA → HPA. This approach:
Hardware-assisted two-stage translation solves this by letting the guest manage its own page tables normally while the hypervisor provides a second translation layer.
Intel introduced EPT with the Nehalem microarchitecture in 2008 as part of Intel VT-x (Virtualization Technology). EPT provides hardware support for the second stage of translation.
VMCS and ControlThe VMCS (Virtual Machine Control Structure) is a data structure in memory that stores the complete state of a virtual machine. Among many other things, it contains:
When a guest is running, the CPU uses both:
1. Guest CR3 for the first translation (VA → IPA)
2. EPT for the second translation (IPA → PA)
Stage 1: Guest Page Tables (VA → IPA)The guest OS manages these tables exactly as if it were running on bare metal. If the guest is running 64-bit Linux:
From the guest's perspective, nothing is different—it's completely unaware of the second translation stage.
Stage 2: EPT (IPA → PA)The hypervisor manages EPT structures. EPT uses its own four-level (or five-level with LA57) page table hierarchy, structurally similar to regular x86-64 page tables but with different entry formats.
EPT Page Table Hierarchy:1. EPT PML4 (or EPT PML5): Top level
2. EPT PDPT: Third level
3. EPT PD: Second level
4. EPT PT: Lowest level
EPT Page Table Entry Format (64-bit):Bits 63-52: Ignored
Bits 51-12: Physical address of next level or page frame (40 bits)
Bits 11-8: Ignored
Bit 7: Ignored (in EPT PML4E, PDPTE, PDE)
OR Page size for 1GB/2MB pages
Bits 6-3: Ignored / Memory type bits
Bit 2: Execute permission for user-mode linear addresses
Bit 1: Write permission
Bit 0: Read permission
Unlike normal page tables which use Present/Accessed/Dirty bits, EPT entries have explicit Read/Write/Execute permissions. This gives the hypervisor fine-grained control over guest memory access.
Memory Type Bits (bits 6-3 in leaf entries):Bits 6-3: IPAT, Type encoding
000: Uncacheable (UC)
001: Write Combining (WC)
100: Write-Through (WT)
101: Write-Protected (WP)
110: Write-Back (WB)
The hypervisor can control caching behavior for guest physical memory, important for device memory regions.
Combined Translation ProcessWhen a guest accesses a virtual address, the hardware performs a complex combined walk:
1. Start with guest VA
2. Walk guest page tables to get IPA:
- Read guest CR3 (this is an IPA, so we need to translate it via EPT!)
- For each level of guest page table:
- The page table entry address is an IPA
- Translate IPA to PA using EPT walk
- Read the page table entry from PA
3. Result: IPA from guest page tables
4. Walk EPT to translate IPA → PA
5. Access final PA
This is where it gets complex: each access to a guest page table itself requires an EPT walk! Let's count the memory accesses:
Worst case (no caching, 4-level guest tables, 4-level EPT):Guest CR3 lookup: 4 EPT accesses to get CR3 PA
Guest PML4 lookup: 4 EPT accesses to get PML4E
Guest PDPT lookup: 4 EPT accesses to get PDPTE
Guest PD lookup: 4 EPT accesses to get PDE
Guest PT lookup: 4 EPT accesses to get PTE
Final IPA→PA: 4 EPT accesses to get data PA
Data access: 1 access
Total: 25 memory accesses!
This is why EPT seemed impossible when first proposed—25 memory accesses per instruction would be catastrophic! But in practice, caching makes it work.
EPT ViolationsWhen EPT translation fails (missing page, permission violation), the CPU generates an EPT violation VM exit. The hypervisor handles this similarly to how an OS handles page faults:
The guest is completely unaware this happened—from its perspective, memory just worked.
AMD introduced NPT (also called AMD-Vi for I/O virtualization) with the Barcelona microarchitecture in 2007, actually predating Intel's EPT by about a year. The concepts are very similar to EPT, with some differences in terminology and minor implementation details.
Nested vs ExtendedAMD uses the term "nested" to emphasize that guest page tables are nested inside hypervisor page tables. The principle is the same as Intel's "extended" page tables.
nCR3 RegisterAMD uses nCR3 (nested CR3) register, stored in the VMCB (Virtual Machine Control Block, AMD's equivalent of Intel's VMCS). This points to the base of the nested page tables.
NPT StructureNPT uses the same four-level (or five-level) hierarchy as AMD64 normal paging:
Bits 63-52: Available
Bits 51-12: Physical address (40 bits)
Bits 11-9: Available
Bits 8-7: Reserved (0)
Bits 6-5: Ignored
Bit 4-3: Ignored / Memory type
Bit 2: User/Supervisor (U/S)
Bit 1: Read/Write (R/W)
Bit 0: Present (P)
AMD NPT works essentially identically to Intel EPT:
1. Guest manages normal page tables (gCR3 → GVA → GPA)
2. Hypervisor manages NPT (nCR3 → GPA → HPA)
3. Combined walk performs both translations
The same explosion of memory accesses occurs without caching, and the same caching mechanisms mitigate it.
Historical NoteAMD shipping NPT first gave AMD processors a virtualization performance advantage in 2007-2008. When Intel shipped EPT in 2008, virtualization performance on x86 became comparable between the two vendors.
ARM's approach to virtualization emerged with the ARMv7 Virtualization Extensions and matured in ARMv8-A (64-bit ARM). ARM explicitly models two translation stages, with clean separation between guest-managed and hypervisor-managed translations.
Translation StagesARM terminology:
Stage 2 uses separate page table structures from Stage 1, with different descriptor formats. For 4KB pages with IPA size up to 48 bits:
Stage 2 Table Structure:Bits 63-59: Ignored
Bits 58-55: Reserved
Bit 54: XN (Execute Never) for Stage 1 EL0 translations
Bit 53: PXN (Privileged Execute Never)
Bit 52: Contiguous hint
Bits 51-12: Output address (PA) or next level table address
Bits 11-10: Reserved
Bits 9-8: SH (Shareability)
Bits 7-6: S2AP (Stage 2 Access Permissions)
00: No access
01: Read-only
10: Write-only
11: Read/Write
Bits 5-4: MemAttr (Memory attributes)
Bit 3-2: Ignored
Bit 1: Descriptor type (0 = block, 1 = table/page)
Bit 0: Valid
Key differences from Stage 1:
When a guest running at EL1 accesses memory:
1. Stage 1 translation: VA → IPA using TTBR0_EL1/TTBR1_EL1
- Guest OS manages these tables
- Four-level walk (typically)
2. Stage 2 translation: IPA → PA using VTTBR_EL2
- Hypervisor manages these tables
- Up to four-level walk
Similar to x86, each access to Stage 1 page tables requires a Stage 2 translation of the table address. Maximum memory accesses:
4 (Stage 1 levels) × 4 (Stage 2 per access) + 4 (final IPA→PA) = 20 memory accesses
(Slightly better than x86's 25 because ARM's Stage 1 and Stage 2 are more independent in implementation.)
Real-World ARM Hypervisors KVM/ARM: Linux's built-in hypervisorRISC-V added hypervisor support relatively recently—the Hypervisor Extension was frozen in 2019 and ratified as v1.0 in 2021. Being newer, it learned from x86 and ARM's experiences and features a clean, orthogonal design.
Two-Stage Translation in RISC-VRISC-V terminology:
This is conceptually identical to x86 EPT and ARM Stage 2, just with different names.
Control Registers VS-stage (Guest):- Same format as regular satp
- Guest manages this
G-stage (Hypervisor):- Format similar to satp
- Hypervisor manages this
hgatp Register Format (RV64):Bits 63-60: MODE
0000 = Bare (no translation)
1000 = Sv39x4 (39-bit GPA, 3 levels)
1001 = Sv48x4 (48-bit GPA, 4 levels)
1010 = Sv57x4 (57-bit GPA, 5 levels)
Bits 59-44: VMID (Virtual Machine ID, 14 bits)
Bits 43-0: PPN of root G-stage page table
G-stage page tables use the same structure as regular RISC-V page tables, with 64-bit PTEs:
Bits 63-54: Reserved (0)
Bits 53-10: PPN (Physical Page Number)
Bits 9-8: RSW (Reserved for software)
Bit 7: D (Dirty)
Bit 6: A (Accessed)
Bit 5: G (Global)
Bit 4: U (User)
Bit 3: X (Execute)
Bit 2: W (Write)
Bit 1: R (Read)
Bit 0: V (Valid)
The beauty of RISC-V: G-stage tables are structurally identical to VS-stage tables, just managed by different software layers!
Combined TranslationFor a guest with Sv39 VS-stage and Sv39x4 G-stage:
1. VS-stage walk (3 levels):
- Each level lookup requires G-stage translation
- 3 levels × 3 G-stage walks = 9 memory accesses
2. Final GPA → SPA:
- G-stage walk: 3 memory accesses
3. Data access: 1 access
Total: 13 memory accesses (better than x86's 25 or ARM's 20!)This is because RISC-V defaulted to 3-level VS-stage tables instead of 4, reducing the worst case.
The explosion of memory accesses for two-stage translation would be disastrous without hardware mitigations. Multiple techniques work together to make virtualization practical:
1. Combined TLB (VA → PA Direct)Modern CPUs cache the final translation (VA → PA) directly in the TLB, skipping both guest and hypervisor page table walks on a hit. The TLB entry is tagged with:
This allows TLB entries from multiple VMs and multiple processes to coexist.
Intel VPID (Virtual Processor ID):With a combined TLB hit (95-99% for typical workloads), two-stage translation has zero overhead for that access!
2. Page Walk Caches for Both StagesModern CPUs cache intermediate page table entries for both stages:
If the PWC contains cached entries, many of the potential 20-25 memory accesses are eliminated.
3. Huge PagesUsing 2MB or 1GB pages in either stage dramatically reduces translation overhead:
Many hypervisors aggressively use huge pages when possible.
Real-World Performance MeasurementsMultiple academic studies have measured two-stage translation overhead:
Adams & Agesen (2006), VMware:Typical virtualization overhead with modern hardware support:
Compare to 50-80% overhead with pure software virtualization, and hardware support clearly justifies its complexity!
Two-stage translation enables modern cloud computing. Without it:
The complexity of 20+ potential memory accesses is completely hidden from both guest operating systems and applications by sophisticated caching. This is a triumph of hardware/software co-design: software (hypervisors) provides the management interfaces, hardware provides transparent acceleration.
We've seen that page table walks can require anywhere from 1 to 25 memory accesses depending on the number of levels and whether virtualization is active. These accesses would devastate performance if they occurred on every memory instruction. Modern processors employ multiple levels of caching to mitigate this overhead. Understanding what gets cached—and critically, what doesn't—is essential for performance tuning and system design.
The TLB is the primary cache for address translation, and most developers are familiar with its basic purpose. However, understanding precisely what it caches and its limitations helps explain many performance behaviors.
What the TLB CachesThe TLB caches complete translations:
VA → PA (complete final mapping)
Each TLB entry stores:
Critically, the TLB does not cache:
When we have a TLB miss, the hardware page walker must still perform the complete page table walk, reading every level from memory (or cache).
TLB OrganizationModern processors have multiple TLB levels with different characteristics:
L1 TLB: Split into instruction and dataLet's examine actual TLB configurations from modern processors:
Intel Skylake (2015-2019):L1 DTLB:
- 64 entries for 4KB pages
- 32 entries for 2MB/4MB pages
- 4 entries for 1GB pages
L1 ITLB:
- 128 entries for 4KB pages
- 8 entries for 2MB/4MB pages
L2 STLB (shared):
- 1536 entries for 4KB or 2MB pages (shared pool)
- 16 entries for 1GB pages
L1 DTLB:
- 72 entries for 4KB pages
- 72 entries for 2MB/1GB pages (separate pool)
L1 ITLB:
- 64 entries for 4KB pages
- 64 entries for 2MB pages
L2 TLB (unified):
- 2048 entries for 4KB/2MB pages
- 1024 entries for 1GB pages
AMD Zen 3's L2 TLB is notably larger than Intel Skylake's, reflecting AMD's emphasis on reducing TLB misses.
ARM Cortex-A77 (2019):L1 DTLB:
- 48 fully-associative entries for 4KB pages
- 32 fully-associative entries for larger pages
L1 ITLB:
- 48 fully-associative entries for 4KB pages
- 32 fully-associative entries for larger pages
L2 TLB (unified):
- 1280 entries, 4-way set-associative
- Supports 4KB, 16KB, 64KB, 2MB, 32MB, 512MB, 1GB pages
Let's calculate how much memory a TLB can cover:
Intel Skylake L2 STLB (worst case - all 4KB pages):1536 entries × 4KB = 6,144 KB = 6 MB
16 entries × 1GB = 16 GB
2048 entries × 4KB = 8,192 KB = 8 MB
2048 entries × 2MB = 4,096 MB = 4 GB
1280 entries × 4KB = 5,120 KB = 5 MB
These numbers reveal a critical insight: if your working set exceeds 5-8 MB with 4KB pages, you'll experience TLB thrashing. Using huge pages (2MB) extends coverage to multiple gigabytes.
While TLBs are well-known, page walk caches (PWCs) are less documented but equally important for modern processor performance. PWCs cache intermediate page table entries, reducing the number of memory accesses needed on a TLB miss.
What Page Walk Caches DoPWCs cache upper-level page table entries:
These caches sit between the TLB and main memory. On a TLB miss, the hardware page walker checks the PWC before accessing memory.
Why PWCs MatterWithout PWC, a TLB miss on x86-64 requires 4 memory accesses:
CR3 → PML4E → PDPTE → PDE → PTE → PA
read read read read
With PWC hits on upper levels:
CR3 → PML4E → PDPTE → PDE → PTE → PA
(PWC) (PWC) (PWC) read
Only 1 memory access needed instead of 4!
Intel's Page Walk Cache (Undocumented)Intel doesn't officially document PWC details in their optimization manuals, but researchers have measured its existence and characteristics:
Barr et al. (2011) measured Intel processors and found:AMD is slightly more transparent about caching intermediate entries:
Two-stage translation introduces new caching challenges. We could cache:
1. VA → IPA (guest translation only)
2. IPA → PA (hypervisor translation only)
3. VA → PA (complete combined translation)
Modern processors use option 3: cache the final VA → PA mapping directly, skipping both intermediate stages!
VPID/ASID TaggingTo allow multiple VMs to share the TLB without conflicts, entries are tagged with identifiers:
Intel VPID (Virtual Processor ID):When the hypervisor switches VMs:
Old behavior (no VPID): Flush entire TLB
New behavior (with VPID): Keep all entries, use VPID to distinguish
This allows:
Understanding what caching structures don't do is as important as understanding what they do:
TLB Does NOT Cache:Here's a comprehensive view of what gets cached where:
| Structure | What It Caches | What It Doesn't Cache | Typical Size | Access Time |
|-----------|----------------|----------------------|--------------|-------------|
| L1 TLB | VA→PA (final) | Intermediate PTEs | 64-128 entries | 0-1 cycles |
| L2 TLB | VA→PA (final) | Intermediate PTEs | 1280-2048 entries | 5-10 cycles |
| PWC | PML4E, PDPTE, PDE | Final PTEs, VA→PA | ~32-64 entries/level | 10-20 cycles |
| L1 Cache | Page table data | Nothing special | 32-64 KB | 3-5 cycles |
| L2 Cache | Page table data | Nothing special | 256-512 KB | 10-15 cycles |
| L3 Cache | Page table data | Nothing special | 8-32 MB | 40-60 cycles |
| RAM | Everything | - | GBs | 100-200 cycles |
This hierarchy explains why TLB hit rate is so critical:
The difference between a TLB hit and a complete miss can be 200× in latency!
Programs with good spatial locality (accessing nearby addresses) benefit from:
Programs with poor spatial locality (random access patterns) suffer:
Using 2MB pages instead of 4KB pages:
Example: 1GB working set
This is why databases, HPC applications, and VMs aggressively use huge pages.
Understanding caching behavior is crucial for:
We've discussed page table walks throughout this chapter, but let's consolidate our understanding of how the hardware actually implements the translation process.
The hardware page walker is a dedicated state machine in the MMU that activates on a TLB miss. It operates concurrently with other CPU execution units, allowing out-of-order processors to continue executing independent instructions during the walk.
Walk Initiation:1. Instruction tries to access memory at virtual address VA
2. TLB lookup: miss
3. PWC lookup for upper levels: partial hits possible
4. Hardware walker activates for remaining levels
5. Other instructions continue (if independent of this load)
x86-64 Four-Level Walk Algorithm:Input: VA (virtual address)
Output: PA (physical address) or Page Fault
1. base = CR3[51:12] << 12 // PML4 base address
2. index = VA[47:39] // PML4 index
3. pml4e = read(base + index × 8)
4. if pml4e.P == 0: raise PAGE_FAULT
5. base = pml4e[51:12] << 12 // PDPT base
6. index = VA[38:30] // PDPT index
7. pdpte = read(base + index × 8)
8. if pdpte.P == 0: raise PAGE_FAULT
9. if pdpte.PS == 1: goto HUGE_1GB
10. base = pdpte[51:12] << 12
11. index = VA[29:21]
12. pde = read(base + index × 8)
13. if pde.P == 0: raise PAGE_FAULT
14. if pde.PS == 1: goto HUGE_2MB
15. base = pde[51:12] << 12
16. index = VA[20:12]
17. pte = read(base + index × 8)
18. if pte.P == 0: raise PAGE_FAULT
19. check permissions (pte.XD, pte.U/S, pte.R/W)
20. if violation: raise PAGE_FAULT
21. pte.A = 1 (mark accessed)
22. if write: pte.D = 1 (mark dirty)
23. PA = (pte[51:12] << 12) | VA[11:0]
24. install in TLB
25. return PA
HUGE_2MB:
PA = (pde[51:21] << 21) | VA[20:0]
goto step 24
HUGE_1GB:
PA = (pdpte[51:30] << 30) | VA[29:0]
goto step 24
Different architectures divide responsibility between hardware and software for page table management:
x86: Mostly Hardware-Managed:The hardware-managed approach simplifies OS implementation but reduces flexibility. Some OS algorithms prefer software-managed tracking for more precise control.
Page walk latency depends on cache hits:
Best case (all levels in L1 cache):4 cache accesses × 4 cycles = 16 cycles
+ TLB install: ~5 cycles
Total: ~20 cycles
~50-100 cycles
4 DRAM accesses × 100-200 cycles = 400-800 cycles
This is why TLB hit rate matters so much: a 95% hit rate means only 5% of memory accesses pay this penalty, but with random access patterns causing frequent TLB misses, performance can degrade dramatically.
Each processor core maintains its own private TLB. When an operating system modifies a page-table mapping that is potentially cached in another core's TLB — for example when unmapping a page, changing permissions, or migrating a process — it must ensure that every core that might hold a stale translation is notified to invalidate it. This process is called a TLB shootdown.
The x2APIC IPI mechanism. On x86-64 systems the shootdown protocol works through the Advanced Programmable Interrupt Controller (APIC). The initiating core sends an Inter-Processor Interrupt (IPI) to every target core using the x2APIC MSR interface (ICR, offset 0x830). The target cores receive the IPI, pause execution, execute an INVLPG instruction (or INVPCID for PCID-aware flushes) to invalidate the relevant TLB entry, then send an acknowledgement back to the initiating core via a shared memory flag. Only after all acknowledgements are received is the page-table update considered safe. ARM uses the TLBI instruction family (e.g. TLBI VAAE1IS for inner-shareable domain invalidation) with a DSB ISH barrier, and RISC-V uses the SFENCE.VMA instruction with optional VA/ASID operands.
Overhead at scale. The shootdown cost is dominated by IPI round-trip latency. On a 16-core system the overhead is modest (typically <1% of cycles). Research by Villavieja et al. (measured in Mittal, 2016) shows this changes dramatically as core count grows: up to 4% of cycles at 16 cores, rising to 25% at 128 cores, because each IPI stalls the initiating core while acknowledgements arrive serially. AI training workloads with rapid gradient-buffer allocation and deallocation can reach 11–74 shootdowns per second, driving coherence overhead high enough to become the dominant bottleneck (GRIT, 2022). Software mitigations include batching shootdowns, lazy TLB invalidation (deferring flushes until a process is rescheduled), and PCID/ASID tagging to scope invalidations. Chapter 12 examines directory-based hardware solutions (IDYLL) that reduce multi-GPU shootdown overhead from O(N) to O(log N).
While hardware handles the mechanics of translation, the operating system is responsible for allocating, initializing, and managing page tables. Let's examine key OS-level page table operations.
When creating a new process (e.g., fork() on UNIX systems):
// Allocate page directory (x86) or PML4 (x86-64)
struct page *pgd = alloc_page(GFP_KERNEL);
pml4_t *pml4 = page_address(pgd);
memset(pml4, 0, PAGE_SIZE); // Zero-initialize
// Copy kernel page table entries
// This way kernel is mapped in all processes
for (i = 256; i < 512; i++) { // Upper half
pml4[i] = kernel_pml4[i];
}
Traditional fork() would copy all memory from parent to child—extremely expensive for large processes. COW optimizes this:
fork() {
child->pml4 = alloc_page_table();
for each mapped page in parent:
// Copy PTE to child
child_pte = parent_pte;
// Mark both parent and child read-only
parent_pte.R/W = 0;
child_pte.R/W = 0;
// Increment page reference count
page_refcount++;
}
page_fault_handler(address, error_code) {
if (error_code == WRITE_TO_READONLY) {
if (page_refcount == 1) {
// Only reference, just make writable
pte.R/W = 1;
} else {
// Multiple references, copy the page
new_page = alloc_page();
memcpy(new_page, old_page, PAGE_SIZE);
pte.pfn = new_page_number;
pte.R/W = 1;
page_refcount--;
}
}
}
COW means fork() is nearly instant regardless of process size!
Linux's Transparent Huge Page support automatically promotes regions of memory to 2MB huge pages when beneficial.
khugepaged kernel thread:* 512 contiguous pages allocated
* Memory is physically contiguous
* No special mappings (device memory, etc.)
Benefit: Applications get huge page performance without explicit huge page requests. Trade-off: Potential memory waste if process doesn't use all 2MB.On memory pressure, the OS can reclaim page table memory for unused regions:
unmap_page_range(start, end) {
for each page table covering [start, end]:
if all PTEs in table are invalid:
free_page_table(pt);
mark_pde_invalid();
}
This is the inverse of lazy allocation: lazily free page tables when no longer needed.
Having explored page table structures in depth, let's step back and analyze the fundamental trade-offs that architects face.
| Levels | Max Address Space | Worst-case Overhead | TLB Miss Cost |
|--------|------------------|---------------------|---------------|
| 1 | 2GB | 4MB per process | 1 memory access |
| 2 | 1TB | ~4MB (sparse) | 2 memory accesses |
| 3 | 512GB | ~16KB (sparse) | 3 memory accesses |
| 4 | 256TB | ~32KB (sparse) | 4 memory accesses |
| 5 | 128PB | ~40KB (sparse) | 5 memory accesses |
The "sweet spot" for most systems is 3-4 levels, balancing reasonable overhead with manageable translation cost.
Modern trend: Hardware walkers win for general-purpose computing due to better average-case performance.
Two-stage translation adds:
Result: 2-7% typical overhead—acceptable for cloud computing.
In this chapter, we've explored page table structures from simple to complex, always grounding our discussion in real-world implementations:
Key Concepts:1. Single-level page tables work for small address spaces but don't scale to 64-bit systems (would require petabytes of table memory).
2. Two-level page tables solved the overhead problem for 32-bit systems by allocating page tables only for used regions.
3. Multi-level page tables (3-5 levels) are necessary for modern 64-bit systems:
- x86-64: Four levels (48-bit) or five levels (57-bit)
- ARM64: Flexible configurations, typically four levels
- RISC-V: Three levels (Sv39) most common, up to five supported
4. Virtualization requires two-stage translation (VA → IPA → PA):
- Intel EPT, AMD NPT, ARM Stage 2, RISC-V G-stage
- Could require 20+ memory accesses without caching
- Modern hardware makes this practical through sophisticated caching
5. Caching is critical for performance:
- TLB caches final translations (VA → PA)
- Page walk caches (PWC) cache intermediate entries
- Combined with VPID/ASID tagging for virtualization
- 95-99% TLB hit rates make even 5-level paging practical
6. Operating systems manage page tables:
- Lazy allocation saves memory
- Copy-on-write optimizes fork()
- Transparent huge pages improve performance automatically
Design Insights:The convergence of x86-64, ARM64, and RISC-V on similar solutions (4-5 levels, 512 entries per table, 8-byte entries) despite different design philosophies suggests these represent near-optimal trade-offs for modern systems.
Performance Implications:In the next chapters, we'll build on this foundation to explore:
Page tables are the foundation of modern virtual memory systems. Understanding their structure, implementation, and performance characteristics is essential for systems programmers, architects, and anyone working with high-performance computing or cloud infrastructure.
Intel Corporation. Intel 64 and IA-32 Architectures Software Developer's Manual, Volume 3A: System Programming Guide, Part 1. Chapter 4: Paging. Document 325384, 2023. https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html
AMD Inc. AMD64 Architecture Programmer's Manual, Volume 2: System Programming. Chapter 5: Page Translation and Protection. Publication 24593, Rev. 3.38, 2023. https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/24593.pdf
Arm Limited. ARM Architecture Reference Manual for A-profile Architecture. Chapter D5: The AArch64 Virtual Memory System Architecture. DDI 0487J.a, 2023. https://developer.arm.com/documentation/ddi0487
RISC-V International. The RISC-V Instruction Set Manual, Volume II: Privileged Architecture. Version 20211203. Chapter 4: Supervisor-Level ISA. 2021. https://github.com/riscv/riscv-isa-manual
Bhattacharjee, A., Lustig, D. Architectural and Operating System Support for Virtual Memory. Morgan & Claypool Publishers, Synthesis Lectures on Computer Architecture, 2017. DOI: 10.2200/S00795ED1V01Y201708CAC042
Denning, P. J. "Virtual Memory." ACM Computing Surveys, 2(3):153–189, 1970. DOI: 10.1145/356571.356573
Silberschatz, A., Galvin, P. B., Gagne, G. Operating System Concepts, 10th ed. Wiley, 2018. Chapters 8–9: Main Memory and Virtual Memory.
Tanenbaum, A. S., Bos, H. Modern Operating Systems, 4th ed. Pearson, 2015. Chapter 3: Memory Management.
Barr, T., Cox, A., Rixner, S. "Translation Caching: Skip, Don't Walk (the Page Table)." Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA), pp. 48–59, 2010. DOI: 10.1145/1815961.1815970
Lustig, D., Bhattacharjee, A., Martonosi, M. "TLB Improvements for Chip Multiprocessors: Inter-Core Cooperative Prefetchers and Shared Last-Level TLBs." ACM Transactions on Architecture and Code Optimization (TACO), 10(1):2, 2013. DOI: 10.1145/2445572.2445574
Gandini, C., Moreto, M., Cristal, A., Valero, M. "A Study of Hardware-Assisted Address Translation." ACM SIGARCH Computer Architecture News, 43(1):34–39, 2016.
Karakostas, V., Gandhi, J., Ayar, F., Cristal, A., Hill, M. D., McKinley, K. S., Nemirovsky, M., Swift, M. M., Ünsal, O. S. "Redundant Memory Mappings for Fast Access to Large Memories." Proceedings of the 42nd Annual International Symposium on Computer Architecture (ISCA), pp. 441–453, 2015. DOI: 10.1145/2749469.2749471
Mittal, S. "A survey of techniques for architecting TLBs." Concurrency and Computation: Practice and Experience (CPE), 29(10), 2016. (Citing Villavieja et al. on shootdown overhead.) DOI: 10.1002/cpe.4061
Intel Corporation. Intel 64 and IA-32 Architectures Software Developer's Manual, Volume 3A: System Programming Guide. §10.12: x2APIC IPI delivery. 2024. intel.com/sdm
ARM Limited. ARM Architecture Reference Manual ARMv8, for ARMv8-A architecture profile. §D5.10: TLB maintenance instructions (TLBI). ARM DDI 0487J.a, 2023. developer.arm.com/ddi0487
Intel Corporation. Intel 64 and IA-32 Architectures Software Developer's Manual, Volume 3A: System Programming Guide, Part 1. Section 4.2: "Hierarchical Paging Structures: An Overview" (CR3 role); Section 4.3: "32-Bit Paging" (two-level legacy paging). Intel Corporation, 2024. intel.com/sdm
ARM Limited. ARM Architecture Reference Manual for ARMv8-A. Document DDI 0487J.a, Chapter D5: "The AArch64 Virtual Memory System Architecture." ARM Limited, 2024. developer.arm.com/ddi0487
RISC-V International. The RISC-V Instruction Set Manual, Volume II: Privileged Architecture. Version 1.12, Section 4.1.11: "Supervisor Address Translation and Protection Register (satp)." 2021. github.com/riscv/riscv-isa-manual
Levy, H. M. and Lipman, P. H. "Virtual memory management in the VAX/VMS operating system." IEEE Computer, 15(3):35–41, 1982.
Intel Corporation. Intel 64 and IA-32 Architectures Software Developer's Manual, Volume 3A. Section 4.5: "4-Level Paging and 5-Level Paging"; Section 4.5.4: "5-Level Paging." Intel Corporation, 2024. intel.com/sdm
Shutemov, K. A. "x86: 5-level paging enabling for v4.14." Linux Kernel Mailing List, 2017. lwn.net/Articles/717293
ARM Limited. ARM Architecture Reference Manual for ARMv8-A. ARM DDI 0487J.a, Section D5.2: "The VMSAv8-64 address translation system"; Section D5.3.3: "Descriptor formats." ARM Limited, 2024. developer.arm.com/ddi0487
RISC-V International. The RISC-V Instruction Set Manual, Volume II: Privileged Architecture. Version 1.12, Section 4.3: "Sv39," Section 4.4: "Sv48," Section 4.5: "Sv57." 2021. github.com/riscv/riscv-isa-manual
Intel Corporation. Intel 64 and IA-32 Architectures Software Developer's Manual, Volume 3C. Section 28.2: "The Extended Page Table Mechanism (EPT)." Intel Corporation, 2024. intel.com/sdm
AMD. AMD64 Architecture Programmer's Manual, Volume 2: System Programming. Section 15.25: "Nested Paging." AMD, 2023. amd.com/24593.pdf
ARM Limited. ARM Architecture Reference Manual for ARMv8-A. ARM DDI 0487J.a, Section D5.2.5: "Translation table walks for stage 2 translations." ARM Limited, 2024. developer.arm.com/ddi0487
RISC-V International. The RISC-V Instruction Set Manual, Volume II: Privileged Architecture. Version 1.12, Chapter 8: "Hypervisor Extension." 2021. github.com/riscv/riscv-isa-manual
Adams, K. and Agesen, O. "A comparison of software and hardware techniques for x86 virtualization." ACM SIGPLAN Notices, 41(11):2–13, 2006. DOI: 10.1145/1168857.1168860
Barr, K. et al. "The VMware mobile virtualization platform: is that a hypervisor in your pocket?" ACM SIGOPS Operating Systems Review, 44(4):124–135, 2010. DOI: 10.1145/1899928.1899945
Intel Corporation. Intel 64 and IA-32 Architectures Optimization Reference Manual. Section 2.5: "Translation Lookaside Buffers (TLB)." Intel Corporation, 2024. intel.com/sdm
AMD. Software Optimization Guide for AMD Family 19h Processors (Zen 3). Publication #56665, Rev. 3.01, Section 2.10: "Translation Lookaside Buffer." AMD, 2020. amd.com/56665.pdf
ARM Limited. ARM Cortex-A77 Core Technical Reference Manual. Section 5.2: "TLB organization." ARM Limited, 2019. developer.arm.com/100802
Barr, T. W., Cox, A. L., and Rixner, S. "Translation caching: skip, don't walk (the page table)." ACM SIGARCH Computer Architecture News, 39(3):48–59, 2011. DOI: 10.1145/1961295.1950382
Bhattacharjee, A., Lustig, D., and Martonosi, M. "Shared last-level TLBs for chip multiprocessors." Proceedings of the 17th IEEE Symposium on High Performance Computer Architecture (HPCA), pp. 62–73, 2011. DOI: 10.1109/HPCA.2011.5749717
Intel Corporation. Intel 64 and IA-32 Architectures Software Developer's Manual, Volume 3C. Section 28.3.3.1: "Virtual-Processor Identifiers (VPIDs)." Intel Corporation, 2024. intel.com/sdm
ARM Limited. ARM Architecture Reference Manual for ARMv8-A. ARM DDI 0487J.a, Section D5.10.3: "TLB maintenance requirements" (VMID and ASID usage). ARM Limited, 2024. developer.arm.com/ddi0487
Bhattacharjee, A. and Martonosi, M. "Characterizing the TLB behavior of emerging parallel workloads on chip multiprocessors." Proceedings of the 18th International Conference on Parallel Architectures and Compilation Techniques (PACT), pp. 29–40, 2009. DOI: 10.1145/1629575.1629579
Intel Corporation. Intel 64 and IA-32 Architectures Software Developer's Manual, Volume 3A. Section 4.7: "Page-Fault Exceptions." Intel Corporation, 2024. intel.com/sdm
Appel, A. W. and Li, K. "Virtual memory primitives for user programs." ACM SIGPLAN Notices, 26(4):96–107, 1991. DOI: 10.1145/106972.106984
Arcangeli, A., Eidus, I., and Wright, C. "Increasing memory density by using KSM." Proceedings of the Linux Symposium, pp. 19–28, 2011.
Rosenblum, M. et al. "Complete computer system simulation: The SimOS approach." IEEE Parallel & Distributed Technology, 3(4):34–43, 1995. DOI: 10.1109/88.386348