"The memory hierarchy is a fundamental structure in computer architecture, and virtual memory is one of the greatest ideas in computer science."
-- John L. Hennessy & David A. Patterson, Computer Architecture: A Quantitative Approach
When you write a program in C, Python, or any modern language, you make several assumptions about memory:
These assumptions seem natural, but they're not inherent to computer hardware. They're provided by virtual memory--one of computer science's most important abstractions [Bhattacharjee & Lustig, 2017].
Virtual memory (VM) is the hardware and software mechanism that gives each program its own private view of the computer's memory. According to Bhattacharjee & Lustig (2017), virtual memory provides several critical benefits:
1. Simplified Programming (Abstraction) - Programs use simple addresses starting from 0 - No need to know physical memory size or layout - Memory appears as one large array - Code is portable across different hardware
2. Memory Protection and Isolation - Each program has its own address space - Programs cannot access each other's memory - Operating system memory is protected from user programs - Security: malicious code can't read passwords from other programs
3. Memory Overcommitment - Total virtual memory across all programs can exceed physical RAM - Programs can use address spaces larger than physical memory - Inactive data can be stored on disk - Efficient memory sharing between programs
4. Flexibility in Memory Allocation - Programs can be loaded anywhere in physical memory - Memory doesn't need to be contiguous in physical RAM - Easy to relocate programs - Simplifies dynamic memory allocation
Virtual memory works through address translation. Let's understand this with a concrete example:
int main() {
int x = 42;
printf("Address of x: %p\n", &x);
// Might print: 0x7ffe5c3d4a8c
return 0;
}When you run this program, it prints an address like
0x7ffe5c3d4a8c. This is a virtual
address--a number your program uses to refer to a memory
location. But this is not where x actually lives in the
physical RAM chips.
Two Types of Addresses:
Virtual Address (VA) - What your program sees and
uses - Example: 0x7ffe5c3d4a8c - Every program has its own
virtual addresses starting from 0 - Abstract, not tied to physical
hardware
Physical Address (PA)
- Where data actually resides in RAM chips - Example:
0x12A4B000 (completely different!) - Only one set of
physical addresses in the entire system - Corresponds to actual hardware
locations
The Translation:
When your program accesses virtual address
0x7ffe5c3d4a8c, hardware automatically translates it to the
corresponding physical address (say, 0x12A4B000) where your
data actually lives.
Program's view: Virtual Address 0x7ffe5c3d4a8c
↓
[MMU Translation]
↓
Reality in hardware: Physical Address 0x12A4B000
This simple translation mechanism simultaneously solves all the problems mentioned above:
Isolation: Program A's virtual address
0x1000 maps to physical address 0x3A2F1000.
Program B's virtual address 0x1000 maps to physical address
0x91B52000. Same virtual address, different physical
locations--they can never conflict.
Simplification: Every program uses addresses starting from 0. The program doesn't need to know where it's actually loaded in physical memory.
Overcommitment: A program can have virtual addresses from 0 to 4 billion (on 32-bit systems), even if physical RAM is only 2 GB. Not all virtual addresses need to be in physical memory simultaneously--some can be on disk.
Protection: The translation mechanism includes permission checks. The hardware verifies that your program is allowed to access each physical address it translates to.
The Memory Management Unit (MMU) is the hardware component inside your CPU that performs this translation. On every memory access--billions of times per second--the MMU:
This happens automatically and transparently. Your program never knows it's using virtual addresses.
While virtual memory elegantly solves programming and protection problems, it introduces new challenges that the MMU must address:
Challenge 1: Translation Performance - Modern CPUs execute ~4 billion instructions per second - Many instructions access memory multiple times - That's billions of translations per second - Each translation must be fast (sub-nanosecond) or the CPU stalls
Challenge 2: Translation Overhead - Translation requires looking up tables in memory - But those table lookups themselves require memory access - This could create infinite recursion - The translation mechanism itself consumes memory
Challenge 3: The Memory Wall - Memory access takes ~70 nanoseconds - During that time, CPU could execute 280 instructions - Most CPU time is spent waiting for memory - Virtual memory must not make this worse
How These Are Solved: - Translation Lookaside Buffer (TLB): A cache that remembers recent translations, making most translations instant - Hardware Page Table Walker: Dedicated logic to quickly traverse translation tables - Multi-level page tables: Hierarchical structure that reduces memory overhead - Large pages: Reducing the number of translations needed
Understanding virtual memory requires understanding the interplay between hardware and software:
Hardware (MMU): - How address translation works (page tables) - How to make translation fast (TLBs, page table walkers) - Different architectures' approaches (x86, ARM, RISC-V)
Operating System: - Managing page tables for all processes - Handling page faults (when data isn't in RAM) - Deciding what to keep in physical memory - Implementing memory protection policies
Together: - How hardware and OS cooperate - Performance implications - Security considerations - Modern challenges and solutions
By the end of this chapter, you'll understand how modern computers provide every program with the illusion of having abundant, fast, private memory--even though physical memory is limited, slow, and shared.
Given the fundamental challenges outlined above--the memory wall, capacity limitations, and security requirements--memory management is the collection of hardware and software mechanisms that bridge the gap between what programmers need (fast, large, secure memory) and what hardware provides (slow, limited, shared physical RAM).
At its core, memory management must simultaneously solve several competing challenges:
Every program requires memory to store instructions and data. When a process starts, memory must be allocated. When it terminates, that memory must be reclaimed.
The Challenge: How do we allocate memory efficiently when programs have unpredictable memory needs? How do we avoid fragmentation? What happens when physical memory is exhausted?
Example: The malloc() Mystery
int* ptr = malloc(1024 * 1024 * 1024); // Request 1 GB
if (ptr != NULL) {
printf("Allocation succeeded!\n");
// But have we actually allocated 1 GB of physical RAM?
}Surprisingly, on most modern systems, malloc() will
succeed even if you only have 512 MB of physical RAM available. How is
this possible? The answer lies in virtual memory and demand
paging--concepts we'll explore in depth.
Programs operate using virtual addresses (also called logical addresses), but the actual data resides at physical addresses in RAM. The Memory Management Unit (MMU) performs this translation billions of times per second.
The Challenge: How do we translate addresses quickly enough to avoid becoming a bottleneck? How do we handle the translation table itself, which might be enormous?
Concrete Example:
Virtual Address: 0x7fff5fbff710 (Program's view)
Physical Address: 0x3a2f1000 (Actual RAM location)
The MMU must perform this translation for every single memory access. On a modern CPU executing billions of instructions per second, with multiple memory accesses per instruction, this means billions of translations per second.
In a multitasking system, one misbehaving program must not be able to corrupt another program's memory--or worse, the operating system kernel's memory.
Historical Perspective:
Early microcomputers like the Commodore 64 and Apple II had no memory
protection. A single buggy program could crash the entire system. Modern
systems enforce strict isolation.
The Challenge: How do we enforce protection without sacrificing performance? How do we share memory safely when needed (e.g., shared libraries)?
Real-World Impact: The 2018 Spectre and Meltdown vulnerabilities demonstrated that even with hardware memory protection, subtle CPU implementation details could leak protected memory. These vulnerabilities affected billions of devices and showed that memory isolation is harder than it appears [Kocher et al., 2019; Lipp et al., 2018].
Memory management must be fast. Every optimization matters: better cache utilization, reduced TLB misses, efficient page table structures, and intelligent page replacement algorithms.
The Challenge: How do we balance competing goals? Larger pages reduce translation overhead but increase internal fragmentation. More complex page tables enable features but consume memory themselves.
Performance Reality:
To understand why memory management is critical, consider the real cost of different memory operations. These numbers come from actual hardware measurements on a modern Intel Xeon processor:
Operation Latency CPU Cycles (@4GHz) Human Scale*
L1 Cache Hit ~0.5 ns ~2 cycles 1 second
L2 Cache Hit ~3 ns ~12 cycles 6 seconds
L3 Cache Hit ~12 ns ~50 cycles 24 seconds
DRAM Access (TLB hit) ~70 ns ~280 cycles 2.3 minutes
DRAM Access (TLB miss) ~150 ns ~600 cycles 5 minutes
Page Fault (from SSD) ~25,000 ns ~100,000 cycles 1.2 days
Page Fault (from HDD) ~10,000,000 ns ~40,000,000 cycles 1.3 years
Table 1.1: Memory hierarchy latencies. Human scale shows the relative time if L1 cache = 1 second.
Understanding the Scale:
If an L1 cache access took 1 second, a page fault from a hard drive
would take over a year. This isn't a small performance penalty--it's the
difference between a responsive system and one that appears frozen.
Why the Memory Wall Matters:
The memory wall means that for many applications, performance is
dominated not by CPU speed but by memory access patterns. Programs with
good locality (accessing nearby memory addresses) stay in fast caches.
Programs with poor locality constantly miss caches and TLBs, spending
most of their time waiting for slow DRAM.
Sources: Latency measurements from Hennessy & Patterson (2024), Intel Xeon performance guides, and empirical testing.
The earliest computers like ENIAC and EDVAC had no memory management. Programs had direct access to all physical memory. Only one program could run at a time.
Programming Model:
; PDP-8 example (1965)
; Direct physical addressing - no translation
0200: 7200 CLA ; Clear accumulator
0201: 1234 TAD 234 ; Load from absolute address 234
0202: 3456 DCA 456 ; Store to absolute address 456
Limitations: - No multitasking - No protection - Manual memory layout by programmer - Physical memory size limited program size
The first hardware memory management used simple base and bounds registers. The base register contained the program's starting address, and all memory accesses were offset from this base. The bounds register limited the maximum offset.
Physical Address = Base Register + Program Address
if (Program Address > Bounds Register):
TRAP to operating system # Protection fault
Example: IBM 7094 (1962)
Base: 0x100000
Bounds: 0x010000 (64 KB limit)
Program accesses: 0x1234
Physical address: 0x100000 + 0x1234 = 0x101234 ✓ (within bounds)
Program accesses: 0x20000
0x20000 > 0x010000 → PROTECTION FAULT
Limitations: - Still required contiguous physical memory - Relocation required copying entire program in memory - External fragmentation: gaps between programs - No sharing of code between processes
Segmentation divided a program's address space into logical units (segments) like code, data, stack, and heap. Each segment had its own base and bounds.
Segments Matched Logical Structure:
Segment 0: Code (executable, read-only)
Segment 1: Data (read-write)
Segment 2: Stack (read-write, grows down)
Segment 3: Heap (read-write, grows up)
Virtual Address Format (Segmented):
Advantages: - Logical program structure - Easy sharing (share code segment) - Different protection for different segments - No internal fragmentation
Disadvantages: - External fragmentation: Segments of varying sizes left gaps in physical memory - Required compaction (expensive memory copying) - Large segments still needed contiguous physical memory
Historical Example: Intel 8086 (1978) The 8086 used segmentation to address 1 MB of memory with 16-bit registers:
Physical Address = (Segment x 16) + Offset
The Atlas computer (University of Manchester, 1962) introduced paging--dividing both virtual and physical memory into fixed-size blocks called pages (virtual) and frames (physical) [Fotheringham, 1961; Kilburn et al., 1962].
Key Insight: With fixed-size pages, any page can map to any frame. No external fragmentation!
Typical Page Sizes: - 4 KB: Standard on x86/x64, ARM, RISC-V - 16 KB: Option on ARM - 64 KB: Option on ARM, used by some UNIX variants - 2 MB/1 GB: "Huge pages" for reducing translation overhead
Virtual Address Format (4 KB pages, 32-bit address space):
Physical Address Format:
Page Table: The mapping from virtual page numbers to physical frame numbers is stored in a page table.
Advantages of Paging: 1. No external fragmentation (all blocks are same size) 2. Easy allocation (any free frame works) 3. Efficient sharing (multiple page tables point to same frame) 4. Demand paging (load pages only when accessed) 5. Virtual memory (address space larger than physical RAM)
Disadvantages: 1. Internal fragmentation (wasted space in partial pages) 2. Page table overhead (memory consumed by page tables themselves) 3. Translation overhead (every memory access needs translation)
Memory fragmentation comes in two forms: external and internal. Understanding the difference is crucial for evaluating memory management schemes.
External fragmentation occurs when free memory is scattered in small, unusable chunks between allocated blocks.
The Problem:
Free memory becomes fragmented into non-contiguous holes. Even when
sufficient total free memory exists, allocation requests may fail if
they require contiguous memory larger than the biggest hole.
Real-World Example:
After a system runs for 24 hours with frequent allocations and
deallocations, 20-40% of memory can become unusable due to
fragmentation. A system might show 4 GB "free" but be unable to allocate
even 100 MB contiguously.
Solutions: - Compaction: Move allocated blocks to consolidate free space (expensive - requires copying) - Paging: Use fixed-size blocks to eliminate external fragmentation entirely
Internal fragmentation occurs when allocated blocks are larger than needed, wasting space within the allocated block.
The Trade-off:
| Page Size | Internal Fragmentation | Page Table Size | TLB Coverage |
|---|---|---|---|
| 1 KB | Low (max 1023 bytes waste) | Large (many entries) | Poor |
| 4 KB | Medium (max 4095 bytes waste) | Medium | Good |
| 2 MB | High (max 2 MB waste) | Small (few entries) | Excellent |
Real-World Impact:
On a system with 16 GB RAM, 4 KB pages, and 100 processes: - Average
internal fragmentation per process: ~500 KB - Total wasted memory: ~50
MB (0.3% of RAM) - This is predictable and
acceptable
Compare this to external fragmentation which can waste 20-40% of RAM unpredictably.
Summary Table:
| Type | Location | Cause | Impact | Predictable? | Solution |
|---|---|---|---|---|---|
| External | Between blocks | Variable-size allocation | Can't use total free memory | No, gets worse over time | Compaction or paging |
| Internal | Within blocks | Fixed-size blocks | Wasted space in blocks | Yes, ≤ (page_size - 1) per allocation | Choose appropriate page size |
Why Modern Systems Use Paging:
External fragmentation is a catastrophic problem: memory becomes increasingly unusable over time, and the only solution (compaction) is expensive and disruptive. A system might show gigabytes of "free" memory but fail to allocate even small contiguous blocks.
Internal fragmentation is a manageable problem: the waste is bounded, predictable, and typically only 0.3-1% of total RAM with 4 KB pages. The system always knows the maximum waste per allocation (page_size - 1 bytes).
This is why virtually all modern operating systems use paging for memory management--accepting small, predictable waste to completely eliminate large, unpredictable memory loss.
Modern systems implement memory management at multiple levels:
The MMU is a specialized hardware component, integrated into the CPU, that performs address translation at hardware speed.
Core MMU Responsibilities: 1. Address Translation: Convert virtual → physical addresses 2. TLB Management: Cache recent translations 3. Protection Checking: Enforce access permissions 4. Exception Generation: Trigger page faults when needed
MMU Pipeline Integration:
Instruction Fetch → Instruction TLB → Instruction Cache → Decode
↓
Physical Address
Data Access → Data TLB → Data Cache → Load/Store Unit
↓
Physical Address
The MMU operates in parallel with the CPU pipeline. Any TLB miss stalls the pipeline while the page table is walked.
Historical Note:
Early CPUs like the Motorola 68020 had optional external MMU chips
(68851). Modern CPUs integrate the MMU on-die for performance.
The OS kernel manages: - Page table maintenance: Creating/updating page tables for processes - Page fault handling: Loading pages from disk when absent - Page replacement: Choosing which pages to evict under memory pressure - Swapping: Moving entire processes to/from disk - Memory allocation: Allocating frames to processes
Key OS Data Structures:
// Linux kernel mm_struct (simplified)
struct mm_struct {
pgd_t *pgd; // Page Global Directory (root page table)
unsigned long start_code; // Code segment start
unsigned long end_code; // Code segment end
unsigned long start_data; // Data segment start
unsigned long end_data; // Data segment end
unsigned long start_brk; // Heap start
unsigned long brk; // Heap end (current)
unsigned long start_stack; // Stack start
unsigned long total_vm; // Total virtual memory pages
unsigned long locked_vm; // Locked pages (mlocked)
unsigned long pinned_vm; // Pinned pages
unsigned long data_vm; // Data pages
unsigned long exec_vm; // Executable pages
unsigned long stack_vm; // Stack pages
};Page Fault Handling Flow:
1. CPU accesses virtual address
2. TLB miss → Page table walk
3. Page not present → Page Fault Exception
4. CPU switches to kernel mode, saves context
5. OS page fault handler invoked:
a. Check if address is valid (in process's VMA)
b. If invalid → Segmentation Fault (kill process)
c. If valid but not resident:
- Allocate physical frame
- If swapped: Read page from disk
- If zero-page: Zero the frame
- If file-backed: Read from file
- Update page table entry
- Mark page as present
6. Return from exception
7. CPU retries the faulting instruction
8. Translation succeeds → execution continues
Applications don't usually manage memory at the page level. Instead, they use allocation libraries:
C/C++:
void* malloc(size_t size);
void free(void* ptr);
void* calloc(size_t nmemb, size_t size);
void* realloc(void* ptr, size_t size);How malloc() Works (Simplified):
// glibc malloc implementation strategy
void* malloc(size_t size) {
if (size < MMAP_THRESHOLD) {
// Small allocation: use heap (brk/sbrk)
return allocate_from_heap(size);
} else {
// Large allocation: use mmap()
return mmap(NULL, size, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
}
}Behind the Scenes: - Small allocations (<128 KB):
Use heap via brk() system call - Large allocations: Use
mmap() to map anonymous memory - Free list management
(bins, arenas) - Metadata overhead (headers, footers)
Memory Allocation Hierarchy:
Application calls malloc()
↓
C Library (glibc)
↓
System Call (brk/mmap)
↓
Kernel (page allocator)
↓
MMU (page table update)
↓
Physical Memory Assignment
Virtual memory is one of computer science's most powerful abstractions. It provides each process with the illusion of: 1. Vast address space (often larger than physical RAM) 2. Contiguous memory (even if physical memory is fragmented) 3. Private memory (isolated from other processes) 4. Uniform access (same addressing model for all memory)
Programmers can assume: - Memory starts at address 0 - Memory is contiguous - Pointers just work (within the valid address space) - No need to manage physical memory locations
Physical memory can be overcommitted--the sum of all virtual address spaces can exceed physical RAM.
Real Example:
Physical RAM: 16 GB
Process 1: 8 GB virtual
Process 2: 8 GB virtual
Process 3: 8 GB virtual
Total: 24 GB virtual → More than physical!
This works because: - Not all virtual pages are actually in use - Pages can be swapped to disk - Many pages are shared (code, libraries)
Each process has its own page table. Process A cannot access Process B's memory.
Process A's view:
0x00000000 - 0x7FFFFFFF: User space
0x80000000 - 0xFFFFFFFF: Kernel space
Process B's view:
0x00000000 - 0x7FFFFFFF: User space (different physical frames!)
0x80000000 - 0xFFFFFFFF: Kernel space (same physical frames)
Files can be mapped into virtual memory. Reading/writing memory reads/writes the file.
// Map a file into memory
int fd = open("database.db", O_RDWR);
void* ptr = mmap(NULL, file_size, PROT_READ|PROT_WRITE,
MAP_SHARED, fd, 0);
// Now ptr[x] reads/writes the file!When fork() creates a child process, both parent and
child initially share the same physical pages (marked read-only). Only
when one writes to a page is it duplicated.
Parent calls fork()
↓
Child created with identical page tables
↓
Both point to same physical frames (read-only)
↓
Parent writes to page → Page Fault
↓
Kernel copies page, updates parent's page table
↓
Parent has private copy, child still has original
This makes fork() extremely fast, even for large
processes.
Virtual memory isn't free:
Performance Impact Example:
// Sequential access (good locality)
for (i = 0; i < ARRAY_SIZE; i++) {
sum += array[i];
}
// TLB hit rate: ~99%+
// Random access (poor locality)
for (i = 0; i < ARRAY_SIZE; i++) {
sum += array[rand() % ARRAY_SIZE];
}
// TLB hit rate: can drop to 50% or worseBefore diving deeper, let's establish precise definitions:
Virtual Address: The address used by a program
(logical address)
Physical Address: The actual address in RAM
hardware
Address Space: The set of all valid addresses (e.g.,
32-bit = 4 GB address space)
Page: A fixed-size block of virtual memory
(typically 4 KB)
Frame: A fixed-size block of physical memory (same size
as page)
Page Number: The high-order bits of a virtual
address
Offset: The low-order bits (position within the
page)
Page Table: Data structure mapping virtual pages to
physical frames
Page Table Entry (PTE): One entry in the page
table
Page Directory: Higher-level page table (in multi-level
schemes)
Translation Lookaside Buffer (TLB): Hardware cache of
recent translations
Resident: Page is in physical memory
Swapped: Page is on disk (swap space)
Dirty: Page has been modified
Present: Page is accessible (resident and valid)
Valid: Address is within process's address space
Page Fault: Exception when accessing a non-present
page
Swapping: Moving entire processes to/from disk
Paging: Moving individual pages to/from disk
Thrashing: System spends more time paging than
executing
Let's trace a complete memory access from virtual address to physical address.
System Configuration: - 32-bit address space - 4 KB (2^12 byte) pages - 1 GB (2^30 byte) physical RAM = 256K frames
Virtual Address: 0x12345678
Step 1: Split address into page number and offset
Step 2: Look up in TLB
TLB lookup for VPN 0x12345...
TLB HIT! → Physical Frame Number: 0xA2B4C
Step 3: Construct physical address
Physical Frame: 0xA2B4C
Offset: 0x678
Physical Address = (Frame Number << 12) | Offset
= (0xA2B4C << 12) | 0x678
= 0xA2B4C678
Step 4: Access physical memory
Read from physical address 0xA2B4C678
If TLB Miss had occurred:
TLB miss for VPN 0x12345
→ Hardware page table walk:
1. Read Page Directory entry
2. Read Page Table entry
3. Extract Physical Frame Number: 0xA2B4C
4. Load into TLB
5. Retry access (now TLB hit)
If Page Not Present:
Page table entry for VPN 0x12345 has Present bit = 0
→ Page Fault Exception
→ Kernel page fault handler:
1. Check if address valid
2. Allocate physical frame
3. Load page from disk (if swapped)
4. Update PTE: Frame = 0xA2B4C, Present = 1
5. Return from exception
→ CPU retries instruction (now page is present)
While page tables elegantly solve the problem of address translation, they introduce a significant performance challenge: every memory access requires at least one additional memory read to consult the page table. For a simple memory load instruction, the CPU must:
This doubles the effective memory access time! For a program making millions of memory accesses per second, this overhead would be catastrophic.
The Translation Lookaside Buffer (TLB) is a specialized hardware cache that stores recently used virtual-to-physical address translations. It sits between the CPU and the page table, intercepting every address translation request.
How the TLB Works:
The performance difference between a TLB hit and miss is dramatic:
| Scenario | Cycles | Time (3 GHz CPU) | Relative Speed |
|---|---|---|---|
| TLB Hit | 1-2 | ~0.5 ns | 1x (baseline) |
| TLB Miss | 100-300 | ~50-100 ns | 100-200x slower |
Modern processors typically achieve 95-99% TLB hit rates during normal operation, meaning most memory accesses avoid the slow page table walk entirely.
A note on TLB hit latency and cache organisation. The "1–2 cycles" figure assumes the TLB result is needed before the L1 data cache can be accessed — the classic PIPT (Physically-Indexed, Physically-Tagged) configuration. Many modern processors instead use VIPT (Virtually-Indexed, Physically-Tagged) L1 caches: the cache index is computed from virtual address bits simultaneously with the TLB lookup. When the VIPT index bits lie entirely within the page offset, there is no aliasing and the effective TLB penalty on an L1 cache hit is zero cycles — the translation arrives just in time for tag comparison. L2 and L3 caches are generally PIPT and pay the full TLB latency. Chapter 7 (§7.8.4) examines VIPT cache aliasing hazards in detail. Mittal (2016) provides a systematic comparison of all four cache-addressing modes (PI-PT, VI-PT, VI-VT, PI-VT).
A typical TLB is organized as an associative cache (content-addressable memory):
Entry Format:
Each entry stores: - Virtual Page Number (VPN): The tag to match against - Physical Frame Number (PFN): The translation result - Valid bit: Whether this entry contains valid data - Flags: Permissions (read/write/execute) from the PTE
Size Example: A typical L1 TLB might have: - 64 entries for 4 KB pages - Covers: 64 x 4 KB = 256 KB of address space - Only 256 KB of "working set" can be accessed without misses!
This limited coverage is called TLB reach and is a critical factor in system performance. Programs with larger working sets will experience more TLB misses.
TLB reach is defined precisely as:
TLB Reach = Number of TLB Entries × Page Size
This formula immediately shows why page size selection is a first-order architectural decision. The following table shows TLB reach for typical hardware configurations (Mittal, 2016; Qasem & Magee, 2012):
| TLB Level | Entries | 4 KB pages | 2 MB pages | 1 GB pages |
|---|---|---|---|---|
| L1 DTLB | 64 | 256 KB | 128 MB | 64 GB |
| L2 STLB | 1,536 | 6 MB | 3 GB | 1.5 TB |
| L2 STLB | 2,048 | 8 MB | 4 GB | 2 TB |
The AI/ML reach crisis. Modern large language model training exposes just how narrow 4 KB TLB reach is. GPT-3's working set consists of approximately 350 GB of parameters, plus 700 GB each for momentum and variance under AdamW optimization, plus 50–100 GB of activations — totalling roughly 1,800 GB. An L1 TLB with 64 entries covering 256 KB addresses 0.000014% of that working set; TLB miss rate approaches 100%. Even the 1,536-entry L2 STLB covers only 6 MB. Wu & Taylor (2021) confirmed this directly for deep learning benchmarks: mapping the kernel address space with 8 MB huge pages yielded "a single TLB entry where 4 KB pages would need 2,048 entries," producing 55–61% performance improvement. Switching to 2 MB pages shifts L1 reach from 256 KB to 128 MB (a 512× gain), meaning the L1 TLB alone can cover a meaningful fraction of typical training batches. This is why huge pages are increasingly favoured in AI infrastructure — a topic examined in depth in Chapters 9 and 14.
The TLB is automatically managed by the hardware MMU:
TLB is updated on: - Every TLB miss (new entry added) - Page table updates by the OS (corresponding entry invalidated)
TLB is flushed (cleared) on: - Context switch between processes (different address spaces) - Explicit flush instruction (e.g., when kernel updates page tables) - Some CPUs support ASID (Address Space Identifier) to avoid flushing on context switches
The TLB is arguably the most important performance component in virtual memory systems:
Real-World Example:
Consider a database scanning a 1 GB table with random access: - With 4 KB pages: 262,144 pages, typical TLB: 64 entries - TLB covers: 256 KB (0.025% of data) - Result: ~99.975% TLB miss rate = catastrophic performance
This is why databases and high-performance applications use huge pages (covered in Chapter 2.7).
The TLB transforms virtual memory from a theoretical elegance into a practical, high-performance system. Modern processors invest significant chip area in TLB structures precisely because this cache is so performance-critical.
In the next section, we'll see how multi-level page tables address another challenge: efficiently managing large, sparse address spaces.
The page table design we've discussed so far has a critical flaw: it assumes a single-level page table that must contain an entry for every possible virtual page. For modern 64-bit systems with vast address spaces, this becomes completely impractical.
Consider a 64-bit system with 4 KB pages:
Address Space Calculation: - Virtual address space: 2^48 bits = 256 TB (typical implementation) - Page size: 4 KB = 2^12 bytes - Number of pages: 256 TB / 4 KB = 68,719,476,736 pages
Page Table Size: - Entries needed: 68.7 billion entries - Entry size: 8 bytes (64-bit PTE) - Total page table size: 512 GB per process!
This is absurd. The page table alone would consume more memory than most systems have, and we haven't even allocated memory for the actual program yet!
The key insight is that programs don't use most of their address space. A typical program might use:
The remaining 99.9999% of the address space is unmapped and unused. Allocating a complete page table for this sparse usage is incredibly wasteful.
Multi-level page tables break the single large page table into a tree structure, allocating page table memory only for regions of the address space that are actually in use.
Concept: Instead of one huge table, create a hierarchy: - Level 1: Page Directory (small, always present) - Level 2: Page Tables (only created for used regions)
Let's walk through a concrete example with a 32-bit address space and 4 KB pages.
Address Format:
Translation Process:
1. Virtual Address: 0x00403A7C
2. Split address into components:
Dir Index: bits 31-22 = 0x001 = 1
Table Index: bits 21-12 = 0x003 = 3
Offset: bits 11-0 = 0xA7C = 2684
3. Walk the page table hierarchy:
Step 1: Use Dir Index to find Page Table
+-----------------------+
| Page Directory | ← Always in memory
| Entry 0: NULL |
| Entry 1: PTE* 0x5000 | ← Points to Page Table
| Entry 2: NULL |
| ... |
+-----------------------+
Step 2: Use Table Index to find PFN
+-----------------------+
| Page Table @ 0x5000 | ← Only allocated if needed
| Entry 0: PFN 0x100 |
| Entry 1: PFN 0x101 |
| Entry 2: PFN 0x102 |
| Entry 3: PFN 0x1B7 | ← Target entry
| ... |
+-----------------------+
Step 3: Combine PFN with Offset
Physical Address = (0x1B7 << 12) | 0xA7C = 0x001B7A7C
4. Access physical memory at 0x001B7A7C
Single-Level Page Table (32-bit system): - Pages needed: 4 GB / 4 KB = 1,048,576 entries - Entry size: 4 bytes - Total size: 4 MB per process (always)
Two-Level Page Table: - Page Directory: 1,024 entries x 4 bytes = 4 KB (always present) - Page Tables: Only allocated for used regions
Example Program Using 268 MB: - Used pages: 268 MB / 4 KB = 68,608 pages - Page tables needed: 68,608 / 1,024 = 67 tables - Size: 67 x 4 KB = 268 KB - Plus directory: 268 KB + 4 KB = 272 KB total
Savings: 4 MB → 272 KB (93% reduction!)
For a 64-bit system with sparse usage, the savings are even more dramatic.
Modern architectures use different numbers of levels:
| Architecture | Levels | Address Bits | Coverage |
|---|---|---|---|
| x86 (32-bit) | 2 | 32 bits | 4 GB |
| x86-64 (4-level) | 4 | 48 bits | 256 TB |
| x86-64 (5-level) | 5 | 57 bits | 128 PB |
| ARM64 (typical) | 4 | 48 bits | 256 TB |
| RISC-V Sv39 | 3 | 39 bits | 512 GB |
| RISC-V Sv48 | 4 | 48 bits | 256 TB |
More levels = larger address space but slower page table walks.
Modern 64-bit systems typically use 4-level page tables as a sweet spot between address space size and lookup performance.
Advantages: ✅ Huge memory savings
for sparse address spaces
✅ Pay only for what you use (demand allocation)
✅ Enables large address spaces (48-57 bits
practical)
✅ Sharing between processes easier at granular
level
Disadvantages: ❌ More complex
hardware (multiple memory accesses)
❌ Slower on TLB miss (must walk multiple levels)
❌ More page faults possible (intermediate tables not
present)
With multi-level page tables, a TLB miss becomes expensive:
Cost of Address Translation: - TLB hit: 1 cycle (~0.3 ns) - TLB miss with 2-level PT: 2 memory accesses (~200 ns) = 667x slower - TLB miss with 4-level PT: 4 memory accesses (~400 ns) = 1,333x slower
This makes the TLB's caching function absolutely critical. A program with a 99% TLB hit rate will be dramatically faster than one with a 95% hit rate, especially on systems with deep page table hierarchies.
Multi-level page tables are a perfect example of indirection solving a scaling problem. By adding a level of indirection (or several), we transform an impractical memory requirement into a manageable one, enabling the vast 64-bit address spaces that modern programs enjoy.
The specific implementations vary by architecture--x86-64's 4-level structure differs from ARM's implementation, and RISC-V offers multiple options. We'll explore these architecture-specific details in Part III: MMU in Modern Architectures.
This completes our introduction to the fundamental mechanisms of address translation. We've covered: - The basic concept of pages and page tables - How address translation works (virtual → physical) - The critical role of the TLB in making translation fast - How multi-level page tables make large address spaces practical
In the next section, we'll look at real-world examples of how these mechanisms impact system performance and security.
Scenario: Database performing random lookups in a 100 GB dataset on a system with 32 GB RAM.
Analysis: - Page size: 4 KB - Total pages: 25 million (100 GB / 4 KB) - Physical frames: 8 million (32 GB / 4 KB) - TLB size: 1,536 entries (Intel Xeon) - TLB coverage: 6 MB (1,536 x 4 KB)
Result:
With random access patterns, TLB hit rate drops to ~5%. Each TLB miss
requires a page table walk (4 memory accesses for 4-level paging).
Performance can degrade by 60-80% compared to sequential access.
Solution:
Use 2 MB huge pages: - TLB coverage: 3 GB (1,536 x 2 MB) - TLB hit rate
improves to 60% - Performance increase: 2-3x for this workload
In 2018, researchers discovered that speculative execution in modern CPUs could be exploited to leak protected memory [Kocher et al., 2019; Lipp et al., 2018].
The Vulnerability: CPUs speculatively execute code before checking permissions. Even though mis-speculated instructions are discarded, they leave traces in the cache and TLB that can be measured.
Attack Example (Simplified):
// Attacker code (user space)
if (x < array1_size) { // Will be true eventually
y = array2[array1[x] * 4096]; // Speculatively executed!
}During speculative execution: 1. CPU loads array1[x] (x
is out of bounds, kernel memory) 2. CPU loads
array2[array1[x] * 4096]
3. Page brought into TLB/cache 4. Speculation was wrong → discard
results 5. BUT TLB/cache still contains traces of kernel memory! 6.
Attacker measures TLB/cache timing to infer kernel data
Impact: - Affected billions of CPUs (Intel, AMD, ARM) - Required OS patches (KPTI: Kernel Page Table Isolation) - Performance overhead: 5-30% depending on workload
Lesson:
Memory management isn't just about performance--it's critical to
security. The MMU is part of the security perimeter.
Rowhammer demonstrated that repeated accesses to DRAM rows can cause bit flips in adjacent rows--a hardware reliability issue that becomes a security vulnerability [Kim et al., 2014].
The Attack:
// Repeatedly access two rows (hammering)
while (1) {
*addr1;
*addr2;
clflush(addr1); // Bypass cache
clflush(addr2);
}
// Causes bit flips in adjacent physical rowsMMU Connection: Attackers used the MMU's address translation to: 1. Determine which virtual addresses map to adjacent physical rows 2. Bypass MMU protection to corrupt page tables 3. Gain kernel privileges
Impact: - Demonstrated in 2015 (Google Project Zero) - Exploitable from JavaScript (no native code needed) - Led to TRR (Target Row Refresh) in DRAM
Key Takeaways:
Memory management bridges the memory wall between fast CPUs and slow DRAM through caching and virtual memory.
Paging solves fragmentation by using fixed-size pages, enabling efficient allocation and virtual memory.
The MMU performs address translation billions of times per second, making virtual memory practical.
Virtual memory provides powerful abstractions: large address spaces, isolation, memory overcommitment, and simplified programming.
Performance matters: TLB misses and page faults can dominate execution time in memory-intensive workloads.
Security is fundamental: The MMU enforces isolation, but vulnerabilities like Spectre and Rowhammer show that hardware memory protection is subtle and complex.
Thought-Provoking Questions:
If CPU speeds continue increasing while memory latency stagnates, what percentage of CPU time will be spent waiting for memory?
How large should page sizes be? Larger pages reduce translation overhead but increase internal fragmentation. What's the optimal tradeoff?
Can we eliminate page tables entirely? What alternative translation mechanisms might be possible?
Should applications be aware of physical memory layout (NUMA, memory tiers)? Or should the OS hide this complexity?
With persistent memory (NVDIMMs, Optane), do we need to rethink the entire virtual memory model?
This chapter has introduced the fundamental concepts of memory management: - The memory wall and why management is necessary - Evolution from simple base/bounds to modern paging - The role of hardware (MMU), OS, and applications - Virtual memory abstraction and its benefits - Real-world security and performance implications
In the chapters ahead:
Chapter 2: Virtual memory concepts in depth--demand paging, page replacement algorithms, working sets, thrashing
Chapter 3: Hardware implementation of MMUs--page table formats, multi-level paging, TLB design
Part II: Historical evolution--from Atlas to modern architectures
Part III: Deep dives into x86, ARM, RISC-V, and other architectures
Part IV: Advanced techniques--huge pages, NUMA, memory compression
Part V: Security--memory protection, confidential computing, isolation
Part VI: Vulnerabilities--Spectre, Meltdown, Rowhammer, and mitigations
Part VII: Debugging and testing MMU implementations
Part VIII: Future directions--CXL, disaggregated memory, AI workloads
Part IX: Performance optimization for real-world systems
Hennessy, J. L., & Patterson, D. A. (2024). Computer Architecture: A Quantitative Approach (7th ed.). Morgan Kaufmann. Chapter 2: Memory Hierarchy Design.
Denning, P. J. (1970). "Virtual memory". Computing Surveys, 2(3), 153-189.
Kilburn, T., Edwards, D. B. G., Lanigan, M. J., & Sumner, F. H. (1962). "One-level storage system". IRE Transactions on Electronic Computers, EC-11(2), 223-235.
Jacob, B., Ng, S. W., & Wang, D. T. (2010). Memory Systems: Cache, DRAM, Disk. Morgan Kaufmann.
Kocher, P., et al. (2019). "Spectre attacks: Exploiting speculative execution". Communications of the ACM, 63(7), 93-101.
Lipp, M., et al. (2018). "Meltdown: Reading kernel memory from user space". 27th USENIX Security Symposium.
Kim, Y., et al. (2014). "Flipping bits in memory without accessing them: An experimental study of DRAM disturbance errors". ISCA 2014.
Arpaci-Dusseau, R. H., & Arpaci-Dusseau, A. C. (2018). Operating Systems: Three Easy Pieces. Chapter 18-23 (Virtual Memory). Available at: https://pages.cs.wisc.edu/~remzi/OSTEP/
Mittal, S. "A survey of techniques for architecting TLBs." Concurrency and Computation: Practice and Experience (CPE), 29(10), 2016. DOI: 10.1002/cpe.4061
Qasem, A. and Magee, J. "Improving TLB performance through demand-driven superpaging." Software: Practice and Experience (SPE), 43(6):705–729, 2012. DOI: 10.1002/spe.2128
Wu, X. and Taylor, V. "Ensemble learning for CANDLE deep learning benchmarks." Concurrency and Computation: Practice and Experience (CPE), 35(15), 2021. DOI: 10.1002/cpe.6516