Chapter 1: Memory Management Basics

Part I: Introduction and Fundamentals

Chapter 1: Memory Management Basics

"The memory hierarchy is a fundamental structure in computer architecture, and virtual memory is one of the greatest ideas in computer science."
-- John L. Hennessy & David A. Patterson, Computer Architecture: A Quantitative Approach

1.1 Introduction: Why Virtual Memory Exists

The Programming Model We Take for Granted

When you write a program in C, Python, or any modern language, you make several assumptions about memory:

  1. Your program can use memory addresses starting from 0
  2. Memory appears as one large, contiguous space
  3. Other programs won't interfere with your memory
  4. You can use more memory than physically available (within limits)
  5. The program works the same regardless of how much RAM the computer has

These assumptions seem natural, but they're not inherent to computer hardware. They're provided by virtual memory--one of computer science's most important abstractions [Bhattacharjee & Lustig, 2017].

What Virtual Memory Provides

Virtual memory (VM) is the hardware and software mechanism that gives each program its own private view of the computer's memory. According to Bhattacharjee & Lustig (2017), virtual memory provides several critical benefits:

1. Simplified Programming (Abstraction) - Programs use simple addresses starting from 0 - No need to know physical memory size or layout - Memory appears as one large array - Code is portable across different hardware

2. Memory Protection and Isolation - Each program has its own address space - Programs cannot access each other's memory - Operating system memory is protected from user programs - Security: malicious code can't read passwords from other programs

3. Memory Overcommitment - Total virtual memory across all programs can exceed physical RAM - Programs can use address spaces larger than physical memory - Inactive data can be stored on disk - Efficient memory sharing between programs

4. Flexibility in Memory Allocation - Programs can be loaded anywhere in physical memory - Memory doesn't need to be contiguous in physical RAM - Easy to relocate programs - Simplifies dynamic memory allocation

The Core Mechanism: Address Translation

Virtual memory works through address translation. Let's understand this with a concrete example:

int main() {
    int x = 42;
    printf("Address of x: %p\n", &x);  
    // Might print: 0x7ffe5c3d4a8c
    return 0;
}

When you run this program, it prints an address like 0x7ffe5c3d4a8c. This is a virtual address--a number your program uses to refer to a memory location. But this is not where x actually lives in the physical RAM chips.

Two Types of Addresses:

Virtual Address (VA) - What your program sees and uses - Example: 0x7ffe5c3d4a8c - Every program has its own virtual addresses starting from 0 - Abstract, not tied to physical hardware

Physical Address (PA)
- Where data actually resides in RAM chips - Example: 0x12A4B000 (completely different!) - Only one set of physical addresses in the entire system - Corresponds to actual hardware locations

The Translation:

When your program accesses virtual address 0x7ffe5c3d4a8c, hardware automatically translates it to the corresponding physical address (say, 0x12A4B000) where your data actually lives.

Program's view:         Virtual Address 0x7ffe5c3d4a8c
                                  ↓
                        [MMU Translation]
                                  ↓
Reality in hardware:    Physical Address 0x12A4B000

Why Two Address Spaces Solve Everything

This simple translation mechanism simultaneously solves all the problems mentioned above:

Isolation: Program A's virtual address 0x1000 maps to physical address 0x3A2F1000. Program B's virtual address 0x1000 maps to physical address 0x91B52000. Same virtual address, different physical locations--they can never conflict.

Simplification: Every program uses addresses starting from 0. The program doesn't need to know where it's actually loaded in physical memory.

Overcommitment: A program can have virtual addresses from 0 to 4 billion (on 32-bit systems), even if physical RAM is only 2 GB. Not all virtual addresses need to be in physical memory simultaneously--some can be on disk.

Protection: The translation mechanism includes permission checks. The hardware verifies that your program is allowed to access each physical address it translates to.

The Memory Management Unit (MMU)

Memory Hierarchy: Size, Speed, and Latency Trade-offs Memory Level Typical Size Access Latency CPU Registers 16 – 256 bytes 0 cycles (instant) L1 Cache Per-core, split I$ + D$ 32 KB – 64 KB 4 cycles (~1 ns) L2 Cache Unified, per-core or shared 256 KB – 1 MB 12 cycles (~4 ns) L3 Cache (LLC) Shared across all cores 8 MB – 64 MB 40 cycles (~14 ns) Main Memory (DRAM) DDR4 / DDR5 / LPDDR5 8 GB – 512 GB 200 cycles (~70 ns) SSD / NVMe (Swap and Files) Virtual memory backing store 256 GB – 8 TB 25 μs (25,000 cycles) Hard Disk Drive (HDD) 1 TB – 20 TB 10 ms (10M cycles) Faster / Smaller / More Expensive The Memory Wall CPU ~0.3 ns/cycle vs DRAM ~70 ns. Every cache miss wastes ~280 CPU cycles.
Figure 1.1: Memory hierarchy: CPU registers to persistent storage with typical latencies and capacities at each level.

The Memory Management Unit (MMU) is the hardware component inside your CPU that performs this translation. On every memory access--billions of times per second--the MMU:

  1. Receives a virtual address from the program
  2. Translates it to the corresponding physical address
  3. Checks if the access is allowed (read/write/execute permissions)
  4. Either completes the access or triggers an exception

This happens automatically and transparently. Your program never knows it's using virtual addresses.

The Three Fundamental Challenges

While virtual memory elegantly solves programming and protection problems, it introduces new challenges that the MMU must address:

Challenge 1: Translation Performance - Modern CPUs execute ~4 billion instructions per second - Many instructions access memory multiple times - That's billions of translations per second - Each translation must be fast (sub-nanosecond) or the CPU stalls

Challenge 2: Translation Overhead - Translation requires looking up tables in memory - But those table lookups themselves require memory access - This could create infinite recursion - The translation mechanism itself consumes memory

Challenge 3: The Memory Wall - Memory access takes ~70 nanoseconds - During that time, CPU could execute 280 instructions - Most CPU time is spent waiting for memory - Virtual memory must not make this worse

How These Are Solved: - Translation Lookaside Buffer (TLB): A cache that remembers recent translations, making most translations instant - Hardware Page Table Walker: Dedicated logic to quickly traverse translation tables - Multi-level page tables: Hierarchical structure that reduces memory overhead - Large pages: Reducing the number of translations needed

What This Chapter Covers

Understanding virtual memory requires understanding the interplay between hardware and software:

Hardware (MMU): - How address translation works (page tables) - How to make translation fast (TLBs, page table walkers) - Different architectures' approaches (x86, ARM, RISC-V)

Operating System: - Managing page tables for all processes - Handling page faults (when data isn't in RAM) - Deciding what to keep in physical memory - Implementing memory protection policies

Together: - How hardware and OS cooperate - Performance implications - Security considerations - Modern challenges and solutions

By the end of this chapter, you'll understand how modern computers provide every program with the illusion of having abundant, fast, private memory--even though physical memory is limited, slow, and shared.


1.2 What is Memory Management?

Given the fundamental challenges outlined above--the memory wall, capacity limitations, and security requirements--memory management is the collection of hardware and software mechanisms that bridge the gap between what programmers need (fast, large, secure memory) and what hardware provides (slow, limited, shared physical RAM).

At its core, memory management must simultaneously solve several competing challenges:

1.2.1 The Four Pillars of Memory Management

1. Allocation and Deallocation

Every program requires memory to store instructions and data. When a process starts, memory must be allocated. When it terminates, that memory must be reclaimed.

The Challenge: How do we allocate memory efficiently when programs have unpredictable memory needs? How do we avoid fragmentation? What happens when physical memory is exhausted?

Example: The malloc() Mystery

int* ptr = malloc(1024 * 1024 * 1024);  // Request 1 GB
if (ptr != NULL) {
    printf("Allocation succeeded!\n");
    // But have we actually allocated 1 GB of physical RAM?
}

Surprisingly, on most modern systems, malloc() will succeed even if you only have 512 MB of physical RAM available. How is this possible? The answer lies in virtual memory and demand paging--concepts we'll explore in depth.

2. Address Translation

Programs operate using virtual addresses (also called logical addresses), but the actual data resides at physical addresses in RAM. The Memory Management Unit (MMU) performs this translation billions of times per second.

The Challenge: How do we translate addresses quickly enough to avoid becoming a bottleneck? How do we handle the translation table itself, which might be enormous?

Concrete Example:

Virtual Address:  0x7fff5fbff710  (Program's view)
Physical Address: 0x3a2f1000      (Actual RAM location)

The MMU must perform this translation for every single memory access. On a modern CPU executing billions of instructions per second, with multiple memory accesses per instruction, this means billions of translations per second.

3. Protection and Isolation

In a multitasking system, one misbehaving program must not be able to corrupt another program's memory--or worse, the operating system kernel's memory.

Historical Perspective:
Early microcomputers like the Commodore 64 and Apple II had no memory protection. A single buggy program could crash the entire system. Modern systems enforce strict isolation.

The Challenge: How do we enforce protection without sacrificing performance? How do we share memory safely when needed (e.g., shared libraries)?

Real-World Impact: The 2018 Spectre and Meltdown vulnerabilities demonstrated that even with hardware memory protection, subtle CPU implementation details could leak protected memory. These vulnerabilities affected billions of devices and showed that memory isolation is harder than it appears [Kocher et al., 2019; Lipp et al., 2018].

4. Optimization

Memory management must be fast. Every optimization matters: better cache utilization, reduced TLB misses, efficient page table structures, and intelligent page replacement algorithms.

The Challenge: How do we balance competing goals? Larger pages reduce translation overhead but increase internal fragmentation. More complex page tables enable features but consume memory themselves.

Performance Reality:

To understand why memory management is critical, consider the real cost of different memory operations. These numbers come from actual hardware measurements on a modern Intel Xeon processor:

Operation                        Latency          CPU Cycles (@4GHz)    Human Scale*
L1 Cache Hit                     ~0.5 ns          ~2 cycles             1 second
L2 Cache Hit                     ~3 ns            ~12 cycles            6 seconds
L3 Cache Hit                     ~12 ns           ~50 cycles            24 seconds
DRAM Access (TLB hit)            ~70 ns           ~280 cycles           2.3 minutes
DRAM Access (TLB miss)           ~150 ns          ~600 cycles           5 minutes
Page Fault (from SSD)            ~25,000 ns       ~100,000 cycles       1.2 days
Page Fault (from HDD)            ~10,000,000 ns   ~40,000,000 cycles    1.3 years

Table 1.1: Memory hierarchy latencies. Human scale shows the relative time if L1 cache = 1 second.

Understanding the Scale:
If an L1 cache access took 1 second, a page fault from a hard drive would take over a year. This isn't a small performance penalty--it's the difference between a responsive system and one that appears frozen.

Why the Memory Wall Matters:
The memory wall means that for many applications, performance is dominated not by CPU speed but by memory access patterns. Programs with good locality (accessing nearby memory addresses) stay in fast caches. Programs with poor locality constantly miss caches and TLBs, spending most of their time waiting for slow DRAM.

Sources: Latency measurements from Hennessy & Patterson (2024), Intel Xeon performance guides, and empirical testing.


1.3 The Evolution of Memory Management

1.3.1 Early Systems: No Memory Management (1940s-1950s)

The earliest computers like ENIAC and EDVAC had no memory management. Programs had direct access to all physical memory. Only one program could run at a time.

Programming Model:

; PDP-8 example (1965)
; Direct physical addressing - no translation
0200:  7200    CLA      ; Clear accumulator  
0201:  1234    TAD 234  ; Load from absolute address 234
0202:  3456    DCA 456  ; Store to absolute address 456

Limitations: - No multitasking - No protection - Manual memory layout by programmer - Physical memory size limited program size

1.3.2 Base and Bounds Registers (1960s)

Base and Bounds Address Translation (1960s Era) CPU / Program Logical address: 0x2000 Logical Addr Base Register 0x10000 program start in RAM + Base value + ADDER hardware unit Phys: 0x12000 Bounds Register Limit: 0x8000 max program size Addr < Limit? Bounds Check 0x2000 < 0x8000? YES Access permitted Physical Memory 0x12000 Target location other processes Protection Fault! Addr >= Limit: Trap to OS Process terminated NO (violation) Example Calculation Base = 0x10000 (program loaded here) Limit = 0x8000 (program is 32 KB) Logical= 0x2000 (within program) Physical = 0x10000 + 0x2000 = 0x12000 Limitations - No memory sharing - External fragmentation - Program must be contiguous - Replaced by paging (1970s+)
Figure 1.2: Base-and-bounds translation: the CPU adds the base register to the logical address; exceeding the bounds register raises a protection fault.

The first hardware memory management used simple base and bounds registers. The base register contained the program's starting address, and all memory accesses were offset from this base. The bounds register limited the maximum offset.

Physical Address = Base Register + Program Address

if (Program Address > Bounds Register):
    TRAP to operating system  # Protection fault
Base and Bounds: Three Worked Calculation Examples Logical Address Base Register Limit Register Check: Addr < Limit? Result 0x2000 within program space 0x10000 program starts here 0x8000 32 KB program 0x2000 < 0x8000: YES within bounds Physical = 0x12000 0x10000 + 0x2000 Access ALLOWED 0x7FFF last valid byte 0x10000 same program 0x8000 32 KB program 0x7FFF < 0x8000: YES just barely in bounds Physical = 0x17FFF 0x10000 + 0x7FFF Access ALLOWED 0x9000 past end of program! 0x10000 same program 0x8000 32 KB program 0x9000 < 0x8000: NO! access violation PROTECTION FAULT Trap to OS Process killed Translation Formula IF (Logical_Address < Limit) THEN Physical_Address = Base + Logical_Address ELSE raise ProtectionFault() Base-and-Bounds vs. Paging Base-and-Bounds: Single contiguous region per process. Simple but causes external fragmentation. Replaced by paging in modern systems.
Figure 1.3: Concrete base-and-bounds calculation examples showing address mapping and bounds-violation detection.

Example: IBM 7094 (1962)

Base:   0x100000
Bounds: 0x010000  (64 KB limit)

Program accesses: 0x1234
Physical address: 0x100000 + 0x1234 = 0x101234 ✓ (within bounds)

Program accesses: 0x20000  
0x20000 > 0x010000 → PROTECTION FAULT

Limitations: - Still required contiguous physical memory - Relocation required copying entire program in memory - External fragmentation: gaps between programs - No sharing of code between processes

1.3.3 Segmentation (1960s-1970s)

Segmentation Architecture: Logical Segments to Physical Memory Virtual (Logical) Address Format: Segment ID e.g., 2 bits = 4 segs Offset within Segment byte position inside the segment bits 15-14 bits 13-0 Seg ID selects row Segment Descriptor Table (in memory, pointed to by LDTR/GDTR): Seg Name Base Address Limit Perms Table header row 0 Code 0x08048000 64 KB R-X 1 Data 0x0804C000 32 KB RW- 2 Stack 0x7FFFF000 8 MB RW- 3 Heap 0x0900A000 16 MB RW- Physical Memory: OS Kernel (protected) Code Segment (Seg 0) 0x08048000 – 0x08058000 Data Segment (Seg 1) 0x0804C000 – 0x08054000 External Fragmentation Gap Heap Segment (Seg 3) 0x0900A000 – ... Stack Segment (Seg 2) 0x7FFFF000 (grows down) Base+Offset Base+Offset + Offset (added to base) Segmentation vs. Base-and-Bounds + Multiple segments per process (code/data/stack) + Per-segment protection (R/W/X permissions) - External fragmentation between segments - Segments must still be contiguous in RAM
Figure 1.4: Segmentation architecture: a logical address encodes a segment selector and byte offset; the segment descriptor table maps each selector to a base address and limit.

Segmentation divided a program's address space into logical units (segments) like code, data, stack, and heap. Each segment had its own base and bounds.

Segmented Address Format How a segmented virtual address is split into Segment Selector and Offset 32-bit Segmented Virtual Address bit 31 bit 16 bit 15 bit 0 Segment Selector 16 bits (bits 31:16) — identifies segment Offset within Segment 16 bits (bits 15:0) — byte within segment Translation Example: 0x0002_1A40 Segment Selector = 0x0002 (Segment 2) Offset = 0x1A40 (byte 6720 within segment) Segment 2 Base = 0x00C0_0000 (from descriptor table) Physical Address = 0x00C0_0000 + 0x1A40 = 0x00C0_1A40 Selector Field Detail (x86 style) Descriptor Index TI RPL Which segment descriptor GDT/LDT Privilege (0–3) TI: Table Indicator — 0=GDT (Global), 1=LDT (Local) RPL: Requested Privilege Level Offset Constraints • Offset must be ≤ Limit in segment descriptor • If Offset > Limit → General Protection Fault (#GP) • 16-bit offset: max 64 KB segment size • 32-bit offset (protected mode): max 4 GB segment
Figure 1.5: Segmented address format: the high bits select the segment descriptor; the low bits specify the byte offset within that segment.

Segments Matched Logical Structure:

Segment Types and Properties in a Typical Process Segment Contains Permissions Typical Size Growth Direction Shared? .text (Code) Machine instructions compiled program R-X Read + Execute only No Write! (W^X policy) 10 KB – 10 MB Fixed at load time YES Between processes .data (Init. Data) Global/static variables with initial values RW- Read + Write, No Execute 1 KB – 100 MB Fixed at load time NO (private) .bss (Uninit. Data) Uninitialized globals zero-filled by OS RW- Read + Write, No Execute 1 KB – 100 MB Fixed at load time NO (private) Heap (Dynamic) malloc/new allocations runtime dynamic data RW- Read + Write, No Execute KB to GB (dynamic) Grows UP via brk() or mmap() NO (private) Stack (Per-thread) Local variables, return addresses, saved regs RW- Read + Write, No Execute 8 KB – 8 MB limit Grows DOWN toward lower addresses NO (per-thread) Permission Flags (set per-page in PTE) R = Readable All segments W = Writable Data/BSS/Heap/Stack X = Executable Code only (W^X enforced) NX = No-Execute Set on data pages (PTE bit 63) W^X Policy: A page must never be both Writable and Executable simultaneously (prevents code injection attacks)
Figure 1.6: Common segment types (code, data, stack, heap) and their access-protection properties.
Segment 0: Code      (executable, read-only)
Segment 1: Data      (read-write)
Segment 2: Stack     (read-write, grows down)
Segment 3: Heap      (read-write, grows up)

Virtual Address Format (Segmented):

Advantages: - Logical program structure - Easy sharing (share code segment) - Different protection for different segments - No internal fragmentation

Disadvantages: - External fragmentation: Segments of varying sizes left gaps in physical memory - Required compaction (expensive memory copying) - Large segments still needed contiguous physical memory

Historical Example: Intel 8086 (1978) The 8086 used segmentation to address 1 MB of memory with 16-bit registers:

Physical Address = (Segment x 16) + Offset

1.3.4 Paging: The Modern Solution (1960s-Present)

Paging: Virtual Page to Physical Frame Mapping Process A Virtual Space Page 0 Page 1 Page 2 Page 3 (unmapped) Process B Virtual Space Page 0 Page 1 Page 2 (unmapped) Page Table A VPN PFN 0 5 1 2 2 8 3 NOT PRESENT Page Table B VPN PFN 0 3 1 7 2 NOT PRESENT Physical Memory (shared, one real RAM) Frame 0: OS Kernel Frame 1: free Frame 2: Proc A Pg 1 Frame 3: Proc B Pg 0 Frame 4: free Frame 5: Proc A Pg 0 Frame 6: free Frame 7: Proc B Pg 1 Frame 8: Proc A Pg 2 Key Benefits of Paging + Non-contiguous physical allocation + Complete process isolation + No external fragmentation + Simple frame reclamation Isolation in Action Proc A: VA 0x1000 → Frame 5 (PFN=5) Proc B: VA 0x1000 → Frame 3 (PFN=3) Same virtual address, different physical = Perfect isolation guaranteed!
Figure 1.7: Paging overview: the virtual address splits into a Virtual Page Number (VPN) and page offset; the page table maps each VPN to a Physical Frame Number (PFN).

The Atlas computer (University of Manchester, 1962) introduced paging--dividing both virtual and physical memory into fixed-size blocks called pages (virtual) and frames (physical) [Fotheringham, 1961; Kilburn et al., 1962].

Key Insight: With fixed-size pages, any page can map to any frame. No external fragmentation!

Typical Page Sizes: - 4 KB: Standard on x86/x64, ARM, RISC-V - 16 KB: Option on ARM - 64 KB: Option on ARM, used by some UNIX variants - 2 MB/1 GB: "Huge pages" for reducing translation overhead

Page Table Entry (PTE) Structure: x86-64 64-bit Format 64-bit PTE (one entry per virtual page): NX bit 63 Reserved bits 62-52 Physical Frame Number (PFN) bits 51-12 (40 bits = up to 4 PB of physical memory) Avail 11-9 G bit8 PAT bit7 D bit6 A bit5 PCD bit4 PWT bit3 U/S bit2 R/W b1 P b0 Flag Bit Definitions: Bit Name Meaning 0 P (Present) 1 = page in RAM; 0 = page fault 1 R/W 1 = writable; 0 = read-only 2 U/S 1 = user accessible; 0 = kernel only 5 A (Accessed) Set by CPU on any access 6 D (Dirty) Set by CPU on write access 8 G (Global) Don't flush TLB on context switch 63 NX (No-Execute) 1 = page cannot execute code 12-51 PFN (Physical Frame Number) 40-bit frame number. Shift left by 12 to get physical address. PTE State Machine: PRESENT (P=1) Page is in physical RAM PFN field is valid TLB can cache this entry SWAPPED (P=0) Page is on disk/swap OS uses PFN bits for swap location info NOT ALLOCATED Virtual page not mapped Access = SIGSEGV (segmentation fault) On Access: TLB gets Physical Address Memory access proceeds Fast path (0-1 cycles) On Access: Page fault exception OS loads page from disk Slow (25 us - 10 ms) On Access: Page fault exception OS sends SIGSEGV Process terminated
Figure 1.8: Page Table Entry (PTE) structure: physical frame number plus status bits -- Present (P), Read/Write (R/W), User/Supervisor (U/S), Dirty (D), and Accessed (A).

Virtual Address Format (4 KB pages, 32-bit address space):

Physical Address Format:

Page Table: The mapping from virtual page numbers to physical frame numbers is stored in a page table.

Advantages of Paging: 1. No external fragmentation (all blocks are same size) 2. Easy allocation (any free frame works) 3. Efficient sharing (multiple page tables point to same frame) 4. Demand paging (load pages only when accessed) 5. Virtual memory (address space larger than physical RAM)

Disadvantages: 1. Internal fragmentation (wasted space in partial pages) 2. Page table overhead (memory consumed by page tables themselves) 3. Translation overhead (every memory access needs translation)


Understanding Fragmentation: Visual Examples

Memory fragmentation comes in two forms: external and internal. Understanding the difference is crucial for evaluating memory management schemes.

External Fragmentation (The Problem with Variable-Size Segments)

External fragmentation occurs when free memory is scattered in small, unusable chunks between allocated blocks.

The Problem:
Free memory becomes fragmented into non-contiguous holes. Even when sufficient total free memory exists, allocation requests may fail if they require contiguous memory larger than the biggest hole.

Real-World Example:
After a system runs for 24 hours with frequent allocations and deallocations, 20-40% of memory can become unusable due to fragmentation. A system might show 4 GB "free" but be unable to allocate even 100 MB contiguously.

Solutions: - Compaction: Move allocated blocks to consolidate free space (expensive - requires copying) - Paging: Use fixed-size blocks to eliminate external fragmentation entirely

Internal Fragmentation (The Problem with Fixed-Size Pages)

Internal fragmentation occurs when allocated blocks are larger than needed, wasting space within the allocated block.

The Trade-off:

Page Size Internal Fragmentation Page Table Size TLB Coverage
1 KB Low (max 1023 bytes waste) Large (many entries) Poor
4 KB Medium (max 4095 bytes waste) Medium Good
2 MB High (max 2 MB waste) Small (few entries) Excellent

Real-World Impact:
On a system with 16 GB RAM, 4 KB pages, and 100 processes: - Average internal fragmentation per process: ~500 KB - Total wasted memory: ~50 MB (0.3% of RAM) - This is predictable and acceptable

Compare this to external fragmentation which can waste 20-40% of RAM unpredictably.

Side-by-Side Comparison

Summary Table:

Type Location Cause Impact Predictable? Solution
External Between blocks Variable-size allocation Can't use total free memory No, gets worse over time Compaction or paging
Internal Within blocks Fixed-size blocks Wasted space in blocks Yes, ≤ (page_size - 1) per allocation Choose appropriate page size

Why Modern Systems Use Paging:

External fragmentation is a catastrophic problem: memory becomes increasingly unusable over time, and the only solution (compaction) is expensive and disruptive. A system might show gigabytes of "free" memory but fail to allocate even small contiguous blocks.

Internal fragmentation is a manageable problem: the waste is bounded, predictable, and typically only 0.3-1% of total RAM with 4 KB pages. The system always knows the maximum waste per allocation (page_size - 1 bytes).

This is why virtually all modern operating systems use paging for memory management--accepting small, predictable waste to completely eliminate large, unpredictable memory loss.


1.4 Hierarchical Levels of Memory Management

Modern systems implement memory management at multiple levels:

1.4.1 Hardware Level: The Memory Management Unit (MMU)

Virtual Memory Address Space Layout (Linux x86-64) Process A: Virtual Address Space 128 TB user space (0x0 – 0x7FFF...) Kernel Space 0xFFFF800000000000 – 0xFFFFFFFFFFFFFFFF Kernel / User Boundary (enforced by U/S bit) Stack grows ↓ (0x7FFFFFFFFFFF...) [ unmapped – stack gap ] Shared Libraries / mmap libc.so, libpthread.so, etc. [ unmapped – large gap ] Heap (malloc/new) grows ↑ (brk/mmap) [ unmapped ] BSS (uninitialized data) Data (initialized globals) Read-only Data (.rodata) Text (executable code) 0x400000 (typical start) NULL Page (unmapped, addr 0) 0x0000000000000000 0xFFFF... 0x7FFF... ~0x601000 0x400000 0x0 Physical Memory (RAM) Shared across all processes OS Kernel (always resident) Frame: libc.so (shared) Frame: Proc A Stack Frame: (free) Frame: Proc A Heap Frame: Proc B Stack Frame: (free) Frame: Proc A BSS/Data Frame: Proc B Heap Frame: libpthread.so Frame: Proc A .text Frame: (free) Frame: Proc B .text Frame: (free) Physical Address 0x0 to 0x(RAM size) Key Insights Virtual spaces are huge: 128 TB each Physical RAM is shared + much smaller Pages can be anywhere in physical RAM Unmapped regions use NO physical RAM at all! Kernel space mapped in all processes but U/S bit blocks user access libc.so pages shared in physical RAM between all processes!
Figure 1.9: Virtual address space layout: kernel space at the top; user space holds stack (grows down), heap (grows up), BSS, data, and code segments.

The MMU is a specialized hardware component, integrated into the CPU, that performs address translation at hardware speed.

Core MMU Responsibilities: 1. Address Translation: Convert virtual → physical addresses 2. TLB Management: Cache recent translations 3. Protection Checking: Enforce access permissions 4. Exception Generation: Trigger page faults when needed

MMU Pipeline Integration:

Instruction Fetch → Instruction TLB → Instruction Cache → Decode
                                ↓
                         Physical Address
                         
Data Access → Data TLB → Data Cache → Load/Store Unit
                   ↓
            Physical Address

The MMU operates in parallel with the CPU pipeline. Any TLB miss stalls the pipeline while the page table is walked.

Historical Note:
Early CPUs like the Motorola 68020 had optional external MMU chips (68851). Modern CPUs integrate the MMU on-die for performance.

1.4.2 Operating System Level: The Virtual Memory Manager

The OS kernel manages: - Page table maintenance: Creating/updating page tables for processes - Page fault handling: Loading pages from disk when absent - Page replacement: Choosing which pages to evict under memory pressure - Swapping: Moving entire processes to/from disk - Memory allocation: Allocating frames to processes

Key OS Data Structures:

// Linux kernel mm_struct (simplified)
struct mm_struct {
    pgd_t *pgd;                    // Page Global Directory (root page table)
    unsigned long start_code;       // Code segment start
    unsigned long end_code;         // Code segment end
    unsigned long start_data;       // Data segment start
    unsigned long end_data;         // Data segment end
    unsigned long start_brk;        // Heap start
    unsigned long brk;              // Heap end (current)
    unsigned long start_stack;      // Stack start
    unsigned long total_vm;         // Total virtual memory pages
    unsigned long locked_vm;        // Locked pages (mlocked)
    unsigned long pinned_vm;        // Pinned pages
    unsigned long data_vm;          // Data pages
    unsigned long exec_vm;          // Executable pages
    unsigned long stack_vm;         // Stack pages
};

Page Fault Handling Flow:

1. CPU accesses virtual address
2. TLB miss → Page table walk
3. Page not present → Page Fault Exception
4. CPU switches to kernel mode, saves context
5. OS page fault handler invoked:
   a. Check if address is valid (in process's VMA)
   b. If invalid → Segmentation Fault (kill process)
   c. If valid but not resident:
      - Allocate physical frame
      - If swapped: Read page from disk
      - If zero-page: Zero the frame
      - If file-backed: Read from file
      - Update page table entry
      - Mark page as present
6. Return from exception
7. CPU retries the faulting instruction
8. Translation succeeds → execution continues

1.4.3 Application Level: Memory Allocation Libraries

Applications don't usually manage memory at the page level. Instead, they use allocation libraries:

C/C++:

void* malloc(size_t size);
void free(void* ptr);
void* calloc(size_t nmemb, size_t size);
void* realloc(void* ptr, size_t size);

How malloc() Works (Simplified):

// glibc malloc implementation strategy
void* malloc(size_t size) {
    if (size < MMAP_THRESHOLD) {
        // Small allocation: use heap (brk/sbrk)
        return allocate_from_heap(size);
    } else {
        // Large allocation: use mmap()
        return mmap(NULL, size, PROT_READ|PROT_WRITE,
                    MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
    }
}

Behind the Scenes: - Small allocations (<128 KB): Use heap via brk() system call - Large allocations: Use mmap() to map anonymous memory - Free list management (bins, arenas) - Metadata overhead (headers, footers)

Memory Allocation Hierarchy:

Application calls malloc()
        ↓
C Library (glibc)
        ↓
System Call (brk/mmap)
        ↓
Kernel (page allocator)
        ↓
MMU (page table update)
        ↓
Physical Memory Assignment

1.5 The Virtual Memory Abstraction

Virtual memory is one of computer science's most powerful abstractions. It provides each process with the illusion of: 1. Vast address space (often larger than physical RAM) 2. Contiguous memory (even if physical memory is fragmented) 3. Private memory (isolated from other processes) 4. Uniform access (same addressing model for all memory)

1.5.1 Benefits of Virtual Memory

Benefit 1: Simplified Programming Model

Programmers can assume: - Memory starts at address 0 - Memory is contiguous - Pointers just work (within the valid address space) - No need to manage physical memory locations

Benefit 2: Memory Overcommitment

Physical memory can be overcommitted--the sum of all virtual address spaces can exceed physical RAM.

Real Example:

Physical RAM: 16 GB
Process 1:    8 GB virtual
Process 2:    8 GB virtual  
Process 3:    8 GB virtual
Total:       24 GB virtual   → More than physical!

This works because: - Not all virtual pages are actually in use - Pages can be swapped to disk - Many pages are shared (code, libraries)

Benefit 3: Security and Isolation

Each process has its own page table. Process A cannot access Process B's memory.

Process A's view:
0x00000000 - 0x7FFFFFFF: User space
0x80000000 - 0xFFFFFFFF: Kernel space

Process B's view:  
0x00000000 - 0x7FFFFFFF: User space (different physical frames!)
0x80000000 - 0xFFFFFFFF: Kernel space (same physical frames)

Benefit 4: Memory-Mapped Files

Files can be mapped into virtual memory. Reading/writing memory reads/writes the file.

// Map a file into memory
int fd = open("database.db", O_RDWR);
void* ptr = mmap(NULL, file_size, PROT_READ|PROT_WRITE,
                 MAP_SHARED, fd, 0);
// Now ptr[x] reads/writes the file!

Benefit 5: Copy-on-Write (COW)

When fork() creates a child process, both parent and child initially share the same physical pages (marked read-only). Only when one writes to a page is it duplicated.

Parent calls fork()
        ↓
Child created with identical page tables
        ↓
Both point to same physical frames (read-only)
        ↓
Parent writes to page → Page Fault
        ↓
Kernel copies page, updates parent's page table
        ↓
Parent has private copy, child still has original

This makes fork() extremely fast, even for large processes.

1.5.2 The Cost of Virtual Memory

Virtual memory isn't free:

  1. Translation overhead: Every memory access requires address translation
  2. TLB misses: Cache misses in the TLB require expensive page table walks
  3. Page table memory: Page tables themselves consume memory
  4. Page faults: Loading pages from disk is extremely slow

Performance Impact Example:

// Sequential access (good locality)
for (i = 0; i < ARRAY_SIZE; i++) {
    sum += array[i];
}
// TLB hit rate: ~99%+

// Random access (poor locality)  
for (i = 0; i < ARRAY_SIZE; i++) {
    sum += array[rand() % ARRAY_SIZE];
}
// TLB hit rate: can drop to 50% or worse

1.6 Key Concepts and Terminology

Before diving deeper, let's establish precise definitions:

Address Spaces

Virtual Address: The address used by a program (logical address)
Physical Address: The actual address in RAM hardware
Address Space: The set of all valid addresses (e.g., 32-bit = 4 GB address space)

Pages and Frames

Virtual Address Format: 4 KB Paging (32-bit and 64-bit) 32-bit Virtual Address (4 GB address space, two-level paging) 31 22 21 12 11 0 Page Directory Index bits 31:22 — 10 bits → 1024 entries in PD Page Table Index bits 21:12 — 10 bits Page Offset bits 11:0 — 12 bits → 4096 bytes (4 KB) 48-bit Virtual Address (x86-64, 256 TB address space, four-level paging) 47 39 38 30 29 21 20 12 11 0 PML4 Index 9 bits (47:39) PDPT Index 9 bits (38:30) PD Index 9 bits (29:21) PT Index 9 bits (20:12) Page Offset 12 bits (11:0) → 4 KB Key Rule: The Page Offset is NEVER translated — it passes through unchanged Only the VPN (upper bits) changes during translation. The offset selects the byte within the page. Offset = VA mod 4096. Page Size vs. Offset Bits Page Size Offset Bits VPN Bits (32-bit) VPN Bits (48-bit) 4 KB (standard) 12 bits 20 bits 36 bits (4 levels × 9) 2 MB huge page 21 bits 11 bits 27 bits (3 levels)
Figure 1.10: Virtual address bit layout for x86-64: bits 47-39 index PGD, 38-30 PUD, 29-21 PMD, 20-12 PTE, and 11-0 are the page offset.

Page: A fixed-size block of virtual memory (typically 4 KB)
Frame: A fixed-size block of physical memory (same size as page)
Page Number: The high-order bits of a virtual address
Offset: The low-order bits (position within the page)

Page Tables

Physical Address Format: Frame Number + Offset 52-bit Physical Address (x86-64, supports up to 4 PB physical RAM) bit 51 bit 12 bit 11 bit 0 Physical Frame Number (PFN) 40 bits (51:12) — which 4 KB frame in physical RAM Page Offset 12 bits — identical to VA offset The 12-bit offset is identical in the virtual and physical address — it is never changed Only the upper bits change: VPN (virtual) is replaced with PFN (physical) via the page table lookup Worked Example: Full Address Translation Virtual Address: 0x0000_7FFF_0010_0A40 VPN (bits 47:12) = 0x7FFF_0010_0 → page table lookup → PFN = 0x000_0000_5A Offset (bits 11:0) = 0xA40 → unchanged Physical Address: 0x5A_000 × 4096 + 0xA40 = 0x0000_005A_000_0A40 PFN 0x5A = Frame 90 in physical RAM — starts at byte 90 × 4096 = 368,640 (0x5A000)
Figure 1.11: Physical address format: the Physical Frame Number (PFN) occupies the upper bits; the page offset (copied from the virtual address) occupies the lower 12 bits.

Page Table: Data structure mapping virtual pages to physical frames
Page Table Entry (PTE): One entry in the page table
Page Directory: Higher-level page table (in multi-level schemes)
Translation Lookaside Buffer (TLB): Hardware cache of recent translations

Page States

Resident: Page is in physical memory
Swapped: Page is on disk (swap space)
Dirty: Page has been modified
Present: Page is accessible (resident and valid)
Valid: Address is within process's address space

Memory Operations

Page Fault: Exception when accessing a non-present page
Swapping: Moving entire processes to/from disk
Paging: Moving individual pages to/from disk
Thrashing: System spends more time paging than executing


1.7 A Concrete Example: Address Translation

Step-by-Step Address Translation Example (32-bit, 4 KB pages) Virtual Address: 0x00403A7C Binary: 0000 0000 0100 0000 0011 1010 0111 1100 | Page size = 4 KB = 2^12 bytes | Page table entry size = 4 bytes | Virtual Page Number (VPN) = upper 20 bits = 0x403 Step 1 — Split Virtual Address into VPN and Offset: VPN: bits 31-12 0x00403 (virtual page number) Offset: bits 11-0 0xA7C (byte in page) Step 2 — TLB Lookup with VPN: Check TLB for VPN = 0x00403 TLB HIT! PFN = 0x1B7 (found in TLB) TLB also confirms: R/W=1, U/S=1 (user writable) Latency: ~1–4 cycles (1 ns) If TLB MISS: Hardware walks page table 4 DRAM reads to find PFN PML4[0] → PDPT[0] → PD[1] → PT[0x003] PTE found: PFN = 0x1B7, then update TLB Latency: ~280 ns penalty — OR — Step 3 — Compute Physical Address: Physical Address = PFN shifted left by 12 bits | Offset = 0x1B7 << 12 | 0xA7C = 0x001B7000 | 0xA7C = 0x001B7A7C The CPU now accesses physical address 0x001B7A7C in DRAM The page offset (0xA7C) is passed through unchanged! Summary of Translation: VA: 0x00403A7C → [MMU: VPN=0x403, lookup, PFN=0x1B7] → PA: 0x001B7A7C The program uses VA 0x403A7C. The hardware silently and invisibly maps it to PA 0x1B7A7C. Program never knows!
Figure 1.12: Step-by-step address translation: virtual address decomposed into VPN and offset; TLB or page table supplies PFN; physical address assembled.

Let's trace a complete memory access from virtual address to physical address.

System Configuration: - 32-bit address space - 4 KB (2^12 byte) pages - 1 GB (2^30 byte) physical RAM = 256K frames

Virtual Address: 0x12345678

Step 1: Split address into page number and offset

Step 2: Look up in TLB

TLB lookup for VPN 0x12345...
TLB HIT! → Physical Frame Number: 0xA2B4C

Step 3: Construct physical address

Physical Frame: 0xA2B4C
Offset:         0x678

Physical Address = (Frame Number << 12) | Offset
                 = (0xA2B4C << 12) | 0x678
                 = 0xA2B4C678

Step 4: Access physical memory

Read from physical address 0xA2B4C678

If TLB Miss had occurred:

TLB miss for VPN 0x12345
→ Hardware page table walk:
  1. Read Page Directory entry
  2. Read Page Table entry  
  3. Extract Physical Frame Number: 0xA2B4C
  4. Load into TLB
  5. Retry access (now TLB hit)

If Page Not Present:

Page table entry for VPN 0x12345 has Present bit = 0
→ Page Fault Exception
→ Kernel page fault handler:
  1. Check if address valid
  2. Allocate physical frame
  3. Load page from disk (if swapped)
  4. Update PTE: Frame = 0xA2B4C, Present = 1
  5. Return from exception
→ CPU retries instruction (now page is present)

1.8 The Translation Lookaside Buffer (TLB)

Complete Address Translation Flow: TLB Hit and Miss Paths CPU Issues Virtual Address VA = 0x7FFF5000 VPN lookup TLB Translation Lookaside Buffer HIT (~95%) MISS (~5%) PFN Found in TLB Latency: 1–4 cycles Fast path! Page Table Walker Hardware page walk 4 accesses Page Tables (DRAM) ~70 ns each read Total: ~280 ns miss cost Update TLB PFN resolved → compute PA Physical Address PA = (PFN << 12) | Offset DRAM / Cache Actual data fetched Permission Check R/W/X/U/S bits Pass → proceed Fail → page fault Latency Summary TLB HIT PATH: 1–4 cycles (~1 ns) No memory access — fastest possible TLB MISS PATH: 4 × 70 ns ≈ 280 ns ≈ 840 CPU cycles wasted per TLB miss 95% hit rate → avg 19 cyc | 50% hit → avg 422 cyc Path Legend TLB Hit — fast path TLB Miss — slow path TLB entry update (feedback) Permission check (always)
Figure 1.13: Address translation control flow: TLB hit returns physical address in 1-2 cycles (fast path); TLB miss triggers a hardware page-table walk costing 100-300 cycles (slow path).

While page tables elegantly solve the problem of address translation, they introduce a significant performance challenge: every memory access requires at least one additional memory read to consult the page table. For a simple memory load instruction, the CPU must:

  1. Read the page table entry from memory (~100 nanoseconds)
  2. Use the PFN to calculate the physical address
  3. Read the actual data from memory (~100 nanoseconds)

This doubles the effective memory access time! For a program making millions of memory accesses per second, this overhead would be catastrophic.

The Solution: Translation Lookaside Buffer

TLB Structure and Operation: Fast Path vs. Miss Path TLB Internal Structure (fully-associative, 64-entry example): ASID VPN (Tag) PFN Flags V 0x01 0x7FFF5 0x1B7 RW- 1 0x01 0x40000 0x2A1 R-X 1 0x02 0x7FFF5 0x5C2 RW- 1 0x01 0x90000 -- --- 0 (invalid) ... 60 more entries ... ASID = Address Space Identifier Allows multiple processes to share TLB without flushing on context switch TLB HIT Path (~1–4 cycles): CPU Virtual Addr VPN=0x7FFF5 | Off=0x123 TLB LOOKUP VPN match found! HIT! Physical Addr PFN=0x1B7, Off=0x123 Access Memory PA: 0x1B7123 Total: ~1–4 cycles. No memory access needed! TLB MISS Path (~280+ ns penalty): Virtual Addr VPN=0x90000 TLB LOOKUP No match -- MISS Page Table Walk 4 DRAM reads x 70ns Update TLB Install mapping Total: ~280 ns = ~840 wasted CPU cycles! Typical TLB Specifications L1 ITLB: 64-128 entries, fully-assoc, 1-cycle L1 DTLB: 64-64 entries, 4-way set-assoc, 1-cycle L2 STLB: 512-2048 entries, 4-12 cycle miss Intel Skylake: 64 L1 + 1536 L2 entries AMD Zen4: 64 L1 + 2048 L2 entries
Figure 1.14: TLB operation: hardware compares the VPN tag against all entries in parallel; a hit immediately returns the cached PFN without touching DRAM.

The Translation Lookaside Buffer (TLB) is a specialized hardware cache that stores recently used virtual-to-physical address translations. It sits between the CPU and the page table, intercepting every address translation request.

How the TLB Works:

TLB Performance Impact

The performance difference between a TLB hit and miss is dramatic:

Scenario Cycles Time (3 GHz CPU) Relative Speed
TLB Hit 1-2 ~0.5 ns 1x (baseline)
TLB Miss 100-300 ~50-100 ns 100-200x slower

Modern processors typically achieve 95-99% TLB hit rates during normal operation, meaning most memory accesses avoid the slow page table walk entirely.

A note on TLB hit latency and cache organisation. The "1–2 cycles" figure assumes the TLB result is needed before the L1 data cache can be accessed — the classic PIPT (Physically-Indexed, Physically-Tagged) configuration. Many modern processors instead use VIPT (Virtually-Indexed, Physically-Tagged) L1 caches: the cache index is computed from virtual address bits simultaneously with the TLB lookup. When the VIPT index bits lie entirely within the page offset, there is no aliasing and the effective TLB penalty on an L1 cache hit is zero cycles — the translation arrives just in time for tag comparison. L2 and L3 caches are generally PIPT and pay the full TLB latency. Chapter 7 (§7.8.4) examines VIPT cache aliasing hazards in detail. Mittal (2016) provides a systematic comparison of all four cache-addressing modes (PI-PT, VI-PT, VI-VT, PI-VT).

TLB Structure

TLB Entry Structure: Fields and Their Roles A single TLB entry encodes everything needed for one virtual-to-physical translation plus access control ASID / PCID 8–12 bit VPN (Tag) Virtual Page Number 36 bits (x86-64, 4-level) V Valid 1 bit PFN (Data) Physical Frame Number 40 bits (4 PB support) R/W Write 1 bit U/S User 1 bit NX No-Exec 1 bit PCD/ PWT Cache 2 bits Field Descriptions and Hardware Behavior ASID / PCID: Address Space ID / Process Context ID. Tags each entry to a specific process, eliminating full TLB flushes on context switches. x86-64 PCID: 12 bits (4096 distinct contexts). VPN (Tag): Virtual Page Number being translated. Used as the lookup key — hardware compares incoming VPN against all TLB tags simultaneously (fully-associative) in a single cycle. Valid: 1=entry is live; 0=entry has been invalidated (by INVLPG or TLB flush). Invalid entries are ignored during lookup even if the VPN tag matches. PFN (Data): Physical Frame Number — the result of the translation. PA = PFN × 4096 + offset. On TLB hit, the PFN is immediately used to form the physical address. Protection Flags: R/W, U/S, NX copied from PTE into TLB entry. Hardware checks these on every access — a write to R/W=0 page raises a #PF even on TLB hit. NX prevents code execution. Global (G): When G=1, this entry is not flushed on CR3 writes (used for kernel pages shared across all processes).
Figure 1.15: TLB entry structure: VPN tag, Physical Frame Number, Address Space ID (ASID) for process isolation, and permission/dirty/valid bits.

A typical TLB is organized as an associative cache (content-addressable memory):

Entry Format:

Each entry stores: - Virtual Page Number (VPN): The tag to match against - Physical Frame Number (PFN): The translation result - Valid bit: Whether this entry contains valid data - Flags: Permissions (read/write/execute) from the PTE

Size Example: A typical L1 TLB might have: - 64 entries for 4 KB pages - Covers: 64 x 4 KB = 256 KB of address space - Only 256 KB of "working set" can be accessed without misses!

This limited coverage is called TLB reach and is a critical factor in system performance. Programs with larger working sets will experience more TLB misses.

TLB Reach: Formal Definition and AI Implications

TLB reach is defined precisely as:

TLB Reach = Number of TLB Entries × Page Size

This formula immediately shows why page size selection is a first-order architectural decision. The following table shows TLB reach for typical hardware configurations (Mittal, 2016; Qasem & Magee, 2012):

TLB LevelEntries4 KB pages2 MB pages1 GB pages
L1 DTLB64256 KB128 MB64 GB
L2 STLB1,5366 MB3 GB1.5 TB
L2 STLB2,0488 MB4 GB2 TB

The AI/ML reach crisis. Modern large language model training exposes just how narrow 4 KB TLB reach is. GPT-3's working set consists of approximately 350 GB of parameters, plus 700 GB each for momentum and variance under AdamW optimization, plus 50–100 GB of activations — totalling roughly 1,800 GB. An L1 TLB with 64 entries covering 256 KB addresses 0.000014% of that working set; TLB miss rate approaches 100%. Even the 1,536-entry L2 STLB covers only 6 MB. Wu & Taylor (2021) confirmed this directly for deep learning benchmarks: mapping the kernel address space with 8 MB huge pages yielded "a single TLB entry where 4 KB pages would need 2,048 entries," producing 55–61% performance improvement. Switching to 2 MB pages shifts L1 reach from 256 KB to 128 MB (a 512× gain), meaning the L1 TLB alone can cover a meaningful fraction of typical training batches. This is why huge pages are increasingly favoured in AI infrastructure — a topic examined in depth in Chapters 9 and 14.

When Does the TLB Get Updated?

The TLB is automatically managed by the hardware MMU:

TLB is updated on: - Every TLB miss (new entry added) - Page table updates by the OS (corresponding entry invalidated)

TLB is flushed (cleared) on: - Context switch between processes (different address spaces) - Explicit flush instruction (e.g., when kernel updates page tables) - Some CPUs support ASID (Address Space Identifier) to avoid flushing on context switches

Why the TLB Matters

The TLB is arguably the most important performance component in virtual memory systems:

  1. Makes Virtual Memory Practical: Without TLBs, virtual memory would be too slow for production use
  2. Working Set Matters: Programs that access memory randomly suffer more TLB misses
  3. Large Pages Help: Using 2 MB pages instead of 4 KB pages increases TLB reach by 512x

Real-World Example:

Consider a database scanning a 1 GB table with random access: - With 4 KB pages: 262,144 pages, typical TLB: 64 entries - TLB covers: 256 KB (0.025% of data) - Result: ~99.975% TLB miss rate = catastrophic performance

This is why databases and high-performance applications use huge pages (covered in Chapter 2.7).

Key Takeaways

  1. TLB is essential: Virtual memory would be impractically slow without TLB caching
  2. Hit rate is critical: 95%+ hit rates are necessary for good performance
  3. TLB reach is limited: Only 256 KB - 128 MB of working set can be efficiently accessed
  4. Hardware-managed: The MMU automatically updates and manages the TLB
  5. Performance tool: Understanding TLB behavior is crucial for optimization (covered in Part IX)

The TLB transforms virtual memory from a theoretical elegance into a practical, high-performance system. Modern processors invest significant chip area in TLB structures precisely because this cache is so performance-critical.

In the next section, we'll see how multi-level page tables address another challenge: efficiently managing large, sparse address spaces.

1.9 Multi-Level Page Tables

Page Table Mapping: Virtual Pages to Physical Frames Virtual Page # (VPN) VPN 0 (0x0000) VPN 1 (0x1000) VPN 2 (not mapped) VPN 3 (0x3000) VPN 4 (0x4000) VPN 5 (0x5000) VPN 6 (not mapped) VPN 7 (0x7000) Page Table Entry PFN=3, P=1, R/W, U/S PFN=7, P=1, R/W, U/S P=0 (page fault if accessed) PFN=1, P=1, R/W, U/S PFN=9, P=1, R/W, U/S PFN=2, P=1, R/W, U/S P=0 PFN=11, P=1, R=1, W=0, NX Physical Frame (PFN) Frame 3 (0x3000) Frame 7 (0x7000) — (no frame allocated) — Page fault if accessed Frame 1 (0x1000) Frame 9 (0x9000) Frame 2 (0x2000) — (no frame) — Frame 11 (0xB000) Non-Contiguous Mapping in Action VPN 0→Frame 3, VPN 1→Frame 7, VPN 3→Frame 1 : virtual pages appear contiguous to the program but physically occupy non-adjacent frames scattered across RAM. This is the power of paging. VPN 7 is mapped R=1 W=0 NX=1 → read-only, cannot be executed (e.g., read-only data segment)
Figure 1.16: Page table mapping: non-contiguous virtual pages map to arbitrary physical frames, providing address-space isolation between processes.

The page table design we've discussed so far has a critical flaw: it assumes a single-level page table that must contain an entry for every possible virtual page. For modern 64-bit systems with vast address spaces, this becomes completely impractical.

The Problem: Single-Level Page Tables Don't Scale

Consider a 64-bit system with 4 KB pages:

Address Space Calculation: - Virtual address space: 2^48 bits = 256 TB (typical implementation) - Page size: 4 KB = 2^12 bytes - Number of pages: 256 TB / 4 KB = 68,719,476,736 pages

Page Table Size: - Entries needed: 68.7 billion entries - Entry size: 8 bytes (64-bit PTE) - Total page table size: 512 GB per process!

This is absurd. The page table alone would consume more memory than most systems have, and we haven't even allocated memory for the actual program yet!

The Observation: Most Address Spaces Are Sparse

The key insight is that programs don't use most of their address space. A typical program might use:

The remaining 99.9999% of the address space is unmapped and unused. Allocating a complete page table for this sparse usage is incredibly wasteful.

The Solution: Hierarchical Page Tables

Multi-Level Page Tables: x86-64 4-Level Hierarchy (CR3 Walk) 48-bit Virtual Address (x86-64 canonical form): Sign Ext bits 63-48 PML4 Index bits 47-39 (9 bits) PDPT Index bits 38-30 (9 bits) PD Index bits 29-21 (9 bits) PT Index bits 20-12 (9 bits) Page Offset bits 11-0 (12 bits) CR3 PML4 ptr PML4 512 entries entry[0] → PDPT entry[1] → PDPT entry[2] = NULL ... 509 more ... entry[511]=NULL Only entries for mapped regions are non-null! PDPT 512 entries entry[0] → PD entry[1] = NULL ... 510 more ... entry[511]=NULL Page Directory 512 entries entry[0] → PT entry[1] = NULL ... 510 more ... entry[511]=NULL Page Table 512 PTEs PTE[0]: PFN=0x1B7 PTE[1]: PFN=0x3C2 ... 510 more ... PTE[511]=NULL Page 0x1B7 TLB Miss Penalty: 4 Separate DRAM Accesses! Access 1: Read PML4 entry from DRAM (+70 ns) Access 2: Read PDPT entry from DRAM (+70 ns) Access 3: Read PD entry from DRAM (+70 ns) Access 4: Read PT (PTE) from DRAM (+70 ns) Total: 280 ns = ~840 wasted cycles + actual data access (+70 ns more) Memory Savings of Hierarchy Flat table for 256 TB: 512 GB/process 4-level for typical process: ~1 MB Saving: 512,000x reduction in PT memory! Walk Example: VA = 0x0000_7F5A_3041_2ABC PML4 index = bits[47-39] = 0x0FF PDPT index = bits[38-30] = 0x168 PD index = bits[29-21] = 0x002 PT index = bits[20-12] = 0x012 Offset = bits[11-0] = 0xABC Physical = PFN << 12 | 0xABC CR3 → PML4[0xFF] → PDPT[0x168] → PD[0x002] → PT[0x012] → Physical Frame Number → Data!
Figure 1.17: x86-64 four-level page table hierarchy (PGD to PUD to PMD to PTE): each level indexed by 9 bits of the 48-bit virtual address, enabling sparse allocation.

Multi-level page tables break the single large page table into a tree structure, allocating page table memory only for regions of the address space that are actually in use.

Concept: Instead of one huge table, create a hierarchy: - Level 1: Page Directory (small, always present) - Level 2: Page Tables (only created for used regions)

Example: Two-Level Page Tables

Let's walk through a concrete example with a 32-bit address space and 4 KB pages.

Address Format:

Translation Process:

1. Virtual Address: 0x00403A7C

2. Split address into components:
   Dir Index:   bits 31-22 = 0x001 = 1
   Table Index: bits 21-12 = 0x003 = 3
   Offset:      bits 11-0  = 0xA7C = 2684

3. Walk the page table hierarchy:
   
   Step 1: Use Dir Index to find Page Table
   +-----------------------+
   |   Page Directory      |  ← Always in memory
   |  Entry 0: NULL        |
   |  Entry 1: PTE* 0x5000 |  ← Points to Page Table
   |  Entry 2: NULL        |
   |  ...                  |
   +-----------------------+
   
   Step 2: Use Table Index to find PFN
   +-----------------------+
   |  Page Table @ 0x5000  |  ← Only allocated if needed
   |  Entry 0: PFN 0x100   |
   |  Entry 1: PFN 0x101   |
   |  Entry 2: PFN 0x102   |
   |  Entry 3: PFN 0x1B7   |  ← Target entry
   |  ...                  |
   +-----------------------+
   
   Step 3: Combine PFN with Offset
   Physical Address = (0x1B7 << 12) | 0xA7C = 0x001B7A7C

4. Access physical memory at 0x001B7A7C

Memory Savings

Single-Level Page Table (32-bit system): - Pages needed: 4 GB / 4 KB = 1,048,576 entries - Entry size: 4 bytes - Total size: 4 MB per process (always)

Two-Level Page Table: - Page Directory: 1,024 entries x 4 bytes = 4 KB (always present) - Page Tables: Only allocated for used regions

Example Program Using 268 MB: - Used pages: 268 MB / 4 KB = 68,608 pages - Page tables needed: 68,608 / 1,024 = 67 tables - Size: 67 x 4 KB = 268 KB - Plus directory: 268 KB + 4 KB = 272 KB total

Savings: 4 MB → 272 KB (93% reduction!)

For a 64-bit system with sparse usage, the savings are even more dramatic.

How Many Levels?

Modern architectures use different numbers of levels:

Architecture Levels Address Bits Coverage
x86 (32-bit) 2 32 bits 4 GB
x86-64 (4-level) 4 48 bits 256 TB
x86-64 (5-level) 5 57 bits 128 PB
ARM64 (typical) 4 48 bits 256 TB
RISC-V Sv39 3 39 bits 512 GB
RISC-V Sv48 4 48 bits 256 TB

More levels = larger address space but slower page table walks.

Modern 64-bit systems typically use 4-level page tables as a sweet spot between address space size and lookup performance.

Trade-offs

Advantages:Huge memory savings for sparse address spaces
Pay only for what you use (demand allocation)
Enables large address spaces (48-57 bits practical)
Sharing between processes easier at granular level

Disadvantages:More complex hardware (multiple memory accesses)
Slower on TLB miss (must walk multiple levels)
More page faults possible (intermediate tables not present)

TLB Becomes Even More Critical

With multi-level page tables, a TLB miss becomes expensive:

Cost of Address Translation: - TLB hit: 1 cycle (~0.3 ns) - TLB miss with 2-level PT: 2 memory accesses (~200 ns) = 667x slower - TLB miss with 4-level PT: 4 memory accesses (~400 ns) = 1,333x slower

This makes the TLB's caching function absolutely critical. A program with a 99% TLB hit rate will be dramatically faster than one with a 95% hit rate, especially on systems with deep page table hierarchies.

Key Takeaways

  1. Single-level page tables don't scale: Would require gigabytes of memory per process
  2. Address spaces are sparse: Programs use tiny fractions of available address space
  3. Multi-level hierarchies: Allocate page table memory only for used regions
  4. Memory savings are dramatic: Often 90-99% reduction in page table overhead
  5. TLB becomes essential: Multiple memory accesses on TLB miss make caching critical
  6. More levels = slower misses: Trade-off between address space size and lookup cost

Multi-level page tables are a perfect example of indirection solving a scaling problem. By adding a level of indirection (or several), we transform an impractical memory requirement into a manageable one, enabling the vast 64-bit address spaces that modern programs enjoy.

The specific implementations vary by architecture--x86-64's 4-level structure differs from ARM's implementation, and RISC-V offers multiple options. We'll explore these architecture-specific details in Part III: MMU in Modern Architectures.


This completes our introduction to the fundamental mechanisms of address translation. We've covered: - The basic concept of pages and page tables - How address translation works (virtual → physical) - The critical role of the TLB in making translation fast - How multi-level page tables make large address spaces practical

In the next section, we'll look at real-world examples of how these mechanisms impact system performance and security.

1.10 Why MMU Matters: Real-World Impact

Memory Latency Comparison: Access Times Across the Hierarchy Access Latency (nanoseconds, log scale) 0.3 ns 1 ns 4 ns 14 ns 70 ns 25 us 10 ms 0.3 ns Registers ~1 cycle 1 ns L1 Cache 4 cycles 4 ns L2 Cache 12 cycles 14 ns L3 Cache 40 cycles 70 ns DRAM ~200 cycles 25 us SSD/NVMe ~75,000 cycles 10 ms HDD 30M cycles TLB miss = 4x DRAM reads = 280 ns Relative Performance Ratios (compared to L1 Cache = 1x) Registers: 0.3x L1: 1x L2: 4x L3: 14x DRAM: 70x SSD: 25,000x HDD: 10,000,000x
Figure 1.18: Memory access latency comparison: L1 cache (~1 ns), L2 (~4 ns), L3 (~10 ns), DRAM (~100 ns), NVMe SSD (~100 us), spinning disk (~10 ms).

Case Study 1: The Cost of TLB Misses

Context Switch: TLB Flush vs. ASID/PCID Tagging Two approaches to maintaining TLB coherency across process switches Approach 1: Full TLB Flush (Naïve) ① Process A Running TLB filled with A's translations TLB hit rate: ~95% — Fast! ✓ OS schedules switch ② Context Switch — TLB FLUSHED mov cr3, [B's page table] → ALL TLB entries invalidated ③ Process B Running — Cold TLB Every access is a TLB miss initially TLB hit rate: ~0% initially — Very slow! ✗ ④ TLB Warm-up Period Hundreds of page walks required before TLB is useful Cost Per Context Switch • Each TLB miss: ~280 ns (4 DRAM accesses) • 100–500 misses to warm TLB • Overhead: 28–140 µs per switch Approach 2: ASID / PCID Tagging ① Process A Running (ASID=1) TLB entries tagged with ASID=1 TLB hit rate: ~95% — Fast! ✓ ② Context Switch — NO FLUSH! mov cr3, [B's PT] + PCID=2 → A's entries RETAINED in TLB ③ Process B Running (ASID=2) B's entries accumulate in TLB alongside A's Warms up quickly — misses only for new pages ✓ TLB Entry With ASID/PCID Tag ASID VPN PFN Flags Benefit: Near-Zero Switch Cost • No TLB flush on context switch • x86-64: 4096 PCIDs supported • 5–30% performance improvement
Figure 1.19: Context switch and TLB management: ASID-tagged TLBs allow entries from multiple address spaces to coexist, eliminating full TLB flushes on every context switch.

Scenario: Database performing random lookups in a 100 GB dataset on a system with 32 GB RAM.

Analysis: - Page size: 4 KB - Total pages: 25 million (100 GB / 4 KB) - Physical frames: 8 million (32 GB / 4 KB) - TLB size: 1,536 entries (Intel Xeon) - TLB coverage: 6 MB (1,536 x 4 KB)

Result:
With random access patterns, TLB hit rate drops to ~5%. Each TLB miss requires a page table walk (4 memory accesses for 4-level paging). Performance can degrade by 60-80% compared to sequential access.

Solution:
Use 2 MB huge pages: - TLB coverage: 3 GB (1,536 x 2 MB) - TLB hit rate improves to 60% - Performance increase: 2-3x for this workload

Case Study 2: Spectre and Meltdown

In 2018, researchers discovered that speculative execution in modern CPUs could be exploited to leak protected memory [Kocher et al., 2019; Lipp et al., 2018].

The Vulnerability: CPUs speculatively execute code before checking permissions. Even though mis-speculated instructions are discarded, they leave traces in the cache and TLB that can be measured.

Attack Example (Simplified):

// Attacker code (user space)
if (x < array1_size) {  // Will be true eventually
    y = array2[array1[x] * 4096];  // Speculatively executed!
}

During speculative execution: 1. CPU loads array1[x] (x is out of bounds, kernel memory) 2. CPU loads array2[array1[x] * 4096]
3. Page brought into TLB/cache 4. Speculation was wrong → discard results 5. BUT TLB/cache still contains traces of kernel memory! 6. Attacker measures TLB/cache timing to infer kernel data

Impact: - Affected billions of CPUs (Intel, AMD, ARM) - Required OS patches (KPTI: Kernel Page Table Isolation) - Performance overhead: 5-30% depending on workload

Lesson:
Memory management isn't just about performance--it's critical to security. The MMU is part of the security perimeter.

Case Study 3: Rowhammer

Rowhammer demonstrated that repeated accesses to DRAM rows can cause bit flips in adjacent rows--a hardware reliability issue that becomes a security vulnerability [Kim et al., 2014].

The Attack:

// Repeatedly access two rows (hammering)
while (1) {
    *addr1;
    *addr2;  
    clflush(addr1);  // Bypass cache
    clflush(addr2);
}
// Causes bit flips in adjacent physical rows

MMU Connection: Attackers used the MMU's address translation to: 1. Determine which virtual addresses map to adjacent physical rows 2. Bypass MMU protection to corrupt page tables 3. Gain kernel privileges

Impact: - Demonstrated in 2015 (Google Project Zero) - Exploitable from JavaScript (no native code needed) - Led to TRR (Target Row Refresh) in DRAM


1.11 Chapter Summary

Key Takeaways:

  1. Memory management bridges the memory wall between fast CPUs and slow DRAM through caching and virtual memory.

  2. Paging solves fragmentation by using fixed-size pages, enabling efficient allocation and virtual memory.

  3. The MMU performs address translation billions of times per second, making virtual memory practical.

  4. Virtual memory provides powerful abstractions: large address spaces, isolation, memory overcommitment, and simplified programming.

  5. Performance matters: TLB misses and page faults can dominate execution time in memory-intensive workloads.

  6. Security is fundamental: The MMU enforces isolation, but vulnerabilities like Spectre and Rowhammer show that hardware memory protection is subtle and complex.

Thought-Provoking Questions:

  1. If CPU speeds continue increasing while memory latency stagnates, what percentage of CPU time will be spent waiting for memory?

  2. How large should page sizes be? Larger pages reduce translation overhead but increase internal fragmentation. What's the optimal tradeoff?

  3. Can we eliminate page tables entirely? What alternative translation mechanisms might be possible?

  4. Should applications be aware of physical memory layout (NUMA, memory tiers)? Or should the OS hide this complexity?

  5. With persistent memory (NVDIMMs, Optane), do we need to rethink the entire virtual memory model?


Page Fault Handling: Complete Flow ① CPU Memory Access Program accesses virtual address 0x5000_1234 ② MMU: TLB Lookup TLB miss → Page Table Walk → P=0 in PTE ③ #PF Exception Raised Hardware saves PC + CR2 (faulting VA) → OS ④ Is VA Valid? Check VMAs in OS NO SIGSEGV Delivered Program killed (segfault) YES ⑤ Where Is the Page? OS checks backing store New alloc Zero-Fill Page Allocate frame, zero it ~1 µs File-backed Load from File Read from mmap'd file ~100 µs (SSD) On swap Swap In from Disk Read from swap partition ~100 µs–10 ms COW fault Copy-On-Write Copy shared page → new frame ~1 µs + copy time ⑥ Update PTE → Set P=1 → Retry PTE updated, TLB filled, instruction re-executed transparently Page Fault Cost Summary Zero/COW fill: ~1–10 µs SSD swap in: ~100 µs = 400,000 cycles HDD swap in: ~10 ms = 40,000,000 cycles!
Figure 1.20: Page fault handling flow: CPU raises fault on unmapped access; OS handler maps page into a free frame; PTE updated; faulting instruction retried.

This chapter has introduced the fundamental concepts of memory management: - The memory wall and why management is necessary - Evolution from simple base/bounds to modern paging - The role of hardware (MMU), OS, and applications - Virtual memory abstraction and its benefits - Real-world security and performance implications

In the chapters ahead:

Chapter 2: Virtual memory concepts in depth--demand paging, page replacement algorithms, working sets, thrashing

Chapter 3: Hardware implementation of MMUs--page table formats, multi-level paging, TLB design

Part II: Historical evolution--from Atlas to modern architectures

Part III: Deep dives into x86, ARM, RISC-V, and other architectures

Part IV: Advanced techniques--huge pages, NUMA, memory compression

Part V: Security--memory protection, confidential computing, isolation

Part VI: Vulnerabilities--Spectre, Meltdown, Rowhammer, and mitigations

Part VII: Debugging and testing MMU implementations

Part VIII: Future directions--CXL, disaggregated memory, AI workloads

Part IX: Performance optimization for real-world systems


References

  1. Hennessy, J. L., & Patterson, D. A. (2024). Computer Architecture: A Quantitative Approach (7th ed.). Morgan Kaufmann. Chapter 2: Memory Hierarchy Design.

  2. Denning, P. J. (1970). "Virtual memory". Computing Surveys, 2(3), 153-189.

  3. Kilburn, T., Edwards, D. B. G., Lanigan, M. J., & Sumner, F. H. (1962). "One-level storage system". IRE Transactions on Electronic Computers, EC-11(2), 223-235.

  4. Jacob, B., Ng, S. W., & Wang, D. T. (2010). Memory Systems: Cache, DRAM, Disk. Morgan Kaufmann.

  5. Kocher, P., et al. (2019). "Spectre attacks: Exploiting speculative execution". Communications of the ACM, 63(7), 93-101.

  6. Lipp, M., et al. (2018). "Meltdown: Reading kernel memory from user space". 27th USENIX Security Symposium.

  7. Kim, Y., et al. (2014). "Flipping bits in memory without accessing them: An experimental study of DRAM disturbance errors". ISCA 2014.

  8. Arpaci-Dusseau, R. H., & Arpaci-Dusseau, A. C. (2018). Operating Systems: Three Easy Pieces. Chapter 18-23 (Virtual Memory). Available at: https://pages.cs.wisc.edu/~remzi/OSTEP/

§1.8 — TLB Reach and AI Workloads

  1. Mittal, S. "A survey of techniques for architecting TLBs." Concurrency and Computation: Practice and Experience (CPE), 29(10), 2016. DOI: 10.1002/cpe.4061

  2. Qasem, A. and Magee, J. "Improving TLB performance through demand-driven superpaging." Software: Practice and Experience (SPE), 43(6):705–729, 2012. DOI: 10.1002/spe.2128

  3. Wu, X. and Taylor, V. "Ensemble learning for CANDLE deep learning benchmarks." Concurrency and Computation: Practice and Experience (CPE), 35(15), 2021. DOI: 10.1002/cpe.6516