Chapter 1: Memory Management Basics

Part I: Introduction and Fundamentals

Chapter 1: Memory Management Basics

"The memory hierarchy is a fundamental structure in computer architecture, and virtual memory is one of the greatest ideas in computer science."
-- John L. Hennessy & David A. Patterson, Computer Architecture: A Quantitative Approach

1.1 Introduction: Why Virtual Memory Exists

The Programming Model We Take for Granted

When you write a program in C, Python, or any modern language, you make several assumptions about memory:

Your program can use memory addresses starting from 0
Memory appears as one large, contiguous space
Other programs won't interfere with your memory
You can use more memory than physically available (within limits)
The program works the same regardless of how much RAM the computer has

These assumptions seem natural, but they're not inherent to computer hardware. They're provided by virtual memory--one of computer science's most important abstractions [Bhattacharjee & Lustig, 2017].

What Virtual Memory Provides

Virtual memory (VM) is the hardware and software mechanism that gives each program its own private view of the computer's memory. According to Bhattacharjee & Lustig (2017), virtual memory provides several critical benefits:

1. Simplified Programming (Abstraction) - Programs use simple addresses starting from 0 - No need to know physical memory size or layout - Memory appears as one large array - Code is portable across different hardware

2. Memory Protection and Isolation - Each program has its own address space - Programs cannot access each other's memory - Operating system memory is protected from user programs - Security: malicious code can't read passwords from other programs

3. Memory Overcommitment - Total virtual memory across all programs can exceed physical RAM - Programs can use address spaces larger than physical memory - Inactive data can be stored on disk - Efficient memory sharing between programs

4. Flexibility in Memory Allocation - Programs can be loaded anywhere in physical memory - Memory doesn't need to be contiguous in physical RAM - Easy to relocate programs - Simplifies dynamic memory allocation

The Core Mechanism: Address Translation

Virtual memory works through address translation. Let's understand this with a concrete example:

int main() {
    int x = 42;
    printf("Address of x: %p\n", &x);  
    // Might print: 0x7ffe5c3d4a8c
    return 0;
}

When you run this program, it prints an address like 0x7ffe5c3d4a8c. This is a virtual address--a number your program uses to refer to a memory location. But this is not where x actually lives in the physical RAM chips.

Two Types of Addresses:

Virtual Address (VA) - What your program sees and uses - Example: 0x7ffe5c3d4a8c - Every program has its own virtual addresses starting from 0 - Abstract, not tied to physical hardware

Physical Address (PA)
- Where data actually resides in RAM chips - Example: 0x12A4B000 (completely different!) - Only one set of physical addresses in the entire system - Corresponds to actual hardware locations

The Translation:

When your program accesses virtual address 0x7ffe5c3d4a8c, hardware automatically translates it to the corresponding physical address (say, 0x12A4B000) where your data actually lives.

Program's view:         Virtual Address 0x7ffe5c3d4a8c
                                  ↓
                        [MMU Translation]
                                  ↓
Reality in hardware:    Physical Address 0x12A4B000

Why Two Address Spaces Solve Everything

This simple translation mechanism simultaneously solves all the problems mentioned above:

Isolation: Program A's virtual address 0x1000 maps to physical address 0x3A2F1000. Program B's virtual address 0x1000 maps to physical address 0x91B52000. Same virtual address, different physical locations--they can never conflict.

Simplification: Every program uses addresses starting from 0. The program doesn't need to know where it's actually loaded in physical memory.

Overcommitment: A program can have virtual addresses from 0 to 4 billion (on 32-bit systems), even if physical RAM is only 2 GB. Not all virtual addresses need to be in physical memory simultaneously--some can be on disk.

Protection: The translation mechanism includes permission checks. The hardware verifies that your program is allowed to access each physical address it translates to.

The Memory Management Unit (MMU)

Figure 1.1: Memory hierarchy: CPU registers to persistent storage with typical latencies and capacities at each level.

The Memory Management Unit (MMU) is the hardware component inside your CPU that performs this translation. On every memory access--billions of times per second--the MMU:

Receives a virtual address from the program
Translates it to the corresponding physical address
Checks if the access is allowed (read/write/execute permissions)
Either completes the access or triggers an exception

This happens automatically and transparently. Your program never knows it's using virtual addresses.

The Three Fundamental Challenges

While virtual memory elegantly solves programming and protection problems, it introduces new challenges that the MMU must address:

Challenge 1: Translation Performance - Modern CPUs execute ~4 billion instructions per second - Many instructions access memory multiple times - That's billions of translations per second - Each translation must be fast (sub-nanosecond) or the CPU stalls

Challenge 2: Translation Overhead - Translation requires looking up tables in memory - But those table lookups themselves require memory access - This could create infinite recursion - The translation mechanism itself consumes memory

Challenge 3: The Memory Wall - Memory access takes ~70 nanoseconds - During that time, CPU could execute 280 instructions - Most CPU time is spent waiting for memory - Virtual memory must not make this worse

How These Are Solved: - Translation Lookaside Buffer (TLB): A cache that remembers recent translations, making most translations instant - Hardware Page Table Walker: Dedicated logic to quickly traverse translation tables - Multi-level page tables: Hierarchical structure that reduces memory overhead - Large pages: Reducing the number of translations needed

What This Chapter Covers

Understanding virtual memory requires understanding the interplay between hardware and software:

Hardware (MMU): - How address translation works (page tables) - How to make translation fast (TLBs, page table walkers) - Different architectures' approaches (x86, ARM, RISC-V)

Operating System: - Managing page tables for all processes - Handling page faults (when data isn't in RAM) - Deciding what to keep in physical memory - Implementing memory protection policies

Together: - How hardware and OS cooperate - Performance implications - Security considerations - Modern challenges and solutions

By the end of this chapter, you'll understand how modern computers provide every program with the illusion of having abundant, fast, private memory--even though physical memory is limited, slow, and shared.

1.2 What is Memory Management?

Given the fundamental challenges outlined above--the memory wall, capacity limitations, and security requirements--memory management is the collection of hardware and software mechanisms that bridge the gap between what programmers need (fast, large, secure memory) and what hardware provides (slow, limited, shared physical RAM).

At its core, memory management must simultaneously solve several competing challenges:

1.2.1 The Four Pillars of Memory Management

1. Allocation and Deallocation

Every program requires memory to store instructions and data. When a process starts, memory must be allocated. When it terminates, that memory must be reclaimed.

The Challenge: How do we allocate memory efficiently when programs have unpredictable memory needs? How do we avoid fragmentation? What happens when physical memory is exhausted?

Example: The malloc() Mystery

int* ptr = malloc(1024 * 1024 * 1024);  // Request 1 GB
if (ptr != NULL) {
    printf("Allocation succeeded!\n");
    // But have we actually allocated 1 GB of physical RAM?
}

Surprisingly, on most modern systems, malloc() will succeed even if you only have 512 MB of physical RAM available. How is this possible? The answer lies in virtual memory and demand paging--concepts we'll explore in depth.

2. Address Translation

Programs operate using virtual addresses (also called logical addresses), but the actual data resides at physical addresses in RAM. The Memory Management Unit (MMU) performs this translation billions of times per second.

The Challenge: How do we translate addresses quickly enough to avoid becoming a bottleneck? How do we handle the translation table itself, which might be enormous?

Concrete Example:

Virtual Address:  0x7fff5fbff710  (Program's view)
Physical Address: 0x3a2f1000      (Actual RAM location)

The MMU must perform this translation for every single memory access. On a modern CPU executing billions of instructions per second, with multiple memory accesses per instruction, this means billions of translations per second.

3. Protection and Isolation

In a multitasking system, one misbehaving program must not be able to corrupt another program's memory--or worse, the operating system kernel's memory.

Historical Perspective:
Early microcomputers like the Commodore 64 and Apple II had no memory protection. A single buggy program could crash the entire system. Modern systems enforce strict isolation.

The Challenge: How do we enforce protection without sacrificing performance? How do we share memory safely when needed (e.g., shared libraries)?

Real-World Impact: The 2018 Spectre and Meltdown vulnerabilities demonstrated that even with hardware memory protection, subtle CPU implementation details could leak protected memory. These vulnerabilities affected billions of devices and showed that memory isolation is harder than it appears [Kocher et al., 2019; Lipp et al., 2018].

4. Optimization

Memory management must be fast. Every optimization matters: better cache utilization, reduced TLB misses, efficient page table structures, and intelligent page replacement algorithms.

The Challenge: How do we balance competing goals? Larger pages reduce translation overhead but increase internal fragmentation. More complex page tables enable features but consume memory themselves.

Performance Reality:

To understand why memory management is critical, consider the real cost of different memory operations. These numbers come from actual hardware measurements on a modern Intel Xeon processor:

Operation                        Latency          CPU Cycles (@4GHz)    Human Scale*
L1 Cache Hit                     ~0.5 ns          ~2 cycles             1 second
L2 Cache Hit                     ~3 ns            ~12 cycles            6 seconds
L3 Cache Hit                     ~12 ns           ~50 cycles            24 seconds
DRAM Access (TLB hit)            ~70 ns           ~280 cycles           2.3 minutes
DRAM Access (TLB miss)           ~150 ns          ~600 cycles           5 minutes
Page Fault (from SSD)            ~25,000 ns       ~100,000 cycles       1.2 days
Page Fault (from HDD)            ~10,000,000 ns   ~40,000,000 cycles    1.3 years

Table 1.1: Memory hierarchy latencies. Human scale shows the relative time if L1 cache = 1 second.

Understanding the Scale:
If an L1 cache access took 1 second, a page fault from a hard drive would take over a year. This isn't a small performance penalty--it's the difference between a responsive system and one that appears frozen.

Why the Memory Wall Matters:
The memory wall means that for many applications, performance is dominated not by CPU speed but by memory access patterns. Programs with good locality (accessing nearby memory addresses) stay in fast caches. Programs with poor locality constantly miss caches and TLBs, spending most of their time waiting for slow DRAM.

Sources: Latency measurements from Hennessy & Patterson (2024), Intel Xeon performance guides, and empirical testing.

1.3 The Evolution of Memory Management

1.3.1 Early Systems: No Memory Management (1940s-1950s)

The earliest computers like ENIAC and EDVAC had no memory management. Programs had direct access to all physical memory. Only one program could run at a time.

Programming Model:

; PDP-8 example (1965)
; Direct physical addressing - no translation
0200:  7200    CLA      ; Clear accumulator  
0201:  1234    TAD 234  ; Load from absolute address 234
0202:  3456    DCA 456  ; Store to absolute address 456

Limitations: - No multitasking - No protection - Manual memory layout by programmer - Physical memory size limited program size

1.3.2 Base and Bounds Registers (1960s)

Figure 1.2: Base-and-bounds translation: the CPU adds the base register to the logical address; exceeding the bounds register raises a protection fault.

The first hardware memory management used simple base and bounds registers. The base register contained the program's starting address, and all memory accesses were offset from this base. The bounds register limited the maximum offset.

Physical Address = Base Register + Program Address

if (Program Address > Bounds Register):
    TRAP to operating system  # Protection fault

Figure 1.3: Concrete base-and-bounds calculation examples showing address mapping and bounds-violation detection.

Example: IBM 7094 (1962)

Base:   0x100000
Bounds: 0x010000  (64 KB limit)

Program accesses: 0x1234
Physical address: 0x100000 + 0x1234 = 0x101234 ✓ (within bounds)

Program accesses: 0x20000  
0x20000 > 0x010000 → PROTECTION FAULT

Limitations: - Still required contiguous physical memory - Relocation required copying entire program in memory - External fragmentation: gaps between programs - No sharing of code between processes

1.3.3 Segmentation (1960s-1970s)

Figure 1.4: Segmentation architecture: a logical address encodes a segment selector and byte offset; the segment descriptor table maps each selector to a base address and limit.

Segmentation divided a program's address space into logical units (segments) like code, data, stack, and heap. Each segment had its own base and bounds.

Figure 1.5: Segmented address format: the high bits select the segment descriptor; the low bits specify the byte offset within that segment.

Segments Matched Logical Structure:

Figure 1.6: Common segment types (code, data, stack, heap) and their access-protection properties.

Segment 0: Code      (executable, read-only)
Segment 1: Data      (read-write)
Segment 2: Stack     (read-write, grows down)
Segment 3: Heap      (read-write, grows up)

Virtual Address Format (Segmented):

Advantages: - Logical program structure - Easy sharing (share code segment) - Different protection for different segments - No internal fragmentation

Disadvantages: - External fragmentation: Segments of varying sizes left gaps in physical memory - Required compaction (expensive memory copying) - Large segments still needed contiguous physical memory

Historical Example: Intel 8086 (1978) The 8086 used segmentation to address 1 MB of memory with 16-bit registers:

Physical Address = (Segment x 16) + Offset

1.3.4 Paging: The Modern Solution (1960s-Present)

Figure 1.7: Paging overview: the virtual address splits into a Virtual Page Number (VPN) and page offset; the page table maps each VPN to a Physical Frame Number (PFN).

The Atlas computer (University of Manchester, 1962) introduced paging--dividing both virtual and physical memory into fixed-size blocks called pages (virtual) and frames (physical) [Fotheringham, 1961; Kilburn et al., 1962].

Key Insight: With fixed-size pages, any page can map to any frame. No external fragmentation!

Typical Page Sizes: - 4 KB: Standard on x86/x64, ARM, RISC-V - 16 KB: Option on ARM - 64 KB: Option on ARM, used by some UNIX variants - 2 MB/1 GB: "Huge pages" for reducing translation overhead

Figure 1.8: Page Table Entry (PTE) structure: physical frame number plus status bits -- Present (P), Read/Write (R/W), User/Supervisor (U/S), Dirty (D), and Accessed (A).

Virtual Address Format (4 KB pages, 32-bit address space):

Physical Address Format:

Page Table: The mapping from virtual page numbers to physical frame numbers is stored in a page table.

Advantages of Paging: 1. No external fragmentation (all blocks are same size) 2. Easy allocation (any free frame works) 3. Efficient sharing (multiple page tables point to same frame) 4. Demand paging (load pages only when accessed) 5. Virtual memory (address space larger than physical RAM)

Disadvantages: 1. Internal fragmentation (wasted space in partial pages) 2. Page table overhead (memory consumed by page tables themselves) 3. Translation overhead (every memory access needs translation)

Understanding Fragmentation: Visual Examples

Memory fragmentation comes in two forms: external and internal. Understanding the difference is crucial for evaluating memory management schemes.

External Fragmentation (The Problem with Variable-Size Segments)

External fragmentation occurs when free memory is scattered in small, unusable chunks between allocated blocks.

The Problem:
Free memory becomes fragmented into non-contiguous holes. Even when sufficient total free memory exists, allocation requests may fail if they require contiguous memory larger than the biggest hole.

Real-World Example:
After a system runs for 24 hours with frequent allocations and deallocations, 20-40% of memory can become unusable due to fragmentation. A system might show 4 GB "free" but be unable to allocate even 100 MB contiguously.

Solutions: - Compaction: Move allocated blocks to consolidate free space (expensive - requires copying) - Paging: Use fixed-size blocks to eliminate external fragmentation entirely

Internal Fragmentation (The Problem with Fixed-Size Pages)

Internal fragmentation occurs when allocated blocks are larger than needed, wasting space within the allocated block.

The Trade-off:

Page Size	Internal Fragmentation	Page Table Size	TLB Coverage
1 KB	Low (max 1023 bytes waste)	Large (many entries)	Poor
4 KB	Medium (max 4095 bytes waste)	Medium	Good
2 MB	High (max 2 MB waste)	Small (few entries)	Excellent

Real-World Impact:
On a system with 16 GB RAM, 4 KB pages, and 100 processes: - Average internal fragmentation per process: ~500 KB - Total wasted memory: ~50 MB (0.3% of RAM) - This is predictable and acceptable

Compare this to external fragmentation which can waste 20-40% of RAM unpredictably.

Side-by-Side Comparison

Summary Table:

Type	Location	Cause	Impact	Predictable?	Solution
External	Between blocks	Variable-size allocation	Can't use total free memory	No, gets worse over time	Compaction or paging
Internal	Within blocks	Fixed-size blocks	Wasted space in blocks	Yes, ≤ (page_size - 1) per allocation	Choose appropriate page size

Why Modern Systems Use Paging:

External fragmentation is a catastrophic problem: memory becomes increasingly unusable over time, and the only solution (compaction) is expensive and disruptive. A system might show gigabytes of "free" memory but fail to allocate even small contiguous blocks.

Internal fragmentation is a manageable problem: the waste is bounded, predictable, and typically only 0.3-1% of total RAM with 4 KB pages. The system always knows the maximum waste per allocation (page_size - 1 bytes).

This is why virtually all modern operating systems use paging for memory management--accepting small, predictable waste to completely eliminate large, unpredictable memory loss.

1.4 Hierarchical Levels of Memory Management

Modern systems implement memory management at multiple levels:

1.4.1 Hardware Level: The Memory Management Unit (MMU)

Figure 1.9: Virtual address space layout: kernel space at the top; user space holds stack (grows down), heap (grows up), BSS, data, and code segments.

The MMU is a specialized hardware component, integrated into the CPU, that performs address translation at hardware speed.

Core MMU Responsibilities: 1. Address Translation: Convert virtual → physical addresses 2. TLB Management: Cache recent translations 3. Protection Checking: Enforce access permissions 4. Exception Generation: Trigger page faults when needed

MMU Pipeline Integration:

Instruction Fetch → Instruction TLB → Instruction Cache → Decode
                                ↓
                         Physical Address
                         
Data Access → Data TLB → Data Cache → Load/Store Unit
                   ↓
            Physical Address

The MMU operates in parallel with the CPU pipeline. Any TLB miss stalls the pipeline while the page table is walked.

Historical Note:
Early CPUs like the Motorola 68020 had optional external MMU chips (68851). Modern CPUs integrate the MMU on-die for performance.

1.4.2 Operating System Level: The Virtual Memory Manager

The OS kernel manages: - Page table maintenance: Creating/updating page tables for processes - Page fault handling: Loading pages from disk when absent - Page replacement: Choosing which pages to evict under memory pressure - Swapping: Moving entire processes to/from disk - Memory allocation: Allocating frames to processes

Key OS Data Structures:

// Linux kernel mm_struct (simplified)
struct mm_struct {
    pgd_t *pgd;                    // Page Global Directory (root page table)
    unsigned long start_code;       // Code segment start
    unsigned long end_code;         // Code segment end
    unsigned long start_data;       // Data segment start
    unsigned long end_data;         // Data segment end
    unsigned long start_brk;        // Heap start
    unsigned long brk;              // Heap end (current)
    unsigned long start_stack;      // Stack start
    unsigned long total_vm;         // Total virtual memory pages
    unsigned long locked_vm;        // Locked pages (mlocked)
    unsigned long pinned_vm;        // Pinned pages
    unsigned long data_vm;          // Data pages
    unsigned long exec_vm;          // Executable pages
    unsigned long stack_vm;         // Stack pages
};

Page Fault Handling Flow:

1. CPU accesses virtual address
2. TLB miss → Page table walk
3. Page not present → Page Fault Exception
4. CPU switches to kernel mode, saves context
5. OS page fault handler invoked:
   a. Check if address is valid (in process's VMA)
   b. If invalid → Segmentation Fault (kill process)
   c. If valid but not resident:
      - Allocate physical frame
      - If swapped: Read page from disk
      - If zero-page: Zero the frame
      - If file-backed: Read from file
      - Update page table entry
      - Mark page as present
6. Return from exception
7. CPU retries the faulting instruction
8. Translation succeeds → execution continues

1.4.3 Application Level: Memory Allocation Libraries

Applications don't usually manage memory at the page level. Instead, they use allocation libraries:

C/C++:

void* malloc(size_t size);
void free(void* ptr);
void* calloc(size_t nmemb, size_t size);
void* realloc(void* ptr, size_t size);

How malloc() Works (Simplified):

// glibc malloc implementation strategy
void* malloc(size_t size) {
    if (size < MMAP_THRESHOLD) {
        // Small allocation: use heap (brk/sbrk)
        return allocate_from_heap(size);
    } else {
        // Large allocation: use mmap()
        return mmap(NULL, size, PROT_READ|PROT_WRITE,
                    MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
    }
}

Behind the Scenes: - Small allocations (<128 KB): Use heap via brk() system call - Large allocations: Use mmap() to map anonymous memory - Free list management (bins, arenas) - Metadata overhead (headers, footers)

Memory Allocation Hierarchy:

Application calls malloc()
        ↓
C Library (glibc)
        ↓
System Call (brk/mmap)
        ↓
Kernel (page allocator)
        ↓
MMU (page table update)
        ↓
Physical Memory Assignment

1.5 The Virtual Memory Abstraction

Virtual memory is one of computer science's most powerful abstractions. It provides each process with the illusion of: 1. Vast address space (often larger than physical RAM) 2. Contiguous memory (even if physical memory is fragmented) 3. Private memory (isolated from other processes) 4. Uniform access (same addressing model for all memory)

1.5.1 Benefits of Virtual Memory

Benefit 1: Simplified Programming Model

Programmers can assume: - Memory starts at address 0 - Memory is contiguous - Pointers just work (within the valid address space) - No need to manage physical memory locations

Benefit 2: Memory Overcommitment

Physical memory can be overcommitted--the sum of all virtual address spaces can exceed physical RAM.

Real Example:

Physical RAM: 16 GB
Process 1:    8 GB virtual
Process 2:    8 GB virtual  
Process 3:    8 GB virtual
Total:       24 GB virtual   → More than physical!

This works because: - Not all virtual pages are actually in use - Pages can be swapped to disk - Many pages are shared (code, libraries)

Benefit 3: Security and Isolation

Each process has its own page table. Process A cannot access Process B's memory.

Process A's view:
0x00000000 - 0x7FFFFFFF: User space
0x80000000 - 0xFFFFFFFF: Kernel space

Process B's view:  
0x00000000 - 0x7FFFFFFF: User space (different physical frames!)
0x80000000 - 0xFFFFFFFF: Kernel space (same physical frames)

Benefit 4: Memory-Mapped Files

Files can be mapped into virtual memory. Reading/writing memory reads/writes the file.

// Map a file into memory
int fd = open("database.db", O_RDWR);
void* ptr = mmap(NULL, file_size, PROT_READ|PROT_WRITE,
                 MAP_SHARED, fd, 0);
// Now ptr[x] reads/writes the file!

Benefit 5: Copy-on-Write (COW)

When fork() creates a child process, both parent and child initially share the same physical pages (marked read-only). Only when one writes to a page is it duplicated.

Parent calls fork()
        ↓
Child created with identical page tables
        ↓
Both point to same physical frames (read-only)
        ↓
Parent writes to page → Page Fault
        ↓
Kernel copies page, updates parent's page table
        ↓
Parent has private copy, child still has original

This makes fork() extremely fast, even for large processes.

1.5.2 The Cost of Virtual Memory

Virtual memory isn't free:

Translation overhead: Every memory access requires address translation
TLB misses: Cache misses in the TLB require expensive page table walks
Page table memory: Page tables themselves consume memory
Page faults: Loading pages from disk is extremely slow

Performance Impact Example:

// Sequential access (good locality)
for (i = 0; i < ARRAY_SIZE; i++) {
    sum += array[i];
}
// TLB hit rate: ~99%+

// Random access (poor locality)  
for (i = 0; i < ARRAY_SIZE; i++) {
    sum += array[rand() % ARRAY_SIZE];
}
// TLB hit rate: can drop to 50% or worse

1.6 Key Concepts and Terminology

Before diving deeper, let's establish precise definitions:

Address Spaces

Virtual Address: The address used by a program (logical address)
Physical Address: The actual address in RAM hardware
Address Space: The set of all valid addresses (e.g., 32-bit = 4 GB address space)

Pages and Frames

Figure 1.10: Virtual address bit layout for x86-64: bits 47-39 index PGD, 38-30 PUD, 29-21 PMD, 20-12 PTE, and 11-0 are the page offset.

Page: A fixed-size block of virtual memory (typically 4 KB)
Frame: A fixed-size block of physical memory (same size as page)
Page Number: The high-order bits of a virtual address
Offset: The low-order bits (position within the page)

Page Tables

Figure 1.11: Physical address format: the Physical Frame Number (PFN) occupies the upper bits; the page offset (copied from the virtual address) occupies the lower 12 bits.

Page Table: Data structure mapping virtual pages to physical frames
Page Table Entry (PTE): One entry in the page table
Page Directory: Higher-level page table (in multi-level schemes)
Translation Lookaside Buffer (TLB): Hardware cache of recent translations

Page States

Resident: Page is in physical memory
Swapped: Page is on disk (swap space)
Dirty: Page has been modified
Present: Page is accessible (resident and valid)
Valid: Address is within process's address space

Memory Operations

Page Fault: Exception when accessing a non-present page
Swapping: Moving entire processes to/from disk
Paging: Moving individual pages to/from disk
Thrashing: System spends more time paging than executing

1.7 A Concrete Example: Address Translation

Figure 1.12: Step-by-step address translation: virtual address decomposed into VPN and offset; TLB or page table supplies PFN; physical address assembled.

Let's trace a complete memory access from virtual address to physical address.

System Configuration: - 32-bit address space - 4 KB (2^12 byte) pages - 1 GB (2^30 byte) physical RAM = 256K frames

Virtual Address: 0x12345678

Step 1: Split address into page number and offset

Step 2: Look up in TLB

TLB lookup for VPN 0x12345...
TLB HIT! → Physical Frame Number: 0xA2B4C

Step 3: Construct physical address

Physical Frame: 0xA2B4C
Offset:         0x678

Physical Address = (Frame Number << 12) | Offset
                 = (0xA2B4C << 12) | 0x678
                 = 0xA2B4C678

Step 4: Access physical memory

Read from physical address 0xA2B4C678

If TLB Miss had occurred:

TLB miss for VPN 0x12345
→ Hardware page table walk:
  1. Read Page Directory entry
  2. Read Page Table entry  
  3. Extract Physical Frame Number: 0xA2B4C
  4. Load into TLB
  5. Retry access (now TLB hit)

If Page Not Present:

Page table entry for VPN 0x12345 has Present bit = 0
→ Page Fault Exception
→ Kernel page fault handler:
  1. Check if address valid
  2. Allocate physical frame
  3. Load page from disk (if swapped)
  4. Update PTE: Frame = 0xA2B4C, Present = 1
  5. Return from exception
→ CPU retries instruction (now page is present)

1.8 The Translation Lookaside Buffer (TLB)

Figure 1.13: Address translation control flow: TLB hit returns physical address in 1-2 cycles (fast path); TLB miss triggers a hardware page-table walk costing 100-300 cycles (slow path).

While page tables elegantly solve the problem of address translation, they introduce a significant performance challenge: every memory access requires at least one additional memory read to consult the page table. For a simple memory load instruction, the CPU must:

Read the page table entry from memory (~100 nanoseconds)
Use the PFN to calculate the physical address
Read the actual data from memory (~100 nanoseconds)

This doubles the effective memory access time! For a program making millions of memory accesses per second, this overhead would be catastrophic.

The Solution: Translation Lookaside Buffer

Figure 1.14: TLB operation: hardware compares the VPN tag against all entries in parallel; a hit immediately returns the cached PFN without touching DRAM.

The Translation Lookaside Buffer (TLB) is a specialized hardware cache that stores recently used virtual-to-physical address translations. It sits between the CPU and the page table, intercepting every address translation request.

How the TLB Works:

TLB Performance Impact

The performance difference between a TLB hit and miss is dramatic:

Scenario	Cycles	Time (3 GHz CPU)	Relative Speed
TLB Hit	1-2	~0.5 ns	1x (baseline)
TLB Miss	100-300	~50-100 ns	100-200x slower

Modern processors typically achieve 95-99% TLB hit rates during normal operation, meaning most memory accesses avoid the slow page table walk entirely.

A note on TLB hit latency and cache organisation. The "1–2 cycles" figure assumes the TLB result is needed before the L1 data cache can be accessed — the classic PIPT (Physically-Indexed, Physically-Tagged) configuration. Many modern processors instead use VIPT (Virtually-Indexed, Physically-Tagged) L1 caches: the cache index is computed from virtual address bits simultaneously with the TLB lookup. When the VIPT index bits lie entirely within the page offset, there is no aliasing and the effective TLB penalty on an L1 cache hit is zero cycles — the translation arrives just in time for tag comparison. L2 and L3 caches are generally PIPT and pay the full TLB latency. Chapter 7 (§7.8.4) examines VIPT cache aliasing hazards in detail. Mittal (2016) provides a systematic comparison of all four cache-addressing modes (PI-PT, VI-PT, VI-VT, PI-VT).

TLB Structure

Figure 1.15: TLB entry structure: VPN tag, Physical Frame Number, Address Space ID (ASID) for process isolation, and permission/dirty/valid bits.

A typical TLB is organized as an associative cache (content-addressable memory):

Entry Format:

Each entry stores: - Virtual Page Number (VPN): The tag to match against - Physical Frame Number (PFN): The translation result - Valid bit: Whether this entry contains valid data - Flags: Permissions (read/write/execute) from the PTE

Size Example: A typical L1 TLB might have: - 64 entries for 4 KB pages - Covers: 64 x 4 KB = 256 KB of address space - Only 256 KB of "working set" can be accessed without misses!

This limited coverage is called TLB reach and is a critical factor in system performance. Programs with larger working sets will experience more TLB misses.

TLB Reach: Formal Definition and AI Implications

TLB reach is defined precisely as:

TLB Reach = Number of TLB Entries × Page Size

This formula immediately shows why page size selection is a first-order architectural decision. The following table shows TLB reach for typical hardware configurations (Mittal, 2016; Qasem & Magee, 2012):

TLB Level	Entries	4 KB pages	2 MB pages	1 GB pages
L1 DTLB	64	256 KB	128 MB	64 GB
L2 STLB	1,536	6 MB	3 GB	1.5 TB
L2 STLB	2,048	8 MB	4 GB	2 TB

The AI/ML reach crisis. Modern large language model training exposes just how narrow 4 KB TLB reach is. GPT-3's working set consists of approximately 350 GB of parameters, plus 700 GB each for momentum and variance under AdamW optimization, plus 50–100 GB of activations — totalling roughly 1,800 GB. An L1 TLB with 64 entries covering 256 KB addresses 0.000014% of that working set; TLB miss rate approaches 100%. Even the 1,536-entry L2 STLB covers only 6 MB. Wu & Taylor (2021) confirmed this directly for deep learning benchmarks: mapping the kernel address space with 8 MB huge pages yielded "a single TLB entry where 4 KB pages would need 2,048 entries," producing 55–61% performance improvement. Switching to 2 MB pages shifts L1 reach from 256 KB to 128 MB (a 512× gain), meaning the L1 TLB alone can cover a meaningful fraction of typical training batches. This is why huge pages are increasingly favoured in AI infrastructure — a topic examined in depth in Chapters 9 and 14.

When Does the TLB Get Updated?

The TLB is automatically managed by the hardware MMU:

TLB is updated on: - Every TLB miss (new entry added) - Page table updates by the OS (corresponding entry invalidated)

TLB is flushed (cleared) on: - Context switch between processes (different address spaces) - Explicit flush instruction (e.g., when kernel updates page tables) - Some CPUs support ASID (Address Space Identifier) to avoid flushing on context switches

Why the TLB Matters

The TLB is arguably the most important performance component in virtual memory systems:

Makes Virtual Memory Practical: Without TLBs, virtual memory would be too slow for production use
Working Set Matters: Programs that access memory randomly suffer more TLB misses
Large Pages Help: Using 2 MB pages instead of 4 KB pages increases TLB reach by 512x

Real-World Example:

Consider a database scanning a 1 GB table with random access: - With 4 KB pages: 262,144 pages, typical TLB: 64 entries - TLB covers: 256 KB (0.025% of data) - Result: ~99.975% TLB miss rate = catastrophic performance

With 2 MB pages: 512 pages, same TLB: 64 entries
TLB covers: 128 MB (12.5% of data)
Result: ~87.5% miss rate (still poor, but 10x fewer misses)

This is why databases and high-performance applications use huge pages (covered in Chapter 2.7).

Key Takeaways

TLB is essential: Virtual memory would be impractically slow without TLB caching
Hit rate is critical: 95%+ hit rates are necessary for good performance
TLB reach is limited: Only 256 KB - 128 MB of working set can be efficiently accessed
Hardware-managed: The MMU automatically updates and manages the TLB
Performance tool: Understanding TLB behavior is crucial for optimization (covered in Part IX)

The TLB transforms virtual memory from a theoretical elegance into a practical, high-performance system. Modern processors invest significant chip area in TLB structures precisely because this cache is so performance-critical.

In the next section, we'll see how multi-level page tables address another challenge: efficiently managing large, sparse address spaces.

1.9 Multi-Level Page Tables

Figure 1.16: Page table mapping: non-contiguous virtual pages map to arbitrary physical frames, providing address-space isolation between processes.

The page table design we've discussed so far has a critical flaw: it assumes a single-level page table that must contain an entry for every possible virtual page. For modern 64-bit systems with vast address spaces, this becomes completely impractical.

The Problem: Single-Level Page Tables Don't Scale

Consider a 64-bit system with 4 KB pages:

Address Space Calculation: - Virtual address space: 2^48 bits = 256 TB (typical implementation) - Page size: 4 KB = 2^12 bytes - Number of pages: 256 TB / 4 KB = 68,719,476,736 pages

Page Table Size: - Entries needed: 68.7 billion entries - Entry size: 8 bytes (64-bit PTE) - Total page table size: 512 GB per process!

This is absurd. The page table alone would consume more memory than most systems have, and we haven't even allocated memory for the actual program yet!

The Observation: Most Address Spaces Are Sparse

The key insight is that programs don't use most of their address space. A typical program might use:

Text segment: 10 MB (executable code)
Data segment: 50 MB (global variables)
Heap: 200 MB (dynamically allocated)
Stack: 8 MB (function call frames)
Total: ~268 MB out of 256 TB (0.0001%)

The remaining 99.9999% of the address space is unmapped and unused. Allocating a complete page table for this sparse usage is incredibly wasteful.

The Solution: Hierarchical Page Tables

Figure 1.17: x86-64 four-level page table hierarchy (PGD to PUD to PMD to PTE): each level indexed by 9 bits of the 48-bit virtual address, enabling sparse allocation.

Multi-level page tables break the single large page table into a tree structure, allocating page table memory only for regions of the address space that are actually in use.

Concept: Instead of one huge table, create a hierarchy: - Level 1: Page Directory (small, always present) - Level 2: Page Tables (only created for used regions)

Example: Two-Level Page Tables

Let's walk through a concrete example with a 32-bit address space and 4 KB pages.

Address Format:

Translation Process:

1. Virtual Address: 0x00403A7C

2. Split address into components:
   Dir Index:   bits 31-22 = 0x001 = 1
   Table Index: bits 21-12 = 0x003 = 3
   Offset:      bits 11-0  = 0xA7C = 2684

3. Walk the page table hierarchy:
   
   Step 1: Use Dir Index to find Page Table
   +-----------------------+
   |   Page Directory      |  ← Always in memory
   |  Entry 0: NULL        |
   |  Entry 1: PTE* 0x5000 |  ← Points to Page Table
   |  Entry 2: NULL        |
   |  ...                  |
   +-----------------------+
   
   Step 2: Use Table Index to find PFN
   +-----------------------+
   |  Page Table @ 0x5000  |  ← Only allocated if needed
   |  Entry 0: PFN 0x100   |
   |  Entry 1: PFN 0x101   |
   |  Entry 2: PFN 0x102   |
   |  Entry 3: PFN 0x1B7   |  ← Target entry
   |  ...                  |
   +-----------------------+
   
   Step 3: Combine PFN with Offset
   Physical Address = (0x1B7 << 12) | 0xA7C = 0x001B7A7C

4. Access physical memory at 0x001B7A7C

Memory Savings

Single-Level Page Table (32-bit system): - Pages needed: 4 GB / 4 KB = 1,048,576 entries - Entry size: 4 bytes - Total size: 4 MB per process (always)

Two-Level Page Table: - Page Directory: 1,024 entries x 4 bytes = 4 KB (always present) - Page Tables: Only allocated for used regions

Example Program Using 268 MB: - Used pages: 268 MB / 4 KB = 68,608 pages - Page tables needed: 68,608 / 1,024 = 67 tables - Size: 67 x 4 KB = 268 KB - Plus directory: 268 KB + 4 KB = 272 KB total

Savings: 4 MB → 272 KB (93% reduction!)

For a 64-bit system with sparse usage, the savings are even more dramatic.

How Many Levels?

Modern architectures use different numbers of levels:

Architecture	Levels	Address Bits	Coverage
x86 (32-bit)	2	32 bits	4 GB
x86-64 (4-level)	4	48 bits	256 TB
x86-64 (5-level)	5	57 bits	128 PB
ARM64 (typical)	4	48 bits	256 TB
RISC-V Sv39	3	39 bits	512 GB
RISC-V Sv48	4	48 bits	256 TB

More levels = larger address space but slower page table walks.

Modern 64-bit systems typically use 4-level page tables as a sweet spot between address space size and lookup performance.

Trade-offs

Advantages: ✅ Huge memory savings for sparse address spaces
✅ Pay only for what you use (demand allocation)
✅ Enables large address spaces (48-57 bits practical)
✅ Sharing between processes easier at granular level

Disadvantages: ❌ More complex hardware (multiple memory accesses)
❌ Slower on TLB miss (must walk multiple levels)
❌ More page faults possible (intermediate tables not present)

TLB Becomes Even More Critical

With multi-level page tables, a TLB miss becomes expensive:

Cost of Address Translation: - TLB hit: 1 cycle (~0.3 ns) - TLB miss with 2-level PT: 2 memory accesses (~200 ns) = 667x slower - TLB miss with 4-level PT: 4 memory accesses (~400 ns) = 1,333x slower

This makes the TLB's caching function absolutely critical. A program with a 99% TLB hit rate will be dramatically faster than one with a 95% hit rate, especially on systems with deep page table hierarchies.

Key Takeaways

Single-level page tables don't scale: Would require gigabytes of memory per process
Address spaces are sparse: Programs use tiny fractions of available address space
Multi-level hierarchies: Allocate page table memory only for used regions
Memory savings are dramatic: Often 90-99% reduction in page table overhead
TLB becomes essential: Multiple memory accesses on TLB miss make caching critical
More levels = slower misses: Trade-off between address space size and lookup cost

Multi-level page tables are a perfect example of indirection solving a scaling problem. By adding a level of indirection (or several), we transform an impractical memory requirement into a manageable one, enabling the vast 64-bit address spaces that modern programs enjoy.

The specific implementations vary by architecture--x86-64's 4-level structure differs from ARM's implementation, and RISC-V offers multiple options. We'll explore these architecture-specific details in Part III: MMU in Modern Architectures.

This completes our introduction to the fundamental mechanisms of address translation. We've covered: - The basic concept of pages and page tables - How address translation works (virtual → physical) - The critical role of the TLB in making translation fast - How multi-level page tables make large address spaces practical

In the next section, we'll look at real-world examples of how these mechanisms impact system performance and security.

1.10 Why MMU Matters: Real-World Impact

Figure 1.18: Memory access latency comparison: L1 cache (~1 ns), L2 (~4 ns), L3 (~10 ns), DRAM (~100 ns), NVMe SSD (~100 us), spinning disk (~10 ms).

Case Study 1: The Cost of TLB Misses

Figure 1.19: Context switch and TLB management: ASID-tagged TLBs allow entries from multiple address spaces to coexist, eliminating full TLB flushes on every context switch.

Scenario: Database performing random lookups in a 100 GB dataset on a system with 32 GB RAM.

Analysis: - Page size: 4 KB - Total pages: 25 million (100 GB / 4 KB) - Physical frames: 8 million (32 GB / 4 KB) - TLB size: 1,536 entries (Intel Xeon) - TLB coverage: 6 MB (1,536 x 4 KB)

Result:
With random access patterns, TLB hit rate drops to ~5%. Each TLB miss requires a page table walk (4 memory accesses for 4-level paging). Performance can degrade by 60-80% compared to sequential access.

Solution:
Use 2 MB huge pages: - TLB coverage: 3 GB (1,536 x 2 MB) - TLB hit rate improves to 60% - Performance increase: 2-3x for this workload

Case Study 2: Spectre and Meltdown

In 2018, researchers discovered that speculative execution in modern CPUs could be exploited to leak protected memory [Kocher et al., 2019; Lipp et al., 2018].

The Vulnerability: CPUs speculatively execute code before checking permissions. Even though mis-speculated instructions are discarded, they leave traces in the cache and TLB that can be measured.

Attack Example (Simplified):

// Attacker code (user space)
if (x < array1_size) {  // Will be true eventually
    y = array2[array1[x] * 4096];  // Speculatively executed!
}

During speculative execution: 1. CPU loads array1[x] (x is out of bounds, kernel memory) 2. CPU loads array2[array1[x] * 4096]
3. Page brought into TLB/cache 4. Speculation was wrong → discard results 5. BUT TLB/cache still contains traces of kernel memory! 6. Attacker measures TLB/cache timing to infer kernel data

Impact: - Affected billions of CPUs (Intel, AMD, ARM) - Required OS patches (KPTI: Kernel Page Table Isolation) - Performance overhead: 5-30% depending on workload

Lesson:
Memory management isn't just about performance--it's critical to security. The MMU is part of the security perimeter.

Case Study 3: Rowhammer

Rowhammer demonstrated that repeated accesses to DRAM rows can cause bit flips in adjacent rows--a hardware reliability issue that becomes a security vulnerability [Kim et al., 2014].

The Attack:

// Repeatedly access two rows (hammering)
while (1) {
    *addr1;
    *addr2;  
    clflush(addr1);  // Bypass cache
    clflush(addr2);
}
// Causes bit flips in adjacent physical rows

MMU Connection: Attackers used the MMU's address translation to: 1. Determine which virtual addresses map to adjacent physical rows 2. Bypass MMU protection to corrupt page tables 3. Gain kernel privileges

Impact: - Demonstrated in 2015 (Google Project Zero) - Exploitable from JavaScript (no native code needed) - Led to TRR (Target Row Refresh) in DRAM

1.11 Chapter Summary

Key Takeaways:

Memory management bridges the memory wall between fast CPUs and slow DRAM through caching and virtual memory.
Paging solves fragmentation by using fixed-size pages, enabling efficient allocation and virtual memory.
The MMU performs address translation billions of times per second, making virtual memory practical.
Virtual memory provides powerful abstractions: large address spaces, isolation, memory overcommitment, and simplified programming.
Performance matters: TLB misses and page faults can dominate execution time in memory-intensive workloads.
Security is fundamental: The MMU enforces isolation, but vulnerabilities like Spectre and Rowhammer show that hardware memory protection is subtle and complex.

Thought-Provoking Questions:

If CPU speeds continue increasing while memory latency stagnates, what percentage of CPU time will be spent waiting for memory?
How large should page sizes be? Larger pages reduce translation overhead but increase internal fragmentation. What's the optimal tradeoff?
Can we eliminate page tables entirely? What alternative translation mechanisms might be possible?
Should applications be aware of physical memory layout (NUMA, memory tiers)? Or should the OS hide this complexity?
With persistent memory (NVDIMMs, Optane), do we need to rethink the entire virtual memory model?

Figure 1.20: Page fault handling flow: CPU raises fault on unmapped access; OS handler maps page into a free frame; PTE updated; faulting instruction retried.

This chapter has introduced the fundamental concepts of memory management: - The memory wall and why management is necessary - Evolution from simple base/bounds to modern paging - The role of hardware (MMU), OS, and applications - Virtual memory abstraction and its benefits - Real-world security and performance implications

In the chapters ahead:

Chapter 2: Virtual memory concepts in depth--demand paging, page replacement algorithms, working sets, thrashing

Chapter 3: Hardware implementation of MMUs--page table formats, multi-level paging, TLB design

Part II: Historical evolution--from Atlas to modern architectures

Part III: Deep dives into x86, ARM, RISC-V, and other architectures

Part IV: Advanced techniques--huge pages, NUMA, memory compression

Part V: Security--memory protection, confidential computing, isolation

Part VI: Vulnerabilities--Spectre, Meltdown, Rowhammer, and mitigations

Part VII: Debugging and testing MMU implementations

Part VIII: Future directions--CXL, disaggregated memory, AI workloads

Part IX: Performance optimization for real-world systems

References

Hennessy, J. L., & Patterson, D. A. (2024). Computer Architecture: A Quantitative Approach (7th ed.). Morgan Kaufmann. Chapter 2: Memory Hierarchy Design.
Denning, P. J. (1970). "Virtual memory". Computing Surveys, 2(3), 153-189.
Kilburn, T., Edwards, D. B. G., Lanigan, M. J., & Sumner, F. H. (1962). "One-level storage system". IRE Transactions on Electronic Computers, EC-11(2), 223-235.
Jacob, B., Ng, S. W., & Wang, D. T. (2010). Memory Systems: Cache, DRAM, Disk. Morgan Kaufmann.
Kocher, P., et al. (2019). "Spectre attacks: Exploiting speculative execution". Communications of the ACM, 63(7), 93-101.
Lipp, M., et al. (2018). "Meltdown: Reading kernel memory from user space". 27th USENIX Security Symposium.
Kim, Y., et al. (2014). "Flipping bits in memory without accessing them: An experimental study of DRAM disturbance errors". ISCA 2014.
Arpaci-Dusseau, R. H., & Arpaci-Dusseau, A. C. (2018). Operating Systems: Three Easy Pieces. Chapter 18-23 (Virtual Memory). Available at: https://pages.cs.wisc.edu/~remzi/OSTEP/

§1.8 — TLB Reach and AI Workloads

Mittal, S. "A survey of techniques for architecting TLBs." Concurrency and Computation: Practice and Experience (CPE), 29(10), 2016. DOI: 10.1002/cpe.4061
Qasem, A. and Magee, J. "Improving TLB performance through demand-driven superpaging." Software: Practice and Experience (SPE), 43(6):705–729, 2012. DOI: 10.1002/spe.2128
Wu, X. and Taylor, V. "Ensemble learning for CANDLE deep learning benchmarks." Concurrency and Computation: Practice and Experience (CPE), 35(15), 2021. DOI: 10.1002/cpe.6516