Chapter 2: Virtual Memory Concepts

Chapter 2: Virtual Memory Concepts

“Virtual memory is a powerful illusion—programs believe they have unlimited, contiguous memory, when in reality they’re sharing limited physical RAM with dozens of other processes.”

2.1 Introduction: The Illusion of Unlimited Memory

In Chapter 1, we learned that virtual memory gives each process its own private address space, with the MMU translating virtual addresses to physical addresses. But we left an important question unanswered: What happens when a program’s virtual address space is larger than the available physical RAM?

Consider a realistic scenario: - Your laptop has 16 GB of physical RAM - You’re running: Chrome (50 tabs using 8 GB), IDE (3 GB), database (4 GB), video editor (6 GB) - Total virtual memory needed: 21 GB - Physical RAM available: 16 GB

How does this work? The answer is virtual memory in its fullest sense—not just address translation, but the ability to transparently use disk storage as an extension of physical RAM.

The Core Insight

Not all of a program’s virtual memory needs to be in physical RAM simultaneously. Most programs exhibit locality of reference: - They access a small subset of their code repeatedly (loops, frequently called functions) - They work with a limited working set of data at any given time - Large data structures are often accessed sequentially or in predictable patterns

Virtual memory exploits this behavior through demand paging—loading pages into physical RAM only when they’re actually accessed, and evicting unused pages to disk when memory is scarce.

What This Chapter Covers

This chapter explores how operating systems implement virtual memory to create the illusion of abundant memory:

  1. Demand Paging: How pages are loaded on-demand from disk
  2. Page Replacement: Which pages to evict when memory is full
  3. Thrashing: When the system spends more time paging than executing
  4. Advanced Techniques: Copy-on-Write, memory-mapped files, shared memory
  5. Performance Optimization: Reducing page faults and improving locality

By the end, you’ll understand how your computer runs dozens of programs simultaneously, each thinking it has gigabytes of memory, on hardware with limited physical RAM.


The Virtual Memory Illusion: More Virtual Space Than Physical RAM Virtual Address Spaces (Each Process: up to 128 TB on x86-64) Chrome (50 tabs) Stack Heap Mapped Files Code + Libs 8 GB virtual needed IDE Stack Heap Code + Libs 3 GB virtual Database Stack Heap (Buffer Pool) Shared Mem Code 4 GB virtual Total virtual memory requested: ~21 GB (Each process believes it has exclusive ownership of its full address space) Locality of Reference — Why This Works Active Working Set ~10–20% of pages → stays in RAM accesses: 99%+ of time Inactive / Cold Pages ~80–90% of virtual space → can reside on disk accesses: <1% of time Physical RAM — 16 GB Available Resident Pages (hot/active only): Chrome — active tabs (~3.2 GB in RAM) Tab renderer JS engine libc, etc. IDE — active workspace (~1.5 GB in RAM) Open files/AST Code index cache Database — hot buffer pool (~2 GB in RAM) Hot rows / indexes / shared memory segments OS Kernel + Page Cache (~2 GB) Swap / Disk (Cold / inactive pages evicted here) Chrome: 4.8 GB cold tab data IDE: 1.5 GB project index (cold) DB: 2 GB cold data pages ~5 GB swapped out Virtual illusion: 21 GB demanded | 16 GB physical | ~5 GB on swap | Transparent to all processes via MMU + OS
Figure 2.1: The virtual memory illusion: each process sees a private, contiguous 64-bit address space far larger than physical RAM. The OS and MMU transparently map virtual pages to physical frames, with inactive pages spilled to disk.

2.2 Demand Paging: Loading Pages on Demand

2.2.1 The Basic Concept

Demand paging is the technique of loading pages into physical memory only when they’re accessed, not when the program starts [Denning, 1970].

Demand Paging: Page Fault Handling Flow CPU Issues Virtual Memory Access TLB Lookup Translation cached? HIT Physical Address → Memory Access ✓ ~1–5 cycles MISS Page Table Walk (MMU walks PT levels) Page Present in Physical RAM? YES Update TLB → Retry Access ✓ ~100–300 cycles NO — PAGE FAULT OS Page Fault Handler ① Save CPU Context / Registers ② Select Victim Page (if mem full) ③ Load Page from Disk/Swap ④ Update PT & TLB → Resume Process Disk I/O: ~5–10 million cycles | SSD: ~100K–1M cycles CPU / MMU Hardware Fast Path (TLB Hit / PT Hit) OS Handler — Slow Path (Page Fault)
Figure 2.2: Demand paging page fault handling flow: TLB hit returns physical address in 1–5 cycles; TLB miss triggers a page table walk; if the page is absent, the OS page fault handler loads it from disk/swap, costing millions of cycles.

2.2.2 How Demand Paging Works

Step-by-Step Process:

  1. Program Starts

  2. First Instruction Fetch

  3. Page Fault Handler (OS)

    1. Save CPU state (registers, program counter)
    2. Check if address is valid (within process's virtual address space)
    3. If invalid → Segmentation fault (kill process)
    4. If valid:
       a. Find a free physical frame (or evict a page)
       b. Load page from disk into that frame
       c. Update page table: frame number, Present=1
    5. Restore CPU state
    6. Return from exception
    7. CPU retries the instruction (now succeeds)
  4. Execution Continues

2.2.3 Page Table Entries Revisited

Page table entries must distinguish between three states:

State 1: Present in Memory

┌─────────┬───┬───┬───┬───┬───┬───┬───┬───────────────────┐
│  Frame  │ P │ D │ A │ R │ W │ X │ U │    (unused)       │
│ Number  │ 1 │   │   │   │   │   │   │                   │
└─────────┴───┴───┴───┴───┴───┴───┴───┴───────────────────┘
P=1: Page is in physical memory at specified frame

State 2: Swapped to Disk

┌─────────────────────────────┬───┬─────────────────────────┐
│    Disk Block Number        │ P │     (disk metadata)      │
│   (where page is stored)    │ 0 │                          │
└─────────────────────────────┴───┴──────────────────────────┘
P=0: Page is on disk, not in memory

State 3: Never Allocated

┌─────────────────────────────────────────────────────────┐
│                    All zeros                             │
│             (or special bit pattern)                     │
└──────────────────────────────────────────────────────────┘
Valid=0 in some architectures: Page has never been used

2.2.4 Types of Pages

Different types of memory pages behave differently with demand paging:

Code Pages (Text Segment): - Loaded from executable file on disk - Read-only, never modified - Can be discarded (not swapped) since original is on disk - Shared between multiple instances of same program

Data Pages (Initialized Data): - Loaded from executable file - Can be modified (writable) - If modified, must be swapped to disk when evicted - Private to each process

BSS Pages (Uninitialized Data): - Not stored in executable file - Allocated on first access - Filled with zeros (zero page) - No disk I/O needed initially

Stack Pages: - Allocated on demand as stack grows - Filled with zeros - Can be swapped when inactive

Heap Pages: - Allocated by malloc() / mmap() - Initially filled with zeros (for security) - Can be swapped when inactive

Anonymous Pages: - Not backed by any file - Created by mmap(MAP_ANONYMOUS) - Must be swapped to disk if evicted

2.2.5 The Cost of Page Faults

Page faults are expensive. Let’s quantify the cost:

Operation                          Latency
────────────────────────────────────────────
L1 Cache hit                       ~1 ns
DRAM access                        ~70 ns
Page fault (page in memory cache)  ~1 μs
Page fault (from SSD)              ~100 μs
Page fault (from HDD)              ~10,000 μs (10 ms)
────────────────────────────────────────────

Example: Processing 1 million array elements

// Array: 4 MB (1 million × 4 bytes)
// Page size: 4 KB
// Pages needed: 1,000

int sum = 0;
for (int i = 0; i < 1000000; i++) {
    sum += array[i];
}

Scenario 1: All pages in memory (no page faults) - Memory accesses: 1,000,000 - Time: 1,000,000 × 70 ns = 70 ms

Scenario 2: All pages on SSD (1,000 page faults) - Page faults: 1,000 × 100 μs = 100 ms - Memory accesses: 1,000,000 × 70 ns = 70 ms - Total: 170 ms (2.4× slower)

Scenario 3: All pages on HDD (1,000 page faults) - Page faults: 1,000 × 10 ms = 10,000 ms (10 seconds!) - Memory accesses: 70 ms - Total: ~10 seconds (140× slower)

Lesson: Page faults to disk are catastrophically expensive. Minimizing page faults is critical for performance.

2.2.6 Optimizing Demand Paging: Prefetching

Operating systems use prefetching to reduce page faults by predicting which pages will be needed:

Sequential Prefetching:

Process accesses page N
  ↓
OS predicts pages N+1, N+2, N+3 will be needed soon
  ↓
Load N+1, N+2, N+3 in background (before they're accessed)
  ↓
When process accesses N+1 → Already in memory (no fault)

Effectiveness: - Works great for sequential access patterns (file reading, array traversal) - Wasted effort for random access patterns - Modern OSes detect access patterns and adapt

Linux readahead:

# Check current readahead setting (in KB)
blockdev --getra /dev/sda
# Output: 256 (means 256 KB prefetched)

# Increase readahead for better sequential performance
blockdev --setra 512 /dev/sda

2.3 Page Replacement Algorithms

When physical memory is full and a page fault occurs, the OS must choose which page to evict (remove from memory) to make room for the new page. This is the page replacement problem [Belady, 1966].

Page Replacement Algorithms: FIFO, LRU, and Clock (Second Chance) Reference string: 1, 2, 3, 4, 1, 2, 5, 1, 2, 3, 4, 5 | 3 page frames available Access: 1 2 3 4 1 2 5 1 2 3 4 5 FIFO (First-In First-Out) — 9 page faults Evicts the oldest loaded page. Simple but ignores recency of use. Frame 0: Frame 1: Frame 2: Fault? 1 - - 1 2 - 1 2 3 4 2 3 4 1 3 4 1 2 5 1 2 5 1 2 5 1 2 3 1 2 3 4 2 3 4 5 LRU (Least Recently Used) — 8 page faults Evicts the page not used for the longest time. Near-optimal but costly to implement precisely. Frame 0: Frame 1: Frame 2: Fault? 1 - - 1 2 - 1 2 3 4 2 3 4 1 3 4 1 2 5 1 2 5 1 2 5 1 2 3 1 2 3 4 2 3 4 5 Clock / Second-Chance Algorithm — 8 page faults (practical LRU approximation) Reference bit R set on access. On fault, scan clock: R=1 → clear R, advance; R=0 → evict. O(1) overhead. P:1 R=1 P:2 R=0 P:3 R=0 clock hand Algorithm on page fault: ① Check frame at clock hand ② If R=1: clear R bit → advance hand (second chance) ③ If R=0: evict this page → load new page here ④ Set R=1 on newly loaded page → advance hand Linux uses a variant with active/inactive page lists (clock-pro) Fault comparison (3 frames, 12 accesses): FIFO = 9 faults | LRU = 8 faults | Clock ≈ 8 faults | Optimal = 6 faults (theoretical minimum)
Figure 2.3: Page replacement algorithms: FIFO evicts the oldest page (simple but can evict frequently used pages); LRU evicts the least recently used page (near-optimal, exploits temporal locality); Clock/Second-Chance approximates LRU using a reference bit and circular scan.

2.3.1 The Optimal Algorithm (Theoretical)

Belady’s Optimal Algorithm (OPT): Evict the page that will be used furthest in the future.

Example:

Reference string: 1, 2, 3, 4, 1, 2, 5, 1, 2, 3, 4, 5
Physical frames: 3

Time:  1  2  3  4  1  2  5  1  2  3  4  5
     ┌──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┐
  F1 │ 1│ 1│ 1│ 1│ 1│ 1│ 1│ 1│ 1│ 1│ 1│ 1│
     ├──┼──┼──┼──┼──┼──┼──┼──┼──┼──┼──┼──┤
  F2 │  │ 2│ 2│ 2│ 2│ 2│ 2│ 2│ 2│ 2│ 4│ 4│
     ├──┼──┼──┼──┼──┼──┼──┼──┼──┼──┼──┼──┤
  F3 │  │  │ 3│ 4│ 4│ 4│ 5│ 5│ 5│ 3│ 3│ 5│
     └──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┘
Faults: F  F  F  F        F           F  F

Page faults: 7

Why it’s optimal: At each eviction, it chooses the page that won’t be needed for the longest time, minimizing future faults.

Why it’s impractical: Requires knowledge of future memory references—impossible in real systems.

Value: Provides a theoretical lower bound for comparison. Any practical algorithm will have ≥ OPT page faults.

2.3.2 First-In-First-Out (FIFO)

Algorithm: Evict the page that has been in memory the longest.

Implementation:

// Simple queue of page frames
struct page_queue {
    int frames[NUM_FRAMES];
    int front, rear;
};

int fifo_replace(struct page_queue *q) {
    int victim = q->frames[q->front];
    q->front = (q->front + 1) % NUM_FRAMES;
    return victim;
}

Example (same reference string):

Reference string: 1, 2, 3, 4, 1, 2, 5, 1, 2, 3, 4, 5
Physical frames: 3

Time:  1  2  3  4  1  2  5  1  2  3  4  5
     ┌──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┐
  F1 │ 1│ 1│ 1│ 4│ 4│ 4│ 4│ 4│ 4│ 3│ 3│ 3│
     ├──┼──┼──┼──┼──┼──┼──┼──┼──┼──┼──┼──┤
  F2 │  │ 2│ 2│ 2│ 2│ 2│ 5│ 5│ 5│ 5│ 4│ 4│
     ├──┼──┼──┼──┼──┼──┼──┼──┼──┼──┼──┼──┤
  F3 │  │  │ 3│ 3│ 3│ 3│ 3│ 3│ 3│ 3│ 3│ 5│
     └──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┘
Faults: F  F  F  F        F           F  F

Page faults: 10

Advantages: - Simple to implement - Fair (all pages treated equally) - Low overhead

Disadvantages: - Doesn’t consider page usage patterns - Can evict frequently used pages - Suffers from Belady’s Anomaly (more frames → more faults in some cases)

Belady’s Anomaly Example:

Reference string: 1, 2, 3, 4, 1, 2, 5, 1, 2, 3, 4, 5

With 3 frames: 9 page faults
With 4 frames: 10 page faults (!)

More memory → worse performance (rare but possible with FIFO)

2.3.3 Least Recently Used (LRU)

Algorithm: Evict the page that hasn’t been used for the longest time.

Rationale: If a page hasn’t been used recently, it’s less likely to be used soon (temporal locality).

Example (same reference string):

Reference string: 1, 2, 3, 4, 1, 2, 5, 1, 2, 3, 4, 5
Physical frames: 3

Time:  1  2  3  4  1  2  5  1  2  3  4  5
     ┌──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┐
  F1 │ 1│ 1│ 1│ 1│ 1│ 1│ 1│ 1│ 1│ 1│ 1│ 5│
     ├──┼──┼──┼──┼──┼──┼──┼──┼──┼──┼──┼──┤
  F2 │  │ 2│ 2│ 2│ 2│ 2│ 2│ 2│ 2│ 2│ 4│ 4│
     ├──┼──┼──┼──┼──┼──┼──┼──┼──┼──┼──┼──┤
  F3 │  │  │ 3│ 4│ 4│ 4│ 5│ 5│ 5│ 3│ 3│ 3│
     └──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┘
Faults: F  F  F  F        F           F  F

Page faults: 8 (better than FIFO!)

Perfect LRU Implementation (Impractical):

// Timestamp every memory reference
struct page {
    int frame_number;
    timestamp_t last_access;  // Updated on EVERY access
};

int lru_replace() {
    timestamp_t oldest = MAX_TIME;
    int victim = -1;
    
    for (int i = 0; i < num_frames; i++) {
        if (pages[i].last_access < oldest) {
            oldest = pages[i].last_access;
            victim = i;
        }
    }
    return victim;
}

Problem: Updating timestamps on every memory access is impossibly expensive (billions of updates per second).

Hardware Support (x86 Accessed Bit): - Page table entries have an “Accessed” (A) bit - MMU sets A=1 automatically when page is accessed - OS can read and clear the A bit periodically - Provides approximation of recency without timestamps

Advantages: - Excellent performance (close to OPT) - Adapts to program behavior - No Belady’s Anomaly

Disadvantages: - Expensive to implement perfectly - Requires hardware support (Accessed bit)

2.3.4 Clock Algorithm (Second Chance / Not Recently Used)

Algorithm: Approximation of LRU using the hardware Accessed bit.

Implementation:

1. Organize pages in a circular list (like a clock)
2. Keep a "hand" pointer pointing to next candidate
3. When eviction needed:
   a. Check page at hand position
   b. If Accessed bit = 0 → Evict this page
   c. If Accessed bit = 1 → Set to 0, move hand forward
   d. Repeat until finding page with Accessed = 0

Visual Representation:

        Page 0
         (A=1)
    ↗            ↖
Page 7          Page 1
(A=0)            (A=1)
   ↑    Hand →   ↓
Page 6          Page 2
(A=1)            (A=0)
    ↖            ↗
        Page 3
         (A=1)
        
Clock hand points to Page 0
Need to evict a page:
  - Page 0: A=1 → Set A=0, advance
  - Page 1: A=1 → Set A=0, advance
  - Page 2: A=0 → EVICT Page 2

Advantages: - Simple to implement - Low overhead - Good approximation of LRU - Used in many real systems (Linux, BSD)

Disadvantages: - Not as good as perfect LRU - May scan many pages before finding victim

Linux Implementation (Simplified):

// Linux uses two lists: active and inactive
struct lru_list {
    struct list_head active_list;
    struct list_head inactive_list;
};

// Pages move between lists based on Accessed bit
void age_pages() {
    for each page in active_list {
        if (page->accessed == 0)
            move_to_inactive_list(page);
        page->accessed = 0;  // Clear for next period
    }
}

2.3.5 Comparison of Algorithms

Performance on typical workloads:

Algorithm Page Faults (relative) Implementation Complexity Used in Practice?
Optimal (OPT) 1.0× (best possible) Impossible (needs future knowledge) No (theoretical)
LRU ~1.1× High (expensive to track) No (too expensive)
Clock (NRU) ~1.3× Low (uses Accessed bit) Yes (most OSes)
FIFO ~1.8× Very low (simple queue) Rarely (too poor)

Source: Empirical studies from Aho et al. (1971) and modern OS textbooks.

Real-World Choice: Most operating systems use variants of the Clock algorithm because it offers the best balance of performance and implementation cost.


2.4 Thrashing and the Working Set Model

Thrashing and the Working Set Model CPU Utilization vs. Degree of Multiprogramming Number of Active Processes → CPU Utilization % 0% 50% 80% 100% 2 4 6 8 10 12 Peak THRASHING zone Normal Operation The Thrashing Vicious Cycle Too many processes Too few frames each Frequent page faults Disk I/O saturated CPU idles waiting I/O OS: low CPU util → add MORE processes! Solution: working set enforcement → suspend low-priority processes Working Set Model: WS(t, Δ) — Pages Referenced in Window Δ Reference string (t=1→10): 1 t=1 2 t=2 3 t=3 4 t=4 1 t=5 2 t=6 5 t=7 ★ 1 t=8 2 t=9 3 t=10 ◆ Δ = 5 references window WS(t=5, Δ=5) Last 5 refs: 1, 2, 3, 4, 1 Working Set = {1, 2, 3, 4} Size = 4 pages needed in RAM WS(t=10, Δ=5) Last 5 refs: 5, 1, 2, 3 (t=7→10) Working Set = {1, 2, 3, 5} Phase change at t=7: page 5 joins Working Set Size Over Time Init Steady Phase↑ Stable Thrashing prevention: only run processes whose working sets fit in RAM | Suspend excess processes until memory frees
Figure 2.4: Thrashing and the working set model: CPU utilization collapses when total working set sizes exceed physical RAM. The working set W(t, Δ) defines pages actively used in the last Δ references; allocating fewer frames than the working set triggers thrashing.

2.4.1 What is Thrashing?

Thrashing occurs when a system spends more time swapping pages between memory and disk than executing useful work [Denning, 1968].

Symptoms: - CPU utilization drops dramatically (from 80% to <10%) - Disk I/O spikes to 100% - System becomes unresponsive - Applications freeze or run extremely slowly

Visual Representation of System State:

2.4.2 Why Thrashing Occurs

Scenario: Too many active processes, insufficient physical memory

System: 4 GB RAM, 10 processes, each needs 500 MB working set
Total needed: 5 GB
Available: 4 GB
Deficit: 1 GB

What happens:
1. Process A runs, faults, loads pages
2. No free frames → evict pages from Process B
3. Process B scheduled, faults on recently evicted pages
4. No free frames → evict pages from Process C
5. Process C scheduled, faults...
6. Repeat endlessly (thrashing)

The Vicious Cycle:

Too many processes
   ↓
Each process gets too few frames
   ↓
Frequent page faults
   ↓
High disk I/O
   ↓
CPU idle waiting for I/O
   ↓
OS sees low CPU utilization
   ↓
OS adds MORE processes (!)
   ↓
Even worse thrashing

2.4.3 The Working Set Model

Working Set (Peter Denning, 1968): The set of pages a process actively uses during a time window.

Definition:
WS(t, Δ) = Set of pages referenced in time interval [t - Δ, t]

Where: - t = current time - Δ = working set window (e.g., 10,000 memory references)

Example:

Process accesses pages: 1, 2, 3, 4, 1, 2, 5, 1, 2, 3

Working set window (Δ) = last 5 references

At position 10:
Last 5 references: 2, 5, 1, 2, 3
Working set: {1, 2, 3, 5} (4 pages)

Working Set Size Over Time:

Size
 |     ┌─────┐
 |    ╱       ╲     ┌────┐
 |   ╱         ╲   ╱      ╲
 |  ╱           ╲ ╱        ╲
 | ╱             ╲          ╲
 └────────────────────────────→ Time
   Init   Execute  Compute  I/O

Programs transition between phases: - Initialization: Large working set (loading libraries) - Steady state: Small working set (main loop) - Phase change: Working set spike (new functionality)

2.4.4 Preventing Thrashing

Strategy 1: Working Set Algorithm

For each process P:
  1. Estimate working set size WS(P)
  2. Allocate at least WS(P) frames to P
  3. If Σ WS(all processes) > Total RAM:
     → Suspend some processes (don't run them)
  4. Only run processes whose working sets fit

Strategy 2: Page Fault Frequency (PFF)

Monitor page fault rate for each process:

If fault rate > upper threshold:
  → Allocate more frames to this process
  
If fault rate < lower threshold:
  → Take away frames from this process
  
If no frames available:
  → Suspend process (swap out entirely)

Strategy 3: Linux OOM Killer When memory is critically low, Linux uses the Out-Of-Memory (OOM) killer:

// Simplified OOM scoring
int calculate_oom_score(process P) {
    score = P.memory_usage;           // Higher usage → higher score
    score -= P.runtime * 10;          // Longer running → lower score
    if (P.is_root_process) score = 0; // Don't kill init
    if (P.is_oom_protected) score = 0;// Don't kill critical processes
    return score;
}

// Kill highest scoring process
void oom_kill() {
    process victim = find_highest_score();
    send_signal(victim, SIGKILL);
    log("OOM: Killed process %d (%s)", victim.pid, victim.name);
}

Real-World Example:

# Linux kernel message when OOM killer activates
[1234567.890] Out of memory: Kill process 12345 (chrome) score 925 or sacrifice child
[1234567.891] Killed process 12345 (chrome) total-vm:8388608kB, anon-rss:7340032kB, file-rss:0kB

2.4.5 Detecting Thrashing

Metrics to monitor:

# Linux: Check page fault rate
sar -B 1 10
# Output:
# pgfault/s: Page faults per second
# major/s: Major page faults (from disk)
# If major/s > 100, you might be thrashing

# Check swap activity
vmstat 1
# si: Swap in (KB/s)
# so: Swap out (KB/s)
# If si + so > 1000 KB/s sustained, investigate

# Check if processes are waiting for I/O
top
# Look for high %wa (wait for I/O)
# If %wa > 30%, likely I/O bound (possibly thrashing)

2.5 Copy-on-Write (COW)

2.5.1 The Problem COW Solves

When a process calls fork() to create a child process, traditionally the OS would: 1. Copy all of the parent’s memory pages to the child 2. This could be hundreds of megabytes or gigabytes 3. Very slow and wasteful if child immediately calls exec() to run a new program

Example:

// Parent process using 2 GB of memory
int main() {
    char *big_buffer = malloc(2 * 1024 * 1024 * 1024);  // 2 GB
    // ... fill buffer with data ...
    
    pid_t pid = fork();  // Create child
    
    if (pid == 0) {
        // Child process
        exec("/bin/ls");  // Immediately replace with new program
        // All that copied memory is wasted!
    }
}

Traditional approach: Copy 2 GB, then immediately discard it when exec() is called.

Copy-on-Write (COW): Shared Pages Before and After fork() BEFORE fork() — Single Process Parent Process Virtual Address Space Page Table (Parent): VPN 0 → PPN 10 VPN 1 → PPN 15 VPN 2 → PPN 22 Physical RAM: Frame 10 — Code Frame 15 — Stack Frame 22 — Heap All pages writable. One copy in RAM. fork() shallow copy → AFTER fork() — Child Writes to Heap Parent read-only map Child write triggers COW Physical RAM: Frame 10 — Code SHARED Frame 15 — Stack SHARED Frame 22 — Heap Parent R/O Frame 35 — Heap′ (copy) Child R/W Write fault! → COW copy COW saves RAM: shared pages copied only when written — lazy allocation Shared (read-only COW) Original frame (Parent R/O) New copy allocated (Child R/W)
Figure 2.5: Copy-on-Write (COW): after fork(), parent and child share physical pages marked read-only. On the first write by either process, the MMU raises a protection fault; the OS copies only the written page to a new frame, leaving all other pages shared.

2.5.2 How Copy-on-Write Works

Copy-on-Write (COW): Initially share memory pages between parent and child, copying only when one writes to a page.

Implementation:

1. Parent calls fork()
2. Child's page table created
3. Both parent and child page tables point to SAME physical frames
4. All pages marked as read-only (even if originally writable)
5. Parent and child run

6. Either process tries to WRITE to a shared page
7. MMU detects write to read-only page → Page Fault
8. OS page fault handler:
   a. Recognize this is COW fault
   b. Allocate new physical frame
   c. Copy page contents to new frame
   d. Update faulting process's page table to new frame
   e. Mark both pages as writable
9. Retry the write instruction (now succeeds)

Visual Example:

Before fork():
Parent Process
  Virtual Page 0 → Physical Frame 100 (R/W)
  Virtual Page 1 → Physical Frame 101 (R/W)

After fork() (COW):
Parent Process                Child Process
  VP 0 → PF 100 (R-only)       VP 0 → PF 100 (R-only)  } Shared
  VP 1 → PF 101 (R-only)       VP 1 → PF 101 (R-only)  } Shared

Parent writes to VP 0:
  - Page fault (write to read-only)
  - Allocate new frame 200
  - Copy frame 100 → frame 200
  - Update parent's PT: VP 0 → PF 200 (R/W)
  
After parent's write:
Parent Process                Child Process
  VP 0 → PF 200 (R/W)          VP 0 → PF 100 (R/W)     } Private
  VP 1 → PF 101 (R-only)       VP 1 → PF 101 (R-only)  } Still shared

2.5.3 Benefits of COW

Performance:

Traditional fork() of 1 GB process:
  - Copy 1 GB = 250,000 page copies (at 4 KB/page)
  - Time: ~500 ms (at 2 GB/s memory bandwidth)

COW fork() of 1 GB process:
  - Copy page tables only = ~250 KB
  - Time: ~1 ms
  
Speed improvement: 500× faster!

Memory Efficiency:

If child immediately calls exec():
  - Traditional: Wasted 1 GB of copy
  - COW: Wasted 0 bytes (no copies made)

If parent and child both read (common case):
  - Traditional: Uses 2 GB (duplicate memory)
  - COW: Uses 1 GB (shared)

Real-World Impact: Modern systems heavily use fork() + exec(): - Shell executing commands - Web servers spawning workers - Build systems compiling code

COW makes these operations practical.

2.5.4 COW in Other Contexts

File Systems (ZFS, Btrfs):

// Create snapshot of filesystem
zfs snapshot pool/filesystem@snapshot

// Snapshot is instantaneous (COW)
// Original and snapshot share blocks
// Blocks copied only when modified

Containers (Docker):

# Docker image layers use COW
docker run ubuntu  # Shares base image with all containers
# Each container has COW layer for changes
# Saves gigabytes of disk space

Virtual Machines:

Base disk image: windows.qcow2 (20 GB)
VM1 snapshot: vm1.qcow2 (points to windows.qcow2, COW)
VM2 snapshot: vm2.qcow2 (points to windows.qcow2, COW)

All VMs share base image, copy blocks only when writing

2.6 Memory-Mapped Files

2.6.1 Traditional File I/O vs. Memory Mapping

Traditional File I/O:

int fd = open("data.txt", O_RDONLY);
char buffer[4096];
read(fd, buffer, 4096);  // Copy from kernel buffer to user buffer
// Work with buffer
close(fd);

Data path: Disk → Kernel buffer → User buffer (two copies)

Memory-Mapped I/O:

int fd = open("data.txt", O_RDONLY);
char *ptr = mmap(NULL, file_size, PROT_READ, MAP_SHARED, fd, 0);
// ptr[x] directly accesses file contents
// No explicit read() call needed
munmap(ptr, file_size);
close(fd);

Data path: Disk → Kernel buffer → Mapped into user address space (zero copy)

Memory-Mapped Files vs. Traditional File I/O Traditional File I/O — Two-Copy Path 💾 Disk / Storage file data at rest ① read() DMA transfer Kernel Page Cache (buffer in kernel address space) ② copy_to_user() memcpy: kernel→user User Application Buffer char buffer[4096] on stack/heap ③ CPU processes buffer Cost: 2 data copies + 2 context switches per read() Disk→Kernel (DMA) + Kernel→User (CPU memcpy) Memory-Mapped I/O — Zero-Copy Path 💾 Disk / Storage file data at rest ① page fault DMA on demand Kernel Page Cache = also User's mapped pages PTE: VPN → same physical frames ② OS maps pages into process VA space ZERO COPY ③ ptr[offset] — direct access no read() call, no copy, no syscall Cost: 1 data copy (DMA only). No kernel→user memcpy. Dirty pages written back to file on eviction (D=1 → writeback) Bonus: Inter-Process Shared Memory via mmap(MAP_SHARED) Process A ptr_a[0] = 42 Shared Physical Frames (Kernel Page Cache) Process B ptr_b[0] → 42 ✓
Figure 2.6: Memory-mapped files vs. traditional file I/O: traditional I/O copies data through kernel buffers (two copies per read); mmap() maps file pages directly into the process address space, allowing zero-copy access and sharing the page cache across processes.

2.6.2 How Memory Mapping Works

Step-by-Step:

1. Application calls mmap(file, size, ...)
2. OS creates VMA (Virtual Memory Area) in process address space
3. VMA marked as "file-backed" (points to file)
4. No pages loaded yet (demand paging)
5. Application accesses ptr[offset]
6. Page fault (page not present)
7. OS reads page from file into memory
8. Maps page into process address space
9. Access succeeds

Page Table Entry for Memory-Mapped File:

┌──────────────┬───┬───┬───┬────────────────────────┐
│ Frame Number │ P │ D │ A │  File: data.txt        │
│              │   │   │   │  Offset: 12288         │
└──────────────┴───┴───┴───┴────────────────────────┘

If evicted: Don't swap to swap file, just drop
If dirty (D=1): Write back to original file

2.6.3 Advantages of Memory Mapping

1. Simplified Programming:

// Traditional I/O (complex)
while (offset < file_size) {
    bytes_read = read(fd, buffer, BUFFER_SIZE);
    if (bytes_read <= 0) break;
    process_data(buffer, bytes_read);
    offset += bytes_read;
}

// Memory mapping (simple)
for (int i = 0; i < file_size; i++) {
    process_byte(file_ptr[i]);  // Just like an array!
}

2. Zero-Copy I/O: No copying between kernel and user space—direct access to kernel’s page cache.

3. Sharing Between Processes:

// Process A
int fd = open("shared.dat", O_RDWR);
void *ptr = mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);

// Process B (different process)
int fd = open("shared.dat", O_RDWR);
void *ptr = mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);

// Both processes see the same physical pages
// Changes by Process A are visible to Process B

4. Lazy Loading: Large files can be mapped without loading entire file into memory. Pages loaded on-demand.

2.6.4 Use Cases

Databases:

// SQLite, LMDB, and many others use mmap
void *db_file = mmap(NULL, db_size, PROT_READ|PROT_WRITE, 
                     MAP_SHARED, db_fd, 0);
// Treat file as in-memory database
// OS handles caching automatically

Shared Libraries:

# All processes using libc.so share the same physical pages
$ cat /proc/1234/maps | grep libc
7f1234567000-7f1234700000 r-xp 00000000 08:01 12345  /lib/x86_64-linux-gnu/libc.so.6

# Code pages shared, data pages COW

Inter-Process Communication (Shared Memory):

// Modern POSIX shared memory
int shm_fd = shm_open("/my_shm", O_CREAT|O_RDWR, 0666);
ftruncate(shm_fd, 4096);
void *ptr = mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_SHARED, shm_fd, 0);
// Fastest IPC mechanism (no copies, just shared pages)

2.7 Advanced Topics

Huge Pages and NUMA Architecture Standard 4 KB Pages — 1 GB Region TLB (1,536 entries typical) VPN 0 → PPN 400 VPN 1 → PPN 401 VPN 2 → PPN 402 1,536 / 262,144 entries fit OVERFLOW! 1 GB ÷ 4 KB = 262,144 pages TLB entries needed: 262,144 ← won't fit Result: Constant TLB misses → page table walks Performance: baseline (1×) 2 MB Huge Pages — Same 1 GB Region TLB (L2) (1,536 entries typical) HPN 0 → PPN 0x200 HPN 1 → PPN 0x400 HPN 2 → PPN 0x600 512 entries total ALL FIT ✓ 1 GB ÷ 2 MB = 512 pages only TLB entries needed: 512 ← fits easily Result: TLB misses: rare → no page walk needed Performance: 2–4× faster for memory-intensive workloads NUMA (Non-Uniform Memory Access) — 2-Socket Server Topology NUMA Node 0 CPU Socket 0 Core 0–3 Core 4–7 L3 Cache: 32 MB Local Memory 0 128 GB DDR5 Latency: ~70 ns ✓ fast NUMA-aware allocation: numa_alloc_local() → always use node-local RAM QPI / UPI Link ~140 ns (2× slower!) NUMA Node 1 CPU Socket 1 Core 8–11 Core 12–15 L3 Cache: 32 MB Local Memory 1 128 GB DDR5 Latency: ~70 ns ✓ fast Remote access penalty: CPU 0 accessing Memory 1 → 140 ns (2× latency) Huge pages: 512× fewer TLB entries for 1 GB | NUMA: always allocate from local node to avoid 2× latency penalty
Figure 2.7: Huge pages and NUMA: 2 MB huge pages reduce TLB pressure by covering 512× more virtual address space per entry; NUMA architecture places memory banks physically close to specific CPU sockets, with cross-node accesses incurring 1.5–2× higher latency.

2.7.1 Huge Pages / Large Pages

Modern systems support multiple page sizes:

Standard Pages: 4 KB (x86-64)
Huge Pages: 2 MB, 1 GB (x86-64)

Benefits of Huge Pages:

Scenario: 1 GB memory region

With 4 KB pages:
  - Pages needed: 262,144
  - TLB entries needed: 262,144 (won't fit!)
  - TLB misses: Constant

With 2 MB pages:
  - Pages needed: 512
  - TLB entries needed: 512 (fits in L2 TLB)
  - TLB misses: Rare

Performance improvement: 2-4× for memory-intensive workloads

Enabling Huge Pages (Linux):

# Transparent Huge Pages (automatic)
echo always > /sys/kernel/mm/transparent_hugepage/enabled

# Explicit huge pages
echo 1024 > /proc/sys/vm/nr_hugepages  # Allocate 1024 × 2MB = 2GB

Trade-offs: - ✓ Fewer TLB misses - ✓ Smaller page tables - ✗ More internal fragmentation - ✗ Slower allocation (finding contiguous 2 MB)

2.7.2 Memory Overcommitment

Linux allows allocating more virtual memory than physical RAM + swap:

# Check overcommit mode
cat /proc/sys/vm/overcommit_memory
# 0: Heuristic (default)
# 1: Always allow
# 2: Never overcommit (strict accounting)

# Example: System with 16 GB RAM, 8 GB swap
# Can allocate 100 GB virtual memory if overcommit=1
# OS assumes not all will be used simultaneously

Risk: If all processes actually use their allocated memory, OOM killer activates.

2.7.3 NUMA (Non-Uniform Memory Access)

Modern multi-socket servers have NUMA:

CPU 0 ←fast→ Memory 0
  ↓slow
CPU 1 ←fast→ Memory 1

Accessing local memory: 70 ns
Accessing remote memory: 140 ns (2× slower)

NUMA-Aware Allocation:

// Linux: Allocate memory on local node
void *ptr = numa_alloc_local(size);

// Pin thread to CPU 0 and allocate from Node 0
numa_run_on_node(0);
void *ptr = malloc(size);  // Allocated from Node 0

2.8 Performance Optimization Strategies

2.8.1 Reduce Page Faults

Technique 1: Improve Locality

// Bad: Column-major access (poor locality)
for (int j = 0; j < N; j++) {
    for (int i = 0; i < N; i++) {
        sum += matrix[i][j];  // Jumps between pages
    }
}

// Good: Row-major access (good locality)
for (int i = 0; i < N; i++) {
    for (int j = 0; j < N; j++) {
        sum += matrix[i][j];  // Sequential within pages
    }
}

Technique 2: Use mlock() for Critical Pages

// Prevent page from being swapped out
void *critical_data = malloc(size);
mlock(critical_data, size);  // Lock in RAM
// Guaranteed to never page fault

Technique 3: Prefetch

// GCC built-in
__builtin_prefetch(&data[i + 64], 0, 3);  // Prefetch for reading

2.8.2 Monitor Performance

Linux Tools:

# Page fault statistics
ps -o min_flt,maj_flt,cmd -p <pid>
# min_flt: Minor faults (page in cache)
# maj_flt: Major faults (from disk)

# System-wide memory stats
vmstat 1

# Detailed process memory map
pmap -X <pid>

# Memory profiling with perf
perf stat -e page-faults,minor-faults,major-faults ./program

2.9 Chapter Summary

Key Concepts:

  1. Demand Paging: Pages loaded from disk only when accessed, enabling programs larger than physical RAM.

  2. Page Replacement: When memory is full, OS must choose which page to evict. Clock algorithm offers best practical performance.

  3. Thrashing: Occurs when system spends more time swapping than executing. Prevented by limiting active processes to those whose working sets fit in RAM.

  4. Copy-on-Write: Efficient fork() implementation that shares pages until written, making process creation 500× faster.

  5. Memory-Mapped Files: Map files directly into address space, simplifying I/O and enabling zero-copy access.

  6. Huge Pages: Reduce TLB misses for memory-intensive workloads (2-4× speedup).

Performance Principles:

Real-World Impact:

These virtual memory techniques enable: - Running dozens of programs on limited RAM - Instant process creation (fork()) - Efficient large file processing - Shared libraries saving gigabytes of memory - Databases performing well with files larger than RAM


Chapter 3 will explore how these concepts are implemented in hardware: - Page table structures (single-level, multi-level, inverted) - Translation Lookaside Buffer (TLB) architecture - Hardware page table walkers - Different architectures’ approaches (x86, ARM, RISC-V)

One hardware optimisation introduced in Chapter 3 is especially relevant to the page-walk latency figures seen throughout this book: the Page Walk Cache (PWC) (§3.7.2). When the TLB misses, the hardware page walker must traverse multiple levels of the page table stored in main memory — four DRAM reads on x86-64 in the worst case. The PWC caches the upper-level page-directory entries (PML4E, PDPTE, PDE on x86-64) so that a typical TLB miss requires only a single final DRAM access rather than four, reducing page-walk latency by 50–80% (Barr, Cox & Rixner, 2011; Bhattacharjee, Lustig & Martonosi, 2011). Chapter 3 also covers ARM’s TTBR0/TTBR1 dual-register design (§3.7.3), which splits the address space so that user and kernel translations never share TLB entries, and the VMID+ASID tagging scheme that allows multiple VMs and processes to coexist in the TLB without flushing on every context switch.

Chapter 4 will cover OS-specific implementations: - Linux memory management (mm_struct, page allocator, kswapd) - Windows memory manager - How OSes coordinate with MMU hardware


References

  1. Denning, P. J. (1968). “The working set model for program behavior”. Communications of the ACM, 11(5), 323-333.

  2. Denning, P. J. (1970). “Virtual memory”. Computing Surveys, 2(3), 153-189.

  3. Belady, L. A. (1966). “A study of replacement algorithms for virtual-storage computer”. IBM Systems Journal, 5(2), 78-101.

  4. Aho, A. V., Denning, P. J., & Ullman, J. D. (1971). “Principles of optimal page replacement”. Journal of the ACM, 18(1), 80-93.

  5. Bhattacharjee, A., & Lustig, D. (2017). Architectural and Operating System Support for Virtual Memory. Morgan & Claypool.

  6. Gorman, M. (2004). Understanding the Linux Virtual Memory Manager. Prentice Hall.

  7. Love, R. (2010). Linux Kernel Development (3rd ed.). Addison-Wesley. Chapter 15: The Page Cache and Page Writeback.

  8. Silberschatz, A., Galvin, P. B., & Gagne, G. (2018). Operating System Concepts (10th ed.). Wiley. Chapter 9: Main Memory, Chapter 10: Virtual Memory. ISBN: 978-1-119-45862-5.


Next Chapter: Page Table Structures and TLB Architecture - Hardware implementation of address translation.