“Virtual memory is a powerful illusion—programs believe they have unlimited, contiguous memory, when in reality they’re sharing limited physical RAM with dozens of other processes.”
In Chapter 1, we learned that virtual memory gives each process its own private address space, with the MMU translating virtual addresses to physical addresses. But we left an important question unanswered: What happens when a program’s virtual address space is larger than the available physical RAM?
Consider a realistic scenario: - Your laptop has 16 GB of physical RAM - You’re running: Chrome (50 tabs using 8 GB), IDE (3 GB), database (4 GB), video editor (6 GB) - Total virtual memory needed: 21 GB - Physical RAM available: 16 GB
How does this work? The answer is virtual memory in its fullest sense—not just address translation, but the ability to transparently use disk storage as an extension of physical RAM.
Not all of a program’s virtual memory needs to be in physical RAM simultaneously. Most programs exhibit locality of reference: - They access a small subset of their code repeatedly (loops, frequently called functions) - They work with a limited working set of data at any given time - Large data structures are often accessed sequentially or in predictable patterns
Virtual memory exploits this behavior through demand paging—loading pages into physical RAM only when they’re actually accessed, and evicting unused pages to disk when memory is scarce.
This chapter explores how operating systems implement virtual memory to create the illusion of abundant memory:
By the end, you’ll understand how your computer runs dozens of programs simultaneously, each thinking it has gigabytes of memory, on hardware with limited physical RAM.
Demand paging is the technique of loading pages into physical memory only when they’re accessed, not when the program starts [Denning, 1970].
Step-by-Step Process:
Program Starts
First Instruction Fetch
Page Fault Handler (OS)
1. Save CPU state (registers, program counter)
2. Check if address is valid (within process's virtual address space)
3. If invalid → Segmentation fault (kill process)
4. If valid:
a. Find a free physical frame (or evict a page)
b. Load page from disk into that frame
c. Update page table: frame number, Present=1
5. Restore CPU state
6. Return from exception
7. CPU retries the instruction (now succeeds)Execution Continues
Page table entries must distinguish between three states:
State 1: Present in Memory
┌─────────┬───┬───┬───┬───┬───┬───┬───┬───────────────────â”
│ Frame │ P │ D │ A │ R │ W │ X │ U │ (unused) │
│ Number │ 1 │ │ │ │ │ │ │ │
└─────────┴───┴───┴───┴───┴───┴───┴───┴───────────────────┘
P=1: Page is in physical memory at specified frame
State 2: Swapped to Disk
┌─────────────────────────────┬───┬─────────────────────────â”
│ Disk Block Number │ P │ (disk metadata) │
│ (where page is stored) │ 0 │ │
└─────────────────────────────┴───┴──────────────────────────┘
P=0: Page is on disk, not in memory
State 3: Never Allocated
┌─────────────────────────────────────────────────────────â”
│ All zeros │
│ (or special bit pattern) │
└──────────────────────────────────────────────────────────┘
Valid=0 in some architectures: Page has never been used
Different types of memory pages behave differently with demand paging:
Code Pages (Text Segment): - Loaded from executable file on disk - Read-only, never modified - Can be discarded (not swapped) since original is on disk - Shared between multiple instances of same program
Data Pages (Initialized Data): - Loaded from executable file - Can be modified (writable) - If modified, must be swapped to disk when evicted - Private to each process
BSS Pages (Uninitialized Data): - Not stored in executable file - Allocated on first access - Filled with zeros (zero page) - No disk I/O needed initially
Stack Pages: - Allocated on demand as stack grows - Filled with zeros - Can be swapped when inactive
Heap Pages: - Allocated by malloc() / mmap() - Initially filled with zeros (for security) - Can be swapped when inactive
Anonymous Pages: - Not backed by any file - Created by mmap(MAP_ANONYMOUS) - Must be swapped to disk if evicted
Page faults are expensive. Let’s quantify the cost:
Operation Latency
────────────────────────────────────────────
L1 Cache hit ~1 ns
DRAM access ~70 ns
Page fault (page in memory cache) ~1 μs
Page fault (from SSD) ~100 μs
Page fault (from HDD) ~10,000 μs (10 ms)
────────────────────────────────────────────
Example: Processing 1 million array elements
// Array: 4 MB (1 million × 4 bytes)
// Page size: 4 KB
// Pages needed: 1,000
int sum = 0;
for (int i = 0; i < 1000000; i++) {
sum += array[i];
}Scenario 1: All pages in memory (no page faults) - Memory accesses: 1,000,000 - Time: 1,000,000 × 70 ns = 70 ms
Scenario 2: All pages on SSD (1,000 page faults) - Page faults: 1,000 × 100 μs = 100 ms - Memory accesses: 1,000,000 × 70 ns = 70 ms - Total: 170 ms (2.4× slower)
Scenario 3: All pages on HDD (1,000 page faults) - Page faults: 1,000 × 10 ms = 10,000 ms (10 seconds!) - Memory accesses: 70 ms - Total: ~10 seconds (140× slower)
Lesson: Page faults to disk are catastrophically expensive. Minimizing page faults is critical for performance.
Operating systems use prefetching to reduce page faults by predicting which pages will be needed:
Sequential Prefetching:
Process accesses page N
↓
OS predicts pages N+1, N+2, N+3 will be needed soon
↓
Load N+1, N+2, N+3 in background (before they're accessed)
↓
When process accesses N+1 → Already in memory (no fault)
Effectiveness: - Works great for sequential access patterns (file reading, array traversal) - Wasted effort for random access patterns - Modern OSes detect access patterns and adapt
Linux readahead:
# Check current readahead setting (in KB)
blockdev --getra /dev/sda
# Output: 256 (means 256 KB prefetched)
# Increase readahead for better sequential performance
blockdev --setra 512 /dev/sdaWhen physical memory is full and a page fault occurs, the OS must choose which page to evict (remove from memory) to make room for the new page. This is the page replacement problem [Belady, 1966].
Belady’s Optimal Algorithm (OPT): Evict the page that will be used furthest in the future.
Example:
Reference string: 1, 2, 3, 4, 1, 2, 5, 1, 2, 3, 4, 5
Physical frames: 3
Time: 1 2 3 4 1 2 5 1 2 3 4 5
┌──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──â”
F1 │ 1│ 1│ 1│ 1│ 1│ 1│ 1│ 1│ 1│ 1│ 1│ 1│
├──┼──┼──┼──┼──┼──┼──┼──┼──┼──┼──┼──┤
F2 │ │ 2│ 2│ 2│ 2│ 2│ 2│ 2│ 2│ 2│ 4│ 4│
├──┼──┼──┼──┼──┼──┼──┼──┼──┼──┼──┼──┤
F3 │ │ │ 3│ 4│ 4│ 4│ 5│ 5│ 5│ 3│ 3│ 5│
└──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┘
Faults: F F F F F F F
Page faults: 7
Why it’s optimal: At each eviction, it chooses the page that won’t be needed for the longest time, minimizing future faults.
Why it’s impractical: Requires knowledge of future memory references—impossible in real systems.
Value: Provides a theoretical lower bound for comparison. Any practical algorithm will have ≥ OPT page faults.
Algorithm: Evict the page that has been in memory the longest.
Implementation:
// Simple queue of page frames
struct page_queue {
int frames[NUM_FRAMES];
int front, rear;
};
int fifo_replace(struct page_queue *q) {
int victim = q->frames[q->front];
q->front = (q->front + 1) % NUM_FRAMES;
return victim;
}Example (same reference string):
Reference string: 1, 2, 3, 4, 1, 2, 5, 1, 2, 3, 4, 5
Physical frames: 3
Time: 1 2 3 4 1 2 5 1 2 3 4 5
┌──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──â”
F1 │ 1│ 1│ 1│ 4│ 4│ 4│ 4│ 4│ 4│ 3│ 3│ 3│
├──┼──┼──┼──┼──┼──┼──┼──┼──┼──┼──┼──┤
F2 │ │ 2│ 2│ 2│ 2│ 2│ 5│ 5│ 5│ 5│ 4│ 4│
├──┼──┼──┼──┼──┼──┼──┼──┼──┼──┼──┼──┤
F3 │ │ │ 3│ 3│ 3│ 3│ 3│ 3│ 3│ 3│ 3│ 5│
└──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┘
Faults: F F F F F F F
Page faults: 10
Advantages: - Simple to implement - Fair (all pages treated equally) - Low overhead
Disadvantages: - Doesn’t consider page usage patterns - Can evict frequently used pages - Suffers from Belady’s Anomaly (more frames → more faults in some cases)
Belady’s Anomaly Example:
Reference string: 1, 2, 3, 4, 1, 2, 5, 1, 2, 3, 4, 5
With 3 frames: 9 page faults
With 4 frames: 10 page faults (!)
More memory → worse performance (rare but possible with FIFO)
Algorithm: Evict the page that hasn’t been used for the longest time.
Rationale: If a page hasn’t been used recently, it’s less likely to be used soon (temporal locality).
Example (same reference string):
Reference string: 1, 2, 3, 4, 1, 2, 5, 1, 2, 3, 4, 5
Physical frames: 3
Time: 1 2 3 4 1 2 5 1 2 3 4 5
┌──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──â”
F1 │ 1│ 1│ 1│ 1│ 1│ 1│ 1│ 1│ 1│ 1│ 1│ 5│
├──┼──┼──┼──┼──┼──┼──┼──┼──┼──┼──┼──┤
F2 │ │ 2│ 2│ 2│ 2│ 2│ 2│ 2│ 2│ 2│ 4│ 4│
├──┼──┼──┼──┼──┼──┼──┼──┼──┼──┼──┼──┤
F3 │ │ │ 3│ 4│ 4│ 4│ 5│ 5│ 5│ 3│ 3│ 3│
└──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┘
Faults: F F F F F F F
Page faults: 8 (better than FIFO!)
Perfect LRU Implementation (Impractical):
// Timestamp every memory reference
struct page {
int frame_number;
timestamp_t last_access; // Updated on EVERY access
};
int lru_replace() {
timestamp_t oldest = MAX_TIME;
int victim = -1;
for (int i = 0; i < num_frames; i++) {
if (pages[i].last_access < oldest) {
oldest = pages[i].last_access;
victim = i;
}
}
return victim;
}Problem: Updating timestamps on every memory access is impossibly expensive (billions of updates per second).
Hardware Support (x86 Accessed Bit): - Page table entries have an “Accessed” (A) bit - MMU sets A=1 automatically when page is accessed - OS can read and clear the A bit periodically - Provides approximation of recency without timestamps
Advantages: - Excellent performance (close to OPT) - Adapts to program behavior - No Belady’s Anomaly
Disadvantages: - Expensive to implement perfectly - Requires hardware support (Accessed bit)
Algorithm: Approximation of LRU using the hardware Accessed bit.
Implementation:
1. Organize pages in a circular list (like a clock)
2. Keep a "hand" pointer pointing to next candidate
3. When eviction needed:
a. Check page at hand position
b. If Accessed bit = 0 → Evict this page
c. If Accessed bit = 1 → Set to 0, move hand forward
d. Repeat until finding page with Accessed = 0
Visual Representation:
Page 0
(A=1)
↗ ↖
Page 7 Page 1
(A=0) (A=1)
↑ Hand → ↓
Page 6 Page 2
(A=1) (A=0)
↖ ↗
Page 3
(A=1)
Clock hand points to Page 0
Need to evict a page:
- Page 0: A=1 → Set A=0, advance
- Page 1: A=1 → Set A=0, advance
- Page 2: A=0 → EVICT Page 2
Advantages: - Simple to implement - Low overhead - Good approximation of LRU - Used in many real systems (Linux, BSD)
Disadvantages: - Not as good as perfect LRU - May scan many pages before finding victim
Linux Implementation (Simplified):
// Linux uses two lists: active and inactive
struct lru_list {
struct list_head active_list;
struct list_head inactive_list;
};
// Pages move between lists based on Accessed bit
void age_pages() {
for each page in active_list {
if (page->accessed == 0)
move_to_inactive_list(page);
page->accessed = 0; // Clear for next period
}
}Performance on typical workloads:
| Algorithm | Page Faults (relative) | Implementation Complexity | Used in Practice? |
|---|---|---|---|
| Optimal (OPT) | 1.0× (best possible) | Impossible (needs future knowledge) | No (theoretical) |
| LRU | ~1.1× | High (expensive to track) | No (too expensive) |
| Clock (NRU) | ~1.3× | Low (uses Accessed bit) | Yes (most OSes) |
| FIFO | ~1.8× | Very low (simple queue) | Rarely (too poor) |
Source: Empirical studies from Aho et al. (1971) and modern OS textbooks.
Real-World Choice: Most operating systems use variants of the Clock algorithm because it offers the best balance of performance and implementation cost.
Thrashing occurs when a system spends more time swapping pages between memory and disk than executing useful work [Denning, 1968].
Symptoms: - CPU utilization drops dramatically (from 80% to <10%) - Disk I/O spikes to 100% - System becomes unresponsive - Applications freeze or run extremely slowly
Visual Representation of System State:
Scenario: Too many active processes, insufficient physical memory
System: 4 GB RAM, 10 processes, each needs 500 MB working set
Total needed: 5 GB
Available: 4 GB
Deficit: 1 GB
What happens:
1. Process A runs, faults, loads pages
2. No free frames → evict pages from Process B
3. Process B scheduled, faults on recently evicted pages
4. No free frames → evict pages from Process C
5. Process C scheduled, faults...
6. Repeat endlessly (thrashing)
The Vicious Cycle:
Too many processes
↓
Each process gets too few frames
↓
Frequent page faults
↓
High disk I/O
↓
CPU idle waiting for I/O
↓
OS sees low CPU utilization
↓
OS adds MORE processes (!)
↓
Even worse thrashing
Working Set (Peter Denning, 1968): The set of pages a process actively uses during a time window.
Definition:
WS(t, Δ) = Set of pages referenced in time interval [t - Δ, t]
Where: - t = current time - Δ = working set window (e.g., 10,000 memory references)
Example:
Process accesses pages: 1, 2, 3, 4, 1, 2, 5, 1, 2, 3
Working set window (Δ) = last 5 references
At position 10:
Last 5 references: 2, 5, 1, 2, 3
Working set: {1, 2, 3, 5} (4 pages)
Working Set Size Over Time:
Size
| ┌─────â”
| ╱ ╲ ┌────â”
| ╱ ╲ ╱ ╲
| ╱ ╲ ╱ ╲
| ╱ ╲ ╲
└────────────────────────────→ Time
Init Execute Compute I/O
Programs transition between phases: - Initialization: Large working set (loading libraries) - Steady state: Small working set (main loop) - Phase change: Working set spike (new functionality)
Strategy 1: Working Set Algorithm
For each process P:
1. Estimate working set size WS(P)
2. Allocate at least WS(P) frames to P
3. If Σ WS(all processes) > Total RAM:
→ Suspend some processes (don't run them)
4. Only run processes whose working sets fit
Strategy 2: Page Fault Frequency (PFF)
Monitor page fault rate for each process:
If fault rate > upper threshold:
→ Allocate more frames to this process
If fault rate < lower threshold:
→ Take away frames from this process
If no frames available:
→ Suspend process (swap out entirely)
Strategy 3: Linux OOM Killer When memory is critically low, Linux uses the Out-Of-Memory (OOM) killer:
// Simplified OOM scoring
int calculate_oom_score(process P) {
score = P.memory_usage; // Higher usage → higher score
score -= P.runtime * 10; // Longer running → lower score
if (P.is_root_process) score = 0; // Don't kill init
if (P.is_oom_protected) score = 0;// Don't kill critical processes
return score;
}
// Kill highest scoring process
void oom_kill() {
process victim = find_highest_score();
send_signal(victim, SIGKILL);
log("OOM: Killed process %d (%s)", victim.pid, victim.name);
}Real-World Example:
# Linux kernel message when OOM killer activates
[1234567.890] Out of memory: Kill process 12345 (chrome) score 925 or sacrifice child
[1234567.891] Killed process 12345 (chrome) total-vm:8388608kB, anon-rss:7340032kB, file-rss:0kBMetrics to monitor:
# Linux: Check page fault rate
sar -B 1 10
# Output:
# pgfault/s: Page faults per second
# major/s: Major page faults (from disk)
# If major/s > 100, you might be thrashing
# Check swap activity
vmstat 1
# si: Swap in (KB/s)
# so: Swap out (KB/s)
# If si + so > 1000 KB/s sustained, investigate
# Check if processes are waiting for I/O
top
# Look for high %wa (wait for I/O)
# If %wa > 30%, likely I/O bound (possibly thrashing)When a process calls fork() to create a child process,
traditionally the OS would: 1. Copy all of the parent’s memory pages to
the child 2. This could be hundreds of megabytes or gigabytes 3. Very
slow and wasteful if child immediately calls exec() to run
a new program
Example:
// Parent process using 2 GB of memory
int main() {
char *big_buffer = malloc(2 * 1024 * 1024 * 1024); // 2 GB
// ... fill buffer with data ...
pid_t pid = fork(); // Create child
if (pid == 0) {
// Child process
exec("/bin/ls"); // Immediately replace with new program
// All that copied memory is wasted!
}
}Traditional approach: Copy 2 GB, then immediately discard it when
exec() is called.
Copy-on-Write (COW): Initially share memory pages between parent and child, copying only when one writes to a page.
Implementation:
1. Parent calls fork()
2. Child's page table created
3. Both parent and child page tables point to SAME physical frames
4. All pages marked as read-only (even if originally writable)
5. Parent and child run
6. Either process tries to WRITE to a shared page
7. MMU detects write to read-only page → Page Fault
8. OS page fault handler:
a. Recognize this is COW fault
b. Allocate new physical frame
c. Copy page contents to new frame
d. Update faulting process's page table to new frame
e. Mark both pages as writable
9. Retry the write instruction (now succeeds)
Visual Example:
Before fork():
Parent Process
Virtual Page 0 → Physical Frame 100 (R/W)
Virtual Page 1 → Physical Frame 101 (R/W)
After fork() (COW):
Parent Process Child Process
VP 0 → PF 100 (R-only) VP 0 → PF 100 (R-only) } Shared
VP 1 → PF 101 (R-only) VP 1 → PF 101 (R-only) } Shared
Parent writes to VP 0:
- Page fault (write to read-only)
- Allocate new frame 200
- Copy frame 100 → frame 200
- Update parent's PT: VP 0 → PF 200 (R/W)
After parent's write:
Parent Process Child Process
VP 0 → PF 200 (R/W) VP 0 → PF 100 (R/W) } Private
VP 1 → PF 101 (R-only) VP 1 → PF 101 (R-only) } Still shared
Performance:
Traditional fork() of 1 GB process:
- Copy 1 GB = 250,000 page copies (at 4 KB/page)
- Time: ~500 ms (at 2 GB/s memory bandwidth)
COW fork() of 1 GB process:
- Copy page tables only = ~250 KB
- Time: ~1 ms
Speed improvement: 500× faster!
Memory Efficiency:
If child immediately calls exec():
- Traditional: Wasted 1 GB of copy
- COW: Wasted 0 bytes (no copies made)
If parent and child both read (common case):
- Traditional: Uses 2 GB (duplicate memory)
- COW: Uses 1 GB (shared)
Real-World Impact: Modern systems heavily use
fork() + exec(): - Shell executing commands -
Web servers spawning workers - Build systems compiling code
COW makes these operations practical.
File Systems (ZFS, Btrfs):
// Create snapshot of filesystem
zfs snapshot pool/filesystem@snapshot
// Snapshot is instantaneous (COW)
// Original and snapshot share blocks
// Blocks copied only when modifiedContainers (Docker):
# Docker image layers use COW
docker run ubuntu # Shares base image with all containers
# Each container has COW layer for changes
# Saves gigabytes of disk spaceVirtual Machines:
Base disk image: windows.qcow2 (20 GB)
VM1 snapshot: vm1.qcow2 (points to windows.qcow2, COW)
VM2 snapshot: vm2.qcow2 (points to windows.qcow2, COW)
All VMs share base image, copy blocks only when writing
Traditional File I/O:
int fd = open("data.txt", O_RDONLY);
char buffer[4096];
read(fd, buffer, 4096); // Copy from kernel buffer to user buffer
// Work with buffer
close(fd);Data path: Disk → Kernel buffer → User buffer (two copies)
Memory-Mapped I/O:
int fd = open("data.txt", O_RDONLY);
char *ptr = mmap(NULL, file_size, PROT_READ, MAP_SHARED, fd, 0);
// ptr[x] directly accesses file contents
// No explicit read() call needed
munmap(ptr, file_size);
close(fd);Data path: Disk → Kernel buffer → Mapped into user address space (zero copy)
Step-by-Step:
1. Application calls mmap(file, size, ...)
2. OS creates VMA (Virtual Memory Area) in process address space
3. VMA marked as "file-backed" (points to file)
4. No pages loaded yet (demand paging)
5. Application accesses ptr[offset]
6. Page fault (page not present)
7. OS reads page from file into memory
8. Maps page into process address space
9. Access succeeds
Page Table Entry for Memory-Mapped File:
┌──────────────┬───┬───┬───┬────────────────────────â”
│ Frame Number │ P │ D │ A │ File: data.txt │
│ │ │ │ │ Offset: 12288 │
└──────────────┴───┴───┴───┴────────────────────────┘
If evicted: Don't swap to swap file, just drop
If dirty (D=1): Write back to original file
1. Simplified Programming:
// Traditional I/O (complex)
while (offset < file_size) {
bytes_read = read(fd, buffer, BUFFER_SIZE);
if (bytes_read <= 0) break;
process_data(buffer, bytes_read);
offset += bytes_read;
}
// Memory mapping (simple)
for (int i = 0; i < file_size; i++) {
process_byte(file_ptr[i]); // Just like an array!
}2. Zero-Copy I/O: No copying between kernel and user space—direct access to kernel’s page cache.
3. Sharing Between Processes:
// Process A
int fd = open("shared.dat", O_RDWR);
void *ptr = mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
// Process B (different process)
int fd = open("shared.dat", O_RDWR);
void *ptr = mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
// Both processes see the same physical pages
// Changes by Process A are visible to Process B4. Lazy Loading: Large files can be mapped without loading entire file into memory. Pages loaded on-demand.
Databases:
// SQLite, LMDB, and many others use mmap
void *db_file = mmap(NULL, db_size, PROT_READ|PROT_WRITE,
MAP_SHARED, db_fd, 0);
// Treat file as in-memory database
// OS handles caching automaticallyShared Libraries:
# All processes using libc.so share the same physical pages
$ cat /proc/1234/maps | grep libc
7f1234567000-7f1234700000 r-xp 00000000 08:01 12345 /lib/x86_64-linux-gnu/libc.so.6
# Code pages shared, data pages COWInter-Process Communication (Shared Memory):
// Modern POSIX shared memory
int shm_fd = shm_open("/my_shm", O_CREAT|O_RDWR, 0666);
ftruncate(shm_fd, 4096);
void *ptr = mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_SHARED, shm_fd, 0);
// Fastest IPC mechanism (no copies, just shared pages)Modern systems support multiple page sizes:
Standard Pages: 4 KB (x86-64)
Huge Pages: 2 MB, 1 GB (x86-64)
Benefits of Huge Pages:
Scenario: 1 GB memory region
With 4 KB pages:
- Pages needed: 262,144
- TLB entries needed: 262,144 (won't fit!)
- TLB misses: Constant
With 2 MB pages:
- Pages needed: 512
- TLB entries needed: 512 (fits in L2 TLB)
- TLB misses: Rare
Performance improvement: 2-4× for memory-intensive workloads
Enabling Huge Pages (Linux):
# Transparent Huge Pages (automatic)
echo always > /sys/kernel/mm/transparent_hugepage/enabled
# Explicit huge pages
echo 1024 > /proc/sys/vm/nr_hugepages # Allocate 1024 × 2MB = 2GBTrade-offs: - ✓ Fewer TLB misses - ✓ Smaller page tables - ✗ More internal fragmentation - ✗ Slower allocation (finding contiguous 2 MB)
Linux allows allocating more virtual memory than physical RAM + swap:
# Check overcommit mode
cat /proc/sys/vm/overcommit_memory
# 0: Heuristic (default)
# 1: Always allow
# 2: Never overcommit (strict accounting)
# Example: System with 16 GB RAM, 8 GB swap
# Can allocate 100 GB virtual memory if overcommit=1
# OS assumes not all will be used simultaneouslyRisk: If all processes actually use their allocated memory, OOM killer activates.
Modern multi-socket servers have NUMA:
CPU 0 â†fast→ Memory 0
↓slow
CPU 1 â†fast→ Memory 1
Accessing local memory: 70 ns
Accessing remote memory: 140 ns (2× slower)
NUMA-Aware Allocation:
// Linux: Allocate memory on local node
void *ptr = numa_alloc_local(size);
// Pin thread to CPU 0 and allocate from Node 0
numa_run_on_node(0);
void *ptr = malloc(size); // Allocated from Node 0Technique 1: Improve Locality
// Bad: Column-major access (poor locality)
for (int j = 0; j < N; j++) {
for (int i = 0; i < N; i++) {
sum += matrix[i][j]; // Jumps between pages
}
}
// Good: Row-major access (good locality)
for (int i = 0; i < N; i++) {
for (int j = 0; j < N; j++) {
sum += matrix[i][j]; // Sequential within pages
}
}Technique 2: Use mlock() for Critical Pages
// Prevent page from being swapped out
void *critical_data = malloc(size);
mlock(critical_data, size); // Lock in RAM
// Guaranteed to never page faultTechnique 3: Prefetch
// GCC built-in
__builtin_prefetch(&data[i + 64], 0, 3); // Prefetch for readingLinux Tools:
# Page fault statistics
ps -o min_flt,maj_flt,cmd -p <pid>
# min_flt: Minor faults (page in cache)
# maj_flt: Major faults (from disk)
# System-wide memory stats
vmstat 1
# Detailed process memory map
pmap -X <pid>
# Memory profiling with perf
perf stat -e page-faults,minor-faults,major-faults ./programKey Concepts:
Demand Paging: Pages loaded from disk only when accessed, enabling programs larger than physical RAM.
Page Replacement: When memory is full, OS must choose which page to evict. Clock algorithm offers best practical performance.
Thrashing: Occurs when system spends more time swapping than executing. Prevented by limiting active processes to those whose working sets fit in RAM.
Copy-on-Write: Efficient fork() implementation that shares pages until written, making process creation 500× faster.
Memory-Mapped Files: Map files directly into address space, simplifying I/O and enabling zero-copy access.
Huge Pages: Reduce TLB misses for memory-intensive workloads (2-4× speedup).
Performance Principles:
Real-World Impact:
These virtual memory techniques enable: - Running dozens of programs
on limited RAM - Instant process creation (fork()) -
Efficient large file processing - Shared libraries saving gigabytes of
memory - Databases performing well with files larger than RAM
Chapter 3 will explore how these concepts are implemented in hardware: - Page table structures (single-level, multi-level, inverted) - Translation Lookaside Buffer (TLB) architecture - Hardware page table walkers - Different architectures’ approaches (x86, ARM, RISC-V)
One hardware optimisation introduced in Chapter 3 is especially relevant to the page-walk latency figures seen throughout this book: the Page Walk Cache (PWC) (§3.7.2). When the TLB misses, the hardware page walker must traverse multiple levels of the page table stored in main memory — four DRAM reads on x86-64 in the worst case. The PWC caches the upper-level page-directory entries (PML4E, PDPTE, PDE on x86-64) so that a typical TLB miss requires only a single final DRAM access rather than four, reducing page-walk latency by 50–80% (Barr, Cox & Rixner, 2011; Bhattacharjee, Lustig & Martonosi, 2011). Chapter 3 also covers ARM’s TTBR0/TTBR1 dual-register design (§3.7.3), which splits the address space so that user and kernel translations never share TLB entries, and the VMID+ASID tagging scheme that allows multiple VMs and processes to coexist in the TLB without flushing on every context switch.
Chapter 4 will cover OS-specific implementations: -
Linux memory management (mm_struct, page allocator, kswapd)
- Windows memory manager - How OSes coordinate with MMU hardware
Denning, P. J. (1968). “The working set model for program behavior”. Communications of the ACM, 11(5), 323-333.
Denning, P. J. (1970). “Virtual memory”. Computing Surveys, 2(3), 153-189.
Belady, L. A. (1966). “A study of replacement algorithms for virtual-storage computer”. IBM Systems Journal, 5(2), 78-101.
Aho, A. V., Denning, P. J., & Ullman, J. D. (1971). “Principles of optimal page replacement”. Journal of the ACM, 18(1), 80-93.
Bhattacharjee, A., & Lustig, D. (2017). Architectural and Operating System Support for Virtual Memory. Morgan & Claypool.
Gorman, M. (2004). Understanding the Linux Virtual Memory Manager. Prentice Hall.
Love, R. (2010). Linux Kernel Development (3rd ed.). Addison-Wesley. Chapter 15: The Page Cache and Page Writeback.
Silberschatz, A., Galvin, P. B., & Gagne, G. (2018). Operating System Concepts (10th ed.). Wiley. Chapter 9: Main Memory, Chapter 10: Virtual Memory. ISBN: 978-1-119-45862-5.
Next Chapter: Page Table Structures and TLB Architecture - Hardware implementation of address translation.