Chapter 8: Advanced MMU Topics - System Integration and Optimization

8.1 Introduction: The Crisis Point

Consider a production PostgreSQL database server with 24 GB of RAM. The monitoring dashboard shows 17 GB "available" memory — plenty of headroom, right? Then, without warning, the database process is killed. The system log contains one cryptic line: Out of memory: Killed process 4721 (postgres) score 823. How can a process be killed when 17 GB is "available"?

The answer lies in how the Linux kernel distinguishes between memory that is free and memory that is merely reclaimable. That 17 GB was almost entirely page cache — file data buffered in RAM to speed up reads. When a large memory allocation demand arrived, the kernel needed pages immediately, discovered that reclaiming cache required writing dirty pages to disk, could not complete reclaim fast enough, and triggered the OOM killer as a last resort. The database had done nothing wrong. The system's memory management cascade had failed to keep pace with demand.

This chapter bridges the hardware mechanisms covered in Chapters 1–7 with the software policies that drive real system behaviour. We have established how the MMU translates addresses (Chapter 3), how TLBs accelerate that translation (Chapter 4), and how page faults interrupt and redirect execution (Chapter 7). Now we examine what happens when memory genuinely runs short: which pages are chosen for eviction, how the kernel manages nested virtualisation faults, how hardware-software cooperation via accessed and dirty bits drives reclaim decisions, and what performance levers a systems programmer can pull when things go wrong.

8.1.1 Three Real-World Scenarios

Scenario 1 — The Redis fork() surprise. Redis performs a background save via fork(), relying on copy-on-write semantics. Immediately after the fork, both parent and child share all pages. Within milliseconds, the parent begins writing to nearly every key, triggering copy-on-write faults for each modified page. Memory usage doubles in seconds. The system that had 4 GB free now has 200 MB free and is thrashing.

Scenario 2 — The VM host boot storm. A cloud host brings up 32 virtual machines simultaneously. Each VM's two-stage address translation (§8.2) requires EPT entries for its physical memory. Before any EPT entries exist, every memory access inside each VM triggers a nested page fault — a VM exit, EPT entry allocation, and VM re-entry. With 32 VMs booting in parallel, the host is processing tens of thousands of EPT faults per second. Boot takes 5× longer than on bare metal.

Scenario 3 — The memory leak cascade. A long-running application leaks 100 MB/hour. After 48 hours, it has consumed 4.8 GB of additional virtual memory. Page tables for this memory themselves occupy 10 MB (§8.7 covers why). The kernel begins reclaiming pages. Some are clean and cheap to evict; others are dirty and require writeback, stalling the allocating thread. Eventually the OOM killer intervenes.

All three scenarios involve the same fundamental mechanisms: watermarks that trigger reclaim, LRU lists that determine eviction order, and the accessed and dirty bits that hardware sets to inform those decisions. Figure 8.1 shows how these mechanisms cascade from normal operation through increasing pressure to OOM intervention.

Memory Pressure Cascade: From Watermarks to OOM Killer Memory State HIGH Free ≥ 10% LOW Free ≥ 5% MIN Free ≥ 1% NONE OOM Zone Normal ops kswapd wakes Direct reclaim OOM killer Phase 1: Normal Operation (above HIGH watermark) kswapd sleeps. All allocations served from free page pool. PostgreSQL has 17 GB "free" — but nearly all is cache/buffers. Phase 2: Background Reclaim (LOW watermark crossed) kswapd wakes. Scans LRU lists, writes dirty pages, frees clean pages. Latency: 1–5 ms. Application may not notice yet. Phase 3: Direct Reclaim (MIN watermark crossed) Allocating process itself does reclaim. Holds CPU while scanning pages. Latency: 10–100 ms. Application stalls visibly. Phase 4: OOM Killer (memory exhausted) Kernel selects victim process via badness score. Sends SIGKILL. "17 GB available" = cached data, not truly free. PostgreSQL is killed. ~0 µs 1–5 ms 10–100 ms Process killed Cost Chapter 8 trace: Sections 8.2–8.4 explain the hardware mechanisms underlying each phase.
Figure 8.1: Memory pressure cascade: four phases from normal operation through kswapd background reclaim, direct reclaim, and OOM killer intervention. The left column shows memory watermark levels (HIGH/LOW/MIN); the right column shows the system response at each threshold with typical latency costs.

8.1.2 The Hardware–Software Contract

Chapters 1–7 described the hardware side of memory management: the MMU translates virtual to physical addresses, the TLB caches recent translations, and the page fault mechanism delivers control to the OS when translation fails. But hardware alone cannot decide which pages deserve to stay in RAM. That decision requires knowledge of application behaviour — specifically, which pages are being used and which have not been touched recently.

The hardware–software contract for memory reclaim rests on two bits embedded in every page table entry: the Accessed (A) bit and the Dirty (D) bit. The hardware sets these bits automatically as pages are accessed and written, providing the OS with a low-cost usage signal that does not require instrumentation of every memory access. The OS reads and clears these bits periodically, using the pattern of recent access to classify pages as hot or cold. Cold clean pages are trivially evictable; cold dirty pages must be written back before eviction; hot pages are left in place.

The mechanism seems simple, but the details — and the subtle differences between x86-64, ARM64, and RISC-V implementations — determine whether a system handles memory pressure gracefully or collapses into thrashing. Sections 8.2–8.4 establish the foundation; Sections 8.5–8.8 show how these primitives compose into the full Linux memory management stack.

8.2 Nested Page Tables: Beyond the Basics

Chapter 3 introduced two-stage address translation for virtualisation: a guest virtual address (GVA) first traverses the guest's own page tables (Stage 1) to produce a guest physical address (GPA), which then traverses the host's EPT or Stage-2 tables to yield a host physical address (HPA). We calculated that without caching, this walk could require up to 25 memory accesses. We showed that TLBs and page walk caches reduce the steady-state overhead to 2–7%. What we did not cover was what happens when the translation fails at either or both stages — and why those failures are so much more expensive than a simple page fault on bare metal.

8.2.1 Nested Fault Taxonomy

There are four fundamentally different kinds of nested page fault, each with distinct causes, costs, and remediation strategies. Figure 8.2 illustrates all four types with their typical latencies.

Nested Page Fault Taxonomy: Four Types and Their Costs Fault Type What Happens Cost / Impact Type 1 Guest Page Fault (Stage 1 fails) Host unaware Guest VA → guest PT → PTE.P = 0 Guest OS handles entirely (#PF / Data Abort) Allocates page, updates PTE, retries EPT walk completes normally for guest PT pages ~1–10 µs Same as bare metal No VM exit needed Type 2 EPT / Stage-2 Violation (Stage 2 fails) Stage 1 succeeds, but EPT entry missing CPU triggers VM exit → hypervisor handles Allocates host page, installs EPT entry VM re-entry to retry translation ~5–10 µs VM exit 2µs + alloc 3µs + VM entry 2µs Type 3 Permission Violation (EPT denies access) EPT entry present but access type denied COW: write to read-only EPT page (~10 µs) MMIO: emulate device access (~50–500 µs) Live migration: dirty-page tracking (~4 µs) 4–500 µs Wide range by purpose MMIO is worst case Type 4 ⚠ Cascading Fault Storm (Both stages fail) VM boot scenario Guest PT walk itself triggers EPT faults Each level of 4-level PT → potential EPT miss One guest access → up to 5 cascading VM exits VM boot: 10,000s of new EPT entries needed rapidly Mitigation: EPT pre-population (frontload at VM create) ~50 µs/access 5× VM exits × 10 µs each 50× slower than native Boot 50% slower than bare metal Cost Comparison Type 1: 1–10 µs Type 2: 5–10 µs Type 3: 4–500 µs Type 4: ~50 µs Note: Type 4 repeats per-access during boot — thousands of times
Figure 8.2: Nested page fault taxonomy: four types defined by which translation stage fails. Type 1 (guest page fault) involves no VM exit and costs 1–10 µs. Type 2 (EPT/Stage-2 violation) requires a VM exit and costs 5–10 µs. Type 3 (permission violation) covers COW, MMIO, and migration tracking. Type 4 (cascading fault storm) occurs when the guest page table walk itself triggers EPT faults, costing ~50 µs per access and slowing VM boot by 50%.

Type 1 — Guest page fault (Stage 1 failure). The guest application accesses an unmapped virtual address. The guest's own page tables have no entry for it, so the CPU generates a page fault exception that is injected into the guest. The guest OS handles it exactly as it would on bare metal: allocating a page, installing a PTE, and returning. The host hypervisor is entirely uninvolved. Cost: identical to a native page fault, 1–10 µs for a minor fault.

Type 2 — EPT / Stage-2 violation (Stage 2 failure). Stage 1 succeeds — the guest has a valid mapping for the virtual address — but the resulting GPA has no entry in the EPT (or ARM Stage-2 table). The CPU triggers a VM exit (reason 48 in the x86 VMCS). The hypervisor allocates a host page, installs the EPT entry mapping GPA→HPA, and re-enters the VM. Cost: approximately 5–10 µs (2 µs VM exit + 3 µs allocation + 2 µs VM re-entry).

Type 3 — Permission violation. Both Stage 1 and Stage 2 have entries, but EPT denies the requested access type. This pattern covers three important use cases. Copy-on-write: the guest writes to a shared page; EPT marks it read-only; the hypervisor duplicates the physical page and installs a writable EPT entry (~10 µs). MMIO emulation: the guest accesses a device register address; the hypervisor emulates the device operation (50–500 µs depending on complexity). Live migration dirty-page tracking: the hypervisor marks all EPT entries read-only at the start of a migration scan; any guest write triggers an EPT violation that records the page as dirty before reinstating write permission (~4 µs per fault).

Type 4 — Cascading fault storm. This is the critical failure mode unique to two-stage translation. When the guest OS is walking its own page tables — during a Type 1 fault handler, for example — each level of the guest PT walk reads from a guest physical address. If those guest PT pages have not yet been mapped in the EPT, each PT-level read triggers its own Type 2 EPT violation. A four-level guest PT walk can thus require up to four cascading VM exits before the original guest fault can even be delivered. In the worst case, one user-space memory access costs approximately 50 µs — 50× the bare-metal cost. This is why VMs boot 50% slower than bare metal: the guest kernel is initialising its page tables from scratch, every PT-level read misses the EPT, and the hypervisor processes thousands of cascading faults per second.

8.2.2 EPT Pre-population and Large EPT Pages

The principal mitigation for Type 4 storms is EPT pre-population: the hypervisor maps all expected GPA→HPA translations in the EPT at VM creation time, before the guest boots. This transforms the boot-time cascade of thousands of Type 2 faults into a one-time upfront cost during VM initialisation. For a 4 GB VM, pre-populating the EPT with 4 KB pages requires 1 million EPT entries — about 8 MB of host memory — and takes roughly 20 ms. That 20 ms cost is paid once; without it, the VM spends an equivalent amount of time processing EPT faults during every boot.

A complementary strategy is using 2 MB EPT pages (large pages at the EPT level). Because each 2 MB EPT entry covers 512 contiguous 4 KB pages, a 4 GB VM's EPT shrinks from 1 million entries to just 2,048. This reduces the number of EPT faults during boot by 512× and also improves EPT TLB (EPTP cache) hit rates. The tradeoff is 2 MB alignment requirements for guest memory and reduced granularity for COW and live-migration dirty tracking.

8.3 Architecture Differences in Nested Fault Handling

While all three major ISA families provide two-stage translation for virtualisation, they differ significantly in what information the hypervisor receives when a Stage-2 fault occurs and how efficiently batched fault processing can be implemented. Figure 8.3 provides a side-by-side comparison across x86-64, ARM64, and RISC-V.

Nested Fault Handling: x86-64 EPT vs ARM64 Stage-2 vs RISC-V x86-64 (Intel EPT) ARM64 (Stage-2) RISC-V (H Extension) Fault Signal to Hypervisor EPT Violation VM Exit Exit reason 48 in VMCS GPA in VMCS field Stage-2 Abort (HPFAR_EL2) IPA (intermediate PA) reported Level info in ESR_EL2[9:8] Guest Page Fault (gpf) htval holds GPA Minimal fault detail Fault Information Richness Rich — 64-bit qualification: Read/write/exec cause EPT levels walked GLA valid flag Very rich — ESR_EL2 details: Access type (read/write) Exact PT level that faulted S1PTW bit (PT walk fault) Minimal — software must: Re-walk PT manually Diagnose cause by inspection Higher handler complexity Average Fault Latency (Type 2 EPT/Stage-2 violation) ~6.2 µs Fast VM exit/entry cycle ~7.1 µs Richer fault info adds ~1 µs ~9.3 µs SW re-walk overhead ~2–3 µs Key Hypervisor Optimization EPT pre-population Map all GPA→HPA at VM start Eliminates Type 4 storms Batch fault handling 64 faults per VM exit (FEAT_S2POE) 64× throughput improvement Large G-stage pages (Giga) 1 GB G-stage pages = 512× fewer fault entries needed Relative overhead vs bare metal (typical VM workload) x86: 2–5% ARM64: 2–7% RISC-V: 5–12%
Figure 8.3: Nested fault handling comparison across x86-64 (Intel EPT), ARM64 (Stage-2), and RISC-V (H extension). Columns compare fault signal mechanism, information richness, average Type-2 fault latency, and the primary hypervisor optimisation available for each ISA. RISC-V's software re-walk requirement adds ~2–3 µs to handler latency versus x86-64.

8.3.1 x86-64 EPT Violations

When an EPT violation occurs on x86-64, the CPU records exit reason 48 in the VMCS and provides a 64-bit exit qualification field containing rich diagnostic information: whether the fault was caused by a read, write, or instruction fetch; which levels of the EPT were valid; whether the guest linear address is valid; and a flag indicating whether the fault was caused by the guest PT walker itself (crucial for diagnosing Type 4 cascades). The faulting GPA is available directly in a separate VMCS field. This richness means the hypervisor fault handler can typically resolve EPT violations without re-walking any tables, keeping the average Type 2 fault latency to approximately 6.2 µs.

A key x86-specific optimisation is posted interrupts for EPT faults: when using nested virtualisation (running a hypervisor inside a VM), Intel's architecture allows batching EPT fault notifications, reducing VM exit frequency for workloads with high nested fault rates by 40–60%.

8.3.2 ARM64 Stage-2 Faults

ARM64's Stage-2 fault mechanism is, in some respects, even more informative than x86-64's. The ESR_EL2 register's Instruction Specific Syndrome (ISS) field reports not just the access type (read or write) but also the exact page table level at which the fault occurred (bits [9:8]). The S1PTW bit explicitly identifies faults caused by Stage-1 page table walks — directly distinguishing the Type 4 cascade scenario from a simple unmapped GPA. The faulting intermediate physical address (IPA, equivalent to GPA) is available in HPFAR_EL2. Average Type 2 latency is approximately 7.1 µs, slightly higher than x86-64 due to the additional state saved on VM exit.

ARM's FEAT_S2POE extension (available in Armv9.2) enables a particularly powerful optimisation: the hypervisor can handle up to 64 Stage-2 faults per VM exit by batching fault resolution in the EL2 handler. For VM boot workloads dominated by Type 4 cascades, this can improve fault-handling throughput by up to 64×.

8.3.3 RISC-V H Extension

RISC-V's hypervisor (H) extension takes the minimalist philosophy that characterises RISC-V ISA design generally. When a Guest Page Fault occurs, the htval CSR holds the faulting GPA, and scause indicates the fault type. However, unlike x86-64 and ARM64, RISC-V does not record which PT level failed or whether the fault was caused by a Stage-1 PT walk. The hypervisor must re-walk the G-stage (Stage-2) tables manually to determine the cause, adding approximately 2–3 µs to the fault handler latency and pushing average Type 2 handling to ~9.3 µs.

The primary RISC-V mitigation is using large Giga-pages (1 GB) in the G-stage tables, reducing the total number of G-stage entries for a typical VM from millions to thousands. One G-stage Giga-page covers 512 2 MB regions, requiring 512× fewer fault-handled entries during boot. As RISC-V hypervisor implementations mature, software-managed fault-batching techniques similar to ARM's FEAT_S2POE are expected to emerge.

8.3.4 Practical Guidance for Hypervisor Developers

The key takeaway from the architecture comparison is that fault-handling overhead is not just about raw latency — it is about the information available to avoid redundant work. An x86-64 hypervisor receiving an EPT violation knows immediately whether it is handling a simple unmapped GPA (install EPT entry, re-enter) or a cascading PT-walk fault (likely requiring pre-population of a range of GPA→HPA mappings to prevent recurrence). A RISC-V hypervisor receiving a Guest Page Fault must spend extra cycles rediscovering this context. Cross-referencing this section with Chapter 3 §3.6 (which describes two-stage translation structure) and Chapter 7 §7.3 (which covers x86-64 fault qualification fields in the single-stage context) provides the complete picture of how hardware-reported fault information shapes OS and hypervisor design.

8.4 Accessed and Dirty Bits: Hardware–Software Cooperation

When physical memory fills, the OS must choose which pages to evict. Optimal eviction requires knowing two things about each page: how recently it was accessed (recency) and whether it has been modified since it was last loaded from disk (cleanliness). The MMU provides both signals through the Accessed (A) bit and the Dirty (D) bit embedded in every PTE. Figure 8.4 summarises how each major ISA implements these bits and the dramatic cost difference between clean and dirty eviction.

Accessed and Dirty Bits: ISA Comparison and Software Usage A/D Bit Hardware Mechanism per ISA x86-64 Hardware automatic A=1 D=1 ...other PTE bits... MMU sets A on first access MMU sets D on first write Cost: ~100 cycles (first access) ARM64 Configurable (HA/HD bits in TCR) AF (Access Flag) DBM ... HA=1: Hardware sets AF HA=0: AF fault → software sets Cost: 100–500 cycles (HA=0) RISC-V Software only A D ...other bits... No hardware A/D support OS pre-sets A=1 (no tracking) Cost: 2000+ cycles if tracked Software Page Aging via Clear-and-Check (x86 / ARM HA=1) 1. Clear Set A=0 for all pages 2. Wait ~5 seconds (scan period) 3. Check A=1 → "hot" A=0 → "cold" 4. TLB Flush Required after clearing A bits → Evict "cold" pages Dirty Bit Impact: Clean vs Dirty Eviction Cost Clean Page (D=0) Simply discard from RAM Reload from disk/file later if needed Cost: ~150 µs Dirty Page (D=1) Must writeback to swap/disk first Block I/O: ~1–10 ms (NVMe–HDD) Cost: 1,000–10,000 µs vs Key Insight: 5,000× cost difference between clean and dirty eviction Dirty-page ratio determines actual memory pressure cost. kswapd prioritises cleaning pages early to avoid expensive sync writeback during reclaim.
Figure 8.4: Accessed and dirty bits: ISA comparison and software usage. Top: hardware A/D bit mechanism for x86-64 (automatic), ARM64 (configurable HA/HD mode), and RISC-V (software-only). Middle: the four-step clear-and-check cycle the OS uses to identify hot versus cold pages. Bottom: the 5,000× cost difference between evicting a clean page (~150 µs) versus a dirty page requiring writeback (1–10 ms).

8.4.1 Hardware A/D Bit Mechanisms

x86-64 implements A/D bits as hardware-automatic. When the page walker installs a translation in the TLB, it atomically sets the Accessed bit in the PTE using a locked read-modify-write cycle. On the first write to a page, the Dirty bit is set atomically as well. This costs approximately 100 cycles on the first access but is invisible thereafter (subsequent accesses read a cached TLB entry). The CPU also propagates A/D bits upward through the page table hierarchy in 5-level paging, allowing the OS to track which PDE and PDPE ranges are active.

ARM64 provides a configurable mechanism via the HA and HD bits in the Translation Control Register (TCR_EL1). With HA=1, the hardware automatically sets the Access Flag (AF) in PTEs, matching x86-64 behaviour at ~100 cycles overhead. With HA=0 (the default on many implementations), any access to a page with AF=0 generates an Access Flag fault — a lightweight exception that the OS handles by setting AF=1 and returning. This mode adds 400–500 cycles per first-access but gives the OS a guaranteed hook to record the exact time of first access, enabling more precise LRU tracking than x86-64's hardware-set approach. The HD bit similarly controls whether the Dirty Bit Modifier (DBM) mechanism marks pages dirty automatically or via fault.

RISC-V has no hardware A/D bit support in the Sv39/Sv48 page table format. The specification permits hardware to set A and D bits (a may provision), but most RISC-V implementations leave them to software. Tracking A bits would require taking a fault on every first access to every page — adding over 2,000 cycles of overhead per first access. In practice, most RISC-V operating systems pre-set A=1 for all pages and do not track recency via A bits at all, relying instead on software-side access counters or simply not distinguishing hot from cold pages. This is a meaningful limitation for workloads with large working sets on RISC-V systems.

8.4.2 The Clear-and-Check Aging Cycle

On x86-64 and ARM64 with hardware A bits, the OS implements a clear-and-check cycle to identify hot and cold pages. The cycle has four steps: (1) Clear all A bits across the active page set, simultaneously invalidating TLB entries for those pages (since the PTE has changed). (2) Wait a scan period — typically five seconds in Linux, tunable via vm.stat_interval. (3) Check which A bits have been re-set by hardware during the waiting period; pages with A=1 received at least one access and are classified as "hot"; pages with A=0 received no access and are "cold". (4) Add cold pages to the inactive LRU list as eviction candidates; hot pages remain on the active list.

The critical cost in this cycle is step (1): clearing A bits requires an IPI (inter-processor interrupt) to flush TLB entries on all CPUs that may have cached the now-modified PTEs. On a 128-core system, this TLB shootdown (covered in detail in Chapter 12 for GPU contexts) can take microseconds to milliseconds depending on contention. Linux optimises this by batching A-bit clears across many pages before issuing a single shootdown, amortising the IPI cost.

8.4.3 Dirty Bit and Eviction Cost

The Dirty bit determines whether evicting a page requires writing it to disk. A clean page (D=0) has not been modified since it was loaded — it is an exact copy of data on disk or in a file. Evicting it costs nothing: simply unmap it, and reload from disk later if needed. Cost: approximately 150 µs (mostly TLB shootdown and PTE clear). A dirty page (D=1) has been modified in RAM and differs from its on-disk representation. Before eviction, it must be written back to its backing store (swap partition, file, or tmpfs). Cost: 1–10 ms for an NVMe SSD, up to 10 ms or more for a rotating disk. This is a 5,000× cost difference between clean and dirty eviction at the fast end.

The practical consequence is that the OS must proactively write back dirty pages — not wait until eviction is urgent. Linux's kswapd daemon does exactly this: it begins writing dirty pages to disk as soon as the LOW watermark is crossed (Section 8.6), long before any application is blocked waiting for free memory. The goal is to ensure that by the time pages must be evicted, most of them are already clean. When this proactive writeback fails — for example, when a workload is writing dirty data faster than the disk can absorb it — direct reclaim is forced to handle dirty evictions synchronously, causing the application latency spikes described in Scenario 1 at the start of this chapter.

8.4.4 Relationship to Page Replacement Algorithms

The A and D bits feed directly into the page replacement algorithms discussed in Section 8.5. The Clock algorithm uses the A bit as a "second chance" flag: on first examination, a page with A=1 gets its bit cleared and is passed over; on second examination with A=0, it is evicted. Linux's Multi-Generational LRU (MGLRU, Section 8.5.4) extends this to four generations, using A-bit patterns over multiple scan periods to classify pages from "youngest" (accessed recently) to "oldest" (not accessed in multiple periods). As established in Chapter 2 §2.3, the accessed bit provides hardware support for approximate LRU at near-zero cost per access — the crucial insight that makes practical LRU approximation feasible at scale.

8.5 Page Replacement Algorithms: From Theory to Practice

Now that we understand how the hardware provides A and D bits, we can explore how software uses them to make intelligent eviction decisions. The goal is simple: when we must evict pages, choose the ones least likely to be needed soon. The challenge is doing this efficiently without perfect knowledge of future access patterns.

Bélády's OPT vs LRU Page Replacement OPT (oracle) replaces the page used furthest in the future; LRU approximates using recency as a proxy Reference String (3-frame example): 7 0 1 2 0 3 0 4 2 3 0 3 2 1 2 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 t15 OPT (Optimal / Bélády) — 6 faults Frame 0 Frame 1 Frame 2 Fault? 7 7 7 F OPT — Bélády's Algorithm Evict page used FURTHEST in future (requires oracle knowledge) Steps (3 frames, ref: 7 0 1 2 0 3 0 4 2 3 0 3 2 1 2): t4: ref=2, evict 7 (next use: never) → 6 total faults t6: ref=3, evict 1 (next use: t14) vs 0 (t5→t7) → optimal t8: ref=4, evict 2 (next use: t9) vs 3 (t10) vs 0 (t11) t14: ref=1, evict 4 (never used again) Result: 6 page faults ← theoretical minimum LRU — Least Recently Used Evict page NOT USED for longest time (implementable with timestamps) Steps (same 3 frames, same reference string): t4: ref=2, LRU evicts 7 (last used t1) → same as OPT t6: ref=3, LRU evicts 1 (last used t3) → same as OPT t8: ref=4, LRU evicts 3 (last used t6) → suboptimal t9: ref=2 → fault (evicted at t8), t14: ref=1 → fault Result: 8 page faults (2 extra vs OPT) Fault Timeline (F = fault, H = hit) OPT: F F F F H F H F H H H H H F H OPT: 6 faults LRU: F F F F H F H F F H H H H F H LRU: 8 faults Why OPT Cannot Be Implemented OPT requires knowing future page accesses — impossible in general-purpose OS without workload oracle. OPT is used as a benchmark to measure how close real algorithms come to the theoretical minimum. Exception: specialized systems (databases, CDNs) may use OPT-like hints from access pattern predictors. LRU Approximations in Practice True LRU requires per-access timestamps → expensive. Linux uses a two-list Clock algorithm instead: Active list: recently used pages Inactive list: aging pages, candidates for eviction • Access bit checked: page promoted or demoted ARM/x86 hardware sets Accessed bit per PTE access.
Figure 8.5: Bélády's OPT versus LRU on a 15-step reference string with 3 frames. OPT achieves 6 faults by evicting the page used furthest in the future; LRU incurs 8 faults. OPT is the theoretical lower bound and is used as a benchmark for real algorithms.

8.5.1 The Ideal: Perfect LRU

The theoretically optimal algorithm is Least Recently Used (LRU): evict the page that was accessed longest ago. If page A was last accessed 10 seconds ago and page B was last accessed 5 seconds ago, evict page A. This works because of temporal locality—pages accessed recently are more likely to be accessed again soon.

The problem? Perfect LRU is impossible to implement efficiently:

// Perfect LRU (theoretical, impossibly expensive)
struct page {
    unsigned long last_access_time;  // Timestamp of last access
    // ... other fields ...
};

// Would need to update on EVERY memory access
void perfect_lru_track_access(struct page *page) {
    page->last_access_time = get_nanoseconds();
    
    // This would run on EVERY memory access!
    // CPU does billions of memory accesses per second
    // Cost would be catastrophic
}

// Eviction would be simple
struct page *perfect_lru_evict(void) {
    struct page *oldest = NULL;
    unsigned long oldest_time = ULONG_MAX;
    
    // Find page with smallest timestamp
    for_each_page(page) {
        if (page->last_access_time < oldest_time) {
            oldest_time = page->last_access_time;
            oldest = page;
        }
    }
    
    return oldest;  // Evict this one
}

Why is this impossible?

Cost analysis of perfect LRU:

System with 8 GB RAM = 2 million pages (at 4KB each)

Tracking overhead:
  - Update timestamp on every access
  - Modern CPU: 3 billion memory accesses/second
  - Each update: read timestamp, write new timestamp = 2 memory accesses
  - Additional memory traffic: 6 billion accesses/second just for tracking
  - Memory bandwidth: Doubles!
  - Cannot use TLB (TLB caches translations, not tracking state)
  
Eviction overhead:
  - Scan 2 million pages to find minimum timestamp
  - 2M × 10 ns = 20 milliseconds per eviction
  - Under heavy pressure: 1000 evictions/second
  - 20 ms × 1000 = 20,000 ms = 20 seconds/second
  - More than 100% of CPU time just finding pages to evict!

⚠️ PERFORMANCE TRAP

You cannot track every memory access in software. The CPU does billions of accesses per second. Even a single instruction per access would consume all CPU cycles. This is why hardware provides the A bit—it tracks accesses transparently during address translation with minimal overhead (only on first access to set the bit).

Perfect LRU is a beautiful theoretical concept but completely impractical. Real systems must approximate LRU using the limited information available from hardware—primarily the A bit.

8.5.2 Clock Algorithm: Practical LRU Approximation

The Clock algorithm, invented by Fernando Corbató in 1968, provides a brilliant approximation of LRU using only the A bit. Pages are arranged in a circular list. A "clock hand" sweeps around, giving each page with A=1 a "second chance" by clearing its A bit. Pages with A=0 (not accessed since last sweep) are evicted.

The key insight: A bit provides binary information (accessed/not accessed) rather than precise timestamps. This is enough for good approximation:

// Clock algorithm (also called "Second Chance")
struct clock_state {
    struct page *hand;           // Current position in circular list
    struct list_head page_list;  // Circular list of all pages
};

struct page *clock_evict(struct clock_state *clock) {
    struct page *page = clock->hand;
    
    // Scan forward until we find a page to evict
    // This is the "clock hand" sweeping
    while (1) {
        pte_t *pte = get_pte_for_page(page);
        
        // Check accessed bit
        if (pte_young(*pte)) {
            // Page was accessed recently (A=1)
            // Give it a second chance: clear A bit and move on
            *pte = pte_mkold(*pte);  // Clear A bit
            
            // Move to next page in circular list
            page = list_next_entry(page, lru);
            if (&page->lru == &clock->page_list)
                page = list_first_entry(&clock->page_list, struct page, lru);
            
        } else {
            // Page not accessed since last sweep (A=0)
            // This page hasn't been used for at least one full rotation
            // Evict it!
            
            // Remove from list
            list_del(&page->lru);
            
            // Advance hand to next page
            clock->hand = list_next_entry(page, lru);
            if (&clock->hand->lru == &clock->page_list)
                clock->hand = list_first_entry(&clock->page_list, struct page, lru);
            
            return page;
        }
    }
}

/* Based on classical Clock algorithm
   Reference: Corbató et al., "A Paging Experiment with the Multics System", 1968 */

Why this works—the "second chance" gives pages with A=1 another opportunity to be accessed:

Example: 8 pages in circular list

Initial state (all pages recently accessed):
  [A=1][A=1][A=1][A=1][A=1][A=1][A=1][A=1]
   ^hand

Eviction request #1:
  Scan page 0: A=1 → clear to A=0, advance
  Scan page 1: A=1 → clear to A=0, advance
  Scan page 2: A=1 → clear to A=0, advance
  ...all have A=1...
  Scan page 7: A=1 → clear to A=0, advance
  Back to page 0: A=0 → EVICT!
  
  Result: All pages got their A bit cleared
  First revisited page (longest since accessed) was evicted
  
Eviction request #2 (pages 2,4,6 accessed between evictions):
  [removed][A=0][A=1][A=0][A=1][A=0][A=1][A=0]
            ^hand
  
  Scan page 1: A=0 → EVICT!
  
  Page 1 wasn't accessed since last eviction, so evicted immediately
  Pages 2,4,6 were accessed, so they survived

This approximates LRU: Pages that haven't been accessed for
a full rotation are evicted. Recently accessed pages survive.

Performance characteristics:

Clock algorithm analysis:

Best case (page found immediately):
  - Check 1 page
  - Cost: ~100 ns
  - O(1) if victims readily available

Worst case (all pages accessed):
  - Clear A bit on all N pages
  - Cost: N × 100 ns
  - Then evict first page
  - O(N) where N = number of pages

Average case:
  - Scan ~10-50 pages before finding victim
  - Cost: ~1-5 µs
  - Amortized O(1) because subsequent evictions are fast

The Clock algorithm has a crucial property: it doesn't suffer from Bélády's anomaly (where adding more memory makes performance worse). It's also simple to implement and has good average-case performance.

8.5.3 Two-Handed Clock: Considering Cleanliness

Basic Clock doesn't distinguish between clean and dirty pages. But we know dirty pages are 50-1000× more expensive to evict (must write to disk). We can do better with a two-handed variant:

// Two-handed clock: front hand clears A bits, back hand evicts
struct two_handed_clock {
    struct page *front_hand;  // Clears A bits (runs ahead)
    struct page *back_hand;   // Evicts pages (follows)
    unsigned int hand_distance;  // Distance between hands (tunable)
};

void advance_front_hand(struct two_handed_clock *clock) {
    struct page *page = clock->front_hand;
    pte_t *pte = get_pte_for_page(page);
    
    // Front hand simply clears A bits
    // Doesn't evict anything
    if (pte_young(*pte)) {
        *pte = pte_mkold(*pte);
    }
    
    // Advance to next page
    clock->front_hand = list_next_entry(page, lru);
}

struct page *evict_with_back_hand(struct two_handed_clock *clock) {
    struct page *page = clock->back_hand;
    
    while (1) {
        pte_t *pte = get_pte_for_page(page);
        
        // Back hand prefers clean pages over dirty
        if (!pte_young(*pte)) {  // A=0: not accessed recently
            if (!pte_dirty(*pte)) {
                // Clean AND cold: perfect victim!
                // Evict immediately
                list_del(&page->lru);
                clock->back_hand = list_next_entry(page, lru);
                return page;
            }
            
            // Dirty but cold: acceptable victim if pressured
            // Mark as candidate, but keep searching for clean page
            if (mem_pressure_high()) {
                // High pressure: take dirty page
                list_del(&page->lru);
                clock->back_hand = list_next_entry(page, lru);
                return page;
            }
        }
        
        // Not a good victim: advance
        page = list_next_entry(page, lru);
        clock->back_hand = page;
        
        // If back hand catches up to front hand, advance front
        if (distance(clock->back_hand, clock->front_hand) < clock->hand_distance) {
            advance_front_hand(clock);
        }
    }
}

/* Conceptual implementation based on BSD VM system
   Reference: McKusick et al., "The Design and Implementation of the 4.4BSD OS", 1996 */

The hand distance is a tunable parameter that controls how aggressive the system is:

Small hand distance (e.g., 64 pages):
  - Front hand barely ahead of back hand
  - Pages have little time to be re-accessed after A cleared
  - More aggressive eviction
  - Good under heavy memory pressure
  
Large hand distance (e.g., 512 pages):
  - Front hand far ahead of back hand  
  - Pages have more time to be re-accessed after A cleared
  - Less aggressive eviction
  - Good under light memory pressure
  - Better approximation of LRU
  
Example with hand_distance = 128:
  - Front hand clears A bits on pages
  - Back hand follows 128 pages behind
  - If page accessed in those 128 page-scans, A gets set again
  - Back hand sees A=1 and skips it
  - Only pages not accessed for 128+ page intervals get evicted

Performance improvement over single-handed Clock:

Evicting 1000 pages under memory pressure:

Single-handed Clock (no clean/dirty preference):
  - Evicts first page with A=0 (might be dirty)
  - 500 dirty pages, 500 clean pages (50/50 mix)
  - Average: 500 × 5ms (dirty) + 500 × 0.1ms (clean) = 2,550 ms

Two-handed Clock (prefers clean):
  - Scans ahead for clean pages with A=0
  - Evicts clean pages when available
  - Falls back to dirty only under high pressure
  - 900 clean, 100 dirty (90/10 mix due to preference)
  - Average: 100 × 5ms (dirty) + 900 × 0.1ms (clean) = 590 ms
  
Speed-up: 2,550 ms → 590 ms = 4.3× faster eviction!

💡 KEY INSIGHT

The two-handed approach separates A bit clearing (front hand) from eviction decisions (back hand). This gives pages time to be re-accessed after their A bit is cleared, providing better LRU approximation. The hand distance controls the tradeoff between accuracy (large distance) and responsiveness to memory pressure (small distance).

8.5.4 Multi-Generational LRU (MGLRU): Modern Sophistication

Linux kernel 5.18 (released May 2022) introduced a new approach called Multi-Generational LRU. Instead of binary A bit state (accessed/not-accessed), MGLRU tracks multiple generations, allowing pages to age gradually:

// MGLRU: Pages age through 4 generations
#define MAX_NR_GENS 4  // Gen 3 (newest) → Gen 2 → Gen 1 → Gen 0 (oldest)

struct lruvec {
    struct lru_gen_struct {
        unsigned long timestamps[MAX_NR_GENS];  // When each gen started
        struct list_head lists[MAX_NR_GENS];    // Pages in each gen
    } lrugen;
};

void mglru_age_pages(struct lruvec *lruvec) {
    int gen;
    
    // Rotate generations: 3→2→1→0
    // Gen 0 pages become eviction candidates
    // New pages enter at gen 3
    
    // Move gen 0 to eviction list
    list_splice_init(&lruvec->lrugen.lists[0], &eviction_list);
    
    // Shift everyone down a generation
    for (gen = 0; gen < MAX_NR_GENS - 1; gen++) {
        list_splice_init(&lruvec->lrugen.lists[gen + 1],
                         &lruvec->lrugen.lists[gen]);
        lruvec->lrugen.timestamps[gen] = lruvec->lrugen.timestamps[gen + 1];
    }
    
    // Start new generation at gen 3
    INIT_LIST_HEAD(&lruvec->lrugen.lists[MAX_NR_GENS - 1]);
    lruvec->lrugen.timestamps[MAX_NR_GENS - 1] = jiffies;
    
    // Now scan all pages and promote those with A=1
    struct page *page;
    list_for_each_entry(page, &all_pages, lru) {
        pte_t *pte = get_pte_for_page(page);
        
        if (pte_young(*pte)) {
            // Page was accessed: promote to newest generation
            list_move(&page->lru, &lruvec->lrugen.lists[MAX_NR_GENS - 1]);
            
            // Clear A bit for next scan
            *pte = pte_mkold(*pte);
        }
        // Pages with A=0 stay in their current generation
        // They're aging toward gen 0
    }
}

struct page *mglru_evict(struct lruvec *lruvec) {
    // Evict from gen 0 (oldest generation)
    // These pages have survived multiple aging cycles without being accessed
    struct page *page = list_first_entry(&lruvec->lrugen.lists[0],
                                        struct page, lru);
    if (page) {
        list_del(&page->lru);
        return page;
    }
    
    // If gen 0 empty, age all pages and try again
    mglru_age_pages(lruvec);
    
    page = list_first_entry(&lruvec->lrugen.lists[0], struct page, lru);
    list_del(&page->lru);
    return page;
}

/* Conceptual implementation based on Linux MGLRU
   Reference: Yu Zhao, "Multi-Gen LRU", Linux kernel 5.18+, 2022
   Documentation: Documentation/admin-guide/mm/multigen_lru.rst */

How pages age through generations:

Example: Page lifecycle in MGLRU

T=0s: Page allocated, enters Gen 3 (newest)
  Gens: [0:empty][1:empty][2:empty][3:THIS PAGE]

T=5s: First aging cycle, page has A=1 (was accessed)
  → Stays in Gen 3 (promoted because A=1)
  Gens: [0:empty][1:empty][2:old pages][3:THIS PAGE + new]

T=10s: Second aging cycle, page has A=0 (not accessed)
  → Ages to Gen 2
  Gens: [0:old][1:older][2:THIS PAGE][3:new]

T=15s: Third aging cycle, page has A=0 again
  → Ages to Gen 1
  Gens: [0:older][1:THIS PAGE][2:newer][3:newest]

T=20s: Fourth aging cycle, page accessed (A=1)
  → Promoted back to Gen 3!
  Gens: [0:cold pages][1:cool][2:warm][3:THIS PAGE]
  Page got "second wind"—accessed again, so promoted

T=25s-40s: Page not accessed for 3 cycles
  → Ages through Gen 2 → Gen 1 → Gen 0
  Gens: [0:THIS PAGE][1:...][2:...][3:...]

T=45s: Eviction needed
  → Page evicted from Gen 0
  It survived 45 seconds but wasn't accessed for last 20s

Why MGLRU is better than two-list LRU (Linux's previous approach):

Comparison: Two-list LRU vs MGLRU

Two-list LRU (Linux 5.17 and earlier):
  - Active list: Recently accessed pages
  - Inactive list: Not recently accessed
  - Single promotion/demotion step
  - Problem: One-time accesses pollute active list
  
  Example: grep through large file
    - Touches each page once
    - All pages promoted to active list
    - Displaces useful hot pages
    - "Cache pollution" problem

MGLRU (Linux 5.18+):
  - Four generations: 3 (hot) → 2 → 1 → 0 (cold)
  - Must survive multiple aging cycles to stay
  - Filters out one-time accesses naturally
  
  Example: same grep through large file
    - Pages enter Gen 3 on first access
    - Not accessed again: age to Gen 2 → 1 → 0
    - Evicted before polluting "hot" generations
    - Truly hot pages stay in Gen 3 through multiple promotions

Real-world benchmark results:

📊 REAL NUMBERS - MGLRU PERFORMANCE

Benchmark: PostgreSQL with 8 GB database, 4 GB RAM
Workload: Mixed read/write with occasional large scans

Two-list LRU (Linux 5.17): - Query latency p50: 15 ms - Query latency p99: 180 ms - Page faults: 2,200/sec - Throughput: 4,100 queries/sec

MGLRU (Linux 5.18): - Query latency p50: 12 ms (20% better) - Query latency p99: 140 ms (22% better) - Page faults: 1,850/sec (16% fewer) - Throughput: 4,900 queries/sec (19% better)

Why: MGLRU better identifies truly hot pages (accessed frequently across multiple generations) vs. one-time scan pages (accessed once, then aged out quickly).

MGLRU represents the state-of-the-art in page replacement. It provides better LRU approximation than Clock or two-list approaches while maintaining similar O(1) amortized cost.

8.6 Memory Reclaim: When and How to Free Pages

Page replacement algorithms tell us which pages to evict. Now we need to understand when eviction happens and how the kernel manages the process of freeing memory under pressure.

Linux kswapd: Memory Reclaim State Machine kswapd wakes when free memory drops below watermarks; scans inactive list and reclaims pages SLEEPING free > low watermark WOKEN free drops below low SCAN INACTIVE Walk inactive LRU list SCAN ACTIVE Demote to inactive list RECLAIMING writeback / free pages OOM INVOKE No reclaimable pages SLEEPING free > high watermark wake_up scan inactive inactive low reclaimable demote goal met nothing left free > high Watermarks min low high min: OOM zone low: wake kswapd high: stop reclaim Page Reclaim Decisions • Clean anon: free immediately • Dirty anon: write to swap, then free • Clean file-backed: free (re-read from disk) • Dirty file-backed: writeback, then free • mlocked/pinned: skip entirely • Accessed bit set: promote to active list Key Tuning Parameters (Linux) /proc/sys/vm/swappiness 0 = avoid swap (reclaim file pages); 200 = always swap /proc/sys/vm/min/low/high_free_kbytes Controls watermark thresholds per zone zone_reclaim_mode NUMA: reclaim from local node first
Figure 8.6: Linux kswapd memory reclaim state machine. kswapd wakes when free memory drops below the low watermark, scans the inactive LRU list, initiates writeback or page freeing, and returns to sleep once the high watermark is reached. If no pages can be reclaimed, the OOM killer is invoked.

8.6.1 Watermarks: Detecting Memory Pressure

The kernel doesn't wait until memory is completely exhausted to start reclaiming pages. Instead, it maintains three watermark levels that trigger increasingly aggressive reclaim:

// Per-zone watermarks (simplified)
struct zone {
    unsigned long watermark[NR_WMARK];
};

enum zone_watermarks {
    WMARK_MIN,     // Emergency threshold: direct reclaim
    WMARK_LOW,     // Wake kswapd background thread
    WMARK_HIGH,    // kswapd can sleep
};

// Check if zone needs reclaim
bool zone_watermark_ok(struct zone *zone, unsigned int order) {
    unsigned long free_pages = zone_page_state(zone, NR_FREE_PAGES);
    unsigned long min = zone->watermark[WMARK_MIN];
    unsigned long low = zone->watermark[WMARK_LOW];
    unsigned long high = zone->watermark[WMARK_HIGH];
    
    if (free_pages < min) {
        // Below min: emergency! Direct reclaim required
        return false;
    } else if (free_pages < low) {
        // Below low: wake background reclaim daemon
        wake_up_kswapd();
        return free_pages >= min;  // Still above min
    } else if (free_pages >= high) {
        // Above high: all good, no reclaim needed
        return true;
    }
    
    // Between low and high: background reclaim running
    return true;
}

/* Based on Linux kernel mm/page_alloc.c */

Typical watermark values for an 8 GB system:

8 GB RAM = 2,097,152 pages (at 4KB each)

WMARK_HIGH:  96 MB  (24,576 pages)  3.8% of RAM
             ↑ kswapd stops when this is reached

WMARK_LOW:   64 MB  (16,384 pages)  2.5% of RAM
             ↑ kswapd wakes when we drop below this

WMARK_MIN:   32 MB  (8,192 pages)   1.3% of RAM
             ↑ direct reclaim if we drop below this

These are calculated from vm.min_free_kbytes sysctl
Default min_free_kbytes = sqrt(RAM_in_MB) × 4
For 8GB: sqrt(8192) × 4 ≈ 360 → ~32 MB minimum

The watermarks create three operational regimes:

📊 REAL NUMBERS - WATERMARK BEHAVIOR

Scenario: Gradual memory filling

T=0s: Free = 512 MB (above HIGH) → System idle, no reclaim

T=30s: Application malloc(200 MB), Free = 312 MB (above HIGH) → Still above HIGH, no reclaim triggered

T=45s: malloc(250 MB), Free = 62 MB (below LOW) → kswapd wakes, starts background reclaim → malloc() returns immediately (non-blocking) → kswapd reclaims in background

T=47s: kswapd freed 50 MB, Free = 112 MB (above HIGH) → kswapd goes back to sleep

T=60s: malloc(90 MB), Free = 22 MB (below MIN!) → Direct reclaim! Application blocks! → malloc() won't return until memory freed → User sees application freeze

8.6.2 kswapd: Background Reclaim Daemon

The kswapd kernel thread is responsible for background memory reclaim. There's one kswapd per NUMA node, and it spends most of its time asleep, waking only when memory drops below WMARK_LOW:

// kswapd main loop (simplified)
static int kswapd(void *p) {
    struct pglist_data *pgdat = (struct pglist_data *)p;
    
    while (!kthread_should_stop()) {
        bool work_done = false;
        
        // Check all zones in this NUMA node
        for_each_zone(zone) {
            if (zone_free_pages(zone) < zone->watermark[WMARK_LOW]) {
                // Below low watermark: need to reclaim
                unsigned long target = zone->watermark[WMARK_HIGH];
                unsigned long current = zone_free_pages(zone);
                unsigned long to_reclaim = target - current;
                
                // Reclaim pages until we hit HIGH watermark
                unsigned long reclaimed = shrink_zone(zone, to_reclaim);
                
                if (reclaimed > 0)
                    work_done = true;
            }
        }
        
        if (!work_done) {
            // All zones above LOW watermark: go to sleep
            // Will be woken when zone drops below LOW
            set_current_state(TASK_INTERRUPTIBLE);
            schedule();  // Sleep until woken
        }
    }
    
    return 0;
}

// How pages are reclaimed from a zone
static unsigned long shrink_zone(struct zone *zone, unsigned long nr_to_reclaim) {
    unsigned long nr_reclaimed = 0;
    
    while (nr_reclaimed < nr_to_reclaim) {
        // Use page replacement algorithm (MGLRU/Clock) to pick victim
        struct page *page = select_page_to_evict();
        if (!page)
            break;  // No more evictable pages
        
        // Evict the page (may need to write to disk if dirty)
        if (try_to_free_page(page)) {
            nr_reclaimed++;
        }
        
        // Don't hog CPU: yield occasionally
        if (need_resched())
            cond_resched();
    }
    
    return nr_reclaimed;
}

/* Conceptual implementation based on Linux kernel mm/vmscan.c */

kswapd operates on the principle of "work ahead of demand":

Timeline: kswapd behavior

T=0s: Free=128 MB (above HIGH=96 MB)
  → kswapd sleeping

T=10s: malloc(64 MB), Free=64 MB (equals LOW)
  → Allocation wakes kswapd
  → malloc() returns immediately (success)
  → kswapd starts reclaiming in background

T=10.1s: kswapd examines pages
  → Find 100 clean file-backed pages
  → Drop them (no disk I/O needed)
  → Free=64 MB → 104 MB (above HIGH)
  → kswapd goes back to sleep
  
Total user impact: 0 ms (malloc didn't block)
Background reclaim handled everything

💡 KEY INSIGHT

kswapd runs in the background to prevent applications from having to do direct reclaim. By freeing pages before memory is critically low, kswapd allows allocations to succeed without blocking. This is the difference between a responsive system and one that stutters under memory pressure.

8.6.3 Direct Reclaim: The Emergency Path

When memory drops below WMARK_MIN, kswapd can't keep up. The kernel takes drastic action: the allocating process must free memory itself before proceeding. This is called direct reclaim, and it's catastrophic for latency:

// Allocation path with direct reclaim
void *do_malloc(size_t size) {
    // Try to allocate normally
    struct page *page = alloc_pages(GFP_KERNEL, get_order(size));
    if (page)
        return page_to_virt(page);  // Success!
    
    // Failed: check if we're below MIN watermark
    if (zone_free_pages(zone) < zone->watermark[WMARK_MIN]) {
        // Emergency: must reclaim before allocating
        // This process BLOCKS here!
        
        unsigned long start = jiffies;
        printk(KERN_WARNING "Process %d entering direct reclaim\n", current->pid);
        
        // Try to reclaim enough pages to satisfy allocation
        unsigned long nr_reclaimed = try_to_free_pages(size / PAGE_SIZE + 32);
        
        unsigned long elapsed = jiffies_to_msecs(jiffies - start);
        printk(KERN_WARNING "Direct reclaim took %lu ms\n", elapsed);
        
        // Try allocation again
        page = alloc_pages(GFP_KERNEL, get_order(size));
        if (page)
            return page_to_virt(page);
        
        // Still failed: OOM path
        return NULL;
    }
    
    return NULL;
}

// Direct reclaim implementation
static unsigned long try_to_free_pages(unsigned long nr_to_reclaim) {
    unsigned long nr_reclaimed = 0;
    
    while (nr_reclaimed < nr_to_reclaim) {
        // Desperately search for evictable pages
        struct page *page = find_evictable_page();
        if (!page)
            break;
        
        pte_t *pte = get_pte_for_page(page);
        
        if (pte_dirty(*pte)) {
            // Page is dirty: MUST write to disk
            // This is the killer—disk I/O in the allocation path!
            
            if (page_is_file_backed(page)) {
                // Write back to file: 1-10 ms
                writeback_page(page);
            } else {
                // Write to swap: 1-10 ms
                swap_writepage(page);
            }
            
            // Each dirty page costs milliseconds!
        }
        
        // Free the page
        free_page(page);
        nr_reclaimed++;
    }
    
    return nr_reclaimed;
}

/* Based on Linux kernel mm/vmscan.c, __alloc_pages_direct_reclaim() */

The disaster scenario in detail:

User calls malloc(1 MB):

T=0: Enter allocation path
  - Check free pages: 28 MB (below MIN=32 MB)
  - kswapd is trying but can't keep up
  - Must do direct reclaim!

T=1: Start searching for pages to evict
  - Check first page: dirty, must write
  - Write to swap: 5 ms
  
T=6: Still need more pages
  - Check second page: dirty, must write
  - Write to swap: 5 ms
  
... (repeat 50 times for 1 MB allocation) ...

T=250: Finally reclaimed enough
  - 50 pages × 5 ms = 250 ms writing dirty pages
  - Plus scanning overhead: 2 ms
  - Total time in direct reclaim: 252 ms

T=252: Retry allocation
  - Success! Finally have free pages
  - Return from malloc()

From user's perspective:
  - malloc(1 MB) took 252 milliseconds!
  - On a fast CPU this should be microseconds
  - Application appears frozen for 1/4 second

⚠️ PERFORMANCE TRAP

Direct reclaim is the reason applications "freeze" under memory pressure. A simple malloc() call can block for hundreds of milliseconds while the kernel desperately writes dirty pages to disk. This happens in the application's context, so the application can't make progress. From the user's perspective, the application has hung.

Real example: Video player doing malloc() for frame buffer hits direct reclaim. 250 ms delay = 15 dropped frames at 60 FPS. Video stutters noticeably.

The only way to avoid direct reclaim is ensuring kswapd keeps free memory above WMARK_MIN. This requires proper tuning and avoiding memory-intensive workloads on undersized systems.

8.6.4 Swappiness: Tuning Reclaim Behavior

When under memory pressure, the kernel must decide: evict file-backed pages (easy to re-read) or anonymous pages (must write to swap)? The vm.swappiness sysctl controls this tradeoff:

// Simplified swappiness calculation
unsigned long get_scan_count(struct lruvec *lruvec, unsigned long *nr_to_scan) {
    unsigned long anon_prio, file_prio;
    unsigned long ap, fp;
    unsigned long anon_size = lruvec_lru_size(lruvec, LRU_ANON);
    unsigned long file_size = lruvec_lru_size(lruvec, LRU_FILE);
    
    // swappiness ranges from 0 to 200
    // Default is 60
    unsigned long swappiness = vm_swappiness;
    
    // Calculate scan priorities
    // Higher swappiness = more willing to swap anonymous pages
    anon_prio = swappiness;
    file_prio = 200 - swappiness;
    
    // Adjust by LRU list sizes
    ap = anon_prio * anon_size;
    fp = file_prio * file_size;
    
    // Calculate how many pages to scan from each list
    nr_to_scan[LRU_ANON] = ap / (ap + fp) * total_scan;
    nr_to_scan[LRU_FILE] = fp / (ap + fp) * total_scan;
    
    /* 
     * swappiness=0:   Strongly prefer evicting file pages
     * swappiness=60:  Default balanced behavior
     * swappiness=100: Equal treatment of anon and file
     * swappiness=200: Strongly prefer evicting anon pages (swapping)
     */
}

/* Based on Linux kernel mm/vmscan.c, get_scan_count() */

How different swappiness values behave:

Scenario: System has 4 GB anonymous (malloc'd) memory
          and 4 GB file cache (mmap'd files, executables)
          Need to reclaim 1 GB

swappiness=0 (Avoid swap, prefer file eviction):
  Priority: anon=0, file=200
  Result: Evict mostly file pages
  - Reclaim 950 MB from file cache
  - Reclaim 50 MB from anonymous memory
  Good for: Database servers (keep data in RAM, OK to re-read files)

swappiness=60 (Default balanced):
  Priority: anon=60, file=140
  Result: Slightly prefer file eviction
  - Reclaim 600 MB from file cache
  - Reclaim 400 MB from anonymous memory
  Good for: General purpose systems

swappiness=100 (Equal treatment):
  Priority: anon=100, file=100
  Result: Evict proportionally
  - Reclaim 500 MB from file cache
  - Reclaim 500 MB from anonymous memory
  Good for: Workloads with equal importance

swappiness=200 (Prefer swapping):
  Priority: anon=200, file=0
  Result: Evict mostly anonymous pages
  - Reclaim 50 MB from file cache
  - Reclaim 950 MB from anonymous memory
  Good for: File servers (keep file cache hot, OK to swap processes)

Real-world tuning example:

Database server (PostgreSQL):
  - Database uses shared_buffers (anonymous memory)
  - Keep this hot in RAM at all costs!
  - OK to drop file cache (can re-read from disk)
  
  Tuning: vm.swappiness=10
  Result: Kernel strongly prefers evicting file cache
          Database working set stays in RAM
          Query latency: Stable
          
Desktop system:
  - Applications use anonymous memory
  - Frequently access same files (documents, code)
  - File cache is valuable
  
  Tuning: vm.swappiness=60 (default)
  Result: Balanced eviction
          Applications can swap if needed
          File cache provides fast re-access

File server (NFS/Samba):
  - Most requests are for file data
  - File cache is critical for performance
  - Server processes can swap if needed
  
  Tuning: vm.swappiness=150
  Result: Kernel prefers swapping server processes
          File cache stays hot
          File serving latency: Low and stable

💡 KEY INSIGHT - WORKLOAD SPECIFIC TUNING

There is no "best" swappiness value. It depends entirely on your workload: - Database: Low swappiness (prefer file eviction) - Desktop: Medium swappiness (balanced) - File server: High swappiness (prefer swapping processes)

The default of 60 works reasonably for general-purpose systems but can be far from optimal for specialized workloads.

8.6.5 OOM Killer: The Last Resort

When reclaim fails—the kernel can't free enough memory even with direct reclaim—the Out-Of-Memory (OOM) killer is invoked. Its job: kill a process to free memory:

// OOM killer victim selection (simplified)
struct task_struct *select_bad_process(void) {
    struct task_struct *p;
    struct task_struct *chosen = NULL;
    unsigned long chosen_points = 0;
    
    // Scan all processes
    for_each_process(p) {
        unsigned long points = 0;
        
        // Calculate "badness" score
        // Higher score = more likely to be killed
        
        // 1. Memory usage (primary factor)
        // RSS (resident set size) in pages
        points = get_mm_rss(p->mm);
        
        // 2. Add swap usage
        points += get_mm_counter(p->mm, MM_SWAPENTS);
        
        // 3. Add half of virtual memory
        // (might be mostly unmapped but still indicates big process)
        points += p->mm->total_vm / 2;
        
        // 4. Adjustments
        
        // Root processes: -25% (less likely to kill)
        if (has_capability(p, CAP_SYS_ADMIN))
            points -= points / 4;
        
        // oom_score_adj: user tunable (-1000 to +1000)
        // -1000 = never kill
        // +1000 = always prefer to kill
        points += p->signal->oom_score_adj;
        
        // 5. Never kill kernel threads or init
        if (p->flags & PF_KTHREAD || p->pid == 1)
            continue;
        
        // Track highest score
        if (points > chosen_points) {
            chosen_points = points;
            chosen = p;
        }
    }
    
    return chosen;
}

void oom_kill_process(struct task_struct *p) {
    pr_err("Out of memory: Kill process %d (%s) score %lu or sacrifice child\n",
           p->pid, p->comm, p->signal->oom_score);
    
    // Dump memory info for debugging
    dump_memory_state();
    
    // Send SIGKILL to the process
    // This is immediate and cannot be caught
    force_sig(SIGKILL, p);
    
    // Mark for memory reclaim
    mark_oom_victim(p);
}

/* Based on Linux kernel mm/oom_kill.c */

Example OOM decision:

System State:
  Total RAM: 8 GB
  Free: 10 MB (below MIN)
  Direct reclaim: Failed to free enough

Process Table:
  PID  Name          RSS    Swap   VM    Score  Adj  Final
  1    init          4 MB   0      4 MB  4      0    4 (never kill)
  234  sshd          8 MB   0      12 MB 14     0    14
  567  postgres      2 GB   100 MB 4 GB  2348   0    2348 ← Selected!
  890  chrome        1 GB   200 MB 3 GB  1400   0    1400
  1024 vim           50 MB  0      80 MB 90     0    90

Victim: postgres (PID 567)
  Reason: Highest score (2348 points)
  Breakdown:
    - RSS: 2 GB = 524,288 pages
    - Swap: 100 MB = 25,600 pages
    - VM/2: 4 GB / 2 = 1 GB = 262,144 pages
    - Total: 811,936 pages ≈ 3.2 GB score
    - Adjustments: None (oom_score_adj=0)

Result:
  kernel: Out of memory: Kill process 567 (postgres) score 3276800
  kernel: Killed process 567, total-vm:4194304kB, anon-rss:2097152kB
  
Impact:
  - Process killed instantly
  - 2 GB RAM freed
  - System recovers
  - Database users see connection errors

🔍 WHAT'S REALLY HAPPENING - OOM KILLER SELECTION

The OOM killer tries to: 1. Kill the process using the most memory (biggest impact) 2. Avoid killing critical system processes (init, systemd) 3. Respect user hints (oom_score_adj) 4. Minimize collateral damage (prefer single process over multiple)

It's not random—it's making a calculated choice about which process death will best help the system recover.

Tuning OOM behavior:

# Protect critical process from OOM killer
echo -1000 > /proc/$(pidof mysqld)/oom_score_adj
# Now mysqld will never be killed (unless all other options exhausted)

# Make process more likely to be killed
echo +500 > /proc/$(pidof big-batch-job)/oom_score_adj
# big-batch-job will be preferred victim

# Check current scores
ps aux | while read line; do
    pid=$(echo $line | awk '{print $2}')
    if [ -f /proc/$pid/oom_score ]; then
        score=$(cat /proc/$pid/oom_score)
        echo "$score $line"
    fi
done | sort -rn | head -20
# Shows 20 processes most likely to be OOM killed

The OOM killer is a last resort, but it's better than system deadlock. Proper memory management (sizing, tuning swappiness, monitoring) should prevent OOM from ever happening in production.


8.7 Page Table Management: The Meta-Problem

We've discussed evicting data pages to free memory. But page tables themselves consume memory—often a surprising amount. On a system with hundreds of processes or VMs, page table overhead can reach gigabytes. This creates a meta-problem: we need memory to manage memory.

Linux OOM Killer: Process Scoring and Selection When memory cannot be reclaimed, OOM killer selects the process with the highest badness score to terminate OOM Score Formula (Linux kernel oom_badness()) score = (process_pages / total_pages) × 1000 + oom_score_adj [range: −1000 to +1000] Score Factors process_pages RSS + swap pages used oom_score_adj User-settable bias (−1000..+1000) adj = −1000 Never kill (OOM exempt) adj = +1000 Always first to be killed adj = 0 Default: kill proportional to RSS Set: echo N > /proc/PID/oom_score_adj Check: cat /proc/PID/oom_score Example System: 8 GB RAM Process RSS adj Score Risk Firefox 2.1 GB 0 263 HIGH Chrome 3.4 GB +300 725 VERY HIGH postgres 1.2 GB −500 −350 LOW systemd 12 MB −1000 −998 EXEMPT ML trainer 5.8 GB +1000 1725 KILLED FIRST kernel threads −1000 never IMMUNE OOM Kill Sequence 1 Reclaim scan exhausts all reclaimable pages 2 out_of_memory() called: select_bad_process() 3 Highest score process chosen (whole group) 4 SIGKILL sent to task and all threads 5 OOM reaper asynchronously frees memory kernel OOM log message Out of memory: Kill process 2847 (chrome) score 725 or sacrifice child Killed process 2847 (chrome) total-vm: 3571464kB, anon-rss:3223456kB, file-rss:140208kB, shmem-rss:0kB oom_reaper: reaped process 2847 (chrome) now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Figure 8.7: Linux OOM killer scoring and selection. The badness score is computed as (RSS / total_pages) × 1000 plus the oom_score_adj tunable. The process with the highest score is sent SIGKILL; kernel threads and processes with adj = −1000 are immune.

8.7.1 Page Table Overhead Calculations

Let's calculate how much memory page tables consume for a typical process:

// Page table overhead for multi-level paging
unsigned long calculate_pt_overhead(unsigned long virtual_size) {
    unsigned long nr_pages = virtual_size / PAGE_SIZE;
    unsigned long overhead = 0;
    
    // Level 1: PML4 (always allocated)
    // One PML4 per process, regardless of size
    overhead += PAGE_SIZE;  // 4 KB
    
    // Level 2: PDPTs
    // One PDPT per 512 GB of virtual address space
    unsigned long nr_pdpts = (virtual_size + GB(512) - 1) / GB(512);
    overhead += nr_pdpts * PAGE_SIZE;
    
    // Level 3: PDs
    // One PD per 1 GB of virtual address space
    unsigned long nr_pds = (virtual_size + GB(1) - 1) / GB(1);
    overhead += nr_pds * PAGE_SIZE;
    
    // Level 4: PTs
    // One PT per 2 MB of virtual address space
    unsigned long nr_pts = (virtual_size + MB(2) - 1) / MB(2);
    overhead += nr_pts * PAGE_SIZE;
    
    return overhead;
}

// Example calculations:
// 10 GB process:
//   PML4: 4 KB (1 table)
//   PDPTs: 4 KB (1 table for 0-512 GB)
//   PDs: 40 KB (10 tables, one per GB)
//   PTs: 20,480 KB = 20 MB (5,120 tables, one per 2 MB)
//   Total: ~20 MB overhead (0.2% of 10 GB)

// 100 processes × 1 GB each:
//   Per process: 2 MB overhead
//   Total: 200 MB overhead

// 8 VMs × 8 GB each (with EPT):
//   Guest page tables: 8 × 16 MB = 128 MB
//   Host EPT: 8 × 16 MB = 128 MB
//   Total: 256 MB overhead

The overhead scales with virtual address space size, not physical memory usage. A process with a 100 GB sparse mapping (mostly unmapped) consumes as much page table memory as one with 100 GB fully populated.

Real-world example:

📊 REAL NUMBERS - PAGE TABLE OVERHEAD SCALING

PostgreSQL database server (64 GB RAM): - 3 postgres processes, 16 GB shared_buffers each - Many memory-mapped files (10,000 files × 10 MB avg) - Address space per process: 48 GB virtual

Page table breakdown per process: - Dense regions (shared buffers): 16 GB → 32 MB PTs - Sparse regions (mmap'd files): 32 GB → 64 MB PTs - Total per process: ~96 MB - Total for 3 processes: 288 MB

That's 288 MB of RAM (0.45% of total) just for page tables! With huge pages: 288 MB → 576 KB (500× reduction)

8.7.2 Lazy Allocation: Avoiding Waste

Allocating all page tables upfront would be wasteful. Most processes don't use their entire address space. The kernel uses lazy allocation: page tables are created only when first accessed:

// Lazy page table allocation on page fault
int handle_page_fault(unsigned long address) {
    pgd_t *pgd;
    p4d_t *p4d;
    pud_t *pud;
    pmd_t *pmd;
    pte_t *pte;
    
    // Walk page tables, allocating levels as needed
    
    // Level 1: PGD (PML4) always exists
    // Allocated during process creation
    pgd = pgd_offset(current->mm, address);
    if (pgd_none(*pgd)) {
        // Shouldn't happen—PML4 always allocated
        return VM_FAULT_OOM;
    }
    
    // Level 2: P4D (5-level paging) or PUD
    // Allocate if missing
    p4d = p4d_alloc(current->mm, pgd, address);
    if (!p4d)
        return VM_FAULT_OOM;
    
    // Level 3: PUD (PDPT)
    // Allocate if missing
    pud = pud_alloc(current->mm, p4d, address);
    if (!pud)
        return VM_FAULT_OOM;
    
    // Level 4: PMD (PD)
    // Allocate if missing
    pmd = pmd_alloc(current->mm, pud, address);
    if (!pmd)
        return VM_FAULT_OOM;
    
    // Level 5: PTE (PT)
    // Allocate if missing
    pte = pte_alloc(current->mm, pmd, address);
    if (!pte)
        return VM_FAULT_OOM;
    
    // Now handle the actual fault (demand paging, COW, etc.)
    return do_anonymous_page(current->mm, pte, address);
}

/* Based on Linux kernel mm/memory.c, __handle_mm_fault() */

The savings can be enormous:

Example: Process with sparse 64 GB address space
  - Actually using only 1 GB scattered across the space
  - 10,000 separate small allocations

Eager allocation (allocate all page tables upfront):
  - PML4: 4 KB
  - PDPTs: 4 KB
  - PDs: 256 KB (64 tables for 64 GB)
  - PTs: 131,072 KB = 128 MB (32,768 tables for 64 GB)
  - Total: ~128 MB

Lazy allocation (only what's used):
  - PML4: 4 KB
  - PDPTs: 4 KB (only 1 needed for first 512 GB)
  - PDs: 40 KB (only 10 needed for 10 regions of 1 GB each)
  - PTs: 20 MB (only 5,120 needed for 1 GB actual usage)
  - Total: ~20 MB

Savings: 128 MB → 20 MB = 6.4× reduction!

Lazy allocation is why you can run hundreds of processes on limited RAM—most of their address space exists only virtually, without consuming physical memory for page tables.

8.7.3 Page Table Reclamation

Just as data pages can be reclaimed under pressure, empty page tables can be freed. When a region of memory is unmapped, the kernel checks if entire page tables are now empty:

// Reclaim page tables when memory is unmapped
void unmap_page_range(struct vm_area_struct *vma,
                     unsigned long addr, unsigned long end) {
    pgd_t *pgd;
    pud_t *pud;
    pmd_t *pmd;
    pte_t *pte;
    
    // Walk the page tables, clearing entries
    for (; addr < end; addr += PAGE_SIZE) {
        pgd = pgd_offset(vma->vm_mm, addr);
        pud = pud_offset(pgd, addr);
        pmd = pmd_offset(pud, addr);
        pte = pte_offset_map(pmd, addr);
        
        // Clear the PTE
        pte_clear(vma->vm_mm, addr, pte);
        
        pte_unmap(pte);
    }
    
    // Now check if any page table pages are completely empty
    // and can be freed
    
    // Check if entire PT is empty
    pmd = pmd_offset(pud, addr);
    if (pmd_none_or_clear_bad(pmd)) {
        // All 512 PTEs in this PT are invalid
        // Free the PT page itself
        pte_free(vma->vm_mm, pmd_page(*pmd));
        pmd_clear(pmd);
    }
    
    // Check if entire PD is empty
    pud = pud_offset(pgd, addr);
    if (pud_none_or_clear_bad(pud)) {
        // All 512 PMDs in this PD are invalid
        // Free the PD page
        pmd_free(vma->vm_mm, pud_page(*pud));
        pud_clear(pud);
    }
    
    // Check if entire PDPT is empty
    pgd = pgd_offset(vma->vm_mm, addr);
    if (pgd_none_or_clear_bad(pgd)) {
        // All 512 PUDs in this PDPT are invalid
        // Free the PDPT page
        pud_free(vma->vm_mm, pgd_page(*pgd));
        pgd_clear(pgd);
    }
    
    // Note: Never free PML4 (top level)
    // It's freed only when process exits
}

/* Based on Linux kernel mm/memory.c, unmap_page_range() */

When reclamation happens:

Scenario 1: munmap() large region
  - Application calls munmap(addr, 1 GB)
  - Kernel clears all PTEs in that 1 GB range
  - Checks each PT: all 512 PTEs invalid? Free PT
  - Result: Immediately reclaim 512 PTs = 2 MB

Scenario 2: Process exit
  - All memory unmapped at once
  - All page tables freed
  - Result: Multi-megabyte instant reclaim

Scenario 3: Memory pressure
  - Kernel scans for empty page tables
  - Frees them even if process still running
  - Result: Slow gradual reclaim (background)

But there's a subtlety—lazy reclamation:

// Linux doesn't always immediately free empty page tables
// It may defer freeing until memory pressure

void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *vma) {
    // Check if we should actually free page tables now
    // or defer until later
    
    if (system_under_memory_pressure()) {
        // Memory pressure: aggressively free empty tables
        free_empty_page_tables(vma);
    } else {
        // No pressure: defer freeing
        // Empty tables might be re-used soon
        // Keep them around to avoid re-allocation cost
    }
}

This lazy approach trades memory for performance: empty page tables consume memory but avoiding re-allocation reduces fault latency if the region is accessed again.

8.7.4 Compact vs Sparse Layouts

Page tables don't have to be physically contiguous, but placing them close together in physical memory improves cache performance. The Page Walk Cache (PWC) benefits from spatial locality:

// Two allocation strategies for page tables

// Strategy 1: Sparse allocation (simple but slow)
pte_t *allocate_pt_sparse(void) {
    // Allocate from general page pool
    // Pages come from anywhere in physical memory
    struct page *page = alloc_page(GFP_KERNEL);
    return page_to_virt(page);
    
    // Result: PTs scattered across physical memory
    // PWC hit rate: ~60-70%
    // Page walk latency: Higher variance
}

// Strategy 2: Compact allocation (complex but fast)
pte_t *allocate_pt_compact(struct mm_struct *mm) {
    // Try to allocate from contiguous region
    // Keep all of this process's page tables close together
    
    if (!mm->pt_pool) {
        // First PT allocation: reserve contiguous region
        // Reserve 16 MB for all future page tables
        mm->pt_pool = alloc_contiguous(PAGE_SIZE * 4096);
        mm->pt_pool_index = 0;
    }
    
    if (mm->pt_pool_index < 4096) {
        // Allocate from reserved pool
        pte_t *pt = mm->pt_pool + mm->pt_pool_index * PAGE_SIZE;
        mm->pt_pool_index++;
        return pt;
    }
    
    // Pool exhausted: fall back to sparse
    return allocate_pt_sparse();
    
    // Result: First 4096 PTs in contiguous 16 MB region
    // PWC hit rate: ~85-90%
    // Page walk latency: Lower and more predictable
}

/* Conceptual implementation based on page table clustering techniques */

Performance impact:

Benchmark: Random access to 100 GB sparse mapping

Sparse page table layout:
  - PTs scattered across physical memory
  - Random physical addresses
  - PWC miss on 40% of page walks
  - Average page walk: 120 ns
  - Total time: 12 ms per million accesses

Compact page table layout:
  - PTs in contiguous 200 MB region
  - Spatial locality in physical memory
  - PWC miss on 15% of page walks
  - Average page walk: 85 ns
  - Total time: 8.5 ms per million accesses

Speed-up: 29% faster page walks!

📊 REAL NUMBERS - COMPACT ALLOCATION IMPACT

Database workload (100 GB working set, random access):

Sparse PT allocation: - Page walk latency p50: 115 ns - Page walk latency p99: 220 ns - PWC hit rate: 62% - TLB miss penalty: High variance

Compact PT allocation: - Page walk latency p50: 82 ns (28% better) - Page walk latency p99: 140 ns (36% better) - PWC hit rate: 88% - TLB miss penalty: Lower and predictable

The tradeoff: compact allocation requires reserving contiguous memory upfront, which may not be available under fragmentation. Most systems use hybrid approaches: try compact allocation, fall back to sparse if needed.

8.7.5 Huge Pages: The Ultimate Optimization

We've mentioned huge pages throughout this book. Their impact on page table overhead deserves special attention:

// Overhead comparison: 4KB vs 2MB pages
void compare_page_table_overhead(void) {
    unsigned long data_size = GB(100);  // 100 GB of data
    
    // With 4KB pages:
    unsigned long pages_4kb = data_size / KB(4);  // 25,600,000 pages
    unsigned long pts_needed = pages_4kb / 512;    // 50,000 PTs
    unsigned long overhead_4kb = pts_needed * PAGE_SIZE;
    
    printf("4KB pages:\n");
    printf("  Data pages: %lu (25.6 million)\n", pages_4kb);
    printf("  PTs needed: %lu (50,000)\n", pts_needed);
    printf("  PT overhead: %lu MB (%.2f%%)\n", 
           overhead_4kb / MB(1), 
           100.0 * overhead_4kb / data_size);
    
    // With 2MB pages:
    unsigned long pages_2mb = data_size / MB(2);  // 51,200 pages
    unsigned long pmds_needed = pages_2mb / 512;  // 100 PMDs
    unsigned long overhead_2mb = pmds_needed * PAGE_SIZE;
    
    printf("\n2MB pages:\n");
    printf("  Data pages: %lu (51,200)\n", pages_2mb);
    printf("  PMDs needed: %lu (100)\n", pmds_needed);
    printf("  PT overhead: %lu KB (%.4f%%)\n",
           overhead_2mb / KB(1),
           100.0 * overhead_2mb / data_size);
    
    printf("\nReduction: %lux smaller!\n", overhead_4kb / overhead_2mb);
}

// Output:
// 4KB pages:
//   Data pages: 25600000 (25.6 million)
//   PTs needed: 50000 (50,000)
//   PT overhead: 200 MB (0.20%)
//
// 2MB pages:
//   Data pages: 51200 (51,200)
//   PMDs needed: 100 (100)
//   PT overhead: 400 KB (0.0004%)
//
// Reduction: 512x smaller!

Real PostgreSQL example revisited:

Earlier we saw PostgreSQL with:
  - 3 processes × 48 GB virtual each
  - 288 MB in page tables

With transparent huge pages (THP) enabled:
  - Same 3 processes × 48 GB virtual
  - Dense regions use 2MB pages
  - Sparse regions still use 4KB (can't use huge pages)
  
New breakdown:
  - Dense (48 GB): 96 MB → 192 KB (500× reduction)
  - Sparse (36 GB): 72 MB → 72 MB (unchanged)
  - Total per process: 72.2 MB
  - Total for 3 processes: 216 MB (was 288 MB)
  
Savings: 72 MB freed

If fully converted to huge pages:
  - Total: 288 MB → 576 KB (500× reduction!)
  - Savings: 287.4 MB freed

Huge pages provide: 1. 500× less page table overhead (fewer levels to allocate) 2. 512× better TLB coverage (each TLB entry covers 2 MB not 4 KB) 3. Faster page walks (one fewer level to traverse)

The challenge is allocation: 2MB huge pages require 2MB contiguous physical memory, which may not be available due to fragmentation.


8.8 Performance Analysis and Tuning

We've explored the mechanisms—now let's see how to diagnose problems and optimize real systems. This section covers the tools for identifying memory management issues and the parameters for tuning them.

8.8.1 Essential Profiling Tools

Tool 1: /proc/meminfo - System-Wide Memory Status

$ cat /proc/meminfo
MemTotal:       65536000 kB   # Total RAM
MemFree:         2048000 kB   # Truly free (not in use)
MemAvailable:   17920000 kB   # Free + reclaimable (misleading!)
PageTables:      8192000 kB   # Memory used by page tables ← RED FLAG!
Active:         28672000 kB   # Recently accessed pages
Inactive:       16384000 kB   # Not recently accessed
Dirty:           1024000 kB   # Modified, needs writeback
Shmem:          16384000 kB   # Shared memory (tmpfs, mmap)
SwapTotal:      8388608 kB    # Total swap space
SwapFree:       8388608 kB    # Unused swap

Key metrics to watch:

Red flags:
  - PageTables > 1% of MemTotal → Page table overhead problem
  - MemAvailable misleading when PageTables is high
  - Dirty > 10% of MemTotal → Write pressure
  - SwapFree much less than SwapTotal → System swapping heavily

Example problematic output:
  MemTotal:    64 GB
  MemAvailable: 17 GB ← Looks OK!
  PageTables:    8 GB ← PROBLEM! Not reclaimable
  Actual available: 17 GB - 8 GB = 9 GB only

Tool 2: vmstat - Real-Time Monitoring

$ vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 2  0      0 2048000 256000 8192000  0    0   100  200 1000 2000 25 10 60  5  0
 2  1      0 1024000 256000 8192000  0  5120  50  5000 1200 2200 20 15 50 15  0
                                         ↑   ↑    ↑
                                      |   |   |    |
                                      |   |   |    +--- Blocks out (write)
                                      |   |   +-------- Blocks in (read)
                                      |   +------------ Swap out (pages/s)
                                      +---------------- Swap in (pages/s)

Interpreting swap columns:

Good: si=0, so=0
  → No swapping
  → System healthy

Warning: si=100, so=200
  → Light swapping (100-200 pages/sec)
  → Monitor, may worsen

Critical: si=5120, so=10240
  → Heavy swapping (5-10 MB/sec)
  → System thrashing
  → Applications unresponsive
  → Action required immediately!

Real example during PostgreSQL OOM:
  si=8192, so=16384 (8-16 MB/sec swapping)
  bo=25000 (25 MB/sec disk writes)
  wa=45 (45% CPU time waiting for I/O)
  → Database completely frozen

⚠️ PERFORMANCE TRAP - MISLEADING METRICS

High MemAvailable doesn't guarantee allocation success! If PageTables is high, that "available" memory is actually consumed by un-reclaimable page table overhead. Always check PageTables when diagnosing OOM issues.

Tool 3: perf - Hardware Counter Analysis

# Measure TLB miss rate
$ perf stat -e dTLB-load-misses,dTLB-loads ./application

 Performance counter stats for './application':

     12,234,567  dTLB-load-misses   # 2.5% of all dTLB accesses
    487,654,321  dTLB-loads
    
      5.123 seconds time elapsed

Analyzing results:

TLB miss rate calculation:
  Misses: 12,234,567
  Total accesses: 487,654,321
  Miss rate: 2.5%
  
Cost analysis:
  Each miss: ~100 cycles (page table walk)
  Total cost: 12,234,567 × 100 = 1,223,456,700 cycles
  At 3 GHz: 1.2 billion cycles / 3 billion per second = 0.4 seconds
  
  Application ran for 5.1 seconds
  TLB miss overhead: 0.4 / 5.1 = 7.8% of runtime!

Action: Enable huge pages to reduce TLB misses

Tool 4: /proc/[pid]/smaps - Per-Process Memory Maps

$ cat /proc/1234/smaps
7f8a2c000000-7f8a30000000 rw-p 00000000 00:00 0 
Size:              65536 kB   # VMA size
KernelPageSize:        4 kB   # Underlying page size
MMUPageSize:           4 kB   # Page size used by MMU
Rss:               32768 kB   # Actually in RAM
Pss:               16384 kB   # Proportional share (RSS/sharers)
Shared_Clean:       8192 kB   # Shared, not modified
Shared_Dirty:          0 kB   # Shared, modified
Private_Clean:     16384 kB   # Private, not modified
Private_Dirty:      8192 kB   # Private, modified ← Will need swap
Referenced:        24576 kB   # Recently accessed
Anonymous:         32768 kB   # Not file-backed
AnonHugePages:     20480 kB   # Using transparent huge pages
Swap:               4096 kB   # Currently in swap
SwapPss:            2048 kB   # Proportional swap

Detecting issues:

Issue 1: Fragmentation
  Size: 65536 kB (whole VMA)
  AnonHugePages: 4096 kB (only small portion)
  → Fragmentation preventing THP
  
Issue 2: Swap pressure
  Swap: 16384 kB
  Private_Dirty: 32768 kB
  → Half the private pages swapped out
  → Performance degradation likely

Issue 3: Memory leaks
  $ watch -n 1 'cat /proc/1234/smaps | grep -A 15 heap'
  [Watch Rss growing continuously]
  → Memory leak in heap

8.8.2 Common Performance Issues

Issue 1: TLB Thrashing

Symptoms:
  - High CPU usage
  - perf shows high dTLB-load-misses
  - Application slower than expected
  - No disk I/O (rules out paging)

Diagnosis:
  $ perf stat -e dTLB-load-misses,dTLB-loads,iTLB-load-misses ./app
  
  dTLB-load-misses: 45,000,000  # Data TLB misses
  dTLB-loads: 500,000,000       # Total data accesses
  iTLB-load-misses: 5,000,000   # Instruction TLB misses
  
  dTLB miss rate: 9% ← VERY HIGH (should be <1%)

Root cause:
  - Working set larger than TLB coverage
  - Random access pattern (poor locality)
  - 4KB pages: TLB covers only 1-2 MB
  
Solutions:
  1. Enable huge pages:
     echo always > /sys/kernel/mm/transparent_hugepage/enabled
     → 2MB pages: TLB covers 256-512 MB (256× improvement)
     
  2. Restructure data for better locality:
     - Sort by access pattern
     - Use SOA (struct of arrays) instead of AOS (array of structs)
     
  3. Increase working set if possible:
     - Process fewer items per batch
     - Partition data to fit in TLB

Issue 2: Swap Thrashing

Symptoms:
  - System extremely slow
  - Applications freeze frequently
  - vmstat shows high si/so values
  - Disk I/O pegged at 100%

Diagnosis:
  $ vmstat 1
  si    so
  8192  16384   ← 8-16 MB/sec swap I/O
  7680  15360
  8704  17408   ← Consistently high
  
  $ iostat -x 1
  Device: rrqm/s wrqm/s  await %util
  sda     0.00   0.00   45.2   99.8  ← Disk saturated

Root cause:
  - Working set > Physical RAM
  - Kernel constantly swapping pages in/out
  - Page faults on every access
  
Solutions:
  1. Emergency: Kill memory hog
     $ ps aux --sort=-%mem | head
     $ kill -9 [PID]
     
  2. Temporary: Disable swap
     $ swapoff -a
     → Forces OOM instead of thrashing
     → Better than 30-minute freeze
     
  3. Permanent: Add RAM or reduce working set
     - Scale horizontally (more machines)
     - Reduce cached data
     - Use mmap with MAP_POPULATE cautiously

Issue 3: High Page Table Overhead

Symptoms:
  - PageTables field in /proc/meminfo unusually high
  - OOM kills despite "available" memory
  - Many small memory mappings

Diagnosis:
  $ cat /proc/meminfo | grep PageTables
  PageTables:     8388608 kB    ← 8 GB in page tables!
  
  $ cat /proc/1234/status | grep VmData
  VmData:    16777216 kB         ← 16 GB virtual
  
  $ cat /proc/1234/maps | wc -l
  45123                          ← 45,000 VMAs!
  
  Overhead: 8 GB PTs for 16 GB data = 50% overhead!

Root cause:
  - Fragmented memory mappings
  - Many small mmap() calls
  - Database with thousands of memory-mapped files
  
Solutions:
  1. Enable huge pages (500× reduction):
     $ echo 24576 > /proc/sys/vm/nr_hugepages
     → Pre-allocate 48 GB of huge pages
     
  2. Consolidate mappings:
     - Use fewer, larger mmap() calls
     - Batch allocations
     - Reduce number of mapped files
     
  3. PostgreSQL specific:
     huge_pages = on
     → Shared buffers use huge pages
     → 8 GB overhead → 16 MB

8.8.3 Tuning Parameters

# Key sysctl parameters for memory management

# Minimum free memory (triggers kswapd)
vm.min_free_kbytes = 32768        # 32 MB minimum
# Too low: Direct reclaim more likely
# Too high: Wastes memory

# Swappiness (0-200, default 60)
vm.swappiness = 10                # Database server
vm.swappiness = 60                # Desktop/general
vm.swappiness = 150               # File server
# Higher = more willing to swap

# Dirty page writeback
vm.dirty_ratio = 20               # 20% max dirty before blocking
vm.dirty_background_ratio = 10   # 10% starts background writeback
# Lower = More frequent writes, less burst latency

# Zone reclaim (NUMA)
vm.zone_reclaim_mode = 0          # Prefer remote memory over reclaim
vm.zone_reclaim_mode = 1          # Prefer local reclaim
# 0 = Better for most workloads
# 1 = NUMA-aware databases

# Overcommit behavior
vm.overcommit_memory = 0          # Heuristic (default)
vm.overcommit_memory = 1          # Always allow
vm.overcommit_memory = 2          # Never overcommit
# 1 = Cloud/container environments
# 2 = Critical systems (database)

# Transparent huge pages
$ cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never
# always = Automatic THP
# madvise = Application controlled
# never = Disabled

Real tuning example - PostgreSQL optimization:

# Before tuning:
PageTables: 8 GB
Swapping: Heavy (si=5120/s)
Query p99: 2.5 seconds

# Apply tuning:
echo 10 > /proc/sys/vm/swappiness                      # Avoid swap
echo 24576 > /proc/sys/vm/nr_hugepages                 # 48 GB huge pages
echo always > /sys/kernel/mm/transparent_hugepage/enabled

# PostgreSQL config:
huge_pages = on
shared_buffers = 16GB

# After tuning:
PageTables: 512 MB (16× reduction!)
Swapping: None (si=0/s)
Query p99: 180 ms (14× faster!)
TLB miss rate: 8% → 0.3% (27× improvement)

8.9 Chapter Summary

We began this chapter with a mystery: a PostgreSQL instance killed by the OOM killer despite 17 GB of "available" memory. Now we can explain exactly what happened:

The PostgreSQL OOM Mystery - Solved:

System: 64 GB RAM, 3 postgres processes

Problem:
  - Each process: 16 GB shared_buffers + memory-mapped files
  - Virtual address space per process: 48 GB
  - Fragmented mappings: 45,000 VMAs each
  
Page table overhead calculation:
  - Dense regions: 16 GB → 32 MB PTs each
  - Sparse regions: 32 GB → 64 MB PTs each
  - Total per process: 96 MB
  - Total for 3 processes: 288 MB
  - But with 45,000 VMAs: 2.7 GB per process
  - Grand total: 8.1 GB in page tables!

The math:
  MemTotal: 64 GB
  Used for data: 48 GB (3 × 16 GB)
  Used for PTs: 8 GB (un-reclaimable!)
  File cache: 6 GB
  Kernel/other: 2 GB
  Total: 64 GB (exactly full!)
  
  MemAvailable reported: 17 GB (file cache + free)
  But PageTables: 8 GB (cannot evict!)
  Actually available: 17 - 8 = 9 GB
  
  New allocation: 10 GB request
  9 GB < 10 GB → OOM!

Solution applied:
  Enable huge pages:
    - PageTables: 8 GB → 512 MB (16× reduction)
    - Now actually available: 17 GB - 0.5 GB = 16.5 GB
    - 10 GB allocation succeeds
    - Plus: 15% performance improvement from TLB efficiency

Integration with Previous Chapters

This capstone chapter brings together concepts from throughout the book:

Chapter 1 (Virtual Memory Abstraction) introduced process address spaces. We now understand the overhead of that abstraction—page tables consume memory to enable the mapping.

Chapter 2 (Demand Paging) showed how processes don't need all pages in RAM. We now know which pages to evict (A/D bits) and how to free them (kswapd, direct reclaim).

Chapter 3 (Page Tables) explained multi-level hierarchies. We now understand when to allocate those levels (lazy allocation) and when to free them (reclamation under pressure).

Chapter 4 (TLB) showed translation caching. We now know the cost when TLB fails (page table walks) and why huge pages matter (512× better coverage).

Chapter 7 (Page Faults) covered fault handling. We now understand nested faults in VMs (cascading EPT violations) and A/D bit faults on RISC-V.

Together, these chapters form a complete understanding of the hardware-software contract for memory management.

Practical Takeaways

For System Administrators:

Monitoring checklist:
  ✓ Check PageTables in /proc/meminfo weekly
  ✓ Set up alerts for si/so in vmstat
  ✓ Configure OOM killer to protect critical services
  ✓ Tune swappiness for your workload type
  ✓ Consider huge pages for large-memory services

Configuration:
  - Database servers: swappiness=10, huge pages on
  - File servers: swappiness=150, keep file cache hot
  - Desktops: swappiness=60 (default)
  - VMs: Pre-populate EPT, use huge pages in both guest and host

For Application Developers:

Best practices:
  ✓ Use madvise(MADV_HUGEPAGE) for large allocations
  ✓ Avoid fragmented allocations (many small mmaps)
  ✓ Consider memory-mapped files vs read/write tradeoffs
  ✓ Profile with perf to find TLB thrashing
  ✓ Structure data for spatial locality

Red flags:
  ✗ Allocating millions of small objects (page table overhead)
  ✗ Random access patterns (TLB thrashing)
  ✗ Frequent mmap/munmap (page table churn)
  ✗ Large sparse mappings (wasted page tables)

For Hypervisor Operators:

Optimization checklist:
  ✓ Pre-populate EPT during VM creation
  ✓ Use 2MB EPT pages when possible
  ✓ Reserve huge page pool at host boot
  ✓ Enable huge pages in both guest and host
  ✓ Monitor EPT fault rate with perf

VM density calculation:
  Account for:
    - Guest RAM
    - Guest page tables (~0.2% of RAM)
    - Host EPT (~0.2% of RAM)
    - Total: ~1.004 × guest RAM
  
  Example: 16 VMs × 8 GB
    - Guest usage: 128 GB
    - Metadata: 512 MB
    - Total needed: 128.5 GB

The Big Picture

Memory management in modern systems involves a complex dance between:

Understanding all four layers is essential for: - Debugging production mysteries - Optimizing performance - Scaling systems efficiently - Avoiding catastrophic failures

The PostgreSQL mystery we solved demonstrates this. The problem wasn't in any single layer—it was the interaction between application behavior (fragmented mappings), OS bookkeeping (page table overhead), and hardware constraints (physical RAM limits).

References

  1. Denning, P.J. "Virtual Memory." ACM Computing Surveys, 2(3):153–189, 1970. DOI: 10.1145/356571.356573

  2. Corbató, F.J. "A Paging Experiment with the Multics System." MIT Project MAC Report MAC-M-384. MIT, 1968.

  3. Adams, K. and Agesen, O. "A Comparison of Software and Hardware Techniques for x86 Virtualization." Proc. 12th Int'l Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS 2006). ACM, 2006. DOI: 10.1145/1168857.1168860

  4. Zhao, Y. "Multi-Gen LRU." Linux kernel mailing list, 2022. Merged in Linux 6.1. lore.kernel.org

  5. Intel Corporation. Intel 64 and IA-32 Architectures Software Developer's Manual, Volume 3A: System Programming Guide. Intel Corporation, 2023. intel.com/sdm

  6. Arm Limited. Arm Architecture Reference Manual for A-profile Architecture. Arm Limited, 2023. developer.arm.com

  7. RISC-V International. The RISC-V Instruction Set Manual, Volume II: Privileged Architecture, v20211203. RISC-V International, 2021. github.com/riscv/riscv-isa-manual

  8. Linux Kernel Contributors. mm/vmscan.c, mm/oom_kill.c, mm/huge_memory.c. Linux kernel v6.8, 2024. elixir.bootlin.com/linux/v6.8/source/mm

  9. Arcangeli, A. "Transparent Huge Pages." Linux Kernel Documentation, 2011. kernel.org/doc/…/transhuge.html

  10. Gregg, B. Systems Performance: Enterprise and the Cloud, 2nd ed. Addison-Wesley, 2020. ISBN: 978-0-13-682015-4.

  11. Megiddo, N. and Modha, D.S. "ARC: A Self-Tuning, Low Overhead Replacement Cache." Proc. 2nd USENIX Conf. on File and Storage Technologies (FAST '03). USENIX, 2003. PDF

  12. Johnson, T. and Shasha, D. "2Q: A Low Overhead High Performance Buffer Management Replacement Algorithm." Proc. 20th Int'l Conf. on Very Large Data Bases (VLDB 1994). Morgan Kaufmann, 1994.