Consider a production PostgreSQL database server with 24 GB of RAM. The monitoring dashboard shows 17 GB "available" memory — plenty of headroom, right? Then, without warning, the database process is killed. The system log contains one cryptic line: Out of memory: Killed process 4721 (postgres) score 823. How can a process be killed when 17 GB is "available"?
The answer lies in how the Linux kernel distinguishes between memory that is free and memory that is merely reclaimable. That 17 GB was almost entirely page cache — file data buffered in RAM to speed up reads. When a large memory allocation demand arrived, the kernel needed pages immediately, discovered that reclaiming cache required writing dirty pages to disk, could not complete reclaim fast enough, and triggered the OOM killer as a last resort. The database had done nothing wrong. The system's memory management cascade had failed to keep pace with demand.
This chapter bridges the hardware mechanisms covered in Chapters 1–7 with the software policies that drive real system behaviour. We have established how the MMU translates addresses (Chapter 3), how TLBs accelerate that translation (Chapter 4), and how page faults interrupt and redirect execution (Chapter 7). Now we examine what happens when memory genuinely runs short: which pages are chosen for eviction, how the kernel manages nested virtualisation faults, how hardware-software cooperation via accessed and dirty bits drives reclaim decisions, and what performance levers a systems programmer can pull when things go wrong.
Scenario 1 — The Redis fork() surprise. Redis performs a background save via fork(), relying on copy-on-write semantics. Immediately after the fork, both parent and child share all pages. Within milliseconds, the parent begins writing to nearly every key, triggering copy-on-write faults for each modified page. Memory usage doubles in seconds. The system that had 4 GB free now has 200 MB free and is thrashing.
Scenario 2 — The VM host boot storm. A cloud host brings up 32 virtual machines simultaneously. Each VM's two-stage address translation (§8.2) requires EPT entries for its physical memory. Before any EPT entries exist, every memory access inside each VM triggers a nested page fault — a VM exit, EPT entry allocation, and VM re-entry. With 32 VMs booting in parallel, the host is processing tens of thousands of EPT faults per second. Boot takes 5× longer than on bare metal.
Scenario 3 — The memory leak cascade. A long-running application leaks 100 MB/hour. After 48 hours, it has consumed 4.8 GB of additional virtual memory. Page tables for this memory themselves occupy 10 MB (§8.7 covers why). The kernel begins reclaiming pages. Some are clean and cheap to evict; others are dirty and require writeback, stalling the allocating thread. Eventually the OOM killer intervenes.
All three scenarios involve the same fundamental mechanisms: watermarks that trigger reclaim, LRU lists that determine eviction order, and the accessed and dirty bits that hardware sets to inform those decisions. Figure 8.1 shows how these mechanisms cascade from normal operation through increasing pressure to OOM intervention.
Chapters 1–7 described the hardware side of memory management: the MMU translates virtual to physical addresses, the TLB caches recent translations, and the page fault mechanism delivers control to the OS when translation fails. But hardware alone cannot decide which pages deserve to stay in RAM. That decision requires knowledge of application behaviour — specifically, which pages are being used and which have not been touched recently.
The hardware–software contract for memory reclaim rests on two bits embedded in every page table entry: the Accessed (A) bit and the Dirty (D) bit. The hardware sets these bits automatically as pages are accessed and written, providing the OS with a low-cost usage signal that does not require instrumentation of every memory access. The OS reads and clears these bits periodically, using the pattern of recent access to classify pages as hot or cold. Cold clean pages are trivially evictable; cold dirty pages must be written back before eviction; hot pages are left in place.
The mechanism seems simple, but the details — and the subtle differences between x86-64, ARM64, and RISC-V implementations — determine whether a system handles memory pressure gracefully or collapses into thrashing. Sections 8.2–8.4 establish the foundation; Sections 8.5–8.8 show how these primitives compose into the full Linux memory management stack.
Chapter 3 introduced two-stage address translation for virtualisation: a guest virtual address (GVA) first traverses the guest's own page tables (Stage 1) to produce a guest physical address (GPA), which then traverses the host's EPT or Stage-2 tables to yield a host physical address (HPA). We calculated that without caching, this walk could require up to 25 memory accesses. We showed that TLBs and page walk caches reduce the steady-state overhead to 2–7%. What we did not cover was what happens when the translation fails at either or both stages — and why those failures are so much more expensive than a simple page fault on bare metal.
There are four fundamentally different kinds of nested page fault, each with distinct causes, costs, and remediation strategies. Figure 8.2 illustrates all four types with their typical latencies.
Type 1 — Guest page fault (Stage 1 failure). The guest application accesses an unmapped virtual address. The guest's own page tables have no entry for it, so the CPU generates a page fault exception that is injected into the guest. The guest OS handles it exactly as it would on bare metal: allocating a page, installing a PTE, and returning. The host hypervisor is entirely uninvolved. Cost: identical to a native page fault, 1–10 µs for a minor fault.
Type 2 — EPT / Stage-2 violation (Stage 2 failure). Stage 1 succeeds — the guest has a valid mapping for the virtual address — but the resulting GPA has no entry in the EPT (or ARM Stage-2 table). The CPU triggers a VM exit (reason 48 in the x86 VMCS). The hypervisor allocates a host page, installs the EPT entry mapping GPA→HPA, and re-enters the VM. Cost: approximately 5–10 µs (2 µs VM exit + 3 µs allocation + 2 µs VM re-entry).
Type 3 — Permission violation. Both Stage 1 and Stage 2 have entries, but EPT denies the requested access type. This pattern covers three important use cases. Copy-on-write: the guest writes to a shared page; EPT marks it read-only; the hypervisor duplicates the physical page and installs a writable EPT entry (~10 µs). MMIO emulation: the guest accesses a device register address; the hypervisor emulates the device operation (50–500 µs depending on complexity). Live migration dirty-page tracking: the hypervisor marks all EPT entries read-only at the start of a migration scan; any guest write triggers an EPT violation that records the page as dirty before reinstating write permission (~4 µs per fault).
Type 4 — Cascading fault storm. This is the critical failure mode unique to two-stage translation. When the guest OS is walking its own page tables — during a Type 1 fault handler, for example — each level of the guest PT walk reads from a guest physical address. If those guest PT pages have not yet been mapped in the EPT, each PT-level read triggers its own Type 2 EPT violation. A four-level guest PT walk can thus require up to four cascading VM exits before the original guest fault can even be delivered. In the worst case, one user-space memory access costs approximately 50 µs — 50× the bare-metal cost. This is why VMs boot 50% slower than bare metal: the guest kernel is initialising its page tables from scratch, every PT-level read misses the EPT, and the hypervisor processes thousands of cascading faults per second.
The principal mitigation for Type 4 storms is EPT pre-population: the hypervisor maps all expected GPA→HPA translations in the EPT at VM creation time, before the guest boots. This transforms the boot-time cascade of thousands of Type 2 faults into a one-time upfront cost during VM initialisation. For a 4 GB VM, pre-populating the EPT with 4 KB pages requires 1 million EPT entries — about 8 MB of host memory — and takes roughly 20 ms. That 20 ms cost is paid once; without it, the VM spends an equivalent amount of time processing EPT faults during every boot.
A complementary strategy is using 2 MB EPT pages (large pages at the EPT level). Because each 2 MB EPT entry covers 512 contiguous 4 KB pages, a 4 GB VM's EPT shrinks from 1 million entries to just 2,048. This reduces the number of EPT faults during boot by 512× and also improves EPT TLB (EPTP cache) hit rates. The tradeoff is 2 MB alignment requirements for guest memory and reduced granularity for COW and live-migration dirty tracking.
While all three major ISA families provide two-stage translation for virtualisation, they differ significantly in what information the hypervisor receives when a Stage-2 fault occurs and how efficiently batched fault processing can be implemented. Figure 8.3 provides a side-by-side comparison across x86-64, ARM64, and RISC-V.
When an EPT violation occurs on x86-64, the CPU records exit reason 48 in the VMCS and provides a 64-bit exit qualification field containing rich diagnostic information: whether the fault was caused by a read, write, or instruction fetch; which levels of the EPT were valid; whether the guest linear address is valid; and a flag indicating whether the fault was caused by the guest PT walker itself (crucial for diagnosing Type 4 cascades). The faulting GPA is available directly in a separate VMCS field. This richness means the hypervisor fault handler can typically resolve EPT violations without re-walking any tables, keeping the average Type 2 fault latency to approximately 6.2 µs.
A key x86-specific optimisation is posted interrupts for EPT faults: when using nested virtualisation (running a hypervisor inside a VM), Intel's architecture allows batching EPT fault notifications, reducing VM exit frequency for workloads with high nested fault rates by 40–60%.
ARM64's Stage-2 fault mechanism is, in some respects, even more informative than x86-64's. The ESR_EL2 register's Instruction Specific Syndrome (ISS) field reports not just the access type (read or write) but also the exact page table level at which the fault occurred (bits [9:8]). The S1PTW bit explicitly identifies faults caused by Stage-1 page table walks — directly distinguishing the Type 4 cascade scenario from a simple unmapped GPA. The faulting intermediate physical address (IPA, equivalent to GPA) is available in HPFAR_EL2. Average Type 2 latency is approximately 7.1 µs, slightly higher than x86-64 due to the additional state saved on VM exit.
ARM's FEAT_S2POE extension (available in Armv9.2) enables a particularly powerful optimisation: the hypervisor can handle up to 64 Stage-2 faults per VM exit by batching fault resolution in the EL2 handler. For VM boot workloads dominated by Type 4 cascades, this can improve fault-handling throughput by up to 64×.
RISC-V's hypervisor (H) extension takes the minimalist philosophy that characterises RISC-V ISA design generally. When a Guest Page Fault occurs, the htval CSR holds the faulting GPA, and scause indicates the fault type. However, unlike x86-64 and ARM64, RISC-V does not record which PT level failed or whether the fault was caused by a Stage-1 PT walk. The hypervisor must re-walk the G-stage (Stage-2) tables manually to determine the cause, adding approximately 2–3 µs to the fault handler latency and pushing average Type 2 handling to ~9.3 µs.
The primary RISC-V mitigation is using large Giga-pages (1 GB) in the G-stage tables, reducing the total number of G-stage entries for a typical VM from millions to thousands. One G-stage Giga-page covers 512 2 MB regions, requiring 512× fewer fault-handled entries during boot. As RISC-V hypervisor implementations mature, software-managed fault-batching techniques similar to ARM's FEAT_S2POE are expected to emerge.
The key takeaway from the architecture comparison is that fault-handling overhead is not just about raw latency — it is about the information available to avoid redundant work. An x86-64 hypervisor receiving an EPT violation knows immediately whether it is handling a simple unmapped GPA (install EPT entry, re-enter) or a cascading PT-walk fault (likely requiring pre-population of a range of GPA→HPA mappings to prevent recurrence). A RISC-V hypervisor receiving a Guest Page Fault must spend extra cycles rediscovering this context. Cross-referencing this section with Chapter 3 §3.6 (which describes two-stage translation structure) and Chapter 7 §7.3 (which covers x86-64 fault qualification fields in the single-stage context) provides the complete picture of how hardware-reported fault information shapes OS and hypervisor design.
When physical memory fills, the OS must choose which pages to evict. Optimal eviction requires knowing two things about each page: how recently it was accessed (recency) and whether it has been modified since it was last loaded from disk (cleanliness). The MMU provides both signals through the Accessed (A) bit and the Dirty (D) bit embedded in every PTE. Figure 8.4 summarises how each major ISA implements these bits and the dramatic cost difference between clean and dirty eviction.
x86-64 implements A/D bits as hardware-automatic. When the page walker installs a translation in the TLB, it atomically sets the Accessed bit in the PTE using a locked read-modify-write cycle. On the first write to a page, the Dirty bit is set atomically as well. This costs approximately 100 cycles on the first access but is invisible thereafter (subsequent accesses read a cached TLB entry). The CPU also propagates A/D bits upward through the page table hierarchy in 5-level paging, allowing the OS to track which PDE and PDPE ranges are active.
ARM64 provides a configurable mechanism via the HA and HD bits in the Translation Control Register (TCR_EL1). With HA=1, the hardware automatically sets the Access Flag (AF) in PTEs, matching x86-64 behaviour at ~100 cycles overhead. With HA=0 (the default on many implementations), any access to a page with AF=0 generates an Access Flag fault — a lightweight exception that the OS handles by setting AF=1 and returning. This mode adds 400–500 cycles per first-access but gives the OS a guaranteed hook to record the exact time of first access, enabling more precise LRU tracking than x86-64's hardware-set approach. The HD bit similarly controls whether the Dirty Bit Modifier (DBM) mechanism marks pages dirty automatically or via fault.
RISC-V has no hardware A/D bit support in the Sv39/Sv48 page table format. The specification permits hardware to set A and D bits (a may provision), but most RISC-V implementations leave them to software. Tracking A bits would require taking a fault on every first access to every page — adding over 2,000 cycles of overhead per first access. In practice, most RISC-V operating systems pre-set A=1 for all pages and do not track recency via A bits at all, relying instead on software-side access counters or simply not distinguishing hot from cold pages. This is a meaningful limitation for workloads with large working sets on RISC-V systems.
On x86-64 and ARM64 with hardware A bits, the OS implements a clear-and-check cycle to identify hot and cold pages. The cycle has four steps: (1) Clear all A bits across the active page set, simultaneously invalidating TLB entries for those pages (since the PTE has changed). (2) Wait a scan period — typically five seconds in Linux, tunable via vm.stat_interval. (3) Check which A bits have been re-set by hardware during the waiting period; pages with A=1 received at least one access and are classified as "hot"; pages with A=0 received no access and are "cold". (4) Add cold pages to the inactive LRU list as eviction candidates; hot pages remain on the active list.
The critical cost in this cycle is step (1): clearing A bits requires an IPI (inter-processor interrupt) to flush TLB entries on all CPUs that may have cached the now-modified PTEs. On a 128-core system, this TLB shootdown (covered in detail in Chapter 12 for GPU contexts) can take microseconds to milliseconds depending on contention. Linux optimises this by batching A-bit clears across many pages before issuing a single shootdown, amortising the IPI cost.
The Dirty bit determines whether evicting a page requires writing it to disk. A clean page (D=0) has not been modified since it was loaded — it is an exact copy of data on disk or in a file. Evicting it costs nothing: simply unmap it, and reload from disk later if needed. Cost: approximately 150 µs (mostly TLB shootdown and PTE clear). A dirty page (D=1) has been modified in RAM and differs from its on-disk representation. Before eviction, it must be written back to its backing store (swap partition, file, or tmpfs). Cost: 1–10 ms for an NVMe SSD, up to 10 ms or more for a rotating disk. This is a 5,000× cost difference between clean and dirty eviction at the fast end.
The practical consequence is that the OS must proactively write back dirty pages — not wait until eviction is urgent. Linux's kswapd daemon does exactly this: it begins writing dirty pages to disk as soon as the LOW watermark is crossed (Section 8.6), long before any application is blocked waiting for free memory. The goal is to ensure that by the time pages must be evicted, most of them are already clean. When this proactive writeback fails — for example, when a workload is writing dirty data faster than the disk can absorb it — direct reclaim is forced to handle dirty evictions synchronously, causing the application latency spikes described in Scenario 1 at the start of this chapter.
The A and D bits feed directly into the page replacement algorithms discussed in Section 8.5. The Clock algorithm uses the A bit as a "second chance" flag: on first examination, a page with A=1 gets its bit cleared and is passed over; on second examination with A=0, it is evicted. Linux's Multi-Generational LRU (MGLRU, Section 8.5.4) extends this to four generations, using A-bit patterns over multiple scan periods to classify pages from "youngest" (accessed recently) to "oldest" (not accessed in multiple periods). As established in Chapter 2 §2.3, the accessed bit provides hardware support for approximate LRU at near-zero cost per access — the crucial insight that makes practical LRU approximation feasible at scale.
Now that we understand how the hardware provides A and D bits, we can explore how software uses them to make intelligent eviction decisions. The goal is simple: when we must evict pages, choose the ones least likely to be needed soon. The challenge is doing this efficiently without perfect knowledge of future access patterns.
The theoretically optimal algorithm is Least Recently Used (LRU): evict the page that was accessed longest ago. If page A was last accessed 10 seconds ago and page B was last accessed 5 seconds ago, evict page A. This works because of temporal locality—pages accessed recently are more likely to be accessed again soon.
The problem? Perfect LRU is impossible to implement efficiently:
// Perfect LRU (theoretical, impossibly expensive)
struct page {
unsigned long last_access_time; // Timestamp of last access
// ... other fields ...
};
// Would need to update on EVERY memory access
void perfect_lru_track_access(struct page *page) {
page->last_access_time = get_nanoseconds();
// This would run on EVERY memory access!
// CPU does billions of memory accesses per second
// Cost would be catastrophic
}
// Eviction would be simple
struct page *perfect_lru_evict(void) {
struct page *oldest = NULL;
unsigned long oldest_time = ULONG_MAX;
// Find page with smallest timestamp
for_each_page(page) {
if (page->last_access_time < oldest_time) {
oldest_time = page->last_access_time;
oldest = page;
}
}
return oldest; // Evict this one
}Why is this impossible?
Cost analysis of perfect LRU:
System with 8 GB RAM = 2 million pages (at 4KB each)
Tracking overhead:
- Update timestamp on every access
- Modern CPU: 3 billion memory accesses/second
- Each update: read timestamp, write new timestamp = 2 memory accesses
- Additional memory traffic: 6 billion accesses/second just for tracking
- Memory bandwidth: Doubles!
- Cannot use TLB (TLB caches translations, not tracking state)
Eviction overhead:
- Scan 2 million pages to find minimum timestamp
- 2M × 10 ns = 20 milliseconds per eviction
- Under heavy pressure: 1000 evictions/second
- 20 ms × 1000 = 20,000 ms = 20 seconds/second
- More than 100% of CPU time just finding pages to evict!
⚠️ PERFORMANCE TRAP
You cannot track every memory access in software. The CPU does billions of accesses per second. Even a single instruction per access would consume all CPU cycles. This is why hardware provides the A bit—it tracks accesses transparently during address translation with minimal overhead (only on first access to set the bit).
Perfect LRU is a beautiful theoretical concept but completely impractical. Real systems must approximate LRU using the limited information available from hardware—primarily the A bit.
The Clock algorithm, invented by Fernando Corbató in 1968, provides a brilliant approximation of LRU using only the A bit. Pages are arranged in a circular list. A "clock hand" sweeps around, giving each page with A=1 a "second chance" by clearing its A bit. Pages with A=0 (not accessed since last sweep) are evicted.
The key insight: A bit provides binary information (accessed/not accessed) rather than precise timestamps. This is enough for good approximation:
// Clock algorithm (also called "Second Chance")
struct clock_state {
struct page *hand; // Current position in circular list
struct list_head page_list; // Circular list of all pages
};
struct page *clock_evict(struct clock_state *clock) {
struct page *page = clock->hand;
// Scan forward until we find a page to evict
// This is the "clock hand" sweeping
while (1) {
pte_t *pte = get_pte_for_page(page);
// Check accessed bit
if (pte_young(*pte)) {
// Page was accessed recently (A=1)
// Give it a second chance: clear A bit and move on
*pte = pte_mkold(*pte); // Clear A bit
// Move to next page in circular list
page = list_next_entry(page, lru);
if (&page->lru == &clock->page_list)
page = list_first_entry(&clock->page_list, struct page, lru);
} else {
// Page not accessed since last sweep (A=0)
// This page hasn't been used for at least one full rotation
// Evict it!
// Remove from list
list_del(&page->lru);
// Advance hand to next page
clock->hand = list_next_entry(page, lru);
if (&clock->hand->lru == &clock->page_list)
clock->hand = list_first_entry(&clock->page_list, struct page, lru);
return page;
}
}
}
/* Based on classical Clock algorithm
Reference: Corbató et al., "A Paging Experiment with the Multics System", 1968 */Why this works—the "second chance" gives pages with A=1 another opportunity to be accessed:
Example: 8 pages in circular list
Initial state (all pages recently accessed):
[A=1][A=1][A=1][A=1][A=1][A=1][A=1][A=1]
^hand
Eviction request #1:
Scan page 0: A=1 → clear to A=0, advance
Scan page 1: A=1 → clear to A=0, advance
Scan page 2: A=1 → clear to A=0, advance
...all have A=1...
Scan page 7: A=1 → clear to A=0, advance
Back to page 0: A=0 → EVICT!
Result: All pages got their A bit cleared
First revisited page (longest since accessed) was evicted
Eviction request #2 (pages 2,4,6 accessed between evictions):
[removed][A=0][A=1][A=0][A=1][A=0][A=1][A=0]
^hand
Scan page 1: A=0 → EVICT!
Page 1 wasn't accessed since last eviction, so evicted immediately
Pages 2,4,6 were accessed, so they survived
This approximates LRU: Pages that haven't been accessed for
a full rotation are evicted. Recently accessed pages survive.
Performance characteristics:
Clock algorithm analysis:
Best case (page found immediately):
- Check 1 page
- Cost: ~100 ns
- O(1) if victims readily available
Worst case (all pages accessed):
- Clear A bit on all N pages
- Cost: N × 100 ns
- Then evict first page
- O(N) where N = number of pages
Average case:
- Scan ~10-50 pages before finding victim
- Cost: ~1-5 µs
- Amortized O(1) because subsequent evictions are fast
The Clock algorithm has a crucial property: it doesn't suffer from Bélády's anomaly (where adding more memory makes performance worse). It's also simple to implement and has good average-case performance.
Basic Clock doesn't distinguish between clean and dirty pages. But we know dirty pages are 50-1000× more expensive to evict (must write to disk). We can do better with a two-handed variant:
// Two-handed clock: front hand clears A bits, back hand evicts
struct two_handed_clock {
struct page *front_hand; // Clears A bits (runs ahead)
struct page *back_hand; // Evicts pages (follows)
unsigned int hand_distance; // Distance between hands (tunable)
};
void advance_front_hand(struct two_handed_clock *clock) {
struct page *page = clock->front_hand;
pte_t *pte = get_pte_for_page(page);
// Front hand simply clears A bits
// Doesn't evict anything
if (pte_young(*pte)) {
*pte = pte_mkold(*pte);
}
// Advance to next page
clock->front_hand = list_next_entry(page, lru);
}
struct page *evict_with_back_hand(struct two_handed_clock *clock) {
struct page *page = clock->back_hand;
while (1) {
pte_t *pte = get_pte_for_page(page);
// Back hand prefers clean pages over dirty
if (!pte_young(*pte)) { // A=0: not accessed recently
if (!pte_dirty(*pte)) {
// Clean AND cold: perfect victim!
// Evict immediately
list_del(&page->lru);
clock->back_hand = list_next_entry(page, lru);
return page;
}
// Dirty but cold: acceptable victim if pressured
// Mark as candidate, but keep searching for clean page
if (mem_pressure_high()) {
// High pressure: take dirty page
list_del(&page->lru);
clock->back_hand = list_next_entry(page, lru);
return page;
}
}
// Not a good victim: advance
page = list_next_entry(page, lru);
clock->back_hand = page;
// If back hand catches up to front hand, advance front
if (distance(clock->back_hand, clock->front_hand) < clock->hand_distance) {
advance_front_hand(clock);
}
}
}
/* Conceptual implementation based on BSD VM system
Reference: McKusick et al., "The Design and Implementation of the 4.4BSD OS", 1996 */The hand distance is a tunable parameter that controls how aggressive the system is:
Small hand distance (e.g., 64 pages):
- Front hand barely ahead of back hand
- Pages have little time to be re-accessed after A cleared
- More aggressive eviction
- Good under heavy memory pressure
Large hand distance (e.g., 512 pages):
- Front hand far ahead of back hand
- Pages have more time to be re-accessed after A cleared
- Less aggressive eviction
- Good under light memory pressure
- Better approximation of LRU
Example with hand_distance = 128:
- Front hand clears A bits on pages
- Back hand follows 128 pages behind
- If page accessed in those 128 page-scans, A gets set again
- Back hand sees A=1 and skips it
- Only pages not accessed for 128+ page intervals get evicted
Performance improvement over single-handed Clock:
Evicting 1000 pages under memory pressure:
Single-handed Clock (no clean/dirty preference):
- Evicts first page with A=0 (might be dirty)
- 500 dirty pages, 500 clean pages (50/50 mix)
- Average: 500 × 5ms (dirty) + 500 × 0.1ms (clean) = 2,550 ms
Two-handed Clock (prefers clean):
- Scans ahead for clean pages with A=0
- Evicts clean pages when available
- Falls back to dirty only under high pressure
- 900 clean, 100 dirty (90/10 mix due to preference)
- Average: 100 × 5ms (dirty) + 900 × 0.1ms (clean) = 590 ms
Speed-up: 2,550 ms → 590 ms = 4.3× faster eviction!
💡 KEY INSIGHT
The two-handed approach separates A bit clearing (front hand) from eviction decisions (back hand). This gives pages time to be re-accessed after their A bit is cleared, providing better LRU approximation. The hand distance controls the tradeoff between accuracy (large distance) and responsiveness to memory pressure (small distance).
Linux kernel 5.18 (released May 2022) introduced a new approach called Multi-Generational LRU. Instead of binary A bit state (accessed/not-accessed), MGLRU tracks multiple generations, allowing pages to age gradually:
// MGLRU: Pages age through 4 generations
#define MAX_NR_GENS 4 // Gen 3 (newest) → Gen 2 → Gen 1 → Gen 0 (oldest)
struct lruvec {
struct lru_gen_struct {
unsigned long timestamps[MAX_NR_GENS]; // When each gen started
struct list_head lists[MAX_NR_GENS]; // Pages in each gen
} lrugen;
};
void mglru_age_pages(struct lruvec *lruvec) {
int gen;
// Rotate generations: 3→2→1→0
// Gen 0 pages become eviction candidates
// New pages enter at gen 3
// Move gen 0 to eviction list
list_splice_init(&lruvec->lrugen.lists[0], &eviction_list);
// Shift everyone down a generation
for (gen = 0; gen < MAX_NR_GENS - 1; gen++) {
list_splice_init(&lruvec->lrugen.lists[gen + 1],
&lruvec->lrugen.lists[gen]);
lruvec->lrugen.timestamps[gen] = lruvec->lrugen.timestamps[gen + 1];
}
// Start new generation at gen 3
INIT_LIST_HEAD(&lruvec->lrugen.lists[MAX_NR_GENS - 1]);
lruvec->lrugen.timestamps[MAX_NR_GENS - 1] = jiffies;
// Now scan all pages and promote those with A=1
struct page *page;
list_for_each_entry(page, &all_pages, lru) {
pte_t *pte = get_pte_for_page(page);
if (pte_young(*pte)) {
// Page was accessed: promote to newest generation
list_move(&page->lru, &lruvec->lrugen.lists[MAX_NR_GENS - 1]);
// Clear A bit for next scan
*pte = pte_mkold(*pte);
}
// Pages with A=0 stay in their current generation
// They're aging toward gen 0
}
}
struct page *mglru_evict(struct lruvec *lruvec) {
// Evict from gen 0 (oldest generation)
// These pages have survived multiple aging cycles without being accessed
struct page *page = list_first_entry(&lruvec->lrugen.lists[0],
struct page, lru);
if (page) {
list_del(&page->lru);
return page;
}
// If gen 0 empty, age all pages and try again
mglru_age_pages(lruvec);
page = list_first_entry(&lruvec->lrugen.lists[0], struct page, lru);
list_del(&page->lru);
return page;
}
/* Conceptual implementation based on Linux MGLRU
Reference: Yu Zhao, "Multi-Gen LRU", Linux kernel 5.18+, 2022
Documentation: Documentation/admin-guide/mm/multigen_lru.rst */How pages age through generations:
Example: Page lifecycle in MGLRU
T=0s: Page allocated, enters Gen 3 (newest)
Gens: [0:empty][1:empty][2:empty][3:THIS PAGE]
T=5s: First aging cycle, page has A=1 (was accessed)
→ Stays in Gen 3 (promoted because A=1)
Gens: [0:empty][1:empty][2:old pages][3:THIS PAGE + new]
T=10s: Second aging cycle, page has A=0 (not accessed)
→ Ages to Gen 2
Gens: [0:old][1:older][2:THIS PAGE][3:new]
T=15s: Third aging cycle, page has A=0 again
→ Ages to Gen 1
Gens: [0:older][1:THIS PAGE][2:newer][3:newest]
T=20s: Fourth aging cycle, page accessed (A=1)
→ Promoted back to Gen 3!
Gens: [0:cold pages][1:cool][2:warm][3:THIS PAGE]
Page got "second wind"—accessed again, so promoted
T=25s-40s: Page not accessed for 3 cycles
→ Ages through Gen 2 → Gen 1 → Gen 0
Gens: [0:THIS PAGE][1:...][2:...][3:...]
T=45s: Eviction needed
→ Page evicted from Gen 0
It survived 45 seconds but wasn't accessed for last 20s
Why MGLRU is better than two-list LRU (Linux's previous approach):
Comparison: Two-list LRU vs MGLRU
Two-list LRU (Linux 5.17 and earlier):
- Active list: Recently accessed pages
- Inactive list: Not recently accessed
- Single promotion/demotion step
- Problem: One-time accesses pollute active list
Example: grep through large file
- Touches each page once
- All pages promoted to active list
- Displaces useful hot pages
- "Cache pollution" problem
MGLRU (Linux 5.18+):
- Four generations: 3 (hot) → 2 → 1 → 0 (cold)
- Must survive multiple aging cycles to stay
- Filters out one-time accesses naturally
Example: same grep through large file
- Pages enter Gen 3 on first access
- Not accessed again: age to Gen 2 → 1 → 0
- Evicted before polluting "hot" generations
- Truly hot pages stay in Gen 3 through multiple promotions
Real-world benchmark results:
📊 REAL NUMBERS - MGLRU PERFORMANCE
Benchmark: PostgreSQL with 8 GB database, 4 GB RAM
Workload: Mixed read/write with occasional large scansTwo-list LRU (Linux 5.17): - Query latency p50: 15 ms - Query latency p99: 180 ms - Page faults: 2,200/sec - Throughput: 4,100 queries/sec
MGLRU (Linux 5.18): - Query latency p50: 12 ms (20% better) - Query latency p99: 140 ms (22% better) - Page faults: 1,850/sec (16% fewer) - Throughput: 4,900 queries/sec (19% better)
Why: MGLRU better identifies truly hot pages (accessed frequently across multiple generations) vs. one-time scan pages (accessed once, then aged out quickly).
MGLRU represents the state-of-the-art in page replacement. It provides better LRU approximation than Clock or two-list approaches while maintaining similar O(1) amortized cost.
Page replacement algorithms tell us which pages to evict. Now we need to understand when eviction happens and how the kernel manages the process of freeing memory under pressure.
The kernel doesn't wait until memory is completely exhausted to start reclaiming pages. Instead, it maintains three watermark levels that trigger increasingly aggressive reclaim:
// Per-zone watermarks (simplified)
struct zone {
unsigned long watermark[NR_WMARK];
};
enum zone_watermarks {
WMARK_MIN, // Emergency threshold: direct reclaim
WMARK_LOW, // Wake kswapd background thread
WMARK_HIGH, // kswapd can sleep
};
// Check if zone needs reclaim
bool zone_watermark_ok(struct zone *zone, unsigned int order) {
unsigned long free_pages = zone_page_state(zone, NR_FREE_PAGES);
unsigned long min = zone->watermark[WMARK_MIN];
unsigned long low = zone->watermark[WMARK_LOW];
unsigned long high = zone->watermark[WMARK_HIGH];
if (free_pages < min) {
// Below min: emergency! Direct reclaim required
return false;
} else if (free_pages < low) {
// Below low: wake background reclaim daemon
wake_up_kswapd();
return free_pages >= min; // Still above min
} else if (free_pages >= high) {
// Above high: all good, no reclaim needed
return true;
}
// Between low and high: background reclaim running
return true;
}
/* Based on Linux kernel mm/page_alloc.c */Typical watermark values for an 8 GB system:
8 GB RAM = 2,097,152 pages (at 4KB each)
WMARK_HIGH: 96 MB (24,576 pages) 3.8% of RAM
↑ kswapd stops when this is reached
WMARK_LOW: 64 MB (16,384 pages) 2.5% of RAM
↑ kswapd wakes when we drop below this
WMARK_MIN: 32 MB (8,192 pages) 1.3% of RAM
↑ direct reclaim if we drop below this
These are calculated from vm.min_free_kbytes sysctl
Default min_free_kbytes = sqrt(RAM_in_MB) × 4
For 8GB: sqrt(8192) × 4 ≈ 360 → ~32 MB minimum
The watermarks create three operational regimes:
📊 REAL NUMBERS - WATERMARK BEHAVIOR
Scenario: Gradual memory filling
T=0s: Free = 512 MB (above HIGH) → System idle, no reclaim
T=30s: Application malloc(200 MB), Free = 312 MB (above HIGH) → Still above HIGH, no reclaim triggered
T=45s: malloc(250 MB), Free = 62 MB (below LOW) → kswapd wakes, starts background reclaim → malloc() returns immediately (non-blocking) → kswapd reclaims in background
T=47s: kswapd freed 50 MB, Free = 112 MB (above HIGH) → kswapd goes back to sleep
T=60s: malloc(90 MB), Free = 22 MB (below MIN!) → Direct reclaim! Application blocks! → malloc() won't return until memory freed → User sees application freeze
The kswapd kernel thread is responsible for background
memory reclaim. There's one kswapd per NUMA node, and it spends most of
its time asleep, waking only when memory drops below WMARK_LOW:
// kswapd main loop (simplified)
static int kswapd(void *p) {
struct pglist_data *pgdat = (struct pglist_data *)p;
while (!kthread_should_stop()) {
bool work_done = false;
// Check all zones in this NUMA node
for_each_zone(zone) {
if (zone_free_pages(zone) < zone->watermark[WMARK_LOW]) {
// Below low watermark: need to reclaim
unsigned long target = zone->watermark[WMARK_HIGH];
unsigned long current = zone_free_pages(zone);
unsigned long to_reclaim = target - current;
// Reclaim pages until we hit HIGH watermark
unsigned long reclaimed = shrink_zone(zone, to_reclaim);
if (reclaimed > 0)
work_done = true;
}
}
if (!work_done) {
// All zones above LOW watermark: go to sleep
// Will be woken when zone drops below LOW
set_current_state(TASK_INTERRUPTIBLE);
schedule(); // Sleep until woken
}
}
return 0;
}
// How pages are reclaimed from a zone
static unsigned long shrink_zone(struct zone *zone, unsigned long nr_to_reclaim) {
unsigned long nr_reclaimed = 0;
while (nr_reclaimed < nr_to_reclaim) {
// Use page replacement algorithm (MGLRU/Clock) to pick victim
struct page *page = select_page_to_evict();
if (!page)
break; // No more evictable pages
// Evict the page (may need to write to disk if dirty)
if (try_to_free_page(page)) {
nr_reclaimed++;
}
// Don't hog CPU: yield occasionally
if (need_resched())
cond_resched();
}
return nr_reclaimed;
}
/* Conceptual implementation based on Linux kernel mm/vmscan.c */kswapd operates on the principle of "work ahead of demand":
Timeline: kswapd behavior
T=0s: Free=128 MB (above HIGH=96 MB)
→ kswapd sleeping
T=10s: malloc(64 MB), Free=64 MB (equals LOW)
→ Allocation wakes kswapd
→ malloc() returns immediately (success)
→ kswapd starts reclaiming in background
T=10.1s: kswapd examines pages
→ Find 100 clean file-backed pages
→ Drop them (no disk I/O needed)
→ Free=64 MB → 104 MB (above HIGH)
→ kswapd goes back to sleep
Total user impact: 0 ms (malloc didn't block)
Background reclaim handled everything
💡 KEY INSIGHT
kswapd runs in the background to prevent applications from having to do direct reclaim. By freeing pages before memory is critically low, kswapd allows allocations to succeed without blocking. This is the difference between a responsive system and one that stutters under memory pressure.
When memory drops below WMARK_MIN, kswapd can't keep up. The kernel takes drastic action: the allocating process must free memory itself before proceeding. This is called direct reclaim, and it's catastrophic for latency:
// Allocation path with direct reclaim
void *do_malloc(size_t size) {
// Try to allocate normally
struct page *page = alloc_pages(GFP_KERNEL, get_order(size));
if (page)
return page_to_virt(page); // Success!
// Failed: check if we're below MIN watermark
if (zone_free_pages(zone) < zone->watermark[WMARK_MIN]) {
// Emergency: must reclaim before allocating
// This process BLOCKS here!
unsigned long start = jiffies;
printk(KERN_WARNING "Process %d entering direct reclaim\n", current->pid);
// Try to reclaim enough pages to satisfy allocation
unsigned long nr_reclaimed = try_to_free_pages(size / PAGE_SIZE + 32);
unsigned long elapsed = jiffies_to_msecs(jiffies - start);
printk(KERN_WARNING "Direct reclaim took %lu ms\n", elapsed);
// Try allocation again
page = alloc_pages(GFP_KERNEL, get_order(size));
if (page)
return page_to_virt(page);
// Still failed: OOM path
return NULL;
}
return NULL;
}
// Direct reclaim implementation
static unsigned long try_to_free_pages(unsigned long nr_to_reclaim) {
unsigned long nr_reclaimed = 0;
while (nr_reclaimed < nr_to_reclaim) {
// Desperately search for evictable pages
struct page *page = find_evictable_page();
if (!page)
break;
pte_t *pte = get_pte_for_page(page);
if (pte_dirty(*pte)) {
// Page is dirty: MUST write to disk
// This is the killer—disk I/O in the allocation path!
if (page_is_file_backed(page)) {
// Write back to file: 1-10 ms
writeback_page(page);
} else {
// Write to swap: 1-10 ms
swap_writepage(page);
}
// Each dirty page costs milliseconds!
}
// Free the page
free_page(page);
nr_reclaimed++;
}
return nr_reclaimed;
}
/* Based on Linux kernel mm/vmscan.c, __alloc_pages_direct_reclaim() */The disaster scenario in detail:
User calls malloc(1 MB):
T=0: Enter allocation path
- Check free pages: 28 MB (below MIN=32 MB)
- kswapd is trying but can't keep up
- Must do direct reclaim!
T=1: Start searching for pages to evict
- Check first page: dirty, must write
- Write to swap: 5 ms
T=6: Still need more pages
- Check second page: dirty, must write
- Write to swap: 5 ms
... (repeat 50 times for 1 MB allocation) ...
T=250: Finally reclaimed enough
- 50 pages × 5 ms = 250 ms writing dirty pages
- Plus scanning overhead: 2 ms
- Total time in direct reclaim: 252 ms
T=252: Retry allocation
- Success! Finally have free pages
- Return from malloc()
From user's perspective:
- malloc(1 MB) took 252 milliseconds!
- On a fast CPU this should be microseconds
- Application appears frozen for 1/4 second
⚠️ PERFORMANCE TRAP
Direct reclaim is the reason applications "freeze" under memory pressure. A simple malloc() call can block for hundreds of milliseconds while the kernel desperately writes dirty pages to disk. This happens in the application's context, so the application can't make progress. From the user's perspective, the application has hung.
Real example: Video player doing malloc() for frame buffer hits direct reclaim. 250 ms delay = 15 dropped frames at 60 FPS. Video stutters noticeably.
The only way to avoid direct reclaim is ensuring kswapd keeps free memory above WMARK_MIN. This requires proper tuning and avoiding memory-intensive workloads on undersized systems.
When under memory pressure, the kernel must decide: evict file-backed
pages (easy to re-read) or anonymous pages (must write to swap)? The
vm.swappiness sysctl controls this tradeoff:
// Simplified swappiness calculation
unsigned long get_scan_count(struct lruvec *lruvec, unsigned long *nr_to_scan) {
unsigned long anon_prio, file_prio;
unsigned long ap, fp;
unsigned long anon_size = lruvec_lru_size(lruvec, LRU_ANON);
unsigned long file_size = lruvec_lru_size(lruvec, LRU_FILE);
// swappiness ranges from 0 to 200
// Default is 60
unsigned long swappiness = vm_swappiness;
// Calculate scan priorities
// Higher swappiness = more willing to swap anonymous pages
anon_prio = swappiness;
file_prio = 200 - swappiness;
// Adjust by LRU list sizes
ap = anon_prio * anon_size;
fp = file_prio * file_size;
// Calculate how many pages to scan from each list
nr_to_scan[LRU_ANON] = ap / (ap + fp) * total_scan;
nr_to_scan[LRU_FILE] = fp / (ap + fp) * total_scan;
/*
* swappiness=0: Strongly prefer evicting file pages
* swappiness=60: Default balanced behavior
* swappiness=100: Equal treatment of anon and file
* swappiness=200: Strongly prefer evicting anon pages (swapping)
*/
}
/* Based on Linux kernel mm/vmscan.c, get_scan_count() */How different swappiness values behave:
Scenario: System has 4 GB anonymous (malloc'd) memory
and 4 GB file cache (mmap'd files, executables)
Need to reclaim 1 GB
swappiness=0 (Avoid swap, prefer file eviction):
Priority: anon=0, file=200
Result: Evict mostly file pages
- Reclaim 950 MB from file cache
- Reclaim 50 MB from anonymous memory
Good for: Database servers (keep data in RAM, OK to re-read files)
swappiness=60 (Default balanced):
Priority: anon=60, file=140
Result: Slightly prefer file eviction
- Reclaim 600 MB from file cache
- Reclaim 400 MB from anonymous memory
Good for: General purpose systems
swappiness=100 (Equal treatment):
Priority: anon=100, file=100
Result: Evict proportionally
- Reclaim 500 MB from file cache
- Reclaim 500 MB from anonymous memory
Good for: Workloads with equal importance
swappiness=200 (Prefer swapping):
Priority: anon=200, file=0
Result: Evict mostly anonymous pages
- Reclaim 50 MB from file cache
- Reclaim 950 MB from anonymous memory
Good for: File servers (keep file cache hot, OK to swap processes)
Real-world tuning example:
Database server (PostgreSQL):
- Database uses shared_buffers (anonymous memory)
- Keep this hot in RAM at all costs!
- OK to drop file cache (can re-read from disk)
Tuning: vm.swappiness=10
Result: Kernel strongly prefers evicting file cache
Database working set stays in RAM
Query latency: Stable
Desktop system:
- Applications use anonymous memory
- Frequently access same files (documents, code)
- File cache is valuable
Tuning: vm.swappiness=60 (default)
Result: Balanced eviction
Applications can swap if needed
File cache provides fast re-access
File server (NFS/Samba):
- Most requests are for file data
- File cache is critical for performance
- Server processes can swap if needed
Tuning: vm.swappiness=150
Result: Kernel prefers swapping server processes
File cache stays hot
File serving latency: Low and stable
💡 KEY INSIGHT - WORKLOAD SPECIFIC TUNING
There is no "best" swappiness value. It depends entirely on your workload: - Database: Low swappiness (prefer file eviction) - Desktop: Medium swappiness (balanced) - File server: High swappiness (prefer swapping processes)
The default of 60 works reasonably for general-purpose systems but can be far from optimal for specialized workloads.
When reclaim fails—the kernel can't free enough memory even with direct reclaim—the Out-Of-Memory (OOM) killer is invoked. Its job: kill a process to free memory:
// OOM killer victim selection (simplified)
struct task_struct *select_bad_process(void) {
struct task_struct *p;
struct task_struct *chosen = NULL;
unsigned long chosen_points = 0;
// Scan all processes
for_each_process(p) {
unsigned long points = 0;
// Calculate "badness" score
// Higher score = more likely to be killed
// 1. Memory usage (primary factor)
// RSS (resident set size) in pages
points = get_mm_rss(p->mm);
// 2. Add swap usage
points += get_mm_counter(p->mm, MM_SWAPENTS);
// 3. Add half of virtual memory
// (might be mostly unmapped but still indicates big process)
points += p->mm->total_vm / 2;
// 4. Adjustments
// Root processes: -25% (less likely to kill)
if (has_capability(p, CAP_SYS_ADMIN))
points -= points / 4;
// oom_score_adj: user tunable (-1000 to +1000)
// -1000 = never kill
// +1000 = always prefer to kill
points += p->signal->oom_score_adj;
// 5. Never kill kernel threads or init
if (p->flags & PF_KTHREAD || p->pid == 1)
continue;
// Track highest score
if (points > chosen_points) {
chosen_points = points;
chosen = p;
}
}
return chosen;
}
void oom_kill_process(struct task_struct *p) {
pr_err("Out of memory: Kill process %d (%s) score %lu or sacrifice child\n",
p->pid, p->comm, p->signal->oom_score);
// Dump memory info for debugging
dump_memory_state();
// Send SIGKILL to the process
// This is immediate and cannot be caught
force_sig(SIGKILL, p);
// Mark for memory reclaim
mark_oom_victim(p);
}
/* Based on Linux kernel mm/oom_kill.c */Example OOM decision:
System State:
Total RAM: 8 GB
Free: 10 MB (below MIN)
Direct reclaim: Failed to free enough
Process Table:
PID Name RSS Swap VM Score Adj Final
1 init 4 MB 0 4 MB 4 0 4 (never kill)
234 sshd 8 MB 0 12 MB 14 0 14
567 postgres 2 GB 100 MB 4 GB 2348 0 2348 ← Selected!
890 chrome 1 GB 200 MB 3 GB 1400 0 1400
1024 vim 50 MB 0 80 MB 90 0 90
Victim: postgres (PID 567)
Reason: Highest score (2348 points)
Breakdown:
- RSS: 2 GB = 524,288 pages
- Swap: 100 MB = 25,600 pages
- VM/2: 4 GB / 2 = 1 GB = 262,144 pages
- Total: 811,936 pages ≈ 3.2 GB score
- Adjustments: None (oom_score_adj=0)
Result:
kernel: Out of memory: Kill process 567 (postgres) score 3276800
kernel: Killed process 567, total-vm:4194304kB, anon-rss:2097152kB
Impact:
- Process killed instantly
- 2 GB RAM freed
- System recovers
- Database users see connection errors
🔍 WHAT'S REALLY HAPPENING - OOM KILLER SELECTION
The OOM killer tries to: 1. Kill the process using the most memory (biggest impact) 2. Avoid killing critical system processes (init, systemd) 3. Respect user hints (oom_score_adj) 4. Minimize collateral damage (prefer single process over multiple)
It's not random—it's making a calculated choice about which process death will best help the system recover.
Tuning OOM behavior:
# Protect critical process from OOM killer
echo -1000 > /proc/$(pidof mysqld)/oom_score_adj
# Now mysqld will never be killed (unless all other options exhausted)
# Make process more likely to be killed
echo +500 > /proc/$(pidof big-batch-job)/oom_score_adj
# big-batch-job will be preferred victim
# Check current scores
ps aux | while read line; do
pid=$(echo $line | awk '{print $2}')
if [ -f /proc/$pid/oom_score ]; then
score=$(cat /proc/$pid/oom_score)
echo "$score $line"
fi
done | sort -rn | head -20
# Shows 20 processes most likely to be OOM killedThe OOM killer is a last resort, but it's better than system deadlock. Proper memory management (sizing, tuning swappiness, monitoring) should prevent OOM from ever happening in production.
We've discussed evicting data pages to free memory. But page tables themselves consume memory—often a surprising amount. On a system with hundreds of processes or VMs, page table overhead can reach gigabytes. This creates a meta-problem: we need memory to manage memory.
Let's calculate how much memory page tables consume for a typical process:
// Page table overhead for multi-level paging
unsigned long calculate_pt_overhead(unsigned long virtual_size) {
unsigned long nr_pages = virtual_size / PAGE_SIZE;
unsigned long overhead = 0;
// Level 1: PML4 (always allocated)
// One PML4 per process, regardless of size
overhead += PAGE_SIZE; // 4 KB
// Level 2: PDPTs
// One PDPT per 512 GB of virtual address space
unsigned long nr_pdpts = (virtual_size + GB(512) - 1) / GB(512);
overhead += nr_pdpts * PAGE_SIZE;
// Level 3: PDs
// One PD per 1 GB of virtual address space
unsigned long nr_pds = (virtual_size + GB(1) - 1) / GB(1);
overhead += nr_pds * PAGE_SIZE;
// Level 4: PTs
// One PT per 2 MB of virtual address space
unsigned long nr_pts = (virtual_size + MB(2) - 1) / MB(2);
overhead += nr_pts * PAGE_SIZE;
return overhead;
}
// Example calculations:
// 10 GB process:
// PML4: 4 KB (1 table)
// PDPTs: 4 KB (1 table for 0-512 GB)
// PDs: 40 KB (10 tables, one per GB)
// PTs: 20,480 KB = 20 MB (5,120 tables, one per 2 MB)
// Total: ~20 MB overhead (0.2% of 10 GB)
// 100 processes × 1 GB each:
// Per process: 2 MB overhead
// Total: 200 MB overhead
// 8 VMs × 8 GB each (with EPT):
// Guest page tables: 8 × 16 MB = 128 MB
// Host EPT: 8 × 16 MB = 128 MB
// Total: 256 MB overheadThe overhead scales with virtual address space size, not physical memory usage. A process with a 100 GB sparse mapping (mostly unmapped) consumes as much page table memory as one with 100 GB fully populated.
Real-world example:
📊 REAL NUMBERS - PAGE TABLE OVERHEAD SCALING
PostgreSQL database server (64 GB RAM): - 3 postgres processes, 16 GB shared_buffers each - Many memory-mapped files (10,000 files × 10 MB avg) - Address space per process: 48 GB virtual
Page table breakdown per process: - Dense regions (shared buffers): 16 GB → 32 MB PTs - Sparse regions (mmap'd files): 32 GB → 64 MB PTs - Total per process: ~96 MB - Total for 3 processes: 288 MB
That's 288 MB of RAM (0.45% of total) just for page tables! With huge pages: 288 MB → 576 KB (500× reduction)
Allocating all page tables upfront would be wasteful. Most processes don't use their entire address space. The kernel uses lazy allocation: page tables are created only when first accessed:
// Lazy page table allocation on page fault
int handle_page_fault(unsigned long address) {
pgd_t *pgd;
p4d_t *p4d;
pud_t *pud;
pmd_t *pmd;
pte_t *pte;
// Walk page tables, allocating levels as needed
// Level 1: PGD (PML4) always exists
// Allocated during process creation
pgd = pgd_offset(current->mm, address);
if (pgd_none(*pgd)) {
// Shouldn't happen—PML4 always allocated
return VM_FAULT_OOM;
}
// Level 2: P4D (5-level paging) or PUD
// Allocate if missing
p4d = p4d_alloc(current->mm, pgd, address);
if (!p4d)
return VM_FAULT_OOM;
// Level 3: PUD (PDPT)
// Allocate if missing
pud = pud_alloc(current->mm, p4d, address);
if (!pud)
return VM_FAULT_OOM;
// Level 4: PMD (PD)
// Allocate if missing
pmd = pmd_alloc(current->mm, pud, address);
if (!pmd)
return VM_FAULT_OOM;
// Level 5: PTE (PT)
// Allocate if missing
pte = pte_alloc(current->mm, pmd, address);
if (!pte)
return VM_FAULT_OOM;
// Now handle the actual fault (demand paging, COW, etc.)
return do_anonymous_page(current->mm, pte, address);
}
/* Based on Linux kernel mm/memory.c, __handle_mm_fault() */The savings can be enormous:
Example: Process with sparse 64 GB address space
- Actually using only 1 GB scattered across the space
- 10,000 separate small allocations
Eager allocation (allocate all page tables upfront):
- PML4: 4 KB
- PDPTs: 4 KB
- PDs: 256 KB (64 tables for 64 GB)
- PTs: 131,072 KB = 128 MB (32,768 tables for 64 GB)
- Total: ~128 MB
Lazy allocation (only what's used):
- PML4: 4 KB
- PDPTs: 4 KB (only 1 needed for first 512 GB)
- PDs: 40 KB (only 10 needed for 10 regions of 1 GB each)
- PTs: 20 MB (only 5,120 needed for 1 GB actual usage)
- Total: ~20 MB
Savings: 128 MB → 20 MB = 6.4× reduction!
Lazy allocation is why you can run hundreds of processes on limited RAM—most of their address space exists only virtually, without consuming physical memory for page tables.
Just as data pages can be reclaimed under pressure, empty page tables can be freed. When a region of memory is unmapped, the kernel checks if entire page tables are now empty:
// Reclaim page tables when memory is unmapped
void unmap_page_range(struct vm_area_struct *vma,
unsigned long addr, unsigned long end) {
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
pte_t *pte;
// Walk the page tables, clearing entries
for (; addr < end; addr += PAGE_SIZE) {
pgd = pgd_offset(vma->vm_mm, addr);
pud = pud_offset(pgd, addr);
pmd = pmd_offset(pud, addr);
pte = pte_offset_map(pmd, addr);
// Clear the PTE
pte_clear(vma->vm_mm, addr, pte);
pte_unmap(pte);
}
// Now check if any page table pages are completely empty
// and can be freed
// Check if entire PT is empty
pmd = pmd_offset(pud, addr);
if (pmd_none_or_clear_bad(pmd)) {
// All 512 PTEs in this PT are invalid
// Free the PT page itself
pte_free(vma->vm_mm, pmd_page(*pmd));
pmd_clear(pmd);
}
// Check if entire PD is empty
pud = pud_offset(pgd, addr);
if (pud_none_or_clear_bad(pud)) {
// All 512 PMDs in this PD are invalid
// Free the PD page
pmd_free(vma->vm_mm, pud_page(*pud));
pud_clear(pud);
}
// Check if entire PDPT is empty
pgd = pgd_offset(vma->vm_mm, addr);
if (pgd_none_or_clear_bad(pgd)) {
// All 512 PUDs in this PDPT are invalid
// Free the PDPT page
pud_free(vma->vm_mm, pgd_page(*pgd));
pgd_clear(pgd);
}
// Note: Never free PML4 (top level)
// It's freed only when process exits
}
/* Based on Linux kernel mm/memory.c, unmap_page_range() */When reclamation happens:
Scenario 1: munmap() large region
- Application calls munmap(addr, 1 GB)
- Kernel clears all PTEs in that 1 GB range
- Checks each PT: all 512 PTEs invalid? Free PT
- Result: Immediately reclaim 512 PTs = 2 MB
Scenario 2: Process exit
- All memory unmapped at once
- All page tables freed
- Result: Multi-megabyte instant reclaim
Scenario 3: Memory pressure
- Kernel scans for empty page tables
- Frees them even if process still running
- Result: Slow gradual reclaim (background)
But there's a subtlety—lazy reclamation:
// Linux doesn't always immediately free empty page tables
// It may defer freeing until memory pressure
void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *vma) {
// Check if we should actually free page tables now
// or defer until later
if (system_under_memory_pressure()) {
// Memory pressure: aggressively free empty tables
free_empty_page_tables(vma);
} else {
// No pressure: defer freeing
// Empty tables might be re-used soon
// Keep them around to avoid re-allocation cost
}
}This lazy approach trades memory for performance: empty page tables consume memory but avoiding re-allocation reduces fault latency if the region is accessed again.
Page tables don't have to be physically contiguous, but placing them close together in physical memory improves cache performance. The Page Walk Cache (PWC) benefits from spatial locality:
// Two allocation strategies for page tables
// Strategy 1: Sparse allocation (simple but slow)
pte_t *allocate_pt_sparse(void) {
// Allocate from general page pool
// Pages come from anywhere in physical memory
struct page *page = alloc_page(GFP_KERNEL);
return page_to_virt(page);
// Result: PTs scattered across physical memory
// PWC hit rate: ~60-70%
// Page walk latency: Higher variance
}
// Strategy 2: Compact allocation (complex but fast)
pte_t *allocate_pt_compact(struct mm_struct *mm) {
// Try to allocate from contiguous region
// Keep all of this process's page tables close together
if (!mm->pt_pool) {
// First PT allocation: reserve contiguous region
// Reserve 16 MB for all future page tables
mm->pt_pool = alloc_contiguous(PAGE_SIZE * 4096);
mm->pt_pool_index = 0;
}
if (mm->pt_pool_index < 4096) {
// Allocate from reserved pool
pte_t *pt = mm->pt_pool + mm->pt_pool_index * PAGE_SIZE;
mm->pt_pool_index++;
return pt;
}
// Pool exhausted: fall back to sparse
return allocate_pt_sparse();
// Result: First 4096 PTs in contiguous 16 MB region
// PWC hit rate: ~85-90%
// Page walk latency: Lower and more predictable
}
/* Conceptual implementation based on page table clustering techniques */Performance impact:
Benchmark: Random access to 100 GB sparse mapping
Sparse page table layout:
- PTs scattered across physical memory
- Random physical addresses
- PWC miss on 40% of page walks
- Average page walk: 120 ns
- Total time: 12 ms per million accesses
Compact page table layout:
- PTs in contiguous 200 MB region
- Spatial locality in physical memory
- PWC miss on 15% of page walks
- Average page walk: 85 ns
- Total time: 8.5 ms per million accesses
Speed-up: 29% faster page walks!
📊 REAL NUMBERS - COMPACT ALLOCATION IMPACT
Database workload (100 GB working set, random access):
Sparse PT allocation: - Page walk latency p50: 115 ns - Page walk latency p99: 220 ns - PWC hit rate: 62% - TLB miss penalty: High variance
Compact PT allocation: - Page walk latency p50: 82 ns (28% better) - Page walk latency p99: 140 ns (36% better) - PWC hit rate: 88% - TLB miss penalty: Lower and predictable
The tradeoff: compact allocation requires reserving contiguous memory upfront, which may not be available under fragmentation. Most systems use hybrid approaches: try compact allocation, fall back to sparse if needed.
We've mentioned huge pages throughout this book. Their impact on page table overhead deserves special attention:
// Overhead comparison: 4KB vs 2MB pages
void compare_page_table_overhead(void) {
unsigned long data_size = GB(100); // 100 GB of data
// With 4KB pages:
unsigned long pages_4kb = data_size / KB(4); // 25,600,000 pages
unsigned long pts_needed = pages_4kb / 512; // 50,000 PTs
unsigned long overhead_4kb = pts_needed * PAGE_SIZE;
printf("4KB pages:\n");
printf(" Data pages: %lu (25.6 million)\n", pages_4kb);
printf(" PTs needed: %lu (50,000)\n", pts_needed);
printf(" PT overhead: %lu MB (%.2f%%)\n",
overhead_4kb / MB(1),
100.0 * overhead_4kb / data_size);
// With 2MB pages:
unsigned long pages_2mb = data_size / MB(2); // 51,200 pages
unsigned long pmds_needed = pages_2mb / 512; // 100 PMDs
unsigned long overhead_2mb = pmds_needed * PAGE_SIZE;
printf("\n2MB pages:\n");
printf(" Data pages: %lu (51,200)\n", pages_2mb);
printf(" PMDs needed: %lu (100)\n", pmds_needed);
printf(" PT overhead: %lu KB (%.4f%%)\n",
overhead_2mb / KB(1),
100.0 * overhead_2mb / data_size);
printf("\nReduction: %lux smaller!\n", overhead_4kb / overhead_2mb);
}
// Output:
// 4KB pages:
// Data pages: 25600000 (25.6 million)
// PTs needed: 50000 (50,000)
// PT overhead: 200 MB (0.20%)
//
// 2MB pages:
// Data pages: 51200 (51,200)
// PMDs needed: 100 (100)
// PT overhead: 400 KB (0.0004%)
//
// Reduction: 512x smaller!Real PostgreSQL example revisited:
Earlier we saw PostgreSQL with:
- 3 processes × 48 GB virtual each
- 288 MB in page tables
With transparent huge pages (THP) enabled:
- Same 3 processes × 48 GB virtual
- Dense regions use 2MB pages
- Sparse regions still use 4KB (can't use huge pages)
New breakdown:
- Dense (48 GB): 96 MB → 192 KB (500× reduction)
- Sparse (36 GB): 72 MB → 72 MB (unchanged)
- Total per process: 72.2 MB
- Total for 3 processes: 216 MB (was 288 MB)
Savings: 72 MB freed
If fully converted to huge pages:
- Total: 288 MB → 576 KB (500× reduction!)
- Savings: 287.4 MB freed
Huge pages provide: 1. 500× less page table overhead (fewer levels to allocate) 2. 512× better TLB coverage (each TLB entry covers 2 MB not 4 KB) 3. Faster page walks (one fewer level to traverse)
The challenge is allocation: 2MB huge pages require 2MB contiguous physical memory, which may not be available due to fragmentation.
We've explored the mechanisms—now let's see how to diagnose problems and optimize real systems. This section covers the tools for identifying memory management issues and the parameters for tuning them.
Tool 1: /proc/meminfo - System-Wide Memory Status
$ cat /proc/meminfo
MemTotal: 65536000 kB # Total RAM
MemFree: 2048000 kB # Truly free (not in use)
MemAvailable: 17920000 kB # Free + reclaimable (misleading!)
PageTables: 8192000 kB # Memory used by page tables ← RED FLAG!
Active: 28672000 kB # Recently accessed pages
Inactive: 16384000 kB # Not recently accessed
Dirty: 1024000 kB # Modified, needs writeback
Shmem: 16384000 kB # Shared memory (tmpfs, mmap)
SwapTotal: 8388608 kB # Total swap space
SwapFree: 8388608 kB # Unused swapKey metrics to watch:
Red flags:
- PageTables > 1% of MemTotal → Page table overhead problem
- MemAvailable misleading when PageTables is high
- Dirty > 10% of MemTotal → Write pressure
- SwapFree much less than SwapTotal → System swapping heavily
Example problematic output:
MemTotal: 64 GB
MemAvailable: 17 GB ← Looks OK!
PageTables: 8 GB ← PROBLEM! Not reclaimable
Actual available: 17 GB - 8 GB = 9 GB only
Tool 2: vmstat - Real-Time Monitoring
$ vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
2 0 0 2048000 256000 8192000 0 0 100 200 1000 2000 25 10 60 5 0
2 1 0 1024000 256000 8192000 0 5120 50 5000 1200 2200 20 15 50 15 0
↑ ↑ ↑ ↑
| | | |
| | | +--- Blocks out (write)
| | +-------- Blocks in (read)
| +------------ Swap out (pages/s)
+---------------- Swap in (pages/s)Interpreting swap columns:
Good: si=0, so=0
→ No swapping
→ System healthy
Warning: si=100, so=200
→ Light swapping (100-200 pages/sec)
→ Monitor, may worsen
Critical: si=5120, so=10240
→ Heavy swapping (5-10 MB/sec)
→ System thrashing
→ Applications unresponsive
→ Action required immediately!
Real example during PostgreSQL OOM:
si=8192, so=16384 (8-16 MB/sec swapping)
bo=25000 (25 MB/sec disk writes)
wa=45 (45% CPU time waiting for I/O)
→ Database completely frozen
⚠️ PERFORMANCE TRAP - MISLEADING METRICS
High
MemAvailabledoesn't guarantee allocation success! IfPageTablesis high, that "available" memory is actually consumed by un-reclaimable page table overhead. Always checkPageTableswhen diagnosing OOM issues.
Tool 3: perf - Hardware Counter Analysis
# Measure TLB miss rate
$ perf stat -e dTLB-load-misses,dTLB-loads ./application
Performance counter stats for './application':
12,234,567 dTLB-load-misses # 2.5% of all dTLB accesses
487,654,321 dTLB-loads
5.123 seconds time elapsedAnalyzing results:
TLB miss rate calculation:
Misses: 12,234,567
Total accesses: 487,654,321
Miss rate: 2.5%
Cost analysis:
Each miss: ~100 cycles (page table walk)
Total cost: 12,234,567 × 100 = 1,223,456,700 cycles
At 3 GHz: 1.2 billion cycles / 3 billion per second = 0.4 seconds
Application ran for 5.1 seconds
TLB miss overhead: 0.4 / 5.1 = 7.8% of runtime!
Action: Enable huge pages to reduce TLB misses
Tool 4: /proc/[pid]/smaps - Per-Process Memory Maps
$ cat /proc/1234/smaps
7f8a2c000000-7f8a30000000 rw-p 00000000 00:00 0
Size: 65536 kB # VMA size
KernelPageSize: 4 kB # Underlying page size
MMUPageSize: 4 kB # Page size used by MMU
Rss: 32768 kB # Actually in RAM
Pss: 16384 kB # Proportional share (RSS/sharers)
Shared_Clean: 8192 kB # Shared, not modified
Shared_Dirty: 0 kB # Shared, modified
Private_Clean: 16384 kB # Private, not modified
Private_Dirty: 8192 kB # Private, modified ← Will need swap
Referenced: 24576 kB # Recently accessed
Anonymous: 32768 kB # Not file-backed
AnonHugePages: 20480 kB # Using transparent huge pages
Swap: 4096 kB # Currently in swap
SwapPss: 2048 kB # Proportional swapDetecting issues:
Issue 1: Fragmentation
Size: 65536 kB (whole VMA)
AnonHugePages: 4096 kB (only small portion)
→ Fragmentation preventing THP
Issue 2: Swap pressure
Swap: 16384 kB
Private_Dirty: 32768 kB
→ Half the private pages swapped out
→ Performance degradation likely
Issue 3: Memory leaks
$ watch -n 1 'cat /proc/1234/smaps | grep -A 15 heap'
[Watch Rss growing continuously]
→ Memory leak in heap
Issue 1: TLB Thrashing
Symptoms:
- High CPU usage
- perf shows high dTLB-load-misses
- Application slower than expected
- No disk I/O (rules out paging)
Diagnosis:
$ perf stat -e dTLB-load-misses,dTLB-loads,iTLB-load-misses ./app
dTLB-load-misses: 45,000,000 # Data TLB misses
dTLB-loads: 500,000,000 # Total data accesses
iTLB-load-misses: 5,000,000 # Instruction TLB misses
dTLB miss rate: 9% ← VERY HIGH (should be <1%)
Root cause:
- Working set larger than TLB coverage
- Random access pattern (poor locality)
- 4KB pages: TLB covers only 1-2 MB
Solutions:
1. Enable huge pages:
echo always > /sys/kernel/mm/transparent_hugepage/enabled
→ 2MB pages: TLB covers 256-512 MB (256× improvement)
2. Restructure data for better locality:
- Sort by access pattern
- Use SOA (struct of arrays) instead of AOS (array of structs)
3. Increase working set if possible:
- Process fewer items per batch
- Partition data to fit in TLB
Issue 2: Swap Thrashing
Symptoms:
- System extremely slow
- Applications freeze frequently
- vmstat shows high si/so values
- Disk I/O pegged at 100%
Diagnosis:
$ vmstat 1
si so
8192 16384 ← 8-16 MB/sec swap I/O
7680 15360
8704 17408 ← Consistently high
$ iostat -x 1
Device: rrqm/s wrqm/s await %util
sda 0.00 0.00 45.2 99.8 ← Disk saturated
Root cause:
- Working set > Physical RAM
- Kernel constantly swapping pages in/out
- Page faults on every access
Solutions:
1. Emergency: Kill memory hog
$ ps aux --sort=-%mem | head
$ kill -9 [PID]
2. Temporary: Disable swap
$ swapoff -a
→ Forces OOM instead of thrashing
→ Better than 30-minute freeze
3. Permanent: Add RAM or reduce working set
- Scale horizontally (more machines)
- Reduce cached data
- Use mmap with MAP_POPULATE cautiously
Issue 3: High Page Table Overhead
Symptoms:
- PageTables field in /proc/meminfo unusually high
- OOM kills despite "available" memory
- Many small memory mappings
Diagnosis:
$ cat /proc/meminfo | grep PageTables
PageTables: 8388608 kB ← 8 GB in page tables!
$ cat /proc/1234/status | grep VmData
VmData: 16777216 kB ← 16 GB virtual
$ cat /proc/1234/maps | wc -l
45123 ← 45,000 VMAs!
Overhead: 8 GB PTs for 16 GB data = 50% overhead!
Root cause:
- Fragmented memory mappings
- Many small mmap() calls
- Database with thousands of memory-mapped files
Solutions:
1. Enable huge pages (500× reduction):
$ echo 24576 > /proc/sys/vm/nr_hugepages
→ Pre-allocate 48 GB of huge pages
2. Consolidate mappings:
- Use fewer, larger mmap() calls
- Batch allocations
- Reduce number of mapped files
3. PostgreSQL specific:
huge_pages = on
→ Shared buffers use huge pages
→ 8 GB overhead → 16 MB
# Key sysctl parameters for memory management
# Minimum free memory (triggers kswapd)
vm.min_free_kbytes = 32768 # 32 MB minimum
# Too low: Direct reclaim more likely
# Too high: Wastes memory
# Swappiness (0-200, default 60)
vm.swappiness = 10 # Database server
vm.swappiness = 60 # Desktop/general
vm.swappiness = 150 # File server
# Higher = more willing to swap
# Dirty page writeback
vm.dirty_ratio = 20 # 20% max dirty before blocking
vm.dirty_background_ratio = 10 # 10% starts background writeback
# Lower = More frequent writes, less burst latency
# Zone reclaim (NUMA)
vm.zone_reclaim_mode = 0 # Prefer remote memory over reclaim
vm.zone_reclaim_mode = 1 # Prefer local reclaim
# 0 = Better for most workloads
# 1 = NUMA-aware databases
# Overcommit behavior
vm.overcommit_memory = 0 # Heuristic (default)
vm.overcommit_memory = 1 # Always allow
vm.overcommit_memory = 2 # Never overcommit
# 1 = Cloud/container environments
# 2 = Critical systems (database)
# Transparent huge pages
$ cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never
# always = Automatic THP
# madvise = Application controlled
# never = DisabledReal tuning example - PostgreSQL optimization:
# Before tuning:
PageTables: 8 GB
Swapping: Heavy (si=5120/s)
Query p99: 2.5 seconds
# Apply tuning:
echo 10 > /proc/sys/vm/swappiness # Avoid swap
echo 24576 > /proc/sys/vm/nr_hugepages # 48 GB huge pages
echo always > /sys/kernel/mm/transparent_hugepage/enabled
# PostgreSQL config:
huge_pages = on
shared_buffers = 16GB
# After tuning:
PageTables: 512 MB (16× reduction!)
Swapping: None (si=0/s)
Query p99: 180 ms (14× faster!)
TLB miss rate: 8% → 0.3% (27× improvement)We began this chapter with a mystery: a PostgreSQL instance killed by the OOM killer despite 17 GB of "available" memory. Now we can explain exactly what happened:
The PostgreSQL OOM Mystery - Solved:
System: 64 GB RAM, 3 postgres processes
Problem:
- Each process: 16 GB shared_buffers + memory-mapped files
- Virtual address space per process: 48 GB
- Fragmented mappings: 45,000 VMAs each
Page table overhead calculation:
- Dense regions: 16 GB → 32 MB PTs each
- Sparse regions: 32 GB → 64 MB PTs each
- Total per process: 96 MB
- Total for 3 processes: 288 MB
- But with 45,000 VMAs: 2.7 GB per process
- Grand total: 8.1 GB in page tables!
The math:
MemTotal: 64 GB
Used for data: 48 GB (3 × 16 GB)
Used for PTs: 8 GB (un-reclaimable!)
File cache: 6 GB
Kernel/other: 2 GB
Total: 64 GB (exactly full!)
MemAvailable reported: 17 GB (file cache + free)
But PageTables: 8 GB (cannot evict!)
Actually available: 17 - 8 = 9 GB
New allocation: 10 GB request
9 GB < 10 GB → OOM!
Solution applied:
Enable huge pages:
- PageTables: 8 GB → 512 MB (16× reduction)
- Now actually available: 17 GB - 0.5 GB = 16.5 GB
- 10 GB allocation succeeds
- Plus: 15% performance improvement from TLB efficiency
This capstone chapter brings together concepts from throughout the book:
Chapter 1 (Virtual Memory Abstraction) introduced process address spaces. We now understand the overhead of that abstraction—page tables consume memory to enable the mapping.
Chapter 2 (Demand Paging) showed how processes don't need all pages in RAM. We now know which pages to evict (A/D bits) and how to free them (kswapd, direct reclaim).
Chapter 3 (Page Tables) explained multi-level hierarchies. We now understand when to allocate those levels (lazy allocation) and when to free them (reclamation under pressure).
Chapter 4 (TLB) showed translation caching. We now know the cost when TLB fails (page table walks) and why huge pages matter (512× better coverage).
Chapter 7 (Page Faults) covered fault handling. We now understand nested faults in VMs (cascading EPT violations) and A/D bit faults on RISC-V.
Together, these chapters form a complete understanding of the hardware-software contract for memory management.
For System Administrators:
Monitoring checklist:
✓ Check PageTables in /proc/meminfo weekly
✓ Set up alerts for si/so in vmstat
✓ Configure OOM killer to protect critical services
✓ Tune swappiness for your workload type
✓ Consider huge pages for large-memory services
Configuration:
- Database servers: swappiness=10, huge pages on
- File servers: swappiness=150, keep file cache hot
- Desktops: swappiness=60 (default)
- VMs: Pre-populate EPT, use huge pages in both guest and host
For Application Developers:
Best practices:
✓ Use madvise(MADV_HUGEPAGE) for large allocations
✓ Avoid fragmented allocations (many small mmaps)
✓ Consider memory-mapped files vs read/write tradeoffs
✓ Profile with perf to find TLB thrashing
✓ Structure data for spatial locality
Red flags:
✗ Allocating millions of small objects (page table overhead)
✗ Random access patterns (TLB thrashing)
✗ Frequent mmap/munmap (page table churn)
✗ Large sparse mappings (wasted page tables)
For Hypervisor Operators:
Optimization checklist:
✓ Pre-populate EPT during VM creation
✓ Use 2MB EPT pages when possible
✓ Reserve huge page pool at host boot
✓ Enable huge pages in both guest and host
✓ Monitor EPT fault rate with perf
VM density calculation:
Account for:
- Guest RAM
- Guest page tables (~0.2% of RAM)
- Host EPT (~0.2% of RAM)
- Total: ~1.004 × guest RAM
Example: 16 VMs × 8 GB
- Guest usage: 128 GB
- Metadata: 512 MB
- Total needed: 128.5 GB
Memory management in modern systems involves a complex dance between:
Understanding all four layers is essential for: - Debugging production mysteries - Optimizing performance - Scaling systems efficiently - Avoiding catastrophic failures
The PostgreSQL mystery we solved demonstrates this. The problem wasn't in any single layer—it was the interaction between application behavior (fragmented mappings), OS bookkeeping (page table overhead), and hardware constraints (physical RAM limits).
Denning, P.J. "Virtual Memory." ACM Computing Surveys, 2(3):153–189, 1970. DOI: 10.1145/356571.356573
Corbató, F.J. "A Paging Experiment with the Multics System." MIT Project MAC Report MAC-M-384. MIT, 1968.
Adams, K. and Agesen, O. "A Comparison of Software and Hardware Techniques for x86 Virtualization." Proc. 12th Int'l Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS 2006). ACM, 2006. DOI: 10.1145/1168857.1168860
Zhao, Y. "Multi-Gen LRU." Linux kernel mailing list, 2022. Merged in Linux 6.1. lore.kernel.org
Intel Corporation. Intel 64 and IA-32 Architectures Software Developer's Manual, Volume 3A: System Programming Guide. Intel Corporation, 2023. intel.com/sdm
Arm Limited. Arm Architecture Reference Manual for A-profile Architecture. Arm Limited, 2023. developer.arm.com
RISC-V International. The RISC-V Instruction Set Manual, Volume II: Privileged Architecture, v20211203. RISC-V International, 2021. github.com/riscv/riscv-isa-manual
Linux Kernel Contributors. mm/vmscan.c, mm/oom_kill.c, mm/huge_memory.c. Linux kernel v6.8, 2024. elixir.bootlin.com/linux/v6.8/source/mm
Arcangeli, A. "Transparent Huge Pages." Linux Kernel Documentation, 2011. kernel.org/doc/…/transhuge.html
Gregg, B. Systems Performance: Enterprise and the Cloud, 2nd ed. Addison-Wesley, 2020. ISBN: 978-0-13-682015-4.
Megiddo, N. and Modha, D.S. "ARC: A Self-Tuning, Low Overhead Replacement Cache." Proc. 2nd USENIX Conf. on File and Storage Technologies (FAST '03). USENIX, 2003. PDF
Johnson, T. and Shasha, D. "2Q: A Low Overhead High Performance Buffer Management Replacement Algorithm." Proc. 20th Int'l Conf. on Very Large Data Bases (VLDB 1994). Morgan Kaufmann, 1994.