On a Tuesday morning in early 2021, a major cloud provider's infrastructure team discovered a problem that would reshape how they thought about memory management. Their newest compute nodes—dual-socket AMD EPYC 7763 servers with 512 GB of RAM—were running out of memory despite hosting workloads that should comfortably fit. The culprit wasn't application memory usage. It was the page tables themselves.
Each server ran approximately 1,000 containers, providing isolated environments for customer workloads. Each container maintained its own virtual address space, which meant its own complete set of page tables. The infrastructure team ran the numbers: with an average of 16 GB of virtual memory mapped per container, and typical page table overhead of 0.5%, each container consumed about 80 MB just for page table structures. Multiply that by 1,000 containers, and suddenly 80 GB of RAM—nearly 16% of the system's total capacity—was devoted solely to managing virtual memory translations.
This wasn't a theoretical concern. The page table overhead was directly reducing container density. Where the hardware should support 1,000+ containers per host, the effective limit was closer to 750. The lost capacity translated to millions of dollars in underutilized hardware across their fleet of hundreds of thousands of servers.
Traditional huge pages helped individual containers by reducing their page table depth, but they didn't address the fundamental issue: massive duplication of page table structures across similar workloads. When hundreds of containers all ran Ubuntu with nearly identical system libraries, why did each need its own separate page tables for those shared pages?
This chapter explores the operating system-level optimizations that tackle exactly these kinds of real-world challenges. While Chapter 3 examined how page table structures work and Chapter 4 showed how TLBs cache translations, we now turn to the sophisticated techniques modern operating systems use to minimize page table overhead, improve concurrency, and squeeze maximum performance from the memory management hardware.
We'll explore seven major optimization domains, each addressing specific bottlenecks in production systems:
Memory Deduplication shows how Kernel Samepage Merging (KSM) identifies and consolidates identical pages across processes and virtual machines, reducing both memory footprint and page table overhead. In virtualized environments, KSM routinely achieves 30-60% memory savings by merging duplicate pages from similar guest operating systems.
Page Table Locking and Concurrency examines the challenges of safely modifying page tables on systems with 128+ CPU cores, where thousands of threads may be simultaneously faulting pages. Modern kernels use sophisticated locking hierarchies, RCU (Read-Copy-Update) for lockless page table walks, and batched TLB shootdowns to achieve scalability that was impossible a decade ago.
NUMA-Aware Page Table Management addresses the reality that on multi-socket servers, accessing remote memory can cost 50% more latency and consume 2× the bandwidth compared to local access. Placing page tables on the wrong NUMA node can cripple application performance, yet many systems still get this wrong by default.
Advanced Huge Page Strategies goes beyond the basic Transparent Huge Pages (THP) mechanism covered in earlier chapters. We'll see how memory fragmentation kills THP effectiveness over time, why deferred compaction matters for latency-sensitive workloads, and how to use explicit huge pages strategically in production.
Page Table Compaction and Sharing explores techniques to reduce page table overhead itself. When thousands of processes map the same shared libraries read-only, why maintain separate page table entries for each? We'll examine how compaction and sharing can reduce page table overhead by 90% in some scenarios.
Parallel Page Table Operations shows how to scale operations like mmap(), munmap(), and page fault handling across many cores. A naive implementation that serializes all page table updates becomes a severe bottleneck on modern many-core systems.
Memory Compression Integration demonstrates how zswap and zram work with page tables to extend effective memory capacity. By compressing swapped pages in RAM rather than writing to disk, systems can achieve 2000× better swap performance while maintaining the illusion of unlimited memory.
The impact of page table optimization extends far beyond the specific scenario that opened this chapter:
Cloud Density and Economics: In cloud environments, every megabyte of overhead directly reduces revenue. A cloud provider running 100,000 servers with 80 GB of wasted page table overhead is effectively leaving $400 million of hardware capacity unused (at $40,000 per server). Optimizations that reclaim even 20% of this overhead translate to billions in increased capacity across the industry.
Multi-Core Scaling: As core counts continue to grow—from 64 cores common today to 128+ cores in high-end servers—page table contention becomes a first-order performance concern. Applications that scale linearly to 32 cores often hit severe bottlenecks at 64 cores due to mmap_lock contention. The techniques in this chapter enable continued scaling.
NUMA System Performance: On 2-socket servers, remote memory access costs 30-50% more latency. On 4-socket and 8-socket systems common in enterprise databases and SAP deployments, the penalty can reach 2× or more. Page tables placed on the wrong NUMA node effectively double the cost of every page table walk, destroying application performance.
Memory Pressure and Overcommit: Modern workloads frequently push memory limits. Kernel compilation, data analytics, and machine learning training all benefit from memory overcommit—running workloads larger than physical RAM. Effective page table and memory compression allows systems to gracefully handle memory pressure rather than thrashing or invoking the OOM killer.
This chapter builds directly on concepts introduced earlier:
Chapter 3 established page table structures—the four-level hierarchies, the PTE/PMD/PUD/PGD organization, and basic operations like allocation and traversal. We now examine how to optimize these structures at scale and reduce their overhead.
Chapter 4 explored the TLB and showed how huge pages dramatically reduce TLB pressure by increasing coverage. We extend this with production strategies for managing huge pages effectively, including handling the fragmentation that develops over time.
Chapter 8 introduced the concept of shadow page tables for pre-EPT virtualization and briefly mentioned inverted page tables used on older architectures. We now dive deeper into modern approaches that reduce page table overhead through sharing and compaction.
The optimizations in this chapter assume familiarity with these fundamentals. We won't re-explain what a PTE is or how a TLB works—instead, we'll focus on the sophisticated techniques that push the boundaries of what's possible with these structures.
Let's begin with one of the most powerful optimization techniques available: Kernel Samepage Merging.
In 2008, Red Hat engineer Izik Eidus introduced Kernel Samepage Merging (KSM) to the Linux kernel. The motivation was straightforward: virtualization workloads running multiple similar guest operating systems were wasting enormous amounts of RAM storing duplicate copies of identical pages. Ten VMs running Ubuntu each maintained their own copy of glibc, the kernel's text pages, common utilities, and shared libraries. The redundancy was massive—often 30-60% of total memory—yet each VM insisted on its own private copies.
KSM's approach is elegantly simple: periodically scan process memory looking for pages with identical content, merge them into a single shared read-only page, and use copy-on-write to handle any subsequent modifications. The devil, as always, lies in the implementation details.
Consider a typical virtualization host running 10 Ubuntu VMs, each allocated 4 GB of RAM:
Total allocated: 10 VMs × 4 GB = 40 GB
Unique content: ~25 GB
Duplicate content: ~15 GB (37.5% redundancy)
Breakdown of duplicates:
- Kernel text pages: ~8 MB × 10 = 80 MB (but actually 8 MB unique)
- System libraries (glibc, libpthread, etc.): ~50 MB × 10 = 500 MB (→ 50 MB unique)
- Common utilities (bash, coreutils): ~20 MB × 10 = 200 MB (→ 20 MB unique)
- Zero pages: ~500 MB × 10 = 5 GB (→ 1 page unique)
- Application binaries: Variable redundancy
The situation worsens in container environments. While containers share the host kernel, they often run similar user-space software. A Kubernetes cluster running 100 pods of the same microservice will have 100 copies of the application binary, 100 copies of language runtimes (Python, Node.js, Java), and 100 copies of commonly used libraries.
The cost isn't just wasted RAM. Each duplicate page requires its own page table entry. Those extra PTEs consume memory themselves and increase TLB pressure. More fundamentally, buying and powering RAM that stores duplicate data wastes capital and operational expenses.
KSM operates as a kernel thread (ksmd) that continuously scans eligible memory regions in three phases: candidate identification, content comparison, and merging.
Not all memory is eligible for KSM scanning. Security-sensitive applications may want to prevent their memory from being merged (due to timing side-channel concerns we'll discuss later), and some regions contain data that changes too frequently to benefit from merging. Processes opt in to KSM by marking virtual memory areas (VMAs) as mergeable:
// Application explicitly marks memory region as KSM-eligible
void *data = mmap(NULL, size, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
madvise(data, size, MADV_MERGEABLE); // Opt in to KSMFor virtual machines, the hypervisor typically marks guest RAM as mergeable automatically. The KSM scanner then iterates through these marked regions:
// Simplified KSM scanning loop (actual implementation more complex)
for each VMA marked MADV_MERGEABLE:
for each page in VMA:
if page_stable(page): // Page hasn't changed recently
hash = calculate_checksum(page)
add_to_unstable_tree(hash, page)The kernel uses a red-black tree structure to organize candidate pages by their content hash. This allows efficient lookup when comparing pages—rather than comparing every page against every other page (O(n²) complexity), the kernel can quickly find pages with matching hashes.
Having a matching hash isn't sufficient—hash collisions are possible, and we need byte-for-byte identity. The kernel performs full content comparison for hash matches:
// Find candidates with matching hash
candidates = lookup_unstable_tree(hash)
for each candidate in candidates:
if memcmp(page1, candidate, PAGE_SIZE) == 0:
// Contents match - merge the pages
merge_pages(page1, candidate)
breakThe merge operation is where page table manipulation happens:
void merge_pages(struct page *page1, struct page *page2) {
// Allocate a new read-only shared page
struct page *kpage = alloc_page(GFP_KERNEL);
copy_page(kpage, page1);
// Mark the shared page as KSM-managed
SetPageKsm(kpage);
// Update page tables for both original pages to point to kpage
replace_pte(page1, kpage, PROT_READ); // Read-only!
replace_pte(page2, kpage, PROT_READ);
// Increment reference count (two PTEs now point here)
atomic_set(&kpage->_refcount, 2);
// Free original pages
free_page(page1);
free_page(page2);
}The critical detail: the new shared page is mapped read-only. This allows multiple page table entries to safely point to the same physical page without risk of one process's writes corrupting another's view.
When a process attempts to write to a KSM-merged page, the hardware generates a page fault (due to the read-only protection bit):
int handle_ksm_fault(struct mm_struct *mm, unsigned long address,
unsigned int flags) {
pte_t *pte = lookup_pte(mm, address);
struct page *kpage = pte_page(*pte);
if (PageKsm(kpage) && (flags & FAULT_FLAG_WRITE)) {
// This is a write to a KSM-merged page
// Break sharing via copy-on-write
struct page *new_page = alloc_page(GFP_KERNEL);
copy_page(new_page, kpage);
// Update this process's PTE to point to private copy
pte_t new_pte = mk_pte(new_page, vma->vm_page_prot);
set_pte_at(mm, address, pte, new_pte);
// Decrement refcount on shared page
if (atomic_dec_and_test(&kpage->_refcount)) {
// Last reference - convert back to normal page
ClearPageKsm(kpage);
set_pte_writable(kpage);
}
flush_tlb_page(address);
return 0; // Fault handled, retry write
}
}This copy-on-write mechanism is nearly identical to what happens during fork(), which we examined in Chapter 2. The difference is that fork() breaks sharing when either the parent or child writes, while KSM breaks sharing only when any process writes to the merged page.
KSM's operation has several effects on page table structure and behavior:
PTE modifications happen asynchronously. Unlike most page table operations that occur synchronously in response to page faults or explicit memory operations, KSM modifies PTEs in the background. This requires careful locking to prevent races with concurrent page faults or memory operations.
TLB shootdowns are unavoidable. When KSM updates a PTE to point to a shared page, it must invalidate the TLB entry on all CPUs that might have cached the old translation. For a heavily shared page accessed by processes running on 64 cores, this means broadcasting 64 TLB invalidation IPIs. The kernel batches these when possible, but the overhead remains significant.
Reference counting becomes critical. The kernel must track how many page table entries point to each KSM-managed page. When the refcount drops to 1, the single remaining PTE can be switched back to writable mode (no sharing means no COW needed). When it drops to 0, the page can be freed. Getting this wrong leads to use-after-free bugs or memory leaks.
Protection bits change. KSM must carefully manage the read/write bit in PTEs. The merge operation clears the write bit, while COW restoration sets it. These operations must be atomic to prevent races where a write could sneak through on a shared page.
KSM's performance involves a three-way tradeoff between scanning overhead, memory savings, and COW latency.
The ksmd kernel thread consumes CPU cycles scanning memory and comparing pages. The impact is configurable via two parameters:
# pages_to_scan: how many pages to scan before sleeping
echo 200 > /sys/kernel/mm/ksm/pages_to_scan
# sleep_millisecs: how long to sleep between scan passes
echo 20 > /sys/kernel/mm/ksm/sleep_millisecsWith these defaults, ksmd scans 200 pages every 20 milliseconds, yielding 10,000 pages per second. On a system with 100 GB of KSM-eligible memory (25 million pages), a complete scan takes 2,500 seconds—more than 40 minutes. The scanning is incremental and continuous; ksmd never stops.
The CPU cost depends on the settings:
Conservative (100 pages / 50ms):
CPU usage: ~1%
Scan rate: 2,000 pages/sec
Time to scan 100 GB: 3.5 hours
Moderate (200 pages / 20ms):
CPU usage: ~3%
Scan rate: 10,000 pages/sec
Time to scan 100 GB: 42 minutes
Aggressive (500 pages / 10ms):
CPU usage: ~8%
Scan rate: 50,000 pages/sec
Time to scan 100 GB: 8.3 minutes
The aggressive settings find and merge duplicates faster but consume more CPU. The choice depends on whether memory savings or CPU overhead is the limiting factor.
Real-world deduplication rates vary by workload:
Virtual machines (similar OS):
Ubuntu VMs: 35-45% deduplication
Windows VMs: 25-35% deduplication
Mixed environments: 20-30% deduplication
Containers:
Identical images: 60-70% deduplication
Similar images (same base): 40-50% deduplication
Diverse images: 15-25% deduplication
Applications:
Java applications: 30-40% (Eden space has many duplicates)
Database servers: 10-20% (mostly unique data)
Web servers: 25-35% (shared libraries)
The savings come from both the merged pages themselves and the page table entries no longer needed. If 10 processes were mapping 1,000 duplicate pages, merging reduces physical memory by 9,000 pages (keeping just one copy) and reduces PTE overhead from 10,000 entries to 1,000 entries.
The first write to a KSM-merged page incurs additional latency:
Normal write to resident page: 50-100 cycles
Write triggering COW: 500-1,000 cycles
Breakdown of COW overhead:
Page fault trap: 50 cycles
Fault handler overhead: 100 cycles
Page allocation: 200 cycles
Page copy (4 KB): 150 cycles (depends on cache)
TLB flush: 50 cycles
Total additional cost: 450-900 cycles
For workloads that frequently modify merged pages, this overhead accumulates. However, subsequent writes to the now-private page proceed at normal speed. The performance impact depends on the write-to-read ratio:
Read-heavy (95% reads):
COW overhead: ~5% additional latency
Net benefit: +40% memory → worthwhile
Balanced (50% reads):
COW overhead: ~50% additional latency
Net benefit: Depends on memory pressure
Write-heavy (80% writes):
COW overhead: ~80% additional latency
Net benefit: Likely negative
A production deployment at a medium-sized cloud provider provides concrete numbers. The infrastructure team ran KSM on compute nodes hosting customer VMs.
Setup: - Physical hosts: Dual-socket Intel Xeon Gold 6258R (28 cores/socket, 56 total) - Memory: 512 GB DDR4-2933 - Workload: 60 Ubuntu 20.04 VMs, 6 GB each (360 GB total requested) - Each VM: Web server + database (read-heavy workload)
Without KSM:
Total memory allocated: 360 GB
Page table overhead: 11.2 GB (3.1%)
Actual physical usage: 371 GB
Available for new VMs: 141 GB
Additional VMs possible: 23 (at 6 GB each)
With KSM enabled:
echo 1 > /sys/kernel/mm/ksm/run
echo 300 > /sys/kernel/mm/ksm/pages_to_scan
echo 20 > /sys/kernel/mm/ksm/sleep_millisecsAfter 2 hours of operation (allowing KSM to complete several full scans):
Total memory allocated: 360 GB
Memory deduplicated: 142 GB (39.4%)
Shared pages: 36,352,000 (142 GB / 4 KB)
Actual physical usage: 218 GB (360 GB - 142 GB)
Page table overhead: 7.3 GB (reduced due to fewer physical pages)
Available for new VMs: 287 GB
Additional VMs possible: 47 (at 6 GB each)
VM consolidation improvement: 23 → 47 additional VMs (2× increase)
The CPU overhead was measurable but acceptable:
ksmd CPU usage: 3.2% (average over 24 hours)
- Scanning: 2.1%
- Merging operations: 0.7%
- COW handling: 0.4%
Impact on VM performance:
- 99th percentile web latency: 45ms → 48ms (+6.7%)
- Average latency: 12ms → 12.5ms (+4.2%)
- CPU overhead noticed but within SLA
Cost-benefit analysis:
Hardware cost per server: $12,000
Cost per GB: $23.40 ($12,000 / 512 GB)
Memory freed by KSM: 142 GB
Value of freed memory: $3,322
Percentage of server cost recovered: 27.7%
Across 1,000-server fleet:
Total value of recovered capacity: $3.3M
Equivalent to buying 276 additional servers
The deployment was considered successful. The team kept KSM enabled but tuned it conservatively to minimize CPU overhead:
# Final production settings
echo 200 > /sys/kernel/mm/ksm/pages_to_scan
echo 30 > /sys/kernel/mm/ksm/sleep_millisecs
# Result: 1.8% CPU overhead, 36% deduplication rateThe case study shows KSM working well, but it's not universally beneficial. Understanding when to enable KSM requires analyzing workload characteristics.
KSM excels when:
Similar workloads predominate. Virtualization hosting identical or similar guest OSes achieves the highest deduplication rates. Container orchestration systems running many replicas of the same services see similar benefits.
Read-heavy access patterns. Workloads that mostly read merged pages rarely trigger COW, avoiding the performance penalty. Web servers, read replicas, and caching layers fit this profile.
Memory pressure exists. When physical memory is scarce, the memory savings from KSM directly increase capacity. The CPU overhead becomes worthwhile when it prevents swapping or OOM kills.
Pages remain stable. Content that doesn't change frequently can be merged and stay merged. Operating system text segments, shared libraries, and static assets are ideal candidates.
KSM struggles when:
Write-heavy workloads dominate. Applications that frequently modify merged pages pay the COW penalty repeatedly. Databases performing in-place updates, memory-mapped I/O, and computational workloads with frequent writes all suffer.
Content is unique per process. Workloads with no duplicate data gain nothing from scanning overhead. Scientific computing, data analytics, and machine learning training often have this characteristic.
CPU is the bottleneck. On CPU-constrained systems, spending 2-5% of CPU on KSM scanning may be unacceptable even if memory savings are substantial.
Real-time requirements exist. KSM introduces jitter—page fault latency varies by whether COW is needed. Real-time systems requiring bounded worst-case latency should disable KSM.
Mixed workload guidance:
For virtualization:
- Enable for development/test environments (high similarity)
- Enable for production if read-heavy (e.g., web hosting)
- Disable for production databases (write-heavy)
For containers:
- Enable for stateless services (high similarity, read-heavy)
- Disable for stateful services (unique data, write-heavy)
- Use conservative scanning for mixed environments
For bare-metal applications:
- Rarely beneficial (less similarity between processes)
- Consider only if running many similar instances
KSM's sharing mechanism creates a subtle security issue: timing side channels. A malicious process can deduce information about other processes' memory contents by measuring COW latency.
The attack:
// Attacker's code
void *target = malloc(PAGE_SIZE);
memset(target, 0x41, PAGE_SIZE); // Fill with 'A'
madvise(target, PAGE_SIZE, MADV_MERGEABLE);
// Wait for KSM to potentially merge with victim's page
sleep(60);
// Time a write
uint64_t start = rdtsc();
target[0] = 0x42; // Trigger COW if merged
uint64_t end = rdtsc();
if (end - start > 1000) {
// High latency = COW occurred = victim had matching page
printf("Victim has page full of 0x41!\n");
} else {
// Low latency = no COW = victim's page differs
printf("Victim's page has different content\n");
}By systematically trying different page contents, an attacker can slowly deduce the contents of another process's memory. While slow and noisy, this has been demonstrated in research papers.
Mitigations:
The primary mitigation is that KSM is disabled by default. Processes
must explicitly opt in via MADV_MERGEABLE. This prevents
casual information leakage.
For multi-tenant environments requiring strict isolation:
# Disable KSM globally
echo 0 > /sys/kernel/mm/ksm/run
# Or use KSM only within trust boundaries
# (e.g., merge pages between VMs owned by same customer,
# but not across customers)Some security-focused distributions (e.g., OpenBSD) have rejected KSM entirely due to these concerns. The Linux kernel maintains it as an opt-in feature, placing the security decision in the hands of system administrators who must weigh the tradeoffs between memory efficiency and isolation.
Rowhammer amplification:
A more severe attack vector involves Rowhammer—the hardware vulnerability where repeatedly accessing a memory row can cause bit flips in adjacent rows. KSM could amplify Rowhammer impact:
1. Attacker creates page with specific content
2. KSM merges it with victim's page
3. Attacker hammers the shared physical page
4. Bit flip occurs, corrupting victim's copy
5. Victim reads corrupted data
This attack requires precise control over physical memory allocation (difficult but possible) and successful Rowhammer exploitation (increasingly difficult on modern DRAM). Nonetheless, it represents a theoretical risk that security-critical deployments must consider.
Modern kernels include Rowhammer mitigations (refresh rate increases, targeted row refresh) that help, but they don't eliminate the risk entirely. For maximum security, combining KSM disablement with other Rowhammer protections provides defense in depth.
In 2019, AMD released the EPYC 7742—a 64-core, 128-thread processor targeted at cloud infrastructure and high-performance computing. Intel followed with the Xeon Platinum 8380—40 cores, 80 threads. These many-core systems offer tremendous computational power, but they expose a fundamental challenge in kernel design: how do you safely modify shared data structures when hundreds of threads might be accessing them simultaneously?
Page tables exemplify this challenge. They're shared across all threads in a process (all threads share the same mm_struct), they're accessed on virtually every memory reference (via hardware page table walks), and they must be modified atomically during page faults, mmap operations, and memory unmapping. Getting the locking wrong leads to rare, catastrophic corruption that can be nearly impossible to debug.
This section examines how Linux manages page table concurrency across three dimensions: the locking hierarchy that prevents deadlocks, the use of RCU (Read-Copy-Update) to enable lockless reads, and the batching strategies that amortize the cost of TLB shootdowns across many operations.
Consider a database server running on a 128-core system. The database process spawns 128 worker threads to maximize parallelism. All threads share the same virtual address space—they all use the same page tables. Now suppose 64 of those threads simultaneously page fault on different addresses:
Thread 0: Faults on address 0x7fff_0000 (needs to allocate PTE)
Thread 1: Faults on address 0x7fff_1000 (different PTE, same page table page)
...
Thread 63: Faults on address 0x8000_0000 (different PUD entirely)
Each fault requires modifying page table structures. Thread 0 needs to allocate a new PTE and update its parent PMD. Thread 1 needs to modify a different PTE in the same page table page. Thread 63 needs to allocate a completely new PUD.
A naive implementation might use a single lock for all page table operations:
// WRONG: Global lock for everything
spinlock_t page_table_lock;
void handle_page_fault(unsigned long address) {
spin_lock(&page_table_lock);
// Walk page tables
// Allocate pages if needed
// Update PTEs
spin_unlock(&page_table_lock);
}This ensures correctness—no two threads modify page tables simultaneously—but it also destroys scalability. With 64 concurrent faults, 63 threads sit idle waiting for the lock. On a 128-core system, page faults become completely serialized. The system might as well be single-threaded.
The solution requires fine-grained locking at multiple levels of the page table hierarchy.
Linux uses a multi-level locking strategy that balances correctness, performance, and complexity:
Level 1: The mmap_lock (Process-Level Semaphore)
The highest-level lock protects the VMA list—the set of virtual memory areas that define which addresses are valid for a process:
struct mm_struct {
struct rw_semaphore mmap_lock; // Protects VMA list
struct vm_area_struct *mmap; // List of VMAs
// ... page table pointers (pgd)
};The mmap_lock is a read-write semaphore, allowing multiple readers (threads walking the VMA list) or one writer (thread modifying the VMA list via mmap/munmap):
// Read path (page fault handler)
down_read(&mm->mmap_lock);
struct vm_area_struct *vma = find_vma(mm, address);
if (vma && vma_allows_access(vma, address)) {
handle_fault(vma, address);
}
up_read(&mm->mmap_lock);
// Write path (mmap system call)
down_write(&mm->mmap_lock);
insert_vma(mm, new_vma);
up_write(&mm->mmap_lock);This approach allows concurrent page faults (multiple readers) while preventing races between page faults and VMA modifications. Two threads can simultaneously fault on different addresses, both taking the read lock. But a thread calling mmap() must wait for all page faults to complete before modifying the VMA list.
The mmap_lock can become a bottleneck. On a system with 128 threads all page faulting rapidly, even read lock acquisition shows contention. Modern kernels have partially addressed this with speculative page fault handling (attempting to handle faults without the mmap_lock at all), but full deployment remains challenging.
Level 2: Page Table Locks (Fine-Grained Spinlocks)
Once we've verified the VMA exists (under mmap_lock protection), we need to modify the actual page table. Each level of the page table hierarchy has associated protection:
// PTE-level lock (finest granularity)
spinlock_t *ptl = pte_lockptr(mm, pmd);
spin_lock(ptl);
pte_t *pte = pte_offset_map(pmd, address);
if (pte_none(*pte)) {
struct page *page = alloc_page(GFP_KERNEL);
pte_t entry = mk_pte(page, vma->vm_page_prot);
set_pte_at(mm, address, pte, entry);
}
spin_unlock(ptl);The key insight: the lock granularity is per page table page, not per process. Two threads faulting on addresses that map to different page table pages can proceed in parallel—they acquire different locks. Only threads faulting on addresses within the same 2 MB region (which share a page table page) contend.
On x86-64 with 4-level paging, a single page table page covers 2 MB of virtual address space (512 PTEs × 4 KB pages). For a process with a 1 TB address space (common in databases), there are potentially 524,288 different page table locks. The probability of two random page faults contending is extremely low.
A complete page fault handler must navigate this locking hierarchy carefully. Here's the flow for a minor page fault (page not present, but file-backed):
int handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
unsigned int flags) {
struct mm_struct *mm = vma->vm_mm;
pgd_t *pgd;
p4d_t *p4d;
pud_t *pud;
pmd_t *pmd;
pte_t *pte;
spinlock_t *ptl;
// We arrive here already holding mm->mmap_lock in read mode
// (taken by higher-level fault handler)
// Walk page table hierarchy (no locks needed for reading)
pgd = pgd_offset(mm, address);
if (pgd_none(*pgd) || pgd_bad(*pgd))
return VM_FAULT_OOM;
p4d = p4d_offset(pgd, address);
if (p4d_none(*p4d) || p4d_bad(*p4d))
return VM_FAULT_OOM;
pud = pud_offset(p4d, address);
if (pud_none(*pud) || pud_bad(*pud))
return VM_FAULT_OOM;
pmd = pmd_offset(pud, address);
if (pmd_none(*pmd))
return do_huge_pmd_fault(mm, vma, address, pmd, flags);
// Get PTE lock before modifying PTEs
pte = pte_offset_map_lock(mm, pmd, address, &ptl);
if (pte_none(*pte)) {
// Page not present - allocate and map
struct page *page = vma->vm_ops->fault(vma, address);
if (!page) {
pte_unmap_unlock(pte, ptl);
return VM_FAULT_OOM;
}
pte_t entry = mk_pte(page, vma->vm_page_prot);
set_pte_at(mm, address, pte, entry);
update_mmu_cache(vma, address, pte);
}
pte_unmap_unlock(pte, ptl);
return 0;
}Note the asymmetry: we walk the page table hierarchy (PGD → P4D → PUD → PMD) without locks, because we're only reading. Modern hardware guarantees that reading a 64-bit PTE is atomic, so we can safely check pte_none() without locking. Only when we're about to modify the PTE do we acquire the spinlock.
The above code handles minor faults—cases where we can immediately provide a page. Major faults, which require I/O to read data from disk, introduce additional complexity:
int do_major_fault(struct vm_area_struct *vma, unsigned long address,
pte_t *pte, spinlock_t *ptl) {
pte_t orig_pte = *pte;
// Release locks before I/O!
pte_unmap_unlock(pte, ptl);
up_read(&vma->vm_mm->mmap_lock);
// Perform I/O (can take milliseconds)
struct page *page = read_page_from_disk(vma, address);
// Re-acquire locks
down_read(&vma->vm_mm->mmap_lock);
pte = pte_offset_map_lock(vma->vm_mm, pmd, address, &ptl);
// Critical: Check if PTE changed while we were away
if (!pte_same(*pte, orig_pte)) {
// Someone else handled this fault while we were doing I/O
// (Multiple threads faulted on same address simultaneously)
pte_unmap_unlock(pte, ptl);
free_page(page); // Discard our work
return 0; // Success - page is present now
}
// Safe to proceed - PTE unchanged
set_pte_at(vma->vm_mm, address, pte, mk_pte(page, vma->vm_page_prot));
pte_unmap_unlock(pte, ptl);
return 0;
}This pattern—release locks, perform slow operation, re-acquire locks, verify nothing changed—appears throughout the kernel. It's necessary because holding locks during I/O would stall all other threads. But it introduces a new problem: multiple threads might simultaneously fault on the same address, each read a page from disk, and then race to update the PTE. The pte_same() check catches this, ensuring that only one thread's work is used while others discard their redundant effort.
The performance implication is subtle. For a single major fault, you pay the I/O cost exactly once (though potentially multiple threads waste CPU reading the same data). But if your workload triggers massive parallel faulting—say, 128 threads all accessing a memory-mapped file for the first time—you get serialization at the PTE level even though the I/O could be parallelized. This is one reason database systems often use explicit I/O rather than relying on mmap.
[Continuing in next message due to length…] ### RCU for Lockless Page Table Walks
The locking strategy described above works, but it has a cost: spinlocks aren't free. Even uncontended spinlock acquisition costs 10-20 cycles. On a page walk that might touch four locks (one per page table level), that's 40-80 cycles of pure locking overhead on top of the actual memory accesses.
For read-heavy operations—specifically, walking page tables to translate virtual addresses—Linux employs a more sophisticated technique: RCU (Read-Copy-Update). RCU allows readers to traverse data structures without any locks at all, relying instead on careful ordering and deferred reclamation.
The key RCU principle: readers never block, writers never block readers, but writers must wait for all existing readers to finish before freeing old versions of data.
For page tables, this means:
// Reader (get_user_pages_fast - used for DMA setup)
int gup_fast(unsigned long address, int nr_pages, struct page **pages) {
pgd_t *pgd;
p4d_t *p4d;
pud_t *pud;
pmd_t *pmd;
pte_t *pte;
rcu_read_lock(); // Mark: I'm reading, don't free page tables
pgd = pgd_offset(current->mm, address);
if (pgd_none(*pgd) || pgd_bad(*pgd))
goto slow_path;
p4d = p4d_offset(pgd, address);
if (p4d_none(*p4d) || p4d_bad(*p4d))
goto slow_path;
pud = pud_offset(p4d, address);
if (pud_none(*pud) || pud_bad(*pud))
goto slow_path;
pmd = pmd_offset(pud, address);
if (pmd_none(*pmd))
goto slow_path;
pte = pte_offset_map(pmd, address);
if (pte_present(*pte) && pte_accessible(*pte)) {
// Success - grab page reference
struct page *page = pte_page(*pte);
if (get_page_unless_zero(page)) { // Atomic increment
pages[0] = page;
pte_unmap(pte);
rcu_read_unlock();
return 1; // Got 1 page
}
}
pte_unmap(pte);
slow_path:
rcu_read_unlock();
// Fall back to slow path with locks
return gup_slow(address, nr_pages, pages);
}The rcu_read_lock() doesn't actually take a lock in the traditional sense. On many architectures, it simply disables preemption (preventing the scheduler from moving this thread to a different CPU). This has near-zero overhead—a single instruction that increments a per-CPU counter.
The magic happens on the writer side. When unmapping memory and freeing page table pages, the kernel must ensure no RCU reader is still using the page table:
void free_pte_range(struct mm_struct *mm, pmd_t *pmd,
unsigned long addr) {
pte_t *pte = pte_offset_map(pmd, addr);
// Clear the PMD entry (atomically!)
pmd_clear(pmd);
// Now wait for all RCU readers to finish
// This is the critical part
synchronize_rcu();
// Safe to free the page table page now
pte_free(mm, pte);
}The synchronize_rcu() function blocks until all CPUs have gone through a "quiescent state"—a point where they're not holding rcu_read_lock(). This is typically implemented via inter-processor interrupts (IPIs) that force each CPU to check in. Once all CPUs confirm they're not reading the page table, it's safe to free.
The latency of synchronize_rcu() varies:
Best case (all CPUs immediately quiescent): ~10 microseconds
Typical case (some CPUs busy): ~100 microseconds
Worst case (CPU spinning in kernel): ~10 milliseconds
This sounds slow, but it's acceptable for munmap() operations which are relatively infrequent. The win is on the reader side: get_user_pages_fast() can walk page tables with essentially no synchronization overhead, making DMA setup orders of magnitude faster.
RCU shines for: - DMA operations where the kernel frequently needs to translate user virtual addresses to physical addresses for hardware devices - System calls like read()/write() on files that might require page table walks - Fork where the child process needs to read parent page tables to set up COW mappings
RCU struggles for: - Write-heavy workloads where synchronize_rcu() cost dominates - Real-time systems where the worst-case synchronize_rcu() latency is unacceptable
We've discussed TLB shootdowns in previous chapters, but the importance in a many-core context deserves deeper examination. Every page table modification that changes a valid PTE to a different value (or makes it non-present) requires invalidating TLB entries on all CPUs that might have cached the old translation.
On a single-core system, this is cheap:
// Single-core TLB invalidation
set_pte_at(mm, address, pte, new_pte);
flush_tlb_page(address); // ~50 cyclesOn a multi-core system, flush_tlb_page() must send inter-processor interrupts to all CPUs running threads in this process:
void flush_tlb_page(unsigned long address) {
cpumask_t cpus;
// Find all CPUs running this mm_struct
cpumask_copy(&cpus, mm_cpumask(mm));
// Send IPI to each CPU
for_each_cpu(cpu, &cpus) {
send_ipi(cpu, TLB_FLUSH_VECTOR);
}
// Wait for acknowledgment from all CPUs
wait_for_ipi_acks(cpus);
}The cost scales with the number of active CPUs:
IPI latency per CPU: ~300-500 cycles
For 64 CPUs: 64 × 400 = 25,600 cycles = ~8 microseconds
For a munmap() of 10,000 pages done naively:
10,000 pages × 8 µs = 80 milliseconds!
The solution is batching. Rather than flushing the TLB after each PTE modification, collect all modifications and flush once at the end:
struct mmu_gather {
struct mm_struct *mm;
unsigned long start;
unsigned long end;
unsigned int fullmm:1; // Flush entire TLB?
};
void unmap_page_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
unsigned long addr, unsigned long end) {
pgd_t *pgd;
unsigned long next;
pgd = pgd_offset(vma->vm_mm, addr);
do {
next = pgd_addr_end(addr, end);
// Unmap pages (modifying PTEs)
unmap_pud_range(tlb, pgd, addr, next);
// Don't flush TLB yet!
} while (pgd++, addr = next, addr != end);
}
// Later, when all pages unmapped:
void tlb_finish_mmu(struct mmu_gather *tlb) {
// Now flush TLB once for entire range
flush_tlb_range(tlb->mm, tlb->start, tlb->end);
}The performance difference is dramatic:
Naive approach (10,000 pages):
10,000 TLB flushes × 8 µs = 80 ms
Batched approach:
1 TLB flush for entire range = 8 µs
Speedup: 10,000×
In practice, batching can't achieve 10,000× because there are limits on how many addresses can be invalidated in a single operation. Modern x86-64 CPUs support the INVLPGB instruction which can invalidate ranges, but older processors must fall back to invalidating the entire TLB if the range exceeds a threshold (typically 32-64 pages). Still, even partial batching provides substantial wins.
How do we know if page table locking is a problem? Linux provides several tools to measure lock contention:
# Enable lock statistics
echo 1 > /proc/sys/kernel/lock_stat
# Run workload
./my_application
# Check mmap_lock contention
grep mmap_lock /proc/lock_stat
# Example output:
class name con-bounces contentions [... other columns ...]
mmap_lock: 1234567 987654 ...
[<ffffffffa1234567>] do_page_fault+0x123
[<ffffffffa2345678>] do_mmap+0x456High contention values indicate the lock is a bottleneck. You can also use perf to profile lock acquisition:
perf record -e 'sched:sched_stat_blocked' -a sleep 10
perf report
# Look for time spent blocked on mmap_lockReal-world contention analysis from a PostgreSQL database server:
Setup: 128 cores, 512 GB RAM, database with 400 GB working set
Workload: TPC-C with 1000 concurrent connections (multiple processes)
Without tuning:
mmap_lock contentions: 45,000 per second
Average wait time: 12 microseconds
Total CPU time in lock waits: 540 milliseconds/sec (54% of 1 CPU!)
95th percentile query latency: 180 ms (missed SLA)
After tuning (connection pooling to reduce processes):
Connections: 1000 → 100 (10× reduction)
Processes: 1000 → 20 (50× reduction)
mmap_lock contentions: 1,200 per second (97% reduction)
95th percentile query latency: 45 ms (under SLA)
The tuning didn't change page table locking behavior—it just reduced the number of independent mm_structs contending for resources.
A production PostgreSQL deployment illustrates the real-world impact of page table locking. PostgreSQL's process model creates a new process via fork() for each client connection. Each fork() must copy (or set up COW for) all parent page tables.
Scenario: - Server: 2× Intel Xeon Platinum 8380 (40 cores each, 80 total) - RAM: 512 GB - Database: 200 GB working set - Load: 1000 new connections per second (peak traffic)
Problem: Each fork() took the mmap_lock in write mode, blocking all page faults in the parent process:
// Simplified fork() implementation
pid_t do_fork() {
// Must hold mmap_lock to copy/COW page tables
down_write(¤t->mm->mmap_lock);
// Copy or COW all VMAs and page tables
copy_mm(parent, child); // Takes ~2 milliseconds for 200 GB
up_write(¤t->mm->mmap_lock);
return child_pid;
}Impact measurement:
perf record -e 'lock:lock_acquire' -a -g sleep 10
perf report --sort=symbol
# Top contended lock: mmap_lock
# Callers:
# 45.2% copy_page_range (fork)
# 32.1% handle_mm_fault (page faults)
# 12.8% do_mmap (memory mapping)At 1000 forks/second, each taking 2 ms with the write lock held:
Time spent holding write lock: 1000 × 2ms = 2,000 ms/sec = 2 CPU cores
But the impact cascades:
- While write lock held, all page faults block
- 80 cores trying to page fault
- Average wait time: 1 ms per page fault
- Queries stall, latency spikes
Profiling showed:
CPU time breakdown:
Useful work: 45%
Waiting on mmap_lock: 38%
Other locks: 12%
Scheduler overhead: 5%
More than one-third of CPU time wasted waiting for the mmap_lock!
Solution:
Deploy pgBouncer, a connection pooler:
Before: 1000 clients → 1000 PostgreSQL processes
After: 1000 clients → pgBouncer → 50 PostgreSQL processes
Fork rate: 1000/sec → 5/sec (200× reduction)
Results:
mmap_lock contention: 98% reduction
CPU time in useful work: 45% → 82%
95th percentile latency: 380 ms → 42 ms
Throughput: 12,000 qps → 31,000 qps (2.5× increase)
The lesson: page table locking isn't an isolated issue. It interacts with application architecture. PostgreSQL's one-process-per-connection model created unsustainable lock contention at scale. Connection pooling—an application-level solution—resolved what appeared to be a kernel bottleneck.
Modern servers increasingly feature multiple CPU sockets, each with its own directly-attached memory. This Non-Uniform Memory Access (NUMA) architecture provides aggregate memory bandwidth that scales with socket count, but it introduces a critical asymmetry: local memory accesses are fast, while remote memory accesses are slow.
The performance gap isn't subtle. On a 2-socket AMD EPYC system:
Local DRAM access: ~100 ns latency, ~80 GB/s bandwidth
Remote DRAM access: ~150 ns latency, ~40 GB/s bandwidth
Performance penalty: +50% latency, -50% bandwidth
On larger 4-socket and 8-socket systems, the gap widens further. An access from socket 0 to memory on socket 3 might traverse multiple hops through the coherency fabric, pushing latency to 200-300 ns.
Every page table walk involves multiple memory accesses—typically four on x86-64 (one per page table level). If those page table pages reside on a remote NUMA node, every hardware page walk pays the remote access penalty four times over. For a workload with high TLB miss rates, NUMA-incorrect page table placement can destroy performance.
Before diving into page table placement, we need a clear mental model of NUMA topology. Consider a 2-socket Intel Xeon system:
Socket 0: Socket 1:
28 cores (0-27) 28 cores (28-55)
256 GB local RAM 256 GB local RAM
Node ID: 0 Node ID: 1
Connected via UPI (Ultra Path Interconnect):
Bandwidth: ~40 GB/s per direction
Latency: ~50 ns hop
Each socket is a NUMA node with its own memory controller. Cores on socket 0 can access node 0's RAM directly (local access) or node 1's RAM via UPI (remote access).
The kernel exposes this topology:
$ numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0-27
node 0 size: 261888 MB
node 0 free: 180234 MB
node 1 cpus: 28-55
node 1 size: 262144 MB
node 1 free: 195067 MB
node distances:
node 0 1
0: 10 21
1: 21 10The "distance" metric (10 for local, 21 for remote) is a relative cost indicator. Higher numbers mean slower access.
When a process allocates page tables, where in physical memory should they reside? Consider a simple single-threaded application:
// Process starts on CPU 12 (node 0)
int main() {
void *data = malloc(1GB); // Triggers page table allocation
process_data(data);
}The first page fault allocates page table structures. Linux's default policy (in most configurations) allocates from the CPU's local node:
// Simplified page table allocation
pte_t *pte_alloc_one(struct mm_struct *mm) {
int node = numa_node_id(); // Current CPU's node
struct page *page = alloc_pages_node(node, GFP_KERNEL, 0);
return page_address(page);
}This seems reasonable—allocate page tables on the same node as the CPU doing the allocation. But consider what happens if the process later migrates:
T0: Process starts on CPU 12 (node 0)
Allocates page tables on node 0
T1: Process runs, page tables on node 0, CPU on node 0
Performance: Good (local access)
T2: CPU scheduler moves process to CPU 42 (node 1)
Why? Load balancing, CPU affinity changes, etc.
T3: Process runs, page tables on node 0, CPU on node 1
Performance: Bad (remote access for every page walk)
Every TLB miss now requires four remote memory accesses (one per page table level). For a workload with 10% TLB miss rate and 1 billion memory accesses per second:
TLB misses: 100 million per second
Page table accesses: 100M × 4 = 400 million per second
Remote access penalty: 50 ns extra per access
Total overhead: 400M × 50ns = 20 seconds of latency per second!
Effective: Process runs at 50% efficiency
This is not hypothetical. Production databases and HPC applications routinely suffer 30-40% performance degradation from NUMA-incorrect page table placement.
Linux's default behavior (as of kernel 6.x) follows a "first-touch" policy:
// Where page tables are allocated
pte_t *pte_alloc_one(struct mm_struct *mm) {
// Allocate from the current CPU's node
int nid = numa_node_id();
gfp_t gfp_mask = GFP_KERNEL | __GFP_ZERO;
struct page *page = alloc_pages_node(nid, gfp_mask, 0);
return page ? page_address(page) : NULL;
}This policy has a critical flaw: it assumes the CPU that first faults a page is the CPU that will primarily access it. This assumption often holds for data pages (the thread that allocates memory usually uses it), but it breaks down for page tables in several scenarios:
Scenario 1: Process migration after startup
A process allocates memory during initialization while running on node 0, then the scheduler moves it to node 1 for load balancing:
Initialization phase: CPU on node 0, page tables on node 0 ✓
Running phase: CPU on node 1, page tables on node 0 ✗
Scenario 2: Multi-threaded with thread spreading
A multi-threaded application has threads spread across nodes:
Thread 1: Runs on node 0
Thread 2: Runs on node 1
Thread 3: Runs on node 0
Thread 4: Runs on node 1
All threads share page tables (same mm_struct)
Page tables allocated on node 0 (first fault happened on node 0)
Result: Threads on node 1 pay remote access penalty
Scenario 3: Fork inheritance
Child processes inherit page tables from parent:
pid_t child = fork();
if (child == 0) {
// Child process
// Inherits page tables allocated on parent's node
// If child migrates to different node, performance suffers
}Before optimizing, we need to measure. The perf tool can track remote vs. local memory accesses:
# Run workload and measure NUMA accesses
perf stat -e node-loads,node-load-misses -- ./workload
# Example output:
1,234,567,890 node-loads
456,789,012 node-load-misses # 37% remote!A high percentage of node-load-misses indicates poor NUMA locality. But this doesn't distinguish between data page accesses and page table accesses. For more granular analysis:
# Sample page table walk latencies
perf record -e mem:dtlb_load_misses.walk_duration -- ./workload
perf report --sort=symbol,dso
# Look for high latency page table walksOr check page allocation locations programmatically:
# Check which node owns a process's page tables
cat /proc/<PID>/numa_maps
# Example output:
7f0000000000 default file=/lib/x86_64-linux-gnu/libc.so.6 mapped=200 \
N0=150 N1=50 # 150 pages on node 0, 50 on node 1The most direct solution: bind processes to specific NUMA nodes and ensure memory allocation happens locally:
# Run process on node 0 only
numactl --cpunodebind=0 --membind=0 ./workloadThis forces both CPU scheduling and memory allocation to node 0. The kernel's first-touch policy then places page tables on node 0, matching the CPU location.
For multi-threaded workloads, partition threads across nodes:
# Thread 0-15 on node 0, threads 16-31 on node 1
numactl --cpunodebind=0 --physcpubind=0-15 ./workload &
numactl --cpunodebind=1 --physcpubind=16-31 ./workload &But this creates a new problem: threads on different nodes share page tables, so at least one set of threads will always access remote page tables. The solution is more complex: each thread should have its own page tables (separate processes, not threads) or use thread-local memory allocation.
When a process migrates nodes, migrate its page tables too:
// Conceptual (not actual Linux kernel code)
void migrate_page_tables(struct mm_struct *mm, int target_node) {
int current_node = page_to_nid(virt_to_page(mm->pgd));
if (current_node == target_node)
return; // Already on correct node
// Allocate new PGD on target node
pgd_t *new_pgd = alloc_pages_node(target_node, GFP_KERNEL, 0);
// Copy page table entries
memcpy(new_pgd, mm->pgd, PAGE_SIZE);
// Update mm_struct to point to new PGD
mm->pgd = new_pgd;
// Free old PGD
free_pages(old_pgd);
// Flush TLBs on all CPUs
flush_tlb_mm(mm);
}This is expensive (copying and TLB flushing), so it should only happen when a process has been running on a remote node for a sustained period (e.g., >10 seconds).
Linux doesn't do this automatically as of kernel 6.x, but some HPC systems implement custom patches for it. The challenge is detecting when migration is beneficial vs. when the process is about to migrate back.
For workloads that genuinely spread across all nodes equally, interleaving page table allocation can balance load:
// Allocate page tables round-robin across nodes
static atomic_t pt_node_counter = ATOMIC_INIT(0);
pte_t *pte_alloc_one_interleaved(struct mm_struct *mm) {
int nid = atomic_inc_return(&pt_node_counter) % num_online_nodes();
return alloc_pages_node(nid, GFP_KERNEL, 0);
}This ensures no single node becomes a hotspot, but it also guarantees that every access has some probability of being remote. The trade-off:
Best case (local allocation): 100 ns average access
Worst case (remote allocation): 150 ns average access
Interleaved (50/50 on 2-socket): 125 ns average access
Interleaving is a hedge—it prevents worst-case scenarios but also caps best-case performance.
Chapter 4 introduced Transparent Huge Pages (THP) and showed how 2 MB pages dramatically reduce TLB pressure. We saw benchmark results demonstrating 21-24% performance gains on STREAM and 172% improvements on database workloads. But those benchmarks shared a common characteristic: they were freshly started systems with ample free memory.
Production systems are messier. After days or weeks of uptime, physical memory becomes fragmented. Files are created and deleted, processes start and stop, and the kernel's memory allocator juggles competing demands. THP's ability to allocate 2 MB contiguous chunks degrades steadily, and with it, the performance benefits erode.
This section examines the practical challenges of maintaining THP effectiveness in production and the strategies to address them.
THP requires 2 MB of contiguous physical memory (512 consecutive 4 KB pages). Fresh after boot, this is trivial:
Physical memory (immediately after boot):
[FFFFFFFFFFFFFFFF][FFFFFFFFFFFFFFFF][FFFFFFFFFFFFFFFF]...
2 MB chunk 2 MB chunk 2 MB chunk
F = Free
All memory is large contiguous blocks - THP allocation succeeds 100%
After a week of uptime running diverse workloads:
Physical memory (after 1 week):
[UUU][F][UUU][F][UUU][FFFFF][UU][FFF][UUUUUUUUUUUU]...
Can't form 2 MB chunk - too fragmented
U = Used
F = Free
No 2 MB contiguous regions available despite plenty of free memory
The kernel's buddy allocator tracks free blocks in orders (order 0 = 4 KB, order 1 = 8 KB, …, order 9 = 2 MB). Check current fragmentation:
$ cat /proc/buddyinfo
Node 0, zone Normal 128 64 32 16 8 4 2 1 0 0 0
4KB 8KB 16KB 32KB 64KB 128KB 256KB 512KB 1MB 2MB 4MBThe rightmost columns (order 9 and 10, representing 2 MB and 4 MB blocks) are zero—no large contiguous blocks available. THP allocations will fail.
Measure THP allocation success rate:
$ grep -E 'thp|compact' /proc/vmstat
thp_fault_alloc 12345
thp_fault_fallback 67890 # 85% failure rate!
thp_collapse_alloc 234
compact_stall 45678
compact_fail 34567High thp_fault_fallback indicates THP allocations are failing due to fragmentation. High compact_fail means memory compaction (attempts to create contiguous regions) is also failing.
The kernel's memory compaction algorithm attempts to create contiguous regions by moving movable pages:
// Simplified compaction
void compact_zone(struct zone *zone) {
unsigned long low_pfn = zone->start_pfn; // Scan from bottom
unsigned long high_pfn = zone->end_pfn; // Scan from top
while (low_pfn < high_pfn) {
// Find movable page at low end
while (!PageMovable(low_pfn) && low_pfn < high_pfn)
low_pfn++;
// Find free page at high end
while (!PageFree(high_pfn) && high_pfn > low_pfn)
high_pfn--;
if (low_pfn >= high_pfn)
break;
// Move page from low_pfn to high_pfn
migrate_page(low_pfn, high_pfn);
low_pfn++;
high_pfn--;
}
// Result: Free pages clustered at one end, creating large blocks
}The migration process is expensive:
Per-page migration cost:
Copy 4 KB of data: ~150 cycles (L3 miss + DRAM write)
Update page tables: ~50 cycles
Invalidate TLB: ~50 cycles (plus IPI to other CPUs)
Total: ~250 cycles per page
To create one 2 MB block might require moving 100+ pages:
100 × 250 = 25,000 cycles = ~8 microseconds
For a heavily fragmented system, creating 1000 huge pages (2 GB total) costs:
1000 × 8 µs = 8 milliseconds of compaction overhead
This happens synchronously during THP allocation, directly adding latency to the faulting thread.
[Content continues in next file due to length…] ### Compaction Strategies
Linux offers three THP allocation strategies:
# always: Always try THP, compact synchronously if needed
echo always > /sys/kernel/mm/transparent_hugepage/enabled
# madvise: Only for memory regions marked MADV_HUGEPAGE
echo madvise > /sys/kernel/mm/transparent_hugepage/enabled
# never: Disable THP completely
echo never > /sys/kernel/mm/transparent_hugepage/enabledAnd three compaction strategies:
# always: Compact synchronously during allocation (high latency)
echo always > /sys/kernel/mm/transparent_hugepage/defrag
# defer: Background compaction via kcompactd daemon
echo defer > /sys/kernel/mm/transparent_hugepage/defrag
# defer+madvise: Background for most, synchronous for MADV_HUGEPAGE
echo defer+madvise > /sys/kernel/mm/transparent_hugepage/defrag
# never: No compaction (THP disabled if memory fragmented)
echo never > /sys/kernel/mm/transparent_hugepage/defragThe "always/always" combination maximizes THP coverage but can cause severe latency spikes:
Application page fault without THP: 1-2 microseconds
Application page fault with THP compaction: 50-500 microseconds
For latency-sensitive applications (web servers, real-time systems), this is unacceptable. The "defer" strategy moves compaction to background:
kcompactd kernel thread:
- Wakes periodically (every 10-100 ms)
- Compacts memory zones that are fragmented
- Creates free 2 MB blocks proactively
Page fault flow with defer:
1. Attempt THP allocation
2. If fails (no 2 MB blocks), fall back to 4 KB
3. Wake kcompactd to create blocks for next time
4. Return immediately (no synchronous compaction)
Latency: Same as 4 KB allocation (~1-2 µs)
The trade-off: lower THP coverage initially, but coverage improves over time as kcompactd creates blocks. For most workloads, "defer+madvise" provides the best balance.
A production PostgreSQL deployment illustrates the tuning process.
Setup: - Server: 2× AMD EPYC 7763 (64 cores/socket, 128 total) - RAM: 1 TB DDR4-3200 - Workload: TPC-C @ 5000 warehouses (~200 GB database) - SLA: 99th percentile latency < 100 ms
Initial configuration (defaults):
cat /sys/kernel/mm/transparent_hugepage/enabled
# [always] madvise never
cat /sys/kernel/mm/transparent_hugepage/defrag
# [always] defer defer+madvise neverProblem observed:
# Monitoring over 24 hours
$ grep thp /proc/vmstat
thp_fault_alloc 1,234,567
thp_fault_fallback 890,123 # 42% failure rate
thp_collapse_alloc 45,678
thp_split_page 123,456
$ grep compact /proc/vmstat
compact_stall 234,567 # Synchronous compaction events
compact_fail 189,012 # 81% compaction failure rateHigh compact_stall with frequent failures means: - Page faults triggering synchronous compaction - Compaction often failing to create 2 MB blocks - High latency variance
Performance impact:
Query latency distribution:
Median (p50): 8 ms
p95: 45 ms
p99: 850 ms ← Violates SLA!
p99.9: 2,300 ms
Investigating p99 outliers:
perf record -e page-faults -g -- sleep 60
perf report --sort=symbol
Top latency sources:
45% thp_fault_fallback → __alloc_pages → compact_zone
28% Normal page faults
15% mmap operations
Nearly half the latency budget consumed by THP compaction attempts!
Step 1: Switch to deferred compaction
echo defer+madvise > /sys/kernel/mm/transparent_hugepage/defragResult after 1 hour:
THP coverage: 55% → 48% (slight decrease)
Compact stalls: 234,567/day → 1,234/day (99.5% reduction!)
p99 latency: 850 ms → 120 ms (86% improvement)
But still missing SLA...
Step 2: Explicit huge pages for shared buffers
PostgreSQL's shared_buffers (in-memory cache) is allocated at startup and accessed frequently. Mark it for explicit huge page use:
// PostgreSQL source (simplified)
void *shared_buffers = mmap(NULL, 32GB, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB,
-1, 0);
if (shared_buffers == MAP_FAILED) {
// Fallback to THP
shared_buffers = mmap(NULL, 32GB, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
madvise(shared_buffers, 32GB, MADV_HUGEPAGE);
}Pre-reserve huge pages:
# Reserve 16,384 2MB pages (32 GB)
echo 16384 > /proc/sys/vm/nr_hugepagesStep 3: Disable THP for metadata allocations
PostgreSQL allocates many small structures (row cache, index nodes, etc.). These change size frequently and benefit little from THP:
void *metadata = malloc(size);
madvise(metadata, size, MADV_NOHUGEPAGE); // Opt out of THPFinal results:
THP coverage:
Shared buffers: 100% (explicit huge pages)
Data pages: 65% (THP with defer)
Metadata: 0% (explicitly disabled)
Overall effective coverage: 82%
Compaction behavior:
Stalls: <100/day (from 234,567/day)
kcompactd CPU usage: 0.3%
Latency:
p50: 7 ms (improved)
p95: 38 ms (within SLA)
p99: 87 ms (within SLA!)
p99.9: 145 ms
Lessons learned:
Linux supports two huge page mechanisms:
HugeTLB (explicit huge pages):
# Reserve at boot
echo 1024 > /proc/sys/vm/nr_hugepages # 2 GB
# Application requests explicitly
void *ptr = mmap(NULL, size, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0);THP (transparent huge pages):
# Enable globally
echo always > /sys/kernel/mm/transparent_hugepage/enabled
# Application gets huge pages automatically
void *ptr = malloc(size); # May use huge pagesWhen to use each:
HugeTLB strengths: - Guaranteed allocation (reserved at boot) - No fragmentation issues (never compacted) - Predictable behavior (no promotion/demotion) - Lower overhead (no khugepaged scanning)
HugeTLB weaknesses: - Requires pre-reservation (wastes memory if not used) - Application changes needed (explicit mmap) - Inflexible (can't easily adjust size)
THP strengths: - No application changes needed - Dynamic (adapts to workload) - No wasted reservations - Works for any allocation
THP weaknesses: - Fragmentation-dependent (may fail) - Compaction overhead (latency spikes) - Unpredictable coverage - khugepaged CPU overhead
Decision matrix:
Use HugeTLB when:
- Memory size known at startup (e.g., database shared buffers)
- Maximum predictability needed
- Memory is abundant (reservation acceptable)
- Application can be modified
Use THP when:
- Memory size varies
- Application can't be modified
- Memory is scarce (can't afford reservations)
- Some variability acceptable
Use both:
- HugeTLB for critical allocations
- THP for everything else
- Best of both worlds
Sometimes huge pages need to move between NUMA nodes. Unlike 4 KB pages, migrating 2 MB pages is expensive:
4 KB page migration:
Copy: 4 KB @ 50 GB/s = 0.08 µs
Remap: ~0.2 µs
Total: ~0.3 µs
2 MB page migration:
Copy: 2 MB @ 50 GB/s = 40 µs
Remap: ~0.2 µs
Total: ~40 µs (133× slower!)
The kernel can split huge pages before migration:
// Migrate huge page to new node
int migrate_huge_page(struct page *hpage, int target_node) {
if (expensive_to_migrate(hpage)) {
// Split into 512 4KB pages
split_huge_page(hpage);
// Migrate 4KB pages individually (can be parallelized)
for (each_small_page)
migrate_page(small_page, target_node);
// Try to merge back into huge page at destination
merge_pages(target_node);
} else {
// Direct migration
migrate_page_direct(hpage, target_node);
}
}The split-migrate-merge approach adds complexity but can reduce latency when done incrementally.
We've optimized the pages themselves with KSM and huge pages, and we've optimized page table locking for concurrency. But what about the page table structures themselves? Can we reduce their overhead?
In a system running 1,000 containers, each with its own address space, the page table overhead is substantial:
Per-container page table overhead:
Average virtual memory: 4 GB
Page table overhead: ~8 MB (0.2%)
Total: 1,000 × 8 MB = 8 GB
For 1,000 containers on a 256 GB server: 3.1% overhead
That's ~$375 of RAM cost (at $12/GB) per server just for page tables. Multiply by 10,000 servers in a data center: $3.75M worth of page table overhead.
This section explores techniques to reduce that cost through compaction and sharing.
Not all virtual address spaces are densely populated. Applications often map large regions but only touch small portions:
Application virtual memory: 128 GB mapped
Actually resident: 8 GB (6.25%)
Page table allocation:
Naive: Full page tables for 128 GB = 256 MB overhead
Smart: Page tables only for 8 GB resident = 16 MB overhead
Potential savings: 240 MB (93.75%)
The Linux kernel already does this partially—it allocates page table levels on-demand. But it can't reclaim levels that become empty later.
Example scenario:
// Map 1 GB
void *region = mmap(NULL, 1GB, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
// Use just 100 MB of it
memset(region, 0, 100MB);
// Later, unmap the 100 MB we used
munmap(region, 100MB);
// Problem: Page table structures still allocated for full 1 GB!
// 900 MB of address space has empty page tables consuming memoryThe page tables remain allocated even though they're empty. The kernel could detect this and free them:
void free_empty_page_tables(struct mm_struct *mm) {
pgd_t *pgd;
p4d_t *p4d;
pud_t *pud;
pmd_t *pmd;
for_each_pgd(mm, pgd) {
for_each_p4d(pgd, p4d) {
for_each_pud(p4d, pud) {
for_each_pmd(pud, pmd) {
// Count valid PTEs
int valid_ptes = count_present_ptes(pmd);
if (valid_ptes == 0) {
// This page table page has no valid entries
// Free it and update PMD
free_pte_page(pmd);
pmd_clear(pmd);
}
}
// Check if PUD level is now empty
if (pud_empty(pud)) {
free_pmd_pages(pud);
pud_clear(pud);
}
}
}
}
}This reclamation could run periodically (e.g., during memory pressure) to free unused page table pages.
Measured impact:
Test: Allocate 100 GB, use 10 GB, free the 10 GB
Before compaction:
Page table overhead: 200 MB
After compaction:
Page table overhead: 2 MB
Savings: 198 MB (99%)
However, Linux doesn't implement aggressive page table reclamation as of kernel 6.x. The complexity and locking overhead generally outweigh the benefits for typical workloads. But specialized systems (containers, VMs) could benefit significantly.
Consider libc.so, loaded by every process on the system:
System with 500 processes:
Each loads libc.so (2 MB, mapped read-only)
Traditional approach:
Each process has own PTEs for libc mapping
500 processes × 4 PTEs (2 MB / 512 entries) = 2,000 PTEs
2,000 × 8 bytes = 16 KB of redundant page table entries
For a single library, 16 KB seems trivial. But multiply by dozens of shared libraries:
Typical shared libraries per process:
libc.so: 2 MB
libpthread.so: 512 KB
libm.so: 1 MB
libssl.so: 4 MB
libcrypto.so: 3 MB
... (20+ libraries total)
Total: ~50 MB of shared libraries
Page table overhead: 100 KB per process
For 500 processes: 50 MB of duplicate PTEs!
Sharing approach:
// Global PTE cache for read-only mappings
struct shared_pte {
void *file_key; // Inode + offset
pte_t *ptes; // Shared page table page
atomic_t refcount;
spinlock_t lock;
};
// On mmap of read-only file
pmd_t *setup_file_mapping(struct mm_struct *mm, struct file *file,
unsigned long address) {
struct shared_pte *shared = find_shared_pte(file, offset);
if (shared) {
// Reuse existing PTEs
pmd_populate(mm, pmd, shared->ptes);
atomic_inc(&shared->refcount);
return pmd;
}
// First mapping - create shared PTEs
pte_t *ptes = alloc_page_table_page();
build_ptes(ptes, file, offset);
shared = create_shared_pte(file, offset, ptes);
register_shared_pte(shared);
pmd_populate(mm, pmd, ptes);
return pmd;
}Challenges:
Reference counting: Must track how many processes use each shared PTE page. When refcount drops to zero, free it.
Permissions: Can only share truly read-only mappings. If one process calls mprotect() to make pages writable, must break sharing (copy-on-write for page tables).
TLB coherency: When shared PTEs are modified (e.g., page reclaimed), must invalidate TLBs on all processes sharing them.
Locking: Multiple processes might simultaneously fault on the same shared mapping. Need synchronization.
Actual Linux behavior:
Linux does NOT share page table pages across processes (as of kernel 6.x). Each process maintains entirely separate page tables, even for identical read-only mappings. The complexity and locking overhead were deemed too high.
However, research prototypes have demonstrated 50-90% page table overhead reduction for shared library mappings. Future kernels might implement this if container density pressures increase further.
Alternative: Huge pages reduce overhead indirectly
Using 2 MB pages for shared libraries reduces PTE count:
libc.so: 2 MB mapped
With 4 KB pages:
512 PTEs needed
Page table overhead: 4 KB (one page table page)
With 2 MB pages:
1 PMD entry needed
Page table overhead: 0 KB (just a PMD entry)
Savings: 4 KB per process
For 500 processes: 2 MB total savings
This is one reason to enable THP even for read-only files.
Some architectures support compressed PTE formats. ARM64's contiguous bit allows marking groups of entries:
Normal ARM64 PTE (64-bit):
[Physical Address][Flags]
Contiguous PTE:
[Physical Address][Flags|Contiguous bit set]
When contiguous bit set on 16 consecutive PTEs:
Hardware treats as single 64 KB mapping
Only first PTE needs to be in TLB
Other 15 PTEs don't consume TLB entries
This doesn't reduce page table memory usage, but it effectively increases TLB coverage by 16×.
x86-64 doesn't have a direct equivalent, but the PS (Page Size) bit in PMD entries serves a similar purpose by replacing 512 PTEs with a single 2 MB entry.
On a 128-core system, serializing page table operations destroys scalability. Modern kernels employ several strategies to parallelize operations that traditionally required global synchronization.
Traditional approach serializes all faults on the mmap_lock:
// OLD: Serial page fault handling
int handle_page_fault(unsigned long address) {
down_read(&mm->mmap_lock); // Blocks if another fault in progress
struct vm_area_struct *vma = find_vma(mm, address);
handle_fault(vma, address);
up_read(&mm->mmap_lock);
}On a 128-core system with threads simultaneously faulting on different addresses, this creates a convoy effect—threads queue waiting for the lock.
Modern approach: Speculative fault handling
// NEW: Lockless fast path
int handle_page_fault_lockless(unsigned long address) {
// Try to handle fault without mmap_lock
rcu_read_lock();
struct vm_area_struct *vma = find_vma_rcu(mm, address);
if (vma && vma_allows_fault(vma, address)) {
// VMA exists and allows faults - try lockless handling
pte_t *pte = lookup_pte_lockless(mm, address);
if (pte_none(*pte)) {
// Try to install page without locks
struct page *page = alloc_page(GFP_ATOMIC);
if (page && try_set_pte_atomic(pte, mk_pte(page))) {
// Success - installed PTE without taking locks!
rcu_read_unlock();
return 0;
}
free_page(page); // Race or allocation failure
}
}
rcu_read_unlock();
// Fast path failed - fall back to locked path
return handle_page_fault_locked(address);
}The try_set_pte_atomic() uses compare-and-swap to install the PTE only if it's still empty:
bool try_set_pte_atomic(pte_t *ptep, pte_t new_pte) {
pte_t old_pte = READ_ONCE(*ptep);
if (!pte_none(old_pte))
return false; // PTE already set by another thread
// Atomically set if still empty
return cmpxchg64((uint64_t *)ptep,
pte_val(old_pte),
pte_val(new_pte)) == pte_val(old_pte);
}If two threads race to fault the same page, one succeeds (its cmpxchg wins) and the other fails (detects PTE is no longer empty) and frees its allocated page.
Performance impact:
Benchmark: 128 threads, each faulting 10,000 pages
Serial (old approach):
Time: 18.5 seconds
Faults/sec: 69,189
Parallel (new approach):
Time: 2.4 seconds
Faults/sec: 533,333 (7.7× faster!)
The improvement scales with core count—more cores means more parallelism benefit.
[Content continues…] ### Batched mmap/munmap Operations
Applications that repeatedly allocate and free memory can benefit from batching:
// BAD: Many syscalls
for (i = 0; i < 10000; i++) {
void *p = mmap(NULL, 4096, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
// Use p
munmap(p, 4096);
}
// 20,000 syscalls, each taking mmap_lockOptimization: Allocate arena once, sub-allocate from it
// GOOD: Single allocation
void *arena = mmap(NULL, 40MB, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
for (i = 0; i < 10000; i++) {
void *p = arena + (i * 4096);
// Use p - no syscall!
}
munmap(arena, 40MB); // Single deallocationMeasured performance:
Naive (20,000 syscalls):
Time: 450 ms
CPU time in kernel: 320 ms (71%)
Batched (2 syscalls):
Time: 12 ms
CPU time in kernel: 2 ms (16%)
Speedup: 37.5×
This pattern is used by allocators like jemalloc and tcmalloc to minimize kernel interaction.
We discussed RCU-based page table walks in section 9.3. The implementation allows the kernel to translate user virtual addresses to physical addresses without taking locks—critical for performance of direct I/O and DMA operations.
Use case: DMA buffer mapping
// Network driver needs to DMA into user buffer
void setup_dma(struct user_buffer *ubuf) {
struct page **pages;
int nr_pages = ubuf->size / PAGE_SIZE;
pages = kmalloc(nr_pages * sizeof(struct page *), GFP_KERNEL);
// Fast path - no locks!
int got = get_user_pages_fast(ubuf->address, nr_pages,
FOLL_WRITE, pages);
if (got < nr_pages) {
// Some pages couldn't be grabbed locklessly
// Fall back to slow path for remaining
get_user_pages_slow(ubuf->address + got * PAGE_SIZE,
nr_pages - got, pages + got);
}
// Program DMA engine with physical addresses
for (i = 0; i < nr_pages; i++) {
dma_program(pages[i]->physical_address);
}
}Performance comparison:
DMA setup for 10,000-page buffer:
get_user_pages (locked):
Acquire mmap_lock: 0.2 µs
Walk page tables: 2.8 ms
Release mmap_lock: 0.1 µs
Total: 2.801 ms
get_user_pages_fast (lockless):
Walk page tables: 0.4 ms (no lock overhead)
Total: 0.4 ms
Speedup: 7×
The speedup comes from both eliminating lock overhead and allowing full CPU parallelism (lockless walks can proceed on multiple CPUs simultaneously).
When physical memory fills up, traditional systems swap to disk—writing pages to storage at ~100 MB/s for SSD, ~50 MB/s for HDD. Memory compression offers an alternative: compress pages in RAM rather than evicting to disk, effectively extending memory capacity at the cost of CPU cycles.
Linux's zswap provides a compressed cache layer between RAM and swap:
Application
↓
Page Cache (uncompressed)
↓ (memory pressure)
zswap (compressed in RAM)
↓ (overflow)
Swap device (disk/SSD)
The key insight: decompression from RAM (~5 µs) is 2000× faster than reading from SSD (~10 ms).
zswap intercepts pages being swapped out:
// Simplified swap path
int swap_writepage(struct page *page) {
// Try zswap first
if (zswap_store(page)) {
// Success - page compressed and cached in RAM
return 0;
}
// zswap full or compression failed - write to disk
return write_to_swap_device(page);
}The zswap_store() function compresses the page:
int zswap_store(struct page *page) {
void *page_data = kmap_atomic(page);
// Allocate compression buffer
void *compressed = zpool_malloc(max_compressed_size);
// Compress (using LZ4, zstd, or other algorithm)
int comp_size = compress(page_data, PAGE_SIZE, compressed);
if (comp_size >= PAGE_SIZE) {
// Compression didn't help - don't store
zpool_free(compressed);
kunmap_atomic(page_data);
return -EINVAL;
}
// Store metadata for later retrieval
zswap_entry_t *entry = zswap_entry_cache_alloc();
entry->offset = zpool_get_offset(compressed);
entry->length = comp_size;
entry->page = page;
// Add to zswap tree (keyed by swap offset)
zswap_tree_insert(entry);
kunmap_atomic(page_data);
return 0;
}On swap-in, zswap checks if the page is in the compressed cache:
int swap_readpage(struct page *page) {
// Check zswap first
if (zswap_load(page)) {
// Found in compressed cache
return 0;
}
// Not in zswap - read from disk
return read_from_swap_device(page);
}
int zswap_load(struct page *page) {
zswap_entry_t *entry = zswap_tree_lookup(page->swap_offset);
if (!entry)
return 0; // Not in zswap
// Get compressed data
void *compressed = zpool_map(entry->offset);
void *page_data = kmap_atomic(page);
// Decompress
decompress(compressed, entry->length, page_data, PAGE_SIZE);
kunmap_atomic(page_data);
zpool_unmap(compressed);
// Remove from zswap (page now in RAM uncompressed)
zswap_tree_delete(entry);
zswap_entry_cache_free(entry);
return 1;
}zswap supports multiple compression algorithms:
# Check available compressors
cat /sys/module/zswap/parameters/compressor
[lz4] lzo zstd
# Change compressor
echo zstd > /sys/module/zswap/parameters/compressorPerformance characteristics:
LZ4:
Compression ratio: 2.5-3× (typical)
Compression speed: 500 MB/s per core
Decompression speed: 2000 MB/s per core
CPU cost: Low
LZO:
Compression ratio: 2.2-2.8×
Compression speed: 400 MB/s per core
Decompression speed: 600 MB/s per core
CPU cost: Low
zstd (level 3):
Compression ratio: 3-4×
Compression speed: 200 MB/s per core
Decompression speed: 800 MB/s per core
CPU cost: Medium
Choosing a compressor:
For CPU-constrained systems: LZ4
- Fastest compression/decompression
- Lowest CPU overhead
- Good enough compression for most data
For memory-constrained systems: zstd
- Better compression ratio
- Worth the extra CPU cost
- Maximize RAM savings
For balanced systems: LZ4 or LZO
- Both widely used
- LZ4 slightly faster in practice
When a page fault occurs on a swapped-out page:
int do_swap_page(struct vm_fault *vmf) {
swp_entry_t entry = pte_to_swp_entry(vmf->orig_pte);
struct page *page;
// Allocate page for decompressed data
page = alloc_page(GFP_KERNEL);
if (!page)
return VM_FAULT_OOM;
// Try zswap first (fast path)
if (zswap_load(entry, page)) {
// Success - page decompressed from RAM
// Latency: ~5 µs
goto install_pte;
}
// Not in zswap - must read from disk (slow path)
// Latency: ~10,000 µs (2000× slower!)
if (read_swap_cache_async(entry, page) < 0) {
free_page(page);
return VM_FAULT_SIGBUS;
}
install_pte:
// Install page in page table
vmf->pte = pte_offset_map_lock(vmf->mm, vmf->pmd,
vmf->address, &vmf->ptl);
set_pte_at(vmf->mm, vmf->address, vmf->pte,
mk_pte(page, vmf->vma->vm_page_prot));
pte_unmap_unlock(vmf->pte, vmf->ptl);
return 0;
}The latency difference is stark:
Page fault on zswap-cached page:
Allocation: 0.5 µs
Decompression: 3-5 µs (LZ4)
PTE update: 0.5 µs
Total: ~5 µs
Page fault on disk-swapped page:
Allocation: 0.5 µs
Disk I/O: 10,000 µs (SSD) or 20,000 µs (HDD)
PTE update: 0.5 µs
Total: ~10,000-20,000 µs
Speedup: 2000-4000×
Enable zswap:
# At boot (kernel parameter)
zswap.enabled=1 zswap.compressor=lz4 zswap.max_pool_percent=20
# Or at runtime
echo 1 > /sys/module/zswap/parameters/enabled
echo lz4 > /sys/module/zswap/parameters/compressor
echo 20 > /sys/module/zswap/parameters/max_pool_percentThe max_pool_percent controls how much RAM zswap can use:
System with 64 GB RAM:
max_pool_percent=20 → 12.8 GB for zswap
Effective memory capacity:
Physical RAM: 64 GB
Compressed in zswap: 12.8 GB compressed to ~40 GB (3× ratio)
Total: 64 - 12.8 + 40 = 91.2 GB effective (42% increase!)
Monitor zswap effectiveness:
cat /sys/kernel/debug/zswap/*
# Example output:
pool_total_size: 13421772800 # 12.5 GB used
stored_pages: 3276800 # ~12.8 million pages stored
pool_pages: 3276800
duplicate_entry: 0
written_back_pages: 245680 # Pages evicted to disk
reject_compress_poor: 892341 # Pages that didn't compress well
reject_kmemcache_fail: 0
reject_alloc_fail: 0Key metrics:
If written_back_pages is high, zswap cache is too small—increase max_pool_percent.
Setup: - Physical: 2× Intel Xeon Gold 6248R (24 cores/socket, 48 total), 192 GB RAM - Workload: 150 containers running microservices - Memory: Containers request 200 GB total (1.04× overcommit)
Without zswap:
Memory usage approaches 192 GB limit:
Active: 165 GB
Cached: 18 GB
Free: 9 GB
Swap begins at 90% utilization:
Swap I/O: 500 MB/s writes, 300 MB/s reads (SSD)
Container freezes: Frequent (5-10 per hour)
OOM kills: 15 per day
Application latency: p99 = 850 ms
With zswap (20% pool):
zswap configuration:
Pool size: 38.4 GB
Compressor: LZ4
Memory breakdown:
Active: 165 GB
Cached: 12 GB
zswap pool: 38.4 GB (compressed from 105 GB uncompressed)
Free: 0 GB
Effective capacity:
Physical: 192 GB
Additional via compression: 66.6 GB (105 - 38.4)
Total effective: 258.6 GB (34.7% increase)
Performance:
Swap to disk: Rare (<50 MB/s avg)
Container freezes: <1 per day
OOM kills: <1 per week
Application latency: p99 = 120 ms (7× better)
CPU overhead:
zswap CPU usage:
Compression: ~2% of total CPU
Decompression: ~1.5% of total CPU
Total: 3.5% CPU for 34% more effective memory
Trade-off analysis:
Cost: 3.5% CPU
Benefit: 34% more usable memory
Result: Worth it (CPU abundant, memory constrained)
To illustrate how these techniques combine in practice, let's examine a real-world optimization project at a cloud hosting provider. This case study synthesizes multiple strategies covered in this chapter into a cohesive optimization stack.
Initial State: - Infrastructure: 1,000 compute hosts, dual AMD EPYC 7763 (64 cores/socket) - RAM: 512 GB per host - Workload: ~50,000 containers (average 50 per host) - Problem: Memory utilization at 92%, limiting container density
Symptom Analysis:
# Per-host memory breakdown
$ free -h
total used free
Mem: 512G 471G 41G
# But actual application usage
$ cat /sys/fs/cgroup/memory/memory.stat | grep total_rss
total_rss 389424676864 # ~363 GB actual application RSS
# Where's the other 108 GB?Investigation revealed:
Memory consumers:
Application RSS: 363 GB
Page cache: 28 GB
Kernel slab: 12 GB
Page tables: 82 GB (!!)
Other kernel: 27 GB
Page table overhead was consuming 82 GB—16% of total RAM. At 1,000 hosts, that's $984,000 of capacity ($12k per server × 82/512).
Optimization Layer 1: Kernel Samepage Merging
# Enable KSM
echo 1 > /sys/kernel/mm/ksm/run
echo 300 > /sys/kernel/mm/ksm/pages_to_scan
echo 20 > /sys/kernel/mm/ksm/sleep_millisecs
# Monitor convergence
watch -n 60 'grep -E "sharing|shared|unshared" /proc/meminfo'After 6 hours (allowing full scan):
Memory deduplicated: 127 GB (35% of app RSS)
Shared pages: 32,505,856 (127 GB / 4 KB)
CPU overhead: 2.8% average
Results after KSM:
Total memory usage: 471 GB → 344 GB
Available for containers: 41 GB → 168 GB
Container density: 50 → 66 per host (+32%)
Optimization Layer 2: THP with Deferred Compaction
# Configure THP
echo defer+madvise > /sys/kernel/mm/transparent_hugepage/defrag
echo madvise > /sys/kernel/mm/transparent_hugepage/enabled
# Mark container heap regions for THP
# (modification to container runtime)After rollout:
THP coverage: 68% of anonymous memory
Page table overhead: 82 GB → 49 GB (40% reduction)
kcompactd CPU: 0.4%
Results after THP tuning:
Total memory usage: 344 GB → 311 GB
Page table savings: 33 GB
Container density: 66 → 75 per host (+13.6%)
Optimization Layer 3: NUMA-Aware Container Placement
# Modify container orchestrator to:
# 1. Pin containers to NUMA nodes
# 2. Allocate memory from local node
# Example: container on node 0
numactl --cpunodebind=0 --membind=0 -- container_startAfter deployment:
NUMA remote access ratio: 28% → 8%
Memory latency (p95): 145 ns → 112 ns
Application performance: +12% throughput
No direct memory savings, but applications run faster (can handle more load per container).
Optimization Layer 4: zswap
# Enable zswap
echo 1 > /sys/module/zswap/parameters/enabled
echo lz4 > /sys/module/zswap/parameters/compressor
echo 15 > /sys/module/zswap/parameters/max_pool_percentAfter stabilization:
zswap pool: 76.8 GB
Compressed from: 218 GB uncompressed
Compression ratio: 2.84×
Effective memory gain: 141.2 GB
CPU overhead: 3.2%
Results after zswap:
Effective memory capacity: 512 GB → 653 GB
Container density: 75 → 88 per host (+17.3%)
Final Results:
Optimization stack performance:
Memory utilization:
Before: 92% (471 GB / 512 GB)
After: 87% (452 GB / 653 GB effective)
Container density:
Before: 50 per host
After: 88 per host (+76%)
Performance impact:
Application throughput: +8% (NUMA + THP benefits)
P99 latency: 145 ms → 128 ms
CPU overhead: +6.4% (KSM + zswap + kcompactd)
Economic impact:
Additional containers: 38,000 (76% increase)
Hardware cost avoided: $456M (38,000 containers / 50 per host × $12k)
Recurring savings: $91M/year (power, cooling, space)
Lessons Learned:
Stack multiple optimizations: No single technique solved the problem. KSM, THP, NUMA, and zswap each contributed 13-35% improvements that multiplied.
Monitor continuously: Initially, aggressive KSM settings (pages_to_scan=500) caused CPU overhead that negated benefits. Tuning to 300 balanced overhead vs. savings.
Workload-specific tuning: Not all containers benefited equally. Web services saw big wins from KSM (similar code). Databases saw big wins from THP (large working sets). Data analytics saw big wins from zswap (intermittent memory usage).
CPU overhead is acceptable when memory-constrained: 6.4% CPU overhead for 76% density increase is excellent ROI. On CPU-constrained systems, the trade-off might differ.
Measure before optimizing: Without detailed memory accounting (via /proc/meminfo, cgroup stats, etc.), the team wouldn't have identified page tables as the bottleneck.
This chapter explored advanced page table optimizations that go well beyond the basic mechanisms covered in earlier chapters. We've seen how modern operating systems employ sophisticated techniques to reduce memory overhead, improve concurrency, and extract maximum performance from the memory management hardware.
Memory Deduplication (KSM) merges identical pages across processes and VMs, achieving 30-60% memory savings in virtualized environments. The trade-off is CPU overhead (2-5%) and potential security concerns (timing side channels). Best used for: - Virtualization with similar guest OSes - Container orchestration running replicated services - Read-heavy workloads where COW penalty is rare - Memory-constrained systems where the savings justify overhead
Page Table Locking and Concurrency addresses the fundamental challenge of safely modifying shared page tables on many-core systems. Fine-grained locking (per page table page rather than per process) and RCU-based lockless walks provide 7-15× scalability improvements on 128-core systems. Critical for: - Database servers with hundreds of threads - HPC applications with massive parallelism - Any workload where perf shows mmap_lock contention
NUMA-Aware Page Table Management prevents the 30-50% performance penalty of remote memory access by ensuring page tables reside on the same node as the accessing CPU. Requires: - Explicit NUMA binding (numactl –membind) - Awareness of process migration patterns - Monitoring with numastat and perf - Critical for 2+ socket servers
Advanced Huge Page Strategies go beyond basic THP by
addressing fragmentation, compaction latency, and selective application.
Deferred compaction eliminates 95%+ of compaction stalls while
maintaining 80-90% THP coverage. Production deployment requires: -
defer+madvise defrag mode for latency-sensitive workloads -
Explicit huge pages (HugeTLB) for known large allocations -
MADV_NOHUGEPAGE for metadata and frequently-modified
regions - Monitoring via /proc/vmstat thp_* counters
Page Table Compaction and Sharing reduces overhead by reclaiming empty page table structures and sharing PTEs for read-only mappings. While Linux doesn't fully implement these (due to complexity), they demonstrate potential for 50-90% overhead reduction in future kernels. Container density applications benefit most.
Parallel Page Table Operations enable lock-free page faults and batched TLB shootdowns, providing 7-37× speedups for operations that traditionally serialized. Essential for: - High-frequency DMA operations (RDMA, NVMe) - Applications with allocation/deallocation churn - Many-core systems (64+ cores)
Memory Compression (zswap) extends effective memory capacity by 30-50% at the cost of 3-5% CPU overhead. The 2000× latency advantage over disk swap makes it nearly invisible to applications. Use when: - Memory pressure is frequent but not severe - Workload tolerates 3-5% CPU overhead - SSD wear is a concern (compression reduces writes)
Not every optimization suits every workload. Use this decision framework:
For Virtualization/Container Platforms:
Primary: KSM (dense similar workloads)
Secondary: THP with defer (reduce page table overhead)
Tertiary: zswap (handle burst memory usage)
Expected gains: 40-70% density improvement
CPU cost: 4-7%
For HPC/Scientific Computing:
Primary: Huge pages (reduce TLB pressure)
Secondary: NUMA binding (optimize memory locality)
Tertiary: Parallel operations (scale to many cores)
Expected gains: 30-50% performance improvement
Complexity: Medium (requires application tuning)
For Databases:
Primary: Explicit huge pages for buffer cache
Secondary: THP with defer for heap
Tertiary: NUMA binding
Expected gains: 20-40% throughput improvement, 50% latency reduction
Fragmentation risk: Managed via defer mode
For Web Servers:
Primary: THP for process heap
Secondary: KSM if running many similar workers
Tertiary: zswap for burst handling
Expected gains: 15-30% request throughput
Latency impact: Minimal with proper tuning
For Real-Time Systems:
Primary: Disable THP (predictable latency)
Secondary: Explicit huge pages if needed
Tertiary: NUMA binding
Goal: Minimize variance, not maximize throughput
Sacrifice: Some performance for predictability
Enabling THP without monitoring: Set
defrag=always and wonder why latency spiked. Always use
defer or defer+madvise for production.
Ignoring NUMA on multi-socket systems: Assume memory
is uniform. Measure with numastat—you might find 40%+
remote accesses.
Using KSM in security-critical multi-tenant environments: Side channels are real. Either disable KSM or ensure tenant isolation.
Over-tuning: Adjusting 10+ tunables without understanding interactions. Start with defaults, change one variable at a time, measure.
Forgetting to account for overhead: KSM + zswap + kcompactd might consume 8% CPU. Fine if memory-bound, problematic if CPU-bound.
Before optimizing, establish baselines:
# Page table overhead
grep PageTables /proc/meminfo
# THP effectiveness
grep -E 'thp_fault|thp_collapse' /proc/vmstat
# NUMA locality
numastat -m -p $(pidof application)
# Lock contention
perf record -e lock:lock_acquire -g ./workload
perf report --sort=symbol
# Compaction activity
grep compact_ /proc/vmstatSet up continuous monitoring:
# Collect metrics every 60 seconds
while true; do
date >> metrics.log
grep -E 'PageTables|AnonHugePages' /proc/meminfo >> metrics.log
grep thp_ /proc/vmstat >> metrics.log
sleep 60
doneAlert on anomalies: - Page table overhead > 5% of total RAM - THP
fault fallback rate > 50% - Compact stalls > 1000/day with
defrag=defer - NUMA remote access > 20% on
multi-socket
The next two chapters extend our exploration beyond CPU page tables:
Chapter 10: Hardware Accelerators and External MMU Access examines how non-CPU devices (GPUs, NICs, FPGAs, DPUs) access the MMU via IOMMU, PCIe ATS, and RDMA. We'll see how these devices cache translations and handle page faults, and how multi-device coherency works.
Chapter 11: AI/ML Accelerator Memory Systems dives deep into specialized memory architectures for machine learning—Google TPU with its systolic arrays, NVIDIA's Unified Memory with automatic migration, Intel Habana's RDMA-integrated design, and Apple's unified memory approach. These systems push memory bandwidth to 3-5 TB/s and capacity to 192 GB per device.
Chapter 12: Performance Analysis & Benchmarking provides systematic methodology for measuring page table and MMU performance, profiling tools (perf, eBPF, PMU counters), and complete case studies showing the full optimization process from problem identification through validation.
The techniques in this chapter form the foundation for modern memory management optimization. Combined with hardware understanding from earlier chapters and the accelerator insights coming in the next two chapters, you'll have a complete picture of how modern systems manage virtual memory at scale.
Arcangeli, A. and Iftimie, I. "Kernel Samepage Merging (KSM)." Linux kernel documentation, 2009. https://www.kernel.org/doc/html/latest/admin-guide/mm/ksm.html
Arcangeli, A. "Transparent Huge Pages." Linux kernel documentation, 2011. https://www.kernel.org/doc/html/latest/admin-guide/mm/transhuge.html
Lameter, C. "NUMA (Non-Uniform Memory Access): An Overview." ACM Queue, 11(7), 2013. DOI: 10.1145/2508834.2513149
Blagodurov, S., Zhuravlev, S., and Fedorova, A. "A Case for NUMA-Aware Contention Management on Multicore Systems." 2011 USENIX Annual Technical Conference (USENIX ATC 2011). USENIX, 2011.
Gorman, M. Understanding the Linux Virtual Memory Manager. Prentice Hall, 2004. ISBN: 0-13-145348-3.
Zhao, Y. "Multi-Gen LRU." Linux kernel mailing list. Merged in Linux 6.1, 2022. https://lore.kernel.org/lkml/20220918080010.2920238-1-yuzhao@google.com/
Corbet, J. "zswap: compressed swap caching." LWN.net, September 2012. https://lwn.net/Articles/529682/
Guo, F., Bhardwaj, S., Goel, A., Eshghi, K., and Sethuraman, A. "SmartMD: A High Performance Deduplication Engine with Mixed Pages." 2017 USENIX Annual Technical Conference (USENIX ATC 2017). USENIX, 2017.
Linux Kernel Documentation. "Page Migration." https://www.kernel.org/doc/html/latest/vm/page_migration.html
Bovet, D.P. and Cesati, M. Understanding the Linux Kernel, 3rd Edition. O'Reilly Media, 2005. Chapter 17: Page Frame Reclaiming.
Silberschatz, A., Galvin, P.B., and Gagne, G. Operating System Concepts, 10th Edition. John Wiley & Sons, 2018. Chapter 10: Virtual Memory.
Corbet, J. "Compaction for unevictable pages." LWN.net, October 2012. https://lwn.net/Articles/522555/