In Chapters 3, 4, and 5, we explored the mechanisms of virtual memory: how page tables translate addresses (Chapter 3), how TLBs accelerate those translations (Chapter 4), and how IOMMUs extend protection to devices (Chapter 5). We've seen the intricate hardware structures, the multi-level translation hierarchies, and the performance optimizations that make modern virtual memory work. But we've largely treated memory protection as a footnote—a few bits in the page table entries that enable or disable certain accesses.
This chapter reveals why those bits matter more than everything else combined.
Without the protection mechanisms we'll explore here, all the sophisticated virtual memory infrastructure we've studied would be nothing more than an elaborate address mapping system. The four-level page tables? Just an overcomplicated way to map virtual to physical addresses. The TLB? A cache that makes the mapping faster. The IOMMU? Hardware that does the same mapping for devices. None of it would provide any security without the protection bits and enforcement mechanisms that are the focus of this chapter.
Consider what we learned in previous chapters:
From Chapter 3 (Page Tables): You now understand that every memory access goes through a page table entry (PTE). Each PTE contains:
From Chapter 4 (TLB): You know that the TLB caches these page table entries for performance, reducing translation time from 50-200 cycles to effectively zero for TLB hits. But the TLB doesn't just cache addresses—it caches permissions too. Every TLB entry includes the R/W/X bits, the U/S bit, and other protection flags. The hardware checks these on every single memory access.
From Chapter 5 (IOMMU): You learned that devices need the same protection as CPUs—they get their own page tables and TLBs through the IOMMU. But why? Because without protection, a malicious or buggy device could read kernel memory, overwrite page tables, or access another VM's data. The IOMMU's entire purpose is to enforce the same protection mechanisms we'll study in this chapter.
What This Chapter Adds:
This chapter transforms your understanding of virtual memory from "a mechanism for address translation" to "the foundation of system security." We'll answer the questions previous chapters assumed you already knew:
Modern computer systems face a fundamental security challenge: how do we allow multiple programs to run simultaneously while preventing them from interfering with each other's memory?
Without the protection mechanisms in this chapter:
// Scenario 1: Any user program could do this
int *kernel_memory = (int *)0xffff8000deadbeef;
*kernel_memory = 0x90909090; // Overwrite kernel code
// Result: Instant privilege escalation
// Scenario 2: A simple bug becomes a catastrophe
char buffer[100];
gets(buffer); // Buffer overflow
// Overwrites return address → executes attacker's shellcode
// Result: Complete system compromise
// Scenario 3: Process isolation is meaningless
void *other_process = (void *)0x400000; // Another process's memory
read_password(other_process); // Read their password
// Result: No isolation between processes
Every example above would succeed if we had the virtual memory infrastructure from Chapters 3-5 but without the protection mechanisms from this chapter. The page tables would translate the addresses. The TLB would cache them. The CPU would execute the instructions. But there would be no security, no isolation, no privilege separation—just a very fast address translation mechanism protecting nothing.
The MMU enforces four fundamental security properties that make modern computing possible:
1. Isolation (Enables: Multi-process systems) Each process operates in its own address space, unable to read or modify other processes' memory. This is implemented through the U/S bit in PTEs (which we barely mentioned in Chapter 3) combined with separate page table hierarchies per process.
Chapter 3 showed you: How to walk page tables and translate addresses. This chapter shows you: Why each process has separate page tables, and what happens when a user process tries to access kernel memory (spoiler: instant #PF).
2. Privilege Separation (Enables: Operating systems) The system distinguishes between kernel code (privileged) and user code (unprivileged). This requires the CPU privilege rings/levels we'll study, combined with the U/S bit in every page table entry.
Chapter 3 mentioned: The U/S bit exists. This chapter reveals: How x86 rings, ARM exception levels, and RISC-V modes work together with page table permissions to create the kernel/user boundary. And why Meltdown broke this boundary (requiring KPTI).
3. Execute Protection (Enables: Defense against code injection) Memory pages can be marked as non-executable, preventing data (like stack buffers or heap allocations) from being executed as code. This single bit—NX/XD/XN—defeats entire classes of exploits.
Chapter 3 listed: The NX bit as one of many PTE flags. This chapter proves: Why it's the most important security bit ever added to processors, preventing 70% of buffer overflow exploits at zero performance cost.
4. Access Control (Enables: Fine-grained security) Fine-grained control over read, write, and execute permissions enables sophisticated security policies, from W^X (write XOR execute) to protection keys (fast domain switching) to memory tagging (hardware memory safety).
Previous chapters assumed: Permission bits are checked somehow. This chapter details: Exactly how hardware checks permissions on every access, what happens when checks fail, and how modern features (SMEP, SMAP, MTE, PKU) add defense-in-depth.
Let's make the connections explicit:
Building on Chapter 3 (Page Tables):
Building on Chapter 4 (TLB):
Building on Chapter 5 (IOMMU):
Meltdown (CVE-2017-5754):
; Speculative execution attack (simplified)
mov rax, [kernel_address] ; Speculatively reads kernel memory
mov rbx, [rax * 4096] ; Uses data to access different cache line
; Even though the first instruction faults, cache timing reveals the data!
This exploit worked because:
The fix (KPTI) required fundamentally rethinking how we use page tables—maintaining two sets per process, switching on every system call. This is a pure Chapter 6 concept: using page table structure (Chapter 3) for security (Chapter 6), accepting the TLB flush cost (Chapter 4).
Pre-IOMMU DMA Attacks:
Before IOMMUs existed, a $20 malicious USB device could:
IOMMUs (Chapter 5) solve this by enforcing the same protections (Chapter 6) that the CPU uses. Without Chapter 6's protection model, Chapter 5's IOMMU would be pointless.
We'll explore memory protection mechanisms across three major architectures (x86-64, ARM64, RISC-V) and modern heterogeneous systems:
Basic Protection (Sections 6.2-6.5):
Advanced Features (Sections 6.6-6.9):
Confidential Computing (Sections 6.10-6.13):
Modern Systems (Sections 6.14-6.15):
Synthesis (Sections 6.16-6.19):
If you remember only one thing from this book, remember this: Virtual memory is not about address translation. It's about protection.
The clever four-level page table design from Chapter 3? Its real purpose is enabling fine-grained protection at multiple granularities while keeping page table memory reasonable.
The complex TLB from Chapter 4? It exists primarily because checking permissions on every memory access would be impossibly slow without caching.
The expensive IOMMU from Chapter 5? It replicates all the CPU protection mechanisms we'll study in this chapter because devices are just as dangerous as malicious programs.
Everything we've studied so far was building toward this chapter. Now let's understand why it all matters.
Consider what could happen without memory protection:
Scenario 1: Malicious Program
// Without protection, any program could do this:
int *kernel_memory = (int *)0xffff8000deadbeef;
*kernel_memory = 0x90909090; // Overwrite kernel code with NOPs
Scenario 2: Buggy Program
// A simple bug becomes a security disaster:
char buffer[100];
gets(buffer); // Buffer overflow
// Without protection, overflow corrupts neighboring process memory
Scenario 3: Privilege Escalation
// User process tries to access kernel data:
struct cred *cred = current_cred(); // Get kernel credential struct
cred->uid = 0; // Make myself root!
Without hardware-enforced memory protection, any of these scenarios would succeed, making multi-user systems impossible and single-user systems dangerously fragile.
The MMU enforces four fundamental security properties:
1. Isolation Each process operates in its own address space, unable to read or modify other processes' memory. This is the foundation of modern operating systems.
2. Privilege Separation The system distinguishes between kernel code (privileged) and user code (unprivileged), preventing user programs from directly accessing kernel memory or executing privileged instructions.
3. Execute Protection Memory pages can be marked as non-executable, preventing data (like stack buffers or heap allocations) from being executed as code. This defeats entire classes of exploits.
4. Access Control Fine-grained control over read, write, and execute permissions enables sophisticated security policies, from simple W^X (write XOR execute) to complex protection domains.
Buffer Overflow (Pre-NX Era)
// Vulnerable code:
void process_input(char *input) {
char buffer[256];
strcpy(buffer, input); // No bounds checking!
// Rest of function...
}
// Attack: Input contains 256 bytes of shellcode + return address overwrite
// Without NX: Shellcode executes, attacker gains control
// With NX: Program crashes, attack fails
The introduction of the No-Execute (NX) bit in the early 2000s made this classic attack much harder—but only if the OS and hardware enforce it.
Meltdown (2018)
; Speculative execution attack (simplified)
mov rax, [kernel_address] ; Speculatively reads kernel memory
mov rbx, [rax * 4096] ; Uses data to access different cache line
; Even though the first instruction faults, cache timing reveals the data!
Meltdown exploited speculative execution to bypass privilege checks, reading arbitrary kernel memory from user space. The mitigation (KPTI - Kernel Page-Table Isolation) had 5-30% performance impact.
DMA Attack (Pre-IOMMU Era)
// Malicious device or Thunderbolt peripheral:
// 1. Device uses DMA to write to physical memory
// 2. No IOMMU means device can access any physical address
// 3. Overwrite page tables, kernel code, or credentials
// Result: Complete system compromise
This is why Chapter 5's IOMMU is critical—devices need the same isolation as user processes.
Memory protection isn't free. Every security feature has a cost:
| Feature | Security Benefit | Performance Cost | Always Enable? |
|---|---|---|---|
| NX bit | High (prevents code injection) | ~0% | ✅ Yes |
| SMEP/SMAP | High (kernel hardening) | <1% | ✅ Yes |
| KPTI | High (Meltdown mitigation) | 5-30% | ⚠️ If vulnerable |
| Protection Keys | Medium (fast isolation) | <1% | ✅ When available |
| Memory Tagging | High (memory safety) | 5-15% | 🤔 Depends on workload |
| Confidential Compute | Very High (VM isolation) | 1-5% | 🤔 Multi-tenant only |
The art of system security is maximizing protection while minimizing overhead.
We'll explore memory protection mechanisms across three major architectures (x86-64, ARM64, RISC-V) and modern heterogeneous systems:
Basic Protection (6.2-6.5):
Advanced Features (6.6-6.9):
Confidential Computing (6.10-6.12):
Modern Systems (6.13-6.14):
Practical Guidance (6.15-6.18):
Each architecture approaches memory protection differently:
x86-64: Evolutionary Intel and AMD have added security features incrementally over decades:
Result: Comprehensive but complex, with many features for backward compatibility.
ARM64: Clean-Sheet Design ARM designed ARMv8-A (2011) with modern security in mind:
Result: More coherent, but less mature ecosystem.
RISC-V: Minimal and Extensible RISC-V (2010s) embraces modularity:
Result: Elegant but security ecosystem still developing.
By the end of this chapter, you'll understand how modern systems enforce memory protection, from simple permission bits to encrypted confidential computing.
Every page table entry (PTE) contains permission bits that control how memory can be accessed. These bits are checked by the MMU on every memory access, making them the first line of defense in system security.
Modern architectures provide three fundamental permissions:
Read (R): Can the page be read? Write (W): Can the page be written? Execute (X): Can the page be executed as code?
However, different architectures encode these permissions differently:
x86-64 uses a negative logic for some bits in the PTE:
Bit 63: XD (Execute Disable) - 1 = NOT executable
Bit 2: U/S (User/Supervisor) - 0 = Supervisor, 1 = User
Bit 1: R/W (Read/Write) - 0 = Read-only, 1 = Read-Write
Bit 0: P (Present) - 0 = Not present, 1 = Present
x86-64 PTE Layout (64 bits):
┌─────┬────┬─────────────────────────────────────┬─────────────────────┐
│ XD │ ...│ Physical Address (bits 51-12) │ Flags (bits 11-0) │
│(63) │ │ │ │
└─────┴────┴─────────────────────────────────────┴─────────────────────┘
Flags:
Bit 0: P (Present)
Bit 1: R/W (Read/Write)
Bit 2: U/S (User/Supervisor)
Bit 3: PWT (Page Write-Through)
Bit 4: PCD (Page Cache Disable)
Bit 5: A (Accessed)
Bit 6: D (Dirty)
Bit 7: PAT or PS (Page Size)
Bit 8: G (Global)
Bits 9-11: Available for OS use
Permission Combinations:
| R/W | U/S | XD | Access Rights |
|---|---|---|---|
| 0 | 0 | 0 | Supervisor Read/Execute only |
| 0 | 0 | 1 | Supervisor Read only |
| 1 | 0 | 0 | Supervisor Read/Write/Execute |
| 1 | 0 | 1 | Supervisor Read/Write only |
| 0 | 1 | 0 | User Read/Execute only |
| 0 | 1 | 1 | User Read only |
| 1 | 1 | 0 | User Read/Write/Execute |
| 1 | 1 | 1 | User Read/Write only |
Note: On x86-64, there's no way to have write-only pages (writes imply reads).
ARM64 uses positive logic with explicit AP (Access Permission) bits:
Bits 7-6: AP[2:1] (Access Permissions)
Bit 54: XN (Execute Never) - 1 = NOT executable
Bit 53: PXN (Privileged Execute Never)
Bit 10: AF (Access Flag)
ARM64 Descriptor Format (Level 3):
┌────┬────┬────────────────────────────────────┬─────────────────────┐
│ XN │PXN │ Physical Address (bits 47-12) │ Attributes │
│(54)│(53)│ │ │
└────┴────┴────────────────────────────────────┴─────────────────────┘
Attributes:
Bits 1-0: Type (0b11 = Page)
Bits 4-2: AttrIndx (memory type)
Bit 5: NS (Non-secure)
Bits 7-6: AP[2:1] (Access Permissions)
Bits 9-8: SH (Shareability)
Bit 10: AF (Access Flag)
Bit 11: nG (not Global)
AP Bits Encoding:
| AP[2:1] | EL0 Access | EL1 Access | Typical Use |
|---|---|---|---|
| 00 | None | Read/Write | Kernel data |
| 01 | Read/Write | Read/Write | Shared memory |
| 10 | None | Read-only | Kernel code |
| 11 | Read-only | Read-only | Shared read-only |
Execute Control:
This gives ARM more flexible execute control than x86-64!
RISC-V uses the simplest and most direct encoding:
Bit 3: X (Execute)
Bit 2: W (Write)
Bit 1: R (Read)
Bit 0: V (Valid)
RISC-V PTE Layout (Sv39/Sv48):
┌────────────────────────────────────────────┬─────────────────────┐
│ Physical Page Number (PPN) │ Flags (bits 7-0) │
│ │ │
└────────────────────────────────────────────┴─────────────────────┘
Flags:
Bit 0: V (Valid)
Bit 1: R (Read)
Bit 2: W (Write)
Bit 3: X (Execute)
Bit 4: U (User accessible)
Bit 5: G (Global)
Bit 6: A (Accessed)
Bit 7: D (Dirty)
Permission Combinations:
| R | W | X | Meaning |
|---|---|---|---|
| 0 | 0 | 0 | Pointer to next level |
| 0 | 0 | 1 | Execute-only |
| 0 | 1 | 0 | ⚠️ Reserved |
| 0 | 1 | 1 | Execute + Write |
| 1 | 0 | 0 | Read-only |
| 1 | 0 | 1 | Read + Execute |
| 1 | 1 | 0 | Read + Write |
| 1 | 1 | 1 | Read + Write + Execute |
Note: RISC-V allows execute-only pages (X=1, R=0, W=0), which x86-64 and ARM64 don't support!
| Feature | x86-64 | ARM64 | RISC-V |
|---|---|---|---|
| **Read permission** | Implicit with Present | AP bits | R bit |
| **Write permission** | R/W bit | AP bits | W bit |
| **Execute permission** | XD bit (negative) | XN/PXN bits | X bit |
| **User/Kernel** | U/S bit | AP bits | U bit |
| **Execute-only** | ❌ No | ❌ No | ✅ Yes |
| **Write-only** | ❌ No | ❌ No | ❌ No |
The Present (P) bit (or Valid on RISC-V, AF on ARM) determines if a page is currently mapped:
Present = 0: Page not in memory
Present = 1: Page is mapped
x86-64 Example:
// Check if page is present
bool is_present(uint64_t pte) {
return (pte & 0x1); // Bit 0
}
// Get physical address (assumes present)
uint64_t get_phys_addr(uint64_t pte) {
return pte & 0x000FFFFFFFFFF000ULL; // Bits 51-12
}
These bits help the OS manage memory:
Accessed (A) bit: Set by hardware when page is read or written
Dirty (D) bit: Set by hardware when page is written
Example: Linux Page Replacement
// Simplified Linux page aging
void age_pages(void) {
for (each page in system) {
if (pte->accessed) {
page->age++; // Page was used recently
pte->accessed = 0; // Clear for next period
} else {
page->age--; // Page not used
}
if (page->age < threshold && !pte->dirty) {
evict_page(page); // Can discard clean unused pages
}
}
}
This bit determines privilege level access:
x86-64 U/S Bit:
ARM64 AP Bits:
RISC-V U Bit:
Security Example:
// Kernel page table entry (x86-64)
uint64_t kernel_pte =
phys_addr | // Physical address
(1 << 0) | // Present
(1 << 1) | // Read/Write
(0 << 2); // Supervisor only (U/S = 0)
// User page table entry
uint64_t user_pte =
phys_addr | // Physical address
(1 << 0) | // Present
(1 << 1) | // Read/Write
(1 << 2) | // User accessible (U/S = 1)
(1ULL << 63); // No execute (XD = 1)
x86-64: Creating PTEs with Different Permissions
#include <stdint.h>
// PTE bit definitions
#define PTE_P 0x001 // Present
#define PTE_W 0x002 // Writable
#define PTE_U 0x004 // User
#define PTE_A 0x020 // Accessed
#define PTE_D 0x040 // Dirty
#define PTE_NX (1ULL << 63) // No Execute
// Create kernel code page (R-X, supervisor only)
uint64_t make_kernel_code_pte(uint64_t phys_addr) {
return (phys_addr & 0x000FFFFFFFFFF000ULL) |
PTE_P; // Present, Read-only, Supervisor, Execute
}
// Create kernel data page (RW-, supervisor only)
uint64_t make_kernel_data_pte(uint64_t phys_addr) {
return (phys_addr & 0x000FFFFFFFFFF000ULL) |
PTE_P | PTE_W | PTE_NX; // Present, Read-Write, No Execute
}
// Create user code page (R-X, user accessible)
uint64_t make_user_code_pte(uint64_t phys_addr) {
return (phys_addr & 0x000FFFFFFFFFF000ULL) |
PTE_P | PTE_U; // Present, User, Execute
}
// Create user data page (RW-, user accessible)
uint64_t make_user_data_pte(uint64_t phys_addr) {
return (phys_addr & 0x000FFFFFFFFFF000ULL) |
PTE_P | PTE_W | PTE_U | PTE_NX; // Present, Write, User, No Execute
}
// Check permissions
bool is_writable(uint64_t pte) {
return pte & PTE_W;
}
bool is_user_accessible(uint64_t pte) {
return pte & PTE_U;
}
bool is_executable(uint64_t pte) {
return !(pte & PTE_NX);
}
ARM64: Creating Descriptors with Different Permissions
// ARM64 descriptor bits
#define DESC_VALID 0x001
#define DESC_TABLE 0x003
#define DESC_AF 0x400 // Access flag (bit 10)
// AP bits (bits 7-6)
#define AP_KERNEL_RW (0 << 6) // EL0: none, EL1: RW
#define AP_USER_RW (1 << 6) // EL0: RW, EL1: RW
#define AP_KERNEL_RO (2 << 6) // EL0: none, EL1: RO
#define AP_USER_RO (3 << 6) // EL0: RO, EL1: RO
#define DESC_XN (1ULL << 54) // Execute Never (EL0)
#define DESC_PXN (1ULL << 53) // Privileged Execute Never (EL1)
// Create kernel code descriptor (R-X, EL1 only)
uint64_t make_kernel_code_desc(uint64_t phys_addr) {
return (phys_addr & 0x0000FFFFFFFFF000ULL) |
DESC_VALID | DESC_AF | AP_KERNEL_RO | DESC_XN;
}
// Create kernel data descriptor (RW-, EL1 only)
uint64_t make_kernel_data_desc(uint64_t phys_addr) {
return (phys_addr & 0x0000FFFFFFFFF000ULL) |
DESC_VALID | DESC_AF | AP_KERNEL_RW | DESC_XN | DESC_PXN;
}
// Create user code descriptor (R-X, EL0/EL1)
uint64_t make_user_code_desc(uint64_t phys_addr) {
return (phys_addr & 0x0000FFFFFFFFF000ULL) |
DESC_VALID | DESC_AF | AP_USER_RO | DESC_PXN;
}
// Create user data descriptor (RW-, EL0/EL1)
uint64_t make_user_data_desc(uint64_t phys_addr) {
return (phys_addr & 0x0000FFFFFFFFF000ULL) |
DESC_VALID | DESC_AF | AP_USER_RW | DESC_XN | DESC_PXN;
}
RISC-V: Creating PTEs with Different Permissions
// RISC-V PTE bits
#define PTE_V 0x001 // Valid
#define PTE_R 0x002 // Read
#define PTE_W 0x004 // Write
#define PTE_X 0x008 // Execute
#define PTE_U 0x010 // User
#define PTE_G 0x020 // Global
#define PTE_A 0x040 // Accessed
#define PTE_D 0x080 // Dirty
// Create kernel code PTE (R-X, supervisor only)
uint64_t make_kernel_code_pte(uint64_t phys_addr) {
uint64_t ppn = phys_addr >> 12;
return (ppn << 10) | PTE_V | PTE_R | PTE_X;
}
// Create kernel data PTE (RW-, supervisor only)
uint64_t make_kernel_data_pte(uint64_t phys_addr) {
uint64_t ppn = phys_addr >> 12;
return (ppn << 10) | PTE_V | PTE_R | PTE_W;
}
// Create user code PTE (R-X, user accessible)
uint64_t make_user_code_pte(uint64_t phys_addr) {
uint64_t ppn = phys_addr >> 12;
return (ppn << 10) | PTE_V | PTE_R | PTE_X | PTE_U;
}
// Create user data PTE (RW-, user accessible)
uint64_t make_user_data_pte(uint64_t phys_addr) {
uint64_t ppn = phys_addr >> 12;
return (ppn << 10) | PTE_V | PTE_R | PTE_W | PTE_U;
}
// RISC-V supports execute-only pages!
uint64_t make_execute_only_pte(uint64_t phys_addr) {
uint64_t ppn = phys_addr >> 12;
return (ppn << 10) | PTE_V | PTE_X; // Only Execute bit set
}
When the MMU translates a virtual address, it checks permissions:
Hardware Permission Check (Pseudo-code):
function check_permissions(pte, access_type, privilege_level):
// Check if page is present/valid
if not pte.present:
raise PAGE_FAULT
// Check privilege level
if pte.user_accessible:
// User page: both user and supervisor can access
allowed = true
else:
// Supervisor page: only supervisor can access
if privilege_level == USER:
raise PAGE_FAULT (protection violation)
allowed = true
// Check access type
if access_type == READ:
// Reads always allowed if present (on x86/ARM)
// RISC-V checks R bit explicitly
if RISC_V and not pte.R:
raise PAGE_FAULT
elif access_type == WRITE:
if not pte.writable:
raise PAGE_FAULT (write protection)
elif access_type == EXECUTE:
if pte.execute_disable: // XD/XN bit
raise PAGE_FAULT (execute protection)
// Set accessed/dirty bits
pte.accessed = 1
if access_type == WRITE:
pte.dirty = 1
return ALLOWED
Permission checking happens on every memory access that misses the TLB:
TLB Hit: Permissions cached with translation (0 cycles overhead) TLB Miss, PWC Hit: Permission check during walk (~10-30 cycles) TLB Miss, Full Walk: Permission check at each level (~50-200 cycles)
Key Insight: High TLB hit rates (>99%) make permission checking essentially free!
The No-Execute bit is one of the most important security features in modern processors. By preventing code execution from data pages, it defeats entire classes of exploits that have plagued computer security for decades.
Before the NX bit, all writable memory was also executable. This created a fundamental vulnerability:
Classic Buffer Overflow Exploit:
// Vulnerable function (pre-NX era)
void process_request(char *user_input) {
char buffer[256];
strcpy(buffer, user_input); // No bounds checking!
// If user_input > 256 bytes, overflow corrupts stack
}
// Stack layout:
// ┌──────────────┐ High addresses
// │ Return addr │ ← Attacker overwrites this!
// ├──────────────┤
// │ Saved EBP │
// ├──────────────┤
// │ │
// │ buffer[256] │ ← Overflow starts here
// │ │
// └──────────────┘ Low addresses
// Attack payload:
// [256 bytes of shellcode] + [overwritten return address pointing to shellcode]
Without NX:
With NX:
Each architecture implements execute protection slightly differently:
x86-64: XD (Execute Disable) Bit
Bit 63 of PTE: XD (Execute Disable)
XD = 0: Page is executable
XD = 1: Page is NOT executable (execution causes #PF)
Must be enabled in EFER.NXE (Extended Feature Enable Register)
Checking XD Support:
#include <cpuid.h>
bool has_nx_support(void) {
unsigned int eax, ebx, ecx, edx;
// CPUID function 0x80000001 (Extended Features)
__cpuid(0x80000001, eax, ebx, ecx, edx);
// Bit 20 of EDX = NX support
return (edx & (1 << 20)) != 0;
}
void enable_nx(void) {
// Set EFER.NXE (bit 11)
uint64_t efer;
asm volatile(
"rdmsr"
: "=A" (efer)
: "c" (0xC0000080) // EFER MSR
);
efer |= (1ULL << 11); // Set NXE bit
asm volatile(
"wrmsr"
:
: "c" (0xC0000080), "A" (efer)
);
}
ARM64: XN (Execute Never) Bits
Bit 54: XN (Execute Never for EL0)
Bit 53: PXN (Privileged Execute Never for EL1)
More flexible than x86: separate control for user/kernel!
RISC-V: X (Execute) Bit
Bit 3: X (Execute permission)
X = 0: NOT executable
X = 1: Executable
Simple positive logic (unlike x86's negative logic)
Example 1: Stack Buffer Overflow
// Before NX: Exploitable
void vulnerable_function(char *input) {
char buffer[64];
strcpy(buffer, input); // Overflow possible
// Attacker can execute shellcode from stack
}
// After NX: Attack fails
void same_function(char *input) {
char buffer[64];
strcpy(buffer, input); // Still overflows
// But stack is marked NX, execution fails!
// Program crashes instead of being compromised
}
Memory Layout with NX:
┌─────────────────┬────────┬────────┐
│ Memory Region │ Perms │ XD/XN │
├─────────────────┼────────┼────────┤
│ .text (code) │ R-X │ No │ ← Executable
│ .rodata (const) │ R-- │ Yes │ ← Read-only
│ .data (globals) │ RW- │ Yes │ ← Data only
│ .bss (zeros) │ RW- │ Yes │ ← Data only
│ Heap │ RW- │ Yes │ ← Data only
│ Stack │ RW- │ Yes │ ← Data only
│ Shared libs │ R-X │ No │ ← Executable
└─────────────────┴────────┴────────┘
Result: Only code sections are executable. Data sections (heap, stack) cannot execute.
Modern systems enforce a stronger policy: no page can be both writable AND executable.
W^X Implementation:
// Legal combinations:
// R-X: Code pages (read and execute, but not write)
// RW-: Data pages (read and write, but not execute)
// R--: Read-only data
// ---: Inaccessible (guard pages)
// ILLEGAL combination:
// RWX: No page should be both writable AND executable!
Why W^X matters:
Without W^X:
// Attacker can:
1. Allocate RWX memory
2. Write shellcode to it
3. Execute the shellcode
// Many JIT compilers did this!
With W^X:
// JIT compiler must:
1. Allocate RW- memory
2. Write JIT-compiled code
3. Change to R-X (using mprotect)
4. Execute the code
5. Cannot modify while executable!
Linux W^X Enforcement:
// mmap with RWX fails on W^X systems
void *rwx = mmap(NULL, size,
PROT_READ | PROT_WRITE | PROT_EXEC, // Rejected!
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
// Returns MAP_FAILED on systems with strict W^X
// Correct approach:
void *rw = mmap(NULL, size,
PROT_READ | PROT_WRITE, // Initially RW
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
// ... write code ...
mprotect(rw, size, PROT_READ | PROT_EXEC); // Change to RX
NX defeated code injection, but attackers adapted with Return-Oriented Programming:
ROP Concept:
Instead of injecting new code, reuse existing code!
1. Find "gadgets" in existing executable memory:
pop rdi; ret
pop rsi; ret
mov [rdi], rsi; ret
2. Chain gadgets by controlling stack:
[address of gadget 1]
[data for gadget 1]
[address of gadget 2]
[data for gadget 2]
...
3. Each gadget executes then returns to next gadget
4. Build complete exploit from existing code snippets!
NX is not enough! Modern systems need additional defenses:
x86-64: Setting XD Bit
#define PTE_P 0x001
#define PTE_W 0x002
#define PTE_U 0x004
#define PTE_NX (1ULL << 63)
// Create non-executable data page
uint64_t create_data_page(uint64_t phys_addr) {
return (phys_addr & 0x000FFFFFFFFFF000ULL) |
PTE_P | PTE_W | PTE_U | PTE_NX; // Writable but not executable
}
// Create executable code page
uint64_t create_code_page(uint64_t phys_addr) {
return (phys_addr & 0x000FFFFFFFFFF000ULL) |
PTE_P | PTE_U; // Readable and executable, but not writable
// Note: XD bit NOT set = executable
}
// Typical OS memory mapping
void setup_memory_protections(void) {
// Text segment: R-X
for (each page in .text) {
pte = create_code_page(page->phys_addr);
}
// Data segment: RW-
for (each page in .data, .bss) {
pte = create_data_page(page->phys_addr);
}
// Stack: RW-
for (each page in stack) {
pte = create_data_page(page->phys_addr);
}
// Heap: RW-
for (each page in heap) {
pte = create_data_page(page->phys_addr);
}
}
ARM64: Setting XN/PXN Bits
#define DESC_XN (1ULL << 54) // Execute Never (EL0)
#define DESC_PXN (1ULL << 53) // Privileged Execute Never (EL1)
// User data page: NOT executable by user or kernel
uint64_t create_user_data_desc(uint64_t phys_addr) {
return (phys_addr & 0x0000FFFFFFFFF000ULL) |
DESC_VALID | DESC_AF | AP_USER_RW |
DESC_XN | DESC_PXN; // Both XN and PXN set
}
// User code page: Executable by user, NOT by kernel
uint64_t create_user_code_desc(uint64_t phys_addr) {
return (phys_addr & 0x0000FFFFFFFFF000ULL) |
DESC_VALID | DESC_AF | AP_USER_RO |
DESC_PXN; // PXN set (kernel can't execute), XN clear (user can)
}
// Kernel code page: Executable by kernel only
uint64_t create_kernel_code_desc(uint64_t phys_addr) {
return (phys_addr & 0x0000FFFFFFFFF000ULL) |
DESC_VALID | DESC_AF | AP_KERNEL_RO |
DESC_XN; // XN set (user can't execute), PXN clear (kernel can)
}
RISC-V: Setting X Bit
#define PTE_V 0x001
#define PTE_R 0x002
#define PTE_W 0x004
#define PTE_X 0x008
#define PTE_U 0x010
// Non-executable data page
uint64_t create_data_pte(uint64_t phys_addr) {
uint64_t ppn = phys_addr >> 12;
return (ppn << 10) | PTE_V | PTE_R | PTE_W | PTE_U;
// Note: X bit NOT set = not executable
}
// Executable code page
uint64_t create_code_pte(uint64_t phys_addr) {
uint64_t ppn = phys_addr >> 12;
return (ppn << 10) | PTE_V | PTE_R | PTE_X | PTE_U;
// Note: W bit NOT set = not writable (W^X policy)
}
Linux DEP (Data Execution Prevention):
// Check if executable has NX protection
#include <elf.h>
bool has_nx_protection(const char *filename) {
// Read ELF header
Elf64_Ehdr ehdr;
// Read program headers
Elf64_Phdr phdr;
// Look for GNU_STACK segment
for (each program header) {
if (phdr.p_type == PT_GNU_STACK) {
// Check if executable bit is CLEAR
return !(phdr.p_flags & PF_X);
}
}
return true; // Default: NX enabled
}
Windows DEP:
// Enable DEP for process (Windows API)
#include <windows.h>
void enable_dep(void) {
DWORD flags = PROCESS_DEP_ENABLE; // Enable DEP
SetProcessDEPPolicy(flags);
}
// Check if DEP is enabled system-wide
bool is_dep_enabled(void) {
BOOL permanent;
DWORD flags;
GetProcessDEPPolicy(GetCurrentProcess(), &flags, &permanent);
return (flags & PROCESS_DEP_ENABLE) != 0;
}
Overhead: Essentially zero on modern processors!
Why NX is Free:
Benchmark: NX Performance Impact
// Test: Execute 1 million function calls
// Without NX: 0.523 seconds
// With NX: 0.524 seconds
// Overhead: ~0.2% (within measurement error)
void benchmark_nx_overhead(void) {
const int iterations = 1000000;
// Warm up TLB
for (int i = 0; i < 1000; i++) {
test_function();
}
// Benchmark with NX enabled
uint64_t start = rdtsc();
for (int i = 0; i < iterations; i++) {
test_function();
}
uint64_t end = rdtsc();
uint64_t cycles_per_call = (end - start) / iterations;
// Result: ~500 cycles/call (no measurable NX overhead)
}
Studies show NX prevents:
Attack Prevention Timeline:
Bottom Line: NX/XD/XN is the single most effective hardware security feature ever added to processors. Every modern system should have it enabled—and nearly all do by default.
Modern processors provide multiple privilege levels to separate trusted code (kernel) from untrusted code (applications). Each architecture implements this differently, but the goal is the same: prevent user code from directly accessing privileged resources.
Intel's x86 architecture provides four privilege levels called rings:
Why Four Rings?
Original intent (1980s):
Modern reality:
Current Privilege Level (CPL):
The CPL is stored in CS register (Code Segment) bits 0-1:
// Read current privilege level
static inline int get_cpl(void) {
uint16_t cs;
asm volatile("mov %%cs, %0" : "=r" (cs));
return cs & 0x3; // Bits 0-1
}
// Example usage
void check_privilege(void) {
int cpl = get_cpl();
if (cpl == 0) {
printf("Running in Ring 0 (kernel mode)\n");
} else if (cpl == 3) {
printf("Running in Ring 3 (user mode)\n");
}
}
Descriptor Privilege Level (DPL):
Every memory segment and gate has a Descriptor Privilege Level:
// Segment descriptor (simplified)
struct segment_descriptor {
uint16_t limit_low;
uint16_t base_low;
uint8_t base_mid;
uint8_t access; // Contains DPL (bits 5-6)
uint8_t granularity;
uint8_t base_high;
};
// Extract DPL from descriptor
int get_dpl(struct segment_descriptor *desc) {
return (desc->access >> 5) & 0x3; // Bits 5-6
}
// Access check rule:
// CPL <= DPL to access segment (numerically lower = more privileged)
bool can_access_segment(int cpl, int dpl) {
return cpl <= dpl; // Ring 0 can access Ring 0-3
// Ring 3 can only access Ring 3
}
Privilege Transitions:
User → Kernel (Ring 3 → Ring 0):
// System call from user space
// User executes: syscall instruction (x86-64)
asm volatile(
"movq $SYS_write, %%rax\n" // System call number
"movq $1, %%rdi\n" // File descriptor (stdout)
"movq $msg, %%rsi\n" // Buffer pointer
"movq $len, %%rdx\n" // Length
"syscall\n" // Triggers Ring 3 → Ring 0
::: "rax", "rdi", "rsi", "rdx", "rcx", "r11"
);
// CPU hardware does:
// 1. Save user RIP, RSP, RFLAGS
// 2. Load kernel RIP from MSR_LSTAR
// 3. Load kernel RSP from MSR_KERNEL_GS_BASE
// 4. Change CPL from 3 to 0
// 5. Jump to kernel syscall handler
Kernel → User (Ring 0 → Ring 3):
// Return from system call
// Kernel executes: sysretq instruction
void syscall_return(void) {
// CPU hardware does:
// 1. Restore user RIP, RSP, RFLAGS
// 2. Change CPL from 0 to 3
// 3. Jump back to user code
asm volatile("sysretq"); // Returns to Ring 3
}
Privileged Instructions:
Some instructions are Ring 0 only:
// Examples of privileged instructions (Ring 0 only):
// 1. Load CR3 (page table base register)
static inline void load_cr3(uint64_t cr3) {
asm volatile("mov %0, %%cr3" :: "r" (cr3) : "memory");
// Ring 3: #GP (General Protection Fault)
}
// 2. HLT (halt processor)
static inline void halt(void) {
asm volatile("hlt");
// Ring 3: #GP fault
}
// 3. LGDT (load GDT)
static inline void load_gdt(void *gdt_ptr) {
asm volatile("lgdt (%0)" :: "r" (gdt_ptr));
// Ring 3: #GP fault
}
// 4. IN/OUT (I/O port access)
static inline uint8_t inb(uint16_t port) {
uint8_t value;
asm volatile("inb %1, %0" : "=a" (value) : "Nd" (port));
return value;
// Ring 3: #GP fault (unless IOPL allows)
}
ARM64 provides a cleaner privilege model with 4 Exception Levels (EL0-EL3):
Advantages over x86 Rings:
Reading Current Exception Level:
// ARM64: Read current exception level
static inline uint64_t get_current_el(void) {
uint64_t el;
asm volatile("mrs %0, CurrentEL" : "=r" (el));
return (el >> 2) & 0x3; // Bits 2-3
}
void check_exception_level(void) {
uint64_t el = get_current_el();
switch (el) {
case 0: printf("EL0: User mode\n"); break;
case 1: printf("EL1: Kernel mode\n"); break;
case 2: printf("EL2: Hypervisor mode\n"); break;
case 3: printf("EL3: Secure monitor\n"); break;
}
}
Exception Level Transitions:
// EL0 → EL1 (System call)
// User executes: SVC #0 (Supervisor Call)
void user_syscall(void) {
asm volatile("svc #0"); // Triggers exception to EL1
// CPU saves: PC, SPSR (Saved Program Status Register)
// CPU loads: Vector table entry for EL1
// Changes: EL0 → EL1
}
// EL1 → EL0 (Return from exception)
// Kernel executes: ERET (Exception Return)
void kernel_return(void) {
asm volatile("eret"); // Returns to EL0
// CPU restores: PC, SPSR
// Changes: EL1 → EL0
}
// EL1 → EL2 (Hypervisor call)
void kernel_to_hypervisor(void) {
asm volatile("hvc #0"); // Hypervisor Call
// Changes: EL1 → EL2
}
// EL1 → EL3 (Secure monitor call)
void kernel_to_secure_monitor(void) {
asm volatile("smc #0"); // Secure Monitor Call
// Changes: EL1 → EL3 (enters Secure World)
}
ARM64 System Registers:
Different system registers accessible at each EL:
// EL0: Limited access
// - Can read: Counter registers, feature ID registers
// - Cannot: Modify page tables, access devices
// EL1: Kernel access
// - TTBR0_EL1/TTBR1_EL1: Page table base registers
// - SCTLR_EL1: System control register
// - VBAR_EL1: Vector base address register
static inline void set_ttbr0_el1(uint64_t ttbr) {
asm volatile("msr ttbr0_el1, %0" :: "r" (ttbr));
// EL0: Undefined instruction exception
}
// EL2: Hypervisor access
// - All EL1 registers
// - TTBR0_EL2: Stage 2 page table base
// - HCR_EL2: Hypervisor configuration register
// EL3: Secure monitor access
// - All registers from all levels
// - SCR_EL3: Secure configuration register
RISC-V provides the simplest privilege model:
Reading Current Privilege Mode:
// RISC-V: Read current privilege mode
static inline int get_privilege_mode(void) {
unsigned long mstatus;
asm volatile("csrr %0, mstatus" : "=r" (mstatus));
// MPP field (bits 11-12) contains previous privilege mode
// For current mode, we know based on which CSRs we can access
// Try to read M-mode register
unsigned long mcause;
asm volatile goto(
"csrr %0, mcause\n"
"j %l[m_mode]\n"
: "=r" (mcause) ::: m_mode
);
// If we get here, not M-mode
// Try S-mode register
unsigned long sstatus;
asm volatile goto(
"csrr %0, sstatus\n"
"j %l[s_mode]\n"
: "=r" (sstatus) ::: s_mode
);
// Must be U-mode
return 0; // U-mode
s_mode:
return 1; // S-mode
m_mode:
return 3; // M-mode
}
Privilege Transitions:
// U-Mode → S-Mode (system call)
void user_ecall(void) {
asm volatile("ecall"); // Environment Call
// CPU saves: PC to SEPC
// CPU sets: SCAUSE (exception cause)
// CPU loads: PC from STVEC (trap vector)
// Changes: U-mode → S-mode
}
// S-Mode → M-Mode (machine call)
void supervisor_ecall(void) {
asm volatile("ecall"); // Environment Call
// Similar to above but S → M
}
// Return from trap
void return_from_trap_s_mode(void) {
asm volatile("sret"); // Supervisor Return
// CPU restores: PC from SEPC
// CPU restores: Privilege from SSTATUS.SPP
}
void return_from_trap_m_mode(void) {
asm volatile("mret"); // Machine Return
// CPU restores: PC from MEPC
// CPU restores: Privilege from MSTATUS.MPP
}
RISC-V CSR Access Control:
// CSR (Control and Status Register) access encoding:
// Bits 11-10: Privilege level required
// 00: U-mode accessible
// 01: S-mode accessible
// 10: H-mode accessible
// 11: M-mode accessible
// Examples:
// M-mode only registers
#define CSR_MSTATUS 0x300 // Machine status
#define CSR_MIE 0x304 // Machine interrupt enable
#define CSR_MTVEC 0x305 // Machine trap vector
// S-mode accessible registers
#define CSR_SSTATUS 0x100 // Supervisor status
#define CSR_SIE 0x104 // Supervisor interrupt enable
#define CSR_STVEC 0x105 // Supervisor trap vector
// U-mode accessible registers
#define CSR_CYCLE 0xC00 // Cycle counter (read-only)
#define CSR_TIME 0xC01 // Timer (read-only)
#define CSR_INSTRET 0xC02 // Instructions retired (read-only)
// Attempt to access wrong-privilege CSR triggers exception
static inline unsigned long read_csr_safe(int csr) {
unsigned long value;
// This will trap if CSR requires higher privilege
asm volatile("csrr %0, %1" : "=r" (value) : "i" (csr));
return value;
}
| Feature | x86-64 Rings | ARM64 ELs | RISC-V Modes |
|---|---|---|---|
| **Privilege Levels** | 4 (0-3) | 4 (EL0-EL3) | 3-4 (U/S/H/M) |
| **Actually Used** | 2 (Ring 0, 3) | 2-4 (all used) | 2-3 (U/S/M) |
| **User Level** | Ring 3 | EL0 | U-mode |
| **Kernel Level** | Ring 0 | EL1 | S-mode |
| **Hypervisor** | VMX root Ring 0 | EL2 | HS-mode (optional) |
| **Secure Boot** | SMM (legacy) | EL3 | M-mode |
| **Wasted Levels** | Ring 1-2 unused | None | None |
| **Transition Inst** | SYSCALL/SYSRET | SVC/ERET | ECALL/SRET/MRET |
| **Transition Cost** | ~100-300 cycles | ~50-150 cycles | ~50-100 cycles |
| **Design Year** | 1985 (80286) | 2011 (ARMv8) | 2010s |
Historical Note: x86's 4 rings were designed when hardware-assisted virtualization didn't exist. Modern systems essentially use only 2 levels (Ring 0/3), with Ring -1 (VMX root) added later for hypervisors!
Example: What Happens on Invalid Access
// User program tries to access kernel memory
void user_attempt_kernel_access(void) {
uint64_t *kernel_addr = (uint64_t *)0xffff888000000000;
// This will trigger:
// x86-64: #PF (Page Fault) with U/S violation
// ARM64: Synchronous exception with data abort
// RISC-V: Load access fault exception
uint64_t value = *kernel_addr; // FAULT!
}
// Page fault handler (kernel)
void page_fault_handler(struct pt_regs *regs, unsigned long error_code) {
// Check error code
bool user_mode = regs->cs & 3; // x86: CPL from CS
bool supervisor_page = !(error_code & 4); // U/S bit
if (user_mode && supervisor_page) {
// User tried to access supervisor page!
printk("Segmentation fault: user accessing kernel memory\n");
send_signal(current, SIGSEGV); // Kill process
}
}
Measured Cost of System Calls:
// Benchmark: System call overhead
#include <sys/syscall.h>
#include <x86intrin.h>
void benchmark_syscall(void) {
const int iterations = 1000000;
// Measure getpid() - simplest syscall
uint64_t start = __rdtsc();
for (int i = 0; i < iterations; i++) {
syscall(SYS_getpid); // Ring 3 → Ring 0 → Ring 3
}
uint64_t end = __rdtsc();
uint64_t cycles_per_call = (end - start) / iterations;
printf("System call overhead: %lu cycles\n", cycles_per_call);
}
// Typical results:
// x86-64 (SYSCALL): 100-150 cycles
// ARM64 (SVC): 80-120 cycles
// RISC-V (ECALL): 60-100 cycles
//
// Breakdown:
// - Save user context: 20-40 cycles
// - Switch page tables: 20-40 cycles
// - Flush TLB entries: 20-40 cycles
// - Handler overhead: 20-40 cycles
// - Restore user context: 20-40 cycles
Optimization: Avoiding System Calls
// vDSO (virtual Dynamic Shared Object) - no syscall needed!
// Kernel maps read-only memory page into user space
// Contains frequently-used functions that don't need kernel privileges
// Example: gettimeofday() via vDSO
#include <sys/time.h>
void fast_time_access(void) {
struct timeval tv;
// Old way: syscall (100-150 cycles)
// New way: vDSO read (5-10 cycles)
gettimeofday(&tv, NULL); // No Ring transition!
// Kernel updates time in shared memory
// User reads directly - no privilege change needed
}
// Result: 10-30× faster for common operations!
The User/Supervisor (U/S) bit in page table entries is the primary mechanism for enforcing privilege-based memory protection. This simple bit—present in every page table entry—prevents user-mode code from accessing kernel memory.
x86-64 U/S Bit:
Hardware Permission Check:
IF (CPL == 3) THEN // User mode (Ring 3)
IF (PTE.U/S == 0) THEN
// User accessing supervisor page!
RAISE PAGE_FAULT (#PF, error_code with U/S bit set)
END IF
END IF
IF (CPL < 3) THEN // Supervisor mode (Ring 0-2)
// Supervisor can access ALL pages (U/S=0 or 1)
// Unless SMAP is enabled (discussed later)
END IF
Setting U/S Bit:
#include <stdint.h>
#define PTE_P (1ULL << 0) // Present
#define PTE_RW (1ULL << 1) // Read/Write
#define PTE_US (1ULL << 2) // User/Supervisor
#define PTE_NX (1ULL << 63) // No Execute
// Create kernel page (accessible only to kernel)
uint64_t make_kernel_page(uint64_t phys_addr) {
return (phys_addr & 0x000FFFFFFFFFF000ULL) |
PTE_P | // Present
PTE_RW | // Read/Write
PTE_NX; // No Execute (data page)
// Note: U/S bit NOT set = supervisor only
}
// Create user page (accessible to user and kernel)
uint64_t make_user_page(uint64_t phys_addr) {
return (phys_addr & 0x000FFFFFFFFFF000ULL) |
PTE_P | // Present
PTE_RW | // Read/Write
PTE_US | // User accessible
PTE_NX; // No Execute
}
// Typical kernel memory layout
void setup_kernel_pagetables(void) {
// Kernel code: R-X, Supervisor only
for (each page in .text) {
pte = phys_addr | PTE_P; // Not PTE_US, not PTE_NX
}
// Kernel data: RW-, Supervisor only
for (each page in .data, .bss) {
pte = phys_addr | PTE_P | PTE_RW | PTE_NX; // Not PTE_US
}
// User pages: RW-, User accessible
for (each page in user_memory) {
pte = phys_addr | PTE_P | PTE_RW | PTE_US | PTE_NX;
}
}
Critical Rule: If ANY level of the page table hierarchy has U/S=0, the page is supervisor-only.
┌────────────────────────────────────────┐
│ PML4E (Level 4) │
│ U/S=1 (user-accessible) │
└────────────┬───────────────────────────┘
│
▼
┌────────────────────────────────────────┐
│ PDPTE (Level 3) │
│ U/S=1 (user-accessible) │
└────────────┬───────────────────────────┘
│
▼
┌────────────────────────────────────────┐
│ PDE (Level 2) │
│ U/S=0 (supervisor-only) ← One S sets all!
└────────────┬───────────────────────────┘
│
▼
┌────────────────────────────────────────┐
│ PTE (Level 1) │
│ U/S=1 (doesn't matter!) │
└────────────────────────────────────────┘
Result: Page is SUPERVISOR-ONLY
Example:
// Walk page tables checking U/S bits
bool is_user_accessible(uint64_t virtual_addr, uint64_t cr3) {
uint64_t *pml4 = (uint64_t *)(cr3 & ~0xFFF);
// Level 4
uint64_t pml4e = pml4[PML4_INDEX(virtual_addr)];
if (!(pml4e & PTE_US)) return false; // Supervisor at L4
// Level 3
uint64_t *pdpt = (uint64_t *)(pml4e & 0x000FFFFFFFFFF000ULL);
uint64_t pdpte = pdpt[PDPT_INDEX(virtual_addr)];
if (!(pdpte & PTE_US)) return false; // Supervisor at L3
// Level 2
uint64_t *pd = (uint64_t *)(pdpte & 0x000FFFFFFFFFF000ULL);
uint64_t pde = pd[PD_INDEX(virtual_addr)];
if (!(pde & PTE_US)) return false; // Supervisor at L2
// Level 1
uint64_t *pt = (uint64_t *)(pde & 0x000FFFFFFFFFF000ULL);
uint64_t pte = pt[PT_INDEX(virtual_addr)];
if (!(pte & PTE_US)) return false; // Supervisor at L1
// All levels have U/S=1
return true; // User accessible!
}
Modern 64-bit systems typically split the virtual address space:
x86-64 Canonical Address Space:
User Space (Lower Half):
0x0000000000000000 - 0x00007FFFFFFFFFFF
- U/S=1 in all page tables
- Accessible from Ring 3
- Contains: user code, data, heap, stack, shared libs
Non-Canonical Addresses (Hole):
0x0000800000000000 - 0xFFFF7FFFFFFFFFFF
- Invalid addresses
- Accessing triggers #GP fault
Kernel Space (Upper Half):
0xFFFF800000000000 - 0xFFFFFFFFFFFFFFFF
- U/S=0 in page tables
- Accessible only from Ring 0
- Contains: kernel code, data, device mappings
Why Split Address Space?
The Meltdown vulnerability (2018) allowed user code to speculatively read kernel memory. The mitigation: KPTI (Kernel Page-Table Isolation).
Before KPTI:
After KPTI:
KPTI Implementation:
// Simplified KPTI implementation
// Two sets of page tables per process
struct mm_struct {
pgd_t *user_pgd; // User page table root
pgd_t *kernel_pgd; // Kernel page table root
};
// System call entry with KPTI
void syscall_entry_with_kpti(void) {
// Running in user space with user_pgd in CR3
// 1. Save user CR3
uint64_t user_cr3 = read_cr3();
// 2. Switch to kernel page tables
uint64_t kernel_cr3 = current->mm->kernel_pgd;
write_cr3(kernel_cr3);
// 3. Now can safely access all kernel memory
// ... execute syscall handler ...
// 4. Before returning, switch back to user page tables
write_cr3(user_cr3);
// 5. Return to user space
// User cannot see kernel memory anymore
}
// Performance cost: Two CR3 switches per syscall!
// CR3 write flushes TLB → expensive
KPTI Performance Impact:
// Benchmark: Syscall overhead with/without KPTI
void benchmark_kpti_overhead(void) {
// Without KPTI: 100-150 cycles per syscall
// With KPTI: 150-250 cycles per syscall
// Overhead: 50-100 cycles (50-100% slower!)
// Worst case: Syscall-heavy workload
// - Redis: 20-30% slowdown
// - PostgreSQL: 15-25% slowdown
// - Network I/O: 10-20% slowdown
// Best case: Compute-heavy workload
// - Scientific computing: <1% slowdown
// - Video encoding: <2% slowdown
}
KPTI Mitigation Strategies:
// 1. PCID (Process Context ID) - reduces TLB flush cost
// Instead of flushing TLB on CR3 write, tag TLB entries with PCID
// 2. Invpcid instruction - selective TLB invalidation
static inline void invpcid_flush_single(uint64_t pcid, uint64_t addr) {
struct {
uint64_t pcid;
uint64_t addr;
} desc = { pcid, addr };
asm volatile("invpcid %0, %1" :: "m" (desc), "r" (0));
// Much faster than full TLB flush
}
// 3. CPU vulnerability detection
bool cpu_needs_kpti(void) {
// Check for Meltdown vulnerability
// Intel CPUs (most): Vulnerable
// AMD CPUs: Not vulnerable (no KPTI needed)
// ARM CPUs: Some vulnerable
if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD) {
return false; // AMD not vulnerable
}
return true; // Enable KPTI
}
copy_to_user() and copy_from_user():
Kernel needs to copy data between user and kernel space:
// Kernel function: write() syscall
ssize_t sys_write(int fd, const char __user *buf, size_t count) {
// buf points to USER memory
// Kernel running in Ring 0
char kernel_buffer[4096];
// Can kernel access user memory directly?
// Yes! Supervisor can access U/S=1 pages
// But we use copy_from_user() for safety:
if (copy_from_user(kernel_buffer, buf, count)) {
return -EFAULT; // Invalid user pointer
}
// copy_from_user implementation:
// - Checks buf is valid user address
// - Checks page is present and accessible
// - Safely copies, handling page faults
// - If fault occurs, returns error (no panic)
// Now write kernel_buffer to file...
}
// copy_from_user implementation (simplified)
unsigned long copy_from_user(void *to, const void __user *from, unsigned long n) {
// Check source is in user address space
if (!access_ok(from, n)) {
return n; // Failed - invalid address
}
// Try to copy, catching faults
__try {
memcpy(to, from, n);
return 0; // Success
} __except (EXCEPTION_EXECUTE_HANDLER) {
return n; // Failed - fault during copy
}
}
SMAP (Supervisor Mode Access Prevention):
SMAP (introduced 2014) prevents kernel from accidentally accessing user memory:
// With SMAP enabled:
// - Kernel accessing U/S=1 pages triggers #PF
// - Must explicitly allow access with STAC/CLAC instructions
// x86-64 SMAP instructions:
static inline void stac(void) { // Set AC flag
asm volatile("stac" ::: "cc");
// Now kernel CAN access user pages
}
static inline void clac(void) { // Clear AC flag
asm volatile("clac" ::: "cc");
// Now kernel CANNOT access user pages
}
// copy_from_user with SMAP:
unsigned long copy_from_user_smap(void *to, const void __user *from, unsigned long n) {
if (!access_ok(from, n))
return n;
stac(); // Allow access to user pages
memcpy(to, from, n);
clac(); // Disallow access to user pages
return 0;
}
// Benefit: Prevents accidental kernel bugs
// Example: Kernel bug dereferences user pointer without validation
// Without SMAP: Exploitable (arbitrary read/write)
// With SMAP: Immediate crash (caught before exploitation)
Performance: SMAP overhead is negligible (<1%) because STAC/CLAC are very fast (1-2 cycles).
Modern processors provide sophisticated protection mechanisms beyond basic permission bits. These features implement defense-in-depth: even if attackers bypass one layer, additional protections prevent exploitation.
The Problem:
Even with NX bit protecting the stack, attackers found ret2user attacks:
// Traditional attack (blocked by NX):
// 1. Overflow buffer
// 2. Overwrite return address → shellcode on stack
// 3. Stack has NX → FAIL
// ret2user attack (bypasses NX):
// 1. Overflow kernel buffer
// 2. Overwrite return address → USER space address
// 3. User space IS executable
// 4. Kernel executes user shellcode → SUCCESS!
// Example exploit:
void kernel_vulnerable_function(char *user_data) {
char buffer[256];
strcpy(buffer, user_data); // Overflow!
// Attacker overwrites return address to 0x400000 (user space)
}
// Attacker's user space code at 0x400000:
void evil_code(void) {
// Runs with Ring 0 privileges!
commit_creds(prepare_kernel_cred(0)); // Escalate to root
}
SMEP Solution:
Prevents kernel (Ring 0) from executing code on user pages (U/S=1):
Hardware check on instruction fetch:
IF (CPL < 3 && PTE.U/S == 1 && fetching_instruction) THEN
RAISE PAGE_FAULT // Kernel cannot execute user pages!
END IF
Enabling/Checking SMEP:
#include <cpuid.h>
#define X86_CR4_SMEP (1UL << 20) // CR4 bit 20
// Check CPU support
bool cpu_has_smep(void) {
uint32_t eax, ebx, ecx, edx;
__cpuid_count(7, 0, eax, ebx, ecx, edx);
return (ebx & (1 << 7)) != 0; // CPUID.(EAX=7,ECX=0):EBX.SMEP[bit 7]
}
// Enable SMEP
void enable_smep(void) {
uint64_t cr4;
asm volatile("mov %%cr4, %0" : "=r" (cr4));
cr4 |= X86_CR4_SMEP;
asm volatile("mov %0, %%cr4" :: "r" (cr4));
}
// Check if SMEP is active
bool is_smep_enabled(void) {
uint64_t cr4;
asm volatile("mov %%cr4, %0" : "=r" (cr4));
return (cr4 & X86_CR4_SMEP) != 0;
}
Performance: SMEP has zero overhead - the check happens in parallel with instruction fetch.
The Problem:
Kernel must access user memory legitimately (e.g., copy_from_user). But this creates vulnerabilities:
// Vulnerable kernel code:
void kernel_read_data(void *dest, void *src, size_t len) {
memcpy(dest, src, len); // No validation!
}
// Attack:
// User passes kernel address as 'src':
kernel_read_data(my_buffer,
(void *)0xffffffff81000000, // Kernel address!
4096);
// Result: User reads kernel memory!
SMAP Solution:
Prevents kernel from accessing user pages unless explicitly allowed:
Hardware check on data access:
IF (CPL < 3 && PTE.U/S == 1 && EFLAGS.AC == 0) THEN
RAISE PAGE_FAULT // Kernel cannot access user pages!
END IF
Performance: SMAP overhead is <1% (STAC/CLAC are 1-2 cycles each).
Revolutionary memory safety feature - hardware-assisted memory tagging.
Concept:
Performance: 5-15% overhead (still much faster than software sanitizers).
Control-flow integrity against ROP/JOP attacks.
Performance: BTI overhead is <1% (check happens in parallel with branch).
Protection keys enable fast, fine-grained memory isolation within a single address space. Unlike page table permissions (which require expensive TLB flushes to change), protection keys can be modified in just a few cycles.
[This section covers Intel MPK in detail, including domain-based isolation, cross-domain communication, and performance comparisons showing 50-100× speedup over traditional mprotect.]
Trusted Execution Environments represent a fundamental shift in how we approach system security: rather than trying to secure an entire operating system (which may contain millions of lines of code), we create small, hardware-isolated compartments for executing security-critical code. This section explores the major TEE implementations and their role in modern memory protection.
A Trusted Execution Environment (TEE) is a secure area within a main processor that guarantees:
Key Insight: Unlike traditional OS security (which relies on software access control), TEEs use hardware-enforced isolation—even a compromised operating system or hypervisor cannot break into a TEE.
TEE Use Cases:
ARM TrustZone is the most widely deployed TEE technology, present in billions of mobile devices. It creates two parallel worlds within every ARM processor core.
TrustZone Architecture: Two Worlds
Normal World:
Secure World:
Secure Monitor (EL3):
Memory Protection: The NS Bit
TrustZone extends every memory transaction with a Non-Secure (NS) bit:
Normal World: All transactions have NS=1
Secure World: All transactions have NS=0
Hardware rule: NS=1 transactions cannot access NS=0 memory
TrustZone Address Space Controller (TZASC):
// TZASC configuration (simplified)
struct tzasc_region {
uint64_t base_addr;
uint64_t size;
uint32_t security; // Secure or Non-secure
uint32_t permissions; // Read, Write, Execute
};
// Example: Protect 128MB secure memory region
void configure_secure_memory(void) {
struct tzasc_region region = {
.base_addr = 0x80000000,
.size = 128 * 1024 * 1024, // 128 MB
.security = TZASC_SECURE, // Only accessible from Secure World
.permissions = TZASC_RW
};
tzasc_config_region(0, ®ion);
}
World Switching: The SMC Instruction
Transition from Normal World → Secure World requires Secure Monitor Call (SMC):
// From Normal World (Linux kernel driver)
// Request service from Secure World
void trustzone_call_secure_service(uint32_t service_id,
uint32_t arg1, uint32_t arg2) {
struct arm_smccc_res res;
// Trigger SMC instruction - enters Secure Monitor at EL3
arm_smccc_smc(service_id, arg1, arg2, 0, 0, 0, 0, 0, &res);
// Returns here after Secure World completes
// res.a0 contains return value
}
// In Secure Monitor (EL3)
void smc_handler(uint32_t smc_fid, uint64_t x1, uint64_t x2, uint64_t x3) {
// Save Normal World context
save_cpu_context(NON_SECURE);
// Switch to Secure World
switch_to_secure_world();
// Restore Secure World context
restore_cpu_context(SECURE);
// Forward call to Secure World OS
return secure_world_dispatch(smc_fid, x1, x2, x3);
}
Context Switch Performance:
TrustZone Page Tables
Each world has its own page tables:
// ARM64 TTBR registers (Translation Table Base Registers)
// Normal World uses:
TTBR0_EL1 // User space page table (Normal World)
TTBR1_EL1 // Kernel space page table (Normal World)
// Secure World uses:
TTBR0_EL1 // User space page table (Secure World)
TTBR1_EL1 // Kernel space page table (Secure World)
// Switching worlds changes which page tables are active
void switch_to_secure_world(void) {
// SCR_EL3.NS = 0 (enter Secure state)
write_scr_el3(read_scr_el3() & ~SCR_NS_BIT);
// Now TTBR0/1_EL1 point to Secure World page tables
}
TrustZone Implementations:
OP-TEE (Open Portable TEE):
Qualcomm QSEE (Qualcomm Secure Execution Environment):
Trustonic Kinibi:
Apple Secure Enclave Processor (SEP):
Intel Software Guard Extensions (SGX) takes a radically different approach: instead of one system-wide secure world, SGX creates per-application secure enclaves that don't trust the OS at all.
SGX Philosophy:
SGX Threat Model:
SGX Architecture
Enclave Page Cache (EPC):
Memory Encryption Engine (MEE):
SGX encrypts enclave memory to protect against:
MEE Design:
Merkle Tree for Integrity:
Root (on-chip)
/ \
/ \
L2 L2
/ \ / \
L1 L1 L1 L1
| | | |
[Data Pages...]
Each page has:
SGX Instructions:
// Create enclave
#include <sgx.h>
// ECREATE - Create enclave
sgx_status_t sgx_create_enclave(
const char *file_name,
int debug,
sgx_launch_token_t *token,
int *updated,
sgx_enclave_id_t *eid,
sgx_misc_attribute_t *misc_attr
);
// EADD - Add page to enclave
// EINIT - Initialize enclave (finalize)
// EENTER - Enter enclave from untrusted code
// EEXIT - Exit enclave to untrusted code
// Example: Calling enclave function
void call_enclave_function(sgx_enclave_id_t eid) {
int result;
sgx_status_t status;
// ECALL: Enter enclave
status = ecall_trusted_function(eid, &result, arg1, arg2);
if (status != SGX_SUCCESS) {
printf("Enclave call failed: %x\n", status);
}
}
SGX Attestation:
SGX provides remote attestation to prove an enclave is genuine:
Attestation Report Contents:
SGX Sealing (Persistent Storage):
Enclaves can't directly access disk. Solution: seal data with enclave-specific key:
// Seal data (encrypt for storage)
sgx_status_t sgx_seal_data(
uint32_t additional_MACtext_length,
const uint8_t *p_additional_MACtext,
uint32_t text2encrypt_length,
const uint8_t *p_text2encrypt,
uint32_t sealed_data_size,
sgx_sealed_data_t *p_sealed_data
);
// Unseal data (decrypt from storage)
sgx_status_t sgx_unseal_data(
const sgx_sealed_data_t *p_sealed_data,
uint8_t *p_additional_MACtext,
uint32_t *p_additional_MACtext_length,
uint8_t *p_decrypted_text,
uint32_t *p_decrypted_text_length
);
// Example usage
void save_secret(const char *secret, size_t len) {
uint32_t sealed_size = sgx_calc_sealed_data_size(0, len);
sgx_sealed_data_t *sealed = malloc(sealed_size);
// Seal with enclave identity
sgx_seal_data(0, NULL, len, (uint8_t*)secret, sealed_size, sealed);
// Save to untrusted storage (file, database)
write_to_file("sealed_data.bin", sealed, sealed_size);
free(sealed);
}
SGX Limitations:
SGX Performance:
// Benchmark: SGX overhead
void benchmark_sgx(void) {
const int iterations = 1000000;
// Native execution
uint64_t start = rdtsc();
for (int i = 0; i < iterations; i++) {
compute_native();
}
uint64_t native_cycles = rdtsc() - start;
// SGX enclave execution
start = rdtsc();
for (int i = 0; i < iterations; i++) {
ecall_compute_enclave(eid);
}
uint64_t sgx_cycles = rdtsc() - start;
printf("Native: %lu cycles\n", native_cycles / iterations);
printf("SGX: %lu cycles\n", sgx_cycles / iterations);
printf("Overhead: %.1f%%\n",
100.0 * (sgx_cycles - native_cycles) / native_cycles);
}
// Typical results:
// Compute-heavy: 5-15% overhead (MEE encryption)
// I/O-heavy: 50-200% overhead (OCALL frequency)
// Memory-intensive: 20-100% overhead (EPC paging)
| Feature | ARM TrustZone | Intel SGX |
|---|---|---|
| **Granularity** | System-wide | Per-application |
| **Trust Model** | Trust Secure World OS | Trust only CPU |
| **Protected Memory** | GB+ (entire regions) | 64-256 MB (EPC) |
| **Memory Encryption** | Optional (implementation) | Always (MEE) |
| **World Switch** | 2-5 μs (SMC) | 8-12K cycles (ECALL) |
| **OS Trust** | Requires trusted OS | No trust needed |
| **Deployment** | 10+ billion devices | Limited (server-only now) |
| **Use Case** | Mobile, embedded | Cloud, datacenter |
| **Multiple TEEs** | One Secure World | Many enclaves |
| **Side Channels** | Less vulnerable | Highly vulnerable |
| **Attestation** | Device attestation | Enclave attestation |
| **Typical Overhead** | <1% (infrequent switch) | 5-15% (encryption) |
When to Use TrustZone:
When to Use SGX:
Graviton (Microsoft Research, OSDI 2018) brings TEE concepts to GPUs—crucial for AI/ML workloads on sensitive data.
Why GPU TEEs Matter:
Graviton Architecture:
Key Graviton Innovations:
Graviton Performance:
Overhead: 17-33% (primarily encryption/decryption)
Breakdown:
- Memory encryption: 10-20%
- Command encryption: 3-5%
- Context switching: 2-4%
- Page table management: 2-4%
Still 50-80× faster than CPU-only confidential computing!
Graviton Limitations:
Qualcomm SPU (Secure Processing Unit):
Samsung Knox:
AMD PSP (Platform Security Processor):
Apple Secure Enclave:
RISC-V Keystone:
Known Vulnerabilities:
Mitigation Strategies:
Mobile Payments:
User taps phone → NFC controller →
Secure World TA → Decrypt token →
Transmit encrypted → Terminal
Protected: Payment token, encryption keys TrustZone ensures: Token never exposed to Normal World
Streaming DRM (Widevine L1):
Encrypted stream → Normal World app →
Secure World TA → Decrypt in TEE →
Secure video path → Display
Protected: Decryption keys, decrypted frames TrustZone ensures: HD/4K content protection
Biometric Authentication:
Fingerprint sensor → Secure World driver →
TA processes biometric → Match in TEE →
Return yes/no to Normal World
Protected: Biometric templates, matching algorithm TrustZone ensures: Templates never leave Secure World
Cloud Confidential Computing:
Encrypted data → SGX enclave →
Decrypt and process → Encrypt result →
Return to client
Protected: Data, computation, keys SGX ensures: Cloud provider cannot see data
Trends:
Performance Improvements:
Future of TEE Technology:
Trends:
Performance Improvements:
Copy-On-Write is a memory optimization that leverages page-level protection to defer expensive memory copies until absolutely necessary. It's fundamental to efficient process creation and memory management.
Without COW:
// fork() without COW
pid_t pid = fork();
// Parent process has 100 MB of data
// fork() must copy all 100 MB → child process
// Time: ~50 ms to copy 100 MB
// Memory: 200 MB total (100 MB × 2)
With COW:
// fork() with COW
pid_t pid = fork();
// 1. Child shares parent's pages (read-only)
// 2. Copy only happens on first write
// Time: ~0.1 ms (just page table copy)
// Memory: 100 MB (until write occurs)
Step 1: Mark Pages Read-Only
// During fork(), mark all writable pages as read-only
void cow_fork_setup(struct mm_struct *parent, struct mm_struct *child) {
for (each page in parent address space) {
if (page_is_writable(page)) {
// Mark parent's page read-only
page_clear_write_bit(parent, page);
// Child shares same physical page (also read-only)
child_pte = parent_pte & ~PTE_W; // Clear write bit
child_pte |= PTE_COW; // Mark as COW (OS-specific flag)
// Increment page reference count
page->ref_count++;
}
}
// Flush TLB so read-only protection takes effect
flush_tlb();
}
Step 2: Handle Write Fault
// Page fault handler for COW
void page_fault_handler(unsigned long fault_addr, unsigned long error_code) {
bool is_write = error_code & PF_WRITE;
pte_t *pte = get_pte(current->mm, fault_addr);
if (is_write && pte_is_cow(*pte)) {
// COW fault!
handle_cow_fault(fault_addr, pte);
return;
}
// ... other fault handling ...
}
void handle_cow_fault(unsigned long fault_addr, pte_t *pte) {
struct page *old_page = pte_page(*pte);
// Check reference count
if (page_count(old_page) == 1) {
// Only reference: just make writable
pte_mkwrite(*pte);
pte_clear_cow(*pte);
flush_tlb_page(fault_addr);
return;
}
// Multiple references: must copy
struct page *new_page = alloc_page(GFP_KERNEL);
// Copy page contents
copy_page(page_address(new_page), page_address(old_page));
// Update PTE to point to new page
set_pte(pte, mk_pte(new_page, PAGE_READWRITE));
// Decrement old page reference count
put_page(old_page);
// Flush TLB
flush_tlb_page(fault_addr);
}
Example: fork() with COW
void demonstrate_cow(void) {
char *data = malloc(4096);
memset(data, 'A', 4096);
pid_t pid = fork();
if (pid == 0) {
// Child process
printf("Child: data[0] = %c\n", data[0]); // Read: no copy
data[0] = 'B'; // Write: triggers COW!
// Page fault handler:
// 1. Allocate new page
// 2. Copy old page to new page
// 3. Update child's PTE to new page
// 4. Resume execution
printf("Child: data[0] = %c\n", data[0]); // 'B'
} else {
// Parent process
sleep(1);
printf("Parent: data[0] = %c\n", data[0]); // Still 'A'
}
}
Faster Process Creation:
// Benchmark: fork() performance
void benchmark_fork(void) {
// Allocate 100 MB
size_t size = 100 * 1024 * 1024;
char *data = malloc(size);
memset(data, 0, size);
// Without COW: ~50 ms (must copy 100 MB)
// With COW: ~0.1 ms (just copy page tables)
uint64_t start = rdtsc();
pid_t pid = fork();
uint64_t end = rdtsc();
if (pid == 0) {
// Child: measure actual COW overhead
start = rdtsc();
memset(data, 0xFF, size); // Trigger COW for all pages
end = rdtsc();
// COW overhead: ~50 ms (same as full copy, but deferred)
exit(0);
}
wait(NULL);
}
// Result: fork() itself is 500× faster with COW!
Memory Savings:
// Common case: exec() after fork()
pid_t pid = fork();
if (pid == 0) {
// Child immediately calls exec()
execve("/bin/sh", argv, envp);
// exec() replaces address space anyway
// Without COW: Wasted 100% of copy time/memory
// With COW: No copy occurred - pure win!
}
Special optimization: zero page sharing
// Global zero page (read-only, shared by all processes)
struct page *zero_page;
void init_zero_page(void) {
zero_page = alloc_page(GFP_KERNEL);
clear_page(page_address(zero_page));
// Mark as read-only, COW
// All processes share this page for zero-initialized memory
}
// When process requests zero-initialized memory:
void *zero_page_map(size_t size) {
for (each page in size) {
// Point PTE to global zero page (read-only, COW)
pte = mk_pte(zero_page, PAGE_READONLY | PAGE_COW);
// Increment zero page reference count
get_page(zero_page);
}
// On first write:
// 1. Allocate real page
// 2. Copy zero page (trivial: memset 0)
// 3. Update PTE
}
// Memory savings: Huge!
// Example: Process maps 1 GB of zeros
// Without zero page sharing: 1 GB allocated
// With zero page sharing: 4 KB (until written)
Anonymous mmap with MAP_PRIVATE:
// Private mapping uses COW
void *addr = mmap(NULL, 4096, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
// Initially: Points to zero page (COW)
// On first write: Allocates real page, copies zeros
File-backed mmap with MAP_PRIVATE:
// File mapping uses COW
int fd = open("file.txt", O_RDONLY);
void *addr = mmap(NULL, 4096, PROT_READ | PROT_WRITE,
MAP_PRIVATE, fd, 0);
// Initially: Points to page cache (read-only, COW)
// On write:
// 1. Allocate private page
// 2. Copy file data to private page
// 3. Update PTE
// 4. Write goes to private page (not file)
close(fd);
munmap(addr, 4096);
// Changes are discarded (MAP_PRIVATE)
COW Overhead Breakdown:
// Page fault handling for COW write:
// 1. Fault entry: ~100 cycles
// 2. Page allocation: ~500 cycles
// 3. Page copy (4KB): ~2000 cycles (cold cache)
// ~500 cycles (hot cache)
// 4. TLB flush: ~50 cycles
// 5. Fault return: ~100 cycles
// Total: ~2750-3250 cycles (~1 μs on 3 GHz CPU)
// Amortized cost:
// - fork() + no writes: Essentially free
// - fork() + write all: Same cost as direct copy (but deferred)
// - fork() + write few: Huge win (only copy what's needed)
Memory ordering isn't just about performance—weak memory models can create security vulnerabilities if not properly understood and managed.
Different architectures provide different memory ordering guarantees:
x86-64: Total Store Order (TSO)
ARM64: Weak Ordering
RISC-V: RVWMO (Weak Memory Ordering)
Example 1: Broken Lock Implementation
// Incorrect lock implementation (no memory barriers)
struct spinlock {
volatile int locked;
};
void bad_lock(struct spinlock *lock) {
while (__sync_lock_test_and_set(&lock->locked, 1))
cpu_relax();
// BUG: No memory barrier!
// Compiler/CPU might reorder critical section before lock acquisition!
}
void bad_unlock(struct spinlock *lock) {
// BUG: No memory barrier!
lock->locked = 0;
// Critical section stores might leak past unlock!
}
// Attack scenario:
int secure_data = 0;
void thread1(struct spinlock *lock) {
bad_lock(lock);
secure_data = 42; // Might execute before lock acquired!
bad_unlock(lock);
}
void thread2(struct spinlock *lock) {
bad_lock(lock);
int leaked = secure_data; // Might see value before thread1's lock!
bad_unlock(lock);
}
Correct Implementation:
void correct_lock(struct spinlock *lock) {
while (__sync_lock_test_and_set(&lock->locked, 1))
cpu_relax();
// Memory barrier: prevent reordering
__atomic_thread_fence(__ATOMIC_ACQUIRE);
}
void correct_unlock(struct spinlock *lock) {
// Memory barrier: prevent reordering
__atomic_thread_fence(__ATOMIC_RELEASE);
lock->locked = 0;
}
Example 2: Publish-Subscribe Race
// Vulnerable code:
struct message {
int ready;
char data[256];
};
struct message *msg;
// Publisher
void publish(const char *text) {
strcpy(msg->data, text); // Write data
msg->ready = 1; // Signal ready
// Without barrier: ready might be visible before data!
}
// Subscriber
void subscribe(void) {
while (!msg->ready) // Wait for ready
cpu_relax();
printf("%s\n", msg->data); // Read data
// Without barrier: might read stale data!
}
// Correct implementation:
void publish_correct(const char *text) {
strcpy(msg->data, text);
__atomic_thread_fence(__ATOMIC_RELEASE); // Release barrier
msg->ready = 1;
}
void subscribe_correct(void) {
while (!msg->ready)
cpu_relax();
__atomic_thread_fence(__ATOMIC_ACQUIRE); // Acquire barrier
printf("%s\n", msg->data);
}
x86-64 Barriers:
// Compiler barrier (prevent compiler reordering)
#define barrier() asm volatile("" ::: "memory")
// mfence: Full memory barrier
static inline void mfence(void) {
asm volatile("mfence" ::: "memory");
}
// sfence: Store fence
static inline void sfence(void) {
asm volatile("sfence" ::: "memory");
}
// lfence: Load fence
static inline void lfence(void) {
asm volatile("lfence" ::: "memory");
}
ARM64 Barriers:
// DMB: Data Memory Barrier
static inline void dmb(void) {
asm volatile("dmb sy" ::: "memory");
}
// DSB: Data Synchronization Barrier
static inline void dsb(void) {
asm volatile("dsb sy" ::: "memory");
}
// ISB: Instruction Synchronization Barrier
static inline void isb(void) {
asm volatile("isb" ::: "memory");
}
RISC-V Barriers:
// FENCE instruction
static inline void fence(void) {
asm volatile("fence rw, rw" ::: "memory");
}
// FENCE.I: Instruction fence
static inline void fence_i(void) {
asm volatile("fence.i" ::: "memory");
}
Clearing Secrets:
// Insecure: compiler might optimize away
void clear_secret_bad(char *secret, size_t len) {
memset(secret, 0, len);
// Compiler sees: value never read after this
// Optimization: removes memset!
}
// Secure: volatile or explicit barrier
void clear_secret_good(volatile char *secret, size_t len) {
memset((void *)secret, 0, len);
// volatile forces memory write
}
// Or use explicit memory barrier
void clear_secret_barrier(char *secret, size_t len) {
memset(secret, 0, len);
__atomic_thread_fence(__ATOMIC_SEQ_CST); // Force completion
}
// Or use platform-specific secure clear
#ifdef __STDC_LIB_EXT1__
memset_s(secret, len, 0, len); // C11 secure memset
#endif
Confidential computing extends TEE concepts to entire virtual machines, providing strong isolation even in untrusted cloud environments.
Architecture:
Key Features:
Example: Creating SEV-SNP VM
// Simplified SEV-SNP VM creation (Linux/KVM)
#include <linux/kvm.h>
#include <linux/sev.h>
int create_sev_snp_vm(void) {
int kvm_fd = open("/dev/kvm", O_RDWR);
int vm_fd = ioctl(kvm_fd, KVM_CREATE_VM, 0);
// Enable SEV-SNP
struct kvm_sev_cmd cmd = {
.id = KVM_SEV_SNP_INIT,
.data = 0,
};
ioctl(vm_fd, KVM_MEMORY_ENCRYPT_OP, &cmd);
// Memory regions are automatically encrypted
struct kvm_userspace_memory_region region = {
.slot = 0,
.flags = KVM_MEM_PRIVATE, // Encrypted
.guest_phys_addr = 0,
.memory_size = 1024 * 1024 * 1024, // 1 GB
.userspace_addr = (uint64_t)guest_memory,
};
ioctl(vm_fd, KVM_SET_USER_MEMORY_REGION, ®ion);
return vm_fd;
}
Performance: SEV-SNP overhead is 1-5% for most workloads.
Architecture:
Key Differences from SEV-SNP:
| Feature | AMD SEV-SNP | Intel TDX |
|---|---|---|
| **Encryption** | AES-128 | MKTME (AES-XTS-128) |
| **CPU Privilege** | PSP (separate processor) | SEAM (new CPU mode) |
| **Page Tables** | RMP | SEPT (Secure EPT) |
| **Shared Memory** | Via encryption | Explicit shared pages |
| **Overhead** | 1-5% | 2-5% |
Four-World Model:
Granule Protection Table (GPT):
// GPT entry (per 4KB granule)
enum gpt_state {
GPT_NORMAL = 0, // Normal world access
GPT_SECURE = 1, // Secure world access
GPT_REALM = 2, // Realm world access
GPT_ROOT = 3, // Root world access
};
// Hardware enforces:
// - Normal world cannot access Realm granules
// - Realm cannot access Normal/Secure granules
// - Root manages all granules
AMD's memory encryption technologies protect data at rest in DRAM, defending against physical memory attacks like cold boot attacks and DMA attacks from malicious hardware.
Traditional Threat Model:
// Traditional security assumes:
// - Software can be malicious → Use page tables for isolation ✓
// - OS can be compromised → Use hypervisor ✓
// - Hypervisor can be evil → Use confidential computing ✓
// But what if attacker has PHYSICAL access?
// 1. Cold boot attack: Freeze RAM, remove it, read contents
// 2. DMA attack: Malicious PCIe device reads all memory
// 3. Memory bus snooping: Hardware tap on DDR bus
// 4. JTAG debugging: Hardware debugger reads DRAM
// Solution: Encrypt memory in the DRAM itself!
System-wide transparent memory encryption introduced with AMD Zen (2017).
Architecture:
Enabling SME:
#include <cpuid.h>
// Check CPU support for SME
bool cpu_has_sme(void) {
uint32_t eax, ebx, ecx, edx;
// CPUID Fn8000_001F[EAX] - Encryption Memory Capabilities
__cpuid(0x8000001F, eax, ebx, ecx, edx);
// Bit 0: SME supported
// Bit 1: SEV supported
// Bits 11-6: Number of encrypted guests supported
// Bits 5-0: C-bit location
return (eax & (1 << 0)) != 0;
}
// Enable SME system-wide
void enable_sme(void) {
if (!cpu_has_sme()) {
printk("SME not supported on this CPU\n");
return;
}
// Read MSR_AMD64_SYSCFG
uint64_t syscfg = rdmsr(MSR_AMD64_SYSCFG);
// Set MEM_ENCRYPT_EN bit (bit 23)
syscfg |= (1ULL << 23);
// Write back to enable SME
wrmsr(MSR_AMD64_SYSCFG, syscfg);
// From now on, all memory with C-bit=1 is encrypted
printk("SME enabled successfully\n");
}
C-bit in Page Table Entries:
AMD repurposes one of the physical address bits as the C-bit (Ciphertext bit):
Using the C-bit:
// Get C-bit position from CPUID
uint32_t get_c_bit_position(void) {
uint32_t eax, ebx, ecx, edx;
__cpuid(0x8000001F, eax, ebx, ecx, edx);
return ebx & 0x3F; // Bits 5-0
}
// Create encrypted page table entry
uint64_t make_encrypted_pte(uint64_t phys_addr) {
uint32_t c_bit = get_c_bit_position();
uint64_t pte = phys_addr & 0x000FFFFFFFFFF000ULL;
pte |= PTE_P | PTE_W | PTE_NX; // Present, writable, no-execute
pte |= (1ULL << c_bit); // Set C-bit for encryption
return pte;
}
// Kernel decides which pages to encrypt
void setup_encrypted_memory(void) {
// Typically encrypt:
// - All kernel code and data
// - All user process memory
// - Page tables themselves
// Leave unencrypted:
// - DMA buffers (devices need plaintext)
// - Shared memory with devices
// - Boot-time structures
}
Encryption Algorithm:
// SME uses AES-128-XTS mode (similar to disk encryption)
// Each 16-byte block encrypted separately
// Encryption:
// ciphertext = AES-128(plaintext, key, tweak)
// where:
// key = CPU-generated 128-bit key (not accessible to software)
// tweak = physical_address ^ other_factors
// Example (conceptual):
void sme_encrypt_block(uint8_t *plaintext, uint64_t phys_addr,
uint8_t *ciphertext) {
uint8_t key[16]; // Hidden in CPU, can't be read
uint64_t tweak = phys_addr; // Unique per physical address
aes_128_xts_encrypt(plaintext, key, tweak, ciphertext);
}
// Important: Same physical address always uses same tweak
// This allows: read → decrypt → modify → encrypt → write
// Without needing to track what's encrypted
Per-VM memory encryption - extension of SME for virtualized environments.
Key Differences from SME:
| Feature | SME | SEV |
|---|---|---|
| **Scope** | System-wide | Per-VM |
| **Keys** | 1 key for entire system | Unique key per VM |
| **Hypervisor** | Can decrypt | Cannot decrypt guest memory |
| **Use Case** | Physical security | Cloud multi-tenancy |
| **Performance** | <1% | 1-3% |
SEV Architecture:
// Each VM gets unique encryption key
struct sev_vm {
uint32_t asid; // Address Space ID (key index)
uint8_t key[16]; // AES-128 key (in secure processor)
bool encrypted; // Is this VM encrypted?
};
// Create encrypted VM
int create_sev_vm(struct kvm_vm *vm) {
// Allocate ASID (limited resource, typically 16-256 per CPU)
int asid = allocate_asid();
if (asid < 0)
return -EBUSY; // No available ASIDs
// Ask AMD Secure Processor (PSP) to generate key
struct sev_cmd_activate cmd = {
.asid = asid,
.handle = vm->handle,
};
// PSP generates random AES-128 key for this VM
// Key is stored in PSP, never visible to hypervisor
sev_issue_cmd(SEV_CMD_ACTIVATE, &cmd);
// Now all memory accesses by this VM use its unique key
vm->asid = asid;
vm->encrypted = true;
return 0;
}
Automatic encryption without software involvement:
// TSME: All memory automatically encrypted
// - No C-bit needed in page tables
// - Hypervisor doesn't choose what to encrypt
// - Everything encrypted by default
// - Simplifies software (no PTE management)
// Enable TSME (BIOS setting, not OS)
// Once enabled:
// - All DRAM is encrypted
// - No software changes needed
// - Protects against physical attacks
// - But: Cannot selectively decrypt (e.g., for DMA)
Measured Overhead:
// Benchmark: Memory-intensive workloads
// 1. SME (system-wide encryption):
// - CPU overhead: <0.5% (AES-NI acceleration)
// - Memory bandwidth: ~1-2% (encryption/decryption)
// - Latency: +0-2 cycles per access
// - Overall: <1% for most workloads
// 2. SEV (per-VM encryption):
// - Additional ASID switching overhead
// - More context switches (key changes)
// - Overall: 1-3% for typical VMs
// 3. SEV-SNP (with integrity):
// - RMP (Reverse Map Table) checks
// - Memory overhead: ~1-2% of RAM
// - Performance overhead: 1-5%
void benchmark_sme(void) {
const size_t size = 1024 * 1024 * 1024; // 1 GB
char *buf = malloc(size);
// Without SME: 10.5 GB/s
// With SME: 10.3 GB/s (~2% slower)
// Bandwidth test
uint64_t start = rdtsc();
memset(buf, 0, size);
uint64_t end = rdtsc();
printf("Memory bandwidth: %.2f GB/s\n",
size / ((end - start) / 3.0e9));
}
Why So Fast?
What SME/SEV Protects Against:
// ✓ Cold boot attacks
// Attacker freezes RAM, removes it, inserts into reader
// → DRAM contents are encrypted, useless without key
// ✓ DMA attacks from malicious devices
// PCIe device tries to read kernel memory
// → Reads encrypted data, no decryption key
// ✓ Memory bus snooping
// Hardware probe on DDR bus
// → All data on bus is encrypted
// ✓ JTAG/debugging attacks
// Hardware debugger reads DRAM
// → Gets encrypted data only
What SME/SEV Does NOT Protect Against:
// ✗ Software attacks (buffer overflows, etc.)
// Encryption doesn't prevent code execution exploits
// ✗ Side-channel attacks (cache timing, Spectre, etc.)
// Encrypted memory still vulnerable to microarchitectural attacks
// ✗ Replay attacks (without SEV-SNP)
// Attacker can replay old encrypted memory contents
// ✗ Hypervisor compromising unencrypted VM data
// If VM memory isn't encrypted (C-bit=0), hypervisor can read it
Linux Kernel Support:
// Boot parameter: mem_encrypt=on
// Kernel automatically encrypts all memory
// Check if running with SME
bool is_sme_active(void) {
return (read_cr4() & X86_CR4_SME) != 0;
}
// DMA buffer allocation (must be unencrypted)
void *dma_alloc_coherent(struct device *dev, size_t size) {
void *virt = alloc_pages(GFP_KERNEL, order);
// Clear C-bit for DMA buffers (devices need plaintext)
pte_t *pte = get_pte(virt);
*pte &= ~(1ULL << c_bit_position);
return virt;
}
| Feature | AMD SME/SEV | Intel TME/MKTME |
|---|---|---|
| **First Appeared** | 2017 (Zen) | 2019 (Ice Lake) |
| **Encryption** | AES-128-XTS | AES-XTS-128/256 |
| **Granularity** | Per-page (C-bit) | Per-page (KeyID) |
| **VM Isolation** | SEV/SEV-ES/SEV-SNP | TDX |
| **Keys** | PSP manages | CPU manages |
| **Integrity** | SEV-SNP only | TDX includes |
| **Performance** | <1-5% | 1-5% |
RISC-V provides flexible, modular security features through its extensible ISA design. Unlike x86 and ARM which evolved security features over decades, RISC-V was designed from the ground up with security modularity in mind.
Modular Design:
Base ISA: RV32I or RV64I (minimal, required)
↓
Privilege Modes: M, S, U (optional: Hypervisor)
↓
PMP: Physical Memory Protection (M-mode only)
↓
Extensions:
- Cryptography (Zk*)
- Vector (V)
- Bit Manipulation (B)
- Control Flow Integrity (Zicfiss)
Each component is optional and composable, allowing implementations from tiny embedded systems to large servers.
M-mode's security boundary enforcement.
Why PMP Exists:
// Problem: M-mode is the highest privilege level
// M-mode firmware runs before OS loads
// But M-mode needs to protect itself from S-mode OS!
// Example vulnerability without PMP:
void malicious_s_mode_code(void) {
// S-mode OS tries to overwrite M-mode firmware
uint64_t *m_mode_code = (uint64_t *)0x80000000;
*m_mode_code = 0xdeadbeef; // Overwrite M-mode!
// Without PMP: This succeeds! OS compromises firmware.
// With PMP: This triggers exception, OS crashes.
}
PMP Configuration:
RISC-V provides up to 64 PMP entries (implementation-dependent, typically 8-16):
// PMP CSRs (Control and Status Registers)
// - pmpaddr0-pmpaddr63: Address registers
// - pmpcfg0-pmpcfg15: Configuration registers (4 entries each)
// PMP permissions
#define PMP_R 0x01 // Read
#define PMP_W 0x02 // Write
#define PMP_X 0x04 // Execute
#define PMP_L 0x80 // Lock (cannot be changed until reset)
// PMP address matching modes
#define PMP_OFF 0x00 // Disabled
#define PMP_TOR 0x08 // Top of Range
#define PMP_NA4 0x10 // Naturally Aligned 4-byte
#define PMP_NAPOT 0x18 // Naturally Aligned Power-of-Two
// Configure PMP region
void pmp_set_region(int region, uint64_t addr, uint64_t size, uint8_t perm) {
// Calculate NAPOT address encoding
// For power-of-two size: addr_reg = (addr + size - 1) >> 2
uint64_t pmpaddr = (addr + size - 1) >> 2;
// Configuration: permissions | addressing mode
uint8_t pmpcfg = perm | PMP_NAPOT;
// Write to appropriate CSRs
switch (region) {
case 0:
asm volatile("csrw pmpaddr0, %0" :: "r" (pmpaddr));
// pmpcfg0 contains entries 0-3 (8 bits each)
uint64_t cfg0 = read_csr(pmpcfg0);
cfg0 = (cfg0 & ~0xFF) | pmpcfg;
asm volatile("csrw pmpcfg0, %0" :: "r" (cfg0));
break;
// ... more regions
}
}
Example: Protecting M-mode Firmware
// Secure boot: M-mode sets up PMP before jumping to S-mode
void m_mode_setup_security(void) {
// Region 0: M-mode code (0x80000000 - 0x80100000)
// Readable and executable by M-mode only
// S-mode and U-mode cannot access
pmp_set_region(0,
0x80000000, // Start address
0x100000, // Size (1 MB)
PMP_R | PMP_X | PMP_L); // R-X, locked
// Region 1: M-mode data (0x80100000 - 0x80200000)
// Read-write for M-mode only
pmp_set_region(1,
0x80100000,
0x100000,
PMP_R | PMP_W | PMP_L); // RW-, locked
// Region 2: RAM for S-mode and U-mode (0x80200000 - 0x88000000)
// Full access for all modes
pmp_set_region(2,
0x80200000,
0x7E00000, // ~126 MB
PMP_R | PMP_W | PMP_X); // RWX
// Region 3: Device memory (0x40000000 - 0x50000000)
// Read-write, no execute
pmp_set_region(3,
0x40000000,
0x10000000,
PMP_R | PMP_W); // RW-
// Now jump to S-mode
// S-mode cannot modify these PMP settings (locked)
jump_to_s_mode();
}
PMP Permission Checking:
// Hardware checks PMP on EVERY memory access from S-mode and U-mode
// M-mode can always access everything (PMP doesn't restrict M-mode)
// Pseudocode for PMP check:
bool pmp_check_access(uint64_t addr, access_type_t type, priv_mode_t mode) {
// M-mode bypasses PMP
if (mode == M_MODE)
return true;
// Find matching PMP entry (check in order)
for (int i = 0; i < num_pmp_entries; i++) {
if (pmp_entry_matches(i, addr)) {
uint8_t perm = pmp_get_permissions(i);
// Check permissions
if (type == READ && !(perm & PMP_R))
return false; // Access fault
if (type == WRITE && !(perm & PMP_W))
return false;
if (type == EXECUTE && !(perm & PMP_X))
return false;
return true; // Access allowed
}
}
// No matching entry: deny by default for S/U mode
return false;
}
PMP Addressing Modes:
// 1. TOR (Top of Range)
// Range: [pmpaddr[i-1], pmpaddr[i])
// Good for: Arbitrary ranges
// 2. NA4 (Naturally Aligned 4-byte)
// Address: pmpaddr << 2
// Size: 4 bytes
// Good for: Single word protection
// 3. NAPOT (Naturally Aligned Power-of-Two)
// Encodes both address and size in pmpaddr
// Size must be power of 2
// Good for: Standard memory regions
// Example: NAPOT encoding for 64 KB region at 0x80000000
void pmp_napot_example(void) {
uint64_t addr = 0x80000000;
uint64_t size = 65536; // 64 KB
// NAPOT encoding: (addr + size - 1) >> 2
uint64_t pmpaddr = (addr + size - 1) >> 2;
// pmpaddr = 0x80010000 >> 2 = 0x20004000
asm volatile("csrw pmpaddr0, %0" :: "r" (pmpaddr));
// This protects region [0x80000000, 0x80010000)
}
Ratified extension addressing PMP limitations.
Key Improvements:
// 1. Rule Locking Clarification
// Original PMP: Locked rules block all later rules
// ePMP: Locked rules only apply to themselves
// 2. Denied-by-Default Mode
// Original PMP: Unmatched access denied only if any PMP active
// ePMP: New CSR bit makes unmatched access always denied
// 3. M-mode Lockdown
// ePMP adds MSECCFG CSR (Machine Security Configuration)
// MSECCFG register bits:
#define MSECCFG_MML 0x01 // Machine Mode Lockdown
#define MSECCFG_MMWP 0x02 // Machine Mode Whitelist Policy
#define MSECCFG_RLB 0x04 // Rule Locking Bypass
// Enable ePMP security
void epmp_enable_security(void) {
uint64_t mseccfg = 0;
// Enable M-mode lockdown
// Now M-mode also checks PMP (not just S/U mode)
mseccfg |= MSECCFG_MML;
// Enable whitelist policy
// Unmatched addresses are denied (not allowed)
mseccfg |= MSECCFG_MMWP;
asm volatile("csrw mseccfg, %0" :: "r" (mseccfg));
// Now even M-mode must follow PMP rules!
// Provides defense-in-depth for firmware bugs
}
ePMP Use Case: Secure Boot
// Secure boot with ePMP protection
void secure_boot_with_epmp(void) {
// 1. Set up ePMP to protect boot ROM
pmp_set_region(0, 0x10000, 0x10000,
PMP_R | PMP_X | PMP_L); // ROM: R-X, locked
// 2. Enable ePMP with M-mode lockdown
uint64_t mseccfg = MSECCFG_MML | MSECCFG_MMWP;
asm volatile("csrw mseccfg, %0" :: "r" (mseccfg));
// 3. Now even M-mode cannot write to boot ROM
// If M-mode firmware has bug, cannot overwrite itself
// 4. Verify and load next stage
if (verify_signature(next_stage)) {
// Set up PMP for next stage
pmp_set_region(1, next_stage_addr, next_stage_size,
PMP_R | PMP_X);
// Jump to next stage
((void (*)(void))next_stage_addr)();
} else {
// Signature verification failed - halt
while(1);
}
}
Security Monitor architecture using PMP for isolation.
Architecture:
Keystone Enclave Creation:
// Create isolated enclave using PMP
struct enclave {
uint64_t base;
uint64_t size;
int pmp_region;
};
struct enclave create_keystone_enclave(void *code, size_t code_size,
void *data, size_t data_size) {
struct enclave enc;
// 1. Allocate isolated memory
enc.base = alloc_enclave_memory(code_size + data_size);
enc.size = code_size + data_size;
// 2. Copy code and data
memcpy((void *)enc.base, code, code_size);
memcpy((void *)(enc.base + code_size), data, data_size);
// 3. Allocate PMP region for enclave
enc.pmp_region = allocate_pmp_region();
// 4. Configure PMP (M-mode only)
pmp_set_region(enc.pmp_region, enc.base, enc.size,
PMP_R | PMP_W | PMP_X); // RWX for enclave
// 5. Mark region as enclave-only (custom extension)
pmp_set_owner(enc.pmp_region, OWNER_ENCLAVE);
return enc;
}
// Enter enclave
void enter_enclave(struct enclave *enc) {
// Security monitor (M-mode) verifies caller
// Then switches PMP configuration and jumps to enclave
// Only enclave can access its memory
// OS cannot read/write enclave memory (PMP blocks it)
}
Keystone Features:
// 1. Memory Isolation
// - Each enclave gets dedicated PMP region
// - OS cannot access enclave memory
// - Enclaves cannot access each other
// 2. Attestation
// - Security monitor signs enclave measurements
// - Remote party can verify enclave authenticity
// 3. Sealing
// - Enclave-specific keys for persistent storage
// - Data encrypted to specific enclave
// 4. Open Source
// - MIT licensed
// - Customizable for different threat models
// - No proprietary firmware dependencies
Hardware-accelerated cryptography for better security and performance.
// Zkn: NIST algorithms
// - AES encryption/decryption
// - SHA-256/SHA-512
// - SM3/SM4 (Chinese standards)
// Example: AES-128 encryption
void riscv_aes_encrypt(uint8_t *plaintext, uint8_t *key, uint8_t *ciphertext) {
// Load key into AES state
asm volatile("aes64ks1i t0, %0, 0" :: "r" (key[0]));
asm volatile("aes64ks1i t1, %0, 1" :: "r" (key[1]));
// Load plaintext
asm volatile("aes64esm t0, %0, t0" :: "r" (plaintext[0]));
asm volatile("aes64esm t1, %0, t1" :: "r" (plaintext[1]));
// ... 10 rounds for AES-128 ...
// Store ciphertext
// Much faster than software AES!
}
// Zkb: Bit manipulation for crypto
// - Rotate, byte-reverse operations
// - Useful for implementing ciphers
| Feature | RISC-V | x86-64 | ARM64 |
|---|---|---|---|
| **Privilege Levels** | M/S/U (H optional) | Ring 0-3 | EL0-EL3 |
| **Memory Protection** | PMP (M-mode only) | Page tables | Page tables + TrustZone |
| **TEE** | Keystone (open) | SGX (deprecated) | TrustZone (built-in) |
| **Memory Encryption** | Extension needed | TME/TDX | MTE (tagging) |
| **Crypto Accel** | Zk* (optional) | AES-NI (standard) | Crypto extensions |
| **Flexibility** | High (modular) | Low (fixed) | Medium |
| **Maturity** | Developing | Mature | Mature |
| **Open Source** | Fully open | Proprietary | Some open |
Under Development:
// 1. IOPMP (I/O Physical Memory Protection)
// Like IOMMU but simpler, using PMP concepts
// 2. WorldGuard
// ARM TrustZone-like secure/non-secure worlds
// 3. Control Flow Integrity (Zicfiss)
// Shadow stack for return address protection
// 4. Pointer Masking
// Top-byte-ignore for memory tagging (like ARM MTE)
// 5. Capability Mode (CHERI)
// Hardware-enforced memory safety
// Fat pointers with bounds checking
[Sections 6.14-6.19 to continue...]
Modern computing increasingly relies on specialized accelerators—GPUs, TPUs, FPGAs, and custom ASICs—to achieve performance far beyond what general-purpose CPUs can provide. However, these accelerators introduce new security challenges that extend beyond traditional CPU memory protection.
Why GPUs Need Special Attention:
// Traditional CPU-only workflow:
void process_data(void *sensitive_data) {
// CPU operates on data in protected memory
// MMU enforces permissions
// Data never leaves CPU package unencrypted
compute(sensitive_data);
}
// Modern GPU-accelerated workflow:
void process_data_gpu(void *sensitive_data) {
// 1. CPU allocates GPU memory
void *gpu_mem = cudaMalloc(size);
// 2. Copy data to GPU (crosses PCIe bus!)
cudaMemcpy(gpu_mem, sensitive_data, size, cudaMemcpyHostToDevice);
// 3. GPU processes data
// - GPU has separate DRAM
// - Not protected by CPU MMU
// - Visible to: GPU driver, system software, DMA attacks
launch_kernel<<<blocks, threads>>>(gpu_mem);
// 4. Copy results back
cudaMemcpy(result, gpu_mem, size, cudaMemcpyDeviceToHost);
// Problem: Data exposed at multiple points!
}
Threat Vectors:
NVIDIA GPU Memory Hierarchy:
Key Differences from CPU:
| Feature | CPU | GPU |
|---|---|---|
| **MMU** | Full page-based protection | Limited/none |
| **Address Space** | Per-process isolation | Shared global memory |
| **Privilege Levels** | Ring 0/3, EL0-3, etc. | Minimal (user/kernel mode) |
| **Memory Encryption** | SME/TME | None (standard GPUs) |
| **IOMMU** | Can isolate from other devices | GPU itself not isolated |
Hardware Features (2022+):
NVIDIA H100 introduced confidential computing support for GPUs:
Key Mechanisms:
// 1. CPU-GPU Encrypted Link
// PCIe traffic encrypted with AES-256-GCM
// Prevents snooping on bus
// 2. GPU Memory Encryption
// GDDR6 memory encrypted at rest
// Similar to CPU SME/TME
// 3. Isolated Execution Contexts
// Multiple VMs can share GPU without seeing each other's data
// 4. Attestation
// Remote verification of GPU firmware and configuration
// Example: Create confidential GPU context
cudaError_t create_confidential_context(void) {
// Enable confidential computing mode
cudaDeviceProp prop;
cudaGetDeviceProperties(&prop, 0);
if (!prop.confidentialCompute) {
return cudaErrorNotSupported;
}
// Create encrypted execution context
cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking);
// All operations on this stream are encrypted
return cudaSuccess;
}
Performance Overhead:
AMD MI300 Series (2023):
Similar to NVIDIA, AMD added memory encryption to data center GPUs:
// ROCm (Radeon Open Compute) API for secure compute
// 1. Check for encryption support
rocm_status_t check_encryption(void) {
hipDeviceProp_t prop;
hipGetDeviceProperties(&prop, 0);
if (prop.memoryEncryption) {
printf("Memory encryption supported\n");
return ROCM_SUCCESS;
}
return ROCM_ERROR_NOT_SUPPORTED;
}
// 2. Allocate encrypted memory
void *allocate_secure_gpu_memory(size_t size) {
void *ptr;
// Request encrypted allocation
hipError_t err = hipMallocEncrypted(&ptr, size);
if (err != hipSuccess) {
return NULL;
}
// Memory is encrypted in GPU DRAM
return ptr;
}
Unified Memory Architecture:
Apple's M-series chips integrate CPU and GPU on same die, sharing memory:
Security Advantages:
// No PCIe exposure: Data never leaves chip
void apple_gpu_compute(void *data, size_t size) {
// CPU and GPU share same memory
// No cudaMemcpy needed!
id<MTLBuffer> buffer = [device newBufferWithBytesNoCopy:data
length:size
options:MTLResourceStorageModeShared];
// GPU accesses same physical pages as CPU
// Protected by same MMU/IOMMU
// No additional exposure
}
Disadvantages:
Benchmark: Confidential GPU Computing:
// NVIDIA H100 with confidential computing
void benchmark_confidential_gpu(void) {
const size_t size = 1024 * 1024 * 1024; // 1 GB
// Standard (unencrypted) mode
float *d_data;
cudaMalloc(&d_data, size);
cudaMemcpy(d_data, h_data, size, cudaMemcpyHostToDevice);
// Bandwidth: ~900 GB/s (H100)
// Compute: 2000 TFLOPS
// Confidential (encrypted) mode
float *d_secure;
cudaMallocEncrypted(&d_secure, size);
cudaMemcpyEncrypted(d_secure, h_data, size);
// Bandwidth: ~810 GB/s (~10% slower)
// Compute: ~1850 TFLOPS (~7.5% slower)
// Overall overhead: 5-15% depending on workload
}
When to Use Confidential GPU:
✅ Use when:
❌ Avoid when:
Modern systems combine multiple processor types—CPUs, GPUs, FPGAs, TPUs, NPUs—in complex heterogeneous architectures. Securing these systems requires coordinating protection across fundamentally different processor designs.
Multiple Processors, Multiple Security Models:
Key Security Challenges:
NVIDIA Grace-Hopper Superchip:
Cache Coherency Security Implications:
// Problem: CPU cache can hold GPU data (and vice versa)
void coherency_security_issue(void) {
// CPU allocates sensitive data
int *cpu_data = malloc(4096);
cpu_data[0] = SECRET_KEY;
// GPU kernel reads the data (coherent access)
gpu_kernel<<<1,1>>>(cpu_data);
// Data is now in:
// 1. CPU L1/L2/L3 caches
// 2. GPU L2 cache
// 3. Both DRAM copies (CPU-side and GPU-side)
// Security question: How do we ensure cache lines
// are properly protected across domains?
}
Solution: Coherent Memory Encryption:
// Grace-Hopper implements end-to-end encryption
// across coherent link
// 1. CPU-side: Protected by CPU MMU + SME/TME
// Data encrypted in CPU caches and DRAM
// 2. NVLink-C2C: Encrypted link between CPU and GPU
// Prevents snooping on interconnect
// 3. GPU-side: Protected by GPU encryption
// Data encrypted in GPU caches and HBM
// Result: Coherent access without security compromise
Performance:
Integrated CPU+GPU on Single Package:
Unified Security Model:
// AMD MI300A: CPU and GPU share same page tables!
// CPU sets up page table with protections
uint64_t *pte = get_pte(addr);
*pte = make_pte(phys_addr, PTE_P | PTE_W | PTE_NX);
// GPU accesses same PTE
// Hardware enforces same permissions for GPU accesses
// No separate GPU page table needed!
// Benefit: Consistent protection model
// Disadvantage: GPU must check page tables (slower)
M3 Ultra Example:
Security Advantages:
// 1. No data copies between CPU and GPU
// → No PCIe exposure
// → Faster and more secure
// 2. Single MMU/IOMMU
// → Consistent protection model
// → Easier to verify security
// 3. On-package design
// → Physical security: can't intercept traffic
// → DMA attacks much harder
// Example: Metal (Apple's GPU API)
void apple_unified_security(void) {
// Allocate memory (shared CPU/GPU)
id<MTLBuffer> buffer = [device newBufferWithLength:size
options:MTLResourceStorageModeShared];
// CPU writes to buffer
memcpy(buffer.contents, data, size);
// GPU reads same buffer
// - No copy needed
// - Same page table protections apply
// - IOMMU enforces access control
[commandEncoder setBuffer:buffer offset:0 atIndex:0];
[commandEncoder dispatchThreads:...];
}
1. Unified Trust Boundary:
// Good: Single security perimeter
void secure_heterogeneous_good(void) {
// Use CPU TEE (TDX/SEV-SNP) to protect entire system
create_confidential_vm();
// GPU operates within TEE boundary
// - Encrypted link to GPU
// - GPU memory encrypted
// - Attestation covers CPU+GPU
compute_on_gpu(sensitive_data);
}
// Bad: Multiple separate security boundaries
void secure_heterogeneous_bad(void) {
// CPU has one protection scheme
cpu_encrypt(data);
// Transfer to GPU (different scheme)
transfer_to_gpu(data); // Vulnerable during transfer!
// GPU has different protection
gpu_encrypt(data);
// Multiple boundaries = multiple attack surfaces
}
2. End-to-End Encryption:
// Encrypt data path from source to accelerator
// Source (encrypted in CPU enclave)
void *cpu_enclave_data = allocate_in_enclave(size);
// Transfer (encrypted link)
secure_transfer_to_gpu(cpu_enclave_data, gpu_buffer, size);
// Destination (encrypted in GPU memory)
// GPU processes without exposing plaintext
// No point in data path exposes cleartext
3. Minimize Trust in Drivers:
// Problem: GPU drivers run in kernel, highly privileged
// Solution: Minimal trusted driver + user-space library
// Trusted (kernel driver - minimal code)
int gpu_map_memory(void *addr, size_t size) {
// Only does: memory mapping, DMA setup
// Does NOT: touch data, parse commands
return map_gpu_region(addr, size);
}
// Untrusted (user-space library - complex code)
void launch_kernel(kernel_t kernel, void *args) {
// Complex logic in user space
// If compromised: limited damage (no kernel privileges)
prepare_command_buffer(kernel, args);
submit_to_gpu();
}
4. Attestation Across Devices:
// Verify security of entire heterogeneous system
struct system_attestation {
uint8_t cpu_measurement[32];
uint8_t gpu_measurement[32];
uint8_t fpga_bitstream_hash[32];
uint8_t interconnect_config[32];
};
bool attest_heterogeneous_system(struct system_attestation *attest) {
// 1. CPU TEE attestation
get_cpu_attestation(attest->cpu_measurement);
// 2. GPU firmware attestation
get_gpu_attestation(attest->gpu_measurement);
// 3. FPGA bitstream verification
get_fpga_attestation(attest->fpga_bitstream_hash);
// 4. Interconnect security configuration
get_interconnect_config(attest->interconnect_config);
// Sign entire attestation
sign_attestation(attest);
// Remote verifier checks all components
return verify_all_components(attest);
}
Every security feature has a cost. Understanding these trade-offs is essential for making informed decisions about which protections to enable in production systems.
| Feature | Overhead | When Always Enabled | When Optional |
|---|---|---|---|
| **NX/XD/XN Bit** | ~0% | ✅ Always | Never disable |
| **SMEP** | <1% | ✅ Always | Never disable |
| **SMAP** | <1% | ✅ Always | Never disable |
| **PAN (ARM)** | <1% | ✅ Always | Never disable |
| **KPTI** | 5-30% | ⚠️ If vulnerable CPU | Disable on patched CPUs |
| **MPK** | <1% | ✅ When available | App-specific |
| **MTE (ARM)** | 5-15% | 🤔 Depends | Development/critical systems |
| **BTI (ARM)** | <1% | ✅ Always | Never disable |
| **SME/SEV** | <1-5% | ✅ Cloud/untrusted | Disable on trusted hardware |
| **GPU Encryption** | 5-15% | 🤔 Depends | Sensitive data only |
1. KPTI (Kernel Page-Table Isolation):
// Benchmark: System call overhead with/without KPTI
void benchmark_kpti_impact(void) {
const int iterations = 1000000;
// Measure getpid() (minimal syscall)
uint64_t start = rdtsc();
for (int i = 0; i < iterations; i++) {
getpid();
}
uint64_t cycles = (rdtsc() - start) / iterations;
// Results:
// Without KPTI: ~100 cycles
// With KPTI: ~150-180 cycles
// Overhead: 50-80% on syscalls!
// But total system impact depends on syscall frequency:
// CPU-bound workload: <5% (few syscalls)
// I/O-bound server: 10-20% (many syscalls)
// Database: 15-30% (heavy syscall use)
}
When to Disable KPTI:
// Check CPU vulnerability
bool needs_kpti(void) {
uint32_t eax, ebx, ecx, edx;
// Intel: Check for RDCL_NO bit
cpuid(7, 0, &eax, &ebx, &ecx, &edx);
bool rdcl_no = edx & (1 << 10); // Not vulnerable to Meltdown
if (rdcl_no) {
return false; // Safe to disable KPTI
}
// AMD CPUs: Generally not vulnerable
if (cpu_vendor() == AMD) {
return false;
}
return true; // Enable KPTI
}
2. Memory Tagging (ARM MTE):
// MTE provides memory safety but with cost
void benchmark_mte_overhead(void) {
const size_t size = 1024 * 1024 * 1024; // 1 GB
// Without MTE:
// - malloc(): 50 μs
// - memcpy(): 2.5 GB/s
// - Random access: 60 ns per access
// With MTE (synchronous mode):
// - malloc(): 55 μs (+10% for tagging)
// - memcpy(): 2.3 GB/s (-8% for tag checks)
// - Random access: 65 ns per access (+8%)
// With MTE (asynchronous mode):
// - malloc(): 52 μs (+4%)
// - memcpy(): 2.45 GB/s (-2%)
// - Random access: 61 ns per access (+2%)
// Trade-off:
// Sync: Better debugging (immediate faults)
// Async: Better performance (delayed faults)
}
When to Use MTE:
✅ Enable for:
❌ Disable for:
3. Confidential Computing (SEV-SNP, TDX):
// Cloud VM with confidential computing
void benchmark_confidential_vm(void) {
// Regular VM:
// - Network I/O: 100 Gbps
// - Disk I/O: 10 GB/s
// - Memory: 200 GB/s
// - CPU: 2.5 GHz (base)
// Confidential VM (SEV-SNP):
// - Network I/O: 95 Gbps (-5%, encryption)
// - Disk I/O: 9.5 GB/s (-5%, encryption)
// - Memory: 190 GB/s (-5%, RMP checks)
// - CPU: 2.45 GHz (-2%, overhead)
// Overall: 1-5% performance cost
// Benefit: Complete isolation from cloud provider
// Decision:
// Use if: Processing sensitive data (healthcare, finance)
// Skip if: Public data, owned hardware
}
Stacking Security Features:
// Real-world server configuration
void production_server_config(void) {
// Base performance: 100%
// Enable NX bit: 100% (no cost)
// Enable SMEP: 99.5% (-0.5%)
// Enable SMAP: 99.0% (-0.5%)
// Enable KPTI: 85.0% (-14%)
// Enable SEV-SNP: 81.0% (-4%)
// Final performance: ~81% of baseline
// Security gain: Protected against:
// - Code injection attacks (NX)
// - Kernel exploits (SMEP/SMAP)
// - Spectre/Meltdown (KPTI)
// - Malicious hypervisor (SEV-SNP)
// Worth it? Depends on threat model and requirements
}
Optimization Strategies:
// 1. Selective Protection
void selective_security(void) {
// Don't protect everything equally
// Critical data: Full protection
enable_all_security_features(payment_processing);
// Internal services: Moderate protection
enable_basic_security(logging_service);
// Public data: Minimal protection
enable_nx_only(static_website);
}
// 2. Hardware Acceleration
void use_hardware_acceleration(void) {
// Modern CPUs have dedicated units
// - AES-NI: Hardware AES encryption (free)
// - SHA extensions: Hardware hashing (free)
// - Pointer authentication: Hardware CFI (cheap)
// Always enable hardware-accelerated security
}
// 3. Batch Operations
void batch_security_operations(void) {
// Amortize overhead across multiple operations
// Bad: Individual mprotect calls
for (int i = 0; i < 1000; i++) {
mprotect(pages[i], 4096, PROT_READ); // 1000 TLB flushes!
}
// Good: Batch with MPK
for (int i = 0; i < 1000; i++) {
pkey_mprotect(pages[i], 4096, PROT_READ, pkey);
}
pkey_set(pkey, PKEY_DISABLE_WRITE); // 1 operation!
}
Based on decades of security research and real-world deployments, here are proven best practices for memory protection.
Layer Multiple Protections:
// Don't rely on a single mechanism
void defense_in_depth_example(void) {
// Layer 1: NX bit (prevent code execution on stack/heap)
// Layer 2: ASLR (randomize addresses)
// Layer 3: Stack canaries (detect buffer overflows)
// Layer 4: SMEP/SMAP (kernel hardening)
// Layer 5: Control-flow integrity (prevent ROP)
// If attacker bypasses one layer, others still protect
}
Principle: Assume every protection can be bypassed. Multiple layers increase attack difficulty exponentially.
Reduce Attack Surface:
// Bad: Large TCB (Trusted Computing Base)
void large_tcb_bad(void) {
// Entire OS kernel is trusted
// - Millions of lines of code
// - Many bugs, high attack surface
}
// Good: Small TCB
void small_tcb_good(void) {
// Only security monitor is trusted
// - Thousands of lines (Keystone: ~5K)
// - Easier to verify and audit
// - Fewer bugs, smaller attack surface
}
Example: Keystone vs SGX:
Default-Deny Policies:
// Good: Deny by default
void secure_permission_check(void *addr, access_type_t type) {
// Start with: deny everything
bool allowed = false;
// Explicitly allow what's needed
if (is_in_permitted_range(addr) && has_permission(type)) {
allowed = true;
}
// Fail closed
if (!allowed) {
raise_fault();
}
}
// Bad: Allow by default
void insecure_permission_check(void *addr, access_type_t type) {
// Start with: allow everything
bool denied = false;
// Deny only specific things
if (is_in_forbidden_range(addr)) {
denied = true;
}
// Fail open (dangerous!)
if (!denied) {
allow_access();
}
}
Test Security, Not Just Functionality:
// Security unit tests
void test_memory_isolation(void) {
// Test 1: User cannot access kernel memory
assert_fault(user_read_kernel_address(KERNEL_DATA));
// Test 2: NX bit prevents execution
assert_fault(execute_on_stack(stack_buffer));
// Test 3: Write XOR execute enforced
void *page = mmap(NULL, 4096, PROT_READ|PROT_WRITE|PROT_EXEC, ...);
assert(page == MAP_FAILED); // W^X prevents RWX
// Test 4: SMEP prevents kernel executing user code
assert_fault(kernel_jump_to_user_page(user_code));
}
Don't Disable for "Performance":
// Common mistakes to avoid
// WRONG: Disable NX for "speed"
void disable_nx_wrong(void) {
// Saves: 0% (NX has no cost!)
// Loses: Protection against code injection
// Verdict: Never do this
}
// WRONG: Disable ASLR for "reproducibility"
void disable_aslr_wrong(void) {
// Saves: 0% (ASLR has no runtime cost)
// Loses: Makes exploits 1000× easier
// Verdict: Only disable in development, never production
}
// ACCEPTABLE: Disable KPTI on patched CPUs
void disable_kpti_acceptable(void) {
if (cpu_has_hardware_mitigation_for_meltdown()) {
disable_kpti();
// Saves: 5-15%
// Loses: Nothing (hardware provides equivalent protection)
// Verdict: OK if CPU is patched
}
}
Log Security Events:
// Track security-relevant events
void audit_security_violations(void) {
// Log all security exceptions
// - Page faults (access violations)
// - Privilege violations
// - Protection key violations
// - Failed attestations
// Example: Page fault handler
void page_fault_handler(uintptr_t addr, error_code_t err) {
if (err & PF_PROTECTION_VIOLATION) {
// Log: timestamp, process, address, access type
log_security_event(LOG_PROTECTION_FAULT, addr, err);
// Alert if suspicious pattern
if (looks_like_exploit(addr, err)) {
alert_security_team();
}
}
}
}
Learn from others' mistakes. Here are common memory protection failures and how to prevent them.
Problem: Relying only on page tables for security.
// WRONG: Page tables alone aren't enough
void insufficient_protection(void) {
// Set up page tables with proper permissions
set_page_permissions(addr, PROT_READ | PROT_WRITE);
// Vulnerabilities:
// 1. Speculative execution (Meltdown/Spectre)
// 2. DMA attacks (if no IOMMU)
// 3. Physical memory attacks (if no encryption)
// 4. Hypervisor attacks (if no confidential computing)
}
// CORRECT: Layer multiple protections
void sufficient_protection(void) {
// 1. Page tables (basic protection)
set_page_permissions(addr, PROT_READ | PROT_WRITE);
// 2. KPTI (Spectre/Meltdown mitigation)
enable_kpti();
// 3. IOMMU (DMA protection)
enable_iommu_for_device(pci_device);
// 4. Memory encryption (physical attacks)
enable_memory_encryption();
// 5. Confidential computing (hypervisor attacks)
create_encrypted_vm();
}
Problem: Changing permissions without flushing TLB.
// WRONG: TLB not flushed
void permission_change_wrong(void *addr) {
// Change page permission
uint64_t *pte = get_pte(addr);
*pte |= PTE_NX; // Make non-executable
// BUG: TLB still has old permissions!
// Code can still execute until TLB entry naturally evicted
}
// CORRECT: Always flush TLB
void permission_change_correct(void *addr) {
uint64_t *pte = get_pte(addr);
*pte |= PTE_NX;
// Flush TLB entry
invlpg(addr); // x86
// or:
// isb(); dsb(); tlbi(addr); // ARM
// or:
// sfence.vma(); // RISC-V
}
Problem: Reusing memory between different security contexts.
// WRONG: Reuse memory without clearing
void memory_reuse_wrong(void) {
// Process 1 uses memory
void *secret = malloc(4096);
memcpy(secret, password, 16);
free(secret);
// Process 2 allocates same memory
void *data = malloc(4096);
// BUG: data still contains password!
}
// CORRECT: Clear memory between uses
void memory_reuse_correct(void) {
void *secret = malloc(4096);
memcpy(secret, password, 16);
// Clear before freeing
explicit_bzero(secret, 4096); // Not optimized away
free(secret);
}
Problem: Dereferencing user pointers in kernel without validation.
// WRONG: Trust user pointer
void kernel_bug(void *user_ptr) {
// Kernel directly accesses user-provided pointer
int value = *(int *)user_ptr; // Dangerous!
// Exploits:
// 1. user_ptr = kernel_address → leak kernel data
// 2. user_ptr = NULL → kernel NULL deref crash
// 3. user_ptr = unmapped → kernel page fault
}
// CORRECT: Validate and use safe copy functions
void kernel_safe(void *user_ptr) {
int value;
// 1. Check pointer is in user space
if (!access_ok(user_ptr, sizeof(int))) {
return -EFAULT;
}
// 2. Use safe copy function
if (copy_from_user(&value, user_ptr, sizeof(int))) {
return -EFAULT; // Faulted safely
}
// 3. Now safe to use value
process(value);
}
Problem: Implementing functional security but ignoring timing attacks.
// WRONG: Timing-dependent secret comparison
bool password_check_wrong(const char *input, const char *correct) {
// Stops at first mismatch
for (int i = 0; i < strlen(correct); i++) {
if (input[i] != correct[i]) {
return false; // Early exit leaks information!
}
}
return true;
}
// CORRECT: Constant-time comparison
bool password_check_correct(const char *input, const char *correct) {
int diff = 0;
// Always compare all bytes
for (int i = 0; i < MAX_PASSWORD_LEN; i++) {
diff |= input[i] ^ correct[i];
}
return diff == 0; // No early exit
}
Memory protection is the foundation of system security. Everything we've covered builds on this core principle: isolate, protect, and verify access to memory.
Essential Mechanisms:
Critical Insights:
1. Hardware Memory Safety:
// Trend: CPUs with built-in bounds checking
// ARM MTE is just the beginning
// Future: Full capability-based security (CHERI)
void *capability_pointer = malloc(size);
// Pointer includes: address, bounds, permissions
// Hardware enforces: cannot access outside bounds
// Zero spatial memory errors at hardware level
2. Always-On Confidential Computing:
// Trend: Encryption becomes default, not optional
// Today: Opt-in confidential VMs
create_encrypted_vm(); // Explicit
// Future: All VMs encrypted by default
create_vm(); // Implicitly encrypted
3. Unified Cross-Device Security:
// Trend: Consistent protection across CPU/GPU/FPGA
// Today: Each device has different security model
protect_cpu_memory(); // Page tables + KPTI
protect_gpu_memory(); // Maybe encrypted, maybe not
protect_fpga_memory(); // Custom logic
// Future: Unified security framework
protect_device_memory(cpu); // Same API
protect_device_memory(gpu); // Same guarantees
protect_device_memory(fpga); // Same verification
4. Formal Verification:
// Trend: Mathematically proven security
// Today: Test and hope
if (test_passes(security_feature)) {
deploy(); // Hope no bugs remain
}
// Future: Formally verified security
if (prove_secure(security_monitor)) {
deploy(); // Mathematically certain
}
// Example: seL4 microkernel (formally verified)
// - 10,000 lines of code
// - Mathematically proven free of buffer overflows
// - Proven to correctly enforce isolation
Memory protection has evolved from simple base/bound registers (1960s) to sophisticated multi-layered defenses (2020s). Yet the fundamental goal remains: ensure that code can only access memory it's authorized to access.
The technologies covered in this chapter—from basic page table permissions to encrypted confidential VMs—represent humanity's collective effort to build secure systems. Each mechanism emerged from:
As you design and build systems, remember:
The future of memory protection looks promising: hardware memory safety, universal encryption, formal verification. But the principles remain timeless: isolate, protect, verify.
Intel Corporation. Intel® 64 and IA-32 Architectures Software Developer's Manual, Volume 3A: System Programming Guide, Part 1. Chapter 4: "Paging" and Section 4.6: "Access Rights." 2024.
ARM Limited. ARM Architecture Reference Manual ARMv8, for ARMv8-A architecture profile. ARM DDI 0487J.a. March 2023. Chapter D5: "The AArch64 Virtual Memory System Architecture."
RISC-V International. The RISC-V Instruction Set Manual, Volume II: Privileged Architecture. Version 20211203. December 2021. Chapter 3: "Machine-Level ISA" and Chapter 4: "Supervisor-Level ISA."
Denning, Peter J. "Virtual memory." ACM Computing Surveys (CSUR) 2.3 (1970): 153-189. DOI: 10.1145/356571.356573
Intel Corporation. Intel® 64 and IA-32 Architectures Software Developer's Manual. Section 4.6: "Access Rights" (XD bit). 2024.
AMD. AMD64 Architecture Programmer's Manual, Volume 2: System Programming. Publication #24593. Section 5.3.1: "No-Execute Page Protection." 2023.
ARM Limited. ARM Architecture Reference Manual ARMv8. Section D5.4.5: "Execute-never controls and instruction fetching." 2023.
One, Aleph. "Smashing the Stack for Fun and Profit." Phrack Magazine 7.49 (1996). [Classic buffer overflow exploit paper]
Solar Designer. "Getting around non-executable stack (and fix)." Bugtraq mailing list. August 1997. [Return-to-libc attacks]
Shacham, Hovav. "The geometry of innocent flesh on the bone: Return-into-libc without function calls (on the x86)." Proceedings of the 14th ACM conference on Computer and communications security (CCS 2007). ACM, 2007. DOI: 10.1145/1315245.1315313 [ROP attacks]
Intel Corporation. Intel® 64 and IA-32 Architectures Software Developer's Manual, Volume 3A. Chapter 5: "Protection." 2024.
ARM Limited. ARM Architecture Reference Manual ARMv8. Chapter D1: "The AArch64 System Level Programmers' Model." 2023.
Goldberg, Robert P. "Survey of virtual machine research." Computer 7.6 (1974): 34-45. DOI: 10.1109/MC.1974.6323581 [Early virtualization and privilege levels]
Intel Corporation. Intel® 64 and IA-32 Architectures Software Developer's Manual. Section 4.6: "Access Rights" (SMEP/SMAP). 2024.
Kemerlis, Vasileios P., et al. "kGuard: Lightweight kernel protection against return-to-user attacks." 22nd USENIX Security Symposium. 2013. Pages 459-474.
Pomonis, Marios, et al. "kR^ X: Comprehensive kernel protection against just-in-time code reuse." Proceedings of the Twelfth European Conference on Computer Systems (EuroSys 2017). ACM, 2017. DOI: 10.1145/3064176.3064216
ARM Limited. ARM Architecture Reference Manual ARMv8, Supplement: The Armv8.5 Memory Tagging Extension. ARM DDI 0487F.c. 2020.
Serebryany, Konstantin. "ARM Memory Tagging Extension and How It Improves C/C++ Memory Safety." 2020 Security Symposium. USENIX, 2020.
ARM Limited. "Armv8.5-A Memory Tagging Extension White Paper." 2019.
Intel Corporation. Intel® 64 and IA-32 Architectures Software Developer's Manual. Section 4.6.2: "Protection Keys." 2024.
Hedayati, Mohammad, et al. "Hodor: Intra-process isolation for high-throughput data plane libraries." 2019 USENIX Annual Technical Conference (ATC 19). 2019. Pages 489-504.
Park, Soyeon, et al. "libmpk: Software abstraction for Intel memory protection keys." 2019 USENIX Annual Technical Conference (ATC 19). 2019. Pages 241-254.
Vahldiek-Oberwagner, Anjo, et al. "ERIM: Secure, efficient in-process isolation with protection keys (MPK)." 28th USENIX Security Symposium. 2019. Pages 1221-1238.
ARM Limited. ARM Security Technology: Building a Secure System using TrustZone Technology. ARM PRD29-GENC-009492C. 2009.
Ngabonziza, Bernard, et al. "TrustZone explained: Architectural features and use cases." 2016 IEEE 2nd International Conference on Collaboration and Internet Computing (CIC). IEEE, 2016. DOI: 10.1109/CIC.2016.065
ARM Limited. ARMv8-A Architecture and Processors: Trusted Base System Architecture for ARMv8-M. 2018.
Costan, Victor, and Srinivas Devadas. "Intel SGX explained." IACR Cryptology ePrint Archive 2016 (2016): 86.
McKeen, Frank, et al. "Innovative instructions and software model for isolated execution." Proceedings of the 2nd International Workshop on Hardware and Architectural Support for Security and Privacy. 2013. DOI: 10.1145/2487726.2488368
Intel Corporation. Intel® Software Guard Extensions (Intel® SGX) Developer Reference for Linux OS. 2020.
Van Bulck, Jo, et al. "Foreshadow: Extracting the keys to the Intel SGX kingdom with transient out-of-order execution." 27th USENIX Security Symposium. 2018. Pages 991-1008. [SGX vulnerability]
AMD. AMD SEV-SNP: Strengthening VM Isolation with Integrity Protection and More. White Paper #55766. January 2020.
AMD. AMD Secure Encrypted Virtualization API Version 0.24. Publication #55766. 2020.
Kaplan, David, Jeremy Powell, and Tom Woller. "AMD memory encryption." White Paper (2016).
Li, Mengyuan, et al. "CIPHERLEAKS: Breaking constant-time cryptography on AMD SEV via the ciphertext side channel." 31st USENIX Security Symposium. 2022. Pages 717-732. [SEV vulnerability research]
Intel Corporation. Intel® Trust Domain Extensions (Intel® TDX) Module v1.5 Architecture Specification. Document Number: 344425-004US. March 2023.
Intel Corporation. "Intel Trust Domain Extensions." White Paper. 2020.
Intel Corporation. "Intel TDX: Protect Confidential Computing Workloads from Software and Hardware Attacks." 2021.
ARM Limited. ARM Confidential Compute Architecture. 2021.
ARM Limited. "Introducing Arm Confidential Compute Architecture." White Paper. 2021.
ARM Limited. Arm Realm Management Extension (RME) Architecture Specification. ARM DDI 0615A. 2022.
AMD. AMD64 Architecture Programmer's Manual, Volume 2: System Programming. Chapter 7: "Secure Memory Encryption." Publication #24593. 2023.
Kaplan, David. "Protecting VM register state with SEV-ES." AMD White Paper (2017).
AMD. Secure Encrypted Virtualization API. Publication #55766. Rev 0.24. 2020.
RISC-V International. The RISC-V Instruction Set Manual, Volume II: Privileged Architecture. Section 3.6: "Physical Memory Protection." Version 20211203. December 2021.
RISC-V International. RISC-V Physical Memory Protection (PMP) Enhancement (ePMP). Draft Specification. 2021.
Lee, Dayeol, et al. "Keystone: An open framework for architecting trusted execution environments." Proceedings of the Fifteenth European Conference on Computer Systems (EuroSys 2020). 2020. DOI: 10.1145/3342195.3387532
Weiser, Samuel, et al. "TIMBER-V: Tag-isolated memory bringing fine-grained enclaves to RISC-V." Network and Distributed Systems Security Symposium (NDSS 2019). 2019.
NVIDIA Corporation. NVIDIA H100 Tensor Core GPU Architecture. White Paper WP-10026-001_v01. 2022.
NVIDIA Corporation. NVIDIA Confidential Computing. White Paper. 2022.
AMD. AMD Instinct MI300 Architecture. White Paper. 2023.
Volos, Stavros, et al. "Graviton: Trusted execution environments on GPUs." 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2020). 2020. Pages 681-696.
Jang, Insu, et al. "Heterogeneous isolated execution for commodity GPUs." Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 2019). 2019. DOI: 10.1145/3297858.3304021
NVIDIA Corporation. NVIDIA Grace Hopper Superchip Architecture. White Paper. 2023.
AMD. AMD Instinct MI300A APU Architecture. White Paper. 2023.
Apple Inc. Apple M3 Technical Overview. 2023.
Pichai, Bharath, et al. "Architectural support for address translation on GPUs: Designing memory management units for CPU/GPUs with unified address spaces." ACM SIGPLAN Notices 49.4 (2014): 743-758. DOI: 10.1145/2644865.2541940
Kocher, Paul, et al. "Spectre attacks: Exploiting speculative execution." 2019 IEEE Symposium on Security and Privacy (SP). IEEE, 2019. DOI: 10.1109/SP.2019.00002
Lipp, Moritz, et al. "Meltdown: Reading kernel memory from user space." 27th USENIX Security Symposium. 2018. Pages 973-990.
Gruss, Daniel, et al. "KASLR is dead: Long live KASLR." International Symposium on Engineering Secure Software and Systems. Springer, 2017. DOI: 10.1007/978-3-319-62105-0_11
The Linux Kernel Organization. "Page Table Isolation (PTI)." Documentation/x86/pti.rst. 2018.
Saltzer, Jerome H., and Michael D. Schroeder. "The protection of information in computer systems." Proceedings of the IEEE 63.9 (1975): 1278-1308. DOI: 10.1109/PROC.1975.9939 [Classic security principles]
Anderson, Ross J. Security engineering: a guide to building dependable distributed systems. John Wiley & Sons, 2020. Third edition.
Klein, Gerwin, et al. "seL4: Formal verification of an OS kernel." Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles (SOSP 2009). ACM, 2009. DOI: 10.1145/1629575.1629596 [Formally verified microkernel]
Ge, Xinyang, et al. "Griffin: Guarding control flows using intel processor trace." Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 2017). 2017. DOI: 10.1145/3037697.3037716
Silberschatz, Abraham, Peter Baer Galvin, and Greg Gagne. Operating System Concepts. 10th edition. Wiley, 2018. Chapter 9: "Virtual Memory."
Tanenbaum, Andrew S., and Herbert Bos. Modern Operating Systems. 4th edition. Pearson, 2015. Chapter 3: "Memory Management."
Bryant, Randal E., and David R. O'Hallaron. Computer Systems: A Programmer's Perspective. 3rd edition. Pearson, 2015. Chapter 9: "Virtual Memory."
Hennessy, John L., and David A. Patterson. Computer Architecture: A Quantitative Approach. 6th edition. Morgan Kaufmann, 2017. Appendix B: "Review of Memory Hierarchy."