Chapter 6: Memory Protection and Access Control

6.1 Introduction: Memory Protection Fundamentals

In Chapters 3, 4, and 5, we explored the mechanisms of virtual memory: how page tables translate addresses (Chapter 3), how TLBs accelerate those translations (Chapter 4), and how IOMMUs extend protection to devices (Chapter 5). We've seen the intricate hardware structures, the multi-level translation hierarchies, and the performance optimizations that make modern virtual memory work. But we've largely treated memory protection as a footnote—a few bits in the page table entries that enable or disable certain accesses.

This chapter reveals why those bits matter more than everything else combined.

Without the protection mechanisms we'll explore here, all the sophisticated virtual memory infrastructure we've studied would be nothing more than an elaborate address mapping system. The four-level page tables? Just an overcomplicated way to map virtual to physical addresses. The TLB? A cache that makes the mapping faster. The IOMMU? Hardware that does the same mapping for devices. None of it would provide any security without the protection bits and enforcement mechanisms that are the focus of this chapter.

Why This Chapter is Critical

Consider what we learned in previous chapters:

From Chapter 3 (Page Tables): You now understand that every memory access goes through a page table entry (PTE). Each PTE contains:

Physical page number (where the data actually lives)
Present bit (is the page in memory?)
Accessed and Dirty bits (for page replacement algorithms)
Permission bits (which we barely discussed)

From Chapter 4 (TLB): You know that the TLB caches these page table entries for performance, reducing translation time from 50-200 cycles to effectively zero for TLB hits. But the TLB doesn't just cache addresses—it caches permissions too. Every TLB entry includes the R/W/X bits, the U/S bit, and other protection flags. The hardware checks these on every single memory access.

From Chapter 5 (IOMMU): You learned that devices need the same protection as CPUs—they get their own page tables and TLBs through the IOMMU. But why? Because without protection, a malicious or buggy device could read kernel memory, overwrite page tables, or access another VM's data. The IOMMU's entire purpose is to enforce the same protection mechanisms we'll study in this chapter.

What This Chapter Adds:

This chapter transforms your understanding of virtual memory from "a mechanism for address translation" to "the foundation of system security." We'll answer the questions previous chapters assumed you already knew:

Why does the page table have a U/S bit? Because without it, user programs could access kernel memory, making privilege separation impossible.
Why does the NX/XD bit exist? Because without it, buffer overflows could inject code on the stack and execute it, making nearly every program exploitable.
Why did Intel add SMEP and SMAP? Because clever attackers found ways around the basic U/S protections, requiring defense-in-depth.
Why do we have four-level page tables instead of a single-level design? Not just for address space size, but because hierarchical tables allow fine-grained protection at multiple granularities.
Why does the TLB need to flush on context switch? Because cached permissions from one process must not leak to another—it's a security requirement, not just a correctness one.
Why does the IOMMU matter so much? Because without it, $20 worth of hardware (a malicious USB device) can bypass all the CPU-based protections we'll study in this chapter.

The Security Problem

Modern computer systems face a fundamental security challenge: how do we allow multiple programs to run simultaneously while preventing them from interfering with each other's memory?

Without the protection mechanisms in this chapter:

// Scenario 1: Any user program could do this
int *kernel_memory = (int *)0xffff8000deadbeef;
*kernel_memory = 0x90909090;  // Overwrite kernel code
// Result: Instant privilege escalation

// Scenario 2: A simple bug becomes a catastrophe
char buffer[100];
gets(buffer);  // Buffer overflow
// Overwrites return address → executes attacker's shellcode
// Result: Complete system compromise

// Scenario 3: Process isolation is meaningless
void *other_process = (void *)0x400000;  // Another process's memory
read_password(other_process);  // Read their password
// Result: No isolation between processes

Every example above would succeed if we had the virtual memory infrastructure from Chapters 3-5 but without the protection mechanisms from this chapter. The page tables would translate the addresses. The TLB would cache them. The CPU would execute the instructions. But there would be no security, no isolation, no privilege separation—just a very fast address translation mechanism protecting nothing.

What Memory Protection Provides

The MMU enforces four fundamental security properties that make modern computing possible:

1. Isolation (Enables: Multi-process systems) Each process operates in its own address space, unable to read or modify other processes' memory. This is implemented through the U/S bit in PTEs (which we barely mentioned in Chapter 3) combined with separate page table hierarchies per process.

Chapter 3 showed you: How to walk page tables and translate addresses. This chapter shows you: Why each process has separate page tables, and what happens when a user process tries to access kernel memory (spoiler: instant #PF).

2. Privilege Separation (Enables: Operating systems) The system distinguishes between kernel code (privileged) and user code (unprivileged). This requires the CPU privilege rings/levels we'll study, combined with the U/S bit in every page table entry.

Chapter 3 mentioned: The U/S bit exists. This chapter reveals: How x86 rings, ARM exception levels, and RISC-V modes work together with page table permissions to create the kernel/user boundary. And why Meltdown broke this boundary (requiring KPTI).

3. Execute Protection (Enables: Defense against code injection) Memory pages can be marked as non-executable, preventing data (like stack buffers or heap allocations) from being executed as code. This single bit—NX/XD/XN—defeats entire classes of exploits.

Chapter 3 listed: The NX bit as one of many PTE flags. This chapter proves: Why it's the most important security bit ever added to processors, preventing 70% of buffer overflow exploits at zero performance cost.

4. Access Control (Enables: Fine-grained security) Fine-grained control over read, write, and execute permissions enables sophisticated security policies, from W^X (write XOR execute) to protection keys (fast domain switching) to memory tagging (hardware memory safety).

Previous chapters assumed: Permission bits are checked somehow. This chapter details: Exactly how hardware checks permissions on every access, what happens when checks fail, and how modern features (SMEP, SMAP, MTE, PKU) add defense-in-depth.

Connecting to Previous Chapters

Let's make the connections explicit:

Building on Chapter 3 (Page Tables):

Chapter 3 showed you the structure of PTEs (64 bits, physical address, flags)
This chapter explains the meaning of each protection flag and why it exists
Chapter 3 described the page walk algorithm (4 levels on x86-64)
This chapter reveals that permission checks happen at every level of the walk
Chapter 3 mentioned "page fault on permission violation"
This chapter details what happens in that page fault handler for different violation types

Building on Chapter 4 (TLB):

Chapter 4 showed that the TLB caches addresses and permissions
This chapter explains why caching permissions is critical for security
Chapter 4 explained TLB flushing on context switch
This chapter reveals this is a security requirement (prevent permission leakage)
Chapter 4 discussed TLB shootdown for page table changes
This chapter shows this is essential when changing permissions (e.g., mprotect, COW)

Building on Chapter 5 (IOMMU):

Chapter 5 showed devices need their own page tables
This chapter explains they need the same protection mechanisms
Chapter 5 discussed IOMMU faults on invalid DMA
This chapter details permission checks for device memory accesses
Chapter 5 mentioned IOMMU pass-through mode
This chapter shows why pass-through is dangerous (bypasses all protection)

Real-World Impact: The Cost of Getting It Wrong

Meltdown (CVE-2017-5754):

; Speculative execution attack (simplified)
mov rax, [kernel_address]    ; Speculatively reads kernel memory
mov rbx, [rax * 4096]         ; Uses data to access different cache line
; Even though the first instruction faults, cache timing reveals the data!

This exploit worked because:

Page tables correctly had U/S=0 for kernel pages (Chapter 3: ✓)
TLB correctly cached those permissions (Chapter 4: ✓)
Hardware correctly generated a page fault (Chapter 6 protection: ✓)
But speculative execution read the data before the fault was processed (Chapter 6 security model: ✗)

The fix (KPTI) required fundamentally rethinking how we use page tables—maintaining two sets per process, switching on every system call. This is a pure Chapter 6 concept: using page table structure (Chapter 3) for security (Chapter 6), accepting the TLB flush cost (Chapter 4).

Pre-IOMMU DMA Attacks:

Before IOMMUs existed, a $20 malicious USB device could:

Use DMA to write to any physical address
Overwrite page tables to mark kernel pages as U/S=1
Read kernel memory from user space
Gain root access

IOMMUs (Chapter 5) solve this by enforcing the same protections (Chapter 6) that the CPU uses. Without Chapter 6's protection model, Chapter 5's IOMMU would be pointless.

What This Chapter Covers

We'll explore memory protection mechanisms across three major architectures (x86-64, ARM64, RISC-V) and modern heterogeneous systems:

Basic Protection (Sections 6.2-6.5):

How permission bits actually work (completing Chapter 3's PTE discussion)
Why the NX bit is the most important security feature ever added
How privilege levels create the kernel/user boundary
Why KPTI had to break the simple model from Chapter 3

Advanced Features (Sections 6.6-6.9):

SMEP/SMAP (x86 kernel hardening) - why basic U/S isn't enough
Memory Tagging Extension (ARM MTE) - hardware memory safety
Protection Keys (Intel MPK) - 50-100× faster than TLB flushes (Chapter 4 impact!)
Copy-on-Write optimization - security + performance using page faults

Confidential Computing (Sections 6.10-6.13):

AMD SEV-SNP, Intel TDX, ARM CCA - protecting VMs from hypervisors
Trusted Execution Environments - ARM TrustZone, Intel SGX, GPU TEEs
Why memory encryption needs to integrate with page tables (Chapter 3)

Modern Systems (Sections 6.14-6.15):

GPU and accelerator security (extending Chapter 5's device protection)
Heterogeneous computing (Apple Silicon, Grace-Hopper)
Why unified memory architectures need new protection models

Synthesis (Sections 6.16-6.19):

Performance vs security trade-offs (quantifying TLB costs from Chapter 4)
Best practices combining all mechanisms
Future trends in hardware security

Why This Chapter Matters More Than You Think

If you remember only one thing from this book, remember this: Virtual memory is not about address translation. It's about protection.

The clever four-level page table design from Chapter 3? Its real purpose is enabling fine-grained protection at multiple granularities while keeping page table memory reasonable.

The complex TLB from Chapter 4? It exists primarily because checking permissions on every memory access would be impossibly slow without caching.

The expensive IOMMU from Chapter 5? It replicates all the CPU protection mechanisms we'll study in this chapter because devices are just as dangerous as malicious programs.

Everything we've studied so far was building toward this chapter. Now let's understand why it all matters.

The Security Problem

Consider what could happen without memory protection:

Scenario 1: Malicious Program

// Without protection, any program could do this:
int *kernel_memory = (int *)0xffff8000deadbeef;
*kernel_memory = 0x90909090;  // Overwrite kernel code with NOPs

Scenario 2: Buggy Program

// A simple bug becomes a security disaster:
char buffer[100];
gets(buffer);  // Buffer overflow
// Without protection, overflow corrupts neighboring process memory

Scenario 3: Privilege Escalation

// User process tries to access kernel data:
struct cred *cred = current_cred();  // Get kernel credential struct
cred->uid = 0;  // Make myself root!

Without hardware-enforced memory protection, any of these scenarios would succeed, making multi-user systems impossible and single-user systems dangerously fragile.

What Memory Protection Provides

The MMU enforces four fundamental security properties:

1. Isolation Each process operates in its own address space, unable to read or modify other processes' memory. This is the foundation of modern operating systems.

2. Privilege Separation The system distinguishes between kernel code (privileged) and user code (unprivileged), preventing user programs from directly accessing kernel memory or executing privileged instructions.

3. Execute Protection Memory pages can be marked as non-executable, preventing data (like stack buffers or heap allocations) from being executed as code. This defeats entire classes of exploits.

4. Access Control Fine-grained control over read, write, and execute permissions enables sophisticated security policies, from simple W^X (write XOR execute) to complex protection domains.

Real-World Attack Examples

Buffer Overflow (Pre-NX Era)

// Vulnerable code:
void process_input(char *input) {
    char buffer[256];
    strcpy(buffer, input);  // No bounds checking!
    // Rest of function...
}

// Attack: Input contains 256 bytes of shellcode + return address overwrite
// Without NX: Shellcode executes, attacker gains control
// With NX: Program crashes, attack fails

The introduction of the No-Execute (NX) bit in the early 2000s made this classic attack much harder—but only if the OS and hardware enforce it.

Meltdown (2018)

; Speculative execution attack (simplified)
mov rax, [kernel_address]    ; Speculatively reads kernel memory
mov rbx, [rax * 4096]         ; Uses data to access different cache line
; Even though the first instruction faults, cache timing reveals the data!

Meltdown exploited speculative execution to bypass privilege checks, reading arbitrary kernel memory from user space. The mitigation (KPTI - Kernel Page-Table Isolation) had 5-30% performance impact.

DMA Attack (Pre-IOMMU Era)

// Malicious device or Thunderbolt peripheral:
// 1. Device uses DMA to write to physical memory
// 2. No IOMMU means device can access any physical address
// 3. Overwrite page tables, kernel code, or credentials
// Result: Complete system compromise

This is why Chapter 5's IOMMU is critical—devices need the same isolation as user processes.

Security vs Performance Trade-offs

Memory protection isn't free. Every security feature has a cost:

Feature	Security Benefit	Performance Cost	Always Enable?
NX bit	High (prevents code injection)	~0%	✅ Yes
SMEP/SMAP	High (kernel hardening)	<1%	✅ Yes
KPTI	High (Meltdown mitigation)	5-30%	⚠️ If vulnerable
Protection Keys	Medium (fast isolation)	<1%	✅ When available
Memory Tagging	High (memory safety)	5-15%	🤔 Depends on workload
Confidential Compute	Very High (VM isolation)	1-5%	🤔 Multi-tenant only

The art of system security is maximizing protection while minimizing overhead.

What This Chapter Covers

We'll explore memory protection mechanisms across three major architectures (x86-64, ARM64, RISC-V) and modern heterogeneous systems:

Basic Protection (6.2-6.5):

Permission bits (R/W/X, User/Supervisor)
No-Execute (NX) bit and code injection prevention
Privilege levels (Rings, Exception Levels, Modes)
User vs Supervisor page separation

Advanced Features (6.6-6.9):

SMEP/SMAP (x86 kernel hardening)
Memory Tagging Extension (ARM MTE)
Protection Keys (Intel MPK, fast domains)
Copy-on-Write optimization

Confidential Computing (6.10-6.12):

AMD SEV-SNP (encrypted VMs)
Intel TDX (trust domains)
ARM CCA (confidential compute architecture)
RISC-V security extensions

Modern Systems (6.13-6.14):

GPU and accelerator security
Heterogeneous computing (Apple Silicon, Grace-Hopper)
Cache-coherent interconnects (CXL, OpenCAPI)
Multi-device isolation

Practical Guidance (6.15-6.18):

Performance analysis and trade-offs
Best practices and guidelines
Common pitfalls and mistakes

Cross-Platform Perspective

Each architecture approaches memory protection differently:

x86-64: Evolutionary Intel and AMD have added security features incrementally over decades:

1985: Segmentation and rings (80286)
2004: NX bit (Athlon 64)
2011: SMEP (Ivy Bridge)
2014: SMAP (Broadwell)
2019: Protection Keys (Skylake-SP)
2023: TDX (Sapphire Rapids)

Result: Comprehensive but complex, with many features for backward compatibility.

ARM64: Clean-Sheet Design ARM designed ARMv8-A (2011) with modern security in mind:

Privilege levels (EL0-EL3) cleaner than x86 rings
TrustZone for secure/non-secure worlds
PAN (Privileged Access Never) simpler than SMAP
MTE (Memory Tagging) for memory safety
CCA (Confidential Compute) built into ARMv9

Result: More coherent, but less mature ecosystem.

RISC-V: Minimal and Extensible RISC-V (2010s) embraces modularity:

PMP (Physical Memory Protection) in M-mode
Clean privilege mode design (M/S/U)
Optional extensions (hypervisor, vector, crypto)
Flexible enough for both embedded and datacenter

Result: Elegant but security ecosystem still developing.

Key Questions We'll Answer

How do permission bits work at the hardware level?
Why is the NX bit so effective against exploits?
What's the difference between SMEP and SMAP?
When should I use protection keys vs page tables?
How does memory encryption work in confidential VMs?
What are the security challenges in heterogeneous computing?
How much performance do security features really cost?

By the end of this chapter, you'll understand how modern systems enforce memory protection, from simple permission bits to encrypted confidential computing.

6.2 Page-Level Protection Bits

Every page table entry (PTE) contains permission bits that control how memory can be accessed. These bits are checked by the MMU on every memory access, making them the first line of defense in system security.

Basic Permission Bits

Modern architectures provide three fundamental permissions:

Read (R): Can the page be read? Write (W): Can the page be written? Execute (X): Can the page be executed as code?

However, different architectures encode these permissions differently:

x86-64 Permission Encoding

x86-64 uses a negative logic for some bits in the PTE:

Bit 63: XD (Execute Disable) - 1 = NOT executable
Bit 2:  U/S (User/Supervisor) - 0 = Supervisor, 1 = User
Bit 1:  R/W (Read/Write)      - 0 = Read-only, 1 = Read-Write
Bit 0:  P (Present)           - 0 = Not present, 1 = Present

x86-64 PTE Layout (64 bits):

┌─────┬────┬─────────────────────────────────────┬─────────────────────┐
│ XD  │ ...│   Physical Address (bits 51-12)     │  Flags (bits 11-0)  │
│(63) │    │                                     │                     │
└─────┴────┴─────────────────────────────────────┴─────────────────────┘
         
Flags:
  Bit 0:  P   (Present)
  Bit 1:  R/W (Read/Write)
  Bit 2:  U/S (User/Supervisor)
  Bit 3:  PWT (Page Write-Through)
  Bit 4:  PCD (Page Cache Disable)
  Bit 5:  A   (Accessed)
  Bit 6:  D   (Dirty)
  Bit 7:  PAT or PS (Page Size)
  Bit 8:  G   (Global)
  Bits 9-11: Available for OS use

Permission Combinations:

R/W	U/S	XD	Access Rights
0	0	0	Supervisor Read/Execute only
0	0	1	Supervisor Read only
1	0	0	Supervisor Read/Write/Execute
1	0	1	Supervisor Read/Write only
0	1	0	User Read/Execute only
0	1	1	User Read only
1	1	0	User Read/Write/Execute
1	1	1	User Read/Write only

Note: On x86-64, there's no way to have write-only pages (writes imply reads).

ARM64 Permission Encoding

ARM64 uses positive logic with explicit AP (Access Permission) bits:

Bits 7-6:  AP[2:1] (Access Permissions)
Bit 54:    XN (Execute Never) - 1 = NOT executable
Bit 53:    PXN (Privileged Execute Never)
Bit 10:    AF (Access Flag)

ARM64 Descriptor Format (Level 3):

┌────┬────┬────────────────────────────────────┬─────────────────────┐
│ XN │PXN │   Physical Address (bits 47-12)   │  Attributes         │
│(54)│(53)│                                    │                     │
└────┴────┴────────────────────────────────────┴─────────────────────┘

Attributes:
  Bits 1-0:  Type (0b11 = Page)
  Bits 4-2:  AttrIndx (memory type)
  Bit 5:     NS (Non-secure)
  Bits 7-6:  AP[2:1] (Access Permissions)
  Bits 9-8:  SH (Shareability)
  Bit 10:    AF (Access Flag)
  Bit 11:    nG (not Global)

AP Bits Encoding:

AP[2:1]	EL0 Access	EL1 Access	Typical Use
00	None	Read/Write	Kernel data
01	Read/Write	Read/Write	Shared memory
10	None	Read-only	Kernel code
11	Read-only	Read-only	Shared read-only

Execute Control:

XN bit: EL0 (user) execute permission (1 = never execute)
PXN bit: EL1 (kernel) execute permission (1 = never execute)

This gives ARM more flexible execute control than x86-64!

RISC-V Permission Encoding

RISC-V uses the simplest and most direct encoding:

Bit 3:  X (Execute)
Bit 2:  W (Write)
Bit 1:  R (Read)
Bit 0:  V (Valid)

RISC-V PTE Layout (Sv39/Sv48):

┌────────────────────────────────────────────┬─────────────────────┐
│      Physical Page Number (PPN)            │   Flags (bits 7-0)  │
│                                            │                     │
└────────────────────────────────────────────┴─────────────────────┘

Flags:
  Bit 0:  V (Valid)
  Bit 1:  R (Read)
  Bit 2:  W (Write)
  Bit 3:  X (Execute)
  Bit 4:  U (User accessible)
  Bit 5:  G (Global)
  Bit 6:  A (Accessed)
  Bit 7:  D (Dirty)

Permission Combinations:

R	W	X	Meaning
0	0	0	Pointer to next level
0	0	1	Execute-only
0	1	0	⚠️ Reserved
0	1	1	Execute + Write
1	0	0	Read-only
1	0	1	Read + Execute
1	1	0	Read + Write
1	1	1	Read + Write + Execute

Note: RISC-V allows execute-only pages (X=1, R=0, W=0), which x86-64 and ARM64 don't support!

Cross-Architecture Comparison

Feature	x86-64	ARM64	RISC-V
Read permission	Implicit with Present	AP bits	R bit
Write permission	R/W bit	AP bits	W bit
Execute permission	XD bit (negative)	XN/PXN bits	X bit
User/Kernel	U/S bit	AP bits	U bit
Execute-only	❌ No	❌ No	✅ Yes
Write-only	❌ No	❌ No	❌ No

Present Bit and Page Faults

The Present (P) bit (or Valid on RISC-V, AF on ARM) determines if a page is currently mapped:

Present = 0: Page not in memory

MMU generates page fault
OS can bring page from disk (demand paging)
Or allocate page on first access (zero-on-demand)

Present = 1: Page is mapped

Translation proceeds
Other permission bits are checked

x86-64 Example:

// Check if page is present
bool is_present(uint64_t pte) {
    return (pte & 0x1);  // Bit 0
}

// Get physical address (assumes present)
uint64_t get_phys_addr(uint64_t pte) {
    return pte & 0x000FFFFFFFFFF000ULL;  // Bits 51-12
}

Accessed and Dirty Bits

These bits help the OS manage memory:

Accessed (A) bit: Set by hardware when page is read or written

OS uses this for page replacement algorithms (LRU approximation)
OS periodically clears A bits to track "working set"

Dirty (D) bit: Set by hardware when page is written

OS uses this to know if page needs to be written back to disk
Clean pages can be discarded without writing

Example: Linux Page Replacement

// Simplified Linux page aging
void age_pages(void) {
    for (each page in system) {
        if (pte->accessed) {
            page->age++;          // Page was used recently
            pte->accessed = 0;    // Clear for next period
        } else {
            page->age--;          // Page not used
        }
        
        if (page->age < threshold && !pte->dirty) {
            evict_page(page);     // Can discard clean unused pages
        }
    }
}

User/Supervisor Bit

This bit determines privilege level access:

x86-64 U/S Bit:

U/S = 0: Supervisor (Ring 0/1/2) only
U/S = 1: User (Ring 3) can access

ARM64 AP Bits:

Different encoding, but similar concept
EL0 (user) vs EL1 (kernel) access

RISC-V U Bit:

U = 0: Supervisor mode only
U = 1: User mode can access

Security Example:

// Kernel page table entry (x86-64)
uint64_t kernel_pte = 
    phys_addr |       // Physical address
    (1 << 0) |        // Present
    (1 << 1) |        // Read/Write
    (0 << 2);         // Supervisor only (U/S = 0)

// User page table entry
uint64_t user_pte = 
    phys_addr |       // Physical address
    (1 << 0) |        // Present
    (1 << 1) |        // Read/Write
    (1 << 2) |        // User accessible (U/S = 1)
    (1ULL << 63);     // No execute (XD = 1)

Code Examples: Setting Permission Bits

x86-64: Creating PTEs with Different Permissions

#include <stdint.h>

// PTE bit definitions
#define PTE_P    0x001  // Present
#define PTE_W    0x002  // Writable
#define PTE_U    0x004  // User
#define PTE_A    0x020  // Accessed
#define PTE_D    0x040  // Dirty
#define PTE_NX   (1ULL << 63)  // No Execute

// Create kernel code page (R-X, supervisor only)
uint64_t make_kernel_code_pte(uint64_t phys_addr) {
    return (phys_addr & 0x000FFFFFFFFFF000ULL) | 
           PTE_P;                    // Present, Read-only, Supervisor, Execute
}

// Create kernel data page (RW-, supervisor only)
uint64_t make_kernel_data_pte(uint64_t phys_addr) {
    return (phys_addr & 0x000FFFFFFFFFF000ULL) | 
           PTE_P | PTE_W | PTE_NX;   // Present, Read-Write, No Execute
}

// Create user code page (R-X, user accessible)
uint64_t make_user_code_pte(uint64_t phys_addr) {
    return (phys_addr & 0x000FFFFFFFFFF000ULL) | 
           PTE_P | PTE_U;            // Present, User, Execute
}

// Create user data page (RW-, user accessible)
uint64_t make_user_data_pte(uint64_t phys_addr) {
    return (phys_addr & 0x000FFFFFFFFFF000ULL) | 
           PTE_P | PTE_W | PTE_U | PTE_NX;  // Present, Write, User, No Execute
}

// Check permissions
bool is_writable(uint64_t pte) {
    return pte & PTE_W;
}

bool is_user_accessible(uint64_t pte) {
    return pte & PTE_U;
}

bool is_executable(uint64_t pte) {
    return !(pte & PTE_NX);
}

ARM64: Creating Descriptors with Different Permissions

// ARM64 descriptor bits
#define DESC_VALID    0x001
#define DESC_TABLE    0x003
#define DESC_AF       0x400   // Access flag (bit 10)

// AP bits (bits 7-6)
#define AP_KERNEL_RW  (0 << 6)  // EL0: none, EL1: RW
#define AP_USER_RW    (1 << 6)  // EL0: RW,   EL1: RW
#define AP_KERNEL_RO  (2 << 6)  // EL0: none, EL1: RO
#define AP_USER_RO    (3 << 6)  // EL0: RO,   EL1: RO

#define DESC_XN       (1ULL << 54)  // Execute Never (EL0)
#define DESC_PXN      (1ULL << 53)  // Privileged Execute Never (EL1)

// Create kernel code descriptor (R-X, EL1 only)
uint64_t make_kernel_code_desc(uint64_t phys_addr) {
    return (phys_addr & 0x0000FFFFFFFFF000ULL) | 
           DESC_VALID | DESC_AF | AP_KERNEL_RO | DESC_XN;
}

// Create kernel data descriptor (RW-, EL1 only)
uint64_t make_kernel_data_desc(uint64_t phys_addr) {
    return (phys_addr & 0x0000FFFFFFFFF000ULL) | 
           DESC_VALID | DESC_AF | AP_KERNEL_RW | DESC_XN | DESC_PXN;
}

// Create user code descriptor (R-X, EL0/EL1)
uint64_t make_user_code_desc(uint64_t phys_addr) {
    return (phys_addr & 0x0000FFFFFFFFF000ULL) | 
           DESC_VALID | DESC_AF | AP_USER_RO | DESC_PXN;
}

// Create user data descriptor (RW-, EL0/EL1)
uint64_t make_user_data_desc(uint64_t phys_addr) {
    return (phys_addr & 0x0000FFFFFFFFF000ULL) | 
           DESC_VALID | DESC_AF | AP_USER_RW | DESC_XN | DESC_PXN;
}

RISC-V: Creating PTEs with Different Permissions

// RISC-V PTE bits
#define PTE_V    0x001  // Valid
#define PTE_R    0x002  // Read
#define PTE_W    0x004  // Write
#define PTE_X    0x008  // Execute
#define PTE_U    0x010  // User
#define PTE_G    0x020  // Global
#define PTE_A    0x040  // Accessed
#define PTE_D    0x080  // Dirty

// Create kernel code PTE (R-X, supervisor only)
uint64_t make_kernel_code_pte(uint64_t phys_addr) {
    uint64_t ppn = phys_addr >> 12;
    return (ppn << 10) | PTE_V | PTE_R | PTE_X;
}

// Create kernel data PTE (RW-, supervisor only)
uint64_t make_kernel_data_pte(uint64_t phys_addr) {
    uint64_t ppn = phys_addr >> 12;
    return (ppn << 10) | PTE_V | PTE_R | PTE_W;
}

// Create user code PTE (R-X, user accessible)
uint64_t make_user_code_pte(uint64_t phys_addr) {
    uint64_t ppn = phys_addr >> 12;
    return (ppn << 10) | PTE_V | PTE_R | PTE_X | PTE_U;
}

// Create user data PTE (RW-, user accessible)
uint64_t make_user_data_pte(uint64_t phys_addr) {
    uint64_t ppn = phys_addr >> 12;
    return (ppn << 10) | PTE_V | PTE_R | PTE_W | PTE_U;
}

// RISC-V supports execute-only pages!
uint64_t make_execute_only_pte(uint64_t phys_addr) {
    uint64_t ppn = phys_addr >> 12;
    return (ppn << 10) | PTE_V | PTE_X;  // Only Execute bit set
}

Permission Checking Algorithm

When the MMU translates a virtual address, it checks permissions:

Hardware Permission Check (Pseudo-code):

function check_permissions(pte, access_type, privilege_level):
    // Check if page is present/valid
    if not pte.present:
        raise PAGE_FAULT
    
    // Check privilege level
    if pte.user_accessible:
        // User page: both user and supervisor can access
        allowed = true
    else:
        // Supervisor page: only supervisor can access
        if privilege_level == USER:
            raise PAGE_FAULT (protection violation)
        allowed = true
    
    // Check access type
    if access_type == READ:
        // Reads always allowed if present (on x86/ARM)
        // RISC-V checks R bit explicitly
        if RISC_V and not pte.R:
            raise PAGE_FAULT
    
    elif access_type == WRITE:
        if not pte.writable:
            raise PAGE_FAULT (write protection)
    
    elif access_type == EXECUTE:
        if pte.execute_disable:  // XD/XN bit
            raise PAGE_FAULT (execute protection)
    
    // Set accessed/dirty bits
    pte.accessed = 1
    if access_type == WRITE:
        pte.dirty = 1
    
    return ALLOWED

Performance Implications

Permission checking happens on every memory access that misses the TLB:

TLB Hit: Permissions cached with translation (0 cycles overhead) TLB Miss, PWC Hit: Permission check during walk (~10-30 cycles) TLB Miss, Full Walk: Permission check at each level (~50-200 cycles)

Key Insight: High TLB hit rates (>99%) make permission checking essentially free!

6.3 The No-Execute (NX) Bit

The No-Execute bit is one of the most important security features in modern processors. By preventing code execution from data pages, it defeats entire classes of exploits that have plagued computer security for decades.

The Code Injection Problem

Before the NX bit, all writable memory was also executable. This created a fundamental vulnerability:

Figure 6.1: NX/SMEP/SMAP/PAN enforcement decision flow. The CPU checks these three hardware protection mechanisms on every memory access, before consulting the TLB. Any failure raises a hardware fault. All checks run in parallel with PTE permission bit evaluation.

Classic Buffer Overflow Exploit:

// Vulnerable function (pre-NX era)
void process_request(char *user_input) {
    char buffer[256];
    strcpy(buffer, user_input);  // No bounds checking!
    // If user_input > 256 bytes, overflow corrupts stack
}

// Stack layout:
// ┌──────────────┐ High addresses
// │ Return addr  │ ← Attacker overwrites this!
// ├──────────────┤
// │ Saved EBP    │
// ├──────────────┤
// │              │
// │  buffer[256] │ ← Overflow starts here
// │              │
// └──────────────┘ Low addresses

// Attack payload:
// [256 bytes of shellcode] + [overwritten return address pointing to shellcode]

Without NX:

Attacker overflows buffer with shellcode
Overwrites return address to point to buffer
Function returns, jumping to shellcode
Shellcode executes with program's privileges
Attacker gains shell or escalates privileges

With NX:

Attacker overflows buffer with shellcode
Overwrites return address to point to buffer
Function returns, attempts to execute from stack
MMU detects execute from non-executable page
CPU generates fault, program terminates
Attack fails!

NX Bit Implementation

Each architecture implements execute protection slightly differently:

x86-64: XD (Execute Disable) Bit

Bit 63 of PTE: XD (Execute Disable)
XD = 0: Page is executable
XD = 1: Page is NOT executable (execution causes #PF)

Must be enabled in EFER.NXE (Extended Feature Enable Register)

Checking XD Support:

#include <cpuid.h>

bool has_nx_support(void) {
    unsigned int eax, ebx, ecx, edx;
    // CPUID function 0x80000001 (Extended Features)
    __cpuid(0x80000001, eax, ebx, ecx, edx);
    // Bit 20 of EDX = NX support
    return (edx & (1 << 20)) != 0;
}

void enable_nx(void) {
    // Set EFER.NXE (bit 11)
    uint64_t efer;
    asm volatile(
        "rdmsr"
        : "=A" (efer)
        : "c" (0xC0000080)  // EFER MSR
    );
    efer |= (1ULL << 11);  // Set NXE bit
    asm volatile(
        "wrmsr"
        :
        : "c" (0xC0000080), "A" (efer)
    );
}

ARM64: XN (Execute Never) Bits

Bit 54: XN  (Execute Never for EL0)
Bit 53: PXN (Privileged Execute Never for EL1)

More flexible than x86: separate control for user/kernel!

RISC-V: X (Execute) Bit

Bit 3: X (Execute permission)
X = 0: NOT executable
X = 1: Executable

Simple positive logic (unlike x86's negative logic)

Real-World Impact: Attack Prevention

Example 1: Stack Buffer Overflow

// Before NX: Exploitable
void vulnerable_function(char *input) {
    char buffer[64];
    strcpy(buffer, input);  // Overflow possible
    // Attacker can execute shellcode from stack
}

// After NX: Attack fails
void same_function(char *input) {
    char buffer[64];
    strcpy(buffer, input);  // Still overflows
    // But stack is marked NX, execution fails!
    // Program crashes instead of being compromised
}

Memory Layout with NX:

┌─────────────────┬────────┬────────┐
│ Memory Region   │ Perms  │ XD/XN  │
├─────────────────┼────────┼────────┤
│ .text (code)    │ R-X    │ No     │  ← Executable
│ .rodata (const) │ R--    │ Yes    │  ← Read-only
│ .data (globals) │ RW-    │ Yes    │  ← Data only
│ .bss (zeros)    │ RW-    │ Yes    │  ← Data only
│ Heap            │ RW-    │ Yes    │  ← Data only
│ Stack           │ RW-    │ Yes    │  ← Data only
│ Shared libs     │ R-X    │ No     │  ← Executable
└─────────────────┴────────┴────────┘

Result: Only code sections are executable. Data sections (heap, stack) cannot execute.

W^X Policy (Write XOR Execute)

Modern systems enforce a stronger policy: no page can be both writable AND executable.

W^X Implementation:

// Legal combinations:
// R-X: Code pages (read and execute, but not write)
// RW-: Data pages (read and write, but not execute)
// R--: Read-only data
// ---: Inaccessible (guard pages)

// ILLEGAL combination:
// RWX: No page should be both writable AND executable!

Why W^X matters:

Without W^X:

// Attacker can:
1. Allocate RWX memory
2. Write shellcode to it
3. Execute the shellcode
// Many JIT compilers did this!

With W^X:

// JIT compiler must:
1. Allocate RW- memory
2. Write JIT-compiled code
3. Change to R-X (using mprotect)
4. Execute the code
5. Cannot modify while executable!

Linux W^X Enforcement:

// mmap with RWX fails on W^X systems
void *rwx = mmap(NULL, size, 
                 PROT_READ | PROT_WRITE | PROT_EXEC,  // Rejected!
                 MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
// Returns MAP_FAILED on systems with strict W^X

// Correct approach:
void *rw = mmap(NULL, size,
                PROT_READ | PROT_WRITE,  // Initially RW
                MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
// ... write code ...
mprotect(rw, size, PROT_READ | PROT_EXEC);  // Change to RX

Evading NX: Return-Oriented Programming (ROP)

NX defeated code injection, but attackers adapted with Return-Oriented Programming:

ROP Concept:

Instead of injecting new code, reuse existing code!

1. Find "gadgets" in existing executable memory:
   pop rdi; ret
   pop rsi; ret
   mov [rdi], rsi; ret
   
2. Chain gadgets by controlling stack:
   [address of gadget 1]
   [data for gadget 1]
   [address of gadget 2]
   [data for gadget 2]
   ...
   
3. Each gadget executes then returns to next gadget
4. Build complete exploit from existing code snippets!

NX is not enough! Modern systems need additional defenses:

ASLR (Address Space Layout Randomization)
Stack canaries
Control-flow integrity (CFI)
Shadow stacks (Intel CET, ARM BTI)

Setting NX Bit in Page Tables

x86-64: Setting XD Bit

#define PTE_P    0x001
#define PTE_W    0x002
#define PTE_U    0x004
#define PTE_NX   (1ULL << 63)

// Create non-executable data page
uint64_t create_data_page(uint64_t phys_addr) {
    return (phys_addr & 0x000FFFFFFFFFF000ULL) |
           PTE_P | PTE_W | PTE_U | PTE_NX;  // Writable but not executable
}

// Create executable code page
uint64_t create_code_page(uint64_t phys_addr) {
    return (phys_addr & 0x000FFFFFFFFFF000ULL) |
           PTE_P | PTE_U;  // Readable and executable, but not writable
    // Note: XD bit NOT set = executable
}

// Typical OS memory mapping
void setup_memory_protections(void) {
    // Text segment: R-X
    for (each page in .text) {
        pte = create_code_page(page->phys_addr);
    }
    
    // Data segment: RW-
    for (each page in .data, .bss) {
        pte = create_data_page(page->phys_addr);
    }
    
    // Stack: RW-
    for (each page in stack) {
        pte = create_data_page(page->phys_addr);
    }
    
    // Heap: RW-
    for (each page in heap) {
        pte = create_data_page(page->phys_addr);
    }
}

ARM64: Setting XN/PXN Bits

#define DESC_XN   (1ULL << 54)  // Execute Never (EL0)
#define DESC_PXN  (1ULL << 53)  // Privileged Execute Never (EL1)

// User data page: NOT executable by user or kernel
uint64_t create_user_data_desc(uint64_t phys_addr) {
    return (phys_addr & 0x0000FFFFFFFFF000ULL) |
           DESC_VALID | DESC_AF | AP_USER_RW |
           DESC_XN | DESC_PXN;  // Both XN and PXN set
}

// User code page: Executable by user, NOT by kernel
uint64_t create_user_code_desc(uint64_t phys_addr) {
    return (phys_addr & 0x0000FFFFFFFFF000ULL) |
           DESC_VALID | DESC_AF | AP_USER_RO |
           DESC_PXN;  // PXN set (kernel can't execute), XN clear (user can)
}

// Kernel code page: Executable by kernel only
uint64_t create_kernel_code_desc(uint64_t phys_addr) {
    return (phys_addr & 0x0000FFFFFFFFF000ULL) |
           DESC_VALID | DESC_AF | AP_KERNEL_RO |
           DESC_XN;  // XN set (user can't execute), PXN clear (kernel can)
}

RISC-V: Setting X Bit

#define PTE_V    0x001
#define PTE_R    0x002
#define PTE_W    0x004
#define PTE_X    0x008
#define PTE_U    0x010

// Non-executable data page
uint64_t create_data_pte(uint64_t phys_addr) {
    uint64_t ppn = phys_addr >> 12;
    return (ppn << 10) | PTE_V | PTE_R | PTE_W | PTE_U;
    // Note: X bit NOT set = not executable
}

// Executable code page
uint64_t create_code_pte(uint64_t phys_addr) {
    uint64_t ppn = phys_addr >> 12;
    return (ppn << 10) | PTE_V | PTE_R | PTE_X | PTE_U;
    // Note: W bit NOT set = not writable (W^X policy)
}

OS-Level NX Enforcement

Linux DEP (Data Execution Prevention):

// Check if executable has NX protection
#include <elf.h>

bool has_nx_protection(const char *filename) {
    // Read ELF header
    Elf64_Ehdr ehdr;
    // Read program headers
    Elf64_Phdr phdr;
    
    // Look for GNU_STACK segment
    for (each program header) {
        if (phdr.p_type == PT_GNU_STACK) {
            // Check if executable bit is CLEAR
            return !(phdr.p_flags & PF_X);
        }
    }
    return true;  // Default: NX enabled
}

Windows DEP:

// Enable DEP for process (Windows API)
#include <windows.h>

void enable_dep(void) {
    DWORD flags = PROCESS_DEP_ENABLE;  // Enable DEP
    SetProcessDEPPolicy(flags);
}

// Check if DEP is enabled system-wide
bool is_dep_enabled(void) {
    BOOL permanent;
    DWORD flags;
    GetProcessDEPPolicy(GetCurrentProcess(), &flags, &permanent);
    return (flags & PROCESS_DEP_ENABLE) != 0;
}

Performance Impact of NX

Overhead: Essentially zero on modern processors!

Why NX is Free:

Permission bits already cached in TLB
Execute check happens in parallel with translation
No additional memory accesses needed
Branch prediction handles faults efficiently

Benchmark: NX Performance Impact

// Test: Execute 1 million function calls

// Without NX: 0.523 seconds
// With NX:    0.524 seconds
// Overhead:   ~0.2% (within measurement error)

void benchmark_nx_overhead(void) {
    const int iterations = 1000000;
    
    // Warm up TLB
    for (int i = 0; i < 1000; i++) {
        test_function();
    }
    
    // Benchmark with NX enabled
    uint64_t start = rdtsc();
    for (int i = 0; i < iterations; i++) {
        test_function();
    }
    uint64_t end = rdtsc();
    
    uint64_t cycles_per_call = (end - start) / iterations;
    // Result: ~500 cycles/call (no measurable NX overhead)
}

Security Benefits: Quantified

Studies show NX prevents:

~70% of buffer overflow exploits (before ROP techniques)
~40% of exploits overall (including ROP era)
100% of simple code injection attacks

Attack Prevention Timeline:

2004: AMD introduces NX bit (AMD64)
2005: Windows XP SP2 enables DEP by default
2007: Linux kernel enables NX by default
2010: Most buffer overflow exploits switch to ROP
2020: Modern defenses (CET, BTI) target ROP

Bottom Line: NX/XD/XN is the single most effective hardware security feature ever added to processors. Every modern system should have it enabled—and nearly all do by default.

6.4 Privilege Levels and Protection Rings

Figure 6.2: Privilege Level Architecture across x86-64, ARM64, and RISC-V. x86-64 uses protection rings (CPL in CS); ARM64 uses numbered Exception Levels with dedicated system registers per level; RISC-V uses privilege modes enforced by the PMP and satp register.

Modern processors provide multiple privilege levels to separate trusted code (kernel) from untrusted code (applications). Each architecture implements this differently, but the goal is the same: prevent user code from directly accessing privileged resources.

x86-64 Protection Rings

Intel's x86 architecture provides four privilege levels called rings:

Why Four Rings?

Original intent (1980s):

Ring 0: Kernel core
Ring 1: Device drivers
Ring 2: More device drivers
Ring 3: Applications

Modern reality:

Ring 0: Everything privileged (kernel + drivers)
Ring 1-2: Unused (complexity without benefit)
Ring 3: User applications

Current Privilege Level (CPL):

The CPL is stored in CS register (Code Segment) bits 0-1:

// Read current privilege level
static inline int get_cpl(void) {
    uint16_t cs;
    asm volatile("mov %%cs, %0" : "=r" (cs));
    return cs & 0x3;  // Bits 0-1
}

// Example usage
void check_privilege(void) {
    int cpl = get_cpl();
    
    if (cpl == 0) {
        printf("Running in Ring 0 (kernel mode)\n");
    } else if (cpl == 3) {
        printf("Running in Ring 3 (user mode)\n");
    }
}

Descriptor Privilege Level (DPL):

Every memory segment and gate has a Descriptor Privilege Level:

// Segment descriptor (simplified)
struct segment_descriptor {
    uint16_t limit_low;
    uint16_t base_low;
    uint8_t  base_mid;
    uint8_t  access;        // Contains DPL (bits 5-6)
    uint8_t  granularity;
    uint8_t  base_high;
};

// Extract DPL from descriptor
int get_dpl(struct segment_descriptor *desc) {
    return (desc->access >> 5) & 0x3;  // Bits 5-6
}

// Access check rule:
// CPL <= DPL to access segment (numerically lower = more privileged)
bool can_access_segment(int cpl, int dpl) {
    return cpl <= dpl;  // Ring 0 can access Ring 0-3
                         // Ring 3 can only access Ring 3
}

Privilege Transitions:

User → Kernel (Ring 3 → Ring 0):

// System call from user space
// User executes: syscall instruction (x86-64)

asm volatile(
    "movq $SYS_write, %%rax\n"  // System call number
    "movq $1, %%rdi\n"           // File descriptor (stdout)
    "movq $msg, %%rsi\n"         // Buffer pointer
    "movq $len, %%rdx\n"         // Length
    "syscall\n"                  // Triggers Ring 3 → Ring 0
    ::: "rax", "rdi", "rsi", "rdx", "rcx", "r11"
);

// CPU hardware does:
// 1. Save user RIP, RSP, RFLAGS
// 2. Load kernel RIP from MSR_LSTAR
// 3. Load kernel RSP from MSR_KERNEL_GS_BASE
// 4. Change CPL from 3 to 0
// 5. Jump to kernel syscall handler

Kernel → User (Ring 0 → Ring 3):

// Return from system call
// Kernel executes: sysretq instruction

void syscall_return(void) {
    // CPU hardware does:
    // 1. Restore user RIP, RSP, RFLAGS
    // 2. Change CPL from 0 to 3
    // 3. Jump back to user code
    
    asm volatile("sysretq");  // Returns to Ring 3
}

Privileged Instructions:

Some instructions are Ring 0 only:

// Examples of privileged instructions (Ring 0 only):

// 1. Load CR3 (page table base register)
static inline void load_cr3(uint64_t cr3) {
    asm volatile("mov %0, %%cr3" :: "r" (cr3) : "memory");
    // Ring 3: #GP (General Protection Fault)
}

// 2. HLT (halt processor)
static inline void halt(void) {
    asm volatile("hlt");
    // Ring 3: #GP fault
}

// 3. LGDT (load GDT)
static inline void load_gdt(void *gdt_ptr) {
    asm volatile("lgdt (%0)" :: "r" (gdt_ptr));
    // Ring 3: #GP fault
}

// 4. IN/OUT (I/O port access)
static inline uint8_t inb(uint16_t port) {
    uint8_t value;
    asm volatile("inb %1, %0" : "=a" (value) : "Nd" (port));
    return value;
    // Ring 3: #GP fault (unless IOPL allows)
}

ARM64 Exception Levels

ARM64 provides a cleaner privilege model with 4 Exception Levels (EL0-EL3):

Advantages over x86 Rings:

Only 2 levels commonly used (EL0, EL1) - simpler!
EL2 specifically designed for virtualization
EL3 specifically designed for secure boot/TrustZone
No unused levels (unlike x86 Ring 1/2)

Reading Current Exception Level:

// ARM64: Read current exception level
static inline uint64_t get_current_el(void) {
    uint64_t el;
    asm volatile("mrs %0, CurrentEL" : "=r" (el));
    return (el >> 2) & 0x3;  // Bits 2-3
}

void check_exception_level(void) {
    uint64_t el = get_current_el();
    
    switch (el) {
        case 0: printf("EL0: User mode\n"); break;
        case 1: printf("EL1: Kernel mode\n"); break;
        case 2: printf("EL2: Hypervisor mode\n"); break;
        case 3: printf("EL3: Secure monitor\n"); break;
    }
}

Exception Level Transitions:

// EL0 → EL1 (System call)
// User executes: SVC #0 (Supervisor Call)

void user_syscall(void) {
    asm volatile("svc #0");  // Triggers exception to EL1
    // CPU saves: PC, SPSR (Saved Program Status Register)
    // CPU loads: Vector table entry for EL1
    // Changes: EL0 → EL1
}

// EL1 → EL0 (Return from exception)
// Kernel executes: ERET (Exception Return)

void kernel_return(void) {
    asm volatile("eret");  // Returns to EL0
    // CPU restores: PC, SPSR
    // Changes: EL1 → EL0
}

// EL1 → EL2 (Hypervisor call)
void kernel_to_hypervisor(void) {
    asm volatile("hvc #0");  // Hypervisor Call
    // Changes: EL1 → EL2
}

// EL1 → EL3 (Secure monitor call)
void kernel_to_secure_monitor(void) {
    asm volatile("smc #0");  // Secure Monitor Call
    // Changes: EL1 → EL3 (enters Secure World)
}

ARM64 System Registers:

Different system registers accessible at each EL:

// EL0: Limited access
// - Can read: Counter registers, feature ID registers
// - Cannot: Modify page tables, access devices

// EL1: Kernel access
// - TTBR0_EL1/TTBR1_EL1: Page table base registers
// - SCTLR_EL1: System control register
// - VBAR_EL1: Vector base address register
static inline void set_ttbr0_el1(uint64_t ttbr) {
    asm volatile("msr ttbr0_el1, %0" :: "r" (ttbr));
    // EL0: Undefined instruction exception
}

// EL2: Hypervisor access
// - All EL1 registers
// - TTBR0_EL2: Stage 2 page table base
// - HCR_EL2: Hypervisor configuration register

// EL3: Secure monitor access
// - All registers from all levels
// - SCR_EL3: Secure configuration register

RISC-V Privilege Modes

RISC-V provides the simplest privilege model:

Reading Current Privilege Mode:

// RISC-V: Read current privilege mode
static inline int get_privilege_mode(void) {
    unsigned long mstatus;
    asm volatile("csrr %0, mstatus" : "=r" (mstatus));
    
    // MPP field (bits 11-12) contains previous privilege mode
    // For current mode, we know based on which CSRs we can access
    
    // Try to read M-mode register
    unsigned long mcause;
    asm volatile goto(
        "csrr %0, mcause\n"
        "j %l[m_mode]\n"
        : "=r" (mcause) ::: m_mode
    );
    
    // If we get here, not M-mode
    // Try S-mode register
    unsigned long sstatus;
    asm volatile goto(
        "csrr %0, sstatus\n"
        "j %l[s_mode]\n"
        : "=r" (sstatus) ::: s_mode
    );
    
    // Must be U-mode
    return 0;  // U-mode
    
s_mode:
    return 1;  // S-mode
    
m_mode:
    return 3;  // M-mode
}

Privilege Transitions:

// U-Mode → S-Mode (system call)
void user_ecall(void) {
    asm volatile("ecall");  // Environment Call
    // CPU saves: PC to SEPC
    // CPU sets: SCAUSE (exception cause)
    // CPU loads: PC from STVEC (trap vector)
    // Changes: U-mode → S-mode
}

// S-Mode → M-Mode (machine call)
void supervisor_ecall(void) {
    asm volatile("ecall");  // Environment Call
    // Similar to above but S → M
}

// Return from trap
void return_from_trap_s_mode(void) {
    asm volatile("sret");  // Supervisor Return
    // CPU restores: PC from SEPC
    // CPU restores: Privilege from SSTATUS.SPP
}

void return_from_trap_m_mode(void) {
    asm volatile("mret");  // Machine Return
    // CPU restores: PC from MEPC
    // CPU restores: Privilege from MSTATUS.MPP
}

RISC-V CSR Access Control:

// CSR (Control and Status Register) access encoding:
// Bits 11-10: Privilege level required
// 00: U-mode accessible
// 01: S-mode accessible
// 10: H-mode accessible
// 11: M-mode accessible

// Examples:

// M-mode only registers
#define CSR_MSTATUS   0x300  // Machine status
#define CSR_MIE       0x304  // Machine interrupt enable
#define CSR_MTVEC     0x305  // Machine trap vector

// S-mode accessible registers
#define CSR_SSTATUS   0x100  // Supervisor status
#define CSR_SIE       0x104  // Supervisor interrupt enable
#define CSR_STVEC     0x105  // Supervisor trap vector

// U-mode accessible registers
#define CSR_CYCLE     0xC00  // Cycle counter (read-only)
#define CSR_TIME      0xC01  // Timer (read-only)
#define CSR_INSTRET   0xC02  // Instructions retired (read-only)

// Attempt to access wrong-privilege CSR triggers exception
static inline unsigned long read_csr_safe(int csr) {
    unsigned long value;
    
    // This will trap if CSR requires higher privilege
    asm volatile("csrr %0, %1" : "=r" (value) : "i" (csr));
    
    return value;
}

Cross-Architecture Comparison

Feature	x86-64 Rings	ARM64 ELs	RISC-V Modes
Privilege Levels	4 (0-3)	4 (EL0-EL3)	3-4 (U/S/H/M)
Actually Used	2 (Ring 0, 3)	2-4 (all used)	2-3 (U/S/M)
User Level	Ring 3	EL0	U-mode
Kernel Level	Ring 0	EL1	S-mode
Hypervisor	VMX root Ring 0	EL2	HS-mode (optional)
Secure Boot	SMM (legacy)	EL3	M-mode
Wasted Levels	Ring 1-2 unused	None	None
Transition Inst	SYSCALL/SYSRET	SVC/ERET	ECALL/SRET/MRET
Transition Cost	~100-300 cycles	~50-150 cycles	~50-100 cycles
Design Year	1985 (80286)	2011 (ARMv8)	2010s

Historical Note: x86's 4 rings were designed when hardware-assisted virtualization didn't exist. Modern systems essentially use only 2 levels (Ring 0/3), with Ring -1 (VMX root) added later for hypervisors!

Privilege Checking in Action

Example: What Happens on Invalid Access

// User program tries to access kernel memory
void user_attempt_kernel_access(void) {
    uint64_t *kernel_addr = (uint64_t *)0xffff888000000000;
    
    // This will trigger:
    // x86-64: #PF (Page Fault) with U/S violation
    // ARM64: Synchronous exception with data abort
    // RISC-V: Load access fault exception
    
    uint64_t value = *kernel_addr;  // FAULT!
}

// Page fault handler (kernel)
void page_fault_handler(struct pt_regs *regs, unsigned long error_code) {
    // Check error code
    bool user_mode = regs->cs & 3;           // x86: CPL from CS
    bool supervisor_page = !(error_code & 4); // U/S bit
    
    if (user_mode && supervisor_page) {
        // User tried to access supervisor page!
        printk("Segmentation fault: user accessing kernel memory\n");
        send_signal(current, SIGSEGV);  // Kill process
    }
}

Performance: Privilege Transitions

Measured Cost of System Calls:

// Benchmark: System call overhead
#include <sys/syscall.h>
#include <x86intrin.h>

void benchmark_syscall(void) {
    const int iterations = 1000000;
    
    // Measure getpid() - simplest syscall
    uint64_t start = __rdtsc();
    for (int i = 0; i < iterations; i++) {
        syscall(SYS_getpid);  // Ring 3 → Ring 0 → Ring 3
    }
    uint64_t end = __rdtsc();
    
    uint64_t cycles_per_call = (end - start) / iterations;
    printf("System call overhead: %lu cycles\n", cycles_per_call);
}

// Typical results:
// x86-64 (SYSCALL): 100-150 cycles
// ARM64 (SVC):      80-120 cycles
// RISC-V (ECALL):   60-100 cycles
//
// Breakdown:
// - Save user context:    20-40 cycles
// - Switch page tables:   20-40 cycles
// - Flush TLB entries:    20-40 cycles
// - Handler overhead:     20-40 cycles
// - Restore user context: 20-40 cycles

Optimization: Avoiding System Calls

// vDSO (virtual Dynamic Shared Object) - no syscall needed!
// Kernel maps read-only memory page into user space
// Contains frequently-used functions that don't need kernel privileges

// Example: gettimeofday() via vDSO
#include <sys/time.h>

void fast_time_access(void) {
    struct timeval tv;
    
    // Old way: syscall (100-150 cycles)
    // New way: vDSO read (5-10 cycles)
    gettimeofday(&tv, NULL);  // No Ring transition!
    
    // Kernel updates time in shared memory
    // User reads directly - no privilege change needed
}

// Result: 10-30× faster for common operations!

6.5 User vs Supervisor Pages

The User/Supervisor (U/S) bit in page table entries is the primary mechanism for enforcing privilege-based memory protection. This simple bit—present in every page table entry—prevents user-mode code from accessing kernel memory.

The U/S Bit Mechanism

x86-64 U/S Bit:

Hardware Permission Check:

IF (CPL == 3) THEN  // User mode (Ring 3)
    IF (PTE.U/S == 0) THEN
        // User accessing supervisor page!
        RAISE PAGE_FAULT (#PF, error_code with U/S bit set)
    END IF
END IF

IF (CPL < 3) THEN  // Supervisor mode (Ring 0-2)
    // Supervisor can access ALL pages (U/S=0 or 1)
    // Unless SMAP is enabled (discussed later)
END IF

Setting U/S Bit:

#include <stdint.h>

#define PTE_P    (1ULL << 0)   // Present
#define PTE_RW   (1ULL << 1)   // Read/Write
#define PTE_US   (1ULL << 2)   // User/Supervisor
#define PTE_NX   (1ULL << 63)  // No Execute

// Create kernel page (accessible only to kernel)
uint64_t make_kernel_page(uint64_t phys_addr) {
    return (phys_addr & 0x000FFFFFFFFFF000ULL) |
           PTE_P |      // Present
           PTE_RW |     // Read/Write
           PTE_NX;      // No Execute (data page)
    // Note: U/S bit NOT set = supervisor only
}

// Create user page (accessible to user and kernel)
uint64_t make_user_page(uint64_t phys_addr) {
    return (phys_addr & 0x000FFFFFFFFFF000ULL) |
           PTE_P |      // Present
           PTE_RW |     // Read/Write
           PTE_US |     // User accessible
           PTE_NX;      // No Execute
}

// Typical kernel memory layout
void setup_kernel_pagetables(void) {
    // Kernel code: R-X, Supervisor only
    for (each page in .text) {
        pte = phys_addr | PTE_P;  // Not PTE_US, not PTE_NX
    }
    
    // Kernel data: RW-, Supervisor only
    for (each page in .data, .bss) {
        pte = phys_addr | PTE_P | PTE_RW | PTE_NX;  // Not PTE_US
    }
    
    // User pages: RW-, User accessible
    for (each page in user_memory) {
        pte = phys_addr | PTE_P | PTE_RW | PTE_US | PTE_NX;
    }
}

Page Table Hierarchy and U/S Propagation

Critical Rule: If ANY level of the page table hierarchy has U/S=0, the page is supervisor-only.

┌────────────────────────────────────────┐
│        PML4E (Level 4)                 │
│  U/S=1 (user-accessible)               │
└────────────┬───────────────────────────┘
             │
             ▼
┌────────────────────────────────────────┐
│        PDPTE (Level 3)                 │
│  U/S=1 (user-accessible)               │
└────────────┬───────────────────────────┘
             │
             ▼
┌────────────────────────────────────────┐
│        PDE (Level 2)                   │
│  U/S=0 (supervisor-only) ← One S sets all!
└────────────┬───────────────────────────┘
             │
             ▼
┌────────────────────────────────────────┐
│        PTE (Level 1)                   │
│  U/S=1 (doesn't matter!)               │
└────────────────────────────────────────┘

Result: Page is SUPERVISOR-ONLY

Example:

// Walk page tables checking U/S bits
bool is_user_accessible(uint64_t virtual_addr, uint64_t cr3) {
    uint64_t *pml4 = (uint64_t *)(cr3 & ~0xFFF);
    
    // Level 4
    uint64_t pml4e = pml4[PML4_INDEX(virtual_addr)];
    if (!(pml4e & PTE_US)) return false;  // Supervisor at L4
    
    // Level 3
    uint64_t *pdpt = (uint64_t *)(pml4e & 0x000FFFFFFFFFF000ULL);
    uint64_t pdpte = pdpt[PDPT_INDEX(virtual_addr)];
    if (!(pdpte & PTE_US)) return false;  // Supervisor at L3
    
    // Level 2
    uint64_t *pd = (uint64_t *)(pdpte & 0x000FFFFFFFFFF000ULL);
    uint64_t pde = pd[PD_INDEX(virtual_addr)];
    if (!(pde & PTE_US)) return false;    // Supervisor at L2
    
    // Level 1
    uint64_t *pt = (uint64_t *)(pde & 0x000FFFFFFFFFF000ULL);
    uint64_t pte = pt[PT_INDEX(virtual_addr)];
    if (!(pte & PTE_US)) return false;    // Supervisor at L1
    
    // All levels have U/S=1
    return true;  // User accessible!
}

Kernel vs User Address Space Separation

Modern 64-bit systems typically split the virtual address space:

x86-64 Canonical Address Space:

User Space (Lower Half):
0x0000000000000000 - 0x00007FFFFFFFFFFF
  - U/S=1 in all page tables
  - Accessible from Ring 3
  - Contains: user code, data, heap, stack, shared libs

Non-Canonical Addresses (Hole):
0x0000800000000000 - 0xFFFF7FFFFFFFFFFF
  - Invalid addresses
  - Accessing triggers #GP fault

Kernel Space (Upper Half):
0xFFFF800000000000 - 0xFFFFFFFFFFFFFFFF
  - U/S=0 in page tables
  - Accessible only from Ring 0
  - Contains: kernel code, data, device mappings

Why Split Address Space?

Kernel Page-Table Isolation (KPTI)

Figure 6.3: x86-64 Virtual Address Space Layout and KPTI. Pre-KPTI, kernel pages were mapped in every process (U/S=0) enabling the Meltdown transient-execution attack. KPTI splits page tables so user-mode CR3 contains no kernel mappings; PCID (Process-Context ID) eliminates the TLB flush overhead on CR3 switch.

The Meltdown vulnerability (2018) allowed user code to speculatively read kernel memory. The mitigation: KPTI (Kernel Page-Table Isolation).

Before KPTI:

After KPTI:

KPTI Implementation:

// Simplified KPTI implementation

// Two sets of page tables per process
struct mm_struct {
    pgd_t *user_pgd;     // User page table root
    pgd_t *kernel_pgd;   // Kernel page table root
};

// System call entry with KPTI
void syscall_entry_with_kpti(void) {
    // Running in user space with user_pgd in CR3
    
    // 1. Save user CR3
    uint64_t user_cr3 = read_cr3();
    
    // 2. Switch to kernel page tables
    uint64_t kernel_cr3 = current->mm->kernel_pgd;
    write_cr3(kernel_cr3);
    
    // 3. Now can safely access all kernel memory
    // ... execute syscall handler ...
    
    // 4. Before returning, switch back to user page tables
    write_cr3(user_cr3);
    
    // 5. Return to user space
    // User cannot see kernel memory anymore
}

// Performance cost: Two CR3 switches per syscall!
// CR3 write flushes TLB → expensive

KPTI Performance Impact:

// Benchmark: Syscall overhead with/without KPTI
void benchmark_kpti_overhead(void) {
    // Without KPTI: 100-150 cycles per syscall
    // With KPTI:    150-250 cycles per syscall
    // Overhead:     50-100 cycles (50-100% slower!)
    
    // Worst case: Syscall-heavy workload
    // - Redis: 20-30% slowdown
    // - PostgreSQL: 15-25% slowdown
    // - Network I/O: 10-20% slowdown
    
    // Best case: Compute-heavy workload
    // - Scientific computing: <1% slowdown
    // - Video encoding: <2% slowdown
}

KPTI Mitigation Strategies:

// 1. PCID (Process Context ID) - reduces TLB flush cost
// Instead of flushing TLB on CR3 write, tag TLB entries with PCID

// 2. Invpcid instruction - selective TLB invalidation
static inline void invpcid_flush_single(uint64_t pcid, uint64_t addr) {
    struct {
        uint64_t pcid;
        uint64_t addr;
    } desc = { pcid, addr };
    
    asm volatile("invpcid %0, %1" :: "m" (desc), "r" (0));
    // Much faster than full TLB flush
}

// 3. CPU vulnerability detection
bool cpu_needs_kpti(void) {
    // Check for Meltdown vulnerability
    // Intel CPUs (most): Vulnerable
    // AMD CPUs: Not vulnerable (no KPTI needed)
    // ARM CPUs: Some vulnerable
    
    if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD) {
        return false;  // AMD not vulnerable
    }
    
    return true;  // Enable KPTI
}

User Access to Kernel Memory: When is it Allowed?

copy_to_user() and copy_from_user():

Kernel needs to copy data between user and kernel space:

// Kernel function: write() syscall
ssize_t sys_write(int fd, const char __user *buf, size_t count) {
    // buf points to USER memory
    // Kernel running in Ring 0
    
    char kernel_buffer[4096];
    
    // Can kernel access user memory directly?
    // Yes! Supervisor can access U/S=1 pages
    
    // But we use copy_from_user() for safety:
    if (copy_from_user(kernel_buffer, buf, count)) {
        return -EFAULT;  // Invalid user pointer
    }
    
    // copy_from_user implementation:
    // - Checks buf is valid user address
    // - Checks page is present and accessible
    // - Safely copies, handling page faults
    // - If fault occurs, returns error (no panic)
    
    // Now write kernel_buffer to file...
}

// copy_from_user implementation (simplified)
unsigned long copy_from_user(void *to, const void __user *from, unsigned long n) {
    // Check source is in user address space
    if (!access_ok(from, n)) {
        return n;  // Failed - invalid address
    }
    
    // Try to copy, catching faults
    __try {
        memcpy(to, from, n);
        return 0;  // Success
    } __except (EXCEPTION_EXECUTE_HANDLER) {
        return n;  // Failed - fault during copy
    }
}

SMAP (Supervisor Mode Access Prevention):

SMAP (introduced 2014) prevents kernel from accidentally accessing user memory:

// With SMAP enabled:
// - Kernel accessing U/S=1 pages triggers #PF
// - Must explicitly allow access with STAC/CLAC instructions

// x86-64 SMAP instructions:
static inline void stac(void) {  // Set AC flag
    asm volatile("stac" ::: "cc");
    // Now kernel CAN access user pages
}

static inline void clac(void) {  // Clear AC flag
    asm volatile("clac" ::: "cc");
    // Now kernel CANNOT access user pages
}

// copy_from_user with SMAP:
unsigned long copy_from_user_smap(void *to, const void __user *from, unsigned long n) {
    if (!access_ok(from, n))
        return n;
    
    stac();  // Allow access to user pages
    memcpy(to, from, n);
    clac();  // Disallow access to user pages
    
    return 0;
}

// Benefit: Prevents accidental kernel bugs
// Example: Kernel bug dereferences user pointer without validation
// Without SMAP: Exploitable (arbitrary read/write)
// With SMAP: Immediate crash (caught before exploitation)

Performance: SMAP overhead is negligible (<1%) because STAC/CLAC are very fast (1-2 cycles).

6.6 Advanced Protection Features

Modern processors provide sophisticated protection mechanisms beyond basic permission bits. These features implement defense-in-depth: even if attackers bypass one layer, additional protections prevent exploitation.

x86-64: SMEP (Supervisor Mode Execution Prevention)

The Problem:

Even with NX bit protecting the stack, attackers found ret2user attacks:

// Traditional attack (blocked by NX):
// 1. Overflow buffer
// 2. Overwrite return address → shellcode on stack
// 3. Stack has NX → FAIL

// ret2user attack (bypasses NX):
// 1. Overflow kernel buffer
// 2. Overwrite return address → USER space address
// 3. User space IS executable
// 4. Kernel executes user shellcode → SUCCESS!

// Example exploit:
void kernel_vulnerable_function(char *user_data) {
    char buffer[256];
    strcpy(buffer, user_data);  // Overflow!
    // Attacker overwrites return address to 0x400000 (user space)
}

// Attacker's user space code at 0x400000:
void evil_code(void) {
    // Runs with Ring 0 privileges!
    commit_creds(prepare_kernel_cred(0));  // Escalate to root
}

SMEP Solution:

Prevents kernel (Ring 0) from executing code on user pages (U/S=1):

Hardware check on instruction fetch:
IF (CPL < 3 && PTE.U/S == 1 && fetching_instruction) THEN
    RAISE PAGE_FAULT  // Kernel cannot execute user pages!
END IF

Enabling/Checking SMEP:

#include <cpuid.h>

#define X86_CR4_SMEP (1UL << 20)  // CR4 bit 20

// Check CPU support
bool cpu_has_smep(void) {
    uint32_t eax, ebx, ecx, edx;
    __cpuid_count(7, 0, eax, ebx, ecx, edx);
    return (ebx & (1 << 7)) != 0;  // CPUID.(EAX=7,ECX=0):EBX.SMEP[bit 7]
}

// Enable SMEP
void enable_smep(void) {
    uint64_t cr4;
    asm volatile("mov %%cr4, %0" : "=r" (cr4));
    cr4 |= X86_CR4_SMEP;
    asm volatile("mov %0, %%cr4" :: "r" (cr4));
}

// Check if SMEP is active
bool is_smep_enabled(void) {
    uint64_t cr4;
    asm volatile("mov %%cr4, %0" : "=r" (cr4));
    return (cr4 & X86_CR4_SMEP) != 0;
}

Performance: SMEP has zero overhead - the check happens in parallel with instruction fetch.

x86-64: SMAP (Supervisor Mode Access Prevention)

The Problem:

Kernel must access user memory legitimately (e.g., copy_from_user). But this creates vulnerabilities:

// Vulnerable kernel code:
void kernel_read_data(void *dest, void *src, size_t len) {
    memcpy(dest, src, len);  // No validation!
}

// Attack:
// User passes kernel address as 'src':
kernel_read_data(my_buffer, 
                 (void *)0xffffffff81000000,  // Kernel address!
                 4096);
// Result: User reads kernel memory!

SMAP Solution:

Prevents kernel from accessing user pages unless explicitly allowed:

Hardware check on data access:
IF (CPL < 3 && PTE.U/S == 1 && EFLAGS.AC == 0) THEN
    RAISE PAGE_FAULT  // Kernel cannot access user pages!
END IF

Performance: SMAP overhead is <1% (STAC/CLAC are 1-2 cycles each).

ARM64: MTE (Memory Tagging Extension)

Revolutionary memory safety feature - hardware-assisted memory tagging.

Concept:

Memory divided into 16-byte granules
Each granule has a 4-bit tag
Each pointer has a 4-bit tag (in bits 59-56)
Hardware checks: pointer tag == memory tag

Performance: 5-15% overhead (still much faster than software sanitizers).

ARM64: BTI (Branch Target Identification)

Control-flow integrity against ROP/JOP attacks.

Performance: BTI overhead is <1% (check happens in parallel with branch).

6.7 Protection Keys and Domains

Protection keys enable fast, fine-grained memory isolation within a single address space. Unlike page table permissions (which require expensive TLB flushes to change), protection keys can be modified in just a few cycles.

[This section covers Intel MPK in detail, including domain-based isolation, cross-domain communication, and performance comparisons showing 50-100× speedup over traditional mprotect.]

Figure 6.4: Memory Protection Keys (PKU) mechanism. Each PTE encodes a 4-bit key (0–15) in bits 62:59. The 32-bit PKRU register holds per-key Access Disable (AD) and Write Disable (WD) bits. User-space switches domain permissions via WRPKRU (~3 cycles) — 333× faster than mprotect().

6.8 Trusted Execution Environments (TEE) - Deep Dive

Trusted Execution Environments represent a fundamental shift in how we approach system security: rather than trying to secure an entire operating system (which may contain millions of lines of code), we create small, hardware-isolated compartments for executing security-critical code. This section explores the major TEE implementations and their role in modern memory protection.

What is a Trusted Execution Environment?

A Trusted Execution Environment (TEE) is a secure area within a main processor that guarantees:

Isolation: Code and data inside the TEE are protected from all software outside it
Integrity: TEE code cannot be modified by external software
Confidentiality: TEE data cannot be read by external software
Attestation: External parties can verify what code is running in the TEE

Key Insight: Unlike traditional OS security (which relies on software access control), TEEs use hardware-enforced isolation—even a compromised operating system or hypervisor cannot break into a TEE.

TEE Use Cases:

Mobile payments: Secure storage of payment credentials
DRM: Decrypt premium content without exposing keys
Biometric authentication: Process fingerprints/face data securely
Cloud confidential computing: Process sensitive data in untrusted cloud

ARM TrustZone: The Dominant Mobile TEE

Figure 6.5: Trusted Execution Environment Architectures. ARM TrustZone enforces world isolation via the NS bit; Intel SGX protects per-application enclaves even from the OS; AMD SEV-SNP encrypts entire VM memory using per-ASID keys with an RMP (Reverse Map Table) for integrity.

ARM TrustZone is the most widely deployed TEE technology, present in billions of mobile devices. It creates two parallel worlds within every ARM processor core.

TrustZone Architecture: Two Worlds

Normal World:

Runs Rich Execution Environment (REE)
Full-featured OS (Linux, Android, iOS)
Regular applications
Can request services from Secure World

Secure World:

Runs Trusted Execution Environment (TEE)
Minimal, security-focused OS
Trusted Applications (TAs)
Has access to protected hardware and memory

Secure Monitor (EL3):

Always runs in Secure state
Only way to switch between worlds
Controls access to Secure World
Cannot be bypassed

Memory Protection: The NS Bit

TrustZone extends every memory transaction with a Non-Secure (NS) bit:

Normal World:  All transactions have NS=1
Secure World:  All transactions have NS=0

Hardware rule: NS=1 transactions cannot access NS=0 memory

TrustZone Address Space Controller (TZASC):

// TZASC configuration (simplified)
struct tzasc_region {
    uint64_t base_addr;
    uint64_t size;
    uint32_t security;    // Secure or Non-secure
    uint32_t permissions; // Read, Write, Execute
};

// Example: Protect 128MB secure memory region
void configure_secure_memory(void) {
    struct tzasc_region region = {
        .base_addr = 0x80000000,
        .size = 128 * 1024 * 1024,  // 128 MB
        .security = TZASC_SECURE,    // Only accessible from Secure World
        .permissions = TZASC_RW
    };
    
    tzasc_config_region(0, &region);
}

World Switching: The SMC Instruction

Transition from Normal World → Secure World requires Secure Monitor Call (SMC):

// From Normal World (Linux kernel driver)
// Request service from Secure World
void trustzone_call_secure_service(uint32_t service_id, 
                                     uint32_t arg1, uint32_t arg2) {
    struct arm_smccc_res res;
    
    // Trigger SMC instruction - enters Secure Monitor at EL3
    arm_smccc_smc(service_id, arg1, arg2, 0, 0, 0, 0, 0, &res);
    
    // Returns here after Secure World completes
    // res.a0 contains return value
}

// In Secure Monitor (EL3)
void smc_handler(uint32_t smc_fid, uint64_t x1, uint64_t x2, uint64_t x3) {
    // Save Normal World context
    save_cpu_context(NON_SECURE);
    
    // Switch to Secure World
    switch_to_secure_world();
    
    // Restore Secure World context
    restore_cpu_context(SECURE);
    
    // Forward call to Secure World OS
    return secure_world_dispatch(smc_fid, x1, x2, x3);
}

Context Switch Performance:

World switch overhead: 2-5 microseconds
Includes: save/restore registers, flush TLB entries, switch page tables
Optimization: Use SMC batching for multiple operations

TrustZone Page Tables

Each world has its own page tables:

// ARM64 TTBR registers (Translation Table Base Registers)
// Normal World uses:
TTBR0_EL1  // User space page table (Normal World)
TTBR1_EL1  // Kernel space page table (Normal World)

// Secure World uses:
TTBR0_EL1  // User space page table (Secure World)
TTBR1_EL1  // Kernel space page table (Secure World)

// Switching worlds changes which page tables are active
void switch_to_secure_world(void) {
    // SCR_EL3.NS = 0 (enter Secure state)
    write_scr_el3(read_scr_el3() & ~SCR_NS_BIT);
    
    // Now TTBR0/1_EL1 point to Secure World page tables
}

TrustZone Implementations:

OP-TEE (Open Portable TEE):

Open-source TEE from Linaro
GlobalPlatform compliant
Used in Raspberry Pi, development boards
MIT-licensed

Qualcomm QSEE (Qualcomm Secure Execution Environment):

Proprietary TEE in Snapdragon processors
Powers Android DRM, payments, biometrics
Monolithic kernel (single point of failure)

Trustonic Kinibi:

Microkernel-based TEE
Used in Samsung, Huawei devices
Better isolation than monolithic designs

Apple Secure Enclave Processor (SEP):

Dedicated ARM coprocessor (not just software)
Handles Touch ID, Face ID, Apple Pay
Never directly accessible by main CPU
Most secure TrustZone implementation

Intel SGX: Enclave-Based TEE

Intel Software Guard Extensions (SGX) takes a radically different approach: instead of one system-wide secure world, SGX creates per-application secure enclaves that don't trust the OS at all.

SGX Philosophy:

TrustZone: "Trust the Secure World OS"
SGX: "Trust nothing except the CPU"

SGX Threat Model:

Adversary controls: OS, hypervisor, BIOS, firmware, SMM, DMA
Only trust: CPU package (cores, caches, Memory Encryption Engine)
Even with root access, attacker cannot read enclave memory

SGX Architecture

Enclave Page Cache (EPC):

Protected memory region in DRAM
Part of Processor Reserved Memory (PRM)
Size limitation: 64-256 MB (major SGX constraint!)
CPU blocks all non-enclave access to EPC

Memory Encryption Engine (MEE):

SGX encrypts enclave memory to protect against:

Memory bus snooping
Cold boot attacks
Physical memory attacks

MEE Design:

Merkle Tree for Integrity:

        Root (on-chip)
            / \
           /   \
         L2    L2
        / \    / \
      L1 L1  L1 L1
      |  |   |  |
     [Data Pages...]

Each page has:

Encrypted data
MAC (Message Authentication Code)
Version number (prevents rollback)

SGX Instructions:

// Create enclave
#include <sgx.h>

// ECREATE - Create enclave
sgx_status_t sgx_create_enclave(
    const char *file_name,
    int debug,
    sgx_launch_token_t *token,
    int *updated,
    sgx_enclave_id_t *eid,
    sgx_misc_attribute_t *misc_attr
);

// EADD - Add page to enclave
// EINIT - Initialize enclave (finalize)
// EENTER - Enter enclave from untrusted code
// EEXIT - Exit enclave to untrusted code

// Example: Calling enclave function
void call_enclave_function(sgx_enclave_id_t eid) {
    int result;
    sgx_status_t status;
    
    // ECALL: Enter enclave
    status = ecall_trusted_function(eid, &result, arg1, arg2);
    
    if (status != SGX_SUCCESS) {
        printf("Enclave call failed: %x\n", status);
    }
}

SGX Attestation:

SGX provides remote attestation to prove an enclave is genuine:

Attestation Report Contents:

Enclave measurement (SHA-256 hash of code/data)
CPU security version number (SVN)
Enclave attributes
Additional user data (64 bytes)
Signature (from Intel-issued Attestation Key)

SGX Sealing (Persistent Storage):

Enclaves can't directly access disk. Solution: seal data with enclave-specific key:

// Seal data (encrypt for storage)
sgx_status_t sgx_seal_data(
    uint32_t additional_MACtext_length,
    const uint8_t *p_additional_MACtext,
    uint32_t text2encrypt_length,
    const uint8_t *p_text2encrypt,
    uint32_t sealed_data_size,
    sgx_sealed_data_t *p_sealed_data
);

// Unseal data (decrypt from storage)
sgx_status_t sgx_unseal_data(
    const sgx_sealed_data_t *p_sealed_data,
    uint8_t *p_additional_MACtext,
    uint32_t *p_additional_MACtext_length,
    uint8_t *p_decrypted_text,
    uint32_t *p_decrypted_text_length
);

// Example usage
void save_secret(const char *secret, size_t len) {
    uint32_t sealed_size = sgx_calc_sealed_data_size(0, len);
    sgx_sealed_data_t *sealed = malloc(sealed_size);
    
    // Seal with enclave identity
    sgx_seal_data(0, NULL, len, (uint8_t*)secret, sealed_size, sealed);
    
    // Save to untrusted storage (file, database)
    write_to_file("sealed_data.bin", sealed, sealed_size);
    
    free(sealed);
}

SGX Limitations:

Small EPC (64-256 MB):

Frequent paging if working set > EPC
Secure paging adds 10-50% overhead

No System Calls from Enclave:

Must exit enclave (OCALL) for I/O
Each OCALL: ~8,000-12,000 cycles overhead

Side-Channel Vulnerabilities:

Cache timing attacks (Prime+Probe)
Spectre/Meltdown variants
Controlled-channel attacks

Deprecated on Client CPUs:

Removed from 11th/12th gen Core (client)
Still available on Xeon (server)

SGX Performance:

// Benchmark: SGX overhead
void benchmark_sgx(void) {
    const int iterations = 1000000;
    
    // Native execution
    uint64_t start = rdtsc();
    for (int i = 0; i < iterations; i++) {
        compute_native();
    }
    uint64_t native_cycles = rdtsc() - start;
    
    // SGX enclave execution
    start = rdtsc();
    for (int i = 0; i < iterations; i++) {
        ecall_compute_enclave(eid);
    }
    uint64_t sgx_cycles = rdtsc() - start;
    
    printf("Native: %lu cycles\n", native_cycles / iterations);
    printf("SGX:    %lu cycles\n", sgx_cycles / iterations);
    printf("Overhead: %.1f%%\n", 
           100.0 * (sgx_cycles - native_cycles) / native_cycles);
}

// Typical results:
// Compute-heavy: 5-15% overhead (MEE encryption)
// I/O-heavy:     50-200% overhead (OCALL frequency)
// Memory-intensive: 20-100% overhead (EPC paging)

Comparison: TrustZone vs SGX

Feature	ARM TrustZone	Intel SGX
Granularity	System-wide	Per-application
Trust Model	Trust Secure World OS	Trust only CPU
Protected Memory	GB+ (entire regions)	64-256 MB (EPC)
Memory Encryption	Optional (implementation)	Always (MEE)
World Switch	2-5 μs (SMC)	8-12K cycles (ECALL)
OS Trust	Requires trusted OS	No trust needed
Deployment	10+ billion devices	Limited (server-only now)
Use Case	Mobile, embedded	Cloud, datacenter
Multiple TEEs	One Secure World	Many enclaves
Side Channels	Less vulnerable	Highly vulnerable
Attestation	Device attestation	Enclave attestation
Typical Overhead	<1% (infrequent switch)	5-15% (encryption)

When to Use TrustZone:

Mobile devices (built-in on ARM)
Need large protected memory
System-wide security services
Full hardware access needed
Low overhead critical

When to Use SGX:

Cloud computing (untrusted platform)
Per-tenant isolation
Don't trust cloud provider
Willing to accept limitations
Need remote attestation

GPU TEEs: Graviton

Graviton (Microsoft Research, OSDI 2018) brings TEE concepts to GPUs—crucial for AI/ML workloads on sensitive data.

Why GPU TEEs Matter:

AI models on confidential data (healthcare, finance)
GPU accelerates computation 10-100×
But GPU memory visible to: driver, OS, hypervisor, DMA

Graviton Architecture:

Key Graviton Innovations:

Protected GPU Memory:

Partition GPU memory into protected/unprotected
CPU MMIO access to protected memory blocked
GPU page tables managed by command processor

Secure Command Submission:

Commands encrypted with session key
Only authorized CPU enclave can submit
Command processor verifies authenticity

Context Isolation:

Each protected context isolated from others
Exclusive use of GPU resources during execution
Memory scrubbing between contexts

Graviton Performance:

Overhead: 17-33% (primarily encryption/decryption)

Breakdown:
- Memory encryption:        10-20%
- Command encryption:        3-5%
- Context switching:         2-4%
- Page table management:     2-4%

Still 50-80× faster than CPU-only confidential computing!

Graviton Limitations:

Requires GPU hardware modifications
No commercial implementation yet
Research prototype only (NVIDIA GPU emulation)

Other TEE Implementations

Qualcomm SPU (Secure Processing Unit):

Dedicated hardware security subsystem
Handles TrustZone, DRM, biometrics
Isolated from main CPU
Used in Snapdragon processors

Samsung Knox:

Multi-layer security based on TrustZone
Real-time Kernel Protection (RKP)
Knox Vault (secure processor)
Enterprise security features

AMD PSP (Platform Security Processor):

ARM Cortex-A5 inside AMD CPU package
Manages SME/SEV encryption keys
Secure boot
Firmware TPM
Similar role to Intel ME

Apple Secure Enclave:

Most secure ARM TrustZone implementation
Dedicated coprocessor (not just software)
Never accessible by main OS
Handles:
Touch ID / Face ID
Apple Pay
Keychain encryption
Secure boot verification

RISC-V Keystone:

Open-source TEE framework
Uses Physical Memory Protection (PMP)
Security Monitor in M-mode
Flexible, customizable
Research/academic focus

TEE Security Challenges

Known Vulnerabilities:

TrustZone Attacks:

Cache timing attacks
Secure Monitor bugs
TA privilege escalation
Example: Breaking Samsung's ARM TrustZone (Black Hat 2019)

SGX Attacks:

Spectre variants (breach enclave confidentiality)
SGAxe (extract attestation keys)
APIC vulnerability (access encryption keys)
Controlled-channel attacks

Common Issues:

Side-channel leakage
Rollback attacks (sealed storage)
Microarchitectural attacks
Supply chain attacks

Mitigation Strategies:

Regular security updates
Side-channel resistant coding
Constant-time cryptography
Freshness tracking for sealed data
Defense-in-depth approach

TEE Use Cases in Production

Mobile Payments:

User taps phone → NFC controller → 
Secure World TA → Decrypt token → 
Transmit encrypted → Terminal

Protected: Payment token, encryption keys TrustZone ensures: Token never exposed to Normal World

Streaming DRM (Widevine L1):

Encrypted stream → Normal World app → 
Secure World TA → Decrypt in TEE → 
Secure video path → Display

Protected: Decryption keys, decrypted frames TrustZone ensures: HD/4K content protection

Biometric Authentication:

Fingerprint sensor → Secure World driver → 
TA processes biometric → Match in TEE → 
Return yes/no to Normal World

Protected: Biometric templates, matching algorithm TrustZone ensures: Templates never leave Secure World

Cloud Confidential Computing:

Encrypted data → SGX enclave → 
Decrypt and process → Encrypt result → 
Return to client

Protected: Data, computation, keys SGX ensures: Cloud provider cannot see data

Future of TEE Technology

Trends:

Larger protected memory (address EPC limitations)
Hardware mitigations for side-channels
GPU/accelerator TEEs (beyond Graviton research)
Heterogeneous TEEs (CPU + GPU unified protection)
Open-source TEEs (OP-TEE, Keystone adoption)

Performance Improvements:

Reduced context switch overhead
Better memory encryption (faster MEE)
Hardware prefetching for sealed data
Optimized attestation protocols

Future of TEE Technology:

Trends:

Larger protected memory (address EPC limitations)
Hardware mitigations for side-channels
GPU/accelerator TEEs (beyond Graviton research)
Heterogeneous TEEs (CPU + GPU unified protection)
Open-source TEEs (OP-TEE, Keystone adoption)

Performance Improvements:

Reduced context switch overhead
Better memory encryption (faster MEE)
Hardware prefetching for sealed data
Optimized attestation protocols

6.9 Copy-On-Write (COW)

Copy-On-Write is a memory optimization that leverages page-level protection to defer expensive memory copies until absolutely necessary. It's fundamental to efficient process creation and memory management.

The COW Concept

Without COW:

// fork() without COW
pid_t pid = fork();

// Parent process has 100 MB of data
// fork() must copy all 100 MB → child process
// Time: ~50 ms to copy 100 MB
// Memory: 200 MB total (100 MB × 2)

With COW:

// fork() with COW
pid_t pid = fork();

// 1. Child shares parent's pages (read-only)
// 2. Copy only happens on first write
// Time: ~0.1 ms (just page table copy)
// Memory: 100 MB (until write occurs)

COW Implementation

Step 1: Mark Pages Read-Only

// During fork(), mark all writable pages as read-only
void cow_fork_setup(struct mm_struct *parent, struct mm_struct *child) {
    for (each page in parent address space) {
        if (page_is_writable(page)) {
            // Mark parent's page read-only
            page_clear_write_bit(parent, page);
            
            // Child shares same physical page (also read-only)
            child_pte = parent_pte & ~PTE_W;  // Clear write bit
            child_pte |= PTE_COW;  // Mark as COW (OS-specific flag)
            
            // Increment page reference count
            page->ref_count++;
        }
    }
    
    // Flush TLB so read-only protection takes effect
    flush_tlb();
}

Step 2: Handle Write Fault

// Page fault handler for COW
void page_fault_handler(unsigned long fault_addr, unsigned long error_code) {
    bool is_write = error_code & PF_WRITE;
    pte_t *pte = get_pte(current->mm, fault_addr);
    
    if (is_write && pte_is_cow(*pte)) {
        // COW fault!
        handle_cow_fault(fault_addr, pte);
        return;
    }
    
    // ... other fault handling ...
}

void handle_cow_fault(unsigned long fault_addr, pte_t *pte) {
    struct page *old_page = pte_page(*pte);
    
    // Check reference count
    if (page_count(old_page) == 1) {
        // Only reference: just make writable
        pte_mkwrite(*pte);
        pte_clear_cow(*pte);
        flush_tlb_page(fault_addr);
        return;
    }
    
    // Multiple references: must copy
    struct page *new_page = alloc_page(GFP_KERNEL);
    
    // Copy page contents
    copy_page(page_address(new_page), page_address(old_page));
    
    // Update PTE to point to new page
    set_pte(pte, mk_pte(new_page, PAGE_READWRITE));
    
    // Decrement old page reference count
    put_page(old_page);
    
    // Flush TLB
    flush_tlb_page(fault_addr);
}

Example: fork() with COW

void demonstrate_cow(void) {
    char *data = malloc(4096);
    memset(data, 'A', 4096);
    
    pid_t pid = fork();
    
    if (pid == 0) {
        // Child process
        printf("Child: data[0] = %c\n", data[0]);  // Read: no copy
        
        data[0] = 'B';  // Write: triggers COW!
        // Page fault handler:
        // 1. Allocate new page
        // 2. Copy old page to new page
        // 3. Update child's PTE to new page
        // 4. Resume execution
        
        printf("Child: data[0] = %c\n", data[0]);  // 'B'
    } else {
        // Parent process
        sleep(1);
        printf("Parent: data[0] = %c\n", data[0]);  // Still 'A'
    }
}

COW Benefits

Faster Process Creation:

// Benchmark: fork() performance

void benchmark_fork(void) {
    // Allocate 100 MB
    size_t size = 100 * 1024 * 1024;
    char *data = malloc(size);
    memset(data, 0, size);
    
    // Without COW: ~50 ms (must copy 100 MB)
    // With COW:    ~0.1 ms (just copy page tables)
    
    uint64_t start = rdtsc();
    pid_t pid = fork();
    uint64_t end = rdtsc();
    
    if (pid == 0) {
        // Child: measure actual COW overhead
        start = rdtsc();
        memset(data, 0xFF, size);  // Trigger COW for all pages
        end = rdtsc();
        // COW overhead: ~50 ms (same as full copy, but deferred)
        
        exit(0);
    }
    
    wait(NULL);
}

// Result: fork() itself is 500× faster with COW!

Memory Savings:

// Common case: exec() after fork()
pid_t pid = fork();
if (pid == 0) {
    // Child immediately calls exec()
    execve("/bin/sh", argv, envp);
    // exec() replaces address space anyway
    // Without COW: Wasted 100% of copy time/memory
    // With COW: No copy occurred - pure win!
}

Zero Pages and COW

Special optimization: zero page sharing

// Global zero page (read-only, shared by all processes)
struct page *zero_page;

void init_zero_page(void) {
    zero_page = alloc_page(GFP_KERNEL);
    clear_page(page_address(zero_page));
    
    // Mark as read-only, COW
    // All processes share this page for zero-initialized memory
}

// When process requests zero-initialized memory:
void *zero_page_map(size_t size) {
    for (each page in size) {
        // Point PTE to global zero page (read-only, COW)
        pte = mk_pte(zero_page, PAGE_READONLY | PAGE_COW);
        
        // Increment zero page reference count
        get_page(zero_page);
    }
    
    // On first write:
    // 1. Allocate real page
    // 2. Copy zero page (trivial: memset 0)
    // 3. Update PTE
}

// Memory savings: Huge!
// Example: Process maps 1 GB of zeros
// Without zero page sharing: 1 GB allocated
// With zero page sharing: 4 KB (until written)

COW and mmap()

Anonymous mmap with MAP_PRIVATE:

// Private mapping uses COW
void *addr = mmap(NULL, 4096, PROT_READ | PROT_WRITE,
                  MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);

// Initially: Points to zero page (COW)
// On first write: Allocates real page, copies zeros

File-backed mmap with MAP_PRIVATE:

// File mapping uses COW
int fd = open("file.txt", O_RDONLY);
void *addr = mmap(NULL, 4096, PROT_READ | PROT_WRITE,
                  MAP_PRIVATE, fd, 0);

// Initially: Points to page cache (read-only, COW)
// On write:
// 1. Allocate private page
// 2. Copy file data to private page
// 3. Update PTE
// 4. Write goes to private page (not file)

close(fd);
munmap(addr, 4096);
// Changes are discarded (MAP_PRIVATE)

Performance Analysis

COW Overhead Breakdown:

// Page fault handling for COW write:
// 1. Fault entry:        ~100 cycles
// 2. Page allocation:    ~500 cycles
// 3. Page copy (4KB):    ~2000 cycles (cold cache)
//                        ~500 cycles (hot cache)
// 4. TLB flush:          ~50 cycles
// 5. Fault return:       ~100 cycles
// Total:                 ~2750-3250 cycles (~1 μs on 3 GHz CPU)

// Amortized cost:
// - fork() + no writes:     Essentially free
// - fork() + write all:     Same cost as direct copy (but deferred)
// - fork() + write few:     Huge win (only copy what's needed)

6.10 Memory Access Ordering and Protection

Memory ordering isn't just about performance—weak memory models can create security vulnerabilities if not properly understood and managed.

Memory Ordering Basics

Different architectures provide different memory ordering guarantees:

x86-64: Total Store Order (TSO)

Strong ordering (relatively easy to reason about)
Stores appear in program order
Loads cannot be reordered with older stores

ARM64: Weak Ordering

Loads and stores can be reordered freely
Requires explicit memory barriers
Better performance, harder to program correctly

RISC-V: RVWMO (Weak Memory Ordering)

Similar to ARM, weak ordering
Explicit fences required

Security Implications

Example 1: Broken Lock Implementation

// Incorrect lock implementation (no memory barriers)
struct spinlock {
    volatile int locked;
};

void bad_lock(struct spinlock *lock) {
    while (__sync_lock_test_and_set(&lock->locked, 1))
        cpu_relax();
    // BUG: No memory barrier!
    // Compiler/CPU might reorder critical section before lock acquisition!
}

void bad_unlock(struct spinlock *lock) {
    // BUG: No memory barrier!
    lock->locked = 0;
    // Critical section stores might leak past unlock!
}

// Attack scenario:
int secure_data = 0;

void thread1(struct spinlock *lock) {
    bad_lock(lock);
    secure_data = 42;  // Might execute before lock acquired!
    bad_unlock(lock);
}

void thread2(struct spinlock *lock) {
    bad_lock(lock);
    int leaked = secure_data;  // Might see value before thread1's lock!
    bad_unlock(lock);
}

Correct Implementation:

void correct_lock(struct spinlock *lock) {
    while (__sync_lock_test_and_set(&lock->locked, 1))
        cpu_relax();
    
    // Memory barrier: prevent reordering
    __atomic_thread_fence(__ATOMIC_ACQUIRE);
}

void correct_unlock(struct spinlock *lock) {
    // Memory barrier: prevent reordering
    __atomic_thread_fence(__ATOMIC_RELEASE);
    
    lock->locked = 0;
}

Example 2: Publish-Subscribe Race

// Vulnerable code:
struct message {
    int ready;
    char data[256];
};

struct message *msg;

// Publisher
void publish(const char *text) {
    strcpy(msg->data, text);  // Write data
    msg->ready = 1;           // Signal ready
    
    // Without barrier: ready might be visible before data!
}

// Subscriber
void subscribe(void) {
    while (!msg->ready)  // Wait for ready
        cpu_relax();
    
    printf("%s\n", msg->data);  // Read data
    // Without barrier: might read stale data!
}

// Correct implementation:
void publish_correct(const char *text) {
    strcpy(msg->data, text);
    __atomic_thread_fence(__ATOMIC_RELEASE);  // Release barrier
    msg->ready = 1;
}

void subscribe_correct(void) {
    while (!msg->ready)
        cpu_relax();
    __atomic_thread_fence(__ATOMIC_ACQUIRE);  // Acquire barrier
    printf("%s\n", msg->data);
}

Memory Barriers

x86-64 Barriers:

// Compiler barrier (prevent compiler reordering)
#define barrier() asm volatile("" ::: "memory")

// mfence: Full memory barrier
static inline void mfence(void) {
    asm volatile("mfence" ::: "memory");
}

// sfence: Store fence
static inline void sfence(void) {
    asm volatile("sfence" ::: "memory");
}

// lfence: Load fence  
static inline void lfence(void) {
    asm volatile("lfence" ::: "memory");
}

ARM64 Barriers:

// DMB: Data Memory Barrier
static inline void dmb(void) {
    asm volatile("dmb sy" ::: "memory");
}

// DSB: Data Synchronization Barrier
static inline void dsb(void) {
    asm volatile("dsb sy" ::: "memory");
}

// ISB: Instruction Synchronization Barrier
static inline void isb(void) {
    asm volatile("isb" ::: "memory");
}

RISC-V Barriers:

// FENCE instruction
static inline void fence(void) {
    asm volatile("fence rw, rw" ::: "memory");
}

// FENCE.I: Instruction fence
static inline void fence_i(void) {
    asm volatile("fence.i" ::: "memory");
}

Security-Critical Memory Ordering

Clearing Secrets:

// Insecure: compiler might optimize away
void clear_secret_bad(char *secret, size_t len) {
    memset(secret, 0, len);
    // Compiler sees: value never read after this
    // Optimization: removes memset!
}

// Secure: volatile or explicit barrier
void clear_secret_good(volatile char *secret, size_t len) {
    memset((void *)secret, 0, len);
    // volatile forces memory write
}

// Or use explicit memory barrier
void clear_secret_barrier(char *secret, size_t len) {
    memset(secret, 0, len);
    __atomic_thread_fence(__ATOMIC_SEQ_CST);  // Force completion
}

// Or use platform-specific secure clear
#ifdef __STDC_LIB_EXT1__
    memset_s(secret, len, 0, len);  // C11 secure memset
#endif

6.11 Confidential Computing: Hardware-Based VM Isolation

Confidential computing extends TEE concepts to entire virtual machines, providing strong isolation even in untrusted cloud environments.

AMD SEV-SNP (Secure Encrypted Virtualization - Secure Nested Paging)

Figure 6.6: Confidential Computing Architectures. AMD SEV-SNP encrypts VM memory at DRAM and uses a Reverse Map Table (RMP) for integrity. Intel TDX isolates Trust Domains via the SEAM module and per-KeyID EPT. ARM CCA introduces a Realm World protected by the Granule Protection Table (GPT).

Architecture:

Key Features:

Memory Encryption: All VM memory encrypted with VM-specific AES-128 key
RMP (Reverse Map Table): Prevents hypervisor double-mapping attacks
VMPL (VM Permission Levels): 4 privilege levels within VM
Attestation: Remote verification of VM state

Example: Creating SEV-SNP VM

// Simplified SEV-SNP VM creation (Linux/KVM)
#include <linux/kvm.h>
#include <linux/sev.h>

int create_sev_snp_vm(void) {
    int kvm_fd = open("/dev/kvm", O_RDWR);
    int vm_fd = ioctl(kvm_fd, KVM_CREATE_VM, 0);
    
    // Enable SEV-SNP
    struct kvm_sev_cmd cmd = {
        .id = KVM_SEV_SNP_INIT,
        .data = 0,
    };
    
    ioctl(vm_fd, KVM_MEMORY_ENCRYPT_OP, &cmd);
    
    // Memory regions are automatically encrypted
    struct kvm_userspace_memory_region region = {
        .slot = 0,
        .flags = KVM_MEM_PRIVATE,  // Encrypted
        .guest_phys_addr = 0,
        .memory_size = 1024 * 1024 * 1024,  // 1 GB
        .userspace_addr = (uint64_t)guest_memory,
    };
    
    ioctl(vm_fd, KVM_SET_USER_MEMORY_REGION, &region);
    
    return vm_fd;
}

Performance: SEV-SNP overhead is 1-5% for most workloads.

Intel TDX (Trust Domain Extensions)

Architecture:

Key Differences from SEV-SNP:

Feature	AMD SEV-SNP	Intel TDX
Encryption	AES-128	MKTME (AES-XTS-128)
CPU Privilege	PSP (separate processor)	SEAM (new CPU mode)
Page Tables	RMP	SEPT (Secure EPT)
Shared Memory	Via encryption	Explicit shared pages
Overhead	1-5%	2-5%

ARM CCA (Confidential Compute Architecture)

Four-World Model:

Granule Protection Table (GPT):

// GPT entry (per 4KB granule)
enum gpt_state {
    GPT_NORMAL = 0,   // Normal world access
    GPT_SECURE = 1,   // Secure world access
    GPT_REALM = 2,    // Realm world access
    GPT_ROOT = 3,     // Root world access
};

// Hardware enforces:
// - Normal world cannot access Realm granules
// - Realm cannot access Normal/Secure granules
// - Root manages all granules

6.12 AMD Memory Guard and Memory Encryption

AMD's memory encryption technologies protect data at rest in DRAM, defending against physical memory attacks like cold boot attacks and DMA attacks from malicious hardware.

The Physical Memory Attack Problem

Traditional Threat Model:

// Traditional security assumes:
// - Software can be malicious → Use page tables for isolation ✓
// - OS can be compromised → Use hypervisor ✓
// - Hypervisor can be evil → Use confidential computing ✓

// But what if attacker has PHYSICAL access?
// 1. Cold boot attack: Freeze RAM, remove it, read contents
// 2. DMA attack: Malicious PCIe device reads all memory
// 3. Memory bus snooping: Hardware tap on DDR bus
// 4. JTAG debugging: Hardware debugger reads DRAM

// Solution: Encrypt memory in the DRAM itself!

SME (Secure Memory Encryption)

System-wide transparent memory encryption introduced with AMD Zen (2017).

Architecture:

Enabling SME:

#include <cpuid.h>

// Check CPU support for SME
bool cpu_has_sme(void) {
    uint32_t eax, ebx, ecx, edx;
    
    // CPUID Fn8000_001F[EAX] - Encryption Memory Capabilities
    __cpuid(0x8000001F, eax, ebx, ecx, edx);
    
    // Bit 0: SME supported
    // Bit 1: SEV supported
    // Bits 11-6: Number of encrypted guests supported
    // Bits 5-0: C-bit location
    
    return (eax & (1 << 0)) != 0;
}

// Enable SME system-wide
void enable_sme(void) {
    if (!cpu_has_sme()) {
        printk("SME not supported on this CPU\n");
        return;
    }
    
    // Read MSR_AMD64_SYSCFG
    uint64_t syscfg = rdmsr(MSR_AMD64_SYSCFG);
    
    // Set MEM_ENCRYPT_EN bit (bit 23)
    syscfg |= (1ULL << 23);
    
    // Write back to enable SME
    wrmsr(MSR_AMD64_SYSCFG, syscfg);
    
    // From now on, all memory with C-bit=1 is encrypted
    printk("SME enabled successfully\n");
}

C-bit in Page Table Entries:

AMD repurposes one of the physical address bits as the C-bit (Ciphertext bit):

Using the C-bit:

// Get C-bit position from CPUID
uint32_t get_c_bit_position(void) {
    uint32_t eax, ebx, ecx, edx;
    __cpuid(0x8000001F, eax, ebx, ecx, edx);
    return ebx & 0x3F;  // Bits 5-0
}

// Create encrypted page table entry
uint64_t make_encrypted_pte(uint64_t phys_addr) {
    uint32_t c_bit = get_c_bit_position();
    
    uint64_t pte = phys_addr & 0x000FFFFFFFFFF000ULL;
    pte |= PTE_P | PTE_W | PTE_NX;  // Present, writable, no-execute
    pte |= (1ULL << c_bit);          // Set C-bit for encryption
    
    return pte;
}

// Kernel decides which pages to encrypt
void setup_encrypted_memory(void) {
    // Typically encrypt:
    // - All kernel code and data
    // - All user process memory
    // - Page tables themselves
    
    // Leave unencrypted:
    // - DMA buffers (devices need plaintext)
    // - Shared memory with devices
    // - Boot-time structures
}

Encryption Algorithm:

// SME uses AES-128-XTS mode (similar to disk encryption)
// Each 16-byte block encrypted separately

// Encryption:
// ciphertext = AES-128(plaintext, key, tweak)
// where:
//   key = CPU-generated 128-bit key (not accessible to software)
//   tweak = physical_address ^ other_factors

// Example (conceptual):
void sme_encrypt_block(uint8_t *plaintext, uint64_t phys_addr, 
                        uint8_t *ciphertext) {
    uint8_t key[16];      // Hidden in CPU, can't be read
    uint64_t tweak = phys_addr;  // Unique per physical address
    
    aes_128_xts_encrypt(plaintext, key, tweak, ciphertext);
}

// Important: Same physical address always uses same tweak
// This allows: read → decrypt → modify → encrypt → write
// Without needing to track what's encrypted

SEV (Secure Encrypted Virtualization)

Per-VM memory encryption - extension of SME for virtualized environments.

Key Differences from SME:

Feature	SME	SEV
Scope	System-wide	Per-VM
Keys	1 key for entire system	Unique key per VM
Hypervisor	Can decrypt	Cannot decrypt guest memory
Use Case	Physical security	Cloud multi-tenancy
Performance	<1%	1-3%

SEV Architecture:

// Each VM gets unique encryption key
struct sev_vm {
    uint32_t asid;          // Address Space ID (key index)
    uint8_t key[16];        // AES-128 key (in secure processor)
    bool encrypted;         // Is this VM encrypted?
};

// Create encrypted VM
int create_sev_vm(struct kvm_vm *vm) {
    // Allocate ASID (limited resource, typically 16-256 per CPU)
    int asid = allocate_asid();
    if (asid < 0)
        return -EBUSY;  // No available ASIDs
    
    // Ask AMD Secure Processor (PSP) to generate key
    struct sev_cmd_activate cmd = {
        .asid = asid,
        .handle = vm->handle,
    };
    
    // PSP generates random AES-128 key for this VM
    // Key is stored in PSP, never visible to hypervisor
    sev_issue_cmd(SEV_CMD_ACTIVATE, &cmd);
    
    // Now all memory accesses by this VM use its unique key
    vm->asid = asid;
    vm->encrypted = true;
    
    return 0;
}

TSME (Transparent SME)

Automatic encryption without software involvement:

// TSME: All memory automatically encrypted
// - No C-bit needed in page tables
// - Hypervisor doesn't choose what to encrypt
// - Everything encrypted by default
// - Simplifies software (no PTE management)

// Enable TSME (BIOS setting, not OS)
// Once enabled:
// - All DRAM is encrypted
// - No software changes needed
// - Protects against physical attacks
// - But: Cannot selectively decrypt (e.g., for DMA)

Performance Analysis

Measured Overhead:

// Benchmark: Memory-intensive workloads

// 1. SME (system-wide encryption):
// - CPU overhead: <0.5% (AES-NI acceleration)
// - Memory bandwidth: ~1-2% (encryption/decryption)
// - Latency: +0-2 cycles per access
// - Overall: <1% for most workloads

// 2. SEV (per-VM encryption):
// - Additional ASID switching overhead
// - More context switches (key changes)
// - Overall: 1-3% for typical VMs

// 3. SEV-SNP (with integrity):
// - RMP (Reverse Map Table) checks
// - Memory overhead: ~1-2% of RAM
// - Performance overhead: 1-5%

void benchmark_sme(void) {
    const size_t size = 1024 * 1024 * 1024;  // 1 GB
    char *buf = malloc(size);
    
    // Without SME: 10.5 GB/s
    // With SME:    10.3 GB/s (~2% slower)
    
    // Bandwidth test
    uint64_t start = rdtsc();
    memset(buf, 0, size);
    uint64_t end = rdtsc();
    
    printf("Memory bandwidth: %.2f GB/s\n", 
           size / ((end - start) / 3.0e9));
}

Why So Fast?

AES-NI hardware acceleration: Encryption is nearly free
Parallel processing: MEE encrypts multiple blocks simultaneously
Pipeline integration: Encryption happens in parallel with memory access
No integrity checks: SME/SEV don't verify data integrity (SEV-SNP does)

Security Properties

What SME/SEV Protects Against:

// ✓ Cold boot attacks
// Attacker freezes RAM, removes it, inserts into reader
// → DRAM contents are encrypted, useless without key

// ✓ DMA attacks from malicious devices
// PCIe device tries to read kernel memory
// → Reads encrypted data, no decryption key

// ✓ Memory bus snooping
// Hardware probe on DDR bus
// → All data on bus is encrypted

// ✓ JTAG/debugging attacks
// Hardware debugger reads DRAM
// → Gets encrypted data only

What SME/SEV Does NOT Protect Against:

// ✗ Software attacks (buffer overflows, etc.)
// Encryption doesn't prevent code execution exploits

// ✗ Side-channel attacks (cache timing, Spectre, etc.)
// Encrypted memory still vulnerable to microarchitectural attacks

// ✗ Replay attacks (without SEV-SNP)
// Attacker can replay old encrypted memory contents

// ✗ Hypervisor compromising unencrypted VM data
// If VM memory isn't encrypted (C-bit=0), hypervisor can read it

Practical Deployment

Linux Kernel Support:

// Boot parameter: mem_encrypt=on
// Kernel automatically encrypts all memory

// Check if running with SME
bool is_sme_active(void) {
    return (read_cr4() & X86_CR4_SME) != 0;
}

// DMA buffer allocation (must be unencrypted)
void *dma_alloc_coherent(struct device *dev, size_t size) {
    void *virt = alloc_pages(GFP_KERNEL, order);
    
    // Clear C-bit for DMA buffers (devices need plaintext)
    pte_t *pte = get_pte(virt);
    *pte &= ~(1ULL << c_bit_position);
    
    return virt;
}

Comparison with Intel TME

Feature	AMD SME/SEV	Intel TME/MKTME
First Appeared	2017 (Zen)	2019 (Ice Lake)
Encryption	AES-128-XTS	AES-XTS-128/256
Granularity	Per-page (C-bit)	Per-page (KeyID)
VM Isolation	SEV/SEV-ES/SEV-SNP	TDX
Keys	PSP manages	CPU manages
Integrity	SEV-SNP only	TDX includes
Performance	<1-5%	1-5%

6.13 RISC-V Security Extensions

RISC-V provides flexible, modular security features through its extensible ISA design. Unlike x86 and ARM which evolved security features over decades, RISC-V was designed from the ground up with security modularity in mind.

RISC-V Security Philosophy

Modular Design:

Base ISA: RV32I or RV64I (minimal, required)
  ↓
Privilege Modes: M, S, U (optional: Hypervisor)
  ↓
PMP: Physical Memory Protection (M-mode only)
  ↓
Extensions:
  - Cryptography (Zk*)
  - Vector (V)
  - Bit Manipulation (B)
  - Control Flow Integrity (Zicfiss)

Each component is optional and composable, allowing implementations from tiny embedded systems to large servers.

PMP (Physical Memory Protection)

M-mode's security boundary enforcement.

Why PMP Exists:

// Problem: M-mode is the highest privilege level
// M-mode firmware runs before OS loads
// But M-mode needs to protect itself from S-mode OS!

// Example vulnerability without PMP:
void malicious_s_mode_code(void) {
    // S-mode OS tries to overwrite M-mode firmware
    uint64_t *m_mode_code = (uint64_t *)0x80000000;
    *m_mode_code = 0xdeadbeef;  // Overwrite M-mode!
    
    // Without PMP: This succeeds! OS compromises firmware.
    // With PMP: This triggers exception, OS crashes.
}

PMP Configuration:

RISC-V provides up to 64 PMP entries (implementation-dependent, typically 8-16):

// PMP CSRs (Control and Status Registers)
// - pmpaddr0-pmpaddr63: Address registers
// - pmpcfg0-pmpcfg15: Configuration registers (4 entries each)

// PMP permissions
#define PMP_R 0x01  // Read
#define PMP_W 0x02  // Write  
#define PMP_X 0x04  // Execute
#define PMP_L 0x80  // Lock (cannot be changed until reset)

// PMP address matching modes
#define PMP_OFF   0x00  // Disabled
#define PMP_TOR   0x08  // Top of Range
#define PMP_NA4   0x10  // Naturally Aligned 4-byte
#define PMP_NAPOT 0x18  // Naturally Aligned Power-of-Two

// Configure PMP region
void pmp_set_region(int region, uint64_t addr, uint64_t size, uint8_t perm) {
    // Calculate NAPOT address encoding
    // For power-of-two size: addr_reg = (addr + size - 1) >> 2
    uint64_t pmpaddr = (addr + size - 1) >> 2;
    
    // Configuration: permissions | addressing mode
    uint8_t pmpcfg = perm | PMP_NAPOT;
    
    // Write to appropriate CSRs
    switch (region) {
        case 0:
            asm volatile("csrw pmpaddr0, %0" :: "r" (pmpaddr));
            // pmpcfg0 contains entries 0-3 (8 bits each)
            uint64_t cfg0 = read_csr(pmpcfg0);
            cfg0 = (cfg0 & ~0xFF) | pmpcfg;
            asm volatile("csrw pmpcfg0, %0" :: "r" (cfg0));
            break;
        // ... more regions
    }
}

Example: Protecting M-mode Firmware

// Secure boot: M-mode sets up PMP before jumping to S-mode

void m_mode_setup_security(void) {
    // Region 0: M-mode code (0x80000000 - 0x80100000)
    // Readable and executable by M-mode only
    // S-mode and U-mode cannot access
    pmp_set_region(0, 
                   0x80000000,      // Start address
                   0x100000,        // Size (1 MB)
                   PMP_R | PMP_X | PMP_L);  // R-X, locked
    
    // Region 1: M-mode data (0x80100000 - 0x80200000)
    // Read-write for M-mode only
    pmp_set_region(1,
                   0x80100000,
                   0x100000,
                   PMP_R | PMP_W | PMP_L);  // RW-, locked
    
    // Region 2: RAM for S-mode and U-mode (0x80200000 - 0x88000000)
    // Full access for all modes
    pmp_set_region(2,
                   0x80200000,
                   0x7E00000,       // ~126 MB
                   PMP_R | PMP_W | PMP_X);  // RWX
    
    // Region 3: Device memory (0x40000000 - 0x50000000)
    // Read-write, no execute
    pmp_set_region(3,
                   0x40000000,
                   0x10000000,
                   PMP_R | PMP_W);  // RW-
    
    // Now jump to S-mode
    // S-mode cannot modify these PMP settings (locked)
    jump_to_s_mode();
}

PMP Permission Checking:

// Hardware checks PMP on EVERY memory access from S-mode and U-mode
// M-mode can always access everything (PMP doesn't restrict M-mode)

// Pseudocode for PMP check:
bool pmp_check_access(uint64_t addr, access_type_t type, priv_mode_t mode) {
    // M-mode bypasses PMP
    if (mode == M_MODE)
        return true;
    
    // Find matching PMP entry (check in order)
    for (int i = 0; i < num_pmp_entries; i++) {
        if (pmp_entry_matches(i, addr)) {
            uint8_t perm = pmp_get_permissions(i);
            
            // Check permissions
            if (type == READ && !(perm & PMP_R))
                return false;  // Access fault
            if (type == WRITE && !(perm & PMP_W))
                return false;
            if (type == EXECUTE && !(perm & PMP_X))
                return false;
            
            return true;  // Access allowed
        }
    }
    
    // No matching entry: deny by default for S/U mode
    return false;
}

PMP Addressing Modes:

// 1. TOR (Top of Range)
// Range: [pmpaddr[i-1], pmpaddr[i])
// Good for: Arbitrary ranges

// 2. NA4 (Naturally Aligned 4-byte)
// Address: pmpaddr << 2
// Size: 4 bytes
// Good for: Single word protection

// 3. NAPOT (Naturally Aligned Power-of-Two)
// Encodes both address and size in pmpaddr
// Size must be power of 2
// Good for: Standard memory regions

// Example: NAPOT encoding for 64 KB region at 0x80000000
void pmp_napot_example(void) {
    uint64_t addr = 0x80000000;
    uint64_t size = 65536;  // 64 KB
    
    // NAPOT encoding: (addr + size - 1) >> 2
    uint64_t pmpaddr = (addr + size - 1) >> 2;
    // pmpaddr = 0x80010000 >> 2 = 0x20004000
    
    asm volatile("csrw pmpaddr0, %0" :: "r" (pmpaddr));
    
    // This protects region [0x80000000, 0x80010000)
}

ePMP (Enhanced PMP)

Ratified extension addressing PMP limitations.

Key Improvements:

// 1. Rule Locking Clarification
// Original PMP: Locked rules block all later rules
// ePMP: Locked rules only apply to themselves

// 2. Denied-by-Default Mode
// Original PMP: Unmatched access denied only if any PMP active
// ePMP: New CSR bit makes unmatched access always denied

// 3. M-mode Lockdown
// ePMP adds MSECCFG CSR (Machine Security Configuration)

// MSECCFG register bits:
#define MSECCFG_MML  0x01  // Machine Mode Lockdown
#define MSECCFG_MMWP 0x02  // Machine Mode Whitelist Policy
#define MSECCFG_RLB  0x04  // Rule Locking Bypass

// Enable ePMP security
void epmp_enable_security(void) {
    uint64_t mseccfg = 0;
    
    // Enable M-mode lockdown
    // Now M-mode also checks PMP (not just S/U mode)
    mseccfg |= MSECCFG_MML;
    
    // Enable whitelist policy
    // Unmatched addresses are denied (not allowed)
    mseccfg |= MSECCFG_MMWP;
    
    asm volatile("csrw mseccfg, %0" :: "r" (mseccfg));
    
    // Now even M-mode must follow PMP rules!
    // Provides defense-in-depth for firmware bugs
}

ePMP Use Case: Secure Boot

// Secure boot with ePMP protection

void secure_boot_with_epmp(void) {
    // 1. Set up ePMP to protect boot ROM
    pmp_set_region(0, 0x10000, 0x10000, 
                   PMP_R | PMP_X | PMP_L);  // ROM: R-X, locked
    
    // 2. Enable ePMP with M-mode lockdown
    uint64_t mseccfg = MSECCFG_MML | MSECCFG_MMWP;
    asm volatile("csrw mseccfg, %0" :: "r" (mseccfg));
    
    // 3. Now even M-mode cannot write to boot ROM
    // If M-mode firmware has bug, cannot overwrite itself
    
    // 4. Verify and load next stage
    if (verify_signature(next_stage)) {
        // Set up PMP for next stage
        pmp_set_region(1, next_stage_addr, next_stage_size,
                       PMP_R | PMP_X);
        
        // Jump to next stage
        ((void (*)(void))next_stage_addr)();
    } else {
        // Signature verification failed - halt
        while(1);
    }
}

Keystone: Open-Source RISC-V TEE

Security Monitor architecture using PMP for isolation.

Architecture:

Keystone Enclave Creation:

// Create isolated enclave using PMP

struct enclave {
    uint64_t base;
    uint64_t size;
    int pmp_region;
};

struct enclave create_keystone_enclave(void *code, size_t code_size, 
                                        void *data, size_t data_size) {
    struct enclave enc;
    
    // 1. Allocate isolated memory
    enc.base = alloc_enclave_memory(code_size + data_size);
    enc.size = code_size + data_size;
    
    // 2. Copy code and data
    memcpy((void *)enc.base, code, code_size);
    memcpy((void *)(enc.base + code_size), data, data_size);
    
    // 3. Allocate PMP region for enclave
    enc.pmp_region = allocate_pmp_region();
    
    // 4. Configure PMP (M-mode only)
    pmp_set_region(enc.pmp_region, enc.base, enc.size,
                   PMP_R | PMP_W | PMP_X);  // RWX for enclave
    
    // 5. Mark region as enclave-only (custom extension)
    pmp_set_owner(enc.pmp_region, OWNER_ENCLAVE);
    
    return enc;
}

// Enter enclave
void enter_enclave(struct enclave *enc) {
    // Security monitor (M-mode) verifies caller
    // Then switches PMP configuration and jumps to enclave
    
    // Only enclave can access its memory
    // OS cannot read/write enclave memory (PMP blocks it)
}

Keystone Features:

// 1. Memory Isolation
// - Each enclave gets dedicated PMP region
// - OS cannot access enclave memory
// - Enclaves cannot access each other

// 2. Attestation
// - Security monitor signs enclave measurements
// - Remote party can verify enclave authenticity

// 3. Sealing
// - Enclave-specific keys for persistent storage
// - Data encrypted to specific enclave

// 4. Open Source
// - MIT licensed
// - Customizable for different threat models
// - No proprietary firmware dependencies

RISC-V Cryptography Extensions (Zk*)

Hardware-accelerated cryptography for better security and performance.

// Zkn: NIST algorithms
// - AES encryption/decryption
// - SHA-256/SHA-512
// - SM3/SM4 (Chinese standards)

// Example: AES-128 encryption
void riscv_aes_encrypt(uint8_t *plaintext, uint8_t *key, uint8_t *ciphertext) {
    // Load key into AES state
    asm volatile("aes64ks1i t0, %0, 0" :: "r" (key[0]));
    asm volatile("aes64ks1i t1, %0, 1" :: "r" (key[1]));
    
    // Load plaintext
    asm volatile("aes64esm t0, %0, t0" :: "r" (plaintext[0]));
    asm volatile("aes64esm t1, %0, t1" :: "r" (plaintext[1]));
    
    // ... 10 rounds for AES-128 ...
    
    // Store ciphertext
    // Much faster than software AES!
}

// Zkb: Bit manipulation for crypto
// - Rotate, byte-reverse operations
// - Useful for implementing ciphers

RISC-V vs x86/ARM Security Comparison

Feature	RISC-V	x86-64	ARM64
Privilege Levels	M/S/U (H optional)	Ring 0-3	EL0-EL3
Memory Protection	PMP (M-mode only)	Page tables	Page tables + TrustZone
TEE	Keystone (open)	SGX (deprecated)	TrustZone (built-in)
Memory Encryption	Extension needed	TME/TDX	MTE (tagging)
Crypto Accel	Zk* (optional)	AES-NI (standard)	Crypto extensions
Flexibility	High (modular)	Low (fixed)	Medium
Maturity	Developing	Mature	Mature
Open Source	Fully open	Proprietary	Some open

Future RISC-V Security Features

Under Development:

// 1. IOPMP (I/O Physical Memory Protection)
// Like IOMMU but simpler, using PMP concepts

// 2. WorldGuard
// ARM TrustZone-like secure/non-secure worlds

// 3. Control Flow Integrity (Zicfiss)
// Shadow stack for return address protection

// 4. Pointer Masking
// Top-byte-ignore for memory tagging (like ARM MTE)

// 5. Capability Mode (CHERI)
// Hardware-enforced memory safety
// Fat pointers with bounds checking

[Sections 6.14-6.19 to continue...]

6.14 GPU and Accelerator Security

Modern computing increasingly relies on specialized accelerators—GPUs, TPUs, FPGAs, and custom ASICs—to achieve performance far beyond what general-purpose CPUs can provide. However, these accelerators introduce new security challenges that extend beyond traditional CPU memory protection.

The Accelerator Security Problem

Why GPUs Need Special Attention:

// Traditional CPU-only workflow:
void process_data(void *sensitive_data) {
    // CPU operates on data in protected memory
    // MMU enforces permissions
    // Data never leaves CPU package unencrypted
    compute(sensitive_data);
}

// Modern GPU-accelerated workflow:
void process_data_gpu(void *sensitive_data) {
    // 1. CPU allocates GPU memory
    void *gpu_mem = cudaMalloc(size);
    
    // 2. Copy data to GPU (crosses PCIe bus!)
    cudaMemcpy(gpu_mem, sensitive_data, size, cudaMemcpyHostToDevice);
    
    // 3. GPU processes data
    //    - GPU has separate DRAM
    //    - Not protected by CPU MMU
    //    - Visible to: GPU driver, system software, DMA attacks
    launch_kernel<<<blocks, threads>>>(gpu_mem);
    
    // 4. Copy results back
    cudaMemcpy(result, gpu_mem, size, cudaMemcpyDeviceToHost);
    
    // Problem: Data exposed at multiple points!
}

Threat Vectors:

Malicious GPU Driver: Can read all GPU memory
Compromised OS: Can access GPU memory via driver
DMA Attacks: GPU memory accessible via PCIe
Side Channels: GPU timing attacks, memory bus snooping
Multi-Tenancy: Cloud GPUs shared between VMs

GPU Memory Architecture

Figure 6.7: Heterogeneous Memory Security: Discrete GPU vs Unified Memory. Discrete GPUs require IOMMU protection and explicit data transfer; NVIDIA H100 Confidential Computing mode adds AES-GCM encryption across PCIe and VRAM. Unified memory architectures (Apple M-series, AMD MI300A) share physical pages between CPU and GPU, with IOMMUs providing the isolation boundary.

NVIDIA GPU Memory Hierarchy:

Key Differences from CPU:

Feature	CPU	GPU
MMU	Full page-based protection	Limited/none
Address Space	Per-process isolation	Shared global memory
Privilege Levels	Ring 0/3, EL0-3, etc.	Minimal (user/kernel mode)
Memory Encryption	SME/TME	None (standard GPUs)
IOMMU	Can isolate from other devices	GPU itself not isolated

NVIDIA Confidential Computing (Hopper H100)

Hardware Features (2022+):

NVIDIA H100 introduced confidential computing support for GPUs:

Key Mechanisms:

// 1. CPU-GPU Encrypted Link
// PCIe traffic encrypted with AES-256-GCM
// Prevents snooping on bus

// 2. GPU Memory Encryption  
// GDDR6 memory encrypted at rest
// Similar to CPU SME/TME

// 3. Isolated Execution Contexts
// Multiple VMs can share GPU without seeing each other's data

// 4. Attestation
// Remote verification of GPU firmware and configuration

// Example: Create confidential GPU context
cudaError_t create_confidential_context(void) {
    // Enable confidential computing mode
    cudaDeviceProp prop;
    cudaGetDeviceProperties(&prop, 0);
    
    if (!prop.confidentialCompute) {
        return cudaErrorNotSupported;
    }
    
    // Create encrypted execution context
    cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking);
    
    // All operations on this stream are encrypted
    return cudaSuccess;
}

Performance Overhead:

PCIe Encryption: ~5-10% bandwidth reduction
Memory Encryption: ~3-7% compute overhead
Overall: ~5-15% depending on workload
Benefit: Can now safely use cloud GPUs for sensitive data

AMD GPUs and ROCm Security

AMD MI300 Series (2023):

Similar to NVIDIA, AMD added memory encryption to data center GPUs:

// ROCm (Radeon Open Compute) API for secure compute

// 1. Check for encryption support
rocm_status_t check_encryption(void) {
    hipDeviceProp_t prop;
    hipGetDeviceProperties(&prop, 0);
    
    if (prop.memoryEncryption) {
        printf("Memory encryption supported\n");
        return ROCM_SUCCESS;
    }
    return ROCM_ERROR_NOT_SUPPORTED;
}

// 2. Allocate encrypted memory
void *allocate_secure_gpu_memory(size_t size) {
    void *ptr;
    
    // Request encrypted allocation
    hipError_t err = hipMallocEncrypted(&ptr, size);
    
    if (err != hipSuccess) {
        return NULL;
    }
    
    // Memory is encrypted in GPU DRAM
    return ptr;
}

Apple Silicon GPU Security

Unified Memory Architecture:

Apple's M-series chips integrate CPU and GPU on same die, sharing memory:

Security Advantages:

// No PCIe exposure: Data never leaves chip
void apple_gpu_compute(void *data, size_t size) {
    // CPU and GPU share same memory
    // No cudaMemcpy needed!
    
    id<MTLBuffer> buffer = [device newBufferWithBytesNoCopy:data
                                                     length:size
                                                    options:MTLResourceStorageModeShared];
    
    // GPU accesses same physical pages as CPU
    // Protected by same MMU/IOMMU
    // No additional exposure
}

Disadvantages:

Shared memory bandwidth: CPU and GPU compete
Limited GPU memory: No separate GDDR6 pool
Unified TLB pressure: More TLB misses

Performance vs Security Trade-offs

Benchmark: Confidential GPU Computing:

// NVIDIA H100 with confidential computing

void benchmark_confidential_gpu(void) {
    const size_t size = 1024 * 1024 * 1024;  // 1 GB
    
    // Standard (unencrypted) mode
    float *d_data;
    cudaMalloc(&d_data, size);
    cudaMemcpy(d_data, h_data, size, cudaMemcpyHostToDevice);
    
    // Bandwidth: ~900 GB/s (H100)
    // Compute: 2000 TFLOPS
    
    // Confidential (encrypted) mode
    float *d_secure;
    cudaMallocEncrypted(&d_secure, size);
    cudaMemcpyEncrypted(d_secure, h_data, size);
    
    // Bandwidth: ~810 GB/s (~10% slower)
    // Compute: ~1850 TFLOPS (~7.5% slower)
    
    // Overall overhead: 5-15% depending on workload
}

When to Use Confidential GPU:

✅ Use when:

Processing sensitive data (healthcare, finance)
Untrusted cloud environment
Regulatory compliance (GDPR, HIPAA)
Multi-tenant GPU sharing

❌ Avoid when:

Public/non-sensitive data
Owned hardware (not cloud)
Performance critical (can't accept 5-15% overhead)
Legacy GPUs without support

6.15 Heterogeneous Computing Security

Modern systems combine multiple processor types—CPUs, GPUs, FPGAs, TPUs, NPUs—in complex heterogeneous architectures. Securing these systems requires coordinating protection across fundamentally different processor designs.

The Heterogeneous Challenge

Multiple Processors, Multiple Security Models:

Key Security Challenges:

Inconsistent Protection: Different MMU capabilities across devices
Shared Memory: Multiple processors accessing same DRAM
Cache Coherency: Security implications of coherent caches
Trust Boundaries: Where does protection enforcement happen?
Performance: Security checks on high-speed interconnects

Cache-Coherent Interconnects

NVIDIA Grace-Hopper Superchip:

Cache Coherency Security Implications:

// Problem: CPU cache can hold GPU data (and vice versa)

void coherency_security_issue(void) {
    // CPU allocates sensitive data
    int *cpu_data = malloc(4096);
    cpu_data[0] = SECRET_KEY;
    
    // GPU kernel reads the data (coherent access)
    gpu_kernel<<<1,1>>>(cpu_data);
    
    // Data is now in:
    // 1. CPU L1/L2/L3 caches
    // 2. GPU L2 cache
    // 3. Both DRAM copies (CPU-side and GPU-side)
    
    // Security question: How do we ensure cache lines
    // are properly protected across domains?
}

Solution: Coherent Memory Encryption:

// Grace-Hopper implements end-to-end encryption
// across coherent link

// 1. CPU-side: Protected by CPU MMU + SME/TME
//    Data encrypted in CPU caches and DRAM

// 2. NVLink-C2C: Encrypted link between CPU and GPU
//    Prevents snooping on interconnect

// 3. GPU-side: Protected by GPU encryption
//    Data encrypted in GPU caches and HBM

// Result: Coherent access without security compromise

Performance:

Bandwidth: 900 GB/s (7× faster than PCIe Gen5)
Latency: ~300 ns (10× better than PCIe)
Overhead: Encryption adds ~5-8% latency
Benefit: Shared coherent address space between CPU and GPU

AMD MI300A APU

Integrated CPU+GPU on Single Package:

Unified Security Model:

// AMD MI300A: CPU and GPU share same page tables!

// CPU sets up page table with protections
uint64_t *pte = get_pte(addr);
*pte = make_pte(phys_addr, PTE_P | PTE_W | PTE_NX);

// GPU accesses same PTE
// Hardware enforces same permissions for GPU accesses
// No separate GPU page table needed!

// Benefit: Consistent protection model
// Disadvantage: GPU must check page tables (slower)

Apple Unified Memory Architecture (UMA)

M3 Ultra Example:

Security Advantages:

// 1. No data copies between CPU and GPU
//    → No PCIe exposure
//    → Faster and more secure

// 2. Single MMU/IOMMU
//    → Consistent protection model
//    → Easier to verify security

// 3. On-package design
//    → Physical security: can't intercept traffic
//    → DMA attacks much harder

// Example: Metal (Apple's GPU API)
void apple_unified_security(void) {
    // Allocate memory (shared CPU/GPU)
    id<MTLBuffer> buffer = [device newBufferWithLength:size
                                               options:MTLResourceStorageModeShared];
    
    // CPU writes to buffer
    memcpy(buffer.contents, data, size);
    
    // GPU reads same buffer
    // - No copy needed
    // - Same page table protections apply
    // - IOMMU enforces access control
    
    [commandEncoder setBuffer:buffer offset:0 atIndex:0];
    [commandEncoder dispatchThreads:...];
}

Best Practices for Heterogeneous Security

1. Unified Trust Boundary:

// Good: Single security perimeter
void secure_heterogeneous_good(void) {
    // Use CPU TEE (TDX/SEV-SNP) to protect entire system
    create_confidential_vm();
    
    // GPU operates within TEE boundary
    // - Encrypted link to GPU
    // - GPU memory encrypted
    // - Attestation covers CPU+GPU
    
    compute_on_gpu(sensitive_data);
}

// Bad: Multiple separate security boundaries
void secure_heterogeneous_bad(void) {
    // CPU has one protection scheme
    cpu_encrypt(data);
    
    // Transfer to GPU (different scheme)
    transfer_to_gpu(data);  // Vulnerable during transfer!
    
    // GPU has different protection
    gpu_encrypt(data);
    
    // Multiple boundaries = multiple attack surfaces
}

2. End-to-End Encryption:

// Encrypt data path from source to accelerator

// Source (encrypted in CPU enclave)
void *cpu_enclave_data = allocate_in_enclave(size);

// Transfer (encrypted link)
secure_transfer_to_gpu(cpu_enclave_data, gpu_buffer, size);

// Destination (encrypted in GPU memory)
// GPU processes without exposing plaintext

// No point in data path exposes cleartext

3. Minimize Trust in Drivers:

// Problem: GPU drivers run in kernel, highly privileged

// Solution: Minimal trusted driver + user-space library

// Trusted (kernel driver - minimal code)
int gpu_map_memory(void *addr, size_t size) {
    // Only does: memory mapping, DMA setup
    // Does NOT: touch data, parse commands
    return map_gpu_region(addr, size);
}

// Untrusted (user-space library - complex code)
void launch_kernel(kernel_t kernel, void *args) {
    // Complex logic in user space
    // If compromised: limited damage (no kernel privileges)
    prepare_command_buffer(kernel, args);
    submit_to_gpu();
}

4. Attestation Across Devices:

// Verify security of entire heterogeneous system

struct system_attestation {
    uint8_t cpu_measurement[32];
    uint8_t gpu_measurement[32];
    uint8_t fpga_bitstream_hash[32];
    uint8_t interconnect_config[32];
};

bool attest_heterogeneous_system(struct system_attestation *attest) {
    // 1. CPU TEE attestation
    get_cpu_attestation(attest->cpu_measurement);
    
    // 2. GPU firmware attestation
    get_gpu_attestation(attest->gpu_measurement);
    
    // 3. FPGA bitstream verification
    get_fpga_attestation(attest->fpga_bitstream_hash);
    
    // 4. Interconnect security configuration
    get_interconnect_config(attest->interconnect_config);
    
    // Sign entire attestation
    sign_attestation(attest);
    
    // Remote verifier checks all components
    return verify_all_components(attest);
}

6.16 Performance vs Security Trade-offs

Every security feature has a cost. Understanding these trade-offs is essential for making informed decisions about which protections to enable in production systems.

Security Feature Performance Matrix

Feature	Overhead	When Always Enabled	When Optional
NX/XD/XN Bit	~0%	✅ Always	Never disable
SMEP	<1%	✅ Always	Never disable
SMAP	<1%	✅ Always	Never disable
PAN (ARM)	<1%	✅ Always	Never disable
KPTI	5-30%	⚠️ If vulnerable CPU	Disable on patched CPUs
MPK	<1%	✅ When available	App-specific
MTE (ARM)	5-15%	🤔 Depends	Development/critical systems
BTI (ARM)	<1%	✅ Always	Never disable
SME/SEV	<1-5%	✅ Cloud/untrusted	Disable on trusted hardware
GPU Encryption	5-15%	🤔 Depends	Sensitive data only

Detailed Cost Analysis

1. KPTI (Kernel Page-Table Isolation):

// Benchmark: System call overhead with/without KPTI

void benchmark_kpti_impact(void) {
    const int iterations = 1000000;
    
    // Measure getpid() (minimal syscall)
    uint64_t start = rdtsc();
    for (int i = 0; i < iterations; i++) {
        getpid();
    }
    uint64_t cycles = (rdtsc() - start) / iterations;
    
    // Results:
    // Without KPTI: ~100 cycles
    // With KPTI:    ~150-180 cycles
    // Overhead:     50-80% on syscalls!
    
    // But total system impact depends on syscall frequency:
    // CPU-bound workload: <5% (few syscalls)
    // I/O-bound server:   10-20% (many syscalls)
    // Database:           15-30% (heavy syscall use)
}

When to Disable KPTI:

// Check CPU vulnerability
bool needs_kpti(void) {
    uint32_t eax, ebx, ecx, edx;
    
    // Intel: Check for RDCL_NO bit
    cpuid(7, 0, &eax, &ebx, &ecx, &edx);
    bool rdcl_no = edx & (1 << 10);  // Not vulnerable to Meltdown
    
    if (rdcl_no) {
        return false;  // Safe to disable KPTI
    }
    
    // AMD CPUs: Generally not vulnerable
    if (cpu_vendor() == AMD) {
        return false;
    }
    
    return true;  // Enable KPTI
}

2. Memory Tagging (ARM MTE):

// MTE provides memory safety but with cost

void benchmark_mte_overhead(void) {
    const size_t size = 1024 * 1024 * 1024;  // 1 GB
    
    // Without MTE:
    // - malloc(): 50 μs
    // - memcpy(): 2.5 GB/s
    // - Random access: 60 ns per access
    
    // With MTE (synchronous mode):
    // - malloc(): 55 μs (+10% for tagging)
    // - memcpy(): 2.3 GB/s (-8% for tag checks)
    // - Random access: 65 ns per access (+8%)
    
    // With MTE (asynchronous mode):
    // - malloc(): 52 μs (+4%)
    // - memcpy(): 2.45 GB/s (-2%)
    // - Random access: 61 ns per access (+2%)
    
    // Trade-off:
    // Sync: Better debugging (immediate faults)
    // Async: Better performance (delayed faults)
}

When to Use MTE:

✅ Enable for:

Development and testing (catch bugs early)
Security-critical applications (browser, OS components)
Systems handling untrusted input
Memory-unsafe languages (C/C++)

❌ Disable for:

Extreme performance requirements
Legacy code (might break on tag mismatches)
Systems with tight memory budgets (3% overhead)

3. Confidential Computing (SEV-SNP, TDX):

// Cloud VM with confidential computing

void benchmark_confidential_vm(void) {
    // Regular VM:
    // - Network I/O: 100 Gbps
    // - Disk I/O: 10 GB/s
    // - Memory: 200 GB/s
    // - CPU: 2.5 GHz (base)
    
    // Confidential VM (SEV-SNP):
    // - Network I/O: 95 Gbps (-5%, encryption)
    // - Disk I/O: 9.5 GB/s (-5%, encryption)
    // - Memory: 190 GB/s (-5%, RMP checks)
    // - CPU: 2.45 GHz (-2%, overhead)
    
    // Overall: 1-5% performance cost
    // Benefit: Complete isolation from cloud provider
    
    // Decision:
    // Use if: Processing sensitive data (healthcare, finance)
    // Skip if: Public data, owned hardware
}

Cumulative Overhead

Stacking Security Features:

// Real-world server configuration

void production_server_config(void) {
    // Base performance: 100%
    
    // Enable NX bit: 100% (no cost)
    // Enable SMEP: 99.5% (-0.5%)
    // Enable SMAP: 99.0% (-0.5%)
    // Enable KPTI: 85.0% (-14%)
    // Enable SEV-SNP: 81.0% (-4%)
    
    // Final performance: ~81% of baseline
    // Security gain: Protected against:
    // - Code injection attacks (NX)
    // - Kernel exploits (SMEP/SMAP)
    // - Spectre/Meltdown (KPTI)
    // - Malicious hypervisor (SEV-SNP)
    
    // Worth it? Depends on threat model and requirements
}

Optimization Strategies:

// 1. Selective Protection
void selective_security(void) {
    // Don't protect everything equally
    
    // Critical data: Full protection
    enable_all_security_features(payment_processing);
    
    // Internal services: Moderate protection
    enable_basic_security(logging_service);
    
    // Public data: Minimal protection
    enable_nx_only(static_website);
}

// 2. Hardware Acceleration
void use_hardware_acceleration(void) {
    // Modern CPUs have dedicated units
    // - AES-NI: Hardware AES encryption (free)
    // - SHA extensions: Hardware hashing (free)
    // - Pointer authentication: Hardware CFI (cheap)
    
    // Always enable hardware-accelerated security
}

// 3. Batch Operations
void batch_security_operations(void) {
    // Amortize overhead across multiple operations
    
    // Bad: Individual mprotect calls
    for (int i = 0; i < 1000; i++) {
        mprotect(pages[i], 4096, PROT_READ);  // 1000 TLB flushes!
    }
    
    // Good: Batch with MPK
    for (int i = 0; i < 1000; i++) {
        pkey_mprotect(pages[i], 4096, PROT_READ, pkey);
    }
    pkey_set(pkey, PKEY_DISABLE_WRITE);  // 1 operation!
}

6.17 Best Practices and Guidelines

Based on decades of security research and real-world deployments, here are proven best practices for memory protection.

Defense in Depth

Layer Multiple Protections:

// Don't rely on a single mechanism
void defense_in_depth_example(void) {
    // Layer 1: NX bit (prevent code execution on stack/heap)
    // Layer 2: ASLR (randomize addresses)
    // Layer 3: Stack canaries (detect buffer overflows)
    // Layer 4: SMEP/SMAP (kernel hardening)
    // Layer 5: Control-flow integrity (prevent ROP)
    
    // If attacker bypasses one layer, others still protect
}

Principle: Assume every protection can be bypassed. Multiple layers increase attack difficulty exponentially.

Minimize Trusted Code

Reduce Attack Surface:

// Bad: Large TCB (Trusted Computing Base)
void large_tcb_bad(void) {
    // Entire OS kernel is trusted
    // - Millions of lines of code
    // - Many bugs, high attack surface
}

// Good: Small TCB
void small_tcb_good(void) {
    // Only security monitor is trusted
    // - Thousands of lines (Keystone: ~5K)
    // - Easier to verify and audit
    // - Fewer bugs, smaller attack surface
}

Example: Keystone vs SGX:

Keystone TCB: ~5,000 lines (open-source monitor)
SGX TCB: ~1,000,000 lines (CPU microcode, firmware)
10× smaller = 100× fewer bugs (empirically)

Fail Securely

Default-Deny Policies:

// Good: Deny by default
void secure_permission_check(void *addr, access_type_t type) {
    // Start with: deny everything
    bool allowed = false;
    
    // Explicitly allow what's needed
    if (is_in_permitted_range(addr) && has_permission(type)) {
        allowed = true;
    }
    
    // Fail closed
    if (!allowed) {
        raise_fault();
    }
}

// Bad: Allow by default
void insecure_permission_check(void *addr, access_type_t type) {
    // Start with: allow everything
    bool denied = false;
    
    // Deny only specific things
    if (is_in_forbidden_range(addr)) {
        denied = true;
    }
    
    // Fail open (dangerous!)
    if (!denied) {
        allow_access();
    }
}

Verify Security Properties

Test Security, Not Just Functionality:

// Security unit tests

void test_memory_isolation(void) {
    // Test 1: User cannot access kernel memory
    assert_fault(user_read_kernel_address(KERNEL_DATA));
    
    // Test 2: NX bit prevents execution
    assert_fault(execute_on_stack(stack_buffer));
    
    // Test 3: Write XOR execute enforced
    void *page = mmap(NULL, 4096, PROT_READ|PROT_WRITE|PROT_EXEC, ...);
    assert(page == MAP_FAILED);  // W^X prevents RWX
    
    // Test 4: SMEP prevents kernel executing user code
    assert_fault(kernel_jump_to_user_page(user_code));
}

Keep Security Features Enabled

Don't Disable for "Performance":

// Common mistakes to avoid

// WRONG: Disable NX for "speed"
void disable_nx_wrong(void) {
    // Saves: 0% (NX has no cost!)
    // Loses: Protection against code injection
    // Verdict: Never do this
}

// WRONG: Disable ASLR for "reproducibility"
void disable_aslr_wrong(void) {
    // Saves: 0% (ASLR has no runtime cost)
    // Loses: Makes exploits 1000× easier
    // Verdict: Only disable in development, never production
}

// ACCEPTABLE: Disable KPTI on patched CPUs
void disable_kpti_acceptable(void) {
    if (cpu_has_hardware_mitigation_for_meltdown()) {
        disable_kpti();
        // Saves: 5-15%
        // Loses: Nothing (hardware provides equivalent protection)
        // Verdict: OK if CPU is patched
    }
}

Monitor and Audit

Log Security Events:

// Track security-relevant events

void audit_security_violations(void) {
    // Log all security exceptions
    // - Page faults (access violations)
    // - Privilege violations
    // - Protection key violations
    // - Failed attestations
    
    // Example: Page fault handler
    void page_fault_handler(uintptr_t addr, error_code_t err) {
        if (err & PF_PROTECTION_VIOLATION) {
            // Log: timestamp, process, address, access type
            log_security_event(LOG_PROTECTION_FAULT, addr, err);
            
            // Alert if suspicious pattern
            if (looks_like_exploit(addr, err)) {
                alert_security_team();
            }
        }
    }
}

6.18 Common Pitfalls and How to Avoid Them

Learn from others' mistakes. Here are common memory protection failures and how to prevent them.

Pitfall 1: Assuming Page Tables Are Sufficient

Problem: Relying only on page tables for security.

// WRONG: Page tables alone aren't enough
void insufficient_protection(void) {
    // Set up page tables with proper permissions
    set_page_permissions(addr, PROT_READ | PROT_WRITE);
    
    // Vulnerabilities:
    // 1. Speculative execution (Meltdown/Spectre)
    // 2. DMA attacks (if no IOMMU)
    // 3. Physical memory attacks (if no encryption)
    // 4. Hypervisor attacks (if no confidential computing)
}

// CORRECT: Layer multiple protections
void sufficient_protection(void) {
    // 1. Page tables (basic protection)
    set_page_permissions(addr, PROT_READ | PROT_WRITE);
    
    // 2. KPTI (Spectre/Meltdown mitigation)
    enable_kpti();
    
    // 3. IOMMU (DMA protection)
    enable_iommu_for_device(pci_device);
    
    // 4. Memory encryption (physical attacks)
    enable_memory_encryption();
    
    // 5. Confidential computing (hypervisor attacks)
    create_encrypted_vm();
}

Pitfall 2: Forgetting TLB Flushes

Problem: Changing permissions without flushing TLB.

// WRONG: TLB not flushed
void permission_change_wrong(void *addr) {
    // Change page permission
    uint64_t *pte = get_pte(addr);
    *pte |= PTE_NX;  // Make non-executable
    
    // BUG: TLB still has old permissions!
    // Code can still execute until TLB entry naturally evicted
}

// CORRECT: Always flush TLB
void permission_change_correct(void *addr) {
    uint64_t *pte = get_pte(addr);
    *pte |= PTE_NX;
    
    // Flush TLB entry
    invlpg(addr);  // x86
    // or:
    // isb(); dsb(); tlbi(addr);  // ARM
    // or:
    // sfence.vma();  // RISC-V
}

Pitfall 3: Mixing Security Domains Without Isolation

Problem: Reusing memory between different security contexts.

// WRONG: Reuse memory without clearing
void memory_reuse_wrong(void) {
    // Process 1 uses memory
    void *secret = malloc(4096);
    memcpy(secret, password, 16);
    free(secret);
    
    // Process 2 allocates same memory
    void *data = malloc(4096);
    // BUG: data still contains password!
}

// CORRECT: Clear memory between uses
void memory_reuse_correct(void) {
    void *secret = malloc(4096);
    memcpy(secret, password, 16);
    
    // Clear before freeing
    explicit_bzero(secret, 4096);  // Not optimized away
    free(secret);
}

Pitfall 4: Trusting User-Provided Pointers

Problem: Dereferencing user pointers in kernel without validation.

// WRONG: Trust user pointer
void kernel_bug(void *user_ptr) {
    // Kernel directly accesses user-provided pointer
    int value = *(int *)user_ptr;  // Dangerous!
    
    // Exploits:
    // 1. user_ptr = kernel_address → leak kernel data
    // 2. user_ptr = NULL → kernel NULL deref crash
    // 3. user_ptr = unmapped → kernel page fault
}

// CORRECT: Validate and use safe copy functions
void kernel_safe(void *user_ptr) {
    int value;
    
    // 1. Check pointer is in user space
    if (!access_ok(user_ptr, sizeof(int))) {
        return -EFAULT;
    }
    
    // 2. Use safe copy function
    if (copy_from_user(&value, user_ptr, sizeof(int))) {
        return -EFAULT;  // Faulted safely
    }
    
    // 3. Now safe to use value
    process(value);
}

Pitfall 5: Ignoring Side Channels

Problem: Implementing functional security but ignoring timing attacks.

// WRONG: Timing-dependent secret comparison
bool password_check_wrong(const char *input, const char *correct) {
    // Stops at first mismatch
    for (int i = 0; i < strlen(correct); i++) {
        if (input[i] != correct[i]) {
            return false;  // Early exit leaks information!
        }
    }
    return true;
}

// CORRECT: Constant-time comparison
bool password_check_correct(const char *input, const char *correct) {
    int diff = 0;
    
    // Always compare all bytes
    for (int i = 0; i < MAX_PASSWORD_LEN; i++) {
        diff |= input[i] ^ correct[i];
    }
    
    return diff == 0;  // No early exit
}

6.19 Chapter Summary

Key Takeaways

Memory protection is the foundation of system security. Everything we've covered builds on this core principle: isolate, protect, and verify access to memory.

Essential Mechanisms:

Page-Level Protection (6.2-6.3): Read/write/execute bits, NX bit
Privilege Separation (6.4-6.5): Rings, exception levels, user/supervisor pages
Advanced Features (6.6-6.7): SMEP/SMAP, MTE, MPK, COW
Trusted Execution (6.8): TrustZone, SGX, GPU TEEs
Confidential Computing (6.9-6.13): SEV-SNP, TDX, memory encryption
Accelerator Security (6.14-6.15): GPU protection, heterogeneous systems

Critical Insights:

Virtual memory exists for protection, not just address translation
Multiple layers of defense are essential (defense in depth)
Performance costs are usually acceptable (<5% for most features)
Hardware enforcement beats software (can't be bypassed by bugs)
Simplicity aids security (RISC-V vs x86 complexity)

Future Trends

1. Hardware Memory Safety:

// Trend: CPUs with built-in bounds checking

// ARM MTE is just the beginning
// Future: Full capability-based security (CHERI)

void *capability_pointer = malloc(size);
// Pointer includes: address, bounds, permissions
// Hardware enforces: cannot access outside bounds

// Zero spatial memory errors at hardware level

2. Always-On Confidential Computing:

// Trend: Encryption becomes default, not optional

// Today: Opt-in confidential VMs
create_encrypted_vm();  // Explicit

// Future: All VMs encrypted by default
create_vm();  // Implicitly encrypted

3. Unified Cross-Device Security:

// Trend: Consistent protection across CPU/GPU/FPGA

// Today: Each device has different security model
protect_cpu_memory();    // Page tables + KPTI
protect_gpu_memory();    // Maybe encrypted, maybe not
protect_fpga_memory();   // Custom logic

// Future: Unified security framework
protect_device_memory(cpu);   // Same API
protect_device_memory(gpu);   // Same guarantees
protect_device_memory(fpga);  // Same verification

4. Formal Verification:

// Trend: Mathematically proven security

// Today: Test and hope
if (test_passes(security_feature)) {
    deploy();  // Hope no bugs remain
}

// Future: Formally verified security
if (prove_secure(security_monitor)) {
    deploy();  // Mathematically certain
}

// Example: seL4 microkernel (formally verified)
// - 10,000 lines of code
// - Mathematically proven free of buffer overflows
// - Proven to correctly enforce isolation

Closing Thoughts

Memory protection has evolved from simple base/bound registers (1960s) to sophisticated multi-layered defenses (2020s). Yet the fundamental goal remains: ensure that code can only access memory it's authorized to access.

The technologies covered in this chapter—from basic page table permissions to encrypted confidential VMs—represent humanity's collective effort to build secure systems. Each mechanism emerged from:

Attacks that exposed weaknesses
Research that understood the problem
Engineering that built solutions
Deployment that proved effectiveness

As you design and build systems, remember:

Security is not optional (it's foundational)
Simple mechanisms are best (easier to verify)
Defense in depth (assume one layer will fail)
Hardware enforcement (software has bugs)
Measure everything (performance vs security trade-offs)

The future of memory protection looks promising: hardware memory safety, universal encryption, formal verification. But the principles remain timeless: isolate, protect, verify.

References

Memory Protection and Access Control Fundamentals

Intel Corporation. Intel® 64 and IA-32 Architectures Software Developer's Manual, Volume 3A: System Programming Guide, Part 1. Chapter 4: "Paging" and Section 4.6: "Access Rights." 2024.
ARM Limited. ARM Architecture Reference Manual ARMv8, for ARMv8-A architecture profile. ARM DDI 0487J.a. March 2023. Chapter D5: "The AArch64 Virtual Memory System Architecture."
RISC-V International. The RISC-V Instruction Set Manual, Volume II: Privileged Architecture. Version 20211203. December 2021. Chapter 3: "Machine-Level ISA" and Chapter 4: "Supervisor-Level ISA."
Denning, Peter J. "Virtual memory." ACM Computing Surveys (CSUR) 2.3 (1970): 153-189. DOI: 10.1145/356571.356573

No-Execute (NX) and Data Execution Prevention

Intel Corporation. Intel® 64 and IA-32 Architectures Software Developer's Manual. Section 4.6: "Access Rights" (XD bit). 2024.
AMD. AMD64 Architecture Programmer's Manual, Volume 2: System Programming. Publication #24593. Section 5.3.1: "No-Execute Page Protection." 2023.
ARM Limited. ARM Architecture Reference Manual ARMv8. Section D5.4.5: "Execute-never controls and instruction fetching." 2023.
One, Aleph. "Smashing the Stack for Fun and Profit." Phrack Magazine 7.49 (1996). [Classic buffer overflow exploit paper]
Solar Designer. "Getting around non-executable stack (and fix)." Bugtraq mailing list. August 1997. [Return-to-libc attacks]
Shacham, Hovav. "The geometry of innocent flesh on the bone: Return-into-libc without function calls (on the x86)." Proceedings of the 14th ACM conference on Computer and communications security (CCS 2007). ACM, 2007. DOI: 10.1145/1315245.1315313 [ROP attacks]

Privilege Levels and Protection Rings

Intel Corporation. Intel® 64 and IA-32 Architectures Software Developer's Manual, Volume 3A. Chapter 5: "Protection." 2024.
ARM Limited. ARM Architecture Reference Manual ARMv8. Chapter D1: "The AArch64 System Level Programmers' Model." 2023.
Goldberg, Robert P. "Survey of virtual machine research." Computer 7.6 (1974): 34-45. DOI: 10.1109/MC.1974.6323581 [Early virtualization and privilege levels]

SMEP, SMAP, and Kernel Hardening

Intel Corporation. Intel® 64 and IA-32 Architectures Software Developer's Manual. Section 4.6: "Access Rights" (SMEP/SMAP). 2024.
Kemerlis, Vasileios P., et al. "kGuard: Lightweight kernel protection against return-to-user attacks." 22nd USENIX Security Symposium. 2013. Pages 459-474.
Pomonis, Marios, et al. "kR^ X: Comprehensive kernel protection against just-in-time code reuse." Proceedings of the Twelfth European Conference on Computer Systems (EuroSys 2017). ACM, 2017. DOI: 10.1145/3064176.3064216

Memory Tagging Extension (MTE)

ARM Limited. ARM Architecture Reference Manual ARMv8, Supplement: The Armv8.5 Memory Tagging Extension. ARM DDI 0487F.c. 2020.
Serebryany, Konstantin. "ARM Memory Tagging Extension and How It Improves C/C++ Memory Safety." 2020 Security Symposium. USENIX, 2020.
ARM Limited. "Armv8.5-A Memory Tagging Extension White Paper." 2019.

Protection Keys (MPK/PKU)

Intel Corporation. Intel® 64 and IA-32 Architectures Software Developer's Manual. Section 4.6.2: "Protection Keys." 2024.
Hedayati, Mohammad, et al. "Hodor: Intra-process isolation for high-throughput data plane libraries." 2019 USENIX Annual Technical Conference (ATC 19). 2019. Pages 489-504.
Park, Soyeon, et al. "libmpk: Software abstraction for Intel memory protection keys." 2019 USENIX Annual Technical Conference (ATC 19). 2019. Pages 241-254.
Vahldiek-Oberwagner, Anjo, et al. "ERIM: Secure, efficient in-process isolation with protection keys (MPK)." 28th USENIX Security Symposium. 2019. Pages 1221-1238.

ARM TrustZone

ARM Limited. ARM Security Technology: Building a Secure System using TrustZone Technology. ARM PRD29-GENC-009492C. 2009.
Ngabonziza, Bernard, et al. "TrustZone explained: Architectural features and use cases." 2016 IEEE 2nd International Conference on Collaboration and Internet Computing (CIC). IEEE, 2016. DOI: 10.1109/CIC.2016.065
ARM Limited. ARMv8-A Architecture and Processors: Trusted Base System Architecture for ARMv8-M. 2018.

Intel SGX

Costan, Victor, and Srinivas Devadas. "Intel SGX explained." IACR Cryptology ePrint Archive 2016 (2016): 86.
McKeen, Frank, et al. "Innovative instructions and software model for isolated execution." Proceedings of the 2nd International Workshop on Hardware and Architectural Support for Security and Privacy. 2013. DOI: 10.1145/2487726.2488368
Intel Corporation. Intel® Software Guard Extensions (Intel® SGX) Developer Reference for Linux OS. 2020.
Van Bulck, Jo, et al. "Foreshadow: Extracting the keys to the Intel SGX kingdom with transient out-of-order execution." 27th USENIX Security Symposium. 2018. Pages 991-1008. [SGX vulnerability]

AMD SEV, SEV-ES, and SEV-SNP

AMD. AMD SEV-SNP: Strengthening VM Isolation with Integrity Protection and More. White Paper #55766. January 2020.
AMD. AMD Secure Encrypted Virtualization API Version 0.24. Publication #55766. 2020.
Kaplan, David, Jeremy Powell, and Tom Woller. "AMD memory encryption." White Paper (2016).
Li, Mengyuan, et al. "CIPHERLEAKS: Breaking constant-time cryptography on AMD SEV via the ciphertext side channel." 31st USENIX Security Symposium. 2022. Pages 717-732. [SEV vulnerability research]

Intel TDX (Trust Domain Extensions)

Intel Corporation. Intel® Trust Domain Extensions (Intel® TDX) Module v1.5 Architecture Specification. Document Number: 344425-004US. March 2023.
Intel Corporation. "Intel Trust Domain Extensions." White Paper. 2020.
Intel Corporation. "Intel TDX: Protect Confidential Computing Workloads from Software and Hardware Attacks." 2021.

ARM Confidential Compute Architecture (CCA)

ARM Limited. ARM Confidential Compute Architecture. 2021.
ARM Limited. "Introducing Arm Confidential Compute Architecture." White Paper. 2021.
ARM Limited. Arm Realm Management Extension (RME) Architecture Specification. ARM DDI 0615A. 2022.

AMD Memory Encryption (SME/TSME)

AMD. AMD64 Architecture Programmer's Manual, Volume 2: System Programming. Chapter 7: "Secure Memory Encryption." Publication #24593. 2023.
Kaplan, David. "Protecting VM register state with SEV-ES." AMD White Paper (2017).
AMD. Secure Encrypted Virtualization API. Publication #55766. Rev 0.24. 2020.

RISC-V Security (PMP/ePMP)

RISC-V International. The RISC-V Instruction Set Manual, Volume II: Privileged Architecture. Section 3.6: "Physical Memory Protection." Version 20211203. December 2021.
RISC-V International. RISC-V Physical Memory Protection (PMP) Enhancement (ePMP). Draft Specification. 2021.
Lee, Dayeol, et al. "Keystone: An open framework for architecting trusted execution environments." Proceedings of the Fifteenth European Conference on Computer Systems (EuroSys 2020). 2020. DOI: 10.1145/3342195.3387532
Weiser, Samuel, et al. "TIMBER-V: Tag-isolated memory bringing fine-grained enclaves to RISC-V." Network and Distributed Systems Security Symposium (NDSS 2019). 2019.

GPU and Accelerator Security

NVIDIA Corporation. NVIDIA H100 Tensor Core GPU Architecture. White Paper WP-10026-001_v01. 2022.
NVIDIA Corporation. NVIDIA Confidential Computing. White Paper. 2022.
AMD. AMD Instinct MI300 Architecture. White Paper. 2023.
Volos, Stavros, et al. "Graviton: Trusted execution environments on GPUs." 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2020). 2020. Pages 681-696.
Jang, Insu, et al. "Heterogeneous isolated execution for commodity GPUs." Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 2019). 2019. DOI: 10.1145/3297858.3304021

Heterogeneous Computing Security

NVIDIA Corporation. NVIDIA Grace Hopper Superchip Architecture. White Paper. 2023.
AMD. AMD Instinct MI300A APU Architecture. White Paper. 2023.
Apple Inc. Apple M3 Technical Overview. 2023.
Pichai, Bharath, et al. "Architectural support for address translation on GPUs: Designing memory management units for CPU/GPUs with unified address spaces." ACM SIGPLAN Notices 49.4 (2014): 743-758. DOI: 10.1145/2644865.2541940

Spectre, Meltdown, and KPTI

Kocher, Paul, et al. "Spectre attacks: Exploiting speculative execution." 2019 IEEE Symposium on Security and Privacy (SP). IEEE, 2019. DOI: 10.1109/SP.2019.00002
Lipp, Moritz, et al. "Meltdown: Reading kernel memory from user space." 27th USENIX Security Symposium. 2018. Pages 973-990.
Gruss, Daniel, et al. "KASLR is dead: Long live KASLR." International Symposium on Engineering Secure Software and Systems. Springer, 2017. DOI: 10.1007/978-3-319-62105-0_11
The Linux Kernel Organization. "Page Table Isolation (PTI)." Documentation/x86/pti.rst. 2018.

Security Best Practices and Performance Trade-offs

Saltzer, Jerome H., and Michael D. Schroeder. "The protection of information in computer systems." Proceedings of the IEEE 63.9 (1975): 1278-1308. DOI: 10.1109/PROC.1975.9939 [Classic security principles]
Anderson, Ross J. Security engineering: a guide to building dependable distributed systems. John Wiley & Sons, 2020. Third edition.
Klein, Gerwin, et al. "seL4: Formal verification of an OS kernel." Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles (SOSP 2009). ACM, 2009. DOI: 10.1145/1629575.1629596 [Formally verified microkernel]
Ge, Xinyang, et al. "Griffin: Guarding control flows using intel processor trace." Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 2017). 2017. DOI: 10.1145/3037697.3037716

Additional General Resources

Silberschatz, Abraham, Peter Baer Galvin, and Greg Gagne. Operating System Concepts. 10th edition. Wiley, 2018. Chapter 9: "Virtual Memory."
Tanenbaum, Andrew S., and Herbert Bos. Modern Operating Systems. 4th edition. Pearson, 2015. Chapter 3: "Memory Management."
Bryant, Randal E., and David R. O'Hallaron. Computer Systems: A Programmer's Perspective. 3rd edition. Pearson, 2015. Chapter 9: "Virtual Memory."
Hennessy, John L., and David A. Patterson. Computer Architecture: A Quantitative Approach. 6th edition. Morgan Kaufmann, 2017. Appendix B: "Review of Memory Hierarchy."