Computer Architecture 08 - Virtual Memory and MMU

Why Virtual Memory Exists

In early computers, programs used physical memory addresses directly. If program A was using address 0x1000 and program B also needed that address, a conflict would occur. Running multiple programs simultaneously required coordinating which memory regions each program could use ahead of time, which was impractical.

Virtual memory solves this problem at its root. By giving each process its own independent address space, every process can operate as though it starts from address 0x0. The address 0x1000 as seen by process A and the address 0x1000 as seen by process B map to entirely different locations in physical memory. The hardware device that performs this translation is the MMU (Memory Management Unit).

How Address Translation Works

In a virtual memory system, every address generated by the CPU is a virtual address. This virtual address must be translated to a physical address in actual physical memory before memory can be accessed. The basic unit of translation is the page, typically 4KB in size.

Virtual Address (e.g., 0x00401234)
┌──────────────────┬──────────────┐
│ Virtual Page No.  │  Page Offset │
│  (VPN: 0x401)    │  (0x234)     │
└────────┬─────────┴──────┬───────┘
         │                │
    Translated via        │
    Page Table            │
         │                │
         ▼                │
┌──────────────────┬──────┴───────┐
│ Physical Frame No.│  Page Offset │
│  (PFN: 0x8A3)    │  (0x234)     │
└──────────────────┴──────────────┘
Physical Address (e.g., 0x8A3234)

The virtual page number (VPN) is translated to a physical frame number (PFN), while the offset within the page remains unchanged. The data structure that stores this mapping information is the page table.

Page Tables

The simplest form of page table is a single-level page table — an array structure where the virtual page number serves as an index to look up the physical frame number directly. However, this approach has a critical problem. With a 64-bit address space using 4KB pages, 2^52 page table entries would be required. If each entry is 8 bytes, the page table itself would consume 32PB of memory.

To solve this problem, modern processors use multi-level page tables. x86-64 adopts a four-level page table structure.

Virtual Address (48 bits used)
┌────────┬────────┬────────┬────────┬──────────────┐
│ PML4   │ PDPT   │  PD    │  PT    │   Offset     │
│ (9bit) │ (9bit) │ (9bit) │ (9bit) │   (12bit)    │
└───┬────┴───┬────┴───┬────┴───┬────┴──────────────┘
    │        │        │        │
    ▼        ▼        ▼        ▼
  PML4 → PDP Table → PD Table → PT Table → Physical Frame

The key advantage of this structure is that lower-level tables need not be allocated at all for unused address ranges. Since the memory actually used by a process is only a tiny fraction of the entire virtual address space, the multi-level structure dramatically reduces the memory consumption of page tables.

TLB: Caching Address Translations

Using a four-level page table means that a single memory access can require up to four additional memory accesses for the translation. With each memory access taking tens of nanoseconds, this overhead would be unacceptable.

The solution is the TLB (Translation Lookaside Buffer). The TLB is a small cache located inside the CPU that stores recently used virtual-to-physical address mappings. When the desired mapping exists in the TLB (a TLB hit), the physical address is obtained immediately without any additional memory accesses.

Can such a small cache truly be effective? It can, because program memory access patterns exhibit strong locality. In typical workloads, TLB hit rates exceed 99%. Despite having only 64 to 1024 entries, such high hit rates are possible because a single TLB entry covers an entire 4KB page. Just 64 entries can cache translations for 256KB of memory.

Page Faults and Demand Paging

When a process attempts to access a page that does not exist in physical memory, a page fault occurs. This is a hardware exception that transfers control to the operating system's page fault handler.

The operating system reads the required page from disk (or SSD) into physical memory, updates the page table, and re-executes the interrupted instruction. This mechanism is demand paging. Rather than loading all of a program's pages into memory from the start, only the pages actually accessed are loaded at the time of access.

Thanks to this approach, programs larger than physical memory can still run. A system with 8GB of physical memory can simultaneously run three processes each using 4GB precisely because not all pages from all processes need to reside in physical memory at once — only the pages currently in active use need to be kept there.

Page Replacement Algorithms

When physical memory runs low, the operating system must evict an existing page to disk to make room for a new one. Deciding which page to evict is the job of the page replacement algorithm.

The theoretically optimal algorithm replaces the page that will be used furthest in the future (the OPT algorithm). Since the future cannot be predicted, practical approximation algorithms are needed. LRU (Least Recently Used) replaces the page that was accessed longest ago, based on the assumption that past access patterns will resemble future ones. In practice, exact LRU implementation carries significant overhead, so operating systems use approximations such as the clock algorithm that leverages reference bits. Linux implements an approximate LRU based on dual lists, maintaining separate active and inactive lists.

Memory Protection

Page tables provide not only address translation but also memory protection. Each page table entry contains access permission bits for that page.

Bit	Meaning
Present	Whether the page exists in physical memory
Read/Write	Read-only or read/write accessible
User/Supervisor	Whether accessible from user mode
Execute Disable (NX)	Whether code execution on this page is prohibited

When a user process attempts to access kernel memory, the MMU detects this and raises an exception. The same applies to write attempts on read-only pages or attempts to execute code in data regions. The NX bit plays a critical role in preventing execution of shellcode injected into the stack during buffer overflow attacks. Virtual memory thus serves as the foundation not only for process isolation but for security as well.

Large Pages

With only the default 4KB pages, problems arise for workloads using large amounts of memory. Mapping 64GB of memory requires approximately 16 million page table entries, and the range a TLB can cover becomes limited.

Large pages (huge pages) mitigate this problem. x86-64 supports large pages of 2MB and 1GB. Using 2MB pages reduces the number of TLB entries needed to map the same memory range by a factor of 512, significantly lowering TLB miss rates. For workloads that access large amounts of memory contiguously — such as databases and virtual machines — the performance improvement from large pages is substantial.

Inverted Page Tables

Traditional page tables consume memory proportional to the virtual address space size. An inverted page table is a data structure proportional to the physical memory size instead. It maintains one entry per physical frame, recording which process and which virtual page is using that frame.

This approach is efficient in terms of memory usage, but translating from virtual to physical addresses becomes a hash table lookup rather than array indexing, which can increase translation time. Inverted page tables were used in some architectures such as PowerPC and IA-64, but multi-level page tables have become the standard in today's dominant x86-64 and ARM architectures.

What Virtual Memory Enables

Virtual memory is far more than a simple address translation mechanism. Process isolation, memory protection, demand paging, copy-on-write, memory-mapped files, shared libraries — the core features of modern operating systems are all built on virtual memory. The efficient operation of the fork() system call, the ability to access files as memory through mmap(), and the fact that multiple processes can share libc's code while remaining unable to access each other's data are all made possible by virtual memory.

In the next post, we'll look at I/O and DMA.

Where to go next