Linux Internals 04 - Memory Management

Why Virtual Memory Exists

In early computers, programs accessed physical memory directly. If program A was using address 0x1000 and program B tried to use the same address, a collision occurred. One program could overwrite another's memory, and programs larger than physical memory simply could not run.

Virtual memory solves this problem. Each process is given its own contiguous address space, and the process never needs to know where those addresses are mapped in physical memory. When process A reads address 0x1000, it accesses physical memory at 0x50000. When process B reads the same address 0x1000, it accesses physical memory at 0x80000. Both processes believe they have exclusive use of the entire memory, but in reality the kernel is maintaining that illusion.

Structure of an Address Space

In Linux, each process's virtual address space follows a consistent layout. On a 64-bit system, the lower region is user space and the upper region is kernel space.

High address  ┌──────────────────┐
              │   Kernel Space    │ (same mapping for all processes)
              ├──────────────────┤
              │   Stack           │ ↓ grows downward
              │       ...        │
              │   Free space      │
              │       ...        │
              │   Heap            │ ↑ grows upward
              ├──────────────────┤
              │   BSS             │
              │  (uninitialized   │
              │   globals)        │
              ├──────────────────┤
              │   Data            │
              │  (initialized     │
              │   globals)        │
              ├──────────────────┤
              │   Text (code)     │
Low address   └──────────────────┘

The text segment contains the executable machine code. The data and BSS segments hold global variables. The heap is used for dynamic allocation via functions like malloc() and grows upward. The stack is used for function calls and local variables and grows downward. The large gap between the heap and stack exists to give both room to grow toward each other.

Page Tables and the MMU

The translation from virtual addresses to physical addresses is performed by the CPU's MMU (Memory Management Unit). This translation does not map the entire memory at once but operates in fixed-size units called pages, typically 4KB.

The kernel maintains a page table for each process. A page table is a data structure that maps virtual page numbers to physical frame numbers. When a process accesses a virtual address, the MMU consults the page table to find the corresponding physical address.

Virtual addr:  [virtual page number | offset within page]
                       │
                       ▼  (page table lookup)
Physical addr: [physical frame number | offset within page]

On a 64-bit system, the virtual address space is vast, so a single flat page table would consume too much memory. Linux therefore uses multi-level page tables. On x86-64, a four-level page table hierarchy is used (PGD, PUD, PMD, PTE), and newer processors support five levels. Thanks to this multi-level structure, page table entries only need to be allocated for address ranges that are actually in use, which saves a significant amount of memory.

Traversing multiple levels on every memory access would introduce several additional memory accesses per lookup. To reduce this cost, the CPU uses a cache called the TLB (Translation Lookaside Buffer). The TLB stores recently translated virtual-to-physical address pairs and returns the physical address immediately on a repeat access, bypassing the page table walk entirely. TLB hit rates have an enormous impact on performance, and TLB invalidation during context switches is a major source of the performance cost discussed in the previous post.

Demand Paging

Does the kernel need to load every page into physical memory when a process starts? No, it does not. Linux uses demand paging. A process's virtual address space is not fully backed by physical memory from the start. Physical pages are allocated only at the moment they are actually accessed.

When a process accesses a virtual address that is not yet mapped to physical memory, a page fault occurs. A page fault is an exception raised by the CPU, handled by the kernel's page fault handler. The handler checks whether the access is valid. If it is, the handler allocates a physical page, updates the page table, and lets the process continue. If the access is invalid, a segmentation fault (SIGSEGV) is raised and the process is terminated.

Page faults come in two varieties. A minor page fault is resolved without disk I/O — allocating a fresh page or handling a Copy-on-Write situation falls into this category. A major page fault requires reading data from disk — restoring a page from swap or loading a file-mapped page from disk. Major page faults involve disk I/O and are therefore hundreds of times slower than minor ones.

Swap

When physical memory runs low, the kernel evicts pages that are not currently in use to a swap area on disk. If those pages are accessed again later, a major page fault occurs and the kernel reads them back from swap into physical memory.

Swap makes it possible to work with data sets larger than physical memory, but disk is tens of thousands of times slower than memory. When swapping becomes frequent, system performance degrades dramatically. This phenomenon is called thrashing — pages are evicted only to be needed again immediately, creating a vicious cycle of constant disk I/O.

# Check swap usage
free -h
              total    used    free    shared  buff/cache  available
Mem:          16Gi     8.2Gi   1.1Gi   512Mi   6.7Gi       7.0Gi
Swap:         4.0Gi    256Mi   3.7Gi

# Monitor swap activity (si: swap in, so: swap out)
vmstat 1
procs  memory         swap    io
 r  b  swpd  free     si  so  bi  bo
 1  0  256k  1.1G     0   0   12  28

Memory Allocation: The Buddy System and Slab Allocator

How the kernel allocates physical memory internally is equally important. Linux uses two levels of allocation mechanisms.

The buddy system manages physical memory in blocks whose sizes are powers of two. Using a 4KB page as the base unit, it maintains lists of 1-page (4KB), 2-page (8KB), 4-page (16KB) blocks, and so on. When an allocation request arrives, the smallest block that fits the request is provided. If no block of the right size is available, a larger block is split in half (the two halves become "buddies"), and upon deallocation, adjacent buddies are merged back into a larger block. This approach minimizes external fragmentation while keeping allocation and deallocation fast.

However, the kernel frequently allocates and deallocates small objects like task_struct and inode. Dedicating an entire 4KB page to a single such object would be wasteful. The slab allocator solves this by pre-dividing pages into many small, equally-sized object slots. Because the same type of object is allocated and freed repeatedly, freed object slots can be reused directly, making allocation fast and reducing internal fragmentation.

The OOM Killer

What happens when both physical memory and swap are exhausted? The Linux kernel invokes the OOM (Out of Memory) Killer. The OOM Killer is a last resort that forcibly terminates processes to free memory and prevent the entire system from grinding to a halt.

The OOM Killer scores each process. Processes that consume more memory receive higher scores, and the process with the highest score is selected for termination. PID 1 and kernel threads are protected. The score of a specific process can be adjusted through /proc/[pid]/oom_score_adj. Setting the value to -1000 excludes the process from OOM Killer consideration entirely.

# Check a process's OOM score
cat /proc/1234/oom_score

# Protect from OOM Killer (-1000 to 1000)
echo -1000 > /proc/1234/oom_score_adj

Does the OOM Killer always terminate the right process? Not necessarily. The process consuming the most memory may well be the most important one. If a database server is killed simply for using a lot of memory, a critical system service goes down. This is why in production environments, it is standard practice to set oom_score_adj appropriately for important processes.

Memory Mapping: mmap

mmap() is a system call that maps a file or device directly into a process's virtual address space. Reads and writes to the mapped region translate directly to reads and writes on the file, eliminating the need for repeated read() and write() system calls.

// Map a file into memory
int fd = open("data.bin", O_RDONLY);
void *addr = mmap(NULL, file_size, PROT_READ, MAP_PRIVATE, fd, 0);

// Now access file contents directly through addr
printf("%c", ((char *)addr)[0]);

munmap(addr, file_size);

One important use of mmap() is loading shared libraries. Shared libraries like libc.so are used by many processes simultaneously. Having each process keep its own copy would waste physical memory. By mapping shared libraries with mmap(), multiple processes can share the same physical pages, significantly reducing physical memory consumption.

There are also anonymous mappings — memory regions not backed by any file. When malloc() allocates large memory blocks, it internally uses mmap() with an anonymous mapping. Anonymous mappings with the MAP_SHARED | MAP_ANONYMOUS flag combination are also used to implement shared memory between processes.

In the next post, we'll look at file systems and VFS.

Where to go next

Why Virtual Memory Exists

Structure of an Address Space

Page Tables and the MMU

Demand Paging

Swap

Memory Allocation: The Buddy System and Slab Allocator

The OOM Killer

Memory Mapping: mmap

Continue Reading

Linux Internals 05 - File Systems

Linux Internals 06 - System Calls and the Kernel

Linux Internals 07 - I/O and Devices