Linux Internals 10 - Containers and Virtualization

A Container Is a Combination of Kernel Features

When first encountering the concept of containers, they feel like lightweight virtual machines. But containers and virtual machines are fundamentally different technologies. A virtual machine uses a hypervisor to emulate hardware and runs a complete operating system on top of it. A container isolates processes using only host kernel features, without any hardware emulation.

Can process isolation alone achieve an effect similar to a virtual machine? Two mechanisms provided by the Linux kernel — namespaces and cgroups — make this possible. Namespaces limit the scope of system resources a process can see, while cgroups limit the amount of resources a process can use. When these two are combined, a process behaves as if it were running on an independent system.

Namespaces: Limiting What Can Be Seen

Linux namespaces partition kernel resources so that a group of processes can see only a specific set of resources. Linux currently provides eight types of namespaces.

Namespace	Isolates	Introduced In
Mount (mnt)	Filesystem mount points	2.4.19
UTS	Hostname and domain name	2.6.19
IPC	System V IPC, POSIX message queues	2.6.19
PID	Process IDs	2.6.24
Network (net)	Network devices, stack, ports	2.6.29
User	User and group IDs	3.8
Cgroup	Cgroup root directory	4.6
Time	Boot time, monotonic time	5.6

Take the PID namespace as an example. The first process inside a container has PID 1. From the host, this same process has an entirely different PID — say, 4523. Running ps inside the container shows only the processes belonging to its namespace; processes on the host or in other containers are invisible. It is not that the processes' existence is hidden, but rather that the visible scope is restricted within that namespace.

The Mount namespace provides filesystem isolation. Each container can have its own root filesystem and configure mounts completely independently from the host's filesystem structure. This is why a CentOS container can run on an Ubuntu host. The kernel is shared with the host, but the user-space filesystem can belong to an entirely different distribution.

The Network namespace, as we saw in the previous post, isolates the entire network stack. The User namespace makes it possible to operate as root inside a container while being mapped to an unprivileged user on the host, providing a security layer that minimizes damage in the event of a container escape.

Cgroups: Limiting How Much Can Be Used

If namespaces limit visibility, cgroups (control groups) limit usage. Cgroups are a kernel feature that controls and monitors the CPU, memory, disk I/O, and network bandwidth usage of process groups.

Without cgroups, a single container could exhaust all of the host's memory or monopolize the CPU, affecting other containers. Cgroups prevent such resource abuse and are the core mechanism for ensuring fair resource distribution in multi-tenant environments.

Cgroup v1 had each resource controller (cpu, memory, blkio, etc.) maintaining its own independent hierarchy. This design was flexible but complex. A single process could belong to different groups under different controllers, which made management difficult.

Cgroup v2 adopted a single unified hierarchy to resolve this complexity. All controllers share a single tree, and a process belongs to exactly one cgroup. Additionally, v2's memory controller supports PSI (Pressure Stall Information), enabling more accurate detection of resource pressure situations.

Cgroup v2 hierarchy example:

/sys/fs/cgroup/
├── cgroup.controllers    (available controllers)
├── cgroup.subtree_control (controllers enabled for children)
├── container-a/
│   ├── memory.max        (memory limit: 512M)
│   ├── cpu.max           (CPU allocation: 50%)
│   └── cgroup.procs      (member processes)
└── container-b/
    ├── memory.max        (memory limit: 1G)
    ├── cpu.max           (CPU allocation: 100%)
    └── cgroup.procs

Building a Minimal Container with unshare

The fact that a container is a combination of namespaces and cgroups can be verified directly with the unshare command. unshare is a utility that creates new namespaces and runs a command inside them.

# Create new PID, Mount, UTS, and Network namespaces and run a shell
sudo unshare --pid --mount --uts --net --fork --mount-proc /bin/bash

# Inside the container:
hostname my-container      # Hostname can be changed thanks to UTS namespace
ps aux                     # Process list starting from PID 1
ip addr                    # Empty network interfaces

In this state, the host's processes are invisible and the network is isolated. Add chroot or pivot_root to replace the root filesystem, apply cgroup resource limits, and the result is an environment essentially identical to a Docker container. What Docker does is closer to wrapping these kernel features in a convenient interface and automating image distribution and networking.

Overlay Filesystem

How are container images efficiently stored and used? The key is the overlay filesystem (OverlayFS). OverlayFS layers multiple directories hierarchically to present a single unified view.

A container image is composed of a stack of read-only layers. A package installation layer sits on top of a base OS layer, and an application code layer sits on top of that. When a container runs, a single writable layer is added on top of this read-only stack. When reading a file, the layers are searched from top to bottom and the first match is returned. When modifying a file, the original is left intact and a copy is made in the writable layer for modification (copy-on-write).

┌──────────────────────────┐
│  Writable layer (container)│  ← stores only changes
├──────────────────────────┤
│  Layer 3: App code        │  ← read-only
├──────────────────────────┤
│  Layer 2: Package install │  ← read-only
├──────────────────────────┤
│  Layer 1: Base OS         │  ← read-only
└──────────────────────────┘

The advantage of this design is that multiple containers using the same image can share the read-only layers. Even if 100 containers use the same Ubuntu image, the base layer is stored only once on disk, and each container has only its own writable layer. The same copy-on-write principle we covered at the virtual memory level in an earlier post is applied here at the filesystem level.

Seccomp and Capabilities

While namespaces and cgroups provide isolation and resource limits, additional defense layers are needed from a security perspective. Processes inside a container still share the host's kernel and can access it directly through system calls.

Seccomp (Secure Computing Mode) restricts which system calls a process can use. Docker applies a default seccomp profile that blocks approximately 40 dangerous system calls. For example, system calls like reboot(), mount(), and kexec_load() cannot be invoked from inside a container. Applying a whitelist approach that permits only the system calls a container needs can greatly reduce the attack surface even if a kernel vulnerability is discovered.

Linux Capabilities subdivide the traditional root privileges. In the past, it was a binary distinction — root or not root. With Capabilities, root privileges are separated into over 30 individual permissions. For example, granting only CAP_NET_BIND_SERVICE allows binding to ports below 1024 but confers no other root privileges. Docker grants containers only a restricted set of Capabilities by default, so even when running as root, the actual level of privilege is very different from that of host root.

Containers vs Virtual Machines

Understanding the difference between containers and virtual machines clarifies the appropriate use cases for each.

Virtual Machines:                    Containers:

┌─────────┐ ┌─────────┐          ┌─────────┐ ┌─────────┐
│  App A  │ │  App B  │          │  App A  │ │  App B  │
├─────────┤ ├─────────┤          ├─────────┤ ├─────────┤
│Guest OS │ │Guest OS │          │  Bins/  │ │  Bins/  │
│ (full)  │ │ (full)  │          │  Libs   │ │  Libs   │
├─────────┴─┴─────────┤          ├─────────┴─┴─────────┤
│      Hypervisor      │          │   Host OS Kernel     │
├─────────────────────┤          ├─────────────────────┤
│       Hardware       │          │       Hardware       │
└─────────────────────┘          └─────────────────────┘

With virtual machines, the hypervisor emulates hardware and runs a complete guest OS on top. The level of isolation is very high, allowing different operating systems to run, but guest OS boot time and resource consumption become overhead. Containers share the host kernel, so they can start in milliseconds with minimal memory consumption, but they must use the same kernel as the host and their isolation level is lower than that of virtual machines.

Can containers fully replace virtual machines? Not entirely. Virtual machines are still needed for workloads requiring a different kernel (for example, running Windows on a Linux host) or environments where strong security isolation is essential. However, for the majority of cases where application-level isolation on the same kernel is needed, containers are the far more efficient choice. Recently, there have been efforts like Kata Containers to combine the advantages of both by running containers inside lightweight virtual machines.

OCI Runtime Specification

In the early container ecosystem, Docker was the de facto standard. But as container technology matured, the need for standardization emerged, and the OCI (Open Container Initiative) was born.

OCI defines two core specifications. The Runtime Specification defines the execution environment for a container. The root filesystem path, which namespaces to apply, cgroup settings, seccomp profiles, and more are described in JSON format. The Image Specification defines the format of container images, ensuring compatibility across different runtimes.

runc is the reference implementation of the OCI Runtime Specification. Docker itself uses runc internally to create and run containers. Thanks to this standardization, various container runtimes such as Docker, Podman, and containerd can use the same images and run compatible containers.

Wrapping Up the Series: Connecting Kernel Concepts

This series began with an operating system overview and has arrived at containers, examining the core concepts of the Linux kernel along the way. Looking back, these concepts do not exist in isolation — they are intimately connected to each other.

The fork() and clone() calls we learned about in process management are used directly for creating container namespaces. Passing flags like CLONE_NEWPID and CLONE_NEWNET to the clone() system call creates a process in new namespaces. The CFS scheduler integrates with cgroup's CPU controller to manage CPU distribution across containers.

The copy-on-write concept we learned in virtual memory is the same principle behind how the overlay filesystem works. An optimization at the page level has been applied at the filesystem level as well. The OOM Killer from memory management works in conjunction with the cgroup memory controller, so when a container exceeds its memory limit, it terminates processes inside that container rather than on the host.

The VFS abstraction layer we learned about in file systems is the foundation on which OverlayFS operates. The unified interface provided by VFS makes it possible to layer various filesystems on top of one another. Seccomp, which we covered in system calls, is used directly for container system call filtering.

The network namespaces, veth pairs, and bridges we learned about in networking are the very building blocks of container networking. RCU and lock-free data structures from synchronization are the foundation that allows all of these subsystems to maintain high performance in multicore environments.

Ultimately, a container is not a new technology but a sophisticated combination of Linux kernel features that have evolved over decades. Understanding each kernel subsystem naturally leads to understanding how containers work, and it also makes clear where the limitations of containers lie. It is my hope that this series has served as a starting point for understanding the inner workings of the Linux kernel.

Where to go next

A Container Is a Combination of Kernel Features

Namespaces: Limiting What Can Be Seen

Cgroups: Limiting How Much Can Be Used

Building a Minimal Container with unshare

Overlay Filesystem

Seccomp and Capabilities

Containers vs Virtual Machines

OCI Runtime Specification

Wrapping Up the Series: Connecting Kernel Concepts

Continue Reading

Linux Internals 09 - Networking

Linux Internals 08 - Synchronization and Concurrency

Linux Internals 07 - I/O and Devices