Linux Internals 06 - System Calls and the Kernel

What Is a System Call?

Programs in user space cannot access hardware directly. To read a file, create a process, or send a network packet, they must ask the kernel. The channel for these requests is the system call. System calls are the only official interface between user space and kernel space — a contract for using the services the kernel provides.

The Linux kernel offers roughly 450 system calls. From file-related calls like open, read, write, and close, to process-related calls like fork, exec, and wait, to network-related calls like socket, bind, and listen, every operating system function is exposed through system calls. Most code a programmer writes ultimately boils down to combinations of these system calls.

How a System Call Executes

When a system call is invoked, the CPU's execution mode switches from user mode to kernel mode. This is not a simple function call — it is a fundamental mode change where the CPU's privilege level shifts. Here is the process step by step.

First, when the program invokes a system call, the system call number is placed in a register (rax on x86-64), and arguments are placed in designated registers (rdi, rsi, rdx, and so on). Then the syscall instruction executes on x86-64, causing the CPU to switch to kernel mode. The kernel looks up the handler function corresponding to the number in rax from the system call table and executes it. When the handler completes, the return value is stored in rax, and the sysret instruction returns to user mode.

User Space                          Kernel Space
┌────────────┐                   ┌─────────────────┐
│ Program    │                   │                 │
│            │  1. rax = call #  │                 │
│ write(fd,  │  2. syscall inst  │  Syscall Table  │
│   buf, n)  │ ──────────────►   │  ┌───┬─────────┐│
│            │                   │  │ 0 │sys_read  ││
│            │  5. sysret        │  │ 1 │sys_write ││
│            │ ◄──────────────   │  │ 2 │sys_open  ││
│ Check ret  │                   │  │...│  ...     ││
└────────────┘                   │  └───┴─────────┘│
                                 │  3. Run handler  │
                                 │  4. Result → rax │
                                 └─────────────────┘

A critical detail in this process is that user space and kernel space have separate stacks. When switching to kernel mode, the CPU transitions to the kernel stack and saves the user space register state. Without this separation, a malicious program could manipulate the kernel's stack and compromise the entire system.

Traps and Interrupts

A system call is one type of software trap. A trap is a CPU exception intentionally triggered by a program. Beyond system calls, traps also occur for debug breakpoints and division by zero. Traps are synchronous — they occur at the precise moment a program executes a specific instruction.

Interrupts, by contrast, are asynchronous. When an external event occurs — a keyboard press, disk I/O completion, or network packet arrival — the hardware sends a signal to the CPU. The CPU finishes the currently executing instruction and then branches to the interrupt handler to process the event.

Does the distinction between traps and interrupts matter in practice? It does. Traps are handled in the context of the current process, while interrupts can occur regardless of which process is running. The kernel employs a strategy of doing as little work as possible during interrupt handling (the top half) and deferring the rest for later processing (the bottom half). Without this separation, frequent interrupts could severely degrade system responsiveness.

glibc Wrapper Functions

When a C program calls write(), the programmer is not manually setting registers and executing the syscall instruction. That is because glibc (the GNU C Library) wraps this process in a wrapper function.

glibc wrapper functions do more than simply invoke system calls on the programmer's behalf. They standardize error handling — when a system call fails, the wrapper sets errno and returns -1. Some wrappers also perform buffering or caching within glibc to reduce unnecessary kernel entries. A good example is printf, which does not call the write system call every time but instead accumulates data in user space until the buffer is full.

// What the programmer writes
ssize_t n = write(fd, buf, count);
if (n == -1) {
    perror("write failed");
}

// What happens inside glibc (conceptual)
ssize_t write(int fd, const void *buf, size_t count) {
    long ret;
    asm volatile (
        "syscall"
        : "=a" (ret)
        : "a" (__NR_write), "D" (fd), "S" (buf), "d" (count)
        : "rcx", "r11", "memory"
    );
    if (ret < 0) {
        errno = -ret;
        return -1;
    }
    return ret;
}

Tracing System Calls with strace

strace is a tool that lets you observe in real time which system calls a program makes. It uses the ptrace system call to intercept every system call entry and return of the target process.

$ strace ls /tmp
execve("/usr/bin/ls", ["ls", "/tmp"], 0x7ffd...) = 0
...
openat(AT_FDCWD, "/tmp", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 3
getdents64(3, /* 5 entries */, 32768)   = 160
getdents64(3, /* 0 entries */, 32768)   = 0
close(3)                                = 0
write(1, "file1.txt  file2.txt\n", 21)  = 21
close(1)                                = 0
close(2)                                = 0
exit_group(0)                           = ?

This output reveals that a simple ls /tmp command actually invokes dozens of system calls. You can see it opening the directory with openat, reading directory entries with getdents64, and writing the results with write. When a program does not behave as expected, strace is a powerful debugging tool that pinpoints exactly which system call failed and why.

Using strace -c shows statistics on call counts and time spent per system call, making it useful for performance analysis as well. You can also filter specific categories with options like strace -e trace=file.

Kernel Modules

Linux is a monolithic kernel, but not all functionality needs to be loaded into memory at boot time. Kernel modules are pieces of code that can be dynamically loaded into or unloaded from a running kernel. Device drivers, file systems, and network protocols are often implemented as kernel modules.

Three basic commands manage kernel modules. insmod loads a module into the kernel, rmmod unloads a module, and lsmod shows the list of currently loaded modules. In practice, modprobe is used more often than insmod because modprobe automatically resolves module dependencies and loads required modules in the correct order.

$ lsmod | head -5
Module                  Size  Used by
snd_hda_intel         57344  2
snd_intel_dspcfg      28672  1 snd_hda_intel
snd_hda_codec        172032  1 snd_hda_intel
snd_hda_core         106496  2 snd_hda_codec,snd_hda_intel

Writing a Simple Kernel Module

The best way to understand how kernel modules work is to write one. Below is a minimal kernel module that prints a message when loaded and another when unloaded.

#include <linux/init.h>
#include <linux/module.h>
#include <linux/kernel.h>

MODULE_LICENSE("GPL");
MODULE_AUTHOR("example");
MODULE_DESCRIPTION("A minimal kernel module");

static int __init hello_init(void) {
    printk(KERN_INFO "hello: module loaded\n");
    return 0;
}

static void __exit hello_exit(void) {
    printk(KERN_INFO "hello: module unloaded\n");
}

module_init(hello_init);
module_exit(hello_exit);

Several things are worth noting in this code. Unlike a regular C program, a kernel module has no main function. Instead, the module_init and module_exit macros designate the entry and exit points. Output uses printk rather than printf, because the standard C library is not available in kernel space. The output from printk is written to the kernel ring buffer and can be viewed with the dmesg command.

Kernel modules run with the same privileges as the kernel itself, so a buggy module can crash the entire system. This is why kernel development demands far more caution than user-space development.

The /proc and /sys Interfaces

The kernel uses virtual file systems to expose its internal state to user space. /proc and /sys are the two primary examples.

The /proc file system was originally designed for process information. Under /proc, there is a directory named after the PID of each running process, where you can examine the process's memory map, open file descriptors, command-line arguments, and more. Over time, system-wide information unrelated to processes was also added to /proc. /proc/meminfo shows memory usage, /proc/cpuinfo shows CPU information, and /proc/interrupts shows interrupt statistics.

$ cat /proc/self/status | head -8
Name:   cat
Umask:  0022
State:  R (running)
Tgid:   12345
Ngid:   0
Pid:    12345
PPid:   12300
TracerPid:   0

The /sys file system was introduced to address the clutter in /proc. It hierarchically reflects the kernel's device model, providing structured information about devices, buses, and drivers. While /proc accumulated unorganized information for historical reasons, /sys was designed with a structured layout from the start.

These virtual file systems are not read-only. Writing a value to certain files can change the kernel's behavior. For example, writing 1 to /proc/sys/net/ipv4/ip_forward enables IP forwarding. Thanks to these interfaces, the kernel can be configured using nothing more than standard file I/O — no special APIs or tools required.

In the next post, we'll look at I/O and device management.

Where to go next

What Is a System Call?

How a System Call Executes

Traps and Interrupts

glibc Wrapper Functions

Tracing System Calls with strace

Kernel Modules

Writing a Simple Kernel Module

The /proc and /sys Interfaces

Continue Reading

Linux Internals 07 - I/O and Devices

Linux Internals 08 - Synchronization and Concurrency

Linux Internals 09 - Networking