Linux Internals 07 - I/O and Devices

Block Devices and Character Devices

Devices in Linux fall into two broad categories. Block devices read and write data in fixed-size blocks. Hard disks, SSDs, and USB drives belong here. Character devices process data sequentially as a byte stream. Keyboards, mice, serial ports, and terminals are character devices.

Why does this distinction matter? Block devices support random access — you can read any block at any position on the disk. Character devices, on the other hand, can only be accessed sequentially, and data that has already passed cannot be re-read. Because of this fundamental difference, the kernel handles the two types through entirely separate paths. Block devices require complex infrastructure like I/O schedulers, the page cache, and buffer caches, while character devices can get by with a relatively simple driver interface.

Running ls -l in the /dev directory shows the first character of each device file as either b (block) or c (character).

$ ls -l /dev/sda /dev/tty0
brw-rw---- 1 root disk 8, 0  Feb 12 10:00 /dev/sda
crw--w---- 1 root tty  4, 0  Feb 12 10:00 /dev/tty0

Major and Minor Numbers

The number pairs shown in place of the file size in the output above — 8, 0 and 4, 0 — are the major and minor numbers. The major number identifies the driver responsible for the device, and the minor number identifies a specific device among the multiple devices managed by that driver.

For example, major number 8 represents the SCSI disk driver. The minor number 0 for /dev/sda refers to the entire first disk, while minor number 1 for /dev/sda1 refers to the first partition. When a process accesses a device file, the kernel uses this number pair to route the request to the appropriate driver.

Is this number-based scheme still adequate for modern systems? In truth, the explosive growth in device counts has exposed the limitations of static number allocation. To address this, Linux introduced dynamic major number allocation and a device management system called udev, which we will cover later in this post.

I/O Schedulers

For block devices — especially HDDs with spinning platters — the physical movement of the disk head becomes a performance bottleneck. The I/O scheduler reorders I/O requests from multiple processes to minimize the distance the disk head must travel.

Looking at the major I/O schedulers that have been used in Linux reveals the design philosophy behind each one.

The noop scheduler (now called none) does exactly what its name suggests — no reordering at all. Requests are forwarded to the device in the order they arrive. For devices like SSDs that have no seek time, reordering is unnecessary overhead, making noop the appropriate choice.

CFQ (Completely Fair Queuing) maintains a separate queue for each process and allocates fair I/O bandwidth to each. It was effective at preventing any single program from monopolizing I/O on desktop systems where multiple applications accessed the disk simultaneously.

The deadline scheduler assigns an expiration time to each I/O request. Read requests have a default deadline of 500ms and write requests 5 seconds, with requests nearing their deadline processed first. This approach prevents starvation caused by I/O reordering while still performing a reasonable level of optimization.

BFQ (Budget Fair Queuing) is the successor to CFQ. It allocates an I/O budget to each process and ensures fair disk usage within that budget. It delivers particularly good responsiveness on slow devices and interactive workloads.

$ cat /sys/block/sda/queue/scheduler
[mq-deadline] kyber bfq none

Modern Linux kernels use the multi-queue block layer (blk-mq), and you can choose among mq-deadline, kyber, bfq, and none as shown in the output above. The advent of NVMe SSDs gave storage devices their own multiple hardware queues, and the legacy single-queue scheduler could not exploit this parallelism — which is why the entire block layer was redesigned.

DMA: Data Transfer Without the CPU

What would happen if the CPU had to copy each byte one by one when transferring large amounts of data from disk to memory? The CPU would be unable to perform any other work until the transfer completed. DMA (Direct Memory Access) is the hardware mechanism that solves this problem.

With DMA, the CPU merely instructs the DMA controller — "move this much data from here to there" — and the DMA controller performs the actual transfer. When the transfer is complete, the DMA controller raises an interrupt to notify the CPU. In the meantime, the CPU is free to execute other processes or perform other computations.

Without DMA (PIO mode)           With DMA
┌─────┐    ┌──────┐             ┌─────┐    ┌──────┐
│ CPU │◄──►│ Disk │             │ CPU │    │ Disk │
│     │    │      │             │     │    │      │
│Byte │    │      │             │Cmd  │    │      │
│copy │    │      │             │only │    │      │
└─────┘    └──────┘             └──┬──┘    └──┬───┘
                                   │          │
                                   ▼          ▼
                                ┌────────────────┐
                                │ DMA Controller  │
                                │(direct transfer)│
                                └───────┬────────┘
                                        │
                                        ▼
                                    ┌──────┐
                                    │Memory│
                                    └──────┘

Virtually all modern block device drivers use DMA. DMA is also essential for network cards — at high speeds, having the CPU copy each packet is practically impossible.

Buffered I/O and Direct I/O

When a process reads a file with the read system call, does the data go straight from disk to the user buffer? In general, no. Linux uses buffered I/O by default: data read from disk is first stored in the kernel's page cache and then copied to the user buffer.

Why is this extra copy beneficial? When multiple processes read the same data, or when the same process reads it repeatedly, the data can be served directly from the page cache without any disk access. For most workloads, this caching effect far outweighs the cost of the additional copy.

However, applications like databases have their own caching strategies, and the kernel's page cache becomes unnecessary memory consumption and extra copying. In such cases, the O_DIRECT flag can be used to perform direct I/O. Direct I/O bypasses the page cache and transfers data directly between the disk and the user buffer.

// Buffered I/O (default)
int fd = open("/data/file", O_RDONLY);

// Direct I/O
int fd = open("/data/file", O_RDONLY | O_DIRECT);

Using direct I/O requires the user buffer to satisfy memory alignment requirements, and the read/write sizes must be multiples of the block size. Despite these constraints, database systems prefer direct I/O because their own cache management is better optimized for their specific workloads than the general-purpose page cache.

The Page Cache

The page cache is the cornerstone of Linux I/O performance. The kernel uses most of the available memory as a page cache to minimize disk access. The buff/cache column in the output of the free command shows exactly this.

$ free -h
              total        used        free      shared  buff/cache   available
Mem:           16Gi       4.2Gi       1.8Gi       256Mi        10Gi        11Gi
Swap:          4.0Gi          0B       4.0Gi

In the output above, 10GB of 16GB is used for buff/cache, yet available shows 11GB. This is because page cache memory can be freed at any time. When a process needs more memory, the kernel shrinks the page cache to make room. A large page cache does not mean the system is running low on memory — it means available memory is being used efficiently for caching.

The page cache is equally important for write operations. When data is written via the write system call, it is not immediately committed to disk. Instead, the corresponding page in the page cache is marked dirty. The kernel's writeback threads (historically pdflush, now the bdi flusher) asynchronously write dirty pages to disk after a certain time elapses or when the ratio of dirty pages exceeds a threshold. The fsync system call can be used to force a specific file's dirty pages to be written to disk immediately.

Polling and Interrupts

There are two ways to determine whether a device has completed its work. Polling has the CPU periodically check the device's status register. Interrupts have the device send a signal to the CPU when the work is done.

Are interrupts always superior to polling? In most cases, yes. Polling wastes CPU cycles while waiting for the device to become ready. But with ultra-fast devices, the situation reverses. In environments where I/O completions are extremely frequent — NVMe SSDs or high-speed network cards — the overhead of processing interrupts itself becomes the bottleneck. Each interrupt triggers a context switch, runs the interrupt handler, and pollutes the cache.

For this reason, modern high-performance drivers use a hybrid approach. They normally operate in interrupt mode but switch to polling mode when I/O surges, eliminating interrupt overhead. NAPI (New API) in the Linux network stack is the canonical implementation of this hybrid approach.

udev and Device Management

In the past, device files in the /dev directory had to be created manually by the administrator. Hundreds of device files were pre-created regardless of whether the actual hardware existed, and when new hardware was added, the administrator had to use the mknod command to create device files by hand.

udev solves this problem fundamentally. When the kernel detects hardware, it generates a uevent. The udev daemon receives this event and automatically creates or removes device files according to its rules. The fact that /dev/sdb appears automatically when you plug in a USB drive and disappears when you unplug it is thanks to udev.

udev rule files are located in the /etc/udev/rules.d/ directory, where you can specify names, permissions, and owners based on device attributes, or configure scripts to run when a device is connected.

# /etc/udev/rules.d/99-usb-storage.rules
# Example rule assigning a fixed name to a specific USB device
SUBSYSTEM=="block", ATTRS{idVendor}=="0781", ATTRS{idProduct}=="5567", SYMLINK+="myusb"

This rule automatically creates a symbolic link at /dev/myusb when a USB device with the specified vendor/product ID is connected. It solves the problem of device names changing — /dev/sdb, /dev/sdc — each time a device is plugged in.

udev does more than simply manage device files. It serves as a bridge connecting the kernel's device model with the /sys file system and forms the foundation for automation that reacts to hardware events. Since the advent of systemd, udev has been integrated as systemd-udevd and become part of the broader system management framework.

In the next post, we'll look at the internals of the networking stack.

Where to go next

Block Devices and Character Devices

Major and Minor Numbers

I/O Schedulers

DMA: Data Transfer Without the CPU

Buffered I/O and Direct I/O

The Page Cache

Polling and Interrupts

udev and Device Management

Continue Reading

Linux Internals 08 - Synchronization and Concurrency

Linux Internals 09 - Networking

Linux Internals 10 - Containers and Virtualization