Computer Architecture 09 - I/O and DMA

Where the CPU Meets the Outside World

A computer is not composed solely of a CPU and memory. It needs to read data from disks, send packets over the network, and render images on the screen. The data exchange between the CPU and these external devices is I/O (Input/Output), and how efficiently I/O is handled determines the performance of the entire system.

Is I/O really such a significant bottleneck? Modern CPUs can process billions of instructions per second, but disk access is measured in milliseconds and network round-trips take hundreds of microseconds to tens of milliseconds. In terms of CPU cycles, millions of instructions could execute during the time spent waiting for a single disk I/O operation. Managing this gap is a central challenge in system design.

Programmed I/O

The simplest I/O approach is for the CPU to repeatedly check the status of a device. This is called programmed I/O or polling. The CPU reads a device register to check whether data is ready, and if not, loops back to check again.

; Programmed I/O (polling) pseudocode
loop:
    status = read(device_status_register)
    if status != READY:
        goto loop
    data = read(device_data_register)

The problem with this approach is that the CPU cannot perform any other useful work while waiting for data. Polling can actually be efficient when reading small amounts of data from a fast device, but with slow devices it results in severe waste of CPU resources.

Interrupt-Driven I/O

Interrupt-driven I/O addresses the inefficiency of polling. After the CPU sends an I/O request to a device, it continues performing other work. When the device completes the data transfer, it raises an interrupt to notify the CPU. The CPU then suspends its current task and executes the interrupt handler to process the data.

This approach prevents the CPU from sitting idle while waiting for I/O. However, one problem remains. The actual work of moving data between the device and memory still requires the CPU to perform byte-by-byte or word-by-word transfers. When transferring large amounts of data, the CPU being tied up copying data consumes computational capacity that could be used elsewhere.

DMA: Freeing the CPU from Data Transfer

DMA (Direct Memory Access) is a mechanism that allows devices to exchange data directly with memory without CPU involvement. The CPU only needs to tell the DMA controller the memory address, size, and direction (read/write) of the data to transfer. The actual data transfer is then performed independently by the DMA controller, which notifies the CPU via an interrupt when the transfer is complete.

CPU                     DMA Controller            Device       Memory
 │                          │                     │            │
 │ ── Setup Transfer ─────▶ │                     │            │
 │   (address, size, dir)   │                     │            │
 │                          │                     │            │
 │   (performs other work)  │ ◀── data ──────────│            │
 │                          │ ─── data ───────────────────────▶│
 │                          │ ◀── data ──────────│            │
 │                          │ ─── data ───────────────────────▶│
 │                          │                     │            │
 │ ◀── Completion IRQ ──── │                     │            │

With DMA, the CPU can freely perform other computations even during large data transfers. While megabytes of data are being read from disk, the CPU can continue executing instructions for other processes. However, since DMA transfers and the CPU share the memory bus, the CPU's memory access speed may decrease somewhat during heavy DMA transfers. This is known as cycle stealing.

Memory-Mapped I/O vs Port I/O

There are two approaches for the CPU to communicate with devices. Port-mapped I/O uses a separate I/O address space and dedicated instructions (IN/OUT on x86) to access devices. Memory-mapped I/O maps device registers into a portion of the memory address space, allowing devices to be accessed using standard memory read/write instructions.

Memory-mapped I/O dominates in modern systems. It simplifies programming since no separate I/O instructions are needed, and existing memory protection mechanisms can be leveraged for device access control. The configuration space of PCIe devices and GPU framebuffer access are both performed through memory-mapped I/O.

Evolution of Bus Architecture

The ISA bus in early PCs operated at 8MHz with a 16-bit width, providing a maximum bandwidth of 8MB/s. As increasingly faster devices — graphics cards, network cards, disk controllers — emerged, this bandwidth quickly proved insufficient.

The PCI bus operated at 33MHz with a 32-bit or 64-bit width, providing up to 533MB/s of bandwidth. However, PCI was a shared bus, meaning multiple devices had to share the same bus bandwidth.

PCIe (PCI Express) fundamentally changed this limitation. By adopting point-to-point serial links instead of a shared bus, each device could secure its own dedicated bandwidth. PCIe bandwidth is determined by the number of lanes and the generation.

Generation	Per-Lane Bandwidth	x16 Bandwidth
PCIe 3.0	~1 GB/s	~16 GB/s
PCIe 4.0	~2 GB/s	~32 GB/s
PCIe 5.0	~4 GB/s	~64 GB/s
PCIe 6.0	~8 GB/s	~128 GB/s

This is why high-performance GPUs use PCIe x16 slots. Devices that require large-scale data transfers secure more lanes to increase their available bandwidth.

I/O Scheduling

When multiple processes simultaneously request disk I/O, the operating system's I/O scheduler determines the processing order of these requests. For traditional rotational hard drives, minimizing disk head movement was important, making scheduling algorithms like the elevator algorithm effective.

The proliferation of SSDs changed this situation. SSDs have no physically moving heads, so the concept of seek time disappears entirely. Consequently, simple FIFO or basic priority-based schedulers became more appropriate than complex schedulers optimizing request order. The evolution of Linux's default I/O scheduler from CFQ to mq-deadline and then to none reflects precisely this hardware shift.

IOMMU and Device Isolation

DMA allows devices to access memory directly, but this carries security risks. A malicious or malfunctioning device could read or overwrite arbitrary memory regions.

The IOMMU (I/O Memory Management Unit) solves this problem. Just as the CPU's MMU controls process memory access, the IOMMU controls device memory access. By restricting the memory regions accessible to each device, it prevents devices from accessing unauthorized memory. This is also why the IOMMU is essential when directly assigning physical devices to guest operating systems (passthrough) in virtualization environments.

Modern Storage Interfaces: NVMe

The SATA interface was designed for rotational hard drives, with a command queue depth of only 32 and support for just a single queue. This interface becomes a bottleneck when trying to fully exploit the potential performance of SSDs.

NVMe (Non-Volatile Memory Express) is a protocol designed from the ground up for SSDs. It operates directly over the PCIe bus, supports up to 65,535 queues, and each queue can have a depth of up to 65,536 entries. This design allows each core in a multicore environment to have its own I/O queue, avoiding lock contention on queue access. This is why NVMe SSDs achieve overwhelmingly higher IOPS (I/O Operations Per Second) compared to SATA SSDs.

I/O Virtualization: SR-IOV

In virtualization environments where multiple virtual machines share a single physical network card, the hypervisor must mediate I/O requests. This software mediation layer increases latency and CPU overhead.

SR-IOV (Single Root I/O Virtualization) divides a single physical device into multiple Virtual Functions, allowing each virtual machine to access the device directly as if it had a dedicated physical device. Since devices are accessed directly without hypervisor mediation, latency is significantly reduced and CPU overhead is minimized. The widespread adoption of SR-IOV in cloud environments requiring high-performance networking is due to these advantages.

In the next post, we'll look at multicore and modern processors.

Where to go next