Linux Internals 09 - Networking

Structure of the Linux Networking Stack

From the moment a network packet departs a remote server to the moment it reaches a local application, a remarkably complex process unfolds inside the Linux kernel. Understanding this process requires first grasping the layered architecture of the Linux networking stack.

The Linux networking stack follows a layered structure similar to the OSI model, but the actual implementation aligns more closely with the TCP/IP four-layer model. At the top sits the socket interface, below which lie the transport layer (TCP, UDP), the network layer (IP), and the link layer (Ethernet, drivers). Each layer processes its own header and passes the payload to the next layer.

┌──────────────────────────────────┐
│       Application (User Space)   │
├──────────────────────────────────┤
│         Socket Interface         │
├──────────────────────────────────┤
│     Transport Layer (TCP / UDP)  │
├──────────────────────────────────┤
│    Network Layer (IP / Routing)  │
├──────────────────────────────────┤
│     Netfilter (Packet Filtering) │
├──────────────────────────────────┤
│   Link Layer (Ethernet / Driver) │
├──────────────────────────────────┤
│          NIC (Hardware)          │
└──────────────────────────────────┘

The Socket API

The only way user-space programs interact with the network is through the socket API. socket(), bind(), listen(), accept(), connect(), send(), recv() — this series of system calls forms the foundation of all network programming.

What makes sockets interesting is their integration with the file abstraction. A socket returns a file descriptor, and data can be sent and received using read() and write() just like a regular file. The Unix philosophy of "everything is a file" extends to networking as well. Thanks to this, multiplexing mechanisms designed for file I/O — select(), poll(), epoll() — can be applied identically to network sockets.

Inside the kernel, a socket is represented by two structures: struct socket and struct sock. struct socket handles the interface with VFS, while struct sock manages the actual network protocol state. For TCP sockets, struct tcp_sock — an extension of struct sock — is used, containing all TCP-specific information such as sequence numbers, window sizes, and congestion control state.

The Packet Receive Path

Following the journey of a network packet from NIC to application provides a concrete understanding of how the kernel's networking implementation works.

When a packet arrives at the NIC, it is copied into a ring buffer in kernel memory via DMA (Direct Memory Access). The NIC then raises an interrupt to notify the CPU of the packet's arrival. In early Linux, every packet triggered an interrupt, but on high-speed networks where millions of packets can arrive per second, interrupt processing itself became the bottleneck.

NAPI (New API) was introduced to solve this problem. NAPI combines interrupts with polling. When the first packet arrives, an interrupt fires, but subsequent interrupts are disabled and the system switches to polling mode to batch-process packets from the ring buffer. When there are no more packets to process, it returns to interrupt mode. This adaptive approach provides the low latency of interrupts under light traffic and the high throughput of polling under heavy traffic.

Once the NIC driver processes a packet, it is placed in an sk_buff structure and passed to higher layers. The Ethernet header is parsed at the link layer, the IP header is processed at the network layer, and a routing decision is made. Whether the packet is destined for the local host or needs to be forwarded is determined here. Local packets ascend to the transport layer for TCP or UDP processing, and are ultimately placed in the socket's receive queue, from which the recv() system call delivers them to user space.

sk_buff: The Packet Container

The sk_buff (socket buffer) is the most important data structure in the Linux networking stack. Every packet is represented as an sk_buff structure, containing not only the packet data but also metadata such as the receiving interface, protocol information, and timestamps.

A notable aspect of sk_buff's design is that no data is copied when each layer adds or removes headers. Instead, pointers are manipulated to change where the header begins. During transmission, each layer prepends its header to the data (using headroom), and during reception, each layer skips past its header to point at the next layer's payload. This zero-copy approach makes a critical contribution to networking performance.

sk_buff structure:
┌──────────┬───────────────┬──────────┬──────────┬─────────┐
│ headroom │ Ethernet hdr  │  IP hdr  │ TCP hdr  │ payload │
└──────────┴───────────────┴──────────┴──────────┴─────────┘
            ↑                                      ↑
      parse direction (rx) →            prepend headers (tx) ←

TCP/IP Implementation

The Linux TCP implementation is among the most complex code in the kernel. It goes well beyond simple reliable delivery to include congestion control, flow control, Selective Acknowledgment (SACK), timestamps, window scaling, and decades of accumulated optimizations and extensions.

TCP connection establishment uses a 3-way handshake. On the server side, calling listen() creates a SYN queue (half-open connection queue) and an Accept queue (completed connection queue). When a SYN packet arrives, it enters the SYN queue and a SYN+ACK is sent. When the client's ACK arrives, the connection is complete and moves to the Accept queue, from which the accept() system call retrieves it.

Do the sizes of these two queues matter? Very much so. If the SYN queue fills up, new connections cannot be accepted, and this is also the target of SYN flood attacks. Linux provides a defense mechanism called SYN cookies that allows normal connections to be established without using the SYN queue at all. The Accept queue size is controlled by the backlog parameter of listen(), and it directly determines how quickly the server can accept connections.

Congestion control is another core area of TCP. Linux uses CUBIC as its default congestion control algorithm and also supports modern algorithms like BBR. These algorithms dynamically adjust the sending rate based on packet loss and RTT (Round-Trip Time), attempting to prevent network congestion while making maximum use of available bandwidth.

Netfilter and Packet Filtering

Netfilter is the packet filtering framework in the Linux kernel. It provides five hook points along the path packets take through the networking stack, and rules can be registered at each hook to accept, drop, or modify packets.

Packet receive path:

  → PREROUTING → routing decision → INPUT → local process
                       ↓
                    FORWARD → POSTROUTING → sent externally

Packet send path:

  local process → OUTPUT → routing decision → POSTROUTING → sent externally

iptables has long been used as the user-space interface to Netfilter. Rules are organized through combinations of tables (filter, nat, mangle) and chains (INPUT, OUTPUT, FORWARD, etc.). However, iptables has a structural limitation where performance degrades linearly as the number of rules grows, since every packet must traverse the rule list sequentially.

nftables was designed as the successor to iptables. The most significant difference is that rules are compiled into virtual machine bytecode for execution. It supports set and map data structures enabling O(1) lookups, and it unifies IPv4, IPv6, ARP, and bridge filtering in a single framework. On distributions using kernel 4.x and later, nftables is becoming the standard.

Network Namespaces

Network namespaces are a mechanism for isolating the Linux networking stack. Each namespace has its own independent network interfaces, routing tables, iptables rules, and socket list. This makes it possible to operate multiple completely separated network environments on a single physical host.

Network namespaces are the foundation of container networking. When a Docker container runs, each container gets its own network namespace and operates in an environment isolated from the host's network. For a container to communicate with the outside world, a connection between namespaces is needed — this is where virtual network devices come in.

Virtual Network Devices

Linux provides a variety of virtual network devices implemented in software. They have the same interface as physical network devices but process packets inside the kernel instead of through hardware.

A veth (Virtual Ethernet) pair is always created as two ends — a packet entering one end comes out the other. It serves as a virtual cable connecting network namespaces and is used in container networking to link containers to a host bridge.

A bridge is a virtual L2 switch. When multiple network interfaces are attached to a single bridge, Ethernet frames are forwarded between them. Docker's default networking mode uses exactly this. A docker0 bridge is created, and one end of each container's veth pair is attached to this bridge.

tun/tap devices provide a tunnel between user space and the kernel network stack. tun operates at L3 (IP) level, while tap operates at L2 (Ethernet) level. VPN software is a classic example of using these devices. OpenVPN sends and receives packets through an encrypted tunnel via tun/tap devices.

Container networking structure:

┌── Container A ──┐   ┌── Container B ──┐
│  eth0 (veth)    │   │  eth0 (veth)    │
└──────┬──────────┘   └──────┬──────────┘
       │                     │
  ┌────┴─────────────────────┴────┐
  │         docker0 (bridge)       │
  └──────────────┬────────────────┘
                 │ NAT
  ┌──────────────┴────────────────┐
  │       eth0 (physical NIC)      │
  └───────────────────────────────┘

By combining these virtual network devices, complex network topologies can be built entirely in software without physical network equipment. This is the foundation of SDN (Software-Defined Networking) and cloud networking.

Performance Optimizations in Linux Networking

Modern Linux includes a range of optimization techniques for high-performance networking. GRO (Generic Receive Offload) merges multiple small packets into a single large packet to reduce the number of times upper layers must process them. GSO (Generic Segmentation Offload) delays segmenting large packets as late as possible during transmission, reducing per-layer processing overhead. RSS (Receive Side Scaling) distributes received packets across multiple CPU cores for parallel processing.

Going further, XDP (eXpress Data Path) allows eBPF programs to be executed at the NIC driver level before a packet even enters the networking stack. Because the full network stack is bypassed, extreme performance can be achieved for tasks like DDoS mitigation and load balancing. It is a technology that represents the direction of modern Linux networking — pursuing both flexibility and performance simultaneously.

In the next post, we'll look at containers and virtualization in Linux.

Where to go next

Structure of the Linux Networking Stack

The Socket API

The Packet Receive Path

sk_buff: The Packet Container

TCP/IP Implementation

Netfilter and Packet Filtering

Network Namespaces

Virtual Network Devices

Performance Optimizations in Linux Networking

Continue Reading

Linux Internals 10 - Containers and Virtualization

Linux Internals 08 - Synchronization and Concurrency

Linux Internals 07 - I/O and Devices