Linux Internals 08 - Synchronization and Concurrency
Why Concurrency Is a Problem
Modern systems have CPUs with multiple cores, and the kernel runs numerous threads simultaneously. In this environment, when two or more execution flows read and write the same data concurrently, unpredictable results occur. This is a race condition.
What makes race conditions dangerous is that they work correctly most of the time. The problem manifests once in thousands of runs, or only under specific timing conditions, making it extremely difficult to reproduce. Even a simple counter increment is not safe. The code count++ is a single line in a high-level language, but at the CPU level it consists of three steps: reading the value from memory, adding 1, and writing it back. If two threads execute these three steps simultaneously, one increment can be lost.
Thread A: read(count=5) โ increment(6) โ write(count=6)
Thread B: read(count=5) โ increment(6) โ write(count=6)
Result: count=6 (expected 7)
To solve this problem, the code region that accesses shared resources โ the critical section โ must be protected. Ensuring that only one execution flow can enter the critical section at a time is the essence of synchronization.
Atomic Operations
The most basic synchronization primitive is the atomic operation. Because the CPU executes an atomic operation as a single instruction, no other execution flow can intervene midway.
The Linux kernel provides the atomic_t type and associated functions. Functions like atomic_inc(), atomic_dec(), and atomic_add() translate to single CPU instructions, making them safe without separate locking. On the x86 architecture, instructions with the LOCK prefix lock the bus to block access from other cores.
However, atomic operations alone cannot solve complex synchronization problems. When two or more variables must be updated simultaneously, or when an action must be performed after checking a condition, more sophisticated mechanisms are needed.
Spinlocks
The spinlock is the most widely used locking mechanism in the kernel. A thread that fails to acquire a spinlock repeatedly checks until the lock is released, literally spinning in place.
Is busy waiting actually efficient? From a user-space perspective, it looks like a waste of CPU time. But in the kernel, the situation is different. When the critical section is very short, the cost of a context switch far exceeds the cost of briefly spinning. Putting a thread to sleep and waking it takes thousands of cycles, whereas a spinlock's critical section typically completes in tens to hundreds of cycles.
There is a rule that must be followed when using spinlocks: never sleep while holding one. If scheduling occurs while a spinlock is held, a thread on another core waiting for the same spinlock will spin forever. Additionally, when a spinlock is needed in an interrupt handler, spin_lock_irqsave() must be used to disable interrupts. Otherwise, an interrupt could fire while the spinlock is held, and the interrupt handler could request the same spinlock, resulting in deadlock.
Mutexes and Semaphores
If spinlocks are suited for short critical sections, mutexes are suited for longer ones. A thread that fails to acquire a mutex enters a wait queue and sleeps. When the lock is released, the waiting thread is woken up to resume execution. This approach does not waste CPU time, but it incurs context switch overhead.
The difference between a mutex and a semaphore lies in the number of concurrent accesses they allow. A mutex permits only one thread to enter the critical section at a time. A semaphore uses a counter to allow up to N threads to access simultaneously. A semaphore with a counter of 1 (a binary semaphore) behaves similarly to a mutex, but mutexes have an important concept called ownership. Only the thread that locked a mutex can unlock it, whereas a semaphore can be released by a different thread. This ownership concept allows mutexes to address the priority inversion problem.
In kernel development, using mutexes over semaphores is recommended unless there is a specific reason otherwise. Mutexes enforce stricter usage rules, making debugging easier and leaving more room for performance optimization.
Read-Write Locks
In many cases, read access to shared data vastly outnumbers write access. Since threads that only read do not conflict with each other, it is safe to allow them concurrent access. From this observation, the read-write lock was born.
A read-write lock distinguishes between read locks and write locks. Multiple threads can acquire the read lock simultaneously, but the write lock is exclusive. Acquiring a write lock requires waiting until all read locks are released. Conversely, while a write lock is active, reads are also blocked.
This approach has a pitfall. When reads are extremely frequent, writer threads can starve. If new read requests arrive ceaselessly, there is never an opportunity to acquire the write lock. The Linux kernel's rwlock_t does not fully solve this problem, and this is one reason why more sophisticated mechanisms like RCU are needed.
Deadlock
The most feared situation when using synchronization mechanisms is deadlock. A deadlock occurs when two or more execution flows each hold a resource the other needs, and none can make progress.
Thread A: lock(X) acquired โ waiting for lock(Y)...
Thread B: lock(Y) acquired โ waiting for lock(X)...
โ Both wait forever
Four conditions must be met simultaneously for a deadlock to occur: mutual exclusion (a resource can only be used by one at a time), hold and wait (holding a resource while waiting for another), no preemption (a resource cannot be forcibly taken from another thread), and circular wait (threads form a cycle of resource dependencies). Breaking any one of these four conditions prevents deadlock.
The most practical prevention method is fixing the lock ordering. By assigning an order to all locks in the system and enforcing that lower-numbered locks are always acquired first, circular waits become impossible. The Linux kernel includes a built-in lock dependency validator called lockdep. It tracks lock acquisition order at runtime and prints warnings when it detects potential deadlock patterns. Because it can discover problems before an actual deadlock occurs, it is an invaluable tool in kernel development.
Futex: Fast Locking in User Space
User-space programs need synchronization too. pthread_mutex is a prime example. Early user-space locking implementations performed poorly because they always had to make system calls โ entering the kernel was required whether the lock was acquired or not.
Futex (Fast Userspace Mutex) solves this problem. The key idea is that in the uncontended case (the common case), locking is handled entirely in user space using atomic operations with no system call. Only when contention occurs is the kernel involved.
A futex consists of two components: an integer value in shared memory and a wait queue in the kernel. When acquiring a lock, a compare-and-swap atomic operation is performed in user space. If it succeeds, the lock is acquired and no system call is needed. If it fails, the FUTEX_WAIT system call is invoked to sleep on the kernel's wait queue. When releasing the lock, the FUTEX_WAKE system call is invoked only if there are waiters.
This design is effective because the rate of lock contention in real systems is low. Most lock acquisitions succeed without contention, and the overhead of a system call is completely avoided in those cases. The glibc pthread mutex implementation uses futex internally.
RCU: Read Optimization Taken to the Extreme
RCU (Read-Copy-Update) is one of the most sophisticated synchronization mechanisms in the Linux kernel. It was designed to address a fundamental limitation of read-write locks: even read locks cause cache line contention.
The core principle of RCU is that the read side uses no locking at all. Readers simply access the protected data between rcu_read_lock() and rcu_read_unlock(). These functions essentially only disable preemption and do not touch any cache lines, so they scale perfectly in multicore environments.
How does the write side work? The writer copies the existing data, modifies the copy, then atomically swaps the pointer. The old data is not freed immediately because there may still be threads reading it. Calling synchronize_rcu() waits until every CPU has passed through at least one RCU read-side critical section exit. This point is called the grace period, and after the grace period, the old data can be safely freed.
Writer:
1. old = rcu_dereference(ptr)
2. new = copy(old)
3. modify(new)
4. rcu_assign_pointer(ptr, new)
5. synchronize_rcu() โ wait for grace period
6. free(old)
Reader:
rcu_read_lock()
p = rcu_dereference(ptr) โ always valid data, whether old or new
read(p)
rcu_read_unlock()
RCU is used extensively in kernel data structures where reads vastly dominate, such as routing tables, process lists, and module lists. Because there is zero cache line contention on the read path, performance does not degrade as the number of cores increases.
Lock-Free Data Structures
It is possible to implement thread-safe data structures without using locks at all. Lock-free data structures operate on atomic operations, particularly CAS (Compare-And-Swap).
The CAS operation checks whether the current value at a memory location matches an expected value โ if it does, it replaces it with a new value; if not, it fails and the operation is retried. Using this, data structures like stacks, queues, and linked lists can be implemented without locks.
Is lock-free always faster than locking? No. In low-contention environments, lock-based implementations can be simpler and more efficient. The true advantage of lock-free lies in progress guarantees. In lock-based implementations, if the thread holding the lock is preempted, all other threads must wait. In lock-free implementations, at least one thread can always make progress. The Linux kernel uses lock-free data structures such as lock-free linked lists (llist) in certain paths, and they are particularly useful in contexts like interrupt handlers where locking is difficult to use.
Choosing a Synchronization Mechanism
The reason the Linux kernel provides so many different synchronization mechanisms is that no single tool can handle every situation. Each mechanism has different tradeoffs, and choosing the right one for the situation has a decisive impact on both performance and correctness.
| Mechanism | Waiting Strategy | Best Suited For |
|---|---|---|
| Atomic Operations | No waiting | Simple updates to a single variable |
| Spinlock | Busy waiting | Short critical sections, interrupt context |
| Mutex | Sleep | Long critical sections, process context |
| Read-Write Lock | Busy wait / Sleep | Frequent reads, infrequent writes |
| RCU | No waiting (read side) | Overwhelmingly read-heavy, very rare writes |
Concurrency is one of the most challenging areas of kernel development, and correct synchronization design determines system stability and performance. The mechanisms the kernel provides are the result of decades of experience and countless bug fixes.
In the next post, we'll look at the Linux networking stack.