Linux Internals 03 - Process Scheduling
Why Scheduling Matters
Hundreds of processes run simultaneously on a modern system, yet the number of CPU cores is limited, and a single core can execute only one process at a time. The scheduler is the kernel component that decides which process gets to use the CPU and when.
If scheduling were simply a matter of ordering, it would not be particularly difficult. In reality, however, the problems a scheduler must solve are multifaceted. Interactive processes need to respond to user input quickly, batch jobs want maximum throughput, and real-time processes must execute within strict time constraints. Satisfying these conflicting demands simultaneously is the core challenge of scheduling.
Preemptive vs Cooperative
Scheduling approaches fall into two broad categories. In cooperative (non-preemptive) scheduling, other processes must wait until the running process voluntarily yields the CPU. If a single process refuses to yield, every other process waits indefinitely and the entire system appears frozen. Early versions of Windows and Mac OS used this approach, which is why a single hung program often required rebooting the entire system.
In preemptive scheduling, the kernel can forcibly interrupt a running process and hand the CPU to another. Linux uses preemptive scheduling. Each process is assigned a time slice of CPU usage, and when that time is exhausted, the kernel intervenes and runs a different process. This ensures that even if one process enters an infinite loop, the rest of the system continues to function normally.
History of the O(1) Scheduler
Linux scheduling has evolved through several iterations. The early Linux scheduler was simple in structure but suffered performance degradation as the number of processes grew. In 2003, Ingo Molnar introduced the O(1) scheduler in Linux 2.6. As its name implies, this scheduler could select the next process to run in constant time regardless of how many processes existed.
The O(1) scheduler used two priority arrays called the active queue and the expired queue. Each array had 140 priority levels, and a bitmap allowed the process with the highest priority to be found instantly. When all processes in the active queue exhausted their time slices, the scheduler simply swapped the pointers of the two queues. This design worked well in server environments but had responsiveness problems with interactive processes on the desktop. The heuristics for identifying interactive processes were complex and difficult to predict.
CFS: The Completely Fair Scheduler
In 2007, Linux 2.6.23 introduced the Completely Fair Scheduler (CFS), created by Ingo Molnar and influenced by ideas from Con Kolivas. The core philosophy of CFS is to simulate an ideal multitasking CPU. An ideal CPU would provide each of n processes with exactly 1/n of CPU time simultaneously. This is impossible in reality, but CFS strives to distribute CPU time as close to this ideal as possible.
The central concept in CFS is vruntime (virtual runtime). Each process maintains a vruntime value that tracks how much CPU time it has actually consumed. CFS always selects the process with the smallest vruntime as the next one to run. A process that has used a lot of CPU time has a large vruntime and gets pushed to the back, while a process that has used little CPU time has a small vruntime and gets priority.
Sorted by vruntime, smallest first (red-black tree)
[vruntime=5]
/ \
[vruntime=3] [vruntime=8]
/
[vruntime=1] โ next process to run
To manage these processes efficiently, CFS uses a red-black tree. A red-black tree is a self-balancing binary search tree where insertion, deletion, and minimum lookup all run in O(log n) time. The leftmost node is the process with the smallest vruntime, so finding the next process to execute is simply a matter of selecting the leftmost node in the tree.
Nice Values and Priorities
Is it always fair to give every process the same amount of CPU time? Not necessarily. It may not be desirable for a critical service process and a background log cleanup task to receive equal shares of CPU. Nice values address this problem.
Nice values range from -20 to 19. A lower value means higher priority, and a higher value means lower priority. The origin of the name is telling โ a process with a high nice value is being "nice" to other processes by yielding to them.
In CFS, the nice value affects the rate at which vruntime increases. A process with a low nice value (high priority) has its vruntime increase more slowly than its actual execution time, so it receives more CPU time. A process with a high nice value (low priority) has its vruntime increase faster, so it receives less. The advantage of this approach is that even low-priority processes never get completely starved of CPU time. Starvation is structurally prevented.
# Check the current process's nice value
nice
# Run a process with nice value 10
nice -n 10 ./background_job
# Change the nice value of a running process
renice -n -5 -p 1234
Real-Time Scheduling
CFS is suitable for ordinary processes, but some processes must execute within a guaranteed time window. Audio processing, industrial control systems, and telecommunications equipment are examples. Linux provides real-time scheduling policies for these requirements.
SCHED_FIFO is a first-in-first-out real-time scheduling policy. A SCHEDFIFO process always runs before any normal process and continues to execute until it voluntarily yields the CPU or a higher-priority SCHEDFIFO process appears. There is no time slice concept, so there is no limit on execution time.
SCHED_RR is identical to SCHED_FIFO except that processes at the same priority level share CPU time in a round-robin fashion with time slices. This prevents a single process from monopolizing the CPU when multiple processes share the same priority.
Real-time process priorities range from 1 to 99, and they are always higher than normal process (CFS) priorities. This means that while a real-time process is ready to run, normal processes receive no CPU time at all. If a poorly designed real-time process enters an infinite loop, the entire system can become unresponsive, which is why real-time scheduling must be used carefully.
The Cost of Context Switching
When the scheduler replaces the running process, a context switch occurs. During this operation, the CPU register state of the current process is saved, the register state of the next process is restored, and the memory map (page tables) is switched.
Saving and restoring registers is fast on its own, but the real cost lies in indirect effects. When page tables are switched, the TLB (Translation Lookaside Buffer) is invalidated. Since the new process's data is not in the CPU cache, cache misses become frequent. These indirect costs are far greater than the direct costs, and they grow as the memory footprint of the processes increases.
Switching between threads within the same process is lighter than switching between processes. Because threads share the same address space, no page table switch is needed and the TLB can be preserved. This is one of the performance advantages of multithreaded programming.
CPU Affinity
By default, the scheduler can run a process on any CPU core. But when a process migrates from one core to another, the data cached on the previous core must be reloaded on the new one, causing a performance penalty.
CPU affinity is a mechanism that restricts which CPU cores a process can run on. Pinning a process to a specific core improves cache efficiency, and on NUMA architectures, it can ensure local memory access.
# Restrict process 1234 to run only on CPUs 0 and 1
taskset -p -c 0,1 1234
# Start a program pinned to CPU 2
taskset -c 2 ./my_program
The Linux scheduler itself is aware of this issue and applies soft affinity by default, preferring to run a process on the core where it last executed. What taskset configures is hard affinity โ a stronger constraint that forces the process to run exclusively on the specified cores.
In the next post, we'll look at memory management.