Why Pipelining

In the previous posts, we examined how the CPU fetches, decodes, and executes instructions. Under this approach, however, the next instruction cannot begin until the current one has fully completed. If each stage takes one clock cycle, a three-stage cycle requires three cycles per instruction. Every cycle, only one of the ALU, memory access unit, or decoder is working while the other two sit idle.

Pipelining is the idea that resolves this inefficiency. A laundry analogy makes it easy to grasp. If you run the washer, wait until it finishes, then run the dryer, wait until it finishes, then fold, processing three loads sequentially takes nine time units. But if you put the second load into the washer the moment the first moves to the dryer, three loads can be finished in five time units. Each piece of equipment stays continuously busy.

CPU pipelining works on the same principle. The moment the first instruction advances to the decode stage, the fetch unit is already retrieving the second instruction. With each stage operating independently, the ideal throughput of one instruction completing every clock cycle becomes achievable.

The 5-Stage Pipeline

The textbook RISC pipeline consists of five stages.

5-Stage Pipeline

Stage   Cycle1  Cycle2  Cycle3  Cycle4  Cycle5  Cycle6  Cycle7
------------------------------------------------------------
IF      I1      I2      I3      I4      I5
ID              I1      I2      I3      I4      I5
EX                      I1      I2      I3      I4      I5
MEM                             I1      I2      I3      I4
WB                                      I1      I2      I3

IF  = Instruction Fetch
ID  = Instruction Decode / Register Read
EX  = Execute (ALU operation)
MEM = Memory Access
WB  = Write Back (write result to register)

When the pipeline is fully filled, one instruction completes every cycle. A five-stage pipeline does not mean each instruction finishes in one cycle. Each instruction still requires five cycles, but since five instructions are simultaneously being processed at different stages, the throughput reaches one per cycle.

Increasing the number of pipeline stages reduces the work done per stage, shortening per-stage latency and allowing higher clock speeds. This is why the Pentium 4's NetBurst architecture used a 31-stage pipeline. However, too many stages increase the penalty from hazards described below, presenting a clear trade-off.

Hazards

Pipelines operate ideally when all instructions are independent. In real programs, however, various dependencies exist between instructions, and these are called hazards. When a hazard occurs, the pipeline must stall or require special handling.

Data hazards occur when a subsequent instruction needs the result of a preceding instruction.

Data Hazard Example

add  r1, r2, r3    IF  ID  EX  MEM  WB
sub  r4, r1, r5        IF  ID  EX   MEM  WB
                             ^
                        r1 has not yet passed through WB,
                        so sub's ID stage cannot read the correct r1 value

The key technique for solving this is forwarding (also called bypassing). Instead of waiting for the WB stage, the result produced in the EX stage is passed directly to the EX input of the next instruction. By adding dedicated data paths in the hardware, data dependencies can be resolved without stalling the pipeline.

Resolved with Forwarding

add  r1, r2, r3    IF  ID  EX  MEM  WB
sub  r4, r1, r5        IF  ID  EX   MEM  WB
                             ^----+
                        EX result forwarded directly

However, when a load instruction is immediately followed by an instruction that uses its result, even forwarding cannot help. The load's result is not available until after the MEM stage, making at least a one-cycle delay (load-use hazard) unavoidable. Compilers perform instruction scheduling to avoid this delay by inserting independent instructions between the load and its use.

Structural hazards occur when two instructions simultaneously need the same hardware resource. For example, if instruction fetch and data memory access occur at the same time but there is only one memory port, one must wait. To resolve this, modern processors duplicate hardware by separating instruction and data caches and providing independent read/write ports on the register file.

Control hazards arise when a branch instruction makes the next instruction to execute uncertain. The outcome of a conditional branch is determined only at the EX stage, but the pipeline has already been fetching subsequent instructions in the meantime. If the branch is taken, all already-fetched instructions must be flushed, wasting as many cycles as the pipeline has stages. This is why branch cost increases with deeper pipelines.

Branch Prediction

Because the cost of control hazards is significant, modern processors use branch prediction to guess the branch outcome in advance and speculatively execute instructions along the predicted path. If the prediction is correct, the pipeline flows without interruption. If incorrect, the speculatively executed instructions are discarded and execution restarts from the correct path.

The simplest static branch prediction always predicts a fixed direction. For example, the rule "backward branches are taken, forward branches are not taken" is quite effective for loop branches, since the branch at the end of a loop that jumps back to the beginning is almost always taken.

Dynamic branch prediction bases its predictions on past branch history. The most basic structure is the Branch History Table (BHT).

2-bit Saturating Counter (per branch instruction)

         taken
  +--------------+
  |              v
+----+  taken  +----+
| ST |<--------| WT |    S = Strongly, W = Weakly
| 11 |         | 10 |    T = Taken,    N = Not Taken
+----+         +----+
  |              ^
  | not taken    | not taken
  v              |
+----+         +----+
| WN |-------->| SN |
| 01 |  not    | 00 |
+----+ taken   +----+
  ^              |
  +--------------+
         taken

Prediction: taken if upper bit is 1, not taken if 0

The reason for using a 2-bit counter lies in the limitation of a 1-bit counter. With a 1-bit counter in a loop, a single not-taken at loop exit flips the state, causing a misprediction on the first iteration of the next loop entry. A 2-bit counter does not fully change state on a single deviation, mitigating this problem.

A more sophisticated predictor is the correlating predictor, which considers the outcomes of other recent branches in addition to the current one. For example, if if (x > 0) is followed by if (x > 5), and the first branch is not taken, the second is very likely also not taken. Capturing such correlations improves prediction accuracy.

The Branch Target Buffer (BTB) is a structure used alongside branch prediction that caches the target addresses of branch instructions. When a branch is predicted as taken, the target address can be retrieved immediately from the BTB to begin the next fetch. Without a BTB, additional cycles would be needed to compute the branch target.

Modern processor branch prediction accuracy exceeds 95%. However, the impact of the remaining 5% of mispredictions on performance grows with pipeline depth. A single misprediction in a 20-stage pipeline means a loss of 20 cycles. This is why programmers should write branch patterns that are predictable, and why conditional operations on sorted data run faster than on unsorted data.

Superscalar Execution

Pipelining aims to complete one instruction per cycle. Is one per cycle the limit? It is not. By placing multiple pipelines in parallel, multiple instructions can be simultaneously fetched, decoded, and executed every cycle. This is superscalar execution.

2-way Superscalar Pipeline

Cycle     1     2     3     4     5
----------------------------------
Pipe0   IF_I1 ID_I1 EX_I1 MEM_I1 WB_I1
Pipe1   IF_I2 ID_I2 EX_I2 MEM_I2 WB_I2
Pipe0         IF_I3 ID_I3 EX_I3  MEM_I3
Pipe1         IF_I4 ID_I4 EX_I4  MEM_I4

-> 2 instructions processed simultaneously each cycle

Most modern high-performance processors adopt 4-way to 8-way superscalar designs. However, increasing the number of instructions processed per cycle requires finding enough independent instructions, and the hardware complexity for this grows sharply. Dependency checking, resource allocation, and result commit logic all need to scale.

Out-of-Order Execution

In in-order execution, when an earlier instruction stalls due to a hazard, all subsequent instructions stall with it. Even if a later instruction is completely unrelated to the one ahead of it, it cannot execute.

Out-of-order execution (OoO) breaks this constraint. Instructions with no data dependencies can execute ahead of program order.

Program order:
  I1: load  r1, [r2]       <- cache miss, wait hundreds of cycles
  I2: add   r3, r1, r4     <- depends on r1, must wait for I1
  I3: mul   r5, r6, r7     <- unrelated to I1 and I2
  I4: sub   r8, r5, r9     <- depends on r5, must wait for I3

Out-of-order execution:
  Issue I1 -> waiting on cache miss
  Execute I3 first (unrelated to I1)
  Execute I4 after I3 (I3 result available)
  Execute I2 after I1 completes

The core structures of out-of-order execution originate from Tomasulo's algorithm: reservation stations and the reorder buffer (ROB). Reservation stations hold instructions until their operands are ready, dispatching them to execution units as soon as operands become available. The ROB commits instruction results in program order, preserving program semantics despite out-of-order execution.

Why does out-of-order execution matter? A cache miss can take hundreds of cycles for memory access. In in-order execution, the CPU stalls completely during this time, but with out-of-order execution, independent instructions continue processing in the meantime. This is precisely why modern high-performance processors are essentially unviable without out-of-order execution.

Instruction-Level Parallelism

Pipelining, superscalar execution, and out-of-order execution are all techniques for exploiting Instruction-Level Parallelism (ILP). ILP is a measure of how many instructions in a program can execute simultaneously.

Theoretically, programs with arbitrarily high ILP can exist, but in practice, exploitable ILP is limited by data dependencies, control dependencies, and resource constraints. For typical integer programs, the ILP that hardware can extract is on the order of two to three instructions per cycle.

Recognizing the limits of ILP is the fundamental reason processor design shifted toward multicore architectures. Extracting more ILP from a single core causes hardware complexity to grow exponentially, whereas deploying multiple cores within the same transistor budget enables exploitation of thread-level parallelism. Of course, this introduces the new challenge that software must be parallelized to harness multicore performance, but that is a topic for a later post.

In the next post, we'll look at privilege levels and protection rings.