Opening Up the CPU

In the previous post, we saw that the CPU repeats the fetch-decode-execute cycle. But what actually happens during each of these stages? What components exist inside the CPU, and how do they cooperate to process a single instruction?

The internals of a CPU divide into three major components. The ALU performs computations, the control unit decodes instructions and sends control signals to each component, and the datapath is the route along which data flows between them. The interplay of these three elements produces every operation the CPU performs.

ALU: Arithmetic Logic Unit

The ALU is the part of the CPU that actually carries out calculations. As its name suggests, it handles arithmetic operations (addition, subtraction, multiplication, division) and logical operations (AND, OR, NOT, XOR).

One of the ALU's most critical building blocks is the adder. Why is the adder so important? Subtraction is addition after taking the two's complement, multiplication is a combination of repeated additions and shifts, and division is a combination of repeated subtractions and shifts. In the end, most arithmetic operations are built on top of the adder.

The simplest adder is the Ripple Carry Adder. It propagates the carry generated by each bit's addition to the next bit. The design is straightforward, but to obtain the result at the most significant bit, the carry must propagate sequentially from the least significant bit, making it slower in proportion to the number of bits.

Ripple Carry Adder (4-bit)

  A3 B3    A2 B2    A1 B1    A0 B0
  |  |     |  |     |  |     |  |
+-+--+-+ +-+--+-+ +-+--+-+ +-+--+-+
|  FA  |<|  FA  |<|  FA  |<|  FA  |< C_in
+--+---+ +--+---+ +--+---+ +--+---+
C_out S3     S2       S1       S0

< : carry propagation direction

To solve this problem, the Carry Lookahead Adder is used. This approach precomputes the conditions under which a carry will be generated at each bit position, producing results simultaneously without waiting for the carry from the previous bit. The circuit is more complex, but the speed improvement is substantial. The ALUs in modern processors almost universally use this approach or a variant of it.

Multipliers are far more complex circuits than adders. Early processors performed multiplication over multiple cycles using the shift-and-add method, but modern processors use techniques like Wallace Trees and Booth Encoding to execute multiplication much faster. Even so, multiplication remains more expensive than addition, which is why compilers replace multiplication by powers of two with shift operations.

Logical operations are simpler compared to arithmetic. Operations like AND, OR, and XOR can process each bit independently, so there is no carry propagation concern. But simplicity does not mean unimportance. Bit masking, flag checking, conditional branching, and many other core CPU behaviors depend on logical operations.

The Control Unit

If the ALU is the muscle of the CPU, the control unit is the brain. Its role is to decode instructions and send appropriate control signals to each component of the CPU. Deciding which operation the ALU should perform, which register to read data from, and where to store the result are all responsibilities of the control unit.

There are two main approaches to implementing a control unit.

Hardwired control takes the bit pattern of an instruction as input and uses combinational logic circuits to generate control signals directly. At its core is a state machine that receives the opcode and current state as inputs. This approach is fast but complex to design, and changing the instruction set requires redesigning the circuit itself. RISC processors typically adopt this approach because their regular and simple instruction formats are well suited to combinational logic implementation.

Microprogrammed control stores a sequence of micro-instructions corresponding to each instruction in a control store. When a machine instruction executes, the control unit sequentially reads the corresponding micro-instructions from the control store, generating control signals along the way. This approach is slower but flexible. Modifying the instruction set requires only updating the microcode rather than changing hardware. CISC architectures like x86, with their complex and varied instruction formats, favor this approach.

So which approach do modern processors actually use? In practice, a hybrid is the norm. In x86 processors, simple instructions are handled quickly via hardwired logic, while only complex instructions go through microcode. This maximizes performance for frequently used simple instructions while maintaining compatibility with complex ones.

The Datapath

The datapath refers to the entire set of paths along which data moves inside the CPU. Registers, the ALU, internal buses, and multiplexers all form the datapath, and the control signals generated by the control unit manipulate each element to perform the desired operation.

Let's trace how the instruction ADD R1, R2, R3 (R1 = R2 + R3) executes through a simple datapath.

Instruction: ADD R1, R2, R3

1. Fetch
   PC -> memory address -> read instruction from memory -> store in IR
   PC = PC + 4 (update to next instruction address)

2. Decode
   IR opcode -> control unit -> generate control signals
   IR register fields -> read R2, R3 values from register file

3. Execute
   R2 value, R3 value -> ALU (addition)
   ALU result -> store in R1
   Update status register (overflow, zero flags, etc.)

At each step, control signals activate the appropriate paths in the datapath. Selecting the read ports of the register file, choosing the ALU operation type, and enabling the write port of the register file are all determined by control signals.

+------+     +--------------+     +-----+
|  PC  |---->|   Memory     |---->| IR  |
+--+---+     +--------------+     +--+--+
   |                                 |
   | +4                   opcode     | register numbers
   |                     +------+   |
   |                     |Control|<--+
   |                     | Unit  |
   |                     +--+---+
   |                        | control signals
   |         +--------------+-------------+
   |         v              v             v
   |    +---------+    +----------+   +------+
   |    |Register |--->|   ALU    |-->|Result|
   |    |  File   |--->|          |   |Store |
   |    +---------+    +----------+   +------+

Datapath design directly impacts CPU performance. Longer paths mean greater signal propagation delay, which sets the upper bound on clock speed. Processor designers therefore invest considerable effort into minimizing the critical path, the longest signal propagation route through the datapath.

Clock Signal and Timing

Every CPU operation is synchronized by the clock signal. The clock is a signal that alternates periodically between 0 and 1, and register values are updated at the rising edge (the moment the signal transitions from 0 to 1) or falling edge. The clock period is the time taken for one cycle of the clock, and the clock frequency is its reciprocal.

Increasing the clock speed allows more instructions to be processed per unit time. So why not raise the clock speed indefinitely? Because signal propagation through the longest path in the datapath must complete within one clock period. If the clock is too fast, the next operation begins before signals have stabilized, producing incorrect results.

Furthermore, higher clock speeds cause power consumption to increase dramatically. Dynamic power consumption is proportional to clock frequency, and more power means more heat. This power-thermal wall is precisely why the clock speed race stalled around 4GHz in the early 2000s. Since then, processor performance improvements have shifted toward deeper pipelining, multicore designs, and microarchitectural refinements rather than raw clock speed.

CPU Design Trade-offs

CPU design involves numerous trade-offs. The hardwired versus microprogrammed control distinction we already examined is one example, and there are many more crossroads.

Increasing the number of registers reduces memory accesses, but requires more bits to specify registers in instructions, enlarging instruction size. Making the ALU wider (say, from 32-bit to 64-bit) increases the data size processable in one operation, but increases circuit complexity and area. Adding more pipeline stages to reduce the critical path enables higher clock speeds, but increases the number of instructions discarded on a branch misprediction.

Understanding these trade-offs matters because the direction of software performance optimization is ultimately determined by hardware design choices. Knowing which operations the CPU handles quickly and which are expensive enables writing more efficient code.

In the next post, we'll look at Instruction Set Architecture (ISA).