Automata and Compilers 12 - Code Generation

From IR to Machine Code

An optimized intermediate representation is not yet an executable program. The final stage of translating it into machine instructions that a specific processor can execute is code generation. The code generator must solve three core problems: instruction selection, which maps IR operations to appropriate machine instructions; register allocation, which assigns variables to a limited number of physical registers; and instruction scheduling, which reorders instructions to maximize pipeline efficiency.

These three problems are tightly intertwined. Different instruction selections change the number of registers needed, and register allocation results affect the freedom available for scheduling. Optimizing each in isolation cannot guarantee a globally optimal solution, and balancing the three is the central challenge of code generator design.

Instruction Selection

Instruction selection maps each IR operation to machine instructions for the target architecture. Sometimes a single IR operation corresponds to a single machine instruction, but reality is seldom that simple. A single IR operation may require multiple machine instructions, or conversely, several IR operations may be folded into a single complex instruction.

Tree pattern matching is a prominent technique for instruction selection. The IR is represented as a tree, and predefined instruction patterns are matched against it. When multiple coverings exist for a given tree, the combination with the lowest cost is chosen.

IR tree:              Pattern match result (x86):
    [+]
   /   \
 [Reg]  [*]           → lea rax, [rbx + rcx*4]
        / \            (addition and multiplication folded into a single LEA)
     [Reg] [4]

In this example, the IR pattern Reg + Reg * 4 maps to a single x86 LEA (Load Effective Address) instruction. This is more efficient than generating two separate instructions for the multiplication and addition.

Macro expansion is a simpler approach that mechanically applies a predetermined instruction template for each IR operation. It is straightforward to implement but cannot exploit complex instructions, which may result in lower code quality. In practice, compilers often use macro expansion to generate initial code, then apply peephole optimization to replace sequences of adjacent instructions with more efficient alternatives.

Register Allocation

Processor registers are tens to hundreds of times faster than memory, but their count is severely limited. x86-64 has 16 general-purpose registers; ARM has 31. Optimized IR may use hundreds or thousands of virtual registers, making register allocation — mapping virtual registers to physical ones — the most demanding problem in code generation.

Register allocation can be modeled as a graph coloring problem. First, an interference graph is constructed: each virtual register becomes a node, and an edge connects any two registers that are simultaneously live. If this graph can be colored with as many colors as there are physical registers, every variable can reside in a register.

Interference graph example:

Virtual registers: v1, v2, v3, v4
Physical registers: R1, R2, R3 (3 available)

  v1 ── v2
  |  ×  |          v1 → R1
  v3 ── v4          v2 → R2
                    v3 → R2 (v2 and v3 are not simultaneously live)
                    v4 → R3

When the graph cannot be colored — when physical registers are insufficient — some variables must be spilled to memory and reloaded when needed. Because the performance impact varies greatly depending on which variables are spilled, the allocator considers use frequency and loop depth to select the variables with the lowest spill cost.

Graph coloring guarantees an optimal allocation but is an NP-complete problem, so execution time can be prohibitive. For environments where compilation speed matters, such as JIT compilers, the linear scan algorithm is used as an alternative. Linear scan sorts variable live ranges and assigns registers sequentially, achieving reasonable allocation quality in O(n log n) time.

Instruction Scheduling

Modern processors use pipelines to execute multiple instructions simultaneously. However, data dependencies between instructions can cause pipeline stalls. Instruction scheduling reorders instructions — without violating dependencies — to minimize stalls.

Before scheduling:                  After scheduling:
load  r1, [addr1]    ← memory access   load  r1, [addr1]
add   r2, r1, 1      ← stall waiting   load  r3, [addr2]    ← independent load first
load  r3, [addr2]                       add   r2, r1, 1      ← runs after r1 is ready
mul   r4, r3, 2      ← stall waiting   mul   r4, r3, 2      ← runs after r3 is ready

By moving the second load earlier, useful work is performed during the latency of the first load. The same set of instructions executes in less time.

Stack Frames and Calling Conventions

Function calls require dedicated management during code generation. Each function call creates a stack frame that holds local variables, saved registers, the return address, and function arguments.

Stack frame layout (typical):

High address  ┌─────────────────┐
              │   Arguments     │  ← passed by caller
              ├─────────────────┤
              │  Return address │  ← saved by call instruction
              ├─────────────────┤
              │ Previous frame  │  ← maintains stack chain
              │    pointer      │
              ├─────────────────┤
              │ Saved registers │  ← callee-saved registers
              ├─────────────────┤
              │ Local variables │
Low address   └─────────────────┘  ← stack pointer (SP)

A calling convention defines the rules for argument passing, return value location, and register preservation responsibilities during function calls. Under the x86-64 System V ABI, the first six integer arguments are passed in RDI, RSI, RDX, RCX, R8, and R9, with the return value in RAX. Windows x64 uses RCX, RDX, R8, and R9 instead. The code generator must adhere precisely to the target platform's calling convention; otherwise, the generated code cannot link with code produced by other compilers.

Targeting Different Architectures

For a single compiler to support multiple architectures, the code generator must systematically manage architecture-specific differences. CISC architectures (x86) offer complex instructions and diverse addressing modes, giving instruction selection a wide range of choices. RISC architectures (ARM, RISC-V) focus on executing simple instructions quickly, so a single IR operation is often decomposed into multiple instructions.

In compiler infrastructures like LLVM, target description files declaratively specify each architecture's register set, instruction definitions, and calling conventions. The core algorithms of the code generator remain architecture-independent, while architecture-specific details are read from the target description. The fact that adding a new architecture requires only a new target description rather than rewriting the entire code generator demonstrates the extensibility of IR-based design.

JIT Compilation

Everything we have covered so far has been AOT (Ahead-of-Time) compilation, where machine code is generated before the program runs. The counterpart is JIT (Just-in-Time) compilation, where the compiler translates bytecode or IR to machine code while the program is executing. Java's HotSpot, JavaScript's V8, and .NET's RyuJIT are prominent JIT compilers.

The key advantage of JIT compilation is the ability to use runtime information for optimization. An AOT compiler cannot know what inputs the program will receive or which branches will be taken frequently. A JIT compiler identifies hot paths through profiling and applies aggressive optimizations to those paths. Does JIT always produce faster code than AOT? Not at all. The time spent on JIT compilation itself is included in execution time, so for short-running programs, JIT can actually be disadvantageous. This is precisely why JIT's benefits are maximized in long-running server applications.

From Theory to Practice

This series began with the simplest computational model: the finite automaton. A finite automaton recognizes patterns using nothing more than states and transitions, and this became the theoretical foundation for the lexer — the first stage of a compiler. The equivalence of regular expressions and finite automata, and the ability to convert NFAs to DFAs, were not mere theoretical curiosities but the bedrock upon which real lexer implementations are built.

Recognizing the limitations of finite automata, we advanced to pushdown automata equipped with a stack, which provided the theoretical basis for parsers. We described the structure of programming languages using context-free grammars and studied systematic parsing methods: LL parsing and LR parsing. These parsing theories are the core principles behind parser generators like Yacc and ANTLR.

On top of that foundation, semantic analysis ensured type safety and scope rules, intermediate representations and optimization improved performance while preserving program meaning, and finally, code generation transformed abstract representations into instructions that real hardware can execute.

Looking back, each phase of the compiler is engineering built on theory. Without automata theory, there would have been no principled basis for designing lexers and parsers. Without formal language theory, there would have been no precise way to describe the grammar of programming languages. My hope is that this series has conveyed, at least in part, the fact that the compilers we use every day stand upon decades of theoretical research.

Where to go next

From IR to Machine Code

Instruction Selection

Register Allocation

Instruction Scheduling

Stack Frames and Calling Conventions

Targeting Different Architectures

JIT Compilation

From Theory to Practice

Continue Reading

Automata and Compilers 11 - Intermediate Representations and Optimization

Automata and Compilers 10 - Semantic Analysis and Type Checking

Automata and Compilers 09 - Abstract Syntax Trees