From Theory to Practice

So far we have covered finite automata, regular expressions, context-free grammars, and pushdown automata. Along the way, we noted that each of these theories corresponds to a specific phase of a compiler. Now it is time to see the full picture. A compiler is a program that translates source code into machine code, but this translation is not a single monolithic process — it is a pipeline of systematically combined phases. Understanding why each phase exists and why they are arranged in this particular order makes the logic of compiler design clear.

Overall Compiler Architecture

A compiler's processing pipeline divides broadly into a frontend and a backend, connected by an intermediate representation (IR).

Source Code
  │
  ▼
┌─────────────────────────┐
│        Frontend          │
│  ┌───────────────────┐  │
│  │  Lexical Analysis  │  │
│  └────────┬──────────┘  │
│  ┌────────▼──────────┐  │
│  │  Syntax Analysis   │  │
│  └────────┬──────────┘  │
│  ┌────────▼──────────┐  │
│  │  Semantic Analysis │  │
│  └────────┬──────────┘  │
└───────────┼─────────────┘
  ┌─────────▼─────────┐
  │  Intermediate Rep. │
  └─────────┬─────────┘
┌───────────┼─────────────┐
│        Backend           │
│  ┌────────▼──────────┐  │
│  │  Optimization      │  │
│  └────────┬──────────┘  │
│  ┌────────▼──────────┐  │
│  │  Code Generation   │  │
│  └───────────────────┘  │
└─────────────────────────┘
  │
  ▼
Object Code

This structure is not arbitrary. Each phase takes the output of the previous phase as input and performs a higher level of analysis. Separating the phases this way makes it possible to design and optimize each one independently.

Lexical Analysis — The Lexer

The first phase of a compiler's processing is lexical analysis. The lexer breaks a stream of characters into meaningful units called tokens. For example, the source code int x = 42; is decomposed into five tokens: int (keyword), x (identifier), = (operator), 42 (integer literal), and ; (delimiter).

As discussed in earlier posts, a lexer is fundamentally a finite automaton. Each token type is defined by a regular expression, which is converted into a DFA that scans the input. Insignificant characters like whitespace and comments are discarded at this stage, reducing the burden on subsequent phases. The lexer exists as a separate phase because mixing character-level processing with structural analysis would dramatically increase complexity.

Syntax Analysis — The Parser

Syntax analysis takes the token stream produced by the lexer and analyzes the program's structure. The parser verifies that the token sequence conforms to the programming language's grammar and, if it does, produces a parse tree or abstract syntax tree (AST).

A parser is fundamentally a pushdown automaton. It operates based on a context-free grammar and uses a stack to track nested structures. Parsing strategies fall into two main categories. Top-down parsing starts from the start symbol and derives the input, with LL parsers being the representative example. Bottom-up parsing starts from the input tokens and reduces toward the start symbol, with LR parsers being the representative example. Regardless of the strategy, the parser's task is to transform a linear sequence of tokens into a hierarchical structure.

Semantic Analysis

A program that is syntactically correct is not necessarily semantically correct. int x = "hello"; conforms to C's grammar but has a type mismatch. Checking for such errors is the job of semantic analysis.

During semantic analysis, a symbol table is constructed to manage declaration information for variables and functions, type checking is performed, and context-dependent rules are verified — such as whether a variable is used before declaration or whether a function call has the correct number of arguments. In the previous post, we mentioned that some rules exceed the scope of context-free grammars. Those rules are precisely what this phase handles.

The Role of Intermediate Representation

Why place an intermediate representation between the frontend and backend? This design decision is one of the most important ideas in compiler engineering.

Without an IR, generating object code directly from a source language would require M×N compilers for M source languages and N target machines. With a common IR, only M frontends and N backends — a total of M+N modules — need to be implemented. Supporting a new source language requires adding only a frontend, and supporting a new target machine requires adding only a backend.

LLVM is the most successful realization of this principle. Built around a common intermediate representation called LLVM IR, it has frontends for C, C++, Rust, Swift, and many other languages, along with backends for x86, ARM, RISC-V, and other architectures. Adding a new language or architecture does not require building an entire compiler from scratch.

Optimization

The optimization phase performs transformations on the IR to improve program performance. Representative optimization techniques include constant folding (pre-computing expressions whose values are known at compile time), dead code elimination (removing code that can never execute), loop optimization (improving the efficiency of loops), and inlining (replacing function calls with the function body).

Optimization is performed at the IR level because transformations independent of both the source language and the target machine can be implemented in one place. There are also machine-dependent optimizations — register allocation, instruction scheduling, and others — which are handled during code generation. The ability to separate machine-independent and machine-dependent optimizations is itself an advantage of the phase-separated architecture centered on an IR.

Code Generation

The final phase of the compiler translates the optimized IR into actual target machine instructions. This phase involves register allocation (deciding which variables go in which registers), instruction selection (mapping IR operations to machine instructions), and instruction scheduling (reordering instructions for pipeline efficiency).

The code generator's output is either assembly code or machine code binary, which becomes the program that the operating system can ultimately execute.

Single-Pass vs. Multi-Pass

Compilers can be classified by implementation approach as single-pass or multi-pass. A single-pass compiler reads the source code only once, performing lexing, parsing, and code generation simultaneously. This was advantageous in the early computing era when memory was severely limited, and early C and Pascal compilers used this approach. The C rule requiring functions to be declared before use originated from single-pass compiler constraints.

A multi-pass compiler traverses the source code or IR multiple times, with each pass performing a specific task. Most modern compilers adopt the multi-pass approach because each pass can focus on a single concern, resulting in cleaner implementation and enabling more sophisticated analysis and optimization.

Compilers vs. Interpreters

Compilers and interpreters represent two approaches to executing programming languages. A compiler translates the entire source code into object code before execution, while an interpreter reads and executes source code one statement at a time. Compilers require translation time before execution but produce fast-running code; interpreters can begin execution immediately but run relatively slower.

Many modern language implementations combine both approaches. Java compiles source code to bytecode, which the JVM then interprets or JIT-compiles (Just-In-Time compilation) for execution. Python also internally compiles to bytecode before executing on a virtual machine. Rather than pure compilers or pure interpreters, hybrid approaches that combine the strengths of both have become the mainstream of modern language implementation.

In the next post, we'll look at the concrete implementation of the compiler's first practical phase — lexical analysis, the lexer.