Automata and Compilers 01 - Finite Automata
Why Start with Automata?
When deciding to study compilers, most people want to jump straight into parsers and code generation. But to truly understand how each phase of a compiler works, the underlying theory is unavoidable. Lexers operate as finite automata, and parsers operate as pushdown automata. Without understanding this theory, you can build a compiler but never understand why it's designed the way it is.
This series starts from automata theory and progresses all the way to implementing an actual compiler. The first topic is the most fundamental computational model: the finite automaton.
The Concept of Finite Automata
A finite automaton is an abstract machine that has a finite number of states and transitions between them by reading input one character at a time. After reading all the input, if the current state is an accepting state, the input is accepted; otherwise, it is rejected.
You encounter structures similar to finite automata in everyday life. Think of a vending machine โ its state changes with each coin inserted, and once enough money has been deposited, it transitions to a state that dispenses a drink. Traffic lights work the same way, cycling through a finite set of states: red, yellow, and green.
Formally, a finite automaton is defined by five components.
| Component | Description |
|---|---|
| Q | Finite set of states |
| ฮฃ | Finite input alphabet |
| ฮด | Transition function (Q ร ฮฃ โ Q) |
| qโ | Start state |
| F | Set of accepting states |
Given these five components, a finite automaton is completely determined. The key point is the absence of memory. A finite automaton can only remember its current state โ previous inputs are reflected only indirectly through state transitions.
DFA and NFA
Finite automata come in two varieties: deterministic finite automata (DFA) and nondeterministic finite automata (NFA).
In a DFA, for each state and input symbol, there is exactly one next state. There is no ambiguity, making implementation straightforward. You simply look at the current state and the input character to determine the next state.
In an NFA, a single input can lead to multiple possible next states simultaneously, and ฮต-transitions (epsilon transitions) โ transitions that occur without reading any input โ are allowed. Does this nondeterminism give NFAs more computational power than DFAs? Not at all. The set of languages recognizable by NFAs is exactly the same as the set recognizable by DFAs. This is one of the most important results in automata theory.
However, there is a difference in expressiveness convenience. NFAs can often represent the same language with far fewer states. This is why, when converting regular expressions to automata, the standard approach is to first construct an NFA, then convert it to a DFA.
DFA Example
Consider a DFA that accepts binary strings containing the substring "01".
States: {S0, S1, S2}
Alphabet: {0, 1}
Start state: S0
Accepting states: {S2}
Transitions:
S0 --0--> S1
S0 --1--> S0
S1 --0--> S1
S1 --1--> S2
S2 --0--> S2
S2 --1--> S2
S0 is the state where "0" has not yet been seen. Reading "0" transitions to S1, indicating that a "0" has been encountered. Reading "1" from S1 completes the "01" pattern and transitions to S2, the accepting state, which remains accepting regardless of subsequent input. The automaton performs pattern matching through state transitions alone, without any additional memory.
Converting NFA to DFA
Every NFA can be converted to an equivalent DFA. This conversion is called the subset construction.
The core idea is to map each set of states that could be simultaneously active in the NFA to a single state in the DFA. If the NFA has n states, the DFA could have up to 2โฟ states. In practice, only reachable states are generated, so the actual number is typically much smaller. However, the possibility of exponential state growth in the worst case is an important consideration.
How does this relate to compilers? When implementing a lexer, the standard procedure is to first convert a regular expression to an NFA (Thompson's construction), then convert the NFA to a DFA (subset construction), and finally minimize the DFA to produce an optimal lexer. Theory translates directly into implementation.
Limitations of Finite Automata
There are languages that finite automata cannot recognize. A classic example is parenthesis matching. To verify that the number of opening and closing parentheses in a string like "((()))" are equal, you need to count the opening parentheses. But since a finite automaton has only finitely many states, it cannot track parentheses to arbitrary depth.
This limitation manifests directly when processing programming language syntax. Nested blocks, recursive function calls, and parenthesis matching are structures that finite automata simply cannot handle. To address this, a model with unbounded memory in the form of a stack is needed โ the pushdown automaton, which we will cover later in this series.
In the next post, we'll look at regular expressions and regular languages โ the formal way to describe the languages that finite automata recognize.