Automata and Compilers 06 - Lexical Analysis
The First Gate of the Compiler
When source code enters a compiler, the very first phase it passes through is lexical analysis. To a human reader, source code appears as a meaningful sequence of words and symbols, but from the compiler's perspective, the input is nothing more than a raw stream of characters. The job of the lexical analyzer โ the lexer โ is to carve that character stream into meaningful units.
You might wonder why the lexer exists as a separate phase at all. Couldn't the parser just read the character stream directly? In theory it could, but doing so would dramatically increase the parser's complexity. Whitespace, comments, and newline characters would all need to be embedded into the parser's grammar. By having the lexer handle these details upfront and deliver a clean token stream, the parser can focus exclusively on the structural grammar of the language.
Tokens, Lexemes, and Patterns
Three concepts are central to lexical analysis.
| Concept | Description | Example |
|---|---|---|
| Token | A grammatical category | IDENTIFIER, NUMBER, IF |
| Lexeme | The actual character string matching a token | count, 42, if |
| Pattern | The rule describing valid lexemes for a token | [a-zA-Z][a-zA-Z0-9]* |
When the lexer processes the source code if (count > 42), it produces the following token stream.
<IF, "if">
<LPAREN, "(">
<IDENTIFIER, "count">
<GT, ">">
<NUMBER, "42">
<RPAREN, ")">
Each token is a pair of its type and its actual value (the lexeme). The parser uses only the token types to apply grammar rules, while the semantic analysis phase later makes use of the lexeme values.
Where Automata Theory Meets Practice
The regular expressions, NFAs, and DFAs covered in earlier posts find their real-world application precisely here, in the lexer. Each token's pattern can be described by a regular expression, and that regular expression is converted to an NFA and then to a DFA, which becomes the engine of the lexer.
The pipeline looks like this.
Regular Expression โ (Thompson's Construction) โ NFA โ (Subset Construction) โ DFA โ (Minimization) โ Optimal DFA
For example, the regular expression [a-zA-Z_][a-zA-Z0-9_]* for recognizing identifiers is first converted into an NFA. Because this NFA contains ฮต-transitions and is inefficient to execute directly, it is converted to a DFA via subset construction. The resulting DFA then operates as the state machine that recognizes identifiers in the lexer.
What was learned in theory applies directly in practice. Thanks to the closure properties of regular languages, multiple token patterns can be merged into a single combined DFA, which is why a lexer can recognize many different token types in a single pass.
The Maximal Munch Rule
An important principle that lexers follow when recognizing tokens is the maximal munch rule: always produce the longest possible lexeme.
When the input ifcount arrives, the lexer could recognize the keyword IF from the first two characters if. But under the maximal munch rule, it continues reading and finds that the entire string ifcount matches the identifier pattern, so it produces a single token <IDENTIFIER, "ifcount">.
Without this rule, ifcount could be incorrectly split into if + count. The maximal munch rule is implemented naturally in DFA-based lexers. Even when the current state is an accepting state, the lexer continues reading as long as further transitions are possible. When no more transitions can be made, it returns the token corresponding to the last accepting state encountered.
Handling Whitespace and Comments
One of the lexer's important responsibilities is filtering out elements that are irrelevant to the parser: spaces, tabs, newlines, and comments. In most languages, these serve only to separate tokens and carry no grammatical meaning.
Not all languages work this way, however. In Python, indentation carries grammatical significance, so the lexer must generate INDENT and DEDENT tokens. To do this, the lexer maintains a stack of indentation levels โ processing that goes beyond the scope of pure regular languages.
Comment handling is not always straightforward either. Single-line comments from // to the end of the line are simple enough, but block comments /* ... */ in languages that allow nesting cannot be handled by regular expressions alone. Such cases require additional logic in the lexer.
Implementing a Simple Lexer
To understand the core principles, consider a simple lexer that recognizes integers, identifiers, and a few operators.
import re
TOKEN_SPEC = [
('NUMBER', r'\d+'),
('IDENT', r'[a-zA-Z_]\w*'),
('PLUS', r'\+'),
('MINUS', r'-'),
('LPAREN', r'\('),
('RPAREN', r'\)'),
('SKIP', r'[ \t]+'),
('NEWLINE', r'\n'),
('MISMATCH', r'.'),
]
MASTER_PAT = '|'.join(f'(?P<{name}>{pat})' for name, pat in TOKEN_SPEC)
def tokenize(code):
tokens = []
for mo in re.finditer(MASTER_PAT, code):
kind = mo.lastgroup
value = mo.group()
if kind == 'SKIP' or kind == 'NEWLINE':
continue
elif kind == 'MISMATCH':
raise SyntaxError(f'Unexpected character: {value!r}')
else:
tokens.append((kind, value))
return tokens
This implementation delegates token recognition to the regular expression engine. It combines all token patterns into a single master regular expression and determines the token type from the matched group name. Priority is determined by the order in which patterns are listed, so keywords must appear before identifiers for correct behavior.
Lexer Generators
In production compilers, writing a lexer by hand every time is impractical. A lexer generator takes a specification of token patterns as input and automatically produces lexer code.
The most well-known examples are Lex and its open-source implementation Flex. These tools take a .l file containing regular expressions paired with actions and transform it into DFA-based C code.
%%
[0-9]+ { return NUMBER; }
[a-zA-Z_]\w* { return IDENTIFIER; }
"+" { return PLUS; }
[ \t\n] { /* skip whitespace */ }
. { error("unexpected character"); }
%%
Internally, lexer generators perform the entire regular expression โ NFA โ DFA โ minimal DFA pipeline automatically. Automata theory is delivered to the developer in the form of a tool.
Error Handling
When the lexer encounters a character that does not match any token pattern, a lexical error occurs. Several strategies can be employed at this point.
The simplest approach is to report the error immediately and halt compilation. But this means only one error can be discovered at a time when the source code contains multiple errors, which is inefficient. Practical compilers adopt a strategy of skipping the offending character and continuing, reporting as many errors as possible in a single pass. This is called panic mode recovery, and at the lexer level, it is relatively straightforward to implement.
Error messages should include the file name, line number, and column number so that developers can quickly locate the problem. To provide this information, the lexer must track position data while recognizing tokens, and this information is used for error reporting throughout all subsequent phases of the compiler.
The Lexer's Place in the Pipeline
The lexer's position within the overall compiler pipeline can be summarized as follows.
Source Code โ [Lexer] โ Token Stream โ [Parser] โ Syntax Tree โ [Semantic Analysis] โ ...
The lexer operates in close coordination with the parser. In most implementations, the parser requests tokens one at a time and the lexer produces them on demand โ a demand-driven approach. Rather than tokenizing the entire source at once, the lexer hands over one token each time the parser asks for one.
With the token stream produced by the lexer, the task of assembling tokens into grammatical structures falls to the parser. In the next post, we'll look at the first approach to parsing: top-down parsing.