Automata and Compilers 02 - Regular Expressions and Regular Languages
Describing the Languages That Finite Automata Recognize
In the previous post, we saw that finite automata process input purely through state transitions. But how far does the reach of finite automata extend, and how can we describe that reach concisely? The answer to this question is the regular expression.
If you've ever used grep or a text editor's search feature, you've worked with regular expressions. But have you considered why they carry the name "regular"? The name comes from regular languages in the mathematical sense — the set of languages that regular expressions can describe is exactly the set of languages that finite automata can recognize.
The Three Operations of Regular Expressions
Regular expressions are built from three fundamental operations.
| Operation | Notation | Meaning |
|---|---|---|
| Union | R₁ ∪ R₂ | Strings belonging to R₁ or R₂ |
| Concatenation | R₁R₂ | A string from R₁ followed by a string from R₂ |
| Kleene Star | R* | Zero or more repetitions of R |
These three operations alone can express a surprisingly wide range of patterns. For example, over the alphabet {0, 1}, the regular expression (0 ∪ 1)*01 denotes the set of all binary strings ending in "01". The (0 ∪ 1)* part matches any binary string, and the trailing "01" constrains the match to strings that end with that specific suffix.
Practical regex engines add convenience operators like + (one or more repetitions), ? (zero or one occurrence), and character classes, but these are all shorthand for the three basic operations. R+ is simply RR*, and R? is R ∪ ε. From an expressiveness standpoint, the three fundamental operations are sufficient.
Kleene's Theorem — Equivalence of Regular Expressions and Finite Automata
A deep connection exists between regular expressions and finite automata. Stephen Kleene proved that a language being describable by a regular expression and being recognizable by a finite automaton are equivalent conditions. Every regular expression has a corresponding finite automaton, and every finite automaton has a corresponding regular expression.
This equivalence is proved in both directions. The conversion from regular expressions to NFAs uses Thompson's construction: for each basic operation (union, concatenation, Kleene star), a corresponding NFA fragment is created, and these fragments are recursively combined to build an NFA for the entire expression. The reverse direction — from finite automata to regular expressions — can be accomplished through state elimination, where states are removed one by one and transition labels are expanded into regular expressions, ultimately yielding an expression that describes all paths from the start state to an accepting state.
The significance of this theorem for compilers is clear. When building a lexer, token patterns are described as regular expressions, which can then be mechanically transformed through the NFA → DFA → minimized DFA pipeline to automatically generate an efficient lexer. Tools like lex and flex are built on exactly this principle.
Examples of Regular and Non-Regular Languages
Distinguishing between languages that are regular and those that are not makes the character of regular languages more apparent.
Examples of regular languages include binary strings containing an even number of 0s, strings containing a particular substring, programming language identifier patterns ([a-zA-Z_][a-zA-Z0-9_]*), integer literals, and floating-point literals. These patterns can all be decided with a finite number of states, never requiring unbounded memory of how many times a particular input has been seen.
Examples of non-regular languages include {aⁿbⁿ | n ≥ 0} (strings where the number of a's equals the number of b's), properly nested parentheses, and palindromes. What these have in common is a requirement for unbounded counting or matching that finite states alone cannot track.
The Pumping Lemma — Proving a Language Is Not Regular
The pumping lemma is the standard tool for rigorously proving that a language is not regular.
The core intuition is straightforward. Because a DFA recognizing a regular language has finitely many states, processing a sufficiently long string must cause the automaton to revisit the same state at least twice. Revisiting a state means the substring between the two visits can be repeated or removed without affecting acceptance. This repetition is called "pumping."
Formally, for a regular language L, there exists a number p (the pumping length) such that every string s ∈ L with length at least p can be decomposed as s = xyz satisfying three conditions.
1. |xy| ≤ p
2. |y| ≥ 1
3. For all i ≥ 0, xy^i z ∈ L
Using this, one can prove that {aⁿbⁿ | n ≥ 0} is not regular. Assume it is regular, so a pumping length p exists. Choose the string aᵖbᵖ. By condition 1, y consists entirely of a's. But pumping y increases the count of a's while leaving b's unchanged, so xy²z has more a's than b's and does not belong to the language. This contradiction proves the language is not regular.
Connection to Lexical Analysis
The most direct application of regular expressions and regular languages in compilers is the lexer (lexical analyzer). The lexer's job is to break a stream of source code characters into meaningful tokens — keywords, identifiers, operators, literals, and so on.
Each token type is described by a regular expression. In C, for instance, identifiers match [a-zA-Z_][a-zA-Z0-9_]*, integer literals match [0-9]+, and floating-point literals match [0-9]+\.[0-9]+([eE][+-]?[0-9]+)?. By combining these regular expressions and converting the result into a single DFA, an efficient lexer can tokenize input in a single pass over the source text.
The limitations of regular languages are also the limitations of the lexer. Structures like nested comments (/* /* */ */) or parenthesis matching cannot be handled by regular expressions alone — they require the next stage of the compiler, the parser. This is the fundamental reason compilers separate lexing and parsing into distinct phases.
In the next post, we'll look at context-free grammars — a grammatical framework that goes beyond regular languages.