Automata and Compilers 03 - Context-Free Grammars
Why Regular Languages Are Not Enough
In the previous post, we established the equivalence of regular expressions and finite automata, and confirmed the limits of regular languages through the pumping lemma. But consider the syntax of programming languages — structures that regular languages cannot handle occupy a central role. Nested parentheses, recursive function calls, and nested if-else blocks all require tracking that finite states alone cannot provide.
The fact that {aⁿbⁿ | n ≥ 0} is not a regular language means that even "matching the count of opening and closing parentheses" is impossible with finite automata. Does a more powerful grammatical framework exist that can describe such structures? It does, and it is the context-free grammar (CFG).
Definition of Context-Free Grammars
A context-free grammar is defined by four components.
| Component | Description |
|---|---|
| V | Finite set of non-terminal symbols |
| Σ | Finite set of terminal symbols |
| R | Finite set of production rules |
| S | Start symbol, a non-terminal in V |
A production rule has the form A → α, where A is a non-terminal and α is a string composed of terminals and non-terminals. The name "context-free" comes from the fact that when substituting a non-terminal A, the same rule can always be applied regardless of the surrounding context. A can be replaced by α wherever it appears, and this is the key distinction from context-sensitive grammars.
Non-terminals represent structures that still need to be expanded, while terminals represent the actual characters that appear in the input. In programming languages, non-terminals correspond to syntactic categories like "expression," "statement," and "block," while terminals correspond to individual tokens such as keywords, identifiers, operators, and literals.
Derivations and Parse Trees
The process of generating a string from a context-free grammar is called a derivation. Starting from the start symbol and repeatedly applying production rules eventually produces a string consisting entirely of terminals. This process can proceed in two ways: a leftmost derivation, which always substitutes the leftmost non-terminal first, and a rightmost derivation, which always substitutes the rightmost non-terminal first.
A parse tree is the tree-structured representation of a derivation. The root node is the start symbol, internal nodes are non-terminals, and leaf nodes are terminals. Parse trees visually display the syntactic structure of a string and play a central role in how compilers analyze the meaning of source code.
As a simple example, consider the following grammar.
S → aSb | ε
This grammar generates exactly {aⁿbⁿ | n ≥ 0}. Each application of S → aSb adds one a and one b, and the derivation terminates when S → ε is applied. A language that was impossible to describe with regular languages is expressed in just two production rules with a context-free grammar.
A Grammar for Arithmetic Expressions
As a practical example, consider a grammar that describes arithmetic expressions.
E → E + T | T
T → T * F | F
F → ( E ) | id
Here E stands for expression, T for term, and F for factor. This grammar appears simple, but it encodes an important design decision. By defining E through T and T through F, operator precedence is reflected in the grammar structure itself. Multiplication (*) appears lower in the parse tree than addition (+), which naturally ensures that multiplication is evaluated first.
For the input id + id * id, constructing a parse tree under this grammar places the id * id portion under a T node first, and then combines that result with id + to form the E node. Operator precedence is determined automatically by the hierarchical structure of the grammar.
Ambiguity
A grammar is called ambiguous if there exist two or more distinct parse trees for a single string. Ambiguity poses a serious problem in compiler design because different parse trees can lead to different program meanings.
A classic example is the dangling else ambiguity.
S → if E then S
| if E then S else S
| other
In the statement if E₁ then if E₂ then S₁ else S₂, the grammar alone does not determine whether the else belongs to the outer if or the inner if. Most programming languages adopt the rule that "else matches the nearest if," but this rule is enforced by an additional convention rather than by the grammar itself.
Several approaches exist for resolving ambiguity. The grammar can be rewritten to eliminate the ambiguity, or the parser can be augmented with precedence rules and associativity rules. Regardless of the approach chosen, the crucial point is that using an ambiguous grammar as-is can cause a compiler to produce different results for the same code.
BNF Notation
In practice, context-free grammars are typically written using BNF (Backus-Naur Form) notation. Introduced by John Backus and Peter Naur to describe the syntax of ALGOL 60, this notation has become the standard way to write production rules.
<expr> ::= <expr> "+" <term> | <term>
<term> ::= <term> "*" <factor> | <factor>
<factor> ::= "(" <expr> ")" | id
Non-terminals are enclosed in angle brackets (< >), ::= means "is defined as," and | separates alternatives. The extended version, EBNF (Extended BNF), adds notation for repetition ({ }) and optional elements ([ ]), allowing grammars to be expressed more concisely. Today, most programming language specifications use some variant of BNF or EBNF to define their syntax.
Where Context-Free Grammars Fit
In the Chomsky hierarchy, context-free grammars occupy the second level — more powerful than regular grammars but weaker than context-sensitive grammars. Every regular language is also a context-free language, but the converse does not hold. {aⁿbⁿ} is the canonical counterexample.
Programming language syntax is largely expressible with context-free grammars, which is why compiler parsers are designed around them. However, rules like "variables must be declared before use" or "types must match" exceed the scope of context-free grammars and are handled in a separate semantic analysis phase.
In the next post, we'll look at pushdown automata — the computational model capable of recognizing context-free grammars.