Automata and Compilers 07 - Top-Down Parsing

What Is Parsing?

Once the lexer has transformed source code into a token stream, the next task is to verify that those tokens are arranged according to the language's grammar rules and to uncover their structure. This is the role of parsing. Parsing is essentially the reverse of derivation. If derivation is the process by which a grammar produces a string starting from the start symbol, then a parser works in the opposite direction — given a string, it finds the path back to the start symbol.

Parsing methods fall into two broad categories. Top-down methods start from the start symbol and work downward toward the input string. Bottom-up methods start from the input string and work upward toward the start symbol. This post covers top-down parsing.

Recursive Descent Parsers

The most intuitive implementation of top-down parsing is the recursive descent parser. The idea is to write one function for each nonterminal symbol in the grammar, with these functions calling each other recursively to analyze the input.

Consider the following simple expression grammar.

E → T + E | T
T → ( E ) | id | num

A recursive descent parser for this grammar takes the following form.

class Parser:
    def __init__(self, tokens):
        self.tokens = tokens
        self.pos = 0

    def peek(self):
        if self.pos < len(self.tokens):
            return self.tokens[self.pos]
        return None

    def consume(self, expected_type):
        token = self.peek()
        if token and token[0] == expected_type:
            self.pos += 1
            return token
        raise SyntaxError(f'Expected {expected_type}')

    def parse_E(self):
        node = self.parse_T()
        if self.peek() and self.peek()[0] == 'PLUS':
            self.consume('PLUS')
            right = self.parse_E()
            return ('ADD', node, right)
        return node

    def parse_T(self):
        token = self.peek()
        if token[0] == 'LPAREN':
            self.consume('LPAREN')
            node = self.parse_E()
            self.consume('RPAREN')
            return node
        elif token[0] == 'IDENT':
            return self.consume('IDENT')
        elif token[0] == 'NUMBER':
            return self.consume('NUMBER')
        raise SyntaxError('Unexpected token')

Because each grammar rule maps directly to a function, the structure is clear and debugging is straightforward. In practice, many production compilers — including GCC, Clang, and the Go compiler — use recursive descent parsing. It allows flexible error handling and fine-grained control without depending on a parser generator.

Predictive Parsing and LL(1)

For a recursive descent parser to work correctly, it must be possible to decide which grammar rule to apply by looking at just the current token. Can a single token always determine the right choice? Not for every grammar. But for a class of grammars known as LL(1), it can.

The name LL(1) encodes three things. The first L means the input is read left-to-right. The second L means the parser produces a leftmost derivation. The 1 means exactly one token of lookahead is needed to make each decision.

For an LL(1) parser to choose the correct rule, two sets must be precomputed: the FIRST set and the FOLLOW set.

FIRST and FOLLOW

FIRST(α) is the set of all terminal symbols that can appear as the first symbol of any string derived from α. If α can derive the empty string (ε), then ε is also included in FIRST(α).

FOLLOW(A) is the set of terminal symbols that can appear immediately after the nonterminal A in some derivation. The FOLLOW set of the start symbol always includes $, representing the end of input.

Consider the following grammar.

E  → T E'
E' → + T E' | ε
T  → F T'
T' → * F T' | ε
F  → ( E ) | id

The FIRST and FOLLOW sets for this grammar are as follows.

Symbol	FIRST	FOLLOW
E	{ (, id }	{ $, ) }
E'	{ +, ε }	{ $, ) }
T	{ (, id }	{ +, $, ) }
T'	{ *, ε }	{ +, $, ) }
F	{ (, id }	{ *, +, $, ) }

These sets enable the construction of a predictive parsing table. Given a nonterminal A and the next input token a, the choice of which production A → α to apply is determined as follows. If a is in FIRST(α), select that rule. If α can derive ε and a is in FOLLOW(A), select the A → ε rule. If every cell in the table contains at most one rule, the grammar is LL(1).

Eliminating Left Recursion

A problem that must be resolved in top-down parsers is left recursion. Consider the following grammar rule.

E → E + T | T

In a recursive descent parser, calling parse_E would immediately call parse_E again, resulting in infinite recursion. To fix this, left recursion must be converted to right recursion.

The transformation works as follows. A left-recursive rule of the form A → Aα | β is rewritten as:

A  → β A'
A' → α A' | ε

Applied to the expression grammar above, this produces:

E  → T E'
E' → + T E' | ε

After the transformation, left recursion is eliminated and the recursive descent parser works correctly. However, the transformation changes the structure of the parse tree. Operations that were left-associative in the original grammar appear right-associative after the transformation, so appropriate adjustments must be made when constructing the abstract syntax tree.

Left Factoring

Another challenge arises from common prefixes. Consider this grammar.

S → if E then S else S
S → if E then S

Both rules share the common prefix if E then S, so looking at just the first token if is insufficient to decide which rule to apply. Left factoring extracts the common prefix.

S  → if E then S S'
S' → else S | ε

Now, upon seeing if, the parser selects the only rule for S, and when else appears, it selects the else branch in S'. However, this transformation does not fully resolve all ambiguity. The so-called dangling else problem remains, and it is typically handled by a separate rule that matches else with the nearest if.

Table-Driven Predictive Parsing

Instead of recursive descent, an explicit stack and parsing table can be used. In this approach, the parser inspects the top of the stack and the next input token, then consults the parsing table to determine which rule to apply.

Stack: [E, $]    Input: id + id $

1. Stack top E, input id → table selects E → T E'
   Stack: [T, E', $]

2. Stack top T, input id → table selects T → F T'
   Stack: [F, T', E', $]

3. Stack top F, input id → table selects F → id
   Stack: [id, T', E', $]

4. Stack top id, input id → match, pop
   Stack: [T', E', $]    Input: + id $

... (continues)

This approach uses an explicit loop and stack instead of recursive function calls, avoiding call stack depth limitations. However, recursive descent tends to offer more flexibility in error handling.

Limitations of Top-Down Parsing

Top-down parsing has the advantages of being intuitive and easy to implement, but it has fundamental limitations. Many language constructs cannot be expressed as LL(1) grammars. No amount of left recursion elimination and left factoring can convert some grammars to LL(1), and in such cases, LL(k) with more lookahead tokens or backtracking must be introduced.

From a practical standpoint, recursive descent parsers encode the grammar directly in code, making it easy to insert context-sensitive handling that goes beyond grammar rules. In C, for example, distinguishing typedef names from variable names is a rule that cannot be expressed in the grammar alone but can be handled inside the parser. This flexibility is another reason why many production compilers choose the recursive descent approach.

On the other hand, if automatic handling of a broader class of grammars is needed, bottom-up parsing should be considered. In the next post, we'll look at bottom-up parsing, with a particular focus on LR parsing.

Where to go next

What Is Parsing?

Recursive Descent Parsers

Predictive Parsing and LL(1)

FIRST and FOLLOW

Eliminating Left Recursion

Left Factoring

Table-Driven Predictive Parsing

Limitations of Top-Down Parsing

Continue Reading

Automata and Compilers 08 - Bottom-Up Parsing

Automata and Compilers 09 - Abstract Syntax Trees

Automata and Compilers 10 - Semantic Analysis and Type Checking