From the Bottom Up

The top-down parsing covered in the previous post started from the start symbol and worked downward toward the input string. Bottom-up parsing is the reverse. It starts from the tokens of the input string and applies grammar rules in reverse, working upward until reaching the start symbol. In this process, the parser traces the reverse of a rightmost derivation.

Does simply reversing the direction make a meaningful difference? It does. Bottom-up parsers can handle a much broader class of grammars than top-down parsers. Many grammars that could not be expressed as LL(1) are handled naturally by LR parsing, and issues like left recursion can be dealt with without any transformation. This is why bottom-up parsing has become the standard method used by parser generators.

Shift-Reduce Parsing

The core mechanism of bottom-up parsing is shift-reduce parsing. The parser uses a stack and an input buffer, repeatedly performing two operations.

  • Shift: Read the next token from the input and push it onto the stack.
  • Reduce: When the top symbols on the stack match the right-hand side of a grammar rule, replace them with the left-hand side (a nonterminal).

Let's trace the parsing of id + id * id using the following grammar.

E โ†’ E + T | T
T โ†’ T * F | F
F โ†’ id
Stack           Input              Action
                id + id * id $
id              + id * id $        Shift
F               + id * id $        Reduce (F โ†’ id)
T               + id * id $        Reduce (T โ†’ F)
E               + id * id $        Reduce (E โ†’ T)
E +             id * id $          Shift
E + id          * id $             Shift
E + F           * id $             Reduce (F โ†’ id)
E + T           * id $             Reduce (T โ†’ F)
E + T *         id $               Shift
E + T * id      $                  Shift
E + T * F       $                  Reduce (F โ†’ id)
E + T           $                  Reduce (T โ†’ T * F)
E               $                  Reduce (E โ†’ E + T)

When only the start symbol E remains on the stack, parsing has succeeded. An important observation is that reversing the order of reductions yields a rightmost derivation.

The critical question is: when should the parser shift and when should it reduce? Automating this decision is precisely what LR parsing accomplishes.

The Mechanics of LR Parsing

An LR parser combines a finite automaton with a parsing table to make shift-reduce decisions automatically. The name LR means the input is read left-to-right while producing the reverse of a rightmost derivation.

The core concepts are items and states. An LR(0) item is a notation that indicates how far parsing has progressed within a grammar rule. A dot (ยท) marks the current position.

E โ†’ ยทE + T    (nothing recognized yet)
E โ†’ Eยท+ T     (E has been recognized)
E โ†’ E +ยทT     (+ has been recognized)
E โ†’ E + Tยท    (entire rule recognized, ready to reduce)

By collecting all items that are possible at a given point into a single state and defining transitions between these states, an LR automaton is constructed. The shift, reduce, and GOTO actions in the parsing table are derived from this automaton.

Varieties of LR Parsers

LR parsers come in several variants depending on how the table is constructed, each differing in the range of grammars they can handle and the size of the table.

VariantLookaheadTable SizeGrammar Coverage
LR(0)0SmallVery limited
SLR(1)1 (FOLLOW-based)SmallLimited
LALR(1)1 (precise lookahead)MediumMost practical grammars
CLR(1)1 (full lookahead)Very largeBroadest

LR(0) decides based on state alone, without any lookahead, so the grammars it can handle are extremely limited. SLR(1) uses FOLLOW sets to decide when to reduce, but because FOLLOW sets aggregate possibilities across all contexts, they can produce unnecessary conflicts.

CLR(1) attaches precise lookahead tokens to each item and can handle the broadest range of grammars, but table sizes can grow exponentially. LALR(1) merges states with identical cores from the CLR(1) construction, reducing table size while still handling the vast majority of practical programming language grammars. This is why LALR(1) has been widely adopted as the practical sweet spot.

Conflict Resolution

When constructing an LR parsing table, a single cell may contain more than one possible action. This is called a conflict, and there are two kinds.

Shift-Reduce Conflict: Both shift and reduce are possible in the same state. The classic example is the dangling else problem. In if E then if E then S ยท else S, shifting else matches it with the inner if, while reducing matches it with the outer if. Most parser generators default to shifting, which binds else to the nearest if.

Reduce-Reduce Conflict: Two different rules could be used to reduce in the same state. This type of conflict generally indicates genuine ambiguity in the grammar, and the recommended solution is to redesign the grammar.

Practical parser generators provide mechanisms for resolving conflicts through operator precedence and associativity declarations. For instance, declaring that + is left-associative and that * has higher precedence than + is sufficient to automatically resolve the shift-reduce conflicts in an expression grammar.

Parser Generators: Yacc and Bison

Just as Lex and Flex exist for lexers, Yacc and Bison exist for parsers. These tools take a grammar specification as input and automatically generate LALR(1) parser code.

A Yacc/Bison input file consists of grammar rules paired with actions that execute when each rule is reduced.

%token NUMBER IDENTIFIER
%left '+' '-'
%left '*' '/'

%%
expr : expr '+' expr   { $$ = $1 + $3; }
     | expr '-' expr   { $$ = $1 - $3; }
     | expr '*' expr   { $$ = $1 * $3; }
     | expr '/' expr   { $$ = $1 / $3; }
     | '(' expr ')'    { $$ = $2; }
     | NUMBER           { $$ = $1; }
     ;
%%

The %left and %right declarations specify associativity, and their order determines precedence โ€” later declarations have higher precedence. These declarations are what resolve shift-reduce conflicts.

Internally, a parser generator constructs the LR automaton from the grammar, merges states to produce the LALR(1) table, resolves conflicts using precedence rules, and finally outputs table-driven parser code. Automata theory is once again realized as a practical tool.

Comparing Top-Down and Bottom-Up

The two approaches can be summarized as follows.

AspectTop-Down (LL)Bottom-Up (LR)
DirectionStart symbol โ†’ inputInput โ†’ start symbol
DerivationLeftmostReverse of rightmost
Grammar coverageLL(1) โŠ‚ LR(1)Broader range
Left recursionRequires transformationHandled directly
ImplementationRecursive descent (manual)Parser generators (automatic)
Error handlingFlexibleStructured
Representative toolsANTLR, hand-writtenYacc, Bison

In practice, both approaches are actively used. When manual control and precise error messages are priorities, recursive descent parsers are the typical choice. When automatic handling of complex grammars matters more, LR parser generators are preferred.

Regardless of the approach chosen, the parser's final output is a parse tree. But parse trees contain much redundant information for direct use. In the next post, we'll look at abstract syntax trees (ASTs) โ€” structures that strip away the unnecessary elements from parse trees and retain only the essential structure.