The Limitations of Parse Trees

When a parser analyzes a token stream according to grammar rules, it produces a parse tree (also called a concrete syntax tree). The parse tree faithfully reflects every step of the derivation process, making it useful for verifying grammatical correctness. But for the subsequent phases of a compiler, it contains far too much unnecessary information.

Consider the parse tree for the expression 2 + 3 * 4. Using a grammar with left recursion eliminated, the parse tree includes nonterminal symbols like E, E', T, T', and F, along with delimiters like parentheses. But the only information the compiler actually needs is the operational structure: "add 2 to the product of 3 and 4." Auxiliary nonterminals like E' and T', introduced solely to make the grammar LL(1)-compatible, are irrelevant to the language's semantics.

For this reason, compilers use abstract syntax trees (ASTs) instead of parse trees. An AST is a tree that strips away grammatical details from the parse tree, retaining only the semantic structure of the program.

Parse Tree vs AST

Let's compare the two trees for the same expression 2 + 3 * 4.

Parse Tree:                   AST:
      E                       +
    / | \                    /   \
   T  +  E                 2     *
   |     |                      /  \
   F     T                    3     4
   |   / | \
   2  F  *  T'
      |     |
      3     F
            |
            4

The parse tree contains every nonterminal from the grammar rules, but the AST retains only operators and operands. Parentheses do not appear in the AST either. In (2 + 3) * 4, the role of the parentheses is captured by the tree structure itself. If the left child of the * node is a + node, that alone conveys that addition is performed first.

This abstraction simplifies every subsequent phase of the compiler. Semantic analysis, optimization, and code generation all operate on the AST, so the design of the AST has a direct impact on the quality of the entire compiler.

Designing AST Nodes

Implementing an AST starts with designing the node types. Each node type corresponds to a language construct, including only semantically necessary information while excluding grammatical details.

AST node types for a simple language might be defined as follows.

from dataclasses import dataclass
from typing import List, Optional

@dataclass
class NumberLit:
    value: int

@dataclass
class Identifier:
    name: str

@dataclass
class BinaryOp:
    op: str          # '+', '-', '*', '/'
    left: 'Expr'
    right: 'Expr'

@dataclass
class UnaryOp:
    op: str          # '-', '!'
    operand: 'Expr'

@dataclass
class Assignment:
    target: Identifier
    value: 'Expr'

@dataclass
class IfStmt:
    condition: 'Expr'
    then_body: List['Stmt']
    else_body: Optional[List['Stmt']]

@dataclass
class WhileStmt:
    condition: 'Expr'
    body: List['Stmt']

@dataclass
class FuncDecl:
    name: str
    params: List[str]
    body: List['Stmt']

@dataclass
class FuncCall:
    name: str
    args: List['Expr']

@dataclass
class ReturnStmt:
    value: Optional['Expr']

Each node type contains only the minimal information needed to represent its language construct. IfStmt has no parentheses or then keyword โ€” just the condition and the bodies. BinaryOp has no precedence information โ€” the tree structure itself determines evaluation order.

Building the AST During Parsing

The AST is constructed as the parser applies grammar rules. In a recursive descent parser, each parsing function is written to return an AST node.

def parse_expr(self):
    left = self.parse_term()
    while self.peek() and self.peek()[0] in ('PLUS', 'MINUS'):
        op_token = self.consume(self.peek()[0])
        right = self.parse_term()
        left = BinaryOp(
            op=op_token[1],
            left=left,
            right=right
        )
    return left

def parse_term(self):
    left = self.parse_factor()
    while self.peek() and self.peek()[0] in ('STAR', 'SLASH'):
        op_token = self.consume(self.peek()[0])
        right = self.parse_factor()
        left = BinaryOp(
            op=op_token[1],
            left=left,
            right=right
        )
    return left

def parse_factor(self):
    token = self.peek()
    if token[0] == 'NUMBER':
        self.consume('NUMBER')
        return NumberLit(value=int(token[1]))
    elif token[0] == 'IDENT':
        self.consume('IDENT')
        return Identifier(name=token[1])
    elif token[0] == 'LPAREN':
        self.consume('LPAREN')
        node = self.parse_expr()
        self.consume('RPAREN')
        return node
    raise SyntaxError('Unexpected token')

Two aspects of this code deserve attention. First, the while loop correctly handles left-associative operations. 2 + 3 + 4 is constructed as BinaryOp('+', BinaryOp('+', 2, 3), 4), producing a structure that evaluates from left to right. Second, when parentheses are encountered, the result of parsing the inner expression is returned directly, so the parentheses themselves never appear in the AST.

In parser generators, AST node construction code is written as actions for each grammar rule. In Yacc/Bison, expressions like $$ = new BinaryOp('+', $1, $3); create AST nodes during reduction.

AST Examples by Language Construct

Let's see how major programming language constructs are represented in an AST.

Expressions: a + b * c

    BinaryOp('+')
    /            \
Identifier('a')  BinaryOp('*')
                  /            \
           Identifier('b')  Identifier('c')

The tree structure itself indicates that * is evaluated before +.

Conditionals: if (x > 0) { return x; } else { return -x; }

IfStmt
โ”œโ”€โ”€ condition: BinaryOp('>')
โ”‚               โ”œโ”€โ”€ Identifier('x')
โ”‚               โ””โ”€โ”€ NumberLit(0)
โ”œโ”€โ”€ then_body: [ReturnStmt(Identifier('x'))]
โ””โ”€โ”€ else_body: [ReturnStmt(UnaryOp('-', Identifier('x')))]

Function declarations: function add(a, b) { return a + b; }

FuncDecl
โ”œโ”€โ”€ name: "add"
โ”œโ”€โ”€ params: ["a", "b"]
โ””โ”€โ”€ body: [ReturnStmt(BinaryOp('+',
                         Identifier('a'),
                         Identifier('b')))]

In each case, the AST strips away all syntactic decoration and clearly exposes the logical structure of the program.

Tree Traversal with the Visitor Pattern

Once the AST is constructed, it must be traversed for various purposes. Each phase โ€” type checking, optimization, code generation โ€” walks the AST to collect information or transform the tree.

The most widely used design pattern for this is the visitor pattern. By separating the processing logic for each node type from the tree structure itself, new analysis phases can be added without modifying the existing AST node classes.

class ASTVisitor:
    def visit(self, node):
        method_name = f'visit_{type(node).__name__}'
        visitor = getattr(self, method_name, self.generic_visit)
        return visitor(node)

    def generic_visit(self, node):
        raise NotImplementedError(f'No visitor for {type(node).__name__}')

class Evaluator(ASTVisitor):
    def visit_NumberLit(self, node):
        return node.value

    def visit_BinaryOp(self, node):
        left = self.visit(node.left)
        right = self.visit(node.right)
        if node.op == '+': return left + right
        if node.op == '-': return left - right
        if node.op == '*': return left * right
        if node.op == '/': return left // right

class PrettyPrinter(ASTVisitor):
    def visit_NumberLit(self, node):
        return str(node.value)

    def visit_BinaryOp(self, node):
        left = self.visit(node.left)
        right = self.visit(node.right)
        return f'({left} {node.op} {right})'

Evaluator and PrettyPrinter traverse the same AST but produce entirely different results. When a new analysis phase is needed, all that is required is a new class inheriting from ASTVisitor. Because the AST node classes remain untouched, each phase of the compiler can be developed and maintained independently.

The Bridge Between Frontend and Backend

The AST plays a pivotal role in compiler architecture as the connection point between the frontend and the backend. The frontend (lexer, parser, semantic analysis) transforms source code into an AST, and the backend (optimization, code generation) starts from the AST to produce target code.

Source Code โ†’ [Lexer] โ†’ [Parser] โ†’ AST โ†’ [Semantic Analysis] โ†’ Annotated AST โ†’ [Optimization] โ†’ [Code Generation] โ†’ Target Code

The advantage of this structure is that the frontend and backend can be replaced independently. Connecting a different language's frontend to the same backend adds support for a new language, and connecting the same frontend to a different backend produces code for a different architecture. This principle is what allows LLVM to support dozens of languages and dozens of target architectures.

In practice, rather than generating target code directly from the AST, compilers typically lower the AST into a lower-level intermediate representation (IR) before performing optimization and code generation. But the AST remains the most important source-level intermediate representation.

In the next post, we'll look at semantic analysis โ€” the checks and annotations performed on the AST.