Automata and Compilers 10 - Semantic Analysis and Type Checking
What the Parser Cannot Catch
The syntactic analysis we have covered so far determines whether a program is grammatically correct. If the parser successfully produces a syntax tree, the program conforms to the grammar rules. But does a grammatically correct program necessarily make sense? Not at all.
Consider the code int x = "hello";. It perfectly follows the syntactic structure of a variable declaration with initialization, so the parser generates a syntax tree without complaint. Yet assigning a string to an integer variable is a semantic contradiction. Using an undeclared variable, passing the wrong number of arguments to a function, or performing operations on incompatible types are all errors of the same nature. These errors lie outside the scope of context-free grammars and must be detected in a separate phase. That phase is semantic analysis.
The semantic analyzer traverses the syntax tree, verifying the semantic consistency of the program. Specifically, it checks whether variables are declared before use, whether types are compatible, whether function calls match their definitions in argument count and types, and so on.
Symbol Tables and Scope
The core data structure of semantic analysis is the symbol table. A symbol table is a dictionary that stores information about every identifier in the program β variables, functions, types, and more. Each entry contains attributes such as name, type, scope, and memory location.
Symbol Table Example:
ββββββββββββ¬βββββββββββ¬βββββββββ¬βββββββββββ
β Name β Type β Scope β Kind β
ββββββββββββΌβββββββββββΌβββββββββΌβββββββββββ€
β x β int β main β variable β
β y β float β main β variable β
β add β intβint β global β function β
β i β int β for β variable β
ββββββββββββ΄βββββββββββ΄βββββββββ΄βββββββββββ
In programming languages, scope determines the region in which an identifier is valid. Most languages support nested scopes, where inner scopes can access identifiers from outer scopes. To handle this, symbol tables are typically implemented as chained or stacked structures.
When entering a new block, a new scope is created; when leaving the block, that scope is discarded. When referencing a variable, the current scope is searched first, and if the identifier is not found, outer scopes are searched in order. This process is called scope resolution.
global scope
βββ int g = 10
βββ function main()
βββ int x = 5 β main scope
βββ if (x > 0)
β βββ int y = x + g β if scope (x resolved from main, g from global)
βββ print(y) β error: y is only valid inside the if block
In this example, y is declared inside the if block, so referencing y outside that block causes the semantic analyzer to raise an "undeclared variable" error. The parser sees nothing wrong with the syntax of print(y), but the semantic analyzer detects through the symbol table that y is not visible in the current scope.
Type Checking
The most critical task in semantic analysis is type checking. Type checking verifies that types are used consistently throughout every expression and statement in a program.
Type checking falls into two broad categories: static type checking and dynamic type checking. Static type checking is performed at compile time and is the approach taken by languages like C, Java, and Rust. Dynamic type checking is performed at runtime and is used by languages like Python, JavaScript, and Ruby.
Is static type checking always superior to dynamic type checking? Not necessarily. Static type checking catches many errors before execution, increasing program safety, but it imposes constraints on expressiveness. Dynamic type checking offers greater flexibility but incurs the cost of discovering errors only at runtime. Each approach involves different tradeoffs, and compiler designers make their choice with these tradeoffs in mind.
A type checker traverses the syntax tree bottom-up, determining the type of each node. The types of leaf nodes (constants, variables) are known directly, while the types of interior nodes (operators, function calls) are inferred from the types of their children.
[+] : int β both children are int, so result is int
/ \
/ \
[x]:int [3]:int β leaf types determined from symbol table and literals
[+] : error! β cannot add int and string
/ \
/ \
[x]:int ["hi"]:string
Type Inference
Requiring explicit type annotations on every variable is tedious. Modern languages use type inference to let the compiler determine types even when the programmer omits them. Writing auto x = 42; allows the compiler to infer that x is of type int because 42 is an integer literal.
Simple type inference works by assigning the type of the initializing expression directly to the variable. But when generics or higher-order functions are involved, the problem becomes considerably more complex. The Hindley-Milner type inference algorithm, used in languages like ML and Haskell, resolves relationships among type variables through a technique called unification. The fact that this algorithm can determine the type of every expression without any type annotations from the programmer is a theoretically significant result in its own right.
Attribute Grammars and Semantic Actions During Parsing
Semantic analysis can be performed as a completely separate pass after parsing, but in practice, many compilers perform semantic processing alongside syntactic analysis. Attribute grammars provide a formal framework for this approach.
In an attribute grammar, each grammar symbol is assigned attributes, and semantic rules are attached to production rules. Attributes are divided into synthesized attributes, which propagate from child nodes to parent nodes, and inherited attributes, which propagate from parent nodes to child nodes.
Production: E β Eβ + T
Semantic rule: E.type = if (Eβ.type == int && T.type == int) then int
else if (Eβ.type == float || T.type == float) then float
else error
This rule determines the type of an addition expression. If both operands are integers, the result is an integer; if either is a float, the result is a float; any other combination produces a type error. In this way, type information can be computed simultaneously with parsing.
Parser generators like Yacc/Bison implement this by embedding semantic actions within production rules. Code like $$ = check_type($1, $3); is included in the production, and each time the parser applies that rule, the code executes to update the symbol table or check types.
Common Semantic Errors
The following table summarizes the typical errors that a semantic analyzer detects.
| Error Type | Example | Description |
|---|---|---|
| Undeclared variable | x = 5; (x not declared) | Referencing an identifier not in the symbol table |
| Duplicate declaration | int x; int x; | Redeclaring the same name in the same scope |
| Type mismatch | int x = "hello"; | Incompatible types on both sides of an assignment |
| Argument count mismatch | add(1, 2, 3) (add takes 2) | Difference between defined and actual argument count |
| Return type mismatch | int f() { return "hi"; } | Actual return value does not match declared return type |
| Invalid operation | "hello" - "world" | Operation not defined for the given types |
All of these errors occur in programs that are syntactically legal but semantically wrong. The parser alone cannot filter them out β the additional information provided by symbol tables and type rules is what makes detection possible.
Semantic analysis is the final gate of the frontend. A program that passes this phase is deemed both syntactically and semantically correct, and the compiler can now safely transform it. In the next post, we'll look at the first step of that transformation: intermediate representations and optimization.