Python Lecture 2 - Lexical Structure of Python
What is Lexical Structure?
When we write Python code, how does Python understand it?
x = 42 + y
To us, this means "assign the sum of 42 and y to variable x," but to Python, it's just a sequence of characters.
'x', ' ', '=', ' ', '4', '2', ' ', '+', ' ', 'y'
Before Python can execute code, it must break these characters into meaningful units, just like we recognize letters as words when reading.
Tokens: The Smallest Units of Meaning
The meaningful pieces extracted from code are called tokens.
x = 42 + y
# Tokens Python sees:
# NAME: 'x'
# OP: '='
# NUMBER: 42
# OP: '+'
# NAME: 'y'
Comparing with natural language makes this clearer.
Sentence: "I eat rice"
Words: ["I", "eat", "rice"]
Code: "x = 42"
Tokens: [NAME:'x', OP:'=', NUMBER:42]
Lexical Analysis: Breaking into Tokens
Lexical Analysis is the process of reading code and breaking it into tokens.
This is the first step the Python interpreter performs.
[Source Code]
โ
[Lexical Analysis] โ Today's topic
โ
[Token List]
โ
[Syntax Analysis]
โ
...
Why is this process necessary?
Understanding code all at once is too complex. Python processes it step by step, with lexical analysis as the first stage.
How Lexical Analyzers Work
Lexical analyzers operate using a State Machine.
Though this might sound complex, it's a concept we encounter everywhere in daily life.
State Machine Examples
1. Human Hunger States
Consider our hunger states.
stateDiagram-v2
Hungry --> Full: Eat
Full --> Hungry: Exercise
- States: Hungry, Full
- Transition: From hungry to full when eating
- Reverse transition: From full to hungry when exercising
This is a state machine! A system in a specific state that changes to another state when it receives input.
2. Traffic Lights
Traffic lights are also state machines.
stateDiagram-v2
Red --> Green: 30 seconds
Green --> Yellow: 30 seconds
Yellow --> Red: 3 seconds
- States: Red, Yellow, Green
- Transition: States change as time passes
- Pattern: Red โ Green โ Yellow โ Red (repeats)
3. Calculator Input States
Think about using a calculator.
stateDiagram-v2
Start --> WaitingNumber
WaitingNumber --> WaitingOperator: Number input (5, 3)
WaitingOperator --> WaitingNumber: Operator input (+)
WaitingOperator --> End: (=) input
A calculator is always in one of two states:
- Waiting for number (WaitingNumber)
- Waiting for operator (WaitingOperator)
Processing "5 + 3 =".
- Start โ WaitingNumber โ '5' input โ WaitingOperator
- WaitingOperator โ '+' input โ WaitingNumber
- WaitingNumber โ '3' input โ WaitingOperator
- WaitingOperator โ '=' input โ Complete
This is how state machines process input by changing states.
Lexical Analysis State Machine
Now let's see how Python's lexical analysis works.
Recognizing Integers
State transitions when reading the string 123.
stateDiagram-v2
[*] --> Start
Start --> ReadingNumber: Digit ('1')
ReadingNumber --> ReadingNumber: Digits ('2', '3')
ReadingNumber --> [*]: Space or operator
note right of ReadingNumber
Collecting digits
Example: "123"
end note
Reading character by character, maintaining the "I'm reading a number" state, then completing the token when encountering a non-digit character.
Result: NUMBER token: 123
Recognizing Floats
What happens with 3.14 that has a decimal point?
stateDiagram-v2
[*] --> Start
Start --> ReadingInteger: Digit ('3')
ReadingInteger --> ReadingInteger: Digit
ReadingInteger --> ReadingFloat: Decimal ('.')
ReadingFloat --> ReadingFloat: Digits ('1', '4')
ReadingInteger --> [*]: Space/operator
ReadingFloat --> [*]: Space/operator
note right of ReadingFloat
Collecting decimal digits
Example: "3.14"
end note
When reading an integer and encountering ., it thinks "Ah, a float!" and changes state to continue reading decimal digits.
Result: NUMBER token: 3.14
Recognizing Multiple Token Types
How does Python read the code x = 42 + y?
stateDiagram-v2
[*] --> Start
Start --> ReadingIdentifier: Letter ('x')
Start --> ReadingNumber: Digit ('4')
Start --> Operator: Operators ('+', '=')
ReadingIdentifier --> ReadingIdentifier: Letter/digit
ReadingIdentifier --> Start: Space (create NAME token)
ReadingNumber --> ReadingNumber: Digit
ReadingNumber --> Start: Space (create NUMBER token)
Operator --> Start: (create OP token)
Start --> [*]: End of string
Processing code x = 42 + y.
xโ letter โ reading identifier โ space โ NAME: 'x'=โ operator โ OP: '='42โ digits โ space โ NUMBER: 42+โ operator โ OP: '+'yโ letter โ end โ NAME: 'y'
Final result: [NAME:'x', OP:'=', NUMBER:42, OP:'+', NAME:'y']
Using state machines allows systematic breaking of complex code into tokens!
Python's Lexical Structure
Let's explore Python's unique lexical rules.
1. Indentation is a Token
Python's most distinctive feature is that indentation is syntax.
def greet():
print("hello") # Indentation level 1
if True:
print("world") # Indentation level 2
While other languages use curly braces {}, Python creates tokens from indentation.
# Tokens Python sees:
[
NAME: 'def',
NAME: 'greet',
OP: '(',
OP: ')',
OP: ':',
NEWLINE,
INDENT, # Indentation starts!
NAME: 'print',
...
NEWLINE,
DEDENT, # Indentation ends!
]
2. Various String Representations
Python can represent strings in many ways.
# Basic strings
s1 = 'hello'
s2 = "world"
# Multi-line strings
s3 = """
Multiple lines
of text
"""
# Raw strings (ignore escapes)
s4 = r'\n is not a newline, just \n'
# f-strings (variable interpolation)
name = "Python"
s5 = f"Hello, {name}!" # "Hello, Python!"
3. Various Number Representations
Python supports numbers in multiple formats.
# Regular number
a = 42
# Binary (0s and 1s only)
b = 0b1010 # 10
# Octal (0~7)
c = 0o12 # 10
# Hexadecimal (0~9, A~F)
d = 0xA # 10
# Float
e = 3.14
# Underscores for readability
f = 1_000_000 # 1000000
# Complex numbers
g = 3 + 4j
4. Python's Built-in Tokenizer
Python has a tokenize module to view token analysis directly.
Tokenizing x = 42 gives.
NAME : 'x'
OP : '='
NUMBER : '42'
NEWLINE : '\n'
ENDMARKER : ''
We can confirm it breaks into NAME, OP, NUMBER tokens as expected!
Summary
What We Learned
- Lexical Analysis: The first step of breaking code into tokens
- Tokens: Smallest meaningful units (variable names, numbers, operators, etc.)
- State Machine: A method of processing input by changing states
- Hunger/Full (humans)
- Traffic lights (red/yellow/green)
- Calculator (waiting for number/operator)
Python's Unique Features
- Indentation as tokens (
INDENT,DEDENT) - Various string representations (quotes, f-strings, raw strings)
- Multiple number bases (binary, octal, hexadecimal)
- Direct token viewing with
tokenizemodule
Why This Matters
Understanding lexical analysis helps you:
- Know how Python reads code
- Better understand syntax errors
- Even create your own programming language!