Python Lecture 2 - Lexical Structure of Python
Learn how Python reads and understands your code, starting from the first step of lexical analysis
What is Lexical Structure?
When you write Python code, how does Python understand it?
x = 42 + y
To us, this means "assign the sum of 42 and y to variable x." But to Python, it's just a sequence of characters.
'x', ' ', '=', ' ', '4', '2', ' ', '+', ' ', 'y'
Before Python can do anything, it has to break these characters into meaningful units. Just like how we recognize letters as words when reading a sentence.
Tokens: The Smallest Units of Meaning
These meaningful pieces are called tokens.
x = 42 + y
# Tokens Python sees:
# NAME: 'x'
# OP: '='
# NUMBER: 42
# OP: '+'
# NAME: 'y'
Comparing with natural language makes this click.
Sentence: "I eat rice"
Words: ["I", "eat", "rice"]
Code: "x = 42"
Tokens: [NAME:'x', OP:'=', NUMBER:42]
Lexical Analysis: Breaking into Tokens
Lexical Analysis is the process of reading code and breaking it into tokens.
This is the very first thing the Python interpreter does.
[Source Code]
โ
[Lexical Analysis] โ Today's topic
โ
[Token List]
โ
[Syntax Analysis]
โ
...
Why bother with this step? Because understanding code all at once is too complex. Python processes it in stages, and lexical analysis is stage one.
How Lexical Analyzers Work
Lexical analyzers use a State Machine.
Sounds complicated, but you already know this concept from everyday life.
State Machine Examples
1. Hunger States
Think about your own hunger.
stateDiagram-v2
Hungry --> Full: Eat
Full --> Hungry: Exercise
You're in a specific state (hungry or full), and when you receive input (eating or exercising), you switch to a different state. That's a state machine.
2. Traffic Lights
Traffic lights work the same way.
stateDiagram-v2
Red --> Green: 30 seconds
Green --> Yellow: 30 seconds
Yellow --> Red: 3 seconds
Three states. Time passes, states change. Red, green, yellow, red -- on repeat.
3. Calculator Input
Think about using a calculator.
stateDiagram-v2
Start --> WaitingNumber
WaitingNumber --> WaitingOperator: Number input (5, 3)
WaitingOperator --> WaitingNumber: Operator input (+)
WaitingOperator --> End: (=) input
A calculator is always in one of two states -- waiting for a number or waiting for an operator.
Processing "5 + 3 =" goes like this:
- Start -> WaitingNumber -> '5' input -> WaitingOperator
- WaitingOperator -> '+' input -> WaitingNumber
- WaitingNumber -> '3' input -> WaitingOperator
- WaitingOperator -> '=' input -> Complete
That's how state machines process input -- by changing states.
Lexical Analysis State Machine
Now let's see how Python's lexical analysis actually works.
Recognizing Integers
Here's what happens when reading the string 123.
stateDiagram-v2
[*] --> Start
Start --> ReadingNumber: Digit ('1')
ReadingNumber --> ReadingNumber: Digits ('2', '3')
ReadingNumber --> [*]: Space or operator
note right of ReadingNumber
Collecting digits
Example: "123"
end note
It reads character by character, stays in the "I'm reading a number" state, then completes the token when it hits a non-digit character.
Result: NUMBER token: 123
Recognizing Floats
What about 3.14 with a decimal point?
stateDiagram-v2
[*] --> Start
Start --> ReadingInteger: Digit ('3')
ReadingInteger --> ReadingInteger: Digit
ReadingInteger --> ReadingFloat: Decimal ('.')
ReadingFloat --> ReadingFloat: Digits ('1', '4')
ReadingInteger --> [*]: Space/operator
ReadingFloat --> [*]: Space/operator
note right of ReadingFloat
Collecting decimal digits
Example: "3.14"
end note
While reading an integer, it encounters . and thinks "ah, a float!" It switches state and keeps reading decimal digits.
Result: NUMBER token: 3.14
Recognizing Multiple Token Types
How does Python read x = 42 + y?
stateDiagram-v2
[*] --> Start
Start --> ReadingIdentifier: Letter ('x')
Start --> ReadingNumber: Digit ('4')
Start --> Operator: Operators ('+', '=')
ReadingIdentifier --> ReadingIdentifier: Letter/digit
ReadingIdentifier --> Start: Space (create NAME token)
ReadingNumber --> ReadingNumber: Digit
ReadingNumber --> Start: Space (create NUMBER token)
Operator --> Start: (create OP token)
Start --> [*]: End of string
Processing x = 42 + y:
x-> letter -> reading identifier -> space -> NAME: 'x'=-> operator -> OP: '='42-> digits -> space -> NUMBER: 42+-> operator -> OP: '+'y-> letter -> end -> NAME: 'y'
Final result: [NAME:'x', OP:'=', NUMBER:42, OP:'+', NAME:'y']
State machines let you systematically break complex code into tokens.
Python's Lexical Rules
Let's look at what makes Python's lexical structure unique.
1. Indentation is a Token
Python's most distinctive feature -- indentation is syntax.
def greet():
print("hello") # Indentation level 1
if True:
print("world") # Indentation level 2
Other languages use curly braces {}. Python turns indentation into actual tokens.
# Tokens Python sees:
[
NAME: 'def',
NAME: 'greet',
OP: '(',
OP: ')',
OP: ':',
NEWLINE,
INDENT, # Indentation starts!
NAME: 'print',
...
NEWLINE,
DEDENT, # Indentation ends!
]
2. Many Ways to Write Strings
Python gives you multiple string representations.
# Basic strings
s1 = 'hello'
s2 = "world"
# Multi-line strings
s3 = """
Multiple lines
of text
"""
# Raw strings (ignore escapes)
s4 = r'\n is not a newline, just \n'
# f-strings (variable interpolation)
name = "Python"
s5 = f"Hello, {name}!" # "Hello, Python!"
3. Many Ways to Write Numbers
Python supports numbers in multiple formats.
# Regular number
a = 42
# Binary (0s and 1s only)
b = 0b1010 # 10
# Octal (0~7)
c = 0o12 # 10
# Hexadecimal (0~9, A~F)
d = 0xA # 10
# Float
e = 3.14
# Underscores for readability
f = 1_000_000 # 1000000
# Complex numbers
g = 3 + 4j
4. Python's Built-in Tokenizer
Python has a tokenize module that lets you see token analysis directly.
Tokenizing x = 42 gives you:
NAME : 'x'
OP : '='
NUMBER : '42'
NEWLINE : '\n'
ENDMARKER : ''
It breaks into NAME, OP, NUMBER tokens exactly as we'd expect.
Understanding lexical analysis helps you see how Python reads code from the ground up. It also makes syntax errors much less mysterious. And if you ever want to build your own programming language, this is where you'd start.