What is Lexical Structure?

When you write Python code, how does Python understand it?

x = 42 + y

To us, this means "assign the sum of 42 and y to variable x." But to Python, it's just a sequence of characters.

'x', ' ', '=', ' ', '4', '2', ' ', '+', ' ', 'y'

Before Python can do anything, it has to break these characters into meaningful units. Just like how we recognize letters as words when reading a sentence.

Tokens: The Smallest Units of Meaning

These meaningful pieces are called tokens.

x = 42 + y

# Tokens Python sees:
# NAME: 'x'
# OP: '='
# NUMBER: 42
# OP: '+'
# NAME: 'y'

Comparing with natural language makes this click.

Sentence: "I eat rice"
Words: ["I", "eat", "rice"]

Code: "x = 42"
Tokens: [NAME:'x', OP:'=', NUMBER:42]

Lexical Analysis: Breaking into Tokens

Lexical Analysis is the process of reading code and breaking it into tokens.
This is the very first thing the Python interpreter does.

[Source Code]
    โ†“
[Lexical Analysis] โ† Today's topic
    โ†“
[Token List]
    โ†“
[Syntax Analysis]
    โ†“
...

Why bother with this step? Because understanding code all at once is too complex. Python processes it in stages, and lexical analysis is stage one.

How Lexical Analyzers Work

Lexical analyzers use a State Machine.
Sounds complicated, but you already know this concept from everyday life.

State Machine Examples

1. Hunger States

Think about your own hunger.

stateDiagram-v2
    Hungry --> Full: Eat
    Full --> Hungry: Exercise

You're in a specific state (hungry or full), and when you receive input (eating or exercising), you switch to a different state. That's a state machine.

2. Traffic Lights

Traffic lights work the same way.

stateDiagram-v2
    Red --> Green: 30 seconds
    Green --> Yellow: 30 seconds
    Yellow --> Red: 3 seconds

Three states. Time passes, states change. Red, green, yellow, red -- on repeat.

3. Calculator Input

Think about using a calculator.

stateDiagram-v2
    Start --> WaitingNumber
    WaitingNumber --> WaitingOperator: Number input (5, 3)
    WaitingOperator --> WaitingNumber: Operator input (+)
    WaitingOperator --> End: (=) input

A calculator is always in one of two states -- waiting for a number or waiting for an operator.

Processing "5 + 3 =" goes like this:

  1. Start -> WaitingNumber -> '5' input -> WaitingOperator
  2. WaitingOperator -> '+' input -> WaitingNumber
  3. WaitingNumber -> '3' input -> WaitingOperator
  4. WaitingOperator -> '=' input -> Complete

That's how state machines process input -- by changing states.

Lexical Analysis State Machine

Now let's see how Python's lexical analysis actually works.

Recognizing Integers

Here's what happens when reading the string 123.

stateDiagram-v2
    [*] --> Start
    Start --> ReadingNumber: Digit ('1')
    ReadingNumber --> ReadingNumber: Digits ('2', '3')
    ReadingNumber --> [*]: Space or operator

    note right of ReadingNumber
        Collecting digits
        Example: "123"
    end note

It reads character by character, stays in the "I'm reading a number" state, then completes the token when it hits a non-digit character.

Result: NUMBER token: 123

Recognizing Floats

What about 3.14 with a decimal point?

stateDiagram-v2
    [*] --> Start
    Start --> ReadingInteger: Digit ('3')
    ReadingInteger --> ReadingInteger: Digit
    ReadingInteger --> ReadingFloat: Decimal ('.')
    ReadingFloat --> ReadingFloat: Digits ('1', '4')
    ReadingInteger --> [*]: Space/operator
    ReadingFloat --> [*]: Space/operator

    note right of ReadingFloat
        Collecting decimal digits
        Example: "3.14"
    end note

While reading an integer, it encounters . and thinks "ah, a float!" It switches state and keeps reading decimal digits.

Result: NUMBER token: 3.14

Recognizing Multiple Token Types

How does Python read x = 42 + y?

stateDiagram-v2
    [*] --> Start
    Start --> ReadingIdentifier: Letter ('x')
    Start --> ReadingNumber: Digit ('4')
    Start --> Operator: Operators ('+', '=')

    ReadingIdentifier --> ReadingIdentifier: Letter/digit
    ReadingIdentifier --> Start: Space (create NAME token)

    ReadingNumber --> ReadingNumber: Digit
    ReadingNumber --> Start: Space (create NUMBER token)

    Operator --> Start: (create OP token)

    Start --> [*]: End of string

Processing x = 42 + y:

  1. x -> letter -> reading identifier -> space -> NAME: 'x'
  2. = -> operator -> OP: '='
  3. 42 -> digits -> space -> NUMBER: 42
  4. + -> operator -> OP: '+'
  5. y -> letter -> end -> NAME: 'y'

Final result: [NAME:'x', OP:'=', NUMBER:42, OP:'+', NAME:'y']

State machines let you systematically break complex code into tokens.

Python's Lexical Rules

Let's look at what makes Python's lexical structure unique.

1. Indentation is a Token

Python's most distinctive feature -- indentation is syntax.

def greet():
    print("hello")  # Indentation level 1
    if True:
        print("world")  # Indentation level 2

Other languages use curly braces {}. Python turns indentation into actual tokens.

# Tokens Python sees:
[
    NAME: 'def',
    NAME: 'greet',
    OP: '(',
    OP: ')',
    OP: ':',
    NEWLINE,
    INDENT,        # Indentation starts!
    NAME: 'print',
    ...
    NEWLINE,
    DEDENT,        # Indentation ends!
]

2. Many Ways to Write Strings

Python gives you multiple string representations.

# Basic strings
s1 = 'hello'
s2 = "world"

# Multi-line strings
s3 = """
Multiple lines
of text
"""

# Raw strings (ignore escapes)
s4 = r'\n is not a newline, just \n'

# f-strings (variable interpolation)
name = "Python"
s5 = f"Hello, {name}!"  # "Hello, Python!"

3. Many Ways to Write Numbers

Python supports numbers in multiple formats.

# Regular number
a = 42

# Binary (0s and 1s only)
b = 0b1010  # 10

# Octal (0~7)
c = 0o12    # 10

# Hexadecimal (0~9, A~F)
d = 0xA     # 10

# Float
e = 3.14

# Underscores for readability
f = 1_000_000  # 1000000

# Complex numbers
g = 3 + 4j

4. Python's Built-in Tokenizer

Python has a tokenize module that lets you see token analysis directly.

Tokenizing x = 42 gives you:

NAME       : 'x'
OP         : '='
NUMBER     : '42'
NEWLINE    : '\n'
ENDMARKER  : ''

It breaks into NAME, OP, NUMBER tokens exactly as we'd expect.

Understanding lexical analysis helps you see how Python reads code from the ground up. It also makes syntax errors much less mysterious. And if you ever want to build your own programming language, this is where you'd start.