What is Lexical Structure?

When we write Python code, how does Python understand it?

x = 42 + y

To us, this means "assign the sum of 42 and y to variable x," but to Python, it's just a sequence of characters.

'x', ' ', '=', ' ', '4', '2', ' ', '+', ' ', 'y'

Before Python can execute code, it must break these characters into meaningful units, just like we recognize letters as words when reading.

Tokens: The Smallest Units of Meaning

The meaningful pieces extracted from code are called tokens.

x = 42 + y

# Tokens Python sees:
# NAME: 'x'
# OP: '='
# NUMBER: 42
# OP: '+'
# NAME: 'y'

Comparing with natural language makes this clearer.

Sentence: "I eat rice"
Words: ["I", "eat", "rice"]

Code: "x = 42"
Tokens: [NAME:'x', OP:'=', NUMBER:42]

Lexical Analysis: Breaking into Tokens

Lexical Analysis is the process of reading code and breaking it into tokens.
This is the first step the Python interpreter performs.

[Source Code]
    โ†“
[Lexical Analysis] โ† Today's topic
    โ†“
[Token List]
    โ†“
[Syntax Analysis]
    โ†“
...

Why is this process necessary?

Understanding code all at once is too complex. Python processes it step by step, with lexical analysis as the first stage.

How Lexical Analyzers Work

Lexical analyzers operate using a State Machine.
Though this might sound complex, it's a concept we encounter everywhere in daily life.

State Machine Examples

1. Human Hunger States

Consider our hunger states.

stateDiagram-v2
    Hungry --> Full: Eat
    Full --> Hungry: Exercise
  • States: Hungry, Full
  • Transition: From hungry to full when eating
  • Reverse transition: From full to hungry when exercising

This is a state machine! A system in a specific state that changes to another state when it receives input.

2. Traffic Lights

Traffic lights are also state machines.

stateDiagram-v2
    Red --> Green: 30 seconds
    Green --> Yellow: 30 seconds
    Yellow --> Red: 3 seconds
  • States: Red, Yellow, Green
  • Transition: States change as time passes
  • Pattern: Red โ†’ Green โ†’ Yellow โ†’ Red (repeats)

3. Calculator Input States

Think about using a calculator.

stateDiagram-v2
    Start --> WaitingNumber
    WaitingNumber --> WaitingOperator: Number input (5, 3)
    WaitingOperator --> WaitingNumber: Operator input (+)
    WaitingOperator --> End: (=) input

A calculator is always in one of two states:
- Waiting for number (WaitingNumber)
- Waiting for operator (WaitingOperator)

Processing "5 + 3 =".

  1. Start โ†’ WaitingNumber โ†’ '5' input โ†’ WaitingOperator
  2. WaitingOperator โ†’ '+' input โ†’ WaitingNumber
  3. WaitingNumber โ†’ '3' input โ†’ WaitingOperator
  4. WaitingOperator โ†’ '=' input โ†’ Complete

This is how state machines process input by changing states.

Lexical Analysis State Machine

Now let's see how Python's lexical analysis works.

Recognizing Integers

State transitions when reading the string 123.

stateDiagram-v2
    [*] --> Start
    Start --> ReadingNumber: Digit ('1')
    ReadingNumber --> ReadingNumber: Digits ('2', '3')
    ReadingNumber --> [*]: Space or operator

    note right of ReadingNumber
        Collecting digits
        Example: "123"
    end note

Reading character by character, maintaining the "I'm reading a number" state, then completing the token when encountering a non-digit character.

Result: NUMBER token: 123

Recognizing Floats

What happens with 3.14 that has a decimal point?

stateDiagram-v2
    [*] --> Start
    Start --> ReadingInteger: Digit ('3')
    ReadingInteger --> ReadingInteger: Digit
    ReadingInteger --> ReadingFloat: Decimal ('.')
    ReadingFloat --> ReadingFloat: Digits ('1', '4')
    ReadingInteger --> [*]: Space/operator
    ReadingFloat --> [*]: Space/operator

    note right of ReadingFloat
        Collecting decimal digits
        Example: "3.14"
    end note

When reading an integer and encountering ., it thinks "Ah, a float!" and changes state to continue reading decimal digits.

Result: NUMBER token: 3.14

Recognizing Multiple Token Types

How does Python read the code x = 42 + y?

stateDiagram-v2
    [*] --> Start
    Start --> ReadingIdentifier: Letter ('x')
    Start --> ReadingNumber: Digit ('4')
    Start --> Operator: Operators ('+', '=')

    ReadingIdentifier --> ReadingIdentifier: Letter/digit
    ReadingIdentifier --> Start: Space (create NAME token)

    ReadingNumber --> ReadingNumber: Digit
    ReadingNumber --> Start: Space (create NUMBER token)

    Operator --> Start: (create OP token)

    Start --> [*]: End of string

Processing code x = 42 + y.

  1. x โ†’ letter โ†’ reading identifier โ†’ space โ†’ NAME: 'x'
  2. = โ†’ operator โ†’ OP: '='
  3. 42 โ†’ digits โ†’ space โ†’ NUMBER: 42
  4. + โ†’ operator โ†’ OP: '+'
  5. y โ†’ letter โ†’ end โ†’ NAME: 'y'

Final result: [NAME:'x', OP:'=', NUMBER:42, OP:'+', NAME:'y']

Using state machines allows systematic breaking of complex code into tokens!

Python's Lexical Structure

Let's explore Python's unique lexical rules.

1. Indentation is a Token

Python's most distinctive feature is that indentation is syntax.

def greet():
    print("hello")  # Indentation level 1
    if True:
        print("world")  # Indentation level 2

While other languages use curly braces {}, Python creates tokens from indentation.

# Tokens Python sees:
[
    NAME: 'def',
    NAME: 'greet',
    OP: '(',
    OP: ')',
    OP: ':',
    NEWLINE,
    INDENT,        # Indentation starts!
    NAME: 'print',
    ...
    NEWLINE,
    DEDENT,        # Indentation ends!
]

2. Various String Representations

Python can represent strings in many ways.

# Basic strings
s1 = 'hello'
s2 = "world"

# Multi-line strings
s3 = """
Multiple lines
of text
"""

# Raw strings (ignore escapes)
s4 = r'\n is not a newline, just \n'

# f-strings (variable interpolation)
name = "Python"
s5 = f"Hello, {name}!"  # "Hello, Python!"

3. Various Number Representations

Python supports numbers in multiple formats.

# Regular number
a = 42

# Binary (0s and 1s only)
b = 0b1010  # 10

# Octal (0~7)
c = 0o12    # 10

# Hexadecimal (0~9, A~F)
d = 0xA     # 10

# Float
e = 3.14

# Underscores for readability
f = 1_000_000  # 1000000

# Complex numbers
g = 3 + 4j

4. Python's Built-in Tokenizer

Python has a tokenize module to view token analysis directly.

Tokenizing x = 42 gives.

NAME       : 'x'
OP         : '='
NUMBER     : '42'
NEWLINE    : '\n'
ENDMARKER  : ''

We can confirm it breaks into NAME, OP, NUMBER tokens as expected!

Summary

What We Learned

  1. Lexical Analysis: The first step of breaking code into tokens
  2. Tokens: Smallest meaningful units (variable names, numbers, operators, etc.)
  3. State Machine: A method of processing input by changing states
  • Hunger/Full (humans)
  • Traffic lights (red/yellow/green)
  • Calculator (waiting for number/operator)

Python's Unique Features

  • Indentation as tokens (INDENT, DEDENT)
  • Various string representations (quotes, f-strings, raw strings)
  • Multiple number bases (binary, octal, hexadecimal)
  • Direct token viewing with tokenize module

Why This Matters

Understanding lexical analysis helps you:
- Know how Python reads code
- Better understand syntax errors
- Even create your own programming language!