GPU Systems 01 - Roadmap to GPU Kernel Engineering

Why Study GPU Systems Separately?

Once you spend enough time around deep learning, you inevitably start using GPUs. At first, loading a model in PyTorch and training it feels like enough. But when models get larger, training gets expensive, and specific operators become bottlenecks, the difference between "using a GPU" and "understanding a GPU" becomes very obvious.

If your goal is to become a GPU Kernel Engineer, being good at library-level usage is not enough. You need to be comfortable with questions like how warps are scheduled, why shared memory matters, and why memory bandwidth keeps showing up as the limiting factor.

This series is about building that intuition. It starts with GPU architecture, then moves into CUDA kernels, Triton, and finally real optimization work.

A Better Order for This Topic

GPU learning gets confusing quickly if you jump straight into framework internals or random optimization tips. A cleaner path is:

GPU architecture
CUDA kernel programming
Triton kernels
kernel optimization

That order matters because each phase keeps reusing the previous one. If you do not understand memory coalescing, Triton block tiling tends to look like syntax rather than an execution strategy. Once the GPU memory hierarchy and warp execution model click, CUDA and Triton start to look like different interfaces for solving related performance problems.

Phase 1. GPU Architecture

The first thing to internalize is that GPUs push work very differently from CPUs.

The core ideas here are:

the relationship between threads, warps, blocks, and grids
the differences between global memory, shared memory, and registers
what memory bandwidth and latency mean in practice
why occupancy matters

The goal of this phase is simple: you should be able to picture how your code fans out and executes on the GPU.

The first exercises do not need to be fancy:

setting up a CUDA environment
a vector add kernel
a matrix multiply kernel

Vector add is small but excellent for learning thread indexing. Matrix multiply is where memory access patterns and shared memory start to feel real.

Phase 2. CUDA Kernel Programming

Once the architecture picture is in place, you need to write kernels directly.

This phase is not just about learning CUDA syntax. It is about understanding why one kernel is fast and another one is not.

The main topics are:

memory coalescing
shared memory access patterns
warp shuffle
kernel launch configuration
register pressure

Good exercises here include a reduction kernel, a softmax kernel, and an optimized matrix multiply. Softmax is especially useful because it quickly teaches you that memory movement and reduction structure often matter more than the arithmetic itself.

Phase 3. Triton Kernels

In LLM work, Triton shows up often enough that it is hard to ignore.

This phase should cover:

the Triton programming model
block tiling
tile-based memory access
fused softmax
a layernorm kernel

Triton is useful because it does not hide GPU structure completely, but it still lets you move faster than raw CUDA in many cases. That makes it a good bridge between basic kernel programming and real model-level optimization.

Phase 4. Kernel Optimization

At this point the question changes from "does the kernel run?" to "does the kernel actually run well?"

The main topics are:

kernel fusion
tiling strategies
tensor core usage
mixed precision
memory bottleneck analysis

Once you start looking at transformer blocks this way, you stop seeing them as just a sequence of matmuls. You start asking where memory traffic dominates, where intermediate reads and writes are expensive, and which operations are worth fusing.

Good exercises here include:

a fused transformer block
a fast attention kernel

How to Read This Series

This is not meant to be a one-pass CUDA tutorial. The goal is to move from "I use GPUs" to "I can reason about GPU execution and change it when necessary."

That means reading with a few habits in mind:

do not stop at "the code works"
always ask whether memory movement is the real bottleneck
connect toy kernels back to real model operators

If your long-term goal is GPU kernel engineering, API familiarity is only the starting point. Performance intuition is the real skill. That is what this series is trying to build.

The next post will start with the big picture of GPU architecture: the thread model, warps, blocks, and the memory hierarchy.

Why Study GPU Systems Separately?

A Better Order for This Topic

Phase 1. GPU Architecture

Phase 2. CUDA Kernel Programming

Phase 3. Triton Kernels

Phase 4. Kernel Optimization

How to Read This Series

Continue Reading

GPU Systems 02 - The Thread, Warp, and Block Execution Model

GPU Systems 03 - Memory Hierarchy and Bandwidth

GPU Systems 04 - Writing CUDA Kernels and Choosing Launch Configuration