GPU Systems 01 - Roadmap to GPU Kernel Engineering
A practical study order from GPU architecture to CUDA, Triton, and kernel optimization
Why Study GPU Systems Separately?
Once you spend enough time around deep learning, you inevitably start using GPUs. At first, loading a model in PyTorch and training it feels like enough. But when models get larger, training gets expensive, and specific operators become bottlenecks, the difference between "using a GPU" and "understanding a GPU" becomes very obvious.
If your goal is to become a GPU Kernel Engineer, being good at library-level usage is not enough. You need to be comfortable with questions like how warps are scheduled, why shared memory matters, and why memory bandwidth keeps showing up as the limiting factor.
This series is about building that intuition. It starts with GPU architecture, then moves into CUDA kernels, Triton, and finally real optimization work.
A Better Order for This Topic
GPU learning gets confusing quickly if you jump straight into framework internals or random optimization tips. A cleaner path is:
- GPU architecture
- CUDA kernel programming
- Triton kernels
- kernel optimization
That order matters because each phase keeps reusing the previous one. If you do not understand memory coalescing, Triton block tiling tends to look like syntax rather than an execution strategy. Once the GPU memory hierarchy and warp execution model click, CUDA and Triton start to look like different interfaces for solving related performance problems.
Phase 1. GPU Architecture
The first thing to internalize is that GPUs push work very differently from CPUs.
The core ideas here are:
- the relationship between threads, warps, blocks, and grids
- the differences between global memory, shared memory, and registers
- what memory bandwidth and latency mean in practice
- why occupancy matters
The goal of this phase is simple: you should be able to picture how your code fans out and executes on the GPU.
The first exercises do not need to be fancy:
- setting up a CUDA environment
- a vector add kernel
- a matrix multiply kernel
Vector add is small but excellent for learning thread indexing. Matrix multiply is where memory access patterns and shared memory start to feel real.
Phase 2. CUDA Kernel Programming
Once the architecture picture is in place, you need to write kernels directly.
This phase is not just about learning CUDA syntax. It is about understanding why one kernel is fast and another one is not.
The main topics are:
- memory coalescing
- shared memory access patterns
- warp shuffle
- kernel launch configuration
- register pressure
Good exercises here include a reduction kernel, a softmax kernel, and an optimized matrix multiply. Softmax is especially useful because it quickly teaches you that memory movement and reduction structure often matter more than the arithmetic itself.
Phase 3. Triton Kernels
In LLM work, Triton shows up often enough that it is hard to ignore.
This phase should cover:
- the Triton programming model
- block tiling
- tile-based memory access
- fused softmax
- a layernorm kernel
Triton is useful because it does not hide GPU structure completely, but it still lets you move faster than raw CUDA in many cases. That makes it a good bridge between basic kernel programming and real model-level optimization.
Phase 4. Kernel Optimization
At this point the question changes from "does the kernel run?" to "does the kernel actually run well?"
The main topics are:
- kernel fusion
- tiling strategies
- tensor core usage
- mixed precision
- memory bottleneck analysis
Once you start looking at transformer blocks this way, you stop seeing them as just a sequence of matmuls. You start asking where memory traffic dominates, where intermediate reads and writes are expensive, and which operations are worth fusing.
Good exercises here include:
- a fused transformer block
- a fast attention kernel
How to Read This Series
This is not meant to be a one-pass CUDA tutorial. The goal is to move from "I use GPUs" to "I can reason about GPU execution and change it when necessary."
That means reading with a few habits in mind:
- do not stop at "the code works"
- always ask whether memory movement is the real bottleneck
- connect toy kernels back to real model operators
If your long-term goal is GPU kernel engineering, API familiarity is only the starting point. Performance intuition is the real skill. That is what this series is trying to build.
The next post will start with the big picture of GPU architecture: the thread model, warps, blocks, and the memory hierarchy.