GPU Systems 00 - What You Should Know Before Starting This Series
The background knowledge that makes the GPU Systems series much easier to study properly
From GPU architecture and CUDA kernels to Triton and real kernel optimization work
Engineers who want to understand how GPUs actually execute work and eventually write and optimize their own kernels.
Comfort with Python, basic linear algebra, and enough systems intuition to read low-level performance topics without panic.
The background knowledge that makes the GPU Systems series much easier to study properly
A practical study order from GPU architecture to CUDA, Triton, and kernel optimization
What threads, warps, blocks, and grids mean in actual GPU execution
How to think about the GPU memory hierarchy and bandwidth bottlenecks
How to think about indexing and launch configuration when writing CUDA kernels
The optimization patterns that keep showing up in CUDA kernels
How Triton fits into real kernel optimization work, especially for LLM-style workloads
Understanding occupancy as a latency-hiding concept instead of just a percentage
A practical way to use profiling and roofline thinking to understand kernel bottlenecks
Using naive matrix multiplication to see memory reuse and traffic problems clearly
Why tiled matrix multiplication and shared memory create such a big performance difference
Why shared memory is not automatically fast and how bank conflicts appear
Why warp-level primitives matter for reductions and lighter-weight cooperation
Using reduction kernels to connect shared memory, warp primitives, and synchronization
How softmax combines reductions, memory traffic, and numerical stability in one kernel
Why normalization kernels are often memory-bound and structurally important
How wider memory operations and alignment affect bandwidth utilization
Why using more registers can improve local efficiency but still reduce total throughput
How tensor cores change performance in compute-heavy kernels and why mixed precision matters
How asynchronous copy and double buffering help overlap memory movement with computation
Closing the GPU Systems series by connecting profiling, Triton experimentation, and FlashAttention-style thinking