Who This Is For

Engineers who want to understand how GPUs actually execute work and eventually write and optimize their own kernels.

Prerequisites

Comfort with Python, basic linear algebra, and enough systems intuition to read low-level performance topics without panic.

What You'll Get

  • Build a concrete mental model of warps, blocks, memory hierarchy, and occupancy
  • Write CUDA and Triton kernels instead of treating GPU work as a black box
  • Understand where kernel performance is won or lost in real workloads

All Posts

  1. 1

    GPU Systems 00 - What You Should Know Before Starting This Series

    The background knowledge that makes the GPU Systems series much easier to study properly

  2. 2

    GPU Systems 01 - Roadmap to GPU Kernel Engineering

    A practical study order from GPU architecture to CUDA, Triton, and kernel optimization

  3. 3

    GPU Systems 02 - The Thread, Warp, and Block Execution Model

    What threads, warps, blocks, and grids mean in actual GPU execution

  4. 4

    GPU Systems 03 - Memory Hierarchy and Bandwidth

    How to think about the GPU memory hierarchy and bandwidth bottlenecks

  5. 5

    GPU Systems 04 - Writing CUDA Kernels and Choosing Launch Configuration

    How to think about indexing and launch configuration when writing CUDA kernels

  6. 6

    GPU Systems 05 - Coalescing, Shared Memory, and Reduction Patterns

    The optimization patterns that keep showing up in CUDA kernels

  7. 7

    GPU Systems 06 - Triton and the Practical Shape of Kernel Optimization

    How Triton fits into real kernel optimization work, especially for LLM-style workloads

  8. 8

    GPU Systems 07 - Occupancy and Latency Hiding

    Understanding occupancy as a latency-hiding concept instead of just a percentage

  9. 9

    GPU Systems 08 - Profiling and the Roofline View

    A practical way to use profiling and roofline thinking to understand kernel bottlenecks

  10. 10

    GPU Systems 09 - Why Naive Matrix Multiplication Is Slow

    Using naive matrix multiplication to see memory reuse and traffic problems clearly

  11. 11

    GPU Systems 10 - Tiled Matrix Multiplication and Shared Memory

    Why tiled matrix multiplication and shared memory create such a big performance difference

  12. 12

    GPU Systems 11 - Shared Memory Bank Conflicts

    Why shared memory is not automatically fast and how bank conflicts appear

  13. 13

    GPU Systems 12 - Warp Shuffle and Warp-Level Primitives

    Why warp-level primitives matter for reductions and lighter-weight cooperation

  14. 14

    GPU Systems 13 - Reduction Kernels in Depth

    Using reduction kernels to connect shared memory, warp primitives, and synchronization

  15. 15

    GPU Systems 14 - Why Softmax Is Such a Good Kernel Exercise

    How softmax combines reductions, memory traffic, and numerical stability in one kernel

  16. 16

    GPU Systems 15 - LayerNorm and RMSNorm Kernel Structure

    Why normalization kernels are often memory-bound and structurally important

  17. 17

    GPU Systems 16 - Vectorized Loads, Stores, and Alignment

    How wider memory operations and alignment affect bandwidth utilization

  18. 18

    GPU Systems 17 - Register Pressure and Spilling

    Why using more registers can improve local efficiency but still reduce total throughput

  19. 19

    GPU Systems 18 - Tensor Cores and Mixed Precision

    How tensor cores change performance in compute-heavy kernels and why mixed precision matters

  20. 20

    GPU Systems 19 - Asynchronous Copy and Pipelining

    How asynchronous copy and double buffering help overlap memory movement with computation

  21. 21

    GPU Systems 20 - From Nsight to Triton to FlashAttention

    Closing the GPU Systems series by connecting profiling, Triton experimentation, and FlashAttention-style thinking