Why Study GPU Systems Separately?

Once you spend enough time around deep learning, you inevitably start using GPUs. At first, loading a model in PyTorch and training it feels like enough. But when models get larger, training gets expensive, and specific operators become bottlenecks, the difference between "using a GPU" and "understanding a GPU" becomes very obvious.

If your goal is to become a GPU Kernel Engineer, being good at library-level usage is not enough. You need to be comfortable with questions like how warps are scheduled, why shared memory matters, and why memory bandwidth keeps showing up as the limiting factor.

This series is about building that intuition. It starts with GPU architecture, then moves into CUDA kernels, Triton, and finally real optimization work.

A Better Order for This Topic

GPU learning gets confusing quickly if you jump straight into framework internals or random optimization tips. A cleaner path is:

  1. GPU architecture
  2. CUDA kernel programming
  3. Triton kernels
  4. kernel optimization

That order matters because each phase keeps reusing the previous one. If you do not understand memory coalescing, Triton block tiling tends to look like syntax rather than an execution strategy. Once the GPU memory hierarchy and warp execution model click, CUDA and Triton start to look like different interfaces for solving related performance problems.

Phase 1. GPU Architecture

The first thing to internalize is that GPUs push work very differently from CPUs.

The core ideas here are:

  • the relationship between threads, warps, blocks, and grids
  • the differences between global memory, shared memory, and registers
  • what memory bandwidth and latency mean in practice
  • why occupancy matters

The goal of this phase is simple: you should be able to picture how your code fans out and executes on the GPU.

The first exercises do not need to be fancy:

  • setting up a CUDA environment
  • a vector add kernel
  • a matrix multiply kernel

Vector add is small but excellent for learning thread indexing. Matrix multiply is where memory access patterns and shared memory start to feel real.

Phase 2. CUDA Kernel Programming

Once the architecture picture is in place, you need to write kernels directly.

This phase is not just about learning CUDA syntax. It is about understanding why one kernel is fast and another one is not.

The main topics are:

  • memory coalescing
  • shared memory access patterns
  • warp shuffle
  • kernel launch configuration
  • register pressure

Good exercises here include a reduction kernel, a softmax kernel, and an optimized matrix multiply. Softmax is especially useful because it quickly teaches you that memory movement and reduction structure often matter more than the arithmetic itself.

Phase 3. Triton Kernels

In LLM work, Triton shows up often enough that it is hard to ignore.

This phase should cover:

  • the Triton programming model
  • block tiling
  • tile-based memory access
  • fused softmax
  • a layernorm kernel

Triton is useful because it does not hide GPU structure completely, but it still lets you move faster than raw CUDA in many cases. That makes it a good bridge between basic kernel programming and real model-level optimization.

Phase 4. Kernel Optimization

At this point the question changes from "does the kernel run?" to "does the kernel actually run well?"

The main topics are:

  • kernel fusion
  • tiling strategies
  • tensor core usage
  • mixed precision
  • memory bottleneck analysis

Once you start looking at transformer blocks this way, you stop seeing them as just a sequence of matmuls. You start asking where memory traffic dominates, where intermediate reads and writes are expensive, and which operations are worth fusing.

Good exercises here include:

  • a fused transformer block
  • a fast attention kernel

How to Read This Series

This is not meant to be a one-pass CUDA tutorial. The goal is to move from "I use GPUs" to "I can reason about GPU execution and change it when necessary."

That means reading with a few habits in mind:

  • do not stop at "the code works"
  • always ask whether memory movement is the real bottleneck
  • connect toy kernels back to real model operators

If your long-term goal is GPU kernel engineering, API familiarity is only the starting point. Performance intuition is the real skill. That is what this series is trying to build.

The next post will start with the big picture of GPU architecture: the thread model, warps, blocks, and the memory hierarchy.