GPU Systems

Who This Is For

Engineers who want to understand how GPUs actually execute work and eventually write and optimize their own kernels.

Prerequisites

Comfort with Python, basic linear algebra, and enough systems intuition to read low-level performance topics without panic.

What You'll Get

Build a concrete mental model of warps, blocks, memory hierarchy, and occupancy
Write CUDA and Triton kernels instead of treating GPU work as a black box
Understand where kernel performance is won or lost in real workloads

All Posts

1

Lectures January 28, 2026 4 min read

GPU Systems 00 - What You Should Know Before Starting This Series

The background knowledge that makes the GPU Systems series much easier to study properly
2

Lectures January 30, 2026 4 min read

GPU Systems 01 - Roadmap to GPU Kernel Engineering

A practical study order from GPU architecture to CUDA, Triton, and kernel optimization
3

Lectures February 1, 2026 3 min read

GPU Systems 02 - The Thread, Warp, and Block Execution Model

What threads, warps, blocks, and grids mean in actual GPU execution
4

Lectures February 3, 2026 3 min read

GPU Systems 03 - Memory Hierarchy and Bandwidth

How to think about the GPU memory hierarchy and bandwidth bottlenecks
5

Lectures February 5, 2026 3 min read

GPU Systems 04 - Writing CUDA Kernels and Choosing Launch Configuration

How to think about indexing and launch configuration when writing CUDA kernels
6

Lectures February 7, 2026 3 min read

GPU Systems 05 - Coalescing, Shared Memory, and Reduction Patterns

The optimization patterns that keep showing up in CUDA kernels
7

Lectures February 9, 2026 3 min read

GPU Systems 06 - Triton and the Practical Shape of Kernel Optimization

How Triton fits into real kernel optimization work, especially for LLM-style workloads
8

Lectures February 11, 2026 4 min read

GPU Systems 07 - Occupancy and Latency Hiding

Understanding occupancy as a latency-hiding concept instead of just a percentage
9

Lectures February 13, 2026 3 min read

GPU Systems 08 - Profiling and the Roofline View

A practical way to use profiling and roofline thinking to understand kernel bottlenecks
10

Lectures February 15, 2026 3 min read

GPU Systems 09 - Why Naive Matrix Multiplication Is Slow

Using naive matrix multiplication to see memory reuse and traffic problems clearly
11

Lectures February 17, 2026 3 min read

GPU Systems 10 - Tiled Matrix Multiplication and Shared Memory

Why tiled matrix multiplication and shared memory create such a big performance difference
12

Lectures February 19, 2026 2 min read

GPU Systems 11 - Shared Memory Bank Conflicts

Why shared memory is not automatically fast and how bank conflicts appear
13

Lectures February 21, 2026 2 min read

GPU Systems 12 - Warp Shuffle and Warp-Level Primitives

Why warp-level primitives matter for reductions and lighter-weight cooperation
14

Lectures February 23, 2026 2 min read

GPU Systems 13 - Reduction Kernels in Depth

Using reduction kernels to connect shared memory, warp primitives, and synchronization
15

Lectures February 25, 2026 2 min read

GPU Systems 14 - Why Softmax Is Such a Good Kernel Exercise

How softmax combines reductions, memory traffic, and numerical stability in one kernel
16

Lectures February 27, 2026 2 min read

GPU Systems 15 - LayerNorm and RMSNorm Kernel Structure

Why normalization kernels are often memory-bound and structurally important
17

Lectures March 1, 2026 3 min read

GPU Systems 16 - Vectorized Loads, Stores, and Alignment

How wider memory operations and alignment affect bandwidth utilization
18

Lectures March 3, 2026 2 min read

GPU Systems 17 - Register Pressure and Spilling

Why using more registers can improve local efficiency but still reduce total throughput
19

Lectures March 5, 2026 2 min read

GPU Systems 18 - Tensor Cores and Mixed Precision

How tensor cores change performance in compute-heavy kernels and why mixed precision matters
20

Lectures March 7, 2026 2 min read

GPU Systems 19 - Asynchronous Copy and Pipelining

How asynchronous copy and double buffering help overlap memory movement with computation
21

Lectures March 9, 2026 3 min read

GPU Systems 20 - From Nsight to Triton to FlashAttention

Closing the GPU Systems series by connecting profiling, Triton experimentation, and FlashAttention-style thinking