Jae's Tech Blog
Home Archive About Game

Lectures

All posts in the Lectures

February 21, 2026 undefined min read

GPU Systems 12 - Warp Shuffle and Warp-Level Primitives

Why warp-level primitives matter for reductions and lighter-weight cooperation

Lectures
Read more
February 23, 2026 undefined min read

GPU Systems 13 - Reduction Kernels in Depth

Using reduction kernels to connect shared memory, warp primitives, and synchronization

Lectures
Read more
February 25, 2026 undefined min read

GPU Systems 14 - Why Softmax Is Such a Good Kernel Exercise

How softmax combines reductions, memory traffic, and numerical stability in one kernel

Lectures
Read more
February 27, 2026 undefined min read

GPU Systems 15 - LayerNorm and RMSNorm Kernel Structure

Why normalization kernels are often memory-bound and structurally important

Lectures
Read more
March 1, 2026 undefined min read

GPU Systems 16 - Vectorized Loads, Stores, and Alignment

How wider memory operations and alignment affect bandwidth utilization

Lectures
Read more
March 3, 2026 undefined min read

GPU Systems 17 - Register Pressure and Spilling

Why using more registers can improve local efficiency but still reduce total throughput

Lectures
Read more
March 5, 2026 undefined min read

GPU Systems 18 - Tensor Cores and Mixed Precision

How tensor cores change performance in compute-heavy kernels and why mixed precision matters

Lectures
Read more
March 7, 2026 undefined min read

GPU Systems 19 - Asynchronous Copy and Pipelining

How asynchronous copy and double buffering help overlap memory movement with computation

Lectures
Read more
March 9, 2026 undefined min read

GPU Systems 20 - From Nsight to Triton to FlashAttention

Closing the GPU Systems series by connecting profiling, Triton experimentation, and FlashAttention-style thinking

Lectures
Read more
โ† Previous
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Next โ†’

© 2025 Jae ยท Notes on systems, software, and building things carefully.

RSS