Jae's Tech Blog

February 21, 2026 undefined min read

GPU Systems 12 - Warp Shuffle and Warp-Level Primitives

Why warp-level primitives matter for reductions and lighter-weight cooperation

Lectures

February 23, 2026 undefined min read

Using reduction kernels to connect shared memory, warp primitives, and synchronization

Lectures

February 25, 2026 undefined min read

How softmax combines reductions, memory traffic, and numerical stability in one kernel

Lectures

February 27, 2026 undefined min read

Why normalization kernels are often memory-bound and structurally important

Lectures

March 1, 2026 undefined min read

How wider memory operations and alignment affect bandwidth utilization

Lectures

March 3, 2026 undefined min read

Why using more registers can improve local efficiency but still reduce total throughput

Lectures

March 5, 2026 undefined min read

How tensor cores change performance in compute-heavy kernels and why mixed precision matters

Lectures

March 7, 2026 undefined min read

How asynchronous copy and double buffering help overlap memory movement with computation

Lectures

March 9, 2026 undefined min read

Closing the GPU Systems series by connecting profiling, Triton experimentation, and FlashAttention-style thinking

Lectures