GPU Systems 12 - Warp Shuffle and Warp-Level Primitives
Why warp-level primitives matter for reductions and lighter-weight cooperation
All posts in the Lectures
Why warp-level primitives matter for reductions and lighter-weight cooperation
Using reduction kernels to connect shared memory, warp primitives, and synchronization
How softmax combines reductions, memory traffic, and numerical stability in one kernel
Why normalization kernels are often memory-bound and structurally important
How wider memory operations and alignment affect bandwidth utilization
Why using more registers can improve local efficiency but still reduce total throughput
How tensor cores change performance in compute-heavy kernels and why mixed precision matters
How asynchronous copy and double buffering help overlap memory movement with computation
Closing the GPU Systems series by connecting profiling, Triton experimentation, and FlashAttention-style thinking