undefined min read
GPU Systems 05 - Coalescing, Shared Memory, and Reduction Patterns
The optimization patterns that keep showing up in CUDA kernels
The optimization patterns that keep showing up in CUDA kernels
Why warp-level primitives matter for reductions and lighter-weight cooperation
Using reduction kernels to connect shared memory, warp primitives, and synchronization
How softmax combines reductions, memory traffic, and numerical stability in one kernel