undefined min read
GPU Systems 14 - Why Softmax Is Such a Good Kernel Exercise
How softmax combines reductions, memory traffic, and numerical stability in one kernel
How softmax combines reductions, memory traffic, and numerical stability in one kernel
A production-quality custom operator has to behave correctly under mixed precision, not just benchmark well in isolation