undefined min read
GPU Systems 04 - Writing CUDA Kernels and Choosing Launch Configuration
How to think about indexing and launch configuration when writing CUDA kernels
How to think about indexing and launch configuration when writing CUDA kernels
Why normalization kernels are often memory-bound and structurally important
What an operating system does and the role of the Linux kernel
How a system call transitions from user space to kernel space, and how kernel modules work
A CUDA kernel becomes a real PyTorch operator only when tensor contracts, runtime semantics, and integration details are handled correctly
Fusion is valuable when it reduces memory traffic and intermediate materialization, not just when it reduces the number of visible ops
Triton is not just a convenient kernel language; it is part of the modern PyTorch kernel and compilation story