GPU Systems 20 - From Nsight to Triton to FlashAttention
Closing the GPU Systems series by connecting profiling, Triton experimentation, and FlashAttention-style thinking
What This Series Was Really Trying to Build
The purpose of studying GPU systems is not to memorize a few CUDA features. The point is to develop the ability to look at a real operator and ask:
- where is the bottleneck?
- is memory or compute limiting the kernel?
- what structural change is most likely to help?
That is the real skill.
A Practical Workflow
In real work, the loop often looks like this:
- profile the kernel
- interpret it with roofline and stall reasoning
- change the kernel in CUDA or Triton
- measure again
That loop is much closer to GPU kernel engineering than any isolated trick.
Triton is important inside this loop because it makes experimentation faster. CUDA gives lower-level control, but Triton often makes it easier to test new kernel structures quickly. In practice, they often play complementary roles rather than purely competing ones.
Why FlashAttention Is a Good Final Example
FlashAttention is a strong closing example because it compresses many series ideas into one case:
- avoid materializing the full attention matrix
- compute in tiles
- reduce memory traffic aggressively
- handle reductions and numerical stability carefully
- solve a real model-scale bottleneck
It is not just a famous optimization. It is a good demonstration of GPU systems thinking.
The important point is not to treat FlashAttention as just one branded algorithm. At a deeper level, it represents a general attitude:
- do not materialize giant intermediates unless you must
- process work in tiles
- treat memory traffic as seriously as arithmetic
- design for numerical correctness while optimizing performance
That attitude is the real lesson.
Where This Leads Next
After this series, the next natural directions are:
- PyTorch internals, to connect kernels into real framework execution
- distributed LLM training, to understand how these kernels behave in multi-GPU systems
In practice, those paths reconnect quickly. Large-model systems make kernels, runtime integration, and communication interact all the time.
For example, PyTorch internals matters once you want to connect a custom kernel cleanly into tensor and autograd flows. Distributed training matters once per-device kernels are no longer the whole performance story because communication and sharding enter the picture.
So GPU systems is not an isolated niche. It is one major layer in a larger ML systems stack.
Closing Thought
Understanding GPU systems means stopping the habit of treating the GPU as a mysterious fast box. It means starting to see it as a structured execution and memory system. Once that shift happens, even a small CUDA kernel stops looking small.
That is the actual goal of the series: not just more API familiarity, but stronger performance judgment. Once that judgment starts to form, CUDA, Triton, PyTorch internals, and distributed training all become easier to connect into one coherent picture.