GPU Systems 20 - From Nsight to Triton to FlashAttention

What This Series Was Really Trying to Build

The purpose of studying GPU systems is not to memorize a few CUDA features. The point is to develop the ability to look at a real operator and ask:

where is the bottleneck?
is memory or compute limiting the kernel?
what structural change is most likely to help?

That is the real skill.

A Practical Workflow

In real work, the loop often looks like this:

profile the kernel
interpret it with roofline and stall reasoning
change the kernel in CUDA or Triton
measure again

That loop is much closer to GPU kernel engineering than any isolated trick.

Triton is important inside this loop because it makes experimentation faster. CUDA gives lower-level control, but Triton often makes it easier to test new kernel structures quickly. In practice, they often play complementary roles rather than purely competing ones.

Why FlashAttention Is a Good Final Example

FlashAttention is a strong closing example because it compresses many series ideas into one case:

avoid materializing the full attention matrix
compute in tiles
reduce memory traffic aggressively
handle reductions and numerical stability carefully
solve a real model-scale bottleneck

It is not just a famous optimization. It is a good demonstration of GPU systems thinking.

The important point is not to treat FlashAttention as just one branded algorithm. At a deeper level, it represents a general attitude:

do not materialize giant intermediates unless you must
process work in tiles
treat memory traffic as seriously as arithmetic
design for numerical correctness while optimizing performance

That attitude is the real lesson.

Where This Leads Next

After this series, the next natural directions are:

PyTorch internals, to connect kernels into real framework execution
distributed LLM training, to understand how these kernels behave in multi-GPU systems

In practice, those paths reconnect quickly. Large-model systems make kernels, runtime integration, and communication interact all the time.

For example, PyTorch internals matters once you want to connect a custom kernel cleanly into tensor and autograd flows. Distributed training matters once per-device kernels are no longer the whole performance story because communication and sharding enter the picture.

So GPU systems is not an isolated niche. It is one major layer in a larger ML systems stack.

Closing Thought

Understanding GPU systems means stopping the habit of treating the GPU as a mysterious fast box. It means starting to see it as a structured execution and memory system. Once that shift happens, even a small CUDA kernel stops looking small.

That is the actual goal of the series: not just more API familiarity, but stronger performance judgment. Once that judgment starts to form, CUDA, Triton, PyTorch internals, and distributed training all become easier to connect into one coherent picture.

What This Series Was Really Trying to Build

A Practical Workflow

Why FlashAttention Is a Good Final Example

Where This Leads Next

Closing Thought

Continue Reading

GPU Systems 06 - Triton and the Practical Shape of Kernel Optimization

GPU Systems 01 - Roadmap to GPU Kernel Engineering

GPU Systems 08 - Profiling and the Roofline View