Why Triton Makes More Sense After CUDA

Triton feels different once you have already spent time with CUDA. If you encounter Triton first, it can look pleasantly compact but somewhat magical. After CUDA, it looks more like a productive interface built on the same underlying concerns.

That is one reason Triton shows up so often in LLM optimization work. It lets you experiment with kernel ideas such as fused softmax, layernorm, and attention-related operators without dropping immediately into raw CUDA for everything.

The Important Way to Read Triton

Triton is not a magic replacement for GPU understanding. It still lives inside the same performance world.

The important questions are still:

  • which tile is being loaded?
  • how regular are the memory accesses?
  • what data is being reused?
  • which operations are worth fusing to reduce memory traffic?

The surface syntax changes, but the performance questions do not.

Why Kernel Fusion Matters Here

Deep learning models often pay heavily for intermediate tensors being written to and read from global memory between operators.

If bias add, activation, and normalization-related work are all separate kernels, the memory traffic can dominate. A fused kernel reduces that traffic by keeping more of the computation local.

But fusion is not automatically good. The real question is whether the fused form improves memory behavior without creating new problems such as excessive register pressure.

Why Flash Attention Is So Interesting

Fast attention kernels are useful study material because they make the real optimization mindset visible.

The goal is not just to compute the same math faster. The goal is to rethink what gets materialized, what stays local, and how tiling changes the memory story.

That is the point where GPU kernel work starts to feel like systems design rather than just writing math kernels.

Profiling Is the Real Arbiter

By this stage, intuition alone is not enough. You need measurement.

Questions worth checking with profiling tools include:

  • how much memory throughput is actually being reached?
  • is occupancy healthy?
  • why are warps stalling?
  • did fusion increase register pressure too much?

GPU optimization has to be measurement-driven. Otherwise it becomes storytelling.

Where This Series Leads Next

After this series, there are two very natural next steps.

One is PyTorch internals: how custom kernels and low-level optimizations get connected to a real training framework.

The other is distributed LLM training: how these kernels matter once the workload expands across multiple GPUs and frameworks.

In practice, GPU kernel work and distributed training work overlap quite a lot. If you do not understand kernels, many framework bottlenecks stay blurry. If you do not understand distributed training, it is easy to miss which kernel optimizations really matter in large-model systems.

Closing Thought

The main point of this series is to stop seeing the GPU as just a fast device. It is a structured compute system with a specific execution model and a specific memory hierarchy. Good kernels come from understanding that structure.

The next series will move into PyTorch internals so the connection between low-level kernels and real training code becomes clearer.