Guessing Is Not Enough for GPU Optimization

Once you go beyond introductory CUDA examples, intuition alone stops being reliable. You may know that coalescing matters and shared memory matters, but that still does not tell you which part of the kernel is actually limiting performance.

For example:

  • a kernel may look arithmetic-heavy but still be memory-bound
  • occupancy may look suspicious, but stall reasons may point elsewhere
  • fusion may seem helpful, but register pressure may make it worse

That is why profiling and roofline thinking matter.

What Profiling Is Really Trying to Answer

The purpose of profiling is not just to measure total time. The more important goal is to identify which resource limit the kernel hits first.

Useful questions include:

  • how much memory throughput is actually reached?
  • how busy are the SMs?
  • what are the dominant warp stall reasons?
  • what does the instruction mix look like?
  • is occupancy sufficient?

Without these questions, optimization becomes mostly guesswork.

How Not to Get Lost in Nsight

Tools like Nsight Compute expose a huge number of metrics. It is easy to get overwhelmed. A practical reading order helps:

  1. kernel time and launch count
  2. memory throughput and cache behavior
  3. achieved occupancy plus register/shared memory usage
  4. warp stall reasons
  5. instruction mix and tensor core utilization

This is often enough to form an initial bottleneck hypothesis.

Why the Roofline Model Helps

The roofline perspective asks where a kernel sits relative to the hardware's memory and compute ceilings.

At a high level:

  • low arithmetic intensity often means memory-bound behavior
  • high arithmetic intensity often means compute-bound behavior

This is helpful because it suggests what kind of optimization is more likely to matter next.

If the kernel is memory-bound, you often care more about:

  • better coalescing
  • more reuse through shared memory
  • fusion to reduce traffic
  • fewer intermediate reads and writes

If the kernel is compute-bound, you often care more about:

  • instruction efficiency
  • tensor core use
  • unrolling and scheduling
  • arithmetic pipeline utilization

Arithmetic Intensity in Practice

Arithmetic intensity is roughly the amount of compute performed per byte moved.

Matrix multiplication tends to have high arithmetic intensity. Elementwise operators tend to have lower intensity. Softmax and normalization kernels often become more memory-sensitive than people initially expect because their traffic patterns dominate.

That is why operator names alone are not enough. The dataflow matters.

Why Stall Reasons Are So Useful

Warp stall reasons help explain not just that a kernel is slow, but why it is slow.

For example:

  • memory dependency stalls suggest traffic or latency issues
  • execution dependency stalls suggest instruction-level dependency problems
  • barrier stalls suggest synchronization structure costs

Those signals point you toward very different optimization strategies.

Good Profiling Habits

Some habits make GPU optimization much more reliable:

  • establish a baseline before changing anything
  • change one thing at a time
  • explain why a speedup happened, not just that it happened
  • look at time, throughput, and stall reasons together

Otherwise you may improve something without understanding the cause.

A Concrete Example

Suppose a softmax kernel is slow. It is easy to blame exponentiation immediately. But profiling may show that the dominant cost is multiple row reads, poor memory reuse, or inefficient reductions.

In that case, memory traffic is the bigger problem than the arithmetic itself.

The Main Point

Profiling and roofline thinking help turn optimization into a structured process:

  • identify whether the kernel is compute-bound or memory-bound
  • see which upper limit it approaches
  • inspect the dominant stall reasons
  • choose the next optimization based on measured evidence

The next post will apply this lens to naive matrix multiplication and explain why it leaves so much performance on the table.