GPU Systems 08 - Profiling and the Roofline View

Guessing Is Not Enough for GPU Optimization

Once you go beyond introductory CUDA examples, intuition alone stops being reliable. You may know that coalescing matters and shared memory matters, but that still does not tell you which part of the kernel is actually limiting performance.

For example:

a kernel may look arithmetic-heavy but still be memory-bound
occupancy may look suspicious, but stall reasons may point elsewhere
fusion may seem helpful, but register pressure may make it worse

That is why profiling and roofline thinking matter.

What Profiling Is Really Trying to Answer

The purpose of profiling is not just to measure total time. The more important goal is to identify which resource limit the kernel hits first.

Useful questions include:

how much memory throughput is actually reached?
how busy are the SMs?
what are the dominant warp stall reasons?
what does the instruction mix look like?
is occupancy sufficient?

Without these questions, optimization becomes mostly guesswork.

How Not to Get Lost in Nsight

Tools like Nsight Compute expose a huge number of metrics. It is easy to get overwhelmed. A practical reading order helps:

kernel time and launch count
memory throughput and cache behavior
achieved occupancy plus register/shared memory usage
warp stall reasons
instruction mix and tensor core utilization

This is often enough to form an initial bottleneck hypothesis.

Why the Roofline Model Helps

The roofline perspective asks where a kernel sits relative to the hardware's memory and compute ceilings.

At a high level:

low arithmetic intensity often means memory-bound behavior
high arithmetic intensity often means compute-bound behavior

This is helpful because it suggests what kind of optimization is more likely to matter next.

If the kernel is memory-bound, you often care more about:

better coalescing
more reuse through shared memory
fusion to reduce traffic
fewer intermediate reads and writes

If the kernel is compute-bound, you often care more about:

instruction efficiency
tensor core use
unrolling and scheduling
arithmetic pipeline utilization

Arithmetic Intensity in Practice

Arithmetic intensity is roughly the amount of compute performed per byte moved.

Matrix multiplication tends to have high arithmetic intensity. Elementwise operators tend to have lower intensity. Softmax and normalization kernels often become more memory-sensitive than people initially expect because their traffic patterns dominate.

That is why operator names alone are not enough. The dataflow matters.

Why Stall Reasons Are So Useful

Warp stall reasons help explain not just that a kernel is slow, but why it is slow.

For example:

memory dependency stalls suggest traffic or latency issues
execution dependency stalls suggest instruction-level dependency problems
barrier stalls suggest synchronization structure costs

Those signals point you toward very different optimization strategies.

Good Profiling Habits

Some habits make GPU optimization much more reliable:

establish a baseline before changing anything
change one thing at a time
explain why a speedup happened, not just that it happened
look at time, throughput, and stall reasons together

Otherwise you may improve something without understanding the cause.

A Concrete Example

Suppose a softmax kernel is slow. It is easy to blame exponentiation immediately. But profiling may show that the dominant cost is multiple row reads, poor memory reuse, or inefficient reductions.

In that case, memory traffic is the bigger problem than the arithmetic itself.

The Main Point

Profiling and roofline thinking help turn optimization into a structured process:

identify whether the kernel is compute-bound or memory-bound
see which upper limit it approaches
inspect the dominant stall reasons
choose the next optimization based on measured evidence

The next post will apply this lens to naive matrix multiplication and explain why it leaves so much performance on the table.

Guessing Is Not Enough for GPU Optimization

What Profiling Is Really Trying to Answer

How Not to Get Lost in Nsight

Why the Roofline Model Helps

Arithmetic Intensity in Practice

Why Stall Reasons Are So Useful

Good Profiling Habits

A Concrete Example

The Main Point

Continue Reading

GPU Systems 09 - Why Naive Matrix Multiplication Is Slow

GPU Systems 10 - Tiled Matrix Multiplication and Shared Memory

GPU Systems 11 - Shared Memory Bank Conflicts