GPU Systems 07 - Occupancy and Latency Hiding

The Most Common Mistake With Occupancy

Once people begin reading about GPU optimization, they quickly encounter occupancy. It is usually introduced as the ratio of active resident warps on an SM relative to the maximum possible number. That definition is correct, but by itself it is easy to misuse.

The most common misunderstandings are:

higher occupancy always means higher performance
low occupancy automatically means a bad kernel
increasing block size is a simple way to improve occupancy and therefore speed

In reality, occupancy is only meaningful when tied back to latency hiding.

Why GPUs Want Many Warps Available

GPU execution frequently stalls for reasons such as memory access latency or instruction dependencies. When one warp cannot make progress, the hardware tries to switch to another ready warp.

That is the basic throughput idea: while one warp waits, another one runs. This is why having many resident warps matters. Occupancy is one view into how much opportunity the hardware has to hide those delays.

What Occupancy Actually Measures

Occupancy is usually expressed as:

current resident warps on an SM
divided by the maximum resident warps that SM can support

That number depends on several resource constraints:

threads per block
shared memory per block
registers per thread
hardware SM limits

So occupancy is not controlled by one knob. It emerges from the full resource shape of the kernel.

Why Very Low Occupancy Can Hurt

If occupancy is too low, the scheduler has fewer warps to choose from when one stalls. That means memory latency or dependency stalls are harder to hide.

This tends to matter especially in memory-bound kernels, where waiting is already a major part of execution. If there are not enough ready warps, the SM can sit idle more often.

Why High Occupancy Is Not Always Better

Trying to maximize occupancy at all costs can backfire.

For example:

you may reduce register use so aggressively that spilling appears
you may choose a block structure that hurts memory access quality
you may weaken useful shared memory reuse just to fit more warps

That is why occupancy should be treated as a means, not as the final goal. The real goal is enough active warps to hide latency without damaging the rest of the kernel design.

A Practical Way to Think About It

In practice, occupancy is useful like this:

if it is very low, suspect resource structure first
if it is reasonably healthy, move on to throughput and stall analysis
if it is high but performance is still poor, another bottleneck probably dominates

For example, a kernel at 50% occupancy can still perform very well if memory access is good and latency hiding is already sufficient. On the other hand, a kernel at 100% occupancy can still be bandwidth-bound and stuck.

Register Pressure Is Often the Real Story

One of the most common reasons occupancy falls is register pressure. A kernel that keeps too many values live at once may reduce how many warps can stay resident on the SM.

This is one of the reasons GPU optimization is so full of tradeoffs. A local improvement in per-thread efficiency can reduce global throughput if it lowers residency too much.

How to Use Occupancy Tools Correctly

Occupancy calculators and profiler estimates are useful, but they should not be treated like a final score. They tell you what residency is possible under the kernel's current resource use. They do not tell you the kernel is optimal.

They are most useful for:

comparing block-size candidates
seeing how register or shared memory use changes residency
explaining why a kernel has fewer active warps than expected

The Main Point

Occupancy makes sense when viewed through one question: how well can this kernel hide latency?

That is the useful chain:

resident warps create scheduling choices
scheduling choices help hide latency
latency hiding helps throughput

The next post will move into profiling and roofline thinking so we can decide whether a kernel is really memory-bound or compute-bound.

The Most Common Mistake With Occupancy

Why GPUs Want Many Warps Available

What Occupancy Actually Measures

Why Very Low Occupancy Can Hurt

Why High Occupancy Is Not Always Better

A Practical Way to Think About It

Register Pressure Is Often the Real Story

How to Use Occupancy Tools Correctly

The Main Point

Continue Reading

GPU Systems 08 - Profiling and the Roofline View

GPU Systems 09 - Why Naive Matrix Multiplication Is Slow

GPU Systems 10 - Tiled Matrix Multiplication and Shared Memory