GPU Systems 02 - The Thread, Warp, and Block Execution Model

The First Part of GPU Programming That Usually Feels Abstract

When people first meet CUDA, they quickly run into the terms thread, warp, block, and grid. At the syntax level, it is easy to memorize them as different execution units and move on. But if you want to understand kernel performance, these are not just labels. They describe how work is actually organized and executed.

A CPU usually relies on a small number of strong cores making fine-grained control decisions. A GPU instead throws a large number of threads at the problem and depends on structured grouping to get throughput. The most important group in that structure is the warp.

A Thread Is the Smallest Logical Unit

In CUDA code, one thread often handles one element or one small piece of work.

In a vector add kernel, for example, a thread might:

read one element from each input vector
add the two values
write one output element

At this stage, it is easy to focus only on the fact that there are many threads. But the GPU does not schedule each thread completely independently.

A Warp Is the More Real Execution Unit

On NVIDIA GPUs, a warp usually contains 32 threads. That means the hardware tends to schedule and execute work in warp-sized groups.

This matters because branch behavior inside a warp can directly hurt performance. If half the threads in a warp take one branch and the other half take another branch, the hardware has to serialize that control flow. This is warp divergence.

So even though the code is written in per-thread terms, performance often depends on how threads behave together inside a warp.

A Block Is the Cooperation and Resource Unit

A block groups multiple warps together. It matters for two main reasons.

First, threads in the same block can cooperate through shared memory.

Second, a block is typically placed on one SM, so block size directly interacts with SM resources such as shared memory and registers. That is why block size decisions affect occupancy and execution efficiency.

A very large block may look good on paper because it contains many threads, but it may also consume enough registers or shared memory to reduce how many blocks can run concurrently.

The Grid Represents the Full Problem Coverage

The grid is the full launch space of the kernel. In practical terms, it answers the question: how many blocks are being launched to cover the whole problem?

A useful mental model is:

thread: the smallest logical worker
warp: the important execution group
block: the cooperation and resource unit
grid: the full problem coverage

Once this structure becomes familiar, CUDA code starts to feel much less magical.

Why This Directly Affects Performance

This execution model is not just terminology. Most performance issues come back to it.

Questions like these all depend on it:

Is warp divergence hurting efficiency?
Is the block size too large, reducing occupancy?
Is block-level cooperation using shared memory effectively?
Is the grid large enough to keep the GPU busy?

That is why understanding threads, warps, and blocks is part of performance reasoning, not just code reading.

A Practical Mental Model

In practice, this is a useful way to think about it:

code is written at thread granularity
performance often breaks at warp granularity
optimization is constrained at block granularity
overall throughput depends on how well the grid fills the GPU

Once that clicks, even a simple vector add kernel starts to look like a real systems example instead of a toy.

The next post will focus on the GPU memory hierarchy: global memory, shared memory, registers, and bandwidth.

The First Part of GPU Programming That Usually Feels Abstract

A Thread Is the Smallest Logical Unit

A Warp Is the More Real Execution Unit

A Block Is the Cooperation and Resource Unit

The Grid Represents the Full Problem Coverage

Why This Directly Affects Performance

A Practical Mental Model

Continue Reading

GPU Systems 03 - Memory Hierarchy and Bandwidth

GPU Systems 04 - Writing CUDA Kernels and Choosing Launch Configuration

GPU Systems 05 - Coalescing, Shared Memory, and Reduction Patterns