GPU Systems 02 - The Thread, Warp, and Block Execution Model
What threads, warps, blocks, and grids mean in actual GPU execution
The First Part of GPU Programming That Usually Feels Abstract
When people first meet CUDA, they quickly run into the terms thread, warp, block, and grid. At the syntax level, it is easy to memorize them as different execution units and move on. But if you want to understand kernel performance, these are not just labels. They describe how work is actually organized and executed.
A CPU usually relies on a small number of strong cores making fine-grained control decisions. A GPU instead throws a large number of threads at the problem and depends on structured grouping to get throughput. The most important group in that structure is the warp.
A Thread Is the Smallest Logical Unit
In CUDA code, one thread often handles one element or one small piece of work.
In a vector add kernel, for example, a thread might:
- read one element from each input vector
- add the two values
- write one output element
At this stage, it is easy to focus only on the fact that there are many threads. But the GPU does not schedule each thread completely independently.
A Warp Is the More Real Execution Unit
On NVIDIA GPUs, a warp usually contains 32 threads. That means the hardware tends to schedule and execute work in warp-sized groups.
This matters because branch behavior inside a warp can directly hurt performance. If half the threads in a warp take one branch and the other half take another branch, the hardware has to serialize that control flow. This is warp divergence.
So even though the code is written in per-thread terms, performance often depends on how threads behave together inside a warp.
A Block Is the Cooperation and Resource Unit
A block groups multiple warps together. It matters for two main reasons.
First, threads in the same block can cooperate through shared memory.
Second, a block is typically placed on one SM, so block size directly interacts with SM resources such as shared memory and registers. That is why block size decisions affect occupancy and execution efficiency.
A very large block may look good on paper because it contains many threads, but it may also consume enough registers or shared memory to reduce how many blocks can run concurrently.
The Grid Represents the Full Problem Coverage
The grid is the full launch space of the kernel. In practical terms, it answers the question: how many blocks are being launched to cover the whole problem?
A useful mental model is:
- thread: the smallest logical worker
- warp: the important execution group
- block: the cooperation and resource unit
- grid: the full problem coverage
Once this structure becomes familiar, CUDA code starts to feel much less magical.
Why This Directly Affects Performance
This execution model is not just terminology. Most performance issues come back to it.
Questions like these all depend on it:
- Is warp divergence hurting efficiency?
- Is the block size too large, reducing occupancy?
- Is block-level cooperation using shared memory effectively?
- Is the grid large enough to keep the GPU busy?
That is why understanding threads, warps, and blocks is part of performance reasoning, not just code reading.
A Practical Mental Model
In practice, this is a useful way to think about it:
- code is written at thread granularity
- performance often breaks at warp granularity
- optimization is constrained at block granularity
- overall throughput depends on how well the grid fills the GPU
Once that clicks, even a simple vector add kernel starts to look like a real systems example instead of a toy.
The next post will focus on the GPU memory hierarchy: global memory, shared memory, registers, and bandwidth.