GPU Systems 04 - Writing CUDA Kernels and Choosing Launch Configuration

Once You Start Writing Kernels, New Questions Show Up

After you get comfortable with the GPU architecture picture, the next step is to write CUDA kernels directly. At that point, syntax is not the hardest part. The more important questions are about indexing and launch configuration.

For example:

should the block size be 128, 256, or 512?
should the grid be 1D or 2D?
should one thread handle one element, or more than one?

These do not have one universal answer. They depend on problem shape and hardware constraints.

Basic Indexing Is Simple, but Not the Full Story

In many CUDA kernels, the global index is computed from blockIdx, blockDim, and threadIdx.

For a 1D problem such as vector add, the mental model is straightforward:

blocks divide the work into large chunks
threads divide each chunk into smaller pieces
a thread computes its global position from block and thread indices

This part is easy. The harder question is whether the launch configuration fits the problem and the hardware well.

Block Size Affects More Than Just Thread Count

Block size is not just a thread count choice. It influences:

how many warps are in a block
how shared memory is used
how much register pressure the kernel creates
how much occupancy is possible

Using a multiple of 32 is the natural starting point because of warp size. But larger is not always better. A large block may consume enough registers or shared memory to reduce concurrency.

That is why practical tuning often starts with values like 128 or 256 and then compares them with profiling.

Grid Size Matters for Utilization

If block size shapes local execution, grid size determines whether the GPU is kept busy across the full problem.

A tiny problem may not provide enough parallel work to keep the GPU occupied. A large enough problem can fill the device, but then other factors such as memory pressure or occupancy become more important.

So grid size is not just about covering all elements. It is also about exposing enough parallel work.

Should One Thread Handle One Element?

In introductory kernels, one thread often handles one element. That is fine as a starting point. But in real optimization work, one thread may handle multiple elements.

That can help with:

memory access structure
register reuse
loop unrolling opportunities

Of course, it also makes kernels more complex and may increase register pressure. So once again, the real issue is balance.

Questions Worth Asking During Launch Design

A useful checklist is:

is the block size aligned with warp structure?
is shared memory use reasonable?
is register pressure acceptable?
does the grid structure fit the problem shape?
is each thread doing too little or too much?

Once you think this way, launch configuration stops looking like boilerplate and starts looking like part of kernel design.

Matrix Multiply as an Example

Matrix multiply often uses 2D indexing because that maps naturally onto output coordinates.

A naive version may assign one output element to one thread. A better version often assigns one output tile to a block, with threads cooperating inside the tile. At that point, launch configuration and tiling strategy become closely connected.

The Main Point

Writing CUDA kernels is largely about mapping:

mapping the problem onto threads
mapping threads into warps and blocks
mapping that structure onto memory and occupancy constraints

Once that clicks, launch parameters stop feeling like arbitrary numbers and start feeling like design choices.

The next post will focus on coalescing, shared memory, and reduction patterns as the first serious optimization layer.

Once You Start Writing Kernels, New Questions Show Up

Basic Indexing Is Simple, but Not the Full Story

Block Size Affects More Than Just Thread Count

Grid Size Matters for Utilization

Should One Thread Handle One Element?

Questions Worth Asking During Launch Design

Matrix Multiply as an Example

The Main Point

Continue Reading

GPU Systems 05 - Coalescing, Shared Memory, and Reduction Patterns

GPU Systems 06 - Triton and the Practical Shape of Kernel Optimization

GPU Systems 07 - Occupancy and Latency Hiding