GPU Systems 16 - Vectorized Loads, Stores, and Alignment

Access Width Matters Too

After learning coalescing, the next useful step is to think about vectorized loads and stores. Instead of moving one scalar value at a time, a thread may load or store a wider chunk such as a packed group of values.

This can reduce instruction count and better match memory transaction structure.

When It Helps

Vectorized access is most useful when:

data is laid out contiguously
alignment is appropriate
each thread naturally handles multiple adjacent values

In that setting, wider loads and stores can improve effective bandwidth use.

Why Alignment Matters

Vectorization depends on memory alignment more than people often expect. If the address pattern is misaligned, the expected gain may not appear, and the access behavior can become awkward.

That is why vectorization is not just a kernel-level decision. It often depends on upstream layout and padding too.

For example, a float4-style path usually wants 16-byte alignment. But a tensor that comes from slicing, transposing, or offset indexing may not satisfy that cleanly even if its logical shape looks fine. In that case, the vectorized path becomes harder to use safely or requires extra layout handling first.

So vectorization is tightly connected to layout contracts, not just to local kernel code.

Why It Can Reduce Instruction Count Too

The benefit is not only bandwidth utilization. Vectorized access can also reduce the number of load or store instructions a thread issues.

Of course, this is still conditional. Tail handling and unpacking logic can reduce that benefit. So it is better to think of vectorization as a way to clean up the access structure when the layout supports it, not as an automatic win.

Tail Handling Is Part of the Real Design

If the vector width is 4, the input size will not always be divisible by 4. Then the tail path becomes part of the kernel design.

Common approaches include:

keeping a fast vectorized main path and a scalar tail path
padding inputs so tails are less awkward
accepting small tail overhead while optimizing the main case heavily

When to Be Careful

Potential downsides include:

awkward tail handling
increased branch complexity
alignment mismatches
higher register use

So vectorization is useful, but not automatic.

It also has to be read together with the rest of the memory path. A vectorized global load may look attractive, but shared memory layout, unpack cost, or downstream register pressure can still change the final outcome.

Where It Shows Up Often

Vectorized access is especially common in:

normalization kernels
fused elementwise operators
packing and unpacking stages around matmul-like workloads
Triton kernels that load contiguous blocks

These are often the places where memory movement dominates enough for vectorization to matter materially.

Summary

Coalescing organizes access across warp lanes. Vectorization adds another layer by improving the width of each thread's access pattern when the layout allows it.

The real questions are:

is alignment correct?
is the tail path reasonable?
does the layout support it naturally?
does the rest of the kernel still benefit overall?

The next post will look at register pressure and spilling, which often become the next limiting factor once kernels get more ambitious.

Access Width Matters Too

When It Helps

Why Alignment Matters

Why It Can Reduce Instruction Count Too

Tail Handling Is Part of the Real Design

When to Be Careful

Where It Shows Up Often

Summary

Continue Reading

GPU Systems 17 - Register Pressure and Spilling

GPU Systems 18 - Tensor Cores and Mixed Precision

GPU Systems 19 - Asynchronous Copy and Pipelining