GPU Systems 10 - Tiled Matrix Multiplication and Shared Memory

Why Tiled Matmul Matters So Much

Once you understand the naive matrix multiply, the natural next step is the tiled version. This is one of the most important GPU examples because it exposes shared memory, block cooperation, reuse, and reduced memory traffic in one place.

The Core Tiling Idea

Instead of treating the full matrices at once, the work is divided into smaller tiles. A block is responsible for one output tile, and the input data needed for that tile is loaded in chunks into shared memory.

The key is that many threads in the block can then reuse the same loaded values.

Why Shared Memory Helps Here

To compute an output tile, threads need overlapping portions of the input matrices. In the naive version, each thread independently goes back to global memory. In the tiled version:

threads cooperatively load an input tile
the tile is stored in shared memory
multiple threads reuse it before moving on

This can reduce global memory traffic dramatically.

Tile Size Is a Design Choice, Not a Constant

Tile size is central to the design. A very small tile may not create enough reuse. A large tile may consume too much shared memory or too many registers, reducing occupancy.

So the real decision is always a tradeoff:

larger tiles may improve reuse
but they increase on-chip resource pressure

That is why tile shape must be read together with SM constraints.

Why Synchronization Appears

Threads share the same tile in shared memory, so they must not begin using it before everyone has finished loading it. That is why synchronization barriers appear in tiled kernels.

The barrier is necessary, but it is also a cost. So tiled matmul is not free speed. It is a structured tradeoff: more coordination in exchange for less global memory traffic.

Where the Speedup Comes From

The performance gain comes not from changing the math, but from changing the dataflow:

fewer global memory loads
more reuse through shared memory
cleaner coalesced load opportunities
better ability to keep the compute units busy

That is why tiled matmul is such an instructive example.

Practical Design Questions

When tuning tiled matrix multiplication, the important questions include:

how large should the tile be?
how much shared memory does it consume?
how many output values should each thread own?
how much register use is acceptable?

At this point, matrix multiplication starts to feel like a resource design problem rather than just an algorithm implementation.

Summary

Tiled matrix multiplication is important because it shows:

why shared memory matters
why block-level cooperation matters
how memory traffic can be reduced structurally
how resource tradeoffs shape real kernel design

The next post will isolate one shared-memory-specific issue: bank conflicts.

Why Tiled Matmul Matters So Much

The Core Tiling Idea

Why Shared Memory Helps Here

Tile Size Is a Design Choice, Not a Constant

Why Synchronization Appears

Where the Speedup Comes From

Practical Design Questions

Summary

Continue Reading

GPU Systems 11 - Shared Memory Bank Conflicts

GPU Systems 12 - Warp Shuffle and Warp-Level Primitives

GPU Systems 13 - Reduction Kernels in Depth