GPU Systems 10 - Tiled Matrix Multiplication and Shared Memory
Why tiled matrix multiplication and shared memory create such a big performance difference
Why Tiled Matmul Matters So Much
Once you understand the naive matrix multiply, the natural next step is the tiled version. This is one of the most important GPU examples because it exposes shared memory, block cooperation, reuse, and reduced memory traffic in one place.
The Core Tiling Idea
Instead of treating the full matrices at once, the work is divided into smaller tiles. A block is responsible for one output tile, and the input data needed for that tile is loaded in chunks into shared memory.
The key is that many threads in the block can then reuse the same loaded values.
Why Shared Memory Helps Here
To compute an output tile, threads need overlapping portions of the input matrices. In the naive version, each thread independently goes back to global memory. In the tiled version:
- threads cooperatively load an input tile
- the tile is stored in shared memory
- multiple threads reuse it before moving on
This can reduce global memory traffic dramatically.
Tile Size Is a Design Choice, Not a Constant
Tile size is central to the design. A very small tile may not create enough reuse. A large tile may consume too much shared memory or too many registers, reducing occupancy.
So the real decision is always a tradeoff:
- larger tiles may improve reuse
- but they increase on-chip resource pressure
That is why tile shape must be read together with SM constraints.
Why Synchronization Appears
Threads share the same tile in shared memory, so they must not begin using it before everyone has finished loading it. That is why synchronization barriers appear in tiled kernels.
The barrier is necessary, but it is also a cost. So tiled matmul is not free speed. It is a structured tradeoff: more coordination in exchange for less global memory traffic.
Where the Speedup Comes From
The performance gain comes not from changing the math, but from changing the dataflow:
- fewer global memory loads
- more reuse through shared memory
- cleaner coalesced load opportunities
- better ability to keep the compute units busy
That is why tiled matmul is such an instructive example.
Practical Design Questions
When tuning tiled matrix multiplication, the important questions include:
- how large should the tile be?
- how much shared memory does it consume?
- how many output values should each thread own?
- how much register use is acceptable?
At this point, matrix multiplication starts to feel like a resource design problem rather than just an algorithm implementation.
Summary
Tiled matrix multiplication is important because it shows:
- why shared memory matters
- why block-level cooperation matters
- how memory traffic can be reduced structurally
- how resource tradeoffs shape real kernel design
The next post will isolate one shared-memory-specific issue: bank conflicts.