GPU Systems 09 - Why Naive Matrix Multiplication Is Slow
Using naive matrix multiplication to see memory reuse and traffic problems clearly
Why Matrix Multiplication Keeps Coming Back
Matrix multiplication appears constantly in GPU education for a reason. It sits at the center of deep learning workloads, and many of the most important GPU optimization ideas become visible inside it.
The naive version is especially useful because the implementation is simple, but the performance problems are very revealing.
What the Naive Kernel Does
In the most straightforward form, one thread computes one output element. A thread assigned to (row, col) walks over the inner dimension and accumulates the dot product.
At first glance, that seems reasonable:
- the work mapping is clear
- the math is correct
- the parallel structure is simple
But the resulting performance is usually much worse than expected.
The Core Problem: Weak Data Reuse
The biggest issue is that many threads repeatedly load related input values from global memory without cooperating.
Threads inside the same block may need overlapping parts of the input matrices, but in the naive kernel each thread simply loads what it needs independently. Reusable data ends up being fetched again and again.
That is why memory traffic becomes much larger than it needs to be.
What It Looks Like From the Memory Side
Matrix multiplication has high arithmetic intensity in principle. But the naive implementation fails to exploit that well because input reuse is not organized explicitly.
So the operator has the potential to be compute-heavy, but the kernel structure pulls it back toward memory inefficiency.
Mapping Also Matters
Even in a naive kernel, thread mapping affects access quality. Depending on how (row, col) coordinates are assigned, one direction may produce cleaner memory access than another.
That is why \"one thread per output element\" is not the whole story. You still need to ask how warps read the underlying data.
Why Cache Is Not the Full Answer
Cache can help to some extent, but treating the naive kernel as something cache will save is not enough.
- reuse is not explicitly managed
- block-level collaboration is weak
- at large scale, cache alone cannot rescue the traffic pattern
That is why tiled matrix multiplication becomes the next natural step.
What Profiling Usually Reveals
Profiling a naive matrix multiply often leads to observations like:
- there is a lot of arithmetic, but also too much repeated traffic
- memory reuse is weak
- global memory accesses dominate more than expected
- shared memory is not being used to control the dataflow
This is exactly the motivation for the tiled version.
The Important Lesson
Many people initially think a faster matrix multiply must come from a different mathematical formula. In practice, a lot of GPU optimization is not about changing the math. It is about reorganizing the same math into a better dataflow.
That is the right way to read naive matmul.
Summary
Naive matrix multiplication is slow because:
- input reuse is weak
- global memory loads are repeated too often
- block-level collaboration is missing
- the arithmetic potential is wasted by the memory structure
The next post will show how tiled matrix multiplication changes that picture with shared memory and block-level reuse.