GPU Systems 09 - Why Naive Matrix Multiplication Is Slow

Why Matrix Multiplication Keeps Coming Back

Matrix multiplication appears constantly in GPU education for a reason. It sits at the center of deep learning workloads, and many of the most important GPU optimization ideas become visible inside it.

The naive version is especially useful because the implementation is simple, but the performance problems are very revealing.

What the Naive Kernel Does

In the most straightforward form, one thread computes one output element. A thread assigned to (row, col) walks over the inner dimension and accumulates the dot product.

At first glance, that seems reasonable:

the work mapping is clear
the math is correct
the parallel structure is simple

But the resulting performance is usually much worse than expected.

The Core Problem: Weak Data Reuse

The biggest issue is that many threads repeatedly load related input values from global memory without cooperating.

Threads inside the same block may need overlapping parts of the input matrices, but in the naive kernel each thread simply loads what it needs independently. Reusable data ends up being fetched again and again.

That is why memory traffic becomes much larger than it needs to be.

What It Looks Like From the Memory Side

Matrix multiplication has high arithmetic intensity in principle. But the naive implementation fails to exploit that well because input reuse is not organized explicitly.

So the operator has the potential to be compute-heavy, but the kernel structure pulls it back toward memory inefficiency.

Mapping Also Matters

Even in a naive kernel, thread mapping affects access quality. Depending on how (row, col) coordinates are assigned, one direction may produce cleaner memory access than another.

That is why \"one thread per output element\" is not the whole story. You still need to ask how warps read the underlying data.

Why Cache Is Not the Full Answer

Cache can help to some extent, but treating the naive kernel as something cache will save is not enough.

reuse is not explicitly managed
block-level collaboration is weak
at large scale, cache alone cannot rescue the traffic pattern

That is why tiled matrix multiplication becomes the next natural step.

What Profiling Usually Reveals

Profiling a naive matrix multiply often leads to observations like:

there is a lot of arithmetic, but also too much repeated traffic
memory reuse is weak
global memory accesses dominate more than expected
shared memory is not being used to control the dataflow

This is exactly the motivation for the tiled version.

The Important Lesson

Many people initially think a faster matrix multiply must come from a different mathematical formula. In practice, a lot of GPU optimization is not about changing the math. It is about reorganizing the same math into a better dataflow.

That is the right way to read naive matmul.

Summary

Naive matrix multiplication is slow because:

input reuse is weak
global memory loads are repeated too often
block-level collaboration is missing
the arithmetic potential is wasted by the memory structure

The next post will show how tiled matrix multiplication changes that picture with shared memory and block-level reuse.

Why Matrix Multiplication Keeps Coming Back

What the Naive Kernel Does

The Core Problem: Weak Data Reuse

What It Looks Like From the Memory Side

Mapping Also Matters

Why Cache Is Not the Full Answer

What Profiling Usually Reveals

The Important Lesson

Summary

Continue Reading

GPU Systems 10 - Tiled Matrix Multiplication and Shared Memory

GPU Systems 11 - Shared Memory Bank Conflicts

GPU Systems 12 - Warp Shuffle and Warp-Level Primitives