In Many Kernels, Memory Is the Real Problem

When people first study GPUs, they often focus on core counts or FLOPS. But in real kernel optimization, memory movement is often more limiting than arithmetic. In many ML workloads, performance is determined less by how hard the math is and more by how often data has to move.

That is why it makes sense to study the memory hierarchy early.

Global Memory Is Large but Expensive

Global memory is accessible to many threads and large enough for real workloads, but it is expensive relative to the speed of on-chip computation.

If a kernel repeatedly loads from global memory, performs a small amount of work, and writes back out, it can become memory-bound very quickly. This is common in deep learning operators where the arithmetic is not especially complex but the memory traffic is heavy.

Shared Memory Is Fast Block-Local Workspace

Shared memory is a small, fast memory space shared by threads in the same block. It is useful when multiple threads in the block will reuse the same data.

Tiled matrix multiplication is the classic example. A block loads a tile of inputs into shared memory, and the threads reuse it rather than repeatedly going back to global memory.

But shared memory is not free.

  • capacity is small
  • bank conflicts can hurt performance
  • heavy shared memory use can reduce occupancy

So the point is not just to use shared memory, but to use it where reuse actually pays off.

Registers Are Closest, but Limited

Registers are the closest storage available to a thread and are typically the fastest. But they are also limited.

If a kernel uses too many registers, register pressure goes up and occupancy may fall because fewer warps can stay resident on an SM. That is why an optimization that looks locally good can still hurt overall throughput.

What Bandwidth Really Means Here

Bandwidth is the rate at which data can be moved. In GPU work, it often behaves like a hard ceiling.

Softmax is a good example. It does not look like the heaviest arithmetic operator in the world, but it is often bandwidth-bound because it involves repeated reads, reductions, exponentiation, and writes.

In situations like that, reducing memory traffic matters more than reducing arithmetic instructions.

Questions Worth Asking Repeatedly

A useful habit is to keep asking:

  • how many times is this data being read from global memory?
  • is there enough reuse to justify shared memory?
  • is register use becoming excessive?
  • is this kernel compute-bound or bandwidth-bound?

Once those questions become automatic, kernels start to look much clearer.

A Simple Example: Matrix Multiplication

In naive matrix multiplication, threads repeatedly read needed values from global memory. That means the same values can be loaded many times.

In a tiled version, the block cooperatively loads tiles into shared memory and reuses them. The arithmetic may look similar on paper, but the memory traffic changes dramatically, and so does performance.

The Main Point

GPU optimization is often less about finding more arithmetic and more about moving less data. Without that intuition, CUDA and Triton optimizations can look like random tricks instead of structured decisions.

The next post will look at writing CUDA kernels directly, especially indexing, launch configuration, and block sizing decisions.