GPU Systems 11 - Shared Memory Bank Conflicts

Shared Memory Is Fast, but Not Automatically Optimal

Shared memory is often introduced as a much faster alternative to global memory, and that is true in general. But that does not mean every shared memory access pattern is efficient. One of the main reasons performance can still suffer is bank conflict.

Why Bank Structure Matters

Shared memory is divided internally into banks. In the ideal case, threads in a warp access different banks and can proceed in parallel. If multiple threads collide in a way that maps poorly onto those banks, accesses may become serialized.

That is the core idea behind bank conflict.

Why It Shows Up in Tiled Kernels

Tiled kernels such as matrix multiplication and transpose often rely heavily on shared memory. That makes them more exposed to bank conflict issues.\n\nIn these kernels, an access pattern that looks harmless at the array level may still produce poor bank behavior internally.

Why Padding Keeps Appearing

A classic technique for reducing bank conflict is padding. By slightly changing the shared memory layout, it is often possible to avoid repeated mapping collisions between threads and banks.

This can look like a small memory waste, but in practice it often improves parallel shared-memory access enough to be worthwhile.

What to Watch for

Useful questions include:

what stride are warp lanes using in shared memory?
does a transpose-like pattern increase conflict risk?
can a small layout change reduce collisions?

Summary

Bank conflict is one of the main reminders that shared memory optimization is deeper than \"move data closer.\" You also need to care about how threads access that close data.\n\nThe next post will look at warp shuffle and warp-level primitives as a lighter alternative for certain kinds of collaboration.

Shared Memory Is Fast, but Not Automatically Optimal

Why Bank Structure Matters

Why It Shows Up in Tiled Kernels

Why Padding Keeps Appearing

What to Watch for

Summary

Continue Reading

GPU Systems 12 - Warp Shuffle and Warp-Level Primitives

GPU Systems 13 - Reduction Kernels in Depth

GPU Systems 14 - Why Softmax Is Such a Good Kernel Exercise