GPU Systems 11 - Shared Memory Bank Conflicts
Why shared memory is not automatically fast and how bank conflicts appear
Shared Memory Is Fast, but Not Automatically Optimal
Shared memory is often introduced as a much faster alternative to global memory, and that is true in general. But that does not mean every shared memory access pattern is efficient. One of the main reasons performance can still suffer is bank conflict.
Why Bank Structure Matters
Shared memory is divided internally into banks. In the ideal case, threads in a warp access different banks and can proceed in parallel. If multiple threads collide in a way that maps poorly onto those banks, accesses may become serialized.
That is the core idea behind bank conflict.
Why It Shows Up in Tiled Kernels
Tiled kernels such as matrix multiplication and transpose often rely heavily on shared memory. That makes them more exposed to bank conflict issues.\n\nIn these kernels, an access pattern that looks harmless at the array level may still produce poor bank behavior internally.
Why Padding Keeps Appearing
A classic technique for reducing bank conflict is padding. By slightly changing the shared memory layout, it is often possible to avoid repeated mapping collisions between threads and banks.
This can look like a small memory waste, but in practice it often improves parallel shared-memory access enough to be worthwhile.
What to Watch for
Useful questions include:
- what stride are warp lanes using in shared memory?
- does a transpose-like pattern increase conflict risk?
- can a small layout change reduce collisions?
Summary
Bank conflict is one of the main reminders that shared memory optimization is deeper than \"move data closer.\" You also need to care about how threads access that close data.\n\nThe next post will look at warp shuffle and warp-level primitives as a lighter alternative for certain kinds of collaboration.