undefined min read
GPU Systems 03 - Memory Hierarchy and Bandwidth
How to think about the GPU memory hierarchy and bandwidth bottlenecks
All posts in the Lectures
How to think about the GPU memory hierarchy and bandwidth bottlenecks
How to think about indexing and launch configuration when writing CUDA kernels
The optimization patterns that keep showing up in CUDA kernels
How Triton fits into real kernel optimization work, especially for LLM-style workloads
Understanding occupancy as a latency-hiding concept instead of just a percentage
A practical way to use profiling and roofline thinking to understand kernel bottlenecks
Using naive matrix multiplication to see memory reuse and traffic problems clearly
Why tiled matrix multiplication and shared memory create such a big performance difference
Why shared memory is not automatically fast and how bank conflicts appear