Computer Architecture 07 - Memory Hierarchy
The memory hierarchy from registers to HDD and how caches work
All posts in the Lectures
The memory hierarchy from registers to HDD and how caches work
How virtual memory enables process isolation through the MMU, page tables, and TLB
How the CPU exchanges data with external devices and the principles behind efficient data transfer via DMA
Why clock speeds stopped increasing and the core concepts of modern multicore processor architecture
Once LLM training leaves a single GPU, it stops being only a modeling problem and becomes a systems problem around memory, communication, and recovery
Data parallelism looks simple, but it carries both gradient synchronization cost and full model-state replication cost
To reason about distributed training performance, you need a concrete mental model for all-reduce and collective communication cost
DDP is not just a wrapper around your model; it is a runtime that coordinates autograd hooks, gradient buckets, and synchronization timing
Adding more GPUs changes optimizer semantics as well as throughput, so batch size and learning rate need to be reasoned about together