Distributed LLM Training 06 - Where LLM Training Memory Actually Goes
Looking only at parameter size leads to bad decisions; training memory is really a combination of parameters, gradients, optimizer state, and activations
All posts in the Lectures
Looking only at parameter size leads to bad decisions; training memory is really a combination of parameters, gradients, optimizer state, and activations
In distributed training, performance is often shaped more by how GPUs are connected than by the raw number of GPUs
Once the model itself is too large for one device, data parallelism is no longer enough and layer-internal computation has to be split
Tensor parallelism becomes real when you map it onto QKV projections, attention output paths, and the two large MLP projections inside a transformer block
As context length grows, activation memory and communication patterns change again, and sequence-oriented partitioning starts to matter
Once the model is split by depth into stages, idle time and stage imbalance become just as important as memory savings
Pipeline efficiency is shaped heavily by scheduling, because bubble size, activation memory, and implementation complexity all depend on it
Saving memory by recomputing activations is not a minor option; it is often a central design choice in large-scale training
ZeRO is best understood as a staged system for removing different forms of replicated training state