Distributed LLM Training 01 - Why LLM Training Becomes a Distributed Systems Problem
Once LLM training leaves a single GPU, it stops being only a modeling problem and becomes a systems problem around memory, communication, and recovery
Once LLM training leaves a single GPU, it stops being only a modeling problem and becomes a systems problem around memory, communication, and recovery
Data parallelism looks simple, but it carries both gradient synchronization cost and full model-state replication cost
To reason about distributed training performance, you need a concrete mental model for all-reduce and collective communication cost
DDP is not just a wrapper around your model; it is a runtime that coordinates autograd hooks, gradient buckets, and synchronization timing
Adding more GPUs changes optimizer semantics as well as throughput, so batch size and learning rate need to be reasoned about together
Looking only at parameter size leads to bad decisions; training memory is really a combination of parameters, gradients, optimizer state, and activations
In distributed training, performance is often shaped more by how GPUs are connected than by the raw number of GPUs
Once the model itself is too large for one device, data parallelism is no longer enough and layer-internal computation has to be split
Tensor parallelism becomes real when you map it onto QKV projections, attention output paths, and the two large MLP projections inside a transformer block