Distributed LLM Training 01 - Why LLM Training Becomes a Distributed Systems Problem
Once LLM training leaves a single GPU, it stops being only a modeling problem and becomes a systems problem around memory, communication, and recovery
How this series should be read
Many people begin distributed training by memorizing framework names: DDP, FSDP, ZeRO, Megatron-LM, DeepSpeed. That is usually the wrong entry point. The better question is why those techniques had to exist in the first place.
Distributed training is ultimately about balancing three resources at once:
- how computation is split across devices
- how memory usage is kept within limits
- how communication cost is reduced or hidden
LLM training stresses all three at the same time. Parameter counts are large, activations are large, optimizer state is large, and once multiple devices are involved, the cost of exchanging gradients and activations becomes a bottleneck of its own.
Why single-GPU intuition breaks
On one GPU, it is easy to summarize a problem as "the step is slow" or "the model does not fit." In distributed training, that rough mental model stops being enough.
- some steps are compute-bound
- some are dominated by all-reduce
- some are held back by the input pipeline
- some fail because memory replication is too expensive
- some slow down because one rank lags behind the others
In other words, distributed training is not just training code anymore. It includes process placement, collective communication, memory sharding, checkpoint strategy, and fault recovery.
What a distributed training engineer actually needs to do
The job is not just "use more GPUs." The real questions are:
- is the current bottleneck compute, memory, or communication?
- is plain data parallel enough, or do we need sharding?
- if we introduce tensor or pipeline parallelism, what new costs appear?
- what happens to optimizer behavior when global batch size changes?
- how do we recover a long run safely after interruption?
Those are internal-behavior questions, not just framework-usage questions.
The structure of this series
This track will move through:
- synchronous SGD and the real shape of data parallelism
- all-reduce and collective communication cost
- PyTorch DDP internals
- memory accounting in LLM training
- NCCL and hardware topology
- tensor, sequence, and pipeline parallelism
- activation checkpointing, ZeRO, and FSDP
- overlap, checkpointing, fault tolerance, and debugging
- how frameworks like Megatron-LM and DeepSpeed package these ideas
- how to design an actual training stack
The main question to keep asking
For every technique in this series, keep asking:
- what cost is this trying to reduce?
- what new cost does it introduce?
- does it reduce memory replication, communication volume, or idle time?
- does it fit the topology and workload we actually have?
That lens will make the later framework-heavy posts much easier to understand.
The next post starts with synchronous SGD and data parallelism, because most of distributed training is easier to understand once that baseline is clear.