Distributed LLM Training 03 - All-Reduce, Ring, and How to Read Communication Cost
To reason about distributed training performance, you need a concrete mental model for all-reduce and collective communication cost
Why all-reduce matters so much
In data parallel training, every rank computes gradients from its own mini-batch, but all ranks must end up with the same gradient result before the optimizer step. That is why all-reduce appears everywhere.
You can think of it as two combined operations:
- reduce: combine values across ranks
- distribute: give the combined result back to every rank
So each rank ends up with the same synchronized tensor.
Why ring all-reduce is common
A naive mental model is to send everything to one central node and broadcast it back. That becomes a bottleneck quickly. Ring all-reduce avoids that by arranging ranks in a ring and moving chunked data between neighbors.
The important intuition is:
- split the tensor into chunks
- perform reduce-scatter over those chunks
- then perform all-gather to distribute the final result
This tends to use available bandwidth more evenly instead of overloading one place.
What communication time is made of
A rough mental model is:
- latency cost: starting the communication and stepping through coordination
- bandwidth cost: moving the actual bytes
If there are many tiny tensors, latency hurts. If there are very large tensors, bandwidth hurts. That is one reason gradient bucketing matters so much in real frameworks.
Hardware topology matters too:
- are GPUs connected by NVLink?
- are they only using PCIe?
- is the traffic intra-node or inter-node?
Those details change the real cost dramatically.
What to look for in practice
When profiling:
- do NCCL kernels dominate after backward finishes?
- do all ranks have similar step times?
- does inter-node communication become the clear bottleneck?
- are too many small collectives being launched?
If utilization looks decent but scaling is disappointing, communication is often hiding underneath.
The next post connects this directly to PyTorch DDP and how gradient synchronization is actually scheduled during backward.