Why all-reduce matters so much

In data parallel training, every rank computes gradients from its own mini-batch, but all ranks must end up with the same gradient result before the optimizer step. That is why all-reduce appears everywhere.

You can think of it as two combined operations:

  • reduce: combine values across ranks
  • distribute: give the combined result back to every rank

So each rank ends up with the same synchronized tensor.

Why ring all-reduce is common

A naive mental model is to send everything to one central node and broadcast it back. That becomes a bottleneck quickly. Ring all-reduce avoids that by arranging ranks in a ring and moving chunked data between neighbors.

The important intuition is:

  • split the tensor into chunks
  • perform reduce-scatter over those chunks
  • then perform all-gather to distribute the final result

This tends to use available bandwidth more evenly instead of overloading one place.

What communication time is made of

A rough mental model is:

  • latency cost: starting the communication and stepping through coordination
  • bandwidth cost: moving the actual bytes

If there are many tiny tensors, latency hurts. If there are very large tensors, bandwidth hurts. That is one reason gradient bucketing matters so much in real frameworks.

Hardware topology matters too:

  • are GPUs connected by NVLink?
  • are they only using PCIe?
  • is the traffic intra-node or inter-node?

Those details change the real cost dramatically.

What to look for in practice

When profiling:

  • do NCCL kernels dominate after backward finishes?
  • do all ranks have similar step times?
  • does inter-node communication become the clear bottleneck?
  • are too many small collectives being launched?

If utilization looks decent but scaling is disappointing, communication is often hiding underneath.

The next post connects this directly to PyTorch DDP and how gradient synchronization is actually scheduled during backward.