Distributed LLM Training 16 - How Communication Overlap Hides Step Time

Why overlap matters

At scale, communication is unavoidable. The practical goal is to keep communication from extending the critical path of each step. That is what overlap is about.

Typical forms include:

overlapping backward compute with gradient all-reduce
overlapping parameter prefetch with upcoming compute
overlapping reduce-scatter with optimizer preparation

Why it often underperforms

Overlap only works when runtime structure supports it. It breaks down when:

buckets become ready too late
kernels are too short to hide communication
too many small collectives are launched
straggler ranks expose waiting behavior anyway

So overlap is a runtime and scheduling problem as much as a communication problem.

What good overlap looks like

In a timeline, good overlap shows compute kernels and NCCL activity interleaving with minimal idle gaps. Bad overlap looks more like a large communication block attached after backward compute has largely finished.

The next post turns to checkpointing and fault tolerance, because long-running distributed jobs need to be recoverable, not just fast.

Why overlap matters

Why it often underperforms

What good overlap looks like

Continue Reading

Distributed LLM Training 17 - Why Checkpointing, Resume, and Fault Tolerance Matter So Much

Distributed LLM Training 18 - Deadlocks, Timeouts, and OOMs: Debugging Distributed Training

Distributed LLM Training 19 - How to Read Megatron-LM and DeepSpeed Structurally