Why overlap matters

At scale, communication is unavoidable. The practical goal is to keep communication from extending the critical path of each step. That is what overlap is about.

Typical forms include:

  • overlapping backward compute with gradient all-reduce
  • overlapping parameter prefetch with upcoming compute
  • overlapping reduce-scatter with optimizer preparation

Why it often underperforms

Overlap only works when runtime structure supports it. It breaks down when:

  • buckets become ready too late
  • kernels are too short to hide communication
  • too many small collectives are launched
  • straggler ranks expose waiting behavior anyway

So overlap is a runtime and scheduling problem as much as a communication problem.

What good overlap looks like

In a timeline, good overlap shows compute kernels and NCCL activity interleaving with minimal idle gaps. Bad overlap looks more like a large communication block attached after backward compute has largely finished.

The next post turns to checkpointing and fault tolerance, because long-running distributed jobs need to be recoverable, not just fast.