Distributed LLM Training 16 - How Communication Overlap Hides Step Time
The goal of overlap is not to eliminate communication entirely, but to make it finish underneath useful computation
Why overlap matters
At scale, communication is unavoidable. The practical goal is to keep communication from extending the critical path of each step. That is what overlap is about.
Typical forms include:
- overlapping backward compute with gradient all-reduce
- overlapping parameter prefetch with upcoming compute
- overlapping reduce-scatter with optimizer preparation
Why it often underperforms
Overlap only works when runtime structure supports it. It breaks down when:
- buckets become ready too late
- kernels are too short to hide communication
- too many small collectives are launched
- straggler ranks expose waiting behavior anyway
So overlap is a runtime and scheduling problem as much as a communication problem.
What good overlap looks like
In a timeline, good overlap shows compute kernels and NCCL activity interleaving with minimal idle gaps. Bad overlap looks more like a large communication block attached after backward compute has largely finished.
The next post turns to checkpointing and fault tolerance, because long-running distributed jobs need to be recoverable, not just fast.