Distributed LLM Training 18 - Deadlocks, Timeouts, and OOMs: Debugging Distributed Training
Debugging distributed training is about narrowing down which rank, which collective, and which state transition went wrong
Why distributed debugging is harder
In a single process, failure is local. In distributed training, one rank's problem can stall the whole job. And many different problems look similar from the outside:
- everything appears stuck
- a timeout fires
- one rank hits OOM
- some rank reaches a collective later than the others
So debugging has to be structural, not just reactive.
The first split to make
Ask first whether the issue is mainly:
- a correctness issue
- a performance issue
- an environment or systems issue
For example, a timeout could come from a model bug, an uneven input pipeline, or a communication-layer issue.
Common failure patterns
collective mismatch
One rank enters one collective while another rank is still somewhere else.
stragglers or uneven input
One rank consistently lags due to data length, I/O, or preprocessing variation.
rank-local memory behavior
Average memory may look fine while one specific rank hits a peak due to fragmentation or load asymmetry.
A practical narrowing strategy
- reduce to the smallest world size that still reproduces the issue
- check whether the failure is deterministic
- identify the first rank that diverges or stalls
- narrow the code region around the last successful collective or log boundary
- separate the data path from the communication path
The next post looks at Megatron-LM and DeepSpeed as frameworks, with a focus on what they abstract and what they still leave to the engineer.