Distributed LLM Training 18 - Deadlocks, Timeouts, and OOMs: Debugging Distributed Training

Why distributed debugging is harder

In a single process, failure is local. In distributed training, one rank's problem can stall the whole job. And many different problems look similar from the outside:

everything appears stuck
a timeout fires
one rank hits OOM
some rank reaches a collective later than the others

So debugging has to be structural, not just reactive.

The first split to make

Ask first whether the issue is mainly:

a correctness issue
a performance issue
an environment or systems issue

For example, a timeout could come from a model bug, an uneven input pipeline, or a communication-layer issue.

Common failure patterns

collective mismatch

One rank enters one collective while another rank is still somewhere else.

stragglers or uneven input

One rank consistently lags due to data length, I/O, or preprocessing variation.

rank-local memory behavior

Average memory may look fine while one specific rank hits a peak due to fragmentation or load asymmetry.

A practical narrowing strategy

reduce to the smallest world size that still reproduces the issue
check whether the failure is deterministic
identify the first rank that diverges or stalls
narrow the code region around the last successful collective or log boundary
separate the data path from the communication path

The next post looks at Megatron-LM and DeepSpeed as frameworks, with a focus on what they abstract and what they still leave to the engineer.

Why distributed debugging is harder

The first split to make

Common failure patterns

collective mismatch

stragglers or uneven input

rank-local memory behavior

A practical narrowing strategy

Continue Reading

Distributed LLM Training 19 - How to Read Megatron-LM and DeepSpeed Structurally

Distributed LLM Training 20 - A Practical Order for Designing an LLM Training Stack

Distributed LLM Training 07 - NCCL and Topology: Why the Same GPU Count Can Behave Very Differently