Why distributed debugging is harder

In a single process, failure is local. In distributed training, one rank's problem can stall the whole job. And many different problems look similar from the outside:

  • everything appears stuck
  • a timeout fires
  • one rank hits OOM
  • some rank reaches a collective later than the others

So debugging has to be structural, not just reactive.

The first split to make

Ask first whether the issue is mainly:

  • a correctness issue
  • a performance issue
  • an environment or systems issue

For example, a timeout could come from a model bug, an uneven input pipeline, or a communication-layer issue.

Common failure patterns

collective mismatch

One rank enters one collective while another rank is still somewhere else.

stragglers or uneven input

One rank consistently lags due to data length, I/O, or preprocessing variation.

rank-local memory behavior

Average memory may look fine while one specific rank hits a peak due to fragmentation or load asymmetry.

A practical narrowing strategy

  1. reduce to the smallest world size that still reproduces the issue
  2. check whether the failure is deterministic
  3. identify the first rank that diverges or stalls
  4. narrow the code region around the last successful collective or log boundary
  5. separate the data path from the communication path

The next post looks at Megatron-LM and DeepSpeed as frameworks, with a focus on what they abstract and what they still leave to the engineer.