Long runs eventually fail

A short local experiment can feel stable. A long multi-node LLM training run usually does not. Hardware issues, network problems, preemption, storage failures, and deployment mistakes all become realistic.

That is why checkpoint strategy is not optional.

What has to be saved

Reliable resume usually requires more than just model weights:

  • parameters
  • optimizer state
  • scheduler state
  • global step or token counters
  • RNG state
  • enough data-loader or sampling state to resume consistently

If some of those are missing, a resumed run may not behave like the original run.

Why sharded training complicates this

With FSDP or ZeRO-style sharding, checkpointing itself has design choices:

  • save a full gathered state
  • save sharded state directly
  • decide where to pay the gather/load cost

So checkpoint design is also a runtime-cost design problem.

The next post looks at debugging distributed training, where deadlocks, timeouts, and rank-local failures need to be narrowed down systematically.