Distributed LLM Training 17 - Why Checkpointing, Resume, and Fault Tolerance Matter So Much

Long runs eventually fail

A short local experiment can feel stable. A long multi-node LLM training run usually does not. Hardware issues, network problems, preemption, storage failures, and deployment mistakes all become realistic.

That is why checkpoint strategy is not optional.

What has to be saved

Reliable resume usually requires more than just model weights:

parameters
optimizer state
scheduler state
global step or token counters
RNG state
enough data-loader or sampling state to resume consistently

If some of those are missing, a resumed run may not behave like the original run.

Why sharded training complicates this

With FSDP or ZeRO-style sharding, checkpointing itself has design choices:

save a full gathered state
save sharded state directly
decide where to pay the gather/load cost

So checkpoint design is also a runtime-cost design problem.

The next post looks at debugging distributed training, where deadlocks, timeouts, and rank-local failures need to be narrowed down systematically.

Long runs eventually fail

What has to be saved

Why sharded training complicates this

Continue Reading

Distributed LLM Training 18 - Deadlocks, Timeouts, and OOMs: Debugging Distributed Training

Distributed LLM Training 19 - How to Read Megatron-LM and DeepSpeed Structurally

Distributed LLM Training 20 - A Practical Order for Designing an LLM Training Stack