Distributed LLM Training 13 - Activation Checkpointing and the Cost of Recomputation

Why recompute at all

Training needs intermediate forward results for backward. If those activations become too large, memory runs out first. Activation checkpointing reduces stored activations and recomputes some of them during backward.

So the tradeoff is simple in principle:

save memory
spend more compute

The real design question

It is better to frame checkpointing around which resource is actually scarce:

memory capacity
additional compute time
scheduling complexity

If memory is the hard wall, recomputation may be unavoidable. If compute is already the main bottleneck and memory is fine, it may be the wrong tradeoff.

Where it is usually applied

Checkpointing is often applied at block granularity or around major substructures such as attention or MLP sections. If the granularity is too fine, overhead and complexity grow. If it is too coarse, the memory benefit may be weak.

Why distributed settings make it harder

In multi-GPU training, checkpointing interacts with:

pipeline schedules
tensor-parallel activation layouts
communication timing

That makes it more than a local memory trick. It becomes part of the overall training-system design.

The next post moves to ZeRO, where memory reduction is handled by sharding replicated training state instead of only recomputing activations.

Why recompute at all

The real design question

Where it is usually applied

Why distributed settings make it harder

Continue Reading

Distributed LLM Training 14 - What ZeRO Stage 1, 2, and 3 Each Remove

Distributed LLM Training 15 - How FSDP Differs from DDP and When It Helps

Distributed LLM Training 16 - How Communication Overlap Hides Step Time