Distributed LLM Training 13 - Activation Checkpointing and the Cost of Recomputation
Saving memory by recomputing activations is not a minor option; it is often a central design choice in large-scale training
Why recompute at all
Training needs intermediate forward results for backward. If those activations become too large, memory runs out first. Activation checkpointing reduces stored activations and recomputes some of them during backward.
So the tradeoff is simple in principle:
- save memory
- spend more compute
The real design question
It is better to frame checkpointing around which resource is actually scarce:
- memory capacity
- additional compute time
- scheduling complexity
If memory is the hard wall, recomputation may be unavoidable. If compute is already the main bottleneck and memory is fine, it may be the wrong tradeoff.
Where it is usually applied
Checkpointing is often applied at block granularity or around major substructures such as attention or MLP sections. If the granularity is too fine, overhead and complexity grow. If it is too coarse, the memory benefit may be weak.
Why distributed settings make it harder
In multi-GPU training, checkpointing interacts with:
- pipeline schedules
- tensor-parallel activation layouts
- communication timing
That makes it more than a local memory trick. It becomes part of the overall training-system design.
The next post moves to ZeRO, where memory reduction is handled by sharding replicated training state instead of only recomputing activations.