Why recompute at all

Training needs intermediate forward results for backward. If those activations become too large, memory runs out first. Activation checkpointing reduces stored activations and recomputes some of them during backward.

So the tradeoff is simple in principle:

  • save memory
  • spend more compute

The real design question

It is better to frame checkpointing around which resource is actually scarce:

  • memory capacity
  • additional compute time
  • scheduling complexity

If memory is the hard wall, recomputation may be unavoidable. If compute is already the main bottleneck and memory is fine, it may be the wrong tradeoff.

Where it is usually applied

Checkpointing is often applied at block granularity or around major substructures such as attention or MLP sections. If the granularity is too fine, overhead and complexity grow. If it is too coarse, the memory benefit may be weak.

Why distributed settings make it harder

In multi-GPU training, checkpointing interacts with:

  • pipeline schedules
  • tensor-parallel activation layouts
  • communication timing

That makes it more than a local memory trick. It becomes part of the overall training-system design.

The next post moves to ZeRO, where memory reduction is handled by sharding replicated training state instead of only recomputing activations.