Stage partitioning is only half the story

Once a model is split into stages, the next question is how forward and backward work should be scheduled across them. The schedule affects:

  • bubble size
  • activation lifetime
  • implementation complexity

The intuition behind GPipe

A GPipe-style schedule pushes multiple micro-batches forward before backward begins in earnest. The structure is conceptually simple, but it often requires keeping many activations alive for a long time.

That makes it easier to understand, but potentially expensive in memory.

Why 1F1B matters

One-forward-one-backward (1F1B) schedules alternate forward and backward more aggressively after warmup. This usually reduces activation lifetime and often gives a more practical memory footprint for large training runs.

The tradeoff is that scheduling becomes more intricate and debugging gets harder.

Why interleaving appears

Interleaving introduces virtual stages to reduce bubbles even further. That can improve utilization, but it raises the complexity of orchestration and communication patterns.

So the question is not just "which is faster?" but also:

  • can we afford the memory cost?
  • can we operate the runtime reliably?
  • is the added complexity justified by the gain?

The next post moves to activation checkpointing, which is often paired with pipeline and long-context training because activation memory quickly becomes central.