Distributed LLM Training 12 - GPipe, 1F1B, and Interleaving: Choosing a Pipeline Schedule
Pipeline efficiency is shaped heavily by scheduling, because bubble size, activation memory, and implementation complexity all depend on it
Stage partitioning is only half the story
Once a model is split into stages, the next question is how forward and backward work should be scheduled across them. The schedule affects:
- bubble size
- activation lifetime
- implementation complexity
The intuition behind GPipe
A GPipe-style schedule pushes multiple micro-batches forward before backward begins in earnest. The structure is conceptually simple, but it often requires keeping many activations alive for a long time.
That makes it easier to understand, but potentially expensive in memory.
Why 1F1B matters
One-forward-one-backward (1F1B) schedules alternate forward and backward more aggressively after warmup. This usually reduces activation lifetime and often gives a more practical memory footprint for large training runs.
The tradeoff is that scheduling becomes more intricate and debugging gets harder.
Why interleaving appears
Interleaving introduces virtual stages to reduce bubbles even further. That can improve utilization, but it raises the complexity of orchestration and communication patterns.
So the question is not just "which is faster?" but also:
- can we afford the memory cost?
- can we operate the runtime reliably?
- is the added complexity justified by the gain?
The next post moves to activation checkpointing, which is often paired with pipeline and long-context training because activation memory quickly becomes central.