Distributed LLM Training 12 - GPipe, 1F1B, and Interleaving: Choosing a Pipeline Schedule

Stage partitioning is only half the story

Once a model is split into stages, the next question is how forward and backward work should be scheduled across them. The schedule affects:

bubble size
activation lifetime
implementation complexity

The intuition behind GPipe

A GPipe-style schedule pushes multiple micro-batches forward before backward begins in earnest. The structure is conceptually simple, but it often requires keeping many activations alive for a long time.

That makes it easier to understand, but potentially expensive in memory.

Why 1F1B matters

One-forward-one-backward (1F1B) schedules alternate forward and backward more aggressively after warmup. This usually reduces activation lifetime and often gives a more practical memory footprint for large training runs.

The tradeoff is that scheduling becomes more intricate and debugging gets harder.

Why interleaving appears

Interleaving introduces virtual stages to reduce bubbles even further. That can improve utilization, but it raises the complexity of orchestration and communication patterns.

So the question is not just "which is faster?" but also:

can we afford the memory cost?
can we operate the runtime reliably?
is the added complexity justified by the gain?

The next post moves to activation checkpointing, which is often paired with pipeline and long-context training because activation memory quickly becomes central.

Stage partitioning is only half the story

The intuition behind GPipe

Why 1F1B matters

Why interleaving appears

Continue Reading

Distributed LLM Training 13 - Activation Checkpointing and the Cost of Recomputation

Distributed LLM Training 14 - What ZeRO Stage 1, 2, and 3 Each Remove

Distributed LLM Training 15 - How FSDP Differs from DDP and When It Helps