What pipeline parallelism is splitting

If tensor parallelism splits work inside layers, pipeline parallelism splits groups of layers across stages. For example, a 48-layer transformer might be divided into four stages of 12 layers each.

This can help when the full model is too large to live comfortably on one device. But simply dividing layers evenly is not the same thing as building an efficient pipeline.

The real enemy is idle time

Pipeline parallelism introduces bubbles. Early in the schedule, later stages are waiting for the first micro-batches to arrive. At the end, some stages finish earlier and sit idle while others are still working.

So the tradeoff is clear:

  • memory is distributed more effectively
  • but idle time becomes a first-class performance concern

Why micro-batches matter

The usual way to keep stages busy is to split a larger batch into micro-batches and flow them through the pipeline. Too few micro-batches means large bubbles. Too many micro-batches increases overhead and complicates activation handling.

That makes three questions unavoidable:

  • how many stages should exist?
  • how many micro-batches should flow through them?
  • is the stage compute balanced well enough?

Stage imbalance is a real practical problem

If one stage is slower than the others, the whole pipeline effectively runs at that stage's speed. Embedding-heavy stages, projection-heavy ends, or asymmetries in attention cost can all create imbalance.

So stage design should be based on real compute and memory patterns, not just layer counts.

The next post looks at pipeline schedules such as GPipe and 1F1B, because the schedule often matters more than the raw idea of stage splitting.