Distributed LLM Training 11 - Pipeline Parallel Basics and How to Think About Stage Splits
Once the model is split by depth into stages, idle time and stage imbalance become just as important as memory savings
What pipeline parallelism is splitting
If tensor parallelism splits work inside layers, pipeline parallelism splits groups of layers across stages. For example, a 48-layer transformer might be divided into four stages of 12 layers each.
This can help when the full model is too large to live comfortably on one device. But simply dividing layers evenly is not the same thing as building an efficient pipeline.
The real enemy is idle time
Pipeline parallelism introduces bubbles. Early in the schedule, later stages are waiting for the first micro-batches to arrive. At the end, some stages finish earlier and sit idle while others are still working.
So the tradeoff is clear:
- memory is distributed more effectively
- but idle time becomes a first-class performance concern
Why micro-batches matter
The usual way to keep stages busy is to split a larger batch into micro-batches and flow them through the pipeline. Too few micro-batches means large bubbles. Too many micro-batches increases overhead and complicates activation handling.
That makes three questions unavoidable:
- how many stages should exist?
- how many micro-batches should flow through them?
- is the stage compute balanced well enough?
Stage imbalance is a real practical problem
If one stage is slower than the others, the whole pipeline effectively runs at that stage's speed. Embedding-heavy stages, projection-heavy ends, or asymmetries in attention cost can all create imbalance.
So stage design should be based on real compute and memory patterns, not just layer counts.
The next post looks at pipeline schedules such as GPipe and 1F1B, because the schedule often matters more than the raw idea of stage splitting.