Distributed LLM Training 10 - Sequence Parallelism and the Cost of Long Context
As context length grows, activation memory and communication patterns change again, and sequence-oriented partitioning starts to matter
Long context creates a different kind of pressure
In LLM training, the model is not the only thing getting larger. Context length also grows, and that sharply increases activation size and attention-related memory. At that point, ordinary tensor parallelism may not be enough.
This is where sequence parallel ideas start to matter.
What it means to split along sequence
Tensor parallelism usually focuses on hidden dimensions or weight layouts. Sequence parallelism shifts some of the partitioning toward the token dimension.
That matters because:
- some operations do not benefit much from hidden-dimension partitioning alone
- activation storage explodes with long sequence length
- attention-related intermediates become expensive in both memory and communication
So long context introduces a pressure that is related to, but distinct from, model-size pressure.
Why it helps and why it is hard
Sequence parallelism can reduce activation memory and distribute some operations more effectively. But it also creates extra complexity:
- rank-local shapes become harder to reason about
- gather/scatter patterns may appear around certain layers
- debugging gets significantly harder
Some operations align well with sequence partitioning, while others need more global information and reintroduce synchronization.
The important questions
- what is the real activation hotspot in long-context training?
- is activation checkpointing enough?
- is the extra communication from sequence partitioning acceptable?
- how does this interact with kernel optimizations such as FlashAttention?
The next post moves to pipeline parallelism, where instead of splitting inside layers, we begin splitting the model depth itself into stages.