Distributed LLM Training 10 - Sequence Parallelism and the Cost of Long Context

Long context creates a different kind of pressure

In LLM training, the model is not the only thing getting larger. Context length also grows, and that sharply increases activation size and attention-related memory. At that point, ordinary tensor parallelism may not be enough.

This is where sequence parallel ideas start to matter.

What it means to split along sequence

Tensor parallelism usually focuses on hidden dimensions or weight layouts. Sequence parallelism shifts some of the partitioning toward the token dimension.

That matters because:

some operations do not benefit much from hidden-dimension partitioning alone
activation storage explodes with long sequence length
attention-related intermediates become expensive in both memory and communication

So long context introduces a pressure that is related to, but distinct from, model-size pressure.

Why it helps and why it is hard

Sequence parallelism can reduce activation memory and distribute some operations more effectively. But it also creates extra complexity:

rank-local shapes become harder to reason about
gather/scatter patterns may appear around certain layers
debugging gets significantly harder

Some operations align well with sequence partitioning, while others need more global information and reintroduce synchronization.

The important questions

what is the real activation hotspot in long-context training?
is activation checkpointing enough?
is the extra communication from sequence partitioning acceptable?
how does this interact with kernel optimizations such as FlashAttention?

The next post moves to pipeline parallelism, where instead of splitting inside layers, we begin splitting the model depth itself into stages.

Long context creates a different kind of pressure

What it means to split along sequence

Why it helps and why it is hard

The important questions

Continue Reading

Distributed LLM Training 11 - Pipeline Parallel Basics and How to Think About Stage Splits

Distributed LLM Training 12 - GPipe, 1F1B, and Interleaving: Choosing a Pipeline Schedule

Distributed LLM Training 13 - Activation Checkpointing and the Cost of Recomputation