Long context creates a different kind of pressure

In LLM training, the model is not the only thing getting larger. Context length also grows, and that sharply increases activation size and attention-related memory. At that point, ordinary tensor parallelism may not be enough.

This is where sequence parallel ideas start to matter.

What it means to split along sequence

Tensor parallelism usually focuses on hidden dimensions or weight layouts. Sequence parallelism shifts some of the partitioning toward the token dimension.

That matters because:

  • some operations do not benefit much from hidden-dimension partitioning alone
  • activation storage explodes with long sequence length
  • attention-related intermediates become expensive in both memory and communication

So long context introduces a pressure that is related to, but distinct from, model-size pressure.

Why it helps and why it is hard

Sequence parallelism can reduce activation memory and distribute some operations more effectively. But it also creates extra complexity:

  • rank-local shapes become harder to reason about
  • gather/scatter patterns may appear around certain layers
  • debugging gets significantly harder

Some operations align well with sequence partitioning, while others need more global information and reintroduce synchronization.

The important questions

  • what is the real activation hotspot in long-context training?
  • is activation checkpointing enough?
  • is the extra communication from sequence partitioning acceptable?
  • how does this interact with kernel optimizations such as FlashAttention?

The next post moves to pipeline parallelism, where instead of splitting inside layers, we begin splitting the model depth itself into stages.