Distributed LLM Training 09 - Where Tensor Parallelism Actually Lives Inside a Transformer

Move from abstraction to block structure

The best way to understand tensor parallelism is to lay out a transformer block and ask where the largest computations live.

Usually the heavy pieces are:

QKV projection
attention output projection
MLP up projection
MLP down projection

Those are the most natural places to split work across ranks.

What happens in attention

If QKV projection is partitioned, each rank may handle a subset of heads or hidden slices. That often fits transformer structure well. But downstream, the outputs usually need to be combined or reduced in some form.

So the important questions become:

which head or hidden partitioning is being used?
where does communication happen relative to softmax and projection?
how does longer sequence length affect activation movement?

What happens in the MLP

The MLP often expands hidden size and then projects back down. This is a very natural place for tensor parallel layouts such as column-parallel followed by row-parallel projections.

That pattern often works well because the middle activation structure can flow naturally between the two linear operations without excessive extra synchronization.

Not every operation fits equally well

Large dense matmuls fit tensor parallelism well. But not everything inside a transformer looks like that. Normalization, residual additions, masking, and some indexing operations require more careful handling and can reduce the elegance of a parallel layout.

The next post extends this line of thought into sequence parallelism and long-context training, where activation size itself becomes a major design pressure.

Move from abstraction to block structure

What happens in attention

What happens in the MLP

Not every operation fits equally well

Continue Reading

Distributed LLM Training 10 - Sequence Parallelism and the Cost of Long Context

Distributed LLM Training 11 - Pipeline Parallel Basics and How to Think About Stage Splits

Distributed LLM Training 12 - GPipe, 1F1B, and Interleaving: Choosing a Pipeline Schedule