Distributed LLM Training 09 - Where Tensor Parallelism Actually Lives Inside a Transformer
Tensor parallelism becomes real when you map it onto QKV projections, attention output paths, and the two large MLP projections inside a transformer block
Move from abstraction to block structure
The best way to understand tensor parallelism is to lay out a transformer block and ask where the largest computations live.
Usually the heavy pieces are:
- QKV projection
- attention output projection
- MLP up projection
- MLP down projection
Those are the most natural places to split work across ranks.
What happens in attention
If QKV projection is partitioned, each rank may handle a subset of heads or hidden slices. That often fits transformer structure well. But downstream, the outputs usually need to be combined or reduced in some form.
So the important questions become:
- which head or hidden partitioning is being used?
- where does communication happen relative to softmax and projection?
- how does longer sequence length affect activation movement?
What happens in the MLP
The MLP often expands hidden size and then projects back down. This is a very natural place for tensor parallel layouts such as column-parallel followed by row-parallel projections.
That pattern often works well because the middle activation structure can flow naturally between the two linear operations without excessive extra synchronization.
Not every operation fits equally well
Large dense matmuls fit tensor parallelism well. But not everything inside a transformer looks like that. Normalization, residual additions, masking, and some indexing operations require more careful handling and can reduce the elegance of a parallel layout.
The next post extends this line of thought into sequence parallelism and long-context training, where activation size itself becomes a major design pressure.