Move from abstraction to block structure

The best way to understand tensor parallelism is to lay out a transformer block and ask where the largest computations live.

Usually the heavy pieces are:

  • QKV projection
  • attention output projection
  • MLP up projection
  • MLP down projection

Those are the most natural places to split work across ranks.

What happens in attention

If QKV projection is partitioned, each rank may handle a subset of heads or hidden slices. That often fits transformer structure well. But downstream, the outputs usually need to be combined or reduced in some form.

So the important questions become:

  • which head or hidden partitioning is being used?
  • where does communication happen relative to softmax and projection?
  • how does longer sequence length affect activation movement?

What happens in the MLP

The MLP often expands hidden size and then projects back down. This is a very natural place for tensor parallel layouts such as column-parallel followed by row-parallel projections.

That pattern often works well because the middle activation structure can flow naturally between the two linear operations without excessive extra synchronization.

Not every operation fits equally well

Large dense matmuls fit tensor parallelism well. But not everything inside a transformer looks like that. Normalization, residual additions, masking, and some indexing operations require more careful handling and can reduce the elegance of a parallel layout.

The next post extends this line of thought into sequence parallelism and long-context training, where activation size itself becomes a major design pressure.