When data parallelism stops being enough

If the model is too large to fit comfortably on one device, or if parameter replication is too expensive, data parallelism alone is not enough. That is where tensor parallelism enters.

The main idea is simple: split large internal operations, especially matrix multiplies, across multiple GPUs.

For a large linear layer, for example, you might split:

  • along the output dimension
  • along the input dimension

The choice changes both the compute pattern and the communication pattern.

Why transformers fit tensor parallelism well

Transformers contain large dense matmuls repeatedly: attention projections and MLP projections are the obvious examples. Those operations are regular enough and large enough that tensor parallelism can be worthwhile.

If an operation is too small or too irregular, communication may outweigh the benefit.

A useful mental model

For large linear layers, two common patterns appear:

column parallel

Split weights along the output dimension so each rank computes part of the output.

row parallel

Split weights along the input dimension so each rank contributes a partial result that later needs reduction.

In both cases, the basic tradeoff is the same: less memory pressure per rank, but more in-layer communication.

What to ask before using it

  • is the operation large enough to justify the extra communication?
  • does the topology make low-latency communication realistic?
  • how cleanly do activations flow into the next layer?
  • how many additional collectives does the split introduce?

The next post makes this concrete inside a transformer block by looking at where tensor parallelism actually shows up in attention and MLP layers.