Distributed LLM Training 08 - Tensor Parallel Basics: Splitting Computation Inside the Model

When data parallelism stops being enough

If the model is too large to fit comfortably on one device, or if parameter replication is too expensive, data parallelism alone is not enough. That is where tensor parallelism enters.

The main idea is simple: split large internal operations, especially matrix multiplies, across multiple GPUs.

For a large linear layer, for example, you might split:

along the output dimension
along the input dimension

The choice changes both the compute pattern and the communication pattern.

Why transformers fit tensor parallelism well

Transformers contain large dense matmuls repeatedly: attention projections and MLP projections are the obvious examples. Those operations are regular enough and large enough that tensor parallelism can be worthwhile.

If an operation is too small or too irregular, communication may outweigh the benefit.

A useful mental model

For large linear layers, two common patterns appear:

column parallel

Split weights along the output dimension so each rank computes part of the output.

row parallel

Split weights along the input dimension so each rank contributes a partial result that later needs reduction.

In both cases, the basic tradeoff is the same: less memory pressure per rank, but more in-layer communication.

What to ask before using it

is the operation large enough to justify the extra communication?
does the topology make low-latency communication realistic?
how cleanly do activations flow into the next layer?
how many additional collectives does the split introduce?

The next post makes this concrete inside a transformer block by looking at where tensor parallelism actually shows up in attention and MLP layers.

When data parallelism stops being enough

Why transformers fit tensor parallelism well

A useful mental model

column parallel

row parallel

What to ask before using it

Continue Reading

Distributed LLM Training 09 - Where Tensor Parallelism Actually Lives Inside a Transformer

Distributed LLM Training 10 - Sequence Parallelism and the Cost of Long Context

Distributed LLM Training 11 - Pipeline Parallel Basics and How to Think About Stage Splits