Distributed LLM Training 08 - Tensor Parallel Basics: Splitting Computation Inside the Model
Once the model itself is too large for one device, data parallelism is no longer enough and layer-internal computation has to be split
When data parallelism stops being enough
If the model is too large to fit comfortably on one device, or if parameter replication is too expensive, data parallelism alone is not enough. That is where tensor parallelism enters.
The main idea is simple: split large internal operations, especially matrix multiplies, across multiple GPUs.
For a large linear layer, for example, you might split:
- along the output dimension
- along the input dimension
The choice changes both the compute pattern and the communication pattern.
Why transformers fit tensor parallelism well
Transformers contain large dense matmuls repeatedly: attention projections and MLP projections are the obvious examples. Those operations are regular enough and large enough that tensor parallelism can be worthwhile.
If an operation is too small or too irregular, communication may outweigh the benefit.
A useful mental model
For large linear layers, two common patterns appear:
column parallel
Split weights along the output dimension so each rank computes part of the output.
row parallel
Split weights along the input dimension so each rank contributes a partial result that later needs reduction.
In both cases, the basic tradeoff is the same: less memory pressure per rank, but more in-layer communication.
What to ask before using it
- is the operation large enough to justify the extra communication?
- does the topology make low-latency communication realistic?
- how cleanly do activations flow into the next layer?
- how many additional collectives does the split introduce?
The next post makes this concrete inside a transformer block by looking at where tensor parallelism actually shows up in attention and MLP layers.