Distributed LLM Training 06 - Where LLM Training Memory Actually Goes
Looking only at parameter size leads to bad decisions; training memory is really a combination of parameters, gradients, optimizer state, and activations
Memory accounting needs to be broken down
When people first reason about large models, they often ask how large the model weights are. That is not enough for training. The real memory picture includes:
- parameters
- gradients
- optimizer state
- activations
On top of that, temporary buffers, allocator fragmentation, and communication buffers can matter.
Which part is actually largest?
That depends on the workload, but optimizer state and activations are often much larger than people expect.
parameters
These are the model weights themselves. They dominate many inference discussions, but training has much more going on.
gradients
Backward produces them and they usually remain until the optimizer step. In data parallel setups, full gradient replicas exist on each rank.
optimizer state
Adam-style optimizers store extra moving statistics, so the optimizer state can easily exceed parameter memory.
activations
Forward intermediates must often be kept for backward. Long sequence length, large hidden size, and larger batches all drive activation memory sharply upward.
Why this decomposition matters
Different distributed techniques reduce different parts:
- activation checkpointing reduces activation storage
- ZeRO stage 1 shards optimizer state
- ZeRO stage 2 shards gradients too
- ZeRO stage 3 and FSDP shard parameters as well
- tensor parallel redistributes model computation and parameter placement
So the right strategy depends on what is actually causing peak memory.
A common mistake
Many teams jump to model parallelism as soon as they see an OOM. But sometimes activation checkpointing or a micro-batch change is enough. In other cases, parameter replication is already so expensive that checkpointing alone cannot help.
That is why the first questions should be:
- is activation memory the real hotspot?
- is optimizer state dominating?
- is parameter replication already too large?
The next post moves from memory to transport and topology by looking at NCCL and why the same strategy behaves differently on different hardware layouts.