Distributed LLM Training 06 - Where LLM Training Memory Actually Goes

Memory accounting needs to be broken down

When people first reason about large models, they often ask how large the model weights are. That is not enough for training. The real memory picture includes:

parameters
gradients
optimizer state
activations

On top of that, temporary buffers, allocator fragmentation, and communication buffers can matter.

Which part is actually largest?

That depends on the workload, but optimizer state and activations are often much larger than people expect.

parameters

These are the model weights themselves. They dominate many inference discussions, but training has much more going on.

gradients

Backward produces them and they usually remain until the optimizer step. In data parallel setups, full gradient replicas exist on each rank.

optimizer state

Adam-style optimizers store extra moving statistics, so the optimizer state can easily exceed parameter memory.

activations

Forward intermediates must often be kept for backward. Long sequence length, large hidden size, and larger batches all drive activation memory sharply upward.

Why this decomposition matters

Different distributed techniques reduce different parts:

activation checkpointing reduces activation storage
ZeRO stage 1 shards optimizer state
ZeRO stage 2 shards gradients too
ZeRO stage 3 and FSDP shard parameters as well
tensor parallel redistributes model computation and parameter placement

So the right strategy depends on what is actually causing peak memory.

A common mistake

Many teams jump to model parallelism as soon as they see an OOM. But sometimes activation checkpointing or a micro-batch change is enough. In other cases, parameter replication is already so expensive that checkpointing alone cannot help.

That is why the first questions should be:

is activation memory the real hotspot?
is optimizer state dominating?
is parameter replication already too large?

The next post moves from memory to transport and topology by looking at NCCL and why the same strategy behaves differently on different hardware layouts.

Memory accounting needs to be broken down

Which part is actually largest?

parameters

gradients

optimizer state

activations

Why this decomposition matters

A common mistake

Continue Reading

Distributed LLM Training 07 - NCCL and Topology: Why the Same GPU Count Can Behave Very Differently

Distributed LLM Training 08 - Tensor Parallel Basics: Splitting Computation Inside the Model

Distributed LLM Training 09 - Where Tensor Parallelism Actually Lives Inside a Transformer