Memory accounting needs to be broken down

When people first reason about large models, they often ask how large the model weights are. That is not enough for training. The real memory picture includes:

  • parameters
  • gradients
  • optimizer state
  • activations

On top of that, temporary buffers, allocator fragmentation, and communication buffers can matter.

Which part is actually largest?

That depends on the workload, but optimizer state and activations are often much larger than people expect.

parameters

These are the model weights themselves. They dominate many inference discussions, but training has much more going on.

gradients

Backward produces them and they usually remain until the optimizer step. In data parallel setups, full gradient replicas exist on each rank.

optimizer state

Adam-style optimizers store extra moving statistics, so the optimizer state can easily exceed parameter memory.

activations

Forward intermediates must often be kept for backward. Long sequence length, large hidden size, and larger batches all drive activation memory sharply upward.

Why this decomposition matters

Different distributed techniques reduce different parts:

  • activation checkpointing reduces activation storage
  • ZeRO stage 1 shards optimizer state
  • ZeRO stage 2 shards gradients too
  • ZeRO stage 3 and FSDP shard parameters as well
  • tensor parallel redistributes model computation and parameter placement

So the right strategy depends on what is actually causing peak memory.

A common mistake

Many teams jump to model parallelism as soon as they see an OOM. But sometimes activation checkpointing or a micro-batch change is enough. In other cases, parameter replication is already so expensive that checkpointing alone cannot help.

That is why the first questions should be:

  • is activation memory the real hotspot?
  • is optimizer state dominating?
  • is parameter replication already too large?

The next post moves from memory to transport and topology by looking at NCCL and why the same strategy behaves differently on different hardware layouts.