The baseline pattern

The usual starting point for distributed training is data parallelism. Every GPU keeps the same model replica, processes a different mini-batch, then synchronizes gradients before the optimizer step.

At a high level:

  1. replicate the model on every rank
  2. split the input batch across ranks
  3. run forward and backward locally
  4. synchronize gradients with all-reduce
  5. perform the same optimizer step on every rank

That baseline matters because many later techniques exist to fix its weaknesses.

Why synchronous SGD is the default

Most practical LLM training still prefers synchronous updates. The main reason is that every rank sees the same parameter state at each step, which makes training behavior more stable and easier to debug. Asynchronous updates may reduce waiting in some cases, but they introduce stale gradients and a harder optimization problem.

For large language models, the extra instability is usually not worth it.

Why data parallelism is attractive

It has real strengths:

  • it is conceptually simple
  • the model code usually needs minimal changes
  • every rank runs the same graph
  • compute scales reasonably well when batches are large enough

If the model fits comfortably on one GPU and communication is not yet dominant, this is usually the first thing to try.

The two costs that appear immediately

1. Full state replication

Every rank holds the full parameter set, full gradients, and optimizer state. With Adam-like optimizers, the optimizer state can be a larger burden than the parameters themselves.

If parameters take P bytes, you often end up reasoning roughly as:

  • parameters: P
  • gradients: P
  • optimizer state: often 2P or more

So data parallelism distributes compute much better than it distributes memory.

2. Gradient synchronization

After backward, gradients must be synchronized. As the model grows, all-reduce becomes more expensive. On multi-node setups, communication can dominate step time even when raw GPU compute is fine.

Why global batch size matters

As you add ranks, global batch size grows:

global_batch = micro_batch_per_rank * num_ranks * grad_accum_steps

That changes optimizer behavior, not just throughput. More GPUs can improve throughput while quietly changing convergence if the learning setup is not adjusted.

Good first questions to ask

  • does the model fit comfortably on one GPU?
  • is all-reduce shorter than backward compute?
  • is the new global batch still stable for training?
  • can the input pipeline keep up with more ranks?

If one of those answers becomes weak, more advanced strategies start to matter.

The next post looks at all-reduce itself, because collective communication is the real center of gravity behind many distributed training bottlenecks.