More GPUs change what one step means

A common mistake in distributed training is to increase GPU count while keeping the same optimizer intuition from the single-GPU setup. But once world size increases, the amount of data contributing to one optimizer step changes too.

Usually:

global_batch = micro_batch_per_rank * num_ranks * grad_accum_steps

That means a step is no longer the same unit of signal it used to be.

Why gradient accumulation is so common

In LLM training, per-rank micro-batch is often constrained by memory. Gradient accumulation lets you simulate a larger effective batch without increasing per-step memory too aggressively.

It helps because:

  • you can keep per-rank memory under control
  • you can achieve a larger effective global batch
  • you can reduce how often synchronization happens

But it is not free. Longer step intervals change scheduler behavior, logging interpretation, and sometimes utilization.

Why scaling rules are only rough guides

People often mention linear learning-rate scaling, but in practice LLM training is more sensitive than those simple rules suggest.

The real behavior depends on:

  • the optimizer
  • warmup length
  • gradient clipping
  • sequence length
  • data mixture

So "double the GPUs, double the learning rate" is not a reliable universal rule.

What to monitor in practice

  • how many tokens are being processed per optimizer update
  • whether the loss becomes noisier or strangely flat
  • whether throughput improves without harming validation quality
  • whether accumulation makes steps too long in wall-clock time

If convergence quality shifts after scaling out, the issue may be batch semantics rather than pure communication cost.

The next post moves into memory accounting, because many distributed-training decisions only become clear once you separate parameters, gradients, optimizer state, and activations.