Distributed LLM Training 05 - Global Batch Size, Gradient Accumulation, and Learning Rate Scaling
Adding more GPUs changes optimizer semantics as well as throughput, so batch size and learning rate need to be reasoned about together
More GPUs change what one step means
A common mistake in distributed training is to increase GPU count while keeping the same optimizer intuition from the single-GPU setup. But once world size increases, the amount of data contributing to one optimizer step changes too.
Usually:
global_batch = micro_batch_per_rank * num_ranks * grad_accum_steps
That means a step is no longer the same unit of signal it used to be.
Why gradient accumulation is so common
In LLM training, per-rank micro-batch is often constrained by memory. Gradient accumulation lets you simulate a larger effective batch without increasing per-step memory too aggressively.
It helps because:
- you can keep per-rank memory under control
- you can achieve a larger effective global batch
- you can reduce how often synchronization happens
But it is not free. Longer step intervals change scheduler behavior, logging interpretation, and sometimes utilization.
Why scaling rules are only rough guides
People often mention linear learning-rate scaling, but in practice LLM training is more sensitive than those simple rules suggest.
The real behavior depends on:
- the optimizer
- warmup length
- gradient clipping
- sequence length
- data mixture
So "double the GPUs, double the learning rate" is not a reliable universal rule.
What to monitor in practice
- how many tokens are being processed per optimizer update
- whether the loss becomes noisier or strangely flat
- whether throughput improves without harming validation quality
- whether accumulation makes steps too long in wall-clock time
If convergence quality shifts after scaling out, the issue may be batch semantics rather than pure communication cost.
The next post moves into memory accounting, because many distributed-training decisions only become clear once you separate parameters, gradients, optimizer state, and activations.