Distributed LLM Training 05 - Global Batch Size, Gradient Accumulation, and Learning Rate Scaling

More GPUs change what one step means

A common mistake in distributed training is to increase GPU count while keeping the same optimizer intuition from the single-GPU setup. But once world size increases, the amount of data contributing to one optimizer step changes too.

Usually:

global_batch = micro_batch_per_rank * num_ranks * grad_accum_steps

That means a step is no longer the same unit of signal it used to be.

Why gradient accumulation is so common

In LLM training, per-rank micro-batch is often constrained by memory. Gradient accumulation lets you simulate a larger effective batch without increasing per-step memory too aggressively.

It helps because:

you can keep per-rank memory under control
you can achieve a larger effective global batch
you can reduce how often synchronization happens

But it is not free. Longer step intervals change scheduler behavior, logging interpretation, and sometimes utilization.

Why scaling rules are only rough guides

People often mention linear learning-rate scaling, but in practice LLM training is more sensitive than those simple rules suggest.

The real behavior depends on:

the optimizer
warmup length
gradient clipping
sequence length
data mixture

So "double the GPUs, double the learning rate" is not a reliable universal rule.

What to monitor in practice

how many tokens are being processed per optimizer update
whether the loss becomes noisier or strangely flat
whether throughput improves without harming validation quality
whether accumulation makes steps too long in wall-clock time

If convergence quality shifts after scaling out, the issue may be batch semantics rather than pure communication cost.

The next post moves into memory accounting, because many distributed-training decisions only become clear once you separate parameters, gradients, optimizer state, and activations.

More GPUs change what one step means

Why gradient accumulation is so common

Why scaling rules are only rough guides

What to monitor in practice

Continue Reading

Distributed LLM Training 06 - Where LLM Training Memory Actually Goes

Distributed LLM Training 07 - NCCL and Topology: Why the Same GPU Count Can Behave Very Differently

Distributed LLM Training 08 - Tensor Parallel Basics: Splitting Computation Inside the Model