undefined min read
Distributed LLM Training 05 - Global Batch Size, Gradient Accumulation, and Learning Rate Scaling
Adding more GPUs changes optimizer semantics as well as throughput, so batch size and learning rate need to be reasoned about together