Jae's Tech Blog

January 18, 2026 undefined min read

Distributed LLM Training 05 - Global Batch Size, Gradient Accumulation, and Learning Rate Scaling

Adding more GPUs changes optimizer semantics as well as throughput, so batch size and learning rate need to be reasoned about together

Lectures