Jae's Tech Blog

January 6, 2026 undefined min read

Distributed LLM Training 01 - Why LLM Training Becomes a Distributed Systems Problem

Once LLM training leaves a single GPU, it stops being only a modeling problem and becomes a systems problem around memory, communication, and recovery

Lectures

January 9, 2026 undefined min read

Distributed LLM Training 02 - The Real Cost of Synchronous SGD and Data Parallelism

Data parallelism looks simple, but it carries both gradient synchronization cost and full model-state replication cost

Lectures

January 12, 2026 undefined min read

Distributed LLM Training 03 - All-Reduce, Ring, and How to Read Communication Cost

To reason about distributed training performance, you need a concrete mental model for all-reduce and collective communication cost

Lectures

January 15, 2026 undefined min read

Distributed LLM Training 04 - What PyTorch DDP Actually Does Internally

DDP is not just a wrapper around your model; it is a runtime that coordinates autograd hooks, gradient buckets, and synchronization timing

Lectures

January 18, 2026 undefined min read

Distributed LLM Training 05 - Global Batch Size, Gradient Accumulation, and Learning Rate Scaling

Adding more GPUs changes optimizer semantics as well as throughput, so batch size and learning rate need to be reasoned about together

Lectures

January 21, 2026 undefined min read

Distributed LLM Training 06 - Where LLM Training Memory Actually Goes

Looking only at parameter size leads to bad decisions; training memory is really a combination of parameters, gradients, optimizer state, and activations

Lectures

January 24, 2026 undefined min read

Distributed LLM Training 07 - NCCL and Topology: Why the Same GPU Count Can Behave Very Differently

In distributed training, performance is often shaped more by how GPUs are connected than by the raw number of GPUs

Lectures

January 27, 2026 undefined min read

Distributed LLM Training 08 - Tensor Parallel Basics: Splitting Computation Inside the Model

Once the model itself is too large for one device, data parallelism is no longer enough and layer-internal computation has to be split

Lectures

January 30, 2026 undefined min read

Distributed LLM Training 09 - Where Tensor Parallelism Actually Lives Inside a Transformer

Tensor parallelism becomes real when you map it onto QKV projections, attention output paths, and the two large MLP projections inside a transformer block

Lectures

Posts tagged "distributed-training"