Distributed LLM Training

Who This Is For

Readers who want to move from single-GPU model training to multi-GPU and large-model training systems.

Prerequisites

Basic deep learning training experience, comfort with GPUs, and a general sense of how model training loops work.

What You'll Get

Understand the tradeoffs between data, tensor, and pipeline parallel strategies
See how communication, memory, and framework design interact during large-model training
Read Megatron-LM, DeepSpeed, and FSDP with a stronger systems perspective

All Posts

1

Lectures January 6, 2026 3 min read

Distributed LLM Training 01 - Why LLM Training Becomes a Distributed Systems Problem

Once LLM training leaves a single GPU, it stops being only a modeling problem and becomes a systems problem around memory, communication, and recovery
2

Lectures January 9, 2026 2 min read

Distributed LLM Training 02 - The Real Cost of Synchronous SGD and Data Parallelism

Data parallelism looks simple, but it carries both gradient synchronization cost and full model-state replication cost
3

Lectures January 12, 2026 2 min read

Distributed LLM Training 03 - All-Reduce, Ring, and How to Read Communication Cost

To reason about distributed training performance, you need a concrete mental model for all-reduce and collective communication cost
4

Lectures January 15, 2026 2 min read

Distributed LLM Training 04 - What PyTorch DDP Actually Does Internally

DDP is not just a wrapper around your model; it is a runtime that coordinates autograd hooks, gradient buckets, and synchronization timing
5

Lectures January 18, 2026 2 min read

Distributed LLM Training 05 - Global Batch Size, Gradient Accumulation, and Learning Rate Scaling

Adding more GPUs changes optimizer semantics as well as throughput, so batch size and learning rate need to be reasoned about together
6

Lectures January 21, 2026 2 min read

Distributed LLM Training 06 - Where LLM Training Memory Actually Goes

Looking only at parameter size leads to bad decisions; training memory is really a combination of parameters, gradients, optimizer state, and activations
7

Lectures January 24, 2026 2 min read

Distributed LLM Training 07 - NCCL and Topology: Why the Same GPU Count Can Behave Very Differently

In distributed training, performance is often shaped more by how GPUs are connected than by the raw number of GPUs
8

Lectures January 27, 2026 2 min read

Distributed LLM Training 08 - Tensor Parallel Basics: Splitting Computation Inside the Model

Once the model itself is too large for one device, data parallelism is no longer enough and layer-internal computation has to be split
9

Lectures January 30, 2026 2 min read

Distributed LLM Training 09 - Where Tensor Parallelism Actually Lives Inside a Transformer

Tensor parallelism becomes real when you map it onto QKV projections, attention output paths, and the two large MLP projections inside a transformer block
10

Lectures February 2, 2026 2 min read

Distributed LLM Training 10 - Sequence Parallelism and the Cost of Long Context

As context length grows, activation memory and communication patterns change again, and sequence-oriented partitioning starts to matter
11

Lectures February 5, 2026 2 min read

Distributed LLM Training 11 - Pipeline Parallel Basics and How to Think About Stage Splits

Once the model is split by depth into stages, idle time and stage imbalance become just as important as memory savings
12

Lectures February 8, 2026 1 min read

Distributed LLM Training 12 - GPipe, 1F1B, and Interleaving: Choosing a Pipeline Schedule

Pipeline efficiency is shaped heavily by scheduling, because bubble size, activation memory, and implementation complexity all depend on it
13

Lectures February 11, 2026 1 min read

Distributed LLM Training 13 - Activation Checkpointing and the Cost of Recomputation

Saving memory by recomputing activations is not a minor option; it is often a central design choice in large-scale training
14

Lectures February 14, 2026 1 min read

Distributed LLM Training 14 - What ZeRO Stage 1, 2, and 3 Each Remove

ZeRO is best understood as a staged system for removing different forms of replicated training state
15

Lectures February 17, 2026 1 min read

Distributed LLM Training 15 - How FSDP Differs from DDP and When It Helps

FSDP keeps parameters sharded and only gathers them when needed, making it a direct answer to parameter-replication pressure
16

Lectures February 20, 2026 1 min read

Distributed LLM Training 16 - How Communication Overlap Hides Step Time

The goal of overlap is not to eliminate communication entirely, but to make it finish underneath useful computation
17

Lectures February 23, 2026 1 min read

Distributed LLM Training 17 - Why Checkpointing, Resume, and Fault Tolerance Matter So Much

In long distributed runs, reliable recovery is as important as raw throughput
18

Lectures February 26, 2026 2 min read

Distributed LLM Training 18 - Deadlocks, Timeouts, and OOMs: Debugging Distributed Training

Debugging distributed training is about narrowing down which rank, which collective, and which state transition went wrong
19

Lectures March 1, 2026 1 min read

Distributed LLM Training 19 - How to Read Megatron-LM and DeepSpeed Structurally

Frameworks are easier to understand when you read them as bundles of parallelization and state-management choices rather than as giant feature lists
20

Lectures March 4, 2026 2 min read

Distributed LLM Training 20 - A Practical Order for Designing an LLM Training Stack

Distributed training architecture is not about collecting fashionable techniques, but about choosing the smallest structure that matches the current bottleneck