Who This Is For

Readers who want to move from single-GPU model training to multi-GPU and large-model training systems.

Prerequisites

Basic deep learning training experience, comfort with GPUs, and a general sense of how model training loops work.

What You'll Get

  • Understand the tradeoffs between data, tensor, and pipeline parallel strategies
  • See how communication, memory, and framework design interact during large-model training
  • Read Megatron-LM, DeepSpeed, and FSDP with a stronger systems perspective

All Posts

  1. 1

    Distributed LLM Training 01 - Why LLM Training Becomes a Distributed Systems Problem

    Once LLM training leaves a single GPU, it stops being only a modeling problem and becomes a systems problem around memory, communication, and recovery

  2. 2

    Distributed LLM Training 02 - The Real Cost of Synchronous SGD and Data Parallelism

    Data parallelism looks simple, but it carries both gradient synchronization cost and full model-state replication cost

  3. 3

    Distributed LLM Training 03 - All-Reduce, Ring, and How to Read Communication Cost

    To reason about distributed training performance, you need a concrete mental model for all-reduce and collective communication cost

  4. 4

    Distributed LLM Training 04 - What PyTorch DDP Actually Does Internally

    DDP is not just a wrapper around your model; it is a runtime that coordinates autograd hooks, gradient buckets, and synchronization timing

  5. 5

    Distributed LLM Training 05 - Global Batch Size, Gradient Accumulation, and Learning Rate Scaling

    Adding more GPUs changes optimizer semantics as well as throughput, so batch size and learning rate need to be reasoned about together

  6. 6

    Distributed LLM Training 06 - Where LLM Training Memory Actually Goes

    Looking only at parameter size leads to bad decisions; training memory is really a combination of parameters, gradients, optimizer state, and activations

  7. 7

    Distributed LLM Training 07 - NCCL and Topology: Why the Same GPU Count Can Behave Very Differently

    In distributed training, performance is often shaped more by how GPUs are connected than by the raw number of GPUs

  8. 8

    Distributed LLM Training 08 - Tensor Parallel Basics: Splitting Computation Inside the Model

    Once the model itself is too large for one device, data parallelism is no longer enough and layer-internal computation has to be split

  9. 9

    Distributed LLM Training 09 - Where Tensor Parallelism Actually Lives Inside a Transformer

    Tensor parallelism becomes real when you map it onto QKV projections, attention output paths, and the two large MLP projections inside a transformer block

  10. 10

    Distributed LLM Training 10 - Sequence Parallelism and the Cost of Long Context

    As context length grows, activation memory and communication patterns change again, and sequence-oriented partitioning starts to matter

  11. 11

    Distributed LLM Training 11 - Pipeline Parallel Basics and How to Think About Stage Splits

    Once the model is split by depth into stages, idle time and stage imbalance become just as important as memory savings

  12. 12

    Distributed LLM Training 12 - GPipe, 1F1B, and Interleaving: Choosing a Pipeline Schedule

    Pipeline efficiency is shaped heavily by scheduling, because bubble size, activation memory, and implementation complexity all depend on it

  13. 13

    Distributed LLM Training 13 - Activation Checkpointing and the Cost of Recomputation

    Saving memory by recomputing activations is not a minor option; it is often a central design choice in large-scale training

  14. 14

    Distributed LLM Training 14 - What ZeRO Stage 1, 2, and 3 Each Remove

    ZeRO is best understood as a staged system for removing different forms of replicated training state

  15. 15

    Distributed LLM Training 15 - How FSDP Differs from DDP and When It Helps

    FSDP keeps parameters sharded and only gathers them when needed, making it a direct answer to parameter-replication pressure

  16. 16

    Distributed LLM Training 16 - How Communication Overlap Hides Step Time

    The goal of overlap is not to eliminate communication entirely, but to make it finish underneath useful computation

  17. 17

    Distributed LLM Training 17 - Why Checkpointing, Resume, and Fault Tolerance Matter So Much

    In long distributed runs, reliable recovery is as important as raw throughput

  18. 18

    Distributed LLM Training 18 - Deadlocks, Timeouts, and OOMs: Debugging Distributed Training

    Debugging distributed training is about narrowing down which rank, which collective, and which state transition went wrong

  19. 19

    Distributed LLM Training 19 - How to Read Megatron-LM and DeepSpeed Structurally

    Frameworks are easier to understand when you read them as bundles of parallelization and state-management choices rather than as giant feature lists

  20. 20

    Distributed LLM Training 20 - A Practical Order for Designing an LLM Training Stack

    Distributed training architecture is not about collecting fashionable techniques, but about choosing the smallest structure that matches the current bottleneck