How this series should be read

Many people begin distributed training by memorizing framework names: DDP, FSDP, ZeRO, Megatron-LM, DeepSpeed. That is usually the wrong entry point. The better question is why those techniques had to exist in the first place.

Distributed training is ultimately about balancing three resources at once:

  • how computation is split across devices
  • how memory usage is kept within limits
  • how communication cost is reduced or hidden

LLM training stresses all three at the same time. Parameter counts are large, activations are large, optimizer state is large, and once multiple devices are involved, the cost of exchanging gradients and activations becomes a bottleneck of its own.

Why single-GPU intuition breaks

On one GPU, it is easy to summarize a problem as "the step is slow" or "the model does not fit." In distributed training, that rough mental model stops being enough.

  • some steps are compute-bound
  • some are dominated by all-reduce
  • some are held back by the input pipeline
  • some fail because memory replication is too expensive
  • some slow down because one rank lags behind the others

In other words, distributed training is not just training code anymore. It includes process placement, collective communication, memory sharding, checkpoint strategy, and fault recovery.

What a distributed training engineer actually needs to do

The job is not just "use more GPUs." The real questions are:

  • is the current bottleneck compute, memory, or communication?
  • is plain data parallel enough, or do we need sharding?
  • if we introduce tensor or pipeline parallelism, what new costs appear?
  • what happens to optimizer behavior when global batch size changes?
  • how do we recover a long run safely after interruption?

Those are internal-behavior questions, not just framework-usage questions.

The structure of this series

This track will move through:

  1. synchronous SGD and the real shape of data parallelism
  2. all-reduce and collective communication cost
  3. PyTorch DDP internals
  4. memory accounting in LLM training
  5. NCCL and hardware topology
  6. tensor, sequence, and pipeline parallelism
  7. activation checkpointing, ZeRO, and FSDP
  8. overlap, checkpointing, fault tolerance, and debugging
  9. how frameworks like Megatron-LM and DeepSpeed package these ideas
  10. how to design an actual training stack

The main question to keep asking

For every technique in this series, keep asking:

  • what cost is this trying to reduce?
  • what new cost does it introduce?
  • does it reduce memory replication, communication volume, or idle time?
  • does it fit the topology and workload we actually have?

That lens will make the later framework-heavy posts much easier to understand.

The next post starts with synchronous SGD and data parallelism, because most of distributed training is easier to understand once that baseline is clear.