Distributed LLM Training 01 - Why LLM Training Becomes a Distributed Systems Problem

How this series should be read

Many people begin distributed training by memorizing framework names: DDP, FSDP, ZeRO, Megatron-LM, DeepSpeed. That is usually the wrong entry point. The better question is why those techniques had to exist in the first place.

Distributed training is ultimately about balancing three resources at once:

how computation is split across devices
how memory usage is kept within limits
how communication cost is reduced or hidden

LLM training stresses all three at the same time. Parameter counts are large, activations are large, optimizer state is large, and once multiple devices are involved, the cost of exchanging gradients and activations becomes a bottleneck of its own.

Why single-GPU intuition breaks

On one GPU, it is easy to summarize a problem as "the step is slow" or "the model does not fit." In distributed training, that rough mental model stops being enough.

some steps are compute-bound
some are dominated by all-reduce
some are held back by the input pipeline
some fail because memory replication is too expensive
some slow down because one rank lags behind the others

In other words, distributed training is not just training code anymore. It includes process placement, collective communication, memory sharding, checkpoint strategy, and fault recovery.

What a distributed training engineer actually needs to do

The job is not just "use more GPUs." The real questions are:

is the current bottleneck compute, memory, or communication?
is plain data parallel enough, or do we need sharding?
if we introduce tensor or pipeline parallelism, what new costs appear?
what happens to optimizer behavior when global batch size changes?
how do we recover a long run safely after interruption?

Those are internal-behavior questions, not just framework-usage questions.

The structure of this series

This track will move through:

synchronous SGD and the real shape of data parallelism
all-reduce and collective communication cost
PyTorch DDP internals
memory accounting in LLM training
NCCL and hardware topology
tensor, sequence, and pipeline parallelism
activation checkpointing, ZeRO, and FSDP
overlap, checkpointing, fault tolerance, and debugging
how frameworks like Megatron-LM and DeepSpeed package these ideas
how to design an actual training stack

The main question to keep asking

For every technique in this series, keep asking:

what cost is this trying to reduce?
what new cost does it introduce?
does it reduce memory replication, communication volume, or idle time?
does it fit the topology and workload we actually have?

That lens will make the later framework-heavy posts much easier to understand.

The next post starts with synchronous SGD and data parallelism, because most of distributed training is easier to understand once that baseline is clear.

How this series should be read

Why single-GPU intuition breaks

What a distributed training engineer actually needs to do

The structure of this series

The main question to keep asking

Continue Reading

Distributed LLM Training 02 - The Real Cost of Synchronous SGD and Data Parallelism

Distributed LLM Training 03 - All-Reduce, Ring, and How to Read Communication Cost

Distributed LLM Training 04 - What PyTorch DDP Actually Does Internally