Distributed LLM Training 20 - A Practical Order for Designing an LLM Training Stack
Distributed training architecture is not about collecting fashionable techniques, but about choosing the smallest structure that matches the current bottleneck
In the end, the techniques have to be composed
This series covered individual techniques one by one, but real systems combine them. That makes sequencing important. The practical question is not "what is the most advanced method?" but "what should be added next?"
A good design order
1. establish a single-GPU baseline
Measure model behavior, memory peak, throughput, and training stability before scaling out.
2. scale with data parallelism first
If the model fits, start with DDP. Measure all-reduce cost, global batch behavior, input-pipeline health, and scaling efficiency.
3. break the memory problem into parts
Identify whether the main pressure comes from activations, optimizer state, or parameter replication. That determines whether checkpointing, ZeRO, FSDP, or tensor parallelism is the next move.
4. choose parallel layouts with topology in mind
For example, tensor parallelism may make the most sense inside a node with fast links, while data parallelism spans nodes.
5. build checkpoint and recovery paths early
Large runs are too expensive to treat resume as an afterthought.
What strong portfolio work looks like
After studying this track, strong project directions include:
- a comparison of DDP and tensor parallelism on a small transformer
- a memory-analysis write-up explaining when checkpointing, ZeRO, or FSDP is the right choice
- a communication-bottleneck analysis using a small multi-node training run
Those artifacts show real engineering judgment better than a generic summary.
The core skill of a distributed training engineer is not memorizing framework names. It is seeing how compute, memory, and communication interact, and choosing structure accordingly.