In the end, the techniques have to be composed

This series covered individual techniques one by one, but real systems combine them. That makes sequencing important. The practical question is not "what is the most advanced method?" but "what should be added next?"

A good design order

1. establish a single-GPU baseline

Measure model behavior, memory peak, throughput, and training stability before scaling out.

2. scale with data parallelism first

If the model fits, start with DDP. Measure all-reduce cost, global batch behavior, input-pipeline health, and scaling efficiency.

3. break the memory problem into parts

Identify whether the main pressure comes from activations, optimizer state, or parameter replication. That determines whether checkpointing, ZeRO, FSDP, or tensor parallelism is the next move.

4. choose parallel layouts with topology in mind

For example, tensor parallelism may make the most sense inside a node with fast links, while data parallelism spans nodes.

5. build checkpoint and recovery paths early

Large runs are too expensive to treat resume as an afterthought.

What strong portfolio work looks like

After studying this track, strong project directions include:

  1. a comparison of DDP and tensor parallelism on a small transformer
  2. a memory-analysis write-up explaining when checkpointing, ZeRO, or FSDP is the right choice
  3. a communication-bottleneck analysis using a small multi-node training run

Those artifacts show real engineering judgment better than a generic summary.

The core skill of a distributed training engineer is not memorizing framework names. It is seeing how compute, memory, and communication interact, and choosing structure accordingly.