Do not read frameworks as feature catalogs

Megatron-LM and DeepSpeed expose many options, but the useful way to read them is to ask what they are abstracting.

Reading Megatron-LM

Megatron is a good place to study how transformer-oriented tensor parallelism, pipeline parallelism, and related runtime decisions fit together.

Look for:

  • how parallel groups are defined
  • how stages are assigned
  • how micro-batches are scheduled
  • how checkpoints and optimizer logic align with the parallel structure

Reading DeepSpeed

DeepSpeed is often more focused on operationally useful runtime abstractions such as ZeRO, optimizer offload, and engine management.

Look for:

  • what state is sharded at each stage
  • where runtime step orchestration lives
  • how checkpointing and optimizer state are managed

The point is not to memorize every option. The point is to understand which bottleneck the framework is trying to address.

The final post brings the series together by showing how to design a real LLM training stack step by step instead of assembling techniques randomly.