Distributed LLM Training 01 - Why LLM Training Becomes a Distributed Systems Problem
Once LLM training leaves a single GPU, it stops being only a modeling problem and becomes a systems problem around memory, communication, and recovery
From data parallelism to tensor parallelism, FSDP, ZeRO, and modern LLM training frameworks
Readers who want to move from single-GPU model training to multi-GPU and large-model training systems.
Basic deep learning training experience, comfort with GPUs, and a general sense of how model training loops work.
Once LLM training leaves a single GPU, it stops being only a modeling problem and becomes a systems problem around memory, communication, and recovery
Data parallelism looks simple, but it carries both gradient synchronization cost and full model-state replication cost
To reason about distributed training performance, you need a concrete mental model for all-reduce and collective communication cost
DDP is not just a wrapper around your model; it is a runtime that coordinates autograd hooks, gradient buckets, and synchronization timing
Adding more GPUs changes optimizer semantics as well as throughput, so batch size and learning rate need to be reasoned about together
Looking only at parameter size leads to bad decisions; training memory is really a combination of parameters, gradients, optimizer state, and activations
In distributed training, performance is often shaped more by how GPUs are connected than by the raw number of GPUs
Once the model itself is too large for one device, data parallelism is no longer enough and layer-internal computation has to be split
Tensor parallelism becomes real when you map it onto QKV projections, attention output paths, and the two large MLP projections inside a transformer block
As context length grows, activation memory and communication patterns change again, and sequence-oriented partitioning starts to matter
Once the model is split by depth into stages, idle time and stage imbalance become just as important as memory savings
Pipeline efficiency is shaped heavily by scheduling, because bubble size, activation memory, and implementation complexity all depend on it
Saving memory by recomputing activations is not a minor option; it is often a central design choice in large-scale training
ZeRO is best understood as a staged system for removing different forms of replicated training state
FSDP keeps parameters sharded and only gathers them when needed, making it a direct answer to parameter-replication pressure
The goal of overlap is not to eliminate communication entirely, but to make it finish underneath useful computation
In long distributed runs, reliable recovery is as important as raw throughput
Debugging distributed training is about narrowing down which rank, which collective, and which state transition went wrong
Frameworks are easier to understand when you read them as bundles of parallelization and state-management choices rather than as giant feature lists
Distributed training architecture is not about collecting fashionable techniques, but about choosing the smallest structure that matches the current bottleneck