Distributed LLM Training 10 - Sequence Parallelism and the Cost of Long Context
As context length grows, activation memory and communication patterns change again, and sequence-oriented partitioning starts to matter
As context length grows, activation memory and communication patterns change again, and sequence-oriented partitioning starts to matter
Once the model is split by depth into stages, idle time and stage imbalance become just as important as memory savings
Pipeline efficiency is shaped heavily by scheduling, because bubble size, activation memory, and implementation complexity all depend on it
Saving memory by recomputing activations is not a minor option; it is often a central design choice in large-scale training
ZeRO is best understood as a staged system for removing different forms of replicated training state
FSDP keeps parameters sharded and only gathers them when needed, making it a direct answer to parameter-replication pressure
The goal of overlap is not to eliminate communication entirely, but to make it finish underneath useful computation
In long distributed runs, reliable recovery is as important as raw throughput
Debugging distributed training is about narrowing down which rank, which collective, and which state transition went wrong