Distributed LLM Training 15 - How FSDP Differs from DDP and When It Helps
FSDP keeps parameters sharded and only gathers them when needed, making it a direct answer to parameter-replication pressure
All posts in the Lectures
FSDP keeps parameters sharded and only gathers them when needed, making it a direct answer to parameter-replication pressure
The goal of overlap is not to eliminate communication entirely, but to make it finish underneath useful computation
In long distributed runs, reliable recovery is as important as raw throughput
Debugging distributed training is about narrowing down which rank, which collective, and which state transition went wrong
Frameworks are easier to understand when you read them as bundles of parallelization and state-management choices rather than as giant feature lists
Distributed training architecture is not about collecting fashionable techniques, but about choosing the smallest structure that matches the current bottleneck
The background knowledge that makes the GPU Systems series much easier to study properly
A practical study order from GPU architecture to CUDA, Triton, and kernel optimization
What threads, warps, blocks, and grids mean in actual GPU execution