undefined min read
Distributed LLM Training 02 - The Real Cost of Synchronous SGD and Data Parallelism
Data parallelism looks simple, but it carries both gradient synchronization cost and full model-state replication cost
Data parallelism looks simple, but it carries both gradient synchronization cost and full model-state replication cost
To reason about distributed training performance, you need a concrete mental model for all-reduce and collective communication cost