Distributed LLM Training 04 - What PyTorch DDP Actually Does Internally

DDP is more than one API call

DistributedDataParallel often looks deceptively simple from the outside. You wrap a model and training proceeds. But internally, DDP is actively coordinating gradient readiness, bucket scheduling, and collective calls during backward.

That internal behavior explains:

why overlap works better for some models than others
why bucket size affects performance
why unused parameter behavior matters for both correctness and speed

The main pieces

At a high level, DDP depends on:

a process group to define which ranks communicate
an ordering of parameters for bucketing
autograd hooks to detect when gradients are ready
a reducer that launches all-reduce when a bucket becomes ready

The important point is that DDP does not always wait for the entire backward pass to finish before starting communication. If a bucket is ready early enough, communication can begin while later parts of backward are still computing.

Why buckets exist

Synchronizing gradients parameter by parameter would create too many collective launches. Synchronizing everything only at the end would kill overlap. Buckets are the practical middle ground.

They help by:

reducing the number of collective launches
keeping message sizes large enough to use bandwidth well
allowing communication to overlap with later backward compute

If buckets are too small, latency dominates. If they are too large, overlap starts too late.

Common practical pitfalls

1. dynamic or conditional parameter usage

If some parameters are not used every step, DDP has to handle gradient readiness differently, which can introduce extra cost or failure modes.

2. gradient accumulation and `no_sync`

Accumulation is often necessary at scale, but misusing no_sync or accumulation boundaries can distort both memory behavior and synchronization semantics.

3. overlap is not automatic magic

Model structure, bucket ordering, kernel duration, and network state all affect whether overlap actually works well.

DDP is best understood as a system for deciding when gradient synchronization can begin. The next post builds on that by looking at global batch size, accumulation, and optimizer behavior together.

DDP is more than one API call

The main pieces

Why buckets exist

Common practical pitfalls

1. dynamic or conditional parameter usage

2. gradient accumulation and no_sync

3. overlap is not automatic magic

Continue Reading

Distributed LLM Training 05 - Global Batch Size, Gradient Accumulation, and Learning Rate Scaling

Distributed LLM Training 06 - Where LLM Training Memory Actually Goes

Distributed LLM Training 07 - NCCL and Topology: Why the Same GPU Count Can Behave Very Differently

2. gradient accumulation and `no_sync`