DDP is more than one API call

DistributedDataParallel often looks deceptively simple from the outside. You wrap a model and training proceeds. But internally, DDP is actively coordinating gradient readiness, bucket scheduling, and collective calls during backward.

That internal behavior explains:

  • why overlap works better for some models than others
  • why bucket size affects performance
  • why unused parameter behavior matters for both correctness and speed

The main pieces

At a high level, DDP depends on:

  • a process group to define which ranks communicate
  • an ordering of parameters for bucketing
  • autograd hooks to detect when gradients are ready
  • a reducer that launches all-reduce when a bucket becomes ready

The important point is that DDP does not always wait for the entire backward pass to finish before starting communication. If a bucket is ready early enough, communication can begin while later parts of backward are still computing.

Why buckets exist

Synchronizing gradients parameter by parameter would create too many collective launches. Synchronizing everything only at the end would kill overlap. Buckets are the practical middle ground.

They help by:

  • reducing the number of collective launches
  • keeping message sizes large enough to use bandwidth well
  • allowing communication to overlap with later backward compute

If buckets are too small, latency dominates. If they are too large, overlap starts too late.

Common practical pitfalls

1. dynamic or conditional parameter usage

If some parameters are not used every step, DDP has to handle gradient readiness differently, which can introduce extra cost or failure modes.

2. gradient accumulation and no_sync

Accumulation is often necessary at scale, but misusing no_sync or accumulation boundaries can distort both memory behavior and synchronization semantics.

3. overlap is not automatic magic

Model structure, bucket ordering, kernel duration, and network state all affect whether overlap actually works well.

DDP is best understood as a system for deciding when gradient synchronization can begin. The next post builds on that by looking at global batch size, accumulation, and optimizer behavior together.