Distributed LLM Training 07 - NCCL and Topology: Why the Same GPU Count Can Behave Very Differently

GPU count alone is not a performance model

Two different 8-GPU setups can have very different scaling behavior. One may scale cleanly, while the other spends much of its time waiting on communication. The reason is usually topology.

are GPUs connected with NVLink?
are they relying only on PCIe?
what is the path between nodes?
how are GPUs and NICs placed relative to NUMA domains?

Those details change communication cost dramatically.

What NCCL is doing for you

NCCL is the main NVIDIA library for efficient collective communication. It implements all-reduce, all-gather, reduce-scatter, and related primitives while trying to map them efficiently onto the hardware topology.

That does not mean topology stops mattering. It means NCCL is working within those physical constraints.

Common bottlenecks in practice

1. slow inter-node links

Intra-node communication may be fast with NVLink, but scaling can fall apart if inter-node traffic goes over a much slower path.

2. poor rank placement

If ranks are mapped inefficiently, communication that could stay local may take a longer path through the system.

3. communication is the bottleneck, but compute gets blamed

If compute kernels are short and step time is still high, NCCL activity or waiting behavior may be the real issue.

Signals worth watching

utilization looks fine, but scaling efficiency drops sharply
adding nodes hurts more than expected
one collective pattern dominates the profile
some ranks are repeatedly slower than others

Those are often cluster-layout or communication-pattern issues more than model-code issues.

The next post moves into tensor parallelism, where communication is no longer just a step-boundary issue but part of the layer execution itself.

GPU count alone is not a performance model

What NCCL is doing for you

Common bottlenecks in practice

1. slow inter-node links

2. poor rank placement

3. communication is the bottleneck, but compute gets blamed

Signals worth watching

Continue Reading

Distributed LLM Training 08 - Tensor Parallel Basics: Splitting Computation Inside the Model

Distributed LLM Training 09 - Where Tensor Parallelism Actually Lives Inside a Transformer

Distributed LLM Training 10 - Sequence Parallelism and the Cost of Long Context