GPU count alone is not a performance model

Two different 8-GPU setups can have very different scaling behavior. One may scale cleanly, while the other spends much of its time waiting on communication. The reason is usually topology.

  • are GPUs connected with NVLink?
  • are they relying only on PCIe?
  • what is the path between nodes?
  • how are GPUs and NICs placed relative to NUMA domains?

Those details change communication cost dramatically.

What NCCL is doing for you

NCCL is the main NVIDIA library for efficient collective communication. It implements all-reduce, all-gather, reduce-scatter, and related primitives while trying to map them efficiently onto the hardware topology.

That does not mean topology stops mattering. It means NCCL is working within those physical constraints.

Common bottlenecks in practice

Intra-node communication may be fast with NVLink, but scaling can fall apart if inter-node traffic goes over a much slower path.

2. poor rank placement

If ranks are mapped inefficiently, communication that could stay local may take a longer path through the system.

3. communication is the bottleneck, but compute gets blamed

If compute kernels are short and step time is still high, NCCL activity or waiting behavior may be the real issue.

Signals worth watching

  • utilization looks fine, but scaling efficiency drops sharply
  • adding nodes hurts more than expected
  • one collective pattern dominates the profile
  • some ranks are repeatedly slower than others

Those are often cluster-layout or communication-pattern issues more than model-code issues.

The next post moves into tensor parallelism, where communication is no longer just a step-boundary issue but part of the layer execution itself.