Why fuse at all

Fused operators usually aim to reduce:

  • kernel launch overhead
  • intermediate tensor materialization
  • unnecessary global memory traffic

So the true benefit is often memory-system efficiency rather than just operator count reduction.

The next post looks at AMP and numerical stability, because a fast operator that is unstable in mixed precision is not practically useful.