Why async execution matters

Calling a CUDA operator in PyTorch usually does not mean the CPU waits until it is fully finished. Work is commonly enqueued asynchronously.

That affects:

  • benchmarking correctness
  • overlap between compute and transfer
  • custom operator synchronization behavior

The next post moves into C++ extensions, where these runtime assumptions become more visible.