undefined min read
PyTorch Internals 08 - CUDA Streams, Events, and Asynchronous Execution
Many PyTorch CUDA operations are asynchronous, so timing, synchronization, and dependency need to be reasoned about explicitly
Many PyTorch CUDA operations are asynchronous, so timing, synchronization, and dependency need to be reasoned about explicitly
The purpose of internals knowledge is to make a performance trace interpretable enough that you can actually change it