The point of this series

After using PyTorch for a while, you eventually run into questions that the surface API does not answer well.

  • why is this operator slower than expected?
  • when does a view silently turn into a copy?
  • how do I connect a custom CUDA kernel to PyTorch properly?
  • where do autograd and distributed runtime actually meet?

Those questions require an internal model, not just API familiarity.

Why PyTorch matters at this layer

For a GPU kernel engineer or distributed training engineer, PyTorch is not just a model framework. It is the point where tensor layout, dispatcher logic, autograd, extensions, and compiler paths meet.

That means a fast kernel is not enough by itself. To become a usable operator, it has to:

  • accept PyTorch tensors correctly
  • validate dtype, device, and layout assumptions
  • fit into autograd when gradients are needed
  • behave naturally inside real training code

This series is about those boundaries.

The structure of the track

The posts will move through:

  1. tensor storage, shape, and stride
  2. contiguous layout and hidden copies
  3. dispatcher and operator registration
  4. autograd graph and engine behavior
  5. custom autograd functions
  6. allocator, stream, and async execution
  7. C++ and CUDA extensions
  8. custom operators and fusion
  9. profiling, torch.compile, and Triton
  10. where all of this meets distributed training

The next post starts with the tensor itself, because storage, shape, and stride are the foundation for almost every later topic.