Who This Is For

ML engineers and systems-minded developers who already use PyTorch but want to understand what happens below the Python API.

Prerequisites

Basic PyTorch training experience, Python proficiency, and some familiarity with tensors and backpropagation.

What You'll Get

  • Understand tensor storage, autograd flow, and custom operator boundaries
  • Know how PyTorch and custom CUDA kernels are wired together
  • Be able to reason about fused operators and extension design with less guesswork

All Posts

  1. 1

    PyTorch Internals 01 - Why You Need to Understand the Internals

    To reason well about performance, custom operators, and distributed runtime behavior, PyTorch has to be understood as a runtime, not just a Python library

  2. 2

    PyTorch Internals 02 - Tensors Run on Top of Storage, Size, and Stride

    If you think of a tensor only as an n-dimensional array, you will misunderstand views, layouts, and hidden copies

  3. 3

    PyTorch Internals 03 - Contiguous Layout, Memory Format, and Hidden Copies

    Layout affects both operator selection and performance, and sometimes the most expensive thing in a path is an invisible copy

  4. 4

    PyTorch Internals 04 - What the Dispatcher and Operator Registry Actually Do

    A single operator name in PyTorch may map to many implementations, and the dispatcher is the runtime layer that decides which one runs

  5. 5

    PyTorch Internals 05 - How the Autograd Graph and Engine Work

    Autograd is not just automatic differentiation; it is a graph-construction and backward-execution runtime

  6. 6

    PyTorch Internals 06 - When and How to Use a Custom Autograd Function

    Custom autograd functions are a practical place to define forward-backward contracts before dropping to lower-level extensions

  7. 7

    PyTorch Internals 07 - Tensor Lifetime, the CUDA Caching Allocator, and Memory Reuse

    PyTorch GPU memory behavior is shaped by a caching allocator, so observed memory usage is not just a story about current tensor objects

  8. 8

    PyTorch Internals 08 - CUDA Streams, Events, and Asynchronous Execution

    Many PyTorch CUDA operations are asynchronous, so timing, synchronization, and dependency need to be reasoned about explicitly

  9. 9

    PyTorch Internals 09 - The Basic Path of a C++ Extension

    A C++ extension is the first practical bridge between user-defined logic and the PyTorch runtime

  10. 10

    PyTorch Internals 10 - Connecting a Custom CUDA Kernel Through an Extension

    A CUDA kernel becomes a real PyTorch operator only when tensor contracts, runtime semantics, and integration details are handled correctly

  11. 11

    PyTorch Internals 11 - Operator Schema, Dispatch Keys, and Meta Functions

    A custom operator is not complete until its schema, dispatch behavior, and meta-level shape logic are defined clearly

  12. 12

    PyTorch Internals 12 - Backward Implementation Patterns and Saved-State Strategy

    Backward design is really a question about what to save, what to recompute, and how to preserve correct semantics

  13. 13

    PyTorch Internals 13 - When a Fused Operator Is Actually Worth It

    Fusion is valuable when it reduces memory traffic and intermediate materialization, not just when it reduces the number of visible ops

  14. 14

    PyTorch Internals 14 - AMP, Autocast, and Numerical Stability

    A production-quality custom operator has to behave correctly under mixed precision, not just benchmark well in isolation

  15. 15

    PyTorch Internals 15 - Reading Operator Bottlenecks with PyTorch Profiling

    The purpose of internals knowledge is to make a performance trace interpretable enough that you can actually change it

  16. 16

    PyTorch Internals 16 - The Big Picture of FX, torch.compile, and Inductor

    PyTorch is no longer only an eager framework; compiler paths are now an important part of its optimization story

  17. 17

    PyTorch Internals 17 - What Role Triton Plays Inside the PyTorch Ecosystem

    Triton is not just a convenient kernel language; it is part of the modern PyTorch kernel and compilation story

  18. 18

    PyTorch Internals 18 - Where Autograd Meets Distributed Runtime

    DDP and FSDP are not external magic; they depend directly on autograd timing and tensor-state management inside the runtime

  19. 19

    PyTorch Internals 19 - Extension Packaging, Testing, and ABI Stability

    A custom operator becomes real production code only when packaging, testing, and compatibility concerns are handled properly

  20. 20

    PyTorch Internals 20 - A Practical Path from Internals Knowledge to Real Engineering Work

    The goal of studying PyTorch internals is not trivia, but the ability to connect custom operators, kernel work, profiling, and distributed runtime behavior