Distributed LLM Training 04 - What PyTorch DDP Actually Does Internally
DDP is not just a wrapper around your model; it is a runtime that coordinates autograd hooks, gradient buckets, and synchronization timing
DDP is not just a wrapper around your model; it is a runtime that coordinates autograd hooks, gradient buckets, and synchronization timing
To reason well about performance, custom operators, and distributed runtime behavior, PyTorch has to be understood as a runtime, not just a Python library
Autograd is not just automatic differentiation; it is a graph-construction and backward-execution runtime
Custom autograd functions are a practical place to define forward-backward contracts before dropping to lower-level extensions
Backward design is really a question about what to save, what to recompute, and how to preserve correct semantics
DDP and FSDP are not external magic; they depend directly on autograd timing and tensor-state management inside the runtime