Distributed LLM Training 04 - What PyTorch DDP Actually Does Internally
DDP is not just a wrapper around your model; it is a runtime that coordinates autograd hooks, gradient buckets, and synchronization timing
DDP is not just a wrapper around your model; it is a runtime that coordinates autograd hooks, gradient buckets, and synchronization timing
FSDP keeps parameters sharded and only gathers them when needed, making it a direct answer to parameter-replication pressure
To reason well about performance, custom operators, and distributed runtime behavior, PyTorch has to be understood as a runtime, not just a Python library
If you think of a tensor only as an n-dimensional array, you will misunderstand views, layouts, and hidden copies
Layout affects both operator selection and performance, and sometimes the most expensive thing in a path is an invisible copy
A single operator name in PyTorch may map to many implementations, and the dispatcher is the runtime layer that decides which one runs
Autograd is not just automatic differentiation; it is a graph-construction and backward-execution runtime
Custom autograd functions are a practical place to define forward-backward contracts before dropping to lower-level extensions
PyTorch GPU memory behavior is shaped by a caching allocator, so observed memory usage is not just a story about current tensor objects