Jae's Tech Blog

January 15, 2026 undefined min read

Distributed LLM Training 04 - What PyTorch DDP Actually Does Internally

DDP is not just a wrapper around your model; it is a runtime that coordinates autograd hooks, gradient buckets, and synchronization timing

Lectures

February 17, 2026 undefined min read

Distributed LLM Training 15 - How FSDP Differs from DDP and When It Helps

FSDP keeps parameters sharded and only gathers them when needed, making it a direct answer to parameter-replication pressure

Lectures

January 5, 2026 undefined min read

PyTorch Internals 01 - Why You Need to Understand the Internals

To reason well about performance, custom operators, and distributed runtime behavior, PyTorch has to be understood as a runtime, not just a Python library

Lectures

January 8, 2026 undefined min read

PyTorch Internals 02 - Tensors Run on Top of Storage, Size, and Stride

If you think of a tensor only as an n-dimensional array, you will misunderstand views, layouts, and hidden copies

Lectures

January 11, 2026 undefined min read

PyTorch Internals 03 - Contiguous Layout, Memory Format, and Hidden Copies

Layout affects both operator selection and performance, and sometimes the most expensive thing in a path is an invisible copy

Lectures

January 14, 2026 undefined min read

PyTorch Internals 04 - What the Dispatcher and Operator Registry Actually Do

A single operator name in PyTorch may map to many implementations, and the dispatcher is the runtime layer that decides which one runs

Lectures

January 17, 2026 undefined min read

PyTorch Internals 05 - How the Autograd Graph and Engine Work

Autograd is not just automatic differentiation; it is a graph-construction and backward-execution runtime

Lectures

January 20, 2026 undefined min read

PyTorch Internals 06 - When and How to Use a Custom Autograd Function

Custom autograd functions are a practical place to define forward-backward contracts before dropping to lower-level extensions

Lectures

January 23, 2026 undefined min read

PyTorch Internals 07 - Tensor Lifetime, the CUDA Caching Allocator, and Memory Reuse

PyTorch GPU memory behavior is shaped by a caching allocator, so observed memory usage is not just a story about current tensor objects

Lectures

Posts tagged "pytorch"