Jae's Tech Blog

February 1, 2026 undefined min read

PyTorch Internals 10 - Connecting a Custom CUDA Kernel Through an Extension

A CUDA kernel becomes a real PyTorch operator only when tensor contracts, runtime semantics, and integration details are handled correctly

Lectures

February 4, 2026 undefined min read

A custom operator is not complete until its schema, dispatch behavior, and meta-level shape logic are defined clearly

Lectures

February 7, 2026 undefined min read

Backward design is really a question about what to save, what to recompute, and how to preserve correct semantics

Lectures

February 10, 2026 undefined min read

Fusion is valuable when it reduces memory traffic and intermediate materialization, not just when it reduces the number of visible ops

Lectures

February 13, 2026 undefined min read

A production-quality custom operator has to behave correctly under mixed precision, not just benchmark well in isolation

Lectures

February 16, 2026 undefined min read

The purpose of internals knowledge is to make a performance trace interpretable enough that you can actually change it

Lectures

February 19, 2026 undefined min read

PyTorch is no longer only an eager framework; compiler paths are now an important part of its optimization story

Lectures

February 22, 2026 undefined min read

Triton is not just a convenient kernel language; it is part of the modern PyTorch kernel and compilation story

Lectures

February 25, 2026 undefined min read

DDP and FSDP are not external magic; they depend directly on autograd timing and tensor-state management inside the runtime

Lectures