Technical notes by Jae

Systems writing for engineers who want the deeper model.

Long-form posts on platform engineering, Linux, compilers, MLOps, and computer architecture, written to help you build stronger intuition instead of just memorize terms.

Start here Featured series

A few strong entry points if you are new here.

Browse recent posts 119 posts

Fresh writing, updates, and ongoing series entries.

  1. Distributed LLM Training 20 - A Practical Order for Designing an LLM Training Stack
  2. Distributed LLM Training 19 - How to Read Megatron-LM and DeepSpeed Structurally
  3. Distributed LLM Training 18 - Deadlocks, Timeouts, and OOMs: Debugging Distributed Training
  4. Distributed LLM Training 17 - Why Checkpointing, Resume, and Fault Tolerance Matter So Much
  5. Distributed LLM Training 16 - How Communication Overlap Hides Step Time
  6. Distributed LLM Training 15 - How FSDP Differs from DDP and When It Helps
  7. Distributed LLM Training 14 - What ZeRO Stage 1, 2, and 3 Each Remove
  8. Distributed LLM Training 13 - Activation Checkpointing and the Cost of Recomputation
  9. Distributed LLM Training 12 - GPipe, 1F1B, and Interleaving: Choosing a Pipeline Schedule
  10. Distributed LLM Training 11 - Pipeline Parallel Basics and How to Think About Stage Splits
  11. Distributed LLM Training 10 - Sequence Parallelism and the Cost of Long Context
  12. Distributed LLM Training 09 - Where Tensor Parallelism Actually Lives Inside a Transformer
  13. Distributed LLM Training 08 - Tensor Parallel Basics: Splitting Computation Inside the Model
  14. Distributed LLM Training 07 - NCCL and Topology: Why the Same GPU Count Can Behave Very Differently
  15. Distributed LLM Training 06 - Where LLM Training Memory Actually Goes
  16. Distributed LLM Training 05 - Global Batch Size, Gradient Accumulation, and Learning Rate Scaling
  17. Distributed LLM Training 04 - What PyTorch DDP Actually Does Internally
  18. Distributed LLM Training 03 - All-Reduce, Ring, and How to Read Communication Cost
  19. Distributed LLM Training 02 - The Real Cost of Synchronous SGD and Data Parallelism
  20. Distributed LLM Training 01 - Why LLM Training Becomes a Distributed Systems Problem
  1. PyTorch Internals 20 - A Practical Path from Internals Knowledge to Real Engineering Work
  2. PyTorch Internals 19 - Extension Packaging, Testing, and ABI Stability
  3. PyTorch Internals 18 - Where Autograd Meets Distributed Runtime
  4. PyTorch Internals 17 - What Role Triton Plays Inside the PyTorch Ecosystem
  5. PyTorch Internals 16 - The Big Picture of FX, torch.compile, and Inductor
  6. PyTorch Internals 15 - Reading Operator Bottlenecks with PyTorch Profiling
  7. PyTorch Internals 14 - AMP, Autocast, and Numerical Stability
  8. PyTorch Internals 13 - When a Fused Operator Is Actually Worth It
  9. PyTorch Internals 12 - Backward Implementation Patterns and Saved-State Strategy
  10. PyTorch Internals 11 - Operator Schema, Dispatch Keys, and Meta Functions
  11. PyTorch Internals 10 - Connecting a Custom CUDA Kernel Through an Extension
  12. PyTorch Internals 09 - The Basic Path of a C++ Extension
  13. PyTorch Internals 08 - CUDA Streams, Events, and Asynchronous Execution
  14. PyTorch Internals 07 - Tensor Lifetime, the CUDA Caching Allocator, and Memory Reuse
  15. PyTorch Internals 06 - When and How to Use a Custom Autograd Function
  16. PyTorch Internals 05 - How the Autograd Graph and Engine Work
  17. PyTorch Internals 04 - What the Dispatcher and Operator Registry Actually Do
  18. PyTorch Internals 03 - Contiguous Layout, Memory Format, and Hidden Copies
  19. PyTorch Internals 02 - Tensors Run on Top of Storage, Size, and Stride
  20. PyTorch Internals 01 - Why You Need to Understand the Internals