Jae's Tech Blog

February 17, 2026 undefined min read

Distributed LLM Training 15 - How FSDP Differs from DDP and When It Helps

FSDP keeps parameters sharded and only gathers them when needed, making it a direct answer to parameter-replication pressure

Lectures

Read more

February 20, 2026 undefined min read

Distributed LLM Training 16 - How Communication Overlap Hides Step Time

The goal of overlap is not to eliminate communication entirely, but to make it finish underneath useful computation

Lectures

Read more

February 23, 2026 undefined min read

Distributed LLM Training 17 - Why Checkpointing, Resume, and Fault Tolerance Matter So Much

In long distributed runs, reliable recovery is as important as raw throughput

Lectures

Read more

February 26, 2026 undefined min read

Distributed LLM Training 18 - Deadlocks, Timeouts, and OOMs: Debugging Distributed Training

Debugging distributed training is about narrowing down which rank, which collective, and which state transition went wrong

Lectures

Read more

March 1, 2026 undefined min read

Distributed LLM Training 19 - How to Read Megatron-LM and DeepSpeed Structurally

Frameworks are easier to understand when you read them as bundles of parallelization and state-management choices rather than as giant feature lists

Lectures

Read more

March 4, 2026 undefined min read

Distributed LLM Training 20 - A Practical Order for Designing an LLM Training Stack

Distributed training architecture is not about collecting fashionable techniques, but about choosing the smallest structure that matches the current bottleneck

Lectures

Read more

January 28, 2026 undefined min read

GPU Systems 00 - What You Should Know Before Starting This Series

The background knowledge that makes the GPU Systems series much easier to study properly

Lectures

Read more

January 30, 2026 undefined min read

GPU Systems 01 - Roadmap to GPU Kernel Engineering

A practical study order from GPU architecture to CUDA, Triton, and kernel optimization

Lectures

Read more

February 1, 2026 undefined min read

GPU Systems 02 - The Thread, Warp, and Block Execution Model

What threads, warps, blocks, and grids mean in actual GPU execution

Lectures

Read more