Jae's Tech Blog

January 21, 2026 undefined min read

Distributed LLM Training 06 - Where LLM Training Memory Actually Goes

Looking only at parameter size leads to bad decisions; training memory is really a combination of parameters, gradients, optimizer state, and activations

Lectures

January 24, 2026 undefined min read

Distributed LLM Training 07 - NCCL and Topology: Why the Same GPU Count Can Behave Very Differently

In distributed training, performance is often shaped more by how GPUs are connected than by the raw number of GPUs

Lectures

January 27, 2026 undefined min read

Distributed LLM Training 08 - Tensor Parallel Basics: Splitting Computation Inside the Model

Once the model itself is too large for one device, data parallelism is no longer enough and layer-internal computation has to be split

Lectures

January 30, 2026 undefined min read

Distributed LLM Training 09 - Where Tensor Parallelism Actually Lives Inside a Transformer

Tensor parallelism becomes real when you map it onto QKV projections, attention output paths, and the two large MLP projections inside a transformer block

Lectures

February 2, 2026 undefined min read

Distributed LLM Training 10 - Sequence Parallelism and the Cost of Long Context

As context length grows, activation memory and communication patterns change again, and sequence-oriented partitioning starts to matter

Lectures

February 5, 2026 undefined min read

Distributed LLM Training 11 - Pipeline Parallel Basics and How to Think About Stage Splits

Once the model is split by depth into stages, idle time and stage imbalance become just as important as memory savings

Lectures

February 8, 2026 undefined min read

Distributed LLM Training 12 - GPipe, 1F1B, and Interleaving: Choosing a Pipeline Schedule

Pipeline efficiency is shaped heavily by scheduling, because bubble size, activation memory, and implementation complexity all depend on it

Lectures

February 11, 2026 undefined min read

Distributed LLM Training 13 - Activation Checkpointing and the Cost of Recomputation

Saving memory by recomputing activations is not a minor option; it is often a central design choice in large-scale training

Lectures

February 14, 2026 undefined min read

Distributed LLM Training 14 - What ZeRO Stage 1, 2, and 3 Each Remove

ZeRO is best understood as a staged system for removing different forms of replicated training state

Lectures