Jae's Tech Blog
Home Archive About Game

Posts tagged "distributed-training"

February 2, 2026 undefined min read

Distributed LLM Training 10 - Sequence Parallelism and the Cost of Long Context

As context length grows, activation memory and communication patterns change again, and sequence-oriented partitioning starts to matter

Lectures
Read more
February 5, 2026 undefined min read

Distributed LLM Training 11 - Pipeline Parallel Basics and How to Think About Stage Splits

Once the model is split by depth into stages, idle time and stage imbalance become just as important as memory savings

Lectures
Read more
February 8, 2026 undefined min read

Distributed LLM Training 12 - GPipe, 1F1B, and Interleaving: Choosing a Pipeline Schedule

Pipeline efficiency is shaped heavily by scheduling, because bubble size, activation memory, and implementation complexity all depend on it

Lectures
Read more
February 11, 2026 undefined min read

Distributed LLM Training 13 - Activation Checkpointing and the Cost of Recomputation

Saving memory by recomputing activations is not a minor option; it is often a central design choice in large-scale training

Lectures
Read more
February 14, 2026 undefined min read

Distributed LLM Training 14 - What ZeRO Stage 1, 2, and 3 Each Remove

ZeRO is best understood as a staged system for removing different forms of replicated training state

Lectures
Read more
February 17, 2026 undefined min read

Distributed LLM Training 15 - How FSDP Differs from DDP and When It Helps

FSDP keeps parameters sharded and only gathers them when needed, making it a direct answer to parameter-replication pressure

Lectures
Read more
February 20, 2026 undefined min read

Distributed LLM Training 16 - How Communication Overlap Hides Step Time

The goal of overlap is not to eliminate communication entirely, but to make it finish underneath useful computation

Lectures
Read more
February 23, 2026 undefined min read

Distributed LLM Training 17 - Why Checkpointing, Resume, and Fault Tolerance Matter So Much

In long distributed runs, reliable recovery is as important as raw throughput

Lectures
Read more
February 26, 2026 undefined min read

Distributed LLM Training 18 - Deadlocks, Timeouts, and OOMs: Debugging Distributed Training

Debugging distributed training is about narrowing down which rank, which collective, and which state transition went wrong

Lectures
Read more
โ† Previous
1 2 3
Next โ†’

© 2025 Jae ยท Notes on systems, software, and building things carefully.

RSS