Jae's Tech Blog

February 2, 2026 undefined min read

Distributed LLM Training 10 - Sequence Parallelism and the Cost of Long Context

As context length grows, activation memory and communication patterns change again, and sequence-oriented partitioning starts to matter

Lectures

February 5, 2026 undefined min read

Once the model is split by depth into stages, idle time and stage imbalance become just as important as memory savings

Lectures

February 8, 2026 undefined min read

Pipeline efficiency is shaped heavily by scheduling, because bubble size, activation memory, and implementation complexity all depend on it

Lectures

February 11, 2026 undefined min read

Saving memory by recomputing activations is not a minor option; it is often a central design choice in large-scale training

Lectures

February 14, 2026 undefined min read

ZeRO is best understood as a staged system for removing different forms of replicated training state

Lectures

February 17, 2026 undefined min read

FSDP keeps parameters sharded and only gathers them when needed, making it a direct answer to parameter-replication pressure

Lectures

February 20, 2026 undefined min read

The goal of overlap is not to eliminate communication entirely, but to make it finish underneath useful computation

Lectures

February 23, 2026 undefined min read

In long distributed runs, reliable recovery is as important as raw throughput

Lectures

February 26, 2026 undefined min read

Debugging distributed training is about narrowing down which rank, which collective, and which state transition went wrong

Lectures