Jae's Tech Blog

February 20, 2026 undefined min read

Distributed LLM Training 16 - How Communication Overlap Hides Step Time

The goal of overlap is not to eliminate communication entirely, but to make it finish underneath useful computation

Lectures

February 11, 2026 undefined min read

Understanding occupancy as a latency-hiding concept instead of just a percentage

Lectures

February 13, 2026 undefined min read

A practical way to use profiling and roofline thinking to understand kernel bottlenecks

Lectures

February 15, 2026 undefined min read

Using naive matrix multiplication to see memory reuse and traffic problems clearly

Lectures

March 5, 2026 undefined min read

How tensor cores change performance in compute-heavy kernels and why mixed precision matters

Lectures

January 11, 2026 undefined min read

Layout affects both operator selection and performance, and sometimes the most expensive thing in a path is an invisible copy

Lectures

February 10, 2026 undefined min read

Fusion is valuable when it reduces memory traffic and intermediate materialization, not just when it reduces the number of visible ops

Lectures

February 16, 2026 undefined min read

The purpose of internals knowledge is to make a performance trace interpretable enough that you can actually change it

Lectures