Jae's Tech Blog
Home Archive About Game

Posts tagged "performance"

February 20, 2026 undefined min read

Distributed LLM Training 16 - How Communication Overlap Hides Step Time

The goal of overlap is not to eliminate communication entirely, but to make it finish underneath useful computation

Lectures
Read more
February 11, 2026 undefined min read

GPU Systems 07 - Occupancy and Latency Hiding

Understanding occupancy as a latency-hiding concept instead of just a percentage

Lectures
Read more
February 13, 2026 undefined min read

GPU Systems 08 - Profiling and the Roofline View

A practical way to use profiling and roofline thinking to understand kernel bottlenecks

Lectures
Read more
February 15, 2026 undefined min read

GPU Systems 09 - Why Naive Matrix Multiplication Is Slow

Using naive matrix multiplication to see memory reuse and traffic problems clearly

Lectures
Read more
March 5, 2026 undefined min read

GPU Systems 18 - Tensor Cores and Mixed Precision

How tensor cores change performance in compute-heavy kernels and why mixed precision matters

Lectures
Read more
January 11, 2026 undefined min read

PyTorch Internals 03 - Contiguous Layout, Memory Format, and Hidden Copies

Layout affects both operator selection and performance, and sometimes the most expensive thing in a path is an invisible copy

Lectures
Read more
February 10, 2026 undefined min read

PyTorch Internals 13 - When a Fused Operator Is Actually Worth It

Fusion is valuable when it reduces memory traffic and intermediate materialization, not just when it reduces the number of visible ops

Lectures
Read more
February 16, 2026 undefined min read

PyTorch Internals 15 - Reading Operator Bottlenecks with PyTorch Profiling

The purpose of internals knowledge is to make a performance trace interpretable enough that you can actually change it

Lectures
Read more

© 2025 Jae ยท Notes on systems, software, and building things carefully.

RSS