Distributed LLM Training 16 - How Communication Overlap Hides Step Time
The goal of overlap is not to eliminate communication entirely, but to make it finish underneath useful computation
The goal of overlap is not to eliminate communication entirely, but to make it finish underneath useful computation
Understanding occupancy as a latency-hiding concept instead of just a percentage
A practical way to use profiling and roofline thinking to understand kernel bottlenecks
Using naive matrix multiplication to see memory reuse and traffic problems clearly
How tensor cores change performance in compute-heavy kernels and why mixed precision matters
Layout affects both operator selection and performance, and sometimes the most expensive thing in a path is an invisible copy
Fusion is valuable when it reduces memory traffic and intermediate materialization, not just when it reduces the number of visible ops
The purpose of internals knowledge is to make a performance trace interpretable enough that you can actually change it