Jae's Tech Blog

February 3, 2026 undefined min read

GPU Systems 03 - Memory Hierarchy and Bandwidth

How to think about the GPU memory hierarchy and bandwidth bottlenecks

Lectures

February 5, 2026 undefined min read

How to think about indexing and launch configuration when writing CUDA kernels

Lectures

February 7, 2026 undefined min read

The optimization patterns that keep showing up in CUDA kernels

Lectures

February 9, 2026 undefined min read

How Triton fits into real kernel optimization work, especially for LLM-style workloads

Lectures

February 11, 2026 undefined min read

Understanding occupancy as a latency-hiding concept instead of just a percentage

Lectures

February 13, 2026 undefined min read

A practical way to use profiling and roofline thinking to understand kernel bottlenecks

Lectures

February 15, 2026 undefined min read

Using naive matrix multiplication to see memory reuse and traffic problems clearly

Lectures

February 17, 2026 undefined min read

Why tiled matrix multiplication and shared memory create such a big performance difference

Lectures

February 19, 2026 undefined min read

Why shared memory is not automatically fast and how bank conflicts appear

Lectures