Jae's Tech Blog

January 27, 2026 undefined min read

Distributed LLM Training 08 - Tensor Parallel Basics: Splitting Computation Inside the Model

Once the model itself is too large for one device, data parallelism is no longer enough and layer-internal computation has to be split

Lectures

February 15, 2026 undefined min read

Using naive matrix multiplication to see memory reuse and traffic problems clearly

Lectures

February 17, 2026 undefined min read

Why tiled matrix multiplication and shared memory create such a big performance difference

Lectures

March 5, 2026 undefined min read

How tensor cores change performance in compute-heavy kernels and why mixed precision matters

Lectures