Jae's Tech Blog
Home Archive About Game

Posts tagged "cuda"

February 19, 2026 undefined min read

GPU Systems 11 - Shared Memory Bank Conflicts

Why shared memory is not automatically fast and how bank conflicts appear

Lectures
Read more
February 21, 2026 undefined min read

GPU Systems 12 - Warp Shuffle and Warp-Level Primitives

Why warp-level primitives matter for reductions and lighter-weight cooperation

Lectures
Read more
February 23, 2026 undefined min read

GPU Systems 13 - Reduction Kernels in Depth

Using reduction kernels to connect shared memory, warp primitives, and synchronization

Lectures
Read more
February 25, 2026 undefined min read

GPU Systems 14 - Why Softmax Is Such a Good Kernel Exercise

How softmax combines reductions, memory traffic, and numerical stability in one kernel

Lectures
Read more
February 27, 2026 undefined min read

GPU Systems 15 - LayerNorm and RMSNorm Kernel Structure

Why normalization kernels are often memory-bound and structurally important

Lectures
Read more
March 1, 2026 undefined min read

GPU Systems 16 - Vectorized Loads, Stores, and Alignment

How wider memory operations and alignment affect bandwidth utilization

Lectures
Read more
March 3, 2026 undefined min read

GPU Systems 17 - Register Pressure and Spilling

Why using more registers can improve local efficiency but still reduce total throughput

Lectures
Read more
March 7, 2026 undefined min read

GPU Systems 19 - Asynchronous Copy and Pipelining

How asynchronous copy and double buffering help overlap memory movement with computation

Lectures
Read more
January 23, 2026 undefined min read

PyTorch Internals 07 - Tensor Lifetime, the CUDA Caching Allocator, and Memory Reuse

PyTorch GPU memory behavior is shaped by a caching allocator, so observed memory usage is not just a story about current tensor objects

Lectures
Read more
โ† Previous
1 2 3
Next โ†’

© 2025 Jae ยท Notes on systems, software, and building things carefully.

RSS