GPU Systems
From GPU architecture and CUDA kernels to Triton and real kernel optimization work
Engineers who want to understand how GPUs actually execute work and eventually write and optimize their own kernels.
Thoughts on code, technology, and everything in between
Long-form posts on platform engineering, Linux, compilers, MLOps, and computer architecture, written to help you build stronger intuition instead of just memorize terms.
A few strong entry points if you are new here.
Fresh writing, updates, and ongoing series entries.
From GPU architecture and CUDA kernels to Triton and real kernel optimization work
Engineers who want to understand how GPUs actually execute work and eventually write and optimize their own kernels.
Building reliable ML systems from data pipelines to production monitoring
ML engineers, data scientists, and backend engineers moving from model experiments to production operations.
From finite automata and formal languages to building a compiler from scratch
Readers who want both the theory behind language processing and the bridge to real compiler construction.
From finite automata and formal languages to building a compiler from scratch
How a lexer breaks source code into tokens and where automata theory meets real implementation
Looking only at parameter size leads to bad decisions; training memory is really a combination of parameters, gradients, optimizer state, and activations
Custom autograd functions are a practical place to define forward-backward contracts before dropping to lower-level extensions
The concept of virtual memory and how the Linux kernel manages memory
Instruction pipelining, hazard handling, branch prediction, superscalar and out-of-order execution
Adding more GPUs changes optimizer semantics as well as throughput, so batch size and learning rate need to be reasoned about together
How to design golden paths that developers actually want to follow โ principles, step-by-step process, and handling teams that go off-road
Autograd is not just automatic differentiation; it is a graph-construction and backward-execution runtime
DDP is not just a wrapper around your model; it is a runtime that coordinates autograd hooks, gradient buckets, and synchronization timing