GPU Systems
From GPU architecture and CUDA kernels to Triton and real kernel optimization work
Engineers who want to understand how GPUs actually execute work and eventually write and optimize their own kernels.
Thoughts on code, technology, and everything in between
Long-form posts on platform engineering, Linux, compilers, MLOps, and computer architecture, written to help you build stronger intuition instead of just memorize terms.
A few strong entry points if you are new here.
Fresh writing, updates, and ongoing series entries.
From GPU architecture and CUDA kernels to Triton and real kernel optimization work
Engineers who want to understand how GPUs actually execute work and eventually write and optimize their own kernels.
Building reliable ML systems from data pipelines to production monitoring
ML engineers, data scientists, and backend engineers moving from model experiments to production operations.
From finite automata and formal languages to building a compiler from scratch
Readers who want both the theory behind language processing and the bridge to real compiler construction.
From finite automata and formal languages to building a compiler from scratch
As context length grows, activation memory and communication patterns change again, and sequence-oriented partitioning starts to matter
What threads, warps, blocks, and grids mean in actual GPU execution
A CUDA kernel becomes a real PyTorch operator only when tensor contracts, runtime semantics, and integration details are handled correctly
How developer portals like Backstage bring order to the chaos of scattered docs, tools, and tribal knowledge
Tensor parallelism becomes real when you map it onto QKV projections, attention output paths, and the two large MLP projections inside a transformer block
A practical study order from GPU architecture to CUDA, Triton, and kernel optimization
How to serve trained models in production and deploy them safely
A C++ extension is the first practical bridge between user-defined logic and the PyTorch runtime
The background knowledge that makes the GPU Systems series much easier to study properly