GPU Systems
From GPU architecture and CUDA kernels to Triton and real kernel optimization work
Engineers who want to understand how GPUs actually execute work and eventually write and optimize their own kernels.
Thoughts on code, technology, and everything in between
Long-form posts on platform engineering, Linux, compilers, MLOps, and computer architecture, written to help you build stronger intuition instead of just memorize terms.
A few strong entry points if you are new here.
Fresh writing, updates, and ongoing series entries.
From GPU architecture and CUDA kernels to Triton and real kernel optimization work
Engineers who want to understand how GPUs actually execute work and eventually write and optimize their own kernels.
Building reliable ML systems from data pipelines to production monitoring
ML engineers, data scientists, and backend engineers moving from model experiments to production operations.
From finite automata and formal languages to building a compiler from scratch
Readers who want both the theory behind language processing and the bridge to real compiler construction.
From finite automata and formal languages to building a compiler from scratch
What goes wrong when experiments aren't tracked, and the tools that solve it
A single operator name in PyTorch may map to many implementations, and the dispatcher is the runtime layer that decides which one runs
Why compilers are divided into multiple phases and what each phase does
To reason about distributed training performance, you need a concrete mental model for all-reduce and collective communication cost
How the Linux kernel distributes CPU time among processes and how CFS works
Layout affects both operator selection and performance, and sometimes the most expensive thing in a path is an invisible copy
The role of ISAs, CISC vs RISC philosophy, and x86 versus ARM design differences
Data parallelism looks simple, but it carries both gradient synchronization cost and full model-state replication cost
How raw data becomes training data, and why data quality matters more than model complexity