GPU Systems
From GPU architecture and CUDA kernels to Triton and real kernel optimization work
Engineers who want to understand how GPUs actually execute work and eventually write and optimize their own kernels.
Thoughts on code, technology, and everything in between
Long-form posts on platform engineering, Linux, compilers, MLOps, and computer architecture, written to help you build stronger intuition instead of just memorize terms.
A few strong entry points if you are new here.
Fresh writing, updates, and ongoing series entries.
From GPU architecture and CUDA kernels to Triton and real kernel optimization work
Engineers who want to understand how GPUs actually execute work and eventually write and optimize their own kernels.
Building reliable ML systems from data pipelines to production monitoring
ML engineers, data scientists, and backend engineers moving from model experiments to production operations.
From finite automata and formal languages to building a compiler from scratch
Readers who want both the theory behind language processing and the bridge to real compiler construction.
From finite automata and formal languages to building a compiler from scratch
How VFS, inodes, and ext4 work in the world where everything is a file
The principles behind recursive descent parsers, LL(1) grammars, and the strengths and limitations of top-down parsing
Once the model itself is too large for one device, data parallelism is no longer enough and layer-internal computation has to be split
Many PyTorch CUDA operations are asynchronous, so timing, synchronization, and dependency need to be reasoned about explicitly
Why CPUs distinguish privilege levels, and how x86 protection rings and ARM exception levels protect the system
In distributed training, performance is often shaped more by how GPUs are connected than by the raw number of GPUs
Why model versioning differs from code versioning, and the role of a model registry
PyTorch GPU memory behavior is shaped by a caching allocator, so observed memory usage is not just a story about current tensor objects
Why IaC is the backbone of any platform, and how tools like Terraform, Pulumi, and Crossplane compare when building self-service infrastructure