Writing

Essays, paper deep dives, and architecture notes. Use the topic filters to jump to what interests you.

Streaming Multiprocessors: Scheduling and Execution

Sep 20, 2025

How warp schedulers, tensor cores, and instruction pipelines inside an H100 SM keep massive thread counts in flight.

GPU ArchitectureLarge-Scale Distributed TrainingCompute ArchitectureResearch Deep Dive

Sep 19, 2025

A tour of caching, shared memory, and register files on modern accelerators, with tips for keeping tensor cores fed.

GPU ArchitectureLarge-Scale Distributed TrainingMemory Systems

Sep 18, 2025

Explains why large-scale training depends on terabyte-per-second HBM stacks and how to budget their bandwidth.

GPU ArchitectureLarge-Scale Distributed TrainingMemory Systems

Sep 18, 2025

Sets the stage for the GPU architecture series—compute, memory, and interconnect pillars for large-scale training.

GPU ArchitectureLarge-Scale Distributed Training

Sep 15, 2025

Breaks down unCLIP’s diffusion prior, decoder, and editing workflows for high-fidelity text-to-image synthesis.

Computer VisionDiffusion ModelsResearch Deep DiveGenerative Models

May 29, 2025

Covers evidence lower bound derivations and the mechanics of VAEs for generative modeling.

Generative ModelsResearch Deep Dive

May 12, 2025

Walkthrough of the BPE tokenization algorithm that powers modern language models.

Natural Language ProcessingML Concepts

May 2, 2025

Analyzes early text-to-image transformers, discrete VAEs, and zero-shot generation techniques.

Computer VisionDiffusion ModelsResearch Deep DiveGenerative Models

Apr 25, 2025

A deep dive into CLIP’s contrastive training on 400M image–text pairs and its zero-shot recognition capabilities.

Computer VisionResearch Deep Dive