How warp schedulers, tensor cores, and instruction pipelines inside an H100 SM keep massive thread counts in flight.
Writing
Essays, paper deep dives, and architecture notes. Use the topic filters to jump to what interests you.
A tour of caching, shared memory, and register files on modern accelerators, with tips for keeping tensor cores fed.
Explains why large-scale training depends on terabyte-per-second HBM stacks and how to budget their bandwidth.
Sets the stage for the GPU architecture series—compute, memory, and interconnect pillars for large-scale training.
Breaks down unCLIP’s diffusion prior, decoder, and editing workflows for high-fidelity text-to-image synthesis.
Covers evidence lower bound derivations and the mechanics of VAEs for generative modeling.
Walkthrough of the BPE tokenization algorithm that powers modern language models.
Analyzes early text-to-image transformers, discrete VAEs, and zero-shot generation techniques.
A deep dive into CLIP’s contrastive training on 400M image–text pairs and its zero-shot recognition capabilities.