The GPU Memory Hierarchy: L2, L1, and Registers

September 19, 2025

While High Bandwidth Memory (HBM) provides GPUs with massive throughput, it is still relatively far away from the compute units in terms of latency. To bridge that gap, modern GPUs employ a layered memory hierarchy: L2 cache, L1 cache (shared memory), and registers. Each layer trades capacity for speed, ensuring that the most frequently used data stays as close to the compute units as possible.

Why this Memory Hierarchy Matters

Training and inference involve repeatedly accessing weights, activations, and intermediate results. Fetching all of this directly from HBM would overwhelm its bandwidth and introduce unacceptable delays. By staging data through progressively faster and smaller memories, GPUs reduce both latency and energy consumption, keeping tensor cores and CUDA cores consistently fed.

Description

Figure 1: The inside of the compute core of the NVIDIA H100 GPU. We see 50MB of L2 cache as well as 132 streaming multiprocessors stacked next to each other.

L2 Cache - Global Buffer for All Streaming Multiprocessors (SMs)

Description

Figure 2: Inside the streaming multiprocessor (SM) of an H100, where we see the L1 instruction cache above.

L1 Cache - L1 Cache (Shared Memory) — Local Fast Storage in Each SM

Registers — Per-Thread Working Memory

How They Work Together in Large-Scale Training

Imagine training a transformer with billions of parameters. Here’s how the data moves:

← High Bandwidth Memory (HBM)Streaming Multiprocessors →
More Writing