Streaming Multiprocessors: Scheduling and Execution

September 20, 2025

Modern GPUs achieve their massive parallel performance through Streaming Multiprocessors (SMs). An SM is the fundamental compute block of the GPU, containing everything needed to execute thousands of threads in parallel: cores for arithmetic, fast local memory, registers, and schedulers. By replicating SMs across the GPU die, manufacturers scale performance from tens of gigaflops into the multi-petaflop range. For example, NVIDIA’s A100 GPU integrates 108 SMs, while the newer H100 contains 132. Understanding the role of SMs is critical to understanding why GPUs excel at deep learning and other workloads that demand large-scale parallelism.

Why SMs Matter for Machine Learning

Streaming multiprocessors are central to GPU efficiency because they:

Without SMs, the GPU’s cores would be isolated units. SMs bring them together into coordinated, highly parallel engines.

Description

Figure 1: H100 SM floorplan highlighting each warp scheduler’s resources—256 KB L1/SMEM, 16k-register files, 128 FP32 ALUs, and four 4th-gen tensor cores delivering 4,096 mixed-precision FLOPs per cycle.

Anatomy of an SM

Each SM integrates a variety of hardware resources that work together to execute threads efficiently:

Together, these resources form a self-contained compute cluster. The GPU die is then built by replicating SMs, enabling the scaling that makes GPUs effective for workloads ranging from graphics to deep learning.

Parallel Execution in an SM

One of the defining features of an SM is its ability to manage massive numbers of threads. Threads are grouped into warps, and warps are organized into thread blocks. An SM can run multiple blocks at once, with each warp scheduler interleaving instructions across warps to keep the cores busy.

This design allows GPUs to hide latency . For example, when one warp is stalled waiting for data from memory, the scheduler can immediately switch to another warp that is ready to execute. This rapid context switching is lightweight and enable the SM to keep its CUDA and tensor cores operating near peak throughput.

In a typical deep learning kernel:

Real World Implications for ML

← The GPU Memory HierarchyMore Writing