How GPUs Power Modern Machine Learning: An Introduction to GPU Architecture

September 18, 2025

If you’ve ever trained a deep learning model, you’ve seen how GPUs can shrink training times from weeks to hours. But what actually makes a GPU so good at machine learning? It isn’t just “more cores” — it’s a carefully designed balance of memory, compute units, and parallelism that work together to keep massive numbers of operations flowing during large-scale distributed training.

At a high level, training a neural network boils down to matrix multiplications — multiplying and adding huge grids of numbers, millions or billions of times over. GPUs are built to excel at this. They provide not only thousands of compute units to do the math, but also specialized hardware (tensor cores) and a memory system (HBM and caches) designed to feed those compute units at astonishing speed.

Understanding this hardware matters. If you know where GPUs shine and where they bottleneck, you’ll better understand why certain model architectures are feasible, why training sometimes slows down even on powerful clusters, and why “more memory” or “more FLOPs” aren’t always the whole story.

In this article, we’ll discuss the key components of modern GPUs — from high-bandwidth memory to streaming multiprocessors, from FP32 cores to tensor cores — and see how they come together to form the engines behind today’s machine learning revolution. Along the way, I’ll point you to deeper dives on each topic if you want to explore further.

Memory: The Fuel of Computation

No matter how powerful a GPU’s compute cores are, their performance depends on how quickly data can reach them. Training large neural networks requires moving enormous amounts of weights, activations, and gradients, and the memory system is what makes that possible.

High Bandwidth Memory (HBM)

Modern GPUs use High Bandwidth Memory (HBM) instead of the DDR or GDDR memory found in typical CPUs and graphics cards. HBM is physically stacked next to the GPU die and connected through a silicon interposer, which enables thousands of high-speed connections. This design delivers extremely high throughput while keeping power consumption manageable.

The result is memory bandwidth on the order of terabytes per second. For example, an NVIDIA H100 GPU provides over 3,000 GB/s of bandwidth, compared to about 100 GB/s for a high-end CPU. This level of throughput ensures that the compute units can be kept busy instead of stalling while waiting for data.

Memory Bandwidth (GB/s)

The “GB/s” figure you see in GPU specifications refers to memory bandwidth: the maximum rate at which data can move between HBM and the GPU cores. Higher bandwidth means more data can be transferred each second, which directly impacts how efficiently large models can be trained.

Caches

Even with HBM, data movement is expensive in both time and energy. To reduce this cost, GPUs use a cache hierarchy:

L2 cache: shared across all streaming multiprocessors (SMs), reduces repeated trips to HBM.
L1 cache: located inside each SM, accelerates access for threads within the same block.
Registers: the fastest storage, located directly in each core, used for active computations.

Compute: The Workhorses of the GPUs

While memory provides the data, the compute units are what actually perform the operations that drive the matrix operations that span the training process for large ML models. In modern GPUs, these units are organized into Streaming Multiprocessors (SMs) , which are the fundamental building blocks of GPU computation.

Streaming Multiprocessors (SMs)

Each GPU die contains many SMs (e.g., 108 in the NVIDIA A100, 132 in the H100). An SM contains everything needed to execute thousands of threads in parallel:

CUDA Cores for general floating-point and integer arithmetic
Tensor Cores for high-throughput matrix operations
Registers for per-thread storage of active variables
L1 Cache / shared memory for fast local data reuse
Schedulers to manage warps of threads and hide memory latency

By replicating SMs across the GPU, large numbers of threads can run simultaneously, providing the massive parallelism needed for matrix multiplications in deep learning.

CUDA (FP32) Cores

The most general-purpose compute units inside each SM are the CUDA cores. They perform operations in standard floating-point precision (FP32) and can also support FP64 or integer arithmetic depending on the architecture. CUDA cores handle a wide variety of tasks, from control flow and indexing to basic arithmetic.

Tensor Cores

Introduced with the NVIDIA Volta architecture, tensor cores are specialized units designed specifically for matrix multiply-accumulate (MMA) operations, which form the backbone of deep learning workloads. Tensor cores operate on small matrix tiles (e.g., 16×16) and perform these multiplications in a single instruction, delivering hundreds of floating-point operations per cycle.

They are optimized for lower-precision formats such as FP16, BF16, and FP8, with results typically accumulated in FP32 for accuracy. This design allows tensor cores to achieve orders of magnitude higher throughput than CUDA cores for matrix-heavy workloads.

Performance Metrics

To evaluate GPU performance for machine learning tasks, it’s not enough to look at core counts or memory size. What matters is how much useful computation the GPU can perform per second and whether the memory system can keep up.

FLOPs and FLOPs per cycle

FLOP stands for floating-point operation (e.g. a single addition or multiplication). GPU datasheets often report performance in teraFLOPs (TFLOPs), meaning trillions of floating-point operations per second.

The total througput is determined through by three factors:

Cycle: one tick of the GPU’s clock (measured in GHz).
FLOPs per cycle: how many operations a compute unit can perform each cycle (tensor cores can do hundreds, CUDA cores usually 1–2).
Number of cores: how many CUDA and tensor cores are availible across all SMs

Together, these define the theoretical peak compute throughput of a GPU.

Memory vs. Compute Bound

In practice, GPUs rarely achieve their full theoretical FLOPs. The limiting factor depends on the workload:

Compute bound: the math units are the bottleneck. Adding more FLOPs increases performance.
Memory bound: the GPU spends time waiting for data from memory. In this case, higher memory bandwidth or larger caches make the biggest difference.

Deep learning workloads often fall between these extremes, which is why both FLOPs and memory bandwidth are highlighted in GPU specifications.

Threads and Parallelism

The defining strength of GPUs is their ability to run tens of thousands of operations at the same time. This parallelism is expressed through a hierarchical execution model based on threads, warps, and blocks.

Processes and Threads

A process is a running instance of a program, such as a machine learning training script.
Within a process, execution is divided into threads, which are the smallest units of work. Threads share the same memory space but each has its own registers and execution state.

Warps and Blocks

On NVIDIA GPUs, threads are grouped into sets of 32 called warps . Warps are the scheduling unit for the GPU: all 32 threads in a warp execute the same instruction at once on different pieces of data. Warps are further organized into thread blocks, which are assigned to a specific streaming multiprocessor (SM).

Scheduling and Latency Hiding

Each SM contains multiple warp schedulers that decide which warps to execute in a given cycle. When one warp stalls waiting for data from memory, the scheduler can switch to another warp that is ready to run. This ability to juggle thousands of threads allows GPUs to “hide” memory latency and keep their compute units fully utilized.

Parallelism at Scale

By replicating SMs across the GPU and running many blocks in parallel, GPUs can scale to tens of thousands of concurrent threads. This massive parallelism is what makes them ideal for workloads like matrix multiplications and deep learning training, where the same operation is applied repeatedly to large datasets.