DAY 1 · AM

01 / 09

GPU & VRAM Fundamentals

schedule5 minsignal_cellular_altBeginner1,103 words

Understand the hardware that powers all LLM inference. Learn GPU architecture, memory hierarchy, bandwidth constraints, and how to calculate VRAM requirements for any model.

What you will learn

01Why GPUs? The Parallelism Story

02GPU Architecture Essentials

03VRAM Math — Know Your Numbers

04Key GPUs for LLM Serving

05Compute-Bound vs Memory-Bound

06NVLink & Inter-GPU Bandwidth

Why GPUs? The Parallelism Story

A CPU has 8–32 powerful cores optimized for sequential, complex tasks. A GPU has thousands of smaller cores designed for the same operation on many data points simultaneously — perfect for matrix multiplications, which are 99% of what neural networks do.

💡

The Core Intuition

An A100 GPU has 6,912 CUDA cores and 432 Tensor Cores. During an LLM forward pass, you're multiplying a [batch × seq × hidden] tensor by a [hidden × hidden] weight matrix — billions of multiply-adds happening in parallel. This is why GPUs are irreplaceable.

GPU Architecture Essentials

Streaming Multiprocessors (SMs)

GPUs are organized into Streaming Multiprocessors. Each SM contains CUDA cores, Tensor Cores, registers, and shared memory (L1 cache). The H100 has 132 SMs; the A100 has 108. When you launch a CUDA kernel, the GPU scheduler assigns thread blocks to available SMs.

Tensor Cores

Tensor Cores are specialized matrix units introduced in Volta (V100). They perform 4×4 matrix-multiply-accumulate (MMA) operations in a single clock cycle. On H100, Tensor Cores deliver up to 989 TFLOPS for FP16 and up to 3,958 TOPS for INT8 — the engines behind fast inference.

How a single matmul travels the GPU. Weights start in VRAM (huge but slow), get pulled into L2, then tiled into the SM's L1/shared SRAM, and finally fed to the Tensor Core which runs a 4×4 multiply-accumulate every cycle. The pulsing grid is the Tensor Core consuming tiles; the moving dots are the data packets flowing in.

Memory Hierarchy

Level	Type	Size (H100)	Bandwidth	Latency
L1 / Shared Mem	On-SM SRAM	256 KB/SM	~33 TB/s	~20 cycles
L2 Cache	On-chip SRAM	50 MB	~12 TB/s	~100 cycles
VRAM (HBM3)	High-Bandwidth Memory	80 GB	3.35 TB/s	~400 cycles
CPU RAM	DDR5	TB range	~100 GB/s	~100 ns
NVMe SSD	PCIe Flash	TB range	~12 GB/s	~100 µs

Memory bandwidth by tier on H100. VRAM is 33× faster than CPU RAM — but L1 is 10× faster still. The decode bottleneck lives at the VRAM tier.

⚠️

The Bandwidth Wall

LLM decode is almost always memory-bandwidth bound, not compute bound. At batch size 1, you use ~1% of A100's compute but 100% of its bandwidth. The bottleneck is loading model weights from VRAM — not doing the math. This is why batching and quantization matter so much.

VRAM Math — Know Your Numbers

VRAM is the single most important constraint in LLM deployment. You need to fit: model weights + KV cache + activations + framework overhead.

Model Weight VRAM (bytes)

Parameters × Bytes-per-parameter

Precision	Bytes/Param	7B Model	70B Model
FP32	4 bytes	28 GB	280 GB
BF16 / FP16	2 bytes	14 GB	140 GB
FP8	1 byte	7 GB	70 GB
INT8	1 byte	7 GB	70 GB
INT4 (AWQ/GPTQ)	0.5 bytes	3.5 GB	35 GB

KV Cache VRAM per Token (bytes)

2 × num_layers × hidden_size × sizeof(dtype)

For a Llama-3 8B (32 layers, hidden=4096, BF16): 2 × 32 × 4096 × 2 = 524,288 bytes ≈ 0.5 MB per token. At 2K context with batch 8: 0.5 MB × 2000 × 8 = 8 GB just for KV cache.

Total VRAM Estimate

Total = Weights + KV_cache + Activations (~20%) + Framework_overhead (~500MB)

Key GPUs for LLM Serving

GPU	VRAM	Bandwidth	FP16 TFLOPs	Best For
H100 SXM5	80 GB HBM3	3.35 TB/s	989	Large model serving, fine-tuning
H100 PCIe	80 GB HBM2e	2.0 TB/s	756	Inference, lighter workloads
A100 80GB	80 GB HBM2e	2.0 TB/s	312	Proven workhorse, wide support
A100 40GB	40 GB HBM2	1.6 TB/s	312	Mid-size models, cost-effective
RTX 4090	24 GB GDDR6X	1.0 TB/s	82.6	Dev/small models, budget serving
L40S	48 GB GDDR6	864 GB/s	91.6	Inference-optimized, good price

Compute-Bound vs Memory-Bound

Understanding when you're compute-bound vs memory-bandwidth-bound is crucial for optimization decisions. The arithmetic intensity of an operation (FLOPs divided by bytes transferred) determines which constraint you hit first. If an operation moves lots of data relative to the math it does, it hits the bandwidth ceiling before the compute ceiling.

📐

Arithmetic Intensity — The Roofline Number

Arithmetic intensity = FLOP / byte. The H100 SXM5 ratio is ~989 TFLOPS ÷ 3.35 TB/s ≈ 295 FLOP/byte. Any kernel below this ratio is memory-bound; above it is compute-bound. LLM decode (batch=1) has intensity ≈ 1 FLOP/byte — 295× below the ridge line. This is why decode is almost always memory-bound and why bandwidth matters far more than raw TFLOP counts for small-batch inference.

🔢 Compute-Bound

Prefill phase (large batches)
Processing prompt tokens in parallel
Large batch sizes (>16)
Bottleneck: Tensor Core throughput
Fix: Better FLOP utilization, FP8

📡 Memory-Bound

Decode phase (generating tokens)
Small batch sizes (1–4)
Single-user interactive inference
Bottleneck: VRAM bandwidth
Fix: Quantization, larger batches

🎯

Production Rule of Thumb

The roofline model: if your model does fewer FLOPS/byte than the GPU's arithmetic intensity (FLOPS/bandwidth), you're memory-bound. H100 SXM5 arithmetic intensity ≈ 989 TFLOPS / 3.35 TB/s ≈ 295. Most decode steps are far below this — hence memory-bound.

NVLink & Inter-GPU Bandwidth

When serving models across multiple GPUs, inter-GPU bandwidth becomes a bottleneck. NVLink (GPU-to-GPU direct) is 10–20x faster than PCIe:

NVLink 4.0 (H100): 900 GB/s bidirectional per GPU pair
PCIe 5.0: ~128 GB/s bidirectional — use for cross-node communication
InfiniBand HDR (100 Gbps): for multi-node tensor parallelism
NVSwitch (in DGX): all-to-all NVLink at full bandwidth across 8 GPUs

🔑

Key Takeaways

1. VRAM = weights + KV cache + activations. Know the formula by heart. 2. Decode is almost always memory-bandwidth-bound — your bottleneck is VRAM bandwidth, not compute. 3. Quantization buys you VRAM; batching buys you throughput. 4. H100's HBM3 bandwidth (3.35 TB/s) is 3x faster than DDR5 — this gap is why VRAM matters so much.

📚 Further reading

NVIDIA H100 Tensor Core GPU Architecture Whitepapernvidia.com
Roofline: An Insightful Visual Performance Model for Floating-Point Programs and Multicore Architecturesarxiv.org
Making Deep Learning Go Brrrr From First Principleshorace.io
NVIDIA Deep Learning Performance Guide — GPU Backgroundnvidia.com
Tim Dettmers — Which GPU(s) to Get for Deep Learningtimdettmers.com

Finished reading?