The Engineering Codex/LLM Systems Engineering
DAY 1 · AM
01 / 09

GPU & VRAM Fundamentals

schedule5 minsignal_cellular_altBeginner1,103 words
Understand the hardware that powers all LLM inference. Learn GPU architecture, memory hierarchy, bandwidth constraints, and how to calculate VRAM requirements for any model.

What you will learn

01Why GPUs? The Parallelism Story
02GPU Architecture Essentials
03VRAM Math — Know Your Numbers
04Key GPUs for LLM Serving
05Compute-Bound vs Memory-Bound
06NVLink & Inter-GPU Bandwidth

Why GPUs? The Parallelism Story

A CPU has 8–32 powerful cores optimized for sequential, complex tasks. A GPU has thousands of smaller cores designed for the same operation on many data points simultaneously — perfect for matrix multiplications, which are 99% of what neural networks do.

💡
The Core Intuition
An A100 GPU has 6,912 CUDA cores and 432 Tensor Cores. During an LLM forward pass, you're multiplying a [batch × seq × hidden] tensor by a [hidden × hidden] weight matrix — billions of multiply-adds happening in parallel. This is why GPUs are irreplaceable.

GPU Architecture Essentials

Streaming Multiprocessors (SMs)

GPUs are organized into Streaming Multiprocessors. Each SM contains CUDA cores, Tensor Cores, registers, and shared memory (L1 cache). The H100 has 132 SMs; the A100 has 108. When you launch a CUDA kernel, the GPU scheduler assigns thread blocks to available SMs.

Tensor Cores

Tensor Cores are specialized matrix units introduced in Volta (V100). They perform 4×4 matrix-multiply-accumulate (MMA) operations in a single clock cycle. On H100, Tensor Cores deliver up to 989 TFLOPS for FP16 and up to 3,958 TOPS for INT8 — the engines behind fast inference.

DATA FLOW DURING ONE MATMUL STREAMING MULTIPROCESSOR (SM) ① Weights live ② Cached ③ Tiled to SRAM ④ MMA compute VRAM HBM3 80 GB 3.35 TB/s L2 Cache 50 MB 12 TB/s L1 / Shared 256 KB 33 TB/s Tensor Core 4×4 MMA per cycle 989 TFLOPS Each tier is roughly 3–10× faster but 100–1000× smaller than the one before it. Decode bottleneck → every token re-streams the entire weight matrix through this whole pipeline.
How a single matmul travels the GPU. Weights start in VRAM (huge but slow), get pulled into L2, then tiled into the SM's L1/shared SRAM, and finally fed to the Tensor Core which runs a 4×4 multiply-accumulate every cycle. The pulsing grid is the Tensor Core consuming tiles; the moving dots are the data packets flowing in.

Memory Hierarchy

LevelTypeSize (H100)BandwidthLatency
L1 / Shared MemOn-SM SRAM256 KB/SM~33 TB/s~20 cycles
L2 CacheOn-chip SRAM50 MB~12 TB/s~100 cycles
VRAM (HBM3)High-Bandwidth Memory80 GB3.35 TB/s~400 cycles
CPU RAMDDR5TB range~100 GB/s~100 ns
NVMe SSDPCIe FlashTB range~12 GB/s~100 µs
L1 / Shared ~33 TB/s L2 Cache ~12 TB/s VRAM (HBM3) 3.35 TB/s CPU RAM ~100 GB/s ← bandwidth (log scale)
Memory bandwidth by tier on H100. VRAM is 33× faster than CPU RAM — but L1 is 10× faster still. The decode bottleneck lives at the VRAM tier.
⚠️
The Bandwidth Wall
LLM decode is almost always memory-bandwidth bound, not compute bound. At batch size 1, you use ~1% of A100's compute but 100% of its bandwidth. The bottleneck is loading model weights from VRAM — not doing the math. This is why batching and quantization matter so much.

VRAM Math — Know Your Numbers

VRAM is the single most important constraint in LLM deployment. You need to fit: model weights + KV cache + activations + framework overhead.

Model Weight VRAM (bytes)
Parameters × Bytes-per-parameter
PrecisionBytes/Param7B Model70B Model
FP324 bytes28 GB280 GB
BF16 / FP162 bytes14 GB140 GB
FP81 byte7 GB70 GB
INT81 byte7 GB70 GB
INT4 (AWQ/GPTQ)0.5 bytes3.5 GB35 GB
KV Cache VRAM per Token (bytes)
2 × num_layers × hidden_size × sizeof(dtype)

For a Llama-3 8B (32 layers, hidden=4096, BF16): 2 × 32 × 4096 × 2 = 524,288 bytes ≈ 0.5 MB per token. At 2K context with batch 8: 0.5 MB × 2000 × 8 = 8 GB just for KV cache.

Total VRAM Estimate
Total = Weights + KV_cache + Activations (~20%) + Framework_overhead (~500MB)

Key GPUs for LLM Serving

GPUVRAMBandwidthFP16 TFLOPsBest For
H100 SXM580 GB HBM33.35 TB/s989Large model serving, fine-tuning
H100 PCIe80 GB HBM2e2.0 TB/s756Inference, lighter workloads
A100 80GB80 GB HBM2e2.0 TB/s312Proven workhorse, wide support
A100 40GB40 GB HBM21.6 TB/s312Mid-size models, cost-effective
RTX 409024 GB GDDR6X1.0 TB/s82.6Dev/small models, budget serving
L40S48 GB GDDR6864 GB/s91.6Inference-optimized, good price

Compute-Bound vs Memory-Bound

Understanding when you're compute-bound vs memory-bandwidth-bound is crucial for optimization decisions. The arithmetic intensity of an operation (FLOPs divided by bytes transferred) determines which constraint you hit first. If an operation moves lots of data relative to the math it does, it hits the bandwidth ceiling before the compute ceiling.

📐
Arithmetic Intensity — The Roofline Number
Arithmetic intensity = FLOP / byte. The H100 SXM5 ratio is ~989 TFLOPS ÷ 3.35 TB/s ≈ 295 FLOP/byte. Any kernel below this ratio is memory-bound; above it is compute-bound. LLM decode (batch=1) has intensity ≈ 1 FLOP/byte — 295× below the ridge line. This is why decode is almost always memory-bound and why bandwidth matters far more than raw TFLOP counts for small-batch inference.
🔢 Compute-Bound
  • Prefill phase (large batches)
  • Processing prompt tokens in parallel
  • Large batch sizes (>16)
  • Bottleneck: Tensor Core throughput
  • Fix: Better FLOP utilization, FP8
📡 Memory-Bound
  • Decode phase (generating tokens)
  • Small batch sizes (1–4)
  • Single-user interactive inference
  • Bottleneck: VRAM bandwidth
  • Fix: Quantization, larger batches
🎯
Production Rule of Thumb
The roofline model: if your model does fewer FLOPS/byte than the GPU's arithmetic intensity (FLOPS/bandwidth), you're memory-bound. H100 SXM5 arithmetic intensity ≈ 989 TFLOPS / 3.35 TB/s ≈ 295. Most decode steps are far below this — hence memory-bound.

NVLink & Inter-GPU Bandwidth

When serving models across multiple GPUs, inter-GPU bandwidth becomes a bottleneck. NVLink (GPU-to-GPU direct) is 10–20x faster than PCIe:

  • NVLink 4.0 (H100): 900 GB/s bidirectional per GPU pair
  • PCIe 5.0: ~128 GB/s bidirectional — use for cross-node communication
  • InfiniBand HDR (100 Gbps): for multi-node tensor parallelism
  • NVSwitch (in DGX): all-to-all NVLink at full bandwidth across 8 GPUs
🔑
Key Takeaways
1. VRAM = weights + KV cache + activations. Know the formula by heart. 2. Decode is almost always memory-bandwidth-bound — your bottleneck is VRAM bandwidth, not compute. 3. Quantization buys you VRAM; batching buys you throughput. 4. H100's HBM3 bandwidth (3.35 TB/s) is 3x faster than DDR5 — this gap is why VRAM matters so much.

Finished reading?