
GPU & VRAM Fundamentals
Understand the hardware that powers all LLM inference. Learn GPU architecture, memory hierarchy, bandwidth constraints, and how to calculate VRAM requirements for any model.
What you will learn
Why GPUs? The Parallelism Story
A CPU has 8–32 powerful cores optimized for sequential, complex tasks. A GPU has thousands of smaller cores designed for the same operation on many data points simultaneously — perfect for matrix multiplications, which are 99% of what neural networks do.
GPU Architecture Essentials
Streaming Multiprocessors (SMs)
GPUs are organized into Streaming Multiprocessors. Each SM contains CUDA cores, Tensor Cores, registers, and shared memory (L1 cache). The H100 has 132 SMs; the A100 has 108. When you launch a CUDA kernel, the GPU scheduler assigns thread blocks to available SMs.
Tensor Cores
Tensor Cores are specialized matrix units introduced in Volta (V100). They perform 4×4 matrix-multiply-accumulate (MMA) operations in a single clock cycle. On H100, Tensor Cores deliver up to 989 TFLOPS for FP16 and up to 3,958 TOPS for INT8 — the engines behind fast inference.
Memory Hierarchy
| Level | Type | Size (H100) | Bandwidth | Latency |
|---|---|---|---|---|
| L1 / Shared Mem | On-SM SRAM | 256 KB/SM | ~33 TB/s | ~20 cycles |
| L2 Cache | On-chip SRAM | 50 MB | ~12 TB/s | ~100 cycles |
| VRAM (HBM3) | High-Bandwidth Memory | 80 GB | 3.35 TB/s | ~400 cycles |
| CPU RAM | DDR5 | TB range | ~100 GB/s | ~100 ns |
| NVMe SSD | PCIe Flash | TB range | ~12 GB/s | ~100 µs |
VRAM Math — Know Your Numbers
VRAM is the single most important constraint in LLM deployment. You need to fit: model weights + KV cache + activations + framework overhead.
| Precision | Bytes/Param | 7B Model | 70B Model |
|---|---|---|---|
| FP32 | 4 bytes | 28 GB | 280 GB |
| BF16 / FP16 | 2 bytes | 14 GB | 140 GB |
| FP8 | 1 byte | 7 GB | 70 GB |
| INT8 | 1 byte | 7 GB | 70 GB |
| INT4 (AWQ/GPTQ) | 0.5 bytes | 3.5 GB | 35 GB |
For a Llama-3 8B (32 layers, hidden=4096, BF16): 2 × 32 × 4096 × 2 = 524,288 bytes ≈ 0.5 MB per token. At 2K context with batch 8: 0.5 MB × 2000 × 8 = 8 GB just for KV cache.
Key GPUs for LLM Serving
| GPU | VRAM | Bandwidth | FP16 TFLOPs | Best For |
|---|---|---|---|---|
| H100 SXM5 | 80 GB HBM3 | 3.35 TB/s | 989 | Large model serving, fine-tuning |
| H100 PCIe | 80 GB HBM2e | 2.0 TB/s | 756 | Inference, lighter workloads |
| A100 80GB | 80 GB HBM2e | 2.0 TB/s | 312 | Proven workhorse, wide support |
| A100 40GB | 40 GB HBM2 | 1.6 TB/s | 312 | Mid-size models, cost-effective |
| RTX 4090 | 24 GB GDDR6X | 1.0 TB/s | 82.6 | Dev/small models, budget serving |
| L40S | 48 GB GDDR6 | 864 GB/s | 91.6 | Inference-optimized, good price |
Compute-Bound vs Memory-Bound
Understanding when you're compute-bound vs memory-bandwidth-bound is crucial for optimization decisions. The arithmetic intensity of an operation (FLOPs divided by bytes transferred) determines which constraint you hit first. If an operation moves lots of data relative to the math it does, it hits the bandwidth ceiling before the compute ceiling.
- Prefill phase (large batches)
- Processing prompt tokens in parallel
- Large batch sizes (>16)
- Bottleneck: Tensor Core throughput
- Fix: Better FLOP utilization, FP8
- Decode phase (generating tokens)
- Small batch sizes (1–4)
- Single-user interactive inference
- Bottleneck: VRAM bandwidth
- Fix: Quantization, larger batches
NVLink & Inter-GPU Bandwidth
When serving models across multiple GPUs, inter-GPU bandwidth becomes a bottleneck. NVLink (GPU-to-GPU direct) is 10–20x faster than PCIe:
- NVLink 4.0 (H100): 900 GB/s bidirectional per GPU pair
- PCIe 5.0: ~128 GB/s bidirectional — use for cross-node communication
- InfiniBand HDR (100 Gbps): for multi-node tensor parallelism
- NVSwitch (in DGX): all-to-all NVLink at full bandwidth across 8 GPUs
- NVIDIA H100 Tensor Core GPU Architecture Whitepapernvidia.com
- Roofline: An Insightful Visual Performance Model for Floating-Point Programs and Multicore Architecturesarxiv.org
- Making Deep Learning Go Brrrr From First Principleshorace.io
- NVIDIA Deep Learning Performance Guide — GPU Backgroundnvidia.com
- Tim Dettmers — Which GPU(s) to Get for Deep Learningtimdettmers.com
Finished reading?