DAY 1 · PM

02 / 09

Quantization & Batching

schedule5 minsignal_cellular_altBeginner1,048 words

Master the two most impactful levers for production LLM efficiency: reducing weight precision and maximizing GPU utilization through smart batching.

What you will learn

01Why Quantization?

02Precision Formats Explained

03Quantization Methods

04Batching Strategies

05Throughput vs Latency Tradeoff

Why Quantization?

A 70B parameter model in FP16 needs 140 GB of VRAM. Four A100 80GB cards barely fit it. Quantize to INT4 and it drops to 35 GB — single GPU territory. Quantization is the fastest way to change your hardware economics.

50%

VRAM saved by INT8 vs FP16

75%

VRAM saved by INT4 vs FP16

2×

throughput gain from INT8 Tensor Cores

<2%

quality loss with INT8 (typical)

Precision Formats Explained

Format	Bits	Range	Use Case	Hardware
FP32	32	±3.4×10³⁸	Training (reference)	All GPUs
BF16	16	±3.4×10³⁸	Training / inference	Ampere+
FP16	16	±65,504	Inference (standard)	All modern GPUs
FP8 (E4M3)	8	±448	Fast inference, quality++	H100/Hopper+
INT8	8	-128 to 127	Inference (safe)	Ampere+
INT4	4	-8 to 7	Memory-constrained	Any (emulated or native)

Precision formats for a 7B model. Each step down halves VRAM; quality tradeoffs compound as you go lower.

📌

BF16 vs FP16

Both are 16-bit, but BF16 has the same exponent range as FP32 (8 exponent bits) at the cost of less precision (7 mantissa bits vs FP16's 10). This makes BF16 more numerically stable for training — overflow is rare. FP16 has better precision but can overflow on large values. For inference, both work fine; prefer BF16 on modern hardware.

Quantization Methods

Post-Training Quantization (PTQ)

Quantize a trained model without any retraining. Fast, but can lose accuracy on complex tasks.

Python · bitsandbytes INT8

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch

# Load model in INT8 (50% VRAM reduction)
quantization_config = BitsAndBytesConfig(
    load_in_8bit=True,
    llm_int8_threshold=6.0,  # outlier threshold
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct",
    quantization_config=quantization_config,
    device_map="auto",
)

AWQ (Activation-aware Weight Quantization)

AWQ observes activation patterns to identify which weights are important and protects them during 4-bit quantization. Produces much better quality than naive INT4. Preferred for production INT4 deployments.

Python · AWQ with vLLM

from vllm import LLM, SamplingParams

# vLLM natively supports AWQ models
llm = LLM(
    model="TheBloke/Llama-2-70B-Chat-AWQ",
    quantization="awq",
    dtype="auto",
    gpu_memory_utilization=0.90,
)

GPTQ (Generative Pre-trained Transformer Quantization)

Uses a second-order optimization technique (approximate inverse Hessian) to minimize quantization error layer by layer. Good quality at 4-bit. Slower to quantize than AWQ but widely supported.

FP8 (The Production Standard on H100)

FP8 is the sweet spot on Hopper GPUs — nearly the quality of BF16 at half the VRAM. The H100's Transformer Engine dynamically switches between FP8 and BF16 per layer. vLLM's FP8 support is production-ready as of 2025.

Python · FP8 with vLLM

llm = LLM(
    model="meta-llama/Meta-Llama-3.1-70B-Instruct",
    quantization="fp8",  # requires H100/Ada
    kv_cache_dtype="fp8",  # also quantize KV cache!
    gpu_memory_utilization=0.95,
)

✅

Production Decision Tree

H100/Hopper GPU → use FP8 (best speed+quality). Older Ampere GPU → INT8 (safe, 50% VRAM savings). Need to fit 70B on one 80GB GPU → AWQ INT4 (quality tradeoff, but works). Doing math-heavy / code generation → stay at BF16/INT8 minimum. GGUF (llama.cpp format) is for dev laptops, NOT production servers.

Batching Strategies

Batching is the second lever. A GPU running one request at a time wastes 95%+ of its capacity. Smart batching is how you turn expensive hardware into efficient infrastructure.

Static Batching

Naive approach: collect N requests, process them together, return all results. Problem: you wait for the slowest request in the batch. GPU sits idle while one long sequence is still generating.

Continuous Batching (Iteration-Level Scheduling)

The game-changer introduced by Orca and implemented in vLLM. Instead of waiting for a full batch to complete, add new requests at every generation step as slots free up. This dramatically improves GPU utilization.

🔄

How Continuous Batching Works

At each decode step (generating one token), the scheduler checks: any running request just finished? If yes, immediately fill that slot with a waiting request. The GPU never sits idle between requests. vLLM implements this by default. This is why vLLM can achieve 10–24x better throughput than naive batching.

Continuous batching: as each request finishes, the slot is immediately filled — no idle GPU cycles waiting for a full batch to complete.

Dynamic Batching

Collect requests up to a maximum wait time or maximum batch size, whichever comes first. Used by TensorRT-LLM and Triton Inference Server. Allows you to trade latency (waiting to fill the batch) for throughput.

Chunked Prefill

vLLM's chunked prefill splits large prefill phases (processing long prompts) into chunks, interleaving them with decode steps. Prevents one long prompt from blocking all decode for other requests — critical for mixed workloads.

Batching Strategy	Throughput	Latency	Complexity	Used In
Static	Low	High (waits for all)	Simple	Legacy systems
Dynamic	Medium	Medium	Medium	Triton, TRT-LLM
Continuous	High	Low (per-request)	Complex	vLLM, SGLang
Chunked Prefill	High	Low (TTFT controlled)	High	vLLM v1

Throughput vs Latency Tradeoff

These are fundamentally in tension. More batching = higher throughput, higher latency per request. Set your optimization target based on your SLA:

Interactive chatbot: minimize TTFT and TPOT. Keep batches small. TTFT < 500ms target.
Batch processing / offline jobs: maximize tokens/second. Fill batches as much as possible.
API serving: balance using continuous batching + SLA-based routing.

🔑

Key Takeaways

1. INT8 = safe, 50% VRAM savings, use it by default. FP8 = best on H100. INT4 = last resort for fitting large models. 2. Continuous batching (vLLM's default) is the single biggest throughput multiplier — 10x+ over static batching. 3. Always quantize the KV cache too (not just weights) for full memory savings. 4. Benchmark your specific model+hardware combo — generic rules can be off by 2x.

📚 Further reading

Finished reading?