The Engineering Codex/LLM Systems Engineering
DAY 1 · PM
02 / 09

Quantization & Batching

schedule5 minsignal_cellular_altBeginner1,048 words
Master the two most impactful levers for production LLM efficiency: reducing weight precision and maximizing GPU utilization through smart batching.

What you will learn

01Why Quantization?
02Precision Formats Explained
03Quantization Methods
04Batching Strategies
05Throughput vs Latency Tradeoff

Why Quantization?

A 70B parameter model in FP16 needs 140 GB of VRAM. Four A100 80GB cards barely fit it. Quantize to INT4 and it drops to 35 GB — single GPU territory. Quantization is the fastest way to change your hardware economics.

50%
VRAM saved by INT8 vs FP16
75%
VRAM saved by INT4 vs FP16
throughput gain from INT8 Tensor Cores
<2%
quality loss with INT8 (typical)

Precision Formats Explained

FormatBitsRangeUse CaseHardware
FP3232±3.4×10³⁸Training (reference)All GPUs
BF1616±3.4×10³⁸Training / inferenceAmpere+
FP1616±65,504Inference (standard)All modern GPUs
FP8 (E4M3)8±448Fast inference, quality++H100/Hopper+
INT88-128 to 127Inference (safe)Ampere+
INT44-8 to 7Memory-constrainedAny (emulated or native)
FP32 32 bits · 28 GB (7B) BF16 16 bits · 14 GB INT8 8 bits · 7 GB FP8 8 bits · 7 GB (H100) INT4 4 bits · 3.5 GB compression →
Precision formats for a 7B model. Each step down halves VRAM; quality tradeoffs compound as you go lower.
📌
BF16 vs FP16
Both are 16-bit, but BF16 has the same exponent range as FP32 (8 exponent bits) at the cost of less precision (7 mantissa bits vs FP16's 10). This makes BF16 more numerically stable for training — overflow is rare. FP16 has better precision but can overflow on large values. For inference, both work fine; prefer BF16 on modern hardware.

Quantization Methods

Post-Training Quantization (PTQ)

Quantize a trained model without any retraining. Fast, but can lose accuracy on complex tasks.

Python · bitsandbytes INT8
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch

# Load model in INT8 (50% VRAM reduction)
quantization_config = BitsAndBytesConfig(
    load_in_8bit=True,
    llm_int8_threshold=6.0,  # outlier threshold
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct",
    quantization_config=quantization_config,
    device_map="auto",
)

AWQ (Activation-aware Weight Quantization)

AWQ observes activation patterns to identify which weights are important and protects them during 4-bit quantization. Produces much better quality than naive INT4. Preferred for production INT4 deployments.

Python · AWQ with vLLM
from vllm import LLM, SamplingParams

# vLLM natively supports AWQ models
llm = LLM(
    model="TheBloke/Llama-2-70B-Chat-AWQ",
    quantization="awq",
    dtype="auto",
    gpu_memory_utilization=0.90,
)

GPTQ (Generative Pre-trained Transformer Quantization)

Uses a second-order optimization technique (approximate inverse Hessian) to minimize quantization error layer by layer. Good quality at 4-bit. Slower to quantize than AWQ but widely supported.

FP8 (The Production Standard on H100)

FP8 is the sweet spot on Hopper GPUs — nearly the quality of BF16 at half the VRAM. The H100's Transformer Engine dynamically switches between FP8 and BF16 per layer. vLLM's FP8 support is production-ready as of 2025.

Python · FP8 with vLLM
llm = LLM(
    model="meta-llama/Meta-Llama-3.1-70B-Instruct",
    quantization="fp8",  # requires H100/Ada
    kv_cache_dtype="fp8",  # also quantize KV cache!
    gpu_memory_utilization=0.95,
)
Production Decision Tree
H100/Hopper GPU → use FP8 (best speed+quality). Older Ampere GPU → INT8 (safe, 50% VRAM savings). Need to fit 70B on one 80GB GPU → AWQ INT4 (quality tradeoff, but works). Doing math-heavy / code generation → stay at BF16/INT8 minimum. GGUF (llama.cpp format) is for dev laptops, NOT production servers.

Batching Strategies

Batching is the second lever. A GPU running one request at a time wastes 95%+ of its capacity. Smart batching is how you turn expensive hardware into efficient infrastructure.

Static Batching

Naive approach: collect N requests, process them together, return all results. Problem: you wait for the slowest request in the batch. GPU sits idle while one long sequence is still generating.

Continuous Batching (Iteration-Level Scheduling)

The game-changer introduced by Orca and implemented in vLLM. Instead of waiting for a full batch to complete, add new requests at every generation step as slots free up. This dramatically improves GPU utilization.

🔄
How Continuous Batching Works
At each decode step (generating one token), the scheduler checks: any running request just finished? If yes, immediately fill that slot with a waiting request. The GPU never sits idle between requests. vLLM implements this by default. This is why vLLM can achieve 10–24x better throughput than naive batching.
Step T Step T+1 Step T+2 Step T+3 Req A Req B Req C Req D Req A ✓ Req B Req C Req D Req E (new) Req B Req C Req D Req E Req B ✓ Req C Req D
Continuous batching: as each request finishes, the slot is immediately filled — no idle GPU cycles waiting for a full batch to complete.

Dynamic Batching

Collect requests up to a maximum wait time or maximum batch size, whichever comes first. Used by TensorRT-LLM and Triton Inference Server. Allows you to trade latency (waiting to fill the batch) for throughput.

Chunked Prefill

vLLM's chunked prefill splits large prefill phases (processing long prompts) into chunks, interleaving them with decode steps. Prevents one long prompt from blocking all decode for other requests — critical for mixed workloads.

Batching StrategyThroughputLatencyComplexityUsed In
StaticLowHigh (waits for all)SimpleLegacy systems
DynamicMediumMediumMediumTriton, TRT-LLM
ContinuousHighLow (per-request)ComplexvLLM, SGLang
Chunked PrefillHighLow (TTFT controlled)HighvLLM v1

Throughput vs Latency Tradeoff

These are fundamentally in tension. More batching = higher throughput, higher latency per request. Set your optimization target based on your SLA:

  • Interactive chatbot: minimize TTFT and TPOT. Keep batches small. TTFT < 500ms target.
  • Batch processing / offline jobs: maximize tokens/second. Fill batches as much as possible.
  • API serving: balance using continuous batching + SLA-based routing.
🔑
Key Takeaways
1. INT8 = safe, 50% VRAM savings, use it by default. FP8 = best on H100. INT4 = last resort for fitting large models. 2. Continuous batching (vLLM's default) is the single biggest throughput multiplier — 10x+ over static batching. 3. Always quantize the KV cache too (not just weights) for full memory savings. 4. Benchmark your specific model+hardware combo — generic rules can be off by 2x.

Finished reading?