
Quantization & Batching
Master the two most impactful levers for production LLM efficiency: reducing weight precision and maximizing GPU utilization through smart batching.
What you will learn
Why Quantization?
A 70B parameter model in FP16 needs 140 GB of VRAM. Four A100 80GB cards barely fit it. Quantize to INT4 and it drops to 35 GB — single GPU territory. Quantization is the fastest way to change your hardware economics.
Precision Formats Explained
| Format | Bits | Range | Use Case | Hardware |
|---|---|---|---|---|
| FP32 | 32 | ±3.4×10³⁸ | Training (reference) | All GPUs |
| BF16 | 16 | ±3.4×10³⁸ | Training / inference | Ampere+ |
| FP16 | 16 | ±65,504 | Inference (standard) | All modern GPUs |
| FP8 (E4M3) | 8 | ±448 | Fast inference, quality++ | H100/Hopper+ |
| INT8 | 8 | -128 to 127 | Inference (safe) | Ampere+ |
| INT4 | 4 | -8 to 7 | Memory-constrained | Any (emulated or native) |
Quantization Methods
Post-Training Quantization (PTQ)
Quantize a trained model without any retraining. Fast, but can lose accuracy on complex tasks.
from transformers import AutoModelForCausalLM, BitsAndBytesConfig import torch # Load model in INT8 (50% VRAM reduction) quantization_config = BitsAndBytesConfig( load_in_8bit=True, llm_int8_threshold=6.0, # outlier threshold ) model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.1-8B-Instruct", quantization_config=quantization_config, device_map="auto", )
AWQ (Activation-aware Weight Quantization)
AWQ observes activation patterns to identify which weights are important and protects them during 4-bit quantization. Produces much better quality than naive INT4. Preferred for production INT4 deployments.
from vllm import LLM, SamplingParams # vLLM natively supports AWQ models llm = LLM( model="TheBloke/Llama-2-70B-Chat-AWQ", quantization="awq", dtype="auto", gpu_memory_utilization=0.90, )
GPTQ (Generative Pre-trained Transformer Quantization)
Uses a second-order optimization technique (approximate inverse Hessian) to minimize quantization error layer by layer. Good quality at 4-bit. Slower to quantize than AWQ but widely supported.
FP8 (The Production Standard on H100)
FP8 is the sweet spot on Hopper GPUs — nearly the quality of BF16 at half the VRAM. The H100's Transformer Engine dynamically switches between FP8 and BF16 per layer. vLLM's FP8 support is production-ready as of 2025.
llm = LLM( model="meta-llama/Meta-Llama-3.1-70B-Instruct", quantization="fp8", # requires H100/Ada kv_cache_dtype="fp8", # also quantize KV cache! gpu_memory_utilization=0.95, )
Batching Strategies
Batching is the second lever. A GPU running one request at a time wastes 95%+ of its capacity. Smart batching is how you turn expensive hardware into efficient infrastructure.
Static Batching
Naive approach: collect N requests, process them together, return all results. Problem: you wait for the slowest request in the batch. GPU sits idle while one long sequence is still generating.
Continuous Batching (Iteration-Level Scheduling)
The game-changer introduced by Orca and implemented in vLLM. Instead of waiting for a full batch to complete, add new requests at every generation step as slots free up. This dramatically improves GPU utilization.
Dynamic Batching
Collect requests up to a maximum wait time or maximum batch size, whichever comes first. Used by TensorRT-LLM and Triton Inference Server. Allows you to trade latency (waiting to fill the batch) for throughput.
Chunked Prefill
vLLM's chunked prefill splits large prefill phases (processing long prompts) into chunks, interleaving them with decode steps. Prevents one long prompt from blocking all decode for other requests — critical for mixed workloads.
| Batching Strategy | Throughput | Latency | Complexity | Used In |
|---|---|---|---|---|
| Static | Low | High (waits for all) | Simple | Legacy systems |
| Dynamic | Medium | Medium | Medium | Triton, TRT-LLM |
| Continuous | High | Low (per-request) | Complex | vLLM, SGLang |
| Chunked Prefill | High | Low (TTFT controlled) | High | vLLM v1 |
Throughput vs Latency Tradeoff
These are fundamentally in tension. More batching = higher throughput, higher latency per request. Set your optimization target based on your SLA:
- Interactive chatbot: minimize TTFT and TPOT. Keep batches small. TTFT < 500ms target.
- Batch processing / offline jobs: maximize tokens/second. Fill batches as much as possible.
- API serving: balance using continuous batching + SLA-based routing.
- LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scalearxiv.org
- AWQ: Activation-aware Weight Quantization for LLM Compression and Accelerationarxiv.org
- GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformersarxiv.org
- Orca: A Distributed Serving System for Transformer-Based Language Models (continuous batching)arxiv.org
- Hugging Face — Quantization Overviewhuggingface.co
Finished reading?