
KV Caching, Speculative Decoding & Token Throughput
Understand the most critical optimization in LLM inference — the KV cache — plus speculative decoding and how to measure and improve token throughput.
What you will learn
The KV Cache: LLM's Working Memory
Every transformer layer computes attention: for each token, it looks at all previous tokens via Key (K) and Value (V) matrices. Without caching, generating token #500 means re-processing all 499 previous tokens. The KV cache stores these computed tensors so each new token only needs one forward pass.
KV Cache Size Math
For Llama-3 8B (32 layers, 8 KV heads, head_dim=128, BF16, batch=32, seq=4096):
Prefix Caching (Automatic Prompt Caching)
When many requests share the same prefix (e.g., a system prompt), vLLM can cache those KV blocks and reuse them for any matching request. Uses content hashing to identify matching blocks.
| Scenario | Without Prefix Cache | With Prefix Cache | Saving |
|---|---|---|---|
| System prompt: 2k tokens | Process 2k tokens per request | Process 0 tokens | 100% of prefill |
| Multi-turn chat (4 turns) | Re-process all history | Reuse cached history | 75% of prefill |
| RAG with fixed context | Re-embed docs per query | Cache doc KV blocks | 60–80% of prefill |
import requests # Query vLLM metrics endpoint (Prometheus format) response = requests.get("http://localhost:8000/metrics") # Look for these metrics: # vllm:gpu_cache_usage_perc - % of KV cache used # vllm:cpu_cache_usage_perc - CPU offload cache # vllm:num_preemptions_total - requests evicted # vllm:request_cache_hit_ratio - prefix cache hits print(response.text)
KV Cache Quantization
You can separately quantize the KV cache to FP8 (independent of weight precision), saving 50% KV cache VRAM with minimal quality impact on most tasks.
python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-3.1-70B-Instruct --kv-cache-dtype fp8 # halves KV cache memory! --dtype bfloat16
Speculative Decoding
Speculative decoding is a clever parallelization trick for decode. It exploits the fact that a small draft model can quickly generate candidate tokens, and the large target model can verify multiple tokens in a single forward pass.
Speculative Decoding Variants
| Method | Draft Source | Overhead | Acceptance Rate | Speedup |
|---|---|---|---|---|
| Draft Model | Smaller LM (e.g. 7B → 70B) | Medium | High | 1.5–2.5× |
| N-Gram Matching | Repeated phrases in prompt | Near zero | Medium | 1.2–1.5× |
| EAGLE / EAGLE-2 | Feature-based head | Low | Very High | 2–3× |
| MLP Speculator | Trained MLP on hidden states | Low | High | 1.5–2× |
python -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3.1-70B-Instruct --speculative-model meta-llama/Llama-3.2-1B-Instruct --num-speculative-tokens 5 # k value, tune this! --tensor-parallel-size 4
Token Throughput Metrics
Knowing your performance numbers is as important as achieving them. Here are the key metrics:
| Metric | Full Name | What It Measures | Good Target |
|---|---|---|---|
| TTFT | Time to First Token | Latency from request to first token | < 500ms (interactive) |
| TPOT | Time Per Output Token | Average time between tokens during decode | < 50ms (interactive) |
| TBT | Time Between Tokens | Inter-token latency (same as TPOT) | < 50ms |
| E2E Latency | End-to-End Latency | Total time for request completion | Depends on output length |
| TPS | Tokens Per Second | Total throughput across all requests | Maximize for batch jobs |
| RPS | Requests Per Second | Request throughput | Depends on output length |
Benchmarking with vLLM
# Official vLLM benchmark tool python benchmarks/benchmark_throughput.py --backend vllm --model meta-llama/Meta-Llama-3.1-8B-Instruct --num-prompts 1000 --input-len 512 --output-len 128 # Benchmark latency (per-request metrics) python benchmarks/benchmark_latency.py --model meta-llama/Meta-Llama-3.1-8B-Instruct --batch-size 1 --input-len 512 --output-len 128 --num-iters 30
- Speculative Decoding: Accelerating Large Language Model Inference via Speculative Samplingarxiv.org
- EAGLE: Speculative Sampling Requires Rethinking Feature Uncertaintyarxiv.org
- Flash-Decoding for Long-Context Inferencearxiv.org
- vLLM — Speculative Decoding Documentationvllm.ai
- Efficient Memory Management for LLM Serving with PagedAttentionarxiv.org
Finished reading?