The Engineering Codex/LLM Systems Engineering
DAY 3
04 / 09

KV Caching, Speculative Decoding & Token Throughput

schedule5 minsignal_cellular_altIntermediate1,041 words
Understand the most critical optimization in LLM inference — the KV cache — plus speculative decoding and how to measure and improve token throughput.

What you will learn

01The KV Cache: LLM's Working Memory
02KV Cache Size Math
03Prefix Caching (Automatic Prompt Caching)
04KV Cache Quantization
05Speculative Decoding
06Token Throughput Metrics

The KV Cache: LLM's Working Memory

Every transformer layer computes attention: for each token, it looks at all previous tokens via Key (K) and Value (V) matrices. Without caching, generating token #500 means re-processing all 499 previous tokens. The KV cache stores these computed tensors so each new token only needs one forward pass.

🔍
Why Cache K and V, Not Q?
In self-attention, the Query (Q) for each step is the current token being generated — you never need to look it up again. Keys and Values, on the other hand, represent all past tokens that the current token attends to. Those past computations are identical for every future token in the sequence, so caching them avoids repeating O(N) matrix multiplications for each new decode step.
🧠
Prefill vs Decode
Prefill: process all prompt tokens in parallel (one forward pass, compute-bound, fast). All K and V tensors are computed and stored in the KV cache. Decode: generate one token at a time. Each step loads the entire KV cache + model weights from VRAM, computes attention, outputs one token. Memory-bandwidth-bound. This is where all your serving time is spent.

KV Cache Size Math

KV Cache Size (bytes)
2 × num_layers × num_kv_heads × head_dim × seq_len × batch_size × dtype_bytes

For Llama-3 8B (32 layers, 8 KV heads, head_dim=128, BF16, batch=32, seq=4096):

2 × 32 × 8 × 128 × 4096 × 32 × 2 bytes = ~8.6 GB
⚠️
KV Cache Grows Quadratically with Context
Double your context length → double your KV cache size. This is why 128k context models are challenging. At 128k tokens with batch 8, a 70B model's KV cache alone is ~100+ GB. This is driving research into KV compression, sliding window attention, and CPU offloading.

Prefix Caching (Automatic Prompt Caching)

When many requests share the same prefix (e.g., a system prompt), vLLM can cache those KV blocks and reuse them for any matching request. Uses content hashing to identify matching blocks.

ScenarioWithout Prefix CacheWith Prefix CacheSaving
System prompt: 2k tokensProcess 2k tokens per requestProcess 0 tokens100% of prefill
Multi-turn chat (4 turns)Re-process all historyReuse cached history75% of prefill
RAG with fixed contextRe-embed docs per queryCache doc KV blocks60–80% of prefill
Python · Measuring cache hit rate in vLLM
import requests

# Query vLLM metrics endpoint (Prometheus format)
response = requests.get("http://localhost:8000/metrics")

# Look for these metrics:
# vllm:gpu_cache_usage_perc - % of KV cache used
# vllm:cpu_cache_usage_perc - CPU offload cache
# vllm:num_preemptions_total - requests evicted
# vllm:request_cache_hit_ratio - prefix cache hits
print(response.text)

KV Cache Quantization

You can separately quantize the KV cache to FP8 (independent of weight precision), saving 50% KV cache VRAM with minimal quality impact on most tasks.

Shell · FP8 KV cache in vLLM
python -m vllm.entrypoints.openai.api_server     --model meta-llama/Llama-3.1-70B-Instruct     --kv-cache-dtype fp8   # halves KV cache memory!
    --dtype bfloat16

Speculative Decoding

Speculative decoding is a clever parallelization trick for decode. It exploits the fact that a small draft model can quickly generate candidate tokens, and the large target model can verify multiple tokens in a single forward pass.

🚀
The Algorithm
1. Draft model generates k tokens cheaply (fast, low quality). 2. Target model runs one forward pass over all k draft tokens + context. 3. Acceptance sampling: accept draft token i if p_target(token_i) ≥ p_draft(token_i), else sample a correction. 4. You get 1–k accepted tokens from a single target model forward pass. No quality degradation — rejected tokens are replaced correctly. Speedup comes from accepted tokens being "free".
Draft Model fast, small k draft tokens "the" "cat" "sat" "on" "the" proposed cheaply Target Model verifies all k tokens in one forward pass accept: free tokens ✓ reject: sample correction from target
Speculative decoding: the draft model guesses k tokens cheaply; the target model verifies all in a single forward pass. Accepted tokens are free — only rejections cost extra work.

Speculative Decoding Variants

MethodDraft SourceOverheadAcceptance RateSpeedup
Draft ModelSmaller LM (e.g. 7B → 70B)MediumHigh1.5–2.5×
N-Gram MatchingRepeated phrases in promptNear zeroMedium1.2–1.5×
EAGLE / EAGLE-2Feature-based headLowVery High2–3×
MLP SpeculatorTrained MLP on hidden statesLowHigh1.5–2×
⚠️
Speculative Decoding is NOT Always a Win
At large batch sizes (>32), the target model's compute is already saturated and the draft model just adds overhead. Spec decoding shines at batch size 1–8. Blindly setting k=10 without tuning can degrade performance by 175%+ if acceptance rates are low. Always benchmark your actual workload.
Shell · Speculative decoding in vLLM
python -m vllm.entrypoints.openai.api_server     --model meta-llama/Meta-Llama-3.1-70B-Instruct     --speculative-model meta-llama/Llama-3.2-1B-Instruct     --num-speculative-tokens 5   # k value, tune this!
    --tensor-parallel-size 4

Token Throughput Metrics

Knowing your performance numbers is as important as achieving them. Here are the key metrics:

MetricFull NameWhat It MeasuresGood Target
TTFTTime to First TokenLatency from request to first token< 500ms (interactive)
TPOTTime Per Output TokenAverage time between tokens during decode< 50ms (interactive)
TBTTime Between TokensInter-token latency (same as TPOT)< 50ms
E2E LatencyEnd-to-End LatencyTotal time for request completionDepends on output length
TPSTokens Per SecondTotal throughput across all requestsMaximize for batch jobs
RPSRequests Per SecondRequest throughputDepends on output length
E2E Latency Estimation
E2E = TTFT + (output_tokens × TPOT)

Benchmarking with vLLM

Shell · vLLM benchmarking
# Official vLLM benchmark tool
python benchmarks/benchmark_throughput.py     --backend vllm     --model meta-llama/Meta-Llama-3.1-8B-Instruct     --num-prompts 1000     --input-len 512     --output-len 128

# Benchmark latency (per-request metrics)
python benchmarks/benchmark_latency.py     --model meta-llama/Meta-Llama-3.1-8B-Instruct     --batch-size 1     --input-len 512     --output-len 128     --num-iters 30
🔑
Key Takeaways
1. KV cache is your biggest VRAM consumer at long contexts — quantize it to FP8 for free 50% savings. 2. Enable prefix caching whenever you have repeated prompt prefixes. 3. Speculative decoding gives 1.5–3× speedup at low batch sizes, but tune the k parameter. 4. Always measure TTFT, TPOT, and TPS — they tell very different stories about your serving health.

Finished reading?