DAY 3

04 / 09

KV Caching, Speculative Decoding & Token Throughput

schedule5 minsignal_cellular_altIntermediate1,041 words

Understand the most critical optimization in LLM inference — the KV cache — plus speculative decoding and how to measure and improve token throughput.

What you will learn

01The KV Cache: LLM's Working Memory

02KV Cache Size Math

03Prefix Caching (Automatic Prompt Caching)

04KV Cache Quantization

05Speculative Decoding

06Token Throughput Metrics

The KV Cache: LLM's Working Memory

Every transformer layer computes attention: for each token, it looks at all previous tokens via Key (K) and Value (V) matrices. Without caching, generating token #500 means re-processing all 499 previous tokens. The KV cache stores these computed tensors so each new token only needs one forward pass.

🔍

Why Cache K and V, Not Q?

In self-attention, the Query (Q) for each step is the current token being generated — you never need to look it up again. Keys and Values, on the other hand, represent all past tokens that the current token attends to. Those past computations are identical for every future token in the sequence, so caching them avoids repeating O(N) matrix multiplications for each new decode step.

🧠

Prefill vs Decode

Prefill: process all prompt tokens in parallel (one forward pass, compute-bound, fast). All K and V tensors are computed and stored in the KV cache. Decode: generate one token at a time. Each step loads the entire KV cache + model weights from VRAM, computes attention, outputs one token. Memory-bandwidth-bound. This is where all your serving time is spent.

KV Cache Size Math

KV Cache Size (bytes)

2 × num_layers × num_kv_heads × head_dim × seq_len × batch_size × dtype_bytes

For Llama-3 8B (32 layers, 8 KV heads, head_dim=128, BF16, batch=32, seq=4096):

2 × 32 × 8 × 128 × 4096 × 32 × 2 bytes = ~8.6 GB

⚠️

KV Cache Grows Quadratically with Context

Double your context length → double your KV cache size. This is why 128k context models are challenging. At 128k tokens with batch 8, a 70B model's KV cache alone is ~100+ GB. This is driving research into KV compression, sliding window attention, and CPU offloading.

Prefix Caching (Automatic Prompt Caching)

When many requests share the same prefix (e.g., a system prompt), vLLM can cache those KV blocks and reuse them for any matching request. Uses content hashing to identify matching blocks.

Scenario	Without Prefix Cache	With Prefix Cache	Saving
System prompt: 2k tokens	Process 2k tokens per request	Process 0 tokens	100% of prefill
Multi-turn chat (4 turns)	Re-process all history	Reuse cached history	75% of prefill
RAG with fixed context	Re-embed docs per query	Cache doc KV blocks	60–80% of prefill

Python · Measuring cache hit rate in vLLM

import requests

# Query vLLM metrics endpoint (Prometheus format)
response = requests.get("http://localhost:8000/metrics")

# Look for these metrics:
# vllm:gpu_cache_usage_perc - % of KV cache used
# vllm:cpu_cache_usage_perc - CPU offload cache
# vllm:num_preemptions_total - requests evicted
# vllm:request_cache_hit_ratio - prefix cache hits
print(response.text)

KV Cache Quantization

You can separately quantize the KV cache to FP8 (independent of weight precision), saving 50% KV cache VRAM with minimal quality impact on most tasks.

Shell · FP8 KV cache in vLLM

python -m vllm.entrypoints.openai.api_server     --model meta-llama/Llama-3.1-70B-Instruct     --kv-cache-dtype fp8   # halves KV cache memory!
    --dtype bfloat16

Speculative Decoding

Speculative decoding is a clever parallelization trick for decode. It exploits the fact that a small draft model can quickly generate candidate tokens, and the large target model can verify multiple tokens in a single forward pass.

🚀

The Algorithm

1. Draft model generates k tokens cheaply (fast, low quality). 2. Target model runs one forward pass over all k draft tokens + context. 3. Acceptance sampling: accept draft token i if p_target(token_i) ≥ p_draft(token_i), else sample a correction. 4. You get 1–k accepted tokens from a single target model forward pass. No quality degradation — rejected tokens are replaced correctly. Speedup comes from accepted tokens being "free".

Speculative decoding: the draft model guesses k tokens cheaply; the target model verifies all in a single forward pass. Accepted tokens are free — only rejections cost extra work.

Speculative Decoding Variants

Method	Draft Source	Overhead	Acceptance Rate	Speedup
Draft Model	Smaller LM (e.g. 7B → 70B)	Medium	High	1.5–2.5×
N-Gram Matching	Repeated phrases in prompt	Near zero	Medium	1.2–1.5×
EAGLE / EAGLE-2	Feature-based head	Low	Very High	2–3×
MLP Speculator	Trained MLP on hidden states	Low	High	1.5–2×

⚠️

Speculative Decoding is NOT Always a Win

At large batch sizes (>32), the target model's compute is already saturated and the draft model just adds overhead. Spec decoding shines at batch size 1–8. Blindly setting k=10 without tuning can degrade performance by 175%+ if acceptance rates are low. Always benchmark your actual workload.

Shell · Speculative decoding in vLLM

python -m vllm.entrypoints.openai.api_server     --model meta-llama/Meta-Llama-3.1-70B-Instruct     --speculative-model meta-llama/Llama-3.2-1B-Instruct     --num-speculative-tokens 5   # k value, tune this!
    --tensor-parallel-size 4

Token Throughput Metrics

Knowing your performance numbers is as important as achieving them. Here are the key metrics:

Metric	Full Name	What It Measures	Good Target
TTFT	Time to First Token	Latency from request to first token	< 500ms (interactive)
TPOT	Time Per Output Token	Average time between tokens during decode	< 50ms (interactive)
TBT	Time Between Tokens	Inter-token latency (same as TPOT)	< 50ms
E2E Latency	End-to-End Latency	Total time for request completion	Depends on output length
TPS	Tokens Per Second	Total throughput across all requests	Maximize for batch jobs
RPS	Requests Per Second	Request throughput	Depends on output length

E2E Latency Estimation

E2E = TTFT + (output_tokens × TPOT)

Benchmarking with vLLM

Shell · vLLM benchmarking

# Official vLLM benchmark tool
python benchmarks/benchmark_throughput.py     --backend vllm     --model meta-llama/Meta-Llama-3.1-8B-Instruct     --num-prompts 1000     --input-len 512     --output-len 128

# Benchmark latency (per-request metrics)
python benchmarks/benchmark_latency.py     --model meta-llama/Meta-Llama-3.1-8B-Instruct     --batch-size 1     --input-len 512     --output-len 128     --num-iters 30

🔑

Key Takeaways

1. KV cache is your biggest VRAM consumer at long contexts — quantize it to FP8 for free 50% savings. 2. Enable prefix caching whenever you have repeated prompt prefixes. 3. Speculative decoding gives 1.5–3× speedup at low batch sizes, but tune the k parameter. 4. Always measure TTFT, TPOT, and TPS — they tell very different stories about your serving health.

📚 Further reading

Finished reading?