
Prompt Caching & Cost Optimization
Learn how to slash LLM inference costs by 30–80% using prompt caching, semantic caching, model routing, and intelligent batching strategies.
What you will learn
The Economics of LLM Serving
LLM inference costs scale with tokens. At production scale, an unoptimized pipeline can cost 5–10× more than necessary. Cost optimization is a core infrastructure responsibility, not an afterthought.
Prompt Caching (Provider-Level)
Major LLM providers now offer prompt caching: if your request's prefix (system prompt + conversation history) matches a recent request's prefix, the provider charges 50–90% less for those cached tokens.
| Provider | Cache Discount | Min Cacheable Tokens | Cache Lifespan |
|---|---|---|---|
| Anthropic (Claude) | 90% off cached input tokens | 1,024 tokens | 5 minutes |
| OpenAI (GPT-4o) | 50% off cached input tokens | 1,024 tokens | ~5–10 minutes |
| Google (Gemini) | 75% off cached tokens | 32k tokens | Configurable (1hr+) |
import anthropic client = anthropic.Anthropic() # Mark system prompt as cacheable response = client.messages.create( model="claude-sonnet-4-20250514", system=[{ "type": "text", "text": long_system_prompt, # Must be > 1024 tokens "cache_control": {"type": "ephemeral"} # ← cache this! }], messages=[{"role": "user", "content": user_query}], max_tokens=1024 ) # Check cache stats in response usage = response.usage print(f"Cache hits: {usage.cache_read_input_tokens}") print(f"Cache miss: {usage.cache_creation_input_tokens}")
Semantic Caching
Provider prompt caching requires exact prefix matches. Semantic caching is broader: if a new query is semantically similar to a previous query, return the cached response directly — no LLM call at all.
import hashlib from redis.commands.search.query import Query async def semantic_cache_lookup(query: str, threshold=0.95): # Embed the query query_vec = await embed(query) # Search Redis vector index results = redis.ft("cache_idx").search( Query("*=>[KNN 1 @vec $vec AS score]") .return_fields("response", "score") .sort_by("score"), query_params={"vec": query_vec} ) if results.docs and float(results.docs[0].score) > threshold: return results.docs[0].response # Cache hit! return None # Cache miss — call LLM
Model Routing
Not all queries need GPT-4 or Claude Opus. Route simple queries to cheaper models (10–100× cheaper per token) and only use expensive models for complex tasks.
| Query Type | Model Tier | Cost Ratio | Example |
|---|---|---|---|
| Simple Q&A, classification | Small (GPT-4o-mini, Haiku) | 1× | "What day is it?" |
| Standard generation | Mid (GPT-4o, Sonnet) | 10–20× | Email drafting |
| Complex reasoning | Large (o1, Opus) | 100× | Multi-step analysis |
Batching for Cost
For non-real-time workloads (document processing, data enrichment, overnight batch jobs), batch API endpoints offer 50% cost reduction at the expense of latency (up to 24 hours).
from openai import OpenAI import json client = OpenAI() # Create JSONL batch file requests = [{ "custom_id": f"req-{i}", "method": "POST", "url": "/v1/chat/completions", "body": { "model": "gpt-4o-mini", "messages": [{"role": "user", "content": doc}] } } for i, doc in enumerate(documents)] # Submit batch (50% cheaper!) batch = client.batches.create( input_file_id=file_id, endpoint="/v1/chat/completions", completion_window="24h", )
Output Token Cost — The Hidden Driver
Input tokens are cheap; output tokens are typically 3–5× more expensive per token (because each output token requires a full decode step). An uncontrolled LLM that writes verbose, padded answers can cost 5× more than a tuned one. Output cost controls are as important as input optimization.
Cost Optimization Checklist
- Enable provider prompt caching — free if your system prompt is >1024 tokens, up to 90% savings on cached tokens
- Compress system prompts — trim verbose instructions, use fewer words. Every token costs money.
- Implement semantic caching — Redis + VectorDB. Save 20–60% on repetitive queries.
- Model routing — send simple queries to cheaper models. Can save 70–90% on token cost.
- RAG over long context — don't send 100k token documents; retrieve relevant 2k tokens instead.
- Output length limits — set
max_tokensaggressively. Don't let models ramble. - Batch API for offline workloads — automatic 50% cost reduction.
- Self-host for high volume — at scale (>100M tokens/month), self-hosted vLLM beats API pricing.
- Anthropic — Prompt Caching (Official Announcement)anthropic.com
- OpenAI — Prompt Caching Documentationopenai.com
- Frugal GPT: How to Use Large Language Models While Reducing Cost and Improving Performancearxiv.org
- Helicone — Implementing Semantic Caching for LLMshelicone.ai
- OpenAI Batch API Documentationopenai.com
Finished reading?