The Engineering Codex/LLM Systems Engineering
DAY 7 · AM
08 / 09

Prompt Caching & Cost Optimization

schedule4 minsignal_cellular_altIntermediate960 words
Learn how to slash LLM inference costs by 30–80% using prompt caching, semantic caching, model routing, and intelligent batching strategies.

What you will learn

01The Economics of LLM Serving
02Prompt Caching (Provider-Level)
03Semantic Caching
04Model Routing
05Batching for Cost
06Output Token Cost — The Hidden Driver

The Economics of LLM Serving

LLM inference costs scale with tokens. At production scale, an unoptimized pipeline can cost 5–10× more than necessary. Cost optimization is a core infrastructure responsibility, not an afterthought.

Monthly LLM Cost Estimate
Cost = Daily_requests × Avg_input_tokens × input_price + Avg_output_tokens × output_price

Prompt Caching (Provider-Level)

Major LLM providers now offer prompt caching: if your request's prefix (system prompt + conversation history) matches a recent request's prefix, the provider charges 50–90% less for those cached tokens.

ProviderCache DiscountMin Cacheable TokensCache Lifespan
Anthropic (Claude)90% off cached input tokens1,024 tokens5 minutes
OpenAI (GPT-4o)50% off cached input tokens1,024 tokens~5–10 minutes
Google (Gemini)75% off cached tokens32k tokensConfigurable (1hr+)
Baseline (no optimization) 100% + Prompt caching (90% off cached tokens) ~70% + Semantic cache (skip 40% of calls) ~42% + Model routing (80% to mini) ~12% ~10% of original cost 90% saved →
Stacking optimizations compounds: prompt caching + semantic caching + model routing can reduce LLM costs by up to 90%. Each layer is independent and additive.
Python · Anthropic prompt caching
import anthropic

client = anthropic.Anthropic()

# Mark system prompt as cacheable
response = client.messages.create(
    model="claude-sonnet-4-20250514",
    system=[{
        "type": "text",
        "text": long_system_prompt,  # Must be > 1024 tokens
        "cache_control": {"type": "ephemeral"}  # ← cache this!
    }],
    messages=[{"role": "user", "content": user_query}],
    max_tokens=1024
)

# Check cache stats in response
usage = response.usage
print(f"Cache hits: {usage.cache_read_input_tokens}")
print(f"Cache miss: {usage.cache_creation_input_tokens}")

Semantic Caching

Provider prompt caching requires exact prefix matches. Semantic caching is broader: if a new query is semantically similar to a previous query, return the cached response directly — no LLM call at all.

🔄
How Semantic Caching Works
1. Embed the incoming query. 2. Search a VectorDB of cached queries. 3. If similarity > threshold (e.g., 0.95 cosine), return the cached response. 4. Otherwise, call the LLM, cache the response + query embedding. This can eliminate 20–60% of LLM calls for FAQ-style applications.
Python · Semantic cache with Redis
import hashlib
from redis.commands.search.query import Query

async def semantic_cache_lookup(query: str, threshold=0.95):
    # Embed the query
    query_vec = await embed(query)
    
    # Search Redis vector index
    results = redis.ft("cache_idx").search(
        Query("*=>[KNN 1 @vec $vec AS score]")
            .return_fields("response", "score")
            .sort_by("score"),
        query_params={"vec": query_vec}
    )
    
    if results.docs and float(results.docs[0].score) > threshold:
        return results.docs[0].response  # Cache hit!
    
    return None  # Cache miss — call LLM

Model Routing

Not all queries need GPT-4 or Claude Opus. Route simple queries to cheaper models (10–100× cheaper per token) and only use expensive models for complex tasks.

Query TypeModel TierCost RatioExample
Simple Q&A, classificationSmall (GPT-4o-mini, Haiku)"What day is it?"
Standard generationMid (GPT-4o, Sonnet)10–20×Email drafting
Complex reasoningLarge (o1, Opus)100×Multi-step analysis

Batching for Cost

For non-real-time workloads (document processing, data enrichment, overnight batch jobs), batch API endpoints offer 50% cost reduction at the expense of latency (up to 24 hours).

Python · OpenAI Batch API
from openai import OpenAI
import json

client = OpenAI()

# Create JSONL batch file
requests = [{
    "custom_id": f"req-{i}",
    "method": "POST",
    "url": "/v1/chat/completions",
    "body": {
        "model": "gpt-4o-mini",
        "messages": [{"role": "user", "content": doc}]
    }
} for i, doc in enumerate(documents)]

# Submit batch (50% cheaper!)
batch = client.batches.create(
    input_file_id=file_id,
    endpoint="/v1/chat/completions",
    completion_window="24h",
)

Output Token Cost — The Hidden Driver

Input tokens are cheap; output tokens are typically 3–5× more expensive per token (because each output token requires a full decode step). An uncontrolled LLM that writes verbose, padded answers can cost 5× more than a tuned one. Output cost controls are as important as input optimization.

✂️
Output Cost Reduction Techniques
1. Set max_tokens aggressively — cap at what's realistic for the task. 2. Prompt engineering: instruct the model to be concise ("reply in under 100 words"). 3. Structured output: JSON responses are shorter than prose narration. 4. Temperature tuning: lower temperature reduces rambling. 5. Stop sequences: define where the model should stop generating to prevent overruns.

Cost Optimization Checklist

  • Enable provider prompt caching — free if your system prompt is >1024 tokens, up to 90% savings on cached tokens
  • Compress system prompts — trim verbose instructions, use fewer words. Every token costs money.
  • Implement semantic caching — Redis + VectorDB. Save 20–60% on repetitive queries.
  • Model routing — send simple queries to cheaper models. Can save 70–90% on token cost.
  • RAG over long context — don't send 100k token documents; retrieve relevant 2k tokens instead.
  • Output length limits — set max_tokens aggressively. Don't let models ramble.
  • Batch API for offline workloads — automatic 50% cost reduction.
  • Self-host for high volume — at scale (>100M tokens/month), self-hosted vLLM beats API pricing.
🔑
Key Takeaways
1. Enable prompt caching in your API calls — it's a few lines of code for up to 90% savings. 2. Semantic caching with Redis + embeddings can eliminate 30–60% of LLM calls for FAQ workloads. 3. Model routing is the highest-leverage optimization: sending simple queries to cheap models saves 90%+ on those calls. 4. For self-hosted: quantization + batching typically achieves <$0.10/1M tokens vs $1–5/1M for major APIs.

Finished reading?