Prompt Caching & Cost Optimization

schedule4 minsignal_cellular_altIntermediate960 words

Learn how to slash LLM inference costs by 30–80% using prompt caching, semantic caching, model routing, and intelligent batching strategies.

What you will learn

01The Economics of LLM Serving

02Prompt Caching (Provider-Level)

03Semantic Caching

04Model Routing

05Batching for Cost

06Output Token Cost — The Hidden Driver

The Economics of LLM Serving

LLM inference costs scale with tokens. At production scale, an unoptimized pipeline can cost 5–10× more than necessary. Cost optimization is a core infrastructure responsibility, not an afterthought.

Monthly LLM Cost Estimate

Cost = Daily_requests × Avg_input_tokens × input_price + Avg_output_tokens × output_price

Prompt Caching (Provider-Level)

Major LLM providers now offer prompt caching: if your request's prefix (system prompt + conversation history) matches a recent request's prefix, the provider charges 50–90% less for those cached tokens.

Provider	Cache Discount	Min Cacheable Tokens	Cache Lifespan
Anthropic (Claude)	90% off cached input tokens	1,024 tokens	5 minutes
OpenAI (GPT-4o)	50% off cached input tokens	1,024 tokens	~5–10 minutes
Google (Gemini)	75% off cached tokens	32k tokens	Configurable (1hr+)

Stacking optimizations compounds: prompt caching + semantic caching + model routing can reduce LLM costs by up to 90%. Each layer is independent and additive.

Python · Anthropic prompt caching

import anthropic

client = anthropic.Anthropic()

# Mark system prompt as cacheable
response = client.messages.create(
    model="claude-sonnet-4-20250514",
    system=[{
        "type": "text",
        "text": long_system_prompt,  # Must be > 1024 tokens
        "cache_control": {"type": "ephemeral"}  # ← cache this!
    }],
    messages=[{"role": "user", "content": user_query}],
    max_tokens=1024
)

# Check cache stats in response
usage = response.usage
print(f"Cache hits: {usage.cache_read_input_tokens}")
print(f"Cache miss: {usage.cache_creation_input_tokens}")

Semantic Caching

Provider prompt caching requires exact prefix matches. Semantic caching is broader: if a new query is semantically similar to a previous query, return the cached response directly — no LLM call at all.

🔄

How Semantic Caching Works

1. Embed the incoming query. 2. Search a VectorDB of cached queries. 3. If similarity > threshold (e.g., 0.95 cosine), return the cached response. 4. Otherwise, call the LLM, cache the response + query embedding. This can eliminate 20–60% of LLM calls for FAQ-style applications.

Python · Semantic cache with Redis

import hashlib
from redis.commands.search.query import Query

async def semantic_cache_lookup(query: str, threshold=0.95):
    # Embed the query
    query_vec = await embed(query)
    
    # Search Redis vector index
    results = redis.ft("cache_idx").search(
        Query("*=>[KNN 1 @vec $vec AS score]")
            .return_fields("response", "score")
            .sort_by("score"),
        query_params={"vec": query_vec}
    )
    
    if results.docs and float(results.docs[0].score) > threshold:
        return results.docs[0].response  # Cache hit!
    
    return None  # Cache miss — call LLM

Model Routing

Not all queries need GPT-4 or Claude Opus. Route simple queries to cheaper models (10–100× cheaper per token) and only use expensive models for complex tasks.

Query Type	Model Tier	Cost Ratio	Example
Simple Q&A, classification	Small (GPT-4o-mini, Haiku)	1×	"What day is it?"
Standard generation	Mid (GPT-4o, Sonnet)	10–20×	Email drafting
Complex reasoning	Large (o1, Opus)	100×	Multi-step analysis

Batching for Cost

For non-real-time workloads (document processing, data enrichment, overnight batch jobs), batch API endpoints offer 50% cost reduction at the expense of latency (up to 24 hours).

Python · OpenAI Batch API

from openai import OpenAI
import json

client = OpenAI()

# Create JSONL batch file
requests = [{
    "custom_id": f"req-{i}",
    "method": "POST",
    "url": "/v1/chat/completions",
    "body": {
        "model": "gpt-4o-mini",
        "messages": [{"role": "user", "content": doc}]
    }
} for i, doc in enumerate(documents)]

# Submit batch (50% cheaper!)
batch = client.batches.create(
    input_file_id=file_id,
    endpoint="/v1/chat/completions",
    completion_window="24h",
)

Output Token Cost — The Hidden Driver

Input tokens are cheap; output tokens are typically 3–5× more expensive per token (because each output token requires a full decode step). An uncontrolled LLM that writes verbose, padded answers can cost 5× more than a tuned one. Output cost controls are as important as input optimization.

✂️

Output Cost Reduction Techniques

1. Set max_tokens aggressively — cap at what's realistic for the task. 2. Prompt engineering: instruct the model to be concise ("reply in under 100 words"). 3. Structured output: JSON responses are shorter than prose narration. 4. Temperature tuning: lower temperature reduces rambling. 5. Stop sequences: define where the model should stop generating to prevent overruns.

Cost Optimization Checklist

Enable provider prompt caching — free if your system prompt is >1024 tokens, up to 90% savings on cached tokens
Compress system prompts — trim verbose instructions, use fewer words. Every token costs money.
Implement semantic caching — Redis + VectorDB. Save 20–60% on repetitive queries.
Model routing — send simple queries to cheaper models. Can save 70–90% on token cost.
RAG over long context — don't send 100k token documents; retrieve relevant 2k tokens instead.
Output length limits — set max_tokens aggressively. Don't let models ramble.
Batch API for offline workloads — automatic 50% cost reduction.
Self-host for high volume — at scale (>100M tokens/month), self-hosted vLLM beats API pricing.

🔑

Key Takeaways

1. Enable prompt caching in your API calls — it's a few lines of code for up to 90% savings. 2. Semantic caching with Redis + embeddings can eliminate 30–60% of LLM calls for FAQ workloads. 3. Model routing is the highest-leverage optimization: sending simple queries to cheap models saves 90%+ on those calls. 4. For self-hosted: quantization + batching typically achieves <$0.10/1M tokens vs $1–5/1M for major APIs.

📚 Further reading

Anthropic — Prompt Caching (Official Announcement)anthropic.com
OpenAI — Prompt Caching Documentationopenai.com
Frugal GPT: How to Use Large Language Models While Reducing Cost and Improving Performancearxiv.org
Helicone — Implementing Semantic Caching for LLMshelicone.ai
OpenAI Batch API Documentationopenai.com

Finished reading?