Caching & Read Paths
Cache invalidation is famously the second-hardest problem in computing. Master the cache hierarchy, the three classic patterns (cache-aside, write-through, write-back), HTTP caching headers, stampede protection, and the semantic-cache pattern that LLMs forced us to invent.
What you will learn
The two ways to make something fast are to do less work or do the same work in a less expensive place. Caching is the second strategy applied to read paths — pull the answer from RAM instead of disk, from the edge instead of the origin, from a precomputed sketch instead of the full computation. This chapter covers the cache hierarchy, the patterns that hold up across providers, and the new shape AI workloads added: semantic caches that match by meaning instead of by string equality.
The Cache Hierarchy
A cache is a faster, smaller copy of something. The hierarchy ranges from CPU registers to global CDN PoPs — each layer 10–1000× faster than the next, and 10–100× smaller.
From the backend engineer's perspective, the layers you control are:
- In-process memory — a small LRU inside the app for things that don't need cross-instance consistency (config, hot lookups). Wins: 100 ns access, no network. Costs: invisible state, harder rolling deploys.
- Redis / Memcached — shared across instances, sub-millisecond access. The workhorse layer.
- HTTP / CDN cache — caches at the edge, often in a PoP within a few ms of the user. Wins: zero origin load. Costs: harder invalidation.
- The browser cache — free of charge if you set the right headers. Often forgotten.
Three Patterns That Cover 95% of Cases
Cache-aside (lazy loading) — the default
The application asks the cache first; on miss, it queries the source, populates the cache, and returns. Simple, robust, and what most production systems use.
def get_user(user_id):
cached = redis.get(f"user:{user_id}")
if cached is not None:
return json.loads(cached)
user = db.fetch_user(user_id) # source of truth
if user is not None:
redis.set(f"user:{user_id}",
json.dumps(user),
ex=600) # 10-min TTL
return userTrade-offs: cache and DB can drift on writes (handled by invalidating the key on write); the first request after a cache miss pays the full cost (handled by stampede protection, below).
Write-through
The app writes to the cache and the cache writes synchronously to the source. Cache and source are always in sync; reads are always served from the cache. Cost: writes are slower (two systems acknowledge), and a cache outage blocks writes.
Write-back (write-behind)
The app writes to the cache; the cache buffers writes and flushes to the source asynchronously. Very fast writes, real risk of data loss on cache failure. Used in metrics ingestion, log buffers, and analytics — places where some loss is acceptable for throughput.
Choosing a TTL
Every cached value lives at the intersection of two questions: how long can it be stale before users notice? and how often does it actually change? Some heuristics:
- User profile data: 5–60 minutes. Users tolerate small lag after editing their own profile (and you can invalidate on save).
- Public catalogs / pricing: hours, with hard invalidation on change.
- Per-user feed / inbox: 30 s – 5 min, or use ETag-based revalidation.
- Static config (feature flags): 30–60 s; tolerate brief lag rather than synchronously checking on every request.
- Computed aggregates: long TTL plus background refresh. "Hourly stats" updated every 5 min stays fresh enough.
Eviction policies
Caches are bounded; what gets evicted matters.
| Policy | What it evicts | Best for |
|---|---|---|
| LRU | Least recently used | General workloads; default in Redis (allkeys-lru) |
| LFU | Least frequently used | Skewed access; protects hot keys from one-time scans |
| FIFO | Oldest insertion | Streams, time-windowed |
| TTL only | Expired keys | Bounded total via TTL discipline |
| Random | Arbitrary key | Last resort; cheapest to compute |
Redis's default noeviction mode actually returns errors when memory is full — which is sometimes correct (you don't want silent eviction in a cache that's also a queue) and sometimes terrible (sudden 500s when a feature ships traffic spike). Choose deliberately and monitor evictions as a metric.
HTTP Caching — the Free Layer You're Probably Underusing
HTTP has a complete caching protocol baked in. CDNs, browsers, and proxies already speak it. Setting the right headers turns the edge into your highest-leverage cache for free.
The four headers worth knowing
Cache-Control— the modern controller.public, max-age=300caches everywhere for 5 minutes.private, no-cachecaches only in the browser and revalidates each time.ETag— an opaque hash of the response. The next request sendsIf-None-Match; if unchanged, the server returns304 Not Modifiedwith no body. Saves bandwidth even when content is fresh.Last-Modified— same idea, with a timestamp. Use ETag for anything not strictly time-ordered.Vary—Vary: Accept-Encoding, Authorizationtells caches "the response depends on these request headers; cache them separately". ForgettingVaryon auth is how one user gets another user's response.
HTTP/1.1 200 OK
Content-Type: application/json
Cache-Control: public, max-age=300, stale-while-revalidate=60
ETag: "a1b2c3d4"
Vary: Accept-Encoding
{ "items": [...] }stale-while-revalidate — the underrated trick
The stale-while-revalidate directive tells the cache: "after max-age, return the stale copy immediately, but kick off a background revalidation". The user sees the fast response; the next request sees the fresh one. Combined with a short max-age and a longer stale-while-revalidate window, you essentially eliminate the cold-cache user-visible latency.
Vary. Customize it: many CDNs let you include cookies (Cookie), query parameters, or specific headers. For per-tenant SaaS, include the tenant ID. For LLM-personalized responses, include the user ID. Cache keys must include every dimension that changes the answer.Stampedes — When Many Misses Hit at Once
A cache stampede happens when a popular cached value expires (or is evicted) and many concurrent requests miss simultaneously, each independently hitting the source. The source — sized for cache-hit traffic — gets crushed. Three remedies, in order of complexity:
1. Lock + double-check
On miss, take a short distributed lock (Redis SETNX with a TTL). One request rebuilds the value and populates the cache; the others wait briefly and retry the cache. Simple and effective up to high concurrency.
def get_value(key):
v = redis.get(key)
if v is not None:
return decode(v)
lock_key = f"lock:{key}"
if redis.set(lock_key, 1, ex=10, nx=True): # only the first miss wins
try:
v = expensive_source(key)
redis.set(key, encode(v), ex=600)
return v
finally:
redis.delete(lock_key)
# someone else is rebuilding — wait and retry
for _ in range(10):
time.sleep(0.05)
v = redis.get(key)
if v is not None:
return decode(v)
return expensive_source(key) # fallback2. Probabilistic early refresh
Each request that finds a near-expired entry has a small probability of rebuilding it (proportional to time-to-expiry). The first few flip the coin; one of them wins, refreshes, and the rest get the fresh value. Smoothes the spike that fixed-TTL approach creates. The classic paper is XFetch (Vattani et al., 2015).
3. Background refresh
The cache value never expires for the reader; a background job refreshes it on a schedule. Best for values that are very hot and tolerate small lag — public landing pages, leaderboards.
4. Soft TTL + hard TTL
The entry has a soft expiry (after which it's considered stale) and a hard expiry (after which it's gone). Between soft and hard, the first reader rebuilds; everyone else gets the stale value. The classic combination of #1 and #2.
Invalidation — Where Most Cache Bugs Live
The two strategies, with their failure modes:
- TTL-based — accept staleness up to TTL. Simplest. Breaks when users edit their own data and immediately re-read.
- Write-through invalidation — on every write, delete (or update) the cache key. Cleanest UX. Breaks when the write completes but the invalidation fails (network blip).
Production systems usually combine both: invalidate on write and set a short TTL as a safety net. The TTL guarantees no value lives forever wrong; the invalidation gives you immediate freshness on the happy path.
Event-driven invalidation
For systems where the writer and the cache live in different services, publish an event ("user.123.updated") on a queue and have cache consumers act on it. Day 3 covers the patterns; for now, note that this gives you decoupled invalidation that can fan out to many caches (Redis, CDN, browser via push) reliably.
Semantic Caches — When Strings Aren't the Right Key
Classical caches are keyed by the URL or the function arguments. For an LLM-backed feature, two requests with different prompts can have nearly identical meaning — "Summarize this article" and "Give me a summary of this article" should both hit the cache. Exact-match keys miss. Semantic caches solve this with embeddings.
How it works
- Hash the prompt to embed it (a small embedding model, e.g. OpenAI
text-embedding-3-smallor a local one). - Look up the nearest neighbour in your vector store, scoped by tenant + relevant context.
- If similarity ≥ threshold (typically 0.93–0.97 cosine), return the stored completion.
- Otherwise call the LLM, store the (prompt-vector, completion) pair, return the completion.
Where it shines and where it bites
Best for high-traffic, low-personalization endpoints — public chatbots, FAQ-style assistants, search rerankers. Hit rates of 30–60% are common, with corresponding cost/latency wins. Where it bites: anything personalized to user data has to scope cache by user (defeating much of the win), and a too-aggressive threshold serves wrong answers. Tune by sampling: capture 1% of "hits" and re-run the LLM, compare embedding-similarity of completions, alarm if drift increases.
Prompt-prefix cache (the providers' version)
Anthropic and OpenAI both support prompt caching for repeated long prompt prefixes (system prompts, few-shot examples, retrieved documents). This is a server-side optimization: you pay full price for the first call with that prefix, then ~10% of input cost for subsequent calls within a 5-minute window. Use it whenever your application has a stable system prompt or retrieved context that would otherwise be re-tokenized on every request.
Cache Observability
You cannot tune a cache you cannot measure. The four metrics every cache layer should expose:
- Hit ratio — hits / (hits + misses). The number that justifies the cache's existence.
- Eviction rate — items pushed out due to memory pressure. High eviction with low hit ratio means undersized cache or poorly-chosen TTLs.
- Latency P50 / P99 — slow caches are worse than no cache. Watch for tail spikes (Redis swap, rebalancing).
- Stampede signal — count of concurrent rebuild attempts on the same key. Should be near zero if locks are working.
Add per-key tagging when you have a few hot keys you want to monitor; aggregate per-feature for everything else.
Show answer
cache.get("summary:" + article_hash) in Redis, with the LLM call only on miss. Now identical articles share completions across all users.- Hierarchy — pick the right layer (in-process, Redis, edge).
- Pattern — cache-aside is the default; the others have stronger constraints.
- Key — include every dimension that changes the answer.
- TTL — short enough to limit damage, long enough to win.
- Stampede — lock or jitter on the rebuild path.
- Observe — hit ratio, eviction, latency, stampede count.
trending:{item_id}. TTL: 60 s soft, 5 min hard — "trending" tolerates minor lag, want immediate freshness on score change. Eviction: Redis allkeys-lru, sized for ~3× the trending working set so cold items don't constantly evict. Stampede: use SET-NX lock on miss, with 2-second lock TTL; concurrent misses on the same key wait. Refresh: probabilistic early-refresh once an entry is older than soft TTL, so the spike of expirations doesn't all hit at once. Observability: hit ratio per item, eviction count, lock-wait counter. Bonus: set stale-while-revalidate if you also serve via a CDN.Cache-Control, ETag, Vary, and stale-while-revalidate is free infrastructure most teams underuse. 4) Stampedes are real and prevented by locks, jitter, soft/hard TTL, or background refresh. 5) Semantic caches match by embedding similarity — the LLM-era addition to the family. 6) Hit ratio + eviction + latency + stampede count are the metrics that turn caching from black art into engineering.- RFC 9111 — HTTP Cachingdatatracker.ietf.org
- Redis — Eviction policiesredis.io
- web.dev — stale-while-revalidateweb.dev
- Cache stampede — overview & XFetch referencewikipedia.org
- Anthropic — Prompt cachinganthropic.com
- OpenAI — Prompt cachingopenai.com
Finished reading?