The Engineering Codex/Backend Engineering for the AI Era
DAY 2 · PM
04 / 09

Caching & Read Paths

schedule12 minsignal_cellular_altIntermediate2,545 words
Cache invalidation is famously the second-hardest problem in computing. Master the cache hierarchy, the three classic patterns (cache-aside, write-through, write-back), HTTP caching headers, stampede protection, and the semantic-cache pattern that LLMs forced us to invent.

What you will learn

01The Cache Hierarchy
02Three Patterns That Cover 95% of Cases
03HTTP Caching — the Free Layer You're Probably Underusing
04Stampedes — When Many Misses Hit at Once
05Invalidation — Where Most Cache Bugs Live
06Semantic Caches — When Strings Aren't the Right Key

The two ways to make something fast are to do less work or do the same work in a less expensive place. Caching is the second strategy applied to read paths — pull the answer from RAM instead of disk, from the edge instead of the origin, from a precomputed sketch instead of the full computation. This chapter covers the cache hierarchy, the patterns that hold up across providers, and the new shape AI workloads added: semantic caches that match by meaning instead of by string equality.

🔑
Today's read-path stack
1) The cache hierarchy — CPU, app, edge — with very different costs and latencies. 2) Three patterns: cache-aside, write-through, write-back. 3) HTTP caching — the headers your CDN already speaks. 4) Stampedes and how to survive a cold cache. 5) Semantic caches — caching by embedding similarity for LLM workloads.

The Cache Hierarchy

A cache is a faster, smaller copy of something. The hierarchy ranges from CPU registers to global CDN PoPs — each layer 10–1000× faster than the next, and 10–100× smaller.

CPU registers · ~1 ns L1/L2/L3 cache · 1–30 ns Process RAM · 100 ns Redis (same VPC) · 0.5 ms DB query (warm) · 1–10 ms Origin: cold disk / external · 10–500 ms smaller, faster bigger, slower
The cache hierarchy. Each layer is the cache for the one below it. Most of your real wins are at the Redis and edge layers.

From the backend engineer's perspective, the layers you control are:

  • In-process memory — a small LRU inside the app for things that don't need cross-instance consistency (config, hot lookups). Wins: 100 ns access, no network. Costs: invisible state, harder rolling deploys.
  • Redis / Memcached — shared across instances, sub-millisecond access. The workhorse layer.
  • HTTP / CDN cache — caches at the edge, often in a PoP within a few ms of the user. Wins: zero origin load. Costs: harder invalidation.
  • The browser cache — free of charge if you set the right headers. Often forgotten.

Three Patterns That Cover 95% of Cases

Cache-aside (lazy loading) — the default

The application asks the cache first; on miss, it queries the source, populates the cache, and returns. Simple, robust, and what most production systems use.

python — cache-aside skeleton
def get_user(user_id):
    cached = redis.get(f"user:{user_id}")
    if cached is not None:
        return json.loads(cached)

    user = db.fetch_user(user_id)              # source of truth
    if user is not None:
        redis.set(f"user:{user_id}",
                  json.dumps(user),
                  ex=600)                       # 10-min TTL
    return user

Trade-offs: cache and DB can drift on writes (handled by invalidating the key on write); the first request after a cache miss pays the full cost (handled by stampede protection, below).

Write-through

The app writes to the cache and the cache writes synchronously to the source. Cache and source are always in sync; reads are always served from the cache. Cost: writes are slower (two systems acknowledge), and a cache outage blocks writes.

Write-back (write-behind)

The app writes to the cache; the cache buffers writes and flushes to the source asynchronously. Very fast writes, real risk of data loss on cache failure. Used in metrics ingestion, log buffers, and analytics — places where some loss is acceptable for throughput.

Cache-aside App Cache DB 1 GET 2 miss → DB 3 row app populates cache after miss default · invalidate on write Write-through App Cache DB 1 SET 2 sync cache writes through to DB strong consistency · slower writes Write-back App Cache DB 1 SET 2 async cache buffers, flushes later fastest writes · risk of loss
The three patterns. Cache-aside is the default; the other two pick a stronger constraint and pay for it.

Choosing a TTL

Every cached value lives at the intersection of two questions: how long can it be stale before users notice? and how often does it actually change? Some heuristics:

  • User profile data: 5–60 minutes. Users tolerate small lag after editing their own profile (and you can invalidate on save).
  • Public catalogs / pricing: hours, with hard invalidation on change.
  • Per-user feed / inbox: 30 s – 5 min, or use ETag-based revalidation.
  • Static config (feature flags): 30–60 s; tolerate brief lag rather than synchronously checking on every request.
  • Computed aggregates: long TTL plus background refresh. "Hourly stats" updated every 5 min stays fresh enough.

Eviction policies

Caches are bounded; what gets evicted matters.

PolicyWhat it evictsBest for
LRULeast recently usedGeneral workloads; default in Redis (allkeys-lru)
LFULeast frequently usedSkewed access; protects hot keys from one-time scans
FIFOOldest insertionStreams, time-windowed
TTL onlyExpired keysBounded total via TTL discipline
RandomArbitrary keyLast resort; cheapest to compute

Redis's default noeviction mode actually returns errors when memory is full — which is sometimes correct (you don't want silent eviction in a cache that's also a queue) and sometimes terrible (sudden 500s when a feature ships traffic spike). Choose deliberately and monitor evictions as a metric.

HTTP Caching — the Free Layer You're Probably Underusing

HTTP has a complete caching protocol baked in. CDNs, browsers, and proxies already speak it. Setting the right headers turns the edge into your highest-leverage cache for free.

The four headers worth knowing

  • Cache-Control — the modern controller. public, max-age=300 caches everywhere for 5 minutes. private, no-cache caches only in the browser and revalidates each time.
  • ETag — an opaque hash of the response. The next request sends If-None-Match; if unchanged, the server returns 304 Not Modified with no body. Saves bandwidth even when content is fresh.
  • Last-Modified — same idea, with a timestamp. Use ETag for anything not strictly time-ordered.
  • VaryVary: Accept-Encoding, Authorization tells caches "the response depends on these request headers; cache them separately". Forgetting Vary on auth is how one user gets another user's response.
http — well-cached response
HTTP/1.1 200 OK
Content-Type: application/json
Cache-Control: public, max-age=300, stale-while-revalidate=60
ETag: "a1b2c3d4"
Vary: Accept-Encoding

{ "items": [...] }

stale-while-revalidate — the underrated trick

The stale-while-revalidate directive tells the cache: "after max-age, return the stale copy immediately, but kick off a background revalidation". The user sees the fast response; the next request sees the fresh one. Combined with a short max-age and a longer stale-while-revalidate window, you essentially eliminate the cold-cache user-visible latency.

💡
Cache key includes more than the URL
Your CDN's cache key is by default the URL plus the headers in Vary. Customize it: many CDNs let you include cookies (Cookie), query parameters, or specific headers. For per-tenant SaaS, include the tenant ID. For LLM-personalized responses, include the user ID. Cache keys must include every dimension that changes the answer.

Stampedes — When Many Misses Hit at Once

A cache stampede happens when a popular cached value expires (or is evicted) and many concurrent requests miss simultaneously, each independently hitting the source. The source — sized for cache-hit traffic — gets crushed. Three remedies, in order of complexity:

1. Lock + double-check

On miss, take a short distributed lock (Redis SETNX with a TTL). One request rebuilds the value and populates the cache; the others wait briefly and retry the cache. Simple and effective up to high concurrency.

python — cache-aside with anti-stampede lock
def get_value(key):
    v = redis.get(key)
    if v is not None:
        return decode(v)

    lock_key = f"lock:{key}"
    if redis.set(lock_key, 1, ex=10, nx=True):  # only the first miss wins
        try:
            v = expensive_source(key)
            redis.set(key, encode(v), ex=600)
            return v
        finally:
            redis.delete(lock_key)

    # someone else is rebuilding — wait and retry
    for _ in range(10):
        time.sleep(0.05)
        v = redis.get(key)
        if v is not None:
            return decode(v)
    return expensive_source(key)              # fallback

2. Probabilistic early refresh

Each request that finds a near-expired entry has a small probability of rebuilding it (proportional to time-to-expiry). The first few flip the coin; one of them wins, refreshes, and the rest get the fresh value. Smoothes the spike that fixed-TTL approach creates. The classic paper is XFetch (Vattani et al., 2015).

3. Background refresh

The cache value never expires for the reader; a background job refreshes it on a schedule. Best for values that are very hot and tolerate small lag — public landing pages, leaderboards.

4. Soft TTL + hard TTL

The entry has a soft expiry (after which it's considered stale) and a hard expiry (after which it's gone). Between soft and hard, the first reader rebuilds; everyone else gets the stale value. The classic combination of #1 and #2.

⚠️
Cold start is also a stampede
When a service deploys, every cache is empty. The first traffic is a stampede by another name. Either prewarm caches in the deploy step (read top-N hot keys), or use a slow-rollout that fills caches gradually. Forgetting this in a Black Friday deploy is the textbook on-call story.

Invalidation — Where Most Cache Bugs Live

The two strategies, with their failure modes:

  • TTL-based — accept staleness up to TTL. Simplest. Breaks when users edit their own data and immediately re-read.
  • Write-through invalidation — on every write, delete (or update) the cache key. Cleanest UX. Breaks when the write completes but the invalidation fails (network blip).

Production systems usually combine both: invalidate on write and set a short TTL as a safety net. The TTL guarantees no value lives forever wrong; the invalidation gives you immediate freshness on the happy path.

Event-driven invalidation

For systems where the writer and the cache live in different services, publish an event ("user.123.updated") on a queue and have cache consumers act on it. Day 3 covers the patterns; for now, note that this gives you decoupled invalidation that can fan out to many caches (Redis, CDN, browser via push) reliably.

Semantic Caches — When Strings Aren't the Right Key

Classical caches are keyed by the URL or the function arguments. For an LLM-backed feature, two requests with different prompts can have nearly identical meaning — "Summarize this article" and "Give me a summary of this article" should both hit the cache. Exact-match keys miss. Semantic caches solve this with embeddings.

Promptuser message Embedvector(prompt) Vector storek-NN search Hit (similarity ≥ θ)return cached completion Misscall LLM, store (vec → completion) Pick θ around 0.93–0.97 cosine. Lower threshold → more hits, more risk of wrong answer.
Semantic cache: embed the prompt, k-NN against past prompts, return the stored completion if close enough.

How it works

  1. Hash the prompt to embed it (a small embedding model, e.g. OpenAI text-embedding-3-small or a local one).
  2. Look up the nearest neighbour in your vector store, scoped by tenant + relevant context.
  3. If similarity ≥ threshold (typically 0.93–0.97 cosine), return the stored completion.
  4. Otherwise call the LLM, store the (prompt-vector, completion) pair, return the completion.

Where it shines and where it bites

Best for high-traffic, low-personalization endpoints — public chatbots, FAQ-style assistants, search rerankers. Hit rates of 30–60% are common, with corresponding cost/latency wins. Where it bites: anything personalized to user data has to scope cache by user (defeating much of the win), and a too-aggressive threshold serves wrong answers. Tune by sampling: capture 1% of "hits" and re-run the LLM, compare embedding-similarity of completions, alarm if drift increases.

Prompt-prefix cache (the providers' version)

Anthropic and OpenAI both support prompt caching for repeated long prompt prefixes (system prompts, few-shot examples, retrieved documents). This is a server-side optimization: you pay full price for the first call with that prefix, then ~10% of input cost for subsequent calls within a 5-minute window. Use it whenever your application has a stable system prompt or retrieved context that would otherwise be re-tokenized on every request.

🌱
Layer caches
A production AI feature often runs three caches in series: HTTP cache for identical requests, semantic cache for similar prompts, prompt-prefix cache at the LLM provider for repeated prefixes. Each layer catches what the previous missed; the cost-per-call descends from full to tiny across the chain.

Cache Observability

You cannot tune a cache you cannot measure. The four metrics every cache layer should expose:

  • Hit ratio — hits / (hits + misses). The number that justifies the cache's existence.
  • Eviction rate — items pushed out due to memory pressure. High eviction with low hit ratio means undersized cache or poorly-chosen TTLs.
  • Latency P50 / P99 — slow caches are worse than no cache. Watch for tail spikes (Redis swap, rebalancing).
  • Stampede signal — count of concurrent rebuild attempts on the same key. Should be near zero if locks are working.

Add per-key tagging when you have a few hot keys you want to monitor; aggregate per-feature for everything else.

Quick check
A team puts an HTTP cache in front of an LLM-summarization endpoint. Cache key: the URL + the article's hash. Hit rate is 0%. Why? What change unlocks a meaningful hit rate?
Show answer
Why: the URL likely includes session or auth tokens that change per user, even when the article being summarized is identical. Each request's URL is unique, so the CDN never matches. Fix: cache key the user-agnostic input — the article hash — not the URL. Either use a custom cache key function in the CDN (Cloudflare Workers, Fastly VCL), or move the cache one layer in: have the application compute cache.get("summary:" + article_hash) in Redis, with the LLM call only on miss. Now identical articles share completions across all users.
Mnemonic — caching playbook
"Hierarchy, Pattern, Key, TTL, Stampede, Observe."
  • Hierarchy — pick the right layer (in-process, Redis, edge).
  • Pattern — cache-aside is the default; the others have stronger constraints.
  • Key — include every dimension that changes the answer.
  • TTL — short enough to limit damage, long enough to win.
  • Stampede — lock or jitter on the rebuild path.
  • Observe — hit ratio, eviction, latency, stampede count.
Flashcard
Your service has a hot endpoint where 80% of requests are for the same five "trending" items. Database read replicas are saturated. You can use Redis. Design the cache, including TTL, eviction, key, and stampede protection.
Click to flip ↻
Answer
Pattern: cache-aside. Key: trending:{item_id}. TTL: 60 s soft, 5 min hard — "trending" tolerates minor lag, want immediate freshness on score change. Eviction: Redis allkeys-lru, sized for ~3× the trending working set so cold items don't constantly evict. Stampede: use SET-NX lock on miss, with 2-second lock TTL; concurrent misses on the same key wait. Refresh: probabilistic early-refresh once an entry is older than soft TTL, so the spike of expirations doesn't all hit at once. Observability: hit ratio per item, eviction count, lock-wait counter. Bonus: set stale-while-revalidate if you also serve via a CDN.
🔑
Key takeaways
1) Caches live in a hierarchy; the highest leverage is usually at Redis and the CDN edge. 2) Cache-aside is the default; write-through and write-back have strong reasons. 3) HTTP caching with Cache-Control, ETag, Vary, and stale-while-revalidate is free infrastructure most teams underuse. 4) Stampedes are real and prevented by locks, jitter, soft/hard TTL, or background refresh. 5) Semantic caches match by embedding similarity — the LLM-era addition to the family. 6) Hit ratio + eviction + latency + stampede count are the metrics that turn caching from black art into engineering.

Finished reading?