The Engineering Codex/Backend Engineering for the AI Era
DAY 5
07 / 09

Reliability Engineering — SLOs, Retries, Circuit Breakers

schedule14 minsignal_cellular_altAdvanced3,062 words
Reliability isn't an outcome — it's a discipline. Master SLOs and error budgets, the retry-with-jitter rules that prevent thundering herds, circuit breakers, bulkheads, timeouts, and the AI-call reliability patterns that emerged when external dependencies became the long pole in every request.

What you will learn

01SLIs, SLOs, and the Error Budget
02The Reliability Triangle: Timeouts + Retries + Circuit Breakers
03Bulkheads — Containing Blast Radius
04Hedged Requests — Trading Cost for Tail Latency
05Backpressure and Rate Limiting
06Idempotency Done Right

Every system fails. The only question is whether the failure stops at one component or cascades. Reliability engineering is the practice of containing failures: choosing what "working" means (SLOs), controlling what failure looks like (timeouts, retries, circuit breakers), and rehearsing what comes next (chaos drills, runbooks). It's also where the AI era forced the biggest mental shift — when the long pole in your latency budget is somebody else's GPU cluster, your reliability story is mostly about coping with their flakiness.

🔑
Today's reliability stack
1) SLOs and error budgets — what "reliable" actually means. 2) Timeouts, retries with jitter, exponential backoff — the three lines of code that prevent thundering herds. 3) Circuit breakers, bulkheads, hedged requests — patterns that keep one slow dependency from sinking the ship. 4) Idempotency for retries — how to make "try again" actually safe. 5) The AI-call reliability playbook — model fallback, semantic caching, async escape hatches.

SLIs, SLOs, and the Error Budget

You cannot improve what you cannot measure. The SRE practice (Google, 2003 onward) gave us a vocabulary that is now industry standard.

  • SLI (Service Level Indicator) — a number that measures user-perceived behaviour. Examples: "fraction of requests with status 200 and latency < 500 ms".
  • SLO (Service Level Objective) — a target for an SLI over a window. Example: "99.9% over a rolling 30 days".
  • SLA (Service Level Agreement) — a contractual SLO with a penalty clause. Always looser than your internal SLOs.
  • Error budget — the inverse of the SLO. 99.9% over 30 days = 43 minutes of allowed failure. That's your budget to spend on risky things: deploys, experiments, migrations.

Picking SLOs

The trap: setting SLOs based on what you wish were true rather than what the user notices. Three rules:

  1. One per critical user journey. Latency for "home page loads". Availability for "checkout completes". Don't aggregate everything into one number.
  2. Start with what you measure today. If you're at 99.5% and want 99.9%, that's a real engineering project. Setting 99.99% as an aspiration tells the team nothing.
  3. Tighter SLOs cost exponentially more. Going from 99% (3.65 days/yr) to 99.9% (8.76 hours) is one project; from 99.9% to 99.99% (52 minutes) is several. Pick the level the business actually needs.
99% 3.65 days / year · achievable with basic monitoring + on-call 99.9% 8.76 hours / year · the typical SaaS target 99.99% 52.6 minutes / year · multi-AZ active-active, automated failover 99.999% 5.26 minutes / year · multi-region, telco-grade — usually impossible per-feature Each "nine" is roughly 10× the engineering investment.
SLO levels mapped to annual downtime. The top of this chart is approximately reachable; the bottom is a research project.

The error budget conversation

The error budget unlocks an honest dialog. If the budget is exhausted, deploys pause until the next window. If the budget is largely unused, the team can take more risk — ship faster, run experiments, do migrations. Without an error budget, every engineering decision becomes a fight between feature velocity and "reliability". With one, the trade-off is mechanical.

The Reliability Triangle: Timeouts + Retries + Circuit Breakers

Three controls compose to handle most upstream-failure modes. Missing any one is an outage waiting to happen.

Timeoutbound the waitconnect, read, totalnever "wait forever" Retrybackoff + jittercap attemptsonly on retryable errors Circuit breakerstop hammeringclosed → open → half-openfail fast, recover gradually Each control fixes one failure mode. Together they keep one slow dependency from cascading.
Timeout caps the wait. Retry handles transient failures. Circuit breaker stops the retry storm.

Timeouts

Every network call must have a timeout. "The default" is rarely correct — most HTTP clients have no timeout out of the box, or one too long to be useful. Configure three numbers:

  • Connect timeout — how long to wait for the TCP+TLS handshake. 1–3 seconds is typical.
  • Read timeout — how long to wait between bytes once the response starts. Must accommodate the slowest legitimate response.
  • Total / overall timeout — the budget for the whole call, including retries. Must fit inside the parent's timeout (timeout budgets cascade up the call graph).

For LLM calls specifically: the read timeout has to handle the slow-streaming case (a 30 s response with 100 ms gaps between tokens). Total timeout might be 60 s. Always use streaming clients for LLM calls; the inactivity timeout is the right knob, not the wall-clock total.

Retries — with jitter, always

Retries handle transient failures: a momentarily-overloaded peer, a lost packet, a brief network hiccup. The wrong way: for i in range(3): time.sleep(1); ... — every failed client retries in lockstep, hammering the recovering peer. The right way: exponential backoff with full jitter.

python — retry with exponential backoff + full jitter
import random, time

MAX_ATTEMPTS = 5
BASE_DELAY  = 0.1   # 100 ms
MAX_DELAY   = 10    # 10 s cap

def call_with_retry(fn):
    for attempt in range(MAX_ATTEMPTS):
        try:
            return fn()
        except RetryableError as e:
            if attempt == MAX_ATTEMPTS - 1:
                raise
            backoff = min(MAX_DELAY, BASE_DELAY * 2 ** attempt)
            sleep   = random.uniform(0, backoff)         # FULL jitter
            time.sleep(sleep)
        except NonRetryableError:
            raise                                          # 4xx, business errors

The jitter is what saves you. AWS published their famous "Exponential Backoff and Jitter" article in 2015 measuring four strategies; full jitter (random delay between 0 and the exponential cap) was the strongest at preventing convoy failures. Same logic, every modern client library.

Three rules for retries

  • Only retry retryable errors. 5xx, connection failures, timeouts. Never 4xx (the request was wrong; retrying just hits the wall harder).
  • Always cap attempts. 3–5 is typical. Beyond that, you're amplifying load on a recovering dependency.
  • Always carry an idempotency key for non-GET retries (Day 1 PM). Without it, retries can double-charge or double-create.
🚨
Retry storms cascade upward
If service A retries 3× into B, and B retries 3× into C, a single failure of C generates 9 attempts on B and 27 on C from one user request. Don't retry at every layer. The classic guidance: retry close to the root cause, propagate failures upward. Or use a shared budget across layers (most service meshes do this with retry budgets).

Circuit breakers

When a peer is genuinely down, retries are wasted bandwidth. The circuit breaker tracks recent failure rate; when it crosses a threshold, the breaker opens and subsequent calls fail immediately without hitting the network. After a cooldown, the breaker enters half-open, lets one or two trial requests through, and either re-opens (failures continue) or closes (success returns).

CLOSED all calls pass OPEN fail fast HALF-OPEN trial requests failure rate > θ cooldown elapsed trial succeeded trial failed → re-open Closed = healthy. Open = stop calling. Half-open = test recovery without unleashing all traffic.
The three states of a circuit breaker. State transitions trip on failure-rate thresholds and time.

How to size a breaker

Common settings: open after 50% failures over a 10-second window with a minimum of 20 requests; cooldown 30 seconds; half-open allows 3 trial requests. Tune to your traffic — a 10-RPS service can't measure a 50-request rolling window in 10 seconds, so widen the window or lower the count.

The Hystrix library (Netflix) popularized this pattern; resilience4j, polly (.NET), tenacity (Python), and service meshes (Istio, Linkerd) all implement it. For LLM calls, breakers are particularly valuable: when the upstream provider is rate-limiting, opening the breaker preserves your remaining budget instead of burning through it on doomed retries.

Bulkheads — Containing Blast Radius

The bulkhead pattern (named after ship compartments) isolates resources so failure in one area can't sink the whole vessel. Concretely: rather than one shared HTTP client, give each downstream dependency its own connection pool with its own concurrency cap. When dependency A is slow, its pool fills up; calls to A queue or 429; calls to B and C are unaffected.

yaml — bulkhead via per-dependency client
# resilience4j-style configuration
clients:
  payments-api:
    max-concurrent: 50
    timeout: 5s
    retry: { attempts: 3, base-delay: 100ms }
  llm-provider:
    max-concurrent: 20      # tighter — slow, expensive
    timeout: 60s
    retry: { attempts: 2, base-delay: 500ms }
  search-api:
    max-concurrent: 100
    timeout: 200ms
    retry: { attempts: 2, base-delay: 50ms }

The bulkhead alone is the #1 fix for the cascading-failure pattern from Day 1: a single slow LLM endpoint exhausting the global connection pool and degrading every other endpoint. With per-dependency limits, the LLM endpoint hits its own ceiling first; everything else stays fast.

Hedged Requests — Trading Cost for Tail Latency

The Tail at Scale paper (Dean & Barroso) introduced hedged requests: send a request to backend A; if it hasn't returned within (e.g.) the P95 of normal latency, send the same request to backend B; take whichever returns first; cancel the loser. The cost is roughly +5% requests; the win is dramatic — P99 collapses toward P50 because slow backends rarely drag both copies.

This is most useful when:

  • Backends are stateless or idempotent (so the duplicate is harmless).
  • The 5% cost is much smaller than the latency gain in user-impact terms.
  • You can identify a P95 threshold to start the hedge — too eager kills the cost saving; too late doesn't help.

For LLM workloads: hedge across different model providers when latency varies. Send to OpenAI; if no first token after 1 s, also send to Anthropic; take whichever responds. Idempotency makes the cancel safe.

Backpressure and Rate Limiting

From Day 3: when producers outpace consumers, queues grow unbounded and the system silently fails. The cure is backpressure — apply pressure backward to the source. Three layers usually compose:

  • Per-tenant rate limits. Token-bucket per tenant, sized for fair share. One noisy tenant cannot crowd out others.
  • Per-instance concurrency caps. Each app pod accepts at most N concurrent in-flight requests; beyond that, return 503 with Retry-After.
  • Adaptive load shedding. When latency P99 climbs above SLO, the LB starts rejecting low-priority traffic. Algorithms like Adaptive Concurrency Limits (Netflix) auto-tune.

For LLM calls, the upstream rate limit is what bites first. The pattern:

  1. Maintain a local token-bucket per provider; refill at the documented RPM.
  2. Workers acquire a token before calling; block briefly if empty.
  3. On a 429 from the provider, drain tokens to zero, sleep Retry-After, then refill.
  4. Surface backpressure to your callers — return a 503 or queue the work.

Idempotency Done Right

Day 1 PM introduced the idempotency-key contract; here we focus on the implementation pitfalls.

The four common bugs

  1. Storing the key after the side effect. If the worker crashes between executing and recording, retries re-execute. Fix: insert the key first with status in_progress, do the work, update to complete with the response. The unique constraint on the key blocks duplicates.
  2. Not capturing concurrent retries. Two retries arrive simultaneously, both find no key, both insert, both execute. Fix: a unique constraint on (key) — one wins; the other catches the conflict and waits for the in-progress execution to finish, then returns the stored response.
  3. Re-executing on body mismatch. The client retries with a different body but the same key. The right answer: 422 with "key already used with different request". Allowing it to re-execute is a privilege-escalation primitive in some contexts.
  4. Forgetting to TTL the key store. Idempotency tables grow forever; queries get slow; the table becomes the bottleneck. Use a 24h or 7d TTL with a background sweeper.
sql — idempotency key table
CREATE TABLE idempotency_keys (
  key            TEXT PRIMARY KEY,
  request_hash   TEXT NOT NULL,
  status         TEXT NOT NULL CHECK (status IN ('in_progress', 'complete')),
  response       JSONB,
  status_code    INT,
  expires_at     TIMESTAMPTZ NOT NULL DEFAULT now() + INTERVAL '24 hours'
);

CREATE INDEX idempotency_expiry ON idempotency_keys (expires_at);

Graceful Degradation — Better Wrong Than Down

When a non-critical dependency fails, fall back rather than fail the whole request. Useful examples:

  • Personalization service down → fall back to popular items, not an error.
  • Recommendation service down → render the page without the carousel.
  • LLM model unavailable → fall back to a smaller model, a cached answer, or a templated response.
  • Search down → return the static category list with a banner.

Implementation: every fallback is a small, fast code path that the breaker activates when the primary trips open. Make sure the fallback is cheap — falling back to another remote call merely shifts the failure.

The AI-Era Reliability Playbook

LLM dependencies have made reliability harder than the textbook anticipated. The patterns that have settled out across the industry:

1. Multi-provider routing

Don't depend on a single provider. Configure two or three (e.g., Anthropic + OpenAI + a self-hosted Llama). A simple router ranks them by cost-per-token, then health-checks each in parallel, then routes the call to the highest-ranked healthy provider. When one provider has an outage, traffic shifts automatically. Tools like OpenRouter, LiteLLM, LangSmith commercialize this.

2. Model fallback

Within a provider, fall back from the big model (Sonnet, GPT-4o) to a smaller one (Haiku, GPT-4o-mini) under load. The smaller model is cheaper, faster, and almost always available; for many tasks it's good enough. The retry sequence: try the big model 1×, on 5xx fall back to the small one immediately rather than retrying the expensive call.

3. Semantic and prompt-prefix caches

Day 2 PM. The cache is a reliability tool too: when the upstream is rate-limiting, served-from-cache responses keep the user experience alive.

4. Async escape hatch

For non-interactive flows ("summarize this PDF"), the request creates a job and returns 202 Accepted. The user polls or receives a webhook. If the LLM is overloaded right now, the worker simply retries with backoff — the user never sees the failure mode, only the delay.

5. Eval-as-canary

When a new model version drops, don't switch all traffic. Run the eval suite (Day 7) against the new version's responses; if regression metrics stay green, slowly shift traffic. Treat model versions like deploys.

🌱
Mock the AI in tests
A test suite that hits the real LLM is slow, flaky, and expensive. Mock the LLM client at the boundary; have a small set of integration tests against the real provider that run nightly. Day-to-day CI runs against deterministic stubs. The integration tests validate the contract; the unit tests validate the application logic that wraps it.

Chaos Engineering — Rehearsing Failures

Reliability isn't proven by green dashboards in the steady state; it's proven by surviving failures. Chaos engineering (Netflix, 2010) is the practice of deliberately injecting failures and observing how the system responds.

  • Pod kill — the original Chaos Monkey. Random instance termination during business hours.
  • Latency injection — slow a dependency by 500 ms; do circuit breakers and timeouts react?
  • Failure injection — return 500s from a dependency; do fallbacks engage?
  • Region failover drill — close a whole region; does traffic shift in budget?

Tools: Gremlin, Chaos Mesh, AWS Fault Injection Service, Istio fault injection. Start small (10% of one staging environment) and graduate. The team that hasn't drilled the failover hasn't drilled the failover.

Quick check
Your service depends on an LLM API. The provider has a 60-second outage. During the outage, your P99 latency spikes from 800 ms to 30 s, and connection pool errors propagate to unrelated endpoints. After the outage, things stay degraded for several minutes. Diagnose three issues and propose fixes.
Show answer
Issue 1 — no per-dependency bulkhead. The LLM stalled handlers consumed the global pool. Fix: isolate the LLM client with its own concurrency limit and pool.
Issue 2 — no circuit breaker. The service kept retrying after the provider was clearly broken. Fix: a breaker that opens at > 50% failure rate over 10 s; while open, route to the fallback model or the async queue.
Issue 3 — sync retries with no jitter. When the provider recovered, every client in the system retried in lockstep, prolonging the brownout. Fix: exponential backoff with full jitter; cap retries at 3; respect Retry-After. Bonus issue: no SLO + error budget conversation, so nobody knew this was urgent until the user complaints.

The Reliability Checklist

For every external dependency in your service, answer these questions in writing:

  1. What's the timeout (connect, read, total)?
  2. What errors are retryable vs not? How many attempts, what backoff?
  3. Is there an idempotency key on writes?
  4. Is the call inside a bulkhead with a concurrency cap?
  5. Is there a circuit breaker, and what are its thresholds?
  6. What's the fallback when the breaker is open?
  7. What does the user see when the call fails?
  8. Is the call's latency and error rate visible on a dashboard? Alerted on?

If you can't answer these, the dependency is a hidden liability. Most outages trace back to one of these questions being unanswered for a service that became critical six months after launch.

Mnemonic — reliability triangle
"Timeout, Retry, Breaker — and Idempotency makes them safe."
  • Timeout — bound the wait.
  • Retry — exponential + full jitter, capped attempts.
  • Breaker — fail fast when the dependency is genuinely down.
  • Idempotency — the retry doesn't double-charge.
Flashcard
Your team's SLO is 99.9% availability for the chat completion endpoint, measured monthly. The LLM provider's published SLA is 99.9%. What's wrong with this picture, and what's the fix?
Click to flip ↻
Answer
Math: if your service depends on a 99.9% provider with no fallback, your maximum is 99.9%. With your own ~0.1% of unrelated failures (deploys, internal bugs, network), you're closer to 99.8%. Fix: dependency math — you need higher reliability than your dependencies, OR multiple providers with health-check failover, OR a graceful-degradation path (cached / smaller-model / async). Never set your SLO equal to your provider's SLA. The error budget = your budget for everything else, after dependency failures eat their share.
🔑
Key takeaways
1) SLOs and error budgets turn reliability arguments into mechanics. 2) The timeout + retry + circuit breaker triangle is the foundational pattern; missing any one is an outage waiting. 3) Exponential backoff with full jitter is the only safe retry strategy. 4) Bulkheads per dependency keep one slow endpoint from sinking the rest. 5) Idempotency keys make retries safe — without them, retries are wrong by construction. 6) The AI-era playbook — multi-provider routing, model fallback, semantic cache, async escape — is just "reliability for slow flaky dependencies" applied to the slowest, flakiest dependency you have.

Finished reading?