Reliability Engineering — SLOs, Retries, Circuit Breakers
Reliability isn't an outcome — it's a discipline. Master SLOs and error budgets, the retry-with-jitter rules that prevent thundering herds, circuit breakers, bulkheads, timeouts, and the AI-call reliability patterns that emerged when external dependencies became the long pole in every request.
What you will learn
Every system fails. The only question is whether the failure stops at one component or cascades. Reliability engineering is the practice of containing failures: choosing what "working" means (SLOs), controlling what failure looks like (timeouts, retries, circuit breakers), and rehearsing what comes next (chaos drills, runbooks). It's also where the AI era forced the biggest mental shift — when the long pole in your latency budget is somebody else's GPU cluster, your reliability story is mostly about coping with their flakiness.
SLIs, SLOs, and the Error Budget
You cannot improve what you cannot measure. The SRE practice (Google, 2003 onward) gave us a vocabulary that is now industry standard.
- SLI (Service Level Indicator) — a number that measures user-perceived behaviour. Examples: "fraction of requests with status 200 and latency < 500 ms".
- SLO (Service Level Objective) — a target for an SLI over a window. Example: "99.9% over a rolling 30 days".
- SLA (Service Level Agreement) — a contractual SLO with a penalty clause. Always looser than your internal SLOs.
- Error budget — the inverse of the SLO. 99.9% over 30 days = 43 minutes of allowed failure. That's your budget to spend on risky things: deploys, experiments, migrations.
Picking SLOs
The trap: setting SLOs based on what you wish were true rather than what the user notices. Three rules:
- One per critical user journey. Latency for "home page loads". Availability for "checkout completes". Don't aggregate everything into one number.
- Start with what you measure today. If you're at 99.5% and want 99.9%, that's a real engineering project. Setting 99.99% as an aspiration tells the team nothing.
- Tighter SLOs cost exponentially more. Going from 99% (3.65 days/yr) to 99.9% (8.76 hours) is one project; from 99.9% to 99.99% (52 minutes) is several. Pick the level the business actually needs.
The error budget conversation
The error budget unlocks an honest dialog. If the budget is exhausted, deploys pause until the next window. If the budget is largely unused, the team can take more risk — ship faster, run experiments, do migrations. Without an error budget, every engineering decision becomes a fight between feature velocity and "reliability". With one, the trade-off is mechanical.
The Reliability Triangle: Timeouts + Retries + Circuit Breakers
Three controls compose to handle most upstream-failure modes. Missing any one is an outage waiting to happen.
Timeouts
Every network call must have a timeout. "The default" is rarely correct — most HTTP clients have no timeout out of the box, or one too long to be useful. Configure three numbers:
- Connect timeout — how long to wait for the TCP+TLS handshake. 1–3 seconds is typical.
- Read timeout — how long to wait between bytes once the response starts. Must accommodate the slowest legitimate response.
- Total / overall timeout — the budget for the whole call, including retries. Must fit inside the parent's timeout (timeout budgets cascade up the call graph).
For LLM calls specifically: the read timeout has to handle the slow-streaming case (a 30 s response with 100 ms gaps between tokens). Total timeout might be 60 s. Always use streaming clients for LLM calls; the inactivity timeout is the right knob, not the wall-clock total.
Retries — with jitter, always
Retries handle transient failures: a momentarily-overloaded peer, a lost packet, a brief network hiccup. The wrong way: for i in range(3): time.sleep(1); ... — every failed client retries in lockstep, hammering the recovering peer. The right way: exponential backoff with full jitter.
import random, time
MAX_ATTEMPTS = 5
BASE_DELAY = 0.1 # 100 ms
MAX_DELAY = 10 # 10 s cap
def call_with_retry(fn):
for attempt in range(MAX_ATTEMPTS):
try:
return fn()
except RetryableError as e:
if attempt == MAX_ATTEMPTS - 1:
raise
backoff = min(MAX_DELAY, BASE_DELAY * 2 ** attempt)
sleep = random.uniform(0, backoff) # FULL jitter
time.sleep(sleep)
except NonRetryableError:
raise # 4xx, business errorsThe jitter is what saves you. AWS published their famous "Exponential Backoff and Jitter" article in 2015 measuring four strategies; full jitter (random delay between 0 and the exponential cap) was the strongest at preventing convoy failures. Same logic, every modern client library.
Three rules for retries
- Only retry retryable errors. 5xx, connection failures, timeouts. Never 4xx (the request was wrong; retrying just hits the wall harder).
- Always cap attempts. 3–5 is typical. Beyond that, you're amplifying load on a recovering dependency.
- Always carry an idempotency key for non-GET retries (Day 1 PM). Without it, retries can double-charge or double-create.
Circuit breakers
When a peer is genuinely down, retries are wasted bandwidth. The circuit breaker tracks recent failure rate; when it crosses a threshold, the breaker opens and subsequent calls fail immediately without hitting the network. After a cooldown, the breaker enters half-open, lets one or two trial requests through, and either re-opens (failures continue) or closes (success returns).
How to size a breaker
Common settings: open after 50% failures over a 10-second window with a minimum of 20 requests; cooldown 30 seconds; half-open allows 3 trial requests. Tune to your traffic — a 10-RPS service can't measure a 50-request rolling window in 10 seconds, so widen the window or lower the count.
The Hystrix library (Netflix) popularized this pattern; resilience4j, polly (.NET), tenacity (Python), and service meshes (Istio, Linkerd) all implement it. For LLM calls, breakers are particularly valuable: when the upstream provider is rate-limiting, opening the breaker preserves your remaining budget instead of burning through it on doomed retries.
Bulkheads — Containing Blast Radius
The bulkhead pattern (named after ship compartments) isolates resources so failure in one area can't sink the whole vessel. Concretely: rather than one shared HTTP client, give each downstream dependency its own connection pool with its own concurrency cap. When dependency A is slow, its pool fills up; calls to A queue or 429; calls to B and C are unaffected.
# resilience4j-style configuration
clients:
payments-api:
max-concurrent: 50
timeout: 5s
retry: { attempts: 3, base-delay: 100ms }
llm-provider:
max-concurrent: 20 # tighter — slow, expensive
timeout: 60s
retry: { attempts: 2, base-delay: 500ms }
search-api:
max-concurrent: 100
timeout: 200ms
retry: { attempts: 2, base-delay: 50ms }The bulkhead alone is the #1 fix for the cascading-failure pattern from Day 1: a single slow LLM endpoint exhausting the global connection pool and degrading every other endpoint. With per-dependency limits, the LLM endpoint hits its own ceiling first; everything else stays fast.
Hedged Requests — Trading Cost for Tail Latency
The Tail at Scale paper (Dean & Barroso) introduced hedged requests: send a request to backend A; if it hasn't returned within (e.g.) the P95 of normal latency, send the same request to backend B; take whichever returns first; cancel the loser. The cost is roughly +5% requests; the win is dramatic — P99 collapses toward P50 because slow backends rarely drag both copies.
This is most useful when:
- Backends are stateless or idempotent (so the duplicate is harmless).
- The 5% cost is much smaller than the latency gain in user-impact terms.
- You can identify a P95 threshold to start the hedge — too eager kills the cost saving; too late doesn't help.
For LLM workloads: hedge across different model providers when latency varies. Send to OpenAI; if no first token after 1 s, also send to Anthropic; take whichever responds. Idempotency makes the cancel safe.
Backpressure and Rate Limiting
From Day 3: when producers outpace consumers, queues grow unbounded and the system silently fails. The cure is backpressure — apply pressure backward to the source. Three layers usually compose:
- Per-tenant rate limits. Token-bucket per tenant, sized for fair share. One noisy tenant cannot crowd out others.
- Per-instance concurrency caps. Each app pod accepts at most N concurrent in-flight requests; beyond that, return 503 with Retry-After.
- Adaptive load shedding. When latency P99 climbs above SLO, the LB starts rejecting low-priority traffic. Algorithms like Adaptive Concurrency Limits (Netflix) auto-tune.
For LLM calls, the upstream rate limit is what bites first. The pattern:
- Maintain a local token-bucket per provider; refill at the documented RPM.
- Workers acquire a token before calling; block briefly if empty.
- On a 429 from the provider, drain tokens to zero, sleep
Retry-After, then refill. - Surface backpressure to your callers — return a 503 or queue the work.
Idempotency Done Right
Day 1 PM introduced the idempotency-key contract; here we focus on the implementation pitfalls.
The four common bugs
- Storing the key after the side effect. If the worker crashes between executing and recording, retries re-execute. Fix: insert the key first with status
in_progress, do the work, update tocompletewith the response. The unique constraint on the key blocks duplicates. - Not capturing concurrent retries. Two retries arrive simultaneously, both find no key, both insert, both execute. Fix: a unique constraint on (key) — one wins; the other catches the conflict and waits for the in-progress execution to finish, then returns the stored response.
- Re-executing on body mismatch. The client retries with a different body but the same key. The right answer: 422 with "key already used with different request". Allowing it to re-execute is a privilege-escalation primitive in some contexts.
- Forgetting to TTL the key store. Idempotency tables grow forever; queries get slow; the table becomes the bottleneck. Use a 24h or 7d TTL with a background sweeper.
CREATE TABLE idempotency_keys (
key TEXT PRIMARY KEY,
request_hash TEXT NOT NULL,
status TEXT NOT NULL CHECK (status IN ('in_progress', 'complete')),
response JSONB,
status_code INT,
expires_at TIMESTAMPTZ NOT NULL DEFAULT now() + INTERVAL '24 hours'
);
CREATE INDEX idempotency_expiry ON idempotency_keys (expires_at);Graceful Degradation — Better Wrong Than Down
When a non-critical dependency fails, fall back rather than fail the whole request. Useful examples:
- Personalization service down → fall back to popular items, not an error.
- Recommendation service down → render the page without the carousel.
- LLM model unavailable → fall back to a smaller model, a cached answer, or a templated response.
- Search down → return the static category list with a banner.
Implementation: every fallback is a small, fast code path that the breaker activates when the primary trips open. Make sure the fallback is cheap — falling back to another remote call merely shifts the failure.
The AI-Era Reliability Playbook
LLM dependencies have made reliability harder than the textbook anticipated. The patterns that have settled out across the industry:
1. Multi-provider routing
Don't depend on a single provider. Configure two or three (e.g., Anthropic + OpenAI + a self-hosted Llama). A simple router ranks them by cost-per-token, then health-checks each in parallel, then routes the call to the highest-ranked healthy provider. When one provider has an outage, traffic shifts automatically. Tools like OpenRouter, LiteLLM, LangSmith commercialize this.
2. Model fallback
Within a provider, fall back from the big model (Sonnet, GPT-4o) to a smaller one (Haiku, GPT-4o-mini) under load. The smaller model is cheaper, faster, and almost always available; for many tasks it's good enough. The retry sequence: try the big model 1×, on 5xx fall back to the small one immediately rather than retrying the expensive call.
3. Semantic and prompt-prefix caches
Day 2 PM. The cache is a reliability tool too: when the upstream is rate-limiting, served-from-cache responses keep the user experience alive.
4. Async escape hatch
For non-interactive flows ("summarize this PDF"), the request creates a job and returns 202 Accepted. The user polls or receives a webhook. If the LLM is overloaded right now, the worker simply retries with backoff — the user never sees the failure mode, only the delay.
5. Eval-as-canary
When a new model version drops, don't switch all traffic. Run the eval suite (Day 7) against the new version's responses; if regression metrics stay green, slowly shift traffic. Treat model versions like deploys.
Chaos Engineering — Rehearsing Failures
Reliability isn't proven by green dashboards in the steady state; it's proven by surviving failures. Chaos engineering (Netflix, 2010) is the practice of deliberately injecting failures and observing how the system responds.
- Pod kill — the original Chaos Monkey. Random instance termination during business hours.
- Latency injection — slow a dependency by 500 ms; do circuit breakers and timeouts react?
- Failure injection — return 500s from a dependency; do fallbacks engage?
- Region failover drill — close a whole region; does traffic shift in budget?
Tools: Gremlin, Chaos Mesh, AWS Fault Injection Service, Istio fault injection. Start small (10% of one staging environment) and graduate. The team that hasn't drilled the failover hasn't drilled the failover.
Show answer
Issue 2 — no circuit breaker. The service kept retrying after the provider was clearly broken. Fix: a breaker that opens at > 50% failure rate over 10 s; while open, route to the fallback model or the async queue.
Issue 3 — sync retries with no jitter. When the provider recovered, every client in the system retried in lockstep, prolonging the brownout. Fix: exponential backoff with full jitter; cap retries at 3; respect
Retry-After. Bonus issue: no SLO + error budget conversation, so nobody knew this was urgent until the user complaints.The Reliability Checklist
For every external dependency in your service, answer these questions in writing:
- What's the timeout (connect, read, total)?
- What errors are retryable vs not? How many attempts, what backoff?
- Is there an idempotency key on writes?
- Is the call inside a bulkhead with a concurrency cap?
- Is there a circuit breaker, and what are its thresholds?
- What's the fallback when the breaker is open?
- What does the user see when the call fails?
- Is the call's latency and error rate visible on a dashboard? Alerted on?
If you can't answer these, the dependency is a hidden liability. Most outages trace back to one of these questions being unanswered for a service that became critical six months after launch.
- Timeout — bound the wait.
- Retry — exponential + full jitter, capped attempts.
- Breaker — fail fast when the dependency is genuinely down.
- Idempotency — the retry doesn't double-charge.
- Google SRE Workbook — Implementing SLOssre.google
- AWS — Exponential Backoff and Jitteraws.amazon.com
- Fowler — CircuitBreakermartinfowler.com
- Dean & Barroso — The Tail at Scaleresearch.google
- Netflix — Adaptive Concurrency Limitsnetflixtechblog.com
- Principles of Chaos Engineeringprinciplesofchaos.org
Finished reading?