The Engineering Codex/Backend Engineering for the AI Era
DAY 1 · AM
01 / 09

Backend Foundations & the Modern Service Anatomy

schedule13 minsignal_cellular_altIntermediate2,782 words
Before pattern names and frameworks, the backend is a request walking through a small number of moving parts. Learn the four-layer service anatomy, where every microsecond of latency goes, why statelessness scaled the web, and how an LLM call distorts every assumption you previously held about p99.

What you will learn

01The Four Layers Every Backend Has
02The Request Lifecycle — Where Time Actually Goes
03Statelessness — The Decision That Built the Modern Web
04Concurrency Models — How Each Process Spends Its Time
05The AI Exception — When the Tail Becomes the Workload
06The Shape of a Modern Service

A backend is the place where a request arrives, becomes a few database queries, maybe a remote call, and then leaves as a response. Stripped of frameworks and acronyms, that's it. Everything you'll learn this week is tooling around those four sentences. This chapter installs the mental scaffolding: the layers that exist in every production service, where time and money actually go, and the way modern AI dependencies twist the old physics.

🔑
Today's anatomy
1) The four layers — edge, app, data, async — and the contract each one keeps. 2) The request lifecycle from DNS to response, with where the milliseconds live. 3) Statelessness as a load-bearing decision, not a style. 4) The AI exception: why a single LLM call can dominate every other line in your latency budget and what that forces you to redesign.

The Four Layers Every Backend Has

The shape of a production backend is the same whether you're running a Django monolith, a 200-service mesh, or a single Lambda. Four layers, ordered by what's closest to the user. Drawing them out is the first thing you do at every architecture review.

EDGE DNS · CDN · WAF · TLS · L7 load balancer · ingress APPLICATION handler · auth · validation · business logic · ORM DATA RDBMS · cache · vector · object store ASYNC queues · workers · cron · streams EXTERNAL DEPENDENCIES payments · email · LLMs · third-party APIs — each adds tail latency
The four-layer service anatomy. Most production services are 80% one of these layers; mature ones use all four.

Edge — the polite bouncer

The edge is everything from the client's TCP packet to your application code. DNS resolves the hostname (cached but with TTLs), TLS performs the cryptographic handshake (1 RTT in TLS 1.3, more on cold connections), a CDN may serve a cached response without ever talking to your origin, a WAF filters obviously hostile traffic, and an L7 load balancer (NGINX, Envoy, ALB) parses the HTTP request and routes by path, host, or weight. The edge layer is where you cheaply absorb load. A request that hits a cached CDN response is essentially free; a request that goes through unprotected origin is the expensive case you want to avoid.

Application — the only place "your code" runs

The app layer is where you spend your career. A request handler authenticates the caller, validates input, runs business logic, calls the data layer, and renders a response. The handler is sometimes 50 lines and sometimes 5,000; mature codebases ruthlessly extract patterns into middleware (auth, logging, rate-limit, error-handling) so the handler reads like the business operation it represents. The app should be stateless — anything it remembers between requests goes in the data or async layer.

Data — durable state

Whatever has to survive a process restart lives here: relational databases, key-value stores, caches, vector stores, object storage. Day 2 is dedicated entirely to this layer. The single most important property is that multiple application instances must be able to read and write the same data; once that's true, the app layer is interchangeable and you can scale it horizontally without ceremony.

Async — work that escapes the request

Anything you can't or shouldn't do during the request belongs here. Sending an email, generating a thumbnail, retraining an embedding, calling an LLM that takes 12 seconds — all of these escape the request lifecycle into a queue, where worker processes consume jobs at their own pace. We treat this as a first-class layer because most failure modes in modern backends are async-related: jobs that retry forever, jobs that never run, jobs that run twice. Day 3 is its own day.

💡
External dependencies aren't a layer — they're a tax
Every external API you call (Stripe, SendGrid, OpenAI) is somebody else's full four-layer stack with its own outage modes. Treat these as taxed dependencies: budget time, budget failure, plan retries, and assume the third party is slower and less reliable than your own data layer. The AI era turned this from an occasional concern into a P99 dominator.

The Request Lifecycle — Where Time Actually Goes

Latency budgets are the difference between an engineer who's vaguely worried about performance and one who can defend a number. A user-perceptible web request should generally finish in under 200 ms server-side for snappy UI; an internal API call between services should be in the single-digit milliseconds. When you blow through these, you blow through them in the same predictable few places.

DNS lookup~10–50 ms TLS handshake~30–100 ms LB / ingress< 1 ms App handler5–50 ms DB query1–20 ms External call (LLM, payments)100 ms – 30 s · the long tail DNS & TLS happen once per connection (with HTTP/2 keep-alive). The expensive recurring cost is the app+DB+external trio. In an LLM-backed request, the external call is often 100× the rest combined.
A typical web request. Numbers are order-of-magnitude. The right-hand external call is the AI-era surprise.

The classic distribution

For a healthy CRUD-style service, your P50 (median) might be 30 ms, P99 150 ms, P99.9 600 ms. The shape matters: the P50 tells you the app is fast in the common case; the gap between P50 and P99 tells you the variance — usually caused by GC pauses, slow queries, contention, or noisy neighbours. The gap between P99 and P99.9 is dominated by retries, leader elections, and unlikely-but-real failure modes. Your SLO (Day 5) usually targets P99.

Where one millisecond goes

OperationOrder of magnitudeWhy
L1 cache reference~1 nsOn-CPU; speed of light wins
Main memory reference~100 nsOff-chip RAM
SSD random read~150 µsPage lookup + flash latency
Same-region network round trip~0.5 msThrough 2-3 switches
Cross-region same continent~30 msSpeed of light + fibre routing
Continent-to-continent~100–200 msSubmarine cables, hops
Indexed Postgres query (warm)1–5 msB-tree lookup + a few page reads
Redis GET0.3–1 msSingle in-memory hash lookup
LLM call (small model, no streaming)500 ms – 5 sTokenize, infer, decode every token
LLM call (large model, full response)5–30 sSequential token generation dominates

Internalize this table. Most performance bugs come from accidentally putting a 30 ms network round-trip inside an N+1 loop, or, more recently, putting a 3-second LLM call on the critical path of an HTTP handler.

⚠️
Tail amplification
If a request fans out to 10 backends in parallel and each has a 1% chance of being slow, the probability that at least one is slow is ≈ 10%. Tail latency in distributed systems multiplies fast — this is why circuit breakers, hedged requests, and timeouts (Day 5) matter so much. Naive parallel fan-out turns rare slowness into common slowness.

Statelessness — The Decision That Built the Modern Web

The single most consequential design choice in modern backends is to make the application layer stateless. "Stateless" doesn't mean your service has no state — it means any instance can serve any request. State that has to persist between requests lives in the data layer (or in async storage); the app layer holds only what it needs for the current request.

Why this matters

  • Horizontal scaling is trivial. Add more containers, the load balancer fans traffic across them, no further coordination needed.
  • Failure is cheap. A pod dies, its replicas pick up the load. No "sticky sessions" to migrate.
  • Deploys are simple. Rolling deploys replace instances one at a time; ongoing requests are unaffected because the next request can land anywhere.
  • Local reasoning works. A bug investigation looks at one request, not a long-lived session graph.

Where state hides anyway

Even "stateless" services smuggle state in three places, and each is a real source of bugs:

  1. In-memory caches. An LRU cache inside the process is invisible state; it makes instances behave differently from each other and breaks rolling-deploy assumptions. Externalize to Redis or accept the inconsistency consciously.
  2. Local file system. Anything written to /tmp survives only as long as the container does. Treat it as scratch and never as truth.
  3. Connection pools. Database and HTTP-client connection pools are stateful (open sockets, in-flight handshakes). They survive between requests within an instance but not across instances. Configure pool sizes with autoscaling in mind.
🌱
Sessions belong in the data layer
Server-side session storage is the classic stateless pitfall. Either issue signed JWTs (stateless: client carries the token, server validates the signature) or store the session in Redis with a TTL (centrally accessible from any instance). The middle path of "sessions in process memory plus sticky load balancing" is a pattern you'll inherit one day and want to dismantle the same week.

Concurrency Models — How Each Process Spends Its Time

Inside a single application instance, three concurrency models cover the territory. Choose deliberately; the wrong choice for your workload becomes a latency wall you can't easily remove.

ModelExamplesStrengthWeakness
Thread-per-requestJava/Spring, Ruby Puma, .NETSimple sync code; familiar debuggingThreads are expensive (~1 MB stack); scales to ~10k connections per box
Event loop / asyncNode.js, Python asyncio, Go (goroutines), Rust async10⁵+ concurrent connections per box; tiny memory per requestColoured functions, easier to deadlock the loop with sync work
Process-per-requestCGI, classic PHPStrong isolation; OOM in one request is containedFork cost on each request; gone from modern stacks

For an AI-heavy backend, the event-loop model wins by default. Most of the time the handler is waiting — for the database, for the LLM API, for the embedding service. Letting one OS thread serve thousands of concurrently-waiting requests makes far better use of hardware than blocking a thread per request. Go and Rust handle this with goroutines/futures, Python with asyncio, Node.js by construction. Threads come back into play when you have CPU-bound work (image resizing, JSON parsing of huge payloads), where you want a thread pool ferried by the event loop.

The AI Exception — When the Tail Becomes the Workload

Everything above describes the world before LLMs became routine dependencies. The shift is quantitative, not qualitative — it's still just "backend" — but the numbers break old assumptions hard enough to require a redesign.

Classic CRUD service AI-backed service P50 ≈ 30 ms · P99 ≈ 150 ms · long tail rare P50 ≈ 1.8 s · P99 ≈ 12 s · the tail IS the workload When the median request takes seconds, you must redesign the request lifecycle around it.
Latency histograms. Classic services have most weight at the left; AI-backed services have most weight on the right.

What changes when an LLM is in the call path

  • Streaming becomes the answer for UX. A 12-second response feels broken; a 12-second stream that emits tokens after 300 ms feels alive. Server-Sent Events or chunked transfer is now table stakes (Day 1 PM).
  • Async by default for non-interactive flows. Reports, batch summarization, document indexing — all of these escape the request lifecycle into queues with webhook completion (Day 3).
  • Idempotency is non-negotiable. LLM calls fail randomly: rate-limit, transient 5xx, timeouts. The retry has to be safe, which means each call carries an idempotency key and the server deduplicates (Day 5).
  • Cost is now a request-level metric. Token counts, model choice, and cache hits become observability primitives alongside latency (Day 6).
  • Caching changes shape. Exact-match URL caches don't hit when prompts vary; semantic caches that match by embedding similarity do (Day 2 PM).
  • Vector stores join the data layer. Retrieval-augmented generation requires similarity search at request time. Vector indexes (HNSW, IVF) become as routine as B-trees (Day 2 AM, Day 7).
🚨
The most expensive line in your code
A single naive synchronous LLM call inside an HTTP handler will wreck your service. Three things go wrong simultaneously: (1) connection pools fill up because handlers are all waiting on OpenAI; (2) load balancer health checks time out and traffic gets shifted; (3) request retries amplify the load on the upstream provider, turning a brownout into an outage. Treat LLM calls like external network I/O on a slow link from day one: with timeouts, retries with backoff, an async path, and observability on the duration.

The Shape of a Modern Service

Bringing the layers together, here's the shape of a single backend service in 2026, drawn so we can refer to it for the rest of the week.

  • An edge running on a CDN with a smart WAF, terminating TLS, doing path-based routing.
  • A handful of app instances in containers, behind an L7 load balancer, autoscaling on CPU and request rate, stateless save for connection pools and short-lived caches.
  • A primary relational store (Postgres or MySQL) with at least one read replica and a connection pooler (PgBouncer/RDS Proxy), holding the system of record.
  • A cache layer (Redis) for hot reads, rate limiting counters, and short-lived session/work data.
  • A message queue or broker (SQS / RabbitMQ / Kafka / NATS) decoupling the request from any work that doesn't have to be synchronous.
  • A vector store (pgvector, Qdrant, Pinecone) when retrieval-augmented features are present.
  • A handful of external dependencies — payments, email, LLMs, observability — each treated as a slow, possibly-down third party.
  • A telemetry pipeline (OpenTelemetry → metrics, logs, traces) hooked into the SLO machinery.
Quick check
An engineer adds a feature where the request handler calls an LLM that takes 6 seconds, then writes the result to Postgres, then returns 200 OK to the user. Connection-pool errors and degraded latency start showing up across unrelated endpoints. Why does the LLM call cause symptoms in endpoints that don't use it?
Show answer
Connection-pool exhaustion. The application has a fixed number of database (or HTTP-client) connections shared across all handlers. The 6-second LLM call holds onto an app worker (and possibly a DB transaction) for the whole duration. As the LLM endpoint takes more concurrent traffic, those workers and connections are unavailable for any other endpoint. The fix: move the LLM call out of the request — return a job ID, push the work to a queue, deliver the result via webhook or polling. Reuse the request lifecycle for fast operations only. We'll explore this pattern explicitly on Day 3.

The Frame for the Rest of the Week

Each remaining chapter zooms into one of these layers or one of the cross-cutting concerns:

DayTopicLayer / concern
1 PMAPI design (REST / gRPC / streaming)Edge ↔ App contract
2 AMStorage systemsData layer
2 PMCaching & read pathsEdge + Data
3Concurrency & queuesAsync layer
4Distributed systemsCross-cutting
5ReliabilityCross-cutting
6ObservabilityCross-cutting
7The AI-era backendSynthesis: all layers
Mnemonic — the four layers
"Edge, App, Data, Async — outside in."
  • Edge — anything before your code runs.
  • App — your code, stateless if you can help it.
  • Data — the truth, cached if you can afford it.
  • Async — work the request shouldn't wait for.
Flashcard
A teammate proposes caching the entire response of an LLM-backed endpoint with a 5-minute TTL keyed on the request URL. Users complain that personalized answers come back wrong. Diagnose and propose two fixes.
Click to flip ↻
Answer
Diagnosis: the URL doesn't capture the dimensions that change the response — most importantly the user identity and any per-user context (history, preferences, stored documents). The cache returns one user's answer to another. Fix 1: include user-scope in the cache key (user ID, session ID), and any other relevant input. Fix 2: use a semantic cache over prompts rather than a raw URL cache: hash the rendered prompt (after personalization) or look up the embedding similarity over recent prompts. The general lesson: cache keys must include every dimension that changes the answer, and LLM workloads usually have more such dimensions than classical APIs.
🔑
Key takeaways
1) Every backend has the same four layers: edge, app, data, async — drawing them is the first thing in every architecture review. 2) Latency is dominated by the same handful of operations; internalize the order-of-magnitude table so you can spot a pathological code path in seconds. 3) Statelessness at the app layer is what makes horizontal scaling cheap and deploys boring; the few places state hides are exactly where production bugs live. 4) The AI exception turns the latency tail into the median; streaming, async, idempotency and cost-aware observability are no longer optional. 5) Treat external dependencies as slow third-parties — including, especially, the LLMs.

Finished reading?