Backend Foundations & the Modern Service Anatomy
Before pattern names and frameworks, the backend is a request walking through a small number of moving parts. Learn the four-layer service anatomy, where every microsecond of latency goes, why statelessness scaled the web, and how an LLM call distorts every assumption you previously held about p99.
What you will learn
A backend is the place where a request arrives, becomes a few database queries, maybe a remote call, and then leaves as a response. Stripped of frameworks and acronyms, that's it. Everything you'll learn this week is tooling around those four sentences. This chapter installs the mental scaffolding: the layers that exist in every production service, where time and money actually go, and the way modern AI dependencies twist the old physics.
The Four Layers Every Backend Has
The shape of a production backend is the same whether you're running a Django monolith, a 200-service mesh, or a single Lambda. Four layers, ordered by what's closest to the user. Drawing them out is the first thing you do at every architecture review.
Edge — the polite bouncer
The edge is everything from the client's TCP packet to your application code. DNS resolves the hostname (cached but with TTLs), TLS performs the cryptographic handshake (1 RTT in TLS 1.3, more on cold connections), a CDN may serve a cached response without ever talking to your origin, a WAF filters obviously hostile traffic, and an L7 load balancer (NGINX, Envoy, ALB) parses the HTTP request and routes by path, host, or weight. The edge layer is where you cheaply absorb load. A request that hits a cached CDN response is essentially free; a request that goes through unprotected origin is the expensive case you want to avoid.
Application — the only place "your code" runs
The app layer is where you spend your career. A request handler authenticates the caller, validates input, runs business logic, calls the data layer, and renders a response. The handler is sometimes 50 lines and sometimes 5,000; mature codebases ruthlessly extract patterns into middleware (auth, logging, rate-limit, error-handling) so the handler reads like the business operation it represents. The app should be stateless — anything it remembers between requests goes in the data or async layer.
Data — durable state
Whatever has to survive a process restart lives here: relational databases, key-value stores, caches, vector stores, object storage. Day 2 is dedicated entirely to this layer. The single most important property is that multiple application instances must be able to read and write the same data; once that's true, the app layer is interchangeable and you can scale it horizontally without ceremony.
Async — work that escapes the request
Anything you can't or shouldn't do during the request belongs here. Sending an email, generating a thumbnail, retraining an embedding, calling an LLM that takes 12 seconds — all of these escape the request lifecycle into a queue, where worker processes consume jobs at their own pace. We treat this as a first-class layer because most failure modes in modern backends are async-related: jobs that retry forever, jobs that never run, jobs that run twice. Day 3 is its own day.
The Request Lifecycle — Where Time Actually Goes
Latency budgets are the difference between an engineer who's vaguely worried about performance and one who can defend a number. A user-perceptible web request should generally finish in under 200 ms server-side for snappy UI; an internal API call between services should be in the single-digit milliseconds. When you blow through these, you blow through them in the same predictable few places.
The classic distribution
For a healthy CRUD-style service, your P50 (median) might be 30 ms, P99 150 ms, P99.9 600 ms. The shape matters: the P50 tells you the app is fast in the common case; the gap between P50 and P99 tells you the variance — usually caused by GC pauses, slow queries, contention, or noisy neighbours. The gap between P99 and P99.9 is dominated by retries, leader elections, and unlikely-but-real failure modes. Your SLO (Day 5) usually targets P99.
Where one millisecond goes
| Operation | Order of magnitude | Why |
|---|---|---|
| L1 cache reference | ~1 ns | On-CPU; speed of light wins |
| Main memory reference | ~100 ns | Off-chip RAM |
| SSD random read | ~150 µs | Page lookup + flash latency |
| Same-region network round trip | ~0.5 ms | Through 2-3 switches |
| Cross-region same continent | ~30 ms | Speed of light + fibre routing |
| Continent-to-continent | ~100–200 ms | Submarine cables, hops |
| Indexed Postgres query (warm) | 1–5 ms | B-tree lookup + a few page reads |
| Redis GET | 0.3–1 ms | Single in-memory hash lookup |
| LLM call (small model, no streaming) | 500 ms – 5 s | Tokenize, infer, decode every token |
| LLM call (large model, full response) | 5–30 s | Sequential token generation dominates |
Internalize this table. Most performance bugs come from accidentally putting a 30 ms network round-trip inside an N+1 loop, or, more recently, putting a 3-second LLM call on the critical path of an HTTP handler.
Statelessness — The Decision That Built the Modern Web
The single most consequential design choice in modern backends is to make the application layer stateless. "Stateless" doesn't mean your service has no state — it means any instance can serve any request. State that has to persist between requests lives in the data layer (or in async storage); the app layer holds only what it needs for the current request.
Why this matters
- Horizontal scaling is trivial. Add more containers, the load balancer fans traffic across them, no further coordination needed.
- Failure is cheap. A pod dies, its replicas pick up the load. No "sticky sessions" to migrate.
- Deploys are simple. Rolling deploys replace instances one at a time; ongoing requests are unaffected because the next request can land anywhere.
- Local reasoning works. A bug investigation looks at one request, not a long-lived session graph.
Where state hides anyway
Even "stateless" services smuggle state in three places, and each is a real source of bugs:
- In-memory caches. An LRU cache inside the process is invisible state; it makes instances behave differently from each other and breaks rolling-deploy assumptions. Externalize to Redis or accept the inconsistency consciously.
- Local file system. Anything written to
/tmpsurvives only as long as the container does. Treat it as scratch and never as truth. - Connection pools. Database and HTTP-client connection pools are stateful (open sockets, in-flight handshakes). They survive between requests within an instance but not across instances. Configure pool sizes with autoscaling in mind.
Concurrency Models — How Each Process Spends Its Time
Inside a single application instance, three concurrency models cover the territory. Choose deliberately; the wrong choice for your workload becomes a latency wall you can't easily remove.
| Model | Examples | Strength | Weakness |
|---|---|---|---|
| Thread-per-request | Java/Spring, Ruby Puma, .NET | Simple sync code; familiar debugging | Threads are expensive (~1 MB stack); scales to ~10k connections per box |
| Event loop / async | Node.js, Python asyncio, Go (goroutines), Rust async | 10⁵+ concurrent connections per box; tiny memory per request | Coloured functions, easier to deadlock the loop with sync work |
| Process-per-request | CGI, classic PHP | Strong isolation; OOM in one request is contained | Fork cost on each request; gone from modern stacks |
For an AI-heavy backend, the event-loop model wins by default. Most of the time the handler is waiting — for the database, for the LLM API, for the embedding service. Letting one OS thread serve thousands of concurrently-waiting requests makes far better use of hardware than blocking a thread per request. Go and Rust handle this with goroutines/futures, Python with asyncio, Node.js by construction. Threads come back into play when you have CPU-bound work (image resizing, JSON parsing of huge payloads), where you want a thread pool ferried by the event loop.
The AI Exception — When the Tail Becomes the Workload
Everything above describes the world before LLMs became routine dependencies. The shift is quantitative, not qualitative — it's still just "backend" — but the numbers break old assumptions hard enough to require a redesign.
What changes when an LLM is in the call path
- Streaming becomes the answer for UX. A 12-second response feels broken; a 12-second stream that emits tokens after 300 ms feels alive. Server-Sent Events or chunked transfer is now table stakes (Day 1 PM).
- Async by default for non-interactive flows. Reports, batch summarization, document indexing — all of these escape the request lifecycle into queues with webhook completion (Day 3).
- Idempotency is non-negotiable. LLM calls fail randomly: rate-limit, transient 5xx, timeouts. The retry has to be safe, which means each call carries an idempotency key and the server deduplicates (Day 5).
- Cost is now a request-level metric. Token counts, model choice, and cache hits become observability primitives alongside latency (Day 6).
- Caching changes shape. Exact-match URL caches don't hit when prompts vary; semantic caches that match by embedding similarity do (Day 2 PM).
- Vector stores join the data layer. Retrieval-augmented generation requires similarity search at request time. Vector indexes (HNSW, IVF) become as routine as B-trees (Day 2 AM, Day 7).
The Shape of a Modern Service
Bringing the layers together, here's the shape of a single backend service in 2026, drawn so we can refer to it for the rest of the week.
- An edge running on a CDN with a smart WAF, terminating TLS, doing path-based routing.
- A handful of app instances in containers, behind an L7 load balancer, autoscaling on CPU and request rate, stateless save for connection pools and short-lived caches.
- A primary relational store (Postgres or MySQL) with at least one read replica and a connection pooler (PgBouncer/RDS Proxy), holding the system of record.
- A cache layer (Redis) for hot reads, rate limiting counters, and short-lived session/work data.
- A message queue or broker (SQS / RabbitMQ / Kafka / NATS) decoupling the request from any work that doesn't have to be synchronous.
- A vector store (pgvector, Qdrant, Pinecone) when retrieval-augmented features are present.
- A handful of external dependencies — payments, email, LLMs, observability — each treated as a slow, possibly-down third party.
- A telemetry pipeline (OpenTelemetry → metrics, logs, traces) hooked into the SLO machinery.
Show answer
The Frame for the Rest of the Week
Each remaining chapter zooms into one of these layers or one of the cross-cutting concerns:
| Day | Topic | Layer / concern |
|---|---|---|
| 1 PM | API design (REST / gRPC / streaming) | Edge ↔ App contract |
| 2 AM | Storage systems | Data layer |
| 2 PM | Caching & read paths | Edge + Data |
| 3 | Concurrency & queues | Async layer |
| 4 | Distributed systems | Cross-cutting |
| 5 | Reliability | Cross-cutting |
| 6 | Observability | Cross-cutting |
| 7 | The AI-era backend | Synthesis: all layers |
- Edge — anything before your code runs.
- App — your code, stateless if you can help it.
- Data — the truth, cached if you can afford it.
- Async — work the request shouldn't wait for.
- Kleppmann — Designing Data-Intensive Applicationsdesigningdataintensiveapplications.com
- Google — Site Reliability Engineering (Free book)sre.google
- Jeff Dean — Latency numbers every programmer should knowgithub.com
- Amazon — Builders' Libraryaws.amazon.com
- System Design Primergithub.com
- Dean & Barroso — The Tail at Scaleresearch.google
Finished reading?