The Engineering Codex/Backend Engineering for the AI Era
DAY 1 · PM
02 / 09

API Design — REST, gRPC, and Streaming

schedule14 minsignal_cellular_altIntermediate3,002 words
An API is a contract you cannot retract. Master HTTP semantics, the REST conventions worth keeping, when to reach for gRPC or GraphQL instead, the streaming primitives (SSE, chunked, WebSockets) that LLMs forced back into the spotlight, plus versioning, errors, idempotency, and pagination that hold up at scale.

What you will learn

01HTTP Semantics — The Backbone You Should Treat Seriously
02The Four Shapes — REST, gRPC, GraphQL, tRPC-style
03RESTful Design — Conventions Worth Following
04Idempotency — The Single Most Underused Pattern
05Streaming — The LLM-Era Renaissance
06Versioning — Strategies That Survive Years

An API is a public, durable, machine-readable promise. Every other backend concern can be refactored quietly; an API is renegotiated with every client. This chapter is about making that contract precise enough that it survives both your next big change and the streaming-token explosion that LLMs brought into the request lifecycle.

🔑
Today's contract
1) HTTP semantics — verbs, status codes, idempotency, content negotiation. 2) The four shapes: REST, gRPC, GraphQL, and tRPC-style — when each wins. 3) Streaming — SSE, chunked transfer, WebSockets, and which one the LLM call wants. 4) Versioning, errors (RFC 7807), idempotency keys, and the pagination patterns that survive a 100M-row table.

HTTP Semantics — The Backbone You Should Treat Seriously

Most arguments about "REST vs RPC" are actually arguments about HTTP. The protocol's verbs and status codes carry meaning that proxies, browsers, monitoring tools, and CDNs already understand. Use them — don't fight them.

Verbs and what they actually mean

VerbSafe?Idempotent?Cacheable?Use for
GETyesyesyesReads with no side effects. Browsers, CDNs, & proxies will retry these freely.
HEADyesyesyesHeaders without body — content existence checks.
OPTIONSyesyesnoCORS preflight, capability discovery.
PUTnoyesnoReplace a resource at a known URL. Same body twice = same state.
DELETEnoyesnoRemove a resource. Second DELETE typically returns 404 or 204 idempotently.
PATCHnoapplication-definednoPartial update. Use JSON Patch (RFC 6902) or JSON Merge Patch (RFC 7396).
POSTnonoconditionalCreate resources or invoke actions. The catch-all when others don't fit.

Idempotent means "safe to repeat" — the second request leaves the system in the same state as the first. This isn't pedantry: load balancers retry idempotent requests on connection failures; CDNs cache them aggressively; client libraries automatically retry them. Marking POST as idempotent (with an idempotency key) unlocks the same machinery — Day 5 covers this in depth.

Status codes — the small set that matters

  • 200 OK / 201 Created / 204 No Content — successes. Use 201 with a Location header for resource creation.
  • 301 / 308 permanent redirect, 302 / 307 temporary. 308/307 preserve the request method; 301/302 may downgrade POST to GET in some clients.
  • 400 client error (validation), 401 unauthenticated, 403 unauthorized, 404 not found.
  • 409 Conflict for state collisions ("resource already exists," "concurrent modification"). 410 Gone when a resource is intentionally retired.
  • 422 Unprocessable Entity for semantically valid but rejected inputs (well-formed JSON, but the values don't make business sense).
  • 429 Too Many Requests with a Retry-After header. The right answer when rate-limiting.
  • 500 Internal Server Error, 502 Bad Gateway, 503 Service Unavailable (often with Retry-After), 504 Gateway Timeout. Distinguish them: load balancers and clients react differently.
⚠️
200-with-error-in-body is a smell
Returning 200 OK { error: "..." } for an actual failure breaks every monitoring and retry tool you'll ever wire up. Use the right status code and an error body. The status code is for proxies and infrastructure; the body is for the human reading the response.

The Four Shapes — REST, gRPC, GraphQL, tRPC-style

You'll choose one (or two) for each surface area of your system. The choice is rarely irreversible, but is far more expensive after launch than before.

REST JSON over HTTP resource-shaped cacheable universal client use for public APIs CRUD-shaped webhooks 3rd-party clients gRPC protobuf over HTTP/2 strongly typed fast, compact native streaming use for internal RPC service mesh latency-critical multi-language GraphQL query language over HTTP client picks fields over-fetch killer complex caching use for mobile clients aggregating many backends rich UIs tRPC-style typed RPC, same lang end-to-end types no spec to write single repo use for monorepo TS internal-only small teams
Most production estates pick two: REST for the public surface, gRPC for service-to-service. GraphQL sits in front of multiple sources. tRPC is a TypeScript convenience.

REST — the default lingua franca

REST is what every browser, CLI tool, and unfamiliar engineer expects to find when they hit your API. Resources have URLs (/users/123, /orders/abc-456/items); HTTP verbs do the action. The constraints (statelessness, uniform interface, cacheability) are what make CDNs, retries, and content negotiation work for free. Pick REST for any externally-exposed surface unless you have an explicit reason not to.

gRPC — internal speed and types

gRPC uses Protocol Buffers (a binary, schema-defined serialization) over HTTP/2. The wire format is compact (5-10× smaller than equivalent JSON), the schema is strict (codegen for Go, Java, Python, Rust, etc.), and HTTP/2 multiplexing means many parallel calls share one connection. It also has first-class streaming in four flavours: unary, server-streaming, client-streaming, bidirectional. Use it for service-to-service traffic inside your VPC, mesh, or cluster.

GraphQL — when fetching shape varies

The classic GraphQL win: a mobile client wants a user, their last 5 orders, and the items in each — three round trips with REST, one query with GraphQL. The client declares the field set, the server resolves them. Strong type system, introspection, subscriptions for live updates. The cost: caching is harder (every query is a unique POST body), N+1 query patterns are easy to write, and rate-limiting needs query-cost analysis instead of simple per-endpoint counters. Best when you have many clients with diverse data needs, or you're aggregating across multiple internal services.

tRPC and Connect-style — when client and server share a language

If your frontend and backend are both TypeScript in the same repo, tRPC generates fully-typed RPC bindings without a separate schema language. ConnectRPC does the same with protobuf as the source of truth, working across HTTP/1.1 and HTTP/2 in browsers. These are productivity wins for internal-only or single-team APIs; not for external integrations.

💡
REST + gRPC is the common combo
A mature backend often runs REST at the edge (for browsers and partners) and gRPC internally (between services). The REST handlers translate to gRPC calls behind the LB. You get the universal client compatibility on the outside and the type-safety / efficiency on the inside.

RESTful Design — Conventions Worth Following

Resources, not actions

The URL identifies a thing; the verb says what to do with it.

http — resource-shaped
GET    /users                  # list users
POST   /users                  # create a user
GET    /users/123              # fetch one
PATCH  /users/123              # partial update
DELETE /users/123              # remove
GET    /users/123/orders       # nested: this user's orders
POST   /users/123/orders       # create order under this user
POST   /orders/456/refund      # actions live as nested POSTs when they don't map to a resource

Don't put verbs in URLs (POST /createUser, POST /getUser) — that's RPC pretending to be REST and you give up the HTTP semantics for free. Use plural nouns for collections; pick a convention (kebab-case, camelCase) and stick to it.

Pagination — the four families

Every collection endpoint must paginate. The choice depends on the data.

StyleLooks likeStrengthDon't use when
Offset?page=2&size=20Familiar; supports random accessTables grow large or change while paginating; OFFSET 100000 scans 100k rows
Cursor?after=eyJ0cyI6Li4ufQ&limit=20O(1) per page; stable under insertsYou truly need random page jumps
Keyset?since_id=12345&limit=20Same as cursor but human-readable; great with B-tree indexesSort key is non-unique without tiebreaker
Time-window?start=2026-01-01&end=2026-01-31Natural for events / logsObject isn't time-shaped

Cursor pagination is the right default for anything user-facing at scale: stable under churn, fast on indexed columns, and the cursor is opaque so you can change the encoding later without breaking clients.

Filtering, sorting, sparse fieldsets

Common conventions to pick from (just be consistent):

  • Filter: ?status=active&created_after=2026-01-01 — flat query params for the common case.
  • Sort: ?sort=-created_at,name — leading - means descending. Default to a deterministic sort to avoid skipped/duplicated rows under pagination.
  • Sparse fieldsets (à la JSON:API): ?fields=id,name,email — useful for mobile to drop bloat. GraphQL solves this natively.
  • Embedding (à la HAL): ?include=author,comments — server pre-fetches related resources. Beware N+1.

Errors — RFC 7807 Problem Details

Every error gets a structured body. The IETF gave us RFC 7807 ("Problem Details for HTTP APIs") so we don't all reinvent the same shape.

http — RFC 7807 error response
HTTP/1.1 422 Unprocessable Entity
Content-Type: application/problem+json

{
  "type":   "https://api.example.com/errors/validation",
  "title":  "Your request parameters did not validate",
  "status": 422,
  "detail": "start_date must be before end_date",
  "instance": "/orders/45/exports/abc",
  "errors": [
    { "path": "/start_date", "message": "must be before end_date" }
  ],
  "trace_id": "01H8ZX9R8Q5GS9T0K6N2T0PJ4M"
}

Always return the trace ID — when a customer reports a bug they paste it back to you and you find their exact request.

Idempotency — The Single Most Underused Pattern

HTTP marks GET, PUT, and DELETE idempotent by spec. POST is not — but most real-world POSTs are operations the user wants to be safe to retry: charging a card, sending an email, creating an order. The community pattern, popularized by Stripe and adopted everywhere from Square to Brex, is the idempotency key.

Client Idempotency-Key: K Server SELECT … WHERE key = K First time → execute store (key, hash, response) Repeat → return stored verify body hash matches Client retries are safe — same key, same outcome, charged once.
The idempotency-key pattern. The server stores (key, request-hash, response) so a retry returns the original response without re-executing.

Implementation rules

  • The client generates a UUID per logical operation and sends it in Idempotency-Key.
  • The server stores the (key, request-body-hash, response) on first success, with a TTL (24h is typical).
  • On a retry with the same key, return the stored response — without re-executing.
  • If the request body differs from the stored hash, return 422 Unprocessable Entity — the client is misusing the key.
  • Concurrent requests with the same key need a lock. The standard pattern: insert the key with status in_flight using a unique constraint; if it conflicts, the second request waits or returns 409.

This pattern becomes essential in the AI era because LLM calls fail randomly. Without idempotency, your charge-card-then-call-LLM-then-store flow can charge twice on a retry.

Streaming — The LLM-Era Renaissance

Streaming used to be a niche concern (live data feeds, log tailing). Then chat UIs arrived and every consumer-facing AI feature wanted to render tokens as they were generated. Three protocols cover the territory.

SSE Server-Sent Events server → client over plain HTTP auto-reconnect EventSource API → LLM tokens, feeds WebSocket full-duplex bi-directional frames stateful connection your reconnect logic heavier infra → chat, games, live gRPC streaming HTTP/2 native 4 stream modes strong types flow control built-in browser support thin → internal streaming
The three streaming choices. SSE wins for browser-facing LLM responses; WebSockets for two-way chat; gRPC streams for service-to-service.

Server-Sent Events — the LLM default

SSE is plain HTTP that holds the response open and emits text events separated by blank lines. The browser's EventSource API auto-reconnects with a Last-Event-ID header. It's one-way (server → client), simple, and crucially: it works through every middlebox that allows long-lived HTTP responses. OpenAI, Anthropic, and most LLM gateways stream tokens over SSE.

http — SSE response shape
HTTP/1.1 200 OK
Content-Type: text/event-stream
Cache-Control: no-cache
Connection: keep-alive
X-Accel-Buffering: no            # disable nginx buffering

event: token
data: {"text": "Hello"}

event: token
data: {"text": ", world"}

event: done
data: {"usage": {"prompt_tokens": 12, "completion_tokens": 4}}

WebSockets — when both sides need to push

WebSockets upgrade an HTTP request to a persistent TCP connection where either side can send frames at any time. Use them for chat (the user is typing while the assistant is responding), real-time games, collaborative editors. The cost: every connection is a long-lived stateful resource on the server, and browsers/proxies have idiosyncratic timeout behaviour you'll deal with.

Chunked transfer — the underrated option

HTTP/1.1 has had Transfer-Encoding: chunked forever. The server doesn't know the response length up front; it sends chunks and a final empty chunk. For internal service-to-service streaming where you don't need the SSE event semantics, plain chunked JSON-Lines (one JSON object per line) is light and effective.

🌱
Disable proxy buffering
SSE breaks if anything in the chain buffers. NGINX defaults to buffering responses; you have to set X-Accel-Buffering: no or proxy_buffering off. Cloudflare, Fastly, and others have similar settings. Always test SSE end-to-end with the real production proxy stack — local dev hides the issue.

Versioning — Strategies That Survive Years

The first API change that breaks a paying customer is the day you wish you'd thought about versioning. Three viable schemes:

  • URL path versioning/v1/users, /v2/users. Visible, easy to route, easy to deprecate. Most public APIs use this.
  • Header versioningAccept: application/vnd.example.v2+json. Cleaner URLs, but invisible to humans browsing logs.
  • Date versioning (à la Stripe) — Stripe-Version: 2024-04-10. Each new version is a dated snapshot; clients pin a date and get exactly that behaviour. The strongest pattern for long-lived APIs.

Whichever you pick, follow these rules: never break v1 once published; deprecate explicitly with Deprecation and Sunset headers; keep at least 12 months between deprecation and removal; communicate via emails, blog posts, and SDK warnings — not just docs.

Rate Limiting and Quotas

Rate limiting protects your service from abuse and your users from each other. Three common shapes:

  • Token bucket — N tokens, refilled at rate R; each request consumes 1. Allows short bursts.
  • Leaky bucket — fixed leak rate; queues if bursty. Smoother output, latency tax.
  • Fixed window — "100 requests per minute". Simplest to implement; suffers from edge bursts at window boundaries.

Always return 429 Too Many Requests with Retry-After and the standard RateLimit-* headers so smart clients know exactly how much budget they have. For multi-tenant systems, layer per-user, per-tenant, and per-IP buckets so one noisy actor can't starve the rest.

http — 429 response with hints
HTTP/1.1 429 Too Many Requests
Content-Type: application/problem+json
Retry-After: 12
RateLimit-Limit: 100, 100;w=60
RateLimit-Remaining: 0
RateLimit-Reset: 12

{
  "type":   "https://api.example.com/errors/rate-limit",
  "title":  "Too many requests",
  "status": 429,
  "detail": "You have exceeded your tenant's 100 req/min budget."
}

OpenAPI & Documentation as Code

Hand-written API docs rot in days. The cure: OpenAPI as the source of truth — generated from your code annotations or hand-written and used to generate handlers/clients. The whole ecosystem (Stoplight, Postman, Redocly, code-generators for every language) plugs in. Pair with contract tests that diff the spec against actual responses; CI fails if the spec drifts.

Quick check
A team launches an internal API where the client sends a long-running LLM job. Sometimes a network blip causes the client to retry; the LLM gets called twice and the user gets two charges. List two fixes — one in the API contract, one at the application layer.
Show answer
Contract fix: require an Idempotency-Key header on the create-job endpoint. The server stores (key → job ID + response). Repeats return the stored response without re-executing. Application fix: separate the charge and the LLM call. Charge once, get a charge_id, then enqueue an async job referencing that charge_id. The job worker is itself idempotent (checks: has this charge_id already been processed?). Combined, retries at any layer are safe — the client retry hits the idempotency cache, the worker dedupes by charge_id. Day 5 is dedicated to this pattern.

Putting It Together — A Production-Grade Endpoint

Here's the surface of a single endpoint that takes everything above seriously.

openapi — POST /v1/completions, with streaming, idempotency, errors
paths:
  /v1/completions:
    post:
      summary: Create a completion (optionally streaming)
      parameters:
        - in: header
          name: Idempotency-Key
          required: true
          schema: { type: string, format: uuid }
        - in: header
          name: X-Tenant-Id
          required: true
          schema: { type: string }
      requestBody:
        required: true
        content:
          application/json:
            schema:
              type: object
              properties:
                model:    { type: string, example: "claude-haiku-4.5" }
                messages: { type: array,  items: { $ref: "#/components/schemas/Message" } }
                stream:   { type: boolean, default: false }
                max_tokens: { type: integer, default: 1024 }
      responses:
        "200":
          description: |
            JSON when stream=false.
            text/event-stream when stream=true.
          content:
            application/json:        { schema: { $ref: "#/components/schemas/Completion" } }
            text/event-stream:       { schema: { $ref: "#/components/schemas/SSEStream" } }
        "401": { $ref: "#/components/responses/Unauthenticated" }
        "403": { $ref: "#/components/responses/Forbidden" }
        "422": { $ref: "#/components/responses/ValidationError" }
        "429": { $ref: "#/components/responses/RateLimited" }
        "503": { $ref: "#/components/responses/Overloaded" }
Mnemonic — API checklist
"Verbs, Status, Errors, Idempotency, Pagination, Versioning, Streaming."
  • Verbs & status carry meaning the rest of the stack already understands.
  • Errors in RFC 7807 with a trace ID.
  • Idempotency-Key on every mutating endpoint that retries can hit.
  • Cursor pagination for collections at scale.
  • Versioning as a first-class deprecation policy.
  • Streaming when the latency budget no longer fits a single response.
Flashcard
A frontend engineer asks: "can we just use POST for everything to keep the URLs simple?" Give two concrete things they'd lose by doing this.
Click to flip ↻
Answer
1) CDN and proxy caching. CDNs cache GETs by URL automatically; POSTs are uncached. Reading data over POST means every request goes to origin, sometimes by 100×. 2) Safe automatic retries. Load balancers, HTTP clients, and browsers will safely retry GET, PUT, DELETE on connection errors because they're declared idempotent in the spec. They will not retry POST without an idempotency key, leading to either silent failures or unsafe duplicate operations. The HTTP semantic verbs are not bureaucracy — they unlock infrastructure behaviour you'd otherwise have to build yourself.
🔑
Key takeaways
1) HTTP semantics are the cheapest infrastructure win available — verbs, statuses, and headers unlock CDN, retry, and proxy behaviour for free. 2) REST for public, gRPC for internal, GraphQL when fetching shape varies — pick by need, not religion. 3) RFC 7807 + idempotency keys + cursor pagination + Retry-After are the four primitives that distinguish a hobby API from a production one. 4) SSE is the LLM streaming default; WebSockets when both ends push; gRPC streams when both ends are your code. 5) Version explicitly, deprecate humanely, and let the OpenAPI spec be the source of truth — or your contract is whatever last week's commit said.

Finished reading?