API Design — REST, gRPC, and Streaming
An API is a contract you cannot retract. Master HTTP semantics, the REST conventions worth keeping, when to reach for gRPC or GraphQL instead, the streaming primitives (SSE, chunked, WebSockets) that LLMs forced back into the spotlight, plus versioning, errors, idempotency, and pagination that hold up at scale.
What you will learn
An API is a public, durable, machine-readable promise. Every other backend concern can be refactored quietly; an API is renegotiated with every client. This chapter is about making that contract precise enough that it survives both your next big change and the streaming-token explosion that LLMs brought into the request lifecycle.
HTTP Semantics — The Backbone You Should Treat Seriously
Most arguments about "REST vs RPC" are actually arguments about HTTP. The protocol's verbs and status codes carry meaning that proxies, browsers, monitoring tools, and CDNs already understand. Use them — don't fight them.
Verbs and what they actually mean
| Verb | Safe? | Idempotent? | Cacheable? | Use for |
|---|---|---|---|---|
| GET | yes | yes | yes | Reads with no side effects. Browsers, CDNs, & proxies will retry these freely. |
| HEAD | yes | yes | yes | Headers without body — content existence checks. |
| OPTIONS | yes | yes | no | CORS preflight, capability discovery. |
| PUT | no | yes | no | Replace a resource at a known URL. Same body twice = same state. |
| DELETE | no | yes | no | Remove a resource. Second DELETE typically returns 404 or 204 idempotently. |
| PATCH | no | application-defined | no | Partial update. Use JSON Patch (RFC 6902) or JSON Merge Patch (RFC 7396). |
| POST | no | no | conditional | Create resources or invoke actions. The catch-all when others don't fit. |
Idempotent means "safe to repeat" — the second request leaves the system in the same state as the first. This isn't pedantry: load balancers retry idempotent requests on connection failures; CDNs cache them aggressively; client libraries automatically retry them. Marking POST as idempotent (with an idempotency key) unlocks the same machinery — Day 5 covers this in depth.
Status codes — the small set that matters
- 200 OK / 201 Created / 204 No Content — successes. Use 201 with a
Locationheader for resource creation. - 301 / 308 permanent redirect, 302 / 307 temporary. 308/307 preserve the request method; 301/302 may downgrade POST to GET in some clients.
- 400 client error (validation), 401 unauthenticated, 403 unauthorized, 404 not found.
- 409 Conflict for state collisions ("resource already exists," "concurrent modification"). 410 Gone when a resource is intentionally retired.
- 422 Unprocessable Entity for semantically valid but rejected inputs (well-formed JSON, but the values don't make business sense).
- 429 Too Many Requests with a
Retry-Afterheader. The right answer when rate-limiting. - 500 Internal Server Error, 502 Bad Gateway, 503 Service Unavailable (often with
Retry-After), 504 Gateway Timeout. Distinguish them: load balancers and clients react differently.
200 OK { error: "..." } for an actual failure breaks every monitoring and retry tool you'll ever wire up. Use the right status code and an error body. The status code is for proxies and infrastructure; the body is for the human reading the response.The Four Shapes — REST, gRPC, GraphQL, tRPC-style
You'll choose one (or two) for each surface area of your system. The choice is rarely irreversible, but is far more expensive after launch than before.
REST — the default lingua franca
REST is what every browser, CLI tool, and unfamiliar engineer expects to find when they hit your API. Resources have URLs (/users/123, /orders/abc-456/items); HTTP verbs do the action. The constraints (statelessness, uniform interface, cacheability) are what make CDNs, retries, and content negotiation work for free. Pick REST for any externally-exposed surface unless you have an explicit reason not to.
gRPC — internal speed and types
gRPC uses Protocol Buffers (a binary, schema-defined serialization) over HTTP/2. The wire format is compact (5-10× smaller than equivalent JSON), the schema is strict (codegen for Go, Java, Python, Rust, etc.), and HTTP/2 multiplexing means many parallel calls share one connection. It also has first-class streaming in four flavours: unary, server-streaming, client-streaming, bidirectional. Use it for service-to-service traffic inside your VPC, mesh, or cluster.
GraphQL — when fetching shape varies
The classic GraphQL win: a mobile client wants a user, their last 5 orders, and the items in each — three round trips with REST, one query with GraphQL. The client declares the field set, the server resolves them. Strong type system, introspection, subscriptions for live updates. The cost: caching is harder (every query is a unique POST body), N+1 query patterns are easy to write, and rate-limiting needs query-cost analysis instead of simple per-endpoint counters. Best when you have many clients with diverse data needs, or you're aggregating across multiple internal services.
tRPC and Connect-style — when client and server share a language
If your frontend and backend are both TypeScript in the same repo, tRPC generates fully-typed RPC bindings without a separate schema language. ConnectRPC does the same with protobuf as the source of truth, working across HTTP/1.1 and HTTP/2 in browsers. These are productivity wins for internal-only or single-team APIs; not for external integrations.
RESTful Design — Conventions Worth Following
Resources, not actions
The URL identifies a thing; the verb says what to do with it.
GET /users # list users POST /users # create a user GET /users/123 # fetch one PATCH /users/123 # partial update DELETE /users/123 # remove GET /users/123/orders # nested: this user's orders POST /users/123/orders # create order under this user POST /orders/456/refund # actions live as nested POSTs when they don't map to a resource
Don't put verbs in URLs (POST /createUser, POST /getUser) — that's RPC pretending to be REST and you give up the HTTP semantics for free. Use plural nouns for collections; pick a convention (kebab-case, camelCase) and stick to it.
Pagination — the four families
Every collection endpoint must paginate. The choice depends on the data.
| Style | Looks like | Strength | Don't use when |
|---|---|---|---|
| Offset | ?page=2&size=20 | Familiar; supports random access | Tables grow large or change while paginating; OFFSET 100000 scans 100k rows |
| Cursor | ?after=eyJ0cyI6Li4ufQ&limit=20 | O(1) per page; stable under inserts | You truly need random page jumps |
| Keyset | ?since_id=12345&limit=20 | Same as cursor but human-readable; great with B-tree indexes | Sort key is non-unique without tiebreaker |
| Time-window | ?start=2026-01-01&end=2026-01-31 | Natural for events / logs | Object isn't time-shaped |
Cursor pagination is the right default for anything user-facing at scale: stable under churn, fast on indexed columns, and the cursor is opaque so you can change the encoding later without breaking clients.
Filtering, sorting, sparse fieldsets
Common conventions to pick from (just be consistent):
- Filter:
?status=active&created_after=2026-01-01— flat query params for the common case. - Sort:
?sort=-created_at,name— leading-means descending. Default to a deterministic sort to avoid skipped/duplicated rows under pagination. - Sparse fieldsets (à la JSON:API):
?fields=id,name,email— useful for mobile to drop bloat. GraphQL solves this natively. - Embedding (à la HAL):
?include=author,comments— server pre-fetches related resources. Beware N+1.
Errors — RFC 7807 Problem Details
Every error gets a structured body. The IETF gave us RFC 7807 ("Problem Details for HTTP APIs") so we don't all reinvent the same shape.
HTTP/1.1 422 Unprocessable Entity
Content-Type: application/problem+json
{
"type": "https://api.example.com/errors/validation",
"title": "Your request parameters did not validate",
"status": 422,
"detail": "start_date must be before end_date",
"instance": "/orders/45/exports/abc",
"errors": [
{ "path": "/start_date", "message": "must be before end_date" }
],
"trace_id": "01H8ZX9R8Q5GS9T0K6N2T0PJ4M"
}Always return the trace ID — when a customer reports a bug they paste it back to you and you find their exact request.
Idempotency — The Single Most Underused Pattern
HTTP marks GET, PUT, and DELETE idempotent by spec. POST is not — but most real-world POSTs are operations the user wants to be safe to retry: charging a card, sending an email, creating an order. The community pattern, popularized by Stripe and adopted everywhere from Square to Brex, is the idempotency key.
Implementation rules
- The client generates a UUID per logical operation and sends it in
Idempotency-Key. - The server stores the (key, request-body-hash, response) on first success, with a TTL (24h is typical).
- On a retry with the same key, return the stored response — without re-executing.
- If the request body differs from the stored hash, return 422 Unprocessable Entity — the client is misusing the key.
- Concurrent requests with the same key need a lock. The standard pattern: insert the key with status
in_flightusing a unique constraint; if it conflicts, the second request waits or returns 409.
This pattern becomes essential in the AI era because LLM calls fail randomly. Without idempotency, your charge-card-then-call-LLM-then-store flow can charge twice on a retry.
Streaming — The LLM-Era Renaissance
Streaming used to be a niche concern (live data feeds, log tailing). Then chat UIs arrived and every consumer-facing AI feature wanted to render tokens as they were generated. Three protocols cover the territory.
Server-Sent Events — the LLM default
SSE is plain HTTP that holds the response open and emits text events separated by blank lines. The browser's EventSource API auto-reconnects with a Last-Event-ID header. It's one-way (server → client), simple, and crucially: it works through every middlebox that allows long-lived HTTP responses. OpenAI, Anthropic, and most LLM gateways stream tokens over SSE.
HTTP/1.1 200 OK
Content-Type: text/event-stream
Cache-Control: no-cache
Connection: keep-alive
X-Accel-Buffering: no # disable nginx buffering
event: token
data: {"text": "Hello"}
event: token
data: {"text": ", world"}
event: done
data: {"usage": {"prompt_tokens": 12, "completion_tokens": 4}}
WebSockets — when both sides need to push
WebSockets upgrade an HTTP request to a persistent TCP connection where either side can send frames at any time. Use them for chat (the user is typing while the assistant is responding), real-time games, collaborative editors. The cost: every connection is a long-lived stateful resource on the server, and browsers/proxies have idiosyncratic timeout behaviour you'll deal with.
Chunked transfer — the underrated option
HTTP/1.1 has had Transfer-Encoding: chunked forever. The server doesn't know the response length up front; it sends chunks and a final empty chunk. For internal service-to-service streaming where you don't need the SSE event semantics, plain chunked JSON-Lines (one JSON object per line) is light and effective.
X-Accel-Buffering: no or proxy_buffering off. Cloudflare, Fastly, and others have similar settings. Always test SSE end-to-end with the real production proxy stack — local dev hides the issue.Versioning — Strategies That Survive Years
The first API change that breaks a paying customer is the day you wish you'd thought about versioning. Three viable schemes:
- URL path versioning —
/v1/users,/v2/users. Visible, easy to route, easy to deprecate. Most public APIs use this. - Header versioning —
Accept: application/vnd.example.v2+json. Cleaner URLs, but invisible to humans browsing logs. - Date versioning (à la Stripe) —
Stripe-Version: 2024-04-10. Each new version is a dated snapshot; clients pin a date and get exactly that behaviour. The strongest pattern for long-lived APIs.
Whichever you pick, follow these rules: never break v1 once published; deprecate explicitly with Deprecation and Sunset headers; keep at least 12 months between deprecation and removal; communicate via emails, blog posts, and SDK warnings — not just docs.
Rate Limiting and Quotas
Rate limiting protects your service from abuse and your users from each other. Three common shapes:
- Token bucket — N tokens, refilled at rate R; each request consumes 1. Allows short bursts.
- Leaky bucket — fixed leak rate; queues if bursty. Smoother output, latency tax.
- Fixed window — "100 requests per minute". Simplest to implement; suffers from edge bursts at window boundaries.
Always return 429 Too Many Requests with Retry-After and the standard RateLimit-* headers so smart clients know exactly how much budget they have. For multi-tenant systems, layer per-user, per-tenant, and per-IP buckets so one noisy actor can't starve the rest.
HTTP/1.1 429 Too Many Requests
Content-Type: application/problem+json
Retry-After: 12
RateLimit-Limit: 100, 100;w=60
RateLimit-Remaining: 0
RateLimit-Reset: 12
{
"type": "https://api.example.com/errors/rate-limit",
"title": "Too many requests",
"status": 429,
"detail": "You have exceeded your tenant's 100 req/min budget."
}OpenAPI & Documentation as Code
Hand-written API docs rot in days. The cure: OpenAPI as the source of truth — generated from your code annotations or hand-written and used to generate handlers/clients. The whole ecosystem (Stoplight, Postman, Redocly, code-generators for every language) plugs in. Pair with contract tests that diff the spec against actual responses; CI fails if the spec drifts.
Show answer
Idempotency-Key header on the create-job endpoint. The server stores (key → job ID + response). Repeats return the stored response without re-executing. Application fix: separate the charge and the LLM call. Charge once, get a charge_id, then enqueue an async job referencing that charge_id. The job worker is itself idempotent (checks: has this charge_id already been processed?). Combined, retries at any layer are safe — the client retry hits the idempotency cache, the worker dedupes by charge_id. Day 5 is dedicated to this pattern.Putting It Together — A Production-Grade Endpoint
Here's the surface of a single endpoint that takes everything above seriously.
paths:
/v1/completions:
post:
summary: Create a completion (optionally streaming)
parameters:
- in: header
name: Idempotency-Key
required: true
schema: { type: string, format: uuid }
- in: header
name: X-Tenant-Id
required: true
schema: { type: string }
requestBody:
required: true
content:
application/json:
schema:
type: object
properties:
model: { type: string, example: "claude-haiku-4.5" }
messages: { type: array, items: { $ref: "#/components/schemas/Message" } }
stream: { type: boolean, default: false }
max_tokens: { type: integer, default: 1024 }
responses:
"200":
description: |
JSON when stream=false.
text/event-stream when stream=true.
content:
application/json: { schema: { $ref: "#/components/schemas/Completion" } }
text/event-stream: { schema: { $ref: "#/components/schemas/SSEStream" } }
"401": { $ref: "#/components/responses/Unauthenticated" }
"403": { $ref: "#/components/responses/Forbidden" }
"422": { $ref: "#/components/responses/ValidationError" }
"429": { $ref: "#/components/responses/RateLimited" }
"503": { $ref: "#/components/responses/Overloaded" }- Verbs & status carry meaning the rest of the stack already understands.
- Errors in RFC 7807 with a trace ID.
- Idempotency-Key on every mutating endpoint that retries can hit.
- Cursor pagination for collections at scale.
- Versioning as a first-class deprecation policy.
- Streaming when the latency budget no longer fits a single response.
- RFC 9110 — HTTP Semanticsdatatracker.ietf.org
- RFC 7807 — Problem Details for HTTP APIsdatatracker.ietf.org
- Stripe — Idempotent Requestsstripe.com
- WHATWG — Server-Sent Eventswhatwg.org
- gRPC — Core conceptsgrpc.io
- OpenAPI Initiativeopenapis.org
- Microsoft — REST API Guidelinesgithub.com
Finished reading?