The Engineering Codex/From Code to Internet: Deployment & Operations
DAY 1
01 / 07

DNS & The Routing of a Request

schedule13 minsignal_cellular_altBeginner2,930 words
Before your code runs in production, a request has to find it. That hunt starts with DNS — the planet-scale phonebook that turns 'api.acme.com' into an IP address. Learn the hierarchy, the record types you'll actually use, why TTLs are the most important number in deployment, and how a five-line typo in a zone file can break the internet for your customers.

What you will learn

01What Actually Happens When You Type a URL
02The Hierarchy — and Why It's Shaped Like This
03The Records You'll Actually Use
04TTL — The Most Important Number You Set
05Propagation Is a Lie — Cache Expiration Is the Truth
06How a Domain Becomes a Live Service

Deployment begins one layer above your code: at the address. Before a single byte of your service runs, somebody's browser had to find it. That finding is DNS — the Domain Name System — and for most engineers it sits in a box marked 'someone else's problem' until the day it isn't. The day a domain doesn't resolve, or resolves to the wrong place, or resolves correctly for you but not for half your users, is the day DNS becomes the most important system in your stack. This chapter installs the model so that day is short.

🔑
Today's mental model
1) DNS is a distributed, cached, hierarchical key-value store with one job: hostname → IP. 2) The resolution flow is four hops, and every hop is a cache. 3) Records (A, AAAA, CNAME, MX, TXT, ALIAS) are the only API. 4) TTL is the most consequential number in your zone file — it controls how fast you can change your mind. 5) Propagation isn't a thing; cache expiration is.

What Actually Happens When You Type a URL

You type api.acme.com and hit Enter. Before any HTTP, before any TLS, the browser does not know where to send the packet. It needs an IP address. The work it does to find one is DNS resolution, and stripped of its lore, it's four named caches in a row.

Browser cache~1 ms · in-process OS resolver cache~1 ms · per machine Recursive resolverISP / 1.1.1.1 / 8.8.8.8 Root nameserversa-m.root-servers.net (13) TLD nameservers.com, .io, .dev … Authoritative nsRoute53 / Cloudflare / NS1 Answer: 203.0.113.42cached on the way back Each cache hit short-circuits everything to its right. When a record changes, you wait for the longest cache TTL on the path.
DNS resolution as a chain of caches. Most lookups never reach an authoritative server — that's the whole point.

Step by step, the first time

  1. Browser cache. Chrome holds a small DNS cache for ~1 minute. chrome://net-internals/#dns reveals it.
  2. OS resolver. The kernel (via systemd-resolved, mDNSResponder, or similar) caches results from prior processes on the same machine. scutil --dns on macOS, resolvectl status on Linux.
  3. Recursive resolver. Configured by DHCP or manually — 1.1.1.1 (Cloudflare), 8.8.8.8 (Google), or your ISP. Holds the largest cache because thousands of clients share it.
  4. Root → TLD → authoritative. If nothing is cached, the recursive resolver walks the hierarchy: ask a root server who handles .com, ask the .com TLD server who handles acme.com, then ask the authoritative server for the actual record.
  5. Cache on the way back. Every layer caches the answer for the record's TTL.

The first query for a brand-new domain might take 80–150 ms. The millionth query, served entirely from the recursive resolver's cache, takes 1 ms. The cache is the design.

💡
Recursive vs authoritative
A recursive resolver answers "give me the IP for this name" by walking the hierarchy on your behalf and caching results. An authoritative nameserver answers "I'm in charge of this zone; here's the record" — it never recurses, never guesses, just speaks for the one zone it owns. Your job as a deploying engineer is to publish records on the authoritative side; the rest of the internet figures out the recursive part.

The Hierarchy — and Why It's Shaped Like This

Read a domain right-to-left, and you read its administrative tree:

text — domain anatomy
api  .  acme  .  com  .
 ↑       ↑       ↑    ↑
 host    SLD    TLD   root (implicit dot)

The trailing dot is real but rarely typed. It marks the root zone — the topmost authority. Underneath the root sit ~1500 top-level domains: legacy (.com, .net, .org), country-code (.uk, .de, .in), and the modern long tail (.io, .dev, .app, .ai). Each TLD is run by a registry (Verisign for .com, Public Interest Registry for .org) which delegates names to registrars (Namecheap, Cloudflare Registrar, Route53), who sell them to you.

Your acme.com is one entry in the .com registry. You set NS records with the registrar — those records point at your authoritative nameservers (Cloudflare, Route53, NS1, your own bind installation). From that point onward, your nameserver is the source of truth for everything under acme.com.

⚠️
The two-place truth problem
Your records exist in two places: at the registrar (NS records, glue) and at the authoritative nameserver (everything else). When you migrate DNS providers and forget to update the NS at the registrar, the world keeps asking your old nameserver — and your changes are invisible. Always do the NS-update last and verify with dig +trace acme.com.

The Records You'll Actually Use

DNS is a record store. Each record has a name, a type, a TTL, and data. You'll meet maybe 30 record types over a career; in deployment you'll touch six.

TypePurposeExample
AHostname → IPv4 addressapi.acme.com 300 A 203.0.113.42
AAAAHostname → IPv6 addressapi.acme.com 300 AAAA 2001:db8::42
CNAMEAlias for another hostnamewww.acme.com 300 CNAME acme.com.
MXMail exchangeacme.com 3600 MX 10 aspmx.l.google.com.
TXTFree-form text (SPF, DKIM, verification)acme.com 300 TXT "v=spf1 include:_spf.google.com ~all"
NSDelegate a zone to nameserversacme.com 86400 NS ns1.cloudflare.com.
SRVService location with port (XMPP, SIP, MS services)_sip._tcp.acme.com 3600 SRV 10 60 5060 sip.acme.com.
CAAWhich CAs may issue certs for this nameacme.com 86400 CAA 0 issue "letsencrypt.org"
PTRReverse: IP → name (mostly for mail)42.113.0.203.in-addr.arpa PTR mail.acme.com.

The CNAME trap

A CNAME says "to find www.acme.com, look up acme.com." Two rules govern it and both surprise people:

  • A name with a CNAME may not have any other records. Not an MX, not a TXT, nothing. The CNAME is the whole answer for that name.
  • The zone apex (the bare domain like acme.com) cannot be a CNAME. The apex usually carries SOA, NS, MX, TXT — so a CNAME there is illegal per RFC.

This is why nearly every modern DNS provider invented a non-standard ALIAS or ANAME record (Route53 calls it an alias record) — it lets the apex point at a hostname like a load balancer's elb-1234.us-east-1.elb.amazonaws.com, with the provider doing the second lookup at query time and returning the resolved IPs. If you've ever wondered why pointing acme.com straight at an AWS ALB requires Route53, it's this.

TTL — The Most Important Number You Set

Every record has a TTL — Time To Live, in seconds — which tells caches how long to keep the answer. It's a number, but it's a policy: how quickly can I change my mind?

TTL value cost Query load on your nameserver Time to roll back a bad change Short TTL: many queries, fast change. Long TTL: cheap queries, slow change. Pick deliberately.
TTL trades query cost for change agility. Steady state: long TTL. Migration windows: drop to 60–300 seconds days in advance.

TTL playbook

  • Stable production records: 1 hour to 24 hours (3600–86400). Cheap, snappy.
  • Records you'll change soon (migration, IP rotation, blue-green): drop to 60–300 seconds at least one full TTL window before the change. The old TTL must expire before the new TTL takes effect — or some caches still hold a 24-hour answer.
  • Email/SPF/DKIM: 1 hour. You don't want a flaky DKIM rotation breaking deliverability for a day.
  • NS records: long (24–48 hours). They almost never change; long TTLs reduce hit rate at the parent TLD.
🚨
The 1-hour-to-low-TTL trick
If your record currently has a 24-hour TTL and you're about to migrate, dropping the TTL to 300s today doesn't help — caches that fetched the record yesterday still have it for another 23 hours. Lower the TTL at least one current-TTL period before the migration, wait that out, then make the change. After the migration is stable, raise the TTL back. Forgetting this turns a planned 5-minute cutover into a 24-hour partial-outage.

Propagation Is a Lie — Cache Expiration Is the Truth

People say "DNS is propagating" as if records flood outward like water. They don't. DNS is pull, not push. Each cache holds whatever it cached at the moment it asked, until that record's TTL expires and it asks again.

This means "propagation" depends entirely on (1) which TTL was active when each cache last asked, and (2) when that cache last asked. There's no atomic switch; for the duration of the longest cached TTL, some users will see the old answer and some the new one. That's not a bug in DNS — it's the design.

Tools for diagnosing

bash — DNS debugging cheat sheet
# Ask the authoritative server directly (bypass caches)
dig api.acme.com @ns1.cloudflare.com

# Trace the full delegation chain (root → tld → authoritative)
dig +trace api.acme.com

# Check what the world sees, from many resolvers
curl https://dns.google/resolve?name=api.acme.com\&type=A
https://www.whatsmydns.net/#A/api.acme.com   # browser tool

# Show your local OS resolver cache (linux)
resolvectl query api.acme.com

# Force-flush local caches
sudo killall -HUP mDNSResponder              # macOS
sudo systemd-resolve --flush-caches           # systemd-resolved
ipconfig /flushdns                            # windows

The single most useful command on the list is dig +trace. It walks every step of the hierarchy and shows you exactly where the answer comes from — invaluable when a customer says "your site is down" and yours isn't.

How a Domain Becomes a Live Service

Putting it together: from buying a domain to getting traffic on day one looks like this.

  1. Register the domain at a registrar (Cloudflare, Namecheap, Route53). The registrar publishes your domain's existence in the TLD registry.
  2. Choose a DNS provider (often the same company; sometimes split — register at one, host DNS at another). Get the four-or-so authoritative nameserver hostnames they assign you.
  3. At the registrar, set the NS records for your domain to those hostnames. This delegation can take an hour or two to settle at the TLD.
  4. At the DNS provider, create your records. A or AAAA pointing to your server (or ALB, or Cloudflare proxy). MX for email, CAA to lock down certificate issuance, TXT for verification.
  5. Verify with dig +trace. You should see your authoritative nameservers responding with the records you set.
  6. Point your service at it — configure your server to answer for that hostname, request a TLS cert (Day 4), serve traffic.
🌱
Buy and host together for the first project
For your first deployment, register the domain at a place that also runs DNS — Cloudflare and Route53 both do this well. You'll skip the NS-delegation step entirely and reduce the surface area for mistakes. Split (registrar at one, DNS at another) is a fine pattern for resilience and price; it's just one more thing that can go wrong while you're still learning.

Smart DNS — Health Checks, GeoDNS, Failover

Once a service is global, plain A records aren't enough. Modern DNS providers extend the protocol with answer-time logic, all of which still resolves to one of those six record types but with a brain behind the choice.

  • Health checks. The DNS provider periodically pings your endpoint; if it's down, it stops returning that IP in answers. Cloudflare load balancers, Route53 health checks, NS1 monitors all work this way.
  • GeoDNS. Return different IPs based on where the resolver is on the planet. A user in Frankfurt gets the EU IP; a user in São Paulo gets the SA IP.
  • Latency-based routing. Return the IP whose datacenter has the lowest RTT to the resolver.
  • Weighted routing. Split traffic 90/10 between two backends — the textbook way to canary a new release at the DNS layer.
  • Failover. Primary IP normally; if the health check fails, swap to the secondary.
💡
DNS-level routing has a lower bound
Because DNS is cached, you can't make a routing decision faster than the TTL. A 60-second TTL with a health check means failover takes ~60 seconds in the worst case. For sub-second failover you need a layer-7 load balancer (NGINX, Envoy, ALB) sitting in front of multiple backends behind one IP — DNS just gets traffic to the LB.

The Records That Aren't About Routing

Three classes of records exist purely to prove things to the rest of the internet — and they bite hard when wrong.

Email auth — SPF, DKIM, DMARC

If your domain sends email, three TXT records keep it out of spam folders.

  • SPF: a TXT at the apex listing which servers may send mail as your domain. v=spf1 include:_spf.google.com ~all.
  • DKIM: a TXT at selector._domainkey.acme.com holding the public key your mail provider signs outbound mail with.
  • DMARC: a TXT at _dmarc.acme.com telling receivers what to do with mail that fails SPF/DKIM. v=DMARC1; p=quarantine; rua=mailto:postmaster@acme.com.

Domain ownership verification

SaaS tools (Google Workspace, GitHub Pages, Stripe) often ask you to add a TXT record with a unique token before they'll trust you with the domain. The pattern: they generate a token, you publish it, they re-query, see it, mark you verified.

CAA — who's allowed to issue certificates

CAA records tell certificate authorities which CAs you authorize to issue TLS certs for your domain. Without it, any CA in the world's trust store can theoretically issue a cert for acme.com if they're tricked into believing it. With it, only the CAs you list will succeed.

text — typical CAA configuration
acme.com.   86400  CAA  0 issue "letsencrypt.org"
acme.com.   86400  CAA  0 issue "amazon.com"
acme.com.   86400  CAA  0 issuewild ";"            ; no wildcards
acme.com.   86400  CAA  0 iodef "mailto:secops@acme.com"

Things That Will Bite You

The DNS war stories that show up in postmortems are mostly the same five mistakes:

  1. Forgetting to lower TTLs before a migration. Half your users see the old IP for 24 hours. (Covered above.)
  2. NS records out of sync. You moved DNS providers but didn't update NS at the registrar. The world still queries the old one.
  3. CNAME at the apex. You set a CNAME on acme.com; some resolvers tolerate it, others return SERVFAIL. Use ALIAS/ANAME or A.
  4. Missing reverse DNS for outgoing email. Your IP has no PTR pointing to your sender domain; receivers reject the mail.
  5. Recursive resolver outages. A bad config push at a major resolver (it has happened to 1.1.1.1, 8.8.8.8, ISPs) breaks half the internet's view of your domain — and there's nothing you can do but wait. Multiple authoritative nameservers across providers is the defense; multiple records pointing at the same provider doesn't help if the provider is down.
Quick check
You're migrating api.acme.com from one server to another at 09:00 tomorrow. The current TTL is 3600 (1 hour). What do you do today, and what do you do tomorrow?
Show answer
Today (≥ 1 hour before the migration): change the TTL on the existing record from 3600 to, say, 60. This propagates over the next hour, after which most caches will refresh every minute. At 09:00 tomorrow: change the IP. Within ~60 seconds, ~all caches will pick up the new value. Once stable: raise the TTL back to 3600 to reduce nameserver load. The trap is doing both edits at once: lowering the TTL today doesn't retroactively shrink the cache entries fetched yesterday.

Where DNS Sits in the Stack You're Building

The rest of this course is about the things DNS points to: the EC2 box (Day 2), the NGINX in front (Day 3), the TLS cert that proves you own the name (Day 4), the deploy process that swaps what's behind the IP (Day 5), the container the process runs in (Day 6), and the pipeline that automates all of it (Day 7). DNS is the entry point; treat it carefully and the rest gets easier. Treat it carelessly and you'll spend a lot of nights with dig +trace.

Mnemonic — DNS in five words
"Hierarchy. Records. TTL. Cache. Authority."
  • Hierarchy — root → TLD → your zone.
  • Records — A, AAAA, CNAME, MX, TXT, NS, CAA.
  • TTL — controls how fast you can change your mind.
  • Cache — every step of the resolution path holds one.
  • Authority — your nameserver is the truth; the registrar's NS records say so.
Flashcard
Why can't you put a CNAME on acme.com (the apex), and what do you use instead when you want the apex to point at a hostname like an AWS load balancer?
Click to flip ↻
Answer
Why not: the apex of a zone must hold SOA and NS records (and almost always MX, TXT). RFC 1034 forbids a name with a CNAME from having any other records, so a CNAME at the apex is illegal. What instead: use a provider-specific ALIAS or ANAME record (Route53 "alias", Cloudflare "CNAME flattening", DNSimple ALIAS). The provider resolves the target hostname at query time and returns A/AAAA records to the client, sidestepping the apex-CNAME prohibition.
🔑
Key takeaways
1) DNS is a chain of caches in front of an authoritative source — most queries never reach the source, and that's the design. 2) Six records cover most deployment work: A, AAAA, CNAME, MX, TXT, NS — plus CAA for cert safety. 3) TTL is policy, not just a number: lower it before migrations, raise it for steady-state. 4) "Propagation" is just cache expiration; you can't go faster than the longest TTL on the path. 5) Use dig +trace when something feels wrong — it answers nine out of ten DNS mysteries.

Finished reading?