The Engineering Codex/From Code to Internet: Deployment & Operations
DAY 4
04 / 07

TLS & Certificate Lifecycle — Padlocks That Don't Expire on Christmas

schedule14 minsignal_cellular_altIntermediate3,013 words
TLS turned the web from a plaintext gossip channel into a private one. Today every browser shows users a giant warning if your site doesn't have it — and your certificate has a 90-day expiry counting down whether you remember or not. Master the handshake, the chain of trust, Let's Encrypt and ACME, automatic renewal, HSTS, OCSP stapling, and the boring discipline that keeps the padlock green at 3 AM on a holiday.

What you will learn

01What TLS Actually Does
02The Handshake — Five Steps in TLS 1.3
03The Chain of Trust
04ACME and Let's Encrypt
05Wiring TLS Into NGINX
06HSTS — The Browser Lock

TLS is the layer that lets you give a credit card to a website without the coffee shop on the corner reading it. It's also the layer most likely to wake you up on a holiday: certs expire, ACME challenges fail, intermediate CAs change, browsers tighten their requirements every year. The good news: the operational work is mostly automatable, and a handful of patterns turn TLS from a chore into a thing that just works. The path here is to first understand the handshake and the chain of trust, then learn the Let's Encrypt + certbot loop that has become the default, then layer on the production niceties: HSTS, OCSP stapling, modern cipher selection, and the renewal monitoring that catches the one box that forgot.

🔑
Today's outcome
1) The TLS 1.3 handshake in five steps and why you only pay for it once per connection. 2) The chain of trust — leaf, intermediate, root — and what the browser actually verifies. 3) ACME & Let's Encrypt: the protocol, the challenges, certbot. 4) Automatic renewal that keeps working when you're not looking. 5) The production polish: HSTS, OCSP stapling, modern ciphers, CAA, ECDSA. 6) What goes wrong at 3 AM and how to keep it from doing so.

What TLS Actually Does

Three properties, in order of how often they're broken in a careless deployment:

  1. Encryption — bytes on the wire are unreadable to anyone but the two endpoints.
  2. Integrity — bytes can't be modified in flight without detection.
  3. Authentication of the server — the client knows it's actually talking to acme.com, not someone pretending. (Mutual TLS adds client authentication too.)

Encryption alone is parlour-trick simple — a static shared key would do it. The hard part is the third bullet: how does your browser, having never met acme.com before, verify on the first byte that it's the real one? The answer is the chain of trust, and TLS spends most of its complexity here.

The Handshake — Five Steps in TLS 1.3

TLS 1.3 (RFC 8446, 2018) collapsed the older 2-RTT handshake into one. Here's what happens before any HTTP byte flows:

Client Server 1. ClientHello — supported ciphers, key share, SNI=acme.com 2. ServerHello + Certificate + key share (plus EncryptedExtensions, CertificateVerify, Finished) 3. Client validates cert chain — pinned trust anchors 4. Client Finished — handshake done, app data flows encrypted One round trip total. With session resumption (0-RTT), the next visit can send app data with the very first packet.
TLS 1.3: client and server each send a key share in their first message; the rest is encrypted from the second message onward.

The crucial mechanism is ephemeral key exchange. Both sides contribute random key shares; they combine them (Diffie-Hellman) to derive a session secret nobody but those two parties can compute, even if they later steal the server's long-term key. This property — forward secrecy — means past traffic can't be decrypted by future key compromise. TLS 1.3 makes this mandatory.

SNI — many sites on one IP

Step 1 includes the Server Name Indication (SNI) extension: the hostname the client is trying to reach, sent before the server picks a certificate. Without SNI, every TLS host needed its own IP. With SNI, NGINX can serve acme.com and example.org on the same IP, picking the right cert based on the SNI value. SNI is unencrypted in TLS 1.3 by default; Encrypted Client Hello (ECH) is the proposed fix and is rolling out gradually.

The Chain of Trust

Authentication hinges on a public-key infrastructure (PKI). Your browser ships with a list of ~140 root CAs it trusts (the trust store, baked into the OS or browser). A certificate signed by one of those — directly or transitively — is trusted; one that isn't, isn't.

Root CA (trust anchor)ISRG Root X1 — in your OS Intermediate CALet's Encrypt R3 — signed by root Leaf cert (your domain)acme.com — signed by R3 Server sends leaf + intermediate; client trusts root.
A typical chain. The server sends the leaf and intermediate; the browser already has the root.

What the browser actually checks

  • Signature chain. Leaf signed by intermediate; intermediate signed by a root the browser trusts.
  • Validity dates. Each cert in the chain is currently in its not before / not after window.
  • Hostname match. The hostname in the URL matches one of the leaf's subjectAltName entries (the CN hasn't been the source of truth in browsers since ~2017).
  • Revocation. The cert hasn't been revoked (via OCSP or CRLs — see below).
  • CT logs. The cert appears in public Certificate Transparency logs (Chrome enforces this since 2018).
⚠️
Always send the intermediate
A common misconfiguration: you upload only the leaf cert to NGINX, not the chain. Browsers that already have the intermediate cached look fine; mobile or fresh browsers can't build the chain and show a warning. Use fullchain.pem (leaf + intermediates) as ssl_certificate, never just the leaf. Test publicly with ssllabs.com — "Chain issues: incomplete" is the giveaway.

ACME and Let's Encrypt

Before 2016, getting a TLS cert involved a credit card, a CSR you didn't fully understand, and a phone call. Then Let's Encrypt and the ACME protocol made it free, automated, and 90 days at a time. Today, ~half the public internet's certs come through ACME.

How ACME works

  1. You generate a key pair for your account; ACME uses the public key as your identity.
  2. You ask for a cert for a list of names (acme.com, www.acme.com).
  3. The CA challenges you to prove you control each name. Three challenge types:
    • HTTP-01: serve a specific token at http://acme.com/.well-known/acme-challenge/<token>. The CA fetches it. The default for web servers.
    • DNS-01: publish a TXT record at _acme-challenge.acme.com with a specific value. Required for wildcards. Works behind any firewall.
    • TLS-ALPN-01: respond to a special TLS handshake on port 443 with the token. Useful when you can't intercept HTTP-01 paths.
  4. You publish the proof at the location/value the CA specified.
  5. The CA verifies, then issues the cert, signed by their intermediate.
  6. You install the cert and restart your server.

certbot — the canonical client

certbot is the EFF-maintained ACME client; it does all of the above and ships with hooks for popular web servers.

bash — get a cert in under a minute
sudo apt install certbot python3-certbot-nginx

# Interactive: detects nginx server blocks, modifies them, reloads
sudo certbot --nginx -d acme.com -d www.acme.com

# Or, manage cert + nginx config separately (cleaner for IaC)
sudo certbot certonly --webroot -w /var/www/html -d acme.com -d www.acme.com
# Cert lands at /etc/letsencrypt/live/acme.com/{fullchain,privkey,chain,cert}.pem

# Wildcard (*.acme.com) requires DNS-01
sudo certbot certonly --dns-cloudflare \
  --dns-cloudflare-credentials ~/.secrets/cloudflare.ini \
  -d acme.com -d '*.acme.com'

certbot installs a systemd timer (systemctl status certbot.timer) that runs certbot renew twice a day. It only actually renews certs in the last 30 days of validity, so you get plenty of room to catch failures before they bite.

💡
90-day expiry is a feature
Older paid CAs sold one- or two-year certs. The shorter the lifetime, the less time a stolen private key remains useful — and the more you're forced to actually verify renewal works. "It worked for two years" is no comfort if the third year breaks at midnight on a Saturday. Embrace the 90-day cycle: your renewal automation is the most important code you don't write.

Wiring TLS Into NGINX

The Day 3 starter config already references certbot's output. The settings worth knowing:

nginx — modern TLS configuration
server {
    listen 443 ssl http2;
    server_name acme.com;

    # Cert + chain (fullchain.pem = leaf + intermediates)
    ssl_certificate     /etc/letsencrypt/live/acme.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/acme.com/privkey.pem;

    # Protocols: drop TLS 1.0/1.1 (deprecated since 2020)
    ssl_protocols       TLSv1.2 TLSv1.3;

    # Cipher suites — Mozilla "intermediate" profile, copy/paste
    ssl_ciphers         ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305;
    ssl_prefer_server_ciphers off;        # honour client preference (better with mobile)

    # Session reuse — saves the handshake on repeat visits
    ssl_session_cache    shared:SSL:50m;
    ssl_session_timeout  1d;
    ssl_session_tickets  off;             # no reuse across server restarts (forward secrecy)

    # OCSP stapling — server fetches revocation status for the client
    ssl_stapling          on;
    ssl_stapling_verify   on;
    resolver              1.1.1.1 8.8.8.8 valid=300s;
    resolver_timeout      5s;

    # HTTP Strict Transport Security — "never come back over http"
    add_header Strict-Transport-Security "max-age=63072000; includeSubDomains; preload" always;

    # The rest of your config ↓
}

For the cipher list specifically, don't roll your own. Use ssl-config.mozilla.org — it generates a config matching Mozilla's three profiles (modern, intermediate, old) and is updated as TLS evolves.

HSTS — The Browser Lock

HSTS (HTTP Strict Transport Security) tells the browser: "never visit this site over plain HTTP again — for the next N seconds, upgrade everything to HTTPS automatically." Once a browser has received the header from your domain, even a typed-in http://acme.com goes over HTTPS without an opportunity for an attacker to downgrade.

  • max-age: usually 1 year (31536000) or 2 years (63072000).
  • includeSubDomains: applies the policy to foo.acme.com as well. Don't enable this until you're sure every subdomain has TLS.
  • preload: opt in to the browser's hardcoded HSTS preload list. Once on the list, browsers never even ask — they refuse plain HTTP from the first visit. Submit at hstspreload.org. Hard to undo; commit only when you're sure.
🚨
HSTS is sticky
Set max-age to 0 for a few minutes during an experiment, and a year-long policy embeds itself in every visitor's browser. The browser ignores anything about your site for the rest of the year except via HTTPS. If you can't keep TLS up, you can't temporarily disable HSTS for users who already cached it. Roll out with a small max-age first (a day, then a week, then a year), and only enable preload when you're committed to HTTPS forever.

OCSP Stapling — Revocation Without the Round-Trip

If a private key is stolen, you revoke the cert. The CA publishes the revocation; clients learn about it via OCSP (Online Certificate Status Protocol) or CRLs (Certificate Revocation Lists). Both have problems: OCSP queries leak which sites you visit to the CA; CRLs are large and slow to refresh.

OCSP stapling fixes this. The server periodically fetches the OCSP response from the CA, signs it, and "staples" it to the TLS handshake — the client gets revocation status without contacting the CA itself. Cheaper, more private, faster.

bash — verify stapling works
echo | openssl s_client -connect acme.com:443 -status 2>/dev/null \
  | grep -E 'OCSP response|Cert Status'
# Look for: "OCSP Response Status: successful" + "Cert Status: good"

If stapling silently fails — usually because of an outdated chain.pem file or a blocked outbound connection from your server to the OCSP responder — TLS still works, but your server is making the client do the revocation lookup. Watch for it in error.log: OCSP responder query failed.

CAA Records — Bound to Day 1

Day 1 introduced CAA records — DNS records that name the CAs allowed to issue certs for your domain. Set them now, before you have an incident:

text — restrict cert issuance
acme.com.   86400  CAA  0 issue    "letsencrypt.org"
acme.com.   86400  CAA  0 issuewild ";"             ; explicitly forbid wildcards
acme.com.   86400  CAA  0 iodef "mailto:secops@acme.com"

An attacker who somehow proves control of your DNS is the only path to issuing a rogue cert; the CAA record means even a misbehaving CA can't help them. The cost is one DNS edit forever.

RSA vs ECDSA — The 2-Key Setup

Modern Let's Encrypt issues both RSA and ECDSA leaf certs. ECDSA is faster, smaller, and supported by 99%+ of clients in the wild. RSA is still the universal fallback. The current best practice on a busy server is to serve both — NGINX picks based on the client's capability:

nginx — dual-cert configuration
ssl_certificate     /etc/letsencrypt/live/acme.com-ecdsa/fullchain.pem;
ssl_certificate_key /etc/letsencrypt/live/acme.com-ecdsa/privkey.pem;
ssl_certificate     /etc/letsencrypt/live/acme.com/fullchain.pem;
ssl_certificate_key /etc/letsencrypt/live/acme.com/privkey.pem;

Get the second cert with certbot --key-type ecdsa. Most clients now negotiate ECDSA, which trims handshake CPU substantially under load.

The Renewal Discipline

Certs expire. Browsers refuse expired certs. Every TLS outage post-mortem has the same shape: "the renewal job had been failing for 60 days but nobody noticed." Three layers of defence:

  1. Automated renewal — certbot's timer or a CI job. Run twice a day; the operation is idempotent.
  2. Renewal hooks — auto-reload NGINX when a cert is renewed:
    bash — /etc/letsencrypt/renewal-hooks/deploy/reload-nginx.sh
    #!/usr/bin/env bash
    set -euo pipefail
    systemctl reload nginx
  3. External monitoring. Don't trust the renewal job to tell you it's broken — it might not run at all. Probe from outside:
    bash — days remaining on the live cert
    echo | openssl s_client -servername acme.com -connect acme.com:443 2>/dev/null \
      | openssl x509 -noout -dates
    # Wire that into a Prometheus exporter or a UptimeRobot SSL check; alert at 14 days.

mTLS — When the Client Has a Cert Too

Mutual TLS extends the handshake: the server requests a cert from the client, and the connection only succeeds if the client presents a valid one signed by an authority the server trusts. Use cases:

  • Service-to-service auth in a service mesh (Istio, Linkerd, Cloudflare Access) — every internal call presents a cert.
  • IoT and partner APIs — your customers install a client cert; only those certs can call your API.
  • Bastion or admin endpoints — stronger than passwords, less hassle than VPN for small teams.
nginx — require a client cert
server {
    listen 443 ssl;
    ssl_certificate     /etc/letsencrypt/live/api.acme.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/api.acme.com/privkey.pem;

    ssl_client_certificate /etc/nginx/ssl/partner-ca.pem;   # CA that signs partner clients
    ssl_verify_client      on;                               # require, fail otherwise
    ssl_verify_depth       2;

    location /webhook/ {
        proxy_set_header X-Client-DN $ssl_client_s_dn;
        proxy_pass       http://app_pool;
    }
}

The DN of the verified client is exposed as $ssl_client_s_dn; pass it to your app as a header for authorization decisions. This pattern replaces shared API tokens for high-trust integrations.

What Goes Wrong (and How You Notice)

  1. Cert expired. All clients see warnings. Cause: renewal job has been failing silently. Defence: external monitor (cert expiry alerts at 14 days).
  2. Chain incomplete. Some clients fine, others see warnings. Cause: ssl_certificate points at cert.pem instead of fullchain.pem. Test with ssllabs.com.
  3. Hostname mismatch. User typed www.acme.com and your cert is only for acme.com. Defence: include both names; redirect www→apex (or vice versa).
  4. HSTS sticky after a misstep. A misconfigured server sent HSTS for the wrong hostname; users can't reach the site over plain HTTP. There's no fix on the server side; users have to clear HSTS state in their browser. Roll HSTS out gradually.
  5. Port 80 blocked. ACME HTTP-01 challenge fails because the server can't reach port 80, which is required for renewal. Either keep port 80 open (redirect to HTTPS), or use DNS-01.
  6. Mixed content. Page loads over HTTPS but pulls a script over HTTP — browsers block. Cause: hardcoded http:// URLs in templates. Fix at the source; serve everything as https:// or relative.
  7. Cipher mismatch. An ancient client (a payment terminal, an old phone) can't negotiate. Mozilla "intermediate" config is the right balance for 2026; "modern" rules out enough phones to be cautious in consumer-facing apps.
Quick check
A monitoring page-out at 02:14 says "acme.com SSL cert expires in 7 days" but certbot's certbot renew --dry-run exits 0. What are the two most likely causes, and how do you fix each?
Show answer
1) The renewal hook didn't reload nginx, so NGINX is still serving the old cert from memory. certbot renew succeeded weeks ago, but the deploy hook script either didn't exist or had a permission problem. Check /etc/letsencrypt/renewal-hooks/deploy/; verify sudo systemctl reload nginx is wired in; reload manually now. 2) The cert is renewed but a CDN/upstream is caching the old one. If CloudFront or an ALB is in front of NGINX, it has its own cert lifecycle — Cloudflare/ACM is independent and may not be hooked into your renewal. Pull the cert at the edge: echo | openssl s_client -connect acme.com:443 -servername acme.com 2>/dev/null | openssl x509 -noout -dates from outside your network and look at the actual expiry. The fix depends on which layer is stale. The lesson: monitor cert expiry from outside your perimeter, not by trusting the renewal job's exit code.
Mnemonic — TLS at a glance
"Handshake. Chain. ACME. Renew. Monitor."
  • Handshake — one RTT in TLS 1.3, ephemeral keys for forward secrecy.
  • Chain — leaf + intermediate sent; root in the trust store.
  • ACME — automate via certbot, HTTP-01 for hosts with port 80, DNS-01 for wildcards.
  • Renew — twice-daily timer, deploy hook reloads nginx.
  • Monitor — externally, days-remaining alert at 14.
Flashcard
Why is OCSP stapling preferred over the client doing OCSP itself, even though both verify the same revocation status?
Click to flip ↻
Answer
Three reasons. (1) Privacy: client-side OCSP tells the CA which sites the client is visiting; stapled OCSP lets the server fetch on the client's behalf so the CA only sees the server's queries. (2) Performance: client-side OCSP adds an extra round trip during page load to a third-party server, often slow or blocked; stapling piggybacks the response on the existing TLS handshake. (3) Reliability: a CA's OCSP responder being down doesn't hard-fail every connection if the server has a fresh stapled response cached. The trade-off: the server has to keep fetching the response and re-stapling on a schedule; if the OCSP responder is unreachable from your server, stapling fails and you fall back to the client doing it (or worse, soft-fail).
🔑
Key takeaways
1) TLS 1.3 is the default; the handshake is one RTT and provides forward secrecy by default. 2) The chain of trust is leaf → intermediate → root; always serve the full chain. 3) ACME via certbot + HTTP-01 (or DNS-01 for wildcards) is the canonical free-cert path. 4) Production polish: HSTS, OCSP stapling, modern ciphers (Mozilla intermediate), CAA records. 5) Renewal automation + external monitoring is non-negotiable; expired certs are the most preventable outage in deployment.

Finished reading?