The Engineering Codex/Cloud Security Engineering
DAY 2 · PM
04 / 09

Service Mesh, mTLS & Zero Trust

schedule9 minsignal_cellular_altAdvanced2,004 words
When identity is the perimeter, every hop must authenticate. Learn the mTLS handshake, SPIFFE/SPIRE workload identity, the service-mesh sidecar architecture, and the NIST SP 800-207 Zero Trust model — plus where each control actually lives in production.

What you will learn

01Zero Trust — Five Principles, No Products
02Workload Identity — SPIFFE and SPIRE
03mTLS — What Actually Happens
04Service Meshes — Sidecar, Sidecar-less, and the Trade-offs
05Identity-Aware Proxies for Human Access
06Cross-Cluster & Multi-Mesh Federation

Yesterday we built network walls. Today we acknowledge that walls alone are insufficient — once an attacker is inside one, the network looks flat. The remedy is to authenticate every hop, not just the front door, and to verify identity and policy on every request. That is the engineering content of Zero Trust: not a product, but a set of architectural choices implemented through workload identity, mutual TLS, identity-aware proxies, and explicit policy at the application layer.

🔑
Today's spine
1) Zero Trust as defined by NIST SP 800-207 — five principles, no products. 2) SPIFFE / SPIRE — how a workload gets a portable, cryptographically verifiable identity. 3) mTLS — what happens on the wire and why a sidecar usually does it. 4) Service mesh internals — Istio, Linkerd, Consul, Cilium. 5) Identity-aware proxies for human access — BeyondCorp / IAP / Cloudflare Access.

Zero Trust — Five Principles, No Products

NIST SP 800-207 (2020) is the foundational reference. Stripped of jargon, Zero Trust says: trust is never granted by network location, never permanent, and always re-verified per request. The implications cascade through identity, networking and policy.

Subject user / workload Policy Decision Point PDP — "may this happen?" Policy Enforcement Point PEP — gate per request Resource app, data, API Every request: PEP asks PDP; PDP evaluates identity, device, context, signals.
NIST 800-207 logical components. The PDP is centralized policy; the PEP lives at every gateway, sidecar, or filter.

The five tenets, paraphrased

  1. All data sources and computing services are resources. No first-class "trusted" zones.
  2. All communication is secured regardless of network location. Mutual auth and integrity, even for traffic that never leaves the data centre.
  3. Access is granted on a per-session basis. Short-lived, re-verified.
  4. Access is determined by dynamic policy, including identity, device posture, environmental context.
  5. The enterprise monitors the integrity of all owned and associated assets and uses that information for policy decisions.
⚠️
Vendors will sell you Zero Trust
No product implements all five tenets. SASE/ZTNA gateways do (1) and (3) for north-south human traffic; service meshes do (2) and (3) for east-west workload traffic; CIEM/CSPM tools touch (4) and (5). Map the components you own to the tenets — gaps are the work to do.

Workload Identity — SPIFFE and SPIRE

For services to authenticate each other you need each service to have an identity that is unforgeable, automatically rotated, and uniformly understood. SPIFFE (Secure Production Identity Framework For Everyone) is the open spec; SPIRE is the reference implementation. Service meshes (Istio, Linkerd, Consul) implement SPIFFE-compatible identities under the hood.

The SPIFFE ID

A SPIFFE ID is a URI: spiffe://prod.example.com/ns/payments/sa/api. It encodes the trust domain (e.g., prod.example.com) and a workload-specific path (here, namespace + service account). Two implementations carry the ID:

  • SVIDs (SPIFFE Verifiable Identity Documents) as X.509 certs with the SPIFFE ID in a SAN URI — used for mTLS.
  • SVIDs as JWTs with the SPIFFE ID in sub — used where TLS is not the right layer (request-level auth between API gateways).

How SPIRE issues an identity

A SPIRE Agent runs on every node. When a workload starts, the agent attests the workload's identity — it observes process attributes (container labels, Kubernetes service-account JWT, AWS instance identity document, GCP metadata token) and matches against workload registration rules in the SPIRE Server. If matched, the agent fetches a fresh X.509-SVID from the server, hands it to the workload via the SPIFFE Workload API, and rotates it before expiry. The workload never sees a long-lived secret.

Workload pod / process SPIRE Agent on every node attests workload SPIRE Server CA + registry issues X.509-SVID Workload API attest + sign Result: a short-lived X.509 cert with a spiffe:// SAN — perfect for mTLS.
Workload identity without secrets. The agent's attestation chain — node + workload — is the trust root.

Why this matters beyond the mesh

Your cloud identities (AWS roles, GCP service accounts) can also be granted via SPIFFE — projects like SPIFFE-aware OIDC let a workload's SVID be exchanged for STS credentials, exactly the IRSA pattern but provider-agnostic. One identity primitive everywhere.

mTLS — What Actually Happens

Server-side TLS authenticates the server. Mutual TLS authenticates both sides — the client also presents a certificate, the server verifies it against a CA, and both endpoints know whom they are talking to. The conceptual upgrade is small; the operational lift is significant, which is why service meshes exist.

Client Server ClientHello (SNI, ALPN, key share) ServerHello + Cert + CertReq + Finished ClientCert + CertVerify + Finished application data (encrypted) Both sides verify cert chain to a shared trust bundle. Identity = SAN URI.
TLS 1.3 mTLS in 1-RTT. The CertReq and ClientCert are the only difference from server-only TLS.

What you have to operate

  • A workload CA — typically per-cluster, with a short-lived intermediate. Compromise of the CA means trust loss for everything; protect the root with HSM-backed keys.
  • A cert distribution path — SPIRE workload API, Istio's istiod, Linkerd's identity service. Workloads pull frequently (5-min rotation is common).
  • A trust-bundle distribution path — every workload needs the up-to-date CA bundle to verify peers. Federation across clusters needs bundle exchange.
  • An auth policy — "only callers with SPIFFE ID matching X may call /api/admin". Enforced at the sidecar / gateway.
💡
PERMISSIVE before STRICT
Istio's PeerAuthentication mode PERMISSIVE accepts both plaintext and mTLS — it lets you roll out the mesh without breaking traffic, observe what is and is not yet mTLS via metrics, then flip to STRICT. Skipping this step is the most common cause of failed mesh rollouts.

Service Meshes — Sidecar, Sidecar-less, and the Trade-offs

The mesh implements four jobs: identity issuance, mTLS, L7 routing/observability, and authz policy. The implementation choices are where flavours differ.

MeshData planeIdentityNotes
IstioEnvoy sidecars + (optional) ambient (per-node L4 + waypoint L7)SPIFFE-shapedMost features, biggest operational footprint; ambient mode reduces overhead
LinkerdRust micro-proxy sidecars (linkerd2-proxy)SPIFFE-compatibleLower latency & memory; smaller feature set, extremely opinionated
ConsulEnvoy sidecarsSPIFFEMulti-platform: VMs, K8s, ECS together
Cilium Service MesheBPF (no sidecars) + optional Envoy per nodeSPIFFE via Cilium identityLower per-pod cost; relies on kernel features
AWS App Mesh / GCP Anthos / Azure SMEnvoy + provider control planeProvider-issuedTighter cloud integration; less portability

The sidecar tax — and why ambient/eBPF exist

A sidecar Envoy adds latency (sub-ms typically), memory (~30-100 MB per pod), and an operational dimension (proxy lifecycle vs app lifecycle, MTU pitfalls). On large clusters this adds up. Ambient mesh (Istio) and Cilium push the L4 mTLS into per-node components, leaving sidecar-class L7 features only where actually needed. The architectural trade-off is real but invisible to most application teams once it works.

Authz policy — where to write it

The mesh enforces service-to-service authz: "checkout may call payments:Charge but not payments:Refund". Per-request user/end-customer authorization stays in the application or an upstream OPA/CEL filter — the mesh does not have the business-logic context. A common pattern: mesh enforces caller identity; app reads a propagated end-user JWT from a trusted gateway and authorizes on it.

yaml — Istio AuthorizationPolicy
apiVersion: security.istio.io/v1
kind: AuthorizationPolicy
metadata:
  name: payments-only-checkout
  namespace: payments
spec:
  selector:
    matchLabels: { app: payments }
  action: ALLOW
  rules:
  - from:
    - source:
        principals: ["cluster.local/ns/checkout/sa/api"]
    to:
    - operation:
        methods: ["POST"]
        paths: ["/v1/charge"]
    when:
    - key: request.auth.claims[scope]
      values: ["payments:charge"]

Three nested controls: caller workload identity, allowed HTTP method+path, and a JWT scope from the propagated end-user token. Default-deny on the namespace — anything not matched is rejected.

Quick check
A team enables Istio with STRICT mTLS and AuthorizationPolicy default-deny on day one of a 200-service migration. By 11am, half the platform is broken. What did they skip?
Show answer
The PERMISSIVE-before-STRICT rollout, and the audit-only authz policy phase. The right sequence: (1) deploy sidecars, traffic still plaintext; (2) flip PeerAuthentication to PERMISSIVE — both plaintext and mTLS accepted while you observe; (3) audit the metrics and resolve any stragglers; (4) flip to STRICT; (5) deploy authz policies in action: AUDIT first; (6) flip to ALLOW+default-deny once the audit log is clean. Each step is reversible; the big-bang rollout is not.

Identity-Aware Proxies for Human Access

The east-west story is mesh + workload identity. The north-south story for humans is the identity-aware proxy — a reverse proxy in front of internal apps that does SSO and per-request policy, replacing the corporate VPN.

  • Google IAP — fronts GCE/GKE/internal apps; SSO via Workspace; per-request claims forwarded.
  • Cloudflare Access / Tailscale / Pomerium — vendor or open-source IAPs.
  • AWS Verified Access — IAP-style for AWS-hosted apps; OPA/Cedar policies on identity + device posture.
  • Azure Application Proxy — Entra-ID-fronted access to internal apps.

The pattern: each request to an internal app must carry a fresh, signed assertion from the IdP that the proxy verifies, augmented with device-posture signals from the endpoint (managed device, encryption on, OS up to date). The original BeyondCorp paper (2014) is the canonical reference for the approach.

Cross-Cluster & Multi-Mesh Federation

Two clusters in the same trust domain can share an identity root by sharing the trust bundle. Two clusters in different trust domains use SPIFFE Federation: each side publishes its trust bundle at a well-known endpoint; the other consumes it and now accepts SVIDs from that domain. The mesh's AuthorizationPolicy can match identities across domains.

🌱
A pragmatic adoption ladder
If a full mesh is overkill: start with Linkerd in default-deny mode on one critical namespace; add egress-only mTLS for outbound calls to other internal services; then graduate to a multi-namespace mesh once you have observability built. Many production estates never need beyond Linkerd's defaults.

Threats This Architecture Mitigates

ThreatPre Zero TrustWith mesh + ZT
Lateral movement after one pod compromiseFlat overlay; reach any servicemTLS + AuthZ policy; only allowed callers reach a service
Stolen long-lived service credentialReuse for weeks until rotatedShort-lived SVID; revoked at rotation interval (minutes)
Insider with VPN accessNetwork access ≈ application accessIAP enforces per-request authz with device posture
Malicious sidecar / supply chainOwns the pod's identityStill scoped: SPIFFE ID is namespace+SA, not cluster-wide
Forged JWT replayApp accepts any signed tokenMesh validates JWT issuer + audience + scope per request
Mnemonic — the four jobs of a mesh
"Issue, encrypt, route, decide."
  • Issue — short-lived workload identities (SPIFFE SVIDs).
  • Encrypt — mTLS on every hop, default deny.
  • Route — L7 routing, retries, circuit breakers, traffic shifting.
  • Decide — authz on caller identity + propagated end-user claims.
Flashcard
An attacker compromises one pod in a service mesh. They control the workload but not the node. What can they reach, and what stops them?
Click to flip ↻
Answer
They can call any service the pod's SPIFFE identity is allowed to call by an AuthorizationPolicy — typically a small set, scoped to the pod's role. They cannot impersonate a different workload because the SVID is bound to that pod's identity, attested by the SPIRE agent. They cannot turn off mTLS — the sidecar enforces it. Their useful blast radius shrinks from "the cluster network" to "the explicit allow-list for this caller". This is the core promise of mesh-based segmentation.
🔑
Key takeaways
1) Zero Trust is an architecture, not a product — five tenets from NIST 800-207 cover identity, network, session, dynamic policy, and continuous monitoring. 2) SPIFFE / SPIRE give workloads cryptographically verifiable identities without long-lived secrets. 3) mTLS authenticates both ends; service meshes (Istio, Linkerd, Cilium) operate the cert pipeline so apps don't have to. 4) Mesh authz enforces caller-identity rules; identity-aware proxies do the same for humans. 5) Roll out PERMISSIVE before STRICT and AUDIT before ALLOW — every successful mesh adoption looks like this.

Finished reading?