
Service Mesh, mTLS & Zero Trust
When identity is the perimeter, every hop must authenticate. Learn the mTLS handshake, SPIFFE/SPIRE workload identity, the service-mesh sidecar architecture, and the NIST SP 800-207 Zero Trust model — plus where each control actually lives in production.
What you will learn
Yesterday we built network walls. Today we acknowledge that walls alone are insufficient — once an attacker is inside one, the network looks flat. The remedy is to authenticate every hop, not just the front door, and to verify identity and policy on every request. That is the engineering content of Zero Trust: not a product, but a set of architectural choices implemented through workload identity, mutual TLS, identity-aware proxies, and explicit policy at the application layer.
Zero Trust — Five Principles, No Products
NIST SP 800-207 (2020) is the foundational reference. Stripped of jargon, Zero Trust says: trust is never granted by network location, never permanent, and always re-verified per request. The implications cascade through identity, networking and policy.
The five tenets, paraphrased
- All data sources and computing services are resources. No first-class "trusted" zones.
- All communication is secured regardless of network location. Mutual auth and integrity, even for traffic that never leaves the data centre.
- Access is granted on a per-session basis. Short-lived, re-verified.
- Access is determined by dynamic policy, including identity, device posture, environmental context.
- The enterprise monitors the integrity of all owned and associated assets and uses that information for policy decisions.
Workload Identity — SPIFFE and SPIRE
For services to authenticate each other you need each service to have an identity that is unforgeable, automatically rotated, and uniformly understood. SPIFFE (Secure Production Identity Framework For Everyone) is the open spec; SPIRE is the reference implementation. Service meshes (Istio, Linkerd, Consul) implement SPIFFE-compatible identities under the hood.
The SPIFFE ID
A SPIFFE ID is a URI: spiffe://prod.example.com/ns/payments/sa/api. It encodes the trust domain (e.g., prod.example.com) and a workload-specific path (here, namespace + service account). Two implementations carry the ID:
- SVIDs (SPIFFE Verifiable Identity Documents) as X.509 certs with the SPIFFE ID in a SAN URI — used for mTLS.
- SVIDs as JWTs with the SPIFFE ID in
sub— used where TLS is not the right layer (request-level auth between API gateways).
How SPIRE issues an identity
A SPIRE Agent runs on every node. When a workload starts, the agent attests the workload's identity — it observes process attributes (container labels, Kubernetes service-account JWT, AWS instance identity document, GCP metadata token) and matches against workload registration rules in the SPIRE Server. If matched, the agent fetches a fresh X.509-SVID from the server, hands it to the workload via the SPIFFE Workload API, and rotates it before expiry. The workload never sees a long-lived secret.
Why this matters beyond the mesh
Your cloud identities (AWS roles, GCP service accounts) can also be granted via SPIFFE — projects like SPIFFE-aware OIDC let a workload's SVID be exchanged for STS credentials, exactly the IRSA pattern but provider-agnostic. One identity primitive everywhere.
mTLS — What Actually Happens
Server-side TLS authenticates the server. Mutual TLS authenticates both sides — the client also presents a certificate, the server verifies it against a CA, and both endpoints know whom they are talking to. The conceptual upgrade is small; the operational lift is significant, which is why service meshes exist.
What you have to operate
- A workload CA — typically per-cluster, with a short-lived intermediate. Compromise of the CA means trust loss for everything; protect the root with HSM-backed keys.
- A cert distribution path — SPIRE workload API, Istio's istiod, Linkerd's identity service. Workloads pull frequently (5-min rotation is common).
- A trust-bundle distribution path — every workload needs the up-to-date CA bundle to verify peers. Federation across clusters needs bundle exchange.
- An auth policy — "only callers with SPIFFE ID matching X may call
/api/admin". Enforced at the sidecar / gateway.
PeerAuthentication mode PERMISSIVE accepts both plaintext and mTLS — it lets you roll out the mesh without breaking traffic, observe what is and is not yet mTLS via metrics, then flip to STRICT. Skipping this step is the most common cause of failed mesh rollouts.Service Meshes — Sidecar, Sidecar-less, and the Trade-offs
The mesh implements four jobs: identity issuance, mTLS, L7 routing/observability, and authz policy. The implementation choices are where flavours differ.
| Mesh | Data plane | Identity | Notes |
|---|---|---|---|
| Istio | Envoy sidecars + (optional) ambient (per-node L4 + waypoint L7) | SPIFFE-shaped | Most features, biggest operational footprint; ambient mode reduces overhead |
| Linkerd | Rust micro-proxy sidecars (linkerd2-proxy) | SPIFFE-compatible | Lower latency & memory; smaller feature set, extremely opinionated |
| Consul | Envoy sidecars | SPIFFE | Multi-platform: VMs, K8s, ECS together |
| Cilium Service Mesh | eBPF (no sidecars) + optional Envoy per node | SPIFFE via Cilium identity | Lower per-pod cost; relies on kernel features |
| AWS App Mesh / GCP Anthos / Azure SM | Envoy + provider control plane | Provider-issued | Tighter cloud integration; less portability |
The sidecar tax — and why ambient/eBPF exist
A sidecar Envoy adds latency (sub-ms typically), memory (~30-100 MB per pod), and an operational dimension (proxy lifecycle vs app lifecycle, MTU pitfalls). On large clusters this adds up. Ambient mesh (Istio) and Cilium push the L4 mTLS into per-node components, leaving sidecar-class L7 features only where actually needed. The architectural trade-off is real but invisible to most application teams once it works.
Authz policy — where to write it
The mesh enforces service-to-service authz: "checkout may call payments:Charge but not payments:Refund". Per-request user/end-customer authorization stays in the application or an upstream OPA/CEL filter — the mesh does not have the business-logic context. A common pattern: mesh enforces caller identity; app reads a propagated end-user JWT from a trusted gateway and authorizes on it.
apiVersion: security.istio.io/v1
kind: AuthorizationPolicy
metadata:
name: payments-only-checkout
namespace: payments
spec:
selector:
matchLabels: { app: payments }
action: ALLOW
rules:
- from:
- source:
principals: ["cluster.local/ns/checkout/sa/api"]
to:
- operation:
methods: ["POST"]
paths: ["/v1/charge"]
when:
- key: request.auth.claims[scope]
values: ["payments:charge"]Three nested controls: caller workload identity, allowed HTTP method+path, and a JWT scope from the propagated end-user token. Default-deny on the namespace — anything not matched is rejected.
STRICT mTLS and AuthorizationPolicy default-deny on day one of a 200-service migration. By 11am, half the platform is broken. What did they skip?Show answer
PeerAuthentication to PERMISSIVE — both plaintext and mTLS accepted while you observe; (3) audit the metrics and resolve any stragglers; (4) flip to STRICT; (5) deploy authz policies in action: AUDIT first; (6) flip to ALLOW+default-deny once the audit log is clean. Each step is reversible; the big-bang rollout is not.Identity-Aware Proxies for Human Access
The east-west story is mesh + workload identity. The north-south story for humans is the identity-aware proxy — a reverse proxy in front of internal apps that does SSO and per-request policy, replacing the corporate VPN.
- Google IAP — fronts GCE/GKE/internal apps; SSO via Workspace; per-request claims forwarded.
- Cloudflare Access / Tailscale / Pomerium — vendor or open-source IAPs.
- AWS Verified Access — IAP-style for AWS-hosted apps; OPA/Cedar policies on identity + device posture.
- Azure Application Proxy — Entra-ID-fronted access to internal apps.
The pattern: each request to an internal app must carry a fresh, signed assertion from the IdP that the proxy verifies, augmented with device-posture signals from the endpoint (managed device, encryption on, OS up to date). The original BeyondCorp paper (2014) is the canonical reference for the approach.
Cross-Cluster & Multi-Mesh Federation
Two clusters in the same trust domain can share an identity root by sharing the trust bundle. Two clusters in different trust domains use SPIFFE Federation: each side publishes its trust bundle at a well-known endpoint; the other consumes it and now accepts SVIDs from that domain. The mesh's AuthorizationPolicy can match identities across domains.
Threats This Architecture Mitigates
| Threat | Pre Zero Trust | With mesh + ZT |
|---|---|---|
| Lateral movement after one pod compromise | Flat overlay; reach any service | mTLS + AuthZ policy; only allowed callers reach a service |
| Stolen long-lived service credential | Reuse for weeks until rotated | Short-lived SVID; revoked at rotation interval (minutes) |
| Insider with VPN access | Network access ≈ application access | IAP enforces per-request authz with device posture |
| Malicious sidecar / supply chain | Owns the pod's identity | Still scoped: SPIFFE ID is namespace+SA, not cluster-wide |
| Forged JWT replay | App accepts any signed token | Mesh validates JWT issuer + audience + scope per request |
- Issue — short-lived workload identities (SPIFFE SVIDs).
- Encrypt — mTLS on every hop, default deny.
- Route — L7 routing, retries, circuit breakers, traffic shifting.
- Decide — authz on caller identity + propagated end-user claims.
AuthorizationPolicy — typically a small set, scoped to the pod's role. They cannot impersonate a different workload because the SVID is bound to that pod's identity, attested by the SPIRE agent. They cannot turn off mTLS — the sidecar enforces it. Their useful blast radius shrinks from "the cluster network" to "the explicit allow-list for this caller". This is the core promise of mesh-based segmentation.- NIST SP 800-207 — Zero Trust Architecturenist.gov
- SPIFFE — Conceptsspiffe.io
- SPIRE — Architecturespiffe.io
- Istio — Security conceptsistio.io
- Linkerd — Automatic mTLSlinkerd.io
- BeyondCorp — A New Approach to Enterprise Security (Google, 2014)research.google
- CISA — Zero Trust Maturity Model v2.0cisa.gov
Finished reading?