The Engineering Codex/Cloud Security Engineering
DAY 3
05 / 09

Kubernetes & Container Security

schedule12 minsignal_cellular_altAdvanced2,633 words
Containers don't contain. Master the container threat model, Pod Security Standards, K8s RBAC, NetworkPolicies, admission control with Gatekeeper and Kyverno, runtime detection with Falco and eBPF, and the supply-chain practices that keep production images trustworthy.

What you will learn

01Containers Don't Contain — The Real Boundary Is the Kernel
02The API Server — Authentication, RBAC, Audit
03Pod Security Standards — The Fast Win
04NetworkPolicies — Default Deny Is the Default You Need
05Admission Control — Where Policy Becomes Enforcement
06Workload Identity — IRSA, Workload Identity, Managed Identity

Kubernetes is a single, multi-tenant, API-driven control plane that schedules code onto shared kernels. That single sentence contains every category of K8s vulnerability that has ever shipped: API exposure, multi-tenant boundary failures, the kernel as the real container boundary, and the rise of admission-time policy as the place where most security wins live. This chapter is intentionally dense — it is the one chapter you will revisit most.

🔑
The five planes of Kubernetes security
1) Container isolation — what namespaces, cgroups and seccomp actually do. 2) API server — authentication, RBAC, audit. 3) Workload identity & secrets — IRSA / Workload Identity, projected SA tokens, external secret stores. 4) Network — NetworkPolicies, ingress/egress, mesh. 5) Admission & runtime — Pod Security Standards, OPA/Kyverno, Falco/eBPF. Hardening is not a one-shot — it is closing each plane in turn.

Containers Don't Contain — The Real Boundary Is the Kernel

A container is a Linux process with a curated view of the world: namespaces (mount, PID, network, UTS, IPC, user, cgroup) hide the rest of the system; cgroups limit resources; capabilities drop pieces of root; seccomp filters the syscall surface; optionally AppArmor or SELinux add MAC. None of these protect against a kernel-level vulnerability — that is the hypervisor's job, or a sandboxed runtime's.

process namespaces & cgroups capabilities & seccomp MAC (AppArmor / SELinux) kernel — shared
Each ring narrows what a container can do. The outer ring — the kernel — is the real, hard boundary. A kernel CVE exposes everything on the node.

Capabilities — root is not one privilege

Linux splits root's powers into ~40 capabilities. Containers default to a small subset (CAP_NET_BIND_SERVICE, CAP_SYS_CHROOT, etc.). Three to watch:

  • CAP_SYS_ADMIN — "the new root." Mount filesystems, manipulate namespaces, configure devices. Never grant in production unless you know exactly why.
  • CAP_NET_RAW — craft raw packets. Useful for ICMP scanning from a compromised pod. Drop unless needed.
  • CAP_SYS_PTRACE — debug other processes. In a multi-tenant pod (sidecar + app) it can be used by one container against another.

seccomp and the syscall floor

The default Docker seccomp profile blocks ~50 syscalls; the Kubernetes RuntimeDefault profile mirrors it. Always set seccompProfile: { type: RuntimeDefault } in the pod spec — by default many K8s distributions still run Unconfined. For higher-risk workloads, generate a custom profile from observed syscalls (e.g. with syscall2seccomp or by running under seccomp-bpf tracing).

When the kernel is not enough — sandboxed runtimes

Two hardened runtimes plug into K8s as RuntimeClasses. gVisor (Google) intercepts syscalls in user-space and re-implements the kernel surface — high isolation, ~5-15% perf cost, some compatibility limits. Kata Containers runs each pod in a tiny VM via QEMU/Cloud Hypervisor — VM-grade isolation at higher startup cost. Both are appropriate for untrusted multi-tenant workloads (CI runners, function-as-a-service).

The API Server — Authentication, RBAC, Audit

Every K8s action is an API call. Hardening the cluster is mostly hardening this API.

Authentication paths

The kube-apiserver accepts several authn modes simultaneously: client certs, static tokens (deprecated), bootstrap tokens, OIDC, service account JWTs, webhook authenticators. For humans, OIDC + IdP MFA is the only acceptable choice in 2026. For workloads, projected service-account tokens are the modern default (audience-bound, time-limited, automatically rotated). The legacy non-expiring SA token files are deprecated; kubectl create token --duration=1h is the replacement.

RBAC — least privilege at the API

RBAC has four objects: Role (namespaced) and ClusterRole (cluster-wide) define rules; RoleBinding and ClusterRoleBinding attach them to subjects (Users, Groups, ServiceAccounts). Three rules of thumb that cover most mistakes:

  • No wildcard verbs in ClusterRole. verbs: ["*"] on Pods is admin via exec; on Secrets it is mass leak; on RoleBinding it is privilege escalation primitive.
  • No create + list on Secrets in default namespaces. Together they enable cross-pod secret stealing in many distros.
  • Audit escalate and bind verbs — they let a principal grant a different set of permissions than they themselves hold.
🚨
RBAC privilege escalation primitives
Several K8s permissions are privilege escalation primitives. pods/exec = code execution as the pod's SA. impersonate = act as another user/group. nodes/proxy = read kubelet (incl. exec). configmaps create/update on system namespaces = backdoor admission. The CNCF maintains an evergreen list (SIG Security); rakkess/rbac-lookup map them in your cluster.

Audit policy

The audit log captures every API request. The default policy is too noisy in some distros and missing critical events in others. A good baseline: RequestResponse for sensitive verbs (create, delete, patch on Secrets, RBAC objects, ValidatingWebhookConfigurations, Certificates, AuditSinks), Metadata for everything else, with reads on Secrets explicitly scrubbed. Ship to a separate logging tenant — kube-apiserver pods can be compromised, and on-cluster ELK can be tampered with.

Pod Security Standards — The Fast Win

The deprecated PodSecurityPolicy is gone. Pod Security Standards ship as a built-in admission controller (PodSecurity) that enforces three named profiles — Privileged, Baseline, Restricted — at the namespace level via labels. This is the cheapest single hardening any cluster can adopt.

ProfileAllowsForbidsUse for
PrivilegedAnythingNothingSystem namespaces with operators that need it
BaselineReasonable defaultsHostNetwork/PID/IPC, privileged, hostPath outside allow-list, non-default capabilitiesMost app workloads
RestrictedHardened defaults: non-root, read-only root FS, drop ALL caps, RuntimeDefault seccomp, no NET_RAW, no privilege escalationAlmost everything riskyDefault for new namespaces
yaml — namespace label flips PSS to Restricted
apiVersion: v1
kind: Namespace
metadata:
  name: payments
  labels:
    pod-security.kubernetes.io/enforce: restricted
    pod-security.kubernetes.io/enforce-version: latest
    # observe before flipping the next two from "audit" to "warn"
    pod-security.kubernetes.io/audit: restricted
    pod-security.kubernetes.io/warn:  restricted

Use audit mode first to log violations without breaking deployments, then promote to warn (CI/UI warnings), then enforce. The same audit-then-enforce pattern shows up everywhere in K8s security.

NetworkPolicies — Default Deny Is the Default You Need

By default, every pod can talk to every other pod and the internet. That is unacceptable in any non-trivial cluster. A baseline default-deny NetworkPolicy per namespace, plus explicit allow rules per app, transforms the security posture more than any other single change.

yaml — default-deny + scoped allow
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata: { name: default-deny, namespace: payments }
spec:
  podSelector: {}
  policyTypes: ["Ingress", "Egress"]
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata: { name: allow-checkout-to-payments, namespace: payments }
spec:
  podSelector: { matchLabels: { app: payments } }
  policyTypes: ["Ingress"]
  ingress:
  - from:
    - namespaceSelector: { matchLabels: { name: checkout } }
      podSelector:       { matchLabels: { app: api } }
    ports:
    - port: 8080
      protocol: TCP

Caveat: stock NetworkPolicy is namespace+label scoped; it cannot match by FQDN, by SPIFFE identity, or apply egress L7 rules. For those, use Cilium NetworkPolicy (CNP / CCNP) — eBPF-based, supports DNS-based egress, L7 HTTP/Kafka rules, and identity-based selection. Most production estates outgrow stock policies within a year.

💡
DNS egress in K8s is hard
An app calling api.stripe.com resolves the name, gets an IP, then connects. Stock NetworkPolicies match on IP/CIDR — Stripe's IPs change, so static rules break. Cilium FQDNPolicy watches DNS responses and dynamically inserts egress rules for the resolved IPs, scoped to that pod's lookup. The other option is a forward proxy that the pod points at; the proxy enforces the FQDN list.

Admission Control — Where Policy Becomes Enforcement

Built-in admission controllers (PSS, ResourceQuota, LimitRange) handle baseline cases. For richer policy — "images must be signed," "deployments must declare resource limits," "S3 bucket names must include the team tag" — use a programmable admission framework.

kubectlapply authn / authzRBAC Mutating webhooksadd sidecar, set fields Validating webhooksreject if non-compliant etcdpersisted Mutate first (sidecars, defaults), then validate. Both can be implemented with OPA/Gatekeeper, Kyverno, or CEL.
The admission chain. Mutating webhooks change the object; validating webhooks accept or reject.

OPA / Gatekeeper

Policies in Rego, distributed via ConstraintTemplate+Constraint objects. The Open Policy Agent ecosystem extends well beyond K8s (Terraform, CI, microservice authz). Steeper learning curve but the most expressive.

Kyverno

K8s-native — policies are themselves Kubernetes resources written in YAML. Lower learning curve, generates and mutates as well as validates, includes a verifyImages rule that integrates with Sigstore (Day 5). For most teams, the right starting point.

Validating Admission Policy (CEL)

Kubernetes 1.30+ ships a built-in admission policy engine using CEL expressions — no external webhook required. Worth using for simple policies; complex composition still favours Kyverno/OPA.

yaml — Kyverno: require team label and signed image
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata: { name: prod-image-and-team }
spec:
  validationFailureAction: Enforce
  rules:
  - name: require-team-label
    match: { resources: { kinds: ["Deployment", "StatefulSet"] } }
    validate:
      message: "deployments must declare a 'team' label"
      pattern: { metadata: { labels: { team: "?*" } } }
  - name: only-signed-images
    match: { resources: { kinds: ["Pod"] } }
    verifyImages:
    - imageReferences: ["ghcr.io/my-org/*"]
      attestors:
      - entries:
        - keyless:
            issuer: "https://token.actions.githubusercontent.com"
            subject: "https://github.com/my-org/*/.github/workflows/release.yml@refs/heads/main"

Two policies, one cluster admission gate. Pods without a team label are rejected; pods running images not signed by the org's GitHub Actions release workflow are rejected — the keyless signing identity comes from Sigstore, which we cover on Day 5.

Workload Identity — IRSA, Workload Identity, Managed Identity

The pre-cloud-native default — bake an AWS access key into a Secret — should never appear in your cluster. The modern pattern federates the cluster's OIDC issuer to the cloud provider's STS:

  • EKS / IRSA — annotate ServiceAccount with role ARN; the projected SA token is exchanged for STS credentials by the AWS SDK.
  • EKS Pod Identity (newer) — managed-agent flavour; no SDK token-exchange code needed.
  • GKE Workload Identity Federation — KSA bound to GSA via project policy.
  • AKS Workload Identity — projected SA token federated to Entra ID.

The clean rule: nothing long-lived on disk in the pod, ever. Tokens are projected, short-lived (~1 hour), audience-bound, and rotate automatically. CSI drivers like Secrets Store CSI Driver can mount cloud-provider secrets (AWS Secrets Manager, GCP Secret Manager, Vault) at well-known paths without storing them as K8s Secrets.

Image & Supply Chain — Pre-Day-5 Foundations

Three image controls that pay back immediately, ahead of the deeper supply-chain treatment on Day 5:

  1. Distroless or chiseled bases. No shell, no package manager — a compromise has nowhere to install tools. gcr.io/distroless/static for Go binaries, distroless/cc for C, language-specific minimal bases for Python/Node.
  2. Pin by digest. Tags can be moved; digests cannot. Production deployments reference image@sha256:abc…. CI computes the digest at build time and propagates it.
  3. Scan in CI and at admission. Trivy, Grype, or your registry's scanner produce a CVE list at build; an admission policy rejects images with critical CVEs older than N days.

Image vulnerability fatigue

A modern Node base image often shows hundreds of CVEs, the vast majority of which are not exploitable in your context. Use the VEX (Vulnerability Exploitability eXchange) standard or vex/openvex tooling to assert non-exploitability and silence noise. Without this discipline, scanner output becomes wallpaper and the team stops reading it.

Runtime Detection — Falco and eBPF

Static admission stops misconfiguration; runtime detection catches the rest. Falco (CNCF graduated, originally Sysdig) reads kernel events via eBPF and matches them against a rule language to surface suspicious behaviour: spawn a shell from a web server pod, read /etc/shadow, write to /proc/sys, mount-from-the-host, exec into a sensitive container.

yaml — Falco rule (excerpt)
- rule: Shell spawned in container
  desc: A shell was spawned inside a container — usually an attacker action.
  condition: >
    spawned_process and container and shell_procs
    and not k8s_containers_excluded
  output: >
    Shell in container (user=%user.name container=%container.name
    parent=%proc.pname cmdline=%proc.cmdline)
  priority: WARNING
  tags: [container, shell, mitre_execution, T1059]

Rules ship pre-tagged with MITRE ATT&CK technique IDs. Modern stacks pair Falco with a response engine — Falco Talon, Sysdig Secure, Datadog CSM — that auto-responds (kill pod, taint node, revoke creds). Pair with Tetragon (Cilium) for eBPF-based observability and policy enforcement that can block at syscall time, not just alert.

Quick check
A pod with a known SQL injection vulnerability is exploited and the attacker runs cat /var/run/secrets/kubernetes.io/serviceaccount/token | curl …. Which controls would have stopped which steps of this attack?
Show answer
Disable SA token automount for pods that don't talk to the K8s API (automountServiceAccountToken: false) — the file would not exist. Restricted PSS would have made the root FS read-only and run as non-root, but the SA token mount is read-only anyway, so this only matters if the attacker tries to drop tools. NetworkPolicy default-deny egress — the curl to attacker.com fails. Falco rule for "shell or unexpected exec in pod" or "egress to unknown destination from pod" — fires alert. Audit log on token use — STS exchange shows up in CloudTrail/Audit. Defense in depth: each layer reduces the chance the attack succeeds quietly.

Multi-Tenancy — Soft, Hard, and the Honest Answer

K8s offers soft multi-tenancy via namespaces + RBAC + NetworkPolicies + ResourceQuotas. This is fine for cooperating teams. Hard multi-tenancy (running untrusted tenants on a shared cluster) is hard — kernel CVEs cross namespaces, noisy-neighbour CPU/memory effects exist, and many controllers leak metadata. The honest options for hard multi-tenancy:

  • One cluster per tenant. Operationally heavy; security simple.
  • Sandboxed runtimes. gVisor or Kata for the tenant pods; same cluster, separate kernel surface.
  • vCluster / virtual clusters — a cluster-in-a-cluster project. Tenant control plane is isolated; data plane shared with sandboxed runtimes.

The Cluster Hardening Checklist

If you do exactly these, you are ahead of most clusters in production today.

  1. Pod Security Standards Restricted on all non-system namespaces.
  2. Default-deny NetworkPolicy per namespace; explicit allow per app.
  3. RBAC review — no wildcard verbs in ClusterRole; SA tokens not auto-mounted unless needed.
  4. Workload identity (IRSA / Workload Identity / Managed Identity) — no static cloud secrets.
  5. Image policy (Kyverno verifyImages with Sigstore) — only signed, scanned, and digest-pinned.
  6. Audit log RequestResponse for sensitive verbs; shipped off-cluster.
  7. Falco/Tetragon runtime detection with at least the default ruleset.
  8. Encrypt etcd at rest (KMS provider plugin) and use Sealed Secrets / External Secrets / SOPS in Git rather than plain Secrets.
  9. IMDSv2 required on EKS nodes; restrict pod access to the metadata IP.
  10. Disable the default NodePort exposure path; ingress through a documented LB only.
Mnemonic — five planes
"Container, API, Identity, Network, Runtime."
  • Container — PSS, capabilities, seccomp.
  • API — RBAC, audit, OIDC.
  • Identity — projected tokens, workload identity.
  • Network — default-deny, NetworkPolicy, mesh.
  • Runtime — Falco/Tetragon, sandboxed runtimes.
Flashcard
A pod has privileged: true, hostPID: true, and mounts the host's /var/run/docker.sock. An attacker compromises the pod via a webapp RCE. What is the realistic blast radius?
Click to flip ↻
Answer
The whole node, then likely the cluster. Privileged + hostPID + docker.sock is effectively root on the host: the attacker can nsenter into other containers' namespaces, schedule new privileged containers, read every Secret mounted on this node, and reach the kubelet. From there, kubelet credentials reach the API server with whatever privileges this node has — often enough to compromise the cluster. This is why Pod Security Standards Restricted exists; this combination is the canonical anti-pattern.
🔑
Key takeaways
1) Containers do not isolate at the kernel — namespaces+cgroups+capabilities+seccomp narrow the surface; for untrusted workloads use gVisor/Kata. 2) RBAC and audit at the API server are non-negotiable; wildcard verbs in ClusterRole are escalation primitives. 3) Pod Security Standards Restricted + default-deny NetworkPolicy are the cheapest big-impact changes. 4) Admission control (Kyverno/OPA/CEL) turns soft policy into hard guardrails — including image signing. 5) Falco / Tetragon close the runtime gap; pair with audit logs and on-call. 6) Workload identity replaces static cloud secrets entirely.

Finished reading?