
Kubernetes & Container Security
Containers don't contain. Master the container threat model, Pod Security Standards, K8s RBAC, NetworkPolicies, admission control with Gatekeeper and Kyverno, runtime detection with Falco and eBPF, and the supply-chain practices that keep production images trustworthy.
What you will learn
Kubernetes is a single, multi-tenant, API-driven control plane that schedules code onto shared kernels. That single sentence contains every category of K8s vulnerability that has ever shipped: API exposure, multi-tenant boundary failures, the kernel as the real container boundary, and the rise of admission-time policy as the place where most security wins live. This chapter is intentionally dense — it is the one chapter you will revisit most.
Containers Don't Contain — The Real Boundary Is the Kernel
A container is a Linux process with a curated view of the world: namespaces (mount, PID, network, UTS, IPC, user, cgroup) hide the rest of the system; cgroups limit resources; capabilities drop pieces of root; seccomp filters the syscall surface; optionally AppArmor or SELinux add MAC. None of these protect against a kernel-level vulnerability — that is the hypervisor's job, or a sandboxed runtime's.
Capabilities — root is not one privilege
Linux splits root's powers into ~40 capabilities. Containers default to a small subset (CAP_NET_BIND_SERVICE, CAP_SYS_CHROOT, etc.). Three to watch:
CAP_SYS_ADMIN— "the new root." Mount filesystems, manipulate namespaces, configure devices. Never grant in production unless you know exactly why.CAP_NET_RAW— craft raw packets. Useful for ICMP scanning from a compromised pod. Drop unless needed.CAP_SYS_PTRACE— debug other processes. In a multi-tenant pod (sidecar + app) it can be used by one container against another.
seccomp and the syscall floor
The default Docker seccomp profile blocks ~50 syscalls; the Kubernetes RuntimeDefault profile mirrors it. Always set seccompProfile: { type: RuntimeDefault } in the pod spec — by default many K8s distributions still run Unconfined. For higher-risk workloads, generate a custom profile from observed syscalls (e.g. with syscall2seccomp or by running under seccomp-bpf tracing).
When the kernel is not enough — sandboxed runtimes
Two hardened runtimes plug into K8s as RuntimeClasses. gVisor (Google) intercepts syscalls in user-space and re-implements the kernel surface — high isolation, ~5-15% perf cost, some compatibility limits. Kata Containers runs each pod in a tiny VM via QEMU/Cloud Hypervisor — VM-grade isolation at higher startup cost. Both are appropriate for untrusted multi-tenant workloads (CI runners, function-as-a-service).
The API Server — Authentication, RBAC, Audit
Every K8s action is an API call. Hardening the cluster is mostly hardening this API.
Authentication paths
The kube-apiserver accepts several authn modes simultaneously: client certs, static tokens (deprecated), bootstrap tokens, OIDC, service account JWTs, webhook authenticators. For humans, OIDC + IdP MFA is the only acceptable choice in 2026. For workloads, projected service-account tokens are the modern default (audience-bound, time-limited, automatically rotated). The legacy non-expiring SA token files are deprecated; kubectl create token --duration=1h is the replacement.
RBAC — least privilege at the API
RBAC has four objects: Role (namespaced) and ClusterRole (cluster-wide) define rules; RoleBinding and ClusterRoleBinding attach them to subjects (Users, Groups, ServiceAccounts). Three rules of thumb that cover most mistakes:
- No wildcard verbs in
ClusterRole.verbs: ["*"]on Pods is admin viaexec; on Secrets it is mass leak; onRoleBindingit is privilege escalation primitive. - No
create+liston Secrets in default namespaces. Together they enable cross-pod secret stealing in many distros. - Audit
escalateandbindverbs — they let a principal grant a different set of permissions than they themselves hold.
pods/exec = code execution as the pod's SA. impersonate = act as another user/group. nodes/proxy = read kubelet (incl. exec). configmaps create/update on system namespaces = backdoor admission. The CNCF maintains an evergreen list (SIG Security); rakkess/rbac-lookup map them in your cluster.Audit policy
The audit log captures every API request. The default policy is too noisy in some distros and missing critical events in others. A good baseline: RequestResponse for sensitive verbs (create, delete, patch on Secrets, RBAC objects, ValidatingWebhookConfigurations, Certificates, AuditSinks), Metadata for everything else, with reads on Secrets explicitly scrubbed. Ship to a separate logging tenant — kube-apiserver pods can be compromised, and on-cluster ELK can be tampered with.
Pod Security Standards — The Fast Win
The deprecated PodSecurityPolicy is gone. Pod Security Standards ship as a built-in admission controller (PodSecurity) that enforces three named profiles — Privileged, Baseline, Restricted — at the namespace level via labels. This is the cheapest single hardening any cluster can adopt.
| Profile | Allows | Forbids | Use for |
|---|---|---|---|
| Privileged | Anything | Nothing | System namespaces with operators that need it |
| Baseline | Reasonable defaults | HostNetwork/PID/IPC, privileged, hostPath outside allow-list, non-default capabilities | Most app workloads |
| Restricted | Hardened defaults: non-root, read-only root FS, drop ALL caps, RuntimeDefault seccomp, no NET_RAW, no privilege escalation | Almost everything risky | Default for new namespaces |
apiVersion: v1
kind: Namespace
metadata:
name: payments
labels:
pod-security.kubernetes.io/enforce: restricted
pod-security.kubernetes.io/enforce-version: latest
# observe before flipping the next two from "audit" to "warn"
pod-security.kubernetes.io/audit: restricted
pod-security.kubernetes.io/warn: restrictedUse audit mode first to log violations without breaking deployments, then promote to warn (CI/UI warnings), then enforce. The same audit-then-enforce pattern shows up everywhere in K8s security.
NetworkPolicies — Default Deny Is the Default You Need
By default, every pod can talk to every other pod and the internet. That is unacceptable in any non-trivial cluster. A baseline default-deny NetworkPolicy per namespace, plus explicit allow rules per app, transforms the security posture more than any other single change.
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata: { name: default-deny, namespace: payments }
spec:
podSelector: {}
policyTypes: ["Ingress", "Egress"]
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata: { name: allow-checkout-to-payments, namespace: payments }
spec:
podSelector: { matchLabels: { app: payments } }
policyTypes: ["Ingress"]
ingress:
- from:
- namespaceSelector: { matchLabels: { name: checkout } }
podSelector: { matchLabels: { app: api } }
ports:
- port: 8080
protocol: TCPCaveat: stock NetworkPolicy is namespace+label scoped; it cannot match by FQDN, by SPIFFE identity, or apply egress L7 rules. For those, use Cilium NetworkPolicy (CNP / CCNP) — eBPF-based, supports DNS-based egress, L7 HTTP/Kafka rules, and identity-based selection. Most production estates outgrow stock policies within a year.
api.stripe.com resolves the name, gets an IP, then connects. Stock NetworkPolicies match on IP/CIDR — Stripe's IPs change, so static rules break. Cilium FQDNPolicy watches DNS responses and dynamically inserts egress rules for the resolved IPs, scoped to that pod's lookup. The other option is a forward proxy that the pod points at; the proxy enforces the FQDN list.Admission Control — Where Policy Becomes Enforcement
Built-in admission controllers (PSS, ResourceQuota, LimitRange) handle baseline cases. For richer policy — "images must be signed," "deployments must declare resource limits," "S3 bucket names must include the team tag" — use a programmable admission framework.
OPA / Gatekeeper
Policies in Rego, distributed via ConstraintTemplate+Constraint objects. The Open Policy Agent ecosystem extends well beyond K8s (Terraform, CI, microservice authz). Steeper learning curve but the most expressive.
Kyverno
K8s-native — policies are themselves Kubernetes resources written in YAML. Lower learning curve, generates and mutates as well as validates, includes a verifyImages rule that integrates with Sigstore (Day 5). For most teams, the right starting point.
Validating Admission Policy (CEL)
Kubernetes 1.30+ ships a built-in admission policy engine using CEL expressions — no external webhook required. Worth using for simple policies; complex composition still favours Kyverno/OPA.
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata: { name: prod-image-and-team }
spec:
validationFailureAction: Enforce
rules:
- name: require-team-label
match: { resources: { kinds: ["Deployment", "StatefulSet"] } }
validate:
message: "deployments must declare a 'team' label"
pattern: { metadata: { labels: { team: "?*" } } }
- name: only-signed-images
match: { resources: { kinds: ["Pod"] } }
verifyImages:
- imageReferences: ["ghcr.io/my-org/*"]
attestors:
- entries:
- keyless:
issuer: "https://token.actions.githubusercontent.com"
subject: "https://github.com/my-org/*/.github/workflows/release.yml@refs/heads/main"Two policies, one cluster admission gate. Pods without a team label are rejected; pods running images not signed by the org's GitHub Actions release workflow are rejected — the keyless signing identity comes from Sigstore, which we cover on Day 5.
Workload Identity — IRSA, Workload Identity, Managed Identity
The pre-cloud-native default — bake an AWS access key into a Secret — should never appear in your cluster. The modern pattern federates the cluster's OIDC issuer to the cloud provider's STS:
- EKS / IRSA — annotate ServiceAccount with role ARN; the projected SA token is exchanged for STS credentials by the AWS SDK.
- EKS Pod Identity (newer) — managed-agent flavour; no SDK token-exchange code needed.
- GKE Workload Identity Federation — KSA bound to GSA via project policy.
- AKS Workload Identity — projected SA token federated to Entra ID.
The clean rule: nothing long-lived on disk in the pod, ever. Tokens are projected, short-lived (~1 hour), audience-bound, and rotate automatically. CSI drivers like Secrets Store CSI Driver can mount cloud-provider secrets (AWS Secrets Manager, GCP Secret Manager, Vault) at well-known paths without storing them as K8s Secrets.
Image & Supply Chain — Pre-Day-5 Foundations
Three image controls that pay back immediately, ahead of the deeper supply-chain treatment on Day 5:
- Distroless or chiseled bases. No shell, no package manager — a compromise has nowhere to install tools.
gcr.io/distroless/staticfor Go binaries,distroless/ccfor C, language-specific minimal bases for Python/Node. - Pin by digest. Tags can be moved; digests cannot. Production deployments reference
image@sha256:abc…. CI computes the digest at build time and propagates it. - Scan in CI and at admission. Trivy, Grype, or your registry's scanner produce a CVE list at build; an admission policy rejects images with critical CVEs older than N days.
Image vulnerability fatigue
A modern Node base image often shows hundreds of CVEs, the vast majority of which are not exploitable in your context. Use the VEX (Vulnerability Exploitability eXchange) standard or vex/openvex tooling to assert non-exploitability and silence noise. Without this discipline, scanner output becomes wallpaper and the team stops reading it.
Runtime Detection — Falco and eBPF
Static admission stops misconfiguration; runtime detection catches the rest. Falco (CNCF graduated, originally Sysdig) reads kernel events via eBPF and matches them against a rule language to surface suspicious behaviour: spawn a shell from a web server pod, read /etc/shadow, write to /proc/sys, mount-from-the-host, exec into a sensitive container.
- rule: Shell spawned in container
desc: A shell was spawned inside a container — usually an attacker action.
condition: >
spawned_process and container and shell_procs
and not k8s_containers_excluded
output: >
Shell in container (user=%user.name container=%container.name
parent=%proc.pname cmdline=%proc.cmdline)
priority: WARNING
tags: [container, shell, mitre_execution, T1059]Rules ship pre-tagged with MITRE ATT&CK technique IDs. Modern stacks pair Falco with a response engine — Falco Talon, Sysdig Secure, Datadog CSM — that auto-responds (kill pod, taint node, revoke creds). Pair with Tetragon (Cilium) for eBPF-based observability and policy enforcement that can block at syscall time, not just alert.
cat /var/run/secrets/kubernetes.io/serviceaccount/token | curl …. Which controls would have stopped which steps of this attack?Show answer
automountServiceAccountToken: false) — the file would not exist. Restricted PSS would have made the root FS read-only and run as non-root, but the SA token mount is read-only anyway, so this only matters if the attacker tries to drop tools. NetworkPolicy default-deny egress — the curl to attacker.com fails. Falco rule for "shell or unexpected exec in pod" or "egress to unknown destination from pod" — fires alert. Audit log on token use — STS exchange shows up in CloudTrail/Audit. Defense in depth: each layer reduces the chance the attack succeeds quietly.Multi-Tenancy — Soft, Hard, and the Honest Answer
K8s offers soft multi-tenancy via namespaces + RBAC + NetworkPolicies + ResourceQuotas. This is fine for cooperating teams. Hard multi-tenancy (running untrusted tenants on a shared cluster) is hard — kernel CVEs cross namespaces, noisy-neighbour CPU/memory effects exist, and many controllers leak metadata. The honest options for hard multi-tenancy:
- One cluster per tenant. Operationally heavy; security simple.
- Sandboxed runtimes. gVisor or Kata for the tenant pods; same cluster, separate kernel surface.
- vCluster / virtual clusters — a cluster-in-a-cluster project. Tenant control plane is isolated; data plane shared with sandboxed runtimes.
The Cluster Hardening Checklist
If you do exactly these, you are ahead of most clusters in production today.
- Pod Security Standards Restricted on all non-system namespaces.
- Default-deny NetworkPolicy per namespace; explicit allow per app.
- RBAC review — no wildcard verbs in
ClusterRole; SA tokens not auto-mounted unless needed. - Workload identity (IRSA / Workload Identity / Managed Identity) — no static cloud secrets.
- Image policy (Kyverno verifyImages with Sigstore) — only signed, scanned, and digest-pinned.
- Audit log RequestResponse for sensitive verbs; shipped off-cluster.
- Falco/Tetragon runtime detection with at least the default ruleset.
- Encrypt etcd at rest (KMS provider plugin) and use Sealed Secrets / External Secrets / SOPS in Git rather than plain Secrets.
- IMDSv2 required on EKS nodes; restrict pod access to the metadata IP.
- Disable the default
NodePortexposure path; ingress through a documented LB only.
- Container — PSS, capabilities, seccomp.
- API — RBAC, audit, OIDC.
- Identity — projected tokens, workload identity.
- Network — default-deny, NetworkPolicy, mesh.
- Runtime — Falco/Tetragon, sandboxed runtimes.
privileged: true, hostPID: true, and mounts the host's /var/run/docker.sock. An attacker compromises the pod via a webapp RCE. What is the realistic blast radius?nsenter into other containers' namespaces, schedule new privileged containers, read every Secret mounted on this node, and reach the kubelet. From there, kubelet credentials reach the API server with whatever privileges this node has — often enough to compromise the cluster. This is why Pod Security Standards Restricted exists; this combination is the canonical anti-pattern.ClusterRole are escalation primitives. 3) Pod Security Standards Restricted + default-deny NetworkPolicy are the cheapest big-impact changes. 4) Admission control (Kyverno/OPA/CEL) turns soft policy into hard guardrails — including image signing. 5) Falco / Tetragon close the runtime gap; pair with audit logs and on-call. 6) Workload identity replaces static cloud secrets entirely.- Kubernetes — Pod Security Standardskubernetes.io
- Kubernetes — RBAC Good Practiceskubernetes.io
- NSA / CISA — Kubernetes Hardening Guidance v1.2defense.gov
- NIST SP 800-190 — Application Container Security Guidenist.gov
- Falco — Default rulesfalco.org
- Kyverno — Policy librarykyverno.io
- CNCF — Cloud Native Security Whitepaper v2cncf.io
- CIS Kubernetes Benchmarkcisecurity.org
Finished reading?