
Cloud, Containers & Incident Response
IAM done right, container and Kubernetes hardening, and what good incident response looks like when an alert fires at 03:00 — through detection, containment, eradication, and the post-mortem.
What you will learn
The first six days were about building it right. The last day is about operating it — the controls that protect cloud infrastructure that your code runs on, and the muscle memory that turns a 9-figure breach into a contained incident with a clean post-mortem.
Cloud IAM — The Real Perimeter
In the cloud era, identity is the perimeter. Most public-cloud breaches in the last five years trace back to one of three IAM failures: over-broad permissions, long-lived credentials, or missing MFA on the root.
The IAM Rules That Pay for Themselves
- No root for daily work. Hardware-MFA the root, lock the credentials in a safe, audit on use.
- SSO only for humans. No IAM users with passwords; one SSO IdP with phishing-resistant MFA.
- Workload identity for services. IRSA on EKS, Workload Identity on GKE, Managed Identity on Azure. Static IAM keys are a smell.
- Least privilege per role. Start from
Deny; add only what the access analyzer says was used. Re-review every quarter. - Resource policies + condition keys. S3 buckets allowing only specific VPC endpoints; KMS keys allowing only specific roles.
- SCP / org policies as guardrails. "No public S3 buckets in this OU" enforced at the org level, not per-account.
- CloudTrail / Audit Log on, immutable, retained. Off-account, append-only.
iam:* against itself, plus access to thousands of S3 buckets it didn't need. An SSRF vulnerability let the attacker hit IMDSv1 and steal those credentials. 100M records. The bug was SSRF; the breach was IAM. Both fixes — IMDSv2 and least-privilege roles — were free.Container Hardening
A container is a process with a few extra namespaces. Most container CVEs are about that "few" being smaller than people think.
- Don't run as root. Set a non-root
USERin the Dockerfile;readOnlyRootFilesystem: true. - Drop all capabilities by default; add back only what you need.
- Distroless or minimal base images. Smaller images = less attack surface and fewer CVEs to patch.
- Pin base images by digest, not tag.
FROM gcr.io/distroless/static@sha256:… - Sign and verify images (cosign). Admission controllers (Kyverno, Gatekeeper) reject unsigned images at deploy.
- Scan continuously — not just at build. New CVEs land against old images every day.
- No secrets in env vars for production. Mount from a secret store at runtime; redact in logs.
FROM gcr.io/distroless/static-debian12@sha256:… USER 65532:65532 COPY --chown=65532:65532 app /app ENTRYPOINT ["/app"]
Kubernetes Hardening Highlights
- Pod Security Standards set to
restrictedby namespace. - NetworkPolicy default-deny, then explicit ingress/egress per service.
- RBAC least privilege — no
cluster-adminRoleBindings outside emergency break-glass. - Secrets via CSI driver from a vault; never plain Kubernetes
Secretobjects without etcd encryption-at-rest. - Audit log shipped off-cluster.
- kubectl access via SSO + JIT, not kubeconfig files passed around.
Incident Response — The NIST Lifecycle
Even with everything above, something will eventually fire. Good IR is muscle memory: detect, contain, eradicate, recover, learn — practiced before it's needed.
Preparation — What You Do Before
- An on-call rotation that covers IR, with paging tested.
- A documented severity scale (Sev 0–3) and what each triggers.
- Runbooks for the top-five threat scenarios — credentials leak, public bucket, host compromise, DDoS, insider abuse.
- Pre-authorized authority — "the on-call security engineer can revoke any credential without further approval."
- Out-of-band comms (Signal, separate Slack workspace) for cases where the primary channel is suspect.
- Tabletop exercises every quarter. The first time you run an IR is the worst time to discover the runbook is wrong.
Detection & Triage
Most signals come from a SIEM (Sumo, Splunk, Datadog, Panther) consuming CloudTrail / k8s audit / app logs / EDR. Good detection rules are specific (low FP) and actionable (named runbook).
The first 15 minutes after an alert: classify severity, assign IC (incident commander), open the incident channel, capture timeline. Do not start fixing yet — capture state first.
Containment
Stop the bleeding without destroying evidence. Two modes:
- Short-term containment: revoke a credential, isolate a host (security group → deny-all), block an IP at the WAF.
- Long-term containment: stand up a clean replacement, route traffic away from the compromised system, preserve the original for forensics.
Eradication & Recovery
Remove malware / persistence / unauthorized accounts. Patch the vulnerability that allowed entry. Then restore service from known-good — backups, freshly built images, rotated secrets. Recovery is when most teams skip steps; the discipline is to verify before declaring it over: run the same detection rules; confirm no new activity.
Post-Incident — The Blameless Post-Mortem
Within a week, write the document. Structure varies, but the load-bearing sections are the same:
Forensics — Capturing State Before You Touch It
If the incident might involve legal exposure (payment data, PHI, regulated industries), forensics matters. The principle is to capture volatile state before persistent, because volatile evaporates:
- Memory: process listing, open sockets, RAM dump (LiME, AVML, EC2 instance memory snapshot).
- Disk: full snapshot before any change. EBS snapshot, GCP disk snapshot, Azure managed-disk snapshot.
- Logs: pull what the SIEM has, but also OS logs from the box (auth.log, journalctl).
- Network: any flow data from VPC Flow Logs, NetFlow, or the load balancer.
Document chain of custody — who touched what artifact when. Most engineers will hand off to a dedicated forensics team; the goal at this stage is not to break the evidence.
Show answer
- Plan — runbooks & tabletop
- Detect — SIEM rules + paging
- Contain — stop bleeding without destroying evidence
- Cure — eradicate root cause; rotate creds
- Recover — clean rebuild; verify clean
- Reflect — blameless post-mortem with owners
Closing the Loop
Across seven days you've covered: the goals (CIA), the controls (AAA, defense in depth), the threats (STRIDE, attack trees), the tools (crypto, AuthN/AuthZ), the catalogue (OWASP Top 10), the wires (TLS, mTLS, Zero Trust), the process (Secure SDLC, SBOM/SLSA), and the operations (cloud IAM, IR). None of these are static — each will evolve over the next decade. What stays constant is the way you reason: identify what you protect, name the threats, layer the controls, log the evidence, and rehearse the response.
- NIST SP 800-61 Rev. 2 — Computer Security Incident Handling Guidenist.gov
- NIST SP 800-190 — Application Container Security Guidenist.gov
- AWS Well-Architected — Security Pillaraws.amazon.com
- CIS Kubernetes Benchmarkcisecurity.org
- Kubernetes Pod Security Standardskubernetes.io
- Google SRE Book — Postmortem Culture: Learning from Failuregoogle.com
- MITRE ATT&CK — adversary techniques (use during IR triage)attack.mitre.org
Finished reading?