DAY 7

09 / 09

Cloud, Containers & Incident Response

schedule7 minsignal_cellular_altAdvanced1,541 words

IAM done right, container and Kubernetes hardening, and what good incident response looks like when an alert fires at 03:00 — through detection, containment, eradication, and the post-mortem.

What you will learn

01Cloud IAM — The Real Perimeter

02Container Hardening

03Incident Response — The NIST Lifecycle

04Forensics — Capturing State Before You Touch It

05Closing the Loop

The first six days were about building it right. The last day is about operating it — the controls that protect cloud infrastructure that your code runs on, and the muscle memory that turns a 9-figure breach into a contained incident with a clean post-mortem.

Cloud IAM — The Real Perimeter

In the cloud era, identity is the perimeter. Most public-cloud breaches in the last five years trace back to one of three IAM failures: over-broad permissions, long-lived credentials, or missing MFA on the root.

Cloud IAM hierarchy. Static keys for workloads is the legacy path; workload identity is the modern one.

The IAM Rules That Pay for Themselves

No root for daily work. Hardware-MFA the root, lock the credentials in a safe, audit on use.
SSO only for humans. No IAM users with passwords; one SSO IdP with phishing-resistant MFA.
Workload identity for services. IRSA on EKS, Workload Identity on GKE, Managed Identity on Azure. Static IAM keys are a smell.
Least privilege per role. Start from Deny; add only what the access analyzer says was used. Re-review every quarter.
Resource policies + condition keys. S3 buckets allowing only specific VPC endpoints; KMS keys allowing only specific roles.
SCP / org policies as guardrails. "No public S3 buckets in this OU" enforced at the org level, not per-account.
CloudTrail / Audit Log on, immutable, retained. Off-account, append-only.

🚫

Capital One, 2019

A WAF role had iam:* against itself, plus access to thousands of S3 buckets it didn't need. An SSRF vulnerability let the attacker hit IMDSv1 and steal those credentials. 100M records. The bug was SSRF; the breach was IAM. Both fixes — IMDSv2 and least-privilege roles — were free.

Container Hardening

A container is a process with a few extra namespaces. Most container CVEs are about that "few" being smaller than people think.

Don't run as root. Set a non-root USER in the Dockerfile; readOnlyRootFilesystem: true.
Drop all capabilities by default; add back only what you need.
Distroless or minimal base images. Smaller images = less attack surface and fewer CVEs to patch.
Pin base images by digest, not tag. FROM gcr.io/distroless/static@sha256:…
Sign and verify images (cosign). Admission controllers (Kyverno, Gatekeeper) reject unsigned images at deploy.
Scan continuously — not just at build. New CVEs land against old images every day.
No secrets in env vars for production. Mount from a secret store at runtime; redact in logs.

Dockerfile (good defaults)

FROM gcr.io/distroless/static-debian12@sha256:…
USER 65532:65532
COPY --chown=65532:65532 app /app
ENTRYPOINT ["/app"]

Kubernetes Hardening Highlights

Pod Security Standards set to restricted by namespace.
NetworkPolicy default-deny, then explicit ingress/egress per service.
RBAC least privilege — no cluster-admin RoleBindings outside emergency break-glass.
Secrets via CSI driver from a vault; never plain Kubernetes Secret objects without etcd encryption-at-rest.
Audit log shipped off-cluster.
kubectl access via SSO + JIT, not kubeconfig files passed around.

Incident Response — The NIST Lifecycle

Even with everything above, something will eventually fire. Good IR is muscle memory: detect, contain, eradicate, recover, learn — practiced before it's needed.

NIST SP 800-61: the canonical incident-response lifecycle.

Preparation — What You Do Before

An on-call rotation that covers IR, with paging tested.
A documented severity scale (Sev 0–3) and what each triggers.
Runbooks for the top-five threat scenarios — credentials leak, public bucket, host compromise, DDoS, insider abuse.
Pre-authorized authority — "the on-call security engineer can revoke any credential without further approval."
Out-of-band comms (Signal, separate Slack workspace) for cases where the primary channel is suspect.
Tabletop exercises every quarter. The first time you run an IR is the worst time to discover the runbook is wrong.

Detection & Triage

Most signals come from a SIEM (Sumo, Splunk, Datadog, Panther) consuming CloudTrail / k8s audit / app logs / EDR. Good detection rules are specific (low FP) and actionable (named runbook).

The first 15 minutes after an alert: classify severity, assign IC (incident commander), open the incident channel, capture timeline. Do not start fixing yet — capture state first.

Containment

Stop the bleeding without destroying evidence. Two modes:

Short-term containment: revoke a credential, isolate a host (security group → deny-all), block an IP at the WAF.
Long-term containment: stand up a clean replacement, route traffic away from the compromised system, preserve the original for forensics.

Eradication & Recovery

Remove malware / persistence / unauthorized accounts. Patch the vulnerability that allowed entry. Then restore service from known-good — backups, freshly built images, rotated secrets. Recovery is when most teams skip steps; the discipline is to verify before declaring it over: run the same detection rules; confirm no new activity.

Post-Incident — The Blameless Post-Mortem

Within a week, write the document. Structure varies, but the load-bearing sections are the same:

📝 Post-mortem template

Summary — one paragraph: what happened, blast radius, customer impact.

Timeline — UTC timestamps from first signal to all-clear.

Root cause — five whys; not a person, a chain of contributing factors.

What went well — detection time, comms, decisions made under pressure.

What didn't — gaps in detection, runbook errors, slow rollback, missing logs.

Action items — owner + due date for each. Not a wishlist.

⚠️

Blameless ≠ accountability-free

Blameless means we treat humans as fallible parts of a sociotechnical system; the question is what made the failure possible, not who clicked the wrong thing. It does not mean there are no owners — every action item has one. Without follow-through, post-mortems are theatre.

Forensics — Capturing State Before You Touch It

If the incident might involve legal exposure (payment data, PHI, regulated industries), forensics matters. The principle is to capture volatile state before persistent, because volatile evaporates:

Memory: process listing, open sockets, RAM dump (LiME, AVML, EC2 instance memory snapshot).
Disk: full snapshot before any change. EBS snapshot, GCP disk snapshot, Azure managed-disk snapshot.
Logs: pull what the SIEM has, but also OS logs from the box (auth.log, journalctl).
Network: any flow data from VPC Flow Logs, NetFlow, or the load balancer.

Document chain of custody — who touched what artifact when. Most engineers will hand off to a dedicated forensics team; the goal at this stage is not to break the evidence.

Active recall

An alert fires: "unusual S3 GetObject pattern from prod-data-bucket." The on-call engineer's first instinct is to revoke the role. Why might that be the wrong first step?

Show answer

Revoking too soon (a) tips the attacker, who may have alternate persistence and will go quiet, and (b) destroys evidence about how they used the credentials. The IR playbook usually says: capture session details and recent CloudTrail first (preserve), then revoke. The exception is active exfil-in-progress against critical data — then containment beats forensics. Severity dictates the order; a runbook removes the guesswork from the on-call engineer.

Flashcard

Why should the IR comms channel and the post-mortem doc be hosted somewhere different from the affected system?

Click to flip ↻

Answer

If the attacker has access to your primary Slack / Google Workspace, they can read your IR plans in real time and adapt — including deleting evidence and altering the post-mortem. Out-of-band tooling (separate workspace, Signal group, encrypted doc) keeps the response invisible. Many large companies have an entirely separate "break-glass" tenant for exactly this reason.

Mnemonic — IR cycle

"Plan, Detect, Contain, Cure, Recover, Reflect."

Plan — runbooks & tabletop
Detect — SIEM rules + paging
Contain — stop bleeding without destroying evidence
Cure — eradicate root cause; rotate creds
Recover — clean rebuild; verify clean
Reflect — blameless post-mortem with owners

Closing the Loop

Across seven days you've covered: the goals (CIA), the controls (AAA, defense in depth), the threats (STRIDE, attack trees), the tools (crypto, AuthN/AuthZ), the catalogue (OWASP Top 10), the wires (TLS, mTLS, Zero Trust), the process (Secure SDLC, SBOM/SLSA), and the operations (cloud IAM, IR). None of these are static — each will evolve over the next decade. What stays constant is the way you reason: identify what you protect, name the threats, layer the controls, log the evidence, and rehearse the response.

🔑

Key takeaways

1) In the cloud, identity is the perimeter — least privilege, no static workload keys, audit log immutable. 2) Containers and Kubernetes harden by default-deny network policy, restricted PSS, signed images, distroless bases. 3) Incident response is a rehearsed process, not an improvisation — runbooks, tabletop, IC + comms separation. 4) Forensics first, fix second, on anything that might have legal exposure. 5) The post-mortem is the highest-leverage hour after an incident — blameless, owner per action, follow through.

📚 Further reading

NIST SP 800-61 Rev. 2 — Computer Security Incident Handling Guidenist.gov
NIST SP 800-190 — Application Container Security Guidenist.gov
AWS Well-Architected — Security Pillaraws.amazon.com
CIS Kubernetes Benchmarkcisecurity.org
Kubernetes Pod Security Standardskubernetes.io
Google SRE Book — Postmortem Culture: Learning from Failuregoogle.com
MITRE ATT&CK — adversary techniques (use during IR triage)attack.mitre.org

Finished reading?