The Engineering Codex/Cloud Security Engineering
DAY 6
08 / 09

Cloud Detection & Incident Response

schedule11 minsignal_cellular_altAdvanced2,485 words
Prevention buys you time; detection buys you a story. Master cloud-native telemetry (CloudTrail, Audit Logs, GuardDuty, eBPF), MITRE ATT&CK Cloud detection engineering, the IR lifecycle adapted for ephemeral infrastructure, and the forensic acquisition that survives in environments where evidence vanishes in seconds.

What you will learn

01The Cloud Telemetry Stack
02Detection Engineering — Practice, Not Vibes
03Anomaly Detection in the Cloud
04The Incident Response Lifecycle — NIST 800-61, Revisited
05Forensic Acquisition in the Cloud
06Communications

You will be breached. Not because your defences are weak, but because at some scale a credential leaks, a pipeline is tampered with, a maintainer goes phishable, or a bug ships. The question that matters then is not whether your prevention worked but whether your detection did, and whether the response process can answer the auditor and the customer in days rather than months. This chapter is about that machinery — telemetry, detections, incident handling and forensics, all reframed for ephemeral cloud environments where the evidence can disappear when a pod restarts.

🔑
Today's response stack
1) The cloud telemetry sources — CloudTrail / Audit Logs, VPC Flow Logs, GuardDuty / Security Command Center, EDR, eBPF runtime. 2) Detection engineering — MITRE ATT&CK Cloud, Sigma rules, the testing loop. 3) The IR lifecycle (NIST 800-61) and how it changes for ephemeral infra. 4) Forensic acquisition in the cloud — snapshotting, container introspection, memory capture before evidence vanishes. 5) Communications — what regulators, customers and execs need, when, and in what tone.

The Cloud Telemetry Stack

You cannot investigate what you have not collected. The good news: cloud platforms emit the most uniformly structured telemetry in computing history. The bad news: each service emits a different shape, and the noise-to-signal ratio without curation is ferocious.

CloudTrail / Audit Logs every API call VPC Flow Logs L4 accept/reject DNS query logs Route 53 / Cloud DNS Runtime — eBPF / Falco syscalls / process Detection & SIEM Sigma rules, CEL enrichment + asset map on-call paging Cold storage S3 + object lock separate logging account 13+ months retention Investigation Athena / BigQuery timeline tools
The minimum viable detection pipeline. Every cloud has analogues; the shape is the same.

Sources you cannot skip

  • CloudTrail (AWS) / Cloud Audit Logs (GCP) / Activity Log (Azure). Every API call. Multi-account organization-wide trails into a separate logging account.
  • VPC Flow Logs — L4 connection records. Costly at scale; sample lower-value VPCs, full on prod.
  • DNS query logs — Route 53 Resolver, Cloud DNS, Azure DNS Analytics. Best signal for C2 callouts.
  • K8s audit log — every API call to the cluster.
  • Application logs — structured, correlated by trace ID, scrubbed of secrets.
  • Runtime telemetry — Falco/Tetragon/EDR for syscall and process tree.
  • Identity provider logs — Okta system log, Entra sign-in logs, Workspace Login Audit. Critical for the AuthN side of every cloud incident.
  • SaaS logs — GitHub audit log, Slack audit, M365 unified audit. The supply-chain detection layer.
⚠️
Logs in the compromised account are not logs
A compromised principal can disable trails, mass-delete log streams, or rotate keys. Ship logs to a separate AWS account / GCP project / Azure subscription, with cross-account write-only IAM and S3 Object Lock / Bucket Lock retention. The logging account should have its own admin chain. CloudTrail's organization trail and Azure's central log forwarding are designed for exactly this.

The native detection services

AWS GuardDuty, GCP Security Command Center, Azure Defender for Cloud all deliver curated detections on top of native telemetry without you operating the pipeline. They are the right starting point — turn them on org-wide before building anything custom. Their findings flow into Security Hub / SCC / Defender XDR for triage. Custom detections come in addition to, not instead of.

Detection Engineering — Practice, Not Vibes

The detection engineering practice has matured into a four-step loop: hypothesis → rule → test → tune. The hypothesis is grounded in MITRE ATT&CK techniques observed in real incidents; the rule is written and reviewed; tests fire it (and verify it doesn't fire on benign behaviour); tuning closes the gap on noise.

Hypothesis ATT&CK technique Rule Sigma / SPL / KQL Test atomic red team Tune FP review
Detection engineering as a loop. Every rule must have an explicit test, a known false-positive class, and an owner.

Anchor on MITRE ATT&CK Cloud

The Cloud sub-matrix names the techniques you actually need to detect. A starter set of high-signal cloud detections, by ATT&CK ID:

ATT&CK IDTechniqueWhat to detect
T1078.004Valid Accounts: CloudConsole / API login from new geo / new ASN / impossible travel
T1098.001Add cloud credentialsiam:CreateAccessKey on a different user; iam:UpdateLoginProfile
T1552.005Credentials from cloud metadata APIIMDS access from non-EC2 user-agent or unexpected pod
T1530Data from cloud storageSpike in s3:GetObject by one principal; cross-account read
T1538Cloud service dashboardConsole session reading large amounts of data
T1190Exploit public-facing applicationWAF bypass + IMDS access pattern
T1136.003Create Account: Cloud Accountiam:CreateUser outside the org's create flow
T1562Impair DefensesCloudTrail StopLogging, GuardDuty disable, security-group all-open
T1567Exfiltration over web serviceEgress to first-time-seen domain; large outbound from a service that usually does not

Sigma — portable rule format

Sigma is the Snort/YARA of log detection: a YAML rule format that backends translate to Splunk SPL, Elastic KQL, Sentinel KQL, etc. Pre-built Sigma rules for cloud telemetry are maintained in the SigmaHQ repo; pulling these and adapting is the fastest way to a credible detection corpus.

yaml — Sigma rule: CloudTrail StopLogging
title: AWS CloudTrail Disabled
id: 4eb19a23-b6b5-4b6a-9e58-1b2f25e6b7c0
status: stable
description: Detects when CloudTrail is stopped or deleted (defender impairment).
tags: [attack.defense_evasion, attack.t1562.008]
logsource: { product: aws, service: cloudtrail }
detection:
  selection:
    eventSource: cloudtrail.amazonaws.com
    eventName:
      - StopLogging
      - DeleteTrail
      - UpdateTrail
      - PutEventSelectors
  filter_legitimate:
    userIdentity.arn|contains: ":role/security-platform/"
  condition: selection and not filter_legitimate
fields: [userIdentity.arn, eventName, sourceIPAddress, requestParameters.name]
level: high

The two false-positive classes

Detections fail in two directions, and you must measure both. Tactical FPs ("a backup script trips the rule") come from incomplete context — fixed by enriching with asset metadata or filtering known principals. Strategic FPs ("the rule fires on legitimate behaviour the team is changing") come from drift — the platform changed but the detection didn't. Quarterly rule reviews and rule-as-code (in Git, with PRs and tests) are the only honest way to keep up.

Anomaly Detection in the Cloud

Two flavours have proven durable; many others have not.

Behavioural baselines per principal

Establish a 30-90 day baseline of API actions per principal — region, action verbs, time-of-day, source ASN. Flag deviations: a CI role that has called s3:GetObject for a year suddenly calling iam:CreateAccessKey; a user logging in from a country they have never used. AWS GuardDuty's IAM and S3 detections do this natively; rolling your own with Athena+windowed queries is feasible.

Graph and chain detections

Single-event rules miss the multi-step chains attackers actually use. Graph-based detection (Panther, Elastic Attack Discovery, Sentinel Fusion) correlates events: "login from new geo, then key created, then unusual S3 access, all within 30 minutes." The signal is far stronger than any single rule, at the cost of operational complexity.

The Incident Response Lifecycle — NIST 800-61, Revisited

NIST SP 800-61r2 names the phases: Preparation, Detection & Analysis, Containment, Eradication & Recovery, Post-incident. The cloud changes how each phase is executed, not that they happen.

Preparerunbooks, IAM Detect & Analysescope, severity Containisolate, revoke Eradicate & Recoverrebuild, rotate Post-IRretro
The phases are linear in name; in practice you cycle Detect→Contain→Detect again as the scope expands.

Preparation — what "ready" means

  • An incident commander on call 24×7. One person owns the response; everyone else assists.
  • Runbooks per scenario: leaked-key, exposed-bucket, suspect-pod, compromised-CI. Not novels — checklists.
  • A break-glass IR role in every account with read-everything + the explicit "contain" actions you may need (revoke session, isolate ENI, snapshot volumes). MFA-required, alarmed on use.
  • Pre-built forensic acquisition Lambda / Cloud Function that snapshots EBS volumes and copies them to an evidence bucket in the IR account.
  • Tabletop exercises twice a year. The AWS-style "breach simulation" or Goteleport's CloudGoat are good practice grounds.

Detection & Analysis — scope first, narrative second

The pressure when a real incident drops is to do something. The discipline is to first answer scope: which accounts, which principals, what time window, what data classes. Without scope you cannot contain effectively. Use the principal as the unit of analysis: pull every CloudTrail event for the suspect role over the relevant window, plot a timeline, identify lateral movements (assume-role chains, cross-account calls).

Containment — fast and reversible

Cloud containment levers are uniformly fast and uniformly possible to over-rotate. Useful actions:

  • Revoke active sessions via IAM (aws iam put-user-policy with explicit deny, or attach the AWS-managed AWSDenyAll) — cuts active STS credentials.
  • Disable a credential via IAM but do not delete it — you need the audit trail for forensics.
  • Quarantine an EC2 instance by replacing its security groups with an isolation SG (no inbound, egress only to the forensic acquisition channel).
  • Cordon and drain a Kubernetes pod's node; snapshot the node before destruction.
  • Block an SCP-level guardrail for the affected account (e.g., "deny s3:* in this account") — extreme but available.
🚨
Contain, do not erase
The first impulse — "kill the pod, terminate the instance, delete the user" — destroys evidence. Take a snapshot first (EBS volume snapshot, K8s pod logs, ECS task definition), then isolate. Cloud snapshots are metadata-cheap; preserving them is almost free. Without them the post-incident retro and the regulator's question are unanswerable.

Eradication & Recovery

For credential compromise: rotate all credentials reachable from the compromised principal — including derived ones. Don't trust "the attacker only used X"; assume they exfiltrated everything they could read. For workload compromise: rebuild from known-good base images at known-good commits, with admission policy verifying signatures. For data-tier compromise: bring up a parallel environment from backups (verified backups, please — test restores quarterly) and cut traffic.

Forensic Acquisition in the Cloud

Forensics is hard in environments where evidence is ephemeral. Three patterns repay the work to set up:

EBS / disk snapshotting

Take a snapshot of the volume before termination. Copy the snapshot into the IR account (cross-account-shared key for KMS). Spin up a forensic VM in a hardened isolated VPC, attach the snapshot read-only, and analyse with standard tooling (Volatility, Plaso for timelines). The whole flow can be automated: a Step Function triggered by GuardDuty's HIGH-severity finding does the snapshot+copy in seconds.

Container memory and process state

Containers exit and lose memory. Pre-incident: enable Falco/Tetragon to record relevant process trees and syscall events; that telemetry survives the pod. At incident: kubectl debug can attach an ephemeral debug container; criu or memfetch can dump process memory before the kill. For stronger guarantees, run forensic-targeted workloads in Kata containers where the entire VM is snapshottable.

Cloud-native log preservation

Lock the relevant log stream during an active incident — extend retention, place under legal hold (Object Lock with retention period). CloudTrail's event data store in CloudTrail Lake can hold relevant events for years.

Communications

Engineers underrate this. Customers, regulators and execs are not your audience for technical detail; they are your audience for accuracy and clarity.

  • Within minutes — incident channel up; commander, scribe, communications lead, technical leads named.
  • Within hours — status page updated if customer-impacting; legal looped in for breach notification analysis.
  • Within 72 hours — GDPR notification window if EU personal data is plausibly affected. Other regimes (CCPA, NYDFS, HIPAA, India DPDP) have their own clocks.
  • Within days — CISA / sector ISAC notification for material incidents. Customer-specific notices.
  • Within weeks — public post-mortem if appropriate (Cloudflare and GitLab have set the bar here).

The single most important communication discipline: say only what you know, and qualify the rest. "We are still confirming whether customer X's data was accessed" is fine. Speculation that turns out wrong erodes trust faster than the breach itself.

Quick check
GuardDuty fires UnauthorizedAccess:IAMUser/InstanceCredentialExfiltration.OutsideAWS. CloudTrail confirms an EC2 role's STS credentials were used from an IP outside AWS. The EC2 instance is still running. What are the first three steps, in order?
Show answer
1) Snapshot the EBS volumes on the EC2 instance and copy to the IR account — evidence first. 2) Quarantine the instance by attaching the isolation security group (no inbound, egress only to forensics infra). 3) Revoke the role's active sessions by attaching an explicit deny policy or rotating the role's trust policy — this cuts off any other credentials the attacker derived. Only after these three: terminate, rebuild, investigate the entry point (likely SSRF reaching IMDS — check if IMDSv2 was required). The instance remains running until the snapshot completes; then either you preserve it under legal hold or terminate cleanly.

Common Cloud Attack Chains and Their Telemetry

ChainTelemetry that catches it
SSRF → IMDS → role creds → S3 readWAF anomaly, IMDS access from non-EC2 user-agent, GuardDuty InstanceCredentialExfiltration, S3 access pattern anomaly
Leaked CI access key → IAM key creation → cross-account assume → exfilCloudTrail CreateAccessKey, GuardDuty RootCredentialUsage, Athena correlation across accounts
Compromised pod → service account token → cluster API → secret readK8s audit log, Falco shell-in-pod rule, anomalous SA token usage
Phished console session → MFA bypass → resource snapshot shareIdP impossible-travel, GuardDuty RDSSnapshotShared, Resource Access Manager logs
Supply-chain build tampering → image push → cluster pullOIDC subject mismatch in Sigstore, Kyverno admission deny, registry push anomaly
Mnemonic — IR triage
"Snapshot, Isolate, Revoke, Rotate, Rebuild."
  • Snapshot before any destructive action.
  • Isolate the affected resource.
  • Revoke active credentials.
  • Rotate every secret in blast radius.
  • Rebuild from known-good artifact.
Flashcard
A junior engineer says "we have CloudTrail and GuardDuty, we're covered for detection." Name three blind spots they leave.
Click to flip ↻
Answer
Three of many: 1) Application logs. CloudTrail covers control plane ("the API call happened"), not data plane ("this user just downloaded 12GB of customer data via the app"). 2) Identity provider events. The session that hit AWS started at Okta/Entra/Workspace; sign-in anomalies, MFA failures, OAuth grant abuse — none in CloudTrail. 3) Runtime in the workload. A pod calling curl evil.com from a shell isn't a CloudTrail event; you need eBPF/Falco. Bonus: SaaS audit logs (GitHub, Slack), VPC Flow Logs for egress patterns, and Kubernetes audit log for cluster-API events.
🔑
Key takeaways
1) Centralize CloudTrail / Audit Logs, VPC Flow Logs, K8s audit, IdP, SaaS, runtime in a separate logging account with object-lock. 2) Turn on the native services (GuardDuty / SCC / Defender) before building anything custom. 3) Detection engineering is a hypothesis → rule → test → tune loop anchored on MITRE ATT&CK Cloud and codified as Sigma. 4) NIST 800-61 phases still apply — preparation is the phase you actually have time for. 5) Snapshot before isolating, isolate before terminating; preserve evidence at every cloud-fast action. 6) Communications discipline matters as much as technical response — say only what you know.

Finished reading?