
Cloud Detection & Incident Response
Prevention buys you time; detection buys you a story. Master cloud-native telemetry (CloudTrail, Audit Logs, GuardDuty, eBPF), MITRE ATT&CK Cloud detection engineering, the IR lifecycle adapted for ephemeral infrastructure, and the forensic acquisition that survives in environments where evidence vanishes in seconds.
What you will learn
You will be breached. Not because your defences are weak, but because at some scale a credential leaks, a pipeline is tampered with, a maintainer goes phishable, or a bug ships. The question that matters then is not whether your prevention worked but whether your detection did, and whether the response process can answer the auditor and the customer in days rather than months. This chapter is about that machinery — telemetry, detections, incident handling and forensics, all reframed for ephemeral cloud environments where the evidence can disappear when a pod restarts.
The Cloud Telemetry Stack
You cannot investigate what you have not collected. The good news: cloud platforms emit the most uniformly structured telemetry in computing history. The bad news: each service emits a different shape, and the noise-to-signal ratio without curation is ferocious.
Sources you cannot skip
- CloudTrail (AWS) / Cloud Audit Logs (GCP) / Activity Log (Azure). Every API call. Multi-account organization-wide trails into a separate logging account.
- VPC Flow Logs — L4 connection records. Costly at scale; sample lower-value VPCs, full on prod.
- DNS query logs — Route 53 Resolver, Cloud DNS, Azure DNS Analytics. Best signal for C2 callouts.
- K8s audit log — every API call to the cluster.
- Application logs — structured, correlated by trace ID, scrubbed of secrets.
- Runtime telemetry — Falco/Tetragon/EDR for syscall and process tree.
- Identity provider logs — Okta system log, Entra sign-in logs, Workspace Login Audit. Critical for the AuthN side of every cloud incident.
- SaaS logs — GitHub audit log, Slack audit, M365 unified audit. The supply-chain detection layer.
The native detection services
AWS GuardDuty, GCP Security Command Center, Azure Defender for Cloud all deliver curated detections on top of native telemetry without you operating the pipeline. They are the right starting point — turn them on org-wide before building anything custom. Their findings flow into Security Hub / SCC / Defender XDR for triage. Custom detections come in addition to, not instead of.
Detection Engineering — Practice, Not Vibes
The detection engineering practice has matured into a four-step loop: hypothesis → rule → test → tune. The hypothesis is grounded in MITRE ATT&CK techniques observed in real incidents; the rule is written and reviewed; tests fire it (and verify it doesn't fire on benign behaviour); tuning closes the gap on noise.
Anchor on MITRE ATT&CK Cloud
The Cloud sub-matrix names the techniques you actually need to detect. A starter set of high-signal cloud detections, by ATT&CK ID:
| ATT&CK ID | Technique | What to detect |
|---|---|---|
| T1078.004 | Valid Accounts: Cloud | Console / API login from new geo / new ASN / impossible travel |
| T1098.001 | Add cloud credentials | iam:CreateAccessKey on a different user; iam:UpdateLoginProfile |
| T1552.005 | Credentials from cloud metadata API | IMDS access from non-EC2 user-agent or unexpected pod |
| T1530 | Data from cloud storage | Spike in s3:GetObject by one principal; cross-account read |
| T1538 | Cloud service dashboard | Console session reading large amounts of data |
| T1190 | Exploit public-facing application | WAF bypass + IMDS access pattern |
| T1136.003 | Create Account: Cloud Account | iam:CreateUser outside the org's create flow |
| T1562 | Impair Defenses | CloudTrail StopLogging, GuardDuty disable, security-group all-open |
| T1567 | Exfiltration over web service | Egress to first-time-seen domain; large outbound from a service that usually does not |
Sigma — portable rule format
Sigma is the Snort/YARA of log detection: a YAML rule format that backends translate to Splunk SPL, Elastic KQL, Sentinel KQL, etc. Pre-built Sigma rules for cloud telemetry are maintained in the SigmaHQ repo; pulling these and adapting is the fastest way to a credible detection corpus.
title: AWS CloudTrail Disabled
id: 4eb19a23-b6b5-4b6a-9e58-1b2f25e6b7c0
status: stable
description: Detects when CloudTrail is stopped or deleted (defender impairment).
tags: [attack.defense_evasion, attack.t1562.008]
logsource: { product: aws, service: cloudtrail }
detection:
selection:
eventSource: cloudtrail.amazonaws.com
eventName:
- StopLogging
- DeleteTrail
- UpdateTrail
- PutEventSelectors
filter_legitimate:
userIdentity.arn|contains: ":role/security-platform/"
condition: selection and not filter_legitimate
fields: [userIdentity.arn, eventName, sourceIPAddress, requestParameters.name]
level: highThe two false-positive classes
Detections fail in two directions, and you must measure both. Tactical FPs ("a backup script trips the rule") come from incomplete context — fixed by enriching with asset metadata or filtering known principals. Strategic FPs ("the rule fires on legitimate behaviour the team is changing") come from drift — the platform changed but the detection didn't. Quarterly rule reviews and rule-as-code (in Git, with PRs and tests) are the only honest way to keep up.
Anomaly Detection in the Cloud
Two flavours have proven durable; many others have not.
Behavioural baselines per principal
Establish a 30-90 day baseline of API actions per principal — region, action verbs, time-of-day, source ASN. Flag deviations: a CI role that has called s3:GetObject for a year suddenly calling iam:CreateAccessKey; a user logging in from a country they have never used. AWS GuardDuty's IAM and S3 detections do this natively; rolling your own with Athena+windowed queries is feasible.
Graph and chain detections
Single-event rules miss the multi-step chains attackers actually use. Graph-based detection (Panther, Elastic Attack Discovery, Sentinel Fusion) correlates events: "login from new geo, then key created, then unusual S3 access, all within 30 minutes." The signal is far stronger than any single rule, at the cost of operational complexity.
The Incident Response Lifecycle — NIST 800-61, Revisited
NIST SP 800-61r2 names the phases: Preparation, Detection & Analysis, Containment, Eradication & Recovery, Post-incident. The cloud changes how each phase is executed, not that they happen.
Preparation — what "ready" means
- An incident commander on call 24×7. One person owns the response; everyone else assists.
- Runbooks per scenario: leaked-key, exposed-bucket, suspect-pod, compromised-CI. Not novels — checklists.
- A break-glass IR role in every account with read-everything + the explicit "contain" actions you may need (revoke session, isolate ENI, snapshot volumes). MFA-required, alarmed on use.
- Pre-built forensic acquisition Lambda / Cloud Function that snapshots EBS volumes and copies them to an evidence bucket in the IR account.
- Tabletop exercises twice a year. The AWS-style "breach simulation" or Goteleport's CloudGoat are good practice grounds.
Detection & Analysis — scope first, narrative second
The pressure when a real incident drops is to do something. The discipline is to first answer scope: which accounts, which principals, what time window, what data classes. Without scope you cannot contain effectively. Use the principal as the unit of analysis: pull every CloudTrail event for the suspect role over the relevant window, plot a timeline, identify lateral movements (assume-role chains, cross-account calls).
Containment — fast and reversible
Cloud containment levers are uniformly fast and uniformly possible to over-rotate. Useful actions:
- Revoke active sessions via IAM (
aws iam put-user-policywith explicit deny, or attach the AWS-managedAWSDenyAll) — cuts active STS credentials. - Disable a credential via IAM but do not delete it — you need the audit trail for forensics.
- Quarantine an EC2 instance by replacing its security groups with an isolation SG (no inbound, egress only to the forensic acquisition channel).
- Cordon and drain a Kubernetes pod's node; snapshot the node before destruction.
- Block an SCP-level guardrail for the affected account (e.g., "deny
s3:*in this account") — extreme but available.
Eradication & Recovery
For credential compromise: rotate all credentials reachable from the compromised principal — including derived ones. Don't trust "the attacker only used X"; assume they exfiltrated everything they could read. For workload compromise: rebuild from known-good base images at known-good commits, with admission policy verifying signatures. For data-tier compromise: bring up a parallel environment from backups (verified backups, please — test restores quarterly) and cut traffic.
Forensic Acquisition in the Cloud
Forensics is hard in environments where evidence is ephemeral. Three patterns repay the work to set up:
EBS / disk snapshotting
Take a snapshot of the volume before termination. Copy the snapshot into the IR account (cross-account-shared key for KMS). Spin up a forensic VM in a hardened isolated VPC, attach the snapshot read-only, and analyse with standard tooling (Volatility, Plaso for timelines). The whole flow can be automated: a Step Function triggered by GuardDuty's HIGH-severity finding does the snapshot+copy in seconds.
Container memory and process state
Containers exit and lose memory. Pre-incident: enable Falco/Tetragon to record relevant process trees and syscall events; that telemetry survives the pod. At incident: kubectl debug can attach an ephemeral debug container; criu or memfetch can dump process memory before the kill. For stronger guarantees, run forensic-targeted workloads in Kata containers where the entire VM is snapshottable.
Cloud-native log preservation
Lock the relevant log stream during an active incident — extend retention, place under legal hold (Object Lock with retention period). CloudTrail's event data store in CloudTrail Lake can hold relevant events for years.
Communications
Engineers underrate this. Customers, regulators and execs are not your audience for technical detail; they are your audience for accuracy and clarity.
- Within minutes — incident channel up; commander, scribe, communications lead, technical leads named.
- Within hours — status page updated if customer-impacting; legal looped in for breach notification analysis.
- Within 72 hours — GDPR notification window if EU personal data is plausibly affected. Other regimes (CCPA, NYDFS, HIPAA, India DPDP) have their own clocks.
- Within days — CISA / sector ISAC notification for material incidents. Customer-specific notices.
- Within weeks — public post-mortem if appropriate (Cloudflare and GitLab have set the bar here).
The single most important communication discipline: say only what you know, and qualify the rest. "We are still confirming whether customer X's data was accessed" is fine. Speculation that turns out wrong erodes trust faster than the breach itself.
UnauthorizedAccess:IAMUser/InstanceCredentialExfiltration.OutsideAWS. CloudTrail confirms an EC2 role's STS credentials were used from an IP outside AWS. The EC2 instance is still running. What are the first three steps, in order?Show answer
Common Cloud Attack Chains and Their Telemetry
| Chain | Telemetry that catches it |
|---|---|
| SSRF → IMDS → role creds → S3 read | WAF anomaly, IMDS access from non-EC2 user-agent, GuardDuty InstanceCredentialExfiltration, S3 access pattern anomaly |
| Leaked CI access key → IAM key creation → cross-account assume → exfil | CloudTrail CreateAccessKey, GuardDuty RootCredentialUsage, Athena correlation across accounts |
| Compromised pod → service account token → cluster API → secret read | K8s audit log, Falco shell-in-pod rule, anomalous SA token usage |
| Phished console session → MFA bypass → resource snapshot share | IdP impossible-travel, GuardDuty RDSSnapshotShared, Resource Access Manager logs |
| Supply-chain build tampering → image push → cluster pull | OIDC subject mismatch in Sigstore, Kyverno admission deny, registry push anomaly |
- Snapshot before any destructive action.
- Isolate the affected resource.
- Revoke active credentials.
- Rotate every secret in blast radius.
- Rebuild from known-good artifact.
curl evil.com from a shell isn't a CloudTrail event; you need eBPF/Falco. Bonus: SaaS audit logs (GitHub, Slack), VPC Flow Logs for egress patterns, and Kubernetes audit log for cluster-API events.- NIST SP 800-61 r2 — Computer Security Incident Handling Guidenist.gov
- MITRE ATT&CK — Cloud Matrixmitre.org
- SigmaHQ — Detection rulesgithub.com
- AWS GuardDuty — Finding typesaws.amazon.com
- AWS — Customer Incident Response Playbooksgithub.com
- Falco — Default rules libraryfalco.org
- CloudGoat — Vulnerable AWS environmentgithub.com
Finished reading?