The Engineering Codex/Cloud Security Engineering
DAY 1 · AM
01 / 09

Cloud Foundations & the Shared Responsibility Model

schedule10 minsignal_cellular_altIntermediate2,171 words
On-prem you owned the wire, the rack and the kernel. In the cloud, half of the stack is somebody else's job — and they will not tell you when you are doing your half wrong. Learn the shared responsibility model, the cloud threat landscape, and the four design heuristics that hold up across every provider.

What you will learn

01The Shared Responsibility Model
02The Cloud Threat Landscape — Misconfiguration Dominates
03Four Heuristics That Survive Any Provider
04The CSP Maturity Pyramid
05The Architecture Review Lens

Moving to the cloud is rarely a security upgrade by accident. The hyperscalers operate hardware, hypervisors and global networks better than almost anyone — but you inherit a stack split horizontally between provider and tenant, and every breach headline in the last decade has lived on the tenant side of that line. This chapter installs the foundations: who owns what, how the cloud changes the threat model, and the four heuristics you will reach for in every architecture review for the rest of the course.

🔑
The three models for the day
1) The shared responsibility model — what the provider does for you and what stays yours. 2) The cloud threat landscape — why misconfiguration, not exploitation, dominates the data. 3) The four heuristics — least privilege, blast-radius reduction, identity as the perimeter, and audit-by-default — that survive across AWS, Azure, GCP, and the platforms built on top.

The Shared Responsibility Model

The single most useful diagram in cloud security is also the one most often glossed over. AWS, Azure and GCP all publish a near-identical model: the provider is responsible for security of the cloud — the data centres, hypervisors, host OS, the global fibre — and the customer is responsible for security in the cloud — the workload, the data, the configuration, and the identities. The line between them shifts depending on the service model.

On-Prem IaaS PaaS SaaS Data & access App / runtime OS / patching Network Hypervisor Hardware Power / cooling Physical site all yours Data & IAM App / runtime OS & patching Net configuration — line — Hypervisor Hardware Site & power Data & IAM App config — line — Runtime OS Network Hypervisor Hardware Data & IAM — line — App Runtime OS Network Hypervisor Hardware
The customer line slides up the stack as you move to higher abstractions. Blue is yours; teal is the provider's. Data and IAM are always yours.

What is always yours

Whatever the service tier, three things never cross the line. Data — what you put into the platform, including its classification, retention and encryption keys when you choose to manage them. Identity — who you grant access to, with what privileges, for how long. Configuration — every toggle the provider exposes, from S3 bucket ACLs to Lambda execution roles to CloudFront origin policies. Every public-bucket headline of the last decade was a configuration failure on the customer side of the line.

The IaaS / PaaS / SaaS gradient

On IaaS (EC2, Compute Engine, Azure VM) you still patch the OS, manage the kernel, and run the firewall agent. On PaaS (Cloud Run, App Service, AWS Fargate) the runtime is provider-operated; you bring code and configuration. On SaaS (Workspaces, M365, Salesforce) you own only the data, the users, and the integration settings. The threats change as you move up — kernel exploits become irrelevant on Fargate but token theft and OAuth scope abuse become primary.

⚠️
The shared fate reframing
Google has argued for years that "shared responsibility" leaves customers carrying too much, and proposes "shared fate" — the provider ships secure-by-default blueprints, paved roads, and detection that surfaces customer-side failures. It is a useful frame: the provider's incentive is for your tenant to not be in tomorrow's headline either, because brand contagion is real.

The Cloud Threat Landscape — Misconfiguration Dominates

Year after year — Verizon DBIR, IBM Cost of a Data Breach, Mandiant M-Trends, the CSA's Top Threats reports — the same picture appears: in the cloud, exploitation of provider infrastructure is rare; customer misconfiguration and credential compromise account for the overwhelming majority of incidents. The hypervisor escape exists in research papers; the public S3 bucket exists in last week's news.

ClassWhy it dominatesCanonical example
MisconfigurationAPIs make insecure states one click away; defaults change over timePublic S3, world-readable storage container, open Elasticsearch
Credential / identity compromiseLong-lived keys leak in code, CI logs, browser sessionsLeaked AWS access key in a public repo; OAuth refresh-token theft
Excessive permissionsWildcard policies are easier than scoped oness3:* on a worker that only needs GetObject
Vulnerable workloadContainer image or function bundles unpatched librariesLog4Shell in a container, exposed admin panel
Insecure APIsCloud APIs are first-class; mistakes are fastServer-Side Request Forgery hitting 169.254.169.254
Supply chainOne dependency, one base image, one CI runnerCompromised npm package, malicious Docker image

The Capital One breach (2019) is the textbook study: a misconfigured WAF allowed an attacker to make the EC2 instance perform an SSRF call to the IMDS, retrieve the role's credentials, and then call S3 with broad read permissions. Three customer-side mistakes stacked — WAF rule, IMDS version, role scope — and the provider's part of the stack was untouched. Read the post-mortems: each individual mistake felt small.

💡
IMDSv2 is the cheap fix
After Capital One, AWS rolled out IMDSv2: every metadata request must use a session token obtained via PUT, and the response hop limit is 1 — making it far harder to reach via SSRF from an outer service. Today, IMDSv2 should be required on every account. We will revisit metadata-service attacks on Day 6.

The MITRE ATT&CK Cloud matrix

For mapping attacks, MITRE maintains a Cloud sub-matrix of ATT&CK with techniques specific to IaaS, SaaS, Office 365 and Azure AD. Worth bookmarking. Notable cloud-flavoured techniques include T1078.004 (valid cloud accounts), T1098.001 (additional cloud credentials — adding a new access key to a compromised principal), T1530 (data from cloud storage), T1538 (cloud service dashboard), and T1552.005 (credentials from instance metadata). When we get to detection on Day 6, these IDs will be the spine of every rule we write.

Quick check
Your team migrates a workload from EC2 to AWS Fargate. Which of the following stops being your responsibility, and which one becomes more important than before?
Show answer
Stops being yours: patching the host OS, managing the kernel, host-level antivirus/EDR. AWS now operates the underlying compute and OS. Becomes more important: the IAM task role attached to the container — on Fargate it is the only thing standing between your workload and your AWS account, since you can no longer harden the host. A bad task role is now a one-step path to your data plane. Container image hygiene also matters more, because you cannot fall back on host hardening to compensate for a vulnerable image.

Four Heuristics That Survive Any Provider

Cloud APIs change every quarter. The heuristics below are mostly older than the cloud and will outlive whichever console you log into next.

Least privilege grants & scopes Blast-radius reduction accounts & segments Identity is the perimeter authn & authz Audit by default logs & provenance
Read in any order — they reinforce each other. A least-privileged role with no logs is half a control; an audited but over-broad role is theatre.

1) Least privilege — and time-bound

Grant the smallest set of actions on the smallest set of resources for the shortest time. The cloud version adds time: long-lived access keys are inherently worse than short-lived STS credentials, and standing admin is worse than just-in-time elevation through a break-glass workflow. Day 1 PM is entirely about how to express this in IAM without making the platform unusable.

2) Blast-radius reduction — segment by failure domain

Assume any single principal, network segment or service will be compromised. Design so the explosion is contained. The strongest tool here is the account boundary — separate AWS accounts, GCP projects or Azure subscriptions per environment, per workload class, per blast radius. A misconfigured policy in prod-payments cannot speak to prod-analytics if they live in different accounts with no shared roles. VPC segmentation, namespaces and service meshes layer below that.

3) Identity is the new perimeter

The corporate firewall used to define inside vs outside. In the cloud, every API call originates from a principal — a user, a role, a workload identity — and is signed and authorized at the API. If your identity story is broken, no amount of network segmentation saves you. Conversely, with strong identity you can run flat networks where every hop is mutually authenticated. This is the core thesis of Zero Trust, NIST SP 800-207 and BeyondCorp; we will spend Day 2 PM on it.

4) Audit by default, immutable, central

Cloud APIs are uniformly logged — CloudTrail, Cloud Audit Logs, Azure Activity Logs. Turn them on for every account, ship them to a separate logging account with object-lock or write-once storage, and retain long enough to cover the typical breach dwell time (Mandiant M-Trends 2024 reported a global median of about 10 days, but cloud-specific incidents often span months because credentials are reused). If logs are retained only in the same account that is compromised, you have no logs.

🌱
Paved roads beat policies
Slack threads where security pleads with developers to use the secure pattern do not scale. Paved roads — golden modules, secure CI templates, sanctioned Helm charts, opinionated Backstage scaffolders — make the safe path also the easy path. Every heuristic above is enforced more reliably by code than by review.

The CSP Maturity Pyramid

Where is your organization on the journey? A useful framing — popularized by the AWS Well-Architected Security pillar and echoed by GCP's Cloud Adoption Framework — is a five-level maturity model. Most teams skip levels and pay for it later.

1 · Inventory & visibility 2 · Identity & access hygiene 3 · Network & data controls 4 · Detection & response 5 · Adaptive
You cannot detect what you cannot see, defend what you cannot identify, or respond to what nobody owns. Build bottom-up.
  • L1 — Inventory. Every account/project/subscription is known, tagged with owner, environment, data classification. Asset registry is automated. Without this, every later level lies.
  • L2 — Identity hygiene. No long-lived root keys; SSO with MFA for humans; workload identity (IRSA, Workload Identity Federation) for services; baseline IAM guardrails (SCPs, Org Policies).
  • L3 — Network & data. Default-deny SGs, private endpoints for managed services, KMS-managed keys with rotation, encryption-in-transit everywhere.
  • L4 — Detection & response. Centralised CloudTrail/audit logs, guardrail-grade detections (GuardDuty, Security Command Center, Defender), runtime detection (Falco, eBPF), 24×7 on-call.
  • L5 — Adaptive. Continuous control monitoring (CSPM/CNAPP), purple-team exercises, blast-radius testing, paved-road enforcement at PR time.
Mnemonic — the four heuristics
"Less, Smaller, Identity, Audit."
  • Less — least privilege, time-bound
  • Smaller — blast-radius reduction (account boundaries first)
  • Identity — every API call signed; identity is the perimeter
  • Audit — by default, immutable, in a separate account

The Architecture Review Lens

By the end of this course, every chapter loops back here. When you sit in front of a new architecture diagram, ask the same questions in the same order. They will rarely all be answered well; the gaps tell you where to focus.

  1. Trust boundaries. Where does data cross from one trust zone to another? (Internet → VPC, account → account, namespace → namespace, principal → principal.) Each boundary should authenticate, authorize and log.
  2. Identity model. Who or what is calling each API? Are credentials long-lived or short-lived? Is the principal scoped to the resources it actually touches?
  3. Blast radius. If this component is compromised, what can the attacker reach? Different account? Different region? Same row in the database? Map it explicitly.
  4. Data path. Where does sensitive data live, in motion and at rest? Whose key encrypts it, and who can use that key?
  5. Detection. What event would tell you this system is being abused? Where would that event land? Who is paged?
Flashcard
A startup runs everything in a single AWS account. The CTO argues separating prod and non-prod into different accounts is "premature complexity." Give one technical and one organisational reason it is the opposite.
Click to flip ↻
Answer
Technical: the account is the only AWS-native blast-radius boundary. A misconfigured IAM policy, a SCP that does not exist, or a leaked admin key affects every workload simultaneously. Service Quotas, IAM identity limits and audit-log volume all pile into one tenant. Organisational: separation forces explicit cross-environment dependencies (who reads what, with what role). Without it, prod and dev calls accumulate informally — and "dev needs read access to prod for debugging" becomes an irreversible permission grant. Splitting accounts late costs ten times what splitting them early does.
🔑
Key takeaways
1) The shared responsibility model moves the customer line up the stack as you choose higher abstractions, but data, identity and configuration are always yours. 2) Cloud incidents are dominated by misconfiguration and credential compromise, not provider exploitation — design for that distribution. 3) The four heuristics — least privilege, blast-radius reduction, identity-as-perimeter, audit-by-default — survive across providers and decades. 4) Build maturity bottom-up: visibility before control, control before detection, detection before adaptation.

Finished reading?