The Engineering Codex/Cloud Security Engineering
DAY 7
09 / 09

Governance, Compliance & Architecture Review

schedule10 minsignal_cellular_altAdvanced2,191 words
Tie it together. Map the controls from days 1-6 onto the compliance regimes auditors actually run (SOC 2, ISO 27001, PCI, HIPAA, FedRAMP), learn how to threat-model cloud-native systems, and run the kind of architecture review that finds real issues without becoming theatre.

What you will learn

01The Compliance Map
02Compliance as Code
03Threat Modeling Cloud-Native Systems
04The Architecture Review
05The Risk Register
06The Engineering–Security Operating Model

Six days in, you have the technical building blocks. Day 7 is the system that operates them — how compliance, threat modeling and review actually fit together in a real engineering organization. Mostly, they are simpler than the consultant decks suggest, and harder than the engineering teams imagine. The trick is that none of them is a one-shot deliverable; all are continuous practices.

🔑
Today's wrap-up
1) The compliance regimes you will actually meet — SOC 2, ISO 27001, PCI DSS, HIPAA, FedRAMP, GDPR — what they require and how to operate them as code. 2) Control mapping — turning controls into auditable evidence without performance theatre. 3) Threat modeling cloud-native systems — STRIDE adapted, tags-as-trust-boundaries, the questions that catch real issues. 4) The architecture review — the lens, the artifacts, and the social discipline that makes it work.

The Compliance Map

Six frameworks cover ninety-something percent of what cloud teams encounter. They overlap heavily; one well-run control program satisfies most.

SOC 2 trust services ISO 27001 ISMS, 93 controls PCI DSS cardholder data overlap access, encryption, logging, change-mgmt
Most controls satisfy multiple regimes. Build once, audit many.
RegimeWho needs itWhat it really testsCloud delta
SOC 2Most US B2B SaaSTrust Services Criteria: Security (mandatory), Availability, Confidentiality, Processing Integrity, Privacy. Type II = operating effectiveness over 6+ monthsAuditors expect cloud-native evidence — IAM exports, CloudTrail samples, alarm history
ISO 27001Global B2B / EU customersAn ISMS — formally documented, risk-driven, reviewed. Annex A 2022 has 93 controlsPairs with ISO 27017 (cloud) and ISO 27018 (PII processors)
PCI DSS 4.0Anyone touching card data12 requirements, hard prescriptive controls (segmentation, encryption, MFA, scanning)4.0 added customized approach for cloud equivalents; explicit mention of containers and serverless
HIPAAUS healthcare dataAdministrative, physical, technical safeguards. Contractual via BAAsCloud BAAs from AWS/GCP/Azure cover infra; your use is still your responsibility
FedRAMPUS gov SaaSNIST SP 800-53 Rev 5 catalog, risk-categorized (Low/Moderate/High). Heavily prescriptiveStateRAMP, IL-2/4/5/6 (DoD) tiers; PMO audits annually
GDPR / DPDP / CCPAPersonal data of EU/India/CA residentsLawful basis, data minimization, subject rights, 72-hr breach notification, DPIAsCloud transfers to non-adequate countries need SCCs / TIA + technical controls

The 80/20 control bundle

If you want to be in good shape for any of the above, six control families cover the bulk of what every regime cares about:

  1. Identity & access — SSO + MFA for humans, workload identity for services, periodic access reviews, least privilege documented.
  2. Encryption — at rest with customer-managed keys, in transit with TLS 1.2+, key rotation policy.
  3. Logging & monitoring — centralized logs, SIEM with detections, retention meeting the longest applicable regime.
  4. Change management — code review, IaC, approved deploy paths, audit trail.
  5. Vulnerability management — scanning at build, patching SLAs, exception process.
  6. Incident response — runbooks, on-call, post-incident reviews, breach notification process.

Document each as a one-page policy, automate as much as possible (the procedure), and have evidence ready (the artifact: CloudTrail samples, IAM access advisor, scan results, review tickets, post-mortems). That triple — policy, procedure, artifact — is what an auditor wants to see for any control.

Compliance as Code

Manual evidence collection is expensive and error-prone. Modern teams treat compliance evidence as a build target, not a quarterly scramble.

  • Continuous control monitoring — Vanta, Drata, Secureframe, Tugboat, Wiz, Prowler, OpenSCAP. They subscribe to your cloud APIs and assert controls ("all S3 buckets are private," "MFA on all human users") continuously.
  • Policy as code — your SCPs, OPA policies, Kyverno rules, Cloud Custodian policies are the controls. Drift away from them is the audit finding; pre-merge tests prevent it.
  • OSCAL — NIST's structured language for control catalogs and assessments. FedRAMP packages are now distributed in OSCAL; expect this to spread.

Threat Modeling Cloud-Native Systems

STRIDE (Spoofing, Tampering, Repudiation, Information disclosure, Denial of service, Elevation of privilege) was written for desktop apps; the categories still apply, but cloud expands the surface.

The right level of abstraction

A useful threat model is system-level, not feature-level. Draw the data flow: components, trust boundaries, data classifications. The trust boundaries are where threats live — every flow that crosses one is a risk to evaluate. In cloud, common boundaries:

  • Internet → public ALB / API Gateway.
  • Public subnet → private subnet.
  • Account A → Account B (assume-role, EventBridge, S3 cross-account).
  • Region A → Region B.
  • Provider A → Provider B (multi-cloud).
  • Production → analytics / data warehouse.
  • Synchronous service ↔ third-party SaaS.
Internet App account Data account Browser API Worker Aurora (KMS) Each dashed boundary = a place to ask the STRIDE-Cloud questions.
A simple DFD with trust boundaries. The interesting threats live at every dashed line crossing.

STRIDE-Cloud question set

For each flow crossing a boundary, ask:

  • Spoofing — who is the caller and how is their identity proven? (mTLS? OIDC? signed request?)
  • Tampering — is the request integrity-protected? (Replay possible? AAD on encrypted DEKs?)
  • Repudiation — is this call logged with the actual principal? (CloudTrail? K8s audit? IdP?)
  • Information disclosure — what data crosses, encrypted with what key, accessible to whom?
  • Denial of service — what is the request quota, the cost ceiling, the autoscale limit?
  • Elevation of privilege — does completing this flow give the caller more than they should keep?

The cloud-flavoured additions

  • Configuration drift — what could turn this benign flow into a public exposure with one bad checkbox?
  • Tag integrity — if access depends on tags (ABAC), what stops the wrong tag value?
  • Cross-account confused-deputy — does any service we trust act on requests from less-trusted callers? (S3 bucket policy with aws:SourceAccount, EventBridge cross-account, KMS grant abuse.)
  • Build & deploy path — could an attacker shift the binary that ends up in this component? (Day 5.)

The Architecture Review

The cheapest place to find a security issue is the design phase. The most expensive place is post-incident. Architecture reviews bridge them.

What good looks like

  • Triggered by code, not by calendar. A new service, new data class, new external dependency, new region — these trigger a review. Quarterly all-hands reviews catch nothing.
  • Lightweight when possible. A 1-page RFC + 30-min sync covers most. Reserve full threat-model exercises for genuinely new threat surfaces (new auth method, new untrusted input class, new data-out flow).
  • Specific, written follow-ups. "Add an SCP for region pinning" beats "consider further hardening." Tracked in the same issue tracker as feature work, with owners and dates.
  • Reviewers who own outcomes. A review where security raises concerns and engineering politely accepts and ignores them is theatre. Either security has a concrete must-fix list or the concerns convert to documented accepted risks signed by a named risk owner.

The architecture review checklist

One page; the same questions for every review. Most fail in predictable ways.

  1. Trust boundaries. Drawn? Each crossing authenticated, authorized, logged?
  2. Identity model. Every principal short-lived? Workload identity for services? OIDC for CI? Long-lived access keys called out as exceptions?
  3. Data classification. What is in motion and at rest? Most-sensitive class drives controls.
  4. Encryption. Customer-managed keys with rotation? Encryption context bound to tenant/purpose?
  5. Network. Default-deny? Egress filtered or proxied? Endpoint policies pin to org?
  6. Admission & supply chain. Image signing required? Provenance verified? Dependency hygiene?
  7. Detection. Three things that, if observed, would tell us this system is being abused. Where they land. Who is paged.
  8. Blast radius. If this component is fully compromised, what can the attacker reach? Document it.
  9. Disaster recovery. RTO / RPO. Backups verified by test restore. Cross-region copies for highest-class data.
  10. Compliance scope. Which regimes apply? Which controls does this design implicate?
⚠️
Beware of the "out of scope"
The most common review failure mode: "that's out of scope for this review." Out-of-scope items have a way of becoming the actual root cause of next year's incident. If a control belongs in another team's lane, file the cross-team ticket as part of the review's output. "Not my problem" is not a valid sign-off.

The Risk Register

Not every issue is fixable today; some are accepted with mitigations. The risk register is the document of record. Each entry has: description, blast radius, likelihood, current mitigations, the remediation plan and date, and the named risk owner. Reviewed monthly; expired items reopened or formally re-accepted.

Why this matters operationally: when an incident happens, the first question regulators and customers ask is "did you know about this risk?" If yes and it was on the register with a plan, that is hard work being recognized. If yes and it was buried in a Slack thread, that is negligence. If no — well, you find out together.

The Engineering–Security Operating Model

Three structures have emerged for how engineering and security work together. None is universally right; all beat "throw it over the wall."

  • Embedded security engineers — one security eng per platform area, in the team's standups. Highest empathy; doesn't scale past tens of teams.
  • Security champions — engineers in product teams trained as the first line; central security supports. Scales further; relies on champions actually having time.
  • Paved roads + shift-left — central security ships golden modules, scaffolders, libraries. Product teams build on top. Highest leverage; works only when the paved road actually solves the team's problem.

The honest truth is that the highest-functioning organisations do all three: paved roads as default, champions in teams, and embedded engineers for the highest-risk components.

Putting It All Together — The Course Loop

A final lap through the course as a single coherent practice:

Architectmodel Buildpaved road Rundeploy DetectSIEM RespondIR
The full loop. Today's incidents become tomorrow's threat-model inputs become next quarter's paved-road improvements.
  • Architect — model the system, identify trust boundaries, run STRIDE-Cloud, choose controls. (Days 1, 2, 7.)
  • Build — paved roads with IAM modules, signed images, scoped credentials, IaC with policy-as-code. (Days 1, 5.)
  • Run — deploy with admission policy, mesh-enforced mTLS, default-deny networks. (Days 2, 3, 4.)
  • Detect — telemetry + Sigma rules + GuardDuty + on-call. (Day 6.)
  • Respond — runbooks, snapshots, communications, post-mortems. (Day 6.)
  • Loop — incident retro becomes threat-model input becomes paved-road improvement.
Quick check
A startup is preparing for SOC 2 Type II in nine months. They have ~50 engineers, multi-account AWS, GitHub Actions, EKS. The CEO asks where the time and money should go. Give three priorities and explain why each, in this order, gives the best return.
Show answer
1) Identity baseline. SSO with MFA, no IAM users for humans, IRSA/Workload Identity for services, quarterly access review automated via IAM Access Analyzer + Vanta/Drata. SOC 2 audits ~30% of evidence here, and it kills the leaked-credential class entirely. 2) Centralized logging. Org-wide CloudTrail to a separate logging account with 13-month retention, GuardDuty enabled everywhere, runbooks for the most common GuardDuty findings. The audit needs evidence of detection effectiveness, and you cannot retro this. 3) Change-management codification. Branch protection, required reviews, deploy through GitHub Actions with OIDC to AWS, no production deploys from laptops. SOC 2 cares deeply about change management; auditors love GitHub-as-evidence. After these three, threat modeling, encryption hardening and supply-chain investments come naturally — but identity, logging, and change-management are the inflection points.
Mnemonic — review hygiene
"Trust, Identity, Data, Network, Supply, Detect, Blast."
  • Trust — boundaries explicit?
  • Identity — short-lived, scoped?
  • Data — classified, encrypted with managed keys?
  • Network — default-deny + filtered egress?
  • Supply — signed + attested + admission-verified?
  • Detect — three signals + paged owner?
  • Blast — radius mapped, accepted or shrunk?
Flashcard
A team's threat model identifies the risk that a leaked CI access key could read all of S3. They write "mitigated by least-privilege IAM" and close the item. Where is the gap?
Click to flip ↻
Answer
The mitigation is unverifiable. "Least-privilege IAM" is an aspiration; the threat model needs to point to specific controls and specific evidence: (a) the CI role uses GitHub OIDC, not a static key, so there is no key to leak; (b) the role's permission policy is constrained by a permission boundary that forbids s3:*; (c) IAM Access Analyzer monitors for policy drift; (d) CloudTrail GuardDuty rule fires on cross-account assume-role. With those four, the risk is genuinely mitigated and provable. "Least-privilege IAM" alone is the kind of statement that closes a ticket and reopens an incident two years later.
🔑
Course-end takeaways
1) Compliance regimes overlap heavily — six control families satisfy most. 2) Compliance as code turns once-a-year evidence collection into continuous assertion. 3) Threat modeling = data flow + trust boundaries + STRIDE-Cloud questions. 4) Architecture review succeeds when triggered by code, lightweight by default, and produces specific tracked follow-ups with owners. 5) The loop — architect, build, run, detect, respond, loop — is the operating model. None of the days you have just learned stand alone; they are the seven faces of one practice.

Finished reading?