The Engineering Codex/Cloud Security Engineering
DAY 4
06 / 09

Data Protection, KMS & Secrets Management

schedule11 minsignal_cellular_altAdvanced2,492 words
Encryption is one of the most-claimed and least-understood controls. Learn the envelope-encryption pattern, KMS and HSM internals, key hierarchies, BYOK/HYOK, secrets stores and rotation patterns — and why your real risk is access to the key, not the cipher.

What you will learn

01Envelope Encryption — The Pattern Behind All Of It
02Key Hierarchies and HSMs
03Access Control on Keys
04BYOK, HYOK, EKM — Customer-Controlled Keys
05Encryption Patterns Where They Live
06Secrets Management

If an attacker reaches your data tier, what stops them is not encryption — it is whether they can also use the key. Modern cloud encryption is operationally cheap and on by default. The interesting design questions live in the key plane: who can decrypt, under what conditions, with what audit trail, and what happens when a key is compromised. This chapter is about the engineering of that key plane.

🔑
Today's data-plane and key-plane
1) Envelope encryption — the canonical pattern in every cloud KMS. 2) Key hierarchies — root, KEK, DEK, plus how HSMs change the trust story. 3) Access control on keys — key policies, grants, conditions, and the dual-control patterns. 4) BYOK / HYOK / EKM — what "customer-managed keys" actually buys you. 5) Secrets management — Vault, Secrets Manager, External Secrets, SOPS — and rotation done right.

Envelope Encryption — The Pattern Behind All Of It

Every cloud KMS implements one core pattern. Don't encrypt the data with the master key directly — that key would have to handle huge volumes and any compromise loses everything. Instead, generate a fresh data encryption key (DEK) per object, encrypt the data locally with the DEK, then ask KMS to encrypt the DEK under a long-lived key encryption key (KEK / CMK). Store the encrypted DEK alongside the ciphertext. To read: ask KMS to decrypt the DEK, then decrypt the data locally.

App has plaintext to write KMS GenerateDataKey(KEK) → {plain DEK, encrypted DEK} HSM root never leaves hardware Storage layer { ciphertext = AES-GCM(plaintext, DEK), encrypted_DEK, key_id, nonce, ad } App immediately zeroes the plain DEK from memory after the encrypt
Envelope encryption. KMS performs only small (DEK-size) crypto and never sees plaintext. The HSM signs/decrypts the DEK; the bulk data is encrypted locally with AES-GCM.

Why this matters operationally

  • Throughput. KMS only encrypts a 32-byte DEK — bulk data goes through your local AES-NI at line rate.
  • Audit granularity. Each Decrypt call to KMS is logged in CloudTrail with the caller, so you can attribute every key use back to a principal — far cheaper than logging every read of every record.
  • Rotation. Rotating the KEK does not require re-encrypting all the data. Old DEKs are still decryptable under the old KEK version; new writes use the new version. Re-encrypt-on-read or background re-wrap close the gap when needed.
💡
Use AEAD, always
The only acceptable bulk cipher modes in 2026 are AEAD — Authenticated Encryption with Associated Data. AES-256-GCM for hardware acceleration; ChaCha20-Poly1305 for software-only. Never raw CBC, never CTR without an HMAC, never ECB. Use the cloud SDKs' encryption helpers (AWS Encryption SDK, Tink, libsodium) — they generate nonces correctly, bind associated data, and avoid the 50-CVE class of low-level mistakes.

Key Hierarchies and HSMs

Every well-designed system has a small number of root keys protecting a larger number of operational keys. Cloud KMS instances are themselves backed by HSMs (Hardware Security Modules) — FIPS 140-2 / 140-3 Level 3 devices where the key material is generated and never exits the boundary in plaintext.

DEKs — millions, per record/object KEKs — per service / per env (10s–100s) CMKs — multi-region, multi-tenant (1s–10s) KMS root — provider-managed HSM cluster root — hardware
A practical hierarchy. Higher = fewer, longer-lived, more tightly controlled. Lower = many, short-lived, ephemeral.

HSM modes — managed vs dedicated

  • Multi-tenant managed KMS — AWS KMS, GCP Cloud KMS, Azure Key Vault standard. Backed by HSMs you don't see; lowest cost; shared across customers within strict isolation.
  • Single-tenant managed — AWS KMS Custom Key Stores backed by CloudHSM; Azure Key Vault Managed HSM; GCP Cloud HSM. Dedicated HSM partition, FIPS 140-2 L3, accessed via the same KMS API.
  • Self-managed CloudHSM — bare HSM cluster, you manage users and authn. Maximum control; you also operate the partition yourself. Used when key material must never touch a multi-tenant tier.

Multi-region keys

AWS multi-region keys, GCP multi-region keys, and Azure cross-region replication all let the same logical key exist in multiple regions, sharing the same key material via secure replication. Useful for active-active databases and DR — without it, every encrypted blob is locked to one region.

Access Control on Keys

Encryption is theatre if anyone with API access can decrypt. Three controls compose:

Key policy

Each KMS key carries a resource policy specifying who may use it for what. The default key policy for a customer-managed key allows the account root to administer it; you must explicitly add principals that may encrypt and decrypt. Separate principals for admin and use — admins set policy, never decrypt; apps decrypt, never administer.

json — KMS key policy: separation of duties
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "KeyAdmins — set policy, never use",
      "Effect": "Allow",
      "Principal": { "AWS": "arn:aws:iam::123456789012:role/security-admin" },
      "Action": ["kms:Describe*", "kms:List*", "kms:Get*", "kms:PutKeyPolicy",
                  "kms:UpdateAlias", "kms:EnableKeyRotation", "kms:ScheduleKeyDeletion"],
      "Resource": "*"
    },
    {
      "Sid": "KeyUsers — encrypt/decrypt within the account, only via VPC endpoint, only with TLS",
      "Effect": "Allow",
      "Principal": { "AWS": "arn:aws:iam::123456789012:role/payments-api" },
      "Action": ["kms:Encrypt", "kms:Decrypt", "kms:GenerateDataKey*", "kms:DescribeKey"],
      "Resource": "*",
      "Condition": {
        "StringEquals": { "aws:SourceVpce": "vpce-0abcd…" },
        "Bool":         { "aws:SecureTransport": "true" },
        "StringEqualsIfExists": { "kms:EncryptionContext:tenant": "\${aws:PrincipalTag/tenant}" }
      }
    }
  ]
}

Three serious controls in one policy: source VPC endpoint pinning (the call must originate inside our private network), TLS-only, and an encryption context binding so a principal tagged for tenant A cannot decrypt material encrypted with a tenant-B context. Encryption contexts are the most underused KMS feature; they turn key access into a tenant-scoped operation almost for free.

Grants — short-lived delegation

A grant is an alternative to inline policy that delegates a narrow set of operations on a key to a specific principal for a bounded time. Useful when a service needs to mint credentials for a worker without giving it long-term key permissions. CloudTrail logs every grant and every use.

Dual-control / quorum operations

For the most sensitive keys (master signing keys, code-signing roots), require multiple humans to approve any administrative change. AWS supports this through multi-party approval in CloudHSM and IAM Access Analyzer policy validation; Azure Key Vault Managed HSM offers activation quorum via M-of-N security officer cards; many on-prem PKI deployments use Shamir's Secret Sharing for the offline root.

⚠️
CMK deletion is mostly irreversible
Scheduling a CMK for deletion permanently destroys the ability to decrypt anything ever encrypted under it. The minimum waiting period is 7 days — use it. Pair with monitoring on kms:ScheduleKeyDeletion and an SCP that requires a quorum approval to actually delete a key marked protected: true. Mistaken deletion is one of the few cloud actions that can leak silent unrecoverable losses.

BYOK, HYOK, EKM — Customer-Controlled Keys

Three names for a similar idea, with different operational implications:

  • BYOK (Bring Your Own Key). You generate key material on your HSM and import it into the cloud KMS. Decryption still happens in the cloud's HSM; you control the source. Used for compliance regimes that require key generation outside the cloud.
  • HYOK (Hold Your Own Key). The key never enters the cloud. Cloud services call your on-prem HSM via an external interface. Operationally heavy and limits service support — only some services accept HYOK.
  • EKM (External Key Manager). The provider's pattern (AWS XKS / GCP EKM) — your HSM is queried for every key operation. The cloud cannot decrypt without your live participation; revoking access is real and instant.

Honest assessment: BYOK is mostly a compliance check-box. HYOK and EKM provide real residual control — at the cost of latency, availability dependence on your own HSM, and reduced service support. Choose based on the threat model your auditor is actually scared of.

Encryption Patterns Where They Live

At rest

Provider services (S3 SSE-KMS, RDS, EBS, GCS CSEK, Azure Blob CMK) handle disk-level encryption transparently. Always use customer-managed keys instead of provider-managed defaults — the access control and audit upgrade is meaningful even when the cipher is identical.

In transit

TLS everywhere — public, internal, even "trusted" networks. Expect TLS 1.2 with strong suites or TLS 1.3 only. Service meshes (Day 2 PM) enforce mTLS uniformly. For database connections, prefer PrivateLink + the database's native TLS over IPsec or VPN tunnels — fewer moving parts.

In use — confidential computing

Memory-resident plaintext is the new at-rest. Confidential VMs (AMD SEV-SNP, Intel TDX, AWS Nitro Enclaves, GCP Confidential VMs, Azure Confidential Computing) encrypt VM memory and provide an attestation channel — the workload can prove what code is running before it accepts decryption keys. Real use cases today: multi-party data analytics where the operator must not see plaintext, and key-management services that run inside enclaves so even the cloud operator cannot read the key.

Application-layer encryption

For especially sensitive fields — payment instruments, health identifiers — encrypt at the application before the data hits the database. AWS Encryption SDK, Google Tink, libsodium, age. The DB sees ciphertext only; admins, DBAs, and BI tools cannot read the field without the app's role.

Secrets Management

A secret is anything an attacker would prize: API tokens, database passwords, OAuth refresh tokens, signing keys, webhook secrets. Secrets in Git repositories or environment files are the single most common cause of leaked-credential incidents. The fix has three parts.

1. A central store

  • HashiCorp Vault — best feature set, runs everywhere, supports dynamic secrets (mints DB credentials per session) and PKI. Operationally heavy.
  • AWS Secrets Manager / SSM Parameter Store — simpler, KMS-backed, native rotation Lambdas. Most AWS estates use this.
  • GCP Secret Manager / Azure Key Vault — equivalents.
  • External Secrets Operator in Kubernetes — bridges any of the above into K8s as native Secrets, kept in sync.

2. Never let a static secret reach the workload

For database passwords, prefer cloud IAM authentication (RDS IAM auth, Cloud SQL IAM auth, Postgres-OIDC) — short-lived tokens minted from the workload identity. For external APIs, mount via the Secrets Store CSI driver so the secret is read into a tmpfs at start and never persisted on disk.

3. Rotate, automatically and visibly

The right rotation cadence depends on the secret. Database passwords: 30-90 days, automated via a Lambda or Vault rotation backend. API tokens to external SaaS: as often as the SaaS allows; many don't allow zero-downtime rotation, so design dual-key acceptance windows. Signing keys: every 12 months with overlap. OIDC client secrets: annually, validated by certificate transparency-style monitoring.

💡
SOPS for the secrets that have to live in Git
Some secrets — Helm chart values, terraform variable files, CI configs — naturally want to live next to the code they configure. Mozilla SOPS lets you encrypt only the values (not the YAML keys) using KMS or PGP. Reviewers see structure; only authorized principals decrypt. Combined with git-crypt for whole-file cases and Sealed Secrets for K8s, the in-Git secrets problem is largely solved.

Tokenization vs Encryption

For payment cards, government IDs, and medical record numbers, the typical right answer is not encryption but tokenization: the value is replaced with an opaque, format-preserving token; the original lives only in a hardened vault accessed by a tiny scope of code (the payment service, the identity service). Most application code only ever sees the token, so the data plane simply does not contain regulated data.

Tokenization shrinks PCI/HIPAA scope dramatically — auditors care about systems that handle the actual data, not those that handle tokens. Stripe, Adyen, and most cloud-payment providers tokenize at the edge. For internal cases, VGS, Skyflow, Privacera, and home-rolled vaults are common.

Quick check
A team encrypts a database with a customer-managed KMS key. The key policy allows the application role and the DBA role to kms:Decrypt. Six months later a credential leak gives the attacker the DBA role's keys. They can SELECT * from the database, see plaintext, and exfiltrate. What was the encryption actually protecting against, and what control was missing?
Show answer
Encryption protected against three threats: a stolen disk image (the EBS snapshot is unreadable without the key), accidental third-party access to the storage backend, and provider-side accidental exposure. It did not protect against an authorized-principal compromise — and was never going to. The missing controls are application-layer field encryption (sensitive fields encrypted with a key the DBA does not have), row-level access control at the application boundary, and egress controls / data exfiltration alarms on bulk SELECTs. Encryption at rest is necessary but rarely sufficient; choose the layer that matches the threat.

Cryptographic Agility & Post-Quantum

NIST has now standardized post-quantum algorithms — ML-KEM (FIPS 203, formerly Kyber) for key establishment and ML-DSA (FIPS 204, formerly Dilithium) for signatures. The threat model is "harvest now, decrypt later": attackers store today's TLS captures and decrypt them when CRQCs (cryptographically relevant quantum computers) arrive. For long-lived data, this matters today. Hybrid TLS 1.3 (X25519+ML-KEM) is rolling out in browsers and major CDN/ALB stacks; AWS KMS, GCP and Azure are publishing PQ migration paths. Crypto agility — the ability to swap algorithms without rewriting applications — is suddenly a real engineering requirement. SDKs like AWS Encryption SDK and Tink wrap algorithm choice in algorithm suites with versioned identifiers; aim to use these abstractions rather than hard-coding cipher names.

Mnemonic — KMS hygiene
"Envelope, Hierarchy, Context, Audit, Rotate."
  • Envelope-encrypt, never bulk-encrypt with the master.
  • Hierarchy: HSM root → CMK → KEK → DEK.
  • Context: bind every encrypt to tenant / purpose.
  • Audit: every Decrypt logged with caller.
  • Rotate: keys, secrets, and dual-control approvers.
Flashcard
An attacker gains kms:Decrypt on a CMK and tries to decrypt an arbitrary ciphertext. The decryption fails. The CMK supports the algorithm; the policy allows the action. What is the most likely cause?
Click to flip ↻
Answer
An encryption context mismatch. The original encrypt bound an encryption context (e.g. {tenant: "acme"}) and the policy or the SDK requires the same context on decrypt. The attacker is using a different (or empty) context, so KMS rejects the operation as not matching the AAD. This is one of the few "low-effort, high-impact" controls in KMS — and a reason to always use encryption contexts in application code, not just rely on key policy.
🔑
Key takeaways
1) Envelope encryption with AEAD is the universal pattern; KMS encrypts DEKs, your code encrypts data. 2) Key hierarchies + HSM-backed roots concentrate trust where it can be defended. 3) Key policy + encryption context + audit is the real least-privilege story for keys; encrypted ≠ safe if everyone can decrypt. 4) BYOK is compliance theatre; HYOK/EKM are real residual control at real cost. 5) Secrets stores + rotation + workload identity kill the leaked-credential class entirely. 6) Tokenization shrinks regulated-data blast radius further than any encryption can. 7) Plan for crypto agility — post-quantum migration is starting now.

Finished reading?