The Engineering Codex/Cloud Security Engineering
DAY 1 · PM
02 / 09

IAM Deep Dive — Principals, Policies & STS

schedule11 minsignal_cellular_altAdvanced2,462 words
If misconfiguration dominates cloud breaches, IAM is the field where it lives. Master principals, policy evaluation, AssumeRole and STS, conditions and ABAC, federation, and the patterns that make least privilege survivable in production.

What you will learn

01Principals, Resources, Actions, Conditions
02Policy Evaluation — Deny Wins
03STS, AssumeRole and the Short-Lived-Credential Loop
04Conditions and ABAC — Tags as Policy
05Federation — Where Identity Comes From
06Privilege Patterns That Hold Up at Scale

Identity and Access Management is where most cloud security wins — and most cloud breaches — happen. The provider gives you a powerful, declarative engine for saying who can do what to which resources, under what conditions. Most teams use perhaps five percent of it and pay for the other ninety-five percent in incident reports. This chapter is a deep dive into how the engine actually evaluates a request, the credential lifecycle behind every API call, and the patterns that hold up at scale.

🔑
Today's deep cuts
1) The policy evaluation order — explicit deny > allow > implicit deny — and where boundaries fit. 2) STS, AssumeRole, and short-lived credentials — the path from a long-lived secret to a 15-minute token. 3) Conditions, ABAC and tags — how to scope access to the actual resource without writing one role per resource. 4) Federation — OIDC for CI, SAML for humans, IRSA / Workload Identity for pods. 5) Privilege patterns — break-glass, permission boundaries, SCPs and just-in-time access.

Principals, Resources, Actions, Conditions

Strip the JSON away and every cloud's IAM language is the same triple-plus-context: a principal (the caller — user, role, workload identity) requests an action (s3:GetObject, kms:Decrypt) on a resource (the ARN of the bucket key, the URI of the storage object) under some conditions (source IP, tag match, TLS version, MFA presence). Policies are documents that match against this tuple and produce an Allow, a Deny, or no opinion.

Principal role / user Action s3:GetObject Resource arn:aws:s3:::reports/* Conditions tag, IP, MFA… Every API call is a 4-tuple. Every policy decides on one tuple.
The same shape appears in AWS IAM, GCP IAM, and Azure RBAC — only the noun mapping differs.

The provider noun map

ConceptAWSGCPAzure
Principal typeUser, Role, Service-linked roleUser, Service Account, GroupUser, Service Principal, Managed Identity
Workload identityEC2 instance role / IRSA / Pod identityWorkload Identity FederationManaged Identity
Permission unitAction (e.g. s3:GetObject)Permission (storage.objects.get)Action (Microsoft.Storage/.../read)
BundlePolicyRole (predefined / custom)Role definition
AttachmentIdentity / resource policyRole binding (on resource hierarchy)Role assignment (on scope)
Org-wide guardrailSCPs (Organizations)Org Policies / VPC-SCAzure Policy / Management Groups
Per-identity ceilingPermission boundary(via deny policies)(via deny assignments)

The vocabulary differs but the algebra is the same. We will use AWS in the examples because its policy language is the most explicit; the patterns translate directly.

Policy Evaluation — Deny Wins

Two principles cover ninety percent of "why is this not working" debugging. An explicit Deny anywhere wins. The default is implicit deny. Everything else is sequencing: AWS evaluates organizational SCPs first (do the org guardrails even allow this action?), then identity policies and resource policies, then any permission boundary, then session policies on STS-assumed roles. If at any point a Deny matches, you stop; otherwise you need an explicit Allow somewhere to succeed.

SCP / Org policydoes the org allow it? Resource policycross-account / public Permission boundarymax for the principal Identity policyuser / role grants Session policySTS narrowing Conditionstag / IP / MFA… Allow only if all passexplicit Deny short-circuits
AWS request authorization. Every layer can deny independently; allows must intersect across all of them.

The intersection rule for boundaries

A permission boundary is the maximum a principal can ever be granted, regardless of its identity policy. Boundaries are commonly used in delegated account models: platform engineers want app teams to create their own roles, but only within a ceiling. The effective permissions are the intersection of identity policy and boundary. SCPs work the same way at the org level — they cap an account's possible API surface.

json — service control policy
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "DenyOutsideHomeRegion",
      "Effect": "Deny",
      "Action": "*",
      "Resource": "*",
      "Condition": {
        "StringNotEquals": { "aws:RequestedRegion": ["us-east-1", "us-east-2"] },
        "BoolIfExists":    { "aws:ViaAWSService":   "false" }
      }
    },
    {
      "Sid": "DenyRootUser",
      "Effect": "Deny",
      "Action": "*",
      "Resource": "*",
      "Condition": { "StringLike": { "aws:PrincipalArn": "arn:aws:iam::*:root" } }
    }
  ]
}

This is a real-world starter SCP: pin your accounts to allowed regions (huge blast-radius win) and forbid the root user from doing anything. It runs above identity policies, so even an unbounded admin role cannot create a resource in eu-central-1.

⚠️
SCPs do not grant permissions
A common confusion: an empty allow-everything SCP gives no one any access. SCPs only set the ceiling. You still need identity policies underneath. Adding a service to FullAWSAccess in the SCP unblocks it; you must still attach an identity policy that grants s3:GetObject for the principal to actually use S3.

STS, AssumeRole and the Short-Lived-Credential Loop

Long-lived AKIA… access keys are the single biggest source of leaked-credential incidents in cloud breaches. Public repos, CI logs, browser dev-tools, screen shares — they leak. The cure is to treat IAM users as starting points and roles as the actual operating identities. Every workload, CI job, federated user, and most humans should operate from short-lived STS credentials.

Caller user / EC2 / OIDC long-lived (or none) STS checks trust policy + conditions Role session 15 min – 12 hr ASIA… AssumeRole temp creds Trust policy says who; permission policy says what; session policy narrows further.
A trust policy on the role decides who may assume it. Every successful AssumeRole is logged to CloudTrail with the source identity.

The two policies on every role

A role has two policy documents, and they are easy to confuse. The trust policy defines who can assume the role — its Principal can be a service (ec2.amazonaws.com), another account, an OIDC issuer, or an identity. The permission policies attached to the role define what the resulting session can do once assumed. The trust policy is the door; the permission policy is the floor of the room.

json — trust policy for a github-actions role
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": {
      "Federated": "arn:aws:iam::123456789012:oidc-provider/token.actions.githubusercontent.com"
    },
    "Action": "sts:AssumeRoleWithWebIdentity",
    "Condition": {
      "StringEquals": {
        "token.actions.githubusercontent.com:aud": "sts.amazonaws.com"
      },
      "StringLike": {
        "token.actions.githubusercontent.com:sub":
          "repo:my-org/payments-service:ref:refs/heads/main"
      }
    }
  }]
}

Three things are doing the work in that StringLike. The repo binds to one repository; an attacker with creds in another repo cannot use this role. The ref binds to main; pull-request workflows from forks cannot assume it. And the audience claim binds to sts.amazonaws.com; tokens minted for any other audience (e.g. accidentally configured to GitHub itself) cannot be replayed. Always include all three. The Microsoft / Sysdig research on misconfigured GitHub-OIDC trust policies (2023) found tens of thousands of public examples leaving sub as *.

Session policies — narrowing on assume

When you call AssumeRole you can pass an optional session policy. The resulting credentials' permissions are the intersection of the role's permission policy and your session policy. This is the standard way to mint a credential that is narrower than the role itself — useful for handing tenant-scoped tokens to per-customer worker invocations without creating one role per customer.

Quick check
A pod uses IRSA to assume RoleA, which has s3:* on bucket X. The pod calls AssumeRole with a session policy of s3:GetObject on X/reports/*. Can the resulting credentials write to X/reports/audit.csv?
Show answer
No. The session credentials' effective permissions are the intersection of the role permission policy (s3:*) and the session policy (s3:GetObject) — that intersection is just s3:GetObject. PutObject is missing, so the write fails. This is the canonical pattern for safely handing scoped credentials to less-trusted code.

Conditions and ABAC — Tags as Policy

Pure RBAC creates one role per resource scope and explodes by Cartesian product. The dominant scaling pattern is ABAC — Attribute-Based Access Control — where the same role's policy uses tags or attributes to constrain which resources it touches.

json — tag-based ABAC for prod reads
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Action": ["s3:GetObject", "s3:ListBucket"],
    "Resource": "*",
    "Condition": {
      "StringEquals": {
        "aws:ResourceTag/Env":  "${aws:PrincipalTag/Env}",
        "aws:ResourceTag/Team": "${aws:PrincipalTag/Team}"
      }
    }
  }]
}

One policy. Every principal tagged Team=payments,Env=prod can read every bucket tagged the same way; nothing else. Add a new bucket and tag it correctly and access works without a policy change. The trade-off: tags become the security boundary, so a stale or forgotten tag is now a vulnerability. Pair ABAC with mandatory-tag SCPs and a periodic tag-drift audit.

The condition keys you will use most

  • aws:SourceIp — restrict to office / VPN / CI runner IPs (avoid as your only control; IPs are spoofable inside the account).
  • aws:SourceVpce / aws:SourceVpc — only via specific VPC endpoints. Crucial for restricting S3 to private connectivity.
  • aws:MultiFactorAuthPresent / aws:MultiFactorAuthAge — gate destructive actions on fresh MFA.
  • aws:SecureTransport — only over TLS. Pair with a bucket policy that denies otherwise.
  • aws:CalledVia — distinguish direct calls from service-to-service ones.
  • aws:ResourceTag/*, aws:RequestTag/*, aws:PrincipalTag/* — the ABAC backbone.
💡
The hidden trick of aws:PrincipalOrgID
Resource policies for cross-account use should constrain aws:PrincipalOrgID to your AWS Organizations org ID, not specific account IDs. New accounts join the org without a policy update; rogue identities outside the org are blocked even if they have the role ARN. This is a quiet superpower for SaaS-style internal platforms.

Federation — Where Identity Comes From

Three federation flavours cover almost every real principal. Get them right and you may never create another long-lived IAM user.

Humans — SAML or OIDC SSO

Humans authenticate to your IdP (Okta, Entra, Google Workspace, Auth0) and the IdP brokers a SAML or OIDC assertion to AWS IAM Identity Center / GCP Cloud Identity / Azure AD. The principal in CloudTrail is the federated user, with attributes (groups, department) that drive permission set selection. No human should have a long-lived IAM user.

CI — OIDC, never static keys

GitHub Actions, GitLab CI, CircleCI, Buildkite all expose OIDC tokens. Configure the cloud provider's OIDC trust to those issuers and use the sub / workflow / environment claims as the trust gate. The result: zero deploy-time secrets, full audit trail of which workflow assumed which role.

Workloads — IRSA, Workload Identity, Managed Identity

Pods on EKS use IAM Roles for Service Accounts (IRSA): each Kubernetes ServiceAccount maps to an IAM role through an OIDC trust to the cluster's issuer. GKE has Workload Identity Federation; AKS has Pod-managed Identities / Workload Identity. The pattern is the same in every direction: the workload presents a signed identity token (often a Kubernetes-issued JWT) and the cloud's STS exchanges it for short-lived API credentials. We will see this end-to-end on Day 3.

Mnemonic — IAM evaluation
"SCP, Resource, Boundary, Identity, Session, Conditions — any Deny stops the chain."
  • Org guardrails before identity grants.
  • Boundaries cap; Identity grants; Session narrows.
  • Conditions are the real least-privilege lever.

Privilege Patterns That Hold Up at Scale

Permission boundaries for delegated admin

Platform engineers pre-create a permission boundary that defines what any team-created role can ever do ("no IAM mutations on the org's protected paths, no iam:* wildcards"). Developers self-serve roles within this ceiling. Result: app teams move fast without waiting on platform; platform retains an enforceable cap.

Break-glass roles

One or two named break-glass roles per account, with full admin and trust restricted to a small, audited set of humans, MFA-required, and assumption alerted to PagerDuty. Day-to-day work uses scoped roles. Break-glass is for true emergencies, and every assumption triggers a review next business day.

Just-in-time elevation

Tools like AWS IAM Identity Center session policies, GCP Privileged Access Manager, ConductorOne, Sym, or in-house tooling let an engineer request elevated access for a specific time-bounded reason. The role exists; the human cannot use it without a request and approval. Removes the entire "I'm an admin because I might need to be" class of standing risk.

Tag-on-create and resource control policies

Use aws:RequestTag and aws:TagKeys to require that a principal tags the resource appropriately at creation time, then key further policies off those tags. Untaggable creates fail. AWS's newer Resource Control Policies (RCPs) give you SCP-style guardrails on resource policies — handy when you need to block public S3 buckets organisation-wide.

🚨
Privilege escalation patterns to watch
A handful of permissions are privilege escalation primitives on their own. Some classics: iam:CreateAccessKey on another user; iam:PassRole + lambda:CreateFunction (run code as the passed role); iam:UpdateAssumeRolePolicy (rewrite a role's trust policy); iam:AttachUserPolicy on yourself; kms:Decrypt on a key holding other principals' material. Tools like Cloudsplaining, PMapper, and IAM Vulnerable map these. Permission boundaries should explicitly deny these on developer roles.
Flashcard
A developer's role is granted iam:PassRole on a powerful CI role and codebuild:*. They claim they cannot escalate because the developer role itself only has s3:Get*. Are they right?
Click to flip ↻
Answer
No. They can create a CodeBuild project that runs with the powerful CI role passed to it (the iam:PassRole permission), then have the build script call any API the CI role can. Their effective permissions are now the union of their own role and any role they can pass to a service that runs code. iam:PassRole is a transitive grant — always scope it to specific role ARNs, never *.

Auditing IAM in Practice

Two reports run weekly are worth their weight in incidents avoided. Unused access: AWS IAM Access Analyzer (last-accessed) flags principals with permissions they never use; trim or revoke. Public & cross-account exposure: Access Analyzer (external access) reports every resource policy reachable from outside the account or org — your starting list of "why" questions. GCP's Recommender and Azure's Defender for Cloud have equivalents.

For static review, Cloudsplaining (Salesforce) and Parliament (Duo) lint policy documents for over-privilege; PMapper renders the actual privilege graph including transitive paths. Run them in CI on policy changes — the pattern of "the policy looks scoped but transitively grants admin" is too easy to miss in review.

🔑
Key takeaways
1) Every cloud's IAM evaluates a principal × action × resource × conditions tuple; deny wins, default is implicit deny. 2) STS short-lived credentials and OIDC federation replace static access keys for nearly every modern principal — humans, CI, and workloads. 3) Conditions and ABAC are the real least-privilege lever; tags become a security boundary you must protect. 4) Permission boundaries, SCPs, and just-in-time elevation are how least privilege survives at scale without strangling delivery. 5) Watch iam:PassRole, iam:UpdateAssumeRolePolicy, and kms:Decrypt — they are escalation primitives whether you intend them to be or not.

Finished reading?