The Engineering Codex/Cloud Security Engineering
DAY 2 · AM
03 / 09

Cloud Network Security — VPC, Routing & Endpoints

schedule11 minsignal_cellular_altIntermediate2,433 words
The cloud network stack is software-defined, but the rules of subnetting, routing, and egress control are the same as ever. Master VPC layout, NACLs vs security groups, PrivateLink and VPC endpoints, transit hubs, and the egress controls that actually catch the data exfiltration attempts.

What you will learn

01VPC Topology — A Subnet Is Just a Routing Decision
02Security Groups and NACLs — Two Different Tools
03VPC Endpoints — PrivateLink Is the Shape of Modern Connectivity
04Egress Control — Where Real Attackers Get Caught
05Multi-VPC: Peering, Transit Gateway, and Cloud WAN
06VPC Flow Logs — Your Network Audit Track

Identity is the perimeter. The network is still the second wall — and the first one if your identity story has a hole. This chapter is about how a modern cloud network is laid out, what each control actually enforces, and how to design for failure modes the on-prem network never had: SSRF reaching the metadata service, lateral movement through a flat overlay, and exfiltration through a misconfigured S3 endpoint.

🔑
Today's network mental models
1) Subnet topology — public, private and isolated tiers, AZ spread, and why routing, not labels, defines what "private" means. 2) NACL vs Security Group — stateful vs stateless, allow-only vs allow/deny, and where each shines. 3) VPC endpoints & PrivateLink — keeping S3, KMS and your own services off the internet without losing managed-service convenience. 4) Egress is the exfiltration surface — DNS, IMDS, NAT, and the controls that catch real attackers.

VPC Topology — A Subnet Is Just a Routing Decision

The vocabulary leads developers astray: a subnet labelled public is not magically reachable, and a subnet labelled private is not magically safe. What makes a subnet public is its route table — specifically, that it has a default route (0.0.0.0/0) pointing to an Internet Gateway. A subnet whose default route points to a NAT Gateway is a private with egress; a subnet with no default route at all is isolated.

VPC 10.0.0.0/16 Public 10.0.0.0/24 ALB, NAT GW default → IGW Private (egress) 10.0.16.0/20 app, API, workers default → NAT GW Isolated (data) 10.0.32.0/20 RDS, ElastiCache no default route VPC Endpoints (PrivateLink) com.amazonaws.us-east-1.s3 / .kms / .secretsmanager / .sts private DNS resolves S3 / KMS / Secrets to ENIs in your VPC — no NAT, no internet
Three subnet tiers. The data tier never reaches a NAT or IGW; managed services arrive over PrivateLink. Replicate per AZ for HA.

The standard three-tier layout

  • Public subnets hold load balancers, NAT gateways, and bastions. Default route is the Internet Gateway. Nothing else lives here.
  • Private (egress) subnets hold app servers, workers, queue consumers. Default route is the NAT gateway in the public subnet of the same AZ. Returning traffic from outbound calls works; inbound from the internet does not.
  • Isolated subnets hold databases, caches, internal-only services. No default route. Egress to managed services happens through VPC endpoints; cross-AZ replicas in matching subnets.
  • One of each per AZ — three AZs is the typical compromise between cost and zonal-failure resilience.
⚠️
CIDR planning is forever
VPC CIDR blocks are not trivially changeable after creation, and overlapping ranges across accounts make peering and Transit Gateway routing impossibly painful later. Reserve a global IP plan up front (e.g., 10.0.0.0/8 divided per region per environment), even if you start with one VPC. Future-you will not be sorry.

Security Groups and NACLs — Two Different Tools

Both filter packets, but they live at different layers and obey different rules. Use both, but for different reasons.

Security GroupNetwork ACL
LayerAround an ENI / instance / endpointAround a subnet
StateStateful — return traffic auto-allowedStateless — must allow both directions
Effect typesAllow only (implicit deny)Allow and Deny (numbered rules)
Reference targetsOther SG IDs (security-group-as-source!)CIDR only
DefaultDeny in / allow all out (start by removing the all-out)Allow in & out
Best atWorkload-to-workload micro-segmentationCoarse subnet guardrails, blocking known-bad CIDRs

Security-group-as-source — the underused power tool

Hard-coding IP ranges in security groups is a maintenance nightmare in autoscaling. Instead, allow source by security group ID: "the database SG accepts 5432 from the app-tier SG." Any ENI that joins the app SG is automatically allowed; any ENI removed from it loses access. This makes SG rules identity-based at the network layer, and is the closest you get to micro-segmentation without a service mesh.

terraform — security-group-as-source
resource "aws_security_group" "app" {
  name   = "app-tier"
  vpc_id = aws_vpc.main.id
}

resource "aws_security_group" "db" {
  name   = "db-tier"
  vpc_id = aws_vpc.main.id
}

# DB accepts Postgres only from anything in the app SG
resource "aws_vpc_security_group_ingress_rule" "db_from_app" {
  security_group_id            = aws_security_group.db.id
  referenced_security_group_id = aws_security_group.app.id
  ip_protocol                  = "tcp"
  from_port                    = 5432
  to_port                      = 5432
}
💡
Default-deny egress on security groups
The default SG ruleset allows all outbound traffic. Most workloads only ever talk to a handful of named hosts. Replacing the default-out with explicit allows (e.g. only port 443 to the VPC endpoint CIDRs and your DB SG) closes the most common exfil path: a compromised app calling out to attacker.com on 443. Service-mesh teams can do this at L7; for plain VMs, SG egress is your best bet.

VPC Endpoints — PrivateLink Is the Shape of Modern Connectivity

Two kinds of VPC endpoint exist; they are not interchangeable.

Gateway endpoints

Used for S3 and DynamoDB. They install a route into your route table targeting com.amazonaws.region.s3 as a special prefix list. Traffic to S3 from the VPC stays inside the AWS network, never traverses the IGW or NAT. Free.

Interface endpoints (PrivateLink)

Used for everything else — KMS, STS, Secrets Manager, ECR, SSM, partner APIs, your own services. An ENI is created in your VPC, and Route 53 private DNS overrides the public service hostname to resolve to that ENI. Your code keeps using kms.us-east-1.amazonaws.com; the resolution is private. Pay per endpoint per AZ per hour, plus data.

Customer VPC App ENI 10.0.16.42 Endpoint ENI 10.0.32.7 (KMS) 443 Private DNS rewrites kms.us-east-1.amazonaws.com → 10.0.32.7 AWS service backbone KMS, STS, Secrets, ECR… no internet, no NAT policy can pin SourceVpce
PrivateLink turns AWS managed services into ENIs in your VPC. Combine with aws:SourceVpce in resource policies to pin S3 buckets to your private network only.

Endpoint policies — the often-missed control

Each VPC endpoint can carry its own policy that filters which API calls can use it. A common hardening: a corporate S3 endpoint policy that denies s3:GetObject on any bucket whose aws:ResourceOrgID is not your org. Now even if a compromised principal has credentials with broad S3 permissions, attempts to exfiltrate to an attacker bucket via your private network simply fail. This is one of the most powerful exfiltration controls in AWS.

json — S3 endpoint policy that blocks cross-org reads
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": "*",
    "Action": ["s3:GetObject", "s3:ListBucket", "s3:PutObject"],
    "Resource": "*",
    "Condition": {
      "StringEquals": { "aws:ResourceOrgID": "o-1234567890" }
    }
  }]
}

Cross-account PrivateLink for internal services

You can also publish your own service via PrivateLink: an NLB in front of your service in account A, exposed as a service-name. Account B creates an interface endpoint to it, optionally accepted by you. Internal-SaaS-style architectures (Snowflake, Databricks, internal platforms) all use this. The receiving service sees the endpoint ENI's source IP — meaning IP-based controls work, but you must use endpoint-policy or TLS-cert pinning for identity.

Egress Control — Where Real Attackers Get Caught

Inbound is the obvious surface; in practice, post-compromise data movement happens outbound. The goal of cloud egress controls is to make a successful compromise visible and useful exfiltration paths few. Three controls in increasing strength.

NAT plus FQDN allow-list

For workloads that legitimately call a small set of external APIs, a forward proxy with FQDN allow-listing (Squid, AWS Network Firewall with Suricata, GCP Secure Web Proxy) is the canonical answer. Egress goes through the proxy; the proxy enforces "only api.stripe.com, api.github.com". Random calls to attacker.com are rejected and logged. Pair with TLS inspection only if your data classification justifies it; otherwise SNI-only filtering catches most exfil while preserving certificate trust.

DNS firewalls — Route 53 Resolver DNS Firewall

Block resolution of malicious or unknown domains at the resolver. Route 53 Resolver DNS Firewall accepts managed lists (AWS-managed, abuse.ch, etc.) and custom lists. Cheap, broad, and catches a lot of commodity attacker tooling that uses DGAs (domain generation algorithms). Does not catch IP-direct callouts; combine with egress firewalls.

Block direct internet egress

The strongest move: only managed services through PrivateLink, never raw internet. Workloads route nowhere outbound except VPC endpoints; package mirrors live in private artifact registries. This is increasingly common in regulated environments and feasible in others if you commit to a private artifact pipeline.

🚨
The metadata service is an SSRF target
Every IaaS metadata service — AWS IMDS, GCP metadata.google.internal, Azure IMDS — sits at a well-known link-local IP. An SSRF in your app or a misconfigured proxy turns into credential theft: the attacker reads the role's STS credentials right out of the metadata response. Mitigations: require IMDSv2 (token + hop-limit 1) on AWS; on GCP, add --metadata=disable-legacy-endpoints=true; on Azure, scope IMDS reads with NSGs and managed-identity hardening. The next chapter (Service Mesh / Zero Trust) treats this as the canonical SSRF case.

Multi-VPC: Peering, Transit Gateway, and Cloud WAN

Real estates have more than one VPC. The connectivity options progress in scale and cost.

  • VPC Peering. 1:1 link, no transitive routing. Fine for two VPCs; quadratic mess at ten.
  • Transit Gateway (AWS) / Network Connectivity Center (GCP) / Virtual WAN (Azure). A hub-and-spoke router. Each VPC attaches once; routing tables on the gateway segment which spokes can reach which. This is the default for any non-trivial estate.
  • Cloud WAN. A higher-level abstraction layered over Transit Gateways across regions, with policy-driven attachments. Useful past about ten regions.
  • VPC sharing (RAM in AWS, Shared VPC in GCP, virtual networks in Azure). One networking account owns the VPC; workload accounts attach. Centralises networking ops.
💡
Transit Gateway is also a security boundary
TGW route tables let you build segmented networks: separate route tables for prod, nonprod, and shared-services, with explicit propagations only where needed. A misconfigured spoke cannot route into another segment. Combined with account-level separation it is the cloud equivalent of network zones — and CloudWatch flow logs at the TGW make cross-VPC egress observable in one place.

VPC Flow Logs — Your Network Audit Track

Every accept/reject decision in the VPC can be logged. The fields cover source/destination IP, ports, packet/byte counts, action, and (in v5) the AWS service the traffic is heading to. Enable on every VPC, ship to S3 with object-lock and a 90+ day retention, and parse with Athena or a SIEM. Patterns to alert on:

  • REJECTs from inside a private subnet — workload trying to reach the internet, often early-stage exfil reconnaissance.
  • Spike in destination IPs from one ENI — possible scanning or staged exfil.
  • Egress to a low-reputation ASN — usually paired with a DNS firewall.
  • IMDS access from a process other than the metadata library — pair with aws:CalledVia patterns in CloudTrail.
Quick check
An app server in a private subnet legitimately calls S3 every minute. The bucket policy is correct, but the call now fails with timeout after a TGW migration. The route to S3 unchanged. What is the most likely cause and the cheapest fix?
Show answer
Most likely cause: the new TGW segmentation removed the path to the gateway endpoint for S3 — or the gateway endpoint was not migrated and traffic now hits the NAT, which is sized for a different load. Cheapest fix: ensure each VPC has its own S3 gateway endpoint (gateway endpoints are free and per-VPC), and confirm the route table of the private subnet has the prefix-list route for S3. Gateway endpoints are scoped to a single VPC; they do not propagate via TGW.

The Cross-Provider View

The vocabulary varies; the architecture survives.

ConceptAWSGCPAzure
L4 firewall around workloadSecurity GroupFirewall rule (network tag / SA)NSG (subnet/NIC)
L4 firewall around subnetNACL(via firewall rule order)NSG on subnet
Private path to managed serviceVPC Endpoint (Gateway / Interface)Private Service Connect / Private Google AccessPrivate Endpoint / Service Endpoint
Hub for many networksTransit Gateway / Cloud WANNetwork Connectivity CenterVirtual WAN / vNet peering
FQDN egress firewallNetwork FirewallSecure Web Proxy / Cloud NGFWAzure Firewall
Service-private exposurePrivateLink (NLB-fronted)Private Service ConnectPrivate Link Service
Network observation logsVPC Flow LogsVPC Flow LogsNSG Flow Logs
Mnemonic — VPC review
"Public, Private, Isolated; Egress, Endpoints, Audit."
  • Three subnet tiers per AZ.
  • Egress is the exfil surface — proxy or block it.
  • Endpoints over internet for managed services.
  • Flow logs into a SIEM, retained outside the account.

A Cloud-Native Reference Pattern

Pulling it all together for a typical web service:

  1. Public ALB in public subnets, terminating TLS with an ACM cert.
  2. App tier in private subnets, running Fargate tasks or EC2 ASGs, security-group-as-source from the ALB SG.
  3. RDS Aurora in isolated subnets, with KMS at-rest encryption and TLS in-transit.
  4. Interface endpoints for KMS, Secrets Manager, STS, ECR, CloudWatch Logs, plus gateway endpoints for S3 and DynamoDB.
  5. NAT gateway present but its egress filtered by AWS Network Firewall — only api.stripe.com, api.sendgrid.com, and your own SaaS allow-list.
  6. Route 53 Resolver DNS Firewall blocking the AWS-managed domain block list.
  7. VPC flow logs to S3 in a separate logging account, retained 13 months.
  8. SCP at the org pinning region and forbidding internet-routable services in non-public OUs (ec2:AssociatePublicIpAddress denied).
Flashcard
A teammate proposes putting RDS in a public subnet and securing it with a strong password and an SG that only allows port 5432 from 0.0.0.0/0. Why is this still bad even with a strong password?
Click to flip ↻
Answer
The internet attack surface multiplies the risk to a single bug. A zero-day in the RDS engine, a regression in TLS, or a misconfigured client now has a global attacker pool. RDS in an isolated subnet behind a SG that only accepts the app SG and KMS-encrypted at rest narrows the threat to anyone who has already compromised the app tier — drastically reducing the population of credible attackers. The password is one wall; defence in depth is many.
🔑
Key takeaways
1) Routing decides public vs private — always verify the route table and IGW/NAT/no-default. 2) SGs and NACLs are different tools: SGs are stateful identity-aware; NACLs are stateless subnet guardrails. 3) VPC endpoints + endpoint policies are the modern way to talk to managed services and a powerful exfil control via aws:SourceVpce / org pinning. 4) Egress, DNS and IMDS are where post-compromise activity lives — proxy, block or filter all three. 5) Flow logs in a separate account are non-negotiable; you cannot investigate without them.

Finished reading?