DAY 2 · AM

03 / 09

Cloud Network Security — VPC, Routing & Endpoints

schedule11 minsignal_cellular_altIntermediate2,433 words

The cloud network stack is software-defined, but the rules of subnetting, routing, and egress control are the same as ever. Master VPC layout, NACLs vs security groups, PrivateLink and VPC endpoints, transit hubs, and the egress controls that actually catch the data exfiltration attempts.

What you will learn

01VPC Topology — A Subnet Is Just a Routing Decision

02Security Groups and NACLs — Two Different Tools

03VPC Endpoints — PrivateLink Is the Shape of Modern Connectivity

04Egress Control — Where Real Attackers Get Caught

05Multi-VPC: Peering, Transit Gateway, and Cloud WAN

06VPC Flow Logs — Your Network Audit Track

Identity is the perimeter. The network is still the second wall — and the first one if your identity story has a hole. This chapter is about how a modern cloud network is laid out, what each control actually enforces, and how to design for failure modes the on-prem network never had: SSRF reaching the metadata service, lateral movement through a flat overlay, and exfiltration through a misconfigured S3 endpoint.

🔑

Today's network mental models

1) Subnet topology — public, private and isolated tiers, AZ spread, and why routing, not labels, defines what "private" means. 2) NACL vs Security Group — stateful vs stateless, allow-only vs allow/deny, and where each shines. 3) VPC endpoints & PrivateLink — keeping S3, KMS and your own services off the internet without losing managed-service convenience. 4) Egress is the exfiltration surface — DNS, IMDS, NAT, and the controls that catch real attackers.

VPC Topology — A Subnet Is Just a Routing Decision

The vocabulary leads developers astray: a subnet labelled public is not magically reachable, and a subnet labelled private is not magically safe. What makes a subnet public is its route table — specifically, that it has a default route (0.0.0.0/0) pointing to an Internet Gateway. A subnet whose default route points to a NAT Gateway is a private with egress; a subnet with no default route at all is isolated.

Three subnet tiers. The data tier never reaches a NAT or IGW; managed services arrive over PrivateLink. Replicate per AZ for HA.

The standard three-tier layout

Public subnets hold load balancers, NAT gateways, and bastions. Default route is the Internet Gateway. Nothing else lives here.
Private (egress) subnets hold app servers, workers, queue consumers. Default route is the NAT gateway in the public subnet of the same AZ. Returning traffic from outbound calls works; inbound from the internet does not.
Isolated subnets hold databases, caches, internal-only services. No default route. Egress to managed services happens through VPC endpoints; cross-AZ replicas in matching subnets.
One of each per AZ — three AZs is the typical compromise between cost and zonal-failure resilience.

⚠️

CIDR planning is forever

VPC CIDR blocks are not trivially changeable after creation, and overlapping ranges across accounts make peering and Transit Gateway routing impossibly painful later. Reserve a global IP plan up front (e.g., 10.0.0.0/8 divided per region per environment), even if you start with one VPC. Future-you will not be sorry.

Security Groups and NACLs — Two Different Tools

Both filter packets, but they live at different layers and obey different rules. Use both, but for different reasons.

	Security Group	Network ACL
Layer	Around an ENI / instance / endpoint	Around a subnet
State	Stateful — return traffic auto-allowed	Stateless — must allow both directions
Effect types	Allow only (implicit deny)	Allow and Deny (numbered rules)
Reference targets	Other SG IDs (security-group-as-source!)	CIDR only
Default	Deny in / allow all out (start by removing the all-out)	Allow in & out
Best at	Workload-to-workload micro-segmentation	Coarse subnet guardrails, blocking known-bad CIDRs

Security-group-as-source — the underused power tool

Hard-coding IP ranges in security groups is a maintenance nightmare in autoscaling. Instead, allow source by security group ID: "the database SG accepts 5432 from the app-tier SG." Any ENI that joins the app SG is automatically allowed; any ENI removed from it loses access. This makes SG rules identity-based at the network layer, and is the closest you get to micro-segmentation without a service mesh.

terraform — security-group-as-source

resource "aws_security_group" "app" {
  name   = "app-tier"
  vpc_id = aws_vpc.main.id
}

resource "aws_security_group" "db" {
  name   = "db-tier"
  vpc_id = aws_vpc.main.id
}

# DB accepts Postgres only from anything in the app SG
resource "aws_vpc_security_group_ingress_rule" "db_from_app" {
  security_group_id            = aws_security_group.db.id
  referenced_security_group_id = aws_security_group.app.id
  ip_protocol                  = "tcp"
  from_port                    = 5432
  to_port                      = 5432
}

💡

Default-deny egress on security groups

The default SG ruleset allows all outbound traffic. Most workloads only ever talk to a handful of named hosts. Replacing the default-out with explicit allows (e.g. only port 443 to the VPC endpoint CIDRs and your DB SG) closes the most common exfil path: a compromised app calling out to attacker.com on 443. Service-mesh teams can do this at L7; for plain VMs, SG egress is your best bet.

VPC Endpoints — PrivateLink Is the Shape of Modern Connectivity

Two kinds of VPC endpoint exist; they are not interchangeable.

Gateway endpoints

Used for S3 and DynamoDB. They install a route into your route table targeting com.amazonaws.region.s3 as a special prefix list. Traffic to S3 from the VPC stays inside the AWS network, never traverses the IGW or NAT. Free.

Interface endpoints (PrivateLink)

Used for everything else — KMS, STS, Secrets Manager, ECR, SSM, partner APIs, your own services. An ENI is created in your VPC, and Route 53 private DNS overrides the public service hostname to resolve to that ENI. Your code keeps using kms.us-east-1.amazonaws.com; the resolution is private. Pay per endpoint per AZ per hour, plus data.

PrivateLink turns AWS managed services into ENIs in your VPC. Combine with aws:SourceVpce in resource policies to pin S3 buckets to your private network only.

Endpoint policies — the often-missed control

Each VPC endpoint can carry its own policy that filters which API calls can use it. A common hardening: a corporate S3 endpoint policy that denies s3:GetObject on any bucket whose aws:ResourceOrgID is not your org. Now even if a compromised principal has credentials with broad S3 permissions, attempts to exfiltrate to an attacker bucket via your private network simply fail. This is one of the most powerful exfiltration controls in AWS.

json — S3 endpoint policy that blocks cross-org reads

{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": "*",
    "Action": ["s3:GetObject", "s3:ListBucket", "s3:PutObject"],
    "Resource": "*",
    "Condition": {
      "StringEquals": { "aws:ResourceOrgID": "o-1234567890" }
    }
  }]
}

Cross-account PrivateLink for internal services

You can also publish your own service via PrivateLink: an NLB in front of your service in account A, exposed as a service-name. Account B creates an interface endpoint to it, optionally accepted by you. Internal-SaaS-style architectures (Snowflake, Databricks, internal platforms) all use this. The receiving service sees the endpoint ENI's source IP — meaning IP-based controls work, but you must use endpoint-policy or TLS-cert pinning for identity.

Egress Control — Where Real Attackers Get Caught

Inbound is the obvious surface; in practice, post-compromise data movement happens outbound. The goal of cloud egress controls is to make a successful compromise visible and useful exfiltration paths few. Three controls in increasing strength.

NAT plus FQDN allow-list

For workloads that legitimately call a small set of external APIs, a forward proxy with FQDN allow-listing (Squid, AWS Network Firewall with Suricata, GCP Secure Web Proxy) is the canonical answer. Egress goes through the proxy; the proxy enforces "only api.stripe.com, api.github.com". Random calls to attacker.com are rejected and logged. Pair with TLS inspection only if your data classification justifies it; otherwise SNI-only filtering catches most exfil while preserving certificate trust.

DNS firewalls — Route 53 Resolver DNS Firewall

Block resolution of malicious or unknown domains at the resolver. Route 53 Resolver DNS Firewall accepts managed lists (AWS-managed, abuse.ch, etc.) and custom lists. Cheap, broad, and catches a lot of commodity attacker tooling that uses DGAs (domain generation algorithms). Does not catch IP-direct callouts; combine with egress firewalls.

Block direct internet egress

The strongest move: only managed services through PrivateLink, never raw internet. Workloads route nowhere outbound except VPC endpoints; package mirrors live in private artifact registries. This is increasingly common in regulated environments and feasible in others if you commit to a private artifact pipeline.

🚨

The metadata service is an SSRF target

Every IaaS metadata service — AWS IMDS, GCP metadata.google.internal, Azure IMDS — sits at a well-known link-local IP. An SSRF in your app or a misconfigured proxy turns into credential theft: the attacker reads the role's STS credentials right out of the metadata response. Mitigations: require IMDSv2 (token + hop-limit 1) on AWS; on GCP, add --metadata=disable-legacy-endpoints=true; on Azure, scope IMDS reads with NSGs and managed-identity hardening. The next chapter (Service Mesh / Zero Trust) treats this as the canonical SSRF case.

Multi-VPC: Peering, Transit Gateway, and Cloud WAN

Real estates have more than one VPC. The connectivity options progress in scale and cost.

VPC Peering. 1:1 link, no transitive routing. Fine for two VPCs; quadratic mess at ten.
Transit Gateway (AWS) / Network Connectivity Center (GCP) / Virtual WAN (Azure). A hub-and-spoke router. Each VPC attaches once; routing tables on the gateway segment which spokes can reach which. This is the default for any non-trivial estate.
Cloud WAN. A higher-level abstraction layered over Transit Gateways across regions, with policy-driven attachments. Useful past about ten regions.
VPC sharing (RAM in AWS, Shared VPC in GCP, virtual networks in Azure). One networking account owns the VPC; workload accounts attach. Centralises networking ops.

💡

Transit Gateway is also a security boundary

TGW route tables let you build segmented networks: separate route tables for prod, nonprod, and shared-services, with explicit propagations only where needed. A misconfigured spoke cannot route into another segment. Combined with account-level separation it is the cloud equivalent of network zones — and CloudWatch flow logs at the TGW make cross-VPC egress observable in one place.

VPC Flow Logs — Your Network Audit Track

Every accept/reject decision in the VPC can be logged. The fields cover source/destination IP, ports, packet/byte counts, action, and (in v5) the AWS service the traffic is heading to. Enable on every VPC, ship to S3 with object-lock and a 90+ day retention, and parse with Athena or a SIEM. Patterns to alert on:

REJECTs from inside a private subnet — workload trying to reach the internet, often early-stage exfil reconnaissance.
Spike in destination IPs from one ENI — possible scanning or staged exfil.
Egress to a low-reputation ASN — usually paired with a DNS firewall.
IMDS access from a process other than the metadata library — pair with aws:CalledVia patterns in CloudTrail.

Quick check

An app server in a private subnet legitimately calls S3 every minute. The bucket policy is correct, but the call now fails with timeout after a TGW migration. The route to S3 unchanged. What is the most likely cause and the cheapest fix?

Show answer

Most likely cause: the new TGW segmentation removed the path to the gateway endpoint for S3 — or the gateway endpoint was not migrated and traffic now hits the NAT, which is sized for a different load. Cheapest fix: ensure each VPC has its own S3 gateway endpoint (gateway endpoints are free and per-VPC), and confirm the route table of the private subnet has the prefix-list route for S3. Gateway endpoints are scoped to a single VPC; they do not propagate via TGW.

The Cross-Provider View

The vocabulary varies; the architecture survives.

Concept	AWS	GCP	Azure
L4 firewall around workload	Security Group	Firewall rule (network tag / SA)	NSG (subnet/NIC)
L4 firewall around subnet	NACL	(via firewall rule order)	NSG on subnet
Private path to managed service	VPC Endpoint (Gateway / Interface)	Private Service Connect / Private Google Access	Private Endpoint / Service Endpoint
Hub for many networks	Transit Gateway / Cloud WAN	Network Connectivity Center	Virtual WAN / vNet peering
FQDN egress firewall	Network Firewall	Secure Web Proxy / Cloud NGFW	Azure Firewall
Service-private exposure	PrivateLink (NLB-fronted)	Private Service Connect	Private Link Service
Network observation logs	VPC Flow Logs	VPC Flow Logs	NSG Flow Logs

Mnemonic — VPC review

"Public, Private, Isolated; Egress, Endpoints, Audit."

Three subnet tiers per AZ.
Egress is the exfil surface — proxy or block it.
Endpoints over internet for managed services.
Flow logs into a SIEM, retained outside the account.

A Cloud-Native Reference Pattern

Pulling it all together for a typical web service:

Public ALB in public subnets, terminating TLS with an ACM cert.
App tier in private subnets, running Fargate tasks or EC2 ASGs, security-group-as-source from the ALB SG.
RDS Aurora in isolated subnets, with KMS at-rest encryption and TLS in-transit.
Interface endpoints for KMS, Secrets Manager, STS, ECR, CloudWatch Logs, plus gateway endpoints for S3 and DynamoDB.
NAT gateway present but its egress filtered by AWS Network Firewall — only api.stripe.com, api.sendgrid.com, and your own SaaS allow-list.
Route 53 Resolver DNS Firewall blocking the AWS-managed domain block list.
VPC flow logs to S3 in a separate logging account, retained 13 months.
SCP at the org pinning region and forbidding internet-routable services in non-public OUs (ec2:AssociatePublicIpAddress denied).

Flashcard

A teammate proposes putting RDS in a public subnet and securing it with a strong password and an SG that only allows port 5432 from 0.0.0.0/0. Why is this still bad even with a strong password?

Click to flip ↻

Answer

The internet attack surface multiplies the risk to a single bug. A zero-day in the RDS engine, a regression in TLS, or a misconfigured client now has a global attacker pool. RDS in an isolated subnet behind a SG that only accepts the app SG and KMS-encrypted at rest narrows the threat to anyone who has already compromised the app tier — drastically reducing the population of credible attackers. The password is one wall; defence in depth is many.

🔑

Key takeaways

1) Routing decides public vs private — always verify the route table and IGW/NAT/no-default. 2) SGs and NACLs are different tools: SGs are stateful identity-aware; NACLs are stateless subnet guardrails. 3) VPC endpoints + endpoint policies are the modern way to talk to managed services and a powerful exfil control via aws:SourceVpce / org pinning. 4) Egress, DNS and IMDS are where post-compromise activity lives — proxy, block or filter all three. 5) Flow logs in a separate account are non-negotiable; you cannot investigate without them.

📚 Further reading

AWS — PrivateLink conceptsaws.amazon.com
AWS — Network ACLsaws.amazon.com
AWS Network Firewall — Suricata-based egress filteringaws.amazon.com
GCP — Private Service Connectcloud.google.com
Azure — Private Endpoint overviewmicrosoft.com
NIST SP 800-41 — Guidelines on Firewallsnist.gov
MITRE ATT&CK T1552.005 — Cloud Instance Metadata APImitre.org

Finished reading?