The Engineering Codex/From Code to Internet: Deployment & Operations
DAY 2
02 / 07

Linux Servers & EC2 — The Box Your Code Runs On

schedule15 minsignal_cellular_altBeginner3,213 words
Underneath every container, every Lambda, every PaaS abstraction is a Linux machine someone provisioned. Learn the parts of an EC2 instance you'll actually touch — AMIs, instance types, VPCs, security groups, key pairs, IAM roles — then the Linux survival kit: SSH, processes, services, file permissions, package management, and the hardening you do before the box is allowed near the public internet.

What you will learn

01The Anatomy of an EC2 Instance
02Networking — VPCs, Subnets, and the Firewall
03SSH — Your First Login
04The Linux Survival Kit
05IAM Roles — The End of Long-Lived AWS Keys
06Hardening Checklist for the First 30 Minutes

Cloud abstractions stack high. Underneath the Lambda is a Firecracker microVM, underneath the microVM is a server, underneath the server is the same kernel you'd boot on your laptop. Every once in a while — when a deploy goes sideways, when a process won't start, when a permission is wrong — you have to drop down to that machine and figure out what's actually happening. Today is the layer where that happens. We'll provision an EC2 instance from zero, walk through the AWS networking primitives that decide whether anyone can reach it, and learn enough Linux to keep a production server alive.

🔑
Today's outcome
By the end you should be able to: 1) Provision an EC2 instance with the right type, AMI, VPC, subnet, security group, and key. 2) SSH in, navigate the filesystem, manage users and permissions. 3) Run a process as a service that survives reboots. 4) Read logs with journalctl and basic shell tools. 5) Harden the box: SSH key-only auth, automatic security updates, and IAM role over static keys.

The Anatomy of an EC2 Instance

EC2 — Elastic Compute Cloud — is AWS's name for "a virtual machine you rent by the second." Conceptually, an instance is six decisions wrapped together.

VPC · 10.0.0.0/16 Public subnet · 10.0.1.0/24 Private subnet · 10.0.2.0/24 EC2 instance i-0a3f…type: t3.small (2 vCPU, 2 GB)AMI: Ubuntu 24.04 LTSEBS: 20 GB gp3SG: web-sg (22, 80, 443)IAM: app-instance-role RDS Postgresno public IP ElastiCache Redisprivate subnet only Internet Gateway · NAT · Route Tables — the wires that decide who can talk to whom
A minimal production VPC: public subnet for the app, private subnet for stateful services it depends on.

The six decisions

DecisionWhat it controlsBeginner default
AMIThe boot image — OS, kernel, pre-installed softwareUbuntu 24.04 LTS or Amazon Linux 2023
Instance typevCPU, RAM, network, disk classt3.small or t4g.small (2 vCPU, 2 GB)
VPC + subnetWhich network the box lives on, IP range, public vs privateDefault VPC, public subnet
Security groupStateful firewall rules — which ports are open, from where22 from your IP, 80 + 443 from anywhere
Key pairThe SSH keypair you'll use to log in initiallyGenerate one; store the .pem somewhere you'll find it
IAM rolePermissions the instance has to call other AWS servicesOne scoped to S3 read/write + CloudWatch logs, no AdministratorAccess

AMIs — the boot image

An AMI (Amazon Machine Image) is a pre-built disk snapshot you boot from. Two flavours matter for production:

  • Stock OS images: Ubuntu LTS, Amazon Linux, Debian. You install your stack post-boot via cloud-init, Ansible, or a config script. Easy to start, but every instance pays the install cost.
  • Custom ("baked") AMIs: you snapshot a fully-configured machine and boot from that. Tools: Packer, EC2 Image Builder. Faster boots, immutable deploys, harder to update — but the right pattern for autoscaling groups.

Instance types — the alphabet soup

EC2 type names encode purpose and generation. t3.small reads as: t family (burstable general-purpose), 3rd gen, small size (2 vCPU, 2 GB). The families:

FamilyForNotes
t (t3, t4g)Burstable general-purposeCheap, but throttles after CPU credits run out — not for steady-state CPU loads
m (m6i, m7g)Steady general-purposeThe default for production app servers
c (c7i, c7g)Compute-optimizedMore CPU per dollar; for CPU-bound workloads (encoding, ML inference)
r (r7i, r7g)Memory-optimizedFor databases, caches, in-memory workloads
i / dStorage-optimizedNVMe SSD or HDD per VM; for big-data and self-hosted DBs
g / pGPUInference (g) or training (p)
The g suffixGraviton (ARM)~20% cheaper than the x86 equivalent for the same perf — use when your stack supports ARM
💡
Burstable instances aren't free CPU
A t3.small earns CPU credits at a base rate (20% of one core, in this case) and spends them when busy. Run hot for too long and credits run out, and the instance is throttled to the base rate. Fine for a low-traffic site or a CI worker; lethal for a busy app server. Watch CPUCreditBalance in CloudWatch — if it trends down, upgrade family.

Networking — VPCs, Subnets, and the Firewall

Every EC2 instance lives in a VPC (Virtual Private Cloud), which is a private network you own a slice of. AWS gives you a default VPC in every region; for serious work you'll create your own.

The pieces in order

  • VPC. A CIDR block like 10.0.0.0/16 — 65k addresses you control.
  • Subnets. Slices of the VPC bound to a single Availability Zone. Public subnets have a route to an Internet Gateway and instances there can have public IPs. Private subnets don't, so instances there can only reach the internet via a NAT Gateway (outbound) and never receive inbound public traffic.
  • Internet Gateway (IGW). The wire from the VPC to the public internet. Attached at the VPC level; subnet-level routing decides who uses it.
  • NAT Gateway. Lets private-subnet instances reach the internet outbound (for package downloads, API calls) without being reachable inbound.
  • Route tables. The rules that send traffic the right way. Public subnets route 0.0.0.0/0 to the IGW; private subnets route 0.0.0.0/0 to the NAT.
  • Security groups. Per-instance stateful firewalls. "Allow TCP 22 from 1.2.3.4/32; allow 443 from 0.0.0.0/0; allow all outbound."
  • Network ACLs. Per-subnet stateless firewalls. Most teams leave them open and rely on security groups; they exist for compliance overlays.

The mental model that prevents 80% of mistakes

For a normal web app, the layout is:

  1. Public subnet in 2+ AZs holds your load balancer (ALB) and optionally bastion hosts.
  2. Private app subnet in those same AZs holds your EC2 app instances. They have no public IPs; the ALB forwards traffic to them.
  3. Private data subnet holds RDS, ElastiCache, etc. Only reachable from the app subnet.

For a learning deployment with one box, you put the EC2 directly in a public subnet with a public IP. That's fine for day one — just remember it means the box is on the public internet, and the security group is the only thing between it and the world's port scanners.

⚠️
Security group as source
Security groups can reference other security groups as sources. "Allow 5432 from app-sg" is far better than "Allow 5432 from 10.0.1.0/24" — the former tracks membership automatically as instances come and go, the latter goes stale silently. Always reference SGs by name when both sides are in your VPC.

SSH — Your First Login

SSH (Secure Shell) is a remote login protocol that uses public-key authentication. You generate a keypair locally; the public key goes on the server; you log in by proving you hold the private key. Two key facts:

  • Private key never leaves your machine. If it does, treat it as compromised.
  • Permissions matter. SSH refuses to use a private key whose file mode is too permissive. chmod 600 ~/.ssh/id_ed25519.
bash — generate, copy, log in
# 1) Generate a modern key (ed25519 — small, fast, secure)
ssh-keygen -t ed25519 -C "sumit@laptop" -f ~/.ssh/id_ed25519

# 2) Provision the EC2 with the corresponding public key (paste into AWS console,
#    or upload via aws cli):
aws ec2 import-key-pair --key-name sumit-laptop \
  --public-key-material fileb://~/.ssh/id_ed25519.pub

# 3) Log in (Ubuntu uses the user 'ubuntu'; Amazon Linux uses 'ec2-user')
ssh -i ~/.ssh/id_ed25519 ubuntu@203.0.113.42

# Tidier: configure ~/.ssh/config so you can `ssh acme-prod`
cat >> ~/.ssh/config <<'EOF'
Host acme-prod
  HostName 203.0.113.42
  User     ubuntu
  IdentityFile ~/.ssh/id_ed25519
  IdentitiesOnly yes
EOF

SSH agents and forwarding

Type ssh-add ~/.ssh/id_ed25519 once and your local SSH agent holds the unlocked key for the session — no more passphrase prompts. Avoid agent forwarding (-A) unless you trust the destination box completely; a compromised server with your forwarded agent can sign auth requests as you. Prefer ProxyJump (-J bastion) — your private key never leaves your laptop.

🚨
Disable password SSH on the first day
Within minutes of an EC2 going public, automated bots will hammer port 22 with username/password attempts. fail2ban helps but isn't enough. Edit /etc/ssh/sshd_config.d/00-hardening.conf: PasswordAuthentication no, PermitRootLogin no, ChallengeResponseAuthentication no, then sudo systemctl reload ssh. From now on, only key-holders can log in. Log out and verify with a fresh terminal before closing the existing session — recovery from a misconfigured sshd is painful.

The Linux Survival Kit

Once you're in, you're staring at a Bourne-Again Shell prompt and a few decades of accumulated UNIX. Five concept clusters cover most of what a deploying engineer needs.

1. The filesystem hierarchy

PathWhat's there
/etcSystem configuration. Edit with care; restart services after.
/var/logLogs — application, system, kernel.
/var/libPersistent state — package data, database files.
/usr/localSoftware you install outside the package manager.
/optVendor-provided large applications.
/home/<user>Per-user files. Your ~.
/tmpScratch — wiped on reboot. Don't put anything important here.
/proc · /sysKernel-exposed virtual filesystems. Where top reads from.

2. Users, groups, permissions

Every file has an owner, a group, and three permission triplets (read/write/execute) for owner / group / others. Read it as rwxr-xr-x = 755:

bash — permissions in practice
# Inspect
ls -l /etc/nginx/nginx.conf
# -rw-r--r-- 1 root root 1234 ...   # owner = root, group = root, mode 644

# Change
sudo chown deploy:deploy /var/www/app          # owner deploy, group deploy
sudo chmod 750 /var/www/app                    # rwxr-x---

# Octal cheat sheet
# r=4 w=2 x=1 → 7=rwx, 6=rw-, 5=r-x, 4=r--, 0=---

Two patterns to know: files used by a service should be owned by that service's user (e.g., www-data for nginx-served files), and secrets should be mode 600 (rw-------) so only the owner can read them. sudo elevates to root for one command; never su - into a long root shell unless you have to.

3. Processes and services

Every running program is a process with a PID. Tools for inspecting:

bash — process tools
ps aux | grep nginx                # all processes matching nginx
top  /  htop                       # live, sorted by CPU/RAM
ss -tlnp                            # listening TCP sockets + which process owns them
lsof -i :80                         # what's bound to port 80
kill -TERM <pid>                    # polite shutdown
kill -9 <pid>                       # last resort; bypasses cleanup

For services that must outlive your shell, you don't run them with ./myapp & — you write a systemd unit (Day 5). For now: systemctl status nginx, systemctl restart nginx, systemctl enable nginx (start at boot).

4. Logs

Two main log surfaces:

  • journald — systemd's binary log database. Query with journalctl -u nginx -f (follow nginx logs) or journalctl -p err --since "1 hour ago".
  • Plaintext under /var/log/var/log/syslog, /var/log/nginx/access.log, etc. Read with tail -f, less +F, grep.
bash — log triage in 4 commands
# When a service won't start
journalctl -u myapp.service -n 200 --no-pager

# When you suspect disk pressure
df -h                                # disk usage by mount
du -sh /var/log/*  | sort -h         # biggest log files

# When something's slow but you don't know what
top -o %CPU                          # who's burning CPU
iostat 1                             # disk activity
ss -s                                 # connection counts

# When the service is hot but the box looks idle
strace -p <pid> -c -f                # what syscalls is it making?

5. Package management

Ubuntu/Debian use apt; Amazon Linux 2023 and Fedora use dnf. Pin versions, prefer the OS package over curl | sh when you can.

bash — apt essentials
sudo apt update                            # refresh package index
sudo apt install -y nginx                  # install nginx
apt list --installed | grep nginx          # check what's there
sudo apt upgrade                            # upgrade everything
sudo unattended-upgrades --dry-run         # what auto-updates would do

IAM Roles — The End of Long-Lived AWS Keys

The single biggest day-2 security upgrade for an EC2 box: never put AWS access keys on the instance. Attach an IAM role to the instance instead. The role grants permissions to call AWS APIs (read this S3 bucket, write to that DynamoDB table, fetch this Secrets Manager secret), and the EC2 instance metadata service rotates short-lived credentials automatically.

EC2 instanceSDK reads from169.254.169.254 IMDSv2 endpointissues short-livedcreds (~6 hr TTL) AWS APIS3 / Secrets / etc. No long-lived keys on disk. Compromise of one box doesn't leak credentials usable beyond the role.
IAM instance roles. Boto3, the AWS CLI, and other SDKs find credentials at the metadata endpoint automatically.

Always require IMDSv2

The instance metadata service has two versions. v1 is the original, GET-only, no auth. v2 requires a session token, defends against SSRF attacks (where an app vulnerable to URL-fetch could be tricked into hitting the metadata endpoint and exfiltrating credentials). New instances should always require IMDSv2:

bash — enforce IMDSv2 at launch
aws ec2 modify-instance-metadata-options \
  --instance-id i-0a3f… \
  --http-tokens required \
  --http-endpoint enabled

Hardening Checklist for the First 30 Minutes

Before the box runs anything important, do this once. The boring list separates a hobby project from a production posture.

  • SSH keys only. PasswordAuthentication no, PermitRootLogin no.
  • SSH from your IP, not the world. Restrict the security group's port 22 to your office/VPN CIDR. For one-off access from anywhere, use SSM Session Manager (no port 22 needed at all).
  • Automatic security updates. sudo apt install unattended-upgrades; sudo dpkg-reconfigure --priority=low unattended-upgrades.
  • IAM role over keys. Attach an instance profile; remove any ~/.aws/credentials file you might have left behind.
  • IMDSv2 required. One CLI call, infinitely worth it.
  • Sensible time sync. chrony on by default; verify with chronyc tracking. TLS, logs, auth tokens all go wrong with skewed clocks.
  • Limited inbound. SG opens only what's needed: 80, 443, sometimes 22. Outbound is usually wide-open; restrict if you have a compliance reason.
  • Known-good base AMI. Prefer the official Ubuntu / Amazon Linux AMIs from the AWS Marketplace; community AMIs are unverified.
  • EBS encryption. Default-on at the account level: aws ec2 enable-ebs-encryption-by-default. Costs nothing.
  • CloudWatch agent. Sends metrics + logs to CloudWatch out of the box; you'll need it on Day 7 anyway.

From Empty Box to Live Service — End-to-End

Putting it all together, the path from "nothing" to "a service answering on port 80" looks like this:

  1. Provision a t3.small Ubuntu 24.04 instance in a public subnet with an SG allowing 22 from your IP and 80/443 from anywhere. Attach an instance profile.
  2. SSH in as ubuntu. Apply hardening: disable password auth, enable unattended upgrades, require IMDSv2.
  3. Create a deploy user: sudo adduser deploy --disabled-password, copy your SSH key into /home/deploy/.ssh/authorized_keys with mode 600, add it to the sudo group only if it needs to.
  4. Install the runtimeapt install nginx, language runtimes, etc. Day 3 covers nginx in depth; Day 5 covers running your app as a systemd service; Day 6 covers doing all of this in a container instead.
  5. Open the right port in the SG. Verify ss -tlnp shows your service listening, then curl from your laptop.
  6. Point DNS at the public IP (or assign an Elastic IP first so it survives a stop/start).
🌱
Elastic IPs are free, until they aren't
An Elastic IP attached to a running instance is free. An EIP not attached, or attached to a stopped instance, costs money — AWS bills idle public IPv4 addresses to discourage hoarding (since the IPv4 supply is exhausted globally). For a learning instance you stop nightly, either accept the dynamic public IP changing or budget the few cents per day.

EC2 Alternatives Worth Knowing

Pure EC2 is the most flexible option — and the most labour-intensive. Several adjacent AWS products handle parts of the work for you:

ServiceWhat you tradeBest for
LightsailLess control, simpler pricingPersonal sites, demos, dev boxes
Elastic BeanstalkLess control, opinionated envQuick deploy of a single app
ECS / FargateContainers, no servers to patchProduction microservices once you've done Day 6
EKSFull Kubernetes, full complexityLarge fleets with platform teams
LambdaEvent-driven, no long-running processesGlue code, event handlers, sporadic workloads
App RunnerContainer PaaSSingle-service deployments, autoscaling for free

The skill of running an EC2 box transfers to all of them — the Linux underneath ECS Fargate is the same Linux. Master the box, then move up the abstraction ladder when the cost-of-ops outweighs the cost-of-control.

Quick check
A new EC2 instance can be SSH'd into from your laptop, but a colleague three desks over can't. Both of you are inside the company office. The instance is in a public subnet with public IP. What are the two most likely causes, and how do you test each one?
Show answer
1) Security group source restriction. Your SG probably allows 22 only from your specific public IP, not the office CIDR. Test: aws ec2 describe-security-groups --group-ids sg-… and inspect IpPermissions. Fix: change the source to the office CIDR or set up SSM Session Manager. 2) SSH key. Your colleague's public key isn't in ~/.ssh/authorized_keys on the box. Test: ssh -v <user>@<ip> from their laptop — verbose output shows whether the SG dropped the connection (timeout) or sshd rejected the auth ("Permission denied (publickey)"). The two cases look different on the wire and the fix is correspondingly different.
Mnemonic — what makes an instance
"AMI, type, VPC, SG, key, role."
  • AMI — the boot image.
  • Type — vCPU, RAM, network shape.
  • VPC + subnet — what network it lives on.
  • SG — who can talk to it.
  • Key — how you log in the first time.
  • Role — what it can talk to.
Flashcard
An app on EC2 calls s3:GetObject using credentials in ~/.aws/credentials. The on-call engineer wants to remove the credentials file because the instance is now compromised. What change must precede that, and why is the IAM-role pattern strictly better?
Click to flip ↻
Answer
Precede with: attach an IAM instance profile (a role) to the EC2 with the same S3 permissions the static creds had. The boto3 / AWS SDK credential chain falls back to the EC2 metadata endpoint automatically once the file is gone. Why strictly better: (1) credentials are short-lived (~6 hours) and rotated by AWS, so an exfiltrated creds blob expires fast; (2) no secret to leak in deploy logs or env-dumps; (3) you can change permissions centrally by editing the role; (4) IMDSv2 makes SSRF exfiltration significantly harder than reading a file. The static-key pattern persists only for legacy reasons and should never be the answer for new work.
🔑
Key takeaways
1) An EC2 instance is six decisions: AMI, type, VPC/subnet, security group, key pair, IAM role. 2) Security groups are the first line of defence; restrict SSH to your IP and reference SGs by name when both sides are in your VPC. 3) SSH keys only, agent-cached, never agent-forwarded into untrusted servers. 4) The Linux survival kit: filesystem, permissions, processes, logs, packages — the same five clusters underneath every higher-level abstraction. 5) IAM roles + IMDSv2 kill the static-credentials class of breaches; do this on day one.

Finished reading?