CI/CD, Monitoring, & On-Call
Six days of patterns — DNS, Linux, NGINX, TLS, systemd, containers — synthesise into one shipping pipeline and the production hygiene that keeps it healthy. Today: a GitHub Actions pipeline that builds, tests, and deploys; secrets that aren't in git; infrastructure as code; the four signals every service needs (latency, traffic, errors, saturation); alerting that doesn't cry wolf; runbooks; and the post-mortem culture that turns outages into learning instead of blame.
What you will learn
You can deploy by hand to one box and call it done. Many small services run that way for years. The moment you have more than one service, more than one engineer, or any business pressure to deploy more than once a week, the manual path becomes the bottleneck — and the source of mistakes. CI/CD is the discipline of automating the path from commit to production: build it the same way every time, test it the same way every time, deploy it the same way every time, and watch it the same way every time. The watching part — what we call observability and on-call — is where this course lives or dies. A deploy you can't watch is a deploy you'll be afraid to make. We'll close the week by tying the operational loop together: pipeline → telemetry → alerts → runbooks → post-mortems → next pipeline change.
Why CI/CD
CI — Continuous Integration — is the practice of merging every change to a shared main branch frequently, with automated build and tests on every merge. CD — Continuous Delivery (or Deployment) — is the discipline of having every commit on main always ready to deploy, ideally with the deploy itself automated.
The benefits compound:
- Smaller batches. Ten one-line changes deployed individually beat one ten-line PR — bisecting failures is cheaper.
- Reproducibility. The deploy is identical at 09:00 on Tuesday and 22:00 on Friday because it's the same script. Manual steps are where surprises hide.
- Speed. A team that deploys hourly does not skip QA — it has automated QA. The two reinforce each other.
- Rollback as a first-class operation. If you deploy by clicking buttons, rollback is improvised. If you deploy by triggering a workflow, rollback is the same workflow with an older SHA.
A Pipeline Worth Reading
Here's a complete GitHub Actions workflow for a typical Node app deployed to ECS Fargate. It's longer than what you'll start with, and shorter than what you'll end with — but every block is doing real work.
name: build and deploy
on:
push:
branches: [main]
pull_request:
branches: [main]
concurrency:
group: deploy-${{ github.ref }}
cancel-in-progress: false # don't cancel in-flight prod deploys
permissions:
id-token: write # OIDC for AWS — no static keys
contents: read
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with: { node-version: 20, cache: npm }
- run: npm ci
- run: npm run lint
- run: npm run test -- --ci
- run: npm run typecheck
build:
needs: test
if: github.ref == 'refs/heads/main'
runs-on: ubuntu-latest
outputs:
image: ${{ steps.tag.outputs.image }}
steps:
- uses: actions/checkout@v4
- uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::123:role/github-actions-deploy
aws-region: us-east-1
- uses: aws-actions/amazon-ecr-login@v2
id: ecr
- uses: docker/setup-buildx-action@v3
- id: tag
run: |
IMAGE=${{ steps.ecr.outputs.registry }}/acme:$(git rev-parse --short HEAD)
echo "image=$IMAGE" >> "$GITHUB_OUTPUT"
- uses: docker/build-push-action@v5
with:
context: .
platforms: linux/amd64
push: true
tags: ${{ steps.tag.outputs.image }}
cache-from: type=gha
cache-to: type=gha,mode=max
- name: scan image
uses: aquasecurity/trivy-action@master
with:
image-ref: ${{ steps.tag.outputs.image }}
severity: HIGH,CRITICAL
exit-code: 1
deploy:
needs: build
runs-on: ubuntu-latest
environment: production # gate via GitHub environment protection
steps:
- uses: actions/checkout@v4
- uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::123:role/github-actions-deploy
aws-region: us-east-1
- name: render task definition
run: |
aws ecs describe-task-definition --task-definition acme \
--query taskDefinition > td.json
jq --arg img "${{ needs.build.outputs.image }}" \
'.containerDefinitions[0].image = $img' td.json > td.new.json
- name: deploy
run: |
REV=$(aws ecs register-task-definition --cli-input-json file://td.new.json \
--query 'taskDefinition.taskDefinitionArn' --output text)
aws ecs update-service --cluster acme --service acme \
--task-definition $REV --force-new-deployment
aws ecs wait services-stable --cluster acme --services acme
- name: smoke test
run: curl -sf https://acme.com/healthz
- name: notify
if: always()
run: |
STATUS=${{ job.status }}
curl -sX POST $SLACK_WEBHOOK -d "{\"text\":\"deploy ${STATUS}: $GITHUB_SHA\"}"
env: { SLACK_WEBHOOK: ${{ secrets.SLACK_DEPLOY_WEBHOOK }} }
What's worth pointing out:
- OIDC for AWS auth. The
id-token: writepermission lets the workflow exchange a GitHub-signed JWT for short-lived AWS credentials. No staticAWS_ACCESS_KEY_IDin repo secrets — the credential never exists outside the running job. Set up the IAM role's trust policy to scope to your repo and ref. - Concurrency group on the ref. One deploy at a time per branch; PR builds run in parallel.
- Image tagged by SHA. Every commit gets a unique image; rollback is "deploy yesterday's SHA."
- Layer cache via GHA.
cache-from/cache-to: type=ghareuses Docker layers across runs — turns 5-minute builds into 30-second ones. - Image vulnerability scan with Trivy. Hard-fails the build on HIGH/CRITICAL.
- Environment protection. The
environment: productionties the job to a GitHub environment with optional required reviewers, branch restrictions, and wait timers. - Smoke test after deploy.
curl /healthz— minimal, but catches the case where ECS reports stable but the service is broken.
Secrets — Never in Git, Always Rotatable
Three places secrets belong, in increasing rigour:
- Pipeline secrets — GitHub Actions secrets, GitLab CI variables. For CI-only credentials (registry tokens, OIDC role ARNs).
- Cloud secret manager — AWS Secrets Manager, Google Secret Manager, Azure Key Vault. The runtime fetches secrets at startup using its IAM role; rotation is a single API call; access is audited.
- Vault / sops / age — for teams that need encrypted secrets in git (sops + KMS), or platform-spanning secret stores (HashiCorp Vault).
What never works long-term: secrets in .env files committed to git, even private git. They live in shell history, in the IDE's recents, in the laptop you'll lose, in the email you forwarded to a contractor. Pretend the repo will leak; design accordingly.
Infrastructure as Code
The same logic that says "deploy by script, not by clicking" extends to provisioning infrastructure. Terraform (or Pulumi, OpenTofu, AWS CDK) describes your VPC, EC2s, ALBs, security groups, RDS, IAM roles, S3 buckets, secrets in code; terraform apply creates them; terraform plan shows the diff before any change.
terraform {
backend "s3" {
bucket = "acme-tf-state"
key = "prod/main.tfstate"
region = "us-east-1"
dynamodb_table = "acme-tf-locks"
encrypt = true
}
}
provider "aws" { region = "us-east-1" }
resource "aws_security_group" "web" {
name = "acme-web"
description = "public web"
vpc_id = data.aws_vpc.default.id
ingress {
from_port = 443
to_port = 443
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"]
}
ingress {
from_port = 80
to_port = 80
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"]
}
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
}
resource "aws_instance" "web" {
ami = "ami-0abcd1234" # ubuntu 24.04, pinned
instance_type = "t3.small"
vpc_security_group_ids = [aws_security_group.web.id]
iam_instance_profile = aws_iam_instance_profile.web.name
metadata_options {
http_tokens = "required" # IMDSv2 only
http_endpoint = "enabled"
}
tags = { Name = "acme-web" }
}
Two crucial details: the remote state backend (S3 + DynamoDB lock) prevents two engineers from racing apply commands; the locked AMI ID and IMDSv2 enforcement match the hardening from Day 2. State is the source of truth for what's deployed; clicks in the AWS console drift from it and cause future terraform plans to want to undo your manual change.
Observability — The Four Golden Signals
Google's SRE book proposed a small set of metrics every service should expose. They've held up for a decade because they cover the failure modes you actually care about.
- Latency. How long do successful requests take? Track P50, P95, P99 — not just the mean.
- Traffic. How many requests per second? RPS, qps, jobs/s — whatever the unit of work is.
- Errors. What fraction of requests fail? 5xx rate, exceptions, business-level failures.
- Saturation. How full is the system? CPU %, memory %, queue depth, connection-pool utilization, disk-fill rate.
If you instrument these four for every service, every dependency, and every external call, you have ~80% of the visibility you'll ever need. Add custom business metrics on top (signups/min, dollars/hour, AI tokens/sec).
The metrics stack
Logs, metrics, and traces — what each is for
- Metrics — numerical, aggregated, cheap to store at scale. The four golden signals live here. Best for "is the system healthy right now?" and "is this trend worsening?"
- Logs — text or structured events, expensive to store, searchable. Best for "what happened on this exact request?" and forensic investigation.
- Traces — a request's journey across services, with each span's timing. Best for "why is THIS request slow when others aren't?" Distributed systems require traces; single-process apps usually don't.
OpenTelemetry (OTel) unifies the SDKs across languages so the same instrumentation feeds all three storage backends. For most teams: metrics in CloudWatch or Prometheus, logs in CloudWatch or Loki, traces in CloudWatch X-Ray or Tempo, dashboards in Grafana.
Alerting — The Discipline of Not Crying Wolf
An alert that doesn't require human action is a bug. The on-call engineer's pager is sacred — every false alarm trains them to ignore the real ones. Two rules govern good alerts:
- Alert on symptoms, not causes. Page when users are unhappy (latency over SLO, errors above threshold), not when an internal metric crosses a line. "P99 latency over 1s for 5 minutes" is a symptom; "CPU above 80%" is a cause that may or may not affect users.
- Page only what's actionable now. If the runbook says "check this dashboard, file a ticket" — that's a ticket, not a page. Page only for things that need a human at this moment.
SLO, SLI, error budget — the framework
- SLI (Service Level Indicator): a measurement. "% of requests completing in under 300 ms."
- SLO (Service Level Objective): your target for an SLI. "99.5% of requests under 300 ms over 28 days."
- Error budget: 100% – SLO. Above 99.5% means you have a 0.5% budget — about 200 minutes per 28 days. Burn fast → page; burn slow → ticket; not burning → ship more code.
groups:
- name: acme-slo
rules:
- alert: AcmeErrorBudgetFastBurn
expr: |
(
sum(rate(http_requests_total{job="acme",status=~"5.."}[5m]))
/ sum(rate(http_requests_total{job="acme"}[5m]))
) > 0.005 * 14 # 14× normal burn — we'll exhaust budget in 2 days at this rate
for: 2m
labels: { severity: page }
annotations:
summary: "Acme error rate burning budget fast"
runbook: "https://wiki.acme.com/runbooks/acme-error-rate"
The runbook URL in the annotation is the single most useful thing on a page. The on-call engineer at 03:00 is not in a state to invent a remediation; they want a checklist.
Runbooks — The Memory of the Team
A runbook for an alert lists: what triggers this, what the user impact is, the diagnosis steps in order, the safe rollbacks, the escalation path, related dashboards, recent context. Format doesn't matter; access does — link from the alert, keep it findable, update it after every incident.
# Acme Error Rate (5xx) — Fast Burn ## What it means The Acme service is returning 5xx at a rate that will exhaust the 28-day error budget in < 2 days at the current pace. ## User impact Users see HTTP 500 / 502 on the affected requests. Login and checkout requests are most sensitive — error budget for those is tighter (see SLO doc). ## First 2 minutes 1. Open dashboard: https://grafana.acme.com/d/acme-overview 2. Identify which endpoint(s) are erroring. Often /api/v1/checkout. 3. Identify the start time. Was there a recent deploy? See #deploys Slack. ## If it correlates with a deploy (most common) 1. Roll back: `gh workflow run rollback.yml -f sha=<previous>` (5 min) 2. Verify error rate drops in dashboard. 3. Open an incident channel #inc-yyyymmdd-acme. ## If no recent deploy 1. Check upstreams: payments, auth, db. See the four-golden-signals row. 2. Check capacity: Fargate task count, RDS connections, Redis OOM. 3. Check rate limits: NGINX limit_req log lines. ## Escalation - Primary on-call: PagerDuty - Secondary: backend team channel #team-backend - VP Eng if user-impact > 1 hour
Post-Mortems — Blameless and Permanent
Every incident worth waking someone for is worth writing about. The shape:
- Summary. One paragraph. "On 2026-05-02, a deploy caused 17 minutes of elevated 5xx for /api/checkout, affecting roughly 1,200 users."
- Timeline. Minute-by-minute, in UTC. Detection, alerts, who joined when, what they tried.
- Root cause. What actually broke, in detail. Often more than one cause stacked.
- Impact. Numbers. Users affected, money lost, SLO budget consumed.
- What went well. Detection time, communication, rollback speed.
- What went badly. Where we got lucky, where the system surprised us.
- Action items. Specific tickets with owners and dates. Not platitudes.
Blameless is the cultural choice that makes the rest possible. The question is never "who messed up" but "what about our system made this mistake easy?" Engineers will hide near-misses if they're punished; engineers will surface them if they're celebrated for catching them.
The Operational Loop
Pulling Day 1 through Day 7 together, the operational loop looks like this:
- An engineer makes a change. CI runs lint, tests, type-checks, vulnerability scan, image build.
- The image is pushed to ECR with a SHA tag.
- The deploy job updates the ECS task definition, ECS rolls the change one task at a time (Day 5's rolling pattern).
- The new tasks pass health checks (Day 5) and ALB target group attaches them; old tasks drain.
- NGINX (Day 3) keeps serving uninterrupted; TLS (Day 4) keeps users on HTTPS.
- OTel instrumentation flows metrics, logs, and traces to the storage backend; Grafana visualises them.
- If the four golden signals trend bad, alerts fire to the on-call's pager — but only for symptoms, only when actionable.
- The on-call follows the runbook to mitigate (often a rollback, often a rate-limit knob, often a feature flag).
- Within a few days, the team writes a post-mortem and ships the action items — making the next instance of this failure detectable, preventable, or impossible.
- The pipeline gets a small improvement to encode that learning. Tomorrow's deploy is one regression-resistant step better than today's.
Show answer
- Pipeline — automated build / test / deploy on every commit.
- Telemetry — four golden signals, plus business metrics.
- Alerts — symptoms only, actionable only.
- Runbooks — what to do at 03:00, linked from the alert.
- Post-mortems — blameless, with owned action items that ship.
- Google — Site Reliability Engineering (free book)sre.google
- Google — SRE Workbook (free)sre.google
- GitHub Actions documentationdocs.github.com
- AWS — OIDC for GitHub Actionsgithub.com
- Terraform documentationterraform.io
- OpenTelemetry — language-agnostic observabilityopentelemetry.io
- Prometheus — alerting best practicesprometheus.io
- Awesome SRE — curated reading listgithub.com
Finished reading?