DAY 7

07 / 07

CI/CD, Monitoring, & On-Call

schedule14 minsignal_cellular_altIntermediate3,157 words

Six days of patterns — DNS, Linux, NGINX, TLS, systemd, containers — synthesise into one shipping pipeline and the production hygiene that keeps it healthy. Today: a GitHub Actions pipeline that builds, tests, and deploys; secrets that aren't in git; infrastructure as code; the four signals every service needs (latency, traffic, errors, saturation); alerting that doesn't cry wolf; runbooks; and the post-mortem culture that turns outages into learning instead of blame.

What you will learn

01Why CI/CD

02A Pipeline Worth Reading

03Secrets — Never in Git, Always Rotatable

04Infrastructure as Code

05Observability — The Four Golden Signals

06Alerting — The Discipline of Not Crying Wolf

You can deploy by hand to one box and call it done. Many small services run that way for years. The moment you have more than one service, more than one engineer, or any business pressure to deploy more than once a week, the manual path becomes the bottleneck — and the source of mistakes. CI/CD is the discipline of automating the path from commit to production: build it the same way every time, test it the same way every time, deploy it the same way every time, and watch it the same way every time. The watching part — what we call observability and on-call — is where this course lives or dies. A deploy you can't watch is a deploy you'll be afraid to make. We'll close the week by tying the operational loop together: pipeline → telemetry → alerts → runbooks → post-mortems → next pipeline change.

🔑

Today's outcome

1) A GitHub Actions pipeline that builds, tests, scans, pushes an image, and deploys. 2) Secrets management — never in git, always rotatable. 3) Infrastructure as code — Terraform basics, why state matters. 4) The four golden signals: latency, traffic, errors, saturation. 5) Alerting that respects the on-call's pager. 6) Runbooks for the predictable failures and post-mortems for the new ones. 7) The operational loop that turns one good week into many.

Why CI/CD

CI — Continuous Integration — is the practice of merging every change to a shared main branch frequently, with automated build and tests on every merge. CD — Continuous Delivery (or Deployment) — is the discipline of having every commit on main always ready to deploy, ideally with the deploy itself automated.

The benefits compound:

Smaller batches. Ten one-line changes deployed individually beat one ten-line PR — bisecting failures is cheaper.
Reproducibility. The deploy is identical at 09:00 on Tuesday and 22:00 on Friday because it's the same script. Manual steps are where surprises hide.
Speed. A team that deploys hourly does not skip QA — it has automated QA. The two reinforce each other.
Rollback as a first-class operation. If you deploy by clicking buttons, rollback is improvised. If you deploy by triggering a workflow, rollback is the same workflow with an older SHA.

A Pipeline Worth Reading

Here's a complete GitHub Actions workflow for a typical Node app deployed to ECS Fargate. It's longer than what you'll start with, and shorter than what you'll end with — but every block is doing real work.

yaml — .github/workflows/deploy.yml

name: build and deploy

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

concurrency:
  group: deploy-${{ github.ref }}
  cancel-in-progress: false        # don't cancel in-flight prod deploys

permissions:
  id-token: write                  # OIDC for AWS — no static keys
  contents: read

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with: { node-version: 20, cache: npm }
      - run: npm ci
      - run: npm run lint
      - run: npm run test -- --ci
      - run: npm run typecheck

  build:
    needs: test
    if: github.ref == 'refs/heads/main'
    runs-on: ubuntu-latest
    outputs:
      image: ${{ steps.tag.outputs.image }}
    steps:
      - uses: actions/checkout@v4
      - uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::123:role/github-actions-deploy
          aws-region: us-east-1
      - uses: aws-actions/amazon-ecr-login@v2
        id: ecr
      - uses: docker/setup-buildx-action@v3
      - id: tag
        run: |
          IMAGE=${{ steps.ecr.outputs.registry }}/acme:$(git rev-parse --short HEAD)
          echo "image=$IMAGE" >> "$GITHUB_OUTPUT"
      - uses: docker/build-push-action@v5
        with:
          context: .
          platforms: linux/amd64
          push: true
          tags: ${{ steps.tag.outputs.image }}
          cache-from: type=gha
          cache-to: type=gha,mode=max
      - name: scan image
        uses: aquasecurity/trivy-action@master
        with:
          image-ref: ${{ steps.tag.outputs.image }}
          severity: HIGH,CRITICAL
          exit-code: 1

  deploy:
    needs: build
    runs-on: ubuntu-latest
    environment: production         # gate via GitHub environment protection
    steps:
      - uses: actions/checkout@v4
      - uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::123:role/github-actions-deploy
          aws-region: us-east-1
      - name: render task definition
        run: |
          aws ecs describe-task-definition --task-definition acme \
            --query taskDefinition > td.json
          jq --arg img "${{ needs.build.outputs.image }}" \
             '.containerDefinitions[0].image = $img' td.json > td.new.json
      - name: deploy
        run: |
          REV=$(aws ecs register-task-definition --cli-input-json file://td.new.json \
                 --query 'taskDefinition.taskDefinitionArn' --output text)
          aws ecs update-service --cluster acme --service acme \
            --task-definition $REV --force-new-deployment
          aws ecs wait services-stable --cluster acme --services acme
      - name: smoke test
        run: curl -sf https://acme.com/healthz
      - name: notify
        if: always()
        run: |
          STATUS=${{ job.status }}
          curl -sX POST $SLACK_WEBHOOK -d "{\"text\":\"deploy ${STATUS}: $GITHUB_SHA\"}"
        env: { SLACK_WEBHOOK: ${{ secrets.SLACK_DEPLOY_WEBHOOK }} }

What's worth pointing out:

OIDC for AWS auth. The id-token: write permission lets the workflow exchange a GitHub-signed JWT for short-lived AWS credentials. No static AWS_ACCESS_KEY_ID in repo secrets — the credential never exists outside the running job. Set up the IAM role's trust policy to scope to your repo and ref.
Concurrency group on the ref. One deploy at a time per branch; PR builds run in parallel.
Image tagged by SHA. Every commit gets a unique image; rollback is "deploy yesterday's SHA."
Layer cache via GHA. cache-from/cache-to: type=gha reuses Docker layers across runs — turns 5-minute builds into 30-second ones.
Image vulnerability scan with Trivy. Hard-fails the build on HIGH/CRITICAL.
Environment protection. The environment: production ties the job to a GitHub environment with optional required reviewers, branch restrictions, and wait timers.
Smoke test after deploy. curl /healthz — minimal, but catches the case where ECS reports stable but the service is broken.

💡

CI is also a security boundary

Anything your pipeline can do, an attacker who compromises your repo can do. Lock down: required code review on main, branch protection, environment-level required reviewers for production, OIDC instead of static AWS keys, restrict the GitHub Actions IAM role to the smallest permission set that ships your service. Audit secrets quarterly — secrets set in a panic six months ago and never rotated are the most common breach vector.

Secrets — Never in Git, Always Rotatable

Three places secrets belong, in increasing rigour:

Pipeline secrets — GitHub Actions secrets, GitLab CI variables. For CI-only credentials (registry tokens, OIDC role ARNs).
Cloud secret manager — AWS Secrets Manager, Google Secret Manager, Azure Key Vault. The runtime fetches secrets at startup using its IAM role; rotation is a single API call; access is audited.
Vault / sops / age — for teams that need encrypted secrets in git (sops + KMS), or platform-spanning secret stores (HashiCorp Vault).

What never works long-term: secrets in .env files committed to git, even private git. They live in shell history, in the IDE's recents, in the laptop you'll lose, in the email you forwarded to a contractor. Pretend the repo will leak; design accordingly.

Infrastructure as Code

The same logic that says "deploy by script, not by clicking" extends to provisioning infrastructure. Terraform (or Pulumi, OpenTofu, AWS CDK) describes your VPC, EC2s, ALBs, security groups, RDS, IAM roles, S3 buckets, secrets in code; terraform apply creates them; terraform plan shows the diff before any change.

hcl — minimal Terraform for the EC2 from Day 2

terraform {
  backend "s3" {
    bucket         = "acme-tf-state"
    key            = "prod/main.tfstate"
    region         = "us-east-1"
    dynamodb_table = "acme-tf-locks"
    encrypt        = true
  }
}

provider "aws" { region = "us-east-1" }

resource "aws_security_group" "web" {
  name        = "acme-web"
  description = "public web"
  vpc_id      = data.aws_vpc.default.id
  ingress {
    from_port   = 443
    to_port     = 443
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }
  ingress {
    from_port   = 80
    to_port     = 80
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }
  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

resource "aws_instance" "web" {
  ami                    = "ami-0abcd1234"     # ubuntu 24.04, pinned
  instance_type          = "t3.small"
  vpc_security_group_ids = [aws_security_group.web.id]
  iam_instance_profile   = aws_iam_instance_profile.web.name
  metadata_options {
    http_tokens   = "required"                    # IMDSv2 only
    http_endpoint = "enabled"
  }
  tags = { Name = "acme-web" }
}

Two crucial details: the remote state backend (S3 + DynamoDB lock) prevents two engineers from racing apply commands; the locked AMI ID and IMDSv2 enforcement match the hardening from Day 2. State is the source of truth for what's deployed; clicks in the AWS console drift from it and cause future terraform plans to want to undo your manual change.

⚠️

No clicking in production

Once you adopt Terraform, the rule is: every change goes through code. Click-ops in the console creates drift, drift breaks your next apply, broken applies make engineers afraid to apply, fear creates more drift. Use the console for read-only debugging; for any change, edit the .tf file and apply. CI/CD on the Terraform repo (plan on PR, apply on merge with required review) makes this enforceable.

Observability — The Four Golden Signals

Google's SRE book proposed a small set of metrics every service should expose. They've held up for a decade because they cover the failure modes you actually care about.

Latency. How long do successful requests take? Track P50, P95, P99 — not just the mean.
Traffic. How many requests per second? RPS, qps, jobs/s — whatever the unit of work is.
Errors. What fraction of requests fail? 5xx rate, exceptions, business-level failures.
Saturation. How full is the system? CPU %, memory %, queue depth, connection-pool utilization, disk-fill rate.

If you instrument these four for every service, every dependency, and every external call, you have ~80% of the visibility you'll ever need. Add custom business metrics on top (signups/min, dollars/hour, AI tokens/sec).

The metrics stack

A modern observability pipeline. OpenTelemetry's collector unifies metrics, logs, and traces; storage and UI are interchangeable.

Logs, metrics, and traces — what each is for

Metrics — numerical, aggregated, cheap to store at scale. The four golden signals live here. Best for "is the system healthy right now?" and "is this trend worsening?"
Logs — text or structured events, expensive to store, searchable. Best for "what happened on this exact request?" and forensic investigation.
Traces — a request's journey across services, with each span's timing. Best for "why is THIS request slow when others aren't?" Distributed systems require traces; single-process apps usually don't.

OpenTelemetry (OTel) unifies the SDKs across languages so the same instrumentation feeds all three storage backends. For most teams: metrics in CloudWatch or Prometheus, logs in CloudWatch or Loki, traces in CloudWatch X-Ray or Tempo, dashboards in Grafana.

Alerting — The Discipline of Not Crying Wolf

An alert that doesn't require human action is a bug. The on-call engineer's pager is sacred — every false alarm trains them to ignore the real ones. Two rules govern good alerts:

Alert on symptoms, not causes. Page when users are unhappy (latency over SLO, errors above threshold), not when an internal metric crosses a line. "P99 latency over 1s for 5 minutes" is a symptom; "CPU above 80%" is a cause that may or may not affect users.
Page only what's actionable now. If the runbook says "check this dashboard, file a ticket" — that's a ticket, not a page. Page only for things that need a human at this moment.

SLO, SLI, error budget — the framework

SLI (Service Level Indicator): a measurement. "% of requests completing in under 300 ms."
SLO (Service Level Objective): your target for an SLI. "99.5% of requests under 300 ms over 28 days."
Error budget: 100% – SLO. Above 99.5% means you have a 0.5% budget — about 200 minutes per 28 days. Burn fast → page; burn slow → ticket; not burning → ship more code.

yaml — Prometheus alert: fast-burn on the error budget

groups:
- name: acme-slo
  rules:
  - alert: AcmeErrorBudgetFastBurn
    expr: |
      (
        sum(rate(http_requests_total{job="acme",status=~"5.."}[5m]))
        / sum(rate(http_requests_total{job="acme"}[5m]))
      ) > 0.005 * 14   # 14× normal burn — we'll exhaust budget in 2 days at this rate
    for: 2m
    labels: { severity: page }
    annotations:
      summary: "Acme error rate burning budget fast"
      runbook: "https://wiki.acme.com/runbooks/acme-error-rate"

The runbook URL in the annotation is the single most useful thing on a page. The on-call engineer at 03:00 is not in a state to invent a remediation; they want a checklist.

Runbooks — The Memory of the Team

A runbook for an alert lists: what triggers this, what the user impact is, the diagnosis steps in order, the safe rollbacks, the escalation path, related dashboards, recent context. Format doesn't matter; access does — link from the alert, keep it findable, update it after every incident.

markdown — runbooks/acme-error-rate.md (excerpt)

# Acme Error Rate (5xx) — Fast Burn

## What it means
The Acme service is returning 5xx at a rate that will exhaust the 28-day
error budget in < 2 days at the current pace.

## User impact
Users see HTTP 500 / 502 on the affected requests. Login and checkout
requests are most sensitive — error budget for those is tighter (see SLO doc).

## First 2 minutes
1. Open dashboard: https://grafana.acme.com/d/acme-overview
2. Identify which endpoint(s) are erroring. Often /api/v1/checkout.
3. Identify the start time. Was there a recent deploy? See #deploys Slack.

## If it correlates with a deploy (most common)
1. Roll back: `gh workflow run rollback.yml -f sha=<previous>` (5 min)
2. Verify error rate drops in dashboard.
3. Open an incident channel #inc-yyyymmdd-acme.

## If no recent deploy
1. Check upstreams: payments, auth, db. See the four-golden-signals row.
2. Check capacity: Fargate task count, RDS connections, Redis OOM.
3. Check rate limits: NGINX limit_req log lines.

## Escalation
- Primary on-call: PagerDuty
- Secondary: backend team channel #team-backend
- VP Eng if user-impact > 1 hour

Post-Mortems — Blameless and Permanent

Every incident worth waking someone for is worth writing about. The shape:

Summary. One paragraph. "On 2026-05-02, a deploy caused 17 minutes of elevated 5xx for /api/checkout, affecting roughly 1,200 users."
Timeline. Minute-by-minute, in UTC. Detection, alerts, who joined when, what they tried.
Root cause. What actually broke, in detail. Often more than one cause stacked.
Impact. Numbers. Users affected, money lost, SLO budget consumed.
What went well. Detection time, communication, rollback speed.
What went badly. Where we got lucky, where the system surprised us.
Action items. Specific tickets with owners and dates. Not platitudes.

Blameless is the cultural choice that makes the rest possible. The question is never "who messed up" but "what about our system made this mistake easy?" Engineers will hide near-misses if they're punished; engineers will surface them if they're celebrated for catching them.

🌱

The action-item discipline

A post-mortem with action items nobody owns is theatre. Each item must have an owner, a date, and a tracking ticket — and a follow-up review four weeks later to check that they actually shipped. The corollary: don't take on more action items than you can ship. Three real fixes in four weeks is better than twelve aspirations in a doc.

The Operational Loop

Pulling Day 1 through Day 7 together, the operational loop looks like this:

An engineer makes a change. CI runs lint, tests, type-checks, vulnerability scan, image build.
The image is pushed to ECR with a SHA tag.
The deploy job updates the ECS task definition, ECS rolls the change one task at a time (Day 5's rolling pattern).
The new tasks pass health checks (Day 5) and ALB target group attaches them; old tasks drain.
NGINX (Day 3) keeps serving uninterrupted; TLS (Day 4) keeps users on HTTPS.
OTel instrumentation flows metrics, logs, and traces to the storage backend; Grafana visualises them.
If the four golden signals trend bad, alerts fire to the on-call's pager — but only for symptoms, only when actionable.
The on-call follows the runbook to mitigate (often a rollback, often a rate-limit knob, often a feature flag).
Within a few days, the team writes a post-mortem and ships the action items — making the next instance of this failure detectable, preventable, or impossible.
The pipeline gets a small improvement to encode that learning. Tomorrow's deploy is one regression-resistant step better than today's.

Quick check

Your service has SLO of 99.9% successful requests over 28 days. The current error budget is 70% remaining with 12 days left in the window. A teammate proposes shipping a risky refactor that may cause errors. Does the SLO framework say go or no-go, and why?

Show answer

Go — with a guardrail. 99.9% over 28 days is a budget of ~40 minutes of unavailability; 70% remaining means ~28 minutes left to spend, in 12 days. The whole point of an error budget is to be spent — if you never spend it, you're shipping too slowly relative to your reliability targets. The right answer is to ship the risky refactor early in the window, with a quick rollback path and good observability, accepting that you may consume some of the remaining budget. If the budget were already spent (5% remaining at the start of the window) the answer would flip: stop shipping risky changes until the next window starts. The framework gives you a numerical, blameless way to negotiate "can we ship this risky thing" — which beats arguing about it. The implicit rule: budget burned slowly is wasted; budget burned quickly with consent is the cost of velocity.

Mnemonic — production hygiene

"Pipeline. Telemetry. Alerts. Runbooks. Post-mortems."

Pipeline — automated build / test / deploy on every commit.
Telemetry — four golden signals, plus business metrics.
Alerts — symptoms only, actionable only.
Runbooks — what to do at 03:00, linked from the alert.
Post-mortems — blameless, with owned action items that ship.

Flashcard

Why is alerting on "CPU > 80%" generally a bad idea, and what would you alert on instead?

Click to flip ↻

Answer

Why bad: CPU is a cause, not a symptom. A service can run at 90% CPU and serve users perfectly fine; another service can be at 30% CPU and timing out because it's blocked on a database. Paging on CPU optimises for an internal metric and trains the on-call to investigate things users don't notice. What instead: page on user-visible failures — request latency above SLO, error rate above SLO, queue lag above the action threshold, saturation indicators that have caused real problems in the past. Use CPU as a diagnostic in the dashboard you visit during an incident, not as a paging trigger. The general rule: page on symptoms, dashboard on causes.

🔑

Course-end takeaways

1) CI/CD turns deploys into a routine, not an event — small batches, automated tests, OIDC instead of static keys. 2) Secrets and infra are code; clicks in the console are forensic only. 3) The four golden signals — latency, traffic, errors, saturation — cover most of what you need to watch. 4) Alert on symptoms, link runbooks, respect the pager. 5) Post-mortems are how teams learn; blameless ones with owned action items beat the same story repeated quarterly. 6) Days 1–7 are layers of one stack: DNS → server → proxy → TLS → process supervision → containers → automation. Master them in order; you can deploy almost anything once you have.

📚 Further reading

Google — Site Reliability Engineering (free book)sre.google
Google — SRE Workbook (free)sre.google
GitHub Actions documentationdocs.github.com
AWS — OIDC for GitHub Actionsgithub.com
Terraform documentationterraform.io
OpenTelemetry — language-agnostic observabilityopentelemetry.io
Prometheus — alerting best practicesprometheus.io
Awesome SRE — curated reading listgithub.com

Finished reading?