The Engineering Codex/Backend Engineering for the AI Era
DAY 6
08 / 09

Observability — Metrics, Logs & Traces

schedule12 minsignal_cellular_altAdvanced2,741 words
If you can't ask new questions of your system at 3 AM, you're not observable. Master the three pillars (metrics, logs, traces), OpenTelemetry, the RED and USE method, SLO burn rates, cardinality control, and the AI-cost telemetry that emerged when token budgets joined latency as a first-class dimension.

What you will learn

01The Three Pillars
02OpenTelemetry — The Standard That Won
03Methods — RED, USE, the Four Golden Signals
04SLO-Driven Alerting and Burn Rates
05Cardinality — The Metric That Eats Budgets
06Logs at Scale — Sampling, Levels, and Volume

Monitoring is checking known signals against known thresholds. Observability is being able to ask new questions of a system you didn't anticipate when you built it — at 3 AM, on the worst day of your career. The shift in vocabulary matters because the second job is harder and far more useful. This chapter is about the data and tooling that make it possible: the three pillars, the standards that knit them together, and the new dimensions LLMs forced into the dashboard (token usage, cost-per-request, model versions).

🔑
Today's observability stack
1) The three pillars — metrics, logs, traces — and what each is good for. 2) OpenTelemetry as the standard wire format. 3) Methods: RED for services, USE for resources, the four golden signals. 4) SLO-driven alerting with multi-window burn rates. 5) Cardinality — the metric that kills observability budgets. 6) AI-cost telemetry — tokens, model, latency, dollars per request.

The Three Pillars

Three different shapes of telemetry, with different costs and questions they answer. A complete observability stack has all three, instrumented from the same code paths so they correlate.

Metrics numeric, aggregated counters, gauges, histograms cheap to store long retention answers how often? how fast? trending up or down? Logs discrete events structured JSON timestamps + fields expensive at volume short-medium retention answers what happened? to which user? in what order? Traces causal request graph spans across services parent/child links sampled (1–10%) retain weeks answers where did time go? which service slow? what called what?
Metrics see the system; logs see events; traces see requests. Each pillar answers different questions; together they cover most incident types.

Metrics

Aggregated numeric measurements over time. Three flavours:

  • Counter — only goes up. http_requests_total. Take rates by differentiating.
  • Gauge — current value. db_connections_in_use, queue_depth.
  • Histogram — distribution of values into buckets. http_request_duration_seconds_bucket{le="0.1"}. Computes percentiles via summation.

Stored in time-series DBs: Prometheus (open source, pull-based), VictoriaMetrics (Prometheus-compatible, scales further), Datadog/New Relic (managed). Cheap per data point; the cost driver is cardinality, covered later.

Logs

Discrete events with timestamps and structured fields. The right shape today is structured JSON:

json — structured log line
{
  "ts": "2026-05-05T18:42:11.234Z",
  "level": "info",
  "msg": "completion finished",
  "service": "chat-api",
  "trace_id": "01H8ZX9R8Q5GS9T0K6N2T0PJ4M",
  "span_id": "a1b2c3d4",
  "tenant_id": "acme",
  "user_id": "u_42",
  "model": "claude-haiku-4.5",
  "latency_ms": 1834,
  "prompt_tokens": 412,
  "completion_tokens": 188,
  "cost_usd": 0.0021
}

Why structured: every field is searchable and aggregateable without regex. Stored in a search/index store: Elasticsearch, OpenSearch, Loki (cheaper, indexes only labels), ClickHouse, BigQuery. Storage cost grows fast — log volume is usually the dominant observability spend at scale.

Traces

The causal graph of a request as it crosses services. A trace is identified by a trace_id; each segment of work is a span with a span_id, parent_span_id, start/end timestamps, and attributes (HTTP method, status, model, etc.).

api / POST /v1/chat3,420 ms auth · 12 ms postgres / SELECT users · 24 ms anthropic / messages.create · 3,180 ms embed · 110 ms redis · 1.4 ms 93% of the request is spent in the LLM call. Auth, DB, embed, and Redis combined are noise. A trace makes that obvious in seconds. Without it, you'd be guessing.
A trace from a chat-completion request. The waterfall view makes "where did time go?" trivially answerable.

Stored in trace stores: Tempo, Jaeger, Zipkin, Datadog APM, Honeycomb, New Relic. Sampled — typically 1–10% of traffic — to keep storage bounded. Modern stacks use tail-based sampling: keep all traces with errors or high latency, sample the rest.

OpenTelemetry — The Standard That Won

Until 2019, every observability vendor had its own SDK and instrumentation library. Switching providers meant rewriting all your instrumentation. OpenTelemetry (CNCF, merger of OpenTracing and OpenCensus) is the lingua franca: a single SDK, a single wire protocol (OTLP), pluggable backends.

The architecture: instrument your code with the OTel SDK; the SDK ships data to an OTel Collector (or directly to a backend); the collector batches, processes, and exports to one or more destinations. Switching from Prometheus to Datadog to Honeycomb is a config change, not a code change.

python — OpenTelemetry tracing
from opentelemetry import trace

tracer = trace.get_tracer("chat-api")

def handle_chat(req):
    with tracer.start_as_current_span("handle_chat") as span:
        span.set_attribute("tenant.id", req.tenant_id)
        span.set_attribute("user.id",   req.user_id)
        with tracer.start_as_current_span("db.lookup_user"):
            user = db.fetch_user(req.user_id)
        with tracer.start_as_current_span("llm.call") as llm_span:
            llm_span.set_attribute("llm.model", "claude-haiku-4.5")
            resp = anthropic.messages.create(model="claude-haiku-4.5", ...)
            llm_span.set_attribute("llm.input_tokens", resp.usage.input_tokens)
            llm_span.set_attribute("llm.output_tokens", resp.usage.output_tokens)
        return resp

OTel auto-instruments most popular libraries (HTTP clients, ORMs, gRPC, Kafka clients) so you get most spans for free. The interesting ones — business operations, AI calls — you instrument by hand with the attributes you'll want to filter on later.

Context propagation

For traces to span services, the trace_id must travel along with each request. OTel uses W3C Trace Context: a traceparent HTTP header that every instrumented HTTP/gRPC client sets and every server reads. Same for queues — the producer puts the context in the message headers; the consumer extracts it. The result: a single trace_id flows from the user's request through every service and worker that participates.

💡
Log every trace_id
Every log line your service emits should include the current trace_id. Then a trace ID lets you correlate: "this request was slow → here are its logs → here are the errors". The OTel logging integration does this automatically; if you're using a different logger, ensure middleware sets the trace ID into the logger's context.

Methods — RED, USE, the Four Golden Signals

Three abbreviations to anchor what you measure. Each is a complete starter dashboard.

RED — for services (Tom Wilkie)

  • Rate — requests per second.
  • Errors — failed requests per second (or as % of rate).
  • Duration — request latency, usually as P50 / P95 / P99.

USE — for resources (Brendan Gregg)

  • Utilization — % time the resource was busy. CPU, disk, memory, network.
  • Saturation — extra work that couldn't be served — queue depth, run-queue length.
  • Errors — error events on the resource.

The four golden signals (Google SRE)

Latency, traffic, errors, saturation. Essentially RED + saturation. The single most-cited "start here" list for any new service.

Pick one method per layer: RED for your application services, USE for your infrastructure. Build them as a starter dashboard before you ship the service to staging.

SLO-Driven Alerting and Burn Rates

Most alerting is hand-tuned thresholds: page when error rate > 5% for 5 minutes. The trouble: this fires on noise (a 30-second blip), misses slow burns (1.5% errors for an hour, also a problem), and doesn't tie to user-perceived reliability.

The SRE playbook replaces threshold alerts with SLO burn-rate alerts. The error budget for a 99.9% SLO over 30 days is 43 minutes of failure. A burn rate is how fast you're consuming that budget right now: at 1× burn, you'll exhaust the budget exactly at the end of the window; at 14.4× you'll exhaust it in 2 days.

PAGE14.4× burn for 5 min & 1 hour→ exhausting 30-day budget in 2 days. Wake someone up. PAGE6× burn for 30 min & 6 hours→ exhausting in 5 days. Worth waking someone up. TICKET3× burn for 2 hours & 24 hours→ exhausting in 10 days. Investigate during business hours. OK< 1× burn over 24 hours→ within budget. Ship features.
Multi-window burn-rate alerts. Each row uses a short window (low-latency detection) and a long window (false-positive filter).

Why two windows per alert?

The short window catches issues quickly; the long window suppresses one-minute spikes. Both must trigger for the alert to fire. The classic table from the Google SRE Workbook ("Alerting on SLOs"):

SeverityBurn rateShort windowLong window
Page14.4×5 minutes1 hour
Page30 minutes6 hours
Ticket2 hours24 hours
Ticket3 days3 days

Adopt this set; you'll have far fewer pages, all of them meaningful.

Cardinality — The Metric That Eats Budgets

Cardinality is the number of distinct label combinations on a metric. http_requests_total{path,method,status,user_id} with 1M users explodes into millions of time series. Time-series databases store one row per unique label set; storage and query cost are linear in cardinality. This is the most common way observability gets unbearably expensive at scale.

Rules of thumb

  • No user IDs as metric labels. Use logs or traces (per-event storage) for per-user data.
  • Bucket high-cardinality dimensions. response_size_bucket with 6 buckets, not raw bytes.
  • Path normalization. /orders/123 and /orders/456 become /orders/{id}.
  • Tenants in metrics: usually OK in B2B (10s-1000s of tenants), problematic in consumer (millions of users).

Wide events as a complement

Honeycomb's pitch is that the cardinality problem stems from forcing telemetry into pre-aggregated metrics. Their answer: store wide structured events (essentially structured logs with traces), then compute aggregates at query time. ClickHouse-backed observability stacks (Datadog Logs, Grafana Loki + Tempo + ClickHouse) follow the same pattern. Wide events let you slice by user_id, request_id, model, anything else — at the cost of storing every event, not aggregates.

⚠️
Audit your metric labels
Quarterly, run a cardinality report on your metrics backend. Look for the top-10 highest-cardinality metrics. Anything with a user ID, request ID, or full URL path is the smoking gun. Either drop the label, bucket it, or move that data to logs/traces. Most observability bills are 80% caused by 20% of metrics.

Logs at Scale — Sampling, Levels, and Volume

Log volume is exponential in user count and noise level. Three controls:

Levels

  • ERROR — something went wrong; needs investigation.
  • WARN — unexpected but recovered. Sample 100%.
  • INFO — successful business operations. Sample 100% for low-volume, downsample for high-volume endpoints.
  • DEBUG — only enabled per-request or per-tenant via dynamic flags.

Sampling

For high-volume INFO logs, sampling is fine. Two modes:

  • Head sampling — decide at request start (keep 1% of all requests' logs).
  • Tail sampling — buffer everything for the request; keep if anything was an error or slow. Catches the interesting cases at fixed cost.

Don't log secrets, even by accident

The classic logging bug: log.info("received request", request=req) emits headers including authorization tokens, cookies, and PII into a long-retention store. Use a logger that redacts known sensitive fields by default; review serializers in code review. The same goes for LLM prompts containing user data — Day 7 covers PII redaction.

Tracing at Scale — Sampling Strategies

Storing 100% of traces is impractical at scale. Three strategies:

  • Head-based, ratio. 1% of all requests get a full trace. Simple. Misses rare-but-interesting cases.
  • Head-based, by attribute. 100% of traces for tenant X (during a debugging session). 100% of traces tagged "new-feature: true" during rollout.
  • Tail-based. Buffer all spans for a request; after the request completes, decide based on outcome (error? slow? interesting?). Keeps the interesting traces at fixed cost. The OTel Collector supports this.

Most production stacks combine: 1% head sampling baseline + 100% on errors + 100% on traces longer than P99 + dynamic 100% per debugging tag.

The AI-Cost Telemetry Layer

Latency, errors, and traffic — the classic three — describe a system without LLMs. Add LLMs and a fourth dimension joins the dashboard: cost. A single endpoint can cost $0.001 or $0.50 per call depending on which model handled it, how long the prompt was, and how many tokens the response generated. Without telemetry, this is invisible until the bill arrives.

What to instrument on every LLM call

  • llm.provider (anthropic, openai, …)
  • llm.model (claude-haiku-4.5, gpt-4o-mini, …)
  • llm.input_tokens and llm.output_tokens
  • llm.cache_read_tokens (when prompt-prefix cache hits)
  • llm.cost_usd (computed from token counts × model pricing)
  • llm.first_token_ms (time to first token, the streaming-UX metric)
  • llm.latency_ms (total response time)
  • llm.finish_reason (stop, length, content_filter, error)
  • llm.tenant_id and llm.feature (so you can attribute cost)

Emit these as both span attributes (so they appear on the trace) and histogram metrics (so dashboards aggregate them). The result: cost-per-tenant, cost-per-feature, cost-per-model — all queryable, all alertable.

Cost dashboards that engineers actually use

  • Daily $ spent per feature. Tells the team which feature actually justifies its cost.
  • Cost-per-request distribution. A long-tail histogram catching the runaway cases.
  • Cache hit ratio for prompt cache. 80%+ on stable system prompts; lower means you're not getting the discount.
  • Tokens-per-request P95. Drift up over time = prompt bloat (the team kept adding context).
  • Cost-budget burn-rate. Same SLO machinery, except the SLO is $ per day rather than reliability.
🚨
Set spend alerts before launch
A misconfigured prompt or an infinite loop in an agent can run a five-figure bill before the workday ends. Alert on $/hour exceeding 3× normal — page on it. Run cost dashboards by feature in the same place as latency dashboards. Cost is a first-class reliability concern in the AI era; pretending otherwise leads to surprise outages of the financial kind.

Eval as Continuous Telemetry

Beyond cost, AI features have a quality dimension that pure latency/error metrics miss: did the model give the right answer? Day 7 covers eval architecture; the observability tie-in is to treat eval as just another set of metrics. Run a sample of production responses through a judge (rubric-based human or another model), score correctness, emit as a metric (llm.eval.score), alert on regressions. Model-version upgrades become trackable like deploys.

Putting It All Together — A Working Telemetry Plan

  1. Instrument with OpenTelemetry. Auto-instrument HTTP, DB, queues. Hand-instrument business operations and LLM calls with rich attributes.
  2. Ship to a Collector, route to: a metrics backend (Prometheus/Mimir/Datadog), a logs backend (Loki/ElasticSearch/Splunk), a trace backend (Tempo/Jaeger/Honeycomb).
  3. Build RED dashboards per service and USE dashboards per resource, with a cost-per-feature dashboard for AI workloads.
  4. Define SLOs per critical user journey; configure multi-window burn-rate alerts.
  5. Audit cardinality quarterly; move per-user data from metrics to logs/traces.
  6. Sample traces with tail-based on errors and latency outliers.
  7. Redact PII and secrets at the logger level; review in CI.
Quick check
Your SLO is 99.9% for chat completions, measured monthly. Last week, error rate spiked from 0.05% to 1.5% for 90 minutes during deploys. Your alert ("error rate > 1%") fired but engineers ignored it. Why was it the wrong alert, and what's the fix?
Show answer
Why wrong: the alert measured an instantaneous error rate, not budget impact. 1% errors for 90 minutes consumed about 11 minutes of the monthly 43-minute budget — significant but not catastrophic. The team learned to ignore the alert because it fires for benign deploys. Fix: SLO burn-rate alerts. A page-worthy threshold like 14.4× burn for 5 min & 1 hour would have not fired (1.5% × 90 min ≈ 1.4× burn over the month, well under 14.4×). A 6× burn for 30 min & 6 hours might fire briefly but suppress quickly. The point of burn rates: alerts correlate with real budget impact, so engineers trust them.
Mnemonic — observability stack
"Metrics, Logs, Traces — and Cost when AI is in the path."
  • Metrics — RED + USE.
  • Logs — structured, with trace_id, redacted.
  • Traces — OTel, tail-sampled on errors.
  • Cost — per request, per feature, per tenant.
Flashcard
Your team launches an AI feature. The product owner says "alert me if costs exceed budget". You set up the metric. What three things must be present in your instrumentation to make this useful, and what alert shape would you use?
Click to flip ↻
Answer
Three needed in instrumentation: (1) llm.cost_usd on every span/log, computed from input/output tokens × model price (so cost can be summed); (2) llm.feature attribute, so you can attribute spend; (3) llm.model, so you can track shifts when fallback to smaller/larger models happens. Alert shape: burn-rate style on a budget. e.g., daily budget = $50; page if > 5× normal hourly burn for 15 min and 1 hour (likely runaway loop or misconfig); ticket if 1.5× normal daily burn for 24 h (creeping prompt bloat). Plus a hard kill switch — circuit breaker that returns 503 if hourly spend exceeds the safety cap.
🔑
Key takeaways
1) Metrics, logs, traces answer different questions; you need all three, ideally correlated by trace_id. 2) OpenTelemetry is the standard wire format — instrument once, send anywhere. 3) RED + USE + golden signals are the starter dashboards every service needs. 4) SLO burn-rate alerts replace threshold alerts and produce far fewer false pages. 5) Cardinality is the silent killer — audit quarterly. 6) For AI workloads, cost is a first-class telemetry dimension; instrument tokens, model, and dollars on every call from day one.

Finished reading?