Observability — Metrics, Logs & Traces
If you can't ask new questions of your system at 3 AM, you're not observable. Master the three pillars (metrics, logs, traces), OpenTelemetry, the RED and USE method, SLO burn rates, cardinality control, and the AI-cost telemetry that emerged when token budgets joined latency as a first-class dimension.
What you will learn
Monitoring is checking known signals against known thresholds. Observability is being able to ask new questions of a system you didn't anticipate when you built it — at 3 AM, on the worst day of your career. The shift in vocabulary matters because the second job is harder and far more useful. This chapter is about the data and tooling that make it possible: the three pillars, the standards that knit them together, and the new dimensions LLMs forced into the dashboard (token usage, cost-per-request, model versions).
The Three Pillars
Three different shapes of telemetry, with different costs and questions they answer. A complete observability stack has all three, instrumented from the same code paths so they correlate.
Metrics
Aggregated numeric measurements over time. Three flavours:
- Counter — only goes up.
http_requests_total. Take rates by differentiating. - Gauge — current value.
db_connections_in_use,queue_depth. - Histogram — distribution of values into buckets.
http_request_duration_seconds_bucket{le="0.1"}. Computes percentiles via summation.
Stored in time-series DBs: Prometheus (open source, pull-based), VictoriaMetrics (Prometheus-compatible, scales further), Datadog/New Relic (managed). Cheap per data point; the cost driver is cardinality, covered later.
Logs
Discrete events with timestamps and structured fields. The right shape today is structured JSON:
{
"ts": "2026-05-05T18:42:11.234Z",
"level": "info",
"msg": "completion finished",
"service": "chat-api",
"trace_id": "01H8ZX9R8Q5GS9T0K6N2T0PJ4M",
"span_id": "a1b2c3d4",
"tenant_id": "acme",
"user_id": "u_42",
"model": "claude-haiku-4.5",
"latency_ms": 1834,
"prompt_tokens": 412,
"completion_tokens": 188,
"cost_usd": 0.0021
}Why structured: every field is searchable and aggregateable without regex. Stored in a search/index store: Elasticsearch, OpenSearch, Loki (cheaper, indexes only labels), ClickHouse, BigQuery. Storage cost grows fast — log volume is usually the dominant observability spend at scale.
Traces
The causal graph of a request as it crosses services. A trace is identified by a trace_id; each segment of work is a span with a span_id, parent_span_id, start/end timestamps, and attributes (HTTP method, status, model, etc.).
Stored in trace stores: Tempo, Jaeger, Zipkin, Datadog APM, Honeycomb, New Relic. Sampled — typically 1–10% of traffic — to keep storage bounded. Modern stacks use tail-based sampling: keep all traces with errors or high latency, sample the rest.
OpenTelemetry — The Standard That Won
Until 2019, every observability vendor had its own SDK and instrumentation library. Switching providers meant rewriting all your instrumentation. OpenTelemetry (CNCF, merger of OpenTracing and OpenCensus) is the lingua franca: a single SDK, a single wire protocol (OTLP), pluggable backends.
The architecture: instrument your code with the OTel SDK; the SDK ships data to an OTel Collector (or directly to a backend); the collector batches, processes, and exports to one or more destinations. Switching from Prometheus to Datadog to Honeycomb is a config change, not a code change.
from opentelemetry import trace
tracer = trace.get_tracer("chat-api")
def handle_chat(req):
with tracer.start_as_current_span("handle_chat") as span:
span.set_attribute("tenant.id", req.tenant_id)
span.set_attribute("user.id", req.user_id)
with tracer.start_as_current_span("db.lookup_user"):
user = db.fetch_user(req.user_id)
with tracer.start_as_current_span("llm.call") as llm_span:
llm_span.set_attribute("llm.model", "claude-haiku-4.5")
resp = anthropic.messages.create(model="claude-haiku-4.5", ...)
llm_span.set_attribute("llm.input_tokens", resp.usage.input_tokens)
llm_span.set_attribute("llm.output_tokens", resp.usage.output_tokens)
return respOTel auto-instruments most popular libraries (HTTP clients, ORMs, gRPC, Kafka clients) so you get most spans for free. The interesting ones — business operations, AI calls — you instrument by hand with the attributes you'll want to filter on later.
Context propagation
For traces to span services, the trace_id must travel along with each request. OTel uses W3C Trace Context: a traceparent HTTP header that every instrumented HTTP/gRPC client sets and every server reads. Same for queues — the producer puts the context in the message headers; the consumer extracts it. The result: a single trace_id flows from the user's request through every service and worker that participates.
trace_id. Then a trace ID lets you correlate: "this request was slow → here are its logs → here are the errors". The OTel logging integration does this automatically; if you're using a different logger, ensure middleware sets the trace ID into the logger's context.Methods — RED, USE, the Four Golden Signals
Three abbreviations to anchor what you measure. Each is a complete starter dashboard.
RED — for services (Tom Wilkie)
- Rate — requests per second.
- Errors — failed requests per second (or as % of rate).
- Duration — request latency, usually as P50 / P95 / P99.
USE — for resources (Brendan Gregg)
- Utilization — % time the resource was busy. CPU, disk, memory, network.
- Saturation — extra work that couldn't be served — queue depth, run-queue length.
- Errors — error events on the resource.
The four golden signals (Google SRE)
Latency, traffic, errors, saturation. Essentially RED + saturation. The single most-cited "start here" list for any new service.
Pick one method per layer: RED for your application services, USE for your infrastructure. Build them as a starter dashboard before you ship the service to staging.
SLO-Driven Alerting and Burn Rates
Most alerting is hand-tuned thresholds: page when error rate > 5% for 5 minutes. The trouble: this fires on noise (a 30-second blip), misses slow burns (1.5% errors for an hour, also a problem), and doesn't tie to user-perceived reliability.
The SRE playbook replaces threshold alerts with SLO burn-rate alerts. The error budget for a 99.9% SLO over 30 days is 43 minutes of failure. A burn rate is how fast you're consuming that budget right now: at 1× burn, you'll exhaust the budget exactly at the end of the window; at 14.4× you'll exhaust it in 2 days.
Why two windows per alert?
The short window catches issues quickly; the long window suppresses one-minute spikes. Both must trigger for the alert to fire. The classic table from the Google SRE Workbook ("Alerting on SLOs"):
| Severity | Burn rate | Short window | Long window |
|---|---|---|---|
| Page | 14.4× | 5 minutes | 1 hour |
| Page | 6× | 30 minutes | 6 hours |
| Ticket | 3× | 2 hours | 24 hours |
| Ticket | 1× | 3 days | 3 days |
Adopt this set; you'll have far fewer pages, all of them meaningful.
Cardinality — The Metric That Eats Budgets
Cardinality is the number of distinct label combinations on a metric. http_requests_total{path,method,status,user_id} with 1M users explodes into millions of time series. Time-series databases store one row per unique label set; storage and query cost are linear in cardinality. This is the most common way observability gets unbearably expensive at scale.
Rules of thumb
- No user IDs as metric labels. Use logs or traces (per-event storage) for per-user data.
- Bucket high-cardinality dimensions.
response_size_bucketwith 6 buckets, not raw bytes. - Path normalization.
/orders/123and/orders/456become/orders/{id}. - Tenants in metrics: usually OK in B2B (10s-1000s of tenants), problematic in consumer (millions of users).
Wide events as a complement
Honeycomb's pitch is that the cardinality problem stems from forcing telemetry into pre-aggregated metrics. Their answer: store wide structured events (essentially structured logs with traces), then compute aggregates at query time. ClickHouse-backed observability stacks (Datadog Logs, Grafana Loki + Tempo + ClickHouse) follow the same pattern. Wide events let you slice by user_id, request_id, model, anything else — at the cost of storing every event, not aggregates.
Logs at Scale — Sampling, Levels, and Volume
Log volume is exponential in user count and noise level. Three controls:
Levels
- ERROR — something went wrong; needs investigation.
- WARN — unexpected but recovered. Sample 100%.
- INFO — successful business operations. Sample 100% for low-volume, downsample for high-volume endpoints.
- DEBUG — only enabled per-request or per-tenant via dynamic flags.
Sampling
For high-volume INFO logs, sampling is fine. Two modes:
- Head sampling — decide at request start (keep 1% of all requests' logs).
- Tail sampling — buffer everything for the request; keep if anything was an error or slow. Catches the interesting cases at fixed cost.
Don't log secrets, even by accident
The classic logging bug: log.info("received request", request=req) emits headers including authorization tokens, cookies, and PII into a long-retention store. Use a logger that redacts known sensitive fields by default; review serializers in code review. The same goes for LLM prompts containing user data — Day 7 covers PII redaction.
Tracing at Scale — Sampling Strategies
Storing 100% of traces is impractical at scale. Three strategies:
- Head-based, ratio. 1% of all requests get a full trace. Simple. Misses rare-but-interesting cases.
- Head-based, by attribute. 100% of traces for tenant X (during a debugging session). 100% of traces tagged "new-feature: true" during rollout.
- Tail-based. Buffer all spans for a request; after the request completes, decide based on outcome (error? slow? interesting?). Keeps the interesting traces at fixed cost. The OTel Collector supports this.
Most production stacks combine: 1% head sampling baseline + 100% on errors + 100% on traces longer than P99 + dynamic 100% per debugging tag.
The AI-Cost Telemetry Layer
Latency, errors, and traffic — the classic three — describe a system without LLMs. Add LLMs and a fourth dimension joins the dashboard: cost. A single endpoint can cost $0.001 or $0.50 per call depending on which model handled it, how long the prompt was, and how many tokens the response generated. Without telemetry, this is invisible until the bill arrives.
What to instrument on every LLM call
llm.provider(anthropic, openai, …)llm.model(claude-haiku-4.5, gpt-4o-mini, …)llm.input_tokensandllm.output_tokensllm.cache_read_tokens(when prompt-prefix cache hits)llm.cost_usd(computed from token counts × model pricing)llm.first_token_ms(time to first token, the streaming-UX metric)llm.latency_ms(total response time)llm.finish_reason(stop, length, content_filter, error)llm.tenant_idandllm.feature(so you can attribute cost)
Emit these as both span attributes (so they appear on the trace) and histogram metrics (so dashboards aggregate them). The result: cost-per-tenant, cost-per-feature, cost-per-model — all queryable, all alertable.
Cost dashboards that engineers actually use
- Daily $ spent per feature. Tells the team which feature actually justifies its cost.
- Cost-per-request distribution. A long-tail histogram catching the runaway cases.
- Cache hit ratio for prompt cache. 80%+ on stable system prompts; lower means you're not getting the discount.
- Tokens-per-request P95. Drift up over time = prompt bloat (the team kept adding context).
- Cost-budget burn-rate. Same SLO machinery, except the SLO is $ per day rather than reliability.
Eval as Continuous Telemetry
Beyond cost, AI features have a quality dimension that pure latency/error metrics miss: did the model give the right answer? Day 7 covers eval architecture; the observability tie-in is to treat eval as just another set of metrics. Run a sample of production responses through a judge (rubric-based human or another model), score correctness, emit as a metric (llm.eval.score), alert on regressions. Model-version upgrades become trackable like deploys.
Putting It All Together — A Working Telemetry Plan
- Instrument with OpenTelemetry. Auto-instrument HTTP, DB, queues. Hand-instrument business operations and LLM calls with rich attributes.
- Ship to a Collector, route to: a metrics backend (Prometheus/Mimir/Datadog), a logs backend (Loki/ElasticSearch/Splunk), a trace backend (Tempo/Jaeger/Honeycomb).
- Build RED dashboards per service and USE dashboards per resource, with a cost-per-feature dashboard for AI workloads.
- Define SLOs per critical user journey; configure multi-window burn-rate alerts.
- Audit cardinality quarterly; move per-user data from metrics to logs/traces.
- Sample traces with tail-based on errors and latency outliers.
- Redact PII and secrets at the logger level; review in CI.
Show answer
- Metrics — RED + USE.
- Logs — structured, with trace_id, redacted.
- Traces — OTel, tail-sampled on errors.
- Cost — per request, per feature, per tenant.
llm.cost_usd on every span/log, computed from input/output tokens × model price (so cost can be summed); (2) llm.feature attribute, so you can attribute spend; (3) llm.model, so you can track shifts when fallback to smaller/larger models happens. Alert shape: burn-rate style on a budget. e.g., daily budget = $50; page if > 5× normal hourly burn for 15 min and 1 hour (likely runaway loop or misconfig); ticket if 1.5× normal daily burn for 24 h (creeping prompt bloat). Plus a hard kill switch — circuit breaker that returns 503 if hourly spend exceeds the safety cap.- OpenTelemetry — Observability primeropentelemetry.io
- Google SRE Workbook — Alerting on SLOssre.google
- Brendan Gregg — The USE methodbrendangregg.com
- Honeycomb — Observability 3.0 and wide eventshoneycomb.io
- Prometheus — Metric naming & labelsprometheus.io
- CNCF OpenTelemetry projectcncf.io
Finished reading?