
Agent Observability & Evals
Trace every span, score every run, and catch regressions before users do. The OpenTelemetry GenAI spec, the Langfuse stack, golden datasets, LLM-as-judge graders, and the five alerts every production agent needs.
What you will learn
If you cannot replay every decision an agent made, you cannot fix it. Agents are stochastic systems — the same input produces different traces — which means classical APM (CPU graphs, p95 latency) gives you almost no useful signal. You need span-level tracing of every LLM call and tool result, plus an eval harness that catches regressions before users do. This chapter covers both halves of agent observability and the open-source stack that's emerging as the standard.
What's Different About Agent Observability
Traditional APM tracks requests — one input, one output, one duration. An agent run is a tree: a single user request fans out into N LLM calls, M tool invocations, possibly P sub-agent runs. The metrics that matter are different too:
| Traditional APM | Agent observability |
|---|---|
| Request latency p50/p95/p99 | Time-to-first-token, time-to-final-answer, loop iterations |
| Error rate | Tool error rate, model error rate, schema validation rate, hallucination rate |
| Throughput (req/s) | Token throughput (in/out), cost per request, cache hit rate |
| CPU / memory | Context length used, prompt cache utilization, tools called per turn |
| Stack trace on error | Full message history including tool calls + results — the "reasoning trace" |
Tracing — The Span Hierarchy
The OpenTelemetry GenAI semantic conventions define the span shape every observability tool now follows. A typical agent trace:
Required attributes per span
gen_ai.system—"anthropic","openai", etc.gen_ai.request.model,gen_ai.response.modelgen_ai.usage.input_tokens,gen_ai.usage.cached_input_tokens,gen_ai.usage.output_tokensgen_ai.request.temperature,top_p,max_tokens- For tool calls:
tool.name,tool.args(JSON),tool.result(JSON or summary),tool.is_error session.idanduser.id(where applicable, with PII redaction)
The Open-Source Observability Stack
Langfuse — the OSS standard
Langfuse is the most-deployed open-source LLM observability tool in 2025-2026. Self-hostable, OpenTelemetry-compatible, with first-class agent tracing, prompt versioning, and built-in eval runners.
from langfuse import observe, get_client langfuse = get_client() @observe(name="research-agent") async def run_agent(query: str): # All nested LLM and tool calls are traced automatically plan = await plan_agent(query) digests = await asyncio.gather(*[ worker_agent(t) for t in plan.subtasks]) answer = await synthesize(query, digests) # Attach scores at runtime — feeds online evals langfuse.score_current_trace( name="answer_complete", value=1 if answer.is_complete else 0) return answer
Other contenders
- Braintrust — paid SaaS, very strong on offline evals + production tracing. Excellent UI for diff-ing prompt changes against eval suites.
- LangSmith — the LangChain ecosystem's tracing tool. Best fit if you're already on LangChain/LangGraph.
- Arize Phoenix — open-source. Strong on RAG-specific observability (embedding drift, retrieval evals).
- Helicone — proxy-based tracing; cost-tracking strength. Easy to drop in without code changes.
Evals — Catching Regressions Before Users Do
Tracing tells you what happened in production. Evals tell you whether a change you're about to ship will make things worse. Every serious agent has at least one eval suite — usually three:
1. Golden dataset regression suite
50-500 hand-curated test cases covering happy paths, edge cases, and known-failure modes. Run on every prompt or code change. Each case has either:
- Deterministic assertion — exact-match, regex, JSON schema, list contains.
- LLM-as-judge — a stronger model scores the output against a rubric.
- Comparison — pairwise A/B between old and new prompt outputs.
2. Trajectory evals
For agents specifically, the final answer isn't enough — you also need to score the trajectory: did the agent call the right tools, in roughly the right order, with reasonable arguments? Common metrics:
- Tool-selection accuracy — did the agent pick the correct tool for the task?
- Trajectory length — how many steps did it take vs. the optimal path?
- Redundant calls — did the agent call the same tool with the same args twice?
- Loop reliability — what fraction of runs terminate successfully vs. hit the iteration cap?
3. Online evals on production traffic
Sample 1-5% of production runs, score them with LLM-as-judge or human review, and dashboard the trend over time. Catches model-update regressions and slow drift that offline evals miss.
from langfuse import get_client EVAL_PROMPT = """ Score the assistant's answer against the user's question on: - faithfulness (1-5): claims are supported by tool outputs - relevance (1-5): answer addresses the actual question - completeness (1-5): all parts of the question are answered Question: {question} Tool calls: {tool_calls} Answer: {answer} Return JSON: {{"faithfulness": int, "relevance": int, "completeness": int, "rationale": str}} """ async def score_trace(trace_id: str): trace = await langfuse.get_trace(trace_id) judgment = await judge_llm(EVAL_PROMPT.format( question=trace.input, tool_calls=trace.tool_calls, answer=trace.output)) for name, val in judgment.items(): if name != "rationale": langfuse.score(trace_id=trace_id, name=name, value=val)
The Five Alerts You Want On Day One
- Cost spike — any 5-minute window with cost > 3× the trailing 24-hour median. Catches loop runaways and prompt-injection cost attacks.
- Loop length spike — p95 iterations per agent run jumps. Suggests the model is thrashing on a new failure mode.
- Tool error rate — any tool's error rate > 5% over 5 minutes. Catches API outages and credential expiries.
- TTFT or final-answer p95 latency > SLA. Catches model degradation, slow tools, queue buildup.
- Eval score regression — golden-suite pass rate drops > 2 points after a deploy. Auto-rollback if you trust your evals.
claude-sonnet-4-6 to 4-7). Your eval suite catches the change in 4 minutes; production users see it instantly. Mitigation: pin model versions in production, run scheduled eval suites against pinned versions, and only roll forward after evals pass on staging. Both Anthropic and OpenAI document model-version pinning for exactly this reason.Debugging A Bad Agent Run
The standard workflow when a user reports "the agent did the wrong thing":
- Look up the trace by user ID + timestamp in Langfuse / Braintrust / etc.
- Replay the span tree — read the model's actual reasoning, see which tool was called.
- Identify which step went wrong: bad tool selection? Bad tool result? Bad synthesis?
- Add the trace as a test case to the regression suite (with the corrected expectation).
- Iterate on the prompt or tool description until the eval passes.
- Re-run the full eval suite — make sure the fix didn't break other cases.
Steps 1-3 are impossible without tracing. Step 4 is impossible without an eval framework. The combination is what turns "agents are flaky" into "agents have a regression and we know which line to fix."
- OpenTelemetry — Semantic Conventions for Generative AIopentelemetry.io
- Langfuse — Open Source LLM Observabilitylangfuse.com
- Braintrust — LLM Evaluation & Observabilitybraintrust.dev
- LangSmith — Evaluation Conceptsdocs.smith.langchain.com
- UK AISI Inspect — Model evaluation frameworkgithub.com
Finished reading?