DAY 6

07 / 09

Agent Observability & Evals

schedule6 minsignal_cellular_altAdvanced1,340 words

Trace every span, score every run, and catch regressions before users do. The OpenTelemetry GenAI spec, the Langfuse stack, golden datasets, LLM-as-judge graders, and the five alerts every production agent needs.

What you will learn

01What's Different About Agent Observability

02Tracing — The Span Hierarchy

03The Open-Source Observability Stack

04Evals — Catching Regressions Before Users Do

05The Five Alerts You Want On Day One

06Debugging A Bad Agent Run

If you cannot replay every decision an agent made, you cannot fix it. Agents are stochastic systems — the same input produces different traces — which means classical APM (CPU graphs, p95 latency) gives you almost no useful signal. You need span-level tracing of every LLM call and tool result, plus an eval harness that catches regressions before users do. This chapter covers both halves of agent observability and the open-source stack that's emerging as the standard.

🔑

The two halves of agent observability

1) Online — trace every production agent run. Catch failures, attribute cost, debug specific bad outputs. 2) Offline — eval harnesses with golden datasets and LLM-as-judge graders. Catch regressions on every code/prompt change. Both are non-negotiable. Skipping either is how production agents quietly drift into dysfunction.

What's Different About Agent Observability

Traditional APM tracks requests — one input, one output, one duration. An agent run is a tree: a single user request fans out into N LLM calls, M tool invocations, possibly P sub-agent runs. The metrics that matter are different too:

Traditional APM	Agent observability
Request latency p50/p95/p99	Time-to-first-token, time-to-final-answer, loop iterations
Error rate	Tool error rate, model error rate, schema validation rate, hallucination rate
Throughput (req/s)	Token throughput (in/out), cost per request, cache hit rate
CPU / memory	Context length used, prompt cache utilization, tools called per turn
Stack trace on error	Full message history including tool calls + results — the "reasoning trace"

Tracing — The Span Hierarchy

The OpenTelemetry GenAI semantic conventions define the span shape every observability tool now follows. A typical agent trace:

A real agent run rendered as a span tree. The OpenTelemetry GenAI spec defines field names; tools like Langfuse, Braintrust, and Arize Phoenix render this tree with cost attribution and replayable inputs/outputs.

Required attributes per span

gen_ai.system — "anthropic", "openai", etc.
gen_ai.request.model, gen_ai.response.model
gen_ai.usage.input_tokens, gen_ai.usage.cached_input_tokens, gen_ai.usage.output_tokens
gen_ai.request.temperature, top_p, max_tokens
For tool calls: tool.name, tool.args (JSON), tool.result (JSON or summary), tool.is_error
session.id and user.id (where applicable, with PII redaction)

The Open-Source Observability Stack

Langfuse — the OSS standard

Langfuse is the most-deployed open-source LLM observability tool in 2025-2026. Self-hostable, OpenTelemetry-compatible, with first-class agent tracing, prompt versioning, and built-in eval runners.

Python · Langfuse with the @observe decorator

from langfuse import observe, get_client

langfuse = get_client()

@observe(name="research-agent")
async def run_agent(query: str):
    # All nested LLM and tool calls are traced automatically
    plan = await plan_agent(query)
    digests = await asyncio.gather(*[
        worker_agent(t) for t in plan.subtasks])
    answer = await synthesize(query, digests)

    # Attach scores at runtime — feeds online evals
    langfuse.score_current_trace(
        name="answer_complete", value=1 if answer.is_complete else 0)
    return answer

Other contenders

Braintrust — paid SaaS, very strong on offline evals + production tracing. Excellent UI for diff-ing prompt changes against eval suites.
LangSmith — the LangChain ecosystem's tracing tool. Best fit if you're already on LangChain/LangGraph.
Arize Phoenix — open-source. Strong on RAG-specific observability (embedding drift, retrieval evals).
Helicone — proxy-based tracing; cost-tracking strength. Easy to drop in without code changes.

Evals — Catching Regressions Before Users Do

Tracing tells you what happened in production. Evals tell you whether a change you're about to ship will make things worse. Every serious agent has at least one eval suite — usually three:

1. Golden dataset regression suite

50-500 hand-curated test cases covering happy paths, edge cases, and known-failure modes. Run on every prompt or code change. Each case has either:

Deterministic assertion — exact-match, regex, JSON schema, list contains.
LLM-as-judge — a stronger model scores the output against a rubric.
Comparison — pairwise A/B between old and new prompt outputs.

2. Trajectory evals

For agents specifically, the final answer isn't enough — you also need to score the trajectory: did the agent call the right tools, in roughly the right order, with reasonable arguments? Common metrics:

Tool-selection accuracy — did the agent pick the correct tool for the task?
Trajectory length — how many steps did it take vs. the optimal path?
Redundant calls — did the agent call the same tool with the same args twice?
Loop reliability — what fraction of runs terminate successfully vs. hit the iteration cap?

3. Online evals on production traffic

Sample 1-5% of production runs, score them with LLM-as-judge or human review, and dashboard the trend over time. Catches model-update regressions and slow drift that offline evals miss.

Python · LLM-as-judge eval with Langfuse

from langfuse import get_client

EVAL_PROMPT = """
Score the assistant's answer against the user's question on:
- faithfulness (1-5): claims are supported by tool outputs
- relevance    (1-5): answer addresses the actual question
- completeness (1-5): all parts of the question are answered

Question: {question}
Tool calls: {tool_calls}
Answer: {answer}

Return JSON: {{"faithfulness": int, "relevance": int, "completeness": int, "rationale": str}}
"""

async def score_trace(trace_id: str):
    trace = await langfuse.get_trace(trace_id)
    judgment = await judge_llm(EVAL_PROMPT.format(
        question=trace.input, tool_calls=trace.tool_calls, answer=trace.output))
    for name, val in judgment.items():
        if name != "rationale":
            langfuse.score(trace_id=trace_id, name=name, value=val)

The Five Alerts You Want On Day One

Cost spike — any 5-minute window with cost > 3× the trailing 24-hour median. Catches loop runaways and prompt-injection cost attacks.
Loop length spike — p95 iterations per agent run jumps. Suggests the model is thrashing on a new failure mode.
Tool error rate — any tool's error rate > 5% over 5 minutes. Catches API outages and credential expiries.
TTFT or final-answer p95 latency > SLA. Catches model degradation, slow tools, queue buildup.
Eval score regression — golden-suite pass rate drops > 2 points after a deploy. Auto-rollback if you trust your evals.

⚠️

The "silent regression"

A production-tested model gets transparently upgraded by the provider (or you bump claude-sonnet-4-6 to 4-7). Your eval suite catches the change in 4 minutes; production users see it instantly. Mitigation: pin model versions in production, run scheduled eval suites against pinned versions, and only roll forward after evals pass on staging. Both Anthropic and OpenAI document model-version pinning for exactly this reason.

Debugging A Bad Agent Run

The standard workflow when a user reports "the agent did the wrong thing":

Look up the trace by user ID + timestamp in Langfuse / Braintrust / etc.
Replay the span tree — read the model's actual reasoning, see which tool was called.
Identify which step went wrong: bad tool selection? Bad tool result? Bad synthesis?
Add the trace as a test case to the regression suite (with the corrected expectation).
Iterate on the prompt or tool description until the eval passes.
Re-run the full eval suite — make sure the fix didn't break other cases.

Steps 1-3 are impossible without tracing. Step 4 is impossible without an eval framework. The combination is what turns "agents are flaky" into "agents have a regression and we know which line to fix."

🔑

Key takeaways

1) Trace every agent run as a span tree — every LLM call, every tool, every error. 2) Use the OpenTelemetry GenAI conventions; the ecosystem is converging on them. 3) Build an eval suite before launch — golden cases + LLM-as-judge + trajectory scores. 4) Sample production traffic for online evals; catches drift offline tests miss. 5) Five alerts on day one: cost, loop length, tool errors, latency, eval regression.

📚 Further reading

OpenTelemetry — Semantic Conventions for Generative AIopentelemetry.io
Langfuse — Open Source LLM Observabilitylangfuse.com
Braintrust — LLM Evaluation & Observabilitybraintrust.dev
LangSmith — Evaluation Conceptsdocs.smith.langchain.com
UK AISI Inspect — Model evaluation frameworkgithub.com

Finished reading?