The Engineering Codex/Agentic AI with LLM APIs
DAY 6
07 / 09

Agent Observability & Evals

schedule6 minsignal_cellular_altAdvanced1,340 words
Trace every span, score every run, and catch regressions before users do. The OpenTelemetry GenAI spec, the Langfuse stack, golden datasets, LLM-as-judge graders, and the five alerts every production agent needs.

What you will learn

01What's Different About Agent Observability
02Tracing — The Span Hierarchy
03The Open-Source Observability Stack
04Evals — Catching Regressions Before Users Do
05The Five Alerts You Want On Day One
06Debugging A Bad Agent Run

If you cannot replay every decision an agent made, you cannot fix it. Agents are stochastic systems — the same input produces different traces — which means classical APM (CPU graphs, p95 latency) gives you almost no useful signal. You need span-level tracing of every LLM call and tool result, plus an eval harness that catches regressions before users do. This chapter covers both halves of agent observability and the open-source stack that's emerging as the standard.

🔑
The two halves of agent observability
1) Online — trace every production agent run. Catch failures, attribute cost, debug specific bad outputs. 2) Offline — eval harnesses with golden datasets and LLM-as-judge graders. Catch regressions on every code/prompt change. Both are non-negotiable. Skipping either is how production agents quietly drift into dysfunction.

What's Different About Agent Observability

Traditional APM tracks requests — one input, one output, one duration. An agent run is a tree: a single user request fans out into N LLM calls, M tool invocations, possibly P sub-agent runs. The metrics that matter are different too:

Traditional APMAgent observability
Request latency p50/p95/p99Time-to-first-token, time-to-final-answer, loop iterations
Error rateTool error rate, model error rate, schema validation rate, hallucination rate
Throughput (req/s)Token throughput (in/out), cost per request, cache hit rate
CPU / memoryContext length used, prompt cache utilization, tools called per turn
Stack trace on errorFull message history including tool calls + results — the "reasoning trace"

Tracing — The Span Hierarchy

The OpenTelemetry GenAI semantic conventions define the span shape every observability tool now follows. A typical agent trace:

SPAN HIERARCHY OF ONE AGENT RUN agent.run · "research X" · 12.4s · $0.18 · 14 spans llm.call · plan · 1.2s · 3.1k tokens tool.web_search · "MCP adoption" · 0.8s tool.web_search · "AutoGen" · 0.9s · ⚡ parallel tool.fetch_url · 1.4s · 18 KB tool.fetch_url · ❌ 504 timeout · retry tool.fetch_url · ✓ retry · 1.1s llm.call · synthesize · 3.8s · 18.4k tokens · ⚡ cache hit eval.faithfulness · 0.92 Each span carries: model, tokens (in/cached/out), cost, latency, parent ID, error if any.
A real agent run rendered as a span tree. The OpenTelemetry GenAI spec defines field names; tools like Langfuse, Braintrust, and Arize Phoenix render this tree with cost attribution and replayable inputs/outputs.

Required attributes per span

  • gen_ai.system"anthropic", "openai", etc.
  • gen_ai.request.model, gen_ai.response.model
  • gen_ai.usage.input_tokens, gen_ai.usage.cached_input_tokens, gen_ai.usage.output_tokens
  • gen_ai.request.temperature, top_p, max_tokens
  • For tool calls: tool.name, tool.args (JSON), tool.result (JSON or summary), tool.is_error
  • session.id and user.id (where applicable, with PII redaction)

The Open-Source Observability Stack

Langfuse — the OSS standard

Langfuse is the most-deployed open-source LLM observability tool in 2025-2026. Self-hostable, OpenTelemetry-compatible, with first-class agent tracing, prompt versioning, and built-in eval runners.

Python · Langfuse with the @observe decorator
from langfuse import observe, get_client

langfuse = get_client()

@observe(name="research-agent")
async def run_agent(query: str):
    # All nested LLM and tool calls are traced automatically
    plan = await plan_agent(query)
    digests = await asyncio.gather(*[
        worker_agent(t) for t in plan.subtasks])
    answer = await synthesize(query, digests)

    # Attach scores at runtime — feeds online evals
    langfuse.score_current_trace(
        name="answer_complete", value=1 if answer.is_complete else 0)
    return answer

Other contenders

  • Braintrust — paid SaaS, very strong on offline evals + production tracing. Excellent UI for diff-ing prompt changes against eval suites.
  • LangSmith — the LangChain ecosystem's tracing tool. Best fit if you're already on LangChain/LangGraph.
  • Arize Phoenix — open-source. Strong on RAG-specific observability (embedding drift, retrieval evals).
  • Helicone — proxy-based tracing; cost-tracking strength. Easy to drop in without code changes.

Evals — Catching Regressions Before Users Do

Tracing tells you what happened in production. Evals tell you whether a change you're about to ship will make things worse. Every serious agent has at least one eval suite — usually three:

1. Golden dataset regression suite

50-500 hand-curated test cases covering happy paths, edge cases, and known-failure modes. Run on every prompt or code change. Each case has either:

  • Deterministic assertion — exact-match, regex, JSON schema, list contains.
  • LLM-as-judge — a stronger model scores the output against a rubric.
  • Comparison — pairwise A/B between old and new prompt outputs.

2. Trajectory evals

For agents specifically, the final answer isn't enough — you also need to score the trajectory: did the agent call the right tools, in roughly the right order, with reasonable arguments? Common metrics:

  • Tool-selection accuracy — did the agent pick the correct tool for the task?
  • Trajectory length — how many steps did it take vs. the optimal path?
  • Redundant calls — did the agent call the same tool with the same args twice?
  • Loop reliability — what fraction of runs terminate successfully vs. hit the iteration cap?

3. Online evals on production traffic

Sample 1-5% of production runs, score them with LLM-as-judge or human review, and dashboard the trend over time. Catches model-update regressions and slow drift that offline evals miss.

Python · LLM-as-judge eval with Langfuse
from langfuse import get_client

EVAL_PROMPT = """
Score the assistant's answer against the user's question on:
- faithfulness (1-5): claims are supported by tool outputs
- relevance    (1-5): answer addresses the actual question
- completeness (1-5): all parts of the question are answered

Question: {question}
Tool calls: {tool_calls}
Answer: {answer}

Return JSON: {{"faithfulness": int, "relevance": int, "completeness": int, "rationale": str}}
"""

async def score_trace(trace_id: str):
    trace = await langfuse.get_trace(trace_id)
    judgment = await judge_llm(EVAL_PROMPT.format(
        question=trace.input, tool_calls=trace.tool_calls, answer=trace.output))
    for name, val in judgment.items():
        if name != "rationale":
            langfuse.score(trace_id=trace_id, name=name, value=val)

The Five Alerts You Want On Day One

  1. Cost spike — any 5-minute window with cost > 3× the trailing 24-hour median. Catches loop runaways and prompt-injection cost attacks.
  2. Loop length spike — p95 iterations per agent run jumps. Suggests the model is thrashing on a new failure mode.
  3. Tool error rate — any tool's error rate > 5% over 5 minutes. Catches API outages and credential expiries.
  4. TTFT or final-answer p95 latency > SLA. Catches model degradation, slow tools, queue buildup.
  5. Eval score regression — golden-suite pass rate drops > 2 points after a deploy. Auto-rollback if you trust your evals.
⚠️
The "silent regression"
A production-tested model gets transparently upgraded by the provider (or you bump claude-sonnet-4-6 to 4-7). Your eval suite catches the change in 4 minutes; production users see it instantly. Mitigation: pin model versions in production, run scheduled eval suites against pinned versions, and only roll forward after evals pass on staging. Both Anthropic and OpenAI document model-version pinning for exactly this reason.

Debugging A Bad Agent Run

The standard workflow when a user reports "the agent did the wrong thing":

  1. Look up the trace by user ID + timestamp in Langfuse / Braintrust / etc.
  2. Replay the span tree — read the model's actual reasoning, see which tool was called.
  3. Identify which step went wrong: bad tool selection? Bad tool result? Bad synthesis?
  4. Add the trace as a test case to the regression suite (with the corrected expectation).
  5. Iterate on the prompt or tool description until the eval passes.
  6. Re-run the full eval suite — make sure the fix didn't break other cases.

Steps 1-3 are impossible without tracing. Step 4 is impossible without an eval framework. The combination is what turns "agents are flaky" into "agents have a regression and we know which line to fix."

🔑
Key takeaways
1) Trace every agent run as a span tree — every LLM call, every tool, every error. 2) Use the OpenTelemetry GenAI conventions; the ecosystem is converging on them. 3) Build an eval suite before launch — golden cases + LLM-as-judge + trajectory scores. 4) Sample production traffic for online evals; catches drift offline tests miss. 5) Five alerts on day one: cost, loop length, tool errors, latency, eval regression.

Finished reading?