The Engineering Codex/Agentic AI with LLM APIs
DAY 7 · PM
09 / 09

Agent Frameworks — Choosing Your Stack

schedule6 minsignal_cellular_altIntermediate1,427 words
Claude Agent SDK, OpenAI Agents SDK, LangGraph, Pydantic AI, Mastra, CrewAI — the six production frameworks that matter, side by side, with a decision rule for picking yours.

What you will learn

01The Six Production Frameworks
02The Side-By-Side Comparison
03The Decision Rule
04The Same Task in Three Frameworks
05Where The Field Is Heading
06The 7-Day Recap

You've covered foundations, tool use, loops, memory, multi-agent, hardening, observability, and computer use. The last decision is the one that determines how much of this you build vs. inherit: which agent framework do you actually use? The 2026 landscape has consolidated to about six serious choices. This chapter is a side-by-side comparison and a decision rule to help you pick.

🔑
The honest meta-rule
Every framework will let you build a working agent. The right choice is the one whose defaults match your stack and whose escape hatches exist where you'll need them. Don't pick on benchmarks or stars — pick on (a) which LLM provider you're committed to, (b) what language your team writes, and (c) how much state your agent needs to durably checkpoint.

The Six Production Frameworks

1. Claude Agent SDK

Claude Agent SDK (formerly Claude Code SDK) ships the entire Claude Code agent loop as a Python and TypeScript library: built-in tools (Read, Edit, Bash, Glob, Grep, WebSearch), hooks (PreToolUse, PostToolUse, SessionStart), subagents, MCP support, JSONL session resume, and an explicit budget model. Pairs with Managed Agents for hosted-sandbox deployment.

  • Strengths: production-tested loop (it's literally Claude Code's runtime); great for code-running agents; minimal glue.
  • Weaknesses: Anthropic-only; less flexible for multi-vendor tool routing.
  • Pick if: you're committed to Claude and want shippable agents with the least scaffolding.

2. OpenAI Agents SDK

OpenAI Agents SDK replaced the deprecated Swarm in March 2025. Three primitives: agents, handoffs, guardrails. Built-in tracing in OpenAI's dashboard. Strong support for voice agents (via gpt-realtime) and the Responses API's built-in tools.

  • Strengths: tight OpenAI integration; voice; clean handoff abstraction.
  • Weaknesses: OpenAI-only; less mature checkpoint/durability story than LangGraph.
  • Pick if: you're on OpenAI, especially for voice or realtime agents.

3. LangGraph

LangGraph models agents as explicit state graphs with first-class checkpointing, human-in-loop pauses, and durable execution. Provider-agnostic. The de-facto standard for long-running, branching, or human-approval-required workflows.

  • Strengths: durable state; HITL; multi-vendor; pairs with LangSmith for tracing.
  • Weaknesses: heavier abstraction; more LOC for simple agents.
  • Pick if: your agent runs >5 minutes, needs checkpointing, or requires human approval steps.

4. Pydantic AI

Pydantic AI brings the type-safety discipline of Pydantic to agent code. Dependency injection, strict structured outputs, model-agnostic, idiomatic Python. Lightest footprint of the serious frameworks.

  • Strengths: strict types; tiny abstraction; great for Python apps that already use Pydantic.
  • Weaknesses: newer; smaller ecosystem than LangChain.
  • Pick if: you want minimal magic, full IDE support, and your stack is Python + FastAPI.

5. Mastra

Mastra is the TypeScript-native answer: agents, workflows, evals, and memory in one TS-first SDK. Plays well with Next.js, Vercel, and modern JS deployment.

  • Strengths: TS ergonomics; integrated workflows + evals; good Next.js integration.
  • Weaknesses: JS/TS only; smaller community than LangGraph.
  • Pick if: your team writes TypeScript and you want a cohesive agent stack without leaving the JS ecosystem.

6. CrewAI

CrewAI takes an opinionated "agents-as-roles + tasks + crews" approach. Less code-heavy than LangGraph; popular for business-process automation.

  • Strengths: approachable mental model; fast to prototype with non-engineers in the loop.
  • Weaknesses: less control; the role/task/crew abstraction can feel like a constraint at scale.
  • Pick if: you're automating business workflows and want a framework that reads like an org chart.

The Side-By-Side Comparison

FRAMEWORK SELECTION MATRIX multi-vendor & durable single-vendor & lean Python Polyglot TypeScript LangGraph Mastra Pydantic AI CrewAI Claude Agent SDK OpenAI Agents
Where each framework lives. Top-left = heavy, durable, multi-vendor. Bottom = leaner, single-vendor. The decision is mostly about (a) language and (b) how durable your agent state needs to be.
FrameworkLanguageVendorState / checkpointingBest for
Claude Agent SDKPython · TSAnthropicBuilt-in (sessions, JSONL)Code-running agents on Claude
OpenAI Agents SDKPython · TSOpenAITracing-onlyOpenAI stack, voice agents
LangGraphPython · JSAnyFirst-class checkpointsLong-running, HITL workflows
Pydantic AIPythonAnyManual (lightweight)Type-safe Python apps
MastraTypeScriptAnyBuilt-in workflowsJS/TS apps, Next.js
CrewAIPythonAnyLightweightBusiness-process automation

The Decision Rule

  1. Is your stack TypeScript? → Mastra (or roll your own; the API surfaces are simple in TS).
  2. Are you locked into one provider? Anthropic → Claude Agent SDK. OpenAI → OpenAI Agents SDK.
  3. Does your agent run >5 minutes or need human approvals? → LangGraph. The checkpointing/HITL story is unmatched.
  4. Do you want minimal magic and full IDE support? → Pydantic AI.
  5. Are non-engineers configuring the agents? → CrewAI's role/task/crew model reads more like business logic.
  6. Otherwise: roll your own with the canonical loop from chapter 1. The frameworks save you 100 lines; sometimes 100 lines is the right answer.
🛠️
Hybrid stacks are normal
Real production teams often combine: Pydantic AI for the domain agents + LangGraph for the long-running workflow that orchestrates them, observed via Langfuse, deployed on Mastra-style serverless functions. Frameworks aren't mutually exclusive — they're libraries with overlapping concerns. The unhealthy pattern is using one framework's abstractions everywhere and fighting it for the use cases it doesn't fit.

The Same Task in Three Frameworks

To make the differences concrete, here's a "research a topic and return a structured report" agent in three frameworks. Same goal, different ergonomics.

Python · Claude Agent SDK
from claude_agent_sdk import ClaudeSDKClient, ClaudeAgentOptions

async with ClaudeSDKClient(options=ClaudeAgentOptions(
    system_prompt="You research topics and return structured JSON reports.",
    allowed_tools=["WebSearch", "WebFetch"],
    max_turns=15,
)) as client:
    await client.query("Research MCP adoption in 2026; return JSON.")
    async for msg in client.receive_response():
        print(msg)
Python · OpenAI Agents SDK
from agents import Agent, Runner
from agents.tools import WebSearchTool

agent = Agent(
    name="Researcher",
    instructions="Research topics; return structured JSON reports.",
    tools=[WebSearchTool()],
    output_type=ReportSchema,
)
result = await Runner.run(agent, "Research MCP adoption in 2026.")
Python · LangGraph
from langgraph.prebuilt import create_react_agent
from langgraph.checkpoint.sqlite import SqliteSaver

agent = create_react_agent(
    model="anthropic:claude-sonnet-4-6",
    tools=[web_search_tool],
    response_format=ReportSchema,
    checkpointer=SqliteSaver.from_conn_string(":memory:"),
)
result = await agent.ainvoke(
    {"messages": ["Research MCP adoption in 2026."]},
    config={"configurable": {"thread_id": "abc"}},
)

Notice the convergence: same shape, same primitives, slightly different ergonomics. Once you've internalized the canonical loop from Day 1, every framework reads as a thin layer over it.

Where The Field Is Heading

Looking at the 2025-2026 trajectory:

  • MCP wins as the tool-server layer. Every framework now treats MCP as default. Tool definitions outlive the framework that called them.
  • Durable execution becomes table-stakes. LangGraph started it; Mastra and Claude Agent SDK have followed. By 2027, an agent framework without checkpointing will look anachronistic.
  • Computer use moves from beta to production-default. The OSWorld trajectory says capability will catch human baseline within 12-18 months for narrow tasks.
  • Eval-driven development becomes the norm. Braintrust, Langfuse Evals, Inspect AI — the toolchain is converging on "every prompt change is a PR with eval results attached."
  • Agent SDKs absorb middleware. Memory tools, context editing, prompt caching — features that lived in third-party libraries in 2024 are now native in the SDKs.

The 7-Day Recap

  1. Day 1 AM — Agents = LLM in a loop with tools and a budget. Workflows beat agents for most tasks.
  2. Day 1 PM — Tool use is JSON in, JSON out. MCP is the universal standard.
  3. Day 2 — ReAct is default; reach for Reflexion / ToT / Plan-and-Execute only when their costs are justified.
  4. Day 3 — Three memory layers: cached context, vector RAG, structured stores. Combine all three.
  5. Day 4 — Multi-agent costs ~15× more tokens. Use orchestrator-workers when sub-tasks are genuinely parallel.
  6. Day 5 — Production = guardrails, retries, caching, sandboxing, escape hatches. Nothing is optional.
  7. Day 6 — Trace every span. Score every run. Catch regressions before users do.
  8. Day 7 AM — Computer use is general but expensive; hybrid (Stagehand) wins for stable-DOM tasks.
  9. Day 7 PM — Pick the framework whose defaults match your stack. Roll your own when 100 lines beats a dependency.
🎉
You've finished the course
You've covered the full stack of building production LLM agents — from the four-line canonical loop to durable multi-agent systems with full observability. Next step: pick one chapter and ship a real thing. A research agent with proper evals beats reading about agents for another month. The frameworks are mature; the patterns are documented; the hard part is now the discipline to bound the loop and ship.

Finished reading?