
Multi-Agent Systems — Orchestration That's Worth The Cost
Orchestrator-workers, handoffs, evaluator pairs — the multi-agent patterns that actually ship. Plus the 15× cost reality and the failure modes the marketing leaves out.
What you will learn
Multi-agent systems sound great in slides and cost a lot in production. The honest, expensive truth: most teams who reach for multi-agent didn't need it — a well-equipped single agent with parallel tools would have done the job for 1/15th the price. This chapter covers when multi-agent actually helps, the orchestration patterns that work, and the failure modes nobody puts in the marketing.
When Multi-Agent Actually Helps
Three signals say yes:
- Genuinely parallel sub-tasks. A research query that needs 5 independent searches benefits from 5 concurrent sub-agents. A chatbot answering one question does not.
- Distinct contexts that don't fit in one window. Each sub-agent reads its own corpus, returns a digest. Lead agent never sees the raw text.
- Specialization with sharply different prompts. A "code reviewer" and a "test generator" want incompatible system prompts; multi-agent lets each have its own.
Three signals say no:
- The task is sequential. Step B needs B's input, step C needs C's input. Multi-agent here is just a loop with overhead.
- Coordination dominates work. If sub-agents spend more tokens negotiating than acting, the architecture is wrong.
- Reliability matters more than breadth. Each agent boundary is another failure surface. A single agent fails in one place; multi-agent fails in N.
The Orchestrator-Workers Pattern
The cleanest, most-shipped multi-agent pattern is orchestrator-workers (sometimes called "lead agent + sub-agents" or "coordinator + specialists"). One LLM acts as the lead — it plans, decomposes, and dispatches sub-tasks to worker LLMs. Workers do focused, scoped work and return digested results. The lead synthesizes and either delegates again or finishes.
async def research_orchestrator(query): # 1. Lead plans subtasks plan = await lead_agent( model="claude-opus-4-7", prompt=f"Decompose this research query into 3-6 independent subtasks: {query}", response_schema=PlanSchema, ) # 2. Workers run in parallel — each in its own context workers = [ worker_agent(model="claude-sonnet-4-6", task=t, tools=[web_search, fetch_url, summarize]) for t in plan.subtasks ] digests = await asyncio.gather(*workers) # 3. Lead synthesizes — sees only the digests, not raw text final = await lead_agent( model="claude-opus-4-7", prompt=build_synthesis_prompt(query, digests), ) return final
Other Multi-Agent Patterns
Handoffs (OpenAI Agents SDK style)
Each agent has a specialty (sales, support, billing). When a turn arrives the routing agent decides which specialist to delegate to. The specialist owns the conversation until it hands back. Implemented in OpenAI's Agents SDK as a primitive.
Critic / Evaluator pairs
One agent produces, another critiques, the producer revises. Bounded by a maximum number of iterations or a quality threshold. Equivalent to Anthropic's "evaluator-optimizer" workflow pattern. Use when you have a measurable quality dimension (correctness, brand fit) but no deterministic test.
Group chat / debate
Multiple agents take turns talking until consensus or termination. Frameworks like AutoGen popularized this. In practice it's chatty and expensive — useful for adversarial robustness ("red team vs. blue team") but rarely for production tasks.
Swarm / handoff networks
OpenAI's now-deprecated Swarm popularized stateless handoff networks; the pattern lives on in CrewAI's "agents-as-roles + tasks + crews" abstraction. Each agent is a role; handoffs are first-class. Good for business-process automation; verbose for general purpose.
Cost Reality — The 15× Number
Anthropic's published research-system numbers are the most credible cost reference we have:
| System type | Token usage vs. baseline chat | Performance vs. single-agent Opus |
|---|---|---|
| Chat (single Q&A) | 1× (baseline) | — |
| Single-agent Opus + tools | ~4× | baseline |
| Multi-agent (Opus lead + Sonnet workers) | ~15× | +90.2% |
Their post explicitly notes that token usage explains ~80% of performance variance; model choice and tool count account for the rest. Translation: the multi-agent gains come mostly from using more tokens to think more, not from any magic in coordination. If you can afford the tokens with a single agent, you often don't need the architecture.
Multi-Agent Failure Modes
From Anthropic's post and accumulated production experience:
- Token-cost runaway — workers fan out, each fans out again, costs go exponential. Mitigation: per-agent token budgets enforced by the lead.
- Stale plans — lead drafts a plan, workers find new info that invalidates it, lead doesn't re-plan. Mitigation: bounded re-planning steps.
- Coordination overhead > work — agents spend more tokens summarizing for each other than doing the task. Mitigation: minimize hand-off bandwidth; pass digests not raw text.
- Inconsistent answers across runs — non-determinism compounds across N agents. Mitigation: temperature 0 on lead, fixed seeds where supported, eval suites that catch regressions.
- Prompt injection via worker output — a worker's tool result contains injected instructions; the lead acts on them. Mitigation: treat all sub-agent output as untrusted; the lead should not execute new tools solely on worker recommendation.
The Multi-Agent Decision Tree
- Can a single LLM call do this? If yes — stop, you're done.
- Can a workflow (chain / route / parallel) do this? If yes — build the workflow.
- Can a single agent + tools do this? If yes — build the agent.
- Are sub-tasks genuinely independent and parallelizable? Yes? Now consider orchestrator-workers.
- Do specialists need incompatible system prompts? Yes? Consider handoffs.
- Do you have a measurable quality dimension? Yes? Consider evaluator-optimizer.
Show answer
- Anthropic — How we built our multi-agent research systemanthropic.com
- OpenAI Agents SDK — Handoffs & Multi-Agentopenai.github.io
- LangGraph — Stateful Multi-Agent Workflowsdocs.langchain.com
- Microsoft AutoGen — Event-Driven Multi-Agentmicrosoft.github.io
- CrewAI — Roles, Tasks, Crewsdocs.crewai.com
Finished reading?