DAY 4

05 / 09

Multi-Agent Systems — Orchestration That's Worth The Cost

schedule6 minsignal_cellular_altIntermediate1,290 words

Orchestrator-workers, handoffs, evaluator pairs — the multi-agent patterns that actually ship. Plus the 15× cost reality and the failure modes the marketing leaves out.

What you will learn

01When Multi-Agent Actually Helps

02The Orchestrator-Workers Pattern

03Other Multi-Agent Patterns

04Cost Reality — The 15× Number

05Multi-Agent Failure Modes

06The Multi-Agent Decision Tree

Multi-agent systems sound great in slides and cost a lot in production. The honest, expensive truth: most teams who reach for multi-agent didn't need it — a well-equipped single agent with parallel tools would have done the job for 1/15th the price. This chapter covers when multi-agent actually helps, the orchestration patterns that work, and the failure modes nobody puts in the marketing.

🔑

Anthropic's stated rule

From How we built our multi-agent research system: "Don't go multi-agent until single-agent + tools demonstrably fails." Their own multi-agent research stack used ~15× more tokens than chat. The performance gain (90.2% over single-agent Opus on internal evals) was worth it for research; for most products it isn't.

When Multi-Agent Actually Helps

Three signals say yes:

Genuinely parallel sub-tasks. A research query that needs 5 independent searches benefits from 5 concurrent sub-agents. A chatbot answering one question does not.
Distinct contexts that don't fit in one window. Each sub-agent reads its own corpus, returns a digest. Lead agent never sees the raw text.
Specialization with sharply different prompts. A "code reviewer" and a "test generator" want incompatible system prompts; multi-agent lets each have its own.

Three signals say no:

The task is sequential. Step B needs B's input, step C needs C's input. Multi-agent here is just a loop with overhead.
Coordination dominates work. If sub-agents spend more tokens negotiating than acting, the architecture is wrong.
Reliability matters more than breadth. Each agent boundary is another failure surface. A single agent fails in one place; multi-agent fails in N.

The Orchestrator-Workers Pattern

The cleanest, most-shipped multi-agent pattern is orchestrator-workers (sometimes called "lead agent + sub-agents" or "coordinator + specialists"). One LLM acts as the lead — it plans, decomposes, and dispatches sub-tasks to worker LLMs. Workers do focused, scoped work and return digested results. The lead synthesizes and either delegates again or finishes.

The pattern Anthropic publicly described for their research system. Lead agent uses Opus, workers use Sonnet/Haiku. Workers run in parallel and report compact digests, not raw text — preserving the lead's context budget.

Python · orchestrator-workers sketch

async def research_orchestrator(query):
    # 1. Lead plans subtasks
    plan = await lead_agent(
        model="claude-opus-4-7",
        prompt=f"Decompose this research query into 3-6 independent subtasks: {query}",
        response_schema=PlanSchema,
    )
    # 2. Workers run in parallel — each in its own context
    workers = [
        worker_agent(model="claude-sonnet-4-6", task=t,
                     tools=[web_search, fetch_url, summarize])
        for t in plan.subtasks
    ]
    digests = await asyncio.gather(*workers)
    # 3. Lead synthesizes — sees only the digests, not raw text
    final = await lead_agent(
        model="claude-opus-4-7",
        prompt=build_synthesis_prompt(query, digests),
    )
    return final

Other Multi-Agent Patterns

Handoffs (OpenAI Agents SDK style)

Each agent has a specialty (sales, support, billing). When a turn arrives the routing agent decides which specialist to delegate to. The specialist owns the conversation until it hands back. Implemented in OpenAI's Agents SDK as a primitive.

Critic / Evaluator pairs

One agent produces, another critiques, the producer revises. Bounded by a maximum number of iterations or a quality threshold. Equivalent to Anthropic's "evaluator-optimizer" workflow pattern. Use when you have a measurable quality dimension (correctness, brand fit) but no deterministic test.

Group chat / debate

Multiple agents take turns talking until consensus or termination. Frameworks like AutoGen popularized this. In practice it's chatty and expensive — useful for adversarial robustness ("red team vs. blue team") but rarely for production tasks.

Swarm / handoff networks

OpenAI's now-deprecated Swarm popularized stateless handoff networks; the pattern lives on in CrewAI's "agents-as-roles + tasks + crews" abstraction. Each agent is a role; handoffs are first-class. Good for business-process automation; verbose for general purpose.

Cost Reality — The 15× Number

Anthropic's published research-system numbers are the most credible cost reference we have:

System type	Token usage vs. baseline chat	Performance vs. single-agent Opus
Chat (single Q&A)	1× (baseline)	—
Single-agent Opus + tools	~4×	baseline
Multi-agent (Opus lead + Sonnet workers)	~15×	+90.2%

Their post explicitly notes that token usage explains ~80% of performance variance; model choice and tool count account for the rest. Translation: the multi-agent gains come mostly from using more tokens to think more, not from any magic in coordination. If you can afford the tokens with a single agent, you often don't need the architecture.

⚠️

The unit-economics check

Before going multi-agent, work out the per-request cost with caching, then multiply by 15. If that destroys your unit economics, single-agent is your answer regardless of the marginal accuracy gain. A 90% improvement at 15× cost is only a deal when the value of accuracy is unbounded — research, legal, medical. For consumer chatbots, it isn't.

Multi-Agent Failure Modes

From Anthropic's post and accumulated production experience:

Token-cost runaway — workers fan out, each fans out again, costs go exponential. Mitigation: per-agent token budgets enforced by the lead.
Stale plans — lead drafts a plan, workers find new info that invalidates it, lead doesn't re-plan. Mitigation: bounded re-planning steps.
Coordination overhead > work — agents spend more tokens summarizing for each other than doing the task. Mitigation: minimize hand-off bandwidth; pass digests not raw text.
Inconsistent answers across runs — non-determinism compounds across N agents. Mitigation: temperature 0 on lead, fixed seeds where supported, eval suites that catch regressions.
Prompt injection via worker output — a worker's tool result contains injected instructions; the lead acts on them. Mitigation: treat all sub-agent output as untrusted; the lead should not execute new tools solely on worker recommendation.

The Multi-Agent Decision Tree

Can a single LLM call do this? If yes — stop, you're done.
Can a workflow (chain / route / parallel) do this? If yes — build the workflow.
Can a single agent + tools do this? If yes — build the agent.
Are sub-tasks genuinely independent and parallelizable? Yes? Now consider orchestrator-workers.
Do specialists need incompatible system prompts? Yes? Consider handoffs.
Do you have a measurable quality dimension? Yes? Consider evaluator-optimizer.

Quick check

Your team is building a customer-support agent. The product manager wants four sub-agents: an "intent classifier", a "knowledge retriever", a "tone checker", and a "response writer". Should you build it that way?

Show answer

Almost certainly no. Three of those are deterministic functions, not agents: intent classification is a single LLM call with structured output; knowledge retrieval is a vector search; tone checking is a single LLM call with a rubric. Only the response writer is plausibly agent-shaped. Build it as a workflow (classify → retrieve → write → tone-check) with a single LLM call per step. You'll spend 1/10th the tokens, get more deterministic behavior, and debug it in one afternoon. The "multi-agent customer support" pitch is the textbook case of multi-agent overuse.

🔑

Key takeaways

1) Multi-agent costs ~15× more tokens than chat — make sure the value justifies it. 2) Orchestrator-workers is the pattern most worth learning; the rest are niche. 3) Use a strong model for the lead, cheap models for workers. 4) Pass digests, not raw context, between agents. 5) Treat sub-agent output as untrusted — prompt injection at the seam is a real attack surface. 6) When in doubt, single agent + parallel tools wins.

📚 Further reading

Anthropic — How we built our multi-agent research systemanthropic.com
OpenAI Agents SDK — Handoffs & Multi-Agentopenai.github.io
LangGraph — Stateful Multi-Agent Workflowsdocs.langchain.com
Microsoft AutoGen — Event-Driven Multi-Agentmicrosoft.github.io
CrewAI — Roles, Tasks, Crewsdocs.crewai.com

Finished reading?