The Engineering Codex/Agentic AI with LLM APIs
DAY 4
05 / 09

Multi-Agent Systems — Orchestration That's Worth The Cost

schedule6 minsignal_cellular_altIntermediate1,290 words
Orchestrator-workers, handoffs, evaluator pairs — the multi-agent patterns that actually ship. Plus the 15× cost reality and the failure modes the marketing leaves out.

What you will learn

01When Multi-Agent Actually Helps
02The Orchestrator-Workers Pattern
03Other Multi-Agent Patterns
04Cost Reality — The 15× Number
05Multi-Agent Failure Modes
06The Multi-Agent Decision Tree

Multi-agent systems sound great in slides and cost a lot in production. The honest, expensive truth: most teams who reach for multi-agent didn't need it — a well-equipped single agent with parallel tools would have done the job for 1/15th the price. This chapter covers when multi-agent actually helps, the orchestration patterns that work, and the failure modes nobody puts in the marketing.

🔑
Anthropic's stated rule
From How we built our multi-agent research system: "Don't go multi-agent until single-agent + tools demonstrably fails." Their own multi-agent research stack used ~15× more tokens than chat. The performance gain (90.2% over single-agent Opus on internal evals) was worth it for research; for most products it isn't.

When Multi-Agent Actually Helps

Three signals say yes:

  1. Genuinely parallel sub-tasks. A research query that needs 5 independent searches benefits from 5 concurrent sub-agents. A chatbot answering one question does not.
  2. Distinct contexts that don't fit in one window. Each sub-agent reads its own corpus, returns a digest. Lead agent never sees the raw text.
  3. Specialization with sharply different prompts. A "code reviewer" and a "test generator" want incompatible system prompts; multi-agent lets each have its own.

Three signals say no:

  1. The task is sequential. Step B needs B's input, step C needs C's input. Multi-agent here is just a loop with overhead.
  2. Coordination dominates work. If sub-agents spend more tokens negotiating than acting, the architecture is wrong.
  3. Reliability matters more than breadth. Each agent boundary is another failure surface. A single agent fails in one place; multi-agent fails in N.

The Orchestrator-Workers Pattern

The cleanest, most-shipped multi-agent pattern is orchestrator-workers (sometimes called "lead agent + sub-agents" or "coordinator + specialists"). One LLM acts as the lead — it plans, decomposes, and dispatches sub-tasks to worker LLMs. Workers do focused, scoped work and return digested results. The lead synthesizes and either delegates again or finishes.

ORCHESTRATOR-WORKERS — LEAD + PARALLEL SUB-AGENTS Lead agent strong model · plans & synthesizes Opus / GPT-5 Worker A search papers Sonnet own context Worker B scrape news Sonnet own context Worker C query DB Haiku own context Worker D summarize Haiku own context Lead returns digested findings only — not raw worker context. Up to ~90% wall-time saved by parallelism.
The pattern Anthropic publicly described for their research system. Lead agent uses Opus, workers use Sonnet/Haiku. Workers run in parallel and report compact digests, not raw text — preserving the lead's context budget.
Python · orchestrator-workers sketch
async def research_orchestrator(query):
    # 1. Lead plans subtasks
    plan = await lead_agent(
        model="claude-opus-4-7",
        prompt=f"Decompose this research query into 3-6 independent subtasks: {query}",
        response_schema=PlanSchema,
    )
    # 2. Workers run in parallel — each in its own context
    workers = [
        worker_agent(model="claude-sonnet-4-6", task=t,
                     tools=[web_search, fetch_url, summarize])
        for t in plan.subtasks
    ]
    digests = await asyncio.gather(*workers)
    # 3. Lead synthesizes — sees only the digests, not raw text
    final = await lead_agent(
        model="claude-opus-4-7",
        prompt=build_synthesis_prompt(query, digests),
    )
    return final

Other Multi-Agent Patterns

Handoffs (OpenAI Agents SDK style)

Each agent has a specialty (sales, support, billing). When a turn arrives the routing agent decides which specialist to delegate to. The specialist owns the conversation until it hands back. Implemented in OpenAI's Agents SDK as a primitive.

Critic / Evaluator pairs

One agent produces, another critiques, the producer revises. Bounded by a maximum number of iterations or a quality threshold. Equivalent to Anthropic's "evaluator-optimizer" workflow pattern. Use when you have a measurable quality dimension (correctness, brand fit) but no deterministic test.

Group chat / debate

Multiple agents take turns talking until consensus or termination. Frameworks like AutoGen popularized this. In practice it's chatty and expensive — useful for adversarial robustness ("red team vs. blue team") but rarely for production tasks.

Swarm / handoff networks

OpenAI's now-deprecated Swarm popularized stateless handoff networks; the pattern lives on in CrewAI's "agents-as-roles + tasks + crews" abstraction. Each agent is a role; handoffs are first-class. Good for business-process automation; verbose for general purpose.

Cost Reality — The 15× Number

Anthropic's published research-system numbers are the most credible cost reference we have:

System typeToken usage vs. baseline chatPerformance vs. single-agent Opus
Chat (single Q&A)1× (baseline)
Single-agent Opus + tools~4×baseline
Multi-agent (Opus lead + Sonnet workers)~15×+90.2%

Their post explicitly notes that token usage explains ~80% of performance variance; model choice and tool count account for the rest. Translation: the multi-agent gains come mostly from using more tokens to think more, not from any magic in coordination. If you can afford the tokens with a single agent, you often don't need the architecture.

⚠️
The unit-economics check
Before going multi-agent, work out the per-request cost with caching, then multiply by 15. If that destroys your unit economics, single-agent is your answer regardless of the marginal accuracy gain. A 90% improvement at 15× cost is only a deal when the value of accuracy is unbounded — research, legal, medical. For consumer chatbots, it isn't.

Multi-Agent Failure Modes

From Anthropic's post and accumulated production experience:

  • Token-cost runaway — workers fan out, each fans out again, costs go exponential. Mitigation: per-agent token budgets enforced by the lead.
  • Stale plans — lead drafts a plan, workers find new info that invalidates it, lead doesn't re-plan. Mitigation: bounded re-planning steps.
  • Coordination overhead > work — agents spend more tokens summarizing for each other than doing the task. Mitigation: minimize hand-off bandwidth; pass digests not raw text.
  • Inconsistent answers across runs — non-determinism compounds across N agents. Mitigation: temperature 0 on lead, fixed seeds where supported, eval suites that catch regressions.
  • Prompt injection via worker output — a worker's tool result contains injected instructions; the lead acts on them. Mitigation: treat all sub-agent output as untrusted; the lead should not execute new tools solely on worker recommendation.

The Multi-Agent Decision Tree

  1. Can a single LLM call do this? If yes — stop, you're done.
  2. Can a workflow (chain / route / parallel) do this? If yes — build the workflow.
  3. Can a single agent + tools do this? If yes — build the agent.
  4. Are sub-tasks genuinely independent and parallelizable? Yes? Now consider orchestrator-workers.
  5. Do specialists need incompatible system prompts? Yes? Consider handoffs.
  6. Do you have a measurable quality dimension? Yes? Consider evaluator-optimizer.
Quick check
Your team is building a customer-support agent. The product manager wants four sub-agents: an "intent classifier", a "knowledge retriever", a "tone checker", and a "response writer". Should you build it that way?
Show answer
Almost certainly no. Three of those are deterministic functions, not agents: intent classification is a single LLM call with structured output; knowledge retrieval is a vector search; tone checking is a single LLM call with a rubric. Only the response writer is plausibly agent-shaped. Build it as a workflow (classify → retrieve → write → tone-check) with a single LLM call per step. You'll spend 1/10th the tokens, get more deterministic behavior, and debug it in one afternoon. The "multi-agent customer support" pitch is the textbook case of multi-agent overuse.
🔑
Key takeaways
1) Multi-agent costs ~15× more tokens than chat — make sure the value justifies it. 2) Orchestrator-workers is the pattern most worth learning; the rest are niche. 3) Use a strong model for the lead, cheap models for workers. 4) Pass digests, not raw context, between agents. 5) Treat sub-agent output as untrusted — prompt injection at the seam is a real attack surface. 6) When in doubt, single agent + parallel tools wins.

Finished reading?