AI Products & Strategy April 4, 2026 · 8 min read

Multi-Agent Systems: When One Agent Isn't Enough

Nine articles in this series used a single agent. This one explains when that stops being sufficient and what to do about it. Four signals tell you it is time. Three patterns handle 90% of cases. The hardest part is not building the system; it is debugging it when something goes wrong.

By Vikas Pratap Singh
#ai-agents #multi-agent-systems #agent-orchestration #agent-architecture #agentic-engineering

Part 10 of 12: The Practitioner’s Guide to AI Agents

The Case Against This Article

Nine articles in this series used a single agent. That was deliberate. Article 2 argued that most agent projects should not be agents at all. Article 3 used Pike’s Rules to argue that complexity is a cost, not a feature. Article 6 showed that a single agent with well-engineered context handles more than most teams expect.

If this series has a thesis, it is: start simple, measure, add complexity only when measurement proves you need it.

So why write an article about multi-agent systems?

Because there are cases where a single agent genuinely is not enough, and the difference between “I need multi-agent” and “I think I need multi-agent” is one of the most expensive mistakes in agent engineering. Anthropic’s own team documented this when building their multi-agent research system: early versions spawned 50 subagents for simple queries, endlessly searched for nonexistent sources, and selected SEO content farms over authoritative academic papers. If Anthropic’s engineers over-engineered multi-agent, your team probably will too.

This article is not a tutorial for building multi-agent systems. It is a decision framework for knowing when you actually need one, and what to watch for when you do.

The Default Answer Is Still One Agent

Anthropic’s Building Effective Agents guide is direct: “We recommend finding the simplest solution possible, and only increasing complexity when needed. This might mean not building agentic systems at all.”

The research backs this up. Li (2026) found that a single agent with well-organized skills “can substantially reduce token usage and latency while maintaining competitive accuracy on reasoning benchmarks” compared to multi-agent systems. The paper also identified a phase transition: skill selection accuracy remains stable up to a critical library size, then drops sharply. The failure is driven by semantic confusability among similar skills, not library size alone.

The compound error math from Article 1 applies with greater force here. Every agent-to-agent handoff is a new boundary where context can be lost, instructions can be misinterpreted, and errors can compound. A three-agent chain with 90% per-agent accuracy succeeds only 73% of the time. A five-agent chain drops to 59%.

For practitioners: Before reaching for multi-agent, ask: “Have I tried improving this single agent’s context, tools, and prompts?” If the answer is no, that is your next step, not adding agents. Article 5 (prompt patterns) and Article 6 (context engineering) exist for this reason.

Four Signals That One Agent Is Not Enough

Not every multi-agent decision is premature. Four signals suggest a single agent is hitting real limits, not perceived ones:

Tool count exceeds 15-20. LLMs lose focus when choosing among too many tools. Anthropic’s research system found that splitting tools across specialist subagents helped each model match the right tool to the right sub-task. The orchestrator’s job became delegation, not execution.

Tasks require fundamentally different system prompts. A code-writing agent and a code-review agent need different instructions, different personas, and different success criteria. Forcing both into one prompt produces neither well. If you find yourself writing “when doing X, behave like this; when doing Y, behave like that” in a single system prompt, you have two agents pretending to be one.

Latency demands parallel execution. If five independent data sources need querying and the user cannot wait for sequential calls, fan out to parallel agents and merge the results. Anthropic’s research system achieved up to 90% speedup on complex queries by running subagents in parallel rather than sequentially.

Trust boundaries require isolation. Some tools access sensitive data (PII, financial records, credentials). Isolating those tools in a separate agent with stricter guardrails limits the blast radius if any single agent is compromised. This connects directly to the Isolation criterion from the Context Engineering paper and the lethal trifecta from Article 8.

If none of these four signals apply, you do not need multi-agent. If one or two apply, the patterns below will help.

Three Patterns That Handle 90% of Cases

Anthropic’s guide identifies five workflow patterns. For most production systems, three cover the ground.

Router. One orchestrator receives the user request and dispatches to the right specialist. The specialist executes and returns a result. The orchestrator’s decision is a single, auditable choice. This is the simplest multi-agent pattern and the right starting point. OpenAI’s Agents SDK implements this as the “handoff” primitive: agents transfer control explicitly, carrying conversation context through the transition.

Supervisor. One agent directs and monitors others through a multi-step workflow, enforcing quality gates between steps. The research agent’s output must pass a relevance check before the analysis agent sees it. Good for workflows where intermediate quality matters. The supervisor pattern is how Anthropic’s research system works: a lead agent analyzes queries, develops strategy, and spawns subagents while monitoring their progress.

Fan-out/fan-in. Multiple agents process in parallel, and results merge into a single output. Good for independent sub-tasks: search five sources simultaneously, then synthesize. The merge step is the critical design point: define how conflicts between parallel results are resolved before you build the fan-out. Unresolved conflicts at merge time are the primary source of contradictory multi-agent output.

Pike’s Rule 4 applies to pattern selection: start with Router. Move to Supervisor only when you need intermediate quality gates. Use Fan-out/fan-in only when parallelism provides measurable latency improvement. Each step up adds debugging complexity that the simpler pattern avoids.

Task Decomposition: The Hard Part

Choosing a pattern is the easy decision. Decomposing the task correctly is where multi-agent systems succeed or fail.

Non-overlapping scopes. Each sub-agent owns a clearly defined slice of the problem. If two agents can both answer the same sub-question, they will, and their answers may conflict. Define scope boundaries explicitly in each agent’s system prompt: “You handle pricing data only. Do not analyze usage patterns.” Overlap is the primary source of contradictions in multi-agent output.

Right-sized tasks. A task too small for one agent (single API call, simple lookup) does not justify a dedicated sub-agent. A task too large (requiring 50+ tool calls, multiple distinct skill sets) should be decomposed further. The heuristic: if a single agent can complete the task in under 10 tool calls with focused tools, keep it as one agent. If it consistently exceeds that or requires context beyond the window, split.

Explicit handoff contracts. Define the input and output schema for each sub-agent before building the orchestrator. When Agent A passes results to Agent B, both must agree on the data shape. An undefined handoff produces the multi-agent equivalent of a broken API contract: the downstream agent receives data it cannot interpret and either fails or hallucinates through the gap.

Anthropic learned this the hard way. Their initial orchestrator gave vague instructions like “research the semiconductor shortage,” which caused duplicate work and misaligned efforts across subagents. The fix was teaching orchestrators “explicit delegation frameworks with clear task boundaries.” The delegation prompt mattered more than the model.

Why Debugging Multi-Agent Is Fundamentally Harder

This is the cost that multi-agent advocates understate.

When a single agent produces wrong output, you read the trace, find the step where reasoning diverged, and fix the prompt or the tool. When a multi-agent system produces wrong output, you face a new question: which agent introduced the error?

The error may have originated in Agent A’s tool call, been passed through Agent B’s context without detection, and surfaced in Agent C’s output. The agent that looks wrong (C) is not the agent that is wrong (A). Without trace propagation across agents, you are debugging by guessing.

Three practices make multi-agent debugging tractable:

Trace propagation. Every agent gets a trace ID linking it to the parent request. Log the trace ID with every tool call and every inter-agent message. The observability infrastructure from Article 9 is not optional for multi-agent; it is a prerequisite.

Blame isolation. When output is wrong, replay each agent’s input and check its output independently. The agent whose output diverges from its input’s ground truth is the one that introduced the error. This is time-consuming, which is exactly why you should not add agents until measurement proves you need them.

Centralized logging. All agents write to the same log store with the same trace ID. Scattered logs across separate agent processes make correlation impossible.

The Diagrid analysis of production agent frameworks found that LangGraph, CrewAI, and Google ADK all lack true durable execution. State checkpointing (saving snapshots of agent state) is not the same as guaranteed completion. When a multi-agent workflow fails mid-execution, these frameworks hand you a snapshot, not a recovery path.

The Framework Landscape (Briefly)

The frameworks exist and are maturing. What matters is not which one you pick but whether you need one at all.

FrameworkPattern StrengthTrade-off
Anthropic SDK (direct)Full control, any patternYou build the orchestration yourself
OpenAI Agents SDKCleanest handoff modelBest with OpenAI models
LangGraphExplicit state machines, complex workflowsSteeper learning curve, ecosystem lock-in
CrewAIRole-based teams, fast prototypingLLM-managed delegation can be inconsistent; higher token overhead

The common production pattern: prototype in CrewAI (fast iteration), then ship in LangGraph or direct SDK (control and reliability). Do not let the prototyping framework become the production framework without evaluating the trade-offs.

Do Next

PriorityActionWhy it matters
Before anythingRevisit your single agent. Apply Article 5 (prompt patterns) and Article 6 (context criteria) first. Measure whether output quality improves.Most teams that think they need multi-agent have not exhausted single-agent optimization.
If signal presentIdentify which of the four signals applies to your use case. Write it down explicitly.”We need multi-agent because it feels more sophisticated” is not a signal. If you cannot name the specific limitation, you do not have one.
First multi-agentStart with the Router pattern. One orchestrator, two specialists, explicit handoff contracts.Router is the simplest to debug. Prove you need Supervisor or Fan-out before building them.
ProductionImplement trace propagation and centralized logging before deploying multi-agent to production.You cannot debug what you cannot trace. Multi-agent without observability is flying blind with more engines.
OngoingMonitor the merge/handoff points specifically. Track how often sub-agents produce contradictory results.Contradictions at merge time are the signature failure mode of multi-agent. If the rate exceeds 5%, your decomposition has overlap.

This is Part 10 of 12 in The Practitioner’s Guide to AI Agents. ← Previous: Observability · Next: The Self-Improving Agent →

Sources & References

  1. Anthropic: Building Effective Agents(2024)
  2. Anthropic: How We Built Our Multi-Agent Research System(2026)
  3. OpenAI Agents SDK: Multi-Agent Orchestration(2025)
  4. Xiaoxiao Li (2026): When Single-Agent with Skills Replace Multi-Agent Systems and When They Fail(2026)
  5. VentureBeat: 'More Agents' Isn't a Reliable Path to Better Enterprise AI(2026)
  6. Diagrid: Checkpoints Are Not Durable Execution (LangGraph, CrewAI, Google ADK)(2026)
  7. Rob Pike's Rules of Programming (1989)(1989)

Stay in the loop

Get new articles on data governance, AI, and engineering delivered to your inbox.

No spam. Unsubscribe anytime.