Skip to main content
AI & Agents/multi-agent-orchestration

Multi-Agent Orchestration

Coordinate multiple AI agents: topology, handoff contracts, shared state, and system-wide cost and failure control.

Use this when one job needs more than one agent working together: an orchestrator dispatching workers, a pipeline of specialists, a peer handoff, or a critic checking a producer. Covers the decision (should this be multi-agent at all), the topology, the handoff contract between agents, shared state, and how you keep cost, latency, and failure under control across the whole system. If a single agent with the right tools can do the job, you do not need this skill, use /ai-agent-design. Most jobs do not need multi-agent.

Related skills: Design each agent first with /ai-agent-design. Evaluate the system end to end with /agent-eval-harness. Plan tool and data access with /mcp-integration-plan. Monitor it in production with /llm-observability-plan.

The hard part most teams miss

Adding agents feels like adding capacity. It is usually adding surface area to fail on.

  1. Multi-agent multiplies cost, latency, and failure surface, and the coordination tax often eats the benefit. Two agents are not twice the work, they are the work plus every round trip between them, every duplicated context window, every chance one stalls while the other waits. If a single agent with the right tools clears the bar, ship that. Reach for multiple agents only when the job genuinely needs parallel breadth or separated context that one agent cannot hold at once.
  2. The handoff contract is the real failure point, not the agents. Each agent can be individually excellent and the system still produces garbage, because agent A handed agent B something B did not expect. What A passes and what B must return is the interface, and an unspecified interface is where every multi-agent system breaks. Pin the contract before you tune any single agent.
  3. One place has to own termination and total cost, or the system runs away. Per-agent step caps do not bound the system; a parent can re-dispatch workers forever under its own cap. You need a single owner of the global budget and the global stop condition. Without it, a loop you cannot see spends money you did not approve.

Process

Step 1: Gather inputs

Ask the user:

  1. What is the job, end to end? (One or two sentences. The outcome, not the agents.)
  2. Why can't one agent do it? (Be specific: parallel breadth, context that won't fit one window, genuinely distinct skills, or an independent check. If you can't answer, it is probably a single agent.)
  3. What are the distinct roles? (Each agent's job and its one responsibility. If two roles blur, they are one agent.)
  4. What does "done" look like for the whole system? (A checkable success condition for the job, not per agent.)
  5. What is the total budget? (Across all agents: rough tool-call ceiling, time, and dollar cost before the system must stop.)
  6. What is the cost of a wrong final answer? (Reversible and cheap, or irreversible and expensive. This sets how hard the critic or human gate must be.)

Step 2: Confirm it should be multi-agent

Single agent is the default. Go multi-agent only if at least one holds, and none of the cheaper tiers clears the bar:

  • Parallel breadth: the job splits into independent sub-tasks that genuinely run at once and shorten wall-clock time.
  • Context separation: the work needs more focused context than one window can hold well, so isolated agents each carry their own slice.
  • Distinct expertise: sub-tasks need materially different tools, prompts, or models, not just different phrasing.
  • Independent verification: the answer needs a separate critic that did not produce it.

If none hold, drop to a single agent (/ai-agent-design) or a plain workflow. Say so plainly. The coordination tax is real and the cheaper tier usually wins.

Step 3: Choose the topology

Pick the simplest shape that fits. Name it explicitly.

  • Orchestrator-worker: a lead agent decomposes the job, dispatches workers (often in parallel), and synthesizes their results. The common production pattern, and the right default when sub-tasks are independent and the lead can judge the whole.
  • Sequential pipeline: agents run in a fixed order, each consuming the prior output. Use when stages have a hard dependency order. Cheapest to reason about; no real parallelism.
  • Peer handoff: control passes between agents by role (triage hands to specialist hands back). Use for routing-style work where the next owner depends on content.
  • Debate / critic: a producer and a critic (or several) iterate until the critic passes or a cap is hit. Use when the cost of a wrong answer justifies an independent check. Cap the rounds hard.

Step 4: Pin the handoff contract

This is the load-bearing step. For every edge between agents, define the interface so neither side guesses:

  • What the sender passes: the exact payload, its shape, and what is required versus optional. Pass the result, not the full transcript.
  • What the receiver must return: the expected output shape and the success or failure signal the caller branches on.
  • What "bad input" looks like and who handles it: the receiver validates what it got and rejects clearly, rather than improvising on a malformed payload.
  • Shared state vs passed state: decide what lives in shared memory all agents read, and what is passed point to point. Keep shared state small and name its single writer; many writers corrupt it silently.

Step 5: Control cost, latency, and failure across the system

Per-agent limits are not enough. Bound the whole thing:

  • Global termination: one owner holds the system stop condition, success met, total budget hit, or unrecoverable failure. This is on top of each agent's own step cap, not replaced by it.
  • Total budget: a single ceiling on tool calls, time, and spend across all agents. A parent that re-dispatches workers can blow past every per-agent cap while staying inside each one.
  • Latency: parallel work is bounded by the slowest worker plus synthesis. Set per-worker timeouts and decide whether the lead proceeds on partials or fails the run.
  • Failure isolation: one worker's failure must not corrupt the run. Return its error to the orchestrator as a result it can route around, retry, or drop, never as a crash that takes the system down.
  • Per-agent observability: trace each agent and each handoff separately with a shared run id, so "the system is broken" resolves to a specific agent or a specific edge. See /llm-observability-plan.

Step 6: Output the orchestration design

# Multi-Agent Orchestration: {{system_name}}

**Job:** {{one sentence}}
**Done means:** {{checkable system-level success condition}}
**Why multi-agent:** {{which Step 2 condition holds, and why a single agent fails}}
**Topology:** {{orchestrator-worker / pipeline / peer handoff / debate}}

## Agents
| Agent | Responsibility | Model/tools | Step cap |
|---|---|---|---|

## Handoff contracts
| Edge (A -> B) | A passes | B returns | Bad-input handling |
|---|---|---|---|

## Shared state
- What is shared: {{fields}}
- Single writer: {{who}}
- What is passed point to point: {{payloads}}

## System control
- Global termination owner: {{who}}
- Total budget (calls / time / spend): {{values}}
- Per-worker timeout + partial policy: {{value, proceed-on-partial or fail}}
- Failure isolation: {{how a worker failure is contained}}
- Observability: {{shared run id, per-agent + per-edge traces}}

## Open questions
- {{unresolved decisions}}

Step 7: Review

Ask the user:

  • Could a single agent with these tools do this instead? (If yes, build that.)
  • For each handoff, what happens when the sender returns something malformed?
  • Who owns the global stop, and what is the worst-case total spend before it fires?
  • When one worker hangs or fails, does the system degrade or die?
  • Can you tell which agent or which edge caused a bad result, from the traces alone?

Anti-patterns

Anti-patternWhy it failsDo instead
Multi-agent where one agent fitsPays the coordination tax for breadth you never neededDefault to a single agent; go multi only when Step 2 holds
Unspecified handoff contractEach agent works, the system still produces garbage at the seamPin payload in and result out for every edge before tuning agents
Only per-agent capsA parent re-dispatches workers and blows the system budget while each cap holdsOne owner of global termination and total spend
Passing the full transcript downstreamContext and cost balloon as every agent carries every other's historyPass results, not transcripts; keep shared state small
Shared state with many writersAgents overwrite each other and the corruption is invisibleOne named writer per field; others read only
A worker error crashes the runOne failure kills work the system could have routed aroundReturn errors to the orchestrator as results to handle

Output location

Present the orchestration design as formatted text in the conversation for the user to copy into their design doc.