7 min

Production Pipeline

From Demo to Enterprise Deployment

What Production Means for Multi-Agent Systems

A working demo and a production system are separated by three gaps: reliability (what happens when things fail), observability (can you see what's happening), and cost control (can you afford to run it).

Multi-agent systems multiply all three problems. A single-agent system has one failure mode (the agent fails) and one cost center (one API call). A four-agent system has combinatorial failure modes (any agent can fail, any handoff can break, any consensus round can deadlock) and four cost centers that need independent tracking.

Agent Performance Metrics

You can't improve what you can't measure. Track per-agent metrics across every orchestration:

Metric	What It Tells You	Alert Threshold
Success rate	Is the agent reliable?	< 80% over 5 minutes
Avg confidence	Is routing accurate?	< 0.5 sustained
P99 latency	Is the agent fast enough?	> 60 seconds
Token usage	Is the agent efficient?	> 2x historical average
Cost per task	Is the agent affordable?	> budget / expected tasks

The metrics collector records every SupervisorState and computes these across time windows. Export to your observability stack (Prometheus + Grafana, Datadog, or a PostgreSQL table with a dashboard).

Correlation matters. If the researcher's success rate drops from 95% to 70% while the analyst's stays at 90%, the problem is in the researcher — maybe its web_search tool is hitting rate limits, or its system prompt can't handle a new type of query. If ALL agents drop simultaneously, the problem is systemic — network issues, API outages, or a bad deployment.

Circuit Breakers

The circuit breaker pattern prevents cascading failures:

CLOSED (normal)
  ↓ 3 consecutive failures
OPEN (rejecting all calls)
  ↓ 30 seconds timeout
HALF_OPEN (testing recovery)
  ↓ 2 consecutive successes → CLOSED
  ↓ 1 failure → OPEN (reset timeout)

Create circuit breakers at two levels:

Agent-level breakers — If the coder agent fails 3 times in a row, stop routing tasks to it. The orchestrator skips coder tasks or routes to a fallback. Other agents continue unaffected.

Tool-level breakers — If the web_search tool fails 3 times, only the agents that depend on it are affected. The researcher loses web_search but can still use document_reader. The writer and analyst (which don't use web_search) are unaffected.

The key behavior: when a circuit is OPEN, the orchestrator doesn't retry — it immediately knows the component is down and works around it. This prevents the "retry storm" where a failed component gets hammered with retries from all four agents simultaneously.

Cost Allocation

Multi-agent orchestrations are expensive. Each agent call is a separate API call with its own token usage. A complex query that touches all 4 agents might cost $0.15-0.50 in API calls. Multiply by hundreds of queries per day and you need precise cost tracking.

The cost allocator tracks:

Per-call costs — input/output tokens, model used, calculated USD

Per-agent totals — which agent is most expensive

Per-conversation totals — how much did this user's session cost

Budget enforcement — stop orchestration when budget is exceeded

Enterprise billing requires this granularity. A customer with a $50/month plan needs hard budget enforcement — the orchestrator gracefully degrades when the budget is exhausted (simpler queries, fewer agents, cheaper models) rather than generating a surprise bill.

// Set budget and check before each agent call
costAllocator.setBudget("conv-123", 0.50); // $0.50 max for this conversation

// Before routing to agent
if (costAllocator.isOverBudget("conv-123")) {
  // Graceful degradation: use a single agent instead of four
  return singleAgentFallback(query);
}

Staged Autonomy

The path from demo to production has three stages:

Stage 1: Shadow Mode

Agents process queries and produce results, but nothing is delivered to end users automatically. A human reviewer sees every output and decides what to use. Every orchestration is free quality data — you're building an evaluation dataset from real queries.

Entry criteria: System is deployed

Exit criteria: 100+ reviewed orchestrations, 85%+ approval rate

Stage 2: Approval Mode

The orchestrator produces results and presents them to users with Approve / Revise / Reject controls. Low-risk outputs (research summaries) can auto-approve above a confidence threshold. High-risk outputs (customer-facing memos) always require explicit approval.

Entry criteria: Shadow mode approval rate > 85%

Exit criteria: 500+ orchestrations, 95%+ auto-approval rate for low-risk tasks

Stage 3: Autonomous with Monitoring

Agents execute and deliver without human gates. Circuit breakers, cost allocators, and metrics provide the safety net. Any quality degradation triggers automatic fallback to Approval Mode.

Entry criteria: Approval mode auto-approval rate > 95%

Fallback trigger: Success rate drops below 80% OR confidence drops below 0.6

This model is both an engineering pattern and a sales tool. Enterprise prospects ask: "What if the AI sends a bad email?" Your answer: "It can't. You start in approval mode. The AI drafts; you approve. When you've seen 500 drafts and approved 95% without changes, you unlock autonomous mode. And even then, circuit breakers catch quality drops."

Evaluation Harness

Staged autonomy requires quantitative evaluation. The evaluation harness runs test scenarios through the orchestrator and measures:

Task completion rate — Did all sub-tasks complete successfully?

Output quality — Do quality checks pass? Do automated metrics (ROUGE for writing, citation accuracy for research) meet thresholds?

Cost efficiency — Is the multi-agent approach cheaper than a single large model call?

Latency — Is end-to-end time acceptable for the use case?

Run the eval harness on every deployment. If a code change causes the researcher's citation accuracy to drop from 95% to 80%, catch it before production. This is the same principle as unit tests — but for agent quality.

The Feedback Flywheel

Production data is your best training data. The flywheel:

Orchestration runs → log full SupervisorState

Human reviews → approve/reject/revise each agent output

Revision data → extract "before" (agent output) and "after" (human revision) pairs

Improve agents → update system prompts, adjust routing weights, fine-tune if volume warrants

Measure improvement → eval harness confirms quality increased

Repeat

The agents that get the most human revisions are the ones that need the most improvement. The router's accuracy improves as you learn which tasks each agent handles well. The decomposer gets better as you see which task structures produce the best results.

This is why staged autonomy matters beyond trust-building — it generates the feedback data you need to actually improve the system.

This is chapter 6 of Multi-Agent Orchestration.

Get the full hands-on course — free during early access. Build the complete system. Your projects become your portfolio.

View course details

Ch. 5: Orchestration App