5 min

Production Pipeline

From Working Demo to Reliable System

Agents in Production

A working demo is not a production system. The gap between "it works on my laptop" and "it handles 50 sales reps reliably every day" is monitoring, evaluation, staged rollout, cost management, and operational runbooks. This module covers the infrastructure that makes agents trustworthy at scale.

Monitoring: What to Measure

You can't improve what you don't measure. Four categories of metrics matter for production agents:

Tool Metrics

Metric	What It Tells You	Alert Threshold
Success rate per tool	Is a tool broken or degraded?	< 95% over 5 min
Latency p50 / p95 / p99	Is a tool getting slow?	p95 > 5s
Calls per session	Is the agent looping?	> 15 calls in one session
Error distribution	What's failing and why?	Any new error type

Agent Metrics

Metric	What It Tells You	Alert Threshold
Task completion rate	Does the agent accomplish what users ask?	< 80%
Turns to completion	How efficient is the agent?	Avg > 6 turns increasing
Approval rate	Are guardrails too tight or too loose?	< 60% (too tight) or > 98% (too loose)
Fallback rate	How often does the agent say "I can't help"?	> 15%

Cost Metrics

Metric	What It Tells You	Alert Threshold
Cost per session	Are conversations getting expensive?	> $0.50 per session
Cost per user per day	Is any user an outlier?	> $5/day
Token usage (input/output)	Are prompts bloated?	Input > 50K tokens avg
Tool API costs	External API spend	Daily spend > 2x baseline

User Metrics

Metric	What It Tells You	Alert Threshold
Daily active users	Is the agent being adopted?	Declining week-over-week
Session length	Are users engaged or frustrated?	Avg < 1 turn (abandoned)
Repeat usage	Do users come back?	< 30% weekly retention
Thumbs up/down	Direct quality signal	Negative rate > 20%

Evaluation: Automated Testing for Agents

Unit tests check tools. Evaluations check the *agent*. Define test scenarios with expected behaviors:

interface AgentEvalScenario {
  name: string;
  input: string;
  expected_tools: string[];           // Tools the agent should call
  expected_tool_sequence?: string[];  // Optional: specific order matters
  expected_output_contains: string[]; // Key phrases in the response
  max_turns: number;                  // Efficiency bound
  max_cost: number;                   // Cost bound
}

const evalScenarios: AgentEvalScenario[] = [
  {
    name: "basic_account_lookup",
    input: "Tell me about the Globex account",
    expected_tools: ["search_crm_contacts", "get_deal_history"],
    expected_output_contains: ["Globex", "ARR", "renewal"],
    max_turns: 4,
    max_cost: 0.10,
  },
  {
    name: "email_draft_with_approval",
    input: "Draft a renewal proposal email to Jane at Globex",
    expected_tools: ["search_crm_contacts", "get_deal_history", "draft_email"],
    expected_output_contains: ["approval", "send"],
    max_turns: 6,
    max_cost: 0.15,
  },
];

Run evaluations on every deploy. Track scores over time. A model upgrade that improves general quality but breaks your specific workflows is a regression, not an improvement.

Staged Autonomy

Don't flip the switch from "demo" to "fully autonomous." Ramp up trust gradually:

Stage 1: Shadow Mode

The agent runs alongside human workflows but takes no actions. It *suggests* what it would do. Humans compare its suggestions to their actual decisions. This builds a dataset of aligned vs. misaligned actions.

Stage 2: Approval-Required

The agent can take actions, but every medium-risk and above action requires human approval. This is where Module 4's guardrails earn their keep. Track approval rates to identify tools where the agent is reliably correct.

Stage 3: Semi-Autonomous

Low-risk actions execute automatically. Medium-risk actions require approval. High-risk actions require approval. Gradually promote tools from "approval-required" to "auto-approved" based on approval rate history:

function shouldAutoApprove(toolName: string): boolean {
  const history = getApprovalHistory(toolName, { days: 30 });
  return (
    history.totalDecisions >= 100 &&
    history.approvalRate >= 0.97 &&
    history.lastRejection === null // No rejections in last 30 days
  );
}

Stage 4: Autonomous

The agent handles most tasks independently. Humans review a random sample of actions for quality (audit sampling). The agent's scope of autonomy is defined by policy, not by technical limitation.

The Feedback Flywheel

Production agents get better over time — but only if you build the feedback loop:

User asks a question
        │
        ▼
Agent plans and acts
        │
        ▼
Human approves / rejects / edits
        │
        ▼
Outcome logged with context
        │
        ▼
Rejected actions → identify planning failures
Edited actions → improve tool parameters
Approved actions → positive training examples
        │
        ▼
Update prompts, tool descriptions, guardrails
        │
        ▼
Agent improves → earns more autonomy

Every rejection is a data point. "User rejected email because it was too formal" tells you to adjust the tone prompt. "User edited the CRM update to add a tag" tells you the agent should learn to tag automatically. The flywheel turns human oversight into agent improvement.

Cost Optimization

AI agents can get expensive fast. Three levers to control costs:

Model Selection

Not every step needs the best model. Use a tiered approach:

Task	Model Tier	Why
Planning (which tools to call)	Large (Claude Sonnet)	Needs strong reasoning
Tool argument generation	Medium (Haiku)	Structured output, simpler
Response synthesis	Large (Claude Sonnet)	User-facing quality matters
Classification / routing	Small (Haiku)	Fast, cheap, sufficient

Caching

Cache tool results that don't change frequently. A contact's deal history from 5 minutes ago is still valid. Set TTLs per tool:

const CACHE_TTL: Record<string, number> = {
  search_contacts: 5 * 60,        // 5 min — contacts don't change often
  get_deal_history: 15 * 60,      // 15 min — deals update slowly
  get_product_catalog: 60 * 60,   // 1 hour — products rarely change
  search_emails: 0,               // never cache — always fresh
};

Batching

When the agent needs to look up 5 contacts, batch the queries instead of making 5 separate tool calls. Design tools with batch support from the start.

Operations Runbook

When things break (they will), have a playbook:

Symptom	Likely Cause	Response
All tool calls failing	External API down	Check API status pages, activate fallback tools
Agent looping (>15 tool calls)	Ambiguous user request or stale context	Increase max-turns alert, review loop examples
Approval rate drops below 60%	Model regression or prompt drift	Roll back to previous prompt version
Cost spike (>3x daily baseline)	Looping, new user spike, or verbose model	Check per-user spend, throttle if needed
Users reporting wrong answers	Tool returning stale data	Check cache TTLs, verify data source freshness
Latency spike (p95 > 10s)	Slow tool or model provider	Check per-tool latency, switch to faster model tier

The Four Pillars

Every production agent rests on four pillars. If any one is weak, the system fails:

Observability — You can see what the agent is doing, why, and how well (monitoring + audit logs)

Safety — The agent can't cause irreversible harm without human approval (guardrails + approval gates)

Economics — The agent costs less than the human effort it replaces (cost caps + model tiering + caching)

Flywheel — The agent gets better over time from human feedback (staged autonomy + feedback loop)

In your capstone, you'll instrument your Sales Companion with all four pillars — metrics dashboards, evaluation suites, staged rollout configuration, and the feedback loop that turns every user interaction into an improvement opportunity. This is what separates a demo from a product.

This is chapter 6 of Production AI Agents.

Get the full hands-on course for $100 and build the complete system. Your projects become your portfolio.

View course details

Ch. 5: Agent App