Back to guides
6
5 min

Production Pipeline

From Working Demo to Reliable System

Agents in Production

A working demo is not a production system. The gap between "it works on my laptop" and "it handles 50 sales reps reliably every day" is monitoring, evaluation, staged rollout, cost management, and operational runbooks. This module covers the infrastructure that makes agents trustworthy at scale.

Monitoring: What to Measure

You can't improve what you don't measure. Four categories of metrics matter for production agents:

Tool Metrics

MetricWhat It Tells YouAlert Threshold
**Success rate** per toolIs a tool broken or degraded?< 95% over 5 min
Latency p50 / p95 / p99Is a tool getting slow?p95 > 5s
Calls per sessionIs the agent looping?> 15 calls in one session
Error distributionWhat's failing and why?Any new error type

Agent Metrics

MetricWhat It Tells YouAlert Threshold
Task completion rateDoes the agent accomplish what users ask?< 80%
Turns to completionHow efficient is the agent?Avg > 6 turns increasing
Approval rateAre guardrails too tight or too loose?< 60% (too tight) or > 98% (too loose)
Fallback rateHow often does the agent say "I can't help"?> 15%

Cost Metrics

MetricWhat It Tells YouAlert Threshold
Cost per sessionAre conversations getting expensive?> $0.50 per session
Cost per user per dayIs any user an outlier?> $5/day
**Token usage** (input/output)Are prompts bloated?Input > 50K tokens avg
Tool API costsExternal API spendDaily spend > 2x baseline

User Metrics

MetricWhat It Tells YouAlert Threshold
Daily active usersIs the agent being adopted?Declining week-over-week
Session lengthAre users engaged or frustrated?Avg < 1 turn (abandoned)
Repeat usageDo users come back?< 30% weekly retention
Thumbs up/downDirect quality signalNegative rate > 20%

Evaluation: Automated Testing for Agents

Unit tests check tools. Evaluations check the *agent*. Define test scenarios with expected behaviors:

interface AgentEvalScenario {
  name: string;
  input: string;
  expected_tools: string[];           // Tools the agent should call
  expected_tool_sequence?: string[];  // Optional: specific order matters
  expected_output_contains: string[]; // Key phrases in the response
  max_turns: number;                  // Efficiency bound
  max_cost: number;                   // Cost bound
}

const evalScenarios: AgentEvalScenario[] = [
  {
    name: "basic_account_lookup",
    input: "Tell me about the Globex account",
    expected_tools: ["search_crm_contacts", "get_deal_history"],
    expected_output_contains: ["Globex", "ARR", "renewal"],
    max_turns: 4,
    max_cost: 0.10,
  },
  {
    name: "email_draft_with_approval",
    input: "Draft a renewal proposal email to Jane at Globex",
    expected_tools: ["search_crm_contacts", "get_deal_history", "draft_email"],
    expected_output_contains: ["approval", "send"],
    max_turns: 6,
    max_cost: 0.15,
  },
];

Run evaluations on every deploy. Track scores over time. A model upgrade that improves general quality but breaks your specific workflows is a regression, not an improvement.

Staged Autonomy

Don't flip the switch from "demo" to "fully autonomous." Ramp up trust gradually:

Stage 1: Shadow Mode

The agent runs alongside human workflows but takes no actions. It *suggests* what it would do. Humans compare its suggestions to their actual decisions. This builds a dataset of aligned vs. misaligned actions.

Stage 2: Approval-Required

The agent can take actions, but every medium-risk and above action requires human approval. This is where Module 4's guardrails earn their keep. Track approval rates to identify tools where the agent is reliably correct.

Stage 3: Semi-Autonomous

Low-risk actions execute automatically. Medium-risk actions require approval. High-risk actions require approval. Gradually promote tools from "approval-required" to "auto-approved" based on approval rate history:

function shouldAutoApprove(toolName: string): boolean {
  const history = getApprovalHistory(toolName, { days: 30 });
  return (
    history.totalDecisions >= 100 &&
    history.approvalRate >= 0.97 &&
    history.lastRejection === null // No rejections in last 30 days
  );
}

Stage 4: Autonomous

The agent handles most tasks independently. Humans review a random sample of actions for quality (audit sampling). The agent's scope of autonomy is defined by policy, not by technical limitation.

The Feedback Flywheel

Production agents get better over time — but only if you build the feedback loop:

User asks a question
        │
        ▼
Agent plans and acts
        │
        ▼
Human approves / rejects / edits
        │
        ▼
Outcome logged with context
        │
        ▼
Rejected actions → identify planning failures
Edited actions → improve tool parameters
Approved actions → positive training examples
        │
        ▼
Update prompts, tool descriptions, guardrails
        │
        ▼
Agent improves → earns more autonomy

Every rejection is a data point. "User rejected email because it was too formal" tells you to adjust the tone prompt. "User edited the CRM update to add a tag" tells you the agent should learn to tag automatically. The flywheel turns human oversight into agent improvement.

Cost Optimization

AI agents can get expensive fast. Three levers to control costs:

Model Selection

Not every step needs the best model. Use a tiered approach:

TaskModel TierWhy
Planning (which tools to call)Large (Claude Sonnet)Needs strong reasoning
Tool argument generationMedium (Haiku)Structured output, simpler
Response synthesisLarge (Claude Sonnet)User-facing quality matters
Classification / routingSmall (Haiku)Fast, cheap, sufficient

Caching

Cache tool results that don't change frequently. A contact's deal history from 5 minutes ago is still valid. Set TTLs per tool:

const CACHE_TTL: Record<string, number> = {
  search_contacts: 5 * 60,        // 5 min — contacts don't change often
  get_deal_history: 15 * 60,      // 15 min — deals update slowly
  get_product_catalog: 60 * 60,   // 1 hour — products rarely change
  search_emails: 0,               // never cache — always fresh
};

Batching

When the agent needs to look up 5 contacts, batch the queries instead of making 5 separate tool calls. Design tools with batch support from the start.

Operations Runbook

When things break (they will), have a playbook:

SymptomLikely CauseResponse
All tool calls failingExternal API downCheck API status pages, activate fallback tools
Agent looping (>15 tool calls)Ambiguous user request or stale contextIncrease max-turns alert, review loop examples
Approval rate drops below 60%Model regression or prompt driftRoll back to previous prompt version
Cost spike (>3x daily baseline)Looping, new user spike, or verbose modelCheck per-user spend, throttle if needed
Users reporting wrong answersTool returning stale dataCheck cache TTLs, verify data source freshness
Latency spike (p95 > 10s)Slow tool or model providerCheck per-tool latency, switch to faster model tier

The Four Pillars

Every production agent rests on four pillars. If any one is weak, the system fails:

  • Observability — You can see what the agent is doing, why, and how well (monitoring + audit logs)
  • Safety — The agent can't cause irreversible harm without human approval (guardrails + approval gates)
  • Economics — The agent costs less than the human effort it replaces (cost caps + model tiering + caching)
  • Flywheel — The agent gets better over time from human feedback (staged autonomy + feedback loop)
  • In your capstone, you'll instrument your Sales Companion with all four pillars — metrics dashboards, evaluation suites, staged rollout configuration, and the feedback loop that turns every user interaction into an improvement opportunity. This is what separates a demo from a product.

    This is chapter 6 of Production AI Agents.

    Get the full hands-on course for $100 and build the complete system. Your projects become your portfolio.

    View course details