Production Pipeline
From Working Demo to Reliable System
Agents in Production
A working demo is not a production system. The gap between "it works on my laptop" and "it handles 50 sales reps reliably every day" is monitoring, evaluation, staged rollout, cost management, and operational runbooks. This module covers the infrastructure that makes agents trustworthy at scale.
Monitoring: What to Measure
You can't improve what you don't measure. Four categories of metrics matter for production agents:
Tool Metrics
| Metric | What It Tells You | Alert Threshold |
|---|---|---|
| **Success rate** per tool | Is a tool broken or degraded? | < 95% over 5 min |
| Latency p50 / p95 / p99 | Is a tool getting slow? | p95 > 5s |
| Calls per session | Is the agent looping? | > 15 calls in one session |
| Error distribution | What's failing and why? | Any new error type |
Agent Metrics
| Metric | What It Tells You | Alert Threshold |
|---|---|---|
| Task completion rate | Does the agent accomplish what users ask? | < 80% |
| Turns to completion | How efficient is the agent? | Avg > 6 turns increasing |
| Approval rate | Are guardrails too tight or too loose? | < 60% (too tight) or > 98% (too loose) |
| Fallback rate | How often does the agent say "I can't help"? | > 15% |
Cost Metrics
| Metric | What It Tells You | Alert Threshold |
|---|---|---|
| Cost per session | Are conversations getting expensive? | > $0.50 per session |
| Cost per user per day | Is any user an outlier? | > $5/day |
| **Token usage** (input/output) | Are prompts bloated? | Input > 50K tokens avg |
| Tool API costs | External API spend | Daily spend > 2x baseline |
User Metrics
| Metric | What It Tells You | Alert Threshold |
|---|---|---|
| Daily active users | Is the agent being adopted? | Declining week-over-week |
| Session length | Are users engaged or frustrated? | Avg < 1 turn (abandoned) |
| Repeat usage | Do users come back? | < 30% weekly retention |
| Thumbs up/down | Direct quality signal | Negative rate > 20% |
Evaluation: Automated Testing for Agents
Unit tests check tools. Evaluations check the *agent*. Define test scenarios with expected behaviors:
interface AgentEvalScenario {
name: string;
input: string;
expected_tools: string[]; // Tools the agent should call
expected_tool_sequence?: string[]; // Optional: specific order matters
expected_output_contains: string[]; // Key phrases in the response
max_turns: number; // Efficiency bound
max_cost: number; // Cost bound
}
const evalScenarios: AgentEvalScenario[] = [
{
name: "basic_account_lookup",
input: "Tell me about the Globex account",
expected_tools: ["search_crm_contacts", "get_deal_history"],
expected_output_contains: ["Globex", "ARR", "renewal"],
max_turns: 4,
max_cost: 0.10,
},
{
name: "email_draft_with_approval",
input: "Draft a renewal proposal email to Jane at Globex",
expected_tools: ["search_crm_contacts", "get_deal_history", "draft_email"],
expected_output_contains: ["approval", "send"],
max_turns: 6,
max_cost: 0.15,
},
];Run evaluations on every deploy. Track scores over time. A model upgrade that improves general quality but breaks your specific workflows is a regression, not an improvement.
Staged Autonomy
Don't flip the switch from "demo" to "fully autonomous." Ramp up trust gradually:
Stage 1: Shadow Mode
The agent runs alongside human workflows but takes no actions. It *suggests* what it would do. Humans compare its suggestions to their actual decisions. This builds a dataset of aligned vs. misaligned actions.
Stage 2: Approval-Required
The agent can take actions, but every medium-risk and above action requires human approval. This is where Module 4's guardrails earn their keep. Track approval rates to identify tools where the agent is reliably correct.
Stage 3: Semi-Autonomous
Low-risk actions execute automatically. Medium-risk actions require approval. High-risk actions require approval. Gradually promote tools from "approval-required" to "auto-approved" based on approval rate history:
function shouldAutoApprove(toolName: string): boolean {
const history = getApprovalHistory(toolName, { days: 30 });
return (
history.totalDecisions >= 100 &&
history.approvalRate >= 0.97 &&
history.lastRejection === null // No rejections in last 30 days
);
}Stage 4: Autonomous
The agent handles most tasks independently. Humans review a random sample of actions for quality (audit sampling). The agent's scope of autonomy is defined by policy, not by technical limitation.
The Feedback Flywheel
Production agents get better over time — but only if you build the feedback loop:
User asks a question
│
▼
Agent plans and acts
│
▼
Human approves / rejects / edits
│
▼
Outcome logged with context
│
▼
Rejected actions → identify planning failures
Edited actions → improve tool parameters
Approved actions → positive training examples
│
▼
Update prompts, tool descriptions, guardrails
│
▼
Agent improves → earns more autonomyEvery rejection is a data point. "User rejected email because it was too formal" tells you to adjust the tone prompt. "User edited the CRM update to add a tag" tells you the agent should learn to tag automatically. The flywheel turns human oversight into agent improvement.
Cost Optimization
AI agents can get expensive fast. Three levers to control costs:
Model Selection
Not every step needs the best model. Use a tiered approach:
| Task | Model Tier | Why |
|---|---|---|
| Planning (which tools to call) | Large (Claude Sonnet) | Needs strong reasoning |
| Tool argument generation | Medium (Haiku) | Structured output, simpler |
| Response synthesis | Large (Claude Sonnet) | User-facing quality matters |
| Classification / routing | Small (Haiku) | Fast, cheap, sufficient |
Caching
Cache tool results that don't change frequently. A contact's deal history from 5 minutes ago is still valid. Set TTLs per tool:
const CACHE_TTL: Record<string, number> = {
search_contacts: 5 * 60, // 5 min — contacts don't change often
get_deal_history: 15 * 60, // 15 min — deals update slowly
get_product_catalog: 60 * 60, // 1 hour — products rarely change
search_emails: 0, // never cache — always fresh
};Batching
When the agent needs to look up 5 contacts, batch the queries instead of making 5 separate tool calls. Design tools with batch support from the start.
Operations Runbook
When things break (they will), have a playbook:
| Symptom | Likely Cause | Response |
|---|---|---|
| All tool calls failing | External API down | Check API status pages, activate fallback tools |
| Agent looping (>15 tool calls) | Ambiguous user request or stale context | Increase max-turns alert, review loop examples |
| Approval rate drops below 60% | Model regression or prompt drift | Roll back to previous prompt version |
| Cost spike (>3x daily baseline) | Looping, new user spike, or verbose model | Check per-user spend, throttle if needed |
| Users reporting wrong answers | Tool returning stale data | Check cache TTLs, verify data source freshness |
| Latency spike (p95 > 10s) | Slow tool or model provider | Check per-tool latency, switch to faster model tier |
The Four Pillars
Every production agent rests on four pillars. If any one is weak, the system fails:
In your capstone, you'll instrument your Sales Companion with all four pillars — metrics dashboards, evaluation suites, staged rollout configuration, and the feedback loop that turns every user interaction into an improvement opportunity. This is what separates a demo from a product.
This is chapter 6 of Production AI Agents.
Get the full hands-on course for $100 and build the complete system. Your projects become your portfolio.
View course details