4 min

Monitor & Maintain

Costs, Errors & Updates

Running an Agent in Production

Building an agent is a project. Running an agent is an operation. The skills are different — monitoring, cost management, error triage, and safe updates become your daily concerns.

An agent that runs unsupervised without monitoring is a liability. You need visibility into what it's doing, what it's costing, and whether it's working correctly.

Cost Tracking

API-based agents spend money on every request. Without tracking, costs creep up silently:

Cost Driver	How It Grows	Mitigation
Token usage	Verbose prompts, long context windows	Set max_tokens, compress context
Request volume	Retry loops, frequent polling	Max retries, longer polling intervals
Model choice	Frontier models cost 10-50x more	Use cheaper models for simple tasks
Workflow chains	Each step is a separate API call	Cache intermediate results

Build a cost dashboard that tracks:

Per-request cost — tokens in, tokens out, model used

Per-workflow cost — sum of all steps in a pipeline

Daily/weekly totals — trend lines to spot anomalies

Budget burn rate — "at this rate, you'll hit your monthly limit on day 18"

Error Categorization

Not all errors are equal. Categorize them to prioritize your response:

Category	Examples	Action
Transient	API timeout, rate limit, network blip	Retry automatically (max 3)
Configuration	Wrong API key, expired token, missing env var	Alert immediately — agent is broken
Data	Malformed input, missing file, encoding error	Skip item, log warning, continue
Safety	Permission denied, budget exceeded, blocked command	Halt agent, alert user, review logs

Track error rates over time. A sudden spike in transient errors might indicate an API outage. A gradual increase in data errors might mean your input sources changed format.

Health Checks

An agent health check runs automatically (e.g., on startup and every hour) and verifies:

Configuration integrity — config files exist, parse correctly, and have expected fields

Permission validity — allowed directories exist and are accessible

API connectivity — model endpoint responds (small test request)

Tool availability — each registered tool loads and passes a self-test

Disk space — enough space for audit logs and temporary files

Audit log integrity — log file exists, is writable, and hasn't been tampered with

If any check fails, the agent should refuse to start (or shut down gracefully) rather than running in a broken state.

Safe Update Procedure

Updating an agent — new model, new tools, new dependencies — is the most common way to break a working system. Follow this procedure:

1. Test — Run the update in a separate environment with sample data

Does it produce the same results as the current version?

Do all safety guardrails still work?

Are costs comparable?

2. Stage — Run the updated agent alongside the current version

Compare outputs on the same inputs

Monitor for new errors or unexpected behaviors

Verify audit logs capture the same actions

3. Deploy — Switch to the updated version

Keep the previous version available for rollback

Monitor closely for the first 24 hours

Set up automatic rollback if error rates spike

Never update in place. If you overwrite the current version and the update fails, you have no fallback.

Long-Term Maintenance

Agents face maintenance challenges that grow over time:

Model deprecation — providers retire old models. Pin versions and plan migrations.

Prompt drift — as models update, the same prompt may produce different results. Version your prompts.

Dependency rot — npm packages, Python libraries, and OS tools get updates. Schedule regular dependency reviews.

Data format changes — external data sources change their schemas. Build validation that catches format changes early.

Safety rule review — review your safety rules quarterly. Your needs change, and so should your permissions.

The Monitoring Dashboard

Bring it all together in a single view:

Panel	Shows	Alert When
Activity feed	Recent agent actions (from audit log)	Unusual action patterns
Cost tracker	Today's spend, weekly trend, budget remaining	>80% of daily/monthly budget
Error rates	Errors by category, trend over time	Error rate >5% or spike detected
Health status	Last health check result, per-component status	Any component fails
Workflow history	Recent workflow runs, success/failure, duration	Workflow takes >2x normal time

Key Takeaways

Running an agent is an operation, not a project. Monitoring is not optional.

Track costs at per-request, per-workflow, and daily levels. Set alerts before limits.

Categorize errors (transient, configuration, data, safety) for appropriate responses.

Health checks should run automatically and prevent the agent from starting in a broken state.

Never update in place — test, stage, deploy, and keep the previous version for rollback.

This is chapter 6 of Open Source AI Agents (OpenClaw).

Get the full hands-on course — free during early access. Build the complete system. Your projects become your portfolio.

View course details

Ch. 5: Workflows