Back to guides
6
4 min

Monitor & Maintain

Costs, Errors & Updates

Running an Agent in Production

Building an agent is a project. Running an agent is an operation. The skills are different — monitoring, cost management, error triage, and safe updates become your daily concerns.

An agent that runs unsupervised without monitoring is a liability. You need visibility into what it's doing, what it's costing, and whether it's working correctly.

Cost Tracking

API-based agents spend money on every request. Without tracking, costs creep up silently:

Cost DriverHow It GrowsMitigation
Token usageVerbose prompts, long context windowsSet max_tokens, compress context
Request volumeRetry loops, frequent pollingMax retries, longer polling intervals
Model choiceFrontier models cost 10-50x moreUse cheaper models for simple tasks
Workflow chainsEach step is a separate API callCache intermediate results

Build a cost dashboard that tracks:

  • Per-request cost — tokens in, tokens out, model used
  • Per-workflow cost — sum of all steps in a pipeline
  • Daily/weekly totals — trend lines to spot anomalies
  • Budget burn rate — "at this rate, you'll hit your monthly limit on day 18"
  • Error Categorization

    Not all errors are equal. Categorize them to prioritize your response:

    CategoryExamplesAction
    TransientAPI timeout, rate limit, network blipRetry automatically (max 3)
    ConfigurationWrong API key, expired token, missing env varAlert immediately — agent is broken
    DataMalformed input, missing file, encoding errorSkip item, log warning, continue
    SafetyPermission denied, budget exceeded, blocked commandHalt agent, alert user, review logs

    Track error rates over time. A sudden spike in transient errors might indicate an API outage. A gradual increase in data errors might mean your input sources changed format.

    Health Checks

    An agent health check runs automatically (e.g., on startup and every hour) and verifies:

  • Configuration integrity — config files exist, parse correctly, and have expected fields
  • Permission validity — allowed directories exist and are accessible
  • API connectivity — model endpoint responds (small test request)
  • Tool availability — each registered tool loads and passes a self-test
  • Disk space — enough space for audit logs and temporary files
  • Audit log integrity — log file exists, is writable, and hasn't been tampered with
  • If any check fails, the agent should refuse to start (or shut down gracefully) rather than running in a broken state.

    Safe Update Procedure

    Updating an agent — new model, new tools, new dependencies — is the most common way to break a working system. Follow this procedure:

    1. Test — Run the update in a separate environment with sample data

  • Does it produce the same results as the current version?
  • Do all safety guardrails still work?
  • Are costs comparable?
  • 2. Stage — Run the updated agent alongside the current version

  • Compare outputs on the same inputs
  • Monitor for new errors or unexpected behaviors
  • Verify audit logs capture the same actions
  • 3. Deploy — Switch to the updated version

  • Keep the previous version available for rollback
  • Monitor closely for the first 24 hours
  • Set up automatic rollback if error rates spike
  • Never update in place. If you overwrite the current version and the update fails, you have no fallback.

    Long-Term Maintenance

    Agents face maintenance challenges that grow over time:

  • Model deprecation — providers retire old models. Pin versions and plan migrations.
  • Prompt drift — as models update, the same prompt may produce different results. Version your prompts.
  • Dependency rot — npm packages, Python libraries, and OS tools get updates. Schedule regular dependency reviews.
  • Data format changes — external data sources change their schemas. Build validation that catches format changes early.
  • Safety rule review — review your safety rules quarterly. Your needs change, and so should your permissions.
  • The Monitoring Dashboard

    Bring it all together in a single view:

    PanelShowsAlert When
    Activity feedRecent agent actions (from audit log)Unusual action patterns
    Cost trackerToday's spend, weekly trend, budget remaining>80% of daily/monthly budget
    Error ratesErrors by category, trend over timeError rate >5% or spike detected
    Health statusLast health check result, per-component statusAny component fails
    Workflow historyRecent workflow runs, success/failure, durationWorkflow takes >2x normal time

    Key Takeaways

  • Running an agent is an operation, not a project. Monitoring is not optional.
  • Track costs at per-request, per-workflow, and daily levels. Set alerts before limits.
  • Categorize errors (transient, configuration, data, safety) for appropriate responses.
  • Health checks should run automatically and prevent the agent from starting in a broken state.
  • Never update in place — test, stage, deploy, and keep the previous version for rollback.
  • This is chapter 6 of Open Source AI Agents (OpenClaw).

    Get the full hands-on course — free during early access. Build the complete system. Your projects become your portfolio.

    View course details