6 min

Production Pipeline

Rate Limiting, Cost Tracking, Alerting & Audit

The Production Gap

Your workflow automation works. Events flow in, AI nodes process them, the engine routes them, reliability patterns handle failures, and the dashboard shows what's happening.

But between "it works" and "it runs in production" are four concerns that don't appear until you're operating at scale: rate limiting, cost tracking, alerting, and audit trails.

Rate Limiting

Every AI API call costs money and has rate limits. Without your own rate limiting layer, a spike in events can:

Blow through your API budget in minutes

Hit the provider's rate limit and get 429 errors for all workflows

Cause cascading failures as retries compound the problem

Two levels of rate limiting:

Per-user limits — Prevent one integration from starving others. If Zendesk sends a burst of 1,000 tickets, limit processing to 50/minute so the Stripe payment workflow still has capacity.

Global limits — Cap total AI API calls across all workflows. If your Anthropic budget is $100/day and each call costs ~$0.003, that's ~33,000 calls/day. Set the global limit below that threshold with a safety margin.

const rateLimiter = {
  perSource: new Map<string, { count: number; windowStart: number }>(),
  global: { count: 0, windowStart: Date.now() },

  check(source: string): boolean {
    if (this.global.count >= 30_000) return false;
    const entry = this.perSource.get(source);
    if (entry && entry.count >= 100) return false;
    return true;
  },
};

Events that exceed the rate limit aren't dropped — they're queued for processing when capacity frees up. This is backpressure — the system slows down gracefully instead of crashing.

Cost Tracking

AI costs are the operational cost of workflow automation. Every classifier call, every summarizer call, every generator call consumes tokens. Without tracking, you won't know until the bill arrives.

The metrics collector tracks cost at three levels:

Per-node — How many tokens did the classifier use? Is the generator consuming 10x more than expected because the templates grew?

Per-workflow — Total cost of a single execution. The support ticket triage workflow costs ~$0.003 (4 nodes x ~250 tokens each). The document analysis workflow costs ~$0.05 (larger context).

Per-day / per-month — Aggregate cost for budgeting and alerting. A daily cost alert at $50 gives you time to investigate before hitting $100.

const metrics = metricsCollector.getWorkflowMetrics("support-ticket-triage");
// totalTokensUsed: 142,500
// costUsd: 0.43 (at $3/1M tokens)

Cost optimization levers:

Use Claude Haiku for classification (cheap, fast, good enough for categorization)

Use Claude Sonnet only for response generation (higher quality, higher cost)

Cache classifier results for identical inputs (many tickets about the same outage)

Skip the summarizer for low-priority events (save tokens, accept less polish)

Alerting

The system needs to tell you when something is wrong. Don't wait for customers to report issues.

Three alert categories:

Health alerts — Failure rate exceeds 20% over 15 minutes. This indicates a systemic issue (API down, bad deployment, data corruption), not a single failed event.

Performance alerts — P95 latency exceeds 10 seconds. This means 5% of events are taking too long — likely AI API degradation or a workflow with too many sequential nodes.

Cost alerts — Daily spend exceeds threshold. Catches both legitimate spikes (marketing campaign triggers 10x normal events) and bugs (infinite loop in event processing).

Each alert has a cooldown — after firing, it won't fire again for N minutes. Without cooldowns, a sustained failure generates hundreds of alerts. With a 15-minute cooldown, the team gets one alert and time to investigate.

Alert channels match urgency:

Slack for operational issues (team visibility)

Email for cost alerts (paper trail for finance)

PagerDuty for system-down scenarios (immediate response)

Audit Trail

Every event processed, every node executed, every error encountered, every human approval — all recorded in an immutable append-only log.

auditTrail.log("workflow-engine", "event.processed", "EVT-1042", {
  workflowId: "support-ticket-triage",
  executionId: "exec-0042",
  durationMs: 127,
  classification: "billing",
  routedTo: "billing-team",
});

The audit trail serves three audiences:

Operations — "Why was ticket TICK-4521 routed to engineering instead of billing?" Query by event ID, see the classifier output (category: technical), understand the routing decision.

Compliance — "What automated actions were taken on customer data last month?" Query by action type, export for review. Regulated industries (finance, healthcare) require this.

Analytics — "Which event types cause the most failures?" Aggregate by action and resource to find systemic issues.

The Complete Architecture

Step back and see what you've built:

Loading diagram...

This is a real production pattern. The specific nodes differ by domain (support tickets vs payments vs documents), but the architecture — event-driven, DAG execution, reliability patterns, observability — is universal.

Every major platform (Stripe, Twilio, Datadog) has internal systems that follow this exact structure. You've now built one from scratch and understand every component.

This is chapter 6 of AI Workflow Automation.

Get the full hands-on course — free during early access. Build the complete system. Your projects become your portfolio.

View course details

Ch. 5: Automation App