Back to guides
4
5 min

Alert Engine

Turning Detections into Action Without Alert Fatigue

The Alert Fatigue Problem

A monitoring system that detects 50 anomalies but sends 500 alerts is worse than useless. Alert fatigue — when engineers stop paying attention because they're overwhelmed — has contributed to some of the worst outages in tech history.

The alert engine sits between detection and human action. Its job: ensure the right person gets the right information at the right time, and nothing more.

Alert Rules

Not every anomaly deserves an alert. Rules define which anomalies matter:

const rule: AlertRule = {
  id: "rule-latency-spike",
  name: "API Latency Spike",
  metric: "api_latency",       // matches any metric containing this string
  condition: "anomaly_score",   // evaluate anomaly confidence, not raw value
  threshold: 0.6,               // minimum ensemble confidence to alert
  window: 15,                   // minutes — aggregate anomalies in this window
  severity: "high",
  channels: ["slack", "pagerduty"],
  enabled: true,
};

Rule design principles:

  • Use anomaly scores, not raw values for ML-detected anomalies — the detector has already determined what's unusual
  • Use raw value thresholds for safety limits — "CPU above 95% is always bad" regardless of what the detector thinks
  • Match broadly — one CPU rule should cover all hosts (use tag-based matching)
  • Set severity based on impact — latency spikes affecting /api/checkout are higher severity than /api/search
  • Deduplication

    During a real incident, the same anomaly triggers repeatedly. Without deduplication, a 2-hour outage affecting 5 metrics checked every minute generates 600 alerts.

    The deduplication key is a composite:

    {ruleId}:{metricName}:{hourBucket}

    Alerts with the same key within a 60-minute window are suppressed. This reduces 600 alerts to 5 (one per affected metric).

    Critical exception: Escalations pass through. If the same metric's severity increases from "high" to "critical," the deduplicator lets the new alert through with an [ESCALATED] prefix. Deteriorating situations must get attention.

    const deduped = deduplicateAlerts(alerts, 60);
    // Input: 47 alerts from 5 metrics over 2 hours
    // Output: 6 alerts (5 initial + 1 escalation)

    Escalation Chains

    Escalation policies ensure incidents get attention without bypassing the first responder:

    SeverityLevel 1 (0 min)Level 2Level 3Level 4
    CriticalOn-call + PagerDuty + PhoneTeam lead (10m)Eng manager (20m)VP (30m)
    HighOn-call + PagerDutyTeam lead (15m)Eng manager (45m)
    MediumOn-call + SlackTeam lead (30m)
    LowOn-call + Slack

    Each level adds more aggressive notification channels. Level 1 is a Slack message. Level 3 is phone calls to the engineering manager. Level 4 (VP escalation for critical incidents) uses every channel.

    const level = getEscalationLevel(alert, now);
    // alert triggered 25 minutes ago, severity: critical
    // → returns level 3: { channels: ["slack", "pagerduty", "phone", "email"],
    //                       assignee: "engineering-manager" }

    The time delays are carefully chosen. For critical incidents, 10 minutes without acknowledgment means the on-call engineer is likely asleep, offline, or overwhelmed. Escalate immediately. For medium issues, 30 minutes gives the engineer time to investigate before bothering the team lead.

    Notification Formatting

    Different channels need different formats:

    Slack:

    🔴 *API Latency Spike: 3 anomalies detected*
    api_latency_api_search = 385.2ms (expected 156.3, confidence 0.87)
    Status: firing | Triggered: 2026-04-12T14:35:00Z

    PagerDuty:

    [HIGH] API Latency Spike: 3 anomalies detected — api_latency_api_search = 385.2ms

    Email:

    Subject: [HIGH] API Latency Spike: 3 anomalies detected
    
    api_latency_api_search = 385.2ms (expected 156.3, confidence 0.87)
    
    Triggered at: 2026-04-12T14:35:00Z
    Alert ID: alert-7

    The Slack format is richest (markdown, emoji severity indicators). PagerDuty is brief (it shows on phone lock screens). Email includes the alert ID for tracking in post-incident reviews.

    Alert Lifecycle

    The state machine for every alert:

    firing → acknowledged → resolved
  • Firing: Anomaly detected, notifications sent, escalation clock running
  • Acknowledged: An engineer clicked "I'm on it." Escalation stops.
  • Resolved: Incident over. Resolution notes captured. Alert archived.
  • Once resolved, an alert cannot return to firing. If the same issue recurs, it's a new alert — with its own timeline, escalation, and resolution.

    Putting It Together

    The full alert pipeline:

    anomalies → evaluateRules() → deduplicateAlerts() → getEscalationLevel() → notifyAlert()

    50 anomalies become 8 alerts, deduplicated to 3, escalated based on severity, and delivered to the right channels. The on-call engineer sees 3 actionable notifications instead of 50 noise events.

    This is chapter 4 of AI Anomaly Detection.

    Get the full hands-on course — free during early access. Build the complete system. Your projects become your portfolio.

    View course details