5 min

Alert Engine

Turning Detections into Action Without Alert Fatigue

The Alert Fatigue Problem

A monitoring system that detects 50 anomalies but sends 500 alerts is worse than useless. Alert fatigue — when engineers stop paying attention because they're overwhelmed — has contributed to some of the worst outages in tech history.

The alert engine sits between detection and human action. Its job: ensure the right person gets the right information at the right time, and nothing more.

Alert Rules

Not every anomaly deserves an alert. Rules define which anomalies matter:

const rule: AlertRule = {
  id: "rule-latency-spike",
  name: "API Latency Spike",
  metric: "api_latency",       // matches any metric containing this string
  condition: "anomaly_score",   // evaluate anomaly confidence, not raw value
  threshold: 0.6,               // minimum ensemble confidence to alert
  window: 15,                   // minutes — aggregate anomalies in this window
  severity: "high",
  channels: ["slack", "pagerduty"],
  enabled: true,
};

Rule design principles:

Use anomaly scores, not raw values for ML-detected anomalies — the detector has already determined what's unusual

Use raw value thresholds for safety limits — "CPU above 95% is always bad" regardless of what the detector thinks

Match broadly — one CPU rule should cover all hosts (use tag-based matching)

Set severity based on impact — latency spikes affecting /api/checkout are higher severity than /api/search

Deduplication

During a real incident, the same anomaly triggers repeatedly. Without deduplication, a 2-hour outage affecting 5 metrics checked every minute generates 600 alerts.

The deduplication key is a composite:

{ruleId}:{metricName}:{hourBucket}

Alerts with the same key within a 60-minute window are suppressed. This reduces 600 alerts to 5 (one per affected metric).

Critical exception: Escalations pass through. If the same metric's severity increases from "high" to "critical," the deduplicator lets the new alert through with an [ESCALATED] prefix. Deteriorating situations must get attention.

const deduped = deduplicateAlerts(alerts, 60);
// Input: 47 alerts from 5 metrics over 2 hours
// Output: 6 alerts (5 initial + 1 escalation)

Escalation Chains

Escalation policies ensure incidents get attention without bypassing the first responder:

Severity	Level 1 (0 min)	Level 2	Level 3	Level 4
Critical	On-call + PagerDuty + Phone	Team lead (10m)	Eng manager (20m)	VP (30m)
High	On-call + PagerDuty	Team lead (15m)	Eng manager (45m)	—
Medium	On-call + Slack	Team lead (30m)	—	—
Low	On-call + Slack	—	—	—

Each level adds more aggressive notification channels. Level 1 is a Slack message. Level 3 is phone calls to the engineering manager. Level 4 (VP escalation for critical incidents) uses every channel.

const level = getEscalationLevel(alert, now);
// alert triggered 25 minutes ago, severity: critical
// → returns level 3: { channels: ["slack", "pagerduty", "phone", "email"],
//                       assignee: "engineering-manager" }

The time delays are carefully chosen. For critical incidents, 10 minutes without acknowledgment means the on-call engineer is likely asleep, offline, or overwhelmed. Escalate immediately. For medium issues, 30 minutes gives the engineer time to investigate before bothering the team lead.

Notification Formatting

Different channels need different formats:

Slack:

🔴 *API Latency Spike: 3 anomalies detected*
api_latency_api_search = 385.2ms (expected 156.3, confidence 0.87)
Status: firing | Triggered: 2026-04-12T14:35:00Z

PagerDuty:

[HIGH] API Latency Spike: 3 anomalies detected — api_latency_api_search = 385.2ms

Email:

Subject: [HIGH] API Latency Spike: 3 anomalies detected

api_latency_api_search = 385.2ms (expected 156.3, confidence 0.87)

Triggered at: 2026-04-12T14:35:00Z
Alert ID: alert-7

The Slack format is richest (markdown, emoji severity indicators). PagerDuty is brief (it shows on phone lock screens). Email includes the alert ID for tracking in post-incident reviews.

Alert Lifecycle

The state machine for every alert:

firing → acknowledged → resolved

Firing: Anomaly detected, notifications sent, escalation clock running

Acknowledged: An engineer clicked "I'm on it." Escalation stops.

Resolved: Incident over. Resolution notes captured. Alert archived.

Once resolved, an alert cannot return to firing. If the same issue recurs, it's a new alert — with its own timeline, escalation, and resolution.

Putting It Together

The full alert pipeline:

anomalies → evaluateRules() → deduplicateAlerts() → getEscalationLevel() → notifyAlert()

50 anomalies become 8 alerts, deduplicated to 3, escalated based on severity, and delivered to the right channels. The on-call engineer sees 3 actionable notifications instead of 50 noise events.

This is chapter 4 of AI Anomaly Detection.

Get the full hands-on course — free during early access. Build the complete system. Your projects become your portfolio.

View course details

Ch. 3: ML Detection

Ch. 5: Monitoring App