Alert Engine
Turning Detections into Action Without Alert Fatigue
The Alert Fatigue Problem
A monitoring system that detects 50 anomalies but sends 500 alerts is worse than useless. Alert fatigue — when engineers stop paying attention because they're overwhelmed — has contributed to some of the worst outages in tech history.
The alert engine sits between detection and human action. Its job: ensure the right person gets the right information at the right time, and nothing more.
Alert Rules
Not every anomaly deserves an alert. Rules define which anomalies matter:
const rule: AlertRule = {
id: "rule-latency-spike",
name: "API Latency Spike",
metric: "api_latency", // matches any metric containing this string
condition: "anomaly_score", // evaluate anomaly confidence, not raw value
threshold: 0.6, // minimum ensemble confidence to alert
window: 15, // minutes — aggregate anomalies in this window
severity: "high",
channels: ["slack", "pagerduty"],
enabled: true,
};Rule design principles:
Deduplication
During a real incident, the same anomaly triggers repeatedly. Without deduplication, a 2-hour outage affecting 5 metrics checked every minute generates 600 alerts.
The deduplication key is a composite:
{ruleId}:{metricName}:{hourBucket}Alerts with the same key within a 60-minute window are suppressed. This reduces 600 alerts to 5 (one per affected metric).
Critical exception: Escalations pass through. If the same metric's severity increases from "high" to "critical," the deduplicator lets the new alert through with an [ESCALATED] prefix. Deteriorating situations must get attention.
const deduped = deduplicateAlerts(alerts, 60);
// Input: 47 alerts from 5 metrics over 2 hours
// Output: 6 alerts (5 initial + 1 escalation)Escalation Chains
Escalation policies ensure incidents get attention without bypassing the first responder:
| Severity | Level 1 (0 min) | Level 2 | Level 3 | Level 4 |
|---|---|---|---|---|
| Critical | On-call + PagerDuty + Phone | Team lead (10m) | Eng manager (20m) | VP (30m) |
| High | On-call + PagerDuty | Team lead (15m) | Eng manager (45m) | — |
| Medium | On-call + Slack | Team lead (30m) | — | — |
| Low | On-call + Slack | — | — | — |
Each level adds more aggressive notification channels. Level 1 is a Slack message. Level 3 is phone calls to the engineering manager. Level 4 (VP escalation for critical incidents) uses every channel.
const level = getEscalationLevel(alert, now);
// alert triggered 25 minutes ago, severity: critical
// → returns level 3: { channels: ["slack", "pagerduty", "phone", "email"],
// assignee: "engineering-manager" }The time delays are carefully chosen. For critical incidents, 10 minutes without acknowledgment means the on-call engineer is likely asleep, offline, or overwhelmed. Escalate immediately. For medium issues, 30 minutes gives the engineer time to investigate before bothering the team lead.
Notification Formatting
Different channels need different formats:
Slack:
🔴 *API Latency Spike: 3 anomalies detected*
api_latency_api_search = 385.2ms (expected 156.3, confidence 0.87)
Status: firing | Triggered: 2026-04-12T14:35:00ZPagerDuty:
[HIGH] API Latency Spike: 3 anomalies detected — api_latency_api_search = 385.2msEmail:
Subject: [HIGH] API Latency Spike: 3 anomalies detected
api_latency_api_search = 385.2ms (expected 156.3, confidence 0.87)
Triggered at: 2026-04-12T14:35:00Z
Alert ID: alert-7The Slack format is richest (markdown, emoji severity indicators). PagerDuty is brief (it shows on phone lock screens). Email includes the alert ID for tracking in post-incident reviews.
Alert Lifecycle
The state machine for every alert:
firing → acknowledged → resolvedOnce resolved, an alert cannot return to firing. If the same issue recurs, it's a new alert — with its own timeline, escalation, and resolution.
Putting It Together
The full alert pipeline:
anomalies → evaluateRules() → deduplicateAlerts() → getEscalationLevel() → notifyAlert()50 anomalies become 8 alerts, deduplicated to 3, escalated based on severity, and delivered to the right channels. The on-call engineer sees 3 actionable notifications instead of 50 noise events.
This is chapter 4 of AI Anomaly Detection.
Get the full hands-on course — free during early access. Build the complete system. Your projects become your portfolio.
View course details