6 min

Statistical Baselines

Teaching Your System What Normal Looks Like

The Definition of Normal

Before you can detect anomalies, you need a definition of "normal." A baseline is a mathematical model of expected behavior for a metric. Anything that deviates significantly from the baseline is a candidate anomaly.

The simplest baseline: the average value over recent history, plus a band of expected variation. But simple doesn't mean naive — the choice of baseline method, window size, and sensitivity threshold determines whether your system catches real incidents or drowns you in false alerts.

Moving Averages

The moving average is the workhorse of statistical monitoring. Given a window of N recent data points, it computes:

Mean — the expected value: mean = sum(values) / N

Standard Deviation — how much normal variation exists: stdDev = sqrt(sum((v - mean)^2) / N)

Bounds — the expected range: [mean - k*stdDev, mean + k*stdDev]

const baseline = computeMovingAverage("api_latency_search", points, 24, 2.0);
// { mean: 156.3, stdDev: 24.1, upperBound: 204.5, lowerBound: 108.1 }

The multiplier k (default 2.0) controls sensitivity. At 2 sigma, about 5% of normal data falls outside the bounds (assuming Gaussian distribution). At 3 sigma, only 0.3%.

Limitation: Moving averages treat all hours equally. A latency of 180ms at 3 AM (when a batch job runs) might be perfectly normal, but the same value at 2 PM (low-traffic period) could indicate trouble.

Z-Scores: The Universal Anomaly Language

A z-score converts any metric value into a standardized "how unusual is this?" measure:

z = (value - mean) / stdDev

z = 0 → exactly average

z = 2 → 2 standard deviations above mean (top ~2.5%)

z = 3 → 3 standard deviations above mean (top ~0.15%)

z = -2 → 2 standard deviations below mean (unusually low)

Z-scores are powerful because they're unit-agnostic. You can compare a z-score of 3.5 on latency (ms) directly with a z-score of 3.5 on error rate (ratio) — both mean "equally unusual."

const { anomalies } = computeZScores("api_latency_search", points, baseline, 2.5);
// Flags all points with |z| > 2.5, assigns severity:
// |z| > 5 → critical, |z| > 4 → high, |z| > 3 → medium, else → low

Seasonal Decomposition

API traffic follows patterns: higher during business hours, lower at night, different on weekends. A flat baseline flags every night as "unusually low" and every afternoon as "unusually high."

Seasonal decomposition separates these expected patterns from real anomalies:

value = trend + seasonal + residual

Trend — the long-term direction (is latency gradually increasing?)

Seasonal — the repeating 24-hour cycle (business hours vs. night)

Residual — what remains after removing trend and seasonal components

Anomalies live in the residual. By analyzing only the residual, you avoid flagging normal daily patterns while still catching genuine deviations.

const result = computeSeasonalBaseline("api_latency_search", points);
// seasonalPattern: [45, 42, 40, 38, ...] — hourly coefficients
// residual stdDev used for anomaly thresholds

The seasonal pattern for API latency typically shows a clear diurnal shape: values 20-30% higher during 9 AM to 5 PM UTC, dropping to baseline overnight. Removing this pattern means a spike at 3 AM (when values are normally low) gets flagged correctly, while elevated values at 2 PM don't trigger false alarms.

Threshold Tuning

The threshold is the most consequential parameter in your entire monitoring system. Too low → alert fatigue. Too high → missed incidents. The threshold tuner evaluates multiple options:

const results = tuneThreshold(points, baseline, knownAnomalyTimestamps);
// [
//   { threshold: 2.0, anomalyCount: 45, falsePositives: 38, missed: 0, score: 38 },
//   { threshold: 2.5, anomalyCount: 18, falsePositives: 12, missed: 0, score: 12 },
//   { threshold: 3.0, anomalyCount: 8,  falsePositives: 3,  missed: 1, score: 6 },
//   { threshold: 3.5, anomalyCount: 4,  falsePositives: 1,  missed: 2, score: 7 },
// ]

The scoring function weights missed anomalies 3x more than false positives — because a missed P1 outage costs orders of magnitude more than checking a false alarm. Threshold 3.0 minimizes the score in this example.

When Statistics Aren't Enough

Statistical baselines work well for:

Sudden spikes (large z-score, easy to detect)

Stable metrics with low variance

They struggle with:

Gradual drift — a slow increase over days never exceeds the threshold in any single hour

Complex patterns — anomalies in the shape of the time series, not just the magnitude

Non-Gaussian data — many real metrics have fat tails, making z-score thresholds unreliable

Module 3 introduces ML-based detection methods that handle these cases without distributional assumptions.

This is chapter 2 of AI Anomaly Detection.

Get the full hands-on course — free during early access. Build the complete system. Your projects become your portfolio.

View course details

Ch. 1: Data Ingestion

Ch. 3: ML Detection