6 min

ML Detection

Beyond Statistics — Learning What Normal Looks Like

Why ML for Anomaly Detection?

Statistical methods (z-scores, moving averages) make assumptions about your data: it should be roughly Gaussian, stationary, and have clear thresholds. Real production metrics violate all three assumptions.

ML-based detectors learn what "normal" looks like from the data itself. They handle:

Fat-tailed distributions — real metrics have occasional large values that aren't anomalies

Non-stationarity — the normal range shifts over time (deployments, traffic growth)

Complex patterns — anomalies in temporal shapes, not just magnitudes

Isolation Forest

The isolation forest is one of the most elegant algorithms in anomaly detection. The key insight: anomalies are easy to isolate.

Imagine randomly drawing a vertical line to split your data points into two groups. Normal points are clustered together — it takes many random splits to isolate one from the crowd. But an outlier sits far from the cluster and can be isolated in just a few splits.

Normal point: needs 8-12 splits to isolate (deep in the tree)
Anomalous point: needs 2-4 splits to isolate (near the root)

The algorithm:

Build 100 random binary trees by recursively splitting data at random values

For each data point, measure the average path length across all trees

Convert to anomaly score: score = 2^(-avgPath / expectedPath)

Scores range from 0 to 1:

~0.5 — average path length, normal behavior

>0.6 — shorter than average, suspicious

>0.8 — very short path, highly anomalous

const result = detectWithIsolationForest("api_latency_search", points);
// result.anomalies: points with score > 0.6
// result.scoresPerPoint: score for every timestamp

The beauty: zero distributional assumptions. It works on Gaussian data, bimodal data, skewed data, and data with complex temporal patterns. The only parameter that matters is the number of trees (100 is standard).

Autoencoders

An autoencoder takes a different approach: learn to reconstruct normal patterns, then flag anything with high reconstruction error.

The architecture (simplified):

Encoder — compresses a window of data points into a smaller representation

Decoder — reconstructs the original window from the compressed representation

Error — mean squared difference between input and reconstruction

When trained on normal data:

Normal windows are reconstructed accurately (low error)

Anomalous windows have patterns the model hasn't seen — high reconstruction error

const result = detectWithAutoencoder("api_latency_search", points, 12);
// Uses 12-point sliding windows
// Anomalies: windows with error > mean_error + 2*std_error

The windowed approach is the key advantage over point-based methods. A value of 150ms might be normal in isolation, but a sudden jump from a stable 80ms to 150ms within one window creates a pattern the autoencoder fails to reconstruct. This catches gradual drift that point-based z-scores miss.

Trade-off: Autoencoders need a window size parameter. Too small (3 points) and they miss slow changes. Too large (48 points) and they're insensitive to short anomalies. 12 points (12 hours with hourly windowing) is a reasonable starting point.

Ensemble Scoring

No single detector catches every type of anomaly:

Detector	Best At	Worst At
Z-score	Sudden spikes	Gradual drift, non-Gaussian data
Isolation Forest	Point outliers	Temporal pattern anomalies
Autoencoder	Shape/pattern anomalies	Very brief spikes

The ensemble combines all three with weighted voting:

score = 0.40 * iforest_score + 0.35 * autoencoder_score + 0.25 * zscore_score

A point that triggers only one detector (score ~0.35) probably isn't worth alerting on — it could be a quirk of that particular method. But a point that triggers two or three detectors (score ~0.65+) is almost certainly a real anomaly.

This is the same principle behind Random Forests, boosting, and other ensemble methods in ML: combining weak learners produces a strong learner. In anomaly detection, combining specialized detectors produces a robust detection system.

Confidence Calibration

Raw anomaly scores are not probabilities. An isolation forest score of 0.7 doesn't mean "70% chance of being anomalous." The mapping from raw score to actual probability varies by:

Which detector produced the score

The characteristics of your data

The current false positive rate

Platt scaling fixes this with a logistic calibration function:

calibrated = 1 / (1 + exp(a * rawScore + b))

Parameters a and b are fitted to historical data: for each score range, what fraction turned out to be true anomalies? After calibration, a score of 0.7 genuinely means "70% likely to be a real anomaly."

const calibrated = calibrateBatch(anomalies, 0.3);
// Filters out anomalies below 0.3 calibrated confidence
// Sorts by calibrated confidence (highest first)

This makes alert thresholds meaningful. You can tell the on-call team: "alerts above 0.6 calibrated confidence are correct 85% of the time."

Choosing Detection Methods

For a new monitoring deployment, start with:

Z-scores as the baseline (fast, interpretable)

Add isolation forest for robustness

Add autoencoder if you see gradual drift or pattern anomalies

Use the ensemble as your production detector

Calibrate confidence before wiring to alerts

Revisit the weights and thresholds monthly as your data evolves.

This is chapter 3 of AI Anomaly Detection.

Get the full hands-on course — free during early access. Build the complete system. Your projects become your portfolio.

View course details

Ch. 2: Statistical Baselines

Ch. 4: Alert Engine