Back to guides
3
6 min

ML Detection

Beyond Statistics — Learning What Normal Looks Like

Why ML for Anomaly Detection?

Statistical methods (z-scores, moving averages) make assumptions about your data: it should be roughly Gaussian, stationary, and have clear thresholds. Real production metrics violate all three assumptions.

ML-based detectors learn what "normal" looks like from the data itself. They handle:

  • Fat-tailed distributions — real metrics have occasional large values that aren't anomalies
  • Non-stationarity — the normal range shifts over time (deployments, traffic growth)
  • Complex patterns — anomalies in temporal shapes, not just magnitudes
  • Isolation Forest

    The isolation forest is one of the most elegant algorithms in anomaly detection. The key insight: anomalies are easy to isolate.

    Imagine randomly drawing a vertical line to split your data points into two groups. Normal points are clustered together — it takes many random splits to isolate one from the crowd. But an outlier sits far from the cluster and can be isolated in just a few splits.

    Normal point: needs 8-12 splits to isolate (deep in the tree)
    Anomalous point: needs 2-4 splits to isolate (near the root)

    The algorithm:

  • Build 100 random binary trees by recursively splitting data at random values
  • For each data point, measure the average path length across all trees
  • Convert to anomaly score: score = 2^(-avgPath / expectedPath)
  • Scores range from 0 to 1:

  • ~0.5 — average path length, normal behavior
  • >0.6 — shorter than average, suspicious
  • >0.8 — very short path, highly anomalous
  • const result = detectWithIsolationForest("api_latency_search", points);
    // result.anomalies: points with score > 0.6
    // result.scoresPerPoint: score for every timestamp

    The beauty: zero distributional assumptions. It works on Gaussian data, bimodal data, skewed data, and data with complex temporal patterns. The only parameter that matters is the number of trees (100 is standard).

    Autoencoders

    An autoencoder takes a different approach: learn to reconstruct normal patterns, then flag anything with high reconstruction error.

    The architecture (simplified):

  • Encoder — compresses a window of data points into a smaller representation
  • Decoder — reconstructs the original window from the compressed representation
  • Error — mean squared difference between input and reconstruction
  • When trained on normal data:

  • Normal windows are reconstructed accurately (low error)
  • Anomalous windows have patterns the model hasn't seen — high reconstruction error
  • const result = detectWithAutoencoder("api_latency_search", points, 12);
    // Uses 12-point sliding windows
    // Anomalies: windows with error > mean_error + 2*std_error

    The windowed approach is the key advantage over point-based methods. A value of 150ms might be normal in isolation, but a sudden jump from a stable 80ms to 150ms within one window creates a pattern the autoencoder fails to reconstruct. This catches gradual drift that point-based z-scores miss.

    Trade-off: Autoencoders need a window size parameter. Too small (3 points) and they miss slow changes. Too large (48 points) and they're insensitive to short anomalies. 12 points (12 hours with hourly windowing) is a reasonable starting point.

    Ensemble Scoring

    No single detector catches every type of anomaly:

    DetectorBest AtWorst At
    Z-scoreSudden spikesGradual drift, non-Gaussian data
    Isolation ForestPoint outliersTemporal pattern anomalies
    AutoencoderShape/pattern anomaliesVery brief spikes

    The ensemble combines all three with weighted voting:

    score = 0.40 * iforest_score + 0.35 * autoencoder_score + 0.25 * zscore_score

    A point that triggers only one detector (score ~0.35) probably isn't worth alerting on — it could be a quirk of that particular method. But a point that triggers two or three detectors (score ~0.65+) is almost certainly a real anomaly.

    This is the same principle behind Random Forests, boosting, and other ensemble methods in ML: combining weak learners produces a strong learner. In anomaly detection, combining specialized detectors produces a robust detection system.

    Confidence Calibration

    Raw anomaly scores are not probabilities. An isolation forest score of 0.7 doesn't mean "70% chance of being anomalous." The mapping from raw score to actual probability varies by:

  • Which detector produced the score
  • The characteristics of your data
  • The current false positive rate
  • Platt scaling fixes this with a logistic calibration function:

    calibrated = 1 / (1 + exp(a * rawScore + b))

    Parameters a and b are fitted to historical data: for each score range, what fraction turned out to be true anomalies? After calibration, a score of 0.7 genuinely means "70% likely to be a real anomaly."

    const calibrated = calibrateBatch(anomalies, 0.3);
    // Filters out anomalies below 0.3 calibrated confidence
    // Sorts by calibrated confidence (highest first)

    This makes alert thresholds meaningful. You can tell the on-call team: "alerts above 0.6 calibrated confidence are correct 85% of the time."

    Choosing Detection Methods

    For a new monitoring deployment, start with:

  • Z-scores as the baseline (fast, interpretable)
  • Add isolation forest for robustness
  • Add autoencoder if you see gradual drift or pattern anomalies
  • Use the ensemble as your production detector
  • Calibrate confidence before wiring to alerts
  • Revisit the weights and thresholds monthly as your data evolves.

    This is chapter 3 of AI Anomaly Detection.

    Get the full hands-on course — free during early access. Build the complete system. Your projects become your portfolio.

    View course details