6 min

Production Pipeline

Root Cause Analysis, Retraining, and SLA Tracking

Beyond Detection

Detection tells you something is wrong. Root cause analysis tells you why. Incident correlation tells you the scope. Model retraining keeps the system accurate. SLA tracking measures how well the system performs. Together, they turn a detection tool into a production monitoring platform.

Root Cause Analysis

The most impactful component in the entire system. Instead of an alert that says "latency spiked," it says:

> "API latency spiked 3x at 14:32 — probable cause: deploy-048 (search-service v1.8.3, deployed at 14:02) added fuzzy matching with N+1 queries. Suggested action: roll back search-service to v1.8.2."

The analyzer uses three signals:

1. Deployment Correlation

For each anomaly cluster, find deployments within the previous 4 hours. Closer deployments get higher confidence scores. A deploy 10 minutes before a spike (confidence ~0.95) is far more suspicious than one 3 hours before (~0.30).

const candidates = analyzeRootCause(anomalies, deployments, incidents);
// [
//   { cause: "deploy-048 (search-service v1.8.3)", confidence: 0.88,
//     evidence: ["Deployed 30 min before first anomaly", ...],
//     suggestedAction: "Roll back search-service to previous version" },
//   { cause: "Similar to INC-202: Search latency spike", confidence: 0.40,
//     evidence: ["Past root cause: N+1 queries", ...] }
// ]

2. Historical Pattern Matching

Compare current anomaly metrics with past incidents. If the same metrics were impacted before, show the historical root cause and resolution. Engineers love this — "we've seen this exact pattern before, and last time it was a connection pool misconfiguration."

3. Scope Heuristics

If more than 3 different metrics show anomalies simultaneously, it's likely infrastructure-level (load balancer, DNS, network) rather than application-level. This narrows the investigation.

Incident Correlation

When a deployment causes an outage, you don't get one anomaly — you get dozens. Latency spikes on 4 endpoints, error rates jump, CPU and memory increase. Without correlation, engineers see 15 separate alerts and don't realize they're all the same incident.

The correlator groups anomalies by temporal proximity:

const incidents = correlateIncidents(anomalies, alerts, deployments, 30);
// Window: 30 minutes
// Input: 47 anomalies across 12 metrics
// Output: 3 incident groups
//   Group 1: Day 7, brief error burst (3 anomalies)
//   Group 2: Day 12, search deployment (18 anomalies, 4 alerts)
//   Group 3: Day 25, platform-wide (26 anomalies, 7 alerts)

Each group includes the time range, affected metrics, worst severity, related alerts, and correlated deployments. This is the same approach used by PagerDuty's intelligent alert grouping and Datadog's event correlation.

Model Retraining

Baselines drift. If your search endpoint improves from 120ms to 80ms after an optimization, the old baseline still expects 120ms. It won't alert on 110ms — which is now 37% above the new normal.

The retraining pipeline runs on a schedule (daily or weekly):

1. Recompute baselines — Use the last 7 days of data to compute new moving averages, seasonal patterns, and bounds. This captures the new normal.

2. Update thresholds — Run the threshold tuner against recent data with known anomaly timestamps from the incident log. Optimize for current false-positive and missed-detection rates.

3. Recalibrate confidence — Update Platt scaling parameters using historical predictions vs. outcomes.

Week 1: mean=120ms, threshold=2.5σ → 3 false positives, 0 missed
Week 2: optimization deployed, mean drops to 80ms
Week 3 (without retraining): 0 alerts because nothing exceeds old threshold
Week 3 (with retraining): new mean=80ms, catches 95ms spike immediately

Critical rule: Exclude known incident windows from training data. If you train on data that includes a 3x latency spike, the model thinks spikes are "normal." Use incident history timestamps to mask those periods.

SLA Tracking

Who monitors the monitor? Four key metrics:

MTTD (Mean Time to Detect) — How long between an anomaly starting and the first alert?

Day 12 incident: anomaly at 14:32, first alert at 14:35 → MTTD = 3 min

Target: < 5 minutes for P1, < 15 minutes for P2

MTTR (Mean Time to Resolve) — How long between first alert and resolution?

Day 12 incident: first alert at 14:35, rollback at 10:45 next day → MTTR = 20 hours

With root cause analysis suggesting immediate rollback: MTTR could be < 1 hour

False Positive Rate — What fraction of alerts were non-incidents?

If 10 alerts fired and 7 were real incidents → FPR = 30%

Target: < 20%

Detection Coverage — What fraction of real incidents were caught before user reports?

4 incidents in the data, 4 detected → 100% coverage

If one was only found via customer complaints → 75% coverage

// Compute SLA metrics
const sla = {
  mttd: incidents.map(i => timeDiff(i.started_at, i.first_alert_at)),
  mttr: incidents.map(i => timeDiff(i.first_alert_at, i.resolved_at)),
  falsePositiveRate: falseAlerts / totalAlerts,
  coverage: detectedIncidents / totalIncidents,
};

The Complete Pipeline

All six modules connected:

Data Sources → [Collectors] → [Normalize + Window] → Metric Store
                                       ↓
                              [Baseline Computation]
                                       ↓
                              [ML Detection (Ensemble)]
                                       ↓
                              [Confidence Calibration]
                                       ↓
                     [Alert Rules → Dedup → Escalation → Notify]
                                       ↓
                              [Dashboard + Timeline]
                                       ↓
                     [Root Cause Analysis + Incident Correlation]
                                       ↓
                     [SLA Tracking → Retraining → Loop Back]

Each component is independently testable, replaceable, and extensible. Want to add a new data source? Add a collector. New detection method? Add a detector and adjust ensemble weights. New notification channel? Add a formatter to the notifier.

This is production monitoring engineering. You built it from scratch — not by configuring someone else's tool, but by understanding every layer of the stack.

This is chapter 6 of AI Anomaly Detection.

Get the full hands-on course — free during early access. Build the complete system. Your projects become your portfolio.

View course details

Ch. 5: Monitoring App