Monitoring App
Making Detection Visible and Actionable
Why Dashboards Matter
A detection system that nobody looks at is useless. The monitoring dashboard is where your engineering team lives during incidents. It answers three questions:
Dashboard Architecture
The monitoring app is a Next.js application with two layers:
API Layer — Server-side endpoints that run the detection pipeline:
/api/metrics — Returns metric summaries (name, latest value, mean, anomaly count)/api/anomalies — Returns detected anomalies sorted by confidenceUI Layer — Client-side React components:
// API endpoint runs the full pipeline on request
export async function GET() {
const metrics = ingestAll(60);
const summaries = metrics.map((m) => {
const baseline = computeMovingAverage(m.name, m.points);
const { anomalies } = computeZScores(m.name, m.points, baseline);
return {
name: m.name,
latest: values[values.length - 1],
mean: average(values),
anomalyCount: anomalies.length,
};
});
return NextResponse.json({ metrics: summaries });
}In production, you'd cache the pipeline output and refresh on a schedule (every 30-60 seconds) rather than running the full pipeline on every request.
Metric Cards
The first thing an on-call engineer sees. Each card shows:
┌─────────────────────────────┐
│ api_latency_api_search api │
│ 156.3 ms │
│ Mean: 142.1 | Points: 720 │
│ ⚠ 12 anomalies detected │
└─────────────────────────────┘Design principles for monitoring cards:
Time-Series Charts
Charts reveal patterns that numbers alone cannot. A good monitoring chart shows three layers:
Value
400│ ●
│ /|\
300│ / | \
│ ════╗ | ╔════════════ ← upper bound
200│ ──────╫─┤──╫──────────── ← mean
│ ════╝ ╚════════════ ← lower bound
100│
└──────────────────────── Time
Day 10 12 14 16When the metric line exits the baseline band, it's immediately visible as abnormal. This is more intuitive than any number or alert text.
Interactive features:
Anomaly Timeline
A chronological feed of all detected events:
14:35 🔴 HIGH api_latency_api_search
385.2ms (expected 156.3ms) — z-score: 4.2, ensemble: 0.87
Probable cause: deploy-048 (search-service v1.8.3)
14:35 🟡 MEDIUM error_rate
0.082 (expected 0.012) — z-score: 3.1, ensemble: 0.62
10:05 🔴 HIGH cpu_web_1
91.2% (expected 42.3%) — ensemble: 0.79Group nearby anomalies (within 30 minutes) into "events." A cluster of 12 anomalies across 5 metrics is one incident — presenting them as a group helps engineers understand the scope immediately.
Alert Management Panel
The final component closes the loop from detection to human action:
| Status | Meaning | Action |
|---|---|---|
| Firing | Active, escalation running | Acknowledge or Resolve |
| Acknowledged | Engineer investigating | Resolve when done |
| Resolved | Incident over | Add resolution notes |
| Silenced | Suppressed (maintenance) | Auto-expires |
Each alert card shows:
The acknowledge button is the most critical interaction. Clicking it:
Real-Time Updates
For production monitoring, the dashboard needs to update without manual refresh. Two approaches:
Polling — Fetch /api/metrics every 30 seconds. Simple, works everywhere, adds server load.
Server-Sent Events (SSE) — The server pushes updates when new anomalies are detected. Lower latency, lower server load, but requires streaming infrastructure.
For this project, polling is sufficient. In production, SSE or WebSockets give engineers sub-second visibility into developing incidents.
This is chapter 5 of AI Anomaly Detection.
Get the full hands-on course — free during early access. Build the complete system. Your projects become your portfolio.
View course details