5 min

Monitoring App

Making Detection Visible and Actionable

Why Dashboards Matter

A detection system that nobody looks at is useless. The monitoring dashboard is where your engineering team lives during incidents. It answers three questions:

What's happening right now? — Metric cards, status indicators

What's abnormal? — Anomaly timeline, severity badges

What have we done about it? — Alert status, acknowledgment tracking

Dashboard Architecture

The monitoring app is a Next.js application with two layers:

API Layer — Server-side endpoints that run the detection pipeline:

/api/metrics — Returns metric summaries (name, latest value, mean, anomaly count)

/api/anomalies — Returns detected anomalies sorted by confidence

UI Layer — Client-side React components:

Metric cards — one per metric stream, showing current health

Time-series charts — interactive visualizations with baseline bands

Anomaly timeline — chronological feed of detected events

Alert management — acknowledge, resolve, silence controls

// API endpoint runs the full pipeline on request
export async function GET() {
  const metrics = ingestAll(60);
  const summaries = metrics.map((m) => {
    const baseline = computeMovingAverage(m.name, m.points);
    const { anomalies } = computeZScores(m.name, m.points, baseline);
    return {
      name: m.name,
      latest: values[values.length - 1],
      mean: average(values),
      anomalyCount: anomalies.length,
    };
  });
  return NextResponse.json({ metrics: summaries });
}

In production, you'd cache the pipeline output and refresh on a schedule (every 30-60 seconds) rather than running the full pipeline on every request.

Metric Cards

The first thing an on-call engineer sees. Each card shows:

Metric name and source — immediately identify which system

Latest value — the current state

Trend indicator — is it going up, down, or stable?

Anomaly badge — red count if anomalies detected

┌─────────────────────────────┐
│ api_latency_api_search  api │
│         156.3 ms            │
│ Mean: 142.1 | Points: 720   │
│ ⚠ 12 anomalies detected     │
└─────────────────────────────┘

Design principles for monitoring cards:

Dark theme — reduces eye strain during night incidents, makes colored indicators pop

Large numbers — the current value should be readable from across the room

Color coding — green (normal), yellow (warning), red (critical)

Minimal text — save details for the drill-down view

Time-Series Charts

Charts reveal patterns that numbers alone cannot. A good monitoring chart shows three layers:

The metric line — actual values over time (blue/green)

The baseline band — shaded area showing expected range (gray)

Anomaly markers — dots or vertical lines at detected timestamps (red)

Value
 400│           ●
    │          /|\
 300│         / | \
    │    ════╗ |  ╔════════════  ← upper bound
 200│  ──────╫─┤──╫──────────── ← mean
    │    ════╝    ╚════════════  ← lower bound
 100│
    └──────────────────────── Time
         Day 10  12  14  16

When the metric line exits the baseline band, it's immediately visible as abnormal. This is more intuitive than any number or alert text.

Interactive features:

Brush selection — zoom into specific time ranges

Tooltip — hover to see exact values, z-scores, and detector results

Click anomaly — jump to the anomaly timeline entry for details

Anomaly Timeline

A chronological feed of all detected events:

14:35  🔴 HIGH  api_latency_api_search
       385.2ms (expected 156.3ms) — z-score: 4.2, ensemble: 0.87
       Probable cause: deploy-048 (search-service v1.8.3)

14:35  🟡 MEDIUM  error_rate
       0.082 (expected 0.012) — z-score: 3.1, ensemble: 0.62

10:05  🔴 HIGH  cpu_web_1
       91.2% (expected 42.3%) — ensemble: 0.79

Group nearby anomalies (within 30 minutes) into "events." A cluster of 12 anomalies across 5 metrics is one incident — presenting them as a group helps engineers understand the scope immediately.

Alert Management Panel

The final component closes the loop from detection to human action:

Status	Meaning	Action
Firing	Active, escalation running	Acknowledge or Resolve
Acknowledged	Engineer investigating	Resolve when done
Resolved	Incident over	Add resolution notes
Silenced	Suppressed (maintenance)	Auto-expires

Each alert card shows:

Severity badge and title

Time since triggered

Current escalation level (who's been notified)

Action buttons (Acknowledge / Resolve / Silence)

The acknowledge button is the most critical interaction. Clicking it:

Stops the escalation clock

Posts "Acknowledged by {name}" to the Slack thread

Records the timestamp for MTTD/MTTR tracking

Real-Time Updates

For production monitoring, the dashboard needs to update without manual refresh. Two approaches:

Polling — Fetch /api/metrics every 30 seconds. Simple, works everywhere, adds server load.

Server-Sent Events (SSE) — The server pushes updates when new anomalies are detected. Lower latency, lower server load, but requires streaming infrastructure.

For this project, polling is sufficient. In production, SSE or WebSockets give engineers sub-second visibility into developing incidents.

This is chapter 5 of AI Anomaly Detection.

Get the full hands-on course — free during early access. Build the complete system. Your projects become your portfolio.

View course details

Ch. 4: Alert Engine

Ch. 6: Production Pipeline