Back to guides
5
5 min

Monitoring App

Making Detection Visible and Actionable

Why Dashboards Matter

A detection system that nobody looks at is useless. The monitoring dashboard is where your engineering team lives during incidents. It answers three questions:

  • What's happening right now? — Metric cards, status indicators
  • What's abnormal? — Anomaly timeline, severity badges
  • What have we done about it? — Alert status, acknowledgment tracking
  • Dashboard Architecture

    The monitoring app is a Next.js application with two layers:

    API Layer — Server-side endpoints that run the detection pipeline:

  • /api/metrics — Returns metric summaries (name, latest value, mean, anomaly count)
  • /api/anomalies — Returns detected anomalies sorted by confidence
  • UI Layer — Client-side React components:

  • Metric cards — one per metric stream, showing current health
  • Time-series charts — interactive visualizations with baseline bands
  • Anomaly timeline — chronological feed of detected events
  • Alert management — acknowledge, resolve, silence controls
  • // API endpoint runs the full pipeline on request
    export async function GET() {
      const metrics = ingestAll(60);
      const summaries = metrics.map((m) => {
        const baseline = computeMovingAverage(m.name, m.points);
        const { anomalies } = computeZScores(m.name, m.points, baseline);
        return {
          name: m.name,
          latest: values[values.length - 1],
          mean: average(values),
          anomalyCount: anomalies.length,
        };
      });
      return NextResponse.json({ metrics: summaries });
    }

    In production, you'd cache the pipeline output and refresh on a schedule (every 30-60 seconds) rather than running the full pipeline on every request.

    Metric Cards

    The first thing an on-call engineer sees. Each card shows:

  • Metric name and source — immediately identify which system
  • Latest value — the current state
  • Trend indicator — is it going up, down, or stable?
  • Anomaly badge — red count if anomalies detected
  • ┌─────────────────────────────┐
    │ api_latency_api_search  api │
    │         156.3 ms            │
    │ Mean: 142.1 | Points: 720   │
    │ ⚠ 12 anomalies detected     │
    └─────────────────────────────┘

    Design principles for monitoring cards:

  • Dark theme — reduces eye strain during night incidents, makes colored indicators pop
  • Large numbers — the current value should be readable from across the room
  • Color coding — green (normal), yellow (warning), red (critical)
  • Minimal text — save details for the drill-down view
  • Time-Series Charts

    Charts reveal patterns that numbers alone cannot. A good monitoring chart shows three layers:

  • The metric line — actual values over time (blue/green)
  • The baseline band — shaded area showing expected range (gray)
  • Anomaly markers — dots or vertical lines at detected timestamps (red)
  • Value
     400│           ●
        │          /|\
     300│         / | \
        │    ════╗ |  ╔════════════  ← upper bound
     200│  ──────╫─┤──╫──────────── ← mean
        │    ════╝    ╚════════════  ← lower bound
     100│
        └──────────────────────── Time
             Day 10  12  14  16

    When the metric line exits the baseline band, it's immediately visible as abnormal. This is more intuitive than any number or alert text.

    Interactive features:

  • Brush selection — zoom into specific time ranges
  • Tooltip — hover to see exact values, z-scores, and detector results
  • Click anomaly — jump to the anomaly timeline entry for details
  • Anomaly Timeline

    A chronological feed of all detected events:

    14:35  🔴 HIGH  api_latency_api_search
           385.2ms (expected 156.3ms) — z-score: 4.2, ensemble: 0.87
           Probable cause: deploy-048 (search-service v1.8.3)
    
    14:35  🟡 MEDIUM  error_rate
           0.082 (expected 0.012) — z-score: 3.1, ensemble: 0.62
    
    10:05  🔴 HIGH  cpu_web_1
           91.2% (expected 42.3%) — ensemble: 0.79

    Group nearby anomalies (within 30 minutes) into "events." A cluster of 12 anomalies across 5 metrics is one incident — presenting them as a group helps engineers understand the scope immediately.

    Alert Management Panel

    The final component closes the loop from detection to human action:

    StatusMeaningAction
    FiringActive, escalation runningAcknowledge or Resolve
    AcknowledgedEngineer investigatingResolve when done
    ResolvedIncident overAdd resolution notes
    SilencedSuppressed (maintenance)Auto-expires

    Each alert card shows:

  • Severity badge and title
  • Time since triggered
  • Current escalation level (who's been notified)
  • Action buttons (Acknowledge / Resolve / Silence)
  • The acknowledge button is the most critical interaction. Clicking it:

  • Stops the escalation clock
  • Posts "Acknowledged by {name}" to the Slack thread
  • Records the timestamp for MTTD/MTTR tracking
  • Real-Time Updates

    For production monitoring, the dashboard needs to update without manual refresh. Two approaches:

    Polling — Fetch /api/metrics every 30 seconds. Simple, works everywhere, adds server load.

    Server-Sent Events (SSE) — The server pushes updates when new anomalies are detected. Lower latency, lower server load, but requires streaming infrastructure.

    For this project, polling is sufficient. In production, SSE or WebSockets give engineers sub-second visibility into developing incidents.

    This is chapter 5 of AI Anomaly Detection.

    Get the full hands-on course — free during early access. Build the complete system. Your projects become your portfolio.

    View course details