The project

Tracewave, from itch to launch

A real-time anomaly-detection pipeline pointed at a live public firehose: ingestion, a stream bus, windowed processing, three online detectors, time-series storage, and a dashboard that catches spikes as they happen. It looks like magic in a notebook and falls apart in production — so I built the production version.

Why it exists

I kept seeing “data” portfolios that were a single static notebook — impressive once, dead on arrival. I wanted to show the thing the notebook hides: a live distributed system, the full data-platform stack end to end, reacting to real internet activity in motion. Wikimedia broadcasts every edit on the planet over an open stream. That's a free, infinite, genuinely unpredictable signal — exactly what you want to point an anomaly detector at when you're trying to prove something runs for real.

How it came together

  1. Mar 2026

    The itch

    Most data portfolios are a static notebook with a confusion matrix. I wanted the opposite — a system you can watch breathe. Wikimedia publishes every edit on Earth as an open SSE stream. Free, infinite, real. Perfect raw material.

  2. Apr 2026

    Design & scoping

    Drew the boxes: ingest → bus → windowed processor → online detectors → storage → live UI. Set the rule that would shape everything — the core must run as one process or a distributed stack from the same code.

  3. early May 2026

    MVP — one process, real feed

    Ingestor, 1-second tumbling windows, and a rolling z-score wired straight to a Next.js chart over WebSockets. Ugly, but alive: real edits drawing a real line within seconds.

  4. mid May 2026WHERE IT GOT HARD

    The part that broke everything

    The firehose doesn't burst politely. Unbounded queues ballooned, late events got dropped, and the chart froze on stale values during quiet spells. This is where it got hard: bounded buffers that shed the oldest and count drops, windows that emit rate=0 instead of freezing, and late events folded into the current window. Boring-sounding fixes; the whole thing was a toy until they existed.

  5. late May 2026

    Three detectors and a referee

    Added EWMA and Half-Space Trees next to the z-score, then the hard part: an ensemble that rewards agreement so one jumpy detector can't cry wolf. Plus the "why" — diffing each spike against decaying per-dimension baselines so a card explains itself.

  6. early Jun 2026

    Split, store, observe

    Proved the transport-agnostic bet: the same Processor moved behind Redis Streams + TimescaleDB with no logic changes. Every service got Prometheus metrics and a Grafana dashboard watching the pipeline's own health.

  7. Jun 2026

    Dashboard polish & launch

    The NOC-console look, tabular figures, slide-and-settle cards, honest empty/stale states — and a self-contained demo stream so the deployed link is alive without a backend. Shipped.

Key features

Live, explained anomalies

Not just "a spike happened" — each card carries the contributing dimensions (wiki, language, namespace, actor type), a confidence score, and which detectors corroborated. Every window is replayable.

Three detectors, compared

Rolling z-score (interpretable baseline), EWMA control chart (adapts to drift), and Half-Space Trees (online, multivariate). The dashboard shows each score over time and an agreement strip.

Self-observability

Throughput, lag, dropped events, detector fires and p95 window time are all Prometheus metrics, watched by a provisioned Grafana dashboard. The pipeline reports on its own health.

Never an empty link

A recorded replay loop keeps a demo honest with real captured data, and the deployed frontend falls back to an in-browser simulation when no backend is reachable.

Interesting decisions & challenges

Transport-agnostic core

The Processor takes events from an abstract bus and writes to an abstract store. In dev that's an in-memory queue and a ring buffer; in prod, Redis Streams and TimescaleDB. Same code, two topologies — the single biggest design lever in the project.

Deterministic windowing

Windows fold on a caller-supplied clock, not wall time, so the windowing math is unit-tested and reproducible. Quiet periods still emit rate=0 windows so the series — and the detectors — never freeze on a stale value.

Agreement-weighted confidence

Ensemble confidence is the summed score of detectors that fired divided by the number available — so 3/3 firing at 0.8 reads as 0.80 (corroborated), but 1/3 at 0.8 collapses to 0.27 (suppressed unless very strong). Severity escalates on full agreement.

Backpressure as a first-class metric

Bounded buffers shed the oldest events under load rather than growing without limit, and every drop is counted, never silently swallowed. You can see the system protecting itself.

Tech stack & why

Python + httpx
async SSE ingestion that survives reconnects and resumes from the last event id
Redis Streams
a durable bus with consumer groups and natural backpressure between services
river
online ML primitives — Half-Space Trees that learn from the stream, no batch retrain
TimescaleDB
time-series storage that's just Postgres, so queries stay boring and familiar
FastAPI + WebSockets
low-latency fan-out of metrics and anomalies to every connected dashboard
Next.js + TypeScript
a typed UI whose wire types mirror the backend's exactly
Tailwind + uPlot
a tight design system and a canvas chart fast enough to redraw every second
Prometheus + Grafana
the pipeline watches its own throughput, lag and drops
Docker Compose
the whole distributed stack comes up with one command

Take it further

GitHub repo ↗LinkedIn write-up — soonOpen the dashboard →

Want to build something or collaborate on something like this? Contact me → or reach out directly at charanreddychanda@gmail.com.