The project

Tracewave, from itch to launch

A real-time anomaly-detection pipeline pointed at a live public firehose: ingestion, a stream bus, windowed processing, three online detectors, time-series storage, and a dashboard that catches spikes as they happen. It looks like magic in a notebook and falls apart in production — so I built the production version.

Why it exists

I kept seeing “data” portfolios that were a single static notebook — impressive once, dead on arrival. I wanted to show the thing the notebook hides: a live distributed system, the full data-platform stack end to end, reacting to real internet activity in motion. Wikimedia broadcasts every edit on the planet over an open stream. That's a free, infinite, genuinely unpredictable signal — exactly what you want to point an anomaly detector at when you're trying to prove something runs for real.

How it came together

Mar 2026
The itch
Most data portfolios are a static notebook with a confusion matrix. I wanted the opposite — a system you can watch breathe. Wikimedia publishes every edit on Earth as an open SSE stream. Free, infinite, real. Perfect raw material.
Apr 2026
Design & scoping
Drew the boxes: ingest → bus → windowed processor → online detectors → storage → live UI. Set the rule that would shape everything — the core must run as one process or a distributed stack from the same code.
early May 2026
MVP — one process, real feed
Ingestor, 1-second tumbling windows, and a rolling z-score wired straight to a Next.js chart over WebSockets. Ugly, but alive: real edits drawing a real line within seconds.
mid May 2026WHERE IT GOT HARD
The part that broke everything
The firehose doesn't burst politely. Unbounded queues ballooned, late events got dropped, and the chart froze on stale values during quiet spells. This is where it got hard: bounded buffers that shed the oldest and count drops, windows that emit rate=0 instead of freezing, and late events folded into the current window. Boring-sounding fixes; the whole thing was a toy until they existed.
late May 2026
Three detectors and a referee
Added EWMA and Half-Space Trees next to the z-score, then the hard part: an ensemble that rewards agreement so one jumpy detector can't cry wolf. Plus the "why" — diffing each spike against decaying per-dimension baselines so a card explains itself.
early Jun 2026
Split, store, observe
Proved the transport-agnostic bet: the same Processor moved behind Redis Streams + TimescaleDB with no logic changes. Every service got Prometheus metrics and a Grafana dashboard watching the pipeline's own health.
Jun 2026
Dashboard polish & launch
The NOC-console look, tabular figures, slide-and-settle cards, honest empty/stale states — and a self-contained demo stream so the deployed link is alive without a backend. Shipped.

Key features

Live, explained anomalies

Not just "a spike happened" — each card carries the contributing dimensions (wiki, language, namespace, actor type), a confidence score, and which detectors corroborated. Every window is replayable.

Three detectors, compared

Rolling z-score (interpretable baseline), EWMA control chart (adapts to drift), and Half-Space Trees (online, multivariate). The dashboard shows each score over time and an agreement strip.

Self-observability

Throughput, lag, dropped events, detector fires and p95 window time are all Prometheus metrics, watched by a provisioned Grafana dashboard. The pipeline reports on its own health.

Never an empty link

A recorded replay loop keeps a demo honest with real captured data, and the deployed frontend falls back to an in-browser simulation when no backend is reachable.

Interesting decisions & challenges

Transport-agnostic core

The Processor takes events from an abstract bus and writes to an abstract store. In dev that's an in-memory queue and a ring buffer; in prod, Redis Streams and TimescaleDB. Same code, two topologies — the single biggest design lever in the project.

Deterministic windowing

Windows fold on a caller-supplied clock, not wall time, so the windowing math is unit-tested and reproducible. Quiet periods still emit rate=0 windows so the series — and the detectors — never freeze on a stale value.

Agreement-weighted confidence

Ensemble confidence is the summed score of detectors that fired divided by the number available — so 3/3 firing at 0.8 reads as 0.80 (corroborated), but 1/3 at 0.8 collapses to 0.27 (suppressed unless very strong). Severity escalates on full agreement.

Backpressure as a first-class metric

Bounded buffers shed the oldest events under load rather than growing without limit, and every drop is counted, never silently swallowed. You can see the system protecting itself.

Tech stack & why

Python + httpx

async SSE ingestion that survives reconnects and resumes from the last event id

Redis Streams

a durable bus with consumer groups and natural backpressure between services

river

online ML primitives — Half-Space Trees that learn from the stream, no batch retrain

TimescaleDB

time-series storage that's just Postgres, so queries stay boring and familiar

FastAPI + WebSockets

low-latency fan-out of metrics and anomalies to every connected dashboard

Next.js + TypeScript

a typed UI whose wire types mirror the backend's exactly

Tailwind + uPlot

a tight design system and a canvas chart fast enough to redraw every second

Prometheus + Grafana

the pipeline watches its own throughput, lag and drops

Docker Compose

the whole distributed stack comes up with one command

Take it further

GitHub repo ↗LinkedIn write-up — soonOpen the dashboard →

Want to build something or collaborate on something like this? Contact me → or reach out directly at charanreddychanda@gmail.com.

Why it exists

How it came together

The itch

Design & scoping

MVP — one process, real feed

The part that broke everything

Three detectors and a referee

Split, store, observe