The project
Tracewave, from itch to launch
A real-time anomaly-detection pipeline pointed at a live public firehose: ingestion, a stream bus, windowed processing, three online detectors, time-series storage, and a dashboard that catches spikes as they happen. It looks like magic in a notebook and falls apart in production — so I built the production version.
Why it exists
I kept seeing “data” portfolios that were a single static notebook — impressive once, dead on arrival. I wanted to show the thing the notebook hides: a live distributed system, the full data-platform stack end to end, reacting to real internet activity in motion. Wikimedia broadcasts every edit on the planet over an open stream. That's a free, infinite, genuinely unpredictable signal — exactly what you want to point an anomaly detector at when you're trying to prove something runs for real.
How it came together
- Mar 2026
The itch
Most data portfolios are a static notebook with a confusion matrix. I wanted the opposite — a system you can watch breathe. Wikimedia publishes every edit on Earth as an open SSE stream. Free, infinite, real. Perfect raw material.
- Apr 2026
Design & scoping
Drew the boxes: ingest → bus → windowed processor → online detectors → storage → live UI. Set the rule that would shape everything — the core must run as one process or a distributed stack from the same code.
- early May 2026
MVP — one process, real feed
Ingestor, 1-second tumbling windows, and a rolling z-score wired straight to a Next.js chart over WebSockets. Ugly, but alive: real edits drawing a real line within seconds.
- mid May 2026WHERE IT GOT HARD
The part that broke everything
The firehose doesn't burst politely. Unbounded queues ballooned, late events got dropped, and the chart froze on stale values during quiet spells. This is where it got hard: bounded buffers that shed the oldest and count drops, windows that emit rate=0 instead of freezing, and late events folded into the current window. Boring-sounding fixes; the whole thing was a toy until they existed.
- late May 2026
Three detectors and a referee
Added EWMA and Half-Space Trees next to the z-score, then the hard part: an ensemble that rewards agreement so one jumpy detector can't cry wolf. Plus the "why" — diffing each spike against decaying per-dimension baselines so a card explains itself.
- early Jun 2026
Split, store, observe
Proved the transport-agnostic bet: the same Processor moved behind Redis Streams + TimescaleDB with no logic changes. Every service got Prometheus metrics and a Grafana dashboard watching the pipeline's own health.
- Jun 2026
Dashboard polish & launch
The NOC-console look, tabular figures, slide-and-settle cards, honest empty/stale states — and a self-contained demo stream so the deployed link is alive without a backend. Shipped.
Key features
Live, explained anomalies
Not just "a spike happened" — each card carries the contributing dimensions (wiki, language, namespace, actor type), a confidence score, and which detectors corroborated. Every window is replayable.
Three detectors, compared
Rolling z-score (interpretable baseline), EWMA control chart (adapts to drift), and Half-Space Trees (online, multivariate). The dashboard shows each score over time and an agreement strip.
Self-observability
Throughput, lag, dropped events, detector fires and p95 window time are all Prometheus metrics, watched by a provisioned Grafana dashboard. The pipeline reports on its own health.
Never an empty link
A recorded replay loop keeps a demo honest with real captured data, and the deployed frontend falls back to an in-browser simulation when no backend is reachable.
Interesting decisions & challenges
Transport-agnostic core
The Processor takes events from an abstract bus and writes to an abstract store. In dev that's an in-memory queue and a ring buffer; in prod, Redis Streams and TimescaleDB. Same code, two topologies — the single biggest design lever in the project.
Deterministic windowing
Windows fold on a caller-supplied clock, not wall time, so the windowing math is unit-tested and reproducible. Quiet periods still emit rate=0 windows so the series — and the detectors — never freeze on a stale value.
Agreement-weighted confidence
Ensemble confidence is the summed score of detectors that fired divided by the number available — so 3/3 firing at 0.8 reads as 0.80 (corroborated), but 1/3 at 0.8 collapses to 0.27 (suppressed unless very strong). Severity escalates on full agreement.
Backpressure as a first-class metric
Bounded buffers shed the oldest events under load rather than growing without limit, and every drop is counted, never silently swallowed. You can see the system protecting itself.
Tech stack & why
Take it further
Want to build something or collaborate on something like this? Contact me → or reach out directly at charanreddychanda@gmail.com.