Data pipeline monitoring: what to track and how to act on it

5 min

A data pipeline is only as reliable as your ability to know when it's broken. Most teams discover failures one of two ways: a downstream consumer notices something is wrong, or an alert fires after significant data loss has already occurred. Neither is acceptable in a production environment where decisions depend on data freshness.

The three failure modes to monitor

Data pipeline failures cluster into three categories. Ingestion failures occur when data stops arriving or arrives corrupted — a source schema change, an API rate limit, a network partition. Processing failures occur when transformation logic breaks on unexpected input or a processing node crashes mid-run. Delivery failures occur when processed data fails to reach its destination — a full disk, a closed connection, a misconfigured sink.

Each failure mode requires different monitoring instrumentation and different response protocols.

Metrics every pipeline should emit

Throughput measures events processed per unit of time. A sudden drop in throughput is the most reliable early signal of an ingestion or processing problem. Lag measures the delta between when events occur and when they're available downstream. Growing lag indicates a processing backlog. Error rate measures the percentage of events that fail processing. Even a 0.1% error rate can represent thousands of lost records at scale. Last seen measures the timestamp of the most recent successfully processed event — critical for detecting silent failures where the pipeline appears healthy but has stopped receiving data.

Setting meaningful thresholds

Static thresholds — alert when throughput drops below 1,000 events/sec — are a starting point but break quickly in dynamic systems. Traffic patterns vary by time of day, day of week, and product season. A throughput of 500 events/sec at 3am might be completely normal; the same number at peak hours signals a crisis.

Percentage-based thresholds relative to recent baselines — alert when throughput drops more than 30% from the trailing 1-hour average — adapt to your actual traffic patterns and reduce false positives significantly.

Runbooks and on-call hygiene

An alert without a runbook is an interrupt without a resolution path. For every alert your pipeline can fire, there should be a documented response: what the alert means, what's likely causing it, and the steps to diagnose and resolve it. Good runbooks cut mean time to resolution in half and make on-call rotations sustainable for the humans running them.

Closing the loop

Pipeline monitoring isn't a set-and-forget system. Review your alert history monthly: which alerts fired most frequently, which produced the most false positives, and which failures weren't caught by any alert. Iterate on your thresholds and runbooks based on real incidents. The goal is a monitoring system that gets quieter over time, not louder.

Built for race day. Ready when you are.