Data Quality
Data quality encompasses monitoring, defining, and maintaining data integrity. Critical in large organizations to prevent incorrect business decisions driven by bad data.
Quality Dimensions
| Dimension | Question |
| Completeness | Are all required values present? |
| Accuracy | Do values correctly represent real-world entities? |
| Consistency | Same facts represented the same way across systems? |
| Timeliness | Is data available when needed? Fresh enough? |
| Uniqueness | Are there duplicate records? |
| Validity | Does data conform to defined formats, types, ranges? |
Data Quality Practices
- Define quality rules and thresholds per dataset/column
- Automated checks in pipelines (pre-load AND post-load)
- Quality dashboards and alerting on SLA breaches
- Root cause analysis for quality issues
- Data profiling to discover anomalies
| Tool | Type |
| Great Expectations | Open-source Python validation |
| dbt tests | Built-in + custom assertions on models |
| Apache Griffin | Open-source for big data |
| Monte Carlo, Bigeye, Soda | Commercial observability platforms |
Data Observability
Goes beyond pipeline monitoring to detect: - Schema changes - Distribution shifts - Null rate changes - Volume anomalies - Freshness violations
Pipeline Monitoring
| Category | Metrics |
| Pipeline health | DAG success/failure rate, task duration, retry count |
| Data freshness | Time since last update, SLA compliance |
| Data volume | Row count per load, deviation from expected |
| Infrastructure | CPU/memory/disk on workers, scheduler lag |
| Data quality | Failed validation count, null rate trends |
Alerting Best Practices
- Alert on SLA breaches, not just failures
- Use anomaly detection for row counts (sudden drop/spike)
- Escalation tiers (warning -> critical -> pager)
- Include runbook links in alert messages
- Avoid alert fatigue - tune thresholds, group related alerts
Gotchas
- Data Lake without governance becomes "data swamp"
- Row-level security must be tested with actual user accounts
- Prometheus pull model requires all targets to be network-accessible
_SUCCESS file guarantees job completion, not data correctness
See Also