Data Warehouse Architecture¶

★★★★★ Intermediate

A data warehouse (DWH) is a subject-oriented, integrated, non-volatile, time-variant collection of data organized to support decision-making (Bill Inmon). DWH exists because OLTP and OLAP workloads are fundamentally incompatible in one system.

OLTP vs OLAP¶

Characteristic	OLTP	OLAP
Purpose	Transaction processing	Analytical queries
Queries	Simple, frequent, identical	Complex, infrequent, ad-hoc
Users	Many concurrent	Few analysts
Data	Current, operational	Historical, aggregated
Operations	Read/write (CRUD)	Mostly read
Latency	Micro/milliseconds	Seconds to minutes
Examples	PostgreSQL, MySQL (per-service)	Snowflake, ClickHouse, Greenplum

DWH Properties (Inmon)¶

Property	Meaning
Subject-oriented	Covers specific business domains; irrelevant data excluded
Integrated	No contradictions between data from different sources
Non-volatile	Data only appended, never modified or deleted
Time-variant	Maintains maximum historical depth

DWH Layers¶

Sources -> Staging Area -> Central Warehouse -> Data Marts -> BI/Analytics

Layer	Purpose
Staging Area	Raw data landing, cleansing, normalization before loading
MDM (Master Data)	Reference/lookup data ensuring consistency
Central Warehouse	Historical, subject-oriented, integrated data
Data Marts / Cubes	Specialized views for specific analytical tasks

Design Approaches¶

Kimball (Bottom-Up)¶

DWH = collection of data marts
Design marts per consumer need (star schema)
Fast to deliver business value
Drawbacks: marts may define metrics differently; hard to standardize

Inmon (Top-Down)¶

Build enterprise DWH first (3NF), then derive marts
Single source of truth, better consistency
Drawbacks: longer initial development, more complex

Data Vault 2.0¶

Handles many sources with different schemas
Append-only, full auditability
Business rules isolated in mart layer
See data vault for details

Platform Evolution¶

Gen 1: RDBMS DWH (Oracle, DB2) - 1990s
  -> Rigid schema, ETL, batch only

Gen 2: Hadoop ecosystem (HDFS + MapReduce) - 2007+
  -> Data Lake, schema-on-read, horizontal scale

Gen 3: Cloud DWH + Lakehouse - 2015+
  -> Snowflake, Databricks, BigQuery
  -> ELT, auto-scaling, separation of compute/storage

Gen 4: Data Mesh / Data Products - 2020+
  -> Domain-owned data, self-serve platform, federated governance

Choosing Architecture¶

Scenario	Recommended
Small company, structured data, BI focus	Cloud DWH (Snowflake, BigQuery)
Mixed data types, ML workloads	Data Lakehouse (Databricks, Iceberg)
Enterprise, many sources, compliance	DWH + Data Lake hybrid
Real-time + batch analytics	Lambda/Kappa with streaming layer
Many autonomous data teams	Data Mesh with self-serve platform

Gotchas¶

Star schema denormalization trades storage for query performance - acceptable for OLAP
Data Vault adds many joins - mart layer must pre-join for acceptable query performance
OLTP databases should never be queried directly for analytics - always replicate to DWH
Batch window must complete before business hours - monitor and alert on SLA breaches
Data Lake without governance becomes "data swamp"