Data Vault 2.0¶

Data Vault is a data modeling methodology designed for agile DWH environments with many sources, frequent schema changes, and audit requirements. It solves the rigidity of 3NF and duplication issues with dimensional modeling.

Core Table Types¶

Table	Purpose	Key Fields
Hub	Business entities (customer, product)	Hash key, business key, load timestamp, source code
Link	Relationships/transactions between 2+ hubs	Hash key, hub foreign keys, load timestamp, source
Satellite	Descriptive attributes with SCD2 history	Hub/Link hash key, attributes, valid_from, valid_to, source

Key Principles¶

Data enters as-is from sources (no business transformation in core)
Business rules applied only in mart layer
All tables append-only - history tracked via SCD2 in satellites
Hash keys for joins (deterministic, no sequences)
Separates structure (hubs/links) from content (satellites)

Advantages¶

Easy to add new sources (just add new satellites)
Full auditability (every change tracked with timestamps and source codes)
Parallelizable loading (hubs, links, satellites load independently)
Handles many sources with different schemas
Resilient to source changes

Disadvantages¶

Complex for direct querying (need hub + satellite + link joins)
Typically feeds dimensional marts for BI consumption
SCD2 joins across satellites with different change frequencies = complexity

SCD2 (Slowly Changing Dimension Type 2)¶

Tracks historical changes by adding new rows with validity dates:

customer_id | name    | valid_from | valid_to   | is_current
1           | Alice   | 2023-01-01 | 2023-06-15 | false
1           | Alice B | 2023-06-15 | 9999-12-31 | true

When joining satellites from different hubs/links, date ranges must overlap - can cause Cartesian explosion.

Anchor Modeling¶

Extreme form of Data Vault where every attribute gets its own table (6NF-like):

Component	Similar To	Description
Anchor	Hub	Entity identified by surrogate ID
Attribute	Satellite	Single attribute per table, with timestamp
Tie	Link	Relationship between anchors
Knot	Reference	Shared reference data (enums, lookups)

Key Differences from Data Vault¶

More granular: one attribute = one table (vs satellites grouping attributes)
Easier schema evolution (add attribute = add table, no ALTER)
Results in many tables and many JOINs
Better for environments with frequent schema changes

Choosing a Methodology¶

Methodology	Best For	Trade-off
3NF (Inmon)	Enterprise DWH core layer	Complex to query, good for integration
Dimensional (Kimball)	Marts, BI, fast queries	Denormalized, harder to maintain
Data Vault	Agile DWH, multiple sources, audit	Complex to query, needs marts on top
Anchor	Frequently changing schemas	Very many tables, many JOINs

Gotchas¶

Data Vault adds many joins - mart layer must pre-join for acceptable performance
SCD2 date range joins across different satellites can produce unexpected Cartesian products
Hash keys must use consistent hashing across all loading processes
Without proper mart layer, analysts cannot query Data Vault directly