Skip to content

Data Vault 2.0

Data Vault is a data modeling methodology designed for agile DWH environments with many sources, frequent schema changes, and audit requirements. It solves the rigidity of 3NF and duplication issues with dimensional modeling.

Core Table Types

Table Purpose Key Fields
Hub Business entities (customer, product) Hash key, business key, load timestamp, source code
Link Relationships/transactions between 2+ hubs Hash key, hub foreign keys, load timestamp, source
Satellite Descriptive attributes with SCD2 history Hub/Link hash key, attributes, valid_from, valid_to, source

Key Principles

  • Data enters as-is from sources (no business transformation in core)
  • Business rules applied only in mart layer
  • All tables append-only - history tracked via SCD2 in satellites
  • Hash keys for joins (deterministic, no sequences)
  • Separates structure (hubs/links) from content (satellites)

Advantages

  • Easy to add new sources (just add new satellites)
  • Full auditability (every change tracked with timestamps and source codes)
  • Parallelizable loading (hubs, links, satellites load independently)
  • Handles many sources with different schemas
  • Resilient to source changes

Disadvantages

  • Complex for direct querying (need hub + satellite + link joins)
  • Typically feeds dimensional marts for BI consumption
  • SCD2 joins across satellites with different change frequencies = complexity

SCD2 (Slowly Changing Dimension Type 2)

Tracks historical changes by adding new rows with validity dates:

customer_id | name    | valid_from | valid_to   | is_current
1           | Alice   | 2023-01-01 | 2023-06-15 | false
1           | Alice B | 2023-06-15 | 9999-12-31 | true

When joining satellites from different hubs/links, date ranges must overlap - can cause Cartesian explosion.

Anchor Modeling

Extreme form of Data Vault where every attribute gets its own table (6NF-like):

Component Similar To Description
Anchor Hub Entity identified by surrogate ID
Attribute Satellite Single attribute per table, with timestamp
Tie Link Relationship between anchors
Knot Reference Shared reference data (enums, lookups)

Key Differences from Data Vault

  • More granular: one attribute = one table (vs satellites grouping attributes)
  • Easier schema evolution (add attribute = add table, no ALTER)
  • Results in many tables and many JOINs
  • Better for environments with frequent schema changes

Choosing a Methodology

Methodology Best For Trade-off
3NF (Inmon) Enterprise DWH core layer Complex to query, good for integration
Dimensional (Kimball) Marts, BI, fast queries Denormalized, harder to maintain
Data Vault Agile DWH, multiple sources, audit Complex to query, needs marts on top
Anchor Frequently changing schemas Very many tables, many JOINs

Gotchas

  • Data Vault adds many joins - mart layer must pre-join for acceptable performance
  • SCD2 date range joins across different satellites can produce unexpected Cartesian products
  • Hash keys must use consistent hashing across all loading processes
  • Without proper mart layer, analysts cannot query Data Vault directly

See Also