Skip to content

Data Modeling and Normalization

Data modeling is the process of analyzing requirements and creating data structures. Three stages: conceptual (business entities), logical (ER diagrams with attributes/keys), physical (DDL scripts with indexes/partitions). Normalization eliminates redundancy through functional dependencies.

Normal Forms

Mnemonic: "Data depends on the key [1NF], the whole key [2NF], and nothing but the key [3NF]."

First Normal Form (1NF)

  • All values are atomic (no arrays, no composite values)
  • No duplicate rows
  • Has a primary key

Second Normal Form (2NF)

  • Satisfies 1NF
  • Every non-key attribute depends on the entire primary key
  • Only relevant for composite keys - if PK is simple, 2NF is automatic

Third Normal Form (3NF)

  • Satisfies 2NF
  • No transitive dependencies: non-key A depends on non-key B which depends on PK
  • Fix: extract the transitively dependent group into a separate table

Beyond 3NF

  • BCNF, 4NF, 5NF - rarely needed in practice
  • 6NF - extreme decomposition, one attribute per table. Used in [[data-vault#anchor-modeling|anchor modeling]]

Functional Dependencies

{X} -> {Y} means: if two tuples agree on X, they must agree on Y. Each X value determines exactly one Y value.

Anomalies (When Normalization is Insufficient)

Anomaly Problem
Update Changing info requires updating ALL rows containing that entity
Delete Deleting last related row loses the entity data entirely
Insert Cannot insert an entity that has no related records yet

Keys

Key Type Definition
Candidate key Unique + irreducible subset of attributes
Primary key Chosen candidate key
Alternative key Non-chosen candidate keys
Surrogate key System-generated ID (auto-increment, UUID)
Natural key Business attribute (email, SSN)
Composite key Multiple attributes forming a candidate key

Prefer surrogate keys when multiple candidates exist or natural key may change.

ER Diagrams

Cardinality Types

Type Implementation
1:1 FK with UNIQUE constraint
1:M FK in the "many" table
M:M Junction table with FKs to both

Notations

Martin (Crow's Foot) - most practical, expected in interviews: - Rectangles = entities with attributes listed inside - Crow's foot (fork) = "many", single line = "one" - Circle = "zero" (optional), dash = "one" (mandatory)

Chen Notation - academic: rectangles (entities), diamonds (relationships), ovals (attributes)

Normalization vs Denormalization

Criterion Normalized (3NF) Denormalized
Data integrity Better (no anomalies) Worse
INSERT/UPDATE/DELETE Faster (smaller tables) Slower
SELECT (reads) Slower (many JOINs) Faster (wide tables)
Best for OLTP (write integrity) OLAP (read speed, marts)

Deduplication Patterns

-- Method 1: CTE + ROW_NUMBER (most common)
WITH ranked AS (
    SELECT *, ROW_NUMBER() OVER (
        PARTITION BY key_col1, key_col2
        ORDER BY load_date DESC
    ) AS rn
    FROM source_table
)
SELECT * FROM ranked WHERE rn = 1;

-- Method 2: QUALIFY (Snowflake, BigQuery, DuckDB)
SELECT * FROM source_table
QUALIFY ROW_NUMBER() OVER (
    PARTITION BY key_col1, key_col2
    ORDER BY load_date DESC
) = 1;

-- Method 3: DISTINCT ON (PostgreSQL only)
SELECT DISTINCT ON (key_col1, key_col2) *
FROM source_table
ORDER BY key_col1, key_col2, load_date DESC;

Gotchas

  • Relations (mathematical) have no duplicate tuples and no ordering. Tables (physical) can violate both
  • A relation is automatically in 1NF by definition (set theory). Tables need explicit checking
  • In DWH, constraints (PK, FK) are often disabled for faster bulk loading
  • Most production DWHs are hybrids - Data Vault or 3NF in core, dimensional in marts

See Also