Causal Inference¶

Determining cause-and-effect from data. Correlation does not imply causation - but with the right methods, we can get closer to causal claims even without randomized experiments.

Methods Hierarchy (by strength)¶

Randomized A/B tests - gold standard
Quasi-experiments - natural group assignment
Counterfactual analysis - no control group

DAGs (Directed Acyclic Graphs)¶

Visualize causal relationships. Identify confounders, mediators, and colliders.

Confounder: affects both treatment and outcome (must control for it)
Mediator: on the causal path between treatment and outcome
Collider: caused by both treatment and outcome (don't control for it)

Example: School GPA -> University admission, but motivation, income, school quality all interconnect.

Difference-in-Differences (DiD)¶

Compare treatment and control groups before and after intervention.

Effect = (treatment_after - treatment_before) - (control_after - control_before)

Requires: parallel trends assumption - groups would have followed the same trend without intervention.

Propensity Score Matching (PSM)¶

When groups aren't randomly assigned: 1. Estimate probability of receiving treatment for each unit (propensity score) 2. Match treatment units to control units with similar propensity scores 3. Compare outcomes between matched pairs

Validation: check covariate distributions become similar after matching.

Synthetic Control¶

When can't split users at all: 1. Deploy feature in treatment region 2. Build model predicting treatment region metric using control regions 3. Compare actual vs predicted baseline 4. Difference = estimated effect

Limitation: needs control regions, fragile to extraordinary events.

Instrumental Variables (IV)¶

When treatment assignment is confounded: - Find a variable (instrument) that affects treatment but not outcome directly - Use instrument to isolate causal variation in treatment - Classic example: distance to college as instrument for education's effect on earnings

Regression Discontinuity (RDD)¶

When treatment is assigned by a threshold (test score, age, income level): - Compare units just above and just below the threshold - They are effectively randomly assigned near the cutoff - Estimate treatment effect at the discontinuity

Gotchas¶

Omitted variable bias: missing confounders invalidate causal claims
DiD parallel trends assumption is untestable (can only check pre-period)
PSM only controls for observed confounders - hidden confounders remain
Synthetic control is fragile with few control units
Don't confuse prediction models with causal models - a feature can predict well without being causal
"Natural experiments" are only as good as the argument for exogeneity