Hypothesis Testing and A/B Testing¶

Statistical framework for making decisions from data. A/B testing applies these principles to controlled experiments. Critical skill for data scientists who need to distinguish real effects from noise.

Hypothesis Testing Framework¶

Formulate hypotheses: H0 (null - no effect) and H1 (alternative - there is an effect)
Choose significance level alpha (typically 0.05)
Compute test statistic from data
Calculate p-value or compare to critical value
Decision: reject H0 if p-value < alpha

Error Types¶

	H0 True	H0 False
Reject H0	Type I (alpha)	Correct
Fail to reject	Correct	Type II (beta)

Type I (false positive): declaring effect when none exists. Controlled by alpha
Type II (false negative): missing a real effect. Power = 1 - beta
p-value: probability of observing data as extreme as ours, assuming H0 is true

Common Tests¶

Test	Use Case	Assumptions
t-test (Student's)	Compare means of two groups	~Normal, similar variance
Welch's t-test	Compare means, unequal variance	~Normal
Mann-Whitney U	Compare distributions non-parametrically	Ordinal+
z-test for proportions	Compare binary proportions	Large n
Chi-square	Test independence of categoricals	Expected count >= 5
ANOVA	Compare means of 3+ groups	Normal, equal variance

Confidence Intervals¶

from scipy import stats
import numpy as np

data = np.array([...])
confidence = 0.95
n = len(data)
mean = data.mean()
se = stats.sem(data)  # standard error
ci = stats.t.interval(confidence, df=n-1, loc=mean, scale=se)

For mean with known variance: x_bar +/- z_(alpha/2) * sigma / sqrt(n) For mean with unknown variance: x_bar +/- t_(alpha/2, n-1) * s / sqrt(n)

A/B Testing¶

Compare two versions (A=control, B=test) on independent user groups. Measure business metrics.

Why Not Just Offline Metrics?¶

Offline model metrics (precision, NDCG) don't directly measure business impact. Companies care about: session length, clicks, conversions, retention, revenue. A/B tests connect model changes to real user behavior.

Experimental Design¶

Split users into control/test groups (hash-based: hash(user_id + salt) % 100)
Run for fixed period (don't peek and stop early)
Compare metrics between groups
Apply statistical tests for significance

Test Power¶

Power = probability of detecting a real effect. Influenced by: - Sample size: more users = more power (main lever) - Effect size: larger effects are easier to detect - Variance: lower variance = more power (use CUPED) - Significance level: higher alpha = more power but more false positives

CUPED (Controlled-experiment Using Pre-Experiment Data)¶

Reduces metric variance by 20-30% using pre-experiment data.

Y_cuped = Y - theta * X
theta* = cov(Y, X) / Var(X)
Var(Y_cuped) = Var(Y) * (1 - rho^2(Y, X))

Higher correlation between pre- and during-experiment metric = more variance reduction.

Measuring CTR¶

CTR = clicks / views. Problem: individual clicks aren't independent (same user generates multiple).

Solutions: - Per-user aggregation: aggregate clicks per user, then t-test - Bootstrap: resample users with replacement, compute CTR per sample - Bucketing: hash users into ~100 buckets, compute CTR per bucket, t-test on buckets - Linearization: LinearizedCTR = clicks - A * views (A = global CTR from control)

Threats to Validity¶

Novelty effect: new users overreact positively. Inflates results
Change aversion: experienced users resist changes. Increases false negatives
Detection: compare effect between new vs experienced users
SRM (Sample Ratio Mismatch): groups aren't equal size - indicates broken randomization

When A/B Tests Are Impossible¶

Two-sided marketplaces (taxi, delivery): affecting one side impacts the other
Visible features: users in different groups see each other's experience
Feature already live: can't run reverse test

Alternatives: - Switchback testing: alternate algorithms across geography x time slots - Synthetic control: deploy everywhere, compare to predicted baseline from control regions - Difference-in-Differences: compare treatment vs control before/after intervention - Propensity Score Matching: match treatment units to similar control units

Causal Inference Methods¶

Method	Strength	When to Use
Randomized A/B	Gold standard	Full control over assignment
DiD	Moderate	Can't randomize, have before/after data
PSM	Moderate	Can't randomize, have covariates
Synthetic Control	Weak	Can't split at all, have control regions

Gotchas¶

Peeking problem: checking results daily and stopping when significant inflates false positive rate. Use sequential testing (SPRT) if you must peek
Multiple comparisons: testing 20 metrics at alpha=0.05 means ~1 will be "significant" by chance. Apply Bonferroni correction
P-hacking: trying many analyses until one is significant. Pre-register analysis plan
Accuracy trap with A/B: 1% increase in CTR can be worth millions - don't dismiss small but significant effects
Don't stockpile results: seasonality means last quarter's results may not generalize