Probability Theory and Statistical Inference¶

Rigorous probability foundations and statistical estimation methods. Bridges the gap between math theory and practical statistics used in ML.

Probability Axioms¶

P(A) >= 0 for any event A
P(sample space) = 1
For mutually exclusive events: P(A1 or A2 or ...) = sum P(Ai)

Key Theorems¶

Law of Large Numbers (Weak)¶

Sample mean converges in probability to population mean as n -> inf. Justifies using sample statistics to estimate population parameters.

Central Limit Theorem¶

For iid random variables with mean mu and variance sigma^2: (X_bar - mu) / (sigma / sqrt(n)) -> N(0,1) as n -> inf

Regardless of original distribution. This is why normal distribution appears everywhere.

Chebyshev's Inequality¶

P(|X - mu| >= k*sigma) <= 1/k^2 for ANY distribution. - k=2: at most 25% of data beyond 2 sigma (empirical: ~5% for normal) - k=3: at most 11% beyond 3 sigma (empirical: ~0.3% for normal)

Statistical Estimation¶

Point Estimation Properties¶

Unbiased: E[theta_hat] = theta (on average, correct)
Consistent: theta_hat -> theta as n -> inf (converges to truth)
Efficient: achieves minimum variance (Cramer-Rao bound)
Sufficient: captures all information about theta from the sample

Maximum Likelihood Estimation (MLE)¶

Find parameters that maximize the likelihood of observed data.

L(theta) = product f(x_i | theta)

In practice, maximize log-likelihood: l(theta) = sum log f(x_i | theta)

Properties: MLE is asymptotically efficient, consistent, and normally distributed.

Example (normal distribution): MLE for mu = sample mean, MLE for sigma^2 = sample variance (biased by n/(n-1)).

Method of Moments¶

Equate sample moments to theoretical moments. Simpler than MLE but often less efficient.

Hypothesis Testing (Theory)¶

Framework¶

H0 (null) and H1 (alternative)
Choose alpha (significance level, typically 0.05)
Compute test statistic
Compare to critical value or compute p-value
Reject H0 if p-value < alpha

Error Types¶

Type I (alpha): reject true H0 (false positive)
Type II (beta): fail to reject false H0 (false negative)
Power = 1 - beta (probability of detecting real effect)

Confidence Intervals¶

Known variance: x_bar +/- z_(alpha/2) * sigma / sqrt(n)
Unknown variance: x_bar +/- t_(alpha/2, n-1) * s / sqrt(n)

Student's t-distribution: heavier tails than normal. Used for small samples or unknown population variance.

Convergence Types¶

From weakest to strongest: 1. In distribution: CDFs converge 2. In probability: P(|X_n - X| > epsilon) -> 0 3. Almost surely: P(lim X_n = X) = 1 4. In mean (L^p): E[|X_n - X|^p] -> 0

Asymptotic Analysis¶

f = o(g): f grows strictly slower. lim f/g = 0
f = O(g): f grows no faster (up to constant)
f ~ g: asymptotically equivalent. lim f/g = 1
Hierarchy: ln(n) << n^a << a^n << n! << n^n

Conditional Expectation and Variance¶

E[X] = E[E[X|Y]] (law of total expectation)
Var(X) = E[Var(X|Y)] + Var(E[X|Y]) (law of total variance)

Useful in: Bayesian analysis, mixture models, hierarchical models.

Gotchas¶

MLE for variance is biased (divide by n, not n-1). Sample variance uses n-1 (Bessel's correction)
CLT requires finite variance - fails for Cauchy and other heavy-tailed distributions
p-value is NOT the probability that H0 is true. It's P(data | H0)
Confidence interval: "95% of intervals constructed this way contain the true parameter" (frequentist interpretation)
Statistical significance != practical significance. A tiny effect can be statistically significant with large n