Probability and Distributions¶

Core probability theory and common distributions used throughout data science and ML. Understanding distributions is essential for choosing appropriate statistical tests, building generative models, and interpreting results.

Probability Fundamentals¶

Probability P(A) = measure of how likely event A is. Range: [0, 1].

Classical definition: P(A) = favorable outcomes / total equally likely outcomes. Frequentist definition: P(A) = lim(N->inf) count(A) / N.

Key Rules¶

Rule	Formula
Complement	P(not A) = 1 - P(A)
Addition	P(A or B) = P(A) + P(B) - P(A and B)
Mutually exclusive	P(A or B) = P(A) + P(B)
Independence	P(A and B) = P(A) * P(B)
Conditional	P(A\|B) = P(A and B) / P(B)

"At least one" pattern: P(at least one) = 1 - P(none). Much easier than direct computation.

Bayes' Theorem¶

P(A|B) = P(B|A) * P(A) / P(B)

P(A|B): posterior (what we want)
P(B|A): likelihood
P(A): prior
P(B): evidence (marginal)

Gotcha: P(A|B) != P(B|A). P(disease|positive test) is NOT P(positive test|disease).

Total Probability¶

P(A) = sum_i P(A|B_i) * P(B_i) where {B_i} partitions the sample space.

Random Variables¶

Discrete: countable values (die roll, number of customers). Described by PMF. Continuous: any value in a range (height, temperature). Described by PDF.

Expected value: E(X) = sum(x_i * P(x_i)) or integral(x * f(x) dx)
Variance: Var(X) = E[(X - E[X])^2] = E[X^2] - (E[X])^2

Common Discrete Distributions¶

Bernoulli¶

Single trial with success probability p. - E(X) = p, Var(X) = p(1-p)

Binomial¶

k successes in n independent Bernoulli trials. - P(k) = C(n,k) * p^k * (1-p)^(n-k) - E(X) = np, Var(X) = np(1-p)

Poisson¶

Number of events in fixed interval with rate lambda. - P(k) = (lambda^k / k!) * e^(-lambda) - E(X) = Var(X) = lambda - Use for: rare events, count data (calls/hour, accidents/month)

Common Continuous Distributions¶

Normal (Gaussian)¶

Bell-shaped, symmetric. Defined by mu (mean) and sigma (std dev).

from scipy import stats
norm = stats.norm(loc=0, scale=1)
norm.pdf(0)      # density at 0
norm.cdf(1.96)   # P(X <= 1.96) ~ 0.975
norm.ppf(0.95)   # 95th percentile ~ 1.645

Central Limit Theorem: sum/average of many independent random variables tends to normal regardless of their individual distributions. This is why the normal distribution is so common.

Exponential¶

Time between independent events at constant rate. - Memoryless: P(X > s + t | X > s) = P(X > t) - E(X) = 1/lambda, Var(X) = 1/lambda^2

Uniform¶

All outcomes equally likely over [a, b]. - E(X) = (a+b)/2, Var(X) = (b-a)^2/12

Combinatorics¶

Permutations (order matters): P(n,k) = n! / (n-k)!
Combinations (order doesn't matter): C(n,k) = n! / (k!(n-k)!)
With repetition: permutations = n^k, combinations = C(n+k-1, k)
Binomial theorem: (a+b)^n = sum C(n,k) * a^(n-k) * b^k

Key Theorems¶

Law of Large Numbers: sample mean converges to population mean as n -> inf
Central Limit Theorem: standardized sample mean -> N(0,1) as n -> inf
Chebyshev's inequality: P(|X - mu| >= k*sigma) <= 1/k^2 (for any distribution)

Simulating Distributions¶

import numpy as np

# Inverse CDF method: if U ~ Uniform(0,1), then F^{-1}(U) has CDF F
# Exponential: X = -ln(U) / lambda
u = np.random.uniform(size=10000)
x_exp = -np.log(u) / 2.0  # Exp(lambda=2)

# Box-Muller for normal:
u1, u2 = np.random.uniform(size=10000), np.random.uniform(size=10000)
z = np.sqrt(-2*np.log(u1)) * np.cos(2*np.pi*u2)  # N(0,1)

# Direct numpy:
np.random.normal(0, 1, 10000)     # normal
np.random.poisson(5, 10000)       # poisson
np.random.binomial(10, 0.3, 10000) # binomial

Gotchas¶

Poisson is NOT just "binomial with large n, small p" - Poisson assumes constant rate, binomial assumes fixed trials
Normal distribution is not universal - many real-world distributions are heavily skewed (income, file sizes)
CLT requires finite variance - fails for heavy-tailed distributions (Cauchy)
Exponential is the only continuous memoryless distribution