Bayesian Inference for ML¶
Bayesian approach treats model parameters as probability distributions, not point estimates. Instead of "the coefficient is 0.5", you get "the coefficient is 0.5 +/- 0.1 with 95% probability". Critical for uncertainty quantification, small datasets, and decision-making under risk.
Bayes Theorem¶
P(theta|data) = P(data|theta) * P(theta) / P(data)
- Prior P(theta): what you believe before seeing data
- Likelihood P(data|theta): how likely the data is given parameters
- Posterior P(theta|data): updated beliefs after seeing data
- Evidence P(data): normalizing constant (often intractable)
Bayesian vs Frequentist¶
| Aspect | Frequentist | Bayesian |
|---|---|---|
| Parameters | Fixed, unknown | Random variables |
| Uncertainty | Confidence intervals | Credible intervals |
| Prior knowledge | Not used | Encoded as prior |
| Small data | Unreliable | Regularized by prior |
| Computation | Fast (MLE) | Slower (MCMC/VI) |
| Interpretation | Long-run frequency | Probability of hypothesis |
Probabilistic Programming with PyMC¶
import pymc as pm
import arviz as az
# Bayesian Linear Regression
with pm.Model() as model:
# Priors
alpha = pm.Normal("alpha", mu=0, sigma=10)
beta = pm.Normal("beta", mu=0, sigma=10, shape=X.shape[1])
sigma = pm.HalfNormal("sigma", sigma=1)
# Likelihood
mu = alpha + pm.math.dot(X, beta)
y_obs = pm.Normal("y_obs", mu=mu, sigma=sigma, observed=y)
# Inference (MCMC sampling)
trace = pm.sample(2000, tune=1000, cores=4, return_inferencedata=True)
# Analyze posterior
az.summary(trace, var_names=["alpha", "beta", "sigma"])
az.plot_posterior(trace, var_names=["beta"])
az.plot_trace(trace)
MCMC (Markov Chain Monte Carlo)¶
Sampling algorithm to approximate posterior when analytic solution is impossible.
NUTS (No-U-Turn Sampler): default in PyMC, Stan. Adapts step size automatically. Gold standard for continuous parameters.
Diagnostics: - R-hat (Gelman-Rubin): should be < 1.01. Measures chain convergence - Effective sample size (ESS): should be > 400. Accounts for autocorrelation - Divergences: NUTS-specific. Divergences = bad geometry, reparameterize model
# Check diagnostics
az.summary(trace) # includes r_hat and ess columns
az.plot_trace(trace) # visual: chains should mix well (look like fuzzy caterpillars)
Variational Inference (VI)¶
Faster alternative to MCMC. Approximates posterior with simpler distribution. Less accurate but scales to large data.
with model:
# Mean-field variational inference
approx = pm.fit(method='advi', n=30000)
trace_vi = approx.sample(2000)
Bayesian Neural Networks¶
Replace point-weight networks with distributions over weights. Capture epistemic uncertainty.
import torch
import torch.nn as nn
class BayesianLinear(nn.Module):
def __init__(self, in_features, out_features):
super().__init__()
self.weight_mu = nn.Parameter(torch.randn(out_features, in_features) * 0.1)
self.weight_log_sigma = nn.Parameter(torch.full((out_features, in_features), -3.0))
def forward(self, x):
weight_sigma = torch.exp(self.weight_log_sigma)
weight = self.weight_mu + weight_sigma * torch.randn_like(self.weight_mu)
return x @ weight.T
# MC Dropout as cheap Bayesian approximation
class MCDropoutModel(nn.Module):
def __init__(self, input_dim, hidden_dim, output_dim, dropout=0.1):
super().__init__()
self.layers = nn.Sequential(
nn.Linear(input_dim, hidden_dim), nn.ReLU(), nn.Dropout(dropout),
nn.Linear(hidden_dim, hidden_dim), nn.ReLU(), nn.Dropout(dropout),
nn.Linear(hidden_dim, output_dim)
)
def predict_with_uncertainty(self, x, n_samples=100):
self.train() # keep dropout ON
preds = torch.stack([self.layers(x) for _ in range(n_samples)])
return preds.mean(0), preds.std(0) # mean prediction + uncertainty
Bayesian A/B Testing¶
Superior to frequentist hypothesis testing for business decisions:
with pm.Model() as ab_model:
# Priors for conversion rates
p_A = pm.Beta("p_A", alpha=1, beta=1) # uninformative prior
p_B = pm.Beta("p_B", alpha=1, beta=1)
# Observed data
obs_A = pm.Binomial("obs_A", n=n_visitors_A, p=p_A, observed=conversions_A)
obs_B = pm.Binomial("obs_B", n=n_visitors_B, p=p_B, observed=conversions_B)
# Derived: probability B is better
diff = pm.Deterministic("diff", p_B - p_A)
trace = pm.sample(5000, return_inferencedata=True)
# P(B > A)
p_b_better = (trace.posterior["diff"] > 0).mean().item()
print(f"Probability B is better: {p_b_better:.1%}")
Conjugate Priors¶
When prior and posterior are same distribution family, update is analytic (no MCMC needed):
| Likelihood | Conjugate Prior | Posterior |
|---|---|---|
| Bernoulli | Beta | Beta |
| Poisson | Gamma | Gamma |
| Normal (known var) | Normal | Normal |
| Multinomial | Dirichlet | Dirichlet |
Gotchas¶
- Prior sensitivity with small data: with < 100 samples, the prior dominates. Always do prior predictive checks (
pm.sample_prior_predictive()) to verify priors generate plausible data ranges. Garbage priors = garbage posteriors regardless of data - MCMC divergences are not warnings - they are errors: divergent transitions mean the sampler failed to explore the posterior correctly. Results are unreliable. Reparameterize (non-centered parameterization), increase
target_acceptto 0.95, or simplify model - VI underestimates uncertainty: variational inference uses simpler distributions (usually Gaussian) to approximate the true posterior. It consistently underestimates the width of credible intervals. Use MCMC when accurate uncertainty quantification matters