Bayesian Inference for ML¶

Bayesian approach treats model parameters as probability distributions, not point estimates. Instead of "the coefficient is 0.5", you get "the coefficient is 0.5 +/- 0.1 with 95% probability". Critical for uncertainty quantification, small datasets, and decision-making under risk.

Bayes Theorem¶

P(theta|data) = P(data|theta) * P(theta) / P(data)

Prior P(theta): what you believe before seeing data
Likelihood P(data|theta): how likely the data is given parameters
Posterior P(theta|data): updated beliefs after seeing data
Evidence P(data): normalizing constant (often intractable)

Bayesian vs Frequentist¶

Aspect	Frequentist	Bayesian
Parameters	Fixed, unknown	Random variables
Uncertainty	Confidence intervals	Credible intervals
Prior knowledge	Not used	Encoded as prior
Small data	Unreliable	Regularized by prior
Computation	Fast (MLE)	Slower (MCMC/VI)
Interpretation	Long-run frequency	Probability of hypothesis

Probabilistic Programming with PyMC¶

import pymc as pm
import arviz as az

# Bayesian Linear Regression
with pm.Model() as model:
    # Priors
    alpha = pm.Normal("alpha", mu=0, sigma=10)
    beta = pm.Normal("beta", mu=0, sigma=10, shape=X.shape[1])
    sigma = pm.HalfNormal("sigma", sigma=1)

    # Likelihood
    mu = alpha + pm.math.dot(X, beta)
    y_obs = pm.Normal("y_obs", mu=mu, sigma=sigma, observed=y)

    # Inference (MCMC sampling)
    trace = pm.sample(2000, tune=1000, cores=4, return_inferencedata=True)

# Analyze posterior
az.summary(trace, var_names=["alpha", "beta", "sigma"])
az.plot_posterior(trace, var_names=["beta"])
az.plot_trace(trace)

MCMC (Markov Chain Monte Carlo)¶

Sampling algorithm to approximate posterior when analytic solution is impossible.

NUTS (No-U-Turn Sampler): default in PyMC, Stan. Adapts step size automatically. Gold standard for continuous parameters.

Diagnostics: - R-hat (Gelman-Rubin): should be < 1.01. Measures chain convergence - Effective sample size (ESS): should be > 400. Accounts for autocorrelation - Divergences: NUTS-specific. Divergences = bad geometry, reparameterize model

# Check diagnostics
az.summary(trace)  # includes r_hat and ess columns
az.plot_trace(trace)  # visual: chains should mix well (look like fuzzy caterpillars)

Variational Inference (VI)¶

Faster alternative to MCMC. Approximates posterior with simpler distribution. Less accurate but scales to large data.

with model:
    # Mean-field variational inference
    approx = pm.fit(method='advi', n=30000)
    trace_vi = approx.sample(2000)

Bayesian Neural Networks¶

Replace point-weight networks with distributions over weights. Capture epistemic uncertainty.

import torch
import torch.nn as nn

class BayesianLinear(nn.Module):
    def __init__(self, in_features, out_features):
        super().__init__()
        self.weight_mu = nn.Parameter(torch.randn(out_features, in_features) * 0.1)
        self.weight_log_sigma = nn.Parameter(torch.full((out_features, in_features), -3.0))

    def forward(self, x):
        weight_sigma = torch.exp(self.weight_log_sigma)
        weight = self.weight_mu + weight_sigma * torch.randn_like(self.weight_mu)
        return x @ weight.T

# MC Dropout as cheap Bayesian approximation
class MCDropoutModel(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim, dropout=0.1):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(input_dim, hidden_dim), nn.ReLU(), nn.Dropout(dropout),
            nn.Linear(hidden_dim, hidden_dim), nn.ReLU(), nn.Dropout(dropout),
            nn.Linear(hidden_dim, output_dim)
        )

    def predict_with_uncertainty(self, x, n_samples=100):
        self.train()  # keep dropout ON
        preds = torch.stack([self.layers(x) for _ in range(n_samples)])
        return preds.mean(0), preds.std(0)  # mean prediction + uncertainty

Bayesian A/B Testing¶

Superior to frequentist hypothesis testing for business decisions:

with pm.Model() as ab_model:
    # Priors for conversion rates
    p_A = pm.Beta("p_A", alpha=1, beta=1)  # uninformative prior
    p_B = pm.Beta("p_B", alpha=1, beta=1)

    # Observed data
    obs_A = pm.Binomial("obs_A", n=n_visitors_A, p=p_A, observed=conversions_A)
    obs_B = pm.Binomial("obs_B", n=n_visitors_B, p=p_B, observed=conversions_B)

    # Derived: probability B is better
    diff = pm.Deterministic("diff", p_B - p_A)

    trace = pm.sample(5000, return_inferencedata=True)

# P(B > A)
p_b_better = (trace.posterior["diff"] > 0).mean().item()
print(f"Probability B is better: {p_b_better:.1%}")

Conjugate Priors¶

When prior and posterior are same distribution family, update is analytic (no MCMC needed):

Likelihood	Conjugate Prior	Posterior
Bernoulli	Beta	Beta
Poisson	Gamma	Gamma
Normal (known var)	Normal	Normal
Multinomial	Dirichlet	Dirichlet

Gotchas¶

Prior sensitivity with small data: with < 100 samples, the prior dominates. Always do prior predictive checks (pm.sample_prior_predictive()) to verify priors generate plausible data ranges. Garbage priors = garbage posteriors regardless of data
MCMC divergences are not warnings - they are errors: divergent transitions mean the sampler failed to explore the posterior correctly. Results are unreliable. Reparameterize (non-centered parameterization), increase target_accept to 0.95, or simplify model
VI underestimates uncertainty: variational inference uses simpler distributions (usually Gaussian) to approximate the true posterior. It consistently underestimates the width of credible intervals. Use MCMC when accurate uncertainty quantification matters