Bayesian Methods in ML¶

Bayesian thinking updates beliefs with evidence. From simple Naive Bayes classification to Bayesian inference, these methods provide principled uncertainty quantification.

Bayes' Theorem Applied¶

P(hypothesis | data) = P(data | hypothesis) * P(hypothesis) / P(data)

Prior P(hypothesis): what we believe before seeing data
Likelihood P(data | hypothesis): probability of data given hypothesis
Posterior P(hypothesis | data): updated belief after seeing data
Evidence P(data): normalizing constant

Naive Bayes Classifier¶

P(class | features) proportional to P(class) * product(P(feature_i | class))

"Naive" assumption: features are conditionally independent given class. Often violated but works surprisingly well.

from sklearn.naive_bayes import GaussianNB, MultinomialNB

# Gaussian: continuous features (assumes normal distribution)
gnb = GaussianNB()
gnb.fit(X_train, y_train)

# Multinomial: count features (great for text/TF-IDF)
mnb = MultinomialNB()
mnb.fit(X_train_tfidf, y_train)

Best for: text classification (spam filtering, sentiment), when training data is small, when you need fast training.

Why It Works Despite Wrong Assumption¶

Classification only needs to rank probabilities correctly, not estimate them precisely
Feature dependencies often cancel out across classes
With limited data, simpler models generalize better

Bayesian vs Frequentist¶

Aspect	Frequentist	Bayesian
Parameters	Fixed, unknown	Random variables with distributions
Data	Random (repeated sampling)	Fixed (observed)
Inference	Point estimate + CI	Full posterior distribution
Prior knowledge	Not incorporated	Explicitly incorporated
Uncertainty	Confidence intervals	Credible intervals

Bayesian Inference¶

Instead of point estimates, compute full posterior distribution of parameters.

Conjugate priors: prior and posterior have same family. - Normal likelihood + Normal prior -> Normal posterior - Binomial likelihood + Beta prior -> Beta posterior

MCMC (Markov Chain Monte Carlo): sample from posterior when analytical solution is intractable.

Applications in Data Science¶

A/B testing: Bayesian A/B testing gives probability that B is better than A (more intuitive than p-values)
Hyperparameter tuning: Bayesian optimization (Optuna, Hyperopt)
Spam filtering: Naive Bayes with word frequencies
Medical diagnosis: prior disease prevalence + test sensitivity
Recommendation systems: Bayesian personalization

Gotchas¶

Naive Bayes assumes feature independence - correlation between features hurts performance
Prior choice matters, especially with small data
MCMC can be slow to converge for complex models
Log-space computation essential to avoid numerical underflow (product of many small probabilities)
Multinomial NB requires non-negative features (counts, TF-IDF)