Ensemble Methods¶

Combining multiple models to produce better predictions than any single model. Three main strategies: bagging (parallel, reduce variance), boosting (sequential, reduce bias), stacking (learn how to combine).

Bagging (Bootstrap Aggregating)¶

Train multiple models on bootstrap samples (random sampling with replacement), aggregate predictions by voting (classification) or averaging (regression).

from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

bagging = BaggingClassifier(
    estimator=DecisionTreeClassifier(max_depth=None),
    n_estimators=50,
    max_samples=0.8,    # 80% of data per base model
    max_features=0.8,   # 80% of features per base model
    bootstrap=True,
    oob_score=True,     # out-of-bag estimate (free validation)
    random_state=42
)
bagging.fit(X_train, y_train)
print(f"OOB score: {bagging.oob_score_:.3f}")

Random Forest is bagging + random feature subsets at each split. The most successful bagging method.

Why it works: reduces variance by averaging decorrelated predictions. Each model sees different data, makes different errors. Errors cancel out in aggregation.

Boosting¶

Sequentially train weak learners, each focusing on mistakes of previous ones.

AdaBoost¶

Increases weight of misclassified samples:

from sklearn.ensemble import AdaBoostClassifier

ada = AdaBoostClassifier(
    estimator=DecisionTreeClassifier(max_depth=1),  # stumps
    n_estimators=200,
    learning_rate=0.1,
    random_state=42
)
ada.fit(X_train, y_train)

Gradient Boosting (XGBoost / LightGBM / CatBoost)¶

Fits new tree to residuals (gradient of loss function). See gradient boosting for detailed comparison.

Key Boosting Parameters¶

Parameter	Effect	Typical Range
n_estimators	Number of trees	100-10000
learning_rate	Shrinkage per tree	0.01-0.3
max_depth	Tree complexity	3-10
subsample	Row sampling ratio	0.5-1.0
colsample_bytree	Feature sampling	0.5-1.0
reg_alpha (L1)	Sparsity	0-10
reg_lambda (L2)	Smoothing	0-10

Rule of thumb: lower learning rate + more trees = better but slower. Use early stopping.

Stacking¶

Train a meta-learner on predictions of base models:

from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression

stack = StackingClassifier(
    estimators=[
        ('rf', RandomForestClassifier(n_estimators=100)),
        ('xgb', XGBClassifier(n_estimators=200)),
        ('svm', SVC(probability=True)),
    ],
    final_estimator=LogisticRegression(),
    cv=5,  # use cross-val predictions to prevent leakage
    stack_method='predict_proba'  # use probabilities not hard labels
)
stack.fit(X_train, y_train)

Multi-level stacking: base models -> level-1 meta-learner -> level-2 meta-learner. Diminishing returns after 2 levels. Popular in Kaggle but rarely worth complexity in production.

Voting¶

Simple combination of diverse models:

from sklearn.ensemble import VotingClassifier

# Hard voting: majority vote
hard_vote = VotingClassifier(
    estimators=[
        ('lr', LogisticRegression()),
        ('rf', RandomForestClassifier()),
        ('svm', SVC())
    ],
    voting='hard'
)

# Soft voting: average probabilities (usually better)
soft_vote = VotingClassifier(
    estimators=[
        ('lr', LogisticRegression()),
        ('rf', RandomForestClassifier()),
        ('svm', SVC(probability=True))
    ],
    voting='soft',
    weights=[1, 2, 1]  # weight better models higher
)

Blending¶

Simpler than stacking - use holdout set instead of cross-validation:

# Split: train -> base_train + blend_set
X_base, X_blend, y_base, y_blend = train_test_split(
    X_train, y_train, test_size=0.3, random_state=42
)

# Train base models on base_train
models = [rf.fit(X_base, y_base), xgb.fit(X_base, y_base)]

# Generate blend features
blend_features = np.column_stack([
    m.predict_proba(X_blend)[:, 1] for m in models
])

# Train meta-model on blend_set
meta = LogisticRegression()
meta.fit(blend_features, y_blend)

Model Diversity¶

Ensembles work best when base models make different errors:

Different algorithms: tree + linear + neural net
Different features: subsets, different preprocessing
Different hyperparameters: shallow + deep trees
Different training data: bootstrap, different time windows

Correlation between model errors is the enemy of ensembles. Two models with 80% accuracy but uncorrelated errors ensemble to ~96%. Two correlated 80% models stay at ~80%.

Gotchas¶

Stacking without cross-validation leaks: if you train base models on all training data and then use those same predictions to train the meta-learner, base model predictions on training data are overfit. Always use out-of-fold predictions for the meta-learner training set
Diminishing returns: going from 1 model to 3-5 diverse models gives biggest improvement. Adding model 50 barely helps. In production, the complexity of maintaining many models often outweighs the marginal accuracy gain. Start with 3 diverse models
Boosting overfits with noisy labels: boosting focuses on hard examples - if those are mislabeled, it memorizes noise. Use early stopping and validate on clean holdout. If label noise is known, prefer bagging or use label smoothing