Skip to content

Gradient Boosting and Tree-Based Models

Gradient boosting is the dominant algorithm for tabular data. CatBoost, LightGBM, and XGBoost are implementations of the same idea. Tree-based models handle mixed data types, non-linear relationships, and interactions without manual feature engineering.

Decision Trees

Base building block. Recursively split data on feature thresholds to minimize impurity.

Splitting criteria: - Gini impurity: sum(p_i * (1 - p_i)). Measures probability of misclassification - Entropy: -sum(p_i * log(p_i)). Information theory measure of disorder - MSE (regression): variance within each split

Pros: interpretable, handles mixed types, no scaling needed. Cons: high variance, prone to overfitting, unstable (small data changes -> different tree).

Random Forest

Ensemble of decision trees. Each tree trained on bootstrap sample with random feature subset.

from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor

rf = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)
rf.fit(X_train, y_train)
rf.feature_importances_  # built-in importance scores
  • Bagging reduces variance by averaging decorrelated predictions
  • Feature importance: decrease in impurity when splitting on feature
  • Embarrassingly parallel - each tree trains independently

Gradient Boosting

Sequentially build trees, each correcting errors of previous ones.

Idea: fit tree to residuals (errors) of current model, add it with a small learning rate.

CatBoost

Best out-of-box performance. Handles categoricals natively.

from catboost import CatBoostRegressor, CatBoostClassifier

model = CatBoostRegressor(cat_features=['transmission', 'fuel_type'])
model.fit(
    train[features], train[target],
    eval_set=(val[features], val[target]),
    verbose=100
)

# Predictions
y_pred = model.predict(test[features])

# Classification: get probabilities
model = CatBoostClassifier(cat_features=cat_features)
model.fit(train[features], train[target],
          eval_set=(val[features], val[target]))
scores = model.predict_proba(test[features])[:, 1]

# Feature importance
model.get_feature_importance(prettified=True)

Validation set is mandatory: CatBoost trains iteratively. Without validation, it memorizes training data. Early stopping halts at validation error minimum.

CatBoost Cross-Validation

from catboost import cv, Pool

cv_data = cv(
    pool=Pool(X, y, cat_features=cat_features),
    params={'loss_function': 'Logloss', 'eval_metric': 'AUC'},
    fold_count=5, shuffle=True, stratified=True,
    partition_random_seed=42, verbose=False
)
best_iter = cv_data['test-AUC-mean'].idxmax()

Workflow: CV to find optimal iterations -> train final model on full train+val with that count -> evaluate on test.

XGBoost

import xgboost as xgb
model = xgb.XGBClassifier(
    n_estimators=500, max_depth=6, learning_rate=0.1,
    subsample=0.8, colsample_bytree=0.8,
    early_stopping_rounds=50, eval_metric='auc'
)
model.fit(X_train, y_train, eval_set=[(X_val, y_val)], verbose=50)

LightGBM

Fastest for large datasets. Leaf-wise growth (vs level-wise).

import lightgbm as lgb
model = lgb.LGBMClassifier(n_estimators=500, learning_rate=0.05, num_leaves=31)
model.fit(X_train, y_train, eval_set=[(X_val, y_val)],
          callbacks=[lgb.early_stopping(50)])

Key Hyperparameters

Parameter Effect Typical Range
n_estimators / iterations Number of trees 100-5000
learning_rate Step size per tree 0.01-0.3
max_depth Tree depth 3-10
subsample Fraction of rows per tree 0.5-1.0
colsample_bytree Fraction of features per tree 0.5-1.0
min_child_weight Minimum samples in leaf 1-100
reg_lambda (L2) L2 regularization 0-10
reg_alpha (L1) L1 regularization 0-10

Rule of thumb: lower learning_rate + more estimators = better but slower. Start with defaults, tune learning_rate and max_depth first.

Handling Imbalanced Classes

# CatBoost
model = CatBoostClassifier(scale_pos_weight=neg_count/pos_count)

# XGBoost
model = xgb.XGBClassifier(scale_pos_weight=neg_count/pos_count)

# Or: tune threshold on validation set

Comparison

Aspect CatBoost XGBoost LightGBM Random Forest
Default performance Best Good Good Good
Categorical handling Native Manual encoding Native Manual encoding
Speed Moderate Moderate Fastest Fast (parallel)
Overfitting risk Low Moderate Moderate Low
GPU support Yes Yes Yes No

Gotchas

  • Don't use accuracy on imbalanced datasets - a model predicting all-zeros gets 80% accuracy with 80/20 split
  • CatBoost with no eval_set will overfit silently - always provide validation
  • Feature importance varies across runs and methods - use permutation importance for reliability
  • XGBoost requires manual one-hot encoding for categoricals (unless using enable_categorical=True)
  • Start simple (defaults) before hyperparameter tuning - tuning doesn't compensate for bad features

See Also