Handling Imbalanced Data¶

When one class dominates the dataset (e.g., 99% negative, 1% positive), standard classifiers become biased toward the majority class. Fraud detection, medical diagnosis, defect detection - all suffer from this. A model predicting "always negative" gets 99% accuracy but is useless.

Measuring the Problem¶

import pandas as pd

# Check imbalance ratio
print(y.value_counts(normalize=True))
# 0    0.985
# 1    0.015  <- severe imbalance

imbalance_ratio = y.value_counts()[0] / y.value_counts()[1]
# > 10:1 = moderate, > 100:1 = severe

Key rule: never use accuracy as the metric for imbalanced problems. Use precision, recall, F1, AUROC, or AUPRC instead.

Data-Level Methods¶

Random Oversampling / Undersampling¶

from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler

# Oversample minority to match majority
ros = RandomOverSampler(random_state=42)
X_res, y_res = ros.fit_resample(X_train, y_train)

# Undersample majority to match minority
rus = RandomUnderSampler(random_state=42)
X_res, y_res = rus.fit_resample(X_train, y_train)

Oversampling: risk of overfitting (duplicated minority samples)
Undersampling: loses information from majority class

SMOTE (Synthetic Minority Oversampling)¶

Creates synthetic samples by interpolating between existing minority points and their k-nearest neighbors.

from imblearn.over_sampling import SMOTE, BorderlineSMOTE, ADASYN

# Basic SMOTE
smote = SMOTE(sampling_strategy='auto', k_neighbors=5, random_state=42)
X_res, y_res = smote.fit_resample(X_train, y_train)

# Borderline-SMOTE: only synthesize near decision boundary
bl_smote = BorderlineSMOTE(kind='borderline-1', random_state=42)

# ADASYN: adaptively generate more samples in harder regions
adasyn = ADASYN(random_state=42)

SMOTE + Tomek Links (Combined)¶

Oversample minority then clean noisy majority samples near the boundary:

from imblearn.combine import SMOTETomek

smt = SMOTETomek(random_state=42)
X_res, y_res = smt.fit_resample(X_train, y_train)

Algorithm-Level Methods¶

Class Weights¶

Most classifiers support class_weight parameter. Penalizes misclassification of minority class more heavily.

from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

# Automatic weight inversely proportional to class frequency
rf = RandomForestClassifier(class_weight='balanced', random_state=42)

# Manual weights
lr = LogisticRegression(class_weight={0: 1, 1: 50})

# CatBoost / XGBoost
import xgboost as xgb
model = xgb.XGBClassifier(scale_pos_weight=imbalance_ratio)

Focal Loss¶

Down-weights easy (well-classified) examples, focuses on hard ones:

import torch
import torch.nn.functional as F

def focal_loss(logits, targets, alpha=0.25, gamma=2.0):
    bce = F.binary_cross_entropy_with_logits(logits, targets, reduction='none')
    pt = torch.exp(-bce)  # probability of correct class
    loss = alpha * (1 - pt) ** gamma * bce
    return loss.mean()

gamma=0 is standard cross-entropy. gamma=2 is typical starting point.

Cost-Sensitive Learning¶

Assign different misclassification costs. False negative (missing fraud) may cost 1000x more than false positive (flagging legitimate transaction).

# Custom scoring with business costs
def business_cost(y_true, y_pred):
    fn_cost = 1000  # missed fraud
    fp_cost = 10    # false alarm
    fn = ((y_true == 1) & (y_pred == 0)).sum()
    fp = ((y_true == 0) & (y_pred == 1)).sum()
    return fn * fn_cost + fp * fp_cost

Threshold Tuning¶

Default threshold 0.5 is almost never optimal for imbalanced data.

from sklearn.metrics import precision_recall_curve

probas = model.predict_proba(X_test)[:, 1]
precisions, recalls, thresholds = precision_recall_curve(y_test, probas)

# Find threshold for target recall (e.g., catch 90% of fraud)
target_recall = 0.90
idx = np.argmin(np.abs(recalls - target_recall))
optimal_threshold = thresholds[idx]

y_pred = (probas >= optimal_threshold).astype(int)

Evaluation Strategy¶

Stratified K-Fold: preserves class distribution in each fold
AUPRC over AUROC: AUPRC is more informative when positive class is rare
Confusion matrix: always inspect raw TP/FP/TN/FN counts

from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import average_precision_score

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
ap_scores = []
for train_idx, val_idx in skf.split(X, y):
    model.fit(X[train_idx], y[train_idx])
    probas = model.predict_proba(X[val_idx])[:, 1]
    ap_scores.append(average_precision_score(y[val_idx], probas))

Gotchas¶

Never SMOTE the test set: apply resampling only to training data inside the cross-validation loop. SMOTE on full data before splitting causes data leakage - synthetic test points are interpolations of training points, giving inflated metrics
SMOTE on high-dimensional sparse data fails: SMOTE interpolates in feature space - for text (TF-IDF) or one-hot encoded categoricals, interpolated points are meaningless. Use class weights or random oversampling instead
Imbalance ratio changes in production: if training data has 1% fraud but production sees 0.01%, threshold and class weights need recalibration. Monitor class distribution in incoming data