Feature Engineering and Preprocessing¶

The art of transforming raw data into features that ML models can use effectively. Often the single most impactful step in a DS project - good features beat complex models.

Feature Scaling¶

Linear models and neural networks are sensitive to feature scales. Tree-based models (CatBoost, Random Forest) are NOT.

StandardScaler (Z-score)¶

Transforms to mean=0, std=1: z = (x - mean) / std

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)  # fit on train ONLY
X_test_scaled = scaler.transform(X_test)         # transform with train stats

MinMaxScaler¶

Scales to [0, 1]: x_scaled = (x - x_min) / (x_max - x_min)

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X_train)

Critical rule: fit scaler on training data ONLY, then transform both train and test. Fitting on test = data leakage.

Categorical Encoding¶

One-Hot Encoding¶

N categories -> N binary columns. Use for nominal (unordered) categories.

pd.get_dummies(df['col'], drop_first=True)  # drop_first avoids multicollinearity

from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(drop='first', sparse_output=False)

Label Encoding¶

Maps categories to integers. Only for ordinal data (has natural order).

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['encoded'] = le.fit_transform(df['col'])

Target Encoding¶

Replace category with mean of target for that category. CatBoost does this internally with regularization. High risk of data leakage without proper CV-based encoding.

Missing Value Imputation¶

from sklearn.impute import SimpleImputer

# Numerical: median is robust to outliers
imputer = SimpleImputer(strategy='median')
X_imputed = imputer.fit_transform(X)

# Categorical: most frequent or constant
imputer_cat = SimpleImputer(strategy='most_frequent')

Pro tip: create binary indicator is_missing before imputing - missingness itself can be informative.

Feature Selection¶

Filter Methods (model-independent)¶

from sklearn.feature_selection import VarianceThreshold, mutual_info_classif

# Remove near-constant features
sel = VarianceThreshold(threshold=0.01)
X_filtered = sel.fit_transform(X)

# Mutual information (captures non-linear relationships)
mi_scores = mutual_info_classif(X, y)

Wrapper Methods (model-dependent)¶

from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

selector = RFE(LogisticRegression(), n_features_to_select=10)
X_selected = selector.fit_transform(X, y)

Embedded Methods (built into model)¶

L1/Lasso: pushes unimportant weights to exactly zero
Tree-based importance: model.feature_importances_ from Random Forest, CatBoost

Regularization¶

Prevents overfitting by penalizing large coefficients.

Type	Penalty	Effect	When to Use
L2 (Ridge)	sum(beta_j^2)	Shrinks all, never zero	Multicollinearity
L1 (Lasso)	sum(\|beta_j\|)	Pushes some to exactly 0	Feature selection
Elastic Net	L1 + L2	Combined	Groups of correlated features

Why L1 produces zeros: L1 ball has corners on axes - optimal point frequently at a corner. L2 ball is smooth - tangent point almost never on an axis.

from sklearn.linear_model import Ridge, Lasso, ElasticNet
ridge = Ridge(alpha=1.0)    # alpha = regularization strength
lasso = Lasso(alpha=0.1)
elastic = ElasticNet(alpha=0.1, l1_ratio=0.5)

Sklearn Pipelines¶

Chain preprocessing + model. Prevents data leakage.

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

preprocessor = ColumnTransformer([
    ('num', StandardScaler(), numerical_cols),
    ('cat', OneHotEncoder(drop='first'), categorical_cols)
])

pipe = Pipeline([
    ('preprocess', preprocessor),
    ('model', LogisticRegression())
])
pipe.fit(X_train, y_train)
predictions = pipe.predict(X_test)

Feature Engineering Techniques¶

Domain-based: use domain knowledge (lat/lon -> distance, date -> day_of_week)
Interaction features: x1 * x2 captures joint effects
Polynomial features: x^2, x^3 for non-linear relationships
Aggregation features: groupby statistics (mean, count, std) of related entities
Time-based: day_of_week, month, hour, is_weekend, days_since_event
Text-based: word count, TF-IDF features, n-grams

from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2, interaction_only=False)
X_poly = poly.fit_transform(X)

Non-linear Correlation Detection¶

# phik - detects non-linear AND works with categoricals
import phik
corr_matrix = df.phik_matrix()
df.phik_matrix()['target'].sort_values(ascending=False)

Always verify phik findings with pivot tables. Phik shows WHERE to look, pivot tables confirm.

Gotchas¶

Data leakage via preprocessing: fit scaler/encoder on test data, or compute target encoding without CV
One-hot explosion: 1000 categories = 1000 features. Use target encoding or embeddings instead
Label encoding for nominal: gives false ordinal relationship (model thinks category 3 > category 1)
Polynomial features: degree=3 with 100 features = millions of features. Use interaction_only=True or manually select
Feature importance from single tree model is unreliable - use permutation importance or average over ensemble