Feature Engineering and Preprocessing¶
The art of transforming raw data into features that ML models can use effectively. Often the single most impactful step in a DS project - good features beat complex models.
Feature Scaling¶
Linear models and neural networks are sensitive to feature scales. Tree-based models (CatBoost, Random Forest) are NOT.
StandardScaler (Z-score)¶
Transforms to mean=0, std=1: z = (x - mean) / std
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train) # fit on train ONLY
X_test_scaled = scaler.transform(X_test) # transform with train stats
MinMaxScaler¶
Scales to [0, 1]: x_scaled = (x - x_min) / (x_max - x_min)
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X_train)
Critical rule: fit scaler on training data ONLY, then transform both train and test. Fitting on test = data leakage.
Categorical Encoding¶
One-Hot Encoding¶
N categories -> N binary columns. Use for nominal (unordered) categories.
pd.get_dummies(df['col'], drop_first=True) # drop_first avoids multicollinearity
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(drop='first', sparse_output=False)
Label Encoding¶
Maps categories to integers. Only for ordinal data (has natural order).
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['encoded'] = le.fit_transform(df['col'])
Target Encoding¶
Replace category with mean of target for that category. CatBoost does this internally with regularization. High risk of data leakage without proper CV-based encoding.
Missing Value Imputation¶
from sklearn.impute import SimpleImputer
# Numerical: median is robust to outliers
imputer = SimpleImputer(strategy='median')
X_imputed = imputer.fit_transform(X)
# Categorical: most frequent or constant
imputer_cat = SimpleImputer(strategy='most_frequent')
Pro tip: create binary indicator is_missing before imputing - missingness itself can be informative.
Feature Selection¶
Filter Methods (model-independent)¶
from sklearn.feature_selection import VarianceThreshold, mutual_info_classif
# Remove near-constant features
sel = VarianceThreshold(threshold=0.01)
X_filtered = sel.fit_transform(X)
# Mutual information (captures non-linear relationships)
mi_scores = mutual_info_classif(X, y)
Wrapper Methods (model-dependent)¶
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
selector = RFE(LogisticRegression(), n_features_to_select=10)
X_selected = selector.fit_transform(X, y)
Embedded Methods (built into model)¶
- L1/Lasso: pushes unimportant weights to exactly zero
- Tree-based importance:
model.feature_importances_from Random Forest, CatBoost
Regularization¶
Prevents overfitting by penalizing large coefficients.
| Type | Penalty | Effect | When to Use |
|---|---|---|---|
| L2 (Ridge) | sum(beta_j^2) | Shrinks all, never zero | Multicollinearity |
| L1 (Lasso) | sum(|beta_j|) | Pushes some to exactly 0 | Feature selection |
| Elastic Net | L1 + L2 | Combined | Groups of correlated features |
Why L1 produces zeros: L1 ball has corners on axes - optimal point frequently at a corner. L2 ball is smooth - tangent point almost never on an axis.
from sklearn.linear_model import Ridge, Lasso, ElasticNet
ridge = Ridge(alpha=1.0) # alpha = regularization strength
lasso = Lasso(alpha=0.1)
elastic = ElasticNet(alpha=0.1, l1_ratio=0.5)
Sklearn Pipelines¶
Chain preprocessing + model. Prevents data leakage.
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
preprocessor = ColumnTransformer([
('num', StandardScaler(), numerical_cols),
('cat', OneHotEncoder(drop='first'), categorical_cols)
])
pipe = Pipeline([
('preprocess', preprocessor),
('model', LogisticRegression())
])
pipe.fit(X_train, y_train)
predictions = pipe.predict(X_test)
Feature Engineering Techniques¶
- Domain-based: use domain knowledge (lat/lon -> distance, date -> day_of_week)
- Interaction features: x1 * x2 captures joint effects
- Polynomial features: x^2, x^3 for non-linear relationships
- Aggregation features: groupby statistics (mean, count, std) of related entities
- Time-based: day_of_week, month, hour, is_weekend, days_since_event
- Text-based: word count, TF-IDF features, n-grams
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2, interaction_only=False)
X_poly = poly.fit_transform(X)
Non-linear Correlation Detection¶
# phik - detects non-linear AND works with categoricals
import phik
corr_matrix = df.phik_matrix()
df.phik_matrix()['target'].sort_values(ascending=False)
Always verify phik findings with pivot tables. Phik shows WHERE to look, pivot tables confirm.
Gotchas¶
- Data leakage via preprocessing: fit scaler/encoder on test data, or compute target encoding without CV
- One-hot explosion: 1000 categories = 1000 features. Use target encoding or embeddings instead
- Label encoding for nominal: gives false ordinal relationship (model thinks category 3 > category 1)
- Polynomial features: degree=3 with 100 features = millions of features. Use
interaction_only=Trueor manually select - Feature importance from single tree model is unreliable - use permutation importance or average over ensemble
See Also¶
- pandas eda - data exploration before engineering
- linear models - models most affected by feature engineering
- gradient boosting - models with built-in feature handling
- model evaluation - evaluating impact of features on performance