Dimensionality Reduction¶

Reducing number of features while preserving important information. Two purposes: visualization (project to 2D/3D) and preprocessing (remove noise, speed up training, fight curse of dimensionality). Two approaches: feature selection (pick subset) and feature extraction (create new features).

PCA (Principal Component Analysis)¶

Linear projection onto directions of maximum variance.

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Always scale before PCA
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Fit PCA
pca = PCA(n_components=0.95)  # keep 95% of variance
X_pca = pca.fit_transform(X_scaled)

print(f"Reduced from {X.shape[1]} to {X_pca.shape[1]} features")
print(f"Explained variance per component: {pca.explained_variance_ratio_}")

# Scree plot to choose n_components
import matplotlib.pyplot as plt
cumvar = pca.explained_variance_ratio_.cumsum()
plt.plot(range(1, len(cumvar)+1), cumvar)
plt.xlabel('Number of components')
plt.ylabel('Cumulative explained variance')
plt.axhline(y=0.95, color='r', linestyle='--')

When to use: preprocessing before ML, denoising, multicollinearity removal, visualization if data is roughly linear.

Incremental PCA (for large datasets)¶

from sklearn.decomposition import IncrementalPCA

ipca = IncrementalPCA(n_components=50, batch_size=1000)
for batch in data_loader:
    ipca.partial_fit(batch)
X_reduced = ipca.transform(X)

t-SNE¶

Non-linear. Preserves local structure. Best for visualization, not preprocessing.

from sklearn.manifold import TSNE

tsne = TSNE(
    n_components=2,
    perplexity=30,       # 5-50, controls neighborhood size
    learning_rate='auto',
    n_iter=1000,
    random_state=42
)
X_2d = tsne.fit_transform(X_scaled)

plt.scatter(X_2d[:, 0], X_2d[:, 1], c=labels, cmap='tab10', s=5, alpha=0.7)

Perplexity: controls effective number of neighbors. Low (5-10) = tight clusters, high (30-50) = global structure. Try multiple values.

UMAP¶

Faster than t-SNE, preserves more global structure, can be used for preprocessing (unlike t-SNE).

import umap

reducer = umap.UMAP(
    n_components=2,
    n_neighbors=15,       # local vs global balance
    min_dist=0.1,         # how tight clusters are (0.0 = very tight)
    metric='euclidean',
    random_state=42
)
X_2d = reducer.fit_transform(X_scaled)

# UMAP can transform new data (t-SNE cannot)
X_new_2d = reducer.transform(X_new)

n_neighbors: low (5-15) = local structure, high (50-200) = global structure. min_dist: low = tighter clusters, high = spread out.

Feature Selection¶

Filter Methods¶

Score features independently, select top-k:

from sklearn.feature_selection import SelectKBest, f_classif, mutual_info_classif

# ANOVA F-test (for classification)
selector = SelectKBest(f_classif, k=20)
X_selected = selector.fit_transform(X, y)
selected_features = X.columns[selector.get_support()]

# Mutual information (captures non-linear relationships)
mi_selector = SelectKBest(mutual_info_classif, k=20)
X_mi = mi_selector.fit_transform(X, y)

# Variance threshold (remove near-constant features)
from sklearn.feature_selection import VarianceThreshold
vt = VarianceThreshold(threshold=0.01)
X_filtered = vt.fit_transform(X)

Wrapper Methods¶

Use model performance to evaluate feature subsets:

from sklearn.feature_selection import RFE

# Recursive Feature Elimination
rfe = RFE(
    estimator=RandomForestClassifier(n_estimators=100),
    n_features_to_select=15,
    step=5  # remove 5 features per iteration
)
rfe.fit(X_train, y_train)
selected = X.columns[rfe.support_]
rankings = rfe.ranking_  # 1 = selected

Embedded Methods¶

Feature importance from the model itself:

# Tree-based importance
model = XGBClassifier()
model.fit(X_train, y_train)
importances = model.feature_importances_

# Permutation importance (model-agnostic, more reliable)
from sklearn.inspection import permutation_importance

perm_imp = permutation_importance(model, X_val, y_val, n_repeats=10)
sorted_idx = perm_imp.importances_mean.argsort()[::-1]

# L1 regularization (Lasso) for linear models
from sklearn.linear_model import LassoCV
lasso = LassoCV(cv=5).fit(X_scaled, y)
selected = X.columns[lasso.coef_ != 0]

Autoencoders for Dimensionality Reduction¶

Non-linear alternative to PCA:

import torch.nn as nn

class DimensionReducer(nn.Module):
    def __init__(self, input_dim, latent_dim):
        super().__init__()
        self.encoder = nn.Sequential(
            nn.Linear(input_dim, 128), nn.ReLU(),
            nn.Linear(128, 64), nn.ReLU(),
            nn.Linear(64, latent_dim)
        )
        self.decoder = nn.Sequential(
            nn.Linear(latent_dim, 64), nn.ReLU(),
            nn.Linear(64, 128), nn.ReLU(),
            nn.Linear(128, input_dim)
        )

    def forward(self, x):
        z = self.encoder(x)
        return self.decoder(z), z

# Use encoder output as reduced features
model.eval()
_, X_reduced = model(X_tensor)

Gotchas¶

t-SNE cluster sizes and distances are meaningless: t-SNE distorts distances to preserve local neighborhoods. A large cluster in t-SNE plot may not be larger in original space. Never interpret inter-cluster distances or cluster sizes. Use t-SNE only to confirm clusters exist, not to measure them
PCA on unscaled data = dominated by high-variance features: if one feature ranges 0-1000 and another 0-1, PCA will pick the high-variance one regardless of importance. Always StandardScaler before PCA. Exception: if features are in same units (e.g., all pixel values 0-255)
Feature selection must happen inside cross-validation: selecting features on full dataset then doing CV leaks information from validation fold. Use Pipeline to ensure selection happens only on training data each fold

from sklearn.pipeline import Pipeline

pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('selector', SelectKBest(f_classif, k=20)),
    ('model', RandomForestClassifier())
])
scores = cross_val_score(pipe, X, y, cv=5)  # correct: no leakage