Linear Models and Gradient Descent¶

Linear models are the foundation of ML. Simple, interpretable, and surprisingly effective as baselines. Understanding gradient descent is essential - it's how most ML models learn.

Linear Regression¶

Predict continuous target as weighted sum of features: y = w0 + w1x1 + w2x2 + ... + wn*xn

Matrix form: y = Xw where X includes a column of 1s for intercept.

Closed-form solution (OLS): w = (X^T X)^(-1) X^T y

from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(model.coef_, model.intercept_)

With statsmodels (for inference)¶

import statsmodels.api as sm
X_const = sm.add_constant(X)
model = sm.OLS(y, X_const).fit()
model.summary()  # coefficients, t-stats, p-values, R^2, confidence intervals

Five OLS Assumptions¶

Linear relationship between features and target
Normally distributed errors with zero mean
Homoscedasticity (constant error variance)
Independent errors
Correct feature specification (no missing/redundant features)

Logistic Regression¶

Binary classification. Outputs probability via sigmoid: P(y=1|x) = 1 / (1 + exp(-w^T x))

Loss: Binary Cross-Entropy = -sum[y_i * log(p_i) + (1-y_i) * log(1-p_i)]

from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, y_train)
proba = model.predict_proba(X_test)[:, 1]  # P(class=1)

Gradient: d(Loss)/dw = X^T(p - y) - elegant because sigmoid derivative sigma'(z) = sigma(z)(1 - sigma(z)).

Gradient Descent¶

Iterative optimization: w_new = w_old - lr * gradient(Loss, w)

Variants¶

Variant	Gradient Computed On	Tradeoff
Batch GD	Full dataset	Stable but slow
Stochastic GD (SGD)	Single sample	Fast but noisy
Mini-batch GD	Small batch (32-256)	Best of both

Learning Rate¶

Too high: diverges (loss oscillates or increases)
Too low: converges very slowly
Just right: smooth decrease in loss

Momentum¶

Accelerates through narrow valleys: v_t = beta * v_(t-1) + gradient; w -= lr * v_t

Convergence¶

For convex functions (like linear regression loss): GD converges to global minimum. For non-convex (neural networks): converges to local minimum - often sufficient in practice.

Regularization¶

Type	Penalty	sklearn Class
L2 (Ridge)	alpha * sum(w_j^2)	`Ridge(alpha=1.0)`
L1 (Lasso)	alpha * sum(\|w_j\|)	`Lasso(alpha=0.1)`
Elastic Net	alpha * [ratio * L1 + (1-ratio) * L2]	`ElasticNet(alpha=0.1, l1_ratio=0.5)`

alpha = 0: no regularization
alpha -> inf: all coefficients -> 0 (underfitting)
Choose alpha via cross-validation

Econometrics vs ML Perspective¶

Econometrics: interpretability, causal relationships, coefficient significance (p-values)
ML: prediction accuracy, doesn't care about individual coefficients
Linear models serve both: interpretable AND predictive baseline

Patterns¶

Polynomial Regression¶

from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline

pipe = Pipeline([
    ('poly', PolynomialFeatures(degree=2)),
    ('scaler', StandardScaler()),
    ('model', Ridge(alpha=1.0))
])
pipe.fit(X_train, y_train)

Regularization Selection¶

from sklearn.linear_model import RidgeCV, LassoCV
ridge = RidgeCV(alphas=[0.01, 0.1, 1.0, 10.0], cv=5)
ridge.fit(X_train, y_train)
print(f"Best alpha: {ridge.alpha_}")

Gotchas¶

Linear regression requires feature scaling for gradient descent (closed-form doesn't need it, but sklearn uses iterative methods for Lasso/Elastic Net)
Multicollinearity inflates coefficient variance - use Ridge or drop correlated features
Logistic regression is a linear classifier - decision boundary is always a hyperplane
Don't interpret Lasso coefficients as "importance" when features are correlated - it arbitrarily picks one
R^2 can be negative (model worse than predicting mean)