ML System Design¶

Designing end-to-end ML systems that work in production. Covers problem framing, data pipeline, model selection, serving architecture, and monitoring. The model is typically < 10% of the total system complexity.

Problem Framing¶

Before any modeling, define:

Business objective: what metric moves the needle (revenue, engagement, cost)?
ML objective: what does the model predict? (classification, ranking, regression)
Data availability: what labeled data exists? Can you generate labels?
Constraints: latency budget, compute budget, fairness requirements
Baseline: what non-ML solution exists? (rules, heuristics)

Common framings: - Spam detection -> binary classification - Search ranking -> learning to rank (pairwise or listwise) - Recommendation -> collaborative filtering + content-based, served as ranking - Fraud -> anomaly detection + supervised classification - ETA prediction -> regression with confidence intervals

Data Pipeline Architecture¶

Raw Data -> Ingestion -> Validation -> Feature Engineering -> Feature Store
                                                                |
Training Pipeline <-------- Historical Features ----------------+
    |                                                           |
Model Registry                                                  |
    |                                                           |
Serving Pipeline <--------- Online Features --------------------+
    |
Predictions -> Logging -> Monitoring -> Feedback Loop

Data Validation¶

import great_expectations as gx

# Define expectations
suite = gx.ExpectationSuite(name="transaction_data")
suite.add_expectation(
    gx.expectations.ExpectColumnValuesToNotBeNull(column="amount")
)
suite.add_expectation(
    gx.expectations.ExpectColumnValuesToBeBetween(
        column="amount", min_value=0, max_value=1000000
    )
)
suite.add_expectation(
    gx.expectations.ExpectColumnProportionOfUniqueValuesToBeBetween(
        column="user_id", min_value=0.01  # not too many duplicates
    )
)

Feature Engineering Patterns¶

Temporal Features¶

# Time-based aggregations
def build_temporal_features(df, entity_col, time_col, value_col):
    features = {}
    for window in ['7d', '30d', '90d']:
        rolled = df.set_index(time_col).groupby(entity_col)[value_col].rolling(window)
        features[f'{value_col}_mean_{window}'] = rolled.mean()
        features[f'{value_col}_std_{window}'] = rolled.std()
        features[f'{value_col}_count_{window}'] = rolled.count()
    return pd.DataFrame(features)

Interaction Features¶

# Ratios, differences, products between features
df['price_per_sqft'] = df['price'] / df['sqft']
df['income_to_debt'] = df['income'] / (df['debt'] + 1)
df['recency_x_frequency'] = df['days_since_last'] * df['purchase_count']

Target Encoding (for high-cardinality categoricals)¶

from category_encoders import TargetEncoder

te = TargetEncoder(cols=['zip_code', 'merchant_id'], smoothing=10)
X_encoded = te.fit_transform(X_train, y_train)
# Use same encoder on test: te.transform(X_test)

Model Selection Decision Tree¶

Tabular data?
  Yes -> Start with CatBoost/XGBoost
    Need uncertainty? -> Bayesian methods or conformal prediction
    Need interpretability? -> SHAP on tree model or linear model
  No -> What modality?
    Text -> Pretrained transformer (fine-tune or embed)
    Image -> Pretrained CNN/ViT (fine-tune)
    Sequence/Time -> LSTM, Transformer, or specialized (Prophet, N-BEATS)
    Graph -> GNN (PyG)

Data size < 1000?
  Yes -> Simple model (logistic regression, small RF) + strong regularization
  Data size > 1M?
  Yes -> Consider online learning, distributed training, or sampling

Serving Architectures¶

Online Serving (Real-Time)¶

Latency < 100ms per request
REST API (FastAPI, Flask) or gRPC
Model loaded in memory, precomputed features from feature store
Horizontal scaling behind load balancer

Batch Serving¶

Process all data periodically (hourly/daily)
Cheaper, simpler, higher throughput
Use for recommendations, email targeting, report generation
Store predictions in database/cache for lookup

Streaming (Near Real-Time)¶

Process events as they arrive (Kafka/Flink)
For fraud detection, real-time personalization
Model deployed as stream processor

A/B Testing ML Models¶

# Traffic splitting
import hashlib

def get_variant(user_id, experiment_id, traffic_pct=0.1):
    hash_key = f"{experiment_id}:{user_id}"
    hash_val = int(hashlib.md5(hash_key.encode()).hexdigest(), 16)
    bucket = hash_val % 1000

    if bucket < traffic_pct * 1000:
        return "treatment"  # new model
    return "control"        # existing model

Statistical power: need enough samples for reliable results. For small effect sizes (< 1% lift), may need millions of observations. Use sequential testing to stop early if results are clear.

Feedback Loops¶

Direct feedback: user clicks / doesn't click -> label within minutes
Delayed feedback: fraud confirmed days/weeks later
Implicit feedback: no negative signal (user didn't complain != user is happy)
Feedback loop danger: model influences what data it sees. Recommendation model that never shows item X will never learn if X is good

Monitoring Checklist¶

What	How	Alert When
Input data quality	Schema validation, null checks	Schema violation, null rate > threshold
Feature drift	KS-test, PSI on feature distributions	p-value < 0.001 or PSI > 0.2
Prediction drift	Distribution of model outputs	Mean/variance shift beyond 2 sigma
Model performance	Actual vs predicted (when labels available)	Metric drops > 5% from baseline
Latency	P50/P95/P99 response time	P99 > SLA
Throughput	Requests per second	Drop > 20% from normal

Gotchas¶

Training-serving skew is the #1 production ML bug: feature computed differently offline (SQL with full history) vs online (real-time with partial data). Use a feature store or shared feature engineering code. Test by comparing offline predictions with production predictions on same inputs
Feedback loop bias: if a fraud model blocks transactions, you never see outcomes of blocked ones. Train on biased data -> reinforce existing biases. Randomly let through small % of flagged cases (exploration) or use causal inference techniques
Premature optimization: start with simplest model that could work (logistic regression, single tree). Establish baseline metrics. Complex models justified only when simple ones clearly fall short. Many production systems run on logistic regression