MLOps and Feature Store¶

MLOps bridges data engineering and machine learning, covering experiment tracking, model versioning, feature management, and model serving. Data engineers build the infrastructure that ML teams depend on.

CRISP-DM Methodology¶

Business Understanding -> 2. Data Understanding -> 3. Data Preparation -> 4. Modeling -> 5. Evaluation -> 6. Deployment

Feature Store¶

Central repository for features, feature-building functions, models, and datasets.

Component	Description
Transformations	Precomputed and on-demand feature transforms
Storage	Online (Redis, DynamoDB) for serving; Offline (S3, HDFS, BigQuery) for training
Serving	Feature set config for inference-time preparation
Registry	Metadata about models, datasets, experiments
Monitoring	Data/model quality monitoring

MLflow Platform¶

Four core modules:

Module	Purpose
Models	Standard packaging format (MLmodel, model.pkl, conda.yaml)
Projects	Reproducible code organization
Tracking	Log parameters, metrics, code versions
Model Registry	Centralized model store with versioning

Experiment Tracking with PySpark¶

import mlflow
mlflow.set_tracking_uri("https://mlflow.example.com")
mlflow.set_experiment("PySpark-ML")

mlflow.start_run()
mlflow.log_param('MaxDepth', model.stages[-1].getMaxDepth())
mlflow.log_metric('accuracy', accuracy)
mlflow.log_metric('f1', f1_score)
mlflow.spark.log_model(model, "spark-model", registered_model_name="spark-model")
mlflow.end_run()

# Auto-tracking
mlflow.pyspark.ml.autolog()
pipeline.fit(train)  # triggers auto-tracking

Model Serving¶

Approach	Use Case
Microservices	REST API (Flask/FastAPI), Docker, K8s
Embedded	Edge/mobile deployment
Spark Pipeline	Batch scoring on large datasets

ML Versioning¶

Unlike software, latest model version is not necessarily best. Version = experiment ordinal number.

What to version: datasets, feature functions, trained models, training jobs, experiments.

Tools: ClearML, DVC, MLflow, Kubeflow Pipelines, Prefect, Dagster.

Gotchas¶

ML versioning differs from software versioning - newer != better
Feature Store online vs offline storage serves different latency requirements
Model serving via Spark pipeline is best for batch, not real-time
Always log model artifacts alongside metrics for reproducibility