MLOps and Feature Store¶
★★★★★ Basic
MLOps bridges data engineering and machine learning, covering experiment tracking, model versioning, feature management, and model serving. Data engineers build the infrastructure that ML teams depend on.
CRISP-DM Methodology¶
- Business Understanding -> 2. Data Understanding -> 3. Data Preparation -> 4. Modeling -> 5. Evaluation -> 6. Deployment
Feature Store¶
Central repository for features, feature-building functions, models, and datasets.
| Component | Description |
|---|---|
| Transformations | Precomputed and on-demand feature transforms |
| Storage | Online (Redis, DynamoDB) for serving; Offline (S3, HDFS, BigQuery) for training |
| Serving | Feature set config for inference-time preparation |
| Registry | Metadata about models, datasets, experiments |
| Monitoring | Data/model quality monitoring |
MLflow Platform¶
Four core modules:
| Module | Purpose |
|---|---|
| Models | Standard packaging format (MLmodel, model.pkl, conda.yaml) |
| Projects | Reproducible code organization |
| Tracking | Log parameters, metrics, code versions |
| Model Registry | Centralized model store with versioning |
Experiment Tracking with PySpark¶
import mlflow
mlflow.set_tracking_uri("https://mlflow.example.com")
mlflow.set_experiment("PySpark-ML")
mlflow.start_run()
mlflow.log_param('MaxDepth', model.stages[-1].getMaxDepth())
mlflow.log_metric('accuracy', accuracy)
mlflow.log_metric('f1', f1_score)
mlflow.spark.log_model(model, "spark-model", registered_model_name="spark-model")
mlflow.end_run()
# Auto-tracking
mlflow.pyspark.ml.autolog()
pipeline.fit(train) # triggers auto-tracking
Model Serving¶
| Approach | Use Case |
|---|---|
| Microservices | REST API (Flask/FastAPI), Docker, K8s |
| Embedded | Edge/mobile deployment |
| Spark Pipeline | Batch scoring on large datasets |
ML Versioning¶
Unlike software, latest model version is not necessarily best. Version = experiment ordinal number.
What to version: datasets, feature functions, trained models, training jobs, experiments.
Tools: ClearML, DVC, MLflow, Kubeflow Pipelines, Prefect, Dagster.
Gotchas¶
- ML versioning differs from software versioning - newer != better
- Feature Store online vs offline storage serves different latency requirements
- Model serving via Spark pipeline is best for batch, not real-time
- Always log model artifacts alongside metrics for reproducibility
See Also¶
- apache spark core - Spark ML integration
- etl elt pipelines - feature pipeline patterns
- kubernetes for de - serving infrastructure
- data quality - model monitoring