MLOps 07 - CI/CD for ML
Is Testing Code Enough?
In traditional software, CI/CD is a well-established concept. You commit code, tests run automatically, and if they pass, the build gets deployed. But for ML systems, this approach alone falls short.
An ML system's behavior is determined by three axes: code, data, and models. Even if the code is flawless, problematic training data leads to a model that makes wrong predictions. Even if the model's accuracy is high, a serving pipeline whose preprocessing logic differs from training time will produce errors in production. Ultimately, you need a CI/CD framework that validates all three.
This post examines how ML CI/CD differs from traditional software CI/CD and covers automation strategies that span data validation, model validation, and pipeline validation.
How ML CI/CD Differs from Software CI/CD
In software CI/CD, the test target is code. Unit tests, integration tests, and E2E tests verify code correctness, and if they pass, you deploy. It's deterministic โ the same input guarantees the same output.
In ML CI/CD, the situation is different. Model training is inherently non-deterministic. Even with the same code and data, random seeds, GPU computation order, and hyperparameter search can produce different results. The pass criteria shift from "pass or fail" to "does the performance metric exceed a certain threshold."
| Aspect | Software CI/CD | ML CI/CD |
|---|---|---|
| Test Target | Code | Code + Data + Model |
| Result Judgment | Pass/Fail | Performance threshold-based |
| Reproducibility | Deterministic | Non-deterministic elements present |
| Artifacts | Binaries, containers | Model files, metadata |
| Post-deploy Validation | Functional tests | Model performance monitoring |
Applying software CI/CD methodology as-is without recognizing these differences leads to situations where the code builds without issues, but the model fails in production.
Data Testing
The first thing to validate in an ML pipeline is the data, since model quality directly depends on data quality.
Data testing can be broken into schema validation, statistical validation, and business rule validation. Schema validation checks whether the data structure matches expectations โ column existence, data types, and allowed ranges. Statistical validation checks whether data distributions fall within normal bounds, verifying that statistics like mean, variance, and missing value ratios haven't shifted significantly from previous datasets. Business rule validation checks domain-specific constraints โ age cannot be negative, prices cannot fall outside certain ranges, and so on.
import great_expectations as gx
context = gx.get_context()
validator = context.sources.pandas_default.read_csv("training_data.csv")
# Schema validation
validator.expect_column_to_exist("age")
validator.expect_column_values_to_be_of_type("age", "int64")
# Range validation
validator.expect_column_values_to_be_between("age", min_value=0, max_value=120)
# Missing value validation
validator.expect_column_values_to_not_be_null("target", mostly=0.99)
# Distribution validation
validator.expect_column_mean_to_be_between("income", min_value=30000, max_value=80000)
Placing data validation as the first step in the pipeline prevents bad data from flowing into the training process altogether.
Model Testing
After data validation passes, the trained model itself must be validated. Model testing operates across three dimensions: performance validation, fairness validation, and robustness validation.
Performance validation checks whether the model's key metrics meet predefined thresholds. It's important not to look at absolute performance alone, but to compare against the model currently deployed in production. If the new model performs worse than the existing one, there's no reason to deploy it.
def validate_model(new_model, baseline_model, test_data):
new_metrics = evaluate(new_model, test_data)
baseline_metrics = evaluate(baseline_model, test_data)
# Absolute threshold validation
assert new_metrics["accuracy"] >= 0.85, "Accuracy below minimum threshold"
assert new_metrics["latency_p99"] <= 100, "Latency exceeds 100ms"
# Relative comparison validation
assert new_metrics["f1"] >= baseline_metrics["f1"] * 0.98, \
"F1 score dropped more than 2% compared to baseline"
Fairness validation checks whether the model produces biased predictions for specific groups. Model performance should not vary significantly based on sensitive attributes like gender, age, or ethnicity. Robustness validation checks whether the model behaves stably against edge cases and adversarial inputs.
Pipeline Validation
Even when individual components are functioning correctly, the end-to-end pipeline must be validated separately. This means verifying that the flow from data collection through preprocessing, training, evaluation, and serving executes without breaking.
A particularly critical concern in pipeline validation is training-serving skew. If the preprocessing logic at training time differs from the preprocessing logic at serving time, even a highly accurate model will produce incorrect results in production. Sharing preprocessing code between training and serving, or running tests that compare input-output snapshots, are ways to prevent this.
Automated Retraining Triggers
When monitoring systems detect drift or performance drops below a threshold, you can build a system that automatically kicks off the retraining pipeline. These are called automated retraining triggers.
There are broadly three types of triggers. Time-based triggers retrain on a fixed schedule (daily, weekly). They're simple to implement but may trigger unnecessary retraining. Performance-based triggers fire when monitoring metrics breach a threshold. Data-based triggers fire when a sufficient volume of new data has accumulated.
Is fully automated retraining always desirable? Not necessarily. If an automatically retrained model gets deployed to production without validation, it could amplify problems rather than solve them. The safe approach is to automate retraining but always include a validation step before deployment.
Model Validation Gates
The mandatory validation steps that a retrained model must pass before reaching production are called model validation gates. These gates serve as automated quality checkpoints, blocking models that fail to meet performance standards from going live.
# Model validation gate configuration example
validation_gates:
performance:
accuracy_min: 0.85
f1_min: 0.80
latency_p99_max_ms: 100
comparison:
metric: f1
threshold: 0.98 # At least 98% of current production model
data_quality:
missing_rate_max: 0.05
drift_psi_max: 0.2
Only models that pass the validation gates are registered in the model registry with an "approved" status. From there, they are gradually rolled into production through canary deployments or A/B testing.
Integration with Experiment Tracking
CI/CD pipelines need tight integration with experiment tracking systems. Each time automated retraining runs, hyperparameters, training data versions, performance metrics, and model artifacts should be automatically logged in the experiment tracker. Integrating tools like MLflow or Weights & Biases into the pipeline makes it possible to trace which data trained which model configuration and what performance it achieved. Without this traceability, diagnosing problems when they arise becomes extremely difficult.
GitHub Actions Example
Looking at the structure of an ML CI/CD pipeline built with GitHub Actions makes the differences from a standard CI/CD pipeline concrete.
name: ML CI/CD Pipeline
on:
push:
paths:
- 'src/**'
- 'data/**'
- 'configs/**'
jobs:
data-validation:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Validate training data
run: python scripts/validate_data.py --config configs/data_schema.yaml
- name: Check for data drift
run: python scripts/check_drift.py --reference data/reference.parquet
model-training:
needs: data-validation
runs-on: [self-hosted, gpu]
steps:
- uses: actions/checkout@v4
- name: Train model
run: python scripts/train.py --config configs/training.yaml
- name: Log to experiment tracker
run: python scripts/log_experiment.py
- name: Upload model artifact
uses: actions/upload-artifact@v4
with:
name: trained-model
path: outputs/model/
model-validation:
needs: model-training
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Download model artifact
uses: actions/download-artifact@v4
with:
name: trained-model
- name: Run validation gates
run: python scripts/validate_model.py --baseline models/production/
- name: Check fairness metrics
run: python scripts/check_fairness.py
deploy-canary:
needs: model-validation
if: github.ref == 'refs/heads/main'
runs-on: ubuntu-latest
steps:
- name: Register model
run: python scripts/register_model.py --status approved
- name: Deploy canary (10%)
run: python scripts/deploy.py --strategy canary --weight 10
The key points in this pipeline are that data validation precedes training, model validation gates follow training, and the final step is a safe canary deployment. If any stage fails, subsequent stages don't execute, structurally preventing problematic models from reaching production.
Summary
ML CI/CD is fundamentally different from software CI/CD in that it must include data and models โ not just code โ in its validation scope. A layered approach is required: data validation to guarantee input quality, model validation gates to enforce performance standards, and pipeline validation to ensure training-serving consistency. When automated retraining triggers and experiment tracking integration are added, you have a complete framework for automating and tracking the entire model lifecycle.
In the next post, we'll look at ML pipeline orchestration and workflow management tools.