Is Testing Code Enough?

In traditional software, CI/CD is a well-established concept. You commit code, tests run automatically, and if they pass, the build gets deployed. But for ML systems, this approach alone falls short.

An ML system's behavior is determined by three axes: code, data, and models. Even if the code is flawless, problematic training data leads to a model that makes wrong predictions. Even if the model's accuracy is high, a serving pipeline whose preprocessing logic differs from training time will produce errors in production. Ultimately, you need a CI/CD framework that validates all three.

This post examines how ML CI/CD differs from traditional software CI/CD and covers automation strategies that span data validation, model validation, and pipeline validation.

How ML CI/CD Differs from Software CI/CD

In software CI/CD, the test target is code. Unit tests, integration tests, and E2E tests verify code correctness, and if they pass, you deploy. It's deterministic โ€” the same input guarantees the same output.

In ML CI/CD, the situation is different. Model training is inherently non-deterministic. Even with the same code and data, random seeds, GPU computation order, and hyperparameter search can produce different results. The pass criteria shift from "pass or fail" to "does the performance metric exceed a certain threshold."

AspectSoftware CI/CDML CI/CD
Test TargetCodeCode + Data + Model
Result JudgmentPass/FailPerformance threshold-based
ReproducibilityDeterministicNon-deterministic elements present
ArtifactsBinaries, containersModel files, metadata
Post-deploy ValidationFunctional testsModel performance monitoring

Applying software CI/CD methodology as-is without recognizing these differences leads to situations where the code builds without issues, but the model fails in production.

Data Testing

The first thing to validate in an ML pipeline is the data, since model quality directly depends on data quality.

Data testing can be broken into schema validation, statistical validation, and business rule validation. Schema validation checks whether the data structure matches expectations โ€” column existence, data types, and allowed ranges. Statistical validation checks whether data distributions fall within normal bounds, verifying that statistics like mean, variance, and missing value ratios haven't shifted significantly from previous datasets. Business rule validation checks domain-specific constraints โ€” age cannot be negative, prices cannot fall outside certain ranges, and so on.

import great_expectations as gx

context = gx.get_context()
validator = context.sources.pandas_default.read_csv("training_data.csv")

# Schema validation
validator.expect_column_to_exist("age")
validator.expect_column_values_to_be_of_type("age", "int64")

# Range validation
validator.expect_column_values_to_be_between("age", min_value=0, max_value=120)

# Missing value validation
validator.expect_column_values_to_not_be_null("target", mostly=0.99)

# Distribution validation
validator.expect_column_mean_to_be_between("income", min_value=30000, max_value=80000)

Placing data validation as the first step in the pipeline prevents bad data from flowing into the training process altogether.

Model Testing

After data validation passes, the trained model itself must be validated. Model testing operates across three dimensions: performance validation, fairness validation, and robustness validation.

Performance validation checks whether the model's key metrics meet predefined thresholds. It's important not to look at absolute performance alone, but to compare against the model currently deployed in production. If the new model performs worse than the existing one, there's no reason to deploy it.

def validate_model(new_model, baseline_model, test_data):
    new_metrics = evaluate(new_model, test_data)
    baseline_metrics = evaluate(baseline_model, test_data)

    # Absolute threshold validation
    assert new_metrics["accuracy"] >= 0.85, "Accuracy below minimum threshold"
    assert new_metrics["latency_p99"] <= 100, "Latency exceeds 100ms"

    # Relative comparison validation
    assert new_metrics["f1"] >= baseline_metrics["f1"] * 0.98, \
        "F1 score dropped more than 2% compared to baseline"

Fairness validation checks whether the model produces biased predictions for specific groups. Model performance should not vary significantly based on sensitive attributes like gender, age, or ethnicity. Robustness validation checks whether the model behaves stably against edge cases and adversarial inputs.

Pipeline Validation

Even when individual components are functioning correctly, the end-to-end pipeline must be validated separately. This means verifying that the flow from data collection through preprocessing, training, evaluation, and serving executes without breaking.

A particularly critical concern in pipeline validation is training-serving skew. If the preprocessing logic at training time differs from the preprocessing logic at serving time, even a highly accurate model will produce incorrect results in production. Sharing preprocessing code between training and serving, or running tests that compare input-output snapshots, are ways to prevent this.

Automated Retraining Triggers

When monitoring systems detect drift or performance drops below a threshold, you can build a system that automatically kicks off the retraining pipeline. These are called automated retraining triggers.

There are broadly three types of triggers. Time-based triggers retrain on a fixed schedule (daily, weekly). They're simple to implement but may trigger unnecessary retraining. Performance-based triggers fire when monitoring metrics breach a threshold. Data-based triggers fire when a sufficient volume of new data has accumulated.

Is fully automated retraining always desirable? Not necessarily. If an automatically retrained model gets deployed to production without validation, it could amplify problems rather than solve them. The safe approach is to automate retraining but always include a validation step before deployment.

Model Validation Gates

The mandatory validation steps that a retrained model must pass before reaching production are called model validation gates. These gates serve as automated quality checkpoints, blocking models that fail to meet performance standards from going live.

# Model validation gate configuration example
validation_gates:
  performance:
    accuracy_min: 0.85
    f1_min: 0.80
    latency_p99_max_ms: 100
  comparison:
    metric: f1
    threshold: 0.98  # At least 98% of current production model
  data_quality:
    missing_rate_max: 0.05
    drift_psi_max: 0.2

Only models that pass the validation gates are registered in the model registry with an "approved" status. From there, they are gradually rolled into production through canary deployments or A/B testing.

Integration with Experiment Tracking

CI/CD pipelines need tight integration with experiment tracking systems. Each time automated retraining runs, hyperparameters, training data versions, performance metrics, and model artifacts should be automatically logged in the experiment tracker. Integrating tools like MLflow or Weights & Biases into the pipeline makes it possible to trace which data trained which model configuration and what performance it achieved. Without this traceability, diagnosing problems when they arise becomes extremely difficult.

GitHub Actions Example

Looking at the structure of an ML CI/CD pipeline built with GitHub Actions makes the differences from a standard CI/CD pipeline concrete.

name: ML CI/CD Pipeline

on:
  push:
    paths:
      - 'src/**'
      - 'data/**'
      - 'configs/**'

jobs:
  data-validation:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Validate training data
        run: python scripts/validate_data.py --config configs/data_schema.yaml
      - name: Check for data drift
        run: python scripts/check_drift.py --reference data/reference.parquet

  model-training:
    needs: data-validation
    runs-on: [self-hosted, gpu]
    steps:
      - uses: actions/checkout@v4
      - name: Train model
        run: python scripts/train.py --config configs/training.yaml
      - name: Log to experiment tracker
        run: python scripts/log_experiment.py
      - name: Upload model artifact
        uses: actions/upload-artifact@v4
        with:
          name: trained-model
          path: outputs/model/

  model-validation:
    needs: model-training
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Download model artifact
        uses: actions/download-artifact@v4
        with:
          name: trained-model
      - name: Run validation gates
        run: python scripts/validate_model.py --baseline models/production/
      - name: Check fairness metrics
        run: python scripts/check_fairness.py

  deploy-canary:
    needs: model-validation
    if: github.ref == 'refs/heads/main'
    runs-on: ubuntu-latest
    steps:
      - name: Register model
        run: python scripts/register_model.py --status approved
      - name: Deploy canary (10%)
        run: python scripts/deploy.py --strategy canary --weight 10

The key points in this pipeline are that data validation precedes training, model validation gates follow training, and the final step is a safe canary deployment. If any stage fails, subsequent stages don't execute, structurally preventing problematic models from reaching production.

Summary

ML CI/CD is fundamentally different from software CI/CD in that it must include data and models โ€” not just code โ€” in its validation scope. A layered approach is required: data validation to guarantee input quality, model validation gates to enforce performance standards, and pipeline validation to ensure training-serving consistency. When automated retraining triggers and experiment tracking integration are added, you have a complete framework for automating and tracking the entire model lifecycle.

In the next post, we'll look at ML pipeline orchestration and workflow management tools.