Deployed Models Don't Last Forever

After deployment, a model works well for a while. But over weeks and months, prediction performance begins to quietly erode. No code was changed, no retraining was done โ€” so why does this happen?

The cause almost always lies in the data. A model learns patterns based on the data distribution at training time. But real-world data changes over time. User behavior shifts, market conditions evolve, seasonal factors intervene. When the gap between the training data and the current serving data grows wide enough, the model's predictive power inevitably degrades.

This is the fundamental reason model monitoring exists. Even after deployment, you need to continuously observe the model's health and catch problems early enough to respond.

Data Drift vs. Concept Drift

The causes of model performance degradation fall into two broad categories: data drift and concept drift.

Data drift is a shift in the distribution of input data. For example, if the average age of applicants was 35 when a loan approval model was trained, but a surge of applicants in their 20s later pushes the average down to 28, the model is now receiving data from a distribution it never learned. Prediction reliability suffers as a result.

Concept drift is a change in the relationship between inputs and outputs itself. A pattern that once indicated a fraudulent transaction may no longer mean fraud as fraud techniques evolve. Even though the data distribution looks similar, the definition of the correct answer has changed, so the model needs to be fundamentally retrained.

Does distinguishing between these two actually matter in practice? It does. Data drift can be detected simply by monitoring input data, but concept drift can only be confirmed once labels (ground truth) become available. Since the detection methods differ, the response strategies must differ as well.

Statistical Methods for Drift Detection

Detecting drift requires comparing the distribution of training data against the distribution of current serving data. Several statistical tests are used for this purpose.

PSI (Population Stability Index) quantifies the difference between two distributions. It divides variable values into bins and measures the difference in proportions falling into each bin. A PSI below 0.1 is generally considered stable, between 0.1 and 0.2 warrants attention, and above 0.2 indicates drift has occurred.

import numpy as np

def calculate_psi(expected, actual, bins=10):
    breakpoints = np.linspace(0, 100, bins + 1)
    expected_percents = np.histogram(expected, breakpoints)[0] / len(expected)
    actual_percents = np.histogram(actual, breakpoints)[0] / len(actual)

    # Prevent division by zero
    expected_percents = np.clip(expected_percents, 0.001, None)
    actual_percents = np.clip(actual_percents, 0.001, None)

    psi = np.sum(
        (actual_percents - expected_percents)
        * np.log(actual_percents / expected_percents)
    )
    return psi

The KS test (Kolmogorov-Smirnov test) measures the maximum distance between the cumulative distribution functions of two distributions. If the p-value falls below the significance level (typically 0.05), the two distributions are considered significantly different. It works well for continuous variables and has the advantage of being straightforward to implement.

MethodStrengthsBest For
PSIIntuitive interpretation, clear thresholdsCategorical and continuous variables
KS TestNon-parametric, simple implementationContinuous variables
Chi-squared TestSpecialized for categorical dataCategorical variables
Wasserstein DistanceCaptures shape differences between distributionsDetecting subtle distribution shifts

In practice, rather than relying on a single method, it's common to combine several methods based on the type and characteristics of each variable.

Monitoring Metrics

Beyond drift detection, there are metrics that need to be observed across the entire serving system. These can be organized into three categories: model performance metrics, service performance metrics, and data quality metrics.

Model performance metrics measure how accurate the model's predictions are. Accuracy, precision, recall, F1 score, and AUC fall into this category. However, since obtaining labels in real time is often difficult in production, the distribution of predicted values or changes in confidence scores are sometimes used as proxy metrics.

Service performance metrics reflect the health of the serving system. Response latency (p50, p95, p99), throughput (requests per second), and error rates are tracked. Even if the model itself is fine, degraded serving infrastructure directly impacts user experience.

Data quality metrics verify the integrity of input data. Missing value ratios, outlier frequency, and schema violation counts are monitored. The goal is to catch problems originating upstream in the data pipeline before they reach the model.

Alerting Strategies

Collecting metrics alone is not enough. When problems occur, you need timely alerts so you can respond.

The hardest part of alert design is setting thresholds. Setting thresholds too low triggers alerts on trivial fluctuations, causing alert fatigue. Setting them too high means real problems get missed. In most cases, dynamic thresholds based on moving averages or standard deviations are more effective than static ones.

Tiering alert severity is equally important. Situations requiring immediate action (serving failures, error rate spikes) need to be distinguished from those requiring trend observation (gradual drift, slow latency increases), so that response priorities are clear.

Deciding When to Retrain

Detecting drift doesn't mean you should automatically retrain. Retraining itself is a costly operation, so you need criteria for judging when it's genuinely necessary.

Generally, several conditions are weighed together. If model performance metrics drop below a predefined threshold, if sustained drift is observed in key input variables, or if business KPIs are meaningfully impacted, retraining is warranted. On the other hand, if the change is a temporary fluctuation or a seasonal pattern, continued observation may be a more appropriate response than retraining.

Drift Detection Tools

Building drift detection and monitoring from scratch requires significant effort. Open-source tools can substantially reduce the time to implementation.

Evidently AI is a Python library that automatically generates reports on data drift, model performance, and data quality. It provides dashboard-style visualizations and can be integrated into CI/CD pipelines as an automated validation step.

WhyLabs is a monitoring platform that uses data profiling to detect drift and anomalies in real time. Its defining feature is whylogs, a lightweight logging library that integrates into serving systems with minimal overhead.

The Prometheus and Grafana combination is a widely used stack for service performance monitoring. Custom metrics can be defined to bring model-related metrics under the same umbrella, and alerting rules are highly configurable.

Summary

Models can begin degrading the moment they're deployed. Distinguishing between data drift and concept drift, detecting them with appropriate statistical methods, and building a systematic monitoring framework are the keys to ensuring stability in production ML systems. Only when alerting strategies and retraining decision criteria are also in place can an ML system be considered truly production-ready.

In the next post, we'll look at CI/CD strategies that encompass code, data, and models in ML systems.