MLOps 03 - Experiment Tracking and Training Management
Do You Remember Which Experiment Was Best?
Training models inevitably involves running a large number of experiments. You adjust the learning rate, change the number of layers, apply regularization techniques, and alter preprocessing methods. At first, jotting notes in a notebook or appending version numbers to file names feels sufficient. Before long, you end up with filenames like modelv1, modelv2final, and modelv2finalreal.
Is this approach sustainable? Once the number of experiments exceeds a few dozen, it becomes difficult to pinpoint which hyperparameter combination produced the best result, what dataset was used, and how preprocessing was performed. The more serious problem arises when you need to reproduce an experiment from a month ago and cannot replicate the same result.
The Cost of Untracked Experiments
When experiments are not tracked systematically, several problems emerge simultaneously.
First, reproducibility vanishes. If you rerun an experiment that previously produced good results but cannot get the same outcome, the value of that experiment is effectively zero. There is no way to determine which of the code, data, or environment settings differed.
Second, collaboration becomes difficult. To understand a teammate's experimental results, you have to ask them directly. When experiment records exist only in an individual's notebook or local files, knowledge becomes locked to a single person and overall team productivity suffers.
Third, decision-making slows down. Choosing which model to deploy to production requires an objective comparison of candidate models' performance. Without a consistent basis for comparison, decisions fall back on intuition, which can lead to poor choices.
Core Elements of Experiment Tracking
A systematic experiment tracking system needs to record several key pieces of information for each experiment.
Parameters are the model's configuration values โ learning rate, batch size, number of epochs, model architecture, and so on. Metrics are the quantitative results of the experiment โ accuracy, loss, F1 score, AUC, and similar measures. Artifacts are the outputs generated during the experiment โ trained model files, confusion matrix images, feature importance plots, and more. Finally, environment information captures the conditions under which the experiment was run โ library versions, GPU type, operating system, and random seed โ all of which are necessary for reproducibility.
Experiment record structure:
โโโ Parameters: lr=0.001, batch_size=64, epochs=100
โโโ Metrics: accuracy=0.94, loss=0.18, f1=0.92
โโโ Artifacts: model.pkl, confusion_matrix.png
โโโ Environment: Python 3.10, PyTorch 2.0, CUDA 11.8, seed=42
MLflow vs. W&B
The two most widely used experiment tracking tools are MLflow and Weights & Biases (W&B).
| Aspect | MLflow | W&B |
|---|---|---|
| Hosting | Self-hosted (open source) | Cloud SaaS (free tier available) |
| Experiment tracking | Parameter, metric, artifact logging | Parameters, metrics, artifacts + automatic system metric collection |
| Visualization | Basic charts | Rich interactive dashboards |
| Model registry | Built-in | Provided as a separate feature |
| Collaboration | Shared central server | Team workspaces, report functionality |
| Learning curve | Low | Medium |
MLflow's greatest strength is that it is open source and can be installed on your own infrastructure, giving you complete control over your data. It is preferred by organizations with strict security or compliance requirements. Beyond experiment tracking, it provides model registry, project packaging, and model serving capabilities within a single platform.
W&B is easy to set up and excels at visualization. Adding just a few lines of code starts experiment tracking, and system metrics like GPU utilization and memory usage are collected automatically. Its team collaboration features and the ability to share experiment results through reports are notable strengths.
Both tools provide solid core experiment tracking capabilities. The choice depends on your organization's circumstances โ if data sovereignty matters, MLflow is the typical choice; if rapid adoption and rich visualization are priorities, W&B is worth considering.
Hyperparameter Tuning Strategies
Once an experiment tracking system is in place, hyperparameter tuning can be performed far more systematically.
The simplest method is grid search, which tries every possible combination. It is intuitive, but as the number of parameters grows, the number of combinations increases exponentially and becomes impractical. Random search selects combinations randomly from the parameter space, and research has shown it can explore promising regions more efficiently than grid search.
A more advanced method is Bayesian optimization, which intelligently selects the next parameter combination to try based on the results of previous experiments. Tools like Optuna and W&B Sweeps support this approach, offering the advantage of finding near-optimal results with fewer experiment runs.
Regardless of which tuning strategy you use, comparing and analyzing results is nearly impossible without an experiment tracking system. The tracking system is, in effect, a prerequisite for tuning.
Reproducible Experiments Build Trustworthy Models
Experiment tracking is not merely about keeping records. By transparently managing which conditions produced which results, it enables the entire team to evaluate models against the same criteria and make informed decisions. An experiment that cannot be reproduced is not science, and a model that cannot be reproduced cannot be deployed to production.
In the next post, we'll look at model versioning and registries โ how to systematically manage trained models.