Good Models Come From Good Data

One of the most common misconceptions in machine learning projects is that model architecture determines performance. The expectation is that a more complex model will produce better results, but reality tells a different story. When data quality is poor, no amount of model sophistication will improve performance. In fact, a simple model trained on well-curated data often outperforms a complex model trained on noisy data.

So how do you systematically transform raw data into a form suitable for training? That is the role of the data pipeline.

The Structure of a Data Pipeline

A data pipeline is an automated sequence of steps that collects, transforms, and stores data. It defines and manages the flow from the point data is generated to the point a model consumes it.

The most fundamental patterns in this process are ETL and ELT.

ETL (Extract, Transform, Load) extracts data from a source, applies transformations, and then loads it into a destination. This approach is commonly used in traditional data warehouse environments and has the advantage of ensuring data quality during the transformation step. ELT (Extract, Load, Transform), on the other hand, loads raw data into the destination first and transforms it as needed later. This pattern is advantageous in cloud-based data lake environments, where storage costs have become cheap enough to preserve raw data while transforming it into various forms on demand.

ETL: Source โ†’ Extract โ†’ Transform โ†’ Load โ†’ Warehouse
ELT: Source โ†’ Extract โ†’ Load โ†’ Data Lake โ†’ Transform โ†’ Serving

In ML pipelines, the ELT pattern is more frequently chosen. Preserving raw data means you don't need to re-collect data when adding new features or changing preprocessing logic later on.

Why Feature Engineering Matters

Feature engineering is the process of extracting or creating meaningful variables from raw data that a model can learn from. Even with the same dataset, model performance can vary dramatically depending on which features are constructed.

Consider building a purchase prediction model for an e-commerce platform. Using only the number of user visits as a feature is vastly different from combining features like visits in the last 7 days, average session duration, and average time from cart addition to purchase. The latter set of features captures purchase intent far more accurately.

The challenge is that feature engineering is not a one-time task. As data evolves, business requirements shift, and models improve, features need continuous updating. When this process is done manually, reproducibility suffers, and a problem called training-serving skew emerges โ€” where features are computed differently in the training environment versus the serving environment.

The Need for Data Validation

Once a pipeline is automated, it's tempting to assume data will flow without issues. But does it really? In practice, upstream systems change their schemas without notice, null values spike in certain fields, or data distributions shift dramatically compared to historical patterns.

If these problems go undetected before model training, diagnosing the cause of performance degradation becomes extremely difficult. You can't tell whether the problem lies in the model or the data. This is why validating data quality at each stage of the pipeline is essential.

Validation can be divided into schema validation (do columns and types match expectations), statistical validation (are value ranges, distributions, and null ratios within acceptable bounds), and business rule validation (are domain-specific constraints being met).

Key Tools

Several tools support data pipeline construction and feature engineering.

ToolRoleKey Features
Apache AirflowWorkflow orchestrationDAG-based pipeline definition, scheduling, monitoring
dbtData transformationSQL-based transformations, version control, built-in testing
Great ExpectationsData validationDeclarative validation rules, data profiling
FeastFeature storeTraining/serving feature consistency, feature reuse

Apache Airflow defines each pipeline stage as a DAG (Directed Acyclic Graph), managing execution order and dependencies. When a specific task fails, it can be rerun from that stage, making failure recovery straightforward.

dbt defines data transformations in SQL and enables version control over them. The ability to write tests against transformation logic makes it useful for ensuring the accuracy of transformed results.

Great Expectations lets you declaratively define expectations about your data and automatically validates them at each pipeline stage. It can also document validation results, providing a history of data quality over time.

Feature stores are a relatively recent concept โ€” centralized systems that manage feature definition, computation, storage, and serving. Using tools like Feast ensures that the same feature computation logic is used during both training and serving, fundamentally eliminating the training-serving skew problem.

Data Quality Is Model Quality

Ultimately, a data pipeline in MLOps is not just plumbing that moves data around. The core objective is to build a system that guarantees data quality, manages features systematically, and enables rapid detection and response when issues arise. In most cases, improving data quality yields a far greater impact on model performance than changing the model architecture for a marginal gain.

In the next post, we'll look at experiment tracking and training management โ€” how to systematically manage the model training process.