Probabl — the scikit-learn founders
An operational blueprint for team leads delivering reliable machine learning systems. Identify risks, secure your validation methodology, and build on solid engineering foundations.
A comprehensive guide drawn from the field experience of the scikit-learn creators, to lead your ML projects with rigor and efficiency.
Data leakage, overfitting, misleading metrics — the errors that lead to confidently wrong results and their disastrous consequences on business decisions.
Cross-validation, group handling, train/test separation — the fundamentals to guarantee the reliability of your performance reports.
Why you should start with simple, well-optimized baselines before investing in complex models. The importance of fair benchmarks.
Track and compare performance over time, detect drift, understand variations in data, software and models.
Snapshots, evaluation datasets, versioning — how to maintain control over your data in a continuously evolving environment.
CI/CD, testing, modular architecture — why software quality is a non-negotiable pillar of operational data science.
Fill in the form to read the complete document.
We respect your privacy. No spam, we promise.
The Blueprint is now available below.
An operational blueprint by Probabl — the scikit-learn founders
Drawing on their experience with real-world ML-driven engineering projects, Probablrs assess in this document that correct methodological foundations and solid software engineering practices are necessary prerequisites for value creation through ML-driven insights and decision support.
Probabl aims to provide the world with methodological guidance and software tools for data scientists to fully control their workflow and certify their deliverables as reliable assets.
ML courses, tutorials, and self-contained kaggle-like competitions, often display a simplistic state of data science where the work environment is frozen and the purpose is well scoped. However, real-world data science happens in ever-moving environments and requires continuous adaptation. New data could be updated or change in nature, business targets could pivot, software dependencies could bump, state-of-the-art methodologies could move forward, all of which would require you to keep up to date to remain competitive. Given the hybrid nature of skills in data science – often unevenly distributed across a spectrum ranging from applied mathematics to software engineering – practitioners may fail to anticipate all aspects of the project and be caught off guard later. As a result they may struggle to keep projects up to date while minimizing costs and maintaining development velocity.
This blueprint provides an attempt at a comprehensive review of risks that can arise from methodological shortcomings or weaknesses in implementation and execution, and suggests mitigation measures.
To start leveraging the power of data science and machine learning, some conditions need to be met: a problem has been identified, its business impact can be described by a collection of metrics, and data is available to consider training predictive models and evaluate performance. In itself, this alone can be challenging to secure, but can be improved with increasing understanding of the needs of downstream users and more refined success metrics. Lacking performance feedback, or optimizing or reporting with the wrong metrics are the first risks of dead-ends for many projects, and should be the first worry of all stakeholders.
The purpose of this document is to highlight the risks that are specific to the practice of operational machine learning once the initial set of requirements has been defined, and to outline practices to mitigate these risks.
In such a case, all the stakeholders may fail to see errors, and enthusiastically buy the conclusions. This could escalate to consequential decisions in short and long term roadmaps, or lead organizations to oversell fantasized products or use cases to customers, only to face confusion when real-world performance falls far short of expectations. The consequences can be disastrous for an organization, that will anticipate ROI and never realize the value.
These risks are further amplified by human confirmation biases. We tend to trust insights that align with the intuitions we seek to validate, and only question surprising results. In practice, these errors happen more frequently than one would think, and it is urgent to guide data scientists as well as all stakeholders to a sufficient level of awareness on those matters.
Here are some common pitfalls, some you might know and consider obvious already, while some even seasoned practitioners could miss.
One of the most fundamental errors is reporting metrics on the same dataset that is used to fit an estimator, and then assuming the model could generalize to unseen data. The proper mitigation is to keep a completely disjoint dataset for evaluation only – this set is called the validation set. However, in cases where the data is too scarce or highly variable, results can vary dramatically depending on how the training and validation sets are drawn. It is entirely possible to observe and report encouraging performance on a given split, when it is nothing more than a statistical fluke, and therefore will not generalize. Cross-validation helps address this problem, and tools in scikit-learn and skore make it easier to assess variance across splits.
Datasets may also contain groups of related samples. Exposure to one observation can be highly predictive of others within the same group, but in real-world deployment new data often comes from entirely unseen groups. Failing to account for such a structure, and mixing samples from the same group across both training and validation sets is a common oversight resulting, once again, with overly optimistic reports.
Another frequent mistake is data leakage: one could mistakenly model some predictive target using information available at training and validation times, but fail to see that this data is not available in the real-world application – resulting, once again! in overly optimistic reports!
Data scientists or stakeholders can be biased towards deploying complex models, even when simpler solutions suffice and the increased costs of those complex solutions. Increasing the complexity unnecessarily risks higher costs, delivery times, and maintenance costs. Stakeholders and data scientists should be careful when stirring the roadmap towards model complexity.
Any increase in complexity, and even paradigm shifts (like considering completely novel models or architectures) should always be benchmarked against simple, fairly-optimized baselines. If the team spend weeks on a clever approach, it's only fair to give at least a few days to set up correctly the baseline. Many machine learning use-cases do not require cutting-edge research: sometimes, the quick, over-the-shelf baseline solution may already net you "good enough" performance.
If it doesn't, then it might be worth looking for more elaborate solutions. Tools like scikit-learn and skrub offer many ready-to-use baseline models that can be the starting point of the ML roadmap. If the baseline falls short, or if it's justified by a competitive edge that would be unlocked by better model performance, the next steps should consist in well optimizing the baseline. Only after which, more elaborate changes should be considered. Elaborate models must justify their increased costs by performance comparison in fair benchmarks against the optimized baselines.
We've discussed some of the possible methodological mistakes when validating performance of a single model, but there's another set of intricacies when comparing relative performance of a set of models, in order to select the best performing one.
Tools like GridSearchCV from scikit-learn, or ComparisonReport from skore are ideal for comparing the performance of a set of models on a given task and metrics. Those tools operate end-to-end in a single same instance of a computer program (a process in some computer Operating System), that we usually refer to as a run. This ensures, by its nature and design, that all comparisons in the benchmark use the same data, the same evaluation strategies, the same software environment, etc., so that the comparison methodology is sound. Such runs typically terminate after outputting reports and optionally saving binaries for the best models.
An inherent limitation is that these tools lack the built-in power to answer questions about the performance of models across different instances of benchmarks. One could wonder, how does the performance seen in some benchmark compare with another set of models evaluated previously, months ago? How would trained models compare on newly collected data? How well did the benchmark metrics translate to performance in the context of real-world usage, and over time (also called "drift control")? And above all, how to explain the variations, and draw from it valuable insights for the future?
The latter question is the most challenging. Models consume data that evolves, rely on software engines that keep being updated, and operate in increasingly complex systems. Without careful tracking, practitioners could overlook those changes and draw wrong conclusions from observations of variations of some target metrics.
Reliable insights require tools that make it easy to browse, compare, and analyze data, software, and model variation across runs. The quality and impact of understanding directly depend on how efficiently the toolbox enables this process.
In many cases investigating the behavior of ML predictive systems comes down to inspecting the data. Inefficient access to your data assets, whether because it's been lost, or because it's locked behind convoluted interfaces, or even internal silos in an organization, will cause frictions that will prevent or slow down problem solving.
The first key assets of importance are the frozen datasets used to fit the models. Their distributions might help explain model performance, and data scientists often need to revisit them. One could need to compare snapshots of a same data source at different points in time to understand differences observed in older and newer models.
If the source of data is a database where new data is continuously added, those frozen datasets will be time-based snapshots. As the models need to be updated to account for distributional shifts in the newer data, new snapshots will be generated to refit the models. Over time, the accumulation of these frozen datasets could turn out to be a significant hassle to manage.
Ideally, one may want to limit the number of copies of datasets, in order to save storage costs, while still being able to inspect the training data for a given model. At the same time, the data should remain easy to explore. One may want to quickly answer questions such as: given snapshots at different points in time, what would be a summary of the changes? How does the amount of data, and the distributions, compare from one to another? How does the data of a few selected samples qualitatively compare?
Beyond datasets used for training, evaluation datasets containing newer data could also be used to keep estimating the comparative performance of predictive ML systems that have been fitted before (another instance of drift control). It remains important to check if samples in the new evaluation datasets are completely disjoint from the samples in the former training datasets, or treat samples that might have been seen at training time differently, and samples from completely unseen data.
Moreover, observations of distributional shifts between the older datasets and new evaluation datasets might also provide key insights into how the data is evolving.
Efficient tools for inspecting the models are equally necessary, for the same reasons than previously stated.
The data science community has agreed to call model cards the capsules that centralize all metadata information for models. To each model its model card, that references its training dataset, its validation strategy, all parameters of interest, a usage manual and the expected performance, its shortcomings, its risks. It will also document the sequence of transformations operated by the model, etc. The same processes that produce model binaries should also collect and organize information so that the model cards are generated with accurate and comprehensive information.
Beyond metadata availability, operating the models require, for all kind of purposes (bugfixes, investigating cases of failures, ...), the ability to take deep dives in the inner working of the pipelines. A pipeline is the sequence of transformation that takes as input the identifier of some sample backed by data stored in some source of data (for instance a database) then fetches the relevant data, transforms it and outputs a prediction, which can be either used as an insight in some report, or maybe embedded into some more complex service as support for a decision.
For one given input identifier, the fetched data should be easily recoverable by human operators for inspection purposes, and all the intermediary steps be re-playable.
skrub recently released a DataOps toolbox, that gives the flexibility to embed many custom preliminary data transformations inside scikit-learn compatible objects. We believe that there is room for standardizing many more tools to make deep pipeline inspections easier and save data scientists from struggling with unpractical objects, ever reinventing the wheel and maintaining boilerplate.
Ultimately, all the practices and tools discussed so far consist in orchestrating and executing sequences of computer instructions. Expectations of robustness and reliability over the outputs and effects it produces can't exceed the level of underlying software quality and software engineering maturity. The same goes for velocity and time to delivery.
Unhealthy software engineering practices will have team members work overnight for emergency fixes, cause deployment anxiety, produce unreliable deliverables, that will crash your demo at best, or deliver subtly yet damning wrong outputs at worst.
We've mentioned earlier the risk of disastrous consequences when reporting misleading performance forecasts to stakeholders because of mistakes in the validation methodology. Bugs in the underlying software, for instance in the implementation of custom metrics that have not been properly tested, can have equally dramatic effects.
The amount of effort that have to be invested in setting up development environments, is to be considered relatively to the ambition and the scope of the tasks to achieve. It's true that many have warned of premature optimization. However, in many instances, we have seen ML practitioners resort exclusively to use of sandbox environments that run jupyter-like notebooks, then let the notebooks grow in size and complexity, but still iterate and use it for demos and reporting to stakeholders, and postpone consolidation into sane software engineering practices to an hypothetical transition to production.
In previous sections, we did go over recommendations to favor simple, off-the-shelf baselines, before committing to more elaborate solutions. Inversely so, we warn that lack of control of the software engineering stakes could limit the ability of data scientists to implement an ambitious roadmap, and have them fallback on simplistic compromises and lose a competitive edge.
Probabl has the ambition to offer a suite of tools that will empower data scientists by offering easier access to software quality for all deliverables.
It could start as a small script, and be enough to provide some encouraging insights. Soon it grows, as it becomes hungrier for data but also pickier. As the amount of steps in the pipeline increase, so does the complexity, and the number of lines of code. Parameters and versions of trained models are suddenly all over the place. High pipeline failure rate and regressions in new iterations become a thing. As newcomers are recruited, an efficient on-boarding is also needed for new contributors.
The team then realizes accumulated technical debts, and the need to settle a collaborative, modular and iterative code architecture, while rationalizing the practices and workflows. While it doesn't directly provide value to the end users, such in-depth work on code assets for a ML-driven project might still have its set of frictions, and the more standards and boilerplate the team has to maintain, the less energy left for improving the actual deliverables.
Probabl carries an opinionated and comprehensive set of practices and tools, that let engineers start focusing right away on the value delivered, and delegates the orchestration of everything else to the framework. It provides both guidance to secure the data science methodology, and solid software engineering foundations that they don't have to re-invent nor maintain.
Here are the general principles of Probabl's opinionated set of tools and practices:
First of all, practitioners are expected to define an entrypoint for registering datasets and validation strategies. The framework will then pick it up and organize nicely the assets in a (possibly remote, and/or integrated with other data version systems) filesystem along with some metadata, that they will be invited to enrich with a standardized interface for custom modules of distributional analysis, and browse in a practical interface.
Then, practitioners are invited to implement the whole pipeline of transformations, starting with an input list of sample IDs, then fetching data, then setting up for training and validation steps, then transforming data into arrays of features, then fitting a classifier or regressor and sending predictions, then evaluating metrics, and finally dumping model binaries. Each of those steps are enclosed into specific abstractions, that aim at being standard enough to provide services such as accurate logging, visualization, system resource tracking, caching, ease unit testing, etc., but flexible enough to adapt to many use-cases and data flows.
From the outside, the pipeline has the same interface as familiar scikit-learn Estimator objects. The framework will pick up all steps and expose a training command, that orchestrates the sequence of transformations molded in the train/validation/test methodology, captures logging, registers the metrics, and saves the binaries in a model warehouse.
Some additional tools are provided to browse the model warehouses and model metadata, export model cards, clean it, etc., as well as reset caches and prune the data of registered datasets.
Some additional primitives will enable implementing modules that integrate the models from the warehouse into more complex services, with utilities for optional post-processing (such as calibration).
Such powerful tools and practices aim at enhancing the level of control of data scientists and machine learning engineers over all the parameters that affect the systems they build. It ambitions to clarify the role of all stakeholders, break the silos that could limit their reach over the data and the efficiency of deliveries.
For instance, we hold that all non-neutral data transformations, with respect to ML systems, should eventually be managed within the scope it's operated in. As much as scalability constraints allow it, as well as compliance requirements, the source of raw data should not go through destructive filters before being in the hand of data scientists and machine learning engineers that leverage it for ML-driven applications.
Upstream data engineering preparation should ever avoid discarding data, even data that seems bad quality, but should instead consider enriching it with tags and metadata. Data science starts when the data assets are reviewed, and selected or discarded depending of the interest for downstream problems being tasks.