DuckDQ: Data Quality Assertions for Machine Learning Pipelines

Till Doehmen, Mark Raasveldt, Hannes Mühleisen, Sebastian Schelter

Abstract

Data quality validation plays an important role in ensuring the proper behaviour of productive machine learning (ML) applications and services. Observing a lack of existing solutions for quality control in medium-sized production systems, we developed DuckDQ: A lightweight and efficient Python library for data quality validation, that seamlessly integrates with existing scikit-learn ML pipelines and does not require a distributed computing environment or ML platform infrastructure, while outperforming existing solutions by a factor 3 to 40 in terms of runtime. We introduce the notion of data quality assertions, which can stop a pipeline when quality constraints of the input data or the model’s output are not met. Furthermore, we employ stateful metric computations, which greatly enhance the possibilities for post-hoc failure analysis and drift detection, even when the serving data is not around anymore.

Type

Conference paper

Publication

Workshop on Challenges in Deploying and Monitoring ML Systems at the International Conference on Machine Learning (ICML)

Date

June, 2021

Links

PDF