Deequ - Data Quality Validation for Machine Learning Pipelines

Sebastian Schelter, Stefan Grafberger, Philipp Schmidt, Tammo Rukat, Mario Kiessling, Andrey Taptunov, Felix Biessmann, Dustin Lange

Abstract

Modern machine learning (ML) systems are comprised of complex ML pipelines which typically have many implicit assumptions about the data they consume (e.g., about the scales of variables, the presence of missing values or the dictionary of categorical values). Violations of these assumptions can result in crashes or wrong predictions. We therefore present Deequ, a library that allows users to explicitly specify their assumptions about the data in a declarative way. Deequ enables the efficient automatic validation of these assumptions on large datasets. It is an open source library based on Apache Spark and meets the requirements of production use cases at Amazon.

Type

Conference paper

Publication

Machine Learning Systems workshop at the conference on Neural Information Processing Systems (NeurIPS)

Date

November, 2018

Links

PDF