Unit Testing Data with Deequ

Abstract

Modern companies and institutions rely on data to guide every single decision. Missing or incorrect information seriously compromises any decision process. We demonstrate Deequ, an Apache Spark-based library for automating the verification of data quality at scale. This library provides a declarative API, which combines common quality constraints with user-defined validation code, and thereby enables unit tests for data. Deequ meets the requirements of production use cases at Amazon, and scales to datasets with billions of records if the constraints to evaluate are chosen carefully. Our demonstration walks attendees through a fictitious business use case of validating daily product reviews from a public dataset, and is executed in a proprietary interactive notebook environment. We show attendees how to define data unit tests from automatically suggested constraints and how to create customized tests. Additionally, we demonstrate how to apply Deequ to validate incrementally growing datasets, and give examples of how to configure anomaly detection algorithms on time series of data quality metrics to further automate the data validation.

Publication
ACM SIGMOD (demo)
Date
Links