DataWig - Missing Value Imputation for Tables

Abstract

With the growing importance of machine learning (ML) algorithms for practical applications, reducing data quality problems in ML pipelines has become a major focus of research. Ensuring completeness of a data source is one of the most impactful data quality challenges: in many use cases, missing values can break data pipelines. Current missing value imputation methods are focusing on numerical or categorical data and can be difficult to scale to datasets with millions of rows. We release DataWig, a robust and scalable approach for missing value imputation that can be applied to tables with more heterogeneous data types, including unstructured text. DataWig combines deep learning feature extractors with automatic hyperparameter tuning. This enables users without a machine learning background, such as data engineers, to impute missing values with minimal effort in tables with more heterogeneous data types than supported in existing libraries, while requiring less glue code for feature engineering and offering more flexible modelling options. We demonstrate that DataWig compares favourably to existing imputation packages.

Publication
Journal of Machine Learning Research (JMLR), open source software track
Date
Links