Recent Publications

All Publications

(2018). Deequ - Data Quality Validation for Machine Learning Pipelines. Machine Learning Systems workshop at the conference on Neural Information Processing Systems (NeurIPS).


(2018). Deep Learning for Missing Value Imputation in Tables with Non-Numerical Data. ACM Conference on Information and Knowledge Management (CIKM).


(2018). Benchmarking Distributed Data Processing Systems for Machine Learning Workloads. TPC Technology Conference on Performance Evaluation & Benchmarking (TPCTC).


(2018). Automating Large-Scale Data Quality Verification. International Conference on Very Large Databases (VLDB).


(2018). BlockJoin: Efficient Matrix Partitioning Through Joins. International Conference on Very Large Databases (VLDB).



Before joining New York University, I have been a Senior Applied Scientist at Amazon Core AI in Berlin, where I worked on data management-related issues of machine learning applications, such as demand forecasting, metadata and provenance tracking of machine learning pipelines and automating data quality verification.

I received my Ph.D. from TU Berlin in 2015, where I have been advised by Volker Markl, head of the database systems and information management group. My co-supervisors were Klaus-Robert Müller from the machine learning group at TU Berlin and Reza Zadeh from Stanford.

During my studies, I have been interning with the SystemML group at IBM Research Almaden and the social recommendations team at Twitter in California.

Open Source

I am engaged in open source as an elected member of the Apache Software Foundation, where I currently mentor the Apache MXNet project on behalf of the Apache Incubator. In the past, I have been involved in the Apache Mahout, Apache Flink and Apache Giraph projects.

I am currently actively contributing to deequ, a library for ‘unit-testing’ large datasets with Apache Spark and recoreco, a fast item-to-item recommender written in Rust.


I am the originator and chair of the workshop series on Data Management for End-To-End Machine Learning (DEEM) at ACM SIGMOD, which started in 2017.

I regularly review submissions to top tier data management conferences. I have been on the program committee at SIGMOD 2019, ICDE 2019 (demo track), ICDE 2018, SIGMOD 2017, EDBT 2017 and the Large-Scale Recommender Systems workshop at the ACM RecSys 2013-2015. Additionally, I have reviewed submissions to journals for IEEE TKDE, ACM TIST, IEEE TPDS, the journal track of ECML/PKDD and the open source track of JMLR. I have also been a reviewer for the Amazon Research Awards.


I’m reachable via email at sebastian.schelter[at] I’m also very actively using twitter as @sscdotopen. Most of the research code that I write is available under an open source license in my github account. Last but not least, I also have a profile in google scholar.