Recent Publications

All Publications

(2019). Learning to Validate the Predictions of Black Box Machine Learning Models on Unseen Data. Human-In-the-Loop Data Analytics workshop at ACM SIGMOD.

(2019). DEEM 2019: Workshop on Data Management for End-to-End Machine Learning. ACM SIGMOD (workshop summary).


(2019). Unit Testing Data with Deequ. ACM SIGMOD (demo).

(2019). Data-Related Challenges in End-to-End Machine Learning. North East Database Day.


(2018). On Challenges in Machine Learning Model Management. IEEE Data Engineering Bulletin.




  • I am collaborating with the Social Media and Political Participation (SMaPP) Lab at NYU, conducting research on the polarization of online political debates on social media.

  • I am co-supervising Sergey Redyuk who is a Ph.D. student at Technische Universität Berlin. We conduct research on novel systems for reproducibility of data science experiments and data quality validation for machine learning pipelines.

  • I am conducting research on predicting political affiliation from text and on the social media usage of far-right actors with Prof. Felix Biessmann from Beuth University, Berlin.


  • I am consulting Amazon Core AI as a part-time Senior Applied Scientist, and work on open source software for large-scale data quality verification with a team from Berlin.

  • I regularly discuss my research on data quality and model validation with Immuta , a company building a data management platform for data science.


Before joining New York University, I have been a Senior Applied Scientist at Amazon Core AI in Berlin, where I worked on data management-related issues of machine learning applications, such as demand forecasting, metadata and provenance tracking of machine learning pipelines and automating data quality verification.

I received my Ph.D. from TU Berlin in 2015, where I have been advised by Volker Markl, head of the database systems and information management group. My co-supervisors were Klaus-Robert Müller from the machine learning group at TU Berlin and Reza Zadeh from Stanford.

During my studies, I have been interning with the SystemML group at IBM Research Almaden and the social recommendations team at Twitter in California.

Open Source

I am engaged in open source as an elected member of the Apache Software Foundation, where I currently mentor the Apache TVM project on behalf of the Apache Incubator. In the past, I have been involved in the Apache Mahout, Apache Flink, Apache Giraph and Apache MXNet projects.

I am currently actively contributing to deequ, a library for ‘unit-testing’ large datasets with Apache Spark and recoreco, a fast item-to-item recommender written in Rust.


I am the originator and chair of the workshop series on Data Management for End-To-End Machine Learning (DEEM) at ACM SIGMOD, which started in 2017.

I regularly review submissions to top tier data management conferences. I have been on the program committee at SIGMOD 2019 & 2020, ICDE 2019 (demo track), ICDE 2018, SIGMOD 2017, EDBT 2017, the workshop on Exploiting Artificial Intelligence Techniques for Data Management at SIGMOD 2019 and the Large-Scale Recommender Systems workshop at the ACM RecSys 2013-2015. Additionally, I have reviewed submissions to journals for IEEE TKDE, ACM TIST, IEEE TPDS, the journal track of ECML/PKDD and the open source track of JMLR. I have also been a reviewer for the Amazon Research Awards.


I’m reachable via email at sebastian.schelter[at] I’m also very actively using twitter as @sscdotopen. Most of the research code that I write is available under an open source license in my github account. Last but not least, I also have a profile in google scholar.