Recent Publications

All Publications

(2019). AdaBench - Towards an Industry Standard Benchmark for Advanced Analytics. TPC Technology Conference on Performance Evaluation & Benchmarking (TPCTC).

(2019). An Intermediate Representation for Optimizing Machine Learning Pipelines. International Conference on Very Large Databases (VLDB).


(2019). 'Amnesia' - Towards Machine Learning Models That Can Forget User Data Very Fast. Workshop on Applied AI for Database Systems and Applications (AIDB) at VLDB.


(2019). Efficient Incremental Cooccurrence Analysis for Item-Based Collaborative Filtering. International Conference on Scientific and Statistical Database Management (SSDBM).


(2019). Learning to Validate the Predictions of Black Box Machine Learning Models on Unseen Data. Human-In-the-Loop Data Analytics workshop at ACM SIGMOD.




  • I am hosting Ji Zhang from Huazhong University of Science and Technology (HUST) as a visiting Ph.D. student at NYU. We conduct research on machine learning for data-intensive systems.

  • I am collaborating with Prof. Julia Stoyanovich from New York University on research with the regard to the impact of data preprocessing on the fairness of machine-assisted decision making.

  • I am co-supervising Sergey Redyuk who is a Ph.D. student at Technische Universität Berlin. We conduct research on novel systems for reproducibility and automated documentation of data science experiments.

  • I am conducting research on data validation and data cleaning for machine learning with Prof. Felix Biessmann from Beuth University, Berlin.


  • I am consulting Amazon AI as a part-time Senior Applied Scientist, and work on open source software for large-scale data quality verification with a team from Berlin.

  • I regularly discuss my research on data quality and model validation with Immuta, a company building a data management platform for data science.


Before joining New York University, I have been a Senior Applied Scientist at Amazon Core AI in Berlin, where I worked on data management-related issues of machine learning applications, such as demand forecasting, metadata and provenance tracking of machine learning pipelines and automating data quality verification.

I received my Ph.D. from TU Berlin in 2015, where I have been advised by Volker Markl, head of the database systems and information management group. My co-supervisors were Klaus-Robert Müller from the machine learning group at TU Berlin and Reza Zadeh from Stanford.

During my studies, I have been interning with the SystemML group at IBM Research Almaden and the social recommendations team at Twitter in California.

Open Source

I am engaged in open source as an elected member of the Apache Software Foundation, where I currently mentor the Apache TVM project on behalf of the Apache Incubator. In the past, I have been involved in the Apache Mahout, Apache Flink, Apache Giraph and Apache MXNet projects.

I am currently actively contributing to deequ, a library for ‘unit-testing’ large datasets with Apache Spark and recoreco, a fast item-to-item recommender written in Rust.


I am the originator and chair of the workshop series on Data Management for End-To-End Machine Learning (DEEM) at ACM SIGMOD, which started in 2017.

I regularly review submissions to top tier data management conferences. I have been on the program committee at SIGMOD 2017, 2019 & 2020, ICDE 2018, 2019 (demo track) & 2020, EDBT 2017, the workshop on Exploiting Artificial Intelligence Techniques for Data Management at SIGMOD 2019 and the Large-Scale Recommender Systems workshop at the ACM RecSys 2013-2015. Additionally, I have reviewed submissions to journals for IEEE TKDE, ACM TIST, IEEE TPDS, the journal track of ECML/PKDD and the open source track of JMLR. I have also been a reviewer for the Amazon Research Awards.


I’m reachable via email at sebastian.schelter[at] I’m also very actively using twitter as @sscdotopen. Most of the research code that I write is available under an open source license in my github account. Last but not least, I also have a profile in google scholar.