Recent Publications

All Publications

(2023). Improving Retrieval-Augmented Large Language Models via Data Importance Learning. [arxiv preprint].


(2023). Towards Declarative Systems for Data-Centric Machine Learning. Data-Centric Machine Learning Research (DMLR) Workshop at ICML.


(2023). mlwhatif: What If You Could Stop Re-Implementing Your Machine Learning Pipeline Analyses Over and Over?. VLDB (demo).

(2023). Forget Me Now - Fast and Exact Unlearning in Neighborhood-Based Recommendation. ACM SIGIR.


(2023). On the Impact of Outlier Bias on User Clicks. ACM SIGIR.


(2023). Automating and Optimizing Data-Centric What-If Analyses on Native Machine Learning Pipelines. ACM SIGMOD.


(2023). Proactively Screening Machine Learning Pipelines with ArgusEyes. ACM SIGMOD (demo).


(2023). Automated Data Cleaning Can Hurt Fairness in Machine Learning-based Decision Making. International Conference on Data Engineering (ICDE).



PhD Students, Research Engineers & Guests
Olivier Sprangers
(with Maarten de Rijke)
Barrie Kersbergen
(with Maarten de Rijke)
Stefan Grafberger
(with Paul Groth)
Zeyu Zhang
(with Iacer Calixto)
Shubha Guha
(with Paul Groth)
Till Doehmen

  • Mozhdeh Ariannezhad (PhD student), first employment as ML Scientist at
  • Arezoo Sarvi (PhD student), first employment as Data Scientist Search at Albert Heijn
  • David Vos (master student), first employment as PhD student at the IRLab of the University of Amsterdam
  • Benjamin Wang (master student), first employment as ML scientist at
  • Ji Zhang (guest researcher), first employment at Huawei



Scientific Career

Before joining University of Amsterdam, I have been a Faculty Fellow at the Center for Data Science at New York University, and a Senior Applied Scientist at Amazon Research in Berlin, where I worked on data management-related issues of machine learning applications, such as demand forecasting, metadata and provenance tracking of machine learning pipelines and automating data quality verification.

I received my Ph.D. with “summa cum laude” from TU Berlin in 2015, where I have been advised by Volker Markl, head of the database systems and information management group. My co-supervisors were Klaus-Robert Müller from the machine learning group at TU Berlin and Reza Zadeh from Stanford. During my studies, I have been interning with the SystemML group at IBM Research Almaden and the social recommendations team at Twitter in California.

Open Source

I am engaged in open source as an elected member of the Apache Software Foundation since 2012. I have been involved in the Apache Mahout, Apache Flink, Apache Giraph and the incubation of the Apache MXNet and Apache TVM projects. Besides that I co-created Deequ, a library for ‘unit-testing’ large datasets with Apache Spark, and Serenade, a low-latency session-based recommender system deployed in production at a large Dutch retailer. Furthermore, I am a member of the Electronic Frontier Foundation since 2015.

Scientific Service

I am the founder and have chaired the workshop series on Data Management for End-To-End Machine Learning (DEEM) at ACM SIGMOD from 2017 to 2020, and an Action Editor for the ML Open Source Software track of the Journal of Machine Learning Research. I have served as Associate Editor for PVLDB Volume 15, as the editor for two special issues of the IEEE Data Engineering Bulletin in 2021 and 2022, and as co-chair of the industry and applications track of EDBT 2022.

I regularly review submissions to top tier data management conferences. I have been on the program committee at SIGMOD 2017, 2019-2024, VLDB 2021, ICDE 2018-2021, EDBT 2017 & 2021, CIKM’20, the PhD Symposium at VLDB’21, the workshop on Data-Centric Machine Learning Research at ICML, the workshop on Exploiting Artificial Intelligence Techniques for Data Management at SIGMOD 2019, the Large-Scale Recommender Systems workshop at the ACM RecSys 2013-2015, the workshop on Applied AI for Database Systems and Applications at VLDB’20, on Table Representation Learning at NeurIPS’22, the DBML workshop at ICDE’21 and Provenance Week’20. Additionally, I have reviewed submissions to journals for IEEE TKDE, ACM TIST, IEEE TPDS, IEEE TNNLS, VLDB Journal, the VLDB Journal Special Issue on Data Science for Responsible Data Management, the journal track of ECML/PKDD and the open source track of JMLR. I have also been a reviewer for the Amazon Research Awards.

At the University of Amsterdam, I coordinate the honors program for the bachelor AI and teach a course on Big Data engineering with up to 200 students.


I’m reachable via email at s.schelter[at] I’m also very actively using twitter as @sscdotopen. Most of the research code that I write is available under an open source license in this, this or this github account. Last but not least, I also have a profile in Google Scholar and DBLP.