I am an Assistant Professor with the University of Amsterdam, conducting research at the intersection of data management and machine learning. I am affiliated with the Intelligent Data Engineering Lab, manage the AI for Retail Lab, and am a Research Fellow at Ahold Delhaize, an international retailer based in the Netherlands.
My work addresses data-related problems that occur in the real world application of machine learning. Examples are the automation of data quality validation, the inspection of machine learning pipelines via code instrumentation, or the design of machine learning applications that can efficiently forget data.
Most of my research is accompanied by efficient and scalable open source implementations, many of which are applied in real world use cases, for example in the Amazon SageMaker Model Monitor service.
In the past, I have been a Faculty Fellow with the Center for Data Science at New York University and a Senior Applied Scientist at Amazon Research, after obtaining my Ph.D. at the database group of TU Berlin with Volker Markl. I am active in open source as an elected member of the Apache Software Foundation, and have extensive experience in building real world systems from my time at Amazon, Twitter, IBM Research, and Zalando.
I have been the editor for the special issue on “data validation for machine learning models and applications” of the IEEE Data Engineering Bulletin.
We will present a paper on learnings from a real world recommender system at one of Europe’s largest e-commerce platforms at ICDE 2021, as well as a paper on maintaining randomised trees for low-latency machine unlearning and a demo on a data distribution debugger for machine learning pipelines at SIGMOD 2021.
PhD Students |
||
Mozhdeh Ariannezhad![]() with Maarten de Rijke) |
Olivier Sprangers![]() with Maarten de Rijke) |
Arezoo Sarvi![]() with Maarten de Rijke)) |
Barrie Kersbergen![]() with Maarten de Rijke) |
Stefan Grafberger![]() with Paul Groth) |
Research Engineers, Associated Researchers & Master Students |
||
Shubha Guha![]() |
Till Doehmen![]() |
Dr. Ji Zhang![]() |
Benjamin Wang![]() |
Yoran Sturkenboom![]() |
Machine Learning applications are increasingly used to automate impactful decisions, and at the same time, can be very brittle with respect to their input data, which leads to concerns about the correctness, reliability, and fairness of such applications. mlinspect is a library that helps data scientists to diagnose and mitigate technical bias that arises during data preprocessing in an ML pipeline. mlinspect can instrument natively written ML code in Python using libraries such as pandas or scikit-learn, and will automatically apply several inspections to the intermediate results of preprocessing operations.
Code:
Publications:
Serenade is a session-based recommender system for ecommerce platforms which handles thousands of requests per second with low latency. Serenade is implemented in Rust, deployable with Kubernetes, and features a custom index for efficient session similarity computation.
Code:
Publications:
Snapcase is a research prototype for a recommender system that can instantly “forget” user data (e.g. in response to GDPR deletion requests) and update its recommendations accordingly. Snapcase models various recommendation algorithms via differential computation and is implemented in Rust on top of Differential Dataflow.
Code:
Publications:
I am collaborating with Prof. Julia Stoyanovich from New York University on research with regard to the impact of data preprocessing on the fairness of machine-assisted decision making.
I am working with Hannes Muehleisen from Centrum Wiskunde & Informatica (CWI) on leveraging DuckDB for the efficient execution of data preprocessing in ML pipelines.
I am conducting research on data validation and data cleaning for machine learning with Prof. Felix Biessmann from Beuth University, Berlin.
I am an associated researcher with BIFOLD, the Berlin Institute for the Foundations of Learning and Data.
Before joining University of Amsterdam, I have been a Faculty Fellow at the Center for Data Science at New York University, and a Senior Applied Scientist at Amazon Research in Berlin, where I worked on data management-related issues of machine learning applications, such as demand forecasting, metadata and provenance tracking of machine learning pipelines and automating data quality verification.
I received my Ph.D. with “summa cum laude” from TU Berlin in 2015, where I have been advised by Volker Markl, head of the database systems and information management group. My co-supervisors were Klaus-Robert Müller from the machine learning group at TU Berlin and Reza Zadeh from Stanford. During my studies, I have been interning with the SystemML group at IBM Research Almaden and the social recommendations team at Twitter in California.
I am engaged in open source as an elected member of the Apache Software Foundation. I have been involved in the Apache Mahout, Apache Flink, Apache Giraph and the incubation of the Apache MXNet and Apache TVM projects. Furthermore, I am the creator of deequ, a library for ‘unit-testing’ large datasets with Apache Spark and recoreco, a fast item-to-item recommender written in Rust.
I am the founder and chair of the workshop series on Data Management for End-To-End Machine Learning (DEEM) at ACM SIGMOD, which started in 2017. I serve as the editor for two issues of the IEEE Data Engineering Bulletin in 2021 and 2022, and as Associate Editor for the Scalable Data Science Track for PVLDB Volume 15 for the period of April 2021 trough March 2022.
I regularly review submissions to top tier data management conferences. I have been on the program committee at SIGMOD 2017, 2019-2022, VLDB 2021, ICDE 2018-2021, EDBT 2017 & 2021, CIKM’20, the PhD Symposium at VLDB’21, the workshop on Exploiting Artificial Intelligence Techniques for Data Management at SIGMOD 2019, the Large-Scale Recommender Systems workshop at the ACM RecSys 2013-2015, the workshop on Applied AI for Database Systems and Applications at VLDB’20, and Provenance Week’20. Additionally, I have reviewed submissions to journals for IEEE TKDE, ACM TIST, IEEE TPDS, IEEE TNNLS, VLDB Journal, the VLDB Journal Special Issue on Data Science for Responsible Data Management, the journal track of ECML/PKDD and the open source track of JMLR. I am also part of the review board of the Journal of Systems Research (JSys), and have been a reviewer for the Amazon Research Awards.
At the University of Amsterdam, I coordinate the honors program for the bachelor AI.
I’m reachable via email at s.schelter[at]uva.nl. I’m also very actively using twitter as @sscdotopen. Most of the research code that I write is available under an open source license in my github account. Last but not least, I also have a profile in google scholar.