We have an opening for a PhD candidate in Data Management for Machine Learning at the University of Amsterdam.


We are looking for a PhD candidate who is eager to design, implement and evaluate data validation techniques to make large-scale ML applications more robust and reliable.

Your research will focus on complex ML applications, which include data integration and data preprocessing pipelines. The goal of your research will be to make future systems automatically detect potential data errors early, and warn the operating users (who are typically not ML experts) accordingly. You will provide efficient and scalable implementations of your methods, and integrate them with popular open source systems.


You will be joining an international team of several PhD students at the AI for Retail Lab, a joint UvA-Ahold Delhaize industry lab, which conducts research into the socially responsible application of large-scale data processing and machine learning for retail use cases.

This lab offers a unique setup to PhD candidates: you can conduct academic research at the university, and evaluate your ideas on real world data and systems at the same time, in collaboration with our industry partners, Albert Heijn and, both brands of Ahold Delhaize. The lab is situated within the larger Amsterdam data science and artificial intelligence ecosystem and values practice-informed and interdisciplinary research and outreach.


We provide a selection of research papers and open source projects that are relevant for the envisioned research direction of the PhD.


Open Source Libraries



  • Master’s degree in computer science (or a related field)
  • Creative and independent mindset
  • Fluency in English
  • Strong programming skills in Python and one additional language (e.g., Java, Scala, C/C++ or Rust)
  • Interest in the internals of dataflow systems (Spark, Hadoop, Flink, Timely Dataflow, …) and relational databases
  • Basic understanding of machine learning

Not required, but helpful

  • First-hand experience with real world data processing systems and/or ML deployments (e.g., from internships or jobs)
  • Contributions to open source projects


Please apply via the official job advertisement. Feel free to contact me if you’re interested in the position and/or have some questions.

Recent Publications

All Publications

(2020). Elastic Machine Learning Algorithms in Amazon SageMaker. ACM SIGMOD.


(2020). Tier-Scrubbing: An Adaptive and Tiered Disk Scrubbing Scheme with Improved MTTD and Reduced Cost. Design Automation Conference (DAC).

(2020). Towards Unsupervised Data Quality Validation on Dynamic Data. Workshop on Explainability for Trustworthy ML Pipelines at EDBT.


(2020). Towards Automated ML Model Monitoring: Measure, Improve and Quantify Data Quality. ML Ops workshop at the Conference on Machine Learning and Systems (MLSys).


(2020). Tier-Scrubbing: An Adaptive and Tiered Disk Scrubbing Scheme. USENIX Conference on File and Storage Technologies (FAST), work-in-progress track..


(2020). Exploring Monte Carlo Tree Search for Join Order Selection. North East Database Day.


(2020). Learning to Validate the Predictions of Black Box Classifiers on Unseen Data. ACM SIGMOD.


(2020). FairPrep: Promoting Data to a First-Class Citizen in Studies on Fairness-Enhancing Interventions. International Conference on Extending Database Technology (EDBT).



I am part of the Information and Language Processing Systems group led by Prof. Maarten de Rijke and the Intelligent Data Engineering Lab led by Prof. Paul Groth.

PhD Students

Mozhdeh Ariannezhad
with Maarten de Rijke)
Olivier Sprangers
with Maarten de Rijke)
Mariya Hendriksen
with Maarten de Rijke)
Sami Jullien
with Maarten de Rijke)
Arezoo Sarvi
with Maarten de Rijke)
Sergey Redyuk, TU Berlin
(co-supervised with
Volker Markl)

Master Students

Stefan Grafberger, TU Munich
with Alfons Kemper and Julia Stoyanovich)


  • I am collaborating with Prof. Julia Stoyanovich from New York University on research with regard to the impact of data preprocessing on the fairness of machine-assisted decision making.

  • I am conducting research on data validation and data cleaning for machine learning with Prof. Felix Biessmann from Beuth University, Berlin.

Past Collaborations


Before joining University of Amsterdam, I have been a Faculty Fellow at the Center for Data Science at New York University, and a Senior Applied Scientist at Amazon Core AI in Berlin, where I worked on data management-related issues of machine learning applications, such as demand forecasting, metadata and provenance tracking of machine learning pipelines and automating data quality verification.

I received my Ph.D. from TU Berlin in 2015, where I have been advised by Volker Markl, head of the database systems and information management group. My co-supervisors were Klaus-Robert Müller from the machine learning group at TU Berlin and Reza Zadeh from Stanford. During my studies, I have been interning with the SystemML group at IBM Research Almaden and the social recommendations team at Twitter in California.

I am engaged in open source as an elected member of the Apache Software Foundation, where I currently mentor the Apache TVM project on behalf of the Apache Incubator. In the past, I have been involved in the Apache Mahout, Apache Flink, Apache Giraph and Apache MXNet projects. I am currently actively contributing to deequ, a library for ‘unit-testing’ large datasets with Apache Spark and recoreco, a fast item-to-item recommender written in Rust.


I am the founder and chair of the workshop series on Data Management for End-To-End Machine Learning (DEEM) at ACM SIGMOD, which started in 2017.

I regularly review submissions to top tier data management conferences. I have been on the program committee at SIGMOD 2017, 2019 & 2020, VLDB 2021, ICDE 2018, 2019 & 2020, EDBT 2017, the workshop on Exploiting Artificial Intelligence Techniques for Data Management at SIGMOD 2019 and the Large-Scale Recommender Systems workshop at the ACM RecSys 2013-2015. Additionally, I have reviewed submissions to journals for IEEE TKDE, ACM TIST, IEEE TPDS, IEEE TNNLS, VLDB Journal, the journal track of ECML/PKDD and the open source track of JMLR. I have also been a reviewer for the Amazon Research Awards.


I’m reachable via email at s.schelter[at] I’m also very actively using twitter as @sscdotopen. Most of the research code that I write is available under an open source license in my github account. Last but not least, I also have a profile in google scholar.