Sebastian Schelter

Assistant Professor

University of Amsterdam

I am an Assistant Professor with the University of Amsterdam, conducting research at the intersection of data management and machine learning.

My work addresses data-related problems that occur in the real world application of machine learning. Examples are scalable data quality validation, data debugging for machine learning pipelines, or enforcing the “right-to-be-forgotten” in deployed machine learning applications.

My research is accompanied by efficient and scalable open source implementations, many of which are applied in real world use cases, for example in the Amazon SageMaker Model Monitor service, AWS Glue Data Quality, the product recommendation system at bol.com and large-scale recommendation libraries in cloud environments such as Amazon Web Services and Microsoft Azure.

My research contributions have been recognized with an ACM SIGMOD Systems Award, an ACM SIGMOD Best Demo Runner Up Award, and a Best Paper Runner Up Award from the Table Representation Learning workshop at NeurIPS.

In the past, I have been a Faculty Fellow with the Center for Data Science at New York University and a Senior Applied Scientist at Amazon Research, after obtaining my Ph.D. at the database group of TU Berlin with Volker Markl. I am active in open source as an elected member of the Apache Software Foundation, and have extensive experience in building real world systems from my time at Amazon, Twitter, IBM Research, and Zalando.

News

Previous News

Oct 18, 2023 ‐ I will join the newly formed Journal of Data-centric Machine Learning Research (DMLR) as an Action Editor.
Aug 30, 2023 ‐ I will give a keynote on Directions Towards Resource-Efficient Machine Learning Systems in e-Commerce at the BIFOLD Weizenbaum Summer School on Artificial Intelligence and Ecological Sustainability in Berlin.
Jun 22, 2023 ‐ I am part of the large group of contributors to Apache Flink, who won the ACM SIGMOD Systems Award 2023! Furthermore, Stefan, Shubha and me won an ACM SIGMOD Best Demo Runner Up Award for our demo on Proactively Screening Machine Learning Pipelines with ArgusEyes.
Feb 22, 2023 ‐ We will present a paper on Automating and Optimizing Data-Centric What-If Analyses on Native Machine Learning Pipelines and a demo on Proactively Screening Machine Learning Pipelines with ArgusEyes at SIGMOD in Seattle.
Feb 2, 2023 ‐ We will present a vision paper on the impact of data cleaning on the fairness of ML models at the new special track of ICDE, originating from our ongoing collaboration with the Center for Responsible AI at New York University.

Recent Publications

All Publications

Zeyu Zhang, Paul Groth, Iacer Calixto, Sebastian Schelter (2024). Directions Towards Efficient and Automated Data Wrangling with Large Language Models. Databases and Machine Learning workshop at ICDE.

Till Doehmen, Radu Geacu, Madelon Hulsebos, Sebastian Schelter (2024). SchemaPile: A Large Collection of Relational Database Schemas. ACM SIGMOD.

Stefan Grafberger, Zeyu Zhang, Sebastian Schelter, Ce Zhang (2024). Red Onions, Soft Cheese and Data: From Food Safety to Data Traceability for Responsible AI. IEEE Data Engineering Bulletin (Special Issue on Data-Centric Responsible AI).

Bojan Karlaš, David Dao, Matteo Interlandi, Sebastian Schelter, Wentao Wu, Ce Zhang (2024). Canonpipe: Data Debugging with Shapley Importance over Machine Learning Pipelines. International Conference on Learning Representations (ICLR).

PDF

Sergey Redyuk, Zoi Kaoudi, Sebastian Schelter, Volker Markl (2024). Assisted Design of Data Science Pipelines. The VLDB Journal — The International Journal on Very Large Data Bases.

Shubha Guha, Falaah Arif Khan, Julia Stoyanovich, Sebastian Schelter (2024). Automated Data Cleaning Can Hurt Fairness in Machine Learning-based Decision Making. IEEE Transactions on Knowledge and Data Engineering (TKDE), Special Issue for Best and Innovation Papers from ICDE’23.

Barrie Kersbergen, Olivier Sprangers, Frank Kootte, Shubha Guha, Maarten de Rijke, Sebastian Schelter (2024). Etude - Evaluating the Inference Latency of Session-Based Recommendation Models at Scale. International Conference on Data Engineering (ICDE).

Songgaojun Deng, Olivier Sprangers, Ming Li, Sebastian Schelter, Maarten de Rijke (2024). Domain Generalization in Time Series Forecasting. ACM Transactions on Knowledge Discovery from Data (TKDD).

Team

PhD Students & Guests
Olivier Sprangers (with Maarten de Rijke)	Barrie Kersbergen (with Maarten de Rijke)	Stefan Grafberger (with Paul Groth)
Zeyu Zhang (with Iacer Calixto)	Shubha Guha (with Paul Groth)	Till Doehmen

Alumni (name, role and first employment)

Dr. Mozhdeh Ariannezhad, PhD student, ML scientist at booking.com
Arezoo Sarvi, PhD student, Data scientist search at Albert Heijn
David Vos, master student, PhD student at the University of Amsterdam
Benjamin Wang, master student, ML scientist at booking.com
Dr. Ji Zhang, guest researcher, Research scientist at Huawei

Collaborations

I am collaborating with Prof. Julia Stoyanovich from New York University on research with regard to the impact of data preprocessing on the fairness of machine-assisted decision making.
I am working with Iacer Calixto from the Amsterdam UMC on problems at the intersection of responsible data management and natural language processing.
I am working with Hannes Muehleisen from Centrum Wiskunde & Informatica (CWI) on leveraging DuckDB for the efficient execution of data preprocessing in ML pipelines.
I am an associated researcher with BIFOLD, the Berlin Institute for the Foundations of Learning and Data.

CV

Scientific Career

Before joining University of Amsterdam, I have been a Faculty Fellow at the Center for Data Science at New York University, and a Senior Applied Scientist at Amazon Research in Berlin, where I worked on data management-related issues of machine learning applications, such as demand forecasting, metadata and provenance tracking of machine learning pipelines and automating data quality verification.

I received my Ph.D. with “summa cum laude” from TU Berlin in 2015, where I have been advised by Volker Markl, head of the database systems and information management group. My co-supervisors were Klaus-Robert Müller from the machine learning group at TU Berlin and Reza Zadeh from Stanford. During my studies, I have been interning with the SystemML group at IBM Research Almaden and the social recommendations team at Twitter in California.

Open Source

I am engaged in open source as an elected member of the Apache Software Foundation since 2012. I have been involved in the Apache Mahout, Apache Flink, Apache Giraph and the incubation of the Apache MXNet and Apache TVM projects. Besides that I co-created Deequ, a library for ‘unit-testing’ large datasets with Apache Spark, and Serenade, a low-latency session-based recommender system deployed in production at a large Dutch retailer. Furthermore, I am a member of the Electronic Frontier Foundation since 2015.

Scientific Service

I am the founder and have chaired the workshop series on Data Management for End-To-End Machine Learning (DEEM) at ACM SIGMOD from 2017 to 2020, and an Action Editor for the Journal of Data-centric Machine Learning Research and the ML Open Source Software track of the Journal of Machine Learning Research. I have served as Associate Editor for PVLDB Volume 15, as the editor for two special issues of the IEEE Data Engineering Bulletin in 2021 and 2022, and as co-chair of the industry and applications track of EDBT 2022.

I regularly review submissions to top tier data management conferences. I have been on the program committee at SIGMOD 2017, 2019-2024, VLDB 2021, ICDE 2018-2021 & 2023, EDBT 2017 & 2021, CIKM’20, the PhD Symposium at VLDB’21, the workshop on Data-Centric Machine Learning Research at ICML, the workshop on Exploiting Artificial Intelligence Techniques for Data Management at SIGMOD 2019, the Large-Scale Recommender Systems workshop at the ACM RecSys 2013-2015, the workshop on Applied AI for Database Systems and Applications at VLDB’20, on Table Representation Learning at NeurIPS’22-23, the DBML workshop at ICDE’21 and Provenance Week’20. Additionally, I have reviewed submissions to journals for IEEE TKDE, ACM TIST, IEEE TPDS, IEEE TNNLS, VLDB Journal, the VLDB Journal Special Issue on Data Science for Responsible Data Management, the journal track of ECML/PKDD and the open source track of JMLR. I have also been a reviewer for the Amazon Research Awards.

At the University of Amsterdam, I coordinate the Big Data Engineering track of the computer science master and the honors program for the bachelor AI, and teach a courses on data engineering with up to 200 students.

Contact

I’m reachable via email at s.schelter[at]uva.nl. I’m also very actively using twitter as @sscdotopen. Most of the research code that I write is available under an open source license in this, this or this github account. Last but not least, I also have a profile in Google Scholar and DBLP.