Publications

Filter by type:

. Automating and Optimizing Data-Centric What-If Analyses on Native Machine Learning Pipelines. ACM SIGMOD, 2023.

. Proactively Screening Machine Learning Pipelines with ArgusEyes. ACM SIGMOD (demo), 2023.

. Automated Data Cleaning Can Hurt Fairness in Machine Learning-based Decision Making. International Conference on Data Engineering (ICDE), 2023.

PDF

. How to Make an Outlier? Studying the Effect of Presentational Features on the Outlierness of Items in Product Search Results. ACM Conference on Human Information Interaction and Retrieval (CHIIR), 2022.

. Reconstructing and Querying ML Pipeline Intermediates. Conference on Innovative Data Systems Research (CIDR, abstract), 2022.

PDF

. Towards Parameter-Efficient Automation of Data Wrangling Tasks with Prefix-Tuning. Table Representation Learning workshop at NeurIPS, 2022.

PDF

. A Personalized Neighborhood-based Model for Within-basket Recommendation in Grocery Shopping. ACM International Conference on Web Search and Data Mining (WSDM), 2022.

PDF

. DORIAN in action: Assisted Design of Data Science Pipelines. VLDB (demo), 2022.

PDF

. Responsible Data Management. Communications of the ACM, 2022.

PDF

. Letter from the Special Issue Editor. Special issue on “Directions Towards GDPR-Compliant Data Systems and Applications” of the IEEE Data Engineering Bulletin (Vol 45, Issue 1), 2022.

PDF

. Towards Data-Centric What-If Analysis for Native Machine Learning Pipelines. Data Management for End-to-End Machine Learning workshop at ACM SIGMOD, 2022.

PDF

. ReCANet: A Repeat Consumption-Aware Neural Network for Next Basket Recommendation in Grocery Shopping. ACM SIGIR, 2022.

PDF

. GitSchemas: A Schema Dataset for Automating Relational Data Preparation Tasks. Databases for Machine Learning workshop at ICDE, 2022.

PDF

. Serving Low-Latency Session-Based Recommendations at bol.com. ECIR (industry talk), 2022.

PDF

. Serenade - Low-Latency Session-Based Recommendation in e-Commerce at Scale. ACM SIGMOD, 2021.

PDF

. Data Distribution Debugging in Machine Learning Pipelines. The VLDB Journal — The International Journal on Very Large Data Bases (Special Issue on Data Science for Responsible Data Management), 2021.

PDF

. Parameter Efficient Deep Probabilistic Forecasting. International Journal of Forecasting, 2021.

PDF

. Screening Native Machine Learning Pipelines with ArgusEyes. Conference on Innovative Data Systems Research (CIDR, abstract), 2021.

PDF

. Understanding and Mitigating the Effect of Outliers in Fair Ranking. ACM International Conference on Web Search and Data Mining (WSDM), 2021.

PDF

. Efficiently Maintaining Next Basket Recommendations under Additions and Deletions of Baskets and Items. Workshop on Online Recommender Systems and User Modeling at ACM RecSys, 2021.

PDF

. Understanding Multi-channel Customer Behavior in Retail. ACM Conference on Information and Knowledge Management (CIKM), 2021.

PDF

. Towards Efficient Machine Unlearning via Incremental View Maintenance. Workshop on Challenges in Deploying and Monitoring ML Systems at the International Conference on Machine Learning (ICML), 2021.

PDF

. DuckDQ: Data Quality Assertions for Machine Learning Pipelines. Workshop on Challenges in Deploying and Monitoring ML Systems at the International Conference on Machine Learning (ICML), 2021.

PDF

. Probabilistic Gradient Boosting Machines for Large-Scale Probabilistic Regression. ACM SIGKDD, 2021.

PDF

. Letter from the Special Issue Editor. Special issue on “Data validation for machine learning models and applications” of the IEEE Data Engineering Bulletin (Vol 44, Issue 1), 2021.

PDF

. HedgeCut: Maintaining Randomised Trees for Low-Latency Machine Unlearning. ACM SIGMOD, 2021.

PDF

. Learnings from a Retail Recommendation System on Billions of Interactions at bol.com. International Conference on Data Engineering (ICDE), 2021.

PDF

. mlinspect: a Data Distribution Debugger for Machine Learning Pipelines. ACM SIGMOD (demo), 2021.

PDF

. Automating Data Quality Validation for Dynamic Data Ingestion. International Conference on Extending Database Technology (EDBT), 2021.

PDF

. Jenga - A Framework to Study the Impact of Data Errors on the Predictions of Machine Learning Models. International Conference on Extending Database Technology (EDBT), 2020.

PDF

. Taming Technical Bias in Machine Learning Pipelines. IEEE Data Engineering Bulletin (Special Issue on Interdisciplinary Perspectives on Fairness and Artificial Intelligence Systems), 2020.

PDF

. Lightweight Inspection of Data Preprocessing in Native Machine Learning Pipelines. Conference on Innovative Data Systems Research (CIDR), 2020.

PDF

. RetaiL: Open your own grocery store to reduce waste. NeurIPS (demonstration track), 2020.

PDF

. Technical Perspective: Query Optimization for Faster Deep CNN Explanations. ACM SIGMOD Record (Vol 49, Issue 1), 2020.

PDF

. Demand Forecasting in the Presence of Privileged Information. Workshop on Advanced Analytics and Learning on Temporal Data at ECML/PKDD, 2020.

PDF

. A Comparison of Supervised Learning to Match Methods for Product Search. eCommerce workshop at SIGIR, 2020.

PDF

. Analyzing and Predicting Purchase Intent in E-commerce: Anonymous vs. Identified Customers. eCommerce workshop at SIGIR, 2020.

PDF

. Apache Mahout: Machine Learning on Distributed Dataflow Systems. Journal of Machine Learning Research (JMLR), open source software track, 2020.

PDF

. AlphaJoin: Join Order Selection à la AlphaGo. PhD workshop at VLDB, 2020.

. Fairness-Aware Instrumentation of Preprocessing Pipelines for Machine Learning. Human-In-the-Loop Data Analytics workshop at ACM SIGMOD, 2020.

PDF

. HDDse: Enabling High-Dimensional Disk State Embedding for Generic Failure Detection of Heterogeneous Disks in Large Data Centers. USENIX Annual Technical Conference (ATC), 2020.

. Elastic Machine Learning Algorithms in Amazon SageMaker. ACM SIGMOD, 2020.

PDF

. Tier-Scrubbing: An Adaptive and Tiered Disk Scrubbing Scheme with Improved MTTD and Reduced Cost. Design Automation Conference (DAC), 2020.

. Towards Unsupervised Data Quality Validation on Dynamic Data. Workshop on Explainability for Trustworthy ML Pipelines at EDBT, 2020.

PDF

. Towards Automated ML Model Monitoring: Measure, Improve and Quantify Data Quality. ML Ops workshop at the Conference on Machine Learning and Systems (MLSys), 2020.

PDF

. Tier-Scrubbing: An Adaptive and Tiered Disk Scrubbing Scheme. USENIX Conference on File and Storage Technologies (FAST), work-in-progress track., 2020.

PDF

. Exploring Monte Carlo Tree Search for Join Order Selection. North East Database Day, 2020.

PDF

. Learning to Validate the Predictions of Black Box Classifiers on Unseen Data. ACM SIGMOD, 2020.

PDF

. FairPrep: Promoting Data to a First-Class Citizen in Studies on Fairness-Enhancing Interventions. International Conference on Extending Database Technology (EDBT), 2020.

PDF

. Zooming Out on an Evolving Graph. International Conference on Extending Database Technology (EDBT), 2020.

PDF

. 'Amnesia' - A Selection of Machine Learning Models That Can Forget User Data Very Fast. Conference on Innovative Data Systems Research (CIDR), 2020.

PDF

. DataWig - Missing Value Imputation for Tables. Journal of Machine Learning Research (JMLR), open source software track, 2019.

PDF

. An Intermediate Representation for Optimizing Machine Learning Pipelines. International Conference on Very Large Databases (VLDB), 2019.

PDF

. AdaBench - Towards an Industry Standard Benchmark for Advanced Analytics. TPC Technology Conference on Performance Evaluation & Benchmarking (TPCTC), 2019.

PDF

. 'Amnesia' - Towards Machine Learning Models That Can Forget User Data Very Fast. Workshop on Applied AI for Database Systems and Applications (AIDB) at VLDB, 2019.

PDF

. Efficient Incremental Cooccurrence Analysis for Item-Based Collaborative Filtering. International Conference on Scientific and Statistical Database Management (SSDBM), 2019.

PDF

. Learning to Validate the Predictions of Black Box Machine Learning Models on Unseen Data. Human-In-the-Loop Data Analytics workshop at ACM SIGMOD, 2019.

PDF

. DEEM 2019: Workshop on Data Management for End-to-End Machine Learning. ACM SIGMOD (workshop summary), 2019.

PDF

. Differential Data Quality Verification on Partitioned Data. International Conference on Data Engineering (ICDE), 2019.

PDF

. Unit Testing Data with Deequ. ACM SIGMOD (demo), 2019.

PDF

. Data-Related Challenges in End-to-End Machine Learning. North East Database Day, 2019.

PDF

. On Challenges in Machine Learning Model Management. IEEE Data Engineering Bulletin, 2018.

PDF

. Deequ - Data Quality Validation for Machine Learning Pipelines. Machine Learning Systems workshop at the conference on Neural Information Processing Systems (NeurIPS), 2018.

PDF

. Deep Learning for Missing Value Imputation in Tables with Non-Numerical Data. ACM Conference on Information and Knowledge Management (CIKM), 2018.

PDF

. Benchmarking Distributed Data Processing Systems for Machine Learning Workloads. TPC Technology Conference on Performance Evaluation & Benchmarking (TPCTC), 2018.

PDF

. BlockJoin: Efficient Matrix Partitioning Through Joins. International Conference on Very Large Databases (VLDB), 2018.

PDF

. Automating Large-Scale Data Quality Verification. International Conference on Very Large Databases (VLDB), 2018.

PDF

. On the Ubiquity of Web Tracking: Insights from a Billion-Page Web Crawl. Journal of Web Science (JWS), 2018.

PDF

. Automatically Tracking Metadata and Provenance of Machine Learning Experiments. Machine Learning Systems workshop at the conference on Neural Information Processing Systems (NIPS), 2017.

PDF

. Dark Germany: Hidden Patterns of Participation in Online Far-Right Protests Against Refugee Housing. International Conference on Social Informatics (SocInfo), 2017.

PDF

. Probabilistic Demand Forecasting at Scale. International Conference on Very Large Databases (VLDB), 2017.

PDF

. Dark Germany: Hidden Patterns of Participation in Online Far-Right Protests Against Refugee Housing. ACM Web Science Conference (WebSci), 2017.

PDF

. Gilbert: Declarative Sparse Linear Algebra on Massively Parallel Dataflow Systems. Fachtagung für Business, Technologie und Web (BTW), 2017.

PDF

. Structural Patterns in the Rise of Germany’s New Right on Facebook. Data Mining in Politics workshop at the International Conference on Data Mining (ICDM), 2016.

PDF

. Samsara: Declarative Machine Learning on Distributed Dataflow Systems. Machine Learning Systems workshop at the conference on Neural Information Processing Systems (NIPS), 2016.

PDF

. Doubly stochastic large scale kernel learning with the empirical kernel map. arxiv, 2016.

PDF

. Predicting Political Party Affiliation from Text. International Conference on the Advances in Computational Analysis of Political Text (PolText), 2016.

PDF

. Tracking The Trackers: A Large-Scale Analysis of Embedded Web Trackers. AAAI International Conference on Web and Social Media (ICWSM), 2016.

PDF

. Scaling Data Mining in Massively Parallel Dataflow Systems. Technische Universität Berlin, 2015.

PDF

. Optimistic Recovery for Iterative Dataflows in Action. ACM SIGMOD (demo), 2015.

PDF

. Efficient Sample Generation for Scalable Meta Learning. IEEE International Conference on Data Engineering (ICDE), 2015.

PDF

. Factorbird - a Parameter Server Approach to Distributed Matrix Factorization. Distributed Machine Learning and Matrix Computations workshop at the conference on Neural Information Processing Systems (NIPS), 2014.

PDF

. The Stratosphere platform for big data analytics. The VLDB Journal — The International Journal on Very Large Data Bases, 2014.

PDF

. Scaling Data Mining in Massively Parallel Dataflow Systems. PhD Symposium at ACM SIGMOD, 2014.

PDF

. 'All Roads Lead to Rome:' Optimistic Recovery for Distributed Iterative Data Processing. ACM Conference on Information and Knowledge Management (CIKM), 2013.

PDF

. Distributed Matrix Factorization with MapReduce using a series of Broadcast-Joins. ACM Conference on Recommender Systems (RecSys), 2013.

PDF

. Iterative Parallel Data Processing with Stratosphere: An Inside Look. ACM SIGMOD (demo), 2013.

PDF

. Collaborative Filtering with Apache Mahout. Recommender Systems Challenge Workshop in conjunction with ACM RecSys, 2012.

PDF

. Scalable Similarity-Based Neighborhood Methods with MapReduce. ACM Conference on Recommender Systems (RecSys), 2012.

PDF

. . 0001.