research-article

mlwhatif: What If You Could Stop Re-Implementing Your Machine Learning Pipeline Analyses over and over?

Authors:

Stefan Grafberger,

Shubha Guha,

Paul Groth, and

Sebastian SchelterAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 16, Issue 12

Pages 4002 - 4005

https://doi.org/10.14778/3611540.3611606

Published: 01 August 2023 Publication History

Get Access

Abstract

Software systems that learn from data with machine learning (ML) are used in critical decision-making processes. Unfortunately, real-world experience shows that the pipelines for data preparation, feature encoding and model training in ML systems are often brittle with respect to their input data. As a consequence, data scientists have to run different kinds of data centric what-if analyses to evaluate the robustness and reliability of such pipelines, e.g., with respect to data errors or preprocessing techniques. These what-if analyses follow a common pattern: they take an existing ML pipeline, create a pipeline variant by introducing a small change, and execute this variant to see how the change impacts the pipeline's output score.

We recently proposed mlwhatif, a library that enables data scientists to declaratively specify what-if analyses for an ML pipeline, and to automatically generate, optimize and execute the required pipeline variants. We demonstrate how data scientists can leverage mlwhatif for a variety of pipelines and three different what-if analyses focusing on the robustness of a pipeline against data errors, the impact of data cleaning operations, and the impact of data preprocessing operations on fairness. In particular, we demonstrate step-by-step how mlwhatif generates and optimizes the required execution plans for the pipeline analyses. Our library is publicly available at https://github.com/stefan-grafberger/mlwhatif.

References

[1]

Sumon Biswas and Hridesh Rajan. 2021. Fair preprocessing: towards understanding compositional fairness of data transformers in machine learning pipeline. ESEC/FSE (2021).

Google Scholar

[2]

Stefan Grafberger, Paul Groth, and Sebastian Schelter. 2023. Automating and Optimizing Data-Centric What-If Analyses on Native Machine Learning Pipelines. SIGMOD (2023).

Google Scholar

[3]

Stefan Grafberger, Paul Groth, Julia Stoyanovich, and Sebastian Schelter. 2022. Data distribution debugging in machine learning pipelines. VLDBJ (2022).

Google Scholar

[4]

Stefan Grafberger, Shubha Guha, Julia Stoyanovich, and Sebastian Schelter. 2021. MLINSPECT: A Data Distribution Debugger for Machine Learning Pipelines. SIGMOD (2021).

Google Scholar

[5]

Shubha Guha, Falaah Arif Khan, Julia Stoyanovich, and Sebastian Schelter. 2023. Automated Data Cleaning Can Hurt Fairness in Machine Learning-based Decision Making. ICDE (2023).

Google Scholar

[6]

Ruoxi Jia, David Dao, Boxin Wang, Frances Ann Hubis, Nezihe Merve Gurel, Bo Li, Ce Zhang, Costas J Spanos, and Dawn Song. 2019. Efficient task-specific data valuation for nearest neighbor algorithms. VLDB (2019).

Google Scholar

[7]

Peng Li, Xi Rao, Jennifer Blase, Yue Zhang, Xu Chu, and Ce Zhang. 2019. Cleanml: A benchmark for joint data cleaning and machine learning. ICDE (2019).

Google Scholar

[8]

Curtis Northcutt, Lu Jiang, and Isaac Chuang. 2021. Confident learning: Estimating uncertainty in dataset labels. JAIR 70 (2021).

Google Scholar

[9]

Neoklis Polyzotis, Sudip Roy, Steven Euijong Whang, and Martin Zinkevich. 2018. Data lifecycle challenges in production machine learning: a survey. SIGMOD Record 47, 2 (2018).

Google Scholar

[10]

Sebastian Schelter, Felix Biessmann, Tim Januschowski, David Salinas, Stephan Seufert, and Gyuri Szarvas. 2018. On challenges in machine learning model management. IEEE Data Engineering Bulletin (2018).

Google Scholar

[11]

Sebastian Schelter, Stefan Grafberger, Shubha Guha, Bojan Karlaš, and Ce Zhang. 2023. Proactively Screening Machine Learning Pipelines with ArgusEyes. SIGMOD (2023).

Google Scholar

[12]

Sebastian Schelter, Stefan Grafberger, Shubha Guha, Olivier Sprangers, Bojan Karlaš, and Ce Zhang. 2022. Screening Native ML Pipelines with "ArgusEyes". CIDR (2022).

Google Scholar

[13]

Sebastian Schelter, Tammo Rukat, and Felix Biessmann. 2021. JENGA - A Framework to Study the Impact of Data Errors on the Predictions of Machine Learning Models. EDBT (2021).

Google Scholar

[14]

Julia Stoyanovich, Bill Howe, Serge Abiteboul, H.V. Jagadish, and Sebastian Schelter. 2022. Responsible Data Management. Commun. ACM (2022).

Google Scholar

Cited By

View all

Tae KZhang HPark JRong KWhang S(2024)Falcon: Fair Active Learning Using Multi-Armed BanditsProceedings of the VLDB Endowment10.14778/3641204.364120717:5(952-965)Online publication date: 2-May-2024
https://dl.acm.org/doi/10.14778/3641204.3641207

Recommendations

Automating and Optimizing Data-Centric What-If Analyses on Native Machine Learning Pipelines
PACMMOD

Software systems that learn from data with machine learning (ML) are used in critical decision-making processes. Unfortunately, real-world experience shows that the pipelines for data preparation, feature encoding and model training in ML systems are ...
Read More
AVATAR - Machine Learning Pipeline Evaluation Using Surrogate Model
Advances in Intelligent Data Analysis XVIII
Abstract
The evaluation of machine learning (ML) pipelines is essential during automatic ML pipeline composition and optimisation. The previous methods such as Bayesian-based and genetic-based optimisation, which are implemented in Auto-Weka, Auto-sklearn ...
Read More
Fair preprocessing: towards understanding compositional fairness of data transformers in machine learning pipeline
ESEC/FSE 2021: Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering

In recent years, many incidents have been reported where machine learning models exhibited discrimination among people based on race, sex, age, etc. Research has been conducted to measure and mitigate unfairness in machine learning models. For a machine ...
Read More

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 16, Issue 12

August 2023

685 pages

ISSN:2150-8097

Editors:
Georgia Koutrika
Athena Research Center
,
Jun Yang
Duke University

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 August 2023

Published in PVLDB Volume 16, Issue 12

Check for updates

Badges

Artifacts Available / v1.1

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
68
Total Downloads

Downloads (Last 12 months)68
Downloads (Last 6 weeks)5

Other Metrics

View Author Metrics

Citations

Cited By

View all

Tae KZhang HPark JRong KWhang S(2024)Falcon: Fair Active Learning Using Multi-Armed BanditsProceedings of the VLDB Endowment10.14778/3641204.364120717:5(952-965)Online publication date: 2-May-2024
https://dl.acm.org/doi/10.14778/3641204.3641207

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Recommendations

Automating and Optimizing Data-Centric What-If Analyses on Native Machine Learning Pipelines

AVATAR - Machine Learning Pipeline Evaluation Using Surrogate Model

Fair preprocessing: towards understanding compositional fairness of data transformers in machine learning pipeline