research-article

Public Access

Debugging Machine Learning Pipelines

Authors:

Raoni Lourenço,

Juliana Freire,

Dennis ShashaAuthors Info & Claims

DEEM'19: Proceedings of the 3rd International Workshop on Data Management for End-to-End Machine Learning

Article No.: 3, Pages 1 - 10

https://doi.org/10.1145/3329486.3329489

Published: 30 June 2019 Publication History

Abstract

Machine learning tasks entail the use of complex computational pipelines to reach quantitative and qualitative conclusions. If some of the activities in a pipeline produce erroneous or uninformative outputs, the pipeline may fail or produce incorrect results. Inferring the root cause of failures and unexpected behavior is challenging, usually requiring much human thought, and is both time consuming and error prone. We propose a new approach that makes use of iteration and provenance to automatically infer the root causes and derive succinct explanations of failures. Through a detailed experimental evaluation, we assess the cost, precision, and recall of our approach compared to the state of the art. Our source code and experimental data will be available for reproducibility and enhancement.

References

[1]

Peter Alvaro, Joshua Rosen, and Joseph M. Hellerstein. 2015. Lineage-driven Fault Injection. In Proceedings of ACM SIGMOD. 331--346.

Digital Library

[2]

Peter Bailis, Edward Gan, Samuel Madden, Deepak Narayanan, Kexin Rong, and Sahaana Suri. 2017. MacroBase: Prioritizing Attention in Fast Data. In Proceedings of the ACM SIGMOD. 541--556.

Digital Library

[3]

Anju Bala and Inderveer Chana. 2015. Intelligent Failure Prediction Models for Scientific Workflows. Expert Syst. Appl. 42, 3 (Feb. 2015), 980--989.

Digital Library

[4]

Bonnie Berger, John Rompel, and Peter W. Shor. 1994. Efficient NC algorithms for set cover with applications to learning and geometry. J. Comput. System Sci. (1994).

Digital Library

[5]

James Bergstra, Rémi Bardenet, Yoshua Bengio, and Balázs Kégl. 2011. Algorithms for Hyper-Parameter Optimization. Advances in Neural Information Processing Systems (NIPS) (2011), 2546--2554. https://doi.org/2012arXiv1206.2944S arXiv:1206.2944

Digital Library

[6]

James Bergstra and Yoshua Bengio. 2012. Random Search for Hyperparameter Optimization. J. Mach. Learn. Res. 13 (Feb. 2012), 281--305. http://dl.acm.org/citation.cfm?id=2188385.2188395

Digital Library

[7]

J. Bergstra, D. Yamins, and D. D. Cox. 2013. Making a Science of Model Search: Hyperparameter Optimization in Hundreds of Dimensions for Vision Architectures. In Proceedings of ICML. I-115--I-123. http://dl.acm.org/citation.cfm?id=3042817.3042832

Digital Library

[8]

Tobias Bleifuß, Sebastian Kruse, and Felix Naumann. 2017. Efficient Denial Constraint Discovery with Hydra. Proc. VLDB Endow. 11, 3 (Nov. 2017), 311--323.

Digital Library

[9]

Ang Chen, Yang Wu, Andreas Haeberlen, Boon T. Loo, and Wenchao Zhou. 2017. Data Provenance at Internet Scale: Architecture, Experiences, and the Road Ahead. In Proceedings of CIDR.

[10]

Fernando Chirigati, Harish Doraiswamy, Theodoros Damoulas, and Juliana Freire. 2016. Data Polygamy: The Many-Many Relationships Among Urban Spatio-Temporal Data Sets. In Proceedings of ACM SIGMOD. 1011--1025.

Digital Library

[11]

Charles J. Colbourn, Sosina S. Martirosyan, Gary L. Mullen, Dennis Shasha, George B. Sherwood, and Joseph L. Yucas. 2006. Products of mixed covering arrays of strength two. Journal of Combinatorial Designs 14, 2 (2006), 124--138.

[12]

Nima Dolatnia, Alan Fern, and Xiaoli Fern. 2016. Bayesian Optimization with Resource Constraints and Production. In International Conference on Automated Planning and Scheduling. 115--123.

Digital Library

[13]

Kareem El Gebaly, Parag Agrawal, Lukasz Golab, Flip Korn, and Divesh Srivastava. 2014. Interpretable and Informative Explanations of Outcomes. Proc. VLDB Endow. 8, 1 (Sept. 2014), 61--72.

Digital Library

[14]

Google. 2015. Prudential Life Insurance Assessment. https://www.kaggle.com/c/prudential-life-insurance-assessment. Accessed: 2019-03-02.

[15]

Google. 2015. Restaurant Revenue Prediction. https://www.kaggle.com/c/restaurant-revenue-prediction. Accessed: 2019-03-02.

[16]

Google. 2016. Breast Cancer Wisconsin (Diagnostic) Data Set. https://www.kaggle.com/uciml/breast-cancer-wisconsin-data. Accessed: 2019-03-02.

[17]

Google. 2018. Kaggle. http://www.kaggle.com. Accessed: 2019-03-02.

[18]

Helga Gudmundsdottir, Babak Salimi, Magdalena Balazinska, Dan R.K. Ports, and Dan Suciu. 2017. A Demonstration of Interactive Analysis of Performance Measurements with Viska. In Proceedings of ACM SIGMOD. 1707--1710.

Digital Library

[19]

Jiangbo Huang. 2014. Programing implementation of the Quine-McCluskey method for minimization of Boolean expression. CoRR abs/1410.1059 (2014). arXiv:1410.1059 http://arxiv.org/abs/1410.1059

[20]

F. Hutter, H. H. Hoos, and K. Leyton-Brown. 2011. Sequential Model-Based Optimization for General Algorithm Configuration. In Proc. of LION-5. 507--523.

Digital Library

[21]

Alexandra Meliou, Sudeepa Roy, and Dan Suciu. 2014. Causality and Explanations in Databases. PVLDB 7, 13 (2014), 1715--1716. http://www.vldb.org/pvldb/vol7/p1715-meliou.pdf

Digital Library

[22]

Jasper Snoek, Hugo Larochelle, and Ryan P. Adams. 2012. Practical Bayesian Optimization of Machine Learning Algorithms. In Proceedings of NIPS. 2951--2959. http://dl.acm.org/citation.cfm?id=2999325.2999464

Digital Library

[23]

Jasper Snoek, Oren Rippel, Kevin Swersky, Ryan Kiros, Nadathur Satish, Narayanan Sundaram, Md. Mostofa Ali Patwary, Prabhat Prabhat, and Ryan P. Adams. 2015. Scalable Bayesian Optimization Using Deep Neural Networks. In Proceedings of the ICML. 2171--2180. http://dl.acm.org/citation.cfm?id=3045118.3045349

Digital Library

[24]

Xiaolan Wang, Xin Luna Dong, and Alexandra Meliou. 2015. Data X-Ray: A Diagnostic Tool for Data Errors. In Proceedings of ACM SIGMOD. 1231--1245.

Digital Library

[25]

Xiaolan Wang, Alexandra Meliou, and Eugene Wu. 2017. QFix: Diagnosing Errors Through Query Histories. In Proceedings of ACM SIGMOD. 1369--1384.

Digital Library

Cited By

Daga EGroth P(2024)Data journeys: Explaining AI workflows through abstractionSemantic Web10.3233/SW-23340715:4(1057-1083)Online publication date: 4-Oct-2024
https://doi.org/10.3233/SW-233407
Song YXie XXu B(2024)When debugging encounters artificial intelligence: state of the art and open challengesScience China Information Sciences10.1007/s11432-022-3803-967:4Online publication date: 21-Feb-2024
https://doi.org/10.1007/s11432-022-3803-9
Lai TSimmons ABarnett SSchneider JVasa R(2024)Comparative analysis of real issues in open-source machine learning projectsEmpirical Software Engineering10.1007/s10664-024-10467-329:3Online publication date: 2-May-2024
https://doi.org/10.1007/s10664-024-10467-3
Show More Cited By

Index Terms

Debugging Machine Learning Pipelines
1. Information systems
  1. Data management systems
    1. Database design and models
      1. Data model extensions
        Data provenance

Recommendations

Automatic Generation of Visualizations for Machine Learning Pipelines
ASE '22: Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering

Visualization is very important for machine learning (ML) pipelines because it can show explorations of the data to inspire data scientists and show explanations of the pipeline to improve understandability. In this paper, we present a novel approach ...
Proactively Screening Machine Learning Pipelines with ARGUSEYES
SIGMOD '23: Companion of the 2023 International Conference on Management of Data

Software systems that learn from data with machine learning (ML) are ubiquitous. ML pipelines in these applications often suffer from a variety of data-related issues, such as data leakage, label errors or fairness violations, which require reasoning ...
Data distribution debugging in machine learning pipelines
Abstract
Machine learning (ML) is increasingly used to automate impactful decisions, and the risks arising from this widespread use are garnering attention from policy makers, scientists, and the media. ML applications are often brittle with respect to ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

DEEM'19: Proceedings of the 3rd International Workshop on Data Management for End-to-End Machine Learning

June 2019

72 pages

ISBN:9781450367974

DOI:10.1145/3329486

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 June 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

New York University
Defense Advanced Research Projects Agency
National Science Foundation
Conselho Nacional de Desenvolvimento Científico e Tecnológico

Conference

SIGMOD/PODS '19

Sponsor:

SIGMOD

SIGMOD/PODS '19: International Conference on Management of Data

June 30, 2019

Amsterdam, Netherlands

Acceptance Rates

Overall Acceptance Rate 44 of 67 submissions, 66%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

15
Total Citations
View Citations
1,410
Total Downloads

Downloads (Last 12 months)262
Downloads (Last 6 weeks)41

Reflects downloads up to 10 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Daga EGroth P(2024)Data journeys: Explaining AI workflows through abstractionSemantic Web10.3233/SW-23340715:4(1057-1083)Online publication date: 4-Oct-2024
https://doi.org/10.3233/SW-233407
Song YXie XXu B(2024)When debugging encounters artificial intelligence: state of the art and open challengesScience China Information Sciences10.1007/s11432-022-3803-967:4Online publication date: 21-Feb-2024
https://doi.org/10.1007/s11432-022-3803-9
Lai TSimmons ABarnett SSchneider JVasa R(2024)Comparative analysis of real issues in open-source machine learning projectsEmpirical Software Engineering10.1007/s10664-024-10467-329:3Online publication date: 2-May-2024
https://doi.org/10.1007/s10664-024-10467-3
Balayn ARikalo NYang JBozzon A(2023)Faulty or Ready? Handling Failures in Deep-Learning Computer Vision Models until Deployment: A Study of Practices, Challenges, and NeedsProceedings of the 2023 CHI Conference on Human Factors in Computing Systems10.1145/3544548.3581555(1-20)Online publication date: 19-Apr-2023
https://dl.acm.org/doi/10.1145/3544548.3581555
Jung JSteinberger TSo C(2023)Towards Actionable Data Science: Domain Experts as End-Users of Data Science SystemsComputer Supported Cooperative Work (CSCW)10.1007/s10606-023-09475-633:3(389-433)Online publication date: 15-Jul-2023
https://doi.org/10.1007/s10606-023-09475-6
Balayn ARikalo NLofi CYang JBozzon A(2022)How can Explainability Methods be Used to Support Bug Identification in Computer Vision Models?Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems10.1145/3491102.3517474(1-16)Online publication date: 29-Apr-2022
https://dl.acm.org/doi/10.1145/3491102.3517474
Cao TTruong HTruong-Huu TNguyen M(2022)Enabling Awareness of Quality of Training and Costs in Federated Machine Learning Marketplaces2022 IEEE/ACM 15th International Conference on Utility and Cloud Computing (UCC)10.1109/UCC56403.2022.00015(41-50)Online publication date: Dec-2022
https://doi.org/10.1109/UCC56403.2022.00015
Chai CWang JLuo YNiu ZLi G(2022)Data Management for Machine Learning: A SurveyIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2022.3148237(1-1)Online publication date: 2022
https://doi.org/10.1109/TKDE.2022.3148237
Lourenço RFreire JSimon EWeber GShasha D(2022)BugDocThe VLDB Journal10.1007/s00778-022-00733-532:1(75-101)Online publication date: 23-Feb-2022
https://doi.org/10.1007/s00778-022-00733-5
Sambasivan NKapania SHighfill HAkrong DParitosh PAroyo LKitamura YQuigley AIsbister KIgarashi TBjørn PDrucker S(2021)“Everyone wants to do the model work, not the data work”: Data Cascades in High-Stakes AIProceedings of the 2021 CHI Conference on Human Factors in Computing Systems10.1145/3411764.3445518(1-15)Online publication date: 6-May-2021
https://dl.acm.org/doi/10.1145/3411764.3445518
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents