Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3329486.3329489acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article
Public Access

Debugging Machine Learning Pipelines

Published: 30 June 2019 Publication History

Abstract

Machine learning tasks entail the use of complex computational pipelines to reach quantitative and qualitative conclusions. If some of the activities in a pipeline produce erroneous or uninformative outputs, the pipeline may fail or produce incorrect results. Inferring the root cause of failures and unexpected behavior is challenging, usually requiring much human thought, and is both time consuming and error prone. We propose a new approach that makes use of iteration and provenance to automatically infer the root causes and derive succinct explanations of failures. Through a detailed experimental evaluation, we assess the cost, precision, and recall of our approach compared to the state of the art. Our source code and experimental data will be available for reproducibility and enhancement.

References

[1]
Peter Alvaro, Joshua Rosen, and Joseph M. Hellerstein. 2015. Lineage-driven Fault Injection. In Proceedings of ACM SIGMOD. 331--346.
[2]
Peter Bailis, Edward Gan, Samuel Madden, Deepak Narayanan, Kexin Rong, and Sahaana Suri. 2017. MacroBase: Prioritizing Attention in Fast Data. In Proceedings of the ACM SIGMOD. 541--556.
[3]
Anju Bala and Inderveer Chana. 2015. Intelligent Failure Prediction Models for Scientific Workflows. Expert Syst. Appl. 42, 3 (Feb. 2015), 980--989.
[4]
Bonnie Berger, John Rompel, and Peter W. Shor. 1994. Efficient NC algorithms for set cover with applications to learning and geometry. J. Comput. System Sci. (1994).
[5]
James Bergstra, Rémi Bardenet, Yoshua Bengio, and Balázs Kégl. 2011. Algorithms for Hyper-Parameter Optimization. Advances in Neural Information Processing Systems (NIPS) (2011), 2546--2554. https://doi.org/2012arXiv1206.2944S arXiv:1206.2944
[6]
James Bergstra and Yoshua Bengio. 2012. Random Search for Hyperparameter Optimization. J. Mach. Learn. Res. 13 (Feb. 2012), 281--305. http://dl.acm.org/citation.cfm?id=2188385.2188395
[7]
J. Bergstra, D. Yamins, and D. D. Cox. 2013. Making a Science of Model Search: Hyperparameter Optimization in Hundreds of Dimensions for Vision Architectures. In Proceedings of ICML. I-115--I-123. http://dl.acm.org/citation.cfm?id=3042817.3042832
[8]
Tobias Bleifuß, Sebastian Kruse, and Felix Naumann. 2017. Efficient Denial Constraint Discovery with Hydra. Proc. VLDB Endow. 11, 3 (Nov. 2017), 311--323.
[9]
Ang Chen, Yang Wu, Andreas Haeberlen, Boon T. Loo, and Wenchao Zhou. 2017. Data Provenance at Internet Scale: Architecture, Experiences, and the Road Ahead. In Proceedings of CIDR.
[10]
Fernando Chirigati, Harish Doraiswamy, Theodoros Damoulas, and Juliana Freire. 2016. Data Polygamy: The Many-Many Relationships Among Urban Spatio-Temporal Data Sets. In Proceedings of ACM SIGMOD. 1011--1025.
[11]
Charles J. Colbourn, Sosina S. Martirosyan, Gary L. Mullen, Dennis Shasha, George B. Sherwood, and Joseph L. Yucas. 2006. Products of mixed covering arrays of strength two. Journal of Combinatorial Designs 14, 2 (2006), 124--138.
[12]
Nima Dolatnia, Alan Fern, and Xiaoli Fern. 2016. Bayesian Optimization with Resource Constraints and Production. In International Conference on Automated Planning and Scheduling. 115--123.
[13]
Kareem El Gebaly, Parag Agrawal, Lukasz Golab, Flip Korn, and Divesh Srivastava. 2014. Interpretable and Informative Explanations of Outcomes. Proc. VLDB Endow. 8, 1 (Sept. 2014), 61--72.
[14]
Google. 2015. Prudential Life Insurance Assessment. https://www.kaggle.com/c/prudential-life-insurance-assessment. Accessed: 2019-03-02.
[15]
Google. 2015. Restaurant Revenue Prediction. https://www.kaggle.com/c/restaurant-revenue-prediction. Accessed: 2019-03-02.
[16]
Google. 2016. Breast Cancer Wisconsin (Diagnostic) Data Set. https://www.kaggle.com/uciml/breast-cancer-wisconsin-data. Accessed: 2019-03-02.
[17]
Google. 2018. Kaggle. http://www.kaggle.com. Accessed: 2019-03-02.
[18]
Helga Gudmundsdottir, Babak Salimi, Magdalena Balazinska, Dan R.K. Ports, and Dan Suciu. 2017. A Demonstration of Interactive Analysis of Performance Measurements with Viska. In Proceedings of ACM SIGMOD. 1707--1710.
[19]
Jiangbo Huang. 2014. Programing implementation of the Quine-McCluskey method for minimization of Boolean expression. CoRR abs/1410.1059 (2014). arXiv:1410.1059 http://arxiv.org/abs/1410.1059
[20]
F. Hutter, H. H. Hoos, and K. Leyton-Brown. 2011. Sequential Model-Based Optimization for General Algorithm Configuration. In Proc. of LION-5. 507--523.
[21]
Alexandra Meliou, Sudeepa Roy, and Dan Suciu. 2014. Causality and Explanations in Databases. PVLDB 7, 13 (2014), 1715--1716. http://www.vldb.org/pvldb/vol7/p1715-meliou.pdf
[22]
Jasper Snoek, Hugo Larochelle, and Ryan P. Adams. 2012. Practical Bayesian Optimization of Machine Learning Algorithms. In Proceedings of NIPS. 2951--2959. http://dl.acm.org/citation.cfm?id=2999325.2999464
[23]
Jasper Snoek, Oren Rippel, Kevin Swersky, Ryan Kiros, Nadathur Satish, Narayanan Sundaram, Md. Mostofa Ali Patwary, Prabhat Prabhat, and Ryan P. Adams. 2015. Scalable Bayesian Optimization Using Deep Neural Networks. In Proceedings of the ICML. 2171--2180. http://dl.acm.org/citation.cfm?id=3045118.3045349
[24]
Xiaolan Wang, Xin Luna Dong, and Alexandra Meliou. 2015. Data X-Ray: A Diagnostic Tool for Data Errors. In Proceedings of ACM SIGMOD. 1231--1245.
[25]
Xiaolan Wang, Alexandra Meliou, and Eugene Wu. 2017. QFix: Diagnosing Errors Through Query Histories. In Proceedings of ACM SIGMOD. 1369--1384.

Cited By

View all
  • (2024)Data journeys: Explaining AI workflows through abstractionSemantic Web10.3233/SW-23340715:4(1057-1083)Online publication date: 4-Oct-2024
  • (2024)When debugging encounters artificial intelligence: state of the art and open challengesScience China Information Sciences10.1007/s11432-022-3803-967:4Online publication date: 21-Feb-2024
  • (2024)Comparative analysis of real issues in open-source machine learning projectsEmpirical Software Engineering10.1007/s10664-024-10467-329:3Online publication date: 2-May-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
DEEM'19: Proceedings of the 3rd International Workshop on Data Management for End-to-End Machine Learning
June 2019
72 pages
ISBN:9781450367974
DOI:10.1145/3329486
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 June 2019

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

Conference

SIGMOD/PODS '19
Sponsor:

Acceptance Rates

Overall Acceptance Rate 44 of 67 submissions, 66%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)262
  • Downloads (Last 6 weeks)41
Reflects downloads up to 10 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Data journeys: Explaining AI workflows through abstractionSemantic Web10.3233/SW-23340715:4(1057-1083)Online publication date: 4-Oct-2024
  • (2024)When debugging encounters artificial intelligence: state of the art and open challengesScience China Information Sciences10.1007/s11432-022-3803-967:4Online publication date: 21-Feb-2024
  • (2024)Comparative analysis of real issues in open-source machine learning projectsEmpirical Software Engineering10.1007/s10664-024-10467-329:3Online publication date: 2-May-2024
  • (2023)Faulty or Ready? Handling Failures in Deep-Learning Computer Vision Models until Deployment: A Study of Practices, Challenges, and NeedsProceedings of the 2023 CHI Conference on Human Factors in Computing Systems10.1145/3544548.3581555(1-20)Online publication date: 19-Apr-2023
  • (2023)Towards Actionable Data Science: Domain Experts as End-Users of Data Science SystemsComputer Supported Cooperative Work (CSCW)10.1007/s10606-023-09475-633:3(389-433)Online publication date: 15-Jul-2023
  • (2022)How can Explainability Methods be Used to Support Bug Identification in Computer Vision Models?Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems10.1145/3491102.3517474(1-16)Online publication date: 29-Apr-2022
  • (2022)Enabling Awareness of Quality of Training and Costs in Federated Machine Learning Marketplaces2022 IEEE/ACM 15th International Conference on Utility and Cloud Computing (UCC)10.1109/UCC56403.2022.00015(41-50)Online publication date: Dec-2022
  • (2022)Data Management for Machine Learning: A SurveyIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2022.3148237(1-1)Online publication date: 2022
  • (2022)BugDocThe VLDB Journal10.1007/s00778-022-00733-532:1(75-101)Online publication date: 23-Feb-2022
  • (2021)“Everyone wants to do the model work, not the data work”: Data Cascades in High-Stakes AIProceedings of the 2021 CHI Conference on Human Factors in Computing Systems10.1145/3411764.3445518(1-15)Online publication date: 6-May-2021
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media