Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3555041.3589682acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
short-paper
Open access

Proactively Screening Machine Learning Pipelines with ARGUSEYES

Published: 05 June 2023 Publication History

Abstract

Software systems that learn from data with machine learning (ML) are ubiquitous. ML pipelines in these applications often suffer from a variety of data-related issues, such as data leakage, label errors or fairness violations, which require reasoning about complex dependencies between their inputs and outputs. These issues are usually only detected in hindsight after deployment, after they caused harm in production. We demonstrate ArgusEyes, a system which enables data scientists to proactively screen their ML pipelines for data-related issues as part of continuous integration. ArgusEyes instruments, executes and screens ML pipelines for declaratively specified pipeline issues, and analyzes data artifacts and their provenance to catch potential problems early before deployment to production. We demonstrate our system for three scenarios: detecting mislabeled images in a computer vision pipeline, spotting data leakage in a price prediction pipeline, and addressing fairness violations in a credit scoring pipeline.

Supplemental Material

MP4 File
Demo pitch

References

[1]
Anaconda.com. 2020. The State of Data Science 2020. https://www.anaconda.com/state-of-data-science-2020.
[2]
Irene Chen et al. 2018. Why is my classifier discriminatory? NeurIPS (2018).
[3]
Franccois Chollet et al. 2015. Keras. https://keras.io.
[4]
Stefan Grafberger et al. 2022. Data distribution debugging in machine learning pipelines. VLDB Journal (2022).
[5]
Todd J Green et al. 2007. Provenance semirings. PODS (2007).
[6]
Ruoxi Jia et al. 2019. Efficient task-specific data valuation for nearest neighbor algorithms. VLDB (2019).
[7]
Sayash Kapoor et al. 2022. Leakage and the Reproducibility Crisis in ML-based Science. arXiv preprint arXiv:2207.07048 (2022).
[8]
Curtis G Northcutt et al. 2021. Pervasive label errors in test sets destabilize machine learning benchmarks. NeurIPS (2021).
[9]
Neoklis Polyzotis et al. 2017. Data management challenges in production machine learning. SIGMOD (2017).
[10]
Sebastian Schelter et al. 2015. On challenges in machine learning model management. IEEE Data Engineering Bullettin (2015).
[11]
Sebastian Schelter et al. 2022. Screening Native ML Pipelines with ?ArgusEyes". CIDR (2022).
[12]
Joaquin Vanschoren et al. 2014. OpenML: networked science in machine learning. KDD (2014).
[13]
Doris Xin et al. 2021. Production machine learning pipelines: Empirical analysis and optimization opportunities. SIGMOD (2021).
[14]
Matei Zaharia et al. 2018. Accelerating the machine learning lifecycle with MLflow. IEEE Data Engineering Bullettin, Vol. 41, 4 (2018), 39--45.

Cited By

View all
  • (2025)Shapley Value Estimation based on Differential MatrixProceedings of the ACM on Management of Data10.1145/37097253:1(1-28)Online publication date: 11-Feb-2025
  • (2025)Modyn: Data-Centric Machine Learning Pipeline OrchestrationProceedings of the ACM on Management of Data10.1145/37097053:1(1-30)Online publication date: 11-Feb-2025
  • (2024)spade: Synthesizing Data Quality Assertions for Large Language Model PipelinesProceedings of the VLDB Endowment10.14778/3685800.368583517:12(4173-4186)Online publication date: 8-Nov-2024
  • Show More Cited By

Index Terms

  1. Proactively Screening Machine Learning Pipelines with ARGUSEYES

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGMOD '23: Companion of the 2023 International Conference on Management of Data
    June 2023
    330 pages
    ISBN:9781450395076
    DOI:10.1145/3555041
    This work is licensed under a Creative Commons Attribution-ShareAlike International 4.0 License.

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 05 June 2023

    Check for updates

    Author Tags

    1. data validation
    2. machine learning pipelines
    3. provenance tracking

    Qualifiers

    • Short-paper

    Data Availability

    Conference

    SIGMOD/PODS '23
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 785 of 4,003 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)251
    • Downloads (Last 6 weeks)32
    Reflects downloads up to 25 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)Shapley Value Estimation based on Differential MatrixProceedings of the ACM on Management of Data10.1145/37097253:1(1-28)Online publication date: 11-Feb-2025
    • (2025)Modyn: Data-Centric Machine Learning Pipeline OrchestrationProceedings of the ACM on Management of Data10.1145/37097053:1(1-30)Online publication date: 11-Feb-2025
    • (2024)spade: Synthesizing Data Quality Assertions for Large Language Model PipelinesProceedings of the VLDB Endowment10.14778/3685800.368583517:12(4173-4186)Online publication date: 8-Nov-2024
    • (2024)P-Shapley: Shapley Values on Probabilistic ClassifiersProceedings of the VLDB Endowment10.14778/3654621.365463817:7(1737-1750)Online publication date: 30-May-2024
    • (2024)DLProv: A Data-Centric Support for Deep Learning Workflow AnalysesProceedings of the Eighth Workshop on Data Management for End-to-End Machine Learning10.1145/3650203.3663337(77-85)Online publication date: 9-Jun-2024
    • (2024)Automated Provenance-Based Screening of ML Data Preparation PipelinesDatenbank-Spektrum10.1007/s13222-024-00483-4Online publication date: 30-Sep-2024
    • (2023)Teaching Blue Elephants the Maths for Machine LearningProceedings of the Seventh Workshop on Data Management for End-to-End Machine Learning10.1145/3595360.3595852(1-4)Online publication date: 18-Jun-2023

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media