Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3394486.3403205acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article
Open access

Vamsa: Automated Provenance Tracking in Data Science Scripts

Published: 20 August 2020 Publication History
  • Get Citation Alerts
  • Abstract

    There has recently been a lot of ongoing research in the areas of fairness, bias and explainability of machine learning (ML) models due to the self-evident or regulatory requirements of various ML applications. We make the following observation: All of these approaches require a robust understanding of the relationship between ML models and the data used to train them. In this work, we introduce the ML provenance tracking problem: the fundamental idea is to automatically track which columns in a dataset have been used to derive the features/labels of an ML model. We discuss the challenges in capturing such information in the context of Python, the most common language used by data scientists.
    We then present Vamsa, a modular system that extracts provenance from Python scripts without requiring any changes to the users' code. Using 26K real data science scripts, we verify the effectiveness of Vamsa in terms of coverage, and performance. We also evaluate Vamsa's accuracy on a smaller subset of manually labeled data. Our analysis shows that Vamsa's precision and recall range from 90.4% to 99.1% and its latency is in the order of milliseconds for average size scripts. Drawing from our experience in deploying ML models in production, we also present an example in which Vamsa helps automatically identify models that are affected by data corruption issues.

    References

    [1]
    Xgboost. https://xgboost.readthedocs.io/en/latest/index.html, 2014.
    [2]
    EU GDPR Regulations. https://ec.europa.eu/commission/priorities/justice-and-fundamental-rights/data-protection/2018-reform-eu-data-protection-rules/eu-data-protection-rules_en, 2018.
    [3]
    Kaggle Heart Disease. https://www.kaggle.com/ronitf/heart-disease-uci, 2018.
    [4]
    Kaggle survey. https://www.kaggle.com/kaggle/kaggle-survey-2018, 2018.
    [5]
    Official Kaggle API. https://github.com/Kaggle/kaggle-api, 2018.
    [6]
    Abstract syntax trees. https://docs.python.org/3/library/ast.html, 2019.
    [7]
    Explainable AI in Industry. https://sites.google.com/view/kdd19-explainable-ai-tutorial, 2019.
    [8]
    Explainable AI/ML (XAI) for Accountability, Fairness, and Transparency. https://xai.kdd2019.a.intuit.com/, 2019.
    [9]
    Fairness-Aware Machine Learning: Practical Challenges and Lessons learned. https://sites.google.com/view/kdd19-fairness-tutorial, 2019.
    [10]
    Kubeflow. https://www.kubeflow.org/, 2019.
    [11]
    Mlflow. https://github.com/mlflow/mlflow/, 2019.
    [12]
    Python AST docs. https://greentreesnakes.readthedocs.io/en/latest/, 2019.
    [13]
    Python language. https://towardsdatascience.com/programming-languages-for-data-scientists-afde2eaf5cc5, 2019.
    [14]
    PyTorch. https://pytorch.org/, 2019.
    [15]
    Typeshed. https://github.com/python/typeshed, 2019.
    [16]
    Vamsa. aka.ms/vamsa, 2020.
    [17]
    E. Angelino et al. Provenance integration requires reconciliation. In TaPP, 2011.
    [18]
    E. Angelino, D. Yamins, and M. Seltzer. Starflow: A script-centric data analysis environment. In IPAW, 2010.
    [19]
    J. Cheney et al. Provenance in databases: Why, how, and where. TRDB, pages 379--474, 2009.
    [20]
    L. Chiticariu, W. C. Tan, and G. Vijayvargiya. Dbnotes: A post-it system for relational databases based on provenance. In SIGMOD, pages 942--944, 2005.
    [21]
    Y. Cui, J. Widom, and J. L. Wiener. Tracing the lineage of view data in a warehousing environment. TODS, 25(2):179--227, 2000.
    [22]
    D. Deutch, N. Frost, and A. Gilad. Provenance for natural language queries. PVLDB, 10(5):577--588, 2017.
    [23]
    J. Freire and M. Anand. Provenance in scientific workflow systems. IEEE Data Engineering Bulletin, 2007.
    [24]
    R. Garcia et al. Context: The missing piece in the machine learning lifecycle. In KDD CMI Workshop, 2018.
    [25]
    T. Gebru et al. Datasheets for datasets, 2018.
    [26]
    R. Ikeda and J. Widom. Data lineage: A survey. Technical report, Stanford InfoLab, 2009.
    [27]
    Z. Ives et al. Dataset relationship management. In CIDR, 2019.
    [28]
    M. R. Lee and M. Shen. Winner's Curse: Bias Estimation for Total Effects of Features in Online Controlled Experiments. In KDD '18, 2018.
    [29]
    T. McPhillips et al. Yesworkflow: A user-oriented, language-independent tool for recovering workflow information from scripts. IJDC, pages 298--313, 2015.
    [30]
    X. Meng et al. Mllib: Machine learning in apache spark. JMLR, pages 1235--1241, 2016.
    [31]
    H. Miao and A. Deshpande. Provdb: Provenance-enabled lifecycle management of collaborative data analysis workflows. IEEE Data Eng. Bull., pages 26--38, 2018.
    [32]
    H. Miao et al. Modelhub: Deep learning lifecycle management. In ICDE, 2017.
    [33]
    H. Miao et al. Towards unified data and lifecycle management for deep learning. In ICDE, 2017.
    [34]
    M. H. Namaki et al. Answering why-questions by exemplars in attributed graphs. In SIGMOD, pages 1481--1498, 2019.
    [35]
    M. H. Namaki et al. Vamsa: Tracking provenance in data science scripts (technical report). arXiv preprint arXiv:2001.01861, 2020.
    [36]
    F. Pedregosa et al. Scikit-learn: Machine learning in python. JMLR, pages 2825--2830, 2011.
    [37]
    J. F. Pimentel et al. noworkflow: a tool for collecting, analyzing, and managing provenance from python scripts. VLDB, 2017.
    [38]
    L. Prokhorenkova et al. Catboost: unbiased boosting with categorical features. In NIPS, 2018.
    [39]
    F. Psallidas et al. Data science through the looking glass and what we found there. arXiv preprint arXiv:1912.09536, 2019.
    [40]
    F. Psallidas and E. Wu. Provenance for interactive visualizations. 2018.
    [41]
    F. Psallidas and E. Wu. Smoke: Fine-grained lineage at interactive speed. VLDB, pages 719--732, 2018.
    [42]
    E. D. Ragan, A. Endert, J. Sanyal, and J. Chen. Characterizing provenance in visualization and data analysis: an organizational framework of provenance types and purposes. IEEE Transactions on Visualization and Computer Graphics, 22(1):31--40, 2016.
    [43]
    A. Rule, A. Tabard, and J. D. Hollan. Exploration and explanation in computational notebooks. In CHI, page 32, 2018.
    [44]
    S. Schelter et al. Automatically tracking metadata and provenance of machine learning experiments. In Machine Learning Systems workshop at NIPS, 2017.
    [45]
    S. Schelter et al. On challenges in machine learning model management. IEEE Data Eng. Bull., pages 5--15, 2018.
    [46]
    L. Shao et al. Griffon: Reasoning about job anomalies with unlabeled data in cloud-based platforms. SoCC '19, 2019.
    [47]
    K. Shu et al. dEFEND: Explainable Fake News Detection. In ACM SIGKDD, pages 395--405. ACM, 2019.
    [48]
    L. Torczon and K. Cooper. Engineering A Compiler. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2nd edition, 2011.
    [49]
    M. Vartak et al. Modeldb: a system for machine learning model management. In HILDA, 2016.
    [50]
    M. Vartak et al. Mistique: A system to store and query model intermediates for model diagnosis. In SIGMOD, 2018.
    [51]
    E. Wu and S. Madden. Scorpion: Explaining away outliers in aggregate queries. PVLDB, 6(8):553--564, 2013.

    Cited By

    View all
    • (2024)"We Have No Idea How Models will Behave in Production until Production": How Engineers Operationalize Machine LearningProceedings of the ACM on Human-Computer Interaction10.1145/36536978:CSCW1(1-34)Online publication date: 26-Apr-2024
    • (2024)DLProv: A Data-Centric Support for Deep Learning Workflow AnalysesProceedings of the Eighth Workshop on Data Management for End-to-End Machine Learning10.1145/3650203.3663337(77-85)Online publication date: 9-Jun-2024
    • (2024)Supporting Better Insights of Data Science Pipelines with Fine-grained ProvenanceACM Transactions on Database Systems10.1145/364438549:2(1-42)Online publication date: 10-Apr-2024
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    KDD '20: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining
    August 2020
    3664 pages
    ISBN:9781450379984
    DOI:10.1145/3394486
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 20 August 2020

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. data science
    2. machine learning
    3. provenance

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    KDD '20
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

    Upcoming Conference

    KDD '24

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)325
    • Downloads (Last 6 weeks)38
    Reflects downloads up to 10 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)"We Have No Idea How Models will Behave in Production until Production": How Engineers Operationalize Machine LearningProceedings of the ACM on Human-Computer Interaction10.1145/36536978:CSCW1(1-34)Online publication date: 26-Apr-2024
    • (2024)DLProv: A Data-Centric Support for Deep Learning Workflow AnalysesProceedings of the Eighth Workshop on Data Management for End-to-End Machine Learning10.1145/3650203.3663337(77-85)Online publication date: 9-Jun-2024
    • (2024)Supporting Better Insights of Data Science Pipelines with Fine-grained ProvenanceACM Transactions on Database Systems10.1145/364438549:2(1-42)Online publication date: 10-Apr-2024
    • (2024)An empirical study of challenges in machine learning asset managementEmpirical Software Engineering10.1007/s10664-024-10474-429:4Online publication date: 15-Jun-2024
    • (2024)Machine learning experiment management tools: a mixed-methods empirical studyEmpirical Software Engineering10.1007/s10664-024-10444-w29:4Online publication date: 29-May-2024
    • (2024)Assisted design of data science pipelinesThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-024-00835-233:4(1129-1153)Online publication date: 1-Jul-2024
    • (2024)An Abstract Interpretation-Based Data Leakage Static AnalysisTheoretical Aspects of Software Engineering10.1007/978-3-031-64626-3_7(109-126)Online publication date: 14-Jul-2024
    • (2023)Data journeys: Explaining AI workflows through abstractionSemantic Web10.3233/SW-233407(1-27)Online publication date: 15-Jun-2023
    • (2023)Static Analysis of Data Transformations in Jupyter NotebooksProceedings of the 12th ACM SIGPLAN International Workshop on the State Of the Art in Program Analysis10.1145/3589250.3596145(8-13)Online publication date: 6-Jun-2023
    • (2023)HAIPipe: Combining Human-generated and Machine-generated Pipelines for Data PreparationProceedings of the ACM on Management of Data10.1145/35889451:1(1-26)Online publication date: 30-May-2023
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media