research-article

Open access

Vamsa: Automated Provenance Tracking in Data Science Scripts

Authors:

Mohammad Hossein Namaki,

Avrilia Floratou,

Fotis Psallidas,

Subru Krishnan,

Ashvin Agrawal,

Markus WeimerAuthors Info & Claims

KDD '20: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

Pages 1542 - 1551

https://doi.org/10.1145/3394486.3403205

Published: 20 August 2020 Publication History

Abstract

There has recently been a lot of ongoing research in the areas of fairness, bias and explainability of machine learning (ML) models due to the self-evident or regulatory requirements of various ML applications. We make the following observation: All of these approaches require a robust understanding of the relationship between ML models and the data used to train them. In this work, we introduce the ML provenance tracking problem: the fundamental idea is to automatically track which columns in a dataset have been used to derive the features/labels of an ML model. We discuss the challenges in capturing such information in the context of Python, the most common language used by data scientists.

We then present Vamsa, a modular system that extracts provenance from Python scripts without requiring any changes to the users' code. Using 26K real data science scripts, we verify the effectiveness of Vamsa in terms of coverage, and performance. We also evaluate Vamsa's accuracy on a smaller subset of manually labeled data. Our analysis shows that Vamsa's precision and recall range from 90.4% to 99.1% and its latency is in the order of milliseconds for average size scripts. Drawing from our experience in deploying ML models in production, we also present an example in which Vamsa helps automatically identify models that are affected by data corruption issues.

References

[1]

Xgboost. https://xgboost.readthedocs.io/en/latest/index.html, 2014.

[2]

EU GDPR Regulations. https://ec.europa.eu/commission/priorities/justice-and-fundamental-rights/data-protection/2018-reform-eu-data-protection-rules/eu-data-protection-rules_en, 2018.

[3]

Kaggle Heart Disease. https://www.kaggle.com/ronitf/heart-disease-uci, 2018.

[4]

Kaggle survey. https://www.kaggle.com/kaggle/kaggle-survey-2018, 2018.

[5]

Official Kaggle API. https://github.com/Kaggle/kaggle-api, 2018.

[6]

Abstract syntax trees. https://docs.python.org/3/library/ast.html, 2019.

[7]

Explainable AI in Industry. https://sites.google.com/view/kdd19-explainable-ai-tutorial, 2019.

[8]

Explainable AI/ML (XAI) for Accountability, Fairness, and Transparency. https://xai.kdd2019.a.intuit.com/, 2019.

[9]

Fairness-Aware Machine Learning: Practical Challenges and Lessons learned. https://sites.google.com/view/kdd19-fairness-tutorial, 2019.

[10]

Kubeflow. https://www.kubeflow.org/, 2019.

[11]

Mlflow. https://github.com/mlflow/mlflow/, 2019.

[12]

Python AST docs. https://greentreesnakes.readthedocs.io/en/latest/, 2019.

[13]

Python language. https://towardsdatascience.com/programming-languages-for-data-scientists-afde2eaf5cc5, 2019.

[14]

PyTorch. https://pytorch.org/, 2019.

[15]

Typeshed. https://github.com/python/typeshed, 2019.

[16]

Vamsa. aka.ms/vamsa, 2020.

[17]

E. Angelino et al. Provenance integration requires reconciliation. In TaPP, 2011.

[18]

E. Angelino, D. Yamins, and M. Seltzer. Starflow: A script-centric data analysis environment. In IPAW, 2010.

[19]

J. Cheney et al. Provenance in databases: Why, how, and where. TRDB, pages 379--474, 2009.

[20]

L. Chiticariu, W. C. Tan, and G. Vijayvargiya. Dbnotes: A post-it system for relational databases based on provenance. In SIGMOD, pages 942--944, 2005.

Digital Library

[21]

Y. Cui, J. Widom, and J. L. Wiener. Tracing the lineage of view data in a warehousing environment. TODS, 25(2):179--227, 2000.

Digital Library

[22]

D. Deutch, N. Frost, and A. Gilad. Provenance for natural language queries. PVLDB, 10(5):577--588, 2017.

Digital Library

[23]

J. Freire and M. Anand. Provenance in scientific workflow systems. IEEE Data Engineering Bulletin, 2007.

[24]

R. Garcia et al. Context: The missing piece in the machine learning lifecycle. In KDD CMI Workshop, 2018.

[25]

T. Gebru et al. Datasheets for datasets, 2018.

[26]

R. Ikeda and J. Widom. Data lineage: A survey. Technical report, Stanford InfoLab, 2009.

[27]

Z. Ives et al. Dataset relationship management. In CIDR, 2019.

[28]

M. R. Lee and M. Shen. Winner's Curse: Bias Estimation for Total Effects of Features in Online Controlled Experiments. In KDD '18, 2018.

Digital Library

[29]

T. McPhillips et al. Yesworkflow: A user-oriented, language-independent tool for recovering workflow information from scripts. IJDC, pages 298--313, 2015.

[30]

X. Meng et al. Mllib: Machine learning in apache spark. JMLR, pages 1235--1241, 2016.

[31]

H. Miao and A. Deshpande. Provdb: Provenance-enabled lifecycle management of collaborative data analysis workflows. IEEE Data Eng. Bull., pages 26--38, 2018.

[32]

H. Miao et al. Modelhub: Deep learning lifecycle management. In ICDE, 2017.

[33]

H. Miao et al. Towards unified data and lifecycle management for deep learning. In ICDE, 2017.

[34]

M. H. Namaki et al. Answering why-questions by exemplars in attributed graphs. In SIGMOD, pages 1481--1498, 2019.

Digital Library

[35]

M. H. Namaki et al. Vamsa: Tracking provenance in data science scripts (technical report). arXiv preprint arXiv:2001.01861, 2020.

[36]

F. Pedregosa et al. Scikit-learn: Machine learning in python. JMLR, pages 2825--2830, 2011.

Digital Library

[37]

J. F. Pimentel et al. noworkflow: a tool for collecting, analyzing, and managing provenance from python scripts. VLDB, 2017.

[38]

L. Prokhorenkova et al. Catboost: unbiased boosting with categorical features. In NIPS, 2018.

[39]

F. Psallidas et al. Data science through the looking glass and what we found there. arXiv preprint arXiv:1912.09536, 2019.

[40]

F. Psallidas and E. Wu. Provenance for interactive visualizations. 2018.

Digital Library

[41]

F. Psallidas and E. Wu. Smoke: Fine-grained lineage at interactive speed. VLDB, pages 719--732, 2018.

[42]

E. D. Ragan, A. Endert, J. Sanyal, and J. Chen. Characterizing provenance in visualization and data analysis: an organizational framework of provenance types and purposes. IEEE Transactions on Visualization and Computer Graphics, 22(1):31--40, 2016.

Digital Library

[43]

A. Rule, A. Tabard, and J. D. Hollan. Exploration and explanation in computational notebooks. In CHI, page 32, 2018.

Digital Library

[44]

S. Schelter et al. Automatically tracking metadata and provenance of machine learning experiments. In Machine Learning Systems workshop at NIPS, 2017.

[45]

S. Schelter et al. On challenges in machine learning model management. IEEE Data Eng. Bull., pages 5--15, 2018.

[46]

L. Shao et al. Griffon: Reasoning about job anomalies with unlabeled data in cloud-based platforms. SoCC '19, 2019.

Digital Library

[47]

K. Shu et al. dEFEND: Explainable Fake News Detection. In ACM SIGKDD, pages 395--405. ACM, 2019.

[48]

L. Torczon and K. Cooper. Engineering A Compiler. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2nd edition, 2011.

Digital Library

[49]

M. Vartak et al. Modeldb: a system for machine learning model management. In HILDA, 2016.

[50]

M. Vartak et al. Mistique: A system to store and query model intermediates for model diagnosis. In SIGMOD, 2018.

Digital Library

[51]

E. Wu and S. Madden. Scorpion: Explaining away outliers in aggregate queries. PVLDB, 6(8):553--564, 2013.

Digital Library

Cited By

Shankar SGarcia RHellerstein JParameswaran A(2024)"We Have No Idea How Models will Behave in Production until Production": How Engineers Operationalize Machine LearningProceedings of the ACM on Human-Computer Interaction10.1145/36536978:CSCW1(1-34)Online publication date: 26-Apr-2024
https://dl.acm.org/doi/10.1145/3653697
Pina DChapman AKunstmann Lde Oliveira DMattoso M(2024)DLProv: A Data-Centric Support for Deep Learning Workflow AnalysesProceedings of the Eighth Workshop on Data Management for End-to-End Machine Learning10.1145/3650203.3663337(77-85)Online publication date: 9-Jun-2024
https://dl.acm.org/doi/10.1145/3650203.3663337
Chapman ALauro LMissier PTorlone R(2024)Supporting Better Insights of Data Science Pipelines with Fine-grained ProvenanceACM Transactions on Database Systems10.1145/364438549:2(1-42)Online publication date: 10-Apr-2024
https://dl.acm.org/doi/10.1145/3644385
Show More Cited By

Index Terms

Vamsa: Automated Provenance Tracking in Data Science Scripts
1. Computing methodologies
  1. Machine learning
2. Information systems
  1. Data management systems
    1. Database design and models
      1. Data model extensions
        Data provenance

Recommendations

Extracting Provenance of Machine Learning Experiment Pipeline Artifacts
Advances in Databases and Information Systems
Abstract
Experiment management systems (EMSs), such as MLflow, are increasingly used to streamline the collection and management of machine learning (ML) artifacts in iterative and exploratory ML experiment workflows. However, EMSs typically suffer from ...
MLflow2PROV: Extracting Provenance from Machine Learning Experiments
DEEM '23: Proceedings of the Seventh Workshop on Data Management for End-to-End Machine Learning

Supporting iterative and explorative workflows for developing machine learning (ML) models, ML experiment management systems (ML EMSs), such as MLflow, are increasingly used to simplify the structured collection and management of ML artifacts, such as ML ...
The perm provenance management system in action
SIGMOD '09: Proceedings of the 2009 ACM SIGMOD International Conference on Management of data

In this demonstration we present the Perm provenance management system (PMS). Perm is capable of computing, storing and querying provenance information for the relational data model. Provenance is computed by using query rewriting techniques to annotate ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

KDD '20: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

August 2020

3664 pages

ISBN:9781450379984

DOI:10.1145/3394486

General Chairs:
Rajesh Gupta
UC San Diego, USA
,
Yan Liu
USC, USA
,
Program Chairs:
Mohak Shah
LG Electronics, USA
,
Suju Rajan
Linkedin, USA
,
Publications Chairs:
Jiliang Tang
Michigan State, USA
,
B. Aditya Prakash
Georgia Tech, USA

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 August 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Microsoft

Conference

KDD '20

Sponsor:

KDD '20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

July 6 - 10, 2020

CA, Virtual Event, USA

Acceptance Rates

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '24

Sponsor:
sigkdd
sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

28
Total Citations
View Citations
981
Total Downloads

Downloads (Last 12 months)325
Downloads (Last 6 weeks)38

Reflects downloads up to 10 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Shankar SGarcia RHellerstein JParameswaran A(2024)"We Have No Idea How Models will Behave in Production until Production": How Engineers Operationalize Machine LearningProceedings of the ACM on Human-Computer Interaction10.1145/36536978:CSCW1(1-34)Online publication date: 26-Apr-2024
https://dl.acm.org/doi/10.1145/3653697
Pina DChapman AKunstmann Lde Oliveira DMattoso M(2024)DLProv: A Data-Centric Support for Deep Learning Workflow AnalysesProceedings of the Eighth Workshop on Data Management for End-to-End Machine Learning10.1145/3650203.3663337(77-85)Online publication date: 9-Jun-2024
https://dl.acm.org/doi/10.1145/3650203.3663337
Chapman ALauro LMissier PTorlone R(2024)Supporting Better Insights of Data Science Pipelines with Fine-grained ProvenanceACM Transactions on Database Systems10.1145/364438549:2(1-42)Online publication date: 10-Apr-2024
https://dl.acm.org/doi/10.1145/3644385
Zhao ZChen YBangash AAdams BHassan A(2024)An empirical study of challenges in machine learning asset managementEmpirical Software Engineering10.1007/s10664-024-10474-429:4Online publication date: 15-Jun-2024
https://doi.org/10.1007/s10664-024-10474-4
Idowu SOsman OStrüber DBerger T(2024)Machine learning experiment management tools: a mixed-methods empirical studyEmpirical Software Engineering10.1007/s10664-024-10444-w29:4Online publication date: 29-May-2024
https://doi.org/10.1007/s10664-024-10444-w
Redyuk SKaoudi ZSchelter SMarkl V(2024)Assisted design of data science pipelinesThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-024-00835-233:4(1129-1153)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.1007/s00778-024-00835-2
Drobnjaković FSubotić PUrban C(2024)An Abstract Interpretation-Based Data Leakage Static AnalysisTheoretical Aspects of Software Engineering10.1007/978-3-031-64626-3_7(109-126)Online publication date: 14-Jul-2024
https://doi.org/10.1007/978-3-031-64626-3_7
Daga EGroth P(2023)Data journeys: Explaining AI workflows through abstractionSemantic Web10.3233/SW-233407(1-27)Online publication date: 15-Jun-2023
https://doi.org/10.3233/SW-233407
Negrini LShabadi GUrban CFerrara PHadarean L(2023)Static Analysis of Data Transformations in Jupyter NotebooksProceedings of the 12th ACM SIGPLAN International Workshop on the State Of the Art in Program Analysis10.1145/3589250.3596145(8-13)Online publication date: 6-Jun-2023
https://dl.acm.org/doi/10.1145/3589250.3596145
Chen STang NFan JYan XChai CLi GDu X(2023)HAIPipe: Combining Human-generated and Machine-generated Pipelines for Data PreparationProceedings of the ACM on Management of Data10.1145/35889451:1(1-26)Online publication date: 30-May-2023
https://dl.acm.org/doi/10.1145/3588945
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents