Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3530800.3534535acmconferencesArticle/Chapter ViewAbstractPublication PagestappConference Proceedingsconference-collections
short-paper
Open access

Runtime provenance refinement for notebooks

Published: 12 June 2022 Publication History

Abstract

Computational notebooks (e.g., Jupyter or Apache Zeppelin) have become a popular choice for data exploration, preparation, and ETL. Notebooks are more suited for interactive development of data pipelines than classical workflow systems, because they provide immediate feedback for the results of a computation and do not require the full computation to be specified upfront. However, the notebook model suffers from poor reproducibility, does not support automatic incremental re-evaluation of code when the code or inputs change, and does not allow for parallel execution of cells --- all symptoms of its kernel-based evaluation strategy. We propose a new "workbook" model that combines the usability of notebooks with the provenance and parallel execution capabilities of workflow systems. This is made possible through a novel approach that refines a static approximation of provenance for Python code at runtime and a scheduler that dynamically adapts the execution order of cells based on data dependencies detected or refuted at runtime. We demonstrate the feasibility of this approach using a prototype implementation in our notebook engine Vizier.

References

[1]
M. Brachmann, W. Spoth, O. Kennedy, B. Glavic, H. Mueller, S. Castelo, C. Bautista, and J. Freire. Your notebook is not crumby enough, replace it. In CIDR, 2020.
[2]
A. Chapman, P. Missier, G. Simonelli, and R. Torlone. Capturing and querying fine-grained provenance of preprocessing pipelines in data science. PVLDB, 14(4):507--520, 2020.
[3]
S. B. Davidson, S. Cohen-Boulakia, A. Eyal, B. Ludäscher, T. McPhillips, S. Bowers, and J. Freire. Provenance in scientific workflow systems. IEEE Data Eng. Bull., 32(4):44--50, 2007.
[4]
D. Koop and J. Patel. Dataflow notebooks: Encoding and tracking dependencies of cells. In TaPP, 2017.
[5]
S. Macke, A. G. Parameswaran, H. Gong, D. J. L. Lee, D. Xin, and A. Head. Fine-grained lineage for safer notebook interactions. PVLDB, 14(6):1093--1101, 2021.
[6]
M. H. Namaki, A. Floratou, F. Psallidas, S. Krishnan, A. Agrawal, Y. Wu, Y. Zhu, and M. Weimer. Vamsa: Automated provenance tracking in data science scripts. In SIGKDD, pages 1542--1551, 2020.
[7]
F. Nielson, H. Nielson, and C. Hankin. Principles of Program Analysis. Springer, 1999.
[8]
J. F. Pimentel, J. Freire, L. Murta, and V. Braganholo. A survey on collecting, managing, and analyzing provenance from scripts. ACM Comput. Surv., 52(3):47:1--47:38, 2019.
[9]
J. F. Pimentel, L. Murta, V. Braganholo, and J. Freire. Noworkflow: a tool for collecting, analyzing, and managing provenance from python scripts. PVLDB, 10(12):1841--1844, 2017.
[10]
J. F. Pimentel, L. Murta, V. Braganholo, and J. Freire. Understanding and improving the quality and reproducibility of jupyter notebooks. Empir. Softw. Eng., 26(4):65, 2021.
[11]
G. Weikum and G. Vossen. Transactional Information Systems: Theory Algorithms, and the Practice of Concurrency Control and Recovery. Morgan Kaufmann, 2002.
[12]
K. Zielnicki and J. Nunez-Iglesias. Nodebook. https://github.com/stitchfix/nodebook, 2018.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
TaPP '22: Proceedings of the 14th International Workshop on the Theory and Practice of Provenance
June 2022
67 pages
ISBN:9781450393492
DOI:10.1145/3530800
This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 June 2022

Check for updates

Qualifiers

  • Short-paper

Funding Sources

Conference

SIGMOD/PODS '22
Sponsor:

Acceptance Rates

TaPP '22 Paper Acceptance Rate 10 of 17 submissions, 59%;
Overall Acceptance Rate 10 of 17 submissions, 59%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 270
    Total Downloads
  • Downloads (Last 12 months)124
  • Downloads (Last 6 weeks)18
Reflects downloads up to 09 Nov 2024

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media