Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2523616.2523619acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Scalable lineage capture for debugging DISC analytics

Published: 01 October 2013 Publication History

Abstract

A fundamental challenge for big-data analytics is how to efficiently tune and debug multi-step dataflows. This paper presents Newt, a scalable architecture for capturing and using record-level data lineage to discover and resolve errors in analytics. Newt's flexible instrumentation allows system developers to collect this fine-grain lineage from a range of data intensive scalable computing (DISC) architectures, actively recording the flow of data through multi-step, user-defined transformations. Newt pairs this API with a scale-out, fault-tolerant lineage store and query engine.
We find that while active collection can be expensive, it incurs modest runtime overheads for real-world analytics (<36%) and enables novel lineage-based debugging techniques. For instance, Newt can efficiently recreate errors (crashes or bad outputs) or remove input data from the dataflow to enable data cleaning strategies. Additionally, Newt's active lineage collection allows retro-spective analyses of a dataflow's behavior, such as identifying anomalous processing steps. As case studies, we instrument two DISC systems, Hadoop and Hyracks, with less than 105 lines of additional code for each. Finally, we use Newt to systematically clean input data to a Hadoop-based de novo genome assembler, improving the quality of the output assembly.

References

[1]
Apache Mahout: Scalable Machine Learning and Data Mining. http://mahout.apache.org.
[2]
Y. Amsterdamer, S. B. Davidson, D. Deutch, T. Milo, and J. Stoyanovich. Putting lipstick on a pig: Enabling database-style workflow provenance. In Proc. of VLDB, August 2011.
[3]
F. E. Angly, D. Willner, F. Rohwer, P. Hugenholtz, and G. W. Tyson. Grinder: a versatile amplicon and shotgun sequence simulator. Nucleic Acids Research, pages 1--8, 2012.
[4]
M. Baker. De novo genome assembly: what every biologist should know. Nature methods, 9, 2012.
[5]
V. Borkar, M. Carey, R. Grover, N. Onose, and R. Vernica. Hyracks: A flexible and extensible foundation for data-intensive computing. In Proc. of ICDE, 2011.
[6]
J. Cheney, L. Chiticariu, and W.-C. Tan. Provenance in databases: Why, how, and where. Foundations and Trends in Databases, 1, 2009.
[7]
Y. Cui and J. Widom. Lineage tracing for general data warehouse transformations. VLDB Journal, 12(1), 2003.
[8]
D. Earl, K. Bradnam, and J. S. John. Assemblathon 1: A competitive assessment of de novo short read assembly methods. Genome Research, 2011.
[9]
G. A. Gilbert, F. Meyer, J. Janson, et al. 1 emp meeting on sample selection and acquisition. At Argonne National Laboratory, 3(3): 249--53, October 1992.
[10]
J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin. PowerGraph: distributed graphparallel computation on natural graphs. In USENIX OSDI, October 2012.
[11]
F. Grubbs. Procedures for detecting outlying observations in samples. Technometrics, 11: 1--21, 1969.
[12]
R. Ikeda, H. Park, and J. Widom. Provenance for generalized map and reduce workflows. In Proc. of CIDR, January 2011.
[13]
R. Ikeda, S. Salihoglu, and J. Widom. Provenance-based refresh in data-oriented workflows. In ACM Conference on Information and Knowledge Management, 2011.
[14]
M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: Distributed data-parallel programs from sequential building blocks. In Proc. of the European Conference on Computer Systems (EuroSys), March 2007.
[15]
B. Langmead, C. Trapnell, M. Pop, and S. L. Salzberg. Ultrafast and memory-efficient alignment of short dna sequences to the human genome. Genome Biology, (10: R25), 2009.
[16]
C. Olston and B. Reed. Inspector gadget: A framework for custom monitoring and debugging of distributed dataflows. In Proc. of VLDB, August 2011.
[17]
C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: A not-so-foreign language for data processing. In Proc. of ACM SIGMOD, Vancouver, Canada, June 2008.
[18]
C. Olston and A. D. Sarma. Ibis: A provenance manager for multi-layer systems. In Proc. of CIDR, January 2011.
[19]
A. D. Sarma, A. Jain, and P. Bohannon. PROBER: Ad-Hoc Debugging of Extraction and Integration Pipelines. Technical report, Yahoo, April 2010.
[20]
M. Schatz, A. Gupta, R. Gupta, D. Kelley, J. Lewi, D. Nettem, D. Sommer, and M. Pop. Contrail: Assembly of Large Genomes using Cloud Computing. http://sourceforge.net/apps/mediawiki/contrial-bio.
[21]
J. Simpson and R. Durbin. Efficient de novo assembly of large genomes using compressed data structures. Genome Research, 2011.
[22]
W. Zhou, Q. Fei, A. Narayan, A. Haeberlen, B. T. Loo, and M. Sherr. Secure network provenance. In Proceedings of 23rd ACM Symposium on Operating System Principles (SOSP), December 2011.
[23]
W. Zhou, S. Mapara, Y. Ren, Y. Li, A. Haeberlen, Z. Ives, B. T. Loo, and M. Sherr. Distributed time-aware provenance. In Proc. of VLDB, August 2012.
[24]
W. Zhou, M. Sherr, T. Tao, X. Li, B. T. Loo, and Y. Mao. Efficient querying and maintenance of network provenance at Internet scale. In Proc. of ACM SIGMOD, June 2010.

Cited By

View all
  • (2024)Automatic Debugging of Design Faults in MapReduce ApplicationsIEEE Transactions on Software Engineering10.1109/TSE.2024.336976650:4(956-978)Online publication date: 26-Feb-2024
  • (2024) Version - [SAMbA-RaP is music to scientists’ ears: Adding provenance support to spark-based scientific workflows] SoftwareX10.1016/j.softx.2024.10192728(101927)Online publication date: Dec-2024
  • (2023)Co-dependence Aware Fuzzing for Dataflow-Based Big Data AnalyticsProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3611643.3616298(1050-1061)Online publication date: 30-Nov-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SOCC '13: Proceedings of the 4th annual Symposium on Cloud Computing
October 2013
427 pages
ISBN:9781450324281
DOI:10.1145/2523616
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 October 2013

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article

Funding Sources

Conference

SOCC '13
Sponsor:
SOCC '13: ACM Symposium on Cloud Computing
October 1 - 3, 2013
California, Santa Clara

Acceptance Rates

SOCC '13 Paper Acceptance Rate 23 of 114 submissions, 20%;
Overall Acceptance Rate 169 of 722 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)18
  • Downloads (Last 6 weeks)2
Reflects downloads up to 15 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Automatic Debugging of Design Faults in MapReduce ApplicationsIEEE Transactions on Software Engineering10.1109/TSE.2024.336976650:4(956-978)Online publication date: 26-Feb-2024
  • (2024) Version - [SAMbA-RaP is music to scientists’ ears: Adding provenance support to spark-based scientific workflows] SoftwareX10.1016/j.softx.2024.10192728(101927)Online publication date: Dec-2024
  • (2023)Co-dependence Aware Fuzzing for Dataflow-Based Big Data AnalyticsProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3611643.3616298(1050-1061)Online publication date: 30-Nov-2023
  • (2022)Data Leakage in Notebooks: Static Detection and Better ProcessesProceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering10.1145/3551349.3556918(1-12)Online publication date: 10-Oct-2022
  • (2022)Augmented lineage: traceability of data analysis including complex UDF processingThe VLDB Journal10.1007/s00778-022-00769-7Online publication date: 23-Nov-2022
  • (2022)Data distribution debugging in machine learning pipelinesThe VLDB Journal10.1007/s00778-021-00726-w31:5(1103-1126)Online publication date: 31-Jan-2022
  • (2021)s2p: Provenance Research for Stream Processing SystemApplied Sciences10.3390/app1112552311:12(5523)Online publication date: 15-Jun-2021
  • (2021)OptDebugProceedings of the ACM Symposium on Cloud Computing10.1145/3472883.3487016(359-372)Online publication date: 1-Nov-2021
  • (2021)SEIZE: Runtime Inspection for Parallel Dataflow SystemsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2020.303517032:4(842-854)Online publication date: 1-Apr-2021
  • (2020)Improving reproducibility of data science pipelines through transparent provenance captureProceedings of the VLDB Endowment10.14778/3415478.341555613:12(3354-3368)Online publication date: 14-Sep-2020
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media