research-article

Public Access

BigDebug: debugging primitives for interactive big data processing in spark

Authors:

Muhammad Ali Gulzar,

Matteo Interlandi,

Sai Deep Tetali,

Todd Millstein,

Miryung KimAuthors Info & Claims

ICSE '16: Proceedings of the 38th International Conference on Software Engineering

Pages 784 - 795

https://doi.org/10.1145/2884781.2884813

Published: 14 May 2016 Publication History

Abstract

Developers use cloud computing platforms to process a large quantity of data in parallel when developing big data analytics. Debugging the massive parallel computations that run in today's datacenters is time consuming and error-prone. To address this challenge, we design a set of interactive, real-time debugging primitives for big data processing in Apache Spark, the next generation data-intensive scalable cloud computing platform. This requires rethinking the notion of step-through debugging in a traditional debugger such as gdb, because pausing the entire computation across distributed worker nodes causes significant delay and naively inspecting millions of records using a watchpoint is too time consuming for an end user.

First, BigDebug's simulated breakpoints and on-demand watchpoints allow users to selectively examine distributed, intermediate data on the cloud with little overhead. Second, a user can also pinpoint a crash-inducing record and selectively resume relevant sub-computations after a quick fix. Third, a user can determine the root causes of errors (or delays) at the level of individual records through a fine-grained data provenance capability. Our evaluation shows that BigDebug scales to terabytes and its record-level tracing incurs less than 25% overhead on average. It determines crash culprits orders of magnitude more accurately and provides up to 100% time saving compared to the baseline replay debugger. The results show that BigDebug supports debugging at interactive speeds with minimal performance impact.

References

[1]

Amazon s3. https://aws.amazon.com/s3/.

[2]

Apache giraph. http://giraph.apache.org/.

[3]

Hadoop. http://hadoop.apache.org/.

[4]

Scala.tool.nsc. http://www.scala-lang.org/api/2.11.0/scala-compiler/index.html#scala.tools.nsc.package.

[5]

Spark. https://spark.apache.org/.

[6]

Spark documentation. http://spark.apache.org/docs/1.2.1/.

[7]

M. K. Aguilera, J. C. Mogul, J. L. Wiener, P. Reynolds, and A. Muthitacharoen. Performance debugging for distributed systems of black boxes. In ACM SIGOPS Operating Systems Review, volume 37, pages 74--89. ACM, 2003.

Digital Library

[8]

G. Altekar and I. Stoica. Dcr: Replay debugging for the datacenter. Technical Report UCB/EECS-2010-74, EECS Department, University of California, Berkeley, May 2010.

[9]

M. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu, J. K. Bradley, X. Meng, T. Kaftan, M. J. Franklin, A. Ghodsi, and M. Zaharia. Spark sql: Relational data processing in spark. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pages 1383--1394. ACM, 2015.

Digital Library

[10]

O. Biton, S. Cohen-Boulakia, S. B. Davidson, and C. S. Hara. Querying and managing provenance through user views in scientific workflows. In Proceedings of the 2008 IEEE 24th International Conference on Data Engineering, ICDE '08, pages 1072--1081, Washington, DC, USA, 2008. IEEE Computer Society.

Digital Library

[11]

J. Boulon, A. Konwinski, R. Qi, A. Rabkin, E. Yang, and M. Yang. Chukwa, a large-scale monitoring system. In Cloud Computing and its Applications (CCA '08), pages 1--5, 10 2008.

[12]

D. Dao, J. Albrecht, C. Killian, and A. Vahdat. Live debugging of distributed systems. In Compiler Construction, pages 94--108. Springer, 2009.

Digital Library

[13]

A. Dave, M. Zaharia, and I. Stoica. Arthur: Rich post-facto debugging for production analytics applications. Technical report, University of California, Berkeley, 2013.

[14]

J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. In Proceedings of the 6th Conference on Symposium on Opearting Systems Design & Implementation - Volume 6, OSDI'04, pages 10--10, Berkeley, CA, USA, 2004. USENIX Association.

Digital Library

[15]

D. Fisher, R. DeLine, M. Czerwinski, and S. Drucker. Interactions with big data analytics. interactions, 19(3):50--59, May 2012.

Digital Library

[16]

Q. Fu, J.-G. Lou, Y. Wang, and J. Li. Execution anomaly detection in distributed systems through unstructured log analysis. In Proceedings of the 2009 Ninth IEEE International Conference on Data Mining, ICDM '09, pages 149--158, Washington, DC, USA, 2009. IEEE Computer Society.

Digital Library

[17]

D. Geels, G. Altekar, P. Maniatis, T. Roscoe, and I. Stoica. Friday: Global comprehension for distributed replay. In Proceedings of the 4th USENIX Conference on Networked Systems Design & Implementation, NSDI'07, pages 21--21, Berkeley, CA, USA, 2007. USENIX Association.

Digital Library

[18]

S. Ghemawat, H. Gobioff, and S.-T. Leung. The google file system. In ACM SIGOPS operating systems review, volume 37, pages 29--43. ACM, 2003.

Digital Library

[19]

J. E. Gonzalez, R. S. Xin, A. Dave, D. Crankshaw, M. J. Franklin, and I. Stoica. Graphx: Graph processing in a distributed dataflow framework. Proceedings of OSDI, pages 599--613, 2014.

Digital Library

[20]

Z. Guo, X. Wang, J. Tang, X. Liu, Z. Xu, M. Wu, M. F. Kaashoek, and Z. Zhang. R2: An application-level kernel for record and replay. In Proceedings of the 8th USENIX conference on Operating systems design and implementation, pages 193--208. USENIX Association, 2008.

Digital Library

[21]

R. Ikeda, H. Park, and J. Widom. Provenance for generalized map and reduce workflows. In In Proc. Conference on Innovative Data Systems Research (CIDR), 2011.

[22]

M. Interlandi, K. Shah, S. D. Tetali, M. A. Gulzar, S. Yoo, M. Kim, T. Millstein, and T. Condie. Titian: Data provenance support in spark. Proc. VLDB Endow., 9(3):216--227, Nov. 2015.

Digital Library

[23]

V. Jagannath, Z. Yin, and M. Budiu. Monitoring and debugging dryadlinq applications with daphne. In Parallel and Distributed Processing Workshops and Phd Forum (IPDPSW), 2011 IEEE International Symposium on, pages 1266--1273. IEEE, 2011.

Digital Library

[24]

T. J. LeBlanc and J. M. Mellor-Crummey. Debugging parallel programs with instant replay. Computers, IEEE Transactions on, 100(4):471--482, 1987.

Digital Library

[25]

K. H. Lee, N. Sumner, X. Zhang, and P. Eugster. Unified debugging of distributed systems with recon. In Dependable Systems & Networks (DSN), 2011 IEEE/IFIP 41st International Conference on, pages 85--96. IEEE, 2011.

Digital Library

[26]

X. Liu, Z. Guo, X. Wang, F. Chen, X. Lian, J. Tang, M. Wu, M. F. Kaashoek, and Z. Zhang. D3s: Debugging deployed distributed systems. In Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation, NSDI'08, pages 423--437, Berkeley, CA, USA, 2008. USENIX Association.

Digital Library

[27]

D. Logothetis, S. De, and K. Yocum. Scalable lineage capture for debugging disc analytics. In Proceedings of the 4th annual Symposium on Cloud Computing, page 17. ACM, 2013.

Digital Library

[28]

X. Meng, J. K. Bradley, B. Yavuz, E. R. Sparks, S. Venkataraman, D. Liu, J. Freeman, D. B. Tsai, M. Amde, S. Owen, D. Xin, R. Xin, M. J. Franklin, R. Zadeh, M. Zaharia, and A. Talwalkar. Mllib: Machine learning in apache spark. CoRR, abs/1505.06807, 2015.

[29]

R. H. Netzer and B. P. Miller. Optimal tracing and replay for debugging message-passing parallel programs. The Journal of Supercomputing, 8(4):371--388, 1995.

Digital Library

[30]

C. Olston and B. Reed. Inspector gadget: A framework for custom monitoring and debugging of distributed dataflows. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of data, pages 1221--1224. ACM, 2011.

Digital Library

[31]

C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: a not-so-foreign language for data processing. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pages 1099--1110. ACM, 2008.

Digital Library

[32]

S. Salihoglu, J. Shin, V. Khanna, B. Q. Truong, and J. Widom. Graft: A debugging tool for apache giraph. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pages 1403--1408. ACM, 2015.

Digital Library

[33]

J. Shafer, S. Rixner, and A. L. Cox. The hadoop distributed filesystem: Balancing portability and performance. In Performance Analysis of Systems & Software (ISPASS), 2010 IEEE International Symposium on, pages 122--133. IEEE, 2010.

[34]

W. Shang, Z. M. Jiang, H. Hemmati, B. Adams, A. E. Hassan, and P. Martin. Assisting developers of big data analytics applications when deploying on hadoop clouds. In Proceedings of the 2013 International Conference on Software Engineering, ICSE '13, pages 402--411, Piscataway, NJ, USA, 2013. IEEE Press.

Digital Library

[35]

J. Tan, X. Pan, S. Kavulya, R. Gandhi, and P. Narasimhan. Salsa: Analyzing logs as state machines. In Proceedings of the First USENIX Conference on Analysis of System Logs, WASL'08, pages 6--6, Berkeley, CA, USA, 2008. USENIX Association.

Digital Library

[36]

A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Antony, H. Liu, and R. Murthy. Hive-a petabyte scale data warehouse using hadoop. In Data Engineering (ICDE), 2010 IEEE 26th International Conference on, pages 996--1005. IEEE, 2010.

[37]

W. Xu, L. Huang, A. Fox, D. Patterson, and M. I. Jordan. Detecting large-scale system problems by mining console logs. In Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles, pages 117--132. ACM, 2009.

Digital Library

[38]

M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation, pages 2--2. USENIX Association, 2012.

Digital Library

[39]

A. Zeller and R. Hildebrandt. Simplifying and isolating failure-inducing input. Software Engineering, IEEE Transactions on, 28(2):183--200, 2002.

Digital Library

[40]

H. Zhou, J.-G. Lou, H. Zhang, H. Lin, H. Lin, and T. Qin. An empirical study on quality issues of production big data platform. In International Conference on Software Engineering, Software Engineering In Practice (ICSE SEIP). IEEE, May 2015.

Digital Library

Cited By

Jindal ABeedkar KSingh VMohammed JSingla TGupta AChoudhary K(2024)Reactive Dataflow for Inflight Error Handling in ML WorkflowsProceedings of the Eighth Workshop on Data Management for End-to-End Machine Learning10.1145/3650203.3663333(51-61)Online publication date: 9-Jun-2024
https://dl.acm.org/doi/10.1145/3650203.3663333
Huang YWang ZLi CBarcelo PSanchez-Pi NMeliou ASudarshan S(2024)Demonstration of Udon: Line-by-line Debugging of User-Defined Functions in Data WorkflowsCompanion of the 2024 International Conference on Management of Data10.1145/3626246.3654756(476-479)Online publication date: 9-Jun-2024
https://dl.acm.org/doi/10.1145/3626246.3654756
Morán JBertolino Ade la Riva CTuya J(2024)Automatic Debugging of Design Faults in MapReduce ApplicationsIEEE Transactions on Software Engineering10.1109/TSE.2024.336976650:4(956-978)Online publication date: Apr-2024
https://doi.org/10.1109/TSE.2024.3369766
Show More Cited By

Index Terms

BigDebug: debugging primitives for interactive big data processing in spark
1. Software and its engineering
  1. Software creation and management
    1. Software development techniques
      1. Error handling and recovery
    2. Software verification and validation
      1. Software defect analysis
        Software testing and debugging

Recommendations

BigDebug: interactive debugger for big data analytics in Apache Spark
FSE 2016: Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering

To process massive quantities of data, developers leverage data-intensive scalable computing (DISC) systems in the cloud, such as Google's MapReduce, Apache Hadoop, and Apache Spark. In terms of debugging, DISC systems support post-mortem log analysis ...
Debugging Big Data Analytics in Spark with BigDebug
SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of Data

To process massive quantities of data, developers leverage Data-Intensive Scalable Computing (DISC) systems such as Apache Spark. In terms of debugging, DISC systems support only post-mortem log analysis and do not provide any debugging functionality. ...
Interactive and automated debugging for big data analytics
ICSE '18: Proceedings of the 40th International Conference on Software Engineering: Companion Proceeedings

An abundance of data in many disciplines of science, engineering, national security, health care, and business has led to the emerging field of Big Data Analytics that run in a cloud computing environment. To process massive quantities of data in the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICSE '16: Proceedings of the 38th International Conference on Software Engineering

May 2016

1235 pages

ISBN:9781450339001

DOI:10.1145/2884781

General Chair:
Laura Dillon
Michigan State University
,
Program Chairs:
Willem Visser
Stellenbosch University, South Africa
,
Laurie Williams
North Carolina State University

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

ACM: Association for Computing Machinery
SIGSOFT: ACM Special Interest Group on Software Engineering
IEEE-CS\TCSE: TC on Software Engineering
IEEE-CS\DATC: IEEE Computer Society

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 May 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Conference

ICSE '16

Sponsor:

ACM
SIGSOFT
IEEE-CS\TCSE
IEEE-CS\DATC

ICSE '16: 38th International Conference on Software Engineering

May 14 - 22, 2016

Texas, Austin

Acceptance Rates

Overall Acceptance Rate 276 of 1,856 submissions, 15%

Upcoming Conference

ICSE 2025

2025 IEEE/ACM 46th International Conference on Software Engineering

April 26 - May 3, 2025

Ottawa , ON , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

64
Total Citations
View Citations
1,286
Total Downloads

Downloads (Last 12 months)105
Downloads (Last 6 weeks)14

Reflects downloads up to 24 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Jindal ABeedkar KSingh VMohammed JSingla TGupta AChoudhary K(2024)Reactive Dataflow for Inflight Error Handling in ML WorkflowsProceedings of the Eighth Workshop on Data Management for End-to-End Machine Learning10.1145/3650203.3663333(51-61)Online publication date: 9-Jun-2024
https://dl.acm.org/doi/10.1145/3650203.3663333
Huang YWang ZLi CBarcelo PSanchez-Pi NMeliou ASudarshan S(2024)Demonstration of Udon: Line-by-line Debugging of User-Defined Functions in Data WorkflowsCompanion of the 2024 International Conference on Management of Data10.1145/3626246.3654756(476-479)Online publication date: 9-Jun-2024
https://dl.acm.org/doi/10.1145/3626246.3654756
Morán JBertolino Ade la Riva CTuya J(2024)Automatic Debugging of Design Faults in MapReduce ApplicationsIEEE Transactions on Software Engineering10.1109/TSE.2024.336976650:4(956-978)Online publication date: Apr-2024
https://doi.org/10.1109/TSE.2024.3369766
Demirbaga UAujla G(2024)RootPath: Root Cause and Critical Path Analysis to Ensure Sustainable and Resilient Consumer-Centric Big Data Processing Under Fault ScenariosIEEE Transactions on Consumer Electronics10.1109/TCE.2023.332954570:1(1493-1500)Online publication date: Feb-2024
https://doi.org/10.1109/TCE.2023.3329545
Kovács JLigetfalvi BLovas R(2024)Automated Debugging Mechanisms for Orchestrated Cloud Infrastructures With Active Control and Global EvaluationIEEE Access10.1109/ACCESS.2024.346722812(143193-143214)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3467228
Huang YWang ZLi C(2023)Udon: Efficient Debugging of User-Defined Functions in Big Data Systems with Line-by-Line ControlProceedings of the ACM on Management of Data10.1145/36267121:4(1-26)Online publication date: 12-Dec-2023
https://dl.acm.org/doi/10.1145/3626712
Homer MBeckmann THirschfeld RSáenz JMerino M(2023)Branching Compositional Data Transformations in jq, VisuallyProceedings of the 2nd ACM SIGPLAN International Workshop on Programming Abstractions and Interactive Notations, Tools, and Environments10.1145/3623504.3623567(11-16)Online publication date: 18-Oct-2023
https://dl.acm.org/doi/10.1145/3623504.3623567
Vu ASparka Jde Mecquenem NKehrer TLeser UGrunske L(2023)Contract-Driven Design of Scientific Data Analysis Workflows2023 IEEE 19th International Conference on e-Science (e-Science)10.1109/e-Science58273.2023.10254898(1-10)Online publication date: 9-Oct-2023
https://doi.org/10.1109/e-Science58273.2023.10254898
Kim M(2023)Software Engineering for Data Intensive Scalable Computing and Heterogeneous Computing2023 IEEE/ACM International Conference on Software Engineering: Future of Software Engineering (ICSE-FoSE)10.1109/ICSE-FoSE59343.2023.00006(54-68)Online publication date: 14-May-2023
https://doi.org/10.1109/ICSE-FoSE59343.2023.00006
Vu ATsigkanos CQuiané-Ruiz JMarkl VKehrer T(2023)On Irregularity Localization for Scientific Data Analysis WorkflowsComputational Science – ICCS 202310.1007/978-3-031-35995-8_24(336-351)Online publication date: 3-Jul-2023
https://dl.acm.org/doi/10.1007/978-3-031-35995-8_24
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents