Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2884781.2884813acmconferencesArticle/Chapter ViewAbstractPublication PagesicseConference Proceedingsconference-collections
research-article
Public Access

BigDebug: debugging primitives for interactive big data processing in spark

Published: 14 May 2016 Publication History

Abstract

Developers use cloud computing platforms to process a large quantity of data in parallel when developing big data analytics. Debugging the massive parallel computations that run in today's datacenters is time consuming and error-prone. To address this challenge, we design a set of interactive, real-time debugging primitives for big data processing in Apache Spark, the next generation data-intensive scalable cloud computing platform. This requires rethinking the notion of step-through debugging in a traditional debugger such as gdb, because pausing the entire computation across distributed worker nodes causes significant delay and naively inspecting millions of records using a watchpoint is too time consuming for an end user.
First, BigDebug's simulated breakpoints and on-demand watchpoints allow users to selectively examine distributed, intermediate data on the cloud with little overhead. Second, a user can also pinpoint a crash-inducing record and selectively resume relevant sub-computations after a quick fix. Third, a user can determine the root causes of errors (or delays) at the level of individual records through a fine-grained data provenance capability. Our evaluation shows that BigDebug scales to terabytes and its record-level tracing incurs less than 25% overhead on average. It determines crash culprits orders of magnitude more accurately and provides up to 100% time saving compared to the baseline replay debugger. The results show that BigDebug supports debugging at interactive speeds with minimal performance impact.

References

[1]
Amazon s3. https://aws.amazon.com/s3/.
[2]
Apache giraph. http://giraph.apache.org/.
[3]
Hadoop. http://hadoop.apache.org/.
[4]
Scala.tool.nsc. http://www.scala-lang.org/api/2.11.0/scala-compiler/index.html#scala.tools.nsc.package.
[5]
Spark. https://spark.apache.org/.
[6]
Spark documentation. http://spark.apache.org/docs/1.2.1/.
[7]
M. K. Aguilera, J. C. Mogul, J. L. Wiener, P. Reynolds, and A. Muthitacharoen. Performance debugging for distributed systems of black boxes. In ACM SIGOPS Operating Systems Review, volume 37, pages 74--89. ACM, 2003.
[8]
G. Altekar and I. Stoica. Dcr: Replay debugging for the datacenter. Technical Report UCB/EECS-2010-74, EECS Department, University of California, Berkeley, May 2010.
[9]
M. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu, J. K. Bradley, X. Meng, T. Kaftan, M. J. Franklin, A. Ghodsi, and M. Zaharia. Spark sql: Relational data processing in spark. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pages 1383--1394. ACM, 2015.
[10]
O. Biton, S. Cohen-Boulakia, S. B. Davidson, and C. S. Hara. Querying and managing provenance through user views in scientific workflows. In Proceedings of the 2008 IEEE 24th International Conference on Data Engineering, ICDE '08, pages 1072--1081, Washington, DC, USA, 2008. IEEE Computer Society.
[11]
J. Boulon, A. Konwinski, R. Qi, A. Rabkin, E. Yang, and M. Yang. Chukwa, a large-scale monitoring system. In Cloud Computing and its Applications (CCA '08), pages 1--5, 10 2008.
[12]
D. Dao, J. Albrecht, C. Killian, and A. Vahdat. Live debugging of distributed systems. In Compiler Construction, pages 94--108. Springer, 2009.
[13]
A. Dave, M. Zaharia, and I. Stoica. Arthur: Rich post-facto debugging for production analytics applications. Technical report, University of California, Berkeley, 2013.
[14]
J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. In Proceedings of the 6th Conference on Symposium on Opearting Systems Design & Implementation - Volume 6, OSDI'04, pages 10--10, Berkeley, CA, USA, 2004. USENIX Association.
[15]
D. Fisher, R. DeLine, M. Czerwinski, and S. Drucker. Interactions with big data analytics. interactions, 19(3):50--59, May 2012.
[16]
Q. Fu, J.-G. Lou, Y. Wang, and J. Li. Execution anomaly detection in distributed systems through unstructured log analysis. In Proceedings of the 2009 Ninth IEEE International Conference on Data Mining, ICDM '09, pages 149--158, Washington, DC, USA, 2009. IEEE Computer Society.
[17]
D. Geels, G. Altekar, P. Maniatis, T. Roscoe, and I. Stoica. Friday: Global comprehension for distributed replay. In Proceedings of the 4th USENIX Conference on Networked Systems Design & Implementation, NSDI'07, pages 21--21, Berkeley, CA, USA, 2007. USENIX Association.
[18]
S. Ghemawat, H. Gobioff, and S.-T. Leung. The google file system. In ACM SIGOPS operating systems review, volume 37, pages 29--43. ACM, 2003.
[19]
J. E. Gonzalez, R. S. Xin, A. Dave, D. Crankshaw, M. J. Franklin, and I. Stoica. Graphx: Graph processing in a distributed dataflow framework. Proceedings of OSDI, pages 599--613, 2014.
[20]
Z. Guo, X. Wang, J. Tang, X. Liu, Z. Xu, M. Wu, M. F. Kaashoek, and Z. Zhang. R2: An application-level kernel for record and replay. In Proceedings of the 8th USENIX conference on Operating systems design and implementation, pages 193--208. USENIX Association, 2008.
[21]
R. Ikeda, H. Park, and J. Widom. Provenance for generalized map and reduce workflows. In In Proc. Conference on Innovative Data Systems Research (CIDR), 2011.
[22]
M. Interlandi, K. Shah, S. D. Tetali, M. A. Gulzar, S. Yoo, M. Kim, T. Millstein, and T. Condie. Titian: Data provenance support in spark. Proc. VLDB Endow., 9(3):216--227, Nov. 2015.
[23]
V. Jagannath, Z. Yin, and M. Budiu. Monitoring and debugging dryadlinq applications with daphne. In Parallel and Distributed Processing Workshops and Phd Forum (IPDPSW), 2011 IEEE International Symposium on, pages 1266--1273. IEEE, 2011.
[24]
T. J. LeBlanc and J. M. Mellor-Crummey. Debugging parallel programs with instant replay. Computers, IEEE Transactions on, 100(4):471--482, 1987.
[25]
K. H. Lee, N. Sumner, X. Zhang, and P. Eugster. Unified debugging of distributed systems with recon. In Dependable Systems & Networks (DSN), 2011 IEEE/IFIP 41st International Conference on, pages 85--96. IEEE, 2011.
[26]
X. Liu, Z. Guo, X. Wang, F. Chen, X. Lian, J. Tang, M. Wu, M. F. Kaashoek, and Z. Zhang. D3s: Debugging deployed distributed systems. In Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation, NSDI'08, pages 423--437, Berkeley, CA, USA, 2008. USENIX Association.
[27]
D. Logothetis, S. De, and K. Yocum. Scalable lineage capture for debugging disc analytics. In Proceedings of the 4th annual Symposium on Cloud Computing, page 17. ACM, 2013.
[28]
X. Meng, J. K. Bradley, B. Yavuz, E. R. Sparks, S. Venkataraman, D. Liu, J. Freeman, D. B. Tsai, M. Amde, S. Owen, D. Xin, R. Xin, M. J. Franklin, R. Zadeh, M. Zaharia, and A. Talwalkar. Mllib: Machine learning in apache spark. CoRR, abs/1505.06807, 2015.
[29]
R. H. Netzer and B. P. Miller. Optimal tracing and replay for debugging message-passing parallel programs. The Journal of Supercomputing, 8(4):371--388, 1995.
[30]
C. Olston and B. Reed. Inspector gadget: A framework for custom monitoring and debugging of distributed dataflows. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of data, pages 1221--1224. ACM, 2011.
[31]
C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: a not-so-foreign language for data processing. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pages 1099--1110. ACM, 2008.
[32]
S. Salihoglu, J. Shin, V. Khanna, B. Q. Truong, and J. Widom. Graft: A debugging tool for apache giraph. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pages 1403--1408. ACM, 2015.
[33]
J. Shafer, S. Rixner, and A. L. Cox. The hadoop distributed filesystem: Balancing portability and performance. In Performance Analysis of Systems & Software (ISPASS), 2010 IEEE International Symposium on, pages 122--133. IEEE, 2010.
[34]
W. Shang, Z. M. Jiang, H. Hemmati, B. Adams, A. E. Hassan, and P. Martin. Assisting developers of big data analytics applications when deploying on hadoop clouds. In Proceedings of the 2013 International Conference on Software Engineering, ICSE '13, pages 402--411, Piscataway, NJ, USA, 2013. IEEE Press.
[35]
J. Tan, X. Pan, S. Kavulya, R. Gandhi, and P. Narasimhan. Salsa: Analyzing logs as state machines. In Proceedings of the First USENIX Conference on Analysis of System Logs, WASL'08, pages 6--6, Berkeley, CA, USA, 2008. USENIX Association.
[36]
A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Antony, H. Liu, and R. Murthy. Hive-a petabyte scale data warehouse using hadoop. In Data Engineering (ICDE), 2010 IEEE 26th International Conference on, pages 996--1005. IEEE, 2010.
[37]
W. Xu, L. Huang, A. Fox, D. Patterson, and M. I. Jordan. Detecting large-scale system problems by mining console logs. In Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles, pages 117--132. ACM, 2009.
[38]
M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation, pages 2--2. USENIX Association, 2012.
[39]
A. Zeller and R. Hildebrandt. Simplifying and isolating failure-inducing input. Software Engineering, IEEE Transactions on, 28(2):183--200, 2002.
[40]
H. Zhou, J.-G. Lou, H. Zhang, H. Lin, H. Lin, and T. Qin. An empirical study on quality issues of production big data platform. In International Conference on Software Engineering, Software Engineering In Practice (ICSE SEIP). IEEE, May 2015.

Cited By

View all
  • (2024)Reactive Dataflow for Inflight Error Handling in ML WorkflowsProceedings of the Eighth Workshop on Data Management for End-to-End Machine Learning10.1145/3650203.3663333(51-61)Online publication date: 9-Jun-2024
  • (2024)Demonstration of Udon: Line-by-line Debugging of User-Defined Functions in Data WorkflowsCompanion of the 2024 International Conference on Management of Data10.1145/3626246.3654756(476-479)Online publication date: 9-Jun-2024
  • (2024)Automatic Debugging of Design Faults in MapReduce ApplicationsIEEE Transactions on Software Engineering10.1109/TSE.2024.336976650:4(956-978)Online publication date: Apr-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICSE '16: Proceedings of the 38th International Conference on Software Engineering
May 2016
1235 pages
ISBN:9781450339001
DOI:10.1145/2884781
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 May 2016

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. big data analytics
  2. data-intensive scalable computing (DISC)
  3. debugging
  4. fault localization and recovery
  5. interactive tools

Qualifiers

  • Research-article

Funding Sources

Conference

ICSE '16
Sponsor:

Acceptance Rates

Overall Acceptance Rate 276 of 1,856 submissions, 15%

Upcoming Conference

ICSE 2025

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)105
  • Downloads (Last 6 weeks)14
Reflects downloads up to 24 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Reactive Dataflow for Inflight Error Handling in ML WorkflowsProceedings of the Eighth Workshop on Data Management for End-to-End Machine Learning10.1145/3650203.3663333(51-61)Online publication date: 9-Jun-2024
  • (2024)Demonstration of Udon: Line-by-line Debugging of User-Defined Functions in Data WorkflowsCompanion of the 2024 International Conference on Management of Data10.1145/3626246.3654756(476-479)Online publication date: 9-Jun-2024
  • (2024)Automatic Debugging of Design Faults in MapReduce ApplicationsIEEE Transactions on Software Engineering10.1109/TSE.2024.336976650:4(956-978)Online publication date: Apr-2024
  • (2024)RootPath: Root Cause and Critical Path Analysis to Ensure Sustainable and Resilient Consumer-Centric Big Data Processing Under Fault ScenariosIEEE Transactions on Consumer Electronics10.1109/TCE.2023.332954570:1(1493-1500)Online publication date: Feb-2024
  • (2024)Automated Debugging Mechanisms for Orchestrated Cloud Infrastructures With Active Control and Global EvaluationIEEE Access10.1109/ACCESS.2024.346722812(143193-143214)Online publication date: 2024
  • (2023)Udon: Efficient Debugging of User-Defined Functions in Big Data Systems with Line-by-Line ControlProceedings of the ACM on Management of Data10.1145/36267121:4(1-26)Online publication date: 12-Dec-2023
  • (2023)Branching Compositional Data Transformations in jq, VisuallyProceedings of the 2nd ACM SIGPLAN International Workshop on Programming Abstractions and Interactive Notations, Tools, and Environments10.1145/3623504.3623567(11-16)Online publication date: 18-Oct-2023
  • (2023)Contract-Driven Design of Scientific Data Analysis Workflows2023 IEEE 19th International Conference on e-Science (e-Science)10.1109/e-Science58273.2023.10254898(1-10)Online publication date: 9-Oct-2023
  • (2023)Software Engineering for Data Intensive Scalable Computing and Heterogeneous Computing2023 IEEE/ACM International Conference on Software Engineering: Future of Software Engineering (ICSE-FoSE)10.1109/ICSE-FoSE59343.2023.00006(54-68)Online publication date: 14-May-2023
  • (2023)On Irregularity Localization for Scientific Data Analysis WorkflowsComputational Science – ICCS 202310.1007/978-3-031-35995-8_24(336-351)Online publication date: 3-Jul-2023
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media