research-article

A characteristic study on failures of production distributed data-parallel programs

Authors:

Tao XieAuthors Info & Claims

ICSE '13: Proceedings of the 2013 International Conference on Software Engineering

Pages 963 - 972

Published: 18 May 2013 Publication History

Abstract

SCOPE is adopted by thousands of developers from tens of different product teams in Microsoft Bing for daily web-scale data processing, including index building, search ranking and advertisement display. A SCOPE job is composed of declarative SQL-like queries and imperative C# user-defined functions (UDFs), which are executed in pipeline by thousands of machines. There are tens of thousands of SCOPE jobs executed on Microsoft clusters per day, while some of them fail after a long execution time and thus waste tremendous resources. Reducing SCOPE failures would save significant resources. This paper presents a comprehensive characteristic study on 200 SCOPE failures/fixes and 50 SCOPE failures with debugging statistics from Microsoft Bing, investigating not only major failure types, failure sources, and fixes, but also current debugging practice. Our major findings include (1) most of the failures (84.5%) are caused by defects in data processing rather than defects in code logic; (2) table-level failures (22.5%) are mainly caused by programmers mistakes and frequent data schema changes while row-level failures (62%) are mainly caused by exceptional data; (3) 93.0% fixes do not change data processing logic; (4) there are 8.0% failures with root cause not at the failure-exposing stage, making current debugging practice insufficient in this case. Our study results provide valuable guidelines for future development of data-parallel programs. We believe that these guidelines are not limited to SCOPE, but can also be generalized to other similar data-parallel platforms.

References

[1]

M. K. Aguilera, J. C. Mogul, J. L. Wiener, P. Reynolds, and A. Muthitacharoen. Performance debugging for distributed systems of black boxes. In SOSP, pages 74–89, 2003.

Digital Library

[2]

Apache. Hadoop. http://hadoop.apache.org/.

[3]

R. Chaiken, B. Jenkins, P. ke Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou. SCOPE: Easy and efficient parallel processing of massive data sets. PVLDB, 1(2):1265–1276, 2008.

Digital Library

[4]

C. Chambers, A. Raniwala, F. Perry, S. Adams, R. R. Henry, R. Bradshaw, and N. Weizenbaum. FlumeJava: easy, efficient data-parallel pipelines. In PLDI, pages 363–375, 2010.

Digital Library

[5]

S. Chandra and P. M. Chen. Whither generic recovery from application faults? a fault study using open-source software. In DSN, pages 97–106, 2000.

Digital Library

[6]

A. Chou, J. Yang, B. Chelf, S. Hallem, and D. Engler. An empirical study of operating systems errors. In SOSP, pages 73–88, 2001.

Digital Library

[7]

J. Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters. Commun. ACM, 51(1):107–113, 2008.

Digital Library

[8]

R. Fonseca, G. Porter, R. H. Katz, S. Shenker, and I. Stoica. X-trace: a pervasive network tracing framework. In NSDI, pages 20–20, 2007.

Digital Library

[9]

P. Godefroid, N. Klarlund, and K. Sen. DART: Directed automated random testing. In PLDI, pages 213–223, 2005.

Digital Library

[10]

W. Gu, Z. Kalbarczyk, R. K. Iyer, and Z. Yang. Characterization of Linux kernel behavior under errors. In DSN, pages 459–468, 2003.

[11]

M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: distributed data-parallel programs from sequential building blocks. In EuroSys, pages 59–72, 2007.

Digital Library

[12]

G. Jin, L. Song, X. Shi, J. Scherpelz, and S. Lu. Understanding and detecting real-world performance bugs. In PLDI, pages 77–88, 2012.

Digital Library

[13]

H. Jin, K. Qiao, X.-H. Sun, and Y. Li. Performance under failures of MapReduce applications. In CCGRID, pages 608–609, 2011.

Digital Library

[14]

S. Kavulya, J. Tan, R. Gandhi, and P. Narasimhan. An analysis of traces from a production mapreduce cluster. In CCGRID, pages 94–103, 2010.

Digital Library

[15]

S. Kim, K. Pan, and E. E. J. Whitehead, Jr. Memories of bug fixes. In FSE, pages 35–45, 2006.

Digital Library

[16]

X. Liu, W. Lin, A. Pan, and Z. Zhang. Wids checker: combating bugs in distributed systems. In NSDI, pages 19–19, 2007.

Digital Library

[17]

S. Lu, S. Park, E. Seo, and Y. Zhou. Learning from mistakes: a comprehensive study on real world concurrency bug characteristics. In ASPLOS, pages 329–339, 2008.

Digital Library

[18]

Microsoft. FxCop. http://msdn.microsoft.com/en-us/library/ bb429476(v=vs.80).aspx.

[19]

Microsoft. Nullable Types. http://msdn.microsoft.com/en-us/library/ 1t3y8s4s(v=vs.80).aspx.

[20]

Microsoft. Phoenix Compiler. http://research.microsoft.com/en-us/ collaboration/focus/cs/phoenix.aspx.

[21]

C. Olston and B. Reed. Inspector Gadget: A framework for custom monitoring and debugging of distributed dataflows. In SIGMOD, pages 1221–1224, 2012.

Digital Library

[22]

C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig Latin: A not-so-foreign language for data processing. In SIGMOD, pages 1099–1110, 2008.

Digital Library

[23]

K. Pan, S. Kim, and E. J. Whitehead, Jr. Toward an understanding of bug fix patterns. Empirical Softw. Engg., 14(3):286–315, 2009.

Digital Library

[24]

K. Sen, D. Marinov, and G. Agha. CUTE: a concolic unit testing engine for C. In ESEC/FSE, pages 263–272, 2005.

Digital Library

[25]

J. Śliwerski, T. Zimmermann, and A. Zeller. When do changes induce fixes? In MSR, pages 1–5, 2005.

Digital Library

[26]

A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Antony, H. Li, and R. Murthy. Hive – a petabyte scale data warehouse using Hadoop. In ICDE, pages 996–1005, 2010.

[27]

N. Tillmann and J. de Halleux. Pex-white box test generation for .NET. In TAP, pages 134–153, 2008.

Digital Library

[28]

Yahoo! M45 supercomputing project. http://research.yahoo.com/ node/1884.

[29]

Z. Yin, D. Yuan, Y. Zhou, S. Pasupathy, and L. Bairavasundaram. How do fixes become bugs? In ESEC/FSE, pages 26–36, 2011.

Digital Library

Cited By

Liu XGu DChen ZWen JZhang ZMa YWang HJin X(2023)Rise of Distributed Deep Learning Training in the Big Model Era: From a Software Engineering PerspectiveACM Transactions on Software Engineering and Methodology10.1145/359720432:6(1-26)Online publication date: 29-Sep-2023
https://dl.acm.org/doi/10.1145/3597204
Alkhatib BUdayashankar SQunaibi SAlquraan AAlfatafta MAl-Manasrah WDepoutovitch AAl-Kiswany S(2023)Partial Network PartitioningACM Transactions on Computer Systems10.1145/357619241:1-4(1-34)Online publication date: 18-Dec-2023
https://dl.acm.org/doi/10.1145/3576192
Alfatafta MAlkhatib BAlquraan AAl-Kiswany SLu SHowell J(2020)Toward a generic fault tolerance technique for partial network partitioningProceedings of the 14th USENIX Conference on Operating Systems Design and Implementation10.5555/3488766.3488786(351-368)Online publication date: 4-Nov-2020
https://dl.acm.org/doi/10.5555/3488766.3488786
Show More Cited By

Index Terms

A characteristic study on failures of production distributed data-parallel programs

Recommendations

Big data analytics and business failures in data-Rich environments: An organizing framework
Highlights
- The relationships between business failure and big data analytics are examined.
Abstract
In view of the burgeoning scholarly works on big data and big data analytical capabilities, there remains limited research on how different access to big data and different big data analytic capabilities possessed by firms can generate ...
Distributed Computing in Big Data Analytics: Concepts, Technologies and Applications
Distributed SPARQL over Big RDF Data: A Comparative Analysis Using Presto and MapReduce
BIGDATACONGRESS '15: Proceedings of the 2015 IEEE International Congress on Big Data

The processing of large volumes of RDF data require an efficient storage and query processing engine that can scale well with the volume of data. The initial attempts to address this issue focused on optimizing native RDF stores as well as conventional ...

Reviews

Reviewer: A. Squassabia

Soft failure is the unsuccessful premature termination of a data parallel program, as opposed, for instance, to any kind of hardware failure. This paper, the first of its kind, undertakes a systematic evaluation of soft failures in a big data system. The study examines a random sample of 250 soft failures and provides a classification of root causes, as well as some insight on debugging and fixes. This work is interesting for at least two reasons: it establishes a peer-reviewed benchmark on soft failures that is valuable for comparison with internal investigations of similar scope, and it provides material from the trenches for initial criteria to validate coding and software life cycle management practices in a rising discipline (big data) where much confusion and no established history exist. For instance, programmer's error in misspelling a column name was one of the prominent sources of production soft failures. This may be surprising with regard to a traditional relational database management system (RDBMS) environment, where production schemas are static and column numbers are relatively small. However, in a big data system, there may be thousands of column names and the schema constraints are more dynamic. In association with undocumented schema churn, the chain of events leading to this type of programmer's error becomes easier to understand. Once understood, one can take safeguards against recurrence. There are no groundbreaking results or findings from this work, but its novelty and incremental contributions are surely welcome. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICSE '13: Proceedings of the 2013 International Conference on Software Engineering

May 2013

1561 pages

ISBN:9781467330763

General Chair:
David Notkin,
Program Chairs:
Betty H. C. Cheng,
Klaus Pohl

Sponsors

SIGSOFT: ACM Special Interest Group on Software Engineering

Publisher

IEEE Press

Publication History

Published: 18 May 2013

Check for updates

Qualifiers

Research-article

Conference

ICSE '13

Sponsor:

SIGSOFT

ICSE '13: 35th International Conference on Software Engineering

May 18 - 26, 2013

CA, San Francisco, USA

Acceptance Rates

Overall Acceptance Rate 276 of 1,856 submissions, 15%

Upcoming Conference

ICSE 2025

2025 IEEE/ACM 46th International Conference on Software Engineering

April 26 - May 3, 2025

Ottawa , ON , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

13
Total Citations
View Citations
238
Total Downloads

Downloads (Last 12 months)1
Downloads (Last 6 weeks)0

Reflects downloads up to 09 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Liu XGu DChen ZWen JZhang ZMa YWang HJin X(2023)Rise of Distributed Deep Learning Training in the Big Model Era: From a Software Engineering PerspectiveACM Transactions on Software Engineering and Methodology10.1145/359720432:6(1-26)Online publication date: 29-Sep-2023
https://dl.acm.org/doi/10.1145/3597204
Alkhatib BUdayashankar SQunaibi SAlquraan AAlfatafta MAl-Manasrah WDepoutovitch AAl-Kiswany S(2023)Partial Network PartitioningACM Transactions on Computer Systems10.1145/357619241:1-4(1-34)Online publication date: 18-Dec-2023
https://dl.acm.org/doi/10.1145/3576192
Alfatafta MAlkhatib BAlquraan AAl-Kiswany SLu SHowell J(2020)Toward a generic fault tolerance technique for partial network partitioningProceedings of the 14th USENIX Conference on Operating Systems Design and Implementation10.5555/3488766.3488786(351-368)Online publication date: 4-Nov-2020
https://dl.acm.org/doi/10.5555/3488766.3488786
Zhang RXiao WZhang HLiu YLin HYang MRothermel GBae D(2020)An empirical study on program failures of deep learning jobsProceedings of the ACM/IEEE 42nd International Conference on Software Engineering10.1145/3377811.3380362(1159-1170)Online publication date: 27-Jun-2020
https://dl.acm.org/doi/10.1145/3377811.3380362
Zuo ZThorpe JWang YPan QLu SWang KXu GWang LLi X(2019)GrappleProceedings of the Fourteenth EuroSys Conference 201910.1145/3302424.3303972(1-17)Online publication date: 25-Mar-2019
https://dl.acm.org/doi/10.1145/3302424.3303972
Alquraan ATakruri HAlfatafta MAl-Kiswany SArpaci-Dusseau AVoelker G(2018)An analysis of network-partitioning failures in cloud systemsProceedings of the 13th USENIX conference on Operating Systems Design and Implementation10.5555/3291168.3291173(51-68)Online publication date: 8-Oct-2018
https://dl.acm.org/doi/10.5555/3291168.3291173
Gunawi HHao MSuminto RLaksono ASatria AAdityatama JEliazar K(2016)Why Does the Cloud Stop Computing?Proceedings of the Seventh ACM Symposium on Cloud Computing10.1145/2987550.2987583(1-16)Online publication date: 5-Oct-2016
https://dl.acm.org/doi/10.1145/2987550.2987583
Zhou HLou JZhang HLin HLin HQin TBertolino ACanfora GElbaum S(2015)An empirical study on quality issues of production big data platformProceedings of the 37th International Conference on Software Engineering - Volume 210.5555/2819009.2819014(17-26)Online publication date: 16-May-2015
https://dl.acm.org/doi/10.5555/2819009.2819014
Morán JRiva CTuya JVos TEldh SPrasetya W(2015)Testing data transformations in MapReduce programsProceedings of the 6th International Workshop on Automating Test Case Design, Selection and Evaluation10.1145/2804322.2804326(20-25)Online publication date: 30-Aug-2015
https://dl.acm.org/doi/10.1145/2804322.2804326
Fonseca PRodrigues RBrandenburg BFlinn JLevy H(2014)SKIProceedings of the 11th USENIX conference on Operating Systems Design and Implementation10.5555/2685048.2685081(415-431)Online publication date: 6-Oct-2014
https://dl.acm.org/doi/10.5555/2685048.2685081
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents