Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.5555/2486788.2486921acmconferencesArticle/Chapter ViewAbstractPublication PagesicseConference Proceedingsconference-collections
research-article

A characteristic study on failures of production distributed data-parallel programs

Published: 18 May 2013 Publication History

Abstract

SCOPE is adopted by thousands of developers from tens of different product teams in Microsoft Bing for daily web-scale data processing, including index building, search ranking and advertisement display. A SCOPE job is composed of declarative SQL-like queries and imperative C# user-defined functions (UDFs), which are executed in pipeline by thousands of machines. There are tens of thousands of SCOPE jobs executed on Microsoft clusters per day, while some of them fail after a long execution time and thus waste tremendous resources. Reducing SCOPE failures would save significant resources. This paper presents a comprehensive characteristic study on 200 SCOPE failures/fixes and 50 SCOPE failures with debugging statistics from Microsoft Bing, investigating not only major failure types, failure sources, and fixes, but also current debugging practice. Our major findings include (1) most of the failures (84.5%) are caused by defects in data processing rather than defects in code logic; (2) table-level failures (22.5%) are mainly caused by programmers mistakes and frequent data schema changes while row-level failures (62%) are mainly caused by exceptional data; (3) 93.0% fixes do not change data processing logic; (4) there are 8.0% failures with root cause not at the failure-exposing stage, making current debugging practice insufficient in this case. Our study results provide valuable guidelines for future development of data-parallel programs. We believe that these guidelines are not limited to SCOPE, but can also be generalized to other similar data-parallel platforms.

References

[1]
M. K. Aguilera, J. C. Mogul, J. L. Wiener, P. Reynolds, and A. Muthitacharoen. Performance debugging for distributed systems of black boxes. In SOSP, pages 74–89, 2003.
[2]
Apache. Hadoop. http://hadoop.apache.org/.
[3]
R. Chaiken, B. Jenkins, P. ke Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou. SCOPE: Easy and efficient parallel processing of massive data sets. PVLDB, 1(2):1265–1276, 2008.
[4]
C. Chambers, A. Raniwala, F. Perry, S. Adams, R. R. Henry, R. Bradshaw, and N. Weizenbaum. FlumeJava: easy, efficient data-parallel pipelines. In PLDI, pages 363–375, 2010.
[5]
S. Chandra and P. M. Chen. Whither generic recovery from application faults? a fault study using open-source software. In DSN, pages 97–106, 2000.
[6]
A. Chou, J. Yang, B. Chelf, S. Hallem, and D. Engler. An empirical study of operating systems errors. In SOSP, pages 73–88, 2001.
[7]
J. Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters. Commun. ACM, 51(1):107–113, 2008.
[8]
R. Fonseca, G. Porter, R. H. Katz, S. Shenker, and I. Stoica. X-trace: a pervasive network tracing framework. In NSDI, pages 20–20, 2007.
[9]
P. Godefroid, N. Klarlund, and K. Sen. DART: Directed automated random testing. In PLDI, pages 213–223, 2005.
[10]
W. Gu, Z. Kalbarczyk, R. K. Iyer, and Z. Yang. Characterization of Linux kernel behavior under errors. In DSN, pages 459–468, 2003.
[11]
M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: distributed data-parallel programs from sequential building blocks. In EuroSys, pages 59–72, 2007.
[12]
G. Jin, L. Song, X. Shi, J. Scherpelz, and S. Lu. Understanding and detecting real-world performance bugs. In PLDI, pages 77–88, 2012.
[13]
H. Jin, K. Qiao, X.-H. Sun, and Y. Li. Performance under failures of MapReduce applications. In CCGRID, pages 608–609, 2011.
[14]
S. Kavulya, J. Tan, R. Gandhi, and P. Narasimhan. An analysis of traces from a production mapreduce cluster. In CCGRID, pages 94–103, 2010.
[15]
S. Kim, K. Pan, and E. E. J. Whitehead, Jr. Memories of bug fixes. In FSE, pages 35–45, 2006.
[16]
X. Liu, W. Lin, A. Pan, and Z. Zhang. Wids checker: combating bugs in distributed systems. In NSDI, pages 19–19, 2007.
[17]
S. Lu, S. Park, E. Seo, and Y. Zhou. Learning from mistakes: a comprehensive study on real world concurrency bug characteristics. In ASPLOS, pages 329–339, 2008.
[18]
Microsoft. FxCop. http://msdn.microsoft.com/en-us/library/ bb429476(v=vs.80).aspx.
[19]
Microsoft. Nullable Types. http://msdn.microsoft.com/en-us/library/ 1t3y8s4s(v=vs.80).aspx.
[20]
Microsoft. Phoenix Compiler. http://research.microsoft.com/en-us/ collaboration/focus/cs/phoenix.aspx.
[21]
C. Olston and B. Reed. Inspector Gadget: A framework for custom monitoring and debugging of distributed dataflows. In SIGMOD, pages 1221–1224, 2012.
[22]
C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig Latin: A not-so-foreign language for data processing. In SIGMOD, pages 1099–1110, 2008.
[23]
K. Pan, S. Kim, and E. J. Whitehead, Jr. Toward an understanding of bug fix patterns. Empirical Softw. Engg., 14(3):286–315, 2009.
[24]
K. Sen, D. Marinov, and G. Agha. CUTE: a concolic unit testing engine for C. In ESEC/FSE, pages 263–272, 2005.
[25]
J. Śliwerski, T. Zimmermann, and A. Zeller. When do changes induce fixes? In MSR, pages 1–5, 2005.
[26]
A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Antony, H. Li, and R. Murthy. Hive – a petabyte scale data warehouse using Hadoop. In ICDE, pages 996–1005, 2010.
[27]
N. Tillmann and J. de Halleux. Pex-white box test generation for .NET. In TAP, pages 134–153, 2008.
[28]
Yahoo! M45 supercomputing project. http://research.yahoo.com/ node/1884.
[29]
Z. Yin, D. Yuan, Y. Zhou, S. Pasupathy, and L. Bairavasundaram. How do fixes become bugs? In ESEC/FSE, pages 26–36, 2011.

Cited By

View all
  • (2023)Rise of Distributed Deep Learning Training in the Big Model Era: From a Software Engineering PerspectiveACM Transactions on Software Engineering and Methodology10.1145/359720432:6(1-26)Online publication date: 29-Sep-2023
  • (2023)Partial Network PartitioningACM Transactions on Computer Systems10.1145/357619241:1-4(1-34)Online publication date: 18-Dec-2023
  • (2020)Toward a generic fault tolerance technique for partial network partitioningProceedings of the 14th USENIX Conference on Operating Systems Design and Implementation10.5555/3488766.3488786(351-368)Online publication date: 4-Nov-2020
  • Show More Cited By

Recommendations

Reviews

A. Squassabia

Soft failure is the unsuccessful premature termination of a data parallel program, as opposed, for instance, to any kind of hardware failure. This paper, the first of its kind, undertakes a systematic evaluation of soft failures in a big data system. The study examines a random sample of 250 soft failures and provides a classification of root causes, as well as some insight on debugging and fixes. This work is interesting for at least two reasons: it establishes a peer-reviewed benchmark on soft failures that is valuable for comparison with internal investigations of similar scope, and it provides material from the trenches for initial criteria to validate coding and software life cycle management practices in a rising discipline (big data) where much confusion and no established history exist. For instance, programmer's error in misspelling a column name was one of the prominent sources of production soft failures. This may be surprising with regard to a traditional relational database management system (RDBMS) environment, where production schemas are static and column numbers are relatively small. However, in a big data system, there may be thousands of column names and the schema constraints are more dynamic. In association with undocumented schema churn, the chain of events leading to this type of programmer's error becomes easier to understand. Once understood, one can take safeguards against recurrence. There are no groundbreaking results or findings from this work, but its novelty and incremental contributions are surely welcome. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICSE '13: Proceedings of the 2013 International Conference on Software Engineering
May 2013
1561 pages
ISBN:9781467330763

Sponsors

Publisher

IEEE Press

Publication History

Published: 18 May 2013

Check for updates

Qualifiers

  • Research-article

Conference

ICSE '13
Sponsor:

Acceptance Rates

Overall Acceptance Rate 276 of 1,856 submissions, 15%

Upcoming Conference

ICSE 2025

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1
  • Downloads (Last 6 weeks)0
Reflects downloads up to 09 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Rise of Distributed Deep Learning Training in the Big Model Era: From a Software Engineering PerspectiveACM Transactions on Software Engineering and Methodology10.1145/359720432:6(1-26)Online publication date: 29-Sep-2023
  • (2023)Partial Network PartitioningACM Transactions on Computer Systems10.1145/357619241:1-4(1-34)Online publication date: 18-Dec-2023
  • (2020)Toward a generic fault tolerance technique for partial network partitioningProceedings of the 14th USENIX Conference on Operating Systems Design and Implementation10.5555/3488766.3488786(351-368)Online publication date: 4-Nov-2020
  • (2020)An empirical study on program failures of deep learning jobsProceedings of the ACM/IEEE 42nd International Conference on Software Engineering10.1145/3377811.3380362(1159-1170)Online publication date: 27-Jun-2020
  • (2019)GrappleProceedings of the Fourteenth EuroSys Conference 201910.1145/3302424.3303972(1-17)Online publication date: 25-Mar-2019
  • (2018)An analysis of network-partitioning failures in cloud systemsProceedings of the 13th USENIX conference on Operating Systems Design and Implementation10.5555/3291168.3291173(51-68)Online publication date: 8-Oct-2018
  • (2016)Why Does the Cloud Stop Computing?Proceedings of the Seventh ACM Symposium on Cloud Computing10.1145/2987550.2987583(1-16)Online publication date: 5-Oct-2016
  • (2015)An empirical study on quality issues of production big data platformProceedings of the 37th International Conference on Software Engineering - Volume 210.5555/2819009.2819014(17-26)Online publication date: 16-May-2015
  • (2015)Testing data transformations in MapReduce programsProceedings of the 6th International Workshop on Automating Test Case Design, Selection and Evaluation10.1145/2804322.2804326(20-25)Online publication date: 30-Aug-2015
  • (2014)SKIProceedings of the 11th USENIX conference on Operating Systems Design and Implementation10.5555/2685048.2685081(415-431)Online publication date: 6-Oct-2014
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media