Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Public Access

TaxDC: A Taxonomy of Non-Deterministic Concurrency Bugs in Datacenter Distributed Systems

Published: 25 March 2016 Publication History

Abstract

We present TaxDC, the largest and most comprehensive taxonomy of non-deterministic concurrency bugs in distributed systems. We study 104 distributed concurrency (DC) bugs from four widely-deployed cloud-scale datacenter distributed systems, Cassandra, Hadoop MapReduce, HBase and ZooKeeper. We study DC-bug characteristics along several axes of analysis such as the triggering timing condition and input preconditions, error and failure symptoms, and fix strategies, collectively stored as 2,083 classification labels in TaxDC database. We discuss how our study can open up many new research directions in combating DC bugs.

References

[1]
http://www.freepastry.org/.
[2]
http://ucare.cs.uchicago.edu/projects/cbs/.
[3]
Apache Cassandra. http://cassandra.apache.org.
[4]
Apache Hadoop. http://hadoop.apache.org.
[5]
Apache Hadoop NextGen MapReduce (YARN).hrefhttps://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.htmlhttps://hadoop.apache.org/docs/current/hrefhttps://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.htmlhadoop-yarn/hadoop-yarn-site/YARN.html.
[6]
Apache HBase. http://hbase.apache.org.
[7]
Apache ZooKeeper. http://zookeeper.apache.org.
[8]
Michael D. Bond, Katherine E. Coons, and Kathryn S. McKinley.textscPacer: Proportional Detection of Data Races. In PLDI, 2010.
[9]
Sebastian Burckhardt, Pravesh Kothari, Madanlal Musuvathi, and Santosh Nagarakatte. A Randomized Scheduler with Probabilistic Guarantees of Finding Bugs. In ASPLOS, 2010.
[10]
Mike Burrows. The Chubby lock service for loosely-coupled distributed systems. In OSDI '06.
[11]
Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Michael Burrows, Tushar Chandra, Andrew Fikes, and Robert Gruber. Bigtable: A Distributed Storage System for Structured Data. In OSDI '06.
[12]
Andy Chou, Junfeng Yang, Benjamin Chelf, Seth Hallem, and Dawson Engler. An Empirical Study of Operating System Errors. In SOSP '01.
[13]
Michael Chow, David Meisner, Jason Flinn, Daniel Peek, and Thomas F. Wenisch. The Mystery Machine: End-to-end performance analysis of large-scale Internet services. In OSDI '14.
[14]
Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In OSDI '04.
[15]
Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels. Dynamo: Amazon's Highly Available Key-value Store. In SOSP '07.
[16]
Ankush Desai, Vivek Gupta, Ethan Jackson, Shaz Qadeer, Sriram Rajamani, and Damien Zufferey. P: Safe Asynchronous Event-Driven Programming. In PLDI '13.
[17]
Thanh Do, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake, and Haryadi S. Gunawi. Limplock: Understanding the Impact of Limpware on Scale-Out Cloud Systems. In SoCC '13.
[18]
Dawson Engler, David Yu Chen, Seth Hallem, Andy Chou, and Benjamin Chelf. Bugs as Deviant Behavior: A General Approach to Inferring Errors in Systems Code. In SOSP, 2001.
[19]
Michael D. Ernst, Adam Czeisler, William G. Griswold, and David Notkin. Quickly Detecting Relevant Program Invariants. In ICSE, 2000.
[20]
Cormac Flanagan and Stephen N. Freund. FastTrack: Efficient and Precise Dynamic Race Detection. In PLDI, 2009.
[21]
Pedro Fonseca, Cheng Li, Vishal Singhal, and Rodrigo Rodrigues. A Study of the Internal and External Effects of Concurrency Bugs. In DSN, 2010.
[22]
Dennis Geels, Gautam Altekar, Petros Maniatis, Timothy Roscoe, and Ion Stoica. Friday: Global Comprehension for Distributed Replay. In NSDI '07.
[23]
Dennis Geels, Gautam Altekar, Scott Shenker, and Ion Stoica. Replay Debugging for Distributed Applications. In USENIX ATC '06.
[24]
Haryadi S. Gunawi, Thanh Do, Pallavi Joshi, Peter Alvaro, Joseph M. Hellerstein, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Koushik Sen, and Dhruba Borthakur. Fate and Destini: A Framework for Cloud Recovery Testing. In NSDI '11.
[25]
Haryadi S. Gunawi, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake, Thanh Do, Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius Martin, and Anang D. Satria. What Bugs Live in the Cloud? A Study of 3000
[26]
Issues in Cloud Systems. In SoCC '14.
[27]
Huayang Guo, Ming Wu, Lidong Zhou, Gang Hu, Junfeng Yang, and Lintao Zhang. Practical Software Model Checking via Dynamic Interface Reduction. In SOSP '11.
[28]
Mingzhe Hao, Gokul Soundararajan, Deepak Kenchammana-Hosekote, Andrew A. Chien, and Haryadi S. Gunawi. The Tail at Store: A Revelation from Millions of Hours of Disk and SSD Deployments. In FAST '16.
[29]
Chris Hawblitzel, Jon Howell, Manos Kapritsos, Jacob R. Lorch, Bryan Parno, Michael L. Roberts, Srinath Setty, and Brian Zill. IronFleet: Proving Practical Distributed Systems Correct. In SOSP '15.
[30]
Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy Katz, Scott Shenker, and Ion Stoica. Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center. In NSDI '11.
[31]
Chun-Hung Hsiao, Cristiano L. Pereira, Jie Yu, Gilles A. Pokam, Satish Narayanasamy, Peter M. Chen, Ziyun Kong, and Jason Flinn. Race Detection for Event-Driven Mobile Applications. In PLDI, 2014.
[32]
Nicholas Hunt, Tom Bergan, Luis Ceze, and Steven D. Gribble. DDOS: Taming Nondeterminism in Distributed Systems. In ASPLOS '13.
[33]
Nicholas Jalbert, Cristiano Pereira, Gilles Pokam, and Koushik Sen. RADBench: A Concurrency Bug Benchmark Suite. In HotPar, 2011.
[34]
Guoliang Jin, Aditya Thakur, Ben Liblit, and Shan Lu. Instrumentation and Sampling Strategies for Cooperative Concurrency Bug Isolation. In OOPSLA, 2010.
[35]
Guoliang Jin, Wei Zhang, Dongdong Deng, Ben Liblit, and Shan Lu. Automated Concurrency-Bug Fixing. In OSDI, 2012.
[36]
Pallavi Joshi, Haryadi S. Gunawi, and Koushik Sen.textscPreFail: A Programmable Tool for Multiple-Failure Injection. In OOPSLA '11.
[37]
Horatiu Jula, Daniel Tralamazza, Cristian Zamfir, and George Candea. Deadlock Immunity: Enabling Systems To Defend Against Deadlocks. In OSDI, 2008.
[38]
Charles Killian, James W. Anderson, Ranjit Jhala, and Amin Vahdat. Life, Death, and the Critical Transition: Finding Liveness Bugs in Systems Code. In NSDI '07.
[39]
Tanakorn Leesatapornwongsa, Mingzhe Hao, Pallavi Joshi, Jeffrey F. Lukman, and Haryadi S. Gunawi. SAMC: Semantic-Aware Model Checking for Fast Discovery of Deep Bugs in Cloud Systems. In OSDI '14.
[40]
Sihan Li, Hucheng Zhou, Haoxiang Lin, Tian Xiao, Haibo Lin, Wei Lin, and Tao Xie. A Characteristic Study on Failures of Production Distributed Data-Parallel Programs. In ICSE, 2013.
[41]
Ben Liblit, Alex Aiken, Alice X. Zheng, and Michael I. Jordan. Bug Isolation via Remote Program Sampling. In PLDI, 2003.
[42]
Peng Liu, Omer Tripp, and Charles Zhang. Grail: Context-Aware Fixing of Concurrency Bugs. In FSE, 2014.
[43]
Xuezheng Liu, Zhenyu Guo, Xi Wang, Feibo Chen, Xiaochen Lian, Jian Tang, Ming Wu, M. Frans Kaashoek, and Zheng Zhang. D3S: Debugging Deployed Distributed Systems. In NSDI '08.
[44]
Xuezheng Liu, Wei Lin, Aimin Pan, and Zheng Zhang. WiDS Checker: Combating Bugs in Distributed Systems. In NSDI '07.
[45]
Lanyue Lu, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, and Shan Lu. A Study of Linux File System Evolution. In FAST '13.
[46]
Shan Lu, Soyeon Park, Eunsoo Seo, and Yuanyuan Zhou. Learning from Mistakes -- A Comprehensive Study on Real World Concurrency Bug Characteristics. In ASPLOS '08.
[47]
Shan Lu, Joseph Tucek, Feng Qin, and Yuanyuan Zhou. AVIO: Detecting Atomicity Violations via Access Interleaving Invariants. In ASPLOS, 2006.
[48]
Brandon Lucia and Luis Ceze. Cooperative Empirical Failure Avoidance for Multithreaded Programs. In ASPLOS, 2013.
[49]
Madanlal Musuvathi and Shaz Qadeer. Iterative Context Bounding for Systematic Testing of Multithreaded Programs. In PLDI, 2007.
[50]
Diego Ongaro and John Ousterhout. In Search of an Understandable Consensus Algorithm. In USENIX ATC '14.
[51]
Nicolas Palix, Gael Thomas, Suman Saha, Christophe Calvès, Julia Lawall, and Gilles Muller. Faults in Linux: Ten Years Later. In ASPLOS, 2011.
[52]
Soyeon Park, Shan Lu, and Yuanyuan Zhou. CTrigger: Exposing Atomicity Violation Bugs from Their Finding Places. In ASPLOS, 2009.
[53]
Gilles Pokam, Klaus Danne, Cristiano Pereira, Rolf Kassa, Tim Kranich, Shiliang Hu, Justin Gottschlich, Nima Honarmand, Nathan Dautenhahn, Samuel T. King, and Josep Torrellas. QuickRec: Prototyping an Intel Architecture Extension for Record and Replay of Multithreaded Programs. In ISCA, 2013.
[54]
Shanxiang Qi, Abdullah A. Muzahid, Wonsun Ahn, and Josep Torrellas. Dynamically Detecting and Tolerating IF-Condition Data Races. In HPCA, 2014.
[55]
Veselin Raychev, Martin T. Vechev, and Manu Sridharan. Effective race detection for event-driven programs. In OOPSLA, 2013.
[56]
Patrick Reynolds, Charles Killian, Janet L. Wiener, Jeffrey C. Mogul, Mehul A. Shah, and Amin Vahdat. Pip: Detecting the Unexpected in Distributed Systems. In NSDI '06.
[57]
Stefan Savage, Michael Burrows, Greg Nelson, Patrick Sobalvarro, and Thomas Anderson. Eraser: A Dynamic Data Race Detector for Multithreaded Programs. ACM TOCS, 1997.
[58]
Koushik Sen. Race Directed Random Testing of Concurrent Programs. In PLDI, 2008.
[59]
Koushik Sen and Gul Agha. Automated Systematic Testing of Open Distributed Programs. In FSE, 2006.
[60]
Jiri Simsa, Randy Bryant, and Garth Gibson. dBug: Systematic Evaluation of Distributed Systems. In SSV '10.
[61]
Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, and Hari Balakrishnan. Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications. In SIGCOMM '01.
[62]
Kaushik Veeraraghavan, Dongyoon Lee, Benjamin Wester, Jessica Ouyang, Peter M. Chen, Jason Flinn, and Satish Narayanasamy. DoublePlay: Parallelizing Sequential Logging and Replay. In ASPLOS, 2011.
[63]
Yin Wang, Terence Kelly, Manjunath Kudlur, Stephane Lafortune, and Scott Mahlke. Gadara: Dynamic Deadlock Avoidance for Multithreaded Programs. In OSDI, 2008.
[64]
James R. Wilcox, Doug Woos, Pavel Panchekha, Zachary Tatlock, Xi Wang, Michael D. Ernst, and Thomas Anderson. Verdi: A Framework for Implementing and Formally Verifying Distributed System. In PLDI, 2015.
[65]
Tian Xiao, Jiaxing Zhang, Hucheng Zhou, Zhenyu Guo, Sean McDirmid, Wei Lin, Wenguang Chen, and Lidong Zhou. Nondeterminism in MapReduce Considered Harmful? An Empirical Study on Non-commutative Aggregators in MapReduce Programs. In ICSE '14.
[66]
Junfeng Yang, Tisheng Chen, Ming Wu, Zhilei Xu, Xuezheng Liu, Haoxiang Lin, Mao Yang, Fan Long, Lintao Zhang, and Lidong Zhou. MODIST: Transparent Model Checking of Unmodified Distributed Systems. In NSDI '09.
[67]
Jie Yu. A collection of concurrency bugs.textcolorMaroonhttps://github.com/jieyu/concurrency-bugs.
[68]
Ding Yuan, Yu Luo, Xin Zhuang, Guilherme Renna Rodrigues, Xu Zhao, Yongle Zhang, Pranay U. Jain, and Michael Stumm. Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems. In OSDI '14.
[69]
Wei Zhang, Marc de Kruijf, Ang Li, Shan Lu, and Karthikeyan Sankaralingam. ConAir: Featherweight concurrency bug recovery via single-threaded idempotent execution. In ASPLOS, 2013.

Cited By

View all
  • (2021)Well-typed programs can go wrong: a study of typing-related bugs in JVM compilersProceedings of the ACM on Programming Languages10.1145/34855005:OOPSLA(1-30)Online publication date: 15-Oct-2021
  • (2021)GoBenchProceedings of the 2021 IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO51591.2021.9370317(187-199)Online publication date: 27-Feb-2021
  • (2020)Hunting Garbage Collection Related Concurrency Bugs through Critical Condition RestorationProceedings of the 2020 ACM Workshop on Forming an Ecosystem Around Software Transformation10.1145/3411502.3418426(17-22)Online publication date: 13-Nov-2020
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM SIGARCH Computer Architecture News
ACM SIGARCH Computer Architecture News  Volume 44, Issue 2
ASPLOS'16
May 2016
774 pages
ISSN:0163-5964
DOI:10.1145/2980024
Issue’s Table of Contents
  • cover image ACM Conferences
    ASPLOS '16: Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems
    March 2016
    824 pages
    ISBN:9781450340915
    DOI:10.1145/2872362
    • General Chair:
    • Tom Conte,
    • Program Chair:
    • Yuanyuan Zhou
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 March 2016
Published in SIGARCH Volume 44, Issue 2

Check for updates

Author Tags

  1. concurrency bugs
  2. distributed systems
  3. software testing

Qualifiers

  • Research-article

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)438
  • Downloads (Last 6 weeks)72
Reflects downloads up to 30 Aug 2024

Other Metrics

Citations

Cited By

View all
  • (2021)Well-typed programs can go wrong: a study of typing-related bugs in JVM compilersProceedings of the ACM on Programming Languages10.1145/34855005:OOPSLA(1-30)Online publication date: 15-Oct-2021
  • (2021)GoBenchProceedings of the 2021 IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO51591.2021.9370317(187-199)Online publication date: 27-Feb-2021
  • (2020)Hunting Garbage Collection Related Concurrency Bugs through Critical Condition RestorationProceedings of the 2020 ACM Workshop on Forming an Ecosystem Around Software Transformation10.1145/3411502.3418426(17-22)Online publication date: 13-Nov-2020
  • (2019)Towards understanding bugs in an open source cloud management stack: An empirical study of OpenStack software bugsJournal of Systems and Software10.1016/j.jss.2019.02.025151(210-223)Online publication date: May-2019
  • (2018)Troubleshooting transiently-recurring problems in production systems with blame-proportional loggingProceedings of the 2018 USENIX Conference on Usenix Annual Technical Conference10.5555/3277355.3277387(321-334)Online publication date: 11-Jul-2018
  • (2024)Blame-Correct Support for Receiver Properties in Recursively-Structured Actor ContractsProceedings of the ACM on Programming Languages10.1145/36746438:ICFP(515-543)Online publication date: 15-Aug-2024
  • (2024)Choral: Object-oriented Choreographic ProgrammingACM Transactions on Programming Languages and Systems10.1145/363239846:1(1-59)Online publication date: 16-Jan-2024
  • (2024)Understanding and Detecting Real-World Safety Issues in RustIEEE Transactions on Software Engineering10.1109/TSE.2024.338039350:6(1306-1324)Online publication date: Jun-2024
  • (2024)What's Wrong With Low-Code Development Platforms? An Empirical Study of Low-Code Development Platform BugsIEEE Transactions on Reliability10.1109/TR.2023.329500973:1(695-709)Online publication date: Mar-2024
  • (2023)A Systematic Mapping of the Proposition of Benchmarks in the Software Testing and Debugging DomainSoftware10.3390/software20400212:4(447-475)Online publication date: 12-Oct-2023
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media