research-article

Public Access

FCatch: Automatically Detecting Time-of-fault Bugs in Cloud Systems

Authors:

Chen TianAuthors Info & Claims

ACM SIGPLAN Notices, Volume 53, Issue 2

Pages 419 - 431

https://doi.org/10.1145/3296957.3177161

Published: 19 March 2018 Publication History

Abstract

It is crucial for distributed systems to achieve high availability. Unfortunately, this is challenging given the common component failures (i.e., faults). Developers often cannot anticipate all the timing conditions and system states under which a fault might occur, and introduce time-of-fault (TOF) bugs that only manifest when a node crashes or a message drops at a special moment. Although challenging, detecting TOF bugs is fundamental to developing highly available distributed systems. Unlike previous work that relies on fault injection to expose TOF bugs, this paper carefully models TOF bugs as a new type of concurrency bugs, and develops FCatch to automatically predict TOF bugs by observing correct execution. Evaluation on representative cloud systems shows that FCatch is effective, accurately finding severe TOF bugs.

References

[1]

Hbase-3596. https://issues.apache.org/jira/browse/HBASE-3596, 2011.

[2]

Mapreduce-3858. https://issues.apache.org/jira/browse/MAPREDUCE-3858, 2012.

[3]

Cassandra-5393. https://issues.apache.org/jira/browse/CASSANDRA-5393, 2013.

[4]

Cassandra-6415. https://issues.apache.org/jira/browse/CASSANDRA-6415, 2013.

[5]

Hbase-10090. https://issues.apache.org/jira/browse/HBASE-10090, 2013.

[6]

Mapreduce-5476. https://issues.apache.org/jira/browse/MAPREDUCE-5476, 2013.

[7]

Zookeeper-1653. https://issues.apache.org/jira/browse/ZOOKEEPER-1653, 2013.

[8]

Java platform standard edition 7 documentation. https://docs.oracle.com/javase/7/docs/api/java/lang/Object.html#hashCode(), 2017.

[9]

Ramnatthan Alagappan, Aishwarya Ganesan, Yuvraj Patel, Thanumalayan Sankaranarayana Pillai, Andrea C Arpaci-Dusseau, and Remzi H Arpaci-Dusseau. Correlated crash vulnerabilities. In OSDI, 2016.

Digital Library

[10]

Peter Alvaro, Kolton Andrus, Chris Sanden, Casey Rosenthal, Ali Basiri, and Lorin Hochstein. Automating failure testing research at internet scale. In SoCC, 2016.

Digital Library

[11]

Peter Alvaro, Joshua Rosen, and Joseph M. Hellerstein. Lineage-driven Fault Injection. In SIGMOD, 2015.

Digital Library

[12]

Kumud Bhandari, Dhruva R Chakrabarti, and Hans-J Boehm. Makalu: Fast recoverable allocation of non-volatile memory. In OOPSLA, 2016.

Digital Library

[13]

Lucas Brutschy, Dimitar Dimitrov, Peter Müller, and Martin T. Vechev. Serializability for eventual consistency: criterion, analysis, and applications. In POPL, 2017.

Digital Library

[14]

Pierre Castéran and Yves Bertot. Interactive theorem proving and program development. coq'art: The calculus of inductive constructions., 2004.

Digital Library

[15]

Feng Chen, Traian Florin Serbanuta, and Grigore Rosu. jPredictor: a predictive runtime analysis tool for java. In ICSE, 2008.

Digital Library

[16]

Datapath.io. Recent aws outage and how you could have avoided downtime. https://medium.com/@datapath_io/recent-aws-outage-and-how-you-could-have-avoided-downtime-7d9d9443d776, 2017.

[17]

Jeff Dean. Designs, lessons and advice from building large distributed systems. Keynote from LADIS, 2009.

[18]

Pantazis Deligiannis, Alastair F Donaldson, Jeroen Ketema, Akash Lal, and Paul Thomson. Asynchronous programming, analysis and testing with state machines. In PLDI, 2015.

Digital Library

[19]

Aishwarya Ganesan, Ramnatthan Alagappan, Andrea C Arpaci-Dusseau, and Remzi H Arpaci-Dusseau. Redundancy does not imply fault tolerance: Analysis of distributed storage reactions to single errors and corruptions. In FAST, 2017.

Digital Library

[20]

Haryadi S. Gunawi, Thanh Do, Pallavi Joshi, Peter Alvaro, Joseph M. Hellerstein, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Koushik Sen, and Dhruba Borthakur. textscFate and textscDestini: A Framework for Cloud Recovery Testing. In NSDI, 2011.

Digital Library

[21]

Haryadi S. Gunawi, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake, Thanh Do, Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius Martin, and Anang D. Satria. What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems. In SoCC, 2014.

Digital Library

[22]

Haryadi S. Gunawi, Cindy Rubio-González, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, and Ben Liblit. EIO: error handling is occasionally correct. In FAST, 2008.

Digital Library

[23]

Huayang Guo, Ming Wu, Lidong Zhou, Gang Hu, Junfeng Yang, and Lintao Zhang. Practical Software Model Checking via Dynamic Interface Reduction. In SOSP, 2011.

Digital Library

[24]

Zhenyu Guo, Sean McDirmid, Mao Yang, Li Zhuang, Pu Zhang, Yingwei Luo, Tom Bergan, Peter Bodik, Madan Musuvathi, Zheng Zhang, and Lidong Zhou. Failure recovery: When the cure is worse than the disease. In HotOS, 2013.

Digital Library

[25]

Chris Hawblitzel, Jon Howell, Manos Kapritsos, Jacob R Lorch, Bryan Parno, Michael L Roberts, Srinath Setty, and Brian Zill. IronFleet: proving practical distributed systems correct. In SOSP, 2015.

Digital Library

[26]

Chun-Hung Hsiao, Jie Yu, Satish Narayanasamy, Ziyun Kong, Cristiano L Pereira, Gilles A Pokam, Peter M Chen, and Jason Flinn. Race detection for event-driven mobile applications. In PLDI, 2014.

Digital Library

[27]

IBM. Main page - walawiki. http://wala.sourceforge.net/wiki/index.php/Main_Page.

[28]

jboss javassist. Javassist. http://jboss-javassist.github.io/javassist/.

[29]

Pallavi Joshi, Malay Ganai, Gogul Balakrishnan, Aarti Gupta, and Nadia Papakonstantinou. Setsud=o: perturbation-based testing framework for scalable distributed systems. In Proceedings of the First ACM SIGOPS Conference on Timely Results in Operating Systems, 2013.

Digital Library

[30]

Charles Killian, James Anderson, Ranjit Jhala, and Amin Vahdat. Life, Death, and the Critical Transition: Finding Liveness Bugs in Systems Code. In NSDI, 2007.

Digital Library

[31]

Aasheesh Kolli, Steven Pelley, Ali Saidi, Peter M Chen, and Thomas F Wenisch. High-performance transactions for persistent memories. In ASPLOS, 2016.

Digital Library

[32]

Eric Koskinen and Junfeng Yang. Reducing crash recoverability to reachability. In POPL, 2016.

Digital Library

[33]

Leslie Lamport. Time, clocks, and the ordering of events in a distributed system. Commun. ACM, 21(7):558--565, July 1978.

Digital Library

[34]

Leslie Lamport. Specifying systems: the TLA+ language and tools for hardware and software engineers. Addison-Wesley Longman Publishing Co., Inc., 2002.

Digital Library

[35]

Philip Lantz, Dulloor Subramanya Rao, Sanjay Kumar, Rajesh Sankaran, and Jeff Jackson. Yat: A validation framework for persistent memory software. In ATC, 2014.

Digital Library

[36]

Tanakorn Leesatapornwongsa, Mingzhe Hao, Pallavi Joshi, Jeffrey F. Lukman, and Haryadi S. Gunawi. SAMC: Semantic-Aware Model Checking for Fast Discovery of Deep Bugs in Cloud Systems. In OSDI, 2014.

Digital Library

[37]

Tanakorn Leesatapornwongsa, Jeffrey F Lukman, Shan Lu, and Haryadi S Gunawi. TaxDC: A taxonomy of non-deterministic concurrency bugs in datacenter distributed systems. In ASPLOS, 2016.

Digital Library

[38]

Kaituo Li, Pallavi Joshi, Aarti Gupta, and Malay K Ganai. Reprolite: A lightweight tool to quickly reproduce hard system bugs. In SoCC, 2014.

Digital Library

[39]

Haopeng Liu, Guangpu Li, Jeffrey F Lukman, Jiaxin Li, Shan Lu, Haryadi S Gunawi, and Chen Tian. DCatch: Automatically detecting distributed concurrency bugs in cloud systems. In ASPLOS, 2017.

Digital Library

[40]

Jonathan Mace, Ryan Roelke, and Rodrigo Fonseca. Pivot Tracing: Dynamic Causal Monitoring for Distributed Systems. In SOSP, 2015.

Digital Library

[41]

Pallavi Maiya, Aditya Kanade, and Rupak Majumdar. Race detection for android applications. In PLDI, 2014.

Digital Library

[42]

IHS Markit. Businesses losing $700 billion a year to it downtime, says ihs. http://news.ihsmarkit.com/press-release/technology/businesses-losing-700-billion-year-it-downtime-says-ihs, 2016.

[43]

Ali José Mashtizadeh, Tal Garfinkel, David Terei, David Mazieres, and Mendel Rosenblum. Towards practical default-on multi-core record/replay. In ASPLOS, 2017.

Digital Library

[44]

Robert H. B. Netzer and Barton P. Miller. Improving The Accuracy of Data Race Detection. In PPoPP, 1991.

Digital Library

[45]

Oracle. Virtualbox -- oracle vm virtualbox. https://www.virtualbox.org/wiki/VirtualBox.

[46]

Steven Pelley, Peter M Chen, and Thomas F Wenisch. Memory persistency. In ISCA, 2014.

Digital Library

[47]

Boris Petrov, Martin Vechev, Manu Sridharan, and Julian Dolby. Race detection for web applications. In PLDI, 2012.

Digital Library

[48]

Cindy Rubio-González, Haryadi S. Gunawi, Ben Liblit, Remzi H. Arpaci-Dusseau, and Andrea C. Arpaci-Dusseau. Error propagation analysis for file systems. In PLDI, 2009.

Digital Library

[49]

Suman Saha, Jean-Pierre Lozi, Gaël Thomas, Julia L. Lawall, and Gilles Muller. Hector: Detecting resource-release omission faults in error-handling code for systems software. In DSN, 2013.

Digital Library

[50]

Stefan Savage, Michael Burrows, Greg Nelson, Patrick Sobalvarro, and Thomas Anderson. Eraser: A Dynamic Data Race Detector for Multithreaded Programs. ACM TOCS, 1997.

Digital Library

[51]

Jiri Simsa, Randy Bryant, and Garth Gibson. dBug: Systematic Evaluation of Distributed Systems. In SSV, 2010.

Digital Library

[52]

Chen Tian, Vijay Nagarajan, Rajiv Gupta, and Sriraman Tallam. Dynamic Recognition of Synchronization Operations for Improved Data Race Detection. In ISSTA, 2008.

Digital Library

[53]

Haris Volos, Andres Jaan Tack, and Michael M Swift. Mnemosyne: Lightweight persistent memory. In ASPLOS, 2011.

Digital Library

[54]

James R Wilcox, Doug Woos, Pavel Panchekha, Zachary Tatlock, Xi Wang, Michael D Ernst, and Thomas Anderson. Verdi: a framework for implementing and formally verifying distributed systems. In PLDI, 2015.

Digital Library

[55]

Weiwei Xiong, Soyeon Park, Jiaqi Zhang, Yuanyuan Zhou, and Zhiqiang Ma. Ad Hoc Synchronization Considered Harmful. In OSDI, 2010.

Digital Library

[56]

Junfeng Yang, Tisheng Chen, Ming Wu, Zhilei Xu, Xuezheng Liu, Haoxiang Lin, Mao Yang, Fan Long, Lintao Zhang, and Lidong Zhou. MODIST: Transparent Model Checking of Unmodified Distributed Systems. In NSDI, 2009.

Digital Library

[57]

Junfeng Yang, Can Sar, and Dawson Engler. Explode: a lightweight, general system for finding serious storage system errors. In OSDI, 2006.

Digital Library

[58]

Junfeng Yang, Paul Twohey, Dawson Engler, and Madanlal Musuvathi. Using model checking to find serious file system errors. In OSDI, 2004.

Digital Library

[59]

Ding Yuan, Yu Luo, Xin Zhuang, Guilherme Renna Rodrigues, Xu Zhao, Yongle Zhang, Pranay Jain, and Michael Stumm. Simple testing can prevent most critical failures: An analysis of production failures in distributed data-intensive systems. In OSDI, 2014.

Digital Library

[60]

Mai Zheng, Joseph Tucek, Dachuan Huang, Feng Qin, Mark Lillibridge, Elizabeth S Yang, Bill W Zhao, and Shashank Singh. Torturing databases for fun and profit. In OSDI, 2014.

Digital Library

Cited By

Xu QGao YWei JChristakis MPradel M(2024)An Empirical Study on Kubernetes Operator BugsProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3680396(1746-1758)Online publication date: 11-Sep-2024
https://dl.acm.org/doi/10.1145/3650212.3680396
Barletta MCinque MDi Martino CKalbarczyk ZIyer R(2024)Mutiny! How Does Kubernetes Fail, and What Can We Do About It?2024 54th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)10.1109/DSN58291.2024.00016(1-14)Online publication date: 24-Jun-2024
https://doi.org/10.1109/DSN58291.2024.00016
Lu JLi HLiu CLi LCheng KYin HStavrou ACremers CShi E(2022)Detecting Missing-Permission-Check Vulnerabilities in Distributed Cloud SystemsProceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security10.1145/3548606.3560589(2145-2158)Online publication date: 7-Nov-2022
https://dl.acm.org/doi/10.1145/3548606.3560589
Show More Cited By

Index Terms

FCatch: Automatically Detecting Time-of-fault Bugs in Cloud Systems
1. Software and its engineering
  1. Software creation and management
    1. Software verification and validation
      1. Software defect analysis
        Software testing and debugging
  2. Software organization and properties
    1. Extra-functional properties
      1. Software reliability
    2. Software system structures
      1. Distributed systems organizing principles
        Cloud computing

Recommendations

CrashTuner: detecting crash-recovery bugs in cloud systems via meta-info analysis
SOSP '19: Proceedings of the 27th ACM Symposium on Operating Systems Principles

Crash-recovery bugs (bugs in crash-recovery-related mechanisms) are among the most severe bugs in cloud systems and can easily cause system failures. It is notoriously difficult to detect crash-recovery bugs since these bugs can only be exposed when ...
DCatch: Automatically Detecting Distributed Concurrency Bugs in Cloud Systems
Asplos'17

In big data and cloud computing era, reliability of distributed systems is extremely important. Unfortunately, distributed concurrency bugs, referred to as DCbugs, widely exist. They hide in the large state space of distributed cloud systems and ...
FCatch: Automatically Detecting Time-of-fault Bugs in Cloud Systems
ASPLOS '18: Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems

It is crucial for distributed systems to achieve high availability. Unfortunately, this is challenging given the common component failures (i.e., faults). Developers often cannot anticipate all the timing conditions and system states under which a fault ...

Comments

Information & Contributors

Information

Published In

cover image ACM SIGPLAN Notices

ACM SIGPLAN Notices Volume 53, Issue 2

ASPLOS '18

February 2018

809 pages

ISSN:0362-1340

EISSN:1558-1160

DOI:10.1145/3296957

Editor:
Matthew Fluet
Rodchester Institude of Technology

Issue’s Table of Contents

ASPLOS '18: Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems
March 2018
827 pages
ISBN:9781450349116
DOI:10.1145/3173162
General Chairs:
Xipeng Shen
North Carolina State University, USA
,
James Tuck
North Carolina State University, USA
,
Program Chairs:
Ricardo Bianchini
Microsoft Research, USA
,
Vivek Sarkar
Georgia Institute of Technology, USA

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 March 2018

Published in SIGPLAN Volume 53, Issue 2

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Huawei
Google Faculty Research Award
CERES Center for Unstoppable Computing
CCF
CNS
IIS

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

24
Total Citations
View Citations
946
Total Downloads

Downloads (Last 12 months)153
Downloads (Last 6 weeks)38

Reflects downloads up to 13 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Xu QGao YWei JChristakis MPradel M(2024)An Empirical Study on Kubernetes Operator BugsProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3680396(1746-1758)Online publication date: 11-Sep-2024
https://dl.acm.org/doi/10.1145/3650212.3680396
Barletta MCinque MDi Martino CKalbarczyk ZIyer R(2024)Mutiny! How Does Kubernetes Fail, and What Can We Do About It?2024 54th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)10.1109/DSN58291.2024.00016(1-14)Online publication date: 24-Jun-2024
https://doi.org/10.1109/DSN58291.2024.00016
Lu JLi HLiu CLi LCheng KYin HStavrou ACremers CShi E(2022)Detecting Missing-Permission-Check Vulnerabilities in Distributed Cloud SystemsProceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security10.1145/3548606.3560589(2145-2158)Online publication date: 7-Nov-2022
https://dl.acm.org/doi/10.1145/3548606.3560589
Zampetti FKapur RDi Penta MPanichella S(2022)An empirical characterization of software bugs in open-source Cyber–Physical SystemsJournal of Systems and Software10.1016/j.jss.2022.111425192(111425)Online publication date: Oct-2022
https://doi.org/10.1016/j.jss.2022.111425
Sun XSuresh LGanesan AAlagappan RGasch MTang LXu TAngel SKasikci BKohler E(2021)Reasoning about modern datacenter infrastructures using partial historiesProceedings of the Workshop on Hot Topics in Operating Systems10.1145/3458336.3465276(213-220)Online publication date: 1-Jun-2021
https://dl.acm.org/doi/10.1145/3458336.3465276
Wu HPan JHuang PVanbever LZhang I(2024)Efficient exposure of partial failure bugs in distributed systems with inferred abstract statesProceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation10.5555/3691825.3691895(1267-1283)Online publication date: 16-Apr-2024
https://dl.acm.org/doi/10.5555/3691825.3691895
Pan JWu HLeesatapornwongsa TNath SHuang PWitchel EArpaci-Dusseau ARossbach CKeeton K(2024)Efficient Reproduction of Fault-Induced Failures in Distributed Systems with Feedback-Driven Fault InjectionProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles10.1145/3694715.3695979(46-62)Online publication date: 4-Nov-2024
https://dl.acm.org/doi/10.1145/3694715.3695979
Feng WPei QGao YWang DDou WWei JLiang ZLong ZRoychoudhury APaiva AAbreu RStorey M(2024)FaultFuzz: A Coverage Guided Fault Injection Tool for Distributed SystemsProceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings10.1145/3639478.3640036(129-133)Online publication date: 14-Apr-2024
https://dl.acm.org/doi/10.1145/3639478.3640036
Winter LBuse Fde Graaf Dvon Gleissenthall KKulahcioglu Ozkan B(2023)Randomized Testing of Byzantine Fault Tolerant AlgorithmsProceedings of the ACM on Programming Languages10.1145/35860537:OOPSLA1(757-788)Online publication date: 6-Apr-2023
https://dl.acm.org/doi/10.1145/3586053
Wang DDou WGao YWu CWei JHuang TFedorova ANarayanan DDi Luna GQuerzoni L(2023)Model Checking Guided Testing for Distributed SystemsProceedings of the Eighteenth European Conference on Computer Systems10.1145/3552326.3587442(127-143)Online publication date: 8-May-2023
https://dl.acm.org/doi/10.1145/3552326.3587442
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents