Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3302424.3303986acmconferencesArticle/Chapter ViewAbstractPublication PageseurosysConference Proceedingsconference-collections
research-article
Public Access

FlyMC: Highly Scalable Testing of Complex Interleavings in Distributed Systems

Published: 25 March 2019 Publication History

Abstract

We present a fast and scalable testing approach for datacenter/cloud systems such as Cassandra, Hadoop, Spark, and ZooKeeper. The uniqueness of our approach is in its ability to overcome the path/state-space explosion problem in testing workloads with complex interleavings of messages and faults. We introduce three powerful algorithms: state symmetry, event independence, and parallel flips, which collectively makes our approach on average 16x (up to 78x) faster than other state-of-the-art solutions. We have integrated our techniques with 8 popular datacenter systems, successfully reproduced 12 old bugs, and found 10 new bugs --- all were done without random walks or manual checkpoints.

References

[1]
Apache Hadoop. http://hadoop.apache.org.
[2]
BUG: CASSANDRA-5925: Race condition in update lightweight transaction. https://issues.apache.org/jira/browse/CASSANDRA-5925.
[3]
BUG: CASSANDRA-6013: CAS may return false but still commit the insert. https://issues.apache.org/jira/browse/CASSANDRA-6013,.
[4]
BUG: CASSANDRA-6023: CAS should distinguish promised and accepted ballots. https://issues.apache.org/jira/browse/CASSANDRA-6023.
[5]
BUG: ETHEREUM-15138: eth/downloader: track peer drops and deassign state sync tasks. https://github.com/ethereum/go-ethereum/issues/15138.
[6]
BUG: HBASE-4397: -ROOT-, .META. tables stay offline for too long in recovery phase after all RSs are shutdown at the same time. https://issues.apache.org/jira/browse/HBASE-4397.
[7]
BUG: LOGCABIN-174: resiliency in InstallSnapshot. https://github.com/logcabin/logcabin/issues/174.
[8]
BUG: MAPREDUCE-5505: Clients should be notified job finished only after job successfully unregistered. https://issues.apache.org/jira/browse/MAPREDUCE-5505.
[9]
BUG: SPARK-15262: race condition in killing an executor and reregistering an executor. https://issues.apache.org/jira/browse/SPARK-15262.
[10]
BUG: SPARK-19623: DAGScheduler should avoid sending conflicting task set. https://issues.apache.org/jira/browse/SPARK-19263.
[11]
BUG: ZOOKEEPER-1419: Leader election never settles for a 5-node cluster. https://issues.apache.org/jira/browse/ZOOKEEPER-1419.
[12]
BUG: ZOOKEEPER-1492: leader cannot switch to LOOKING state when lost the majority. https://issues.apache.org/jira/browse/ZOOKEEPER-1492.
[13]
BUG: ZOOKEEPER-335: zookeeper servers should commit the new leader txn to their logs. https://issues.apache.org/jira/browse/ZOOKEEPER-335.
[14]
BUG: ZOOKEEPER-790: Last processed zxid set prematurely while establishing leadership. https://issues.apache.org/jira/browse/ZOOKEEPER-790.
[15]
Chameleon. https://www.chameleoncloud.org.
[16]
Chameleon Haswell Website. https://bit.ly/2KrnE4L.
[17]
Eclipse Abstract Syntaxt Tree (AST). http://www.eclipse.org/articles/article.php?file=Article-JavaCodeManipulation_AST/index.html.
[18]
Emulab d430 Website. https://wiki.emulab.net/wiki/d430.
[19]
Ethereum. https://www.ethereum.org.
[20]
FlyMC Open-Sourced Code. http://ucare.cs.uchicago.edu/projects/FlyMC/.
[21]
FlyMC Technical Report (includes correctness sketch, pseudo-code, implementation details, etc.). https://tinyurl.com/flymc-technical-report.
[22]
Java Path Finder. https://babelfish.arc.nasa.gov/trac/jpf.
[23]
Jepsen. http://jepsen.io/.
[24]
Kudu. https://kudu.apache.org/.
[25]
Logcabin. https://github.com/logcabin/logcabin.
[26]
Namazu. http://osrg.github.io/namazu/.
[27]
Personal Communication with ZooKeeper Developers (Michael Han, Patrick Hunt, and Alex Shraer).
[28]
RIVER: A Research Infrastructure to Explore Volatility, Energy-Efficiency, and Resilience. http://river.cs.uchicago.edu.
[29]
Parosh Abdulla, Stavros Aronis, Bengt Jonsson, and Konstantinos Sagonas. Optimal Dynamic Partial Order Reduction. In Proceedings of the 41st ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL), 2014.
[30]
Ivan Beschastnikh, Patty Wang, Yuriy Brun, and Michael D. Ernst. Debugging Distributed Systems: Challenges and Options for Validation and Debugging. In Communications of the ACM (CACM), 2016.
[31]
Ella Bounimova, Patrice Godefroid, and David Molnar. Billions and Billions of Constraints: Whitebox Fuzz Testing in Production. In Proceedings of the 35th International Conference on Software Engineering (ICSE), 2013.
[32]
Edmund M. Clarke, E. Allen Emerson, Somesh Jha, and A. Prasad Sistla. Symmetry reductions in model checking. In 10th International Conference on Computer Aided Verification (CAV), 1998.
[33]
Edmund M. Clarke, Orna Grumberg, and David E. Long. Model Checking and Abstraction. ACM Transactions on Programming Languages and Systems, 1994.
[34]
Katherine E. Coons, Sebastian Burckhardt, and Madanlal Musuvathi. GAMBIT: Effective Unit Testing for Concurrency Libraries. In Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), 2010.
[35]
Pantazis Deligiannis, Matt McCutchen, Paul Thomson, Shuo Chen, Alastair F. Donaldson, John Erickson, Cheng Huang, Akash Lal, Rashmi Mudduluru, Shaz Qadeer, and Wolfram Schulte. Uncovering Bugs in Distributed Storage Systems during Testing (Not in Production!). In Proceedings of the 14th USENIX Symposium on File and Storage Technologies (FAST), 2016.
[36]
Ankush Desai, Vivek Gupta, Ethan Jackson, Shaz Qadeer, Sriram Rajamani, and Damien Zufferey. P: Safe Asynchronous Event-Driven Programming. In Proceedings of the ACM SIGPLAN 2013 Conference on Programming Language Design and Implementation (PLDI), 2013.
[37]
Ernest Allen Emerson. The Beginning of Model Checking: A Personal Perspective. Springer-Verlag, 2008.
[38]
Cormac Flanagan and Patrice Godefroid. Dynamic Partial-Order Reduction for Model Checking Software. In Proceedings of the 33rd ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL), 2005.
[39]
Pedro Fonseca, Kaiyuan Zhang, Xi Wang, and Arvind Krishnamurthy. An Empirical Study on the Correctness of Formally Verified Distributed Systems. In Proceedings of the 2017 EuroSys Conference (EuroSys), 2017.
[40]
Dennis Geels, Gautam Altekar, Petros Maniatis, Timothy Roscoe, and Ion Stoica. Friday: Global Comprehension for Distributed Replay. In Proceedings of the 4th Symposium on Networked Systems Design and Implementation (NSDI), 2007.
[41]
Patrice Godefroid. Partial-Order Methods for the Verification of Concurrent Systems: An Approach to the State-Explosion Problem. volume 1032, 1996.
[42]
Patrice Godefroid. Model checking for programming languages using verisoft. In Proceedings of the 24th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL), 1997.
[43]
Patrice Godefroid. Between Testing and Verification: Software Model Checking via Systematic Testing (Talk). In Haifa Verification Conference (HVC), 2015.
[44]
Patrice Godefroid, Michael Y. Levin, and David Molnar. SAGE: Whitebox Fuzzing for Security Testing. In Communications of the ACM (CACM), 2012.
[45]
Patrice Godefroid and Nachiappan Nagappan. Concurrency At Microsoft - An Exploratory Study. Technical report, Microsoft Research, 2008.
[46]
Rachid Guerraoui and Maysam Yabandeh. Model Checking a Networked System Without the Network. In Proceedings of the 8th Symposium on Networked Systems Design and Implementation (NSDI), 2011.
[47]
Haryadi S. Gunawi, Thanh Do, Pallavi Joshi, Peter Alvaro, Joseph M. Hellerstein, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Koushik Sen, and Dhruba Borthakur. FATE and DESTINI: A Framework for Cloud Recovery Testing. In Proceedings of the 8th Symposium on Networked Systems Design and Implementation (NSDI), 2011.
[48]
Haryadi S. Gunawi, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake, Thanh Do, Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius Martin, and Anang D. Satria. What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems. In Proceedings of the 5th ACM Symposium on Cloud Computing (SoCC), 2014.
[49]
Huayang Guo, Ming Wu, Lidong Zhou, Gang Hu, Junfeng Yang, and Lintao Zhang. Practical Software Model Checking via Dynamic Interface Reduction. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles (SOSP), 2011.
[50]
Chris Hawblitzel, Jon Howell, Manos Kapritsos, Jacob R. Lorch, Bryan Parno, Michael L. Roberts, Srinath Setty, and Brian Zill. IronFleet: Proving Practical Distributed Systems Correct. In Proceedings of the 25th ACM Symposium on Operating Systems Principles (SOSP), 2015.
[51]
Patrick Hunt, Mahadev Konar, Flavio P. Junqueira, and Benjamin Reed. ZooKeeper: Wait-free coordination for Internet-scale systems. In Proceedings of the 2010 USENIX Annual Technical Conference (ATC), 2010.
[52]
Pallavi Joshi, Malay Ganai, Gogul Balakrishnan, Aarti Gupta, and Nadia Papakonstantinou. SETSUDO: Perturbation-based Testing Framework for Scalable Distributed Systems. In Conference on Timely Results in Operating Systems (TRIOS), 2013.
[53]
Xiaoen Ju, Livio Soares, Kang G. Shin, Kyung Dong Ryu, and Dilma Da Silva. On Fault Resilience of OpenStack. In Proceedings of the 4th ACM Symposium on Cloud Computing (SoCC), 2013.
[54]
Vineet Kahlon, Chao Wang, and Aarti Gupta. Monotonic Partial Order Reduction: An Optimal Symbolic Partial Order Reduction Technique. In 21st International Conference on Computer Aided Verification (CAV), 2009.
[55]
Charles Killian, James W. Anderson, Ranjit Jhala, and Amin Vahdat. Life, Death, and the Critical Transition: Finding Liveness Bugs in Systems Code. In Proceedings of the 4th Symposium on Networked Systems Design and Implementation (NSDI), 2007.
[56]
Avinash Lakshman and Prashant Malik. Cassandra - A Decentralized Structured Storage System. In The 3rd ACM SIGOPS International Workshop on Large Scale Distributed Systems and Middleware (LADIS), 2009.
[57]
Leslie Lamport. The part-time parliament (paxos). ACM Transactions on Computer Systems, 16(2), May 1998.
[58]
Tanakorn Leesatapornwongsa, Mingzhe Hao, Pallavi Joshi, Jeffrey F. Lukman, and Haryadi S. Gunawi. SAMC: Semantic-Aware Model Checking for Fast Discovery of Deep Bugs in Cloud Systems. In Proceedings of the 11th Symposium on Operating Systems Design and Implementation (OSDI), 2014.
[59]
Tanakorn Leesatapornwongsa, Jeffrey F. Lukman, Shan Lu, and Haryadi S. Gunawi. TaxDC: A Taxonomy of Non-Deterministic Concurrency Bugs in Datacenter Distributed Systems. In Proceedings of the 21st International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2016.
[60]
Thomas A. Limoncelli and Doug Hughe. LISA '11 Theme -- DevOps: New Challenges, Proven Values. USENIX;login: Magazine, 36(4), August 2011.
[61]
Haopeng Liu, Guangpu Li, Jeffrey F. Lukman, Jiaxin Li, Shan Lu, Haryadi S. Gunawi, and Chen Tian. DCatch: Automatically Detecting Distributed Concurrency Bugs in Cloud Systems. In Proceedings of the 22nd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2017.
[62]
Haopeng Liu, Xu Wang, Guangpu Li, Shan Lu, Feng Ye, and Chen Tian. FCatch: Automatically Detecting Time-of-fault Bugs in Cloud Systems. In Proceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2018.
[63]
Xuezheng Liu, Zhenyu Guo, Xi Wang, Feibo Chen, Xiaochen Lian, Jian Tang, Ming Wu, M. Frans Kaashoek, and Zheng Zhang. D3S: Debugging Deployed Distributed Systems. In Proceedings of the 5th Symposium on Networked Systems Design and Implementation (NSDI), 2008.
[64]
Xuezheng Liu, Wei Lin, Aimin Pan, and Zheng Zhang. WiDS Checker: Combating Bugs in Distributed Systems. In Proceedings of the 4th Symposium on Networked Systems Design and Implementation (NSDI), 2007.
[65]
Jonathan Mace, Ryan Roelke, and Rodrigo Fonseca. Pivot Tracing: Dynamic Causal Monitoring for Distributed Systems. In Proceedings of the 25th ACM Symposium on Operating Systems Principles (SOSP), 2015.
[66]
Madanlal Musuvathi, Shaz Qadeer, Tom Ball, Gerard Basler, Piramanayakam Arumuga Nainar, and Iulian Neamtiu. Finding and Reproducing Heisenbugs in Concurrent Programs. In Proceedings of the 8th Symposium on Operating Systems Design and Implementation (OSDI), 2008.
[67]
Diego Ongaro and John Ousterhout. In Search of an Understandable Consensus Algorithm. In Proceedings of the 2014 USENIX Annual Technical Conference (ATC), 2014.
[68]
Patrick Reynolds, Charles Killian, Janet L. Wiener, Jeffrey C. Mogul, Mehul A. Shah, and Amin Vahdat. Pip: Detecting the Unexpected in Distributed Systems. In Proceedings of the 3rd Symposium on Networked Systems Design and Implementation (NSDI), 2006.
[69]
Cesar Rodriguez, Marcelo Sousa, Subodh Sharma, and Daniel Kroening. Unfolding-based Partial Order Reduction. In Proceedings of the 26th International Conference on Concurrency Theory (CONCUR'15), 2015.
[70]
Raja R. Sambasivan, Ilari Shafer, Jonathan Mace, Benjamin H. Sigelman, Rodrigo Fonseca, and Gregory R. Ganger. Principled workflow-centric tracing of distributed systems. In Proceedings of the 7th ACM Symposium on Cloud Computing (SoCC), 2016.
[71]
Colin Scott, Vjekoslav Brajkovic, George Necula, Arvind Krishnamurthy, and Scott Shenker. Minimizing Faulty Executions of Distributed Systems. In Proceedings of the 13th Symposium on Networked Systems Design and Implementation (NSDI), 2016.
[72]
Benjamin H. Sigelman, Luiz AndrÃl' Barroso, Mike Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, and Chandan Shanbhag. Dapper, a Large-Scale Distributed Systems Tracing Infrastructure. Technical report, Google, Inc., 2010.
[73]
Jiri Simsa, Randy Bryant, and Garth Gibson. dBug: Systematic Evaluation of Distributed Systems. In 5th International Workshop on Systems Software Verification (SSV), 2010.
[74]
Jiri Simsa, Randy Bryant, Garth A. Gibson, and Jason Hickey. Scalable Dynamic Partial Order Reduction. In The 3rd International Conference on Runtime Verification (RV), 2012.
[75]
A. Prasad Sistla, Viktor Gyuris, and E. Allen Emerson. SMC: a symmetry-based model checker for verification of safety and liveness properties. ACM Transactions on Software Engineering and Methodology, 2010.
[76]
Chao Wang, Swarat Chaudhuri, Aarti Gupta, and Yu Yang. Symbolic Pruning of Concurrent Program Executions. In Proceedings of the 17th ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE), 2009.
[77]
Chao Wang, Mahmoud Said, and Aarti Gupta. Coverage guided systematic concurrency testing. In Proceedings of the 33rd International Conference on Software Engineering (ICSE), 2011.
[78]
Brian White, Jay Lepreau, Leigh Stoller, Robert Ricci, Shashi Guruprasad, Mac Newbold, Mike Hibler, Chad Barb, and Abhijeet Joglekar. An Integrated Experimental Environment for Distributed Systems and Networks. In Proceedings of the 5th Symposium on Operating Systems Design and Implementation (OSDI), 2002.
[79]
James R. Wilcox, Doug Woos, Pavel Panchekha, Zachary Tatlock, Xi Wang, Michael D. Ernst, and Tom Anderson. Verdi: A framework for formally verifying distributed system implementations. In Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 2015.
[80]
Maysam Yabandeh, Nikola Knezevic, Dejan Kostic, and Viktor Kuncak. CrystalBall: Predicting and Preventing Inconsistencies in Deployed Distributed Systems. In Proceedings of the 6th Symposium on Networked Systems Design and Implementation (NSDI), 2009.
[81]
Junfeng Yang, Tisheng Chen, Ming Wu, Zhilei Xu, Xuezheng Liu, Haoxiang Lin, Mao Yang, Fan Long, Lintao Zhang, and Lidong Zhou. MODIST: Transparent Model Checking of Unmodified Distributed Systems. In Proceedings of the 6th Symposium on Networked Systems Design and Implementation (NSDI), 2009.
[82]
Yu Yang, Xiaofang Chen, Ganesh Gopalakrishnan, and Robert M. Kirby. Distributed Dynamic Partial Order Reduction Based Verification of Threaded Software*. In International SPIN Workshop on Model Checking of Software (SPIN), 2007.
[83]
Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. Spark: Cluster Computing with Working Sets. In The 2nd Workshop on Hot Topics in Cloud Computing (HotCloud), 2010.
[84]
Yongle Zhang, Serguei Makarov, Xiang Ren, David Lion, and Ding Yuan. Pensieve: Non-Intrusive Failure Reproduction of Distributed Systems using the Event Chaining Approach. In Proceedings of the 26th ACM Symposium on Operating Systems Principles (SOSP), 2017.

Cited By

View all
  • (2024)KiviProceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference10.5555/3691992.3692024(509-527)Online publication date: 10-Jul-2024
  • (2024)Efficient exposure of partial failure bugs in distributed systems with inferred abstract statesProceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation10.5555/3691825.3691895(1267-1283)Online publication date: 16-Apr-2024
  • (2024)VConMC: Enabling Consistency Verification for Distributed Systems Using Implementation-Level Model Checkers and Consistency OraclesElectronics10.3390/electronics1306115313:6(1153)Online publication date: 21-Mar-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
EuroSys '19: Proceedings of the Fourteenth EuroSys Conference 2019
March 2019
714 pages
ISBN:9781450362818
DOI:10.1145/3302424
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 March 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Availability
  2. Distributed Concurrency Bugs
  3. Distributed Systems
  4. Reliability
  5. Software Model Checking

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

Conference

EuroSys '19
Sponsor:
EuroSys '19: Fourteenth EuroSys Conference 2019
March 25 - 28, 2019
Dresden, Germany

Acceptance Rates

Overall Acceptance Rate 241 of 1,308 submissions, 18%

Upcoming Conference

EuroSys '25
Twentieth European Conference on Computer Systems
March 30 - April 3, 2025
Rotterdam , Netherlands

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)912
  • Downloads (Last 6 weeks)60
Reflects downloads up to 11 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)KiviProceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference10.5555/3691992.3692024(509-527)Online publication date: 10-Jul-2024
  • (2024)Efficient exposure of partial failure bugs in distributed systems with inferred abstract statesProceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation10.5555/3691825.3691895(1267-1283)Online publication date: 16-Apr-2024
  • (2024)VConMC: Enabling Consistency Verification for Distributed Systems Using Implementation-Level Model Checkers and Consistency OraclesElectronics10.3390/electronics1306115313:6(1153)Online publication date: 21-Mar-2024
  • (2024)An Empirical Study on Kubernetes Operator BugsProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3680396(1746-1758)Online publication date: 11-Sep-2024
  • (2024)Chronos: Finding Timeout Bugs in Practical Distributed Systems by Deep-Priority Fuzzing with Transient Delay2024 IEEE Symposium on Security and Privacy (SP)10.1109/SP54263.2024.00109(1939-1955)Online publication date: 19-May-2024
  • (2024)A Domain Specific Language for Testing Distributed Protocol ImplementationsNetworked Systems10.1007/978-3-031-67321-4_6(100-117)Online publication date: 29-May-2024
  • (2023)Halfmoon: Log-Optimal Fault-Tolerant Stateful Serverless ComputingProceedings of the 29th Symposium on Operating Systems Principles10.1145/3600006.3613154(314-330)Online publication date: 23-Oct-2023
  • (2023)Greybox Fuzzing of Distributed SystemsProceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security10.1145/3576915.3623097(1615-1629)Online publication date: 15-Nov-2023
  • (2023)Model Checking Guided Testing for Distributed SystemsProceedings of the Eighteenth European Conference on Computer Systems10.1145/3552326.3587442(127-143)Online publication date: 8-May-2023
  • (2023)Coverage Guided Fault Injection for Cloud Systems2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE)10.1109/ICSE48619.2023.00186(2211-2223)Online publication date: May-2023
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media