Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3552326.3587442acmconferencesArticle/Chapter ViewAbstractPublication PageseurosysConference Proceedingsconference-collections
research-article
Open access

Model Checking Guided Testing for Distributed Systems

Published: 08 May 2023 Publication History

Abstract

Distributed systems have become the backbone of cloud computing. Incorrect system designs and implementations can greatly impair the reliability of distributed systems. Although a distributed system design modelled in the formal specification can be verified by formal model checking, it is still challenging to figure out whether its corresponding implementation conforms to the verified specification. An incorrect system implementation can violate its verified specification, and causes intricate bugs.
In this paper, we propose a novel distributed system testing technique, Model checking guided testing (Mocket), to fill the gap between the specification and its implementation in a distributed system. Specially, we use the state space generated by formal model checking to guide the testing for the system implementation, and unearth bugs in the target distributed system. To evaluate the feasibility and effectiveness of Mocket, we apply Mocket on three popular distributed systems, and find 3 previously unknown bugs in them.

References

[1]
1997. Alloy. Retrieved May 6, 2022 from https://alloytools.org/
[2]
2002. ASM. Retrieved April 22, 2021 from https://asm.ow2.io/
[3]
2008. Apache Hadoop MapReduce. Retrieved March 29, 2021 from https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html
[4]
2010. Apache ZooKeeper. Retrieved June 14, 2020 from https://zookeeper.apache.org/
[5]
2010. The TLA+ Toolbox. Retrieved May 6, 2022 from http://lamport.azurewebsites.net/tla/toolbox.html
[6]
2012. Leader election never settles for a 5-node cluster. Retrieved Oct 2, 2022 from https://issues.apache.org/jira/browse/ZOOKEEPER-1419
[7]
2013. Zookeeper fails to start because of inconsistent epoch. Retrieved Oct 2, 2022 from https://issues.apache.org/jira/browse/ZOOKEEPER-1653
[8]
2014. Python NetworkX. Retrieved May 6, 2022 from https://networkx.org/
[9]
2014. TLA+ specification for the Raft consensus algorithm. Retrieved May 6, 2022 from https://github.com/ongardie/raft.tla
[10]
2014. ZKVerifier.java in SAMC. Retrieved Feb 25, 2023 from https://github.com/wangsnowyin/samc/blob/master/src/edu/uchicago/cs/ucare/samc/zookeeper/ZKVerifier.java
[11]
2015. Apache Spark. Retrieved April 12, 2021 from https://spark.apache.org/
[12]
2016. Revise TLA+ spec. Retrieved May 6, 2022 from https://github.com/ongardie/raft.tla/pull/4/
[13]
2017. Raft-java. Retrieved April 11, 2022 from https://github.com/wenweihu86/raft-java
[14]
2017. Raft-java issue#3. Retrieved April 25, 2022 from https://github.com/wenweihu86/raft-java/issues/3
[15]
2017. TiDB. Retrieved April 13, 2022 from https://github.com/pingcap/tidb
[16]
2018. Foundations of Azure Cosmos DB (Multi-Master) with Dr. Leslie Lamport. Retrieved May 6, 2022 from https://www.youtube.com/watch?v=kYX6UrY_ooA
[17]
2018. Lloyd's Estimates the Impact of a U.S. Cloud Outage at $19 Billion. Retrieved May 6, 2022 from https://www.eweek.com/cloud/lloyd-s-estimates-the-impact-of-a-u.s.-cloud-outage-at-19-billion
[18]
2018. xraft. Retrieved April 11, 2022 from https://github.com/xnnyygn/xraft
[19]
2019. Raft-java issue#19. Retrieved April 25, 2022 from https://github.com/wenweihu86/raft-java/issues/19
[20]
2020. CockroachDB. Retrieved April 13, 2022 from https://github.com/cockroachdb/cockroach
[21]
2021. GraphViz. Retrieved Feb 17, 2023 from https://graphviz.org/
[22]
2022. xraft commit: Handle with canceled votes. Retrieved April 24, 2022 from https://github.com/xnnyygn/xraft/pull/28/commits/a48000080b6590402fbf45dd1a06af001d558830
[23]
2022. xraft issue: Duplicate vote response can make illegal leader without a quorum. Retrieved April 24, 2022 from https://github.com/xnnyygn/xraft/issues/27
[24]
2022. xraft issue: VotedFor is not stored when a node is candidate and receives an AppendEntriesRpc. Retrieved May 6, 2022 from https://github.com/xnnyygn/xraft/issues/29
[25]
Basil Alkhatib, Sreeharsha Udayashankar, Sara Qunaibi, Ahmed Alquraan, Mohammed Alfatafta, Wael Al-Manasrah, Alex Depoutovitch, and Samer Al-Kiswany. 2022. Partial Network Partitioning. ACM Transactions on Computer Systems (2022).
[26]
Ahmed Alquraan, Hatem Takruri, Mohammed Alfatafta, and Samer Al-Kiswany. 2018. An Analysis of Network-Partitioning Failures in Cloud Systems. In Proceedings of USENIX Symposium on Operating Systems Design and Implementation (OSDI). 51--68.
[27]
Haicheng Chen, Wensheng Dou, Yanyan Jiang, and Feng Qin. 2019. Understanding Exception-Related Bugs in Large-Scale Cloud Systems. In Proceedings of IEEE/ACM SIGSOFT International Conference on Automated Software Engineering (ASE). 339--351.
[28]
Haicheng Chen, Wensheng Dou, Dong Wang, and Feng Qin. 2020. CoFI: Consistency-Guided Fault Injection for Cloud Systems. In Proceedings of IEEE/ACM SIGSOFT International Conference on Automated Software Engineering (ASE). 536--547.
[29]
Zhao Chen. 2020. The Practice on Developing the Distributed Consensus Algorithm. Peking University Press.
[30]
Ting Dai, Jingzhu He, Xiaohui Gu, and Shan Lu. 2018. Understanding Real-World Timeout Problems in Cloud Server Systems. In Proceedings of IEEE International Conference on Cloud Engineering (IC2E). 1--11.
[31]
A Jesse Jiryu Davis, Max Hirschhorn, and Judah Schvimer. 2020. eXtreme Modelling in Practice. Proceedings of International Conference on Very Large Data Bases (VLDB) 13, 9 (2020), 1346--1358.
[32]
Cormac Flanagan and Patrice Godefroid. 2005. Dynamic Partial-Order Reduction for Model Checking Software. In Proceedings of ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL). 110--121.
[33]
Pedro Fonseca, Kaiyuan Zhang, Xi Wang, and Arvind Krishnamurthy. 2017. An Empirical Study on the Correctness of Formally Verified Distributed Systems. In Proceedings of European Conference on Computer Systems (EuroSys). 328--343.
[34]
Xiaoqin Fu and Haipeng Cai. 2021. FlowDist: Multi-Staged Refinement-Based Dynamic Information Flow Analysis for Distributed Software Systems. In Proceedings of USENIX Security Symposium (USENIX Security). 2093--2110.
[35]
Yu Gao, Wensheng Dou, Feng Qin, Chushu Gao, Dong Wang, Jun Wei, Ruirui Huang, Li Zhou, and Yongming Wu. 2018. An Empirical Study on Crash Recovery Bugs in Large-Scale Distributed Systems. In Proceedings of ACM SIGSOFT Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESCE/FSE). 539--550.
[36]
Yu Gao, Wensheng Dou, Dong Wang, Wenhan Feng, Jun Wei, Hua Zhong, and Tao Huang. 2023. Coverage Guided Fault Injection for Cloud Systems. In Proceedings of IEEE/ACM SIGSOFT International Conference on Software Engineering (ICSE).
[37]
Yu Gao, Dong Wang, Qianwang Dai, Wensheng Dou, and Jun Wei. 2022. Common Data Guided Crash Injection for Cloud Systems. In Proceedings of ACM/IEEE SIGSOFT International Conference on Software Engineering: Companion Proceedings (ICSE Companion). 36--40.
[38]
Patrice Godefroid. 1994. Partial-Order Methods for the Verification of Concurrent Systems-An Approach to the State-Explosion Problem. University de Liege, Faculte des Sciences Appliquees.
[39]
A Gravell, Yvonne Howard, Juan C Augusto, Carla Ferreira, and Stefan Gruner. 2011. Concurrent Development of Model and Implementation. arXiv preprint arXiv:1111.2826 (2011).
[40]
Haryadi S Gunawi, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake, Thanh Do, Jeffry Adityatama, Kurnia J Eliazar, Agung Laksono, Jeffrey F Lukman, Vincentius Martin, et al. 2014. What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems. In Proceedings of ACM Symposium on Cloud Computing (SOCC). 1--14.
[41]
Finn Hackett, Shayan Hosseini, Renato Costa, Matthew Do, and Ivan Beschastnikh. 2023. Compiling Distributed System Models with PGo. In Proceedings of ACM SIGARCH-SIGPLAN-SIGOPS International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 159--175.
[42]
Chris Hawblitzel, Jon Howell, Manos Kapritsos, Jacob R Lorch, Bryan Parno, Michael L Roberts, Srinath Setty, and Brian Zill. 2015. IronFleet: Proving Practical Distributed Systems Correct. In Proceedings of ACM SIGOPS Symposium on Operating Systems Principles (SOSP). 1--17.
[43]
Flavio P Junqueira, Benjamin C Reed, and Marco Serafini. 2011. Zab: High-Performance Broadcast for Primary-Backup Systems. In Proceedings of IEEE/IFIP International Conference on Dependable Systems & Networks (DSN). 245--256.
[44]
Beom Heyn Kim, Taesoo Kim, and David Lie. 2022. Modulo: Finding Convergence Failure Bugs in Distributed Systems with Divergence Resync Models. In Proceedings of USENIX Annual Technical Conference (ATC). 383--398.
[45]
Leslie Lamport. 2002. Specifying Systems: The TLA+ Language and Tools for Hardware and Software Engineers. Addison-Wesley.
[46]
Leslie Lamport and Stephan Merz. 2017. Auxiliary Variables in TLA+. arXiv preprint arXiv:1703.05121 (2017).
[47]
Tanakorn Leesatapornwongsa, Mingzhe Hao, Pallavi Joshi, Jeffrey F Lukman, and Haryadi S Gunawi. 2014. SAMC: Semantic-Aware Model Checking for Fast Discovery of Deep Bugs in Cloud Systems. In Proceedings of USENIX Symposium on Operating Systems Design and Implementation (OSDI). 399--414.
[48]
Tanakorn Leesatapornwongsa, Jeffrey F Lukman, Shan Lu, and Haryadi S Gunawi. 2016. TaxDC: A Taxonomy of Non-deterministic Concurrency Bugs in Datacenter Distributed Systems. In Proceedings of ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 517--530.
[49]
Jiaxin Li, Yuxi Chen, Haopeng Liu, Shan Lu, Yiming Zhang, Haryadi S Gunawi, Xiaohui Gu, Xicheng Lu, and Dongsheng Li. 2018. PCatch: Automatically Detecting Performance Cascading Bugs in Cloud Systems. In Proceedings of European Conference on Computer Systems (EuroSys). 1--14.
[50]
Jiaxin Li, Yiming Zhang, Shan Lu, Haryadi S Gunawi, Xiaohui Gu, Feng Huang, and Dongsheng Li. 2023. Performance Bug Analysis and Detection for Distributed Storage and Computing Systems. ACM Transactions on Storage (TOS) (2023), 1--31.
[51]
Yishuai Li, Benjamin C Pierce, and Steve Zdancewic. 2021. Model-Based Testing of Networked Applications. In Proceedings of ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA). 529--539.
[52]
Haopeng Liu, Guangpu Li, Jeffrey F Lukman, Jiaxin Li, Shan Lu, Haryadi S Gunawi, and Chen Tian. 2017. DCatch: Automatically Detecting Distributed Concurrency Bugs in Cloud Systems. In Proceedings of ACM SIGARCH-SIGPLAN-SIGOPS International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 677--691.
[53]
Haopeng Liu, Xu Wang, Guangpu Li, Shan Lu, Feng Ye, and Chen Tian. 2018. FCatch: Automatically Detecting Time-of-Fault Bugs in Cloud Systems. In Proceedings of ACM SIGARCH-SIGPLAN-SIGOPS International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 419--431.
[54]
Chang Lou, Peng Huang, and Scott Smith. 2020. Understanding, Detecting and Localizing Partial Failures in Large System Software. In Proceedings of USENIX Symposium on Networked Systems Design and Implementation (NSDI). 559--574.
[55]
Jie Lu, Haofeng Li, Chen Liu, Lian Li, and Kun Cheng. 2022. Detecting Missing-Permission-Check Vulnerabilities in Distributed Cloud Systems. In Proceedings of ACM SIGSAC Conference on Computer and Communications Security (CCS). 2145--2158.
[56]
Jie Lu, Chen Liu, Lian Li, Xiaobing Feng, Feng Tan, Jun Yang, and Liang You. 2019. CrashTuner: Detecting Crash-Recovery Bugs in Cloud Systems via Meta-Info Analysis. In Proceedings of ACM SIGOPS Symposium on Operating Systems Principles (SOSP). 114--130.
[57]
Jeffrey F Lukman, Huan Ke, Cesar A Stuardo, Riza O Suminto, Daniar H Kurniawan, Dikaimin Simon, Satria Priambada, Chen Tian, Feng Ye, Tanakorn Leesatapornwongsa, et al. 2019. FlyMC: Highly Scalable Testing of Complex Interleavings in Distributed Systems. In Proceedings of European Conference on Computer Systems (EuroSys). 1--16.
[58]
Ellis Michael, Doug Woos, Thomas Anderson, Michael D Ernst, and Zachary Tatlock. 2019. Teaching Rigorous Distributed Systems with Efficient Model Checking. In Proceedings of European Conference on Computer Systems (EuroSys). 1--15.
[59]
Chris Newcombe, Tim Rath, Fan Zhang, Bogdan Munteanu, Marc Brooker, and Michael Deardeuff. 2015. How Amazon Web Services Uses Formal Methods. Commun. ACM 58, 4 (2015), 66--73.
[60]
Diego Ongaro. 2014. Consensus: Bridging Theory and Practice. Stanford University.
[61]
Diego Ongaro and John Ousterhout. 2014. In Search of an Understandable Consensus Algorithm. In Proceedings of USENIX Annual Technical Conference (ATC). 305--319.
[62]
Burcu Kulahcioglu Ozkan, Rupak Majumdar, Filip Niksic, Mitra Tabaei Befrouei, and Georg Weissenbacher. 2018. Randomized Testing of Distributed Systems with Probabilistic Guarantees. In Proceedings of ACM SIGPLAN International Conference on Object-Oriented Programming Systems Languages Applications (OOPSLA). 1--28.
[63]
Burcu Kulahcioglu Ozkan, Rupak Majumdar, and Simin Oraee. 2019. Trace Aware Random testing for Distributed Systems. In Proceedings of ACM SIGPLAN International Conference on Object-Oriented Programming Systems Languages Applications (OOPSLA). 1--28.
[64]
Jiri Simsa, Randy Bryant, and Garth Gibson. 2010. dBug: Systematic Evaluation of Distributed Systems. In International Workshop on Systems Software Verification (SSV).
[65]
Ion Stoica, Robert Morris, David Karger, M Frans Kaashoek, and Hari Balakrishnan. 2001. Chord: A Scalable Peer-to-Peer Lookup Service for Internet Applications. ACM SIGCOMM Computer Communication Review (CCR) 31, 4 (2001), 149--160.
[66]
Dong Wang, Yu Gao, Wensheng Dou, and Jun Wei. 2022. DisTA: Generic Dynamic Taint Tracking for Java-Based Distributed Systems. In Proceedings of IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). 547--558.
[67]
James R Wilcox, Doug Woos, Pavel Panchekha, Zachary Tatlock, Xi Wang, Michael D Ernst, and Thomas Anderson. 2015. Verdi: A Framework for Implementing and Formally Verifying Distributed Systems. In Proceedings of ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI). 357--368.
[68]
Maysam Yabandeh, Nikola Knezevic, Dejan Kostic, and Viktor Kuncak. 2009. CrystalBall: Predicting and Preventing Inconsistencies in Deployed Distributed Systems. In Proceedings of USENIX Symposium on Networked Systems Design and Implementation (NSDI). 229--244.
[69]
Junfeng Yang, Tisheng Chen, Ming Wu, Zhilei Xu, Xuezheng Liu, Haoxiang Lin, Mao Yang, Fan Long, Lintao Zhang, and Lidong Zhou. 2009. MODIST: Transparent Model Checking of Unmodified Distributed Systems. In Proceedings of USENIX Symposium on Networked Systems Design and Implementation (NSDI). 213--228.
[70]
Xinhao Yuan and Junfeng Yang. 2020. Effective Concurrency Testing for Distributed Systems. In Proceedings of ACM SIGARCH-SIGPLAN-SIGOPS International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 1141--1156.
[71]
Pamela Zave. 2012. Using Lightweight Modelling to Understand Chord. ACM SIGCOMM Computer Communication Review (CCR) 42, 2 (2012), 49--57.
[72]
Yongle Zhang, Junwen Yang, Zhuqi Jin, Utsav Sethi, Kirk Rodrigues, Shan Lu, and Ding Yuan. 2021. Understanding and Detecting Software Upgrade Failures in Distributed Systems. In Proceedings of ACM SIGOPS Symposium on Operating Systems Principles (SOSP). 116--131.

Cited By

View all
  • (2024)MetisProceedings of the 22nd USENIX Conference on File and Storage Technologies10.5555/3650697.3650705(123-140)Online publication date: 27-Feb-2024
  • (2024)Reward Augmentation in Reinforcement Learning for Testing Distributed SystemsProceedings of the ACM on Programming Languages10.1145/36897798:OOPSLA2(1928-1954)Online publication date: 8-Oct-2024
  • (2024)Erla⁺: Translating TLA⁺ Models into Executable Actor-Based ImplementationsProceedings of the 23rd ACM SIGPLAN International Workshop on Erlang10.1145/3677995.3678190(13-23)Online publication date: 28-Aug-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
EuroSys '23: Proceedings of the Eighteenth European Conference on Computer Systems
May 2023
910 pages
ISBN:9781450394871
DOI:10.1145/3552326
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 May 2023

Permissions

Request permissions for this article.

Check for updates

Badges

Author Tags

  1. distributed system
  2. model checking
  3. testing

Qualifiers

  • Research-article

Conference

EuroSys '23
Sponsor:

Acceptance Rates

Overall Acceptance Rate 241 of 1,308 submissions, 18%

Upcoming Conference

EuroSys '25
Twentieth European Conference on Computer Systems
March 30 - April 3, 2025
Rotterdam , Netherlands

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1,203
  • Downloads (Last 6 weeks)126
Reflects downloads up to 01 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)MetisProceedings of the 22nd USENIX Conference on File and Storage Technologies10.5555/3650697.3650705(123-140)Online publication date: 27-Feb-2024
  • (2024)Reward Augmentation in Reinforcement Learning for Testing Distributed SystemsProceedings of the ACM on Programming Languages10.1145/36897798:OOPSLA2(1928-1954)Online publication date: 8-Oct-2024
  • (2024)Erla⁺: Translating TLA⁺ Models into Executable Actor-Based ImplementationsProceedings of the 23rd ACM SIGPLAN International Workshop on Erlang10.1145/3677995.3678190(13-23)Online publication date: 28-Aug-2024
  • (2024)An Empirical Study on Kubernetes Operator BugsProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3680396(1746-1758)Online publication date: 11-Sep-2024
  • (2024)Non-Functional Requirements Discovery and Quality Assurance Using Goal Model for Earthquake Warning System in Operation2024 IEEE 32nd International Requirements Engineering Conference (RE)10.1109/RE59067.2024.00034(275-286)Online publication date: 24-Jun-2024
  • (2024)Validating Traces of Distributed Programs Against TLA+ SpecificationsSoftware Engineering and Formal Methods10.1007/978-3-031-77382-2_8(126-143)Online publication date: 26-Nov-2024

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media