Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

Trace aware random testing for distributed systems

Published: 10 October 2019 Publication History

Abstract

Distributed and concurrent applications often have subtle bugs that only get exposed under specific schedules. While these schedules may be found by systematic model checking techniques, in practice, model checkers do not scale to large systems. On the other hand, naive random exploration techniques often require a very large number of runs to find the specific interactions needed to expose a bug. In recent years, several random testing algorithms have been proposed that, on the one hand, exploit state-space reduction strategies from model checking and, on the other, provide guarantees on the probability of hitting bugs of certain kinds.
These existing techniques exploit two orthogonal strategies to reduce the state space: partial-order reduction and bug depth. Testing algorithms based on partial order techniques, such as RAPOS or POS, ensure non-redundant exploration of independent interleavings among system events by imposing an equivalence relation on schedules and ideally exploring only one schedule from each equivalence class. Techniques based on bug depth, such as PCT, exploit the empirical observation that many bugs are exposed by the clever scheduling of a small number of key events. They bias the sample space of schedules to only cover all executions of small depth, rather than the much larger space of all schedules. At this point, there is no random testing algorithm that combines the power of both approaches.
In this paper, we provide such an algorithm. Our algorithm, trace-aware PCT (taPCTCP), extends and unifies several different algorithms in the random testing literature. It samples the space of low-depth executions by constructing a schedule online, while taking dependencies among events into account. Moreover, the algorithm comes with a theoretical guarantee on the probability of sampling a trace of low depth---the probability grows exponentially with the depth but only polynomially with the number of racy events explored. We further show that the guarantee is optimal among a large class of techniques.
We empirically compare our algorithm with several state-of-the-art random testing approaches for concurrent software on two large-scale distributed systems, Zookeeper and Cassandra, and show that our approach is effective in uncovering subtle bugs and usually outperforms related random testing algorithms.

Supplementary Material

a180-ozcan (a180-ozcan.webm)
Presentation at OOPSLA '19

References

[1]
Parosh Abdulla, Stavros Aronis, Bengt Jonsson, and Konstantinos Sagonas. 2014. Optimal Dynamic Partial Order Reduction. In Proceedings of the 41st ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL ’14). ACM, 373–384.
[2]
Parosh Aziz Abdulla, Stavros Aronis, Bengt Jonsson, and Konstantinos Sagonas. 2017. Source Sets: A Foundation for Optimal Dynamic Partial Order Reduction. J. ACM 64, 4 (2017), 25:1–25:49.
[3]
Anurag Agarwal and Vijay K. Garg. 2007. Efficient dependency tracking for relevant events in concurrent systems. Distributed Computing 19, 3 (2007), 163–183.
[4]
Gul Agha and Carl Hewitt. 1985. Concurrent Programming Using Actors: Exploiting large-Scale Parallelism. In Foundations of Software Technology and Theoretical Computer Science, Fifth Conference, New Delhi, India, December 16-18, 1985, Proceedings. 19–41.
[5]
Elvira Albert, Puri Arenas, Maria Garcia de la Banda, Miguel Gómez-Zamalloa, and Peter J. Stuckey. 2017. Context-Sensitive Dynamic Partial Order Reduction. In Computer Aided Verification - 29th International Conference, CAV 2017, Heidelberg, Germany, July 24-28, 2017, Proceedings, Part I. 526–543.
[6]
Marcelo Arenas, Luis Alberto Croquevielle, Rajesh Jayaram, and Cristian Riveros. 2019. Efficient Logspace Classes for Enumeration, Counting, and Uniform Generation. In Proceedings of the 38th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2019, Amsterdam, The Netherlands, June 30 - July 5, 2019. 59–73.
[7]
Sanjeev Arora and Boaz Barak. 2009. Computational Complexity - A Modern Approach. Cambridge University Press.
[8]
Ahmed Bouajjani and Michael Emmi. 2012. Bounded Phase Analysis of Message-Passing Programs. In Tools and Algorithms for the Construction and Analysis of Systems - 18th International Conference, TACAS 2012, Held as Part of the European Joint Conferences on Theory and Practice of Software, ETAPS 2012, Tallinn, Estonia, March 24 - April 1, 2012. Proceedings. 451–465.
[9]
Sebastian Burckhardt, Pravesh Kothari, Madanlal Musuvathi, and Santosh Nagarakatte. 2010. A randomized scheduler with probabilistic guarantees of finding bugs. In Proceedings of the 15th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2010, Pittsburgh, Pennsylvania, USA, March 13-17, 2010. 167–178.
[10]
P. Cartier and D. Foata. 1969. Problèmes combinatoires de commutation et r’earrangements. Number 85 in Lecture Notes in Mathematics. Springer.
[11]
Marek Chalupa, Krishnendu Chatterjee, Andreas Pavlogiannis, Nishant Sinha, and Kapil Vaidya. 2018. Data-centric dynamic partial order reduction. PACMPL 2, POPL (2018), 31:1–31:30.
[12]
Dmitry Chistikov, Rupak Majumdar, and Filip Niksic. 2016. Hitting Families of Schedules for Asynchronous Programs. In Computer Aided Verification - 28th International Conference, CAV 2016, Toronto, ON, Canada, July 17-23, 2016, Proceedings, Part II (Lecture Notes in Computer Science), Vol. 9780. Springer, 157–176.
[13]
Katherine E. Coons, Madan Musuvathi, and Kathryn S. McKinley. 2013. Bounded partial-order reduction. In Proceedings of the 2013 ACM SIGPLAN International Conference on Object Oriented Programming Systems Languages & Applications, OOPSLA 2013, part of SPLASH 2013, Indianapolis, IN, USA, October 26-31, 2013. 833–848.
[14]
Pantazis Deligiannis, Matt McCutchen, Paul Thomson, Shuo Chen, Alastair F. Donaldson, John Erickson, Cheng Huang, Akash Lal, Rashmi Mudduluru, Shaz Qadeer, and Wolfram Schulte. 2016. Uncovering Bugs in Distributed Storage Systems during Testing (Not in Production!). In 14th USENIX Conference on File and Storage Technologies, FAST 2016, Santa Clara, CA, USA, February 22-25, 2016. 249–262. https://www.usenix.org/conference/fast16/technical- sessions/presentation/ deligiannis
[15]
Ankush Desai, Shaz Qadeer, and Sanjit A. Seshia. 2015. Systematic testing of asynchronous reactive systems. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering, ESEC/FSE 2015, Bergamo, Italy, August 30 - September 4, 2015. 73–83.
[16]
Volker Diekert and Grzegorz Rozenberg (Eds.). 1995. The Book of Traces. World Scientific.
[17]
Michael Emmi, Shaz Qadeer, and Zvonimir Rakamaric. 2011. Delay-bounded scheduling. In Proceedings of the 38th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL 2011, Austin, TX, USA, January 26-28, 2011. 411–422.
[18]
Cormac Flanagan and Patrice Godefroid. 2005. Dynamic Partial-order Reduction for Model Checking Software. In Proceedings of the 32Nd ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL ’05). ACM, New York, NY, USA, 110–121.
[19]
Patrice Godefroid. 1996. Partial-Order Methods for the Verification of Concurrent Systems: An Approach to the State-Explosion Problem. Springer-Verlag New York, Inc., Secaucus, NJ, USA.
[20]
Patrice Godefroid. 1997. Model checking for programming languages using VeriSoft. In Proceedings of the 24th ACM SIGPLAN-SIGACT symposium on Principles of programming languages. ACM, 174–186.
[21]
Vivek Gore, Mark Jerrum, Sampath Kannan, Z. Sweedyk, and Stephen R. Mahaney. 1997. A Quasi-Polynomial-Time Algorithm for Sampling Words from a Context-Free Language. Information and Computation 134, 1 (1997), 59–74.
[22]
Carl Hewitt, Peter Boehler Bishop, and Richard Steiger. 1973. A Universal Modular ACTOR Formalism for Artificial Intelligence. In Proceedings of the 3rd International Joint Conference on Artificial Intelligence. Standford, CA, USA, August 20-23, 1973. 235–245. http://ijcai.org/Proceedings/73/Papers/027B.pdf
[23]
John E. Hopcroft and Jeffrey D. Ullman. 1979. Introduction to Automata Theory, Languages and Computation (1st ed.). Addison-Wesley.
[24]
Flavio Paiva Junqueira, Benjamin C. Reed, and Marco Serafini. 2011. Zab: High-performance broadcast for primary-backup systems. 2011 IEEE/IFIP 41st International Conference on Dependable Systems & Networks (DSN) (2011), 245–256.
[25]
Charles Edwin Killian, James W. Anderson, Ranjit Jhala, and Amin Vahdat. 2007. Life, Death, and the Critical Transition: Finding Liveness Bugs in Systems Code (Awarded Best Paper). In 4th Symposium on Networked Systems Design and Implementation (NSDI 2007), April 11-13, 2007, Cambridge, Massachusetts, USA, Proceedings. http://www.usenix.org/ events/nsdi07/tech/killian.html
[26]
Kyle Kingsbury. 2013–2018. Jepsen. Retrieved April 05, 2019 from http://jepsen.io/
[27]
Burcu Kulahcioglu Ozkan, Rupak Majumdar, Filip Niksic, Mitra Tabaei Befrouei, and Georg Weissenbacher. 2018. Randomized testing of distributed systems with probabilistic guarantees. PACMPL 2, OOPSLA (2018), 160:1–160:28.
[28]
Tanakorn Leesatapornwongsa, Mingzhe Hao, Pallavi Joshi, Jeffrey F. Lukman, and Haryadi S. Gunawi. 2014. SAMC: Semantic-Aware Model Checking for Fast Discovery of Deep Bugs in Cloud Systems. In 11th USENIX Symposium on Operating Systems Design and Implementation, OSDI ’14, Broomfield, CO, USA, October 6-8, 2014. 399–414. https: //www.usenix.org/conference/osdi14/technical- sessions/presentation/leesatapornwongsa
[29]
Tanakorn Leesatapornwongsa, Jeffrey F. Lukman, Shan Lu, and Haryadi S. Gunawi. 2016. TaxDC: A Taxonomy of NonDeterministic Concurrency Bugs in Datacenter Distributed Systems. In Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS’16, Atlanta, GA, USA, April 2-6, 2016. 517–530.
[30]
Joshua B. Leners, Hao Wu, Wei-Lun Hung, Marcos Kawazoe Aguilera, and Michael Walfish. 2011. Detecting failures in distributed systems with the Falcon spy network. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles 2011, SOSP 2011, Cascais, Portugal, October 23-26, 2011. 279–294.
[31]
Haopeng Liu, Guangpu Li, Jeffrey F. Lukman, Jiaxin Li, Shan Lu, Haryadi S. Gunawi, and Chen Tian. 2017. DCatch: Automatically Detecting Distributed Concurrency Bugs in Cloud Systems. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2017, Xi’an, China, April 8-12, 2017. 677–691.
[32]
Yuheng Long, Mehdi Bagherzadeh, Eric Lin, Ganesha Upadhyaya, and Hridesh Rajan. 2016. On ordering problems in message passing software. In Proceedings of the 15th International Conference on Modularity, MODULARITY 2016, Málaga, Spain, March 14 - 18, 2016. 54–65.
[33]
Brandon Lucia and Luis Ceze. 2009. Finding concurrency bugs with context-aware communication graphs. In 42st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-42 2009), December 12-16, 2009, New York, New York, USA. 553–563.
[34]
Jeffrey F. Lukman, Huan Ke, Cesar A. Stuardo, Riza O. Suminto, Daniar H. Kurniawan, Dikaimin Simon, Satria Priambada, Chen Tian, Feng Ye, Tanakorn Leesatapornwongsa, Aarti Gupta, Shan Lu, and Haryadi S. Gunawi. 2019. FlyMC: Highly Scalable Testing of Complex Interleavings in Distributed Systems. In Proceedings of the Fourteenth EuroSys Conference 2019, Dresden, Germany, March 25-28, 2019. 20:1–20:16.
[35]
Rupak Majumdar and Filip Niksic. 2018. Why is random testing effective for partition tolerance bugs? PACMPL 2, POPL (2018), 46:1–46:24.
[36]
Madanlal Musuvathi and Shaz Qadeer. 2007. Iterative context bounding for systematic testing of multithreaded programs. In Proceedings of the ACM SIGPLAN 2007 Conference on Programming Language Design and Implementation, San Diego, California, USA, June 10-13, 2007. 446–455.
[37]
Santosh Nagarakatte, Sebastian Burckhardt, Milo M. K. Martin, and Madanlal Musuvathi. 2012. Multicore acceleration of priority-based schedulers for concurrency bug detection. In ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’12, Beijing, China - June 11 - 16, 2012. 543–554.
[38]
Shaz Qadeer and Jakob Rehof. 2005. Context-Bounded Model Checking of Concurrent Software. In Tools and Algorithms for the Construction and Analysis of Systems, 11th International Conference, TACAS 2005, Held as Part of the Joint European Conferences on Theory and Practice of Software, ETAPS 2005, Edinburgh, UK, April 4-8, 2005, Proceedings. 93–107.
[39]
Koushik Sen. 2007. Effective random testing of concurrent programs. In 22nd IEEE/ACM International Conference on Automated Software Engineering (ASE 2007), November 5-9, 2007, Atlanta, Georgia, USA. 323–332.
[40]
Jirí Simsa, Randy Bryant, and Garth A. Gibson. 2011. dBug: Systematic Testing of Unmodified Distributed and Multi-threaded Systems. In Model Checking Software - 18th International SPIN Workshop, Snowbird, UT, USA, July 14-15, 2011. Proceedings. 188–193.
[41]
Samira Tasharofi, Rajesh K. Karmani, Steven Lauterburg, Axel Legay, Darko Marinov, and Gul Agha. 2012. TransDPOR: A Novel Dynamic Partial-Order Reduction Technique for Testing Actor Programs. In Formal Techniques for Distributed Systems - Joint 14th IFIP WG 6.1 International Conference, FMOODS 2012 and 32nd IFIP WG 6.1 International Conference, FORTE 2012, Stockholm, Sweden, June 13-16, 2012. Proceedings. 219–234.
[42]
Samira Tasharofi, Michael Pradel, Yu Lin, and Ralph E. Johnson. 2013. Bita: Coverage-guided, automatic testing of actor programs. In 2013 28th IEEE/ACM International Conference on Automated Software Engineering, ASE 2013, Silicon Valley, CA, USA, November 11-15, 2013. 114–124.
[43]
Paul Thomson, Alastair F. Donaldson, and Adam Betts. 2014. Concurrency testing using schedule bounding: an empirical study. In ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’14, Orlando, FL, USA, February 15-19, 2014. 15–28.
[44]
Junfeng Yang, Tisheng Chen, Ming Wu, Zhilei Xu, Xuezheng Liu, Haoxiang Lin, Mao Yang, Fan Long, Lintao Zhang, and Lidong Zhou. 2009. MODIST: Transparent Model Checking of Unmodified Distributed Systems. In Proceedings of the 6th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2009, April 22-24, 2009, Boston, MA, USA. 213–228. http://www.usenix.org/events/nsdi09/tech/full_papers/yang/yang.pdf
[45]
Xinhao Yuan, Junfeng Yang, and Ronghui Gu. 2018. Partial Order Aware Concurrency Sampling. In Computer Aided Verification - 30th International Conference, CAV 2018, Held as Part of the Federated Logic Conference, FloC 2018, Oxford, UK, July 14-17, 2018, Proceedings, Part II. 317–335.

Cited By

View all
  • (2024)Reward Augmentation in Reinforcement Learning for Testing Distributed SystemsProceedings of the ACM on Programming Languages10.1145/36897798:OOPSLA2(1928-1954)Online publication date: 8-Oct-2024
  • (2024)A faster FPRAS for #NFAProceedings of the ACM on Management of Data10.1145/36516132:2(1-22)Online publication date: 14-May-2024
  • (2024)FaultFuzz: A Coverage Guided Fault Injection Tool for Distributed SystemsProceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings10.1145/3639478.3640036(129-133)Online publication date: 14-Apr-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Programming Languages
Proceedings of the ACM on Programming Languages  Volume 3, Issue OOPSLA
October 2019
2077 pages
EISSN:2475-1421
DOI:10.1145/3366395
Issue’s Table of Contents
This work is licensed under a Creative Commons Attribution International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 October 2019
Published in PACMPL Volume 3, Issue OOPSLA

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. distributed systems
  2. hitting families
  3. partial order reduction
  4. random testing

Qualifiers

  • Research-article

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)305
  • Downloads (Last 6 weeks)21
Reflects downloads up to 15 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Reward Augmentation in Reinforcement Learning for Testing Distributed SystemsProceedings of the ACM on Programming Languages10.1145/36897798:OOPSLA2(1928-1954)Online publication date: 8-Oct-2024
  • (2024)A faster FPRAS for #NFAProceedings of the ACM on Management of Data10.1145/36516132:2(1-22)Online publication date: 14-May-2024
  • (2024)FaultFuzz: A Coverage Guided Fault Injection Tool for Distributed SystemsProceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings10.1145/3639478.3640036(129-133)Online publication date: 14-Apr-2024
  • (2024)Predictive Monitoring against Pattern Regular LanguagesProceedings of the ACM on Programming Languages10.1145/36329158:POPL(2191-2225)Online publication date: 5-Jan-2024
  • (2024)A Domain Specific Language for Testing Distributed Protocol ImplementationsNetworked Systems10.1007/978-3-031-67321-4_6(100-117)Online publication date: 29-May-2024
  • (2023)Greybox Fuzzing of Distributed SystemsProceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security10.1145/3576915.3623097(1615-1629)Online publication date: 15-Nov-2023
  • (2023)Probabilistic Concurrency Testing for Weak Memory ProgramsProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3575693.3575729(603-616)Online publication date: 27-Jan-2023
  • (2023)Model Checking Guided Testing for Distributed SystemsProceedings of the Eighteenth European Conference on Computer Systems10.1145/3552326.3587442(127-143)Online publication date: 8-May-2023
  • (2023)Evolutionary Approach for Concurrency Testing of Ripple Blockchain Consensus Algorithm2023 IEEE/ACM 45th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP)10.1109/ICSE-SEIP58684.2023.00009(36-47)Online publication date: May-2023
  • (2023)Automated Test Case Generation for Service Composition from Event Logs2023 38th IEEE/ACM International Conference on Automated Software Engineering Workshops (ASEW)10.1109/ASEW60602.2023.00022(127-134)Online publication date: 11-Sep-2023
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media