Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2524211.2524217acmconferencesArticle/Chapter ViewAbstractPublication PagessospConference Proceedingsconference-collections
research-article

SETSUDŌ: perturbation-based testing framework for scalable distributed systems

Published: 03 November 2013 Publication History

Abstract

Modern scalable distributed systems are designed to be partition-tolerant. They are often required to support increasing load in service requests elastically, and to provide seamless services even when some servers malfunction. Partition-tolerance enables such systems to withstand arbitrary loss of messages as "perceived" by the communicating nodes. However, partition-tolerance and robustness are not tested rigorously in practice. Often severe system-level design defects stay hidden even after deployment, possibly resulting in loss of revenue or customer satisfaction.
We propose a novel perturbation-based rigorous testing framework, named SETSUDŌ, especially targeted to expose system-level defects in scalable distributed systems. It applies perturbations (i.e., controlled changes) from the environment of a system during testing, and leverages awareness of system-internal states to precisely control their timing. It uses a flexible instrumentation framework to select relevant internal states and to implement the system code for perturbations. It also provides a test policy language framework, where sequences of perturbation scenarios at a high level are converted automatically to system-level test code. This test code is weaved-in automatically with application code during testing, and any observed defects are reported. We have implemented our perturbation testing framework and demonstrate its evaluation on several open source projects, where it was successful in exposing known, as well as some unknown, defects. Our framework leverages small-scale testing, and avoids upfront infrastructure costs typically needed for large-scale stress testing.

References

[1]
Apache JMeter. http://jmeter.apache.org/.
[2]
Apache ZooKeeper. http://zookeeper.apache.org.
[3]
Cassandra. http://cassandra.apache.org/.
[4]
Chaos monkey. http://techblog.netflix.com/2012/07/chaos-monkey-released-into-wild.html.
[5]
CouchDB. http://couchdb.apache.org/.
[6]
An empty or just replicated index cannot become the leader of a shard after a leader goes down. https://issues.apache.org/jira/browse/SOLR-3939.
[7]
Hadoop. http://hadoop.apache.org/.
[8]
Hadoop Team. Fault Injection framework: How to use it, test using artificial faults, and develop new faults. http://issues.apache.org.
[9]
HBase. http://hbase.apache.org/.
[10]
Hive. http://hive.apache.org/.
[11]
HP - Enterprise Software. http://www8.hp.com/us/en/software/enterprise-software.html.
[12]
Root region doesn't get re-assigned in servershutdownhandler. https://issues.apache.org/jira/browse/HBASE-6289.
[13]
Selenium - Web Browser Automation. http://docs.seleniumhq.org/.
[14]
SolrCloud. http://wiki.apache.org/solr/SolrCloud.
[15]
Solrcloud leader election on single node stucks the initialization. https://issues.apache.org/jira/browse/SOLR-3993.
[16]
System dashboard - ASF JIRA. https://issues.apache.org/jira.
[17]
The Aspectj Project. http://www.eclipse.org/aspectj/.
[18]
ZooKeeper: Because Coordinating Distributed Systems is a Zoo. http://zookeeper.apache.org/doc/trunk/.
[19]
R. Banabic and G. Candea. Fast black-box testing of system recovery code. In Eurosys, 2012.
[20]
E. A. Brewer. Towards robust distributed systems (invited talk). Principles of Distributed Computing, July 2000.
[21]
P. Broadwell, N. Sastry, and J. Traupman. FIG: A prototype tool for online verification of recovery. In In Workshop on Self-Healing, Adaptive and Self-Managed Systems, 2002.
[22]
T. Chandra, R. Griesemer, and J. Redstone. Paxos made live - an engineering perspective. In PODC, 2007.
[23]
S. Dawson, F. Jahanian, and T. Mitton. Experiments on six commercial tcp implementations using a software fault injection tool. Software Practice and Experience, 27: 1385--1410, 1996.
[24]
S. Gilbert and N. Lynch. Brewers conjecture and the feasibility of consistent, available, partition-tolerant web services. ACM SIGACT News, pages 51--59, 2002.
[25]
H. S. Gunawi, T. Do, P. Joshi, P. Alvaro, J. M. Hellerstein, A. C. Arpaci-Dusseau, R. H. Arpaci-Dusseau, K. Sen, and D. Borthakur. FATE and DESTINI: A framework for cloud recovery testing. In NSDI, 2011.
[26]
H. Guo, M. Wu, L. Zhou, G. Hu, J. Yang, and L. Zhang. Practical software model checking via dynamic interface reduction. In SOSP, pages 265--278. ACM, 2011.
[27]
A. Henry. Cloud storage FUD: Failure and uncertainty and durability. In FAST, 2009.
[28]
W. Hoarau, S. Tixeuil, and F. Vauchelles. FAIL-FCI: Versatile fault injection. Future Generation Computer Systems, 23(7): 913--919, Aug. 2007.
[29]
T. Hoff. Netflix: Continually test by failing servers with chaos monkey. http://highscalability.com, 2010.
[30]
P. Joshi, H. S. Gunawi, and K. Sen. PREFAIL: A programmable tool for multiple-failure injection. In OOPSLA, pages 171--188. ACM, 2011.
[31]
L. Juszczyk and S. Dustdar. Programmable fault injection testbeds for complex SOA. In International Conference on Service Oriented Computing. Springer, 2010.
[32]
L. Keller, P. Marinescu, and G. Candea. AFEX: An automated fault explorer for faster system testing. In Technical Report EPFL-REPORT-151651, 2008.
[33]
P. Marinescu and G. Candea. LFI: A practical and general library-level fault injector. In IEEE/IFIP International Conference on Dependable Systems and Networks, 2009, pages 379--388, 2009.
[34]
P. D. Marinescu and G. Candea. Efficient testing of recovery code using fault injection. ACM Trans. Comput. Syst., 29(2), 2011.
[35]
J. Yang, T. Chen, M. Wu, Z. Xu, X. Liu, H. Lin, M. Yang, F. Long, L. Zhang, and L. Zhou.MoDist: Transparent model checking of unmodified distributed systems. In NSDI, pages 213--228, 2009.
[36]
J. Yang, C. Sar, and D. Engler. EXPLODE: A lightweight, general system for finding serious storage system errors. In OSDI, 2006.
[37]
J. Yang, P. Twohey, D. Engler, and M. Musuvathi. Using model checking to find serious file system errors. In OSDI, 2004.

Cited By

View all
  • (2024)MicroFI: Non-Intrusive and Prioritized Request-Level Fault Injection for Microservice ApplicationsIEEE Transactions on Dependable and Secure Computing10.1109/TDSC.2024.3363902(1-18)Online publication date: 2024
  • (2023)Enhancing Fault Injection Testing of Service Systems via Fault-Tolerance BottleneckIEEE Transactions on Software Engineering10.1109/TSE.2023.3285357(1-17)Online publication date: 2023
  • (2023)Efficient regression testing of distributed real-time reactive systems in the context of model-driven developmentSoftware and Systems Modeling10.1007/s10270-023-01086-522:5(1565-1587)Online publication date: 6-Mar-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
TRIOS '13: Proceedings of the First ACM SIGOPS Conference on Timely Results in Operating Systems
November 2013
155 pages
ISBN:9781450324632
DOI:10.1145/2524211
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 November 2013

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article

Conference

SOSP '13
Sponsor:

Upcoming Conference

SOSP '25
ACM SIGOPS 31st Symposium on Operating Systems Principles
October 13 - 16, 2025
Seoul , Republic of Korea

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)14
  • Downloads (Last 6 weeks)1
Reflects downloads up to 13 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)MicroFI: Non-Intrusive and Prioritized Request-Level Fault Injection for Microservice ApplicationsIEEE Transactions on Dependable and Secure Computing10.1109/TDSC.2024.3363902(1-18)Online publication date: 2024
  • (2023)Enhancing Fault Injection Testing of Service Systems via Fault-Tolerance BottleneckIEEE Transactions on Software Engineering10.1109/TSE.2023.3285357(1-17)Online publication date: 2023
  • (2023)Efficient regression testing of distributed real-time reactive systems in the context of model-driven developmentSoftware and Systems Modeling10.1007/s10270-023-01086-522:5(1565-1587)Online publication date: 6-Mar-2023
  • (2023)The operation and maintenance governance of microservices architecture systemsJournal of Software: Evolution and Process10.1002/smr.243335:10Online publication date: 12-Oct-2023
  • (2022)Data-driven Mobility Analysis and Modeling: Typical and Confined Life of a Metropolitan PopulationACM Transactions on Spatial Algorithms and Systems10.1145/35172228:3(1-33)Online publication date: 27-Sep-2022
  • (2022)Verification of Distributed Quantum ProgramsACM Transactions on Computational Logic10.1145/351714523:3(1-40)Online publication date: 6-Apr-2022
  • (2022)Parameterized Complexity of Elimination Distance to First-Order Logic PropertiesACM Transactions on Computational Logic10.1145/351712923:3(1-35)Online publication date: 6-Apr-2022
  • (2022)Modalities and Parametric AdjointsACM Transactions on Computational Logic10.1145/351424123:3(1-29)Online publication date: 6-Apr-2022
  • (2022)Doing More with Less: Overcoming Data Scarcity for POI Recommendation via Cross-Region TransferACM Transactions on Intelligent Systems and Technology10.1145/351171113:3(1-24)Online publication date: 13-Apr-2022
  • (2022)SoK: A Modularized Approach to Study the Security of Automatic Speech Recognition SystemsACM Transactions on Privacy and Security10.1145/351058225:3(1-31)Online publication date: 19-May-2022
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media