research-article

Free access

Abstracting the geniuses away from failure testing

Authors:

Peter Alvaro,

Severine TymonAuthors Info & Claims

Communications of the ACM, Volume 61, Issue 1

Pages 54 - 61

https://doi.org/10.1145/3152483

Published: 27 December 2017 Publication History

All formats PDF

Abstract

Ordinary users need tools that automate the selection of custom-tailored faults to inject.

References

[1]

Alvaro, P. et al. Automating failure-testing research at Internet scale. In Proceedings of the 7^th ACM Symposium on Cloud Computing (2016), 17--28.

Digital Library

Google Scholar

[2]

Alvaro, P., Rosen, J., Hellerstein, J.M. Lineage-driven fault injection. In Proceedings of the ACM SIGMOD International Conference on Management of Data (2015), 331--346.

Digital Library

Google Scholar

[3]

Andrus, K. Personal communication, 2016.

Google Scholar

[4]

Aniszczyk, C. Distributed systems tracing with Zipkin. Twitter Engineering; https://blog.twitter.com/2012/distributed-systems-tracing-with-zipkin.

Google Scholar

[5]

Barth, D. Inject failure to make your systems more reliable. DevOps.com; http://devops.com/2014/06/03/inject-failure/.

Google Scholar

[6]

Basiri, A. et al. Chaos Engineering. IEEE Software 33, 3 (2016), 35--41.

Digital Library

Google Scholar

[7]

Beyer, B., Jones, C., Petoff, J., Murphy, N.R. Site Reliability Engineering. O'Reilly, 2016.

Digital Library

Google Scholar

[8]

Birrell, A.D., Nelson, B.J. Implementing remote procedure calls. ACM Trans. Computer Systems 2, 1 (1984), 39--59.

Digital Library

Google Scholar

[9]

Chandra, T.D., Hadzilacos, V., Toueg, S. The weakest failure detector for solving consensus. J.ACM 43, 4 (1996), 685--722.

Digital Library

Google Scholar

[10]

Chen, A. et al. The good, the bad, and the differences: better network diagnostics with differential provenance. In Proceedings of the ACM SIGCOMM Conference (2016), 115--128.

Digital Library

Google Scholar

[11]

Chothia, Z., Liagouris, J., McSherry, F., Roscoe, T. Explaining outputs in modern data analytics. In Proceedings of the VLDB Endowment 9, 12 (2016): 1137--1148.

Digital Library

Google Scholar

[12]

Chow, M. et al. The Mystery Machine: End-to-end performance analysis of large-scale Internet services. In Proceedings of the 11^th Usenix Conference on Operating Systems Design and Implementation (2014), 217--231.

Digital Library

Google Scholar

[13]

Cui, Y., Widom, J., Wiener, J.L. Tracing the lineage of view data in a warehousing environment. ACM Trans. Database Systems 25, 2 (2000), 179--227.

Digital Library

Google Scholar

[14]

Dawson, S., Jahanian, F., Mitton, T. ORCHESTRA: A Fault Injection Environment for Distributed Systems. In Proceedings of the 26^th International Symposium on Fault-tolerant Computing, (1996).

Google Scholar

[15]

Fischer, M.J., Lynch, N.A., Paterson, M.S. Impossibility of distributed consensus with one faulty process. J. ACM 32, 2 (1985): 374--382; https://groups.csail.mit.edu/tds/papers/Lynch/jacm85.pdf.

Digital Library

Google Scholar

[16]

Fisman, D., Kupferman, O., Lustig, Y. On verifying fault tolerance of distributed protocols. In Tools and Algorithms for the Construction and Analysis of Systems, Lecture Notes in Computer Science 4963, Springer Verlag (2008). 315--331.

Digital Library

Google Scholar

[17]

Gopalani, N., Andrus, K., Schmaus, B. FIT: Failure injection testing. Netflix Technology Blog; http://techblog.netflix.com/2014/10/fit-failure-injection-testing.html.

Google Scholar

[18]

Gray, J. Why do computers stop and what can be done about it? Tandem Technical Report 85.7 (1985); http://www.hpl.hp.com/techreports/tandem/TR-85.7.pdf.

Google Scholar

[19]

Gunawi, H.S. et al. FATE and DESTINI: A framework for cloud recovery testing. In Proceedings of the 8^th Usenix Conference on Networked Systems Design and Implementation (2011), 238--252; http://db.cs.berkeley.edu/papers/nsdi11-fate-destini.pdf.

Digital Library

Google Scholar

[20]

Holzmann, G. The SPIN Model Checker: Primer and Reference Manual. Addison-Wesley Professional, 2003.

Digital Library

Google Scholar

[21]

Honeycomb. 2016; https://honeycomb.io/.

Google Scholar

[22]

Interlandi, M. et al. Titian: Data provenance support in Spark. In Proceedings of the VLDB Endowment 9, 33 (2015), 216--227.

Digital Library

Google Scholar

[23]

Izrailevsky, Y., Tseitlin, A. The Netflix Simian Army. Netflix Technology Blog; http://techblog.netflix.com/2011/07/netflix-simian-army.html.

Google Scholar

[24]

Jepsen. Distributed systems safety research, 2016; http://jepsen.io/.

Google Scholar

[25]

Jones, N. Personal communication, 2016.

Google Scholar

[26]

Kafka 0.8.0. Apache, 2013; https://kafka.apache.org/08/documentation.html.

Google Scholar

[27]

Kanawati, G.A., Kanawati, N.A., Abraham, J.A. Ferrari: A flexible software-based fault and error injection system. IEEE Trans. Computers 44, 2 (1995): 248--260.

Digital Library

Google Scholar

[28]

Kendall, S.C., Waldo, J., Wollrath, A., Wyant, G. A note on distributed computing. Technical Report, 1994. Sun Microsystems Laboratories.

Digital Library

Google Scholar

[29]

Killian, C.E., Anderson, J.W., Jhala, R., Vahdat, A. Life, death, and the critical transition: Finding liveness bugs in systems code. Networked System Design and Implementation, (2007); 243--256.

Digital Library

Google Scholar

[30]

Kingsbury, K. Call me maybe: Kafka, 2013; http://aphyr.com/posts/293-call-me-maybe-kafka.

Google Scholar

[31]

Kingsbury, K. Personal communication, 2016.

Google Scholar

[32]

Lafeldt, M. The discipline of Chaos Engineering. Gremlin Inc., 2017; https://blog.gremlininc.com/the-discipline-of-chaos-engineering-e39d2383c459.

Google Scholar

[33]

Lampson, B.W. Atomic transactions. In Distributed Systems---Architecture and Implementation, An Advanced Cours: (1980), 246--265; https://link.springer.com/chapter/10.1007%2F3-540-10571-9_11.

Google Scholar

[34]

LightStep. 2016; http://lightstep.com/.

Google Scholar

[35]

Marinescu, P.D., Candea, G. LFI: A practical and general library-level fault injector. In IEEE/IFIP International Conference on Dependable Systems and Networks (2009).

Crossref

Google Scholar

[36]

Matloff, N., Salzman, P.J. The Art of Debugging with GDB, DDD, and Eclipse. No Starch Press, 2008.

Digital Library

Google Scholar

[37]

Meliou, A., Suciu, D. Tiresias: The database oracle for how-to queries. Proceedings of the ACM SIGMOD International Conference on the Management of Data (2012), 337--348.

Digital Library

Google Scholar

[38]

Microsoft Azure Documentation. Introduction to the fault analysis service, 2016; https://azure.microsoft.com/en-us/documentation/articles/service-fabric-testability-overview/.

Google Scholar

[39]

Musuvathi, M. et al. CMC: A pragmatic approach to model checking real code. ACM SIGOPS Operating Systems Review. In Proceedings of the 5^th Symposium on Operating Systems Design and Implementation 36 (2002), 75--88.

Digital Library

Google Scholar

[40]

Musuvathi, M. et al. Finding and reproducing Heisenbugs in concurrent programs. In Proceedings of the 8^th Usenix Conference on Operating Systems Design and Implementation (2008), 267--280.

Digital Library

Google Scholar

[41]

Newcombe, C. et al. Use of formal methods at Amazon Web Services. Technical Report, 2014; http://lamport.azurewebsites.net/tla/formal-methods-amazon.pdf.

Google Scholar

[42]

Olston, C., Reed, B. Inspector Gadget: A framework for custom monitoring and debugging of distributed data flows. In Proceedings of the ACM SIGMOD International Conference on the Management of Data (2011), 1221--1224.

Digital Library

Google Scholar

[43]

OpenTracing. 2016; http://opentracing.io/.

Google Scholar

[44]

Pasquier, T.F. J.-M., Singh, J., Eyers, D.M., Bacon, J. CamFlow: Managed data-sharing for cloud services, 2015; https://arxiv.org/pdf/1506.04391.pdf.

Google Scholar

[45]

Patterson, D.A., Gibson, G., Katz, R.H. A case for redundant arrays of inexpensive disks (RAID). In Proceedings of the 1988 ACM SIGMOD International Conference on Management of Data, 109--116; http://web.mit.edu/6.033/2015/wwwdocs/papers/Patterson88.pdf.

Digital Library

Google Scholar

[46]

Ramasubramanian, K. et al. Growing a protocol. In Proceedings of the 9^th Usenix Workshop on Hot Topics in Cloud Computing (2017).

Digital Library

Google Scholar

[47]

Reinhold, E. Rewriting Uber engineering: The opportunities microservices provide. Uber Engineering, 2016; https://eng.uber.com/building-tincup/.

Google Scholar

[48]

Saltzer, J. H., Reed, D.P., Clark, D.D. End-to-end arguments in system design. ACM Trans. Computing Systems 2, 4 (1984): 277--288.

Digital Library

Google Scholar

[49]

Sandberg, R. The Sun network file system: design, implementation and experience. Technical report, Sun Microsystems. In Proceedings of the Summer 1986 Usenix Technical Conference and Exhibition.

Google Scholar

[50]

Shkuro, Y. Jaeger: Uber's distributed tracing system. Uber Engineering, 2017; https://uber.github.io/jaeger/.

Google Scholar

[51]

Sigelman, B.H. et al. Dapper, a large-scale distributed systems tracing infrastructure. Technical report. Research at Google, 2010; https://research.google.com/pubs/pub36356.html.

Google Scholar

[52]

Shenoy, A. A deep dive into Simoorg: Our open source failure induction framework. Linkedin Engineering, 2016; https://engineering.linkedin.com/blog/2016/03/deep-dive-Simoorg-open-source-failure-induction-framework.

Google Scholar

[53]

Yang, J. et al. L., Zhou, L. MODIST: Transparent model checking of unmodifed distributed systems. In Proceedings of the 6^th Usenix Symposium on Networked Systems Design and Implementation (2009), 213--228.

Digital Library

Google Scholar

[54]

Yu, Y., Manolios, P., Lamport, L. Model checking TLA+ specifications. In Proceedings of the 10^th IFIP WG 10.5 Advanced Research Working Conference on Correct Hardware Design and Verification Methods (1999), 54--66.

Digital Library

Google Scholar

[55]

Zhao, X. et al. Lprof: A non-intrusive request flow profiler for distributed systems. In Proceedings of the 11^th Usenix Conference on Operating Systems Design and Implementation (2014), 629--644.

Digital Library

Google Scholar

Cited By

View all

Meng RPîrlea GRoychoudhury ASergey IMeng WJensen CCremers CKirda E(2023)Greybox Fuzzing of Distributed SystemsProceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security10.1145/3576915.3623097(1615-1629)Online publication date: 15-Nov-2023
https://dl.acm.org/doi/10.1145/3576915.3623097
Veeraraghavan KMeza JMichelson SPanneerselvam SGyori AChou DMargulis SObenshain DPadmanabha SShah ASong YXu TArpaci-Dusseau AVoelker G(2018)MaelstromProceedings of the 13th USENIX conference on Operating Systems Design and Implementation10.5555/3291168.3291196(373-389)Online publication date: 8-Oct-2018
https://dl.acm.org/doi/10.5555/3291168.3291196
Aronis SFördős VSzoboszlay DChechina NFrancalanza A(2018)Modelling distributed Erlang within a single nodeProceedings of the 17th ACM SIGPLAN International Workshop on Erlang10.1145/3239332.3242764(25-36)Online publication date: 29-Sep-2018
https://dl.acm.org/doi/10.1145/3239332.3242764

Index Terms

Abstracting the geniuses away from failure testing
1. Hardware
  1. Hardware test
    1. Fault models and test metrics
  2. Robustness
    1. Fault tolerance
2. Software and its engineering
  1. Software organization and properties
    1. Extra-functional properties
      1. Software fault tolerance

Recommendations

Automating Failure Testing Research at Internet Scale
SoCC '16: Proceedings of the Seventh ACM Symposium on Cloud Computing

Large-scale distributed systems must be built to anticipate and mitigate a variety of hardware and software failures. In order to build confidence that fault-tolerant systems are correctly implemented, Netflix (and similar enterprises) regularly run ...
Abstracting the Geniuses Away from Failure Testing: Ordinary users need tools that automate the selection of custom-tailored faults to inject.
Cryptocurrency

This article presents a call to arms for the distributed systems research community to improve the state of the art in fault tolerance testing. Ordinary users need tools that automate the selection of custom-tailored faults to inject. We conjecture that ...
Characterizing failure-causing parameter interactions by adaptive testing
ISSTA '11: Proceedings of the 2011 International Symposium on Software Testing and Analysis

Combinatorial testing is a widely used black-box testing technique, which is used to detect failures caused by parameter interactions (we call them faulty interactions). Traditional combinatorial testing techniques provide fault detection, but most of ...

Comments

Information & Contributors

Information

Published In

Communications of the ACM Volume 61, Issue 1

January 2018

110 pages

ISSN:0001-0782

EISSN:1557-7317

DOI:10.1145/3176926

Editor:
Andrew A. Chien
Association for Computing Machinery, New York, NY

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 December 2017

Published in CACM Volume 61, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article
Popular
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
6,360
Total Downloads

Downloads (Last 12 months)509
Downloads (Last 6 weeks)113

Reflects downloads up to 12 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

View all

Meng RPîrlea GRoychoudhury ASergey IMeng WJensen CCremers CKirda E(2023)Greybox Fuzzing of Distributed SystemsProceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security10.1145/3576915.3623097(1615-1629)Online publication date: 15-Nov-2023
https://dl.acm.org/doi/10.1145/3576915.3623097
Veeraraghavan KMeza JMichelson SPanneerselvam SGyori AChou DMargulis SObenshain DPadmanabha SShah ASong YXu TArpaci-Dusseau AVoelker G(2018)MaelstromProceedings of the 13th USENIX conference on Operating Systems Design and Implementation10.5555/3291168.3291196(373-389)Online publication date: 8-Oct-2018
https://dl.acm.org/doi/10.5555/3291168.3291196
Aronis SFördős VSzoboszlay DChechina NFrancalanza A(2018)Modelling distributed Erlang within a single nodeProceedings of the 17th ACM SIGPLAN International Workshop on Erlang10.1145/3239332.3242764(25-36)Online publication date: 29-Sep-2018
https://dl.acm.org/doi/10.1145/3239332.3242764

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Digital Edition

View this article in digital edition.

Digital Edition

Magazine Site

View this article on the magazine site (external)

Magazine Site

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Abstract

References

Cited By

Index Terms

Recommendations

Automating Failure Testing Research at Internet Scale