Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Free access

Abstracting the geniuses away from failure testing

Published: 27 December 2017 Publication History

Abstract

Ordinary users need tools that automate the selection of custom-tailored faults to inject.

References

[1]
Alvaro, P. et al. Automating failure-testing research at Internet scale. In Proceedings of the 7th ACM Symposium on Cloud Computing (2016), 17--28.
[2]
Alvaro, P., Rosen, J., Hellerstein, J.M. Lineage-driven fault injection. In Proceedings of the ACM SIGMOD International Conference on Management of Data (2015), 331--346.
[3]
Andrus, K. Personal communication, 2016.
[4]
Aniszczyk, C. Distributed systems tracing with Zipkin. Twitter Engineering; https://blog.twitter.com/2012/distributed-systems-tracing-with-zipkin.
[5]
Barth, D. Inject failure to make your systems more reliable. DevOps.com; http://devops.com/2014/06/03/inject-failure/.
[6]
Basiri, A. et al. Chaos Engineering. IEEE Software 33, 3 (2016), 35--41.
[7]
Beyer, B., Jones, C., Petoff, J., Murphy, N.R. Site Reliability Engineering. O'Reilly, 2016.
[8]
Birrell, A.D., Nelson, B.J. Implementing remote procedure calls. ACM Trans. Computer Systems 2, 1 (1984), 39--59.
[9]
Chandra, T.D., Hadzilacos, V., Toueg, S. The weakest failure detector for solving consensus. J.ACM 43, 4 (1996), 685--722.
[10]
Chen, A. et al. The good, the bad, and the differences: better network diagnostics with differential provenance. In Proceedings of the ACM SIGCOMM Conference (2016), 115--128.
[11]
Chothia, Z., Liagouris, J., McSherry, F., Roscoe, T. Explaining outputs in modern data analytics. In Proceedings of the VLDB Endowment 9, 12 (2016): 1137--1148.
[12]
Chow, M. et al. The Mystery Machine: End-to-end performance analysis of large-scale Internet services. In Proceedings of the 11th Usenix Conference on Operating Systems Design and Implementation (2014), 217--231.
[13]
Cui, Y., Widom, J., Wiener, J.L. Tracing the lineage of view data in a warehousing environment. ACM Trans. Database Systems 25, 2 (2000), 179--227.
[14]
Dawson, S., Jahanian, F., Mitton, T. ORCHESTRA: A Fault Injection Environment for Distributed Systems. In Proceedings of the 26th International Symposium on Fault-tolerant Computing, (1996).
[15]
Fischer, M.J., Lynch, N.A., Paterson, M.S. Impossibility of distributed consensus with one faulty process. J. ACM 32, 2 (1985): 374--382; https://groups.csail.mit.edu/tds/papers/Lynch/jacm85.pdf.
[16]
Fisman, D., Kupferman, O., Lustig, Y. On verifying fault tolerance of distributed protocols. In Tools and Algorithms for the Construction and Analysis of Systems, Lecture Notes in Computer Science 4963, Springer Verlag (2008). 315--331.
[17]
Gopalani, N., Andrus, K., Schmaus, B. FIT: Failure injection testing. Netflix Technology Blog; http://techblog.netflix.com/2014/10/fit-failure-injection-testing.html.
[18]
Gray, J. Why do computers stop and what can be done about it? Tandem Technical Report 85.7 (1985); http://www.hpl.hp.com/techreports/tandem/TR-85.7.pdf.
[19]
Gunawi, H.S. et al. FATE and DESTINI: A framework for cloud recovery testing. In Proceedings of the 8th Usenix Conference on Networked Systems Design and Implementation (2011), 238--252; http://db.cs.berkeley.edu/papers/nsdi11-fate-destini.pdf.
[20]
Holzmann, G. The SPIN Model Checker: Primer and Reference Manual. Addison-Wesley Professional, 2003.
[21]
Honeycomb. 2016; https://honeycomb.io/.
[22]
Interlandi, M. et al. Titian: Data provenance support in Spark. In Proceedings of the VLDB Endowment 9, 33 (2015), 216--227.
[23]
Izrailevsky, Y., Tseitlin, A. The Netflix Simian Army. Netflix Technology Blog; http://techblog.netflix.com/2011/07/netflix-simian-army.html.
[24]
Jepsen. Distributed systems safety research, 2016; http://jepsen.io/.
[25]
Jones, N. Personal communication, 2016.
[26]
Kafka 0.8.0. Apache, 2013; https://kafka.apache.org/08/documentation.html.
[27]
Kanawati, G.A., Kanawati, N.A., Abraham, J.A. Ferrari: A flexible software-based fault and error injection system. IEEE Trans. Computers 44, 2 (1995): 248--260.
[28]
Kendall, S.C., Waldo, J., Wollrath, A., Wyant, G. A note on distributed computing. Technical Report, 1994. Sun Microsystems Laboratories.
[29]
Killian, C.E., Anderson, J.W., Jhala, R., Vahdat, A. Life, death, and the critical transition: Finding liveness bugs in systems code. Networked System Design and Implementation, (2007); 243--256.
[30]
Kingsbury, K. Call me maybe: Kafka, 2013; http://aphyr.com/posts/293-call-me-maybe-kafka.
[31]
Kingsbury, K. Personal communication, 2016.
[32]
Lafeldt, M. The discipline of Chaos Engineering. Gremlin Inc., 2017; https://blog.gremlininc.com/the-discipline-of-chaos-engineering-e39d2383c459.
[33]
Lampson, B.W. Atomic transactions. In Distributed Systems---Architecture and Implementation, An Advanced Cours: (1980), 246--265; https://link.springer.com/chapter/10.1007%2F3-540-10571-9_11.
[34]
LightStep. 2016; http://lightstep.com/.
[35]
Marinescu, P.D., Candea, G. LFI: A practical and general library-level fault injector. In IEEE/IFIP International Conference on Dependable Systems and Networks (2009).
[36]
Matloff, N., Salzman, P.J. The Art of Debugging with GDB, DDD, and Eclipse. No Starch Press, 2008.
[37]
Meliou, A., Suciu, D. Tiresias: The database oracle for how-to queries. Proceedings of the ACM SIGMOD International Conference on the Management of Data (2012), 337--348.
[38]
Microsoft Azure Documentation. Introduction to the fault analysis service, 2016; https://azure.microsoft.com/en-us/documentation/articles/service-fabric-testability-overview/.
[39]
Musuvathi, M. et al. CMC: A pragmatic approach to model checking real code. ACM SIGOPS Operating Systems Review. In Proceedings of the 5th Symposium on Operating Systems Design and Implementation 36 (2002), 75--88.
[40]
Musuvathi, M. et al. Finding and reproducing Heisenbugs in concurrent programs. In Proceedings of the 8th Usenix Conference on Operating Systems Design and Implementation (2008), 267--280.
[41]
Newcombe, C. et al. Use of formal methods at Amazon Web Services. Technical Report, 2014; http://lamport.azurewebsites.net/tla/formal-methods-amazon.pdf.
[42]
Olston, C., Reed, B. Inspector Gadget: A framework for custom monitoring and debugging of distributed data flows. In Proceedings of the ACM SIGMOD International Conference on the Management of Data (2011), 1221--1224.
[43]
OpenTracing. 2016; http://opentracing.io/.
[44]
Pasquier, T.F. J.-M., Singh, J., Eyers, D.M., Bacon, J. CamFlow: Managed data-sharing for cloud services, 2015; https://arxiv.org/pdf/1506.04391.pdf.
[45]
Patterson, D.A., Gibson, G., Katz, R.H. A case for redundant arrays of inexpensive disks (RAID). In Proceedings of the 1988 ACM SIGMOD International Conference on Management of Data, 109--116; http://web.mit.edu/6.033/2015/wwwdocs/papers/Patterson88.pdf.
[46]
Ramasubramanian, K. et al. Growing a protocol. In Proceedings of the 9th Usenix Workshop on Hot Topics in Cloud Computing (2017).
[47]
Reinhold, E. Rewriting Uber engineering: The opportunities microservices provide. Uber Engineering, 2016; https://eng.uber.com/building-tincup/.
[48]
Saltzer, J. H., Reed, D.P., Clark, D.D. End-to-end arguments in system design. ACM Trans. Computing Systems 2, 4 (1984): 277--288.
[49]
Sandberg, R. The Sun network file system: design, implementation and experience. Technical report, Sun Microsystems. In Proceedings of the Summer 1986 Usenix Technical Conference and Exhibition.
[50]
Shkuro, Y. Jaeger: Uber's distributed tracing system. Uber Engineering, 2017; https://uber.github.io/jaeger/.
[51]
Sigelman, B.H. et al. Dapper, a large-scale distributed systems tracing infrastructure. Technical report. Research at Google, 2010; https://research.google.com/pubs/pub36356.html.
[52]
Shenoy, A. A deep dive into Simoorg: Our open source failure induction framework. Linkedin Engineering, 2016; https://engineering.linkedin.com/blog/2016/03/deep-dive-Simoorg-open-source-failure-induction-framework.
[53]
Yang, J. et al. L., Zhou, L. MODIST: Transparent model checking of unmodifed distributed systems. In Proceedings of the 6th Usenix Symposium on Networked Systems Design and Implementation (2009), 213--228.
[54]
Yu, Y., Manolios, P., Lamport, L. Model checking TLA+ specifications. In Proceedings of the 10th IFIP WG 10.5 Advanced Research Working Conference on Correct Hardware Design and Verification Methods (1999), 54--66.
[55]
Zhao, X. et al. Lprof: A non-intrusive request flow profiler for distributed systems. In Proceedings of the 11th Usenix Conference on Operating Systems Design and Implementation (2014), 629--644.

Cited By

View all
  • (2023)Greybox Fuzzing of Distributed SystemsProceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security10.1145/3576915.3623097(1615-1629)Online publication date: 15-Nov-2023
  • (2018)MaelstromProceedings of the 13th USENIX conference on Operating Systems Design and Implementation10.5555/3291168.3291196(373-389)Online publication date: 8-Oct-2018
  • (2018)Modelling distributed Erlang within a single nodeProceedings of the 17th ACM SIGPLAN International Workshop on Erlang10.1145/3239332.3242764(25-36)Online publication date: 29-Sep-2018

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Communications of the ACM
Communications of the ACM  Volume 61, Issue 1
January 2018
110 pages
ISSN:0001-0782
EISSN:1557-7317
DOI:10.1145/3176926
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 December 2017
Published in CACM Volume 61, Issue 1

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article
  • Popular
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)509
  • Downloads (Last 6 weeks)113
Reflects downloads up to 12 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2023)Greybox Fuzzing of Distributed SystemsProceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security10.1145/3576915.3623097(1615-1629)Online publication date: 15-Nov-2023
  • (2018)MaelstromProceedings of the 13th USENIX conference on Operating Systems Design and Implementation10.5555/3291168.3291196(373-389)Online publication date: 8-Oct-2018
  • (2018)Modelling distributed Erlang within a single nodeProceedings of the 17th ACM SIGPLAN International Workshop on Erlang10.1145/3239332.3242764(25-36)Online publication date: 29-Sep-2018

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Digital Edition

View this article in digital edition.

Digital Edition

Magazine Site

View this article on the magazine site (external)

Magazine Site

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media