Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3477132.3483577acmconferencesArticle/Chapter ViewAbstractPublication PagessospConference Proceedingsconference-collections
research-article
Open access

Understanding and Detecting Software Upgrade Failures in Distributed Systems

Published: 26 October 2021 Publication History

Abstract

Upgrade is one of the most disruptive yet unavoidable maintenance tasks that undermine the availability of distributed systems. Any failure during an upgrade is catastrophic, as it further extends the service disruption caused by the upgrade. The increasing adoption of continuous deployment further increases the frequency and burden of the upgrade task. In practice, upgrade failures have caused many of today's high-profile cloud outages. Unfortunately, there has been little understanding of their characteristics.
This paper presents an in-depth study of 123 real-world upgrade failures that were previously reported by users in 8 widely used distributed systems, shedding lights on the severity, root causes, exposing conditions, and fix strategies of upgrade failures. Guided by our study, we have designed a testing framework DUPTester that revealed 20 previously unknown upgrade failures in 4 distributed systems, and applied a series of static checkers DUPChecker that discovered over 800 cross-version data-format incompatibilities that can lead to upgrade failures. DUPChecker has been requested by HBase developers to be integrated into their toolchain.

References

[1]
American fuzzy lop. https://lcamtuf.coredump.cx/afl/.
[2]
Application deployment and testing strategies. https://cloud.google.com/architecture/application-deployment-and-testing-strategies.
[3]
Canary deployment. https://cloud.google.com/blog/products/gcp/how-release-canaries-can-save-your-bacon-cre-life-lessons.
[4]
CASSANDRA-10652. https://jira.apache.org/jira/browse/CASSANDRA-10652.
[5]
CASSANDRA-10822. https://jira.apache.org/jira/browse/CASSANDRA-10822.
[6]
CASSANDRA-13441. https://issues.apache.org/jira/browse/CASSANDRA-13441.
[7]
CASSANDRA-5102. https://issues.apache.org/jira/browse/CASSANDRA-5102.
[8]
CASSANDRA-6678. https://issues.apache.org/jira/browse/CASSANDRA-6678.
[9]
Docker hub. https://www.docker.com/products/docker-hub.
[10]
Dropbox upgrade failure. https://dropbox.tech/infrastructure/outage-post-mortem.
[11]
HDFS-11856. https://issues.apache.org/jira/browse/HDFS-11856.
[12]
HDFS-14726. https://issues.apache.org/jira/browse/HDFS-14726.
[13]
HDFS-156224. https://issues.apache.org/jira/browse/HDFS-15624.
[14]
HDFS-1936. https://issues.apache.org/jira/browse/HDFS-1936.
[15]
HDFS-5988. https://issues.apache.org/jira/browse/HDFS-5988.
[16]
HDFS-8676. https://issues.apache.org/jira/browse/HDFS-8676.
[17]
Java virtual machine. https://en.wikipedia.org/wiki/Java_virtual_machine.
[18]
JUnit 5. https://junit.org/junit5/.
[19]
KAFKA-10173. https://issues.apache.org/jira/browse/KAFKA-10173.
[20]
KAFKA-6238. https://issues.apache.org/jira/browse/KAFKA-6238.
[21]
KAFKA-7403. https://jira.apache.org/jira/browse/KAFKA-7403.
[22]
Linux containers. https://linuxcontainers.org/.
[23]
MESOS-3834. https://issues.apache.org/jira/browse/MESOS-3834.
[24]
Microsoft Azure Blog. https://azure.microsoft.com/en-us/blog/.
[25]
Microsoft says 11-hour azure outage was caused by system update. https://www.entrepreneur.com/article/240029.
[26]
Proto Buffer Guide. https://developers.google.com/protocol-buffers/docs/proto.
[27]
PyParsing. https://github.com/pyparsing/pyparsing.
[28]
Summary of windows azure service disruption on feb 29th, 2012. https://azure.microsoft.com/en-us/blog/summary-of-windows-azure-service-disruption-on-feb-29th-2012/.
[29]
Thrift Compatibility Checker. https://github.com/brunorijsman/thrift-compatibility.
[30]
Thrift Guide. https://diwakergupta.github.io/thrift-missing-guide/.
[31]
ZOOKEEPER-1805. https://issues.apache.org/jira/browse/ZOOKEEPER-1805.
[32]
Sameer Ajmani, Barbara Liskov, and Liuba Shrira. Scheduling and simulation: How to upgrade distributed systems. In Proceedings of the 9th Conference on Hot Topics in Operating Systems - Volume 9, HOTOS'03, pages 8--8, 2003.
[33]
Sameer Ajmani, Barbara Liskov, and Liuba Shrira. Modular software upgrades for distributed systems. In European Conference on Object-Oriented Programming, pages 452--476, 2006.
[34]
Apache Cassandra. http://cassandra.apache.org.
[35]
Apache HBase. http://hbase.apache.org.
[36]
Apache Kafka. https://kafka.apache.org/.
[37]
Apache Mesos. https://mesos.apache.org/.
[38]
Apache ZooKeeper. https://zookeeper.apache.org/.
[39]
Roberto Baldoni, Emilio Coppa, Daniele Cono D'elia, Camil Demetrescu, and Irene Finocchi. A survey of symbolic execution techniques. ACM Computing Surveys (CSUR), 51(3):1--39, 2018.
[40]
Cristian Cadar, Daniel Dunbar, and Dawson Engler. Klee: Unassisted and automatic generation of high-coverage tests for complex systems programs. In Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation, OSDI'08, page 209--224, 2008.
[41]
Andy Chou, Junfeng Yang, Benjamin Chelf, Seth Hallem, and Dawson Engler. An empirical study of operating systems errors. In Proceedings of the 18th ACM Symposium on Operating Systems Principles, SOSP '01, pages 73--88, 2001.
[42]
Ting Dai, Jingzhu He, Xiaohui Gu, and Shan Lu. Understanding real-world timeout problems in cloud server systems. In 2018 IEEE International Conference on Cloud Engineering (IC2E), pages 1--11, 2018.
[43]
Tudor Dumitraş and Priya Narasimhan. Why do upgrades fail and what can we do about it?: Toward dependable, online upgrades in enterprise system. In Proceedings of the 10th ACM/IFIP/USENIX International Conference on Middleware, Middleware '09, pages 18:1--18:20, 2009.
[44]
Patrice Godefroid, Michael Y Levin, David A Molnar, et al. Automated whitebox fuzz testing. In Network and Distributed System Security Symposium, NDSS'08, pages 416--426, 2008.
[45]
Haryadi S. Gunawi, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake, Thanh Do, Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius Martin, and Anang D. Satria. What bugs live in the cloud? a study of 3000+ issues in cloud systems. In Proceedings of the ACM Symposium on Cloud Computing, SOCC '14, pages 7:1--7:14, 2014.
[46]
Haryadi S. Gunawi, Mingzhe Hao, Riza O. Suminto, Agung Laksono, Anang D. Satria, Jeffry Adityatama, and Kurnia J. Eliazar. Why does the cloud stop computing?: Lessons from hundreds of service outages. In Proceedings of the Seventh ACM Symposium on Cloud Computing, SoCC '16, pages 1--16, 2016.
[47]
Hadoop Distributed File System (HDFS) architecture guide. https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html.
[48]
Hadoop MapReduce. https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-clientcore/MapReduceTutorial.html.
[49]
Apache Hadoop YARN. https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/YARN.html.
[50]
Peng Huang, Chuanxiong Guo, Jacob R. Lorch, Lidong Zhou, and Yingnong Dang. Capturing and enhancing in situ system observability for failure detection. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation, OSDI'18, pages 1--16, 2018.
[51]
Peng Huang, Chuanxiong Guo, Lidong Zhou, Jacob R. Lorch, Yingnong Dang, Murali Chintalapati, and Randolph Yao. Gray failure: The achilles' heel of cloud-scale systems. In Proceedings of the 16th Workshop on Hot Topics in Operating Systems, HotOS '17, pages 150--155, 2017.
[52]
Jez Humble and David Farley. Continuous delivery: reliable software releases through build, test, and deployment automation. Addison-Wesley, 2015.
[53]
Guoliang Jin, Linhai Song, Xiaoming Shi, Joel Scherpelz, and Shan Lu. Understanding and detecting real-world performance bugs. In Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '12, pages 77--88, 2012.
[54]
Srikanth Kandula, Ratul Mahajan, Patrick Verkaik, Sharad Agarwal, Jitendra Padhye, and Paramvir Bahl. Detailed diagnosis in enterprise networks. In Proceedings of the ACM SIGCOMM 2009 conference, SIGCOMM '09, pages 243--254, 2009.
[55]
James C King. Symbolic execution and program testing. Communications of the ACM, 19(7):385--394, 1976.
[56]
Tanakorn Leesatapornwongsa, Jeffrey F. Lukman, Shan Lu, and Haryadi S. Gunawi. Taxdc: A taxonomy of non-deterministic concurrency bugs in datacenter distributed systems. In Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '16, pages 517--530, 2016.
[57]
Joshua B. Leners, Hao Wu, Wei-Lun Hung, Marcos K. Aguilera, and Michael Walfish. Detecting failures in distributed systems with the falcon spy network. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles, SOSP '11, pages 279--294, 2011.
[58]
Sihan Li, Hucheng Zhou, Haoxiang Lin, Tian Xiao, Haibo Lin, Wei Lin, and Tao Xie. A characteristic study on failures of production distributed data-parallel programs. In Proceedings of the 35th International Conference on Software Engineering (ICSE'13), pages 963--972, 2013.
[59]
Ze Li, Qian Cheng, Ken Hsieh, Yingnong Dang, Peng Huang, Pankaj Singh, Xinsheng Yang, Qingwei Lin, Youjiang Wu, Sebastien Levy, and Murali Chintalapati. Gandalf: An intelligent, end-to-end analytics service for safe deployment in large-scale cloud infrastructure. In Proceedings of the 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 20), pages 389--402, 2020.
[60]
Zhenmin Li, Lin Tan, Xuanhui Wang, Shan Lu, Yuanyuan Zhou, and Chengxiang Zhai. Have things changed now?: An empirical study of bug characteristics in modern open source software. In Proceedings of the 1st Workshop on Architectural and System Support for Improving Software Dependability, ASID '06, pages 25--33, 2006.
[61]
Haopeng Liu, Shan Lu, Madan Musuvathi, and Suman Nath. What bugs cause production cloud incidents? In Proceedings of the Workshop on Hot Topics in Operating Systems, HotOS '19, pages 155--162, 2019.
[62]
Lanyue Lu, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, and Shan Lu. A study of Linux file system evolution. In Proceedings of the 11th USENIX Conference on File and Storage Technologies, FAST'13, pages 31--44, 2013.
[63]
Shan Lu, Soyeon Park, Eunsoo Seo, and Yuanyuan Zhou. Learning from mistakes: a comprehensive study on real world concurrency bug characteristics. In Proceedings of the 13th international conference on Architectural support for programming languages and operating systems, ASPLOS'08, pages 329--339, 2008.
[64]
Ratul Mahajan, David Wetherall, and Tom Anderson. Understanding BGP misconfiguration. In Proceedings of the ACM SIGCOMM 2002 conference, SIGCOMM '02, pages 3--16, 2002.
[65]
Stephen McCamant and Michael D. Ernst. Predicting problems caused by component upgrades. In Proceedings of the 9th European Software Engineering Conference Held Jointly with 11th ACM SIGSOFT International Symposium on Foundations of Software Engineering, ESEC/FSE-11, pages 287--296, 2003.
[66]
Kiran Nagaraja, Fábio Oliveira, Ricardo Bianchini, Richard P. Martin, and Thu D. Nguyen. Understanding and dealing with operator mistakes in internet services. In Proceedings of the 6th conference on Symposium on Opearting Systems Design and Implementation, OSDI'04, 2004.
[67]
David Oppenheimer, Archana Ganapathi, and David A. Patterson. Why do Internet services fail, and what can be done about it? In Proceedings of the 4th conference on USENIX Symposium on Internet Technologies and Systems, USITS'03, pages 1--15, 2003.
[68]
Nicolas Palix, Gaël Thomas, Suman Saha, Christophe Calvès, Julia Lawall, and Gilles Muller. Faults in Linux: ten years later. In Proceedings of the 16th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '11, pages 305--318, 2011.
[69]
Luís Pina, Anastasios Andronidis, Michael Hicks, and Cristian Cadar. Mvedsua: Higher availability dynamic software updates via multi-version execution. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 573--585, 2019.
[70]
A. Rabkin and R.H. Katz. How Hadoop clusters break. Software, IEEE, 30(4):88--94, 2013.
[71]
Redis: an open source, advanced key-value store. http://redis.io/.
[72]
Tony Savor, Mitchell Douglas, Michael Gentili, Laurie Williams, Kent Beck, and Michael Stumm. Continuous deployment at facebook and oanda. In Proceedings of the 38th International Conference on Software Engineering Companion, ICSE '16, pages 21--30, 2016.
[73]
B. Schroeder and G.A. Gibson. A large-scale study of failures in high-performance computing systems. IEEE Transactions on Dependable and Secure Computing, 7(4):337--350, 2010.
[74]
Edward J Schwartz, Thanassis Avgerinos, and David Brumley. All you ever wanted to know about dynamic taint analysis and forward symbolic execution (but might have been afraid to ask). In 2010 IEEE symposium on Security and privacy, pages 317--331, 2010.
[75]
Kashi Venkatesh Vishwanath and Nachiappan Nagappan. Characterizing cloud computing hardware reliability. In Proceedings of the 1st ACM symposium on Cloud computing, SoCC '10, pages 193--204, 2010.
[76]
Tianyin Xu, Jiaqi Zhang, Peng Huang, Jing Zheng, Tianwei Sheng, Ding Yuan, Yuanyuan Zhou, and Shankar Pasupathy. Do not blame users for misconfigurations. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, SOSP '13, pages 244--259, 2013.
[77]
Junwen Yang, Pranav Subramaniam, Shan Lu, Cong Yan, and Alvin Cheung. How not to structure your database-backed web applications: A study of performance bugs in the wild. In Proceedings of the 40th International Conference on Software Engineering, ICSE '18, pages 800--810, 2018.
[78]
Zuoning Yin, Xiao Ma, Jing Zheng, Yuanyuan Zhou, Lakshmi N. Bairavasundaram, and Shankar Pasupathy. An empirical study on configuration errors in commercial and open source systems. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles, SOSP '11, pages 159--172, 2011.
[79]
Zuoning Yin, Ding Yuan, Yuanyuan Zhou, Shankar Pasupathy, and Lakshmi Bairavasundaram. How do fixes become bugs? In Proceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering, ESEC/FSE '11, pages 26--36, 2011.
[80]
Ding Yuan, Yu Luo, Xin Zhuang, Guilherme Renna Rodrigues, Xu Zhao, Yongle Zhang, Pranay U Jain, and Michael Stumm. Simple testing can prevent most critical failures: An analysis of production failures in distributed data-intensive systems. In Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI'14), pages 249--265, 2014.
[81]
Andreas Zeller, Rahul Gopinath, Marcel Böhme, Gordon Fraser, and Christian Holler. The Fuzzing Book. CISPA Helmholtz Center for Information Security, 2021.

Cited By

View all
  • (2024)KiviProceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference10.5555/3691992.3692024(509-527)Online publication date: 10-Jul-2024
  • (2024)Demystifying the Fight Against Complexity: A Comprehensive Study of Live Debugging Activities in Production Cloud SystemsProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698568(341-360)Online publication date: 20-Nov-2024
  • (2024)Everything Everywhere All At Once: Efficient Cross-Service Program Analysis with OverSeerProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering Workshops10.1145/3691621.3694937(82-87)Online publication date: 27-Oct-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SOSP '21: Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles
October 2021
899 pages
ISBN:9781450387095
DOI:10.1145/3477132
This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 October 2021

Check for updates

Badges

Author Tags

  1. bug detection
  2. distributed systems
  3. study
  4. upgrade failure

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

SOSP '21
Sponsor:

Acceptance Rates

Overall Acceptance Rate 174 of 961 submissions, 18%

Upcoming Conference

SOSP '25
ACM SIGOPS 31st Symposium on Operating Systems Principles
October 13 - 16, 2025
Seoul , Republic of Korea

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)960
  • Downloads (Last 6 weeks)83
Reflects downloads up to 25 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)KiviProceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference10.5555/3691992.3692024(509-527)Online publication date: 10-Jul-2024
  • (2024)Demystifying the Fight Against Complexity: A Comprehensive Study of Live Debugging Activities in Production Cloud SystemsProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698568(341-360)Online publication date: 20-Nov-2024
  • (2024)Everything Everywhere All At Once: Efficient Cross-Service Program Analysis with OverSeerProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering Workshops10.1145/3691621.3694937(82-87)Online publication date: 27-Oct-2024
  • (2024)X-Lifecycle Learning for Cloud Incident Management using LLMsCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663861(417-428)Online publication date: 10-Jul-2024
  • (2024)Automated Root Causing of Cloud Incidents using In-Context Learning with GPT-4Companion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663846(266-277)Online publication date: 10-Jul-2024
  • (2024)ChangeRCA: Finding Root Causes from Software Changes in Large Online SystemsProceedings of the ACM on Software Engineering10.1145/36437281:FSE(24-46)Online publication date: 12-Jul-2024
  • (2024)Software Defect Prediction Using Advanced Ensemble Techniques: A Focus on Boosting and Voting Method2024 International Conference on Electronic Systems and Intelligent Computing (ICESIC)10.1109/ICESIC61777.2024.10846550(157-161)Online publication date: 22-Nov-2024
  • (2024)Augmenting Automatic Root-Cause Identification with Incident Alerts Using LLM2024 34th International Conference on Collaborative Advances in Software and COmputiNg (CASCON)10.1109/CASCON62161.2024.10838171(1-10)Online publication date: 11-Nov-2024
  • (2023)Compiling Distributed System Models with PGoProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3575693.3575695(159-175)Online publication date: 27-Jan-2023
  • (2023)Fail through the Cracks: Cross-System Interaction Failures in Modern Cloud SystemsProceedings of the Eighteenth European Conference on Computer Systems10.1145/3552326.3587448(433-451)Online publication date: 8-May-2023
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media