Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.5555/3488766.3488808guideproceedingsArticle/Chapter ViewAbstractPublication PagesosdiConference Proceedingsconference-collections
research-article
Free access

Testing configuration changes in context to prevent production failures

Published: 04 November 2020 Publication History

Abstract

Large-scale cloud services deploy hundreds of configuration changes to production systems daily. At such velocity, configuration changes have inevitably become prevalent causes of production failures. Existing misconfiguration detection and configuration validation techniques only check configuration values. These techniques cannot detect common types of failure-inducing configuration changes, such as those that cause code to fail or those that violate hidden constraints.
We present ctests, a new type of tests for detecting failure-inducing configuration changes to prevent production failures. The idea behind ctests is simple--connecting production system configurations to software tests so that configuration changes can be tested in the context of code affected by the changes. So, ctests can detect configuration changes that expose dormant software bugs and diverse misconfigurations.
We show how to generate ctests by transforming the many existing tests in mature systems. The key challenge that we address is the automated identification of test logic and oracles that can be reused in ctests. We generated thousands of ctests from the existing tests in five cloud systems.
Our results show that ctests are effective in detecting failure-inducing configuration changes before deployment. We evaluate ctests on real-world failure-inducing configuration changes, injected misconfigurations, and deployed configuration files from public Docker images. Ctests effectively detect real-world failure-inducing configuration changes and misconfigurations in the deployed files.

References

[1]
ALLUXIO-3402. Backward compatibility for enum-typed configuration. Https://alluxio.atlassian.net/browse/ALLUXIO-3402, 2020.
[2]
ALLUXIO GITHUB ISSUE #9810. Alluxio worker fails to start when using multiple storage media in single tier on EMR. https://github.com/Alluxio/alluxio/issues/9810, 2019.
[3]
ATTARIYAN, M., CHOW, M., AND FLINN, J. X-ray: Automating Root-Cause Diagnosis of Performance Anomalies in Production Software. In Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation (OSDI'12) (October 2012).
[4]
ATTARIYAN, M., AND FLINN, J. Automating Configuration Troubleshooting with Dynamic Information Flow Analysis. In Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation (OSDI'10) (October 2010).
[5]
BARROSO, L. A., HÖLZLE, U., AND RANGANATHAN, P. The Datacenter as a Computer: Designing Warehouse-Scale Machines. Morgan and Claypool Publishers, 2018.
[6]
BASET, S., SUNEJA, S., BILA, N., TUNCER, O., AND ISCI, C. Usable Declarative Configuration Specification and Validation for Applications, Systems, and Cloud. In Proceedings of the 18th ACM/IFIP/USENIX Middleware Conference (Middleware' 17), Industrial Track (December 2017).
[7]
Bazel: a fast, scalable, multi-language and extensible build system. https://bazel.build/, 2020.
[8]
BELL, J., LEGUNSEN, O., HILTON, M., ELOUSSI, L., YUNG, T., AND MARINOV, D. DeFlaker: Automatically Detecting Flaky Tests. In In Proceedings of the 40th International Conference on Software Engineering (ICSE'18) (May 2018).
[9]
BEYER, B., MURPHY, N. R., RENSIN, D. K., KAWAHARA, K., AND THORNE, S. Site Reliability Workbook: Practical Ways to Implement SRE. O'Reilly Media Inc., August 2018.
[10]
BHAGWAN, R., KUMAR, R., MADDILA, C. S., AND PHILIP, A. A. Orca: Differential Bug Localization in Large-Scale Services. In Proceedings of the 13th USENIX Conference on Operating Systems Design and Implementation (OSDI'18) (October 2018).
[11]
Buck: A fast build tool. https://buck.build/, 2020.
[12]
CHEN, Q., WANG, T., LEGUNSEN, O., LI, S., AND XU, T. Understanding and Discovering Software Configuration Dependencies in Cloud and Datacenter Systems. In In Proceedings of the 2020 ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE'20) (November 2020).
[13]
CUI, W., GE, X., KASIKCI, B., NIU, B., SHARMA, U., WANG, R., AND YUN, I. REPT: Reverse Debugging of Failures in Deployed Software. In Proceedings of the 13th USENIX Conference on Operating Systems Design and Implementation (OSDI'18) (October 2018).
[14]
Docker Hub. https://www.docker.com/products/docker-hub, 2020.
[15]
FOWLER, M. Eradicating Non-Determinism in Tests. https://martinfowler.com/articles/nonDeterminism.html, April 2011.
[16]
Gradle Build Tool. https://gradle.org/, 2020.
[17]
GRAVES, T. L., HARROLD, M. J., KIM, J.-M., PORTER, A., AND ROTHERMEL, G. An Empirical Study of Regression Test Selection Techniques. ACM Transactions on Software Engineering and Methodology 10, 2 (April 2001), 184-208.
[18]
GUNAWI, H. S., HAO, M., LEESATAPORNWONGSA, T., PATANA-ANAKE, T., DO, T., ADITYATAMA, J., ELIAZAR, K. J., LAKSONO, A., LUKMAN, J. F., MARTIN, V., AND SATRIA, A. D. What bugs live in the cloud? a study of 3000+ issues in cloud systems. In Proceedings of the 5th ACM Symposium on Cloud Computing (SoCC'14) (November 2014).
[19]
GUNAWI, H. S., HAO, M., SUMINTO, R. O., LAKSONO, A., SATRIA, A. D., ADITYATAMA, J., AND ELIAZAR, K. J. Why Does the Cloud Stop Computing? Lessons from Hundreds of Service Outages. In Proceedings of the 7th ACM Symposium on Cloud Computing (SoCC'16) (October 2016).
[20]
HADOOP-10508. RefreshCallQueue fails when authorization is enabled. https://issues.apache.org/jira/browse/HADOOP-10508, 2014.
[21]
HBASE-22559. [RPC] set guard against CALL_QUEUE_HANDLER_FACTOR_CONF_KEY. https://issues.apache.org/jira/browse/HBASE-22559, 2019.
[22]
HBASE-23962. Improving the documentation for 'hbase.regionserver.hlog.reader, writer.impl'. https://issues.apache.org/jira/browse/HBASE-23962, 2020.
[23]
HDFS-15124. Crashing bugs in NameNode when using a valid configuration for 'dfs.namenode.audit.loggers'. https://issues.apache.org/jira/browse/HDFS-15124, 2020.
[24]
HDFS-15250. Setting 'dfs.client.use.datanode.hostname' to true can crash the system because of unhandled UnresolvedAddressException. https://issues.apache.org/jira/browse/HDFS-15250, 2020.
[25]
HDFS-7684. The host:port settings of the daemons should be trimmed before use. https://issues.apache.org/jira/browse/HDFS-7684, 2015.
[26]
HDFS-7727. Check and verify the auto-fence settings to prevent failures of auto-failover. https://issues.apache.org/jira/browse/HDFS-7727, 2015.
[27]
HUANG, P., BOLOSKY, W. J., SIGH, A., AND ZHOU, Y. ConfValley: A Systematic Configuration Validation Framework for Cloud Services. In Proceedings of the 10th ACM European Conference in Computer Systems (EuroSys'15) (April 2015).
[28]
HUO, C., AND CLAUSE, J. Improving Oracle Quality by Detecting Brittle Assertions and Unused Inputs in Tests. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE'14) (November 2014).
[29]
IVANKOVIĆ, M., PETROVIĆ, G., JUST, R., AND FRASER, G. Code Coverage at Google. In Proceedings of the 27th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/ FSE'19) (August 2019).
[30]
KASIKCI, B., SCHUBERT, B., PEREIRA, C., POKAM, G., AND CANDEA, G. Failure Sketching: A Technique for Automated Root Cause Diagnosis of In-Production Failures. In Proceedings of the 25th ACM Symposium on Operating System Principles (SOSP'15) (October 2015).
[31]
KELLER, L., UPADHYAYA, P., AND CANDEA, G. ConfErr: A Tool for Assessing Resilience to Human Configuration Errors. In Proceedings of the 38th IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'08) (June 2008).
[32]
LI, S., LI, W., LIAO, X., PENG, S., ZHOU, S., JIA, Z., AND WANG, T. ConfVD: System Reactions Analysis and Evaluation Through Misconfiguration Injection. IEEE Transactions on Reliability 67, 4 (December 2018), 1393-1405.
[33]
LILLACK, M., KÄSTNER, C., AND BODDEN, E. Tracking Load-time Configuration Options. IEEE Transactions on Software Engineering (TSE) 44, 12 (December 2018), 1269-1291.
[34]
LIU, H., LU, S., MUSUVATHI, M., AND NATH, S. What bugs cause production cloud incidents? In Proceedings of the 16th Workshop on Hot Topics in Operating Systems (HotOS'19) (May 2019).
[35]
MAURER, B. Fail at Scale: Reliability in the Face of Rapid Change. Communications of the ACM 58, 11 (November 2015), 44-49.
[36]
Apache Maven. http://maven.apache.org/, 2020.
[37]
MEDEIROS, F., KÄSTNER, C., RIBEIRO,M., GHEYI, R., AND APEL, S. A Comparison of 10 Sampling Algorithms for Configurable Systems. In Proceedings of the 38th International Conference on Software Engineering (ICSE'16) (May 2016).
[38]
MEHTA, S., BHAGWAN, R., KUMAR, R., ASHOK, B., BANSAL, C., MADDILA, C., BIRD, C., ASTHANA, S., AND KUMAR, A. Rex: Preventing Bugs and Misconfiguration in Large Services using Correlated Change Analysis. In Proceedings of the 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI'20) (February 2020).
[39]
MUKELABAI, M., NEŠIĆ, D., MARO, S., BERGER, T., AND STEGHÖFER, J.-P. Tackling Combinatorial Explosion: A Study of Industrial Needs and Practices for Analyzing Highly Configurable Systems. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering (ASE'18) (September 2018).
[40]
NAGARAJA, K., OLIVEIRA, F., BIANCHINI, R., MARTIN, R. P., AND NGUYEN, T. D. Understanding and Dealing with Operator Mistakes in Internet Services. In Proceedings of the 6th USENIX Conference on Operating Systems Design and Implementation (OSDI'04) (December 2004).
[41]
NARLA, C., AND SALAS, D. Hermetic Servers. https://testing.googleblog.com/2012/10/hermetic-servers.html, October 2012. Google Testing Blog.
[42]
OPPENHEIMER, D., GANAPATHI, A., AND PATTERSON, D. A. Why Do Internet Services Fail, and What Can Be Done About It? In Proceedings of the 4th USENIX Symposium on Internet Technologies and Systems (USITS'03) (March 2003).
[43]
PALATIN, N., LEIZAROWITZ, A., SCHUSTER, A., AND WOLFF, R. Mining for Misconfigured Machines in Grid Systems. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'06) (August 2006).
[44]
POTHARAJU, R., CHAN, J., HU, L., NITA-ROTARU, C., WANG, M., ZHANG, L., AND JAIN, N. ConfSeer: Leveraging Customer Support Knowledge Bases for Automated Misconfiguration Detection. In Proceedings of the 35th International Conference on Very Large Data Bases (VLDB'15) (August 2015).
[45]
QU, X. Configuration Aware Prioritization Techniques in Regression Testing. In Proceedings of the 31st International Conference on Software Engineering (ICSE'09) (May 2009).
[46]
RABKIN, A., AND KATZ, R. Precomputing Possible Configuration Error Diagnosis. In Proceedings of the 26th IEEE/ACM International Conference on Automated Software Engineering (ASE'11) (November 2011).
[47]
RABKIN, A., AND KATZ, R. Static Extraction of Program Configuration Options. In Proceedings of the 33rd International Conference on Software Engineering (ICSE'11) (May 2011).
[48]
RABKIN, A. S. Using Program Analysis to Reduce Misconfiguration in Open Source Systems Software. PhD thesis, University of California, Berkeley, 2012.
[49]
SANTOLUCITO, M., ZHAI, E., DHODAPKAR, R., SHIM, A., AND PISKAC, R. Synthesizing Configuration File Specifications with Association Rule Learning. In Proceedings of 2017 ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA'17) (October 2017).
[50]
SANTOLUCITO, M., ZHAI, E., AND PISKAC, R. Probabilistic Automated Language Learning for Configuration Files. In Proceedings of the 28th International Conference on Computer Aided Verification (CAV'16) (July 2016).
[51]
SAVOIA, A. Code coverage goal: 80% and no less! https://testing.googleblog.com/2010/07/code-coverage-goal-80-and-no-less.html, July 2010. Google Testing Blog.
[52]
SAVOR, T., DOUGLAS, M., GENTILI, M., WILLIAMS, L., BECK, K., AND STUMM, M. Continuous Deployment at Facebook and OANDA. In Proceedings of the IEEE/ACM 38th International Conference on Software Engineering (ICSE'16)(May 2016).
[53]
SHERMAN, A., LISIECKI, P., BERKHEIMER, A., AND WEIN, J. ACMS: Akamai Configuration Management System. In Proceedings of the 2nd USENIX Symposium on Networked Systems Design and Implementation (NSDI'05) (May 2005).
[54]
SHIEBER, J. Facebook blames a server configuration change for yesterday's outage. https://techcrunch.com/2019/03/14/facebook-blames-a-misconfigured-server-for-yesterdays-outage/, March 2019.
[55]
TANG, C., KOOBURAT, T., VENKATACHALAM, P., CHANDER, A., WEN, Z., NARAYANAN, A., DOWELL, P., AND KARL, R. Holistic Configuration Management at Facebook. In Proceedings of the 25th ACM Symposium on Operating System Principles (SOSP'15) (October 2015).
[56]
THE APACHE HBASE REFERENCE GUIDE. Default Configuration. https://hbase.apache.org/book.html#hbase_default_configurations, 2020.
[57]
TILLMANN, N., AND SCHULTE, W. Parameterized Unit Tests. In Proceedings of the 10th European Software Engineering Conference Held Jointly with 13th ACM SIGSOFT International Symposium on Foundations of Software Engineering (ESEC/FSE'05) (September 2005).
[58]
TILLMANN, N., AND SCHULTE, W. Unit Tests Reloaded: Parameterized Unit Testing with Symbolic Execution. IEEE Software 23, 4 (July 2006), 38-47.
[59]
TUNCER, O., BILA, N., ISCI, C., AND COSKUN, A. K. ConfEx: An Analytics Framework for Text-Based Software Configurations in the Cloud. Tech. Rep. RC25675 (WAT1803-107), IBM Research, March 2018.
[60]
WACKER, M. Just Say No to More End-to-End Tests. https://testing.googleblog.com/2015/04/just-say-no-to-more-end-to-end-tests.html, April 2015. Google Testing Blog.
[61]
WANG, H. J., PLATT, J. C., CHEN, Y., ZHANG, R., AND WANG, Y.-M. Automatic Misconfiguration Troubleshooting with PeerPressure. In Proceedings of the 6th USENIX Conference on Operating Systems Design and Implementation (OSDI'04) (December 2004).
[62]
WANG, Y.-M., VERBOWSKI, C., DUNAGAN, J., CHEN, Y., WANG, H. J., YUAN, C., AND ZHANG, Z. STRIDER: A Black-box, State-based Approach to Change and Configuration Management and Support. In Proceedings of the 17th Large Installation Systems Administration Conference (LISA'03) (October 2003).
[63]
WHITAKER, A., COX, R. S., AND GRIBBLE, S. D. Configuration Debugging as Search: Finding the Needle in the Haystack. In Proceedings of the 6th USENIX Conference on Operating Systems Design and Implementation (OSDI'04) (December 2004).
[64]
WONG, W. E., GAO, R., LI, Y., ABREU, R., ANDWOTAWA, F. A Survey on Software Fault Localization. IEEE Transactions on Software Engineering (TSE) 42, 8 (August 2016), 707-740.
[65]
XIANG, C., HUANG, H., YOO, A., ZHOU, Y., AND PASUPATHY, S. PracExtractor: Extracting Configuration Good Practices from Manuals to Detect Server Misconfigurations. In Proceedings of the 2020 USENIX Annual Technical Conference (ATC'20) (July 2020).
[66]
XIANG, C., WU, Y., SHEN, B., SHEN, M., HUANG, H., XU, T., ZHOU, Y., MOORE, C., JIN, X., AND SHENG, T. Towards Continuous Access Control Validation and Forensics. In Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security (CCS'19) (November 2019).
[67]
XU, T., JIN, X., HUANG, P., ZHOU, Y., LU, S., JIN, L., AND PASUPATHY, S. Early Detection of Configuration Errors to Reduce Failure Damage. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI'16) (November 2016).
[68]
XU, T., AND MARINOV, D. Mining Container Image Repositories for Software Configurations and Beyond. In In Proceedings of the 40th International Conference on Software Engineering (ICSE'18), New Ideas and Emerging Results (May 2018).
[69]
XU, T., ZHANG, J., HUANG, P., ZHENG, J., SHENG, T., YUAN, D., ZHOU, Y., AND PASUPATHY, S. Do Not Blame Users for Misconfigurations. In Proceedings of the 24th ACM Symposium on Operating System Principles (SOSP'13) (November 2013).
[70]
XU, T., AND ZHOU, Y. Systems Approaches to Tackling Configuration Errors: A Survey. ACM Computing Surveys (CSUR) 47, 4 (July 2015).
[71]
YIN, Z., MA, X., ZHENG, J., ZHOU, Y., BAIRAVASUNDARAM, L. N., AND PASUPATHY, S. An Empirical Study on Configuration Errors in Commercial and Open Source Systems. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles (SOSP'11) (October 2011).
[72]
YOO, S., AND HARMAN, M. Regression Testing Minimisation, Selection and Prioritization: A Survey. Software Testing, Verification, and Reliability 22, 2 (March 2012), 67-120.
[73]
YUAN, C., LAO, N., WEN, J.-R., LI, J., ZHANG, Z., WANG, Y.-M., AND MA, W.-Y. Automated Known Problem Diagnosis with Event Traces. In Proceedings of the 1st ACM European Conference on Computer Systems (EuroSys'06) (April 2006).
[74]
YUAN, D., LUO, Y., ZHUANG, X., RODRIGUES, G., ZHAO, X., ZHANG, Y., JAIN, P. U., AND STUMM, M. Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-intensive Systems. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation (OSDI'14) (October 2014).
[75]
YUAN, D., XIE, Y., PANIGRAHY, R., YANG, J., VERBOWSKI, C., AND KUMAR, A. Context-based Online Configuration Error Detection. In Proceedings of 2011 USENIX Annual Technical Conference (USENIX ATC'11) (June 2011).
[76]
ZHANG, G., AND LIU, L. Why Do Migrations Fail and What CanWe Do about It? In Proceedings of the 25th USENIX Large Installation System Administration Conference (LISA'11) (December 2011).
[77]
ZHANG, J., RENGANARAYANA, L., ZHANG, X., GE, N., BALA, V., XU, T., AND ZHOU, Y. EnCore: Exploiting System Environment and Correlation Information for Misconfiguration Detection. In Proceedings of the 19th International Conference on Architecture Support for Programming Languages and Operating Systems (ASPLOS'14) (March 2014).
[78]
ZHANG, S., AND ERNST, M. D. Which Configuration Option Should I Change? In Proceedings of the 36th International Conference on Software Engineering (ICSE'14) (May 2014).
[79]
ZHANG, Y., RODRIGUES, K., LUO, Y., STUMM, M., AND YUAN, D. The Inflection Point Hypothesis: A Principled Debugging Approach for Locating the Root Cause of a Failure. In Proceedings of the 26th ACM Symposium on Operating System Principles (SOSP'19) (October 2019).
[80]
ZOOKEEPER-2299. NullPointerException in LocalPeer-Bean for ClientAddress. https://issues.apache.org/jira/browse/ZOOKEEPER-2299, 2015.
[81]
ZOOKEEPER-3721. PR #1266: ZOOKEEPER-3721: Making the boolean configuration parameters consistent. https://github.com/apache/zookeeper/pull/1266, 2020.

Cited By

View all
  • (2024)Ctest4J: A Practical Configuration Testing Framework for JavaCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663799(562-566)Online publication date: 10-Jul-2024
  • (2024)ECFuzz: Effective Configuration Fuzzing for Large-Scale SystemsProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3623315(1-12)Online publication date: 20-May-2024
  • (2023)What goes wrong in serverless runtimes? A survey of bugs in Knative ServingProceedings of the 1st Workshop on SErverless Systems, Applications and MEthodologies10.1145/3592533.3592806(12-18)Online publication date: 8-May-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings
OSDI'20: Proceedings of the 14th USENIX Conference on Operating Systems Design and Implementation
November 2020
1255 pages
ISBN:978-1-939133-19-9

Sponsors

  • ORACLE
  • VMware
  • Google Inc.
  • Amazon
  • Microsoft

Publisher

USENIX Association

United States

Publication History

Published: 04 November 2020

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)52
  • Downloads (Last 6 weeks)16
Reflects downloads up to 03 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Ctest4J: A Practical Configuration Testing Framework for JavaCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663799(562-566)Online publication date: 10-Jul-2024
  • (2024)ECFuzz: Effective Configuration Fuzzing for Large-Scale SystemsProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3623315(1-12)Online publication date: 20-May-2024
  • (2023)What goes wrong in serverless runtimes? A survey of bugs in Knative ServingProceedings of the 1st Workshop on SErverless Systems, Applications and MEthodologies10.1145/3592533.3592806(12-18)Online publication date: 8-May-2023
  • (2021)An Evolutionary Study of Configuration Design and Implementation in Cloud SystemsProceedings of the 43rd International Conference on Software Engineering10.1109/ICSE43902.2021.00029(188-200)Online publication date: 22-May-2021

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media