Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3552326.3587448acmconferencesArticle/Chapter ViewAbstractPublication PageseurosysConference Proceedingsconference-collections
research-article
Public Access

Fail through the Cracks: Cross-System Interaction Failures in Modern Cloud Systems

Published: 08 May 2023 Publication History

Abstract

Modern cloud systems are orchestrations of independent and interacting (sub-)systems, each specializing in important services (e.g., data processing, storage, resource management, etc.). Hence, cloud system reliability is affected not only by the reliability of each individual system, but also by the interplay between these systems. We observe that many recent production incidents of cloud systems are manifested through interactions across the system boundaries. However, there is a lack of systematic understanding of this emerging mode of failures, which we term as cross-system interaction failures (or CSI failures). This hinders the development of better design, integration practices, and new tooling.
In this paper, we discuss cross-system interaction failures based on analyses of (1) 11 CSI-failure-induced cloud incidents of Google, Azure, and AWS, and (2) 120 CSI failure cases of seven widely co-deployed open-source systems. We focus on understanding discrepancies between interacting systems as the root causes of CSI failures---CSI failures cannot be understood by analyzing one single system in isolation. This paper draws attention to this emerging failure mode, provides a comprehensive understanding of CSI failure patterns, and discusses potential approaches for mitigation. We advocate for cross-system testing and verification and demonstrate its potential by cross-testing the Spark-Hive data plane and exposing 15 new discrepancies.

References

[1]
ANSI Compliance. https://spark.apache.org/docs/latest/sql-ref-ansi-compliance.html.
[2]
Apache Avro. https://avro.apache.org/.
[3]
Apache Flink. https://flink.apache.org/.
[4]
Apache Hadoop YARN. https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html.
[5]
Apache HBase. https://hbase.apache.org/book.html.
[6]
Apache Hive. https://cwiki.apache.org/confluence/display/Hive/Tutorial.
[7]
Apache Kafka. https://kafka.apache.org/documentation/.
[8]
Apache ORC: The smallest, fastest columnar storage for Hadoop workloads. https://orc.apache.org/.
[9]
Apache Parquet. https://parquet.apache.org/.
[10]
Apache Spark Codebase. https://github.com/apache/spark.
[11]
Apache Spark Website. https://spark.apache.org/.
[12]
ASF JIRA. https://issues.apache.org/jira/secure/Dashboard.jspa.
[13]
AWS Post-Event Summaries. https://aws.amazon.com/premiumsupport/technology/pes/.
[14]
Azure status history. https://status.azure.com/en-us/status/history/.
[15]
Data types (Databricks SQL). https://docs.databricks.com/sql/language-manual/sql-ref-datatypes.html#data-types-databricks-sql.
[16]
Dynamic Tables. https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/table/concepts/dynamic_tables/.
[17]
Google Cloud Service Health. https://status.cloud.google.com/summary.
[18]
Google was hit with massive outage, including youtube, gmail and google classroom | cnn business. https://www.cnn.com/2020/12/14/tech/google-youtube-gmail-down/index.html.
[19]
Google's apps crash in a worldwide outage. - the new york times. https://www.nytimes.com/2020/12/14/business/google-down-worldwide.html.
[20]
Hadoop Common. https://github.com/apache/hadoop/tree/trunk/hadoop-common-project.
[21]
Hadoop Distributed File System. http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html.
[22]
Implicit assumptions of the Hadoop FileSystem APIs. https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/filesystem/introduction.html#Implicit_assumptions_of_the_Hadoop_FileSystem_APIs.
[23]
Incident affecting Google App Engine. https://status.cloud.google.com/incidents/NuaWbbv8n8V8PMHNR7kT.
[24]
Incident affecting Google BigQuery. https://status.cloud.google.com/incidents/qq7VS3aLtp6Nmgs5Nux4.
[25]
Incident affecting Google Cloud Infrastructure Components, Google Cloud Support, Google Cloud Console, Google BigQuery, Google Cloud Storage, Google Cloud Networking, Google Kubernetes Engine, Virtual Private Cloud (VPC). https://status.cloud.google.com/incidents/cFXPsFUnUELR8U2bQeGz.
[26]
Incident affecting Google Compute Engine, Google Cloud Networking, Access Approval, Google App Engine. https://status.cloud.google.com/incidents/1tX748pbxW2JjTUuTJsx.
[27]
Integration with Cloud Infrastructures. https://spark.apache.org/docs/latest/cloud-integration.html.
[28]
Massive google outage takes millions offline. https://www.forbes.com/sites/paulmonckton/2020/12/14/massive-google-outage-takes-millions-offline/?sh=40f33d060ad1.
[29]
Protocol buffers guide. https://developers.google.com/protocol-buffers/docs/proto.
[30]
Spark SQL Guide. https://spark.apache.org/docs/latest/sql-programming-guide.html.
[31]
Mars Climate Orbiter Mishap Investigation Board Phase I Report. https://llis.nasa.gov/llis_lib/pdf/1009464main1_0641-mr.pdf, Nov. 1999.
[32]
CBS: Cloud Bug Study Database. https://ucare.cs.uchicago.edu/projects/cbs/, 2014.
[33]
Apache Hive SQL Conformance. https://cwiki.apache.org/confluence/display/Hive/Apache+Hive+SQL+Conformance, Nov. 2018.
[34]
Alfatafta, M., Alkhatib, B., Alqraan, A., and Al-Kiswany, S. Toward a Generic Fault Tolerance Technique for Partial Network Partitioning. In Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI'20) (Nov. 2020).
[35]
Allspaw, J. Fault injection in production: Making the case for resilience testing. Communications of the ACM (CACM) 55, 10 (Oct 2012), 48--52.
[36]
Alluxio Docs. The Need for a New Data Orchestration Platform. https://www.alluxio.io/data-orchestration/.
[37]
Alqraan, A., Takruri, H., Alfatafta, M., and Al-Kiswany, S. An Analysis of Network-Partitioning Failures in Cloud Systems. In Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI'18) (Oct. 2018).
[38]
Altekar, G., and Stoica, I. Focus Replay Debugging Effort on the Control Plane. In Proceedings of the 6th Workshop on Hot Topics in System Dependability (HotDep'10) (Oct. 2010).
[39]
Amann, S., Nguyen, H. A., Nadi, S., Nguyen, T. N., and Mezini, M. A Systematic Evaluation of Static API-Misuse Detectors. IEEE Transactions on Software Engineering 45, 12 (Dec. 2019), 1170--1188.
[40]
Attariyan, M., Chow, M., and Flinn, J. X-ray: Automating Root-Cause Diagnosis of Performance Anomalies in Production Software. In Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation (OSDI'12) (Oct. 2012).
[41]
Attariyan, M., and Flinn, J. Automating Configuration Troubleshooting with Dynamic Information Flow Analysis. In Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation (OSDI'10) (Oct. 2010).
[42]
Basiri, A., Behnam, N., de Rooij, R., Hochstein, L., Kosewski, L., Reynolds, J., and Rosenthal, C. Chaos engineering. IEEE Software 33, 3 (May 2016), 35--41.
[43]
Behrang, F., Cohen, M. B., and Orso, A. Users Beware: Preference Inconsistencies Ahead. In Proceedings of the 10th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering (ESEC/FSE'15) (Aug. 2015).
[44]
Bogart, C., Kästner, C., Herbsleb, J., and Thung, F. How to Break an API: Cost Negotiation and Community Values in Three Software Ecosystems. In Proceedings of the 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE'16) (Nov. 2016).
[45]
Bronson, N., Aghayev, A., Charapko, A., and Zhu, T. Metastable Failures in Distributed Systems. In Proceedings of the 18th Workshop on Hot Topics in Operating Systems (HotOS'21) (May 2021).
[46]
Brooker, M. The Fundamental Mechanism of Scaling. http://brooker.co.za/blog/2021/01/22/cloud-scale.html, 2020.
[47]
Burns, B., Grant, B., Oppenheimer, D., Brewer, E., and Wilkes, J. Borg, Omega, and Kubernetes. Communications of the ACM 59, 5 (May 2016), 50--57.
[48]
Chen, H., Dou, W., Jiang, Y., and Qin, F. Understanding Exception-Related Bugs in Large-Scale Cloud Systems. In Proceedings of the 34th IEEE/ACM International Conference on Automated Software Engineering (ASE'19) (Nov. 2019).
[49]
Chen, Q., Wang, T., Legunsen, O., Li, S., and Xu, T. Understanding and Discovering Software Configuration Dependencies in Cloud and Datacenter Systems. In Proceedings of the 2020 ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE'20) (November 2020).
[50]
Chou, A., Yang, J., Chelf, B., Hallem, S., and Engler, D. An Empirical Study of Operating Systems Errors. In Proceedings of the Eighteenth ACM Symposium on Operating Systems Principles (SOSP'01) (Oct. 2001).
[51]
Cotroneo, D., De Simone, L., Liguori, P., Natella, R., and Bidokhti, N. How Bad Can a Bug Get? An Empirical Analysis of Software Failures in the OpenStack Cloud Computing Platform. In Proceedings of the 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE'19) (Aug. 2019).
[52]
Databricks Docs. Databricks architecture overview. https://docs.databricks.com/getting-started/overview.html.
[53]
Engler, D., Chen, D. Y., Hallem, S., Chou, A., and Chelf, B. Bugs as Deviant Behavior: A General Approach to Inferring Errors in Systems Code. In Proceedings of the Eighteenth ACM Symposium on Operating Systems Principles (SOSP'01) (Oct. 2001).
[54]
Flink Docs. Checkpointing. https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/datastream/fault-tolerance/checkpointing/.
[55]
Ford, D., Labelle, F., Popovici, F. I., Stokely, M., Truong, V.-A., Barroso, L., Grimes, C., and Quinlan, S. Availability in Globally Distributed Storage Systems. In Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation (OSDI'10) (Oct. 2010).
[56]
Gao, Y., Dou, W., Qin, F., Gao, C., Wang, D., Wei, J., Huang, R., Zhou, L., and Wu, Y. An Empirical Study on Crash Recovery Bugs in Large-Scale Distributed Systems. In Proceedings of the 26th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE'18) (Nov. 2018).
[57]
Gember-Jacobson, A., Wu, W., Li, X., Akella, A., and Mahajan, R. Management Plane Analytics. In Proceedings of the 2015 Internet Measurement Conference (IMC'15) (Oct. 2015).
[58]
Ghemawat, S., Gobioff, H., and Leung, S.-T. The Google File System. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP'03) (Oct. 2003).
[59]
Google Cloud. What is a hybrid cloud? https://cloud.google.com/learn/what-is-hybrid-cloud.
[60]
Govindan, R., Minei, I., Kallahalla, M., Koley, B., and Vahdat, A. Evolve or Die: High-Availability Design Principles Drawn from Google's Network Infrastructure. In Proceedings of the 2011 ACM SIGCOMM Conference (SIGCOMM'11) (Aug. 2016).
[61]
Grant, S., Cech, H., and Beschastnikh, I. Inferring and Asserting Distributed System Invariants. In Proceedings of the 40th International Conference on Software Engineering (ICSE'18) (May 2018).
[62]
Gunawi, H. S., Hao, M., Leesatapornwongsa, T., Patana-anake, T., Do, T., Adityatama, J., Eliazar, K. J., Laksono, A., Lukman, J. F., Martin, V., and Satria, A. D. What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems. In Proceedings of the 5th ACM Symposium on Cloud Computing (SoCC'14) (Nov. 2014).
[63]
Gunawi, H. S., Hao, M., Suminto, R. O., Laksono, A., Satria, A. D., Adityatama, J., and Eliazar, K. J. Why Does the Cloud Stop Computing? Lessons from Hundreds of Service Outages. In Proceedings of the 7th ACM Symposium on Cloud Computing (SoCC'16) (Oct. 2016).
[64]
Gunawi, H. S., Suminto, R. O., Sears, R., Golliher, C., Sundararaman, S., Lin, X., Emami, T., Sheng, W., Bidokhti, N., McCaffrey, C., Grider, G., Fields, P. M., Harms, K., Ross, R. B., Jacobson, A., Ricci, R., Webb, K., Alvaro, P., Runesha, H. B., Hao, M., and Li, H. Fail-Slow at Scale: Evidence of Hardware Performance Faults in Large Production Systems. In Proceedings of the 16th USENIX Conference on File and Storage Technologies (FAST'18) (Feb. 2018).
[65]
Gyori, A., Lambeth, B., Shi, A., Legunsen, O., and Marinov, D. Non-Dex: A Tool for Detecting and Debugging Wrong Assumptions on Java API Specifications. In Proceedings of the 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE'16) (Nov. 2016).
[66]
HBase Docs. HBase Cluster Replication. https://hbase.apache.org/book.html#_cluster_replication.
[67]
HBase Docs. HBase Write Ahead Log. https://hbase.apache.org/book.html#wal.
[68]
HDFS Docs. Data Replication. https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html#Data_Replication.
[69]
Huang, L., Magnusson, M., Muralikrishna, A. B., Estyak, S., Isaacs, R., Aghayev, A., Zhu, T., and Charapko, A. Metastable Failures in the Wild. In Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI'22) (July 2022).
[70]
Huang, P., Guo, C., Zhou, L., Lorch, J. R., Dang, Y., Chintalapati, M., and Yao, R. Gray Failure: The Achilles' Heel of Cloud-Scale Systems. In Proceedings of the 16th Workshop on Hot Topics in Operating Systems (HotOS-XVI) (May 2017).
[71]
Istio Docs. Architecture. https://istio.io/latest/docs/ops/deployment/architecture/.
[72]
Jia, Z., Li, S., Yu, T., Zeng, C., Xu, E., Liu, X., Wang, J., and Liao, X. DepOwl: Detecting Dependency Bugs to Prevent Compatibility Failures. In Proceedings of the 43rd International Conference on Software Engineering (ICSE'21) (May 2021).
[73]
Jiang, L., and Su, Z. Osprey: A Practical Type System for Validating Dimensional Unit Correctness of C Programs. In Proceedings of the 28th International Conference on Software Engineering (ICSE'06) (May 2006).
[74]
Kafka Docs. Auto Restart. https://kafka.apache.org/documentation/streams/architecture#streams_architecture_recovery.
[75]
Kubernetes Docs. Control Plane Components. https://kubernetes.io/docs/concepts/overview/components/.
[76]
Leesatapornwongsa, T., Lukman, J. F., Lu, S., and Gunawi, H. S. TaxDC: A Taxonomy of Non-Deterministic Concurrency Bugs in Datacenter Distributed Systems. In Proceedings of the 21st International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS'16) (Mar. 2016).
[77]
Legunsen, O., Hassan, W. U., Xu, X., Rosu, G., and Marinov, D. How Good are the Specs? A Study of the Bug-Finding Effectiveness of Existing Java API Specifications. In Proceedings of the 31th IEEE/ACM International Conference on Automated Software Engineering (ASE'16) (2016).
[78]
Li, G., Lu, S., Musuvathi, M., Nath, S., and Padhye, R. Efficient Scalable Thread-Safety-Violation Detection. In Proceedings of the 27th ACM Symposium on Operating Systems Principles (SOSP'19) (2019).
[79]
Liu, H., Lu, S., Musuvathi, M., and Nath, S. What Bugs Cause Production Cloud Incidents? In Proceedings of the 17th Workshop on Hot Topics in Operating Systems (HotOS'19) (Nov. 2019).
[80]
Lou, C., Chen, C., Huang, P., Dang, Y., Qin, S., Yang, X., Li, X., Lin, Q., and Chintalapati, M. RESIN: A Holistic Service for Dealing with Memory Leaks in Production Cloud Infrastructure. In Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI'22) (July 2022).
[81]
Lou, C., Huang, P., and Smith, S. Understanding, Detecting and Localizing Partial Failures in Large System Software. In Proceedings of the 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI'20) (Feb. 2020).
[82]
Lou, C., Jing, Y., and Huang, P. Demystifying and Checking Silent Semantic Violations in Large Distributed Systems. In Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI'22) (July 2022).
[83]
Lu, L., Arpaci-Dusseau, A. C., Arpaci-Dusseau, R. H., and Lu, S. A Study of Linux File System Evolution. ACM Trans. Storage 10, 1 (Jan. 2014).
[84]
Lu, S., Park, S., Seo, E., and Zhou, Y. Learning from Mistakes: A Comprehensive Study on Real World Concurrency Bug Characteristics. SIGARCH Comput. Archit. News 36, 1 (Mar. 2008), 329--339.
[85]
Ma, H., Goel, A., Jeannin, J.-B., Kapritsos, M., Kasikci, B., and Sakallah, K. A. I4: Incremental Inference of Inductive Invariants for Verification of Distributed Protocols. In Proceedings of the 27th ACM Symposium on Operating Systems Principles (SOSP'19) (2019).
[86]
Ma, S., Zhou, F., Bond, M. D., and Wang, Y. Finding Heterogeneous-Unsafe Configuration Parameters in Cloud Systems. In Proceedings of the 16th ACM European Conference on Computer Systems (EuroSys'21) (Apr. 2021).
[87]
Maurer, B. Fail at Scale: Reliability in the Face of Rapid Change. Communications of the ACM 58, 11 (Nov. 2015), 44--49.
[88]
Mehta, S., Bhagwan, R., Kumar, R., Ashok, B., Bansal, C., Maddila, C., Bird, C., Asthana, S., and Kumar, A. Rex: Preventing Bugs and Misconfiguration in Large Services using Correlated Change Analysis. In Proceedings of the 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI'20) (Feb. 2020).
[89]
Min, C., Kashyap, S., Lee, B., Song, C., and Kim, T. Cross-checking Semantic Correctness: The Case of Finding File System Bugs. In Proceedings of the 25th ACM Symposium on Operating Systems Principles (SOSP'15) (Oct. 2015).
[90]
Nagaraja, K., Oliveira, F., Bianchini, R., Martin, R. P., and Nguyen, T. D. Understanding and Dealing with Operator Mistakes in Internet Services. In Proceedings of the 6th USENIX Conference on Operating Systems Design and Implementation (OSDI'04) (Dec. 2004).
[91]
OpenStack Docs. Logical architecture. https://docs.openstack.org/install-guide/get-started-logical-architecture.html.
[92]
Oppenheimer, D., Ganapathi, A., and Patterson, D. A. Why Do Internet Services Fail, and What Can Be Done About It? In Proceedings of the 4th USENIX Symposium on Internet Technologies and Systems (USITS'03) (Mar. 2003).
[93]
Ore, J.-P., Detweiler, C., and Elbaum, S. Phriky-Units: A Lightweight, Annotation-Free Physical Unit Inconsistency Detection Tool. In Proceedings of the 26th ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA'17) (2017).
[94]
Pillai, T. S., Chidambaram, V., Alagappan, R., Al-Kiswany, S., Arpaci-Dusseau, A. C., and Arpaci-Dusseau, R. H. All File Systems Are Not Created Equal: On the Complexity of Crafting Crash-Consistent Applications. In Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI'14) (Oct. 2014).
[95]
Rabkin, A., and Katz, R. Static Extraction of Program Configuration Options. In Proceedings of the 33rd International Conference on Software Engineering (ICSE'11) (May 2011).
[96]
Rabkin, A., and Katz, R. How Hadoop Clusters Break. IEEE Software Magazine 30, 4 (July 2013), 88--94.
[97]
Ramachandran, V., Gupta, M., Sethi, M., and Chowdhury, S. R. Determining Configuration Parameter Dependencies via Analysis of Configuration Data from Multi-tiered Enterprise Applications. In Proceedings of the 6th International Conference on Autonomic Computing and Communications (ICAC'09) (June 2009).
[98]
Rigger, M., and Su, Z. Testing Database Engines via Pivoted Query Synthesis. In Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI'20) (Nov. 2020).
[99]
Robillard, M. P., Bodden, E., Kawrykow, D., Mezini, M., and Ratchford, T. Automated API Property Inference Techniques. IEEE Transactions on Software Engineering 39, 5 (May 2013), 613--637.
[100]
Rosu, G., and Chen, F. Certifying Measurement Unit Safety Policy. In Proceedings of the 18th IEEE International Conference on Automated Software Engineering (ASE'03) (Oct. 2003).
[101]
Schumilo, S., Aschermann, C., Jemmett, A., Abbasi, A., and Holz, T. Nyx-Net: Network Fuzzing with Incremental Snapshots. In Proceedings of the Seventeenth European Conference on Computer Systems (EuroSys'22) (Apr. 2022).
[102]
Stoica, I., and Shenker, S. From Cloud Computing to Sky Computing. In Proceedings of the 18th Workshop on Hot Topics in Operating Systems (HotOS'21) (May 2021).
[103]
Sun, X., Cheng, R., Chen, J., Ang, E., Legunsen, O., and Xu, T. Testing Configuration Changes in Context to Prevent Production Failures. In Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI'20) (Nov. 2020).
[104]
Sun, X., Suresh, L., Ganesan, A., Alagappan, R., Gasch, M., Tang, L., and Xu, T. Reasoning about modern datacenter infrastructures using partial histories. In Proceedings of the 18th Workshop on Hot Topics in Operating Systems (HotOS-XVIII) (May 2021).
[105]
Treynor, B., Dahlin, M., Rau, V., and Beyer, B. The Calculus of Service Availability. Communications of the ACM (CACM) 60, 9 (Sept. 2017), 42--47.
[106]
van Renesse, R., Weatherspoon, H., Shen, Z., and Song, W. The Supercloud: Applying Internet Design Principles to Interconnecting Clouds. In IEEE Internet Computing (IEEE Internet Computing'18) (Mar. 2018).
[107]
Veeraraghavan, K., Meza, J., Michelson, S., Panneerselvam, S., Gyori, A., Chou, D., Margulis, S., Obenshain, D., Padmanabha, S., Shah, A., Song, Y. J., and Xu, T. Maelstrom: Mitigating Datacenter-level Disasters by Draining Interdependent Traffic Safely and Efficiently. In Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI'18) (Oct. 2018).
[108]
Wang, Y., Wen, M., Liu, Y., Wang, Y., Li, Z., Wang, C., Yu, H., Cheung, S.-C., Xu, C., and Zhu, Z. Watchman: Monitoring Dependency Conflicts for Python Library Ecosystem. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering (ICSE'20) (July 2020).
[109]
Wen, M., Liu, Y., Wu, R., Xie, X., Cheung, S.-C., and Su, Z. Exposing Library API Misuses via Mutation Analysis. In Proceedings of the 41st International Conference on Software Engineering (ICSE'19) (May 2019).
[110]
Xia, H., Zhang, Y., Zhou, Y., Chen, X., Wang, Y., Zhang, X., Cui, S., Hong, G., Zhang, X., Yang, M., and Yang, Z. How Android Developers Handle Evolution-Induced API Compatibility Issues: A Large-Scale Study. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering (ICSE'20) (July 2020).
[111]
Xia, W., Wen, Y., Foh, C. H., Niyato, D., and Xie, H. A Survey on Software-Defined Networking. IEEE Communications Surveys & Tutorials 17, 1 (June 2014), 27--51.
[112]
Xu, T., Jin, X., Huang, P., Zhou, Y., Lu, S., Jin, L., and Pasupathy, S. Early Detection of Configuration Errors to Reduce Failure Damage. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI'16) (Nov. 2016).
[113]
Xu, T., Zhang, J., Huang, P., Zheng, J., Sheng, T., Yuan, D., Zhou, Y., and Pasupathy, S. Do Not Blame Users for Misconfigurations. In Proceedings of the 24th Symposium on Operating System Principles (SOSP'13) (Nov. 2013).
[114]
Xu, T., and Zhou, Y. Systems Approaches to Tackling Configuration Errors: A Survey. ACM Computing Surveys (CSUR) 47, 4 (July 2015).
[115]
Yang, J. Modeling API Traffic to Catch Breaking Changes. https://www.akitasoftware.com/blog-posts/modeling-api-traffic-to-catch-breaking-changes, 2021.
[116]
YARN Docs. YARN ResourceManager HA. https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/ResourceManagerHA.html.
[117]
Yin, Z., Ma, X., Zheng, J., Zhou, Y., Bairavasundaram, L. N., and Pasupathy, S. An Empirical Study on Configuration Errors in Commercial and Open Source Systems. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles (SOSP'11) (Oct. 2011).
[118]
Yoo, A., Wang, Y., Sinha, R., Mu, S., and Xu, T. Fail-slow fault tolerance needs programming support. In Proceedings of the 18th Workshop on Hot Topics in Operating Systems (HotOS-XVIII) (May 2021).
[119]
Yoo, S., and Harman, M. Regression Testing Minimisation, Selection and Prioritization: A Survey. Software Testing, Verification, and Reliability 22, 2 (Mar. 2012), 67--120.
[120]
Yuan, D., Luo, Y., Zhuang, X., Rodrigues, G., Zhao, X., Zhang, Y., Jain, P. U., and Stumm, M. Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-intensive Systems. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation (OSDI'14) (Oct. 2014).
[121]
Yun, I., Min, C., Si, X., Jang, Y., Kim, T., and Naik, M. APISan: Sanitizing API Usages through Semantic Cross-Checking. In Proceedings of the 25th USENIX Security Symposium (USENIX Security '16) (Aug. 2016).
[122]
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M. J., Shenker, S., and Stoica, I. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. In Proceedings of the 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI'12) (Apr. 2012).
[123]
Zamfir, C., Altekar, G., and Stoica:, I. Automating the Debugging of Datacenter Applications with ADDA. In Proceedings of the 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'13) (June 2013).
[124]
Zhai, E., Chen, A., Piskac, R., Balakrishnan, M., Tian, B., Song, B., and Zhang, H. Check before You Change: Preventing Correlated Failures in Service Updates. In Proceedings of the 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI'20) (Feb. 2020).
[125]
Zhang, J., Renganarayana, L., Zhang, X., Ge, N., Bala, V., Xu, T., and Zhou, Y. EnCore: Exploiting System Environment and Correlation Information for Misconfiguration Detection. In Proceedings of the 19th International Conference on Architecture Support for Programming Languages and Operating Systems (ASPLOS'14) (Mar. 2014).
[126]
Zhang, Y., Yang, J., Jin, Z., Sethi, U., Rodrigues, K., Lu, S., and Yuan, D. Understanding and Detecting Software Upgrade Failures in Distributed Systems. In Proceedings of the 28th ACM Symposium on Operating Systems Principles (SOSP'21) (2021).

Cited By

View all
  • (2024)Everything Everywhere All At Once: Efficient Cross-Service Program Analysis with OverSeerProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering Workshops10.1145/3691621.3694937(82-87)Online publication date: 27-Oct-2024
  • (2024)Trinity: High-Performance and Reliable Mobile Emulation through Graphics ProjectionACM Transactions on Computer Systems10.1145/364302942:3-4(1-33)Online publication date: 20-Sep-2024
  • (2024)Mutiny! How Does Kubernetes Fail, and What Can We Do About It?2024 54th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)10.1109/DSN58291.2024.00016(1-14)Online publication date: 24-Jun-2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
EuroSys '23: Proceedings of the Eighteenth European Conference on Computer Systems
May 2023
910 pages
ISBN:9781450394871
DOI:10.1145/3552326
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 May 2023

Permissions

Request permissions for this article.

Check for updates

Badges

Author Tags

  1. cross-system interaction
  2. failure study
  3. root cause analysis
  4. cloud system

Qualifiers

  • Research-article

Funding Sources

Conference

EuroSys '23
Sponsor:

Acceptance Rates

Overall Acceptance Rate 241 of 1,308 submissions, 18%

Upcoming Conference

EuroSys '25
Twentieth European Conference on Computer Systems
March 30 - April 3, 2025
Rotterdam , Netherlands

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)503
  • Downloads (Last 6 weeks)58
Reflects downloads up to 09 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Everything Everywhere All At Once: Efficient Cross-Service Program Analysis with OverSeerProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering Workshops10.1145/3691621.3694937(82-87)Online publication date: 27-Oct-2024
  • (2024)Trinity: High-Performance and Reliable Mobile Emulation through Graphics ProjectionACM Transactions on Computer Systems10.1145/364302942:3-4(1-33)Online publication date: 20-Sep-2024
  • (2024)Mutiny! How Does Kubernetes Fail, and What Can We Do About It?2024 54th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)10.1109/DSN58291.2024.00016(1-14)Online publication date: 24-Jun-2024

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media