research-article

MicroFI: Non-Intrusive and Prioritized Request-Level Fault Injection for Microservice Applications

Authors:

Zilong HeAuthors Info & Claims

IEEE Transactions on Dependable and Secure Computing, Volume 21, Issue 5

Pages 4921 - 4938

https://doi.org/10.1109/TDSC.2024.3363902

Published: 01 September 2024 Publication History

Abstract

Microservice is a widely-adopted architecture for constructing cloud-native applications. To test application resiliency, chaos engineering is widely used to inject faults proactively in applications. However, the searching space formed by possible injection locations is huge due to the scale and complexity of the application. Although some methods are proposed to effectively explore injection space, they cannot prioritize high-impact injection solutions. Additionally, the blast radius of faults injected by existing methods is typically full of uncertainty, causing faults of multiple application functions. Although some tools are designed to conduct request-level injection, they require instrumentation on application code. To tackle these problems, this paper presents MicroFI, a non-intrusive fault injection framework, aiming to efficiently test different application functions with request-level injection. Request-level injection limits the blast radius to specified requests without any source code modification. Additionally, MicroFI leverages historical injection results and parallel technique to accelerate the searching. Moreover, An enhanced PageRank is used to measure the impact of faults and prioritize high-impact faults that fail more functions. Evaluations on three microservice applications show that MicroFI precisely injects faults and reduces up to 91% redundant faults on average. Additionally, by employing prioritization, MicroFI reduces an average of 47.3% injection budgets to cover all high-impact faults.

References

[1]

J. Lin, P. Chen, and Z. Zheng, “Microscope: Pinpoint performance issues with causal graphs in micro-service environments,” in Proc. 16th Int. Conf. Serv.-Oriented Comput., 2018, pp. 3–20.

[2]

H. Zhou et al., “Overload control for scaling wechat microservices,” in Proc. ACM Symp. Cloud Comput., New York, NY, USA, 2018, p. 149–161.

[3]

O. Sheikh et al., “Modernize digital applications with microservices management using the istio service mesh,” in Proc. 28th Annu. Int. Conf. Comput. Sci. Softw. Eng., 2018, p. 359–360.

[4]

Y. Gan et al., “Seer: Leveraging Big Data to navigate the complexity of performance debugging in cloud microservices,” in Proc. 24th Int. Conf. Architectural Support Program. Lang. Operating Syst., New York, NY, USA, 2019, pp. 19–33.

[5]

X. Zhou et al., “Latent error prediction and fault localization for microservice applications by learning from system trace logs,” in Proc. 27th ACM Joint Meeting Eur. Softw. Eng. Conf. Symp. Found. Softw. Eng., New York, NY, USA, 2019, pp. 683–694.

[6]

P. Huang et al., “Gray failure: The achilles’ heel of cloud-scale systems,” in Proc. 16th Workshop Hot Topics Operating Syst., New York, NY, USA, 2017, pp. 150–155.

[7]

H. S. Gunawi et al., “Why does the cloud stop computing? lessons from hundreds of service outages,” in Proc. 7th ACM Symp. Cloud Comput., New York, NY, USA, 2016, pp. 1–16.

[8]

D. Cotroneo et al., “How bad can a bug get? an empirical analysis of software failures in the openstack cloud computing platform,” in Proc. 27th ACM Joint Meeting Eur. Softw. Eng. Conf. Symp. Found. Softw. Eng., New York, NY, USA, 2019, pp. 200–211.

[9]

X. Li, G. Yu, P. Chen, H. Chen, and Z. Chen, “Going through the life cycle of faults in clouds: Guidelines on fault handling,” in Proc. IEEE 33rd Int. Symp. Softw. Rel. Eng., 2022, pp. 121–132.

[10]

V. Heorhiadi, S. Rajagopalan, H. Jamjoom, M. K. Reiter, and V. Sekar, “Gremlin: Systematic resilience testing of microservices,” in Proc. IEEE 36th Int. Conf. Distrib. Comput. Syst., 2016, pp. 57–66.

[11]

“Google compute engine incident #18012,” 2022. [Online]. Available: https://status.cloud.google.com/incident/compute/18012

[12]

H. Tucker, L. Hochstein, N. Jones, A. Basiri, and C. Rosenthal, “The business case for chaos engineering,” IEEE Cloud Comput., vol. 5, no. 3, pp. 45–54, May/Jun. 2018.

[13]

P. Alvaro et al., “Automating failure testing research at internet scale,” in Proc. 7th ACM Symp. Cloud Comput., New York, NY, USA, 2016, pp. 17–28.

[14]

J. Cahoon,“Google dirt: Disaster recovery testing,”, 2019. [Online]. Available: https://www.oreilly.com/library/view/chaos-engineering/9781492043850/ch05.html

[15]

“Azure chaos studio,” 2021. [Online]. Available: https://azure.microsoft.com/en-us/services/chaos-studio/#overview

[16]

“Chaos engineering at linkedin: The “linkedout” failure injection testing framework,” 2021. [Online]. Available: https://www.infoq.com/news/2018/06/linkedout-failure-injection/

[17]

A. Basiri et al., “Chaos engineering,” IEEE Softw., vol. 33, no. 3, pp. 35–41, May/Jun. 2016.

Digital Library

[18]

Gremlin, “state-of-chaos-engineering,” 2022. [Online]. Available: https://www.gremlin.com/state-of-chaos-engineering/2021/?ref=blog

[19]

P. Joshi, H. S. Gunawi, and K. Sen, “Prefail: A programmable tool for multiple-failure injection,” in Proc. ACM Int. Conf. Object Oriented Program. Syst. Lang. Appl., New York, NY, USA, 2011, pp. 171–188.

[20]

“Hipstershop,” 2021. [Online]. Available: https://github.com/GoogleCloudPlatform/microservices-demo

[21]

K. Lee, “Beyond distributed tracing,” in SREConSan Francisco, CA, USA: USENIX Association, 2022.

[22]

C. S. Meiklejohn, A. Estrada, Y. Song, H. Miller, and R. Padhye, “Service-level fault injection testing,” in Proc. ACM Symp. Cloud Comput., New York, NY, USA, 2021, pp. 388–402.

[23]

L. Zhang, B. Morin, B. Baudry, and M. Monperrus, “Maximizing error injection realism for chaos engineering with system calls,” IEEE Trans. Dependable Secure Comput., vol. 19, no. 4, pp. 2695–2708, Jul./Aug. 2022.

[24]

J. Simonsson et al., “Observability and chaos engineering on system calls for containerized applications in docker,” Future Gener. Comput. Syst., vol. 122, pp. 117–129, 2021.

[25]

Istio, “Istio Documention,” 2020. [Online]. Available: https://istio.io/latest/docs

[26]

Netflix, “Chaosmonkey,” 2015. [Online]. Available: https://github.com/Netflix/chaosmonkey

[27]

Chaosblade, “An easy to use and powerful chaos engineering toolkit,” 2020. [Online]. Available: https://github.com/chaosblade-io/chaosblade

[28]

Chaosmesh, “Break your system constructively,” 2019. [Online]. Available: https://chaos-mesh.org/

[29]

Litmus, “Open source chaos engineering platform,” 2019. [Online]. Available: https://github.com/litmuschaos/litmus

[30]

P. Joshi et al., “SETSUDO: Perturbation-based testing framework for scalable distributed systems,” in Proc. First ACM SIGOPS Conf. Timely Results Operating Syst., New York, NY, USA, 2013, pp. 1–14.

[31]

P. Alvaro, J. Rosen, and J. M. Hellerstein, “Lineage-driven fault injection,” in Proc. ACM SIGMOD Int. Conf. Manage. Data, New York, NY, USA, 2015, pp. 331–346.

[32]

L. Page, S. Brin, R. Motwani, and T. Winograd, “The pagerank citation ranking: Bringing order to the web,” in Proc. Web Conf., 1999. [Online]. Available: https://api.semanticscholar.org/CorpusID:1508503

[33]

X. Zhou et al., “Benchmarking microservice systems for software engineering research,” in Proc. 40th Int. Conf. Softw. Eng., Companion Proc., New York, NY, USA, 2018, pp. 323–324.

[34]

Y. Gan et al., “An open-source benchmark suite for microservices and their hardware-software implications for cloud& edge systems,” in Proc. 24th Int. Conf. Architectural Support Program. Lang. Operating Syst., New York, NY, USA, 2019, pp. 3–18.

[35]

J. Zhang, R. Ferydouni, A. Montana, D. Bittman, and P. Alvaro, “3MileBeach: A tracer with teeth,” in Proc. ACM Symp. Cloud Comput., New York, NY, USA, 2021, pp. 458–472.

[36]

H. S. Gunawi et al., “Fate and destini: A framework for cloud recovery testing,” in Proc. 8th USENIX Conf. Networked Syst. Des. Implementation, 2011, pp. 238–252.

[37]

A. Basiri, L. Hochstein, N. Jones, and H. Tucker, “Automating chaos experiments in production,” in Proc. 41st Int. Conf. Softw. Eng.: Softw. Eng. Pract., 2019, pp. 31–40.

[38]

Z. Long et al., “Fitness-guided resilience testing of microservice-based applications,” in Proc. IEEE Int. Conf. Web Serv., 2020, pp. 151–158.

[39]

A. O. Duque et al., “A qualitative evaluation of service mesh-based traffic management for mobile edge cloud,” in Proc. IEEE 22nd Int. Symp. Cluster Cloud Internet Comput., 2022, pp. 210–219.

[40]

“Alibabacloud service mesh,” 2023. [Online]. Available: https://www.alibabacloud.com/es/product/servicemesh

[41]

“Tencent cloud mesh,” 2023. [Online]. Available: https://www.tencentcloud.com/products/tcm

[42]

“Anthos service mesh,” 2023. [Online]. Available: https://cloud.google.com/anthos/service-mesh?hl=es

[43]

“Envoy,” 2022. [Online]. Available: https://www.envoyproxy.io/

[44]

B. H. Sigelman et al., “Dapper, a large-scale distributed systems tracing infrastructure,” 2010. [Online]. Available: https://api.semanticscholar.org/CorpusID:14271421

[45]

“Opentelemetry,” 2022. [Online]. Available: https://opentelemetry.io/

[46]

“Opentracing,” 2022. [Online]. Available: https://opentraicng.io/

[47]

P. Jackson and D. Sheridan, “Clause form conversions for boolean circuits,” in Proc. 7th Int. Conf. Theory Appl. Satisfiability Testing, Berlin, Heidelberg, 2004, pp. 183–198.

[48]

T. Schoning, “A probabilistic algorithm for K-sat and constraint satisfaction problems,” in Proc. 40th Annu. Symp. Found. Comput. Sci., 1999, pp. 410–414.

[49]

M. Davis, G. Logemann, and D. Loveland, “A machine program for theorem-proving,” Commun. ACM, vol. 5, no. 7, pp. 394–397, Jul. 1962.

Digital Library

[50]

Z3, “The Z3 theorem prover,” 2019. [Online]. Available: https://github.com/Z3Prover/z3

[51]

“The minisat page,” 2022. http://minisat.se

[52]

C. Lou, P. Huang, and S. Smith, “Understanding, detecting and localizing partial failures in large system software,” in Proc. 17th USENIX Symp. Netw. Syst. Des. Implementation, Santa Clara, CA, USA, 2020, pp. 559–574.

[53]

Little known ways to better use your error budgets. [Online]. Available: https://www.blameless.com/blog/4-surprising-error-budget-use-cases

[54]

R. T. Fielding and R. N. Taylor, “Architectural styles and the design of network-based software architectures,” Ph.D. dissertation, University of California, Irvine, 2000, aAI9980887.

[55]

A. D. Birrell and B. J. Nelson, “Implementing remote procedure calls,” ACM Trans. Comput. Syst., vol. 2, no. 1, pp. 39–59, Feb. 1984.

Digital Library

[56]

“Strace,” 2023. [Online]. Available: https://man7.org/linux/man-pages/man1/strace.1.html

[57]

“Realizing the new value of service mesh: Accurately controlling the blast radius (chinese),” 2023. [Online]. Available: https://developer.aliyun.com/article/878287

[58]

G. Yu et al., “MicroRank: End-to-end latency issue localization with extended spectrum analysis in microservice environments,” in Proc. Web Conf.2021, pp. 3087–3098.

[59]

G. Jeh and J. Widom, Scaling Personalized Web Search. New York, NY, USA: ACM, 2003, pp. 271–279.

[60]

“Kubernetes,” 2021. [Online]. Available: https://kubernetes.io/

[61]

X. Zhou et al., “Fault analysis and debugging of microservice systems: Industrial survey, benchmark system, and empirical study,” IEEE Trans. Softw. Eng., vol. 47, no. 2, pp. 243–260, Feb. 2021.

Digital Library

[62]

Z. Huang et al., “Sieve: Attention-based sampling of end-to-end trace data in distributed microservice systems,” in Proc. IEEE Int. Conf. Web Serv., 2021, pp. 436–446.

[63]

Z. He et al., “Graph based incident extraction and diagnosis in large-scale online systems,” in Proc. IEEE/ACM 37th Int. Conf. Automated Softw. Eng., New York, NY, USA, 2023, pp. 1–3.

[64]

Y. Gan et al., “Sage: Practical and scalable ML-Driven performance debugging in microservices,” in Proc. 26th Int. Conf. Architectural Support Program. Lang. Operating Syst., 2021, pp. 135–151.

[65]

LocustIO, “Locust: An open source load testing tool,” 2019. [Online]. Available: https://locust.io/

[66]

elasticsearch. (2015) elasticsearch/elasticsearch. 2022. [Online]. Available: https://github.com/elasticsearch/elasticsearch

[67]

“Jaeger,” 2022. [Online]. Available: https://www.jaegertracing.io/

[68]

T. Yang et al., “AID: Efficient prediction of aggregated intensity of dependency in large-scale cloud systems,” in Proc. IEEE/ACM 36th Int. Conf. Automated Softw. Eng., 2021, pp. 653–665.

[69]

J. Yin et al., “CloudScout: A non-intrusive approach to service dependency discovery,” IEEE Trans. Parallel Distrib. Syst., vol. 28, no. 5, pp. 1271–1284, May 2017.

Digital Library

[70]

A. Raina and R. Ellupuru, Madaari: Ordering for the Monkeys. Brooklyn, NY, USA: USENIX Association, 2019.

[71]

“Cilium,” 2022. [Online]. Available: https://cilium.io/

[72]

Datadog, “Cloud monitoring as a service,” 2023. [Online]. Available: https://www.datadoghq.com/

[73]

Apache, “Skywalking,” 2023. [Online]. Available: https://skywalking.apache.org/

[74]

Alibaba, “Application real-time monitoring service,” 2023. [Online]. Available: https://www.alibabacloud.com/es/product/arms

[75]

C. Zhang et al., “DeepTraLog: Trace-log combined microservice anomaly detection through graph-based deep learning,” in Proc. IEEE/ACM 44th Int. Conf. Softw. Eng., 2022, pp. 623–634.

Index Terms

MicroFI: Non-Intrusive and Prioritized Request-Level Fault Injection for Microservice Applications
1. Computer systems organization
  1. Dependable and fault-tolerant systems and networks
    1. Fault-tolerant network topologies
2. Software and its engineering

Index terms have been assigned to the content through auto-classification.

Recommendations

Fault Injection into VHDL Models: Experimental Validation of a Fault Tolerant Microcomputer System
EDCC-3: Proceedings of the Third European Dependable Computing Conference on Dependable Computing

This work presents a campaign of fault injection to validate the dependability of a fault tolerant microcomputer system. The system is duplex with cold stand-by sparing, parity detection and a watchdog timer. The faults have been injected on a chip-...
Fault Injection and Dependability Evaluation of Fault-Tolerant Systems

The authors describe a dependability evaluation method based on fault injection that establishes the link between the experimental evaluation of the fault tolerance process and the fault occurrence process. The main characteristics of a fault injection ...
On Fault Representativeness of Software Fault Injection

The injection of software faults in software components to assess the impact of these faults on other components or on the system as a whole, allowing the evaluation of fault tolerance, is relatively new compared to decades of research on hardware fault ...

Comments

Information & Contributors

Information

Published In

cover image IEEE Transactions on Dependable and Secure Computing

IEEE Transactions on Dependable and Secure Computing Volume 21, Issue 5

Sept.-Oct. 2024

750 pages

Issue’s Table of Contents

1545-5971 © 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information.

Publisher

IEEE Computer Society Press

Washington, DC, United States

Publication History

Published: 01 September 2024

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 28 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

Figures

Tables

Media

View Issue’s Table of Contents