Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

BARO: Robust Root Cause Analysis for Microservices via Multivariate Bayesian Online Change Point Detection

Published: 12 July 2024 Publication History

Abstract

Detecting failures and identifying their root causes promptly and accurately is crucial for ensuring the availability of microservice systems. A typical failure troubleshooting pipeline for microservices consists of two phases: anomaly detection and root cause analysis. While various existing works on root cause analysis require accurate anomaly detection, there is no guarantee of accurate estimation with anomaly detection techniques. Inaccurate anomaly detection results can significantly affect the root cause localization results. To address this challenge, we propose BARO, an end-to-end approach that integrates anomaly detection and root cause analysis for effectively troubleshooting failures in microservice systems. BARO leverages the Multivariate Bayesian Online Change Point Detection technique to model the dependency within multivariate time-series metrics data, enabling it to detect anomalies more accurately. BARO also incorporates a novel nonparametric statistical hypothesis testing technique for robustly identifying root causes, which is less sensitive to the accuracy of anomaly detection compared to existing works. Our comprehensive experiments conducted on three popular benchmark microservice systems demonstrate that BARO consistently outperforms state-of-the-art approaches in both anomaly detection and root cause analysis.

References

[1]
2023. Bayesian Change Point Detection. https://github.com/hildensia/bayesian_changepoint_detection
[2]
2023. Container Advisor - an open-source tool to monitor containers. https://github.com/google/cadvisor
[3]
2023. The Istio service mesh. https://istio.io/
[4]
2023. Modified z score. https://www.ibm.com/docs/en/cognos-analytics/11.1.0?topic=terms-modified-z-score
[5]
2023. Online Boutique is a cloud-first microservices demo application. https://github.com/GoogleCloudPlatform/microservices-demo
[6]
2023. An open-source monitoring and alerting toolkit. https://prometheus.io/
[7]
2023. Sock Shop - A Microservices Demo Application. https://microservices-demo.github.io/
[8]
2023. Statistical Procedures, Calculations and Formulae, APPENDIX D. https://www.apac-accreditation.org/app/uploads/2017/08/aplac_t017_appendix_d.pdf
[9]
2023. Train Ticket Benchmark System. https://github.com/FudanSELab/train-ticket
[10]
2024. Automated root cause analysis with Watchdog RCA. https://www.datadoghq.com/blog/datadog-watchdog-automated-root-cause-analysis/
[11]
2024. BARO Dataset Artifacts at Zenodo. https://zenodo.org/records/11046533
[12]
2024. BARO: Root Cause Analysis for Microservices. https://github.com/phamquiluan/baro
[13]
2024. BARO Software Artifacts at Zenodo. https://doi.org/10.5281/zenodo.11094092
[14]
2024. Datadog: Modern monitoring & security. https://www.datadoghq.com
[15]
2024. Dynatrace: Root cause analysis with example. https://docs.dynatrace.com/docs/platform/davis-ai/problem-and-root-cause/root-cause-analysis##expand–details-and-example
[16]
2024. Dynatrace: Unified observability and security. https://www.dynatrace.com
[17]
2024. Google - Site Reliability Engineering. https://sre.google/sre-book/monitoring-distributed-systems/
[18]
2024. Introducing anomaly detection in Datadog. https://www.datadoghq.com/blog/introducing-anomaly-detection-datadog/
[19]
2024. Set up anomaly detection based on your business needs. https://www.dynatrace.com/news/blog/metric-events-set-up-anomaly-detection-based-on-your-business-needs/
[20]
2024. Stress test for Computer system. https://wiki.ubuntu.com/Kernel/Reference/stress-ng
[21]
2024. Traffic Control. https://man7.org/linux/man-pages/man8/tc.8.html
[22]
Ryan Prescott Adams and David JC MacKay. 2007. Bayesian online changepoint detection. arXiv preprint arXiv:0710.3742.
[23]
Pooja Aggarwal, Ajay Gupta, Prateeti Mohapatra, Seema Nagar, Atri Mandal, Qing Wang, and Amit Paradkar. 2020. Localization of operational faults in cloud applications by mining causal dependencies in logs using golden signals. In International Conference on Service-Oriented Computing. 137–149.
[24]
Pooja Aggarwal, Seema Nagar, Ajay Gupta, Larisa Shwartz, Prateeti Mohapatra, Qing Wang, Amit Paradkar, and Atri Mandal. 2021. Causal modeling based fault localization in cloud systems using golden signals. In 2021 IEEE 14th International Conference on Cloud Computing (CLOUD). 124–135.
[25]
Algirdas Avizienis, J-C Laprie, Brian Randell, and Carl Landwehr. 2004. Basic concepts and taxonomy of dependable and secure computing. IEEE transactions on dependable and secure computing, 1, 1 (2004), 11–33.
[26]
Ane Blázquez-García, Angel Conde, Usue Mori, and Jose A. Lozano. 2021. A Review on Outlier/Anomaly Detection in Time Series Data. Comput. Surveys, 54, 3 (2021), Article 56.
[27]
Álvaro Brandón, Marc Solé, Alberto Huélamo, David Solans, María S Pérez, and Victor Muntés-Mulero. 2020. Graph-based root cause analysis for service-oriented and microservice architectures. Journal of Systems and Software, 159 (2020), 110432.
[28]
Junjie Chen, Xiaoting He, Qingwei Lin, Yong Xu, Hongyu Zhang, Dan Hao, Feng Gao, Zhangwei Xu, Yingnong Dang, and Dongmei Zhang. 2019. An Empirical Investigation of Incident Triage for Online Service Systems. In 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). 111–120.
[29]
Pengfei Chen, Yong Qi, and Di Hou. 2016. CauseInfer: Automated end-to-end performance diagnosis with hierarchical causality graph in cloud environment. IEEE transactions on services computing, 12, 2 (2016), 214–230.
[30]
Pengfei Chen, Yong Qi, Pengfei Zheng, and Di Hou. 2014. CauseInfer: Automatic and distributed performance diagnosis with hierarchical causality graph in large distributed systems. In IEEE Conference on Computer Communications (INFOCOM’14). 1887–1895.
[31]
Yujun Chen, Xian Yang, Qingwei Lin, Hongyu Zhang, Feng Gao, Zhangwei Xu, Yingnong Dang, Dongmei Zhang, Hang Dong, Yong Xu, Hao Li, and Yu Kang. 2019. Outage Prediction and Diagnosis for Cloud Service Systems. In The World Wide Web Conference (WWW ’19). Association for Computing Machinery, New York, NY, USA. 2659–2665.
[32]
Zhuangbin Chen, Yu Kang, Liqun Li, Xu Zhang, Hongyu Zhang, Hui Xu, Yangfan Zhou, Li Yang, Jeffrey Sun, Zhangwei Xu, Yingnong Dang, Feng Gao, Pu Zhao, Bo Qiao, Qingwei Lin, Dongmei Zhang, and Michael R. Lyu. 2020. Towards Intelligent Incident Management: Why We Need It and How We Make It. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2020). 1487–1497.
[33]
Zhuangbin Chen, Jinyang Liu, Yuxin Su, Hongyu Zhang, Xiao Ling, Yongqiang Yang, and Michael R Lyu. 2022. Adaptive performance anomaly detection for online service systems via pattern sketching. In Proceedings of the 44th International Conference on Software Engineering. 61–72.
[34]
David Maxwell Chickering. 2002. Learning Equivalence Classes of Bayesian-Network Structures. Journal of Machine Learning Research, 2 (2002), 445–498.
[35]
Stuart Coles. 2001. An introduction to statistical modeling of extreme values. Springer-Verlag, London.
[36]
Jerome H Friedman and Lawrence C Rafsky. 1979. Multivariate generalizations of the Wald-Wolfowitz and Smirnov two-sample tests. The Annals of Statistics, 697–717.
[37]
Jerome H Friedman and Lawrence C Rafsky. 1983. Graph-theoretic measures of multivariate association and prediction. The Annals of Statistics, 377–391.
[38]
C.W.J. Granger. 1980. Testing for causality: A personal viewpoint. Journal of Economic Dynamics and Control, 2 (1980), 329–352.
[39]
Zilong He, Pengfei Chen, Yu Luo, Qiuyu Yan, Hongyang Chen, Guangba Yu, and Fangyuan Li. 2022. Graph based Incident Extraction and Diagnosis in Large-Scale Online Systems. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering (ASE’22). 1–13.
[40]
Azam Ikram, Sarthak Chakraborty, Subrata Mitra, Shiv Saini, Saurabh Bagchi, and Murat Kocaoglu. 2022. Root Cause Analysis of Failures in Microservices through Causal Discovery. In Advances in Neural Information Processing Systems (NeurIPS’22). 35, 31158–31170.
[41]
Amin Jaber, Murat Kocaoglu, Karthikeyan Shanmugam, and Elias Bareinboim. 2020. Causal Discovery from Soft Interventions with Unknown Targets: Characterization and Learning. In Advances in Neural Information Processing Systems (NeurIPS’20). 33, 9551–9561.
[42]
Van-Hoang Le and Hongyu Zhang. 2023. Log parsing with prompt-based few-shot learning. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). 2438–2449.
[43]
Cheryl Lee, Tianyi Yang, Zhuangbin Chen, Yuxin Su, and Michael R Lyu. 2023. Eadro: An End-to-End Troubleshooting Framework for Microservices on Multi-source Data. arXiv preprint arXiv:2302.05092.
[44]
Mingjie Li, Zeyan Li, Kanglin Yin, Xiaohui Nie, Wenchi Zhang, Kaixin Sui, and Dan Pei. 2022. Causal Inference-Based Root Cause Analysis for Online Service Systems with Intervention Recognition. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD’22). 3230–3240.
[45]
Zeyan Li, Junjie Chen, Rui Jiao, Nengwen Zhao, Zhijun Wang, Shuwei Zhang, Yanjun Wu, Long Jiang, Leiqin Yan, Zikai Wang, Zhekang Chen, Wenchi Zhang, Xiaohui Nie, Kaixin Sui, and Dan Pei. 2021. Practical Root Cause Localization for Microservice Systems via Trace Analysis. In 2021 IEEE/ACM 29th International Symposium on Quality of Service (IWQOS’21). 1–10.
[46]
Jinjin Lin, Pengfei Chen, and Zibin Zheng. 2018. Microscope: Pinpoint Performance Issues with Causal Graphs in Micro-service Environments. In Service-Oriented Computing. 3–20.
[47]
Chenghao Liu, Wenzhuo Yang, Himanshu Mittal, Manpreet Singh, Doyen Sahoo, and Steven CH Hoi. 2023. PyRCA: A Library for Metric-based Root Cause Analysis. arXiv preprint arXiv:2306.11417.
[48]
Dewei Liu, Chuan He, Xin Peng, Fan Lin, Chenxi Zhang, Shengfang Gong, Ziang Li, Jiayu Ou, and Zheshun Wu. 2021. Microhecl: High-efficient root cause localization in large-scale microservice systems. In 2021 IEEE/ACM 43rd International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). 338–347.
[49]
Chen Luo, Jian-Guang Lou, Qingwei Lin, Qiang Fu, Rui Ding, Dongmei Zhang, and Zhe Wang. 2014. Correlating events with time series for incident diagnosis. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. 1583–1592.
[50]
Meng Ma, Weilan Lin, Disheng Pan, and Ping Wang. 2019. MS-Rank: Multi-Metric and Self-Adaptive Root Cause Diagnosis for Microservice Applications. In IEEE International Conference on Web Services (ICWS’19). 60–67.
[51]
Meng Ma, Jingmin Xu, Yuan Wang, Pengfei Chen, Zonghua Zhang, and Ping Wang. 2020. AutoMAP: Diagnose Your Microservice-Based Web Applications Automatically. In Proceedings of The Web Conference (WWW’20). 246–258.
[52]
Leonardo Mariani, Cristina Monni, Mauro Pezzé, Oliviero Riganelli, and Rui Xin. 2018. Localizing faults in cloud systems. In 2018 IEEE 11th International Conference on Software Testing, Verification and Validation (ICST). 262–273.
[53]
Yuan Meng, Shenglin Zhang, Yongqian Sun, Ruru Zhang, Zhilong Hu, Yiyin Zhang, Chenyang Jia, Zhaogang Wang, and Dan Pei. 2020. Localizing Failure Root Causes in a Microservice through Causality Inference. In IEEE/ACM 28th International Symposium on Quality of Service (IWQoS’20). 1–10.
[54]
Yicheng Pan, Meng Ma, Xinrui Jiang, and Ping Wang. 2021. Faster, deeper, easier: crowdsourcing diagnosis of microservice kernel failure from user space. In Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis. 646–657.
[55]
Jakob Runge, Peer Nowack, Marlene Kretschmer, Seth Flaxman, and Dino Sejdinovic. 2019. Detecting and quantifying causal associations in large nonlinear time series datasets. Science Advances, 5, 11 (2019).
[56]
Areeg Samir and Claus Pahl. 2019. Dla: Detecting and localizing anomalies in containerized microservice architectures using markov models. In 2019 7th International Conference on Future Internet of Things and Cloud (FiCloud). 205–213.
[57]
Huasong Shan, Yuan Chen, Haifeng Liu, Yunpeng Zhang, Xiao Xiao, Xiaofeng He, Min Li, and Wei Ding. 2019. ∊ -Diagnosis: Unsupervised and Real-Time Diagnosis of Small-Window Long-Tail Latency in Large-Scale Microservice Platforms. In The World Wide Web Conference (WWW’19). 3215–3222.
[58]
Shohei Shimizu, Patrik O. Hoyer, Aapo Hyvarinen, and Antti Kerminen. 2006. A Linear Non-Gaussian Acyclic Model for Causal Discovery. Journal of Machine Learning Research, 7, 72 (2006), 2003–2030.
[59]
Alban Siffer, Pierre-Alain Fouque, Alexandre Termier, and Christine Largouet. 2017. Anomaly detection in streams with extreme value theory. In Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining. 1067–1075.
[60]
Jacopo Soldani and Antonio Brogi. 2022. Anomaly Detection and Failure Root Cause Analysis in (Micro) Service-Based Cloud Applications: A Survey. Comput. Surveys, 55, 3 (2022).
[61]
P. Spirtes, C. Glymour, and R. Scheines. 1993. Causation, Prediction, and Search (1st ed.). MIT press.
[62]
Peter Spirtes, Christopher Meek, and Thomas Richardson. 1995. Causal Inference in the Presence of Latent Variables and Selection Bias. In Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence (UAI’95). 499–506.
[63]
Jörg Thalheim, Antonio Rodrigues, Istemi Ekin Akkus, Pramod Bhatotia, Ruichuan Chen, Bimal Viswanath, Lei Jiao, and Christof Fetzer. 2017. Sieve: Actionable insights from monitored metrics in distributed systems. In Proceedings of the 18th ACM/IFIP/USENIX Middleware Conference (Middleware’17). 14–27.
[64]
Gerrit J. J. van den Burg and Christopher K. I. Williams. 2020. An Evaluation of Change Point Detection Algorithms. CoRR, abs/2003.06222 (2020).
[65]
Lingzhi Wang, Nengwen Zhao, Junjie Chen, Pinnong Li, Wenchi Zhang, and Kaixin Sui. 2020. Root-cause metric location for microservice systems via log anomaly detection. In 2020 IEEE international conference on web services (ICWS). 142–150.
[66]
Ping Wang, Jingmin Xu, Meng Ma, Weilan Lin, Disheng Pan, Yuan Wang, and Pengfei Chen. 2018. CloudRanger: Root Cause Identification for Cloud Native Systems. In 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID’18). 492–502.
[67]
Qing Wang, Larisa Shwartz, Genady Ya Grabarnik, Vijay Arya, and Karthikeyan Shanmugam. 2021. Detecting causal structure on cloud application microservices using granger causality models. In 2021 IEEE 14th International Conference on Cloud Computing (CLOUD). 558–565.
[68]
Qing Wang, Larisa Shwartz, Genady Ya. Grabarnik, Vijay Arya, and Karthikeyan Shanmugam. 2021. Detecting Causal Structure on Cloud Application Microservices Using Granger Causality Models. In IEEE 14th International Conference on Cloud Computing (CLOUD’21). 558–565.
[69]
Claes Wohlin, Per Runeson, Martin Höst, Magnus C Ohlsson, Björn Regnell, and Anders Wesslén. 2012. Experimentation in software engineering. Springer Science & Business Media.
[70]
Li Wu. 2022. Automatic performance diagnosis and recovery in cloud microservices. Technische Universitaet Berlin (Germany).
[71]
Li Wu, Jasmin Bogatinovski, Sasho Nedelkoski, Johan Tordsson, and Odej Kao. 2020. Performance diagnosis in cloud microservices using deep learning. In International Conference on Service-Oriented Computing. 85–96.
[72]
Li Wu, Johan Tordsson, Jasmin Bogatinovski, Erik Elmroth, and Odej Kao. 2021. Microdiag: Fine-grained performance diagnosis for microservice systems. In 2021 IEEE/ACM International Workshop on Cloud Intelligence (CloudIntelligence). 31–36.
[73]
Li Wu, Johan Tordsson, Erik Elmroth, and Odej Kao. 2020. Microrca: Root cause localization of performance issues in microservices. In NOMS 2020-2020 IEEE/IFIP Network Operations and Management Symposium. 1–9.
[74]
Ruyue Xin, Peng Chen, and Zhiming Zhao. 2023. CausalRCA: Causal inference based precise fine-grained root cause localization for microservice applications. Journal of Systems and Software, 203 (2023), 111724.
[75]
Xiang Xuan and Kevin Murphy. 2007. Modeling changing dependency structure in multivariate time series. In Proceedings of the 24th international conference on Machine learning. 1055–1062.
[76]
Guangba Yu, Pengfei Chen, Hongyang Chen, Zijie Guan, Zicheng Huang, Linxiao Jing, Tianjun Weng, Xinmeng Sun, and Xiaoyun Li. 2021. Microrank: End-to-end latency issue localization with extended spectrum analysis in microservice environments. In Proceedings of the Web Conference (WWW’21). 3087–3098.
[77]
Qingyang Yu, Changhua Pei, Bowen Hao, Mingjie Li, Zeyan Li, Shenglin Zhang, Xianglin Lu, Rui Wang, Jiaqi Li, and Zhenyu Wu. 2023. CMDiagnostor: An Ambiguity-Aware Root Cause Localization Approach Based on Call Metric Data. In Proceedings of the ACM Web Conference 2023. 2937–2947.
[78]
Yue Yu, Jie Chen, Tian Gao, and Mo Yu. 2019. DAG-GNN: DAG Structure Learning with Graph Neural Networks. In Proceedings of the 36th International Conference on Machine Learning (ICML’19). 97, 7154–7163.
[79]
Tian Zhang, Raghu Ramakrishnan, and Miron Livny. 1996. BIRCH: an efficient data clustering method for very large databases. ACM sigmod record, 25, 2 (1996), 103–114.
[80]
Xiang Zhou, Xin Peng, Tao Xie, Jun Sun, Chao Ji, Wenhai Li, and Dan Ding. 2018. Fault analysis and debugging of microservice systems: Industrial survey, benchmark system, and empirical study. IEEE Transactions on Software Engineering, 47, 2 (2018), 243–260.

Cited By

View all
  • (2024)MRCA: Metric-level Root Cause Analysis for Microservices via Multi-Modal DataProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695485(1057-1068)Online publication date: 27-Oct-2024
  • (2024)Root Cause Analysis for Microservice System based on Causal Inference: How Far Are We?Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695065(706-715)Online publication date: 27-Oct-2024

Index Terms

  1. BARO: Robust Root Cause Analysis for Microservices via Multivariate Bayesian Online Change Point Detection

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image Proceedings of the ACM on Software Engineering
      Proceedings of the ACM on Software Engineering  Volume 1, Issue FSE
      July 2024
      2770 pages
      EISSN:2994-970X
      DOI:10.1145/3554322
      Issue’s Table of Contents
      This work is licensed under a Creative Commons Attribution International 4.0 License.

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 12 July 2024
      Published in PACMSE Volume 1, Issue FSE

      Badges

      Author Tags

      1. Anomaly Detection
      2. Microservice Systems
      3. Root Cause Analysis

      Qualifiers

      • Research-article

      Funding Sources

      • Australian Research Council

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)669
      • Downloads (Last 6 weeks)173
      Reflects downloads up to 16 Jan 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)MRCA: Metric-level Root Cause Analysis for Microservices via Multi-Modal DataProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695485(1057-1068)Online publication date: 27-Oct-2024
      • (2024)Root Cause Analysis for Microservice System based on Causal Inference: How Far Are We?Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695065(706-715)Online publication date: 27-Oct-2024

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Login options

      Full Access

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media