Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3534678.3539041acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article
Open access

Causal Inference-Based Root Cause Analysis for Online Service Systems with Intervention Recognition

Published: 14 August 2022 Publication History

Abstract

Fault diagnosis is critical in many domains, as faults may lead to safety threats or economic losses. In the field of online service systems, operators rely on enormous monitoring data to detect and mitigate failures. Quickly recognizing a small set of root cause indicators for the underlying fault can save much time for failure mitigation. In this paper, we formulate the root cause analysis problem as a new causal inference task namedintervention recognition. We proposed a novel unsupervised causal inference-based method namedCausal Inference-based Root Cause Analysis (CIRCA). The core idea is a sufficient condition for a monitoring variable to be a root cause indicator,i.e., the change of probability distribution conditioned on the parents in the Causal Bayesian Network (CBN). Towards the application in online service systems, CIRCA constructs a graph among monitoring metrics based on the knowledge of system architecture and a set of causal assumptions. The simulation study illustrates the theoretical reliability of CIRCA. The performance on a real-world dataset further shows that CIRCA can improve the recall of the top-1 recommendation by 25% over the best baseline method.

Supplemental Material

MP4 File
Presentation video of causal inference-based root cause analysis (CIRCA). We formulate the root cause analysis for online service systems (OSS) as a new causal inference task named intervention recognition (IR). While exploring the relation between IR and interventional knowledge, we assume that any intervention makes an observable change, the Faithfulness assumption. Under this assumption, we prove that IR is at the second layer of Judea Pearl's "Ladder of Causation" and find a practical criterion to locate the root cause. We provide a guideline to construct the Causal Bayesian Network (CBN) with the knowledge of OSS architecture. Two more techniques, regression-based hypothesis testing and descendant adjustment, are proposed to infer root cause variables in the CBN. Experiments with simulation and real-world datasets show CIRCA's theoretical reliability and practical value over baseline methods.

References

[1]
Elias Bareinboim, Juan D. Correa, Duligur Ibeling, and Thomas Icard. 2022. On Pearl's Hierarchy and the Foundations of Causal Inference 1 ed.). Association for Computing Machinery, 507--556.
[2]
Betsy Beyer, Chris Jones, Jennifer Petoff, and Niall Richard Murphy. 2016. Site Reliability Engineering first ed.). O'Reilly Media, Inc.
[3]
Varun Chandola, Arindam Banerjee, and Vipin Kumar. 2009. Anomaly Detection: A Survey. ACM Comput. Surv., Vol. 41, 3, Article 15 (jul 2009), 58 pages.
[4]
P. Chen, Y. Qi, P. Zheng, and D. Hou. 2014. CauseInfer: Automatic and distributed performance diagnosis with hierarchical causality graph in large distributed systems. In INFOCOM. 1887--1895.
[5]
Wei Cheng, Kai Zhang, Haifeng Chen, Guofei Jiang, Zhengzhang Chen, and Wei Wang. 2016. Ranking Causal Anomalies via Temporal and Dynamical Analysis on Vanishing Correlations. In KDD. 805--814.
[6]
Amin Dhaou, Antoine Bertoncello, Sébastien Gourvénec, Josselin Garnier, and Erwan Le Pennec. 2021. Causal and Interpretable Rules for Time Series Analysis. In KDD. 2764--2772.
[7]
Silvery Fu, Saurabh Gupta, Radhika Mittal, and Sylvia Ratnasamy. 2021. On the Use of ML for Blackbox System Performance Prediction. In NSDI. 763--784.
[8]
Yu Gan, Mingyu Liang, Sundar Dev, David Lo, and Christina Delimitrou. 2021. Sage: Practical and Scalable ML-Driven Performance Debugging in Microservices. In ASPLOS. 135--151.
[9]
Janos Gertler. 1998. Fault Detection and Diagnosis in Engineering Systems .Marcel Dekker.
[10]
Ruocheng Guo, Lu Cheng, Jundong Li, P. Richard Hahn, and Huan Liu. 2020. A Survey of Learning Causality with Data: Problems and Methods. ACM Comput. Surv., Vol. 53, 4 (jul 2020), 37 pages.
[11]
Yue He, Peng Cui, Zheyan Shen, Renzhe Xu, Furui Liu, and Yong Jiang. 2021. DARING: Differentiable Causal Discovery with Residual Independence. In KDD. 596--605.
[12]
Markus Kalisch, Martin M"achler, Diego Colombo, Marloes H. Maathuis, and Peter Bühlmann. 2012. Causal Inference Using Graphical Models with the R Package pcalg. Journal of Statistical Software, Vol. 47, 11 (2012), 1--26.
[13]
Pavneet Singh Kochhar, Xin Xia, David Lo, and Shanping Li. 2016. Practitioners' Expectations on Automated Fault Localization. In ISSTA. 165--176.
[14]
Jinjin Lin, Pengfei Chen, and Zibin Zheng. 2018. "Microscope: Pinpoint Performance Issues with Causal Graphs in Micro-service Environments". In Service-Oriented Computing. 3--20.
[15]
Dewei Liu, Chuan He, Xin Peng, Fan Lin, Chenxi Zhang, Shengfang Gong, Ziang Li, Jiayu Ou, and Zheshun Wu. 2021. MicroHECL: High-Efficient Root Cause Localization in Large-Scale Microservice Systems. In ICSE-SEIP. 338--347.
[16]
Meng Ma, Jingmin Xu, Yuan Wang, Pengfei Chen, Zonghua Zhang, and Ping Wang. 2020. AutoMAP: Diagnose Your Microservice-Based Web Applications Automatically. In WWW. 246--258.
[17]
Yuan Meng, Shenglin Zhang, Yongqian Sun, Ruru Zhang, Zhilong Hu, Yiyin Zhang, Chenyang Jia, Zhaogang Wang, and Dan Pei. 2020. Localizing Failure Root Causes in a Microservice through Causality Inference. In IWQoS. 1--10.
[18]
Jingchao Ni, Wei Cheng, Kai Zhang, Dongjin Song, Tan Yan, Haifeng Chen, and Xiang Zhang. 2017. Ranking Causal Anomalies by Modeling Local Propagations on Networked Systems. In ICDM. 1003--1008.
[19]
Paolo Notaro, Jorge Cardoso, and Michael Gerndt. 2021. A Survey of AIOps Methods for Failure Management. ACM Trans. Intell. Syst. Technol., Vol. 12, 6, Article 81 (nov 2021), 45 pages.
[20]
Judea Pearl. 2009. Causality : models, reasoning, and inference second ed.). Cambridge University Press.
[21]
Jamie Pool, Ebrahim Beyrami, Vishak Gopal, Ashkan Aazami, Jayant Gupchup, Jeff Rowland, Binlong Li, Pritesh Kanani, Ross Cutler, and Johannes Gehrke. 2020. Lumos: A Library for Diagnosing Metric Regressions in Web-Scale Applications. In KDD. 2562--2570.
[22]
Jakob Runge, Peer Nowack, Marlene Kretschmer, Seth Flaxman, and Dino Sejdinovic. 2019. Detecting and quantifying causal associations in large nonlinear time series datasets. Science Advances, Vol. 5, 11 (2019), eaau4996.
[23]
Alban Siffer, Pierre-Alain Fouque, Alexandre Termier, and Christine Largouet. 2017. Anomaly Detection in Streams with Extreme Value Theory. In KDD. 1067--1075.
[24]
Hanzhang Wang, Zhengkai Wu, Huai Jiang, Yichao Huang, Jiamu Wang, Selcuk Kopru, and Tao Xie. 2021. Groot: An Event-graph-based Approach for Root Cause Analysis in Industrial Settings. In ASE. 419--429.
[25]
Ping Wang, Jingmin Xu, Meng Ma, Weilan Lin, Disheng Pan, Yuan Wang, and Pengfei Chen. 2018. CloudRanger: Root Cause Identification for Cloud Native Systems. In CCGRID. 492--502.
[26]
Tianyi Yang, Jiacheng Shen, Yuxin Su, Xiao Ling, Yongqiang Yang, and Michael R. Lyu. 2021. AID: Efficient Prediction of Aggregated Intensity of Dependency in Large-scale Cloud Systems. In ASE. 653--665.
[27]
Liuyi Yao, Zhixuan Chu, Sheng Li, Yaliang Li, Jing Gao, and Aidong Zhang. 2021. A Survey on Causal Inference. ACM Trans. Knowl. Discov. Data, Vol. 15, 5, Article 74 (may 2021), 46 pages.
[28]
Jaehyuk Yi and Jinkyoo Park. 2021. Semi-Supervised Bearing Fault Diagnosis with Adversarially-Trained Phase-Consistent Network. In KDD. 3875--3885.
[29]
Guangba Yu, Pengfei Chen, Hongyang Chen, Zijie Guan, Zicheng Huang, Linxiao Jing, Tianjun Weng, Xinmeng Sun, and Xiaoyun Li. 2021. MicroRank: End-to-End Latency Issue Localization with Extended Spectrum Analysis in Microservice Environments. In WWW. 3087--3098.
[30]
Xu Zhang, Chao Du, Yifan Li, Yong Xu, Hongyu Zhang, Si Qin, Ze Li, Qingwei Lin, Yingnong Dang, Andrew Zhou, Saravanakumar Rajmohan, and Dongmei Zhang. 2021. HALO: Hierarchy-Aware Fault Localization for Cloud Systems. In KDD. 3948--3958.
[31]
Nengwen Zhao, Junjie Chen, Xiao Peng, Honglin Wang, Xinya Wu, Yuanzong Zhang, Zikai Chen, Xiangzhong Zheng, Xiaohui Nie, Gang Wang, Yong Wu, Fang Zhou, Wenchi Zhang, Kaixin Sui, and Dan Pei. 2020. Understanding and Handling Alert Storm for Online Service Systems. In ICSE-SEIP. 162--171.
[32]
Xun Zheng, Bryon Aragam, Pradeep K Ravikumar, and Eric P Xing. 2018. DAGs with NO TEARS: Continuous Optimization for Structure Learning. In NIPS, Vol. 31. 9472--9483.

Cited By

View all
  • (2025)Causal similarity learning with multi-level predictive relation aggregation for grouped root cause diagnosis of industrial faultsControl Engineering Practice10.1016/j.conengprac.2024.106140154(106140)Online publication date: Jan-2025
  • (2024)Fault Location Method Based on Dynamic Operation and Maintenance Map and Common Alarm Points AnalysisAlgorithms10.3390/a1705021717:5(217)Online publication date: 16-May-2024
  • (2024)The Potential of One-Shot Failure Root Cause Analysis: Collaboration of the Large Language Model and Small ClassifierProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695475(931-943)Online publication date: 27-Oct-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
KDD '22: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining
August 2022
5033 pages
ISBN:9781450393850
DOI:10.1145/3534678
This work is licensed under a Creative Commons Attribution-NonCommercial International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 August 2022

Check for updates

Author Tags

  1. causal inference
  2. intervention recognition
  3. online service systems
  4. root cause analysis

Qualifiers

  • Research-article

Funding Sources

  • National Key R&D Program of China
  • State Key Program of National Natural Science of China

Conference

KDD '22
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1,332
  • Downloads (Last 6 weeks)160
Reflects downloads up to 09 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2025)Causal similarity learning with multi-level predictive relation aggregation for grouped root cause diagnosis of industrial faultsControl Engineering Practice10.1016/j.conengprac.2024.106140154(106140)Online publication date: Jan-2025
  • (2024)Fault Location Method Based on Dynamic Operation and Maintenance Map and Common Alarm Points AnalysisAlgorithms10.3390/a1705021717:5(217)Online publication date: 16-May-2024
  • (2024)The Potential of One-Shot Failure Root Cause Analysis: Collaboration of the Large Language Model and Small ClassifierProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695475(931-943)Online publication date: 27-Oct-2024
  • (2024)Root Cause Analysis for Microservice System based on Causal Inference: How Far Are We?Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695065(706-715)Online publication date: 27-Oct-2024
  • (2024)Automated Root Causing of Cloud Incidents using In-Context Learning with GPT-4Companion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663846(266-277)Online publication date: 10-Jul-2024
  • (2024)Illuminating the Gray Zone: Non-intrusive Gray Failure Localization in Server Operating SystemsCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663834(126-137)Online publication date: 10-Jul-2024
  • (2024)FaultInsight: Interpreting Hyperscale Data Center Host FaultsProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3672051(141-152)Online publication date: 25-Aug-2024
  • (2024)Microservice Root Cause Analysis With Limited Observability Through Intervention Recognition in the Latent SpaceProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671530(6049-6060)Online publication date: 25-Aug-2024
  • (2024)Causal Discovery from Heterogenous Multivariate Time SeriesProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3680269(5499-5502)Online publication date: 21-Oct-2024
  • (2024)On the Fly Detection of Root Causes from Observed Data with Application to IT SystemsProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3680010(5062-5069)Online publication date: 21-Oct-2024
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media