Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3580305.3599849acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article
Free access

Interdependent Causal Networks for Root Cause Localization

Published: 04 August 2023 Publication History

Abstract

The goal of root cause analysis is to identify the underlying causes of system problems by discovering and analyzing the causal structure from system monitoring data. It is indispensable for maintaining the stability and robustness of large-scale complex systems. Existing methods mainly focus on the construction of a single effective isolated causal network, whereas many real-world systems are complex and exhibit interdependent structures (i.e., multiple networks of a system are interconnected by cross-network links). In interdependent networks, the malfunctioning effects of problematic system entities can propagate to other networks or different levels of system entities. Consequently, ignoring the interdependency results in suboptimal root cause analysis outcomes.
In this paper, we propose REASON, a novel framework that enables the automatic discovery of both intra-level (i.e., within-network) and inter-level (i.e., across-network) causal relationships for root cause localization. REASON consists of Topological Causal Discovery (TCD) and Individual Causal Discovery (ICD). The TCD component aims to model the fault propagation in order to trace back to the root causes. To achieve this, we propose novel hierarchical graph neural networks to construct interdependent causal networks by modeling both intra-level and inter-level non-linear causal relations. Based on the learned interdependent causal networks, we then leverage random walk with restarts to model the network propagation of a system fault. The ICD component focuses on capturing abrupt change patterns of a single system entity. This component examines the temporal patterns of each entity's metric data (i.e., time series), and estimates its likelihood of being a root cause based on the Extreme Value theory. Combining the topological and individual causal scores, the top K system entities are identified as root causes. Extensive experiments on three real-world datasets validate the effectiveness of the proposed framework.

Supplementary Material

MP4 File (apfp246-2min-promo.mp4)
In this video, Dongjie Wang presents a game-changing approach to system failure analysis?'Interdependent Causal Networks for Root Cause Localization'. Uncover the novel REASON framework that captures both individual and topological properties of interdependent networks. Learn how this approach has outperformed traditional methods in real-world tests, marking a significant advancement in root cause analysis. Join us on this journey into a new frontier of system maintenance.

References

[1]
Chuadhry Mujeeb Ahmed, Venkata Reddy Palleti, and Aditya P Mathur. 2017. WADI: a water distribution testbed for research in the design of secure cyber physical systems. In Proceedings of the 3rd International Workshop on Cyber-Physical Systems for Smart Water Networks. 25--28.
[2]
M Hadi Amini, Kianoosh G Boroojeni, SS Iyengar, Panos M Pardalos, Frede Blaabjerg, and Asad M Madni. 2019. Sustainable interdependent networks II. Studies in systems, decision and control (2019), 167.
[3]
M Hadi Amini, Ahmed Imteaj, and Panos M Pardalos. 2020. Interdependent networks: A data science perspective. Patterns, Vol. 1, 1 (2020), 100003.
[4]
Bjørn Andersen and Tom Fagerhaug. 2006. Root cause analysis: simplified tools and techniques. Quality Press.
[5]
Charles K Assaad, Emilie Devijver, and Eric Gaussier. 2022. Survey and Evaluation of Causal Discovery Methods for Time Series. Journal of Artificial Intelligence Research, Vol. 73 (2022), 767--819.
[6]
Jan Beirlant, Yuri Goegebeur, Johan Segers, and Jozef L Teugels. 2004. Statistics of extremes: theory and applications. Vol. 558. John Wiley & Sons.
[7]
Alexis Bellot, Kim Branson, and Mihaela van der Schaar. 2021. Neural graphical modelling in continuous-time: consistency guarantees and algorithms. In International Conference on Learning Representations.
[8]
Stephen A Billings. 2013. Nonlinear system identification: NARMAX methods in the time, frequency, and spatio-temporal domains. John Wiley & Sons.
[9]
Álvaro Brandón, Marc Solé, Alberto Huélamo, David Solans, María S Pérez, and Victor Muntés-Mulero. 2020. Graph-based root cause analysis for service-oriented and microservice architectures. Journal of Systems and Software, Vol. 159 (2020), 110432.
[10]
Sergey V Buldyrev, Roni Parshani, Gerald Paul, H Eugene Stanley, and Shlomo Havlin. 2010. Catastrophic cascade of failures in interdependent networks. Nature, Vol. 464, 7291 (2010), 1025--1028.
[11]
Alfonso Capozzoli, Fiorella Lauro, and Imran Khan. 2015. Fault detection analysis using data mining techniques for a cluster of smart office buildings. Expert Systems with Applications, Vol. 42, 9 (2015), 4324--4338.
[12]
Zhengzhang Chen, Kanchana Padmanabhan, Andrea M Rocha, Yekaterina Shpanskaya, James R Mihelcic, Kathleen Scott, and Nagiza F Samatova. 2012. SPICE: discovery of phenotype-determining component interplays. BMC Systems Biology, Vol. 6, 1 (2012), 1--19.
[13]
Wei Cheng, Kai Zhang, Haifeng Chen, Guofei Jiang, Zhengzhang Chen, and Wei Wang. 2016. Ranking causal anomalies via temporal and dynamical analysis on vanishing correlations. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 805--814.
[14]
Arun Das, Joydeep Banerjee, and Arunabha Sen. 2014. Root Cause Analysis of Failures in Interdependent Power-Communication Networks. In 2014 IEEE Military Communications Conference. 910--915.
[15]
Boxiang Dong, Zhengzhang Chen, Hui Wang, Lu-An Tang, Kai Zhang, Ying Lin, Zhichun Li, and Haifeng Chen. 2017. Efficient discovery of abnormal event sequences in enterprise security systems. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. 707--715.
[16]
Doris Entner and Patrik O Hoyer. 2010. On causal discovery from time series data using FCI. Probabilistic graphical models (2010), 121--128.
[17]
George K Fourlas and George C Karras. 2021. A survey on fault diagnosis methods for UAVs. In 2021 International Conference on Unmanned Aircraft Systems (ICUAS). IEEE, 394--403.
[18]
Jianxi Gao, Daqing Li, and Shlomo Havlin. 2014. From a single network to a network of networks. National Science Review, Vol. 1, 3 (2014), 346--356.
[19]
Jiaping Gui, Ding Li, Zhengzhang Chen, Junghwan Rhee, Xusheng Xiao, Mu Zhang, Kangkook Jee, Zhichun Li, and Haifeng Chen. 2020. APTrace: A responsive system for agile enterprise level causality analysis. In 2020 IEEE 36th International Conference on Data Engineering (ICDE). IEEE, 1701--1712.
[20]
Ait Mimoune Hamiche, Amine Boudghene Stambouli, and Samir Flazi. 2016. A review of the water-energy nexus. Renewable and Sustainable Energy Reviews, Vol. 65 (2016), 319--331. https://doi.org/10.1016/j.rser.2016.07.020
[21]
Aapo Hyv"arinen, Kun Zhang, Shohei Shimizu, and Patrik O Hoyer. 2010. Estimation of a structural vector autoregression model using non-gaussianity. Journal of Machine Learning Research, Vol. 11, 5 (2010).
[22]
Emre Kiciman and Lakshminarayanan Subramanian. 2005. Root cause localization in large scale systems. In Proc. 1st Workshop on Hot Topics in Systems Dependability.
[23]
Maya Kosoff. 2022. One Amazon Employee's “Human Error” May Have Cost The Economy Millions. [EB/OL]. https://www.vanityfair.com/news/2017/03/one-amazon-employees-human-error-may-have-cost-the-economy-millions.
[24]
Yaguang Li, Rose Yu, Cyrus Shahabi, and Yan Liu. 2017. Diffusion convolutional recurrent neural network: Data-driven traffic forecasting. arXiv preprint arXiv:1707.01926 (2017).
[25]
Zeyan Li, Junjie Chen, Rui Jiao, Nengwen Zhao, Zhijun Wang, Shuwei Zhang, Yanjun Wu, Long Jiang, Leiqin Yan, Zikai Wang, Zhekang Chen, Wenchi Zhang, Xiaohui Nie, Kaixin Sui, and Dan Pei. 2021. Practical Root Cause Localization for Microservice Systems via Trace Analysis. In 2021 IEEE/ACM 29th International Symposium on Quality of Service. 1--10. https://doi.org/10.1109/IWQOS52092.2021.9521340
[26]
Jin Jin Lin, Pengfei Chen, and Zibin Zheng. 2018. Microscope: Pinpoint performance issues with causal graphs in micro-service environments. In International Conference on Service-Oriented Computing. Springer, 3--20.
[27]
Dewei Liu, Chuan He, Xin Peng, Fan Lin, Chenxi Zhang, Shengfang Gong, Ziang Li, Jiayu Ou, and Zheshun Wu. 2021. MicroHECL: high-efficient root cause localization in large-scale microservice systems. In 2021 IEEE/ACM 43rd International Conference on Software Engineering: Software Engineering in Practice. IEEE, 338--347.
[28]
Xueming Liu, H Eugene Stanley, and Jianxi Gao. 2016. Breakdown of interdependent directed networks. Proceedings of the National Academy of Sciences, Vol. 113, 5 (2016), 1138--1143.
[29]
Aditya P Mathur and Nils Ole Tippenhauer. 2016. SWaT: A water treatment testbed for research and training on ICS security. In 2016 international workshop on cyber-physical systems for smart water networks (CySWater). IEEE, 31--36.
[30]
Yuan Meng, Shenglin Zhang, Yongqian Sun, Ruru Zhang, Zhilong Hu, Yiyin Zhang, Chenyang Jia, Zhaogang Wang, and Dan Pei. 2020. Localizing failure root causes in a microservice through causality inference. In 2020 IEEE/ACM 28th International Symposium on Quality of Service. IEEE, 1--10.
[31]
Meike Nauta, Doina Bucur, and Christin Seifert. 2019. Causal discovery with attention-based convolutional neural networks. Machine Learning and Knowledge Extraction, Vol. 1, 1 (2019), 312--340.
[32]
M. Nekovee, Y. Moreno, G. Bianconi, and M. Marsili. 2007. Theory of rumour spreading in complex social networks. Physica A: Statistical Mechanics and its Applications, Vol. 374, 1 (2007), 457--470. https://doi.org/10.1016/j.physa.2006.07.017
[33]
Ignavier Ng, AmirEmad Ghassami, and Kun Zhang. 2020. On the role of sparsity and dag constraints for learning linear dags. Advances in Neural Information Processing Systems, Vol. 33 (2020), 17943--17954.
[34]
Jingchao Ni, Hanghang Tong, Wei Fan, and Xiang Zhang. 2014. Inside the atoms: ranking on a network of networks. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. 1356--1365.
[35]
Jingchao Ni, Hanghang Tong, Wei Fan, and Xiang Zhang. 2015. Flexible and Robust Multi-Network Clustering. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 835--844.
[36]
Roxana Pamfil, Nisara Sriwattanaworachai, Shaan Desai, Philip Pilgerstorfer, Konstantinos Georgatzis, Paul Beaumont, and Bryon Aragam. 2020. Dynotears: Structure learning from time-series data. In International Conference on Artificial Intelligence and Statistics. PMLR, 1595--1605.
[37]
Jonas Peters, Dominik Janzing, and Bernhard Schölkopf. 2013. Causal inference on time series using restricted structural equation models. Advances in Neural Information Processing Systems, Vol. 26 (2013).
[38]
James Pickands III. 1975. Statistical inference using extreme order statistics. the Annals of Statistics (1975), 119--131.
[39]
Jakob Runge. 2020. Discovering contemporaneous and lagged causal relations in autocorrelated nonlinear time series datasets. In Conference on Uncertainty in Artificial Intelligence. PMLR, 1388--1397.
[40]
Davood Shiri and Vahid Akbari. 2021. Online Failure Diagnosis in Interdependent Networks. Operations Research Forum, Vol. 2, 1 (2021), 10.
[41]
Alban Siffer, Pierre-Alain Fouque, Alexandre Termier, and Christine Largouet. 2017. Anomaly detection in streams with extreme value theory. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1067--1075.
[42]
Jacopo Soldani and Antonio Brogi. 2022. Anomaly detection and failure root cause analysis in (micro) service-based cloud applications: A survey. ACM Computing Surveys (CSUR), Vol. 55, 3 (2022), 1--39.
[43]
Marc Solé, Victor Muntés-Mulero, Annie Ibrahim Rana, and Giovani Estrada. 2017. Survey on models and techniques for root-cause analysis. arXiv preprint arXiv:1701.08546 (2017).
[44]
Peter Spirtes, Clark N Glymour, Richard Scheines, and David Heckerman. 2000. Causation, prediction, and search. MIT press.
[45]
James H Stock and Mark W Watson. 2001. Vector autoregressions. Journal of Economic perspectives, Vol. 15, 4 (2001), 101--115.
[46]
Jie Sun, Dane Taylor, and Erik M Bollt. 2015. Causal network inference by optimal causation entropy. SIAM Journal on Applied Dynamical Systems, Vol. 14, 1 (2015), 73--106.
[47]
LuAn Tang, Hengtong Zhang, Zhengzhang Chen, Bo Zong, LI Zhichun, Guofei Jiang, and Kenji Yoshihira. 2019. Graph-based attack chain discovery in enterprise security systems. US Patent 10,289,841.
[48]
A Tank, I Covert, N Foti, A Shojaie, and EB Fox. 2021. Neural Granger Causality. IEEE Transactions on Pattern Analysis and Machine Intelligence (2021).
[49]
Dongjie Wang, Zhengzhang Chen, Jingchao Ni, Liang Tong, Zheng Wang, Yanjie Fu, and Haifeng Chen. 2023. Hierarchical Graph Neural Networks for Causal Discovery and Root Cause Localization. arXiv preprint arXiv:2302.01987 (2023).
[50]
Xun Zheng, Bryon Aragam, Pradeep K Ravikumar, and Eric P Xing. 2018. Dags with no tears: Continuous optimization for structure learning. Advances in Neural Information Processing Systems, Vol. 31 (2018).

Cited By

View all
  • (2024)Illuminating the Gray Zone: Non-intrusive Gray Failure Localization in Server Operating SystemsCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663834(126-137)Online publication date: 10-Jul-2024
  • (2024)POND: Multi-Source Time Series Domain Adaptation with Information-Aware Prompt TuningProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671721(3140-3151)Online publication date: 25-Aug-2024
  • (2024)MARLP: Time-series Forecasting Control for Agricultural Managed Aquifer RechargeProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671533(4862-4872)Online publication date: 25-Aug-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
KDD '23: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining
August 2023
5996 pages
ISBN:9798400701030
DOI:10.1145/3580305
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 August 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. causal structure learning
  2. graph neural networks
  3. interdependent networks
  4. network propagation
  5. root cause analysis

Qualifiers

  • Research-article

Conference

KDD '23
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)797
  • Downloads (Last 6 weeks)67
Reflects downloads up to 30 Aug 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Illuminating the Gray Zone: Non-intrusive Gray Failure Localization in Server Operating SystemsCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663834(126-137)Online publication date: 10-Jul-2024
  • (2024)POND: Multi-Source Time Series Domain Adaptation with Information-Aware Prompt TuningProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671721(3140-3151)Online publication date: 25-Aug-2024
  • (2024)MARLP: Time-series Forecasting Control for Agricultural Managed Aquifer RechargeProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671533(4862-4872)Online publication date: 25-Aug-2024
  • (2024)MULAN: Multi-modal Causal Structure Learning and Root Cause Analysis for Microservice SystemsProceedings of the ACM Web Conference 202410.1145/3589334.3645442(4107-4116)Online publication date: 13-May-2024
  • (2024)Semi-Supervised Metrics-Based Self-Training Root Cause Analysis for Cloud-Native Systems with Class-Imbalanced DataICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10447959(6405-6409)Online publication date: 14-Apr-2024

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media