Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

BALANCE: Bayesian Linear Attribution for Root Cause Localization

Published: 30 May 2023 Publication History

Abstract

Root Cause Analysis (RCA) plays an indispensable role in distributed data system maintenance and operations, as it bridges the gap between fault detection and system recovery. Existing works mainly study multidimensional localization or graph-based root cause localization. This paper opens up the possibilities of exploiting the recently developed framework of explainable AI (XAI) for the purpose of RCA. In particular, we propose BALANCE (BAyesian Linear AttributioN for root CausE localization), which formulates the problem of RCA through the lens of attribution in XAI and seeks to explain the anomalies in the target KPIs by the behavior of the candidate root causes. BALANCE consists of three innovative components. First, we propose a Bayesian multicollinear feature selection (BMFS) model to predict the target KPIs given the candidate root causes in a forward manner while promoting sparsity and concurrently paying attention to the correlation between the candidate root causes. Second, we introduce attribution analysis to compute the attribution score for each candidate in a backward manner. Third, we merge the estimated root causes related to each KPI if there are multiple KPIs. We extensively evaluate the proposed BALANCE method on one synthesis dataset as well as three real-world RCA tasks, that is, bad SQL localization, container fault localization, and fault type diagnosis for Exathlon. Results show that BALANCE outperforms the state-of-the-art (SOTA) methods in terms of accuracy with the least amount of running time, and achieves at least 6% notably higher accuracy than SOTA methods for real tasks. BALANCE has been deployed to production to tackle real-world RCA problems, and the online results further advocate its usage for real-time diagnosis in distributed data systems.

Supplemental Material

MP4 File
Presentation video for SIGMOD 2023

References

[1]
Pooja Aggarwal, Ajay Gupta, Prateeti Mohapatra, et al. 2020. Localization of operational faults in cloud applications by mining causal dependencies in logs using golden signals. In International Conference on Service-Oriented Computing. 137--149.
[2]
Marco Ancona, Cengiz Oztireli, and Markus Gross. 2019. Explaining deep neural networks with a polynomial time algorithm for shapley value approximation. In International Conference on Machine Learning (ICML). 272--281.
[3]
Charles K Assaad, Emilie Devijver, and Eric Gaussier. 2022. Survey and Evaluation of Causal Discovery Methods for Time Series. Journal of Artificial Intelligence Research, Vol. 73 (2022), 767--819.
[4]
Sebastian Bach, Alexander Binder, Grégoire Montavon, et al. 2015. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS One, Vol. 10, 7 (2015).
[5]
Ranjita Bhagwan, Rahul Kumar, Ramachandran Ramjee, et al. 2014. Adtributor: Revenue debugging in advertising systems. In 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI). 43--55.
[6]
Christopher M Bishop and Nasser M Nasrabadi. 2006. Pattern recognition and machine learning. Vol. 4.
[7]
Jonathan Boss, Jyotishka Datta, Xin Wang, et al. 2021. Group Inverse-Gamma Gamma Shrinkage for Sparse Regression with Block-Correlated Predictors. arXiv preprint arXiv:2102.10670 (2021).
[8]
Kirill Bykov, Marina M-C Höhne, Adelaida Creosteanu, et al. 2021. Explaining bayesian neural networks. arXiv preprint arXiv:2108.10346 (2021).
[9]
Carlos M Carvalho, Nicholas G Polson, and James G Scott. 2009. Handling sparsity via the horseshoe. In Artificial Intelligence and Statistics (AIStat). 73--80.
[10]
Pengfei Chen, Yong Qi, Pengfei Zheng, et al. 2014. Causeinfer: Automatic and distributed performance diagnosis with hierarchical causality graph in large distributed systems. In IEEE Conference on Computer Communications (INFOCOM). 1887--1895.
[11]
Yingnong Dang, Qingwei Lin, and Peng Huang. 2019. AIOps: real-world challenges and research innovations. In IEEE/ACM 41st International Conference on Software Engineering: Companion Proceedings (ICSE-Companion). 4--5.
[12]
Vincent Jacob, Fei Song, Arnaud Stiegler, et al. 2021. Exathlon: A Benchmark for Explainable Anomaly Detection over Time Series. Proceedings of the VLDB Endowment (PVLDB) (2021).
[13]
Vimalkumar Jeyakumar, Omid Madani, Ali Parandeh, et al. 2019. ExplainIt!--A declarative root-cause analysis engine for time series data. In Proceedings of the ACM International Conference on Management of Data (SIGMOD). 333--348.
[14]
Myunghwan Kim, Roshan Sumbaly, and Sam Shah. 2013. Root cause detection in a service-oriented architecture. ACM SIGMETRICS Performance Evaluation Review, Vol. 41, 1 (2013), 93--104.
[15]
Wu Lin, Mohammad Emtiyaz Khan, and Mark Schmidt. 2019. Fast and simple natural-gradient variational inference with mixture of exponential-family approximations. In International Conference on Machine Learning (ICML). 3992--4002.
[16]
Pantelis Linardatos, Vasilis Papastefanopoulos, and Sotiris Kotsiantis. 2020. Explainable ai: A review of machine learning interpretability methods. Entropy, Vol. 23, 1 (2020), 18.
[17]
Dewei Liu, Chuan He, Xin Peng, et al. 2021. Microhecl: High-efficient root cause localization in large-scale microservice systems. In IEEE/ACM 43rd International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). 338--347.
[18]
Scott M Lundberg and Su-In Lee. 2017. A unified approach to interpreting model predictions. Advances in neural information processing systems (NIPS), Vol. 30 (2017).
[19]
Minghua Ma, Zheng Yin, Shenglin Zhang, et al. 2020. Diagnosing root causes of intermittent slow queries in cloud databases. Proceedings of the VLDB Endowment, Vol. 13, 8 (2020), 1176--1189.
[20]
Raha Moraffah, Paras Sheth, Mansooreh Karami, et al. 2021. Causal inference for time series analysis: Problems, methods and evaluation. Knowledge and Information Systems (2021), 1--45.
[21]
Vinod Nair, Ameya Raul, Shwetabh Khanduja, et al. 2015. Learning a hierarchical monitoring system for detecting and diagnosing service issues. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2029--2038.
[22]
Sarah E Neville, John T Ormerod, and MP Wand. 2014. Mean field variational Bayes for continuous sparse signal shrinkage: pitfalls and remedies. Electronic Journal of Statistics, Vol. 8, 1 (2014), 1113--1151.
[23]
Dmitry Pavlyuk. 2022. fsMTS: Feature Selection for Multivariate Time Series.
[24]
Juan Qiu, Qingfeng Du, Kanglin Yin, et al. 2020. A causality mining and knowledge graph based method of root cause diagnosis for performance anomaly in cloud applications. Applied Sciences, Vol. 10, 6 (2020), 2166.
[25]
Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. " Why should i trust you?" Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international Conference on Knowledge Discovery & Data mining. 1135--1144.
[26]
Veronika Rovc ková and Edward I George. 2014. Negotiating multicollinearity with spike-and-slab priors. Metron, Vol. 72, 2 (2014), 217--229.
[27]
Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. 2017. Learning important features through propagating activation differences. In International conference on machine learning (ICML). 3145--3153.
[28]
Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2014. Deep inside convolutional networks: Visualising image classification models and saliency maps. In ICLR workshop.
[29]
Jacopo Soldani and Antonio Brogi. 2022. Anomaly detection and failure root cause analysis in (micro) service-based cloud applications: A survey. ACM Computing Surveys (CSUR), Vol. 55, 3 (2022), 1--39.
[30]
Peter Spirtes and Clark Glymour. 1991. An algorithm for fast recovery of sparse causal graphs. Social science computer review, Vol. 9, 1 (1991), 62--72.
[31]
Yi Sun and Mukund Sundararajan. 2011. Axiomatic attribution for multilinear functions. In Proceedings of the 12th ACM conference on Electronic commerce. 177--178.
[32]
Yongqian Sun, Youjian Zhao, Ya Su, et al. 2018. Hotspot: Anomaly localization for additive kpis with multi-dimensional attributes. IEEE Access, Vol. 6 (2018), 10909--10923.
[33]
Mukund Sundararajan, Ankur Taly, and Qiqi Yan. 2017. Axiomatic attribution for deep networks. In International conference on machine learning (ICML). 3319--3328.
[34]
Jörg Thalheim, Antonio Rodrigues, Istemi Ekin Akkus, et al. 2017. Sieve: Actionable insights from monitored metrics in microservices. arXiv preprint arXiv:1709.06686 (2017).
[35]
Robert Tibshirani. 1996. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), Vol. 58, 1 (1996), 267--288.
[36]
Michael E Tipping. 2001. Sparse Bayesian learning and the relevance vector machine. Journal of machine learning research (JMLR), Vol. 1, Jun (2001), 211--244.
[37]
Guanchu Wang, Yu-Neng Chuang, Mengnan Du, et al. 2022. Accelerating Shapley Explanation via Contributive Cooperator Selection. In International Conference on Machine Learning (ICML).
[38]
Hanzhang Wang, Phuong Nguyen, Jun Li, et al. 2019. Grano: Interactive graph-based root cause analysis for cloud-native distributed data platform. Proceedings of the VLDB Endowment, Vol. 12, 12 (2019), 1942--1945.
[39]
Ping Wang, Jingmin Xu, Meng Ma, et al. 2018. Cloudranger: Root cause identification for cloud native systems. In 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID). 492--502.
[40]
Jianping Weng, Jessie Hui Wang, Jiahai Yang, et al. 2018. Root cause analysis of anomalies of multitier services in public clouds. IEEE/ACM Transactions on Networking, Vol. 26, 4 (2018), 1646--1659.
[41]
Zhenkun Yang, Chuanhui Yang, Fusheng Han, Mingqiang Zhuang, Bing Yang, Zhifeng Yang, Xiaojun Cheng, Yuzhong Zhao, Wenhui Shi, Huafeng Xi, Huang Yu, Bin Liu, Yi Pan, Boxue Yin, Junquan Chen, and Quanqing Xu. 2022. OceanBase: A 707 Million tpmC Distributed Relational Database System. Proceedings of the VLDB Endowment, Vol. 15, 12 (2022), 3385--3397.
[42]
Hang Yu, Songwei Wu, Luyin Xin, et al. 2020. Fast Bayesian Inference of Sparse Networks with Automatic Sparsity Determination. J. Mach. Learn. Res., Vol. 21 (2020), 124--1.
[43]
Hang Yu, Luyin Xin, and Justin Dauwels. 2019. Variational wishart approximation for graphical model selection: Monoscale and multiscale models. IEEE Transactions on Signal Processing, Vol. 67, 24 (2019), 6468--6482.
[44]
Matthew D Zeiler and Rob Fergus. 2014. Visualizing and understanding convolutional networks. In European conference on computer vision (ECCV). 818--833.
[45]
Arnold Zellner. 1986. On assessing prior distributions and Bayesian regression analysis with g-prior distributions. Bayesian inference and decision techniques (1986).
[46]
Kai Zhang, Chao Tian, Kun Zhang, et al. 2021c. A Fast PC Algorithm with Reversed-order Pruning and A Parallelization Strategy. arXiv preprint arXiv:2109.04626 (2021).
[47]
Xu Zhang, Chao Du, Yifan Li, et al. 2021a. HALO: Hierarchy-aware Fault Localization for Cloud Systems. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 3948--3958.
[48]
Yingying Zhang, Zhengxiong Guan, Huajie Qian, et al. 2021b. CloudRCA: A Root Cause Analysis Framework for Cloud Computing Platforms. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management (CIKM). 4373--4382.
[49]
Xun Zheng, Bryon Aragam, Pradeep K Ravikumar, et al. 2018. Dags with no tears: Continuous optimization for structure learning. Advances in Neural Information Processing Systems (NIPS), Vol. 31 (2018).
[50]
Hui Zou and Trevor Hastie. 2005. Regularization and variable selection via the elastic net. Journal of the royal statistical society: series B (statistical methodology), Vol. 67, 2 (2005), 301--320.

Cited By

View all
  • (2024)DB-MAGS: Multi-Anomaly Data Generation System for Transactional DatabasesProceedings of the VLDB Endowment10.14778/3685800.368590917:12(4497-4500)Online publication date: 8-Nov-2024
  • (2024)Enabling Window-Based Monotonic Graph Analytics with Reusable Transitional Results for Pattern-Consistent QueriesProceedings of the VLDB Endowment10.14778/3681954.368197917:11(3003-3016)Online publication date: 30-Aug-2024
  • (2024)CGgraph: An Ultra-Fast Graph Processing System on Modern Commodity CPU-GPU Co-processorProceedings of the VLDB Endowment10.14778/3648160.364817917:6(1405-1417)Online publication date: 3-May-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Management of Data
Proceedings of the ACM on Management of Data  Volume 1, Issue 1
PACMMOD
May 2023
2807 pages
EISSN:2836-6573
DOI:10.1145/3603164
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 May 2023
Published in PACMMOD Volume 1, Issue 1

Permissions

Request permissions for this article.

Author Tags

  1. Bayesian method
  2. attribution analysis
  3. bad SQLs
  4. distributed system
  5. explainable AI
  6. faults diagnosis
  7. root cause analysis

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)82
  • Downloads (Last 6 weeks)3
Reflects downloads up to 10 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)DB-MAGS: Multi-Anomaly Data Generation System for Transactional DatabasesProceedings of the VLDB Endowment10.14778/3685800.368590917:12(4497-4500)Online publication date: 8-Nov-2024
  • (2024)Enabling Window-Based Monotonic Graph Analytics with Reusable Transitional Results for Pattern-Consistent QueriesProceedings of the VLDB Endowment10.14778/3681954.368197917:11(3003-3016)Online publication date: 30-Aug-2024
  • (2024)CGgraph: An Ultra-Fast Graph Processing System on Modern Commodity CPU-GPU Co-processorProceedings of the VLDB Endowment10.14778/3648160.364817917:6(1405-1417)Online publication date: 3-May-2024
  • (2024)PreLog: A Pre-trained Model for Log AnalyticsProceedings of the ACM on Management of Data10.1145/36549662:3(1-28)Online publication date: 30-May-2024
  • (2024)Auto-Formula: Recommend Formulas in Spreadsheets using Contrastive Learning for Table RepresentationsProceedings of the ACM on Management of Data10.1145/36549252:3(1-27)Online publication date: 30-May-2024
  • (2024)A Knowledge-Enhanced Transformer-FL Method for Fault Root Cause LocalizationProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679816(1607-1616)Online publication date: 21-Oct-2024
  • (2023)Homomorphic Compression: Making Text Processing on Compression UnlimitedProceedings of the ACM on Management of Data10.1145/36267651:4(1-28)Online publication date: 12-Dec-2023
  • (2023)RECom: A Compiler Approach to Accelerating Recommendation Model Inference with Massive Embedding ColumnsProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 410.1145/3623278.3624761(268-286)Online publication date: 25-Mar-2023

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media