Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3623278.3624771acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections
research-article

ShapleyIQ: Influence Quantification by Shapley Values for Performance Debugging of Microservices

Published: 07 February 2024 Publication History

Abstract

Years of experience in operating large-scale microservice systems strengthens our belief that their individual components, with inevitable anomalies, still demand a quantification of the influences on the end-to-end performance indicators. On a causal graph that represents the complex dependencies between the system components, the scatteredly detected anomalies, even when they look similar, could have different implications with contrastive remedial actions. To this end, we design ShapleyIQ, an online monitoring and diagnosis service that can effectively improve the system stability. It is guided by rigorous analysis on Shapley values for a causal graph. Notably, a new property on splitting invariance addresses the challenging exponential computation complexity problem of generic Shapley values by decomposition.
This service has been deployed on a core infrastructure system on Alibaba Cloud, for over a year with more than 15,000 operations for 86 services across 2,546 machines. Since then, it has drastically improved the DevOps efficiency, and the system failures have been significantly reduced by 83.3%. We also conduct an offline evaluation on an open source microservice system TrainTicket, which is to pinpoint the root causes of the performance issues in hindsight. Extensive experiments and test cases show that our system can achieve 97.3% accuracy in identifying the top-1 root causes for these datasets, which significantly outperforms baseline algorithms by at least 28.7% in absolute difference.

References

[1]
2022. Chaosblade: An Easy to Use and Powerful Chaos Engineering Toolkit. https://github.com/chaosblade-io/chaosblade. Last accessed 2022-08-08.
[2]
2022. Code and Data. https://github.com/lonyle/ShapleyIQ.
[3]
2022. Jaeger. https://www.jaegertracing.io/. Last accessed 2022-08-08.
[4]
2022. Locust. https://locust.io/. Last accessed 2022-08-08.
[5]
Rakesh Agrawal and Ramakrishnan Srikant. 1994. Fast algorithms for mining association rules. In Proc. 20th int. conf. very large data bases, VLDB, Vol. 1215. Citeseer, 487--499.
[6]
The OpenTelemetry Authors. 2022. OpenTelemetry. https://opentelemetry.io/. Last accessed 2022-08-08.
[7]
The Prometheus authors. 2022. Prometheus. https://prometheus.io/. Last accessed 2022-08-08.
[8]
The TrainTicket authors. 2022. TrainTicket. https://github.com/FudanSELab/train-ticket/. Last accessed 2022-08-08.
[9]
Darcy G Benoit. 2005. Automatic diagnosis of performance problems in database management systems. In Second International Conference on Autonomic Computing (ICAC'05). IEEE, 326--327.
[10]
Ranjita Bhagwan, Rahul Kumar, Ramachandran Ramjee, George Varghese, Surjyakanta Mohapatra, Hemanth Manoharan, and Piyush Shah. 2014. Adtributor: Revenue debugging in advertising systems. In 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI 14). 43--55.
[11]
Kenneth A Bollen and Judea Pearl. 2013. Eight myths about causality and structural equation models. In Handbook of causal analysis for social research. Springer, 301--328.
[12]
George Bosilca, Aurelien Bouteiller, Anthony Danalis, Thomas Herault, Pierre Lemarinier, and Jack Dongarra. 2012. DAGuE: A generic distributed DAG engine for high performance computing. Parallel Comput. 38, 1--2 (2012), 37--51.
[13]
Stephen Boyd and Lieven Vandenberghe. 2004. Convex Optimization. Cambridge University Press.
[14]
Surajit Chaudhuri and Gerhard Weikum. 2000. Rethinking Database System Architecture: Towards a Self-Tuning RISC-Style Database System. In VLDB. 1--10.
[15]
Pengfei Chen, Yong Qi, and Di Hou. 2019. CauseInfer: Automated End-to-End Performance Diagnosis with Hierarchical Causality Graph in Cloud Environment. IEEE Transactions on Services Computing 12, 2 (2019), 214--230.
[16]
Robert B Cleveland, William S Cleveland, Jean E McRae, and Irma Terpenning. 1990. STL: A seasonal-trend decomposition. J. Off. Stat 6, 1 (1990), 3--73.
[17]
Jean Derks and Stef Tijs. 2000. On merge properties of the Shapley value. International Game Theory Review 2, 04 (2000), 249--257.
[18]
Karl Dias, Mark Ramacher, Uri Shaft, Venkateshwaran Venkataramani, and Graham Wood. 2005. Automatic Performance Diagnosis and Tuning in Oracle. In CIDR. 84--94.
[19]
Djellel Eddine Difallah, Andrew Pavlo, Carlo Curino, and Philippe Cudre-Mauroux. 2013. Oltp-bench: An extensible testbed for benchmarking relational databases. Proceedings of the VLDB Endowment 7, 4 (2013), 277--288.
[20]
Min Du and Feifei Li. 2016. Spell: Streaming parsing of system event logs. In 2016 IEEE 16th International Conference on Data Mining (ICDM). IEEE, 859--864.
[21]
Christopher Frye, Colin Rowat, and Ilya Feige. 2020. Asymmetric Shapley values: incorporating causal knowledge into model-agnostic explainability. Advances in Neural Information Processing Systems 33 (2020).
[22]
Yu Gan, Mingyu Liang, Sundar Dev, David Lo, and Christina Delimitrou. 2021. Sage: Practical and Scalable ML-Driven Performance Debugging in Microservices. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (Virtual, USA) (ASPLOS '21). Association for Computing Machinery, New York, NY, USA, 135--151.
[23]
Yu Gan, Yanqi Zhang, Kelvin Hu, Dailun Cheng, Yuan He, Meghna Pancholi, and Christina Delimitrou. 2019. Seer: Leveraging Big Data to Navigate the Complexity of Performance Debugging in Cloud Microservices. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (Providence, RI, USA) (ASPLOS '19). Association for Computing Machinery, New York, NY, USA, 19--33.
[24]
Jayanta K Ghosh. 2011. Causality: Models, reasoning and inference, by Judea pearl.
[25]
John C Harsanyi. 1963. A simplified bargaining model for the n-person cooperative game. International Economic Review 4, 2 (1963), 194--220.
[26]
Xiao He, Ye Li, Jian Tan, Bin Wu, and Feifei Li. 2023. OneShotSTL: One-Shot Seasonal-Trend Decomposition For Online Time Series Anomaly Detection And Forecasting. Proc. VLDB Endow. 16, 6 (apr 2023), 1399--1412.
[27]
Jordan Hochenbaum, Owen S Vallis, and Arun Kejariwal. 2017. Automatic anomaly detection in the cloud via statistical learning. arXiv preprint arXiv:1704.07706 (2017).
[28]
Leslie Lamport. 2019. Time, clocks, and the ordering of events in a distributed system. In Concurrency: the Works of Leslie Lamport. 179--196.
[29]
Zeyan Li, Junjie Chen, Rui Jiao, Nengwen Zhao, Zhijun Wang, Shuwei Zhang, Yanjun Wu, Long Jiang, Leiqin Yan, Zikai Wang, Zhekang Chen, Wenchi Zhang, Xiaohui Nie, Kaixin Sui, and Dan Pei. 2021. Practical Root Cause Localization for Microservice Systems via Trace Analysis. In 2021 IEEE/ACM 29th International Symposium on Quality of Service (IWQOS). 1--10.
[30]
Jinjin Lin, Pengfei Chen, and Zibin Zheng. 2018. Microscope: Pinpoint Performance Issues with Causal Graphs in Micro-service Environments. In Service-Oriented Computing, Claus Pahl, Maja Vukovic, Jianwei Yin, and Qi Yu (Eds.). Springer International Publishing, 3--20.
[31]
Qingwei Lin, Jian-Guang Lou, Hongyu Zhang, and Dongmei Zhang. 2016. iDice: Problem Identification for Emerging Issues. In 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE). 214--224.
[32]
Dewei Liu, Chuan He, Xin Peng, Fan Lin, Chenxi Zhang, Shengfang Gong, Ziang Li, Jiayu Ou, and Zheshun Wu. 2021. MicroHECL: High-Efficient Root Cause Localization in Large-Scale Microservice Systems. In 2021 IEEE/ACM 43rd International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). 338--347.
[33]
Scott M Lundberg and Su-In Lee. 2017. A Unified Approach to Interpreting Model Predictions. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.). Curran Associates, Inc., 4765--4774. http://papers.nips.cc/paper/7062-a-unified-approach-to-interpreting-model-predictions.pdf
[34]
Tomasz P Michalak, Karthik V Aadithya, Piotr L Szczepanski, Balaraman Ravindran, and Nicholas R Jennings. 2013. Efficient computation of the Shapley value for game-theoretic network centrality. Journal of Artificial Intelligence Research 46 (2013), 607--650.
[35]
Dov Monderer and Dov Samet. 2002. Variations on the Shapley value. Handbook of game theory with economic applications 3 (2002), 2055--2076.
[36]
Oracle. 2022. Java ThreadPoolExecutor. https://docs.oracle.com/javase/6/docs/api/java/util/concurrent/ThreadPoolExecutor.html. Last accessed 2022-08-08.
[37]
Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic Differentiation in PyTorch. In NIPS 2017 Workshop on Autodiff (Long Beach, California, USA).
[38]
Eric Schulz, Maarten Speekenbrink, and Andreas Krause. 2018. A tutorial on Gaussian process regression: Modelling, exploring, and exploiting functions. Journal of Mathematical Psychology 85 (2018), 1--16.
[39]
Lloyd S Shapley. 2016. 17. A value for n-person games. Princeton University Press.
[40]
Peter Spirtes, Clark N Glymour, Richard Scheines, and David Heckerman. 2000. Causation, prediction, and search. MIT press.
[41]
Yongqian Sun, Youjian Zhao, Ya Su, Dapeng Liu, Xiaohui Nie, Yuan Meng, Shiwen Cheng, Dan Pei, Shenglin Zhang, Xianping Qu, and Guo Xuanyao. 2018. Hotspot: Anomaly localization for additive kpis with multi-dimensional attributes. IEEE Access 6 (2018), 10909--10923.
[42]
Robert J Weber. 1988. Probabilistic values for games. The Shapley Value. Essays in Honor of Lloyd S. Shapley (1988), 101--119.
[43]
Gerhard Weikum, Axel Moenkeberg, Christof Hasse, and Peter Zabback. 2002. Self-tuning database technology and information services: from wishful thinking to viable engineering. In VLDB'02: Proceedings of the 28th International Conference on Very Large Databases. Elsevier, 20--31.
[44]
Jianping Weng, Jessie Hui Wang, Jiahai Yang, and Yang Yang. 2018. Root Cause Analysis of Anomalies of Multitier Services in Public Clouds. IEEE/ACM Transactions on Networking 26, 4 (2018), 1646--1659.
[45]
Li Wu, Johan Tordsson, Erik Elmroth, and Odej Kao. 2020. MicroRCA: Root Cause Localization of Performance Issues in Microservices. In NOMS 2020 - 2020 IEEE/IFIP Network Operations and Management Symposium. 1--9.
[46]
Dong Young Yoon, Ning Niu, and Barzan Mozafari. 2016. Dbsherlock: A performance diagnostic tool for transactional databases. In Proceedings of the 2016 International Conference on Management of Data. 1599--1614.
[47]
Guangba Yu, Pengfei Chen, Hongyang Chen, Zijie Guan, Zicheng Huang, Linxiao Jing, Tianjun Weng, Xinmeng Sun, and Xiaoyun Li. 2021. MicroRank: End-to-End Latency Issue Localization with Extended Spectrum Analysis in Microservice Environments. In Proceedings of the Web Conference 2021 (Ljubljana, Slovenia) (WWW '21). Association for Computing Machinery, New York, NY, USA, 3087--3098.
[48]
Xiang Zhou, Xin Peng, Tao Xie, Jun Sun, Chenjie Xu, Chao Ji, and Wenyun Zhao. 2018. Benchmarking Microservice Systems for Software Engineering Research. In Proceedings of the 40th International Conference on Software Engineering: Companion Proceeedings (Gothenburg, Sweden) (ICSE '18). Association for Computing Machinery, New York, NY, USA, 323--324.

Cited By

View all
  • (2024)Interpretable Failure Localization for Microservice Systems Based on Graph AutoencoderACM Transactions on Software Engineering and Methodology10.1145/3695999Online publication date: 13-Sep-2024
  • (2024)ART: A Unified Unsupervised Framework for Incident Management in Microservice SystemsProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695495(1183-1194)Online publication date: 27-Oct-2024

Index Terms

  1. ShapleyIQ: Influence Quantification by Shapley Values for Performance Debugging of Microservices
    Index terms have been assigned to the content through auto-classification.

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ASPLOS '23: Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 4
    March 2023
    430 pages
    ISBN:9798400703942
    DOI:10.1145/3623278
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    In-Cooperation

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 07 February 2024

    Permissions

    Request permissions for this article.

    Check for updates

    Badges

    Qualifiers

    • Research-article

    Conference

    ASPLOS '23

    Acceptance Rates

    Overall Acceptance Rate 535 of 2,713 submissions, 20%

    Upcoming Conference

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)241
    • Downloads (Last 6 weeks)22
    Reflects downloads up to 09 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Interpretable Failure Localization for Microservice Systems Based on Graph AutoencoderACM Transactions on Software Engineering and Methodology10.1145/3695999Online publication date: 13-Sep-2024
    • (2024)ART: A Unified Unsupervised Framework for Incident Management in Microservice SystemsProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695495(1183-1194)Online publication date: 27-Oct-2024

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media