Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3623278.3624771acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections
research-article

ShapleyIQ: Influence Quantification by Shapley Values for Performance Debugging of Microservices

Published: 07 February 2024 Publication History

Abstract

Years of experience in operating large-scale microservice systems strengthens our belief that their individual components, with inevitable anomalies, still demand a quantification of the influences on the end-to-end performance indicators. On a causal graph that represents the complex dependencies between the system components, the scatteredly detected anomalies, even when they look similar, could have different implications with contrastive remedial actions. To this end, we design ShapleyIQ, an online monitoring and diagnosis service that can effectively improve the system stability. It is guided by rigorous analysis on Shapley values for a causal graph. Notably, a new property on splitting invariance addresses the challenging exponential computation complexity problem of generic Shapley values by decomposition.
This service has been deployed on a core infrastructure system on Alibaba Cloud, for over a year with more than 15,000 operations for 86 services across 2,546 machines. Since then, it has drastically improved the DevOps efficiency, and the system failures have been significantly reduced by 83.3%. We also conduct an offline evaluation on an open source microservice system TrainTicket, which is to pinpoint the root causes of the performance issues in hindsight. Extensive experiments and test cases show that our system can achieve 97.3% accuracy in identifying the top-1 root causes for these datasets, which significantly outperforms baseline algorithms by at least 28.7% in absolute difference.

References

[1]
2022. Chaosblade: An Easy to Use and Powerful Chaos Engineering Toolkit. https://github.com/chaosblade-io/chaosblade. Last accessed 2022-08-08.
[2]
2022. Code and Data. https://github.com/lonyle/ShapleyIQ.
[3]
2022. Jaeger. https://www.jaegertracing.io/. Last accessed 2022-08-08.
[4]
2022. Locust. https://locust.io/. Last accessed 2022-08-08.
[5]
Rakesh Agrawal and Ramakrishnan Srikant. 1994. Fast algorithms for mining association rules. In Proc. 20th int. conf. very large data bases, VLDB, Vol. 1215. Citeseer, 487--499.
[6]
The OpenTelemetry Authors. 2022. OpenTelemetry. https://opentelemetry.io/. Last accessed 2022-08-08.
[7]
The Prometheus authors. 2022. Prometheus. https://prometheus.io/. Last accessed 2022-08-08.
[8]
The TrainTicket authors. 2022. TrainTicket. https://github.com/FudanSELab/train-ticket/. Last accessed 2022-08-08.
[9]
Darcy G Benoit. 2005. Automatic diagnosis of performance problems in database management systems. In Second International Conference on Autonomic Computing (ICAC'05). IEEE, 326--327.
[10]
Ranjita Bhagwan, Rahul Kumar, Ramachandran Ramjee, George Varghese, Surjyakanta Mohapatra, Hemanth Manoharan, and Piyush Shah. 2014. Adtributor: Revenue debugging in advertising systems. In 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI 14). 43--55.
[11]
Kenneth A Bollen and Judea Pearl. 2013. Eight myths about causality and structural equation models. In Handbook of causal analysis for social research. Springer, 301--328.
[12]
George Bosilca, Aurelien Bouteiller, Anthony Danalis, Thomas Herault, Pierre Lemarinier, and Jack Dongarra. 2012. DAGuE: A generic distributed DAG engine for high performance computing. Parallel Comput. 38, 1--2 (2012), 37--51.
[13]
Stephen Boyd and Lieven Vandenberghe. 2004. Convex Optimization. Cambridge University Press.
[14]
Surajit Chaudhuri and Gerhard Weikum. 2000. Rethinking Database System Architecture: Towards a Self-Tuning RISC-Style Database System. In VLDB. 1--10.
[15]
Pengfei Chen, Yong Qi, and Di Hou. 2019. CauseInfer: Automated End-to-End Performance Diagnosis with Hierarchical Causality Graph in Cloud Environment. IEEE Transactions on Services Computing 12, 2 (2019), 214--230.
[16]
Robert B Cleveland, William S Cleveland, Jean E McRae, and Irma Terpenning. 1990. STL: A seasonal-trend decomposition. J. Off. Stat 6, 1 (1990), 3--73.
[17]
Jean Derks and Stef Tijs. 2000. On merge properties of the Shapley value. International Game Theory Review 2, 04 (2000), 249--257.
[18]
Karl Dias, Mark Ramacher, Uri Shaft, Venkateshwaran Venkataramani, and Graham Wood. 2005. Automatic Performance Diagnosis and Tuning in Oracle. In CIDR. 84--94.
[19]
Djellel Eddine Difallah, Andrew Pavlo, Carlo Curino, and Philippe Cudre-Mauroux. 2013. Oltp-bench: An extensible testbed for benchmarking relational databases. Proceedings of the VLDB Endowment 7, 4 (2013), 277--288.
[20]
Min Du and Feifei Li. 2016. Spell: Streaming parsing of system event logs. In 2016 IEEE 16th International Conference on Data Mining (ICDM). IEEE, 859--864.
[21]
Christopher Frye, Colin Rowat, and Ilya Feige. 2020. Asymmetric Shapley values: incorporating causal knowledge into model-agnostic explainability. Advances in Neural Information Processing Systems 33 (2020).
[22]
Yu Gan, Mingyu Liang, Sundar Dev, David Lo, and Christina Delimitrou. 2021. Sage: Practical and Scalable ML-Driven Performance Debugging in Microservices. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (Virtual, USA) (ASPLOS '21). Association for Computing Machinery, New York, NY, USA, 135--151.
[23]
Yu Gan, Yanqi Zhang, Kelvin Hu, Dailun Cheng, Yuan He, Meghna Pancholi, and Christina Delimitrou. 2019. Seer: Leveraging Big Data to Navigate the Complexity of Performance Debugging in Cloud Microservices. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (Providence, RI, USA) (ASPLOS '19). Association for Computing Machinery, New York, NY, USA, 19--33.
[24]
Jayanta K Ghosh. 2011. Causality: Models, reasoning and inference, by Judea pearl.
[25]
John C Harsanyi. 1963. A simplified bargaining model for the n-person cooperative game. International Economic Review 4, 2 (1963), 194--220.
[26]
Xiao He, Ye Li, Jian Tan, Bin Wu, and Feifei Li. 2023. OneShotSTL: One-Shot Seasonal-Trend Decomposition For Online Time Series Anomaly Detection And Forecasting. Proc. VLDB Endow. 16, 6 (apr 2023), 1399--1412.
[27]
Jordan Hochenbaum, Owen S Vallis, and Arun Kejariwal. 2017. Automatic anomaly detection in the cloud via statistical learning. arXiv preprint arXiv:1704.07706 (2017).
[28]
Leslie Lamport. 2019. Time, clocks, and the ordering of events in a distributed system. In Concurrency: the Works of Leslie Lamport. 179--196.
[29]
Zeyan Li, Junjie Chen, Rui Jiao, Nengwen Zhao, Zhijun Wang, Shuwei Zhang, Yanjun Wu, Long Jiang, Leiqin Yan, Zikai Wang, Zhekang Chen, Wenchi Zhang, Xiaohui Nie, Kaixin Sui, and Dan Pei. 2021. Practical Root Cause Localization for Microservice Systems via Trace Analysis. In 2021 IEEE/ACM 29th International Symposium on Quality of Service (IWQOS). 1--10.
[30]
Jinjin Lin, Pengfei Chen, and Zibin Zheng. 2018. Microscope: Pinpoint Performance Issues with Causal Graphs in Micro-service Environments. In Service-Oriented Computing, Claus Pahl, Maja Vukovic, Jianwei Yin, and Qi Yu (Eds.). Springer International Publishing, 3--20.
[31]
Qingwei Lin, Jian-Guang Lou, Hongyu Zhang, and Dongmei Zhang. 2016. iDice: Problem Identification for Emerging Issues. In 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE). 214--224.
[32]
Dewei Liu, Chuan He, Xin Peng, Fan Lin, Chenxi Zhang, Shengfang Gong, Ziang Li, Jiayu Ou, and Zheshun Wu. 2021. MicroHECL: High-Efficient Root Cause Localization in Large-Scale Microservice Systems. In 2021 IEEE/ACM 43rd International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). 338--347.
[33]
Scott M Lundberg and Su-In Lee. 2017. A Unified Approach to Interpreting Model Predictions. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.). Curran Associates, Inc., 4765--4774. http://papers.nips.cc/paper/7062-a-unified-approach-to-interpreting-model-predictions.pdf
[34]
Tomasz P Michalak, Karthik V Aadithya, Piotr L Szczepanski, Balaraman Ravindran, and Nicholas R Jennings. 2013. Efficient computation of the Shapley value for game-theoretic network centrality. Journal of Artificial Intelligence Research 46 (2013), 607--650.
[35]
Dov Monderer and Dov Samet. 2002. Variations on the Shapley value. Handbook of game theory with economic applications 3 (2002), 2055--2076.
[36]
Oracle. 2022. Java ThreadPoolExecutor. https://docs.oracle.com/javase/6/docs/api/java/util/concurrent/ThreadPoolExecutor.html. Last accessed 2022-08-08.
[37]
Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic Differentiation in PyTorch. In NIPS 2017 Workshop on Autodiff (Long Beach, California, USA).
[38]
Eric Schulz, Maarten Speekenbrink, and Andreas Krause. 2018. A tutorial on Gaussian process regression: Modelling, exploring, and exploiting functions. Journal of Mathematical Psychology 85 (2018), 1--16.
[39]
Lloyd S Shapley. 2016. 17. A value for n-person games. Princeton University Press.
[40]
Peter Spirtes, Clark N Glymour, Richard Scheines, and David Heckerman. 2000. Causation, prediction, and search. MIT press.
[41]
Yongqian Sun, Youjian Zhao, Ya Su, Dapeng Liu, Xiaohui Nie, Yuan Meng, Shiwen Cheng, Dan Pei, Shenglin Zhang, Xianping Qu, and Guo Xuanyao. 2018. Hotspot: Anomaly localization for additive kpis with multi-dimensional attributes. IEEE Access 6 (2018), 10909--10923.
[42]
Robert J Weber. 1988. Probabilistic values for games. The Shapley Value. Essays in Honor of Lloyd S. Shapley (1988), 101--119.
[43]
Gerhard Weikum, Axel Moenkeberg, Christof Hasse, and Peter Zabback. 2002. Self-tuning database technology and information services: from wishful thinking to viable engineering. In VLDB'02: Proceedings of the 28th International Conference on Very Large Databases. Elsevier, 20--31.
[44]
Jianping Weng, Jessie Hui Wang, Jiahai Yang, and Yang Yang. 2018. Root Cause Analysis of Anomalies of Multitier Services in Public Clouds. IEEE/ACM Transactions on Networking 26, 4 (2018), 1646--1659.
[45]
Li Wu, Johan Tordsson, Erik Elmroth, and Odej Kao. 2020. MicroRCA: Root Cause Localization of Performance Issues in Microservices. In NOMS 2020 - 2020 IEEE/IFIP Network Operations and Management Symposium. 1--9.
[46]
Dong Young Yoon, Ning Niu, and Barzan Mozafari. 2016. Dbsherlock: A performance diagnostic tool for transactional databases. In Proceedings of the 2016 International Conference on Management of Data. 1599--1614.
[47]
Guangba Yu, Pengfei Chen, Hongyang Chen, Zijie Guan, Zicheng Huang, Linxiao Jing, Tianjun Weng, Xinmeng Sun, and Xiaoyun Li. 2021. MicroRank: End-to-End Latency Issue Localization with Extended Spectrum Analysis in Microservice Environments. In Proceedings of the Web Conference 2021 (Ljubljana, Slovenia) (WWW '21). Association for Computing Machinery, New York, NY, USA, 3087--3098.
[48]
Xiang Zhou, Xin Peng, Tao Xie, Jun Sun, Chenjie Xu, Chao Ji, and Wenyun Zhao. 2018. Benchmarking Microservice Systems for Software Engineering Research. In Proceedings of the 40th International Conference on Software Engineering: Companion Proceeedings (Gothenburg, Sweden) (ICSE '18). Association for Computing Machinery, New York, NY, USA, 323--324.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ASPLOS '23: Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 4
March 2023
430 pages
ISBN:9798400703942
DOI:10.1145/3623278
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 February 2024

Permissions

Request permissions for this article.

Check for updates

Badges

Qualifiers

  • Research-article

Conference

ASPLOS '23

Acceptance Rates

Overall Acceptance Rate 535 of 2,713 submissions, 20%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 200
    Total Downloads
  • Downloads (Last 12 months)200
  • Downloads (Last 6 weeks)17
Reflects downloads up to 30 Aug 2024

Other Metrics

Citations

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media