Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3623278.3624758acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections
research-article

Sleuth: A Trace-Based Root Cause Analysis System for Large-Scale Microservices with Graph Neural Networks

Published: 07 February 2024 Publication History

Abstract

Cloud microservices are being scaled up due to the rising demand for new features and the convenience of cloud-native technologies. However, the growing scale of microservices complicates the remote procedure call (RPC) dependency graph, exacerbates the tail-of-scale effect, and makes many of the empirical rules for detecting the root cause of end-to-end performance issues unreliable. Additionally, existing open-source microservice benchmarks are too small to evaluate performance debugging algorithms at a production-scale with hundreds or even thousands of services and RPCs.
To address these challenges, we present Sleuth, a trace-based root cause analysis (RCA) system for large-scale microservices using un-supervised graph learning. Sleuth leverages a graph neural network to capture the causal impact of each span in a trace, and trace clustering using a trace distance metric to reduce the amount of traces required for root cause localization. A pre-trained Sleuth model can be transferred to different microservice applications without any retraining or with few-shot fine-tuning. To quantitatively evaluate the performance and scalability of Sleuth, we propose a method to generate microservice benchmarks comparable to a production-scale. The experiments on the existing benchmark suites and synthetic large-scale microservices indicate that Sleuth has significantly outperformed the prior work in detection accuracy, performance, and adaptability on a large-scale deployment.

References

[1]
giltene/wrk2. https://github.com/giltene/wrk2.
[2]
grpc: A high performance open-source universal rpc framework. https://grpc.io/.
[3]
Jaeger: open source, end-to-end distributed tracing. https://www.jaegertracing.io/.
[4]
Opentracing. https://opentracing.io/.
[5]
Prometheus. https://prometheus.io/.
[6]
Sockshop: A microservices demo application. https://www.weave.works/blog/sock-shop-microservices-demo-application.
[7]
Spring framework. https://spring.io/projects/spring-framework.
[8]
stress-ng. https://wiki.ubuntu.com/Kernel/Reference/stress-ng.
[9]
Zipkin. http://zipkin.io.
[10]
Alibaba cloud container service for kubernetes (ack). https://www.alibabacloud.com/product/kubernetes.
[11]
Nuha Alshuqayran, Nour Ali, and Roger Evans. A systematic mapping study in microservice architecture. In 2016 IEEE 9th International Conference on Service-Oriented Computing and Applications (SOCA), pages 44--51. IEEE, 2016.
[12]
Amazon elastic kubernetes service (eks). https://aws.amazon.com/eks.
[13]
Azure kubernetes service (aks). https://azure.microsoft.com/en-us/products/kubernetes-service/.
[14]
Daren C Brabham. Crowdsourcing. MIT Press, 2013.
[15]
Zhengong Cai, Wei Li, Wanyi Zhu, Lu Liu, and Bowei Yang. A real-time trace-level root-cause diagnosis system in alibaba datacenters. IEEE Access, 7:142692--142702, 2019.
[16]
chaosblade io. chaosblade-io/chaosblade: An easy to use and powerful chaos engineering experiment toolkit., Sep 2022.
[17]
P. Chen, Y. Qi, P. Zheng, and D. Hou. Causeinfer: Automatic and distributed performance diagnosis with hierarchical causality graph in large distributed systems. In IEEE INFOCOM 2014 - IEEE Conference on Computer Communications, pages 1887--1895, 2014.
[18]
Max Chickering, David Heckerman, and Chris Meek. Large-sample learning of bayesian networks is np-hard. Journal of Machine Learning Research, 5:1287--1330, 2004.
[19]
Michael Chow, David Meisner, Jason Flinn, Daniel Peek, and Thomas F. Wenisch. The mystery machine: End-to-end performance analysis of large-scale internet services. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation, OSDI'14, pages 217--231, Berkeley, CA, USA, 2014. USENIX Association.
[20]
Common web application architectures. https://learn.microsoft.com/en-us/dotnet/architecture/modern-web-apps-azure/common-web-application-architectures.
[21]
Consul. https://www.consul.io/, Sep 2022.
[22]
Christina Delimitrou and Christos Kozyrakis. iBench: Quantifying Interference for Datacenter Workloads. In Proceedings of the 2013 IEEE International Symposium on Workload Characterization (IISWC). Portland, OR, September 2013.
[23]
Docker. https://www.docker.com/, 2022.
[24]
Matthias Fey and Jan Eric Lenssen. Fast graph representation learning with pytorch geometric, 2019.
[25]
Fluentd. Fluentd. https://www.fluentd.org/.
[26]
Rodrigo Fonseca, George Porter, Randy H. Katz, Scott Shenker, and Ion Stoica. X-trace: A pervasive network tracing framework. In Proceedings of the 4th USENIX Conference on Networked Systems Design & Implementation, NSDI'07, pages 20--20, Berkeley, CA, USA, 2007. USENIX Association.
[27]
Yu Gan, Mingyu Liang, Sundar Dev, David Lo, and Christina Delimitrou. Sage: Practical and scalable ml-driven performance debugging in microservices. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2021, page 135--151, New York, NY, USA, 2021. Association for Computing Machinery.
[28]
Yu Gan, Yanqi Zhang, Dailun Cheng, Ankitha Shetty, Priyal Rathi, Nayantara Katarki, Ariana Bruno, Justin Hu, Brian Ritchken, Brendon Jackson, Kelvin Hu, Meghna Pancholi, Yuan He, Brett Clancy, Chris Colen, Fukang Wen, Catherine Leung, Siyuan Wang, Leon Zaruvinsky, Mateo Espinosa, Rick Lin, Zhongling Liu, Jake Padilla, and Christina Delimitrou. An Open-Source Benchmark Suite for Microservices and Their Hardware-Software Implications for Cloud and Edge Systems. In Proceedings of the Twenty Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), April 2019.
[29]
Yu Gan, Yanqi Zhang, Kelvin Hu, Yuan He, Meghna Pancholi, Dailun Cheng, and Christina Delimitrou. Seer: Leveraging Big Data to Navigate the Complexity of Performance Debugging in Cloud Microservices. In Proceedings of the Twenty Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), April 2019.
[30]
Go kit - a toolkit for microservices. http://gokit.io/, 2014.
[31]
Google kubernetes engine (gke). https://cloud.google.com/kubernetes-engine.
[32]
GoogleCloudPlatform. Online boutique. https://github.com/GoogleCloudPlatform/microservices-demo.
[33]
M. Grechanik, C. Fu, and Q. Xie. Automatically finding performance problems with feedback-directed learning software testing. In 2012 34th International Conference on Software Engineering (ICSE), pages 156--166, 2012.
[34]
Joop J Hox and Timo M Bechger. An introduction to structural equation modeling. 1998.
[35]
Vimalkumar Jeyakumar, Omid Madani, Ali Parandeh, Ashutosh Kulshreshtha, Weifei Zeng, and Navindra Yadav. Explainit! - a declarative root-cause analysis engine for time series data. In Proceedings of the 2019 International Conference on Management of Data, SIGMOD '19, page 333--348, New York, NY, USA, 2019. Association for Computing Machinery.
[36]
Markus Kalisch and Peter Bühlmann. Estimating high-dimensional directed acyclic graphs with the pc-algorithm. Journal of Machine Learning Research, 8(Mar):613--636, 2007.
[37]
Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
[38]
Kubernetes. https://kubernetes.io/, Sep 2022.
[39]
Grafana Labs. Grafana. https://grafana.com/.
[40]
Mingjie Li, Zeyan Li, Kanglin Yin, Xiaohui Nie, Wenchi Zhang, Kaixin Sui, and Dan Pei. Causal inference-based root cause analysis for online service systems with intervention recognition. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD '22, page 3230--3240, New York, NY, USA, 2022. Association for Computing Machinery.
[41]
Zeyan Li, Junjie Chen, Rui Jiao, Nengwen Zhao, Zhijun Wang, Shuwei Zhang, Yanjun Wu, Long Jiang, Leiqin Yan, Zikai Wang, et al. Practical root cause localization for microservice systems via trace analysis. In 2021 IEEE/ACM 29th International Symposium on Quality of Service (IWQOS), pages 1--10. IEEE, 2021.
[42]
JinJin Lin, Pengfei Chen, and Zibin Zheng. Microscope: Pinpoint performance issues with causal graphs in micro-service environments. In International Conference on Service-Oriented Computing, pages 3--20. Springer, 2018.
[43]
Dewei Liu, Chuan He, Xin Peng, Fan Lin, Chenxi Zhang, Shengfang Gong, Ziang Li, Jiayu Ou, and Zheshun Wu. Microhecl: High-efficient root cause localization in large-scale microservice systems. In 2021 IEEE/ACM 43rd International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP), pages 338--347. IEEE, 2021.
[44]
Ping Liu, Haowen Xu, Qianyu Ouyang, Rui Jiao, Zhekang Chen, Shenglin Zhang, Jiahai Yang, Linlin Mo, Jice Zeng, Wenman Xue, et al. Unsupervised detection of microservice trace anomalies through service-level deep bayesian networks. In 2020 IEEE 31st International Symposium on Software Reliability Engineering (ISSRE), pages 48--58. IEEE, 2020.
[45]
Locust. https://locust.io/, 2022.
[46]
Logstash. Logstash. https://www.elastic.co/logstash.
[47]
Shutian Luo, Huanle Xu, Chengzhi Lu, Kejiang Ye, Guoyao Xu, Liping Zhang, Yu Ding, Jian He, and Chengzhong Xu. Characterizing microservice dependency and performance: Alibaba trace analysis. In Proceedings of the ACM Symposium on Cloud Computing, SoCC '21, page 412--426, New York, NY, USA, 2021. Association for Computing Machinery.
[48]
Shutian Luo, Huanle Xu, Chengzhi Lu, Kejiang Ye, Guoyao Xu, Liping Zhang, Jian He, and Chengzhong Xu. An in-depth study of microservice call graph and runtime performance. IEEE Transactions on Parallel and Distributed Systems, 33(12):3901--3914, 2022.
[49]
Jonathan Mace, Ryan Roelke, and Rodrigo Fonseca. Pivot tracing: Dynamic causal monitoring for distributed systems. In Proceedings of the 25th Symposium on Operating Systems Principles, SOSP '15, page 378--393, New York, NY, USA, 2015. Association for Computing Machinery.
[50]
Leland McInnes, John Healy, and Steve Astels. hdbscan: Hierarchical density based clustering. The Journal of Open Source Software, 2(11), mar 2017.
[51]
Irakli Nadareishvili, Ronnie Mitra, Matt McLarty, and Mike Amundsen. Microservice architecture: aligning principles, practices, and culture. "O'Reilly Media, Inc.", 2016.
[52]
Node.js. Node.js. https://nodejs.org/en/, 2022.
[53]
Opentelemetry. https://opentelemetry.io/, 2022.
[54]
OpenTelemetry. Opentelemetry specification, 2022.
[55]
Mateusz Pawlik and Nikolaus Augsten. Efficient computation of the tree edit distance. ACM Trans. Database Syst., 40(1), mar 2015.
[56]
Mateusz Pawlik and Nikolaus Augsten. Tree edit distance: Robust and memory-efficient. Information Systems, 56:157--173, 2016.
[57]
Judea Pearl. Causality: Models, Reasoning and Inference. Cambridge University Press, New York, NY, USA, 2nd edition, 2009.
[58]
Judea Pearl. Structural counterfactuals: A brief introduction. Cognitive science, 37(6):977--985, 2013.
[59]
Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11 2019.
[60]
Ryan A. Rossi and Nesreen K. Ahmed. The network data repository with interactive graph analytics and visualization. In AAAI, 2015.
[61]
Huasong Shan, Yuan Chen, Haifeng Liu, Yunpeng Zhang, Xiao Xiao, Xiaofeng He, Min Li, and Wei Ding. ?-diagnosis: Unsupervised and real-time diagnosis of small- window long-tail latency in large-scale microservice platforms. In The World Wide Web Conference, WWW '19, page 3215--3222, New York, NY, USA, 2019. Association for Computing Machinery.
[62]
Benjamin H. Sigelman, Luiz André Barroso, Mike Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, and Chandan Shanbhag. Dapper, a large-scale distributed systems tracing infrastructure. Technical report, Google, Inc., 2010.
[63]
A. Sriraman and T. F. Wenisch. μ suite: A benchmark suite for microservices. In 2018 IEEE International Symposium on Workload Characterization (IISWC), pages 1--12, 2018.
[64]
Jörg Thalheim, Antonio Rodrigues, Istemi Ekin Akkus, Pramod Bhatotia, Ruichuan Chen, Bimal Viswanath, Lei Jiao, and Christof Fetzer. Sieve: Actionable insights from monitored metrics in distributed systems. In Proceedings of the 18th ACM/IFIP/USENIX Middleware Conference, Middleware '17, page 14--27, New York, NY, USA, 2017. Association for Computing Machinery.
[65]
Takanori Ueda, Takuya Nakaike, and Moriyoshi Ohara. Workload characterization for microservices. In Proc. of IISWC. 2016.
[66]
C. Wang, K. Viswanathan, L. Choudur, V. Talwar, W. Satterfield, and K. Schwan. Statistical techniques for online anomaly detection in data centers. In 12th IFIP/IEEE International Symposium on Integrated Network Management (IM 2011) and Workshops, pages 385--392, 2011.
[67]
Hanzhang Wang, Zhengkai Wu, Huai Jiang, Yichao Huang, Jiamu Wang, Selcuk Kopru, and Tao Xie. Groot: An event-graph-based approach for root cause analysis in industrial settings. In 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 419--429. IEEE, 2021.
[68]
Li Wu, Johan Tordsson, Erik Elmroth, and Odej Kao. Microrca: Root cause localization of performance issues in microservices. In NOMS 2020 - 2020 IEEE/IFIP Network Operations and Management Symposium, pages 1--9, 2020.
[69]
Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? arXiv preprint arXiv:1810.00826, 2018.
[70]
Chenxi Zhang, Xin Peng, Chaofeng Sha, Ke Zhang, Zhenqing Fu, Xiya Wu, Qingwei Lin, and Dongmei Zhang. Deeptralog: Trace-log combined microservice anomaly detection through graph-based deep learning. In Proceedings of the 44th International Conference on Software Engineering, ICSE '22, page 623--634, New York, NY, USA, 2022. Association for Computing Machinery.
[71]
Kaizhong Zhang and Dennis Shasha. Simple fast algorithms for the editing distance between trees and related problems. SIAM journal on computing, 18(6):1245--1262, 1989.
[72]
Hao Zhou, Ming Chen, Qian Lin, Yong Wang, Xiaobin She, Sifan Liu, Rui Gu, Beng Chin Ooi, and Junfeng Yang. Overload control for scaling wechat microservices. In Proceedings of the ACM Symposium on Cloud Computing, SoCC '18, page 149--161, New York, NY, USA, 2018. Association for Computing Machinery.
[73]
Xiang Zhou, Xin Peng, Tao Xie, Jun Sun, Chao Ji, Wenhai Li, and Dan Ding. Fault analysis and debugging of microservice systems: Industrial survey, benchmark system, and empirical study. IEEE Transactions on Software Engineering, 47(2):243--260, 2018.
[74]
Xiang Zhou, Xin Peng, Tao Xie, Jun Sun, Chao Ji, Dewei Liu, Qilin Xiang, and Chuan He. Latent error prediction and fault localization for microservice applications by learning from system trace logs. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 683--694, 2019.

Index Terms

  1. Sleuth: A Trace-Based Root Cause Analysis System for Large-Scale Microservices with Graph Neural Networks
    Index terms have been assigned to the content through auto-classification.

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ASPLOS '23: Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 4
    March 2023
    430 pages
    ISBN:9798400703942
    DOI:10.1145/3623278
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    In-Cooperation

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 07 February 2024

    Permissions

    Request permissions for this article.

    Check for updates

    Qualifiers

    • Research-article

    Conference

    ASPLOS '23

    Acceptance Rates

    Overall Acceptance Rate 535 of 2,713 submissions, 20%

    Upcoming Conference

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 410
      Total Downloads
    • Downloads (Last 12 months)410
    • Downloads (Last 6 weeks)48
    Reflects downloads up to 09 Nov 2024

    Other Metrics

    Citations

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media