research-article

Sleuth: A Trace-Based Root Cause Analysis System for Large-Scale Microservices with Graph Neural Networks

Authors:

Jiangwei JiangAuthors Info & Claims

ASPLOS '23: Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 4

Pages 324 - 337

https://doi.org/10.1145/3623278.3624758

Published: 07 February 2024 Publication History

Abstract

Cloud microservices are being scaled up due to the rising demand for new features and the convenience of cloud-native technologies. However, the growing scale of microservices complicates the remote procedure call (RPC) dependency graph, exacerbates the tail-of-scale effect, and makes many of the empirical rules for detecting the root cause of end-to-end performance issues unreliable. Additionally, existing open-source microservice benchmarks are too small to evaluate performance debugging algorithms at a production-scale with hundreds or even thousands of services and RPCs.

To address these challenges, we present Sleuth, a trace-based root cause analysis (RCA) system for large-scale microservices using un-supervised graph learning. Sleuth leverages a graph neural network to capture the causal impact of each span in a trace, and trace clustering using a trace distance metric to reduce the amount of traces required for root cause localization. A pre-trained Sleuth model can be transferred to different microservice applications without any retraining or with few-shot fine-tuning. To quantitatively evaluate the performance and scalability of Sleuth, we propose a method to generate microservice benchmarks comparable to a production-scale. The experiments on the existing benchmark suites and synthetic large-scale microservices indicate that Sleuth has significantly outperformed the prior work in detection accuracy, performance, and adaptability on a large-scale deployment.

References

[1]

giltene/wrk2. https://github.com/giltene/wrk2.

[2]

grpc: A high performance open-source universal rpc framework. https://grpc.io/.

[3]

Jaeger: open source, end-to-end distributed tracing. https://www.jaegertracing.io/.

[4]

Opentracing. https://opentracing.io/.

[5]

Prometheus. https://prometheus.io/.

[6]

Sockshop: A microservices demo application. https://www.weave.works/blog/sock-shop-microservices-demo-application.

[7]

Spring framework. https://spring.io/projects/spring-framework.

[8]

stress-ng. https://wiki.ubuntu.com/Kernel/Reference/stress-ng.

[9]

Zipkin. http://zipkin.io.

[10]

Alibaba cloud container service for kubernetes (ack). https://www.alibabacloud.com/product/kubernetes.

[11]

Nuha Alshuqayran, Nour Ali, and Roger Evans. A systematic mapping study in microservice architecture. In 2016 IEEE 9th International Conference on Service-Oriented Computing and Applications (SOCA), pages 44--51. IEEE, 2016.

[12]

Amazon elastic kubernetes service (eks). https://aws.amazon.com/eks.

[13]

Azure kubernetes service (aks). https://azure.microsoft.com/en-us/products/kubernetes-service/.

[14]

Daren C Brabham. Crowdsourcing. MIT Press, 2013.

[15]

Zhengong Cai, Wei Li, Wanyi Zhu, Lu Liu, and Bowei Yang. A real-time trace-level root-cause diagnosis system in alibaba datacenters. IEEE Access, 7:142692--142702, 2019.

[16]

chaosblade io. chaosblade-io/chaosblade: An easy to use and powerful chaos engineering experiment toolkit., Sep 2022.

[17]

P. Chen, Y. Qi, P. Zheng, and D. Hou. Causeinfer: Automatic and distributed performance diagnosis with hierarchical causality graph in large distributed systems. In IEEE INFOCOM 2014 - IEEE Conference on Computer Communications, pages 1887--1895, 2014.

[18]

Max Chickering, David Heckerman, and Chris Meek. Large-sample learning of bayesian networks is np-hard. Journal of Machine Learning Research, 5:1287--1330, 2004.

Digital Library

[19]

Michael Chow, David Meisner, Jason Flinn, Daniel Peek, and Thomas F. Wenisch. The mystery machine: End-to-end performance analysis of large-scale internet services. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation, OSDI'14, pages 217--231, Berkeley, CA, USA, 2014. USENIX Association.

Digital Library

[20]

Common web application architectures. https://learn.microsoft.com/en-us/dotnet/architecture/modern-web-apps-azure/common-web-application-architectures.

[21]

Consul. https://www.consul.io/, Sep 2022.

[22]

Christina Delimitrou and Christos Kozyrakis. iBench: Quantifying Interference for Datacenter Workloads. In Proceedings of the 2013 IEEE International Symposium on Workload Characterization (IISWC). Portland, OR, September 2013.

[23]

Docker. https://www.docker.com/, 2022.

[24]

Matthias Fey and Jan Eric Lenssen. Fast graph representation learning with pytorch geometric, 2019.

[25]

Fluentd. Fluentd. https://www.fluentd.org/.

[26]

Rodrigo Fonseca, George Porter, Randy H. Katz, Scott Shenker, and Ion Stoica. X-trace: A pervasive network tracing framework. In Proceedings of the 4th USENIX Conference on Networked Systems Design & Implementation, NSDI'07, pages 20--20, Berkeley, CA, USA, 2007. USENIX Association.

Digital Library

[27]

Yu Gan, Mingyu Liang, Sundar Dev, David Lo, and Christina Delimitrou. Sage: Practical and scalable ml-driven performance debugging in microservices. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2021, page 135--151, New York, NY, USA, 2021. Association for Computing Machinery.

Digital Library

[28]

Yu Gan, Yanqi Zhang, Dailun Cheng, Ankitha Shetty, Priyal Rathi, Nayantara Katarki, Ariana Bruno, Justin Hu, Brian Ritchken, Brendon Jackson, Kelvin Hu, Meghna Pancholi, Yuan He, Brett Clancy, Chris Colen, Fukang Wen, Catherine Leung, Siyuan Wang, Leon Zaruvinsky, Mateo Espinosa, Rick Lin, Zhongling Liu, Jake Padilla, and Christina Delimitrou. An Open-Source Benchmark Suite for Microservices and Their Hardware-Software Implications for Cloud and Edge Systems. In Proceedings of the Twenty Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), April 2019.

Digital Library

[29]

Yu Gan, Yanqi Zhang, Kelvin Hu, Yuan He, Meghna Pancholi, Dailun Cheng, and Christina Delimitrou. Seer: Leveraging Big Data to Navigate the Complexity of Performance Debugging in Cloud Microservices. In Proceedings of the Twenty Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), April 2019.

Digital Library

[30]

Go kit - a toolkit for microservices. http://gokit.io/, 2014.

[31]

Google kubernetes engine (gke). https://cloud.google.com/kubernetes-engine.

[32]

GoogleCloudPlatform. Online boutique. https://github.com/GoogleCloudPlatform/microservices-demo.

[33]

M. Grechanik, C. Fu, and Q. Xie. Automatically finding performance problems with feedback-directed learning software testing. In 2012 34th International Conference on Software Engineering (ICSE), pages 156--166, 2012.

[34]

Joop J Hox and Timo M Bechger. An introduction to structural equation modeling. 1998.

[35]

Vimalkumar Jeyakumar, Omid Madani, Ali Parandeh, Ashutosh Kulshreshtha, Weifei Zeng, and Navindra Yadav. Explainit! - a declarative root-cause analysis engine for time series data. In Proceedings of the 2019 International Conference on Management of Data, SIGMOD '19, page 333--348, New York, NY, USA, 2019. Association for Computing Machinery.

Digital Library

[36]

Markus Kalisch and Peter Bühlmann. Estimating high-dimensional directed acyclic graphs with the pc-algorithm. Journal of Machine Learning Research, 8(Mar):613--636, 2007.

[37]

Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.

[38]

Kubernetes. https://kubernetes.io/, Sep 2022.

[39]

Grafana Labs. Grafana. https://grafana.com/.

[40]

Mingjie Li, Zeyan Li, Kanglin Yin, Xiaohui Nie, Wenchi Zhang, Kaixin Sui, and Dan Pei. Causal inference-based root cause analysis for online service systems with intervention recognition. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD '22, page 3230--3240, New York, NY, USA, 2022. Association for Computing Machinery.

Digital Library

[41]

Zeyan Li, Junjie Chen, Rui Jiao, Nengwen Zhao, Zhijun Wang, Shuwei Zhang, Yanjun Wu, Long Jiang, Leiqin Yan, Zikai Wang, et al. Practical root cause localization for microservice systems via trace analysis. In 2021 IEEE/ACM 29th International Symposium on Quality of Service (IWQOS), pages 1--10. IEEE, 2021.

[42]

JinJin Lin, Pengfei Chen, and Zibin Zheng. Microscope: Pinpoint performance issues with causal graphs in micro-service environments. In International Conference on Service-Oriented Computing, pages 3--20. Springer, 2018.

Digital Library

[43]

Dewei Liu, Chuan He, Xin Peng, Fan Lin, Chenxi Zhang, Shengfang Gong, Ziang Li, Jiayu Ou, and Zheshun Wu. Microhecl: High-efficient root cause localization in large-scale microservice systems. In 2021 IEEE/ACM 43rd International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP), pages 338--347. IEEE, 2021.

Digital Library

[44]

Ping Liu, Haowen Xu, Qianyu Ouyang, Rui Jiao, Zhekang Chen, Shenglin Zhang, Jiahai Yang, Linlin Mo, Jice Zeng, Wenman Xue, et al. Unsupervised detection of microservice trace anomalies through service-level deep bayesian networks. In 2020 IEEE 31st International Symposium on Software Reliability Engineering (ISSRE), pages 48--58. IEEE, 2020.

[45]

Locust. https://locust.io/, 2022.

[46]

Logstash. Logstash. https://www.elastic.co/logstash.

[47]

Shutian Luo, Huanle Xu, Chengzhi Lu, Kejiang Ye, Guoyao Xu, Liping Zhang, Yu Ding, Jian He, and Chengzhong Xu. Characterizing microservice dependency and performance: Alibaba trace analysis. In Proceedings of the ACM Symposium on Cloud Computing, SoCC '21, page 412--426, New York, NY, USA, 2021. Association for Computing Machinery.

Digital Library

[48]

Shutian Luo, Huanle Xu, Chengzhi Lu, Kejiang Ye, Guoyao Xu, Liping Zhang, Jian He, and Chengzhong Xu. An in-depth study of microservice call graph and runtime performance. IEEE Transactions on Parallel and Distributed Systems, 33(12):3901--3914, 2022.

[49]

Jonathan Mace, Ryan Roelke, and Rodrigo Fonseca. Pivot tracing: Dynamic causal monitoring for distributed systems. In Proceedings of the 25th Symposium on Operating Systems Principles, SOSP '15, page 378--393, New York, NY, USA, 2015. Association for Computing Machinery.

Digital Library

[50]

Leland McInnes, John Healy, and Steve Astels. hdbscan: Hierarchical density based clustering. The Journal of Open Source Software, 2(11), mar 2017.

[51]

Irakli Nadareishvili, Ronnie Mitra, Matt McLarty, and Mike Amundsen. Microservice architecture: aligning principles, practices, and culture. "O'Reilly Media, Inc.", 2016.

[52]

Node.js. Node.js. https://nodejs.org/en/, 2022.

[53]

Opentelemetry. https://opentelemetry.io/, 2022.

[54]

OpenTelemetry. Opentelemetry specification, 2022.

[55]

Mateusz Pawlik and Nikolaus Augsten. Efficient computation of the tree edit distance. ACM Trans. Database Syst., 40(1), mar 2015.

[56]

Mateusz Pawlik and Nikolaus Augsten. Tree edit distance: Robust and memory-efficient. Information Systems, 56:157--173, 2016.

Digital Library

[57]

Judea Pearl. Causality: Models, Reasoning and Inference. Cambridge University Press, New York, NY, USA, 2nd edition, 2009.

Digital Library

[58]

Judea Pearl. Structural counterfactuals: A brief introduction. Cognitive science, 37(6):977--985, 2013.

[59]

Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11 2019.

[60]

Ryan A. Rossi and Nesreen K. Ahmed. The network data repository with interactive graph analytics and visualization. In AAAI, 2015.

Digital Library

[61]

Huasong Shan, Yuan Chen, Haifeng Liu, Yunpeng Zhang, Xiao Xiao, Xiaofeng He, Min Li, and Wei Ding. ?-diagnosis: Unsupervised and real-time diagnosis of small- window long-tail latency in large-scale microservice platforms. In The World Wide Web Conference, WWW '19, page 3215--3222, New York, NY, USA, 2019. Association for Computing Machinery.

Digital Library

[62]

Benjamin H. Sigelman, Luiz André Barroso, Mike Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, and Chandan Shanbhag. Dapper, a large-scale distributed systems tracing infrastructure. Technical report, Google, Inc., 2010.

[63]

A. Sriraman and T. F. Wenisch. μ suite: A benchmark suite for microservices. In 2018 IEEE International Symposium on Workload Characterization (IISWC), pages 1--12, 2018.

[64]

Jörg Thalheim, Antonio Rodrigues, Istemi Ekin Akkus, Pramod Bhatotia, Ruichuan Chen, Bimal Viswanath, Lei Jiao, and Christof Fetzer. Sieve: Actionable insights from monitored metrics in distributed systems. In Proceedings of the 18th ACM/IFIP/USENIX Middleware Conference, Middleware '17, page 14--27, New York, NY, USA, 2017. Association for Computing Machinery.

Digital Library

[65]

Takanori Ueda, Takuya Nakaike, and Moriyoshi Ohara. Workload characterization for microservices. In Proc. of IISWC. 2016.

[66]

C. Wang, K. Viswanathan, L. Choudur, V. Talwar, W. Satterfield, and K. Schwan. Statistical techniques for online anomaly detection in data centers. In 12th IFIP/IEEE International Symposium on Integrated Network Management (IM 2011) and Workshops, pages 385--392, 2011.

[67]

Hanzhang Wang, Zhengkai Wu, Huai Jiang, Yichao Huang, Jiamu Wang, Selcuk Kopru, and Tao Xie. Groot: An event-graph-based approach for root cause analysis in industrial settings. In 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 419--429. IEEE, 2021.

Digital Library

[68]

Li Wu, Johan Tordsson, Erik Elmroth, and Odej Kao. Microrca: Root cause localization of performance issues in microservices. In NOMS 2020 - 2020 IEEE/IFIP Network Operations and Management Symposium, pages 1--9, 2020.

Digital Library

[69]

Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? arXiv preprint arXiv:1810.00826, 2018.

[70]

Chenxi Zhang, Xin Peng, Chaofeng Sha, Ke Zhang, Zhenqing Fu, Xiya Wu, Qingwei Lin, and Dongmei Zhang. Deeptralog: Trace-log combined microservice anomaly detection through graph-based deep learning. In Proceedings of the 44th International Conference on Software Engineering, ICSE '22, page 623--634, New York, NY, USA, 2022. Association for Computing Machinery.

Digital Library

[71]

Kaizhong Zhang and Dennis Shasha. Simple fast algorithms for the editing distance between trees and related problems. SIAM journal on computing, 18(6):1245--1262, 1989.

[72]

Hao Zhou, Ming Chen, Qian Lin, Yong Wang, Xiaobin She, Sifan Liu, Rui Gu, Beng Chin Ooi, and Junfeng Yang. Overload control for scaling wechat microservices. In Proceedings of the ACM Symposium on Cloud Computing, SoCC '18, page 149--161, New York, NY, USA, 2018. Association for Computing Machinery.

Digital Library

[73]

Xiang Zhou, Xin Peng, Tao Xie, Jun Sun, Chao Ji, Wenhai Li, and Dan Ding. Fault analysis and debugging of microservice systems: Industrial survey, benchmark system, and empirical study. IEEE Transactions on Software Engineering, 47(2):243--260, 2018.

Digital Library

[74]

Xiang Zhou, Xin Peng, Tao Xie, Jun Sun, Chao Ji, Dewei Liu, Qilin Xiang, and Chuan He. Latent error prediction and fault localization for microservice applications by learning from system trace logs. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 683--694, 2019.

Digital Library

Index Terms

Sleuth: A Trace-Based Root Cause Analysis System for Large-Scale Microservices with Graph Neural Networks

Index terms have been assigned to the content through auto-classification.

Recommendations

Large-Scale Trace Analysis for Microservice Anomaly Detection and Root Cause Localization
FAMECSE '22: Proceedings of the Federated Africa and Middle East Conference on Software Engineering

Distributed tracing traces requests as they flow between services. It has been widely accepted and practiced in industry as an important means to achieve observability in microservice architecture for various purposes such as anomaly detection and root ...
HeMiRCA: Fine-Grained Root Cause Analysis for Microservices with Heterogeneous Data Sources
Microservices architecture improves software scalability, resilience, and agility but also poses significant challenges to system reliability due to their complexity and dynamic nature. Identifying and resolving anomalies promptly is crucial because they ...
Trace-based Multi-Dimensional Root Cause Localization of Performance Issues in Microservice Systems
ICSE '24: Proceedings of the IEEE/ACM 46th International Conference on Software Engineering

Modern microservice systems have become increasingly complicated due to the dynamic and complex interactions and runtime environment. It leads to the system vulnerable to performance issues caused by a variety of reasons, such as the runtime environments,...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ASPLOS '23: Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 4

March 2023

430 pages

ISBN:9798400703942

DOI:10.1145/3623278

Chair:
Tor Aamodt,
Program Chair:
Michael M Swift,
Program Co-chair:
Natalie Enright Jerger

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

In-Cooperation

SIGBED: ACM Special Interest Group on Embedded Systems

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 February 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Conference

ASPLOS '23

Sponsor:

ASPLOS '23: 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 4

March 25 - 29, 2023

BC, Vancouver, Canada

Acceptance Rates

Overall Acceptance Rate 535 of 2,713 submissions, 20%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
319
Total Downloads

Downloads (Last 12 months)319
Downloads (Last 6 weeks)56

Reflects downloads up to 27 Jul 2024

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents