research-article

TraceCRL: contrastive representation learning for microservice trace analysis

Authors:

Hong YangAuthors Info & Claims

ESEC/FSE 2022: Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering

Pages 1221 - 1232

https://doi.org/10.1145/3540250.3549146

Published: 09 November 2022 Publication History

Abstract

Due to the large amount and high complexity of trace data, microservice trace analysis tasks such as anomaly detection, fault diagnosis, and tail-based sampling widely adopt machine learning technology. These trace analysis approaches usually use a preprocessing step to map structured features of traces to vector representations in an ad-hoc way. Therefore, they may lose important information such as topological dependencies between service operations. In this paper, we propose TraceCRL, a trace representation learning approach based on contrastive learning and graph neural network, which can incorporate graph structured information in the downstream trace analysis tasks. Given a trace, TraceCRL constructs an operation invocation graph where nodes represent service operations and edges represent operation invocations together with predefined features for invocation status and related metrics. Based on the operation invocation graphs of traces TraceCRL uses a contrastive learning method to train a graph neural network-based model for trace representation. In particular, TraceCRL employs six trace data augmentation strategies to alleviate the problems of class collision and uniformity of representation in contrastive learning. Our experimental studies show that TraceCRL can significantly improve the performance of trace anomaly detection and offline trace sampling. It also confirms the effectiveness of the trace augmentation strategies and the efficiency of TraceCRL.

References

[1]

André Bento, Jaime Correia, Ricardo Filipe, Filipe Araújo, and Jorge Cardoso. 2021. Automated Analysis of Distributed Tracing: Challenges and Research Directions. J. Grid Comput., 19, 1 (2021), 9. https://doi.org/10.1007/s10723-021-09551-5

Digital Library

[2]

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey E. Hinton. 2020. A Simple Framework for Contrastive Learning of Visual Representations. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020. 119, PMLR, 1597–1607.

[3]

elastic. 2022. Elasticsearch. https://www.elastic.co/elasticsearch/

[4]

Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable Feature Learning for Networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 855–864. https://doi.org/10.1145/2939672.2939754

Digital Library

[5]

Xiaofeng Guo, Xin Peng, Hanzhang Wang, Wanxue Li, Huai Jiang, Dan Ding, Tao Xie, and Liangfei Su. 2020. Graph-based trace analysis for microservice architecture understanding and problem diagnosis. In 28th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2016. ACM, 1387–1397. https://doi.org/10.1145/3368089.3417066

Digital Library

[6]

Zicheng Huang, Pengfei Chen, Guangba Yu, Hongyang Chen, and Zibin Zheng. 2021. Sieve: Attention-based Sampling of End-to-End Trace Data in Distributed Microservice Systems. In 2021 IEEE International Conference on Web Services, ICWS 2021. IEEE, 436–446. https://doi.org/10.1109/ICWS53863.2021.00063

[7]

Jaegertracing.Io. 2022. Jaeger. https://www.jaegertracing.io/

[8]

Jonathan Kaldor, Jonathan Mace, Michal Bejda, Edison Gao, Wiktor Kuropatwa, Joe O’Neill, Kian Win Ong, Bill Schaller, Pingjia Shan, Brendan Viscomi, Vinod Venkataraman, Kaushik Veeraraghavan, and Yee Jiun Song. 2017. Canopy: An End-to-End Performance Tracing And Analysis System. In Proceedings of the 26th Symposium on Operating Systems Principles, 2017. ACM, 34–50. https://doi.org/10.1145/3132747.3132749

Digital Library

[9]

Thomas N. Kipf and Max Welling. 2017. Semi-Supervised Classification with Graph Convolutional Networks. In 5th International Conference on Learning Representations, ICLR 2017. OpenReview.net.

[10]

Pedro Henrique B. Las-Casas, Jonathan Mace, Dorgival O. Guedes, and Rodrigo Fonseca. 2018. Weighted Sampling of Execution Traces: Capturing More Needles and Less Hay. In Proceedings of the ACM Symposium on Cloud Computing, SoCC 2018. ACM, 326–332. https://doi.org/10.1145/3267809.3267841

Digital Library

[11]

Pedro Henrique B. Las-Casas, Giorgi Papakerashvili, Vaastav Anand, and Jonathan Mace. 2019. Sifter: Scalable Sampling for Distributed Traces, without Feature Engineering. In Proceedings of the ACM Symposium on Cloud Computing, SoCC 2019. ACM, 312–324. https://doi.org/10.1145/3357223.3362736

Digital Library

[12]

Bowen Li, Xin Peng, Qilin Xiang, Hanzhang Wang, Tao Xie, Jun Sun, and Xuanzhe Liu. 2022. Enjoy your observability: an industrial survey of microservice tracing and analysis. Empir. Softw. Eng., 27, 1 (2022), 25. https://doi.org/10.1007/s10664-021-10063-9

Digital Library

[13]

Chun-Liang Li, Kihyuk Sohn, Jinsung Yoon, and Tomas Pfister. 2021. CutPaste: Self-Supervised Learning for Anomaly Detection and Localization. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021. Computer Vision Foundation / IEEE, 9664–9674.

[14]

Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard S. Zemel. 2016. Gated Graph Sequence Neural Networks. In 4th International Conference on Learning Representations, ICLR 2016. arxiv:1511.05493

[15]

Zeyan Li, Junjie Chen, Rui Jiao, Nengwen Zhao, Zhijun Wang, Shuwei Zhang, Yanjun Wu, Long Jiang, Leiqin Yan, Zikai Wang, Zhekang Chen, Wenchi Zhang, Xiaohui Nie, Kaixin Sui, and Dan Pei. 2021. Practical Root Cause Localization for Microservice Systems via Trace Analysis. In 29th IEEE/ACM International Symposium on Quality of Service, IWQOS 2021. IEEE, 1–10. https://doi.org/10.1109/IWQOS52092.2021.9521340

[16]

Ping Liu, Haowen Xu, Qianyu Ouyang, Rui Jiao, Zhekang Chen, Shenglin Zhang, Jiahai Yang, Linlin Mo, Jice Zeng, Wenman Xue, and Dan Pei. 2020. Unsupervised Detection of Microservice Trace Anomalies through Service-Level Deep Bayesian Networks. In 31st IEEE International Symposium on Software Reliability Engineering, ISSRE 2020. IEEE, 48–58. https://doi.org/10.1109/ISSRE5003.2020.00014

[17]

Chaos Mesh. 2022. Chaos Mesh. https://chaos-mesh.org/

[18]

Tomás Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient Estimation of Word Representations in Vector Space. In 1st International Conference on Learning Representations, ICLR 2013. arxiv:1301.3781

[19]

Sasho Nedelkoski, Jorge S. Cardoso, and Odej Kao. 2019. Anomaly Detection and Classification using Distributed Tracing and Deep Learning. In 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGRID 2019. IEEE, 241–250. https://doi.org/10.1109/CCGRID.2019.00038

[20]

Sasho Nedelkoski, Jorge S. Cardoso, and Odej Kao. 2019. Anomaly Detection from System Tracing Data Using Multimodal Deep Learning. In 12th IEEE International Conference on Cloud Computing, CLOUD 2019. IEEE, 179–186. https://doi.org/10.1109/CLOUD.2019.00038

[21]

Opentracing.io. 2022. OpenTracing. https://opentracing.io/

[22]

Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014. DeepWalk: online learning of social representations. In The 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’14. ACM, 701–710. https://doi.org/10.1145/2623330.2623732

Digital Library

[23]

Nikunj Saunshi, Orestis Plevrakis, Sanjeev Arora, Mikhail Khodak, and Hrishikesh Khandeparkar. 2019. A Theoretical Analysis of Contrastive Unsupervised Representation Learning. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019. 97, PMLR, 5628–5637. http://proceedings.mlr.press/v97/saunshi19a.html

[24]

Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. 2009. The Graph Neural Network Model. IEEE Trans. Neural Networks, 20, 1 (2009), 61–80. https://doi.org/10.1109/TNN.2008.2005605

Digital Library

[25]

Bernhard Schölkopf, Robert C. Williamson, Alexander J. Smola, John Shawe-Taylor, and John C. Platt. 1999. Support Vector Method for Novelty Detection. In Advances in Neural Information Processing Systems 12. The MIT Press, 582–588.

Digital Library

[26]

Benjamin H Sigelman, Luiz Andre Barroso, Mike Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, and Chandan Shanbhag. 2010. Dapper, a large-scale distributed systems tracing infrastructure.

[27]

skywalking.apache.org. 2022. Apache SkyWalking. http://skywalking.apache.org/

[28]

Kihyuk Sohn, Chun-Liang Li, Jinsung Yoon, Minho Jin, and Tomas Pfister. 2021. Learning and Evaluating Representations for Deep One-Class Classification. In 9th International Conference on Learning Representations, ICLR 2021. OpenReview.net.

[29]

TraceCRL. 2022. TraceCRL. https://fudanselab.github.io/TraceCRL/

[30]

Twitter. 2022. Zipkin. https://zipkin.io/

[31]

Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. 2018. Graph Attention Networks. In 6th International Conference on Learning Representations, ICLR 2018. OpenReview.net.

[32]

Tongzhou Wang and Phillip Isola. 2020. Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020. 119, PMLR, 9929–9939. http://proceedings.mlr.press/v119/wang20k.html

[33]

Tian Xie and Jeffrey C Grossman. 2018. Crystal graph convolutional neural networks for an accurate and interpretable prediction of material properties. Physical review letters, 120, 14 (2018), 145301.

[34]

Yuning You, Tianlong Chen, Yongduo Sui, Ting Chen, Zhangyang Wang, and Yang Shen. 2020. Graph Contrastive Learning with Augmentations. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020.

[35]

Guangba Yu, Pengfei Chen, Hongyang Chen, Zijie Guan, Zicheng Huang, Linxiao Jing, Tianjun Weng, Xinmeng Sun, and Xiaoyun Li. 2021. MicroRank: End-to-End Latency Issue Localization with Extended Spectrum Analysis in Microservice Environments. In WWW ’21: The Web Conference 2021. ACM / IW3C2, 3087–3098. https://doi.org/10.1145/3442381.3449905

Digital Library

[36]

Guangba Yu, Zicheng Huang, and Pengfei Chen. 2021. TraceRank: Abnormal service localization with dis-aggregated end-to-end tracing data in cloud native systems. Journal of Software: Evolution and Process, e2413.

[37]

Chenxi Zhang, Xin Peng, Chaofeng Sha, Ke Zhang, Zhenqing Fu, Xiya Wu, Qingwei Lin, and Dongmei Zhang. 2022. DeepTraLog: Trace-Log Combined Microservice Anomaly Detection through Graph-based Deep Learning. In 44th IEEE/ACM 44th International Conference on Software Engineering, ICSE 2022. ACM, 623–634. https://doi.org/10.1145/3510003.3510180

Digital Library

[38]

Nengwen Zhao, Junjie Chen, Zhaoyang Yu, Honglin Wang, Jiesong Li, Bin Qiu, Hongyu Xu, Wenchi Zhang, Kaixin Sui, and Dan Pei. 2021. Identifying bad software changes via multimodal anomaly detection for online service systems. In ESEC/FSE ’21: 29th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Athens, Greece, August 23-28, 2021. ACM, 527–539. https://doi.org/10.1145/3468264.3468543

Digital Library

[39]

Xiang Zhou, Xin Peng, Tao Xie, Jun Sun, Chao Ji, Wenhai Li, and Dan Ding. 2021. Fault Analysis and Debugging of Microservice Systems: Industrial Survey, Benchmark System, and Empirical Study. IEEE Trans. Software Eng., 47, 2 (2021), 243–260. https://doi.org/10.1109/TSE.2018.2887384

Digital Library

[40]

Xiang Zhou, Xin Peng, Tao Xie, Jun Sun, Chao Ji, Dewei Liu, Qilin Xiang, and Chuan He. 2019. Latent error prediction and fault localization for microservice applications by learning from system trace logs. In 2019 ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/SIGSOFT FSE 2019. ACM, 683–694. https://doi.org/10.1145/3338906.3338961

Digital Library

[41]

Xiang Zhou, Xin Peng, Tao Xie, Jun Sun, Chenjie Xu, Chao Ji, and Wenyun Zhao. 2018. Benchmarking microservice systems for software engineering research. In 40th International Conference on Software Engineering, ICSE 2018. ACM, 323–324. https://doi.org/10.1145/3183440.3194991

Digital Library

Cited By

Wang RTian XYing S(2025)Performance issue monitoring, identification and diagnosis of SaaS software: a surveyFrontiers of Computer Science: Selected Publications from Chinese Universities10.1007/s11704-023-2701-019:1Online publication date: 1-Jan-2025
https://dl.acm.org/doi/10.1007/s11704-023-2701-0
Zhang CDong ZPeng XZhang BChen MRoychoudhury APaiva AAbreu RStorey M(2024)Trace-based Multi-Dimensional Root Cause Localization of Performance Issues in Microservice SystemsProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639088(1-12)Online publication date: 20-May-2024
https://dl.acm.org/doi/10.1145/3597503.3639088
Tian XYing SLi TYuan MWang RZhao YShang J(2024)iTCRL: Causal-Intervention-Based Trace Contrastive Representation Learning for Microservice SystemsIEEE Transactions on Software Engineering10.1109/TSE.2024.344653250:10(2583-2601)Online publication date: Oct-2024
https://doi.org/10.1109/TSE.2024.3446532
Show More Cited By

Index Terms

TraceCRL: contrastive representation learning for microservice trace analysis

Recommendations

DeepTraLog: trace-log combined microservice anomaly detection through graph-based deep learning
ICSE '22: Proceedings of the 44th International Conference on Software Engineering

A microservice system in industry is usually a large-scale distributed system consisting of dozens to thousands of services running in different machines. An anomaly of the system often can be reflected in traces and logs, which record inter-service ...
Graph-based trace analysis for microservice architecture understanding and problem diagnosis
ESEC/FSE 2020: Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering

Microservice systems are highly dynamic and complex. For such systems, operation engineers and developers highly rely on trace analysis to understand architectures and diagnose various problems such as service failures and quality degradation. However, ...
iTCRL: Causal-Intervention-Based Trace Contrastive Representation Learning for Microservice Systems
Nowadays, microservice architecture has become mainstream way of cloud applications delivery. Distributed tracing is crucial to preserve the observability of microservice systems. However, existing trace representation approaches only concentrate on ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ESEC/FSE 2022: Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering

November 2022

1822 pages

ISBN:9781450394130

DOI:10.1145/3540250

General Chair:
Abhik Roychoudhury
National University of Singapore, Singapore
,
Program Chairs:
Cristian Cadar
Imperial College London, UK
,
Miryung Kim
University of California at Los Angeles, USA

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 November 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ESEC/FSE '22

Sponsor:

ESEC/FSE '22: 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering

November 14 - 18, 2022

Singapore, Singapore

Acceptance Rates

Overall Acceptance Rate 112 of 543 submissions, 21%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

10
Total Citations
View Citations
548
Total Downloads

Downloads (Last 12 months)226
Downloads (Last 6 weeks)27

Reflects downloads up to 13 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Wang RTian XYing S(2025)Performance issue monitoring, identification and diagnosis of SaaS software: a surveyFrontiers of Computer Science: Selected Publications from Chinese Universities10.1007/s11704-023-2701-019:1Online publication date: 1-Jan-2025
https://dl.acm.org/doi/10.1007/s11704-023-2701-0
Zhang CDong ZPeng XZhang BChen MRoychoudhury APaiva AAbreu RStorey M(2024)Trace-based Multi-Dimensional Root Cause Localization of Performance Issues in Microservice SystemsProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639088(1-12)Online publication date: 20-May-2024
https://dl.acm.org/doi/10.1145/3597503.3639088
Tian XYing SLi TYuan MWang RZhao YShang J(2024)iTCRL: Causal-Intervention-Based Trace Contrastive Representation Learning for Microservice SystemsIEEE Transactions on Software Engineering10.1109/TSE.2024.344653250:10(2583-2601)Online publication date: Oct-2024
https://doi.org/10.1109/TSE.2024.3446532
Zhang SChe ZPan ZNie XSun YPan LPei D(2024)LabelEase: A Semi-Automatic Tool for Efficient and Accurate Trace Labeling in Microservices2024 IEEE 35th International Symposium on Software Reliability Engineering (ISSRE)10.1109/ISSRE62328.2024.00032(238-247)Online publication date: 28-Oct-2024
https://doi.org/10.1109/ISSRE62328.2024.00032
Chen ZJiang ZSu YLyu MZheng Z(2024)Tracemesh: Scalable and Streaming Sampling for Distributed Traces2024 IEEE 17th International Conference on Cloud Computing (CLOUD)10.1109/CLOUD62652.2024.00016(54-65)Online publication date: 7-Jul-2024
https://doi.org/10.1109/CLOUD62652.2024.00016
Wang RTian GYing S(2024)MicroCMComputer Networks: The International Journal of Computer and Telecommunications Networking10.1016/j.comnet.2023.110121238:COnline publication date: 14-Mar-2024
https://dl.acm.org/doi/10.1016/j.comnet.2023.110121
Kong HLi TGe JZhang LLi L(2024)Enhancing fault localization in microservices systems through span-level using graph convolutional networksAutomated Software Engineering10.1007/s10515-024-00445-w31:2Online publication date: 5-Jun-2024
https://doi.org/10.1007/s10515-024-00445-w
Xie ZPei CLi WJiang HSu LLi JXie GPei DChandra SBlincoe KTonella P(2023)From Point-wise to Group-wise: A Fast and Accurate Microservice Trace Anomaly Detection ApproachProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3611643.3613861(1739-1749)Online publication date: 30-Nov-2023
https://dl.acm.org/doi/10.1145/3611643.3613861
Zhang SPan ZLiu HJin PSun YOuyang QWang JJia XZhang YYang HZou YPei D(2023)Efficient and Robust Trace Anomaly Detection for Large-Scale Microservice Systems2023 IEEE 34th International Symposium on Software Reliability Engineering (ISSRE)10.1109/ISSRE59848.2023.00012(69-79)Online publication date: 9-Oct-2023
https://doi.org/10.1109/ISSRE59848.2023.00012
Chen LDang QChen MSun BDu CLu Z(2023)BertHTLG: Graph-Based Microservice Anomaly Detection Through Sentence-Bert EnhancementWeb Information Systems and Applications10.1007/978-981-99-6222-8_36(427-439)Online publication date: 15-Sep-2023
https://dl.acm.org/doi/10.1007/978-981-99-6222-8_36

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents