research-article

STEAM: Observability-Preserving Trace Sampling

Authors:

Saravan Rajmohan,

Dongmei ZhangAuthors Info & Claims

ESEC/FSE 2023: Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering

Pages 1750 - 1761

https://doi.org/10.1145/3611643.3613881

Published: 30 November 2023 Publication History

Abstract

In distributed systems and microservice applications, tracing is a crucial observability signal employed for comprehending their internal states. To mitigate the overhead associated with distributed tracing, most tracing frameworks utilize a uniform sampling strategy, which retains only a subset of traces. However, this approach is insufficient for preserving system observability. This is primarily attributed to the long-tail distribution of traces in practice, which results in the omission or rarity of minority yet critical traces after sampling. In this study, we introduce an observability-preserving trace sampling method, denoted as STEAM, which aims to retain as much information as possible in the sampled traces. We employ Graph Neural Networks (GNN) for trace representation, while incorporating domain knowledge of trace comparison through logical clauses. Subsequently, we employ a scalable approach to sample traces, emphasizing mutually dissimilar traces. STEAM has been implemented on top of OpenTelemetry, comprising approximately 1.6K lines of Golang code and 2K lines of Python code. Evaluation on four benchmark microservice applications and a production system demonstrates the superior performance of our approach compared to baseline methods. Furthermore, STEAM is capable of processing 15,000 traces in approximately 4 seconds.

References

[1]

Paul Barham, Rebecca Isaacs, Richard Mortier, and Dushyanth Narayanan. 2003. Magpie: Online modelling and performance-aware systems. In 9th Workshop on Hot Topics in Operating Systems (HotOS IX).

[2]

Elisa Celis, Vijay Keswani, Damian Straszak, Amit Deshpande, Tarun Kathuria, and Nisheeth Vishnoi. 2018. Fair and diverse DPP-based data summarization. In International Conference on Machine Learning. 716–725.

[3]

Gal Chechik, Varun Sharma, Uri Shalit, and Samy Bengio. 2010. Large Scale Online Learning of Image Similarity Through Ranking. Journal of Machine Learning Research, 11, 3 (2010).

[4]

Laming Chen, Guoxin Zhang, and Eric Zhou. 2018. Fast greedy map inference for determinantal point process to improve recommendation diversity. Advances in Neural Information Processing Systems, 31 (2018).

[5]

Mike Y Chen, Anthony Accardi, Emre Kiciman, Jim Lloyd, Dave Patterson, Armando Fox, and Eric Brewer. 2004. Path-based faliure and evolution management. In Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation-Volume 1. 23–23.

Digital Library

[6]

Rodrigo Fonseca, George Porter, Randy H Katz, and Scott Shenker. 2007. $X-Trace$: A Pervasive Network Tracing Framework. In 4th USENIX Symposium on Networked Systems Design & Implementation (NSDI 07).

[7]

Yu Gan, Mingyu Liang, Sundar Dev, David Lo, and Christina Delimitrou. 2021. Sage: practical and scalable ML-driven performance debugging in microservices. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. 135–151.

Digital Library

[8]

Yu Gan, Yanqi Zhang, Dailun Cheng, Ankitha Shetty, Priyal Rathi, Nayan Katarki, Ariana Bruno, Justin Hu, Brian Ritchken, and Brendon Jackson. 2019. An open-source benchmark suite for microservices and their hardware-software implications for cloud & edge systems. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems. 3–18.

Digital Library

[9]

Adam Gluck. [n. d.]. Introducing Domain-Oriented Microservice Architecture. https://eng.uber.com/microservice-architecture/ [Online; accessed 03-Sep-2022]

[10]

Robert M Gray. 2011. Entropy and information theory. Springer Science & Business Media.

[11]

Xiaofeng Guo, Xin Peng, Hanzhang Wang, Wanxue Li, Huai Jiang, Dan Ding, Tao Xie, and Liangfei Su. 2020. Graph-based trace analysis for microservice architecture understanding and problem diagnosis. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1387–1397.

Digital Library

[12]

Lexiang Huang and Timothy Zhu. 2021. Tprof: Performance Profiling via Structural Aggregation and Automated Analysis of Distributed Systems Traces. In Proceedings of the ACM Symposium on Cloud Computing (SoCC ’21) (SoCC ’21). Association for Computing Machinery, 76–91. https://doi.org/10.1145/3472883.3486994

Digital Library

[13]

Peng Huang, Chuanxiong Guo, Jacob R Lorch, Lidong Zhou, and Yingnong Dang. 2018. Capturing and enhancing in situ system observability for failure detection. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). 1–16.

[14]

Zicheng Huang, Pengfei Chen, Guangba Yu, Hongyang Chen, and Zibin Zheng. 2021. Sieve: Attention-based Sampling of End-to-End Trace Data in Distributed Microservice Systems. In 2021 IEEE International Conference on Web Services (ICWS). 436–446.

[15]

Jaeger. 2022. Jaeger: open source, end-to-end distributed tracing. https://www.jaegertracing.io/ [Online; accessed 03-Sep-2022]

[16]

Jonathan Kaldor, Jonathan Mace, Michał Bejda, Edison Gao, Wiktor Kuropatwa, Joe O’Neill, Kian Win Ong, Bill Schaller, Pingjia Shan, and Brendan Viscomi. 2017. Canopy: An end-to-end performance tracing and analysis system. In Proceedings of the 26th symposium on operating systems principles. 34–50.

Digital Library

[17]

Rudolf E Kalman. 1960. On the general theory of control systems. In Proceedings First International Conference on Automatic Control, Moscow, USSR. 481–492.

[18]

Ari Kobren, Nicholas Monath, Akshay Krishnamurthy, and Andrew McCallum. 2017. A hierarchical algorithm for extreme clustering. In Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining. 255–264.

Digital Library

[19]

Alex Kulesza and Ben Taskar. 2012. Determinantal point processes for machine learning. arXiv preprint arXiv:1207.6083.

[20]

Pedro Las-Casas, Jonathan Mace, Dorgival Guedes, and Rodrigo Fonseca. 2018. Weighted sampling of execution traces: capturing more needles and less hay. In Proceedings of the ACM Symposium on Cloud Computing. 326–332.

Digital Library

[21]

Pedro Las-Casas, Giorgi Papakerashvili, Vaastav Anand, and Jonathan Mace. 2019. Sifter: Scalable sampling for distributed traces, without feature engineering. In Proceedings of the ACM Symposium on Cloud Computing. 312–324.

Digital Library

[22]

Jiaxin Li, Yuxi Chen, Haopeng Liu, Shan Lu, Yiming Zhang, Haryadi S Gunawi, Xiaohui Gu, Xicheng Lu, and Dongsheng Li. 2018. Pcatch: automatically detecting performance cascading bugs in cloud systems. In Proceedings of the Thirteenth EuroSys Conference. 1–14.

Digital Library

[23]

lightstep. [n. d.]. The cloud-native reliability platform | Lightstep. https://www.lightstep.com/ [Online; accessed 03-Sep-2022]

[24]

Aristidis Likas, Nikos Vlassis, and Jakob J Verbeek. 2003. The global k-means clustering algorithm. Pattern recognition, 36, 2 (2003), 451–461.

[25]

John W Lloyd. 2012. Foundations of logic programming. Springer Science & Business Media.

[26]

Locust. [n. d.]. LOCUST: An open source load testing tool. https://locust.io/ [Online; accessed 03-Sep-2022]

[27]

Shutian Luo, Huanle Xu, Chengzhi Lu, Kejiang Ye, Guoyao Xu, Liping Zhang, Yu Ding, Jian He, and Chengzhong Xu. 2021. Characterizing Microservice Dependency and Performance: Alibaba Trace Analysis. SoCC.

[28]

Jonathan Mace, Ryan Roelke, and Rodrigo Fonseca. 2015. Pivot tracing: Dynamic causal monitoring for distributed systems. In Proceedings of the 25th Symposium on Operating Systems Principles. 378–393.

Digital Library

[29]

C. Majors, L. Fong-Jones, and G. Miranda. 2022. Observability Engineering. O’Reilly Media. isbn:9781492076391 https://books.google.com/books?id=JmZuEAAAQBAJ

[30]

Gideon Mann, Mark Sandler, Darja Krushevskaja, Sudipto Guha, and Eyal Even-Dar. 2011. Modeling the Parallel Execution of $Black-Box$ Services. In 3rd USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 11).

[31]

Chaos Mesh. [n. d.]. Chaos Mesh: A Powerful Chaos Engineering Platform for Kubernetes. https://chaos-mesh.org/ [Online; accessed 03-Sep-2022]

[32]

Tomas Mikolov, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013. Efficient Estimation of Word Representations in Vector Space. In ICLR.

[33]

OpenTelemetry. [n. d.]. OpenTelemetry: High-quality, ubiquitous, and portable telemetry to enable effective observability. https://opentelemetry.io/ [Online; accessed 03-Sep-2022]

[34]

Google Cloud Platform. [n. d.]. Online Boutique: a cloud-native microservices demo application. https://github.com/GoogleCloudPlatform/microservices-demo [Online; accessed 03-Sep-2022]

[35]

Raja R Sambasivan, Alice X Zheng, Michael De Rosa, Elie Krevat, Spencer Whitman, Michael Stroucken, William Wang, Lianghong Xu, and Gregory R Ganger. 2011. Diagnosing performance changes by comparing request flows. In 8th USENIX Symposium on Networked Systems Design and Implementation (NSDI 11).

[36]

Giulio Santoli. [n. d.]. Microservices Architectures: Become a Unicorn like Netflix, Twitter and Hailo. https://www.slideshare.net/gjuljo/ microservices-architectures-become-a-unicorn-like-netflix-twitter-and-hailo [Online; accessed 03-Sep-2022]

[37]

Benjamin H Sigelman, Luiz Andre Barroso, Mike Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, and Chandan Shanbhag. 2010. Dapper, a large-scale distributed systems tracing infrastructure.

[38]

Apache Spark. [n. d.]. Streaming K-means. https://spark.apache.org/docs/latest/mllib-clustering.html [Online; accessed 03-Sep-2022]

[39]

C. Sridharan. 2018. Distributed Systems Observability: A Guide to Building Robust Systems. O’Reilly Media. isbn:9781492033424 https://books.google.co.jp/books?id=07EswAEACAAJ

[40]

Stanford. [n. d.]. Inverse Document Frequency. https://nlp.stanford.edu/IR-book/html/htmledition/inverse-document-frequency-1.html [Online; accessed 03-Sep-2022]

[41]

w3c. 2021. W3C Recommendation: Trace Context. https://www.w3.org/TR/trace-context-1 [Online; accessed 03-Sep-2022]

[42]

Lingmei Weng, Peng Huang, Jason Nieh, and Junfeng Yang. 2021. Argus: Debugging Performance Issues in Modern Desktop Applications with Annotated Causal Tracing. In 2021 USENIX Annual Technical Conference (USENIX ATC 21). USENIX Association, 193–207.

[43]

Yang Wu, Ang Chen, and Linh Thi Xuan Phan. 2019. Zeno: Diagnosing performance problems with temporal provenance. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19). 395–420.

[44]

Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and S Yu Philip. 2020. A comprehensive survey on graph neural networks. IEEE transactions on neural networks and learning systems, 32, 1 (2020), 4–24.

[45]

Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. 2018. How powerful are graph neural networks? arXiv preprint arXiv:1810.00826.

[46]

Yuning You, Tianlong Chen, Yongduo Sui, Ting Chen, Zhangyang Wang, and Yang Shen. 2020. Graph contrastive learning with augmentations. Advances in Neural Information Processing Systems, 33, 5812–5823.

[47]

Chenxi Zhang, Xin Peng, Tong Zhou, Chaofeng Sha, Zhenghui Yan, Yiru Chen, and Hong Yang. 2022. TraceCRL: contrastive representation learning for microservice trace analysis. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1221–1232.

Digital Library

[48]

Xiang Zhou, Xin Peng, Tao Xie, Jun Sun, Chao Ji, Wenhai Li, and Dan Ding. 2018. Fault analysis and debugging of microservice systems: Industrial survey, benchmark system, and empirical study. IEEE Transactions on Software Engineering, 47, 2 (2018), 243–260.

Digital Library

[49]

Xiang Zhou, Xin Peng, Tao Xie, Jun Sun, Chao Ji, Dewei Liu, Qilin Xiang, and Chuan He. 2019. Latent error prediction and fault localization for microservice applications by learning from system trace logs. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 683–694.

Digital Library

[50]

Yanqiao Zhu, Yichen Xu, Qiang Liu, and Shu Wu. 2021. An Empirical Study of Graph Contrastive Learning. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2).

[51]

Yanqiao Zhu, Yichen Xu, Feng Yu, Qiang Liu, Shu Wu, and Liang Wang. 2021. Graph contrastive learning with adaptive augmentation. In Proceedings of the Web Conference 2021. 2069–2080.

Digital Library

[52]

Zipkin. [n. d.]. Open Zipkin. https://www.zipkin.io/ [Online; accessed 03-Sep-2022]

Cited By

Poghosyan AHarutyunyan ADavtyan EPetrosyan KBaloian N(2024)The Diagnosis-Effective Sampling of Application TracesApplied Sciences10.3390/app1413577914:13(5779)Online publication date: 2-Jul-2024
https://doi.org/10.3390/app14135779
Chen ZJiang ZSu YLyu MZheng Z(2024)Tracemesh: Scalable and Streaming Sampling for Distributed Traces2024 IEEE 17th International Conference on Cloud Computing (CLOUD)10.1109/CLOUD62652.2024.00016(54-65)Online publication date: 7-Jul-2024
https://doi.org/10.1109/CLOUD62652.2024.00016

Index Terms

STEAM: Observability-Preserving Trace Sampling
1. Software and its engineering
  1. Software creation and management
    1. Software post-development issues
      1. Maintaining software
  2. Software notations and tools
    1. Software maintenance tools

Recommendations

On the Use of Trace Sampling for Architectural Studies of Desktop Applications
WWC '98: Proceedings of the Workload Characterization: Methodology and Case Studies

This paper examines the feasibility of performing architectural studies with trace sampling for a suite of desktop application traces on Windows NT. This paper makes three contributions: we compare the accuracy of several sampling techniques to ...
TraStrainer: Adaptive Sampling for Distributed Traces with System Runtime State

Distributed tracing has been widely adopted in many microservice systems and plays an important role in monitoring and analyzing the system. However, trace data often come in large volumes, incurring substantial computational and storage costs. To reduce ...
SampleHST-X: A Point and Collective Anomaly-Aware Trace Sampling Pipeline with Approximate Half Space Trees
Abstract
The storage requirement for distributed tracing can be reduced significantly by sampling only the anomalous or interesting traces that occur rarely at runtime. In this paper, we introduce an unsupervised sampling pipeline for distributed tracing ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ESEC/FSE 2023: Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering

November 2023

2215 pages

ISBN:9798400703270

DOI:10.1145/3611643

General Chair:
Satish Chandra
Google, USA
,
Program Chairs:
Kelly Blincoe
University of Auckland, New Zealand
,
Paolo Tonella
USI Lugano, Switzerland

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGSOFT: ACM Special Interest Group on Software Engineering

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 November 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ESEC/FSE '23

Sponsor:

SIGSOFT

ESEC/FSE '23: 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering

December 3 - 9, 2023

CA, San Francisco, USA

Acceptance Rates

Overall Acceptance Rate 112 of 543 submissions, 21%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
334
Total Downloads

Downloads (Last 12 months)334
Downloads (Last 6 weeks)28

Reflects downloads up to 04 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Poghosyan AHarutyunyan ADavtyan EPetrosyan KBaloian N(2024)The Diagnosis-Effective Sampling of Application TracesApplied Sciences10.3390/app1413577914:13(5779)Online publication date: 2-Jul-2024
https://doi.org/10.3390/app14135779
Chen ZJiang ZSu YLyu MZheng Z(2024)Tracemesh: Scalable and Streaming Sampling for Distributed Traces2024 IEEE 17th International Conference on Cloud Computing (CLOUD)10.1109/CLOUD62652.2024.00016(54-65)Online publication date: 7-Jul-2024
https://doi.org/10.1109/CLOUD62652.2024.00016

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents