Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3493649.3493656acmconferencesArticle/Chapter ViewAbstractPublication PagesmiddlewareConference Proceedingsconference-collections
short-paper

Tritium: A Cross-layer Analytics System for Enhancing Microservice Rollouts in the Cloud

Published: 06 December 2021 Publication History

Abstract

Microservice architectures are widely used in cloud-native applications as their modularity allows for independent development and deployment of components. With the many complex interactions occurring in between components, it is difficult to determine the effects of a particular microservice rollout. Site Reliability Engineers must be able to determine with confidence whether a new rollout is at fault for a concurrent or subsequent performance problem in the system so they can quickly mitigate the issue. We present Tritium, a cross-layer analytics system that synthesizes several types of data to suggest possible causes for Service Level Objective (SLO) violations in microservice applications. It uses event data to identify new version rollouts, tracing data to build a topology graph for the cluster and determine services potentially affected by the rollout, and causal impact analysis applied to metric time-series to determine if the rollout is at fault. Tritium works based on the principle that if a rollout is not responsible for a change in an upstream or neighboring SLO metric, then the rollout's telemetry data will do a poor job predicting the behavior of that SLO metric. In this paper, we experimentally demonstrate that Tritium can accurately attribute SLO violations to downstream rollouts and outline the steps necessary to fully realize Tritium.

References

[1]
bookinfo [n. d.]. Bookinfo. https://istio.io/latest/docs/examples/bookinfo/.
[2]
Kay H. Brodersen, Fabian Gallusser, Jim Koehler, Nicolas Remy, and Steven L. Scott. 2015. Inferring causal impact using Bayesian structural time-series models. Annals of Applied Statistics 9 (2015), 247--274.
[3]
Pengfei Chen, Yong Qi, Pengfei Zheng, and Di Hou. 2014. CauseInfer: Automatic and distributed performance diagnosis with hierarchical causality graph in large distributed systems. In IEEE Conference on Computer Communications (INFOCOM). 1887--1895. https://doi.org/10.1109/INFOCOM.2014.6848128
[4]
Gerry Gerard Claps, Richard Berntsson Svensson, and Aybüke Aurum. 2015. On the journey to continuous deployment: Technical and social challenges along the way. Information and Software technology 57 (2015), 21--31.
[5]
Yu Gan, Yanqi Zhang, Kelvin Hu, Dailun Cheng, Yuan He, Meghna Pancholi, and Christina Delimitrou. 2019. Seer: Leveraging Big Data to Navigate the Complexity of Performance Debugging in Cloud Microservices. In 24th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 19--33. https://doi.org/10.1145/3297858.3304004
[6]
Clive Granger. 1969. Investigating Causal Relations by Econometric Models and Cross-Spectral Methods. Econometrica 37, 3 (1969), 424--38. https://EconPapers.repec.org/RePEc:ecm:emetrp:v:37:y:1969:i:3:p:424--38
[7]
Xiaofeng Guo, Xin Peng, Hanzhang Wang, Wanxue Li, Huai Jiang, Dan Ding, Tao Xie, and Liangfei Su. 2020. Graph-Based Trace Analysis for Microservice Architecture Understanding and Problem Diagnosis. In 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE). 1387--1397. https://doi.org/10.1145/3368089.3417066
[8]
istio [n. d.]. Istio service mesh. https://istio.io/.
[9]
Jaeger [n. d.]. Jaeger: open-source, end-to-end distributed tracing. https://www.jaegertracing.io/. https://www.jaegertracing.io/
[10]
kiali [n. d.]. Kiali, Service mesh management for Istio. https:://kiali.io/. https://kiali.io/
[11]
KubernetesFailures [n. d.]. Kubernetes Failure Stories. https://github.com/hjacobs/kubernetes-failure-stories. https://github.com/hjacobs/kubernetes-failure-stories
[12]
KubernetesProduction [n. d.]. Kubernetes: Production-Grade Container Orchestration. https://kubernetes.io/. https://kubernetes.io/
[13]
John Langford, Lihong Li, and Alex Strehl. 2007. Train Ticket: A Benchmark Microservice System. http://hunch.net/~vw/
[14]
Scott M Lundberg and Su-In Lee. 2017. A Unified Approach to Interpreting Model Predictions. In 31st International Conference on Neural Information Processing Systems (NIPS). 4768--4777. http://papers.nips.cc/paper/7062-a-unified-approach-to-interpreting-model-predictions.pdf
[15]
Chen Luo, Jian-Guang Lou, Qingwei Lin, Qiang Fu, Rui Ding, Dongmei Zhang, and Zhe Wang. 2014. Correlating Events with Time Series for Incident Diagnosis. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
[16]
Leonardo Mariani, Cristina Monni, Mauro Pezzé, Oliviero Riganelli, and Rui Xin. 2018. Localizing Faults in Cloud Systems. In IEEE 11th International Conference on Software Testing, Verification and Validation (ICST). 262--273.
[17]
Prometheus [n. d.]. Prometheus. https://prometheus.io/. https://prometheus.io/
[18]
Haoran Qiu, Subho S. Banerjee, Saurabh Jha, Zbigniew T. Kalbarczyk, and Ravishankar K. Iyer. 2020. FIRM: An Intelligent Fine-grained Resource Management Framework for SLO-Oriented Microservices. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). USENIX Association, 805--825. https://www.usenix.org/conference/osdi20/presentation/qiu
[19]
Juan Qiu, Qingfeng Du, Kanglin Yin, Shuang-Li Zhang, and Chongshu Qian. 2020. A Causality Mining and Knowledge Graph Based Method of Root Cause Diagnosis for Performance Anomaly in Cloud Applications. Applied Sciences 10, 6 (2020). https://doi.org/10.3390/app10062166
[20]
Marco Túlio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. "Why Should I Trust You?": Explaining the Predictions of Any Classifier. CoRR abs/1602.04938 (2016). arXiv:1602.04938 http://arxiv.org/abs/1602.04938
[21]
Huasong Shan, Yuan Chen, Haifeng Liu, Yunpeng Zhang, Xiao Xiao, Xiaofeng He, Min Li, and Wei Ding. 2019. ?-Diagnosis: Unsupervised and Real-Time Diagnosis of Small- Window Long-Tail Latency in Large-Scale Microservice Platforms. In the World Wide Web Conference (WWW). 3215--3222. https://doi.org/10.1145/3308558.3313653
[22]
Jörg Thalheim, Antonio Rodrigues, Istemi Ekin Akkus, Pramod Bhatotia, Ruichuan Chen, Bimal Viswanath, Lei Jiao, and Christof Fetzer. 2017. Sieve: Actionable Insights from Monitored Metrics in Microservices. CoRR abs/1709.06686 (2017). arXiv:1709.06686 http://arxiv.org/abs/1709.06686
[23]
A. Traeger, I. Deras, and E. Zadok. 2008. DARC: Dynamic Analysis of Root Causes of Latency Distributions. In International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS). 277--288.
[24]
Li Wu, Johan Tordsson, Erik Elmroth, and Odej Kao. 2020. MicroRCA: Root Cause Localization of Performance Issues in Microservices. In IEEE/IFIP Network Operations and Management Symposium (NOMS).
[25]
Xiang Zhou, Xin Peng, Tao Xie, Jun Sun, Chao Ji, Dewei Liu, Qilin Xiang, and Chuan He. 2019. Latent error prediction and fault localization for microservice applications by learning from system trace logs. In 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 683--694.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
WoC '21: Proceedings of the Seventh International Workshop on Container Technologies and Container Clouds
December 2021
37 pages
ISBN:9781450391719
DOI:10.1145/3493649
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

In-Cooperation

  • IFIP

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 December 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Fault diagnosis
  2. container systems
  3. microservices
  4. version rollouts

Qualifiers

  • Short-paper
  • Research
  • Refereed limited

Conference

Middleware '21
Sponsor:

Upcoming Conference

MIDDLEWARE '24
25th International Middleware Conference
December 2 - 6, 2024
Hong Kong , Hong Kong

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 107
    Total Downloads
  • Downloads (Last 12 months)30
  • Downloads (Last 6 weeks)1
Reflects downloads up to 30 Aug 2024

Other Metrics

Citations

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media