short-paper

Tritium: A Cross-layer Analytics System for Enhancing Microservice Rollouts in the Cloud

Authors:

Srinivasan Parthasarathy,

Fabio Oliveira,

Ayse K. CoskunAuthors Info & Claims

WoC '21: Proceedings of the Seventh International Workshop on Container Technologies and Container Clouds

Pages 19 - 24

https://doi.org/10.1145/3493649.3493656

Published: 06 December 2021 Publication History

Abstract

Microservice architectures are widely used in cloud-native applications as their modularity allows for independent development and deployment of components. With the many complex interactions occurring in between components, it is difficult to determine the effects of a particular microservice rollout. Site Reliability Engineers must be able to determine with confidence whether a new rollout is at fault for a concurrent or subsequent performance problem in the system so they can quickly mitigate the issue. We present Tritium, a cross-layer analytics system that synthesizes several types of data to suggest possible causes for Service Level Objective (SLO) violations in microservice applications. It uses event data to identify new version rollouts, tracing data to build a topology graph for the cluster and determine services potentially affected by the rollout, and causal impact analysis applied to metric time-series to determine if the rollout is at fault. Tritium works based on the principle that if a rollout is not responsible for a change in an upstream or neighboring SLO metric, then the rollout's telemetry data will do a poor job predicting the behavior of that SLO metric. In this paper, we experimentally demonstrate that Tritium can accurately attribute SLO violations to downstream rollouts and outline the steps necessary to fully realize Tritium.

References

[1]

bookinfo [n. d.]. Bookinfo. https://istio.io/latest/docs/examples/bookinfo/.

[2]

Kay H. Brodersen, Fabian Gallusser, Jim Koehler, Nicolas Remy, and Steven L. Scott. 2015. Inferring causal impact using Bayesian structural time-series models. Annals of Applied Statistics 9 (2015), 247--274.

[3]

Pengfei Chen, Yong Qi, Pengfei Zheng, and Di Hou. 2014. CauseInfer: Automatic and distributed performance diagnosis with hierarchical causality graph in large distributed systems. In IEEE Conference on Computer Communications (INFOCOM). 1887--1895. https://doi.org/10.1109/INFOCOM.2014.6848128

[4]

Gerry Gerard Claps, Richard Berntsson Svensson, and Aybüke Aurum. 2015. On the journey to continuous deployment: Technical and social challenges along the way. Information and Software technology 57 (2015), 21--31.

[5]

Yu Gan, Yanqi Zhang, Kelvin Hu, Dailun Cheng, Yuan He, Meghna Pancholi, and Christina Delimitrou. 2019. Seer: Leveraging Big Data to Navigate the Complexity of Performance Debugging in Cloud Microservices. In 24th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 19--33. https://doi.org/10.1145/3297858.3304004

Digital Library

[6]

Clive Granger. 1969. Investigating Causal Relations by Econometric Models and Cross-Spectral Methods. Econometrica 37, 3 (1969), 424--38. https://EconPapers.repec.org/RePEc:ecm:emetrp:v:37:y:1969:i:3:p:424--38

[7]

Xiaofeng Guo, Xin Peng, Hanzhang Wang, Wanxue Li, Huai Jiang, Dan Ding, Tao Xie, and Liangfei Su. 2020. Graph-Based Trace Analysis for Microservice Architecture Understanding and Problem Diagnosis. In 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE). 1387--1397. https://doi.org/10.1145/3368089.3417066

[8]

istio [n. d.]. Istio service mesh. https://istio.io/.

[9]

Jaeger [n. d.]. Jaeger: open-source, end-to-end distributed tracing. https://www.jaegertracing.io/. https://www.jaegertracing.io/

[10]

kiali [n. d.]. Kiali, Service mesh management for Istio. https:://kiali.io/. https://kiali.io/

[11]

KubernetesFailures [n. d.]. Kubernetes Failure Stories. https://github.com/hjacobs/kubernetes-failure-stories. https://github.com/hjacobs/kubernetes-failure-stories

[12]

KubernetesProduction [n. d.]. Kubernetes: Production-Grade Container Orchestration. https://kubernetes.io/. https://kubernetes.io/

[13]

John Langford, Lihong Li, and Alex Strehl. 2007. Train Ticket: A Benchmark Microservice System. http://hunch.net/~vw/

[14]

Scott M Lundberg and Su-In Lee. 2017. A Unified Approach to Interpreting Model Predictions. In 31st International Conference on Neural Information Processing Systems (NIPS). 4768--4777. http://papers.nips.cc/paper/7062-a-unified-approach-to-interpreting-model-predictions.pdf

[15]

Chen Luo, Jian-Guang Lou, Qingwei Lin, Qiang Fu, Rui Ding, Dongmei Zhang, and Zhe Wang. 2014. Correlating Events with Time Series for Incident Diagnosis. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.

[16]

Leonardo Mariani, Cristina Monni, Mauro Pezzé, Oliviero Riganelli, and Rui Xin. 2018. Localizing Faults in Cloud Systems. In IEEE 11th International Conference on Software Testing, Verification and Validation (ICST). 262--273.

[17]

Prometheus [n. d.]. Prometheus. https://prometheus.io/. https://prometheus.io/

[18]

Haoran Qiu, Subho S. Banerjee, Saurabh Jha, Zbigniew T. Kalbarczyk, and Ravishankar K. Iyer. 2020. FIRM: An Intelligent Fine-grained Resource Management Framework for SLO-Oriented Microservices. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). USENIX Association, 805--825. https://www.usenix.org/conference/osdi20/presentation/qiu

[19]

Juan Qiu, Qingfeng Du, Kanglin Yin, Shuang-Li Zhang, and Chongshu Qian. 2020. A Causality Mining and Knowledge Graph Based Method of Root Cause Diagnosis for Performance Anomaly in Cloud Applications. Applied Sciences 10, 6 (2020). https://doi.org/10.3390/app10062166

[20]

Marco Túlio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. "Why Should I Trust You?": Explaining the Predictions of Any Classifier. CoRR abs/1602.04938 (2016). arXiv:1602.04938 http://arxiv.org/abs/1602.04938

[21]

Huasong Shan, Yuan Chen, Haifeng Liu, Yunpeng Zhang, Xiao Xiao, Xiaofeng He, Min Li, and Wei Ding. 2019. ?-Diagnosis: Unsupervised and Real-Time Diagnosis of Small- Window Long-Tail Latency in Large-Scale Microservice Platforms. In the World Wide Web Conference (WWW). 3215--3222. https://doi.org/10.1145/3308558.3313653

Digital Library

[22]

Jörg Thalheim, Antonio Rodrigues, Istemi Ekin Akkus, Pramod Bhatotia, Ruichuan Chen, Bimal Viswanath, Lei Jiao, and Christof Fetzer. 2017. Sieve: Actionable Insights from Monitored Metrics in Microservices. CoRR abs/1709.06686 (2017). arXiv:1709.06686 http://arxiv.org/abs/1709.06686

[23]

A. Traeger, I. Deras, and E. Zadok. 2008. DARC: Dynamic Analysis of Root Causes of Latency Distributions. In International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS). 277--288.

[24]

Li Wu, Johan Tordsson, Erik Elmroth, and Odej Kao. 2020. MicroRCA: Root Cause Localization of Performance Issues in Microservices. In IEEE/IFIP Network Operations and Management Symposium (NOMS).

[25]

Xiang Zhou, Xin Peng, Tao Xie, Jun Sun, Chao Ji, Dewei Liu, Qilin Xiang, and Chuan He. 2019. Latent error prediction and fault localization for microservice applications by learning from system trace logs. In 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 683--694.

Digital Library

Index Terms

Tritium: A Cross-layer Analytics System for Enhancing Microservice Rollouts in the Cloud
1. Computer systems organization
  1. Architectures
    1. Distributed architectures
      1. Cloud computing
2. Software and its engineering
  1. Software creation and management
    1. Software verification and validation
      1. Software defect analysis
        Software testing and debugging

Recommendations

How do microservices evolve? An empirical analysis of changes in open-source microservice repositories
Abstract Context.
Microservice architectures are an emergent service-oriented paradigm widely used in industry to develop and deploy scalable software systems. The underlying idea is to design highly independent services that ...
Highlights
- We analyzed 11 open-source microservice repositories to study their evolution over time.
Microservices Security: Bad vs. Good Practices
Software Architecture. ECSA 2022 Tracks and Workshops
Abstract
The microservice architectural style is widespread in enterprise IT, making the securing of microservices a crucial issue. Many bad practices in securing microservices have been identified by researchers and practitioners, along with security good ...
Midiag: A Sequential Trace-Based Fault Diagnosis Framework for Microservices
Services Computing – SCC 2020
Abstract
Cloud applications are often deployed in shared data centers to optimize resource allocation and improve management efficiency. However, since a cloud application often has a large amount of different microservices, it is difficult for operators ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

WoC '21: Proceedings of the Seventh International Workshop on Container Technologies and Container Clouds

December 2021

37 pages

ISBN:9781450391719

DOI:10.1145/3493649

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

ACM: Association for Computing Machinery

In-Cooperation

IFIP

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 December 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Short-paper
Research
Refereed limited

Conference

Middleware '21

Sponsor:

ACM

Middleware '21: 22nd International Middleware Conference

December 6, 2021

Virtual Event, Canada

Upcoming Conference

MIDDLEWARE '24

25th International Middleware Conference

December 2 - 6, 2024

Hong Kong , Hong Kong

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
107
Total Downloads

Downloads (Last 12 months)30
Downloads (Last 6 weeks)1

Reflects downloads up to 30 Aug 2024

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents