Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Enjoy your observability: an industrial survey of microservice tracing and analysis

Published: 01 January 2022 Publication History

Abstract

Microservice systems are often deployed in complex cloud-based environments and may involve a large number of service instances being dynamically created and destroyed. It is thus essential to ensure observability to understand these microservice systems’ behaviors and troubleshoot their problems. As an important means to achieve the observability, distributed tracing and analysis is known to be challenging. While many companies have started implementing distributed tracing and analysis for microservice systems, it is not clear whether existing approaches fulfill the required observability. In this article, we present our industrial survey on microservice tracing and analysis through interviewing developers and operation engineers of microservice systems from ten companies. Our survey results offer a number of findings. For example, large microservice systems commonly adopt a tracing and analysis pipeline, and the implementations of the pipeline in different companies reflect different tradeoffs among a variety of concerns. Visualization and statistic-based metrics are the most common means for trace analysis, while more advanced analysis techniques such as machine learning and data mining are seldom used. Microservice tracing and analysis is a new big data problem for software engineering, and its practices breed new challenges and opportunities.

References

[2]
Aws.Amazon.Com: AWS (2020). https://aws.amazon.com/xray/
[3]
Barham P., Donnelly A., Isaacs R., Mortier R. (2004) Using Magpie for request extraction and workload modelling. In: 6Th USENIX symposium on operating systems design and implementation (OSDI), pp 259–272
[4]
Barham P., Isaacs R., Mortier R., Narayanan D. (2003) Magpie: Online modelling and performance-aware systems. In: 9Th workshop on hot topics in operating systems (hotOS), pp 85–90
[5]
Bogner J., Fritzsch J., Wagner S., Zimmermann A. (2019) Microservices in industry: Insights into technologies, characteristics, and software quality. In: IEEE International conference on software architecture companion (ICSA-c), pp 187–195
[6]
Bogner J., Schlinger S., Wagner S., Zimmermann A. (2019) A modular approach to calculate service-based maintainability metrics from runtime data of microservices. In: 20Th international conference on product-focused software process improvement (PROFES), pp 489–496
[8]
Chanda A., Cox A. L., Zwaenepoel W. (2007) Whodunit: Transactional profiling for multi-tier applications. In: 2Nd ACM SIGOPS/eurosys european conference on computer systems (eurosys), pp 17–30
[9]
Chen M. Y., Accardi A. J., Kiciman E., Patterson D. A., Fox A., Brewer E. A. (2004) Path-based failure and evolution management. In: 1St symposium on networked systems design and implementation (NSDI), pp 309–322
[10]
Chen R., Li S., Li Z. (2017) From monolith to microservices: A dataflow-driven approach. In: 24Th asia-pacific software engineering conference (APSEC), pp 466–475
[11]
Developer.Ebay.Com: Ebay developers program (2020). https://developer.ebay.com/
[12]
Di Francesco P., Lago P., Malavolta I. (2018) Migrating towards microservice architectures: an industrial survey. In: IEEE International conference on software architecture (ICSA), pp 29–39
[13]
Dynatrace.Cn: Dynatrace (2020). https://www.dynatrace.cn/
[14]
Elasticsearch.Com: Elasticsearch (2020). https://www.elastic.co/products/elasticsearch
[15]
Engel T., Langermeier M., Bauer B., Hofmann A. (2018) Evaluation of microservice architectures: a metric and tool-based approach. In: Information systems in the big data era, pp 74–89
[16]
Flink.com: Apache flink (2021). https://flink.apache.org/
[17]
Fonseca R., Dutta P., Levis P., Stoica I. (2008) Quanto: Tracking energy in networked embedded systems. In: 8Th USENIX symposium on operating systems design and implementation (OSDI), pp 323–338
[18]
Fonseca R., Freedman M., Porter G. (2010) Experiences with tracing causality in networked services. In: 1St internet network management workshop/workshop on research on enterprise monitoring
[19]
Fonseca R., Porter G., Katz R. H., Shenker S., Stoica I. (2007) X-trace: A pervasive network tracing framework. In: 4Th symposium on networked systems design and implementation (NSDI), pp 271–284
[20]
Francesco P. D., Malavolta I., Lago P. (2017) Research on architecting microservices: trends, focus, and potential for industrial adoption. In: IEEE International conference on software architecture (ICSA), pp 21–30
[21]
Haselböck S., Weinreich R., Buchgeher G. (2018) An expert interview study on areas of microservice design. In: 11Th IEEE conference on service-oriented computing and applications (SOCA), pp 137–144
[22]
Htrace.Org: Htrace (2020). http://htrace.org/
[23]
Istio.Io: Istio (2020). https://istio.io/
[25]
Jaegertracing.Io: Jaegertracing (2020). https://www.jaegertracing.io/
[26]
Kafka.Com: Apache Kafka (2021). http://kafka.apache.org/
[27]
Kaldor J., Mace J., Bejda M., Gao E., Kuropatwa W., O’Neill J., Ong K. W., Schaller B., Shan P., Viscomi B., Venkataraman V., Veeraraghavan K., Song Y. J. (2017) Canopy: an end-to-end performance tracing and analysis system. In: 26Th symposium on operating systems principles (SOSP), pp 34–50
[29]
Kubernetes.Io: Kubernetes (2020). https://kubernetes.io/
[30]
Kvale S. (2008) Doing interviews. Sage
[31]
Lenarduzzi V and Panichella A Serverless testing: Tool vendors’ and experts’ points of view IEEE Softw 2021 38 1 54-60
[32]
[33]
Logging.Apache.Org: Apache (2020). https://logging.apache.org/log4j/2.x/
[35]
Mace J., Roelke R., Fonseca R. (2015) Pivot tracing: Dynamic causal monitoring for distributed systems. In: 25Th symposium on operating systems principles (SOSP), pp 378–393
[36]
Naver.Github.Io: Dynatrace (2020). http://naver.github.io/pinpoint/
[37]
Netflix.Com: Netflix (2014). https://www.netflix.com/
[38]
Newman S. (2015) Building microservices - designing fine-grained systems, 1st edn. O’Reilly
[39]
Opencensus.Io: Opencensus (2020). https://opencensus.io/
[40]
Opentelemetry.Io: Opentelemetry (2020). https://opentelemetry.io/
[41]
Opentracing.Io: Opentracing (2020). https://opentracing.io/
[42]
Pham C, Wang L, Tak B, Baset S, Tang C, Kalbarczyk ZT, and Iyer RK Failure diagnosis for distributed systems using targeted fault injection IEEE Trans Parallel Distrib Syst (TPDS) 2017 28 2 503-516
[43]
Prometheus.Io: Prometheus (2020). https://prometheus.io/
[44]
[45]
Reynolds P., Killian C., Wiener J., Mogul J., Shah M., Vahdat A. (2006) Pip: Detecting the unexpected in distributed systems. In: 3Rd symposium on networked systems design and implementation (NSDI), pp 115–128
[46]
Richardson C. (2019) Microservices patterns: With examples in java. Manning Publications Co
[47]
Sambasivan R. R., Zheng A. X., Rosa M. D., Krevat E., Whitman S., Stroucken M., Wang W., Xu L., Ganger G. R. (2011) Diagnosing performance changes by comparing request flows. In: 8Th symposium on networked systems design and implementation (NSDI)
[48]
Sigelman B. H., Barroso L. A., Burrows M., Stephenson P., Plakal M., Beaver D., Jaspan S., Shanbhag C. (2010) Dapper a large-scale distributed systems tracing infrastructure
[49]
Skywalking.Org: Apache Skywalking (2020). https://skywalking.apache.org/
[50]
Splunk.Com: Splunk observability (2021). https://www.splunk.com/
[51]
Sridharan C. (2018) Distributed systems observability: a guide to building robust systems. O’Reilly Media, Inc
[52]
Strauss A., Corbin J. (1990) Open coding. In: Strauss A, Corbin J (eds) Basics of qualitative research: Grounded theory procedures and techniques. Sage Publications, London, pp 101–121
[53]
Taibi D., Systä K (2019) From monolithic systems to microservices: A decomposition framework based on process mining. In: 9Th international conference on cloud computing and services science (CLOSER), pp 153–164
[54]
Thereska E., Salmon B., Strunk J. D., Wachs M., Abd-el-malek M, Hernandez J.L, Ganger G.R (2006) Stardust: tracking activity in a distributed storage system. In: ACM SIGMETRICS Joint international conference on measurement and modeling of computer systems (SIGMETRICS), pp 3–14
[55]
[56]
Yuan D., Park S., Zhou Y. (2012) Characterizing logging practices in open-source software. In: 34Th international conference on software engineering (ICSE), pp 102–112
[57]
Zhang H., Li S., Jia Z., Zhong C., Zhang C. (2019) Microservice architecture in reality: an industrial inquiry. In: IEEE International conference on software architecture (ICSA), pp 51–60
[58]
Zhou X, Peng X, Xie T, Sun J, Ji C, Li W, and Ding D Fault analysis and debugging of microservice systems: Industrial survey, benchmark system, and empirical study IEEE Trans Softw Eng (TSE) 2021 47 2 243-260
[59]
Zhou X., Peng X., Xie T., Sun J., Ji C., Liu D., Xiang Q., He C. (2019) Latent error prediction and fault localization for microservice applications by learning from system trace logs. In: 27Th ACM joint meeting on european software engineering conference and symposium on the foundations of software engineering (ESEC/SIGSOFT FSE), pp 683–694
[60]
Zhou X., Peng X., Xie T., Sun J., Li W., Ji C., Ding D. (2018) Delta debugging microservice systems. In: 33Rd ACM/IEEE international conference on automated software engineering (ASE), pp 802–807
[61]
Zipkin.Io: Zipkin (2020). https://zipkin.io/

Cited By

View all
  • (2024)TENSAI - Practical and Responsible Observability for Data Quality-aware Large-scale AnalyticsJournal of Data and Information Quality10.1145/370801416:4(1-43)Online publication date: 10-Dec-2024
  • (2024)Logging design patterns for cloud-native applicationsProceedings of the 29th European Conference on Pattern Languages of Programs, People, and Practices10.1145/3698322.3698351(1-11)Online publication date: 3-Jul-2024
  • (2024)Demystifying the Fight Against Complexity: A Comprehensive Study of Live Debugging Activities in Production Cloud SystemsProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698568(341-360)Online publication date: 20-Nov-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Empirical Software Engineering
Empirical Software Engineering  Volume 27, Issue 1
Jan 2022
985 pages

Publisher

Kluwer Academic Publishers

United States

Publication History

Published: 01 January 2022
Accepted: 11 August 2021

Author Tags

  1. Microservice
  2. Logging
  3. Tracing
  4. Industrial survey

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 26 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)TENSAI - Practical and Responsible Observability for Data Quality-aware Large-scale AnalyticsJournal of Data and Information Quality10.1145/370801416:4(1-43)Online publication date: 10-Dec-2024
  • (2024)Logging design patterns for cloud-native applicationsProceedings of the 29th European Conference on Pattern Languages of Programs, People, and Practices10.1145/3698322.3698351(1-11)Online publication date: 3-Jul-2024
  • (2024)Demystifying the Fight Against Complexity: A Comprehensive Study of Live Debugging Activities in Production Cloud SystemsProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698568(341-360)Online publication date: 20-Nov-2024
  • (2024)Using Static Analysis to Aid Monolith to Microservice System Transformation: Tuning Fuzzy c-Means in a VAE-Based GNN ApproachProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering Workshops10.1145/3691621.3694933(43-53)Online publication date: 27-Oct-2024
  • (2024)The HitchHiker's Guide to High-Assurance System Observability Protection with Efficient Permission SwitchesProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security10.1145/3658644.3690188(3898-3912)Online publication date: 2-Dec-2024
  • (2024)Microservice Root Cause Analysis With Limited Observability Through Intervention Recognition in the Latent SpaceProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671530(6049-6060)Online publication date: 25-Aug-2024
  • (2024)Grammar-Based Anomaly Detection of Microservice Systems Execution TracesCompanion of the 15th ACM/SPEC International Conference on Performance Engineering10.1145/3629527.3651844(77-81)Online publication date: 7-May-2024
  • (2024)Process-Aware Intrusion Detection in MQTT NetworksProceedings of the Fourteenth ACM Conference on Data and Application Security and Privacy10.1145/3626232.3653271(91-102)Online publication date: 19-Jun-2024
  • (2024)VAMP: Visual Analytics for Microservices PerformanceProceedings of the 39th ACM/SIGAPP Symposium on Applied Computing10.1145/3605098.3636069(1209-1218)Online publication date: 8-Apr-2024
  • (2024)Trace-based Multi-Dimensional Root Cause Localization of Performance Issues in Microservice SystemsProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639088(1-12)Online publication date: 20-May-2024
  • Show More Cited By

View Options

View options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media