Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

Distributed Latency Profiling through Critical Path Tracing: CPT can provide actionable and precise latency analysis.

Published: 29 March 2022 Publication History

Abstract

Low latency is an important feature for many Google applications such as Search, and latency-analysis tools play a critical role in sustaining low latency at scale. For complex distributed systems that include services that constantly evolve in functionality and data, keeping overall latency to a minimum is a challenging task. In large, real-world distributed systems, existing tools such as RPC telemetry, CPU profiling, and distributed tracing are valuable to understand the subcomponents of the overall system, but are insufficient to perform end-to-end latency analyses in practice. Scalable and accurate fine-grain tracing has made Critical Path Tracing the standard approach for distributed latency analysis for many Google applications, including Google Search.

References

[1]
Amazon Web Services. Amazon CloudWatch: Observability of your AWS resources and applications on AWS and on-premises; https://aws.amazon.com/cloudwatch/.
[2]
Amazon Web Services. What is Amazon CodeGuru profiler?; https://docs.aws.amazon.com/codeguru/latest/profiler-ug/what-is-codeguru-profiler.html.
[3]
Anderson, T.E., Lazowska, E.D. 1990. Quartz: A tool for tuning parallel program performance. In Proceedings of the ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, 115?125; https://dl.acm.org/doi/10.1145/98457.98518.
[4]
Arapakis, I., Bai, X., Cambazoglu, B.B. 2014. Impact of response latency on user behavior in web search. In Proceedings of the 37th international ACM SIGIR Conference on Research and Development in Information Retrieval, 103?112; https://dl.acm.org/doi/10.1145/2600428.2609627.
[5]
Chow, M., Meisner, D., Flinn, J., Peek, D., Wenisch, T.F. 2014. The Mystery Machine: End-to-end performance analysis of large-scale Internet services. In Proceedings of the 11th Usenix Symposium on Operating Systems Design and Implementation, 217-231; https://dl.acm.org/doi/10.5555/2685048.2685066.
[6]
Curtsinger, C., Berger, E.D. 2015. Coz: Finding code that counts with causal profiling. In Proceedings of the 25th Symposium on Operating Systems Principles, 184?197; https://dl.acm.org/doi/10.1145/2815400.2815409.
[7]
Dean, J., Barroso, L.A. 2013. The tail at scale. Communications of the ACM 56(2), 74?80; https://dl.acm.org/doi/10.1145/2408776.2408794.
[8]
GitHub. pprof; https://github.com/google/pprof.
[9]
Google. Cloud monitoring; https://cloud.google.com/monitoring.
[10]
Google. Cloud Profiler; https://cloud.google.com/profiler.
[11]
Gregg, B. 2016. The flame graph. Communications of the ACM 59(6), 48?57; https://dl.acm.org/doi/10.1145/2909476.
[12]
Kelley., J.E. 1961. Critical path planning and scheduling: Mathematical basis. Operations Research 9(3), 296?435; https://www.jstor.org/stable/167563.
[13]
Microsoft. Profile production applications in Azure with application insights; https://docs.microsoft.com/en-us/azure/azure-monitor/app/profiler-overview.
[14]
Microsoft. Azure Monitor; https://azure.microsoft.com/en-au/services/monitor/.
[15]
Nokleberg, C., Hawkes, B. 2021. Best practice: Application frameworks. acmqueue 18(6), 52?77; https://queue.acm.org/detail.cfm?id=3447806.
[16]
Pandey, M., Lew, K., Arunachalam, N., Carretto, E., Haffner, D., Ushakov, A., Katz, S., Burrell, G., Vaithilingam, R., Smith, M. 2020. Building Netflix's distributed tracing infrastructure. The Netflix Tech Blog; https://netflixtechblog.com/building-netflixs-distributed-tracing-infrastructure-bb856c319304.
[17]
Sigelman, B.H., Barroso, L.A., Burrows, M., Stephenson, P., Plakal, M., Beaver, D., Jaspan, S., Shanbhag, C. 2010. Dapper, a large-scale distributed systems tracing infrastructure. Google Technical Report; https://static.googleusercontent.com/media/research.google.com/en//archive/papers/dapper-2010-1.pdf.
[18]
Yang, C.-Q., Miller, B. 1988. Critical path analysis for the execution of parallel and distributed programs. In Proceedings of the 8th International Conference on Distributed Computing Systems. IEEE Computer Society Press, 366?375.

Cited By

View all
  • (2024)The Tale of Errors in MicroservicesProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/37004368:3(1-36)Online publication date: 10-Dec-2024
  • (2024)Efficient Unsupervised Latency Culprit Ranking in Distributed Traces with GNN and Critical Path AnalysisCompanion of the 15th ACM/SPEC International Conference on Performance Engineering10.1145/3629527.3651841(62-66)Online publication date: 7-May-2024
  • (2024)Trace-based Multi-Dimensional Root Cause Localization of Performance Issues in Microservice SystemsProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639088(1-12)Online publication date: 20-May-2024
  • Show More Cited By

Index Terms

  1. Distributed Latency Profiling through Critical Path Tracing: CPT can provide actionable and precise latency analysis.
            Index terms have been assigned to the content through auto-classification.

            Recommendations

            Comments

            Information & Contributors

            Information

            Published In

            cover image Queue
            Queue  Volume 20, Issue 1
            Persistence
            January/February 2022
            101 pages
            ISSN:1542-7730
            EISSN:1542-7749
            DOI:10.1145/3527159
            Issue’s Table of Contents
            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            Published: 29 March 2022
            Published in QUEUE Volume 20, Issue 1

            Permissions

            Request permissions for this article.

            Check for updates

            Qualifiers

            • Research-article
            • Popular
            • Editor picked

            Contributors

            Other Metrics

            Bibliometrics & Citations

            Bibliometrics

            Article Metrics

            • Downloads (Last 12 months)3,905
            • Downloads (Last 6 weeks)432
            Reflects downloads up to 20 Jan 2025

            Other Metrics

            Citations

            Cited By

            View all
            • (2024)The Tale of Errors in MicroservicesProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/37004368:3(1-36)Online publication date: 10-Dec-2024
            • (2024)Efficient Unsupervised Latency Culprit Ranking in Distributed Traces with GNN and Critical Path AnalysisCompanion of the 15th ACM/SPEC International Conference on Performance Engineering10.1145/3629527.3651841(62-66)Online publication date: 7-May-2024
            • (2024)Trace-based Multi-Dimensional Root Cause Localization of Performance Issues in Microservice SystemsProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639088(1-12)Online publication date: 20-May-2024
            • (2023)LatenSeerProceedings of the 2023 ACM Symposium on Cloud Computing10.1145/3620678.3624787(502-519)Online publication date: 30-Oct-2023
            • (2023)Distributed computation of the critical path from execution tracesSoftware: Practice and Experience10.1002/spe.321053:8(1722-1737)Online publication date: 3-May-2023
            • (2022)Testing-as-a-Service (TaaS) – Capabilities and Features for Real-Time Testing in CloudInternational Journal of Computer Science and Information Technology10.5121/ijcsit.2022.1460314:6(31-38)Online publication date: 30-Dec-2022
            • (2022)Distributed Latency Profiling through Critical Path TracingCommunications of the ACM10.1145/357052266:1(44-51)Online publication date: 20-Dec-2022

            View Options

            View options

            PDF

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader

            Magazine Site

            View this article on the magazine site (external)

            Magazine Site

            Login options

            Full Access

            Media

            Figures

            Other

            Tables

            Share

            Share

            Share this Publication link

            Share on social media