Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3357223.3362704acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
short-paper
Public Access

An automated, cross-layer instrumentation framework for diagnosing performance problems in distributed applications

Published: 20 November 2019 Publication History

Abstract

Diagnosing performance problems in distributed applications is extremely challenging. A significant reason is that it is hard to know where to place instrumentation a priori to help diagnose problems that may occur in the future. We present the vision of an automated instrumentation framework, Pythia, that runs alongside deployed distributed applications. In response to a newly-observed performance problem, Pythia searches the space of possible instrumentation choices to enable the instrumentation needed to help diagnose it. Our vision for Pythia builds on workflow-centric tracing, which records the order and timing of how requests are processed within and among a distributed application's nodes (i.e., records their workflows). It uses the key insight that localizing the sources high performance variation within the workflows of requests that are expected to perform similarly gives insight into where additional instrumentation is needed.

References

[1]
Glenn Ammons, Thomas Ball, and James R. Larus. 1997. Exploiting Hardware Performance Counters with Flow and Context Sensitive Profiling. In Proceedings of the ACM SIGPLAN 1997 Conference on Programming Language Design and Implementation (PLDI ′97). ACM, New York, NY, USA, 85--96. https://doi.org/10.1145/258915.258924
[2]
Dan Ardelean, Amer Diwan, and Chandra Erdman. 2018. Performance Analysis of Cloud Applications. In NSDI ′18: Proceedings of the 1st USENIX Symposium on Networked Systems Design and Implementation.
[3]
Piramanayagam Nainar Arumuga and Ben Liblit. 2010. Adaptive bug isolation. In International Conference on Software Engineering. ACM Press, New York, New York, USA, 255--264.
[4]
Mona Attariyan, Michael Chow, and Jason Flinn. 2012. X-ray: Automating Root-Cause Diagnosis of Performance Anomalies in Production Software. In OSDI ′12: Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation.
[5]
Thomas Ball and James R Larus. 1996. Efficient path profiling. In MICRO 29: Proceedigs of the 29th Annual Internatoinal Symposium on Microachitecture. IEEE/ACM.
[6]
Ivan Beschastnikh, Yuriy Brun, Sigurd Schneider, Michael Sloan, and Michael D Ernst. 2011. Leveraging existing instrumentation to automatically infer invariant-constrained models. In ESEC/FSE ′11: Proceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering.
[7]
Bryan M Cantrill and Michael W Shapiro. 2004. Dynamic instrumentation of production systems. In ATC ′04: Proceedings of the 2004 USENIX Annual Technical Conference.
[8]
Mike Y. Chen, Anthony Accardi, Emre Kiciman, Jim Lloyd, David Patterson, Armando Fox, and Eric Brewer. 2004. Path-based failure and evolution management. In NSDI ′04: Proceedings of the 1st USENIX Symposium on Networked Systems Design and Implementation.
[9]
Zaheer Chothia, John Liagouris, Desislava Dimitrova, and Timothy Roscoe. 2017. Online reconstruction of structural information from datacenter logs. In EuroSys ′17: Proceedings of the 12th ACM SIGOPS European Conference on Computer Systems.
[10]
Rui Ding, Hucheng Zhou, Jian-Guang Lou, Hongyu Zhang, Qingwei Lin, Qiang Fu, Dongmei Zhang, and Tao Xie. 2015. Log2: A cost-aware logging mechanism for performance diagnosis. In ATC '15: Proceedings of the 2015 USENIX Annual Technical Conference.
[11]
Ulfar Erlingsson, Marcus Peinado, Simon Peter, and Mihai Budiu. 2011. Fay: extensible distributed tracing from kernels to clusters. In SOSP ′11: Proceedings of the 23nd ACM Symposium on Operating Systems Principles.
[12]
Rodrigo Fonseca, Michael J. Freedman, and George Porter. 2010. Experiences with tracing causality in networked services. In INM/WREN '10: Proceedings of the 1st Internet Network Management Workshop/Workshop on Research on Enterprise Monitoring.
[13]
Rodrigo Fonseca, George Porter, Randy H. Katz, Scott Shenker, and Ion Stoica. 2007. X-Trace: a pervasive network tracing framework. In NSDI ′07: Proceedings of the 4th USENIX Symposium on Networked Systems Design and Implementation.
[14]
Kirk Glerum, Kinshuman Kinshumann, Steve Greenberg, Gabriel Aul, Vince Orgovan, Greg Nichols, David Grant, Gretchen Loihle, and Galen Hunt. 2009. Debugging in the (very) large: ten years of implementation and experience. In SOSP ′09: Proceedings of the 22nd Symposium on Operating Systems Principles.
[15]
Harold Hotelling. 1936. Relations Between Two Sets of Variates. Biometrika 28, 3/4 (1936), 321--377. http://www.jstor.org/stable/2333955
[16]
Jaeger [n.d.]. Jaeger: open-source, end-to-end distributed tracing. https://www.jaegertracing.io.
[17]
Jonathan Kaldor, Jonathan Mace, Michałt Bejda, Edison Gao, Wiktor Kuropatwa, Joe O'Neill, Kian Win Ong, Bill Schaller, Pingjia Shan, Brendan Viscomi, Vinod Venkataraman, Kaushik Veeraraghavan, and Yee Jiun Song. 2017. Canopy: An end-to-end performance tracing and analysis system. In SOSP ′17: Proceedings of the 26th Symposium on Operating Systems Principles.
[18]
Ben Liblit, Alex Aiken, Alice X Zheng, and Michael I Jordan. 2003. Bug isolation via remote program sampling. In PLDI ′03: Programming Language Design and Implementation. ACM.
[19]
Jonathan Mace and Rodrigo Fonseca. 2018. Universal context propagation for distributed system instrumentation. In EuroSys′18: Proceedings of the Thirteenth EuroSys Conference.
[20]
Jonathan Mace, Ryan Roelke, and Rodrigo Fonseca. 2015. Pivot Tracing: dynamic causal monitoring for distributed systems. In SOSP ′15: Proceedings of the 25th Symposium on Operating Systems Principles.
[21]
Massachusetts Open Cloud. 2019. http://massopen.cloud.
[22]
Mirantis OSProfiler [n.d.]. OSProfiler. https://docs.openstack.org/osprofiler/latest/.
[23]
Karthik Nagaraj, Charles Killian, and Jennifer Neville. 2012. Structured comparative analysis of systems logs to diagnose performance problems. In NSDI ′12: Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation.
[24]
Openstack [n.d.]. OpenStack web site. https://www.openstack.org.
[25]
OpenTracing website [n.d.]. OpenTracing website. http://opentracing.io/.
[26]
Patrick Reynolds, Charles Killian, Janet L. Wiener, Jeffrey C. Mogul, Mehul Shah, and Amin Vahdat. 2006. Pip: detecting the unexpected in distributed systems. In NSDI ′06: Proceedings of the 3rd USENIX Symposium on Networked Systems Design and Implementation.
[27]
Raja R. Sambasivan and Gregory R. Ganger. 2012. Automated diagnosis without predictability is a recipe for failure. In Proceedings of the 4th USENIX conference on Hot Topics in Cloud Computing. USENIX Association, 21--21.
[28]
Raja R. Sambasivan, Ilari Shafer, Jonathan Mace, Benjamin H Sigelman, Rodrigo Fonseca, and Gregory R. Ganger. 2016. Principled workflow-centric tracing of distributed systems. In SoCC ′16: Proceedings of the Seventh Symposium on Cloud Computing.
[29]
Raja R. Sambasivan, Alice X. Zheng, Michael De Rosa, Elie Krevat, Spencer Whitman, Michael Stroucken, William Wang, Lianghong Xu, and Gregory R. Ganger. 2011. Diagnosing performance changes by comparing request flows. In NSDI′11: Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation.
[30]
Benjamin H. Sigelman, Luiz A. Barroso, Michael Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, and Chandan Shanbhag. 2010. Dapper, a large-scale distributed systems tracing infrastructure. Technical Report dapper-2010-1. Google.
[31]
Skua: Extending end-to-end tracing into the Linux Kernel 2018. Skua: Extending end-to-end tracing into the Linux Kernel. https://devconfus2018.sched.com/event/FzVg.
[32]
The Apache Hadoop Distributed File System 2013. The Apache Hadoop Distributed File System. http://hadoop.apache.org/hdfs/.
[33]
Eno Thereska, Brandon Salmon, John Strunk, Matthew Wachs, Michael Abd-El-Malek, Julio Lopez, and Gregory R. Ganger. 2006. Stardust: tracking activity in a distributed storage system. In SIGMETRICS ′06/Performance ′06: Proceedings of the Joint International Conference on Measurement and Modeling of Computer Systems.
[34]
Sage A Weil, Scott A Brandt, Ethan L Miller, Darrell D E Long, and Carlos Maltzahn. 2006. Ceph: a scalable, high-performance distributed file system. In OSDI ′06: Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation.
[35]
Ding Yuan, Soyeon Park, Peng Huang, Yang Liu, Michael M Lee, Xiaoming Tang, Yuanyuan Zhou, and Stefan Savage. 2012. Be conservative: enhancing failure diagnosis with proactive logging. In OSDI' 12: Proceedings of the 10th conferences on Operating Systems Design & Implementation.
[36]
Ding Yuan, Jing Zheng, Soyeon Park, Yuanyuan Zhou, and Stefan Savage. 2012. Improving software diagnosability via log enhancement. ACM SIGPLAN Notices 47, 4 (June 2012), 3--14.
[37]
Xu Zhao, Kirk Rodrigues, Yu Luo, Michael Stumm, Ding Yuan, and Yuanyuan Zhou. 2017. Log20: Fully automated optimal placement of log printing statements under specified overhead threshold. In SOSP ′17: Proceedings of the 26th Symposium on Operating Systems Principles.
[38]
Xu Zhao, Kirk Rodrigues, Yu Luo, Ding Yuan, and Michael Stumm. 2016. Non-Intrusive Performance Profiling for Entire Software Stacks Based on the Flow Reconstruction Principle. In OSDI ′16: Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation.
[39]
Zipkin [n.d.]. Zipkin. http://zipkin.io/.
[40]
Zhiqiang Zuo, Lu Fang, Siau-Cheng Khoo, Guoqing Xu, Shan Lu, Lu Fang, Siau-Cheng Khoo, and Guoqing Xu. 2016. Low-overhead and fully automated statistical debugging with abstraction refinement. In OOPSLA ′16: Proceedings of the ACM international conference on Object oriented programming systems languages and applications.

Cited By

View all
  • (2024)The Tale of Errors in MicroservicesProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/37004368:3(1-36)Online publication date: 10-Dec-2024
  • (2024)Toward Adaptive Tracing: Efficient System Behavior Analysis using Language ModelsProceedings of the 2024 ACM/IEEE 44th International Conference on Software Engineering: New Ideas and Emerging Results10.1145/3639476.3639778(62-66)Online publication date: 14-Apr-2024
  • (2024)Analyzing Performance Variability in Alibaba's Microservice Architecture: A Critical-Path-Based PerspectiveCompanion of the 15th ACM/SPEC International Conference on Performance Engineering10.1145/3629527.3651845(82-86)Online publication date: 7-May-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SoCC '19: Proceedings of the ACM Symposium on Cloud Computing
November 2019
503 pages
ISBN:9781450369732
DOI:10.1145/3357223
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 November 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. cross-layer
  2. distributed systems
  3. performance
  4. tracing

Qualifiers

  • Short-paper
  • Research
  • Refereed limited

Funding Sources

Conference

SoCC '19
Sponsor:
SoCC '19: ACM Symposium on Cloud Computing
November 20 - 23, 2019
CA, Santa Cruz, USA

Acceptance Rates

SoCC '19 Paper Acceptance Rate 39 of 157 submissions, 25%;
Overall Acceptance Rate 169 of 722 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)135
  • Downloads (Last 6 weeks)16
Reflects downloads up to 23 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)The Tale of Errors in MicroservicesProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/37004368:3(1-36)Online publication date: 10-Dec-2024
  • (2024)Toward Adaptive Tracing: Efficient System Behavior Analysis using Language ModelsProceedings of the 2024 ACM/IEEE 44th International Conference on Software Engineering: New Ideas and Emerging Results10.1145/3639476.3639778(62-66)Online publication date: 14-Apr-2024
  • (2024)Analyzing Performance Variability in Alibaba's Microservice Architecture: A Critical-Path-Based PerspectiveCompanion of the 15th ACM/SPEC International Conference on Performance Engineering10.1145/3629527.3651845(82-86)Online publication date: 7-May-2024
  • (2024)Towards Efficient Diagnosis of Performance Bottlenecks in Microservice-Based Applications (Work In Progress paper)Companion of the 15th ACM/SPEC International Conference on Performance Engineering10.1145/3629527.3651432(40-46)Online publication date: 7-May-2024
  • (2024)Unleashing Performance Insights with Online Probabilistic Tracing2024 IEEE International Conference on Cloud Engineering (IC2E)10.1109/IC2E61754.2024.00015(72-82)Online publication date: 24-Sep-2024
  • (2024)Enabling Programmable Metric Flows2024 IEEE 17th International Conference on Cloud Computing (CLOUD)10.1109/CLOUD62652.2024.00050(386-398)Online publication date: 7-Jul-2024
  • (2023)LatenSeerProceedings of the 2023 ACM Symposium on Cloud Computing10.1145/3620678.3624787(502-519)Online publication date: 30-Oct-2023
  • (2023)Network-Centric Distributed Tracing with DeepFlow: Troubleshooting Your Microservices in Zero CodeProceedings of the ACM SIGCOMM 2023 Conference10.1145/3603269.3604823(420-437)Online publication date: 10-Sep-2023
  • (2023)Multi-level Adaptive Execution Tracing for Efficient Performance Analysis2023 IEEE/ACIS 21st International Conference on Software Engineering Research, Management and Applications (SERA)10.1109/SERA57763.2023.10197790(104-109)Online publication date: 23-May-2023
  • (2023)CONAN: Diagnosing Batch Failures for Cloud Systems2023 IEEE/ACM 45th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP)10.1109/ICSE-SEIP58684.2023.00018(138-149)Online publication date: May-2023
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media