Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.5555/3026877.3026924acmotherconferencesArticle/Chapter ViewAbstractPublication PagesosdiConference Proceedingsconference-collections
Article

Non-intrusive performance profiling for entire software stacks based on the flow reconstruction principle

Published: 02 November 2016 Publication History

Abstract

Understanding the performance behavior of distributed server stacks at scale is non-trivial. The servicing of just a single request can trigger numerous sub-requests across heterogeneous software components; and many similar requests are serviced concurrently and in parallel. When a user experiences poor performance, it is extremely difficult to identify the root cause, as well as the software components and machines that are the culprits.
This paper describes Stitch, a non-intrusive tool capable of profiling the performance of an entire distributed software stack solely using the unstructured logs output by heterogeneous software components. Stitch is substantially different from all prior related tools in that it is capable of constructing a system model of an entire software stack without building any domain knowledge into Stitch. Instead, it automatically reconstructs the extensive domain knowledge of the programmers who wrote the code; it does this by relying on the Flow Reconstruction Principle which states that programmers log events such that one can reliably reconstruct the execution flow a posteriori.

References

[1]
M. K. Aguilera, J. C. Mogul, J. L. Wiener, P. Reynolds, and A. Muthitacharoen. Performance debugging for distributed systems of black boxes. In Proc. of the 19th ACM Symposium on Operating Systems Principles, SOSP '03, pages 74-89. ACM, 2003.
[2]
P. Barham, A. Donnelly, R. Isaacs, and R. Mortier. Using magpie for request extraction and workload modelling. In Proceedings of the 6th Conference on Symposium on Opearting Systems Design & Implementation, OSDI '04, pages 259-272. USENIX Association, 2004.
[3]
I. Beschastnikh, Y. Brun, S. Schneider, M. Sloan, and M. D. Ernst. Leveraging existing instrumentation to automatically infer invariant-constrained models. In Proceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering, ESEC/FSE '11, pages 267-277. ACM, 2011.
[4]
A. Bessey, K. Block, B. Chelf, A. Chou, B. Fulton, S. Hallem, C. Henri-Gros, A. Kamsky, S. McPeak, and D. Engler. A few billion lines of code later: Using static analysis to find bugs in the real world. Commun. ACM, 53(2):66-75, Feb. 2010.
[5]
Chapter 1. Boost.Log v2 - 1.61.0. http://www.boost.org/doc/libs/1_61_0/libs/log/doc/html/index.html.
[6]
M. A. Borkin, C. S. Yeh, M. Boyd, P. Macko, K. Z. Gajos, M. Seltzer, and H. Pfister. Evaluation of filesystem provenance visualization tools. IEEE Transactions on Visualization and Computer Graphics, 19(12):2476-2485, Dec. 2013.
[7]
A. Chanda, A. L. Cox, and W. Zwaenepoel. Whodunit: Transactional profiling for multi-tier applications. In Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007, EuroSys '07, pages 17-30. ACM, 2007.
[8]
M. Y. Chen, E. Kiciman, E. Fratkin, A. Fox, and E. Brewer. Pinpoint: Problem determination in large, dynamic internet services. In Proceedings of the 2002 International Conference on Dependable Systems and Networks, DSN '02, pages 595-604. IEEE Computer Society, 2002.
[9]
M. Chow, D. Meisner, J. Flinn, D. Peek, and T. F. Wenisch. The mystery machine: End-to-end performance analysis of large-scale internet services. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation, OSDI '14, pages 217-231. USENIX Association, 2014.
[10]
C. Curtsinger and E. D. Berger. Coz: Finding code that counts with causal profiling. In Proceedings of the 25th Symposium on Operating Systems Principles, SOSP '15, pages 184-197. ACM, 2015.
[11]
R. Fonseca, G. Porter, R. H. Katz, S. Shenker, and I. Stoica. X-trace: A pervasive network tracing framework. In Proceedings of the 4th USENIX Conference on Networked Systems Design & Implementation, NSDI '07, pages 271-284. USENIX Association, 2007.
[12]
A. Ghazal, T. Rabl, M. Hu, F. Raab, M. Poess, A. Crolotte, and H.-A. Jacobsen. Bigbench: Towards an industry standard benchmark for big data analytics. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, SIGMOD '13, pages 1197-1208. ACM, 2013.
[13]
S. L. Graham, P. B. Kessler, and M. K. Mckusick. Gprof: A call graph execution profiler. In Proceedings of the 1982 SIGPLAN Symposium on Compiler Construction, SIGPLAN '82, pages 120-126. ACM, 1982.
[14]
Z. Guo, D. Zhou, H. Lin, M. Yang, F. Long, C. Deng, C. Liu, and L. Zhou. G2: A graph processing system for diagnosing distributed systems. In Proceedings of the 2011 USENIX Conference on USENIX Annual Technical Conference, USENIX ATC '11, pages 299-312. USENIX Association, 2011.
[15]
S. Huang, J. Huang, J. Dai, T. Xie, and B. Huang. The HiBench benchmark suite: Characterization of the MapReduce-based data analysis. In 26th International Conference on Data Engineering Workshops, ICDEW'10, pages 41-51. IEEE Computer Society, 2010.
[16]
java.util.logging (Java Platform SE 8). https:// docs.oracle.com/javase/8/docs/api/java/util/logging/package-summary.html.
[17]
Kibana: Explore, visualize, discover data. https://www.elastic.co/products/kibana.
[18]
P. Leach, M. Mealling, and R. Salz. A Universally Unique IDentifier (UUID) URN Namespace. RFC 4122, July 2005.
[19]
D. L. Lewis. Counterfactuals. Blackwell Publishers, 1973.
[20]
Log for C++ Project. http://log4cpp.sourceforge.net/.
[21]
Log4j - Log4j 2 Guide - Apache Log4j 2. http://logging.apache.org/log4j/2.x/.
[22]
VMware vCenter Log Insight: Log management and analytics. http://www.vmware.com/ca/en/products/vcenter-log-insight.
[23]
Logstash - normalizing varying schema. https://www.elastic.co/products/logstash.
[24]
J. Mace, R. Roelke, and R. Fonseca. Pivot tracing: Dynamic causal monitoring for distributed systems. In Proceedings of the 25th Symposium on Operating Systems Principles, SOSP '15, pages 378-393. ACM, 2015.
[25]
K. Nagaraj, C. Killian, and J. Neville. Structured comparative analysis of systems logs to diagnose performance problems. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, NSDI '12, pages 353-366. USENIX Association, 2012.
[26]
Nagios: the industry standard in IT infrastructure monitoring. http://www.nagios.org/.
[27]
NewRelic: Application performance management and monitoring. http://newrelic.com/.
[28]
OProf - A system profiler for Linux. http://oprofile.sourceforge.net/.
[29]
Poor HDFS performances: Slow BlockReceiver write packet to mirror. http://stackoverflow.com/questions/27984331.
[30]
Section 15.7 logging - Logging facility for Python - Python 2.7.12 documentation. https://docs.python.org/2/library/logging.html.
[31]
JIRA: YARN bug 4610. https://issues.apache.org/jira/browse/YARN-4610.
[32]
P. Reynolds, C. Killian, J. L. Wiener, J. C. Mogul, M. A. Shah, and A. Vahdat. Pip: Detecting the unexpected in distributed systems. In Proceedings of the 3rd Conference on Networked Systems Design & Implementation, NSDI '06, pages 115-128. USENIX Association, 2006.
[33]
RSYSLOG: the rocket-fast system for log processing. www.rsyslog.com.
[34]
B. H. Sigelman, L. A. Barroso, M. Burrows, P. Stephenson, M. Plakal, D. Beaver, S. Jaspan, and C. Shanbhag. Dapper, a large-scale distributed systems tracing infrastructure. Technical report, Google, Inc., 2010.
[35]
SLF4J. http://www.slf4j.org/.
[36]
Splunk log management. http://www.splunk.com/view/log-management/SP-CAAAC6F.
[37]
The syslog protocol. http://tools.ietf.org/html/rfc5424.
[38]
W. Xu, L. Huang, A. Fox, D. Patterson, and M. I. Jordan. Detecting large-scale system problems by mining console logs. In Proc. of the ACM SIGOPS 22nd Symposium on Operating Systems Principles, SOSP '09, pages 117-132. ACM, 2009.
[39]
Z. Yin, D. Yuan, Y. Zhou, S. Pasupathy, and L. Bairavasundaram. How do fixes become bugs? In Proceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering, ESEC/FSE '11, pages 26-36. ACM, 2011.
[40]
X. Yu, P. Joshi, J. Xu, G. Jin, H. Zhang, and G. Jiang. Cloudseer: Workflow monitoring of cloud infrastructures via interleaved logs. In Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '16, pages 489-502. ACM, 2016.
[41]
D. Yuan, H. Mai, W. Xiong, L. Tan, Y. Zhou, and S. Pasupathy. SherLog: error diagnosis by connecting clues from run-time logs. In Proceedings of the 15th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '10, pages 143-154. ACM, 2010.
[42]
X. Zhao, Y. Zhang, D. Lion, M. FaizanUllah, Y. Luo, D. Yuan, and M. Stumm. lprof: A non-intrusive request flow profiler for distributed systems. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation, OSDI '14, pages 629-644. USENIX Association, 2014.

Cited By

View all
  • (2024)Scaler: Efficient and Effective Cross Flow AnalysisProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695473(907-918)Online publication date: 27-Oct-2024
  • (2023)Network-Centric Distributed Tracing with DeepFlow: Troubleshooting Your Microservices in Zero CodeProceedings of the ACM SIGCOMM 2023 Conference10.1145/3603269.3604823(420-437)Online publication date: 10-Sep-2023
  • (2020)Check before you changeProceedings of the 17th Usenix Conference on Networked Systems Design and Implementation10.5555/3388242.3388285(575-590)Online publication date: 25-Feb-2020
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
OSDI'16: Proceedings of the 12th USENIX conference on Operating Systems Design and Implementation
November 2016
786 pages
ISBN:9781931971331

Sponsors

  • VMware
  • NetApp
  • Google Inc.
  • Microsoft: Microsoft
  • Facebook: Facebook

In-Cooperation

Publisher

USENIX Association

United States

Publication History

Published: 02 November 2016

Check for updates

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 25 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Scaler: Efficient and Effective Cross Flow AnalysisProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695473(907-918)Online publication date: 27-Oct-2024
  • (2023)Network-Centric Distributed Tracing with DeepFlow: Troubleshooting Your Microservices in Zero CodeProceedings of the ACM SIGCOMM 2023 Conference10.1145/3603269.3604823(420-437)Online publication date: 10-Sep-2023
  • (2020)Check before you changeProceedings of the 17th Usenix Conference on Networked Systems Design and Implementation10.5555/3388242.3388285(575-590)Online publication date: 25-Feb-2020
  • (2020)SentinelProceedings of the VLDB Endowment10.14778/3407790.340785613:12(2720-2733)Online publication date: 14-Sep-2020
  • (2019)IASOProceedings of the 2019 USENIX Conference on Usenix Annual Technical Conference10.5555/3358807.3358812(47-61)Online publication date: 10-Jul-2019
  • (2019)The inflection point hypothesisProceedings of the 27th ACM Symposium on Operating Systems Principles10.1145/3341301.3359650(131-146)Online publication date: 27-Oct-2019
  • (2019)CrashTunerProceedings of the 27th ACM Symposium on Operating Systems Principles10.1145/3341301.3359645(114-130)Online publication date: 27-Oct-2019
  • (2019)Semantic-aware Workflow Construction and Analysis for Distributed Data Analytics SystemsProceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing10.1145/3307681.3325404(255-266)Online publication date: 17-Jun-2019
  • (2019)Profile-based Detection of Layered BottlenecksProceedings of the 2019 ACM/SPEC International Conference on Performance Engineering10.1145/3297663.3310296(197-208)Online publication date: 4-Apr-2019
  • (2018)SnailtrailProceedings of the 15th USENIX Conference on Networked Systems Design and Implementation10.5555/3307441.3307450(95-110)Online publication date: 9-Apr-2018
  • Show More Cited By

View Options

View options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media