Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

Pivot tracing: dynamic causal monitoring for distributed systems

Published: 24 February 2020 Publication History

Abstract

Monitoring and troubleshooting distributed systems are notoriously difficult; potential problems are complex, varied, and unpredictable. The monitoring and diagnosis tools commonly used today---logs, counters, and metrics---have two important limitations: what gets recorded is defined a priori, and the information is recorded in a component- or machine-centric way, making it extremely hard to correlate events that cross these boundaries. This paper presents Pivot Tracing, a monitoring framework for distributed systems that addresses both limitations by combining dynamic instrumentation with a novel relational operator: the happened-before join. Pivot Tracing gives users, at runtime, the ability to define arbitrary metrics at one point of the system, while being able to select, filter, and group by events meaningful at other parts of the system, even when crossing component or machine boundaries. Pivot Tracing does not correlate cross-component events using expensive global aggregations, nor does it perform offline analysis. Instead, Pivot Tracing directly correlates events as they happen by piggybacking metadata alongside requests as they execute. This gives Pivot Tracing low runtime overhead---less than 1% for many cross-component monitoring queries.

References

[1]
Apache HTrace. http://htrace.incubator.apache.org/. [Online; accessed March 2015]. (Section 2.3).
[2]
Bodik, P. Overview of the workshop of managing large-scale systems via the analysis of system logs and the application of machine learning techniques (SLAML'11). SIGOPS Oper. Syst. Rev. 45, 3 (2011), 20--22. (Section 2.3).
[3]
Cantrill, B. Hidden in plain sight. ACM Queue 4, 1 (Feb. 2006), 26--36. (Sections 1 and 2.3).
[4]
Cantrill, B., Shapiro, M.W., Leventhal, A.H. Dynamic instrumentation of production systems. In USENIX Annual Technical Conference, General Track (2004), pp. 15--28. (Sections 1, 2.3, and 4).
[5]
Chiba, S. Javassist: Java bytecode engineering made simple. Java Developer's Journal 9, 1 (2004). (Section 4).
[6]
Cooper, B.F., Silberstein, A., Tam, E., Ramakrishnan, R., Sears, R. Benchmarking cloud serving systems with ycsb. In Proceedings of the 1st ACM Symposium on Cloud Computing (2010). ACM, pp. 143--154. (Section 5).
[7]
Erlingsson, Ú., Peinado, M., Peter, S., Budiu, M., Mainar-Ruiz, G. Fay: Extensible distributed tracing from kernels to clusters. ACM Trans. Comput. Syst. (TOCS) 30, 4 (2012), 13. (Sections 1, 2.3, 3, and 4).
[8]
Fonseca, R., Porter, G., Katz, R.H., Shenker, S., Stoica, I. X-trace: A pervasive network tracing framework. In Proceedings of the 4th USENIX Conference on Networked Systems Design & Implementation (Berkeley, CA, USA, 2007), NSDI'07, USENIX Association. (Sections 1 and 3).
[9]
HBASE-4145 Provide metrics for HBASE client. https://issues.apache.org/jira/browse/HBASE-4145. [Online; accessed 25 February 2015]. (Section 2.3).
[10]
HBASE-8370 Report data block cache hit rates apart from aggregate cache hit rates. https://issues.apache.org/jira/browse/HBASE-8370. [Online; accessed 25 February 2015]. (Section 2.3).
[11]
HDFS-6268 Better sorting in Network Topology. pseudoSortByDistance when no local node is found. https://issues.apache.org/jira/browse/HDFS-6268. [Online; accessed 25 February 2015]. (Sections 1 and 3).
[12]
Huang, S., Huang, J., Dai, J., Xie, T., Huang, B. The hibench benchmark suite: Characterization of the mapreduce-based data analysis. In New Frontiers in Information and Software as Services (2010). IEEE, pp. 41--51. (Section 5).
[13]
Kiczales, G., Hilsdale, E., Hugunin, J., Kersten, M., Palm, J., Griswold, W.G. An Overview of AspectJ. In European Conference on Object-Oriented Programming (London, UK, 2001). Springer-Verlag, pp. 327--353. (Section 4).
[14]
Kiczales, G., Lamping, J., Mendhekar, A., Maeda, C., Lopes, C.V., Loingtier, J.-M., Irwin, J. Aspect-oriented programming. In European Conference on Object-Oriented Programming, LNCS 1241 (June 1997), Springer-Verlag. (Section 2.2).
[15]
Lamport, L. Time, clocks, and the ordering of events in a distributed system. Commun. ACM 21, 7 (1978), 558--565. (Sections 1 and 3).
[16]
Mace, J., Roelke, R., Fonseca, R. Pivot tracing: Dynamic causal monitoring for distributed systems. In Proceedings of the 25th Symposium on Operating Systems Principles (2015). ACM, pp. 378--393. (Sections 1, 2.5, and 3).
[17]
Meijer, E., Beckman, B., Bierman, G. Linq: Reconciling object, relations and xml in the.net framework. In Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data, SIGMOD'06 (New York, NY, USA, 2006). ACM, pp. 706--706. (Section 2.1).
[18]
MESOS-1949 All log messages from master, slave, executor, etc. should be collected on a per-task basis. https://issues.apache.org/jira/browse/MESOS-1949. [Online; accessed 25 February 2015]. (Section 2.3).
[19]
Oliner, A., Ganapathi, A., Xu, W. Advances and challenges in log analysis. Commun. ACM 55, 2 (2012), 55--61. (Section 6).
[20]
Prasad, V., Cohen, W., Eigler, F.C., Hunt, M., Keniston, J., Chen, B. Locating system problems using dynamic instrumentation. In 2005 Ottawa Linux Symposium (2005). (Section 2.3).
[21]
Rabkin, A., Katz, R.H. How hadoop clusters break. IEEE Softw. 30, 4 (2013), 88--94. (Section 2.3).
[22]
Shvachko, K., Kuang, H., Radia, S., Chansler, R. The Hadoop distributed file system. In 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST) (2010). IEEE, pp. 1--10. (Section 5).
[23]
Sigelman, B.H., Barroso, L.A., Burrows, M., Stephenson, P., Plakal, M., Beaver, D., Jaspan, S., Shanbhag, C. Dapper, a large-scale distributed systems tracing infrastructure. Google Technical Report (2010). (Sections 1, 2.3, and 3).
[24]
Twitter Zipkin. http://twitter.github.io/zipkin/. [Online; accessed March 2015]. (Section 2.3).
[25]
Yuan, D., Zheng, J., Park, S., Zhou, Y., Savage, S. Improving software diagnosability via log enhancement. ACM Trans Comput Syst 30, 1 (2012), 4. (Section 2.3).

Cited By

View all
  • (2024)Intent-Driven Multi-Engine Observability Dataflows for Heterogeneous Geo-Distributed Clouds2024 IEEE 17th International Conference on Cloud Computing (CLOUD)10.1109/CLOUD62652.2024.00014(30-41)Online publication date: 7-Jul-2024
  • (2023)Multi-Layer Observability for Fault Localization in Microservices Based Systems2023 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)10.1109/SANER56733.2023.00081(733-737)Online publication date: Mar-2023
  • (2020)Efficient reordering and replay of execution traces of distributed reactive systems in the context of model-driven developmentProceedings of the 23rd ACM/IEEE International Conference on Model Driven Engineering Languages and Systems10.1145/3365438.3410939(285-296)Online publication date: 16-Oct-2020

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Communications of the ACM
Communications of the ACM  Volume 63, Issue 3
March 2020
98 pages
ISSN:0001-0782
EISSN:1557-7317
DOI:10.1145/3385399
Issue’s Table of Contents
This work is licensed under a Creative Commons Attribution International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 February 2020
Published in CACM Volume 63, Issue 3

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)3,296
  • Downloads (Last 6 weeks)25
Reflects downloads up to 13 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Intent-Driven Multi-Engine Observability Dataflows for Heterogeneous Geo-Distributed Clouds2024 IEEE 17th International Conference on Cloud Computing (CLOUD)10.1109/CLOUD62652.2024.00014(30-41)Online publication date: 7-Jul-2024
  • (2023)Multi-Layer Observability for Fault Localization in Microservices Based Systems2023 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)10.1109/SANER56733.2023.00081(733-737)Online publication date: Mar-2023
  • (2020)Efficient reordering and replay of execution traces of distributed reactive systems in the context of model-driven developmentProceedings of the 23rd ACM/IEEE International Conference on Model Driven Engineering Languages and Systems10.1145/3365438.3410939(285-296)Online publication date: 16-Oct-2020

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Digital Edition

View this article in digital edition.

Digital Edition

Magazine Site

View this article on the magazine site (external)

Magazine Site

Get Access

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media