research-article

Open access

Pivot tracing: dynamic causal monitoring for distributed systems

Authors:

Rodrigo FonsecaAuthors Info & Claims

Communications of the ACM, Volume 63, Issue 3

Pages 94 - 102

https://doi.org/10.1145/3378933

Published: 24 February 2020 Publication History

All formats PDF

Abstract

Monitoring and troubleshooting distributed systems are notoriously difficult; potential problems are complex, varied, and unpredictable. The monitoring and diagnosis tools commonly used today---logs, counters, and metrics---have two important limitations: what gets recorded is defined a priori, and the information is recorded in a component- or machine-centric way, making it extremely hard to correlate events that cross these boundaries. This paper presents Pivot Tracing, a monitoring framework for distributed systems that addresses both limitations by combining dynamic instrumentation with a novel relational operator: the happened-before join. Pivot Tracing gives users, at runtime, the ability to define arbitrary metrics at one point of the system, while being able to select, filter, and group by events meaningful at other parts of the system, even when crossing component or machine boundaries. Pivot Tracing does not correlate cross-component events using expensive global aggregations, nor does it perform offline analysis. Instead, Pivot Tracing directly correlates events as they happen by piggybacking metadata alongside requests as they execute. This gives Pivot Tracing low runtime overhead---less than 1% for many cross-component monitoring queries.

References

[1]

Apache HTrace. http://htrace.incubator.apache.org/. [Online; accessed March 2015]. (Section 2.3).

[2]

Bodik, P. Overview of the workshop of managing large-scale systems via the analysis of system logs and the application of machine learning techniques (SLAML'11). SIGOPS Oper. Syst. Rev. 45, 3 (2011), 20--22. (Section 2.3).

[3]

Cantrill, B. Hidden in plain sight. ACM Queue 4, 1 (Feb. 2006), 26--36. (Sections 1 and 2.3).

Digital Library

[4]

Cantrill, B., Shapiro, M.W., Leventhal, A.H. Dynamic instrumentation of production systems. In USENIX Annual Technical Conference, General Track (2004), pp. 15--28. (Sections 1, 2.3, and 4).

Digital Library

[5]

Chiba, S. Javassist: Java bytecode engineering made simple. Java Developer's Journal 9, 1 (2004). (Section 4).

[6]

Cooper, B.F., Silberstein, A., Tam, E., Ramakrishnan, R., Sears, R. Benchmarking cloud serving systems with ycsb. In Proceedings of the 1st ACM Symposium on Cloud Computing (2010). ACM, pp. 143--154. (Section 5).

Digital Library

[7]

Erlingsson, Ú., Peinado, M., Peter, S., Budiu, M., Mainar-Ruiz, G. Fay: Extensible distributed tracing from kernels to clusters. ACM Trans. Comput. Syst. (TOCS) 30, 4 (2012), 13. (Sections 1, 2.3, 3, and 4).

Digital Library

[8]

Fonseca, R., Porter, G., Katz, R.H., Shenker, S., Stoica, I. X-trace: A pervasive network tracing framework. In Proceedings of the 4th USENIX Conference on Networked Systems Design & Implementation (Berkeley, CA, USA, 2007), NSDI'07, USENIX Association. (Sections 1 and 3).

[9]

HBASE-4145 Provide metrics for HBASE client. https://issues.apache.org/jira/browse/HBASE-4145. [Online; accessed 25 February 2015]. (Section 2.3).

[10]

HBASE-8370 Report data block cache hit rates apart from aggregate cache hit rates. https://issues.apache.org/jira/browse/HBASE-8370. [Online; accessed 25 February 2015]. (Section 2.3).

[11]

HDFS-6268 Better sorting in Network Topology. pseudoSortByDistance when no local node is found. https://issues.apache.org/jira/browse/HDFS-6268. [Online; accessed 25 February 2015]. (Sections 1 and 3).

[12]

Huang, S., Huang, J., Dai, J., Xie, T., Huang, B. The hibench benchmark suite: Characterization of the mapreduce-based data analysis. In New Frontiers in Information and Software as Services (2010). IEEE, pp. 41--51. (Section 5).

[13]

Kiczales, G., Hilsdale, E., Hugunin, J., Kersten, M., Palm, J., Griswold, W.G. An Overview of AspectJ. In European Conference on Object-Oriented Programming (London, UK, 2001). Springer-Verlag, pp. 327--353. (Section 4).

Digital Library

[14]

Kiczales, G., Lamping, J., Mendhekar, A., Maeda, C., Lopes, C.V., Loingtier, J.-M., Irwin, J. Aspect-oriented programming. In European Conference on Object-Oriented Programming, LNCS 1241 (June 1997), Springer-Verlag. (Section 2.2).

[15]

Lamport, L. Time, clocks, and the ordering of events in a distributed system. Commun. ACM 21, 7 (1978), 558--565. (Sections 1 and 3).

Digital Library

[16]

Mace, J., Roelke, R., Fonseca, R. Pivot tracing: Dynamic causal monitoring for distributed systems. In Proceedings of the 25th Symposium on Operating Systems Principles (2015). ACM, pp. 378--393. (Sections 1, 2.5, and 3).

Digital Library

[17]

Meijer, E., Beckman, B., Bierman, G. Linq: Reconciling object, relations and xml in the.net framework. In Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data, SIGMOD'06 (New York, NY, USA, 2006). ACM, pp. 706--706. (Section 2.1).

Digital Library

[18]

MESOS-1949 All log messages from master, slave, executor, etc. should be collected on a per-task basis. https://issues.apache.org/jira/browse/MESOS-1949. [Online; accessed 25 February 2015]. (Section 2.3).

[19]

Oliner, A., Ganapathi, A., Xu, W. Advances and challenges in log analysis. Commun. ACM 55, 2 (2012), 55--61. (Section 6).

Digital Library

[20]

Prasad, V., Cohen, W., Eigler, F.C., Hunt, M., Keniston, J., Chen, B. Locating system problems using dynamic instrumentation. In 2005 Ottawa Linux Symposium (2005). (Section 2.3).

[21]

Rabkin, A., Katz, R.H. How hadoop clusters break. IEEE Softw. 30, 4 (2013), 88--94. (Section 2.3).

Digital Library

[22]

Shvachko, K., Kuang, H., Radia, S., Chansler, R. The Hadoop distributed file system. In 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST) (2010). IEEE, pp. 1--10. (Section 5).

Digital Library

[23]

Sigelman, B.H., Barroso, L.A., Burrows, M., Stephenson, P., Plakal, M., Beaver, D., Jaspan, S., Shanbhag, C. Dapper, a large-scale distributed systems tracing infrastructure. Google Technical Report (2010). (Sections 1, 2.3, and 3).

[24]

Twitter Zipkin. http://twitter.github.io/zipkin/. [Online; accessed March 2015]. (Section 2.3).

[25]

Yuan, D., Zheng, J., Park, S., Zhou, Y., Savage, S. Improving software diagnosability via log enhancement. ACM Trans Comput Syst 30, 1 (2012), 4. (Section 2.3).

Digital Library

Cited By

Chakraborty AEswaran AThorat PVerma MGupta PJayachandran P(2024)Intent-Driven Multi-Engine Observability Dataflows for Heterogeneous Geo-Distributed Clouds2024 IEEE 17th International Conference on Cloud Computing (CLOUD)10.1109/CLOUD62652.2024.00014(30-41)Online publication date: 7-Jul-2024
https://doi.org/10.1109/CLOUD62652.2024.00014
Rangaiyengar RKomondoor RMedicherla R(2023)Multi-Layer Observability for Fault Localization in Microservices Based Systems2023 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)10.1109/SANER56733.2023.00081(733-737)Online publication date: Mar-2023
https://doi.org/10.1109/SANER56733.2023.00081
Babaei MBagherzadeh MDingel JSyriani ESahraoui H(2020)Efficient reordering and replay of execution traces of distributed reactive systems in the context of model-driven developmentProceedings of the 23rd ACM/IEEE International Conference on Model Driven Engineering Languages and Systems10.1145/3365438.3410939(285-296)Online publication date: 16-Oct-2020
https://dl.acm.org/doi/10.1145/3365438.3410939

Index Terms

Pivot tracing: dynamic causal monitoring for distributed systems

Recommendations

Pivot Tracing: Dynamic Causal Monitoring for Distributed Systems

Monitoring and troubleshooting distributed systems is notoriously difficult; potential problems are complex, varied, and unpredictable. The monitoring and diagnosis tools commonly used today—logs, counters, and metrics—have two important limitations: ...
Reducing time and space costs of memory tracing
Enabling tracing Of long-running multithreaded programs via dynamic execution reduction
ISSTA '07: Proceedings of the 2007 international symposium on Software testing and analysis

Debugging long running multithreaded programs is a very challenging problem when using tracing-based analyses. Since such programs are non-deterministic, reproducing the bug is non-trivial and generating and inspecting traces for long running programs ...

Comments

Information & Contributors

Information

Published In

cover image Communications of the ACM

Communications of the ACM Volume 63, Issue 3

March 2020

98 pages

ISSN:0001-0782

EISSN:1557-7317

DOI:10.1145/3385399

Editor:
Andrew A. Chien
Association for Computing Machinery, New York, NY

Issue’s Table of Contents

Copyright © 2020 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 February 2020

Published in CACM Volume 63, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article
Research
Refereed

Funding Sources

NSF

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
14,306
Total Downloads

Downloads (Last 12 months)3,296
Downloads (Last 6 weeks)25

Reflects downloads up to 13 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Chakraborty AEswaran AThorat PVerma MGupta PJayachandran P(2024)Intent-Driven Multi-Engine Observability Dataflows for Heterogeneous Geo-Distributed Clouds2024 IEEE 17th International Conference on Cloud Computing (CLOUD)10.1109/CLOUD62652.2024.00014(30-41)Online publication date: 7-Jul-2024
https://doi.org/10.1109/CLOUD62652.2024.00014
Rangaiyengar RKomondoor RMedicherla R(2023)Multi-Layer Observability for Fault Localization in Microservices Based Systems2023 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)10.1109/SANER56733.2023.00081(733-737)Online publication date: Mar-2023
https://doi.org/10.1109/SANER56733.2023.00081
Babaei MBagherzadeh MDingel JSyriani ESahraoui H(2020)Efficient reordering and replay of execution traces of distributed reactive systems in the context of model-driven developmentProceedings of the 23rd ACM/IEEE International Conference on Model Driven Engineering Languages and Systems10.1145/3365438.3410939(285-296)Online publication date: 16-Oct-2020
https://dl.acm.org/doi/10.1145/3365438.3410939

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Digital Edition

View this article in digital edition.

Digital Edition

Magazine Site

View this article on the magazine site (external)

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents