Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3311790.3396633acmconferencesArticle/Chapter ViewAbstractPublication PagespearcConference Proceedingsconference-collections
research-article

Log Discovery for Troubleshooting Open Distributed Systems with TLQ

Published: 26 July 2020 Publication History

Abstract

Troubleshooting a distributed system can be incredibly difficult. It is rarely feasible to expect a user to know the fine-grained interactions between their system and the environment configuration of each machine used in the system. Because of this, work can grind to a halt when a seemingly trivial detail changes. To address this, there is a plethora of state-of-the-art log analysis tools, debuggers, and visualization suites. However, a user may be executing in an open distributed system where the placement of their components are not known before runtime. This makes the process of tracking debug logs almost as difficult as troubleshooting the failures these logs have recorded because the location of those logs is usually not transparent to the user (and by association the troubleshooting tools they are using). We present TLQ, a framework designed from first principles for log discovery to enable troubleshooting of open distributed systems. TLQconsists of a querying client and a set of servers which track relevant debug logs spread across an open distributed system. Through a series of examples, we demonstrate how TLQenables users to discover the locations of their system’s debug logs and in turn use well-defined troubleshooting tools upon those logs in a distributed fashion. Both of these tasks were previously impractical to ask of an open distributed system without significant a priori knowledge. We also concretely verify TLQ’s effectiveness by way of a production system: a biodiversity scientific workflow. We note the potential storage and performance overheads of TLQcompared to a centralized, closed system approach.

Supplemental Material

MP4 File
Presentation video

References

[1]
Jenny Abrahamson, Ivan Beschastnikh, Yuriy Brun, and Michael D. Ernst. 2014. Shedding Light on Distributed System Executions. In Companion Proceedings of the 36th International Conference on Software Engineering (Hyderabad, India) (ICSE Companion 2014). ACM, New York, NY, USA, 598–599. https://doi.org/10.1145/2591062.2591134
[2]
Peter C Bates and Jack C Wileden. 1983. High-level debugging of distributed systems: The behavioral abstraction approach. Journal of Systems and Software 3, 4 (1983), 255–264.
[3]
Ivan Beschastnikh, Patty Wang, Yuriy Brun, and Michael D Ernst. 2016. Debugging distributed systems. Queue 14, 2 (2016), 50.
[4]
Ivan Beschastnikh, Patty Wang, Yuriy Brun, and Michael D. Ernst. 2016. Debugging Distributed Systems. Commun. ACM 59, 8 (July 2016), 32–37. https://doi.org/10.1145/2909480
[5]
K. Mani Chandy and Leslie Lamport. 1985. Distributed Snapshots: Determining Global States of Distributed Systems. ACM Trans. Comput. Syst. 3, 1 (Feb. 1985), 63–75. https://doi.org/10.1145/214451.214456
[6]
Edward Chuah, Arshad Jhumka, Samantha Alt, Theo Damoulas, Nentawe Gurumdimma, Marie-Christine Sawley, William L Barth, Tommy Minyard, and James C Browne. 2017. Enabling Dependability-Driven Resource Use and Message Log-Analysis for Cluster System Diagnosis. In 2017 IEEE 24th International Conference on High Performance Computing (HiPC). IEEE, 317–327.
[7]
Ewa Deelman, Karan Vahi, Gideon Juve, Mats Rynge, Scott Callaghan, Philip J Maechling, Rajiv Mayani, Weiwei Chen, Rafael Ferreira da Silva, Miron Livny, and Kent Wenger. 2015. Pegasus: a Workflow Management System for Science Automation. Future Generation Computer Systems 46 (2015), 17–35. https://doi.org/10.1016/j.future.2014.10.008 Funding Acknowledgements: NSF ACI SDCI 0722019, NSF ACI SI2-SSI 1148515 and NSF OCI-1053575.
[8]
Nikoli Dryden. 2014. Pgdb: A debugger for mpi applications. In Proceedings of the 2014 Annual Conference on Extreme Science and Engineering Discovery Environment. ACM, 44.
[9]
Nikhil Handigol, Brandon Heller, Vimalkumar Jeyakumar, David Mazières, and Nick McKeown. 2014. I Know What Your Packet Did Last Hop: Using Packet Histories to Troubleshoot Networks. In NSDI, Vol. 14. 71–85.
[10]
Shilin He, Jieming Zhu, Pinjia He, and Michael R Lyu. 2016. Experience report: system log analysis for anomaly detection. In Software Reliability Engineering (ISSRE), 2016 IEEE 27th International Symposium on. IEEE, 207–218.
[11]
Stephen T Jones, Andrea C Arpaci-Dusseau, Remzi H Arpaci-Dusseau, 2006. Antfarm: Tracking Processes in a Virtual Machine Environment. In USENIX Annual Technical Conference, General Track. 1–14.
[12]
Mohammad Maifi Hasan Khan, Hieu Khac Le, Hossein Ahmadi, Tarek F Abdelzaher, and Jiawei Han. 2008. Dustminer: troubleshooting interactive complexity bugs in sensor networks. In Proceedings of the 6th ACM conference on Embedded network sensor systems. ACM, 99–112.
[13]
Nathaniel Kremer-Herman, Benjamin Tovar, and Douglas Thain. 2018. A Lightweight Model for Right-sizing Master-worker Applications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (Dallas, Texas) (SC ’18). IEEE Press, Piscataway, NJ, USA, Article 39, 13 pages. http://dl.acm.org/citation.cfm?id=3291656.3291708
[14]
Abdelkader Lahmadi and Frédéric Beck. 2015. Powering Monitoring Analytics with ELK stack. https://hal.inria.fr/hal-01212015
[15]
Leslie Lamport. 1978. Time, Clocks, and the Ordering of Events in a Distributed System. Commun. ACM 21, 7 (July 1978), 558–565. https://doi.org/10.1145/359545.359563
[16]
Johan Scholten and PG Jansen. 1990. Distributed debugging and Tumult. In Distributed Computing Systems, 1990. Proceedings., Second IEEE Workshop on Future Trends of. IEEE, 172–176.
[17]
Tim Shaffer, Nathaniel Kremer-Herman, and Douglas Thain. 2019. Flexible Partitioning of Scientific Workflows Using the JX Workflow Language. In Proceedings of the Practice and Experience in Advanced Research Computing on Rise of the Machines (Learning) (Chicago, IL, USA) (PEARC ’19). Association for Computing Machinery, New York, NY, USA, Article 103, 8 pages. https://doi.org/10.1145/3332186.3338100
[18]
Steve Sistare, Don Allen, Rich Bowker, Karen Jourdenais, Josh Simons, and Rich Title. 1994. A scalable debugger for massively parallel message-passing programs. In Proceedings of IEEE Scalable High Performance Computing Conference. IEEE, 825–832.
[19]
Michael Whittaker, Cristina Teodoropol, Peter Alvaro, and Joseph M Hellerstein. 2018. Debugging Distributed Systems with Why-Across-Time Provenance. In Proceedings of the ACM Symposium on Cloud Computing. ACM, 333–346.
[20]
K. Yamnual, P. Phunchongharn, and T. Achalakul. 2017. Failure detection through monitoring of the scientific distributed system. In 2017 International Conference on Applied System Innovation (ICASI). 568–571. https://doi.org/10.1109/ICASI.2017.7988485
[21]
Yanyan Zhuang, Eleni Gessiou, Steven Portzer, Fraida Fund, Monzur Muhammad, Ivan Beschastnikh, and Justin Cappos. 2014. NetCheck: Network Diagnoses from Blackbox Traces. In NSDI. 115–128.

Cited By

View all
  • (2023)An Anomaly Detection Method Based on Adaptive Log and Dual Feature Fusion Analysis for Distributed Systems2023 11th International Conference on Information Systems and Computing Technology (ISCTech)10.1109/ISCTech60480.2023.00027(111-114)Online publication date: 30-Jul-2023
  • (2022)Threshold based Technique to Detect Anomalies using Log Files2022 7th International Conference on Machine Learning Technologies (ICMLT)10.1145/3529399.3529430(191-198)Online publication date: 10-Jun-2022
  • (2022)Challenges and Triumphs Teaching Distributed Computing Topics at a Small Liberal Arts College2022 IEEE/ACM International Workshop on Education for High Performance Computing (EduHPC)10.1109/EduHPC56719.2022.00009(26-33)Online publication date: Nov-2022

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
PEARC '20: Practice and Experience in Advanced Research Computing 2020: Catch the Wave
July 2020
556 pages
ISBN:9781450366892
DOI:10.1145/3311790
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 July 2020

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

  • National Science Foundation

Conference

PEARC '20
Sponsor:

Acceptance Rates

Overall Acceptance Rate 133 of 202 submissions, 66%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)5
  • Downloads (Last 6 weeks)0
Reflects downloads up to 11 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2023)An Anomaly Detection Method Based on Adaptive Log and Dual Feature Fusion Analysis for Distributed Systems2023 11th International Conference on Information Systems and Computing Technology (ISCTech)10.1109/ISCTech60480.2023.00027(111-114)Online publication date: 30-Jul-2023
  • (2022)Threshold based Technique to Detect Anomalies using Log Files2022 7th International Conference on Machine Learning Technologies (ICMLT)10.1145/3529399.3529430(191-198)Online publication date: 10-Jun-2022
  • (2022)Challenges and Triumphs Teaching Distributed Computing Topics at a Small Liberal Arts College2022 IEEE/ACM International Workshop on Education for High Performance Computing (EduHPC)10.1109/EduHPC56719.2022.00009(26-33)Online publication date: Nov-2022

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media