Article

E2EWatch: An End-to-End Anomaly Diagnosis Framework for Production HPC Systems

Authors:

Benjamin Schwaller,

Vitus J. Leung,

Ayse K. CoskunAuthors Info & Claims

Euro-Par 2021: Parallel Processing: 27th International Conference on Parallel and Distributed Computing, Lisbon, Portugal, September 1–3, 2021, Proceedings

Pages 70 - 85

https://doi.org/10.1007/978-3-030-85665-6_5

Published: 01 September 2021 Publication History

Abstract

In today’s High-Performance Computing (HPC) systems, application performance variations are among the most vital challenges as they adversely affect system efficiency, application performance, and cost. System administrators need to identify the anomalies that are responsible for performance variation and take mitigating actions. One can perform manual root-cause analysis on telemetry data collected by HPC monitoring infrastructures to analyze performance variations. However, manual analysis methods are time-intensive and limited in impact due to the increasing complexity of HPC systems and terabyte/day-sized telemetry data. State-of-the-art approaches use machine learning-based methods to diagnose performance anomalies automatically. This paper deploys an end-to-end machine learning framework that diagnoses performance anomalies on compute nodes on a 1488-node production HPC system. We demonstrate job and node-level anomaly diagnosis results with the Grafana frontend interface at runtime. Furthermore, we discuss challenges and design decisions for the deployment.

References

[1]

Agelastos, A., et al.: The lightweight distributed metric service: a scalable infrastructure for continuous monitoring of large scale computing systems and applications. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pp. 154–165 (2014)

[2]

Agelastos, A., et al.: Toward rapid understanding of production HPC applications and systems. In: IEEE International Conference on Cluster Computing, pp. 464–473 (2015)

[3]

Ahad, R., Chan, E., Santos, A.: Toward autonomic cloud: automatic anomaly detection and resolution. In: International Conference on Cloud and Autonomic Computing, pp. 200–203 (2015)

[4]

Arzani, B., Ciraci, S., Loo, B.T., Schuster, A., Outhred, G.: Taking the blame game out of data centers operations with NetPoirot. In: Proceedings of the ACM SIGCOMM Conference, pp. 440–453 (2016)

[5]

Ates E et al. Aldinucci M, Padovani L, Torquati M, et al. Taxonomist: application detection through rich monitoring data Euro-Par 2018: Parallel Processing 2018 Cham Springer 92-105

[6]

Ates, E., Zhang, Y., Aksar, B., et al.: HPAS: an HPC performance anomaly suite for reproducing performance variations. In: Proceedings of the 48th International Conference on Parallel Processing, pp. 1–10. ACM, August 2019

[7]

Bartolini, A., et al.: The DAVIDE big-data-powered fine-grain power and performance monitoring support. In: Proceedings of the 15th ACM International Conference on Computing Frontiers, pp. 303–308 (2018)

[8]

Bhatele, A., Mohror, K., Langer, S.H., Isaacs, K.E.: There goes the neighborhood: performance degradation due to nearby jobs. In: SC 2013: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pp. 1–12 (2013)

[9]

Bhatele, A., et al.: The case of performance variability on dragonfly-based systems. In: IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 896–905 (2020)

[10]

Bhuyan, M.H., Bhattacharyya, D., Kalita, J.K.: NADO: network anomaly detection using outlier approach. In: Proceedings of the International Conference on Communication, Computing and Security, pp. 531–536 (2011)

[11]

Borghesi A, Bartolini A, Lombardi M, Milano M, and Benini L A semisupervised autoencoder-based approach for anomaly detection in high performance computing systems Eng. Appl. Artif. Intell. 2019 85 634-644

[12]

Bourassa, N., et al.: Operational data analytics: optimizing the national energy research scientific computing center cooling systems. In: Proceedings of the 48th International Conference on Parallel Processing: Workshops, pp. 1–7 (2019)

[13]

Brandt, J.M., et al.: Enabling advanced operational analysis through multi-subsystem data integration on trinity. Technical report, Sandia National Lab. (SNL-CA), Livermore, CA (United States) (2015)

[14]

Chen, T., Guestrin, C.: XGBoost: a scalable tree boosting system. In: Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining, pp. 785–794 (2016)

[15]

Dalmazo, B.L., Vilela, J.P., Simoes, P., Curado, M.: Expedite feature extraction for enhanced cloud anomaly detection. In: IEEE/IFIP Network Operations and Management Symposium, pp. 1215–1220 (2016)

[16]

Das, A., Mueller, F., Rountree, B.: Aarohi: making real-time node failure prediction feasible. In: 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 1092–1101 (2020)

[17]

Dorier, M., Antoniu, G., Ross, R., Kimpe, D., Ibrahim, S.: CALCioM: mitigating i/o interference in HPC systems through cross-application coordination. In: IEEE International Parallel and Distributed Processing Symposium, pp. 155–164 (2014)

[18]

Jayathilaka, H., Krintz, C., Wolski, R.: Performance monitoring and root cause analysis for cloud-hosted web applications. In: Proceedings of the 26th International Conference on World Wide Web, pp. 469–478 (2017)

[19]

Ke G et al. Lightgbm: a highly efficient gradient boosting decision tree Adv. Neural. Inf. Process. Syst. 2017 30 3146-3154

[20]

Klinkenberg, J., Terboven, C., Lankes, S., Müller, M.S.: Data mining-based analysis of HPC center operations. In: IEEE International Conference on Cluster Computing (CLUSTER), pp. 766–773 (2017)

[21]

Lan, Z., Zheng, Z., Li, Y.: Toward automated anomaly identification in large-scale systems. IEEE Trans. Parallel Distrib. Syst. 21(2), 174–187 (2009)

[22]

Leung, V.J., Bender, M.A., Bunde, D.P., Phillips, C.A.: Algorithmic support for commodity-based parallel computing systems. Technical report, Sandia National Laboratories (2003)

[23]

Marathe, A., Zhang, Y., Blanks, G., Kumbhare, N., Abdulla, G., Rountree, B.: An empirical survey of performance and energy efficiency variation on intel processors. In: Proceedings of the 5th International Workshop on Energy Efficient Supercomputing, pp. 1–8 (2017)

[24]

Massey FJ Jr The Kolmogorov-Smirnov test for goodness of fit J. Am. Stat. Assoc. 1951 46 253 68-78

[25]

Massie ML, Chun BN, and Culler DE The ganglia distributed monitoring system: design, implementation, and experience Parallel Comput. 2004 30 7 817-840

[26]

Nair, V., et al.: Learning a hierarchical monitoring system for detecting and diagnosing service issues. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 2029–2038 (2015)

[27]

Netti, A., et al.: DCDB wintermute: enabling online and holistic operational data analytics on HPC systems. In: Proceedings of the 29th International Symposium on High-Performance Parallel and Distributed Computing, pp. 101–112 (2020)

[28]

Pedregosa F et al. Scikit-learn: machine learning in Python J. Mach. Learn. Res. 2011 12 2825-2830

[29]

Sandia National Laboratories: HPC capacity cluster platforms (2017). https://hpc.sandia.gov/HPC%20Production%20Clusters/index.html

[30]

Schwaller, B., Tucker, N., Tucker, T., Allan, B., Brandt, J.: HPC system data pipeline to enable meaningful insights through analysis-driven visualizations. In: IEEE International Conference on Cluster Computing (CLUSTER), pp. 433–441 (2020)

[31]

Shaykhislamov D and Voevodin V An approach for dynamic detection of inefficient supercomputer applications Procedia Comput. Sci. 2018 136 35-43

[32]

Skinner, D., Kramer, W.: Understanding the causes of performance variability in HPC workloads. In: Proceedings of the IEEE Workload Characterization Symposium, pp. 137–149 (2005)

[33]

Tuncer, O., et al.: Online diagnosis of performance variation in HPC systems using machine learning. IEEE Trans. Parallel Distrib. Syst. 30(4), 883–896 (2018)

[34]

Xie C, Xu W, and Mueller K A visual analytics framework for the detection of anomalous call stack trees in high performance computing applications IEEE Trans. Vis. Comput. Graph. 2018 25 1 215-224

[35]

Zasadziński M, Muntés-Mulero V, Solé M, Carrera D, and Ludwig T Aldinucci M, Padovani L, and Torquati M Early termination of failed HPC jobs through machine and deep learning Euro-Par 2018: Parallel Processing 2018 Cham Springer 163-177

[36]

Zhang, X., Meng, F., Chen, P., Xu, J.: TaskInsight: a fine-grained performance anomaly detection and problem locating system. In: IEEE International Conference on Cloud Computing (CLOUD), pp. 917–920 (2016)

[37]

Zhang, Y., Groves, T., Cook, B., Wright, N.J., Coskun, A.K.: Quantifying the impact of network congestion on application performance and network metrics. In: IEEE International Conference on Cluster Computing (CLUSTER), pp. 162–168 (2020)

Cited By

Taşyaran FYasal OMorgado JIlic AUnat DKaya K(2024)P-MoVE: Performance Monitoring and Visualization with Encoded KnowledgeProceedings of the SC '24 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis10.1109/SCW63240.2024.00193(1531-1542)Online publication date: 17-Nov-2024
https://dl.acm.org/doi/10.1109/SCW63240.2024.00193
Quan AHowell LGreenberg H(2023)Heterogeneous Syslog Analysis: There Is HopeProceedings of the SC '23 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624128(581-587)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3624062.3624128
Ardebili MBartolini AAcquaviva ABenini L(2022)Rule-Based Thermal Anomaly Detection for Tier-0 HPC SystemsHigh Performance Computing. ISC High Performance 2022 International Workshops10.1007/978-3-031-23220-6_18(262-276)Online publication date: 29-May-2022
https://dl.acm.org/doi/10.1007/978-3-031-23220-6_18

Index Terms

E2EWatch: An End-to-End Anomaly Diagnosis Framework for Production HPC Systems

Index terms have been assigned to the content through auto-classification.

Recommendations

Prodigy: Towards Unsupervised Anomaly Detection in Production HPC Systems
SC '23: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Performance variations caused by anomalies in modern High Performance Computing (HPC) systems lead to decreased efficiency, impaired application performance, and increased operational costs. While machine learning (ML)-based frameworks for automated ...
Proctor: A Semi-Supervised Performance Anomaly Diagnosis Framework for Production HPC Systems
High Performance Computing
Abstract
Performance variation diagnosis in High-Performance Computing (HPC) systems is a challenging problem due to the size and complexity of the systems. Application performance variation leads to premature termination of jobs, decreased energy ...
Rule-Based Thermal Anomaly Detection for Tier-0 HPC Systems
High Performance Computing. ISC High Performance 2022 International Workshops
Abstract
Today, significant advances in science and technology can not be envisioned without high computing capacity. To solve large problems in science, engineering, and business, data centers provide High-Performance Computing (HPC) systems with ...

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings

Euro-Par 2021: Parallel Processing: 27th International Conference on Parallel and Distributed Computing, Lisbon, Portugal, September 1–3, 2021, Proceedings

Sep 2021

651 pages

ISBN:978-3-030-85664-9

DOI:10.1007/978-3-030-85665-6

Editors:
Leonel Sousa
Universidade de Lisboa, Lisbon, Portugal
,
Nuno Roma
Universidade de Lisboa, Lisbon, Portugal
,
Pedro Tomás
Universidade de Lisboa, Lisbon, Portugal

© Springer Nature Switzerland AG 2021.

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 01 September 2021

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 08 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Taşyaran FYasal OMorgado JIlic AUnat DKaya K(2024)P-MoVE: Performance Monitoring and Visualization with Encoded KnowledgeProceedings of the SC '24 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis10.1109/SCW63240.2024.00193(1531-1542)Online publication date: 17-Nov-2024
https://dl.acm.org/doi/10.1109/SCW63240.2024.00193
Quan AHowell LGreenberg H(2023)Heterogeneous Syslog Analysis: There Is HopeProceedings of the SC '23 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624128(581-587)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3624062.3624128
Ardebili MBartolini AAcquaviva ABenini L(2022)Rule-Based Thermal Anomaly Detection for Tier-0 HPC SystemsHigh Performance Computing. ISC High Performance 2022 International Workshops10.1007/978-3-031-23220-6_18(262-276)Online publication date: 29-May-2022
https://dl.acm.org/doi/10.1007/978-3-031-23220-6_18
Molan MBorghesi ABenini LBartolini A(2022)Analysing Supercomputer Nodes Behaviour with the Latent Representation of Deep Learning ModelsEuro-Par 2022: Parallel Processing10.1007/978-3-031-12597-3_11(171-185)Online publication date: 22-Aug-2022
https://dl.acm.org/doi/10.1007/978-3-031-12597-3_11

View Options

View options

Figures

Tables

Media

View Table of Conten