Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1007/978-3-030-85665-6_5guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

E2EWatch: An End-to-End Anomaly Diagnosis Framework for Production HPC Systems

Published: 01 September 2021 Publication History

Abstract

In today’s High-Performance Computing (HPC) systems, application performance variations are among the most vital challenges as they adversely affect system efficiency, application performance, and cost. System administrators need to identify the anomalies that are responsible for performance variation and take mitigating actions. One can perform manual root-cause analysis on telemetry data collected by HPC monitoring infrastructures to analyze performance variations. However, manual analysis methods are time-intensive and limited in impact due to the increasing complexity of HPC systems and terabyte/day-sized telemetry data. State-of-the-art approaches use machine learning-based methods to diagnose performance anomalies automatically. This paper deploys an end-to-end machine learning framework that diagnoses performance anomalies on compute nodes on a 1488-node production HPC system. We demonstrate job and node-level anomaly diagnosis results with the Grafana frontend interface at runtime. Furthermore, we discuss challenges and design decisions for the deployment.

References

[1]
Agelastos, A., et al.: The lightweight distributed metric service: a scalable infrastructure for continuous monitoring of large scale computing systems and applications. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pp. 154–165 (2014)
[2]
Agelastos, A., et al.: Toward rapid understanding of production HPC applications and systems. In: IEEE International Conference on Cluster Computing, pp. 464–473 (2015)
[3]
Ahad, R., Chan, E., Santos, A.: Toward autonomic cloud: automatic anomaly detection and resolution. In: International Conference on Cloud and Autonomic Computing, pp. 200–203 (2015)
[4]
Arzani, B., Ciraci, S., Loo, B.T., Schuster, A., Outhred, G.: Taking the blame game out of data centers operations with NetPoirot. In: Proceedings of the ACM SIGCOMM Conference, pp. 440–453 (2016)
[5]
Ates E et al. Aldinucci M, Padovani L, Torquati M, et al. Taxonomist: application detection through rich monitoring data Euro-Par 2018: Parallel Processing 2018 Cham Springer 92-105
[6]
Ates, E., Zhang, Y., Aksar, B., et al.: HPAS: an HPC performance anomaly suite for reproducing performance variations. In: Proceedings of the 48th International Conference on Parallel Processing, pp. 1–10. ACM, August 2019
[7]
Bartolini, A., et al.: The DAVIDE big-data-powered fine-grain power and performance monitoring support. In: Proceedings of the 15th ACM International Conference on Computing Frontiers, pp. 303–308 (2018)
[8]
Bhatele, A., Mohror, K., Langer, S.H., Isaacs, K.E.: There goes the neighborhood: performance degradation due to nearby jobs. In: SC 2013: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pp. 1–12 (2013)
[9]
Bhatele, A., et al.: The case of performance variability on dragonfly-based systems. In: IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 896–905 (2020)
[10]
Bhuyan, M.H., Bhattacharyya, D., Kalita, J.K.: NADO: network anomaly detection using outlier approach. In: Proceedings of the International Conference on Communication, Computing and Security, pp. 531–536 (2011)
[11]
Borghesi A, Bartolini A, Lombardi M, Milano M, and Benini L A semisupervised autoencoder-based approach for anomaly detection in high performance computing systems Eng. Appl. Artif. Intell. 2019 85 634-644
[12]
Bourassa, N., et al.: Operational data analytics: optimizing the national energy research scientific computing center cooling systems. In: Proceedings of the 48th International Conference on Parallel Processing: Workshops, pp. 1–7 (2019)
[13]
Brandt, J.M., et al.: Enabling advanced operational analysis through multi-subsystem data integration on trinity. Technical report, Sandia National Lab. (SNL-CA), Livermore, CA (United States) (2015)
[14]
Chen, T., Guestrin, C.: XGBoost: a scalable tree boosting system. In: Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining, pp. 785–794 (2016)
[15]
Dalmazo, B.L., Vilela, J.P., Simoes, P., Curado, M.: Expedite feature extraction for enhanced cloud anomaly detection. In: IEEE/IFIP Network Operations and Management Symposium, pp. 1215–1220 (2016)
[16]
Das, A., Mueller, F., Rountree, B.: Aarohi: making real-time node failure prediction feasible. In: 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 1092–1101 (2020)
[17]
Dorier, M., Antoniu, G., Ross, R., Kimpe, D., Ibrahim, S.: CALCioM: mitigating i/o interference in HPC systems through cross-application coordination. In: IEEE International Parallel and Distributed Processing Symposium, pp. 155–164 (2014)
[18]
Jayathilaka, H., Krintz, C., Wolski, R.: Performance monitoring and root cause analysis for cloud-hosted web applications. In: Proceedings of the 26th International Conference on World Wide Web, pp. 469–478 (2017)
[19]
Ke G et al. Lightgbm: a highly efficient gradient boosting decision tree Adv. Neural. Inf. Process. Syst. 2017 30 3146-3154
[20]
Klinkenberg, J., Terboven, C., Lankes, S., Müller, M.S.: Data mining-based analysis of HPC center operations. In: IEEE International Conference on Cluster Computing (CLUSTER), pp. 766–773 (2017)
[21]
Lan, Z., Zheng, Z., Li, Y.: Toward automated anomaly identification in large-scale systems. IEEE Trans. Parallel Distrib. Syst. 21(2), 174–187 (2009)
[22]
Leung, V.J., Bender, M.A., Bunde, D.P., Phillips, C.A.: Algorithmic support for commodity-based parallel computing systems. Technical report, Sandia National Laboratories (2003)
[23]
Marathe, A., Zhang, Y., Blanks, G., Kumbhare, N., Abdulla, G., Rountree, B.: An empirical survey of performance and energy efficiency variation on intel processors. In: Proceedings of the 5th International Workshop on Energy Efficient Supercomputing, pp. 1–8 (2017)
[24]
Massey FJ Jr The Kolmogorov-Smirnov test for goodness of fit J. Am. Stat. Assoc. 1951 46 253 68-78
[25]
Massie ML, Chun BN, and Culler DE The ganglia distributed monitoring system: design, implementation, and experience Parallel Comput. 2004 30 7 817-840
[26]
Nair, V., et al.: Learning a hierarchical monitoring system for detecting and diagnosing service issues. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 2029–2038 (2015)
[27]
Netti, A., et al.: DCDB wintermute: enabling online and holistic operational data analytics on HPC systems. In: Proceedings of the 29th International Symposium on High-Performance Parallel and Distributed Computing, pp. 101–112 (2020)
[28]
Pedregosa F et al. Scikit-learn: machine learning in Python J. Mach. Learn. Res. 2011 12 2825-2830
[29]
Sandia National Laboratories: HPC capacity cluster platforms (2017). https://hpc.sandia.gov/HPC%20Production%20Clusters/index.html
[30]
Schwaller, B., Tucker, N., Tucker, T., Allan, B., Brandt, J.: HPC system data pipeline to enable meaningful insights through analysis-driven visualizations. In: IEEE International Conference on Cluster Computing (CLUSTER), pp. 433–441 (2020)
[31]
Shaykhislamov D and Voevodin V An approach for dynamic detection of inefficient supercomputer applications Procedia Comput. Sci. 2018 136 35-43
[32]
Skinner, D., Kramer, W.: Understanding the causes of performance variability in HPC workloads. In: Proceedings of the IEEE Workload Characterization Symposium, pp. 137–149 (2005)
[33]
Tuncer, O., et al.: Online diagnosis of performance variation in HPC systems using machine learning. IEEE Trans. Parallel Distrib. Syst. 30(4), 883–896 (2018)
[34]
Xie C, Xu W, and Mueller K A visual analytics framework for the detection of anomalous call stack trees in high performance computing applications IEEE Trans. Vis. Comput. Graph. 2018 25 1 215-224
[35]
Zasadziński M, Muntés-Mulero V, Solé M, Carrera D, and Ludwig T Aldinucci M, Padovani L, and Torquati M Early termination of failed HPC jobs through machine and deep learning Euro-Par 2018: Parallel Processing 2018 Cham Springer 163-177
[36]
Zhang, X., Meng, F., Chen, P., Xu, J.: TaskInsight: a fine-grained performance anomaly detection and problem locating system. In: IEEE International Conference on Cloud Computing (CLOUD), pp. 917–920 (2016)
[37]
Zhang, Y., Groves, T., Cook, B., Wright, N.J., Coskun, A.K.: Quantifying the impact of network congestion on application performance and network metrics. In: IEEE International Conference on Cluster Computing (CLUSTER), pp. 162–168 (2020)

Cited By

View all
  • (2024)P-MoVE: Performance Monitoring and Visualization with Encoded KnowledgeProceedings of the SC '24 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis10.1109/SCW63240.2024.00193(1531-1542)Online publication date: 17-Nov-2024
  • (2023)Heterogeneous Syslog Analysis: There Is HopeProceedings of the SC '23 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624128(581-587)Online publication date: 12-Nov-2023
  • (2022)Rule-Based Thermal Anomaly Detection for Tier-0 HPC SystemsHigh Performance Computing. ISC High Performance 2022 International Workshops10.1007/978-3-031-23220-6_18(262-276)Online publication date: 29-May-2022

Index Terms

  1. E2EWatch: An End-to-End Anomaly Diagnosis Framework for Production HPC Systems
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Information & Contributors

          Information

          Published In

          cover image Guide Proceedings
          Euro-Par 2021: Parallel Processing: 27th International Conference on Parallel and Distributed Computing, Lisbon, Portugal, September 1–3, 2021, Proceedings
          Sep 2021
          651 pages
          ISBN:978-3-030-85664-9
          DOI:10.1007/978-3-030-85665-6

          Publisher

          Springer-Verlag

          Berlin, Heidelberg

          Publication History

          Published: 01 September 2021

          Author Tags

          1. HPC
          2. Anomaly diagnosis
          3. Machine learning
          4. Telemetry.

          Qualifiers

          • Article

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • Downloads (Last 12 months)0
          • Downloads (Last 6 weeks)0
          Reflects downloads up to 08 Feb 2025

          Other Metrics

          Citations

          Cited By

          View all
          • (2024)P-MoVE: Performance Monitoring and Visualization with Encoded KnowledgeProceedings of the SC '24 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis10.1109/SCW63240.2024.00193(1531-1542)Online publication date: 17-Nov-2024
          • (2023)Heterogeneous Syslog Analysis: There Is HopeProceedings of the SC '23 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624128(581-587)Online publication date: 12-Nov-2023
          • (2022)Rule-Based Thermal Anomaly Detection for Tier-0 HPC SystemsHigh Performance Computing. ISC High Performance 2022 International Workshops10.1007/978-3-031-23220-6_18(262-276)Online publication date: 29-May-2022
          • (2022)Analysing Supercomputer Nodes Behaviour with the Latent Representation of Deep Learning ModelsEuro-Par 2022: Parallel Processing10.1007/978-3-031-12597-3_11(171-185)Online publication date: 22-Aug-2022

          View Options

          View options

          Figures

          Tables

          Media

          Share

          Share

          Share this Publication link

          Share on social media