research-article

GRAAFE: : GRaph Anomaly Anticipation Framework for Exascale HPC systems

Authors:

Mohsen Seyedkazemi Ardebili,

Junaid Ahmed Khan,

Francesco Beneventi,

Daniele Cesarini,

Andrea Borghesi,

Andrea BartoliniAuthors Info & Claims

Volume 160, Issue C

Pages 644 - 653

https://doi.org/10.1016/j.future.2024.06.032

Published: 01 November 2024 Publication History

Abstract

The main limitation of applying predictive tools to large-scale supercomputers is the complexity of deploying Artificial Intelligence (AI) services in production and modeling heterogeneous data sources while preserving topological information in compact models. This paper proposes GRAAFE, a framework for continuously predicting compute node failures in the Marconi100 supercomputer. The framework consists of (i) an anomaly prediction model based on graph neural networks (GNNs) that leverage nodes’ physical layout in the compute room and (ii) the computationally efficient integration into the Marconi100’s ExaMon holistic monitoring system with Kubeflow, an MLOps Kubernetes framework which enables continuous deployment of AI pipelines. The GRAAFE GNN model achieves an area under the curve (AUC) from 0.91 to 0.78, surpassing state-of-the-art (SoA), achieving AUC between 0.64 and 0.5. GRAAFE sustains the anomaly prediction for all the Marconi100 nodes every 120s, requiring an additional 30% CPU resources and less than 5% more RAM w.r.t. monitoring only.

Highlights

•

Based on telemetry data, the GRAAFE framework predicts the compute node availability.

•

It is the first HPC anomaly prediction framework based on graph neural networks.

•

GRAAFE is a full-scale ML-ops framework for anomaly prediction in HPC.

•

It requires an additional 30% CPU and 5% more RAM compared to monitoring only.

References

[1]

Dongarra J.J., Meuer H.W., Strohmaier E., 29th TOP500 Supercomputer Sites, Top500.org, 1994.

[2]

Borghesi A., Molan M., Milano M., Bartolini A., Anomaly detection and anticipation in high performance computing systems, IEEE Trans. Parallel Distrib. Syst. 33 (4) (2022) 739–750,.

[3]

Molan M., Borghesi A., Cesarini D., Benini L., Bartolini A., RUAD: Unsupervised anomaly detection in HPC systems, Future Gener. Comput. Syst. 141 (2023) 542–554,.

Digital Library

[4]

Jauk D., Yang D., Schulz M., Predicting faults in high performance computing systems: An in-depth survey of the state-of-the-practice, in: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’19, Association for Computing Machinery, New York, NY, USA, 2019,.

Digital Library

[5]

Q. Guan, Z. Zhang, S. Fu, Proactive Failure Management by Integrated Unsupervised and Semi-Supervised Learning for Dependable Cloud Systems, in: 2011 Sixth International Conference on Availability, Reliability and Security, 2011, pp. 83–90, https://doi.org/10.1109/ARES.2011.20.

[6]

B. Nie, J. Xue, S. Gupta, T. Patel, C. Engelmann, E. Smirni, D. Tiwari, Machine Learning Models for GPU Error Prediction in a Large Scale HPC System, in: 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN, 2018, pp. 95–106, https://doi.org/10.1109/DSN.2018.00022.

[7]

M. Ott, W. Shin, N. Bourassa, T. Wilde, S. Ceballos, M. Romanus, N. Bates, Global Experiences with HPC Operational Data Measurement, Collection and Analysis, in: 2020 IEEE International Conference on Cluster Computing, CLUSTER, 2020, pp. 499–508, https://doi.org/10.1109/CLUSTER49012.2020.00071.

[8]

Matri P., Carns P., Ross R., Costan A., Pérez M.S., Antoniu G., Slog: Large-scale logging middleware for hpc and big data convergence, in: 2018 IEEE 38th International Conference on Distributed Computing Systems, ICDCS, IEEE, 2018, pp. 1507–1512.

[9]

W. Khan, D. De Chiara, A.-L. Kor, M. Chinnici, Exploratory data analysis for data center energy management, in: Proceedings of the Thirteenth ACM International Conference on Future Energy Systems, 2022, pp. 571–580.

[10]

Zhou J., Cui G., Hu S., Zhang Z., Yang C., Liu Z., Wang L., Li C., Sun M., Graph neural networks: A review of methods and applications, 2018,. arXiv URL https://arxiv.org/abs/1812.08434.

[11]

Netti A., Muller M., et al., DCDB wintermute: Enabling online and holistic operational data analytics on HPC systems, in: Proc. of the 29th International Symposium on High-Performance Parallel and Distributed Computing, ACM, New York, NY, USA, 2020, pp. 101–112.

[12]

Beneventi F., Bartolini A., Cavazzoni C., Benini L., Continuous learning of HPC infrastructure models using big data analytics and in-memory processing tools, in: Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017, IEEE, 2017, pp. 1038–1043.

[13]

A. Bartolini, F. Beneventi, et al., Paving the way toward energy-aware and automated datacentre, in: Proceedings of the 48th International Conference on Parallel Processing: Workshops, 2019, pp. 1–8.

[14]

Wikipedia A., CINECA — Wikipedia, the free encyclopedia, 2021, http://en.wikipedia.org/w/index.php?title=CINECA&oldid=954269846. (Online; Accessed 04 December 2021),.

[15]

D. Milojicic, P. Faraboschi, N. Dube, D. Roweth, Future of HPC: Diversifying Heterogeneity, in: 2021 Design, Automation Test in Europe Conference Exhibition, DATE, 2021, pp. 276–281, https://doi.org/10.23919/DATE51398.2021.9474063.

[16]

Gainaru A., Bouguerra M.-S., Cappello F., Snir M., Kramer W.T.C., Navigating the blue waters : Online failure prediction in the petascale era, 2013, URL https://api.semanticscholar.org/CorpusID:16874101.

[17]

Lu S., Luo B., Patel T., Yao Y., Tiwari D., Shi W., Making disk failure predictions SMARTer!, in: Proceedings of the 18th USENIX Conference on File and Storage Technologies, FAST ’20, USENIX Association, USA, 2020, pp. 151–168.

[18]

Liu Y., Guan Y., Jiang T., Zhou K., Wang H., Hu G., Zhang J., Fang W., Cheng Z., Huang P., SPAE: Lifelong disk failure prediction via end-to-end GAN-based anomaly detection with ensemble update, Future Gener. Comput. Syst. 148 (2023) 460–471,. URL https://www.sciencedirect.com/science/article/pii/S0167739X23002030.

Digital Library

[19]

Borghesi A., Di Santi C., Molan M., Ardebili M.S., Mauri A., Guarrasi M., Galetti D., Cestari M., Barchi F., Benini L., Beneventi F., Bartolini A., M100 ExaData: a data collection campaign on the CINECA’s Marconi100 Tier-0 supercomputer, Sci. Data 10 (1) (2023) 288,.

[20]

Carvalho T.P., Soares F.A., Vita R., Francisco R.d.P., Basto J.P., Alcalá S.G., A systematic literature review of machine learning methods applied to predictive maintenance, Comput. Ind. Eng. 137 (2019).

[21]

Behera S., Choubey A., Kanani C.S., Patel Y.S., Misra R., Sillitti A., Ensemble trees learning based improved predictive maintenance using IIoT for turbofan engines, in: Proceedings of the 34th ACM/SIGAPP Symposium on Applied Computing, SAC ’19, Association for Computing Machinery, New York, NY, USA, 2019, pp. 842–850,.

Digital Library

[22]

Zhang J., Gardner R., Vukotic I., Anomaly detection in wide area network meshes using two machine learning algorithms, Future Gener. Comput. Syst. 93 (2019) 418–426,. URL https://www.sciencedirect.com/science/article/pii/S0167739X18302267.

Digital Library

[23]

F. Monti, D. Boscaini, et al., Geometric deep learning on graphs and manifolds using mixture model cnns, in: Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 5115–5124.

[24]

Zhou J., Cui G., Hu S., Zhang Z., Yang C., Liu Z., Wang L., Li C., Sun M., Graph neural networks: A review of methods and applications, AI Open 1 (2020) 57–81.

[25]

Chaudhary A., Mittal H., Arora A., Anomaly detection using graph neural networks, in: 2019 International Conference on ML, Big Data, Cloud and Parallel Computing, IEEE, 2019, pp. 346–350.

[26]

Deng A., Hooi B., Graph neural network-based anomaly detection in multivariate time series, Proceedings of the AAAI Conference on Artificial Intelligence, 2021, pp. 4027–4035.

[27]

Song Y., Xin R., et al., Identifying performance anomalies in fluctuating cloud environments: a robust correlative-GNN-based explainable approach, Future Gener. Comput. Syst. (2023).

[28]

Ghiasvand S., Ciorba F.M., Anomaly detection in high performance computers: A vicinity perspective, in: 18th International Symposium on Parallel and Distributed Computing, IEEE, 2019, pp. 112–120.

[29]

Song Y., Xin R., Chen P., Zhang R., Chen J., Zhao Z., Identifying performance anomalies in fluctuating cloud environments: A robust correlative-GNN-based explainable approach, Future Gener. Comput. Syst. 145 (2023) 77–86,. URL https://www.sciencedirect.com/science/article/pii/S0167739X23000973.

Digital Library

[30]

Netti A., Ott M., Guillen P., et al., Operational data analytics in practice: experiences from design to deployment in production HPC environments, Parallel Comput. 113 (2022).

[31]

Borghesi A., Burrello A., Bartolini A., ExaMon-X: a predictive maintenance framework for automatic monitoring in industrial IoT systems, IEEE Internet Things J. (2021).

[32]

Kreuzberger D., Kühl N., Hirschl S., Machine learning operations (mlops): Overview, definition, and architecture, 2022, arXiv preprint arXiv:2205.02302.

[33]

Kim S., Choi K., Choi H.-S., et al., Towards a rigorous evaluation of time-series anomaly detection, Proceedings of the AAAI Conference on Artificial Intelligence, 2022, pp. 7194–7201.

Cited By

Krumpak RRožanec JMolan MAngelinelli MBartolini A(2024)Predicting Compute Node Unavailability in HPC: A Graph-Based Machine Learning ApproachProceedings of the SC '24 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis10.1109/SCW63240.2024.00103(737-740)Online publication date: 17-Nov-2024
https://dl.acm.org/doi/10.1109/SCW63240.2024.00103

Index Terms

GRAAFE: GRaph Anomaly Anticipation Framework for Exascale HPC systems
1. Computer systems organization
  1. Architectures
2. Computing methodologies

Index terms have been assigned to the content through auto-classification.

Recommendations

Graph Neural Networks for Anomaly Anticipation in HPC Systems
ICPE '23 Companion: Companion of the 2023 ACM/SPEC International Conference on Performance Engineering

In this paper, we explore the use of Graph Neural Networks (GNNs) for anomaly anticipation in high performance computing (HPC) systems. We propose a GNN-based approach that leverages the structure of the HPC system (particularly, the physical proximity ...
Robust and accurate performance anomaly detection and prediction for cloud applications: a novel ensemble learning-based framework
Abstract
Effectively detecting run-time performance anomalies is crucial for clouds to identify abnormal performance behavior and forestall future incidents. To be used for real-world applications, an effective anomaly detection framework should meet three ...
The Entropy and PCA Based Anomaly Prediction in Data Streams

With the increase of data and information, anomaly management has been attracting much more attention and become an important research topic gradually. Previous literatures have advocated anomaly discovery and identification ignoring the fact that ...

Comments

Information & Contributors

Information

Published In

cover image Future Generation Computer Systems

Future Generation Computer Systems Volume 160, Issue C

Nov 2024

966 pages

Issue’s Table of Contents

Elsevier B.V.

Publisher

Elsevier Science Publishers B. V.

Netherlands

Publication History

Published: 01 November 2024

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 15 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Krumpak RRožanec JMolan MAngelinelli MBartolini A(2024)Predicting Compute Node Unavailability in HPC: A Graph-Based Machine Learning ApproachProceedings of the SC '24 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis10.1109/SCW63240.2024.00103(737-740)Online publication date: 17-Nov-2024
https://dl.acm.org/doi/10.1109/SCW63240.2024.00103

View Options

View options

Figures

Tables

Media

View Issue’s Table of Contents