Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

GRAAFE: : GRaph Anomaly Anticipation Framework for Exascale HPC systems

Published: 01 November 2024 Publication History

Abstract

The main limitation of applying predictive tools to large-scale supercomputers is the complexity of deploying Artificial Intelligence (AI) services in production and modeling heterogeneous data sources while preserving topological information in compact models. This paper proposes GRAAFE, a framework for continuously predicting compute node failures in the Marconi100 supercomputer. The framework consists of (i) an anomaly prediction model based on graph neural networks (GNNs) that leverage nodes’ physical layout in the compute room and (ii) the computationally efficient integration into the Marconi100’s ExaMon holistic monitoring system with Kubeflow, an MLOps Kubernetes framework which enables continuous deployment of AI pipelines. The GRAAFE GNN model achieves an area under the curve (AUC) from 0.91 to 0.78, surpassing state-of-the-art (SoA), achieving AUC between 0.64 and 0.5. GRAAFE sustains the anomaly prediction for all the Marconi100 nodes every 120s, requiring an additional 30% CPU resources and less than 5% more RAM w.r.t. monitoring only.

Highlights

Based on telemetry data, the GRAAFE framework predicts the compute node availability.
It is the first HPC anomaly prediction framework based on graph neural networks.
GRAAFE is a full-scale ML-ops framework for anomaly prediction in HPC.
It requires an additional 30% CPU and 5% more RAM compared to monitoring only.

References

[1]
Dongarra J.J., Meuer H.W., Strohmaier E., 29th TOP500 Supercomputer Sites, Top500.org, 1994.
[2]
Borghesi A., Molan M., Milano M., Bartolini A., Anomaly detection and anticipation in high performance computing systems, IEEE Trans. Parallel Distrib. Syst. 33 (4) (2022) 739–750,.
[3]
Molan M., Borghesi A., Cesarini D., Benini L., Bartolini A., RUAD: Unsupervised anomaly detection in HPC systems, Future Gener. Comput. Syst. 141 (2023) 542–554,.
[4]
Jauk D., Yang D., Schulz M., Predicting faults in high performance computing systems: An in-depth survey of the state-of-the-practice, in: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’19, Association for Computing Machinery, New York, NY, USA, 2019,.
[5]
Q. Guan, Z. Zhang, S. Fu, Proactive Failure Management by Integrated Unsupervised and Semi-Supervised Learning for Dependable Cloud Systems, in: 2011 Sixth International Conference on Availability, Reliability and Security, 2011, pp. 83–90, https://doi.org/10.1109/ARES.2011.20.
[6]
B. Nie, J. Xue, S. Gupta, T. Patel, C. Engelmann, E. Smirni, D. Tiwari, Machine Learning Models for GPU Error Prediction in a Large Scale HPC System, in: 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN, 2018, pp. 95–106, https://doi.org/10.1109/DSN.2018.00022.
[7]
M. Ott, W. Shin, N. Bourassa, T. Wilde, S. Ceballos, M. Romanus, N. Bates, Global Experiences with HPC Operational Data Measurement, Collection and Analysis, in: 2020 IEEE International Conference on Cluster Computing, CLUSTER, 2020, pp. 499–508, https://doi.org/10.1109/CLUSTER49012.2020.00071.
[8]
Matri P., Carns P., Ross R., Costan A., Pérez M.S., Antoniu G., Slog: Large-scale logging middleware for hpc and big data convergence, in: 2018 IEEE 38th International Conference on Distributed Computing Systems, ICDCS, IEEE, 2018, pp. 1507–1512.
[9]
W. Khan, D. De Chiara, A.-L. Kor, M. Chinnici, Exploratory data analysis for data center energy management, in: Proceedings of the Thirteenth ACM International Conference on Future Energy Systems, 2022, pp. 571–580.
[10]
Zhou J., Cui G., Hu S., Zhang Z., Yang C., Liu Z., Wang L., Li C., Sun M., Graph neural networks: A review of methods and applications, 2018,. arXiv URL https://arxiv.org/abs/1812.08434.
[11]
Netti A., Muller M., et al., DCDB wintermute: Enabling online and holistic operational data analytics on HPC systems, in: Proc. of the 29th International Symposium on High-Performance Parallel and Distributed Computing, ACM, New York, NY, USA, 2020, pp. 101–112.
[12]
Beneventi F., Bartolini A., Cavazzoni C., Benini L., Continuous learning of HPC infrastructure models using big data analytics and in-memory processing tools, in: Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017, IEEE, 2017, pp. 1038–1043.
[13]
A. Bartolini, F. Beneventi, et al., Paving the way toward energy-aware and automated datacentre, in: Proceedings of the 48th International Conference on Parallel Processing: Workshops, 2019, pp. 1–8.
[14]
Wikipedia A., CINECA — Wikipedia, the free encyclopedia, 2021, http://en.wikipedia.org/w/index.php?title=CINECA&oldid=954269846. (Online; Accessed 04 December 2021),.
[15]
D. Milojicic, P. Faraboschi, N. Dube, D. Roweth, Future of HPC: Diversifying Heterogeneity, in: 2021 Design, Automation Test in Europe Conference Exhibition, DATE, 2021, pp. 276–281, https://doi.org/10.23919/DATE51398.2021.9474063.
[16]
Gainaru A., Bouguerra M.-S., Cappello F., Snir M., Kramer W.T.C., Navigating the blue waters : Online failure prediction in the petascale era, 2013, URL https://api.semanticscholar.org/CorpusID:16874101.
[17]
Lu S., Luo B., Patel T., Yao Y., Tiwari D., Shi W., Making disk failure predictions SMARTer!, in: Proceedings of the 18th USENIX Conference on File and Storage Technologies, FAST ’20, USENIX Association, USA, 2020, pp. 151–168.
[18]
Liu Y., Guan Y., Jiang T., Zhou K., Wang H., Hu G., Zhang J., Fang W., Cheng Z., Huang P., SPAE: Lifelong disk failure prediction via end-to-end GAN-based anomaly detection with ensemble update, Future Gener. Comput. Syst. 148 (2023) 460–471,. URL https://www.sciencedirect.com/science/article/pii/S0167739X23002030.
[19]
Borghesi A., Di Santi C., Molan M., Ardebili M.S., Mauri A., Guarrasi M., Galetti D., Cestari M., Barchi F., Benini L., Beneventi F., Bartolini A., M100 ExaData: a data collection campaign on the CINECA’s Marconi100 Tier-0 supercomputer, Sci. Data 10 (1) (2023) 288,.
[20]
Carvalho T.P., Soares F.A., Vita R., Francisco R.d.P., Basto J.P., Alcalá S.G., A systematic literature review of machine learning methods applied to predictive maintenance, Comput. Ind. Eng. 137 (2019).
[21]
Behera S., Choubey A., Kanani C.S., Patel Y.S., Misra R., Sillitti A., Ensemble trees learning based improved predictive maintenance using IIoT for turbofan engines, in: Proceedings of the 34th ACM/SIGAPP Symposium on Applied Computing, SAC ’19, Association for Computing Machinery, New York, NY, USA, 2019, pp. 842–850,.
[22]
Zhang J., Gardner R., Vukotic I., Anomaly detection in wide area network meshes using two machine learning algorithms, Future Gener. Comput. Syst. 93 (2019) 418–426,. URL https://www.sciencedirect.com/science/article/pii/S0167739X18302267.
[23]
F. Monti, D. Boscaini, et al., Geometric deep learning on graphs and manifolds using mixture model cnns, in: Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 5115–5124.
[24]
Zhou J., Cui G., Hu S., Zhang Z., Yang C., Liu Z., Wang L., Li C., Sun M., Graph neural networks: A review of methods and applications, AI Open 1 (2020) 57–81.
[25]
Chaudhary A., Mittal H., Arora A., Anomaly detection using graph neural networks, in: 2019 International Conference on ML, Big Data, Cloud and Parallel Computing, IEEE, 2019, pp. 346–350.
[26]
Deng A., Hooi B., Graph neural network-based anomaly detection in multivariate time series, Proceedings of the AAAI Conference on Artificial Intelligence, 2021, pp. 4027–4035.
[27]
Song Y., Xin R., et al., Identifying performance anomalies in fluctuating cloud environments: a robust correlative-GNN-based explainable approach, Future Gener. Comput. Syst. (2023).
[28]
Ghiasvand S., Ciorba F.M., Anomaly detection in high performance computers: A vicinity perspective, in: 18th International Symposium on Parallel and Distributed Computing, IEEE, 2019, pp. 112–120.
[29]
Song Y., Xin R., Chen P., Zhang R., Chen J., Zhao Z., Identifying performance anomalies in fluctuating cloud environments: A robust correlative-GNN-based explainable approach, Future Gener. Comput. Syst. 145 (2023) 77–86,. URL https://www.sciencedirect.com/science/article/pii/S0167739X23000973.
[30]
Netti A., Ott M., Guillen P., et al., Operational data analytics in practice: experiences from design to deployment in production HPC environments, Parallel Comput. 113 (2022).
[31]
Borghesi A., Burrello A., Bartolini A., ExaMon-X: a predictive maintenance framework for automatic monitoring in industrial IoT systems, IEEE Internet Things J. (2021).
[32]
Kreuzberger D., Kühl N., Hirschl S., Machine learning operations (mlops): Overview, definition, and architecture, 2022, arXiv preprint arXiv:2205.02302.
[33]
Kim S., Choi K., Choi H.-S., et al., Towards a rigorous evaluation of time-series anomaly detection, Proceedings of the AAAI Conference on Artificial Intelligence, 2022, pp. 7194–7201.

Cited By

View all
  • (2024)Predicting Compute Node Unavailability in HPC: A Graph-Based Machine Learning ApproachProceedings of the SC '24 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis10.1109/SCW63240.2024.00103(737-740)Online publication date: 17-Nov-2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Future Generation Computer Systems
Future Generation Computer Systems  Volume 160, Issue C
Nov 2024
966 pages

Publisher

Elsevier Science Publishers B. V.

Netherlands

Publication History

Published: 01 November 2024

Author Tags

  1. Anomaly prediction
  2. High-performance systems
  3. Graph neural networks
  4. MLOps

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 15 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Predicting Compute Node Unavailability in HPC: A Graph-Based Machine Learning ApproachProceedings of the SC '24 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis10.1109/SCW63240.2024.00103(737-740)Online publication date: 17-Nov-2024

View Options

View options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media