research-article

MISP: A Multimodal-based Intelligent Server Failure Prediction Model for Cloud Computing Systems

Authors:

Cheng ZhuoAuthors Info & Claims

KDD '24: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

Pages 5509 - 5520

https://doi.org/10.1145/3637528.3671568

Published: 24 August 2024 Publication History

Abstract

Traditional server failure prediction methods predominantly rely on single-modality data such as system logs or system status curves. This reliance may lead to an incomplete understanding of system health and impending issues, proving inadequate for the complex and dynamic landscape of contemporary cloud computing environments. The potential of multimodal data to provide comprehensive insights is widely acknowledged, yet the lack of a holistic dataset and the challenges inherent in integrating features from both structured and unstructured data have impeded the exploration of multimodal-based server failure prediction. Addressing these challenges, this paper presents an industrial-scale, comprehensive dataset for server failure prediction, comprising nearly 80 types of structured and unstructured data sourced from real-world industrial cloud systems ¹. Building on this resource, we introduce MISP, a model that leverages multimodal fusion techniques for server failure prediction. MISP transforms multimodal data into multi-dimensional sequences, extracts and encodes features both within and across the modalities, and ultimately computes the failure probability from the synthesized features. Experiments demonstrate that MISP significantly outperforms existing methods, enhancing prediction accuracy by approximately 25% over previous state-of-the-art approaches.

Supplemental Material

MP4 File - Promotional Video

The video briefly introduces failure prediction in cloud computing systems, including the background, the multimodal data, and the proposed model.

Download
11.59 MB

References

[1]

Alibaba Cloud 2023. About Alibaba Cloud. https://www.alibabacloud.com/about

[2]

Pradeep K Atrey, M Anwar Hossain, Abdulmotaleb El Saddik, and Mohan S Kankanhalli. 2010. Multimodal fusion for multimedia analysis: a survey. Multimedia systems, Vol. 16 (2010), 345--379.

[3]

Mirela Madalina Botezatu, Ioana Giurgiu, Jasmina Bogojeska, and Dorothea Wiesmann. 2016. Predicting disk replacement towards reliable data centers. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 39--48.

Digital Library

[4]

Thanyalak Chalermarrewong, Tiranee Achalakul, and Simon Chong Wee See. 2012. Failure Prediction of Data Centers Using Time Series and Fault Tree Analysis. In 2012 IEEE 18th International Conference on Parallel and Distributed Systems. 794--799. https://doi.org/10.1109/ICPADS.2012.129

Digital Library

[5]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 4171--4186. https://doi.org/10.18653/v1/N19-1423

[6]

Vijay Ekambaram, Kushagra Manglik, Sumanta Mukherjee, Surya Shravan Kumar Sajja, Satyam Dwivedi, and Vikas Raykar. 2020. Attention based Multi-Modal New Product Sales Time-series Forecasting (KDD '20). Association for Computing Machinery, New York, NY, USA, 3110--3118. https://doi.org/10.1145/3394486.3403362

Digital Library

[7]

Ilenia Fronza, Alberto Sillitti, Giancarlo Succi, Mikko Terho, and Jelena Vlasenko. 2013. Failure prediction based on log files using Random Indexing and Support Vector Machines. Journal of Systems and Software, Vol. 86, 1 (2013), 2--11. https://doi.org/10.1016/j.jss.2012.06.025

Digital Library

[8]

Jiechao Gao, Haoyu Wang, and Haiying Shen. 2019. Task Failure Prediction in Cloud Data Centers Using Deep Learning. In 2019 IEEE International Conference on Big Data (Big Data). 1111--1116. https://doi.org/10.1109/BigData47090.2019.9006011

[9]

Shaohan Huang, Yi Liu, Carol Fung, Rong He, Yining Zhao, Hailong Yang, and Zhongzhi Luan. 2020. HitAnomaly: Hierarchical Transformers for Anomaly Detection in System Log. IEEE Transactions on Network and Service Management, Vol. 17, 4 (2020), 2064--2076. https://doi.org/10.1109/TNSM.2020.3034647

Digital Library

[10]

Michael Hüsken and Peter Stagge. 2003. Recurrent neural networks for time series classification. Neurocomputing, Vol. 50 (2003), 223--235. https://doi.org/10.1016/S0925-2312(01)00706-8

[11]

Tariqul Islam and Dakshnamoorthy Manivannan. 2017. Predicting Application Failure in Cloud: A Machine Learning Approach. In 2017 IEEE International Conference on Cognitive Computing (ICCC). 24--31. https://doi.org/10.1109/IEEE.ICCC.2017.11

[12]

Tong Jia, Ying Li, Yong Yang, Gang Huang, and Zhonghai Wu. 2022. Augmenting Log-based Anomaly Detection Models to Reduce False Anomalies with Human Feedback (KDD '22). Association for Computing Machinery, New York, NY, USA, 3081--3089. https://doi.org/10.1145/3534678.3539106

Digital Library

[13]

K. Sparck Jones. 2004. A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, Vol. 60, 5 (2004), 493--502.

[14]

Fazle Karim, Somshubra Majumdar, Houshang Darabi, and Shun Chen. 2018. LSTM Fully Convolutional Networks for Time Series Classification. IEEE Access, Vol. 6 (2018), 1662--1669. https://doi.org/10.1109/ACCESS.2017.2779939

[15]

Fazle Karim, Somshubra Majumdar, Houshang Darabi, and Samuel Harford. 2019. Multivariate LSTM-FCNs for time series classification. Neural networks, Vol. 116 (2019), 237--245.

[16]

S. Katz. 1987. Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. 35, 3 (1987), 400--401. https://doi.org/10.1109/TASSP.1987.1165125

[17]

Wonjae Kim, Bokyung Son, and Ildoo Kim. 2021. ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision. In Proceedings of the 38th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 139), Marina Meila and Tong Zhang (Eds.). PMLR, 5583--5594.

[18]

Dana Lahat, Tülay Adali, and Christian Jutten. 2015. Multimodal data fusion: an overview of methods, challenges, and prospects. Proc. IEEE, Vol. 103, 9 (2015), 1449--1477.

[19]

Tian Lan, Ziyue Li, Zhishuai Li, Lei Bai, Man Li, Fugee Tsung, Wolfgang Ketter, Rui Zhao, and Chen Zhang. 2023. MM-DAG: Multi-task DAG Learning for Multi-modal Data-with Application for Traffic Congestion Analysis. arXiv preprint arXiv:2306.02831 (2023).

[20]

Yinglung Liang, Yanyong Zhang, Hui Xiong, and Ramendra Sahoo. 2007. Failure Prediction in IBM BlueGene/L Event Logs. In Seventh IEEE International Conference on Data Mining (ICDM 2007). 583--588. https://doi.org/10.1109/ICDM.2007.46

Digital Library

[21]

Qingwei Lin, Ken Hsieh, Yingnong Dang, Hongyu Zhang, Kaixin Sui, Yong Xu, Jian-Guang Lou, Chenggang Li, Youjiang Wu, Randolph Yao, Murali Chintalapati, and Dongmei Zhang. 2018. Predicting Node Failure in Cloud Service Systems. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (Lake Buena Vista, FL, USA) (ESEC/FSE 2018). Association for Computing Machinery, New York, NY, USA, 480--490. https://doi.org/10.1145/3236024.3236060

Digital Library

[22]

Weiyang Liu, Yandong Wen, Zhiding Yu, Ming Li, Bhiksha Raj, and Le Song. 2017. SphereFace: Deep Hypersphere Embedding for Face Recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]

Yudong Liu, Hailan Yang, Pu Zhao, Minghua Ma, Chengwu Wen, Hongyu Zhang, Chuan Luo, Qingwei Lin, Chang Yi, Jiaojian Wang, et al. 2022. Multi-task Hierarchical Classification for Disk Failure Prediction in Online Service Systems. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 3438--3446.

Digital Library

[24]

Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems, Vol. 32 (2019).

[25]

Tomás Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient Estimation of Word Representations in Vector Space. In 1st International Conference on Learning Representations, ICLR 2013, Scottsdale, Arizona, USA, May 2-4, 2013, Workshop Track Proceedings, Yoshua Bengio and Yann LeCun (Eds.).

[26]

Yuqi Nie, Nam H Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam. 2023. A Time Series is Worth 64 Words: Long-term Forecasting with Transformers. In The Eleventh International Conference on Learning Representations.

[27]

Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. GloVe: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Doha, Qatar, 1532--1543. https://doi.org/10.3115/v1/D14--1162

[28]

Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep Contextualized Word Representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). Association for Computational Linguistics, New Orleans, Louisiana, 2227--2237. https://doi.org/10.18653/v1/N18--1202

[29]

Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. 2018. Improving language understanding by generative pre-training. (2018).

[30]

Hansheng Ren, Bixiong Xu, Yujing Wang, Chao Yi, Congrui Huang, Xiaoyu Kou, Tony Xing, Mao Yang, Jie Tong, and Qi Zhang. 2019. Time-Series Anomaly Detection Service at Microsoft. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (Anchorage, AK, USA) (KDD '19). Association for Computing Machinery, New York, NY, USA, 3009--3017. https://doi.org/10.1145/3292500.3330680

Digital Library

[31]

Felix Salfner and Steffen Tschirpke. 2008. Error Log Processing for Accurate Failure Prediction. WASL, Vol. 8 (2008), 4.

[32]

Xiaoyi Sun, Krishnendu Chakrabarty, Ruirui Huang, Yiquan Chen, Bing Zhao, Hai Cao, Yinhe Han, Xiaoyao Liang, and Li Jiang. 2019. System-Level Hardware Failure Prediction Using Deep Learning. In Proceedings of the 56th Annual Design Automation Conference 2019 (Las Vegas, NV, USA) (DAC '19). Association for Computing Machinery, New York, NY, USA, Article 20, 6 pages. https://doi.org/10.1145/3316781.3317918

Digital Library

[33]

Joseph Turian, Lev-Arie Ratinov, and Yoshua Bengio. 2010. Word Representations: A Simple and General Method for Semi-Supervised Learning. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Uppsala, Sweden, 384--394.

Digital Library

[34]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems, I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Curran Associates, Inc.

[35]

Delu Wang, Jun Gan, Jinqi Mao, Fan Chen, and Lan Yu. 2023. Forecasting power demand in China with a CNN-LSTM model including multimodal information. Energy, Vol. 263 (2023), 126012. https://doi.org/10.1016/j.energy.2022.126012

[36]

Zhiwei Wang, Zhengzhang Chen, Jingchao Ni, Hui Liu, Haifeng Chen, and Jiliang Tang. 2021. Multi-scale one-class recurrent neural networks for discrete event sequence anomaly detection. In Proceedings of the 27th ACM SIGKDD conference on knowledge discovery & data mining. 3726--3734.

Digital Library

[37]

Chang Xu, Gang Wang, Xiaoguang Liu, Dongdong Guo, and Tie-Yan Liu. 2016. Health Status Assessment and Failure Prediction for Hard Drives with Recurrent Neural Networks. IEEE Trans. Comput., Vol. 65, 11 (2016), 3502--3508. https://doi.org/10.1109/TC.2016.2538237

Digital Library

[38]

Zhenghua Xue, Xiaoshe Dong, Siyuan Ma, and Weiqing Dong. 2007. A Survey on Failure Prediction of Large-Scale Server Clusters. In Eighth ACIS International Conference on Software Engineering, Artificial Intelligence, Networking, and Parallel/Distributed Computing (SNPD 2007), Vol. 2. 733--738. https://doi.org/10.1109/SNPD.2007.284

Digital Library

[39]

Xitong Yang, Palghat Ramesh, Radha Chitta, Sriganesh Madhvanath, Edgar A Bernal, and Jiebo Luo. 2017. Deep multimodal representation learning from temporal data. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5447--5455.

[40]

Zhe Yang, Piero Baraldi, and Enrico Zio. 2021. A multi-branch deep neural network model for failure prognostics based on multimodal data. Journal of Manufacturing Systems, Vol. 59 (2021), 42--50. https://doi.org/10.1016/j.jmsy.2021.01.007

[41]

Licheng Yu, Jun Chen, Animesh Sinha, Mengjiao Wang, Yu Chen, Tamara L Berg, and Ning Zhang. 2022. Commercemm: Large-scale commerce multimodal representation learning with omni retrieval. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 4433--4442.

Digital Library

[42]

Wennian Yu, Il Yong Kim, and Chris Mechefske. 2021. Analysis of different RNN autoencoder variants for time series classification and machine prognostics. Mechanical Systems and Signal Processing, Vol. 149 (2021), 107322.

[43]

George Zerveas, Srideepika Jayaraman, Dhaval Patel, Anuradha Bhamidipaty, and Carsten Eickhoff. 2021. A transformer-based framework for multivariate time series representation learning. In Proceedings of the 27th ACM SIGKDD conference on knowledge discovery & data mining. 2114--2124.

Digital Library

[44]

Chaohe Zhang, Xu Chu, Liantao Ma, Yinghao Zhu, Yasha Wang, Jiangtao Wang, and Junfeng Zhao. 2022. M3Care: Learning with Missing Modalities in Multimodal Healthcare Data. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2418--2428.

Digital Library

[45]

G Peter Zhang. 2003. Time series forecasting using a hybrid ARIMA and neural network model. Neurocomputing, Vol. 50 (2003), 159--175.

[46]

Ke Zhang, Jianwu Xu, Martin Renqiang Min, Guofei Jiang, Konstantinos Pelechrinis, and Hui Zhang. 2016. Automated IT system failure prediction: A deep learning approach. In 2016 IEEE International Conference on Big Data (Big Data). 1291--1300. https://doi.org/10.1109/BigData.2016.7840733

[47]

Xu Zhang, Chao Du, Yifan Li, Yong Xu, Hongyu Zhang, Si Qin, Ze Li, Qingwei Lin, Yingnong Dang, Andrew Zhou, et al. 2021. Halo: Hierarchy-aware fault localization for cloud systems. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 3948--3958.

Digital Library

[48]

Chenyu Zhao, Minghua Ma, Zhenyu Zhong, Shenglin Zhang, Zhiyuan Tan, Xiao Xiong, LuLu Yu, Jiayi Feng, Yongqian Sun, Yuzhi Zhang, Dan Pei, Qingwei Lin, and Dongmei Zhang. 2023. Robust Multimodal Failure Detection for Microservice Systems. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD '23). Association for Computing Machinery, New York, NY, USA, 5639--5649. https://doi.org/10.1145/3580305.3599902

Digital Library

[49]

Minglu Zhao, Reo Furuhata, Mulya Agung, Hiroyuki Takizawa, and Tomoya Soma. 2020. Failure Prediction in Datacenters Using Unsupervised Multimodal Anomaly Detection. In 2020 IEEE International Conference on Big Data (Big Data). 3545--3549. https://doi.org/10.1109/BigData50022.2020.9378419

[50]

Ziming Zheng, Zhiling Lan, Byung H. Park, and Al Geist. 2009. System log pre-processing to improve failure prediction. In 2009 IEEE/IFIP International Conference on Dependable Systems & Networks. 572--577. https://doi.org/10.1109/DSN.2009.5270289

[51]

Tian Zhou, Ziqing Ma, Qingsong Wen, Xue Wang, Liang Sun, and Rong Jin. 2022. FEDformer: Frequency Enhanced Decomposed Transformer for Long-term Series Forecasting. In Proceedings of the 39th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 162), Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (Eds.). PMLR, 27268--27286.

Index Terms

MISP: A Multimodal-based Intelligent Server Failure Prediction Model for Cloud Computing Systems

Index terms have been assigned to the content through auto-classification.

Recommendations

Failure Prediction with Hierarchical Approach in Private Cloud
Green, Pervasive, and Cloud Computing
Abstract
Cloud computing is widely adopted in real-world data centers. Most companies choose to build a private cloud service with the consideration of privacy. In these circumstances, they provide the service through Infrastructure as a Service (IaaS). ...
Failure-aware energy-efficient VM consolidation in cloud computing systems
Abstract
VM consolidation is an important technique used in cloud computing systems to improve energy efficiency. It migrates the running VMs from under utilized physical resources to other resources in order to reduce the energy consumption. ...
Highlights
- Reliability, energy consumption and task finishing time modelling under failures.
Intelligent failure prediction models for scientific workflows

Intelligent task failure models using machine learning approaches are proposed.The accuracy of proposed models is validated in Pegasus and Amazon EC2.The prediction accuracy of (94%) is achieved using Naïve Bayes approach. The ever-growing demand and ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

KDD '24: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 2024

6901 pages

ISBN:9798400704901

DOI:10.1145/3637528

General Chairs:
Ricardo Baeza-Yates
Northeastern University, USA
,
Francesco Bonchi
CENTAI / Eurecat, Italy

Copyright © 2024 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 August 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

KDD '24

Sponsor:

KDD '24: The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona, Spain

Acceptance Rates

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Sponsor:
sigkdd
sigkdd

The 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 3 - 7, 2025

Toronto , ON , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
144
Total Downloads

Downloads (Last 12 months)144
Downloads (Last 6 weeks)37

Reflects downloads up to 25 Dec 2024

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents