Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3637528.3671568acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

MISP: A Multimodal-based Intelligent Server Failure Prediction Model for Cloud Computing Systems

Published: 24 August 2024 Publication History

Abstract

Traditional server failure prediction methods predominantly rely on single-modality data such as system logs or system status curves. This reliance may lead to an incomplete understanding of system health and impending issues, proving inadequate for the complex and dynamic landscape of contemporary cloud computing environments. The potential of multimodal data to provide comprehensive insights is widely acknowledged, yet the lack of a holistic dataset and the challenges inherent in integrating features from both structured and unstructured data have impeded the exploration of multimodal-based server failure prediction. Addressing these challenges, this paper presents an industrial-scale, comprehensive dataset for server failure prediction, comprising nearly 80 types of structured and unstructured data sourced from real-world industrial cloud systems 1. Building on this resource, we introduce MISP, a model that leverages multimodal fusion techniques for server failure prediction. MISP transforms multimodal data into multi-dimensional sequences, extracts and encodes features both within and across the modalities, and ultimately computes the failure probability from the synthesized features. Experiments demonstrate that MISP significantly outperforms existing methods, enhancing prediction accuracy by approximately 25% over previous state-of-the-art approaches.

Supplemental Material

MP4 File - Promotional Video
The video briefly introduces failure prediction in cloud computing systems, including the background, the multimodal data, and the proposed model.

References

[1]
Alibaba Cloud 2023. About Alibaba Cloud. https://www.alibabacloud.com/about
[2]
Pradeep K Atrey, M Anwar Hossain, Abdulmotaleb El Saddik, and Mohan S Kankanhalli. 2010. Multimodal fusion for multimedia analysis: a survey. Multimedia systems, Vol. 16 (2010), 345--379.
[3]
Mirela Madalina Botezatu, Ioana Giurgiu, Jasmina Bogojeska, and Dorothea Wiesmann. 2016. Predicting disk replacement towards reliable data centers. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 39--48.
[4]
Thanyalak Chalermarrewong, Tiranee Achalakul, and Simon Chong Wee See. 2012. Failure Prediction of Data Centers Using Time Series and Fault Tree Analysis. In 2012 IEEE 18th International Conference on Parallel and Distributed Systems. 794--799. https://doi.org/10.1109/ICPADS.2012.129
[5]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 4171--4186. https://doi.org/10.18653/v1/N19-1423
[6]
Vijay Ekambaram, Kushagra Manglik, Sumanta Mukherjee, Surya Shravan Kumar Sajja, Satyam Dwivedi, and Vikas Raykar. 2020. Attention based Multi-Modal New Product Sales Time-series Forecasting (KDD '20). Association for Computing Machinery, New York, NY, USA, 3110--3118. https://doi.org/10.1145/3394486.3403362
[7]
Ilenia Fronza, Alberto Sillitti, Giancarlo Succi, Mikko Terho, and Jelena Vlasenko. 2013. Failure prediction based on log files using Random Indexing and Support Vector Machines. Journal of Systems and Software, Vol. 86, 1 (2013), 2--11. https://doi.org/10.1016/j.jss.2012.06.025
[8]
Jiechao Gao, Haoyu Wang, and Haiying Shen. 2019. Task Failure Prediction in Cloud Data Centers Using Deep Learning. In 2019 IEEE International Conference on Big Data (Big Data). 1111--1116. https://doi.org/10.1109/BigData47090.2019.9006011
[9]
Shaohan Huang, Yi Liu, Carol Fung, Rong He, Yining Zhao, Hailong Yang, and Zhongzhi Luan. 2020. HitAnomaly: Hierarchical Transformers for Anomaly Detection in System Log. IEEE Transactions on Network and Service Management, Vol. 17, 4 (2020), 2064--2076. https://doi.org/10.1109/TNSM.2020.3034647
[10]
Michael Hüsken and Peter Stagge. 2003. Recurrent neural networks for time series classification. Neurocomputing, Vol. 50 (2003), 223--235. https://doi.org/10.1016/S0925-2312(01)00706-8
[11]
Tariqul Islam and Dakshnamoorthy Manivannan. 2017. Predicting Application Failure in Cloud: A Machine Learning Approach. In 2017 IEEE International Conference on Cognitive Computing (ICCC). 24--31. https://doi.org/10.1109/IEEE.ICCC.2017.11
[12]
Tong Jia, Ying Li, Yong Yang, Gang Huang, and Zhonghai Wu. 2022. Augmenting Log-based Anomaly Detection Models to Reduce False Anomalies with Human Feedback (KDD '22). Association for Computing Machinery, New York, NY, USA, 3081--3089. https://doi.org/10.1145/3534678.3539106
[13]
K. Sparck Jones. 2004. A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, Vol. 60, 5 (2004), 493--502.
[14]
Fazle Karim, Somshubra Majumdar, Houshang Darabi, and Shun Chen. 2018. LSTM Fully Convolutional Networks for Time Series Classification. IEEE Access, Vol. 6 (2018), 1662--1669. https://doi.org/10.1109/ACCESS.2017.2779939
[15]
Fazle Karim, Somshubra Majumdar, Houshang Darabi, and Samuel Harford. 2019. Multivariate LSTM-FCNs for time series classification. Neural networks, Vol. 116 (2019), 237--245.
[16]
S. Katz. 1987. Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. 35, 3 (1987), 400--401. https://doi.org/10.1109/TASSP.1987.1165125
[17]
Wonjae Kim, Bokyung Son, and Ildoo Kim. 2021. ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision. In Proceedings of the 38th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 139), Marina Meila and Tong Zhang (Eds.). PMLR, 5583--5594.
[18]
Dana Lahat, Tülay Adali, and Christian Jutten. 2015. Multimodal data fusion: an overview of methods, challenges, and prospects. Proc. IEEE, Vol. 103, 9 (2015), 1449--1477.
[19]
Tian Lan, Ziyue Li, Zhishuai Li, Lei Bai, Man Li, Fugee Tsung, Wolfgang Ketter, Rui Zhao, and Chen Zhang. 2023. MM-DAG: Multi-task DAG Learning for Multi-modal Data-with Application for Traffic Congestion Analysis. arXiv preprint arXiv:2306.02831 (2023).
[20]
Yinglung Liang, Yanyong Zhang, Hui Xiong, and Ramendra Sahoo. 2007. Failure Prediction in IBM BlueGene/L Event Logs. In Seventh IEEE International Conference on Data Mining (ICDM 2007). 583--588. https://doi.org/10.1109/ICDM.2007.46
[21]
Qingwei Lin, Ken Hsieh, Yingnong Dang, Hongyu Zhang, Kaixin Sui, Yong Xu, Jian-Guang Lou, Chenggang Li, Youjiang Wu, Randolph Yao, Murali Chintalapati, and Dongmei Zhang. 2018. Predicting Node Failure in Cloud Service Systems. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (Lake Buena Vista, FL, USA) (ESEC/FSE 2018). Association for Computing Machinery, New York, NY, USA, 480--490. https://doi.org/10.1145/3236024.3236060
[22]
Weiyang Liu, Yandong Wen, Zhiding Yu, Ming Li, Bhiksha Raj, and Le Song. 2017. SphereFace: Deep Hypersphere Embedding for Face Recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[23]
Yudong Liu, Hailan Yang, Pu Zhao, Minghua Ma, Chengwu Wen, Hongyu Zhang, Chuan Luo, Qingwei Lin, Chang Yi, Jiaojian Wang, et al. 2022. Multi-task Hierarchical Classification for Disk Failure Prediction in Online Service Systems. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 3438--3446.
[24]
Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems, Vol. 32 (2019).
[25]
Tomás Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient Estimation of Word Representations in Vector Space. In 1st International Conference on Learning Representations, ICLR 2013, Scottsdale, Arizona, USA, May 2-4, 2013, Workshop Track Proceedings, Yoshua Bengio and Yann LeCun (Eds.).
[26]
Yuqi Nie, Nam H Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam. 2023. A Time Series is Worth 64 Words: Long-term Forecasting with Transformers. In The Eleventh International Conference on Learning Representations.
[27]
Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. GloVe: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Doha, Qatar, 1532--1543. https://doi.org/10.3115/v1/D14--1162
[28]
Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep Contextualized Word Representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). Association for Computational Linguistics, New Orleans, Louisiana, 2227--2237. https://doi.org/10.18653/v1/N18--1202
[29]
Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. 2018. Improving language understanding by generative pre-training. (2018).
[30]
Hansheng Ren, Bixiong Xu, Yujing Wang, Chao Yi, Congrui Huang, Xiaoyu Kou, Tony Xing, Mao Yang, Jie Tong, and Qi Zhang. 2019. Time-Series Anomaly Detection Service at Microsoft. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (Anchorage, AK, USA) (KDD '19). Association for Computing Machinery, New York, NY, USA, 3009--3017. https://doi.org/10.1145/3292500.3330680
[31]
Felix Salfner and Steffen Tschirpke. 2008. Error Log Processing for Accurate Failure Prediction. WASL, Vol. 8 (2008), 4.
[32]
Xiaoyi Sun, Krishnendu Chakrabarty, Ruirui Huang, Yiquan Chen, Bing Zhao, Hai Cao, Yinhe Han, Xiaoyao Liang, and Li Jiang. 2019. System-Level Hardware Failure Prediction Using Deep Learning. In Proceedings of the 56th Annual Design Automation Conference 2019 (Las Vegas, NV, USA) (DAC '19). Association for Computing Machinery, New York, NY, USA, Article 20, 6 pages. https://doi.org/10.1145/3316781.3317918
[33]
Joseph Turian, Lev-Arie Ratinov, and Yoshua Bengio. 2010. Word Representations: A Simple and General Method for Semi-Supervised Learning. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Uppsala, Sweden, 384--394.
[34]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems, I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Curran Associates, Inc.
[35]
Delu Wang, Jun Gan, Jinqi Mao, Fan Chen, and Lan Yu. 2023. Forecasting power demand in China with a CNN-LSTM model including multimodal information. Energy, Vol. 263 (2023), 126012. https://doi.org/10.1016/j.energy.2022.126012
[36]
Zhiwei Wang, Zhengzhang Chen, Jingchao Ni, Hui Liu, Haifeng Chen, and Jiliang Tang. 2021. Multi-scale one-class recurrent neural networks for discrete event sequence anomaly detection. In Proceedings of the 27th ACM SIGKDD conference on knowledge discovery & data mining. 3726--3734.
[37]
Chang Xu, Gang Wang, Xiaoguang Liu, Dongdong Guo, and Tie-Yan Liu. 2016. Health Status Assessment and Failure Prediction for Hard Drives with Recurrent Neural Networks. IEEE Trans. Comput., Vol. 65, 11 (2016), 3502--3508. https://doi.org/10.1109/TC.2016.2538237
[38]
Zhenghua Xue, Xiaoshe Dong, Siyuan Ma, and Weiqing Dong. 2007. A Survey on Failure Prediction of Large-Scale Server Clusters. In Eighth ACIS International Conference on Software Engineering, Artificial Intelligence, Networking, and Parallel/Distributed Computing (SNPD 2007), Vol. 2. 733--738. https://doi.org/10.1109/SNPD.2007.284
[39]
Xitong Yang, Palghat Ramesh, Radha Chitta, Sriganesh Madhvanath, Edgar A Bernal, and Jiebo Luo. 2017. Deep multimodal representation learning from temporal data. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5447--5455.
[40]
Zhe Yang, Piero Baraldi, and Enrico Zio. 2021. A multi-branch deep neural network model for failure prognostics based on multimodal data. Journal of Manufacturing Systems, Vol. 59 (2021), 42--50. https://doi.org/10.1016/j.jmsy.2021.01.007
[41]
Licheng Yu, Jun Chen, Animesh Sinha, Mengjiao Wang, Yu Chen, Tamara L Berg, and Ning Zhang. 2022. Commercemm: Large-scale commerce multimodal representation learning with omni retrieval. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 4433--4442.
[42]
Wennian Yu, Il Yong Kim, and Chris Mechefske. 2021. Analysis of different RNN autoencoder variants for time series classification and machine prognostics. Mechanical Systems and Signal Processing, Vol. 149 (2021), 107322.
[43]
George Zerveas, Srideepika Jayaraman, Dhaval Patel, Anuradha Bhamidipaty, and Carsten Eickhoff. 2021. A transformer-based framework for multivariate time series representation learning. In Proceedings of the 27th ACM SIGKDD conference on knowledge discovery & data mining. 2114--2124.
[44]
Chaohe Zhang, Xu Chu, Liantao Ma, Yinghao Zhu, Yasha Wang, Jiangtao Wang, and Junfeng Zhao. 2022. M3Care: Learning with Missing Modalities in Multimodal Healthcare Data. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2418--2428.
[45]
G Peter Zhang. 2003. Time series forecasting using a hybrid ARIMA and neural network model. Neurocomputing, Vol. 50 (2003), 159--175.
[46]
Ke Zhang, Jianwu Xu, Martin Renqiang Min, Guofei Jiang, Konstantinos Pelechrinis, and Hui Zhang. 2016. Automated IT system failure prediction: A deep learning approach. In 2016 IEEE International Conference on Big Data (Big Data). 1291--1300. https://doi.org/10.1109/BigData.2016.7840733
[47]
Xu Zhang, Chao Du, Yifan Li, Yong Xu, Hongyu Zhang, Si Qin, Ze Li, Qingwei Lin, Yingnong Dang, Andrew Zhou, et al. 2021. Halo: Hierarchy-aware fault localization for cloud systems. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 3948--3958.
[48]
Chenyu Zhao, Minghua Ma, Zhenyu Zhong, Shenglin Zhang, Zhiyuan Tan, Xiao Xiong, LuLu Yu, Jiayi Feng, Yongqian Sun, Yuzhi Zhang, Dan Pei, Qingwei Lin, and Dongmei Zhang. 2023. Robust Multimodal Failure Detection for Microservice Systems. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD '23). Association for Computing Machinery, New York, NY, USA, 5639--5649. https://doi.org/10.1145/3580305.3599902
[49]
Minglu Zhao, Reo Furuhata, Mulya Agung, Hiroyuki Takizawa, and Tomoya Soma. 2020. Failure Prediction in Datacenters Using Unsupervised Multimodal Anomaly Detection. In 2020 IEEE International Conference on Big Data (Big Data). 3545--3549. https://doi.org/10.1109/BigData50022.2020.9378419
[50]
Ziming Zheng, Zhiling Lan, Byung H. Park, and Al Geist. 2009. System log pre-processing to improve failure prediction. In 2009 IEEE/IFIP International Conference on Dependable Systems & Networks. 572--577. https://doi.org/10.1109/DSN.2009.5270289
[51]
Tian Zhou, Ziqing Ma, Qingsong Wen, Xue Wang, Liang Sun, and Rong Jin. 2022. FEDformer: Frequency Enhanced Decomposed Transformer for Long-term Series Forecasting. In Proceedings of the 39th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 162), Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (Eds.). PMLR, 27268--27286.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
KDD '24: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining
August 2024
6901 pages
ISBN:9798400704901
DOI:10.1145/3637528
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 August 2024

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. cloud computing
  2. failure prediction
  3. multimodal data
  4. time series

Qualifiers

  • Research-article

Conference

KDD '24
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 144
    Total Downloads
  • Downloads (Last 12 months)144
  • Downloads (Last 6 weeks)37
Reflects downloads up to 25 Dec 2024

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media