short-paper

EDITS: An Easy-to-difficult Training Strategy for Cloud Failure Prediction

Authors:

Lingling Zheng,

Murali Chintalapati,

Saravan Rajmohan,

Dongmei ZhangAuthors Info & Claims

WWW '23 Companion: Companion Proceedings of the ACM Web Conference 2023

Pages 371 - 375

https://doi.org/10.1145/3543873.3584630

Published: 30 April 2023 Publication History

Abstract

Cloud failures have been a major threat to the reliability of cloud services. Many failure prediction approaches have been proposed to predict cloud failures before they actually occur, so that proactive actions can be taken to ensure service reliability. In industrial practice, existing failure prediction approaches mainly focus on utilizing state-of-the-art time series models to enhance the performance of failure prediction but neglect the training strategy. However, as curriculum learning points out, models perform better when they are trained with data in an order of easy-to-difficult. In this paper, we propose EDITS, a novel training strategy for cloud failure prediction, which greatly improves the performance of the existing cloud failure prediction models. Our experimental results on industrial and public datasets show that EDITS can obviously enhance the performance of cloud failure prediction model. In addition, EDITS also outperforms other curriculum learning methods. More encouragingly, our proposed EDITS has been successfully applied to Microsoft 365 and Azure online service systems, and has obviously reduced financial losses caused by cloud failures.

References

[1]

Danilo Ardagna, Barbara Panicucci, and Mauro Passacantando. 2011. A game theoretic formulation of the service provisioning problem in cloud systems. In Proceedings of the 20th international conference on World wide web. 177–186.

Digital Library

[2]

Backblaze. 2019. The Backblzae Hard Drive Data and Stats. https://www.backblaze.com/b2/hard-drive-test-data.html.

[3]

Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. 2009. Curriculum learning. In Proceedings of the 26th annual international conference on machine learning. 41–48.

Digital Library

[4]

Mirela Madalina Botezatu, Ioana Giurgiu, Jasmina Bogojeska, and Dorothea Wiesmann. 2016. Predicting disk replacement towards reliable data centers. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 39–48.

Digital Library

[5]

Thibault Castells, Philippe Weinzaepfel, and Jerome Revaud. 2020. Superloss: A generic loss for robust curriculum learning. Advances in Neural Information Processing Systems 33 (2020), 4308–4319.

[6]

Jiazhen Gu, Chuan Luo, Si Qin, Bo Qiao, Qingwei Lin, Hongyu Zhang, Ze Li, Yingnong Dang, Shaowei Cai, Wei Wu, 2020. Efficient incident identification from multi-dimensional issue reports via meta-heuristic search. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 292–303.

Digital Library

[7]

Jiazhen Gu, Jiaqi Wen, Zijian Wang, Pu Zhao, Chuan Luo, Yu Kang, Yangfan Zhou, Li Yang, Jeffrey Sun, Zhangwei Xu, Bo Qiao, Liqun Li, Qingwei Lin, and Dongmei Zhang. 2020. Efficient customer incident triage via linking with system incidents. In ESEC/FSE ’20: 28th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Virtual Event, USA, November 8-13, 2020. 1296–1307.

[8]

Sheng Guo, Weilin Huang, Haozhi Zhang, Chenfan Zhuang, Dengke Dong, Matthew R Scott, and Dinglong Huang. 2018. Curriculumnet: Weakly supervised learning from large-scale web images. In Proceedings of the European conference on computer vision (ECCV). 135–150.

Digital Library

[9]

Xiaohong Huang. 2017. Hard drive failure prediction for large scale storage system. Ph. D. Dissertation. UCLA.

[10]

Lu Jiang, Deyu Meng, Teruko Mitamura, and Alexander G Hauptmann. 2014. Easy samples first: Self-paced reranking for zero-example multimedia search. In Proceedings of the 22nd ACM international conference on Multimedia. 547–556.

Digital Library

[11]

M Kumar, Benjamin Packer, and Daphne Koller. 2010. Self-paced learning for latent variable models. Advances in neural information processing systems 23 (2010).

[12]

Jing Li, Xinpu Ji, Yuhan Jia, Bingpeng Zhu, Gang Wang, Zhongwei Li, and Xiaoguang Liu. 2014. Hard drive failure prediction using classification and regression trees. In 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. 383–394.

Digital Library

[13]

Ze Li, Qian Cheng, Ken Hsieh, Yingnong Dang, Peng Huang, Pankaj Singh, Xinsheng Yang, Qingwei Lin, Youjiang Wu, Sebastien Levy, 2020. Gandalf: An Intelligent,{ End-To-End} Analytics Service for Safe Deployment in { Large-Scale} Cloud Infrastructure. In 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 20). 389–402.

[14]

Yudong Liu, Hailan Yang, Pu Zhao, Minghua Ma, Chengwu Wen, Hongyu Zhang, Chuan Luo, Qingwei Lin, Chang Yi, Jiaojian Wang, 2022. Multi-task Hierarchical Classification for Disk Failure Prediction in Online Service Systems. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 3438–3446.

Digital Library

[15]

Sidi Lu, Bing Luo, Tirthak Patel, Yongtao Yao, Devesh Tiwari, and Weisong Shi. 2020. Making Disk Failure Predictions { SMARTer} !. In 18th USENIX Conference on File and Storage Technologies (FAST 20). 151–167.

Digital Library

[16]

Chuan Luo, Bo Qiao, Xin Chen, Pu Zhao, Randolph Yao, Hongyu Zhang, Wei Wu, Andrew Zhou, and Qingwei Lin. 2020. Intelligent Virtual Machine Provisioning in Cloud Computing. In IJCAI. 1495–1502.

[17]

Chuan Luo, Bo Qiao, Wenqian Xing, Xin Chen, Pu Zhao, Chao Du, Randolph Yao, Hongyu Zhang, Wei Wu, Shaowei Cai, 2021. Correlation-aware heuristic search for intelligent virtual machine provisioning in cloud systems. In Proceedings of the AAAI Conference on Artificial Intelligence. 12363–12372.

[18]

Chuan Luo, Pu Zhao, Bo Qiao, Youjiang Wu, Hongyu Zhang, Wei Wu, Weihai Lu, Yingnong Dang, Saravanakumar Rajmohan, title=NTAM: neighborhood-temporal attention model for disk failure prediction in cloud platforms, Qingwei Lin, 2021. In Proceedings of WWW 2021. 1181–1191.

[19]

Ao Ma, Rachel Traylor, Fred Douglis, Mark Chamness, Guanlin Lu, Darren Sawyer, Surendar Chandra, and Windsor Hsu. 2015. RAIDShield: characterizing, monitoring, and proactively protecting against disk failures. ACM Transactions on Storage (TOS) 11, 4 (2015), 1–28.

Digital Library

[20]

Meng Ma, Jingmin Xu, Yuan Wang, Pengfei Chen, Zonghua Zhang, and Ping Wang. 2020. Automap: Diagnose your microservice-based web applications automatically. In Proceedings of The Web Conference 2020. 246–258.

Digital Library

[21]

Emmanouil Antonios Platanios, Otilia Stretcu, Graham Neubig, Barnabas Poczos, and Tom M Mitchell. 2019. Competence-based curriculum learning for neural machine translation. arXiv preprint arXiv:1903.09848 (2019).

[22]

Jack W. Rae, Anna Potapenko, Siddhant M. Jayakumar, and Timothy P. Lillicrap. 2020. Compressive Transformers for Long-Range Sequence Modelling. In Proceedings of ICLR.

[23]

Nashid Shahriar, Reaz Ahmed, Shihabur Rahman Chowdhury, Aimal Khan, Raouf Boutaba, and Jeebak Mitra. 2017. Generalized recovery from node failure in virtual network embedding. IEEE Transactions on Network and Service Management 14, 2 (2017), 261–274.

Digital Library

[24]

Huasong Shan, Yuan Chen, Haifeng Liu, Yunpeng Zhang, Xiao Xiao, Xiaofeng He, Min Li, and Wei Ding. 2019. ?-diagnosis: Unsupervised and real-time diagnosis of small-window long-tail latency in large-scale microservice platforms. In The World Wide Web Conference. 3215–3222.

Digital Library

[25]

Jing Shen, Jian Wan, Se-Jung Lim, and Lifeng Yu. 2018. Random-forest-based failure prediction for hard disk drives. International Journal of Distributed Sensor Networks 14, 11 (2018), 1550147718806480.

[26]

Xiaoyi Sun, Krishnendu Chakrabarty, Ruirui Huang, Yiquan Chen, Bing Zhao, Hai Cao, Yinhe Han, Xiaoyao Liang, and Li Jiang. 2019. System-level hardware failure prediction using deep learning. In Proceedings of DAC 2019. 1–6.

Digital Library

[27]

James S Supancic and Deva Ramanan. 2013. Self-paced learning for long-term tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2379–2386.

Digital Library

[28]

Amoghavarsha Suresh and Anshul Gandhi. 2019. Using variability as a guiding principle to reduce latency in web applications via OS profiling. In The World Wide Web Conference. 1759–1770.

Digital Library

[29]

Kevin Tang, Vignesh Ramanathan, Li Fei-Fei, and Daphne Koller. 2012. Shifting weights: Adapting object detectors from image to video. Advances in Neural Information Processing Systems 25 (2012).

[30]

Ye Tang, Yu-Bin Yang, and Yang Gao. 2012. Self-paced dictionary learning for image classification. In Proceedings of the 20th ACM international conference on Multimedia. 833–836.

Digital Library

[31]

Yi Tay, Shuohang Wang, Luu Anh Tuan, Jie Fu, Minh C Phan, Xingdi Yuan, Jinfeng Rao, Siu Cheung Hui, and Aston Zhang. 2019. Simple and effective curriculum pointer-generator networks for reading comprehension over long narratives. arXiv preprint arXiv:1905.10847 (2019).

[32]

Benfeng Xu, Licheng Zhang, Zhendong Mao, Quan Wang, Hongtao Xie, and Yongdong Zhang. 2020. Curriculum learning for natural language understanding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 6095–6104.

[33]

Chang Xu, Gang Wang, Xiaoguang Liu, Dongdong Guo, and Tie-Yan Liu. 2016. Health status assessment and failure prediction for hard drives with recurrent neural networks. IEEE Trans. Comput. 65, 11 (2016), 3502–3508.

Digital Library

[34]

Yong Xu, Kaixin Sui, Randolph Yao, Hongyu Zhang, Qingwei Lin, Yingnong Dang, Peng Li, Keceng Jiang, Wenchi Zhang, Jian-Guang Lou, 2018. Improving service availability of cloud systems by predicting disk error. In 2018 USENIX Annual Technical Conference (USENIX ATC 18). 481–494.

[35]

Jianguo Zhang, Ji Wang, Lifang He, Zhao Li, and S Yu Philip. 2018. Layerwise perturbation-based adversarial training for hard drive health degree prediction. In 2018 IEEE International Conference on Data Mining (ICDM). 1428–1433.

[36]

Qiao Zhang, Guo Yu, Chuanxiong Guo, Yingnong Dang, Nick Swanson, Xinsheng Yang, Randolph Yao, Murali Chintalapati, Arvind Krishnamurthy, and Thomas Anderson. 2018. Deepview: Virtual disk failure diagnosis and pattern detection for azure. In 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI 18). 519–532.

[37]

Yu Zhang. 2015. Multi-task learning and algorithmic stability. In Twenty-Ninth AAAI Conference on Artificial Intelligence.

Digital Library

Cited By

Yu ZMa MZhang CQin SKang YBansal CRajmohan SDang YPei CPei DLin QZhang Dd'Amorim M(2024)MonitorAssistant: Simplifying Cloud Service Monitoring via Large Language ModelsCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663826(38-49)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3663529.3663826
Zhong GLiu FJiang JWang BYao XChen C(2024)Detecting Cloud Anomaly via Broad Network-Based Contrastive AutoencoderIEEE Transactions on Network and Service Management10.1109/TNSM.2024.335377221:3(3249-3263)Online publication date: Jun-2024
https://doi.org/10.1109/TNSM.2024.3353772
Liu YMa MZhao PLi TQiao BLi SLi ZChintalapati MDang YBansal CRajmohan SLin QZhang D(2024)Early Bird: Ensuring Reliability of Cloud Systems Through Early Failure Prediction2024 IEEE 35th International Symposium on Software Reliability Engineering Workshops (ISSREW)10.1109/ISSREW63542.2024.00046(49-54)Online publication date: 28-Oct-2024
https://doi.org/10.1109/ISSREW63542.2024.00046
Show More Cited By

Recommendations

An empirical investigation of missing data handling in cloud node failure prediction
ESEC/FSE 2022: Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering

Cloud computing systems have become increasingly popular in recent years. A typical cloud system utilizes millions of computing nodes as the basic infrastructure. Node failure has been identified as one of the most prevalent causes of cloud system ...
Failure Prediction with Hierarchical Approach in Private Cloud
Green, Pervasive, and Cloud Computing
Abstract
Cloud computing is widely adopted in real-world data centers. Most companies choose to build a private cloud service with the consideration of privacy. In these circumstances, they provide the service through Infrastructure as a Service (IaaS). ...
Modeling cloud failure data: a case study of the virtual computing lab
SECLOUD '11: Proceedings of the 2nd International Workshop on Software Engineering for Cloud Computing

Virtual Computing Lab is a higher education cloud computing environment that on demand, allocates a chosen software stack on the required hardware and gives access to the customers, in this case NCSU students, faculty and staff. VCL has been in ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

WWW '23 Companion: Companion Proceedings of the ACM Web Conference 2023

April 2023

1567 pages

ISBN:9781450394192

DOI:10.1145/3543873

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 April 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Short-paper
Research
Refereed limited

Conference

WWW '23

Sponsor:

SIGWEB

WWW '23: The ACM Web Conference 2023

April 30 - May 4, 2023

TX, Austin, USA

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
272
Total Downloads

Downloads (Last 12 months)48
Downloads (Last 6 weeks)6

Reflects downloads up to 11 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Yu ZMa MZhang CQin SKang YBansal CRajmohan SDang YPei CPei DLin QZhang Dd'Amorim M(2024)MonitorAssistant: Simplifying Cloud Service Monitoring via Large Language ModelsCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663826(38-49)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3663529.3663826
Zhong GLiu FJiang JWang BYao XChen C(2024)Detecting Cloud Anomaly via Broad Network-Based Contrastive AutoencoderIEEE Transactions on Network and Service Management10.1109/TNSM.2024.335377221:3(3249-3263)Online publication date: Jun-2024
https://doi.org/10.1109/TNSM.2024.3353772
Liu YMa MZhao PLi TQiao BLi SLi ZChintalapati MDang YBansal CRajmohan SLin QZhang D(2024)Early Bird: Ensuring Reliability of Cloud Systems Through Early Failure Prediction2024 IEEE 35th International Symposium on Software Reliability Engineering Workshops (ISSREW)10.1109/ISSREW63542.2024.00046(49-54)Online publication date: 28-Oct-2024
https://doi.org/10.1109/ISSREW63542.2024.00046
Wang ZLi JMa MLi ZKang YZhang CBansal CChintalapati MRajmohan SLin QZhang DPei CXie G(2024)Large Language Models Can Provide Accurate and Interpretable Incident Triage2024 IEEE 35th International Symposium on Software Reliability Engineering (ISSRE)10.1109/ISSRE62328.2024.00056(523-534)Online publication date: 28-Oct-2024
https://doi.org/10.1109/ISSRE62328.2024.00056
Li HMa MLiu YZhao PLi SLi ZChintalapati MDang YBansal CRajmohan SLin QZhang D(2024)Can We Trust Auto-Mitigation? Improving Cloud Failure Prediction with Uncertain Positive Learning2024 IEEE 35th International Symposium on Software Reliability Engineering (ISSRE)10.1109/ISSRE62328.2024.00054(499-510)Online publication date: 28-Oct-2024
https://doi.org/10.1109/ISSRE62328.2024.00054
Campos JCosta EVieira M(2023)Online Failure Prediction Through Fault Injection and Machine Learning: Methodology and Case Study2023 IEEE 34th International Symposium on Software Reliability Engineering (ISSRE)10.1109/ISSRE59848.2023.00021(451-461)Online publication date: 9-Oct-2023
https://doi.org/10.1109/ISSRE59848.2023.00021

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten