Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3543873.3584630acmconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
short-paper

EDITS: An Easy-to-difficult Training Strategy for Cloud Failure Prediction

Published: 30 April 2023 Publication History

Abstract

Cloud failures have been a major threat to the reliability of cloud services. Many failure prediction approaches have been proposed to predict cloud failures before they actually occur, so that proactive actions can be taken to ensure service reliability. In industrial practice, existing failure prediction approaches mainly focus on utilizing state-of-the-art time series models to enhance the performance of failure prediction but neglect the training strategy. However, as curriculum learning points out, models perform better when they are trained with data in an order of easy-to-difficult. In this paper, we propose EDITS, a novel training strategy for cloud failure prediction, which greatly improves the performance of the existing cloud failure prediction models. Our experimental results on industrial and public datasets show that EDITS can obviously enhance the performance of cloud failure prediction model. In addition, EDITS also outperforms other curriculum learning methods. More encouragingly, our proposed EDITS has been successfully applied to Microsoft 365 and Azure online service systems, and has obviously reduced financial losses caused by cloud failures.

References

[1]
Danilo Ardagna, Barbara Panicucci, and Mauro Passacantando. 2011. A game theoretic formulation of the service provisioning problem in cloud systems. In Proceedings of the 20th international conference on World wide web. 177–186.
[2]
Backblaze. 2019. The Backblzae Hard Drive Data and Stats. https://www.backblaze.com/b2/hard-drive-test-data.html.
[3]
Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. 2009. Curriculum learning. In Proceedings of the 26th annual international conference on machine learning. 41–48.
[4]
Mirela Madalina Botezatu, Ioana Giurgiu, Jasmina Bogojeska, and Dorothea Wiesmann. 2016. Predicting disk replacement towards reliable data centers. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 39–48.
[5]
Thibault Castells, Philippe Weinzaepfel, and Jerome Revaud. 2020. Superloss: A generic loss for robust curriculum learning. Advances in Neural Information Processing Systems 33 (2020), 4308–4319.
[6]
Jiazhen Gu, Chuan Luo, Si Qin, Bo Qiao, Qingwei Lin, Hongyu Zhang, Ze Li, Yingnong Dang, Shaowei Cai, Wei Wu, 2020. Efficient incident identification from multi-dimensional issue reports via meta-heuristic search. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 292–303.
[7]
Jiazhen Gu, Jiaqi Wen, Zijian Wang, Pu Zhao, Chuan Luo, Yu Kang, Yangfan Zhou, Li Yang, Jeffrey Sun, Zhangwei Xu, Bo Qiao, Liqun Li, Qingwei Lin, and Dongmei Zhang. 2020. Efficient customer incident triage via linking with system incidents. In ESEC/FSE ’20: 28th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Virtual Event, USA, November 8-13, 2020. 1296–1307.
[8]
Sheng Guo, Weilin Huang, Haozhi Zhang, Chenfan Zhuang, Dengke Dong, Matthew R Scott, and Dinglong Huang. 2018. Curriculumnet: Weakly supervised learning from large-scale web images. In Proceedings of the European conference on computer vision (ECCV). 135–150.
[9]
Xiaohong Huang. 2017. Hard drive failure prediction for large scale storage system. Ph. D. Dissertation. UCLA.
[10]
Lu Jiang, Deyu Meng, Teruko Mitamura, and Alexander G Hauptmann. 2014. Easy samples first: Self-paced reranking for zero-example multimedia search. In Proceedings of the 22nd ACM international conference on Multimedia. 547–556.
[11]
M Kumar, Benjamin Packer, and Daphne Koller. 2010. Self-paced learning for latent variable models. Advances in neural information processing systems 23 (2010).
[12]
Jing Li, Xinpu Ji, Yuhan Jia, Bingpeng Zhu, Gang Wang, Zhongwei Li, and Xiaoguang Liu. 2014. Hard drive failure prediction using classification and regression trees. In 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. 383–394.
[13]
Ze Li, Qian Cheng, Ken Hsieh, Yingnong Dang, Peng Huang, Pankaj Singh, Xinsheng Yang, Qingwei Lin, Youjiang Wu, Sebastien Levy, 2020. Gandalf: An Intelligent,{ End-To-End} Analytics Service for Safe Deployment in { Large-Scale} Cloud Infrastructure. In 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 20). 389–402.
[14]
Yudong Liu, Hailan Yang, Pu Zhao, Minghua Ma, Chengwu Wen, Hongyu Zhang, Chuan Luo, Qingwei Lin, Chang Yi, Jiaojian Wang, 2022. Multi-task Hierarchical Classification for Disk Failure Prediction in Online Service Systems. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 3438–3446.
[15]
Sidi Lu, Bing Luo, Tirthak Patel, Yongtao Yao, Devesh Tiwari, and Weisong Shi. 2020. Making Disk Failure Predictions { SMARTer} !. In 18th USENIX Conference on File and Storage Technologies (FAST 20). 151–167.
[16]
Chuan Luo, Bo Qiao, Xin Chen, Pu Zhao, Randolph Yao, Hongyu Zhang, Wei Wu, Andrew Zhou, and Qingwei Lin. 2020. Intelligent Virtual Machine Provisioning in Cloud Computing. In IJCAI. 1495–1502.
[17]
Chuan Luo, Bo Qiao, Wenqian Xing, Xin Chen, Pu Zhao, Chao Du, Randolph Yao, Hongyu Zhang, Wei Wu, Shaowei Cai, 2021. Correlation-aware heuristic search for intelligent virtual machine provisioning in cloud systems. In Proceedings of the AAAI Conference on Artificial Intelligence. 12363–12372.
[18]
Chuan Luo, Pu Zhao, Bo Qiao, Youjiang Wu, Hongyu Zhang, Wei Wu, Weihai Lu, Yingnong Dang, Saravanakumar Rajmohan, title=NTAM: neighborhood-temporal attention model for disk failure prediction in cloud platforms, Qingwei Lin, 2021. In Proceedings of WWW 2021. 1181–1191.
[19]
Ao Ma, Rachel Traylor, Fred Douglis, Mark Chamness, Guanlin Lu, Darren Sawyer, Surendar Chandra, and Windsor Hsu. 2015. RAIDShield: characterizing, monitoring, and proactively protecting against disk failures. ACM Transactions on Storage (TOS) 11, 4 (2015), 1–28.
[20]
Meng Ma, Jingmin Xu, Yuan Wang, Pengfei Chen, Zonghua Zhang, and Ping Wang. 2020. Automap: Diagnose your microservice-based web applications automatically. In Proceedings of The Web Conference 2020. 246–258.
[21]
Emmanouil Antonios Platanios, Otilia Stretcu, Graham Neubig, Barnabas Poczos, and Tom M Mitchell. 2019. Competence-based curriculum learning for neural machine translation. arXiv preprint arXiv:1903.09848 (2019).
[22]
Jack W. Rae, Anna Potapenko, Siddhant M. Jayakumar, and Timothy P. Lillicrap. 2020. Compressive Transformers for Long-Range Sequence Modelling. In Proceedings of ICLR.
[23]
Nashid Shahriar, Reaz Ahmed, Shihabur Rahman Chowdhury, Aimal Khan, Raouf Boutaba, and Jeebak Mitra. 2017. Generalized recovery from node failure in virtual network embedding. IEEE Transactions on Network and Service Management 14, 2 (2017), 261–274.
[24]
Huasong Shan, Yuan Chen, Haifeng Liu, Yunpeng Zhang, Xiao Xiao, Xiaofeng He, Min Li, and Wei Ding. 2019. ?-diagnosis: Unsupervised and real-time diagnosis of small-window long-tail latency in large-scale microservice platforms. In The World Wide Web Conference. 3215–3222.
[25]
Jing Shen, Jian Wan, Se-Jung Lim, and Lifeng Yu. 2018. Random-forest-based failure prediction for hard disk drives. International Journal of Distributed Sensor Networks 14, 11 (2018), 1550147718806480.
[26]
Xiaoyi Sun, Krishnendu Chakrabarty, Ruirui Huang, Yiquan Chen, Bing Zhao, Hai Cao, Yinhe Han, Xiaoyao Liang, and Li Jiang. 2019. System-level hardware failure prediction using deep learning. In Proceedings of DAC 2019. 1–6.
[27]
James S Supancic and Deva Ramanan. 2013. Self-paced learning for long-term tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2379–2386.
[28]
Amoghavarsha Suresh and Anshul Gandhi. 2019. Using variability as a guiding principle to reduce latency in web applications via OS profiling. In The World Wide Web Conference. 1759–1770.
[29]
Kevin Tang, Vignesh Ramanathan, Li Fei-Fei, and Daphne Koller. 2012. Shifting weights: Adapting object detectors from image to video. Advances in Neural Information Processing Systems 25 (2012).
[30]
Ye Tang, Yu-Bin Yang, and Yang Gao. 2012. Self-paced dictionary learning for image classification. In Proceedings of the 20th ACM international conference on Multimedia. 833–836.
[31]
Yi Tay, Shuohang Wang, Luu Anh Tuan, Jie Fu, Minh C Phan, Xingdi Yuan, Jinfeng Rao, Siu Cheung Hui, and Aston Zhang. 2019. Simple and effective curriculum pointer-generator networks for reading comprehension over long narratives. arXiv preprint arXiv:1905.10847 (2019).
[32]
Benfeng Xu, Licheng Zhang, Zhendong Mao, Quan Wang, Hongtao Xie, and Yongdong Zhang. 2020. Curriculum learning for natural language understanding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 6095–6104.
[33]
Chang Xu, Gang Wang, Xiaoguang Liu, Dongdong Guo, and Tie-Yan Liu. 2016. Health status assessment and failure prediction for hard drives with recurrent neural networks. IEEE Trans. Comput. 65, 11 (2016), 3502–3508.
[34]
Yong Xu, Kaixin Sui, Randolph Yao, Hongyu Zhang, Qingwei Lin, Yingnong Dang, Peng Li, Keceng Jiang, Wenchi Zhang, Jian-Guang Lou, 2018. Improving service availability of cloud systems by predicting disk error. In 2018 USENIX Annual Technical Conference (USENIX ATC 18). 481–494.
[35]
Jianguo Zhang, Ji Wang, Lifang He, Zhao Li, and S Yu Philip. 2018. Layerwise perturbation-based adversarial training for hard drive health degree prediction. In 2018 IEEE International Conference on Data Mining (ICDM). 1428–1433.
[36]
Qiao Zhang, Guo Yu, Chuanxiong Guo, Yingnong Dang, Nick Swanson, Xinsheng Yang, Randolph Yao, Murali Chintalapati, Arvind Krishnamurthy, and Thomas Anderson. 2018. Deepview: Virtual disk failure diagnosis and pattern detection for azure. In 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI 18). 519–532.
[37]
Yu Zhang. 2015. Multi-task learning and algorithmic stability. In Twenty-Ninth AAAI Conference on Artificial Intelligence.

Cited By

View all
  • (2024)MonitorAssistant: Simplifying Cloud Service Monitoring via Large Language ModelsCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663826(38-49)Online publication date: 10-Jul-2024
  • (2024)Detecting Cloud Anomaly via Broad Network-Based Contrastive AutoencoderIEEE Transactions on Network and Service Management10.1109/TNSM.2024.335377221:3(3249-3263)Online publication date: Jun-2024
  • (2024)Early Bird: Ensuring Reliability of Cloud Systems Through Early Failure Prediction2024 IEEE 35th International Symposium on Software Reliability Engineering Workshops (ISSREW)10.1109/ISSREW63542.2024.00046(49-54)Online publication date: 28-Oct-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
WWW '23 Companion: Companion Proceedings of the ACM Web Conference 2023
April 2023
1567 pages
ISBN:9781450394192
DOI:10.1145/3543873
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 April 2023

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Short-paper
  • Research
  • Refereed limited

Conference

WWW '23
Sponsor:
WWW '23: The ACM Web Conference 2023
April 30 - May 4, 2023
TX, Austin, USA

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)48
  • Downloads (Last 6 weeks)6
Reflects downloads up to 11 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)MonitorAssistant: Simplifying Cloud Service Monitoring via Large Language ModelsCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663826(38-49)Online publication date: 10-Jul-2024
  • (2024)Detecting Cloud Anomaly via Broad Network-Based Contrastive AutoencoderIEEE Transactions on Network and Service Management10.1109/TNSM.2024.335377221:3(3249-3263)Online publication date: Jun-2024
  • (2024)Early Bird: Ensuring Reliability of Cloud Systems Through Early Failure Prediction2024 IEEE 35th International Symposium on Software Reliability Engineering Workshops (ISSREW)10.1109/ISSREW63542.2024.00046(49-54)Online publication date: 28-Oct-2024
  • (2024)Large Language Models Can Provide Accurate and Interpretable Incident Triage2024 IEEE 35th International Symposium on Software Reliability Engineering (ISSRE)10.1109/ISSRE62328.2024.00056(523-534)Online publication date: 28-Oct-2024
  • (2024)Can We Trust Auto-Mitigation? Improving Cloud Failure Prediction with Uncertain Positive Learning2024 IEEE 35th International Symposium on Software Reliability Engineering (ISSRE)10.1109/ISSRE62328.2024.00054(499-510)Online publication date: 28-Oct-2024
  • (2023)Online Failure Prediction Through Fault Injection and Machine Learning: Methodology and Case Study2023 IEEE 34th International Symposium on Software Reliability Engineering (ISSRE)10.1109/ISSRE59848.2023.00021(451-461)Online publication date: 9-Oct-2023

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media