Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3540250.3558946acmconferencesArticle/Chapter ViewAbstractPublication PagesfseConference Proceedingsconference-collections
research-article

An empirical investigation of missing data handling in cloud node failure prediction

Published: 09 November 2022 Publication History
  • Get Citation Alerts
  • Abstract

    Cloud computing systems have become increasingly popular in recent years. A typical cloud system utilizes millions of computing nodes as the basic infrastructure. Node failure has been identified as one of the most prevalent causes of cloud system downtime. To improve the reliability of cloud systems, many previous studies collected monitoring metrics from nodes and built models to predict node failures before the failures happen. However, based on our experience with large-scale real-world cloud systems in Microsoft, we find that the task of predicting node failure is severely hampered by missing data. There is a large amount of missing data, and the online latest data utilized for prediction is even worse. As a result, the real-time performance of the node prediction model is limited. In this paper, we first characterize the missing data problem for node failure prediction. Then, we evaluate several existing data interpolation approaches, and find that node dimension interpolation approaches outperform time dimension ones and deep learning based interpolation is the best for early prediction. Our findings can help academics and engineers address the missing data problem in cloud node failure prediction and other data-driven software engineering scenarios.

    References

    [1]
    K. Mohaideen Abdul Kadhar and G. Anand. 2021. Preparing the Data. Apress, Berkeley, CA. 99–119.
    [2]
    Moria Bergman, Tova Milo, Slava Novgorodov, and Wang-Chiew Tan. 2015. Query-oriented data cleaning with oracles. In Proceedings of the Conference on Management of Data (SIGMOD). ACM, 1199–1214.
    [3]
    Mirela Madalina Botezatu, Ioana Giurgiu, Jasmina Bogojeska, and Dorothea Wiesmann. 2016. Predicting Disk Replacement towards Reliable Data Centers. In Proceedings of the Knowledge Discovery and Data Mining (SIGKDD). ACM, 39–48.
    [4]
    Wei Cao, Dong Wang, Jian Li, Hao Zhou, Lei Li, and Yitan Li. 2018. Brits: Bidirectional recurrent imputation for time series. Advances in neural information processing systems, 31 (2018).
    [5]
    Chengliang Chai, Lei Cao, Guoliang Li, Jian Li, Yuyu Luo, and Samuel Madden. 2020. Human-in-the-loop outlier detection. In Proceedings of the Conference on Management of Data (SIGMOD). ACM, 19–33.
    [6]
    Yujun Chen, Xian Yang, Qingwei Lin, Hongyu Zhang, Feng Gao, Zhangwei Xu, Yingnong Dang, Dongmei Zhang, Hang Dong, Yong Xu, Hao Li, and Yu Kang. 2019. Outage Prediction and Diagnosis for Cloud Service Systems. In Proceedings of the International World Wide Web Conferences (WWW). ACM, 2659–2665.
    [7]
    Thomas Cover and Peter Hart. 1967. Nearest neighbor pattern classification. IEEE Transactions on Information Theory (TIT), 13, 1 (1967), 21–27.
    [8]
    Pieter-Tjerk De Boer, Dirk P Kroese, Shie Mannor, and Reuven Y Rubinstein. 2005. A tutorial on the cross-entropy method. Annals of operations research, 134, 1 (2005), 19–67.
    [9]
    Supratim Deb, Zihui Ge, Sastry Isukapalli, Sarat Puthenpura, Shobha Venkataraman, He Yan, and Jennifer Yates. 2017. Aesop: Automatic policy learning for predicting and mitigating network service impairments. In Proceedings of the Knowledge Discovery and Data Mining (SIGKDD). ACM, 1783–1792.
    [10]
    Ori Hadary, Luke Marshall, Ishai Menache, Abhisek Pan, Esaias E Greeff, David Dion, Star Dorminey, Shailesh Joshi, Yang Chen, and Mark Russinovich. 2020. Protean: VM allocation service at scale. In Proceedings of the Symposium on Operating Systems Design and Implementation (OSDI). USENIX, 845–861.
    [11]
    Shilin He, Qingwei Lin, Jian-Guang Lou, Hongyu Zhang, Michael R Lyu, and Dongmei Zhang. 2018. Identifying impactful service system problems via log analysis. In Proceedings of the Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE). ACM, 60–70.
    [12]
    Xiao-Yuan Jing, Fumin Qi, Fei Wu, and Baowen Xu. 2016. Missing data imputation based on low-rank recovery and semi-supervised regression for software effort estimation. In Proceedings of the International Conference on Software Engineering (ICSE). ACM/IEEE, 607–618.
    [13]
    Sunghun Kim, Hongyu Zhang, Rongxin Wu, and Liang Gong. 2011. Dealing with noise in defect prediction. In Proceedings of the International Conference on Software Engineering (ICSE). IEEE, 481–490.
    [14]
    Anil C Kokaram, Robin D Morris, William J Fitzgerald, and Peter JW Rayner. 1995. Interpolation of missing data in image sequences. Transactions on Image Processing (TIP), 4, 11 (1995), 1509–1519.
    [15]
    Jing Li, Xinpu Ji, Yuhan Jia, Bingpeng Zhu, Gang Wang, Zhongwei Li, and Xiaoguang Liu. 2014. Hard Drive Failure Prediction Using Classification and Regression Trees. In Proceedings of the International Conference on Dependable Systems and Networks (DSN). IEEE/IFIP, 383–394.
    [16]
    Zhenhao Li, Tse-Hsun Chen, and Weiyi Shang. 2020. Where shall we log? studying and suggesting logging locations in code blocks. In Proceedings of the International Conference on Automated Software Engineering (ASE). IEEE/ACM, 361–372.
    [17]
    Qingwei Lin, Ken Hsieh, Yingnong Dang, Hongyu Zhang, Kaixin Sui, Yong Xu, Jian-Guang Lou, Chenggang Li, Youjiang Wu, and Randolph Yao. 2018. Predicting node failure in cloud service systems. In Proceedings of the Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE). ACM, 480–490.
    [18]
    Qingwei Lin, Hongyu Zhang, Jian-Guang Lou, Yu Zhang, and Xuewei Chen. 2016. Log clustering based problem identification for online service systems. In Proceedings of the International Conference on Software Engineering Companion (ICSE-Companion). ACM/IEEE, 102–111.
    [19]
    Yudong Liu, Hailan Yang, Pu Zhao, Minghua Ma, Chengwu Wen, Hongyu Zhang, Chuan Luo, Qingwei Lin, Chang Yi, and Jiaojian Wang. 2022. Multi-task Hierarchical Classification for Disk Failure Prediction in Online Service Systems. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 3438–3446.
    [20]
    Sidi Lu, Bing Luo, Tirthak Patel, Yongtao Yao, Devesh Tiwari, and Weisong Shi. 2020. Making Disk Failure Predictions SMARTer!. In Proceedings of the Conference on File and Storage Technologies (FAST). USENIX, 151–167.
    [21]
    Chuan Luo, Pu Zhao, Bo Qiao, Youjiang Wu, Hongyu Zhang, Wei Wu, Weihai Lu, Yingnong Dang, Saravanakumar Rajmohan, Qingwei Lin, and Dongmei Zhang. 2021. NTAM: Neighborhood-Temporal Attention Model for Disk Failure Prediction in Cloud Platforms. In Proceedings of the International World Wide Web Conferences (WWW). ACM, 1181–1191.
    [22]
    Ao Ma, Rachel Traylor, Fred Douglis, Mark Chamness, Guanlin Lu, Darren Sawyer, Surendar Chandra, and Windsor Hsu. 2015. RAIDShield: characterizing, monitoring, and proactively protecting against disk failures. Transactions on Storage (TOS), 11, 4 (2015), 1–28.
    [23]
    Minghua Ma, Zheng Yin, Shenglin Zhang, Sheng Wang, Christopher Zheng, Xinhao Jiang, Hanwen Hu, Cheng Luo, Yilin Li, and Nengjun Qiu. 2020. Diagnosing root causes of intermittent slow queries in cloud databases. Proceedings of the VLDB Endowment, 13, 8 (2020), 1176–1189.
    [24]
    Minghua Ma, Shenglin Zhang, Junjie Chen, Jim Xu, Haozhe Li, Yongliang Lin, Xiaohui Nie, Bo Zhou, Yong Wang, and Dan Pei. 2021. Jump-Starting Multivariate Time Series Anomaly Detection for Online Service Systems. In Proceedings of the Annul Technical Conference (ATC). USENIX, 413–426.
    [25]
    Minghua Ma, Shenglin Zhang, Dan Pei, Xin Huang, and Hongwei Dai. 2018. Robust and rapid adaption for concept drift in software system anomaly detection. In Proceedings of the 29th International Symposium on Software Reliability Engineering (ISSRE). 13–24.
    [26]
    Sonu Mehta, Ranjita Bhagwan, Rahul Kumar, Chetan Bansal, Chandra Maddila, B Ashok, Sumit Asthana, Christian Bird, and Aditya Kumar. 2020. Rex: Preventing bugs and misconfiguration in large services using correlated change analysis. In Proceedings of the Symposium on Networked Systems Design and Implementation (NSDI). USENIX, 435–448.
    [27]
    Bertrand Muquet, Zhengdao Wang, Georgios B Giannakis, Marc De Courville, and Pierre Duhamel. 2002. Cyclic prefixing or zero padding for wireless multicarrier transmissions? Transactions on communications, 50, 12 (2002), 2136–2148.
    [28]
    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS). 32, MIT Press, 8024–8035.
    [29]
    Daniel Peña and George C Tiao. 1991. A note on likelihood estimation of missing values in time series. The American Statistician, 45, 3 (1991), 212–213.
    [30]
    ALEXANDER L Read. 1999. Linear interpolation of histograms. Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment, 425, 1-2 (1999), 357–360.
    [31]
    Weiyi Shang, Zhen Ming Jiang, Hadi Hemmati, Brain Adams, Ahmed E Hassan, and Patrick Martin. 2013. Assisting developers of big data analytics applications when deploying on hadoop clouds. In Proceedings of the International Conference on Software Engineering (ICSE). IEEE, 402–411.
    [32]
    Qinbao Song and Martin Shepperd. 2007. Missing data imputation techniques. International journal of business intelligence and data mining, 2, 3 (2007), 261–291.
    [33]
    Ming Sun, Ya Su, Shenglin Zhang, Yuanpu Cao, Yuqing Liu, Dan Pei, Wenfei Wu, Yongsu Zhang, Xiaozhou Liu, and Junliang Tang. 2021. CTF: Anomaly Detection in High-Dimensional Time Series with Coarse-to-Fine Model Transfer. In Proceedings of the International Conference on Computer Communications (INFOCOM). IEEE, 1–10.
    [34]
    Xiaoyi Sun, Krishnendu Chakrabarty, Ruirui Huang, Yiquan Chen, Bing Zhao, Hai Cao, Yinhe Han, Xiaoyao Liang, and Li Jiang. 2019. System-Level Hardware Failure Prediction using Deep Learning. In Proceedings of the Design Automation Conference (DAC). ACM, 20.
    [35]
    Olga Troyanskaya, Michael Cantor, Gavin Sherlock, Pat Brown, Trevor Hastie, Robert Tibshirani, David Botstein, and Russ B Altman. 2001. Missing value estimation methods for DNA microarrays. Bioinformatics, 17, 6 (2001), 520–525.
    [36]
    Kashi Venkatesh Vishwanath and Nachiappan Nagappan. 2010. Characterizing cloud computing hardware reliability. In Proceedings of the Symposium on Cloud Computing (SoCC). ACM, 193–204.
    [37]
    Wikipedia. 2021. Facebook Outage. https://en.wikipedia.org/wiki/2021_Facebook_outage
    [38]
    Rongxin Wu, Hongyu Zhang, Sunghun Kim, and Shing-Chi Cheung. 2011. Relink: recovering links between bugs and changes. In Proceedings of the Symposium on the Foundation of Software Engineering European Software Engineering Conference (ESEC/FSE). ACM, 15–25.
    [39]
    Yang Wu, Mingchen Zhao, Andreas Haeberlen, Wenchao Zhou, and Boon Thau Loo. 2014. Diagnosing missing events in distributed systems with negative provenance. SIGCOMM Computer Communication Review, 44, 4 (2014), 383–394.
    [40]
    Chang Xu, Gang Wang, Xiaoguang Liu, Dongdong Guo, and Tie-Yan Liu. 2016. Health Status Assessment and Failure Prediction for Hard Drives with Recurrent Neural Networks. Transactions on Computers, 65, 11 (2016), 3502–3508.
    [41]
    Yong Xu, Kaixin Sui, Randolph Yao, Hongyu Zhang, Qingwei Lin, Yingnong Dang, Peng Li, Keceng Jiang, Wenchi Zhang, Jian-Guang Lou, Murali Chintalapati, and Dongmei Zhang. 2018. Improving Service Availability of Cloud Systems by Predicting Disk Error. In Proceedings of the USENIX Annul Technical Conference (ATC). USENIX, 481–494.
    [42]
    Jiaxuan You, Xiaobai Ma, Daisy Yi Ding, Mykel J. Kochenderfer, and Jure Leskovec. 2020. Handling Missing Data with Graph Representation Learning. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (Eds.). 33, MIT Press, 19075–19087.
    [43]
    Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augustus Odena. 2019. Self-attention generative adversarial networks. In Proceedings of the International Conference on Machine Learning (ICML). ACM, 7354–7363.
    [44]
    Jianguo Zhang, Ji Wang, Lifang He, Zhao Li, and Philip S. Yu. 2018. Layerwise Perturbation-Based Adversarial Training for Hard Drive Health Degree Prediction. In Proceedings of the International Conference on Data Mining (ICDM). IEEE, 1428–1433.
    [45]
    Qiao Zhang, Guo Yu, Chuanxiong Guo, Yingnong Dang, Nick Swanson, Xinsheng Yang, Randolph Yao, Murali Chintalapati, Arvind Krishnamurthy, and Thomas Anderson. 2018. Deepview: Virtual Disk Failure Diagnosis and Pattern Detection for Azure. In Proceedings of the Symposium on Networked Systems Design and Implementation (NSDI). USENIX, Renton, WA. 519–532. isbn:978-1-939133-01-4

    Cited By

    View all
    • (2024)KnowLog: Knowledge Enhanced Pre-trained Language Model for Log UnderstandingProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3623304(1-13)Online publication date: 20-May-2024
    • (2024)SOIL: Score Conditioned Diffusion Model for Imbalanced Cloud Failure PredictionCompanion Proceedings of the ACM on Web Conference 202410.1145/3589335.3648303(65-72)Online publication date: 13-May-2024
    • (2024)ACPR: Adaptive Classification Predictive Repair Method for Different Fault ScenariosIEEE Access10.1109/ACCESS.2023.334688112(4631-4641)Online publication date: 2024
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ESEC/FSE 2022: Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering
    November 2022
    1822 pages
    ISBN:9781450394130
    DOI:10.1145/3540250
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 09 November 2022

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Cloud systems
    2. Missing data
    3. Node failure prediction

    Qualifiers

    • Research-article

    Conference

    ESEC/FSE '22
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 112 of 543 submissions, 21%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)101
    • Downloads (Last 6 weeks)11

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)KnowLog: Knowledge Enhanced Pre-trained Language Model for Log UnderstandingProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3623304(1-13)Online publication date: 20-May-2024
    • (2024)SOIL: Score Conditioned Diffusion Model for Imbalanced Cloud Failure PredictionCompanion Proceedings of the ACM on Web Conference 202410.1145/3589335.3648303(65-72)Online publication date: 13-May-2024
    • (2024)ACPR: Adaptive Classification Predictive Repair Method for Different Fault ScenariosIEEE Access10.1109/ACCESS.2023.334688112(4631-4641)Online publication date: 2024
    • (2023)ImDiffusion: Imputed Diffusion Models for Multivariate Time Series Anomaly DetectionProceedings of the VLDB Endowment10.14778/3632093.363210117:3(359-372)Online publication date: 1-Nov-2023
    • (2023)Diffusion-Based Time Series Data Imputation for Cloud Failure Prediction at Microsoft 365Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3611643.3613866(2050-2055)Online publication date: 30-Nov-2023
    • (2023)TraceDiag: Adaptive, Interpretable, and Efficient Root Cause Analysis on Large-Scale Microservice SystemsProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3611643.3613864(1762-1773)Online publication date: 30-Nov-2023
    • (2023)Robust Multimodal Failure Detection for Microservice SystemsProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3580305.3599902(5639-5649)Online publication date: 6-Aug-2023
    • (2023)CODEC: Cost-Effective Duration Prediction System for Deadline Scheduling in the Cloud2023 IEEE 34th International Symposium on Software Reliability Engineering (ISSRE)10.1109/ISSRE59848.2023.00069(298-308)Online publication date: 9-Oct-2023
    • (2023)TraceArk: Towards Actionable Performance Anomaly Alerting for Online Service SystemsProceedings of the 45th International Conference on Software Engineering: Software Engineering in Practice10.1109/ICSE-SEIP58684.2023.00029(258-269)Online publication date: 17-May-2023
    • (2023)Aegis: Attribution of Control Plane Change Impact across Layers and Components for Cloud SystemsProceedings of the 45th International Conference on Software Engineering: Software Engineering in Practice10.1109/ICSE-SEIP58684.2023.00026(222-233)Online publication date: 17-May-2023
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media