Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3368089.3417054acmconferencesArticle/Chapter ViewAbstractPublication PagesfseConference Proceedingsconference-collections
research-article

How to mitigate the incident? an effective troubleshooting guide recommendation technique for online service systems

Published: 08 November 2020 Publication History

Abstract

In recent years, more and more traditional shrink-wrapped software is provided as 7x24 online services. Incidents (events that lead to service disruptions or outages) could affect service availability and cause great financial loss. Therefore, mitigating the incidents is important and time critical. In practice, a document describing a mitigation process, called a troubleshooting guide (TSG), is usually used to reduce the Time To Mitigate (TTM). To investigate the usage of TSGs in real-world online services, we conduct the first empirical study on 18 real-world, large-scale online service systems in Microsoft. We analyze the distribution and characteristics of TSGs among all incident records in the past two years. According to our study, 27.2% incidents have TSG records and 36.2% of them occurred at least twice. Besides, on average developers spend around 36.3% of the entire mitigation time on locating the desired TSGs.
Our study shows that incidents could occur repeatedly and TSGs could be reused to facilitate incident mitigation. Motivated by our empirical study, we propose an automated TSG recommendation approach, DeepRmd, by leveraging the textual similarity between incident description and its corresponding TSG using deep learning techniques. We evaluate the effectiveness of DeepRmd on 18 online service systems. The results show that DeepRmd can recommend the correct TSG as the Top 1 returned result for 80.3% incidents, which significantly outperforms two baseline approaches.

Supplementary Material

Auxiliary Teaser Video (fse20ind-p52-p-teaser.mp4)
The full presentation video of the paper.
Auxiliary Presentation Video (fse20ind-p52-p-video.mp4)
The full presentation video of the paper.

References

[1]
2016. Cost of Data Center Outages. https://www.vertiv.com/globalassets/ documents/reports/2016-cost-of-data-center-outages-11-11_51190_1.pdf.
[2]
2018. Amazon's one hour of downtime on Prime Day may have cost it up to $100 million in lost sales. https://www.businessinsider. com/amazon-prime-daywebsite-issues-cost-it-millions-in-lost-sales-2018-7.
[3]
2019. ANNOY library. https://github.com/spotify/annoy. Accessed: 2019-09-01.
[4]
Aug., 2008. Amazon's S3 cloud service turns into a puf of smoke. InformationWeek.
[5]
Akiko Aizawa. 2003. An information-theoretic perspective of tf-idf measures. Information Processing & Management 39 ( 2003 ), 45-65. https://doi.org/10.1016/ S0306-4573 ( 02 ) 00021-3.
[6]
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural Machine Translation by Jointly Learning to Align and Translate. In ICLR. http://arxiv.org/ abs/1409.0473.
[7]
Yoshua Bengio, Patrice Simard, and Paolo Frasconi. 1994. Learning long-term dependencies with gradient descent is dificult. IEEE transactions on neural networks 5, 2 ( 1994 ), 157-166. https://doi.org/10.1109/72.279181.
[8]
Junjie Chen, Xiaoting He, Qingwei Lin, Yong Xu, Hongyu Zhang, Dan Hao, Feng Gao, Zhangwei Xu, Yingnong Dang, and Dongmei Zhang. 2019. An Empirical Investigation of Incident Triage for Online Service Systems. In ICSE-SEIP. IEEE Press, 111-120. https://doi.org/10.1109/ICSE-SEIP. 2019.00020
[9]
Junjie Chen, Xiaoting He, Qingwei Lin, Hongyu Zhang, Dan Hao, Feng Gao, Zhangwei Xu, Yingnong Dang, and Dongmei Zhang. 2019. Continuous Incident Triage for Large-Scale Online Service Systems. In Proceedings of the 34th IEEE/ACM International Conference on Automated Software Engineering. 364-375. https://doi.org/10.1109/ASE. 2019.00042
[10]
Junjie Chen, Shu Zhang, Xiaoting He, Qingwei Lin, Hongyu Zhang, Dan Hao, Yu Kang, Feng Gao, Zhangwei Xu, Yingnong Dang, and Dongmei Zhang. 2020. How Incidental are the Incidents? Characterizing and Prioritizing Incidents for LargeScale Online Service Systems. In The 35th IEEE/ACM International Conference on Automated Software Engineering. to appear.
[11]
Ira Cohen, Moises Goldszmidt, Terence Kelly, Julie Symons, and Jefrey S. Chase. 2004. Correlating Instrumentation Data to System States: A Building Block for Automated Diagnosis and Control. In OSDI. USENIX Association, 16-16.
[12]
Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural Language Processing (Almost) from Scratch. J. Mach. Learn. Res. (Nov. 2011 ), 2493-2537.
[13]
Sanjoy Dasgupta and Yoav Freund. 2008. Random Projection Trees and Low Dimensional Manifolds. In STOC. ACM, 537-546. https://doi.org/10.1145/1374376. 1374452.
[14]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. CoRR ( 2018 ). https://arxiv.org/abs/ 1810.04805.
[15]
S. Duan and S. Babu. 2008. Guided Problem Diagnosis through Active Learning. In 2008 International Conference on Autonomic Computing. 45-54. https://doi.org/ 10.1109/ICAC. 2008. 28.
[16]
Minwei Feng, Bing Xiang, Michael R. Glass, Lidan Wang, and Bowen Zhou. 2015. Applying Deep Learning to Answer Selection: A Study and An Open Task. CoRR abs/1508.01585 ( 2015 ). http://arxiv.org/abs/1508.01585.
[17]
Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, Jef Dean, Marc'Aurelio Ranzato, and Tomas Mikolov. 2013. Devise: A deep visual-semantic embedding model. In Advances in neural information processing systems. 2121-2129.
[18]
S. Fujiwara, H. Hata, A. Monden, and K. Matsumoto. 2015. Bug report recommendation for code inspection. In 2015 IEEE 1st International Workshop on Software Analytics (SWAN). 9-12. https://doi.org/10.1109/SWAN. 2015. 7070481.
[19]
Alex Graves, Abdel rahman Mohamed, and Geofrey E. Hinton. 2013. Speech recognition with deep recurrent neural networks. 2013 IEEE International Conference on Acoustics, Speech and Signal Processing ( 2013 ), 6645-6649.
[20]
A. Graves and J. Schmidhuber. 2005. Framewise phoneme classification with bidirectional LSTM networks. In Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005., Vol. 4. 2047-2052 vol. 4. https://doi.org/10. 1109/IJCNN. 2005. 1556215.
[21]
Xiaodong Gu, Hongyu Zhang, and Sunghun Kim. 2018. Deep Code Search. In ICSE. 933-944. https://doi.org/10.1145/3180155.3180167.
[22]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 ( 1997 ), 1735-1780. https://doi.org/10.1162/neco. 1997. 9.8.1735.
[23]
J. Nicholas Hoover. Aug. 16, 2008. Outages Force Cloud Computing Users To Rethink Tactics. InformationWeek. https://www.informationweek.com/cloud/ software-as-a-service/outages-force-cloud-computing-users-to-rethinktactics/d/d-id/1071014.
[24]
H. Hu, H. Zhang, J. Xuan, and W. Sun. 2014. Efective Bug Triage Based on Historical Bug-Fix Information. In 2014 IEEE 25th International Symposium on Software Reliability Engineering. 122-132. https://doi.org/10.1109/ISSRE. 2014. 17.
[25]
Q. Huang, D. Lo, X. Xia, Q. Wang, and S. Li. 2017. Which Packages Would be Afected by This Bug Report?. In ISSRE. 124-135.
[26]
N. Jalbert and W. Weimer. 2008. Automated duplicate detection for bug tracking systems. In DSN. 52-61. https://doi.org/10.1109/DSN. 2008. 4630070.
[27]
M. R. Karim, S. M. D. A. Alam, S. J. Kabeer, G. Ruhe, B. Baluta, and S. Mahmud. 2016. Applying Data Analytics towards Optimized Issue Management: An Industrial Case Study. In CESI. 7-13. https://doi.org/10.1109/CESI. 2016. 012.
[28]
Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In CVPR. 3128-3137.
[29]
S. Kikuchi. 2015. Prediction of Workloads in Incident Management Based on Incident Ticket Updating History. In UCC. 333-340. https://doi.org/10.1109/UCC. 2015. 53.
[30]
Yoon Kim. 2014. Convolutional Neural Networks for Sentence Classification. CoRR ( 2014 ). arXiv: 1408.5882 https://arxiv.org/abs/1408.5882.
[31]
Alex Krizhevsky, Ilya Sutskever, and Geofrey E. Hinton. 2017. ImageNet Classiifcation with Deep Convolutional Neural Networks. Commun. ACM 60 ( 2017 ), 84-90. https://doi.org/10.1145/3065386.
[32]
Xuan Li, Zerui Wang, Qianxiang Wang, Shoumeng Yan, Tao Xie, and Hong Mei. 2016. Relationship-aware Code Search for JavaScript Frameworks. In FSE. ACM, 690-701. https://doi.org/10.1145/2950290.2950341.
[33]
M. Lim, J. Lou, H. Zhang, Q. Fu, A. B. J. Teoh, Q. Lin, R. Ding, and D. Zhang. 2014. Identifying Recurrent and Unknown Performance Issues. In 2014 IEEE International Conference on Data Mining. 320-329. https://doi.org/10.1109/ICDM. 2014. 96.
[34]
Jian-Guang Lou, Qingwei Lin, Rui Ding, Qiang Fu, Dongmei Zhang, and Tao Xie. 2013. Software Analytics for Incident Management of Online Services: An Experience Report. In ASE. 475-485. https://doi.org/10.1109/ASE. 2013. 6693105 https://doi.org/10.1109/ASE. 2013. 6693105.
[35]
Jian-Guang Lou, Qingwei Lin, Rui Ding, Qiang Fu, Dongmei Zhang, and Tao Xie. 2017. Experience report on applying software analytics in incident management of online service. ASE ( 2017 ), 905-941. https://doi.org/10.1007/s10515-017-0218-1 https://doi.org/10.1007/s10515-017-0218-1.
[36]
F. Lv, H. Zhang, J. Lou, S. Wang, D. Zhang, and J. Zhao. 2015. CodeHow: Efective Code Search Based on API Understanding and Extended Boolean Model (E). In ASE. 260-270. https://doi.org/10.1109/ASE. 2015. 42.
[37]
Tomas Mikolov, Martin Karafiát, Lukas Burget, Jan Cernock y`, and Sanjeev Khudanpur. 2010. Recurrent neural network based language model. In Interspeech, Vol. 2. 3.
[38]
Anand Rajaraman and Jefrey David Ullman. 2011. Data Mining. Cambridge University Press, 1-17. https://doi.org/10.1017/CBO9781139058452.002.
[39]
Juan Enrique Ramos. 2003. Using TF-IDF to Determine Word Relevance in Document Queries.
[40]
Pengzhen Ren, Yun Xiao, Xiaojun Chang, Po-Yao Huang, Zhihui Li, Xiaojiang Chen, and Xin Wang. 2020. A Survey of Deep Active Learning. arXiv: 2009. 00236 [cs.LG]
[41]
Christopher Schröder and Andreas Niekler. 2020. A Survey of Active Learning for Text Classification using Deep Neural Networks. arXiv: 2008. 07267 [cs.CL]
[42]
Mike Schuster and Kuldip Paliwal. 1997. Bidirectional recurrent neural networks. Signal Processing, IEEE Transactions on 45 (12 1997 ), 2673-2681. https://doi.org/ 10.1109/78.650093.
[43]
Vishwanath A. Sindagi and Vishal M. Patel. 2018. A survey of recent advances in CNN-based single image crowd counting and density estimation. Pattern Recognition Letters 107 ( 2018 ), 3-16. https://doi.org/10.1016/j.patrec. 2017. 07.007.
[44]
Y. Tian, C. Sun, and D. Lo. 2012. Improved Duplicate Bug Report Identification. In 2012 16th European Conference on Software Maintenance and Reengineering. 385-390. https://doi.org/10.1109/CSMR. 2012. 48.
[45]
Yue Wu, Yinpeng Chen, Lijuan Wang, Yuancheng Ye, Zicheng Liu, Yandong Guo, and Yun Fu. 2019. Large Scale Incremental Learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 374-382.
[46]
Ran Xu, Caiming Xiong, Wei Chen, and Jason J Corso. 2015. Jointly modeling deep video and compositional text to bridge vision and language in a unified framework. In Twenty-Ninth AAAI Conference on Artificial Intelligence.
[47]
Yang Yang, Da-Wei Zhou, De-Chuan Zhan, Hui Xiong, and Yuan Jiang. 2019. Adaptive Deep Models for Incremental Learning: Considering Capacity Scalability and Sustainability. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '19). Association for Computing Machinery, 74-82. https://doi.org/10.1145/3292500.3330865
[48]
D. Zhang, S. Han, Y. Dang, J. Lou, H. Zhang, and T. Xie. 2013. Software Analytics in Practice. IEEE Software 30, 5 (Sep. 2013 ), 30-37. https://doi.org/10.1109/MS. 2013. 94.
[49]
Ye Zhang and Byron C. Wallace. 2015. A Sensitivity Analysis of (and Practitioners' Guide to) Convolutional Neural Networks for Sentence Classification. CoRR ( 2015 ). arXiv: 1510.03820 https://arxiv.org/abs/1510.03820.
[50]
Nengwen Zhao, Junjie Chen, Xiao Peng, Honglin Wang, Xinya Wu, Yuanzong Zhang, Zikai Chen, Xiangzhong Zheng, Xiaohui Nie, Gang Wang, Yong Wu, Fang Zhou, Wenchi Zhang, Kaixin Sui, and Dan Pei. 2020. Understanding and Handling Alert Storm for Online Service Systems. In The 42nd International Conference on Software Engineering, SEIP track. to appear.
[51]
Nengwen Zhao, Junjie Chen, Zhou Wang, Xiao Peng, Gang Wang, Yong Wu, Fang Zhou, Zhen Feng, Xiaohui Nie, Wenchi Zhang, Kaixin Sui, and Dan Pei. 2020. Real-time Incident Prediction for Online Service Systems. In The 28th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. to appear.
[52]
Jian Zhou, Hongyu Zhang, and David Lo. 2012. Where Should the Bugs Be Fixed?-More Accurate Information Retrieval-based Bug Localization Based on Bug Reports. In ICSE. IEEE Press, 14-24. https://doi.org/10.1109/ICSE. 2012. 6227210.
[53]
X. Zhou, X. Peng, T. Xie, J. Sun, C. Ji, W. Li, and D. Ding. 2019. Delta Debugging Microservice Systems with Parallel Optimization. IEEE Transactions on Services Computing ( 2019 ), 1-1. https://doi.org/10.1109/TSC. 2019. 2919823.
[54]
Xiang Zhou, Xin Peng, Tao Xie, Jun Sun, Chenjie Xu, Chao Ji, and Wenyun Zhao. 2018. Benchmarking Microservice Systems for Software Engineering Research. In ICSE: Companion Proceeedings. ACM, 323-324. https://doi.org/10.1145/3183440. 3194991.

Cited By

View all
  • (2024)Dynamic analysis of nonlinear features for high-precision fault diagnosis of large pumps and compressors in oil and gas fieldsApplied Mathematics and Nonlinear Sciences10.2478/amns-2024-33949:1Online publication date: 18-Nov-2024
  • (2024)Building AI Agents for Autonomous Clouds: Challenges and Design PrinciplesProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698525(99-110)Online publication date: 20-Nov-2024
  • (2024)X-Lifecycle Learning for Cloud Incident Management using LLMsCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663861(417-428)Online publication date: 10-Jul-2024
  • Show More Cited By

Index Terms

  1. How to mitigate the incident? an effective troubleshooting guide recommendation technique for online service systems

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      ESEC/FSE 2020: Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering
      November 2020
      1703 pages
      ISBN:9781450370431
      DOI:10.1145/3368089
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 08 November 2020

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Incident management
      2. incident mitigation
      3. online service systems
      4. troubleshooting guide

      Qualifiers

      • Research-article

      Conference

      ESEC/FSE '20
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 112 of 543 submissions, 21%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)94
      • Downloads (Last 6 weeks)19
      Reflects downloads up to 09 Jan 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Dynamic analysis of nonlinear features for high-precision fault diagnosis of large pumps and compressors in oil and gas fieldsApplied Mathematics and Nonlinear Sciences10.2478/amns-2024-33949:1Online publication date: 18-Nov-2024
      • (2024)Building AI Agents for Autonomous Clouds: Challenges and Design PrinciplesProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698525(99-110)Online publication date: 20-Nov-2024
      • (2024)X-Lifecycle Learning for Cloud Incident Management using LLMsCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663861(417-428)Online publication date: 10-Jul-2024
      • (2024)LM-PACE: Confidence Estimation by Large Language Models for Effective Root Causing of Cloud IncidentsCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663858(388-398)Online publication date: 10-Jul-2024
      • (2024)Leveraging Large Language Models for the Auto-remediation of Microservice Applications: An Experimental StudyCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663855(358-369)Online publication date: 10-Jul-2024
      • (2024)Automated Root Causing of Cloud Incidents using In-Context Learning with GPT-4Companion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663846(266-277)Online publication date: 10-Jul-2024
      • (2024)Exploring LLM-Based Agents for Root Cause AnalysisCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663841(208-219)Online publication date: 10-Jul-2024
      • (2024)Chain-of-Event: Interpretable Root Cause Analysis for Microservices through Automatically Learning Weighted Event Causal GraphCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663827(50-61)Online publication date: 10-Jul-2024
      • (2024)Dynamic Alert Suppression Policy for Noise Reduction in AIOpsProceedings of the 46th International Conference on Software Engineering: Software Engineering in Practice10.1145/3639477.3639752(178-188)Online publication date: 14-Apr-2024
      • (2024)Knowledge-aware Alert Aggregation in Large-scale Cloud Systems: a Hybrid ApproachProceedings of the 46th International Conference on Software Engineering: Software Engineering in Practice10.1145/3639477.3639745(369-380)Online publication date: 14-Apr-2024
      • Show More Cited By

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media