Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3368089.3409768acmconferencesArticle/Chapter ViewAbstractPublication PagesfseConference Proceedingsconference-collections
research-article

Identifying linked incidents in large-scale online service systems

Published: 08 November 2020 Publication History

Abstract

In large-scale online service systems, incidents occur frequently due to a variety of causes, from updates of software and hardware to changes in operation environment. These incidents could significantly degrade system’s availability and customers’ satisfaction. Some incidents are linked because they are duplicate or inter-related. The linked incidents can greatly help on-call engineers find mitigation solutions and identify the root causes. In this work, we investigate the incidents and their links in a representative real-world incident management (IcM) system. Based on the identified indicators of linked incidents, we further propose LiDAR (Linked Incident identification with DAta-driven Representation), a deep learning based approach to incident linking. More specifically, we incorporate the textual description of incidents and structural information extracted from historical linked incidents to identify possible links among a large number of incidents. To show the effectiveness of our method, we apply our method to a real-world IcM system and find that our method outperforms other state-of-the-art methods.

Supplementary Material

Auxiliary Teaser Video (fse20main-p978-p-teaser.mp4)
This is a presentation video of my talk at FSE 2020 on our paper accepted in the research track. In this paper, we investigate the incidents and their links in a representative real-world incident management (IcM) system. Based on the identified indicators of linked incidents, we further propose LiDAR (Linked Incident identification with DAta-driven Representation), a deep learning based approach to incident linking. More specifically, we incorporate the textual description of incidents and structural information extracted from historical linked incidents to identify possible links among a large number of incidents. To show the effectiveness of our method, we apply our method to a real-world IcM system and find that our method outperforms other state-of-the-art methods.
Auxiliary Presentation Video (fse20main-p978-p-video.mp4)
This is a presentation video of my talk at FSE 2020 on our paper accepted in the research track. In this paper, we investigate the incidents and their links in a representative real-world incident management (IcM) system. Based on the identified indicators of linked incidents, we further propose LiDAR (Linked Incident identification with DAta-driven Representation), a deep learning based approach to incident linking. More specifically, we incorporate the textual description of incidents and structural information extracted from historical linked incidents to identify possible links among a large number of incidents. To show the effectiveness of our method, we apply our method to a real-world IcM system and find that our method outperforms other state-of-the-art methods.

References

[1]
Pragya Agarwal and Arun Prakash Agrawal. 2014. Fault-localization techniques for software systems: A literature review. ACM SIGSOFT Software Engineering Notes 39, 5 ( 2014 ), 1-8.
[2]
Mona Attariyan, Michael Chow, and Jason Flinn. 2012. X-ray: Automating RootCause Diagnosis of Performance Anomalies in Production Software. In OSDI, Vol. 12. 307-320.
[3]
Amar Budhiraja, Kartik Dutta, Raghu Reddy, and Manish Shrivastava. 2018. DWEN: deep word embedding network for duplicate bug report detection in software repositories. In Proceedings of the 40th International Conference on Software Engineering: Companion Proceeedings. ACM, 193-194.
[4]
Amar Budhiraja, Raghu Reddy, and Manish Shrivastava. 2018. LWE: LDA refined word embeddings for duplicate bug report detection. In Proceedings of the 40th International Conference on Software Engineering: Companion Proceeedings. ACM, 165-166.
[5]
Carlo Cecati. 2015. A survey of fault diagnosis and fault-tolerant techniquesâĂŤ-Part II: Fault diagnosis with knowledge-based and hybrid/active approaches. IEEE Transactions on Industrial Electronics ( 2015 ).
[6]
Ajay Chandramouly, Big Data Domain Owner, IT Ravindra Narkhede, IT Vijay Mungara, IT Guillermo Rueda, and IT Asoka Diggs. 2013. Reducing Client Incidents through Big Data Predictive Analytics. Intel IT Big Data Predictive Analytics,(December) ( 2013 ).
[7]
Junjie Chen, Xiaoting He, Qingwei Lin, Yong Xu, Hongyu Zhang, Dan Hao, Feng Gao, Zhangwei Xu, Yingnong Dang, and Dongmei Zhang. 2019. An Empirical Investigation of Incident Triage for Online Service Systems. In Proceedings of the 41st International Conference on Software Engineering: Software Engineering in Practice (Montreal, Quebec, Canada) ( ICSE-SEIP '19). IEEE Press, 111-120. https://doi.org/10.1109/ICSE-SEIP. 2019.00020
[8]
Junjie Chen, Xiaoting He, Qingwei Lin, Hongyu Zhang, Dan Hao, Feng Gao, Zhangwei Xu, Yingnong Dang, and Dongmei Zhang. 2019. Continuous incident triage for large-scale online service systems. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 364-375.
[9]
Junjie Chen, Wenxiang Hu, Dan Hao, Yingfei Xiong, Hongyu Zhang, and Lu Zhang. 2019. Static duplicate bug-report identification for compilers. SCIENTIA SINICA Informationis 49, 10 ( 2019 ), 1283-1298.
[10]
Junjie Chen, Shu Zhang, Xiaoting He, Qingwei Lin, Hongyu Zhang, Dan Hao, Yu Kang, Feng Gao, Zhangwei Xu, Yingnong Dang, and Dongmei Zhang. 2020. How Incidental are the Incidents? Characterizing and Prioritizing Incidents for LargeScale Online Service Systems. In The 35th IEEE/ACM International Conference on Automated Software Engineering. to appear.
[11]
J. Deshmukh, A. K. M, S. Podder, S. Sengupta, and N. Dubash. 2017. Towards Accurate Duplicate Bug Retrieval Using Deep Learning Techniques. In 2017 IEEE International Conference on Software Maintenance and Evolution (ICSME). 115-124. https://doi.org/10.1109/ICSME. 2017.69
[12]
Lv Feng, Li Xiang, and Wang Xiu-qing. 2013. A survey of intelligent network fault diagnosis technology. In Control and Decision Conference (CCDC), 2013 25th Chinese. IEEE, 4874-4879.
[13]
Zhiwei Gao, Carlo Cecati, and Steven X Ding. 2015. A survey of fault diagnosis and fault-tolerant techniquesâĂŤPart I: Fault diagnosis with model-based and signal-based approaches. IEEE Transactions on Industrial Electronics 62, 6 ( 2015 ), 3757-3767.
[14]
Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 855-864.
[15]
Abram Hindle, Anahita Alipour, and Eleni Stroulia. 2016. A Contextual Approach Towards More Accurate Duplicate Bug Report Detection and Ranking. Empirical Softw. Engg. 21, 2 (April 2016 ), 368-410. https://doi.org/10.1007/s10664-015-9387-3
[16]
Jiajun Jiang, Weihai Lu, Junjie Chen, Qingwei Lin, Pu Zhao, Yu Kang, Hongyu Zhang, Yingfei Xiong, Feng Gao, Zhangwei Xu, Yingnong Dang, and Dongmei Zhang. 2020. How to Mitigate the Incident? An Efective Troubleshooting Guide Recommendation Technique for Online Service Systems. In The 28th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Industry track. to appear.
[17]
Soila P Kavulya, Kaustubh Joshi, Felicita Di Giandomenico, and Priya Narasimhan. 2012. Failure diagnosis of complex systems. In Resilience assessment and evaluation of computing systems. Springer, 239-261.
[18]
Yoon Kim. 2014. Convolutional Neural Networks for Sentence Classification. In EMNLP.
[19]
Nathan Klein, Christopher S Corley, and Nicholas A Kraft. 2014. New features for duplicate bug detection. In Proceedings of the 11th Working Conference on Mining Software Repositories. ACM, 324-327.
[20]
Gregory Koch, Richard Zemel, and Ruslan Salakhutdinov. 2015. Siamese neural networks for one-shot image recognition. In ICML deep learning workshop, Vol. 2. Lille.
[21]
Ahmed Lamkanfi, Javier Pérez, and Serge Demeyer. 2013. The eclipse and mozilla defect tracking dataset: a genuine dataset for mining bug information. In 2013 10th Working Conference on Mining Software Repositories (MSR). IEEE, 203-206.
[22]
Ma łgorzata Steinder and Adarshpal S Sethi. 2004. A survey of fault localization techniques in computer networks. Science of computer programming 53, 2 ( 2004 ), 165-194.
[23]
Jian-Guang Lou, Qingwei Lin, Rui Ding, Qiang Fu, Dongmei Zhang, and Tao Xie. 2013. Software analytics for incident management of online services: An experience report. In Proceedings of the 28th IEEE/ACM International Conference on Automated Software Engineering. IEEE Press, 475-485.
[24]
Laurens van der Maaten and Geofrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research 9, Nov ( 2008 ), 2579-2605.
[25]
Tomas Mikolov, Edouard Grave, Piotr Bojanowski, Christian Puhrsch, and Armand Joulin. 2017. Advances in pre-training distributed word representations. arXiv preprint arXiv:1712.09405 ( 2017 ).
[26]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jef Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems. 3111-3119.
[27]
Anh Tuan Nguyen, Tung Thanh Nguyen, Tien N. Nguyen, David Lo, and Chengnian Sun. 2012. Duplicate Bug Report Detection with a Combination of Information Retrieval and Topic Modeling. In Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering (Essen, Germany) (ASE 2012 ). ACM, New York, NY, USA, 70-79. https://doi.org/10.1145/2351676.2351687
[28]
Mohamed Sami Rakha, Cor-Paul Bezemer, and Ahmed E Hassan. 2018. Revisiting the performance evaluation of automated approaches for the retrieval of duplicate issue reports. IEEE Transactions on Software Engineering 44, 12 ( 2018 ), 1245-1268.
[29]
P. Runeson, M. Alexandersson, and O. Nyholm. 2007. Detection of Duplicate Defect Reports Using Natural Language Processing. In 29th International Conference on Software Engineering (ICSE'07). 499-510. https://doi.org/10.1109/ICSE. 2007.32
[30]
Robert J. Sandusky, Les Gasser, and Gabriel Ripoche. 2004. Bug Report Networks: Varieties, Strategies, and Impacts in a F/OSS Development Community. In Proceedings of the 1st International Workshop on Mining Software Repositories (MSR 2004 ). 80-84.
[31]
Marc Solé, Victor Muntés-Mulero, Annie Ibrahim Rana, and Giovani Estrada. 2017. Survey on models and techniques for root-cause analysis. arXiv preprint arXiv:1701.08546 ( 2017 ).
[32]
C. Sun, D. Lo, S. Khoo, and J. Jiang. 2011. Towards more accurate retrieval of duplicate bug reports. In 2011 26th IEEE/ACM International Conference on Automated Software Engineering (ASE 2011 ). 253-262. https://doi.org/10.1109/ ASE. 2011.6100061
[33]
Chengnian Sun, David Lo, Xiaoyin Wang, Jing Jiang, and Siau-Cheng Khoo. 2010. A discriminative model approach for accurate duplicate bug report retrieval. In Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering-Volume 1. ACM, 45-54.
[34]
X. Wang, L. Zhang, T. Xie, J. Anvik, and J. Sun. 2008. An approach to detecting duplicate bug reports using natural language and execution information. In 2008 ACM/IEEE 30th International Conference on Software Engineering. 461-470. https://doi.org/10.1145/1368088.1368151
[35]
W Eric Wong, Ruizhi Gao, Yihao Li, Rui Abreu, and Franz Wotawa. 2016. A survey on software fault localization. IEEE Transactions on Software Engineering 42, 8 ( 2016 ), 707-740.
[36]
Yang Wu, Ang Chen, Andreas Haeberlen, Wenchao Zhou, and Boon Thau Loo. 2017. Automated Bug Removal for Software-Defined Networks. In NSDI. 719-733.
[37]
Bowen Xu, Deheng Ye, Zhenchang Xing, Xin Xia, Guibin Chen, and Shanping Li. 2016. Predicting semantically linkable knowledge in developer online forums via convolutional neural network. In Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering. ACM, 51-62.
[38]
X. Yang, D. Lo, X. Xia, L. Bao, and J. Sun. 2016. Combining Word Embedding with Information Retrieval to Recommend Similar Bug Reports. In 2016 IEEE 27th International Symposium on Software Reliability Engineering (ISSRE). 127-137. https://doi.org/10.1109/ISSRE. 2016.33
[39]
Yiwen Yang, Jun Ai, and Fei Wang. 2018. Defect Prediction Based on the Characteristics of Multilayer Structure of Software Network. In 2018 IEEE International Conference on Software Quality, Reliability and Security Companion (QRS-C). IEEE, 27-34.
[40]
Xin Ye, Razvan Bunescu, and Chang Liu. 2016. Mapping bug reports to relevant ifles: A ranking model, a fine-grained benchmark, and feature evaluation. IEEE Transactions on Software Engineering 42, 4 ( 2016 ), 379-402.
[41]
Jian Zhou and Hongyu Zhang. 2012. Learning to Rank Duplicate Bug Reports. In Proceedings of the 21st ACM International Conference on Information and Knowledge Management (Maui, Hawaii, USA) ( CIKM'12). Association for Computing Machinery, New York, NY, USA, 852âĂŞ861.
[42]
Thomas Zimmermann and Nachiappan Nagappan. 2007. Predicting Subsystem Failures Using Dependency Graph Complexities. In Proceedings of the The 18th IEEE International Symposium on Software Reliability (ISSRE '07). IEEE Computer Society, Washington, DC, USA, 227-236. https://doi.org/10.1109/ISSRE. 2007.19
[43]
Thomas Zimmermann and Nachiappan Nagappan. 2008. Predicting defects using network analysis on dependency graphs. In Proceedings of the 30th international conference on Software engineering. ACM, 531-540.

Cited By

View all
  • (2025)A Context-Aware Clustering Approach for Assisting Operators in Classifying Security AlertsIEEE Transactions on Software Engineering10.1109/TSE.2024.349758851:1(153-171)Online publication date: Jan-2025
  • (2024)X-Lifecycle Learning for Cloud Incident Management using LLMsCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663861(417-428)Online publication date: 10-Jul-2024
  • (2024)MonitorAssistant: Simplifying Cloud Service Monitoring via Large Language ModelsCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663826(38-49)Online publication date: 10-Jul-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ESEC/FSE 2020: Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering
November 2020
1703 pages
ISBN:9781450370431
DOI:10.1145/3368089
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 November 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Linked incidents
  2. incident management
  3. link prediction
  4. online service system

Qualifiers

  • Research-article

Conference

ESEC/FSE '20
Sponsor:

Acceptance Rates

Overall Acceptance Rate 112 of 543 submissions, 21%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)82
  • Downloads (Last 6 weeks)3
Reflects downloads up to 04 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)A Context-Aware Clustering Approach for Assisting Operators in Classifying Security AlertsIEEE Transactions on Software Engineering10.1109/TSE.2024.349758851:1(153-171)Online publication date: Jan-2025
  • (2024)X-Lifecycle Learning for Cloud Incident Management using LLMsCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663861(417-428)Online publication date: 10-Jul-2024
  • (2024)MonitorAssistant: Simplifying Cloud Service Monitoring via Large Language ModelsCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663826(38-49)Online publication date: 10-Jul-2024
  • (2024)FaultProfIT: Hierarchical Fault Profiling of Incident Tickets in Large-scale Cloud SystemsProceedings of the 46th International Conference on Software Engineering: Software Engineering in Practice10.1145/3639477.3639754(392-404)Online publication date: 14-Apr-2024
  • (2024)Knowledge-aware Alert Aggregation in Large-scale Cloud Systems: a Hybrid ApproachProceedings of the 46th International Conference on Software Engineering: Software Engineering in Practice10.1145/3639477.3639745(369-380)Online publication date: 14-Apr-2024
  • (2024)GraphWeaver: Billion-Scale Cybersecurity Incident CorrelationProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3680057(4479-4486)Online publication date: 21-Oct-2024
  • (2024)Dependency Aware Incident Linking in Large Cloud SystemsCompanion Proceedings of the ACM Web Conference 202410.1145/3589335.3648311(141-150)Online publication date: 13-May-2024
  • (2024)DualAttlog: Context aware dual attention networks for log-based anomaly detectionNeural Networks10.1016/j.neunet.2024.106680(106680)Online publication date: Aug-2024
  • (2023)Outage-Watch: Early Prediction of Outages using Extreme Event RegularizerProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3611643.3616316(682-694)Online publication date: 30-Nov-2023
  • (2023)Detection Is Better Than Cure: A Cloud Incidents PerspectiveProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3611643.3613898(1891-1902)Online publication date: 30-Nov-2023
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media