Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3510003.3510055acmconferencesArticle/Chapter ViewAbstractPublication PagesicseConference Proceedingsconference-collections
research-article

Online summarizing alerts through semantic and behavior information

Published: 05 July 2022 Publication History
  • Get Citation Alerts
  • Abstract

    Alerts, which record details about system failures, are crucial data for monitoring a online service system. Due to the complex correlation between system components, a system failure usually triggers a large number of alerts, making the traditional manual handling of alerts insufficient. Thus, automatically summarizing alerts is a problem demanding prompt solution. This paper tackles this challenge through a novel approach based on supervised learning. The proposed approach, OAS (Online Alert Summarizing), first learns two types of information from alerts, semantic information and behavior information, respectively. Then, OAS adopts a specific deep learning model to aggregate semantic and behavior representations of alerts and thus determines the correlation between alerts. OAS is able to summarize the newly reported alert online. Extensive experiments, which are conducted on real alert datasets from two large commercial banks, demonstrate the efficiency and the effectiveness of OAS.

    References

    [1]
    Amey Agrawal, Rohit Karlupia, and Rajat Gupta. 2019. Logan: A Distributed Online Log Parser. In IEEE 35th International Conference on Data Engineering (ICDE). IEEE, 1946--1951.
    [2]
    David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. Journal of machine Learning research 3, Jan (2003), 993--1022.
    [3]
    Junjie Chen, Xiaoting He, Qingwei Lin, Yong Xu, Hongyu Zhang, Dan Hao, Feng Gao, Zhangwei Xu, Yingnong Dang, and Dongmei Zhang. 2019. An Empirical Investigation of Incident Triage for Online Service Systems. In IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice. IEEE, 111--120.
    [4]
    Yujun Chen, Xian Yang, Qingwei Lin, Hongyu Zhang, Feng Gao, Zhangwei Xu, Yingnong Dang, Dongmei Zhang, Hang Dong, Yong Xu, Hao Li, and Yu Kang. 2019. Outage Prediction and Diagnosis for Cloud Service Systems. In The World Wide Web Conference. ACM, New York, NY, USA, 2659--2665.
    [5]
    Min Du, Feifei Li, Guineng Zheng, and Vivek Srikumar. 2017. Deeplog: Anomaly detection and diagnosis from system logs through deep learning. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security. ACM, 1285--1298.
    [6]
    The Linux Foundation. 2022. Prometheus. Retrieved February 5, 2022 from https://prometheus.io
    [7]
    Shangbin Han, Qianhong Wu, Han Zhang, Bo Qin, Jiankun Hu, Xingang Shi, Linfeng Liu, and Xia Yin. 2021. Log-Based Anomaly Detection With Robust Feature Extraction and Online Learning. IEEE Transactions on Information Forensics and Security 16 (2021), 2300--2311.
    [8]
    Pinjia He, Jieming Zhu, Zibin Zheng, and Michael R. Lyu. 2017. Drain: An Online Log Parsing Approach with Fixed Depth Tree. In 2017 IEEE International Conference on Web Services (ICWS). IEEE, 33--40.
    [9]
    Shilin He, Qingwei Lin, Jian-Guang Lou, Hongyu Zhang, Michael R. Lyu, and Dongmei Zhang. 2018. Identifying Impactful Service System Problems via Log Analysis. In Proceedings of the 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. ACM, New York, NY, USA, 60--70.
    [10]
    A Ibrahim, Shivakumar Sastry, and PS Sastry. 2016. Discovering compressing serial episodes from event sequences. Knowledge and Information Systems 47, 2 (2016), 405--432.
    [11]
    Guofei Jiang, Haifeng Chen, Kenji Yoshihira, and Akhilesh Saxena. 2009. Ranking the Importance of Alerts for Problem Determination in Large Computer Systems. In Proceedings of the 6th International Conference on Autonomic Computing. ACM, New York, NY, USA, 3--12.
    [12]
    Hoang Thanh Lam, Fabian Mörchen, Dmitriy Fradkin, and Toon Calders. 2014. Mining compressing sequential patterns. Statistical Analysis and Data Mining: The ASA Data Science Journal 7, 1 (2014), 34--52.
    [13]
    Derek Lin, Rashmi Raghu, Vivek Ramamurthy, Jin Yu, Regunathan Radhakrishnan, and Joseph Fernandez. 2014. Unveiling Clusters of Events for Alert and Incident Management in Large-Scale Enterprise It. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, New York, NY, USA, 1630--1639.
    [14]
    Qingwei Lin, Hongyu Zhang, Jian-Guang Lou, Yu Zhang, and Xuewei Chen. 2016. Log Clustering Based Problem Identification for Online Service Systems. In Proceedings of the 38th International Conference on Software Engineering Companion. ACM, New York, NY, USA, 102--111.
    [15]
    Ping Liu, Haowen Xu, Qianyu Ouyang, Rui Jiao, Zhekang Chen, Shenglin Zhang, Jiahai Yang, Linlin Mo, Jice Zeng, Wenman Xue, and Dan Pei. 2020. Unsupervised Detection of Microservice Trace Anomalies through Service-Level Deep Bayesian Networks. In IEEE 31st International Symposium on Software Reliability Engineering (ISSRE). IEEE, 48--58.
    [16]
    Zabbix LLC. 2022. Zabbix. Retrieved February 5, 2022 from https://www.zabbix.com
    [17]
    Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to Information Retrieval. Cambridge University Press.
    [18]
    Weibin Meng, Ying Liu, Yichen Zhu, Shenglin Zhang, Dan Pei, Yuqing Liu, Yihao Chen, Ruizhi Zhang, Shimin Tao, Pei Sun, and Rong Zhou. 2019. LogAnomaly: Unsupervised Detection of Sequential and Quantitative Anomalies in Unstructured Logs. In Proceedings of the 28th International Joint Conference on Artificial Intelligence. IJCAI Organization, 4739--4745.
    [19]
    Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. In Proceedings of Workshop at International Conference on Learning Representations (ICLR). 1--12.
    [20]
    Sasho Nedelkoski, Jorge Cardoso, and Odej Kao. 2019. Anomaly Detection and Classification using Distributed Tracing and Deep Learning. In 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID). IEEE, 241--250.
    [21]
    Sasho Nedelkoski, Jorge Cardoso, and Odej Kao. 2019. Anomaly Detection from System Tracing Data Using Multimodal Deep Learning. In IEEE 12th International Conference on Cloud Computing (CLOUD). IEEE, 179--186.
    [22]
    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32. Curran Associates, Inc., 8024--8035.
    [23]
    Radim Řehůřek and Petr Sojka. 2010. Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. ELRA, Valletta, Malta, 45--50.
    [24]
    Jorma Rissanen. 1978. Modeling by shortest data description. Automatica 14, 5 (1978), 465--471.
    [25]
    Roel Wieringa. 2010. Design Science Methodology: Principles and Practice. In Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering (Cape Town, South Africa). ACM, New York, NY, USA, 493--494.
    [26]
    Haowen Xu, Wenxiao Chen, Nengwen Zhao, Zeyan Li, Jiahao Bu, Zhihan Li, Ying Liu, Youjian Zhao, Dan Pei, Yang Feng, Jie Chen, Zhaogang Wang, and Honglin Qiao. 2018. Unsupervised Anomaly Detection via Variational Auto-Encoder for Seasonal KPIs in Web Applications. In Proceedings of the 2019 World Wide Web Conference. ACM, Republic and Canton of Geneva, CHE, 187--196.
    [27]
    Jingmin Xu, Yuan Wang, Pengfei Chen, and Ping Wang. 2017. Lightweight and Adaptive Service API Performance Monitoring in Highly Dynamic Cloud Environment. In IEEE International Conference on Services Computing (SCC). IEEE, Los Alamitos, CA, USA, 35--43.
    [28]
    Yizhou Yan, Lei Cao, Samuel Madden, and Elke A. Rundensteiner. 2018. SWIFT: Mining Representative Patterns from Large Event Streams. Proc. VLDB Endow. 12, 3 (Nov. 2018), 265--277.
    [29]
    Shenglin Zhang, Weibin Meng, Jiahao Bu, Sen Yang, Ying Liu, Dan Pei, Jun Xu, Yu Chen, Hui Dong, Xianping Qu, and Lei Song. 2017. Syslog processing for switch failure diagnosis and prediction in datacenter networks. In IEEE/ACM 25th International Symposium on Quality of Service (IWQoS). IEEE, 1--10.
    [30]
    Xu Zhang, Yong Xu, Qingwei Lin, Bo Qiao, Hongyu Zhang, Yingnong Dang, Chunyu Xie, Xinsheng Yang, Qian Cheng, Ze Li, Junjie Chen, Xiaoting He, Randolph Yao, Jian-Guang Lou, Murali Chintalapati, Furao Shen, and Dongmei Zhang. 2019. Robust Log-Based Anomaly Detection on Unstable Log Data. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. ACM, New York, NY, USA, 807--817.
    [31]
    Nengwen Zhao, Junjie Chen, Xiao Peng, Honglin Wang, Xinya Wu, Yuanzong Zhang, Zikai Chen, Xiangzhong Zheng, Xiaohui Nie, Gang Wang, Yong Wu, Fang Zhou, Wenchi Zhang, Kaixin Sui, and Dan Pei. 2020. Understanding and Handling Alert Storm for Online Service Systems. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering: Software Engineering in Practice. ACM, New York, NY, USA, 162--171.
    [32]
    Nengwen Zhao, Junjie Chen, Zhou Wang, Xiao Peng, Gang Wang, Yong Wu, Fang Zhou, Zhen Feng, Xiaohui Nie, Wenchi Zhang, Kaixin Sui, and Dan Pei. 2020. Real-Time Incident Prediction for Online Service Systems. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. ACM, New York, NY, USA, 315--326.
    [33]
    Nengwen Zhao, Panshi Jin, Lixin Wang, Xiaoqin Yang, Rong Liu, Wenchi Zhang, Kaixin Sui, and Dan Pei. 2020. Automatically and Adaptively Identifying Severe Alerts for Online Service Systems. In IEEE Conference on Computer Communications. IEEE, 2420--2429.
    [34]
    Nengwen Zhao, Jing Zhu, Rong Liu, Dapeng Liu, Ming Zhang, and Dan Pei. 2019. Label-Less: A Semi-Automatic Labelling Tool for KPI Anomalies. In IEEE Conference on Computer Communications. IEEE, 1882--1890.
    [35]
    Nengwen Zhao, Jing Zhu, Yao Wang, Minghua Ma, Wenchi Zhang, Dapeng Liu, Ming Zhang, and Dan Pei. 2019. Automatic and Generic Periodicity Adaptation for KPI Anomaly Detection. IEEE Transactions on Network and Service Management 16, 3 (2019), 1170--1183.
    [36]
    Xiang Zhou, Xin Peng, Tao Xie, Jun Sun, Chao Ji, Dewei Liu, Qilin Xiang, and Chuan He. 2019. Latent Error Prediction and Fault Localization for Microservice Applications by Learning from System Trace Logs. In Proceedings of the 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. ACM, New York, NY, USA, 683--694.

    Cited By

    View all
    • (2024)Knowledge-aware Alert Aggregation in Large-scale Cloud Systems: a Hybrid ApproachProceedings of the 46th International Conference on Software Engineering: Software Engineering in Practice10.1145/3639477.3639745(369-380)Online publication date: 14-Apr-2024
    • (2024)Dependency Aware Incident Linking in Large Cloud SystemsCompanion Proceedings of the ACM on Web Conference 202410.1145/3589335.3648311(141-150)Online publication date: 13-May-2024
    • (2024)A survey on intelligent management of alerts and incidents in IT servicesJournal of Network and Computer Applications10.1016/j.jnca.2024.103842224:COnline publication date: 2-Jul-2024
    • Show More Cited By

    Index Terms

    1. Online summarizing alerts through semantic and behavior information

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      ICSE '22: Proceedings of the 44th International Conference on Software Engineering
      May 2022
      2508 pages
      ISBN:9781450392211
      DOI:10.1145/3510003
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      In-Cooperation

      • IEEE CS

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 05 July 2022

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. alert summary
      2. online service systems
      3. system maintenance

      Qualifiers

      • Research-article

      Conference

      ICSE '22
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 276 of 1,856 submissions, 15%

      Upcoming Conference

      ICSE 2025

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)90
      • Downloads (Last 6 weeks)6
      Reflects downloads up to

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Knowledge-aware Alert Aggregation in Large-scale Cloud Systems: a Hybrid ApproachProceedings of the 46th International Conference on Software Engineering: Software Engineering in Practice10.1145/3639477.3639745(369-380)Online publication date: 14-Apr-2024
      • (2024)Dependency Aware Incident Linking in Large Cloud SystemsCompanion Proceedings of the ACM on Web Conference 202410.1145/3589335.3648311(141-150)Online publication date: 13-May-2024
      • (2024)A survey on intelligent management of alerts and incidents in IT servicesJournal of Network and Computer Applications10.1016/j.jnca.2024.103842224:COnline publication date: 2-Jul-2024
      • (2023)Assess and Summarize: Improve Outage Understanding with Large Language ModelsProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3611643.3613891(1657-1668)Online publication date: 30-Nov-2023
      • (2023)Incident-Aware Duplicate Ticket Aggregation for Cloud SystemsProceedings of the 45th International Conference on Software Engineering10.1109/ICSE48619.2023.00193(2299-2311)Online publication date: 14-May-2023
      • (2023)Dynamic Graph Neural Networks-Based Alert Link Prediction for Online Service Systems2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE)10.1109/ASE56229.2023.00177(79-90)Online publication date: 11-Sep-2023
      • (2023)Prism: Revealing Hidden Functional Clusters from Massive Instances in Cloud Systems2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE)10.1109/ASE56229.2023.00077(268-280)Online publication date: 11-Sep-2023

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media