Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3468264.3468543acmconferencesArticle/Chapter ViewAbstractPublication PagesfseConference Proceedingsconference-collections
research-article

Identifying bad software changes via multimodal anomaly detection for online service systems

Published: 18 August 2021 Publication History

Abstract

In large-scale online service systems, software changes are inevitable and frequent. Due to importing new code or configurations, changes are likely to incur incidents and destroy user experience. Thus it is essential for engineers to identify bad software changes, so as to reduce the influence of incidents and improve system re- liability. To better understand bad software changes, we perform the first empirical study based on large-scale real-world data from a large commercial bank. Our quantitative analyses indicate that about 50.4% of incidents are caused by bad changes, mainly be- cause of code defect, configuration error, resource contention, and software version. Besides, our qualitative analyses show that the current practice of detecting bad software changes performs not well to handle heterogeneous multi-source data involved in soft- ware changes. Based on the findings and motivation obtained from the empirical study, we propose a novel approach named SCWarn aiming to identify bad changes and produce interpretable alerts accurately and timely. The key idea of SCWarn is drawing support from multimodal learning to identify anomalies from heterogeneous multi-source data. An extensive study on two datasets with various bad software changes demonstrates our approach significantly outperforms all the compared approaches, achieving 0.95 F1-score on average and reducing MTTD (mean time to detect) by 20.4%∼60.7%. In particular, we shared some success stories and lessons learned from the practical usage.

References

[1]
3-sigma rule. https://en.wikipedia.org/wiki/68-95-99.7_rule [Online; accessed 10-Feb-2021].
[2]
Grafana. https://grafana.com/ [Online; accessed 10-Feb-2021].
[3]
Kibana. https://www.elastic.co/kibana [Online; accessed 10-Feb-2021].
[4]
Betsy Beyer, Chris Jones, Jennifer Petoff, and Niall Richard Murphy. 2016. Site Reliability Engineering: How Google Runs Production Systems. " O’Reilly Media, Inc.".
[5]
Timofey Bryksin, Victor Petukhov, Ilya Alexin, Stanislav Prikhodko, Alexey Shpilman, Vladimir Kovalenko, and Nikita Povarov. 2020. Using Large-Scale Anomaly Detection on Code to Improve Kotlin Compiler. In Proceedings of the 17th International Conference on Mining Software Repositories. 455–465. https://doi.org/10.1145/3379597.3387447
[6]
Junjie Chen, Xiaoting He, Qingwei Lin, Yong Xu, Hongyu Zhang, Dan Hao, Feng Gao, Zhangwei Xu, Yingnong Dang, and Dongmei Zhang. 2019. An empirical investigation of incident triage for online service systems. In Proceedings of the 41st International Conference on Software Engineering: Software Engineering in Practice. 111–120. https://doi.org/10.1109/ICSE-SEIP.2019.00020
[7]
Junjie Chen, Xiaoting He, Qingwei Lin, Hongyu Zhang, Dan Hao, Feng Gao, Zhangwei Xu, Yingnong Dang, and Dongmei Zhang. 2019. Continuous Incident Triage for Large-Scale Online Service Systems. In 34th IEEE/ACM International Conference on Automated Software Engineering. 364–375. https://doi.org/10.1109/ASE.2019.00042
[8]
Junjie Chen, Zhuo Wu, Zan Wang, Hanmo You, Lingming Zhang, and Ming Yan. 2020. Practical Accuracy Estimation for Efficient Deep Neural Network Testing. ACM Transactions on Software Engineering and Methodology (TOSEM), 29, 4 (2020), 1–35. https://doi.org/10.1145/3394112
[9]
Junjie Chen, Shu Zhang, Xiaoting He, Qingwei Lin, Hongyu Zhang, Dan Hao, Yu Kang, Feng Gao, Zhangwei Xu, Yingnong Dang, and Dongmei Zhang. 2020. How Incidental are the Incidents? Characterizing and Prioritizing Incidents for Large-Scale Online Service Systems. In 35th IEEE/ACM International Conference on Automated Software Engineering. 373–384. https://doi.org/10.1145/3324884.3416624
[10]
Yujun Chen, Xian Yang, Hang Dong, Xiaoting He, Hongyu Zhang, Qingwei Lin, Junjie Chen, Pu Zhao, Yu Kang, and Feng Gao. 2020. Identifying linked incidents in large-scale online service systems. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 304–314. https://doi.org/10.1145/3368089.3409768
[11]
Zhuangbin Chen, Yu Kang, Liqun Li, Xu Zhang, Hongyu Zhang, Hui Xu, Yangfan Zhou, Li Yang, Jeffrey Sun, and Zhangwei Xu. 2020. Towards intelligent incident management: why we need it and how we make it. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1487–1497. https://doi.org/10.1145/3368089.3417055
[12]
Google Cloud. [n.d.]. https://status.cloud.google.com/summary
[13]
Datadog. [n.d.]. https://www.datadoghq.com/ [Online; accessed 10-Feb-2021].
[14]
Finale Doshi-Velez and Been Kim. 2017. Towards a rigorous science of interpretable machine learning. arXiv preprint arXiv:1702.08608.
[15]
Min Du, Zhi Chen, Chang Liu, Rajvardhan Oak, and Dawn Song. 2019. Lifelong anomaly detection through unlearning. In Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security. 1283–1297. https://doi.org/10.1145/3319535.3363226
[16]
Min Du, Feifei Li, Guineng Zheng, and Vivek Srikumar. 2017. Deeplog: Anomaly detection and diagnosis from system logs through deep learning. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security. 1285–1298. https://doi.org/10.1145/3133956.3134015
[17]
Dynatrace. [n.d.]. https://www.dynatrace.com/ [Online; accessed 10-Feb-2021].
[18]
E-commerce. [n.d.]. https://github.com/alibaba/eCommerceSearchBench [Online; accessed 10-Feb-2021].
[19]
Elasticsearch. [n.d.]. https://github.com/elastic/elasticsearch [Online; accessed 10-Feb-2021].
[20]
Qiang Fu, Jian-Guang Lou, Yi Wang, and Jiang Li. 2009. Execution anomaly detection in distributed systems through unstructured log analysis. In 2009 ninth IEEE international conference on data mining. 149–158. https://doi.org/10.1109/ICDM.2009.60
[21]
Wanling Gao, Fei Tang, Lei Wang, Jianfeng Zhan, Chunxin Lan, Chunjie Luo, Yunyou Huang, Chen Zheng, Jiahui Dai, and Zheng Cao. 2019. AIBench: an industry standard internet service AI benchmark suite. arXiv preprint arXiv:1908.08998.
[22]
Aitor Gartziandia. 2021. Microservice-based Performance Problem Detection in Cyber-Physical System Software Updates. In 2021 IEEE/ACM 43rd International Conference on Software Engineering: Companion Proceedings (ICSE-Companion). 147–149. https://doi.org/10.1109/ICSE-Companion52605.2021.00062
[23]
Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep learning. MIT press.
[24]
Haryadi S Gunawi, Mingzhe Hao, Riza O Suminto, Agung Laksono, Anang D Satria, Jeffry Adityatama, and Kurnia J Eliazar. 2016. Why does the cloud stop computing? Lessons from hundreds of service outages. In Proceedings of the Seventh ACM Symposium on Cloud Computing. 1–16. https://doi.org/10.1145/2987550.2987583
[25]
Pinjia He, Jieming Zhu, Zibin Zheng, and Michael R Lyu. 2017. Drain: An online log parsing approach with fixed depth tree. In 2017 IEEE International Conference on Web Services (ICWS). 33–40. https://doi.org/10.1109/ICWS.2017.13
[26]
Shilin He, Jieming Zhu, Pinjia He, and Michael R Lyu. 2016. Experience report: system log analysis for anomaly detection. In 2016 IEEE 27th International Symposium on Software Reliability Engineering (ISSRE). 207–218. https://doi.org/10.1109/ISSRE.2016.21
[27]
Shilin He, Jieming Zhu, Pinjia He, and Michael R. Lyu. 2016. Experience Report: System Log Analysis for Anomaly Detection. In 27th IEEE International Symposium on Software Reliability Engineering, ISSRE 2016, Ottawa, ON, Canada, October 23-27, 2016. IEEE Computer Society, 207–218. https://doi.org/10.1109/ISSRE.2016.21
[28]
Scott Heidbrink, Kathryn N Rodhouse, and Daniel M Dunlavy. 2020. Multimodal Deep Learning for Flaw Detection in Software Programs. arXiv preprint arXiv:2009.04549.
[29]
Jens Henriksson, Christian Berger, Markus Borg, Lars Tornberg, Cristofer Englund, Sankar Raman Sathyamoorthy, and Stig Ursing. 2019. Towards structured evaluation of deep neural network supervisors. In 2019 IEEE International Conference On Artificial Intelligence Testing (AITest). 27–34. https://doi.org/10.1109/AITest.2019.00-12
[30]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9, 8 (1997), 1735–1780.
[31]
Xiaofeng Hou, Jiacheng Liu, Chao Li, and Minyi Guo. 2019. Unleashing the Scalability Potential of Power-Constrained Data Center in the Microservice Era. In Proceedings of the 48th International Conference on Parallel Processing (ICPP 2019). Association for Computing Machinery, New York, NY, USA. Article 10, 10 pages. isbn:9781450362955 https://doi.org/10.1145/3337821.3337857
[32]
Kyle Hundman, Valentino Constantinou, Christopher Laporte, Ian Colwell, and Tom Soderstrom. 2018. Detecting spacecraft anomalies using lstms and nonparametric dynamic thresholding. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 387–395. https://doi.org/10.1145/3219819.3219845
[33]
InfluxDB. [n.d.]. https://github.com/influxdata/influxdb [Online; accessed 10-Feb-2021].
[34]
Mohammad S Islam, William Pourmajidi, Lei Zhang, John Steinbacher, Tony Erwin, and Andriy Miranskyy. 2021. Anomaly detection in a large-scale cloud platform. In 2021 IEEE/ACM 43rd International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). 150–159. https://doi.org/10.1109/ICSE-SEIP52600.2021.00024
[35]
Jiajun Jiang, Weihai Lu, Junjie Chen, Qingwei Lin, Pu Zhao, Yu Kang, Hongyu Zhang, Yingfei Xiong, Feng Gao, and Zhangwei Xu. 2020. How to mitigate the incident? an effective troubleshooting guide recommendation technique for online service systems. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1410–1420. https://doi.org/10.1145/3368089.3417054
[36]
Kubernetes. [n.d.]. https://kubernetes.io/ [Online; accessed 10-Feb-2021].
[37]
Steffen Lehnert. 2011. A review of software change impact analysis. Univ.-Bibliothek.
[38]
Sebastien Levy, Randolph Yao, Youjiang Wu, Yingnong Dang, Peng Huang, Zheng Mu, Pu Zhao, Tarun Ramani, Naga Govindaraju, and Xukun Li. 2020. Predictive and Adaptive Failure Mitigation to Avert Production Cloud $VM$ Interruptions. In 14th $USENIX$ Symposium on Operating Systems Design and Implementation ($OSDI$ 20). 1155–1170.
[39]
Ze Li, Qian Cheng, Ken Hsieh, Yingnong Dang, Peng Huang, Pankaj Singh, Xinsheng Yang, Qingwei Lin, Youjiang Wu, and Sebastien Levy. 2020. Gandalf: An Intelligent, End-To-End Analytics Service for Safe Deployment in Large-Scale Cloud Infrastructure. In 17th $USENIX$ Symposium on Networked Systems Design and Implementation ($NSDI$ 20). 389–402.
[40]
Qingwei Lin, Ken Hsieh, Yingnong Dang, Hongyu Zhang, Kaixin Sui, Yong Xu, Jian-Guang Lou, Chenggang Li, Youjiang Wu, and Randolph Yao. 2018. Predicting Node failure in cloud service systems. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 480–490. https://doi.org/10.1145/3236024.3236060
[41]
Qingwei Lin, Hongyu Zhang, Jian-Guang Lou, Yu Zhang, and Xuewei Chen. 2016. Log clustering based problem identification for online service systems. In 2016 IEEE/ACM 38th International Conference on Software Engineering Companion (ICSE-C). 102–111.
[42]
P. Liu, H. Xu, Q. Ouyang, R. Jiao, Z. Chen, S. Zhang, J. Yang, L. Mo, J. Zeng, W. Xue, and D. Pei. 2020. Unsupervised Detection of Microservice Trace Anomalies through Service-Level Deep Bayesian Networks. In 2020 IEEE 31st International Symposium on Software Reliability Engineering (ISSRE). 48–58. https://doi.org/10.1109/ISSRE5003.2020.00014
[43]
LogStash. [n.d.]. https://github.com/elastic/logstash [Online; accessed 10-Feb-2021].
[44]
Minghua Ma, Zheng Yin, Shenglin Zhang, Sheng Wang, Christopher Zheng, Xinhao Jiang, Hanwen Hu, Cheng Luo, Yilin Li, and Nengjun Qiu. 2020. Diagnosing root causes of intermittent slow queries in cloud databases. Proceedings of the VLDB Endowment, 13, 10 (2020), 1176–1189. https://doi.org/10.14778/3389133.3389136
[45]
Minghua Ma, Shenglin Zhang, Dan Pei, Xin Huang, and Hongwei Dai. 2018. Robust and rapid adaption for concept drift in software system anomaly detection. In 2018 IEEE 29th International Symposium on Software Reliability Engineering (ISSRE). 13–24. https://doi.org/10.1109/ISSRE.2018.00013
[46]
Ajay Mahimkar, Zihui Ge, Jia Wang, Jennifer Yates, Yin Zhang, Joanne Emmons, Brian Huntley, and Mark Stockert. 2011. Rapid detection of maintenance induced changes in service performance. In Proceedings of the Seventh Conference on emerging Networking EXperiments and Technologies. 13. https://doi.org/10.1145/2079296.2079309
[47]
Ajay Anil Mahimkar, Han Hee Song, Zihui Ge, Aman Shaikh, Jia Wang, Jennifer Yates, Yin Zhang, and Joanne Emmons. 2011. Detecting the performance impact of upgrades in large operational networks. ACM SIGCOMM Computer Communication Review, 41, 4 (2011), 303–314. https://doi.org/10.1145/1851182.1851219
[48]
Ke Mao, Mark Harman, and Yue Jia. 2016. Sapienz: Multi-objective automated testing for android applications. In Proceedings of the 25th International Symposium on Software Testing and Analysis. 94–105. https://doi.org/10.1145/2931037.2931054
[49]
Sonu Mehta, Ranjita Bhagwan, Rahul Kumar, Chetan Bansal, Chandra Maddila, B Ashok, Sumit Asthana, Christian Bird, and Aditya Kumar. 2020. Rex: Preventing bugs and misconfiguration in large services using correlated change analysis. In 17th $USENIX$ Symposium on Networked Systems Design and Implementation ($NSDI$ 20). 435–448.
[50]
Weibin Meng, Ying Liu, Yichen Zhu, Shenglin Zhang, Dan Pei, Yuqing Liu, Yihao Chen, Ruizhi Zhang, Shimin Tao, and Pei Sun. 2019. LogAnomaly: Unsupervised Detection of Sequential and Quantitative Anomalies in Unstructured Logs. In IJCAI. 4739–4745.
[51]
Animesh Nandi, Atri Mandal, Shubham Atreja, Gargi B Dasgupta, and Subhrajit Bhattacharya. 2016. Anomaly detection using program control flow graph mining from execution logs. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 215–224. https://doi.org/10.1145/2939672.2939712
[52]
Sasho Nedelkoski, Jorge Cardoso, and Odej Kao. 2019. Anomaly detection from system tracing data using multimodal deep learning. In 2019 IEEE 12th International Conference on Cloud Computing (CLOUD). 179–186. https://doi.org/10.1109/CLOUD.2019.00038
[53]
Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y Ng. 2011. Multimodal deep learning. In ICML.
[54]
NumPy. [n.d.]. https://numpy.org/ [Online; accessed 10-Feb-2021].
[55]
pandas. [n.d.]. https://pandas.pydata.org/ [Online; accessed 10-Feb-2021].
[56]
Jamie Pool, Ebrahim Beyrami, Vishak Gopal, Ashkan Aazami, Jayant Gupchup, Jeff Rowland, Binlong Li, Pritesh Kanani, Ross Cutler, and Johannes Gehrke. 2020. Lumos: A Library for Diagnosing Metric Regressions in Web-Scale Applications. arXiv preprint arXiv:2006.12793.
[57]
Prometheus. [n.d.]. https://prometheus.io/ [Online; accessed 10-Feb-2021].
[58]
PyTorch. [n.d.]. https://pytorch.org/ [Online; accessed 10-Feb-2021].
[59]
D. Ramachandram and G. W. Taylor. 2017. Deep Multimodal Learning: A Survey on Recent Advances and Trends. IEEE Signal Processing Magazine, 34, 6 (2017), 96–108. https://doi.org/10.1109/MSP.2017.2738401
[60]
Hansheng Ren, Bixiong Xu, Yujing Wang, Chao Yi, Congrui Huang, Xiaoyu Kou, Tony Xing, Mao Yang, Jie Tong, and Qi Zhang. 2019. Time-series anomaly detection service at microsoft. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 3009–3017. https://doi.org/10.1145/3292500.3330680
[61]
scikit learn. [n.d.]. https://scikit-learn.org/
[62]
SCWarn. [n.d.]. https://github.com/FSEwork/SCWarn [Online; accessed 24-Feb-2021].
[63]
Andrea Stocco, Michael Weiss, Marco Calzana, and Paolo Tonella. 2020. Misbehaviour Prediction for Autonomous Driving Systems. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering (ICSE ’20). Association for Computing Machinery, New York, NY, USA. 359–371. isbn:9781450371216 https://doi.org/10.1145/3377811.3380353
[64]
Ya Su, Youjian Zhao, Chenhao Niu, Rong Liu, Wei Sun, and Dan Pei. 2019. Robust Anomaly Detection for Multivariate Time Series through Stochastic Recurrent Neural Network. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, NY, USA. https://doi.org/10.1145/3292500.3330672
[65]
Train-Ticket. [n.d.]. https://github.com/FudanSELab/train-ticket/ [Online; accessed 10-Feb-2021].
[66]
András Vargha and Harold D Delaney. 2000. A critique and improvement of the CL common language effect size statistics of McGraw and Wong. Journal of Educational and Behavioral Statistics, 25, 2 (2000), 101–132.
[67]
Anthony J Viera and Joanne M Garrett. 2005. Understanding interobserver agreement: the kappa statistic. Fam med, 37, 5 (2005), 360–363.
[68]
T. Wang, W. Zhang, J. Xu, and Z. Gu. 2020. Workflow-Aware Automatic Fault Diagnosis for Microservice-Based Applications With Statistics. IEEE Transactions on Network and Service Management, 17, 4 (2020), 2350–2363. https://doi.org/10.1109/TNSM.2020.3022028
[69]
Frank Wilcoxon. 1992. Individual comparisons by ranking methods. In Breakthroughs in statistics. Springer, 196–202.
[70]
Haowen Xu, Wenxiao Chen, Nengwen Zhao, Zeyan Li, Jiahao Bu, Zhihan Li, Ying Liu, Youjian Zhao, Dan Pei, Yang Feng, Jie Chen, Zhaogang Wang, and Honglin Qiao. 2018. Unsupervised Anomaly Detection via Variational Auto-Encoder for Seasonal KPIs in Web Applications. In Proceedings of the 2018 World Wide Web Conference (WWW ’18). International World Wide Web Conferences Steering Committee, 187–196. isbn:9781450356398 https://doi.org/10.1145/3178876.3185996
[71]
Yong Xu, Kaixin Sui, Randolph Yao, Hongyu Zhang, Qingwei Lin, Yingnong Dang, Peng Li, Keceng Jiang, Wenchi Zhang, and Jian-Guang Lou. 2018. Improving service availability of cloud systems by predicting disk error. In 2018 $USENIX$ Annual Technical Conference ($USENIX$$ATC$ 18). 481–494.
[72]
He Yan, Ashley Flavel, Zihui Ge, Alexandre Gerber, Dan Massey, Christos Papadopoulos, Hiren Shah, and Jennifer Yates. 2012. Argus: End-to-end service anomaly detection and localization from an isp’s point of view. In 2012 Proceedings IEEE INFOCOM. 2756–2760. https://doi.org/10.1109/INFCOM.2012.6195694
[73]
Lin Yang, Junjie Chen, Zan Wang, Weijing Wang, Jiajun Jiang, Xuyuan Dong, and Wenbin Zhang. 2021. Semi-supervised Log-based Anomaly Detection via Probabilistic Label Estimation. In 43rd IEEE/ACM International Conference on Software Engineering. 1448–1460. https://doi.org/10.1109/ICSE43902.2021.00130
[74]
Ennan Zhai, Ang Chen, Ruzica Piskac, Mahesh Balakrishnan, Bingchuan Tian, Bo Song, and Haoliang Zhang. 2020. Check before You Change: Preventing Correlated Failures in Service Updates. In 17th $USENIX$ Symposium on Networked Systems Design and Implementation ($NSDI$ 20). 575–589.
[75]
Mengshi Zhang, Yuqun Zhang, Lingming Zhang, Cong Liu, and Sarfraz Khurshid. 2018. DeepRoad: GAN-Based Metamorphic Testing and Input Validation Framework for Autonomous Driving Systems. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering (ASE 2018). Association for Computing Machinery, New York, NY, USA. 132–142. isbn:9781450359375 https://doi.org/10.1145/3238147.3238187
[76]
Shenglin Zhang, Ying Liu, Weibin Meng, Zhiling Luo, Jiahao Bu, Sen Yang, Peixian Liang, Dan Pei, Jun Xu, and Yuzhi Zhang. 2018. Prefix: Switch failure prediction in datacenter networks. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 2, 1 (2018), 2.
[77]
Shenglin Zhang, Ying Liu, Dan Pei, Yu Chen, Xianping Qu, Shimin Tao, and Zhi Zang. 2015. Rapid and robust impact assessment of software changes in large internet-based services. In Proceedings of the 11th ACM Conference on Emerging Networking Experiments and Technologies. 2. https://doi.org/10.1145/2716281.2836087
[78]
Xu Zhang, Yong Xu, Qingwei Lin, Bo Qiao, Hongyu Zhang, Yingnong Dang, Chunyu Xie, Xinsheng Yang, Qian Cheng, and et. al. 2019. Robust Log-Based Anomaly Detection on Unstable Log Data. ESEC/FSE 2019. Association for Computing Machinery, New York, NY, USA. 807–817. isbn:9781450355728 https://doi.org/10.1145/3338906.3338931
[79]
Nengwen Zhao, Junjie Chen, Zhou Wang, Xiao Peng, Gang Wang, Yong Wu, Fang Zhou, Zhen Feng, Xiaohui Nie, Wenchi Zhang, Kaixin Sui, and Dan Pei. 2020. Real-time incident prediction for online service systems. In ESEC/FSE ’20: 28th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 315–326. https://doi.org/10.1145/3368089.3409672
[80]
Nengwen Zhao, Panshi Jin, Lixin Wang, Xiaoqin Yang, Rong Liu, Wenchi Zhang, Kaixin Sui, and Dan Pei. 2020. Automatically and Adaptively Identifying Severe Alerts for Online Service Systems. In IEEE INFOCOM 2020-IEEE Conference on Computer Communications. 2420–2429. https://doi.org/10.1109/INFOCOM41043.2020.9155219
[81]
Nengwen Zhao, Jing Zhu, Yao Wang, Minghua Ma, Wenchi Zhang, and et.al. 2019. Automatic and Generic Periodicity Adaptation for KPI Anomaly Detection. IEEE Transactions on Network and Service Management, https://doi.org/10.1109/TNSM.2019.2919327
[82]
Xiang Zhou, Xin Peng, Tao Xie, Jun Sun, Chao Ji, Wenhai Li, and Dan Ding. 2018. Fault analysis and debugging of microservice systems: Industrial survey, benchmark system, and empirical study. IEEE Transactions on Software Engineering, https://doi.org/10.1109/TSE.2018.2887384
[83]
Jieming Zhu, Shilin He, Jinyang Liu, Pinjia He, Qi Xie, Zibin Zheng, and Michael R. Lyu. 2019. Tools and Benchmarks for Automated Log Parsing. In Proceedings of the 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP ’19). IEEE Press, 121–130. https://doi.org/10.1109/ICSE-SEIP.2019.00021

Cited By

View all
  • (2025)Asynchronous Real-Time Federated Learning for Anomaly Detection in Microservice Cloud ApplicationsIEEE Transactions on Machine Learning in Communications and Networking10.1109/TMLCN.2025.35279193(176-194)Online publication date: 2025
  • (2024)METER: A Dynamic Concept Adaptation Framework for Online Anomaly DetectionProceedings of the VLDB Endowment10.14778/3636218.363623317:4(794-807)Online publication date: 5-Mar-2024
  • (2024)ART: A Unified Unsupervised Framework for Incident Management in Microservice SystemsProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695495(1183-1194)Online publication date: 27-Oct-2024
  • Show More Cited By

Index Terms

  1. Identifying bad software changes via multimodal anomaly detection for online service systems

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ESEC/FSE 2021: Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering
    August 2021
    1690 pages
    ISBN:9781450385626
    DOI:10.1145/3468264
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 18 August 2021

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Anomaly Detection
    2. Online Service Systems
    3. Software Change

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    ESEC/FSE '21
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 112 of 543 submissions, 21%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)284
    • Downloads (Last 6 weeks)18
    Reflects downloads up to 26 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)Asynchronous Real-Time Federated Learning for Anomaly Detection in Microservice Cloud ApplicationsIEEE Transactions on Machine Learning in Communications and Networking10.1109/TMLCN.2025.35279193(176-194)Online publication date: 2025
    • (2024)METER: A Dynamic Concept Adaptation Framework for Online Anomaly DetectionProceedings of the VLDB Endowment10.14778/3636218.363623317:4(794-807)Online publication date: 5-Mar-2024
    • (2024)ART: A Unified Unsupervised Framework for Incident Management in Microservice SystemsProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695495(1183-1194)Online publication date: 27-Oct-2024
    • (2024)Detecting and Explaining Anomalies Caused by Web Tamper Attacks via Building Consistency-based NormalityProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695024(531-543)Online publication date: 27-Oct-2024
    • (2024)SLIM: a Scalable and Interpretable Light-weight Fault Localization Algorithm for Imbalanced Data in MicroserviceProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3694984(27-39)Online publication date: 27-Oct-2024
    • (2024)MonitorAssistant: Simplifying Cloud Service Monitoring via Large Language ModelsCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663826(38-49)Online publication date: 10-Jul-2024
    • (2024)Try with Simpler - An Evaluation of Improved Principal Component Analysis in Log-based Anomaly DetectionACM Transactions on Software Engineering and Methodology10.1145/364438633:5(1-27)Online publication date: 3-Jun-2024
    • (2024)ChangeRCA: Finding Root Causes from Software Changes in Large Online SystemsProceedings of the ACM on Software Engineering10.1145/36437281:FSE(24-46)Online publication date: 12-Jul-2024
    • (2024)Pre-trained KPI Anomaly Detection Model Through Disentangled TransformerProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671522(6190-6201)Online publication date: 25-Aug-2024
    • (2024)No More Data Silos: Unified Microservice Failure Diagnosis with Temporal Knowledge GraphIEEE Transactions on Services Computing10.1109/TSC.2024.3489444(1-14)Online publication date: 2024
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media