research-article

Identifying bad software changes via multimodal anomaly detection for online service systems

Authors:

Dan PeiAuthors Info & Claims

ESEC/FSE 2021: Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering

Pages 527 - 539

https://doi.org/10.1145/3468264.3468543

Published: 18 August 2021 Publication History

Abstract

In large-scale online service systems, software changes are inevitable and frequent. Due to importing new code or configurations, changes are likely to incur incidents and destroy user experience. Thus it is essential for engineers to identify bad software changes, so as to reduce the influence of incidents and improve system re- liability. To better understand bad software changes, we perform the first empirical study based on large-scale real-world data from a large commercial bank. Our quantitative analyses indicate that about 50.4% of incidents are caused by bad changes, mainly be- cause of code defect, configuration error, resource contention, and software version. Besides, our qualitative analyses show that the current practice of detecting bad software changes performs not well to handle heterogeneous multi-source data involved in soft- ware changes. Based on the findings and motivation obtained from the empirical study, we propose a novel approach named SCWarn aiming to identify bad changes and produce interpretable alerts accurately and timely. The key idea of SCWarn is drawing support from multimodal learning to identify anomalies from heterogeneous multi-source data. An extensive study on two datasets with various bad software changes demonstrates our approach significantly outperforms all the compared approaches, achieving 0.95 F1-score on average and reducing MTTD (mean time to detect) by 20.4%∼60.7%. In particular, we shared some success stories and lessons learned from the practical usage.

References

[1]

3-sigma rule. https://en.wikipedia.org/wiki/68-95-99.7_rule [Online; accessed 10-Feb-2021].

[2]

Grafana. https://grafana.com/ [Online; accessed 10-Feb-2021].

[3]

Kibana. https://www.elastic.co/kibana [Online; accessed 10-Feb-2021].

[4]

Betsy Beyer, Chris Jones, Jennifer Petoff, and Niall Richard Murphy. 2016. Site Reliability Engineering: How Google Runs Production Systems. " O’Reilly Media, Inc.".

[5]

Timofey Bryksin, Victor Petukhov, Ilya Alexin, Stanislav Prikhodko, Alexey Shpilman, Vladimir Kovalenko, and Nikita Povarov. 2020. Using Large-Scale Anomaly Detection on Code to Improve Kotlin Compiler. In Proceedings of the 17th International Conference on Mining Software Repositories. 455–465. https://doi.org/10.1145/3379597.3387447

Digital Library

[6]

Junjie Chen, Xiaoting He, Qingwei Lin, Yong Xu, Hongyu Zhang, Dan Hao, Feng Gao, Zhangwei Xu, Yingnong Dang, and Dongmei Zhang. 2019. An empirical investigation of incident triage for online service systems. In Proceedings of the 41st International Conference on Software Engineering: Software Engineering in Practice. 111–120. https://doi.org/10.1109/ICSE-SEIP.2019.00020

Digital Library

[7]

Junjie Chen, Xiaoting He, Qingwei Lin, Hongyu Zhang, Dan Hao, Feng Gao, Zhangwei Xu, Yingnong Dang, and Dongmei Zhang. 2019. Continuous Incident Triage for Large-Scale Online Service Systems. In 34th IEEE/ACM International Conference on Automated Software Engineering. 364–375. https://doi.org/10.1109/ASE.2019.00042

Digital Library

[8]

Junjie Chen, Zhuo Wu, Zan Wang, Hanmo You, Lingming Zhang, and Ming Yan. 2020. Practical Accuracy Estimation for Efficient Deep Neural Network Testing. ACM Transactions on Software Engineering and Methodology (TOSEM), 29, 4 (2020), 1–35. https://doi.org/10.1145/3394112

Digital Library

[9]

Junjie Chen, Shu Zhang, Xiaoting He, Qingwei Lin, Hongyu Zhang, Dan Hao, Yu Kang, Feng Gao, Zhangwei Xu, Yingnong Dang, and Dongmei Zhang. 2020. How Incidental are the Incidents? Characterizing and Prioritizing Incidents for Large-Scale Online Service Systems. In 35th IEEE/ACM International Conference on Automated Software Engineering. 373–384. https://doi.org/10.1145/3324884.3416624

Digital Library

[10]

Yujun Chen, Xian Yang, Hang Dong, Xiaoting He, Hongyu Zhang, Qingwei Lin, Junjie Chen, Pu Zhao, Yu Kang, and Feng Gao. 2020. Identifying linked incidents in large-scale online service systems. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 304–314. https://doi.org/10.1145/3368089.3409768

Digital Library

[11]

Zhuangbin Chen, Yu Kang, Liqun Li, Xu Zhang, Hongyu Zhang, Hui Xu, Yangfan Zhou, Li Yang, Jeffrey Sun, and Zhangwei Xu. 2020. Towards intelligent incident management: why we need it and how we make it. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1487–1497. https://doi.org/10.1145/3368089.3417055

Digital Library

[12]

Google Cloud. [n.d.]. https://status.cloud.google.com/summary

[13]

Datadog. [n.d.]. https://www.datadoghq.com/ [Online; accessed 10-Feb-2021].

[14]

Finale Doshi-Velez and Been Kim. 2017. Towards a rigorous science of interpretable machine learning. arXiv preprint arXiv:1702.08608.

[15]

Min Du, Zhi Chen, Chang Liu, Rajvardhan Oak, and Dawn Song. 2019. Lifelong anomaly detection through unlearning. In Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security. 1283–1297. https://doi.org/10.1145/3319535.3363226

Digital Library

[16]

Min Du, Feifei Li, Guineng Zheng, and Vivek Srikumar. 2017. Deeplog: Anomaly detection and diagnosis from system logs through deep learning. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security. 1285–1298. https://doi.org/10.1145/3133956.3134015

Digital Library

[17]

Dynatrace. [n.d.]. https://www.dynatrace.com/ [Online; accessed 10-Feb-2021].

[18]

E-commerce. [n.d.]. https://github.com/alibaba/eCommerceSearchBench [Online; accessed 10-Feb-2021].

[19]

Elasticsearch. [n.d.]. https://github.com/elastic/elasticsearch [Online; accessed 10-Feb-2021].

[20]

Qiang Fu, Jian-Guang Lou, Yi Wang, and Jiang Li. 2009. Execution anomaly detection in distributed systems through unstructured log analysis. In 2009 ninth IEEE international conference on data mining. 149–158. https://doi.org/10.1109/ICDM.2009.60

Digital Library

[21]

Wanling Gao, Fei Tang, Lei Wang, Jianfeng Zhan, Chunxin Lan, Chunjie Luo, Yunyou Huang, Chen Zheng, Jiahui Dai, and Zheng Cao. 2019. AIBench: an industry standard internet service AI benchmark suite. arXiv preprint arXiv:1908.08998.

[22]

Aitor Gartziandia. 2021. Microservice-based Performance Problem Detection in Cyber-Physical System Software Updates. In 2021 IEEE/ACM 43rd International Conference on Software Engineering: Companion Proceedings (ICSE-Companion). 147–149. https://doi.org/10.1109/ICSE-Companion52605.2021.00062

Digital Library

[23]

Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep learning. MIT press.

Digital Library

[24]

Haryadi S Gunawi, Mingzhe Hao, Riza O Suminto, Agung Laksono, Anang D Satria, Jeffry Adityatama, and Kurnia J Eliazar. 2016. Why does the cloud stop computing? Lessons from hundreds of service outages. In Proceedings of the Seventh ACM Symposium on Cloud Computing. 1–16. https://doi.org/10.1145/2987550.2987583

Digital Library

[25]

Pinjia He, Jieming Zhu, Zibin Zheng, and Michael R Lyu. 2017. Drain: An online log parsing approach with fixed depth tree. In 2017 IEEE International Conference on Web Services (ICWS). 33–40. https://doi.org/10.1109/ICWS.2017.13

[26]

Shilin He, Jieming Zhu, Pinjia He, and Michael R Lyu. 2016. Experience report: system log analysis for anomaly detection. In 2016 IEEE 27th International Symposium on Software Reliability Engineering (ISSRE). 207–218. https://doi.org/10.1109/ISSRE.2016.21

[27]

Shilin He, Jieming Zhu, Pinjia He, and Michael R. Lyu. 2016. Experience Report: System Log Analysis for Anomaly Detection. In 27th IEEE International Symposium on Software Reliability Engineering, ISSRE 2016, Ottawa, ON, Canada, October 23-27, 2016. IEEE Computer Society, 207–218. https://doi.org/10.1109/ISSRE.2016.21

[28]

Scott Heidbrink, Kathryn N Rodhouse, and Daniel M Dunlavy. 2020. Multimodal Deep Learning for Flaw Detection in Software Programs. arXiv preprint arXiv:2009.04549.

[29]

Jens Henriksson, Christian Berger, Markus Borg, Lars Tornberg, Cristofer Englund, Sankar Raman Sathyamoorthy, and Stig Ursing. 2019. Towards structured evaluation of deep neural network supervisors. In 2019 IEEE International Conference On Artificial Intelligence Testing (AITest). 27–34. https://doi.org/10.1109/AITest.2019.00-12

[30]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9, 8 (1997), 1735–1780.

[31]

Xiaofeng Hou, Jiacheng Liu, Chao Li, and Minyi Guo. 2019. Unleashing the Scalability Potential of Power-Constrained Data Center in the Microservice Era. In Proceedings of the 48th International Conference on Parallel Processing (ICPP 2019). Association for Computing Machinery, New York, NY, USA. Article 10, 10 pages. isbn:9781450362955 https://doi.org/10.1145/3337821.3337857

Digital Library

[32]

Kyle Hundman, Valentino Constantinou, Christopher Laporte, Ian Colwell, and Tom Soderstrom. 2018. Detecting spacecraft anomalies using lstms and nonparametric dynamic thresholding. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 387–395. https://doi.org/10.1145/3219819.3219845

Digital Library

[33]

InfluxDB. [n.d.]. https://github.com/influxdata/influxdb [Online; accessed 10-Feb-2021].

[34]

Mohammad S Islam, William Pourmajidi, Lei Zhang, John Steinbacher, Tony Erwin, and Andriy Miranskyy. 2021. Anomaly detection in a large-scale cloud platform. In 2021 IEEE/ACM 43rd International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). 150–159. https://doi.org/10.1109/ICSE-SEIP52600.2021.00024

Digital Library

[35]

Jiajun Jiang, Weihai Lu, Junjie Chen, Qingwei Lin, Pu Zhao, Yu Kang, Hongyu Zhang, Yingfei Xiong, Feng Gao, and Zhangwei Xu. 2020. How to mitigate the incident? an effective troubleshooting guide recommendation technique for online service systems. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1410–1420. https://doi.org/10.1145/3368089.3417054

Digital Library

[36]

Kubernetes. [n.d.]. https://kubernetes.io/ [Online; accessed 10-Feb-2021].

[37]

Steffen Lehnert. 2011. A review of software change impact analysis. Univ.-Bibliothek.

[38]

Sebastien Levy, Randolph Yao, Youjiang Wu, Yingnong Dang, Peng Huang, Zheng Mu, Pu Zhao, Tarun Ramani, Naga Govindaraju, and Xukun Li. 2020. Predictive and Adaptive Failure Mitigation to Avert Production Cloud $VM$ Interruptions. In 14th $USENIX$ Symposium on Operating Systems Design and Implementation ($OSDI$ 20). 1155–1170.

[39]

Ze Li, Qian Cheng, Ken Hsieh, Yingnong Dang, Peng Huang, Pankaj Singh, Xinsheng Yang, Qingwei Lin, Youjiang Wu, and Sebastien Levy. 2020. Gandalf: An Intelligent, End-To-End Analytics Service for Safe Deployment in Large-Scale Cloud Infrastructure. In 17th $USENIX$ Symposium on Networked Systems Design and Implementation ($NSDI$ 20). 389–402.

[40]

Qingwei Lin, Ken Hsieh, Yingnong Dang, Hongyu Zhang, Kaixin Sui, Yong Xu, Jian-Guang Lou, Chenggang Li, Youjiang Wu, and Randolph Yao. 2018. Predicting Node failure in cloud service systems. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 480–490. https://doi.org/10.1145/3236024.3236060

Digital Library

[41]

Qingwei Lin, Hongyu Zhang, Jian-Guang Lou, Yu Zhang, and Xuewei Chen. 2016. Log clustering based problem identification for online service systems. In 2016 IEEE/ACM 38th International Conference on Software Engineering Companion (ICSE-C). 102–111.

Digital Library

[42]

P. Liu, H. Xu, Q. Ouyang, R. Jiao, Z. Chen, S. Zhang, J. Yang, L. Mo, J. Zeng, W. Xue, and D. Pei. 2020. Unsupervised Detection of Microservice Trace Anomalies through Service-Level Deep Bayesian Networks. In 2020 IEEE 31st International Symposium on Software Reliability Engineering (ISSRE). 48–58. https://doi.org/10.1109/ISSRE5003.2020.00014

[43]

LogStash. [n.d.]. https://github.com/elastic/logstash [Online; accessed 10-Feb-2021].

[44]

Minghua Ma, Zheng Yin, Shenglin Zhang, Sheng Wang, Christopher Zheng, Xinhao Jiang, Hanwen Hu, Cheng Luo, Yilin Li, and Nengjun Qiu. 2020. Diagnosing root causes of intermittent slow queries in cloud databases. Proceedings of the VLDB Endowment, 13, 10 (2020), 1176–1189. https://doi.org/10.14778/3389133.3389136

Digital Library

[45]

Minghua Ma, Shenglin Zhang, Dan Pei, Xin Huang, and Hongwei Dai. 2018. Robust and rapid adaption for concept drift in software system anomaly detection. In 2018 IEEE 29th International Symposium on Software Reliability Engineering (ISSRE). 13–24. https://doi.org/10.1109/ISSRE.2018.00013

[46]

Ajay Mahimkar, Zihui Ge, Jia Wang, Jennifer Yates, Yin Zhang, Joanne Emmons, Brian Huntley, and Mark Stockert. 2011. Rapid detection of maintenance induced changes in service performance. In Proceedings of the Seventh Conference on emerging Networking EXperiments and Technologies. 13. https://doi.org/10.1145/2079296.2079309

Digital Library

[47]

Ajay Anil Mahimkar, Han Hee Song, Zihui Ge, Aman Shaikh, Jia Wang, Jennifer Yates, Yin Zhang, and Joanne Emmons. 2011. Detecting the performance impact of upgrades in large operational networks. ACM SIGCOMM Computer Communication Review, 41, 4 (2011), 303–314. https://doi.org/10.1145/1851182.1851219

Digital Library

[48]

Ke Mao, Mark Harman, and Yue Jia. 2016. Sapienz: Multi-objective automated testing for android applications. In Proceedings of the 25th International Symposium on Software Testing and Analysis. 94–105. https://doi.org/10.1145/2931037.2931054

Digital Library

[49]

Sonu Mehta, Ranjita Bhagwan, Rahul Kumar, Chetan Bansal, Chandra Maddila, B Ashok, Sumit Asthana, Christian Bird, and Aditya Kumar. 2020. Rex: Preventing bugs and misconfiguration in large services using correlated change analysis. In 17th $USENIX$ Symposium on Networked Systems Design and Implementation ($NSDI$ 20). 435–448.

[50]

Weibin Meng, Ying Liu, Yichen Zhu, Shenglin Zhang, Dan Pei, Yuqing Liu, Yihao Chen, Ruizhi Zhang, Shimin Tao, and Pei Sun. 2019. LogAnomaly: Unsupervised Detection of Sequential and Quantitative Anomalies in Unstructured Logs. In IJCAI. 4739–4745.

[51]

Animesh Nandi, Atri Mandal, Shubham Atreja, Gargi B Dasgupta, and Subhrajit Bhattacharya. 2016. Anomaly detection using program control flow graph mining from execution logs. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 215–224. https://doi.org/10.1145/2939672.2939712

Digital Library

[52]

Sasho Nedelkoski, Jorge Cardoso, and Odej Kao. 2019. Anomaly detection from system tracing data using multimodal deep learning. In 2019 IEEE 12th International Conference on Cloud Computing (CLOUD). 179–186. https://doi.org/10.1109/CLOUD.2019.00038

[53]

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y Ng. 2011. Multimodal deep learning. In ICML.

[54]

NumPy. [n.d.]. https://numpy.org/ [Online; accessed 10-Feb-2021].

[55]

pandas. [n.d.]. https://pandas.pydata.org/ [Online; accessed 10-Feb-2021].

[56]

Jamie Pool, Ebrahim Beyrami, Vishak Gopal, Ashkan Aazami, Jayant Gupchup, Jeff Rowland, Binlong Li, Pritesh Kanani, Ross Cutler, and Johannes Gehrke. 2020. Lumos: A Library for Diagnosing Metric Regressions in Web-Scale Applications. arXiv preprint arXiv:2006.12793.

[57]

Prometheus. [n.d.]. https://prometheus.io/ [Online; accessed 10-Feb-2021].

[58]

PyTorch. [n.d.]. https://pytorch.org/ [Online; accessed 10-Feb-2021].

[59]

D. Ramachandram and G. W. Taylor. 2017. Deep Multimodal Learning: A Survey on Recent Advances and Trends. IEEE Signal Processing Magazine, 34, 6 (2017), 96–108. https://doi.org/10.1109/MSP.2017.2738401

[60]

Hansheng Ren, Bixiong Xu, Yujing Wang, Chao Yi, Congrui Huang, Xiaoyu Kou, Tony Xing, Mao Yang, Jie Tong, and Qi Zhang. 2019. Time-series anomaly detection service at microsoft. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 3009–3017. https://doi.org/10.1145/3292500.3330680

Digital Library

[61]

scikit learn. [n.d.]. https://scikit-learn.org/

[62]

SCWarn. [n.d.]. https://github.com/FSEwork/SCWarn [Online; accessed 24-Feb-2021].

[63]

Andrea Stocco, Michael Weiss, Marco Calzana, and Paolo Tonella. 2020. Misbehaviour Prediction for Autonomous Driving Systems. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering (ICSE ’20). Association for Computing Machinery, New York, NY, USA. 359–371. isbn:9781450371216 https://doi.org/10.1145/3377811.3380353

Digital Library

[64]

Ya Su, Youjian Zhao, Chenhao Niu, Rong Liu, Wei Sun, and Dan Pei. 2019. Robust Anomaly Detection for Multivariate Time Series through Stochastic Recurrent Neural Network. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, NY, USA. https://doi.org/10.1145/3292500.3330672

Digital Library

[65]

Train-Ticket. [n.d.]. https://github.com/FudanSELab/train-ticket/ [Online; accessed 10-Feb-2021].

[66]

András Vargha and Harold D Delaney. 2000. A critique and improvement of the CL common language effect size statistics of McGraw and Wong. Journal of Educational and Behavioral Statistics, 25, 2 (2000), 101–132.

[67]

Anthony J Viera and Joanne M Garrett. 2005. Understanding interobserver agreement: the kappa statistic. Fam med, 37, 5 (2005), 360–363.

[68]

T. Wang, W. Zhang, J. Xu, and Z. Gu. 2020. Workflow-Aware Automatic Fault Diagnosis for Microservice-Based Applications With Statistics. IEEE Transactions on Network and Service Management, 17, 4 (2020), 2350–2363. https://doi.org/10.1109/TNSM.2020.3022028

[69]

Frank Wilcoxon. 1992. Individual comparisons by ranking methods. In Breakthroughs in statistics. Springer, 196–202.

[70]

Haowen Xu, Wenxiao Chen, Nengwen Zhao, Zeyan Li, Jiahao Bu, Zhihan Li, Ying Liu, Youjian Zhao, Dan Pei, Yang Feng, Jie Chen, Zhaogang Wang, and Honglin Qiao. 2018. Unsupervised Anomaly Detection via Variational Auto-Encoder for Seasonal KPIs in Web Applications. In Proceedings of the 2018 World Wide Web Conference (WWW ’18). International World Wide Web Conferences Steering Committee, 187–196. isbn:9781450356398 https://doi.org/10.1145/3178876.3185996

Digital Library

[71]

Yong Xu, Kaixin Sui, Randolph Yao, Hongyu Zhang, Qingwei Lin, Yingnong Dang, Peng Li, Keceng Jiang, Wenchi Zhang, and Jian-Guang Lou. 2018. Improving service availability of cloud systems by predicting disk error. In 2018 $USENIX$ Annual Technical Conference ($USENIX$$ATC$ 18). 481–494.

[72]

He Yan, Ashley Flavel, Zihui Ge, Alexandre Gerber, Dan Massey, Christos Papadopoulos, Hiren Shah, and Jennifer Yates. 2012. Argus: End-to-end service anomaly detection and localization from an isp’s point of view. In 2012 Proceedings IEEE INFOCOM. 2756–2760. https://doi.org/10.1109/INFCOM.2012.6195694

[73]

Lin Yang, Junjie Chen, Zan Wang, Weijing Wang, Jiajun Jiang, Xuyuan Dong, and Wenbin Zhang. 2021. Semi-supervised Log-based Anomaly Detection via Probabilistic Label Estimation. In 43rd IEEE/ACM International Conference on Software Engineering. 1448–1460. https://doi.org/10.1109/ICSE43902.2021.00130

Digital Library

[74]

Ennan Zhai, Ang Chen, Ruzica Piskac, Mahesh Balakrishnan, Bingchuan Tian, Bo Song, and Haoliang Zhang. 2020. Check before You Change: Preventing Correlated Failures in Service Updates. In 17th $USENIX$ Symposium on Networked Systems Design and Implementation ($NSDI$ 20). 575–589.

[75]

Mengshi Zhang, Yuqun Zhang, Lingming Zhang, Cong Liu, and Sarfraz Khurshid. 2018. DeepRoad: GAN-Based Metamorphic Testing and Input Validation Framework for Autonomous Driving Systems. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering (ASE 2018). Association for Computing Machinery, New York, NY, USA. 132–142. isbn:9781450359375 https://doi.org/10.1145/3238147.3238187

Digital Library

[76]

Shenglin Zhang, Ying Liu, Weibin Meng, Zhiling Luo, Jiahao Bu, Sen Yang, Peixian Liang, Dan Pei, Jun Xu, and Yuzhi Zhang. 2018. Prefix: Switch failure prediction in datacenter networks. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 2, 1 (2018), 2.

Digital Library

[77]

Shenglin Zhang, Ying Liu, Dan Pei, Yu Chen, Xianping Qu, Shimin Tao, and Zhi Zang. 2015. Rapid and robust impact assessment of software changes in large internet-based services. In Proceedings of the 11th ACM Conference on Emerging Networking Experiments and Technologies. 2. https://doi.org/10.1145/2716281.2836087

Digital Library

[78]

Xu Zhang, Yong Xu, Qingwei Lin, Bo Qiao, Hongyu Zhang, Yingnong Dang, Chunyu Xie, Xinsheng Yang, Qian Cheng, and et. al. 2019. Robust Log-Based Anomaly Detection on Unstable Log Data. ESEC/FSE 2019. Association for Computing Machinery, New York, NY, USA. 807–817. isbn:9781450355728 https://doi.org/10.1145/3338906.3338931

Digital Library

[79]

Nengwen Zhao, Junjie Chen, Zhou Wang, Xiao Peng, Gang Wang, Yong Wu, Fang Zhou, Zhen Feng, Xiaohui Nie, Wenchi Zhang, Kaixin Sui, and Dan Pei. 2020. Real-time incident prediction for online service systems. In ESEC/FSE ’20: 28th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 315–326. https://doi.org/10.1145/3368089.3409672

Digital Library

[80]

Nengwen Zhao, Panshi Jin, Lixin Wang, Xiaoqin Yang, Rong Liu, Wenchi Zhang, Kaixin Sui, and Dan Pei. 2020. Automatically and Adaptively Identifying Severe Alerts for Online Service Systems. In IEEE INFOCOM 2020-IEEE Conference on Computer Communications. 2420–2429. https://doi.org/10.1109/INFOCOM41043.2020.9155219

Digital Library

[81]

Nengwen Zhao, Jing Zhu, Yao Wang, Minghua Ma, Wenchi Zhang, and et.al. 2019. Automatic and Generic Periodicity Adaptation for KPI Anomaly Detection. IEEE Transactions on Network and Service Management, https://doi.org/10.1109/TNSM.2019.2919327

[82]

Xiang Zhou, Xin Peng, Tao Xie, Jun Sun, Chao Ji, Wenhai Li, and Dan Ding. 2018. Fault analysis and debugging of microservice systems: Industrial survey, benchmark system, and empirical study. IEEE Transactions on Software Engineering, https://doi.org/10.1109/TSE.2018.2887384

Digital Library

[83]

Jieming Zhu, Shilin He, Jinyang Liu, Pinjia He, Qi Xie, Zibin Zheng, and Michael R. Lyu. 2019. Tools and Benchmarks for Automated Log Parsing. In Proceedings of the 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP ’19). IEEE Press, 121–130. https://doi.org/10.1109/ICSE-SEIP.2019.00021

Digital Library

Cited By

Raeiszadeh MEbrahimzadeh AGlitho REker JMini R(2025)Asynchronous Real-Time Federated Learning for Anomaly Detection in Microservice Cloud ApplicationsIEEE Transactions on Machine Learning in Communications and Networking10.1109/TMLCN.2025.35279193(176-194)Online publication date: 2025
https://doi.org/10.1109/TMLCN.2025.3527919
Zhu JCai SDeng FOoi BZhang W(2024)METER: A Dynamic Concept Adaptation Framework for Online Anomaly DetectionProceedings of the VLDB Endowment10.14778/3636218.363623317:4(794-807)Online publication date: 5-Mar-2024
https://dl.acm.org/doi/10.14778/3636218.3636233
Sun YShi BMao MMa MXia SZhang SPei DFilkov VRay BZhou M(2024)ART: A Unified Unsupervised Framework for Incident Management in Microservice SystemsProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695495(1183-1194)Online publication date: 27-Oct-2024
https://dl.acm.org/doi/10.1145/3691620.3695495
Show More Cited By

Index Terms

Identifying bad software changes via multimodal anomaly detection for online service systems
1. Software and its engineering
  1. Software creation and management
    1. Software post-development issues
      1. Maintaining software

Recommendations

Two-stage anomaly detection algorithm via dynamic community evolution in temporal graph
Abstract
Detecting anomalies from a massive amount of user behavioral data is often liken to finding a needle in a haystack. While tremendous efforts have been devoted to anomaly detection from temporal graphs, existing studies rarely consider community ...
Robust Anomaly Detection and Localization via Simulated Anomalies
VRCAI '22: Proceedings of the 18th ACM SIGGRAPH International Conference on Virtual-Reality Continuum and its Applications in Industry

Anomaly detection refers to identifying abnormal images and localizing anomalous regions. Reconstruction-based anomaly detection is a commonly used method; however, traditional reconstruction-based methods perform poorly as deep models generalize ...
Anomaly Detection in Embedded Systems
Special issue on fault-tolerant embedded systems

By employing fault tolerance, embedded systems can withstand both intentional and unintentional faults. Many fault-tolerance mechanisms are invoked only after a fault has been detected by whatever fault-detection mechanism is used, hence, the process of ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ESEC/FSE 2021: Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering

August 2021

1690 pages

ISBN:9781450385626

DOI:10.1145/3468264

General Chairs:
Diomidis Spinellis
Athens University of Economics and Business, Greece
,
Georgios Gousios
Facebook, Netherlands / Delft University of Technology, Netherlands
,
Program Chairs:
Marsha Chechik
University of Toronto, Canada
,
Massimiliano Di Penta
University of Sannio, Italy

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGSOFT: ACM Special Interest Group on Software Engineering

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 August 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Key Research and Development Program of China
State Key Program of National Natural Science of China

Conference

ESEC/FSE '21

Sponsor:

SIGSOFT

ESEC/FSE '21: 29th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering

August 23 - 28, 2021

Athens, Greece

Acceptance Rates

Overall Acceptance Rate 112 of 543 submissions, 21%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

42
Total Citations
View Citations
963
Total Downloads

Downloads (Last 12 months)284
Downloads (Last 6 weeks)18

Reflects downloads up to 26 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Raeiszadeh MEbrahimzadeh AGlitho REker JMini R(2025)Asynchronous Real-Time Federated Learning for Anomaly Detection in Microservice Cloud ApplicationsIEEE Transactions on Machine Learning in Communications and Networking10.1109/TMLCN.2025.35279193(176-194)Online publication date: 2025
https://doi.org/10.1109/TMLCN.2025.3527919
Zhu JCai SDeng FOoi BZhang W(2024)METER: A Dynamic Concept Adaptation Framework for Online Anomaly DetectionProceedings of the VLDB Endowment10.14778/3636218.363623317:4(794-807)Online publication date: 5-Mar-2024
https://dl.acm.org/doi/10.14778/3636218.3636233
Sun YShi BMao MMa MXia SZhang SPei DFilkov VRay BZhou M(2024)ART: A Unified Unsupervised Framework for Incident Management in Microservice SystemsProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695495(1183-1194)Online publication date: 27-Oct-2024
https://dl.acm.org/doi/10.1145/3691620.3695495
Liao YXu MLin YTeoh XXie XFeng RLiaw FZhang HDong JFilkov VRay BZhou M(2024)Detecting and Explaining Anomalies Caused by Web Tamper Attacks via Building Consistency-based NormalityProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695024(531-543)Online publication date: 27-Oct-2024
https://dl.acm.org/doi/10.1145/3691620.3695024
Ren RYang JYang LGu XSun LFilkov VRay BZhou M(2024)SLIM: a Scalable and Interpretable Light-weight Fault Localization Algorithm for Imbalanced Data in MicroserviceProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3694984(27-39)Online publication date: 27-Oct-2024
https://dl.acm.org/doi/10.1145/3691620.3694984
Yu ZMa MZhang CQin SKang YBansal CRajmohan SDang YPei CPei DLin QZhang Dd'Amorim M(2024)MonitorAssistant: Simplifying Cloud Service Monitoring via Large Language ModelsCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663826(38-49)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3663529.3663826
Yang LChen JGao SGong ZZhang HKang YLi H(2024)Try with Simpler - An Evaluation of Improved Principal Component Analysis in Log-based Anomaly DetectionACM Transactions on Software Engineering and Methodology10.1145/364438633:5(1-27)Online publication date: 3-Jun-2024
https://dl.acm.org/doi/10.1145/3644386
Yu GChen PHe ZYan QLuo YLi FZheng Z(2024)ChangeRCA: Finding Root Causes from Software Changes in Large Online SystemsProceedings of the ACM on Software Engineering10.1145/36437281:FSE(24-46)Online publication date: 12-Jul-2024
https://dl.acm.org/doi/10.1145/3643728
Yu ZPei CWang XMa MBansal CRajmohan SLin QZhang DWen XLi JXie GPei DBaeza-Yates RBonchi F(2024)Pre-trained KPI Anomaly Detection Model Through Disentangled TransformerProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671522(6190-6201)Online publication date: 25-Aug-2024
https://dl.acm.org/doi/10.1145/3637528.3671522
Zhang SZhao YXia SWei SSun YZhao CMa SKuang JZhu BPan LGuo YPei D(2024)No More Data Silos: Unified Microservice Failure Diagnosis with Temporal Knowledge GraphIEEE Transactions on Services Computing10.1109/TSC.2024.3489444(1-14)Online publication date: 2024
https://doi.org/10.1109/TSC.2024.3489444
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten