Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
survey

Anomaly Detection and Failure Root Cause Analysis in (Micro) Service-Based Cloud Applications: A Survey

Published: 03 February 2022 Publication History

Abstract

The proliferation of services and service interactions within microservices and cloud-native applications, makes it harder to detect failures and to identify their possible root causes, which is, on the other hand crucial to promptly recover and fix applications. Various techniques have been proposed to promptly detect failures based on their symptoms, viz., observing anomalous behaviour in one or more application services, as well as to analyse logs or monitored performance of such services to determine the possible root causes for observed anomalies. The objective of this survey is to provide a structured overview and qualitative analysis of currently available techniques for anomaly detection and root cause analysis in modern multi-service applications. Some open challenges and research directions stemming out from the analysis are also discussed.

Supplementary Material

soldani (soldani.zip)
Supplemental movie, appendix, image and software files for, Anomaly Detection and Failure Root Cause Analysis in (Micro) Service-Based Cloud Applications: A Survey

References

[1]
P. Aggarwal, A. Gupta, P. Mohapatra, S. Nagar, A. Mandal, Q. Wang, and A. Paradkar. 2020. Localization of operational faults in cloud applications by mining causal dependencies in logs using golden signals. In Proceedings of the Service-Oriented Computing(LNCS, Vol. 12632). Springer, Cham, 137–149. DOI:https://doi.org/10.1007/978-3-030-76352-7_17
[2]
L. Akoglu, H. Tong, and D. Koutra. 2015. Graph based anomaly detection and description: A survey. Data Mining and Knowledge Discovery 29, 3 (2015), 626–688. DOI:https://doi.org/10.1007/s10618-014-0365-y
[3]
N. S. Altman. 1992. An introduction to kernel and nearest-neighbor nonparametric regression. The American Statistician 46, 3 (1992), 175–185. DOI:https://doi.org/10.1080/00031305.1992.10475879
[4]
A. Arnold, Y. Liu, and N. Abe. 2007. Temporal causal modeling with graphical granger methods. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 2007. ACM, New York, 66–75. DOI:https://doi.org/10.1145/1281192.1281203
[5]
V. Arya, K. Shanmugam, Pooja Aggarwal, Qing Wang, Prateeti Mohapatra, and Seema Nagar. 2021. Evaluation of causal inference techniques for AIOps. In Proceedings of the CODS COMAD 2021. ACM, New York, 188–192. DOI:https://doi.org/10.1145/3430984.3431027
[6]
M. Basseville and I. V. Nikiforov. 1993. Detection of Abrupt Changes: Theory and Application. Prentice-Hall, Inc.
[7]
I. Beschastnikh, P. Wang, et al. 2016. Debugging distributed systems. Communications of the ACM 59, 8 (July 2016), 32–37. DOI:https://doi.org/10.1145/2909480
[8]
J. Bogatinovski, S. Nedelkoski, J. Cardoso, and O. Kao. 2020. Self-supervised anomaly detection from distributed traces. In Proceedings of the 2020 IEEE/ACM 13th International Conference on Utility and Cloud Computing. IEEE, New York, 342–347. DOI:https://doi.org/10.1109/UCC48980.2020.00054
[9]
A. Brandón, M. Solé, Alberto Huélamo, David Solans, María S. Pérez, and Victor Muntés-Mulero. 2020. Graph-based root cause analysis for service-oriented and microservice architectures. Journal of Systems and Software 159 (2020), 110432. DOI:https://doi.org/10.1016/j.jss.2019.110432
[10]
L. Breiman. 2001. Random forests. Machine Learning 45, 1 (2001), 5–32. DOI:https://doi.org/10.1023/A:1010933404324
[11]
A. Brogi and J. Soldani. 2020. Identifying failure causalities in multi-component applications. In Proceedings of the Software Engineering and Formal Methods(LNCS, Vol. 12226). Springer, Cham, 226–235. DOI:https://doi.org/10.1007/978-3-030-57506-9_17
[12]
C. Calude and G. Longo. 2017. The deluge of spurious correlations in big data. Foundations of Science 22, 3 (2017), 595–612. DOI:https://doi.org/10.1007/s10699-016-9489-4
[13]
E. J. Candès, X. Li, et al. 2009. Robust principal component analysis?Journal of the ACM 58, 3, Article 11 (2011), 37 pages. DOI:https://doi.org/10.1145/1970392.1970395
[14]
O. Capp, E. Moulines, and T. Ryden. 2010. Inference in Hidden Markov Models. Springer, New York.
[15]
J. Carrasco, F. Durán, and E. Pimentel. 2020. Live migration of trans-cloud applications. Computer Standards & Interfaces 69 (2020), 103392. DOI:https://doi.org/10.1016/j.csi.2019.103392
[16]
V. Chandola, A. Banerjee, and V. Kumar. 2009. Anomaly detection: A survey. ACM Computing Surveys 41, 3, Article 15 (2009), 58 pages. DOI:https://doi.org/10.1145/1541880.1541882
[17]
H. Chen, P. Chen, and G. Yu. 2020. A framework of virtual war room and matrix sketch-based streaming anomaly detection for microservice systems. IEEE Access 8 (2020), 43413–43426. DOI:https://doi.org/10.1109/ACCESS.2020.2977464
[18]
P. Chen, Y. Qi, Pengfei Zheng, and Di Hou. 2014. Causeinfer: Automatic and distributed performance diagnosis with hierarchical causality graph in large distributed systems. In Proceedings of the INFOCOM 2014 IEEE Conference on Computer Communications. IEEE, New York, 1887–1895. DOI:https://doi.org/10.1109/INFOCOM.2014.6848128
[19]
P. Chen, Y. Qi, and D. Hou. 2019. Causeinfer: Automated end-to-end performance diagnosis with hierarchical causality graph in cloud environment. IEEE Transactions on Services Computing 12, 2 (2019), 214–230. DOI:https://doi.org/10.1109/TSC.2016.2607739
[20]
M. Du, F. Li, et al. 2017. Deeplog: Anomaly detection and diagnosis from system logs through deep learning. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security. ACM, New York, 1285–1298.
[21]
Q. Du, T. Xie, and Y. He. 2018. Anomaly detection and diagnosis for container-based microservices with performance monitoring. In Proceedings of the Algorithms and Architectures for Parallel Processing(LNCS, Vol. 11337). Springer, Cham, 560–572. DOI:https://doi.org/10.1007/978-3-030-05063-4_42
[22]
Y. Gan, Y. Zhang, Kelvin Hu, Dailun Cheng, Yuan He, Meghna Pancholi, and Christina Delimitrou. 2019. Seer: Leveraging big data to navigate the complexity of performance debugging in cloud microservices. In Proceedings of the 24th International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, New York, 19–33. DOI:https://doi.org/10.1145/3297858.3304004
[23]
I. Goodfellow, Y. Bengio, and A. Courville. 2016. Deep Learning (1st ed.). MIT Press, Cambridge.
[24]
Z. Guan, J. Lin, and P. Chen. 2019. On anomaly detection and root cause analysis of microservice systems. In Proceedings of the Service-Oriented Computing(LNCS, Vol. 11434). Springer, Cham, 465–469. DOI:https://doi.org/10.1007/978-3-030-17642-6_45
[25]
R. Guidotti, A. Monreale, Salvatore Ruggieri, Franco Turini, Fosca Giannotti, and Dino Pedreschi. 2018. A survey of methods for explaining black box models. ACM Computing Surveys 51, 5, Article 93 (2018), 42 pages. DOI:https://doi.org/10.1145/3236009
[26]
A. Gulenko, F. Schmidt, Alexander Acker, Marcel Wallschläger, Odej Kao, and Feng Liu. 2018. Detecting anomalous behavior of black-box services modeled with distance-based online clustering. In 2018 IEEE 11th International Conference on Cloud Computing. IEEE, New York, 912–915. DOI:https://doi.org/10.1109/CLOUD.2018.00134
[27]
X. Guo, X. Peng, Hanzhang Wang, Wanxue Li, Huai Jiang, Dan Ding, Tao Xie, and Liangfei Su. 2020. Graph-based trace analysis for microservice architecture understanding and problem diagnosis. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. ACM, New York, 1387–1397. DOI:https://doi.org/10.1145/3368089.3417066
[28]
H. Huang and S. P. Kasiviswanathan. 2015. Streaming anomaly detection using randomized matrix sketching. VLDB Endowment 9, 3 (2015), 192–203. DOI:https://doi.org/10.14778/2850583.2850593
[29]
J. Humble and D. Farley. 2010. Continuous Delivery: Reliable Software Releases Through Build, Test, and Deployment Automation. Addison-Wesley, Boston.
[30]
IBM. 2021. IBM operations analytics predictive insights. Retrieved November 18th, 2021 from https://www.ibm.com/support/knowledgecenter/SSJQQ3.
[31]
Istio Authors. 2020. Istio. Retrieved November 18th, 2021 from https://istio.io.
[32]
G. Jeh and J. Widom. 2003. Scaling personalized web search. In Proceedings of the 12th International Conference on World Wide Web. ACM, New York, 271–279. DOI:https://doi.org/10.1145/775152.775191
[33]
T. Jia, P. Chen, Lin Yang, Ying Li, Fanjing Meng, and Jingmin Xu. 2017. An approach for anomaly diagnosis based on hybrid graph model with logs for distributed services. In 2017 IEEE International Conference on Web Services. IEEE, New York, 25–32. DOI:https://doi.org/10.1109/ICWS.2017.12
[34]
T. Jia, L. Yang, Pengfei Chen, Ying Li, Fanjing Meng, and Jingmin Xu. 2017. Logsed: Anomaly diagnosis through mining time-weighted control flow graph in logs. In Proceedings of the IEEE 10th International Conference on Cloud Computing. IEEE, New York, 447–455. DOI:https://doi.org/10.1109/CLOUD.2017.64
[35]
M. Jin, A. Lu, Yuanpeng Zhu, Zijiang Wen, Yubin Zhong, Zexin Zhao, Jiang Wu, Hejie Li, Hanheng He, and Fengyi Chen. 2020. An anomaly detection algorithm for microservice architecture based on robust principal component analysis. IEEE Access 8 (2020), 226397–226408. DOI:https://doi.org/10.1109/ACCESS.2020.3044610
[36]
G. H. John and P. Langley. 1995. Estimating continuous distributions in bayesian classifiers. In Proceedings of the 11th Conference on Uncertainty in Artificial Intelligence. Morgan Kaufmann Publishers Inc., San Francisco, 338–345.
[37]
I. Jolliffe. 2011. Principal component analysis. In International Encyclopedia of Statistical Science. Lovric M. (Ed.), Springer, Berlin, 1094–1096. DOI:https://doi.org/10.1007/978-3-642-04898-2_455
[38]
M. Kim, R. Sumbaly, and S. Shah. 2013. Root cause detection in a service-oriented architecture. ACM SIGMETRICS Performance Evaluation Review 41, 1 (2013), 93–104. DOI:https://doi.org/10.1145/2494232.2465753
[39]
D. P. Kingma and J. Bam. 2015. Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference for Learning Representations. arXiv, Cornell University, Ithaca, 1–15. Retrieved from http://arxiv.org/abs/1412.6980.
[40]
N. Kratzke and P. C. Quint. 2017. Understanding cloud-native applications after 10 years of cloud computing—a systematic mapping study. Journal of Systems and Software 126 (2017), 1–16. DOI:https://doi.org/10.1016/j.jss.2017.01.001
[41]
J. Lewis and M. Fowler. 2014. Microservices. ThoughtWorks. Retrieved from https://martinfowler.com/articles/microservices.html.
[42]
J. Lin, P. Chen, and Z. Zheng. 2018. Microscope: Pinpoint performance issues with causal graphs in micro-service environments. In Proceedings of the Service-Oriented Computing(LNCS, Vol. 11236). Springer, Cham, 3–20. DOI:https://doi.org/10.1007/978-3-030-03596-9_1
[43]
W. Lin, M. Ma, Disheng Pan, and Ping Wang. 2018. Facgraph: Frequent anomaly correlation graph mining for root cause diagnose in micro-service architecture. In Proceedings of the 2018 IEEE 37th International Performance Computing and Communications Conference. IEEE, New York, 1–8. DOI:https://doi.org/10.1109/PCCC.2018.8711092
[44]
D. Liu, C. He, Xin Peng, Fan Lin, Chenxi Zhang, Shengfang Gong, Ziang Li, Jiayu Ou, and Zheshun Wu. 2021. MicroHECL: High-efficient root cause localization in large-scale microservice systems. In Proceedings of the 2021 IEEE/ACM 43rd International Conference on Software Engineering: Software Engineering in Practice ICSE-SEIP. IEEE, New York, 338–347. DOI:https://doi.org/10.1109/ICSE-SEIP52600.2021.00043
[45]
P. Liu, H. Xu, Q. Ouyang, R. Jiao, Z. Chen, S. Zhang, J. Yang, L. Mo, J. Zeng, W. Xue, and D. Pei. 2020. Unsupervised detection of microservice trace anomalies through service-level deep bayesian networks. In Proceedings of the 2020 IEEE 31st International Symposium on Software Reliability Engineering (ISSRE). IEEE, New York, 48–58. DOI:https://doi.org/10.1109/ISSRE5003.2020.00014
[46]
V. Lomonaco, L. Pellegrini, A. Cossu, A. Carta, G. Graffieti, T. L. Hayes, M. De Lange, M. Masana, J. Pomponi, G. van de Ven, and M. Mundt. 2021. Avalanche: An end-to-end library for continual learning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). IEEE, New York, 3595–3605. DOI:https://doi.org/10.1109/CVPRW53098.2021.00399
[47]
C. Luo, J. G. Lou, Qingwei Lin, Qiang Fu, Rui Ding, Dongmei Zhang, and Zhe Wang. 2014. Correlating events with time series for incident diagnosis. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, New York, 1583–1592. DOI:https://doi.org/10.1145/2623330.2623374
[48]
M. Ma, W. Lin, Disheng Pan, and Ping Wang. 2019. MS-rank: Multi-metric and self-adaptive root cause diagnosis for microservice applications. In Proceedings of the 2019 IEEE International Conference on Web Services. IEEE, New York, 60–67. DOI:https://doi.org/10.1109/ICWS.2019.00022
[49]
M. Ma, W. Lin, Disheng Pan, and Ping Wang. 2020. Self-adaptive root cause diagnosis for large-scale microservice architecture. IEEE IEEE Transactions on Services Computing. DOI:https://doi.org/10.1109/TSC.2020.2993251
[50]
M. Ma, J. Xu, Yuan Wang, Pengfei Chen, Zonghua Zhang, and Ping Wang. 2020. AutoMAP: Diagnose your microservice-based web applications automatically. In Proceedings of the Web Conference. ACM, New York, 246–258. DOI:https://doi.org/10.1145/3366423.3380111
[51]
L. Mariani, C. Monni, Mauro Pezzé, Oliviero Riganelli, and Rui Xin. 2018. Localizing faults in cloud systems. In Proceedings of the 2018 IEEE 11th International Conference on Software Testing, Verification and Validation. IEEE, New York, 262–273. DOI:https://doi.org/10.1109/ICST.2018.00034
[52]
L. Mariani, M. Pezzé, Oliviero Riganelli, and Rui Xin. 2020. Predicting failures in multi-tier distributed systems. Journal of Systems and Software 161 (2020), 110464. DOI:https://doi.org/10.1016/j.jss.2019.110464
[53]
V. Medel, R. Tolosana-Calasanz, José Ángel Bañares, Unai Arronategui, and Omer F. Rana. 2018. Characterising resource management performance in kubernetes. Computers & Electrical Engineering 68 (2018), 286–297. DOI:https://doi.org/10.1016/j.compeleceng.2018.03.041
[54]
L. Meng, F. Ji, Yao Sun, and Tao Wang. 2021. Detecting anomalies in microservices with execution trace comparison. Future Generation Computer Systems 116 (2021), 291–301. DOI:https://doi.org/10.1016/j.future.2020.10.040
[55]
Y. Meng, S. Zhang, Yongqian Sun, Ruru Zhang, Zhilong Hu, Yiyin Zhang, Chenyang Jia, Zhaogang Wang, and Dan Pei. 2020. Localizing failure root causes in a microservice through causality inference. In Proceedings of the 2020 IEEE/ACM 28th International Symposium on Quality of Service. IEEE, New York, 1–10. DOI:https://doi.org/10.1109/IWQoS49365.2020.9213058
[56]
H. Mi, H. Wang, Yangfan Zhou, Michael Rung-Tsong Lyu, and Hua Cai. 2013. Toward fine-grained, unsupervised, scalable performance diagnosis for production cloud computing systems. IEEE Transactions on Parallel and Distributed Systems 24, 6 (2013), 1245–1255. DOI:https://doi.org/10.1109/TPDS.2013.21
[57]
T. M. Mitchell. 1997. Machine Learning (1st ed.). McGraw-Hill, Inc.
[58]
A. Nandi, A. Mandal, Atreja, S., G. B. Dasgupta, and S. Bhattacharya. 2016. Anomaly detection using program control flow graph mining from execution logs. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, New York, 215–224. DOI:https://doi.org/10.1145/2939672.2939712
[59]
R. M. Neal. 1996. Bayesian Learning for Neural Networks (1st ed.). Springer-Verlag, New York.
[60]
S. Nedelkoski, J. Cardoso, and O. Kao. 2019. Anomaly detection and classification using distributed tracing and deep learning. In Proceedings of the 2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing. IEEE, New York, 241–250. DOI:https://doi.org/10.1109/CCGRID.2019.00038
[61]
S. Nedelkoski, J. Cardoso, and O. Kao. 2019. Anomaly detection from system tracing data using multimodal deep learning. In Proceedings of the 2019 IEEE 12th International Conference on Cloud Computing. IEEE, New York, 179–186. DOI:https://doi.org/10.1109/CLOUD.2019.00038
[62]
H. Nguyen, Z. Shen, Yongmin Tan, and Xiaohui Gu. 2013. FChain: Toward black-box online fault localization for cloud systems. In Proceedings of the 2013 IEEE 33rd International Conference on Distributed Computing Systems. IEEE, New York, 21–30. DOI:https://doi.org/10.1109/ICDCS.2013.26
[63]
H. Nguyen, Y. Tan, and X. Gu. 2011. PAL: Propagation-aware anomaly localization for cloud hosted distributed applications. In Proceedings of the Managing Large-scale Systems Via the Analysis of System Logs and the Application of Machine Learning Techniques. ACM, New York, Article 1, 8 pages. DOI:https://doi.org/10.1145/2038633.2038634
[64]
H. Peng, F. Long, and C. Ding. 2005. Feature selection based on mutual information: Criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence 27, 8 (Aug. 2005), 1226–1238. DOI:https://doi.org/10.1109/TPAMI.2005.159
[65]
T. Pitakrat, D. Okanović, Andre Van Hoorn, and Lars Grunske. 2016. An architecture-aware approach to hierarchical online failure prediction. In Proceedings of the 2016 12th International ACM SIGSOFT Conference on Quality of Software Architectures. IEEE, New York, 60–69. DOI:https://doi.org/10.1109/QoSA.2016.16
[66]
T. Pitakrat, D. Okanović, André van Hoorn, and Lars Grunske. 2018. Hora: Architecture-aware online failure prediction. Journal of Systems and Software 137 (2018), 669–685. https://doi.org/10.1016/j.jss.2017.02.041
[67]
Prometheus Authors. 2021. Prometheus: Monitoring System & Time Series Database. Retrieved November 18th, 2021 from https://prometheus.io.
[68]
J. Qiu, Q. Du, K. Yin, S. L. Zhang, and C. Qian. 2020. A causality mining and knowledge graph based method of root cause diagnosis for performance anomaly in cloud applications. Applied Sciences 10, 6, Article 2166 (2020), 19 pages. DOI:https://doi.org/10.3390/app10062166
[69]
D. Reis, P. B. Golgher, A. S. Silva, and A. Laender. 2004. Automatic web news extraction using tree edit distance. In Proceedings of the 13th International Conference on World Wide Web. ACM, New York, 502–511. DOI:https://doi.org/10.1145/988672.988740
[70]
D. J. Rezende and S. Mohamed. 2015. Variational inference with normalizing flows. In Proceedings of the 2017 International Conference on Machine Learning, Vol. 37. JMLR.org, Microtome Publishing, Brookline, 1530–1538.
[71]
C. Richardson. 2018. Microservices Patterns (1st ed.). Manning Publications, Shelter Island.
[72]
N. J. Salking. 2010. Coefficient of variation. In Proceedings of the Encyclopedia of Research Design (SAGE Research Methods). SAGE Publications, Washington, DC, 169–171. DOI:https://doi.org/10.4135/9781412961288.n56
[73]
J. H. Saltzer and M. F. Kaashoek. 2009. Principles of Computer System Design: An Introduction. Morgan Kaufmann Publishers Inc., San Francisco.
[74]
A. Samir and C. Pahl. 2019. DLA: Detecting and localizing anomalies in containerized microservice architectures using markov models. In Proceedings of the 2019 7th International Conference on Future Internet of Things and Cloud. IEEE, New York, USA, 205–213. DOI:https://doi.org/10.1109/FiCloud.2019.00036
[75]
B. Schölkopf, C. J. C. Burges, and A. J. Smola. 1999. Advances in Kernel Methods: Support Vector Learning. MIT Press, Cambridge.
[76]
H. Shan, Y. Chen, Haifeng Liu, Yunpeng Zhang, Xiao Xiao, Xiaofeng He, Min Li, and Wei Ding. 2019. \(\epsilon\) -Diagnosis: Unsupervised and real-time diagnosis of small-window long-tail latency in large-scale microservice platforms. In Proceedings of the World Wide Web Conference. ACM, New York, 3215–3222. DOI:https://doi.org/10.1145/3308558.3313653
[77]
D. J. Sheskin. 2011. Handbook of Parametric and Nonparametric Statistical Procedures (5th ed.). Chapman & Hall/CRC, Taylor & Francis Group, Abingdon, UK.
[78]
R. H. Shumway and D. S. Stoffer. 2017. Time Series Analysis and Its Applications (4th ed.). Springer, New York.
[79]
A. Siffer, P. A. Fouque, et al. 2017. Anomaly detection in streams with extreme value theory. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, New York, 1067–1075. DOI:https://doi.org/10.1145/3097983.3098144
[80]
J. Soldani, D. A. Tamburri, and W. J. Van Den Heuvel. 2018. The pains and gains of microservices: A systematic grey literature review. Journal of Systems and Software 146 (2018), 215–232. DOI:https://doi.org/10.1016/j.jss.2018.09.082
[81]
M. Solé, Victor Muntés-Mulero, Annie Ibrahim Rana, and Giovani Estrada2017. Survey on models and techniques for root-cause analysis. arXiv:1701.08546. Retrieved from https://arxiv.org/abs/1701.08546.
[82]
P. Spirtes, C. Glymour, and R. Scheines. 2000. Causation, Prediction, and Search (2nd ed.). MIT Press, Cambridge.
[83]
M. Steinder and A. S. Sethi. 2004. A survey of fault localization techniques in computer networks. Science of Computer Programming 53, 2 (2004), 165–194. DOI:https://doi.org/10.1016/j.scico.2004.01.010
[84]
J. Thalheim, A. Rodrigues, Istemi Ekin Akkus, Pramod Bhatotia, Ruichuan Chen, Bimal Viswanath, Lei Jiao, and Christof Fetzer. 2017. Sieve: Actionable insights from monitored metrics in distributed systems. In Proceedings of the 18th ACM/IFIP/USENIX Middleware Conference. ACM, New York, 14–27. DOI:https://doi.org/10.1145/3135974.3135977
[85]
A. van Hoorn. 2014. Model-Driven Online Capacity Management for Component-Based Software Systems. Ph.D. Dissertation. Faculty of Engineering, Kiel University, Kiel, Germany.
[86]
L. Wang, N. Zhao, Junjie Chen, Pinnong Li, Wenchi Zhang, and Kaixin Sui. 2020. Root-cause metric location for microservice systems via log anomaly detection. In Proceedings of the 2020 IEEE International Conference on Web Services. IEEE, New York, 142–150. DOI:https://doi.org/10.1109/ICWS49710.2020.00026
[87]
P. Wang, J. Xu, Meng Ma, Weilan Lin, Disheng Pan, Yuan Wang, and Pengfei Chen. 2018. Cloudranger: Root cause identification for cloud native systems. In Proceedings of the 2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing. IEEE, New York, 492–502. DOI:https://doi.org/10.1109/CCGRID.2018.00076
[88]
T. Wang, W. Zhang, Jiwei Xu, and Zeyu Gu. 2020. Workflow-aware automatic fault diagnosis for microservice-based applications with statistics. IEEE Transactions on Network and Service Management 17, 4 (2020), 2350–2363. DOI:https://doi.org/10.1109/TNSM.2020.3022028
[89]
Weaveworks and Container Solutions. 2017. Sock shop. Retrieved November 18th, 2021 from https://microservices-demo.github.io.
[90]
E. W. Wong, R. Gao, Yihao Li, Rui Abreu, and Franz Wotawa. 2016. A survey on software fault localization. IEEE Transactions on Software Engineering 42, 8 (2016), 707–740. DOI:https://doi.org/10.1109/TSE.2016.2521368
[91]
L. Wu, J. Bogatinovski, Sasho Nedelkoski, Johan Tordsson, and Odej Kao. 2020. Performance diagnosis in cloud microservices using deep learning. In Proceedings of the International Conference on Service-Oriented Computing(LNCS, Vol. 12632). Springer, Cham, 85–96. DOI:https://doi.org/10.1007/978-3-030-76352-7_13
[92]
L. Wu, J. Tordsson, Erik Elmroth, and Odej Kao. 2020. MicroRCA: Root cause localization of performance issues in microservices. In NOMS 2020 IEEE/IFIP Network Operations and Management Symposium. IEEE, New York, 1–9. DOI:https://doi.org/10.1109/NOMS47738.2020.9110353
[93]
V. Yussupov, J. Soldani, Uwe Breitenbücher, Antonio Brogi, and Frank Leymann. 2021. Faasten your decisions: A classification framework and technology review of function-as-a-service platforms. Journal of Systems and Software 175 (2021), 110906. DOI:https://doi.org/10.1016/j.jss.2021.110906
[94]
X. Zang, W. Chen, Jing Zou, Sheng Zhou, Huang Lisong, and Liang Ruigang. 2018. A fault diagnosis method for microservices based on multi-factor self-adaptive heartbeat detection algorithm. In Proceedings of the 2018 2nd IEEE Conference on Energy Internet and Energy System Integration. IEEE, New York, 1–6. DOI:https://doi.org/10.1109/EI2.2018.8582217
[95]
T. Zhang, R. Ramakrishnan, and M. Livny. 1996. BIRCH: An efficient data clustering method for very large databases. ACM Sigmod Record 25 2(1996), 103–114. ACM, New York. DOI:https://doi.org/10.1145/233269.233324
[96]
X. Zhou, X. Peng, Tao Xie, Jun Sun, Chao Ji, Wenhai Li, and Dan Ding. 2019. Delta debugging microservice systems with parallel optimization. IEEE Transactions on Services Computing. DOI:https://doi.org/10.1109/TSC.2019.2919823
[97]
X. Zhou, X. Peng, Tao Xie, Jun Sun, Chao Ji, Dewei Liu, Qilin Xiang, and Chuan He. 2019. Latent error prediction and fault localization for microservice applications by learning from system trace logs. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. ACM, New York, 683–694. DOI:https://doi.org/10.1145/3338906.3338961
[98]
X. Zhou, X. Peng, Tao Xie, Jun Sun, Chao Ji, Wenhai Li, and Dan Ding. 2021. Fault analysis and debugging of microservice systems: Industrial survey, benchmark system, and empirical study. IEEE Transactions on Software Engineering 47, 2 (2021), 243–260. DOI:https://doi.org/10.1109/TSE.2018.2887384

Cited By

View all
  • (2024)Towards Future Vehicle Diagnostics in Software-Defined VehiclesSAE Technical Paper Series10.4271/2024-01-2981Online publication date: 2-Jul-2024
  • (2024)Diagnosing and Identifying Standards Affecting on the Ready-Mix Concrete Production Plants Performance: An Analytical StudyTikrit Journal of Engineering Sciences10.25130/tjes.31.1.1831:1(211-222)Online publication date: 9-Mar-2024
  • (2024)ASOD: an adaptive stream outlier detection method using online strategyJournal of Cloud Computing: Advances, Systems and Applications10.1186/s13677-024-00682-013:1Online publication date: 5-Jul-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Computing Surveys
ACM Computing Surveys  Volume 55, Issue 3
March 2023
772 pages
ISSN:0360-0300
EISSN:1557-7341
DOI:10.1145/3514180
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 February 2022
Accepted: 01 November 2021
Revised: 01 October 2021
Received: 01 May 2021
Published in CSUR Volume 55, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Microservices
  2. multi-service applications
  3. failure detection
  4. anomaly detection
  5. root cause analysis

Qualifiers

  • Survey
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)2,169
  • Downloads (Last 6 weeks)156
Reflects downloads up to 11 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Towards Future Vehicle Diagnostics in Software-Defined VehiclesSAE Technical Paper Series10.4271/2024-01-2981Online publication date: 2-Jul-2024
  • (2024)Diagnosing and Identifying Standards Affecting on the Ready-Mix Concrete Production Plants Performance: An Analytical StudyTikrit Journal of Engineering Sciences10.25130/tjes.31.1.1831:1(211-222)Online publication date: 9-Mar-2024
  • (2024)ASOD: an adaptive stream outlier detection method using online strategyJournal of Cloud Computing: Advances, Systems and Applications10.1186/s13677-024-00682-013:1Online publication date: 5-Jul-2024
  • (2024)HeMiRCA: Fine-Grained Root Cause Analysis for Microservices with Heterogeneous Data SourcesACM Transactions on Software Engineering and Methodology10.1145/3674726Online publication date: 1-Jul-2024
  • (2024)Leveraging Large Language Models for the Auto-remediation of Microservice Applications: An Experimental StudyCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663855(358-369)Online publication date: 10-Jul-2024
  • (2024)Illuminating the Gray Zone: Non-intrusive Gray Failure Localization in Server Operating SystemsCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663834(126-137)Online publication date: 10-Jul-2024
  • (2024)Chain-of-Event: Interpretable Root Cause Analysis for Microservices through Automatically Learning Weighted Event Causal GraphCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663827(50-61)Online publication date: 10-Jul-2024
  • (2024)Towards Sustainable Deployment of Microservices over the Cloud-IoT Continuum, with FREEDAProceedings of the 4th Workshop on Flexible Resource and Application Management on the Edge10.1145/3659994.3660311(1-4)Online publication date: 3-Jun-2024
  • (2024)Microservice Root Cause Analysis With Limited Observability Through Intervention Recognition in the Latent SpaceProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671530(6049-6060)Online publication date: 25-Aug-2024
  • (2024)MULAN: Multi-modal Causal Structure Learning and Root Cause Analysis for Microservice SystemsProceedings of the ACM Web Conference 202410.1145/3589334.3645442(4107-4116)Online publication date: 13-May-2024
  • Show More Cited By

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media