A real-time trace-level root-cause diagnosis system in alibaba datacenters
Z Cai, W Li, W Zhu, L Liu, B Yang - IEEE Access, 2019 - ieeexplore.ieee.org
Z Cai, W Li, W Zhu, L Liu, B Yang
IEEE Access, 2019•ieeexplore.ieee.orgRoot-cause analysis (RCA) for service performance degradation can be a challenging
exercise given the increasingly complex, inter-related, distributed infrastructure environment
in today's enterprises. Many approaches have been applied into enterprise datacenters to
improve the maintenance efficiency. A novel graph-level RCA approach is introduced in this
paper, including tracing, weighted graph matching and suspicion ranking. The approach is
developed based on performance profiling, tracing, and logging systems in Alibaba …
exercise given the increasingly complex, inter-related, distributed infrastructure environment
in today's enterprises. Many approaches have been applied into enterprise datacenters to
improve the maintenance efficiency. A novel graph-level RCA approach is introduced in this
paper, including tracing, weighted graph matching and suspicion ranking. The approach is
developed based on performance profiling, tracing, and logging systems in Alibaba …
Root-cause analysis (RCA) for service performance degradation can be a challenging exercise given the increasingly complex, inter-related, distributed infrastructure environment in today's enterprises. Many approaches have been applied into enterprise datacenters to improve the maintenance efficiency. A novel graph-level RCA approach is introduced in this paper, including tracing, weighted graph matching and suspicion ranking. The approach is developed based on performance profiling, tracing, and logging systems in Alibaba datacenters to speed up the real-time root-cause diagnosis. Our system allows the discovery of normative patterns and the corresponding key graph properties, which are stored and updated offline as a knowledge base for subsequently being used in trace-level risk estimation and identification of transitions that are unexpected deviations from the normative patterns. Through testing in production, we show the effectiveness of applying the graph-level RCA to discover the origins of problems and generate real-time operational support. It greatly decreases the workload for locating the root-cause of the anomaly.
ieeexplore.ieee.org