Root cause analysis of anomalies of multitier services in public clouds
J Weng, JH Wang, J Yang… - IEEE/ACM Transactions on …, 2018 - ieeexplore.ieee.org
J Weng, JH Wang, J Yang, Y Yang
IEEE/ACM Transactions on Networking, 2018•ieeexplore.ieee.orgAnomalies of multitier services of one tenant running in cloud platform can be caused by the
tenant's own components or performance interference from other tenants. If the performance
of a multitier service degrades, we need to find out the root causes precisely to recover the
service as soon as possible. In this paper, we argue that the cloud providers are in a better
position than the tenants to solve this problem, and the solution should be non-intrusive to
tenants' services or applications. Based on these two considerations, we propose a solution …
tenant's own components or performance interference from other tenants. If the performance
of a multitier service degrades, we need to find out the root causes precisely to recover the
service as soon as possible. In this paper, we argue that the cloud providers are in a better
position than the tenants to solve this problem, and the solution should be non-intrusive to
tenants' services or applications. Based on these two considerations, we propose a solution …
Anomalies of multitier services of one tenant running in cloud platform can be caused by the tenant's own components or performance interference from other tenants. If the performance of a multitier service degrades, we need to find out the root causes precisely to recover the service as soon as possible. In this paper, we argue that the cloud providers are in a better position than the tenants to solve this problem, and the solution should be non-intrusive to tenants' services or applications. Based on these two considerations, we propose a solution for cloud providers to help tenants to localize root causes of any anomaly. With the help of our solution, cloud operators can find out root causes of any anomaly no matter the root causes are in the same tenant as the anomaly or from other tenants. Particularly, we elaborate a non-intrusive method to capture the dependency relationships of components, which improves the feasibility. During localization, we exploit measurement data of both application layer and underlay infrastructure, and our two-step localization algorithm also includes a random walk procedure to model anomaly propagation probability. These techniques improve the accuracy of our root causes localization. Our small-scale real-world experiments and large-scale simulation experiments show a 15%-71% improvement in mean average precision compared with the current methods in different scenarios.
ieeexplore.ieee.org