Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3038912.3052649acmotherconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
research-article
Public Access

Performance Monitoring and Root Cause Analysis for Cloud-hosted Web Applications

Published: 03 April 2017 Publication History

Abstract

In this paper, we describe Roots - a system for automatically identifying the "root cause" of performance anomalies in web applications deployed in Platform-as-a-Service (PaaS) clouds. Roots does not require application-level instrumentation. Instead, it tracks events within the PaaS cloud that are triggered by application requests using a combination of metadata injection and platform-level instrumentation.
We describe the extensible architecture of Roots, a prototype implementation of the system, and a statistical methodology for performance anomaly detection and diagnosis. We evaluate the efficacy of Roots using a set of PaaS-hosted web applications, and detail the performance overhead and scalability of the implementation.

References

[1]
M. K. Aguilera, J. C. Mogul, J. L. Wiener, P. Reynolds, and A. Muthitacharoen. Performance debugging for distributed systems of black boxes. In Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles, 2003.
[2]
N. Antonopoulos and L. Gillam. Cloud Computing: Principles, Systems and Applications. Springer Publishing Company, Incorporated, 1st edition, 2010.
[3]
M. Attariyan, M. Chow, and J. Flinn. X-ray: Automating root-cause diagnosis of performance anomalies in production software. In Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation, 2012.
[4]
V. Chandola, A. Banerjee, and V. Kumar. Anomaly detection: A survey. ACM Comput. Surv., 41(3), 2009.
[5]
C. Chen and L.-M. Liu. Joint estimation of model parameters and outlier effects in time series. Journal of the American Statistical Association, 88(421):284--297, 1993.
[6]
M. Y. Chen, E. Kiciman, E. Fratkin, A. Fox, and E. Brewer. Pinpoint: Problem determination in large, dynamic internet services. In Proceedings of the 2002 International Conference on Dependable Systems and Networks, 2002.
[7]
Amazon cloud watch, 2016. https://aws.amazon.com/cloudwatch {Accessed Sep 2016}.
[8]
G. Da Cunha Rodrigues, R. N. Calheiros, V. T. Guimaraes, G. L. d. Santos, M. B. de Carvalho, L. Z. Granville, L. M. R. Tarouco, and R. Buyya. Monitoring of cloud computing environments: Concepts, solutions, trends, and future directions. In Proceedings of the 31st Annual ACM Symposium on Applied Computing, 2016.
[9]
Datadog: Cloud monitoring as a service, 2016. https://www.datadoghq.com {Accessed Sep 2016}.
[10]
D. J. Dean, H. Nguyen, P. Wang, and X. Gu. Perfcompass: Toward runtime performance anomaly fault localization for infrastructure-as-a-service clouds. In Proceedings of the 6th USENIX Conference on Hot Topics in Cloud Computing, 2014.
[11]
Dynatrace: Digital performance management and application performance monitoring, 2016. https://www.dynatrace.com {Accessed Sep 2016}.
[12]
R. Fonseca, G. Porter, R. H. Katz, S. Shenker, and I. Stoica. X-trace: A pervasive network tracing framework. In Proceedings of the 4th USENIX Conference on Networked Systems Design #38; Implementation, 2007.
[13]
App Engine - Run your applications on a fully managed PaaS, 2015. "https://cloud.google.com/appengine" {Accessed March 2015}.
[14]
Google Cloud SDK Service Quotas, 2015. https://cloud.google.com/appengine/docs/quotas {Accessed March 2015}.
[15]
U. Groemping. Relative importance for linear regression in r: The package relaimpo. Journal of Statistical Software, 17(1), 2006.
[16]
Q. Guan, Z. Zhang, and S. Fu. Proactive failure management by integrated unsupervised and semi-supervised learning for dependable cloud systems. In Availability, Reliability and Security (ARES), 2011 Sixth International Conference on, 2011.
[17]
O. Ibidunmoye, F. Hernández-Rodriguez, and E. Elmroth. Performance anomaly detection and bottleneck identification. ACM Comput. Surv., 48(1), July 2015.
[18]
H. Jayathilaka, C. Krintz, and R. Wolski. Response time service level agreements for cloud-hosted web applications. In Proceedings of the Sixth ACM Symposium on Cloud Computing, 2015.
[19]
A. Keller and H. Ludwig. The WSLA Framework: Specifying and Monitoring Service Level Agreements for Web Services. J. Netw. Syst. Manage., 11(1), Mar. 2003.
[20]
R. Killick, P. Fearnhead, and I. A. Eckley. Optimal detection of changepoints with a linear computational cost. Journal of the American Statistical Association, 107(500):1590--1598, 2012.
[21]
O. Kononenko, O. Baysal, R. Holmes, and M. W. Godfrey. Mining modern repositories with elasticsearch. In Proceedings of the 11th Working Conference on Mining Software Repositories, 2014.
[22]
C. Krintz. The appscale cloud platform: Enabling portable, scalable web application deployment. IEEE Internet Computing, 17(2), 2013.
[23]
Latency is Everywhere and it Costs Your Sales, 2009. http://highscalability.com/latency-everywhere-and-it-costs-you-sales-how-crush-it {Accessed Sep 2016}.
[24]
G. R. Lindeman R.H., Merenda P.F. Introduction to Bivariate and Multivariate Analysis. Scott, Foresman, Glenview, IL, 1980.
[25]
J. a. P. Magalhaes and L. M. Silva. Root-cause analysis of performance anomalies in web-based applications. In Proceedings of the 2011 ACM Symposium on Applied Computing, 2011.
[26]
J. P. Magalhaes and L. M. Silva. Detection of performance anomalies in web-based applications. In Proceedings of the 2010 Ninth IEEE International Symposium on Network Computing and Applications, 2010.
[27]
Microsoft Azure Cloud SDK Service Quotas, 2015. http://azure.microsoft.com/en-us/documentation/articles/azure-subscription-service-limits {Accessed March 2015}.
[28]
M. Natu, R. K. Ghosh, R. K. Shyamsundar, and R. Ranjan. Holistic performance monitoring of hybrid clouds: Complexities and future directions. IEEE Cloud Computing, 3(1), Jan 2016.
[29]
New relic: Application performance management and monitoring, 2016. https://newrelic.com {Accessed Sep 2016}.
[30]
H. Nguyen, Y. Tan, and X. Gu. Pal: Propagation-aware anomaly localization for cloud hosted distributed applications. In Managing Large-scale Systems via the Analysis of System Logs and the Application of Machine Learning Techniques, 2011.
[31]
D. Nurmi, R. Wolski, C. Grzegorczyk, G. Obertelli, S. Soman, L. Youseff, and D. Zagorodnov. The Eucalyptus open-source cloud-computing system. In IEEE/ACM International Symposium on Cluster Computing and the Grid, 2009.
[32]
P. Pinheiro, M. Aparicio, and C. Costa. Adoption of cloud computing systems. In Proceedings of the International Conference on Information Systems and Design of Communication, 2014.
[33]
M. Soni. Cloud computing basics--platform as a service (paas). Linux J., 2014(238), 2014.

Cited By

View all
  • (2024)Making Sense of Multi-threaded Application Performance at Scale with NonSequiturProceedings of the ACM on Programming Languages10.1145/36897938:OOPSLA2(2325-2354)Online publication date: 8-Oct-2024
  • (2024)Disambiguating Performance Anomalies from Workload Changes in Cloud-Native ApplicationsProceedings of the 15th ACM/SPEC International Conference on Performance Engineering10.1145/3629526.3645046(286-297)Online publication date: 7-May-2024
  • (2024)Trace-based Multi-Dimensional Root Cause Localization of Performance Issues in Microservice SystemsProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639088(1-12)Online publication date: 20-May-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
WWW '17: Proceedings of the 26th International Conference on World Wide Web
April 2017
1678 pages
ISBN:9781450349130

Sponsors

  • IW3C2: International World Wide Web Conference Committee

In-Cooperation

Publisher

International World Wide Web Conferences Steering Committee

Republic and Canton of Geneva, Switzerland

Publication History

Published: 03 April 2017

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. application performance monitoring
  2. cloud computing
  3. platform-as-a-service
  4. root cause analysis
  5. web services

Qualifiers

  • Research-article

Funding Sources

  • NIH
  • ONR NEEC
  • Huawei Techologies
  • NSF
  • California Energy Commission

Conference

WWW '17
Sponsor:
  • IW3C2

Acceptance Rates

WWW '17 Paper Acceptance Rate 164 of 966 submissions, 17%;
Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)262
  • Downloads (Last 6 weeks)31
Reflects downloads up to 10 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Making Sense of Multi-threaded Application Performance at Scale with NonSequiturProceedings of the ACM on Programming Languages10.1145/36897938:OOPSLA2(2325-2354)Online publication date: 8-Oct-2024
  • (2024)Disambiguating Performance Anomalies from Workload Changes in Cloud-Native ApplicationsProceedings of the 15th ACM/SPEC International Conference on Performance Engineering10.1145/3629526.3645046(286-297)Online publication date: 7-May-2024
  • (2024)Trace-based Multi-Dimensional Root Cause Localization of Performance Issues in Microservice SystemsProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639088(1-12)Online publication date: 20-May-2024
  • (2024)Optimizing I/O Performance Through Effective vCPU Scheduling Interference ManagementIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.332929835:12(2315-2330)Online publication date: Dec-2024
  • (2024)Syscall Analysis for Resource Stress Identification for Container Network Functions2024 IEEE 17th International Conference on Cloud Computing (CLOUD)10.1109/CLOUD62652.2024.00037(256-266)Online publication date: 7-Jul-2024
  • (2023)Detecting Software Anomalies Using Spectrograms and Convolutional Neural NetworkProceedings of the 33rd Annual International Conference on Computer Science and Software Engineering10.5555/3615924.3615929(44-53)Online publication date: 11-Sep-2023
  • (2023)Multitier Web System Reliability: Identifying Causative Metrics and Analyzing Performance Anomaly Using a Regression ModelSensors10.3390/s2304191923:4(1919)Online publication date: 8-Feb-2023
  • (2023)Diffusion-Based Time Series Data Imputation for Cloud Failure Prediction at Microsoft 365Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3611643.3613866(2050-2055)Online publication date: 30-Nov-2023
  • (2023)Locating Anomaly Clues for Atypical Anomalous Services: An Industrial ExplorationIEEE Transactions on Dependable and Secure Computing10.1109/TDSC.2022.318114320:4(2746-2761)Online publication date: 1-Jul-2023
  • (2023)CONAN: Diagnosing Batch Failures for Cloud Systems2023 IEEE/ACM 45th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP)10.1109/ICSE-SEIP58684.2023.00018(138-149)Online publication date: May-2023
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media