research-article

Adaptive system anomaly prediction for large-scale hosting infrastructures

Authors:

Haixun WangAuthors Info & Claims

PODC '10: Proceedings of the 29th ACM SIGACT-SIGOPS symposium on Principles of distributed computing

Pages 173 - 182

https://doi.org/10.1145/1835698.1835741

Published: 25 July 2010 Publication History

Abstract

Large-scale hosting infrastructures require automatic system anomaly management to achieve continuous system operation. In this paper, we present a novel adaptive runtime anomaly prediction system, called ALERT, to achieve robust hosting infrastructures. In contrast to traditional anomaly detection schemes, ALERT aims at raising advance anomaly alerts to achieve just-in-time anomaly prevention. We propose a novel context-aware anomaly prediction scheme to improve prediction accuracy in dynamic hosting infrastructures. We have implemented the ALERT system and deployed it on several production hosting infrastructures such as IBM System S stream processing cluster and PlanetLab. Our experiments show that ALERT can achieve high prediction accuracy for a range of system anomalies and impose low overhead to the hosting infrastructure.

References

[1]

Amazon Elastic Compute Cloud. http://aws.amazon.com/ec2/.

[2]

C4.5 Release 8. http://www.rulequest.com/Personal/.

[3]

CoMon. http://comon.cs.princeton.edu/.

[4]

InfoScope Distributed Monitoring System. http://dance.csc.ncsu.edu/projects/infoscope/index.html.

[5]

PlanetLab. https://www.planet-lab.org/.

[6]

The STREAM Group, STREAM: The Stanford Stream Data Manager. IEEE Data Engineering Bulletin, 26(1):19--26, Mar. 2003.

[7]

D. J. Abadi and et al. The Design of the Borealis Stream Processing Engine. In Proc. of CIDR, 2005.

[8]

M. K. Aguilera, J. Mogul, J. Wiener, P. Reynolds, and A. Muthitacharoen. Performance debugging for distributed systems of black boxes. In Proc. of ACM SOSP, 2003.

Digital Library

[9]

P. Barham, A. Donnelly, R. Isaacs, and R. Mortier. Using Magpie for request extraction and workload modelling. In Proc. of OSDI, 2004.

Digital Library

[10]

S. Bhatia, A. Kumar, M. E. Fiuczynski, and L. L. Peterson. Lightweight, high-resolution monitoring for troubleshooting production systems. In Proc. of OSDI, pages 103--116, 2008.

Digital Library

[11]

J. Breese and R. Blake. Automating computer bottleneck detection with belief nets. In Proc. of UAI, pages 36--45, San Francisco, CA, 1995. Morgan Kaufmann.

Digital Library

[12]

G. Candea, S. Kawamoto, Y. Fujiki, G. Friedman, and A. Fox. Microreboot - A Technique for Cheap Recovery. In Proc. of OSDI, Dec. 2004.

Digital Library

[13]

M. Y. Chen, A. Accardi, E. Kiciman, D. Patterson, A. Fox, and E. Brewer. Path-Based Failure and Evolution Management. In Proc. of NSDI, 2004.

Digital Library

[14]

S. Chen, H. Wang, S. Zhou, and P. S. Yu. Stop Chasing Trends: Discovering High Order Models in Evolving Data. In Proc. of ICDE, 2008.

Digital Library

[15]

L. Cherkasova, K. M. Ozonat, N. Mi, J. Symons, and E. Smirni. Anomaly? application change? or workload change? towards automated detection of application performance anomaly and change. In Proc. of DSN, pages 452--461, 2008.

[16]

I. Cohen, M. Goldszmidt, T. Kelly, J. Symons, and J. S. Chase. Correlating Instrumentation Data to System States: A Building Block for Automated Diagnosis and Control. In Proc. of OSDI, 2004.

Digital Library

[17]

I. Cohen, S. Zhang, M. Goldszmidt, J. Symons, T. Kelly, and A. Fox. Capturing, indexing, clustering, and retrieving system history. In Proc. of SOSP, 2005.

Digital Library

[18]

J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In Proc. of OSDI, Dec. 2004.

Digital Library

[19]

S. Duan, S. Babu, and K. Munagala. Fa: A System for Automating Failure Diagnosis. In Proc. of ICDE, 2009.

Digital Library

[20]

G. W. Dunlap, S. T. King, S. Cinar, M. A. Basrai, and P. M. Chen. Detecting Past and Present Intrusions Through Vulnerability-Specific Predicates. In Proc. of SOSP, Oct. 2005.

Digital Library

[21]

K.-L.W. et al. Challenges and Experience in Prototyping a Multi-Modal Stream Analytic and Monitoring Application on System S. In Proc. of VLDB, 2007.

Digital Library

[22]

B. Gedik, H. Andrade, K.-L. Wu, P. S. Yu, and M. Doo. SPADE: the system s declarative stream processing engine. In Proc. of SIGMOD, 2008.

Digital Library

[23]

X. Gu, S. Papadimitriou, P. S. Yu, and S. P. Chang. Toward Predictive Failure Management for Distributed Stream Processing Systems. In Proc. of ICDCS, 2008.

Digital Library

[24]

X. Gu and H. Wang. Online Anomaly Prediction for Robust Cluster Systems. In Proc. of IEEE ICDE, 2009.

Digital Library

[25]

X. Gu, P. S. Yu, and H. Wang. Adaptive load diffusion for multiway windowed stream joins. In Proc. of ICDE, 2007.

[26]

Z. Guo, G. Jiang, H. Chen, and K. Yoshihira. Tracking probabilistic correlation of monitoring data for fault detection in complex systems. In Proc. of DSN, pages 259--268, 2006.

Digital Library

[27]

R. Jin and G. Agrawal. Efficient decision tree construction on streaming data. In Proc. of KDD, 2003.

Digital Library

[28]

E. Kiciman and A. Fox. Detecting Application-Level Failures in Component-based Internet Services. IEEE Transactions on Neural Networks, 2005.

Digital Library

[29]

Y. Liang, Y. Zhang, H. Xiong, and R. Sahoo. Failure Prediction in IBM BlueGene/L Event Logs. In Proc. of ICDM, 2007.

Digital Library

[30]

J. F. Murray, G. F. Hughes, and K. Kreutz-Delgado. Comparison of machine learning methods for predicting failures in hard drives. Journal of Machine Learning Research, 2005.

Digital Library

[31]

C. C. Noble and D. J. Cook. Graph-based anomaly detection. In Proc. of KDD, pages 631--636, Aug. 24-27 2003.

Digital Library

[32]

K. M. Ozonat. An information-theoretic approach to detecting performance anomalies and changes for large-scale distributed web services. In Proc. of DSN, pages 522--531, 2008.

[33]

L. Peterson, T. Anderson, D. Culler, and T. Roscoe. A blueprint for introducing disruptive technology into the internet. In Proc. of HotNets-I, Princeton, New Jersey, October 2002.

[34]

R. Powers, M. Goldszmidt, and I. Cohen. Short term performance forecasting in enterprise systems. In Proc. of KDD, pages 801--807, 2005.

Digital Library

[35]

R. K. Sahoo and et al. Critical event prediction for proactive management in large-scale computer clusters. In Proc. of ACM SIGKDD, 2003.

Digital Library

[36]

B. Schroeder and G. Gibson. Disk failures in the real world: What does an MTTF of 1,000,000 hours mean too you? In Proc. of FAST, 2007.

Digital Library

[37]

K. Shen, C. Stewart, C. Li, and X. Li. Reference-driven performance anomaly identification. In Proc. of SIGMETRICS/Performance, pages 85--96, 2009.

Digital Library

[38]

K. Shen, M. Zhong, and C. Li. I/o system performance debugging using model-driven anomaly characterization. In Proc. of FAST, 2005.

Digital Library

[39]

J. Tucek, S. Lu, C. Huang, S. Xanthos, and Y. Zhou. Triage: Diagnosing Production Run Failures at the User's Site. In Proc. of SOSP, 2007.

Digital Library

[40]

R. Vilalta, C. V. Apte, J. L. Hellerstein, S. Ma, and S. M. Weiss. Predictive algorithms in the management of computer systems. IBM Systems Journal, 2002.

Digital Library

[41]

H. J. Wang, J. C. Platt, Y. Chen, R. Zhang, and Y.-M. Wang. Automatic misconfiguration troubleshooting with peerpressure. In Proc. of OSDI, pages 245--258, 2004.

Digital Library

[42]

A. W. Williams, S. M. Pertet, and P. Narasimhan. Tiresias: Black-box failure prediction in distributed systems. In Proc. of IPDPS, 2007.

[43]

I. H. Witten and E. Frank. Data Mining : Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, 1999.

Digital Library

[44]

W. Xu, L. Huang, A. Fox, D. Patterson, and M. Jordan. Large-scale system problems detection by mining console logs. In Proc. of SOSP, 2009.

Digital Library

[45]

S. Zhang, I. Cohen, M. Goldszmidt, J. Symons, and A. Fox. Ensemble of models for automated diagnosis of system performance problems. In Proc. of DSN, 2005.

Digital Library

Cited By

Denaro GHeydarov RMohebbi APezzè M(2023)Prevent: An Unsupervised Approach to Predict Software Failures in ProductionIEEE Transactions on Software Engineering10.1109/TSE.2023.332758349:12(5139-5153)Online publication date: Dec-2023
https://doi.org/10.1109/TSE.2023.3327583
Lee CYang TChen ZSu YLyu M(2023)Maat: Performance Metric Anomaly Anticipation for Cloud Services with Conditional Diffusion2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE)10.1109/ASE56229.2023.00082(116-128)Online publication date: 11-Sep-2023
https://doi.org/10.1109/ASE56229.2023.00082
Yang NShi YSu ZWang XYan ZKong F(2023)FSFP: A Fine-Grained Online Service System Performance Fault Prediction Method Based on Cross-attention2023 30th Asia-Pacific Software Engineering Conference (APSEC)10.1109/APSEC60848.2023.00018(81-90)Online publication date: 4-Dec-2023
https://doi.org/10.1109/APSEC60848.2023.00018
Show More Cited By

Index Terms

Adaptive system anomaly prediction for large-scale hosting infrastructures
1. Computer systems organization
  1. Dependable and fault-tolerant systems and networks
2. General and reference
  1. Cross-computing tools and techniques
    1. Reliability

Recommendations

The Entropy and PCA Based Anomaly Prediction in Data Streams

With the increase of data and information, anomaly management has been attracting much more attention and become an important research topic gradually. Previous literatures have advocated anomaly discovery and identification ignoring the fact that ...
Robust and accurate performance anomaly detection and prediction for cloud applications: a novel ensemble learning-based framework
Abstract
Effectively detecting run-time performance anomalies is crucial for clouds to identify abnormal performance behavior and forestall future incidents. To be used for real-world applications, an effective anomaly detection framework should meet three ...
Crowdsourcing-based Urban Anomaly Prediction System for Smart Cities
CIKM '16: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management

Crowdsourcing has become an emerging data collection paradigm for smart city applications. A new category of crowdsourcing-based urban anomaly reporting systems have been developed to enable pervasive and real-time reporting of anomalies in cities (e.g.,...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

PODC '10: Proceedings of the 29th ACM SIGACT-SIGOPS symposium on Principles of distributed computing

July 2010

494 pages

ISBN:9781605588889

DOI:10.1145/1835698

General Chairs:
Andrea Richa
Arizona State University, USA
,
Rachid Guerraoui
EPFL, Switzerland

Copyright © 2010 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 July 2010

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

PODC '10

Sponsor:

PODC '10: ACM Symposium on Principles of Distributed Computing

July 25 - 28, 2010

Zurich, Switzerland

Acceptance Rates

Overall Acceptance Rate 740 of 2,477 submissions, 30%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

45
Total Citations
View Citations
743
Total Downloads

Downloads (Last 12 months)25
Downloads (Last 6 weeks)2

Reflects downloads up to 30 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Denaro GHeydarov RMohebbi APezzè M(2023)Prevent: An Unsupervised Approach to Predict Software Failures in ProductionIEEE Transactions on Software Engineering10.1109/TSE.2023.332758349:12(5139-5153)Online publication date: Dec-2023
https://doi.org/10.1109/TSE.2023.3327583
Lee CYang TChen ZSu YLyu M(2023)Maat: Performance Metric Anomaly Anticipation for Cloud Services with Conditional Diffusion2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE)10.1109/ASE56229.2023.00082(116-128)Online publication date: 11-Sep-2023
https://doi.org/10.1109/ASE56229.2023.00082
Yang NShi YSu ZWang XYan ZKong F(2023)FSFP: A Fine-Grained Online Service System Performance Fault Prediction Method Based on Cross-attention2023 30th Asia-Pacific Software Engineering Conference (APSEC)10.1109/APSEC60848.2023.00018(81-90)Online publication date: 4-Dec-2023
https://doi.org/10.1109/APSEC60848.2023.00018
Mir TLiaqat HKiren TSana MAlvarez RMiró YPascual Barrera AAshraf I(2022)Antifragile and Resilient Geographical Information System Service Delivery in Fog ComputingSensors10.3390/s2222877822:22(8778)Online publication date: 14-Nov-2022
https://doi.org/10.3390/s22228778
Orhan Cikmazel RSaylam ARodoplu VGuzelis C(2022)Multi-Resolution Inter-Level Refinement (MR-ILR) Architecture for Anomaly Prediction in IoT Data2022 International Conference on INnovations in Intelligent SysTems and Applications (INISTA)10.1109/INISTA55318.2022.9894159(1-6)Online publication date: 8-Aug-2022
https://doi.org/10.1109/INISTA55318.2022.9894159
Zhao GHassan SZou YTruong DCorbin T(2021)Predicting Performance Anomalies in Software Systems at Run-timeACM Transactions on Software Engineering and Methodology10.1145/344075730:3(1-33)Online publication date: 23-Apr-2021
https://dl.acm.org/doi/10.1145/3440757
Zoppi TCeccarelli ABondavalli A(2021)MADneSs: A Multi-Layer Anomaly Detection Framework for Complex Dynamic SystemsIEEE Transactions on Dependable and Secure Computing10.1109/TDSC.2019.290836618:2(796-809)Online publication date: 1-Mar-2021
https://doi.org/10.1109/TDSC.2019.2908366
M. El-Shamy AA. El-Fishawy NAttiya GA. A. Mohamed M(2021)Anomaly Detection and Bottleneck Identification of The Distributed Application in Cloud Data Center using Software–Defined NetworkingEgyptian Informatics Journal10.1016/j.eij.2021.01.001Online publication date: Jan-2021
https://doi.org/10.1016/j.eij.2021.01.001
Grohmann JHerbst NChalbani AArian YPeretz NKounev S(2020)A Taxonomy of Techniques for SLO Failure Prediction in Software SystemsComputers10.3390/computers90100109:1(10)Online publication date: 11-Feb-2020
https://doi.org/10.3390/computers9010010
Hernandez-Serrato JVelasco ANifio YLinares-Vasquez M(2020)Applying Machine Learning with Chaos Engineering2020 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW)10.1109/ISSREW51248.2020.00057(151-152)Online publication date: Oct-2020
https://doi.org/10.1109/ISSREW51248.2020.00057
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents