Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1835698.1835741acmconferencesArticle/Chapter ViewAbstractPublication PagespodcConference Proceedingsconference-collections
research-article

Adaptive system anomaly prediction for large-scale hosting infrastructures

Published: 25 July 2010 Publication History

Abstract

Large-scale hosting infrastructures require automatic system anomaly management to achieve continuous system operation. In this paper, we present a novel adaptive runtime anomaly prediction system, called ALERT, to achieve robust hosting infrastructures. In contrast to traditional anomaly detection schemes, ALERT aims at raising advance anomaly alerts to achieve just-in-time anomaly prevention. We propose a novel context-aware anomaly prediction scheme to improve prediction accuracy in dynamic hosting infrastructures. We have implemented the ALERT system and deployed it on several production hosting infrastructures such as IBM System S stream processing cluster and PlanetLab. Our experiments show that ALERT can achieve high prediction accuracy for a range of system anomalies and impose low overhead to the hosting infrastructure.

References

[1]
Amazon Elastic Compute Cloud. http://aws.amazon.com/ec2/.
[2]
C4.5 Release 8. http://www.rulequest.com/Personal/.
[3]
CoMon. http://comon.cs.princeton.edu/.
[4]
InfoScope Distributed Monitoring System. http://dance.csc.ncsu.edu/projects/infoscope/index.html.
[5]
PlanetLab. https://www.planet-lab.org/.
[6]
The STREAM Group, STREAM: The Stanford Stream Data Manager. IEEE Data Engineering Bulletin, 26(1):19--26, Mar. 2003.
[7]
D. J. Abadi and et al. The Design of the Borealis Stream Processing Engine. In Proc. of CIDR, 2005.
[8]
M. K. Aguilera, J. Mogul, J. Wiener, P. Reynolds, and A. Muthitacharoen. Performance debugging for distributed systems of black boxes. In Proc. of ACM SOSP, 2003.
[9]
P. Barham, A. Donnelly, R. Isaacs, and R. Mortier. Using Magpie for request extraction and workload modelling. In Proc. of OSDI, 2004.
[10]
S. Bhatia, A. Kumar, M. E. Fiuczynski, and L. L. Peterson. Lightweight, high-resolution monitoring for troubleshooting production systems. In Proc. of OSDI, pages 103--116, 2008.
[11]
J. Breese and R. Blake. Automating computer bottleneck detection with belief nets. In Proc. of UAI, pages 36--45, San Francisco, CA, 1995. Morgan Kaufmann.
[12]
G. Candea, S. Kawamoto, Y. Fujiki, G. Friedman, and A. Fox. Microreboot - A Technique for Cheap Recovery. In Proc. of OSDI, Dec. 2004.
[13]
M. Y. Chen, A. Accardi, E. Kiciman, D. Patterson, A. Fox, and E. Brewer. Path-Based Failure and Evolution Management. In Proc. of NSDI, 2004.
[14]
S. Chen, H. Wang, S. Zhou, and P. S. Yu. Stop Chasing Trends: Discovering High Order Models in Evolving Data. In Proc. of ICDE, 2008.
[15]
L. Cherkasova, K. M. Ozonat, N. Mi, J. Symons, and E. Smirni. Anomaly? application change? or workload change? towards automated detection of application performance anomaly and change. In Proc. of DSN, pages 452--461, 2008.
[16]
I. Cohen, M. Goldszmidt, T. Kelly, J. Symons, and J. S. Chase. Correlating Instrumentation Data to System States: A Building Block for Automated Diagnosis and Control. In Proc. of OSDI, 2004.
[17]
I. Cohen, S. Zhang, M. Goldszmidt, J. Symons, T. Kelly, and A. Fox. Capturing, indexing, clustering, and retrieving system history. In Proc. of SOSP, 2005.
[18]
J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In Proc. of OSDI, Dec. 2004.
[19]
S. Duan, S. Babu, and K. Munagala. Fa: A System for Automating Failure Diagnosis. In Proc. of ICDE, 2009.
[20]
G. W. Dunlap, S. T. King, S. Cinar, M. A. Basrai, and P. M. Chen. Detecting Past and Present Intrusions Through Vulnerability-Specific Predicates. In Proc. of SOSP, Oct. 2005.
[21]
K.-L.W. et al. Challenges and Experience in Prototyping a Multi-Modal Stream Analytic and Monitoring Application on System S. In Proc. of VLDB, 2007.
[22]
B. Gedik, H. Andrade, K.-L. Wu, P. S. Yu, and M. Doo. SPADE: the system s declarative stream processing engine. In Proc. of SIGMOD, 2008.
[23]
X. Gu, S. Papadimitriou, P. S. Yu, and S. P. Chang. Toward Predictive Failure Management for Distributed Stream Processing Systems. In Proc. of ICDCS, 2008.
[24]
X. Gu and H. Wang. Online Anomaly Prediction for Robust Cluster Systems. In Proc. of IEEE ICDE, 2009.
[25]
X. Gu, P. S. Yu, and H. Wang. Adaptive load diffusion for multiway windowed stream joins. In Proc. of ICDE, 2007.
[26]
Z. Guo, G. Jiang, H. Chen, and K. Yoshihira. Tracking probabilistic correlation of monitoring data for fault detection in complex systems. In Proc. of DSN, pages 259--268, 2006.
[27]
R. Jin and G. Agrawal. Efficient decision tree construction on streaming data. In Proc. of KDD, 2003.
[28]
E. Kiciman and A. Fox. Detecting Application-Level Failures in Component-based Internet Services. IEEE Transactions on Neural Networks, 2005.
[29]
Y. Liang, Y. Zhang, H. Xiong, and R. Sahoo. Failure Prediction in IBM BlueGene/L Event Logs. In Proc. of ICDM, 2007.
[30]
J. F. Murray, G. F. Hughes, and K. Kreutz-Delgado. Comparison of machine learning methods for predicting failures in hard drives. Journal of Machine Learning Research, 2005.
[31]
C. C. Noble and D. J. Cook. Graph-based anomaly detection. In Proc. of KDD, pages 631--636, Aug. 24-27 2003.
[32]
K. M. Ozonat. An information-theoretic approach to detecting performance anomalies and changes for large-scale distributed web services. In Proc. of DSN, pages 522--531, 2008.
[33]
L. Peterson, T. Anderson, D. Culler, and T. Roscoe. A blueprint for introducing disruptive technology into the internet. In Proc. of HotNets-I, Princeton, New Jersey, October 2002.
[34]
R. Powers, M. Goldszmidt, and I. Cohen. Short term performance forecasting in enterprise systems. In Proc. of KDD, pages 801--807, 2005.
[35]
R. K. Sahoo and et al. Critical event prediction for proactive management in large-scale computer clusters. In Proc. of ACM SIGKDD, 2003.
[36]
B. Schroeder and G. Gibson. Disk failures in the real world: What does an MTTF of 1,000,000 hours mean too you? In Proc. of FAST, 2007.
[37]
K. Shen, C. Stewart, C. Li, and X. Li. Reference-driven performance anomaly identification. In Proc. of SIGMETRICS/Performance, pages 85--96, 2009.
[38]
K. Shen, M. Zhong, and C. Li. I/o system performance debugging using model-driven anomaly characterization. In Proc. of FAST, 2005.
[39]
J. Tucek, S. Lu, C. Huang, S. Xanthos, and Y. Zhou. Triage: Diagnosing Production Run Failures at the User's Site. In Proc. of SOSP, 2007.
[40]
R. Vilalta, C. V. Apte, J. L. Hellerstein, S. Ma, and S. M. Weiss. Predictive algorithms in the management of computer systems. IBM Systems Journal, 2002.
[41]
H. J. Wang, J. C. Platt, Y. Chen, R. Zhang, and Y.-M. Wang. Automatic misconfiguration troubleshooting with peerpressure. In Proc. of OSDI, pages 245--258, 2004.
[42]
A. W. Williams, S. M. Pertet, and P. Narasimhan. Tiresias: Black-box failure prediction in distributed systems. In Proc. of IPDPS, 2007.
[43]
I. H. Witten and E. Frank. Data Mining : Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, 1999.
[44]
W. Xu, L. Huang, A. Fox, D. Patterson, and M. Jordan. Large-scale system problems detection by mining console logs. In Proc. of SOSP, 2009.
[45]
S. Zhang, I. Cohen, M. Goldszmidt, J. Symons, and A. Fox. Ensemble of models for automated diagnosis of system performance problems. In Proc. of DSN, 2005.

Cited By

View all
  • (2023)Prevent: An Unsupervised Approach to Predict Software Failures in ProductionIEEE Transactions on Software Engineering10.1109/TSE.2023.332758349:12(5139-5153)Online publication date: Dec-2023
  • (2023)Maat: Performance Metric Anomaly Anticipation for Cloud Services with Conditional Diffusion2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE)10.1109/ASE56229.2023.00082(116-128)Online publication date: 11-Sep-2023
  • (2023)FSFP: A Fine-Grained Online Service System Performance Fault Prediction Method Based on Cross-attention2023 30th Asia-Pacific Software Engineering Conference (APSEC)10.1109/APSEC60848.2023.00018(81-90)Online publication date: 4-Dec-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
PODC '10: Proceedings of the 29th ACM SIGACT-SIGOPS symposium on Principles of distributed computing
July 2010
494 pages
ISBN:9781605588889
DOI:10.1145/1835698
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 July 2010

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. anomaly prediction
  2. context-aware prediction model

Qualifiers

  • Research-article

Conference

PODC '10
Sponsor:

Acceptance Rates

Overall Acceptance Rate 740 of 2,477 submissions, 30%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)25
  • Downloads (Last 6 weeks)2
Reflects downloads up to 30 Aug 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Prevent: An Unsupervised Approach to Predict Software Failures in ProductionIEEE Transactions on Software Engineering10.1109/TSE.2023.332758349:12(5139-5153)Online publication date: Dec-2023
  • (2023)Maat: Performance Metric Anomaly Anticipation for Cloud Services with Conditional Diffusion2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE)10.1109/ASE56229.2023.00082(116-128)Online publication date: 11-Sep-2023
  • (2023)FSFP: A Fine-Grained Online Service System Performance Fault Prediction Method Based on Cross-attention2023 30th Asia-Pacific Software Engineering Conference (APSEC)10.1109/APSEC60848.2023.00018(81-90)Online publication date: 4-Dec-2023
  • (2022)Antifragile and Resilient Geographical Information System Service Delivery in Fog ComputingSensors10.3390/s2222877822:22(8778)Online publication date: 14-Nov-2022
  • (2022)Multi-Resolution Inter-Level Refinement (MR-ILR) Architecture for Anomaly Prediction in IoT Data2022 International Conference on INnovations in Intelligent SysTems and Applications (INISTA)10.1109/INISTA55318.2022.9894159(1-6)Online publication date: 8-Aug-2022
  • (2021)Predicting Performance Anomalies in Software Systems at Run-timeACM Transactions on Software Engineering and Methodology10.1145/344075730:3(1-33)Online publication date: 23-Apr-2021
  • (2021)MADneSs: A Multi-Layer Anomaly Detection Framework for Complex Dynamic SystemsIEEE Transactions on Dependable and Secure Computing10.1109/TDSC.2019.290836618:2(796-809)Online publication date: 1-Mar-2021
  • (2021)Anomaly Detection and Bottleneck Identification of The Distributed Application in Cloud Data Center using Software–Defined NetworkingEgyptian Informatics Journal10.1016/j.eij.2021.01.001Online publication date: Jan-2021
  • (2020)A Taxonomy of Techniques for SLO Failure Prediction in Software SystemsComputers10.3390/computers90100109:1(10)Online publication date: 11-Feb-2020
  • (2020)Applying Machine Learning with Chaos Engineering2020 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW)10.1109/ISSREW51248.2020.00057(151-152)Online publication date: Oct-2020
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media