Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2378356.2378360acmconferencesArticle/Chapter ViewAbstractPublication PagesrecsysConference Proceedingsconference-collections
research-article

Light-weight black-box failure detection for distributed systems

Published: 21 September 2012 Publication History

Abstract

Detecting failures in distributed systems is challenging, as modern datacenters run a variety of applications. Current techniques for detecting failures often require training, have limited scalability, or have results that are hard to interpret. We present LFD, a light-weight technique to quickly detect performance problems in distributed systems using only correlations of OS metrics. LFD is based on our hypothesis of server application behavior, does not require training, and detects failures with complexity linear in the number of nodes, with results that are interpretable by sysadmins. We further show that LFD is versatile, and can diagnose faults in Hadoop MapReduce systems and on multi-tier web request systems, and show how LFD is intuitive to sysadmins.

References

[1]
https://issues.apache.org/jira.
[2]
http://hadoop.apache.org/core.
[3]
S. Bhatia, A. Kumar, M. Fiuczynski, and L. Peterson. Lightweight, High-Resolution Monitoring for Troubleshooting Production Systems. In OSDI, Dec 2008.
[4]
P. Bodik, M. Goldszmidt, A. Fox, D.Woodard, and H. Andersen. Fingerprinting the Datacenter: Automated Classification of Performance Crises. In EuroSys, Apr 2010.
[5]
E. Cecchet, A. Chanda, S. Elnikety, J. Marguerite, and W. Zwaenepoel. Performance comparison of middleware architectures for generating dynamic web content. In ACM/IFIP/USENIX Middleware, Jun 2003.
[6]
M. Y. Chen, E. Kiciman, E. Fratkin, A. Fox, and E. Brewer. Pinpoint: Problem determination in large, dynamic internet services. In DSN, Jun 2002.
[7]
L. Cherkasova, K. M. Ozonat, N. Mi, J. Symons, and E. Smirni. Anomaly? application change? or workload change? towards automated detection of application performance anomaly and change. In DSN, June 2008.
[8]
I. Cohen, S. Zhang, M. Goldszmidt, J. Symons, T. Kelly, and A. Fox. Capturing, indexing, clustering, and retrieving system history. In SOSP, Oct 2005.
[9]
J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI, Dec 2004.
[10]
G. Jiang, H. Chen, K. Yoshihira, and A. Saxena. Ranking the importance of alerts for problem determination in large computer systems. In ICAC, June 2009.
[11]
M. Jiang, M. A. Munawar, T. Reidemeister, and P. A. S.Ward. System monitoring with metric-correlation models: problems and solutions. In ICAC, June 2009.
[12]
H. Kang, H. Chen, and G. Jiang. PeerWatch: A fault detection and diagnosis tool for virtualized consolidated systems. In ICAC, Jun 2010.
[13]
E. Kiciman and A. Fox. Detecting application-level failures in component-based internet services. IEEE Trans. on Neural Networks: Special Issue on Adaptive Learning Systems in Communication Networks, 16(5):1027-- 1041, Sep 2005.
[14]
J. Lou, Q. Fu, Y. Wang, and J. Li. Mining dependency in distributed systems through unstructured log analysis. In USENIX WASL, Oct 2009.
[15]
X. Pan, J. Tan, S. Kavulya, R. Gandhi, and P. Narasimhan. Ganesha: Black-Box Diagnosis of MapReduce Systems. In HotMetrics, Jun 2009.
[16]
C. Stewart, T. Kelly, and A. Zhang. Exploiting nonstationarity for performance prediction. In EuroSys, Mar 2007.
[17]
J. Tan, S. Kavulya, R. Gandhi, and P. Narasimhan. Visual, Log-Based Causal Tracing for Performance Debugging of MapReduce Systems. In ICDCS, Jun 2010.
[18]
J. Tan, S. Kavulya, R. Gandhi, and P. Narasimhan. Lightweight Black-box Failure Detection for Distributed Systems. Carnegie Mellon University-PDL-12--106, Jul 2012.
[19]
J. Tan, X. Pan, S. Kavulya, E. Marinelli, R. Gandhi, and P. Narasimhan. Kahuna: Problem Diagnosis for MapReducebased Cloud Computing Environments. In IEEE/IFIP NOMS, Apr 2010.
[20]
C. Wang, V. Talwar, K. Schwan, and P. Ranganathan. Online detection of utility cloud anomalies using metric distributions. In IEEE NOMS, Apr 2010.
[21]
W. Xu, L. Huang, A. Fox, D. Patterson, and M. Jordan. Detecting large-scale system problems by mining console logs. In SOSP, Oct 2009.

Cited By

View all
  • (2022)Themis: Fair Memory Subsystem Resource Sharing with Differentiated QoS in Public CloudsProceedings of the 51st International Conference on Parallel Processing10.1145/3545008.3545064(1-12)Online publication date: 29-Aug-2022
  • (2022)DemeterProceedings of the 13th Symposium on Cloud Computing10.1145/3542929.3563476(31-46)Online publication date: 7-Nov-2022
  • (2020)A New Statistical Method for Anomaly Detection in Distributed Systems2020 IEEE Canadian Conference on Electrical and Computer Engineering (CCECE)10.1109/CCECE47787.2020.9255700(1-4)Online publication date: 30-Aug-2020
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MBDS '12: Proceedings of the 2012 workshop on Management of big data systems
September 2012
48 pages
ISBN:9781450317528
DOI:10.1145/2378356
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

  • IEEE

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 September 2012

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. diagnosis
  2. mapreduce
  3. web applications

Qualifiers

  • Research-article

Conference

ICAC '12
Sponsor:
ICAC '12: 9th International Conference on Autonomic Computing
September 21, 2012
California, San Jose, USA

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)2
  • Downloads (Last 6 weeks)1
Reflects downloads up to 01 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2022)Themis: Fair Memory Subsystem Resource Sharing with Differentiated QoS in Public CloudsProceedings of the 51st International Conference on Parallel Processing10.1145/3545008.3545064(1-12)Online publication date: 29-Aug-2022
  • (2022)DemeterProceedings of the 13th Symposium on Cloud Computing10.1145/3542929.3563476(31-46)Online publication date: 7-Nov-2022
  • (2020)A New Statistical Method for Anomaly Detection in Distributed Systems2020 IEEE Canadian Conference on Electrical and Computer Engineering (CCECE)10.1109/CCECE47787.2020.9255700(1-4)Online publication date: 30-Aug-2020
  • (2019)DMFD: Non-Intrusive Dependency Inference and Flow Ratio Model for Performance Anomaly Detection in Multi-Tier Cloud Applications2019 IEEE 12th International Conference on Cloud Computing (CLOUD)10.1109/CLOUD.2019.00036(164-173)Online publication date: Jul-2019
  • (2018)Secure cloud computing: Continuous anomaly detection approach in legal metrology2018 IEEE International Instrumentation and Measurement Technology Conference (I2MTC)10.1109/I2MTC.2018.8409767(1-6)Online publication date: May-2018
  • (2016)Improvements to Online Distributed Monitoring Systems2016 IEEE Trustcom/BigDataSE/ISPA10.1109/TrustCom.2016.0180(1093-1100)Online publication date: Aug-2016
  • (2016)LS-ADT: Lightweight and Scalable Anomaly Detection for Cloud DatacentresCloud Computing and Services Science10.1007/978-3-319-29582-4_8(135-152)Online publication date: 3-Feb-2016
  • (2014)Peer-Comparison Based Fault Diagnosis for Hadoop SystemsApplied Mechanics and Materials10.4028/www.scientific.net/AMM.621.235621(235-240)Online publication date: Aug-2014
  • (2014)MiCAProceedings of the 2014 Brazilian Conference on Intelligent Systems10.1109/ICPP.2014.53(441-450)Online publication date: 18-Oct-2014

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media