Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1109/ICDE.2009.115guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

Fa: A System for Automating Failure Diagnosis

Published: 29 March 2009 Publication History
  • Get Citation Alerts
  • Abstract

    Failures of Internet services and enterprise systems lead to user dissatisfaction and considerable loss of revenue. Since manual diagnosis is often laborious and slow, there is considerable interest in tools that can diagnose the cause of failures quickly and automatically from system-monitoring data. This paper identifies two key data-mining problems arising in a platform for automated diagnosis called {\em Fa}. Fa uses monitoring data to construct a database of{\em failure signatures} against which data from undiagnosed failures can be matched. Two novel challenges we address are to make signatures robust to the noisy monitoring data in production systems, and to generate reliable confidence estimates for matches. Fa uses a new technique called {\em anomaly-based clustering} when the signature database has no high-confidence match for an undiagnosed failure. This technique clusters monitoring data based on how it differs from the failure data, and pinpoints attributes linked to the failure. We show the effectiveness of Fa through a comprehensive experimental evaluation based on failures from a production setting, a variety of failures injected in a testbed, and synthetic data.

    Cited By

    View all
    • (2023)Runtime Variation in Big Data AnalyticsProceedings of the ACM on Management of Data10.1145/35889211:1(1-20)Online publication date: 30-May-2023
    • (2022)Multi-Tenant Cloud Data Services: State-of-the-Art, Challenges and OpportunitiesProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3522566(2465-2473)Online publication date: 10-Jun-2022
    • (2021)HALO: Hierarchy-aware Fault Localization for Cloud SystemsProceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining10.1145/3447548.3467190(3948-3958)Online publication date: 14-Aug-2021
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Guide Proceedings
    ICDE '09: Proceedings of the 2009 IEEE International Conference on Data Engineering
    March 2009
    1772 pages
    ISBN:9780769535456

    Publisher

    IEEE Computer Society

    United States

    Publication History

    Published: 29 March 2009

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Runtime Variation in Big Data AnalyticsProceedings of the ACM on Management of Data10.1145/35889211:1(1-20)Online publication date: 30-May-2023
    • (2022)Multi-Tenant Cloud Data Services: State-of-the-Art, Challenges and OpportunitiesProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3522566(2465-2473)Online publication date: 10-Jun-2022
    • (2021)HALO: Hierarchy-aware Fault Localization for Cloud SystemsProceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining10.1145/3447548.3467190(3948-3958)Online publication date: 14-Aug-2021
    • (2020)DeCafProceedings of the ACM/IEEE 42nd International Conference on Software Engineering: Software Engineering in Practice10.1145/3377813.3381353(201-210)Online publication date: 27-Jun-2020
    • (2020)Anomaly Detection Models Based on Context-Aware Sequential Long Short-Term Memory Learning2019 IEEE Global Communications Conference (GLOBECOM)10.1109/GLOBECOM38437.2019.9014287(1-6)Online publication date: 17-Jun-2020
    • (2019)GriffonProceedings of the ACM Symposium on Cloud Computing10.1145/3357223.3362716(441-452)Online publication date: 20-Nov-2019
    • (2016)DBSherlockProceedings of the 2016 International Conference on Management of Data10.1145/2882903.2915218(1599-1614)Online publication date: 26-Jun-2016
    • (2016)Domain-independent planning for services in uncertain and dynamic environmentsArtificial Intelligence10.1016/j.artint.2016.03.002236:C(30-64)Online publication date: 1-Jul-2016
    • (2015)ProvErrProceedings of the 15th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing10.1109/CCGrid.2015.86(525-534)Online publication date: 4-May-2015
    • (2014)The SIOX Architecture --- Coupling Automatic Monitoring and Optimization of Parallel I/OProceedings of the 29th International Conference on Supercomputing - Volume 848810.1007/978-3-319-07518-1_16(245-260)Online publication date: 22-Jun-2014
    • Show More Cited By

    View Options

    View options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media