Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1555349.1555360acmconferencesArticle/Chapter ViewAbstractPublication PagesmetricsConference Proceedingsconference-collections
research-article

Reference-driven performance anomaly identification

Published: 15 June 2009 Publication History
  • Get Citation Alerts
  • Abstract

    Complex system software allows a variety of execution conditions on system configurations and workload properties. This paper explores a principled use of reference executions--those of similar execution conditions from the target--to help identify the symptoms and causes of performance anomalies. First, to identify anomaly symptoms, we construct change profiles that probabilistically characterize expected performance deviations between target and reference executions. By synthesizing several single-parameter change profiles, we can scalably identify anomalous reference-to-target changes in a complex system with multiple execution parameters. Second, to narrow the scope of anomaly root cause analysis, we filter anomaly-related low-level system metrics as those that manifest very differently between target and reference executions. Our anomaly identification approach requires little expert knowledge or detailed models on system internals and consequently it can be easily deployed. Using empirical case studies on the Linux I/O subsystem and a J2EE-based distributed online service, we demonstrate our approach's effectiveness in identifying performance anomalies over a wide range of execution conditions as well as multiple system software versions. In particular, we discovered five previously unknown performance anomaly causes in the Linux 2.6.23 kernel. Additionally, our preliminary results suggest that online anomaly detection and system reconfiguration may help evade performance anomalies in complex online systems.

    References

    [1]
    Realistic nonstationary online workloads. http://www.cs.rochester.edu/u/stewart/models.html.
    [2]
    MySQL JDBC driver. http://www.mysql.com/products/connector.
    [3]
    R.A. Fisher. The arrangement of field experiments. J. of the Ministry of Agriculture of Great Britain, 33:503--513, 1926.
    [4]
    M. Grindal, J. Offutt, and S.F. Andler. Combination testing strategies: A survey. Software Testing, Verification and Reliability, 15(3):167--199, Mar. 2005.
    [5]
    S. Iyer and P. Druschel. Anticipatory scheduling: A disk scheduling framework to overcome deceptive idleness in synchronous I/O. In 18th ACM Symp. on Operating Systems Principles, pages 117--130, Banff, Canada, Oct. 2001.
    [6]
    N. Joukov, A. Traeger, R. Iyer, C.P. Wright, and E. Zadok. Operating system profiling via latency analysis. In 7th USENIX Symp. on Operating Systems Design and Implementation, pages 89--102, Seattle, WA, Nov. 2006.
    [7]
    C. Li and K. Shen. Managing prefetch memory for data-intensive online servers. In 4th USENIX Conf. on File and Storage Technologies, pages 253--266, Dec. 2005.
    [8]
    C. Li, K. Shen, and A. Papathanasiou. Competitive prefetching for concurrent sequential I/O. In Second EuroSys Conf., pages 189--202, Lisbon, Portugal, Mar. 2007.
    [9]
    Linux kernel bug tracker. http://bugzilla.kernel.org/.
    [10]
    Linux kernel bug tracker on "many pre-mature anticipation timeouts in anticipatory I/O scheduler". http://bugzilla.kernel.org/show_bug.cgi?id=10756.
    [11]
    M.P. Mesnier, M. Wachs, R.R. Sambasivan, A.X. Zheng, and G.R. Ganger. Modeling the relative fitness of storage. In ACM SIGMETRICS, pages 37--48, San Diego, CA, June 2007.
    [12]
    P. Reynolds, C. Killian, J. Wiener, J. Mogul, M. Shah, and A. Vahdat. Pip: Detecting the unexpected in distributed systems. In Third USENIX Symp. on Networked Systems Design and Implementation, San Jose, CA, May 2006.
    [13]
    RUBiS: Rice University bidding system. http://rubis.objectweb.org.
    [14]
    Y. Rubner, C. Tomasi, and L.J. Guibas. The earth mover's distance as a metric for image retrieval. Int'l J. of Computer Vision, 40(2):99--121, 2000.
    [15]
    K. Shen, M. Zhong, and C. Li. I/O system performance debugging using model-driven anomaly characterization. In 4th USENIX Conf. on File and Storage Technologies, pages 309--322, San Francisco, CA, Dec. 2005.
    [16]
    C. Stewart, T. Kelly, and A. Zhang. Exploiting nonstationarity for performance prediction. In Second EuroSys Conf., pages 31--44, Lisbon, Portugal, Mar. 2007.
    [17]
    C. Stewart and K. Shen. Performance modeling and system management for multi-component online services. In Second USENIX Symp. on Networked Systems Design and Implementation, pages 71--84, Boston, MA, May 2005.
    [18]
    E. Thereska and G.R. Ganger. IRONModel: Robust performance models in the wild. In ACM SIGMETRICS, pages 253--264, Annapolis, MD, June 2008.
    [19]
    A. Traeger, I. Deras, and E. Zadok. DARC: Dynamic analysis of root causes of latency distributions. In ACM SIGMETRICS, pages 277--288, Annapolis, MD, June 2008.
    [20]
    J. Tucek, S. Lu, C. Huang, S. Xanthos, and Y. Zhou. Triage: Diagnosing production run failures at the user's site. In 21th ACM Symp. on Operating Systems Principles, pages 131--144, Stevenson, WA, Oct. 2007.
    [21]
    H.J. Wang, J.C. Platt, Y. Chen, R. Zhang, and Y.-M. Wang. Automatic misconfiguration troubleshooting with PeerPressure. In 6th USENIX Symp. on Operating Systems Design and Implementation, pages 245--258, San Francisco, CA, Dec. 2004.
    [22]
    A. Zeller. Isolating cause-effect chains from computer programs. In 10th ACM Symp. on Foundations of Software Engineering, pages 1--10, Charleston, SC, Nov. 2002.

    Cited By

    View all
    • (2023)FSFP: A Fine-Grained Online Service System Performance Fault Prediction Method Based on Cross-attention2023 30th Asia-Pacific Software Engineering Conference (APSEC)10.1109/APSEC60848.2023.00018(81-90)Online publication date: 4-Dec-2023
    • (2021)Predicting Performance Anomalies in Software Systems at Run-timeACM Transactions on Software Engineering and Methodology10.1145/344075730:3(1-33)Online publication date: 23-Apr-2021
    • (2019)Hytrace: A Hybrid Approach to Performance Bug Diagnosis in Production Cloud InfrastructuresIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2018.285880030:1(107-118)Online publication date: 1-Jan-2019
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGMETRICS '09: Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
    June 2009
    336 pages
    ISBN:9781605585116
    DOI:10.1145/1555349
    • cover image ACM SIGMETRICS Performance Evaluation Review
      ACM SIGMETRICS Performance Evaluation Review  Volume 37, Issue 1
      SIGMETRICS '09
      June 2009
      320 pages
      ISSN:0163-5999
      DOI:10.1145/2492101
      Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 15 June 2009

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. operating system
    2. performance anomaly

    Qualifiers

    • Research-article

    Conference

    SIGMETRICS09

    Acceptance Rates

    Overall Acceptance Rate 459 of 2,691 submissions, 17%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)12
    • Downloads (Last 6 weeks)1

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)FSFP: A Fine-Grained Online Service System Performance Fault Prediction Method Based on Cross-attention2023 30th Asia-Pacific Software Engineering Conference (APSEC)10.1109/APSEC60848.2023.00018(81-90)Online publication date: 4-Dec-2023
    • (2021)Predicting Performance Anomalies in Software Systems at Run-timeACM Transactions on Software Engineering and Methodology10.1145/344075730:3(1-33)Online publication date: 23-Apr-2021
    • (2019)Hytrace: A Hybrid Approach to Performance Bug Diagnosis in Production Cloud InfrastructuresIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2018.285880030:1(107-118)Online publication date: 1-Jan-2019
    • (2018)Differential energy profilingProceedings of the 13th USENIX conference on Operating Systems Design and Implementation10.5555/3291168.3291206(511-526)Online publication date: 8-Oct-2018
    • (2018)Hang doctorProceedings of the Thirteenth EuroSys Conference10.1145/3190508.3190525(1-15)Online publication date: 23-Apr-2018
    • (2018)Anomaly Detection in Complex Real World Application SystemsIEEE Transactions on Network and Service Management10.1109/TNSM.2017.277140315:1(83-96)Online publication date: Mar-2018
    • (2017)Statistical Analysis of Latency Through Semantic ProfilingProceedings of the Twelfth European Conference on Computer Systems10.1145/3064176.3064179(64-79)Online publication date: 23-Apr-2017
    • (2017)Lightweight and Adaptive Service API Performance Monitoring in Highly Dynamic Cloud Environment2017 IEEE International Conference on Services Computing (SCC)10.1109/SCC.2017.80(35-43)Online publication date: Jun-2017
    • (2016)The Good, the Bad, and the DifferencesProceedings of the 2016 ACM SIGCOMM Conference10.1145/2934872.2934910(115-128)Online publication date: 22-Aug-2016
    • (2016)A Scalable, Non-Parametric Method for Detecting Performance Anomaly in Large Scale ComputingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2015.247574127:7(1902-1914)Online publication date: 1-Jul-2016
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media