Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1109/ICPP.2010.18guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

Identifying the Root Causes of Wait States in Large-Scale Parallel Applications

Published: 13 September 2010 Publication History
  • Get Citation Alerts
  • Abstract

    Driven by growing application requirements and accelerated by current trends in microprocessor design, the number of processor cores on modern supercomputers is increasing from generation to generation. However, load or communication imbalance prevents many codes from taking advantage of the available parallelism, as delays of single processes may spread wait states across the entire machine. Moreover, when employing complex point-to-point communication patterns, wait states may propagate along far-reaching cause-effect chains that are hard to track manually and that complicate an assessment of the actual costs of an imbalance. Building on earlier work by Meira Jr. et al., we present a scalable approach that identifies program wait states and attributes their costs in terms of resource waste to their original cause. By replaying event traces in parallel both in forward and backward direction, we can identify the processes and call paths responsible for the most severe imbalances even for runs with tens of thousands of processes.

    Cited By

    View all
    • (2022)PerFlowProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3503221.3508405(177-191)Online publication date: 2-Apr-2022
    • (2016)Identifying the Root Causes of Wait States in Large-Scale Parallel ApplicationsACM Transactions on Parallel Computing10.1145/29346613:2(1-24)Online publication date: 20-Jul-2016
    • (2014)Scalable parallel performance measurement and analysis tools - state-of-the-art and future challengesSupercomputing Frontiers and Innovations: an International Journal10.14529/jsfi1402071:2(108-123)Online publication date: 9-Jul-2014
    • Show More Cited By

    Index Terms

    1. Identifying the Root Causes of Wait States in Large-Scale Parallel Applications
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image Guide Proceedings
        ICPP '10: Proceedings of the 2010 39th International Conference on Parallel Processing
        September 2010
        702 pages
        ISBN:9780769541563

        Publisher

        IEEE Computer Society

        United States

        Publication History

        Published: 13 September 2010

        Author Tags

        1. parallel program performance analysis
        2. root cause analysis

        Qualifiers

        • Article

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)0
        • Downloads (Last 6 weeks)0

        Other Metrics

        Citations

        Cited By

        View all
        • (2022)PerFlowProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3503221.3508405(177-191)Online publication date: 2-Apr-2022
        • (2016)Identifying the Root Causes of Wait States in Large-Scale Parallel ApplicationsACM Transactions on Parallel Computing10.1145/29346613:2(1-24)Online publication date: 20-Jul-2016
        • (2014)Scalable parallel performance measurement and analysis tools - state-of-the-art and future challengesSupercomputing Frontiers and Innovations: an International Journal10.14529/jsfi1402071:2(108-123)Online publication date: 9-Jul-2014
        • (2014)Catching Idlers with EaseProceedings of the 21st European MPI Users' Group Meeting10.1145/2642769.2642783(103-108)Online publication date: 9-Sep-2014
        • (2013)Effective sampling-driven performance tools for GPU-accelerated supercomputersProceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis10.1145/2503210.2503299(1-12)Online publication date: 17-Nov-2013
        • (2013)Understanding the formation of wait states in applications with one-sided communicationProceedings of the 20th European MPI Users' Group Meeting10.1145/2488551.2488569(73-78)Online publication date: 15-Sep-2013
        • (2013)A scalable infrastructure for the performance analysis of passive target synchronizationParallel Computing10.1016/j.parco.2012.09.00239:3(132-145)Online publication date: 1-Mar-2013
        • (2012)ADPACM SIGMETRICS Performance Evaluation Review10.1145/2318857.225479140:1(283-294)Online publication date: 11-Jun-2012
        • (2012)ADPProceedings of the 12th ACM SIGMETRICS/PERFORMANCE joint international conference on Measurement and Modeling of Computer Systems10.1145/2254756.2254791(283-294)Online publication date: 11-Jun-2012

        View Options

        View options

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media