research-article

Identifying the Root Causes of Wait States in Large-Scale Parallel Applications

Authors:

Felix Voigtlaender,

Felix WolfAuthors Info & Claims

ACM Transactions on Parallel Computing (TOPC), Volume 3, Issue 2

Article No.: 11, Pages 1 - 24

https://doi.org/10.1145/2934661

Published: 20 July 2016 Publication History

Abstract

Driven by growing application requirements and accelerated by current trends in microprocessor design, the number of processor cores on modern supercomputers is increasing from generation to generation. However, load or communication imbalance prevents many codes from taking advantage of the available parallelism, as delays of single processes may spread wait states across the entire machine. Moreover, when employing complex point-to-point communication patterns, wait states may propagate along far-reaching cause-effect chains that are hard to track manually and that complicate an assessment of the actual costs of an imbalance. Building on earlier work by Meira, Jr., et al., we present a scalable approach that identifies program wait states and attributes their costs in terms of resource waste to their original cause. By replaying event traces in parallel both forward and backward, we can identify the processes and call paths responsible for the most severe imbalances, even for runs with hundreds of thousands of processes.

References

[1]

Accelerated Strategic Computing Initiative. 1995. The ASCI SWEEP3D Benchmark Code. (1995). http://www.ccs3.lanl.gov/pal/software/sweep3d/sweep3d_readme.html.

[2]

Laksono Adhianto, Sinchan Banerjee, Michael W. Fagan, Mark Krentel, Gabriel Marin, John Mellor-Crummey, and Nathan R. Tallent. 2010. HPCToolkit: Tools for performance analysis of optimized parallel programs. Concurrency and Computation: Practice and Experience (April 2010).

Digital Library

[3]

Daniel Becker, Rolf Rabenseifner, Felix Wolf, and John C. Linford. 2009. Scalable timestamp synchronization for event traces of message-passing applications. Parallel Computing 35, 12 (2009), 595--607.

Digital Library

[4]

David Böhme, Bronis R. de Supinski, Markus Geimer, Martin Schulz, and Felix Wolf. 2012. Scalable critical-path based performance analysis. In Proceedings of the 26th IEEE International Parallel & Distributed Processing Symposium (IPDPS). 1330--1340.

Digital Library

[5]

David Böhme, Markus Geimer, Felix Wolf, and Lukas Arnold. 2010. Identifying the root causes of wait states in large-scale parallel applications. In Proceedings of the 39th International Conference on Parallel Processing (ICPP). IEEE Computer Society, 90--100. Best Paper Award.

Digital Library

[6]

Maria Calzarossa, Luisa Massari, and Daniele Tessera. 2004. A methodology towards automatic performance analysis of parallel applications. Parallel Computing 30, 2 (Feb. 2004), 211--223.

Digital Library

[7]

Todd Gamblin, Bronis R. de Supinski, Martin Schulz, Rob Fowler, and Daniel A. Reed. 2008. Scalable load-balance measurement for SPMD codes. In Proceedings of the ACM/IEEE Conference on Supercomputing (SC’08).

Digital Library

[8]

Markus Geimer, Felix Wolf, Brian J. N. Wylie, and Bernd Mohr. 2009. A scalable tool architecture for diagnosing wait states in massively-parallel applications. Parallel Computing 35, 7 (2009), 375--388.

Digital Library

[9]

Michael Geissler, S. Rykovanov, Jörg Schreiber, Jürgen Meyer ter Vehn, and G. D. Tsakiris. 2007. 3D simulations of surface harmonic generation with few-cycle laser pulses. New Journal of Physics 9, 7 (2007), 218.

[10]

Michael Geissler, Jörg Schreiber, and Jürgen Meyer ter Vehn. 2006. Bubble acceleration of electrons with few-cycle laser pulses. New Journal of Physics 8, 9 (2006), 186.

[11]

John C. Hayes, Michael L. Norman, Robert A. Fiedler, James O. Bordner, Pak Shing Li, Stephen E. Clark, Asif Ud-Doula, and Mordecai-Mark MacLow. 2006. Simulating radiating and magnetized flows in multi-dimensions with ZEUS-MP. Astrophysical Journal Supplement 165 (2006), 188--228.

[12]

Marc-André Hermanns, Manfred Miklosch, David Böhme, and Felix Wolf. 2013. Understanding the formation of wait states in applications with one-sided communication. In Procedings of the 20th European MPI Users’ Group Meeting (EuroMPI’13). ACM, New York, NY, 73--78.

Digital Library

[13]

Adolfy Hoisie, Olaf Lubeck, and Harvey Wasserman. 1999. Performance analysis of wavefront algorithms on very-large scale distributed systems. In Proceedings of the Workshop on Wide Area Networks and High Performance Computing, Lecture Notes in Control and Information Sciences, Vol. 249. Springer Berlin/Heidelberg, 171--187.

Digital Library

[14]

Jeffrey K. Hollingsworth. 1996. An online computation of critical path profiling. In Proceedings of the 1st ACM SIGMETRICS Symposium on Parallel and Distributed Tools. 11--20.

Digital Library

[15]

Hassan M. Jafri. 2007. Measuring causal propagation of overhead of inefficiencies in parallel applications. In Proceedings of the 19th IASTED International Conference on Parallel and Distributed Computing and Systems. Cambridge, MA, 237--243.

Digital Library

[16]

Allen D. Malony, Sameer S. Shende, and Alan Morris. 2005. Phase-based parallel performance profiling. In Proceedings of the Conference on Parallel Computing (ParCo, Malaga, Spain) (NIC Series), Vol. 33. John von Neumann Institute for Computing, 203--210.

[17]

Wagner Meira, Jr., Thomas J. LeBlanc, and Virgílio A. F. Almeida. 1998. Using cause-effect analysis to understand the performance of distributed programs. In Proceedings of the SIGMETRICS Symposium on Parallel and Distributed Tools (SPDT’98). ACM, New York, NY, 101--111.

Digital Library

[18]

Wagner Meira, Jr., Thomas J. LeBlanc, and Alexandros Poulos. 1996. Waiting time analysis and performance visualization in Carnival. In Proceedings of the SIGMETRICS Symposium on Parallel and Distributed Tools (SPDT’96). ACM, New York, NY, 1--10.

Digital Library

[19]

Oleg Morajko, Anna Morajko, Tomas Margalef, and Emilio Luque. 2008. On-line performance modeling for MPI applications. In Proceedings of the 14th Euro-Par Conference, Lecture Notes in Computer Science, Vol. 5168. Springer, 68--77.

Digital Library

[20]

Martin Schulz. 2005. Extracting critical path graphs from MPI applications. In Proceedings of the IEEE Cluster Conference. Boston, MA.

[21]

Martin Schulz, Greg Bronevetsky, and Bronis R. de Supinski. 2008. On the performance of transparent MPI piggyback messages. In Proceedings of the 15th European PVM/MPI Users’ Group Meeting, Lecture Notes in Computer Science, Vol. 5205. Springer, 194--201.

Digital Library

[22]

David Sundaram-Stukel and Mary K. Vernon. 1999. Predictive analysis of a wavefront application using LogGP. In Proceedings of the 7th ACM SIGPLAN Symposium on Principles and Practices of Parallel Programming, Vol. 34. 141--150.

Digital Library

[23]

Zoltán Szebenyi, Felix Wolf, and Brian J. N. Wylie. 2009. Space-efficient time-series call-path profiling of parallel applications. In Proceedings of the ACM/IEEE Conference on Supercomputing (SC’09).

Digital Library

[24]

Zoltán Szebenyi, Brian J. N. Wylie, and Felix Wolf. 2008. Scalasca parallel performance analyses of SPEC MPI2007 applications. In Proceedings of the 1st SPEC International Performance Evaluation Workshop, Lecture Notes in Computer Science, Vol. 5119. Springer, 99--123.

Digital Library

[25]

Nathan R. Tallent, Laksono Adhianto, and John Mellor-Crummey. 2010. Scalable identification of load imbalance in parallel executions using call path profiles. In Supercomputing 2010. New Orleans, LA.

Digital Library

[26]

University Corporation for Atmospheric Research (UCAR). 2012. The Community Earth System Model. (Feb. 2012). http://www.cesm.ucar.edu/.

[27]

Jeffrey Vetter (Ed.). 2007. Report of the Workshop on Software Development Tools for Petascale Computing. (August 2007). U.S. Department of Energy, http://www.csm.ornl.gov/workshops/Petascale07/sdtpc_workshop_report.pdf.

[28]

Brian J. N. Wylie. 2012. Parallel performance measurement and analysis scaling lessons. SC’12 Workshop on Extreme-Scale Performance Tools.

[29]

Brian J. N. Wylie, David Böhme, Bernd Mohr, Zoltán Szebenyi, and Felix Wolf. 2010. Performance analysis of Sweep3D on Blue Gene/P with the Scalasca toolset. In Proceedings of the 24th International Parallel & Distributed Processing Symposium and Workshops (IPDPS). IEEE Computer Society.

Cited By

Zhang YIsaacs RYue YYang JZhang LVigfusson Y(2023)LatenSeerProceedings of the 2023 ACM Symposium on Cloud Computing10.1145/3620678.3624787(502-519)Online publication date: 30-Oct-2023
https://dl.acm.org/doi/10.1145/3620678.3624787
Afzal AHager GWellein G(2023)The Role of Idle Waves, Desynchronization, and Bottleneck Evasion in the Performance of Parallel ProgramsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.322108534:2(623-638)Online publication date: 1-Feb-2023
https://doi.org/10.1109/TPDS.2022.3221085
Demirbaga UAujla G(2023)Federated-ANN-Based Critical Path Analysis and Health Recommendations for MapReduce Workflows in Consumer Electronics ApplicationsIEEE Transactions on Consumer Electronics10.1109/TCE.2023.331881370:1(2639-2647)Online publication date: 25-Sep-2023
https://dl.acm.org/doi/10.1109/TCE.2023.3318813
Show More Cited By

Index Terms

Identifying the Root Causes of Wait States in Large-Scale Parallel Applications

Recommendations

A scalable tool architecture for diagnosing wait states in massively parallel applications

When scaling message-passing applications to thousands of processors, their performance is often affected by wait states that occur when processes fail to reach synchronization points simultaneously. As a first step in reducing the performance impact, ...
Identifying the Root Causes of Wait States in Large-Scale Parallel Applications
ICPP '10: Proceedings of the 2010 39th International Conference on Parallel Processing

Driven by growing application requirements and accelerated by current trends in microprocessor design, the number of processor cores on modern supercomputers is increasing from generation to generation. However, load or communication imbalance prevents ...
Exploiting Distributed-Memory and Shared-Memory Parallelism on Clusters of SMPs with Data Parallel Programs

Clusters of SMPs are hybrid-parallel architectures that combine the main concepts of distributed-memory and shared-memory parallel machines. Although SMP clusters are widely used in the high performance computing community, there exists no single ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Parallel Computing

ACM Transactions on Parallel Computing Volume 3, Issue 2

August 2016

154 pages

ISSN:2329-4949

EISSN:2329-4957

DOI:10.1145/2974644

Editor:
Phillip B. Gibbons
Carnegie Mellon University, Pittsburgh, USA

Issue’s Table of Contents

Copyright © 2016 ACM.

© 2016 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 July 2016

Accepted: 01 May 2016

Revised: 01 October 2014

Received: 01 July 2013

Published in TOPC Volume 3, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

G8 Research Councils Initiative on Multilateral Research
Deutsche Forschungsgemeinschaft (German Research Foundation)
U.S. Department of Energy by Lawrence Livermore National Laboratory
Interdisciplinary Program on Application Software towards Exascale Computing for Global Scale Issues is gratefully acknowledged
Helmholtz Association of German Research Centers

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

18
Total Citations
View Citations
287
Total Downloads

Downloads (Last 12 months)35
Downloads (Last 6 weeks)1

Reflects downloads up to 18 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zhang YIsaacs RYue YYang JZhang LVigfusson Y(2023)LatenSeerProceedings of the 2023 ACM Symposium on Cloud Computing10.1145/3620678.3624787(502-519)Online publication date: 30-Oct-2023
https://dl.acm.org/doi/10.1145/3620678.3624787
Afzal AHager GWellein G(2023)The Role of Idle Waves, Desynchronization, and Bottleneck Evasion in the Performance of Parallel ProgramsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.322108534:2(623-638)Online publication date: 1-Feb-2023
https://doi.org/10.1109/TPDS.2022.3221085
Demirbaga UAujla G(2023)Federated-ANN-Based Critical Path Analysis and Health Recommendations for MapReduce Workflows in Consumer Electronics ApplicationsIEEE Transactions on Consumer Electronics10.1109/TCE.2023.331881370:1(2639-2647)Online publication date: 25-Sep-2023
https://dl.acm.org/doi/10.1109/TCE.2023.3318813
Afzal AHager GMarkidis SWellein G(2023)Making applications faster by asynchronous executionFuture Generation Computer Systems10.1016/j.future.2023.06.017148:C(472-487)Online publication date: 1-Nov-2023
https://dl.acm.org/doi/10.1016/j.future.2023.06.017
Gutierrez AMüller S(2022)Estimations of Means and Variances in a Markov Linear ModelStochastics and Quality Control10.1515/eqc-2022-000437:1(21-43)Online publication date: 12-Mar-2022
https://doi.org/10.1515/eqc-2022-0004
Mohammed AKorndorfer JEleliemy ACiorba F(2022)Automated Scheduling Algorithm Selection and Chunk Parameter Calculation in OpenMPIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.318927033:12(4383-4394)Online publication date: 1-Dec-2022
https://doi.org/10.1109/TPDS.2022.3189270
Sankaran ABientinesi P(2022)A Test for FLOPs as a Discriminant for Linear Algebra Algorithms2022 IEEE 34th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)10.1109/SBAC-PAD55451.2022.00033(221-230)Online publication date: Nov-2022
https://doi.org/10.1109/SBAC-PAD55451.2022.00033
Alawneh LHamou-Lhadj A(2022)Locating and categorizing inefficient communication patterns in HPC systems using inter-process communication tracesJournal of Systems and Software10.1016/j.jss.2022.111494194(111494)Online publication date: Dec-2022
https://doi.org/10.1016/j.jss.2022.111494
Hutter ESolomonik E(2021)Accelerating Distributed-Memory Autotuning via Statistical Analysis of Execution Paths2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS49936.2021.00014(46-57)Online publication date: May-2021
https://doi.org/10.1109/IPDPS49936.2021.00014
Arzt PFischler YLehr JBischof C(2021)Automatic Low-Overhead Load-Imbalance Detection in MPI ApplicationsEuro-Par 2021: Parallel Processing10.1007/978-3-030-85665-6_2(19-34)Online publication date: 25-Aug-2021
https://doi.org/10.1007/978-3-030-85665-6_2
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents