research-article

Measuring Thread Timing to Assess the Feasibility of Early-bird Message Delivery

Authors:

William Pepper Marts,

Matthew G. F. Dosanjh,

Whit Schonbein,

Patrick G. BridgesAuthors Info & Claims

ICPP Workshops '23: Proceedings of the 52nd International Conference on Parallel Processing Workshops

Pages 119 - 126

https://doi.org/10.1145/3605731.3605884

Published: 07 September 2023 Publication History

Abstract

Early-bird communication is a communication/computation overlap technique that combines fine-grained communication with partitioned communication to improve application run-time. Communication is divided among the compute threads such that each individual thread can initiate transmission of its portion of the data as soon as it is complete rather than waiting for all of the threads. However, the benefit of early-bird communication depends on the completion timing of the individual threads.

In this paper, we measure and evaluate the potential overlap, the idle time each thread experiences between finishing their computation and the final thread finishing. These measurements help us understand whether a given application could benefit from early-bird communication. We present our technique for gathering this data and evaluate data collected from three proxy applications: MiniFE, MiniMD, and MiniQMC. To characterize the behavior of these workloads, we study the thread timings at both a macro level, i.e., across all threads across all runs of an application, and a micro level, i.e., within a single process of a single run. We observe that these applications exhibit significantly different behavior. While MiniFE and MiniQMC appear to be well-suited for early-bird communication because of their wider thread distribution and more frequent laggard threads, the behavior of MiniMD may limit its ability to leverage early-bird communication.

References

[1]

David E Bernholdt, Swen Boehm, George Bosilca, Manjunath Gorentla Venkata, Ryan E Grant, Thomas Naughton, Howard P Pritchard, Martin Schulz, and Geoffroy R Vallee. 2020. A survey of MPI usage in the US exascale computing project. Concurrency and Computation: Practice and Experience 32, 3 (2020), e4851.

[2]

Intel Corporation. [n. d.]. Intel VTune Profiler User Guide. https://www.intel.com/content/www/us/en/develop/documentation/vtune-help/top.html?wapkw=intel%20vtune%20profiler. Accessed: 2022-08-04.

[3]

RALPH B. D'AGOSTINO. 1971. An omnibus test of normality for moderate and large size samples. Biometrika 58, 2 (1971), 341–348. https://doi.org/10.1093/biomet/58.2.341

[4]

Luiz DeRose, Bernd Mohr, and Seetharami Seelam. 2004. Profiling and tracing OpenMP applications with POMP based monitoring libraries. In European Conference on Parallel Processing. Springer, 39–46.

[5]

James Dinan, Pavan Balaji, David Goodell, Douglas Miller, Marc Snir, and Rajeev Thakur. 2013. Enabling MPI interoperability through flexible communication endpoints. In Proceedings of the 20th European MPI Users’ Group Meeting. 13–18.

Digital Library

[6]

Matthew GF Dosanjh, Taylor Groves, Ryan E Grant, Ron Brightwell, and Patrick G Bridges. 2016. RMA-MT: a benchmark suite for assessing MPI multi-threaded RMA performance. In 2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid). IEEE, 550–559.

Digital Library

[7]

Mario Flajslik, James Dinan, and Keith D Underwood. 2016. Mitigating MPI message matching misery. In International conference on high performance computing. Springer, 281–299.

[8]

Karl Fürlinger and Michael Gerndt. 2005. ompP: A profiling tool for OpenMP. In International Workshop on OpenMP. Springer, 15–23.

[9]

Todd Gamblin. [n. d.]. Scalable Load Balance Analysis. https://github.com/tgamblin/libra. Accessed: 2022-08-03.

[10]

Todd Gamblin, Bronis R De Supinski, Martin Schulz, Rob Fowler, and Daniel A Reed. 2008. Scalable load-balance measurement for SPMD codes. In SC’08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing. IEEE, 1–12.

Digital Library

[11]

Ryan E Grant, Matthew GF Dosanjh, Michael J Levenhagen, Ron Brightwell, and Anthony Skjellum. 2019. Finepoints: Partitioned multithreaded MPI communication. In International Conference on High Performance Computing. Springer, 330–350.

[12]

Michael Heroux and Richard Barrett. 2019. Mantevo project. https://mantevo.github.io/

[13]

Jeongnim Kim, Andrew D Baczewski, Todd D Beaudet, Anouar Benali, M Chandler Bennett, Mark A Berrill, Nick S Blunt, Edgar Josué Landinez Borda, Michele Casula, David M Ceperley, Simone Chiesa, Bryan K Clark, Raymond C Clay, Kris T Delaney, Mark Dewing, Kenneth P Esler, Hongxia Hao, Olle Heinonen, Paul R C Kent, Jaron T Krogel, Ilkka Kylänpää, Ying Wai Li, M Graham Lopez, Ye Luo, Fionn D Malone, Richard M Martin, Amrita Mathuriya, Jeremy McMinis, Cody A Melton, Lubos Mitas, Miguel A Morales, Eric Neuscamman, William D Parker, Sergio D Pineda Flores, Nichols A Romero, Brenda M Rubenstein, Jacqueline A R Shea, Hyeondeok Shin, Luke Shulenburger, Andreas F Tillack, Joshua P Townsend, Norm M Tubman, Brett Van Der Goetz, Jordan E Vincent, D ChangMo Yang, Yubo Yang, Shuai Zhang, and Luning Zhao. 2018. QMCPACK: an open source ab initio quantum Monte Carlo package for the electronic structure of atoms, molecules and solids. Journal of Physics: Condensed Matter 30, 19 (April 2018), 195901. https://doi.org/10.1088/1361-648x/aab9c3

[14]

Xu Liu, John Mellor-Crummey, and Michael Fagan. 2013. A new approach for performance analysis of OpenMP programs. In Proceedings of the 27th international ACM conference on International conference on supercomputing. 69–80.

Digital Library

[15]

W Pepper Marts, Matthew GF Dosanjh, Scott Levy, Whit Schonbein, Ryan E Grant, and Patrick G Bridges. 2021. MiniMod: A Modular Miniapplication Benchmarking Framework for HPC. In 2021 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, 12–22.

[16]

PJ Mendygral, Nick Radcliffe, Krishna Kandalla, David Porter, Brian J O’Neill, Chris Nolting, Paul Edmon, Julius MF Donnert, and Thomas W Jones. 2017. WOMBAT: A scalable and high-performance astrophysical magnetohydrodynamics code. The Astrophysical Journal Supplement Series 228, 2 (2017), 23.

[17]

Bernd Mohr, Allen D Malony, Sameer Shende, and Felix Wolf. 2002. Design and prototype of a performance tool interface for OpenMP. The Journal of Supercomputing 23, 1 (2002), 105–128.

Digital Library

[18]

Mubrak S Mohsen, Rosni Abdullah, and Yong M Teo. 2009. A survey on performance tools for OpenMP. World Academy of Science, Engineering and Technology 49 (2009), 754–765.

[19]

Alessandro Morari, Roberto Gioiosa, Robert W Wisniewski, Francisco J Cazorla, and Mateo Valero. 2011. A quantitative analysis of OS noise. In 2011 IEEE International Parallel & Distributed Processing Symposium. IEEE, 852–863.

Digital Library

[20]

Ananya Muddukrishna, Peter A Jonsson, and Mats Brorsson. 2015. Characterizing task-based OpenMP programs. PloS one 10, 4 (2015), e0123545.

[21]

Ananya Muddukrishna, Peter A Jonsson, Artur Podobas, and Mats Brorsson. 2016. Grain graphs: OpenMP performance analysis made easy. In Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 1–13.

Digital Library

[22]

Institute of Electrical and Electronics Engineers. [n. d.]. International Organization for Standardization. Information Technology–Portable Operating System Interface (POSIX). https://pubs.opengroup.org/onlinepubs/9699919799.2018edition/. Accessed: 2022-08-04.

[23]

Fabian Orland and Christian Terboven. 2020. A case study on addressing complex load imbalance in OpenMP. In International Workshop on OpenMP. Springer, 130–145.

Digital Library

[24]

S. S. SHAPIRO and M. B. WILK. 1965. An analysis of variance test for normality (complete samples). Biometrika 52, 3-4 (dec 1965), 591–611. https://doi.org/10.1093/biomet/52.3-4.591

[25]

Srinivas Sridharan, James Dinan, and Dhiraj D Kalamkar. 2014. Enabling efficient multithreaded MPI communication through a library-based implementation of MPI endpoints. In SC’14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 487–498.

Digital Library

[26]

M. A. Stephens. 1974. EDF Statistics for Goodness of Fit and Some Comparisons. J. Amer. Statist. Assoc. 69, 347 (Sept. 1974), 730–737. https://doi.org/10.1080/01621459.1974.10480196

[27]

Yiltan Hassan Temucin, Ryan Grant, and Ahmad Afsahi. 2022. Micro-Benchmarking MPI Partitioned Point-to-Point Communication. In 2022 International Conference on Parallel Processing (ICPP. ACM.

[28]

Rajeev Thakur and William Gropp. 2009. Test suite for evaluating performance of multithreaded MPI communication. Parallel Comput. 35, 12 (2009), 608–617.

Digital Library

[29]

A. P. Thompson, H. M. Aktulga, R. Berger, D. S. Bolintineanu, W. M. Brown, P. S. Crozier, P. J. in ’t Veld, A. Kohlmeyer, S. G. Moore, T. D. Nguyen, R. Shan, M. J. Stevens, J. Tranchida, C. Trott, and S. J. Plimpton. 2022. LAMMPS - a flexible simulation tool for particle-based materials modeling at the atomic, meso, and continuum scales. Comp. Phys. Comm. 271 (2022), 108171. https://doi.org/10.1016/j.cpc.2021.108171

[30]

Karthikeyan Vaidyanathan, Dhiraj D Kalamkar, Kiran Pamnany, Jeff R Hammond, Pavan Balaji, Dipankar Das, Jongsoo Park, and Bálint Joó. 2015. Improving concurrency and asynchrony in multithreaded MPI applications using software offloading. In SC’15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1–12.

Digital Library

Cited By

Bernholdt DBosilca GBouteiller ABrightwell RCiesko JDosanjh MGeorgakoudis GLaguna ILevy SNaughton TOlivier SPritchard HSchonbein WSchuchart JShehata A(2024)Taking the MPI standard and the open MPI library to exascaleThe International Journal of High Performance Computing Applications10.1177/10943420241265936Online publication date: 23-Jul-2024
https://doi.org/10.1177/10943420241265936
Schonbein WLevy SDosanjh MMarts WReid EGrant R(2023)Modeling and Benchmarking the Potential Benefit of Early-Bird Transmission in Fine-Grained CommunicationProceedings of the 52nd International Conference on Parallel Processing10.1145/3605573.3605618(306-316)Online publication date: 7-Aug-2023
https://dl.acm.org/doi/10.1145/3605573.3605618
Temuçin YLevy SSchonbein WGrant RAfsahi A(2023)A Dynamic Network-Native MPI Partitioned Aggregation Over InfiniBand Verbs2023 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER52292.2023.00029(259-270)Online publication date: 31-Oct-2023
https://doi.org/10.1109/CLUSTER52292.2023.00029

Recommendations

Thread Weaving: Static Resource Scheduling for Multithreaded High-Level Synthesis
DAC '19: Proceedings of the 56th Annual Design Automation Conference 2019

In high-level synthesis (HLS), software multithreading constructs can be used to explicitly specify coarse-grained parallelism for multiple accelerators. While software threads typically operate independently and in isolation of each other on CPUs, HLS ...
Thread algebra for strategic interleaving
Abstract
We take a thread as the behavior of a sequential deterministic program under execution and multi-threading as the form of concurrency provided by contemporary programming languages such as Java and C#. We outline an algebraic theory about threads ...
Leveraging Hardware Message Passing for Efficient Thread Synchronization
Special Issue on PPOPP 2014

As the level of parallelism in manycore processors keeps increasing, providing efficient mechanisms for thread synchronization in concurrent programs is becoming a major concern. On cache-coherent shared-memory processors, synchronization efficiency is ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICPP Workshops '23: Proceedings of the 52nd International Conference on Parallel Processing Workshops

August 2023

217 pages

ISBN:9798400708428

DOI:10.1145/3605731

Copyright © 2023 ACM.

Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 September 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tag

high-performance computing computer networks fine-grained communication benchmarks

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

Advanced Simulation and Computing Program

Conference

ICPP-W 2023

ICPP-W 2023: 52nd International Conference on Parallel Processing Workshops

August 7 - 10, 2023

UT, Salt Lake City, USA

Acceptance Rates

Overall Acceptance Rate 91 of 313 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
21
Total Downloads

Downloads (Last 12 months)21
Downloads (Last 6 weeks)1

Reflects downloads up to 22 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Bernholdt DBosilca GBouteiller ABrightwell RCiesko JDosanjh MGeorgakoudis GLaguna ILevy SNaughton TOlivier SPritchard HSchonbein WSchuchart JShehata A(2024)Taking the MPI standard and the open MPI library to exascaleThe International Journal of High Performance Computing Applications10.1177/10943420241265936Online publication date: 23-Jul-2024
https://doi.org/10.1177/10943420241265936
Schonbein WLevy SDosanjh MMarts WReid EGrant R(2023)Modeling and Benchmarking the Potential Benefit of Early-Bird Transmission in Fine-Grained CommunicationProceedings of the 52nd International Conference on Parallel Processing10.1145/3605573.3605618(306-316)Online publication date: 7-Aug-2023
https://dl.acm.org/doi/10.1145/3605573.3605618
Temuçin YLevy SSchonbein WGrant RAfsahi A(2023)A Dynamic Network-Native MPI Partitioned Aggregation Over InfiniBand Verbs2023 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER52292.2023.00029(259-270)Online publication date: 31-Oct-2023
https://doi.org/10.1109/CLUSTER52292.2023.00029

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents