research-article

Implementation and performance evaluation of MPI persistent collectives in MPC: a case study

Authors:

Stéphane Bouhrour,

Julien JaegerAuthors Info & Claims

EuroMPI/USA '20: Proceedings of the 27th European MPI Users' Group Meeting

Pages 51 - 60

https://doi.org/10.1145/3416315.3416321

Published: 07 October 2020 Publication History

Abstract

Persistent collective communications have recently been voted in the MPI standard, opening the door to many optimizations to reduce collectives cost, in particular for recurring operations. Indeed persistent semantics contains an initialization phase called only once for a specific collective. It can be used to collect building costs necessary to the collective, to avoid paying them each time the operation is performed.

We propose an overview of the implementation of the persistent collectives in the MPC MPI runtime. We first present a naïve implementation for MPI runtimes already providing nonblocking collectives. Then, we improve this first implementation with two levels of caching optimizations. We present the performance results of the naïve and optimized versions and discuss their impact on different collective algorithms. We observe performance improvement compared to the naïve version on a repetitive benchmark, up to a 3x speedup for the reduce collective.

References

[1]

Alexandra Carpen-Amarie, Sascha Hunold, and Jesper Larsson Träff. 2017. On expected and observed communication performance with MPI derived datatypes. Parallel Comput. 69, 98 – 117. https://doi.org/10.1016/j.parco.2017.08.006

[2]

Alexandre Denis, Julien Jaeger, Emmanuel Jeannot, Marc Pérache, and Hugo Taboada. 2018. Dynamic Placement of Progress Thread for Overlapping MPI Non-blocking Collectives on Manycore Processor. In Euro-Par 2018: Parallel Processing - 24th International Conference on Parallel and Distributed Computing, Turin, Italy, August 27-31, 2018, Proceedings(Lecture Notes in Computer Science), Marco Aldinucci, Luca Padovani, and Massimo Torquati(Eds.), Vol. 11014. Springer, 616–627. https://doi.org/10.1007/978-3-319-96983-1_44

Digital Library

[3]

Alexandre Denis, Julien Jaeger, Emmanuel Jeannot, Marc Pérache, and Hugo Taboada. 2019. Study on progress threads placement and dedicated cores for overlapping MPI nonblocking collectives on manycore processor. Int. J. High Perform. Comput. Appl. 33, 6 (2019). https://doi.org/10.1177/1094342019860184

Digital Library

[4]

MPI Forum. 1994. MPI: A message-passing interface standard, version 1.1.

Digital Library

[5]

MPI Forum. 2015. MPI: A message-passing interface standard, version 3.1.

[6]

Brice Goglin, Emmanuel Jeannot, Farouk Mansouri, and Guillaume Mercier. 2018. Hardware topology management in MPI applications through hierarchical communicators. Parallel Comput. 76(2018), 70 – 90. https://doi.org/10.1016/j.parco.2018.05.006

Digital Library

[7]

Khalid Hasanov and Alexey Lastovetsky. 2017. Hierarchical redesign of classic MPI reduction algorithms. The Journal of Supercomputing 73, 2 (2017), 713–725.

Digital Library

[8]

Masayuki Hatanaka, Atsushi Hori, and Yutaka Ishikawa. 2013. Optimization of MPI Persistent Communication. In Proceedings of the 20th European MPI Users’ Group Meeting(EuroMPI ’13). Association for Computing Machinery, New York, NY, USA, 79–84. https://doi.org/10.1145/2488551.2488566

Digital Library

[9]

Masayuki Hatanaka, Masamichi Takagi, Atsushi Hori, and Yutaka Ishikawa. 2017. Offloaded MPI Persistent Collectives Using Persistent Generalized Request Interface. In Proceedings of the 24th European MPI Users’ Group Meeting(EuroMPI ’17). Association for Computing Machinery, New York, NY, USA, Article Article 5, 10 pages. https://doi.org/10.1145/3127024.3127029

Digital Library

[10]

Torsten Hoefler and Andrew Lumsdaine. 2006. Design, Implementation, and Usage of LibNBC. School of Informatics.

[11]

Torsten Hoefler and Timo Schneider. 2012. Optimization Principles for Collective Neighborhood Communications. International Conference for High Performance Computing, Networking, Storage and Analysis, SC, 98:1–98:10. https://doi.org/10.1109/SC.2012.86

[12]

Qiao Kang, Jesper Larsson Träff, Reda Al-Bahrani, Ankit Agrawal, Alok Choudhary, and Wei-keng Liao. 2018. Full-Duplex Inter-Group All-to-All Broadcast Algorithms with Optimal Bandwidth. In Proceedings of the 25th European MPI Users’ Group Meeting(EuroMPI ’18). Association for Computing Machinery, New York, NY, USA, Article Article 1, 10 pages. https://doi.org/10.1145/3236367.3236374

Digital Library

[13]

S. Kumar, P. Heidelberger, D. Chen, and M. Hines. 2010. Optimization of applications with non-blocking neighborhood collectives via multisends on the Blue Gene/P supercomputer. In 2010 IEEE International Symposium on Parallel Distributed Processing (IPDPS). 1–11. https://doi.org/10.1109/IPDPS.2010.5470407

[14]

Jesper Larsson Träff, Alexandra Carpen-Amarie, Sascha Hunold, and Antoine Rougier. 2016. Message-Combining Algorithms for Isomorphic, Sparse Collective Communication. arXiv e-prints, Article arXiv:1606.07676, arXiv:1606.07676 pages. arxiv:cs.DC/1606.07676

[15]

Bradley Morgan, Daniel J. Holmes, Anthony Skjellum, Purushotham Bangalore, and Srinivas Sridharan. 2017. Planning for Performance: Persistent Collective Operations for MPI. In Proceedings of the 24th European MPI Users’ Group Meeting(EuroMPI ’17). Association for Computing Machinery, New York, NY, USA, Article Article 4, 11 pages. https://doi.org/10.1145/3127024.3127028

Digital Library

[16]

Marc Pérache, Patrick Carribault, and Hervé Jourdren. 2009. MPC-MPI: An MPI Implementation Reducing the Overall Memory Consumption. In Recent Advances in Parallel Virtual Machine and Message Passing Interface, Proceedings of the 16th European PVM/MPI Users’ Group Meeting (EuroPVM/MPI 2009), Matti Ropo, Jan Westerholm, and Jack Dongarra (Eds.). Lecture Notes in Computer Science, Vol. 5759. Springer Berlin Heidelberg, 94–103. https://doi.org/10.1007/978-3-642-03770-2_16

[17]

Martin Ruefenacht, Mark Bull, and Stephen Booth. 2017. Generalisation of recursive doubling for AllReduce: Now with simulation. Parallel Comput. 69(2017), 24 – 44. https://doi.org/10.1016/j.parco.2017.08.004

Digital Library

[18]

Anthony Skjellum. 1998. High Performance MPI: Extending the Message Passing Interface for Higher Performance and Higher Predictability. In International Conference on Parallel and Distributed Processing Techniques and Applications(PDPTA’98). 25–32.

[19]

Jesper Larsson Träff and Sascha Hunold. 2019. Cartesian Collective Communication. In Proceedings of the 48th International Conference on Parallel Processing(ICPP 2019). Association for Computing Machinery, New York, NY, USA, Article 48, 11 pages. https://doi.org/10.1145/3337821.3337848

Digital Library

[20]

Jesper Larsson Träff, Antoine Rougier, and Sascha Hunold. 2014. Implementing a Classic: Zero-Copy All-to-All Communication with Mpi Datatypes. In Proceedings of the 28th ACM International Conference on Supercomputing(ICS’14). Association for Computing Machinery, New York, NY, USA, 135–144. https://doi.org/10.1145/2597652.2597662

Digital Library

Cited By

Bouhrour SPepin TJaeger J(2022)Towards leveraging collective performance with the support of MPI 4.0 features in MPCParallel Computing10.1016/j.parco.2021.102860109:COnline publication date: 1-Mar-2022
https://dl.acm.org/doi/10.1016/j.parco.2021.102860

Recommendations

Offloaded MPI persistent collectives using persistent generalized request interface
EuroMPI '17: Proceedings of the 24th European MPI Users' Group Meeting

This paper proposes a library with a persistent generalized request interface for the implementation of persistent communication operations. This interface allows developers to add persistent communication functions to the existing MPI library. We ...
Implementation and performance analysis of non-blocking collective operations for MPI
SC '07: Proceedings of the 2007 ACM/IEEE conference on Supercomputing

Collective operations and non-blocking point-to-point operations have always been part of MPI. Although non-blocking collective operations are an obvious extension to MPI, there have been no comprehensive studies of this functionality. In this paper we ...
Verifying Performance Guidelines for MPI Collectives at Scale
SC-W '23: Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis

MPI collective communication operations are crucial for high-performance computing, making the efficient implementation of collective algorithms essential for optimal application performance. While most MPI libraries provide several algorithms for a ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

EuroMPI/USA '20: Proceedings of the 27th European MPI Users' Group Meeting

September 2020

88 pages

ISBN:9781450388801

DOI:10.1145/3416315

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 October 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article
Research
Refereed limited

Conference

EuroMPI/USA '20

EuroMPI/USA '20: 27th European MPI Users' Group Meeting

September 21 - 24, 2020

TX, Austin, USA

Acceptance Rates

Overall Acceptance Rate 66 of 139 submissions, 47%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
103
Total Downloads

Downloads (Last 12 months)22
Downloads (Last 6 weeks)0

Reflects downloads up to 10 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Bouhrour SPepin TJaeger J(2022)Towards leveraging collective performance with the support of MPI 4.0 features in MPCParallel Computing10.1016/j.parco.2021.102860109:COnline publication date: 1-Mar-2022
https://dl.acm.org/doi/10.1016/j.parco.2021.102860

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents