research-article

Public Access

Improving MPI Multi-threaded RMA Communication Performance

Authors:

Matthew G. F. Dosanjh,

Patrick Bridges,

Dorian ArnoldAuthors Info & Claims

ICPP '18: Proceedings of the 47th International Conference on Parallel Processing

Article No.: 58, Pages 1 - 11

https://doi.org/10.1145/3225058.3225114

Published: 13 August 2018 Publication History

Abstract

One-sided communication is crucial to enabling communication concurrency. As core counts have increased, particularly with many-core architectures, one-sided (RMA) communication has been proposed to address the ever increasing contention at the network interface. The difficulty in using one-sided (RMA) communication with MPI is that the performance of MPI implementations using RMA with multiple concurrent threads is not well understood. Past studies have been done using MPI RMA in combination with multi-threading (RMA-MT) but they have been performed on older MPI implementations lacking RMA-MT optimizations. In addition prior work has only been done at smaller scale (<=512 cores).

In this paper, we describe a new RMA implementation for Open MPI. The implementation targets scalability and multi-threaded performance. We describe the design and implementation of our RMA improvements and offer an evaluation that demonstrates scaling to 524,288 cores, the full size of a leading supercomputer installation. In contrast, the previous implementation failed to scale past approximately 4,096 cores. To evaluate this approach, we then compare against a vendor optimized MPI RMA-MT implementation with microbenchmarks, a mini-application, and a full astrophysics code at large scale on a many-core architecture. This is the first time that an evaluation at large scale on many-core architectures has been done for MPI RMA-MT (524,288 cores) and the first large scale application performance comparison between two different RMA-MT optimized MPI implementations. The results show a 8.6% benefit to our optimized open source MPI for a full application code running on 512K cores.

References

[1]

Bob Alverson, Edwin Froese, Larry Kaplan, and Duncan Roweth. 2012. Cray XC series network. Cray Inc., White Paper WP-Aries01-1112 (2012).

[2]

Pavan Balaji, Darius Buntinas, David Goodell, William Gropp, and Rajeev Thakur. 2008. Toward efficient support for multithreaded MPI communication. In European Parallel Virtual Machine/Message Passing Interface Users' Group Meeting. Springer, 120--129.

Digital Library

[3]

Brian W Barrett, Ron Brightwell, Ryan Grant, Simon D Hammond, and K Scott Hemmert. 2014. An evaluation of MPI message rate on hybrid-core processors. International Journal of High Performance Computing Applications 28, 4 (2014), 415--424.

Digital Library

[4]

Richard F Barrett, Dylan T Stark, Courtenay T Vaughan, Ryan E Grant, Stephen L Olivier, and Kevin T Pedretti. 2015. Toward an evolutionary task parallel integrated MPI+ X programming model. In Proceedings of the Sixth International Workshop on Programming Models and Applications for Multicores and Manycores. ACM, 30--39.

Digital Library

[5]

Monica ten Bruggencate and Duncan Roweth. 2010. Dmapp: An api for one-sided programming model on baker systems. Cray Users Group (CUG) (2010).

[6]

Cray Inc. 2014. Using the GNI and DMAPP APIs. In Cray Software Document, Vol. S-2446-5202. http://docs.cray.com/books/S-2446-5202/S-2446-5202.pdf

[7]

James Dinan, Pavan Balaji, Darius Buntinas, David Goodell, William Gropp, and Rajeev Thakur. 2016. An implementation and evaluation of the MPI 3.0 one-sided communication interface. Concurrency and Computation: Practice and Experience (2016).

Digital Library

[8]

James Dinan, Ryan E Grant, Pavan Balaji, David Goodell, Douglas Miller, Marc Snir, and Rajeev Thakur. 2014. Enabling communication concurrency through flexible MPI endpoints. International Journal of High Performance Computing Applications 28, 4 (2014), 390--405.

Digital Library

[9]

Matthew GF Dosanjh, Taylor Groves, Ryan E Grant, Ron Brightwell, and Patrick G Bridges. 2016. RMA-MT: A Benchmark Suite for Assessing MPI Multi-threaded RMA Performance. In IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (IEEE/ACM CCGrid 2016).

[10]

R. Gerstenberger, M. Besta, and T. Hoefler. 2013. Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One Sided. IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC13).

Digital Library

[11]

Richard L. Graham, Timothy S. Woodall, and Jeffrey M. Squyres. 2005. Open MPI: A Flexible High Performance MPI. In Proceedings, 6th Annual International Conference on Parallel Processing and Applied Mathematics. Poznan, Poland.

Digital Library

[12]

William D Gropp and Rajeev Thakur. 2007. Revealing the performance of MPI RMA implementations. In European Parallel Virtual Machine/Message Passing Interface Users' Group Meeting. Springer, 272--280.

Digital Library

[13]

Taylor Groves, Ryan E Grant, and Dorian Arnold. 2016. NiMC: Characterizing and Eliminating Network-Induced Memory Contention. In IEEE International Parallel & Distributed Processing Symposium. IEEE.

[14]

Michael A Heroux, Douglas W Doerfler, Paul S Crozier, James M Willenbring, H Carter Edwards, Alan Williams, Mahesh Rajan, Eric R Keiter, Heidi K Thornquist, and Robert W Numrich. 2009. Improving performance via mini-applications. Sandia National Laboratories, Tech. Rep (2009).

[15]

Nathan Hjelm. 2016. An Evaluation of the One-Sided Performance in Open MPI. In Proceedings of the 23rd European MPI Users' Group Meeting (EuroMPI 2016). ACM, New York, NY, USA, 184--187.

Digital Library

[16]

Steve Huss-Lederman, Bill Gropp, Anthony Skjellum, Andrew Lumsdaine, Bill Saphir, Jeff Squyres, et al. 1997. MPI-2: Extensions to the message passing interface. University of Tennessee, available online at http://www.mpiforum.org/docs/docs. html (1997).

[17]

Intel. 2015. Intel MPI Benchmarks 4.0. https://software.intel.com/en-us/articles/intel-mpi-benchmarks.

[18]

Humaira Kamal and Alan Wagner. 2010. FG-MPI: Fine-grain MPI for multicore and clusters. In 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW). IEEE, 1--8.

[19]

Krishna Kandalla, Peter Mendygral, Nick Radcliffe, Bob Cernohous, David Knaak, Kim McMahon, and Mark Pagel. 2016. Optimizing Cray MPI and SHMEM Software Stacks for Cray-XC Supercomputers based on Intel KNL Processors. (2016).

[20]

Jiuxing Liu, Jiesheng Wu, Sushmitha P Kini, Pete Wyckoff, and Dhabaleswar K Panda. 2003. High performance RDMA-based MPI implementation over Infini-Band. In Proceedings of the 17th annual international conference on Supercomputing. ACM, 295--304.

Digital Library

[21]

PJ Mendygral, Nick Radcliffe, Krishna Kandalla, David Porter, Brian J O'Neill, Chris Nolting, Paul Edmon, Julius MF Donnert, and Thomas W Jones. 2017. WOMBAT: A Scalable and High-performance Astrophysical Magnetohydrodynamics Code. The Astrophysical Journal Supplement Series 228, 2 (2017), 23.

[22]

Hans Meuer, Erich Strohmaier, Jack Dongarra, and Horst Simon. 2013. Top500 Supercomputing Sites. http://www.top500.org/.

[23]

MPI Forum. 2015. MPI: A message-passing interface standard version 3.1. Technical Report. University of Tennessee, Knoxville.

Digital Library

[24]

Ohio State University. 2015. OSU Micro-Benchmarks 4.4.1. http://mvapich.cse.ohio-state.edu/benchmarks/.

[25]

Rolf Rabenseifner, Georg Hager, and Gabriele Jost. 2009. Hybrid MPI/OpenMP parallel programming on clusters of multi-core SMP nodes. In Parallel, Distributed and Network-based Processing, 2009 17th Euromicro International Conference on. IEEE, 427--436.

Digital Library

[26]

Sandia National Laboratory. 2010. Mantevo Project Home Page. https://mantevo.org.

[27]

Whit Schonbein, Matthew G. F. Dosanjh, Ryan E. Grant, and Patrick G. Bridges. 2018. Measuring Multithreaded Message Matching Misery. In Proceedings of the International European Conference on Parallel and Distributed Computing.

[28]

Srinivas Sridharan, James Dinan, and Dhiraj D. Kalamkar. 2014. Enabling Efficient Multithreaded MPI Communication Through a Library-based Implementation of MPI Endpoints. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis.

Digital Library

[29]

Dylan T Stark, Richard F Barrett, Ryan E Grant, Stephen L Olivier, Kevin T Pedretti, and Courtenay T Vaughan. 2014. Early experiences co-scheduling work and communication tasks for hybrid MPI+ X applications. In Proceedings of the 2014 Workshop on Exascale MPI. IEEE Press, 9--19.

Digital Library

[30]

Rajeev Thakur and William D. Gropp. 2007. Test Suite for Evaluating Performance of MPI Implementations That Support MPI_THREAD_MULTIPLE. In Recent Advances in Parallel Virtual Machine and Message Passing Interface, Franck Cappello, Thomas Herault, and Jack Dongarra (Eds.). Lecture Notes in Computer Science, Vol. 4757. Springer Berlin Heidelberg, 46--55.

Digital Library

[31]

Hans Weeks, Matthew GF Dosanjh, Patrick G Bridges, and Ryan E Grant. 2016. SHMEM-MT: A Benchmark Suite for Assessing Multi-threaded SHMEM Performance. In Workshop on OpenSHMEM and Related Technologies. Springer International Publishing, 227--231.

Cited By

Heroux MBernholdt DBosilca GBouteiller ABrightwell RCiesko JDosanjh MGeorgakoudis GLaguna ILevy SNaughton TOlivier SPritchard HSchonbein WSchuchart JShehata A(2024)Taking the MPI standard and the open MPI library to exascaleInternational Journal of High Performance Computing Applications10.1177/1094342024126593638:5(491-507)Online publication date: 1-Sep-2024
https://dl.acm.org/doi/10.1177/10943420241265936
Abuelsoud MPaznikov A(2024)Improving the MPI Remote Memory Access Model for Distributed-memory Systems by Implementing One-sided Broadcast2024 XXVII International Conference on Soft Computing and Measurements (SCM)10.1109/SCM62608.2024.10554130(17-21)Online publication date: 22-May-2024
https://doi.org/10.1109/SCM62608.2024.10554130
Marts WKruse DDosanjh MSchonbein WLevy SBridges P(2024)CMB: A Configurable Messaging Benchmark to Explore Fine-Grained Communication2024 IEEE 24th International Symposium on Cluster, Cloud and Internet Computing (CCGrid)10.1109/CCGrid59990.2024.00013(28-38)Online publication date: 6-May-2024
https://doi.org/10.1109/CCGrid59990.2024.00013
Show More Cited By

Recommendations

Performance Evaluation of OpenFOAM* with MPI-3 RMA Routines on Intel® Xeon® Processors and Intel® Xeon Phi™ Coprocessors
EuroMPI '15: Proceedings of the 22nd European MPI Users' Group Meeting

OpenFOAM is a software package for solving partial differential equations and is very popular for computational fluid dynamics in the automotive segment. In this work, we describe our evaluation of the performance of OpenFOAM with MPI-3 Remote Memory ...
Multi-Threading and Lock-Free MPI RMA Based Graph Processing on KNL and POWER Architectures
EuroMPI '18: Proceedings of the 25th European MPI Users' Group Meeting

Intel Knights Landing (KNL) and IBM POWER architectures are becoming widely deployed on modern supercomputing systems due to its powerful components. MPI Remote Memory Access (RMA) model that provides one-sided communication semantics has been seen as ...
Revealing the performance of MPI RMA implementations
PVM/MPI'07: Proceedings of the 14th European conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface

The MPI remote-memory access (RMA) operations provide a different programming model from the regular MPI-1 point-to-point operations. This model is particularly appropriate for cases where there are multiple communication events for each synchronization ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICPP '18: Proceedings of the 47th International Conference on Parallel Processing

August 2018

945 pages

ISBN:9781450365109

DOI:10.1145/3225058

Copyright © 2018 ACM.

Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

In-Cooperation

University of Oregon: University of Oregon

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 August 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

U.S. Department of Energy

Conference

ICPP 2018

ICPP 2018: 47th International Conference on Parallel Processing

August 13 - 16, 2018

OR, Eugene, USA

Acceptance Rates

ICPP '18 Paper Acceptance Rate 91 of 313 submissions, 29%;

Overall Acceptance Rate 91 of 313 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

21
Total Citations
View Citations
733
Total Downloads

Downloads (Last 12 months)190
Downloads (Last 6 weeks)33

Reflects downloads up to 17 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Heroux MBernholdt DBosilca GBouteiller ABrightwell RCiesko JDosanjh MGeorgakoudis GLaguna ILevy SNaughton TOlivier SPritchard HSchonbein WSchuchart JShehata A(2024)Taking the MPI standard and the open MPI library to exascaleInternational Journal of High Performance Computing Applications10.1177/1094342024126593638:5(491-507)Online publication date: 1-Sep-2024
https://dl.acm.org/doi/10.1177/10943420241265936
Abuelsoud MPaznikov A(2024)Improving the MPI Remote Memory Access Model for Distributed-memory Systems by Implementing One-sided Broadcast2024 XXVII International Conference on Soft Computing and Measurements (SCM)10.1109/SCM62608.2024.10554130(17-21)Online publication date: 22-May-2024
https://doi.org/10.1109/SCM62608.2024.10554130
Marts WKruse DDosanjh MSchonbein WLevy SBridges P(2024)CMB: A Configurable Messaging Benchmark to Explore Fine-Grained Communication2024 IEEE 24th International Symposium on Cluster, Cloud and Internet Computing (CCGrid)10.1109/CCGrid59990.2024.00013(28-38)Online publication date: 6-May-2024
https://doi.org/10.1109/CCGrid59990.2024.00013
Palma MGonzalez JCarrasco MRubio-Noriega RBergman KAzevedo R(2024)Inter-Node Message Passing Through Optical Reconfigurable Memory ChannelIEEE Access10.1109/ACCESS.2024.341287812(83057-83071)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3412878
Ding BHan MChen R(2023)DArray: A High Performance RDMA-Based Distributed ArrayProceedings of the 52nd International Conference on Parallel Processing10.1145/3605573.3605608(715-724)Online publication date: 7-Aug-2023
https://dl.acm.org/doi/10.1145/3605573.3605608
Temuçin YLevy SSchonbein WGrant RAfsahi A(2023)A Dynamic Network-Native MPI Partitioned Aggregation Over InfiniBand Verbs2023 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER52292.2023.00029(259-270)Online publication date: 31-Oct-2023
https://doi.org/10.1109/CLUSTER52292.2023.00029
Wang JLi CLiu YWang TMei JZhang LWang PGuo M(2023)Fargraph+: Excavating the parallelism of graph processing workload on RDMA-based far memory systemJournal of Parallel and Distributed Computing10.1016/j.jpdc.2023.02.015177(144-159)Online publication date: Jul-2023
https://doi.org/10.1016/j.jpdc.2023.02.015
Dosanjh MWorley ASchafer DSoundararajan PGhafoor SSkjellum ABangalore PGrant R(2022)Implementation and evaluation of MPI 4.0 partitioned communication librariesParallel Computing10.1016/j.parco.2021.102827108:COnline publication date: 23-Apr-2022
https://dl.acm.org/doi/10.1016/j.parco.2021.102827
Denis AJaeger JJeannot EReynier F(2022)A methodology for assessing computation/communication overlap of MPI nonblocking collectivesConcurrency and Computation: Practice and Experience10.1002/cpe.716834:22Online publication date: 5-Aug-2022
https://doi.org/10.1002/cpe.7168
Holmes DSkjellum AJaeger JGrant RBangalore PDosanjh MBienz ASchafer D(2021)Partitioned Collective Communication2021 Workshop on Exascale MPI (ExaMPI)10.1109/ExaMPI54564.2021.00007(9-17)Online publication date: Nov-2021
https://doi.org/10.1109/ExaMPI54564.2021.00007
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten