Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3225058.3225114acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicppConference Proceedingsconference-collections
research-article
Public Access

Improving MPI Multi-threaded RMA Communication Performance

Published: 13 August 2018 Publication History

Abstract

One-sided communication is crucial to enabling communication concurrency. As core counts have increased, particularly with many-core architectures, one-sided (RMA) communication has been proposed to address the ever increasing contention at the network interface. The difficulty in using one-sided (RMA) communication with MPI is that the performance of MPI implementations using RMA with multiple concurrent threads is not well understood. Past studies have been done using MPI RMA in combination with multi-threading (RMA-MT) but they have been performed on older MPI implementations lacking RMA-MT optimizations. In addition prior work has only been done at smaller scale (<=512 cores).
In this paper, we describe a new RMA implementation for Open MPI. The implementation targets scalability and multi-threaded performance. We describe the design and implementation of our RMA improvements and offer an evaluation that demonstrates scaling to 524,288 cores, the full size of a leading supercomputer installation. In contrast, the previous implementation failed to scale past approximately 4,096 cores. To evaluate this approach, we then compare against a vendor optimized MPI RMA-MT implementation with microbenchmarks, a mini-application, and a full astrophysics code at large scale on a many-core architecture. This is the first time that an evaluation at large scale on many-core architectures has been done for MPI RMA-MT (524,288 cores) and the first large scale application performance comparison between two different RMA-MT optimized MPI implementations. The results show a 8.6% benefit to our optimized open source MPI for a full application code running on 512K cores.

References

[1]
Bob Alverson, Edwin Froese, Larry Kaplan, and Duncan Roweth. 2012. Cray XC series network. Cray Inc., White Paper WP-Aries01-1112 (2012).
[2]
Pavan Balaji, Darius Buntinas, David Goodell, William Gropp, and Rajeev Thakur. 2008. Toward efficient support for multithreaded MPI communication. In European Parallel Virtual Machine/Message Passing Interface Users' Group Meeting. Springer, 120--129.
[3]
Brian W Barrett, Ron Brightwell, Ryan Grant, Simon D Hammond, and K Scott Hemmert. 2014. An evaluation of MPI message rate on hybrid-core processors. International Journal of High Performance Computing Applications 28, 4 (2014), 415--424.
[4]
Richard F Barrett, Dylan T Stark, Courtenay T Vaughan, Ryan E Grant, Stephen L Olivier, and Kevin T Pedretti. 2015. Toward an evolutionary task parallel integrated MPI+ X programming model. In Proceedings of the Sixth International Workshop on Programming Models and Applications for Multicores and Manycores. ACM, 30--39.
[5]
Monica ten Bruggencate and Duncan Roweth. 2010. Dmapp: An api for one-sided programming model on baker systems. Cray Users Group (CUG) (2010).
[6]
Cray Inc. 2014. Using the GNI and DMAPP APIs. In Cray Software Document, Vol. S-2446-5202. http://docs.cray.com/books/S-2446-5202/S-2446-5202.pdf
[7]
James Dinan, Pavan Balaji, Darius Buntinas, David Goodell, William Gropp, and Rajeev Thakur. 2016. An implementation and evaluation of the MPI 3.0 one-sided communication interface. Concurrency and Computation: Practice and Experience (2016).
[8]
James Dinan, Ryan E Grant, Pavan Balaji, David Goodell, Douglas Miller, Marc Snir, and Rajeev Thakur. 2014. Enabling communication concurrency through flexible MPI endpoints. International Journal of High Performance Computing Applications 28, 4 (2014), 390--405.
[9]
Matthew GF Dosanjh, Taylor Groves, Ryan E Grant, Ron Brightwell, and Patrick G Bridges. 2016. RMA-MT: A Benchmark Suite for Assessing MPI Multi-threaded RMA Performance. In IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (IEEE/ACM CCGrid 2016).
[10]
R. Gerstenberger, M. Besta, and T. Hoefler. 2013. Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One Sided. IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC13).
[11]
Richard L. Graham, Timothy S. Woodall, and Jeffrey M. Squyres. 2005. Open MPI: A Flexible High Performance MPI. In Proceedings, 6th Annual International Conference on Parallel Processing and Applied Mathematics. Poznan, Poland.
[12]
William D Gropp and Rajeev Thakur. 2007. Revealing the performance of MPI RMA implementations. In European Parallel Virtual Machine/Message Passing Interface Users' Group Meeting. Springer, 272--280.
[13]
Taylor Groves, Ryan E Grant, and Dorian Arnold. 2016. NiMC: Characterizing and Eliminating Network-Induced Memory Contention. In IEEE International Parallel & Distributed Processing Symposium. IEEE.
[14]
Michael A Heroux, Douglas W Doerfler, Paul S Crozier, James M Willenbring, H Carter Edwards, Alan Williams, Mahesh Rajan, Eric R Keiter, Heidi K Thornquist, and Robert W Numrich. 2009. Improving performance via mini-applications. Sandia National Laboratories, Tech. Rep (2009).
[15]
Nathan Hjelm. 2016. An Evaluation of the One-Sided Performance in Open MPI. In Proceedings of the 23rd European MPI Users' Group Meeting (EuroMPI 2016). ACM, New York, NY, USA, 184--187.
[16]
Steve Huss-Lederman, Bill Gropp, Anthony Skjellum, Andrew Lumsdaine, Bill Saphir, Jeff Squyres, et al. 1997. MPI-2: Extensions to the message passing interface. University of Tennessee, available online at http://www.mpiforum.org/docs/docs. html (1997).
[17]
Intel. 2015. Intel MPI Benchmarks 4.0. https://software.intel.com/en-us/articles/intel-mpi-benchmarks.
[18]
Humaira Kamal and Alan Wagner. 2010. FG-MPI: Fine-grain MPI for multicore and clusters. In 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW). IEEE, 1--8.
[19]
Krishna Kandalla, Peter Mendygral, Nick Radcliffe, Bob Cernohous, David Knaak, Kim McMahon, and Mark Pagel. 2016. Optimizing Cray MPI and SHMEM Software Stacks for Cray-XC Supercomputers based on Intel KNL Processors. (2016).
[20]
Jiuxing Liu, Jiesheng Wu, Sushmitha P Kini, Pete Wyckoff, and Dhabaleswar K Panda. 2003. High performance RDMA-based MPI implementation over Infini-Band. In Proceedings of the 17th annual international conference on Supercomputing. ACM, 295--304.
[21]
PJ Mendygral, Nick Radcliffe, Krishna Kandalla, David Porter, Brian J O'Neill, Chris Nolting, Paul Edmon, Julius MF Donnert, and Thomas W Jones. 2017. WOMBAT: A Scalable and High-performance Astrophysical Magnetohydrodynamics Code. The Astrophysical Journal Supplement Series 228, 2 (2017), 23.
[22]
Hans Meuer, Erich Strohmaier, Jack Dongarra, and Horst Simon. 2013. Top500 Supercomputing Sites. http://www.top500.org/.
[23]
MPI Forum. 2015. MPI: A message-passing interface standard version 3.1. Technical Report. University of Tennessee, Knoxville.
[24]
Ohio State University. 2015. OSU Micro-Benchmarks 4.4.1. http://mvapich.cse.ohio-state.edu/benchmarks/.
[25]
Rolf Rabenseifner, Georg Hager, and Gabriele Jost. 2009. Hybrid MPI/OpenMP parallel programming on clusters of multi-core SMP nodes. In Parallel, Distributed and Network-based Processing, 2009 17th Euromicro International Conference on. IEEE, 427--436.
[26]
Sandia National Laboratory. 2010. Mantevo Project Home Page. https://mantevo.org.
[27]
Whit Schonbein, Matthew G. F. Dosanjh, Ryan E. Grant, and Patrick G. Bridges. 2018. Measuring Multithreaded Message Matching Misery. In Proceedings of the International European Conference on Parallel and Distributed Computing.
[28]
Srinivas Sridharan, James Dinan, and Dhiraj D. Kalamkar. 2014. Enabling Efficient Multithreaded MPI Communication Through a Library-based Implementation of MPI Endpoints. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis.
[29]
Dylan T Stark, Richard F Barrett, Ryan E Grant, Stephen L Olivier, Kevin T Pedretti, and Courtenay T Vaughan. 2014. Early experiences co-scheduling work and communication tasks for hybrid MPI+ X applications. In Proceedings of the 2014 Workshop on Exascale MPI. IEEE Press, 9--19.
[30]
Rajeev Thakur and William D. Gropp. 2007. Test Suite for Evaluating Performance of MPI Implementations That Support MPI_THREAD_MULTIPLE. In Recent Advances in Parallel Virtual Machine and Message Passing Interface, Franck Cappello, Thomas Herault, and Jack Dongarra (Eds.). Lecture Notes in Computer Science, Vol. 4757. Springer Berlin Heidelberg, 46--55.
[31]
Hans Weeks, Matthew GF Dosanjh, Patrick G Bridges, and Ryan E Grant. 2016. SHMEM-MT: A Benchmark Suite for Assessing Multi-threaded SHMEM Performance. In Workshop on OpenSHMEM and Related Technologies. Springer International Publishing, 227--231.

Cited By

View all
  • (2024)Taking the MPI standard and the open MPI library to exascaleInternational Journal of High Performance Computing Applications10.1177/1094342024126593638:5(491-507)Online publication date: 1-Sep-2024
  • (2024)Improving the MPI Remote Memory Access Model for Distributed-memory Systems by Implementing One-sided Broadcast2024 XXVII International Conference on Soft Computing and Measurements (SCM)10.1109/SCM62608.2024.10554130(17-21)Online publication date: 22-May-2024
  • (2024)CMB: A Configurable Messaging Benchmark to Explore Fine-Grained Communication2024 IEEE 24th International Symposium on Cluster, Cloud and Internet Computing (CCGrid)10.1109/CCGrid59990.2024.00013(28-38)Online publication date: 6-May-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
ICPP '18: Proceedings of the 47th International Conference on Parallel Processing
August 2018
945 pages
ISBN:9781450365109
DOI:10.1145/3225058
Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

In-Cooperation

  • University of Oregon: University of Oregon

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 August 2018

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

  • U.S. Department of Energy

Conference

ICPP 2018

Acceptance Rates

ICPP '18 Paper Acceptance Rate 91 of 313 submissions, 29%;
Overall Acceptance Rate 91 of 313 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)190
  • Downloads (Last 6 weeks)33
Reflects downloads up to 17 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Taking the MPI standard and the open MPI library to exascaleInternational Journal of High Performance Computing Applications10.1177/1094342024126593638:5(491-507)Online publication date: 1-Sep-2024
  • (2024)Improving the MPI Remote Memory Access Model for Distributed-memory Systems by Implementing One-sided Broadcast2024 XXVII International Conference on Soft Computing and Measurements (SCM)10.1109/SCM62608.2024.10554130(17-21)Online publication date: 22-May-2024
  • (2024)CMB: A Configurable Messaging Benchmark to Explore Fine-Grained Communication2024 IEEE 24th International Symposium on Cluster, Cloud and Internet Computing (CCGrid)10.1109/CCGrid59990.2024.00013(28-38)Online publication date: 6-May-2024
  • (2024)Inter-Node Message Passing Through Optical Reconfigurable Memory ChannelIEEE Access10.1109/ACCESS.2024.341287812(83057-83071)Online publication date: 2024
  • (2023)DArray: A High Performance RDMA-Based Distributed ArrayProceedings of the 52nd International Conference on Parallel Processing10.1145/3605573.3605608(715-724)Online publication date: 7-Aug-2023
  • (2023)A Dynamic Network-Native MPI Partitioned Aggregation Over InfiniBand Verbs2023 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER52292.2023.00029(259-270)Online publication date: 31-Oct-2023
  • (2023)Fargraph+: Excavating the parallelism of graph processing workload on RDMA-based far memory systemJournal of Parallel and Distributed Computing10.1016/j.jpdc.2023.02.015177(144-159)Online publication date: Jul-2023
  • (2022)Implementation and evaluation of MPI 4.0 partitioned communication librariesParallel Computing10.1016/j.parco.2021.102827108:COnline publication date: 23-Apr-2022
  • (2022)A methodology for assessing computation/communication overlap of MPI nonblocking collectivesConcurrency and Computation: Practice and Experience10.1002/cpe.716834:22Online publication date: 5-Aug-2022
  • (2021)Partitioned Collective Communication2021 Workshop on Exascale MPI (ExaMPI)10.1109/ExaMPI54564.2021.00007(9-17)Online publication date: Nov-2021
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media