research-article

Public Access

Why is MPI so slow?: analyzing the fundamental limits in implementing MPI-3.1

SC '17: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Article No.: 62, Pages 1 - 12

https://doi.org/10.1145/3126908.3126963

Published: 12 November 2017 Publication History

Abstract

This paper provides an in-depth analysis of the software overheads in the MPI performance-critical path and exposes mandatory performance overheads that are unavoidable based on the MPI-3.1 specification. We first present a highly optimized implementation of the MPI-3.1 standard in which the communication stack---all the way from the application to the low-level network communication API---takes only a few tens of instructions. We carefully study these instructions and analyze the root cause of the overheads based on specific requirements from the MPI standard that are unavoidable under the current MPI standard. We recommend potential changes to the MPI standard that can minimize these overheads. Our experimental results on a variety of network architectures and applications demonstrate significant benefits from our proposed changes.

References

[1]

2017. Center for Exascale Simulation of Advanced Reactors. https://cesar.mcs.anl.gov. (2017).

[2]

2017. Center for Exascale Simulation of Combustion in Turbulence. https://science.energy.gov/ascr/research/scidac/co-design/. (2017).

[3]

2017,. CORAL Benchmarks. https://asc.llnl.gov/CORAL-benchmarks. (2017,).

[4]

2017. Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics (LULESH). https://codesign.llnl.gov/lulesh.php. (2017).

[5]

2017. Monte Carlo Benchmark (MCB). https://codesign.llnl.gov/mcb.php. (2017).

[6]

2017. NAS Parallel Benchmarks. http://www.nas.nasa.gov/publications/npb.html. (2017).

[7]

2017. Nekbone. https://cesar.mcs.anl.gov/content/software/thermal_hydraulics. (2017).

[8]

2017. QMCPack. http://qmcpack.org. (2017).

[9]

2017. The Local-Self-Consistent Mutliple-Scattering (LSMSL) Code. https://www.ccs.ornl.gov/mri/repository/LSMS/index.html. (2017).

[10]

Adnan Agbaria, Dong-In Kang, and Karandeep Singh. 2006. LMPI: MPI for heterogeneous embedded distributed systems. In Parallel and Distributed Systems, 2006. ICPADS 2006. 12th International Conference on, Vol. 1. IEEE, 8--pp.

Digital Library

[11]

Abdelhalim Amer, Pavan Balaji, Wesley Bland, William Gropp, Rob Latham, Huiwei Lu, Lena Oden, Antonio Pena, Ken Raffenetti, Sangmin Seo, et al. 2015. MPICH User's Guide. (2015).

[12]

Pavan Balaji, Darius Buntinas, D. Goodell, W. D. Gropp, and Rajeev Thakur. 2010. Fine-Grained Multithreading Support for Hybrid Threaded MPI Programming. International Journal of High Performance Computing Applications (IJHPCA) 24 (2010), 49--57.

Digital Library

[13]

Brian W Barrett, Ron Brightwell, Ryan Grant, Simon D Hammond, and K Scott Hemmert. 2014. An evaluation of MPI message rate on hybrid-core processors. (2014).

[14]

Surendra Byna, Xian-He Sun, Rajeev Thakur, and William Gropp. 2006. Automatic memory optimizations for improving MPI derived datatype performance. In European Parallel Virtual Machine/Message Passing Interface UsersâĂ&Zacute; Group Meeting. Springer, 238--246.

Digital Library

[15]

James Dinan, Pavan Balaji, Dave Goodell, Doug Miller, Marc Snir, and Rajeev Thakur. 2013. Enabling MPI Interoperability through Flexible Communication Endpoints. In Proceedings of the 17th European MPI Users' Group Meeting Conference on Recent Advances in the Message Passing Interface (EuroMPI'13). Madrid, Spain, 13--18.

Digital Library

[16]

P. Fischer, K. Heisey, and M. Min. 2015. Scaling Limits for PDE-Based Simulation (Invited). In 22nd AIAA Computational Fluid Dynamics Conference, AIAA Aviation. AIAA 2015--3049.

[17]

P. Fischer, J. Lottes, and S. Kerkemeier. 2008. Nek5000: Open source spectral element CFD solver. http://nek5000.mcs.anl.gov and https://github.com/Nek5000/nek5000. (2008).

[18]

P. F. Fischer and A. T. Patera. 1991. Parallel Spectral Element Solution of the Stokes Problem. J. Comput. Phys. 92 (1991), 380--421.

Digital Library

[19]

Mario Flajslik, James Dinan, and Keith D Underwood. 2016. Mitigating MPI message matching misery. In International Conference on High Performance Computing. Springer, 281--299.

[20]

William Gropp, Torsten Hoefler, Rajeev Thakur, and Ewing Lusk. 2004. Using Advanced MPI: Modern Features of the Message-Passing Interface. MIT Press.

Digital Library

[21]

William Gropp, Ewing Lusk, and Rajeev Thakur. 1999. Using MPI-2: Advanced Features of the Message-Passing Interface. MIT Press.

Digital Library

[22]

Yanfei Guo, Charles Archer, Michael Blocksome, Scott Parker, Wesley Bland, Kenneth J. Raffenetti, and Pavan Balaji. 2017. Memory Compression Techniques for Network Address Management in MPI. In IEEE International Parallel and Distributed Processing Symposium (IPDPS). Orlando, Florida.

[23]

Salman Habib, Vitali Morozov, Nicholas Frontiere, Hal Finkel, Adrian Pope, and Katrin Heitmann. 2013. HACC: Extreme Scaling and Performance Across Diverse Architectures. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC '13). ACM, New York, NY, USA, Article 6, 10 pages.

Digital Library

[24]

Michael A Heroux, Douglas W Doerler, Paul S Crozier, James M Willenbring, H Carter Edwards, Alan Williams, Mahesh Rajan, Eric R Keiter, Heidi K Thornquist, and Robert W Numrich. 2009. Improving Performance via Mini-applications. Technical Report SAND2009-5574. Sandia National Laboratories.

[25]

Torsten Hoefler, James Dinan, Rajeev Thakur, Brian Barrett, Pavan Balaji, William Gropp, and Keith Underwood. 2015. Remote Memory Access Programming in MPI-3. TOPC'15 (2015).

Digital Library

[26]

MPI Forum. 2015. MPI: A Message Passing Interface Standard. (2015). http://www.mpi-forum.org/docs/docs.html.

[27]

M. Otten, J. Gong, A. Mametjanov, A. Vose, J. Levesque, P. Fischer, and M. Min. 2016. An MPI/OpenACC Implementation of a High Order Electromagnetics Solver with GPUDirect Communication. Int. J. High Perf. Comput. Appl. (2016).

[28]

Mohammad J Rashti and Ahmad Afsahi. 2008. Improving communication progress and overlap in mpi rendezvous protocol over rdma-enabled interconnects. In High Performance Computing Systems and Applications, 2008. HPCS 2008. 22nd International Symposium on. IEEE, 95--101.

Digital Library

[29]

Xian-He Sun et al. 2003. Improving the performance of MPI derived datatypes by optimizing memory-access cost. In Cluster Computing, 2003. Proceedings. 2003 IEEE International Conference on. IEEE, 412--419.

[30]

Rajeev Thakur and William D Gropp. 2003. Improving the performance of collective operations in MPICH. In European Parallel Virtual Machine/Message Passing Interface UsersâĂ&Zacute; Group Meeting. Springer, 257--267.

[31]

H. M. Tufo and P. F. Fischer. 1999. Terascale Spectral Element Algorithms and Implementations. In Proc. of the ACM/IEEE SC99 Conf. on High Performance Networking and Computing, Gordon Bell Prize. IEEE Computer Soc., CDROM.

Digital Library

[32]

Isaías A Comprés Ureña, Michael Riepen, and Michael Konow. 2011. RCKMPI-lightweight MPI implementation for IntelâĂ&Zacute;s Single-chip Cloud Computer (SCC). In European MPI Users' Group Meeting. Springer, 208--217.

Digital Library

[33]

M. Valiev, E. J. Bylaska, N. Govind, K. Kowalski, T. P. Straatsma, H. J. J. Van Dam, D. Wang, J. Nieplocha, E. Apra, T. L. Windus, and W. A. de Jong. 2010. NWChem: A Comprehensive and Scalable Open-Source Solution for Large Scale Molecular Simulations. Computer Physics Communications 181, 9 (2010), 1477--1489.

Cited By

Psota JSolar-Lezama ALee IChabbi MSteuwer M(2024)Pure: Evolving Message Passing To Better Leverage Shared Memory Within NodesProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638503(133-146)Online publication date: 2-Mar-2024
https://dl.acm.org/doi/10.1145/3627535.3638503
Dong YDai YXie MLu KWang RChen JShao MWang Z(2024)Faster and Scalable MPI Applications LaunchingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.321807735:2(264-279)Online publication date: Mar-2024
https://doi.org/10.1109/TPDS.2022.3218077
Zambre RChandramowlishwaran A(2022)Lessons Learned on MPI+Threads CommunicationSC22: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41404.2022.00082(1-16)Online publication date: Nov-2022
https://doi.org/10.1109/SC41404.2022.00082
Show More Cited By

Index Terms

Why is MPI so slow?: analyzing the fundamental limits in implementing MPI-3.1
1. Computing methodologies
  1. Concurrent computing methodologies
    1. Concurrent algorithms
  2. Parallel computing methodologies
    1. Parallel algorithms
      1. Massively parallel algorithms

Recommendations

MPI + MPI: a new hybrid approach to parallel programming with MPI plus shared memory

Hybrid parallel programming with the message passing interface (MPI) for internode communication in conjunction with a shared-memory programming model to manage intranode parallelism has become a dominant approach to scalable parallel programming. While ...
MPI: past, present and future
PVM/MPI'07: Proceedings of the 14th European conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface

This talk will trace the origins of MPI from the early message-passing, distributed memory, parallel computers in the 1980's, to today's parallel supercomputers. In these early days, parallel computing companies implemented proprietary message-passing ...
An Introduction to the MPI Standard

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '17: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

November 2017

801 pages

ISBN:9781450351140

DOI:10.1145/3126908

General Chair:
Bernd Mohr
Jülich Supercomputing Center, Jülich, Germany
,
Program Chair:
Padma Raghavan
Vanderbilt University, Nashville, TN

Copyright © 2017 ACM.

Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Sponsors

SIGHPC: ACM Special Interest Group on High Performance Computing, Special Interest Group on High Performance Computing

In-Cooperation

IEEE CS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 November 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Funding Sources

Advanced Scientific Computing Research

Conference

SC '17

Sponsor:

SIGHPC

SC '17: The International Conference for High Performance Computing, Networking, Storage and Analysis

November 12 - 17, 2017

Colorado, Denver

Acceptance Rates

SC '17 Paper Acceptance Rate 61 of 327 submissions, 19%;

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

21
Total Citations
View Citations
1,246
Total Downloads

Downloads (Last 12 months)306
Downloads (Last 6 weeks)43

Reflects downloads up to 27 Jul 2024

Other Metrics

View Author Metrics

Citations

Cited By

Psota JSolar-Lezama ALee IChabbi MSteuwer M(2024)Pure: Evolving Message Passing To Better Leverage Shared Memory Within NodesProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638503(133-146)Online publication date: 2-Mar-2024
https://dl.acm.org/doi/10.1145/3627535.3638503
Dong YDai YXie MLu KWang RChen JShao MWang Z(2024)Faster and Scalable MPI Applications LaunchingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.321807735:2(264-279)Online publication date: Mar-2024
https://doi.org/10.1109/TPDS.2022.3218077
Zambre RChandramowlishwaran A(2022)Lessons Learned on MPI+Threads CommunicationSC22: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41404.2022.00082(1-16)Online publication date: Nov-2022
https://doi.org/10.1109/SC41404.2022.00082
Kolev TFischer PMin MDongarra JBrown JDobrev VWarburton TTomov SShephard MAbdelfattah ABarra VBeams NCamier JChalmers NDudouit YKarakus AKarlin IKerkemeier SLan YMedina DMerzari EObabko APazner WRathnayake TSmith CSpies LSwirydowicz KThompson JTomboulides ATomov V(2021)Efficient exascale discretizations: High-order finite element methodsThe International Journal of High Performance Computing Applications10.1177/10943420211020803(109434202110208)Online publication date: 8-Jun-2021
https://doi.org/10.1177/10943420211020803
Zambre RSahasrabudhe DZhou HBerzins MChandramowlishwaran ABalaji P(2021)Logically Parallel Communication for Fast MPI+Threads ApplicationsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.307515732:12(3038-3052)Online publication date: 1-Dec-2021
https://doi.org/10.1109/TPDS.2021.3075157
Leon EJoos MHanford NCotte ADelforge TDiakhate FDucrot VKarlin IPerache M(2021)On-the-Fly, Robust Translation of MPI Libraries2021 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/Cluster48925.2021.00026(504-515)Online publication date: Sep-2021
https://doi.org/10.1109/Cluster48925.2021.00026
Kronbichler M(2021)High-Performance Implementation of Discontinuous Galerkin Methods with Application in Fluid FlowEfficient High-Order Discretizations for Computational Fluid Dynamics10.1007/978-3-030-60610-7_2(57-115)Online publication date: 5-Jan-2021
https://doi.org/10.1007/978-3-030-60610-7_2
Haghi PGuo AXiong QYang CGeng TBroaddus JMarshall RSchafer DSkjellum AHerbordt M(2021)Reconfigurable switches for high performance and flexible MPI collectivesConcurrency and Computation: Practice and Experience10.1002/cpe.676934:6Online publication date: 12-Dec-2021
https://doi.org/10.1002/cpe.6769
Mohanamuraly PStaffelbach G(2020)Hardware Locality-Aware Partitioning and Dynamic Load-Balancing of Unstructured Meshes for Large-Scale Scientific ApplicationsProceedings of the Platform for Advanced Scientific Computing Conference10.1145/3394277.3401851(1-10)Online publication date: 29-Jun-2020
https://dl.acm.org/doi/10.1145/3394277.3401851
Zambre RChandramowliswharan ABalaji PAyguadé EHwu WBadia RHofstee H(2020)How I learned to stop worrying about user-visible endpoints and love MPIProceedings of the 34th ACM International Conference on Supercomputing10.1145/3392717.3392773(1-13)Online publication date: 29-Jun-2020
https://dl.acm.org/doi/10.1145/3392717.3392773
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents