Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Free access

Enabling highly scalable remote memory access programming with MPI-3 one sided

Published: 26 September 2018 Publication History

Abstract

Modern high-performance networks offer remote direct memory access (RDMA) that exposes a process' virtual address space to other processes in the network. The Message Passing Interface (MPI) specification has recently been extended with a programming interface called MPI-3 Remote Memory Access (MPI-3 RMA) for efficiently exploiting state-of-the-art RDMA features. MPI-3 RMA enables a powerful programming model that alleviates many message passing downsides. In this work, we design and develop bufferless protocols that demonstrate how to implement this interface and support scaling to millions of cores with negligible memory consumption while providing highest performance and minimal overheads. To arm programmers, we provide a spectrum of performance models for RMA functions that enable rigorous mathematical analysis of application performance and facilitate the development of codes that solve given tasks within specified time and energy budgets. We validate the usability of our library and models with several application studies with up to half a million processes. In a wider sense, our work illustrates how to use RMA principles to accelerate computation- and data-intensive codes.

References

[1]
Bell, C., Bonachea, D., Nishtala, R., Yelick, K. Optimizing bandwidth limited problems using one-sided communication and overlap. In Proceedings of the International Conference on Parallel and Distributed Processing (IPDPS'06) (2006). IEEE Computer Society, 1--10.
[2]
Bernard, C., Ogilvie, M.C., DeGrand, T.A., DeTar, C.E., Gottlieb, S.A., Krasnitz, A., Sugar, R., Toussaint, D. Studying quarks and gluons on MIMD parallel computers. J. High Perform. Comput Appl. 5, 4 (1991), 61--70.
[3]
Faanes, G., Bataineh, A., Roweth, D., Court, T., Froese, E., Alverson, B., Johnson, T., Kopnick, J., Higgins, M., Reinhard, J. Cray Cascade: A Scalable HPC System Based on a Dragonfly Network. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC'12) (2012). IEEE Computer Society, Los Alamitos, CA, 103:1--103:9.
[4]
Gropp, W., Hoefler, T., Thakur, R., Lusk, E. Using Advanced MPI: Modern Features of the Message-Passing Interface. MIT Press, Cambridge, MA, Nov. (2014).
[5]
Hoefler, T., Dinan, J., Buntinas, D., Balaji, P., Barrett, B., Brightwell, R., Gropp, W., Kale, V., Thakur, R. Leveraging MPI's one-sided communication interface for shared-memory programming. In Recent Advances in the Message Passing Interface (EuroMPI'12), Volume LNCS 7490 (2012). Springer, 132--141.
[6]
Jiang, W., Liu, J., Jin, H.-W., Panda, D.K., Gropp, W., Thakur, R. High performance MPI-2 one-sided communication over InfiniBand. In Proceedings of the IEEE International Symposium on Cluster Computing and the Grid (CCGRID'04) (2004). IEEE Computer Society, 531--538.
[7]
Karp, R.M., Sahay, A., Santos, E.E., Schauser, K.E. Optimal broadcast and summation in the LogP model. In Proceedings of the ACM Symposium on Parallel Algorithms and Architectures (SPAA'93) (1993). ACM, New York, NY, USA, 142--153.
[8]
Mellor-Crummey, J.M., Scott, M.L. Scalable reader-writer synchronization for shared-memory multiprocessors. SIGPLAN Notices 26, 7 (1991), 106--113.
[9]
Mellor-Crummey, J.M., Scott, M.L. Synchronization without contention. SIGPLAN Notices 26, 4 (1991), 269--278.
[10]
Mirin, A.A., Sawyer, W.B. A scalable implementation of a finite-volume dynamical core in the community atmosphere model. J. High Perform. Comput. Appl. 19, 3 (2005), 203--212.
[11]
MPI Forum. MPI: A Message-Passing Interface Standard. Version 3.0 (2012).
[12]
Nishtala, R., Hargrove, P.H., Bonachea, D.O., Yelick, K.A. Scaling communication-intensive applications on BlueGene/P using one-sided communication and overlap. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS'09) (2009). IEEE Computer Society, 1--12.
[13]
Potluri, S., Lai, P., Tomko, K., Sur, S., Cui, Y., Tatineni, M., Schulz, K.W., Barth, W.L., Majumdar, A., Panda, D.K. Quantifying performance benefits of overlap using MPI-2 in a seismic modeling application. In Proceedings of the ACM International Conference on Supercomputing (ICS'10) (2010). ACM 17--25.
[14]
Santhanaraman, G., Balaji, P., Gopalakrishnan, K., Thakur, R., Gropp, W., Panda, D.K. Natively supporting true one-sided communication in MPI on multi-core systems with InfiniBand. In Proceedings of the IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGRID '09) (2009). 380--387.
[15]
Shan, H., Austin, B., Wright, N., Strohmaier, E., Shalf, J., Yelick, K. Accelerating applications at scale using one-sided communication. In Proceedings of the Conference on Partitioned Global Address Space Programming Models (PGAS'12) (2012).
[16]
Woodacre, M., Robb, D., Roe, D., Feind, K. The SGI Altix TM 3000 Global Shared-Memory Architecture (2003). SGI HPC White Papers.
[17]
Zhang, J., Behzad, B., Snir, M. Optimizing the Barnes-Hut algorithm in UPC. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC'11) (2011). ACM, 75:1--75:11.
[18]
Zhao, X., Santhanaraman, G., Gropp, W. Adaptive strategy for one-sided communication in MPICH2. In Recent Advances in the Message Passing Interface (EuroMPI'12) (2012). Springer, 16--26.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Communications of the ACM
Communications of the ACM  Volume 61, Issue 10
October 2018
107 pages
ISSN:0001-0782
EISSN:1557-7317
DOI:10.1145/3281635
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 September 2018
Published in CACM Volume 61, Issue 10

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • ETH Zurich

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)313
  • Downloads (Last 6 weeks)22
Reflects downloads up to 30 Aug 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Decentralized lock-free distributed queue in MPI remote memory access modelE3S Web of Conferences10.1051/e3sconf/202454803007548(03007)Online publication date: 12-Jul-2024
  • (2021)ModularisProceedings of the VLDB Endowment10.14778/3484224.348422914:13(3308-3321)Online publication date: 1-Sep-2021
  • (2021)The Fair Division of Hereditary Set SystemsACM Transactions on Economics and Computation10.1145/34344109:2(1-19)Online publication date: 9-Feb-2021
  • (2020)High-performance parallel graph coloring with strong guarantees on work, depth, and qualityProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.5555/3433701.3433833(1-17)Online publication date: 9-Nov-2020
  • (2020)The immortal soul of an old machineCommunications of the ACM10.1145/343624964:1(32-37)Online publication date: 17-Dec-2020
  • (2020)Does Facebook use sensitive data for advertising purposes?Communications of the ACM10.1145/342636164:1(62-69)Online publication date: 17-Dec-2020
  • (2020)Algorithm 1012ACM Transactions on Mathematical Software10.1145/342281846:4(1-20)Online publication date: 7-Nov-2020
  • (2020)Error Analysis and Improving the Accuracy of Winograd Convolution for Deep Neural NetworksACM Transactions on Mathematical Software10.1145/341238046:4(1-33)Online publication date: 7-Nov-2020
  • (2020)ZygardeProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/34118084:3(1-29)Online publication date: 4-Sep-2020
  • (2020)Yet Another Tensor Toolbox for Discontinuous Galerkin Methods and Other ApplicationsACM Transactions on Mathematical Software10.1145/340683546:4(1-40)Online publication date: 16-Oct-2020
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Digital Edition

View this article in digital edition.

Digital Edition

Magazine Site

View this article on the magazine site (external)

Magazine Site

Get Access

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media