research-article

GPU-Aware Non-contiguous Data Movement In Open MPI

Authors:

George Bosilca,

Rolf vandeVaart,

Sylvain Jeaugey,

Jack DongarraAuthors Info & Claims

HPDC '16: Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing

Pages 231 - 242

https://doi.org/10.1145/2907294.2907317

Published: 31 May 2016 Publication History

Abstract

Due to better parallel density and power efficiency, GPUs have become more popular for use in scientific applica- tions. Many of these applications are based on the ubiquitous Message Passing Interface (MPI) programming paradigm, and take advantage of non-contiguous memory layouts to exchange data between processes. However, support for efficient non- contiguous data movements for GPU-resident data is still in its infancy, imposing a negative impact on the overall application performance.

To address this shortcoming, we present a solution where we take advantage of the inherent parallelism in the datatype pack- ing and unpacking operations. We developed a close integration between Open MPI's stack-based datatype engine, NVIDIA's Unified Memory Architecture and GPUDirect capabilities. In this design the datatype packing and unpacking operations are offloaded onto the GPU and handled by specialized GPU kernels, while the CPU remains the driver for data movements between nodes. By incorporating our design into the Open MPI library we have shown significantly better performance for non-contiguous GPU-resident data transfers on both shared and distributed memory machines.

References

[1]

A. M. Aji, J. Dinan, D. Buntinas, P. Balaji, W.-c. Feng, K. R. Bisset, and R. Thakur. MPI-ACC: An Integrated and Extensible Approach to Data Movement in Accelerator-based Systems. In HPCC'12, pages 647--654, Washington, DC, USA, 2012.

Digital Library

[2]

L. S. Blackford, J. Choi, A. Cleary, E. D'Azeuedo, J. Demmel, I. Dhillon, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker, and R. C. Whaley. ScaLAPACK User's Guide. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 1997.

Digital Library

[3]

A. Danalis, G. Marin, C. McCurdy, J. S. Meredith, P. C. Roth, K. Spafford, V. Tipparaju, and J. S. Vetter. The Scalable Heterogeneous Computing (SHOC) Benchmark Suite. In GPGPU-3 Workshop, pages 63--74, New York, NY, USA, 2010.

Digital Library

[4]

T. Eicken, D. Culler, S. Goldstein, and K. Schauser. Active Messages: A Mechanism for Integrated Communication and Computation. In ISCA'92, pages 256--266, 1992.

Digital Library

[5]

M. Forum. MPI-2: Extensions to the message-passing interface. In Univ. of Tennessee, Knoxville, Tech Report, 1996.

[6]

E. Gabriel, G. E. Fagg, G. Bosilca, T. Angskun, J. J. Dongarra, J. M. Squyres, V. Sahay, P. Kambadur, B. Barrett, A. Lumsdaine, R. H. Castain, D. J. Daniel, R. L. Graham, and T. S. Woodall. Open MPI: Goals, concept, and design of a next generation MPI implementation. In EuroMPI'04, pages 97--104, Budapest, Hungary, 2004.

[7]

J. Jenkins, J. Dinan, P. Balaji, T. Peterka, N. Samatova, and R. Thakur. Processing MPI Derived Datatypes on Noncontiguous GPU-Resident Data. Parallel and Distributed Systems, IEEE Transactions on, 25(10):2627--2637, Oct 2014.

[8]

O. Lawlor. Message passing for GPGPU clusters: CudaMPI. In CLUSTER'09., pages 1--8, Aug 2009.

[9]

NVIDIA. NVIDIA CUDA Toolkit Documentation. http://docs.nvidia.com/cuda/index.html, 2015.

[10]

NVIDIA. NVIDIA GPUDirect. https://developer.nvidia.com/gpudirect, 2015.

[11]

S. Potluri, D. Bureddy, H. Wang, H. Subramoni, and D. Panda. Extending OpenSHMEM for GPU Computing. In IPDPS'13, pages 1001--1012, May 2013.

Digital Library

[12]

R. Ross, N. Miller, and W. Gropp. Implementing Fast and Reusable Datatype Processing. In Recent Advances in Parallel Virtual Machine and Message Passing Interface, volume 2840 of Lecture Notes in Computer Science, pages 404--413. Springer Berlin Heidelberg, 2003.

[13]

T. Schneider, R. Gerstenberger, and T. Hoefler. Micro-Applications for Communication Data Access Patterns and MPI Datatypes. In EuroMPI'12, pages 121--131, 2012.

Digital Library

[14]

R. vandeVaart. Open MPI with RDMA support and CUDA. In NVIDIA GTC'14, 2014.

[15]

H. Wang, S. Potluri, D. Bureddy, C. Rosales, and D. Panda. GPU-Aware MPI on RDMA-Enabled Clusters: Design, Implementation and Evaluation. Parallel and Distributed Systems, IEEE Transactions on, 25(10):2595--2605, Oct 2014.

[16]

H. Wang, S. Potluri, M. Luo, A. Singh, X. Ouyang, S. Sur, and D. Panda. Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2. In CLUSTER'11, pages 308--316, Sept 2011.

Digital Library

[17]

H. Wang, S. Potluri, M. Luo, A. K. Singh, S. Sur, and D. K. Panda. MVAPICH2-GPU: Optimized GPU to GPU Communication for InfiniBand Clusters. Computer Science - Research and Development, 26(3--4):257--266, 2011.

Digital Library

[18]

L. Wang, W. Wu, Z. Xu, J. Xiao, and Y. Yang. BLASX: A High Performance Level-3 BLAS Library for Heterogeneous Multi-GPU Computing. In ICS'16, Istanbul, Turkey, 2016.

Digital Library

Cited By

Godoy WValero-Lara PAnderson CLee KGainaru AFerreira Da Silva RVetter J(2023)Julia as a unifying end-to-end workflow language on the Frontier exascale systemProceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624278(1989-1999)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3624062.3624278
Suresh KKhorassani KChen CRamesh BAbduljabbar MShafi ASubramoni HPanda D(2023)Network-Assisted Noncontiguous Transfers for GPU-Aware MPI LibrariesIEEE Micro10.1109/MM.2023.324113343:2(131-139)Online publication date: 1-Mar-2023
https://dl.acm.org/doi/10.1109/MM.2023.3241133
Pearson CWu KChung IXiong JHwu WLaure EMarkidis SVerbanescu ALofstead G(2021)TEMPIProceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing10.1145/3431379.3460645(95-106)Online publication date: 21-Jun-2021
https://dl.acm.org/doi/10.1145/3431379.3460645
Show More Cited By

Index Terms

GPU-Aware Non-contiguous Data Movement In Open MPI

Recommendations

Characterizing CUDA Unified Memory (UM)-Aware MPI Designs on Modern GPU Architectures
GPGPU '19: Proceedings of the 12th Workshop on General Purpose Processing Using GPUs

The CUDA Unified Memory (UM) interface enables a significantly simpler programming paradigm and has the potential to fundamentally change the way programmers write CUDA applications in the future. Although UM leads to high productivity in programming ...
GPU-Aware Intranode MPI_Allreduce
EuroMPI/ASIA '14: Proceedings of the 21st European MPI Users' Group Meeting

Modern multi-core clusters are increasingly using GPUs to achieve higher performance and power efficiency. In such clusters, efficient communication among processes with data residing in GPU memory is of paramount importance to the performance of MPI ...
Divide and Conquer on Hybrid GPU-Accelerated Multicore Systems

With the raw computing power of graphics processing units (GPUs) being more widely available in commodity multicore systems, there is an imminent need to harness their power for important numerical libraries such as LAPACK. In this paper, we consider ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

HPDC '16: Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing

May 2016

302 pages

ISBN:9781450343145

DOI:10.1145/2907294

General Chair:
Hiroshi Nakashima
Kyoto University, Japan
,
Program Chairs:
Kenjiro Taura
The University of Tokyo, Japan
,
Jack Lange
University of Pittsburgh, USA

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

University of Arizona: University of Arizona
SIGARCH: ACM Special Interest Group on Computer Architecture

In-Cooperation

SIGHPC: ACM Special Interest Group on High Performance Computing, Special Interest Group on High Performance Computing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 31 May 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Science Foundation

Conference

HPDC'16

Sponsor:

University of Arizona
SIGARCH

HPDC'16: The 25th International Symposium on High-Performance Parallel and Distributed Computing

May 31 - June 4, 2016

Kyoto, Japan

Acceptance Rates

HPDC '16 Paper Acceptance Rate 20 of 129 submissions, 16%;

Overall Acceptance Rate 166 of 966 submissions, 17%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

12
Total Citations
View Citations
270
Total Downloads

Downloads (Last 12 months)22
Downloads (Last 6 weeks)4

Reflects downloads up to 26 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Godoy WValero-Lara PAnderson CLee KGainaru AFerreira Da Silva RVetter J(2023)Julia as a unifying end-to-end workflow language on the Frontier exascale systemProceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624278(1989-1999)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3624062.3624278
Suresh KKhorassani KChen CRamesh BAbduljabbar MShafi ASubramoni HPanda D(2023)Network-Assisted Noncontiguous Transfers for GPU-Aware MPI LibrariesIEEE Micro10.1109/MM.2023.324113343:2(131-139)Online publication date: 1-Mar-2023
https://dl.acm.org/doi/10.1109/MM.2023.3241133
Pearson CWu KChung IXiong JHwu WLaure EMarkidis SVerbanescu ALofstead G(2021)TEMPIProceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing10.1145/3431379.3460645(95-106)Online publication date: 21-Jun-2021
https://dl.acm.org/doi/10.1145/3431379.3460645
Wu HJin JZhai JGong YLiu W(2021)Accelerating GPU Message Communication for Autonomous Navigation Systems2021 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/Cluster48925.2021.00029(181-191)Online publication date: Sep-2021
https://doi.org/10.1109/Cluster48925.2021.00029
Bastem BUnat D(2020)Tiling-Based Programming Model for Structured Grids on GPU ClustersProceedings of the International Conference on High Performance Computing in Asia-Pacific Region10.1145/3368474.3368485(43-51)Online publication date: 15-Jan-2020
https://dl.acm.org/doi/10.1145/3368474.3368485
Chu CKhorassani KZhou QSubramoni HPanda D(2020)Dynamic Kernel Fusion for Bulk Non-contiguous Data Transfer on GPU Clusters2020 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER49012.2020.00023(130-141)Online publication date: Sep-2020
https://doi.org/10.1109/CLUSTER49012.2020.00023
Zhong DShamis PCao QBosilca GSumimoto SMiura KDongarra J(2020)Using Arm Scalable Vector Extension to Optimize OPEN MPI2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID)10.1109/CCGrid49817.2020.00-71(222-231)Online publication date: May-2020
https://doi.org/10.1109/CCGrid49817.2020.00-71
Chu CHashmi JKhorassani KSubramoni HPanda D(2019)High-Performance Adaptive MPI Derived Datatype Communication for Modern Multi-GPU Systems2019 IEEE 26th International Conference on High Performance Computing, Data, and Analytics (HiPC)10.1109/HiPC.2019.00041(267-276)Online publication date: Dec-2019
https://doi.org/10.1109/HiPC.2019.00041
Luo XWu WBosilca GPatinyasakdikul TWang LDongarra JZhao MChandra ARamakrishnan L(2018)ADAPTProceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing10.1145/3208040.3208054(118-130)Online publication date: 11-Jun-2018
https://dl.acm.org/doi/10.1145/3208040.3208054
Peterson BHumphrey ASunderland DSutherland JSaad TDasari HBerzins M(2018)Automatic Halo Management for the Uintah GPU-Heterogeneous Asynchronous Many-Task RuntimeInternational Journal of Parallel Programming10.1007/s10766-018-0619-1Online publication date: 7-Dec-2018
https://doi.org/10.1007/s10766-018-0619-1
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents