Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2907294.2907317acmconferencesArticle/Chapter ViewAbstractPublication PageshpdcConference Proceedingsconference-collections
research-article

GPU-Aware Non-contiguous Data Movement In Open MPI

Published: 31 May 2016 Publication History

Abstract

Due to better parallel density and power efficiency, GPUs have become more popular for use in scientific applica- tions. Many of these applications are based on the ubiquitous Message Passing Interface (MPI) programming paradigm, and take advantage of non-contiguous memory layouts to exchange data between processes. However, support for efficient non- contiguous data movements for GPU-resident data is still in its infancy, imposing a negative impact on the overall application performance.
To address this shortcoming, we present a solution where we take advantage of the inherent parallelism in the datatype pack- ing and unpacking operations. We developed a close integration between Open MPI's stack-based datatype engine, NVIDIA's Unified Memory Architecture and GPUDirect capabilities. In this design the datatype packing and unpacking operations are offloaded onto the GPU and handled by specialized GPU kernels, while the CPU remains the driver for data movements between nodes. By incorporating our design into the Open MPI library we have shown significantly better performance for non-contiguous GPU-resident data transfers on both shared and distributed memory machines.

References

[1]
A. M. Aji, J. Dinan, D. Buntinas, P. Balaji, W.-c. Feng, K. R. Bisset, and R. Thakur. MPI-ACC: An Integrated and Extensible Approach to Data Movement in Accelerator-based Systems. In HPCC'12, pages 647--654, Washington, DC, USA, 2012.
[2]
L. S. Blackford, J. Choi, A. Cleary, E. D'Azeuedo, J. Demmel, I. Dhillon, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker, and R. C. Whaley. ScaLAPACK User's Guide. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 1997.
[3]
A. Danalis, G. Marin, C. McCurdy, J. S. Meredith, P. C. Roth, K. Spafford, V. Tipparaju, and J. S. Vetter. The Scalable Heterogeneous Computing (SHOC) Benchmark Suite. In GPGPU-3 Workshop, pages 63--74, New York, NY, USA, 2010.
[4]
T. Eicken, D. Culler, S. Goldstein, and K. Schauser. Active Messages: A Mechanism for Integrated Communication and Computation. In ISCA'92, pages 256--266, 1992.
[5]
M. Forum. MPI-2: Extensions to the message-passing interface. In Univ. of Tennessee, Knoxville, Tech Report, 1996.
[6]
E. Gabriel, G. E. Fagg, G. Bosilca, T. Angskun, J. J. Dongarra, J. M. Squyres, V. Sahay, P. Kambadur, B. Barrett, A. Lumsdaine, R. H. Castain, D. J. Daniel, R. L. Graham, and T. S. Woodall. Open MPI: Goals, concept, and design of a next generation MPI implementation. In EuroMPI'04, pages 97--104, Budapest, Hungary, 2004.
[7]
J. Jenkins, J. Dinan, P. Balaji, T. Peterka, N. Samatova, and R. Thakur. Processing MPI Derived Datatypes on Noncontiguous GPU-Resident Data. Parallel and Distributed Systems, IEEE Transactions on, 25(10):2627--2637, Oct 2014.
[8]
O. Lawlor. Message passing for GPGPU clusters: CudaMPI. In CLUSTER'09., pages 1--8, Aug 2009.
[9]
NVIDIA. NVIDIA CUDA Toolkit Documentation. http://docs.nvidia.com/cuda/index.html, 2015.
[10]
NVIDIA. NVIDIA GPUDirect. https://developer.nvidia.com/gpudirect, 2015.
[11]
S. Potluri, D. Bureddy, H. Wang, H. Subramoni, and D. Panda. Extending OpenSHMEM for GPU Computing. In IPDPS'13, pages 1001--1012, May 2013.
[12]
R. Ross, N. Miller, and W. Gropp. Implementing Fast and Reusable Datatype Processing. In Recent Advances in Parallel Virtual Machine and Message Passing Interface, volume 2840 of Lecture Notes in Computer Science, pages 404--413. Springer Berlin Heidelberg, 2003.
[13]
T. Schneider, R. Gerstenberger, and T. Hoefler. Micro-Applications for Communication Data Access Patterns and MPI Datatypes. In EuroMPI'12, pages 121--131, 2012.
[14]
R. vandeVaart. Open MPI with RDMA support and CUDA. In NVIDIA GTC'14, 2014.
[15]
H. Wang, S. Potluri, D. Bureddy, C. Rosales, and D. Panda. GPU-Aware MPI on RDMA-Enabled Clusters: Design, Implementation and Evaluation. Parallel and Distributed Systems, IEEE Transactions on, 25(10):2595--2605, Oct 2014.
[16]
H. Wang, S. Potluri, M. Luo, A. Singh, X. Ouyang, S. Sur, and D. Panda. Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2. In CLUSTER'11, pages 308--316, Sept 2011.
[17]
H. Wang, S. Potluri, M. Luo, A. K. Singh, S. Sur, and D. K. Panda. MVAPICH2-GPU: Optimized GPU to GPU Communication for InfiniBand Clusters. Computer Science - Research and Development, 26(3--4):257--266, 2011.
[18]
L. Wang, W. Wu, Z. Xu, J. Xiao, and Y. Yang. BLASX: A High Performance Level-3 BLAS Library for Heterogeneous Multi-GPU Computing. In ICS'16, Istanbul, Turkey, 2016.

Cited By

View all
  • (2023)Julia as a unifying end-to-end workflow language on the Frontier exascale systemProceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624278(1989-1999)Online publication date: 12-Nov-2023
  • (2023)Network-Assisted Noncontiguous Transfers for GPU-Aware MPI LibrariesIEEE Micro10.1109/MM.2023.324113343:2(131-139)Online publication date: 1-Mar-2023
  • (2021)TEMPIProceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing10.1145/3431379.3460645(95-106)Online publication date: 21-Jun-2021
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
HPDC '16: Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing
May 2016
302 pages
ISBN:9781450343145
DOI:10.1145/2907294
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 31 May 2016

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. datatype
  2. gpu
  3. hybrid architecture
  4. mpi
  5. non-contiguous data

Qualifiers

  • Research-article

Funding Sources

  • National Science Foundation

Conference

HPDC'16
Sponsor:

Acceptance Rates

HPDC '16 Paper Acceptance Rate 20 of 129 submissions, 16%;
Overall Acceptance Rate 166 of 966 submissions, 17%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)22
  • Downloads (Last 6 weeks)4
Reflects downloads up to 26 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Julia as a unifying end-to-end workflow language on the Frontier exascale systemProceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624278(1989-1999)Online publication date: 12-Nov-2023
  • (2023)Network-Assisted Noncontiguous Transfers for GPU-Aware MPI LibrariesIEEE Micro10.1109/MM.2023.324113343:2(131-139)Online publication date: 1-Mar-2023
  • (2021)TEMPIProceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing10.1145/3431379.3460645(95-106)Online publication date: 21-Jun-2021
  • (2021)Accelerating GPU Message Communication for Autonomous Navigation Systems2021 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/Cluster48925.2021.00029(181-191)Online publication date: Sep-2021
  • (2020)Tiling-Based Programming Model for Structured Grids on GPU ClustersProceedings of the International Conference on High Performance Computing in Asia-Pacific Region10.1145/3368474.3368485(43-51)Online publication date: 15-Jan-2020
  • (2020)Dynamic Kernel Fusion for Bulk Non-contiguous Data Transfer on GPU Clusters2020 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER49012.2020.00023(130-141)Online publication date: Sep-2020
  • (2020)Using Arm Scalable Vector Extension to Optimize OPEN MPI2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID)10.1109/CCGrid49817.2020.00-71(222-231)Online publication date: May-2020
  • (2019)High-Performance Adaptive MPI Derived Datatype Communication for Modern Multi-GPU Systems2019 IEEE 26th International Conference on High Performance Computing, Data, and Analytics (HiPC)10.1109/HiPC.2019.00041(267-276)Online publication date: Dec-2019
  • (2018)ADAPTProceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing10.1145/3208040.3208054(118-130)Online publication date: 11-Jun-2018
  • (2018)Automatic Halo Management for the Uintah GPU-Heterogeneous Asynchronous Many-Task RuntimeInternational Journal of Parallel Programming10.1007/s10766-018-0619-1Online publication date: 7-Dec-2018
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media