research-article

GPU-Aware Design, Implementation, and Evaluation of Non-blocking Collective Benchmarks

Authors:

D. K. PandaAuthors Info & Claims

EuroMPI '15: Proceedings of the 22nd European MPI Users' Group Meeting

Article No.: 9, Pages 1 - 10

https://doi.org/10.1145/2802658.2802672

Published: 21 September 2015 Publication History

Abstract

As we move towards efficient exascale systems, heterogeneous accelerators like NVIDIA GPUs are becoming a significant compute component of modern HPC clusters. It has become important to utilize every single cycle of every compute device available in the system. From NICs to GPUs to Co-processors, heterogeneous compute resources are the way to move forward. Another important trend, especially with the introduction of non-blocking collective communication in the latest MPI standard, is overlapping communication with computation. It has become an important design goal for messaging libraries like MVAPICH2 and OpenMPI. In this paper, we present an important benchmark that allows the users of different MPI libraries to evaluate performance of GPU-Aware Non-Blocking Collectives. The main performance metrics are overlap and latency. We provide insights on designing a GPU-Aware benchmark and discuss the challenges associated with identifying and implementing performance parameters like overlap, latency, effect of MPI_Test() calls to progress communication, effect of independent GPU communication while the overlapped computation proceeds under the communication, and the effect of complexity, target, and scale of this overlapped computation. To illustrate the efficacy of the proposed benchmark, we provide a comparative performance evaluation of GPU-Aware Non-Blocking Collectives in MVAPICH2 and OpenMPI.

References

[1]

SHOC for Intel MIC. https://github.com/vetter/shoc-mic, 2015.

[2]

D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, L. Dagum, R. A. Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S. Schreiber, et al. The NAS Parallel Benchmarks. IJHPCA, 5(3):63--73, 1991.

Digital Library

[3]

D. Bureddy, H. Wang, A. Venkatesh, S. Potluri, and D. Panda. OMB-GPU: A Micro-Benchmark Suite for Evaluating MPI Libraries on GPU Clusters. In J. Träff, S. Benkner, and J. Dongarra, editors, Recent Advances in the Message Passing Interface, volume 7490 of Lecture Notes in Computer Science, pages 110--120. Springer Berlin Heidelberg, 2012. ISBN 978-3-642-33517-4. URL http://dx.doi.org/10.1007/978-3-642-33518-1_16.

Digital Library

[4]

S. Che, J. Sheaffer, M. Boyer, L. Szafaryn, L. Wang, and K. Skadron. A characterization of the Rodinia benchmark suite with comparison to contemporary CMP workloads. In Workload Characterization (IISWC), 2010 IEEE International Symposium on, pages 1--11, Dec 2010.

Digital Library

[5]

Cray. Cray Message Passing Toolkit. http://docs.cray.com/books/S-9407-1305/S-9407-1305.pdf, 2015.

[6]

A. Danalis, G. Marin, C. McCurdy, J. S. Meredith, P. C. Roth, K. Spafford, V. Tipparaju, and J. S. Vetter. The Scalable Heterogeneous Computing (SHOC) Benchmark Suite. In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, GPGPU '10, pages 63--74, New York, NY, USA, 2010. ACM. ISBN 978-1-60558-935-0. URL http://doi.acm.org/10.1145/1735688.1735702.

Digital Library

[7]

T. Hoefler, A. Lumsdaine, and W. Rehm. Implementation and Performance Analysis of Non-blocking Collective Operations for MPI. In Supercomputing, 2007. SC'07. Proceedings of the 2007 ACM/IEEE Conference on, pages 1--10. IEEE, 2007.

Digital Library

[8]

T. Hoefler, T. Schneider, and A. Lumsdaine. Accurately Measuring Collective Operations at Massive Scale. In Proceedings of the 22nd IEEE International Parallel & Distributed Processing Symposium, PMEO'08 Workshop, Apr. 2008. ISBN 978-1-4244-1694-3.

[9]

T. Hoefler, C. Siebert, and A. Lumsdaine. Group Operation Assembly Language - A Flexible Way to Express Collective Communication. In ICPP-2009 - The 38th International Conference on Parallel Processing. IEEE, Sep. 2009. ISBN 978-0-7695-3802-0.

Digital Library

[10]

IBM. Platform MPI. http://www-03.ibm.com/systems/platformcomputing/products/mpi/index.html, 2015.

[11]

IMB. Intel MPI Benchmarks (IMB). https://software.intel.com/en-us/articles/intel-mpibenchmarks.

[12]

K. C. Kandalla, H. Subramoni, K. Tomko, D. Pekurovsky, and D. K. Panda. A Novel Functional Partitioning Approach to Design High-Performance MPI-3 Non-blocking Alltoallv Collective on Multi-core Systems. In 42nd International Conference on Parallel Processing, ICPP 2013, Lyon, France, October 1-4, 2013, pages 611--620, 2013. URL http://dx.doi.org/10.1109/ICPP.2013.75.

Digital Library

[13]

W. Lawry, C. Wilson, A. B. Maccabe, and R. Brightwell. COMB: A Portable Benchmark Suite for Assessing MPI Overlap. In IEEE Cluster, pages 23--26, 2002.

Digital Library

[14]

H. Meuer, E. Strohmaier, J. Dongarra, and H. Simon. TOP 500 Supercomputer Sites. http://www.top500.org.

[15]

MVAPICH2: MPI over InfiniBand, 10GigE/iWARP and RoCE. https://mvapich.cse.ohio-state.edu/.

[16]

M. S. Müller, M. van Waveren, R. Lieberman, B. Whitney, H. Saito, K. Kumaran, J. Baron, W. C. Brantley, C. Parrott, T. Elken, H. Feng, and C. Ponder. SPEC MPI2007: An Application Benchmark Suite for Parallel Systems using MPI. Concurrency and Computation: Practice and Experience, 22(2):191--205, 2010. ISSN 1532-0634. URL http://dx.doi.org/10.1002/cpe.1535.

Digital Library

[17]

Network Based Computing Laboratory. OSU Micro-Benchmarks. http://mvapich.cse.ohio-state.edu/benchmarks/, 2015.

[18]

A. Nomura and Y. Ishikawa. Design of Kernel-level Asynchronous Collective Communication. In Proceedings of the 17th European MPI Users' Group Meeting Conference on Recent Advances in the Message Passing Interface, EuroMPI'10, pages 92--101, Berlin, Heidelberg, 2010. Springer-Verlag. ISBN 3-642-15645-2, 978-3-642-15645-8. URL http://dl.acm.org/citation.cfm?id=1894122.1894135.

Digital Library

[19]

K. Olsen. Simulation of three dimensional wave propagation in the Salt Lake Basin. PhD thesis, The University of Utah, Salt Lake City, Utah, 1994.

[20]

D. Pekurovsky. P3DFFT: A Framework for Parallel Computations of Fourier Transforms in Three Dimensions. SIAM Journal on Scientific Computing, 34(4):C192--C209, 2012. URL http://dx.doi.org/10.1137/11082748X.

[21]

T. Schneider, S. Eckelmann, T. Hoefler, and W. Rehm. Kernel-Based Offload of Collective Operations - Implementation, Evaluation and Lessons Learned. In Proceedings of the 17th international conference on Parallel processing - Volume Part II, pages 264--275. Springer-Verlag, Aug. 2011. ISBN 978-3-642-23396-8.

Digital Library

[22]

SMB. Sandia MPI Micro-Benchmark Suite (SMB). http://www.cs.sandia.gov/smb/index.html.

[23]

J. A. Stratton, C. Rodrigues, I.-J. Sung, N. Obeid, L.-W. Chang, N. Anssari, G. D. Liu, and W. mei W. Hwu. Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing. Technical report, University of Illinois at Urbana-Champaign, March 2012.

[24]

H. Subramoni, A. A. Awan, K. Hamidouche, D. Pekurovsky, A. Venkatesh, S. Chakraborty, K. Tomko, and D. K. Panda. Designing Non-blocking Personalized Collectives with Near Perfect Overlap for RDMA-Enabled Clusters. In J. M. Kunkel and T. Ludwig, editors, High Performance Computing, volume 9137 of Lecture Notes in Computer Science, pages 434--453. Springer International Publishing, 2015. ISBN 978-3-319-20118-4. URL http://dx.doi.org/10.1007/978-3-319-20119-1_31.

[25]

The Open MPI Development Team. Open MPI: Open Source High Performance Computing. http://www.open-mpi.org.

[26]

H. Wang, S. Potluri, M. Luo, A. K. Singh, X. Ouyang, S. Sur, and D. K. Panda. Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2. In Proceedings of the 2011 IEEE International Conference on Cluster Computing, CLUSTER '11, pages 308--316, Washington, DC, USA, 2011. IEEE Computer Society. ISBN 978-0-7695-4516-5. URL http://dx.doi.org/10.1109/CLUSTER.2011.42.

Digital Library

Cited By

Medvedev A(2022)IMB-ASYNC: a revised method and benchmark to estimate MPI-3 asynchronous progress efficiencyCluster Computing10.1007/s10586-021-03452-825:4(2683-2697)Online publication date: 15-Jan-2022
https://doi.org/10.1007/s10586-021-03452-8
Hamidouche KZhang JPanda DTomko KMorris K(2016)OpenSHMEM nonblocking data movement operations with MVAPICH2-XProceedings of the First Workshop on PGAS Applications10.5555/3019040.3019042(9-16)Online publication date: 13-Nov-2016
https://dl.acm.org/doi/10.5555/3019040.3019042
Hamidouche KZhang JPanda DTomko K(2016)OpenSHMEM Non-blocking Data Movement Operations with MVAPICH2-X: Early Experiences2016 PGAS Applications Workshop (PAW)10.1109/PAW.2016.007(9-16)Online publication date: Nov-2016
https://doi.org/10.1109/PAW.2016.007
Show More Cited By

Index Terms

GPU-Aware Design, Implementation, and Evaluation of Non-blocking Collective Benchmarks

Recommendations

Auto-tuning Non-blocking Collective Communication Operations
IPDPSW '15: Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium Workshop

Collective operations are widely used in large scale scientific applications, and critical to the scalability of these applications for large process counts. It has also been demonstrated that collective operations have to be carefully tuned for a given ...
Highly parallel GEMV with register blocking method on GPU architecture

We propose a register blocking method for GEMV on GPU.The proposed method can improve the parallelism and reuse data on chip at the same time.Different block sizes are tested to found the best block size on a GPU platform. GPUs can provide powerful ...
Evaluation of GPU Architectures Using Spiking Neural Networks
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance Computing

During recent years General-Purpose Graphical Processing Units (GP-GPUs) have entered the field of High-Performance Computing (HPC) as one of the primary architectural focuses for many research groups working with complex scientific applications. Nvidia'...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

EuroMPI '15: Proceedings of the 22nd European MPI Users' Group Meeting

September 2015

149 pages

ISBN:9781450337953

DOI:10.1145/2802658

Conference Chair:
Jack Dongarra,
Program Chairs:
Alexandre Denis,
Brice Goglin,
Emmanuel Jeannot,
Guillaume Mercier

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

Conseil Régional d'Aquitaine
Communauté Urbaine de Bordeaux
INRIA: INRIA Rhône-Alpes

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 September 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

EuroMPI '15

EuroMPI '15: The 22nd European MPI Users' Group Meeting

September 21 - 23, 2015

Bordeaux, France

Acceptance Rates

EuroMPI '15 Paper Acceptance Rate 14 of 29 submissions, 48%;

Overall Acceptance Rate 66 of 139 submissions, 47%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
166
Total Downloads

Downloads (Last 12 months)8
Downloads (Last 6 weeks)1

Reflects downloads up to 26 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Medvedev A(2022)IMB-ASYNC: a revised method and benchmark to estimate MPI-3 asynchronous progress efficiencyCluster Computing10.1007/s10586-021-03452-825:4(2683-2697)Online publication date: 15-Jan-2022
https://doi.org/10.1007/s10586-021-03452-8
Hamidouche KZhang JPanda DTomko KMorris K(2016)OpenSHMEM nonblocking data movement operations with MVAPICH2-XProceedings of the First Workshop on PGAS Applications10.5555/3019040.3019042(9-16)Online publication date: 13-Nov-2016
https://dl.acm.org/doi/10.5555/3019040.3019042
Hamidouche KZhang JPanda DTomko K(2016)OpenSHMEM Non-blocking Data Movement Operations with MVAPICH2-X: Early Experiences2016 PGAS Applications Workshop (PAW)10.1109/PAW.2016.007(9-16)Online publication date: Nov-2016
https://doi.org/10.1109/PAW.2016.007
Awan AHamidouche KChu CPanda D(2015)A Case for Non-blocking Collectives in OpenSHMEMRevised Selected Papers of the Second Workshop on OpenSHMEM and Related Technologies. Experiences, Implementations, and Technologies - Volume 939710.1007/978-3-319-26428-8_5(69-86)Online publication date: 4-Aug-2015
https://dl.acm.org/doi/10.1007/978-3-319-26428-8_5

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents