Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2802658.2802672acmotherconferencesArticle/Chapter ViewAbstractPublication PageseurompiConference Proceedingsconference-collections
research-article

GPU-Aware Design, Implementation, and Evaluation of Non-blocking Collective Benchmarks

Published: 21 September 2015 Publication History

Abstract

As we move towards efficient exascale systems, heterogeneous accelerators like NVIDIA GPUs are becoming a significant compute component of modern HPC clusters. It has become important to utilize every single cycle of every compute device available in the system. From NICs to GPUs to Co-processors, heterogeneous compute resources are the way to move forward. Another important trend, especially with the introduction of non-blocking collective communication in the latest MPI standard, is overlapping communication with computation. It has become an important design goal for messaging libraries like MVAPICH2 and OpenMPI. In this paper, we present an important benchmark that allows the users of different MPI libraries to evaluate performance of GPU-Aware Non-Blocking Collectives. The main performance metrics are overlap and latency. We provide insights on designing a GPU-Aware benchmark and discuss the challenges associated with identifying and implementing performance parameters like overlap, latency, effect of MPI_Test() calls to progress communication, effect of independent GPU communication while the overlapped computation proceeds under the communication, and the effect of complexity, target, and scale of this overlapped computation. To illustrate the efficacy of the proposed benchmark, we provide a comparative performance evaluation of GPU-Aware Non-Blocking Collectives in MVAPICH2 and OpenMPI.

References

[1]
SHOC for Intel MIC. https://github.com/vetter/shoc-mic, 2015.
[2]
D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, L. Dagum, R. A. Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S. Schreiber, et al. The NAS Parallel Benchmarks. IJHPCA, 5(3):63--73, 1991.
[3]
D. Bureddy, H. Wang, A. Venkatesh, S. Potluri, and D. Panda. OMB-GPU: A Micro-Benchmark Suite for Evaluating MPI Libraries on GPU Clusters. In J. Träff, S. Benkner, and J. Dongarra, editors, Recent Advances in the Message Passing Interface, volume 7490 of Lecture Notes in Computer Science, pages 110--120. Springer Berlin Heidelberg, 2012. ISBN 978-3-642-33517-4. URL http://dx.doi.org/10.1007/978-3-642-33518-1_16.
[4]
S. Che, J. Sheaffer, M. Boyer, L. Szafaryn, L. Wang, and K. Skadron. A characterization of the Rodinia benchmark suite with comparison to contemporary CMP workloads. In Workload Characterization (IISWC), 2010 IEEE International Symposium on, pages 1--11, Dec 2010.
[5]
Cray. Cray Message Passing Toolkit. http://docs.cray.com/books/S-9407-1305/S-9407-1305.pdf, 2015.
[6]
A. Danalis, G. Marin, C. McCurdy, J. S. Meredith, P. C. Roth, K. Spafford, V. Tipparaju, and J. S. Vetter. The Scalable Heterogeneous Computing (SHOC) Benchmark Suite. In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, GPGPU '10, pages 63--74, New York, NY, USA, 2010. ACM. ISBN 978-1-60558-935-0. URL http://doi.acm.org/10.1145/1735688.1735702.
[7]
T. Hoefler, A. Lumsdaine, and W. Rehm. Implementation and Performance Analysis of Non-blocking Collective Operations for MPI. In Supercomputing, 2007. SC'07. Proceedings of the 2007 ACM/IEEE Conference on, pages 1--10. IEEE, 2007.
[8]
T. Hoefler, T. Schneider, and A. Lumsdaine. Accurately Measuring Collective Operations at Massive Scale. In Proceedings of the 22nd IEEE International Parallel & Distributed Processing Symposium, PMEO'08 Workshop, Apr. 2008. ISBN 978-1-4244-1694-3.
[9]
T. Hoefler, C. Siebert, and A. Lumsdaine. Group Operation Assembly Language - A Flexible Way to Express Collective Communication. In ICPP-2009 - The 38th International Conference on Parallel Processing. IEEE, Sep. 2009. ISBN 978-0-7695-3802-0.
[10]
IBM. Platform MPI. http://www-03.ibm.com/systems/platformcomputing/products/mpi/index.html, 2015.
[11]
IMB. Intel MPI Benchmarks (IMB). https://software.intel.com/en-us/articles/intel-mpibenchmarks.
[12]
K. C. Kandalla, H. Subramoni, K. Tomko, D. Pekurovsky, and D. K. Panda. A Novel Functional Partitioning Approach to Design High-Performance MPI-3 Non-blocking Alltoallv Collective on Multi-core Systems. In 42nd International Conference on Parallel Processing, ICPP 2013, Lyon, France, October 1-4, 2013, pages 611--620, 2013. URL http://dx.doi.org/10.1109/ICPP.2013.75.
[13]
W. Lawry, C. Wilson, A. B. Maccabe, and R. Brightwell. COMB: A Portable Benchmark Suite for Assessing MPI Overlap. In IEEE Cluster, pages 23--26, 2002.
[14]
H. Meuer, E. Strohmaier, J. Dongarra, and H. Simon. TOP 500 Supercomputer Sites. http://www.top500.org.
[15]
MVAPICH2: MPI over InfiniBand, 10GigE/iWARP and RoCE. https://mvapich.cse.ohio-state.edu/.
[16]
M. S. Müller, M. van Waveren, R. Lieberman, B. Whitney, H. Saito, K. Kumaran, J. Baron, W. C. Brantley, C. Parrott, T. Elken, H. Feng, and C. Ponder. SPEC MPI2007: An Application Benchmark Suite for Parallel Systems using MPI. Concurrency and Computation: Practice and Experience, 22(2):191--205, 2010. ISSN 1532-0634. URL http://dx.doi.org/10.1002/cpe.1535.
[17]
Network Based Computing Laboratory. OSU Micro-Benchmarks. http://mvapich.cse.ohio-state.edu/benchmarks/, 2015.
[18]
A. Nomura and Y. Ishikawa. Design of Kernel-level Asynchronous Collective Communication. In Proceedings of the 17th European MPI Users' Group Meeting Conference on Recent Advances in the Message Passing Interface, EuroMPI'10, pages 92--101, Berlin, Heidelberg, 2010. Springer-Verlag. ISBN 3-642-15645-2, 978-3-642-15645-8. URL http://dl.acm.org/citation.cfm?id=1894122.1894135.
[19]
K. Olsen. Simulation of three dimensional wave propagation in the Salt Lake Basin. PhD thesis, The University of Utah, Salt Lake City, Utah, 1994.
[20]
D. Pekurovsky. P3DFFT: A Framework for Parallel Computations of Fourier Transforms in Three Dimensions. SIAM Journal on Scientific Computing, 34(4):C192--C209, 2012. URL http://dx.doi.org/10.1137/11082748X.
[21]
T. Schneider, S. Eckelmann, T. Hoefler, and W. Rehm. Kernel-Based Offload of Collective Operations - Implementation, Evaluation and Lessons Learned. In Proceedings of the 17th international conference on Parallel processing - Volume Part II, pages 264--275. Springer-Verlag, Aug. 2011. ISBN 978-3-642-23396-8.
[22]
SMB. Sandia MPI Micro-Benchmark Suite (SMB). http://www.cs.sandia.gov/smb/index.html.
[23]
J. A. Stratton, C. Rodrigues, I.-J. Sung, N. Obeid, L.-W. Chang, N. Anssari, G. D. Liu, and W. mei W. Hwu. Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing. Technical report, University of Illinois at Urbana-Champaign, March 2012.
[24]
H. Subramoni, A. A. Awan, K. Hamidouche, D. Pekurovsky, A. Venkatesh, S. Chakraborty, K. Tomko, and D. K. Panda. Designing Non-blocking Personalized Collectives with Near Perfect Overlap for RDMA-Enabled Clusters. In J. M. Kunkel and T. Ludwig, editors, High Performance Computing, volume 9137 of Lecture Notes in Computer Science, pages 434--453. Springer International Publishing, 2015. ISBN 978-3-319-20118-4. URL http://dx.doi.org/10.1007/978-3-319-20119-1_31.
[25]
The Open MPI Development Team. Open MPI: Open Source High Performance Computing. http://www.open-mpi.org.
[26]
H. Wang, S. Potluri, M. Luo, A. K. Singh, X. Ouyang, S. Sur, and D. K. Panda. Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2. In Proceedings of the 2011 IEEE International Conference on Cluster Computing, CLUSTER '11, pages 308--316, Washington, DC, USA, 2011. IEEE Computer Society. ISBN 978-0-7695-4516-5. URL http://dx.doi.org/10.1109/CLUSTER.2011.42.

Cited By

View all
  • (2022)IMB-ASYNC: a revised method and benchmark to estimate MPI-3 asynchronous progress efficiencyCluster Computing10.1007/s10586-021-03452-825:4(2683-2697)Online publication date: 15-Jan-2022
  • (2016)OpenSHMEM nonblocking data movement operations with MVAPICH2-XProceedings of the First Workshop on PGAS Applications10.5555/3019040.3019042(9-16)Online publication date: 13-Nov-2016
  • (2016)OpenSHMEM Non-blocking Data Movement Operations with MVAPICH2-X: Early Experiences2016 PGAS Applications Workshop (PAW)10.1109/PAW.2016.007(9-16)Online publication date: Nov-2016
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
EuroMPI '15: Proceedings of the 22nd European MPI Users' Group Meeting
September 2015
149 pages
ISBN:9781450337953
DOI:10.1145/2802658
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

  • Conseil Régional d'Aquitaine
  • Communauté Urbaine de Bordeaux
  • INRIA: INRIA Rhône-Alpes

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 September 2015

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. GPU-Aware
  2. MVAPICH2
  3. Micro-Benchmarking
  4. Non-Blocking Collectives
  5. OMB

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

EuroMPI '15
EuroMPI '15: The 22nd European MPI Users' Group Meeting
September 21 - 23, 2015
Bordeaux, France

Acceptance Rates

EuroMPI '15 Paper Acceptance Rate 14 of 29 submissions, 48%;
Overall Acceptance Rate 66 of 139 submissions, 47%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)8
  • Downloads (Last 6 weeks)1
Reflects downloads up to 26 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2022)IMB-ASYNC: a revised method and benchmark to estimate MPI-3 asynchronous progress efficiencyCluster Computing10.1007/s10586-021-03452-825:4(2683-2697)Online publication date: 15-Jan-2022
  • (2016)OpenSHMEM nonblocking data movement operations with MVAPICH2-XProceedings of the First Workshop on PGAS Applications10.5555/3019040.3019042(9-16)Online publication date: 13-Nov-2016
  • (2016)OpenSHMEM Non-blocking Data Movement Operations with MVAPICH2-X: Early Experiences2016 PGAS Applications Workshop (PAW)10.1109/PAW.2016.007(9-16)Online publication date: Nov-2016
  • (2015)A Case for Non-blocking Collectives in OpenSHMEMRevised Selected Papers of the Second Workshop on OpenSHMEM and Related Technologies. Experiences, Implementations, and Technologies - Volume 939710.1007/978-3-319-26428-8_5(69-86)Online publication date: 4-Aug-2015

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media