research-article

Implementation and performance analysis of non-blocking collective operations for MPI

Authors:

Torsten Hoefler,

Andrew Lumsdaine,

Wolfgang RehmAuthors Info & Claims

SC '07: Proceedings of the 2007 ACM/IEEE conference on Supercomputing

Article No.: 52, Pages 1 - 10

https://doi.org/10.1145/1362622.1362692

Published: 10 November 2007 Publication History

Abstract

Collective operations and non-blocking point-to-point operations have always been part of MPI. Although non-blocking collective operations are an obvious extension to MPI, there have been no comprehensive studies of this functionality. In this paper we present LibNBC, a portable high-performance library for implementing non-blocking collective MPI communication operations. LibNBC provides non-blocking versions of all MPI collective operations, is layered on top of MPI-1, and is portable to nearly all parallel architectures. To measure the performance characteristics of our implementation, we also present a microbenchmark for measuring both latency and overlap of computation and communication. Experimental results demonstrate that the blocking performance of the collective operations in our library is comparable to that of collective operations in other high-performance MPI implementations. Our library introduces a very low overhead between the application and the underlying MPI and thus, in conjunction with the potential to overlap communication with computation, offers the potential for optimizing real-world applications.

References

[1]

T. S. Abdelrahman and G. Liu. Overlap of computation and communication on shared-memory networks-of-workstations. Cluster computing, pages 35--45, 2001.

Digital Library

[2]

A. Adelmann, W. P. P. A. Bonelli and, and C. W. Ueberhuber. Communication efficiency of parallel 3d ffts. In High Performance Computing for Computational Science - VECPAR 2004, 6th International Conference, Valencia, Spain, June 28--30, 2004, Revised Selected and Invited Papers, volume 3402 of Lecture Notes in Computer Science, pages 901--907. Springer, 2004.

[3]

F. Baude, D. Caromel, N. Furmento, and D. Sagnol. Optimizing metacomputing with communication-computation overlap. In PaCT '01: Proceedings of the 6th International Conference on Parallel Computing Technologies, pages 190--204, London, UK, 2001. Springer-Verlag.

Digital Library

[4]

P. Beckman, K. Iskra, K. Yoshii, and S. Coghlan. Operating system issues for petascale systems. SIGOPS Operating System Review, 40(2):29--33, 2006.

Digital Library

[5]

C. Bell, D. Bonachea, R. Nishtala, and K. Yelick. Optimizing Bandwidth Limited Problems Using One-Sided Communication and Overlap. In Proceedings, 20th International Parallel and Distributed Processing Symposium IPDPS 2006 (CAC 06), April 2006.

Digital Library

[6]

R. Brightwell, R. Riesen, and K. D. Underwood. Analyzing the impact of overlap, offload, and independent progress for message passing interface applications. Int. J. High Perform. Comput. Appl., 19(2):103--117, 2005.

Digital Library

[7]

E. D. Brooks. The Butterfly Barrier. International Journal of Parallel Programming, 15(4):295--307, 1986.

Digital Library

[8]

BZIP2. http://www.bzip.org, 2006.

[9]

P.-Y. Calland, J. Dongarra, and Y. Robert. Tiling on systems with communication/computation overlap. Concurrency - Practice and Experience, 11(3):139--153, 1999.

[10]

C. E. Cramer and J. A. Board. The development and integration of a distributed 3d fft for a cluster of workstations. In Proceedings of the 4th Annual Linux Showcase & Conference, Atlanta, volume 4. USENIX Association, 2000.

Digital Library

[11]

D. Culler, R. Karp, D. Patterson, A. Sahay, K. E. Schauser, E. Santos, R. Subramonian, and T. von Eicken. LogP: towards a realistic model of parallel computation. In Principles Practice of Parallel Programming, pages 1--12, 1993.

Digital Library

[12]

A. Dubey and D. Tessera. Redistribution strategies for portable parallel FFT: a case study. Concurrency and Computation: Practice and Experience, 13(3):209--220, 2001.

[13]

L. A. Estefanel and G. Mounie. Fast Tuning of Intra-Cluster Collective Communications. In Recent Advances in Parallel Virtual Machine and Message Passing Interface: 11th European PVM/MPI Users Group Meeting Budapest, Hungary, September 19--22, 2004. Proceedings, 2004.

[14]

E. Gabriel, G. E. Fagg, G. Bosilca, T. Angskun, J. J. Dongarra, J. M. Squyres, V. Sahay, P. Kambadur, B. Barrett, A. Lumsdaine, R. H. Castain, D. J. Daniel, R. L. Graham, and T. S. Woodall. Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation. In Proceedings, 11th European PVM/MPI Users' Group Meeting, Budapest, Hungary, September 2004.

[15]

W. D. Gropp and R. Thakur. Issues in developing a thread-safe mpi implementation. In B. Mohr, J. L. Träff, J. Worringen, and J. Dongarra, editors, Recent Advances in Parallel Virtual Machine and Message Passing Interface, 13th European PVM/MPI User's Group Meeting, Bonn, Germany, September 17--20, 2006, Proceedings, volume 4192 of Lecture Notes in Computer Science, pages 12--21. Springer, 2006.

Digital Library

[16]

T. Hoefler, L. Cerquetti, T. Mehlan, F. Mietke, and W. Rehm. A practical Approach to the Rating of Barrier Algorithms using the LogP Model and Open MPI. In Proceedings of the 2005 International Conference on Parallel Processing Workshops (ICPP '05), pages 562--569, June 2005.

Digital Library

[17]

T. Hoefler, P. Gottschling, W. Rehm, and A. Lumsdaine. Optimizing a Conjugate Gradient Solver with Non-Blocking Collective Operations. In Recent Advantages in Parallel Virtual Machine and Message Passing Interface. 13th European PVM/MPI User's Group Meeting, Proceedings, LNCS 4192, pages 374--382. Springer, 9 2006.

Digital Library

[18]

T. Hoefler and A. Lumsdaine. Design, Implementation, and Usage of LibNBC. Technical report, Open Systems Lab, Indiana University, 08 2006.

[19]

T. Hoefler, T. Mehlan, F. Mietke, and W. Rehm. Fast Barrier Synchronization for InfiniBand. In Proceedings, 20th International Parallel and Distributed Processing Symposium IPDPS 2006 (CAG 06), April 2006.

Digital Library

[20]

T. Hoefler, J. Squyres, G. Bosilca, G. Fagg, A. Lumsdaine, and W. Rehm. Non-Blocking Collective Operations for MPI-2. Technical report, Open Systems Lab, Indiana University, 08 2006.

[21]

T. Hoefler, J. Squyres, W. Rehm, and A. Lumsdaine. A Case for Non-Blocking Collective Operations. In Frontiers of High Performance Computing and Networking - ISPA 2006 Workshops, volume 4331/2006, pages 155--164. Springer Berlin / Heidelberg, 12 2006.

Digital Library

[22]

T. Hoefler, C. Viertel, T. Mehlan, F. Mietke, and W. Rehm. Assessing Single-Message and Multi-Node Communication Performance of InfiniBand. In Proceedings of IEEE Inernational Conference on Parallel Computing in Electrical Engineering, PARELEC 2006, pages 227--232. IEEE Computer Society, 9 2006.

Digital Library

[23]

C. Iancu, P. Husbands, and P. Hargrove. Hunting the overlap. In PACT '05: Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques (PACT '05), pages 279--290, Washington, DC, USA, 2005. IEEE Computer Society.

Digital Library

[24]

IBM. IBM Parallel Environment for AIX, MPI Subroutine Reference, 1993. http://publibfp.boulder.ibm.com/epubs/pdf/a2274230.pdf.

[25]

J. W. III and S. Bova. Where's the Overlap? - An Analysis of Popular MPI Implementations, 1999.

[26]

Intel Corporation. Intel Application Notes - Using the RDTSC Instruction for Performance Monitoring. Technical report, Intel, 1997.

[27]

K. Iskra, P. Beckman, K. Yoshii, and S. Coghlan. The influence of operating systems on the performance of collective operations at extreme scale. In Proceedings of Cluster Computing, 2006 IEEE International Conference, 2006.

[28]

L. V. Kale, S. Kumar, and K. Vardarajan. A Framework for Collective Personalized Communication. In Proceedings of IPDPS '03, Nice, France, April 2003.

Digital Library

[29]

A. Kanevsky, A. Skjellum, and A. Rounbehler. MPI/RT - an emerging standard for high-performance real-time systems. In HICSS (3), pages 157--166, 1998.

Digital Library

[30]

W. Lawry, C. Wilson, A. B. Maccabe, and R. Brightwell. Comb: A portable benchmark suite for assessing mpi overlap. In 2002 IEEE International Conference on Cluster Computing (CLUSTER 2002), 23--26 September 2002, Chicago, IL, USA, pages 472--475. IEEE Computer Society, 2002.

Digital Library

[31]

C. L. Lawson, R. J. Hanson, D. Kincaid, and F. T. Krogh. Basic Linear Algebra Subprograms for FORTRAN usage. In In ACM Trans. Math. Soft., 5 (1979), pp. 308--323, 1979.

Digital Library

[32]

LibNBC. http://www.unixer.de/NBC, 2006.

[33]

G. Liu and T. Abdelrahman. Computation-communication overlap on network-of-workstation multiprocessors. In Proc. of the Int'l Conference on Parallel and Distributed Processing Techniques and Applications, pages 1635--1642, July 1998.

[34]

J. Liu, A. Mamidala, and D. Panda. Fast and Scalable MPI-Level Broadcast using InfiniBand's Hardware Multicast Support. Technical report, OSU-CISRC-10/03-TR57, 2003.

[35]

J. Liu, J. Wu, and D. K. Panda. High Performance RDMA-Based MPI Implementation over InfiniBand. Int'l Journal of Parallel Programming, 2004, 2004.

Digital Library

[36]

Message Passing Interface Forum. MPI-2 Journal of Development, July 1997.

[37]

F. Mietke, R. Baumgartl, R. Rex, T. Mehlan, T. Hoefler, and W. Rehm. Analysis of the Memory Registration Process in the Mellanox InfiniBand Software Stack. 8 2006. Accepted for publication at Euro-Par 2006 Conference.

Digital Library

[38]

Myrinet. http://www.myrinet.com, 2006.

[39]

Quadrics. http://www.quadrics.com, 2006.

[40]

R. Rabenseifner. Automatic MPI Counter Profiling. In 42nd CUG Conference, 2000.

[41]

M. L. Scott and J. M. Mellor-Crummey. Fast, contention-free combining tree barriers for shared-memory multiprocessors. Int. J. Parallel Program., 22(4):449--481, 1994.

Digital Library

[42]

M. Technologies. Infiniband - industry standard data center fabric is ready for prime time. Mellanox White Papers, December 2005.

[43]

S. S. Vadhiyar, G. E. Fagg, and J. Dongarra. Automatically tuned collective communications. In Supercomputing '00: Proceedings of the 2000 ACM/IEEE conference on Supercomputing (CDROM), page 3, Washington, DC, USA, 2000. IEEE Computer Society.

Digital Library

[44]

W. Yu, D. Buntinas, R. L. Graham, and D. K. Panda. Efficient and scalable barrier over quadrics and myrinet with a new nic-based collective message passing protocol. In 18th International Parallel and Distributed Processing Symposium (IPDPS 2004), CD-ROM / Abstracts Proceedings, 26--30 April 2004, Santa Fe, New Mexico, USA, 2004.

Cited By

Lessani MLi ZDeng JGuo Z(2024)An MPI-based parallel genetic algorithm for multiple geographical feature label placement based on the hybrid of fixed-sliding modelsGeo-spatial Information Science10.1080/10095020.2024.2313326(1-19)Online publication date: 15-Mar-2024
https://doi.org/10.1080/10095020.2024.2313326
De Sensi DCosta Molero EDi Girolamo SVanbever LHoefler T(2024)CanaryFuture Generation Computer Systems10.1016/j.future.2023.10.010152:C(70-82)Online publication date: 4-Mar-2024
https://dl.acm.org/doi/10.1016/j.future.2023.10.010
Korkhov VGankevich IGavrikov AMingazova MPetriakov ITereshchenko DShatalin ASlobodskoy V(2023)Finding Bottlenecks in Message Passing Interface Programs by Scalable Critical Path AnalysisAlgorithms10.3390/a1611050516:11(505)Online publication date: 31-Oct-2023
https://doi.org/10.3390/a16110505
Show More Cited By

Recommendations

A case for non-blocking collective operations
ISPA'06: Proceedings of the 2006 international conference on Frontiers of High Performance Computing and Networking

Non-blocking collective operations for MPI have been in discussion for a long time. We want to contribute to this discussion and to give a rationale for the usage these operations and assess their possible benefits. A LogGP model for the CPU overhead of ...
Collective operations in NEC's high-performance MPI libraries
IPDPS'06: Proceedings of the 20th international conference on Parallel and distributed processing

We give an overview of the algorithms and implementations in the high-performance MPI libraries MPI/SX and MPI/ES of some of the most important collective operations of MPI (the Message Passing Interface). The infrastructure of MPI/SX makes it easy to ...
A preliminary evaluation of the hardware acceleration of the Cray Gemini interconnect for PGAS languages and comparison with MPI

The Gemini interconnect on the Cray XE6 platform provides for lightweight remote direct memory access (RDMA) between nodes, which is useful for implementing partitioned global address space (PGAS) languages like UPC and Co-Array Fortran. In this paper, ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '07: Proceedings of the 2007 ACM/IEEE conference on Supercomputing

November 2007

723 pages

ISBN:9781595937643

DOI:10.1145/1362622

General Chair:
Becky Verastegui
Oak Ridge National Laboratory

Copyright © 2007 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture
IEEE-CS: Computer Society

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 November 2007

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Science Foundation

Conference

SC '07

Sponsor:

SIGARCH
IEEE-CS

SC '07: International Conference for High Performance Computing, Networking, Storage and Analysis

November 10 - 16, 2007

Nevada, Reno

Acceptance Rates

SC '07 Paper Acceptance Rate 54 of 268 submissions, 20%;

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

147
Total Citations
View Citations
610
Total Downloads

Downloads (Last 12 months)47
Downloads (Last 6 weeks)3

Reflects downloads up to 30 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Lessani MLi ZDeng JGuo Z(2024)An MPI-based parallel genetic algorithm for multiple geographical feature label placement based on the hybrid of fixed-sliding modelsGeo-spatial Information Science10.1080/10095020.2024.2313326(1-19)Online publication date: 15-Mar-2024
https://doi.org/10.1080/10095020.2024.2313326
De Sensi DCosta Molero EDi Girolamo SVanbever LHoefler T(2024)CanaryFuture Generation Computer Systems10.1016/j.future.2023.10.010152:C(70-82)Online publication date: 4-Mar-2024
https://dl.acm.org/doi/10.1016/j.future.2023.10.010
Korkhov VGankevich IGavrikov AMingazova MPetriakov ITereshchenko DShatalin ASlobodskoy V(2023)Finding Bottlenecks in Message Passing Interface Programs by Scalable Critical Path AnalysisAlgorithms10.3390/a1611050516:11(505)Online publication date: 31-Oct-2023
https://doi.org/10.3390/a16110505
Adam JBesnard JCanat PTaboada HRoussel APérache MJaeger JShende S(2023)Generating and Scaling a Multi-Language Test-Suite for MPIProceedings of the 30th European MPI Users' Group Meeting10.1145/3615318.3615329(1-10)Online publication date: 11-Sep-2023
https://dl.acm.org/doi/10.1145/3615318.3615329
Liu PPeng JLiu JChi L(2023)TH-Allreduce: Optimizing Small Data Allreduce Operation on Tianhe System2023 IEEE 29th International Conference on Parallel and Distributed Systems (ICPADS)10.1109/ICPADS60453.2023.00261(1903-1911)Online publication date: 17-Dec-2023
https://doi.org/10.1109/ICPADS60453.2023.00261
Cho JSeo PJin H(2023)Exploiting copy engines for intra-node MPI collective communicationThe Journal of Supercomputing10.1007/s11227-023-05340-x79:16(17962-17982)Online publication date: 11-May-2023
https://doi.org/10.1007/s11227-023-05340-x
Besnard JShende SMalony AJaeger JPerache M(2022)Enabling Global MPI Process Addressing in MPI ApplicationsProceedings of the 29th European MPI Users' Group Meeting10.1145/3555819.3555829(27-36)Online publication date: 14-Sep-2022
https://dl.acm.org/doi/10.1145/3555819.3555829
Hoefler TBonato TDe Sensi DDi Girolamo SLi SHeddes MBelk JGoel DCastro MScott S(2022)HammingMesh: A Network Topology for Large-Scale Deep LearningSC22: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41404.2022.00016(1-18)Online publication date: Nov-2022
https://doi.org/10.1109/SC41404.2022.00016
Tschuter RHuismann IWesarg BKnespel M(2022)Performance Analysis of the CFD Solver CODA - Harnessing Synergies between Application and Performance Tools Developers2022 IEEE/ACM Workshop on Programming and Performance Visualization Tools (ProTools)10.1109/ProTools56701.2022.00010(31-40)Online publication date: Nov-2022
https://doi.org/10.1109/ProTools56701.2022.00010
Hoefler THendel ARoweth D(2022)The Convergence of Hyperscale Data Center and High-Performance Computing NetworksComputer10.1109/MC.2022.315843755:7(29-37)Online publication date: 1-Jul-2022
https://dl.acm.org/doi/10.1109/MC.2022.3158437
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents