Article

RDMA read based rendezvous protocol for MPI over InfiniBand: design alternatives and benefits

Authors:

Sayantan Sur,

Hyun-Wook Jin,

Lei Chai,

Dhabaleswar K. PandaAuthors Info & Claims

PPoPP '06: Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming

Pages 32 - 39

https://doi.org/10.1145/1122971.1122978

Published: 29 March 2006 Publication History

Get Access

Abstract

Message Passing Interface (MPI) is a popular parallel programming model for scientific applications. Most high-performance MPI implementations use Rendezvous Protocol for efficient transfer of large messages. This protocol can be designed using either RDMA Write or RDMA Read. Usually, this protocol is implemented using RDMA Write. The RDMA Write based protocol requires a two-way handshake between the sending and receiving processes. On the other hand, to achieve low latency, MPI implementations often provide a polling based progress engine. The two-way handshake requires the polling progress engine to discover multiple control messages. This in turn places a restriction on MPI applications that they should call into the MPI library to make progress. For compute or I/O intensive applications, it is not possible to do so. Thus, most communication progress is made only after the computation or I/O is over. This hampers the computation to communication overlap severely, which can have a detrimental impact on the overall application performance. In this paper, we propose several mechanisms to exploit RDMA Read and selective interrupt based asynchronous progress to provide better computation/communication overlap on InfiniBand clusters. Our evaluations reveal that it is possible to achieve nearly complete computation/communication overlap using our RDMA Read with Interrupt based Protocol. Additionally, our schemes yield around 50% better communication progress rate when computation is overlapped with communication. Further, our application evaluation with Linpack (HPL) and NAS-SP (Class C) reveals that MPI_Wait time is reduced by around 30% and 28%, respectively, for a 32 node InfiniBand cluster. We observe that the gains obtained in the MPI_Wait time increase as the system size increases. This indicates that our designs have a strong positive impact on scalability of parallel applications.

References

[1]

G. Amerson and A. Apon. Implementation and Design Analysis of a Network Messaging Module using Virtual Interface Architecture. In International Conference on Cluster Computing, 2004.

Abstract

References

Cited By

Index Terms

Recommendations

High performance RDMA-based MPI implementation over InfiniBand

High performance RDMA-based MPI implementation over infiniBand

Using the Intel MPI benchmarks (IMB) to evaluate MPI implementations on an Infiniband Nehalem Linux cluster

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations