Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1122971.1122978acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections
Article

RDMA read based rendezvous protocol for MPI over InfiniBand: design alternatives and benefits

Published: 29 March 2006 Publication History

Abstract

Message Passing Interface (MPI) is a popular parallel programming model for scientific applications. Most high-performance MPI implementations use Rendezvous Protocol for efficient transfer of large messages. This protocol can be designed using either RDMA Write or RDMA Read. Usually, this protocol is implemented using RDMA Write. The RDMA Write based protocol requires a two-way handshake between the sending and receiving processes. On the other hand, to achieve low latency, MPI implementations often provide a polling based progress engine. The two-way handshake requires the polling progress engine to discover multiple control messages. This in turn places a restriction on MPI applications that they should call into the MPI library to make progress. For compute or I/O intensive applications, it is not possible to do so. Thus, most communication progress is made only after the computation or I/O is over. This hampers the computation to communication overlap severely, which can have a detrimental impact on the overall application performance. In this paper, we propose several mechanisms to exploit RDMA Read and selective interrupt based asynchronous progress to provide better computation/communication overlap on InfiniBand clusters. Our evaluations reveal that it is possible to achieve nearly complete computation/communication overlap using our RDMA Read with Interrupt based Protocol. Additionally, our schemes yield around 50% better communication progress rate when computation is overlapped with communication. Further, our application evaluation with Linpack (HPL) and NAS-SP (Class C) reveals that MPI_Wait time is reduced by around 30% and 28%, respectively, for a 32 node InfiniBand cluster. We observe that the gains obtained in the MPI_Wait time increase as the system size increases. This indicates that our designs have a strong positive impact on scalability of parallel applications.

References

[1]
G. Amerson and A. Apon. Implementation and Design Analysis of a Network Messaging Module using Virtual Interface Architecture. In International Conference on Cluster Computing, 2004.
[2]
D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, D. Dagum, R. A. Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S. Schreiber, H. D. Simon, V. Venkatakrishnan, and S. K. Weeratunga. The NAS parallel benchmarks. volume 5, pages 63--73, Fall 1991.
[3]
C. Bell and D. Bonachea. A New DMA Registration Strategy for Pinning-Based High Performance Networks. In International Parallel and Distributed Processing Symposium, 2003.
[4]
Berkeley Lab. GASNet. http://gasnet.cs.berkeley.edu/.
[5]
R. Brightwell and K. Underwood. An Analysis of the Impact of MPI Overlap and Independent Progress. In International Conference on Supercomputing (ICS), 2004.
[6]
R. Brightwell, K. Underwood, and R. Riesen. An Initial Analysis of the Impact of Overlap and Independent Progress for MPI. In Euro PVM/MPI, 2004.
[7]
J. Dongarra. Performance of Various Computers Using Standard Linear Equations Software. Technical Report CS-89-85, University of Tennessee, 1989.
[8]
G. Goumas, A. Sotiropoulos, and N. Koziris. Minimizing Completion Time for Loop Tiling with Computation and Communication Overlapping. In International Parallel and Distributed Processing Symposium, 2001.
[9]
R. L. Graham, S.-E. Choi, D. J. Daniel, N. N. Desai, R. G. Minnich, C. E. Rasmussen L. D. Risinger, and M. W. Sukalski. A Network-Failure-Tolerant Message-Passing System for Terascacle Clusters. In International Conference on Supercomputing (ICS), 2002.
[10]
W. Gropp, E. Lusk, N. Doss, and A. Skjellum. A High-Performance, Portable Implementation of the MPI, Message Passing Interface Standard. Technical report, Argonne National Laboratory and Mississippi State University.
[11]
InfiniBand Trade Association. InfiniBand Trade Association. http://www.infinibandta.com.
[12]
J. Vetter and C. Chambreau. mpiP: Lightweight, Scalable MPI Profiling. http://www.llnl.gov/CASC/mpip/.
[13]
C. Keppitiyagama and A. Wagner. MPI-NP II: A Network Processor Based Message Manger for MPI. In International Conference on Communications in Computing (CIC), 2000.
[14]
C. Keppitiyagama and A. Wagner. Asynchronous MPI messaging on Myrinet. In International Parallel and Distributed Processing Symposium, 2001.
[15]
V. Kumar, A. Grama, A. Gupta, and G. Karypis. Introduction to Parallel Computing : Design and Analysis of Algorithms. Addison Wesley/Benjamin Cummings, 1993.
[16]
Lawrence Berkeley National Laboratory. MVICH: MPI for Virtual Interface Architecture. http://www.nersc.gov/research/FTG/mvich/index.html, August 2001.
[17]
Lawrence Livermore National Laboratory. The ASCI Purple Benchmarks. http://www.llnl.gov/asci/platforms/purple/rfp/benchmarks/.
[18]
S. Majumder, S. Rixner, and V. S. Pai. An Event-driven Architecture for MPI Libraries. In The Los Alamos Computer Science Institute Symposium, 2004.
[19]
O. Maquelin, G. R. Gao, and H. H. J Hum. Polling Watchdog: Combining Polling and Interrupts for Efficient Message Handling. In International Symposium on Computer Architecture, 1996.
[20]
Mellanox Technologies. Mellanox VAPI Interface, July 2002.
[21]
Message Passing Interface Forum. MPI: A Message-Passing Interface Standard, Mar 1994.
[22]
Myricom Inc. Portable MPI Model Implementation over GM, March 2004.
[23]
Network-Based Computing Laboratory. MPI over InfiniBand Project. http://nowlab.cse.ohio-state.edu/projects/mpi-iba/.
[24]
Quadrics. MPICH-QsNet. http://www.quadrics.com.
[25]
D. Sitsky and Kenichi Hayashi. An MPI library which uses polling, interrupts and remote copying for the Fujitsu AP1000+. In International Symposium on Parallel Architectures, Algorithms, and Networks, 1996.
[26]
The Top 500 Project. The Top 500. http://www.top500.org/.
[27]
V. Tipparaju, G. Santhanaraman, J. Nieplocha, and D. K. Panda. Host-Assisted Zero-Copy Remote Memory Access Communication on InfiniBand. In International Parallel and Distributed Processing Symposium, 2004.
[28]
J. S. Vetter and F. Mueller. Communication characteristics of large-scale scientific applications for contemporary cluster architectures. Journal of Parallel and Distributed Computing, 63(9):853--865, September 2003.
[29]
F. C. Wong, R. P. Martin, R. H. Arpaci-Dusseau, and D. E. Culler. Architectural requirements and scalability of the NAS parallel benchmarks. In Conference on High Performance Networking and Computing, 1999.

Cited By

View all
  • (2024)DPU-Direct: Unleashing Remote Accelerators via Enhanced RDMA for Disaggregated DatacentersIEEE Transactions on Computers10.1109/TC.2024.340408973:8(2081-2095)Online publication date: Aug-2024
  • (2024)An optimized RDMA QP communication mechanism for hyperscale AI infrastructureCluster Computing10.1007/s10586-024-04796-728:1Online publication date: 5-Nov-2024
  • (2024)In-Network Monitoring Strategies for HPC CloudAdvanced Information Networking and Applications10.1007/978-3-031-57942-4_35(364-373)Online publication date: 10-Apr-2024
  • Show More Cited By

Index Terms

  1. RDMA read based rendezvous protocol for MPI over InfiniBand: design alternatives and benefits

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    PPoPP '06: Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
    March 2006
    258 pages
    ISBN:1595931899
    DOI:10.1145/1122971
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 29 March 2006

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. InfiniBand
    2. MPI
    3. communication overlap

    Qualifiers

    • Article

    Conference

    PPoPP06
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 230 of 1,014 submissions, 23%

    Upcoming Conference

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)49
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 11 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)DPU-Direct: Unleashing Remote Accelerators via Enhanced RDMA for Disaggregated DatacentersIEEE Transactions on Computers10.1109/TC.2024.340408973:8(2081-2095)Online publication date: Aug-2024
    • (2024)An optimized RDMA QP communication mechanism for hyperscale AI infrastructureCluster Computing10.1007/s10586-024-04796-728:1Online publication date: 5-Nov-2024
    • (2024)In-Network Monitoring Strategies for HPC CloudAdvanced Information Networking and Applications10.1007/978-3-031-57942-4_35(364-373)Online publication date: 10-Apr-2024
    • (2023)CausalSE: Understanding Varied Spatial Effects with Missing Data Toward Adding New Bike-sharing StationsACM Transactions on Knowledge Discovery from Data10.1145/353642717:2(1-24)Online publication date: 20-Mar-2023
    • (2023)Performance Prediction for Scalability AnalysisPerformance Analysis of Parallel Applications for HPC10.1007/978-981-99-4366-1_6(129-161)Online publication date: 19-Jun-2023
    • (2022)SETTI: A Self-supervised AdvErsarial Malware DeTection ArchiTecture in an IoT EnvironmentACM Transactions on Multimedia Computing, Communications, and Applications10.1145/353642518:2s(1-21)Online publication date: 6-Oct-2022
    • (2022)NB-IoT Coverage and Sensor Node Connectivity in Dense Urban Environments: An Empirical StudyACM Transactions on Sensor Networks10.1145/353642418:3(1-36)Online publication date: 15-Sep-2022
    • (2022)Remarn: A Reconfigurable Multi-threaded Multi-core Accelerator for Recurrent Neural NetworksACM Transactions on Reconfigurable Technology and Systems10.1145/353496916:1(1-26)Online publication date: 22-Dec-2022
    • (2022)Allocation of Resources for Cloud Survivability in Smart ManufacturingACM Transactions on Management Information Systems10.1145/353370113:4(1-11)Online publication date: 10-Aug-2022
    • (2022)Compiler-enabled optimization of persistent MPI Operations2022 IEEE/ACM International Workshop on Exascale MPI (ExaMPI)10.1109/ExaMPI56604.2022.00006(1-10)Online publication date: Nov-2022
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media