Article

Building Multirail InfiniBand Clusters: MPI-Level Design and Performance Evaluation

Authors:

Abhinav Vishnu,

Dhabaleswar K. PandaAuthors Info & Claims

SC '04: Proceedings of the 2004 ACM/IEEE conference on Supercomputing

Page 33

https://doi.org/10.1109/SC.2004.15

Published: 06 November 2004 Publication History

Abstract

In the area of cluster computing, InfiniBand is becoming increasingly popular due to its open standard and high performance. However, even with InfiniBand, network bandwidth can still become the performance bottleneck for some of todayýs most demanding applications. In this paper, we study the problem of how to overcome the bandwidth bottleneck by using multirail networks. We present different ways of setting up multirail networks with InfiniBand and propose a unified MPI design that can support all these approaches. We have also discussed various important design issues and provided in-depth discussions of different policies of using multirail networks, including an adaptive striping scheme that can dynamically change the striping parameters based on current system condition. We have implemented our design and evaluated it using both microbenchmarks and applications. Our performance results show that multirail networks can significant improve MPI communication performance. With a two rail InfiniBand cluster, we have achieved almost twice the bandwidth and half the latency for large messages compared with the original MPI. At the application level, the multirail MPI can significantly reduce communication time as well as running time depending on the communication pattern. We have also shown that the adaptive striping scheme can achieve excellent performance without a priori knowledge of the bandwidth of each rail.

References

[1]

{1} H. Adiseshu, G.M. Parulkar, and G. Varghese. A reliable and scalable striping protocol. In SIGCOMM, pages 131- 141, 1996.

Digital Library

[2]

{2} M. Banikazemi, R.K. Govindaraju, R. Blackmore, and D.K. Panda. MPI-LAPI: An Efficient Implementation of MPI for IBM RS/6000 SP Systems. IEEE Transactions on Parallel and Distributed Systems, pages 1081-1093, October 2001.

Digital Library

[3]

{3} C. Brendan, S. Traw, and J.M. Smith. Striping within the network subsystem. IEEE Network, 9(4):22, 1995.

Digital Library

[4]

{4} S. Coll, E. Frachtenberg, F. Petrini, A. Hoisie, and L. Gurvits. Using Multirail Networks in High-Performance Clusters. Concurrency and Computation: Practice and Experience , 15(7-8):625, 2003.

[5]

{5} Compaq, Intel, and Microsoft. VI Architecture Specification V1.0, December 1997.

[6]

{6} A. Farazdel, G.R. Archondo-Callao, E. Hocks, T. Sakachi, and F. Vagnini. IBM Red Book: Understanding and Using the SP Switch. IBM, Poughkeepsie, NY, 1999.

[7]

{7} J. Gao and H.-W. Shen. Parallel View-Dependent Isosurface Extraction Using Multi-Pass Occlusion Culling. In Proceedings of 2001 IEEE Symposium in Parallel and Large Data Visualization and Graphics, pages 67-74, October 2001.

Digital Library

[8]

{8} R.L. Graham, S.-E. Choi, D.J. Daniel, N.N. Desai, R.G. M.C.E. Rasmussen, L.D. Risinger, and M.W. Sukalski. A network failure tolerant message passing system for terascale clusters. In 16th Annual ACM International Conference on Supercomputing (ICS'02), June 2002.

Digital Library

[9]

{9} W. Gropp, E. Lusk, N. Doss, and A. Skjellum. A High-Performance, Portable Implementation of the MPI Message Passing Interface Standard. Parallel Computing, 22(6):789- 828, 1996.

Digital Library

[10]

{10} W. Gropp, E. Lusk, and A. Skjellum. Using MPI: Portable Parallel Programming with the Message Passing Interface, 2nd edition. MIT Press, Cambridge, MA, 1999.

Digital Library

[11]

{11} InfiniBand Trade Association. InfiniBand Architecture Specification, Release 1.1. http://www.infinibandta.org, November 2002.

[12]

{12} Lawrence Berkeley National Laboratory. MVICH: MPI for Virtual Interface Architecture. http://www.nersc.gov/research/FTG/mvich/index.html, August 2001.

[13]

{13} J. Liu, J. Wu, S.P. Kini, P. Wyckoff, and D.K. Panda. High Performance RDMA-Based MPI Implementation over InfiniBand. In 17th Annual ACM International Conference on Supercomputing (ICS'03), June 2003.

Digital Library

[14]

{14} Mellanox Technologies. Mellanox InfiniBand InfiniHost MT23108 Adapters. http://www.mellanox.com, July 2002.

[15]

{15} Mellanox Technologies. Mellanox InfiniBand InfiniHost III Ex MT25208 Adapters. http://www.mellanox.com, February 2004.

[16]

{16} Myricom. Myrinet. http://www.myri.com/.

[17]

{17} NASA. NAS Parallel Benchmarks. http://www.nas.nasa.gov/Software/NPB/.

[18]

{18} NCSA. MPICH over VMI2 Interface. http://vmi.ncsa.uiuc.edu/.

[19]

{19} Network-Based Computing Laboratory. MVAPICH: MPI for InfiniBand on VAPI Layer. http://nowlab.cis.ohiostate.edu/projects/mpi-iba/index.html.

[20]

{20} Pallas. Pallas MPI Benchmarks. http://www.pallas.com/e/products/pmb/.

[21]

{21} PCI-SIG. PCI Express Architecture. http://www.pcisig.com.

[22]

{22} F. Petrini, W. Feng, A. Hoisie, S. Coll, and E. Frachtenberg. The Quadrics Network: High-Performance Clustering Technology. IEEE Micro, 22(1):46-57, 2002.

Digital Library

[23]

{23} S. Pakin and A. Pant. VMI 2.0: A Dynamically Reconfigurable Messaging Layer for Availability, Usability, and Management. In SAN-1 Workshop (in conjunction with HPCA), Febuary 2002.

[24]

{24} SGI. SGI Message Passing Toolkit. http://www.sgi.com/software/mpt/overview.html.

[25]

{25} S.J. Sistare and C.J. Jackson. Ultra-High Performance Communication with MPI and the Sun Fire Link Interconnect. In Proceedings of the Supercomputing, 2002.

Digital Library

[26]

{26} H. Sivakumar, S. Bailey, and R.L. Grossman. PSockets: The case for application-level network striping for data intensive applications using high speed wide area networks. In Supercomputing, 2000.

Digital Library

Cited By

Hosomi TYasudo RKoibuchi MShimojo S(2020)Dual-Plane Isomorphic Hypercube NetworkProceedings of the International Conference on High Performance Computing in Asia-Pacific Region10.1145/3368474.3368493(73-80)Online publication date: 15-Jan-2020
https://dl.acm.org/doi/10.1145/3368474.3368493
Shi HLu XShankar DPanda DWeissman JButt ASmirni E(2019)UMR-ECProceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing10.1145/3307681.3325406(219-230)Online publication date: 17-Jun-2019
https://dl.acm.org/doi/10.1145/3307681.3325406
Choi DLockwood GSinkovits RTatineni MLathrop SAlameda J(2014)Performance of Applications using Dual-Rail InfiniBand 3D Torus network on the Gordon SupercomputerProceedings of the 2014 Annual Conference on Extreme Science and Engineering Discovery Environment10.1145/2616498.2616541(1-6)Online publication date: 13-Jul-2014
https://dl.acm.org/doi/10.1145/2616498.2616541
Show More Cited By

Recommendations

Building highly scalable and reliable InfiniBand clusters
SC '06: Proceedings of the 2006 ACM/IEEE conference on Supercomputing

InfiniBand has become the defacto standard HPC cluster interconnect due to its unsurpassed bandwidth and extreme low latency. Nevertheless, achieving reliable application performance that scales to 1000+ processor clusters is never easy. This session ...
Performance evaluation of InfiniBand with PCI Express
HOTI '04: Proceedings of the High Performance Interconnects, 2004. on Proceedings. 12th Annual IEEE Symposium

We present an initial performance evaluation of InfiniBand HCAs (host channel adapters) from Mellanox with PCI Express interfaces. We compare the performance with HCAs using PCI-X interfaces. Our results show that InfiniBand HCAs with PCI Express can ...
Topologies for improved InfiniBand latency
SC '06: Proceedings of the 2006 ACM/IEEE conference on Supercomputing

InfiniBand continues to become more and more important in High Performance Computing world. This talk discusses the impact of DDR (i.e., 20 GigE) switches and HCA's on the creation of low latency, high bandwidth InfiniBand fabrics. When combined with ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '04: Proceedings of the 2004 ACM/IEEE conference on Supercomputing

November 2004

724 pages

ISBN:0769521533

General Chair:
Jeff Huskamp

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture
IEEE-CS: Computer Society

Publisher

IEEE Computer Society

United States

Publication History

Published: 06 November 2004

Check for updates

Qualifiers

Article

Conference

SC '04

Sponsor:

SIGARCH
IEEE-CS

SC '04: International Conference for High Performance Computing, Networking, Storage and Analysis

November 6 - 12, 2004

Acceptance Rates

SC '04 Paper Acceptance Rate 60 of 200 submissions, 30%;

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

12
Total Citations
View Citations
28
Total Downloads

Downloads (Last 12 months)2
Downloads (Last 6 weeks)0

Reflects downloads up to 26 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Hosomi TYasudo RKoibuchi MShimojo S(2020)Dual-Plane Isomorphic Hypercube NetworkProceedings of the International Conference on High Performance Computing in Asia-Pacific Region10.1145/3368474.3368493(73-80)Online publication date: 15-Jan-2020
https://dl.acm.org/doi/10.1145/3368474.3368493
Shi HLu XShankar DPanda DWeissman JButt ASmirni E(2019)UMR-ECProceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing10.1145/3307681.3325406(219-230)Online publication date: 17-Jun-2019
https://dl.acm.org/doi/10.1145/3307681.3325406
Choi DLockwood GSinkovits RTatineni MLathrop SAlameda J(2014)Performance of Applications using Dual-Rail InfiniBand 3D Torus network on the Gordon SupercomputerProceedings of the 2014 Annual Conference on Extreme Science and Engineering Discovery Environment10.1145/2616498.2616541(1-6)Online publication date: 13-Jul-2014
https://dl.acm.org/doi/10.1145/2616498.2616541
Cai JStrazdins PRendell A(2010)Region-Based Prefetch Techniques for Software Distributed Shared Memory SystemsProceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing10.1109/CCGRID.2010.16(113-122)Online publication date: 17-May-2010
https://dl.acm.org/doi/10.1109/CCGRID.2010.16
Birk YZdornov VAllalouf MFactor MFeitelson D(2009)Improving communication-phase completion times in HPC clusters through congestion mitigationProceedings of SYSTOR 2009: The Israeli Experimental Systems Conference10.1145/1534530.1534552(1-11)Online publication date: 4-May-2009
https://dl.acm.org/doi/10.1145/1534530.1534552
Penoff BTsai MIyengar JWagner A(2007)Using CMT in SCTP-based MPI to exploit multiple interfaces in cluster nodesProceedings of the 14th European conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface10.5555/2396095.2396134(204-212)Online publication date: 30-Sep-2007
https://dl.acm.org/doi/10.5555/2396095.2396134
Yu WGao QPanda D(2006)Adaptive connection management for scalable MPI over InfiniBandProceedings of the 20th international conference on Parallel and distributed processing10.5555/1898953.1899034(102-102)Online publication date: 25-Apr-2006
https://dl.acm.org/doi/10.5555/1898953.1899034
Bell CBonachea DNishtala RYelick K(2006)Optimizing bandwidth limited problems using one-sided communication and overlapProceedings of the 20th international conference on Parallel and distributed processing10.5555/1898953.1899016(84-84)Online publication date: 25-Apr-2006
https://dl.acm.org/doi/10.5555/1898953.1899016
Qian YAfsahi A(2006)Efficient RDMA-based multi-port collectives on multi-rail QsNetII clustersProceedings of the 20th international conference on Parallel and distributed processing10.5555/1898699.1898790(273-273)Online publication date: 25-Apr-2006
https://dl.acm.org/doi/10.5555/1898699.1898790
Vishnu AGupta PMamidala APanda DHorner-Miller B(2006)A software based approach for providing network fault tolerance in clusters with uDAPL interfaceProceedings of the 2006 ACM/IEEE conference on Supercomputing10.1145/1188455.1188545(85-es)Online publication date: 11-Nov-2006
https://dl.acm.org/doi/10.1145/1188455.1188545
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten