Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1109/SC.2004.15acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
Article

Building Multirail InfiniBand Clusters: MPI-Level Design and Performance Evaluation

Published: 06 November 2004 Publication History

Abstract

In the area of cluster computing, InfiniBand is becoming increasingly popular due to its open standard and high performance. However, even with InfiniBand, network bandwidth can still become the performance bottleneck for some of todayýs most demanding applications. In this paper, we study the problem of how to overcome the bandwidth bottleneck by using multirail networks. We present different ways of setting up multirail networks with InfiniBand and propose a unified MPI design that can support all these approaches. We have also discussed various important design issues and provided in-depth discussions of different policies of using multirail networks, including an adaptive striping scheme that can dynamically change the striping parameters based on current system condition. We have implemented our design and evaluated it using both microbenchmarks and applications. Our performance results show that multirail networks can significant improve MPI communication performance. With a two rail InfiniBand cluster, we have achieved almost twice the bandwidth and half the latency for large messages compared with the original MPI. At the application level, the multirail MPI can significantly reduce communication time as well as running time depending on the communication pattern. We have also shown that the adaptive striping scheme can achieve excellent performance without a priori knowledge of the bandwidth of each rail.

References

[1]
{1} H. Adiseshu, G.M. Parulkar, and G. Varghese. A reliable and scalable striping protocol. In SIGCOMM, pages 131- 141, 1996.
[2]
{2} M. Banikazemi, R.K. Govindaraju, R. Blackmore, and D.K. Panda. MPI-LAPI: An Efficient Implementation of MPI for IBM RS/6000 SP Systems. IEEE Transactions on Parallel and Distributed Systems, pages 1081-1093, October 2001.
[3]
{3} C. Brendan, S. Traw, and J.M. Smith. Striping within the network subsystem. IEEE Network, 9(4):22, 1995.
[4]
{4} S. Coll, E. Frachtenberg, F. Petrini, A. Hoisie, and L. Gurvits. Using Multirail Networks in High-Performance Clusters. Concurrency and Computation: Practice and Experience , 15(7-8):625, 2003.
[5]
{5} Compaq, Intel, and Microsoft. VI Architecture Specification V1.0, December 1997.
[6]
{6} A. Farazdel, G.R. Archondo-Callao, E. Hocks, T. Sakachi, and F. Vagnini. IBM Red Book: Understanding and Using the SP Switch. IBM, Poughkeepsie, NY, 1999.
[7]
{7} J. Gao and H.-W. Shen. Parallel View-Dependent Isosurface Extraction Using Multi-Pass Occlusion Culling. In Proceedings of 2001 IEEE Symposium in Parallel and Large Data Visualization and Graphics, pages 67-74, October 2001.
[8]
{8} R.L. Graham, S.-E. Choi, D.J. Daniel, N.N. Desai, R.G. M.C.E. Rasmussen, L.D. Risinger, and M.W. Sukalski. A network failure tolerant message passing system for terascale clusters. In 16th Annual ACM International Conference on Supercomputing (ICS'02), June 2002.
[9]
{9} W. Gropp, E. Lusk, N. Doss, and A. Skjellum. A High-Performance, Portable Implementation of the MPI Message Passing Interface Standard. Parallel Computing, 22(6):789- 828, 1996.
[10]
{10} W. Gropp, E. Lusk, and A. Skjellum. Using MPI: Portable Parallel Programming with the Message Passing Interface, 2nd edition. MIT Press, Cambridge, MA, 1999.
[11]
{11} InfiniBand Trade Association. InfiniBand Architecture Specification, Release 1.1. http://www.infinibandta.org, November 2002.
[12]
{12} Lawrence Berkeley National Laboratory. MVICH: MPI for Virtual Interface Architecture. http://www.nersc.gov/research/FTG/mvich/index.html, August 2001.
[13]
{13} J. Liu, J. Wu, S.P. Kini, P. Wyckoff, and D.K. Panda. High Performance RDMA-Based MPI Implementation over InfiniBand. In 17th Annual ACM International Conference on Supercomputing (ICS'03), June 2003.
[14]
{14} Mellanox Technologies. Mellanox InfiniBand InfiniHost MT23108 Adapters. http://www.mellanox.com, July 2002.
[15]
{15} Mellanox Technologies. Mellanox InfiniBand InfiniHost III Ex MT25208 Adapters. http://www.mellanox.com, February 2004.
[16]
{16} Myricom. Myrinet. http://www.myri.com/.
[17]
{17} NASA. NAS Parallel Benchmarks. http://www.nas.nasa.gov/Software/NPB/.
[18]
{18} NCSA. MPICH over VMI2 Interface. http://vmi.ncsa.uiuc.edu/.
[19]
{19} Network-Based Computing Laboratory. MVAPICH: MPI for InfiniBand on VAPI Layer. http://nowlab.cis.ohiostate.edu/projects/mpi-iba/index.html.
[20]
{20} Pallas. Pallas MPI Benchmarks. http://www.pallas.com/e/products/pmb/.
[21]
{21} PCI-SIG. PCI Express Architecture. http://www.pcisig.com.
[22]
{22} F. Petrini, W. Feng, A. Hoisie, S. Coll, and E. Frachtenberg. The Quadrics Network: High-Performance Clustering Technology. IEEE Micro, 22(1):46-57, 2002.
[23]
{23} S. Pakin and A. Pant. VMI 2.0: A Dynamically Reconfigurable Messaging Layer for Availability, Usability, and Management. In SAN-1 Workshop (in conjunction with HPCA), Febuary 2002.
[24]
{24} SGI. SGI Message Passing Toolkit. http://www.sgi.com/software/mpt/overview.html.
[25]
{25} S.J. Sistare and C.J. Jackson. Ultra-High Performance Communication with MPI and the Sun Fire Link Interconnect. In Proceedings of the Supercomputing, 2002.
[26]
{26} H. Sivakumar, S. Bailey, and R.L. Grossman. PSockets: The case for application-level network striping for data intensive applications using high speed wide area networks. In Supercomputing, 2000.

Cited By

View all
  • (2020)Dual-Plane Isomorphic Hypercube NetworkProceedings of the International Conference on High Performance Computing in Asia-Pacific Region10.1145/3368474.3368493(73-80)Online publication date: 15-Jan-2020
  • (2019)UMR-ECProceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing10.1145/3307681.3325406(219-230)Online publication date: 17-Jun-2019
  • (2014)Performance of Applications using Dual-Rail InfiniBand 3D Torus network on the Gordon SupercomputerProceedings of the 2014 Annual Conference on Extreme Science and Engineering Discovery Environment10.1145/2616498.2616541(1-6)Online publication date: 13-Jul-2014
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SC '04: Proceedings of the 2004 ACM/IEEE conference on Supercomputing
November 2004
724 pages
ISBN:0769521533

Sponsors

Publisher

IEEE Computer Society

United States

Publication History

Published: 06 November 2004

Check for updates

Qualifiers

  • Article

Conference

SC '04
Sponsor:

Acceptance Rates

SC '04 Paper Acceptance Rate 60 of 200 submissions, 30%;
Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)2
  • Downloads (Last 6 weeks)0
Reflects downloads up to 26 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2020)Dual-Plane Isomorphic Hypercube NetworkProceedings of the International Conference on High Performance Computing in Asia-Pacific Region10.1145/3368474.3368493(73-80)Online publication date: 15-Jan-2020
  • (2019)UMR-ECProceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing10.1145/3307681.3325406(219-230)Online publication date: 17-Jun-2019
  • (2014)Performance of Applications using Dual-Rail InfiniBand 3D Torus network on the Gordon SupercomputerProceedings of the 2014 Annual Conference on Extreme Science and Engineering Discovery Environment10.1145/2616498.2616541(1-6)Online publication date: 13-Jul-2014
  • (2010)Region-Based Prefetch Techniques for Software Distributed Shared Memory SystemsProceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing10.1109/CCGRID.2010.16(113-122)Online publication date: 17-May-2010
  • (2009)Improving communication-phase completion times in HPC clusters through congestion mitigationProceedings of SYSTOR 2009: The Israeli Experimental Systems Conference10.1145/1534530.1534552(1-11)Online publication date: 4-May-2009
  • (2007)Using CMT in SCTP-based MPI to exploit multiple interfaces in cluster nodesProceedings of the 14th European conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface10.5555/2396095.2396134(204-212)Online publication date: 30-Sep-2007
  • (2006)Adaptive connection management for scalable MPI over InfiniBandProceedings of the 20th international conference on Parallel and distributed processing10.5555/1898953.1899034(102-102)Online publication date: 25-Apr-2006
  • (2006)Optimizing bandwidth limited problems using one-sided communication and overlapProceedings of the 20th international conference on Parallel and distributed processing10.5555/1898953.1899016(84-84)Online publication date: 25-Apr-2006
  • (2006)Efficient RDMA-based multi-port collectives on multi-rail QsNetII clustersProceedings of the 20th international conference on Parallel and distributed processing10.5555/1898699.1898790(273-273)Online publication date: 25-Apr-2006
  • (2006)A software based approach for providing network fault tolerance in clusters with uDAPL interfaceProceedings of the 2006 ACM/IEEE conference on Supercomputing10.1145/1188455.1188545(85-es)Online publication date: 11-Nov-2006
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media