research-article

Using InfiniBand Hardware Gather-Scatter Capabilities to Optimize MPI All-to-All

Authors:

Richard L. Graham,

Artem Polyakov,

Gilad ShainerAuthors Info & Claims

EuroMPI '16: Proceedings of the 23rd European MPI Users' Group Meeting

Pages 167 - 179

https://doi.org/10.1145/2966884.2966918

Published: 25 September 2016 Publication History

Abstract

The MPI all-to-all algorithm is a data intensive, high-cost collective algorithm used by many scientific High Performance Computing applications. Optimizations for small data exchange use aggregation techniques, such as the Bruck algorithm, to minimize the number of messages sent, and minimize overall operation latency. This paper presents three variants of the Bruck algorithm, which differ in the way data is laid out in memory at intermediate steps of the algorithm. Mellanox's InfiniBand support for Host Channel Adapter (HCA) hardware scatter/gather is used selectively to replace CPU-based buffer packing and unpacking. Using this offload capability reduces the eight and sixteen byte all-to-all latency on 1024 MPI Processes by 9.7% and 9.1%, respectively. The optimization accounts for a decrease in the total memory handling time of 40.6% and 57.9%, respectively.

References

[1]

"http://www.mpi-forum.org".

[2]

"http://mvapich.cse.ohio-state.edu/overview/".

[3]

"https://www.mpich.org/".

[4]

"http://www.ana-gainaru.com/eurompi16/pingpong.xlsx".

[5]

G. Almási, P. Heidelberger, C. J. Archer, X. Martorell, C. C. Erway, J. E. Moreira, B. Steinmacher-Burow, and Y. Zheng. Optimization of mpi collective communication on bluegene/1 systems. In Proceedings of the 19th Annual International Conference on Supercomputing, ICS '05, pages 253--262, New York, NY, USA, 2005. ACM.

Digital Library

[6]

J. Bruck, S. Member, C. tien Ho, S. Kipnis, E. Upfal, S. Member, and D. Weathersby. Efficient algorithms for all-to-all communications in multi-port message-passing systems. In IEEE Transactions on Parallel and Distributed Systems, pages 298--309, 1997.

Digital Library

[7]

G. Fagg, G. Bosilca, J. Pješivac-Grbović, T. Angskun, and J. Dongarra. Tuned: An open mpi collective communications component. In P. Kacsuk, T. Fahringer, and Z. Németh, editors, Distributed and Parallel Systems, chapter 7, pages 65--72. Springer US, Boston, MA, 2007.

[8]

D. P. G. Santhanaraman, J. Wu. Zero-copy mpi derived datatype communication over infiniband. September 2004.

[9]

E. Gabriel, G. Fagg, G. Bosilca, T. Angskun, J. J. Dongarra, J. Squyres, V. Sahay, P. Kambadur, B. Barrett, A. Lumsdaine, R. H. Castain, D. Daniel, R. L. Graham, and T. S. Woodall. Open mpi: Goals, concept, and design of a next generation mpi implementation. In In Proceedings, 11th European PVM/MPI Users' Group Meeting, pages 97--104, 2004.

[10]

InfiniBand Trade Association. The InfiniBand Architecture. http://www.infinibandta.org/specs.

[11]

K. Kandalla, H. Subramoni, K. Tomko, D. Pekurovsky, S. Sur, and D. K. Panda. High-performance and scalable non-blocking all-to-all with collective offload on infiniband clusters: a study with parallel 3d fft. Comput. Sci., 26:237--246, June 2011.

Digital Library

[12]

A. R. Mamidala, R. Kumar, D. De, and D. K. Panda. Mpi collectives on modern multicore clusters: Performance optimizations and communication characteristics. 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, 0:130--137, 2008.

Digital Library

[13]

Y. Qian. Design and evaluation of efficient collective communications on modern interconnects and multi-core clusters. PhD thesis, Queens University, Kingston, Ontario, Canada, 2010.

[14]

Y. Qian and A. Afsahi. Process arrival pattern aware alltoall and allgather on infiniband clusters. International Journal of Parallel Programming, 39(4):473--493, 2011.

[15]

R. Thakur, R. Rabenseifner, and W. Gropp. Optimization of collective communication operations in mpich. IJHPCA, 19(1):49--66, 2005.

Digital Library

[16]

J. L. Träff and A. Rougier. Mpi collectives and datatypes for hierarchical all-to-all communication. In Proceedings of the 21st European MPI Users' Group Meeting, EuroMPI/ASIA '14, pages 27:27--27:32, New York, NY, USA, 2014. ACM.

Digital Library

[17]

M. G. Venkata, R. L. Graham, J. Ladd, and P. Shamis. Exploring the all-to-all collective optimization space with connectx core-direct. 2014 43rd International Conference on Parallel Processing, 0:289--298, 2012.

Digital Library

[18]

A. Venkatesh, S. Potluri, R. Rajachandrasekar, M. Luo, K. Hamidouche, and D. Panda. High performance alltoall and allgather designs for infiniband mic clusters. May 2014.

Digital Library

[19]

E. Zahavi, G. Johnson, D. J. Kerbyson, and M. Lang. Optimized infiniband fat-tree routing for shift all-to-all communication patterns. Concurrency and Computation: Practice and Experience, 22(2):217--231, 2010.

Digital Library

Cited By

Fan KPetruzza SGilray TKumar S(2024)Configurable Algorithms for All-to-All CollectivesISC High Performance 2024 Research Paper Proceedings (39th International Conference)10.23919/ISC.2024.10528936(1-12)Online publication date: May-2024
https://doi.org/10.23919/ISC.2024.10528936
Xie MLu YWang QFeng YLiu JRen KShu J(2023)PetPS: Supporting Huge Embedding Models with Persistent MemoryProceedings of the VLDB Endowment10.14778/3579075.357907716:5(1013-1022)Online publication date: 6-Mar-2023
https://dl.acm.org/doi/10.14778/3579075.3579077
Chen SHe WQi FZheng YYu K(2022)Hybrid Approach to Optimize MPI Collectives by In-network-computation and Point-to-Point Messages2022 7th International Conference on Computer and Communication Systems (ICCCS)10.1109/ICCCS55155.2022.9846190(773-783)Online publication date: 22-Apr-2022
https://doi.org/10.1109/ICCCS55155.2022.9846190
Show More Cited By

Recommendations

Optimization of MPI collective communication on BlueGene/L systems
ICS '05: Proceedings of the 19th annual international conference on Supercomputing

BlueGene/L is currently the world's fastest supercomputer. It consists of a large number of low power dual-processor compute nodes interconnected by high speed torus and collective networks, Because compute nodes do not have shared memory, MPI is the ...
Topology agnostic hot-spot avoidance with InfiniBand
The Best of CCGrid'2007: A Snapshot of an ‘Adolescent’ Area

InfiniBand has become a very popular interconnect due to its advanced features and open standard. Large-scale InfiniBand clusters are becoming very popular, as reflected by the TOP 500 supercomputer rankings. However, even with popular topologies such ...
High performance RDMA-based MPI implementation over infiniBand
Special issue I: The 17th annual international conference on supercomputing (ICS'03)

Although InfiniBand Architecture is relatively new in the high performance computing area, it offers many features which help us to improve the performance of communication subsystems. One of these features is Remote Direct Memory Access (RDMA) ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

EuroMPI '16: Proceedings of the 23rd European MPI Users' Group Meeting

September 2016

225 pages

ISBN:9781450342346

DOI:10.1145/2966884

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

SIGHPC: ACM Special Interest Group on High Performance Computing, Special Interest Group on High Performance Computing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 September 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

EuroMPI 2016

EuroMPI 2016: The 23rd European MPI Users' Group Meeting

September 25 - 28, 2016

Edinburgh, United Kingdom

Acceptance Rates

Overall Acceptance Rate 66 of 139 submissions, 47%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
241
Total Downloads

Downloads (Last 12 months)27
Downloads (Last 6 weeks)2

Reflects downloads up to 01 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Fan KPetruzza SGilray TKumar S(2024)Configurable Algorithms for All-to-All CollectivesISC High Performance 2024 Research Paper Proceedings (39th International Conference)10.23919/ISC.2024.10528936(1-12)Online publication date: May-2024
https://doi.org/10.23919/ISC.2024.10528936
Xie MLu YWang QFeng YLiu JRen KShu J(2023)PetPS: Supporting Huge Embedding Models with Persistent MemoryProceedings of the VLDB Endowment10.14778/3579075.357907716:5(1013-1022)Online publication date: 6-Mar-2023
https://dl.acm.org/doi/10.14778/3579075.3579077
Chen SHe WQi FZheng YYu K(2022)Hybrid Approach to Optimize MPI Collectives by In-network-computation and Point-to-Point Messages2022 7th International Conference on Computer and Communication Systems (ICCCS)10.1109/ICCCS55155.2022.9846190(773-783)Online publication date: 22-Apr-2022
https://doi.org/10.1109/ICCCS55155.2022.9846190
Kumar MMalakar P(2022)Hierarchical Communication Optimization for FFT2022 IEEE/ACM International Workshop on Hierarchical Parallelism for Exascale Computing (HiPar)10.1109/HiPar56574.2022.00007(12-21)Online publication date: Nov-2022
https://doi.org/10.1109/HiPar56574.2022.00007
Raghavan DLevis PZaharia MZhang IAngel SKasikci BKohler E(2021)Breakfast of championsProceedings of the Workshop on Hot Topics in Operating Systems10.1145/3458336.3465287(199-205)Online publication date: 1-Jun-2021
https://dl.acm.org/doi/10.1145/3458336.3465287
Zhong DShamis PCao QBosilca GSumimoto SMiura KDongarra J(2020)Using Arm Scalable Vector Extension to Optimize OPEN MPI2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID)10.1109/CCGrid49817.2020.00-71(222-231)Online publication date: May-2020
https://doi.org/10.1109/CCGrid49817.2020.00-71

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents