Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3547276.3548524acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicppConference Proceedingsconference-collections
research-article
Public Access

Designing Hierarchical Multi-HCA Aware Allgather in MPI

Published: 13 January 2023 Publication History

Abstract

To accelerate the communication between nodes, supercomputers are now equipped with multiple network adapters per node, resulting in a ”multi-rail” network. The second and third-placed systems of the Top500 use two adapters per node; recently, the ThetaGPU system at Argonne National Laboratory (ANL) uses eight adapters per node. With such an availability of networking resources, it is a non-trivial task to utilize all of them. The Message Passing Interface (MPI) is a dominant model for high-performance computing clusters. Not all MPI collectives utilize all resources, and this becomes more apparent with advances in bandwidth and adapter count in a given cluster.
In this work, we take up this task and propose hierarchical, multi-HCA aware Allgather designs; Allgather is a communication-intensive collective widely used in applications like matrix multiplication and other collectives. The proposed designs fully utilize all the available network adapters within a node and provides high overlap between inter-node and intra-node communication. At the micro-benchmark level, our new schemes achieve performance improvement for both single node and multiple node communication. We see inter-node improvements up to 62% and 61% better than HPC-X and MVAPICH2-X for 1024 processes. The design for inter-node communication also boosts the performance of Ring Allreduce by 56% and 44% compared to HPC-X and MVAPICH2-X. At the application level, the enhanced Allgather shows 1.98x and 1.42x improvement in a matrix-vector multiplication kernel when compared to HPC-X and MVAPICH2-X, and Allreduce performs up to 7.83% better in deep learning training against MVAPICH2-X.

References

[1]
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, 2016. {TensorFlow}: A System for {Large-Scale} Machine Learning. In 12th USENIX symposium on operating systems design and implementation (OSDI 16). 265–283.
[2]
Pavan Balaji, Darius Buntinas, David Goodell, William Gropp, Sameer Kumar, Ewing Lusk, Rajeev Thakur, and Jesper Larsson Träff. 2009. MPI on a Million Processors. In European Parallel Virtual Machine/Message Passing Interface Users’ Group Meeting. Springer, 20–30.
[3]
Mohammadreza Bayatpour, Jahanzeb Maqbool Hashmi, Sourav Chakraborty, Hari Subramoni, Pouya Kousha, and Dhabaleswar K Panda. 2018. Salar: Scalable and adaptive designs for large message reduction collectives. In 2018 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, 12–23.
[4]
David E Bernholdt, Swen Boehm, George Bosilca, Manjunath Gorentla Venkata, Ryan E Grant, Thomas Naughton, Howard P Pritchard, Martin Schulz, and Geoffroy R Vallee. 2020. A survey of MPI usage in the US exascale computing project. Concurrency and Computation: Practice and Experience 32, 3(2020), e4851.
[5]
Adrián Castelló, Enrique S Quintana-Ortí, and José Duato. 2021. Accelerating distributed deep neural network training with pipelined MPI allreduce. Cluster Computing 24, 4 (2021), 3797–3813.
[6]
Sourav Chakraborty, Hari Subramoni, and Dhabaleswar K Panda. 2017. Contention-aware kernel-assisted MPI collectives for multi-/many-core systems. In 2017 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, 13–24.
[7]
Barbara Chapman, Tony Curtis, Swaroop Pophale, Stephen Poole, Jeff Kuehn, Chuck Koelbel, and Lauren Smith. 2010. Introducing OpenSHMEM: SHMEM for the PGAS community. In Proceedings of the Fourth Conference on Partitioned Global Address Space Programming Model. 1–3.
[8]
Daichi Mukunoki and Toshiyuki Imamura 2017. Implementation and Evaluation of 2.5D Matrix Multiplication on the K computer. Retrieved Mar 18, 2022 from https://prace-ri.eu/wp-content/uploads/PRACE-at-SC17-Daichi-Mokunoki.pdf
[9]
Vijay Dhanraj. 2012. Enhancement of LiMIC-Based Collectives for Multi-core Clusters. Ph. D. Dissertation. The Ohio State University.
[10]
El Capitan 2022. El Capitan. Retrieved Mar 18, 2022 from https://www.hpe.com/us/en/newsroom/press-release/2020/03/hpe-and-amd-power-complex-scientific-discovery-in-worlds-fastest-supercomputer-for-us-department-of-energys-doe-national-nuclear-security-administration-nnsa.html
[11]
Frontier 2022. Frontier. Retrieved Mar 18, 2022 from https://www.olcf.ornl.gov/frontier/
[12]
Jahanzeb Maqbool Hashmi, Sourav Chakraborty, Mohammadreza Bayatpour, Hari Subramoni, and Dhabaleswar K Panda. 2018. Designing efficient shared address space reduction collectives for multi-/many-cores. In 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 1020–1029.
[13]
HPC Advisory Council 2022. Thor. Retrieved Mar 18, 2022 from https://hpcadvisorycouncil.atlassian.net/wiki/spaces/HPCWORKS/pages/7864401/Thor
[14]
Krishna Kandalla, Hari Subramoni, Gopal Santhanaraman, Matthew Koop, and Dhabaleswar K Panda. 2009. Designing multi-leader-based allgather algorithms for multi-core clusters. In 2009 IEEE International Symposium on Parallel & Distributed Processing. IEEE, 1–8.
[15]
Keras 2022. Keras Applications. Retrieved Mar 18, 2022 from https://keras.io/api/applications/
[16]
Ignacio Laguna, Ryan Marshall, Kathryn Mohror, Martin Ruefenacht, Anthony Skjellum, and Nawrin Sultana. 2019. A large-scale study of MPI usage in open-source HPC applications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1–14.
[17]
Jiuxing Liu, Abhinav Vishnu, and Dhabaleswar K Panda. 2004. Building multirail infiniband clusters: Mpi-level design and performance evaluation. In SC’04: Proceedings of the 2004 ACM/IEEE conference on Supercomputing. IEEE, 33–33.
[18]
Teng Ma, George Bosilca, Aurelien Bouteiller, Brice Goglin, Jeffrey M Squyres, and Jack J Dongarra. 2011. Kernel assisted collective intra-node mpi communication among multi-core and many-core cpus. In 2011 International Conference on Parallel Processing. IEEE, 532–541.
[19]
Amith R Mamidala, Abhinav Vishnu, and Dhabaleswar K Panda. 2006. Efficient shared memory and RDMA based design for mpi_allgather over InfiniBand. In European Parallel Virtual Machine/Message Passing Interface Users’ Group Meeting. Springer, 66–75.
[20]
Message Passing Interface Forum. 2021. MPI: A Message-Passing Interface Standard Version 4.0. https://www.mpi-forum.org/docs/mpi-4.0/mpi40-report.pdf
[21]
Paul Messina. 2017. The exascale computing project. Computing in Science & Engineering 19, 3 (2017), 63–67.
[22]
OSU Micro-Benchmarks. 2018. Osu network-based computing laboratory. URL: http://mvapich. cse. ohio-state. edu/benchmarks 2 (2018).
[23]
Network-Based Computing Laboratory 2022. MVAPICH: MPI over InfiniBand, 10GigE/iWARP and RoCE. Retrieved Mar 18, 2022 from http://mvapich.cse.ohio-state.edu/
[24]
NVIDIA 2022. HPC-X. Retrieved Mar 18, 2022 from https://developer.nvidia.com/networking/hpc-x
[25]
Open MPI 2022. Open MPI: Open Source High Performance Computing. Retrieved Mar 18, 2022 from https://www.open-mpi.org/
[26]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, 2019. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 (2019).
[27]
Pitch Patarasuk and Xin Yuan. 2009. Bandwidth optimal all-reduce algorithms for clusters of workstations. J. Parallel and Distrib. Comput. 69, 2 (2009), 117–124.
[28]
Ying Qian and Ahmad Afsahi. 2007. High performance RDMA-based multi-port all-gather on multi-rail QsNet II. In 21st International Symposium on High Performance Computing Systems and Applications (HPCS’07). IEEE, 3–3.
[29]
Ruslan Salakhutdinov and Andriy Mnih. 2008. Bayesian probabilistic matrix factorization using Markov chain Monte Carlo. In Proceedings of the 25th international conference on Machine learning. 880–887.
[30]
Frank Seide and Amit Agarwal. 2016. CNTK: Microsoft’s open-source deep-learning toolkit. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. 2135–2135.
[31]
Alexander Sergeev and Mike Del Balso. 2018. Horovod: fast and easy distributed deep learning in TensorFlow. arXiv preprint arXiv:1802.05799(2018).
[32]
Sameer S Shende and Allen D Malony. 2006. The TAU parallel performance system. The International Journal of High Performance Computing Applications 20, 2(2006), 287–311.
[33]
Sayantan Sur, Uday Kumar Reddy Bondhugula, Amith Mamidala, H-W Jin, and Dhabaleswar K Panda. 2005. High performance rdma based all-to-all broadcast for infiniband clusters. In International Conference on High-Performance Computing. Springer, 148–157.
[34]
Rajeev Thakur, Pavan Balaji, Darius Buntinas, David Goodell, William Gropp, Torsten Hoefler, Sameer Kumar, Ewing Lusk, and J Larsson Träff. 2010. MPI at Exascale. Procceedings of SciDAC 2(2010), 14–35.
[35]
Rajeev Thakur, Rolf Rabenseifner, and William Gropp. 2005. Optimization of collective communication operations in MPICH. The International Journal of High Performance Computing Applications 19, 1(2005), 49–66.
[36]
ThetaGPU 2022. Theta/ThetaGPU Machine Overview. Retrieved Mar 18, 2022 from https://www.alcf.anl.gov/support-center/theta/theta-thetagpu-overview
[37]
Top500 2022. NOVEMBER 2021. Retrieved Mar 18, 2022 from https://www.top500.org/lists/top500/2021/11/
[38]
Jesper Larsson Träff and Sascha Hunold. 2020. Decomposing MPI collectives for exploiting multi-lane communication. In 2020 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, 270–280.
[39]
Tom Vander Aa, Imen Chakroun, and Tom Haber. 2017. Distributed Bayesian probabilistic matrix factorization. Procedia Computer Science 108 (2017), 1030–1039.
[40]
Yili Zheng, Amir Kamil, Michael B Driscoll, Hongzhang Shan, and Katherine Yelick. 2014. UPC++: a PGAS extension for C++. In 2014 IEEE 28th International Parallel and Distributed Processing Symposium. IEEE, 1105–1114.
[41]
Huan Zhou, José Gracia, and Ralf Schneider. 2019. MPI collectives for multi-core clusters: Optimized performance of the hybrid MPI+ MPI parallel codes. In Proceedings of the 48th International Conference on Parallel Processing: Workshops. 1–10.

Cited By

View all
  • (2023)Accelerating communication with multi‐HCA aware collectives in MPIConcurrency and Computation: Practice and Experience10.1002/cpe.787936:1Online publication date: 9-Aug-2023

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
ICPP Workshops '22: Workshop Proceedings of the 51st International Conference on Parallel Processing
August 2022
233 pages
ISBN:9781450394451
DOI:10.1145/3547276
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 January 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Allgather
  2. Allreduce
  3. Collectives
  4. HCA-aware
  5. MPI
  6. Network-aware

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

Conference

ICPP '22
ICPP '22: 51st International Conference on Parallel Processing
August 29 - September 1, 2022
Bordeaux, France

Acceptance Rates

Overall Acceptance Rate 91 of 313 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)295
  • Downloads (Last 6 weeks)73
Reflects downloads up to 17 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Accelerating communication with multi‐HCA aware collectives in MPIConcurrency and Computation: Practice and Experience10.1002/cpe.787936:1Online publication date: 9-Aug-2023

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media