research-article

CommBench: Micro-Benchmarking Hierarchical Networks with Multi-GPU, Multi-NIC Nodes

Authors:

Mert Hidayetoglu,

Simon Garcia De Gonzalo,

Elliott Slaughter,

Christopher Zimmer,

Alex AikenAuthors Info & Claims

ICS '24: Proceedings of the 38th ACM International Conference on Supercomputing

Pages 426 - 436

https://doi.org/10.1145/3650200.3656591

Published: 03 June 2024 Publication History

Abstract

Modern high-performance computing systems have multiple GPUs and network interface cards (NICs) per node. The resulting network architectures have multilevel hierarchies of subnetworks with different interconnect and software technologies. These systems offer multiple vendor-provided communication capabilities and library implementations (IPC, MPI, NCCL, RCCL, OneCCL) with APIs providing varying levels of performance across the different levels. Understanding this performance is currently difficult because of the wide range of architectures and programming models (CUDA, HIP, OneAPI).

We present CommBench, a library with cross-system portability and a high-level API that enables developers to easily build microbenchmarks relevant to their use cases and gain insight into the performance (bandwidth & latency) of multiple implementation libraries on different networks. We demonstrate CommBench with three sets of microbenchmarks that profile the performance of six systems. Our experimental results reveal the effect of multiple NICs on optimizing the bandwidth across nodes and also present the performance characteristics of four available communication libraries within and across nodes of NVIDIA, AMD, and Intel GPU networks.

References

[1]

[1] [n. d.]. https://mvapich.cse.ohio-state.edu/benchmarks/

[2]

Brian W Barrett and K Scott Hemmert. 2009. An application based MPI message throughput benchmark. In 2009 IEEE International Conference on Cluster Computing and Workshops. IEEE, 1–8.

[3]

Amanda Bienz, William D Gropp, and Luke N Olson. 2020. Reducing communication in algebraic multigrid with multi-step node aware communication. The International Journal of High Performance Computing Applications 34, 5 (2020), 547–561.

Digital Library

[4]

Amanda Bienz, Luke N Olson, William D Gropp, and Shelby Lockhart. 2021. Modeling data movement performance on heterogeneous architectures. In 2021 IEEE High Performance Extreme Computing Conference (HPEC). IEEE, 1–7.

[5]

National Energy Research Scientific Computing Center. 2022. Perlmutter Architecture. https://docs.nersc.gov/systems/perlmutter/architecture/

[6]

Sudheer Chunduri, Taylor Groves, Peter Mendygral, Brian Austin, Jacob Balma, 2019. GPCNeT: Designing a benchmark suite for inducing and measuring contention in HPC networks. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1–33.

Digital Library

[7]

Daniele De Sensi, Salvatore Di Girolamo, Kim H McMahon, Duncan Roweth, and Torsten Hoefler. 2020. An in-depth analysis of the slingshot interconnect. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1–14.

Digital Library

[8]

Argonne Leadership Computing Facility. 2022. Aurora. https://www.alcf.anl.gov/support-center/aurora-sunspot

[9]

Oak Ridge Leadership Computing Facility. 2022. Frontier User Guide - System Overview. https://docs.olcf.ornl.gov/systems/frontier_user_guide.html#id2

[10]

National Center for Supercomputing Applications. 2022. Delta. https://www.ncsa.illinois.edu/research/project-highlights/delta/

[11]

Edgar Gabriel, Graham E Fagg, George Bosilca, Thara Angskun, Jack J Dongarra, 2004. Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation. In Recent Advances in Parallel Virtual Machine and Message Passing Interface: 11th European PVM/MPI Users’ Group Meeting Budapest, Hungary, September 19-22, 2004. Proceedings 11. Springer, 97–104.

[12]

Richard L Graham, Devendar Bureddy, Pak Lui, Hal Rosenstock, Gilad Shainer, 2016. Scalable hierarchical aggregation protocol (SHArP): A hardware architecture for efficient data reduction. In 2016 First International Workshop on Communication Optimizations in HPC (COMHPC). IEEE, 1–10.

[13]

William Gropp and Ewing Lusk. 1996. User’s Guide for MPICH, a Portable Implementation of MPI.

[14]

William Gropp, Luke N. Olson, and Philipp Samfass. 2016. Modeling MPI Communication Performance on SMP Nodes: Is It Time to Retire the Ping Pong Test. In Proceedings of the 23rd European MPI Users’ Group Meeting(EuroMPI 2016). ACM, New York, NY, USA, 41–50. https://doi.org/10.1145/2966884.2966919

Digital Library

[15]

William D Gropp. 2019. Using node and socket information to implement MPI Cartesian topologies. Parallel Comput. 85 (2019), 98–108.

Digital Library

[16]

Mert Hidayetoğlu, Tekin Bicer, Simon Garcia De Gonzalo, Bin Ren, Vincent De Andrade, 2020. Petascale XCT: 3D image reconstruction with hierarchical communications on multi-GPU nodes. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1–13.

[17]

Mert Hidayetoğlu, Tekin Biçer, Simon Garcia de Gonzalo, Bin Ren, Doğa Gürsoy, 2021. MemXCT: design, optimization, scaling, and reproducibility of x-ray tomography imaging. IEEE Transactions on Parallel and Distributed Systems 33, 9 (2021), 2014–2031.

[18]

Torsten Hoefler, Torsten Mehlan, Andrew Lumsdaine, and Wolfgang Rehm. 2007. Netgauge: A network performance measurement framework. In High Performance Computing and Communications: Third International Conference, HPCC 2007, Houston, USA, September 26-28, 2007. Proceedings 3. Springer, 659–671.

[19]

Ben Huang, Michael Bauer, and Michael Katchabaw. 2005. Hpcbench-a Linux-based network benchmark for high performance networks. In 19th international symposium on high performance computing systems and applications (HPCS’05). IEEE, 65–71.

Digital Library

[20]

IBM. 2017. IBM Spectrum MPI - Overview. https://www.ibm.com/products/spectrum-mpi

[21]

Nikhil Jain, Abhinav Bhatele, Sam White, Todd Gamblin, and Laxmikant V Kale. 2016. Evaluating HPC networks via simulation of parallel workloads. In SC’16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 154–165.

Digital Library

[22]

John Kim, Wiliam J Dally, Steve Scott, and Dennis Abts. 2008. Technology-driven, highly-scalable dragonfly topology. ACM SIGARCH Computer Architecture News 36, 3 (2008), 77–88.

Digital Library

[23]

Shelby Lockhart, Amanda Bienz, William Gropp, and Luke Olson. 2022. Performance analysis and optimal node-aware communication for enlarged conjugate gradient methods. ACM Transactions on Parallel Computing (2022).

[24]

Shelby Lockhart, Amanda Bienz, William D Gropp, and Luke N Olson. 2022. Characterizing the Performance of Node-Aware Strategies for Irregular Point-to-Point Communication on Heterogeneous Architectures. arXiv:2209.06141 (2022).

[25]

Adam Moody. 2009. Contention-free routing for shift-based communication in MPI applications on large-scale InfiniBand clusters. Technical Report. Lawrence Livermore National Lab.(LLNL), Livermore, CA (United States).

[26]

Samuel K. Moore. 2022. Behind Intel’s HPC chip that will break the exascale barrier. https://spectrum.ieee.org/intel-s-exascale-supercomputer-chip-is-a-master-class-in-3d-integration

[27]

NVIDIA. 2016. NVIDIA Collective Communications Library (NCCL). https://developer.nvidia.com/nccl

[28]

Scott Pakin. 2007. The Design and Implementation of a Domain-Specific Language for Network Performance Testing. IEEE Transactions on Parallel and Distributed Systems 18, 10 (2007), 1436–1449.

Digital Library

[29]

Dhabaleswar Kumar Panda, Hari Subramoni, Ching-Hsiang Chu, and Mohammadreza Bayatpour. 2021. The MVAPICH project: Transforming research into high-performance MPI library for HPC community. Journal of Computational Science 52 (2021), 101208.

[30]

Carl Pearson. 2023. Interconnect Bandwidth Heterogeneity on AMD MI250x and Infinity Fabric. arXiv:2302.14827 (2023).

[31]

Carl Pearson, Abdul Dakkak, Sarah Hashash, Cheng Li, I-Hsin Chung, 2019. Evaluating characteristics of CUDA communication primitives on high-bandwidth interconnects. In Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering. 209–218.

Digital Library

[32]

Ralf Reussner, Peter Sanders, and Jesper Larsson Träff. 2002. SKaMPI: A comprehensive benchmark for public benchmarking of MPI. Scientific Programming 10, 1 (2002), 55–65.

Digital Library

[33]

Piyush Sao, Ramakrishnan Kannan, Xiaoye Sherry Li, and Richard Vuduc. 2019. A communication-avoiding 3D sparse triangular solver. In Proceedings of the ACM International Conference on Supercomputing. 127–137.

Digital Library

[34]

Mohak Shroff and Robert A Van De Geijn. 1999. CollMark: MPI collective communication benchmark. In International Conference on Supercomputing, Vol. 2000. 10.

[35]

Christopher M Siefert, Carl Pearson, Stephen L Olivier, Andrey Prokopenko, Jonathan Hu, 2023. Latency and Bandwidth Microbenchmarks of US Department of Energy Systems in the June 2023 Top 500 List. In Proceedings of the SC’23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis. 1298–1305.

Digital Library

[36]

Muhammet Abdullah Soytürk, Palwisha Akhtar, Erhan Tezcan, and Didem Unat. 2022. Monitoring collective communication among GPUs. In Euro-Par 2021: Parallel Processing Workshops: Euro-Par 2021 International Workshops, Lisbon, Portugal, August 30-31, 2021, Revised Selected Papers. Springer, 41–52.

[37]

Sudharshan S. Vazhkudai, Bronis R. de Supinski, Arthur S. Bland, Al Geist, James Sexton, 2018. The Design, Deployment, and Evaluation of the CORAL Pre-Exascale Systems. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. 661–672. https://doi.org/10.1109/SC.2018.00055

Digital Library

[38]

Christopher Zimmer, Scott Atchley, Ramesh Pankajakshan, Brian E Smith, Ian Karlin, 2019. An evaluation of the CORAL interconnects. In SC19: International Conference for High Performance Computing, Networking, Storage and Analysis. 1–18.

Digital Library

Recommendations

Exploiting heterogeneity of communication channels for efficient GPU selection on multi-GPU nodes

Multi-GPU nodes have become the platform of choice for scientific applications. In a multi-GPU node, GPUs are interconnected together via different communication channels. The intranode communications among GPUs may traverse different paths with ...
Optimized HPL for AMD GPU and multi-core CPU usage

The installation of the LOEWE-CSC ( http://csc.uni-frankfurt.de/csc/__ __51 ) supercomputer at the Goethe University in Frankfurt lead to the development of a Linpack which can fully utilize the installed AMD Cypress GPUs. At its core, a fast DGEMM for ...
Accelerating LINPACK with MPI-OpenCL on Clusters of Multi-GPU Nodes
OpenCL is an open standard to write parallel applications for heterogeneous computing systems. Since its usage is restricted to a single operating system instance, programmers need to use a mix of OpenCL and MPI to program a heterogeneous cluster. In this ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICS '24: Proceedings of the 38th ACM International Conference on Supercomputing

May 2024

582 pages

ISBN:9798400706103

DOI:10.1145/3650200

Copyright © 2024 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 June 2024

Check for updates

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

Department of Energy

Conference

ICS '24

Sponsor:

SIGARCH

ICS '24: 2024 International Conference on Supercomputing

June 4 - 7, 2024

Kyoto, Japan

Acceptance Rates

ICS '24 Paper Acceptance Rate 45 of 125 submissions, 36%;

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
228
Total Downloads

Downloads (Last 12 months)228
Downloads (Last 6 weeks)39

Reflects downloads up to 17 Oct 2024

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents