Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3650200.3656591acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article

CommBench: Micro-Benchmarking Hierarchical Networks with Multi-GPU, Multi-NIC Nodes

Published: 03 June 2024 Publication History

Abstract

Modern high-performance computing systems have multiple GPUs and network interface cards (NICs) per node. The resulting network architectures have multilevel hierarchies of subnetworks with different interconnect and software technologies. These systems offer multiple vendor-provided communication capabilities and library implementations (IPC, MPI, NCCL, RCCL, OneCCL) with APIs providing varying levels of performance across the different levels. Understanding this performance is currently difficult because of the wide range of architectures and programming models (CUDA, HIP, OneAPI).
We present CommBench, a library with cross-system portability and a high-level API that enables developers to easily build microbenchmarks relevant to their use cases and gain insight into the performance (bandwidth & latency) of multiple implementation libraries on different networks. We demonstrate CommBench with three sets of microbenchmarks that profile the performance of six systems. Our experimental results reveal the effect of multiple NICs on optimizing the bandwidth across nodes and also present the performance characteristics of four available communication libraries within and across nodes of NVIDIA, AMD, and Intel GPU networks.

References

[1]
[1] [n. d.]. https://mvapich.cse.ohio-state.edu/benchmarks/
[2]
Brian W Barrett and K Scott Hemmert. 2009. An application based MPI message throughput benchmark. In 2009 IEEE International Conference on Cluster Computing and Workshops. IEEE, 1–8.
[3]
Amanda Bienz, William D Gropp, and Luke N Olson. 2020. Reducing communication in algebraic multigrid with multi-step node aware communication. The International Journal of High Performance Computing Applications 34, 5 (2020), 547–561.
[4]
Amanda Bienz, Luke N Olson, William D Gropp, and Shelby Lockhart. 2021. Modeling data movement performance on heterogeneous architectures. In 2021 IEEE High Performance Extreme Computing Conference (HPEC). IEEE, 1–7.
[5]
National Energy Research Scientific Computing Center. 2022. Perlmutter Architecture. https://docs.nersc.gov/systems/perlmutter/architecture/
[6]
Sudheer Chunduri, Taylor Groves, Peter Mendygral, Brian Austin, Jacob Balma, 2019. GPCNeT: Designing a benchmark suite for inducing and measuring contention in HPC networks. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1–33.
[7]
Daniele De Sensi, Salvatore Di Girolamo, Kim H McMahon, Duncan Roweth, and Torsten Hoefler. 2020. An in-depth analysis of the slingshot interconnect. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1–14.
[8]
Argonne Leadership Computing Facility. 2022. Aurora. https://www.alcf.anl.gov/support-center/aurora-sunspot
[9]
Oak Ridge Leadership Computing Facility. 2022. Frontier User Guide - System Overview. https://docs.olcf.ornl.gov/systems/frontier_user_guide.html#id2
[10]
National Center for Supercomputing Applications. 2022. Delta. https://www.ncsa.illinois.edu/research/project-highlights/delta/
[11]
Edgar Gabriel, Graham E Fagg, George Bosilca, Thara Angskun, Jack J Dongarra, 2004. Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation. In Recent Advances in Parallel Virtual Machine and Message Passing Interface: 11th European PVM/MPI Users’ Group Meeting Budapest, Hungary, September 19-22, 2004. Proceedings 11. Springer, 97–104.
[12]
Richard L Graham, Devendar Bureddy, Pak Lui, Hal Rosenstock, Gilad Shainer, 2016. Scalable hierarchical aggregation protocol (SHArP): A hardware architecture for efficient data reduction. In 2016 First International Workshop on Communication Optimizations in HPC (COMHPC). IEEE, 1–10.
[13]
William Gropp and Ewing Lusk. 1996. User’s Guide for MPICH, a Portable Implementation of MPI.
[14]
William Gropp, Luke N. Olson, and Philipp Samfass. 2016. Modeling MPI Communication Performance on SMP Nodes: Is It Time to Retire the Ping Pong Test. In Proceedings of the 23rd European MPI Users’ Group Meeting(EuroMPI 2016). ACM, New York, NY, USA, 41–50. https://doi.org/10.1145/2966884.2966919
[15]
William D Gropp. 2019. Using node and socket information to implement MPI Cartesian topologies. Parallel Comput. 85 (2019), 98–108.
[16]
Mert Hidayetoğlu, Tekin Bicer, Simon Garcia De Gonzalo, Bin Ren, Vincent De Andrade, 2020. Petascale XCT: 3D image reconstruction with hierarchical communications on multi-GPU nodes. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1–13.
[17]
Mert Hidayetoğlu, Tekin Biçer, Simon Garcia de Gonzalo, Bin Ren, Doğa Gürsoy, 2021. MemXCT: design, optimization, scaling, and reproducibility of x-ray tomography imaging. IEEE Transactions on Parallel and Distributed Systems 33, 9 (2021), 2014–2031.
[18]
Torsten Hoefler, Torsten Mehlan, Andrew Lumsdaine, and Wolfgang Rehm. 2007. Netgauge: A network performance measurement framework. In High Performance Computing and Communications: Third International Conference, HPCC 2007, Houston, USA, September 26-28, 2007. Proceedings 3. Springer, 659–671.
[19]
Ben Huang, Michael Bauer, and Michael Katchabaw. 2005. Hpcbench-a Linux-based network benchmark for high performance networks. In 19th international symposium on high performance computing systems and applications (HPCS’05). IEEE, 65–71.
[20]
IBM. 2017. IBM Spectrum MPI - Overview. https://www.ibm.com/products/spectrum-mpi
[21]
Nikhil Jain, Abhinav Bhatele, Sam White, Todd Gamblin, and Laxmikant V Kale. 2016. Evaluating HPC networks via simulation of parallel workloads. In SC’16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 154–165.
[22]
John Kim, Wiliam J Dally, Steve Scott, and Dennis Abts. 2008. Technology-driven, highly-scalable dragonfly topology. ACM SIGARCH Computer Architecture News 36, 3 (2008), 77–88.
[23]
Shelby Lockhart, Amanda Bienz, William Gropp, and Luke Olson. 2022. Performance analysis and optimal node-aware communication for enlarged conjugate gradient methods. ACM Transactions on Parallel Computing (2022).
[24]
Shelby Lockhart, Amanda Bienz, William D Gropp, and Luke N Olson. 2022. Characterizing the Performance of Node-Aware Strategies for Irregular Point-to-Point Communication on Heterogeneous Architectures. arXiv:2209.06141 (2022).
[25]
Adam Moody. 2009. Contention-free routing for shift-based communication in MPI applications on large-scale InfiniBand clusters. Technical Report. Lawrence Livermore National Lab.(LLNL), Livermore, CA (United States).
[26]
Samuel K. Moore. 2022. Behind Intel’s HPC chip that will break the exascale barrier. https://spectrum.ieee.org/intel-s-exascale-supercomputer-chip-is-a-master-class-in-3d-integration
[27]
NVIDIA. 2016. NVIDIA Collective Communications Library (NCCL). https://developer.nvidia.com/nccl
[28]
Scott Pakin. 2007. The Design and Implementation of a Domain-Specific Language for Network Performance Testing. IEEE Transactions on Parallel and Distributed Systems 18, 10 (2007), 1436–1449.
[29]
Dhabaleswar Kumar Panda, Hari Subramoni, Ching-Hsiang Chu, and Mohammadreza Bayatpour. 2021. The MVAPICH project: Transforming research into high-performance MPI library for HPC community. Journal of Computational Science 52 (2021), 101208.
[30]
Carl Pearson. 2023. Interconnect Bandwidth Heterogeneity on AMD MI250x and Infinity Fabric. arXiv:2302.14827 (2023).
[31]
Carl Pearson, Abdul Dakkak, Sarah Hashash, Cheng Li, I-Hsin Chung, 2019. Evaluating characteristics of CUDA communication primitives on high-bandwidth interconnects. In Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering. 209–218.
[32]
Ralf Reussner, Peter Sanders, and Jesper Larsson Träff. 2002. SKaMPI: A comprehensive benchmark for public benchmarking of MPI. Scientific Programming 10, 1 (2002), 55–65.
[33]
Piyush Sao, Ramakrishnan Kannan, Xiaoye Sherry Li, and Richard Vuduc. 2019. A communication-avoiding 3D sparse triangular solver. In Proceedings of the ACM International Conference on Supercomputing. 127–137.
[34]
Mohak Shroff and Robert A Van De Geijn. 1999. CollMark: MPI collective communication benchmark. In International Conference on Supercomputing, Vol. 2000. 10.
[35]
Christopher M Siefert, Carl Pearson, Stephen L Olivier, Andrey Prokopenko, Jonathan Hu, 2023. Latency and Bandwidth Microbenchmarks of US Department of Energy Systems in the June 2023 Top 500 List. In Proceedings of the SC’23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis. 1298–1305.
[36]
Muhammet Abdullah Soytürk, Palwisha Akhtar, Erhan Tezcan, and Didem Unat. 2022. Monitoring collective communication among GPUs. In Euro-Par 2021: Parallel Processing Workshops: Euro-Par 2021 International Workshops, Lisbon, Portugal, August 30-31, 2021, Revised Selected Papers. Springer, 41–52.
[37]
Sudharshan S. Vazhkudai, Bronis R. de Supinski, Arthur S. Bland, Al Geist, James Sexton, 2018. The Design, Deployment, and Evaluation of the CORAL Pre-Exascale Systems. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. 661–672. https://doi.org/10.1109/SC.2018.00055
[38]
Christopher Zimmer, Scott Atchley, Ramesh Pankajakshan, Brian E Smith, Ian Karlin, 2019. An evaluation of the CORAL interconnects. In SC19: International Conference for High Performance Computing, Networking, Storage and Analysis. 1–18.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICS '24: Proceedings of the 38th ACM International Conference on Supercomputing
May 2024
582 pages
ISBN:9798400706103
DOI:10.1145/3650200
This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 June 2024

Check for updates

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

  • Department of Energy

Conference

ICS '24
Sponsor:

Acceptance Rates

ICS '24 Paper Acceptance Rate 45 of 125 submissions, 36%;
Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 228
    Total Downloads
  • Downloads (Last 12 months)228
  • Downloads (Last 6 weeks)39
Reflects downloads up to 17 Oct 2024

Other Metrics

Citations

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media