research-article

Open access

BoostN: Optimizing Imbalanced Neighborhood Communication on Homogeneous Many-Core System

Authors:

Wei XueAuthors Info & Claims

ICPP '24: Proceedings of the 53rd International Conference on Parallel Processing

Pages 262 - 272

https://doi.org/10.1145/3673038.3673131

Published: 12 August 2024 Publication History

All formats PDF

Abstract

MPI neighborhood communication with sparse and imbalanced patterns is common in process-level parallel programs. However, these programs often encounter significant performance slowdowns in today’s many-core clusters that feature dozens of cores per node. There are two key causes for this slowdown. First, there is substantial competition for memory and network ports when a large number of processes simultaneously access the MPI library. Second, many neighborhood communications do not align well with the many-core architecture, resulting in performance bottlenecks that could have been mitigated.

In this paper, we leverage communication patterns to address the above issues in neighborhood communication. We use zero redundant copy and message aggregation to optimize intra-node communication, and relieve both intra-node and inter-node bottlenecks with process mapping. By combining optimizations effectively, we present BoostN, a standalone library that speeds up imbalanced neighborhood communication on many-core systems. BoostN works well with mainstream homogeneous architectures and various latest versions of MPI libraries. Experiments show that our optimization tool can achieve up to 4.94x geometric mean speedups for SpMV of 2,708 matrices in SuiteSparse, up to 8.18x speedup for Laser problem (latency-bounded), and up to 8.98x speedup for Oil problem (bandwidth-bounded) solved by Hypre.

References

[1]

2023. AMG2013. https://asc.llnl.gov/codes/proxy-apps/amg2013 [Accessed 23-06-2024].

[2]

2023. HYPRE. https://computing.llnl.gov/projects/hypre-scalable-linear-solvers-multigrid-methods. [Accessed 23-06-2024].

[3]

2024. AMD EPYC 9754. https://www.amd.com/en/products/cpu/amd-epyc-9754. [Accessed 23-06-2024].

[4]

2024. HPC-X. https://developer.nvidia.com/networking/. [Accessed 23-06-2024].

[5]

2024. MPI standard. https://www.mpi-forum.org/ [Accessed 23-06-2024].

[6]

2024. MPICH | High-Performance Portable MPI — mpich.org. https://www.mpich.org/. [Accessed 23-06-2024].

[7]

2024. Open MPI: Open Source High Performance Computing — open-mpi.org. https://www.open-mpi.org/. [Accessed 23-06-2024].

[8]

2024. SuiteSparse Matrix Collection. https://sparse.tamu.edu/. [Accessed 23-06-2024].

[9]

Albert Alexandrov, Mihai F Ionescu, Klaus E Schauser, and Chris Scheiman. 1995. LogGP: Incorporating long messages into the LogP model—one step closer towards a realistic model for parallel computation. In Proceedings of the seventh annual ACM symposium on Parallel algorithms and architectures. 95–105.

Digital Library

[10]

Robert Anderson, Julian Andrej, Andrew Barker, Jamie Bramwell, Jean-Sylvain Camier, Jakub Cerveny, Veselin Dobrev, Yohann Dudouit, Aaron Fisher, Tzanio Kolev, 2021. MFEM: A modular finite element methods library. Computers & Mathematics with Applications 81 (2021), 42–74.

[11]

Satish Balay, Shrirang Abhyankar, Mark Adams, Jed Brown, Peter Brune, Kris Buschelman, Lisandro Dalcin, Alp Dener, Victor Eijkhout, William Gropp, 2019. PETSc users manual. (2019).

[12]

Amanda Bienz, William D Gropp, and Luke N Olson. 2020. Reducing communication in algebraic multigrid with multi-step node aware communication. The International Journal of High Performance Computing Applications 34, 5 (2020), 547–561.

Digital Library

[13]

Sourav Chakraborty, Mohammadreza Bayatpour, J Hashmi, Hari Subramoni, and Dhabaleswar K Panda. 2018. Cooperative rendezvous protocols for improved performance and overlap. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 361–373.

Digital Library

[14]

Hu Chen, Wenguang Chen, Jian Huang, and Bob Kuhn. 2006. MPIPP: an automatic profile-guided parallel process placement toolset for SMP clusters and multiclusters. In International Conference on Supercomputing. https://api.semanticscholar.org/CorpusID:7998042

Digital Library

[15]

Gerald Collom, Rui Peng Li, and Amanda Bienz. 2023. Optimizing Irregular Communication with Neighborhood Collectives and Locality-Aware Parallelism. In Proceedings of the SC’23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis. 427–437.

Digital Library

[16]

David Culler, Richard Karp, David Patterson, Abhijit Sahay, Klaus Erik Schauser, Eunice Santos, Ramesh Subramonian, and Thorsten Von Eicken. 1993. LogP: Towards a realistic model of parallel computation. In Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming. 1–12.

Digital Library

[17]

Gokhan Danabasoglu, J-F Lamarque, J Bacmeister, DA Bailey, AK DuVivier, Jim Edwards, LK Emmons, John Fasullo, R Garcia, Andrew Gettelman, 2020. The community earth system model version 2 (CESM2). Journal of Advances in Modeling Earth Systems 12, 2 (2020), e2019MS001916.

[18]

Robert D Falgout and Jacob B Schroder. 2014. Non-Galerkin coarse grids for algebraic multigrid. SIAM Journal on Scientific Computing 36, 3 (2014), C309–C334.

Digital Library

[19]

Karl Fürlinger, Colin Glass, Jose Gracia, Andreas Knüpfer, Jie Tao, Denis Hünich, Kamran Idrees, Matthias Maiterth, Yousri Mhedheb, and Huan Zhou. 2014. DASH: Data structures and algorithms with support for hierarchical locality. In Euro-Par 2014: Parallel Processing Workshops: Euro-Par 2014 International Workshops, Porto, Portugal, August 25-26, 2014, Revised Selected Papers, Part II 20. Springer, 542–552.

[20]

Hormozd Gahvari, Allison H Baker, Martin Schulz, Ulrike Meier Yang, Kirk E Jordan, and William Gropp. 2011. Modeling the performance of an algebraic multigrid cycle on HPC platforms. In Proceedings of the international conference on Supercomputing. 172–181.

Digital Library

[21]

Brice Goglin and Stéphanie Moreaud. 2013. KNEM: A generic and scalable kernel-assisted intra-node MPI communication framework. J. Parallel and Distrib. Comput. 73, 2 (2013), 176–188.

Digital Library

[22]

Takao Hatazaki. 1998. Rank reordering strategy for MPI topology creation functions. In European Parallel Virtual Machine/Message Passing Interface Users’ Group Meeting. Springer, 188–195.

[23]

Michael A Heroux, Lois Curfman McInnes, Rajeev Thakur, Jeffrey S Vetter, Xiaoye Sherry Li, James Aherns, Todd Munson, and Kathryn Mohror. 2020. ECP software technology capability assessment report. Technical Report. Oak Ridge National Lab.(ORNL), Oak Ridge, TN (United States).

[24]

Torsten Hoefler and Marc Snir. 2011. Generic topology mapping strategies for large-scale parallel architectures. In Proceedings of the international conference on Supercomputing. 75–84.

Digital Library

[25]

Wei Huang, Matthew J Koop, and Dhabaleswar K Panda. 2008. Efficient one-copy MPI shared memory communication in virtual machines. In 2008 IEEE International Conference on Cluster Computing. IEEE, 107–115.

[26]

Laxmikant V Kale and Sanjeev Krishnan. 1993. Charm++ a portable concurrent object oriented system based on c++. In Proceedings of the eighth annual conference on Object-oriented programming systems, languages, and applications. 91–108.

Digital Library

[27]

George Karypis. 1997. METIS: Unstructured graph partitioning and sparse matrix ordering system. Technical report (1997).

[28]

George Karypis and Vipin Kumar. 1998. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM Journal on scientific Computing 20, 1 (1998), 359–392.

Digital Library

[29]

John D McCalpin. 1995. Stream benchmark. Link: www. cs. virginia. edu/stream/ref. html# what 22, 7 (1995).

[30]

Michael Noeth, Prasun Ratn, Frank Mueller, Martin Schulz, and Bronis R De Supinski. 2009. Scalatrace: Scalable compression and replay of communication traces for high-performance computing. J. Parallel and Distrib. Comput. 69, 8 (2009), 696–710.

Digital Library

[31]

Robert W Numrich and John Reid. 1998. Co-Array Fortran for parallel programming. In ACM Sigplan Fortran Forum, Vol. 17. ACM New York, NY, USA, 1–31.

[32]

K Pedretti and B Barrett. 2020. Xpmem: Cross-process memory mapping.

[33]

Jintao Peng, Jianbin Fang, Jie Liu, Min Xie, Yi Dai, Bo Yang, Shengguo Li, and Zheng Wang. 2023. Optimizing MPI Collectives on Shared Memory Multi-Cores. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1–15.

Digital Library

[34]

James Psota and Armando Solar-Lezama. 2024. Pure: Evolving Message Passing To Better Leverage Shared Memory Within Nodes. In Proceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming. 133–146.

Digital Library

[35]

Ken Raffenetti, Abdelhalim Amer, Lena Oden, Charles Archer, Wesley Bland, Hajime Fujita, Yanfei Guo, Tomislav Janjusic, Dmitry Durnov, Michael Blocksome, 2017. Why is MPI so slow? analyzing the fundamental limits in implementing MPI-3.1. In Proceedings of the international conference for high performance computing, networking, storage and analysis. 1–12.

Digital Library

[36]

Christian Schulz and Jesper Larsson Träff. 2017. Better Process Mapping and Sparse Quadratic Assignment. In Proceedings of the 16th International Symposium on Experimental Algorithms (SEA’17)(LIPIcs, Vol. 75). Dagstuhl, 4:1 – 4:15. Technical Report, arXiv:1702.04164.

[37]

Matthew Small and Xin Yuan. 2009. Maximizing mpi point-to-point communication performance on rdma-enabled clusters with customized protocols. In Proceedings of the 23rd international conference on Supercomputing. 306–315.

Digital Library

[38]

Aidan P Thompson, H Metin Aktulga, Richard Berger, Dan S Bolintineanu, W Michael Brown, Paul S Crozier, Pieter J In’t Veld, Axel Kohlmeyer, Stan G Moore, Trung Dac Nguyen, 2022. LAMMPS-a flexible simulation tool for particle-based materials modeling at the atomic, meso, and continuum scales. Computer Physics Communications 271 (2022), 108171.

[39]

Chris Walshaw and Mark Cross. 2000. Mesh partitioning: a multilevel balancing and refinement algorithm. SIAM Journal on Scientific Computing 22, 1 (2000), 63–80.

Digital Library

[40]

Jidong Zhai, Tianwei Sheng, Jiangzhou He, Wenguang Chen, and Weimin Zheng. 2009. FACT: Fast communication trace collection for parallel applications through program slicing. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis. 1–12.

Digital Library

[41]

Yili Zheng, Amir Kamil, Michael B Driscoll, Hongzhang Shan, and Katherine Yelick. 2014. UPC++: a PGAS extension for C++. In 2014 IEEE 28th international parallel and distributed processing symposium. IEEE, 1105–1114.

Index Terms

BoostN: Optimizing Imbalanced Neighborhood Communication on Homogeneous Many-Core System
1. Software and its engineering
  1. Software notations and tools
    1. Software libraries and repositories

Recommendations

Many-core needs fine-grained scheduling: A case study of query processing on Intel Xeon Phi processors
Abstract
Emerging many-core processors feature very high memory bandwidth and computational power. For example, Intel Xeon Phi many-core processors of the Knights Corner (KNC) and Knights Landing (KNL) architectures embrace 60 to 64 x86-based ...
Highlights
- We find that the state-of-the-art implementations of in-memory database operators suffer severely from memory stalls. Also, such implementations under-...
Enhanced memory management for scalable MPI intra-node communication on many-core processor
EuroMPI '17: Proceedings of the 24th European MPI Users' Group Meeting

As the number of cores installed in a single computing node drastically increases, the intra-node communication between parallel processes becomes more important. The parallel programming models, such as Message Passing Interface (MPI), internally ...
A Study of Main-Memory Hash Joins on Many-core Processor: A Case with Intel Knights Landing Architecture
CIKM '17: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management

Advanced processor architectures have been driving new designs, implementations and optimizations of main-memory hash join algorithms recently. The newly released Intel Xeon Phi many-core processor of the Knights Landing architecture (KNL) embraces ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICPP '24: Proceedings of the 53rd International Conference on Parallel Processing

August 2024

1279 pages

ISBN:9798400717932

DOI:10.1145/3673038

Copyright © 2024 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 August 2024

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

National Key R&D Program of China

Conference

ICPP '24

ICPP '24: the 53rd International Conference on Parallel Processing

August 12 - 15, 2024

Gotland, Sweden

Acceptance Rates

Overall Acceptance Rate 91 of 313 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
99
Total Downloads

Downloads (Last 12 months)99
Downloads (Last 6 weeks)36

Reflects downloads up to 01 Nov 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents