Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3673038.3673131acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicppConference Proceedingsconference-collections
research-article
Open access

BoostN: Optimizing Imbalanced Neighborhood Communication on Homogeneous Many-Core System

Published: 12 August 2024 Publication History

Abstract

MPI neighborhood communication with sparse and imbalanced patterns is common in process-level parallel programs. However, these programs often encounter significant performance slowdowns in today’s many-core clusters that feature dozens of cores per node. There are two key causes for this slowdown. First, there is substantial competition for memory and network ports when a large number of processes simultaneously access the MPI library. Second, many neighborhood communications do not align well with the many-core architecture, resulting in performance bottlenecks that could have been mitigated.
In this paper, we leverage communication patterns to address the above issues in neighborhood communication. We use zero redundant copy and message aggregation to optimize intra-node communication, and relieve both intra-node and inter-node bottlenecks with process mapping. By combining optimizations effectively, we present BoostN, a standalone library that speeds up imbalanced neighborhood communication on many-core systems. BoostN works well with mainstream homogeneous architectures and various latest versions of MPI libraries. Experiments show that our optimization tool can achieve up to 4.94x geometric mean speedups for SpMV of 2,708 matrices in SuiteSparse, up to 8.18x speedup for Laser problem (latency-bounded), and up to 8.98x speedup for Oil problem (bandwidth-bounded) solved by Hypre.

References

[1]
2023. AMG2013. https://asc.llnl.gov/codes/proxy-apps/amg2013 [Accessed 23-06-2024].
[2]
2023. HYPRE. https://computing.llnl.gov/projects/hypre-scalable-linear-solvers-multigrid-methods. [Accessed 23-06-2024].
[3]
2024. AMD EPYC 9754. https://www.amd.com/en/products/cpu/amd-epyc-9754. [Accessed 23-06-2024].
[4]
2024. HPC-X. https://developer.nvidia.com/networking/. [Accessed 23-06-2024].
[5]
2024. MPI standard. https://www.mpi-forum.org/ [Accessed 23-06-2024].
[6]
2024. MPICH | High-Performance Portable MPI — mpich.org. https://www.mpich.org/. [Accessed 23-06-2024].
[7]
2024. Open MPI: Open Source High Performance Computing — open-mpi.org. https://www.open-mpi.org/. [Accessed 23-06-2024].
[8]
2024. SuiteSparse Matrix Collection. https://sparse.tamu.edu/. [Accessed 23-06-2024].
[9]
Albert Alexandrov, Mihai F Ionescu, Klaus E Schauser, and Chris Scheiman. 1995. LogGP: Incorporating long messages into the LogP model—one step closer towards a realistic model for parallel computation. In Proceedings of the seventh annual ACM symposium on Parallel algorithms and architectures. 95–105.
[10]
Robert Anderson, Julian Andrej, Andrew Barker, Jamie Bramwell, Jean-Sylvain Camier, Jakub Cerveny, Veselin Dobrev, Yohann Dudouit, Aaron Fisher, Tzanio Kolev, 2021. MFEM: A modular finite element methods library. Computers & Mathematics with Applications 81 (2021), 42–74.
[11]
Satish Balay, Shrirang Abhyankar, Mark Adams, Jed Brown, Peter Brune, Kris Buschelman, Lisandro Dalcin, Alp Dener, Victor Eijkhout, William Gropp, 2019. PETSc users manual. (2019).
[12]
Amanda Bienz, William D Gropp, and Luke N Olson. 2020. Reducing communication in algebraic multigrid with multi-step node aware communication. The International Journal of High Performance Computing Applications 34, 5 (2020), 547–561.
[13]
Sourav Chakraborty, Mohammadreza Bayatpour, J Hashmi, Hari Subramoni, and Dhabaleswar K Panda. 2018. Cooperative rendezvous protocols for improved performance and overlap. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 361–373.
[14]
Hu Chen, Wenguang Chen, Jian Huang, and Bob Kuhn. 2006. MPIPP: an automatic profile-guided parallel process placement toolset for SMP clusters and multiclusters. In International Conference on Supercomputing. https://api.semanticscholar.org/CorpusID:7998042
[15]
Gerald Collom, Rui Peng Li, and Amanda Bienz. 2023. Optimizing Irregular Communication with Neighborhood Collectives and Locality-Aware Parallelism. In Proceedings of the SC’23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis. 427–437.
[16]
David Culler, Richard Karp, David Patterson, Abhijit Sahay, Klaus Erik Schauser, Eunice Santos, Ramesh Subramonian, and Thorsten Von Eicken. 1993. LogP: Towards a realistic model of parallel computation. In Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming. 1–12.
[17]
Gokhan Danabasoglu, J-F Lamarque, J Bacmeister, DA Bailey, AK DuVivier, Jim Edwards, LK Emmons, John Fasullo, R Garcia, Andrew Gettelman, 2020. The community earth system model version 2 (CESM2). Journal of Advances in Modeling Earth Systems 12, 2 (2020), e2019MS001916.
[18]
Robert D Falgout and Jacob B Schroder. 2014. Non-Galerkin coarse grids for algebraic multigrid. SIAM Journal on Scientific Computing 36, 3 (2014), C309–C334.
[19]
Karl Fürlinger, Colin Glass, Jose Gracia, Andreas Knüpfer, Jie Tao, Denis Hünich, Kamran Idrees, Matthias Maiterth, Yousri Mhedheb, and Huan Zhou. 2014. DASH: Data structures and algorithms with support for hierarchical locality. In Euro-Par 2014: Parallel Processing Workshops: Euro-Par 2014 International Workshops, Porto, Portugal, August 25-26, 2014, Revised Selected Papers, Part II 20. Springer, 542–552.
[20]
Hormozd Gahvari, Allison H Baker, Martin Schulz, Ulrike Meier Yang, Kirk E Jordan, and William Gropp. 2011. Modeling the performance of an algebraic multigrid cycle on HPC platforms. In Proceedings of the international conference on Supercomputing. 172–181.
[21]
Brice Goglin and Stéphanie Moreaud. 2013. KNEM: A generic and scalable kernel-assisted intra-node MPI communication framework. J. Parallel and Distrib. Comput. 73, 2 (2013), 176–188.
[22]
Takao Hatazaki. 1998. Rank reordering strategy for MPI topology creation functions. In European Parallel Virtual Machine/Message Passing Interface Users’ Group Meeting. Springer, 188–195.
[23]
Michael A Heroux, Lois Curfman McInnes, Rajeev Thakur, Jeffrey S Vetter, Xiaoye Sherry Li, James Aherns, Todd Munson, and Kathryn Mohror. 2020. ECP software technology capability assessment report. Technical Report. Oak Ridge National Lab.(ORNL), Oak Ridge, TN (United States).
[24]
Torsten Hoefler and Marc Snir. 2011. Generic topology mapping strategies for large-scale parallel architectures. In Proceedings of the international conference on Supercomputing. 75–84.
[25]
Wei Huang, Matthew J Koop, and Dhabaleswar K Panda. 2008. Efficient one-copy MPI shared memory communication in virtual machines. In 2008 IEEE International Conference on Cluster Computing. IEEE, 107–115.
[26]
Laxmikant V Kale and Sanjeev Krishnan. 1993. Charm++ a portable concurrent object oriented system based on c++. In Proceedings of the eighth annual conference on Object-oriented programming systems, languages, and applications. 91–108.
[27]
George Karypis. 1997. METIS: Unstructured graph partitioning and sparse matrix ordering system. Technical report (1997).
[28]
George Karypis and Vipin Kumar. 1998. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM Journal on scientific Computing 20, 1 (1998), 359–392.
[29]
John D McCalpin. 1995. Stream benchmark. Link: www. cs. virginia. edu/stream/ref. html# what 22, 7 (1995).
[30]
Michael Noeth, Prasun Ratn, Frank Mueller, Martin Schulz, and Bronis R De Supinski. 2009. Scalatrace: Scalable compression and replay of communication traces for high-performance computing. J. Parallel and Distrib. Comput. 69, 8 (2009), 696–710.
[31]
Robert W Numrich and John Reid. 1998. Co-Array Fortran for parallel programming. In ACM Sigplan Fortran Forum, Vol. 17. ACM New York, NY, USA, 1–31.
[32]
K Pedretti and B Barrett. 2020. Xpmem: Cross-process memory mapping.
[33]
Jintao Peng, Jianbin Fang, Jie Liu, Min Xie, Yi Dai, Bo Yang, Shengguo Li, and Zheng Wang. 2023. Optimizing MPI Collectives on Shared Memory Multi-Cores. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1–15.
[34]
James Psota and Armando Solar-Lezama. 2024. Pure: Evolving Message Passing To Better Leverage Shared Memory Within Nodes. In Proceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming. 133–146.
[35]
Ken Raffenetti, Abdelhalim Amer, Lena Oden, Charles Archer, Wesley Bland, Hajime Fujita, Yanfei Guo, Tomislav Janjusic, Dmitry Durnov, Michael Blocksome, 2017. Why is MPI so slow? analyzing the fundamental limits in implementing MPI-3.1. In Proceedings of the international conference for high performance computing, networking, storage and analysis. 1–12.
[36]
Christian Schulz and Jesper Larsson Träff. 2017. Better Process Mapping and Sparse Quadratic Assignment. In Proceedings of the 16th International Symposium on Experimental Algorithms (SEA’17)(LIPIcs, Vol. 75). Dagstuhl, 4:1 – 4:15. Technical Report, arXiv:1702.04164.
[37]
Matthew Small and Xin Yuan. 2009. Maximizing mpi point-to-point communication performance on rdma-enabled clusters with customized protocols. In Proceedings of the 23rd international conference on Supercomputing. 306–315.
[38]
Aidan P Thompson, H Metin Aktulga, Richard Berger, Dan S Bolintineanu, W Michael Brown, Paul S Crozier, Pieter J In’t Veld, Axel Kohlmeyer, Stan G Moore, Trung Dac Nguyen, 2022. LAMMPS-a flexible simulation tool for particle-based materials modeling at the atomic, meso, and continuum scales. Computer Physics Communications 271 (2022), 108171.
[39]
Chris Walshaw and Mark Cross. 2000. Mesh partitioning: a multilevel balancing and refinement algorithm. SIAM Journal on Scientific Computing 22, 1 (2000), 63–80.
[40]
Jidong Zhai, Tianwei Sheng, Jiangzhou He, Wenguang Chen, and Weimin Zheng. 2009. FACT: Fast communication trace collection for parallel applications through program slicing. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis. 1–12.
[41]
Yili Zheng, Amir Kamil, Michael B Driscoll, Hongzhang Shan, and Katherine Yelick. 2014. UPC++: a PGAS extension for C++. In 2014 IEEE 28th international parallel and distributed processing symposium. IEEE, 1105–1114.

Index Terms

  1. BoostN: Optimizing Imbalanced Neighborhood Communication on Homogeneous Many-Core System

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    ICPP '24: Proceedings of the 53rd International Conference on Parallel Processing
    August 2024
    1279 pages
    ISBN:9798400717932
    DOI:10.1145/3673038
    This work is licensed under a Creative Commons Attribution International 4.0 License.

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 12 August 2024

    Check for updates

    Author Tags

    1. MPI
    2. Many-core Processor
    3. Neighborhood Communication
    4. Performance Contention
    5. Sparse Problem

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Funding Sources

    • National Key R&D Program of China

    Conference

    ICPP '24

    Acceptance Rates

    Overall Acceptance Rate 91 of 313 submissions, 29%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 99
      Total Downloads
    • Downloads (Last 12 months)99
    • Downloads (Last 6 weeks)36
    Reflects downloads up to 01 Nov 2024

    Other Metrics

    Citations

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media