Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3524059.3532380acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article

Optimized MPI collective algorithms for dragonfly topology

Published: 28 June 2022 Publication History

Abstract

The Message Passing Interface (MPI) is the most prominent and dominant programming model for scientific computing in super-computing systems today. Although many general and efficient algorithms have been proposed for MPI collective operations, there is still room for topology-aware optimization. Dragonfly is a high-scalability, low-diameter, and cost-efficient network topology adopted in more and more supercomputing networks. However, Dragonfly topology limits the performance of some MPI collective operations. In this paper, our analysis shows that the bottlenecks of collective algorithms in Dragonfly topology are intra-job interference, inter-job interference, and topology mismatch. We propose 5 different optimizations, i.e., Pseudo-random Pairwise, Tree-based Shuffle, Reversed Recursive Doubling, Reordered Bruck, and Matched Rabenseifner, for MPI collective operations including All-Gather, All-to-All, All-Reduce, and Reduce-Scatter. We evaluate each optimization through CODES network simulation framework with minimal, non-minimal, and adaptive routing. The simulation results demonstrate that the performance of All-to-All, All-Gather, All-Reduce, and Reduce-Scatter can be improved by 4.7X, 3.4X, 12.7%, and 4.1X, respectively, for 32768-node jobs with adaptive routing.

References

[1]
William Aspray. 1985. Proceedings of a Symposium on Large Scale Digital Calculating Machinery 1948. MIT Press, Cambridge, MA, USA.
[2]
Peter D. Barnes, Christopher D. Carothers, David R. Jefferson, and Justin M. LaPre. 2013. Warp speed: executing time warp on 1,966,080 cores. In Proceedings of the 1st ACM SIGSIM Conference on Principles of Advanced Discrete Simulation (SIGSIM PADS '13). Association for Computing Machinery, New York, NY, USA, 327--336.
[3]
Maciej Besta and Torsten Hoefler. 2014. Slim fly: a cost effective low-diameter network topology. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '14). IEEE Press, New Orleans, Louisana, 348--359.
[4]
J. Bruck, Ching-Tien Ho, S. Kipnis, E. Upfal, and D. Weathersby. 1997. Efficient algorithms for all-to-all communications in multiport message-passing systems. IEEE Transactions on Parallel and Distributed Systems 8, 11 (Nov. 1997), 1143--1156. Conference Name: IEEE Transactions on Parallel and Distributed Systems.
[5]
Christopher D. Carothers, David Bauer, and Shawn Pearce. 2000. ROSS: a high-performance, low memory, modular time warp system. In Proceedings of the fourteenth workshop on Parallel and distributed simulation (PADS '00). IEEE Computer Society, USA, 53--60.
[6]
Jing Chen, Linbo Zhang, Yunquan Zhang, and Wei Yuan. 2005. Performance evaluation of Allgather algorithms on terascale Linux cluster with fast Ethernet. In Eighth International Conference on High-Performance Computing in Asia-Pacific Region (HPCASIA'05). 6 pp.-442.
[7]
Sudheer Chunduri, Scott Parker, Pavan Balaji, Kevin Harms, and Kalyan Kumaran. 2018. Characterization of MPI Usage on a Production Supercomputer. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, Dallas, TX, USA, 386--400.
[8]
CSCS. 2018. Factsheet: "Piz Daint", one of the most powerful super-computers in the world. https://www.cscs.ch/publications/news/2017/factsheetpizdaintoneofthemostpowerfulsupercomputersintheworld/
[9]
Daniele De Sensi, Salvatore Di Girolamo, Kim H. McMahon, Duncan Roweth, and Torsten Hoefler. 2020. An In-Depth Analysis of the Slingshot Interconnect. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, Atlanta, GA, USA, 1--14.
[10]
Matthieu Dorier, Misbah Mubarak, Rob Ross, Jianping Kelvin Li, Christopher D. Carothers, and Kwa-Liu Ma. 2016. Evaluation of Topology-Aware Broadcast Algorithms for Dragonfly Networks. In 2016 IEEE International Conference on Cluster Computing (CLUSTER). 40--49. ISSN: 2168--9253.
[11]
Ran Duan, Seth Pettie, and Hsin-Hao Su. 2018. Scaling Algorithms for Weighted Matching in General Graphs. ACM Trans. Algorithms 14, 1 (Jan. 2018), 1--35.
[12]
Greg Faanes, Abdulla Bataineh, Duncan Roweth, Tom Court, Edwin Froese, Bob Alverson, Tim Johnson, Joe Kopnick, Mike Higgins, and James Reinhard. 2012. Cray Cascade: A scalable HPC system based on a Dragonfly network. In 2012 International Conference for High Performance Computing, Networking, Storage and Analysis (SC '12). IEEE, Salt Lake City, UT, 1--9.
[13]
Argonne Leadership Computing Facility(ALCF). 2022. ALCF Public Data. https://reports.alcf.anl.gov/data/index.html
[14]
Nikhil Jain, Abhinav Bhatele, Xiang Ni, Nicholas J. Wright, and Laxmikant V. Kale. 2014. Maximizing Throughput on a Dragonfly Network. In SC14: International Conference for High Performance Computing, Networking, Storage and Analysis (SC '14). IEEE, New Orleans, LA, USA, 336--347.
[15]
Nikhil Jain, Abhinav Bhatele, Sam White, Todd Gamblin, and Laxmikant V. Kale. 2016. Evaluating HPC Networks via Simulation of Parallel Workloads. In SC '16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '16). 154--165. ISSN: 2167-4337.
[16]
Nikhil Jain, JohnMark Lau, and L. Kalé. 2012. Collectives on Two-Tier Direct Networks. In EuroMPI.
[17]
Nan Jiang, John Kim, and William J. Dally. 2009. Indirect adaptive routing on large scale interconnection networks. SIGARCH Comput. Archit. News 37, 3 (2009), 220--231.
[18]
John Kim, Wiliam J. Dally, Steve Scott, and Dennis Abts. 2008. Technology-Driven, Highly-Scalable Dragonfly Topology. In 2008 International Symposium on Computer Architecture (ISCA '08). 77--88. ISSN: 1063-6897.
[19]
Los Alamos National Laboratory(LANL)LANL. 2020. Trinity: Advanced Technology System. https://www.lanl.gov/projects/trinity/
[20]
Jianping Kelvin Li, Misbah Mubarak, Robert B. Ross, Christopher D. Carothers, and Kwan-Liu Ma. 2017. Visual Analytics Techniques for Exploring the Design Space of Large-Scale High-Radix Networks. In 2017 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, Honolulu, HI, USA, 193--203.
[21]
Junchao Ma, Dezun Dong, Cunlu Li, Ke Wu, and Liquan Xiao. 2021. PAARD: Proximity-Aware All-Reduce Communication for Dragonfly Networks. In 2021 IEEE Intl Conf on Parallel Distributed Processing with Applications, Big Data Cloud Computing, Sustainable Computing Communications, Social Computing Networking (ISPA/BDCloud/SocialCom/SustainCom). 255--262.
[22]
Neil McGlohon, Christopher D. Carothers, K. Scott Hemmert, Michael Levenhagen, Kevin A. Brown, Sudheer Chunduri, and Robert B. Ross. 2021. Exploration of Congestion Control Techniques on Dragonfly-class HPC Networks Through Simulation. In 2021 International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS). 40--50.
[23]
OSU Micro-Benchmark. 2022. MVAPICH :: Benchmarks. http://mvapich.cse.ohio-state.edu/benchmarks/
[24]
Open MPI. 2022. Open Source High Performance Computing. https://www.open-mpi.org/
[25]
MPICH. 2022. High-Performance Portable MPI. https://www.mpich.org/
[26]
Misbah Mubarak, Christopher D. Carothers, Robert Ross, and Philip Carns. 2012. Modeling a Million-Node Dragonfly Network Using Massively Parallel Discrete-Event Simulation. In 2012 SC Companion: High Performance Computing, Networking Storage and Analysis. IEEE, Salt Lake City, UT, 366--376.
[27]
Misbah Mubarak, Christopher D. Carothers, Robert B. Ross, and Philip Carns. 2017. Enabling Parallel Simulation of Large-Scale HPC Network Systems. IEEE Trans. Parallel Distrib. Syst. 28, 1 (Jan. 2017), 87--100.
[28]
MVAPICH. 2022. MPI over InfiniBand, Omni-Path, Ethernet/iWARP, and RoCE. https://mvapich.cse.ohio-state.edu/
[29]
NERSC. 2021. Perlmutter System Details. https://docs.nersc.gov/systems/perlmutter/system_details/
[30]
R. Rabenseifner. 2004. Optimization of Collective Reduction Operations. In International Conference on Computational Science.
[31]
Arjun Singh. 2005. LOAD-BALANCED ROUTING IN INTERCONNECTION NETWORKS. Ph. D. Dissertation. Stanford University.
[32]
Rajeev Thakur, Rolf Rabenseifner, and William Gropp. 2005. Optimization of Collective Communication Operations in MPICH. The International Journal of High Performance Computing Applications 19, 1 (Feb. 2005), 49--66.
[33]
O. Tuncer, Yijia Zhang, V. Leung, and A. Coskun. 2017. Task mapping on a dragonfly supercomputer. undefined (2017). https://www.semanticscholar.org/paper/Task-mapping-on-a-dragonfly-supercomputer-Tuncer-Zhang/ac5416c4d080fabf5b983f86a5497f487389e9a9
[34]
L. Valiant. 1982. A Scheme for Fast Parallel Communication. SIAM J. Comput. (1982).
[35]
D. W. Walker. 1992. Standards for message-passing in a distributed memory environment. Technical Report ORNL/TM-12147; CONF-9204185-Summ. Oak Ridge National Lab., TN (United States). https://www.osti.gov/biblio/7104668-standards-message-passing-distributed-memory-environment
[36]
Xin Wang, Misbah Mubarak, Xu Yang, Robert B. Ross, and Zhiling Lan. 2018. Trade-Off Study of Localizing Communication and Balancing Network Traffic on a Dragonfly System. In 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, Vancouver, BC, 1113--1122.
[37]
Dong Xiang and Xiaowei Liu. 2016. Deadlock-Free Broadcast Routing in Dragonfly Networks without Virtual Channels. IEEE Trans. Parallel Distrib. Syst. 27, 9 (Sept. 2016), 2520--2532.
[38]
Xu Yang, John Jenkins, Misbah Mubarak, Robert B. Ross, and Zhiling Lan. 2016. Watch Out for the Bully! Job Interference Study on Dragonfly Network. In SC16: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, Salt Lake City, UT, USA, 750--760.

Cited By

View all
  • (2023)Roar: A Router Microarchitecture for In-network AllreduceProceedings of the 37th ACM International Conference on Supercomputing10.1145/3577193.3593711(423-436)Online publication date: 21-Jun-2023
  • (2023)Generalized Collective Algorithms for the Exascale Era2023 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER52292.2023.00013(60-71)Online publication date: 31-Oct-2023
  • (2023)A transmission optimization method for MPI communicationsThe Journal of Supercomputing10.1007/s11227-023-05699-x80:5(6240-6263)Online publication date: 20-Oct-2023

Index Terms

  1. Optimized MPI collective algorithms for dragonfly topology

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ICS '22: Proceedings of the 36th ACM International Conference on Supercomputing
    June 2022
    514 pages
    ISBN:9781450392815
    DOI:10.1145/3524059
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 28 June 2022

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. MPI
    2. collective
    3. dragonfly

    Qualifiers

    • Research-article

    Funding Sources

    • Program for Guangdong Introducing Innovative and Entrepreneurial Teams
    • National Key R&D Program of China
    • Major Program of Guangdong Basic and Applied Research
    • National Natural Science Foundation of China
    • Excellent Youth Foundation of Hunan Province
    • Guangdong Natural Science Foundation

    Conference

    ICS '22
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 629 of 2,180 submissions, 29%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)182
    • Downloads (Last 6 weeks)7
    Reflects downloads up to 09 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Roar: A Router Microarchitecture for In-network AllreduceProceedings of the 37th ACM International Conference on Supercomputing10.1145/3577193.3593711(423-436)Online publication date: 21-Jun-2023
    • (2023)Generalized Collective Algorithms for the Exascale Era2023 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER52292.2023.00013(60-71)Online publication date: 31-Oct-2023
    • (2023)A transmission optimization method for MPI communicationsThe Journal of Supercomputing10.1007/s11227-023-05699-x80:5(6240-6263)Online publication date: 20-Oct-2023

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media