research-article

Optimized MPI collective algorithms for dragonfly topology

Authors:

Yutong LuAuthors Info & Claims

ICS '22: Proceedings of the 36th ACM International Conference on Supercomputing

Article No.: 14, Pages 1 - 11

https://doi.org/10.1145/3524059.3532380

Published: 28 June 2022 Publication History

Abstract

The Message Passing Interface (MPI) is the most prominent and dominant programming model for scientific computing in super-computing systems today. Although many general and efficient algorithms have been proposed for MPI collective operations, there is still room for topology-aware optimization. Dragonfly is a high-scalability, low-diameter, and cost-efficient network topology adopted in more and more supercomputing networks. However, Dragonfly topology limits the performance of some MPI collective operations. In this paper, our analysis shows that the bottlenecks of collective algorithms in Dragonfly topology are intra-job interference, inter-job interference, and topology mismatch. We propose 5 different optimizations, i.e., Pseudo-random Pairwise, Tree-based Shuffle, Reversed Recursive Doubling, Reordered Bruck, and Matched Rabenseifner, for MPI collective operations including All-Gather, All-to-All, All-Reduce, and Reduce-Scatter. We evaluate each optimization through CODES network simulation framework with minimal, non-minimal, and adaptive routing. The simulation results demonstrate that the performance of All-to-All, All-Gather, All-Reduce, and Reduce-Scatter can be improved by 4.7X, 3.4X, 12.7%, and 4.1X, respectively, for 32768-node jobs with adaptive routing.

References

[1]

William Aspray. 1985. Proceedings of a Symposium on Large Scale Digital Calculating Machinery 1948. MIT Press, Cambridge, MA, USA.

[2]

Peter D. Barnes, Christopher D. Carothers, David R. Jefferson, and Justin M. LaPre. 2013. Warp speed: executing time warp on 1,966,080 cores. In Proceedings of the 1st ACM SIGSIM Conference on Principles of Advanced Discrete Simulation (SIGSIM PADS '13). Association for Computing Machinery, New York, NY, USA, 327--336.

Digital Library

[3]

Maciej Besta and Torsten Hoefler. 2014. Slim fly: a cost effective low-diameter network topology. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '14). IEEE Press, New Orleans, Louisana, 348--359.

Digital Library

[4]

J. Bruck, Ching-Tien Ho, S. Kipnis, E. Upfal, and D. Weathersby. 1997. Efficient algorithms for all-to-all communications in multiport message-passing systems. IEEE Transactions on Parallel and Distributed Systems 8, 11 (Nov. 1997), 1143--1156. Conference Name: IEEE Transactions on Parallel and Distributed Systems.

Digital Library

[5]

Christopher D. Carothers, David Bauer, and Shawn Pearce. 2000. ROSS: a high-performance, low memory, modular time warp system. In Proceedings of the fourteenth workshop on Parallel and distributed simulation (PADS '00). IEEE Computer Society, USA, 53--60.

Digital Library

[6]

Jing Chen, Linbo Zhang, Yunquan Zhang, and Wei Yuan. 2005. Performance evaluation of Allgather algorithms on terascale Linux cluster with fast Ethernet. In Eighth International Conference on High-Performance Computing in Asia-Pacific Region (HPCASIA'05). 6 pp.-442.

Digital Library

[7]

Sudheer Chunduri, Scott Parker, Pavan Balaji, Kevin Harms, and Kalyan Kumaran. 2018. Characterization of MPI Usage on a Production Supercomputer. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, Dallas, TX, USA, 386--400.

Digital Library

[8]

CSCS. 2018. Factsheet: "Piz Daint", one of the most powerful super-computers in the world. https://www.cscs.ch/publications/news/2017/factsheetpizdaintoneofthemostpowerfulsupercomputersintheworld/

[9]

Daniele De Sensi, Salvatore Di Girolamo, Kim H. McMahon, Duncan Roweth, and Torsten Hoefler. 2020. An In-Depth Analysis of the Slingshot Interconnect. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, Atlanta, GA, USA, 1--14.

[10]

Matthieu Dorier, Misbah Mubarak, Rob Ross, Jianping Kelvin Li, Christopher D. Carothers, and Kwa-Liu Ma. 2016. Evaluation of Topology-Aware Broadcast Algorithms for Dragonfly Networks. In 2016 IEEE International Conference on Cluster Computing (CLUSTER). 40--49. ISSN: 2168--9253.

[11]

Ran Duan, Seth Pettie, and Hsin-Hao Su. 2018. Scaling Algorithms for Weighted Matching in General Graphs. ACM Trans. Algorithms 14, 1 (Jan. 2018), 1--35.

Digital Library

[12]

Greg Faanes, Abdulla Bataineh, Duncan Roweth, Tom Court, Edwin Froese, Bob Alverson, Tim Johnson, Joe Kopnick, Mike Higgins, and James Reinhard. 2012. Cray Cascade: A scalable HPC system based on a Dragonfly network. In 2012 International Conference for High Performance Computing, Networking, Storage and Analysis (SC '12). IEEE, Salt Lake City, UT, 1--9.

Digital Library

[13]

Argonne Leadership Computing Facility(ALCF). 2022. ALCF Public Data. https://reports.alcf.anl.gov/data/index.html

[14]

Nikhil Jain, Abhinav Bhatele, Xiang Ni, Nicholas J. Wright, and Laxmikant V. Kale. 2014. Maximizing Throughput on a Dragonfly Network. In SC14: International Conference for High Performance Computing, Networking, Storage and Analysis (SC '14). IEEE, New Orleans, LA, USA, 336--347.

Digital Library

[15]

Nikhil Jain, Abhinav Bhatele, Sam White, Todd Gamblin, and Laxmikant V. Kale. 2016. Evaluating HPC Networks via Simulation of Parallel Workloads. In SC '16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '16). 154--165. ISSN: 2167-4337.

[16]

Nikhil Jain, JohnMark Lau, and L. Kalé. 2012. Collectives on Two-Tier Direct Networks. In EuroMPI.

Digital Library

[17]

Nan Jiang, John Kim, and William J. Dally. 2009. Indirect adaptive routing on large scale interconnection networks. SIGARCH Comput. Archit. News 37, 3 (2009), 220--231.

Digital Library

[18]

John Kim, Wiliam J. Dally, Steve Scott, and Dennis Abts. 2008. Technology-Driven, Highly-Scalable Dragonfly Topology. In 2008 International Symposium on Computer Architecture (ISCA '08). 77--88. ISSN: 1063-6897.

Digital Library

[19]

Los Alamos National Laboratory(LANL)LANL. 2020. Trinity: Advanced Technology System. https://www.lanl.gov/projects/trinity/

[20]

Jianping Kelvin Li, Misbah Mubarak, Robert B. Ross, Christopher D. Carothers, and Kwan-Liu Ma. 2017. Visual Analytics Techniques for Exploring the Design Space of Large-Scale High-Radix Networks. In 2017 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, Honolulu, HI, USA, 193--203.

[21]

Junchao Ma, Dezun Dong, Cunlu Li, Ke Wu, and Liquan Xiao. 2021. PAARD: Proximity-Aware All-Reduce Communication for Dragonfly Networks. In 2021 IEEE Intl Conf on Parallel Distributed Processing with Applications, Big Data Cloud Computing, Sustainable Computing Communications, Social Computing Networking (ISPA/BDCloud/SocialCom/SustainCom). 255--262.

[22]

Neil McGlohon, Christopher D. Carothers, K. Scott Hemmert, Michael Levenhagen, Kevin A. Brown, Sudheer Chunduri, and Robert B. Ross. 2021. Exploration of Congestion Control Techniques on Dragonfly-class HPC Networks Through Simulation. In 2021 International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS). 40--50.

[23]

OSU Micro-Benchmark. 2022. MVAPICH :: Benchmarks. http://mvapich.cse.ohio-state.edu/benchmarks/

[24]

Open MPI. 2022. Open Source High Performance Computing. https://www.open-mpi.org/

[25]

MPICH. 2022. High-Performance Portable MPI. https://www.mpich.org/

[26]

Misbah Mubarak, Christopher D. Carothers, Robert Ross, and Philip Carns. 2012. Modeling a Million-Node Dragonfly Network Using Massively Parallel Discrete-Event Simulation. In 2012 SC Companion: High Performance Computing, Networking Storage and Analysis. IEEE, Salt Lake City, UT, 366--376.

Digital Library

[27]

Misbah Mubarak, Christopher D. Carothers, Robert B. Ross, and Philip Carns. 2017. Enabling Parallel Simulation of Large-Scale HPC Network Systems. IEEE Trans. Parallel Distrib. Syst. 28, 1 (Jan. 2017), 87--100.

Digital Library

[28]

MVAPICH. 2022. MPI over InfiniBand, Omni-Path, Ethernet/iWARP, and RoCE. https://mvapich.cse.ohio-state.edu/

[29]

NERSC. 2021. Perlmutter System Details. https://docs.nersc.gov/systems/perlmutter/system_details/

[30]

R. Rabenseifner. 2004. Optimization of Collective Reduction Operations. In International Conference on Computational Science.

[31]

Arjun Singh. 2005. LOAD-BALANCED ROUTING IN INTERCONNECTION NETWORKS. Ph. D. Dissertation. Stanford University.

[32]

Rajeev Thakur, Rolf Rabenseifner, and William Gropp. 2005. Optimization of Collective Communication Operations in MPICH. The International Journal of High Performance Computing Applications 19, 1 (Feb. 2005), 49--66.

Digital Library

[33]

O. Tuncer, Yijia Zhang, V. Leung, and A. Coskun. 2017. Task mapping on a dragonfly supercomputer. undefined (2017). https://www.semanticscholar.org/paper/Task-mapping-on-a-dragonfly-supercomputer-Tuncer-Zhang/ac5416c4d080fabf5b983f86a5497f487389e9a9

[34]

L. Valiant. 1982. A Scheme for Fast Parallel Communication. SIAM J. Comput. (1982).

Digital Library

[35]

D. W. Walker. 1992. Standards for message-passing in a distributed memory environment. Technical Report ORNL/TM-12147; CONF-9204185-Summ. Oak Ridge National Lab., TN (United States). https://www.osti.gov/biblio/7104668-standards-message-passing-distributed-memory-environment

[36]

Xin Wang, Misbah Mubarak, Xu Yang, Robert B. Ross, and Zhiling Lan. 2018. Trade-Off Study of Localizing Communication and Balancing Network Traffic on a Dragonfly System. In 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, Vancouver, BC, 1113--1122.

[37]

Dong Xiang and Xiaowei Liu. 2016. Deadlock-Free Broadcast Routing in Dragonfly Networks without Virtual Channels. IEEE Trans. Parallel Distrib. Syst. 27, 9 (Sept. 2016), 2520--2532.

Digital Library

[38]

Xu Yang, John Jenkins, Misbah Mubarak, Robert B. Ross, and Zhiling Lan. 2016. Watch Out for the Bully! Job Interference Study on Dragonfly Network. In SC16: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, Salt Lake City, UT, USA, 750--760.

Cited By

Wang RDong DLei FMa JWu KLu KGallivan KNikolopoulos DBeivide RGallopoulos E(2023)Roar: A Router Microarchitecture for In-network AllreduceProceedings of the 37th ACM International Conference on Supercomputing10.1145/3577193.3593711(423-436)Online publication date: 21-Jun-2023
https://dl.acm.org/doi/10.1145/3577193.3593711
Wilkins MWang HLiu PPham BGuo YThakur RDinda PHardavellas N(2023)Generalized Collective Algorithms for the Exascale Era2023 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER52292.2023.00013(60-71)Online publication date: 31-Oct-2023
https://doi.org/10.1109/CLUSTER52292.2023.00013
Wang JZhuang YZeng Y(2023)A transmission optimization method for MPI communicationsThe Journal of Supercomputing10.1007/s11227-023-05699-x80:5(6240-6263)Online publication date: 20-Oct-2023
https://dl.acm.org/doi/10.1007/s11227-023-05699-x

Index Terms

Optimized MPI collective algorithms for dragonfly topology
1. Computing methodologies
  1. Distributed computing methodologies
    1. Distributed algorithms

Recommendations

Technology-Driven, Highly-Scalable Dragonfly Topology
ISCA '08: Proceedings of the 35th Annual International Symposium on Computer Architecture

Evolving technology and increasing pin-bandwidth motivate the use of high-radix routers to reduce the diameter, latency, and cost of interconnection networks. High-radix networks, however, require longer cables than their low-radix counterparts. Because ...
Technology-Driven, Highly-Scalable Dragonfly Topology

Evolving technology and increasing pin-bandwidth motivate the use of high-radix routers to reduce the diameter, latency, and cost of interconnection networks. High-radix networks, however, require longer cables than their low-radix counterparts. Because ...
Optimal bucket algorithms for large MPI collectives on torus interconnects
ICS '10: Proceedings of the 24th ACM International Conference on Supercomputing

Collectives are an important and frequently used component of MPI. Bucket algorithms, also known as "large vector" algorithms, were introduced in the early 90's and have since evolved as a well known paradigm for large MPI collectives. Many modern day ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICS '22: Proceedings of the 36th ACM International Conference on Supercomputing

June 2022

514 pages

ISBN:9781450392815

DOI:10.1145/3524059

General Chairs:
Lawrence Rauchwerger
University of Illinois at Urbana-Champaign
,
Kirk Cameron
Virginia Tech
,
Program Chairs:
Dimitrios S. Nikolopoulos
Virginia Tech
,
Dionisios Pnevmatikatos
National Technical University of Athens

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 June 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Program for Guangdong Introducing Innovative and Entrepreneurial Teams
National Key R&D Program of China
Major Program of Guangdong Basic and Applied Research
National Natural Science Foundation of China
Excellent Youth Foundation of Hunan Province
Guangdong Natural Science Foundation

Conference

ICS '22

Sponsor:

SIGARCH

ICS '22: 2022 International Conference on Supercomputing

June 28 - 30, 2022

Virtual Event

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
496
Total Downloads

Downloads (Last 12 months)182
Downloads (Last 6 weeks)7

Reflects downloads up to 09 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Wang RDong DLei FMa JWu KLu KGallivan KNikolopoulos DBeivide RGallopoulos E(2023)Roar: A Router Microarchitecture for In-network AllreduceProceedings of the 37th ACM International Conference on Supercomputing10.1145/3577193.3593711(423-436)Online publication date: 21-Jun-2023
https://dl.acm.org/doi/10.1145/3577193.3593711
Wilkins MWang HLiu PPham BGuo YThakur RDinda PHardavellas N(2023)Generalized Collective Algorithms for the Exascale Era2023 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER52292.2023.00013(60-71)Online publication date: 31-Oct-2023
https://doi.org/10.1109/CLUSTER52292.2023.00013
Wang JZhuang YZeng Y(2023)A transmission optimization method for MPI communicationsThe Journal of Supercomputing10.1007/s11227-023-05699-x80:5(6240-6263)Online publication date: 20-Oct-2023
https://dl.acm.org/doi/10.1007/s11227-023-05699-x

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents