Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3577193.3593711acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article

Roar: A Router Microarchitecture for In-network Allreduce

Published: 21 June 2023 Publication History

Abstract

The allreduce operation is the most commonly used collective operation in distributed or parallel applications. It aggregates data collected from distributed hosts and broadcasts the aggregated result back to them. In-network computing can accelerate allreduce by offloading this operation into network devices. However, existing in-network solutions face the challenge of high throughput, performance of aggregating large message and producing repeatable results. In this work, we propose a simple and effective router microarchitecture for in-network allreduce, which uses an RDMA protocol to improve its throughput. We further discuss strategies to tackle the aforementioned challenges. Our approach not only shows advantages in comparison with the state-of-the-art in-network solutions, but also accelerates allreduce at a near-optimal level compared to host-based algorithms, as demonstrated through experiments.

References

[1]
2014. Supplement to InfiniBand architecture specification volume 1 release 1.2.2 annex A17: RoCEv2 (IP routable RoCE). https://www.infinibandta.org/specs
[2]
Yuichiro Ajima, Takahiro Kawashima, Takayuki Okamoto, Naoyuki Shida, Kouichi Hirai, Toshiyuki Shimizu, Shinya Hiramoto, Yoshiro Ikeda, Takahide Yoshikawa, Kenji Uchida, et al. 2018. The tofu interconnect d. In 2018 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, 646--654.
[3]
Bob Alverson, Edwin Froese, Larry Kaplan, and Duncan Roweth. 2012. Cray XC series network. Cray Inc., White Paper WP-Aries01-1112 (2012).
[4]
Keren Bergman. 2018. Empowering Flexible and Scalable High Performance Architectures with Embedded Photonics. In 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS '18). 378--378.
[5]
Roberto Bifulco and Gábor Rétvári. 2018. A survey on the programmable data plane: Abstractions, architectures, and open problems. In 2018 IEEE 19th International Conference on High Performance Switching and Routing (HPSR). IEEE, 1--7.
[6]
Pat Bosshart, Dan Daly, Glen Gibb, Martin Izzard, Nick McKeown, Jennifer Rexford, Cole Schlesinger, Dan Talayco, Amin Vahdat, George Varghese, et al. 2014. P4: Programming protocol-independent packet processors. ACM SIGCOMM Computer Communication Review 44, 3 (2014), 87--95.
[7]
Pat Bosshart, Glen Gibb, Hun-Seok Kim, George Varghese, Nick McKeown, Martin Izzard, Fernando Mujica, and Mark Horowitz. 2013. Forwarding metamorphosis: Fast programmable match-action processing in hardware for SDN. ACM SIGCOMM Computer Communication Review 43, 4 (2013), 99--110.
[8]
Dong Chen, Noel A Eisley, Philip Heidelberger, Robert M Senger, Yutaka Sugawara, Sameer Kumar, Valentina Salapura, David L Satterfield, Burkhard Steinmacher-Burow, and Jeffrey J Parker. 2011. The IBM Blue Gene/Q interconnection network and message unit. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis. 1--10.
[9]
Yen-Huei Chen, Wei-Min Chan, Wei-Cheng Wu, Hung-Jen Liao, Kuo-Hua Pan, Jhon-Jhy Liaw, Tang-Hsuan Chung, Quincy Li, Chih-Yung Lin, Mu-Chi Chiang, et al. 2014. A 16 nm 128 Mb SRAM in High- k Metal-Gate FinFET Technology With Write-Assist Circuitry for Low-VMIN Applications. IEEE Journal of Solid-State Circuits 50, 1 (2014), 170--177.
[10]
Sudheer Chunduri, Scott Parker, Pavan Balaji, Kevin Harms, and Kalyan Kumaran. 2018. Characterization of mpi usage on a production supercomputer. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 386--400.
[11]
Daniele De Sensi, Salvatore Di Girolamo, Saleh Ashkboos, Shigang Li, and Torsten Hoefler. 2021. Flare: Flexible in-network allreduce. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1--16.
[12]
Salvatore Di Girolamo, Andreas Kurth, Alexandru Calotoiu, Thomas Benz, Timo Schneider, Jakub Beranek, Luca Benini, and Torsten Hoefler. 2021. A RISC-V in-network accelerator for flexible high-performance low-power packet processing. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). IEEE, 958--971.
[13]
InfiniBand Trade Association et al. 2000. InfiniBand architecture specification release 1.2. http://www.infinibandta.org
[14]
Aoxiang Feng, Dezun Dong, Fei Lei, Junchao Ma, Enda Yu, and Ruiqi Wang. 2022. In-network aggregation for data center networks: A survey. Computer Communications (2022).
[15]
Guangnan Feng, Dezun Dong, and Yutong Lu. 2022. Optimized MPI collective algorithms for dragonfly topology. In Proceedings of the 36th ACM International Conference on Supercomputing. 1--11.
[16]
M. P. Forum. 1994. MPI: A Message-Passing Interface Standard.
[17]
Nadeen Gebara, Manya Ghobadi, and Paolo Costa. 2021. In-network aggregation for shared machine learning clusters. Proceedings of Machine Learning and Systems 3 (2021), 829--844.
[18]
Richard L Graham, Devendar Bureddy, Pak Lui, Hal Rosenstock, Gilad Shainer, Gil Bloch, Dror Goldenerg, Mike Dubman, Sasha Kotchubievsky, Vladimir Koushnir, et al. 2016. Scalable hierarchical aggregation protocol (SHArP): a hardware architecture for efficient data reduction. In 2016 First International Workshop on Communication Optimizations in HPC (COMHPC). IEEE, 1--10.
[19]
Richard L Graham, Lion Levi, Devendar Burredy, Gil Bloch, Gilad Shainer, David Cho, George Elias, Daniel Klein, Joshua Ladd, Ophir Maor, et al. 2020. Scalable hierarchical aggregation and reduction protocol (SHARP) TM streaming-aggregation hardware design and evaluation. In High Performance Computing: 35th International Conference, ISC High Performance 2020, Frankfurt/Main, Germany, June 22--25, 2020, Proceedings 35. Springer, 41--59.
[20]
Torsten Hoefler, Salvatore Di Girolamo, Konstantin Taranov, Ryan E Grant, and Ron Brightwell. 2017. sPIN: High-performance streaming Processing in the Network. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1--16.
[21]
Intel. mar, 2021. Intel Tofino Series. https://www.intel.com/content/www/us/en/products/network-io/programmable-ethernet-switch.html
[22]
A Ishii, D Foley, E Anderson, B Dally, G Dearth, L Dennison, M Hummel, and J Schafer. 2018. NVSwitch and DGX-2. In Hot Chips.
[23]
Sylvain Jeaugey. 2017. Nccl 2.0. In GPU Technology Conference (GTC), Vol. 2.
[24]
Nan Jiang, Daniel U. Becker, George Michelogiannakis, James Balfour, Brian Towles, D. E. Shaw, John Kim, and William J. Dally. 2013. A detailed and flexible cycle-accurate Network-on-Chip simulator. In 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS '13).
[25]
John Kim, Wiliam J Dally, Steve Scott, and Dennis Abts. 2008. Technology-driven, highly-scalable dragonfly topology. ACM SIGARCH Computer Architecture News 36, 3 (2008), 77--88.
[26]
Benjamin Klenk, Nan Jiang, Greg Thorson, and Larry Dennison. 2020. An in-network architecture for accelerating shared-memory multiprocessor collectives. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). IEEE, 996--1009.
[27]
ChonLam Lao, Yanfang Le, Kshiteej Mahajan, Yixi Chen, Wenfei Wu, Aditya Akella, and Michael M Swift. 2021. ATP: In-network Aggregation for Multi-tenant Learning. In NSDI, Vol. 21. 741--761.
[28]
Charles E Leiserson. 1985. Fat-trees: universal networks for hardware-efficient supercomputing. IEEE transactions on Computers 100, 10 (1985), 892--901.
[29]
Shigang Li and Torsten Hoefler. 2022. Near-optimal sparse allreduce for distributed deep learning. In Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 135--149.
[30]
Youjie Li, Iou-Jen Liu, Yifan Yuan, Deming Chen, Alexander Schwing, and Jian Huang. 2019. Accelerating distributed reinforcement learning with in-switch computing. In Proceedings of the 46th International Symposium on Computer Architecture. 279--291.
[31]
Xiang-Ke Liao, Zheng-Bin Pang, Ke-Fei Wang, Yu-Tong Lu, Min Xie, Jun Xia, De-Zun Dong, and Guang Suo. 2015. High Performance Interconnect Network for Tianhe System. Journal of Computer Science & Technology (2015), 259--272. Issue No.2.
[32]
John DC Little. 1961. A proof for the queuing formula: L= Λ W. Operations research 9, 3 (1961), 383--387.
[33]
Shuo Liu, Qiaoling Wang, Junyi Zhang, Wenfei Wu, Qinliang Lin, Yao Liu, Meng Xu, Marco Canini, Ray CC Cheung, and Jianfei He. 2023. In-Network Aggregation with Transport Transparency for Distributed Training. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3. 376--391.
[34]
Junchao Ma, Dezun Dong, Cunlu Li, Ke Wu, and Liquan Xiao. 2021. PAARD: Proximity-Aware All-Reduce Communication for Dragonfly Networks. In 2021 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom). IEEE, 255--262.
[35]
Mellanox. mar 2019. Mellanox Quantum Switches. https://www.mellanox.com/products/infiniband-switches-ic/quantum
[36]
Pitch Patarasuk and Xin Yuan. 2009. Bandwidth optimal all-reduce algorithms for clusters of workstations. J. Parallel and Distrib. Comput. 69, 2 (2009), 117--124.
[37]
Dan RK Ports and Jacob Nelson. 2019. When should the network be the computer?. In Proceedings of the Workshop on Hot Topics in Operating Systems (HotOS '19). 209--215.
[38]
Rolf Rabenseifner. 2004. Optimization of collective reduction operations. In International Conference on Computational Science, Vol. 3036. 1--9.
[39]
Saeed Rashidi, William Won, Sudarshan Srinivasan, Srinivas Sridharan, and Tushar Krishna. 2022. Themis: A Network Bandwidth-aware Collective Scheduling Policy for Distributed Training of DL Models. In Proceedings of the 49th Annual International Symposium on Computer Architecture (ISCA '22). 581--596.
[40]
Juan A Rico-Gallego, Juan C Díaz-Martín, Ravi Reddy Manumachu, and Alexey L Lastovetsky. 2019. A survey of communication performance models for high-performance computing. ACM Computing Surveys (CSUR) 51, 6 (2019), 1--36.
[41]
Amedeo Sapio, Marco Canini, Chen-Yu Ho, Jacob Nelson, Panos Kalnis, Changhoon Kim, Arvind Krishnamurthy, Masoud Moshref, Dan RK Ports, and Peter Richtárik. 2019. Scaling distributed machine learning with in-network aggregation. arXiv preprint arXiv:1903.06701 (2019).
[42]
Vishal Shrivastav. 2022. Stateful multi-pipelined programmable switches. In Proceedings of the ACM SIGCOMM 2022 Conference. 663--676.
[43]
Rajeev Thakur, Rolf Rabenseifner, and William Gropp. 2005. Optimization of collective communication operations in MPICH. The International Journal of High Performance Computing Applications 19, 1 (2005), 49--66.
[44]
Joost Verbraeken, Matthijs Wolting, Jonathan Katzy, Jeroen Kloppenburg, Tim Verbelen, and Jan S Rellermeyer. 2020. A survey on distributed machine learning. ACM Computing Surveys (CSUR) 53, 2 (2020), 1--33.
[45]
U von Luxburg, S Bengio, HM Wallach, R Fergus, SVN Vishwanathan, and R Garnett. 2017. Baidu Research. https://github.com/baidu-research/baidu-allreduce.
[46]
Songtao Wang, Dan Li, Yang Cheng, Jinkun Geng, Yanshu Wang, Shuai Wang, Shutao Xia, and Jianping Wu. 2020. A scalable, high-performance, and fault-tolerant network architecture for distributed machine learning. IEEE/ACM Transactions on Networking 28, 04 (2020), 1752--1764.
[47]
Songtao Wang, Dan Li, Yang Cheng, Jinkun Geng, Yanshu Wang, Shuai Wang, Shu-Tao Xia, and Jianping Wu. 2018. BML: A High-performance, Low-cost Gradient Synchronization Algorithm for DML Training. Advances in Neural Information Processing Systems 31 (2018).
[48]
Mingran Yang, Alex Baban, Valery Kugel, Jeff Libby, Scott Mackie, Swamy Sadashivaiah Renu Kananda, Chang-Hong Wu, and Manya Ghobadi. 2022. Using trio: juniper networks' programmable chipset-for emerging in-network applications. In Proceedings of the ACM SIGCOMM 2022 Conference. 633--648.
[49]
Hesam Zolfaghari, Davide Rossi, Walter Cerroni, Hayate Okuhara, Carla Raffaelli, and Jari Nurmi. 2020. Flexible software-defined packet processing using low-area hardware. IEEE Access 8 (2020), 98929--98945.

Cited By

View all
  • (2024)Sparse Gradient Communication with AlltoAll for Accelerating Distributed Deep LearningProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673140(148-157)Online publication date: 12-Aug-2024
  • (2024)PID-Comm: A Fast and Flexible Collective Communication Framework for Commodity Processing-in-DIMM Devices2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00027(245-260)Online publication date: 29-Jun-2024
  • (2024)A lightweight RDMA Connection Protocol based on Post-hoc ConfirmationJournal of Parallel and Distributed Computing10.1016/j.jpdc.2024.104991(104991)Online publication date: Oct-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICS '23: Proceedings of the 37th ACM International Conference on Supercomputing
June 2023
505 pages
ISBN:9798400700569
DOI:10.1145/3577193
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 June 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. in-network computing
  2. allreduce
  3. router
  4. RDMA

Qualifiers

  • Research-article

Conference

ICS '23
Sponsor:

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)269
  • Downloads (Last 6 weeks)41
Reflects downloads up to 13 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Sparse Gradient Communication with AlltoAll for Accelerating Distributed Deep LearningProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673140(148-157)Online publication date: 12-Aug-2024
  • (2024)PID-Comm: A Fast and Flexible Collective Communication Framework for Commodity Processing-in-DIMM Devices2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00027(245-260)Online publication date: 29-Jun-2024
  • (2024)A lightweight RDMA Connection Protocol based on Post-hoc ConfirmationJournal of Parallel and Distributed Computing10.1016/j.jpdc.2024.104991(104991)Online publication date: Oct-2024
  • (2023)PiN: Processing in Network-on-ChipIEEE Design & Test10.1109/MDAT.2023.330794340:6(30-38)Online publication date: Dec-2023
  • (2023)DFAR: Dynamic-threshold Fault-tolerant Adaptive Routing for Fat Tree Networks2023 IEEE 29th International Conference on Parallel and Distributed Systems (ICPADS)10.1109/ICPADS60453.2023.00110(721-728)Online publication date: 17-Dec-2023

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media