Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3672198.3673794acmconferencesArticle/Chapter ViewAbstractPublication PagescommConference Proceedingsconference-collections
research-article
Open access

Network Load Balancing with Parallel Flowlets for AI Training Clusters

Published: 04 August 2024 Publication History

Abstract

Unlike traditional data center traffic, AI training traffic primarily consists of large-size flows that are fewer in number. This characteristic poses a challenge in balancing routing granularity with reorder overhead in existing routing strategies. Existing serial flowlet schemes aim to achieve a better trade-off in TCP scenarios than flow-level or packet spraying load balancing. However, they are not well-suited for AI training clusters with high-performance RDMA networks.
To tackle this issue, we propose a parallel-flowlet strategy, ParaLet, which effectively resolves the serial flowlet's problems of insufficient routing entropy in AI training traffic and the difficulty of identifying time gaps in RDMA networks. ParaLet requires only a small number of Queue Pairs, which are decoupled from the connections, thus circumventing scalability limits. The theoretical analysis and simulations indicate that ParaLet not only achieves near-optimal throughput but also diminishes flow completion time by 1.5-3.4 times compared to existing methods.

References

[1]
Paul Grun. Introduction to infiniband for end users. White paper, InfiniBand Trade Association, 55, 2010.
[2]
Yuliang Li, Rui Miao, Hongqiang Harry Liu, Yan Zhuang, Fei Feng, Lingbo Tang, Zheng Cao, Ming Zhang, Frank Kelly, Mohammad Alizadeh, et al. Hpcc: High precision congestion control. In Proceedings of the ACM Special Interest Group on Data Communication, pages 44--58. 2019.
[3]
Chuanxiong Guo, Haitao Wu, Zhong Deng, Gaurav Soni, Jianxi Ye, Jitu Padhye, and Marina Lipshteyn. Rdma over commodity ethernet at scale. In Proceedings of the 2016 ACM SIGCOMM Conference, pages 202--215, 2016.
[4]
Yimin Jiang, Yibo Zhu, Chang Lan, Bairen Yi, Yong Cui, and Chuanxiong Guo. A unified architecture for accelerating distributed {DNN} training in heterogeneous {GPU/CPU} clusters. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 463--479, 2020.
[5]
Christian Hopps. Analysis of an equal-cost multi-path algorithm. Technical report, 2000.
[6]
Junlan Zhou, Malveeka Tewari, Min Zhu, Abdul Kabbani, Leon Poutievski, Arjun Singh, and Amin Vahdat. Wcmp: Weighted cost multipathing for improved fairness in data centers. In Proceedings of the Ninth European Conference on Computer Systems, pages 1--14, 2014.
[7]
Advait Dixit, Pawan Prakash, Y Charlie Hu, and Ramana Rao Kompella. On the impact of packet spraying in data center networks. In 2013 Proceedings IEEE INFOCOM, pages 2130--2138. IEEE, 2013.
[8]
Jiawei Huang, Wenjun Lv, Weihe Li, Jianxin Wang, and Tian He. Qdaps: Queueing delay aware packet spraying for load balancing in data center. In 2018 IEEE 26th International Conference on Network Protocols (ICNP), pages 66--76. IEEE, 2018.
[9]
Soudeh Ghorbani, Zibin Yang, P Brighten Godfrey, Yashar Ganjali, and Amin Firoozshahian. Drill: Micro load balancing for low-latency data center networks. In Proceedings of the Conference of the ACM Special Interest Group on Data Communication, pages 225--238, 2017.
[10]
Srikanth Kandula, Dina Katabi, Shantanu Sinha, and Arthur Berger. Dynamic load balancing without packet reordering. ACM SIGCOMM Computer Communication Review, 37(2):51--62, 2007.
[11]
Mubashir Adnan Qureshi, Yuchung Cheng, Qianwen Yin, Qiaobin Fu, Gautam Kumar, Masoud Moshref, Junhua Yan, Van Jacobson, David Wetherall, and Abdul Kabbani. Plb: congestion signals are simple and effective for network load balancing. In Proceedings of the ACM SIGCOMM 2022 Conference, pages 207--218, 2022.
[12]
Erico Vanini, Rong Pan, Mohammad Alizadeh, Parvin Taheri, and Tom Edsall. Let it flow: Resilient asymmetric load balancing with flowlet switching. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17), pages 407--420, 2017.
[13]
Mohammad Alizadeh, Tom Edsall, Sarang Dharmapurikar, Ramanan Vaidyanathan, Kevin Chu, Andy Fingerhut, Vinh The Lam, Francis Matus, Rong Pan, Navindra Yadav, et al. Conga: Distributed congestion-aware load balancing for datacenters. In Proceedings of the 2014 ACM conference on SIGCOMM, pages 503--514, 2014.
[14]
NVIDIA. Nvidia spectrum-x network platform architecture - the first ethernet network designed to accelerate ai workloads. Technical report, NVIDIA, 11 2023.
[15]
Mohammad Al-Fares, Alexander Loukissas, and Amin Vahdat. A scalable, commodity data center network architecture. ACM SIGCOMM computer communication review, 38(4):63--74, 2008.
[16]
Arjun Roy, Hongyi Zeng, Jasmeet Bagga, George Porter, and Alex C Snoeren. Inside the social network's (datacenter) network. In Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communication, pages 123--137, 2015.
[17]
Peirui Cao, Shizhen Zhao, Min Yee The, Yunzhuo Liu, and Xinbing Wang. Trod: Evolving from electrical data center to optical data center. In 2021 IEEE 29th International Conference on Network Protocols (ICNP), pages 1--11. IEEE, 2021.
[18]
Peirui Cao, Shizhen Zhao, Dai Zhang, Zhuotao Liu, Mingwei Xu, Min Yee Teh, Yunzhuo Liu, Xinbing Wang, and Chenghu Zhou. Threshold-based routing-topology co-design for optical data center. IEEE/ACM Transactions on Networking, 2023.
[19]
Weiyang Wang, Moein Khazraee, Zhizhen Zhong, Manya Ghobadi, Zhihao Jia, Dheevatsa Mudigere, Ying Zhang, and Anthony Kewitsch. {TopoOpt}: Co-optimizing network topology and parallelization strategy for distributed training jobs. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), pages 739--767, 2023.
[20]
Qingkai Meng and Fengyuan Ren. Lightning: A practical building block for rdma transport control. In 2021 IEEE/ACM 29th International Symposium on Quality of Service (IWQOS), pages 1--10. IEEE, 2021.
[21]
Yufei Ren, Xingbo Wu, Li Zhang, Yandong Wang, Wei Zhang, Zijun Wang, Michel Hack, and Song Jiang. irdma: Efficient use of rdma in distributed deep learning systems. In 2017 IEEE 19th International Conference on High Performance Computing and Communications; IEEE 15th International Conference on Smart City; IEEE 3rd International Conference on Data Science and Systems (HPCC/SmartCity/DSS), pages 231--238. IEEE, 2017.
[22]
James Salamy. Network Requirements for Distributed Machine Learning Training in the Cloud. PhD thesis, Massachusetts Institute of Technology, 2022.
[23]
William Won, Midhilesh Elavazhagan, Sudarshan Srinivasan, Ajaya Durg, Swati Gupta, and Tushar Krishna. Tacos: Topology-aware collective algorithm synthesizer for distributed training. arXiv preprint arXiv:2304.05301, 2023.
[24]
Zixian Cai, Zhengyang Liu, Saeed Maleki, Madanlal Musuvathi, Todd Mytkowicz, Jacob Nelson, and Olli Saarikivi. Synthesizing optimal collective algorithms. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 62--75, 2021.
[25]
Adam Weingram, Yuke Li, Hao Qi, Darren Ng, Liuyao Dai, and Xiaoyi Lu. xccl: A survey of industry-led collective communication libraries for deep learning. Journal of Computer Science and Technology, 38(1):166--195, 2023.
[26]
Cliff Woolley. Nccl: Accelerated multi-gpu collective communications, 2015.
[27]
Meghan Cowan, Saeed Maleki, Madanlal Musuvathi, Olli Saarikivi, and Yifan Xiong. Msccl: Microsoft collective communication library. arXiv preprint arXiv:2201.11840, 2022.
[28]
Jianbo Dong, Shaochuang Wang, Fei Feng, Zheng Cao, Heng Pan, Lingbo Tang, Pengcheng Li, Hao Li, Qianyuan Ran, Yiqun Guo, et al. Accl: Architecting highly scalable distributed training systems with highly efficient collective communication library. IEEE Micro, 41(5):85--92, 2021.
[29]
Jiacheng Xia, Gaoxiong Zeng, Junxue Zhang, Weiyan Wang, Wei Bai, Junchen Jiang, and Kai Chen. Rethinking transport layer design for distributed machine learning. In Proceedings of the 3rd Asia-Pacific Workshop on Networking 2019, pages 22--28, 2019.
[30]
Amedeo Sapio, Marco Canini, Chen-Yu Ho, Jacob Nelson, Panos Kalnis, Changhoon Kim, Arvind Krishnamurthy, Masoud Moshref, Dan Ports, and Peter Richtárik. Scaling distributed machine learning with {InNetwork} aggregation. In 18th USENIX Symposium on Networked Systems Design and Implementation (NSDI 21), pages 785--808, 2021.
[31]
Cha Hwan Song, Xin Zhe Khooi, Raj Joshi, Inho Choi, Jialin Li, and Mun Choon Chan. Network load balancing with in-network reordering support for rdma. In Proceedings of the ACM SIGCOMM 2023 Conference, pages 816--831, 2023.
[32]
Jinbin Hu, Chaoliang Zeng, Zilong Wang, Junxue Zhang, Kun Guo, Hong Xu, Jiawei Huang, and Kai Chen. Enabling load balancing for lossless datacenters. In 2023 IEEE 31st International Conference on Network Protocols (ICNP), pages 1--11. IEEE, 2023.
[33]
Naga Katta, Mukesh Hira, Changhoon Kim, Anirudh Sivaraman, and Jennifer Rexford. Hula: Scalable load balancing using programmable data planes. In Proceedings of the Symposium on SDN Research, pages 1--12, 2016.
[34]
Leah Shalev, Hani Ayoub, Nafea Bshara, and Erez Sabbag. A cloud-optimized transport protocol for elastic and scalable hpc. IEEE Micro, 40(6):67--73, 2020.
[35]
Jiawei Huang, Wenjun Lyu, Weihe Li, Jianxin Wang, and Tian He. Mitigating packet reordering for random packet spraying in data center networks. IEEE/ACM Transactions on Networking, 29(3):1183--1196, 2021.
[36]
Yuanwei Lu, Guo Chen, Bojie Li, Kun Tan, Yongqiang Xiong, Peng Cheng, Jiansong Zhang, Enhong Chen, and Thomas Moscibroda. {Multi-Path} transport for {RDMA} in datacenters. In 15th USENIX symposium on networked systems design and implementation (NSDI 18), pages 357--371, 2018.
[37]
Zilong Wang, Layong Luo, Qingsong Ning, Chaoliang Zeng, Wenxue Li, Xinchen Wan, Peng Xie, Tao Feng, Ke Cheng, Xiongfei Geng, et al. {SRNIC}: A scalable architecture for {RDMA}{NICs}. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), pages 1--14, 2023.

Index Terms

  1. Network Load Balancing with Parallel Flowlets for AI Training Clusters

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    NAIC '24: Proceedings of the 2024 SIGCOMM Workshop on Networks for AI Computing
    August 2024
    89 pages
    ISBN:9798400707131
    DOI:10.1145/3672198
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 04 August 2024

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. AI training networks
    2. Network load balancing
    3. Routing

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Funding Sources

    Conference

    ACM SIGCOMM '24
    Sponsor:
    ACM SIGCOMM '24: ACM SIGCOMM 2024 Conference
    August 4 - 8, 2024
    NSW, Sydney, Australia

    Acceptance Rates

    NAIC '24 Paper Acceptance Rate 13 of 22 submissions, 59%;
    Overall Acceptance Rate 13 of 22 submissions, 59%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 273
      Total Downloads
    • Downloads (Last 12 months)273
    • Downloads (Last 6 weeks)136
    Reflects downloads up to 17 Oct 2024

    Other Metrics

    Citations

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media