Network Load Balancing with Parallel Flowlets for AI Training Clusters
Abstract
References
Index Terms
- Network Load Balancing with Parallel Flowlets for AI Training Clusters
Recommendations
Network Load Balancing with In-network Reordering Support for RDMA
ACM SIGCOMM '23: Proceedings of the ACM SIGCOMM 2023 ConferenceRemote Direct Memory Access (RDMA) is widely used in high-performance computing (HPC) and data center networks. In this paper, we first show that RDMA does not work well with existing load balancing algorithms because of its traffic flow characteristics ...
POSTER: Hybrid-Granularity Network Load balancing for Distributed AI Model Training
ACM SIGCOMM Posters and Demos '24: Proceedings of the ACM SIGCOMM 2024 Conference: Posters and DemosWith the increasing number of parameters in artificial intelligence (AI) models, distributed AI model training using numerous servers within data centers has become commonplace. However, traditional load balancing strategies in data center networks are ...
Routing with load balancing: increasing the guaranteed node traffics
In this paper we introduce the novel routing scheme based on load balancing and shortest-path routing. First, we present the linear program for routing optimization. The nonblocking network is considered, which only limits the traffic loads of the ...
Comments
Information & Contributors
Information
Published In
Sponsors
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
Check for updates
Author Tags
Qualifiers
- Research-article
- Research
- Refereed limited
Funding Sources
Conference
Acceptance Rates
Contributors
Other Metrics
Bibliometrics & Citations
Bibliometrics
Article Metrics
- 0Total Citations
- 273Total Downloads
- Downloads (Last 12 months)273
- Downloads (Last 6 weeks)136
Other Metrics
Citations
View Options
Get Access
Login options
Check if you have access through your login credentials or your institution to get full access on this article.
Sign in