Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3603165.3607399acmotherconferencesArticle/Chapter ViewAbstractPublication Pagesacm-turcConference Proceedingsconference-collections
poster

Augmenting Distributed AI Training with Loss-tolerant Transmission

Published: 25 September 2023 Publication History

Abstract

Parameter server (PS) communication architecture in distributed machine learning (DML) systems is utilized to enhance the speed of model training in data centers (DCs) and edge nodes. However, it faces severe long-tail latency caused by many-to-one "incast" traffic patterns and suffers from non-congestion packet loss, negatively impacting training throughput. To address this challenge, we design the Loss-tolerant Transmission Protocol (LTP), which permits partial loss of gradients during synchronization to avoid unneeded retransmission and contributes to faster synchronization per iteration. Moreover, the preliminary evaluation shows that LTP outperforms other schemes on both communication latency and training accuracy.

References

[1]
Mu Li, Li Zhou, Zichao Yang, Aaron Li, Fei Xia, David G Andersen, and Alexander Smola. 2013. Parameter server for distributed machine learning. In Big learning NIPS workshop, Vol. 6. 2.
[2]
David Zats, Tathagata Das, Prashanth Mohan, Dhruba Borthakur, and Randy Katz. 2012. DeTail: Reducing the flow completion time tail in datacenter networks. In Proceedings of the ACM SIGCOMM 2012 conference on Applications, technologies, architectures, and protocols for computer communication. 139–150.
[3]
Alham Fikri Aji and Kenneth Heafield. 2017. Sparse communication for distributed gradient descent. arXiv preprint arXiv:1704.05021 (2017).
[4]
Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. 2017. QSGD: Communication-efficient SGD via gradient quantization and encoding. Advances in Neural Information Processing Systems 30 (2017).

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
ACM TURC '23: Proceedings of the ACM Turing Award Celebration Conference - China 2023
July 2023
173 pages
ISBN:9798400702334
DOI:10.1145/3603165
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 September 2023

Check for updates

Qualifiers

  • Poster
  • Research
  • Refereed limited

Funding Sources

Conference

ACM TURC '23

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 33
    Total Downloads
  • Downloads (Last 12 months)21
  • Downloads (Last 6 weeks)0
Reflects downloads up to 25 Dec 2024

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media