poster

Augmenting Distributed AI Training with Loss-tolerant Transmission

Authors:

Zixuan Chen,

Lei Shi,

Yongbo Gao,

Xuandong Liu,

Xin Ai,

Sen Liu,

Yang XuAuthors Info & Claims

ACM TURC '23: Proceedings of the ACM Turing Award Celebration Conference - China 2023

Pages 65 - 66

https://doi.org/10.1145/3603165.3607399

Published: 25 September 2023 Publication History

Get Access

Abstract

Parameter server (PS) communication architecture in distributed machine learning (DML) systems is utilized to enhance the speed of model training in data centers (DCs) and edge nodes. However, it faces severe long-tail latency caused by many-to-one "incast" traffic patterns and suffers from non-congestion packet loss, negatively impacting training throughput. To address this challenge, we design the Loss-tolerant Transmission Protocol (LTP), which permits partial loss of gradients during synchronization to avoid unneeded retransmission and contributes to faster synchronization per iteration. Moreover, the preliminary evaluation shows that LTP outperforms other schemes on both communication latency and training accuracy.

References

[1]

Mu Li, Li Zhou, Zichao Yang, Aaron Li, Fei Xia, David G Andersen, and Alexander Smola. 2013. Parameter server for distributed machine learning. In Big learning NIPS workshop, Vol. 6. 2.

Google Scholar

[2]

David Zats, Tathagata Das, Prashanth Mohan, Dhruba Borthakur, and Randy Katz. 2012. DeTail: Reducing the flow completion time tail in datacenter networks. In Proceedings of the ACM SIGCOMM 2012 conference on Applications, technologies, architectures, and protocols for computer communication. 139–150.

Digital Library

Google Scholar

[3]

Alham Fikri Aji and Kenneth Heafield. 2017. Sparse communication for distributed gradient descent. arXiv preprint arXiv:1704.05021 (2017).

Google Scholar

[4]

Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. 2017. QSGD: Communication-efficient SGD via gradient quantization and encoding. Advances in Neural Information Processing Systems 30 (2017).

Google Scholar

Recommendations

The Packet Loss Effect on MPEG Video Transmission in Wireless Networks
AINA '06: Proceedings of the 20th International Conference on Advanced Information Networking and Applications - Volume 01

The purpose of this paper is to study the packet loss effect on MPEG video transmission quality in wireless networks. First, we consider the distribution of packet losses in wireless network, including distributed and burst packet losses. Besides, we ...
FEC-based Loss Recovery for Interactive Video Transmission
ICMCS '99: Proceedings of the IEEE International Conference on Multimedia Computing and Systems - Volume 2

Real-time interactive video transmission in the current Internet has mediocre quality because of high packet loss rates. Loss of packets belonging to a video frame manifests itself not only in the reduced quality of that frame but also in the ...
Parity-based loss recovery for reliable multicast transmission

We investigate how FEC (Forward Error Correction) can be combined with ARQ (Automatic Repeat Request) to achieve scalable reliable multicast transmission. We consider the two scenarios where FEC is introduced as a transparent layer underneath a reliable ...

Comments

Information & Contributors

Information

Published In

ACM TURC '23: Proceedings of the ACM Turing Award Celebration Conference - China 2023

July 2023

173 pages

ISBN:9798400702334

DOI:10.1145/3603165

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 September 2023

Check for updates

Qualifiers

Poster
Research
Refereed limited

Funding Sources

Key-Area Research and Development Program of Guangdong Province
Natural Science Foundation of Shanghai
Major Key Project of PCL
National Natural Science Foundation of China
Open Research Projects of Zhejiang Lab

Conference

ACM TURC '23

ACM TURC '23: ACM Turing Award Celebration Conference 2023

July 28 - 30, 2023

Wuhan, China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
33
Total Downloads

Downloads (Last 12 months)21
Downloads (Last 6 weeks)0

Reflects downloads up to 25 Dec 2024

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Abstract

References

Recommendations

The Packet Loss Effect on MPEG Video Transmission in Wireless Networks

FEC-based Loss Recovery for Interactive Video Transmission

Parity-based loss recovery for reliable multicast transmission

Comments

Information

Published In

Publisher

Publication History

Check for updates

Qualifiers

Funding Sources

Conference

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Login options

Full Access

View options

PDF

eReader

HTML Format

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations