research-article

Sign bit is enough: a learning synchronization framework for multi-hop all-reduce with ultimate compression

Authors:

Jie ZhangAuthors Info & Claims

DAC '22: Proceedings of the 59th ACM/IEEE Design Automation Conference

Pages 193 - 198

https://doi.org/10.1145/3489517.3530417

Published: 23 August 2022 Publication History

Abstract

Traditional one-bit compressed stochastic gradient descent can not be directly employed in multi-hop all-reduce, a widely adopted distributed training paradigm in network-intensive high-performance computing systems such as public clouds. According to our theoretical findings, due to the cascading compression, the training process has considerable deterioration on the convergence performance. To overcome this limitation, we implement a sign-bit compression-based learning synchronization framework, Marsit. It prevents cascading compression via an elaborate bit-wise operation for unbiased sign aggregation and its specific global compensation mechanism for mitigating compression deviation. The proposed framework retains the same theoretical convergence rate as non-compression mechanisms. Experimental results demonstrate that Marsit reduces up to 35% training time while preserving the same accuracy as training without compression.

References

[1]

F. Pérez-Hernández, S. Tabik, A. Lamas, R. Olmos, H. Fujita, and F. Herrera, "Object detection binary classifiers methodology based on deep learning to identify small objects handled similarly: Application in video surveillance," Knowledge-Based Systems, 2020.

[2]

K. Roy, S. Debdas, S. Kundu, S. Chouhan, S. Mohanty, and B. Biswas, "Application of natural language processing in healthcare," Computational Intelligence and Healthcare Informatics, 2021.

[3]

O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., "Imagenet large scale visual recognition challenge," International journal of computer vision, 2015.

[4]

Baidu-Research, "tensorflow-allreduce," [Source Code]. https://github.com/baidu-research/tensorflow-allreduce, 2017.

[5]

A. Sergeev and M. Del Balso, "Horovod: fast and easy distributed deep learning in tensorflow," arXiv preprint arXiv:1802.05799, 2018.

[6]

H. Mikami, H. Suganuma, P. U-chupala, Y. Tanaka, and Y. Kageyama, "Massively distributed sgd: Imagenet/resnet-50 training in a flash," arXiv preprint arXiv:1811.05233, 2018.

[7]

M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, and B.-Y. Su, "Scaling distributed machine learning with the parameter server," in OSDI, 2014.

[8]

Y. Lu and C. De Sa, "Optimal complexity in decentralized training," in ICML, 2021.

[9]

T. Lin, S. P. Karimireddy, S. U. Stich, and M. Jaggi, "Quasi-global momentum: Accelerating decentralized deep learning on heterogeneous data," in ICML, 2021.

[10]

Y. Chen, K. Yuan, Y. Zhang, P. Pan, Y. Xu, and W. Yin, "Accelerating gossip sgd with periodic global averaging," in ICML, 2021.

[11]

K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in CVPR, 2016.

[12]

T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., "Language models are few-shot learners," in NeurIPS, 2020.

[13]

J. Bernstein, J. Zhao, K. Azizzadenesheli, and A. Anandkumar, "signsgd with majority vote is communication efficient and fault tolerant," in ICLR, 2018.

[14]

M. Safaryan and P. Richtárik, "Stochastic sign descent methods: New algorithms and better theory," in ICML, 2021.

[15]

S. Liu, P.-Y. Chen, X. Chen, and M. Hong, "signsgd via zeroth-order oracle," in ICLR, 2018.

[16]

H. Tang, S. Gan, A. A. Awan, S. Rajbhandari, C. Li, X. Lian, J. Liu, C. Zhang, and Y. He, "1-bit adam: Communication efficient large-scale training with adam's convergence speed," in ICML, 2021.

[17]

W. Wen, C. Xu, F. Yan, C. Wu, Y. Wang, Y. Chen, and H. Li, "Terngrad: Ternary gradients to reduce communication in distributed deep learning," in NeurIPS, 2017.

[18]

D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic, "Qsgd: Communication-efficient sgd via gradient quantization and encoding," in NeurIPS, 2017.

[19]

H. Zhang, J. Li, K. Kara, D. Alistarh, J. Liu, and C. Zhang, "Zipml: Training linear models with end-to-end low precision, and a little bit of deep learning," in ICML, 2017.

[20]

M. Yu, Z. Lin, K. Narra, S. Li, Y. Li, N. S. Kim, A. Schwing, M. Annavaram, and S. Avestimehr, "Gradiveq: Vector quantization for bandwidth-efficient gradient aggregation in distributed cnn training," in NeurIPS, 2018.

[21]

J. Bernstein, Y.-X. Wang, K. Azizzadenesheli, and A. Anandkumar, "signsgd: Compressed optimisation for non-convex problems," in ICML, 2018.

[22]

J. Wangni, J. Wang, J. Liu, and T. Zhang, "Gradient sparsification for communication-efficient distributed optimization," in NeurIPS, 2018.

[23]

J. Guo, S. Hu, W. Wang, C. Yao, J. Han, R. Li, and Y. Lu, "Tail: an automated and lightweight gradient compression framework for distributed deep learning," in DAC, 2020.

[24]

T. Vogels, S. P. Karinireddy, and M. Jaggi, "Powersgd: Practical low-rank gradient compression for distributed optimization," NeurIPS, 2019.

[25]

X. Jia, S. Song, W. He, Y. Wang, H. Rong, F. Zhou, L. Xie, Z. Guo, Y. Yang, L. Yu et al., "Highly scalable deep learning training system with mixed-precision: Training imagenet in four minutes," arXiv preprint arXiv:1807.11205, 2018.

[26]

A. Krizhevsky, G. Hinton et al., "Learning multiple layers of features from tiny images," 2009.

[27]

A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts, "Learning word vectors for sentiment analysis," in Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, 2011.

[28]

A. Krizhevsky, I. Sutskever, and G. E. Hinton, "Imagenet classification with deep convolutional neural networks," NeurIPS, 2012.

[29]

V. Sanh, L. Debut, J. Chaumond, and T. Wolf, "Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter," arXiv preprint arXiv:1910.01108, 2019.

[30]

S. P. Karimireddy, Q. Rebjock, S. Stich, and M. Jaggi, "Error feedback fixes signsgd and other gradient compression schemes," in ICML, 2019.

[31]

P. Elias, "Universal codeword sets and representations of the integers," IEEE transactions on information theory, 1975.

Cited By

Berloco FBevilacqua VColucci S(2024)Distributed Analytics For Big DataNeurocomputing10.1016/j.neucom.2024.127258574:COnline publication date: 17-Apr-2024
https://dl.acm.org/doi/10.1016/j.neucom.2024.127258
Peng HQin SYu YWang JWang HLi GOh ANaumann TGloberson ASaenko KHardt MLevine S(2023)BirderProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3667839(39529-39552)Online publication date: 10-Dec-2023
https://dl.acm.org/doi/10.5555/3666122.3667839
Wang XZheng SDuan L(2023)Dynamic Pricing for Client Recruitment in Federated LearningIEEE/ACM Transactions on Networking10.1109/TNET.2023.331220832:2(1273-1286)Online publication date: 11-Sep-2023
https://dl.acm.org/doi/10.1109/TNET.2023.3312208

Recommendations

Video Transcoding with H.263 Bit-Streams

Video transcoding is one of the key technologies in implementing dynamic adaptation of the bit-rate of a coded video bit-stream to the available bandwidth over various networks. Many fast transcoder architectures have been proposed to achieve fast ...
Video transcoder architectures for bit rate scaling of H.263 bit streams
MULTIMEDIA '99: Proceedings of the seventh ACM international conference on Multimedia (Part 1)

Video transcoding is one of the key technologies in implementing dynamic adaptation of the bit rate of a pre-encoded video stream to the available bandwidth over various networks. Many different transcoder architectures have been proposed to achieve ...
Wavelet transform and bit-plane encoding
ICIP '95: Proceedings of the 1995 International Conference on Image Processing (Vol. 1)-Volume 1 - Volume 1

In this work a new approach for wavelet transform (WT) based image compression is presented. Employing a simple region representation coding scheme previously used with bi-level facsimile pictures, the wavelet transform coefficients are first quantized ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

DAC '22: Proceedings of the 59th ACM/IEEE Design Automation Conference

July 2022

1462 pages

ISBN:9781450391429

DOI:10.1145/3489517

General Chair:
Rob Oshana
NXP

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGDA: ACM Special Interest Group on Design Automation
IEEE CEDA

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 August 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Hong Kong RGC Research Impact Fund (RIF)
General Research Fund (GRF)
Shenzhen Science and Technology Innovation Commission
National Natural Science Foundation of China
Natural Science Foundation of Jiangsu Province

Conference

DAC '22

Sponsor:

SIGDA

DAC '22: 59th ACM/IEEE Design Automation Conference

July 10 - 14, 2022

California, San Francisco

Acceptance Rates

Overall Acceptance Rate 1,770 of 5,499 submissions, 32%

Upcoming Conference

DAC '25

Sponsor:
sigda

62nd ACM/IEEE Design Automation Conference

June 22 - 26, 2025

San Francisco , CA , USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
191
Total Downloads

Downloads (Last 12 months)34
Downloads (Last 6 weeks)0

Reflects downloads up to 27 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Berloco FBevilacqua VColucci S(2024)Distributed Analytics For Big DataNeurocomputing10.1016/j.neucom.2024.127258574:COnline publication date: 17-Apr-2024
https://dl.acm.org/doi/10.1016/j.neucom.2024.127258
Peng HQin SYu YWang JWang HLi GOh ANaumann TGloberson ASaenko KHardt MLevine S(2023)BirderProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3667839(39529-39552)Online publication date: 10-Dec-2023
https://dl.acm.org/doi/10.5555/3666122.3667839
Wang XZheng SDuan L(2023)Dynamic Pricing for Client Recruitment in Federated LearningIEEE/ACM Transactions on Networking10.1109/TNET.2023.331220832:2(1273-1286)Online publication date: 11-Sep-2023
https://dl.acm.org/doi/10.1109/TNET.2023.3312208

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten