research-article

Communication quantization for data-parallel training of deep neural networks

Authors:

Sam Ade Jacobs,

Brian Van EssenAuthors Info & Claims

MLHPC '16: Proceedings of the Workshop on Machine Learning in High Performance Computing Environments

Pages 1 - 8

Published: 13 November 2016 Publication History

Abstract

We study data-parallel training of deep neural networks on high-performance computing infrastructure. The key problem with scaling data-parallel training is avoiding severe communication/computation imbalance. We explore quantizing gradient updates before communication to reduce bandwidth requirements and compare it against a baseline implementation that uses the MPI allreduce routine. We port two existing quantization approaches, one-bit and threshold, and develop our own adaptive quantization algorithm. The performance of these algorithms is evaluated and compared with MPI_Allreduce when training models for the MNIST dataset and on a synthetic benchmark. On an HPC system, MPI_Allreduce outperforms the existing quantization approaches. Our adaptive quantization is comparable or superior for large layers without sacrificing accuracy. It is 1.76 times faster than the next best approach for the largest layers in our benchmark and achieves near-linear speedup in data-parallel training.

References

[1]

B. Van Essen, H. Kim, R. Pearce, K. Boakye, and B. Chen, "LBANN: Livermore big artificial neural network HPC toolkit," in Proceedings of the Workshop on Machine Learning in High-Performance Computing Environments. ACM, 2015, P. 5.

Digital Library

[2]

D. Ciresan, U. Meier, L. Gambardella, and J. Schmidhuber, "Deep big simple neural nets excel on handwritten digit recognition," arXiv preprint arXiv:1003.0358, 2010.

Digital Library

[3]

I. G. Y. Bengio and A. Courville, "Deep learning," 2016, book in preparation for MIT Press. {Online}. Available: http://www.deeplearningbook.org

[4]

F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu, "On parallelizability of stochastic gradient descent for speech dnns," in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2014, pp. 235--239.

[5]

J. Poulson, B. Marker, R. A. Van de Geijn, J. R. Hammond, and N. A. Romero, "Elemental: A new framework for distributed memory dense matrix computations," ACM Transactions on Mathematical Software (TOMS), vol. 39, no. 2, P. 13, 2013.

Digital Library

[6]

Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, "Caffe: Convolutional architecture for fast feature embedding," arXiv preprint arXiv:1408.5093, 2014.

[7]

F. Chollet, "Keras," https://github.com/fchollet/keras, 2015.

[8]

S. Lee, S. Purushwalkam, M. Cogswell, D. J. Crandall, and D. Batra, "Why M heads are better than one: Training a diverse ensemble of deep networks," arXiv, 2015. {Online}. Available: http://arxiv.org/abs/1511.06314

[9]

M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, "TensorFlow: Large-scale machine learning on heterogeneous systems," 2015, software available from tensorflow.org. {Online}. Available: http://tensorflow.org/

[10]

F. N. Iandola, K. Ashraf, M. W. Moskewicz, and K. Keutzer, "Fire-Caffe: Near-linear acceleration of deep neural network training on compute clusters," arXiv preprint arXiv:1511.00175, 2015.

[11]

A. Coates, B. Huval, T. Wang, D. Wu, B. Catanzaro, and N. Andrew, "Deep learning with cots hpc systems," in Proceedings of the 30th International Conference on Machine Learning (ICML-13), 2013, pp. 1337--1345.

[12]

I.-H. Chung, T. N. Sainath, B. Ramabhadran, M. Pichen, J. Gunnels, V. Austel, U. Chauhari, and B. Kingsbury, "Parallel deep neural network training for big data on blue gene/q," in SC14: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2014, pp. 745--753.

Digital Library

[13]

F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu, "1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs." in INTERSPEECH, 2014, pp. 1058--1062.

[14]

N. Strom, "Scalable distributed DNN training using commodity GPU cloud computing," in INTERSPEECH, vol. 7, 2015, P. 10.

[15]

J. Duchi, E. Hazan, and Y. Singer, "Adaptive subgradient methods for online learning and stochastic optimization," Journal of Machine Learning Research, vol. 12, no. Jul, pp. 2121--2159, 2011.

Digital Library

[16]

S. W. Smith, The scientist and engineer's guide to digital signal processing. California Technical Publishing, 1997.

Digital Library

[17]

R. Rice and J. Plaunt, "Adaptive variable-length coding for efficient compression of spacecraft television data," IEEE Transactions on Communication Technology, vol. 19, no. 6, pp. 889--897, 1971.

[18]

R. Thakur, R. Rabenseifner, and W. Gropp, "Optimization of collective communication operations in MPICH," International Journal of High Performance Computing Applications, vol. 19, no. 1, pp. 49--66, 2005.

Digital Library

[19]

Lawrence Livermore National Laboratory, "Catalyst," http://computation.llnl.gov/computers/catalyst, 2016.

[20]

Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, "Gradient-based learning applied to document recognition," Proceedings of the IEEE, vol. 86, no. 11, pp. 2278--2324, 1998.

[21]

V. Nair and G. E. Hinton, "Rectified linear units improve restricted boltzmann machines," in Proceedings of the 27th International Conference on Machine Learning (ICML-10), 2010, pp. 807--814.

Cited By

Markov IVladu AGuo QAlistarh DKrause ABrunskill ECho KEngelhardt BSabato SScarlett J(2023)Quantized distributed training of large models with convergence guaranteesProceedings of the 40th International Conference on Machine Learning10.5555/3618408.3619409(24020-24044)Online publication date: 23-Jul-2023
https://dl.acm.org/doi/10.5555/3618408.3619409
Abdi ARashidi SFekri FKrishna TWilliams BChen YNeville J(2023)Efficient distributed inference of deep neural networks via restructuring and pruningProceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence10.1609/aaai.v37i6.25815(6640-6648)Online publication date: 7-Feb-2023
https://dl.acm.org/doi/10.1609/aaai.v37i6.25815
Markov IRamezanikebrya HAlistarh DBellavista PZhang KGherbi ABagchi SPatiño MDi Modica GGascon-Samson J(2022)CGXProceedings of the 23rd ACM/IFIP International Middleware Conference10.1145/3528535.3565248(241-254)Online publication date: 7-Nov-2022
https://dl.acm.org/doi/10.1145/3528535.3565248
Show More Cited By

Communication quantization for data-parallel training of deep neural networks
1. Computer systems organization

Recommendations

Parallel Deep Neural Network Training for Big Data on Blue Gene/Q

Deep Neural Networks (DNNs) have recently been shown to significantly outperform existing machine learning techniques in several pattern recognition tasks. DNNs are the state-of-the-art models used in image recognition, object detection, classification ...
Parallel deep neural network training for big data on blue gene/Q
SC '14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Deep Neural Networks (DNNs) have recently been shown to significantly outperform existing machine learning techniques in several pattern recognition tasks. DNNs are the state-of-the-art models used in image recognition, object detection, classification ...
Towards Scalable Parallel Training of Deep Neural Networks
MLHPC'17: Proceedings of the Machine Learning on HPC Environments

We propose a new framework for parallelizing deep neural network training that maximize the amount of data that is ingested by the training algorithm. Our proposed framework called Livermore Tournament Fast Batch Learning (LTFB) targets large-scale data ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MLHPC '16: Proceedings of the Workshop on Machine Learning in High Performance Computing Environments

November 2016

66 pages

ISBN:9781509038824

Sponsors

SIGHPC: ACM Special Interest Group on High Performance Computing, Special Interest Group on High Performance Computing
IEEE-CS\DATC: IEEE Computer Society

In-Cooperation

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

IEEE Press

Publication History

Published: 13 November 2016

Check for updates

Qualifiers

Research-article

Conference

SC16

Sponsor:

SIGHPC
IEEE-CS\DATC

SC16: The International Conference for High Performance Computing, Networking, Storage and Analysis

November 13 - 18, 2016

Utah, Salt Lake City

Acceptance Rates

Overall Acceptance Rate 5 of 7 submissions, 71%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

17
Total Citations
View Citations
351
Total Downloads

Downloads (Last 12 months)27
Downloads (Last 6 weeks)6

Reflects downloads up to 23 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Markov IVladu AGuo QAlistarh DKrause ABrunskill ECho KEngelhardt BSabato SScarlett J(2023)Quantized distributed training of large models with convergence guaranteesProceedings of the 40th International Conference on Machine Learning10.5555/3618408.3619409(24020-24044)Online publication date: 23-Jul-2023
https://dl.acm.org/doi/10.5555/3618408.3619409
Abdi ARashidi SFekri FKrishna TWilliams BChen YNeville J(2023)Efficient distributed inference of deep neural networks via restructuring and pruningProceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence10.1609/aaai.v37i6.25815(6640-6648)Online publication date: 7-Feb-2023
https://dl.acm.org/doi/10.1609/aaai.v37i6.25815
Markov IRamezanikebrya HAlistarh DBellavista PZhang KGherbi ABagchi SPatiño MDi Modica GGascon-Samson J(2022)CGXProceedings of the 23rd ACM/IFIP International Middleware Conference10.1145/3528535.3565248(241-254)Online publication date: 7-Nov-2022
https://dl.acm.org/doi/10.1145/3528535.3565248
Sahu ADutta AAbdelmoniem ABanerjee TCanini MKalnis PRanzato MBeygelzimer ADauphin YLiang PVaughan J(2021)Rethinking gradient sparsification as total error minimizationProceedings of the 35th International Conference on Neural Information Processing Systems10.5555/3540261.3540883(8133-8146)Online publication date: 6-Dec-2021
https://dl.acm.org/doi/10.5555/3540261.3540883
De Sensi DDi Girolamo SAshkboos SLi SHoefler Tde Supinski BHall MGamblin T(2021)FlareProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3458817.3476178(1-16)Online publication date: 14-Nov-2021
https://dl.acm.org/doi/10.1145/3458817.3476178
Lin CKostina VHassibi B(2021)Differentially Quantized Gradient Descent2021 IEEE International Symposium on Information Theory (ISIT)10.1109/ISIT45174.2021.9518254(1200-1205)Online publication date: 12-Jul-2021
https://dl.acm.org/doi/10.1109/ISIT45174.2021.9518254
Akter SAdnan MGupta RLiu YShah MRajan STang JPrakash B(2020)WeightGrad: Geo-Distributed Data Analysis Using Quantization for Faster Convergence and Better AccuracyProceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining10.1145/3394486.3403097(546-556)Online publication date: 23-Aug-2020
https://dl.acm.org/doi/10.1145/3394486.3403097
Ben-Nun THoefler T(2019)Demystifying Parallel and Distributed Deep LearningACM Computing Surveys10.1145/332006052:4(1-43)Online publication date: 30-Aug-2019
https://dl.acm.org/doi/10.1145/3320060
Renggli CAshkboos SAghagolzadeh MAlistarh DHoefler TTaufer MBalaji PPeña A(2019)SparCMLProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3295500.3356222(1-15)Online publication date: 17-Nov-2019
https://dl.acm.org/doi/10.1145/3295500.3356222
Chen CChoi JBrand DAgrawal AZhang WGopalakrishnan KMcIlraith SWeinberger K(2018)AdaCompProceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence10.5555/3504035.3504380(2827-2835)Online publication date: 2-Feb-2018
https://dl.acm.org/doi/10.5555/3504035.3504380
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents