Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.5555/3018874.3018875acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Communication quantization for data-parallel training of deep neural networks

Published: 13 November 2016 Publication History

Abstract

We study data-parallel training of deep neural networks on high-performance computing infrastructure. The key problem with scaling data-parallel training is avoiding severe communication/computation imbalance. We explore quantizing gradient updates before communication to reduce bandwidth requirements and compare it against a baseline implementation that uses the MPI allreduce routine. We port two existing quantization approaches, one-bit and threshold, and develop our own adaptive quantization algorithm. The performance of these algorithms is evaluated and compared with MPI_Allreduce when training models for the MNIST dataset and on a synthetic benchmark. On an HPC system, MPI_Allreduce outperforms the existing quantization approaches. Our adaptive quantization is comparable or superior for large layers without sacrificing accuracy. It is 1.76 times faster than the next best approach for the largest layers in our benchmark and achieves near-linear speedup in data-parallel training.

References

[1]
B. Van Essen, H. Kim, R. Pearce, K. Boakye, and B. Chen, "LBANN: Livermore big artificial neural network HPC toolkit," in Proceedings of the Workshop on Machine Learning in High-Performance Computing Environments. ACM, 2015, P. 5.
[2]
D. Ciresan, U. Meier, L. Gambardella, and J. Schmidhuber, "Deep big simple neural nets excel on handwritten digit recognition," arXiv preprint arXiv:1003.0358, 2010.
[3]
I. G. Y. Bengio and A. Courville, "Deep learning," 2016, book in preparation for MIT Press. {Online}. Available: http://www.deeplearningbook.org
[4]
F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu, "On parallelizability of stochastic gradient descent for speech dnns," in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2014, pp. 235--239.
[5]
J. Poulson, B. Marker, R. A. Van de Geijn, J. R. Hammond, and N. A. Romero, "Elemental: A new framework for distributed memory dense matrix computations," ACM Transactions on Mathematical Software (TOMS), vol. 39, no. 2, P. 13, 2013.
[6]
Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, "Caffe: Convolutional architecture for fast feature embedding," arXiv preprint arXiv:1408.5093, 2014.
[7]
F. Chollet, "Keras," https://github.com/fchollet/keras, 2015.
[8]
S. Lee, S. Purushwalkam, M. Cogswell, D. J. Crandall, and D. Batra, "Why M heads are better than one: Training a diverse ensemble of deep networks," arXiv, 2015. {Online}. Available: http://arxiv.org/abs/1511.06314
[9]
M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, "TensorFlow: Large-scale machine learning on heterogeneous systems," 2015, software available from tensorflow.org. {Online}. Available: http://tensorflow.org/
[10]
F. N. Iandola, K. Ashraf, M. W. Moskewicz, and K. Keutzer, "Fire-Caffe: Near-linear acceleration of deep neural network training on compute clusters," arXiv preprint arXiv:1511.00175, 2015.
[11]
A. Coates, B. Huval, T. Wang, D. Wu, B. Catanzaro, and N. Andrew, "Deep learning with cots hpc systems," in Proceedings of the 30th International Conference on Machine Learning (ICML-13), 2013, pp. 1337--1345.
[12]
I.-H. Chung, T. N. Sainath, B. Ramabhadran, M. Pichen, J. Gunnels, V. Austel, U. Chauhari, and B. Kingsbury, "Parallel deep neural network training for big data on blue gene/q," in SC14: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2014, pp. 745--753.
[13]
F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu, "1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs." in INTERSPEECH, 2014, pp. 1058--1062.
[14]
N. Strom, "Scalable distributed DNN training using commodity GPU cloud computing," in INTERSPEECH, vol. 7, 2015, P. 10.
[15]
J. Duchi, E. Hazan, and Y. Singer, "Adaptive subgradient methods for online learning and stochastic optimization," Journal of Machine Learning Research, vol. 12, no. Jul, pp. 2121--2159, 2011.
[16]
S. W. Smith, The scientist and engineer's guide to digital signal processing. California Technical Publishing, 1997.
[17]
R. Rice and J. Plaunt, "Adaptive variable-length coding for efficient compression of spacecraft television data," IEEE Transactions on Communication Technology, vol. 19, no. 6, pp. 889--897, 1971.
[18]
R. Thakur, R. Rabenseifner, and W. Gropp, "Optimization of collective communication operations in MPICH," International Journal of High Performance Computing Applications, vol. 19, no. 1, pp. 49--66, 2005.
[19]
Lawrence Livermore National Laboratory, "Catalyst," http://computation.llnl.gov/computers/catalyst, 2016.
[20]
Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, "Gradient-based learning applied to document recognition," Proceedings of the IEEE, vol. 86, no. 11, pp. 2278--2324, 1998.
[21]
V. Nair and G. E. Hinton, "Rectified linear units improve restricted boltzmann machines," in Proceedings of the 27th International Conference on Machine Learning (ICML-10), 2010, pp. 807--814.

Cited By

View all
  • (2023)Quantized distributed training of large models with convergence guaranteesProceedings of the 40th International Conference on Machine Learning10.5555/3618408.3619409(24020-24044)Online publication date: 23-Jul-2023
  • (2023)Efficient distributed inference of deep neural networks via restructuring and pruningProceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence10.1609/aaai.v37i6.25815(6640-6648)Online publication date: 7-Feb-2023
  • (2022)CGXProceedings of the 23rd ACM/IFIP International Middleware Conference10.1145/3528535.3565248(241-254)Online publication date: 7-Nov-2022
  • Show More Cited By
  1. Communication quantization for data-parallel training of deep neural networks

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MLHPC '16: Proceedings of the Workshop on Machine Learning in High Performance Computing Environments
    November 2016
    66 pages
    ISBN:9781509038824

    Sponsors

    In-Cooperation

    Publisher

    IEEE Press

    Publication History

    Published: 13 November 2016

    Check for updates

    Qualifiers

    • Research-article

    Conference

    SC16
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 5 of 7 submissions, 71%

    Upcoming Conference

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)27
    • Downloads (Last 6 weeks)6
    Reflects downloads up to 23 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Quantized distributed training of large models with convergence guaranteesProceedings of the 40th International Conference on Machine Learning10.5555/3618408.3619409(24020-24044)Online publication date: 23-Jul-2023
    • (2023)Efficient distributed inference of deep neural networks via restructuring and pruningProceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence10.1609/aaai.v37i6.25815(6640-6648)Online publication date: 7-Feb-2023
    • (2022)CGXProceedings of the 23rd ACM/IFIP International Middleware Conference10.1145/3528535.3565248(241-254)Online publication date: 7-Nov-2022
    • (2021)Rethinking gradient sparsification as total error minimizationProceedings of the 35th International Conference on Neural Information Processing Systems10.5555/3540261.3540883(8133-8146)Online publication date: 6-Dec-2021
    • (2021)FlareProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3458817.3476178(1-16)Online publication date: 14-Nov-2021
    • (2021)Differentially Quantized Gradient Descent2021 IEEE International Symposium on Information Theory (ISIT)10.1109/ISIT45174.2021.9518254(1200-1205)Online publication date: 12-Jul-2021
    • (2020)WeightGrad: Geo-Distributed Data Analysis Using Quantization for Faster Convergence and Better AccuracyProceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining10.1145/3394486.3403097(546-556)Online publication date: 23-Aug-2020
    • (2019)Demystifying Parallel and Distributed Deep LearningACM Computing Surveys10.1145/332006052:4(1-43)Online publication date: 30-Aug-2019
    • (2019)SparCMLProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3295500.3356222(1-15)Online publication date: 17-Nov-2019
    • (2018)AdaCompProceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence10.5555/3504035.3504380(2827-2835)Online publication date: 2-Feb-2018
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media