Article

Asynchronous parallel stochastic gradient for nonconvex optimization

Authors:

Ji LiuAuthors Info & Claims

NIPS'15: Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2

Pages 2737 - 2745

Published: 07 December 2015 Publication History

Abstract

Asynchronous parallel implementations of stochastic gradient (SG) have been broadly used in solving deep neural network and received many successes in practice recently. However, existing theories cannot explain their convergence and speedup properties, mainly due to the nonconvexity of most deep learning formulations and the asynchronous parallel mechanism. To fill the gaps in theory and provide theoretical supports, this paper studies two asynchronous parallel implementations of SG: one is over a computer network and the other is on a shared memory system. We establish an ergodic convergence rate O(1/ √K) for both algorithms and prove that the linear speedup is achievable if the number of workers is bounded by √K (K is the total number of iterations). Our results generalize and improve existing analysis for convex minimization.

References

[1]

A. Agarwal and J. C. Duchi. Distributed delayed stochastic optimization. NIPS, 2011.

[2]

H. Avron, A. Druinsky, and A. Gupta. Revisiting asynchronous linear solvers: Provable convergence rate through randomization. IPDPS, 2014.

[3]

Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin. A neural probabilistic language model. The Journal of Machine Learning Research, 3:1137-1155, 2003.

[4]

D. P. Bertsekas and J. N. Tsitsiklis. Parallel and distributed computation: numerical methods, volume 23. Prentice hall Englewood Cliffs, NJ, 1989.

[5]

J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, A. Senior, P. Tucker, K. Yang, Q. V. Le, et al. Large scale distributed deep networks. NIPS, 2012.

[6]

O. Dekel, R. Gilad-Bachrach, O. Shamir, and L. Xiao. Optimal distributed online prediction using mini-batches. Journal of Machine Learning Research, 13(1):165-202, 2012.

[7]

O. Fercoq and P. Richtárik. Accelerated, parallel and proximal coordinate descent. arXiv preprint arXiv:1312.5799, 2013.

[8]

H. R. Feyzmahdavian, A. Aytekin, and M. Johansson. An asynchronous mini-batch algorithm for regularized stochastic optimization. ArXiv e-prints, May 18 2015.

[9]

S. Ghadimi and G. Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341-2368, 2013.

[10]

M. Hong. A distributed, asynchronous and incremental algorithm for nonconvex optimization: An ADMM based approach. arXiv preprint arXiv:1412.6058, 2014.

[11]

Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.

[12]

A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Computer Science Department, University of Toronto, Tech. Rep, 1(4):7, 2009.

[13]

A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. NIPS, pages 1097-1105, 2012.

[14]

M. Li, L. Zhou, Z. Yang, A. Li, F. Xia, D. G. Andersen, and A. Smola. Parameter server for distributed machine learning. Big Learning NIPS Workshop, 2013.

[15]

M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, and B.-Y. Su. Scaling distributed machine learning with the parameter server. OSDI, 2014a.

[16]

M. Li, D. G. Andersen, A. J. Smola, and K. Yu. Communication efficient distributed machine learning with the parameter server. NIPS, 2014b.

[17]

J. Liu and S. J. Wright. Asynchronous stochastic coordinate descent: Parallelism and convergence properties. arXiv preprint arXiv:1403.3862, 2014.

[18]

J. Liu, S. J. Wright, C. Ré, V. Bittorf, and S. Sridhar. An asynchronous parallel stochastic coordinate descent algorithm. ICML, 2014a.

[19]

J. Liu, S. J. Wright, and S. Sridhar. An asynchronous parallel randomized kaczmarz algorithm. arXiv preprint arXiv:1401.4780, 2014b.

[20]

H. Mania, X. Pan, D. Papailiopoulos, B. Recht, K. Ramchandran, and M. I. Jordan. Perturbed iterate analysis for asynchronous stochastic optimization. arXiv preprint arXiv:1507.06970, 2015.

[21]

J. Marecek, P. Richtárik, and M. Takác. Distributed block coordinate descent for minimizing partially separable functions. arXiv preprint arXiv:1406.0238, 2014.

[22]

A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574-1609, 2009.

[23]

F. Niu, B. Recht, C. Re, and S. Wright. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. NIPS, 2011.

[24]

T. Paine, H. Jin, J. Yang, Z. Lin, and T. Huang. Gpu asynchronous stochastic gradient descent to speed up neural network training. NIPS, 2013.

[25]

F. Petroni and L. Querzoni. Gasgd: stochastic gradient descent for distributed asynchronous matrix completion via graph partitioning. ACM Conference on Recommender systems, 2014.

[26]

S. Sridhar, S. Wright, C. Re, J. Liu, V. Bittorf, and C. Zhang. An approximate, efficient LP solver for 1p rounding. NIPS, 2013.

[27]

R. Tappenden, M. Takáč, and P. Richtárik. On the complexity of parallel coordinate descent. arXiv preprint arXiv:1503.03033, 2015.

[28]

K. Tran, S. Hosseini, L. Xiao, T. Finley, and M. Bilenko. Scaling up stochastic dual coordinate ascent. ICML, 2015.

[29]

H. Yun, H.-F. Yu, C.-J. Hsieh, S. Vishwanathan, and I. Dhillon. Nomad: Non-locking, stochastic multi-machine algorithm for asynchronous and decentralized matrix completion. arXiv preprint arXiv:1312.0193, 2013.

[30]

R. Zhang and J. Kwok. Asynchronous distributed ADMM for consensus optimization. ICML, 2014.

[31]

S. Zhang, A. Choromanska, and Y. LeCun. Deep learning with elastic averaging SGD. CoRR, abs/1412.6651, 2014.

Cited By

Koloskova AStich SJaggi MKoyejo SMohamed SAgarwal ABelgrave DCho KOh A(2022)Sharper convergence guarantees for asynchronous SGD for distributed and federated learningProceedings of the 36th International Conference on Neural Information Processing Systems10.5555/3600270.3601521(17202-17215)Online publication date: 28-Nov-2022
https://dl.acm.org/doi/10.5555/3600270.3601521
Miao XZhang HShi YNie XYang ZTao YCui B(2022)HETProceedings of the VLDB Endowment10.14778/3489496.348951115:2(312-320)Online publication date: 4-Feb-2022
https://dl.acm.org/doi/10.14778/3489496.3489511
Liu ZZhang XLiu JChong SJoo CLee K(2022)SYNTHESISProceedings of the Twenty-Third International Symposium on Theory, Algorithmic Foundations, and Protocol Design for Mobile Networks and Mobile Computing10.1145/3492866.3549722(151-160)Online publication date: 3-Oct-2022
https://dl.acm.org/doi/10.1145/3492866.3549722
Show More Cited By

Recommendations

A simple proximal stochastic gradient method for nonsmooth nonconvex optimization
NIPS'18: Proceedings of the 32nd International Conference on Neural Information Processing Systems

We analyze stochastic gradient algorithms for optimizing nonconvex, nonsmooth finite-sum problems. In particular, the objective function is given by the summation of a differentiable (possibly nonconvex) component, together with a possibly non-...
Accelerated gradient methods for nonconvex nonlinear and stochastic programming

In this paper, we generalize the well-known Nesterov's accelerated gradient (AG) method, originally designed for convex smooth optimization, to solve nonconvex and possibly stochastic optimization problems. We demonstrate that by properly specifying the ...
Parallel asynchronous stochastic variance reduction for nonconvex optimization
AAAI'17: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence

Nowadays, asynchronous parallel algorithms have received much attention in the optimization field due to the crucial demands for modern large-scale optimization problems. However, most asynchronous algorithms focus on convex problems. Analysis on ...

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings

NIPS'15: Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2

December 2015

3626 pages

Publisher

MIT Press

Cambridge, MA, United States

Publication History

Published: 07 December 2015

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

49
Total Citations
View Citations
4
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 30 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Koloskova AStich SJaggi MKoyejo SMohamed SAgarwal ABelgrave DCho KOh A(2022)Sharper convergence guarantees for asynchronous SGD for distributed and federated learningProceedings of the 36th International Conference on Neural Information Processing Systems10.5555/3600270.3601521(17202-17215)Online publication date: 28-Nov-2022
https://dl.acm.org/doi/10.5555/3600270.3601521
Miao XZhang HShi YNie XYang ZTao YCui B(2022)HETProceedings of the VLDB Endowment10.14778/3489496.348951115:2(312-320)Online publication date: 4-Feb-2022
https://dl.acm.org/doi/10.14778/3489496.3489511
Liu ZZhang XLiu JChong SJoo CLee K(2022)SYNTHESISProceedings of the Twenty-Third International Symposium on Theory, Algorithmic Foundations, and Protocol Design for Mobile Networks and Mobile Computing10.1145/3492866.3549722(151-160)Online publication date: 3-Oct-2022
https://dl.acm.org/doi/10.1145/3492866.3549722
Shi SChu XLi B(2021)Exploiting Simultaneous Communications to Accelerate Data Parallel Distributed Deep LearningIEEE INFOCOM 2021 - IEEE Conference on Computer Communications10.1109/INFOCOM42981.2021.9488803(1-10)Online publication date: 10-May-2021
https://dl.acm.org/doi/10.1109/INFOCOM42981.2021.9488803
Gorbunov EKovalev DMakarenko DRichtárik PLarochelle HRanzato MHadsell RBalcan MLin H(2020)Linearly converging error compensated SGDProceedings of the 34th International Conference on Neural Information Processing Systems10.5555/3495724.3497478(20889-20900)Online publication date: 6-Dec-2020
https://dl.acm.org/doi/10.5555/3495724.3497478
Yakovlev AMoghadam HMoharrer ACai JChavoshi NVaradarajan VAgrawal SIdicula SKarnagel TJinturkar SAgarwal N(2020)Oracle AutoMLProceedings of the VLDB Endowment10.14778/3415478.341554213:12(3166-3180)Online publication date: 14-Sep-2020
https://dl.acm.org/doi/10.14778/3415478.3415542
Wan XZhang HWang HHu SZhang JChen K(2020)RAT - Resilient Allreduce Tree for Distributed Machine LearningProceedings of the 4th Asia-Pacific Workshop on Networking10.1145/3411029.3411037(52-57)Online publication date: 3-Aug-2020
https://dl.acm.org/doi/10.1145/3411029.3411037
Tang YRen ZLi N(2020)Zeroth-order Feedback Optimization for Cooperative Multi-Agent Systems2020 59th IEEE Conference on Decision and Control (CDC)10.1109/CDC42340.2020.9304302(3649-3656)Online publication date: 14-Dec-2020
https://dl.acm.org/doi/10.1109/CDC42340.2020.9304302
Zhang XLiu JZhu Z(2020)Taming Convergence for Asynchronous Stochastic Gradient Descent with Unbounded Delay in Non-Convex Learning2020 59th IEEE Conference on Decision and Control (CDC)10.1109/CDC42340.2020.9303830(3580-3585)Online publication date: 14-Dec-2020
https://dl.acm.org/doi/10.1109/CDC42340.2020.9303830
Haddadpour FKamani MMahdavi MCadambe VWallach HLarochelle HBeygelzimer Ad'Alché-Buc FFox E(2019)Local SGD with periodic averagingProceedings of the 33rd International Conference on Neural Information Processing Systems10.5555/3454287.3455281(11082-11094)Online publication date: 8-Dec-2019
https://dl.acm.org/doi/10.5555/3454287.3455281
Show More Cited By

View Options

View options

Media

Figures

Other

Tables

View Table of Contents