Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.5555/2969442.2969545guideproceedingsArticle/Chapter ViewAbstractPublication PagesnipsConference Proceedingsconference-collections
Article

Asynchronous parallel stochastic gradient for nonconvex optimization

Published: 07 December 2015 Publication History

Abstract

Asynchronous parallel implementations of stochastic gradient (SG) have been broadly used in solving deep neural network and received many successes in practice recently. However, existing theories cannot explain their convergence and speedup properties, mainly due to the nonconvexity of most deep learning formulations and the asynchronous parallel mechanism. To fill the gaps in theory and provide theoretical supports, this paper studies two asynchronous parallel implementations of SG: one is over a computer network and the other is on a shared memory system. We establish an ergodic convergence rate O(1/ √K) for both algorithms and prove that the linear speedup is achievable if the number of workers is bounded by √K (K is the total number of iterations). Our results generalize and improve existing analysis for convex minimization.

References

[1]
A. Agarwal and J. C. Duchi. Distributed delayed stochastic optimization. NIPS, 2011.
[2]
H. Avron, A. Druinsky, and A. Gupta. Revisiting asynchronous linear solvers: Provable convergence rate through randomization. IPDPS, 2014.
[3]
Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin. A neural probabilistic language model. The Journal of Machine Learning Research, 3:1137-1155, 2003.
[4]
D. P. Bertsekas and J. N. Tsitsiklis. Parallel and distributed computation: numerical methods, volume 23. Prentice hall Englewood Cliffs, NJ, 1989.
[5]
J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, A. Senior, P. Tucker, K. Yang, Q. V. Le, et al. Large scale distributed deep networks. NIPS, 2012.
[6]
O. Dekel, R. Gilad-Bachrach, O. Shamir, and L. Xiao. Optimal distributed online prediction using mini-batches. Journal of Machine Learning Research, 13(1):165-202, 2012.
[7]
O. Fercoq and P. Richtárik. Accelerated, parallel and proximal coordinate descent. arXiv preprint arXiv:1312.5799, 2013.
[8]
H. R. Feyzmahdavian, A. Aytekin, and M. Johansson. An asynchronous mini-batch algorithm for regularized stochastic optimization. ArXiv e-prints, May 18 2015.
[9]
S. Ghadimi and G. Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341-2368, 2013.
[10]
M. Hong. A distributed, asynchronous and incremental algorithm for nonconvex optimization: An ADMM based approach. arXiv preprint arXiv:1412.6058, 2014.
[11]
Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.
[12]
A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Computer Science Department, University of Toronto, Tech. Rep, 1(4):7, 2009.
[13]
A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. NIPS, pages 1097-1105, 2012.
[14]
M. Li, L. Zhou, Z. Yang, A. Li, F. Xia, D. G. Andersen, and A. Smola. Parameter server for distributed machine learning. Big Learning NIPS Workshop, 2013.
[15]
M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, and B.-Y. Su. Scaling distributed machine learning with the parameter server. OSDI, 2014a.
[16]
M. Li, D. G. Andersen, A. J. Smola, and K. Yu. Communication efficient distributed machine learning with the parameter server. NIPS, 2014b.
[17]
J. Liu and S. J. Wright. Asynchronous stochastic coordinate descent: Parallelism and convergence properties. arXiv preprint arXiv:1403.3862, 2014.
[18]
J. Liu, S. J. Wright, C. Ré, V. Bittorf, and S. Sridhar. An asynchronous parallel stochastic coordinate descent algorithm. ICML, 2014a.
[19]
J. Liu, S. J. Wright, and S. Sridhar. An asynchronous parallel randomized kaczmarz algorithm. arXiv preprint arXiv:1401.4780, 2014b.
[20]
H. Mania, X. Pan, D. Papailiopoulos, B. Recht, K. Ramchandran, and M. I. Jordan. Perturbed iterate analysis for asynchronous stochastic optimization. arXiv preprint arXiv:1507.06970, 2015.
[21]
J. Marecek, P. Richtárik, and M. Takác. Distributed block coordinate descent for minimizing partially separable functions. arXiv preprint arXiv:1406.0238, 2014.
[22]
A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574-1609, 2009.
[23]
F. Niu, B. Recht, C. Re, and S. Wright. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. NIPS, 2011.
[24]
T. Paine, H. Jin, J. Yang, Z. Lin, and T. Huang. Gpu asynchronous stochastic gradient descent to speed up neural network training. NIPS, 2013.
[25]
F. Petroni and L. Querzoni. Gasgd: stochastic gradient descent for distributed asynchronous matrix completion via graph partitioning. ACM Conference on Recommender systems, 2014.
[26]
S. Sridhar, S. Wright, C. Re, J. Liu, V. Bittorf, and C. Zhang. An approximate, efficient LP solver for 1p rounding. NIPS, 2013.
[27]
R. Tappenden, M. Takáč, and P. Richtárik. On the complexity of parallel coordinate descent. arXiv preprint arXiv:1503.03033, 2015.
[28]
K. Tran, S. Hosseini, L. Xiao, T. Finley, and M. Bilenko. Scaling up stochastic dual coordinate ascent. ICML, 2015.
[29]
H. Yun, H.-F. Yu, C.-J. Hsieh, S. Vishwanathan, and I. Dhillon. Nomad: Non-locking, stochastic multi-machine algorithm for asynchronous and decentralized matrix completion. arXiv preprint arXiv:1312.0193, 2013.
[30]
R. Zhang and J. Kwok. Asynchronous distributed ADMM for consensus optimization. ICML, 2014.
[31]
S. Zhang, A. Choromanska, and Y. LeCun. Deep learning with elastic averaging SGD. CoRR, abs/1412.6651, 2014.

Cited By

View all
  • (2022)Sharper convergence guarantees for asynchronous SGD for distributed and federated learningProceedings of the 36th International Conference on Neural Information Processing Systems10.5555/3600270.3601521(17202-17215)Online publication date: 28-Nov-2022
  • (2022)HETProceedings of the VLDB Endowment10.14778/3489496.348951115:2(312-320)Online publication date: 4-Feb-2022
  • (2022)SYNTHESISProceedings of the Twenty-Third International Symposium on Theory, Algorithmic Foundations, and Protocol Design for Mobile Networks and Mobile Computing10.1145/3492866.3549722(151-160)Online publication date: 3-Oct-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings
NIPS'15: Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2
December 2015
3626 pages

Publisher

MIT Press

Cambridge, MA, United States

Publication History

Published: 07 December 2015

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 30 Aug 2024

Other Metrics

Citations

Cited By

View all
  • (2022)Sharper convergence guarantees for asynchronous SGD for distributed and federated learningProceedings of the 36th International Conference on Neural Information Processing Systems10.5555/3600270.3601521(17202-17215)Online publication date: 28-Nov-2022
  • (2022)HETProceedings of the VLDB Endowment10.14778/3489496.348951115:2(312-320)Online publication date: 4-Feb-2022
  • (2022)SYNTHESISProceedings of the Twenty-Third International Symposium on Theory, Algorithmic Foundations, and Protocol Design for Mobile Networks and Mobile Computing10.1145/3492866.3549722(151-160)Online publication date: 3-Oct-2022
  • (2021)Exploiting Simultaneous Communications to Accelerate Data Parallel Distributed Deep LearningIEEE INFOCOM 2021 - IEEE Conference on Computer Communications10.1109/INFOCOM42981.2021.9488803(1-10)Online publication date: 10-May-2021
  • (2020)Linearly converging error compensated SGDProceedings of the 34th International Conference on Neural Information Processing Systems10.5555/3495724.3497478(20889-20900)Online publication date: 6-Dec-2020
  • (2020)Oracle AutoMLProceedings of the VLDB Endowment10.14778/3415478.341554213:12(3166-3180)Online publication date: 14-Sep-2020
  • (2020)RAT - Resilient Allreduce Tree for Distributed Machine LearningProceedings of the 4th Asia-Pacific Workshop on Networking10.1145/3411029.3411037(52-57)Online publication date: 3-Aug-2020
  • (2020)Zeroth-order Feedback Optimization for Cooperative Multi-Agent Systems2020 59th IEEE Conference on Decision and Control (CDC)10.1109/CDC42340.2020.9304302(3649-3656)Online publication date: 14-Dec-2020
  • (2020)Taming Convergence for Asynchronous Stochastic Gradient Descent with Unbounded Delay in Non-Convex Learning2020 59th IEEE Conference on Decision and Control (CDC)10.1109/CDC42340.2020.9303830(3580-3585)Online publication date: 14-Dec-2020
  • (2019)Local SGD with periodic averagingProceedings of the 33rd International Conference on Neural Information Processing Systems10.5555/3454287.3455281(11082-11094)Online publication date: 8-Dec-2019
  • Show More Cited By

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media