Article

HOGWILD!: a lock-free approach to parallelizing stochastic gradient descent

Authors:

Benjamin Recht,

Christopher Re,

Stephen J. WrightAuthors Info & Claims

NIPS'11: Proceedings of the 24th International Conference on Neural Information Processing Systems

Pages 693 - 701

Published: 12 December 2011 Publication History

Abstract

Stochastic Gradient Descent (SGD) is a popular algorithm that can achieve state-of-the-art performance on a variety of machine learning tasks. Several researchers have recently proposed schemes to parallelize SGD, but all require performance-destroying memory locking and synchronization. This work aims to show using novel theoretical analysis, algorithms, and implementation that SGD can be implemented without any locking. We present an update scheme called HOGWILD! which allows processors access to shared memory with the possibility of overwriting each other's work. We show that when the associated optimization problem is sparse, meaning most gradient updates only modify small parts of the decision variable, then HOGWILD! achieves a nearly optimal rate of convergence. We demonstrate experimentally that HOGWILD! outperforms alternative schemes that use locking by an order of magnitude.

References

[1]

Max-flow problem instances in vision. From http://vision.csd.uwo.ca/data/maxflow/.

[2]

K. Asanovic and et al. The landscape of parallel computing research: A view from berkeley. Technical Report UCB/EECS-2006-183, Electrical Engineering and Computer Sciences, University of California at Berkeley, 2006.

[3]

D. P. Bertsekas. Nonlinear Programming. Athena Scientific, Belmont, MA, 2nd edition, 1999.

[4]

D. P. Bertsekas and J. N. Tsitsiklis. Parallel and Distributed Computation: Numerical Methods. Athena Scientific, Belmont, MA, 1997.

[5]

L. Bottou and O. Bousquet. The tradeoffs of large scale learning. In Advances in Neural Information Processing Systems, 2008.

[6]

Y. Boykov and V. Kolmogorov. An experimental comparison of min-cut/max-flow algorithms for energy minimization in vision. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(9):1124-1137, 2004.

[7]

G. Călinescu, H. Karloff, and Y. Rabani. An improved approximation algorithm for multiway cut. In Proceedings of the thirtieth annual ACM Symposium on Theory of Computing, pages 48-52, 1998.

[8]

E. Candès and B. Recht. Exact matrix completion via convex optimization. Foundations of Computational Mathematics, 9(6):717-772, 2009.

[9]

J. Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters. Communications of the ACM, 51(1):107-113, 2008.

[10]

O. Dekel, R. Gilad-Bachrach, O. Shamir, and L. Xiao. Optimal distributed online prediction using mini-batches. Technical report, Microsoft Research, 2011.

[11]

A. Doan. http://dblife.cs.wisc.edu.

[12]

J. Duchi, A. Agarwal, and M. J. Wainwright. Distributed dual averaging in networks. In Advances in Neural Information Processing Systems, 2010.

[13]

S. H. Fuller and L. I. Millett, editors. The Future of Computing Performance: Game Over or Next Level. Committee on Sustaining Growth in Computing Performance. The National Academies Press, Washington, D.C., 2011.

[14]

T. Joachims. Training linear svms in linear time. In Proceedings of the ACM Conference on Knowledge Discovery and Data Mining (KDD), 2006.

[15]

J. Langford. https://github.com/JohnLangford/vowpal_wabbit/wiki.

[16]

J. Langford, A. J. Smola, and M. Zinkevich. Slow learners are fast. In Advances in Neural Information Processing Systems, 2009.

[17]

J. Lee, B. Recht, N. Srebro, R. R. Salakhutdinov, and J. A. Tropp. Practical large-scale optimization for max-norm regularization. In Advances in Neural Information Processing Systems, 2010.

[18]

T. Lee, Z. Wang, H. Wang, and S. Hwang. Web scale entity resolution using relational evidence. Technical report, Microsoft Research, 2011. Available at http://research.microsoft.com/apps/pubs/default.aspx?id=145839.

[19]

D. Lewis, Y. Yang, T. Rose, and F. Li. RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 5:361-397, 2004.

[20]

S. Melnik, A. Gubarev, J. J. Long, G. Romer, S. Shivakumar, M. Tolton, and T. Vassilakis. Dremel: Interactive analysis of web-scale datasets. In Proceedings of VLDB, 2010.

[21]

A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574-1609, 2009.

[22]

F. Niu, B. Recht, C. Ré, and S. J. Wright. Hogwild!: A lock-free approach to parallelizing stochastic gradient descent. Technical report, 2011. arxiv.org/abs/1106.5730.

[23]

B. Recht, M. Fazel, and P. Parrilo. Guaranteed minimum rank solutions of matrix equations via nuclear norm minimization. SIAM Review, 52(3):471-501, 2010.

[24]

S. Shalev-Shwartz and N. Srebro. SVM Optimization: Inverse dependence on training set size. In Proceedings of the 25th Internation Conference on Machine Learning, 2008.

[25]

N. Srebro, J. Rennie, and T. Jaakkola. Maximum margin matrix factorization. In Advances in Neural Information Processing Systems, 2004.

[26]

J. Tsitsiklis, D. P. Bertsekas, and M. Athans. Distributed asynchronous deterministic and stochastic gradient optimization algorithms. IEEE Transactions on Automatic Control, 31(9):803-812, 1986.

[27]

M. Zinkevich, M. Weimer, A. Smola, and L. Li. Parallelized stochastic gradient descent. Advances in Neural Information Processing Systems, 2010.

Cited By

Chatterjee B(2024)Distributed Machine LearningProceedings of the 25th International Conference on Distributed Computing and Networking10.1145/3631461.3632516(4-7)Online publication date: 4-Jan-2024
https://dl.acm.org/doi/10.1145/3631461.3632516
Fernando VJoshi KLaurel JMisailovic S(2023)Diamont: dynamic monitoring of uncertainty for distributed asynchronous programsInternational Journal on Software Tools for Technology Transfer (STTT)10.1007/s10009-023-00717-y25:4(521-539)Online publication date: 1-Aug-2023
https://dl.acm.org/doi/10.1007/s10009-023-00717-y
Yuan BWolfe CDun CTang YKyrillidis AJermaine C(2022)Distributed learning of fully connected neural networks using independent subnet trainingProceedings of the VLDB Endowment10.14778/3529337.352934315:8(1581-1590)Online publication date: 1-Apr-2022
https://dl.acm.org/doi/10.14778/3529337.3529343
Show More Cited By

Recommendations

On three-term conjugate gradient algorithms for unconstrained optimization

This paper presents a project for three-term conjugate gradient algorithms development. The search direction of the algorithms from this class has three terms and is computed as modifications of the classical conjugate gradient algorithms to satisfy ...
Another nonlinear conjugate gradient algorithm for unconstrained optimization

A nonlinear conjugate gradient algorithm which is a modification of the Dai and Yuan [A nonlinear conjugate gradient method with a strong global convergence property, SIAM J. Optim. 10 (1999), pp. 177-182] conjugate gradient algorithm satisfying a ...
Steepest Descent, CG, and Iterative Regularization of Ill-Posed Problems
Abstract
The state of the art iterative method for solving large linear systems is the conjugate gradient (CG) algorithm. Theoretical convergence analysis suggests that CG converges more rapidly than steepest descent. This paper argues that steepest ...

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings

NIPS'11: Proceedings of the 24th International Conference on Neural Information Processing Systems

December 2011

2752 pages

ISBN:9781618395993

Publisher

Curran Associates Inc.

Red Hook, NY, United States

Publication History

Published: 12 December 2011

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

290
Total Citations
View Citations
1
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 30 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Chatterjee B(2024)Distributed Machine LearningProceedings of the 25th International Conference on Distributed Computing and Networking10.1145/3631461.3632516(4-7)Online publication date: 4-Jan-2024
https://dl.acm.org/doi/10.1145/3631461.3632516
Fernando VJoshi KLaurel JMisailovic S(2023)Diamont: dynamic monitoring of uncertainty for distributed asynchronous programsInternational Journal on Software Tools for Technology Transfer (STTT)10.1007/s10009-023-00717-y25:4(521-539)Online publication date: 1-Aug-2023
https://dl.acm.org/doi/10.1007/s10009-023-00717-y
Yuan BWolfe CDun CTang YKyrillidis AJermaine C(2022)Distributed learning of fully connected neural networks using independent subnet trainingProceedings of the VLDB Endowment10.14778/3529337.352934315:8(1581-1590)Online publication date: 1-Apr-2022
https://dl.acm.org/doi/10.14778/3529337.3529343
Alistarh DMarkov INadiradze G(2022)Elastic ConsistencyACM SIGACT News10.1145/3544979.354499153:2(64-82)Online publication date: 27-Jul-2022
https://dl.acm.org/doi/10.1145/3544979.3544991
Ren HDai HDai BChen XZhou DLeskovec JSchuurmans DZhang ARangwala H(2022)SMOREProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3534678.3539405(1472-1482)Online publication date: 14-Aug-2022
https://dl.acm.org/doi/10.1145/3534678.3539405
Liu ZZhang XLiu JChong SJoo CLee K(2022)SYNTHESISProceedings of the Twenty-Third International Symposium on Theory, Algorithmic Foundations, and Protocol Design for Mobile Networks and Mobile Computing10.1145/3492866.3549722(151-160)Online publication date: 3-Oct-2022
https://dl.acm.org/doi/10.1145/3492866.3549722
Karakus CSun YDiggavi SYin W(2021)Redundancy techniques for straggler mitigation in distributed optimization and learningThe Journal of Machine Learning Research10.5555/3322706.336201320:1(2619-2665)Online publication date: 9-Mar-2021
https://dl.acm.org/doi/10.5555/3322706.3362013
Xiao LYu ALin QChen W(2021)DSCOVRThe Journal of Machine Learning Research10.5555/3322706.336198420:1(1634-1691)Online publication date: 9-Mar-2021
https://dl.acm.org/doi/10.5555/3322706.3361984
Lu YYu XCao LMadden S(2021)Epoch-based commit and replication in distributed OLTP databasesProceedings of the VLDB Endowment10.14778/3446095.344609814:5(743-756)Online publication date: 23-Mar-2021
https://dl.acm.org/doi/10.14778/3446095.3446098
De Supinski BJacobs SMoon TMcLoughlin KJones DHysom DAhn DGyllenhaal JWatson PLightstone FAllen JKarlin IVan Essen B(2021)Enabling rapid COVID-19 small molecule drug design through scalable deep learning of generative modelsInternational Journal of High Performance Computing Applications10.1177/1094342021101093035:5(469-482)Online publication date: 1-Sep-2021
https://dl.acm.org/doi/10.1177/10943420211010930
Show More Cited By

View Options

View options

Media

Figures

Other

Tables

View Table of Contents