Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3055399.3055448acmconferencesArticle/Chapter ViewAbstractPublication PagesstocConference Proceedingsconference-collections
research-article
Public Access

Katyusha: the first direct acceleration of stochastic gradient methods

Published: 19 June 2017 Publication History

Abstract

Nesterov's momentum trick is famously known for accelerating gradient descent, and has been proven useful in building fast iterative algorithms. However, in the stochastic setting, counterexamples exist and prevent Nesterov's momentum from providing similar acceleration, even if the underlying problem is convex.
We introduce Katyusha, a direct, primal-only stochastic gradient method to fix this issue. It has a provably accelerated convergence rate in convex (off-line) stochastic optimization. The main ingredient is Katyusha momentum, a novel "negative momentum" on top of Nesterov's momentum that can be incorporated into a variance-reduction based algorithm and speed it up. Since variance reduction has been successfully applied to a growing list of practical problems, our paper suggests that in each of such cases, one could potentially give Katyusha a hug.

Supplementary Material

MP4 File (d4_sb_t7.mp4)

References

[1]
Zeyuan Allen-Zhu and Elad Hazan. Optimal Black-Box Reductions Between Optimization Objectives. In NIPS, 2016.
[2]
Zeyuan Allen-Zhu and Elad Hazan. Variance Reduction for Faster Non-Convex Optimization. In ICML, 2016.
[3]
Zeyuan Allen-Zhu, Yin Tat Lee, and Lorenzo Orecchia. Using optimization to obtain a width-independent, parallel, simpler, and faster positive SDP solver. In Proceedings of the 27th ACM-SIAM Symposium on Discrete Algorithms, SODA ’16, 2016.
[4]
Zeyuan Allen-Zhu and Yuanzhi Li. Doubly Accelerated Methods for Faster CCA and Generalized Eigendecomposition. ArXiv e-prints, abs/1607.06017, July 2016.
[5]
Zeyuan Allen-Zhu and Yuanzhi Li. LazySVD: Even Faster SVD Decomposition Yet Without Agonizing Pain. In NIPS, 2016.
[6]
Zeyuan Allen-Zhu, Yuanzhi Li, Rafael Oliveira, and Avi Wigderson. Much faster algorithms for matrix scaling. ArXiv e-prints, abs/1704.02315, April 2017.
[7]
Zeyuan Allen-Zhu, Zhenyu Liao, and Yang Yuan. Optimization Algorithms for Faster Computational Geometry. In ICALP, 2016.
[8]
Zeyuan Allen-Zhu and Lorenzo Orecchia. Nearly-Linear Time Positive LP Solver with Faster Convergence Rate. In Proceedings of the 47th Annual ACM Symposium on Theory of Computing, STOC ’15, 2015.
[9]
Zeyuan Allen-Zhu and Lorenzo Orecchia. Using optimization to break the epsilon barrier: A faster and simpler width-independent algorithm for solving positive linear programs in parallel. In Proceedings of the 26th ACM-SIAM Symposium on Discrete Algorithms, SODA ’15, 2015.
[10]
12 Like in all stochastic first-order methods, one can apply a Markov inequality to conclude that with probability at least 2 /3, Katyusha satisfies F (x S ) − F (x ∗ ) ≤ ε in the same stated asymptotic running time. STOC’17, June 2017, Montreal, Canada Zeyuan Allen-Zhu
[11]
Zeyuan Allen-Zhu and Lorenzo Orecchia. Linear Coupling: An Ultimate Unification of Gradient and Mirror Descent. In Proceedings of the 8th Innovations in Theoretical Computer Science, ITCS ’17, 2017. Full version available at http://arxiv.org/abs/1407.1537.
[12]
Zeyuan Allen-Zhu, Peter Richtárik, Zheng Qu, and Yang Yuan. Even faster accelerated coordinate descent using non-uniform sampling. In ICML, 2016.
[13]
Zeyuan Allen-Zhu and Yang Yuan. Improved SVRG for Non-Strongly-Convex or Sum-of-Non-Convex Objectives. In ICML, 2016.
[14]
Léon Bottou. Stochastic gradient descent. http://leon.bottou.org/projects/sgd.
[15]
Cong Dang and Guanghui Lan. Randomized First-Order Methods for Saddle Point Optimization. ArXiv e-prints, abs/1409.8625, sep 2014.
[16]
Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives. In NIPS, 2014.
[17]
Roy Frostig, Rong Ge, Sham M. Kakade, and Aaron Sidford. Un-regularizing: approximate proximal point and faster stochastic algorithms for empirical risk minimization. In ICML, volume 37, pages 1–28, 2015.
[18]
Dan Garber and Elad Hazan. Fast and simple PCA via convex optimization. ArXiv e-prints, September 2015.
[19]
Elad Hazan and Satyen Kale. Beyond the regret minimization barrier: Optimal algorithms for stochastic strongly-convex optimization. The Journal of Machine Learning Research, 15(1):2489–2512, 2014.
[20]
Chonghai Hu, Weike Pan, and James T Kwok. Accelerated gradient methods for stochastic optimization and online learning. In Advances in Neural Information Processing Systems, pages 781–789, 2009.
[21]
Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems, NIPS 2013, pages 315–323, 2013.
[22]
Jakub Konečn ` y, Jie Liu, Peter Richtárik, and Martin Takáč. Mini-batch semistochastic gradient descent in the proximal setting. IEEE Journal of Selected Topics in Signal Processing, 10(2):242–255, 2016.
[23]
Guanghui Lan. An optimal method for stochastic composite optimization. Mathematical Programming, 133(1-2):365–397, January 2011.
[24]
Guanghui Lan and Yi Zhou. An optimal randomized incremental gradient method. ArXiv e-prints, abs/1507.02000, October 2015.
[25]
Hongzhou Lin. private communication, 2016.
[26]
Hongzhou Lin, Julien Mairal, and Zaid Harchaoui. A Universal Catalyst for First-Order Optimization. In NIPS, 2015.
[27]
Qihang Lin, Zhaosong Lu, and Lin Xiao. An Accelerated Proximal Coordinate Gradient Method and its Application to Regularized Empirical Risk Minimization. In NIPS, pages 3059–3067, 2014.
[28]
Zhaosong Lu and Lin Xiao. On the complexity analysis of randomized blockcoordinate descent methods. Mathematical Programming, pages 1–28, 2013.
[29]
Mehrdad Mahdavi, Lijun Zhang, and Rong Jin. Mixed optimization for smooth functions. In Advances in Neural Information Processing Systems, pages 674–682, 2013.
[30]
Julien Mairal. Incremental Majorization-Minimization Optimization with Application to Large-Scale Machine Learning. SIAM Journal on Optimization, 25(2):829– 855, April 2015. Preliminary version appeared in ICML 2013.
[31]
Yurii Nesterov. A method of solving a convex programming problem with convergence rate O(1/k 2 ). In Doklady AN SSSR (translated as Soviet Mathematics Doklady), volume 269, pages 543–547, 1983.
[32]
Yurii Nesterov. Introductory Lectures on Convex Programming Volume: A Basic course, volume I. Kluwer Academic Publishers, 2004.
[33]
Yurii Nesterov. Smooth minimization of non-smooth functions. Mathematical Programming, 103(1):127–152, December 2005.
[34]
Yurii Nesterov. Efficiency of Coordinate Descent Methods on Huge-Scale Optimization Problems. SIAM Journal on Optimization, 22(2):341–362, jan 2012.
[35]
Atsushi Nitanda. Stochastic proximal gradient descent with acceleration techniques. In Advances in Neural Information Processing Systems, pages 1574–1582, 2014.
[36]
Alexander Rakhlin, Ohad Shamir, and Karthik Sridharan. Making gradient descent optimal for strongly convex stochastic optimization. In ICML, 2012.
[37]
Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing finite sums with the stochastic average gradient. arXiv preprint arXiv:1309.2388, pages 1–45, 2013.
[38]
Preliminary version appeared in NIPS 2012.
[39]
Shai Shalev-Shwartz. SDCA without Duality. arXiv preprint arXiv:1502.06177, pages 1–7, 2015.
[40]
Shai Shalev-Shwartz and Tong Zhang. Proximal Stochastic Dual Coordinate Ascent. arXiv preprint arXiv:1211.2717, pages 1–18, 2012.
[41]
Shai Shalev-Shwartz and Tong Zhang. Stochastic dual coordinate ascent methods for regularized loss minimization. Journal of Machine Learning Research, 14:567– 599, 2013.
[42]
Shai Shalev-Shwartz and Tong Zhang. Accelerated Proximal Stochastic Dual Coordinate Ascent for Regularized Loss Minimization. In Proceedings of the 31st International Conference on Machine Learning, ICML 2014, pages 64–72, 2014.
[43]
Blake Woodworth and Nati Srebro. Tight Complexity Bounds for Optimizing Composite Objectives. In NIPS, 2016.
[44]
Lin Xiao and Tong Zhang. A Proximal Stochastic Gradient Method with Progressive Variance Reduction. SIAM Journal on Optimization, 24(4):2057—-2075, 2014.
[45]
Lijun Zhang, Mehrdad Mahdavi, and Rong Jin. Linear convergence with condition number independent access of full gradients. In Advances in Neural Information Processing Systems, pages 980–988, 2013.
[46]
Tong Zhang. Solving large scale linear prediction problems using stochastic gradient descent algorithms. In Proceedings of the 21st International Conference on Machine Learning, ICML 2004, 2004.
[47]
Yuchen Zhang and Lin Xiao. Stochastic Primal-Dual Coordinate Method for Regularized Empirical Risk Minimization. In ICML, 2015.

Cited By

View all
  • (2024)Decentralized convex finite-sum optimization with better dependence on condition numbersProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3693312(30807-30841)Online publication date: 21-Jul-2024
  • (2024)Accelerated Stochastic ExtraGradient: Mixing Hessian and gradient similarity to reduce communication in distributed and federated learningУспехи математических наукUspekhi Matematicheskikh Nauk10.4213/rm1020679:6(480)(5-38)Online publication date: 30-Nov-2024
  • (2024)Incremental quasi-newton methods with faster superlinear convergence ratesProceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial Intelligence10.1609/aaai.v38i13.29319(14097-14105)Online publication date: 20-Feb-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
STOC 2017: Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing
June 2017
1268 pages
ISBN:9781450345286
DOI:10.1145/3055399
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 June 2017

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. accelerated gradient descent
  2. acceleration
  3. first-order method
  4. momentum
  5. stochastic gradient descent

Qualifiers

  • Research-article

Funding Sources

Conference

STOC '17
Sponsor:
STOC '17: Symposium on Theory of Computing
June 19 - 23, 2017
Montreal, Canada

Acceptance Rates

Overall Acceptance Rate 1,469 of 4,586 submissions, 32%

Upcoming Conference

STOC '25
57th Annual ACM Symposium on Theory of Computing (STOC 2025)
June 23 - 27, 2025
Prague , Czech Republic

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)163
  • Downloads (Last 6 weeks)26
Reflects downloads up to 14 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Decentralized convex finite-sum optimization with better dependence on condition numbersProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3693312(30807-30841)Online publication date: 21-Jul-2024
  • (2024)Accelerated Stochastic ExtraGradient: Mixing Hessian and gradient similarity to reduce communication in distributed and federated learningУспехи математических наукUspekhi Matematicheskikh Nauk10.4213/rm1020679:6(480)(5-38)Online publication date: 30-Nov-2024
  • (2024)Incremental quasi-newton methods with faster superlinear convergence ratesProceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial Intelligence10.1609/aaai.v38i13.29319(14097-14105)Online publication date: 20-Feb-2024
  • (2024)Optimal Analysis of Method with Batching for Monotone Stochastic Finite-Sum Variational InequalitiesDoklady Mathematics10.1134/S1064562423701582108:S2(S348-S359)Online publication date: 25-Mar-2024
  • (2024)mPage: Probabilistic Gradient Estimator With Momentum for Non-Convex OptimizationIEEE Transactions on Signal Processing10.1109/TSP.2024.337410672(1375-1386)Online publication date: 2024
  • (2024)The Powerball Method With Biased Stochastic Gradient Estimation for Large-Scale Learning SystemsIEEE Transactions on Computational Social Systems10.1109/TCSS.2024.341163011:6(7435-7447)Online publication date: Dec-2024
  • (2024)FedCore: Straggler-Free Federated Learning with Distributed CoresetsICC 2024 - IEEE International Conference on Communications10.1109/ICC51166.2024.10622224(280-286)Online publication date: 9-Jun-2024
  • (2024)Gradient complexity and non-stationary views of differentially private empirical risk minimizationTheoretical Computer Science10.1016/j.tcs.2023.114259982(114259)Online publication date: Jan-2024
  • (2024)Method with batching for stochastic finite-sum variational inequalities in non-Euclidean settingChaos, Solitons & Fractals10.1016/j.chaos.2024.115396187(115396)Online publication date: Oct-2024
  • (2024)An accelerated stochastic ADMM for nonconvex and nonsmooth finite-sum optimizationAutomatica10.1016/j.automatica.2024.111554163(111554)Online publication date: May-2024
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media