research-article

Public Access

Katyusha: the first direct acceleration of stochastic gradient methods

Author:

Zeyuan Allen-ZhuAuthors Info & Claims

STOC 2017: Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing

Pages 1200 - 1205

https://doi.org/10.1145/3055399.3055448

Published: 19 June 2017 Publication History

Abstract

Nesterov's momentum trick is famously known for accelerating gradient descent, and has been proven useful in building fast iterative algorithms. However, in the stochastic setting, counterexamples exist and prevent Nesterov's momentum from providing similar acceleration, even if the underlying problem is convex.

We introduce Katyusha, a direct, primal-only stochastic gradient method to fix this issue. It has a provably accelerated convergence rate in convex (off-line) stochastic optimization. The main ingredient is Katyusha momentum, a novel "negative momentum" on top of Nesterov's momentum that can be incorporated into a variance-reduction based algorithm and speed it up. Since variance reduction has been successfully applied to a growing list of practical problems, our paper suggests that in each of such cases, one could potentially give Katyusha a hug.

Supplementary Material

MP4 File (d4_sb_t7.mp4)

Download
142.91 MB

References

[1]

Zeyuan Allen-Zhu and Elad Hazan. Optimal Black-Box Reductions Between Optimization Objectives. In NIPS, 2016.

Digital Library

[2]

Zeyuan Allen-Zhu and Elad Hazan. Variance Reduction for Faster Non-Convex Optimization. In ICML, 2016.

Digital Library

[3]

Zeyuan Allen-Zhu, Yin Tat Lee, and Lorenzo Orecchia. Using optimization to obtain a width-independent, parallel, simpler, and faster positive SDP solver. In Proceedings of the 27th ACM-SIAM Symposium on Discrete Algorithms, SODA ’16, 2016.

Digital Library

[4]

Zeyuan Allen-Zhu and Yuanzhi Li. Doubly Accelerated Methods for Faster CCA and Generalized Eigendecomposition. ArXiv e-prints, abs/1607.06017, July 2016.

[5]

Zeyuan Allen-Zhu and Yuanzhi Li. LazySVD: Even Faster SVD Decomposition Yet Without Agonizing Pain. In NIPS, 2016.

Digital Library

[6]

Zeyuan Allen-Zhu, Yuanzhi Li, Rafael Oliveira, and Avi Wigderson. Much faster algorithms for matrix scaling. ArXiv e-prints, abs/1704.02315, April 2017.

[7]

Zeyuan Allen-Zhu, Zhenyu Liao, and Yang Yuan. Optimization Algorithms for Faster Computational Geometry. In ICALP, 2016.

[8]

Zeyuan Allen-Zhu and Lorenzo Orecchia. Nearly-Linear Time Positive LP Solver with Faster Convergence Rate. In Proceedings of the 47th Annual ACM Symposium on Theory of Computing, STOC ’15, 2015.

Digital Library

[9]

Zeyuan Allen-Zhu and Lorenzo Orecchia. Using optimization to break the epsilon barrier: A faster and simpler width-independent algorithm for solving positive linear programs in parallel. In Proceedings of the 26th ACM-SIAM Symposium on Discrete Algorithms, SODA ’15, 2015.

Digital Library

[10]

12 Like in all stochastic first-order methods, one can apply a Markov inequality to conclude that with probability at least 2 /3, Katyusha satisfies F (x S ) − F (x ∗ ) ≤ ε in the same stated asymptotic running time. STOC’17, June 2017, Montreal, Canada Zeyuan Allen-Zhu

[11]

Zeyuan Allen-Zhu and Lorenzo Orecchia. Linear Coupling: An Ultimate Unification of Gradient and Mirror Descent. In Proceedings of the 8th Innovations in Theoretical Computer Science, ITCS ’17, 2017. Full version available at http://arxiv.org/abs/1407.1537.

[12]

Zeyuan Allen-Zhu, Peter Richtárik, Zheng Qu, and Yang Yuan. Even faster accelerated coordinate descent using non-uniform sampling. In ICML, 2016.

Digital Library

[13]

Zeyuan Allen-Zhu and Yang Yuan. Improved SVRG for Non-Strongly-Convex or Sum-of-Non-Convex Objectives. In ICML, 2016.

Digital Library

[14]

Léon Bottou. Stochastic gradient descent. http://leon.bottou.org/projects/sgd.

[15]

Cong Dang and Guanghui Lan. Randomized First-Order Methods for Saddle Point Optimization. ArXiv e-prints, abs/1409.8625, sep 2014.

[16]

Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives. In NIPS, 2014.

Digital Library

[17]

Roy Frostig, Rong Ge, Sham M. Kakade, and Aaron Sidford. Un-regularizing: approximate proximal point and faster stochastic algorithms for empirical risk minimization. In ICML, volume 37, pages 1–28, 2015.

Digital Library

[18]

Dan Garber and Elad Hazan. Fast and simple PCA via convex optimization. ArXiv e-prints, September 2015.

[19]

Elad Hazan and Satyen Kale. Beyond the regret minimization barrier: Optimal algorithms for stochastic strongly-convex optimization. The Journal of Machine Learning Research, 15(1):2489–2512, 2014.

Digital Library

[20]

Chonghai Hu, Weike Pan, and James T Kwok. Accelerated gradient methods for stochastic optimization and online learning. In Advances in Neural Information Processing Systems, pages 781–789, 2009.

Digital Library

[21]

Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems, NIPS 2013, pages 315–323, 2013.

Digital Library

[22]

Jakub Konečn ` y, Jie Liu, Peter Richtárik, and Martin Takáč. Mini-batch semistochastic gradient descent in the proximal setting. IEEE Journal of Selected Topics in Signal Processing, 10(2):242–255, 2016.

[23]

Guanghui Lan. An optimal method for stochastic composite optimization. Mathematical Programming, 133(1-2):365–397, January 2011.

Digital Library

[24]

Guanghui Lan and Yi Zhou. An optimal randomized incremental gradient method. ArXiv e-prints, abs/1507.02000, October 2015.

[25]

Hongzhou Lin. private communication, 2016.

[26]

Hongzhou Lin, Julien Mairal, and Zaid Harchaoui. A Universal Catalyst for First-Order Optimization. In NIPS, 2015.

Digital Library

[27]

Qihang Lin, Zhaosong Lu, and Lin Xiao. An Accelerated Proximal Coordinate Gradient Method and its Application to Regularized Empirical Risk Minimization. In NIPS, pages 3059–3067, 2014.

Digital Library

[28]

Zhaosong Lu and Lin Xiao. On the complexity analysis of randomized blockcoordinate descent methods. Mathematical Programming, pages 1–28, 2013.

Digital Library

[29]

Mehrdad Mahdavi, Lijun Zhang, and Rong Jin. Mixed optimization for smooth functions. In Advances in Neural Information Processing Systems, pages 674–682, 2013.

Digital Library

[30]

Julien Mairal. Incremental Majorization-Minimization Optimization with Application to Large-Scale Machine Learning. SIAM Journal on Optimization, 25(2):829– 855, April 2015. Preliminary version appeared in ICML 2013.

Digital Library

[31]

Yurii Nesterov. A method of solving a convex programming problem with convergence rate O(1/k 2 ). In Doklady AN SSSR (translated as Soviet Mathematics Doklady), volume 269, pages 543–547, 1983.

[32]

Yurii Nesterov. Introductory Lectures on Convex Programming Volume: A Basic course, volume I. Kluwer Academic Publishers, 2004.

Digital Library

[33]

Yurii Nesterov. Smooth minimization of non-smooth functions. Mathematical Programming, 103(1):127–152, December 2005.

Digital Library

[34]

Yurii Nesterov. Efficiency of Coordinate Descent Methods on Huge-Scale Optimization Problems. SIAM Journal on Optimization, 22(2):341–362, jan 2012.

[35]

Atsushi Nitanda. Stochastic proximal gradient descent with acceleration techniques. In Advances in Neural Information Processing Systems, pages 1574–1582, 2014.

Digital Library

[36]

Alexander Rakhlin, Ohad Shamir, and Karthik Sridharan. Making gradient descent optimal for strongly convex stochastic optimization. In ICML, 2012.

Digital Library

[37]

Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing finite sums with the stochastic average gradient. arXiv preprint arXiv:1309.2388, pages 1–45, 2013.

[38]

Preliminary version appeared in NIPS 2012.

[39]

Shai Shalev-Shwartz. SDCA without Duality. arXiv preprint arXiv:1502.06177, pages 1–7, 2015.

[40]

Shai Shalev-Shwartz and Tong Zhang. Proximal Stochastic Dual Coordinate Ascent. arXiv preprint arXiv:1211.2717, pages 1–18, 2012.

[41]

Shai Shalev-Shwartz and Tong Zhang. Stochastic dual coordinate ascent methods for regularized loss minimization. Journal of Machine Learning Research, 14:567– 599, 2013.

Digital Library

[42]

Shai Shalev-Shwartz and Tong Zhang. Accelerated Proximal Stochastic Dual Coordinate Ascent for Regularized Loss Minimization. In Proceedings of the 31st International Conference on Machine Learning, ICML 2014, pages 64–72, 2014.

Digital Library

[43]

Blake Woodworth and Nati Srebro. Tight Complexity Bounds for Optimizing Composite Objectives. In NIPS, 2016.

Digital Library

[44]

Lin Xiao and Tong Zhang. A Proximal Stochastic Gradient Method with Progressive Variance Reduction. SIAM Journal on Optimization, 24(4):2057—-2075, 2014.

Digital Library

[45]

Lijun Zhang, Mehrdad Mahdavi, and Rong Jin. Linear convergence with condition number independent access of full gradients. In Advances in Neural Information Processing Systems, pages 980–988, 2013.

Digital Library

[46]

Tong Zhang. Solving large scale linear prediction problems using stochastic gradient descent algorithms. In Proceedings of the 21st International Conference on Machine Learning, ICML 2004, 2004.

Digital Library

[47]

Yuchen Zhang and Lin Xiao. Stochastic Primal-Dual Coordinate Method for Regularized Empirical Risk Minimization. In ICML, 2015.

Digital Library

Cited By

Liu YChen LLuo LSalakhutdinov RKolter ZHeller KWeller AOliver NScarlett JBerkenkamp F(2024)Decentralized convex finite-sum optimization with better dependence on condition numbersProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3693312(30807-30841)Online publication date: 21-Jul-2024
https://dl.acm.org/doi/10.5555/3692070.3693312
Bylinkin DDegtyarev KBeznosikov A(2024)Accelerated Stochastic ExtraGradient: Mixing Hessian and gradient similarity to reduce communication in distributed and federated learningУспехи математических наукUspekhi Matematicheskikh Nauk10.4213/rm1020679:6(480)(5-38)Online publication date: 30-Nov-2024
https://doi.org/10.4213/rm10206
Liu ZLuo LLow BWooldridge MDy JNatarajan S(2024)Incremental quasi-newton methods with faster superlinear convergence ratesProceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial Intelligence10.1609/aaai.v38i13.29319(14097-14105)Online publication date: 20-Feb-2024
https://dl.acm.org/doi/10.1609/aaai.v38i13.29319
Show More Cited By

Index Terms

Katyusha: the first direct acceleration of stochastic gradient methods
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Supervised learning
        Supervised learning by classification
        Supervised learning by regression
2. Theory of computation
  1. Design and analysis of algorithms
    1. Approximation algorithms analysis
      1. Stochastic approximation
    2. Mathematical optimization
      1. Continuous optimization
        Convex optimization
        Nonconvex optimization
        Stochastic control and optimization
  2. Theory and algorithms for application domains
    1. Machine learning theory

Recommendations

Accelerating variance-reduced stochastic gradient methods
Abstract
Variance reduction is a crucial tool for improving the slow convergence of stochastic gradient descent. Only a few variance-reduced methods, however, have yet been shown to directly benefit from Nesterov’s acceleration techniques to match the ...
Accelerated First-Order Primal-Dual Proximal Methods for Linearly Constrained Composite Convex Programming

Motivated by big data applications, first-order methods have been extremely popular in recent years. However, naive gradient methods generally converge slowly. Hence, much effort has been made to accelerate various first-order methods. This paper ...
Momentum and stochastic momentum for stochastic gradient, Newton, proximal point and subspace descent methods
Abstract
In this paper we study several classes of stochastic optimization algorithms enriched with heavy ball momentum. Among the methods studied are: stochastic gradient descent, stochastic Newton, stochastic proximal point and stochastic dual subspace ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

STOC 2017: Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing

June 2017

1268 pages

ISBN:9781450345286

DOI:10.1145/3055399

General Chairs:
Hamed Hatami
McGill University
,
Pierre McKenzie
Université de Montréal
,
Program Chair:
Valerie King
University of Victoria

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGACT: ACM Special Interest Group on Algorithms and Computation Theory

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 June 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Microsoft
NSF

Conference

STOC '17

Sponsor:

SIGACT

STOC '17: Symposium on Theory of Computing

June 19 - 23, 2017

Montreal, Canada

Acceptance Rates

Overall Acceptance Rate 1,469 of 4,586 submissions, 32%

Upcoming Conference

STOC '25

Sponsor:
sigact

57th Annual ACM Symposium on Theory of Computing (STOC 2025)

June 23 - 27, 2025

Prague , Czech Republic

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

112
Total Citations
View Citations
902
Total Downloads

Downloads (Last 12 months)163
Downloads (Last 6 weeks)26

Reflects downloads up to 14 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Liu YChen LLuo LSalakhutdinov RKolter ZHeller KWeller AOliver NScarlett JBerkenkamp F(2024)Decentralized convex finite-sum optimization with better dependence on condition numbersProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3693312(30807-30841)Online publication date: 21-Jul-2024
https://dl.acm.org/doi/10.5555/3692070.3693312
Bylinkin DDegtyarev KBeznosikov A(2024)Accelerated Stochastic ExtraGradient: Mixing Hessian and gradient similarity to reduce communication in distributed and federated learningУспехи математических наукUspekhi Matematicheskikh Nauk10.4213/rm1020679:6(480)(5-38)Online publication date: 30-Nov-2024
https://doi.org/10.4213/rm10206
Liu ZLuo LLow BWooldridge MDy JNatarajan S(2024)Incremental quasi-newton methods with faster superlinear convergence ratesProceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial Intelligence10.1609/aaai.v38i13.29319(14097-14105)Online publication date: 20-Feb-2024
https://dl.acm.org/doi/10.1609/aaai.v38i13.29319
Pichugin APechin MBeznosikov ASavchenko AGasnikov A(2024)Optimal Analysis of Method with Batching for Monotone Stochastic Finite-Sum Variational InequalitiesDoklady Mathematics10.1134/S1064562423701582108:S2(S348-S359)Online publication date: 25-Mar-2024
https://doi.org/10.1134/S1064562423701582
Liang YSu HLiu JXu D(2024)mPage: Probabilistic Gradient Estimator With Momentum for Non-Convex OptimizationIEEE Transactions on Signal Processing10.1109/TSP.2024.337410672(1375-1386)Online publication date: 2024
https://doi.org/10.1109/TSP.2024.3374106
Yang Z(2024)The Powerball Method With Biased Stochastic Gradient Estimation for Large-Scale Learning SystemsIEEE Transactions on Computational Social Systems10.1109/TCSS.2024.341163011:6(7435-7447)Online publication date: Dec-2024
https://doi.org/10.1109/TCSS.2024.3411630
Guo HGu HWang XChen BLee EEilam TChen DNahrstedt K(2024)FedCore: Straggler-Free Federated Learning with Distributed CoresetsICC 2024 - IEEE International Conference on Communications10.1109/ICC51166.2024.10622224(280-286)Online publication date: 9-Jun-2024
https://doi.org/10.1109/ICC51166.2024.10622224
Wang DXu J(2024)Gradient complexity and non-stationary views of differentially private empirical risk minimizationTheoretical Computer Science10.1016/j.tcs.2023.114259982(114259)Online publication date: Jan-2024
https://doi.org/10.1016/j.tcs.2023.114259
Pichugin APechin MBeznosikov ANovitskii VGasnikov A(2024)Method with batching for stochastic finite-sum variational inequalities in non-Euclidean settingChaos, Solitons & Fractals10.1016/j.chaos.2024.115396187(115396)Online publication date: Oct-2024
https://doi.org/10.1016/j.chaos.2024.115396
Zeng YWang ZBai JShen X(2024)An accelerated stochastic ADMM for nonconvex and nonsmooth finite-sum optimizationAutomatica10.1016/j.automatica.2024.111554163(111554)Online publication date: May-2024
https://doi.org/10.1016/j.automatica.2024.111554
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents