Abstract
We consider the problem of minimizing the sum of an average function of a large number of smooth convex components and a general, possibly non-differentiable, convex function. Although many methods have been proposed to solve this problem with the assumption that the sum is strongly convex, few methods support the non-strongly convex case. Adding a small quadratic regularization is a common devise used to tackle non-strongly convex problems; however, it may cause loss of sparsity of solutions or weaken the performance of the algorithms. Avoiding this devise, we propose an accelerated randomized mirror descent method for solving this problem without the strongly convex assumption. Our method extends the deterministic accelerated proximal gradient methods of Paul Tseng and can be applied, even when proximal points are computed inexactly. We also propose a scheme for solving the problem, when the component functions are non-smooth.
Similar content being viewed by others
References
Nesterov, Y.: A method of solving a convex programming problem with convergence rate \(\text{ O }(1/k^2)\). Sov. Math. Dokl. 27(2), 543–547 (1983)
Nesterov, Y.: On an approach to the construction of optimal methods of minimization of smooth convex functions. Ekonom. i. Mat. Metody 24, 509–517 (1998)
Nesterov, Y.: Smooth minimization of non-smooth functions. Math. Program. 103(1), 127–152 (2005)
Nesterov, Y.: Gradient methods for minimizing composite functions. Math. Program. 140(1), 125–161 (2013)
Becker, S., Bobin, J., Candès, E.J.: NESTA: a fast and accurate first-order method for sparse recovery. SIAM J. Imaging Sci. 4(1), 1–39 (2011)
d’Aspremont, A., Banerjee, O., Ghaoui, L.E.: First-order methods for sparse covariance selection. SIAM J. Matrix Anal. Appl. 30(1), 56–66 (2008)
Auslender, A., Teboulle, M.: Interior gradient and proximal methods for convex and conic optimization. SIAM J. Optim. 16(3), 697–725 (2006)
Tseng, P.: On Accelerated Proximal Gradient Methods for Convex–Concave Optimization. Technical report (2008)
Nemirovski, A., Juditsky, A., Lan, G., Shapiro, A.: Robust stochastic approximation approach to stochastic programming. SIAM J. Optim. 19(4), 1574–1609 (2009)
Roux, N.L., Schmidt, M., Bach, F.R.: A stochastic gradient method with an exponential convergence rate for finite training sets. In: Advances in Neural Information Processing Systems, pp. 2663–2671 (2012)
Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance reduction. In: Advances in Neural Information Processing Systems, pp. 315–323 (2013)
Xiao, L., Zhang, T.: A proximal stochastic gradient method with progressive variance reduction. SIAM J. Optim. 24, 2057–2075 (2014)
Nguyen, L.M., Liu, J., Scheinberg, K., Takáč, M.: SARAH: a novel method for machine learning problems using stochastic recursive gradient. In: International Conference on Machine Learning, pp. 2613–2621 (2017)
Fercoq, O., Richtárik, P.: Accelerated, parallel, and proximal coordinate descent. SIAM J. Optim. 25(4), 1997–2023 (2015)
Lin, H., Mairal, J., Harchaoui, Z.: A universal catalyst for first-order optimization. In: Advances in Neural Information Processing Systems, pp. 3384–3392 (2015)
Parikh, N., Boyd, S.: Proximal algorithms. Found. Trends Optim. 1(3), 127–239 (2014)
Cai, J.F., Candès, E.J., Shen, Z.: A singular value thresholding algorithm for matrix completion. SIAM J. Optim. 20(4), 1956–1982 (2010)
Fadili, J.M., Peyre, G.: Total variation projection with first order schemes. IEEE Trans. Image Process. 20(3), 657–669 (2011)
Ma, S., Goldfarb, D., Chen, L.: Fixed point and Bregman iterative methods for matrix rank minimization. Math. Program. 128(1), 321–353 (2011)
Rockafellar, R.T.: Monotone operators and the proximal point algorithm. SIAM J. Control Optim. 14(5), 877–898 (1976)
Devolder, O., Glineur, F., Nesterov, Y.: First-order methods of smooth convex optimization with inexact oracle. Math. Program. 146(1), 37–75 (2014)
Schmidt, M., Roux, N.L., Bach, F.R.: Convergence rates of inexact proximal-gradient methods for convex optimization. In: Advances in Neural Information Processing Systems, pp. 1458–1466 (2011)
Solodov, M., Svaiter, B.: Error bounds for proximal point subproblems and associated inexact proximal point algorithms. Math. Program. 88(2), 371–389 (2000)
Villa, S., Salzo, S., Baldassarre, L., Verri, A.: Accelerated and inexact forward–backward algorithms. SIAM J. Optim. 23(3), 1607–1633 (2013)
Allen-Zhu, Z.K.: The first direct acceleration of stochastic gradient methods. In: ACM SIGACT Symposium on Theory of Computing (2017)
Bregman, L.: The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Comput. Math. Math. Phys. 7(3), 200–217 (1967)
Teboulle, M.: Convergence of proximal-like algorithms. SIAM J. Optim. 7(4), 1069–1083 (1997)
Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course. Kluwer, Dordrecht (2004)
Auslender, A.: Numerical Methods for Nondifferentiable Convex Optimization, pp. 102–126. Springer, Berlin (1987)
Lee, Y.J., Mangasarian, O.: SSVM: a smooth support vector machine for classification. Comput. Optim. Appl. 20(1), 5–22 (2001)
Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2(1), 183–202 (2009)
Defazio, A., Bach, F., Lacoste-julien, S.: SAGA: a fast incremental gradient method with support for non-strongly convex composite objectives. In: Advances in Neural Information Processing Systems, pp. 1646–1654 (2014)
Fan, R.E., Lin, C.J.: LIBSVM Data: Classification, Regression and Multi-Label. http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets (2011). Accessed 01 April 2018
Jacob, L., Obozinski, G., Vert, J.P.: Group Lasso with overlap and graph Lasso. In: International Conference on Machine Learning, pp. 433–440 (2009)
Mosci, S., Villa, S., Verri, A., Rosasco, L.: A primal–dual algorithm for group sparse regularization with overlapping groups. In: Advances in Neural Information Processing Systems, pp. 2604–2612 (2010)
Acknowledgements
We are grateful to the anonymous reviewers and the Editor-in-Chief for their meticulous comments and insightful suggestions. Le Thi Khanh Hien would like to give a special thank to Prof. W. B. Haskell for his support. Le Thi Khanh Hien was supported by Grant A*STAR 1421200078.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Gabriel Peyré.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix: Proofs of Lemmas, Propositions, and Theorems
Appendix: Proofs of Lemmas, Propositions, and Theorems
Proof of Lemma 3.1
We have
\(\square \)
Proof of Lemma 3.2
For notation succinctness, we omit the subscript s when no confusing is caused. Applying Lemma 2.4(1), we have:
where the last inequality uses \(\left\langle {a}, {b} \right\rangle \le \frac{1}{2}\left||{a} \right||_*^2 + \frac{1}{2}\left||{b} \right||^2\). Together with the update rule (4), Lemma 2.1 with \(\sigma =1\), and noting that \(\hat{x}_k - y_k=\alpha _2(z_k-z_{k-1})\), we get:
\(\square \)
Proof of Lemma 3.3
We have \(\bar{z}_{k,s}=\arg \min \nolimits _{x\in X_s} \{\phi _k(x)+D(x,z_{k-1,s})\}\), where we let \(\phi _k(x)=\frac{1}{\theta _s }(\left\langle {v_k}, {x} \right\rangle + P(x))\). From Lemma 2.2, for all \(x\in X_s \cap \mathrm {dom}P \), we have:
Together with \(z_{k,s}\approx _{\varepsilon _{k,s}}\arg \min _{x\in X_s}\theta _s( \phi (x) + D(x,z_{k-1,s}))\), we get:
From Lemma 2.3, we get
Thus, the result follows. \(\square \)
Proof of Proposition 3.1
For notation succinctness, we omit the subscript s when no confusion is caused. Applying Lemma 3.2, we have:
From Inequality (12) and Lemma 3.3, we deduce that:
Note that \(\mathbb {E}_{i_k} [v_k]=\nabla F(y_k)\) (we omit the subscript \(i_k\) of the conditional expectation when it is clear in the context) and \(P(\hat{x}_k) \le \alpha _1 P(x_{k-1})+\alpha _2 P(z_k) + \alpha _3 P(\tilde{x}_{s-1})\). Taking expectation with respected to \(i_k\) conditioned on \(i_{k-1}\), it follows from (13) that:
On the other hand, applying Lemma 3.1, the second inequality of Lemma 2.4, and noting that \(\frac{1}{L_Q n q_i}\le \frac{1}{L_i}\) and \(\frac{1}{L_Q}\le \frac{1}{L_A}\), we have:
Therefore, (14) and (15) imply that:
Here, in (a) we use
in (b) we use \(\left\langle {\nabla F(y_k)}, {x_{k-1}-y_k} \right\rangle \le F(x_{k-1})-F(y_k)\). Finally, we take expectation with respect to \(i_{k-1}\) to get the result. \(\square \)
Proof of Proposition 3.2
Applying Proposition 3.1 with \(x=x^*\), we have:
Denote \(d_{k,s}=\mathbb {E}(F^P(x_{k,s}) -F^P(x^*))\), then
which implies \( \frac{1}{\alpha _{2,s}^2}d_{k,s} \le \frac{\alpha _{1,s}}{\alpha _{2,s}^2} d_{k-1,s} + \frac{\alpha _{3}}{\alpha _{2,s}^2}\tilde{d}_{s-1} + \overline{L} (\mathbb {E}D(x^*,z_{k-1,s}) - \mathbb {E}D(x^*,z_{k,s}))+\frac{ r^*_{k,s}}{\alpha _{2,s}^2}. \) Summing up this inequality from \(k=1\) to \(k=m\), we get:
Using the update rule (5), \(\alpha _{1,s}+\alpha _{3}=1-\alpha _{2,s}\), \(z_{m,s-1}=z_{0,s}\), and \(d_{m,s-1}=d_{0,s}\), we get:
Combining with the update rule (2), we obtain:
Therefore,
where in (a) we use the update rule (5), in (b) we use the property \(\alpha _3 \le 1-\alpha _{2,s+1}\), and in (c) we use the recursive inequality (16). The result then follows. \(\square \)
Proof of Theorem 3.1
Without loss of generality, we can assume that:
When \(\varepsilon _{k,s}=0\), then \(z_{k,s}=\bar{z}_{k,s}\) and we have \(r_{k,s}=0\). The convergence rate of exact ASMD follows from Proposition 3.2 by taking \(\alpha _{2,s}=\frac{2}{s+2}\) and noting that \(D(x^*,z_{m,s})\ge 0\). \(\square \)
Proof of Theorem 3.2
We remind that Inequality (11) holds for all x. Taking \(x=z_{k,s}\), (11) yields that \(D(z_{k,s},\bar{z}_{k,s})\le \frac{\varepsilon _{k,s}}{\theta _s}\). On the other hand, if \( h(\cdot )\) is \(L_h\)-Lipschitz smooth, then:
If \(\left||{z_{k,s}} \right||\le C\), then we let \(C_1=\left||{x^*} \right||+C\). Noting that \(D(z_{k,s},\bar{z}_{k,s})\ge 0\), we have
Hence,
If the adaptive inexact rule \(\max \left\{ \left||{\bar{z}_{k,s}} \right||^2\varepsilon _{k,s},C\varepsilon _{k,s}\right\} \le C\epsilon _s\) is chosen, we have
In this case, we let \(C_1=\left||{x^*} \right||+\sqrt{C}\). We then have
The result then follows from (17), (18), and Proposition 3.2 easily. \(\square \)
Proof of Theorem 3.3
Let \(x^*_\mu \) is the optimal solution of Problem (9). We have:
where \(\bar{C}=O\left( {\sqrt{\bar{L}_\mu }} \right) \), by applying Theorem 3.2. By Assumption 3.1, we have:
Together with (19) and noting that \(F^P_\mu (x^*)\ge F^P_\mu (x^*_\mu )\), we get:
\(\square \)
Rights and permissions
About this article
Cite this article
Hien, L.T.K., Nguyen, C.V., Xu, H. et al. Accelerated Randomized Mirror Descent Algorithms for Composite Non-strongly Convex Optimization. J Optim Theory Appl 181, 541–566 (2019). https://doi.org/10.1007/s10957-018-01469-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10957-018-01469-5