Abstract
We introduce SPIRAL, a SuPerlinearly convergent Incremental pRoximal ALgorithm, for solving nonconvex regularized finite sum problems under a relative smoothness assumption. Each iteration of SPIRAL consists of an inner and an outer loop. It combines incremental gradient updates with a linesearch that has the remarkable property of never being triggered asymptotically, leading to superlinear convergence under mild assumptions at the limit point. Simulation results with L-BFGS directions on different convex, nonconvex, and non-Lipschitz differentiable problems show that our algorithm, as well as its adaptive variant, are competitive to the state of the art.
Similar content being viewed by others
Data availability
The datasets used in the numerical experiments are available in https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/ and https://hastie.su.domains/ElemStatLearn/data.html.
References
Ahookhosh, M., Themelis, A., Patrinos, P.: A Bregman forward-backward linesearch algorithm for nonconvex composite optimization: superlinear convergence to nonisolated local minima. SIAM J. Optim. 31(1), 653–685 (2021). https://doi.org/10.1137/19M1264783
Aragón Artacho, F.J., Belyakov, A., Dontchev, A.L., López, M.: Local convergence of quasi-Newton methods under metric regularity. Comput. Optim. Appl. 58(1), 225–247 (2014)
Bauschke, H.H., Bolte, J., Teboulle, M.: A descent lemma beyond Lipschitz gradient continuity: first-order methods revisited and applications. Math. Oper. Res. 42(2), 330–348 (2017)
Beck, A., Teboulle, M.: Mirror descent and nonlinear projected subgradient methods for convex optimization. Oper. Res. Lett. 31(3), 167–175 (2003)
Bengio, Y.: Practical recommendations for gradient-based training of deep architectures. Neural Networks: Tricks of the Trade: Second Edition, pp. 437–478 (2012)
Bertsekas, D.P.: Nonlinear Programming. Athena Scientific (2016)
Bertsekas, D.P., Tsitsiklis, J.N.: Gradient convergence in gradient methods with errors. SIAM J. Optim. 10(3), 627–642 (2000)
Blatt, D., Hero, A.O., Gauchman, H.: A convergent incremental gradient method with a constant step size. SIAM J. Optim. 18(1), 29–51 (2007)
Bolte, J., Daniilidis, A., Lewis, A.: The Łojasiewicz inequality for nonsmooth subanalytic functions with applications to subgradient dynamical systems. SIAM J. Optim. 17(4), 1205–1223 (2007)
Bolte, J., Daniilidis, A., Lewis, A., Shiota, M.: Clarke subgradients of stratifiable functions. SIAM J. Optim. 18(2), 556–572 (2007)
Bolte, J., Sabach, S., Teboulle, M.: Proximal alternating linearized minimization for nonconvex and nonsmooth problems. Math. Program. 146(1–2), 459–494 (2014)
Bolte, J., Sabach, S., Teboulle, M., Vaisbourd, Y.: First order methods beyond convexity and Lipschitz gradient continuity with applications to quadratic inverse problems. SIAM J. Optim. 28(3), 2131–2151 (2018)
Cai, X., Lin, C.Y., Diakonikolas, J.: Empirical risk minimization with shuffled SGD: a primal-dual perspective and improved bounds. arXiv preprint arXiv:2306.12498 (2023)
Cai, X., Song, C., Wright, S., Diakonikolas, J.: Cyclic block coordinate descent with variance reduction for composite nonconvex optimization. In: International Conference on Machine Learning, pp. 3469–3494. PMLR (2023)
Candes, E.J., Li, X., Soltanolkotabi, M.: Phase retrieval via Wirtinger flow: theory and algorithms. IEEE Trans. Inf. Theory 61(4), 1985–2007 (2015)
Cha, J., Lee, J., Yun, C.: Tighter lower bounds for shuffling SGD: Random permutations and beyond. In: International Conference on Machine Learning, pp. 3855–3912. PMLR (2023)
Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. (TIST) 2, 1–27 (2011)
Chen, G., Teboulle, M.: Convergence analysis of a proximal-like minimization algorithm using Bregman functions. SIAM J. Optim. 3(3), 538–543 (1993)
Davis, D., Drusvyatskiy, D., MacPhee, K.J.: Stochastic model-based minimization under high-order growth. arXiv preprint arXiv:1807.00255 (2018)
De Marchi, A., Themelis, A.: Proximal gradient algorithms under local Lipschitz gradient continuity: a convergence and robustness analysis of PANOC. J. Optim. Theory Appl. 194(3), 771–794 (2022)
Defazio, A., Bach, F., Lacoste-Julien, S.: SAGA: a fast incremental gradient method with support for non-strongly convex composite objectives. In: Advances in Neural Information Processing Systems, pp. 1646–1654 (2014)
Defazio, A., Domke, J.: Finito: A faster, permutable incremental gradient method for big data problems. In: International Conference on Machine Learning, pp. 1125–1133 (2014)
Dennis, J.E., Moré, J.J.: A characterization of superlinear convergence and its application to quasi-newton methods. Math. Comput. 28(126), 549–560 (1974)
Dennis, J.E., Jr., Moré, J.J.: Quasi-Newton methods, motivation and theory. SIAM Rev. 19(1), 46–89 (1977)
Dragomir, R.A., Even, M., Hendrikx, H.: Fast stochastic Bregman gradient methods: sharp analysis and variance reduction. In: International Conference on Machine Learning, pp. 2815–2825. PMLR (2021)
Duchi, J.C., Ruan, F.: Solving (most) of a set of quadratic equalities: composite optimization for robust phase retrieval. Inf. Inference J. IMA 8(3), 471–529 (2019)
Facchinei, F., Pang, J.S.: Finite-Dimensional Variational Inequalities and Complementarity Problems. Springer, II (2003)
Fang, C., Li, C.J., Lin, Z., Zhang, T.: Spider: Near-optimal non-convex optimization via stochastic path-integrated differential estimator. Adv. Neural Inf. Process. Syst. 31 (2018)
Ge, R., Li, Z., Wang, W., Wang, X.: Stabilized SVRG: simple variance reduction for nonconvex optimization. In: Conference on Learning Theory, pp. 1394–1448. PMLR (2019)
Ghadimi, S., Lan, G.: Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM J. Optim. 23(4), 2341–2368 (2013)
Ghadimi, S., Lan, G., Zhang, H.: Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization. Math. Program. 155(1–2), 267–305 (2016)
Gürbüzbalaban, M., Ozdaglar, A., Parrilo, P.A.: Why random reshuffling beats stochastic gradient descent. Math. Program. 186, 49–84 (2021)
Hanzely, F., Richtárik, P.: Fastest rates for stochastic mirror descent methods. Comput. Optim. Appl. 1–50 (2021)
Haochen, J., Sra, S.: Random shuffling beats SGD after finite epochs. In: International Conference on Machine Learning, pp. 2624–2633. PMLR (2019)
Hastie, T., Friedman, J., Tibshirani, R.: The Elements of Statistical Learning. Springer, New York (2001)
Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance reduction. Adv. Neural Inf. Process. Syst. 26, 315–323 (2013)
Kan, C., Song, W.: The Moreau envelope function and proximal mapping in the sense of the Bregman distance. Nonlinear Anal. Theory Methods Appl. 75(3), 1385–1399 (2012). https://doi.org/10.1016/j.na.2011.07.031
Kurdyka, K.: On gradients of functions definable in \(o\)-minimal structures. Annales de l’institut Fourier 48(3), 769–783 (1998)
Latafat, P., Themelis, A., Ahookhosh, M., Patrinos, P.: Bregman Finito/MISO for nonconvex regularized finite sum minimization without Lipschitz gradient continuity. SIAM J. Optim. 32(3), 2230–2262 (2022)
Latafat, P., Themelis, A., Patrinos, P.: Block-coordinate and incremental aggregated proximal gradient methods for nonsmooth nonconvex problems. Math. Program. 1–30 (2021)
Li, Z., Richtárik, P.: ZeroSARAH: efficient nonconvex finite-sum optimization with zero full gradient computation. arXiv preprint arXiv:2103.01447 (2021)
Lu, H., Freund, R.M., Nesterov, Y.: Relatively smooth convex optimization by first-order methods, and applications. SIAM J. Optim. 28(1), 333–354 (2018)
Mairal, J.: Incremental majorization-minimization optimization with application to large-scale machine learning. SIAM J. Optim. 25(2), 829–855 (2015)
Mishchenko, K., Khaled, A., Richtárik, P.: Random reshuffling: simple analysis with vast improvements. Adv. Neural Inf. Process. Syst. 33, 17309–17320 (2020)
Mokhtari, A., Eisen, M., Ribeiro, A.: IQN: an incremental quasi-Newton method with local superlinear convergence rate. SIAM J. Optim. 28(2), 1670–1698 (2018)
Mokhtari, A., Gürbüzbalaban, M., Ribeiro, A.: Surpassing gradient descent provably: a cyclic incremental method with linear convergence rate. SIAM J. Optim. 28(2), 1420–1447 (2018)
Moritz, P., Nishihara, R., Jordan, M.: A linearly-convergent stochastic L-BFGS algorithm. In: Artificial Intelligence and Statistics, pp. 249–258. PMLR (2016)
Nedic, A., Lee, S.: On stochastic subgradient mirror-descent algorithm with weighted averaging. SIAM J. Optim. 24(1), 84–107 (2014)
Nesterov, Y.: Gradient methods for minimizing composite functions. Math. Program. 140(1), 125–161 (2013)
Nesterov, Y.: Introductory lectures on convex optimization: a basic course, vol. 137. Springer Science & Business Media (2018)
Nguyen, L.M., Liu, J., Scheinberg, K., Takáč, M.: SARAH: a novel method for machine learning problems using stochastic recursive gradient. In: International Conference on Machine Learning, pp. 2613–2621. PMLR (2017)
Pas, P., Schuurmans, M., Patrinos, P.: Alpaqa: a matrix-free solver for nonlinear MPC and large-scale nonconvex optimization. In: 2022 European Control Conference (ECC), pp. 417–422. IEEE (2022)
Pham, N.H., Nguyen, L.M., Phan, D.T., Tran-Dinh, Q.: ProxSARAH: an efficient algorithmic framework for stochastic composite nonconvex optimization. J. Mach. Learn. Res. 21, 110–1 (2020)
Recht, B., Ré, C.: Parallel stochastic gradient algorithms for large-scale matrix completion. Math. Program. Comput. 5(2), 201–226 (2013)
Reddi, S.J., Hefny, A., Sra, S., Poczos, B., Smola, A.J.: Stochastic variance reduction for nonconvex optimization. In: International Conference on Machine Learning, pp. 314–323 (2016)
Reddi, S.J., Sra, S., Poczos, B., Smola, A.J.: Proximal stochastic methods for nonsmooth nonconvex finite-sum optimization. In: Advances in Neural Information Processing Systems, pp. 1145–1153 (2016)
Rockafellar, R.T.: Convex analysis. Princeton University Press (1970)
Rockafellar, R.T., Wets, R.J.B.: Variational analysis, vol. 317. Springer Science & Business Media (2009)
Rodomanov, A., Kropotov, D.: A superlinearly-convergent proximal Newton-type method for the optimization of finite sums. In: International Conference on Machine Learning, pp. 2597–2605. PMLR (2016)
Sadeghi, H., Giselsson, P.: Hybrid acceleration scheme for variance reduced stochastic optimization algorithms. arXiv preprint arXiv:2111.06791 (2021)
Schmidt, M., Le Roux, N., Bach, F.: Minimizing finite sums with the stochastic average gradient. Math. Program. 162(1), 83–112 (2017)
Shalev-Shwartz, S., Zhang, T.: Stochastic dual coordinate ascent methods for regularized loss minimization. J. Mach. Learn. Res. 14(Feb), 567–599 (2013)
Solodov, M.V., Svaiter, B.F.: An inexact hybrid generalized proximal point algorithm and some new results on the theory of Bregman functions. Math. Oper. Res. 25(2), 214–230 (2000)
Sun, J., Qu, Q., Wright, J.: A geometric analysis of phase retrieval. Found. Comput. Math. 18(5), 1131–1198 (2018)
Teboulle, M.: A simplified view of first order methods for optimization. Math. Program. 170(1), 67–96 (2018)
Themelis, A., Ahookhosh, M., Patrinos, P.: On the acceleration of forward-backward splitting via an inexact Newton method. In: Bauschke, H.H., Burachik, R.S., Luke, D.R. (eds.) Splitting Algorithms, Modern Operator Theory, and Applications, pp. 363–412. Springer International Publishing, Cham (2019)
Themelis, A., Patrinos, P.: SuperMann: a superlinearly convergent algorithm for finding fixed points of nonexpansive operators. IEEE Trans. Autom. Control 64(12), 4875–4890 (2019)
Themelis, A., Stella, L., Patrinos, P.: Forward-backward envelope for the sum of two nonconvex functions: further properties and nonmonotone linesearch algorithms. SIAM J. Optim. 28(3), 2274–2303 (2018)
Vanli, N.D., Gurbuzbalaban, M., Ozdaglar, A.: Global convergence rate of proximal incremental aggregated gradient methods. SIAM J. Optim. 28(2), 1282–1300 (2018)
Wang, Z., Ji, K., Zhou, Y., Liang, Y., Tarokh, V.: Spiderboost and momentum: faster variance reduction algorithms. Adv. Neural Inf. Process. Syst. 32 (2019)
Yang, M., Milzarek, A., Wen, Z., Zhang, T.: A stochastic extra-step quasi-Newton method for nonsmooth nonconvex optimization. Math. Program. 1–47 (2021)
Yu, P., Li, G., Pong, T.K.: Kurdyka-Łojasiewicz exponent via inf-projection. Found. Comput. Math. 1–47 (2021)
Zhang, H., Dai, Y.H., Guo, L., Peng, W.: Proximal-like incremental aggregated gradient method with linear convergence under Bregman distance growth conditions. Math. Oper. Res. 46(1), 61–81 (2021)
Zhang, J., Liu, H., So, A.M.C., Ling, Q.: Variance-reduced stochastic quasi-Newton methods for decentralized learning: Part I (2022)
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors have no competing interests to declare that are relevant to the content of this article.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
P. Behmandpoor and M. Moonen acknowledge the research work carried out at the ESAT Laboratory of KU Leuven, in the frame of Research Project FWO nr. G0C0623N ’User-centric distributed signal processing algorithms for next generation cell-free massive MIMO based wireless communication networks’ and Fonds de la Recherche Scientifique—FNRS and Fonds voor Wetenschappelijk Onderzoek— Vlaanderen EOS Project no 30452698 ’(MUSE-WINET) MUlti-SErvice WIreless NETworks’. The scientific responsibility is assumed by its authors. The work of P. Latafat was supported by the Research Foundation Flanders (FWO) grants 1196820N and 12Y7622N. The work of P. Patrinos was supported by the Research Foundation Flanders (FWO) research projects G0A0920N, G086518N, G086318N, and G081222N; Research Council KU Leuven C1 project No. C14/18/068; Fonds de la Recherche Scientifique—FNRS and the Fonds Wetenschappelijk Onderzoek—Vlaanderen under EOS project 30468160 (SeLMA); European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No. 953348. The work of A. Themelis was supported by the Japan Society for the Promotion of Science (JSPS) KAKENHI grant JP21K17710.
Appendices
A Preliminaries
Fact A.1
(basic properties [18, 50]) The following hold for a dgf \(H:\mathbb {R}^n\rightarrow \mathbb {R}\), \(x,y,z\in \mathbb {R}^n\):
-
(i)
(three-point inequality) \( {{\,\textrm{D}\,}}_H(x,z)={{\,\textrm{D}\,}}_H(x,y)+{{\,\textrm{D}\,}}_H(y,z)+\langle {x-y}, {\nabla H(y)}{-\nabla H(z)}\rangle . \) [18, Lem. 3.1].
For any convex set \(\mathcal {U}\subseteq \mathbb {R}^n\) and \(u,v\in \mathcal {U}\) the following hold [50, Thm. 2.1.5, 2.1.10]:
-
(ii)
If \(H\) is \(\mu _{H,\mathcal {U}}\)-strongly convex on \(\mathcal {U}\), then \( \frac{\mu _{H,\mathcal {U}}}{2}\Vert v-u\Vert ^2 \le {{\,\textrm{D}\,}}_{H}(v,u) \le \frac{1}{2\mu _{H,\mathcal {U}}}\Vert \nabla H(v)-\nabla H(u)\Vert ^2 \).
-
(iii)
If \(\nabla H\) is \(\ell _{H,\mathcal {U}}\)-Lipschitz on \(\mathcal {U}\), then \( \frac{1}{2\ell _{H,\mathcal {U}}}\Vert \nabla H(v)-\nabla H(u)\Vert ^2 \le {{\,\textrm{D}\,}}_{H}(v,u) \le \frac{\ell _{H,\mathcal {U}}}{2}\Vert v-u\Vert ^2 \).
In the following, some properties of the Bregman Moreau envelope are highlighted. The interested reader is referred to [1] and [37] for proofs and further properties.
Fact A.2
(Basic properties of \(\phi ^{H}\) and \({{\,\textrm{prox}\,}}_\phi ^{H}\), [1, 37]) Let \(H:\mathbb {R}^n\rightarrow \mathbb {R}\) denote a dgf (cf. Definition 2.1), and \(\phi :\mathbb {R}^n\rightarrow \overline{\mathbb {R}}\) be a proper lsc, and lower bounded function. Then, the following hold:
-
(i)
\({{\,\textrm{prox}\,}}_\phi ^{H}\) is locally bounded, compact-valued, and outer semicontinuous;
-
(ii)
\(\phi ^{H}\) is finite-valued and continuous; it is locally Lipschitz if so is \(\nabla H\);
-
(iii)
\(\phi ^{H}(z) = \phi (v) + {{\,\textrm{D}\,}}_{H}(v, z) \le \phi (y) + {{\,\textrm{D}\,}}_{H}(y, z)\) with any \(y, z\in \mathbb {R}^n\), \(v \in {{\,\textrm{prox}\,}}_\phi ^{H}(z)\). Hence, \(\phi ^{H}(z) \le \phi (z)\);
-
(iv)
\(\inf \phi = \inf \phi ^{H}\) and \({{\,\textrm{argmin}\,}}\phi ^{H}= {{\,\textrm{argmin}\,}}\phi \);
-
(v)
\(\phi ^{H}\) is level-bounded iff so is \(\phi \).
The following fact studies sufficient conditions for Lipschitz continuity of the Bregman proximal mapping and continuity of the Moreau envelope, both of which are crucial to the theory developed in Theorems 4.7 and 4.12.
Fact A.3
([39, Lem. A.2]) Let \(\mathcal {V}_i\subseteq \mathbb {R}^n\) be nonempty and convex, \(i\in [N]\), and let \(\mathcal {V}\mathrel {:=}\mathcal {V}_1\times \cdots \times \mathcal {V}_N\). Additionally to Assumption 1, suppose that \(g\) is convex, and \(h_i\), \(i\in [N]\), is \(\ell _{h_i}\)-smooth and \(\mu _{h_i}\)-strongly convex on \(\mathcal {V}_i\). Then, the following hold for function \(\hat{H}\) as in (4.2) with \(\gamma _i\in (0,\nicefrac {N}{L_{f_i}})\), \(i\in [N]\):
-
(i)
\({{\,\textrm{prox}\,}}_\varPhi ^{\hat{H}}\) is \({\bar{L}}\)-Lipschitz continuous on \(\mathcal {V}\) for some constant \({\bar{L}}\ge 0\).
If in addition \(f_i\) and \(h_i\) are twice continuously differentiable on \(\mathcal {V}_i\), \(i\in [N]\), then
-
(ii)
\(\varPhi ^{\hat{H}}\) is continuously differentiable on \(\mathcal {V}\) with \(\nabla \varPhi ^{\hat{H}}=\nabla ^2\hat{H}\circ ({{\,\textrm{id}\,}}-{{\,\textrm{prox}\,}}_\varPhi ^{\hat{H}})\).
The following fact establishes the equivalence between problems (1.1) and (4.1).
Fact A.4
([39, Lem. A.1]) Let the functions \(\varphi \) and \(\varPhi \) be as in (1.1) and (4.1), respectively. Then,
-
(i)
\(\partial \varPhi (\varvec{x}) = \{\varvec{v} = (v,\dots ,v) \mid \sum _iv_i \in \partial \varphi (x)\}\) if \(\varvec{x}=(x,\dots ,x) \in \varDelta \), and is empty otherwise.
-
(ii)
\(\varPhi \) has the KL property at \(\varvec{x} = (x,\dots ,x)\) iff so does \(\varphi \) at x. In this case, the desingularizing functions are the same up to a positive scaling.
B Omitted lemmas
Lemma B.1
Suppose that Assumptions 1 and 2 hold and that \(\varphi \) is level bounded. Consider the sequence generated by Algorithm 1. Then, for every \(\ell \in [N]\) there exists \(c_\ell >0\) such that
Proof
By level boundedness of \(\varphi \) and Theorem 4.7, \((\varvec{u}^k)_{k\in \mathbb {N}}\), \((\varvec{z}^k)_{k\in \mathbb {N}}\), \((\tilde{z}^{{k}}_{{i}})_{k\in \mathbb {N}}, i\in [N]\) are contained in a nonempty bounded set \(\varvec{\mathcal {U}}\). By Assumption 2.A2, \(h_i\) is locally strongly convex and locally Lipschitz, which along with Assumption 2.(A1) and Fact A.3 implies that \({{\,\textrm{prox}\,}}_\varPhi ^{\hat{H}}\) is \({\bar{L}}\)-Lipschitz on a convex subset of \(\varvec{\mathcal {U}}\) for some \({\bar{L}} > 0\). Without loss of generality and for the sake of simplicity, we assume the cyclic sweeping rule in the incremental loop, i.e., \(i^\ell =\ell \). Note that the following proof can be easily cast into the case of cyclic sweeping without replacement. Arguing by induction, for \(\ell =1\), (B.1) holds trivially. Suppose that the claim holds for some \(\ell \ge 1\). Then, by triangular inequality and the definition of \(\tilde{\varvec{z}}^{k}_{\ell }\) in step 2.7 of Algorithm 2
establishing (B.1). \(\square \)
Lemma B.2
In addition to the assumptions in Lemma B.1, suppose that the directions \(d^k\) in step 1.4 satisfy \(\Vert d^k\Vert \le D\Vert z^k-v^k\Vert \) for some \(D\ge 0\). Then, \(\Vert z^{k+1}- z^k\Vert \le C\Vert \varvec{z}^k - \bar{{{\varvec{z}}}}^{{k-1}}_{{N}}\Vert \) holds for some positive C.
Proof
By the same reasoning as in Lemma B.1, \({{\,\textrm{prox}\,}}_\varPhi ^{\hat{H}}\) is \({\bar{L}}\)-Lipschitz continuous on a bounded convex set containing the iterates \((\varvec{u}^k)_{k\in \mathbb {N}}\), \((\varvec{z}^k)_{k\in \mathbb {N}}\), \((\tilde{z}^{{k}}_{{i}})_{k\in \mathbb {N}}, i\in [N]\). It follows from the assumption on \(\Vert d^k\Vert \) and step 2.4.a of Algorithm 2 that
where \(\eta _1 = {\bar{L}}(1-\tau _k + \tau _k D)\) and Lipschitz continuity of the proximal mapping was used in the last inequality. Further using triangular inequality yields
Using this along with the triangular inequality yields
where \(\eta _2 = \sum _{\ell =1}^N \tfrac{c_\ell }{\sqrt{N}}\big ((\bar{L}+1)\eta _1 + {\bar{L}}\big )\). This inequality combined with (B.3) yields
The claimed inequality follows from Lipschitz continuity of \({{\,\textrm{prox}\,}}_\varPhi ^{\hat{H}}\) and the inclusion in step of Algorithm 2. \(\square \)
C Omitted proofs
1.1 Proof of Theorem 4.16
By level boundedness of \(\varphi \) and Theorem 4.7, \((\varvec{u}^k)_{k\in \mathbb {N}}\), \((\varvec{z}^k)_{k\in \mathbb {N}}\), \((\tilde{z}^{{k}}_{{i}})_{k\in \mathbb {N}}\) are contained in a nonempty convex bounded set \(\varvec{\mathcal {U}}\), where owing to Assumption 2.A2, \(h_i\) and consequently \(\hat{H}\) are strongly convex. It then follows from Fact A.1(ii), Theorem 4.7(ii), and Lemma B.2 that \( \Vert z^{k+1} - z^k\Vert \rightarrow 0. \) Therefore, the set of limit points of \((z^k)_{k\in \mathbb {N}}\) is nonempty compact and connected [11, Rem. 5]. By Theorems 4.7(iv) and 4.7(v) the limit points are stationary for \(\varphi \), and \(\varPhi ^{\hat{H}}(\varvec{z}^k) = \mathcal {L}(v^k, z^k) \rightarrow \varphi _\star \). In the trivial case \(\varPhi ^{\hat{H}}(\varvec{z}^k) = \mathcal {L}(v^k, z^k) = \varphi _\star \) for some k, the claims follow from Theorem 4.7. Assume that \(\varPhi ^{\hat{H}}(\varvec{z}^k)> \varphi _\star \) for \(k\in \mathbb {N}\). The KL property for \(\varPhi \) is implied by that of \(\varphi \) due to Fact A.4, with desingularizing function \(\psi (s)=\rho s^{1-\theta }\) with exponent \(\theta \in (0,1)\). Let \(\varOmega \) denote the set of limit points of \((\varvec{z}^k=(z^k,\ldots , z^k))_{k\in \mathbb {N}}\). Since \(\hat{H}\) is strongly convex, [72, Lem. 5.1] can be invoked to infer that the function \(\mathcal {M}_{\hat{H}}(\varvec{w}, \varvec{x}) = \varPhi (\varvec{w}) + {{\,\textrm{D}\,}}_{\hat{H}}(\varvec{w}, \varvec{x})\) also has the KL property with exponent \(\nu \in \max \{\theta ,\frac{1}{2}\}\) at every point \((\varvec{z}^\star , \varvec{z}^\star )\) in the compact set \(\varOmega \times \varOmega \). Moreover, by (4.2) \(\mathcal {M}_{\hat{H}}(\varvec{z}^\star ,\varvec{z}^\star ) = \varPhi (\varvec{z}^\star )= \varphi _\star \) where Theorem 4.7(iv) was used in the last equality. Recall that \(\varvec{z}^{k} \in {{\,\textrm{prox}\,}}_\varPhi ^{\hat{H}}(\bar{{{\varvec{z}}}}^{{k-1}}_{{N}})\) as in step 2.1 of Algorithm 2. Therefore, , resulting in
where \(c = \sup _k\Vert \nabla ^2 \hat{H}(\bar{{{\varvec{z}}}}^{{k-1}}_{{N}})\Vert > 0\) is finite due to \(\tilde{z}^{{k}}_{{N}}\) being bounded (cf. Theorem 4.7(vi)) and continuity of \(\nabla ^2 \hat{H}\). Considering (4.18) with ((C.1), since \(\mathcal {M}_{\hat{H}}(\varvec{z}^k, \bar{{{\varvec{z}}}}^{{k-1}}_{{N}}) = \varPhi ^{\hat{H}}(\bar{{{\varvec{z}}}}^{{k-1}}_{{N}}) \rightarrow \varphi \) from above, and that \((\varvec{z}^k, \bar{{{\varvec{z}}}}^{{k-1}}_{{N}})_{k\in \mathbb {N}}\) is bounded and accumulates on \(\varOmega \times \varOmega \), up to discarding iterates the following holds
where \(\psi =\rho s^{1-\nu }\) is a desingularizing function for \(\mathcal {M}_{\hat{H}}\) on \(\varOmega \times \varOmega \). Let us define
Then, \( \varDelta _k^{\frac{\nu }{1-\nu }} \le c \rho ^{\frac{1}{1-\nu }}(1-\nu )\Vert \bar{{{\varvec{z}}}}^{{k-1}}_{{N}} - \varvec{z}^k\Vert . \) Concavity of \(\psi \) also implies
On the other hand by (4.11) and (4.10)
where Fact A.1(ii) was used and \(\mu _{\hat{H}}\) denotes its strong convexity modulus. Combining (C.4) and (C.5),
with some constant \(\eta > 0\) where the last inequality follows from Lemma B.2. Hence, \((\Vert \varvec{z}^{k+1} - \varvec{z}^k\Vert )_{k\in \mathbb {N}}\) has finite length and is thus convergent. It then follows from Theorem 4.7(v) that \((\varvec{z}^k)_{k\in \mathbb {N}}\) converges to a stationary point of \(\varphi \). Combining (C.3) and (C.6) we have,
with some appropriate \(\alpha \ge 0\). Hence, if \(\nu =\frac{1}{2}\), i.e. \(\theta \in (0,\frac{1}{2}]\) for \(\varPhi \), in (C.7) we have \(\varDelta _{k+1} \le (1 - \alpha ) \varDelta _k\). As \(\alpha > 0\) and \(\frac{\varDelta _{k+1}}{\varDelta _k} > 0\), then \((1-\alpha ) \in (0,1)\) concluding \(\varDelta _k\) is Q-linearly convergent to zero. By (C.3) we then conclude \((\varPhi ^{\hat{H}}(\bar{{{\varvec{z}}}}^{{k-1}}_{{N}}))_{k\in \mathbb {N}}\) is convergent Q-linearly and by Fact A.2(iii), where we have \(\varphi (z^k)=\varPhi (\varvec{z}^k) \le \varPhi ^{\hat{H}}(\bar{{{\varvec{z}}}}^{{k-1}}_{{N}})\), we conclude \((\varphi (z^k))_{k\in \mathbb {N}}\) is convergent R-linearly. Moreover, the inequality (C.6) implies that \((\Vert z^{k+1} - z^k\Vert )_{k\in \mathbb {N}}\) is R-linearly convergent, thus so is \((z^k)_{k\in \mathbb {N}}\).
D CPU time
The performance results presented in Sect. 5 are also reported versus CPU time. According to the numerical comparisons in Figs. 5, 4, and 6, the proposed algorithm features relatively cheap iterations and has comparable computational complexity per epoch compared to the other algorithms.
E Algorithm variants
1.1 E.1 Adaptive variant
In this section, the implementation of Table 1 is further discussed. In Table 1, for the first iterate, i.e. \(k=0\), the vectors \(\tilde{z}^{{-1}}_{{i}}\) are initially considered equal to \(z^{\textrm{init}}\) for all \(i\in [N]\). Also, note that the linesearch in step 2.5.d of Table 1 backtracks to step 2.3.a, rather than step 2.5.c. Performing the linesearches in this intertwined fashion is observed to result in acceptance of good directions and reduction in the overall computational complexity [20, 52]. We refer the reader to [20] for the theoretical justification for the effectiveness of this procedure. Note that in Algorithm 3, in the Euclidean case, the same backtrackings can be used with dgfs \(h_i = \frac{1}{2} \Vert \cdot \Vert ^2\). The backtracking linesearches in the first block of Table 1 do not require storing \(\tilde{z}^{{k}}_{{i}}\) and can be performed efficiently. In step 2.1.b \(\sum _{i=1}^N p_i(\cdot ,\tilde{z}^{{k}}_{{i}})\) may be evaluated by storing the scalars \(\sum _{i=1}^N f_i(\tilde{z}^{{k}}_{{i}})\) and \(\sum _{i=1}^N \langle \nabla f_i(\tilde{z}^{{k}}_{{i}}), \tilde{z}^{{k}}_{{i}} \rangle \) and one vector \(\sum _{i=1}^N \nabla f_i(\tilde{z}^{{k}}_{{i}}) \in \mathbb {R}^n\) while performing step 1.10 of the algorithm. Similar tricks apply to the computation of the Bregman distances, functions \(p_i\) in other backtracking linesearches of Table 1, and updating the vectors \(s^k, {\bar{s}}^k\), and \({\tilde{s}}^k\).
1.2 E.2 Euclidean variant
In this section, the proposed algorithm in the Euclidean version is outlined in Algorithm 3, when the functions \(f_i\) have Lipschitz continuous gradients with constants \(L_i\). In this case, the distance generating functions are \(h_i=\frac{1}{2}\Vert \cdot \Vert ^2\) and consequently the Bregman distances are simplified to \({{\,\textrm{D}\,}}_{h_i}(y,x)=\frac{1}{2}\Vert y-x\Vert ^2\).
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Behmandpoor, P., Latafat, P., Themelis, A. et al. SPIRAL: a superlinearly convergent incremental proximal algorithm for nonconvex finite sum minimization. Comput Optim Appl 88, 71–106 (2024). https://doi.org/10.1007/s10589-023-00550-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10589-023-00550-8
Keywords
- Finite sum minimization
- Nonsmooth nonconvex optimization
- Relative smoothness
- Superlinear convergence
- KL inequality