SPIRAL: a superlinearly convergent incremental proximal algorithm for nonconvex finite sum minimization

Behmandpoor, Pourya; Latafat, Puya; Themelis, Andreas; Moonen, Marc; Patrinos, Panagiotis

doi:10.1007/s10589-023-00550-8

SPIRAL: a superlinearly convergent incremental proximal algorithm for nonconvex finite sum minimization

Published: 29 March 2024

Volume 88, pages 71–106, (2024)
Cite this article

Computational Optimization and Applications Aims and scope Submit manuscript

Pourya Behmandpoor ORCID: orcid.org/0000-0003-4522-1280¹,
Puya Latafat¹,
Andreas Themelis²,
Marc Moonen¹ &
…
Panagiotis Patrinos¹

309 Accesses
2 Altmetric
Explore all metrics

Abstract

We introduce SPIRAL, a SuPerlinearly convergent Incremental pRoximal ALgorithm, for solving nonconvex regularized finite sum problems under a relative smoothness assumption. Each iteration of SPIRAL consists of an inner and an outer loop. It combines incremental gradient updates with a linesearch that has the remarkable property of never being triggered asymptotically, leading to superlinear convergence under mild assumptions at the limit point. Simulation results with L-BFGS directions on different convex, nonconvex, and non-Lipschitz differentiable problems show that our algorithm, as well as its adaptive variant, are competitive to the state of the art.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Nonconvex Proximal Incremental Aggregated Gradient Method with Linear Convergence

Article 17 May 2019

Globally linearly convergent nonlinear conjugate gradients without Wolfe line search

Article 09 February 2024

Linesearch Newton-CG methods for convex optimization with noise

Article Open access 17 August 2022

Data availability

The datasets used in the numerical experiments are available in https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/ and https://hastie.su.domains/ElemStatLearn/data.html.

References

Ahookhosh, M., Themelis, A., Patrinos, P.: A Bregman forward-backward linesearch algorithm for nonconvex composite optimization: superlinear convergence to nonisolated local minima. SIAM J. Optim. 31(1), 653–685 (2021). https://doi.org/10.1137/19M1264783
Article MathSciNet Google Scholar
Aragón Artacho, F.J., Belyakov, A., Dontchev, A.L., López, M.: Local convergence of quasi-Newton methods under metric regularity. Comput. Optim. Appl. 58(1), 225–247 (2014)
Article MathSciNet Google Scholar
Bauschke, H.H., Bolte, J., Teboulle, M.: A descent lemma beyond Lipschitz gradient continuity: first-order methods revisited and applications. Math. Oper. Res. 42(2), 330–348 (2017)
Article MathSciNet Google Scholar
Beck, A., Teboulle, M.: Mirror descent and nonlinear projected subgradient methods for convex optimization. Oper. Res. Lett. 31(3), 167–175 (2003)
Article MathSciNet Google Scholar
Bengio, Y.: Practical recommendations for gradient-based training of deep architectures. Neural Networks: Tricks of the Trade: Second Edition, pp. 437–478 (2012)
Bertsekas, D.P.: Nonlinear Programming. Athena Scientific (2016)
Bertsekas, D.P., Tsitsiklis, J.N.: Gradient convergence in gradient methods with errors. SIAM J. Optim. 10(3), 627–642 (2000)
Article MathSciNet Google Scholar
Blatt, D., Hero, A.O., Gauchman, H.: A convergent incremental gradient method with a constant step size. SIAM J. Optim. 18(1), 29–51 (2007)
Article MathSciNet Google Scholar
Bolte, J., Daniilidis, A., Lewis, A.: The Łojasiewicz inequality for nonsmooth subanalytic functions with applications to subgradient dynamical systems. SIAM J. Optim. 17(4), 1205–1223 (2007)
Article Google Scholar
Bolte, J., Daniilidis, A., Lewis, A., Shiota, M.: Clarke subgradients of stratifiable functions. SIAM J. Optim. 18(2), 556–572 (2007)
Article MathSciNet Google Scholar
Bolte, J., Sabach, S., Teboulle, M.: Proximal alternating linearized minimization for nonconvex and nonsmooth problems. Math. Program. 146(1–2), 459–494 (2014)
Article MathSciNet Google Scholar
Bolte, J., Sabach, S., Teboulle, M., Vaisbourd, Y.: First order methods beyond convexity and Lipschitz gradient continuity with applications to quadratic inverse problems. SIAM J. Optim. 28(3), 2131–2151 (2018)
Cai, X., Lin, C.Y., Diakonikolas, J.: Empirical risk minimization with shuffled SGD: a primal-dual perspective and improved bounds. arXiv preprint arXiv:2306.12498 (2023)
Cai, X., Song, C., Wright, S., Diakonikolas, J.: Cyclic block coordinate descent with variance reduction for composite nonconvex optimization. In: International Conference on Machine Learning, pp. 3469–3494. PMLR (2023)
Candes, E.J., Li, X., Soltanolkotabi, M.: Phase retrieval via Wirtinger flow: theory and algorithms. IEEE Trans. Inf. Theory 61(4), 1985–2007 (2015)
Article MathSciNet Google Scholar
Cha, J., Lee, J., Yun, C.: Tighter lower bounds for shuffling SGD: Random permutations and beyond. In: International Conference on Machine Learning, pp. 3855–3912. PMLR (2023)
Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. (TIST) 2, 1–27 (2011)
Article Google Scholar
Chen, G., Teboulle, M.: Convergence analysis of a proximal-like minimization algorithm using Bregman functions. SIAM J. Optim. 3(3), 538–543 (1993)
Article MathSciNet Google Scholar
Davis, D., Drusvyatskiy, D., MacPhee, K.J.: Stochastic model-based minimization under high-order growth. arXiv preprint arXiv:1807.00255 (2018)
De Marchi, A., Themelis, A.: Proximal gradient algorithms under local Lipschitz gradient continuity: a convergence and robustness analysis of PANOC. J. Optim. Theory Appl. 194(3), 771–794 (2022)
Defazio, A., Bach, F., Lacoste-Julien, S.: SAGA: a fast incremental gradient method with support for non-strongly convex composite objectives. In: Advances in Neural Information Processing Systems, pp. 1646–1654 (2014)
Defazio, A., Domke, J.: Finito: A faster, permutable incremental gradient method for big data problems. In: International Conference on Machine Learning, pp. 1125–1133 (2014)
Dennis, J.E., Moré, J.J.: A characterization of superlinear convergence and its application to quasi-newton methods. Math. Comput. 28(126), 549–560 (1974)
Article MathSciNet Google Scholar
Dennis, J.E., Jr., Moré, J.J.: Quasi-Newton methods, motivation and theory. SIAM Rev. 19(1), 46–89 (1977)
Dragomir, R.A., Even, M., Hendrikx, H.: Fast stochastic Bregman gradient methods: sharp analysis and variance reduction. In: International Conference on Machine Learning, pp. 2815–2825. PMLR (2021)
Duchi, J.C., Ruan, F.: Solving (most) of a set of quadratic equalities: composite optimization for robust phase retrieval. Inf. Inference J. IMA 8(3), 471–529 (2019)
MathSciNet Google Scholar
Facchinei, F., Pang, J.S.: Finite-Dimensional Variational Inequalities and Complementarity Problems. Springer, II (2003)
Google Scholar
Fang, C., Li, C.J., Lin, Z., Zhang, T.: Spider: Near-optimal non-convex optimization via stochastic path-integrated differential estimator. Adv. Neural Inf. Process. Syst. 31 (2018)
Ge, R., Li, Z., Wang, W., Wang, X.: Stabilized SVRG: simple variance reduction for nonconvex optimization. In: Conference on Learning Theory, pp. 1394–1448. PMLR (2019)
Ghadimi, S., Lan, G.: Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM J. Optim. 23(4), 2341–2368 (2013)
Article MathSciNet Google Scholar
Ghadimi, S., Lan, G., Zhang, H.: Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization. Math. Program. 155(1–2), 267–305 (2016)
Article MathSciNet Google Scholar
Gürbüzbalaban, M., Ozdaglar, A., Parrilo, P.A.: Why random reshuffling beats stochastic gradient descent. Math. Program. 186, 49–84 (2021)
Article MathSciNet Google Scholar
Hanzely, F., Richtárik, P.: Fastest rates for stochastic mirror descent methods. Comput. Optim. Appl. 1–50 (2021)
Haochen, J., Sra, S.: Random shuffling beats SGD after finite epochs. In: International Conference on Machine Learning, pp. 2624–2633. PMLR (2019)
Hastie, T., Friedman, J., Tibshirani, R.: The Elements of Statistical Learning. Springer, New York (2001)
Book Google Scholar
Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance reduction. Adv. Neural Inf. Process. Syst. 26, 315–323 (2013)
Google Scholar
Kan, C., Song, W.: The Moreau envelope function and proximal mapping in the sense of the Bregman distance. Nonlinear Anal. Theory Methods Appl. 75(3), 1385–1399 (2012). https://doi.org/10.1016/j.na.2011.07.031
Article MathSciNet Google Scholar
Kurdyka, K.: On gradients of functions definable in $o$-minimal structures. Annales de l’institut Fourier 48(3), 769–783 (1998)
Article MathSciNet Google Scholar
Latafat, P., Themelis, A., Ahookhosh, M., Patrinos, P.: Bregman Finito/MISO for nonconvex regularized finite sum minimization without Lipschitz gradient continuity. SIAM J. Optim. 32(3), 2230–2262 (2022)
Latafat, P., Themelis, A., Patrinos, P.: Block-coordinate and incremental aggregated proximal gradient methods for nonsmooth nonconvex problems. Math. Program. 1–30 (2021)
Li, Z., Richtárik, P.: ZeroSARAH: efficient nonconvex finite-sum optimization with zero full gradient computation. arXiv preprint arXiv:2103.01447 (2021)
Lu, H., Freund, R.M., Nesterov, Y.: Relatively smooth convex optimization by first-order methods, and applications. SIAM J. Optim. 28(1), 333–354 (2018)
Article MathSciNet Google Scholar
Mairal, J.: Incremental majorization-minimization optimization with application to large-scale machine learning. SIAM J. Optim. 25(2), 829–855 (2015)
Article MathSciNet Google Scholar
Mishchenko, K., Khaled, A., Richtárik, P.: Random reshuffling: simple analysis with vast improvements. Adv. Neural Inf. Process. Syst. 33, 17309–17320 (2020)
Google Scholar
Mokhtari, A., Eisen, M., Ribeiro, A.: IQN: an incremental quasi-Newton method with local superlinear convergence rate. SIAM J. Optim. 28(2), 1670–1698 (2018)
Article MathSciNet Google Scholar
Mokhtari, A., Gürbüzbalaban, M., Ribeiro, A.: Surpassing gradient descent provably: a cyclic incremental method with linear convergence rate. SIAM J. Optim. 28(2), 1420–1447 (2018)
Article MathSciNet Google Scholar
Moritz, P., Nishihara, R., Jordan, M.: A linearly-convergent stochastic L-BFGS algorithm. In: Artificial Intelligence and Statistics, pp. 249–258. PMLR (2016)
Nedic, A., Lee, S.: On stochastic subgradient mirror-descent algorithm with weighted averaging. SIAM J. Optim. 24(1), 84–107 (2014)
Article MathSciNet Google Scholar
Nesterov, Y.: Gradient methods for minimizing composite functions. Math. Program. 140(1), 125–161 (2013)
Article MathSciNet Google Scholar
Nesterov, Y.: Introductory lectures on convex optimization: a basic course, vol. 137. Springer Science & Business Media (2018)
Nguyen, L.M., Liu, J., Scheinberg, K., Takáč, M.: SARAH: a novel method for machine learning problems using stochastic recursive gradient. In: International Conference on Machine Learning, pp. 2613–2621. PMLR (2017)
Pas, P., Schuurmans, M., Patrinos, P.: Alpaqa: a matrix-free solver for nonlinear MPC and large-scale nonconvex optimization. In: 2022 European Control Conference (ECC), pp. 417–422. IEEE (2022)
Pham, N.H., Nguyen, L.M., Phan, D.T., Tran-Dinh, Q.: ProxSARAH: an efficient algorithmic framework for stochastic composite nonconvex optimization. J. Mach. Learn. Res. 21, 110–1 (2020)
MathSciNet Google Scholar
Recht, B., Ré, C.: Parallel stochastic gradient algorithms for large-scale matrix completion. Math. Program. Comput. 5(2), 201–226 (2013)
Article MathSciNet Google Scholar
Reddi, S.J., Hefny, A., Sra, S., Poczos, B., Smola, A.J.: Stochastic variance reduction for nonconvex optimization. In: International Conference on Machine Learning, pp. 314–323 (2016)
Reddi, S.J., Sra, S., Poczos, B., Smola, A.J.: Proximal stochastic methods for nonsmooth nonconvex finite-sum optimization. In: Advances in Neural Information Processing Systems, pp. 1145–1153 (2016)
Rockafellar, R.T.: Convex analysis. Princeton University Press (1970)
Rockafellar, R.T., Wets, R.J.B.: Variational analysis, vol. 317. Springer Science & Business Media (2009)
Rodomanov, A., Kropotov, D.: A superlinearly-convergent proximal Newton-type method for the optimization of finite sums. In: International Conference on Machine Learning, pp. 2597–2605. PMLR (2016)
Sadeghi, H., Giselsson, P.: Hybrid acceleration scheme for variance reduced stochastic optimization algorithms. arXiv preprint arXiv:2111.06791 (2021)
Schmidt, M., Le Roux, N., Bach, F.: Minimizing finite sums with the stochastic average gradient. Math. Program. 162(1), 83–112 (2017)
Article MathSciNet Google Scholar
Shalev-Shwartz, S., Zhang, T.: Stochastic dual coordinate ascent methods for regularized loss minimization. J. Mach. Learn. Res. 14(Feb), 567–599 (2013)
Solodov, M.V., Svaiter, B.F.: An inexact hybrid generalized proximal point algorithm and some new results on the theory of Bregman functions. Math. Oper. Res. 25(2), 214–230 (2000)
Article MathSciNet Google Scholar
Sun, J., Qu, Q., Wright, J.: A geometric analysis of phase retrieval. Found. Comput. Math. 18(5), 1131–1198 (2018)
Article MathSciNet Google Scholar
Teboulle, M.: A simplified view of first order methods for optimization. Math. Program. 170(1), 67–96 (2018)
Article MathSciNet Google Scholar
Themelis, A., Ahookhosh, M., Patrinos, P.: On the acceleration of forward-backward splitting via an inexact Newton method. In: Bauschke, H.H., Burachik, R.S., Luke, D.R. (eds.) Splitting Algorithms, Modern Operator Theory, and Applications, pp. 363–412. Springer International Publishing, Cham (2019)
Chapter Google Scholar
Themelis, A., Patrinos, P.: SuperMann: a superlinearly convergent algorithm for finding fixed points of nonexpansive operators. IEEE Trans. Autom. Control 64(12), 4875–4890 (2019)
Themelis, A., Stella, L., Patrinos, P.: Forward-backward envelope for the sum of two nonconvex functions: further properties and nonmonotone linesearch algorithms. SIAM J. Optim. 28(3), 2274–2303 (2018)
Article MathSciNet Google Scholar
Vanli, N.D., Gurbuzbalaban, M., Ozdaglar, A.: Global convergence rate of proximal incremental aggregated gradient methods. SIAM J. Optim. 28(2), 1282–1300 (2018)
Article MathSciNet Google Scholar
Wang, Z., Ji, K., Zhou, Y., Liang, Y., Tarokh, V.: Spiderboost and momentum: faster variance reduction algorithms. Adv. Neural Inf. Process. Syst. 32 (2019)
Yang, M., Milzarek, A., Wen, Z., Zhang, T.: A stochastic extra-step quasi-Newton method for nonsmooth nonconvex optimization. Math. Program. 1–47 (2021)
Yu, P., Li, G., Pong, T.K.: Kurdyka-Łojasiewicz exponent via inf-projection. Found. Comput. Math. 1–47 (2021)
Zhang, H., Dai, Y.H., Guo, L., Peng, W.: Proximal-like incremental aggregated gradient method with linear convergence under Bregman distance growth conditions. Math. Oper. Res. 46(1), 61–81 (2021)
Article MathSciNet Google Scholar
Zhang, J., Liu, H., So, A.M.C., Ling, Q.: Variance-reduced stochastic quasi-Newton methods for decentralized learning: Part I (2022)

Download references

Author information

Authors and Affiliations

Department of Electrical Engineering (ESAT-STADIUS), KU Leuven, Kasteelpark Arenberg 10, 3001, Leuven, Belgium
Pourya Behmandpoor, Puya Latafat, Marc Moonen & Panagiotis Patrinos
Faculty of Information Science and Electrical Engineering (ISEE), Kyushu University, 744 Motooka Nishi-ku, Fukuoka, 819-0395, Japan
Andreas Themelis

Authors

Pourya Behmandpoor
View author publications
You can also search for this author in PubMed Google Scholar
Puya Latafat
View author publications
You can also search for this author in PubMed Google Scholar
Andreas Themelis
View author publications
You can also search for this author in PubMed Google Scholar
Marc Moonen
View author publications
You can also search for this author in PubMed Google Scholar
Panagiotis Patrinos
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pourya Behmandpoor.

Ethics declarations

Conflict of interest

The authors have no competing interests to declare that are relevant to the content of this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

P. Behmandpoor and M. Moonen acknowledge the research work carried out at the ESAT Laboratory of KU Leuven, in the frame of Research Project FWO nr. G0C0623N ’User-centric distributed signal processing algorithms for next generation cell-free massive MIMO based wireless communication networks’ and Fonds de la Recherche Scientifique—FNRS and Fonds voor Wetenschappelijk Onderzoek— Vlaanderen EOS Project no 30452698 ’(MUSE-WINET) MUlti-SErvice WIreless NETworks’. The scientific responsibility is assumed by its authors. The work of P. Latafat was supported by the Research Foundation Flanders (FWO) grants 1196820N and 12Y7622N. The work of P. Patrinos was supported by the Research Foundation Flanders (FWO) research projects G0A0920N, G086518N, G086318N, and G081222N; Research Council KU Leuven C1 project No. C14/18/068; Fonds de la Recherche Scientifique—FNRS and the Fonds Wetenschappelijk Onderzoek—Vlaanderen under EOS project 30468160 (SeLMA); European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No. 953348. The work of A. Themelis was supported by the Japan Society for the Promotion of Science (JSPS) KAKENHI grant JP21K17710.

Appendices

A Preliminaries

Fact A.1

(basic properties [18, 50]) The following hold for a dgf $H:\mathbb {R}^n\rightarrow \mathbb {R}$, $x,y,z\in \mathbb {R}^n$:

(i)
(three-point inequality) $ {{\,\textrm{D}\,}}_H(x,z)={{\,\textrm{D}\,}}_H(x,y)+{{\,\textrm{D}\,}}_H(y,z)+\langle {x-y}, {\nabla H(y)}{-\nabla H(z)}\rangle . $ [18, Lem. 3.1].

For any convex set $\mathcal {U}\subseteq \mathbb {R}^n$ and $u,v\in \mathcal {U}$ the following hold [50, Thm. 2.1.5, 2.1.10]:

(ii)
If $H$ is $\mu _{H,\mathcal {U}}$-strongly convex on $\mathcal {U}$, then $ \frac{\mu _{H,\mathcal {U}}}{2}\Vert v-u\Vert ^2 \le {{\,\textrm{D}\,}}_{H}(v,u) \le \frac{1}{2\mu _{H,\mathcal {U}}}\Vert \nabla H(v)-\nabla H(u)\Vert ^2 $.
(iii)
If $\nabla H$ is $\ell _{H,\mathcal {U}}$-Lipschitz on $\mathcal {U}$, then $ \frac{1}{2\ell _{H,\mathcal {U}}}\Vert \nabla H(v)-\nabla H(u)\Vert ^2 \le {{\,\textrm{D}\,}}_{H}(v,u) \le \frac{\ell _{H,\mathcal {U}}}{2}\Vert v-u\Vert ^2 $.

In the following, some properties of the Bregman Moreau envelope are highlighted. The interested reader is referred to [1] and [37] for proofs and further properties.

Fact A.2

(Basic properties of $\phi ^{H}$ and ${{\,\textrm{prox}\,}}_\phi ^{H}$, [1, 37]) Let $H:\mathbb {R}^n\rightarrow \mathbb {R}$ denote a dgf (cf. Definition 2.1), and $\phi :\mathbb {R}^n\rightarrow \overline{\mathbb {R}}$ be a proper lsc, and lower bounded function. Then, the following hold:

(i)
${{\,\textrm{prox}\,}}_\phi ^{H}$ is locally bounded, compact-valued, and outer semicontinuous;
(ii)
$\phi ^{H}$ is finite-valued and continuous; it is locally Lipschitz if so is $\nabla H$;
(iii)
$\phi ^{H}(z) = \phi (v) + {{\,\textrm{D}\,}}_{H}(v, z) \le \phi (y) + {{\,\textrm{D}\,}}_{H}(y, z)$ with any $y, z\in \mathbb {R}^n$, $v \in {{\,\textrm{prox}\,}}_\phi ^{H}(z)$. Hence, $\phi ^{H}(z) \le \phi (z)$;
(iv)
$\inf \phi = \inf \phi ^{H}$ and ${{\,\textrm{argmin}\,}}\phi ^{H}= {{\,\textrm{argmin}\,}}\phi $;
(v)
$\phi ^{H}$ is level-bounded iff so is $\phi $.

The following fact studies sufficient conditions for Lipschitz continuity of the Bregman proximal mapping and continuity of the Moreau envelope, both of which are crucial to the theory developed in Theorems 4.7 and 4.12.

Fact A.3

([39, Lem. A.2]) Let $\mathcal {V}_i\subseteq \mathbb {R}^n$ be nonempty and convex, $i\in [N]$, and let $\mathcal {V}\mathrel {:=}\mathcal {V}_1\times \cdots \times \mathcal {V}_N$. Additionally to Assumption 1, suppose that $g$ is convex, and $h_i$, $i\in [N]$, is $\ell _{h_i}$-smooth and $\mu _{h_i}$-strongly convex on $\mathcal {V}_i$. Then, the following hold for function $\hat{H}$ as in (4.2) with $\gamma _i\in (0,\nicefrac {N}{L_{f_i}})$, $i\in [N]$:

(i)
${{\,\textrm{prox}\,}}_\varPhi ^{\hat{H}}$ is ${\bar{L}}$-Lipschitz continuous on $\mathcal {V}$ for some constant ${\bar{L}}\ge 0$.

If in addition $f_i$ and $h_i$ are twice continuously differentiable on $\mathcal {V}_i$, $i\in [N]$, then

(ii)
$\varPhi ^{\hat{H}}$ is continuously differentiable on $\mathcal {V}$ with $\nabla \varPhi ^{\hat{H}}=\nabla ^2\hat{H}\circ ({{\,\textrm{id}\,}}-{{\,\textrm{prox}\,}}_\varPhi ^{\hat{H}})$.

The following fact establishes the equivalence between problems (1.1) and (4.1).

Fact A.4

([39, Lem. A.1]) Let the functions $\varphi $ and $\varPhi $ be as in (1.1) and (4.1), respectively. Then,

(i)
$\partial \varPhi (\varvec{x}) = \{\varvec{v} = (v,\dots ,v) \mid \sum _iv_i \in \partial \varphi (x)\}$ if $\varvec{x}=(x,\dots ,x) \in \varDelta $, and is empty otherwise.
(ii)
$\varPhi $ has the KL property at $\varvec{x} = (x,\dots ,x)$ iff so does $\varphi $ at x. In this case, the desingularizing functions are the same up to a positive scaling.

B Omitted lemmas

Lemma B.1

Suppose that Assumptions 1 and 2 hold and that $\varphi $ is level bounded. Consider the sequence generated by Algorithm 1. Then, for every $\ell \in [N]$ there exists $c_\ell >0$ such that

$$\begin{aligned} \Vert \tilde{z}^{{k}}_{{\ell }} - u^k\Vert \le c_\ell \Vert \tilde{z}^{{k}}_{{1}} - u^k\Vert . \end{aligned}$$

(B.1)

Proof

By level boundedness of $\varphi $ and Theorem 4.7, $(\varvec{u}^k)_{k\in \mathbb {N}}$, $(\varvec{z}^k)_{k\in \mathbb {N}}$, $(\tilde{z}^{{k}}_{{i}})_{k\in \mathbb {N}}, i\in [N]$ are contained in a nonempty bounded set $\varvec{\mathcal {U}}$. By Assumption 2.A2, $h_i$ is locally strongly convex and locally Lipschitz, which along with Assumption 2.(A1) and Fact A.3 implies that ${{\,\textrm{prox}\,}}_\varPhi ^{\hat{H}}$ is ${\bar{L}}$-Lipschitz on a convex subset of $\varvec{\mathcal {U}}$ for some ${\bar{L}} > 0$. Without loss of generality and for the sake of simplicity, we assume the cyclic sweeping rule in the incremental loop, i.e., $i^\ell =\ell $. Note that the following proof can be easily cast into the case of cyclic sweeping without replacement. Arguing by induction, for $\ell =1$, (B.1) holds trivially. Suppose that the claim holds for some $\ell \ge 1$. Then, by triangular inequality and the definition of $\tilde{\varvec{z}}^{k}_{\ell }$ in step 2.7 of Algorithm 2

establishing (B.1). $\square $

Lemma B.2

In addition to the assumptions in Lemma B.1, suppose that the directions $d^k$ in step 1.4 satisfy $\Vert d^k\Vert \le D\Vert z^k-v^k\Vert $ for some $D\ge 0$. Then, $\Vert z^{k+1}- z^k\Vert \le C\Vert \varvec{z}^k - \bar{{{\varvec{z}}}}^{{k-1}}_{{N}}\Vert $ holds for some positive C.

Proof

By the same reasoning as in Lemma B.1, ${{\,\textrm{prox}\,}}_\varPhi ^{\hat{H}}$ is ${\bar{L}}$-Lipschitz continuous on a bounded convex set containing the iterates $(\varvec{u}^k)_{k\in \mathbb {N}}$, $(\varvec{z}^k)_{k\in \mathbb {N}}$, $(\tilde{z}^{{k}}_{{i}})_{k\in \mathbb {N}}, i\in [N]$. It follows from the assumption on $\Vert d^k\Vert $ and step 2.4.a of Algorithm 2 that

$$\begin{aligned} \Vert \varvec{z}^k - \varvec{u}^k\Vert\le & {} (1-\tau _k)\Vert \varvec{z}^k - \varvec{v}^k\Vert + \tau _k \Vert \varvec{d}^k\Vert \nonumber \\\le & {} (1-\tau _k + \tau _k D) \Vert \varvec{z}^k - \varvec{v}^k\Vert \nonumber \\\le & {} \eta _1 \Vert \bar{{{\varvec{z}}}}^{{k-1}}_{{N}} - \varvec{z}^k\Vert , \end{aligned}$$

(B.2)

where $\eta _1 = {\bar{L}}(1-\tau _k + \tau _k D)$ and Lipschitz continuity of the proximal mapping was used in the last inequality. Further using triangular inequality yields

$$\begin{aligned} \Vert \varvec{u}^k - \bar{{{\varvec{z}}}}_{{N}}^{k-1}\Vert \le {}&\Vert \varvec{u}^k - \varvec{z}^k\Vert + \Vert \varvec{z}^k - \bar{{{\varvec{z}}}}_{{N}}^{k-1}\Vert \nonumber \\ (B.2) \le {}&\left( \eta _1 + 1\right) \Vert \varvec{z}^k - \bar{{{\varvec{z}}}}_{{N}}^{k-1} \Vert , \quad \text{ and } \nonumber \\ \Vert \tilde{z}_{{1}}^{k} - u^k\Vert ={}&\tfrac{1}{\sqrt{N}}\Vert \tilde{\varvec{z}}_{1}^{k} - \varvec{u}^k\Vert \nonumber \\ \le {}&\tfrac{1}{\sqrt{N}}\Vert \tilde{\varvec{z}}_{1}^{k} - \varvec{z}^k\Vert + \tfrac{1}{\sqrt{N}} \Vert \varvec{z}^k - \varvec{u}^k\Vert \nonumber \\ \text{ Lip. } \text{ continuity } \text{ of } {{\,\text {prox}\,}}_\varPhi ^{\hat{H}} \text{ and } \text{ Algorithm } 2 \le {}&\tfrac{\bar{L}}{\sqrt{N}}\Vert \varvec{u}^k - \bar{{{\varvec{z}}}}_{{N}}^{k-1}\Vert + \tfrac{1}{\sqrt{N}} \Vert \varvec{z}^k - \varvec{u}^k\Vert \nonumber \\ (B.3), (B.2) \le {}&\tfrac{1}{\sqrt{N}}\big ((\bar{L}+1)\eta _1 + \bar{L}\big )\Vert \varvec{z}^k - \bar{{{\varvec{z}}}}_{{N}}^{k-1}\Vert . \end{aligned}$$

(B.3)

Using this along with the triangular inequality yields

$$\begin{aligned} \Vert \bar{{{\varvec{z}}}}^{{k}}_{{N}} - \varvec{u}^k\Vert = \sum _{\ell =1}^N \Vert \tilde{z}^{{k}}_{{\ell }} - u^k\Vert {\mathop {\le }\limits ^{{(B.1)}}} \sum _{\ell =1}^N c_\ell \Vert \tilde{z}^{{k}}_{{1}} - u^k\Vert \le \eta _2\Vert \varvec{z}^k - \bar{{{\varvec{z}}}}^{{k-1}}_{{N}}\Vert , \end{aligned}$$

where $\eta _2 = \sum _{\ell =1}^N \tfrac{c_\ell }{\sqrt{N}}\big ((\bar{L}+1)\eta _1 + {\bar{L}}\big )$. This inequality combined with (B.3) yields

$$\begin{aligned} \Vert z^{k+1}- z^{k}\Vert ={}&\frac{1}{\sqrt{N}}\Vert {{\,\textrm{prox}\,}}_\varPhi ^{\hat{H}}(\bar{{{\varvec{z}}}}^{{k}}_{{N}}) - {{\,\textrm{prox}\,}}_\varPhi ^{\hat{H}}(\bar{{{\varvec{z}}}}^{{k-1}}_{{N}})\Vert \le \frac{{\bar{L}}}{\sqrt{N}} \Vert \bar{{{\varvec{z}}}}^{{k}}_{{N}}- \bar{{{\varvec{z}}}}^{{k-1}}_{{N}}\Vert \\ \le {}&\frac{{\bar{L}}}{\sqrt{N}} \Vert \bar{{{\varvec{z}}}}^{{k}}_{{N}}- \varvec{u}^k\Vert + \frac{\bar{L}}{\sqrt{N}} \Vert \varvec{u}^k - \bar{{{\varvec{z}}}}^{{k-1}}_{{N}}\Vert \\&\le \frac{{\bar{L}}}{\sqrt{N}} (\eta _1 + \eta _2 + 1)\Vert \varvec{z}^k - \bar{{{\varvec{z}}}}^{{k-1}}_{{N}}\Vert . \end{aligned}$$

The claimed inequality follows from Lipschitz continuity of ${{\,\textrm{prox}\,}}_\varPhi ^{\hat{H}}$ and the inclusion in step of Algorithm 2. $\square $

C Omitted proofs

1.1 Proof of Theorem 4.16

By level boundedness of $\varphi $ and Theorem 4.7, $(\varvec{u}^k)_{k\in \mathbb {N}}$, $(\varvec{z}^k)_{k\in \mathbb {N}}$, $(\tilde{z}^{{k}}_{{i}})_{k\in \mathbb {N}}$ are contained in a nonempty convex bounded set $\varvec{\mathcal {U}}$, where owing to Assumption 2.A2, $h_i$ and consequently $\hat{H}$ are strongly convex. It then follows from Fact A.1(ii), Theorem 4.7(ii), and Lemma B.2 that $ \Vert z^{k+1} - z^k\Vert \rightarrow 0. $ Therefore, the set of limit points of $(z^k)_{k\in \mathbb {N}}$ is nonempty compact and connected [11, Rem. 5]. By Theorems 4.7(iv) and 4.7(v) the limit points are stationary for $\varphi $, and $\varPhi ^{\hat{H}}(\varvec{z}^k) = \mathcal {L}(v^k, z^k) \rightarrow \varphi _\star $. In the trivial case $\varPhi ^{\hat{H}}(\varvec{z}^k) = \mathcal {L}(v^k, z^k) = \varphi _\star $ for some k, the claims follow from Theorem 4.7. Assume that $\varPhi ^{\hat{H}}(\varvec{z}^k)> \varphi _\star $ for $k\in \mathbb {N}$. The KL property for $\varPhi $ is implied by that of $\varphi $ due to Fact A.4, with desingularizing function $\psi (s)=\rho s^{1-\theta }$ with exponent $\theta \in (0,1)$. Let $\varOmega $ denote the set of limit points of $(\varvec{z}^k=(z^k,\ldots , z^k))_{k\in \mathbb {N}}$. Since $\hat{H}$ is strongly convex, [72, Lem. 5.1] can be invoked to infer that the function $\mathcal {M}_{\hat{H}}(\varvec{w}, \varvec{x}) = \varPhi (\varvec{w}) + {{\,\textrm{D}\,}}_{\hat{H}}(\varvec{w}, \varvec{x})$ also has the KL property with exponent $\nu \in \max \{\theta ,\frac{1}{2}\}$ at every point $(\varvec{z}^\star , \varvec{z}^\star )$ in the compact set $\varOmega \times \varOmega $. Moreover, by (4.2) $\mathcal {M}_{\hat{H}}(\varvec{z}^\star ,\varvec{z}^\star ) = \varPhi (\varvec{z}^\star )= \varphi _\star $ where Theorem 4.7(iv) was used in the last equality. Recall that $\varvec{z}^{k} \in {{\,\textrm{prox}\,}}_\varPhi ^{\hat{H}}(\bar{{{\varvec{z}}}}^{{k-1}}_{{N}})$ as in step 2.1 of Algorithm 2. Therefore, , resulting in

$$\begin{aligned} {{\,\textrm{dist}\,}}(0,\partial \mathcal {M}_{\hat{H}}(\varvec{z}^{k}, \bar{{{\varvec{z}}}}^{{k-1}}_{{N}})) \le \Vert \nabla ^2 \hat{H}(\bar{{{\varvec{z}}}}^{{k-1}}_{{N}})\Vert \Vert \bar{{{\varvec{z}}}}^{{k-1}}_{{N}} - \varvec{z}^k\Vert \le c \Vert \bar{{{\varvec{z}}}}^{{k-1}}_{{N}} - \varvec{z}^k\Vert \nonumber \\ \end{aligned}$$

(C.1)

where $c = \sup _k\Vert \nabla ^2 \hat{H}(\bar{{{\varvec{z}}}}^{{k-1}}_{{N}})\Vert > 0$ is finite due to $\tilde{z}^{{k}}_{{N}}$ being bounded (cf. Theorem 4.7(vi)) and continuity of $\nabla ^2 \hat{H}$. Considering (4.18) with ((C.1), since $\mathcal {M}_{\hat{H}}(\varvec{z}^k, \bar{{{\varvec{z}}}}^{{k-1}}_{{N}}) = \varPhi ^{\hat{H}}(\bar{{{\varvec{z}}}}^{{k-1}}_{{N}}) \rightarrow \varphi $ from above, and that $(\varvec{z}^k, \bar{{{\varvec{z}}}}^{{k-1}}_{{N}})_{k\in \mathbb {N}}$ is bounded and accumulates on $\varOmega \times \varOmega $, up to discarding iterates the following holds

$$\begin{aligned} \psi '\big (\varPhi ^{\hat{H}}(\bar{{{\varvec{z}}}}^{{k-1}}_{{N}})- \varphi _\star \big ){} & {} = \psi '\big (\mathcal {M}_{\hat{H}}(\varvec{z}^{k}, \bar{{{\varvec{z}}}}^{{k-1}}_{{N}})- \mathcal {M}_{\hat{H}}(\varvec{z}^\star , \varvec{z}^\star )\big ) \nonumber \\{} & {} \ge \frac{1}{c\Vert \bar{{{\varvec{z}}}}^{{k-1}}_{{N}} - \varvec{z}^k\Vert }, \end{aligned}$$

(C.2)

where $\psi =\rho s^{1-\nu }$ is a desingularizing function for $\mathcal {M}_{\hat{H}}$ on $\varOmega \times \varOmega $. Let us define

$$\begin{aligned} \varDelta _k \mathrel {:=}\psi (\varPhi ^{\hat{H}}(\bar{{{\varvec{z}}}}^{{k-1}}_{{N}})-\varphi _\star ){} & {} = \rho [\varPhi ^{\hat{H}}(\bar{{{\varvec{z}}}}^{{k-1}}_{{N}})-\varphi _\star ]^{1-\nu }\nonumber \\{} & {} \le \rho [\rho (1-\nu )c\Vert \bar{{{\varvec{z}}}}^{{k-1}}_{{N}} - \varvec{z}^k\Vert ]^{\frac{1-\nu }{\nu }}. \end{aligned}$$

(C.3)

Then, $ \varDelta _k^{\frac{\nu }{1-\nu }} \le c \rho ^{\frac{1}{1-\nu }}(1-\nu )\Vert \bar{{{\varvec{z}}}}^{{k-1}}_{{N}} - \varvec{z}^k\Vert . $ Concavity of $\psi $ also implies

$$\begin{aligned} \varDelta _k - \varDelta _{k+1}{} & {} \ge \psi '(\varPhi ^{\hat{H}}(\bar{{{\varvec{z}}}}^{{k-1}}_{{N}})-\varphi _\star )(\varPhi ^{\hat{H}}(\bar{{{\varvec{z}}}}^{{k-1}}_{{N}})-\varPhi ^{\hat{H}}(\bar{{{\varvec{z}}}}^{{k}}_{{N}}))\nonumber \\{} & {} {\mathop {\ge }\limits ^{{C.2}}} \frac{\varPhi ^{\hat{H}}(\bar{{{\varvec{z}}}}^{{k-1}}_{{N}}) - \varPhi ^{\hat{H}}(\bar{{{\varvec{z}}}}^{{k}}_{{N}})}{c \Vert \bar{{{\varvec{z}}}}^{{k-1}}_{{N}} - \varvec{z}^k\Vert }. \end{aligned}$$

(C.4)

On the other hand by (4.11) and (4.10)

$$\begin{aligned} \varPhi ^{\hat{H}}(\bar{{{\varvec{z}}}}^{{k+1}}_{{N}}) - \varPhi ^{\hat{H}}(\bar{{{\varvec{z}}}}^{{k}}_{{N}}) \le - {{\,\textrm{D}\,}}_{\hat{H}}(\varvec{z}^{k+1}, \bar{{{\varvec{z}}}}^{{k}}_{{N}}) \le -\frac{\mu _{\hat{H}}}{2}\Vert \varvec{z}^{k+1} - \bar{{{\varvec{z}}}}^{{k}}_{{N}}\Vert ^2, \end{aligned}$$

(C.5)

where Fact A.1(ii) was used and $\mu _{\hat{H}}$ denotes its strong convexity modulus. Combining (C.4) and (C.5),

$$\begin{aligned} \varDelta _k - \varDelta _{k+1} \ge \eta \Vert \bar{{{\varvec{z}}}}^{{k-1}}_{{N}} - \varvec{z}^k\Vert \ge \frac{\eta }{C} \Vert \varvec{z}^{k+1} - \varvec{z}^k\Vert , \end{aligned}$$

(C.6)

with some constant $\eta > 0$ where the last inequality follows from Lemma B.2. Hence, $(\Vert \varvec{z}^{k+1} - \varvec{z}^k\Vert )_{k\in \mathbb {N}}$ has finite length and is thus convergent. It then follows from Theorem 4.7(v) that $(\varvec{z}^k)_{k\in \mathbb {N}}$ converges to a stationary point of $\varphi $. Combining (C.3) and (C.6) we have,

$$\begin{aligned} \varDelta _{k+1} \le \varDelta _k - \alpha \varDelta _k^{{\frac{\nu }{1-\nu }}} \end{aligned}$$

(C.7)

with some appropriate $\alpha \ge 0$. Hence, if $\nu =\frac{1}{2}$, i.e. $\theta \in (0,\frac{1}{2}]$ for $\varPhi $, in (C.7) we have $\varDelta _{k+1} \le (1 - \alpha ) \varDelta _k$. As $\alpha > 0$ and $\frac{\varDelta _{k+1}}{\varDelta _k} > 0$, then $(1-\alpha ) \in (0,1)$ concluding $\varDelta _k$ is Q-linearly convergent to zero. By (C.3) we then conclude $(\varPhi ^{\hat{H}}(\bar{{{\varvec{z}}}}^{{k-1}}_{{N}}))_{k\in \mathbb {N}}$ is convergent Q-linearly and by Fact A.2(iii), where we have $\varphi (z^k)=\varPhi (\varvec{z}^k) \le \varPhi ^{\hat{H}}(\bar{{{\varvec{z}}}}^{{k-1}}_{{N}})$, we conclude $(\varphi (z^k))_{k\in \mathbb {N}}$ is convergent R-linearly. Moreover, the inequality (C.6) implies that $(\Vert z^{k+1} - z^k\Vert )_{k\in \mathbb {N}}$ is R-linearly convergent, thus so is $(z^k)_{k\in \mathbb {N}}$.

D CPU time

The performance results presented in Sect. 5 are also reported versus CPU time. According to the numerical comparisons in Figs. 5, 4, and 6, the proposed algorithm features relatively cheap iterations and has comparable computational complexity per epoch compared to the other algorithms.

E Algorithm variants

1.1 E.1 Adaptive variant

In this section, the implementation of Table 1 is further discussed. In Table 1, for the first iterate, i.e. $k=0$, the vectors $\tilde{z}^{{-1}}_{{i}}$ are initially considered equal to $z^{\textrm{init}}$ for all $i\in [N]$. Also, note that the linesearch in step 2.5.d of Table 1 backtracks to step 2.3.a, rather than step 2.5.c. Performing the linesearches in this intertwined fashion is observed to result in acceptance of good directions and reduction in the overall computational complexity [20, 52]. We refer the reader to [20] for the theoretical justification for the effectiveness of this procedure. Note that in Algorithm 3, in the Euclidean case, the same backtrackings can be used with dgfs $h_i = \frac{1}{2} \Vert \cdot \Vert ^2$. The backtracking linesearches in the first block of Table 1 do not require storing $\tilde{z}^{{k}}_{{i}}$ and can be performed efficiently. In step 2.1.b $\sum _{i=1}^N p_i(\cdot ,\tilde{z}^{{k}}_{{i}})$ may be evaluated by storing the scalars $\sum _{i=1}^N f_i(\tilde{z}^{{k}}_{{i}})$ and $\sum _{i=1}^N \langle \nabla f_i(\tilde{z}^{{k}}_{{i}}), \tilde{z}^{{k}}_{{i}} \rangle $ and one vector $\sum _{i=1}^N \nabla f_i(\tilde{z}^{{k}}_{{i}}) \in \mathbb {R}^n$ while performing step 1.10 of the algorithm. Similar tricks apply to the computation of the Bregman distances, functions $p_i$ in other backtracking linesearches of Table 1, and updating the vectors $s^k, {\bar{s}}^k$, and ${\tilde{s}}^k$.

1.2 E.2 Euclidean variant

In this section, the proposed algorithm in the Euclidean version is outlined in Algorithm 3, when the functions $f_i$ have Lipschitz continuous gradients with constants $L_i$. In this case, the distance generating functions are $h_i=\frac{1}{2}\Vert \cdot \Vert ^2$ and consequently the Bregman distances are simplified to ${{\,\textrm{D}\,}}_{h_i}(y,x)=\frac{1}{2}\Vert y-x\Vert ^2$.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Behmandpoor, P., Latafat, P., Themelis, A. et al. SPIRAL: a superlinearly convergent incremental proximal algorithm for nonconvex finite sum minimization. Comput Optim Appl 88, 71–106 (2024). https://doi.org/10.1007/s10589-023-00550-8

Download citation

Received: 11 August 2023
Accepted: 09 December 2023
Published: 29 March 2024
Issue Date: May 2024
DOI: https://doi.org/10.1007/s10589-023-00550-8

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

SPIRAL: a superlinearly convergent incremental proximal algorithm for nonconvex finite sum minimization

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Nonconvex Proximal Incremental Aggregated Gradient Method with Linear Convergence

Globally linearly convergent nonlinear conjugate gradients without Wolfe line search

Linesearch Newton-CG methods for convex optimization with noise

Data availability

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendices

A Preliminaries

Fact A.1

Fact A.2

Fact A.3

Fact A.4

B Omitted lemmas

Lemma B.1

Proof

Lemma B.2

Proof

C Omitted proofs

1.1 Proof of Theorem 4.16

D CPU time

E Algorithm variants

1.1 E.1 Adaptive variant

1.2 E.2 Euclidean variant

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Subscribe and save

Buy Now

Search

Navigation