Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

SPIRAL: a superlinearly convergent incremental proximal algorithm for nonconvex finite sum minimization

  • Published:
Computational Optimization and Applications Aims and scope Submit manuscript

Abstract

We introduce SPIRAL, a SuPerlinearly convergent Incremental pRoximal ALgorithm, for solving nonconvex regularized finite sum problems under a relative smoothness assumption. Each iteration of SPIRAL consists of an inner and an outer loop. It combines incremental gradient updates with a linesearch that has the remarkable property of never being triggered asymptotically, leading to superlinear convergence under mild assumptions at the limit point. Simulation results with L-BFGS directions on different convex, nonconvex, and non-Lipschitz differentiable problems show that our algorithm, as well as its adaptive variant, are competitive to the state of the art.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Algorithm 1
Algorithm 2
Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Data availability

The datasets used in the numerical experiments are available in https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/ and https://hastie.su.domains/ElemStatLearn/data.html.

References

  1. Ahookhosh, M., Themelis, A., Patrinos, P.: A Bregman forward-backward linesearch algorithm for nonconvex composite optimization: superlinear convergence to nonisolated local minima. SIAM J. Optim. 31(1), 653–685 (2021). https://doi.org/10.1137/19M1264783

    Article  MathSciNet  Google Scholar 

  2. Aragón Artacho, F.J., Belyakov, A., Dontchev, A.L., López, M.: Local convergence of quasi-Newton methods under metric regularity. Comput. Optim. Appl. 58(1), 225–247 (2014)

    Article  MathSciNet  Google Scholar 

  3. Bauschke, H.H., Bolte, J., Teboulle, M.: A descent lemma beyond Lipschitz gradient continuity: first-order methods revisited and applications. Math. Oper. Res. 42(2), 330–348 (2017)

    Article  MathSciNet  Google Scholar 

  4. Beck, A., Teboulle, M.: Mirror descent and nonlinear projected subgradient methods for convex optimization. Oper. Res. Lett. 31(3), 167–175 (2003)

    Article  MathSciNet  Google Scholar 

  5. Bengio, Y.: Practical recommendations for gradient-based training of deep architectures. Neural Networks: Tricks of the Trade: Second Edition, pp. 437–478 (2012)

  6. Bertsekas, D.P.: Nonlinear Programming. Athena Scientific (2016)

  7. Bertsekas, D.P., Tsitsiklis, J.N.: Gradient convergence in gradient methods with errors. SIAM J. Optim. 10(3), 627–642 (2000)

    Article  MathSciNet  Google Scholar 

  8. Blatt, D., Hero, A.O., Gauchman, H.: A convergent incremental gradient method with a constant step size. SIAM J. Optim. 18(1), 29–51 (2007)

    Article  MathSciNet  Google Scholar 

  9. Bolte, J., Daniilidis, A., Lewis, A.: The Łojasiewicz inequality for nonsmooth subanalytic functions with applications to subgradient dynamical systems. SIAM J. Optim. 17(4), 1205–1223 (2007)

    Article  Google Scholar 

  10. Bolte, J., Daniilidis, A., Lewis, A., Shiota, M.: Clarke subgradients of stratifiable functions. SIAM J. Optim. 18(2), 556–572 (2007)

    Article  MathSciNet  Google Scholar 

  11. Bolte, J., Sabach, S., Teboulle, M.: Proximal alternating linearized minimization for nonconvex and nonsmooth problems. Math. Program. 146(1–2), 459–494 (2014)

    Article  MathSciNet  Google Scholar 

  12. Bolte, J., Sabach, S., Teboulle, M., Vaisbourd, Y.: First order methods beyond convexity and Lipschitz gradient continuity with applications to quadratic inverse problems. SIAM J. Optim. 28(3), 2131–2151 (2018)

  13. Cai, X., Lin, C.Y., Diakonikolas, J.: Empirical risk minimization with shuffled SGD: a primal-dual perspective and improved bounds. arXiv preprint arXiv:2306.12498 (2023)

  14. Cai, X., Song, C., Wright, S., Diakonikolas, J.: Cyclic block coordinate descent with variance reduction for composite nonconvex optimization. In: International Conference on Machine Learning, pp. 3469–3494. PMLR (2023)

  15. Candes, E.J., Li, X., Soltanolkotabi, M.: Phase retrieval via Wirtinger flow: theory and algorithms. IEEE Trans. Inf. Theory 61(4), 1985–2007 (2015)

    Article  MathSciNet  Google Scholar 

  16. Cha, J., Lee, J., Yun, C.: Tighter lower bounds for shuffling SGD: Random permutations and beyond. In: International Conference on Machine Learning, pp. 3855–3912. PMLR (2023)

  17. Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. (TIST) 2, 1–27 (2011)

    Article  Google Scholar 

  18. Chen, G., Teboulle, M.: Convergence analysis of a proximal-like minimization algorithm using Bregman functions. SIAM J. Optim. 3(3), 538–543 (1993)

    Article  MathSciNet  Google Scholar 

  19. Davis, D., Drusvyatskiy, D., MacPhee, K.J.: Stochastic model-based minimization under high-order growth. arXiv preprint arXiv:1807.00255 (2018)

  20. De Marchi, A., Themelis, A.: Proximal gradient algorithms under local Lipschitz gradient continuity: a convergence and robustness analysis of PANOC. J. Optim. Theory Appl. 194(3), 771–794 (2022)

  21. Defazio, A., Bach, F., Lacoste-Julien, S.: SAGA: a fast incremental gradient method with support for non-strongly convex composite objectives. In: Advances in Neural Information Processing Systems, pp. 1646–1654 (2014)

  22. Defazio, A., Domke, J.: Finito: A faster, permutable incremental gradient method for big data problems. In: International Conference on Machine Learning, pp. 1125–1133 (2014)

  23. Dennis, J.E., Moré, J.J.: A characterization of superlinear convergence and its application to quasi-newton methods. Math. Comput. 28(126), 549–560 (1974)

    Article  MathSciNet  Google Scholar 

  24. Dennis, J.E., Jr., Moré, J.J.: Quasi-Newton methods, motivation and theory. SIAM Rev. 19(1), 46–89 (1977)

  25. Dragomir, R.A., Even, M., Hendrikx, H.: Fast stochastic Bregman gradient methods: sharp analysis and variance reduction. In: International Conference on Machine Learning, pp. 2815–2825. PMLR (2021)

  26. Duchi, J.C., Ruan, F.: Solving (most) of a set of quadratic equalities: composite optimization for robust phase retrieval. Inf. Inference J. IMA 8(3), 471–529 (2019)

    MathSciNet  Google Scholar 

  27. Facchinei, F., Pang, J.S.: Finite-Dimensional Variational Inequalities and Complementarity Problems. Springer, II (2003)

    Google Scholar 

  28. Fang, C., Li, C.J., Lin, Z., Zhang, T.: Spider: Near-optimal non-convex optimization via stochastic path-integrated differential estimator. Adv. Neural Inf. Process. Syst. 31 (2018)

  29. Ge, R., Li, Z., Wang, W., Wang, X.: Stabilized SVRG: simple variance reduction for nonconvex optimization. In: Conference on Learning Theory, pp. 1394–1448. PMLR (2019)

  30. Ghadimi, S., Lan, G.: Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM J. Optim. 23(4), 2341–2368 (2013)

    Article  MathSciNet  Google Scholar 

  31. Ghadimi, S., Lan, G., Zhang, H.: Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization. Math. Program. 155(1–2), 267–305 (2016)

    Article  MathSciNet  Google Scholar 

  32. Gürbüzbalaban, M., Ozdaglar, A., Parrilo, P.A.: Why random reshuffling beats stochastic gradient descent. Math. Program. 186, 49–84 (2021)

    Article  MathSciNet  Google Scholar 

  33. Hanzely, F., Richtárik, P.: Fastest rates for stochastic mirror descent methods. Comput. Optim. Appl. 1–50 (2021)

  34. Haochen, J., Sra, S.: Random shuffling beats SGD after finite epochs. In: International Conference on Machine Learning, pp. 2624–2633. PMLR (2019)

  35. Hastie, T., Friedman, J., Tibshirani, R.: The Elements of Statistical Learning. Springer, New York (2001)

    Book  Google Scholar 

  36. Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance reduction. Adv. Neural Inf. Process. Syst. 26, 315–323 (2013)

    Google Scholar 

  37. Kan, C., Song, W.: The Moreau envelope function and proximal mapping in the sense of the Bregman distance. Nonlinear Anal. Theory Methods Appl. 75(3), 1385–1399 (2012). https://doi.org/10.1016/j.na.2011.07.031

    Article  MathSciNet  Google Scholar 

  38. Kurdyka, K.: On gradients of functions definable in \(o\)-minimal structures. Annales de l’institut Fourier 48(3), 769–783 (1998)

    Article  MathSciNet  Google Scholar 

  39. Latafat, P., Themelis, A., Ahookhosh, M., Patrinos, P.: Bregman Finito/MISO for nonconvex regularized finite sum minimization without Lipschitz gradient continuity. SIAM J. Optim. 32(3), 2230–2262 (2022)

  40. Latafat, P., Themelis, A., Patrinos, P.: Block-coordinate and incremental aggregated proximal gradient methods for nonsmooth nonconvex problems. Math. Program. 1–30 (2021)

  41. Li, Z., Richtárik, P.: ZeroSARAH: efficient nonconvex finite-sum optimization with zero full gradient computation. arXiv preprint arXiv:2103.01447 (2021)

  42. Lu, H., Freund, R.M., Nesterov, Y.: Relatively smooth convex optimization by first-order methods, and applications. SIAM J. Optim. 28(1), 333–354 (2018)

    Article  MathSciNet  Google Scholar 

  43. Mairal, J.: Incremental majorization-minimization optimization with application to large-scale machine learning. SIAM J. Optim. 25(2), 829–855 (2015)

    Article  MathSciNet  Google Scholar 

  44. Mishchenko, K., Khaled, A., Richtárik, P.: Random reshuffling: simple analysis with vast improvements. Adv. Neural Inf. Process. Syst. 33, 17309–17320 (2020)

    Google Scholar 

  45. Mokhtari, A., Eisen, M., Ribeiro, A.: IQN: an incremental quasi-Newton method with local superlinear convergence rate. SIAM J. Optim. 28(2), 1670–1698 (2018)

    Article  MathSciNet  Google Scholar 

  46. Mokhtari, A., Gürbüzbalaban, M., Ribeiro, A.: Surpassing gradient descent provably: a cyclic incremental method with linear convergence rate. SIAM J. Optim. 28(2), 1420–1447 (2018)

    Article  MathSciNet  Google Scholar 

  47. Moritz, P., Nishihara, R., Jordan, M.: A linearly-convergent stochastic L-BFGS algorithm. In: Artificial Intelligence and Statistics, pp. 249–258. PMLR (2016)

  48. Nedic, A., Lee, S.: On stochastic subgradient mirror-descent algorithm with weighted averaging. SIAM J. Optim. 24(1), 84–107 (2014)

    Article  MathSciNet  Google Scholar 

  49. Nesterov, Y.: Gradient methods for minimizing composite functions. Math. Program. 140(1), 125–161 (2013)

    Article  MathSciNet  Google Scholar 

  50. Nesterov, Y.: Introductory lectures on convex optimization: a basic course, vol. 137. Springer Science & Business Media (2018)

  51. Nguyen, L.M., Liu, J., Scheinberg, K., Takáč, M.: SARAH: a novel method for machine learning problems using stochastic recursive gradient. In: International Conference on Machine Learning, pp. 2613–2621. PMLR (2017)

  52. Pas, P., Schuurmans, M., Patrinos, P.: Alpaqa: a matrix-free solver for nonlinear MPC and large-scale nonconvex optimization. In: 2022 European Control Conference (ECC), pp. 417–422. IEEE (2022)

  53. Pham, N.H., Nguyen, L.M., Phan, D.T., Tran-Dinh, Q.: ProxSARAH: an efficient algorithmic framework for stochastic composite nonconvex optimization. J. Mach. Learn. Res. 21, 110–1 (2020)

    MathSciNet  Google Scholar 

  54. Recht, B., Ré, C.: Parallel stochastic gradient algorithms for large-scale matrix completion. Math. Program. Comput. 5(2), 201–226 (2013)

    Article  MathSciNet  Google Scholar 

  55. Reddi, S.J., Hefny, A., Sra, S., Poczos, B., Smola, A.J.: Stochastic variance reduction for nonconvex optimization. In: International Conference on Machine Learning, pp. 314–323 (2016)

  56. Reddi, S.J., Sra, S., Poczos, B., Smola, A.J.: Proximal stochastic methods for nonsmooth nonconvex finite-sum optimization. In: Advances in Neural Information Processing Systems, pp. 1145–1153 (2016)

  57. Rockafellar, R.T.: Convex analysis. Princeton University Press (1970)

  58. Rockafellar, R.T., Wets, R.J.B.: Variational analysis, vol. 317. Springer Science & Business Media (2009)

  59. Rodomanov, A., Kropotov, D.: A superlinearly-convergent proximal Newton-type method for the optimization of finite sums. In: International Conference on Machine Learning, pp. 2597–2605. PMLR (2016)

  60. Sadeghi, H., Giselsson, P.: Hybrid acceleration scheme for variance reduced stochastic optimization algorithms. arXiv preprint arXiv:2111.06791 (2021)

  61. Schmidt, M., Le Roux, N., Bach, F.: Minimizing finite sums with the stochastic average gradient. Math. Program. 162(1), 83–112 (2017)

    Article  MathSciNet  Google Scholar 

  62. Shalev-Shwartz, S., Zhang, T.: Stochastic dual coordinate ascent methods for regularized loss minimization. J. Mach. Learn. Res. 14(Feb), 567–599 (2013)

  63. Solodov, M.V., Svaiter, B.F.: An inexact hybrid generalized proximal point algorithm and some new results on the theory of Bregman functions. Math. Oper. Res. 25(2), 214–230 (2000)

    Article  MathSciNet  Google Scholar 

  64. Sun, J., Qu, Q., Wright, J.: A geometric analysis of phase retrieval. Found. Comput. Math. 18(5), 1131–1198 (2018)

    Article  MathSciNet  Google Scholar 

  65. Teboulle, M.: A simplified view of first order methods for optimization. Math. Program. 170(1), 67–96 (2018)

    Article  MathSciNet  Google Scholar 

  66. Themelis, A., Ahookhosh, M., Patrinos, P.: On the acceleration of forward-backward splitting via an inexact Newton method. In: Bauschke, H.H., Burachik, R.S., Luke, D.R. (eds.) Splitting Algorithms, Modern Operator Theory, and Applications, pp. 363–412. Springer International Publishing, Cham (2019)

    Chapter  Google Scholar 

  67. Themelis, A., Patrinos, P.: SuperMann: a superlinearly convergent algorithm for finding fixed points of nonexpansive operators. IEEE Trans. Autom. Control 64(12), 4875–4890 (2019)

  68. Themelis, A., Stella, L., Patrinos, P.: Forward-backward envelope for the sum of two nonconvex functions: further properties and nonmonotone linesearch algorithms. SIAM J. Optim. 28(3), 2274–2303 (2018)

    Article  MathSciNet  Google Scholar 

  69. Vanli, N.D., Gurbuzbalaban, M., Ozdaglar, A.: Global convergence rate of proximal incremental aggregated gradient methods. SIAM J. Optim. 28(2), 1282–1300 (2018)

    Article  MathSciNet  Google Scholar 

  70. Wang, Z., Ji, K., Zhou, Y., Liang, Y., Tarokh, V.: Spiderboost and momentum: faster variance reduction algorithms. Adv. Neural Inf. Process. Syst. 32 (2019)

  71. Yang, M., Milzarek, A., Wen, Z., Zhang, T.: A stochastic extra-step quasi-Newton method for nonsmooth nonconvex optimization. Math. Program. 1–47 (2021)

  72. Yu, P., Li, G., Pong, T.K.: Kurdyka-Łojasiewicz exponent via inf-projection. Found. Comput. Math. 1–47 (2021)

  73. Zhang, H., Dai, Y.H., Guo, L., Peng, W.: Proximal-like incremental aggregated gradient method with linear convergence under Bregman distance growth conditions. Math. Oper. Res. 46(1), 61–81 (2021)

    Article  MathSciNet  Google Scholar 

  74. Zhang, J., Liu, H., So, A.M.C., Ling, Q.: Variance-reduced stochastic quasi-Newton methods for decentralized learning: Part I (2022)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Pourya Behmandpoor.

Ethics declarations

Conflict of interest

The authors have no competing interests to declare that are relevant to the content of this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

P. Behmandpoor and M. Moonen acknowledge the research work carried out at the ESAT Laboratory of KU Leuven, in the frame of Research Project FWO nr. G0C0623N ’User-centric distributed signal processing algorithms for next generation cell-free massive MIMO based wireless communication networks’ and Fonds de la Recherche Scientifique—FNRS and Fonds voor Wetenschappelijk Onderzoek— Vlaanderen EOS Project no 30452698 ’(MUSE-WINET) MUlti-SErvice WIreless NETworks’. The scientific responsibility is assumed by its authors. The work of P. Latafat was supported by the Research Foundation Flanders (FWO) grants 1196820N and 12Y7622N. The work of P. Patrinos was supported by the Research Foundation Flanders (FWO) research projects G0A0920N, G086518N, G086318N, and G081222N; Research Council KU Leuven C1 project No. C14/18/068; Fonds de la Recherche Scientifique—FNRS and the Fonds Wetenschappelijk Onderzoek—Vlaanderen under EOS project 30468160 (SeLMA); European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No. 953348. The work of A. Themelis was supported by the Japan Society for the Promotion of Science (JSPS) KAKENHI grant JP21K17710.

Appendices

A Preliminaries

Fact A.1

(basic properties [18, 50]) The following hold for a dgf \(H:\mathbb {R}^n\rightarrow \mathbb {R}\), \(x,y,z\in \mathbb {R}^n\):

  1. (i)

    (three-point inequality) \( {{\,\textrm{D}\,}}_H(x,z)={{\,\textrm{D}\,}}_H(x,y)+{{\,\textrm{D}\,}}_H(y,z)+\langle {x-y}, {\nabla H(y)}{-\nabla H(z)}\rangle . \) [18, Lem. 3.1].

For any convex set \(\mathcal {U}\subseteq \mathbb {R}^n\) and \(u,v\in \mathcal {U}\) the following hold [50, Thm. 2.1.5, 2.1.10]:

  1. (ii)

    If \(H\) is \(\mu _{H,\mathcal {U}}\)-strongly convex on \(\mathcal {U}\), then \( \frac{\mu _{H,\mathcal {U}}}{2}\Vert v-u\Vert ^2 \le {{\,\textrm{D}\,}}_{H}(v,u) \le \frac{1}{2\mu _{H,\mathcal {U}}}\Vert \nabla H(v)-\nabla H(u)\Vert ^2 \).

  2. (iii)

    If \(\nabla H\) is \(\ell _{H,\mathcal {U}}\)-Lipschitz on \(\mathcal {U}\), then \( \frac{1}{2\ell _{H,\mathcal {U}}}\Vert \nabla H(v)-\nabla H(u)\Vert ^2 \le {{\,\textrm{D}\,}}_{H}(v,u) \le \frac{\ell _{H,\mathcal {U}}}{2}\Vert v-u\Vert ^2 \).

In the following, some properties of the Bregman Moreau envelope are highlighted. The interested reader is referred to [1] and [37] for proofs and further properties.

Fact A.2

(Basic properties of \(\phi ^{H}\) and \({{\,\textrm{prox}\,}}_\phi ^{H}\), [1, 37]) Let \(H:\mathbb {R}^n\rightarrow \mathbb {R}\) denote a dgf (cf. Definition 2.1), and \(\phi :\mathbb {R}^n\rightarrow \overline{\mathbb {R}}\) be a proper lsc, and lower bounded function. Then, the following hold:

  1. (i)

    \({{\,\textrm{prox}\,}}_\phi ^{H}\) is locally bounded, compact-valued, and outer semicontinuous;

  2. (ii)

    \(\phi ^{H}\) is finite-valued and continuous; it is locally Lipschitz if so is \(\nabla H\);

  3. (iii)

    \(\phi ^{H}(z) = \phi (v) + {{\,\textrm{D}\,}}_{H}(v, z) \le \phi (y) + {{\,\textrm{D}\,}}_{H}(y, z)\) with any \(y, z\in \mathbb {R}^n\), \(v \in {{\,\textrm{prox}\,}}_\phi ^{H}(z)\). Hence, \(\phi ^{H}(z) \le \phi (z)\);

  4. (iv)

    \(\inf \phi = \inf \phi ^{H}\) and \({{\,\textrm{argmin}\,}}\phi ^{H}= {{\,\textrm{argmin}\,}}\phi \);

  5. (v)

    \(\phi ^{H}\) is level-bounded iff so is \(\phi \).

The following fact studies sufficient conditions for Lipschitz continuity of the Bregman proximal mapping and continuity of the Moreau envelope, both of which are crucial to the theory developed in Theorems 4.7 and 4.12.

Fact A.3

([39, Lem. A.2]) Let \(\mathcal {V}_i\subseteq \mathbb {R}^n\) be nonempty and convex, \(i\in [N]\), and let \(\mathcal {V}\mathrel {:=}\mathcal {V}_1\times \cdots \times \mathcal {V}_N\). Additionally to Assumption 1, suppose that \(g\) is convex, and \(h_i\), \(i\in [N]\), is \(\ell _{h_i}\)-smooth and \(\mu _{h_i}\)-strongly convex on \(\mathcal {V}_i\). Then, the following hold for function \(\hat{H}\) as in (4.2) with \(\gamma _i\in (0,\nicefrac {N}{L_{f_i}})\), \(i\in [N]\):

  1. (i)

    \({{\,\textrm{prox}\,}}_\varPhi ^{\hat{H}}\) is \({\bar{L}}\)-Lipschitz continuous on \(\mathcal {V}\) for some constant \({\bar{L}}\ge 0\).

If in addition \(f_i\) and \(h_i\) are twice continuously differentiable on \(\mathcal {V}_i\), \(i\in [N]\), then

  1. (ii)

    \(\varPhi ^{\hat{H}}\) is continuously differentiable on \(\mathcal {V}\) with \(\nabla \varPhi ^{\hat{H}}=\nabla ^2\hat{H}\circ ({{\,\textrm{id}\,}}-{{\,\textrm{prox}\,}}_\varPhi ^{\hat{H}})\).

The following fact establishes the equivalence between problems (1.1) and (4.1).

Fact A.4

([39, Lem. A.1]) Let the functions \(\varphi \) and \(\varPhi \) be as in (1.1) and (4.1), respectively. Then,

  1. (i)

    \(\partial \varPhi (\varvec{x}) = \{\varvec{v} = (v,\dots ,v) \mid \sum _iv_i \in \partial \varphi (x)\}\) if \(\varvec{x}=(x,\dots ,x) \in \varDelta \), and is empty otherwise.

  2. (ii)

    \(\varPhi \) has the KL property at \(\varvec{x} = (x,\dots ,x)\) iff so does \(\varphi \) at x. In this case, the desingularizing functions are the same up to a positive scaling.

B Omitted lemmas

Lemma B.1

Suppose that Assumptions 1 and 2 hold and that \(\varphi \) is level bounded. Consider the sequence generated by Algorithm 1. Then, for every \(\ell \in [N]\) there exists \(c_\ell >0\) such that

$$\begin{aligned} \Vert \tilde{z}^{{k}}_{{\ell }} - u^k\Vert \le c_\ell \Vert \tilde{z}^{{k}}_{{1}} - u^k\Vert . \end{aligned}$$
(B.1)

Proof

By level boundedness of \(\varphi \) and Theorem 4.7, \((\varvec{u}^k)_{k\in \mathbb {N}}\), \((\varvec{z}^k)_{k\in \mathbb {N}}\), \((\tilde{z}^{{k}}_{{i}})_{k\in \mathbb {N}}, i\in [N]\) are contained in a nonempty bounded set \(\varvec{\mathcal {U}}\). By Assumption 2.A2, \(h_i\) is locally strongly convex and locally Lipschitz, which along with Assumption 2.(A1) and Fact A.3 implies that \({{\,\textrm{prox}\,}}_\varPhi ^{\hat{H}}\) is \({\bar{L}}\)-Lipschitz on a convex subset of \(\varvec{\mathcal {U}}\) for some \({\bar{L}} > 0\). Without loss of generality and for the sake of simplicity, we assume the cyclic sweeping rule in the incremental loop, i.e., \(i^\ell =\ell \). Note that the following proof can be easily cast into the case of cyclic sweeping without replacement. Arguing by induction, for \(\ell =1\), (B.1) holds trivially. Suppose that the claim holds for some \(\ell \ge 1\). Then, by triangular inequality and the definition of \(\tilde{\varvec{z}}^{k}_{\ell }\) in step 2.7 of Algorithm 2

establishing (B.1). \(\square \)

Lemma B.2

In addition to the assumptions in Lemma B.1, suppose that the directions \(d^k\) in step 1.4 satisfy \(\Vert d^k\Vert \le D\Vert z^k-v^k\Vert \) for some \(D\ge 0\). Then, \(\Vert z^{k+1}- z^k\Vert \le C\Vert \varvec{z}^k - \bar{{{\varvec{z}}}}^{{k-1}}_{{N}}\Vert \) holds for some positive C.

Proof

By the same reasoning as in Lemma B.1, \({{\,\textrm{prox}\,}}_\varPhi ^{\hat{H}}\) is \({\bar{L}}\)-Lipschitz continuous on a bounded convex set containing the iterates \((\varvec{u}^k)_{k\in \mathbb {N}}\), \((\varvec{z}^k)_{k\in \mathbb {N}}\), \((\tilde{z}^{{k}}_{{i}})_{k\in \mathbb {N}}, i\in [N]\). It follows from the assumption on \(\Vert d^k\Vert \) and step 2.4.a of Algorithm 2 that

$$\begin{aligned} \Vert \varvec{z}^k - \varvec{u}^k\Vert\le & {} (1-\tau _k)\Vert \varvec{z}^k - \varvec{v}^k\Vert + \tau _k \Vert \varvec{d}^k\Vert \nonumber \\\le & {} (1-\tau _k + \tau _k D) \Vert \varvec{z}^k - \varvec{v}^k\Vert \nonumber \\\le & {} \eta _1 \Vert \bar{{{\varvec{z}}}}^{{k-1}}_{{N}} - \varvec{z}^k\Vert , \end{aligned}$$
(B.2)

where \(\eta _1 = {\bar{L}}(1-\tau _k + \tau _k D)\) and Lipschitz continuity of the proximal mapping was used in the last inequality. Further using triangular inequality yields

$$\begin{aligned} \Vert \varvec{u}^k - \bar{{{\varvec{z}}}}_{{N}}^{k-1}\Vert \le {}&\Vert \varvec{u}^k - \varvec{z}^k\Vert + \Vert \varvec{z}^k - \bar{{{\varvec{z}}}}_{{N}}^{k-1}\Vert \nonumber \\ (B.2) \le {}&\left( \eta _1 + 1\right) \Vert \varvec{z}^k - \bar{{{\varvec{z}}}}_{{N}}^{k-1} \Vert , \quad \text{ and } \nonumber \\ \Vert \tilde{z}_{{1}}^{k} - u^k\Vert ={}&\tfrac{1}{\sqrt{N}}\Vert \tilde{\varvec{z}}_{1}^{k} - \varvec{u}^k\Vert \nonumber \\ \le {}&\tfrac{1}{\sqrt{N}}\Vert \tilde{\varvec{z}}_{1}^{k} - \varvec{z}^k\Vert + \tfrac{1}{\sqrt{N}} \Vert \varvec{z}^k - \varvec{u}^k\Vert \nonumber \\ \text{ Lip. } \text{ continuity } \text{ of } {{\,\text {prox}\,}}_\varPhi ^{\hat{H}} \text{ and } \text{ Algorithm } 2 \le {}&\tfrac{\bar{L}}{\sqrt{N}}\Vert \varvec{u}^k - \bar{{{\varvec{z}}}}_{{N}}^{k-1}\Vert + \tfrac{1}{\sqrt{N}} \Vert \varvec{z}^k - \varvec{u}^k\Vert \nonumber \\ (B.3), (B.2) \le {}&\tfrac{1}{\sqrt{N}}\big ((\bar{L}+1)\eta _1 + \bar{L}\big )\Vert \varvec{z}^k - \bar{{{\varvec{z}}}}_{{N}}^{k-1}\Vert . \end{aligned}$$
(B.3)

Using this along with the triangular inequality yields

$$\begin{aligned} \Vert \bar{{{\varvec{z}}}}^{{k}}_{{N}} - \varvec{u}^k\Vert = \sum _{\ell =1}^N \Vert \tilde{z}^{{k}}_{{\ell }} - u^k\Vert {\mathop {\le }\limits ^{{(B.1)}}} \sum _{\ell =1}^N c_\ell \Vert \tilde{z}^{{k}}_{{1}} - u^k\Vert \le \eta _2\Vert \varvec{z}^k - \bar{{{\varvec{z}}}}^{{k-1}}_{{N}}\Vert , \end{aligned}$$

where \(\eta _2 = \sum _{\ell =1}^N \tfrac{c_\ell }{\sqrt{N}}\big ((\bar{L}+1)\eta _1 + {\bar{L}}\big )\). This inequality combined with (B.3) yields

$$\begin{aligned} \Vert z^{k+1}- z^{k}\Vert ={}&\frac{1}{\sqrt{N}}\Vert {{\,\textrm{prox}\,}}_\varPhi ^{\hat{H}}(\bar{{{\varvec{z}}}}^{{k}}_{{N}}) - {{\,\textrm{prox}\,}}_\varPhi ^{\hat{H}}(\bar{{{\varvec{z}}}}^{{k-1}}_{{N}})\Vert \le \frac{{\bar{L}}}{\sqrt{N}} \Vert \bar{{{\varvec{z}}}}^{{k}}_{{N}}- \bar{{{\varvec{z}}}}^{{k-1}}_{{N}}\Vert \\ \le {}&\frac{{\bar{L}}}{\sqrt{N}} \Vert \bar{{{\varvec{z}}}}^{{k}}_{{N}}- \varvec{u}^k\Vert + \frac{\bar{L}}{\sqrt{N}} \Vert \varvec{u}^k - \bar{{{\varvec{z}}}}^{{k-1}}_{{N}}\Vert \\&\le \frac{{\bar{L}}}{\sqrt{N}} (\eta _1 + \eta _2 + 1)\Vert \varvec{z}^k - \bar{{{\varvec{z}}}}^{{k-1}}_{{N}}\Vert . \end{aligned}$$

The claimed inequality follows from Lipschitz continuity of \({{\,\textrm{prox}\,}}_\varPhi ^{\hat{H}}\) and the inclusion in step of Algorithm 2. \(\square \)

C Omitted proofs

1.1 Proof of Theorem 4.16

By level boundedness of \(\varphi \) and Theorem 4.7, \((\varvec{u}^k)_{k\in \mathbb {N}}\), \((\varvec{z}^k)_{k\in \mathbb {N}}\), \((\tilde{z}^{{k}}_{{i}})_{k\in \mathbb {N}}\) are contained in a nonempty convex bounded set \(\varvec{\mathcal {U}}\), where owing to Assumption 2.A2, \(h_i\) and consequently \(\hat{H}\) are strongly convex. It then follows from Fact A.1(ii), Theorem 4.7(ii), and Lemma B.2 that \( \Vert z^{k+1} - z^k\Vert \rightarrow 0. \) Therefore, the set of limit points of \((z^k)_{k\in \mathbb {N}}\) is nonempty compact and connected [11, Rem. 5]. By Theorems 4.7(iv) and 4.7(v) the limit points are stationary for \(\varphi \), and \(\varPhi ^{\hat{H}}(\varvec{z}^k) = \mathcal {L}(v^k, z^k) \rightarrow \varphi _\star \). In the trivial case \(\varPhi ^{\hat{H}}(\varvec{z}^k) = \mathcal {L}(v^k, z^k) = \varphi _\star \) for some k, the claims follow from Theorem 4.7. Assume that \(\varPhi ^{\hat{H}}(\varvec{z}^k)> \varphi _\star \) for \(k\in \mathbb {N}\). The KL property for \(\varPhi \) is implied by that of \(\varphi \) due to Fact A.4, with desingularizing function \(\psi (s)=\rho s^{1-\theta }\) with exponent \(\theta \in (0,1)\). Let \(\varOmega \) denote the set of limit points of \((\varvec{z}^k=(z^k,\ldots , z^k))_{k\in \mathbb {N}}\). Since \(\hat{H}\) is strongly convex, [72, Lem. 5.1] can be invoked to infer that the function \(\mathcal {M}_{\hat{H}}(\varvec{w}, \varvec{x}) = \varPhi (\varvec{w}) + {{\,\textrm{D}\,}}_{\hat{H}}(\varvec{w}, \varvec{x})\) also has the KL property with exponent \(\nu \in \max \{\theta ,\frac{1}{2}\}\) at every point \((\varvec{z}^\star , \varvec{z}^\star )\) in the compact set \(\varOmega \times \varOmega \). Moreover, by (4.2) \(\mathcal {M}_{\hat{H}}(\varvec{z}^\star ,\varvec{z}^\star ) = \varPhi (\varvec{z}^\star )= \varphi _\star \) where Theorem 4.7(iv) was used in the last equality. Recall that \(\varvec{z}^{k} \in {{\,\textrm{prox}\,}}_\varPhi ^{\hat{H}}(\bar{{{\varvec{z}}}}^{{k-1}}_{{N}})\) as in step 2.1 of Algorithm 2. Therefore, , resulting in

$$\begin{aligned} {{\,\textrm{dist}\,}}(0,\partial \mathcal {M}_{\hat{H}}(\varvec{z}^{k}, \bar{{{\varvec{z}}}}^{{k-1}}_{{N}})) \le \Vert \nabla ^2 \hat{H}(\bar{{{\varvec{z}}}}^{{k-1}}_{{N}})\Vert \Vert \bar{{{\varvec{z}}}}^{{k-1}}_{{N}} - \varvec{z}^k\Vert \le c \Vert \bar{{{\varvec{z}}}}^{{k-1}}_{{N}} - \varvec{z}^k\Vert \nonumber \\ \end{aligned}$$
(C.1)

where \(c = \sup _k\Vert \nabla ^2 \hat{H}(\bar{{{\varvec{z}}}}^{{k-1}}_{{N}})\Vert > 0\) is finite due to \(\tilde{z}^{{k}}_{{N}}\) being bounded (cf. Theorem 4.7(vi)) and continuity of \(\nabla ^2 \hat{H}\). Considering (4.18) with ((C.1), since \(\mathcal {M}_{\hat{H}}(\varvec{z}^k, \bar{{{\varvec{z}}}}^{{k-1}}_{{N}}) = \varPhi ^{\hat{H}}(\bar{{{\varvec{z}}}}^{{k-1}}_{{N}}) \rightarrow \varphi \) from above, and that \((\varvec{z}^k, \bar{{{\varvec{z}}}}^{{k-1}}_{{N}})_{k\in \mathbb {N}}\) is bounded and accumulates on \(\varOmega \times \varOmega \), up to discarding iterates the following holds

$$\begin{aligned} \psi '\big (\varPhi ^{\hat{H}}(\bar{{{\varvec{z}}}}^{{k-1}}_{{N}})- \varphi _\star \big ){} & {} = \psi '\big (\mathcal {M}_{\hat{H}}(\varvec{z}^{k}, \bar{{{\varvec{z}}}}^{{k-1}}_{{N}})- \mathcal {M}_{\hat{H}}(\varvec{z}^\star , \varvec{z}^\star )\big ) \nonumber \\{} & {} \ge \frac{1}{c\Vert \bar{{{\varvec{z}}}}^{{k-1}}_{{N}} - \varvec{z}^k\Vert }, \end{aligned}$$
(C.2)

where \(\psi =\rho s^{1-\nu }\) is a desingularizing function for \(\mathcal {M}_{\hat{H}}\) on \(\varOmega \times \varOmega \). Let us define

$$\begin{aligned} \varDelta _k \mathrel {:=}\psi (\varPhi ^{\hat{H}}(\bar{{{\varvec{z}}}}^{{k-1}}_{{N}})-\varphi _\star ){} & {} = \rho [\varPhi ^{\hat{H}}(\bar{{{\varvec{z}}}}^{{k-1}}_{{N}})-\varphi _\star ]^{1-\nu }\nonumber \\{} & {} \le \rho [\rho (1-\nu )c\Vert \bar{{{\varvec{z}}}}^{{k-1}}_{{N}} - \varvec{z}^k\Vert ]^{\frac{1-\nu }{\nu }}. \end{aligned}$$
(C.3)

Then, \( \varDelta _k^{\frac{\nu }{1-\nu }} \le c \rho ^{\frac{1}{1-\nu }}(1-\nu )\Vert \bar{{{\varvec{z}}}}^{{k-1}}_{{N}} - \varvec{z}^k\Vert . \) Concavity of \(\psi \) also implies

$$\begin{aligned} \varDelta _k - \varDelta _{k+1}{} & {} \ge \psi '(\varPhi ^{\hat{H}}(\bar{{{\varvec{z}}}}^{{k-1}}_{{N}})-\varphi _\star )(\varPhi ^{\hat{H}}(\bar{{{\varvec{z}}}}^{{k-1}}_{{N}})-\varPhi ^{\hat{H}}(\bar{{{\varvec{z}}}}^{{k}}_{{N}}))\nonumber \\{} & {} {\mathop {\ge }\limits ^{{C.2}}} \frac{\varPhi ^{\hat{H}}(\bar{{{\varvec{z}}}}^{{k-1}}_{{N}}) - \varPhi ^{\hat{H}}(\bar{{{\varvec{z}}}}^{{k}}_{{N}})}{c \Vert \bar{{{\varvec{z}}}}^{{k-1}}_{{N}} - \varvec{z}^k\Vert }. \end{aligned}$$
(C.4)

On the other hand by (4.11) and (4.10)

$$\begin{aligned} \varPhi ^{\hat{H}}(\bar{{{\varvec{z}}}}^{{k+1}}_{{N}}) - \varPhi ^{\hat{H}}(\bar{{{\varvec{z}}}}^{{k}}_{{N}}) \le - {{\,\textrm{D}\,}}_{\hat{H}}(\varvec{z}^{k+1}, \bar{{{\varvec{z}}}}^{{k}}_{{N}}) \le -\frac{\mu _{\hat{H}}}{2}\Vert \varvec{z}^{k+1} - \bar{{{\varvec{z}}}}^{{k}}_{{N}}\Vert ^2, \end{aligned}$$
(C.5)

where Fact A.1(ii) was used and \(\mu _{\hat{H}}\) denotes its strong convexity modulus. Combining (C.4) and (C.5),

$$\begin{aligned} \varDelta _k - \varDelta _{k+1} \ge \eta \Vert \bar{{{\varvec{z}}}}^{{k-1}}_{{N}} - \varvec{z}^k\Vert \ge \frac{\eta }{C} \Vert \varvec{z}^{k+1} - \varvec{z}^k\Vert , \end{aligned}$$
(C.6)

with some constant \(\eta > 0\) where the last inequality follows from Lemma B.2. Hence, \((\Vert \varvec{z}^{k+1} - \varvec{z}^k\Vert )_{k\in \mathbb {N}}\) has finite length and is thus convergent. It then follows from Theorem 4.7(v) that \((\varvec{z}^k)_{k\in \mathbb {N}}\) converges to a stationary point of \(\varphi \). Combining (C.3) and (C.6) we have,

$$\begin{aligned} \varDelta _{k+1} \le \varDelta _k - \alpha \varDelta _k^{{\frac{\nu }{1-\nu }}} \end{aligned}$$
(C.7)

with some appropriate \(\alpha \ge 0\). Hence, if \(\nu =\frac{1}{2}\), i.e. \(\theta \in (0,\frac{1}{2}]\) for \(\varPhi \), in (C.7) we have \(\varDelta _{k+1} \le (1 - \alpha ) \varDelta _k\). As \(\alpha > 0\) and \(\frac{\varDelta _{k+1}}{\varDelta _k} > 0\), then \((1-\alpha ) \in (0,1)\) concluding \(\varDelta _k\) is Q-linearly convergent to zero. By (C.3) we then conclude \((\varPhi ^{\hat{H}}(\bar{{{\varvec{z}}}}^{{k-1}}_{{N}}))_{k\in \mathbb {N}}\) is convergent Q-linearly and by Fact A.2(iii), where we have \(\varphi (z^k)=\varPhi (\varvec{z}^k) \le \varPhi ^{\hat{H}}(\bar{{{\varvec{z}}}}^{{k-1}}_{{N}})\), we conclude \((\varphi (z^k))_{k\in \mathbb {N}}\) is convergent R-linearly. Moreover, the inequality (C.6) implies that \((\Vert z^{k+1} - z^k\Vert )_{k\in \mathbb {N}}\) is R-linearly convergent, thus so is \((z^k)_{k\in \mathbb {N}}\).

D CPU time

The performance results presented in Sect. 5 are also reported versus CPU time. According to the numerical comparisons in Figs. 5, 4, and 6, the proposed algorithm features relatively cheap iterations and has comparable computational complexity per epoch compared to the other algorithms.

Fig. 4
figure 4

Performance of different algorithms versus cpu time on the phase retrieval problem (5.2) for 550 epochs on a digit 6 image with \(N=1280\), \(n=256\)

Fig. 5
figure 5

Performance of different algorithms versus cpu time on the lasso problem of (5.3) for 50 epochs. Synthetic dataset (top left) with \(N=10000\), \(n=400\), synthetic dataset (top center) with \(N=300\), \(n=600\), mg (top right) with \(N=1385\), \(n=6\), triazines (bottom left) with \(N=186\), \(n=60\), housing (bottom center) with \(N=506\), \(n=13\), and cadata (bottom right) with \(N=20640\), \(n=8\)

Fig. 6
figure 6

Performance of different algorithms versus cpu time on the NN-PCA problem of (5.4) for 500 epochs. MNIST (left) with \(N=60000\), \(n=784\), covtype (left center) with \(N=581012\), \(n=54\), a9a (right center) with \(N=32561\), \(n=123\), and aloi (right) with \(N=108000\), \(n=128\)

E Algorithm variants

1.1 E.1 Adaptive variant

In this section, the implementation of Table 1 is further discussed. In Table 1, for the first iterate, i.e. \(k=0\), the vectors \(\tilde{z}^{{-1}}_{{i}}\) are initially considered equal to \(z^{\textrm{init}}\) for all \(i\in [N]\). Also, note that the linesearch in step 2.5.d of Table 1 backtracks to step 2.3.a, rather than step 2.5.c. Performing the linesearches in this intertwined fashion is observed to result in acceptance of good directions and reduction in the overall computational complexity [20, 52]. We refer the reader to [20] for the theoretical justification for the effectiveness of this procedure. Note that in Algorithm 3, in the Euclidean case, the same backtrackings can be used with dgfs \(h_i = \frac{1}{2} \Vert \cdot \Vert ^2\). The backtracking linesearches in the first block of Table 1 do not require storing \(\tilde{z}^{{k}}_{{i}}\) and can be performed efficiently. In step 2.1.b \(\sum _{i=1}^N p_i(\cdot ,\tilde{z}^{{k}}_{{i}})\) may be evaluated by storing the scalars \(\sum _{i=1}^N f_i(\tilde{z}^{{k}}_{{i}})\) and \(\sum _{i=1}^N \langle \nabla f_i(\tilde{z}^{{k}}_{{i}}), \tilde{z}^{{k}}_{{i}} \rangle \) and one vector \(\sum _{i=1}^N \nabla f_i(\tilde{z}^{{k}}_{{i}}) \in \mathbb {R}^n\) while performing step 1.10 of the algorithm. Similar tricks apply to the computation of the Bregman distances, functions \(p_i\) in other backtracking linesearches of Table 1, and updating the vectors \(s^k, {\bar{s}}^k\), and \({\tilde{s}}^k\).

1.2 E.2 Euclidean variant

Algorithm 3
figure d

SPIRAL - Euclidean version

In this section, the proposed algorithm in the Euclidean version is outlined in Algorithm 3, when the functions \(f_i\) have Lipschitz continuous gradients with constants \(L_i\). In this case, the distance generating functions are \(h_i=\frac{1}{2}\Vert \cdot \Vert ^2\) and consequently the Bregman distances are simplified to \({{\,\textrm{D}\,}}_{h_i}(y,x)=\frac{1}{2}\Vert y-x\Vert ^2\).

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Behmandpoor, P., Latafat, P., Themelis, A. et al. SPIRAL: a superlinearly convergent incremental proximal algorithm for nonconvex finite sum minimization. Comput Optim Appl 88, 71–106 (2024). https://doi.org/10.1007/s10589-023-00550-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10589-023-00550-8

Keywords

Mathematics Subject Classification