Abstract
In this paper, we study local convergence of high-order Tensor Methods for solving convex optimization problems with composite objective. We justify local superlinear convergence under the assumption of uniform convexity of the smooth component, having Lipschitz-continuous high-order derivative. The convergence both in function value and in the norm of minimal subgradient is established. Global complexity bounds for the Composite Tensor Method in convex and uniformly convex cases are also discussed. Lastly, we show how local convergence of the methods can be globalized using the inexact proximal iterations.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Motivation In Nonlinear Optimization, it seems to be a natural idea to increase the performance of numerical methods by employing high-order oracles. However, the main obstacle to this approach consists in a prohibiting complexity of the corresponding Taylor approximations formed by the high-order multidimensional polynomials, which are difficult to store, handle, and minimize. If we go just one step above the commonly used quadratic approximation, we get a multidimensional polynomial of degree three which is never convex. Consequently, its usefulness for optimization methods is questionable.
However, recently in [18] it was shown that the Taylor polynomials of convex functions have a very interesting structure. It appears that their augmentation by a power of Euclidean norm with a reasonably big coefficients gives us a global upper convex model of the objective function, which keeps all advantages of the local high-order approximation.
One of the classical and well-known results in Nonlinear Optimization is related to the local quadratic convergence of Newton’s method [13, 19]. Later on, it was generalized to the case of composite optimization problems [14], where the objective is represented as a sum of two convex components: smooth, and possibly nonsmooth but simple. Local superlinear convergence of the Incremental Newton method for finite-sum minimization problems was established in [24].
The study of high-order numerical methods for solving nonlinear equations is dated back to the work of Chebyshev in 1838, where the scalar methods of order three and four were proposed [2]. The methods of arbitrary order for solving nonlinear equations were studied in [6].
A big step in the second-order optimization theory was made since [22], where Cubic regularization of the Newton method with its global complexity estimates was proposed. Additionally, the local superlinear convergence was justified. See also [1] for the local analysis of the Adaptive cubic regularization methods.
Our paper is aimed to study local convergence of high-order methods, generalizing corresponding results from [22] in several ways. We establish local superlinear convergence of Tensor Method [18] of degree \(p \ge 2\), in the case when the objective is composite, and its smooth part is uniformly convex of arbitrary degree q from the interval \(2 \le q < p - 1\). For strongly convex functions (\(q=2\)), this gives the local convergence of degree p.
Contents We formulate our problem of interest and define a step of the Regularized Composite Tensor Method in Sect. 2. Then, we declare some of its properties, which are required for our analysis.
In Sect. 3, we prove local superlinear convergence of the Tensor Method in function value, and in the norm of minimal subgradient, under the assumption of uniform convexity of the objective.
In Sect. 4, we discuss global behavior of the method and justify sublinear and linear global rates of convergence for convex and uniformly convex cases, respectively.
One application of our developments is provided in Sect. 5. We show how local convergence can be applied for computing an inexact step in proximal methods. A global sublinear rate of convergence for the resulting scheme is also given.
Notations and generalities In what follows, we denote by \(\mathbb {E}\) a finite-dimensional real vector space, and by \(\mathbb {E}^*\) its dual spaced composed by linear functions on \(\mathbb {E}\). For such a function \(s \in \mathbb {E}^*\), we denote by \(\langle s, x \rangle \) its value at \(x \in \mathbb {E}\). Using a self-adjoint positive-definite operator \(B: \mathbb {E}\rightarrow \mathbb {E}^*\) (notation \(B = B^* \succ 0\)), we can endow these spaces with mutually conjugate Euclidean norms:
For a smooth function \(f: \mathrm{dom} \,f \rightarrow \mathbb {R}\) with convex and open domain \(\mathrm{dom} \,f \subseteq \mathbb {E}\), denote by \(\nabla f(x)\) its gradient, and by \(\nabla ^2 f(x)\) its Hessian evaluated at point \(x \in \mathrm{dom} \,f \subseteq \mathbb {E}\). Note that
For non-differentiable convex function \(f(\cdot )\), we denote by \(\partial f(x) \subset \mathbb {E}^*\) its subdifferential at the point \(x \in \mathrm{dom} \,f\).
In what follows, we often work with directional derivatives. For \(p \ge 1\), denote by
the directional derivative of function f at x along directions \(h_i \in \mathbb {E}\), \(i = 1, \dots , p\). If all directions \(h_1, \dots , h_p\) are the same, we apply a simpler notation
Note that \(D^p f(x)[ \cdot ]\) is a symmetric p-linear form. Its norm is defined in the standard way:
(for the last equation see, for example, Appendix 1 in [21]). Similarly, we define
In particular, for any \(x \in \mathrm{dom} \,f\) and \(h_1, h_2 \in \mathbb {E}\), we have
Thus, for the Hessian, our definition corresponds to a spectral norm of the self-adjoint linear operator (maximal module of all eigenvalues computed with respect to \(B \succ 0\)).
Finally, the Taylor approximation of function \(f(\cdot )\) at \(x \in \mathrm{dom} \,f\) is defined as follows:
Consequently, for all \(y \in \mathbb {E}\) we have
2 Main inequalities
In this paper, we consider the following composite convex minimization problem
where \(h: \mathbb {E}\rightarrow \mathbb {R}\cup \{+\infty \}\) is a simple proper closed convex function and \(f \in C^{p,p}(\mathrm{dom} \,h)\) for a certain \(p \ge 2\). In other words, we assume that the pth derivative of function f is Lipschitz continuous:
Assuming that \(L_{p} < +\infty \), by the standard integration arguments we can bound the residual between function value and its Taylor approximation:
Applying the same reasoning to functions \(\langle \nabla f(\cdot ), h \rangle \) and \(\langle \nabla ^2 f(\cdot ) h, h \rangle \) with direction \(h \in \mathbb {E}\) being fixed, we get the following guarantees:
which are valid for all \(x, y \in \mathrm{dom} \,h\).
Let us define now one step of the Regularized Composite Tensor Method (RCTM) of degree \(p \ge 2\):
It can be shown that for
the auxiliary optimization problem in (2.6) is convex (see Theorem 1 in [18]). This condition is crucial for implementability of our methods and we always assume it to be satisfied.
Let us write down the first-order optimality condition for the auxiliary optimization problem in (2.6):
for all \(y \in \mathrm{dom} \,h\). In other words, for vector
we have \(h'(T) {\mathop {\in }\limits ^{(2.8)}} \partial h(T)\). This fact explains our notation
Let us present some properties of the point \(T = T_H(x)\). First of all, we need some bounds for the norm of vector \(F'(T)\). Note that
Consequently,
Secondly, we use the following lemma.
Lemma 1
Let \(\beta > 1\) and \(H = \beta L_p\). Then
In particular, if \(\beta = p\), then
Proof
Denote \(r = \Vert T - x \Vert \), \(h = {H \over p!}\), and \(l = {L_p \over p!}\). Then inequality (2.11) can be written as follows:
This means that
Denote
Then inequality (2.15) can be rewritten as follows:
Taking into account that \(1+\alpha = {2p \over p-1}\) and \({\alpha \over 1 + \alpha } = {p + 1 \over 2p}\), and using the actual meaning of a, b, and \(\alpha \), we get
It remains to note that
\(\square \)
3 Local convergence
The main goal of this paper consists in analyzing the local behavior of the Regularized Composite Tensor Method (RCTM):
as applied to the problem (2.1). In order to prove local superlinear convergence of this scheme, we need one more assumption.
Assumption 1
The objective in problem (2.1) is uniformly convex of degree \(q \ge 2\). Thus, for all \(x, y \in \mathrm{dom} \,h\) and for all \(G_x \in \partial F(x), G_y \in \partial F(y)\), it holds:
for certain \(\sigma _q > 0\).
It is well known that this assumption guarantees the uniform convexity of the objective function (see, for example, Lemma 4.2.1 in [19]):
where \(G_x\) is an arbitrary subgradient from \(\partial F(x)\). Therefore,
This simple inequality gives us the following local convergence rate for RCTM.
Theorem 1
For any \(k \ge 0\) we have
Proof
Indeed, for any \(k \ge 0\) we have
And this is exactly inequality (3.5). \(\square \)
Corollary 1
If \(p > q-1\), then method (3.1) has local superlinear rate of convergence for problem (2.1).
Proof
Indeed, in this case \({p \over q-1} > 1\). \(\square \)
For example, if \(q = 2\) (strongly convex function) and \(p=2\) (Cubic Regularization of the Newton Method), then the rate of convergence is quadratic. If \(q=2\), and \(p = 3\), then the local rate of convergence is cubic, etc.
Let us study now the local convergence of the method (3.1) in terms of the norm of gradient. For any \(x \in \mathrm{dom} \,h\) denote
If \(\partial h(x) = \emptyset \), we set \(\eta (x) = +\infty \).
Theorem 2
For any \(k \ge 0\) we have
Proof
Indeed, in view of inequality (3.2), we have
where \(g_k\) is an arbitrary vector from \(\partial h(x_k)\). Therefore, we conclude that
It remains to use inequality (2.12). \(\square \)
As we can see, the condition for superlinear convergence of the method (3.1) in terms of the norm of the gradient is the same as in Corollary 1: we need to have \({p \over q-1} > 1\), that is \(p > q-1\). Moreover, the local rate of convergence has the same order as that for the residual of the function value.
According to Theorem 1, the region of superlinear convergence of RCTM in terms of the function value is as follows:
Alternatively, by Theorem 2, in terms of the norm of minimal subgradient (3.6), the region of superlinear convergence looks as follows:
Note that these sets can be very different. Indeed, set \(\mathcal {Q}\) is a closed and convex neighborhood of the point \(x^*\). At the same time, the structure of the set \(\mathcal {G}\) can be very complex since in general the function \(\eta (x)\) is discontinuous. Let us look at simple example where \(h(x) = \text{ Ind}_Q(x)\), the indicator function of a closed convex set Q.
Example 1
Consider the following optimization problem:
with
for some fixed \(\sigma _2, \sigma _3 > 0\) and \(\bar{x} = (0, -2) \in \mathbb {R}^2\). We have
where \(r: \mathbb {R}^2 \rightarrow \mathbb {R}\) is
Note that f is uniformly convex of degree \(q = 2\) with constant \(\sigma _2\), and for \(q = 3\) with constant \(\sigma _3\) (see Lemma 4.2.3 in [19]). Moreover, we have for any \(\nu \in [0, 1]\):
Hence, this function is uniformly convex of any degree \(q \in [2, 3]\). At the same time, the Hessian of f is Lipschitz continuous with constant \(L_2 = 4 \sigma _3\) (see Lemma 4.2.4 in [19]).
Clearly, in this problem \(x^*=(0,-1)\), and it can be written in the composite form (2.1) with
Note that for \(x \in \mathrm{dom} \,h \equiv \{ x: \; \Vert x \Vert \le 1\}\), we have
Therefore, if \(\Vert x \Vert < 1\), then \(\eta (x) = \Vert \nabla f(x) \Vert \ge \sigma _2\). If \(\Vert x \Vert = 1\), then
Thus, in any neighbourhood of \(x^*\), \(\eta (x)\) vanishes only along the boundary of the feasible set. \(\square \)
So, the question arises how the Tensor Method (3.1) could come to the region \(\mathcal {G}\). The answer follows from the inequalities derived in Sect. 2. Indeed,
and
Thus, at some moment the norm \(\Vert F'(x_k) \Vert _*\) will be small enough to enter \(\mathcal {G}\).
4 Global complexity bounds
Let us briefly discuss the global complexity bounds of the method (3.1), namely the number of iterations required for coming from an arbitrary initial point \(x_0 \in \mathrm{dom} \,h\) to the region \(\mathcal {Q}\). First, note that for every step \(T = T_H(x)\) of the method with parameter \(H \ge p L_p\), we have
Therefore,
with \(x^{*} {\mathop {=}\limits ^{\mathrm {def}}}\arg \min \limits _{y \in \mathbb {E}} F(y)\), which exists by our assumption. Denote by D the maximal radius of the initial level set of the objective, which we assume to be finite:
Then, by monotonicity of the method (3.1) and by convexity we conclude
In the general convex case, we can prove the global sublinear rate of convergence of the Tensor Method of the order \(O({1 / k^p})\) [18]. For completeness of presentation, let us prove an extension of this result onto the composite case.
Theorem 3
For the method (3.1) with \(H = pL_p\) we have
Proof
Indeed, in view of (2.14) and (4.2), we have for every \(k \ge 0\)
Denoting \(\delta _k = F(x_k) - F^*\) and \(C = \left( {p! \over (p+1) L_p D^{p + 1} }\right) ^{1 \over p}\), we obtain the following recurrence:
or for \(\mu _k = C^p \delta _k {{\mathop {\le }\limits ^{(4.1)}}} 1\), as follows:
Then, Lemma 1.1 from [8] provides us with the following guarantee:
Therefore,
\(\square \)
For a given degree \(q \ge 2\) of uniform convexity with \(\sigma _q > 0\), and for RCTM of order \(p \ge q - 1\), let us denote by \(\omega _{p, q}\) the following condition number:
Corollary 2
In order to achieve the region \(\mathcal {Q}\) it is enough to perform
iterations of the method.
Proof
Plugging (3.8) into (4.3). \(\square \)
We can improve this estimate, knowing that the objective is globally uniformly convex (3.2). Then the linear rate of convergence arises at the first state, till the entering in the region \(\mathcal {Q}\).
Theorem 4
Let \(\sigma _q > 0\) with \(q \le p + 1\). Then for the method (3.1) with \(H = pL_p\), we have
Therefore, for a given \(\varepsilon > 0\) to achieve \(F(x_K) - F^{*} \le \varepsilon \), it is enough to set
Proof
Indeed, for every \(k \ge 0\)
Denoting \(\delta _k = F(x_k) - F^{*}\), we obtain
\(\square \)
We see that, for RCTM with \(p \ge 2\) minimizing the uniformly convex objective of degree \(q \le p + 1\), the condition number \(\omega ^{1/p}_{p, q}\) is the main factor in the global complexity estimates (4.5) and (4.7). Since in general this number may be arbitrarily big, complexity estimate \(\tilde{O}(\omega _{p, q}^{1 / p})\) in (4.7) is much better than the estimate \(O(\omega _{p, q}^{(p + 1) / (p(p - q + 1))})\) in (4.5) because of relation \({ p + 1 \over p - q + 1} \ge 1\).
These global bounds can be improved, by using the universal [3, 10] and the accelerated [7, 9, 10, 17, 28] high-order schemes.
High-order tensor methods for minimizing the gradient norm were developed in [4]. These methods achieve near-optimal global convergence rates, and can be used for coming into the region \(\mathcal {G}\) (3.9). Note, that for the composite minimization problems, some modification of these methods is required, which ensures minimization of the subgradient norm.
Finally, let us mention some recent results [12, 20], where it was shown that a proper implementation of the third-order schemes by second-order oracle may lead to a significant acceleration of the methods. However, the relation of these techniques to the local convergence needs further investigations.
5 Application to proximal methods
Let us discuss now a general approach, which uses the local convergence of the methods for justifying the global performance of proximal iterations.
The proximal method [23] is one of the classical methods in theoretical optimization. Every step of the method for solving problem (2.1) is a minimization of the regularized objective:
where \(\{ a_k \}_{k \ge 1}\) is a sequence of positive coefficients, related to the iteration counter.
Of course, in general, we can hope only to solve subproblem (5.1) inexactly. The questions of practical implementations and possible generalizations of the proximal method, are still in the area of intensive research (see, for example [11, 25,26,27]).
One simple observation on the subproblem (5.1) is that it is 1-strongly convex. Therefore, if we would be able to pick an initial point from the region of superlinear convergence (3.8) or (3.9), we could minimize it very quickly by RCTM of degree \(p \ge 2\) up to arbitrary accuracy. In this section, we are going to investigate this approach. For the resulting scheme, we will prove the global rate of convergence of the order \(\tilde{O}(1 / k^{p + 1 \over 2})\).
Denote by \(\varPhi _{k + 1}\) the regularized objective from (5.1):
We fix a sequences of accuracies \(\{\delta _k\}_{k \ge 1}\) and relax the assumption on exact minimization in (5.1). Now, at every step we need to find a point \(x_{k + 1}\) and corresponding subgradient vector \(g_{k + 1} \in \partial \varPhi _{k + 1}(x_{k + 1})\) with bounded norm:
Denote
The following global convergence result holds for the general proximal method with inexact minimization criterion (5.2).
Theorem 5
Assume that there exist a minimum \(x^{*} \in \mathrm{dom} \,h\) of the problem (2.1). Then, for any \(k \ge 1\), we have
where
Proof
First, let us prove that for all \(k \ge 0\) and for every \(x \in \mathrm{dom} \,h\), we have
where
This is obviously true for \(k = 0\). Let it hold for some \(k \ge 0\). Consider the step number \(k + 1\) of the inexact proximal method.
By condition (5.2), we have
Equivalently,
Therefore, using the inductive assumption and strong convexity of \(\varPhi _{k + 1}(\cdot )\), we conclude
Thus, inequality (5.4) is valid for all \(k \ge 0\).
Now, plugging \(x \equiv x^{*}\) into (5.4), we have
In order to finish the proof, it is enough to show that \(\alpha _k \le R_k(\delta )\).
Indeed,
Therefore,
\(\square \)
Now, we are ready to use the result on the local superlinear convergence of RCTM in the norm of subgradient (Theorem 2), in order to minimize \(\varPhi _{k + 1}(\cdot )\) at every step of inexact proximal method.
Note that
and it is natural to start minimization process from the previous point \(x_k\), for which \(\partial \varPhi _{k + 1}(x_k) = a_{k + 1} \partial F(x_k)\). Let us also notice, that the Lipschitz constant of the pth derivative (\(p \ge 2\)) of the smooth part of \(\varPhi _{k + 1}\) is \(a_{k + 1} L_p\).
Using our previous notation, one step of RCTM can be written as follows:
where \(H = a_{k + 1}pL_p\). Then, a sufficient condition for \(z = x_k\) to be in the region of superlinear convergence (3.9) is
or, equivalently
To be sure that \(x_k\) is strictly inside the region, we can pick:
Note, that this rule requires fixing an initial subgradient \(F'(x_0) \in \partial F(x_0)\), in order to choose \(a_1\).
Finally, we apply the following steps:
We can estimate the required number of these iterations as follows.
Lemma 2
At every iteration \(k \ge 0\) of the inexact proximal method, in order to achieve \(\Vert \varPhi '_{k + 1}(z_t) \Vert _{*} \le \delta _{k + 1}\), it is enough to perform
steps of RCTM (5.8), where
Proof
According to (3.7), one step of RCTM (5.8) provides us with the following guarantee in terms of the subgradients of our objective \(\varPhi _{k + 1}(\cdot )\):
where we used in (3.7) the values \(q = 2\), \(\sigma _q = 1\), \(a_{k + 1} L_p\) for the Lipschitz constant of the pth derivative of the smooth part of \(\varPhi _{k + 1}\), and \(H = a_{k + 1}pL_p\).
Denote \(\beta \equiv \left( { a_{k + 1}(p + 1)L_p \over p! } \right) ^{1 \over p - 1} {{\mathop {=}\limits ^{(5.7)}}} \left( { (p + 1) L_p \over 2 \cdot p! \cdot \Vert F'(x_k)\Vert _* } \right) ^{1 \over p}\). Then, from (5.10) we have
Therefore, for
it holds \(\Vert \varPhi '_{k + 1}(z_t)\Vert _{*} \le \delta _{k + 1}\). To finish the proof, let us estimate \(\Vert F'(x_k) \Vert _{*}\) from above. We have
Thus, for every \(1 \le i \le k\) it holds
with \(\mathcal {D} \equiv R_k^{1/2}(\delta ) \left( \frac{(p + 1) L_p}{p!} \right) ^{1 \over p} 2^{3p - 2 \over 2p}\), and \(\rho \equiv \frac{p - 1}{p}\). Therefore,
Substitution of this bound into (5.12) gives (5.9). \(\square \)
Let us prove now the rate of convergence for the outer iterations. This is a direct consequence of Theorem 5 and the choice (5.7) of the coefficients \(\{ a_{k} \}_{k \ge 1}\).
Lemma 3
Let for a given \(\varepsilon > 0\),
Then for every \(1 \le k \le K\), we have
where \(\bar{x}_k \; {\mathop {=}\limits ^{\mathrm {def}}}\; \frac{\sum _{i = 1}^k a_i x_i}{\sum _{i = 1}^k a_i}\), and \(V_k(\varepsilon ) \; {\mathop {=}\limits ^{\mathrm {def}}}\; \left( \frac{\Vert F'(x_0)\Vert _{*} \cdot ( \Vert x_0 - x^{*}\Vert + \sum _{i = 1}^k \delta _i )}{\varepsilon } \right) ^{p - 1 \over k}\).
Proof
Using the inequality between the arithmetic and geometric means, we obtain
Therefore,
where the first inequality holds by convexity. At the same time, we have
Thus, \(\left( \frac{\Vert F'(x_0)\Vert _{*}}{\Vert F'(x_k)\Vert _{*}} \right) ^{p - 1 \over k} \le V_k(\varepsilon )\) and we obtain (5.16). \(\square \)
Remark 1
Note that \(\bigl (\frac{1}{\varepsilon }\bigr )^{p - 1 \over k} = \exp \bigl ( {p - 1 \over k} \ln {1 \over \varepsilon } \bigr )\). Therefore after \(k = O\left( \ln {1 \over \varepsilon }\right) \) iterations, the factor \(V_k(\varepsilon )\) is bounded by an absolute constant.
Since the local convergence of RCTM is very fast (5.9), we can choose the inner accuracies \(\{ \delta _i \}_{i \ge 1}\) small enough, to have the right hand side of (5.16) being of the order \(\tilde{O}(1 / k^{p + 1 \over 2})\). Let us present a precise statement.
Theorem 6
Let \(\delta _k \equiv \frac{c}{k^s}\) for fixed absolute constants \(c > 0\) and \(s > 1\). Let for a given \(\varepsilon > 0\), we have
Then, for every k such that \(\ln \frac{\Vert F'(x_0)\Vert _{*} R}{ \varepsilon } \le k \le K\), we get
where
The total number of oracle calls \(N_k\) during the first k iterations is bounded as follows:
where
Proof
Indeed,
Thus, we obtain (5.18) directly from the bound (5.16), and by the fact that
when \(k \ge \ln \frac{\Vert F'(x_0) \Vert _{*} R }{ \varepsilon } \).
Finally,
\(\square \)
Note that we were able to justify the global performance of the scheme, using only the local convergence results for the inner method. It is interesting to compare our approach with the recent results on the path-following second-order methods [5].
We can drop the logarithmic components in the complexity bounds by using the hybrid proximal methods (see [15, 16]), where at each iteration only one step of RCTM is performed. The resulting rate of convergence there is \(O(1 / k^{p + 1 \over 2})\), without any extra logarithmic factors. However, this rate is worse than the rate \(O(1 / k^p)\) provided by the Theorem 3 for the primal iterations of RCTM (3.1).
References
Cartis, C., Gould, N.I.M., Toint, P.L.: Adaptive cubic regularisation methods for unconstrained optimization. Part I: motivation, convergence and numerical results. Math. Program. 127(2), 245–295 (2011)
Chebyshev, P.L.: Polnoe sobranie sochinenii. Izd. Akad. Nauk SSSR 5, 7–25 (1951)
Doikov, N., Nesterov, Y.: Minimizing uniformly convex functions by cubic regularization of Newton method. arXiv preprint arXiv:1905.02671 (2019)
Dvurechensky, P., Gasnikov, A., Ostroukhov, P., Uribe, C.A., Ivanova, A.: Near-optimal tensor methods for minimizing the gradient norm of convex function. arXiv preprint arXiv:1912.03381 (2019)
Dvurechensky, P., Nesterov, Y.: Global performance guarantees of second-order methods for unconstrained convex minimization. Technical report, CORE Discussion Paper (2018)
Evtushenko, Y.G., Tretyakov, A.A.: p-th order methods for solving nonlinear system. Dokl. akad. nauk 455(5), 512–515 (2014)
Gasnikov, A., Dvurechensky, P., Gorbunov, E., Vorontsova, E., Selikhanovych, D., Uribe, C.A.: Optimal tensor methods in smooth convex and uniformly convex optimization. In: Conference on Learning Theory, pp. 1374–1391 (2019)
Grapiglia, G.N., Nesterov, Y.: Regularized Newton methods for minimizing functions with Hölder continuous Hessians. SIAM J. Optim. 27(1), 478–506 (2017)
Grapiglia, G.N., Nesterov, Y.: Accelerated regularized Newton methods for minimizing composite convex functions. SIAM J. Optim. 29(1), 77–99 (2019)
Grapiglia, G.N, Nesterov, Y.: Tensor methods for minimizing functions with Hölder continuous higher-order derivatives. arXiv preprint arXiv:1904.12559 (2019)
Güler, O.: On the convergence of the proximal point algorithm for convex minimization. SIAM J. Control Optim. 29(2), 403–419 (1991)
Kamzolov, D., Gasnikov, A.: Near-optimal hyperfast second-order method for convex optimization and its sliding. arXiv preprint arXiv:2002.09050 (2020)
Kantorovich, L.V.: Functional analysis and applied mathematics. Uspekhi Matematicheskikh Nauk 3(6), 89–185 (1948)
Lee, J.D., Sun, Y., Saunders, M.A.: Proximal Newton-type methods for minimizing composite functions. SIAM J. Optim. 24(3), 1420–1443 (2014)
Marques Alves, M., Monteiro, R.D.C., Svaiter, B.F.: Iteration-complexity of a Rockafellar’s proximal method of multipliers for convex programming based on second-order approximations. Optimization 68, 1–30 (2019)
Monteiro, R.D.C., Svaiter, B.F.: On the complexity of the hybrid proximal extragradient method for the iterates and the ergodic mean. SIAM J. Optim. 20(6), 2755–2787 (2010)
Nesterov, Y.: Accelerating the cubic regularization of Newton’s method on convex problems. Math. Program. 112(1), 159–181 (2008)
Nesterov, Y.: Implementable Tensor Methods in Unconstrained Convex Optimization. Universite catholique de Louvain, Center for Operations Research and Econometrics (CORE), Leuven (2018)
Nesterov, Y.: Lectures on Convex Optimization, vol. 137. Springer, Berlin (2018)
Nesterov, Y.: Superfast second-order methods for unconstrained convex optimization. CORE DP 7, 2020 (2020)
Nesterov, Y., Nemirovskii, A.: Interior-Point Polynomial Algorithms in Convex Programming. SIAM, Philadelphia (1994)
Nesterov, Y., Polyak, B.T.: Cubic regularization of Newton’s method and its global performance. Math. Program. 108(1), 177–205 (2006)
Rockafellar, R.T.: Monotone operators and the proximal point algorithm. SIAM J. Control Optim. 14(5), 877–898 (1976)
Rodomanov, A., Kropotov, D.: A superlinearly-convergent proximal Newton-type method for the optimization of finite sums. In: International Conference on Machine Learning, pp. 2597–2605 (2016)
Salzo, S., Villa, S.: Inexact and accelerated proximal point algorithms. J. Convex Anal. 19(4), 1167–1192 (2012)
Schmidt, M., Roux, N.L., Bach, F.R.: Convergence rates of inexact proximal-gradient methods for convex optimization. In: Advances in Neural Information Processing Systems, pp. 1458–1466 (2011)
Solodov, M.V., Svaiter, B.F.: A unified framework for some inexact proximal point algorithms. Numer. Funct. Anal. Optim. 22(7–8), 1013–1035 (2001)
Song, C., Ma, Y.: Towards unified acceleration of high-order algorithms under Hölder continuity and uniform convexity. arXiv preprint arXiv:1906.00582 (2019)
Acknowledgements
We are very thankful to anonymous referees for valuable comments that improved the initial version of this paper.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
The research results of this paper were obtained in the framework of ERC Advanced Grant 788368.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Doikov, N., Nesterov, Y. Local convergence of tensor methods. Math. Program. 193, 315–336 (2022). https://doi.org/10.1007/s10107-020-01606-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10107-020-01606-x
Keywords
- Convex optimization
- High-order methods
- Tensor methods
- Local convergence
- Uniform convexity
- Proximal methods