Abstract
We present a methodology for establishing the existence of quadratic Lyapunov inequalities for a wide range of first-order methods used to solve convex optimization problems. In particular, we consider (i) classes of optimization problems of finite-sum form with (possibly strongly) convex and possibly smooth functional components, (ii) first-order methods that can be written as a linear system on state-space form in feedback interconnection with the subdifferentials of the functional components of the objective function, and (iii) quadratic Lyapunov inequalities that can be used to draw convergence conclusions. We present a necessary and sufficient condition for the existence of a quadratic Lyapunov inequality within a predefined class of Lyapunov inequalities, which amounts to solving a small-sized semidefinite program. We showcase our methodology on several first-order methods that fit the framework. Most notably, our methodology allows us to significantly extend the region of parameter choices that allow for duality gap convergence in the Chambolle–Pock method when the linear operator is the identity mapping.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
First-order methods are used to solve optimization problems and can be analyzed via Lyapunov inequalities. Such inequalities consist of a Lyapunov function that is nonincreasing from one iteration to the next and a residual function that quantifies a lower bound on the potential decrease. The traditional approach of establishing a Lyapunov inequality, which is typically done on a case-by-case basis, amounts to combining and rearranging algorithm update equations and inequalities that describe properties of the objective function. In this paper, we develop an automated methodology for finding Lyapunov inequalities that can be applied to a large class of first-order methods.
The methodology uses an algorithm representation that covers most first-order methods with fixed parameters. The structure of the algorithm representation is a linear system in state-space form in feedback interconnection with a nonlinearity, in our case the subdifferentials of the functional components of the objective function. Such representations are common in the automatic control literature [44] and have previously been used for algorithm analysis, e.g., in [24]. The algorithm representation is also closely connected to the operator splitting framework introduced in [28]. Different algorithms are obtained by instantiating the matrices that define the linear system. Some matrix choices lead to algorithms that cannot solve the optimization problem in general. A contribution of this paper is that we provide conditions on the matrices that are necessary and sufficient for the equivalence between solving an instance of the optimization problem and finding a fixed point of the algorithm.
Our methodology is based on a necessary and sufficient condition for the existence of a quadratic Lyapunov inequality within a predefined class of Lyapunov inequalities. At the core of the methodology is a necessary and sufficient condition, in terms of a semidefinite program, for the optimal value of a quadratic objective function to be nonpositive when optimized over all possible algorithm iterates, fixed points, subgradients, and function values and over the full function class under consideration. This result is applied to the three conditions that we use to define a quadratic Lyapunov inequality. The resulting semidefinite program is feasible if and only if such a quadratic Lyapunov inequality exists, and it provides associated Lyapunov functions and residual functions when feasible.
Other methodologies that analyze optimization algorithms using semidefinite programs are the performance estimation problem (PEP) methodology [15, 38] and the integral quadratic constraints (IQC) methodology [24]. The PEP methodology poses the problem of finding a worst-case function from a predefined class of functions for the algorithm under consideration as an optimization problem. This is then reformulated in a sequence of steps to arrive at a semidefinite program. The PEP methodology, first presented in [15], has been extended in a sequence of works that guarantee tightness in each step of the reformulation [37, 38], has been adapted as a tool for Lyapunov analysis [29, 36, 39], and extended to monotone inclusion problems [34]. The IQC methodology is based on integral quadratic constraints from the control literature [26], which has been adopted for automated convergence analysis of first-order methods under various settings [23, 24, 42]. The IQC methodology uses a simple algorithm representation but lacks tightness guarantees. We are inspired by the strengths of both methodologies; the worst-case analysis and tightness guarantees of PEP and the simple algorithm representation of IQC. Another work that is inspired by the PEP and IQC frameworks for tight Lyapunov function analysis is [39]. Our framework is more general as it can be applied to a wider range of algorithms, allowing, e.g., for proximal operators, and can be used to derive a broader range of convergence results.
The proposed methodology is applied in two ways. First, to find the smallest possible linear convergence rate via quadratic Lyapunov inequalities for the algorithm at hand. This is done via a bisection search over the convergence rate \(\rho \in [0,1[\). Second, to find the range of algorithm parameters for which the Lyapunov analysis can guarantee function value convergence or duality gap convergence. The algorithms we consider are the Douglas–Rachford method [13, 25], the (proximal) gradient method with heavy-ball momentum [17, 32], the three-operator splitting method by Davis and Yin [12], and the Chambolle–Pock method [7].
For the Douglas–Rachford method, we recover some of the known tight linear convergence rate results in [18, 19, 34]. For the gradient method with heavy-ball momentum, we improve, compared to [17], the linear convergence rate, and also extend the range of parameters that guarantee function value suboptimality convergence. We also show convergence of the duality gap for two proximal gradient methods with heavy-ball momentum. For the three-operator splitting method by Davis and Yin we provide linear convergence rate results that improve the ones found in [11, 31]. More strikingly, our methodology allows us to significantly enlarge the range of parameters that give duality gap convergence for the Chambolle–Pock method when the linear operator is assumed to be the identity operator. Traditional proofs such as in [7], allow for proximal operator step-size parameters \(\tau _1,\tau _2>0\) to satisfy \(\tau _1\tau _2<1\) and the coefficient \(\theta \) for the linear combination of previous iterates to satisfy \(\theta =1\). Our analysis allows for a significantly wider range of parameter values, e.g., for \(\theta =1\) we allow for \(\tau _1=\tau _2\in \,]0,1.15]\), for \(\theta =0.35\) we allow for \(\tau _1=\tau _2\in ]0,1.5]\), and for \(\tau _1=\tau _2=0.5\), we allow for \(\theta \in \,[0.03,7.5]\). We also show with the methodology that the extended range of parameters can lead to improved linear convergence rates over the traditional parameter choices.
The paper is organized as follows: in Sect. 2, we introduce the problem class and the algorithm representation. Section 3 discusses interpolation results and frames them in our setting. We define the notion of a quadratic Lyapunov inequality in Sect. 4. Section 5 contains the main result on the existence of a quadratic Lyapunov inequality. Section 6 contains numerical examples and Sect. 7 contains a proof of our core result. Section 8 contains the main conclusions of this work and discusses future work.
An implementation of the methodology and additional numerical examples can be found at: https://github.com/ManuUpadhyaya/TightLyapunovAnalysis.
1.1 Preliminaries
Let \(\mathbb {N}_0\) denote set of nonnegative integers, \(\mathbb {Z}\) the set of integers, the set of integers inclusively between the integers n and m, \(\mathbb {R}\) the set of real numbers, \(\mathbb {R}_+\) the set of nonnegative real numbers, \(\mathbb {R}_{++}\) the set of positive real numbers, \(\mathbb {R}^n\) the set of all n-tuples of elements of \(\mathbb {R}\), \(\mathbb {R}^{m\times n}\) the set of real-valued matrices of size \(m\times n\), if \(M\in \mathbb {R}^{m\times n}\) then \([M]_{i,j}\) the i, j-th element of M, \(\mathbb {S}^{n}\) the set of symmetric real-valued matrices of size \(n\times n\), and \(\mathbb {S}_+^n\subseteq \mathbb {S}^{n}\) the set of positive semidefinite real-valued matrices of size \(n\times n\). \(\textbf{1}\) denotes the column vector of all ones where the size will be clear from the context.
Throughout this paper, \((\mathcal {H},\left\langle \cdot , \cdot \right\rangle )\) will denote a real Hilbert space. All norms \(\left\Vert \cdot \right\Vert \) are canonical norms where the inner-product will be clear from the context. We denote the identity mapping \(x \mapsto x\) on \(\mathcal {H}\) by \(\mathop {\textrm{Id}}\limits \). Given a function \(f:\mathcal {H}\rightarrow \mathbb {R}\cup \{+\infty \}\), the effective domain of f is the set . The function f is said to be proper if \(\mathop {\textrm{dom}}\limits f \ne \emptyset \). The subdifferential of a proper function f is the set-valued operator \(\partial f:\mathcal {H}\rightarrow 2^{\mathcal {H}}\) defined as the mapping .
Let \(f:\mathcal {H}\rightarrow \mathbb {R}\cup \{+\infty \}\) and \(\sigma ,\beta \in \mathbb {R}_{+}\). The function f is
-
(i)
convex if \(f\mathord {\left( \mathord {\left( 1 - \lambda \right) } x + \lambda y \right) } \le \mathord {\left( 1 - \lambda \right) } f\mathord {\left( x \right) } + \lambda f\mathord {\left( y \right) }\) for each \(x, y \in \mathcal {H}\) and \(0 \le \lambda \le 1\),
-
(ii)
\(\sigma \)-strongly convex if f is proper and \(f-(\sigma /2)\left\Vert \cdot \right\Vert ^{2}\) is convex, and
-
(iii)
\(\beta \)-smooth if f is differentiable and \(\left\Vert \nabla f(x) - \nabla f(y)\right\Vert \le \beta \left\Vert x - y\right\Vert \) for each \(x,y\in \mathcal {H}\).
Let \(0 \le \sigma < \beta \le +\infty \). We let \(\mathcal {F}_{\sigma ,\beta }\) denote the class of all functions \(f:\mathcal {H}\rightarrow \mathbb {R}\cup \{+\infty \}\) that are
-
(i)
\(\beta \)-smooth and \(\sigma \)-strongly convex if \(\beta < +\infty \), and
-
(ii)
lower semicontinuous and \(\sigma \)-strongly convex if \(\beta = +\infty \).
Let \(f:\mathcal {H}\rightarrow \mathbb {R}\cup \{+\infty \}\) be proper, lower semicontinuous and convex, and let \(\gamma >0\). Then the proximal operator \({\textrm{prox}}_{\gamma f}: \mathcal {H}\rightarrow \mathcal {H}\) is defined as the single-valued operator given by
for each \(x\in \mathcal {H}\). If \(x,p\in \mathcal {H}\), then \(p = {\textrm{prox}}_{\gamma f}(x)\) \(\Leftrightarrow \) \(\gamma ^{-1}\mathord {\left( x-p \right) } \in \partial f (p)\). Moreover, the conjugate of f, denoted \(f^{*}:\mathcal {H}\rightarrow \mathbb {R}\cup \{+\infty \}\), is the proper, lower semicontinuous and convex function given by \(f^{*}(u) = \sup _{x\in \mathcal {H}}\mathord {\left( \left\langle u, x \right\rangle - f(x) \right) }\) for each \(u\in \mathcal {H}\). If \(x,u\in \mathcal {H}\), then \(u \in \partial f(x) \) \(\Leftrightarrow \) \(x \in \partial f^{*}(u)\) [4, Theorem 16.29].
Given any positive integer n, we let the inner-product \(\left\langle \cdot , \cdot \right\rangle \) on \(\mathcal {H}^{n}\) be given by \(\left\langle \textbf{z}_1, \textbf{z}_2 \right\rangle =\sum _{j=1}^{n}\left\langle z_1^{(j)}, z_2^{(j)} \right\rangle \) for each \(\textbf{z}_i=\mathord {\left( z_{i}^{(1)},\ldots ,z_{i}^{(n)} \right) }\in \mathcal {H}^{n}\) and \(i\in \llbracket 1,2\rrbracket \). If \(M\in \mathbb {R}^{m\times n}\), we define the tensor product \(M\otimes \mathop {\textrm{Id}}\limits \) to be the mapping \((M\otimes \mathop {\textrm{Id}}\limits ):\mathcal {H}^{n}\rightarrow \mathcal {H}^{m}\) such that
for each \(\textbf{z}=\mathord {\left( z^{(1)},\ldots ,z^{(n)} \right) }\in \mathcal {H}^n\). The adjoint satisfies \((M\otimes \mathop {\textrm{Id}}\limits )^*=M^{\top }\otimes \mathop {\textrm{Id}}\limits \). If \(N\in \mathbb {R}^{n\times l}\), the composition rule \((M\otimes \mathop {\textrm{Id}}\limits )\circ (N\otimes \mathop {\textrm{Id}}\limits )=(MN)\otimes \mathop {\textrm{Id}}\limits \) holds. Moreover, if \(M\in \mathbb {R}^{n\times n}\) is invertible, then \((M\otimes \mathop {\textrm{Id}}\limits )^{-1} = M^{-1}\otimes \mathop {\textrm{Id}}\limits \) holds.
If we let \(M_{1}\in \mathbb {R}^{m\times n_{1}}\) and \(M_{2}\in \mathbb {R}^{m\times n_{2}}\), the relations above imply that \(\left\langle (M_1\otimes \mathop {\textrm{Id}}\limits )\textbf{z}_{1}, (M_2\otimes \mathop {\textrm{Id}}\limits )\textbf{z}_{2} \right\rangle =\left\langle \textbf{z}_{1}, \mathord {\left( \mathord {\left( M_1^{\top }M_2 \right) }\otimes \mathop {\textrm{Id}}\limits \right) }\textbf{z}_{2} \right\rangle \) for each \(\textbf{z}_1\in \mathcal {H}^{n_1}\) and \(\textbf{z}_2\in \mathcal {H}^{n_2}\). We define the mappingFootnote 1\(\mathcal {Q}:\mathbb {S}^{n}\times \mathcal {H}^{n} \rightarrow \mathbb {R}\) by \(\mathcal {Q}(M,\textbf{z})=\left\langle \textbf{z}, (M\otimes \mathop {\textrm{Id}}\limits )\textbf{z} \right\rangle \) for each \(M\in \mathbb {S}^{n}\) and \(\textbf{z}\in \mathcal {H}^{n}\). Note that, if \(M\in \mathbb {S}^{n}\), \(N\in \mathbb {R}^{n\times m}\) and \(\textbf{z}\in \mathcal {H}^{m}\), then \(\mathcal {Q}(M,(N\otimes \mathop {\textrm{Id}}\limits )\textbf{z})=\mathcal {Q}(N^{\top }MN,\textbf{z})\).
2 Problem class and algorithm representation
In this section, we introduce the problem class and the algorithm representation. We provide conditions for when solving a problem is equivalent to finding a fixed point of an algorithm. We also provide conditions for when an algorithm can be implemented using scalar multiplications, vector additions, proximal operator evaluations, and gradient evaluations only. We conclude the section by listing a few examples of first-order methods that fit into the algorithm representation.
2.1 Problem class
Let \(0\le \sigma _{i} < \beta _{i} \le +\infty \) for each \(i\in \llbracket 1,m\rrbracket \). Consider the convex optimization problem
where \(f_i\in \mathcal {F}_{\sigma _i,\beta _i}\) for each \(i\in \llbracket 1,m\rrbracket \). Most first-order methods are limited to solving the related inclusion problem
A solution to (2) is always a solution to (1) and the converse holds under some appropriate constraint qualification, e.g., see [6]. Moreover, it is reasonable to only consider problems such that the inclusion problem (2) is solvable, i.e., there exists at least one point \(y\in \mathcal {H}\) such that \(0\in \sum _{i=1}^m\partial f_i(y)\). Thus, the problem class we consider is all solvable problems of the form (2) where \(f_i\in \mathcal {F}_{\sigma _i,\beta _i}\) for each \(i\in \llbracket 1,m\rrbracket \). For examples of problems that can be modeled according to (1) or (2), we refer to the textbooks [5, 30].
For later convenience, we introduce the notation
That is, \(\mathop {\textrm{zer}}\limits \mathord {\left( \sum _{i=1}^m\partial f_i \right) }\) is the set of zeros of the set-valued operator \(\sum _{i=1}^m\partial f_i:\mathcal {H}\rightarrow 2^{\mathcal {H}}:y\mapsto \sum _{i=1}^m\partial f_i(y)\), which is the same as the set of solutions to (2).
2.2 Algorithm representation
We consider algorithms that solve (2) that can be represented as a discrete-time linear system in state-space form in feedback interconnection with the potentially nonlinear and set-valued subdifferentials that define the problem. In particular, let \(\textbf{f}:\mathcal {H}^m\rightarrow (\mathbb {R}\cup \{+\infty \})^m\) and \(\varvec{\partial }\textbf{f}:\mathcal {H}^m\rightarrow 2^{\mathcal {H}^m}\) be mappings containing all functions and subdifferentials associated with (1) and (2) that satisfy
for each \(\textbf{y}=\mathord {\left( y^{(1)},\ldots ,y^{(m)} \right) }\in \mathcal {H}^{m}\), respectively.Footnote 2 We consider algorithms that can be written as: pick an initial \(\textbf{x}_{0}\in \mathcal {H}^{n}\) and let
where \(\textbf{x}_{k}\in \mathcal {H}^n\), \(\textbf{u}_{k}\in \mathcal {H}^m\), \(\textbf{y}_{k}\in \mathcal {H}^m\), and \(\textbf{F}_{k}\in \mathbb {R}^m\) are the algorithm variables and
are fixed matrices containing the parameters of the method at hand. For clarification, individual subgradients and function values are calculated as \(u_k^{(i)}\in \partial f_i\mathord {\left( y_k^{(i)} \right) }\) and \(\textbf{F}_k^{(i)}=f_i\mathord {\left( y_k^{(i)} \right) }\), respectively, so that \(\textbf{u}_k=\mathord {\left( u_k^{(1)},\ldots ,u_k^{(m)} \right) }\) and \(\textbf{F}_k=\mathord {\left( f_1\mathord {\left( y_k^{(1)} \right) },\ldots ,f_m\mathord {\left( y_k^{(m)} \right) } \right) }\). Moreover, representation (5) is a tool for analysis and does not necessarily indicate an efficient implementation, e.g., the function values are not used in the algorithm but are needed for the Lyapunov analysis. The structure in (5) of a linear system in feedback interconnection with a nonlinearity is common in the automatic control literature and has previously been proposed in [24, 43] as a model for algorithm analysis. It can represent a wide range of first-order methods as seen in Sect. 2.5.
Algorithm (5) searches for a fixed point \((\textbf{x}_\star ,\textbf{u}_\star ,\textbf{y}_\star ,\textbf{F}_\star )\in \mathcal {H}^n\times \mathcal {H}^m\times \mathcal {H}^m\times \mathbb {R}^m\) satisfying the fixed-point equations
from which we want to recover a solution to (2). In particular, we want the problem of finding a fixed point of (5) to be equivalent to solving (2).
2.3 Solutions and fixed points
There are choices of the matrices A, B, C, and D such that it is not possible to extract a solution of (2) from fixed points of (5) in any practical way. To exclude such algorithms, we add the requirement that fixed points should satisfy
for some \(y_\star \in \mathcal {H}\), where \(\textbf{u}_{\star }=\mathord {\left( u_{\star }^{(1)},\ldots ,u_{\star }^{(m)} \right) }\in \mathcal {H}^{m}\). This implies that \(y_\star \) solves (2) since the fixed-point equations (6) give that
We say that such fixed points are fixed-point encodings in line with the terminology in [33]. By defining the set of fixed points as
and the set fixed-point encodings as
the requirement that all fixed points are fixed-point encodings can be written as \(\Omega _{fixed points }\mathord {\left( f_1,\ldots ,f_m \right) }=\Omega _{fixed-point encodings }\mathord {\left( f_1,\ldots ,f_m \right) }\). Another requirement is that to each solution of (2), there exists a corresponding fixed point. These two requirements imply that solving (2) is equivalent to finding a fixed point of the algorithm. We say that such algorithms have the fixed-point encoding property.
Definition 1
(Fixed-point encoding property) We say that algorithm (5) has the fixed-point encoding property if
and
for each \(\mathord {\left( f_{1},\ldots ,f_m \right) }\in \prod _{i=1}^{m}\mathcal {F}_{\sigma _{i},\beta _{i}}\).
By appropriately restricting A, B, C and D, we can exactly capture the class of algorithms with this property. For \(m\ge 2\), let
The fixed-point encoding condition in (7) is then equivalent to \(0=\mathord {\left( N^{\top }\otimes \mathop {\textrm{Id}}\limits \right) }\textbf{y}_\star \) and \(\textbf{u}_{\star }=\mathord {\left( N\otimes \mathop {\textrm{Id}}\limits \right) }{\hat{\textbf{u}}}_\star \). In the case \(m=1\), the fixed-point encoding condition is simply \(\textbf{u}_{\star }=0\). The matrix \(N\) enters in the restriction of A, B, C, and D to exactly capture the class of algorithms with the fixed-point encoding property.
Assumption 1
Suppose that
with the interpretation that the block column containing \(N\) is removed when \(m=1\), and that
with the interpretation that the block row containing \(N^{\top }\) is removed when \(m=1\).
Proposition 1
The following are equivalent:
Proof
(i) \(\Rightarrow \) (ii): Suppose that Assumption 1 holds. Let \(\mathord {\left( f_{1},\ldots ,f_m \right) }\in \prod _{i=1}^{m}\mathcal {F}_{\sigma _{i},\beta _{i}}\).
First, we prove that (8) holds. Suppose that \(y_{\star } \in \mathop {\textrm{zer}}\limits \mathord {\left( \sum _{i=1}^m\partial f_i \right) }\). This implies that there exists a \(\textbf{u}_{\star }=\mathord {\left( u_{\star }^{(1)},\ldots ,u_{\star }^{(m)} \right) }\in \mathcal {H}^{m}\) such that
Note that the second part of (13) implies that
for \({\hat{\textbf{u}}}_{\star }=\mathord {\left( u_{\star }^{(1)},\ldots ,u_{\star }^{(m-1)} \right) }\in \mathcal {H}^{m-1}\), where \(N\) is defined in (10). We will show that there exists an \(\textbf{x}_{\star }\in \mathcal {H}^{n}\) such that
i.e.,
is a fixed-point encoding. This will prove the desired implication. Note that (14) is equivalent to
with the interpretation that \({\hat{\textbf{u}}}_{\star }\) and the block column containing \(N\) is removed when \(m=1\). Moreover, (11) in Assumption 1 implies that there exists a matrix \(U\in \mathbb {R}^{n\times m}\) such that
i.e., each column of the matrix to the left in (11) can be written as linear combinations of the columns of the matrix to the right in (11). If we let \(\textbf{x}_{\star }=\mathord {\left( U\otimes \mathop {\textrm{Id}}\limits \right) }\mathord {\left( {\hat{\textbf{u}}}_{\star }, y_{\star } \right) }\), then we get
as desired.
Second, we prove that (9) holds. Note that the inclusion \(\supseteq \) holds trivially. Therefore, we only need to prove the \(\subseteq \) inclusion. Suppose that \((\textbf{x}_\star ,\textbf{u}_\star ,\textbf{y}_\star ,\textbf{F}_\star )\in \Omega _{fixed points }\mathord {\left( f_1,\ldots ,f_m \right) }\). This implies that
However, (12) in Assumption 1 implies that
with the interpretation that the block row containing \(N^{\top }\) is removed when \(m=1\). In particular, note that this implies that
where \(\textbf{u}_{\star }=\mathord {\left( u_{\star }^{(1)},\ldots ,u_{\star }^{(m)} \right) }\) and \(\textbf{y}_\star = \mathord {\left( y_{\star }^{(1)}, \ldots , y_{\star }^{(m)} \right) }\), for some common value \(y_{\star }\in \mathcal {H}\), since \(\textbf{y}_{\star } = \mathord {\left( C \otimes \mathop {\textrm{Id}}\limits \right) } \textbf{x}_{\star } + \mathord {\left( D \otimes \mathop {\textrm{Id}}\limits \right) } \textbf{u}_{\star }\) (and \(\mathord {\left( N^{\top }\otimes \mathop {\textrm{Id}}\limits \right) }\textbf{y}_{\star }=\mathord {\left( y_{\star }^{(1)} - y_{\star }^{(m)}, \ldots , y_{\star }^{(m-1)}-y_{\star }^{(m)} \right) }\) if \(m>1\)). In particular, \((\textbf{x}_\star ,\textbf{u}_\star ,\textbf{y}_\star ,\textbf{F}_\star )\in \Omega _{fixed-point encodings }\mathord {\left( f_1,\ldots ,f_m \right) }\). This proves the \(\subseteq \) inclusion for (9).
\({(ii)} \Rightarrow { (i):}\) We prove the contrapositive.
First, suppose that (11) does not hold, i.e., there exists \(\mathord {\left( y_{\star },{\hat{\textbf{u}}}_{\star } \right) }\in \mathcal {H}\times \mathcal {H}^{m-1}\) such that
does not hold for any \(\textbf{x}_{\star }\in \mathcal {H}^{n}\). Define \(\textbf{u}_{\star }=\mathord {\left( u_{\star }^{(1)},\ldots ,u_{\star }^{(m)} \right) }\in \mathcal {H}^{m}\) such that
and note that \(\sum _{i=1}^{m}u_{\star }^{(i)} = 0\) holds by construction. Note that (16) then implies that
does not hold for any \(\textbf{x}_{\star }\in \mathcal {H}^{n}\). Thus, if we can show that there exists \(\mathord {\left( f_{1},\ldots ,f_m \right) }\in \prod _{i=1}^{m}\mathcal {F}_{\sigma _{i},\beta _{i}}\) such that \(\varvec{\partial f}\mathord {\left( \mathord {\left( \textbf{1} \otimes \mathop {\textrm{Id}}\limits \right) } y_{\star } \right) } = \mathord {\left\{ \textbf{u}_{\star } \right\} }\) holds, then we are done since this shows that there exists \(y_{\star } \in \mathop {\textrm{zer}}\limits \mathord {\left( \sum _{i=1}^m\partial f_i \right) }\) such that the implication in (8) fails. Let \((\delta _{1},\ldots ,\delta _{m})\in \prod _{i=1}^{m}[\sigma _i,\beta _i]\) and \(f_{i}:\mathcal {H}\rightarrow \mathbb {R}\) such that
for each \(y\in \mathcal {H}\) and \(i\in \llbracket 1,m\rrbracket \). Then \(\mathord {\left( f_{1},\ldots ,f_m \right) }\in \prod _{i=1}^{m}\mathcal {F}_{\sigma _{i},\beta _{i}}\) is clear, and \(\varvec{\partial f}\mathord {\left( \mathord {\left( \textbf{1} \otimes \mathop {\textrm{Id}}\limits \right) } y_{\star } \right) } = \mathord {\left\{ \textbf{u}_{\star } \right\} }\) holds since
for each \(i\in \llbracket 1,m\rrbracket \).
Second, suppose that (12) does not hold, i.e., there exists \(\mathord {\left( \textbf{x}_\star , \textbf{u}_\star \right) }\in \mathcal {H}^{n}\times \mathcal {H}^{m}\) such that
but
If we let \(\textbf{u}_{\star }=\mathord {\left( u_{\star }^{(1)},\ldots ,u_{\star }^{(m)} \right) }\) and \(\textbf{y}_{\star } = \mathord {\left( y_{\star }^{(1)}, \ldots , y_{\star }^{(m)} \right) } = \mathord {\left( C \otimes \mathop {\textrm{Id}}\limits \right) } \textbf{x}_{\star } + \mathord {\left( D \otimes \mathop {\textrm{Id}}\limits \right) } \textbf{u}_{\star }\), then either or both of \(y_{\star }^{(1)} = \ldots = y_{\star }^{(m)}\) and \(\sum _{i=1}^{m}u_{\star }^{(i)} = 0\) fail. Thus, if we can show that there exists \(\mathord {\left( f_{1},\ldots ,f_m \right) }\in \prod _{i=1}^{m}\mathcal {F}_{\sigma _{i},\beta _{i}}\) such that (6) holds, then we are done since this shows that there exists \((\textbf{x}_\star ,\textbf{u}_\star ,\textbf{y}_\star ,\textbf{F}_\star )\in \Omega _{fixed points }\mathord {\left( f_1,\ldots ,f_m \right) }\) such that \((\textbf{x}_\star ,\textbf{u}_\star ,\textbf{y}_\star ,\textbf{F}_\star )\notin \Omega _{fixed-point encodings }\mathord {\left( f_1,\ldots ,f_m \right) }\). Let \((\delta _{1},\ldots ,\delta _{m})\in \prod _{i=1}^{m}[\sigma _i,\beta _i]\) and \(f_{i}:\mathcal {H}\rightarrow \mathbb {R}\) such that
for each \(y\in \mathcal {H}\) and \(i\in \llbracket 1,m\rrbracket \). Then \(\mathord {\left( f_{1},\ldots ,f_m \right) }\in \prod _{i=1}^{m}\mathcal {F}_{\sigma _{i},\beta _{i}}\), and (6) holds since
for each \(i\in \llbracket 1,m\rrbracket \). \(\square \)
Remark 1
There exist many different choices of A, B, C, and D in (5) that can represent a given first-order method. The dimension m in \(\textbf{y}\in \mathcal {H}^m\) is fixed due to the number of functional components in problem (1), but the dimension n in \(\textbf{x}\in \mathcal {H}^n\) can vary among representations. In fact, there exists a minimal n such that a given first-order method can be represented as (5), leading to a minimal representation. A necessary condition is that
where both matrices appear in Assumption 1. If these do not hold, the system is not controllable (also often called reachable) [9, Definitions 6.D1] or observable [9, Definitions 6.D2], respectively. This implies that the representation is not minimal [10, Theorem 25.2] and that it is possible, for instance via a Kalman decomposition [10, Section 25.2], to go from this non-minimal representation to a minimal representation that satisfies (17) and represents the same algorithm.
Remark 2
Previously, [27, 35] derived necessary and sufficient conditions for the existence of a fixed point from which a solution can be extracted, using algorithm representations different from (5). Note that the existence of a fixed point from which a solution can be extracted differs from the concept of the fixed-point encoding property considered here.
2.4 Well-posedness
When analyzing existing algorithms, well-posedness is usually clear from the outset. However, when taking the more abstract point of view, as given by Algorithm (5), further discussion is warranted. We would like Algorithm (5) to be well-posed in the sense that it can be initiated at an arbitrary \(\textbf{x}_{0}\in \mathcal {H}^{n}\) and produce an infinite sequence \(\{(\textbf{x}_k, \textbf{u}_k, \textbf{y}_k, \textbf{F}_k)\}_{k=0}^{\infty }\) obeying the algorithm dynamics (5). This holds if for each \(\textbf{x}\in \mathcal {H}^n\), there exist \(\textbf{u}\in \mathcal {H}^m\) and \(\textbf{y}\in \mathcal {H}^m\) such that
In addition, if \(\textbf{u}\in \mathcal {H}^m\) and \(\textbf{y}\in \mathcal {H}^m\) are unique, then the generated sequence is unique.
If D has a lower-triangular structure, (18) can be solved using back-substitution. If \([D]_{i,i}\ne 0\), an implicit step is needed to find \(y^{(i)}\) and \(u^{(i)}\). If \([D]_{i,i}<0\), this implicit step is a proximal evaluation, which implies uniqueness. If \([D]_{i,i}=0\), \(u^{(i)}\) is found via direct evaluation of \(\partial f_i\mathord {\left( y^{(i)} \right) }\) which is always unique if \(f_i\) is differentiable.
Assumption 2
Let
and assume that \(I_{differentiable }\cup I_{D} = \llbracket 1,m\rrbracket \) and D is lower triangular with nonpositive diagonal elements.
The requirements in Assumption 2 give rise to causal algorithms that generate unique and infinite sequences that evaluate either a proximal operator or a gradient for each \(f_i\) and linearly combine results of previous evaluations to form inputs.
Proposition 2
Suppose that Assumption 2 holds. Then for any \(\mathord {\left( f_{1},\ldots ,f_m \right) }\in \prod _{i=1}^{m}\mathcal {F}_{\sigma _{i},\beta _{i}}\) and \(\textbf{x}_{0}=\mathord {\left( x_{0}^{(1)},\ldots ,x_{0}^{(n)} \right) }\in \mathcal {H}^{n}\), algorithm (5) produces a unique sequence \(\{(\textbf{x}_k, \textbf{u}_k, \textbf{y}_k, \textbf{F}_k)\}_{k=0}^{\infty }\) obeying the algorithm dynamics (5) and can be implemented as the following causal procedure:
where \(\textbf{u}_{k}=\mathord {\left( u_{k}^{(1)},\ldots ,u_{k}^{(m)} \right) }\), \(\textbf{y}_{k}=\mathord {\left( y_{k}^{(1)},\ldots ,y_{k}^{(m)} \right) }\), \(\textbf{F}_{k}=\mathord {\left( F_{k}^{(1)},\ldots ,F_{k}^{(m)} \right) }\) and the empty sum is set equal to zero by convention.
Proof
Let \(\mathord {\left( f_{1},\ldots ,f_m \right) }\in \prod _{i=1}^{m}\mathcal {F}_{\sigma _{i},\beta _{i}}\). Consider an arbitrary \(k\in \mathbb {N}_0\) and pick any \(\textbf{x}_{k}=\mathord {\left( x_{k}^{(1)},\ldots ,x_{k}^{(n)} \right) }\in \mathcal {H}^{n}\). For \(i\in \llbracket 1,m\rrbracket \) in ascending order:
-
\(v_k^{(i)}\) in the inner loop in (19) is a linear combination of previously calculated/known quantities.
-
If \(i\in I_{D}\), then (5) and the structure of D in Assumption 2 give that
$$\begin{aligned} y^{(i)}_k \in v_k^{(i)} + [D]_{i,i} \partial f_{i}\mathord {\left( y^{(i)}_k \right) }&\quad \Leftrightarrow \quad \underbrace{(-[D]_{i,i})^{-1}\mathord {\left( v_{k}^{(i)}-y_{k}^{(i)} \right) }}_{=\, u_k^{(i)}} \in \partial f_{i}\mathord {\left( y^{(i)}_k \right) }\\&\quad \Leftrightarrow \quad y^{(i)}_k = {\textrm{prox}}_{-[D]_{i,i}f_{i}}\mathord {\left( v_k^{(i)} \right) }, \end{aligned}$$which is unique since the proximal operator is single-valued with full domain under our assumptions (recall that each \(f_{i}\) is assumed to be proper, lower semicontinuous, and convex).
-
If \(i\notin I_D\), then \(f_i\) is differentiable due to Assumptions 2, and (5) gives that \(y^{(i)}_k = v^{(i)}_k\) and \(u^{(i)}_k = \nabla f_{i}\mathord {\left( y_{k}^{(i)} \right) }\).
An inductive argument concludes the proof. \(\square \)
The requirement that D is lower-triangular is for convenience. If there exists a permutation \(\pi :\llbracket 1,m\rrbracket \rightarrow \llbracket 1,m\rrbracket \) with associated permutation matrix \(P_\pi \) such that \(P_\pi D P_{\pi }^{\top }\) is lower-triangular, the resulting algorithm is equivalent to (19). Let \(\bar{\textbf{y}}_k=P_\pi \textbf{y}_k\), \(\bar{\textbf{u}}_k=P_\pi \textbf{u}_k\), and \(\bar{\textbf{f}}=\textbf{f}\circ (P_\pi ^{\top }\otimes \mathop {\textrm{Id}}\limits )\) (that just reorders the inputs). Then \(\varvec{\partial }\bar{\textbf{f}} = (P_\pi \otimes \mathop {\textrm{Id}}\limits )\circ \varvec{\partial }\textbf{f}\circ (P_\pi ^{\top }\otimes \mathop {\textrm{Id}}\limits )\) and the algorithm is equivalent to
which can be implemented as in (19).
If no permutation matrix exists such that \(P_\pi DP_\pi ^{\top }\) is lower-triangular, then there exist \(i<j\) with \(i,j\in \llbracket 1,m\rrbracket \) such that \([D]_{i,j}\ne 0\) and \([D]_{j,i}\ne 0\). This couples the \(\partial f_i\) and \(\partial f_j\) evaluations such that back-substitution fails and these updates cannot in general be done using only proximal operator or gradient evaluations of \(f_i\) and \(f_j\) individually.
Since the linear combinations decided by A, B, C, and D are arbitrary, all first-order methods that use fixed linear combinations of previously computed quantities and evaluate each individual subdifferentials only once per iteration and either via a proximal operator or gradient evaluation can be implemented as in (19), potentially after a permutation of variables. We provide a list of examples in Sect. 2.5 that all satisfy Assumption 2. They also satisfy Assumption 1, implying that solving (2) is equivalent to finding a fixed point of the algorithm, and the rank conditions in (17).
2.5 Examples
In this section, we provide examples of a few well-known algorithms that can be written as (5).
2.5.1 Douglas–Rachford method
Let \(\gamma \in \mathbb {R}_{++}\), \(\lambda \in \mathbb {R}\setminus \{0\}\) and \(f_{1},f_{2}\in \mathcal {F}_{0,\infty }\). The Douglas–Rachford method [13, 16, 25] is given by
which can equivalently be written as
By letting \(\textbf{x}_{k}=x_{k}\), \(\textbf{y}_{k}=\mathord {\left( y_k^{(1)},y_k^{(2)} \right) }\) and \(\textbf{u}_{k}=(\gamma ^{-1}(x_k-y_k^{(1)}),\gamma ^{-1}(2y_k^{(1)}-x_k-y_k^{(2)}))\), we get
where \(\varvec{\partial }\textbf{f}(\textbf{y})=\partial f_1\mathord {\left( y^{(1)} \right) }\times \partial f_2\mathord {\left( y^{(2)} \right) }\) for each \(\textbf{y}=\mathord {\left( y^{(1)},y^{(2)} \right) }\in \mathcal {H}^2\), which matches the form (5).
2.5.2 Gradient method with heavy-ball momentum
Let \(\gamma ,\beta _1 \in \mathbb {R}_{++}\), \(\delta \in \mathbb {R}\) and \(f_{1}\in \mathcal {F}_{0,\beta _1}\). The gradient method with heavy-ball momentum is given by
By letting \(\textbf{x}_{k}=(x_{k},x_{k-1})\), \(\textbf{y}_k=x_{k}\), and \(\textbf{u}_k=\nabla f_1(x_k)\), we get
where \(\varvec{\partial }\textbf{f}(y)=\{\nabla f_1(y)\}\) for each \(y\in \mathcal {H}\), which matches the form (5).
2.5.3 Proximal gradient method with heavy-ball momentum terms
Let \(\gamma ,\beta _1 \in \mathbb {R}_{++}\), \(\delta _1,\delta _2\in \mathbb {R}\), \(f_{1}\in \mathcal {F}_{0,\beta _1}\) and \(f_{2}\in \mathcal {F}_{0,\infty }\). A proximal gradient method with heavy-ball momentum terms is given by
By letting \(\textbf{x}_{k}=(x_{k},x_{k-1})\), \(\textbf{y}_{k}=\mathord {\left( x_k,x_{k+1}-\delta _2(x_k-x_{k-1}) \right) }\), \(\textbf{u}_{k}=(\nabla f_1(x_k), \gamma ^{-1}(x_k-\gamma \nabla f_1(x_k)+(\delta _1+\delta _2)(x_k-x_{k-1})-x_{k+1}))\), we get
where \(\varvec{\partial }\textbf{f}(\textbf{y})=\mathord {\left\{ \nabla f_{1}\mathord {\left( y^{(1)} \right) } \right\} }\times \partial f_2\mathord {\left( y^{(2)} \right) }\) for each \(\textbf{y}=\mathord {\left( y^{(1)},y^{(2)} \right) }\in \mathcal {H}^2\), which matches the form (5).
2.5.4 Davis–Yin three-operator splitting method
Let \(\gamma ,\lambda \in \mathbb {R}_{++}\), \(0 \le \sigma _i < \beta _i \le +\infty \) and \(f_i\in \mathcal {F}_{\sigma _i,\beta _i}\) for each \(i\in \llbracket 1,3\rrbracket \), and \(\beta _{3}<\infty \). The three-operator splitting method by Davis and Yin in [12] is given by
By letting \(\textbf{x}_{k} = z_k\), \(\textbf{y}_{k}=(x_k,x_k,x_k+\lambda ^{-1}(z_{k+1}-z_k))\) and \(\textbf{u}_{k}= (\gamma ^{-1}(z_k-x_k),\gamma ^{-1}(2x_k - z_k - z_{k + \frac{1}{2}}),\gamma ^{-1}(z_{k+\frac{1}{2}}-x_k-\lambda ^{-1}(z_{k+1}-z_{k})))\), we get
where \(\varvec{\partial }\textbf{f}(\textbf{y})=\partial f_1\mathord {\left( y^{(1)} \right) }\times \mathord {\left\{ \nabla f_{2}\mathord {\left( y^{(2)} \right) } \right\} }\times \partial f_3\mathord {\left( y^{(3)} \right) }\) for each \(\textbf{y}=\mathord {\left( y^{(1)},y^{(2)},y^{(3)} \right) }\in \mathcal {H}^3\), which matches the form (5).
2.5.5 Chambolle–Pock method
Let \(\tau _1, \tau _2 \in \mathbb {R}_{++}\), \( \theta \in \mathbb {R}\), \(0 \le \sigma _i < \beta _i \le +\infty \) and \(f_i\in \mathcal {F}_{\sigma _i,\beta _i}\) for each \(i\in \llbracket 1,2\rrbracket \). The method by Chambolle and Pock in [7, Algorithm 1] is given by
By letting \(\textbf{x}_{k} = \mathord {\left( x_k, y_k \right) }\), \(\textbf{y}_{k}=( x_{k+1},\frac{1}{\tau _2} \mathord {\left( y_k - y_{k+1} \right) } + \mathord {\left( 1 + \theta \right) } x_{k+1} - \theta x_{k})\), and \(\textbf{u}_{k}= (\frac{1}{\tau _1} \mathord {\left( x_k - x_{k+1} \right) } - y_{k},y_{k+1})\), we get
where \(\varvec{\partial }\textbf{f}(\textbf{y})=\partial f_1\mathord {\left( y^{(1)} \right) }\times \partial f_2\mathord {\left( y^{(2)} \right) }\) for each \(\textbf{y}=\mathord {\left( y^{(1)},y^{(2)} \right) }\in \mathcal {H}^2\), which matches the form (5).
3 Interpolation
Tightness of our methodology hinges critically on so-called interpolation conditions for function classes that have been developed in the PEP literature [37, 38]. The following theorem is proved in [38, Theorem 4].
Theorem 1
Let \(0\le \sigma <\beta \le +\infty \) and \(\{(y_{i},F_{i},u_{i})\}_{i\in \mathcal {I}}\) be a finite family of triplets in \(\mathcal {H}\times \mathbb {R}\times \mathcal {H}\) indexed by \(\mathcal {I}\). Then the following are equivalent:
-
(i)
There exists \(f\in \mathcal {F}_{\sigma ,\beta }\) such that
$$\begin{aligned} f(y_i) = F_i \text { and } u_i \in \partial f (y_i) \end{aligned}$$for each \(i\in \mathcal {I}\).
-
(ii)
It holds that
$$\begin{aligned} F_{i} \ge F_{j} + \left\langle u_{j}, y_{i}-y_{j} \right\rangle + \frac{\sigma }{2}\left\Vert y_i-y_j\right\Vert ^{2}+ \frac{1}{2(\beta -\sigma )}\left\Vert u_i-u_j-\sigma (y_i-y_j)\right\Vert ^2 \end{aligned}$$for each \(i,j\in \mathcal {I}\), where \(\frac{1}{2(\beta -\sigma )}\) is interpreted as 0 in the case \(\beta = +\infty \).
Next, we adapt these interpolation conditions to our framework. In the following, we let denote the class of all mappings \(\textbf{f}:\mathcal {H}^m\rightarrow \mathord {\left( \mathbb {R}\cup \{+\infty \} \right) }^m\) defined by (3) for every possible choice of \(f_i\in \mathcal {F}_{\sigma _i,\beta _i}\) and \(i\in \llbracket 1,m\rrbracket \). Moreover, with each , we associate the mapping \(\varvec{\partial }\textbf{f}:\mathcal {H}^m\rightarrow 2^{\mathcal {H}^m}\) defined by (4).
Corollary 1
Let
for each \(l\in \llbracket 1,m\rrbracket \), where \(\otimes \) denotes the Kronecker product and \(\mathord {\left\{ e_i \right\} }_{i=1}^{m}\) denotes the standard basis vectors of \(\mathbb {R}^{m}\). Then for each finite family of triplets in \(\mathcal {H}^m\times \mathbb {R}^m\times \mathcal {H}^m\) indexed by \(\mathcal {I}\), \(\{(\textbf{y}_{i},\textbf{F}_{i},\textbf{u}_{i})\}_{i\in \mathcal {I}}\), the following are equivalent:
-
(i)
There exist such that
$$\begin{aligned} \textbf{f}(\textbf{y}_i) = \textbf{F}_{i} \text { and } \textbf{u}_{i}\in \varvec{\partial }\textbf{f}(\textbf{y}_i) \end{aligned}$$for each \(i\in \mathcal {I}\).
-
(ii)
It holds that
$$\begin{aligned} \textbf{a}_l^{\top }(\textbf{F}_{i}-\textbf{F}_{j}) + \mathcal {Q}\mathord {\left( \textbf{M}_{l},(\textbf{y}_i-\textbf{y}_j,\textbf{u}_i,\textbf{u}_j) \right) } \le 0 \end{aligned}$$for each \(i,j\in \mathcal {I}\) and \(l\in \llbracket 1,m\rrbracket \).
Moreover,
for each \(\textbf{u}\in \mathcal {H}^{m}\) and \(l\in \llbracket 1,m\rrbracket \).
4 Lyapunov inequalities
Convergence properties of many first-order methods can be analyzed via so-called Lyapunov inequalities. We consider Lyapunov inequalities of the form
where \(\rho \in [0,1]\), \(\varvec{\xi }_k=(\textbf{x}_k,\textbf{u}_k,\textbf{y}_k,\textbf{F}_k)\in \mathcal {S}\) contains all algorithm variables in iteration k, \(\varvec{\xi }_{k+1}=(\textbf{x}_{k+1},\textbf{u}_{k+1},\textbf{y}_{k+1},\textbf{F}_{k+1})\in \mathcal {S}\) contains all algorithm variables in iteration \(k+1\), \(\varvec{\xi }_\star =(\textbf{x}_{\star },\textbf{u}_{\star },\textbf{y}_{\star },\textbf{F}_{\star })\in \mathcal {S}\) is a fixed point, \(V:\mathcal {S}\times \mathcal {S}\rightarrow \mathbb {R}\) is called a Lyapunov function, \(R:\mathcal {S}\times \mathcal {S}\rightarrow \mathbb {R}\) is called a residual function, and \(\mathcal {S}=\mathcal {H}^{n}\times \mathcal {H}^{m}\times \mathcal {H}^{m}\times \mathbb {R}^m\). Once such an inequality has been established, various convergence properties may be concluded depending on the properties of the functions V and R.
We consider quadratic ansatzes of the functions V and R given by
for each \(\varvec{\xi },\varvec{\xi }_\star \in \mathcal {S}\), respectively, where \(Q,S\in \mathbb {S}^{n+2\,m}\) and \(q,s\in \mathbb {R}^{m}\) parameterize the functions. These are general ansatzes that allow for arbitrary linear combinations of scalar products between linear combinations of \(x^{(i)}-x_\star ^{(i)}\), \(u^{(i)}\), and \(u_\star ^{(i)}\) and linear combinations of function value differences \(f_i\mathord {\left( y^{(i)} \right) }-f_i\mathord {\left( y_\star ^{(i)} \right) }\).
To draw useful convergence conclusions from (25), we enforce nonnegative quadratic lower bounds on V and R given by
where \(P,T\in \mathbb {S}^{n+2\,m}\) and \(p,t\in \mathbb {R}^{m}\). We do not enforce these inequalities on all of \(\mathcal {S}\times \mathcal {S}\) but only when the first argument is a so-called algorithm consistent point and the second argument satisfies the fixed-point equations (6).
Definition 2
(Algorithm consistency) Consider algorithm (5). The point \(\varvec{\xi }=(\textbf{x},\textbf{u},\textbf{y},\textbf{F})\in \mathcal {S}\) is called algorithm-consistent for if
To restrict (28) and (29) on this subset of \(\mathcal {S}\times \mathcal {S}\) gives a larger class of Lyapunov functions and residual functions compared to requiring them to hold on all of \(\mathcal {S}\times \mathcal {S}\).
In the proposed methodology, the user specifies \((P,p,T,t,\rho )\) and the methodology provides (Q, q, S, s) complying with (25), (28), and (29), if it exists. When such a (Q, q, S, s) exists, the choice of \((P,p,T,t,\rho )\) decides which convergence properties the analysis implies.
-
(i)
Suppose that \(\rho \in [0,1[\). Then
$$\begin{aligned} 0\le \mathcal {Q}(P,(\textbf{x}_k-\textbf{x}_{\star },\textbf{u}_k,\textbf{u}_{\star }))+p^{\top }(\textbf{F}-\textbf{F}_{\star })\le V(\varvec{\xi }_k,\varvec{\xi }_\star )\le \rho ^k V(\varvec{\xi }_0,\varvec{\xi }_\star )\rightarrow 0 \end{aligned}$$as \(k\rightarrow \infty \). In particular,
$$\begin{aligned} \mathord {\left\{ \mathcal {Q}(P,(\textbf{x}_k-\textbf{x}_{\star },\textbf{u}_k,\textbf{u}_{\star }))+p^{\top }(\textbf{F}_k-\textbf{F}_{\star }) \right\} }_{k\in \mathbb {N}_0}\quad \text {converges }\rho \text {-linearly to zero.} \end{aligned}$$(30) -
(ii)
Suppose that \(\rho =1\). Then
$$\begin{aligned} \sum _{k=0}^{\infty }\mathord {\left( \mathcal {Q}(T,(\textbf{x}_k-\textbf{x}_{\star },\textbf{u}_k,\textbf{u}_{\star }))+t^{\top }(\textbf{F}_k-\textbf{F}_{\star }) \right) }\le \sum _{k=0}^{\infty }R(\varvec{\xi }_k,\varvec{\xi }_\star )\le V(\varvec{\xi }_0,\varvec{\xi }_\star ), \end{aligned}$$(31)using a telescoping summation argument. In particular,
$$\begin{aligned} \mathord {\left\{ \mathcal {Q}(T,(\textbf{x}_k-\textbf{x}_{\star },\textbf{u}_k,\textbf{u}_{\star }))+t^{\top }(\textbf{F}_k-\textbf{F}_{\star }) \right\} }_{k\in \mathbb {N}_0}\quad \text {is summable and converges to zero.} \end{aligned}$$(32)
Therefore, \((P,p,T,t,\rho )\) needs to be chosen to extract interesting convergence results from the lower bounds. If \(P=T=0\) and \(p=t=0\), then V and R equal to the zero function gives a valid Lyapunov inequality (25) that complies with the lower bounds (28) and (29), but is of no interest. Useful choices of \((P,p,T,t,\rho )\) that imply different specific convergence results are provided in Sect. 4.1.
The above requirements on the Lyapunov inequality, the Lyapunov function, and the residual function are formalized in Definition 4 after we define the notion of a successor.
Definition 3
(Successor) Consider algorithm (5). Given an algorithm-consistent point \(\varvec{\xi }\) for some , we define a successor of \(\varvec{\xi }\) to be any point \(\varvec{\xi }_{+}=(\textbf{x}_{+},\textbf{u}_{+},\textbf{y}_{+},\textbf{F}_{+})\in \mathcal {S}\) such that
Definition 4
(Quadratic Lyapunov inequality) Let \(V:\mathcal {S}\times \mathcal {S}\rightarrow \mathbb {R}\) as in (26), \(R:\mathcal {S}\times \mathcal {S}\rightarrow \mathbb {R}\) as in (27), \(P,T\in \mathbb {S}^{n+2m}\), \(p,t\in \mathbb {R}^{m}\) and \(\rho \in [0,1]\). We say that V and R satisfy the \(\mathord {\left( P,p,T,t,\rho \right) }\)-quadratic Lyapunov inequality for algorithm (5) over the class if:
-
C1.
\(V(\varvec{\xi }_{+},\varvec{\xi }_{\star }) \le \rho V(\varvec{\xi },\varvec{\xi }_{\star })-R(\varvec{\xi },\varvec{\xi }_{\star })\) for each \(\varvec{\xi }\in \mathcal {S}\) that is algorithm-consistent for \(\textbf{f}\), each successor \(\varvec{\xi }_{+}\in \mathcal {S}\) of \(\varvec{\xi }\), each \(\varvec{\xi }_{\star }\in \mathcal {S}\) that satisfies (6), and each .
-
C2.
\(V(\varvec{\xi },\varvec{\xi }_{\star })\ge \mathcal {Q}(P,(\textbf{x}-\textbf{x}_{\star },\textbf{u},\textbf{u}_{\star }))+p^{\top }(\textbf{F}-\textbf{F}_\star )\ge 0\) for each \(\varvec{\xi }\in \mathcal {S}\) that is algorithm-consistent for \(\textbf{f}\), each \(\varvec{\xi }_{\star }\in \mathcal {S}\) that satisfies (6), and each .
-
C3.
\(R(\varvec{\xi },\varvec{\xi }_{\star })\ge \mathcal {Q}(T,(\textbf{x}-\textbf{x}_{\star },\textbf{u},\textbf{u}_{\star }))+t^{\top }(\textbf{F}-\textbf{F}_\star )\ge 0\) for each \(\varvec{\xi }\in \mathcal {S}\) that is algorithm-consistent for \(\textbf{f}\), each \(\varvec{\xi }_{\star }\in \mathcal {S}\) that satisfies (6), and each .
The main result in Sect. 5 is a necessary and sufficient condition for the existence of a \((P,p,T,t,\rho )\)-quadratic Lyapunov inequality expressed as a semidefinite feasibility problem over the Lyapunov function and residual function parameters (Q, q, S, s). This is done by providing a necessary and sufficient condition for each of C1, C2, and C3. Conditions C1, C2, and C3 can all be stated as the verification of a quadratic function \(\Phi :\mathcal {S}\times \mathcal {S}\rightarrow \mathbb {R}\) to be nonpositive over the subset of \(\mathcal {S}\times \mathcal {S}\) that includes algorithm consistent points in the first argument and fixed points in the second. Restricting to this subset adds significant technical complication compared to verifying nonpositivity over the entirety of \(\mathcal {S}\times \mathcal {S}\), but provides the added benefit of a more general Lyapunov analysis.
4.1 Lower bounds and convergence implications
In this section, we provide a few choices of \((P,p,T,t,\rho )\) from which we can draw specific convergence results, under the assumption that there exists a Lyapunov function V and a residual function R that satisfy the \(\mathord {\left( P,p,T,t,\rho \right) }\)-quadratic Lyapunov inequality. Moreover, we assume that Assumptions 1 and 2 hold.
4.1.1 Linear convergence of the distance to the solution
Suppose that \(\rho \in [0,1[\) and
for some \(i\in \llbracket 1,m\rrbracket \), where \(\mathord {\left\{ e_i \right\} }_{i=1}^{m}\) denotes the standard basis vectors of \(\mathbb {R}^{m}\). Then (30) implies that the squared distance to the solution \(\mathord {\left\{ \left\Vert y^{(i)}_{k} - y_{\star }\right\Vert ^{2} \right\} }_{k\in \mathbb {N}_0}\) converges \(\rho \)-linearly to zero, where \(y_{\star }\) is the solution to (2), since
Note that we exclude the case \(\rho = 1\) since we only can guarantee that the squared distance to the solution remains bounded but not necessarily linearly convergent.
4.1.2 \(\mathcal {O}\mathord {\left( 1/k \right) }\) ergodic convergence
Suppose that \(\rho = 1\).
Function value suboptimality (\(m=1\)).
Suppose that \(m=1\) and
Then (32) implies that the function value suboptimality \(\mathord {\left\{ f_1\mathord {\left( y^{(1)}_{k} \right) } - f_1\mathord {\left( y_{\star } \right) } \right\} }_{k\in \mathbb {N}_0}\) converges to zero, since
Moreover, (31) and Jensen’s inequality imply that the ergodic function value suboptimality
converges to zero with rate \(\mathcal {O}\mathord {\left( 1/k \right) }\) since
Duality gap. Suppose that
Then
since \(\sum _{i=1}^{m}u_\star ^{(i)}=0\) and \(y_\star ^{(1)}=\ldots =y_\star ^{(m)}=y_\star \) (all fixed points are fixed-point encodings). The quantity in (36) is known as the duality gap. Note that if \(m=1\), the duality gap reduces to function value suboptimality. The duality gap is in fact a natural generalization to the function value suboptimality, which we motivate next (see also, e.g., [8, 2, Theorem3.9], and [1, Section3.1]). Problem (1) can equivalently be written as
It has the Lagrangian function \(\mathcal {L}:\mathcal {H}^m\times \mathcal {H}^m\rightarrow \mathbb {R}\) given by
where \(\textbf{y}=\mathord {\left( y^{(1)},\ldots ,y^{(m)} \right) }\in \mathcal {H}^m\), and \(\textbf{u}=\mathord {\left( u^{(1)},\ldots ,u^{(m)} \right) }\in \mathcal {H}^m\) are the dual variables. The Lagrangian function satisfies
for each \(\textbf{y},\textbf{u}\in \mathcal {H}^{m}\). In particular, \(\mathcal {L}(\textbf{y}_k,\textbf{u}_\star )-\mathcal {L}(\textbf{y}_\star ,\textbf{u}_k)\) is equal to (36) and (32) implies that the duality gap \(\mathord {\left\{ \mathcal {L}(\textbf{y}_k,\textbf{u}_\star )-\mathcal {L}(\textbf{y}_\star ,\textbf{u}_k) \right\} }_{k\in \mathbb {N}_0}\) converges to zero. Moreover, (31) and Jensen’s inequality imply that the ergodic duality gap
converges to zero with rate \(\mathcal {O}\mathord {\left( 1/k \right) }\).
5 Main result
This section provides a necessary and sufficient condition, in terms of the feasibility of a semidefinite program, for the existence of a quadratic Lyapunov inequality in the sense of Definition 4. First, we introduce some necessary notation. Recall \(N\in \mathbb {R}^{m\times (m-1)}\) defined in (10) when \(m>1\). For all the matrices defined below, the interpretation is that the block column containing \(N\) is removed when \(m=1\). Let
where \(E_{i,j}\in \mathbb {R}^{3m\times (n + 3m-1)}\) for each distinct \(i,j\in \{\text {\o },+,\star \}\), and
where \(H_{i,j}\in \mathbb {R}^{m\times 2m}\) for each distinct \(i,j\in \{\text {\o },+,\star \}\). Define
for each distinct \(i,j\in \{\text {\o },+,\star \}\) and \(l\in \llbracket 1,m\rrbracket \), where the \(\textbf{M}_l\)’s and \(\textbf{a}_{l}\)’s are defined in (23) and (24), respectively. Moreover, let
where \(\Sigma _i\in \mathbb {R}^{(n+2\,m)\times (n+3\,m-1)}\) for each \(i\in \{\text {\o },+\}\).
Theorem 2
(Main result) Assume that Assumption 1 and Assumption 2 hold, let \(\rho \in [0,1]\), and suppose that \(P,T\in \mathbb {S}^{n+2\,m}\) and \(p,t\in \mathbb {R}^{m}\) are such that
for each \(\varvec{\xi }\in \mathcal {S}\) that is algorithm-consistent for \(\textbf{f}\), each \(\varvec{\xi }_{\star }\in \mathcal {S}\) that satisfies (6), and each . Then a sufficient condition for there to exist a Lyapunov function \(V:\mathcal {S}\times \mathcal {S}\rightarrow \mathbb {R}\) as in (26) and a residual function \(R:\mathcal {S}\times \mathcal {S}\rightarrow \mathbb {R}\) as in (27) such that they satisfy the \(\mathord {\left( P,p,T,t,\rho \right) }\)-quadratic Lyapunov inequality for algorithm (5) over the class is that the following system of constraints
is feasible for the scalars \(\lambda _{(l,i,j)}^{{\textrm{C}}1}\), \(\lambda _{(l,i,j)}^{{\textrm{C}}2}\),\(\lambda _{(l,i,j)}^{{\textrm{C}}3}\), matrices Q and S, and vectors q and s. Moreover, if \(\dim (\mathcal {H})\ge n+3m-1\) and there exists \(G\in \mathbb {S}_{++}^{n+3m-1}\) and \(\varvec{\chi }\in \mathbb {R}^{2m}\) such that
then the feasibility of (42) is also a necessary condition.
The proof of Theorem 2 is based on, for C1, C2 and C3 in Definition 4, finding the relevant conditions, respectively, and then combining these conditions together to give (42); (42a) correspond to C1, (42b) correspond to C2 and (42c) correspond to C3.
Finding the conditions for C1, C2 and C3 is done in the spirit of PEP by finding the worst-case behavior over algorithm-consistent points, their successors, fixed points (6), and mappings in the class . In particular, given some objective , the performance estimation problem we consider is
where we maximize over all variables except A, B, C, D, and \(\Phi \). Let \(S_{\Phi }^\star \) be the optimal value of (PEP).
We consider objective functions \(\Phi \) in (PEP) of the form
parameterized by \(Q_{\text {\o }},Q_{+}\in \mathbb {S}^{n+2m}\) and \(q_{\text {\o }},q_{+}\in \mathbb {R}^{m}\). For each \(cond \in \mathord {\left\{ \textrm{C}1,\textrm{C}2,\textrm{C}3 \right\} }\) separately, the parameters \(Q_{\text {\o }}\), \(Q_{+}\), \(q_{\text {\o }}\) and \(q_{+}\) are chosen such that \(S_{\Phi }^{\star }\le 0\) is a necessary and sufficient condition for cond to hold.
Before we proceed, we reformulate (PEP), and in order to do so we introduce some helpful notation. We let
where \(Q_{\text {\o }}\), \(Q_{+}\), \(q_{\text {\o }}\), and \(q_{+}\) are the parameters in the objective function \(\Phi \) given in (44), and \(\Sigma _{\text {\o }}\) and \(\Sigma _{+}\) are given in (41).
Lemma 1
Let \(\mathcal {I} = \mathord {\left\{ \text {\o },\star \right\} }\) or \(\mathcal {I} = \mathord {\left\{ \text {\o },+,\star \right\} }\), \(S_{\Phi }^\star \) the optimal value of (PEP), and assume that Assumptions 1 and 2 hold. Suppose that \(\Phi \) is of the form (44) and that the right-hand side of (44) only depends on variables with indices in the set \(\mathcal {I}\) (a variable without a subscript is interpreted to have index ø). A sufficient condition for \(S_{\Phi }^\star \le 0\) is that the following system
is feasible for the scalars \(\lambda _{(l,i,j)}\). Furthermore, if \(\dim (\mathcal {H})\ge n+3m-1\), and there exists \(G\in \mathbb {S}_{++}^{n+3m-1}\) and \(\varvec{\chi }\in \mathbb {R}^{2m}\) such that
then (46) is a necessary condition.
Proof
(Proof sketch) The full proof is provided in Sect. 7. The proof first reformulates (PEP) as a semidefinite program, forms the dual problem, which is equal to the feasibility problem (46), and shows strong duality when \(\dim (\mathcal {H})\ge n+3m-1\) and (47) holds. \(\square \)
Proof of Theorem 2
First, suppose that the parameters (Q, q, T, t) are fixed in some Lyapunov function \(V:\mathcal {S}\times \mathcal {S}\rightarrow \mathbb {R}\) as in (26) and some residual function \(R:\mathcal {S}\times \mathcal {S}\rightarrow \mathbb {R}\) as in (27). We consider when V and R satisfy the \(\mathord {\left( P,p,T,t,\rho \right) }\)-quadratic Lyapunov inequality.
C1 holds if \(S_{\Phi }^{\star }\le 0\) for the choice \(Q_{\text {\o }}=S-\rho Q\), \(q_{\text {\o }}=s-\rho q\), \(Q_{+}=Q\), and \(q_{+}=q\), which in turn holds if (42a) is feasible, according to Lemma 1.
C2 holds if \(S_{\Phi }^{\star }\le 0\) for the choice \(Q_{\text {\o }}=P-Q\), \(q_{\text {\o }}=p-q\), \(Q_{+}=0\) and \(q_{+}=0\), which in turn holds if (42b) is feasible, according to Lemma 1.
C3 holds if \(S_{\Phi }^{\star }\le 0\) for the choice \(Q_{\text {\o }}=T-S\), \(q_{\text {\o }}=t-s\), \(Q_{+}=0\) and \(q_{+}=0\), which in turn holds if (42c) is feasible, according to Lemma 1.
If in addition \(\dim (\mathcal {H})\ge n+3m-1\) and (43) hold, then Lemma 1 gives that feasibility of (42a)–(42c) is a necessary condition for C1, C2 and C3 to hold simultaneously.
Second, note that the proof is complete if we let the parameters (Q, q, T, t) free, as in (42d)–(42e).
6 Numerical examples
The necessary and sufficient condition (42) in Theorem 2 for the existence of a Lyapunov inequality is a semidefinite program of size \(n+2m\) (which is below ten for all examples in Sect. 2.5) and is readily solved by standard solvers. We apply Theorem 2 to each example in Sect. 2.5 in two different ways:
-
B1.
We find the smallest possible \(\rho \in [0,1[\), via bisection search, such that a \(\mathord {\left( P,p,T,t,\rho \right) }\)-Lyapunov inequality exists, where \(\mathord {\left( P,p,T,t \right) }\) is chosen as in (33), which implies that the squared distance to the solution convergence \(\rho \)-linearly to zero. The tolerance for the bisection search is set to 0.001 and i is set to 1 in (33) for all examples.
-
B2.
We fix \(\rho = 1\) and find a range of algorithm parameters for which there exists a \(\mathord {\left( P,p,T,t,\rho \right) }\)-Lyapunov inequality, where \(\mathord {\left( P,p,T,t \right) }\) is chosen as in (34) if \(m=1\) and (35) if \(m>1\), implying (\(\mathcal {O}\mathord {\left( 1/k \right) }\) ergodic) convergence of the function value suboptimality and duality gap, respectively. The parameter range is evaluated on a square grid of size \(0.01\times 0.01\).
6.1 Douglas–Rachford method
Consider the Douglas–Rachford method in Sect. 2.5.1 in the case when \(f_1\in \mathcal {F}_{1,2}\), \(f_2\in \mathcal {F}_{0,\infty }\) and \(\lambda = 1\). Figure 1 shows the \(\rho \) we obtain via B1. In particular, note that we recover the already known tight rates given in [19, Theorem 2].
6.2 Proximal gradient method with heavy-ball momentum
Consider the gradient method with heavy-ball momentum in Sect. 2.5.2 and the proximal operator extension in Sect. 2.5.3. Note that the method in Sect. 2.5.3 reduces to the one in Sect. 2.5.2 if the proximal operator is removed and either \(\delta _1=0\) or \(\delta _2=0\).
Figure 2 contains the parameter region we obtain via B2 for the gradient method with heavy-ball momentum when \(f_1\in \mathcal {F}_{0,1}\). Note that we improve on the parameter region given in [17] that guarantees \(\mathcal {O}\mathord {\left( 1/k \right) }\) ergodic convergence of the function value suboptimality.
Figure 2b contains the parameter region we obtain via B2 for the (proximal) gradient method with heavy-ball momentum when \(f_1\in \mathcal {F}_{0,1}\) (and \(f_2\in \mathcal {F}_{0,\infty }\)). In particular, note how the feasible parameter region is affected by adding a proximal term—having the momentum term inside the proximal evaluation (\(\delta _2=0\)) gives a slightly smaller region, and having it outside (\(\delta _1=0\)) makes it even smaller.
Figure 2c shows the \(\rho \) we obtain via B1. for the gradient method with heavy-ball momentum when \(f_1\in \mathcal {F}_{1,10}\). Note that we improve on the rates given in [17] and range of allowable momentum parameters \(\delta \) that guarantee linear convergence.
6.3 Davis–Yin three-operator splitting method
Consider the three-operator splitting method by Davis–Yin in Sect. 2.5.4 in the case when \(f_1\in \mathcal {F}_{0,\beta _1}\), \(f_2\in \mathcal {F}_{1,2}\), \(f_2\in \mathcal {F}_{0,\infty }\), \(\gamma = 1/2\) and \(\lambda = 1\). Figure 3 shows the \(\rho \) we obtain via B1. In particular, note that we improve on the rates given in [11, Theorem D.6] and [31, Theorem 3].
6.4 Chambolle–Pock method
Consider the special case of the Chambolle–Pock method when the linear operator is restricted to be the identity operator \(\mathop {\textrm{Id}}\limits \), as presented in Sect. 2.5.5. Standard convergence proofs, e.g., the ones in [7], allow in this setting for \(\theta =1\) and \(\tau _1,\tau _2>0\) satisfying \(\tau _1\tau _2<1\).
Figure 4a shows the range of parameters \(\theta \), \(\tau _1\), and \(\tau _2\) when \(\tau _1=\tau _2\ge 0.5\) that we obtain via B2 when \(f_1,f_2\in \mathcal {F}_{0,\infty }\). This is a significantly larger region than what traditional analyses allow for. In particular, we see that \(\theta \ne 1\) is a valid choice and that \(\tau _1\tau _2>1\) is also valid for many choices of \(\theta \). Moreover, for comparison, Fig. 4a also contains the region if we add the additional restriction in (42) that
where \(Q_{xx}\in \mathbb {S}^{n}\) and modify P in B2 so that
where I is the identity matrix of size \(n\times n\). With these additional restrictions, we recover exactly the traditional convergence region.
Figure 4b shows the \(\rho \) that we obtain via B1. when \(f_1,f_2\in \mathcal {F}_{0.05,50}\) in the region when \(\tau _1=\tau _2\ge 0.5\). In particular, we note that the smallest \(\rho \) is obtained for the parameters \(\tau _1=\tau _2=1.6\) and \(\theta =0.22\), giving a value of \(\rho =0.8812\). If we restrict to the feasible parameter region in Fig. 4a, the optimal parameters are \(\tau _1=\tau _2=1.5\) and \(\theta =0.35\) with \(\rho =0.8891\). Both these rates are significantly better than what can be achieved with traditional parameter choices, where the optimal choice is \(\tau _1=\tau _2=0.99\) and \(\theta =1\) giving \(\rho =0.9266\).
7 Proof of lemma 1
We prove Lemma 1 only in the case \(\mathcal {I} = \mathord {\left\{ \text {\o },+,\star \right\} }\), as the case \(\mathcal {I} = \mathord {\left\{ \text {\o },\star \right\} }\) is analogous. Recall that we assume that Assumption 1 and Assumption 2 hold.
Proof of Lemma 1
We prove Lemma 1 in a sequence of steps:
Formulating the primal semidefinite program. Recall that \(S_{\Phi }^\star \) is the optimal value of (PEP). By Corollary 1, the constraints of (PEP) can equivalently be written as
By Corollary 1, the last three constraints can be dropped, since they encode \(0\le 0\). By inserting the \(\textbf{y}\), \(\textbf{y}_{+}\), and \(\textbf{y}_{\star }\) equalities and using the notation \(X_{\mathop {\textrm{Id}}\limits }=(X\otimes \mathop {\textrm{Id}}\limits )\), the constraints in (50) can be written as
Using the equality \(\textbf{x}-\textbf{x}_{+} = (\textbf{x}- \textbf{x}_{\star }) - (A_{\mathop {\textrm{Id}}\limits }(\textbf{x}- \textbf{x}_{\star }) + B_{\mathop {\textrm{Id}}\limits }(\textbf{u}- \textbf{u}_{\star }))\) in the first two inequalities and inserting the \(\textbf{x}_{+}\) and \(\textbf{x}_\star \) equalities in the last two inequalities, (51) can equivalently be written as
and using the same equality \(\textbf{x}_{+}-\textbf{x}_\star =A_{\mathop {\textrm{Id}}\limits }(\textbf{x}- \textbf{x}_{\star }) + B_{\mathop {\textrm{Id}}\limits }(\textbf{u}- \textbf{u}_{\star })\), the objective function \(\Phi (\varvec{\xi },\varvec{\xi }_{+},\varvec{\xi }_{\star })\) of (PEP), given in (44), can be written as
Therefore, the first equality in (52) can be dropped since nothing else in (52) and (53) depend on \(\textbf{x}_+\). Moreover, by replacing \(\textbf{x}- \textbf{x}_{\star }\) with \(\Delta \textbf{x}\), we get that (52) can equivalently be written as
and that (53) can equivalently be written as
The first line in (54) and (12) in Assumption 1 imply that
for some \({\hat{\textbf{u}}}_\star \in \mathcal {H}^{m-1}\), where \(N\) is defined in (10). This implies that the first line in (54) can be written as \(\textbf{x}_{\star } = A_{\mathop {\textrm{Id}}\limits }\textbf{x}_{\star }\) if \(m=1\) and \(\textbf{x}_{\star } = A_{\mathop {\textrm{Id}}\limits }\textbf{x}_{\star } + \mathord {\left( BN \right) }_{\mathop {\textrm{Id}}\limits }{\hat{\textbf{u}}}_{\star }\) if \(m>1\). Moreover, note that nothing else in (54) and (55) depend on \(\textbf{x}_{\star }\). Therefore, \(\textbf{x}_{\star } = 0\) is a valid choice in the \(m=1\) case and in the \(m>1\) case (11) in Assumption 1 gives that the first line in (54) can be dropped since for each \({\hat{\textbf{u}}}_\star \in \mathcal {H}^{m-1}\) there exists an \(\textbf{x}_{\star }\in \mathcal {H}^{n}\) such that \(\textbf{x}_{\star } = A_{\mathop {\textrm{Id}}\limits }\textbf{x}_{\star } + \mathord {\left( BN \right) }_{\mathop {\textrm{Id}}\limits }{\hat{\textbf{u}}}_{\star }\) is satisfied. Therefore, (54) can equivalently be written as
and (55) can equivalently be written as
If we let
and use \(\Sigma _{\text {\o }}\) and \(\Sigma _{+}\) defined in (41), (57) can equivalently be written as
where \(\textbf{Q}\) and \(\textbf{q}\) are defined in (45). Using \(E_{i,j}\) and \(H_{i,j}\) defined in (38) and (39), respectively, (56) can equivalently be written as
which with \(\textbf{M}_{(l,i,j)}=E_{i,j}^{\top }\textbf{M}_lE_{i,j}\) and \(\textbf{a}_{(l,i,j)}=H_{i,j}^{\top }\textbf{a}_l\) [also defined in (40)] is equivalent to
The equivalent reformulations (58) and (59) give that (PEP) can be written as
We define the Gramian function \(g:\mathcal {H}^k\rightarrow \mathbb {S}^k_{+}\) such that \([g(\textbf{z})]_{i,j}=\left\langle z^{(i)}, z^{(j)} \right\rangle \) for each \(i,j\in \llbracket 1,k\rrbracket \) and \(\textbf{z}=(z^{(1)},\ldots ,z^{(k)})\in \mathcal {H}^k\). If \(M\in \mathbb {S}^{k}\) and \(\textbf{z}\in \mathcal {H}^k\), then \(\mathcal {Q}(M,\textbf{z})={\textrm{trace}}\mathord {\left( Mg(\textbf{z}) \right) }\). Using this identity, (60) can be written as
with optimal value equal to \(S_{\Phi }^\star \). The problem
is a relaxation of (61), and therefore, has optimal value greater or equal to \(S_{\Phi }^\star \).
We will make use of the following fact: If \(\dim \,\mathcal {H}\ge k\), then \(G\in \mathbb {S}_+^k\) if and only if there exists \(\textbf{z}\in \mathcal {H}^k\) such that \(G=g(\textbf{z})\). [34, Lemma 3.1] shows the result for the case \(k=4\) and is based on the Cholesky decomposition of positive semidefinite matrices. The general case is a straightforward extension. This fact implies that if \(\dim (\mathcal {H})\ge n+3m-1\), then (62) has optimal value equal to \(S_{\Phi }^\star \). Note that (62) is a semidefinite program.
Dual problem and strong duality. First, we derive the dual problem of (62). If we introduce dual variables \(\lambda _{(l,i,j)}\ge 0\) for each \(l\in \llbracket 1,m\rrbracket \) and distinct \(i,j\in \mathcal {I}\) for the inequality constraints, the objective function of the dual problem becomes
Since the dual problem is a minimization problem over the dual variables \(\lambda _{(l,i,j)}\), we conclude that it can be written as
which is a feasibility problem.
Next, suppose that the primal problem (62) has a Slater point, i.e., there exists \(G\in \mathbb {S}_{++}^{n+3m-1}\) and \(\varvec{\chi }\in \mathbb {R}^{2m}\) such that
Then there is no duality gap, i.e., strong duality holds, between the primal problem (62) and the dual problem (63).
Alternatives. The last step of the proof compares the optimal values of (PEP) and the dual problem (63). We have established that \(S_{\Phi }^\star \) is less than or equal to the optimal value of (63). Thus, a sufficient condition for \(S_{\Phi }\le 0\) is that the dual problem (63) is feasible. In addition, if \(\dim (\mathcal {H})\ge n+3m-1\) and there exists \(G\in \mathbb {S}_{++}^{n+3\,m-1}\) and \(\varvec{\chi }\in \mathbb {R}^{2\,m}\) such that (64) holds, the above condition also becomes a necessary condition.
This concludes the proof. \(\square \)
8 Conclusions
We developed a flexible methodology for automated convergence analysis of a large class of first-order methods for solving convex optimization problems. The main result is a necessary and sufficient condition for the existence of a quadratic Lyapunov inequality within a predefined class of Lyapunov inequalities, which amounts to solving a small-sized semidefinite program. The applicability and efficacy of the methodology are demonstrated by providing several new convergence results in Sect. 6.
We mention a few possible modifications that can be made to extend or modify the applicability and possibly improve the convergence results of the methodology. These were not pursued in the current work in order to maintain accessibility and not introduce unnecessary burdensome notation, but do constitute proper avenues for future works. First, each functional component \(f_i\) in (2) can be modified to be from any function class that has quadratic interpolation constraints, e.g., the class of smooth functions [37], the class of convex and quadratically upper bounded functions [20], the class of convex and Lipschitz continuous functions [37], etc. Second, the algorithm representation (5) can be extended to allow for more types of oracles (including, e.g., Frank–Wolfe-type oracles [37], Bregman-type oracles [14], or approximate proximal point oracles [3]) but also multiple evaluations of the same subdifferential \(\partial f_{i}\) during the same iteration, enabling the analysis of, e.g., the forward–backward–forward splitting method of Tseng [40]. Third, similar to [24, 39], it is possible to extend the quadratic Lyapunov function and the quadratic residual function ansatzes to not only contain the current iterate \(\varvec{\xi }_{k}\), but some history \(\varvec{\xi }_{k}, \varvec{\xi }_{k-1}, \ldots , \varvec{\xi }_{k+1-h}\) for some integer \(h\ge 1\). This would allow exploring a greater class of Lyapunov inequalities that may lead to improved convergence results.
Finally, the methodology can be used in the process of finding analytical Lyapunov inequalities, convergence results, and optimal algorithm parameters. Indeed, finding a Lyapunov inequality is equivalent to solving a parametric semidefinite program. Obtaining a Lyapunov inequality involves discovering a closed-form solution for this semidefinite program, which can then be utilized to derive convergence results and select algorithm parameters. Works that aim to enable the obtaining of closed-form solutions include [21, 22], while a previous work focused on selecting algorithm parameters can be found in [41].
Notes
We use the same symbol \(\mathcal {Q}\) for the mapping independent of the dimension n, which will be clear from context.
In this context, the symbol \(\prod \) is used for Cartesian products.
References
Banert, S., Boţ, R.I., Csetnek, E.R.: Fixing and extending some recent results on the ADMM algorithm. Numer. Algorith. 86(3), 1303–1325 (2021). https://doi.org/10.1007/s11075-020-00934-5
Banert, S., Ringh, A., Adler, J., Karlsson, J., Öktem, O.: Data-driven nonsmooth optimization. SIAM J. Optim. 30(1), 102–131 (2020). https://doi.org/10.1137/18M1207685
Barré, M., Taylor, A.B., Bach, F.: Principled analyses and design of first-order methods with inexact proximal operators. Math. Program. (2022). https://doi.org/10.1007/s10107-022-01903-7
Bauschke, H.H., Combettes, P.L.: Convex analysis and monotone operator theory in Hilbert spaces. CMS Books in Mathematics. Springer (2017). https://doi.org/10.1007/978-3-319-48311-5
Beck, A.: First-order methods in optimization. Society for Industrial and Applied Mathematics, Philadelphia, PA (2017). https://doi.org/10.1137/1.9781611974997
Boţ, R.I.: Conjugate duality in convex optimization. Lecture notes in economics and mathematical systems. Springer (2010). https://doi.org/10.1007/978-3-642-04900-2
Chambolle, A., Pock, T.: A first-order primal-dual algorithm for convex problems with applications to imaging. J. Math. Imag. Vision 40(1), 120–145 (2011). https://doi.org/10.1007/s10851-010-0251-1
Chambolle, A., Pock, T.: On the ergodic convergence rates of a first-order primal-dual algorithm. Math. Program. 159(1–2), 253–287 (2016). https://doi.org/10.1007/s10107-015-0957-3
Chen, C.T.: Linear system theory and design, 3rd edn. Oxford University Press, Oxford (2013)
Dahleh, M., Dahleh, M.A., Verghese, G.: Lectures on dynamic systems and control
Davis, D., Yin, W.: A three-operator splitting scheme and its optimization applications (2015)
Davis, D., Yin, W.: A three-operator splitting scheme and its optimization applications. Set-Val. Variat. Anal. 25(4), 829–858 (2017). https://doi.org/10.1007/s11228-017-0421-z
Douglas, J., Rachford, H.H.: On the numerical solution of heat conduction problems in two and three space variables. Trans. Am. Math. Soc. 82(2), 421–439 (1956). https://doi.org/10.1090/S0002-9947-1956-0084194-4
Dragomir, R.A., Taylor, A.B., d’Aspremont, A., Bolte, J.: Optimal complexity and certification of Bregman first-order methods. Math. Program. 194(1–2), 41–83 (2022). https://doi.org/10.1007/s10107-021-01618-1
Drori, Y., Teboulle, M.: Performance of first-order methods for smooth convex minimization: a novel approach. Math. Program. 145(1/2), 451–482 (2014). https://doi.org/10.1007/s10107-013-0653-0
Eckstein, J., Bertsekas, D.P.: On the Douglas-Rachford splitting method and the proximal point algorithm for maximal monotone operators. Math. Program. 55(1), 293–318 (1992). https://doi.org/10.1007/BF01581204
Ghadimi, E., Feyzmahdavian, H.R., Johansson, M.: Global convergence of the heavy-ball method for convex optimization. In: 2015 European control conference (ECC), pp. 310–315 (2015). https://doi.org/10.1109/ECC.2015.7330562
Giselsson, P.: Tight global linear convergence rate bounds for Douglas–Rachford splitting. J. Fixed Point Theory appl. 19(4), 2241–2270 (2017). https://doi.org/10.1007/s11784-017-0417-1
Giselsson, P., Boyd, S.: Linear convergence and metric selection for Douglas-Rachford splitting and ADMM. IEEE Trans. Autom. Control 62(2), 532–544 (2017)
Goujaud, B., Taylor, A., Dieuleveut, A.: Optimal first-order methods for convex functions with a quadratic upper bound (2022)
Henrion, D., Naldi, S., El Din, M.S.: Exact algorithms for linear matrix inequalities. SIAM J. Optim. 26(4), 2512–2539 (2016). https://doi.org/10.1137/15M1036543
Henrion, D., Naldi, S., El Din, M.S.: Spectra-a maple library for solving linear matrix inequalities in exact arithmetic. Optimiz. Method. Softw. 34(1), 62–78 (2019). https://doi.org/10.1080/10556788.2017.1341505
Lessard, L.: The analysis of optimization algorithms: A dissipativity approach. IEEE Control Syst. Mag. 42(3), 58–72 (2022). https://doi.org/10.1109/MCS.2022.3157115
Lessard, L., Recht, B., Packard, A.: Analysis and design of optimization algorithms via integral quadratic constraints. SIAM J. Optim. 26(1), 57–95 (2016). https://doi.org/10.1137/15M1009597
Lions, P.L., Mercier, B.: Splitting algorithms for the sum of two nonlinear operators. SIAM J. Numer. Anal. 16(6), 964–979 (1979). https://doi.org/10.1137/0716071
Megretski, A., Rantzer, A.: System analysis via integral quadratic constraints. IEEE Trans. Autom. Control 42(6), 819–830 (1997). https://doi.org/10.1109/9.587335
Michalowsky, S., Scherer, C., Ebenbauer, C.: Robust and structure exploiting optimisation algorithms: an integral quadratic constraint approach. Int. J. Control 94(11), 2956–2979 (2021). https://doi.org/10.1080/00207179.2020.1745286
Morin, M., Banert, S., Giselsson, P.: Frugal splitting operators: representation, minimal lifting and convergence (2022)
Moucer, C., Taylor, A., Bach, F.: A systematic approach to Lyapunov analyses of continuous-time models in convex optimization (2022)
Nesterov, Y.: Lectures on convex optimization. Springer optimization and its applications, vol. 137, 2nd Edn. Springer, Cham (2018)
Pedregosa, F., Gidel, G.: Adaptive three operator splitting. In: J. Dy, A. Krause (eds.) Proceedings of the 35th International Conference on Machine Learning, Proceedings of Machine Learning Research, vol. 80, pp. 4085–4094 (2018). https://proceedings.mlr.press/v80/pedregosa18a.html
Polyak, B.: Introduction to optimization. Optimization software (1987)
Ryu, E.K.: Uniqueness of DRS as the 2 operator resolvent-splitting and impossibility of 3 operator resolvent-splitting. Math. Program. 182(1–2), 233–273 (2020). https://doi.org/10.1007/s10107-019-01403-1
Ryu, E.K., Taylor, A.B., Bergeling, C., Giselsson, P.: Operator splitting performance estimation: tight contraction factors and optimal parameter selection. SIAM J. Optim. 30(3), 2251–2271 (2020). https://doi.org/10.1137/19M1304854
Sundararajan, A., Van Scoy, B., Lessard, L.: Analysis and design of first-order distributed optimization algorithms over time-varying graphs. IEEE Trans. Control Netw. Syst. 7(4), 1597–1608 (2020). https://doi.org/10.1109/TCNS.2020.2988009
Taylor, A.B., Bach, F.: Stochastic first-order methods: non-asymptotic and computer-aided analyses via potential functions. In: A. Beygelzimer, D. Hsu (Eds.) Proceedings of the Thirty-Second Conference on Learning Theory, Proceedings of Machine Learning Research, vol. 99, pp. 2934–2992 (2019). https://proceedings.mlr.press/v99/taylor19a.html
Taylor, A.B., Hendrickx, J.M., Glineur, F.: Exact worst-case performance of first-order methods for composite convex optimization. SIAM J. Optim. 27(3), 1283–1313 (2017). https://doi.org/10.1137/16M108104X
Taylor, A.B., Hendrickx, J.M., Glineur, F.: Smooth strongly convex interpolation and exact worst-case performance of first-order methods. Math. Program. 161(1/2), 307–345 (2017). https://doi.org/10.1007/s10107-016-1009-3
Taylor, A.B., Van Scoy, B., Lessard, L.: Lyapunov functions for first-order methods: tight automated convergence guarantees. In: J. Dy, A. Krause (Eds.) Proceedings of the 35th international conference on machine learning, proceedings of machine learning research, vol. 80, pp. 4897–4906 (2018). https://proceedings.mlr.press/v80/taylor18a.html
Tseng, P.: A modified forward-backward splitting method for maximal monotone mappings. SIAM J. Control. Optim. 38(2), 431–446 (2000). https://doi.org/10.1137/S0363012998338806
Van Scoy, B., Freeman, R.A., Lynch, K.M.: The fastest known globally convergent first-order method for minimizing strongly convex functions. IEEE Control Syst. Let. 2(1), 49–54 (2018)
Van Scoy, B., Lessard, L.: The speed-robustness trade-off for first-order methods with additive gradient noise (2021)
Zhao, S., Lessard, L., Udell, M.: An automatic system to detect equivalence between iterative algorithms (2021)
Zhou, K., Doyle, J.C.: Essentials of robust control. Prentice Hall, Hoboken (1998)
Acknowledgements
This work was partially supported by the ELLIIT Strategic Research Area and the Wallenberg AI, Autonomous Systems and Software Program (WASP) funded by the Knut and Alice Wallenberg Foundation. S. Banert and P. Giselsson acknowledge support from Vetenskapsrådet grant VR 2021-05710. A.B. Taylor acknowledges support from the French “Agence Nationale de la Recherche” as part of the “Investissements d’avenir” program, reference ANR-19-P3IA-0001 (PRAIRIE 3IA Institute), as well as support from the European Research Council (Grant No. SEQUOIA 724063). We are grateful to Olle Kjellqvist for pointing out some of the details in Remark 1.
Funding
Open access funding provided by Lund University.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Upadhyaya, M., Banert, S., Taylor, A.B. et al. Automated tight Lyapunov analysis for first-order methods. Math. Program. (2024). https://doi.org/10.1007/s10107-024-02061-8
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s10107-024-02061-8
Keywords
- Performance estimation
- Convex optimization
- First-order methods
- Quadratic constraints
- Lyapunov functions
- Semidefinite programming