refs.bib \DefineBibliographyStringsenglishbackrefpage = page,backrefpages = pages,
Non-asymptotic Global Convergence Rates of BFGS
with Exact Line Search
Abstract
In this paper, we explore the non-asymptotic global convergence rates of the Broyden-Fletcher-Goldfarb-Shanno (BFGS) method implemented with exact line search. Notably, due to Dixon’s equivalence result, our findings are also applicable to other quasi-Newton methods in the convex Broyden class employing exact line search, such as the Davidon-Fletcher-Powell (DFP) method. Specifically, we focus on problems where the objective function is strongly convex with Lipschitz continuous gradient and Hessian. Our results hold for any initial point and any symmetric positive definite initial Hessian approximation matrix. The analysis unveils a detailed three-phase convergence process, characterized by distinct linear and superlinear rates, contingent on the iteration progress. Additionally, our theoretical findings demonstrate the trade-offs between linear and superlinear convergence rates for BFGS when we modify the initial Hessian approximation matrix, a phenomenon further corroborated by our numerical experiments.
1 Introduction
In this paper, we consider the unconstrained minimization problem
(1) |
where is strongly convex and twice continuously differentiable. We focus on the non-asymptotic global convergence properties of quasi-Newton methods for solving problem (1). The core idea behind quasi-Newton methods is to mimic the update of Newton’s method using only first-order information, i.e., the gradients of . Specifically, the update rule at the -th iteration is
(2) |
where is the step size and is a matrix constructed from the gradients of to approximate the Hessian . Various quasi-Newton methods have been developed, each distinguished by its strategy for constructing the Hessian approximation and its inverse. The key methods among them are the Davidon-Fletcher-Powell (DFP) method [davidon1959variable, fletcher1963rapidly], the Broyden-Fletcher-Goldfarb-Shanno (BFGS) method [broyden1970convergence, fletcher1970new, goldfarb1970family, shanno1970conditioning], the Symmetric Rank-One (SR1) method [conn1991convergence, khalfan1993theoretical], the Broyden method [broyden1965class], and the limited-memory BFGS (L-BFGS) method [nocedal1980updating, liu1989limited]. Notably, these quasi-Newton methods directly maintain and update the inverse matrix using a constant number of matrix-vector multiplications, resulting in a computational cost of per iteration, reducing the cost per iteration of Newton’s method which involves computing the Hessian and solving a linear system that could incur a computational cost of .
Compared to other first-order methods, such as gradient descent and accelerated gradient descent, the primary advantage of quasi-Newton methods is their ability to achieve a Q-superlinear convergence, i.e.,
(3) |
where denotes the optimal solution of Problem (1). Specifically, [broyden1973local] and [dennis1974characterization] have established that both DFP and BFGS converge Q-superlinearly with unit step size , where the initial point is required to be within a local neighborhood of the optimal solution . Later, it has also been extended to various settings [griewank1982local, dennis1989convergence, yuan1991modified, al1998global, li1999globally, yabe2007local, mokhtari2017iqn, gao2019quasi]. However, these local convergence results are all asymptotic and fail to provide an explicit convergence rate after a finite number of iterations.
Recently, there has been progress regarding non-asymptotic local convergence analysis of quasi-Newton methods. The authors of [rodomanov2020rates] showed that, if the initial point is in a local neighborhood of the optimal solution and the initial Hessian approximation matrix is initialized as , then BFGS with unit step size attains a local superlinear convergence rate of the form , where is the problem’s dimension, is the Lipschitz parameter of the gradient, and is the strong convexity parameter. Later in [rodomanov2020ratesnew], the local convergence rate of BFGS was improved to under similar initial conditions. Similar local superlinear convergence analysis has also been established for the SR1 method [ye2023towards]. In a concurrent work [qiujiang2020quasinewton], the authors demonstrated that, if is in a local neighborhood of the optimal solution and is sufficiently close to the exact Hessian at the optimal solution (or selected as the exact Hessian at ), then BFGS with unit step size achieves a local superlinear rate of , which is independent of the dimension and the condition number . While these non-asymptotic results successfully characterize an explicit superlinear rate, they rely heavily on local analysis: requiring the initial point to be sufficiently close to the optimal solution , and imposing conditions on the step size and initial Hessian approximation matrix . Consequently, these results cannot be directly extended to a global convergence guarantee. We discuss this issue in detail in Section 6.
To guarantee global convergence, quasi-Newton methods must be coupled with line search or trust-region techniques. The first global result for quasi-Newton methods was derived by Powell in [powell1971convergence], where it was established that DFP with exact line search converges globally and Q-superlinearly. Later, Dixon [Dixon] proved that all quasi-Newton methods from the convex Broyden’s class generate the same iterates using exact line search, thus extending Powell’s result to the convex Broyden’s class including BFGS. In order to relax the exact line search condition, the work in [Powell] considered BFGS using inexact line search based on Wolfe conditions and showed that it retains global superlinear convergence. This result was later extended in [byrd1987global] to the convex Broyden class except for DFP. Moreover, [conn1991convergence, khalfan1993theoretical, byrd1996analysis] showed that the SR1 method with trust-region techniques achieves global and superlinear convergence.
However, all these results lack an explicit global convergence rate; they only provide asymptotic convergence guarantees and fail to characterize the explicit global convergence rate of classic quasi-Newton methods. The only exception is a recent work in [krutikov2023convergence], where the authors also studied the global convergence rate of BFGS with exact line search. Specifically, it was shown that BFGS attains a global linear rate of , where denotes the trace of a matrix. We note that after iterations, their linear rate approaches the rate of , which is substantially slower than gradient descent-type methods. More importantly, their study does not extend to demonstrating any superlinear convergence rate and fails to fully characterize the behavior of BFGS.
The discussions above reveal a major gap in classic quasi-Newton methods: the lack of an explicit global convergence rate characterization.
Contributions. In this paper, we present the first results that contain explicit non-asymptotic global linear and superlinear convergence rates for the BFGS method with exact line search. Note that due to the equivalence result by Dixon [Dixon], our results also hold for other quasi-Newton methods in the convex Broyden class with exact line search. At a high level, our convergence analysis sharpens the potential function-based framework first introduced in [QN_tool], leading to a unifying framework for proving both the global linear convergence rates and the superlinear convergence rates. Our convergence results are global as they hold for any initial point and any initial Hessian approximation matrix that is symmetric positive definite. Specifically, our analysis divides the convergence process into three phases, characterized by different convergence rates:
-
(i)
First linear phase: We show that
Here, is the scaled initial Hessian approximation matrix, is a potential function defined later in (18), denotes the condition number, and is defined based on the initial optimality gap with as the Hessian’s Lipschitz parameter. In particular, when , this leads to a linear rate of
-
(ii)
Second linear phase: Upon reaching , the algorithm attains an improved linear rate matching that of standard gradient descent:
-
(iii)
Superlinear phase: when , BFGS achieves a superlinear convergence rate of
where is the normalized initial Hessian approximation matrix.
Convergence Phase | Convergence Rate | Starting moment | |||
Linear phase I | |||||
Linear phase II | |||||
Superlinear phase |
|
||||
Linear phase I | |||||
Linear phase II |
|
||||
Superlinear phase | |
|
To make our convergence rates easily interpretable, we further consider and as two special cases. The global convergence results with these two initializations are summarized in Table 1. Our analysis reveals a trade-off between the linear and the superlinear rates, depending on the choice of the initial matrix . Specifically, while both initializations lead to the same linear convergence rates, initiating with allows the algorithm to reach this rate iterations earlier than with . On the other hand, for the superlinear convergence phase, the difference between and essentially boils down to comparing against . Thus, when , the initializing with enables an earlier transition to the superlinear convergence compared to , as well as a faster superlinear convergence rate. As we shall see in Section 7, our experiments also demonstrate this trade-off.
Additional related work. In addition to the standard quasi-Newton methods such as BFGS, the superlinear convergence of other variants of quasi-Newton methods has also been studied in the literature. The greedy variants of quasi-Newton methods were first introduced in [rodomanov2020greedy] and developed in subsequent works [lin2021greedy, lin2022explicit, ji2023greedy]. Instead of using the difference of successive iterates to update the Hessian approximation matrix, the key idea is to greedily select basis vectors to maximize a certain measure of progress. In [rodomanov2020greedy], greedy BFGS is shown to achieve a local superlinear convergence rate of and the superlinear convergence phase begins after iterations. Similar superlinear convergence rates are extended to works [lin2021greedy, lin2022explicit, ji2023greedy]. However, we note that their results are all local and require the initial point to be sufficiently close to the optimal solution . Recently, along a different line of work, the authors in [jiang2023online, jiang2023accelerated] proposed quasi-Newton-type methods based on the hybrid proximal extragradient framework [solodov1999hybrid, monteiro2010complexity] and studied their global convergence rates. Specifically, it was shown that the quasi-Newton proximal extragradient method in [jiang2023online] achieves a global linear convergence rate of and a global superlinear rate of . However, all these methods are distinct from the classical quasi-Newton methods such as BFGS analyzed in this paper, since they formulate the update of the Hessian approximation matrices as an online convex optimization problem and follow an online learning algorithm to update .
Outline. In Section 2, we provide an overview of the BFGS method with exact line search, outline our assumptions, and introduce some fundamental lemmas for the exact line search scheme. Section 3 presents our general analytical framework, which is employed to establish global linear and superlinear convergence results for the BFGS method, along with the intermediate results for the update of quasi-Newton methods. In Section 4, we establish the global linear convergence rate of BFGS using exact line search and delve into specific cases with and . Section 5 details our global superlinear convergence results, applicable to any choices of and . In Section 6, we contrast our analytical framework with classical asymptotic and recent local non-asymptotic analyses of BFGS. Section 7 displays our numerical experiments that corroborate our theoretical findings. Finally, we finish the paper by presenting some concluding remarks in Section 8.
Notation. We use to denote the norm of a vector or the spectral norm of a matrix. We denote and as the set of symmetric positive semidefinite and symmetric positive definite matrices with dimension , respectively. Given two symmetric matrices and , we denote if and only if is symmetric positive semidefinite. Given a matrix , we use and to denote its trace and determinant, respectively.
2 Preliminaries
In this section, we first outline the assumptions, notations, and lemmas essential for our convergence proof. Following this, we explore the general framework of quasi-Newton methods incorporating exact line search and provide an overview of the principal concepts underpinning the update mechanism in the convex Broyden’s class of quasi-Newton (QN) methods, which encompasses both the BFGS and DFP algorithms.
2.1 Assumptions
To begin with, we state our assumptions on the objective functions .
Assumption 1.
The objective function is strongly convex with parameter , i.e., , for any .
Assumption 2.
The objective function gradient is Lipschitz continuous with parameter , i.e., for any .
Both Assumptions 1 and 2 are standard in the convergence analysis of first-order methods. Moreover, since is twice differentiable, they imply that for any . Additionally, the condition number of is defined as . We also remark that Assumptions 1 and 2 are sufficient to prove our global linear convergence rate results. In order to achieve a superlinear convergence rate, we need to impose an additional assumption on the Hessian of the function , which is stated below.
Assumption 3.
The objective function Hessian is Lipschitz continuous with parameter , i.e., for any .
Assumption 3 is also commonly employed in the analysis of quasi-Newton methods such as [QN_tool], as it provides a necessary smoothness condition for the Hessian of the objective function.
2.2 Quasi-Newton methods with exact line search
Next, we briefly review the template for updating QN methods, focusing specifically on the DFP and BFGS algorithms. Specifically, at the -th iteration, the update in (2) can be equivalently written as
(4) |
Here, represents the step size, and is the Hessian approximation matrix. Replacing with the exact Hessian turns the update into classical Newton’s method. Quasi-Newton methods aim to approximate the Hessian with first-order information, typically adhering to a secant condition and a least-change property. To elaborate, we define the variable difference and gradient difference as
(5) |
The secant condition mandates satisfy , ensuring the gradient consistency between the quadratic model and at and ; that is, and (see [nocedal2006numerical, Chapter 6]). That said, the secant condition does not uniquely define . Thus, we impose a least-change property to ensure , satisfying the secant condition, is closest to in a specific proximity measure. Various proximity measures have been proposed in the literature [goldfarb1970family, greenstadt1970variations, fletcher1991new] and here we follow the variation’s characterization in [fletcher1991new]. Specifically, for any symmetric positive definite matrix , define the negative log-determinant function and define the Bregman divergence generated by by
(6) |
Note that the Bregman divergence can be regarded as a measure of proximity between two positive definite matrices, and if and only if . For the BFGS update, it was shown in [fletcher1991new] that is given as the unique solution of the minimization problem:
which admits the following explicit update rule:
(7) |
Moreover, if we define as the inverse of the Hessian approximation matrix, it follows from the Sherman-Morrison formula that
(8) |
The DFP update rule can be regarded as the dual of BFGS, where the roles of the Hessian approximation matrix and its inverse are exchanged. Specifically, the DFP update rules are given by
Both BFGS and DFP belong to a more general class of QN methods, known as the convex Broyden’s class [broyden1967quasi]. In this class, the Hessian approximation matrix is defined as
where for any . Accordingly, there exists such that the Hessian inverse approximation matrix is given by
The convex Broyden’s class exhibits a crucial property: if the initial Hessian approximation matrix is symmetric positive definite and the objective function is strictly convex, then all subsequent matrices produced by this class maintain symmetric positive definiteness (see [nocedal2006numerical]).
To guarantee the global convergence of quasi-Newton methods in (4), it is necessary to employ a line search scheme to select the step size . In this paper, our primary focus is on the exact line search step size, where we aim to minimize the objective function along the search direction . Specifically,
(9) |
Remarkably, it was shown in [Dixon] that, when employing the exact line search scheme, the convex Broyden’s class of quasi-Newton methods produce identical iterates given that the initial point and the initial matrix are the same. Thus, in the remainder of the paper, we focus on the BFGS update in (7) as all results hold for other algorithms in the convex Broyden family.
Finally, we introduce some intermediate results related to the exact line search step size, as defined in (9). These results are essential for the forthcoming demonstration of the convergence rate of the quasi-Newton method.
Lemma 1.
Proof.
Given and in the -th iteration, define the function . By the definition of exact line search in (9), it holds that . To prove (a), note that we have . Moreover, by the first-order optimality condition, we have . Since , and , the above equation implies that . Applying the fact that , we obtain that . ∎
3 Convergence analysis framework
In this section, we introduce our theoretical framework for establishing the global convergence rates of the BFGS algorithm with exact line search. As previously discussed, due to the equivalence among quasi-Newton methods within the convex Broyden’s class under the exact line search [Dixon], our results also extend to the entire convex Broyden’s class, including the DFP algorithm.
Our framework builds on two key propositions. In Proposition 1, we characterize the amount of function value decrease in one iteration in terms of the angle between the steepest descent direction and the search direction given in (4). Subsequently, Proposition 2 presents a potential function for the BFGS update, which leads to a lower bound on .
To formally start the analysis, we first introduce a weighted version of key vectors and matrices. Specifically, for a weight matrix , we define the weighted gradient , the weighted gradient difference , and the weighted iterate difference as
(10) |
Similarly, we define the weighted Hessian approximation matrix as
(11) |
Note that the weight matrix can be chosen as any positive definite matrix, and its choice will be evident from the context. In particular, as we shall see later, we use in Section 4 to prove the global linear convergence rate, and use in Section 5 to prove the global superlinear convergence rate. Moreover, since the above weighting procedure amounts to a change of the coordinate system, the weighted versions of the vectors and matrices defined in (10) and (11) retain the same algebraic relations as their original forms. In particular, the weighted Hessian approximation matrices generated by the BFGS algorithm follow the subsequent update rule:
(12) |
Before introducing our first key proposition, we define a quantity by
(13) |
which is the angle between the weighted steepest descent direction and the weighted iterate difference . It is well-known that the convergence of QN methods can be established by monitoring the behavior of . We next quantify the link between functional value decrease and .
Proposition 1.
Proof.
First, we use the definition of in (15) to write
(17) |
Moreover, note that we have by Lemma 1(b). Hence, using the definition of in (13) and the definition of in (15), it follows that
Furthermore, we have from the definition of in (15). Thus, the equality in (17) can be rewritten as
By rearranging the term in the above equality, we obtain (14). To prove the inequality in (16), note that for any , we have
where the last equality is due to (14). Notice that the term are non-negative for any . Thus, by applying the inequality of arithmetic and geometric means twice, we obtain that
This completes the proof. ∎
Remark 1.
We note that similar results relating to have appeared in prior work such as [byrd1987global, Lemma 4.2] and [QN_tool], though they are used in the analysis of QN methods with inexact line search. Compared with these prior results, Proposition 1 is more general in the sense that we consider the weighted iterates using a general weight matrix . This flexibility enables us to obtain tighter bounds and, more importantly, to obtain a global superlinear convergence rate under the same framework (see Section 5). Another subtle yet important difference is that previous works typically upper bound the term by prematurely, leading to a worst dependence on the condition number . Instead, we keep in (14) as is and lower bound the term together, as later shown in Proposition 2.
Proposition 1 shows that BFGS’s convergence rate hinges on four quantities: , , , and . Note that and can be bounded using Assumptions 1-3, independent of the QN update, with details deferred to Section 3.1. The focus here is to establish a lower bound for . This involves analyzing the dynamics of the Hessian approximation matrices through their trace and determinant, leveraging the following potential function from [QN_tool] that integrates both:
(18) |
Given (6), can be regarded as the Bregman divergence generated by between the matrix and the identity matrix . In particular, and also we have if and only if . Now we are ready to state Proposition 2, which is a classical result in the QN literature (e.g, see [nocedal2006numerical, Section 6.4]). For completeness, we provide its proof in Appendix A.
Proposition 2.
Taking exponentiation of both sides in (20), Proposition 2 provides a lower bound for the product in relation to the sum and . We will use Assumptions 1-3 to bound the term for any , as shown in Lemma 5 of Section 3.1. Moreover, the second term depends on our choice of the initial Hessian approximation matrix . Specifically, we will consider two different initializations: (i) ; (ii) . As we shall discuss in the upcoming sections, these two choices result in different bounds and thus lead to a trade-off between the initial linear convergence rate and the final superlinear convergence rate.
Having outlined our key propositions, Sections 4 and 5 will merge Proposition 1 and Proposition 2 to demonstrate that BFGS achieves global non-asymptotic linear and superlinear convergence rates, respectively. Our approach involves selecting an appropriate weight matrix and bounding the quantities in (16) to derive the overall convergence rate. Specifically, we set for global linear convergence and for superlinear convergence. The following intermediate lemmas will be used to establish these convergence bounds.
3.1 Intermediate lemmas
Next, we provide some intermediate results that lower bound the quantities and defined in (15) and the term appearing in (19). To do so, we first define the average Hessian matrices and as
(21) | ||||
(22) |
These two matrices play an important role in our analysis, since by the fundamental theorem of calculus, it holds that and for any . We also define the weighted average Hessian matrix for the given weight matrix . Moreover, we define a quantity that depends on function value at the iterate :
(23) |
where is the Lipschitz constant of the Hessian in Assumption 3 and is the strong convexity parameter in Assumption 1. Given these definitions, in the following lemma, we characterize the relationship between different matrices that appear in our convergence analysis.
Lemma 2.
Suppose Assumptions 1, 2, and 3 hold, and recall the definitions of the matrices in (21), in (22), and the quantity in (23). Then, the following statements hold:
-
(a)
For any , we have that
(24) -
(b)
For any , we have that
(25) -
(c)
For any and any , we have that
(26) -
(d)
For any and , we have that
(27)
Proof.
Please check Appendix B. ∎
After establishing Lemma 2, in the following three lemmas, we will provide bounds on the quantities , and , respectively. Notice that is independent of the choice of the weight matrix , while and are determined by different options of weight matrix .
Lemma 3.
Proof.
We first prove the first bound in (28). By Assumptions 1 and 2, the function is -strongly convex and its gradient is -Lipschitz. Then for any , it holds that
(29) |
This is also known as the interpolation inequality; see, e.g., [Taylor_convex, Theorem 4]. By setting , in (29) and recalling that , and , we obtain that
Moreover, Lemma 1 shows that due to exact line search. Thus, we can simplify the above inequality as
(30) |
where we used Young’s inequality in the second inequality and the fact that due to Cauchy-Schwartz inequality in the third inequality. Hence, we conclude that .
Remark 2.
Lemma 4.
Proof.
We first prove (a). When , we have . Since is -strongly convex by Assumption 1, it holds that (see, e.g, [boyd04, (9.9)]). Hence, we conclude that .
Next, we prove (b). When , we have . By applying Taylor’s theorem with Lagrange remainder, there exists such that
(32) |
where we used the fact that in the last equality. Moreover, by the fundamental theorem of calculus, we have
where we use the definition of in (22). Since and we denote , this further implies that
(33) |
Combining (32) and (33) leads to
(34) |
Based on (27) in Lemma 2, we have , which implies that
(35) |
Moreover, it follows from (25) in Lemma 2 that , which implies that
(36) |
Combining (35) and (36), we obtain that
and hence
By using (34) and the fact that , we obtain
and the claim follows. ∎
Lemma 5.
4 Global linear convergence rates
In this section, we establish the explicit global linear convergence rates for the BFGS method using an exact line search step size, marking one of the first non-asymptotic global linear convergence analyses of BFGS with a line search scheme. The subsequent global superlinear convergence analyses are established based on on these linear rates.
Specifically, we combine the fundamental inequality (16) from Proposition 1 with lower bounds of the terms , , and from Lemma 3, 4, 5 and Proposition 2 to prove all the global linear convergence rates. In this section, we set the weight matrix as and we define the weighted matrix as:
(37) |
In the following lemma, we prove the first global linear convergence rate of the BFGS method for any choice of .
Lemma 6.
Proof.
Our starting point is applying Proposition 1 with the weight matrix chosen as . Specifically, (16) shows that to obtain a convergence rate, it suffices to prove a lower bound on . It follows from Lemma 3 that for any . Moreover, by applying Lemma 4 with , we obtain that for any . Futhermore, applying Proposition 2 with , it follows from (20) that
where in the last inequality we used by Lemma 5 with . This further implies that
(39) |
Combining all the pieces above, we get
Thus, it follows from Proposition 1 that
This completes the proof. ∎
Notice that this result holds without the Hessian Lipschitz continuity assumption. In the next lemma, we present another version of the global linear convergence analysis with the additional assumption the Hessian of is -Lipschitz. We show that the BFGS method with exact line search will eventually reach a global linear convergence rate of , which is the same as the gradient descent method.
Lemma 7.
Proof.
We follow a similar argument as in the proof of Lemma 6 but with a different lower bound for . Specifically, by Lemma 3, we also have . Combining this with and (39) leads to
(42) |
To begin with, recall the definition that . Since the objective function is non-increasing by Lemma 1, it holds that for any . Thus, from (42) we have
To prove the second claim in (41), we use the fact that for any to get
(43) |
Combining (42) and (43) leads to
(44) |
Next, we prove an upper bound on . First, we assume . Then (38) in Lemma 6 and (40) together imply that
where we used the fact that . Moreover, we decompose the sum into two parts by . For the first part, we have . For the second part, by the definition of , we have
where we used for all in the last inequality. Combining both inequalities, we arrive at
(45) |
Thus, when the number of iterations exceeds , by (44) we have
Together with Proposition 1, this proves the second claim in (41). ∎
We summarize all the global linear convergence results from the above two lemmas in the following theorem.
Theorem 1.
Let be the iterates generated by the BFGS method with exact line search and suppose that Assumptions 1, 2 and 3 hold. For any initial point and any initial matrix , we have the following global linear convergence rate for any ,
(46) |
where is defined in (37). When , we have that
(47) |
Moreover, when , we have
(48) |
In Theorem 1, we present three distinct linear convergence rates during different phases of the BFGS algorithm with exact line search. Specifically, the linear rate in (46) is applicable from the first iteration, but the contraction factor depends on the quantity , which can be exponentially small and thus imply a slow convergence rate. However, this quantity will be bounded away from zero as the number of iterations increases, resulting in an improved linear rate. In particular, for , the quantity is bounded below by , leading to the second improved linear convergence rate in (47). Furthermore, as shown in Lemma 7, after an additional iterations, we achieve the last linear convergence rate in (48), which is comparable to that of gradient descent.
From the discussions above, we observe that the quantity (recall that ) plays a critical role in determining the transitions between different linear convergence phases, and a smaller implies fewer iterations required to reach each linear convergence phase. Thus, we consider two different initializations: and . Specifically, note that in the first case where , we have and thus it achieves the best linear convergence results according to Theorem 1. The corresponding global linear rate is presented in Corollary 1.
Corollary 1.
In the second case where , we have . The corresponding global linear rate is presented in Corollary 2.
Corollary 2.
Let be the iterates generated by the BFGS method with exact line search and suppose that Assumptions 1, 2 and 3 hold. For any initial point and the initial Hessian approximation matrix , we have the following global convergence rate for any ,
(51) |
When , the following linear rate holds
(52) |
Moreover, when , we have
(53) |
Comparing the results in Corollary 2 with those in Corollary 1, we observe that BFGS with requires additional iterations to achieve a similar linear rate as in the first case. However, as we present in the next section, the choice of the initial Hessian approximation matrix achieves a better superlinear convergence rate. This trade-off between the linear and superlinear convergence phase is the fundamental consequence of different choices of the initial Hessian approximation matrix in our convergence analysis.
5 Global superlinear convergence rates
In this section, we establish the non-asymptotic global superlinear convergence rate of BFGS with exact line search, employing a similar approach to the global linear convergence rate analysis from the previous section. We utilize the framework from Proposition 1 and integrate the lower bounds from Lemmas 3, 4, 5, and Proposition 2. The key distinction lies in the choice of the weight matrix: instead of used in the linear convergence analysis, we opt for for the global superlinear convergence proof.
We define the weighted matrix as:
(54) |
In the following proposition, we first provide a general global convergence bound with an arbitrary initial Hessian approximation matrix . All the global superlinear convergence rates are based on the following proposition.
Proposition 3.
Proof.
Recall that we choose the weight matrix as throughout the proof. From Lemma 3 and Lemma 4(b), we have and . Hence, using the inequality for any , it follows that
(56) |
Moreover, by using the inequality (20) in Proposition 2 with , we obtain that
where in the last inequality we used the fact that from Lemma 5(b). This further implies that
(57) |
Combining (56), (57), and (16) from Proposition 1, we prove that
where the last inequality is due to the fact that for any . ∎
The above global result shows that the error after iterations for the BFGS update with exact line search depends on the potential function of the weighted initial Hessian approximation matrix , i.e., , and the sum of weighted functions suboptimality, i.e., . This result forms the foundation of our superlinear result, as if we can demonstrate that the sum is bounded above, it leads to a superlinear rate of the form .
Having established the non-asymptotic global linear convergence rate of BFGS in the previous section, we can leverage it to show that the sum is uniformly bounded above, allowing us to establish an explicit upper bound for this finite sum. In the following theorem, we apply the linear convergence results from section 4 to prove the non-asymptotic global superlinear convergence rates of BFGS with exact line search for any initial Hessian approximation matrix .
Theorem 2.
Proof.
This result indicates that BFGS with exact line search achieves a superlinear convergence rate when the number of iterations satisfies the condition . The initial matrix critically influences the required iterations to attain this rate, as it appears in the numerator of the upper bound through and . Thus, different choices of yield different values for , affecting the number of iterations required for superlinear convergence. Indeed, one can try to optimize the choice of to make the expression as small as possible. However, here we only focus on two practical initial Hessian approximations: and . Next, in the upcoming corollaries, we present the superlinear convergence results obtained from Theorem 2 when we use these two initial Hessian approximations.
Corollary 3.
Proof.
Corollary 4.
Proof.
As shown in the proofs of Corollary 3 and Corollary 4, selecting minimizes , resulting in . However, in this case could be as large as . Conversely, setting yields a favorable upper bound, allowing both and to be bounded by .
Hence, choosing the initial Hessian approximation as instead of could result in fewer iterations to reach the superlinear convergence phase. This demonstrates the advantage of over in achieving superlinear convergence, highlighting the trade-off between the linear and superlinear convergence performances of different initial Hessian approximation matrices.
Generally, during the initial linear convergence stage, the iterates generated by the BFGS method with outperform those with , due to a faster linear convergence speed. However, the BFGS method with transitions to the ultimate superlinear convergence phase in fewer iterations compared to . This phenomenon has also been observed in our numerical experiments presented in Section 7.
While all of our presented results are global and do not impose any initial condition on , in the following remark, we present a potential local result derivable from Corollary 4.
Remark 3.
Consider the scenario where BFGS starts at a point near the optimal solution such that the initial error condition is satisfied, i.e., . In this case, we can establish that and . Thus, from Corollary 4, we obtain the local superlinear convergence rate of , which aligns with the local convergence result in [rodomanov2020ratesnew]. It is noteworthy that the local result in [rodomanov2020ratesnew] relied on a unit step size, while our local side-result is derived using exact line search.
6 Discussions
Comparison with local non-asymptotic analysis. In this section, we discuss the recent non-asymptotic local convergence results for BFGS and DFP in [rodomanov2020rates, rodomanov2020ratesnew, qiujiang2020quasinewton] and explain why these results cannot be easily extended to achieve global complexity bounds.
To begin with, note that these results are crucially based on local analysis and only apply when the iterates are close to the optimal solution and the step size is set to 1 in this local region. Therefore, to extend their results into a global convergence guarantee, one plausible strategy is to employ a line search scheme to ensure global convergence, and then switch to the local analysis when the iterates enter the region of local convergence. However, this approach faces several challenges.
First, it remains unclear how to explicitly upper bound the number of iterations until the line search subroutine accepts the unit step size . Moreover, assume the iterates enter the region of local convergence after iterations and we have for all . Even then, there is no guarantee that the Hessian approximation matrix will satisfy the necessary conditions required for the local analysis in [rodomanov2020rates, rodomanov2020ratesnew, qiujiang2020quasinewton]. Specifically, for the analysis in [qiujiang2020quasinewton] to hold, must be sufficiently close to the exact Hessian matrix, which is not satisfied in general. Regarding [rodomanov2020ratesnew, rodomanov2020rates], we note that their analyses depend on the condition number of , which could be exponentially large and thus render the superlinear rate meaningless. To be more concrete, inspecting the proofs in [rodomanov2020ratesnew, Lemma 5.4] and [rodomanov2020rates, Theorem 4.2] reveals that the superlinear convergence rate occurs when and , respectively, where with defined in (21) and is the potential function defined in (18). Consequently, it is essential to establish bounds for the smallest and largest eigenvalues of . However, the current theory indicates (see e.g. [rodomanov2020rates, Theorem 4.1]) that , where denotes the initial Newton decrement. This suggests that without a sufficiently small , the extreme eigenvalues of will be exponentially dependent on the condition number , leading to . Hence, a superlinear rate will be achieved only after iterations.
Our convergence framework also diverges significantly from the previous works [rodomanov2020rates, rodomanov2020ratesnew, qiujiang2020quasinewton] in terms of the proof strategy. Specifically, the approach in the aforementioned studies employs an induction argument to control the largest and smallest eigenvalues of the Hessian approximation matrix and prove a local linear convergence rate. In comparison, as presented in Sections 4 and 5, we prove global linear and superlinear convergence rates without explicitly establishing upper or lower bounds on the eigenvalues of . This marks a notable departure from the local convergence analysis in [rodomanov2020rates], [rodomanov2020ratesnew], and [qiujiang2020quasinewton].
Comparison with global asymptotic analysis. As mentioned in Section 3, our convergence analysis framework resembles the approach taken in [Powell, byrd1987global, QN_tool] for proving asymptotic linear convergence rates of classical quasi-Newton methods such as BFGS and DFP. While these works considered inexact line search schemes and thus are different from our exact line search setting, they used a similar inequality as (16) in Proposition 1 to express the convergence rate in terms of the angle . Moreover, the authors in [Powell] and [byrd1987global] analyzed the traces and the determinants of the Hessian approximation matrices separately to lower bound . Later, this process was simplified in [QN_tool] by introducing the potential function given in (18), combining the trace and determinant together as in our Proposition 2. However, since their main focus is on asymptotic convergence, we note that these previous works only demonstrate that is lower bounded by a constant, without giving an explicit form. Furthermore, our work builds upon previous analyses by incorporating a weight matrix , while earlier works correspond to setting . Another notable difference is that we keep the term and lower bound the term as shown in Proposition 2, whereas previous works relied on a looser bound for . These refinements enable us to provide a tighter linear convergence rate for the BFGS method.
On the other hand, in demonstrating superlinear convergence, our approach deviates significantly from that of [Powell, byrd1987global, QN_tool]. Specifically, the previous works relied on the Dennis-Moré condition, i.e., , to establish asymptotic superlinear convergence. In comparison, we use the same framework outlined in Section 3 to establish both linear and superlinear convergence rates. The key distinction lies in the choice of the weight matrix : we choose for showing linear convergence and for showing superlinear convergence. Thus, we provide a unified framework for studying the global non-asymptotic convergence of BFGS.
7 Numerical experiments
In this section, we present our numerical experiments to validate our convergence rate guarantees, and in particular, we explore the difference between the convergence paths of BFGS under the two initializations: and . We further compare these two variants of BFGS implementations with the gradient descent algorithm when deployed with exact line search. Hence, in our numerical experiments, all the step sizes used in BFGS with , BFGS with , and gradient descent are computed by the exact line search condition defined in (9). Specifically, we use the MATLAB optimization package and fminsearch function to determine the exact line search step size for all the algorithms. In our experiments, all initial points are chosen as random vectors in the corresponding Euclidean vector spaces.
In our first experiment, we focus on a hard cubic objective function defined in [hard_cubic, Section 5], i.e.,
(64) |
and is defined as
(65) |
where are hyper-parameters and are standard orthogonal unit vectors in . This hard cubic function is used to establish a lower bound for second-order methods. The performance of the methods in addressing this problem is shown in Figures 1 and 2. In Figure 1, we vary the problem’s dimension while holding the condition number constant, whereas in Figure 2, we hold the problem’s dimension constant and explores the methods’ convergence behaviors for different condition numbers.
Several observations are in order. First, BFGS with initially converges faster than BFGS with in most plots, aligning with our theoretical findings that the linear convergence rate of BFGS with surpasses that of .
Second, the transition to superlinear convergence for BFGS with typically occurs around , as predicted by our theoretical analysis. Interestingly, this transition does not always coincide with the iterates approaching the solution’s local neighborhood; in many cases, it occurs for BFGS with even when its error is larger than that of gradient descent.
Third, although BFGS with initially converges faster, its transition to superlinear convergence consistently occurs later than for . Notably, for a fixed dimension , the transition to superlinear convergence for occurs increasingly later as the problem condition number rises, an effect not observed for . This phenomenon indicates that the superlinear rate for is more sensitive to the condition number , which corroborates our theory that the number of iterations required for superlinear convergence is for and is improved to for .
8 Conclusion
In this paper, we proved explicit global linear and superlinear convergence rates for the BFGS method implemented with the exact line search scheme. Our results hold for any initial point and any initial Hessian approximation matrix . We proved a global convergence rate of , where and is defined in (18). This implies a linear rate of when . Moreover, we proved that the linear rate is improved to after iterations. Finally, we proved a superlinear convergence rate of , where .
We further showed that for the specific choice of , BFGS achieves a global linear convergence rate of from the first iteration, a improved linear rate of after iterations, and a superlinear convergence rate of . Moreover, for , BFGS achieves a global linear rate of after iterations, a improved linear rate of after iterations, and a superlinear rate of .
Appendix
Appendix A Proof of Proposition 2
First, we show that
(66) | ||||
(67) |
Taking the trace on both sides of the equation (12) and using the fact that for any vector and , we obtain the equality in (66). Please check Lemma 6.2 of [rodomanov2020rates] for the proof of (67). Take the logarithm on both sides of the above equation, we obtain that
Recall that and . Since , we also have . Hence, we can write
Thus, we obtain that
where the last inequality holds since for any . Hence (19) follows from the above inequality. Finally, the result in (20) follows from summing both sides of (19) from to , i.e.,
where the last inequality holds since for any .
Appendix B Proof of Lemma 2
-
(a)
Recall that . Using the triangle inequality, we have
Moreover, it follows from Assumption 3 that for any . Thus, we can further apply the triangle inequality to obtain
Since is strongly convex, by Assumption 1 and , we have , which implies that . Similarly, since , it also holds that . Hence, we obtain
(68) Moreover, notice that by Assumption 1, we also have and . Hence, (68) implies that
where we used the definition of in (23). By rearranging the terms, we obtain (24).
- (b)
- (c)
- (d)