Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
\addbibresource

refs.bib \DefineBibliographyStringsenglishbackrefpage = page,backrefpages = pages,

Non-asymptotic Global Convergence Rates of BFGS
with Exact Line Search

Qiujiang Jin   Ruichen Jiang   Aryan Mokhtari Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, TX, USA {qiujiang@austin.utexas.edu}Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, TX, USA {rjiang@utexas.edu}Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, TX, USA {mokhtari@austin.utexas.edu}
Abstract

In this paper, we explore the non-asymptotic global convergence rates of the Broyden-Fletcher-Goldfarb-Shanno (BFGS) method implemented with exact line search. Notably, due to Dixon’s equivalence result, our findings are also applicable to other quasi-Newton methods in the convex Broyden class employing exact line search, such as the Davidon-Fletcher-Powell (DFP) method. Specifically, we focus on problems where the objective function is strongly convex with Lipschitz continuous gradient and Hessian. Our results hold for any initial point and any symmetric positive definite initial Hessian approximation matrix. The analysis unveils a detailed three-phase convergence process, characterized by distinct linear and superlinear rates, contingent on the iteration progress. Additionally, our theoretical findings demonstrate the trade-offs between linear and superlinear convergence rates for BFGS when we modify the initial Hessian approximation matrix, a phenomenon further corroborated by our numerical experiments.

1 Introduction

In this paper, we consider the unconstrained minimization problem

minxdf(x),subscript𝑥superscript𝑑𝑓𝑥\min_{x\in\mathbb{R}^{d}}f(x),roman_min start_POSTSUBSCRIPT italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_f ( italic_x ) , (1)

where f:d:𝑓superscript𝑑f:\mathbb{R}^{d}\to\mathbb{R}italic_f : blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT → blackboard_R is strongly convex and twice continuously differentiable. We focus on the non-asymptotic global convergence properties of quasi-Newton methods for solving problem (1). The core idea behind quasi-Newton methods is to mimic the update of Newton’s method using only first-order information, i.e., the gradients of f𝑓fitalic_f. Specifically, the update rule at the k𝑘kitalic_k-th iteration is

xk+1=xkηkBk1f(xk),subscript𝑥𝑘1subscript𝑥𝑘subscript𝜂𝑘superscriptsubscript𝐵𝑘1𝑓subscript𝑥𝑘x_{k+1}=x_{k}-\eta_{k}B_{k}^{-1}\nabla f(x_{k}),italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∇ italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) , (2)

where ηksubscript𝜂𝑘\eta_{k}italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is the step size and Bkd×dsubscript𝐵𝑘superscript𝑑𝑑B_{k}\in\mathbb{R}^{d\times d}italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_d end_POSTSUPERSCRIPT is a matrix constructed from the gradients of f𝑓fitalic_f to approximate the Hessian 2f(xk)superscript2𝑓subscript𝑥𝑘\nabla^{2}{f(x_{k})}∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ). Various quasi-Newton methods have been developed, each distinguished by its strategy for constructing the Hessian approximation Bksubscript𝐵𝑘B_{k}italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and its inverse. The key methods among them are the Davidon-Fletcher-Powell (DFP) method [davidon1959variable, fletcher1963rapidly], the Broyden-Fletcher-Goldfarb-Shanno (BFGS) method [broyden1970convergence, fletcher1970new, goldfarb1970family, shanno1970conditioning], the Symmetric Rank-One (SR1) method [conn1991convergence, khalfan1993theoretical], the Broyden method [broyden1965class], and the limited-memory BFGS (L-BFGS) method [nocedal1980updating, liu1989limited]. Notably, these quasi-Newton methods directly maintain and update the inverse matrix Bk1superscriptsubscript𝐵𝑘1B_{k}^{-1}italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT using a constant number of matrix-vector multiplications, resulting in a computational cost of 𝒪(d2)𝒪superscript𝑑2\mathcal{O}(d^{2})caligraphic_O ( italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) per iteration, reducing the cost per iteration of Newton’s method which involves computing the Hessian and solving a linear system that could incur a computational cost of 𝒪(d3)𝒪superscript𝑑3\mathcal{O}(d^{3})caligraphic_O ( italic_d start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ).

Compared to other first-order methods, such as gradient descent and accelerated gradient descent, the primary advantage of quasi-Newton methods is their ability to achieve a Q-superlinear convergence, i.e.,

limkf(xk+1)f(x)f(xk)f(x)=0orlimkxk+1xxkx=0,formulae-sequencesubscript𝑘𝑓subscript𝑥𝑘1𝑓subscript𝑥𝑓subscript𝑥𝑘𝑓subscript𝑥0orsubscript𝑘normsubscript𝑥𝑘1subscript𝑥normsubscript𝑥𝑘subscript𝑥0\lim_{k\to\infty}\frac{f(x_{k+1})-f(x_{*})}{f(x_{k})-f(x_{*})}=0\qquad\text{or% }\qquad\lim_{k\to\infty}\frac{\|x_{k+1}-x_{*}\|}{\|x_{k}-x_{*}\|}=0,roman_lim start_POSTSUBSCRIPT italic_k → ∞ end_POSTSUBSCRIPT divide start_ARG italic_f ( italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) end_ARG start_ARG italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) end_ARG = 0 or roman_lim start_POSTSUBSCRIPT italic_k → ∞ end_POSTSUBSCRIPT divide start_ARG ∥ italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ end_ARG start_ARG ∥ italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ end_ARG = 0 , (3)

where xdsubscript𝑥superscript𝑑x_{*}\in\mathbb{R}^{d}italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT denotes the optimal solution of Problem (1). Specifically, [broyden1973local] and [dennis1974characterization] have established that both DFP and BFGS converge Q-superlinearly with unit step size ηk=1subscript𝜂𝑘1\eta_{k}=1italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = 1, where the initial point x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is required to be within a local neighborhood of the optimal solution xsubscript𝑥x_{*}italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT. Later, it has also been extended to various settings [griewank1982local, dennis1989convergence, yuan1991modified, al1998global, li1999globally, yabe2007local, mokhtari2017iqn, gao2019quasi]. However, these local convergence results are all asymptotic and fail to provide an explicit convergence rate after a finite number of iterations.

Recently, there has been progress regarding non-asymptotic local convergence analysis of quasi-Newton methods. The authors of [rodomanov2020rates] showed that, if the initial point x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is in a local neighborhood of the optimal solution xsubscript𝑥x_{*}italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT and the initial Hessian approximation matrix B0subscript𝐵0B_{0}italic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is initialized as LI𝐿𝐼LIitalic_L italic_I, then BFGS with unit step size attains a local superlinear convergence rate of the form (dLμk)ksuperscript𝑑𝐿𝜇𝑘𝑘(\frac{dL}{\mu k})^{k}( divide start_ARG italic_d italic_L end_ARG start_ARG italic_μ italic_k end_ARG ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, where d𝑑ditalic_d is the problem’s dimension, L𝐿Litalic_L is the Lipschitz parameter of the gradient, and μ𝜇\muitalic_μ is the strong convexity parameter. Later in [rodomanov2020ratesnew], the local convergence rate of BFGS was improved to (dlog(L/μ)k)ksuperscript𝑑𝐿𝜇𝑘𝑘(\frac{d\log{(L/\mu)}}{k})^{k}( divide start_ARG italic_d roman_log ( italic_L / italic_μ ) end_ARG start_ARG italic_k end_ARG ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT under similar initial conditions. Similar local superlinear convergence analysis has also been established for the SR1 method [ye2023towards]. In a concurrent work [qiujiang2020quasinewton], the authors demonstrated that, if x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is in a local neighborhood of the optimal solution xsubscript𝑥x_{*}italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT and B0subscript𝐵0B_{0}italic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is sufficiently close to the exact Hessian at the optimal solution (or selected as the exact Hessian at x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT), then BFGS with unit step size achieves a local superlinear rate of (1/k)k/2superscript1𝑘𝑘2(1/k)^{k/2}( 1 / italic_k ) start_POSTSUPERSCRIPT italic_k / 2 end_POSTSUPERSCRIPT, which is independent of the dimension d𝑑ditalic_d and the condition number L/μ𝐿𝜇L/\muitalic_L / italic_μ. While these non-asymptotic results successfully characterize an explicit superlinear rate, they rely heavily on local analysis: requiring the initial point to be sufficiently close to the optimal solution xsubscript𝑥x_{*}italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT, and imposing conditions on the step size and initial Hessian approximation matrix B0subscript𝐵0B_{0}italic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Consequently, these results cannot be directly extended to a global convergence guarantee. We discuss this issue in detail in Section 6.

To guarantee global convergence, quasi-Newton methods must be coupled with line search or trust-region techniques. The first global result for quasi-Newton methods was derived by Powell in [powell1971convergence], where it was established that DFP with exact line search converges globally and Q-superlinearly. Later, Dixon [Dixon] proved that all quasi-Newton methods from the convex Broyden’s class generate the same iterates using exact line search, thus extending Powell’s result to the convex Broyden’s class including BFGS. In order to relax the exact line search condition, the work in [Powell] considered BFGS using inexact line search based on Wolfe conditions and showed that it retains global superlinear convergence. This result was later extended in [byrd1987global] to the convex Broyden class except for DFP. Moreover, [conn1991convergence, khalfan1993theoretical, byrd1996analysis] showed that the SR1 method with trust-region techniques achieves global and superlinear convergence.

However, all these results lack an explicit global convergence rate; they only provide asymptotic convergence guarantees and fail to characterize the explicit global convergence rate of classic quasi-Newton methods. The only exception is a recent work in [krutikov2023convergence], where the authors also studied the global convergence rate of BFGS with exact line search. Specifically, it was shown that BFGS attains a global linear rate of (12κ3(1+μ𝐓𝐫(B01)k)1(1+𝐓𝐫(B0)Lk)1)ksuperscript12superscript𝜅3superscript1𝜇𝐓𝐫superscriptsubscript𝐵01𝑘1superscript1𝐓𝐫subscript𝐵0𝐿𝑘1𝑘(1-2\kappa^{-3}(1+\frac{\mu\mathbf{Tr}(B_{0}^{-1})}{k})^{-1}(1+\frac{\mathbf{% Tr}(B_{0})}{Lk})^{-1})^{k}( 1 - 2 italic_κ start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT ( 1 + divide start_ARG italic_μ bold_Tr ( italic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_k end_ARG ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( 1 + divide start_ARG bold_Tr ( italic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_L italic_k end_ARG ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, where 𝐓𝐫()𝐓𝐫\mathbf{Tr}(\cdot)bold_Tr ( ⋅ ) denotes the trace of a matrix. We note that after k=O(d)𝑘𝑂𝑑k=O(d)italic_k = italic_O ( italic_d ) iterations, their linear rate approaches the rate of (12κ3)ksuperscript12superscript𝜅3𝑘(1-2\kappa^{-3})^{k}( 1 - 2 italic_κ start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, which is substantially slower than gradient descent-type methods. More importantly, their study does not extend to demonstrating any superlinear convergence rate and fails to fully characterize the behavior of BFGS.

The discussions above reveal a major gap in classic quasi-Newton methods: the lack of an explicit global convergence rate characterization.

Contributions. In this paper, we present the first results that contain explicit non-asymptotic global linear and superlinear convergence rates for the BFGS method with exact line search. Note that due to the equivalence result by Dixon [Dixon], our results also hold for other quasi-Newton methods in the convex Broyden class with exact line search. At a high level, our convergence analysis sharpens the potential function-based framework first introduced in [QN_tool], leading to a unifying framework for proving both the global linear convergence rates and the superlinear convergence rates. Our convergence results are global as they hold for any initial point x0dsubscript𝑥0superscript𝑑x_{0}\in\mathbb{R}^{d}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT and any initial Hessian approximation matrix B0subscript𝐵0B_{0}italic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT that is symmetric positive definite. Specifically, our analysis divides the convergence process into three phases, characterized by different convergence rates:

  1. (i)

    First linear phase: We show that

    f(xk)f(x)f(x0)f(x)(1eΨ(B¯0)k1κmax{21+κ,11+C0})k.𝑓subscript𝑥𝑘𝑓subscript𝑥𝑓subscript𝑥0𝑓subscript𝑥superscript1superscript𝑒Ψsubscript¯𝐵0𝑘1𝜅21𝜅11subscript𝐶0𝑘\frac{f(x_{k})-f(x_{*})}{f(x_{0})-f(x_{*})}\leq\left(1-e^{-\frac{\Psi(\bar{B}_% {0})}{k}}\frac{1}{\kappa}\max\left\{\frac{2}{1+\sqrt{\kappa}},\frac{1}{1+C_{0}% }\right\}\right)^{k}.divide start_ARG italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) end_ARG start_ARG italic_f ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) end_ARG ≤ ( 1 - italic_e start_POSTSUPERSCRIPT - divide start_ARG roman_Ψ ( over¯ start_ARG italic_B end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_k end_ARG end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_κ end_ARG roman_max { divide start_ARG 2 end_ARG start_ARG 1 + square-root start_ARG italic_κ end_ARG end_ARG , divide start_ARG 1 end_ARG start_ARG 1 + italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG } ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT .

    Here, B¯0=1LB0subscript¯𝐵01𝐿subscript𝐵0\bar{B}_{0}=\frac{1}{L}B_{0}over¯ start_ARG italic_B end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_L end_ARG italic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is the scaled initial Hessian approximation matrix, Ψ()Ψ\Psi(\cdot)roman_Ψ ( ⋅ ) is a potential function defined later in (18), κ=Lμ𝜅𝐿𝜇\kappa=\frac{L}{\mu}italic_κ = divide start_ARG italic_L end_ARG start_ARG italic_μ end_ARG denotes the condition number, and C0=M2(f(x0)f(x))μ3/2subscript𝐶0𝑀2𝑓subscript𝑥0𝑓subscript𝑥superscript𝜇32C_{0}=\frac{M\sqrt{2(f(x_{0})-f(x_{*}))}}{\mu^{{3}/{2}}}italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = divide start_ARG italic_M square-root start_ARG 2 ( italic_f ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) end_ARG end_ARG start_ARG italic_μ start_POSTSUPERSCRIPT 3 / 2 end_POSTSUPERSCRIPT end_ARG is defined based on the initial optimality gap with M𝑀Mitalic_M as the Hessian’s Lipschitz parameter. In particular, when kΨ(B¯0)𝑘Ψsubscript¯𝐵0k\geq\Psi(\bar{B}_{0})italic_k ≥ roman_Ψ ( over¯ start_ARG italic_B end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ), this leads to a linear rate of

    f(xk)f(x)f(x0)f(x)(113κmax{21+κ,11+C0})k.𝑓subscript𝑥𝑘𝑓subscript𝑥𝑓subscript𝑥0𝑓subscript𝑥superscript113𝜅21𝜅11subscript𝐶0𝑘\frac{f(x_{k})-f(x_{*})}{f(x_{0})-f(x_{*})}\leq\left(1-\frac{1}{3\kappa}\max% \left\{\frac{2}{1+\sqrt{\kappa}},\frac{1}{1+C_{0}}\right\}\right)^{k}.divide start_ARG italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) end_ARG start_ARG italic_f ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) end_ARG ≤ ( 1 - divide start_ARG 1 end_ARG start_ARG 3 italic_κ end_ARG roman_max { divide start_ARG 2 end_ARG start_ARG 1 + square-root start_ARG italic_κ end_ARG end_ARG , divide start_ARG 1 end_ARG start_ARG 1 + italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG } ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT .
  2. (ii)

    Second linear phase: Upon reaching k(1+C0)Ψ(B¯0)+3C0κmin{2(1+C0),1+κ}𝑘1subscript𝐶0Ψsubscript¯𝐵03subscript𝐶0𝜅21subscript𝐶01𝜅k\geq(1+C_{0})\Psi(\bar{B}_{0})+3C_{0}\kappa\min\{2(1+C_{0}),1+\sqrt{\kappa}\}italic_k ≥ ( 1 + italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) roman_Ψ ( over¯ start_ARG italic_B end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) + 3 italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_κ roman_min { 2 ( 1 + italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , 1 + square-root start_ARG italic_κ end_ARG }, the algorithm attains an improved linear rate matching that of standard gradient descent:

    f(xk)f(x)f(x0)f(x)(113κ)k.𝑓subscript𝑥𝑘𝑓subscript𝑥𝑓subscript𝑥0𝑓subscript𝑥superscript113𝜅𝑘\frac{f(x_{k})-f(x_{*})}{f(x_{0})-f(x_{*})}\leq\left(1-\frac{1}{3\kappa}\right% )^{k}.divide start_ARG italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) end_ARG start_ARG italic_f ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) end_ARG ≤ ( 1 - divide start_ARG 1 end_ARG start_ARG 3 italic_κ end_ARG ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT .
  3. (iii)

    Superlinear phase: when kΨ(B~0)+4C0Ψ(B¯0)+12C0κmin{2(1+C0),1+κ}𝑘Ψsubscript~𝐵04subscript𝐶0Ψsubscript¯𝐵012subscript𝐶0𝜅21subscript𝐶01𝜅k\geq\Psi(\tilde{B}_{0})+4C_{0}\Psi(\bar{B}_{0})+12C_{0}\kappa\min\{2(1+C_{0})% ,1+\sqrt{\kappa}\}italic_k ≥ roman_Ψ ( over~ start_ARG italic_B end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) + 4 italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT roman_Ψ ( over¯ start_ARG italic_B end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) + 12 italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_κ roman_min { 2 ( 1 + italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , 1 + square-root start_ARG italic_κ end_ARG }, BFGS achieves a superlinear convergence rate of

    f(xk)f(x)f(x0)f(x)(Ψ(B~0)+4C0Ψ(B¯0)+12C0κmin{2(1+C0),1+κ}k)k,𝑓subscript𝑥𝑘𝑓subscript𝑥𝑓subscript𝑥0𝑓subscript𝑥superscriptΨsubscript~𝐵04subscript𝐶0Ψsubscript¯𝐵012subscript𝐶0𝜅21subscript𝐶01𝜅𝑘𝑘\frac{f(x_{k})-f(x_{*})}{f(x_{0})-f(x_{*})}\leq\left(\frac{\Psi(\tilde{B}_{0})% +4C_{0}\Psi(\bar{B}_{0})+12C_{0}\kappa\min\{2(1+C_{0}),1+\sqrt{\kappa}\}}{k}% \right)^{k},divide start_ARG italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) end_ARG start_ARG italic_f ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) end_ARG ≤ ( divide start_ARG roman_Ψ ( over~ start_ARG italic_B end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) + 4 italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT roman_Ψ ( over¯ start_ARG italic_B end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) + 12 italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_κ roman_min { 2 ( 1 + italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , 1 + square-root start_ARG italic_κ end_ARG } end_ARG start_ARG italic_k end_ARG ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ,

    where B~0=2f(x)12B02f(x)12subscript~𝐵0superscript2𝑓superscriptsubscript𝑥12subscript𝐵0superscript2𝑓superscriptsubscript𝑥12\tilde{B}_{0}=\nabla^{2}f(x_{*})^{-\frac{1}{2}}B_{0}\nabla^{2}f(x_{*})^{-\frac% {1}{2}}over~ start_ARG italic_B end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT is the normalized initial Hessian approximation matrix.

Table 1: Summary of our convergence results. The last column presents the number of iterations required to achieve corresponding linear or superlinear convergence phase. For brevity, we drop absolute constants in our results.
B0subscript𝐵0B_{0}italic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT Convergence Phase Convergence Rate Starting moment
LI𝐿𝐼LIitalic_L italic_I Linear phase I (11κmin{1+C0,κ})ksuperscript11𝜅1subscript𝐶0𝜅𝑘\left(1-\frac{1}{\kappa\min\{1+C_{0},\sqrt{\kappa}\}}\right)^{k}( 1 - divide start_ARG 1 end_ARG start_ARG italic_κ roman_min { 1 + italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , square-root start_ARG italic_κ end_ARG } end_ARG ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT 1111
LI𝐿𝐼LIitalic_L italic_I Linear phase II (11κ)ksuperscript11𝜅𝑘\left(1-\frac{1}{\kappa}\right)^{k}( 1 - divide start_ARG 1 end_ARG start_ARG italic_κ end_ARG ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT C0κmin{1+C0,κ}subscript𝐶0𝜅1subscript𝐶0𝜅C_{0}\kappa\min\{1+C_{0},\sqrt{\kappa}\}italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_κ roman_min { 1 + italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , square-root start_ARG italic_κ end_ARG }
LI𝐿𝐼LIitalic_L italic_I Superlinear phase (dκ+C0κmin{1+C0,κ}k)ksuperscript𝑑𝜅subscript𝐶0𝜅1subscript𝐶0𝜅𝑘𝑘\left(\frac{d\kappa+C_{0}\kappa\min\{1+C_{0},\sqrt{\kappa}\}}{k}\right)^{k}( divide start_ARG italic_d italic_κ + italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_κ roman_min { 1 + italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , square-root start_ARG italic_κ end_ARG } end_ARG start_ARG italic_k end_ARG ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT
dκ+limit-from𝑑𝜅d\kappa+italic_d italic_κ +
C0κmin{1+C0,κ}subscript𝐶0𝜅1subscript𝐶0𝜅C_{0}\kappa\min\{1+C_{0},\sqrt{\kappa}\}italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_κ roman_min { 1 + italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , square-root start_ARG italic_κ end_ARG }
μI𝜇𝐼\mu Iitalic_μ italic_I Linear phase I (11κmin{1+C0,κ})ksuperscript11𝜅1subscript𝐶0𝜅𝑘\left(1-\frac{1}{\kappa\min\{1+C_{0},\sqrt{\kappa}\}}\right)^{k}( 1 - divide start_ARG 1 end_ARG start_ARG italic_κ roman_min { 1 + italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , square-root start_ARG italic_κ end_ARG } end_ARG ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT dlogκ𝑑𝜅d\log\kappaitalic_d roman_log italic_κ
μI𝜇𝐼\mu Iitalic_μ italic_I Linear phase II (11κ)ksuperscript11𝜅𝑘\left(1-\frac{1}{\kappa}\right)^{k}( 1 - divide start_ARG 1 end_ARG start_ARG italic_κ end_ARG ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT
dlogκ+limit-from𝑑𝜅d\log\kappa+italic_d roman_log italic_κ +
C0κmin{1+C0,κ}subscript𝐶0𝜅1subscript𝐶0𝜅C_{0}\kappa\min\{1+C_{0},\sqrt{\kappa}\}italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_κ roman_min { 1 + italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , square-root start_ARG italic_κ end_ARG }
μI𝜇𝐼\mu Iitalic_μ italic_I Superlinear phase ​​ ((1+C0)dlogκ+C0κmin{1+C0,κ}k)ksuperscript1subscript𝐶0𝑑𝜅subscript𝐶0𝜅1subscript𝐶0𝜅𝑘𝑘\!\!\left(\frac{(1+C_{0})d\log\kappa+C_{0}\kappa\min\{1+C_{0},\sqrt{\kappa}\}}% {k}\right)^{k}\!\!( divide start_ARG ( 1 + italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) italic_d roman_log italic_κ + italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_κ roman_min { 1 + italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , square-root start_ARG italic_κ end_ARG } end_ARG start_ARG italic_k end_ARG ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT
(1+C0)dlogκ+limit-from1subscript𝐶0𝑑𝜅(1+C_{0})d\log\kappa+( 1 + italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) italic_d roman_log italic_κ +
C0κmin{1+C0,κ}subscript𝐶0𝜅1subscript𝐶0𝜅C_{0}\kappa\min\{1+C_{0},\sqrt{\kappa}\}italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_κ roman_min { 1 + italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , square-root start_ARG italic_κ end_ARG }

To make our convergence rates easily interpretable, we further consider B0=LIsubscript𝐵0𝐿𝐼B_{0}=LIitalic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_L italic_I and B0=μIsubscript𝐵0𝜇𝐼B_{0}=\mu Iitalic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_μ italic_I as two special cases. The global convergence results with these two initializations are summarized in Table 1. Our analysis reveals a trade-off between the linear and the superlinear rates, depending on the choice of the initial matrix B0subscript𝐵0B_{0}italic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Specifically, while both initializations lead to the same linear convergence rates, initiating with B0=LIsubscript𝐵0𝐿𝐼B_{0}=LIitalic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_L italic_I allows the algorithm to reach this rate dlogκ𝑑𝜅d\log\kappaitalic_d roman_log italic_κ iterations earlier than with B0=μIsubscript𝐵0𝜇𝐼B_{0}=\mu Iitalic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_μ italic_I. On the other hand, for the superlinear convergence phase, the difference between B0=LIsubscript𝐵0𝐿𝐼B_{0}=LIitalic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_L italic_I and B0=μIsubscript𝐵0𝜇𝐼B_{0}=\mu Iitalic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_μ italic_I essentially boils down to comparing dκ𝑑𝜅d\kappaitalic_d italic_κ against (1+4C0)dlogκ14subscript𝐶0𝑑𝜅(1+4C_{0})d\log\kappa( 1 + 4 italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) italic_d roman_log italic_κ. Thus, when C0κmuch-less-thansubscript𝐶0𝜅C_{0}\ll\kappaitalic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ≪ italic_κ, the initializing with B0=μIsubscript𝐵0𝜇𝐼B_{0}=\mu Iitalic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_μ italic_I enables an earlier transition to the superlinear convergence compared to B0=LIsubscript𝐵0𝐿𝐼B_{0}=LIitalic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_L italic_I, as well as a faster superlinear convergence rate. As we shall see in Section 7, our experiments also demonstrate this trade-off.

Additional related work. In addition to the standard quasi-Newton methods such as BFGS, the superlinear convergence of other variants of quasi-Newton methods has also been studied in the literature. The greedy variants of quasi-Newton methods were first introduced in [rodomanov2020greedy] and developed in subsequent works [lin2021greedy, lin2022explicit, ji2023greedy]. Instead of using the difference of successive iterates to update the Hessian approximation matrix, the key idea is to greedily select basis vectors to maximize a certain measure of progress. In [rodomanov2020greedy], greedy BFGS is shown to achieve a local superlinear convergence rate of (dκ(11dκ)k2)ksuperscript𝑑𝜅superscript11𝑑𝜅𝑘2𝑘(d\kappa(1-\frac{1}{d\kappa})^{\frac{k}{2}})^{k}( italic_d italic_κ ( 1 - divide start_ARG 1 end_ARG start_ARG italic_d italic_κ end_ARG ) start_POSTSUPERSCRIPT divide start_ARG italic_k end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT and the superlinear convergence phase begins after dκln(dκ)𝑑𝜅𝑑𝜅d\kappa\ln{(d\kappa)}italic_d italic_κ roman_ln ( italic_d italic_κ ) iterations. Similar superlinear convergence rates are extended to works [lin2021greedy, lin2022explicit, ji2023greedy]. However, we note that their results are all local and require the initial point to be sufficiently close to the optimal solution xsubscript𝑥x_{*}italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT. Recently, along a different line of work, the authors in [jiang2023online, jiang2023accelerated] proposed quasi-Newton-type methods based on the hybrid proximal extragradient framework [solodov1999hybrid, monteiro2010complexity] and studied their global convergence rates. Specifically, it was shown that the quasi-Newton proximal extragradient method in [jiang2023online] achieves a global linear convergence rate of 𝒪((11/κ)k)𝒪superscript11𝜅𝑘\mathcal{O}((1-{1}/{\kappa})^{k})caligraphic_O ( ( 1 - 1 / italic_κ ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) and a global superlinear rate of 𝒪((1+k/κ2d)k)𝒪superscript1𝑘superscript𝜅2𝑑𝑘\mathcal{O}((1+\sqrt{k/\kappa^{2}d})^{-k})caligraphic_O ( ( 1 + square-root start_ARG italic_k / italic_κ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d end_ARG ) start_POSTSUPERSCRIPT - italic_k end_POSTSUPERSCRIPT ). However, all these methods are distinct from the classical quasi-Newton methods such as BFGS analyzed in this paper, since they formulate the update of the Hessian approximation matrices Bksubscript𝐵𝑘B_{k}italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT as an online convex optimization problem and follow an online learning algorithm to update Bksubscript𝐵𝑘B_{k}italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT.

Outline. In Section 2, we provide an overview of the BFGS method with exact line search, outline our assumptions, and introduce some fundamental lemmas for the exact line search scheme. Section 3 presents our general analytical framework, which is employed to establish global linear and superlinear convergence results for the BFGS method, along with the intermediate results for the update of quasi-Newton methods. In Section 4, we establish the global linear convergence rate of BFGS using exact line search and delve into specific cases with B0=LIsubscript𝐵0𝐿𝐼B_{0}=LIitalic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_L italic_I and B0=μIsubscript𝐵0𝜇𝐼B_{0}=\mu Iitalic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_μ italic_I. Section 5 details our global superlinear convergence results, applicable to any choices of B0subscript𝐵0B_{0}italic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. In Section 6, we contrast our analytical framework with classical asymptotic and recent local non-asymptotic analyses of BFGS. Section 7 displays our numerical experiments that corroborate our theoretical findings. Finally, we finish the paper by presenting some concluding remarks in Section 8.

Notation. We use \|\cdot\|∥ ⋅ ∥ to denote the 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT norm of a vector or the spectral norm of a matrix. We denote 𝕊+dsubscriptsuperscript𝕊𝑑\mathbb{S}^{d}_{+}blackboard_S start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT + end_POSTSUBSCRIPT and 𝕊++dsubscriptsuperscript𝕊𝑑absent\mathbb{S}^{d}_{++}blackboard_S start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT + + end_POSTSUBSCRIPT as the set of symmetric positive semidefinite and symmetric positive definite matrices with dimension d×d𝑑𝑑d\times ditalic_d × italic_d, respectively. Given two symmetric matrices A𝐴Aitalic_A and B𝐵Bitalic_B, we denote ABprecedes-or-equals𝐴𝐵A\preceq Bitalic_A ⪯ italic_B if and only if BA𝐵𝐴B-Aitalic_B - italic_A is symmetric positive semidefinite. Given a matrix A𝐴Aitalic_A, we use 𝐓𝐫(A)𝐓𝐫𝐴\mathbf{Tr}(A)bold_Tr ( italic_A ) and 𝐃𝐞𝐭(A)𝐃𝐞𝐭𝐴\mathbf{Det}(A)bold_Det ( italic_A ) to denote its trace and determinant, respectively.

2 Preliminaries

In this section, we first outline the assumptions, notations, and lemmas essential for our convergence proof. Following this, we explore the general framework of quasi-Newton methods incorporating exact line search and provide an overview of the principal concepts underpinning the update mechanism in the convex Broyden’s class of quasi-Newton (QN) methods, which encompasses both the BFGS and DFP algorithms.

2.1 Assumptions

To begin with, we state our assumptions on the objective functions f𝑓fitalic_f.

Assumption 1.

The objective function f𝑓fitalic_f is strongly convex with parameter μ>0𝜇0\mu>0italic_μ > 0, i.e., f(x)f(y)μxynorm𝑓𝑥𝑓𝑦𝜇norm𝑥𝑦\|\nabla{f(x)}-\nabla{f(y)}\|\geq\mu\|x-y\|∥ ∇ italic_f ( italic_x ) - ∇ italic_f ( italic_y ) ∥ ≥ italic_μ ∥ italic_x - italic_y ∥, for any x,yd𝑥𝑦superscript𝑑x,y\in\mathbb{R}^{d}italic_x , italic_y ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT.

Assumption 2.

The objective function gradient f𝑓\nabla f∇ italic_f is Lipschitz continuous with parameter L>0𝐿0L>0italic_L > 0, i.e., f(x)f(y)Lxynorm𝑓𝑥𝑓𝑦𝐿norm𝑥𝑦\|\nabla{f(x)}-\nabla{f(y)}\|\leq L\|x-y\|∥ ∇ italic_f ( italic_x ) - ∇ italic_f ( italic_y ) ∥ ≤ italic_L ∥ italic_x - italic_y ∥ for any x,yd𝑥𝑦superscript𝑑x,y\in\mathbb{R}^{d}italic_x , italic_y ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT.

Both Assumptions 1 and 2 are standard in the convergence analysis of first-order methods. Moreover, since f𝑓fitalic_f is twice differentiable, they imply that μI2f(x)LIprecedes-or-equals𝜇𝐼superscript2𝑓𝑥precedes-or-equals𝐿𝐼\mu I\preceq\nabla^{2}{f(x)}\preceq LIitalic_μ italic_I ⪯ ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x ) ⪯ italic_L italic_I for any xd𝑥superscript𝑑x\in\mathbb{R}^{d}italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT. Additionally, the condition number of f𝑓fitalic_f is defined as κ:=Lμassign𝜅𝐿𝜇\kappa:=\frac{L}{\mu}italic_κ := divide start_ARG italic_L end_ARG start_ARG italic_μ end_ARG. We also remark that Assumptions 1 and 2 are sufficient to prove our global linear convergence rate results. In order to achieve a superlinear convergence rate, we need to impose an additional assumption on the Hessian of the function f𝑓fitalic_f, which is stated below.

Assumption 3.

The objective function Hessian 2fsuperscript2𝑓\nabla^{2}f∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f is Lipschitz continuous with parameter M>0𝑀0M>0italic_M > 0, i.e., 2f(x)2f(y)Mxynormsuperscript2𝑓𝑥superscript2𝑓𝑦𝑀norm𝑥𝑦\|\nabla^{2}{f(x)}-\nabla^{2}{f(y)}\|\leq M\|x-y\|∥ ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x ) - ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_y ) ∥ ≤ italic_M ∥ italic_x - italic_y ∥ for any x,yd𝑥𝑦superscript𝑑x,y\in\mathbb{R}^{d}italic_x , italic_y ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT.

Assumption 3 is also commonly employed in the analysis of quasi-Newton methods such as [QN_tool], as it provides a necessary smoothness condition for the Hessian of the objective function.

2.2 Quasi-Newton methods with exact line search

Next, we briefly review the template for updating QN methods, focusing specifically on the DFP and BFGS algorithms. Specifically, at the k𝑘kitalic_k-th iteration, the update in (2) can be equivalently written as

xk+1=xk+ηkdk,where dk=Bk1gkandgk=f(xk).formulae-sequencesubscript𝑥𝑘1subscript𝑥𝑘subscript𝜂𝑘subscript𝑑𝑘formulae-sequencewhere subscript𝑑𝑘superscriptsubscript𝐵𝑘1subscript𝑔𝑘andsubscript𝑔𝑘𝑓subscript𝑥𝑘x_{k+1}=x_{k}+\eta_{k}d_{k},\qquad\text{where }\;d_{k}=-B_{k}^{-1}g_{k}\quad% \text{and}\quad g_{k}=\nabla{f(x_{k})}.italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , where italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = - italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and italic_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = ∇ italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) . (4)

Here, ηk0subscript𝜂𝑘0\eta_{k}\geq 0italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≥ 0 represents the step size, and Bkd×dsubscript𝐵𝑘superscript𝑑𝑑B_{k}\in\mathbb{R}^{d\times d}italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_d end_POSTSUPERSCRIPT is the Hessian approximation matrix. Replacing Bksubscript𝐵𝑘B_{k}italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT with the exact Hessian 2f(xk)superscript2𝑓subscript𝑥𝑘\nabla^{2}f(x_{k})∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) turns the update into classical Newton’s method. Quasi-Newton methods aim to approximate the Hessian with first-order information, typically adhering to a secant condition and a least-change property. To elaborate, we define the variable difference sksubscript𝑠𝑘s_{k}italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and gradient difference yksubscript𝑦𝑘y_{k}italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT as

sk:=xk+1xk,yk:=f(xk+1)f(xk).formulae-sequenceassignsubscript𝑠𝑘subscript𝑥𝑘1subscript𝑥𝑘assignsubscript𝑦𝑘𝑓subscript𝑥𝑘1𝑓subscript𝑥𝑘s_{k}:=x_{k+1}-x_{k},\qquad y_{k}:=\nabla f(x_{k+1})-\nabla f(x_{k}).italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT := italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT := ∇ italic_f ( italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ) - ∇ italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) . (5)

The secant condition mandates Bk+1subscript𝐵𝑘1B_{k+1}italic_B start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT satisfy yk=Bk+1sksubscript𝑦𝑘subscript𝐵𝑘1subscript𝑠𝑘y_{k}=B_{k+1}s_{k}italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_B start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, ensuring the gradient consistency between the quadratic model hk+1(x)=f(xk)+gk(xxk)+12(xxk)Bk(xxk)subscript𝑘1𝑥𝑓subscript𝑥𝑘superscriptsubscript𝑔𝑘top𝑥subscript𝑥𝑘12superscript𝑥subscript𝑥𝑘topsubscript𝐵𝑘𝑥subscript𝑥𝑘h_{k+1}(x)=f(x_{k})+g_{k}^{\top}(x-x_{k})+\frac{1}{2}(x-x_{k})^{\top}B_{k}(x-x% _{k})italic_h start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ( italic_x ) = italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) + italic_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( italic_x - italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_x - italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x - italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) and f𝑓fitalic_f at xksubscript𝑥𝑘x_{k}italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and xk+1subscript𝑥𝑘1x_{k+1}italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT; that is, hk+1(xk)=f(xk)subscript𝑘1subscript𝑥𝑘𝑓subscript𝑥𝑘\nabla h_{k+1}(x_{k})=\nabla f(x_{k})∇ italic_h start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = ∇ italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) and hk+1(xk+1)=f(xk+1)subscript𝑘1subscript𝑥𝑘1𝑓subscript𝑥𝑘1\nabla h_{k+1}(x_{k+1})=\nabla f(x_{k+1})∇ italic_h start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ) = ∇ italic_f ( italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ) (see [nocedal2006numerical, Chapter 6]). That said, the secant condition does not uniquely define Bk+1subscript𝐵𝑘1B_{k+1}italic_B start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT. Thus, we impose a least-change property to ensure Bk+1subscript𝐵𝑘1B_{k+1}italic_B start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT, satisfying the secant condition, is closest to Bksubscript𝐵𝑘B_{k}italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT in a specific proximity measure. Various proximity measures have been proposed in the literature [goldfarb1970family, greenstadt1970variations, fletcher1991new] and here we follow the variation’s characterization in [fletcher1991new]. Specifically, for any symmetric positive definite matrix A𝕊++d𝐴subscriptsuperscript𝕊𝑑absentA\in\mathbb{S}^{d}_{++}italic_A ∈ blackboard_S start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT + + end_POSTSUBSCRIPT, define the negative log-determinant function Φ(A)=log𝐃𝐞𝐭(A)Φ𝐴𝐃𝐞𝐭𝐴\Phi(A)=-\log\mathbf{Det}(A)roman_Φ ( italic_A ) = - roman_log bold_Det ( italic_A ) and define the Bregman divergence generated by ΦΦ\Phiroman_Φ by

DΦ(A,B):=Φ(A)Φ(B)Φ(B),AB=𝐓𝐫(B1A)log𝐃𝐞𝐭(B1A)d.assignsubscript𝐷Φ𝐴𝐵Φ𝐴Φ𝐵Φ𝐵𝐴𝐵𝐓𝐫superscript𝐵1𝐴𝐃𝐞𝐭superscript𝐵1𝐴𝑑D_{\Phi}(A,B):=\Phi(A)-\Phi(B)-\langle\nabla\Phi(B),A-B\rangle=\mathbf{Tr}(B^{% -1}A)-\log\mathbf{Det}(B^{-1}A)-d.italic_D start_POSTSUBSCRIPT roman_Φ end_POSTSUBSCRIPT ( italic_A , italic_B ) := roman_Φ ( italic_A ) - roman_Φ ( italic_B ) - ⟨ ∇ roman_Φ ( italic_B ) , italic_A - italic_B ⟩ = bold_Tr ( italic_B start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_A ) - roman_log bold_Det ( italic_B start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_A ) - italic_d . (6)

Note that the Bregman divergence can be regarded as a measure of proximity between two positive definite matrices, and DΦ(A,B)=0subscript𝐷Φ𝐴𝐵0D_{\Phi}(A,B)=0italic_D start_POSTSUBSCRIPT roman_Φ end_POSTSUBSCRIPT ( italic_A , italic_B ) = 0 if and only if A=B𝐴𝐵A=Bitalic_A = italic_B. For the BFGS update, it was shown in [fletcher1991new] that Bk+1subscript𝐵𝑘1B_{k+1}italic_B start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT is given as the unique solution of the minimization problem:

minB𝕊++dDΦ(B;Bk)s.t.yk=Bsk,subscript𝐵subscriptsuperscript𝕊𝑑absentsubscript𝐷Φ𝐵subscript𝐵𝑘s.t.subscript𝑦𝑘𝐵subscript𝑠𝑘\min_{B\in\mathbb{S}^{d}_{++}}\;D_{\Phi}(B;B_{k})\quad\text{s.t.}\quad y_{k}=% Bs_{k},roman_min start_POSTSUBSCRIPT italic_B ∈ blackboard_S start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT + + end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT roman_Φ end_POSTSUBSCRIPT ( italic_B ; italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) s.t. italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_B italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ,

which admits the following explicit update rule:

Bk+1BFGS:=BkBkskskBkskBksk+ykykskyk.assignsubscriptsuperscript𝐵BFGS𝑘1subscript𝐵𝑘subscript𝐵𝑘subscript𝑠𝑘superscriptsubscript𝑠𝑘topsubscript𝐵𝑘superscriptsubscript𝑠𝑘topsubscript𝐵𝑘subscript𝑠𝑘subscript𝑦𝑘superscriptsubscript𝑦𝑘topsuperscriptsubscript𝑠𝑘topsubscript𝑦𝑘B^{\text{BFGS}}_{k+1}:=B_{k}-\frac{B_{k}s_{k}s_{k}^{\top}B_{k}}{s_{k}^{\top}B_% {k}s_{k}}+\frac{y_{k}y_{k}^{\top}}{s_{k}^{\top}y_{k}}.italic_B start_POSTSUPERSCRIPT BFGS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT := italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - divide start_ARG italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG + divide start_ARG italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_ARG start_ARG italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG . (7)

Moreover, if we define Hk:=Bk1assignsubscript𝐻𝑘superscriptsubscript𝐵𝑘1H_{k}:=B_{k}^{-1}italic_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT := italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT as the inverse of the Hessian approximation matrix, it follows from the Sherman-Morrison formula that

Hk+1BFGS:=(Iskykyksk)Hk(Iykskskyk)+skskyksk.assignsubscriptsuperscript𝐻BFGS𝑘1𝐼subscript𝑠𝑘superscriptsubscript𝑦𝑘topsuperscriptsubscript𝑦𝑘topsubscript𝑠𝑘subscript𝐻𝑘𝐼subscript𝑦𝑘superscriptsubscript𝑠𝑘topsuperscriptsubscript𝑠𝑘topsubscript𝑦𝑘subscript𝑠𝑘superscriptsubscript𝑠𝑘topsuperscriptsubscript𝑦𝑘topsubscript𝑠𝑘H^{\text{BFGS}}_{k+1}:=\left(I-\frac{s_{k}y_{k}^{\top}}{y_{k}^{\top}s_{k}}% \right)H_{k}\left(I-\frac{y_{k}s_{k}^{\top}}{s_{k}^{\top}y_{k}}\right)+\frac{s% _{k}s_{k}^{\top}}{y_{k}^{\top}s_{k}}.italic_H start_POSTSUPERSCRIPT BFGS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT := ( italic_I - divide start_ARG italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_ARG start_ARG italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ) italic_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_I - divide start_ARG italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_ARG start_ARG italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ) + divide start_ARG italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_ARG start_ARG italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG . (8)

The DFP update rule can be regarded as the dual of BFGS, where the roles of the Hessian approximation matrix Bk+1subscript𝐵𝑘1B_{k+1}italic_B start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT and its inverse Hk+1subscript𝐻𝑘1H_{k+1}italic_H start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT are exchanged. Specifically, the DFP update rules are given by

Bk+1DFP:=(Iykskyksk)Bk(Iskykskyk)+ykykyksk,Hk+1DFP:=HkHkykykHkykHkyk+skskskyk.formulae-sequenceassignsubscriptsuperscript𝐵DFP𝑘1𝐼subscript𝑦𝑘superscriptsubscript𝑠𝑘topsuperscriptsubscript𝑦𝑘topsubscript𝑠𝑘subscript𝐵𝑘𝐼subscript𝑠𝑘superscriptsubscript𝑦𝑘topsuperscriptsubscript𝑠𝑘topsubscript𝑦𝑘subscript𝑦𝑘superscriptsubscript𝑦𝑘topsuperscriptsubscript𝑦𝑘topsubscript𝑠𝑘assignsubscriptsuperscript𝐻DFP𝑘1subscript𝐻𝑘subscript𝐻𝑘subscript𝑦𝑘superscriptsubscript𝑦𝑘topsubscript𝐻𝑘superscriptsubscript𝑦𝑘topsubscript𝐻𝑘subscript𝑦𝑘subscript𝑠𝑘superscriptsubscript𝑠𝑘topsuperscriptsubscript𝑠𝑘topsubscript𝑦𝑘\displaystyle B^{\text{DFP}}_{k+1}:=\left(I-\frac{y_{k}s_{k}^{\top}}{y_{k}^{% \top}s_{k}}\right)B_{k}\left(I-\frac{s_{k}y_{k}^{\top}}{s_{k}^{\top}y_{k}}% \right)+\frac{y_{k}y_{k}^{\top}}{y_{k}^{\top}s_{k}},\quad H^{\text{DFP}}_{k+1}% :=H_{k}-\frac{H_{k}y_{k}y_{k}^{\top}H_{k}}{y_{k}^{\top}H_{k}y_{k}}+\frac{s_{k}% s_{k}^{\top}}{s_{k}^{\top}y_{k}}.italic_B start_POSTSUPERSCRIPT DFP end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT := ( italic_I - divide start_ARG italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_ARG start_ARG italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ) italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_I - divide start_ARG italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_ARG start_ARG italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ) + divide start_ARG italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_ARG start_ARG italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG , italic_H start_POSTSUPERSCRIPT DFP end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT := italic_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - divide start_ARG italic_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG + divide start_ARG italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_ARG start_ARG italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG .

Both BFGS and DFP belong to a more general class of QN methods, known as the convex Broyden’s class [broyden1967quasi]. In this class, the Hessian approximation matrix Bk+1subscript𝐵𝑘1B_{k+1}italic_B start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT is defined as

Bk+1:=ϕkBk+1DFP+(1ϕk)Bk+1BFGS,assignsubscript𝐵𝑘1subscriptitalic-ϕ𝑘subscriptsuperscript𝐵DFP𝑘11subscriptitalic-ϕ𝑘subscriptsuperscript𝐵BFGS𝑘1B_{k+1}:=\phi_{k}B^{\text{DFP}}_{k+1}+(1-\phi_{k})B^{\text{BFGS}}_{k+1},italic_B start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT := italic_ϕ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_B start_POSTSUPERSCRIPT DFP end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT + ( 1 - italic_ϕ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) italic_B start_POSTSUPERSCRIPT BFGS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ,

where ϕk[0,1]subscriptitalic-ϕ𝑘01\phi_{k}\in[0,1]italic_ϕ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ [ 0 , 1 ] for any k0𝑘0k\geq 0italic_k ≥ 0. Accordingly, there exists ψk[0,1]subscript𝜓𝑘01\psi_{k}\in[0,1]italic_ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ [ 0 , 1 ] such that the Hessian inverse approximation matrix Hk+1subscript𝐻𝑘1H_{k+1}italic_H start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT is given by

Hk+1:=(1ψk)Hk+1DFP+ψkHk+1BFGS,assignsubscript𝐻𝑘11subscript𝜓𝑘subscriptsuperscript𝐻DFP𝑘1subscript𝜓𝑘subscriptsuperscript𝐻BFGS𝑘1H_{k+1}:=(1-\psi_{k})H^{\text{DFP}}_{k+1}+\psi_{k}H^{\text{BFGS}}_{k+1},italic_H start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT := ( 1 - italic_ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) italic_H start_POSTSUPERSCRIPT DFP end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT + italic_ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_H start_POSTSUPERSCRIPT BFGS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ,

The convex Broyden’s class exhibits a crucial property: if the initial Hessian approximation matrix B0subscript𝐵0B_{0}italic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is symmetric positive definite and the objective function f𝑓fitalic_f is strictly convex, then all subsequent Bksubscript𝐵𝑘B_{k}italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT matrices produced by this class maintain symmetric positive definiteness (see [nocedal2006numerical]).

To guarantee the global convergence of quasi-Newton methods in (4), it is necessary to employ a line search scheme to select the step size ηksubscript𝜂𝑘\eta_{k}italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. In this paper, our primary focus is on the exact line search step size, where we aim to minimize the objective function along the search direction dksubscript𝑑𝑘d_{k}italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. Specifically,

ηk:=argminη0f(xk+ηdk).assignsubscript𝜂𝑘subscriptargmin𝜂0𝑓subscript𝑥𝑘𝜂subscript𝑑𝑘\eta_{k}:=\operatorname*{arg\,min}_{\eta\geq 0}f(x_{k}+\eta d_{k}).italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT := start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_η ≥ 0 end_POSTSUBSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_η italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) . (9)

Remarkably, it was shown in [Dixon] that, when employing the exact line search scheme, the convex Broyden’s class of quasi-Newton methods produce identical iterates given that the initial point x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and the initial matrix B0subscript𝐵0B_{0}italic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT are the same. Thus, in the remainder of the paper, we focus on the BFGS update in (7) as all results hold for other algorithms in the convex Broyden family.

Finally, we introduce some intermediate results related to the exact line search step size, as defined in (9). These results are essential for the forthcoming demonstration of the convergence rate of the quasi-Newton method.

Lemma 1.

Consider the standard quasi-Newton method in (4) with the exact line search specified in (9). The following results hold for any k0𝑘0k\geq 0italic_k ≥ 0:

  1. (a)

    f(xk+1)f(xk)𝑓subscript𝑥𝑘1𝑓subscript𝑥𝑘f(x_{k+1})\leq f(x_{k})italic_f ( italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ) ≤ italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ).

  2. (b)

    gk+1sk=0superscriptsubscript𝑔𝑘1topsubscript𝑠𝑘0g_{k+1}^{\top}s_{k}=0italic_g start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = 0 and yksk=gksksuperscriptsubscript𝑦𝑘topsubscript𝑠𝑘superscriptsubscript𝑔𝑘topsubscript𝑠𝑘y_{k}^{\top}s_{k}=-g_{k}^{\top}s_{k}italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = - italic_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT.

Proof.

Given xksubscript𝑥𝑘x_{k}italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and dksubscript𝑑𝑘d_{k}italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT in the k𝑘kitalic_k-th iteration, define the function h(η):=f(xk+ηdk)assign𝜂𝑓subscript𝑥𝑘𝜂subscript𝑑𝑘h(\eta):=f(x_{k}+\eta d_{k})italic_h ( italic_η ) := italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_η italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ). By the definition of exact line search in (9), it holds that ηk=argminη0h(η)subscript𝜂𝑘subscriptargmin𝜂0𝜂\eta_{k}=\operatorname*{arg\,min}_{\eta\geq 0}h(\eta)italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_η ≥ 0 end_POSTSUBSCRIPT italic_h ( italic_η ). To prove (a), note that we have f(xk+1)=h(ηk)h(0)=f(xk)𝑓subscript𝑥𝑘1subscript𝜂𝑘0𝑓subscript𝑥𝑘f(x_{k+1})=h(\eta_{k})\leq h(0)=f(x_{k})italic_f ( italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ) = italic_h ( italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ≤ italic_h ( 0 ) = italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ). Moreover, by the first-order optimality condition, we have h(ηk)=f(xk+ηkdk)dk=0superscriptsubscript𝜂𝑘𝑓superscriptsubscript𝑥𝑘subscript𝜂𝑘subscript𝑑𝑘topsubscript𝑑𝑘0h^{\prime}(\eta_{k})=\nabla{f(x_{k}+\eta_{k}d_{k})}^{\top}d_{k}=0italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = ∇ italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = 0. Since gk+1=f(xk+1)=f(xk+ηkdk)subscript𝑔𝑘1𝑓subscript𝑥𝑘1𝑓subscript𝑥𝑘subscript𝜂𝑘subscript𝑑𝑘g_{k+1}=\nabla{f(x_{k+1})}=\nabla{f(x_{k}+\eta_{k}d_{k})}italic_g start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT = ∇ italic_f ( italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ) = ∇ italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ), sk=xk+1xk=ηkdksubscript𝑠𝑘subscript𝑥𝑘1subscript𝑥𝑘subscript𝜂𝑘subscript𝑑𝑘s_{k}=x_{k+1}-x_{k}=\eta_{k}d_{k}italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and ηk0subscript𝜂𝑘0\eta_{k}\geq 0italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≥ 0, the above equation implies that gk+1sk=f(xk+ηkdk)ηkdk=ηkf(xk+ηkdk)dk=0superscriptsubscript𝑔𝑘1topsubscript𝑠𝑘𝑓superscriptsubscript𝑥𝑘subscript𝜂𝑘subscript𝑑𝑘topsubscript𝜂𝑘subscript𝑑𝑘subscript𝜂𝑘𝑓superscriptsubscript𝑥𝑘subscript𝜂𝑘subscript𝑑𝑘topsubscript𝑑𝑘0g_{k+1}^{\top}s_{k}=\nabla{f(x_{k}+\eta_{k}d_{k})}^{\top}\eta_{k}d_{k}=\eta_{k% }\nabla{f(x_{k}+\eta_{k}d_{k})}^{\top}d_{k}=0italic_g start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = ∇ italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∇ italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = 0. Applying the fact that gk+1sk=0superscriptsubscript𝑔𝑘1topsubscript𝑠𝑘0g_{k+1}^{\top}s_{k}=0italic_g start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = 0, we obtain that yksk=gk+1skgksk=gksksuperscriptsubscript𝑦𝑘topsubscript𝑠𝑘superscriptsubscript𝑔𝑘1topsubscript𝑠𝑘superscriptsubscript𝑔𝑘topsubscript𝑠𝑘superscriptsubscript𝑔𝑘topsubscript𝑠𝑘y_{k}^{\top}s_{k}=g_{k+1}^{\top}s_{k}-g_{k}^{\top}s_{k}=-g_{k}^{\top}s_{k}italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_g start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = - italic_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. ∎

3 Convergence analysis framework

In this section, we introduce our theoretical framework for establishing the global convergence rates of the BFGS algorithm with exact line search. As previously discussed, due to the equivalence among quasi-Newton methods within the convex Broyden’s class under the exact line search [Dixon], our results also extend to the entire convex Broyden’s class, including the DFP algorithm.

Our framework builds on two key propositions. In Proposition 1, we characterize the amount of function value decrease in one iteration in terms of the angle θksubscript𝜃𝑘\theta_{k}italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT between the steepest descent direction gksubscript𝑔𝑘-g_{k}- italic_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and the search direction dksubscript𝑑𝑘d_{k}italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT given in (4). Subsequently, Proposition 2 presents a potential function for the BFGS update, which leads to a lower bound on cos(θk)subscript𝜃𝑘\cos(\theta_{k})roman_cos ( italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ).

To formally start the analysis, we first introduce a weighted version of key vectors and matrices. Specifically, for a weight matrix P𝕊++d𝑃superscriptsubscript𝕊absent𝑑P\in\mathbb{S}_{++}^{d}italic_P ∈ blackboard_S start_POSTSUBSCRIPT + + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, we define the weighted gradient g^ksubscript^𝑔𝑘\hat{g}_{k}over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, the weighted gradient difference y^ksubscript^𝑦𝑘\hat{y}_{k}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, and the weighted iterate difference s^ksubscript^𝑠𝑘\hat{s}_{k}over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT as

g^k=P12gk,y^k=P12yk,s^k=P12sk.formulae-sequencesubscript^𝑔𝑘superscript𝑃12subscript𝑔𝑘formulae-sequencesubscript^𝑦𝑘superscript𝑃12subscript𝑦𝑘subscript^𝑠𝑘superscript𝑃12subscript𝑠𝑘\hat{g}_{k}=P^{-\frac{1}{2}}g_{k},\qquad\hat{y}_{k}=P^{-\frac{1}{2}}y_{k},% \qquad\hat{s}_{k}=P^{\frac{1}{2}}s_{k}.over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_P start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_P start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_P start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT . (10)

Similarly, we define the weighted Hessian approximation matrix B^ksubscript^𝐵𝑘\hat{B}_{k}over^ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT as

B^k=P12BkP12.subscript^𝐵𝑘superscript𝑃12subscript𝐵𝑘superscript𝑃12\hat{B}_{k}=P^{-\frac{1}{2}}{B}_{k}P^{-\frac{1}{2}}.over^ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_P start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT . (11)

Note that the weight matrix P𝑃Pitalic_P can be chosen as any positive definite matrix, and its choice will be evident from the context. In particular, as we shall see later, we use P=LI𝑃𝐿𝐼P=LIitalic_P = italic_L italic_I in Section 4 to prove the global linear convergence rate, and use P=2f(x)𝑃superscript2𝑓subscript𝑥P=\nabla^{2}f(x_{*})italic_P = ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) in Section 5 to prove the global superlinear convergence rate. Moreover, since the above weighting procedure amounts to a change of the coordinate system, the weighted versions of the vectors and matrices defined in (10) and (11) retain the same algebraic relations as their original forms. In particular, the weighted Hessian approximation matrices generated by the BFGS algorithm follow the subsequent update rule:

B^k+1=B^kB^ks^ks^kB^ks^kB^ks^k+y^ky^ks^ky^k.subscript^𝐵𝑘1subscript^𝐵𝑘subscript^𝐵𝑘subscript^𝑠𝑘superscriptsubscript^𝑠𝑘topsubscript^𝐵𝑘superscriptsubscript^𝑠𝑘topsubscript^𝐵𝑘subscript^𝑠𝑘subscript^𝑦𝑘superscriptsubscript^𝑦𝑘topsuperscriptsubscript^𝑠𝑘topsubscript^𝑦𝑘\hat{B}_{k+1}=\hat{B}_{k}-\frac{\hat{B}_{k}\hat{s}_{k}\hat{s}_{k}^{\top}\hat{B% }_{k}}{\hat{s}_{k}^{\top}\hat{B}_{k}\hat{s}_{k}}+\frac{\hat{y}_{k}\hat{y}_{k}^% {\top}}{\hat{s}_{k}^{\top}\hat{y}_{k}}.over^ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT = over^ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - divide start_ARG over^ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over^ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over^ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG + divide start_ARG over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_ARG start_ARG over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG . (12)

Before introducing our first key proposition, we define a quantity θ^ksubscript^𝜃𝑘\hat{\theta}_{k}over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT by

cos(θ^k)=g^ks^kg^ks^k,subscript^𝜃𝑘superscriptsubscript^𝑔𝑘topsubscript^𝑠𝑘normsubscript^𝑔𝑘normsubscript^𝑠𝑘\cos(\hat{\theta}_{k})=\frac{-\hat{g}_{k}^{\top}\hat{s}_{k}}{\|\hat{g}_{k}\|\|% \hat{s}_{k}\|},roman_cos ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = divide start_ARG - over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG ∥ over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ ∥ over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ end_ARG , (13)

which is the angle between the weighted steepest descent direction g^ksubscript^𝑔𝑘-\hat{g}_{k}- over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and the weighted iterate difference s^ksubscript^𝑠𝑘\hat{s}_{k}over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. It is well-known that the convergence of QN methods can be established by monitoring the behavior of cos(θ^k)subscript^𝜃𝑘\cos(\hat{\theta}_{k})roman_cos ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ). We next quantify the link between functional value decrease and cos(θ^k)subscript^𝜃𝑘\cos(\hat{\theta}_{k})roman_cos ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ).

Proposition 1.

Let {xk}k0subscriptsubscript𝑥𝑘𝑘0\{x_{k}\}_{k\geq 0}{ italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k ≥ 0 end_POSTSUBSCRIPT be the iterates generated by the BFGS method with exact line search. Given a weight matrix P𝕊++d𝑃subscriptsuperscript𝕊𝑑absentP\in\mathbb{S}^{d}_{++}italic_P ∈ blackboard_S start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT + + end_POSTSUBSCRIPT, recall the weighted vectors and matrices defined in (10) and (11). For any k0𝑘0k\geq 0italic_k ≥ 0, we have

f(xk+1)f(x)=(1α^kq^km^kcos2(θ^k))(f(xk)f(x)),𝑓subscript𝑥𝑘1𝑓subscript𝑥1subscript^𝛼𝑘subscript^𝑞𝑘subscript^𝑚𝑘superscript2subscript^𝜃𝑘𝑓subscript𝑥𝑘𝑓subscript𝑥f(x_{k+1})-f(x_{*})=\left(1-\frac{\hat{\alpha}_{k}\hat{q}_{k}}{\hat{m}_{k}}% \cos^{2}(\hat{\theta}_{k})\right)(f(x_{k})-f(x_{*})),italic_f ( italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) = ( 1 - divide start_ARG over^ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG roman_cos start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) ( italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) , (14)

where we define

α^k:=f(xk)f(xk+1)g^ks^k,q^k:=g^k2f(xk)f(x),m^k:=y^ks^ks^k2.formulae-sequenceassignsubscript^𝛼𝑘𝑓subscript𝑥𝑘𝑓subscript𝑥𝑘1superscriptsubscript^𝑔𝑘topsubscript^𝑠𝑘formulae-sequenceassignsubscript^𝑞𝑘superscriptnormsubscript^𝑔𝑘2𝑓subscript𝑥𝑘𝑓subscript𝑥assignsubscript^𝑚𝑘superscriptsubscript^𝑦𝑘topsubscript^𝑠𝑘superscriptnormsubscript^𝑠𝑘2\hat{\alpha}_{k}:=\frac{f(x_{k})-f(x_{k+1})}{-\hat{g}_{k}^{\top}\hat{s}_{k}},% \qquad\hat{q}_{k}:=\frac{\|\hat{g}_{k}\|^{2}}{f(x_{k})-f(x_{*})},\qquad\hat{m}% _{k}:=\frac{\hat{y}_{k}^{\top}\hat{s}_{k}}{\|\hat{s}_{k}\|^{2}}.over^ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT := divide start_ARG italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ) end_ARG start_ARG - over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG , over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT := divide start_ARG ∥ over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) end_ARG , over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT := divide start_ARG over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG ∥ over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG . (15)

As a corollary, we have that for any k1𝑘1k\geq 1italic_k ≥ 1,

f(xk)f(x)f(x0)f(x)[1(i=0k1α^iq^im^icos2(θ^i))1k]k.𝑓subscript𝑥𝑘𝑓subscript𝑥𝑓subscript𝑥0𝑓subscript𝑥superscriptdelimited-[]1superscriptsuperscriptsubscriptproduct𝑖0𝑘1subscript^𝛼𝑖subscript^𝑞𝑖subscript^𝑚𝑖superscript2subscript^𝜃𝑖1𝑘𝑘\frac{f(x_{k})-f(x_{*})}{f(x_{0})-f(x_{*})}\leq\left[1-\left(\prod_{i=0}^{k-1}% \frac{\hat{\alpha}_{i}\hat{q}_{i}}{\hat{m}_{i}}\cos^{2}(\hat{\theta}_{i})% \right)^{\frac{1}{k}}\right]^{k}.divide start_ARG italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) end_ARG start_ARG italic_f ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) end_ARG ≤ [ 1 - ( ∏ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT divide start_ARG over^ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG roman_cos start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_k end_ARG end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT . (16)
Proof.

First, we use the definition of α^ksubscript^𝛼𝑘\hat{\alpha}_{k}over^ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT in (15) to write

f(xk)f(xk+1)=α^kg^ks^k=α^kg^ks^kg^k2g^k2.𝑓subscript𝑥𝑘𝑓subscript𝑥𝑘1subscript^𝛼𝑘superscriptsubscript^𝑔𝑘topsubscript^𝑠𝑘subscript^𝛼𝑘superscriptsubscript^𝑔𝑘topsubscript^𝑠𝑘superscriptnormsubscript^𝑔𝑘2superscriptnormsubscript^𝑔𝑘2f(x_{k})-f(x_{k+1})=-\hat{\alpha}_{k}\hat{g}_{k}^{\top}\hat{s}_{k}=-\hat{% \alpha}_{k}\frac{\hat{g}_{k}^{\top}\hat{s}_{k}}{\|\hat{g}_{k}\|^{2}}\|\hat{g}_% {k}\|^{2}.italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ) = - over^ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = - over^ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT divide start_ARG over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG ∥ over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∥ over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (17)

Moreover, note that we have g^ks^k=y^ks^ksuperscriptsubscript^𝑔𝑘topsubscript^𝑠𝑘superscriptsubscript^𝑦𝑘topsubscript^𝑠𝑘-\hat{g}_{k}^{\top}\hat{s}_{k}=\hat{y}_{k}^{\top}\hat{s}_{k}- over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT by Lemma 1(b). Hence, using the definition of θ^ksubscript^𝜃𝑘\hat{\theta}_{k}over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT in (13) and the definition of m^ksubscript^𝑚𝑘\hat{m}_{k}over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT in (15), it follows that

g^ks^kg^k2=(g^ks^k)2g^k2s^k2s^k2g^ks^k=(g^ks^k)2g^k2s^k2s^k2y^ks^k=cos2(θ^k)m^k.superscriptsubscript^𝑔𝑘topsubscript^𝑠𝑘superscriptnormsubscript^𝑔𝑘2superscriptsuperscriptsubscript^𝑔𝑘topsubscript^𝑠𝑘2superscriptnormsubscript^𝑔𝑘2superscriptnormsubscript^𝑠𝑘2superscriptnormsubscript^𝑠𝑘2superscriptsubscript^𝑔𝑘topsubscript^𝑠𝑘superscriptsuperscriptsubscript^𝑔𝑘topsubscript^𝑠𝑘2superscriptnormsubscript^𝑔𝑘2superscriptnormsubscript^𝑠𝑘2superscriptnormsubscript^𝑠𝑘2superscriptsubscript^𝑦𝑘topsubscript^𝑠𝑘superscript2subscript^𝜃𝑘subscript^𝑚𝑘\frac{-\hat{g}_{k}^{\top}\hat{s}_{k}}{\|\hat{g}_{k}\|^{2}}=\frac{(\hat{g}_{k}^% {\top}\hat{s}_{k})^{2}}{\|\hat{g}_{k}\|^{2}\|\hat{s}_{k}\|^{2}}\frac{\|\hat{s}% _{k}\|^{2}}{-\hat{g}_{k}^{\top}\hat{s}_{k}}=\frac{(\hat{g}_{k}^{\top}\hat{s}_{% k})^{2}}{\|\hat{g}_{k}\|^{2}\|\hat{s}_{k}\|^{2}}\frac{\|\hat{s}_{k}\|^{2}}{% \hat{y}_{k}^{\top}\hat{s}_{k}}=\frac{\cos^{2}(\hat{\theta}_{k})}{\hat{m}_{k}}.divide start_ARG - over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG ∥ over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG = divide start_ARG ( over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ∥ over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG divide start_ARG ∥ over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG - over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG = divide start_ARG ( over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ∥ over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG divide start_ARG ∥ over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG = divide start_ARG roman_cos start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_ARG start_ARG over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG .

Furthermore, we have g^k2=q^k(f(xk)f(x))superscriptnormsubscript^𝑔𝑘2subscript^𝑞𝑘𝑓subscript𝑥𝑘𝑓subscript𝑥\|\hat{g}_{k}\|^{2}=\hat{q}_{k}(f(x_{k})-f(x_{*}))∥ over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) from the definition of q^ksubscript^𝑞𝑘\hat{q}_{k}over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT in (15). Thus, the equality in (17) can be rewritten as

f(xk)f(xk+1)=α^kq^km^kcos2(θ^k)(f(xk)f(x)).𝑓subscript𝑥𝑘𝑓subscript𝑥𝑘1subscript^𝛼𝑘subscript^𝑞𝑘subscript^𝑚𝑘superscript2subscript^𝜃𝑘𝑓subscript𝑥𝑘𝑓subscript𝑥\displaystyle f(x_{k})-f(x_{k+1})=\frac{\hat{\alpha}_{k}\hat{q}_{k}}{\hat{m}_{% k}}\cos^{2}(\hat{\theta}_{k})(f(x_{k})-f(x_{*})).italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ) = divide start_ARG over^ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG roman_cos start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ( italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) .

By rearranging the term in the above equality, we obtain (14). To prove the inequality in (16), note that for any k1𝑘1k\geq 1italic_k ≥ 1, we have

f(xk)f(x)f(x0)f(x)=i=0k1f(xi+1)f(x)f(xi)f(x)=i=0k1(1α^iq^im^icos2(θ^i)),𝑓subscript𝑥𝑘𝑓subscript𝑥𝑓subscript𝑥0𝑓subscript𝑥superscriptsubscriptproduct𝑖0𝑘1𝑓subscript𝑥𝑖1𝑓subscript𝑥𝑓subscript𝑥𝑖𝑓subscript𝑥superscriptsubscriptproduct𝑖0𝑘11subscript^𝛼𝑖subscript^𝑞𝑖subscript^𝑚𝑖superscript2subscript^𝜃𝑖\frac{f(x_{k})-f(x_{*})}{f(x_{0})-f(x_{*})}=\prod_{i=0}^{k-1}\frac{f(x_{i+1})-% f(x_{*})}{f(x_{i})-f(x_{*})}=\prod_{i=0}^{k-1}\left(1-\frac{\hat{\alpha}_{i}% \hat{q}_{i}}{\hat{m}_{i}}\cos^{2}(\hat{\theta}_{i})\right),divide start_ARG italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) end_ARG start_ARG italic_f ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) end_ARG = ∏ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT divide start_ARG italic_f ( italic_x start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) end_ARG start_ARG italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) end_ARG = ∏ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ( 1 - divide start_ARG over^ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG roman_cos start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) ,

where the last equality is due to (14). Notice that the term 1α^iq^im^icos2(θ^i)1subscript^𝛼𝑖subscript^𝑞𝑖subscript^𝑚𝑖superscript2subscript^𝜃𝑖1-\frac{\hat{\alpha}_{i}\hat{q}_{i}}{\hat{m}_{i}}\cos^{2}(\hat{\theta}_{i})1 - divide start_ARG over^ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG roman_cos start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) are non-negative for any i0𝑖0i\geq 0italic_i ≥ 0. Thus, by applying the inequality of arithmetic and geometric means twice, we obtain that

i=0k1(1α^iq^im^icos2(θ^i))superscriptsubscriptproduct𝑖0𝑘11subscript^𝛼𝑖subscript^𝑞𝑖subscript^𝑚𝑖superscript2subscript^𝜃𝑖\displaystyle\prod_{i=0}^{k-1}\left(1-\frac{\hat{\alpha}_{i}\hat{q}_{i}}{\hat{% m}_{i}}\cos^{2}(\hat{\theta}_{i})\right)∏ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ( 1 - divide start_ARG over^ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG roman_cos start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) [1ki=0k1(1α^iq^im^icos2(θ^i))]kabsentsuperscriptdelimited-[]1𝑘superscriptsubscript𝑖0𝑘11subscript^𝛼𝑖subscript^𝑞𝑖subscript^𝑚𝑖superscript2subscript^𝜃𝑖𝑘\displaystyle\leq\left[\frac{1}{k}\sum_{i=0}^{k-1}\left(1-\frac{\hat{\alpha}_{% i}\hat{q}_{i}}{\hat{m}_{i}}\cos^{2}(\hat{\theta}_{i})\right)\right]^{k}≤ [ divide start_ARG 1 end_ARG start_ARG italic_k end_ARG ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ( 1 - divide start_ARG over^ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG roman_cos start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) ] start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT
=[11ki=0k1α^iq^im^icos2(θ^i)]k[1(i=0k1α^iq^im^icos2(θ^i))1k]k.absentsuperscriptdelimited-[]11𝑘superscriptsubscript𝑖0𝑘1subscript^𝛼𝑖subscript^𝑞𝑖subscript^𝑚𝑖superscript2subscript^𝜃𝑖𝑘superscriptdelimited-[]1superscriptsuperscriptsubscriptproduct𝑖0𝑘1subscript^𝛼𝑖subscript^𝑞𝑖subscript^𝑚𝑖superscript2subscript^𝜃𝑖1𝑘𝑘\displaystyle=\left[1-\frac{1}{k}\sum_{i=0}^{k-1}\frac{\hat{\alpha}_{i}\hat{q}% _{i}}{\hat{m}_{i}}\cos^{2}(\hat{\theta}_{i})\right]^{k}\leq\left[1-\left(\prod% _{i=0}^{k-1}\frac{\hat{\alpha}_{i}\hat{q}_{i}}{\hat{m}_{i}}\cos^{2}(\hat{% \theta}_{i})\right)^{\frac{1}{k}}\right]^{k}.= [ 1 - divide start_ARG 1 end_ARG start_ARG italic_k end_ARG ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT divide start_ARG over^ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG roman_cos start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ] start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ≤ [ 1 - ( ∏ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT divide start_ARG over^ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG roman_cos start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_k end_ARG end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT .

This completes the proof. ∎

Remark 1.

We note that similar results relating f(xk)f(xk+1)𝑓subscript𝑥𝑘𝑓subscript𝑥𝑘1f(x_{k})-f(x_{k+1})italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ) to cos2(θ^k)superscript2subscript^𝜃𝑘\cos^{2}(\hat{\theta}_{k})roman_cos start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) have appeared in prior work such as [byrd1987global, Lemma 4.2] and [QN_tool], though they are used in the analysis of QN methods with inexact line search. Compared with these prior results, Proposition 1 is more general in the sense that we consider the weighted iterates using a general weight matrix P𝑃Pitalic_P. This flexibility enables us to obtain tighter bounds and, more importantly, to obtain a global superlinear convergence rate under the same framework (see Section 5). Another subtle yet important difference is that previous works typically upper bound the term m^ksubscript^𝑚𝑘\hat{m}_{k}over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT by L𝐿Litalic_L prematurely, leading to a worst dependence on the condition number κ𝜅\kappaitalic_κ. Instead, we keep m^ksubscript^𝑚𝑘\hat{m}_{k}over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT in (14) as is and lower bound the term cos2(θ^k)/m^ksuperscript2subscript^𝜃𝑘subscript^𝑚𝑘\cos^{2}(\hat{\theta}_{k})/\hat{m}_{k}roman_cos start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) / over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT together, as later shown in Proposition 2.

Proposition 1 shows that BFGS’s convergence rate hinges on four quantities: α^ksubscript^𝛼𝑘\hat{\alpha}_{k}over^ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, q^ksubscript^𝑞𝑘\hat{q}_{k}over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, m^ksubscript^𝑚𝑘\hat{m}_{k}over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, and cos(θ^k)subscript^𝜃𝑘\cos(\hat{\theta}_{k})roman_cos ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ). Note that α^ksubscript^𝛼𝑘\hat{\alpha}_{k}over^ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and q^ksubscript^𝑞𝑘\hat{q}_{k}over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT can be bounded using Assumptions 1-3, independent of the QN update, with details deferred to Section 3.1. The focus here is to establish a lower bound for cos2(θ^k)/m^ksuperscript2subscript^𝜃𝑘subscript^𝑚𝑘\cos^{2}(\hat{\theta}_{k})/\hat{m}_{k}roman_cos start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) / over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. This involves analyzing the dynamics of the Hessian approximation matrices {Bk}k0subscriptsubscript𝐵𝑘𝑘0\{B_{k}\}_{k\geq 0}{ italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k ≥ 0 end_POSTSUBSCRIPT through their trace and determinant, leveraging the following potential function from [QN_tool] that integrates both:

Ψ(A):=𝐓𝐫(A)log𝐃𝐞𝐭(A)d.assignΨ𝐴𝐓𝐫𝐴𝐃𝐞𝐭𝐴𝑑\Psi(A):=\mathbf{Tr}(A)-\log{\mathbf{Det}(A)}-d.roman_Ψ ( italic_A ) := bold_Tr ( italic_A ) - roman_log bold_Det ( italic_A ) - italic_d . (18)

Given (6), Ψ(A)Ψ𝐴\Psi(A)roman_Ψ ( italic_A ) can be regarded as the Bregman divergence generated by Φ(A)=logdet(A)Φ𝐴𝐴\Phi(A)=-\log\det(A)roman_Φ ( italic_A ) = - roman_log roman_det ( italic_A ) between the matrix A𝐴Aitalic_A and the identity matrix I𝐼Iitalic_I. In particular, Ψ(A)0Ψ𝐴0\Psi(A)\geq 0roman_Ψ ( italic_A ) ≥ 0 and also we have Ψ(A)=0Ψ𝐴0\Psi(A)=0roman_Ψ ( italic_A ) = 0 if and only if A=I𝐴𝐼A=Iitalic_A = italic_I. Now we are ready to state Proposition 2, which is a classical result in the QN literature (e.g, see [nocedal2006numerical, Section 6.4]). For completeness, we provide its proof in Appendix A.

Proposition 2.

Given a weight matrix P𝕊++d𝑃subscriptsuperscript𝕊𝑑absentP\in\mathbb{S}^{d}_{++}italic_P ∈ blackboard_S start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT + + end_POSTSUBSCRIPT, recall the weighted vectors and matrices defined in (10) and (11). Let {B^k}k0subscriptsubscript^𝐵𝑘𝑘0\{\hat{B}_{k}\}_{k\geq 0}{ over^ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k ≥ 0 end_POSTSUBSCRIPT be the weighted Hessian approximation matrices generated by the BFGS update in (12). Then we have

Ψ(B^k+1)Ψ(B^k)+y^k2s^ky^k1+logcos2θ^km^k,k0,formulae-sequenceΨsubscript^𝐵𝑘1Ψsubscript^𝐵𝑘superscriptnormsubscript^𝑦𝑘2superscriptsubscript^𝑠𝑘topsubscript^𝑦𝑘1superscript2subscript^𝜃𝑘subscript^𝑚𝑘for-all𝑘0\Psi(\hat{B}_{k+1})\leq\Psi(\hat{B}_{k})+\frac{\|\hat{y}_{k}\|^{2}}{\hat{s}_{k% }^{\top}\hat{y}_{k}}-1+\log\frac{\cos^{2}\hat{\theta}_{k}}{\hat{m}_{k}},\qquad% \forall k\geq 0,roman_Ψ ( over^ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ) ≤ roman_Ψ ( over^ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) + divide start_ARG ∥ over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG - 1 + roman_log divide start_ARG roman_cos start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG , ∀ italic_k ≥ 0 , (19)

where m^ksubscript^𝑚𝑘\hat{m}_{k}over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and θ^ksubscript^𝜃𝑘\hat{\theta}_{k}over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT are defined in (15). As a corollary, we have for any k1𝑘1k\geq 1italic_k ≥ 1,

i=0k1logcos2(θ^i)m^iΨ(B^0)+i=0k1(1y^i2s^iy^i).superscriptsubscript𝑖0𝑘1superscript2subscript^𝜃𝑖subscript^𝑚𝑖Ψsubscript^𝐵0superscriptsubscript𝑖0𝑘11superscriptnormsubscript^𝑦𝑖2superscriptsubscript^𝑠𝑖topsubscript^𝑦𝑖\sum_{i=0}^{k-1}\log{\frac{\cos^{2}(\hat{\theta}_{i})}{\hat{m}_{i}}}\geq-\Psi(% \hat{B}_{0})+\sum_{i=0}^{k-1}\left(1-\frac{\|\hat{y}_{i}\|^{2}}{\hat{s}_{i}^{% \top}\hat{y}_{i}}\right).∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT roman_log divide start_ARG roman_cos start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ≥ - roman_Ψ ( over^ start_ARG italic_B end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ( 1 - divide start_ARG ∥ over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ) . (20)

Taking exponentiation of both sides in (20), Proposition 2 provides a lower bound for the product i=0k1cos2(θ^i)m^isuperscriptsubscriptproduct𝑖0𝑘1superscript2subscript^𝜃𝑖subscript^𝑚𝑖\prod_{i=0}^{k-1}\frac{\cos^{2}(\hat{\theta}_{i})}{\hat{m}_{i}}∏ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT divide start_ARG roman_cos start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG in relation to the sum i=0k1y^i2s^iy^isuperscriptsubscript𝑖0𝑘1superscriptnormsubscript^𝑦𝑖2superscriptsubscript^𝑠𝑖topsubscript^𝑦𝑖\sum_{i=0}^{k-1}\frac{\|\hat{y}_{i}\|^{2}}{\hat{s}_{i}^{\top}\hat{y}_{i}}∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT divide start_ARG ∥ over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG and Ψ(B^0)Ψsubscript^𝐵0\Psi(\hat{B}_{0})roman_Ψ ( over^ start_ARG italic_B end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ). We will use Assumptions 1-3 to bound the term y^k2s^ky^ksuperscriptnormsubscript^𝑦𝑘2superscriptsubscript^𝑠𝑘topsubscript^𝑦𝑘\frac{\|\hat{y}_{k}\|^{2}}{\hat{s}_{k}^{\top}\hat{y}_{k}}divide start_ARG ∥ over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG for any k0𝑘0k\geq 0italic_k ≥ 0, as shown in Lemma 5 of Section 3.1. Moreover, the second term Ψ(B^0)Ψsubscript^𝐵0\Psi(\hat{B}_{0})roman_Ψ ( over^ start_ARG italic_B end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) depends on our choice of the initial Hessian approximation matrix B0subscript𝐵0B_{0}italic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Specifically, we will consider two different initializations: (i) B0=LIsubscript𝐵0𝐿𝐼B_{0}=LIitalic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_L italic_I; (ii) B0=μIsubscript𝐵0𝜇𝐼B_{0}=\mu Iitalic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_μ italic_I. As we shall discuss in the upcoming sections, these two choices result in different bounds and thus lead to a trade-off between the initial linear convergence rate and the final superlinear convergence rate.

Having outlined our key propositions, Sections 4 and 5 will merge Proposition 1 and Proposition 2 to demonstrate that BFGS achieves global non-asymptotic linear and superlinear convergence rates, respectively. Our approach involves selecting an appropriate weight matrix P𝑃Pitalic_P and bounding the quantities in (16) to derive the overall convergence rate. Specifically, we set P=LI𝑃𝐿𝐼P=LIitalic_P = italic_L italic_I for global linear convergence and P=2f(x)𝑃superscript2𝑓subscript𝑥P=\nabla^{2}f(x_{*})italic_P = ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) for superlinear convergence. The following intermediate lemmas will be used to establish these convergence bounds.

3.1 Intermediate lemmas

Next, we provide some intermediate results that lower bound the quantities α^ksubscript^𝛼𝑘\hat{\alpha}_{k}over^ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and q^ksubscript^𝑞𝑘\hat{q}_{k}over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT defined in (15) and the term y^k2s^ky^ksuperscriptnormsubscript^𝑦𝑘2superscriptsubscript^𝑠𝑘topsubscript^𝑦𝑘\frac{\|\hat{y}_{k}\|^{2}}{\hat{s}_{k}^{\top}\hat{y}_{k}}divide start_ARG ∥ over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG appearing in (19). To do so, we first define the average Hessian matrices Jksubscript𝐽𝑘J_{k}italic_J start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and Gksubscript𝐺𝑘G_{k}italic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT as

Jksubscript𝐽𝑘\displaystyle J_{k}italic_J start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT :=012f(xk+τ(xk+1xk))𝑑τ,assignabsentsuperscriptsubscript01superscript2𝑓subscript𝑥𝑘𝜏subscript𝑥𝑘1subscript𝑥𝑘differential-d𝜏\displaystyle:=\int_{0}^{1}\nabla^{2}{f(x_{k}+\tau(x_{k+1}-x_{k}))}d\tau,:= ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_τ ( italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) italic_d italic_τ , (21)
Gksubscript𝐺𝑘\displaystyle G_{k}italic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT :=012f(xk+τ(xxk))𝑑τ.assignabsentsuperscriptsubscript01superscript2𝑓subscript𝑥𝑘𝜏subscript𝑥subscript𝑥𝑘differential-d𝜏\displaystyle:=\int_{0}^{1}\nabla^{2}{f(x_{k}+\tau(x_{*}-x_{k}))}d\tau.:= ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_τ ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) italic_d italic_τ . (22)

These two matrices play an important role in our analysis, since by the fundamental theorem of calculus, it holds that yk=Jksksubscript𝑦𝑘subscript𝐽𝑘subscript𝑠𝑘y_{k}=J_{k}s_{k}italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_J start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and gk=Gk(xkx)subscript𝑔𝑘subscript𝐺𝑘subscript𝑥𝑘superscript𝑥g_{k}=G_{k}(x_{k}-x^{*})italic_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) for any k0𝑘0k\geq 0italic_k ≥ 0. We also define the weighted average Hessian matrix J^k=P12JkP12subscript^𝐽𝑘superscript𝑃12subscript𝐽𝑘superscript𝑃12\hat{J}_{k}=P^{-\frac{1}{2}}J_{k}P^{-\frac{1}{2}}over^ start_ARG italic_J end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_P start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_J start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT for the given weight matrix P𝕊++d𝑃subscriptsuperscript𝕊𝑑absentP\in\mathbb{S}^{d}_{++}italic_P ∈ blackboard_S start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT + + end_POSTSUBSCRIPT. Moreover, we define a quantity Cksubscript𝐶𝑘C_{k}italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT that depends on function value at the iterate xksubscript𝑥𝑘x_{k}italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT:

Ck:=Mμ322(f(xk)f(x)),k0.formulae-sequenceassignsubscript𝐶𝑘𝑀superscript𝜇322𝑓subscript𝑥𝑘𝑓subscript𝑥for-all𝑘0C_{k}:=\frac{M}{\mu^{\frac{3}{2}}}\sqrt{2(f(x_{k})-f(x_{*}))},\qquad\forall k% \geq 0.italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT := divide start_ARG italic_M end_ARG start_ARG italic_μ start_POSTSUPERSCRIPT divide start_ARG 3 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT end_ARG square-root start_ARG 2 ( italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) end_ARG , ∀ italic_k ≥ 0 . (23)

where M𝑀Mitalic_M is the Lipschitz constant of the Hessian in Assumption 3 and μ𝜇\muitalic_μ is the strong convexity parameter in Assumption 1. Given these definitions, in the following lemma, we characterize the relationship between different matrices that appear in our convergence analysis.

Lemma 2.

Suppose Assumptions 1, 2, and 3 hold, and recall the definitions of the matrices Jksubscript𝐽𝑘J_{k}italic_J start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT in (21), Gksubscript𝐺𝑘G_{k}italic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT in (22), and the quantity Cksubscript𝐶𝑘C_{k}italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT in (23). Then, the following statements hold:

  1. (a)

    For any k0𝑘0k\geq 0italic_k ≥ 0, we have that

    11+Ck2f(x)Jk(1+Ck)2f(x).precedes-or-equals11subscript𝐶𝑘superscript2𝑓subscript𝑥subscript𝐽𝑘precedes-or-equals1subscript𝐶𝑘superscript2𝑓subscript𝑥\frac{1}{1+C_{k}}\nabla^{2}{f(x_{*})}\preceq J_{k}\preceq(1+C_{k})\nabla^{2}{f% (x_{*})}.divide start_ARG 1 end_ARG start_ARG 1 + italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ⪯ italic_J start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⪯ ( 1 + italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) . (24)
  2. (b)

    For any k0𝑘0k\geq 0italic_k ≥ 0, we have that

    11+Ck2f(x)Gk(1+Ck)2f(x).precedes-or-equals11subscript𝐶𝑘superscript2𝑓subscript𝑥subscript𝐺𝑘precedes-or-equals1subscript𝐶𝑘superscript2𝑓subscript𝑥\frac{1}{1+C_{k}}\nabla^{2}{f(x_{*})}\preceq G_{k}\preceq(1+C_{k})\nabla^{2}{f% (x_{*})}.divide start_ARG 1 end_ARG start_ARG 1 + italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ⪯ italic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⪯ ( 1 + italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) . (25)
  3. (c)

    For any k0𝑘0k\geq 0italic_k ≥ 0 and any τ^[0,1]^𝜏01\hat{\tau}\in[0,1]over^ start_ARG italic_τ end_ARG ∈ [ 0 , 1 ], we have that

    11+CkJk2f(xk+τ^(xk+1xk))(1+Ck)Jk.precedes-or-equals11subscript𝐶𝑘subscript𝐽𝑘superscript2𝑓subscript𝑥𝑘^𝜏subscript𝑥𝑘1subscript𝑥𝑘precedes-or-equals1subscript𝐶𝑘subscript𝐽𝑘\frac{1}{1+C_{k}}J_{k}\preceq\nabla^{2}{f(x_{k}+\hat{\tau}(x_{k+1}-x_{k}))}% \preceq(1+C_{k})J_{k}.divide start_ARG 1 end_ARG start_ARG 1 + italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG italic_J start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⪯ ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + over^ start_ARG italic_τ end_ARG ( italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) ⪯ ( 1 + italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) italic_J start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT . (26)
  4. (d)

    For any k0𝑘0k\geq 0italic_k ≥ 0 and τ~[0,1]~𝜏01\tilde{\tau}\in[0,1]over~ start_ARG italic_τ end_ARG ∈ [ 0 , 1 ], we have that

    11+CkGk2f(xk+τ~(xxk))(1+Ck)Gk.precedes-or-equals11subscript𝐶𝑘subscript𝐺𝑘superscript2𝑓subscript𝑥𝑘~𝜏subscript𝑥subscript𝑥𝑘precedes-or-equals1subscript𝐶𝑘subscript𝐺𝑘\frac{1}{1+C_{k}}G_{k}\preceq\nabla^{2}{f(x_{k}+\tilde{\tau}(x_{*}-x_{k}))}% \preceq(1+C_{k})G_{k}.divide start_ARG 1 end_ARG start_ARG 1 + italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG italic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⪯ ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + over~ start_ARG italic_τ end_ARG ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) ⪯ ( 1 + italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) italic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT . (27)
Proof.

Please check Appendix B. ∎

After establishing Lemma 2, in the following three lemmas, we will provide bounds on the quantities α^ksubscript^𝛼𝑘\hat{\alpha}_{k}over^ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, q^ksubscript^𝑞𝑘\hat{q}_{k}over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and y^k2s^ky^ksuperscriptnormsubscript^𝑦𝑘2superscriptsubscript^𝑠𝑘topsubscript^𝑦𝑘\frac{\|\hat{y}_{k}\|^{2}}{\hat{s}_{k}^{\top}\hat{y}_{k}}divide start_ARG ∥ over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG, respectively. Notice that α^ksubscript^𝛼𝑘\hat{\alpha}_{k}over^ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is independent of the choice of the weight matrix P𝕊++d𝑃subscriptsuperscript𝕊𝑑absentP\in\mathbb{S}^{d}_{++}italic_P ∈ blackboard_S start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT + + end_POSTSUBSCRIPT, while q^ksubscript^𝑞𝑘\hat{q}_{k}over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and y^k2s^ky^ksuperscriptnormsubscript^𝑦𝑘2superscriptsubscript^𝑠𝑘topsubscript^𝑦𝑘\frac{\|\hat{y}_{k}\|^{2}}{\hat{s}_{k}^{\top}\hat{y}_{k}}divide start_ARG ∥ over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG are determined by different options of weight matrix P𝑃Pitalic_P.

Lemma 3.

Let {xk}k0subscriptsubscript𝑥𝑘𝑘0\{x_{k}\}_{k\geq 0}{ italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k ≥ 0 end_POSTSUBSCRIPT be the iterates generated by the BFGS algorithm with exact line search, and recall that α^k=f(xk)f(xk+1)g^ks^ksubscript^𝛼𝑘𝑓subscript𝑥𝑘𝑓subscript𝑥𝑘1superscriptsubscript^𝑔𝑘topsubscript^𝑠𝑘\hat{\alpha}_{k}=\frac{f(x_{k})-f(x_{k+1})}{-\hat{g}_{k}^{\top}\hat{s}_{k}}over^ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = divide start_ARG italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ) end_ARG start_ARG - over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG in (15). Suppose Assumptions 1, 2, and 3 hold. Then, for any k0𝑘0k\geq 0italic_k ≥ 0, we have

α^kmax{11+κ,12(1+Ck)}.subscript^𝛼𝑘11𝜅121subscript𝐶𝑘\hat{\alpha}_{k}\geq\max\left\{\frac{1}{1+\sqrt{\kappa}},\frac{1}{2(1+C_{k})}% \right\}.over^ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≥ roman_max { divide start_ARG 1 end_ARG start_ARG 1 + square-root start_ARG italic_κ end_ARG end_ARG , divide start_ARG 1 end_ARG start_ARG 2 ( 1 + italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_ARG } . (28)
Proof.

We first prove the first bound in (28). By Assumptions 1 and 2, the function f𝑓fitalic_f is μ𝜇\muitalic_μ-strongly convex and its gradient is L𝐿Litalic_L-Lipschitz. Then for any x,yd𝑥𝑦superscript𝑑x,y\in\mathbb{R}^{d}italic_x , italic_y ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, it holds that

f(x)f(y)f(y)(xy)f(x)f(y)22(Lμ)+μLxy22(Lμ)μ(f(y)f(x))(yx)Lμ.𝑓𝑥𝑓𝑦𝑓superscript𝑦top𝑥𝑦superscriptnorm𝑓𝑥𝑓𝑦22𝐿𝜇𝜇𝐿superscriptnorm𝑥𝑦22𝐿𝜇𝜇superscript𝑓𝑦𝑓𝑥top𝑦𝑥𝐿𝜇f(x)-f(y)-\nabla{f(y)}^{\top}(x-y)\geq\frac{\|\nabla{f(x)}-\nabla{f(y)}\|^{2}}% {2(L-\mu)}+\frac{\mu L\|x-y\|^{2}}{2(L-\mu)}-\frac{\mu(\nabla{f(y)}-\nabla{f(x% )})^{\top}(y-x)}{L-\mu}.italic_f ( italic_x ) - italic_f ( italic_y ) - ∇ italic_f ( italic_y ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( italic_x - italic_y ) ≥ divide start_ARG ∥ ∇ italic_f ( italic_x ) - ∇ italic_f ( italic_y ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 ( italic_L - italic_μ ) end_ARG + divide start_ARG italic_μ italic_L ∥ italic_x - italic_y ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 ( italic_L - italic_μ ) end_ARG - divide start_ARG italic_μ ( ∇ italic_f ( italic_y ) - ∇ italic_f ( italic_x ) ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( italic_y - italic_x ) end_ARG start_ARG italic_L - italic_μ end_ARG . (29)

This is also known as the interpolation inequality; see, e.g., [Taylor_convex, Theorem 4]. By setting x=xk𝑥subscript𝑥𝑘x=x_{k}italic_x = italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, y=xk+1𝑦subscript𝑥𝑘1y=x_{k+1}italic_y = italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT in (29) and recalling that sk=xk+1xksubscript𝑠𝑘subscript𝑥𝑘1subscript𝑥𝑘s_{k}=x_{k+1}-x_{k}italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, yk=f(xk+1)f(xk)subscript𝑦𝑘𝑓subscript𝑥𝑘1𝑓subscript𝑥𝑘y_{k}=\nabla f(x_{k+1})-\nabla f(x_{k})italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = ∇ italic_f ( italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ) - ∇ italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) and gk+1=f(xk+1)subscript𝑔𝑘1𝑓subscript𝑥𝑘1g_{k+1}=\nabla f(x_{k+1})italic_g start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT = ∇ italic_f ( italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ), we obtain that

f(xk)f(xk+1)+gk+1sk12(Lμ)yk2+μLsk22(Lμ)μLμyksk.𝑓subscript𝑥𝑘𝑓subscript𝑥𝑘1superscriptsubscript𝑔𝑘1topsubscript𝑠𝑘12𝐿𝜇superscriptnormsubscript𝑦𝑘2𝜇𝐿superscriptnormsubscript𝑠𝑘22𝐿𝜇𝜇𝐿𝜇superscriptsubscript𝑦𝑘topsubscript𝑠𝑘f(x_{k})-f(x_{k+1})+g_{k+1}^{\top}s_{k}\geq\frac{1}{2(L-\mu)}\|y_{k}\|^{2}+% \frac{\mu L\|s_{k}\|^{2}}{2(L-\mu)}-\frac{\mu}{L-\mu}y_{k}^{\top}s_{k}.italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ) + italic_g start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≥ divide start_ARG 1 end_ARG start_ARG 2 ( italic_L - italic_μ ) end_ARG ∥ italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_μ italic_L ∥ italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 ( italic_L - italic_μ ) end_ARG - divide start_ARG italic_μ end_ARG start_ARG italic_L - italic_μ end_ARG italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT .

Moreover, Lemma 1 shows that gk+1sk=0superscriptsubscript𝑔𝑘1topsubscript𝑠𝑘0g_{k+1}^{\top}s_{k}=0italic_g start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = 0 due to exact line search. Thus, we can simplify the above inequality as

f(xk)f(xk+1)12(Lμ)yk2+μLsk22(Lμ)μLμykskμLLμykskμLμyksk(μLLμμLμ)yksk=11+κskyk,𝑓subscript𝑥𝑘𝑓subscript𝑥𝑘112𝐿𝜇superscriptdelimited-∥∥subscript𝑦𝑘2𝜇𝐿superscriptnormsubscript𝑠𝑘22𝐿𝜇𝜇𝐿𝜇superscriptsubscript𝑦𝑘topsubscript𝑠𝑘𝜇𝐿𝐿𝜇delimited-∥∥subscript𝑦𝑘delimited-∥∥subscript𝑠𝑘𝜇𝐿𝜇superscriptsubscript𝑦𝑘topsubscript𝑠𝑘𝜇𝐿𝐿𝜇𝜇𝐿𝜇superscriptsubscript𝑦𝑘topsubscript𝑠𝑘11𝜅superscriptsubscript𝑠𝑘topsubscript𝑦𝑘\begin{split}f(x_{k})-f(x_{k+1})&\geq\frac{1}{2(L-\mu)}\|y_{k}\|^{2}+\frac{\mu L% \|s_{k}\|^{2}}{2(L-\mu)}-\frac{\mu}{L-\mu}y_{k}^{\top}s_{k}\\ &\geq\frac{\sqrt{\mu L}}{L-\mu}\|y_{k}\|\|s_{k}\|-\frac{\mu}{L-\mu}y_{k}^{\top% }s_{k}\geq\left(\frac{\sqrt{\mu L}}{L-\mu}-\frac{\mu}{L-\mu}\right)y_{k}^{\top% }s_{k}=\frac{1}{1+\sqrt{\kappa}}s_{k}^{\top}y_{k},\end{split}start_ROW start_CELL italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ) end_CELL start_CELL ≥ divide start_ARG 1 end_ARG start_ARG 2 ( italic_L - italic_μ ) end_ARG ∥ italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_μ italic_L ∥ italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 ( italic_L - italic_μ ) end_ARG - divide start_ARG italic_μ end_ARG start_ARG italic_L - italic_μ end_ARG italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≥ divide start_ARG square-root start_ARG italic_μ italic_L end_ARG end_ARG start_ARG italic_L - italic_μ end_ARG ∥ italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ ∥ italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ - divide start_ARG italic_μ end_ARG start_ARG italic_L - italic_μ end_ARG italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≥ ( divide start_ARG square-root start_ARG italic_μ italic_L end_ARG end_ARG start_ARG italic_L - italic_μ end_ARG - divide start_ARG italic_μ end_ARG start_ARG italic_L - italic_μ end_ARG ) italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG 1 + square-root start_ARG italic_κ end_ARG end_ARG italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , end_CELL end_ROW (30)

where we used Young’s inequality in the second inequality and the fact that skykskyksuperscriptsubscript𝑠𝑘topsubscript𝑦𝑘normsubscript𝑠𝑘normsubscript𝑦𝑘s_{k}^{\top}y_{k}\leq\|s_{k}\|\|y_{k}\|italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≤ ∥ italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ ∥ italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ due to Cauchy-Schwartz inequality in the third inequality. Hence, we conclude that α^k=f(xk)f(xk+1)skyk11+κsubscript^𝛼𝑘𝑓subscript𝑥𝑘𝑓subscript𝑥𝑘1superscriptsubscript𝑠𝑘topsubscript𝑦𝑘11𝜅\hat{\alpha}_{k}=\frac{f(x_{k})-f(x_{k+1})}{s_{k}^{\top}y_{k}}\geq\frac{1}{1+% \sqrt{\kappa}}over^ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = divide start_ARG italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ≥ divide start_ARG 1 end_ARG start_ARG 1 + square-root start_ARG italic_κ end_ARG end_ARG.

Now we proceed to establish the second lower bound on α^ksubscript^𝛼𝑘\hat{\alpha}_{k}over^ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. Given Taylor’s theorem, there exists τk[0,1]subscript𝜏𝑘01\tau_{k}\in[0,1]italic_τ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ [ 0 , 1 ] such that

f(xk)𝑓subscript𝑥𝑘\displaystyle f(x_{k})italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) =f(xk+1)+gk+1(xkxk+1)+12(xkxk+1)2f(xk+τk(xk+1xk))(xkxk+1)absent𝑓subscript𝑥𝑘1superscriptsubscript𝑔𝑘1topsubscript𝑥𝑘subscript𝑥𝑘112superscriptsubscript𝑥𝑘subscript𝑥𝑘1topsuperscript2𝑓subscript𝑥𝑘subscript𝜏𝑘subscript𝑥𝑘1subscript𝑥𝑘subscript𝑥𝑘subscript𝑥𝑘1\displaystyle=f(x_{k+1})+g_{k+1}^{\top}(x_{k}-x_{k+1})+\frac{1}{2}(x_{k}-x_{k+% 1})^{\top}\nabla^{2}{f(x_{k}+\tau_{k}(x_{k+1}-x_{k}))}(x_{k}-x_{k+1})= italic_f ( italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ) + italic_g start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ) + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_τ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT )
=f(xk+1)+12sk2f(xk+τk(xk+1xk))sk,absent𝑓subscript𝑥𝑘112superscriptsubscript𝑠𝑘topsuperscript2𝑓subscript𝑥𝑘subscript𝜏𝑘subscript𝑥𝑘1subscript𝑥𝑘subscript𝑠𝑘\displaystyle=f(x_{k+1})+\frac{1}{2}s_{k}^{\top}\nabla^{2}{f(x_{k}+\tau_{k}(x_% {k+1}-x_{k}))}s_{k},= italic_f ( italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ) + divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_τ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ,

where we used gk+1sk=0superscriptsubscript𝑔𝑘1topsubscript𝑠𝑘0g_{k+1}^{\top}s_{k}=0italic_g start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = 0. Moreover, we have sk2f(xk+τk(xk+1xk))sk11+CkskJksk=11+Ckskyksuperscriptsubscript𝑠𝑘topsuperscript2𝑓subscript𝑥𝑘subscript𝜏𝑘subscript𝑥𝑘1subscript𝑥𝑘subscript𝑠𝑘11subscript𝐶𝑘superscriptsubscript𝑠𝑘topsubscript𝐽𝑘subscript𝑠𝑘11subscript𝐶𝑘superscriptsubscript𝑠𝑘topsubscript𝑦𝑘s_{k}^{\top}\nabla^{2}{f(x_{k}+\tau_{k}(x_{k+1}-x_{k}))}s_{k}\geq\frac{1}{1+C_% {k}}s_{k}^{\top}J_{k}s_{k}=\frac{1}{1+C_{k}}s_{k}^{\top}y_{k}italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_τ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≥ divide start_ARG 1 end_ARG start_ARG 1 + italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_J start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG 1 + italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT based on (26) in Lemma 2. Hence,

f(xk)f(xk+1)=12sk2f(xk+τk(xk+1xk))sk12(1+Ck)skyk.𝑓subscript𝑥𝑘𝑓subscript𝑥𝑘112superscriptsubscript𝑠𝑘topsuperscript2𝑓subscript𝑥𝑘subscript𝜏𝑘subscript𝑥𝑘1subscript𝑥𝑘subscript𝑠𝑘121subscript𝐶𝑘superscriptsubscript𝑠𝑘topsubscript𝑦𝑘f(x_{k})-f(x_{k+1})=\frac{1}{2}s_{k}^{\top}\nabla^{2}{f(x_{k}+\tau_{k}(x_{k+1}% -x_{k}))}s_{k}\geq\frac{1}{2(1+C_{k})}s_{k}^{\top}y_{k}.italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_τ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≥ divide start_ARG 1 end_ARG start_ARG 2 ( 1 + italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_ARG italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT . (31)

By combining the inequalities in (30) and (31), the main claim follows. ∎

Remark 2.

The bounds in Lemma 3 only require the exact line search scheme. Thus, these inequalities are valid not just for BFGS, but also for any iterative algorithm that adheres to the exact line search condition specified in (9).

Lemma 4.

Recall the definition q^k=g^k2f(xk)f(x)subscript^𝑞𝑘superscriptnormsubscript^𝑔𝑘2𝑓subscript𝑥𝑘𝑓subscript𝑥\hat{q}_{k}=\frac{\|\hat{g}_{k}\|^{2}}{f(x_{k})-f(x_{*})}over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = divide start_ARG ∥ over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) end_ARG in (15). Suppose Assumptions 12, and 3 hold. Then we have the following results:

  1. (a)

    If we choose P=LI𝑃𝐿𝐼P=LIitalic_P = italic_L italic_I, then q^k2/κsubscript^𝑞𝑘2𝜅\hat{q}_{k}\geq 2/\kappaover^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≥ 2 / italic_κ.

  2. (b)

    If we choose P=2f(x)𝑃superscript2𝑓subscript𝑥P=\nabla^{2}f(x_{*})italic_P = ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ), then q^k2/(1+Ck)2subscript^𝑞𝑘2superscript1subscript𝐶𝑘2\hat{q}_{k}\geq 2/(1+C_{k})^{2}over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≥ 2 / ( 1 + italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.

Proof.

We first prove (a). When P=LI𝑃𝐿𝐼P=LIitalic_P = italic_L italic_I, we have q^k=gk2L(f(xk)f(x))subscript^𝑞𝑘superscriptnormsubscript𝑔𝑘2𝐿𝑓subscript𝑥𝑘𝑓subscript𝑥\hat{q}_{k}=\frac{\|{g}_{k}\|^{2}}{L(f(x_{k})-f(x_{*}))}over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = divide start_ARG ∥ italic_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_L ( italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) end_ARG. Since f𝑓fitalic_f is μ𝜇\muitalic_μ-strongly convex by Assumption 1, it holds that f(xk)22μ(f(xk)f(x)\|\nabla{f(x_{k})}\|^{2}\geq 2\mu(f(x_{k})-f(x_{*})∥ ∇ italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≥ 2 italic_μ ( italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) (see, e.g, [boyd04, (9.9)]). Hence, we conclude that q^k2μ/L=2/κsubscript^𝑞𝑘2𝜇𝐿2𝜅\hat{q}_{k}\geq 2\mu/L=2/\kappaover^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≥ 2 italic_μ / italic_L = 2 / italic_κ.

Next, we prove (b). When P=2f(x)𝑃superscript2𝑓subscript𝑥P=\nabla^{2}f(x_{*})italic_P = ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ), we have g^k2=gkP1gk=gk(2f(x))1gksuperscriptnormsubscript^𝑔𝑘2superscriptsubscript𝑔𝑘topsuperscript𝑃1subscript𝑔𝑘superscriptsubscript𝑔𝑘topsuperscriptsuperscript2𝑓subscript𝑥1subscript𝑔𝑘\|\hat{g}_{k}\|^{2}=g_{k}^{\top}P^{-1}g_{k}=g_{k}^{\top}(\nabla^{2}f(x_{*}))^{% -1}g_{k}∥ over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = italic_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_P start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. By applying Taylor’s theorem with Lagrange remainder, there exists τ~k[0,1]subscript~𝜏𝑘01\tilde{\tau}_{k}\in[0,1]over~ start_ARG italic_τ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ [ 0 , 1 ] such that

f(xk)=f(x)+f(x)(xkx)+12(xkx)2f(xk+τ~k(xxk))(xkx)=f(x)+12(xkx)2f(xk+τ~k(xxk))(xkx),𝑓subscript𝑥𝑘𝑓subscript𝑥𝑓superscriptsubscript𝑥topsubscript𝑥𝑘subscript𝑥12superscriptsubscript𝑥𝑘subscript𝑥topsuperscript2𝑓subscript𝑥𝑘subscript~𝜏𝑘subscript𝑥subscript𝑥𝑘subscript𝑥𝑘subscript𝑥𝑓subscript𝑥12superscriptsubscript𝑥𝑘subscript𝑥topsuperscript2𝑓subscript𝑥𝑘subscript~𝜏𝑘subscript𝑥subscript𝑥𝑘subscript𝑥𝑘subscript𝑥\begin{split}f(x_{k})&=f(x_{*})+\nabla{f(x_{*})}^{\top}(x_{k}-x_{*})+\frac{1}{% 2}(x_{k}-x_{*})^{\top}\nabla^{2}{f(x_{k}+\tilde{\tau}_{k}(x_{*}-x_{k}))}(x_{k}% -x_{*})\\ &=f(x_{*})+\frac{1}{2}(x_{k}-x_{*})^{\top}\nabla^{2}{f(x_{k}+\tilde{\tau}_{k}(% x_{*}-x_{k}))}(x_{k}-x_{*}),\end{split}start_ROW start_CELL italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_CELL start_CELL = italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) + ∇ italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + over~ start_ARG italic_τ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + over~ start_ARG italic_τ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) , end_CELL end_ROW (32)

where we used the fact that f(x)=0𝑓subscript𝑥0\nabla{f(x_{*})}=0∇ italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) = 0 in the last equality. Moreover, by the fundamental theorem of calculus, we have

f(xk)f(x)=012f(xk+τ(xxk))(xkx)𝑑τ=Gk(xkx),𝑓subscript𝑥𝑘𝑓subscript𝑥superscriptsubscript01superscript2𝑓subscript𝑥𝑘𝜏subscript𝑥subscript𝑥𝑘subscript𝑥𝑘subscript𝑥differential-d𝜏subscript𝐺𝑘subscript𝑥𝑘superscript𝑥\nabla{f(x_{k})}-\nabla{f(x_{*})}=\int_{0}^{1}\nabla^{2}{f(x_{k}+\tau(x_{*}-x_% {k}))}(x_{k}-x_{*})\;d\tau=G_{k}(x_{k}-x^{*}),∇ italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - ∇ italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) = ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_τ ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) italic_d italic_τ = italic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ,

where we use the definition of Gksubscript𝐺𝑘G_{k}italic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT in (22). Since f(x)=0𝑓subscript𝑥0\nabla f(x_{*})=0∇ italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) = 0 and we denote gk=f(xk)subscript𝑔𝑘𝑓subscript𝑥𝑘g_{k}=\nabla{f(x_{k})}italic_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = ∇ italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ), this further implies that

xkx=Gk1(f(xk)f(x))=Gk1gk.subscript𝑥𝑘subscript𝑥superscriptsubscript𝐺𝑘1𝑓subscript𝑥𝑘𝑓subscript𝑥superscriptsubscript𝐺𝑘1subscript𝑔𝑘x_{k}-x_{*}=G_{k}^{-1}(\nabla{f(x_{k})}-\nabla{f(x_{*})})=G_{k}^{-1}g_{k}.italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT = italic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( ∇ italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - ∇ italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) = italic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT . (33)

Combining (32) and (33) leads to

f(xk)f(x)=12gkGk12f(xk+τ~k(xxk))Gk1gk.𝑓subscript𝑥𝑘𝑓subscript𝑥12superscriptsubscript𝑔𝑘topsuperscriptsubscript𝐺𝑘1superscript2𝑓subscript𝑥𝑘subscript~𝜏𝑘subscript𝑥subscript𝑥𝑘superscriptsubscript𝐺𝑘1subscript𝑔𝑘f(x_{k})-f(x_{*})=\frac{1}{2}g_{k}^{\top}G_{k}^{-1}\nabla^{2}{f(x_{k}+\tilde{% \tau}_{k}(x_{*}-x_{k}))}G_{k}^{-1}g_{k}.italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + over~ start_ARG italic_τ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) italic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT . (34)

Based on (27) in Lemma 2, we have 2f(xk+τ~k(xxk))(1+Ck)Gkprecedes-or-equalssuperscript2𝑓subscript𝑥𝑘subscript~𝜏𝑘subscript𝑥subscript𝑥𝑘1subscript𝐶𝑘subscript𝐺𝑘\nabla^{2}{f(x_{k}+\tilde{\tau}_{k}(x_{*}-x_{k}))}\preceq(1+C_{k})G_{k}∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + over~ start_ARG italic_τ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) ⪯ ( 1 + italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) italic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, which implies that

Gk12f(xk+τ~k(xxk))Gk1(1+Ck)Gk1.precedes-or-equalssuperscriptsubscript𝐺𝑘1superscript2𝑓subscript𝑥𝑘subscript~𝜏𝑘subscript𝑥subscript𝑥𝑘superscriptsubscript𝐺𝑘11subscript𝐶𝑘superscriptsubscript𝐺𝑘1G_{k}^{-1}\nabla^{2}{f(x_{k}+\tilde{\tau}_{k}(x_{*}-x_{k}))}G_{k}^{-1}\preceq(% 1+C_{k})G_{k}^{-1}.italic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + over~ start_ARG italic_τ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) italic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ⪯ ( 1 + italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) italic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT . (35)

Moreover, it follows from (25) in Lemma 2 that 11+Ck2f(x)Gkprecedes-or-equals11subscript𝐶𝑘superscript2𝑓subscript𝑥subscript𝐺𝑘\frac{1}{1+C_{k}}\nabla^{2}{f(x_{*})}\preceq G_{k}divide start_ARG 1 end_ARG start_ARG 1 + italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ⪯ italic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, which implies that

Gk1(1+Ck)(2f(x))1.precedes-or-equalssuperscriptsubscript𝐺𝑘11subscript𝐶𝑘superscriptsuperscript2𝑓subscript𝑥1G_{k}^{-1}\preceq(1+C_{k})(\nabla^{2}{f(x_{*})})^{-1}.italic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ⪯ ( 1 + italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ( ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT . (36)

Combining (35) and (36), we obtain that

Gk12f(xk+τ~k(xxk))Gk1(1+Ck)2(2f(x))1,precedes-or-equalssuperscriptsubscript𝐺𝑘1superscript2𝑓subscript𝑥𝑘subscript~𝜏𝑘subscript𝑥subscript𝑥𝑘superscriptsubscript𝐺𝑘1superscript1subscript𝐶𝑘2superscriptsuperscript2𝑓subscript𝑥1G_{k}^{-1}\nabla^{2}{f(x_{k}+\tilde{\tau}_{k}(x_{*}-x_{k}))}G_{k}^{-1}\preceq(% 1+C_{k})^{2}(\nabla^{2}{f(x_{*})})^{-1},italic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + over~ start_ARG italic_τ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) italic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ⪯ ( 1 + italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ,

and hence

gkGk12f(xk+τ~k(xxk))Gk1gk(1+Ck)2gk(2f(x))1gk.superscriptsubscript𝑔𝑘topsuperscriptsubscript𝐺𝑘1superscript2𝑓subscript𝑥𝑘subscript~𝜏𝑘subscript𝑥subscript𝑥𝑘superscriptsubscript𝐺𝑘1subscript𝑔𝑘superscript1subscript𝐶𝑘2superscriptsubscript𝑔𝑘topsuperscriptsuperscript2𝑓subscript𝑥1subscript𝑔𝑘g_{k}^{\top}G_{k}^{-1}\nabla^{2}{f(x_{k}+\tilde{\tau}_{k}(x_{*}-x_{k}))}G_{k}^% {-1}g_{k}\leq(1+C_{k})^{2}g_{k}^{\top}(\nabla^{2}{f(x_{*})})^{-1}g_{k}.italic_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + over~ start_ARG italic_τ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) italic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≤ ( 1 + italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT .

By using (34) and the fact that g^k2=gk(2f(x))1gksuperscriptnormsubscript^𝑔𝑘2superscriptsubscript𝑔𝑘topsuperscriptsuperscript2𝑓subscript𝑥1subscript𝑔𝑘\|\hat{g}_{k}\|^{2}=g_{k}^{\top}(\nabla^{2}f(x_{*}))^{-1}g_{k}∥ over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = italic_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, we obtain

q^k=g^k2f(xk)f(x)2(1+Ck)2,subscript^𝑞𝑘superscriptnormsubscript^𝑔𝑘2𝑓subscript𝑥𝑘𝑓subscript𝑥2superscript1subscript𝐶𝑘2\hat{q}_{k}=\frac{\|\hat{g}_{k}\|^{2}}{f(x_{k})-f(x_{*})}\geq\frac{2}{(1+C_{k}% )^{2}},over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = divide start_ARG ∥ over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) end_ARG ≥ divide start_ARG 2 end_ARG start_ARG ( 1 + italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ,

and the claim follows. ∎

Lemma 5.

Suppose Assumptions 12, and 3 hold. Then we have

y^k2s^ky^kJ^k,k0.formulae-sequencesuperscriptnormsubscript^𝑦𝑘2superscriptsubscript^𝑠𝑘topsubscript^𝑦𝑘normsubscript^𝐽𝑘for-all𝑘0\frac{\|\hat{y}_{k}\|^{2}}{\hat{s}_{k}^{\top}\hat{y}_{k}}\leq\|\hat{J}_{k}\|,% \qquad\forall k\geq 0.divide start_ARG ∥ over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ≤ ∥ over^ start_ARG italic_J end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ , ∀ italic_k ≥ 0 .

As a corollary, we have the following results:

  1. (a)

    If we choose P=LI𝑃𝐿𝐼P=LIitalic_P = italic_L italic_I, then y^k2s^ky^k1superscriptnormsubscript^𝑦𝑘2superscriptsubscript^𝑠𝑘topsubscript^𝑦𝑘1\frac{\|\hat{y}_{k}\|^{2}}{\hat{s}_{k}^{\top}\hat{y}_{k}}\leq 1divide start_ARG ∥ over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ≤ 1.

  2. (b)

    If we choose P=2f(x)𝑃superscript2𝑓subscript𝑥P=\nabla^{2}f(x_{*})italic_P = ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ), then y^k2s^ky^k1+Cksuperscriptnormsubscript^𝑦𝑘2superscriptsubscript^𝑠𝑘topsubscript^𝑦𝑘1subscript𝐶𝑘\frac{\|\hat{y}_{k}\|^{2}}{\hat{s}_{k}^{\top}\hat{y}_{k}}\leq 1+C_{k}divide start_ARG ∥ over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ≤ 1 + italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT.

Proof.

Note that by the fundamental theorem of calculus, we have yk=Jksksubscript𝑦𝑘subscript𝐽𝑘subscript𝑠𝑘{y}_{k}={J}_{k}{s}_{k}italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_J start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, which implies that y^k=J^ks^ksubscript^𝑦𝑘subscript^𝐽𝑘subscript^𝑠𝑘\hat{y}_{k}=\hat{J}_{k}\hat{s}_{k}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = over^ start_ARG italic_J end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. Hence, we can bound

y^k2s^ky^k=s^kJ^kJ^ks^ks^kJ^ks^k=s^kJ^k12J^kJ^k12s^kJ^k12s^k2J^k.superscriptnormsubscript^𝑦𝑘2superscriptsubscript^𝑠𝑘topsubscript^𝑦𝑘superscriptsubscript^𝑠𝑘topsubscript^𝐽𝑘subscript^𝐽𝑘subscript^𝑠𝑘superscriptsubscript^𝑠𝑘topsubscript^𝐽𝑘subscript^𝑠𝑘superscriptsubscript^𝑠𝑘topsuperscriptsubscript^𝐽𝑘12subscript^𝐽𝑘superscriptsubscript^𝐽𝑘12subscript^𝑠𝑘superscriptnormsuperscriptsubscript^𝐽𝑘12subscript^𝑠𝑘2normsubscript^𝐽𝑘\frac{\|\hat{y}_{k}\|^{2}}{\hat{s}_{k}^{\top}\hat{y}_{k}}=\frac{\hat{s}_{k}^{% \top}\hat{J}_{k}\hat{J}_{k}\hat{s}_{k}}{\hat{s}_{k}^{\top}\hat{J}_{k}\hat{s}_{% k}}=\frac{\hat{s}_{k}^{\top}\hat{J}_{k}^{\frac{1}{2}}\hat{J}_{k}\hat{J}_{k}^{% \frac{1}{2}}\hat{s}_{k}}{\|\hat{J}_{k}^{\frac{1}{2}}\hat{s}_{k}\|^{2}}\leq\|% \hat{J}_{k}\|.divide start_ARG ∥ over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG = divide start_ARG over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over^ start_ARG italic_J end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT over^ start_ARG italic_J end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over^ start_ARG italic_J end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG = divide start_ARG over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over^ start_ARG italic_J end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT over^ start_ARG italic_J end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT over^ start_ARG italic_J end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG ∥ over^ start_ARG italic_J end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ≤ ∥ over^ start_ARG italic_J end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ .

Hence, if P=LI𝑃𝐿𝐼P=LIitalic_P = italic_L italic_I, then J^k=1LJk1normsubscript^𝐽𝑘1𝐿normsubscript𝐽𝑘1\|\hat{J}_{k}\|=\frac{1}{L}\|J_{k}\|\leq 1∥ over^ start_ARG italic_J end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ = divide start_ARG 1 end_ARG start_ARG italic_L end_ARG ∥ italic_J start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ ≤ 1 by Assumption 2, which proves the result in (a). Moreover, if P=2f(x)𝑃superscript2𝑓subscript𝑥P=\nabla^{2}f(x_{*})italic_P = ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ), then

J^k=(2f(x))12Jk(2f(x))121+Ck,normsubscript^𝐽𝑘normsuperscriptsuperscript2𝑓subscript𝑥12subscript𝐽𝑘superscriptsuperscript2𝑓subscript𝑥121subscript𝐶𝑘\|\hat{J}_{k}\|=\|(\nabla^{2}f(x_{*}))^{-\frac{1}{2}}J_{k}(\nabla^{2}f(x_{*}))% ^{-\frac{1}{2}}\|\leq 1+C_{k},∥ over^ start_ARG italic_J end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ = ∥ ( ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_J start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ∥ ≤ 1 + italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ,

by (24) in Lemma 2, which proves the result in (b). ∎

4 Global linear convergence rates

In this section, we establish the explicit global linear convergence rates for the BFGS method using an exact line search step size, marking one of the first non-asymptotic global linear convergence analyses of BFGS with a line search scheme. The subsequent global superlinear convergence analyses are established based on on these linear rates.

Specifically, we combine the fundamental inequality (16) from Proposition 1 with lower bounds of the terms α^ksubscript^𝛼𝑘\hat{\alpha}_{k}over^ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, q^ksubscript^𝑞𝑘\hat{q}_{k}over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, and cos2(θ^k)/m^ksuperscript2subscript^𝜃𝑘subscript^𝑚𝑘\cos^{2}(\hat{\theta}_{k})/\hat{m}_{k}roman_cos start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) / over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT from Lemma 3, 4, 5 and Proposition 2 to prove all the global linear convergence rates. In this section, we set the weight matrix P𝑃Pitalic_P as P=LI𝑃𝐿𝐼P=LIitalic_P = italic_L italic_I and we define the weighted matrix B¯ksubscript¯𝐵𝑘\bar{B}_{k}over¯ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT as:

B¯k=1LBk.subscript¯𝐵𝑘1𝐿subscript𝐵𝑘\bar{B}_{k}=\frac{1}{L}B_{k}.over¯ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_L end_ARG italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT . (37)

In the following lemma, we prove the first global linear convergence rate of the BFGS method for any choice of B0𝕊++dsubscript𝐵0subscriptsuperscript𝕊𝑑absentB_{0}\in\mathbb{S}^{d}_{++}italic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ blackboard_S start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT + + end_POSTSUBSCRIPT.

Lemma 6.

Let {xk}k0subscriptsubscript𝑥𝑘𝑘0\{x_{k}\}_{k\geq 0}{ italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k ≥ 0 end_POSTSUBSCRIPT be the iterates generated by the BFGS method with exact line search and suppose that Assumptions 1 and 2 hold. For any initial point x0dsubscript𝑥0superscript𝑑x_{0}\in\mathbb{R}^{d}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT and any initial Hessian approximation matrix B0𝕊++dsubscript𝐵0subscriptsuperscript𝕊𝑑absentB_{0}\in\mathbb{S}^{d}_{++}italic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ blackboard_S start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT + + end_POSTSUBSCRIPT, we have the following global linear convergence rate for any k1𝑘1k\geq 1italic_k ≥ 1,

f(xk)f(x)f(x0)f(x)(1eΨ(B¯0)k2κ(1+κ))k.𝑓subscript𝑥𝑘𝑓subscript𝑥𝑓subscript𝑥0𝑓subscript𝑥superscript1superscript𝑒Ψsubscript¯𝐵0𝑘2𝜅1𝜅𝑘\frac{f(x_{k})-f(x_{*})}{f(x_{0})-f(x_{*})}\leq\left(1-e^{-\frac{\Psi(\bar{B}_% {0})}{k}}\frac{2}{\kappa(1+\sqrt{\kappa})}\right)^{k}.divide start_ARG italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) end_ARG start_ARG italic_f ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) end_ARG ≤ ( 1 - italic_e start_POSTSUPERSCRIPT - divide start_ARG roman_Ψ ( over¯ start_ARG italic_B end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_k end_ARG end_POSTSUPERSCRIPT divide start_ARG 2 end_ARG start_ARG italic_κ ( 1 + square-root start_ARG italic_κ end_ARG ) end_ARG ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT . (38)
Proof.

Our starting point is applying Proposition 1 with the weight matrix P𝑃Pitalic_P chosen as P=LI𝑃𝐿𝐼P=LIitalic_P = italic_L italic_I. Specifically, (16) shows that to obtain a convergence rate, it suffices to prove a lower bound on i=0k1α^iq^im^icos2(θ^i)superscriptsubscriptproduct𝑖0𝑘1subscript^𝛼𝑖subscript^𝑞𝑖subscript^𝑚𝑖superscript2subscript^𝜃𝑖\prod_{i=0}^{k-1}\frac{\hat{\alpha}_{i}\hat{q}_{i}}{\hat{m}_{i}}\cos^{2}(\hat{% \theta}_{i})∏ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT divide start_ARG over^ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG roman_cos start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). It follows from Lemma 3 that α^k=f(xk)f(xk+1)skyk1κ+1subscript^𝛼𝑘𝑓subscript𝑥𝑘𝑓subscript𝑥𝑘1superscriptsubscript𝑠𝑘topsubscript𝑦𝑘1𝜅1\hat{\alpha}_{k}=\frac{f(x_{k})-f(x_{k+1})}{s_{k}^{\top}y_{k}}\geq\frac{1}{% \sqrt{\kappa}+1}over^ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = divide start_ARG italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ≥ divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_κ end_ARG + 1 end_ARG for any k0𝑘0k\geq 0italic_k ≥ 0. Moreover, by applying Lemma 4 with P=LI𝑃𝐿𝐼P=LIitalic_P = italic_L italic_I, we obtain that q^k=g^k2f(xk)f(x)2κsubscript^𝑞𝑘superscriptnormsubscript^𝑔𝑘2𝑓subscript𝑥𝑘𝑓subscript𝑥2𝜅\hat{q}_{k}=\frac{\|\hat{g}_{k}\|^{2}}{f(x_{k})-f(x_{*})}\geq\frac{2}{\kappa}over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = divide start_ARG ∥ over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) end_ARG ≥ divide start_ARG 2 end_ARG start_ARG italic_κ end_ARG for any k0𝑘0k\geq 0italic_k ≥ 0. Futhermore, applying Proposition 2 with P=LI𝑃𝐿𝐼P=LIitalic_P = italic_L italic_I, it follows from (20) that

i=0k1logcos2(θ^i)m^iΨ(B¯0)+i=0k1(1y^i2s^iy^i)Ψ(B¯0),superscriptsubscript𝑖0𝑘1superscript2subscript^𝜃𝑖subscript^𝑚𝑖Ψsubscript¯𝐵0superscriptsubscript𝑖0𝑘11superscriptnormsubscript^𝑦𝑖2superscriptsubscript^𝑠𝑖topsubscript^𝑦𝑖Ψsubscript¯𝐵0\sum_{i=0}^{k-1}\log{\frac{\cos^{2}(\hat{\theta}_{i})}{\hat{m}_{i}}}\geq-\Psi(% \bar{B}_{0})+\sum_{i=0}^{k-1}\left(1-\frac{\|\hat{y}_{i}\|^{2}}{\hat{s}_{i}^{% \top}\hat{y}_{i}}\right)\geq-\Psi(\bar{B}_{0}),∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT roman_log divide start_ARG roman_cos start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ≥ - roman_Ψ ( over¯ start_ARG italic_B end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ( 1 - divide start_ARG ∥ over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ) ≥ - roman_Ψ ( over¯ start_ARG italic_B end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ,

where in the last inequality we used y^i2s^iy^i1superscriptnormsubscript^𝑦𝑖2superscriptsubscript^𝑠𝑖topsubscript^𝑦𝑖1\frac{\|\hat{y}_{i}\|^{2}}{\hat{s}_{i}^{\top}\hat{y}_{i}}\leq 1divide start_ARG ∥ over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ≤ 1 by Lemma 5 with P=LI𝑃𝐿𝐼P=LIitalic_P = italic_L italic_I. This further implies that

i=0k1cos2(θ^i)m^ieΨ(B¯0).superscriptsubscriptproduct𝑖0𝑘1superscript2subscript^𝜃𝑖subscript^𝑚𝑖superscript𝑒Ψsubscript¯𝐵0\qquad\prod_{i=0}^{k-1}\frac{\cos^{2}(\hat{\theta}_{i})}{\hat{m}_{i}}\geq e^{-% \Psi(\bar{B}_{0})}.∏ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT divide start_ARG roman_cos start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ≥ italic_e start_POSTSUPERSCRIPT - roman_Ψ ( over¯ start_ARG italic_B end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT . (39)

Combining all the pieces above, we get

i=0k1α^iq^im^icos2(θ^i)i=0k1(α^iq^i)i=0k1cos2(θ^i)m^i(2κ(κ+1))keΨ(B¯0).superscriptsubscriptproduct𝑖0𝑘1subscript^𝛼𝑖subscript^𝑞𝑖subscript^𝑚𝑖superscript2subscript^𝜃𝑖superscriptsubscriptproduct𝑖0𝑘1subscript^𝛼𝑖subscript^𝑞𝑖superscriptsubscriptproduct𝑖0𝑘1superscript2subscript^𝜃𝑖subscript^𝑚𝑖superscript2𝜅𝜅1𝑘superscript𝑒Ψsubscript¯𝐵0\prod_{i=0}^{k-1}\frac{\hat{\alpha}_{i}\hat{q}_{i}}{\hat{m}_{i}}\cos^{2}(\hat{% \theta}_{i})\geq\prod_{i=0}^{k-1}(\hat{\alpha}_{i}\hat{q}_{i})\prod_{i=0}^{k-1% }\frac{\cos^{2}(\hat{\theta}_{i})}{\hat{m}_{i}}\geq\left(\frac{2}{\kappa(\sqrt% {\kappa}+1)}\right)^{k}e^{-\Psi(\bar{B}_{0})}.∏ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT divide start_ARG over^ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG roman_cos start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ≥ ∏ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ( over^ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∏ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT divide start_ARG roman_cos start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ≥ ( divide start_ARG 2 end_ARG start_ARG italic_κ ( square-root start_ARG italic_κ end_ARG + 1 ) end_ARG ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT - roman_Ψ ( over¯ start_ARG italic_B end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT .

Thus, it follows from Proposition 1 that

f(xk)f(x)f(x0)f(x)[1(i=0k1α^iq^im^icos2(θ^i))1k]k(1eΨ(B¯0)k2κ(1+κ))k.𝑓subscript𝑥𝑘𝑓subscript𝑥𝑓subscript𝑥0𝑓subscript𝑥superscriptdelimited-[]1superscriptsuperscriptsubscriptproduct𝑖0𝑘1subscript^𝛼𝑖subscript^𝑞𝑖subscript^𝑚𝑖superscript2subscript^𝜃𝑖1𝑘𝑘superscript1superscript𝑒Ψsubscript¯𝐵0𝑘2𝜅1𝜅𝑘\frac{f(x_{k})-f(x_{*})}{f(x_{0})-f(x_{*})}\leq\left[1-\left(\prod_{i=0}^{k-1}% \frac{\hat{\alpha}_{i}\hat{q}_{i}}{\hat{m}_{i}}\cos^{2}(\hat{\theta}_{i})% \right)^{\frac{1}{k}}\right]^{k}\!\!\leq\left(1-e^{-\frac{\Psi(\bar{B}_{0})}{k% }}\frac{2}{\kappa(1+\sqrt{\kappa})}\right)^{k}.divide start_ARG italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) end_ARG start_ARG italic_f ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) end_ARG ≤ [ 1 - ( ∏ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT divide start_ARG over^ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG roman_cos start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_k end_ARG end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ≤ ( 1 - italic_e start_POSTSUPERSCRIPT - divide start_ARG roman_Ψ ( over¯ start_ARG italic_B end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_k end_ARG end_POSTSUPERSCRIPT divide start_ARG 2 end_ARG start_ARG italic_κ ( 1 + square-root start_ARG italic_κ end_ARG ) end_ARG ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT .

This completes the proof. ∎

Notice that this result holds without the Hessian Lipschitz continuity assumption. In the next lemma, we present another version of the global linear convergence analysis with the additional assumption the Hessian of f𝑓fitalic_f is M𝑀Mitalic_M-Lipschitz. We show that the BFGS method with exact line search will eventually reach a global linear convergence rate of 𝒪((11/κ)k)𝒪superscript11𝜅𝑘\mathcal{O}((1-{1}/{\kappa})^{k})caligraphic_O ( ( 1 - 1 / italic_κ ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ), which is the same as the gradient descent method.

Lemma 7.

Let {xk}k0subscriptsubscript𝑥𝑘𝑘0\{x_{k}\}_{k\geq 0}{ italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k ≥ 0 end_POSTSUBSCRIPT be the iterates generated by the BFGS method with exact line search and suppose that Assumptions 1, 2 and 3 hold. For any initial point x0dsubscript𝑥0superscript𝑑x_{0}\in\mathbb{R}^{d}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT and any initial Hessian approximation matrix B0𝕊++dsubscript𝐵0subscriptsuperscript𝕊𝑑absentB_{0}\in\mathbb{S}^{d}_{++}italic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ blackboard_S start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT + + end_POSTSUBSCRIPT, we have the following global linear convergence rate for any k1𝑘1k\geq 1italic_k ≥ 1,

f(xk)f(x)f(x0)f(x)(1eΨ(B¯0)k1κ11+C0)k.𝑓subscript𝑥𝑘𝑓subscript𝑥𝑓subscript𝑥0𝑓subscript𝑥superscript1superscript𝑒Ψsubscript¯𝐵0𝑘1𝜅11subscript𝐶0𝑘\frac{f(x_{k})-f(x_{*})}{f(x_{0})-f(x_{*})}\leq\left(1-e^{-\frac{\Psi(\bar{B}_% {0})}{k}}\frac{1}{\kappa}\frac{1}{1+C_{0}}\right)^{k}.divide start_ARG italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) end_ARG start_ARG italic_f ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) end_ARG ≤ ( 1 - italic_e start_POSTSUPERSCRIPT - divide start_ARG roman_Ψ ( over¯ start_ARG italic_B end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_k end_ARG end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_κ end_ARG divide start_ARG 1 end_ARG start_ARG 1 + italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT . (40)

Moreover, when k(1+C0)Ψ(B¯0)+3C0κmin{2(1+C0),(1+κ)}𝑘1subscript𝐶0Ψsubscript¯𝐵03subscript𝐶0𝜅21subscript𝐶01𝜅k\geq(1+C_{0})\Psi(\bar{B}_{0})+3C_{0}\kappa\min\{2(1+C_{0}),(1+\sqrt{\kappa})\}italic_k ≥ ( 1 + italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) roman_Ψ ( over¯ start_ARG italic_B end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) + 3 italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_κ roman_min { 2 ( 1 + italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , ( 1 + square-root start_ARG italic_κ end_ARG ) }, we have

f(xk)f(x)f(x0)f(x)(113κ)k.𝑓subscript𝑥𝑘𝑓subscript𝑥𝑓subscript𝑥0𝑓subscript𝑥superscript113𝜅𝑘\frac{f(x_{k})-f(x_{*})}{f(x_{0})-f(x_{*})}\leq\left(1-\frac{1}{3\kappa}\right% )^{k}.divide start_ARG italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) end_ARG start_ARG italic_f ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) end_ARG ≤ ( 1 - divide start_ARG 1 end_ARG start_ARG 3 italic_κ end_ARG ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT . (41)
Proof.

We follow a similar argument as in the proof of Lemma 6 but with a different lower bound for α^ksubscript^𝛼𝑘\hat{\alpha}_{k}over^ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. Specifically, by Lemma 3, we also have α^k=f(xk)f(xk+1)skyk12(1+Ck)subscript^𝛼𝑘𝑓subscript𝑥𝑘𝑓subscript𝑥𝑘1superscriptsubscript𝑠𝑘topsubscript𝑦𝑘121subscript𝐶𝑘\hat{\alpha}_{k}=\frac{f(x_{k})-f(x_{k+1})}{s_{k}^{\top}y_{k}}\geq\frac{1}{2(1% +C_{k})}over^ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = divide start_ARG italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ≥ divide start_ARG 1 end_ARG start_ARG 2 ( 1 + italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_ARG. Combining this with q^k2/κsubscript^𝑞𝑘2𝜅\hat{q}_{k}\geq 2/\kappaover^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≥ 2 / italic_κ and (39) leads to

i=0k1α^iq^im^icos2(θ^i)i=0k1(α^iq^i)i=0k1cos2(θ^i)m^i(1κ)keΨ(B¯0)i=0k111+Ci.superscriptsubscriptproduct𝑖0𝑘1subscript^𝛼𝑖subscript^𝑞𝑖subscript^𝑚𝑖superscript2subscript^𝜃𝑖superscriptsubscriptproduct𝑖0𝑘1subscript^𝛼𝑖subscript^𝑞𝑖superscriptsubscriptproduct𝑖0𝑘1superscript2subscript^𝜃𝑖subscript^𝑚𝑖superscript1𝜅𝑘superscript𝑒Ψsubscript¯𝐵0superscriptsubscriptproduct𝑖0𝑘111subscript𝐶𝑖\prod_{i=0}^{k-1}\frac{\hat{\alpha}_{i}\hat{q}_{i}}{\hat{m}_{i}}\cos^{2}(\hat{% \theta}_{i})\geq\prod_{i=0}^{k-1}(\hat{\alpha}_{i}\hat{q}_{i})\prod_{i=0}^{k-1% }\frac{\cos^{2}(\hat{\theta}_{i})}{\hat{m}_{i}}\geq\left(\frac{1}{\kappa}% \right)^{k}e^{-\Psi(\bar{B}_{0})}\prod_{i=0}^{k-1}\frac{1}{1+C_{i}}.∏ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT divide start_ARG over^ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG roman_cos start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ≥ ∏ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ( over^ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∏ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT divide start_ARG roman_cos start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ≥ ( divide start_ARG 1 end_ARG start_ARG italic_κ end_ARG ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT - roman_Ψ ( over¯ start_ARG italic_B end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT ∏ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 1 + italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG . (42)

To begin with, recall the definition that Ci=Mμ322(f(xi)f(x))subscript𝐶𝑖𝑀superscript𝜇322𝑓subscript𝑥𝑖𝑓subscript𝑥C_{i}=\frac{M}{\mu^{\frac{3}{2}}}\sqrt{2(f(x_{i})-f(x_{*}))}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG italic_M end_ARG start_ARG italic_μ start_POSTSUPERSCRIPT divide start_ARG 3 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT end_ARG square-root start_ARG 2 ( italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) end_ARG. Since the objective function is non-increasing by Lemma 1, it holds that CiC0subscript𝐶𝑖subscript𝐶0C_{i}\leq C_{0}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≤ italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT for any i0𝑖0i\geq 0italic_i ≥ 0. Thus, from (42) we have

i=0k1α^iq^im^icos2(θ^i)(1κ)keΨ(B¯0)(11+C0)k.superscriptsubscriptproduct𝑖0𝑘1subscript^𝛼𝑖subscript^𝑞𝑖subscript^𝑚𝑖superscript2subscript^𝜃𝑖superscript1𝜅𝑘superscript𝑒Ψsubscript¯𝐵0superscript11subscript𝐶0𝑘\prod_{i=0}^{k-1}\frac{\hat{\alpha}_{i}\hat{q}_{i}}{\hat{m}_{i}}\cos^{2}(\hat{% \theta}_{i})\geq\left(\frac{1}{\kappa}\right)^{k}e^{-\Psi(\bar{B}_{0})}\left(% \frac{1}{1+C_{0}}\right)^{k}.∏ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT divide start_ARG over^ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG roman_cos start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ≥ ( divide start_ARG 1 end_ARG start_ARG italic_κ end_ARG ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT - roman_Ψ ( over¯ start_ARG italic_B end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT ( divide start_ARG 1 end_ARG start_ARG 1 + italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT .

Thus, by using Proposition 1 we obtain (40).

To prove the second claim in (41), we use the fact that 1+xex1𝑥superscript𝑒𝑥1+x\leq e^{x}1 + italic_x ≤ italic_e start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT for any x𝑥x\in\mathbb{R}italic_x ∈ blackboard_R to get

i=0k111+Cii=0k1eCi=ei=0k1Ci.superscriptsubscriptproduct𝑖0𝑘111subscript𝐶𝑖superscriptsubscriptproduct𝑖0𝑘1superscript𝑒subscript𝐶𝑖superscript𝑒superscriptsubscript𝑖0𝑘1subscript𝐶𝑖\prod_{i=0}^{k-1}\frac{1}{1+C_{i}}\geq\prod_{i=0}^{k-1}e^{-C_{i}}=e^{-\sum_{i=% 0}^{k-1}C_{i}}.∏ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 1 + italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ≥ ∏ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT - italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT = italic_e start_POSTSUPERSCRIPT - ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT . (43)

Combining (42) and (43) leads to

i=0k1α^iq^im^icos2(θ^i)(1κ)keΨ(B¯0)i=0k1Ci.superscriptsubscriptproduct𝑖0𝑘1subscript^𝛼𝑖subscript^𝑞𝑖subscript^𝑚𝑖superscript2subscript^𝜃𝑖superscript1𝜅𝑘superscript𝑒Ψsubscript¯𝐵0superscriptsubscript𝑖0𝑘1subscript𝐶𝑖\prod_{i=0}^{k-1}\frac{\hat{\alpha}_{i}\hat{q}_{i}}{\hat{m}_{i}}\cos^{2}(\hat{% \theta}_{i})\geq\left(\frac{1}{\kappa}\right)^{k}e^{-\Psi(\bar{B}_{0})-\sum_{i% =0}^{k-1}C_{i}}.∏ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT divide start_ARG over^ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG roman_cos start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ≥ ( divide start_ARG 1 end_ARG start_ARG italic_κ end_ARG ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT - roman_Ψ ( over¯ start_ARG italic_B end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT . (44)

Next, we prove an upper bound on i=0k1Cisuperscriptsubscript𝑖0𝑘1subscript𝐶𝑖\sum_{i=0}^{k-1}C_{i}∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. First, we assume kΨ(B¯0)𝑘Ψsubscript¯𝐵0k\geq\Psi(\bar{B}_{0})italic_k ≥ roman_Ψ ( over¯ start_ARG italic_B end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ). Then (38) in Lemma 6 and (40) together imply that

f(xk)f(x)f(x0)f(x)(113κmax{21+κ,11+C0})k,𝑓subscript𝑥𝑘𝑓subscript𝑥𝑓subscript𝑥0𝑓subscript𝑥superscript113𝜅21𝜅11subscript𝐶0𝑘\frac{f(x_{k})-f(x_{*})}{f(x_{0})-f(x_{*})}\leq\left(1-\frac{1}{3\kappa}\max% \left\{\frac{2}{1+\sqrt{\kappa}},\frac{1}{1+C_{0}}\right\}\right)^{k},divide start_ARG italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) end_ARG start_ARG italic_f ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) end_ARG ≤ ( 1 - divide start_ARG 1 end_ARG start_ARG 3 italic_κ end_ARG roman_max { divide start_ARG 2 end_ARG start_ARG 1 + square-root start_ARG italic_κ end_ARG end_ARG , divide start_ARG 1 end_ARG start_ARG 1 + italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG } ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ,

where we used the fact that eΨ(B¯0)ke113superscript𝑒Ψsubscript¯𝐵0𝑘superscript𝑒113e^{-\frac{\Psi(\bar{B}_{0})}{k}}\geq e^{-1}\geq\frac{1}{3}italic_e start_POSTSUPERSCRIPT - divide start_ARG roman_Ψ ( over¯ start_ARG italic_B end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_k end_ARG end_POSTSUPERSCRIPT ≥ italic_e start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ≥ divide start_ARG 1 end_ARG start_ARG 3 end_ARG. Moreover, we decompose the sum i=0k1Cisuperscriptsubscript𝑖0𝑘1subscript𝐶𝑖\sum_{i=0}^{k-1}C_{i}∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT into two parts by i=0k1Ci=i=0Ψ(B¯0)1Ci+i=Ψ(B¯0)k1Cisuperscriptsubscript𝑖0𝑘1subscript𝐶𝑖superscriptsubscript𝑖0Ψsubscript¯𝐵01subscript𝐶𝑖superscriptsubscript𝑖Ψsubscript¯𝐵0𝑘1subscript𝐶𝑖\sum_{i=0}^{k-1}C_{i}=\sum_{i=0}^{\Psi(\bar{B}_{0})-1}C_{i}+\sum_{i=\Psi(\bar{% B}_{0})}^{k-1}C_{i}∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_Ψ ( over¯ start_ARG italic_B end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - 1 end_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_i = roman_Ψ ( over¯ start_ARG italic_B end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. For the first part, we have i=0Ψ(B¯0)1CiC0Ψ(B¯0)superscriptsubscript𝑖0Ψsubscript¯𝐵01subscript𝐶𝑖subscript𝐶0Ψsubscript¯𝐵0\sum_{i=0}^{\Psi(\bar{B}_{0})-1}C_{i}\leq C_{0}\Psi(\bar{B}_{0})∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_Ψ ( over¯ start_ARG italic_B end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - 1 end_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≤ italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT roman_Ψ ( over¯ start_ARG italic_B end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ). For the second part, by the definition of Cisubscript𝐶𝑖C_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, we have

i=Ψ(B¯0)k1Cisuperscriptsubscript𝑖Ψsubscript¯𝐵0𝑘1subscript𝐶𝑖\displaystyle\sum_{i=\Psi(\bar{B}_{0})}^{k-1}C_{i}∑ start_POSTSUBSCRIPT italic_i = roman_Ψ ( over¯ start_ARG italic_B end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT =C0i=Ψ(B¯0)k1f(xi)f(x)f(x0)f(x)absentsubscript𝐶0superscriptsubscript𝑖Ψsubscript¯𝐵0𝑘1𝑓subscript𝑥𝑖𝑓subscript𝑥𝑓subscript𝑥0𝑓subscript𝑥\displaystyle=C_{0}\sum_{i=\Psi(\bar{B}_{0})}^{k-1}\sqrt{\frac{f(x_{i})-f(x_{*% })}{f(x_{0})-f(x_{*})}}= italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = roman_Ψ ( over¯ start_ARG italic_B end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT square-root start_ARG divide start_ARG italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) end_ARG start_ARG italic_f ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) end_ARG end_ARG
C0i=Ψ(B¯0)k1(113κmax{21+κ,11+C0})i2absentsubscript𝐶0superscriptsubscript𝑖Ψsubscript¯𝐵0𝑘1superscript113𝜅21𝜅11subscript𝐶0𝑖2\displaystyle\leq C_{0}\sum_{i=\Psi(\bar{B}_{0})}^{k-1}\left(1-\frac{1}{3% \kappa}\max\left\{\frac{2}{1+\sqrt{\kappa}},\frac{1}{1+C_{0}}\right\}\right)^{% \frac{i}{2}}≤ italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = roman_Ψ ( over¯ start_ARG italic_B end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ( 1 - divide start_ARG 1 end_ARG start_ARG 3 italic_κ end_ARG roman_max { divide start_ARG 2 end_ARG start_ARG 1 + square-root start_ARG italic_κ end_ARG end_ARG , divide start_ARG 1 end_ARG start_ARG 1 + italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG } ) start_POSTSUPERSCRIPT divide start_ARG italic_i end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT
C01113κmax{11+C0,21+κ}3C0κmin{2(1+C0),1+κ},absentsubscript𝐶01113𝜅11subscript𝐶021𝜅3subscript𝐶0𝜅21subscript𝐶01𝜅\displaystyle\leq\frac{C_{0}}{1-\sqrt{1-\frac{1}{3\kappa}\max\{\frac{1}{1+C_{0% }},\frac{2}{1+\sqrt{\kappa}}\}}}\leq 3C_{0}\kappa\min\{2(1+C_{0}),{1+\sqrt{% \kappa}}\},≤ divide start_ARG italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG 1 - square-root start_ARG 1 - divide start_ARG 1 end_ARG start_ARG 3 italic_κ end_ARG roman_max { divide start_ARG 1 end_ARG start_ARG 1 + italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG , divide start_ARG 2 end_ARG start_ARG 1 + square-root start_ARG italic_κ end_ARG end_ARG } end_ARG end_ARG ≤ 3 italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_κ roman_min { 2 ( 1 + italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , 1 + square-root start_ARG italic_κ end_ARG } ,

where we used 1x112x1𝑥112𝑥\sqrt{1-x}\leq 1-\frac{1}{2}xsquare-root start_ARG 1 - italic_x end_ARG ≤ 1 - divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_x for all 0x10𝑥10\leq x\leq 10 ≤ italic_x ≤ 1 in the last inequality. Combining both inequalities, we arrive at

i=0k1CiC0Ψ(B¯0)+3C0κmin{2(1+C0),1+κ}.superscriptsubscript𝑖0𝑘1subscript𝐶𝑖subscript𝐶0Ψsubscript¯𝐵03subscript𝐶0𝜅21subscript𝐶01𝜅\sum_{i=0}^{k-1}C_{i}\leq C_{0}\Psi(\bar{B}_{0})+3C_{0}\kappa\min\{2(1+C_{0}),% 1+\sqrt{\kappa}\}.∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≤ italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT roman_Ψ ( over¯ start_ARG italic_B end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) + 3 italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_κ roman_min { 2 ( 1 + italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , 1 + square-root start_ARG italic_κ end_ARG } . (45)

Thus, when the number of iterations k𝑘kitalic_k exceeds (1+C0)Ψ(B¯0)+3C0κmin{2(1+C0),(1+κ)}1subscript𝐶0Ψsubscript¯𝐵03subscript𝐶0𝜅21subscript𝐶01𝜅(1+C_{0})\Psi(\bar{B}_{0})+3C_{0}\kappa\min\{2(1+C_{0}),(1+\sqrt{\kappa})\}( 1 + italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) roman_Ψ ( over¯ start_ARG italic_B end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) + 3 italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_κ roman_min { 2 ( 1 + italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , ( 1 + square-root start_ARG italic_κ end_ARG ) }, by (44) we have

(i=0k1α^iq^im^icos2(θ^i))1k1κe1k(Ψ(B¯0)+i=0k1Ci)1eκ13κ.superscriptsuperscriptsubscriptproduct𝑖0𝑘1subscript^𝛼𝑖subscript^𝑞𝑖subscript^𝑚𝑖superscript2subscript^𝜃𝑖1𝑘1𝜅superscript𝑒1𝑘Ψsubscript¯𝐵0superscriptsubscript𝑖0𝑘1subscript𝐶𝑖1𝑒𝜅13𝜅\left(\prod_{i=0}^{k-1}\frac{\hat{\alpha}_{i}\hat{q}_{i}}{\hat{m}_{i}}\cos^{2}% (\hat{\theta}_{i})\right)^{\frac{1}{k}}\geq\frac{1}{\kappa}e^{-\frac{1}{k}(% \Psi(\bar{B}_{0})+\sum_{i=0}^{k-1}C_{i})}\geq\frac{1}{e\kappa}\geq\frac{1}{3% \kappa}.( ∏ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT divide start_ARG over^ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG roman_cos start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_k end_ARG end_POSTSUPERSCRIPT ≥ divide start_ARG 1 end_ARG start_ARG italic_κ end_ARG italic_e start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG italic_k end_ARG ( roman_Ψ ( over¯ start_ARG italic_B end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT ≥ divide start_ARG 1 end_ARG start_ARG italic_e italic_κ end_ARG ≥ divide start_ARG 1 end_ARG start_ARG 3 italic_κ end_ARG .

Together with Proposition 1, this proves the second claim in (41). ∎

We summarize all the global linear convergence results from the above two lemmas in the following theorem.

Theorem 1.

Let {xk}k0subscriptsubscript𝑥𝑘𝑘0\{x_{k}\}_{k\geq 0}{ italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k ≥ 0 end_POSTSUBSCRIPT be the iterates generated by the BFGS method with exact line search and suppose that Assumptions 1, 2 and 3 hold. For any initial point x0dsubscript𝑥0superscript𝑑x_{0}\in\mathbb{R}^{d}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT and any initial matrix B0𝕊++dsubscript𝐵0superscriptsubscript𝕊absent𝑑B_{0}\in\mathbb{S}_{++}^{d}italic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ blackboard_S start_POSTSUBSCRIPT + + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, we have the following global linear convergence rate for any k1𝑘1k\geq 1italic_k ≥ 1,

f(xk)f(x)f(x0)f(x)(1eΨ(B¯0)k1κmax{21+κ,11+C0})k,𝑓subscript𝑥𝑘𝑓subscript𝑥𝑓subscript𝑥0𝑓subscript𝑥superscript1superscript𝑒Ψsubscript¯𝐵0𝑘1𝜅21𝜅11subscript𝐶0𝑘\frac{f(x_{k})-f(x_{*})}{f(x_{0})-f(x_{*})}\leq\left(1-e^{-\frac{\Psi(\bar{B}_% {0})}{k}}\frac{1}{\kappa}\max\left\{\frac{2}{1+\sqrt{\kappa}},\frac{1}{1+C_{0}% }\right\}\right)^{k},divide start_ARG italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) end_ARG start_ARG italic_f ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) end_ARG ≤ ( 1 - italic_e start_POSTSUPERSCRIPT - divide start_ARG roman_Ψ ( over¯ start_ARG italic_B end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_k end_ARG end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_κ end_ARG roman_max { divide start_ARG 2 end_ARG start_ARG 1 + square-root start_ARG italic_κ end_ARG end_ARG , divide start_ARG 1 end_ARG start_ARG 1 + italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG } ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , (46)

where B¯0subscript¯𝐵0\bar{B}_{0}over¯ start_ARG italic_B end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is defined in (37). When kΨ(B¯0)𝑘Ψsubscript¯𝐵0k\geq\Psi(\bar{B}_{0})italic_k ≥ roman_Ψ ( over¯ start_ARG italic_B end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ), we have that

f(xk)f(x)f(x0)f(x)(113κmax{21+κ,11+C0})k.𝑓subscript𝑥𝑘𝑓subscript𝑥𝑓subscript𝑥0𝑓subscript𝑥superscript113𝜅21𝜅11subscript𝐶0𝑘\frac{f(x_{k})-f(x_{*})}{f(x_{0})-f(x_{*})}\leq\left(1-\frac{1}{3\kappa}\max% \left\{\frac{2}{1+\sqrt{\kappa}},\frac{1}{1+C_{0}}\right\}\right)^{k}.divide start_ARG italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) end_ARG start_ARG italic_f ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) end_ARG ≤ ( 1 - divide start_ARG 1 end_ARG start_ARG 3 italic_κ end_ARG roman_max { divide start_ARG 2 end_ARG start_ARG 1 + square-root start_ARG italic_κ end_ARG end_ARG , divide start_ARG 1 end_ARG start_ARG 1 + italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG } ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT . (47)

Moreover, when k(1+C0)Ψ(B¯0)+3C0κmin{2(1+C0),1+κ}𝑘1subscript𝐶0Ψsubscript¯𝐵03subscript𝐶0𝜅21subscript𝐶01𝜅k\geq(1+C_{0})\Psi(\bar{B}_{0})+3C_{0}\kappa\min\{2(1+C_{0}),1+\sqrt{\kappa}\}italic_k ≥ ( 1 + italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) roman_Ψ ( over¯ start_ARG italic_B end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) + 3 italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_κ roman_min { 2 ( 1 + italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , 1 + square-root start_ARG italic_κ end_ARG }, we have

f(xk)f(x)f(x0)f(x)(113κ)k.𝑓subscript𝑥𝑘𝑓subscript𝑥𝑓subscript𝑥0𝑓subscript𝑥superscript113𝜅𝑘\frac{f(x_{k})-f(x_{*})}{f(x_{0})-f(x_{*})}\leq\left(1-\frac{1}{3\kappa}\right% )^{k}.divide start_ARG italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) end_ARG start_ARG italic_f ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) end_ARG ≤ ( 1 - divide start_ARG 1 end_ARG start_ARG 3 italic_κ end_ARG ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT . (48)

In Theorem 1, we present three distinct linear convergence rates during different phases of the BFGS algorithm with exact line search. Specifically, the linear rate in (46) is applicable from the first iteration, but the contraction factor depends on the quantity eΨ(B¯0)/ksuperscript𝑒Ψsubscript¯𝐵0𝑘e^{-\Psi(\bar{B}_{0})/k}italic_e start_POSTSUPERSCRIPT - roman_Ψ ( over¯ start_ARG italic_B end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) / italic_k end_POSTSUPERSCRIPT, which can be exponentially small and thus imply a slow convergence rate. However, this quantity will be bounded away from zero as the number of iterations k𝑘kitalic_k increases, resulting in an improved linear rate. In particular, for kΨ(B¯0)𝑘Ψsubscript¯𝐵0k\geq\Psi(\bar{B}_{0})italic_k ≥ roman_Ψ ( over¯ start_ARG italic_B end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ), the quantity eΨ(B¯0)/ksuperscript𝑒Ψsubscript¯𝐵0𝑘e^{-\Psi(\bar{B}_{0})/k}italic_e start_POSTSUPERSCRIPT - roman_Ψ ( over¯ start_ARG italic_B end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) / italic_k end_POSTSUPERSCRIPT is bounded below by 1/3131/31 / 3, leading to the second improved linear convergence rate in (47). Furthermore, as shown in Lemma 7, after an additional C0Ψ(B¯0)+3C0κmin{2(1+C0),1+κ}subscript𝐶0Ψsubscript¯𝐵03subscript𝐶0𝜅21subscript𝐶01𝜅C_{0}\Psi(\bar{B}_{0})+3C_{0}\kappa\min\{2(1+C_{0}),1+\sqrt{\kappa}\}italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT roman_Ψ ( over¯ start_ARG italic_B end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) + 3 italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_κ roman_min { 2 ( 1 + italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , 1 + square-root start_ARG italic_κ end_ARG } iterations, we achieve the last linear convergence rate in (48), which is comparable to that of gradient descent.

From the discussions above, we observe that the quantity Ψ(B¯0)Ψsubscript¯𝐵0\Psi(\bar{B}_{0})roman_Ψ ( over¯ start_ARG italic_B end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) (recall that B¯0=1LB0subscript¯𝐵01𝐿subscript𝐵0\bar{B}_{0}=\frac{1}{L}B_{0}over¯ start_ARG italic_B end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_L end_ARG italic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT) plays a critical role in determining the transitions between different linear convergence phases, and a smaller Ψ(B¯0)Ψsubscript¯𝐵0\Psi(\bar{B}_{0})roman_Ψ ( over¯ start_ARG italic_B end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) implies fewer iterations required to reach each linear convergence phase. Thus, we consider two different initializations: B0=LIsubscript𝐵0𝐿𝐼B_{0}=LIitalic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_L italic_I and B0=μIsubscript𝐵0𝜇𝐼B_{0}=\mu Iitalic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_μ italic_I. Specifically, note that in the first case where B0=LIsubscript𝐵0𝐿𝐼B_{0}=LIitalic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_L italic_I, we have Ψ(B¯0)=0Ψsubscript¯𝐵00\Psi(\bar{B}_{0})=0roman_Ψ ( over¯ start_ARG italic_B end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = 0 and thus it achieves the best linear convergence results according to Theorem 1. The corresponding global linear rate is presented in Corollary 1.

Corollary 1.

Let {xk}k0subscriptsubscript𝑥𝑘𝑘0\{x_{k}\}_{k\geq 0}{ italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k ≥ 0 end_POSTSUBSCRIPT be the iterates generated by the BFGS method with exact line search and suppose that Assumptions 1, 2 and 3 hold. For any initial point x0dsubscript𝑥0superscript𝑑x_{0}\in\mathbb{R}^{d}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT and the initial Hessian approximation matrix B0=LIsubscript𝐵0𝐿𝐼B_{0}=LIitalic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_L italic_I, we have the following global linear convergence rate for any k1𝑘1k\geq 1italic_k ≥ 1,

f(xk)f(x)f(x0)f(x)(11κmax{21+κ,11+C0})k.𝑓subscript𝑥𝑘𝑓subscript𝑥𝑓subscript𝑥0𝑓subscript𝑥superscript11𝜅21𝜅11subscript𝐶0𝑘\frac{f(x_{k})-f(x_{*})}{f(x_{0})-f(x_{*})}\leq\left(1-\frac{1}{\kappa}\max% \left\{\frac{2}{1+\sqrt{\kappa}},\frac{1}{1+C_{0}}\right\}\right)^{k}.divide start_ARG italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) end_ARG start_ARG italic_f ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) end_ARG ≤ ( 1 - divide start_ARG 1 end_ARG start_ARG italic_κ end_ARG roman_max { divide start_ARG 2 end_ARG start_ARG 1 + square-root start_ARG italic_κ end_ARG end_ARG , divide start_ARG 1 end_ARG start_ARG 1 + italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG } ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT . (49)

Moreover, when k3C0κmin{2(1+C0),(1+κ)}𝑘3subscript𝐶0𝜅21subscript𝐶01𝜅k\geq 3C_{0}\kappa\min\{2(1+C_{0}),(1+\sqrt{\kappa})\}italic_k ≥ 3 italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_κ roman_min { 2 ( 1 + italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , ( 1 + square-root start_ARG italic_κ end_ARG ) }, we have

f(xk)f(x)f(x0)f(x)(113κ)k.𝑓subscript𝑥𝑘𝑓subscript𝑥𝑓subscript𝑥0𝑓subscript𝑥superscript113𝜅𝑘\frac{f(x_{k})-f(x_{*})}{f(x_{0})-f(x_{*})}\leq\left(1-\frac{1}{3\kappa}\right% )^{k}.divide start_ARG italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) end_ARG start_ARG italic_f ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) end_ARG ≤ ( 1 - divide start_ARG 1 end_ARG start_ARG 3 italic_κ end_ARG ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT . (50)

In the second case where B0=μIsubscript𝐵0𝜇𝐼B_{0}=\mu Iitalic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_μ italic_I, we have Ψ(B¯0)=Ψ(μLI)=d(1κ1+logκ)dlogκΨsubscript¯𝐵0Ψ𝜇𝐿𝐼𝑑1𝜅1𝜅𝑑𝜅\Psi(\bar{B}_{0})=\Psi(\frac{\mu}{L}I)=d(\frac{1}{\kappa}-1+\log{\kappa})\leq d\log\kapparoman_Ψ ( over¯ start_ARG italic_B end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = roman_Ψ ( divide start_ARG italic_μ end_ARG start_ARG italic_L end_ARG italic_I ) = italic_d ( divide start_ARG 1 end_ARG start_ARG italic_κ end_ARG - 1 + roman_log italic_κ ) ≤ italic_d roman_log italic_κ. The corresponding global linear rate is presented in Corollary 2.

Corollary 2.

Let {xk}k0subscriptsubscript𝑥𝑘𝑘0\{x_{k}\}_{k\geq 0}{ italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k ≥ 0 end_POSTSUBSCRIPT be the iterates generated by the BFGS method with exact line search and suppose that Assumptions 1, 2 and 3 hold. For any initial point x0dsubscript𝑥0superscript𝑑x_{0}\in\mathbb{R}^{d}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT and the initial Hessian approximation matrix B0=μIsubscript𝐵0𝜇𝐼B_{0}=\mu Iitalic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_μ italic_I, we have the following global convergence rate for any k1𝑘1k\geq 1italic_k ≥ 1,

f(xk)f(x)f(x0)f(x)(1edlogκk1κmax{21+κ,11+C0})k.𝑓subscript𝑥𝑘𝑓subscript𝑥𝑓subscript𝑥0𝑓subscript𝑥superscript1superscript𝑒𝑑𝜅𝑘1𝜅21𝜅11subscript𝐶0𝑘\frac{f(x_{k})-f(x_{*})}{f(x_{0})-f(x_{*})}\leq\left(1-e^{-\frac{d\log{\kappa}% }{k}}\frac{1}{\kappa}\max\left\{\frac{2}{1+\sqrt{\kappa}},\frac{1}{1+C_{0}}% \right\}\right)^{k}.divide start_ARG italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) end_ARG start_ARG italic_f ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) end_ARG ≤ ( 1 - italic_e start_POSTSUPERSCRIPT - divide start_ARG italic_d roman_log italic_κ end_ARG start_ARG italic_k end_ARG end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_κ end_ARG roman_max { divide start_ARG 2 end_ARG start_ARG 1 + square-root start_ARG italic_κ end_ARG end_ARG , divide start_ARG 1 end_ARG start_ARG 1 + italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG } ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT . (51)

When kdlogκ𝑘𝑑𝜅k\geq d\log{\kappa}italic_k ≥ italic_d roman_log italic_κ, the following linear rate holds

f(xk)f(x)f(x0)f(x)(113κmax{21+κ,11+C0})k.𝑓subscript𝑥𝑘𝑓subscript𝑥𝑓subscript𝑥0𝑓subscript𝑥superscript113𝜅21𝜅11subscript𝐶0𝑘\frac{f(x_{k})-f(x_{*})}{f(x_{0})-f(x_{*})}\leq\left(1-\frac{1}{3\kappa}\max% \left\{\frac{2}{1+\sqrt{\kappa}},\frac{1}{1+C_{0}}\right\}\right)^{k}.divide start_ARG italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) end_ARG start_ARG italic_f ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) end_ARG ≤ ( 1 - divide start_ARG 1 end_ARG start_ARG 3 italic_κ end_ARG roman_max { divide start_ARG 2 end_ARG start_ARG 1 + square-root start_ARG italic_κ end_ARG end_ARG , divide start_ARG 1 end_ARG start_ARG 1 + italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG } ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT . (52)

Moreover, when k(1+C0)dlogκ+3C0κmin{2(1+C0),1+κ}𝑘1subscript𝐶0𝑑𝜅3subscript𝐶0𝜅21subscript𝐶01𝜅k\geq(1+C_{0})d\log{\kappa}+3C_{0}\kappa\min\{2(1+C_{0}),1+\sqrt{\kappa}\}italic_k ≥ ( 1 + italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) italic_d roman_log italic_κ + 3 italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_κ roman_min { 2 ( 1 + italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , 1 + square-root start_ARG italic_κ end_ARG }, we have

f(xk)f(x)f(x0)f(x)(113κ)k.𝑓subscript𝑥𝑘𝑓subscript𝑥𝑓subscript𝑥0𝑓subscript𝑥superscript113𝜅𝑘\frac{f(x_{k})-f(x_{*})}{f(x_{0})-f(x_{*})}\leq\left(1-\frac{1}{3\kappa}\right% )^{k}.divide start_ARG italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) end_ARG start_ARG italic_f ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) end_ARG ≤ ( 1 - divide start_ARG 1 end_ARG start_ARG 3 italic_κ end_ARG ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT . (53)

Comparing the results in Corollary 2 with those in Corollary 1, we observe that BFGS with B0=μIsubscript𝐵0𝜇𝐼B_{0}=\mu Iitalic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_μ italic_I requires additional dlogκ𝑑𝜅d\log\kappaitalic_d roman_log italic_κ iterations to achieve a similar linear rate as in the first case. However, as we present in the next section, the choice of the initial Hessian approximation matrix B0=μIsubscript𝐵0𝜇𝐼B_{0}=\mu Iitalic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_μ italic_I achieves a better superlinear convergence rate. This trade-off between the linear and superlinear convergence phase is the fundamental consequence of different choices of the initial Hessian approximation matrix in our convergence analysis.

5 Global superlinear convergence rates

In this section, we establish the non-asymptotic global superlinear convergence rate of BFGS with exact line search, employing a similar approach to the global linear convergence rate analysis from the previous section. We utilize the framework from Proposition 1 and integrate the lower bounds from Lemmas 3, 4, 5, and Proposition 2. The key distinction lies in the choice of the weight matrix: instead of P=LI𝑃𝐿𝐼P=LIitalic_P = italic_L italic_I used in the linear convergence analysis, we opt for P=2f(x)𝑃superscript2𝑓subscript𝑥P=\nabla^{2}{f(x_{*})}italic_P = ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) for the global superlinear convergence proof.

We define the weighted matrix B~ksubscript~𝐵𝑘\tilde{B}_{k}over~ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT as:

B~k=2f(x)12Bk2f(x)12, fork0.formulae-sequencesubscript~𝐵𝑘superscript2𝑓superscriptsubscript𝑥12subscript𝐵𝑘superscript2𝑓superscriptsubscript𝑥12 for𝑘0\tilde{B}_{k}=\nabla^{2}f(x_{*})^{-\frac{1}{2}}B_{k}\nabla^{2}f(x_{*})^{-\frac% {1}{2}},\qquad\text{ for}\ \ k\geq 0.over~ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT , for italic_k ≥ 0 . (54)

In the following proposition, we first provide a general global convergence bound with an arbitrary initial Hessian approximation matrix B0𝕊++dsubscript𝐵0subscriptsuperscript𝕊𝑑absentB_{0}\in\mathbb{S}^{d}_{++}italic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ blackboard_S start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT + + end_POSTSUBSCRIPT. All the global superlinear convergence rates are based on the following proposition.

Proposition 3.

Let {xk}k0subscriptsubscript𝑥𝑘𝑘0\{x_{k}\}_{k\geq 0}{ italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k ≥ 0 end_POSTSUBSCRIPT be the iterates generated by the BFGS method with exact line search and suppose that Assumptions 1, 2 and 3 hold. Recall the definition of Cksubscript𝐶𝑘C_{k}italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT in (23) and Ψ(.)\Psi(.)roman_Ψ ( . ) in (18). For any initial point x0dsubscript𝑥0superscript𝑑x_{0}\in\mathbb{R}^{d}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT and any initial Hessian approximation matrix B0𝕊++dsubscript𝐵0subscriptsuperscript𝕊𝑑absentB_{0}\in\mathbb{S}^{d}_{++}italic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ blackboard_S start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT + + end_POSTSUBSCRIPT, the following result holds for any k1𝑘1k\geq 1italic_k ≥ 1,

f(xk)f(x)f(x0)f(x)(Ψ(B~0)+4i=0k1Cik)k.𝑓subscript𝑥𝑘𝑓subscript𝑥𝑓subscript𝑥0𝑓subscript𝑥superscriptΨsubscript~𝐵04superscriptsubscript𝑖0𝑘1subscript𝐶𝑖𝑘𝑘\frac{f(x_{k})-f(x_{*})}{f(x_{0})-f(x_{*})}\leq\left(\frac{\Psi(\tilde{B}_{0})% +4\sum_{i=0}^{k-1}C_{i}}{k}\right)^{k}.divide start_ARG italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) end_ARG start_ARG italic_f ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) end_ARG ≤ ( divide start_ARG roman_Ψ ( over~ start_ARG italic_B end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) + 4 ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_k end_ARG ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT . (55)
Proof.

Recall that we choose the weight matrix as P=2f(x)𝑃superscript2𝑓subscript𝑥P=\nabla^{2}f(x_{*})italic_P = ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) throughout the proof. From Lemma 3 and Lemma 4(b), we have α^k12(1+Ck)subscript^𝛼𝑘121subscript𝐶𝑘\hat{\alpha}_{k}\geq\frac{1}{2(1+C_{k})}over^ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≥ divide start_ARG 1 end_ARG start_ARG 2 ( 1 + italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_ARG and q^k2(1+Ck)2subscript^𝑞𝑘2superscript1subscript𝐶𝑘2\hat{q}_{k}\geq\frac{2}{(1+C_{k})^{2}}over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≥ divide start_ARG 2 end_ARG start_ARG ( 1 + italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG. Hence, using the inequality 1+xex1𝑥superscript𝑒𝑥1+x\leq e^{x}1 + italic_x ≤ italic_e start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT for any x0𝑥0x\geq 0italic_x ≥ 0, it follows that

i=0k1(α^iq^i)i=0k11(1+Ck)3i=0k1e3Ck=e3i=0k1Ci.superscriptsubscriptproduct𝑖0𝑘1subscript^𝛼𝑖subscript^𝑞𝑖superscriptsubscriptproduct𝑖0𝑘11superscript1subscript𝐶𝑘3superscriptsubscriptproduct𝑖0𝑘1superscript𝑒3subscript𝐶𝑘superscript𝑒3superscriptsubscript𝑖0𝑘1subscript𝐶𝑖\prod_{i=0}^{k-1}(\hat{\alpha}_{i}\hat{q}_{i})\geq\prod_{i=0}^{k-1}\frac{1}{(1% +C_{k})^{3}}\geq\prod_{i=0}^{k-1}e^{-3C_{k}}=e^{-3\sum_{i=0}^{k-1}C_{i}}.∏ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ( over^ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ≥ ∏ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG ( 1 + italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG ≥ ∏ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT - 3 italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT = italic_e start_POSTSUPERSCRIPT - 3 ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT . (56)

Moreover, by using the inequality (20) in Proposition 2 with P=2f(x)𝑃superscript2𝑓subscript𝑥P=\nabla^{2}{f(x_{*})}italic_P = ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ), we obtain that

i=0k1logcos2(θ^i)m^iΨ(B~0)+i=0k1(1y^i2s^iy^i)Ψ(B~0)i=0k1Ci,superscriptsubscript𝑖0𝑘1superscript2subscript^𝜃𝑖subscript^𝑚𝑖Ψsubscript~𝐵0superscriptsubscript𝑖0𝑘11superscriptnormsubscript^𝑦𝑖2superscriptsubscript^𝑠𝑖topsubscript^𝑦𝑖Ψsubscript~𝐵0superscriptsubscript𝑖0𝑘1subscript𝐶𝑖\sum_{i=0}^{k-1}\log{\frac{\cos^{2}(\hat{\theta}_{i})}{\hat{m}_{i}}}\geq-\Psi(% \tilde{B}_{0})+\sum_{i=0}^{k-1}\left(1-\frac{\|\hat{y}_{i}\|^{2}}{\hat{s}_{i}^% {\top}\hat{y}_{i}}\right)\geq-\Psi(\tilde{B}_{0})-\sum_{i=0}^{k-1}C_{i},∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT roman_log divide start_ARG roman_cos start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ≥ - roman_Ψ ( over~ start_ARG italic_B end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ( 1 - divide start_ARG ∥ over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ) ≥ - roman_Ψ ( over~ start_ARG italic_B end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ,

where in the last inequality we used the fact that y^i2s^iy^i1+Cisuperscriptnormsubscript^𝑦𝑖2superscriptsubscript^𝑠𝑖topsubscript^𝑦𝑖1subscript𝐶𝑖\frac{\|\hat{y}_{i}\|^{2}}{\hat{s}_{i}^{\top}\hat{y}_{i}}\leq 1+C_{i}divide start_ARG ∥ over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ≤ 1 + italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT from Lemma 5(b). This further implies that

i=0k1cos2(θ^i)m^ieΨ(B~0)i=0k1Ci.superscriptsubscriptproduct𝑖0𝑘1superscript2subscript^𝜃𝑖subscript^𝑚𝑖superscript𝑒Ψsubscript~𝐵0superscriptsubscript𝑖0𝑘1subscript𝐶𝑖\prod_{i=0}^{k-1}\frac{\cos^{2}(\hat{\theta}_{i})}{\hat{m}_{i}}\geq e^{-\Psi(% \tilde{B}_{0})-\sum_{i=0}^{k-1}C_{i}}.∏ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT divide start_ARG roman_cos start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ≥ italic_e start_POSTSUPERSCRIPT - roman_Ψ ( over~ start_ARG italic_B end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT . (57)

Combining (56), (57), and (16) from Proposition 1, we prove that

f(xk)f(x)f(x0)f(x)𝑓subscript𝑥𝑘𝑓subscript𝑥𝑓subscript𝑥0𝑓subscript𝑥\displaystyle\frac{f(x_{k})-f(x_{*})}{f(x_{0})-f(x_{*})}divide start_ARG italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) end_ARG start_ARG italic_f ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) end_ARG [1(i=0k1α^iq^im^icos2(θ^i))1k]kabsentsuperscriptdelimited-[]1superscriptsuperscriptsubscriptproduct𝑖0𝑘1subscript^𝛼𝑖subscript^𝑞𝑖subscript^𝑚𝑖superscript2subscript^𝜃𝑖1𝑘𝑘\displaystyle\leq\left[1-\left(\prod_{i=0}^{k-1}\frac{\hat{\alpha}_{i}\hat{q}_% {i}}{\hat{m}_{i}}\cos^{2}(\hat{\theta}_{i})\right)^{\frac{1}{k}}\right]^{k}≤ [ 1 - ( ∏ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT divide start_ARG over^ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG roman_cos start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_k end_ARG end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT
[1(e3i=0k1CieΨ(B~0)i=0k1Ci)1k]kabsentsuperscriptdelimited-[]1superscriptsuperscript𝑒3superscriptsubscript𝑖0𝑘1subscript𝐶𝑖superscript𝑒Ψsubscript~𝐵0superscriptsubscript𝑖0𝑘1subscript𝐶𝑖1𝑘𝑘\displaystyle\leq\left[1-\left(e^{-3\sum_{i=0}^{k-1}C_{i}}e^{-\Psi(\tilde{B}_{% 0})-\sum_{i=0}^{k-1}C_{i}}\right)^{\frac{1}{k}}\right]^{k}≤ [ 1 - ( italic_e start_POSTSUPERSCRIPT - 3 ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT - roman_Ψ ( over~ start_ARG italic_B end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_k end_ARG end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT
=(1eΨ(B~0)+4i=0k1Cik)k(Ψ(B~0)+4i=0k1Cik)k,absentsuperscript1superscript𝑒Ψsubscript~𝐵04superscriptsubscript𝑖0𝑘1subscript𝐶𝑖𝑘𝑘superscriptΨsubscript~𝐵04superscriptsubscript𝑖0𝑘1subscript𝐶𝑖𝑘𝑘\displaystyle=\left(1-e^{-\frac{\Psi(\tilde{B}_{0})+4\sum_{i=0}^{k-1}C_{i}}{k}% }\right)^{k}\leq\left(\frac{\Psi(\tilde{B}_{0})+4\sum_{i=0}^{k-1}C_{i}}{k}% \right)^{k},= ( 1 - italic_e start_POSTSUPERSCRIPT - divide start_ARG roman_Ψ ( over~ start_ARG italic_B end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) + 4 ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_k end_ARG end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ≤ ( divide start_ARG roman_Ψ ( over~ start_ARG italic_B end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) + 4 ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_k end_ARG ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ,

where the last inequality is due to the fact that 1exx1superscript𝑒𝑥𝑥1-e^{-x}\leq x1 - italic_e start_POSTSUPERSCRIPT - italic_x end_POSTSUPERSCRIPT ≤ italic_x for any x𝑥xitalic_x. ∎

The above global result shows that the error after k𝑘kitalic_k iterations for the BFGS update with exact line search depends on the potential function of the weighted initial Hessian approximation matrix B~0subscript~𝐵0\tilde{B}_{0}over~ start_ARG italic_B end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, i.e., Ψ(B~0)Ψsubscript~𝐵0\Psi(\tilde{B}_{0})roman_Ψ ( over~ start_ARG italic_B end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ), and the sum of weighted functions suboptimality, i.e., i=0k1Cisuperscriptsubscript𝑖0𝑘1subscript𝐶𝑖\sum_{i=0}^{k-1}C_{i}∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. This result forms the foundation of our superlinear result, as if we can demonstrate that the sum i=0k1Cisuperscriptsubscript𝑖0𝑘1subscript𝐶𝑖\sum_{i=0}^{k-1}C_{i}∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is bounded above, it leads to a superlinear rate of the form 𝒪((1/k)k)𝒪superscript1𝑘𝑘\mathcal{O}((1/k)^{k})caligraphic_O ( ( 1 / italic_k ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ).

Having established the non-asymptotic global linear convergence rate of BFGS in the previous section, we can leverage it to show that the sum i=0k1Cisuperscriptsubscript𝑖0𝑘1subscript𝐶𝑖\sum_{i=0}^{k-1}C_{i}∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is uniformly bounded above, allowing us to establish an explicit upper bound for this finite sum. In the following theorem, we apply the linear convergence results from section 4 to prove the non-asymptotic global superlinear convergence rates of BFGS with exact line search for any initial Hessian approximation matrix B0𝕊++dsubscript𝐵0subscriptsuperscript𝕊𝑑absentB_{0}\in\mathbb{S}^{d}_{++}italic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ blackboard_S start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT + + end_POSTSUBSCRIPT.

Theorem 2.

Let {xk}k0subscriptsubscript𝑥𝑘𝑘0\{x_{k}\}_{k\geq 0}{ italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k ≥ 0 end_POSTSUBSCRIPT be the iterates generated by the BFGS method with exact line search and suppose that Assumptions 1, 2 and 3 hold. For any initial point x0dsubscript𝑥0superscript𝑑x_{0}\in\mathbb{R}^{d}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT and any initial Hessian approximation matrix B0𝕊++dsubscript𝐵0subscriptsuperscript𝕊𝑑absentB_{0}\in\mathbb{S}^{d}_{++}italic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ blackboard_S start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT + + end_POSTSUBSCRIPT, we have the following superlinear convergence rate,

f(xk)f(x)f(x0)f(x)(Ψ(B~0)+4C0Ψ(B¯0)+12C0κmin{2(1+C0),1+κ}k)k,𝑓subscript𝑥𝑘𝑓subscript𝑥𝑓subscript𝑥0𝑓subscript𝑥superscriptΨsubscript~𝐵04subscript𝐶0Ψsubscript¯𝐵012subscript𝐶0𝜅21subscript𝐶01𝜅𝑘𝑘\frac{f(x_{k})-f(x_{*})}{f(x_{0})-f(x_{*})}\leq\left(\frac{\Psi(\tilde{B}_{0})% +4C_{0}\Psi(\bar{B}_{0})+12C_{0}\kappa\min\{2(1+C_{0}),1+\sqrt{\kappa}\}}{k}% \right)^{k},divide start_ARG italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) end_ARG start_ARG italic_f ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) end_ARG ≤ ( divide start_ARG roman_Ψ ( over~ start_ARG italic_B end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) + 4 italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT roman_Ψ ( over¯ start_ARG italic_B end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) + 12 italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_κ roman_min { 2 ( 1 + italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , 1 + square-root start_ARG italic_κ end_ARG } end_ARG start_ARG italic_k end_ARG ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , (58)

where B¯0subscript¯𝐵0\bar{B}_{0}over¯ start_ARG italic_B end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and B~0subscript~𝐵0\tilde{B}_{0}over~ start_ARG italic_B end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT are defined in (37) and (54).

Proof.

From (45) in Lemma 7, we know that for k1𝑘1k\geq 1italic_k ≥ 1,

i=0k1CiC0Ψ(B¯0)+3C0κmin{2(1+C0),1+κ}).\sum_{i=0}^{k-1}C_{i}\leq C_{0}\Psi(\bar{B}_{0})+3C_{0}\kappa\min\{2(1+C_{0}),% 1+\sqrt{\kappa}\}).∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≤ italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT roman_Ψ ( over¯ start_ARG italic_B end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) + 3 italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_κ roman_min { 2 ( 1 + italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , 1 + square-root start_ARG italic_κ end_ARG } ) . (59)

Leveraging (59) and (55) in Lemma 3, we prove that for k1𝑘1k\geq 1italic_k ≥ 1,

f(xk)f(x)f(x0)f(x)(Ψ(B~0)+4i=0k1Cik)k(Ψ(B~0)+4C0Ψ(B¯0)+12C0κmin{2(1+C0),1+κ}k)k,𝑓subscript𝑥𝑘𝑓subscript𝑥𝑓subscript𝑥0𝑓subscript𝑥superscriptΨsubscript~𝐵04superscriptsubscript𝑖0𝑘1subscript𝐶𝑖𝑘𝑘superscriptΨsubscript~𝐵04subscript𝐶0Ψsubscript¯𝐵012subscript𝐶0𝜅21subscript𝐶01𝜅𝑘𝑘\begin{split}\frac{f(x_{k})-f(x_{*})}{f(x_{0})-f(x_{*})}&\leq\left(\frac{\Psi(% \tilde{B}_{0})+4\sum_{i=0}^{k-1}C_{i}}{k}\right)^{k}\\ &\leq\left(\frac{\Psi(\tilde{B}_{0})+4C_{0}\Psi(\bar{B}_{0})+12C_{0}\kappa\min% \{2(1+C_{0}),1+\sqrt{\kappa}\}}{k}\right)^{k},\end{split}start_ROW start_CELL divide start_ARG italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) end_ARG start_ARG italic_f ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) end_ARG end_CELL start_CELL ≤ ( divide start_ARG roman_Ψ ( over~ start_ARG italic_B end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) + 4 ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_k end_ARG ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ ( divide start_ARG roman_Ψ ( over~ start_ARG italic_B end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) + 4 italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT roman_Ψ ( over¯ start_ARG italic_B end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) + 12 italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_κ roman_min { 2 ( 1 + italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , 1 + square-root start_ARG italic_κ end_ARG } end_ARG start_ARG italic_k end_ARG ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , end_CELL end_ROW

and the proof is complete. ∎

This result indicates that BFGS with exact line search achieves a superlinear convergence rate when the number of iterations satisfies the condition kΨ(B~0)+4C0Ψ(B¯0)+12C0κmin{2(1+C0),1+κ}𝑘Ψsubscript~𝐵04subscript𝐶0Ψsubscript¯𝐵012subscript𝐶0𝜅21subscript𝐶01𝜅k\geq\Psi(\tilde{B}_{0})+4C_{0}\Psi(\bar{B}_{0})+12C_{0}\kappa\min\{2(1+C_{0})% ,1+\sqrt{\kappa}\}italic_k ≥ roman_Ψ ( over~ start_ARG italic_B end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) + 4 italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT roman_Ψ ( over¯ start_ARG italic_B end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) + 12 italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_κ roman_min { 2 ( 1 + italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , 1 + square-root start_ARG italic_κ end_ARG }. The initial matrix B0subscript𝐵0B_{0}italic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT critically influences the required iterations to attain this rate, as it appears in the numerator of the upper bound through B~0=2f(x)12B02f(x)12subscript~𝐵0superscript2𝑓superscriptsubscript𝑥12subscript𝐵0superscript2𝑓superscriptsubscript𝑥12\tilde{B}_{0}=\nabla^{2}f(x_{*})^{-\frac{1}{2}}B_{0}\nabla^{2}f(x_{*})^{-\frac% {1}{2}}over~ start_ARG italic_B end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT and B¯0=(1/L)B0subscript¯𝐵01𝐿subscript𝐵0\bar{B}_{0}=(1/L)B_{0}over¯ start_ARG italic_B end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = ( 1 / italic_L ) italic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Thus, different choices of B0subscript𝐵0B_{0}italic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT yield different values for Ψ(B~0)+4C0Ψ(B¯0)Ψsubscript~𝐵04subscript𝐶0Ψsubscript¯𝐵0\Psi(\tilde{B}_{0})+4C_{0}\Psi(\bar{B}_{0})roman_Ψ ( over~ start_ARG italic_B end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) + 4 italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT roman_Ψ ( over¯ start_ARG italic_B end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ), affecting the number of iterations required for superlinear convergence. Indeed, one can try to optimize the choice of B0subscript𝐵0B_{0}italic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to make the expression Ψ(B~0)+4C0Ψ(B¯0)Ψsubscript~𝐵04subscript𝐶0Ψsubscript¯𝐵0\Psi(\tilde{B}_{0})+4C_{0}\Psi(\bar{B}_{0})roman_Ψ ( over~ start_ARG italic_B end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) + 4 italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT roman_Ψ ( over¯ start_ARG italic_B end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) as small as possible. However, here we only focus on two practical initial Hessian approximations: B0=LIsubscript𝐵0𝐿𝐼B_{0}=LIitalic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_L italic_I and B0=μIsubscript𝐵0𝜇𝐼B_{0}=\mu Iitalic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_μ italic_I. Next, in the upcoming corollaries, we present the superlinear convergence results obtained from Theorem 2 when we use these two initial Hessian approximations.

Corollary 3.

Let {xk}k0subscriptsubscript𝑥𝑘𝑘0\{x_{k}\}_{k\geq 0}{ italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k ≥ 0 end_POSTSUBSCRIPT be the iterates generated by the BFGS method with exact line search and suppose that Assumptions 1, 2 and 3 hold. For any initial point x0dsubscript𝑥0superscript𝑑x_{0}\in\mathbb{R}^{d}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT and the initial Hessian approximation matrix B0=LIsubscript𝐵0𝐿𝐼B_{0}=LIitalic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_L italic_I, we have the following superlinear convergence rate,

f(xk)f(x)f(x0)f(x)(dκ+12C0κmin{2(1+C0),(1+κ)}k)k.𝑓subscript𝑥𝑘𝑓subscript𝑥𝑓subscript𝑥0𝑓subscript𝑥superscript𝑑𝜅12subscript𝐶0𝜅21subscript𝐶01𝜅𝑘𝑘\frac{f(x_{k})-f(x_{*})}{f(x_{0})-f(x_{*})}\leq\left(\frac{d\kappa+12C_{0}% \kappa\min\{2(1+C_{0}),(1+\sqrt{\kappa})\}}{k}\right)^{k}.divide start_ARG italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) end_ARG start_ARG italic_f ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) end_ARG ≤ ( divide start_ARG italic_d italic_κ + 12 italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_κ roman_min { 2 ( 1 + italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , ( 1 + square-root start_ARG italic_κ end_ARG ) } end_ARG start_ARG italic_k end_ARG ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT . (60)
Proof.

From Assumptions 1 and 2, we have 1LI2f(x)11μIprecedes-or-equals1𝐿𝐼superscript2𝑓superscriptsubscript𝑥1precedes-or-equals1𝜇𝐼\frac{1}{L}I\preceq\nabla^{2}{f(x_{*})}^{-1}\preceq\frac{1}{\mu}Idivide start_ARG 1 end_ARG start_ARG italic_L end_ARG italic_I ⪯ ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ⪯ divide start_ARG 1 end_ARG start_ARG italic_μ end_ARG italic_I. Since B0=LIsubscript𝐵0𝐿𝐼B_{0}=LIitalic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_L italic_I, we have

Ψ(B~0)=𝐓𝐫(B~0)dlog𝐃𝐞𝐭(B~0)=𝐓𝐫(L2f(x)1)dlog𝐃𝐞𝐭(L2f(x)1)𝐓𝐫(κI)dlog𝐃𝐞𝐭(I)=dκddκ.Ψsubscript~𝐵0𝐓𝐫subscript~𝐵0𝑑𝐃𝐞𝐭subscript~𝐵0𝐓𝐫𝐿superscript2𝑓superscriptsubscript𝑥1𝑑𝐃𝐞𝐭𝐿superscript2𝑓superscriptsubscript𝑥1𝐓𝐫𝜅𝐼𝑑𝐃𝐞𝐭𝐼𝑑𝜅𝑑𝑑𝜅\begin{split}\Psi(\tilde{B}_{0})&=\mathbf{Tr}(\tilde{B}_{0})-d-\log{\mathbf{% Det}(\tilde{B}_{0})}=\mathbf{Tr}(L\nabla^{2}{f(x_{*})}^{-1})-d-\log{\mathbf{% Det}(L\nabla^{2}{f(x_{*})}^{-1})}\\ &\leq\mathbf{Tr}(\kappa I)-d-\log{\mathbf{Det}(I})=d\kappa-d\leq d\kappa.\end{split}start_ROW start_CELL roman_Ψ ( over~ start_ARG italic_B end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_CELL start_CELL = bold_Tr ( over~ start_ARG italic_B end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_d - roman_log bold_Det ( over~ start_ARG italic_B end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = bold_Tr ( italic_L ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) - italic_d - roman_log bold_Det ( italic_L ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ bold_Tr ( italic_κ italic_I ) - italic_d - roman_log bold_Det ( italic_I ) = italic_d italic_κ - italic_d ≤ italic_d italic_κ . end_CELL end_ROW (61)

Leveraging (61), Ψ(B¯0)=Ψ(I)=0Ψsubscript¯𝐵0Ψ𝐼0\Psi(\bar{B}_{0})=\Psi(I)=0roman_Ψ ( over¯ start_ARG italic_B end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = roman_Ψ ( italic_I ) = 0 and (58) in Theorem 2, we prove that

f(xk)f(x)f(x0)f(x)(Ψ(B~0)+4C0Ψ(B¯0)+12C0κmin{2(1+C0),1+κ}k)k(dκ+12C0κmin{2(1+C0),(1+κ)}k)k.𝑓subscript𝑥𝑘𝑓subscript𝑥𝑓subscript𝑥0𝑓subscript𝑥superscriptΨsubscript~𝐵04subscript𝐶0Ψsubscript¯𝐵012subscript𝐶0𝜅21subscript𝐶01𝜅𝑘𝑘superscript𝑑𝜅12subscript𝐶0𝜅21subscript𝐶01𝜅𝑘𝑘\begin{split}\frac{f(x_{k})-f(x_{*})}{f(x_{0})-f(x_{*})}&\leq\left(\frac{\Psi(% \tilde{B}_{0})+4C_{0}\Psi(\bar{B}_{0})+12C_{0}\kappa\min\{2(1+C_{0}),1+\sqrt{% \kappa}\}}{k}\right)^{k}\\ &\leq\left(\frac{d\kappa+12C_{0}\kappa\min\{2(1+C_{0}),(1+\sqrt{\kappa})\}}{k}% \right)^{k}.\end{split}start_ROW start_CELL divide start_ARG italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) end_ARG start_ARG italic_f ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) end_ARG end_CELL start_CELL ≤ ( divide start_ARG roman_Ψ ( over~ start_ARG italic_B end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) + 4 italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT roman_Ψ ( over¯ start_ARG italic_B end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) + 12 italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_κ roman_min { 2 ( 1 + italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , 1 + square-root start_ARG italic_κ end_ARG } end_ARG start_ARG italic_k end_ARG ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ ( divide start_ARG italic_d italic_κ + 12 italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_κ roman_min { 2 ( 1 + italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , ( 1 + square-root start_ARG italic_κ end_ARG ) } end_ARG start_ARG italic_k end_ARG ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT . end_CELL end_ROW

Corollary 4.

Let {xk}k0subscriptsubscript𝑥𝑘𝑘0\{x_{k}\}_{k\geq 0}{ italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k ≥ 0 end_POSTSUBSCRIPT be the iterates generated by the BFGS method with exact line search and suppose that Assumptions 1, 2 and 3 hold. For any initial point x0dsubscript𝑥0superscript𝑑x_{0}\in\mathbb{R}^{d}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT and the initial Hessian approximation matrix B0=μIsubscript𝐵0𝜇𝐼B_{0}=\mu Iitalic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_μ italic_I, we have the following superlinear convergence rate,

f(xk)f(x)f(x0)f(x)((1+4C0)dlogκ+12C0κmin{2(1+C0),1+κ}k)k.𝑓subscript𝑥𝑘𝑓subscript𝑥𝑓subscript𝑥0𝑓subscript𝑥superscript14subscript𝐶0𝑑𝜅12subscript𝐶0𝜅21subscript𝐶01𝜅𝑘𝑘\frac{f(x_{k})-f(x_{*})}{f(x_{0})-f(x_{*})}\leq\left(\frac{(1+4C_{0})d\log{% \kappa}+12C_{0}\kappa\min\{2(1+C_{0}),1+\sqrt{\kappa}\}}{k}\right)^{k}.divide start_ARG italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) end_ARG start_ARG italic_f ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) end_ARG ≤ ( divide start_ARG ( 1 + 4 italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) italic_d roman_log italic_κ + 12 italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_κ roman_min { 2 ( 1 + italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , 1 + square-root start_ARG italic_κ end_ARG } end_ARG start_ARG italic_k end_ARG ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT . (62)
Proof.

Since B0=μIsubscript𝐵0𝜇𝐼B_{0}=\mu Iitalic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_μ italic_I, from Assumptions 1 and 2, we have that

Ψ(B~0)=𝐓𝐫(B~0)dlog𝐃𝐞𝐭(B~0)=𝐓𝐫(μ2f(x)1)dlog𝐃𝐞𝐭(μ2f(x)1)𝐓𝐫(I)dlog𝐃𝐞𝐭(1κI)=dd+dlogκ=dlogκ.Ψsubscript~𝐵0𝐓𝐫subscript~𝐵0𝑑𝐃𝐞𝐭subscript~𝐵0𝐓𝐫𝜇superscript2𝑓superscriptsubscript𝑥1𝑑𝐃𝐞𝐭𝜇superscript2𝑓superscriptsubscript𝑥1𝐓𝐫𝐼𝑑𝐃𝐞𝐭1𝜅𝐼𝑑𝑑𝑑𝜅𝑑𝜅\begin{split}\Psi(\tilde{B}_{0})&=\mathbf{Tr}(\tilde{B}_{0})-d-\log{\mathbf{% Det}(\tilde{B}_{0})}=\mathbf{Tr}(\mu\nabla^{2}{f(x_{*})}^{-1})-d-\log{\mathbf{% Det}(\mu\nabla^{2}{f(x_{*})}^{-1})}\\ &\leq\mathbf{Tr}(I)-d-\log{\mathbf{Det}(\frac{1}{\kappa}I})=d-d+d\log{\kappa}=% d\log{\kappa}.\end{split}start_ROW start_CELL roman_Ψ ( over~ start_ARG italic_B end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_CELL start_CELL = bold_Tr ( over~ start_ARG italic_B end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_d - roman_log bold_Det ( over~ start_ARG italic_B end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = bold_Tr ( italic_μ ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) - italic_d - roman_log bold_Det ( italic_μ ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ bold_Tr ( italic_I ) - italic_d - roman_log bold_Det ( divide start_ARG 1 end_ARG start_ARG italic_κ end_ARG italic_I ) = italic_d - italic_d + italic_d roman_log italic_κ = italic_d roman_log italic_κ . end_CELL end_ROW (63)

Leveraging (63), Ψ(B¯0)=Ψ(1κI)dlogκΨsubscript¯𝐵0Ψ1𝜅𝐼𝑑𝜅\Psi(\bar{B}_{0})=\Psi(\frac{1}{\kappa}I)\leq d\log{\kappa}roman_Ψ ( over¯ start_ARG italic_B end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = roman_Ψ ( divide start_ARG 1 end_ARG start_ARG italic_κ end_ARG italic_I ) ≤ italic_d roman_log italic_κ and (58) in Theorem 2, we prove

f(xk)f(x)f(x0)f(x)𝑓subscript𝑥𝑘𝑓subscript𝑥𝑓subscript𝑥0𝑓subscript𝑥\displaystyle\frac{f(x_{k})-f(x_{*})}{f(x_{0})-f(x_{*})}divide start_ARG italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) end_ARG start_ARG italic_f ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) end_ARG (Ψ(B~0)+4C0Ψ(B¯0)+12C0κmin{2(1+C0),1+κ}k)kabsentsuperscriptΨsubscript~𝐵04subscript𝐶0Ψsubscript¯𝐵012subscript𝐶0𝜅21subscript𝐶01𝜅𝑘𝑘\displaystyle\leq\left(\frac{\Psi(\tilde{B}_{0})+4C_{0}\Psi(\bar{B}_{0})+12C_{% 0}\kappa\min\{2(1+C_{0}),1+\sqrt{\kappa}\}}{k}\right)^{k}≤ ( divide start_ARG roman_Ψ ( over~ start_ARG italic_B end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) + 4 italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT roman_Ψ ( over¯ start_ARG italic_B end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) + 12 italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_κ roman_min { 2 ( 1 + italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , 1 + square-root start_ARG italic_κ end_ARG } end_ARG start_ARG italic_k end_ARG ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT
((1+4C0)dlogκ+12C0κmin{2(1+C0),1+κ}k)k.absentsuperscript14subscript𝐶0𝑑𝜅12subscript𝐶0𝜅21subscript𝐶01𝜅𝑘𝑘\displaystyle\leq\left(\frac{(1+4C_{0})d\log{\kappa}+12C_{0}\kappa\min\{2(1+C_% {0}),1+\sqrt{\kappa}\}}{k}\right)^{k}.≤ ( divide start_ARG ( 1 + 4 italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) italic_d roman_log italic_κ + 12 italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_κ roman_min { 2 ( 1 + italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , 1 + square-root start_ARG italic_κ end_ARG } end_ARG start_ARG italic_k end_ARG ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT .

As shown in the proofs of Corollary 3 and Corollary 4, selecting B0=LIsubscript𝐵0𝐿𝐼B_{0}=LIitalic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_L italic_I minimizes Ψ(B¯0)Ψsubscript¯𝐵0\Psi(\bar{B}_{0})roman_Ψ ( over¯ start_ARG italic_B end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ), resulting in Ψ(B¯0)=0Ψsubscript¯𝐵00\Psi(\bar{B}_{0})=0roman_Ψ ( over¯ start_ARG italic_B end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = 0. However, Ψ(B~0)Ψsubscript~𝐵0\Psi(\tilde{B}_{0})roman_Ψ ( over~ start_ARG italic_B end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) in this case could be as large as dκ𝑑𝜅d\kappaitalic_d italic_κ. Conversely, setting B0=μIsubscript𝐵0𝜇𝐼B_{0}=\mu Iitalic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_μ italic_I yields a favorable upper bound, allowing both Ψ(B¯0)Ψsubscript¯𝐵0\Psi(\bar{B}_{0})roman_Ψ ( over¯ start_ARG italic_B end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) and Ψ(B~0)Ψsubscript~𝐵0\Psi(\tilde{B}_{0})roman_Ψ ( over~ start_ARG italic_B end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) to be bounded by dlogκ𝑑𝜅d\log\kappaitalic_d roman_log italic_κ.

Hence, choosing the initial Hessian approximation as B0=μIsubscript𝐵0𝜇𝐼B_{0}=\mu Iitalic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_μ italic_I instead of B0=LIsubscript𝐵0𝐿𝐼B_{0}=LIitalic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_L italic_I could result in fewer iterations to reach the superlinear convergence phase. This demonstrates the advantage of B0=μIsubscript𝐵0𝜇𝐼B_{0}=\mu Iitalic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_μ italic_I over B0=LIsubscript𝐵0𝐿𝐼B_{0}=LIitalic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_L italic_I in achieving superlinear convergence, highlighting the trade-off between the linear and superlinear convergence performances of different initial Hessian approximation matrices.

Generally, during the initial linear convergence stage, the iterates generated by the BFGS method with B0=LIsubscript𝐵0𝐿𝐼B_{0}=LIitalic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_L italic_I outperform those with B0=μIsubscript𝐵0𝜇𝐼B_{0}=\mu Iitalic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_μ italic_I, due to a faster linear convergence speed. However, the BFGS method with B0=μIsubscript𝐵0𝜇𝐼B_{0}=\mu Iitalic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_μ italic_I transitions to the ultimate superlinear convergence phase in fewer iterations compared to B0=LIsubscript𝐵0𝐿𝐼B_{0}=LIitalic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_L italic_I. This phenomenon has also been observed in our numerical experiments presented in Section 7.

While all of our presented results are global and do not impose any initial condition on x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, in the following remark, we present a potential local result derivable from Corollary 4.

Remark 3.

Consider the scenario where BFGS starts at a point x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT near the optimal solution xsubscript𝑥x_{*}italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT such that the initial error condition C0=𝒪(1/κ)subscript𝐶0𝒪1𝜅C_{0}=\mathcal{O}({1}/{\sqrt{\kappa}})italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = caligraphic_O ( 1 / square-root start_ARG italic_κ end_ARG ) is satisfied, i.e., f(x0)f(x)=𝒪(μ3M2κ)𝑓subscript𝑥0𝑓subscript𝑥𝒪superscript𝜇3superscript𝑀2𝜅f(x_{0})-f(x_{*})=\mathcal{O}(\frac{\mu^{3}}{M^{2}{\kappa}})italic_f ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) = caligraphic_O ( divide start_ARG italic_μ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG start_ARG italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_κ end_ARG ). In this case, we can establish that (1+4C0)dlogκ=𝒪(dlogκ)14subscript𝐶0𝑑𝜅𝒪𝑑𝜅(1+4C_{0})d\log{\kappa}=\mathcal{O}(d\log{\kappa})( 1 + 4 italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) italic_d roman_log italic_κ = caligraphic_O ( italic_d roman_log italic_κ ) and C0κmin{1+C0,κ}=𝒪(1)subscript𝐶0𝜅1subscript𝐶0𝜅𝒪1C_{0}\kappa\min\{1+C_{0},\sqrt{\kappa}\}=\mathcal{O}(1)italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_κ roman_min { 1 + italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , square-root start_ARG italic_κ end_ARG } = caligraphic_O ( 1 ). Thus, from Corollary 4, we obtain the local superlinear convergence rate of 𝒪(dlogκk)k𝒪superscript𝑑𝜅𝑘𝑘\mathcal{O}(\frac{d\log{\kappa}}{k})^{k}caligraphic_O ( divide start_ARG italic_d roman_log italic_κ end_ARG start_ARG italic_k end_ARG ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, which aligns with the local convergence result in [rodomanov2020ratesnew]. It is noteworthy that the local result in [rodomanov2020ratesnew] relied on a unit step size, while our local side-result is derived using exact line search.

6 Discussions

Comparison with local non-asymptotic analysis. In this section, we discuss the recent non-asymptotic local convergence results for BFGS and DFP in [rodomanov2020rates, rodomanov2020ratesnew, qiujiang2020quasinewton] and explain why these results cannot be easily extended to achieve global complexity bounds.

To begin with, note that these results are crucially based on local analysis and only apply when the iterates are close to the optimal solution xsubscript𝑥x_{*}italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT and the step size ηksubscript𝜂𝑘\eta_{k}italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is set to 1 in this local region. Therefore, to extend their results into a global convergence guarantee, one plausible strategy is to employ a line search scheme to ensure global convergence, and then switch to the local analysis when the iterates enter the region of local convergence. However, this approach faces several challenges.

First, it remains unclear how to explicitly upper bound the number of iterations until the line search subroutine accepts the unit step size ηk=1subscript𝜂𝑘1\eta_{k}=1italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = 1. Moreover, assume the iterates enter the region of local convergence after k0subscript𝑘0k_{0}italic_k start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT iterations and we have ηk=1subscript𝜂𝑘1\eta_{k}=1italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = 1 for all kk0𝑘subscript𝑘0k\geq k_{0}italic_k ≥ italic_k start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Even then, there is no guarantee that the Hessian approximation matrix Bk0subscript𝐵subscript𝑘0B_{k_{0}}italic_B start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT will satisfy the necessary conditions required for the local analysis in [rodomanov2020rates, rodomanov2020ratesnew, qiujiang2020quasinewton]. Specifically, for the analysis in [qiujiang2020quasinewton] to hold, Bk0subscript𝐵subscript𝑘0B_{k_{0}}italic_B start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT must be sufficiently close to the exact Hessian matrix, which is not satisfied in general. Regarding [rodomanov2020ratesnew, rodomanov2020rates], we note that their analyses depend on the condition number of Bk0subscript𝐵subscript𝑘0B_{k_{0}}italic_B start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT, which could be exponentially large and thus render the superlinear rate meaningless. To be more concrete, inspecting the proofs in [rodomanov2020ratesnew, Lemma 5.4] and [rodomanov2020rates, Theorem 4.2] reveals that the superlinear convergence rate occurs when k=Ω(Ψ(Bˇk01))𝑘ΩΨsuperscriptsubscriptˇ𝐵subscript𝑘01k=\Omega(\Psi(\check{B}_{k_{0}}^{-1}))italic_k = roman_Ω ( roman_Ψ ( overroman_ˇ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) ) and k=Ω(Ψ(Bˇk0))𝑘ΩΨsubscriptˇ𝐵subscript𝑘0k=\Omega(\Psi(\check{B}_{k_{0}}))italic_k = roman_Ω ( roman_Ψ ( overroman_ˇ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ), respectively, where Bˇk0=Jk01/2Bk0Jk01/2subscriptˇ𝐵subscript𝑘0superscriptsubscript𝐽subscript𝑘012subscript𝐵subscript𝑘0superscriptsubscript𝐽subscript𝑘012\check{B}_{k_{0}}=J_{k_{0}}^{-{1}/{2}}B_{k_{0}}J_{k_{0}}^{-{1}/{2}}overroman_ˇ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_J start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT italic_B start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_J start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT with Jk0subscript𝐽subscript𝑘0J_{k_{0}}italic_J start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT defined in (21) and Ψ()Ψ\Psi(\cdot)roman_Ψ ( ⋅ ) is the potential function defined in (18). Consequently, it is essential to establish bounds for the smallest and largest eigenvalues of Bˇk0subscriptˇ𝐵subscript𝑘0\check{B}_{k_{0}}overroman_ˇ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT. However, the current theory indicates (see e.g. [rodomanov2020rates, Theorem 4.1]) that e2κMλ0IBˇk0e2κMλ0Iprecedes-or-equalssuperscript𝑒2𝜅𝑀subscript𝜆0𝐼subscriptˇ𝐵subscript𝑘0precedes-or-equalssuperscript𝑒2𝜅𝑀subscript𝜆0𝐼e^{-2\kappa M\lambda_{0}}I\preceq\check{B}_{k_{0}}\preceq e^{2\kappa M\lambda_% {0}}Iitalic_e start_POSTSUPERSCRIPT - 2 italic_κ italic_M italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_I ⪯ overroman_ˇ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⪯ italic_e start_POSTSUPERSCRIPT 2 italic_κ italic_M italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_I, where λ0=(2f(x0))12f(x0)subscript𝜆0normsuperscriptsuperscript2𝑓subscript𝑥012𝑓subscript𝑥0\lambda_{0}=\|(\nabla^{2}f(x_{0}))^{-\frac{1}{2}}\nabla f(x_{0})\|italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = ∥ ( ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ∇ italic_f ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ∥ denotes the initial Newton decrement. This suggests that without a sufficiently small λ0subscript𝜆0\lambda_{0}italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, the extreme eigenvalues of Bˇk0subscriptˇ𝐵subscript𝑘0\check{B}_{k_{0}}overroman_ˇ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT will be exponentially dependent on the condition number κ𝜅\kappaitalic_κ, leading to Ψ(Bˇk01),Ψ(Bˇk0)=Ω(de2κMλ0)Ψsuperscriptsubscriptˇ𝐵subscript𝑘01Ψsubscriptˇ𝐵subscript𝑘0Ω𝑑superscript𝑒2𝜅𝑀subscript𝜆0\Psi(\check{B}_{k_{0}}^{-1}),\Psi(\check{B}_{k_{0}})=\Omega(de^{2\kappa M% \lambda_{0}})roman_Ψ ( overroman_ˇ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) , roman_Ψ ( overroman_ˇ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) = roman_Ω ( italic_d italic_e start_POSTSUPERSCRIPT 2 italic_κ italic_M italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ). Hence, a superlinear rate will be achieved only after Ω(de2κMλ0)Ω𝑑superscript𝑒2𝜅𝑀subscript𝜆0\Omega(de^{2\kappa M\lambda_{0}})roman_Ω ( italic_d italic_e start_POSTSUPERSCRIPT 2 italic_κ italic_M italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) iterations.

Our convergence framework also diverges significantly from the previous works [rodomanov2020rates, rodomanov2020ratesnew, qiujiang2020quasinewton] in terms of the proof strategy. Specifically, the approach in the aforementioned studies employs an induction argument to control the largest and smallest eigenvalues of the Hessian approximation matrix Bksubscript𝐵𝑘B_{k}italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and prove a local linear convergence rate. In comparison, as presented in Sections 4 and 5, we prove global linear and superlinear convergence rates without explicitly establishing upper or lower bounds on the eigenvalues of Bksubscript𝐵𝑘B_{k}italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. This marks a notable departure from the local convergence analysis in [rodomanov2020rates], [rodomanov2020ratesnew], and [qiujiang2020quasinewton].

Comparison with global asymptotic analysis. As mentioned in Section 3, our convergence analysis framework resembles the approach taken in [Powell, byrd1987global, QN_tool] for proving asymptotic linear convergence rates of classical quasi-Newton methods such as BFGS and DFP. While these works considered inexact line search schemes and thus are different from our exact line search setting, they used a similar inequality as (16) in Proposition 1 to express the convergence rate in terms of the angle θ^ksubscript^𝜃𝑘\hat{\theta}_{k}over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. Moreover, the authors in [Powell] and [byrd1987global] analyzed the traces and the determinants of the Hessian approximation matrices {Bk}k0subscriptsubscript𝐵𝑘𝑘0\{B_{k}\}_{k\geq 0}{ italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k ≥ 0 end_POSTSUBSCRIPT separately to lower bound i=0k1cos(θ^i)superscriptsubscriptproduct𝑖0𝑘1subscript^𝜃𝑖\prod_{i=0}^{k-1}\cos{(\hat{\theta}_{i})}∏ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT roman_cos ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). Later, this process was simplified in [QN_tool] by introducing the potential function Ψ(.)\Psi(.)roman_Ψ ( . ) given in (18), combining the trace and determinant together as in our Proposition 2. However, since their main focus is on asymptotic convergence, we note that these previous works only demonstrate that (i=0k1cos(θ^i))1/ksuperscriptsuperscriptsubscriptproduct𝑖0𝑘1subscript^𝜃𝑖1𝑘(\prod_{i=0}^{k-1}\cos{(\hat{\theta}_{i})})^{{1}/{k}}( ∏ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT roman_cos ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 1 / italic_k end_POSTSUPERSCRIPT is lower bounded by a constant, without giving an explicit form. Furthermore, our work builds upon previous analyses by incorporating a weight matrix P𝑃Pitalic_P, while earlier works correspond to setting P=I𝑃𝐼P=Iitalic_P = italic_I. Another notable difference is that we keep the term m^ksubscript^𝑚𝑘\hat{m}_{k}over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and lower bound the term cos2(θ^k)/m^ksuperscript2subscript^𝜃𝑘subscript^𝑚𝑘\cos^{2}(\hat{\theta}_{k})/\hat{m}_{k}roman_cos start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) / over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT as shown in Proposition 2, whereas previous works relied on a looser bound for m^ksubscript^𝑚𝑘\hat{m}_{k}over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. These refinements enable us to provide a tighter linear convergence rate for the BFGS method.

On the other hand, in demonstrating superlinear convergence, our approach deviates significantly from that of [Powell, byrd1987global, QN_tool]. Specifically, the previous works relied on the Dennis-Moré condition, i.e., limk(Bk2f(x))sksk=0subscript𝑘normsubscript𝐵𝑘superscript2𝑓subscript𝑥subscript𝑠𝑘normsubscript𝑠𝑘0\lim_{k\to\infty}\frac{\|(B_{k}-\nabla^{2}{f(x_{*})})s_{k}\|}{\|s_{k}\|}=0roman_lim start_POSTSUBSCRIPT italic_k → ∞ end_POSTSUBSCRIPT divide start_ARG ∥ ( italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ end_ARG start_ARG ∥ italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ end_ARG = 0, to establish asymptotic superlinear convergence. In comparison, we use the same framework outlined in Section 3 to establish both linear and superlinear convergence rates. The key distinction lies in the choice of the weight matrix P𝑃Pitalic_P: we choose P=LI𝑃𝐿𝐼P=LIitalic_P = italic_L italic_I for showing linear convergence and P=2f(x)𝑃superscript2𝑓subscript𝑥P=\nabla^{2}f(x_{*})italic_P = ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) for showing superlinear convergence. Thus, we provide a unified framework for studying the global non-asymptotic convergence of BFGS.

7 Numerical experiments

In this section, we present our numerical experiments to validate our convergence rate guarantees, and in particular, we explore the difference between the convergence paths of BFGS under the two initializations: B0=LIsubscript𝐵0𝐿𝐼B_{0}=LIitalic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_L italic_I and B0=μIsubscript𝐵0𝜇𝐼B_{0}=\mu Iitalic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_μ italic_I. We further compare these two variants of BFGS implementations with the gradient descent algorithm when deployed with exact line search. Hence, in our numerical experiments, all the step sizes used in BFGS with B0=LIsubscript𝐵0𝐿𝐼B_{0}=LIitalic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_L italic_I, BFGS with B0=μIsubscript𝐵0𝜇𝐼B_{0}=\mu Iitalic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_μ italic_I, and gradient descent are computed by the exact line search condition defined in (9). Specifically, we use the MATLAB optimization package and fminsearch function to determine the exact line search step size for all the algorithms. In our experiments, all initial points are chosen as random vectors in the corresponding Euclidean vector spaces.

In our first experiment, we focus on a hard cubic objective function defined in [hard_cubic, Section 5], i.e.,

f(x)=α12(i=1d1g(vixvi+1x)βv1x)+λ2x2,𝑓𝑥𝛼12superscriptsubscript𝑖1𝑑1𝑔superscriptsubscript𝑣𝑖top𝑥superscriptsubscript𝑣𝑖1top𝑥𝛽superscriptsubscript𝑣1top𝑥𝜆2superscriptnorm𝑥2f(x)=\frac{\alpha}{12}\left(\sum_{i=1}^{d-1}g(v_{i}^{\top}x-v_{i+1}^{\top}x)-% \beta v_{1}^{\top}x\right)+\frac{\lambda}{2}\|x\|^{2},italic_f ( italic_x ) = divide start_ARG italic_α end_ARG start_ARG 12 end_ARG ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT italic_g ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_x - italic_v start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_x ) - italic_β italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_x ) + divide start_ARG italic_λ end_ARG start_ARG 2 end_ARG ∥ italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (64)

and g::𝑔g:\mathbb{R}\to\mathbb{R}italic_g : blackboard_R → blackboard_R is defined as

g(w)={13|w|3|w|Δ,Δw2Δ2|w|+13Δ3|w|>Δ,𝑔𝑤cases13superscript𝑤3𝑤ΔΔsuperscript𝑤2superscriptΔ2𝑤13superscriptΔ3𝑤Δg(w)=\begin{cases}\frac{1}{3}|w|^{3}&|w|\leq\Delta,\\ \Delta w^{2}-\Delta^{2}|w|+\frac{1}{3}\Delta^{3}&|w|>\Delta,\end{cases}italic_g ( italic_w ) = { start_ROW start_CELL divide start_ARG 1 end_ARG start_ARG 3 end_ARG | italic_w | start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_CELL start_CELL | italic_w | ≤ roman_Δ , end_CELL end_ROW start_ROW start_CELL roman_Δ italic_w start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - roman_Δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | italic_w | + divide start_ARG 1 end_ARG start_ARG 3 end_ARG roman_Δ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_CELL start_CELL | italic_w | > roman_Δ , end_CELL end_ROW (65)

where α,β,λ,Δ𝛼𝛽𝜆Δ\alpha,\beta,\lambda,\Delta\in\mathbb{R}italic_α , italic_β , italic_λ , roman_Δ ∈ blackboard_R are hyper-parameters and {vi}i=1nsuperscriptsubscriptsubscript𝑣𝑖𝑖1𝑛\{v_{i}\}_{i=1}^{n}{ italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT are standard orthogonal unit vectors in dsuperscript𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT. This hard cubic function is used to establish a lower bound for second-order methods. The performance of the methods in addressing this problem is shown in Figures 1 and 2. In Figure 1, we vary the problem’s dimension while holding the condition number constant, whereas in Figure 2, we hold the problem’s dimension constant and explores the methods’ convergence behaviors for different condition numbers.

Refer to caption
(a) d=40𝑑40d=40italic_d = 40, κ=103𝜅superscript103\kappa=10^{3}italic_κ = 10 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT.
Refer to caption
(b) d=400𝑑400d=400italic_d = 400, κ=103𝜅superscript103\kappa=10^{3}italic_κ = 10 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT.
Refer to caption
(c) d=4000𝑑4000d=4000italic_d = 4000, κ=103𝜅superscript103\kappa=10^{3}italic_κ = 10 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT.
Figure 1: Convergence rates of BFGS with B0=LIsubscript𝐵0𝐿𝐼B_{0}=LIitalic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_L italic_I, B0=μIsubscript𝐵0𝜇𝐼B_{0}=\mu Iitalic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_μ italic_I and gradient descent for solving the hard cubic objective function when condition number is fixed and dimension is modified.
Refer to caption
(a) d=600𝑑600d=600italic_d = 600, κ=10𝜅10\kappa=10italic_κ = 10.
Refer to caption
(b) d=600𝑑600d=600italic_d = 600, κ=102𝜅superscript102\kappa=10^{2}italic_κ = 10 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.
Refer to caption
(c) d=600𝑑600d=600italic_d = 600, κ=103𝜅superscript103\kappa=10^{3}italic_κ = 10 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT.
Figure 2: Convergence rates of BFGS with B0=LIsubscript𝐵0𝐿𝐼B_{0}=LIitalic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_L italic_I, B0=μIsubscript𝐵0𝜇𝐼B_{0}=\mu Iitalic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_μ italic_I and gradient descent for solving the hard cubic objective function when condition number is changing and dimension is fixed.

Several observations are in order. First, BFGS with B0=LIsubscript𝐵0𝐿𝐼B_{0}=LIitalic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_L italic_I initially converges faster than BFGS with B0=μIsubscript𝐵0𝜇𝐼B_{0}=\mu Iitalic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_μ italic_I in most plots, aligning with our theoretical findings that the linear convergence rate of BFGS with B0=LIsubscript𝐵0𝐿𝐼B_{0}=LIitalic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_L italic_I surpasses that of B0=μIsubscript𝐵0𝜇𝐼B_{0}=\mu Iitalic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_μ italic_I.

Second, the transition to superlinear convergence for BFGS with B0=μIsubscript𝐵0𝜇𝐼B_{0}=\mu Iitalic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_μ italic_I typically occurs around kd𝑘𝑑k\approx ditalic_k ≈ italic_d, as predicted by our theoretical analysis. Interestingly, this transition does not always coincide with the iterates approaching the solution’s local neighborhood; in many cases, it occurs for BFGS with B0=μIsubscript𝐵0𝜇𝐼B_{0}=\mu Iitalic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_μ italic_I even when its error is larger than that of gradient descent.

Third, although BFGS with B0=LIsubscript𝐵0𝐿𝐼B_{0}=LIitalic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_L italic_I initially converges faster, its transition to superlinear convergence consistently occurs later than for B0=μIsubscript𝐵0𝜇𝐼B_{0}=\mu Iitalic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_μ italic_I. Notably, for a fixed dimension d=600𝑑600d=600italic_d = 600, the transition to superlinear convergence for B0=LIsubscript𝐵0𝐿𝐼B_{0}=LIitalic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_L italic_I occurs increasingly later as the problem condition number rises, an effect not observed for B0=μIsubscript𝐵0𝜇𝐼B_{0}=\mu Iitalic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_μ italic_I. This phenomenon indicates that the superlinear rate for B0=LIsubscript𝐵0𝐿𝐼B_{0}=LIitalic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_L italic_I is more sensitive to the condition number κ𝜅\kappaitalic_κ, which corroborates our theory that the number of iterations required for superlinear convergence is 𝒪(dκ)𝒪𝑑𝜅\mathcal{O}(d\kappa)caligraphic_O ( italic_d italic_κ ) for B0=LIsubscript𝐵0𝐿𝐼B_{0}=LIitalic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_L italic_I and is improved to 𝒪(dlogκ)𝒪𝑑𝜅\mathcal{O}(d\log{\kappa})caligraphic_O ( italic_d roman_log italic_κ ) for B0=μIsubscript𝐵0𝜇𝐼B_{0}=\mu Iitalic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_μ italic_I.

These findings align with our theoretical observations on the trade-off between global linear and superlinear convergence rates for different initial Hessian approximation matrices, as discussed in Sections 4 and 5.

8 Conclusion

In this paper, we proved explicit global linear and superlinear convergence rates for the BFGS method implemented with the exact line search scheme. Our results hold for any initial point x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and any initial Hessian approximation matrix B0𝕊++dsubscript𝐵0superscriptsubscript𝕊absent𝑑B_{0}\in\mathbb{S}_{++}^{d}italic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ blackboard_S start_POSTSUBSCRIPT + + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT. We proved a global convergence rate of (1eΨ(B¯0)k2κmin{2(1+C0),1+κ})ksuperscript1superscript𝑒Ψsubscript¯𝐵0𝑘2𝜅21subscript𝐶01𝜅𝑘\bigl{(}1-e^{-\frac{\Psi(\bar{B}_{0})}{k}}\frac{2}{\kappa\min\{2(1+C_{0}),1+% \sqrt{\kappa}\}}\bigr{)}^{k}( 1 - italic_e start_POSTSUPERSCRIPT - divide start_ARG roman_Ψ ( over¯ start_ARG italic_B end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_k end_ARG end_POSTSUPERSCRIPT divide start_ARG 2 end_ARG start_ARG italic_κ roman_min { 2 ( 1 + italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , 1 + square-root start_ARG italic_κ end_ARG } end_ARG ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, where B¯0=B0/Lsubscript¯𝐵0subscript𝐵0𝐿\bar{B}_{0}=B_{0}/Lover¯ start_ARG italic_B end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT / italic_L and Ψ()Ψ\Psi(\cdot)roman_Ψ ( ⋅ ) is defined in (18). This implies a linear rate of (12κmin{2(1+C0),1+κ})ksuperscript12𝜅21subscript𝐶01𝜅𝑘(1-\frac{2}{\kappa\min\{2(1+C_{0}),1+\sqrt{\kappa}\}})^{k}( 1 - divide start_ARG 2 end_ARG start_ARG italic_κ roman_min { 2 ( 1 + italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , 1 + square-root start_ARG italic_κ end_ARG } end_ARG ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT when kΨ(B¯0)𝑘Ψsubscript¯𝐵0k\geq\Psi(\bar{B}_{0})italic_k ≥ roman_Ψ ( over¯ start_ARG italic_B end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ). Moreover, we proved that the linear rate is improved to (113κ)ksuperscript113𝜅𝑘(1-\frac{1}{3\kappa})^{k}( 1 - divide start_ARG 1 end_ARG start_ARG 3 italic_κ end_ARG ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT after 𝒪((1+C0)Ψ(B¯0)+C0κmin{1+C0,κ})𝒪1subscript𝐶0Ψsubscript¯𝐵0subscript𝐶0𝜅1subscript𝐶0𝜅\mathcal{O}((1+C_{0})\Psi(\bar{B}_{0})+C_{0}\kappa\min\{1+C_{0},\sqrt{\kappa}\})caligraphic_O ( ( 1 + italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) roman_Ψ ( over¯ start_ARG italic_B end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) + italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_κ roman_min { 1 + italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , square-root start_ARG italic_κ end_ARG } ) iterations. Finally, we proved a superlinear convergence rate of 𝒪(Ψ(B~0)+C0Ψ(B¯0)+C0κmin{1+C0,κ}k)k𝒪superscriptΨsubscript~𝐵0subscript𝐶0Ψsubscript¯𝐵0subscript𝐶0𝜅1subscript𝐶0𝜅𝑘𝑘\mathcal{O}(\frac{\Psi(\tilde{B}_{0})+C_{0}\Psi(\bar{B}_{0})+C_{0}\kappa\min\{% 1+C_{0},\sqrt{\kappa}\}}{k})^{k}caligraphic_O ( divide start_ARG roman_Ψ ( over~ start_ARG italic_B end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) + italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT roman_Ψ ( over¯ start_ARG italic_B end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) + italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_κ roman_min { 1 + italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , square-root start_ARG italic_κ end_ARG } end_ARG start_ARG italic_k end_ARG ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, where B~0=2f(x)12B02f(x)12subscript~𝐵0superscript2𝑓superscriptsubscript𝑥12subscript𝐵0superscript2𝑓superscriptsubscript𝑥12\tilde{B}_{0}=\nabla^{2}f(x_{*})^{-\frac{1}{2}}B_{0}\nabla^{2}f(x_{*})^{-\frac% {1}{2}}over~ start_ARG italic_B end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT.

We further showed that for the specific choice of B0=LIsubscript𝐵0𝐿𝐼B_{0}=LIitalic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_L italic_I, BFGS achieves a global linear convergence rate of 𝒪(11κmin{1+C0,κ})k𝒪superscript11𝜅1subscript𝐶0𝜅𝑘\mathcal{O}(1-\frac{1}{\kappa\min\{1+C_{0},\sqrt{\kappa}\}})^{k}caligraphic_O ( 1 - divide start_ARG 1 end_ARG start_ARG italic_κ roman_min { 1 + italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , square-root start_ARG italic_κ end_ARG } end_ARG ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT from the first iteration, a improved linear rate of (113κ)ksuperscript113𝜅𝑘(1-\frac{1}{3\kappa})^{k}( 1 - divide start_ARG 1 end_ARG start_ARG 3 italic_κ end_ARG ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT after 𝒪(C0κmin{1+C0,κ})𝒪subscript𝐶0𝜅1subscript𝐶0𝜅\mathcal{O}(C_{0}\kappa\min\{1+C_{0},\sqrt{\kappa}\})caligraphic_O ( italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_κ roman_min { 1 + italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , square-root start_ARG italic_κ end_ARG } ) iterations, and a superlinear convergence rate of 𝒪(dκ+C0κmin{1+C0,κ}k)k𝒪superscript𝑑𝜅subscript𝐶0𝜅1subscript𝐶0𝜅𝑘𝑘\mathcal{O}(\frac{d\kappa+C_{0}\kappa\min\{1+C_{0},\sqrt{\kappa}\}}{k})^{k}caligraphic_O ( divide start_ARG italic_d italic_κ + italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_κ roman_min { 1 + italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , square-root start_ARG italic_κ end_ARG } end_ARG start_ARG italic_k end_ARG ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT. Moreover, for B0=μIsubscript𝐵0𝜇𝐼B_{0}=\mu Iitalic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_μ italic_I, BFGS achieves a global linear rate of 𝒪(11κmin{1+C0,κ})k𝒪superscript11𝜅1subscript𝐶0𝜅𝑘\mathcal{O}(1-\frac{1}{\kappa\min\{1+C_{0},\sqrt{\kappa}\}})^{k}caligraphic_O ( 1 - divide start_ARG 1 end_ARG start_ARG italic_κ roman_min { 1 + italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , square-root start_ARG italic_κ end_ARG } end_ARG ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT after 𝒪(dlogκ)𝒪𝑑𝜅\mathcal{O}(d\log{\kappa})caligraphic_O ( italic_d roman_log italic_κ ) iterations, a improved linear rate of 𝒪((11κ)k)𝒪superscript11𝜅𝑘\mathcal{O}((1-\frac{1}{\kappa})^{k})caligraphic_O ( ( 1 - divide start_ARG 1 end_ARG start_ARG italic_κ end_ARG ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) after 𝒪(dlogκ+C0κmin{1+C0,κ})𝒪𝑑𝜅subscript𝐶0𝜅1subscript𝐶0𝜅\mathcal{O}(d\log{\kappa}+C_{0}\kappa\min\{1+C_{0},\sqrt{\kappa}\})caligraphic_O ( italic_d roman_log italic_κ + italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_κ roman_min { 1 + italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , square-root start_ARG italic_κ end_ARG } ) iterations, and a superlinear rate of 𝒪((1+C0)dlogκ+C0κmin{1+C0,κ}k)k𝒪superscript1subscript𝐶0𝑑𝜅subscript𝐶0𝜅1subscript𝐶0𝜅𝑘𝑘\mathcal{O}(\frac{(1+C_{0})d\log{\kappa}+C_{0}\kappa\min\{1+C_{0},\sqrt{\kappa% }\}}{k})^{k}caligraphic_O ( divide start_ARG ( 1 + italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) italic_d roman_log italic_κ + italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_κ roman_min { 1 + italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , square-root start_ARG italic_κ end_ARG } end_ARG start_ARG italic_k end_ARG ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT.

Appendix

Appendix A Proof of Proposition 2

First, we show that

𝐓𝐫(B^k+1)𝐓𝐫subscript^𝐵𝑘1\displaystyle\mathbf{Tr}(\hat{B}_{k+1})bold_Tr ( over^ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ) =𝐓𝐫(B^k)B^ks^k2s^kB^ks^k+y^k2s^ky^k,absent𝐓𝐫subscript^𝐵𝑘superscriptnormsubscript^𝐵𝑘subscript^𝑠𝑘2superscriptsubscript^𝑠𝑘topsubscript^𝐵𝑘subscript^𝑠𝑘superscriptnormsubscript^𝑦𝑘2superscriptsubscript^𝑠𝑘topsubscript^𝑦𝑘\displaystyle=\mathbf{Tr}(\hat{B}_{k})-\frac{\|\hat{B}_{k}\hat{s}_{k}\|^{2}}{% \hat{s}_{k}^{\top}\hat{B}_{k}\hat{s}_{k}}+\frac{\|\hat{y}_{k}\|^{2}}{\hat{s}_{% k}^{\top}\hat{y}_{k}},= bold_Tr ( over^ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - divide start_ARG ∥ over^ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over^ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG + divide start_ARG ∥ over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG , (66)
𝐃𝐞𝐭(B^k+1)𝐃𝐞𝐭subscript^𝐵𝑘1\displaystyle\mathbf{Det}(\hat{B}_{k+1})bold_Det ( over^ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ) =𝐃𝐞𝐭(B^k)s^ky^ks^kB^ks^k.absent𝐃𝐞𝐭subscript^𝐵𝑘superscriptsubscript^𝑠𝑘topsubscript^𝑦𝑘superscriptsubscript^𝑠𝑘topsubscript^𝐵𝑘subscript^𝑠𝑘\displaystyle=\mathbf{Det}(\hat{B}_{k})\frac{\hat{s}_{k}^{\top}\hat{y}_{k}}{% \hat{s}_{k}^{\top}\hat{B}_{k}\hat{s}_{k}}.= bold_Det ( over^ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) divide start_ARG over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over^ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG . (67)

Taking the trace on both sides of the equation (12) and using the fact that 𝐓𝐫(ab)=ab𝐓𝐫𝑎superscript𝑏topsuperscript𝑎top𝑏\mathbf{Tr}(ab^{\top})=a^{\top}bbold_Tr ( italic_a italic_b start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) = italic_a start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_b for any vector a𝑎aitalic_a and b𝑏bitalic_b, we obtain the equality in (66). Please check Lemma 6.2 of [rodomanov2020rates] for the proof of (67). Take the logarithm on both sides of the above equation, we obtain that

logs^ky^ks^kB^ks^k=log𝐃𝐞𝐭(B^k+1)log𝐃𝐞𝐭(B^k).superscriptsubscript^𝑠𝑘topsubscript^𝑦𝑘superscriptsubscript^𝑠𝑘topsubscript^𝐵𝑘subscript^𝑠𝑘𝐃𝐞𝐭subscript^𝐵𝑘1𝐃𝐞𝐭subscript^𝐵𝑘\log{\frac{\hat{s}_{k}^{\top}\hat{y}_{k}}{\hat{s}_{k}^{\top}\hat{B}_{k}\hat{s}% _{k}}}=\log{\mathbf{Det}(\hat{B}_{k+1})}-\log{\mathbf{Det}(\hat{B}_{k})}.roman_log divide start_ARG over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over^ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG = roman_log bold_Det ( over^ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ) - roman_log bold_Det ( over^ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) .

Recall that m^k=y^ks^ks^k2subscript^𝑚𝑘superscriptsubscript^𝑦𝑘topsubscript^𝑠𝑘superscriptnormsubscript^𝑠𝑘2\hat{m}_{k}=\frac{\hat{y}_{k}^{\top}\hat{s}_{k}}{\|\hat{s}_{k}\|^{2}}over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = divide start_ARG over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG ∥ over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG and cos(θ^k)=g^ks^k/(g^ks^k)subscript^𝜃𝑘superscriptsubscript^𝑔𝑘topsubscript^𝑠𝑘normsubscript^𝑔𝑘normsubscript^𝑠𝑘\cos(\hat{\theta}_{k})=-\hat{g}_{k}^{\top}\hat{s}_{k}/(\|\hat{g}_{k}\|\|\hat{s% }_{k}\|)roman_cos ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = - over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT / ( ∥ over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ ∥ over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ ). Since B^ks^k=ηkg^ksubscript^𝐵𝑘subscript^𝑠𝑘subscript𝜂𝑘subscript^𝑔𝑘\hat{B}_{k}\hat{s}_{k}=-\eta_{k}\hat{g}_{k}over^ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = - italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, we also have cos(θ^k)=s^kB^ks^k/(B^ks^ks^k)subscript^𝜃𝑘superscriptsubscript^𝑠𝑘topsubscript^𝐵𝑘subscript^𝑠𝑘normsubscript^𝐵𝑘subscript^𝑠𝑘normsubscript^𝑠𝑘\cos(\hat{\theta}_{k})=\hat{s}_{k}^{\top}\hat{B}_{k}\hat{s}_{k}/(\|\hat{B}_{k}% \hat{s}_{k}\|\|\hat{s}_{k}\|)roman_cos ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over^ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT / ( ∥ over^ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ ∥ over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ ). Hence, we can write

s^ky^ks^kB^ks^k=B^ks^k2s^k2(s^kB^ks^k)2s^ky^ks^k2s^kB^ks^kB^ks^k2=m^kcos2(θ^k)s^kB^ks^kB^ks^k2.superscriptsubscript^𝑠𝑘topsubscript^𝑦𝑘superscriptsubscript^𝑠𝑘topsubscript^𝐵𝑘subscript^𝑠𝑘superscriptnormsubscript^𝐵𝑘subscript^𝑠𝑘2superscriptnormsubscript^𝑠𝑘2superscriptsuperscriptsubscript^𝑠𝑘topsubscript^𝐵𝑘subscript^𝑠𝑘2superscriptsubscript^𝑠𝑘topsubscript^𝑦𝑘superscriptnormsubscript^𝑠𝑘2superscriptsubscript^𝑠𝑘topsubscript^𝐵𝑘subscript^𝑠𝑘superscriptnormsubscript^𝐵𝑘subscript^𝑠𝑘2subscript^𝑚𝑘superscript2subscript^𝜃𝑘superscriptsubscript^𝑠𝑘topsubscript^𝐵𝑘subscript^𝑠𝑘superscriptnormsubscript^𝐵𝑘subscript^𝑠𝑘2\frac{\hat{s}_{k}^{\top}\hat{y}_{k}}{\hat{s}_{k}^{\top}\hat{B}_{k}\hat{s}_{k}}% =\frac{\|\hat{B}_{k}\hat{s}_{k}\|^{2}\|\hat{s}_{k}\|^{2}}{(\hat{s}_{k}^{\top}% \hat{B}_{k}\hat{s}_{k})^{2}}\frac{\hat{s}_{k}^{\top}\hat{y}_{k}}{\|\hat{s}_{k}% \|^{2}}\frac{\hat{s}_{k}^{\top}\hat{B}_{k}\hat{s}_{k}}{\|\hat{B}_{k}\hat{s}_{k% }\|^{2}}=\frac{\hat{m}_{k}}{\cos^{2}(\hat{\theta}_{k})}\frac{\hat{s}_{k}^{\top% }\hat{B}_{k}\hat{s}_{k}}{\|\hat{B}_{k}\hat{s}_{k}\|^{2}}.divide start_ARG over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over^ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG = divide start_ARG ∥ over^ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ( over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over^ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG divide start_ARG over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG ∥ over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG divide start_ARG over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over^ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG ∥ over^ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG = divide start_ARG over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG roman_cos start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_ARG divide start_ARG over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over^ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG ∥ over^ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG .

Thus, we obtain that

Ψ(B^k+1)Ψ(B^k)Ψsubscript^𝐵𝑘1Ψsubscript^𝐵𝑘\displaystyle\Psi(\hat{B}_{k+1})-\Psi(\hat{B}_{k})roman_Ψ ( over^ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ) - roman_Ψ ( over^ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) =𝐓𝐫(B^k+1)𝐓𝐫(B^k)+log𝐃𝐞𝐭(B^k)log𝐃𝐞𝐭(B^k+1)absent𝐓𝐫subscript^𝐵𝑘1𝐓𝐫subscript^𝐵𝑘𝐃𝐞𝐭subscript^𝐵𝑘𝐃𝐞𝐭subscript^𝐵𝑘1\displaystyle=\mathbf{Tr}(\hat{B}_{k+1})-\mathbf{Tr}(\hat{B}_{k})+\log\mathbf{% Det}(\hat{B}_{k})-\log{\mathbf{Det}(\hat{B}_{k+1})}= bold_Tr ( over^ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ) - bold_Tr ( over^ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) + roman_log bold_Det ( over^ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - roman_log bold_Det ( over^ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT )
=y^k2s^ky^kB^ks^k2s^kB^ks^klogs^ky^ks^kB^ks^kabsentsuperscriptnormsubscript^𝑦𝑘2superscriptsubscript^𝑠𝑘topsubscript^𝑦𝑘superscriptnormsubscript^𝐵𝑘subscript^𝑠𝑘2superscriptsubscript^𝑠𝑘topsubscript^𝐵𝑘subscript^𝑠𝑘superscriptsubscript^𝑠𝑘topsubscript^𝑦𝑘superscriptsubscript^𝑠𝑘topsubscript^𝐵𝑘subscript^𝑠𝑘\displaystyle=\frac{\|\hat{y}_{k}\|^{2}}{\hat{s}_{k}^{\top}\hat{y}_{k}}-\frac{% \|\hat{B}_{k}\hat{s}_{k}\|^{2}}{\hat{s}_{k}^{\top}\hat{B}_{k}\hat{s}_{k}}-\log% {\frac{\hat{s}_{k}^{\top}\hat{y}_{k}}{\hat{s}_{k}^{\top}\hat{B}_{k}\hat{s}_{k}}}= divide start_ARG ∥ over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG - divide start_ARG ∥ over^ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over^ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG - roman_log divide start_ARG over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over^ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG
=y^k2s^ky^k1+logcos2θ^km^k(B^ks^k2s^kB^ks^klogB^ks^k2s^kB^ks^k+1)absentsuperscriptnormsubscript^𝑦𝑘2superscriptsubscript^𝑠𝑘topsubscript^𝑦𝑘1superscript2subscript^𝜃𝑘subscript^𝑚𝑘superscriptnormsubscript^𝐵𝑘subscript^𝑠𝑘2superscriptsubscript^𝑠𝑘topsubscript^𝐵𝑘subscript^𝑠𝑘superscriptnormsubscript^𝐵𝑘subscript^𝑠𝑘2superscriptsubscript^𝑠𝑘topsubscript^𝐵𝑘subscript^𝑠𝑘1\displaystyle=\frac{\|\hat{y}_{k}\|^{2}}{\hat{s}_{k}^{\top}\hat{y}_{k}}-1+\log% \frac{\cos^{2}\hat{\theta}_{k}}{\hat{m}_{k}}-\left(\frac{\|\hat{B}_{k}\hat{s}_% {k}\|^{2}}{\hat{s}_{k}^{\top}\hat{B}_{k}\hat{s}_{k}}-\log\frac{\|\hat{B}_{k}% \hat{s}_{k}\|^{2}}{\hat{s}_{k}^{\top}\hat{B}_{k}\hat{s}_{k}}+1\right)= divide start_ARG ∥ over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG - 1 + roman_log divide start_ARG roman_cos start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG - ( divide start_ARG ∥ over^ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over^ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG - roman_log divide start_ARG ∥ over^ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over^ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG + 1 )
y^k2s^ky^k1+logcos2θ^km^k.absentsuperscriptnormsubscript^𝑦𝑘2superscriptsubscript^𝑠𝑘topsubscript^𝑦𝑘1superscript2subscript^𝜃𝑘subscript^𝑚𝑘\displaystyle\leq\frac{\|\hat{y}_{k}\|^{2}}{\hat{s}_{k}^{\top}\hat{y}_{k}}-1+% \log\frac{\cos^{2}\hat{\theta}_{k}}{\hat{m}_{k}}.≤ divide start_ARG ∥ over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG - 1 + roman_log divide start_ARG roman_cos start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG .

where the last inequality holds since xlogx+10𝑥𝑥10x-\log x+1\geq 0italic_x - roman_log italic_x + 1 ≥ 0 for any x>0𝑥0x>0italic_x > 0. Hence (19) follows from the above inequality. Finally, the result in (20) follows from summing both sides of (19) from i=0𝑖0i=0italic_i = 0 to k1𝑘1k-1italic_k - 1, i.e.,

i=0k1Ψ(B^i+1)i=0k1Ψ(B^i)+i=0k1(y^i2s^iy^i1)+i=0k1logcos2θ^im^i,superscriptsubscript𝑖0𝑘1Ψsubscript^𝐵𝑖1superscriptsubscript𝑖0𝑘1Ψsubscript^𝐵𝑖superscriptsubscript𝑖0𝑘1superscriptnormsubscript^𝑦𝑖2superscriptsubscript^𝑠𝑖topsubscript^𝑦𝑖1superscriptsubscript𝑖0𝑘1superscript2subscript^𝜃𝑖subscript^𝑚𝑖\displaystyle\sum_{i=0}^{k-1}\Psi(\hat{B}_{i+1})\leq\sum_{i=0}^{k-1}\Psi(\hat{% B}_{i})+\sum_{i=0}^{k-1}\left(\frac{\|\hat{y}_{i}\|^{2}}{\hat{s}_{i}^{\top}% \hat{y}_{i}}-1\right)+\sum_{i=0}^{k-1}\log\frac{\cos^{2}\hat{\theta}_{i}}{\hat% {m}_{i}},∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT roman_Ψ ( over^ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT ) ≤ ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT roman_Ψ ( over^ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ( divide start_ARG ∥ over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG - 1 ) + ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT roman_log divide start_ARG roman_cos start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ,
Ψ(B^k)Ψ(B^0)+i=0k1(y^i2s^iy^i1)+i=0k1logcos2θ^im^i,Ψsubscript^𝐵𝑘Ψsubscript^𝐵0superscriptsubscript𝑖0𝑘1superscriptnormsubscript^𝑦𝑖2superscriptsubscript^𝑠𝑖topsubscript^𝑦𝑖1superscriptsubscript𝑖0𝑘1superscript2subscript^𝜃𝑖subscript^𝑚𝑖\displaystyle\Psi(\hat{B}_{k})\leq\Psi(\hat{B}_{0})+\sum_{i=0}^{k-1}\left(% \frac{\|\hat{y}_{i}\|^{2}}{\hat{s}_{i}^{\top}\hat{y}_{i}}-1\right)+\sum_{i=0}^% {k-1}\log\frac{\cos^{2}\hat{\theta}_{i}}{\hat{m}_{i}},roman_Ψ ( over^ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ≤ roman_Ψ ( over^ start_ARG italic_B end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ( divide start_ARG ∥ over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG - 1 ) + ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT roman_log divide start_ARG roman_cos start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ,
i=0k1logcos2(θ^i)m^iΨ(B^k)Ψ(B^0)+i=0k1(1y^i2s^iy^i)Ψ(B^0)+i=0k1(1y^i2s^iy^i),superscriptsubscript𝑖0𝑘1superscript2subscript^𝜃𝑖subscript^𝑚𝑖Ψsubscript^𝐵𝑘Ψsubscript^𝐵0superscriptsubscript𝑖0𝑘11superscriptnormsubscript^𝑦𝑖2superscriptsubscript^𝑠𝑖topsubscript^𝑦𝑖Ψsubscript^𝐵0superscriptsubscript𝑖0𝑘11superscriptnormsubscript^𝑦𝑖2superscriptsubscript^𝑠𝑖topsubscript^𝑦𝑖\displaystyle\sum_{i=0}^{k-1}\log{\frac{\cos^{2}(\hat{\theta}_{i})}{\hat{m}_{i% }}}\geq\Psi(\hat{B}_{k})-\Psi(\hat{B}_{0})+\sum_{i=0}^{k-1}\left(1-\frac{\|% \hat{y}_{i}\|^{2}}{\hat{s}_{i}^{\top}\hat{y}_{i}}\right)\geq-\Psi(\hat{B}_{0})% +\sum_{i=0}^{k-1}\left(1-\frac{\|\hat{y}_{i}\|^{2}}{\hat{s}_{i}^{\top}\hat{y}_% {i}}\right),∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT roman_log divide start_ARG roman_cos start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ≥ roman_Ψ ( over^ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - roman_Ψ ( over^ start_ARG italic_B end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ( 1 - divide start_ARG ∥ over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ) ≥ - roman_Ψ ( over^ start_ARG italic_B end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ( 1 - divide start_ARG ∥ over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ) ,

where the last inequality holds since Ψ(B^k)0Ψsubscript^𝐵𝑘0\Psi(\hat{B}_{k})\geq 0roman_Ψ ( over^ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ≥ 0 for any k0𝑘0k\geq 0italic_k ≥ 0.

Appendix B Proof of Lemma 2

  1. (a)

    Recall that Jk=012f(xk+τ(xk+1xk))𝑑τsubscript𝐽𝑘superscriptsubscript01superscript2𝑓subscript𝑥𝑘𝜏subscript𝑥𝑘1subscript𝑥𝑘differential-d𝜏J_{k}=\int_{0}^{1}\nabla^{2}{f(x_{k}+\tau(x_{k+1}-x_{k}))}d\tauitalic_J start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_τ ( italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) italic_d italic_τ. Using the triangle inequality, we have

    2f(x)Jknormsuperscript2𝑓subscript𝑥subscript𝐽𝑘\displaystyle\|\nabla^{2}{f(x_{*})}-J_{k}\|∥ ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) - italic_J start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ =01(2f(x)2f(xk+τ(xk+1xk)))𝑑τabsentnormsuperscriptsubscript01superscript2𝑓subscript𝑥superscript2𝑓subscript𝑥𝑘𝜏subscript𝑥𝑘1subscript𝑥𝑘differential-d𝜏\displaystyle=\left\|\int_{0}^{1}\!\!\left(\nabla^{2}{f(x_{*})}-\nabla^{2}{f(x% _{k}+\tau(x_{k+1}-x_{k}))}\right)d\tau\right\|= ∥ ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) - ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_τ ( italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) ) italic_d italic_τ ∥
    012f(x)2f(xk+τ(xk+1xk))𝑑τ.absentsuperscriptsubscript01normsuperscript2𝑓subscript𝑥superscript2𝑓subscript𝑥𝑘𝜏subscript𝑥𝑘1subscript𝑥𝑘differential-d𝜏\displaystyle\leq\int_{0}^{1}\|\nabla^{2}{f(x_{*})}-\nabla^{2}{f(x_{k}+\tau(x_% {k+1}-x_{k}))}\|d\tau.≤ ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ∥ ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) - ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_τ ( italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) ∥ italic_d italic_τ .

    Moreover, it follows from Assumption 3 that 2f(x)2f(xk+τ(xk+1xk))M(1τ)(xxk)+τ(xxk+1)normsuperscript2𝑓subscript𝑥superscript2𝑓subscript𝑥𝑘𝜏subscript𝑥𝑘1subscript𝑥𝑘𝑀norm1𝜏subscript𝑥subscript𝑥𝑘𝜏subscript𝑥subscript𝑥𝑘1\|\nabla^{2}{f(x_{*})}-\nabla^{2}{f(x_{k}+\tau(x_{k+1}-x_{k}))}\|\leq M\|(1-% \tau)(x_{*}-x_{k})+\tau(x_{*}-x_{k+1})\|∥ ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) - ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_τ ( italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) ∥ ≤ italic_M ∥ ( 1 - italic_τ ) ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) + italic_τ ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ) ∥ for any τ[0,1]𝜏01\tau\in[0,1]italic_τ ∈ [ 0 , 1 ]. Thus, we can further apply the triangle inequality to obtain

    2f(x)Jknormsuperscript2𝑓subscript𝑥subscript𝐽𝑘\displaystyle\|\nabla^{2}{f(x_{*})}-J_{k}\|∥ ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) - italic_J start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ 01M(1τ)(xxk)+τ(xxk+1)𝑑τabsentsuperscriptsubscript01𝑀norm1𝜏subscript𝑥subscript𝑥𝑘𝜏subscript𝑥subscript𝑥𝑘1differential-d𝜏\displaystyle\leq\int_{0}^{1}M\|(1-\tau)(x_{*}-x_{k})+\tau(x_{*}-x_{k+1})\|d\tau≤ ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT italic_M ∥ ( 1 - italic_τ ) ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) + italic_τ ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ) ∥ italic_d italic_τ
    Mxkx01(1τ)𝑑τ+Mxk+1x01τ𝑑τabsent𝑀normsubscript𝑥𝑘subscript𝑥superscriptsubscript011𝜏differential-d𝜏𝑀normsubscript𝑥𝑘1subscript𝑥superscriptsubscript01𝜏differential-d𝜏\displaystyle\leq M\|x_{k}-x_{*}\|\int_{0}^{1}(1-\tau)d\tau+M\|x_{k+1}-x_{*}\|% \int_{0}^{1}\tau d\tau≤ italic_M ∥ italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( 1 - italic_τ ) italic_d italic_τ + italic_M ∥ italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT italic_τ italic_d italic_τ
    =M2(xkx+xk+1x).absent𝑀2normsubscript𝑥𝑘subscript𝑥normsubscript𝑥𝑘1subscript𝑥\displaystyle=\frac{M}{2}(\|x_{k}-x_{*}\|+\|x_{k+1}-x_{*}\|).= divide start_ARG italic_M end_ARG start_ARG 2 end_ARG ( ∥ italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ + ∥ italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ ) .

    Since f𝑓fitalic_f is strongly convex, by Assumption 1 and f(xk+1)f(xk)𝑓subscript𝑥𝑘1𝑓subscript𝑥𝑘f(x_{k+1})\leq f(x_{k})italic_f ( italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ) ≤ italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ), we have μ2xkx2f(xk)f(x)𝜇2superscriptnormsubscript𝑥𝑘subscript𝑥2𝑓subscript𝑥𝑘𝑓subscript𝑥\frac{\mu}{2}\|x_{k}-x_{*}\|^{2}\leq f(x_{k})-f(x_{*})divide start_ARG italic_μ end_ARG start_ARG 2 end_ARG ∥ italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ), which implies that xkx2(f(xk)f(x))/μnormsubscript𝑥𝑘subscript𝑥2𝑓subscript𝑥𝑘𝑓subscript𝑥𝜇\|x_{k}-x_{*}\|\leq\sqrt{2(f(x_{k})-f(x_{*}))/\mu}∥ italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ ≤ square-root start_ARG 2 ( italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) / italic_μ end_ARG. Similarly, since f(xk+1)f(xk)𝑓subscript𝑥𝑘1𝑓subscript𝑥𝑘f(x_{k+1})\leq f(x_{k})italic_f ( italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ) ≤ italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ), it also holds that xk+1x2(f(xk+1)f(x))/μ2(f(xk)f(x))/μnormsubscript𝑥𝑘1subscript𝑥2𝑓subscript𝑥𝑘1𝑓subscript𝑥𝜇2𝑓subscript𝑥𝑘𝑓subscript𝑥𝜇\|x_{k+1}-x_{*}\|\leq\sqrt{2(f(x_{k+1})-f(x_{*}))/\mu}\leq\sqrt{2(f(x_{k})-f(x% _{*}))/\mu}∥ italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ ≤ square-root start_ARG 2 ( italic_f ( italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) / italic_μ end_ARG ≤ square-root start_ARG 2 ( italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) / italic_μ end_ARG. Hence, we obtain

    2f(x)JkMμ2(f(xk)f(x))normsuperscript2𝑓subscript𝑥subscript𝐽𝑘𝑀𝜇2𝑓subscript𝑥𝑘𝑓subscript𝑥\|\nabla^{2}{f(x_{*})}-J_{k}\|\leq\frac{M}{\sqrt{\mu}}\sqrt{2(f(x_{k})-f(x_{*}% ))}∥ ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) - italic_J start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ ≤ divide start_ARG italic_M end_ARG start_ARG square-root start_ARG italic_μ end_ARG end_ARG square-root start_ARG 2 ( italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) end_ARG (68)

    Moreover, notice that by Assumption 1, we also have JkμIsucceeds-or-equalssubscript𝐽𝑘𝜇𝐼J_{k}\succeq\mu Iitalic_J start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⪰ italic_μ italic_I and 2f(x)μIsucceeds-or-equalssuperscript2𝑓subscript𝑥𝜇𝐼\nabla^{2}f(x_{*})\succeq\mu I∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ⪰ italic_μ italic_I. Hence, (68) implies that

    2f(x)Jksuperscript2𝑓subscript𝑥subscript𝐽𝑘\displaystyle\nabla^{2}{f(x_{*})}-J_{k}∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) - italic_J start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT 2f(x)JkIMμ322(f(xk)f(x))Jk=CkJk,precedes-or-equalsabsentnormsuperscript2𝑓subscript𝑥subscript𝐽𝑘𝐼precedes-or-equals𝑀superscript𝜇322𝑓subscript𝑥𝑘𝑓subscript𝑥subscript𝐽𝑘subscript𝐶𝑘subscript𝐽𝑘\displaystyle\preceq\|\nabla^{2}{f(x_{*})}-J_{k}\|I\preceq\frac{M}{\mu^{\frac{% 3}{2}}}\sqrt{2(f(x_{k})-f(x_{*}))}J_{k}=C_{k}J_{k},⪯ ∥ ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) - italic_J start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ italic_I ⪯ divide start_ARG italic_M end_ARG start_ARG italic_μ start_POSTSUPERSCRIPT divide start_ARG 3 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT end_ARG square-root start_ARG 2 ( italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) end_ARG italic_J start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_J start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ,
    Jk2f(x)subscript𝐽𝑘superscript2𝑓subscript𝑥\displaystyle J_{k}-\nabla^{2}{f(x_{*})}italic_J start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) Jk2f(x)IMμ322(f(xk)f(x))2f(x)=Ck2f(x).precedes-or-equalsabsentnormsubscript𝐽𝑘superscript2𝑓subscript𝑥𝐼precedes-or-equals𝑀superscript𝜇322𝑓subscript𝑥𝑘𝑓subscript𝑥superscript2𝑓subscript𝑥subscript𝐶𝑘superscript2𝑓subscript𝑥\displaystyle\preceq\|J_{k}-\nabla^{2}{f(x_{*})}\|I\preceq\frac{M}{\mu^{\frac{% 3}{2}}}\sqrt{2(f(x_{k})-f(x_{*}))}\nabla^{2}{f(x_{*})}=C_{k}\nabla^{2}{f(x_{*}% )}.⪯ ∥ italic_J start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ∥ italic_I ⪯ divide start_ARG italic_M end_ARG start_ARG italic_μ start_POSTSUPERSCRIPT divide start_ARG 3 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT end_ARG square-root start_ARG 2 ( italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) end_ARG ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) = italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) .

    where we used the definition of Cksubscript𝐶𝑘C_{k}italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT in (23). By rearranging the terms, we obtain (24).

  2. (b)

    Recall that Gk=012f(xk+τ(xxk))𝑑τsubscript𝐺𝑘superscriptsubscript01superscript2𝑓subscript𝑥𝑘𝜏subscript𝑥subscript𝑥𝑘differential-d𝜏G_{k}=\int_{0}^{1}\nabla^{2}{f(x_{k}+\tau(x_{*}-x_{k}))}d\tauitalic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_τ ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) italic_d italic_τ. Similar to the arguments in (a), we have

    2f(x)Gk=01(2f(x)2f(xk+τ(xxk)))𝑑τ012f(x)2f(xk+τ(xxk))𝑑τM01(1τ)(xxk)𝑑τ=Mxkx01(1τ)𝑑τ=M2xkxMμ2(f(xk)f(x)).delimited-∥∥superscript2𝑓subscript𝑥subscript𝐺𝑘delimited-∥∥superscriptsubscript01superscript2𝑓subscript𝑥superscript2𝑓subscript𝑥𝑘𝜏subscript𝑥subscript𝑥𝑘differential-d𝜏superscriptsubscript01delimited-∥∥superscript2𝑓subscript𝑥superscript2𝑓subscript𝑥𝑘𝜏subscript𝑥subscript𝑥𝑘differential-d𝜏𝑀superscriptsubscript01delimited-∥∥1𝜏subscript𝑥subscript𝑥𝑘differential-d𝜏𝑀delimited-∥∥subscript𝑥𝑘subscript𝑥superscriptsubscript011𝜏differential-d𝜏𝑀2delimited-∥∥subscript𝑥𝑘subscript𝑥𝑀𝜇2𝑓subscript𝑥𝑘𝑓subscript𝑥\begin{split}\left\|\nabla^{2}{f(x_{*})}-G_{k}\right\|&=\left\|\int_{0}^{1}% \left(\nabla^{2}{f(x_{*})}-\nabla^{2}{f(x_{k}+\tau(x_{*}-x_{k}))}\right)d\tau% \right\|\\ &\leq\int_{0}^{1}\|\nabla^{2}{f(x_{*})}-\nabla^{2}{f(x_{k}+\tau(x_{*}-x_{k}))}% \|d\tau\\ &\leq M\int_{0}^{1}\|(1-\tau)(x_{*}-x_{k})\|d\tau=M\|x_{k}-x_{*}\|\int_{0}^{1}% (1-\tau)d\tau\\ &=\frac{M}{2}\|x_{k}-x_{*}\|\leq\frac{M}{\sqrt{\mu}}\sqrt{2(f(x_{k})-f(x_{*}))% }.\end{split}start_ROW start_CELL ∥ ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) - italic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ end_CELL start_CELL = ∥ ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) - ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_τ ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) ) italic_d italic_τ ∥ end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ∥ ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) - ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_τ ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) ∥ italic_d italic_τ end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ italic_M ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ∥ ( 1 - italic_τ ) ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ italic_d italic_τ = italic_M ∥ italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( 1 - italic_τ ) italic_d italic_τ end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = divide start_ARG italic_M end_ARG start_ARG 2 end_ARG ∥ italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ ≤ divide start_ARG italic_M end_ARG start_ARG square-root start_ARG italic_μ end_ARG end_ARG square-root start_ARG 2 ( italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) end_ARG . end_CELL end_ROW (69)

    Moreover, notice that by Assumption 1 we also have GkμIsucceeds-or-equalssubscript𝐺𝑘𝜇𝐼G_{k}\succeq\mu Iitalic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⪰ italic_μ italic_I and 2f(x)μIsucceeds-or-equalssuperscript2𝑓subscript𝑥𝜇𝐼\nabla^{2}f(x_{*})\succeq\mu I∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ⪰ italic_μ italic_I. The rest follows similarly as in the proof of (a) and we prove (25).

  3. (c)

    Recall that Jk=012f(xk+τ^(xk+1xk))𝑑τsubscript𝐽𝑘superscriptsubscript01superscript2𝑓subscript𝑥𝑘^𝜏subscript𝑥𝑘1subscript𝑥𝑘differential-d𝜏J_{k}=\int_{0}^{1}\nabla^{2}{f(x_{k}+\hat{\tau}(x_{k+1}-x_{k}))}d\tauitalic_J start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + over^ start_ARG italic_τ end_ARG ( italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) italic_d italic_τ. For any τ^[0,1]^𝜏01\hat{\tau}\in[0,1]over^ start_ARG italic_τ end_ARG ∈ [ 0 , 1 ], we have

    2f(xk+τ^(xk+1xk))Jk=01(2f(xk+τ^(xk+1xk))2f(xk+τ(xk+1xk)))𝑑τ012f(xk+τ^(xk+1xk))2f(xk+τ(xk+1xk))𝑑τ01M|τ^τ|xk+1xk𝑑τ12Mxk+1xk.delimited-∥∥superscript2𝑓subscript𝑥𝑘^𝜏subscript𝑥𝑘1subscript𝑥𝑘subscript𝐽𝑘delimited-∥∥superscriptsubscript01superscript2𝑓subscript𝑥𝑘^𝜏subscript𝑥𝑘1subscript𝑥𝑘superscript2𝑓subscript𝑥𝑘𝜏subscript𝑥𝑘1subscript𝑥𝑘differential-d𝜏superscriptsubscript01delimited-∥∥superscript2𝑓subscript𝑥𝑘^𝜏subscript𝑥𝑘1subscript𝑥𝑘superscript2𝑓subscript𝑥𝑘𝜏subscript𝑥𝑘1subscript𝑥𝑘differential-d𝜏superscriptsubscript01𝑀^𝜏𝜏delimited-∥∥subscript𝑥𝑘1subscript𝑥𝑘differential-d𝜏12𝑀delimited-∥∥subscript𝑥𝑘1subscript𝑥𝑘\begin{split}&\phantom{{}={}}\left\|\nabla^{2}{f(x_{k}+\hat{\tau}(x_{k+1}-x_{k% }))}-J_{k}\right\|\\ &=\left\|\int_{0}^{1}\left(\nabla^{2}{f(x_{k}+\hat{\tau}(x_{k+1}-x_{k}))}-% \nabla^{2}{f(x_{k}+\tau(x_{k+1}-x_{k}))}\right)d\tau\right\|\\ &\leq\int_{0}^{1}\left\|\nabla^{2}{f(x_{k}+\hat{\tau}(x_{k+1}-x_{k}))}-\nabla^% {2}{f(x_{k}+\tau(x_{k+1}-x_{k}))}\right\|d\tau\\ &\leq\int_{0}^{1}M|\hat{\tau}-\tau|\|x_{k+1}-x_{k}\|d\tau\leq\frac{1}{2}M\|x_{% k+1}-x_{k}\|.\end{split}start_ROW start_CELL end_CELL start_CELL ∥ ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + over^ start_ARG italic_τ end_ARG ( italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) - italic_J start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = ∥ ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + over^ start_ARG italic_τ end_ARG ( italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) - ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_τ ( italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) ) italic_d italic_τ ∥ end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ∥ ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + over^ start_ARG italic_τ end_ARG ( italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) - ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_τ ( italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) ∥ italic_d italic_τ end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT italic_M | over^ start_ARG italic_τ end_ARG - italic_τ | ∥ italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ italic_d italic_τ ≤ divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_M ∥ italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ . end_CELL end_ROW (70)

    Moreover, by using the triangle inequality, we have xk+1xkxk+1x+xkx2μ(f(xk+1)f(x))+2μ(f(xk)f(x))22μ(f(xk)f(x))normsubscript𝑥𝑘1subscript𝑥𝑘normsubscript𝑥𝑘1subscript𝑥normsubscript𝑥𝑘subscript𝑥2𝜇𝑓subscript𝑥𝑘1𝑓subscript𝑥2𝜇𝑓subscript𝑥𝑘𝑓subscript𝑥22𝜇𝑓subscript𝑥𝑘𝑓subscript𝑥\|x_{k+1}-x_{k}\|\leq\|x_{k+1}-x_{*}\|+\|x_{k}-x_{*}\|\leq\sqrt{\frac{2}{\mu}(% f(x_{k+1})-f(x_{*}))}+\sqrt{\frac{2}{\mu}(f(x_{k})-f(x_{*}))}\leq 2\sqrt{\frac% {2}{\mu}(f(x_{k})-f(x_{*}))}∥ italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ ≤ ∥ italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ + ∥ italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ ≤ square-root start_ARG divide start_ARG 2 end_ARG start_ARG italic_μ end_ARG ( italic_f ( italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) end_ARG + square-root start_ARG divide start_ARG 2 end_ARG start_ARG italic_μ end_ARG ( italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) end_ARG ≤ 2 square-root start_ARG divide start_ARG 2 end_ARG start_ARG italic_μ end_ARG ( italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) end_ARG. Combining this with (70), we obtain that

    2f(xk+τ^(xk+1xk))JkMμ2(f(xk)f(x)).normsuperscript2𝑓subscript𝑥𝑘^𝜏subscript𝑥𝑘1subscript𝑥𝑘subscript𝐽𝑘𝑀𝜇2𝑓subscript𝑥𝑘𝑓subscript𝑥\left\|\nabla^{2}{f(x_{k}+\hat{\tau}(x_{k+1}-x_{k}))}-J_{k}\right\|\leq\frac{M% }{\sqrt{\mu}}\sqrt{2(f(x_{k})-f(x_{*}))}.∥ ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + over^ start_ARG italic_τ end_ARG ( italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) - italic_J start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ ≤ divide start_ARG italic_M end_ARG start_ARG square-root start_ARG italic_μ end_ARG end_ARG square-root start_ARG 2 ( italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) end_ARG .

    Moreover, notice that by Assumption 1, we also have 2f(xk+τ^(xk+1xk))μIsucceeds-or-equalssuperscript2𝑓subscript𝑥𝑘^𝜏subscript𝑥𝑘1subscript𝑥𝑘𝜇𝐼\nabla^{2}{f(x_{k}+\hat{\tau}(x_{k+1}-x_{k}))}\succeq\mu I∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + over^ start_ARG italic_τ end_ARG ( italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) ⪰ italic_μ italic_I and JkμIsucceeds-or-equalssubscript𝐽𝑘𝜇𝐼J_{k}\succeq\mu Iitalic_J start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⪰ italic_μ italic_I. The rest follows similarly as in the proof of (a) and we prove (26).

  4. (d)

    Recall that Gk=012f(xk+τ(xxk))𝑑τsubscript𝐺𝑘superscriptsubscript01superscript2𝑓subscript𝑥𝑘𝜏subscript𝑥subscript𝑥𝑘differential-d𝜏G_{k}=\int_{0}^{1}\nabla^{2}{f(x_{k}+\tau(x_{*}-x_{k}))}d\tauitalic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_τ ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) italic_d italic_τ. For any τ~[0,1]~𝜏01\tilde{\tau}\in[0,1]over~ start_ARG italic_τ end_ARG ∈ [ 0 , 1 ], we have

    2f(xk+τ~(xxk))Gk=01(2f(xk+τ~(xxk))2f(xk+τ(xxk)))𝑑τ012f(xk+τ~(xxk))2f(xk+τ(xxk))𝑑τ01M|τ~τ|xkx𝑑τ12MxkxMμ2(f(xk)f(x)).delimited-∥∥superscript2𝑓subscript𝑥𝑘~𝜏subscript𝑥subscript𝑥𝑘subscript𝐺𝑘delimited-∥∥superscriptsubscript01superscript2𝑓subscript𝑥𝑘~𝜏subscript𝑥subscript𝑥𝑘superscript2𝑓subscript𝑥𝑘𝜏subscript𝑥subscript𝑥𝑘differential-d𝜏superscriptsubscript01delimited-∥∥superscript2𝑓subscript𝑥𝑘~𝜏subscript𝑥subscript𝑥𝑘superscript2𝑓subscript𝑥𝑘𝜏subscript𝑥subscript𝑥𝑘differential-d𝜏superscriptsubscript01𝑀~𝜏𝜏delimited-∥∥subscript𝑥𝑘subscript𝑥differential-d𝜏12𝑀delimited-∥∥subscript𝑥𝑘subscript𝑥𝑀𝜇2𝑓subscript𝑥𝑘𝑓subscript𝑥\begin{split}&\phantom{{}={}}\left\|\nabla^{2}{f(x_{k}+\tilde{\tau}(x_{*}-x_{k% }))}-G_{k}\right\|\\ &=\left\|\int_{0}^{1}\left(\nabla^{2}{f(x_{k}+\tilde{\tau}(x_{*}-x_{k}))}-% \nabla^{2}{f(x_{k}+\tau(x_{*}-x_{k}))}\right)d\tau\right\|\\ &\leq\int_{0}^{1}\left\|\nabla^{2}{f(x_{k}+\tilde{\tau}(x_{*}-x_{k}))}-\nabla^% {2}{f(x_{k}+\tau(x_{*}-x_{k}))}\right\|d\tau\\ &\leq\int_{0}^{1}M|\tilde{\tau}-\tau|\|x_{k}-x_{*}\|d\tau\leq\frac{1}{2}M\|x_{% k}-x_{*}\|\leq\frac{M}{\sqrt{\mu}}\sqrt{2(f(x_{k})-f(x_{*}))}.\end{split}start_ROW start_CELL end_CELL start_CELL ∥ ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + over~ start_ARG italic_τ end_ARG ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) - italic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = ∥ ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + over~ start_ARG italic_τ end_ARG ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) - ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_τ ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) ) italic_d italic_τ ∥ end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ∥ ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + over~ start_ARG italic_τ end_ARG ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) - ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_τ ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) ∥ italic_d italic_τ end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT italic_M | over~ start_ARG italic_τ end_ARG - italic_τ | ∥ italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ italic_d italic_τ ≤ divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_M ∥ italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ ≤ divide start_ARG italic_M end_ARG start_ARG square-root start_ARG italic_μ end_ARG end_ARG square-root start_ARG 2 ( italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) end_ARG . end_CELL end_ROW (71)

    Moreover, notice that by Assumption 1, we also have 2f(xk+τ~(xxk))μIsucceeds-or-equalssuperscript2𝑓subscript𝑥𝑘~𝜏subscript𝑥subscript𝑥𝑘𝜇𝐼\nabla^{2}{f(x_{k}+\tilde{\tau}(x_{*}-x_{k}))}\succeq\mu I∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + over~ start_ARG italic_τ end_ARG ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) ⪰ italic_μ italic_I and GkμIsucceeds-or-equalssubscript𝐺𝑘𝜇𝐼G_{k}\succeq\mu Iitalic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⪰ italic_μ italic_I. The rest follows similarly as in the proof of (a) and we prove (27).

\printbibliography