Cauchy
Cauchy
Cauchy
Updating
Abstract
The quasi-Cauchy (QC) relation is the weak-secant or weak-quasi-
Newton relation of Dennis and Wolkowicz [3] with the added restriction
that full matrices are replaced by diagonal matrices. The latter are es-
pecially appropriate for scaling a Cauchy (steepest-descent) algorithm,
hence our choice of terminology.
In this article, we explore the QC relation and develop variational
techniques for updating diagonal matrices that satisfy it. Numerical
results are also given to illustrate the use of such updates within a
Cauchy algorithm.
Keywords: Weak-secant, Quasi-Cauchy, diagonal updating, Cauchy
algorithm, steepest-descent.
1 Introduction
We consider the problem of nding a local minimum of a smooth, uncon-
strained nonlinear function, namely,
minimize x2Rn f (x): (1)
For a background overview of Newton and Cauchy-type algorithms for solv-
ing (1), see Dennis and Schnabel [2] or the recent landmark book of Bertsekas
[1].
Department of Pure and Applied Mathematics, Washington State University, Pull-
man, WA 99164-3113. E-mail: zhu@delta.math.wsu.edu
y As above. E-mail: nazareth@amath.washington.edu
z
Department of Combinatorics and Optimization, University of Waterloo, Waterloo,
Ontario, Canada. E-mail: hwolkowi@orion.math.uwaterloo.ca
1
In the latter reference, we nd the following important observation ([1],
p. 67):
Generally, there is a tendency to think that dicult problems should
be addressed with sophisticated methods, such as Newton-like meth-
ods. This is often true, particularly for problems with nonsingular
local minima that are poorly conditioned. However, it is important to
realize that often the reverse is true, namely that for problems with
\dicult" cost functions and singular local minima, it is best to use
simple methods such as (perhaps diagonally scaled) steepest descent
with simple stepsize rules such as a constant or a diminishing stepsize.
The reason is that methods that use sophisticated descent directions
and stepsize rules often rely on assumptions that are likely to be vio-
lated on dicult problems.
Our investigation here is very much in the spirit of these remarks. In
particular, we seek e ective ways to diagonally scale an algorithm of Cauchy
type.
For purposes of discussion, it is useful to identify a hierarchy of relations
that can be employed within Newton and Cauchy algorithms as follows:
Secant or Quasi-Newton (QN): M+s = y where the n-dimensional
vectors s = x+ ? x and y = g+ ? g denote the step and gradient change
corresponding to two di erent points x and x+ and their associated
gradients g and g+ . M+ a full n n matrix that approximates the
Hessian. This notation is used henceforth. Both s and y are available
to the associated QN algorithm and it requires O(n2) storage for the
matrix M+ .
Weak-Secant: sT M+ s = sT y. This was introduced and studied by
Dennis and Wolkowicz [3]. Again the resulting QN algorithm uses s
and y explicitly and requires O(n2) storage.
Quasi-Cauchy (QC): sT D+s = sT y where D+ is a diagonal matrix,
i.e., the QC relation is the weak secant with matrices restricted to be
diagonal and s and y are available. The associated Cauchy algorithm
requires only O(n) storage.
Weak-Quasi-Cauchy: sT D+s = b where D+ is a diagonal matrix and
b sT g+ ? sT g = sT y is obtained by directional derivative di erences
along s, i.e., the weak QC relation is the QC relation further weakened
so that gradient vectors (hence the vector y ) are not explicitly used.
The notions of QC relations and diagonal updating were originally
2
introduced in this setting in [12], [13]. The associated QC algorithm
requires O(n) storage and, in addition, only requires approximations
to gradients (quasi-gradients).
We will discuss the general idea of diagonal updating subject to the
QC relation and give numerical results for an implementation of a Cauchy
algorithm that employs such diagonal scaling matrices. A more complete
theory of diagonal updating, including its application to limited-memory
BFGS algorithms and further numerical results, can be found in [16], [17].
2 Diagonal Updating
Suppose D > 0 is a positive de nite diagonal matrix and D+ is the updated
version of D which is also diagonal. We require that the updated D+ satisfy
the QC relation and that the deviation between D and D+ is minimized un-
der some variational principle. We would like the latter to preserve positive
de niteness in a natural way, i.e. we seek well-posed metric problems such
that the solution D+ , through the diagonal updating, incorporates available
curvature information from the step and gradient changes as well as that
contained in D. As noted earlier, a diagonal matrix simply needs the same
computer storage as a vector so an algorithm with O(n) storage will be
maintained. We only consider Cauchy algorithms here, but it is clear that
diagonal updating will have wide application to CG and limited memory
QN algorithms as well.
We now focus on two basic forms of the diagonal updates.
2.1 Updating D
Consider the variational problem:
(P ) : minimize jjD+ ? DjjF
s:t: sT D+ s = sT y
where s 6= 0, sT y > 0 and D > 0. Let
D+ = D + ; a = sT Ds; b = sT y: (2)
Then the variational problem can be stated as
(P ) : minimize jjjjF
3
s:t: sT s = b ? a:
In (P ), the objective is strictly convex and the feasible set is convex. There-
fore, there exists a unique solution to (P ). Its Lagrangian function is
L(; ) = 12 tr(2) + (sT s + a ? b)
where is the Lagrange multiplier associated with the constraint and tr
denotes the trace operator. Di erentiating with respect to via the matrix
calculus [6] or di erentiating with respect to the diagonal elements, setting
the result to zero and invoking the constraint sT s = b ? a, we have
= b ? a2 E; E = diag [s21; s22 ; ::: ; s2n ] (3)
tr(E )
where si is the i'th element of s. When b < a, note that the resulting D+
is not necessarily positive de nite. For algorithmic purposes, a safeguard
is needed to ensure D+ > 0. This can be easily achieved by checking the
condition 2
8i; di + (btr(?Ea2))si > 0 (4)
where di is the i'th diagonal element of D. When the above is violated,
we can retain the previous diagonal matrix by setting D+ = D or use some
simple scheme to generate D+ such that D+ > 0. An example is to switch to
the basic Oren-Luenberger scaling matrix (used in the L-BFGS algorithm),
namely,
D+ = (sT y=sT s)I
where I is the identity matrix. It is useful to note that this is precisely the
matrix that would be obtained from the QC relation with the further re-
striction that the diagonal matrix is a scalar multiple of the identity matrix,
i.e., instead of a general diagonal matrix one uses a matrix whose elements
on the diagonal are equal.
An algorithm incorporating these details will be considered in the section
on numerical results later in this paper.
2.2 Updating D1=2
A more ecient way of preserving positive de niteness through diagonal
updating is to update the Cholesky1 factor D1=2 to the corresponding D+1=2
1
`Square-root' would be a more precise choice of terminology, but we use `Cholesky' to
retain the connection with the updating of QN triangular factors of full matrices.
4
with
D+1=2 = D1=2 +
and
(FP ) : minimize jj jjF
s:t: sT (D1=2 + )2s = sT y > 0:
The foregoing variational problem is well-posed, being de ned over the
closed set of matrices for which the corresponding D+ is positive-semide nite.
Further, analogously to the full matrix case in standard QN updating, it al-
ways has a viable solution for which D+ is positive de nite. This is stated
in the following theorem:
Theorem 2.2.1 Let D > 0 and s 6= 0, a; b; E are de ned in (2) and (3).
There is a unique global solution of (FP ) which is given by
(
= 0 if b = a (5)
?E (I + E )?1D1=2 if b 6= a
where is the largest solution of the nonlinear equation F () = b for which
X di s2i
F () def
= sT (D(I + E )?2)s = 22 (6)
fi:si 6=0g (1 + si )
Proof. In the process of the proof we will see every expression above is well
de ned. First, by some simple transformations, problem (FP ) is equivalent
to
(FP ) : minimize jjwjj22 = wT w
s:t: wT Ew + 2wT Er = b ? a
where
r = [d11=2; d12=2; :::; d1n=2 ]T
When b = a, the global optimal solution is obviously w = 0, and hence =
0, which implies that D+ = D is positive de nite. In the following discussion
we assume that b 6= a. Problem (FP ) has a strictly convex objective with
the Hessian E of the constraint being positive semi-de nite. By a theorem
in [8] concerning a quadratic objective with a quadratic constraint, (FP )
has a global solution. Di erentiating its Lagrangian
L(w; ) = wT w + (wT Ew + 2wT Er + a ? b)
5
with respect to w, where is the Lagrangian multiplier, and setting the
result to zero, we have
s 2 d1=2
w = ? (1 +i s
i
2 ) ; i = 1; :::; n
i
2.3 Discussion
7
and de ning the vector z to be the diagonal elements of D+1=2, we can reex-
press the foregoing variational problem as follows:
minimize ? rT z + 1 z T z
2
s:t: z T Ez = b
where b > 0. When E is nonsingular (hence positive de nite) and the equal-
ity in the constraint is replaced by a inequality, one obtains a standard
trust-region subproblem in the metric de ned by E . The QC subproblem
can be viewed as a simple but nonstandard trust region problem3 . Thus
many of the techniques used to solve trust-region subproblems|see, in par-
ticular, Rendl and Wolkowicz [15]|can be suitably adapted to solving the
QC subproblem more eciently. Our purpose in the present article is to
explore the QC approach at a root level and further re nements will be
considered in a subsequent paper including comparison with recent non-
monotonic Cauchy-based algorithms, see Raydan [14].
3 Numerical Results
In this section we give some numerical results on the application of diagonal
updating to the Cauchy algorithm. Diagonal updating can be used as a
dynamical scaling at each iteration to the steepest descent direction in the
Cauchy algorithm. The Cauchy direction is ideal when the contours of the
objective f to be minimized are hyperspheres. For a general function which
is not quadratic, a preconditioning can be used to make the transformed
contours closer to hyperspheres such that the the eciency of the Cauchy
direction in the transformed space is enhanced, see [11]. The diagonal up-
dating is a non xed preconditioning which includes the updated curvature
information, and its hereditary positive de niteness is naturally maintained
when the Cholesky factor is updated as shown in the previous section. An
3
Simple because only diagonal matrices are involved so issues associated with cost of
matrix inversions or factorizations of a more general quadratic objective do not arise. Also,
because all components of r are positive and the eigenvectors associated with the Hessian
of the objective (or any diagonal rescaling of it) are along the coordinate axes, which leads
to theoretical and algorithmic simpli cations. In particular, r has a nonzero component
in the eigenspace associated with the smallest eigenvalue. Nonstandard because z = r is
not an acceptable solution of the QC problem when rT Er < b as in the usual inequality
constrained trust-region problem. Also, because E can be singular, in which case the
corresponding components of z are separable and can be set to the components of r.
8
Number Problem Name
1 Helical valley function
2 Biggs exp6 function
3 Gaussian function
4 Powell badly scaled function
5 Box 3-dimensional function
6 Variably dimensioned function
7 Watson function
8 Penalty function I
9 Penalty function II
10 Brown badly scaled function
11 Brown and Dennis function
12 Gulf research and development function
13 Trigonometric function
14 Extended Rosenbrock function
15 Extended Powell function
16 Beale function
17 Wood function
18 Chebyquad function
expectation that the Cauchy method will be signi cantly accelerated using
diagonal updating is supported by our numerical results.
Our source code is written in Fortran 90, with double precision algo-
rithmic, running on an ULTRIX DEC Alpha workstation. The numerical
experiment is done within the MINPACK-1 testing environment [10]. Test
functions are the standard unconstrained problems collected in [7], which
we identify by the numbering in Table 1.
We employ a line search (rather than the more simple stepsize choices
mentioned in the quotation at the beginning of this paper) and use a routine
of More and Thuente [9] based on cubic interpolation, which satis es the
Wolfe conditions:
f (x+) f (x) + g T d (8)
g(x+ )T d g T d (9)
9
where the line search parameters are chosen as [4]: = 10?4; = 0:9. The
stopping criterion is [4]:
jjg(x)jj 10?5maxf1:0; jjxjjg (10)
The methods tested include:
1. Standard Cauchy algorithm of the simple form d = ?g at the k'th
iteration.
2. Cauchy with Oren-Luenberger Scaling: this scales the search direction
with the well-known Oren-Luenberger Scaling [5]:
d = ? yyT ys g
T
for all the iterations after the rst step where the initial steepest de-
scent search is employed.
3. Cauchy-DU: Cauchy algorithm with diagonal updating, i.e., at the
current iterate the search direction d is scaled from the steepest descent
as:
d = ?U+ g
where U+ is updated from U = D?1 , which corresponds to comple-
mentary diagonal updating:
(CP ) : minimize jjU+ ? U jjF
s:t: y T U+ y = y T s
For the details about the complementarity on (P ) and (CP ), see [16].
The updated diagonal matrix is given by
b?c G
U+ = U + ? = U + tr(G2 )
where
c = y T Uy; G = diag [ y12 ; y22 ; :::; yn2 ]
with the safeguarding policy as follows: the above updating is used
only when the condition
2
8i; ui + (btr(?Gc)2y)i > 0
10
is satis ed (ui are the diagonal elements of U ). Otherwise the constant
diagonal matrix as the basic matrix in the L-BFGS algorithm is used,
i.e.
U+ = (y T s=yT y)I (11)
For algorithmic consideration of L-BFGS, see [4] and [16].
4. Cauchy-Cholesky: Cauchy algorithm with the diagonal updating for
the Cholesky factor U 1=2, where again considering the complementary
problem we have
(
U+ = U(I + G)?2U ifif bb = c
6= c (12)
12
diagonal updating strategy with the safeguard policy for positive de nite-
ness. The DU-Cauchy is competitive with the Cholesky form because the
former can be implemented very simply whereas the latter incurs an over-
head for computing the optimal for a highly nonlinear one-dimensional
equation. But from a wider perspective, the Cholesky form of diagonal
updating is very successful in accelerating the Cauchy algorithm and the
expense of solving is a relatively minor portion of the total algorithm
itself. The Cauchy-OL is competitive due to its simple formulation, and
indeed there are some cases in the table for which it requires fewer itera-
tions and function/gradient calls than diagonal updating. But it is clear
that globally the Cauchy-Cholesky is best. In comparison, the results for
Oren-Luenberger scaling uctuate greatly. This variability of performance
has already been observed in the literature including [5] where even for
the simple BFGS algorithm, the Oren-Luenberger scaling for the Hessian
matrix, namely (sT y=sT s)I , does not consistently reduce the iteration and
function/gradient calls vis-a-vis the pure BFGS method. Hence, the above
results show that diagonal updating could be a more stable scaling in prac-
tice.
4 Conclusion
Diagonal updating is a fascinating theory whose appeal arises from its sim-
plicity, elegant solutions and the similarity of the variational metrics em-
ployed to those of Quasi-Newton methods, e.g., BFGS, SR1 and LPD. A
thorough exploration of both theoretical and practical aspects is ongoing,
and further results, in particular, the use of diagonal updating within the
L-BFGS algorithm, can be found in [16] and [17].
References
[1] Bertsekas, D.P. (1995), Nonlinear Programming, Athena Scienti c, Bel-
mont, Massachusetts.
[2] Dennis, J.E. and Schnabel, R.B. (1983), Numerical Methods for Un-
constrained Optimization and Nonlinear Equations, Prentice-Hall, New
Jersey.
13
[3] Dennis, J.E. and Wolkowicz, H. (1990), Sizing and Least Change Se-
cant Updates, CORR Report 90-02, Department of Combinatorics and
Optimization, University of Waterloo, Ontario, Canada.
[4] Liu, D.C, Nocedal, J., (1989), On the limited memory method for large
scale optimization, Mathematical Programming B, 45, 3, 503-528.
[5] Luenberger, D.G., (1984), Linear and Nonlinear Programming,
Addison-Wesley (second edition), 1984.
[6] Magnus, R.J. (1988), Matrix di erential calculus with applications in
statistics and econometrics, Wiley, Chichester.
[7] More, J.J., Garbow, B.S. and Hillstrom, K.E. (1981), Testing uncon-
strained optimization software, ACM Transactions on Mathematical
Software, 7, 17-41.
[8] More, J.J. (1993), Generalizations of the trust region problem, Opti-
mization Methods and Software, 2, 189-209.
[9] More, J.J. and Thuente. D.J. (1994), Line search algorithms with guar-
anteed sucient decrease, ACM Transactions on Mathematical Soft-
ware, 20, 1994, June Issue.
[10] More, J.J. and Averick, B. A. (1994), User Guide for the MINPACK-2
Test Problem Collection, Technical Memorandum ANL/MCS-TM-157.
[11] Nazareth, J.L. (1994), The Newton-Cauchy Framework: A Uni ed Ap-
proach to Unconstrained Nonlinear Minimization, LNCS 769, Springer-
Verlag, Berlin.
[12] Nazareth, J.L. (1995), If quasi-Newton then why not quasi-Cauchy ?,
SIAG/OPT Views-and-News, 6, 11-14.
[13] Nazareth, J.L. (1995), The Quasi-Cauchy method: a stepping stone to
derivative-free algorithms, TR 95-3, Department of Pure and Applied
Mathematics, Washington State University.
[14] Raydan, M. (1993), \On the Barzalai and Borwein choice of steplength
for the gradient method", IMA J. Numerical Analysis, 13, 321-326.
[15] Rendl, F. and H. Wolkowicz (1997), A semide nite framework for trust
region subproblems with applications to large scale minimization, Math-
ematical Programming, 77, 273-300.
14
[16] Zhu, M. (1997), Limited memory BFGS algorithms with diagonal up-
dating, Project of M.S. in Computer Science, School of Electrical Engi-
neering and Computer Science, Washington State University.
[17] Zhu, M. (1997), Techniques for Large-Scale Nonlinear Optimization:
Principles and Practice, Ph.D dissertation, Department of Pure and
Applied Mathematics, Washington State University.
15