Conjugate Gradient Method

A BRIEF INTRODUCTION TO THE
CONJUGATE GRADIENT METHOD

Runar Heggelien Refsnæs, Fall 2009
Introduction
The conjugate gradient(CG) method is one of the most popular and well
known iterative techniques for solving sparse symmetric positive denite(SPD)
systems of linear equations. It was originally developed as a direct method,
but became popular for its properties as an iterative method, especially fol-
lowing the development of sophisticated preconditioning techniques.
The intention behind this small note is to present the motivation behind
the CG-method, followed by some details regarding the algorithm and its
implementation. It is important to note that the mathematical derivation of
the method is intentionally incomplete and missing important details. This
is done in order to keep the note small and focus on the more intuitive char-
acteristics of the CG-method, making it more suitable for a target audience
with a limited mathematical background.
The Quadratic Test Function

A natural starting point in deriving the conjugate gradient method is by
looking at the minimization of the quadratic test function
1
φ(x) = xT Ax − xT b
2
with b, x ∈ Rn and A ∈ Rnxn , where A is assumed to be SPD.
Note: A symmetric matrix A is SPD(symmetric positive denite) if xT Ax >
0 ∀x ∈ Ω, or equivalently if all the egeinvalues of A are positive.
The minimizer x∗ of the function φ is given as the point where the gra-
dient of the function is equal to zero. Direct calculation gives
∇φ(x∗ ) = Ax∗ − b = 0
or
Ax∗ = b
This shows that the same x∗ that minimizes φ(x) also serves as the solution
to the linear system of equations Ax = b, motivating us to nd a method
capable of minimizing φ instead of looking at a direct solver of the algebraic
system.
The uniqueness of our solution is guaranteed by the SPD condition. To

see this, consider what you learned about unqiue minimas and maximas in
high school(vgs). A minimum point is unique and global if the double deriva-
tive is larger than zero for all points of the domain. For multiple dimensions
this is the same as requiring that the hessian(or hess matrix) H = ∇2 > 0.
1
For the quadratic test function this means that ∇2 φ(x) = A. But since A is
SPD, xT Ax > 0, and hence our solution x∗ is unique. This uniqueness can
also be seen intuitively by considering the strict convexity of the function.
Line Search Methods

In the previous section we concluded that minimizing the quadratic test
function is equivalent to solving Ax = b. We now need a strategy for solving
our optimization problem.
The line search methods are a large family of iterative optimization methods
where the the iteration is given by
xk+1 = xk + αk pk
The idea is to choose an initial position x0 , and for each step walk along a di-
rection(a line) so that φ(xk+1 ) < φ(xk ). The dierent methods have various
strategies for how they choose the search direction pk , and the step length αk .
The steepest descent method is perhaps the most intuitive and basic line
search method. We remember that the gradient of a function is a vector giv-
ing the direction along which the function increases the most. The method
of steepest descent is based on the strategy that in any given point x, the
search direction given by the negative gradient of the function φ(x) is the
direction of steepest descent. In other words, the negative gradient direction
is the locally optimal search direction. We have already found the gradient
of φ(x) to be Ax − b. We also call this the residual r of the system.
We now have our direction, but we still need to know how far to walk along
it. The natural choice is obviously to walk until we no longer descend, and
an expression for the optimal step length αk is easily found to be
∇φ(xk )T ∇φ(xk ) rk T rk
αk = T
=
∇φ(xk ) A∇φ(xk ) rk T Ark
by inserting the expression for the next step xk+1 = xk − α∇φ(xk ) into the
quadratic test function and minimizing with respect to α. We repeat this for
every step, taking the gradient of φ in the next point xk+1 and by nding a
new step length. It is obvious that this leads to search directions that are
orthogonal to each other, as can be seen in gure [ 1]
This means that the method zigzag towards the solution. While it can be
shown that the steepest descent method always converge for a SPD prob-
lem like ours, it seems obvious that its zigzag behaviour is not the optimal
2
Figure 1: The contour plot of a function, with the steps of the steepest
descent method in red
and fastest path towards the minimum. The problem is simply that each
succesive step is not dierent enough from the others. We need a method
that somehow uses the information from the previous steps in order to avoid
running back and forth across the valley of our contour plots.
The Conjugate Gradient Method

We now turn our eye to the A-conjugate direction methods. A-conjugacy
means that that a set of nonzero vectors {p0 , p1 , ..., pn−1 } are conjugate with
respect to the SPD matrix A. That is
pi T Apj ∀i 6= j
A set of n such vectors are linearly independent and hence span the whole
space Rn . The reason why such A-conjugate sets are important is that we
can minimize our quadratic function φ in n steps by successively minimizing
it along each of the directions. Since the set of A-conjugate vectors acts as a
basis for Rn , we can express the dierence between the exact solution x∗ and
our rst guess x0 as a linear combination of the conjugate vectors:
x∗ − x0 = σ0 p0 + σ1 p1 + ... + σn−1 pn−1
By utilizing our conjugacy property, we nd that the coecients σk are the
same as the step lengths αk that minimize the quadratic function φ along
xk + αk pk , and hence:
x∗ = x0 + α0 p0 + α1 p1 + ... + αn−1 pn−1
3
You can think of this as gradually building the solution along the dimensions
of our solution space. In fact, for diagonal matrices, the conjugate search
vectors coincide with the coordinate axes. In each step k, xk is the exact
solution x∗ projected into the solution space spanned by the k vectors.
All of this sounds great, but so far we have just assumed that the set of
A-conjugate search directions exists. In practice we need a way to create
it. There are several ways to choose such a set. The eigenvectors of A form
a A-conjugate set, but nding the eigenvectors is a task requiring a lot of
computations, so we better nd another strategy. A second alternative is to
modify the usual Gram-Schmidt orthogonalization process. This is also not
optimal, as it requires storing all the directions.
It turns out there is a conjugate direction method with the very nice prop-
erty that each new conjugate vector pk can be computed by using only the
previous vector pk−1 . Without knowing the other previous vectors the new
vector is still automatically conjugate to the others. The method works its
magic by chosing each new direction pk as a linear combination of the nega-
tive residual −rk and the previous search vector pk−1 . We remember that for
the quadratic function the negative residual is equal to the steepest descent
or negative gradient direction. Hence the name of the method: the conjugate
gradient method.
The formula for the new step becomes

pk = −rk + βk pk−1
where βk is found by imposing the condition that pk−1 T Apk and is given as
rk T Apk−1 rk T rk
βk = =
pk−1 T Apk−1 rk−1 T rk−1
A comparison of the conjugate gradient method and the steepest descent

method can be seen in gure [ 2]
Algorithm and Implementation

We are nally ready to write up the algorithm for the conjugate gradient
method. While we have not covered all the details of its derivation, we should
still be able to recognize most of the steps of the iteration, and what they do.
4
Figure 2: The contour plot of a function, with the steps of the steepest
descent method in red and of the conjugate gradient method in green
The conjugate gradient algorithm

Compute r0 = Ax0 − b, p0 = −r0
For k = 0, 1, 2, .. until convergence
Tr
αk = prkT Ap k
k k
xk+1 = xk + αk pk
rk+1 = rk + αk Apk
r Tr
βk = k+1 k+1
rk T rk
pk+1 = −rk+1 + βk pk
End
The rst step is to nd the initial residual which is the same as the rst
gradient search direction. If the initial solution x0 is zero, then r0 and p0
simply becomes b. Inside the for-loop we recognize the rst line as the cal-
culation of the step length. In the second line we update our solution by
adding our step. Then in line three we update our residual, and line 4 and
5 gives us the new search direction.
For an actual implementation of the algorithm above, there are several things
to consider. First we notice that both rkT rk and Apk are each used twice for
every iteration. Also rk+1
T r
k+1 is simply rk rk in the next iteration. There is
T
no sense in doing the same calculations twice, so one should store the results
the rst time they are calculated and reuse them in the calculations that
follow.
Another thing to consider is the matrix-vector calculation Apk . When doing
5
a numerical analysis of partial dierential equations, our matrix typically
comes from a nite dierence discretization or from the stiness assembly of
the nite element method. This means that A can be very large, but also
very sparse(few non-zero elements) and often with a clear structure(at least
for the nite dierence methods) such as the tri-diagonal banded matrix
from the 2D poisson problem.
A trick is to avoid construction and storing of the actual matrix, and in-
stead do the equivalent calculations explicitly for each node or element in
the problem. This way we can save both valuable memory and oating point
operations. Using the specic knowledge of the problem you are trying to
solve can often be a key factor in achieving good performance, and in gen-
eral, numerical solvers that do this will outperform more general solvers.
Finally the 'until convergence' condition in our for-loop is not very pre-
cise. As mentioned, CG will theoretically reach the exact solution in no
more than n steps where n is the size of the matrix A. However, there is no
point in doing all those iterations if our solution is already within a tolerable
margin of error. Furthermore, when doing numerical analysis of PDEs it is
important to remember that the exact solution given by the CG-method is
only the exact solution of our discretized system, not the exact solution of
our PDE. This means that there is no point in running the CG-method to-
wards an error of value zero(tehcnically this is not even possible, due to the
round-o error of oating point numbers), instead one should try to balance
the discretization error and the error in the CG-solver.
Preconditioning
The CG-method is an excellent algorithm, being both extremely fast but
also very easy to implement. However, it turns out through numerical ex-
periments that alot of times the pure CG-method simply does not converge
as fast as we would like. The reason for this slow convergence is that the
system is ill-conditioned. An ill-conditioned system is a system with a high
condition number κ. The condition number for a system like ours can be
expressed as κ = λλmaxmin
. It can be shown that for the conjugate gradient
method, the number of iterations required to reach convergence is propor-
tional to the condition number: Nit ∼ O(κ).
In the introduction of this document, there was briey mentioned that the
CG-method owes much of its popularity to good preconditioning techniques.
Preconditioning is simply a way of manipulating the original system in a
way that improves its condition number, and hence its rate of convergence.
The preconditioner itself is nothing more than a matrix P such that P −1 A
6
has a better condition number. The obvious best choice is P = A wich ac-
tually solves the problem, but of course if we had A−1 we wouldn't really
be interested in preconditioning of the CG-method in the rst place. In the
opposite end we nd the cheapest and least eective preconditioner P = I
which does absolutely nothing. In practice one would try to nd something
in between.
There is much to be said about preconditioning, and nding a good way

of doing preconditioning can be a very dicult task, but can likewise lead
to very impressive convergence results. A lot of the time the only real way
of deciding what works best is by doing experiments.
Summary
We started our discussion of the conjugate gradient method by noticing that
a linear system of equations Ax = b could be written as a minimization
problem of the quadratic test function φ(x) = 12 xT Ax − xT b. We then in-
troduced the line search methods as an iterative approach, giving each new
step as xk+1 = xk + αk pk . Our rst atempt at using line search was the
steepest descent method, but this gave us slow convergence bacause of its
zigzag movement. Changing our focus to the A-conjugate direction methods,
we found that by using information from the previous steps, we could get the
exact solution in n or less steps. In our desire to build the set of A-conjugate
directions as cheaply as possible, we nally ended up with the conjugate
gradient method. Still hungry for better performance we then introduced
preconditioning as a way to get even faster convergence.

Conjugate Gradient Method

Uploaded by

Copyright:

Available Formats

Conjugate Gradient Method

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Conjugate Gradient Method

Uploaded by

Copyright:

Available Formats

A BRIEF INTRODUCTION TO THE