03a1 MIT18 - 409F09 - Scribe21
03a1 MIT18 - 409F09 - Scribe21
03a1 MIT18 - 409F09 - Scribe21
Lecture 21
Lecturer: Jonathan Kelner Scribe: Yan Zhang
21-1
system ⎛ ⎞ ⎛ ⎞
100 −1 −4 100
⎝ 100 100 3 ⎠ x = ⎝ 200 ⎠ .
100 100 100 300
Again, while computing the exact answer would take some work, we can tell at a glance that the solution
should be close to (1, 1, 1)T . In this case, the above-diagonal entries are all small, and once we ignore these,
we can easily solve the remaining lower-triangular system. As before, we may now iteratively improve our
solution by finding the error and repeating the procedure, converging geometrically to the correct answer.
2.4 Analysis
Of course, for this method to be useful, we need to know that our iterations do actually improve our estimate.
We would also like a bound on the improvement at each stage so that we know when to stop. To obtain
these results, we need to make precise the notions of L and S being “large” and “small.”
Consider the product
L−1 A = L−1 (L + S) = I + L−1 S.
This gives us some intuition that L−1 should be a good approximation of A−1 when L−1 S is “small”
compared to the identity matrix I. Proceeding with the analysis, let x denote the actual solution to Ax = b.
Substituting A = L + S, we get Lx = −Sx + b, or equivalently,
x = −L−1 Sx + L−1 b.
Define M = −L−1 S and z = L−1 b and observe that we can rewrite our iterative step as the recurrence
xk+1 = xk + L−1 rk
= xk + L−1 (b − Axk )
= xk + L−1 b − L−1 Lxk − L−1 Sxk
= M xk + z.
21-2
Note that x is a fixed point of this recurrence because it leaves zero residual: r = b − Ax = 0 by definition
of x. In other words, x = M x + z.
Now define the error at step k to be ek = xk − x and observe
ek+1 = xk+1 − x
= M xk + z − x
= M (x + ek ) + z − x
= (M x + z − x) + M ek
= M ek .
ek = −M k x,
since we could have started our iteration at x0 = 0 in which case e0 = −x. Thus, we can think of the error
growing roughly as a matrix power1 . We pause here to make a definition.
Definition 1 The spectral radius ρ of a symmetric matrix M is the absolute value of its largest eigenvalue:
ρ = |λmax |.
Observe that it follows from the definition that (in the symmetric case)
so if ρ < 1, then powers of M converge exponentially to zero at a rate given by ρ. The same holds for general
M if we replace “eigenvalue” by “singular value.” Summarizing, we have the following result.
21-3
and the matrix AT A is positive definite.
It is worth noting that while it is clear that the above reduction is theoretically valid, it is less clear
whether or not such a reduction is practical. While the matrix product AT A has the advantage of positive
definiteness, it raises several other concerns. For one, matrix multiplication could be as expensive as solving
the system in the first place and could destroy sparsity properties. Additionally, one might worry about the
effects of replacing A with AT A on convergence speed and condition number. As we shall see, however, the
trick to getting around these issues is to never actually compute AT A. Instead, since our algorithms will
only use this matrix in the context of multiplying by a vector, we can perform such multiplications from
right to left via two matrix-vector multiplications, thus avoiding the much more expensive matrix-matrix
multiplication.
4 Steepest Descent
4.1 Motivation
We now discuss the technique of steepest descent, also known as gradient descent, which is a general iterative
method for finding local minima of a function f . The idea is that given a current estimate xi , the gradient
∇f (xi )—or more precisely, its negative—gives the direction in which f is decreasing most rapidly. Hence,
one would expect that taking a step in this direction should bring us closer to the minimum we seek. Keeping
with our previous notation, we will let x denote the actual minimizer, xi denote our i-th estimate, and
ei = xi − x, (3)
ri = b − Axi = −Aei (4)
21-4
turns out just to equal the i + 1-st residual, so our orthogonality relation reduces to the condition that
successive residuals be orthogonal:
riT+1 ri = 0.
Expanding out
ri+1 = b − Axi+1
= b − A(xi + αi ri )
= ri − αi Ari
and thus we have a formula for computing the step size along ri in terms of just ri itself.
Remark It is important to remember that the residuals ri = b − Axi measure the difference between
our objective b and the result Axi of our approximation in “range space,” whereas the errors ei = xi − x
measure the difference between our approximation and the true solution in “domain space.” Thus, the
previous orthogonality relation that holds for residual vectors does not mean that successive error vectors
in the domain are orthogonal. It does, however, imply that successive differences between consecutive
approximations are orthogonal because these differences xi+1 − xi = αi ri are proportional to the residuals.
4.2 Algorithm
To summarize the development thus far, we have obtained an iterative algorithm for steepest descent with
the following update step:
ri = b − Axi (5)
riT ri
αi = (6)
riT Ari
xi+1 = xi + αi ri . (7)
As an implementation note, we point out that the runtime of this algorithm is dominated by the two
matrix-vector multiplications: Axi (used to compute ri ) and Ari (used in finding the step size αi ). In fact,
it is enough to do just the latter multiplication because as we saw before, we can alternatively write
ri+1 = ri − αi Ari ,
so that after the first step we can find residuals by reusing the computation Ari , which was already done
in the previous step. In practice, one needs to be careful about accumulation of roundoff errors, but this
problem may be resolved by using (5) every once in a while to recalibrate.
4.3 Analysis
Before dealing with general bounds on the rate of convergence of steepest descent, we make the preliminary
observation that in certain special cases, steepest descent converges to the exact solution in just one step.
More precisely, we make the following claim.
Claim 3 If the current error vector ei is an eigenvector of A, then the subsequent descent step moves directly
to the correct answer. That is, ei+1 = 0.
21-5
Proof Apply (5)–(7) and the definition of the error (3) to find
riT ri
ei+1 = ei + ri , (8)
riT Ari
giving the change in the error from step i to step i + 1. In the case that ei is an eigenvector of A, say with
eigenvalue λ, we have from (4) that ri = −Aei = −λei , and hence (8) reduces to
1
ei+1 = ei + (−λei ) = 0.
λ
Remark The above result tells us that steepest descent works instantly for error vectors in the eigenspaces
of A. These spaces have dimensions equal to the multiplicities of the corresponding eigenvalues, and in
particular, if A is a multiple of the identity, then steepest descent converges immediately from any starting
point. In general, we are not nearly so lucky and the eigenspaces each have dimension 1, but it is worth
noting that even in this case convergence is qualitatively different from that of our first iterative approach:
there are particular directions along which steepest descent works perfectly, whereas our first approach only
gave the correct answer in the trivial case in which the error was already zero.
In light of the preceding remark, we can expect that convergence should be faster along some directions
than others, and we will see that this is indeed the case. Before jumping headlong into the convergence
analysis, however, it is worthwhile to define a more convenient measure of error.
Motivation for this definition will be provided in the next lecture; for now, we simply take for granted that
it obeys the usual properties of a norm—and hence produces the same qualitative notion of convergence—
but lends itself to a cleaner convergence bounds. We will satisfy ourselves with simply stating the result
and focus on discussing its consequences, since the proof is just a computation using (8) and (9). A more
intuitive line of reasoning will also come in the next lecture.
The general result (10) is quite a mouthful, but fortunately we can understand its flavor just by looking
at the two-dimensional case. In this case we have only two eigenvectors v1 and v2 . Assume λ1 > λ2 , so the
condition number of A is κ = λ1 /λ2 . Define μ = ξ1 /ξ2 to be the ratio of the components of ei along the
basis vectors. Then (10) simplifies to
Note that the form of the expression on the right corroborates our preliminary observations. If the condition
number κ = 1, convergence occurs instantly, and if κ is close to 1, convergence occurs quickly for all values
of μ. If κ is large, convergence still occurs instantly if μ = 0 or ∞, but now the rate of convergence varies
substantially with μ, with the worst case being when ei is closer to the smaller eigenvector than the larger
one by a factor of κ, i.e., μ = ±κ (see the lecture slides or [1] for helpful pictures).
21-6
4.4 Some Motivation
To summarize, we have seen that the performance of steepest descent varies depending on the error direction
and can sometimes be excellent; however, in the worst case (obtained by maximizing the factor on the right
side of (10) over all ξj ) convergence is still only geometric.
The problem, as can be seen in the lecture figures, is that steepest descent has the potential to “zig-zag
too much.” We will see in the next lecture how the method of conjugate gradients overcomes this issue. The
big idea here is that the so-called “zig-zagging” comes from situations when the ellipsoidal curves are very
skew; the disparity between the magnitudes of the axes of the ellipses causes us to take very tiny steps. Note
we can then think of the energy norm is really a normalization of the ellipses into spheres, which removes
this issue.
References
[1] Shewchuk, Jonathan. “An Introduction to the Conjugate Gradient Method Without the Agonizing Pain.”
August 1994. http://www.cs.cmu.edu/~jrs/jrspapers.html.
21-7
MIT OpenCourseWare
http://ocw.mit.edu
For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.