An Introduction To Numerical Analysis
An Introduction To Numerical Analysis
NUMERICAL ANALYSIS
Second Edition
Kendall E. Atkinson
University of Iowa
WILEY
John Wiley & Sons
'
'
__ j
Copyright 1978, 1989, by John Wiley & Sons, Inc.
All rights reserved. Published simultaneously in Canada.
Reproduction or translation of any part of
this work beyond that permitted by Sections
107 and 108 of the 1976 United States Copyright
Act without the permission of the copyright
owner is unlawful. Requests for permission
or further information should be addressed to
the Permissions Department, John Wiley & Sons.
Library of Congre.fs Cataloging in Publication Data:
Atkinson, Kendall E.
An introduction to numerical analysis/Kendall E. Atkinson.-
2nd ed.
p. em.
Bibliography: p.
Includes index.
ISBN 0-471-62489-6
1. Numerical analysis.
QA297.A84 1988
519.4-dcl9
20 19 18 17 16 15 14
I. Title.
PREFACE
to the first edition
This introduction to numerical analysis was written for students in mathematics,
the physical sciences, and engineering, at the upper undergraduate to beginning
graduate level. Prerequisites for using the text are elementary calculus, linear
algebra, and an introduction to differential equations. The student's level of
mathematical maturity or experience with mathematics should be somewhat
higher; I have found that most students do not attain the necessary level until
their senior year. Finally, the student should have a knowledge of computer
programming. The preferred language for most scientific programming is For-
tran.
A truly effective use of numerical analysis in applications requires both a
theoretical knowledge of the subject and computational experience with it. The
theoretical knowledge should include an understanding of both the original
problem being solved and of the numerical me.thods for its solution, including
their derivation, error analysis, and an idea of when they will perform well or
poorly. This kind of knowledge is necessary even if you are only considering
using a package program from your computer center. You must still understand
the program's purpose and limitations to know whether it applies to your
particular situation or not. More importantly, a majority of problems cannot be
solved by the simple application of a standard program. For such problems you
must devise new numerical methods, and this is usually done by adapting
standard numerical methods to the new situation. This requires a good theoreti-
cal foundation in numerical analysis, both to devise the new methods and to
avoid certain numerical pitfalls that occur easily in a number of problem areas.
Computational experience is also very important. It gives a sense of reality to
most theoretical discussions; and it brings out the important difference between
the exact arithmetic implicit in most theoretical discussions and the finite-length
arithmetic computation, whether on a computer or a hand calculator. The use of
a computer also imposes constraints on the structure of numerical methods,
constraints that are not evident and that seem unnecessary from a strictly
mathematical viewpoint. For example, iterative procedures are often preferred
over direct procedures because of simpler programming requirements or com-
puter memory size limitations, even though the direct procedure may seem
simpler to explain and to use. Many numerical examples are ~ v n in this text to
illustrate these points, and there are a number of exercises that will give the
student a variety of <...)mputational experience.
ix
i
__ ___j
X PREFACE TO THE FIRST EDITION
The book is organized in a fairly standard manner. Topics that are simpler,
both theoretically and computationally, come first; for example, rootfinding for a
single nonlinear equation is covered in Chapter 2. The more sophisticated topics
within numerical linear algebra are left until the last three chapters. If an
instructor prefers, however, Chapters 7 through 9 on numerical linear algebra can
be inserted at any point following Chapter 1. Chapter 1 contains a number of
introductory topics, some of which the instructor may wish to postpone until
later in the course. It is important, however, to cover the mathematical and
notational preliminaries of Section 1.1 and the introduction to computer
floating-point arithmetic given in Section 1.2 and in part of Section 1.3.
The text contains more than enough material for a one-year course. In
addition, introductions are given to some topics that instructors may wish to
expand on from their own notes. For example, a brief introduction is given to
stiff differential equations in the last part of Section 6.8 in Chapter 6; and some
theoretical foundation for the least squares data-fitting problem is given in
Theorem 7.5 and Problem 15 of Chapter 7. These can easily be expanded by
using the references given in the respective chapters.
Each chapter contains a discussion of the research literature and a bibliogra-
phy of some of the important books and papers on the material of the chapter.
The chapters all conclude with a set of exercises. Some of these exercises are
illustrations or applications of the text material, and others involve the develop-
ment of new material. As an aid to the student, answers and hints to selected
exercises are given at the end of the book. It is important, however, for students
to solve some problems in which there is no given answer against which they can
check their results. This forces them to develop a variety of other means for
checking their own work; and it will force them to develop some common sense
or judgment as an aid in knowing whether or not their results are reasonable.
I teach a one-year course covering much of the material of this book. Chapters
1 through 5 form the first semester, and Chapters 6 through 9 form the second
semester. In most chapters, a number of topics can be deleted without any
difficulty arising in later chapters. Exceptions to this are Section 2.5 on linear
iteration methods, Sections 3.1 to 3.3, 3.6 on interpolation theory, Section 4.4 on
orthogonal polynomials, and Section 5.1 on the trapezoidal and Simpson integra-
tion rules.
I thank Professor Herb Hethcote of the University of Iowa for his helpful
advice and for having taught from an earlier rough draft of the book. I am also
grateful for the advice of Professors Robert Barnhill, University of Utah, Herman
Burchard, Oklahoma State University, and Robert J. Flynn, Polytechnic Institute
of New York. I am very grateful to Ada Bums and Lois Friday, who did an
excellent job of typing this and earlier versions of the book. I thank the many
s t u e ~ t s who, over the past twelve years, enrolled in my course and used my
notes and rough drafts rather than a regular text. They pointed but numerous
errors, and their difficulties with certain topics helped me in preparing better
presentations of them. The staff of John Wiley have been very helpful, and the
text is much better as a result of their efforts. Finally, I thank my wife Alice for
her patient and encouraging support, without which the book would probably
have not been completed.
Iowa City, August, 1978 Kendall E. Atkinson
CONTENTS
ONE
ERROR: ITS SOURCES, PROPAGATION,
AND ANALYSIS
1.1 Mathematical Preliminaries
1.2 Computer Representation of Numbers
1.3 Definitions and Sources of Error
1.4 Propagation of Errors
1.5 Errors in Summation
1.6 Stability in Numerical Analysis
Discussion of the Literature
Problems
1WO
ROOTFINDING FOR NONLINEAR EQUATIONS
2.1 The Bisection Method
2.2 Newton's Method
2.3 The Secant Method
2.4 Muller's Method
2.5 A General Theory for One-Point Iteration Methods
2.6 Aitken Extrapolation for Linearly Convergent Sequences
2.7 The Numerical Evaluation of Multiple Roots
2.8 Brent's Rootfinding Algorithm
2.9 Roots of Polynomials
2.10 Systems of Nonlinear Equations
2.11 Newton's Method for Nonlinear Systems
2.12 Unconstrained Optimization
Discussion of the Literature
Problems
THREE
INTERPOLATION THEORY
3.1 PolynOinial Interpolation Theory
3.2 Newton Divided Differences
3
3
11
17
23
29
34
39
43
53
56
58
65
73
76
83
87
91
94
103
108
111
114
117
131
131
138
xiii
;
i'
.............. !
'
xiv CONTENTS
3.3 Finite Differences and Table-Oriented Interpolation Formulas 147
3.4 Errors in Data and Forward Differences 151
3.5 Further Results on Interpolation Error 154
3.6 Hermite Interpolation 159
3.7 Piecewise Polynomial Interpolation 163
3.8 Trigonometric Interpolation 176
Discussion of the Literature 183
Problems 185
FOUR
APPROXIMATION OF FUNCTIONS
4.1 The Weierstrass Theorem and Taylor's Theorem
4.2 The Minimax Approximation Problem
4.3 The Least Squares Approximation Problem
4.4 Orthogonal Polynomials
4.5 The Least Squares Approximation Problem (continued)
4.6 Minimax Approximations
4.7 Near-Minimax Approximations
Discussion of the Literature
Problems
FIVE
NUMERICAL INTEGRATION
5.1 The Trapezoidal Rule and Simpson's Rule
5.2 Newton-Cotes Integration Formulas
5.3 Gaussian Quadrature
5.4 Asymptotic Error Formulas and Their Applications
5.5 Automatic Numerical Integration
5.6 Singular Integrals
5.7 Numerical Differentiation
Discussion of the Literature
Problems
SIX
NUMERICAL METHODS FOR ORDINARY
DIFFERENTIAL EQUATIONS
6.1 Existence, Uniqueness, and Stability Theory
6.2 Euler's Method
6.3 Multistep Methods
6.4 The Midpoint Method
6.5 The Trapezoidal Method
6.6 A Low-Order Predictor-Corrector Algorithm
197
197
201
204
207
216
222
225
236
239
249
251
263
270
284
299
305
315
320
323
333
335
341
357
361
366
373
6.7 Derivation of Higher Order Multistep Methods
6.8 Convergence and Stability Theory for Multistep Methods
6.9 Stiff Differential Equations and the Method of Lines
6.10 Single-Step and Runge-Kutta Methods
6.11 Boundary Value Problems
Discussion of the Literature
Problems
SEVEN
LINEAR ALGEBRA
7.1 Vector Spaces, Matrices, and Linear Systems
7.2 Eigenvalues and Canonical Forms for Matrices
7.3 Vector and Matrix Norms
7.4 Convergence and Perturbation Theorems
Discussion of the Literature
Problems
EIGHT
NUMERICAL SOLUTION OF SYSTEMS
OF LINEAR EQUATIONS
8.1 Gaussian Elimination
8.2 Pivoting and Scaling in Gaussian Elimination
8.3 Variants of Gaussian Elimination
8.4 Error Analysis
8.5 The Residual Correction Method
8.6 Iteration Methods
8.7 Error Prediction and Acceleration
8.8 The Numerical Solution of Poisson's Equation
8.9 The Conjugate Gradient Method
Discussion of the Literature
Problems
NINE
THE MATRIX EIGENVALUE PROBLEM
9.1 Eigenvalue Location, Error, and Stability Results
9.2 The Power Method
9.3 Orthogonal Transformations Using Householder Matrices
9.4 The Eigenvalues of a Symmetric Tridiagonal Matrix
9.5 The QR Method
9.6 The Calculation of Eigenvectors and Inverse Iteration
9.7 Least Squares Solution of Linear Systems
Discussion of the Literature
Problems
CONTENTS XV
381
394
409
418
433
444
450
463
463
471
480
490
495
496
507
508
515
522
529
540
544
552
557
562
569
574
587
588
602
609
619
623
628
633
645
648
xvi CONTENTS
APPENDIX:
MATHEMATICAL SOFTWARE
ANSWERS TO SELECTED EXERCISES
INDEX
661
667
683
ONE
ERROR:
ITS SOURCES,
PROPAGATION,
AND ANALYSIS
The subject of numerical analysis provides computational methods for the study
and solution of mathematical problems. In this text we study numerical methods
for the solution of the most common mathematical problems and we analyze the
errors present in these methods. Because almost all computation is now done on
digital computers, we also discuss the implications of this in the implementation
of numerical methods.
The study of error is a central concern of numerical analysis. Most numerical
methods give answers that are only approximations to the desired true solution,
-and-itcisimportant to understand-and to be able, if possible, to estimate or bound
the resulting error. This chapter examines the various kinds of errors that may
occur in a problem. The representation of numbers in computers is examined,
along with the error in computer arithmetic. General results on the propagation
of errors in calculations are given, with a detailed look at error in summation
procedures. Finally, the concepts of stability and conditioning of problems and
numerical methods are introduced and illustrated. The first section contains
mathematical preliminaries needed for the work of later chapters.
1.1 Mathematical Preliminaries
This section contains a review of results from calculus, which will be used in this
text. We first give some mean value theorems, and then we present and discuss
Taylor's theorem, for functions of one and two variables. The section concludes
with some notation that will be used in later chapters.
Theorem 1.1 (Intermediate Value) Let f(x) be continuous on the finite interval
a x b, and define
m = Infimumf(x),
asxsh
M = Supremumj(x)
asxsh
Then for any number K in the interval [ m, MJ, there is at least one
point in [a, b) for which
!(0 = r
3
4 ERROR: ITS SOURCES, PROPAGATION, AND ANALYSIS
In particular, there are points and :X in [a, b] for which
m = M = J(x)
Theorem 1.2 (Mean Value) Let /(x) be continuous for a::;; x::;; b, and let it be
differentiable for a < x < b. Then there is at least one point in
. (a, b) for which
/(b)- /(a)= a)
Theorem 1.3 (Integral Mean Value) Let w(x) be integrable on
[a, b], and let f(x) be continuous on [a, b]. Then
Jbw(x}f(x) dx = jbw(x) dx
a a
for E [a, b].
These theorems are discussed in most elementary calculus textbooks, and thus
we omit their proofs. Some implications of these theorems are examined in the
problems at the end of the chapter.
One of the most important tools of numerical analysis is Taylor's theorem and
the associated Taylor series. It is used throughout this text. The theorem gives a
relatively simple method for approximating functions f(x) by polynomials, and
thereby gives a method for computing f(x).
Theorem 1.4 (Taylor's Theorem) Let f(x) have n + 1 continuous derivatives
ori [a, b] for some n 0, and let x, x
0
E (a, b]. Then
. f(x) = Pn(x) + Rn+l(x)
(x- x
0
)
Pn(x) = f(xo) +
1
! f'(xo)
(x- xor
+ ... + n! J<n>(xo)
(
x-x )n+l
----;--0--:---/ ( n + 1) (
(n + 1}!
for some between x
0
and x.
(1.1.1)
(1.1.2)
(1.1.3)
Proof The derivation of (1.1.1) is given in most calculus texts. It uses carefully
chosen integration by parts in the identity
/(x) =/(x
0
) + jxf'(t)dt
Xo
J
MATHEMATICAL PRELIMINARIES 5
repeating it n times to obtain (1.1.1)-(1.1.3), with the integral form of
the remainder R,.+
1
(x). The second form of R,+
1
(x) is obtained by
using the integral mean value theorem with w(t) = (x - t)".
Using Taylor's theorem, we obtain the following standard formulas:
x2 x" xn+l
ez = 1 + x + - + + - +
2 n! (n + 1)!
(1.1.4)
x2 x4 x2"
cos(x) = 1-- +-- +( -1)"--
2! 4! (2n)!
x2n+2
+(-1)"+1 (2n +
(1.1.5)
x3 xs x2n-1
sin(x) = x-- +-- +( -1)"-
1
...,..-----,--
3! 5! (2n - 1)!
x2n+l
+( -1)" (
2
n +
1
)! cos(EJ
(1.1.6}
(1 + x)" = 1 + + (;)x
2
+ ..
(1.1.7}
with
(
a:)= a:(a:-1);(a-k+1)
k k!.
k = 1,2,3, ...
for any real number a. For all cases, the unknown point is located between x
and 0.
An important special case of (1.1.7) is
1 x"+
1
-- = 1 + x + x
2
+ +x" + --
1-x 1-x
(l.l.8)
This is the case a = - 1, with x replaced by - x. The remainder has a simpler
form than in (1.1.7); it is easily proved by multiplying both sides of (1.1.8) by
1 - x and then simplifying. Rearranging (1.1.8), we obtain the familiar formula
for a finite geometric series:
1 - xn+l
1 + x + x
2
+ +x" = ---
1-x
x'i=1 (1.1.9)
6 ERROR: ITS SOURCES, PROPAGATION, AND ANALYSIS
Infinite series representations for the functions on the left side of (1.1.4) to
(1.1.8) can be obtained by letting n -+ oo. The infinite series for (1.1.4) to (1.1.6)
converge for all x, and those for (1.1.7) and (1.1.8) converge for lxl < 1.
Formula (1.1.8) leads to the well-known infinite geometric series
1 00
-- = [xk
1 - X k-0
lxl < 1 (1.1.10)
The Taylor series of -any sufficiently differentiable function f(x) can be
calculated directly from the definition (1.1.2), with as many terms included as
desired. But because of the complexity of the differentiation of many functions
f(x), it is often better to obtain indirectly their Taylor polynomial approxima-
tions Pn(x) or their Taylor series, by using one of the preceding formulas (1.1.4)
through (1.1.8). We give three examples, all of which have simpler error terms
than if (1.1.3) were used directly.
Example 1. Let f(x) = e-x
2
Replace x by -x
2
in (1.1.4) to obtain
x4 x2n x2n+2
_ x2 2 {
1
) n {
1
) n+ l t
e = 1 - X + - - + - - + - e'x
2! n! (n + 1)!
with x
2
~ ~ x ~ O .
2. Let /( x) = tan -l ( x ). Begin by setting x = - u
2
in (1.1.8)
Integrate over [0, x] to get
x3 xs x2n+1
tan-l(x) = x- 3 + 5- ... +( -1)"-2n_+_1
u2n+2
+( -l)n+llx __ du
o 1 + u
2
Applying the integral mean value theorem
u2n+2 x2n+3 1
J: 1 + u
2
du = 2n + 3 . 1 + g;
with gx between 0 and x.
(1.1.11)
I
I
MATHEMATICAL PRELIMINARIES 7
3. Let f(x) = Jd sin(xt) dt. Using (1.1.6)
1[ x3t3 1 {xt)2n-1
f(x) =fa xt- -3-! + ... +(-1r- -:-(2-n---1):-!
n (xt)2n+1 l
+(-1)
dt
n x2j-1 x2n+1
"( )j-1 ( )n 1
1
2n+1 ()
f::l -1 (2j)! + -1 (2n + 1)! 0 ( COS dt
with
between 0 and xt. The integral in the remainder is easily bounded by
1/(2n + 2); but we can also convert it to a simpler form. Although it wasn't
proved, it can be shown that
o n j-l n
n=1,2,3, ... (1.3.4)
This is called the midpoint numerical integration rule: see the last part of Section
5.2 for more detail. The general topic of numerical integration is examined in
Chapter 5.
DEFINITIONS AND SOURCES OF ERROR 21
(c) For the differential equation problem
Y'(t) = f(t, Y(t)) (1.3.5)
use the approximation of the derivative
Y( t + h) - Y( t)
Y'( t) ~ ___ _ _ _ _ ~
h
for some small h. Let ti = t
0
+ jh for j ~ 0, and define an approximate solution
function y(tj) by
Thus we have
This is Euler's method for solving_ an initial value problem for an ordinary
differential equation. An-extensive'discussion and analysis of it is-given in Section
6.2. Chapter 6 gives a complete development of numerical methods for solving
the initial value problem (1.3.5). -
Most numerical analysis problems in the following chapters involve mainly
mathematical truncation errors. The major exception is the solution of systems of
linear equations in which rounding errors are often the major source of error.
Noise in function evaluation One of the immediate consequences of rounding
errors is that the evaluation of a function f(x) using a computer will lead to an
approximate function /(x) that is not continuous, although it is apparent only
when the graph of /(x) is looked at on a sufficiently small scale. After each
arithmetic operation that is used in evaluating f(x), there will usually be a
rounding error. When the effect of these rounding errors is considered, we obtain
a computed value /(x) whose error f(x)- /(x) appears to be a small random
number as x varies. This error in /(x) is called noise. When the graph of /(x) is
looked at on a small enough scale, it appears as a fuzzy band of dots, where the x
values range over all acceptable floating-point numbers on the machine. This has
consequences for many other programs that make use of /<x). For example,
calculating the root of J(x) by using /<x) will lead to uncertainty in the location
of the root, because it will likely be located in the intersection of the x-axis and
the fuzzy banded graph of /<x). The following example shows that this can result
in considerable uncertainty in the location of the root.
Example Let
f(x) = x
3
- 3x
1
+ 3x- 1 (1.3.6)
which is just (x - 1)
3
We evaluated (1.3.6) in the single precision BASIC of a
22 ERROR: ITS SOURCES, PROPAGATION, AND ANALYSIS
y
-1
Figure 1.1 Graph of (1.3.6).
y
9.8E-9
Figure 1.2 Detailed graph of (1.3.6).
popular microcomputer, using rounded binary arithmetic with a unit round of
8 r
24
= 5.96 X 10-
8
The graph of /(x) on [0, 2], shown in Figure 1.1, is
continuous and smooth to the eye, as would be expected. But the graph on the
smaller interval [.998, 1.002] shows the discontinuous nature of f(x), as is
apparent from the graph in Figure 1.2. In this latter case, /(x) was evaluated at
640 evenly spaced values of x in [.998, 1.002], resulting in the fu7..zy band that is
the graph of f(x). From the latter graph, it can be seen that there is a large
interval of uncertainty as to where /(x) crosses the x-axis. We return to this topic
in Section 2.7 of Chapter 2.
Underflow I overflow errors in calculations We consider another consequence of
machine errors. The upper and lower limits for floating-point numbers, given in
(1.2.17), can lead to errors in calculations. Sometimes these are unavoidable, but
often they are an artifact of the way the calculation is arranged.
To illustrate this, consider evaluating the magnitude of a complex number,
(1.3.7)
It is possible this may underflow or overflow, even though the magnitude jx + iyl
PROPAGATION OF ERRORS 23
is within machine limits. For example, if xu= 1.7 X 10
38
from (1.2.17), then
(1.3.7) will overflow for x = y = 10
20
, even though jx + iyj = 1.4 X 10
20
To
avoid this, determine the larger of x and y, say x. Then rewrite (1.3.7) as
lx + iyl = lxl
y
a=-
X
(1.3.8)
We must calculate ./1 + a
2
, with 0 .:$; a.:$; 1. This avoids the problems of (1.3.7),
both for underflow and overflow.
1.4 Propagation of Errors
In this and the next sections, we consider the effect of calculations with numbers
that are in error. To begin, consider the basic arithmetic operations. Let w denote
one of the arithmetic operations +, -, X, j; and let w be the computer version
of the same operation, which will usually include rounding. Let xA and YA be the
numbers being used for calculations, and suppose they are in error, with true
values
Then xAwyA is the number actually computed, and for its error,
XrWYr- xAwyA = [xrWYr- xAwyA] + [xAWYA - xAwyA] (1.4.1)
The first quantity in brackets is called the propagated error, and the second
quantity is normally rounding or chopping error. For this second quantity, we
usually have
(1.4.2)
which means that xAwyA is computed exactly and then rounded. Combining
(1.2.9) and (1.4.2),
(1.4.3)
provided true rounding is used.
For the propagated error, we examine particular cases.
Case (a) Multiplication. For the error in xAyA,
24 ERROR: ITS SOURCES, PROPAGATION, AND ANALYSIS
(1.4.4)
The symbol " " means "much less than."
Case (b) Division. By a similar argument,
(1.4.5)
(1.4.6)
For both multiplication and division, relative errors do not propagate rapidly.
Case (c) Addition and subtraction.
(1.4.7)
This appears quite good and reasonable, but it cim be misleading. The relative
error in xA YA can be quite poor when compared with Rei (xA) and Rei (YA).
Example Let Xr = 'TT, xA = 3.1416, Yr = lf, YA = 3.1429. Then
Xr- XA = -7.35 X 10-
6
Rel (xA) = -2.34 X 10-
6
Yr- YA = -4.29 X 10-
5
(xr- Yr)- (xA- YA) = -.0012645- (- .0013) = :1.55 X 10-
5
Although the error in xA - YA is quite small, the relative error in xA - YA is
much larger than that in xA or YA alone.
Loss of significance errors This last example shows that it is possible to have a
large decrease in accuracy, in a relative error sense, when subtracting nearly equal
quantities. This can be a very important way in which accuracy is lost when error
is propagated in a calculation. We now give some examples of this phenomenon,
along with suggestions on how it may be avoided in some cases.
PROPAGATION OF ERRORS 25
Example Consider solving ax
1
+ bx + c = 0 when 4ac is small compared to
b
2
; use the standard quadratic formula for the roots
-b + Vb
2
- 4ac
r<I> = ------
T 2a
-b- Vb
2
- 4ac
rf>= ------------
2a
For definiteness, consider x
2
- 26x + 1 = 0. The formulas (1.4.8) yield
(1.4.8)
(1.4.9)
Now imagine using a five-digit decimal machine. On it, /168 ,;, 12.961. Then
define
rj
1
> = 13 + 12.961 = 25.961 rj2> = 13- 12.961 = .039 (1.4.10)
Using the exact answers,
Rei (rJl>) ,;, 1.85 X 10-
5
(1.4.11)
Eor the -data entering into the calculation (1.4.10), using the notation of (1.4.7),
Jr= /168 YA = 12.961
Rei (yA) ,;, 3.71 x 10-
5
The accuracy in rj
2
> is much less than that of the data xA and YA entering into
the calculation. We say that significant digits have been lost in the subtraction
rj
2
> = xA - yA, or that we have had a loss-of-significance error in calculating rjl>.
In rj
1
>, we have five significant digits of accuracy, whereas we have only two
significant digits in r,f>.
To cure this particular problem, of accurately calculating r,.f>, convert (1.4.9)
to
Then use
13 - v'168
rf>= ----
1
13 + 1168
13 + /168
1 1
1
13 + v'168
-----== ,;, -- ,;, .038519 = r<
2
>
13 + v'168 25.961 . A
(1.4.12)
There are two errors here, that of v'168 ,;, 12.961 and that of the final division.
But each of these will have small relative errors [see (1.4.6)], and the new value of
rjl> will be more accurate than the preceding one. By exact calculations, we now
have
much better thim in (1.4.11).
26 ERROR: ITS SOURCES, PROPAGATION, AND ANALYSIS
This new computation of r,fl demonstrates then the loss-of-significance error
is due to the form of the calculation, not to errors in the data of the computation.
In this example it was easy to find an alternative computation that eliminated the
loss-of-significance error, but this is not always possible. For a complete discus-
sion of the practical computation of roots of a quadratic polynomial, see
Forsythe (1969).
Example With many loss-of-significance calculations, Taylor polynomial ap-
proximations can be used to eliminate the difficulty. We illustrate this with the
evaluation of
1 ex- 1
J(x) = 1 ex
1
dt = --
0 X
x#-0 (1.4.13)
For x = 0, f(O) = 1; and easily, f(x) is continuous at x = 0.
To see that there is a loss-of-significance problem when x is small, we evaluate
f(x) at x = 1.4 X 10-
9
, using a popular and well-designed ten-digit hand
calculator. The results are
ex = 1.000000001
ex- 1 w-
9
-- = = .714
X 1.4 X 10-
9
(1.4.14)
The right-hand sides give the calculator results, and the true answer, rounded to
10 places, is
f(x) = 1.000000001
The calculation (1.4.14) has had a cancellation of the leading nine digits of
accuracy in the operands in the numerator.
To avoid the loss-of-significance error, use a quadratic Taylor approximation
to ex and then simplify f(x):
x x
2
J(x) = 1 + - + -e(
2 6
With the preceding x = 1.4 X 10-
9
,
f(x) = 1 + 7 X 10-
10
with an error of less than 10-
18
(1.4.15)
In general, use (1.4.15) on some interval [0, 8], picking 8 to ensure the error in
X
f(x) = 1 + -
2
PROPAGATION OF ERRORS 27
is sufficiently small. Of course, a higher degree "approximation to ex could be
used, allowing a yet larger value of o.
In general, Taylor approximations are often useful in avoiding loss-of-signifi-
cance calculations. But in some cases, the loss-of-significance error is more subtle.
Example Consider calculating a sum
(1.4.16)
with positive and negative terms Xp each of which is an approximate value.
Furthermore, assume the sum S is much smaller than the maximum magnitude
of the xj" In calculating such a sum on a computer, it is likely that a loss-of-sig-
nificance error will occur. We give an illustration of this.
Consider using the Taylor formula (1.1.4) for ex to evaluate e-
5
:
( -5) ( -5)
2
( -5)
3
( -sr
e-
5
1 + -- + -- + -- + + --
1! 2! 3! n!
(1.4.17)
Imagine using a computer with four-digit decimal rounded floating-point arith-
metic, so that each of the terms in this series must be rounded to four significant
digits. In Table 1.2, we give these rounded terms x
1
, along with the exact sum of
the terms through the given degree. The true value of e-
5
is .006738, to four
significant digits, and this is .quite different from the final sum in the table. Also,
if (1.4.17) is calculated exactly for n = 25, then the correct value of e-
5
is
obtained to four digits.
In this example, the terms xj become relatively large, but they are then added
to form the much smaller number e-
5
This means there are loss-of-significance
Table 1.2 Calculation of (1.4.17) using four-digit decimal arithmetic
Degree Term Sum Degree Term Sum
0 1.000 1.000 13 -.1960 -.04230
1 -5.000 -4.000 14 .7001E- 1 .02771
2 12.50 8.500 15 -2334E- 1 .004370
3 -20.83 -12.33 16 .7293E- 2 .01166
4 26.04 13.71 17 -.2145E- 2 .009518
5 -26.04 -12.33 18 .5958E- 3 .01011
6 21.70 9.370 19 -.1568E- 3 .009957
7 -15.50 -6.130 20 .3920E- 4 .009996
8 9.688 3.558 21 -.9333E- 5 .009987
9 -5.382 -1.824 22 .2121E- 5 .009989
10 2.691 .8670 23 -.4611E- 6 .009989
11 -1.223 -.3560 24 .9607E- 7 .009989
12 .5097 .1537 25 -.1921E- 7 .009989
28 ERROR: ITS SOURCES, PROPAGATION, AND ANALYSIS
errors in the calculation of the sum. To avoid this problem in this case is quite
easy. Either use
and form e
5
with a series not involving cancellation of positive and negative
terms; or simply form e-
1
= lje, perhaps using a series, and mu:jply it times
itself to form e-
5
With other series, it is likely that there will not be such a
simple solution.
Propagated error in function evaluations Let f(x) be a given function, and let
/(x) denote the result of evaluating f(x) on a computer. Then f(xr) denotes the
desired function value and /cxA) is the value actually computed. For the error,
we write
(1.4.18)
The first quantity in brackets is called the propagated error, and the second is the
error due to evaluating f(xA) on a computer. This second error is generally a
small random number, based on an assortment of rounding errors that occurred
in carrying out the arithmetic operations defining f(x). We referred to it earlier
in Section 1.3 as the noise in evaluating f(x).
For the propagated error, the mean value theorem gives
(1.4.19)
This assumes that xA and Xr are relatively close and that f'(x) does not vary
greatly for x between x A and Xr.
Example sin('IT/5)- sin(.628) = cos('1T/5)[('1T/5)- .628] = .00026, which is
an excellent estimation of the error.
Using Taylor's theorem (1.1.12) for functions of two variables, we can gener-
alize the preceding to propagation of error in functions of two variables:
(1.4.20)
with fx = aj;ax. We are assuming that fx(x, y) and /y(x, y) do not vary greatly
for (x, y) between (xr, Yr) and (xA, YA).
Example For f(x, y) = xY, we have fx = yxrt, /y = xY log(x). Then (1.4.20)
yields
(1.4.21)
ERRORS IN SUMMATION 29
The relative error in x ~ may be large, even though Rel(xA) and Rel(yA) are
small. As a further illustration, take Yr = YA = 500, xT = 1.2, x A = 1.2001. Then
xfr = 3.89604 X 10
39
, x ~ = 4.06179 X 10
39
, Rei x ~ ) = - .0425. Compare this
with Rel(xA) = 8.3 X w-s.
Error in data If the input data to an algorithm contain only r digits of
accuracy, then it is sometimes suggested that only r-digit arithmetic should be
used in any calculations involving these data. This is nonsense. It is certainly true
that the limited accuracy of the data will affect the eventual results of the
algorithmic calculations, giving answers that are in error. Nonetheless, there is no
reason to make matters worse by using r-digit arithmetic with correspondingly
sized rounding errors. Instead one should use a higher precision arithmetic, to
avoid any further degradation in the accuracy of results of the algorithm. This
will lead to arithmetic rounding errors that are less significant than the error in
the data, helping to preserve the accuracy associated with the data.
1.5 Errors in Summation
Many numerical methods, especially in linear algebra, involve sununations. In
this section, we look at various aspects of summation, particularly as carried out
in floating-point arithmetic.
Consider the computation of the sum
m
with x
1
, , xm floating-point numbers. Define
s2 = fl (xl + x2) = (xl + x2)(1 + 2)
where we have made use of (1.4.2) and (1.2.10). Define recursively
r = 2, ... , m- 1
Then
(1.5 .1)
(1.5.2)
(1.5.3)
The quantities t:
2
, , t:m satisfy (1.2.8) or (1.2.9), depending on whether chop-
ping or rounding is used.
Expanding the first few sums, we obtain the following:
s2- (xl + x2) = 2(xl + x2)
S3- (xl + X2 + x3) = (xl + x2)2 + (xi+ x2)(1 + 2)3 + X33
= (xl + x2)2 + (xl + X2 + x3)t:3
S4- (xl + X2 + X3 + x4) = (xl + x2)t:2 + (xi + X2 + x3)t:3
+(xi+ x2 + x3 + x4)t:4
30 ERROR: ITS SOURCES, PROPAGATION, AND ANALYSIS
Table 1.3 Calculating S on a machine using chopping
n True SL Error LS Error
10 2.929 2.928 .001 2.927 .002
25 3.816 3.813 .003 3.806 .010
50 4.499 4.491 .008 4.479 .020
100 5.187 5.170 .017 5.142 .045
200 5.878 5.841 .037 5.786 .092
500 6.793 6.692 .101 6.569 .224
1000 7.486 7.284 .202 7.069 .417
Table 1.4 Calculating S on a machine using rounding
n True SL Error LS Error
10 2.929 2.929 0.0 2.929 0.0
25 3.816 3.816 0.0 3.817 -.001
50 4.499 4.500 -.001 4.498 .001
100 5.187 5.187 0.0 5.I87 0.0
200 5.878 5.878 0.0 5.876 .002
500 6.793 6.794 -.001 6.783 .010
1000 7.486 7.486 0.0 7.449 .037
We have neglected cross-product terms ;
1
, since they will be of much smaller
magnitude. By induction, we obtain
m \
sm- LX;= (xl + x2)2 + ... +(xl + x2 + ... +xm)(m
1
(1.5.4)
From this formula we deduce that the best strategy for addition is to add from
the smallest to the largest. Of course, counterexamples can be produced, but over
a large number of summations, the preceding rule should be best. This is
especially true if the numbers X; are all of one sign so that no cancellation occurs
in the calculation of the intermediate sums x
1
+ +xm, m = 1, ... , n. In this
case, if chopping is used, rather than rounding, and if all x; > 0, then there is no
cancellation in the sums of the ;. With the strategy of adding from smallest to
largest, we minimize the effect of these chopping errors.
Example Define the terms x
1
of the sum S as follows: convert the fraction 1/j
to a decimal fraction, round it to four significant digits, and let this be x
1
. To
make the errors in the calculation of S more clear, we use four-digit decimal
floating-point arithmetic. Tables 1.3 and 1.4 contain the results of four different
ways of computing S. Adding S from largest to smallest is denoted by LS, and
ERRORS IN SUMMATION 31
adding from smallest to largest is denoted by SL. Table 1.3 uses chopped
arithmetic, with
(1.5 .5)
and Table 1.4 uses rounded arithmetic, with
- .0005 .::::; Ej .::::; .0005 (1.5.6)
The numbers .
1
refer to (1.5.4), and their bounds come from (1.2.8) and (1.2.9).
In both tables, it is clear that the strategy of adding S from the smallest term
to the largest is superior to the summation from the largest term to the smallest.
Of much more significance, however, is the far smaller error with rounding as
compared to chopping. The difference is much more than the factor of 2 that
would come from the relative size of the bounds in (1.5.5) and (1.5.6). We next
give an analysis of this.
A statistical analysis of error propagation Consider a general error sum
n
E = L f.}
j=l
of the type that occurs in the summation error (1.5.4). A simple bound is
lEI.::::; nS
(1.5.7)
(1.5 .8)
where S is a bound on
1
, ... , f.n. Then S = .001 or .0005 in the preceding
example, depending on whether chopping or rounding is used. This bound (1.5.8)
is for the worst possible case in which all the errors
1
are as large as possible and
of the same sign.
-----When-using-rounding,-the symmetry in sign behavior of the .
1
, as shown in
(1.2.9), makes a major difference in the size of E. In this case, a better model is to
assume that the errors
1
are uniformly distributed random variables in the
interval [- 8, 8] and that they are independent. Then
The sample mean i is a new random variable, having a probability distribution
with mean 0 and variance S
1
j3n. To calculate probabilities for statements
involving i, it is important to note that the probability distribution for i is
well-approximated by the normal distribution with the same mean and variance,
even for small values such as n ;;;: 10. This follows from the Central Limit
Theorem of probability theory [e.g., see Hogg and Craig (1978, chap. 5)]. Using
the approximating normal distribution, the probability is t that
lEI .::::; .39SVn
32 ERROR: ITS SOURCES, PROPAGATION, AND ANALYSIS
and the probability is .99 that
lEI 1.498vn (1.5.9)
The result (1.5.9) is a considerable improvement upon (1.5.8) if n is at all large.
This analysis can also be applied to the case of chopping error. But in that
case, -8 ei 0. The sample mean now has a mean of -oj2, while
retaining the same variance of o
2
/3n. Thus there is a probability of .99 that
(1.5.10)
For large n, this ensures that E will approximately equal n8j2, which is much
larger than (1.5.9) for the case of rounding errors.
When these results, (1.5.9) and (1.5.10), are applied to the general summation
error (1.5.4), we see the likely reason for the significantly different error behavior
of chopping and rounding in Tables 1.3 and 1.4. In general, rounded arithmetic is
almost always to be preferred to chopped arithmetic.
Although statistical analyses give more realistic bounds, they are usually much
more difficult to compute. As a more sophisticated example, see Henrici (1962,
pp. 41-59) for a statistical analysis of the error in the numerical solution of
equations. An example is given in Table 6.3 of Chapter 6 of the
present textbook.
Inner products Given two vectors x, Y. E Rm, we call
m
xTy = L xjyj
j-1
(1.5.11)
the inner product of x and y. (The notation xT denotes the matrix transpose
of x.) Properties of the inner product are examined in Chapter 7, but we note
here that
(1.5.12)
(1.5.13)
The latter inequality is called the Cauchy-Schwarz inequality, and it is proved in
a more general setting in Chapter 4. Sums of the form (1.5.11) occur commonly
in linear algebra problems (for example, matrix multiplication). We now consider
the numerical computation of such sums.
Assume X; and Y;, i = 1, ... , m, are floating-point numbers. Define
k=l,2, ... ,m-1 (1.5.14)
ERRORS IN SUMMATION 33
Then as before, using (1.2.10),
with the tenns f.j, Tlj satisfying (1.2.8) or (1.2.9), depending on whether chopping
or rounding, respectively, is used. Combining and rearranging the preceding
formulas, we obtain
with
m
sm = L: xjy/1- 'Yj)
j-1
~ 1 + f.j + Tlj + Tlj+l + +11m
~
(1.5.15)
Til= 0
(1.5.16}
The last approximation is based on ignoring the products of the small tenns
'll;'llk,f.;'llk This brings us back to the same kind of analysis as was done earlier for
the sum (1.5.1). The statistical error analysis following (1.5.7) is also valid. For a
rigorous bound, it can be shown that if ~ < .01, then
j = 1, ... , m, (1.5.17)
where ~ is the unit round given in (1.2.12) [see Forsythe and Moler (1967, p. 92)].
Applying this to (1.5.15) and using (1.5.13),
m
IS- Sml !5: L IXjY/Yjl
j-1
(1.5.18)
This says nothing about the relative error, since x Ty can be zero even though all
x; and Y; are nonzero.
These results say that the absolute error in Sm ~ XTJ does not increase very
rapidly, especially if true rounding is used and we consider the earlier statistical
analysis of (1.5.7). Nonetheless, it is often possible to easily and inexpensively
reduce this error a great deal further, and this is usually very important in linear
algebra problems.
Calculate each product xjyj in a higher precision arithmetic, and carry out the
summation in this higher precision arithmetic. When the complete sum has been
computed, then round or chop the result back to the original arithmetic precision.
34 ERROR: ITS SOURCES, PROPAGATION, AND ANALYSIS
For example, when xi and Yi are in single precision, then compute the products
and sums in double precision. [On most computers, single and double precision
are fairly close in running time; although some computers do not implement
double precision in their hardware, but only in their software, which is slower.]
The resulting sum Sm will satisfy
(1.5.19)
a considerable improvement on (1.5.18) or (1.5.15). This can be used in parts of a
single precision calculation, significantly improving the accuracy without having
to do the entire calculation in double precision. For linear algebra problems, this
may halve the storage requirements as compared to that needed for an entirely
double precision computation.
1.6 Stability in Numerical Analysis
A number of mathematical problems have solutions that are quite sensitive to
small computational errors, for example rounding errors. To deal with this
phenomenon, we introduce the concepts of stability and condition number. The
condition number of a problem will be closely related to the maximum accuracy
that can be attained in the solution when using finite-length numbers and
computer arithmetic. These concepts will then be extended to the numerical
methods that are used to calculate the solution. Generally we will want to use
numerical methods that have no greater sensitivity to small errors than was true
of the original mathematical problem.
To simplify the presentation, the discussion is limited to problems that have
the form of an equation
F(x, y) = 0 (1.6.1)
The variable x is the unknown being sought, and the variable y is data on which
the solution depends. This equation may represent many different kinds of
problems. For example, (1) F may be a real valued function of the real variable
x, and y may be a vector of coefficients present in the definition of F; or (2) the
equation may be an integral or differential equation, with x an unknown function
and y a given function or given boundary values.
We say that the problem (1.6.1) is stable if the solution x depends in a
continuous way on the variable y. This means that if { Yn} is a sequence of values
approaching y in some sense, then the associated solution values { x n } must also
approach x in some way. Equivalently, if we make ever smaller changes in y,
these must lead to correspondingly smaller changes in x. The sense in which the
changes are small will depend on the norm being used to measure the sizes of the
vectors x and y; there are many possible choices, varying with the problem.
Stable problems are also called well-posed problems, and we will use the two
terms interchangeably. If a problem is not stable, it is called unstable or ill-posed.
Example (a) Consider the solution of
ax
2
+ bx + c = 0 a=I=O
STABILITY IN NUMERICAL ANALYSIS 35
Any solution x is a complex number. For the data in this case, we use
y =(a, b, c), the vector of coefficients. It should be clear from the quadratic
formula
x=
-b Vb
2
- 4ac
2a
that the two solutions for x will vary in a continuous way with the data
y = (a, b, c).
(b) Consider the integral equation problem
1 .?5x(t)dt
1
= y(s)
0 1.25 - cos ( 2 7T ( s + t))
O ~ s ~ (1.6.2)
This is an unstable problem. There are perturbations on(s) = Yn(s)- y(s) for
which
Max lon(s) I 0
O:Ss:Sl
and the corresponding solutions xn(s) satisfy
Max I X n ( s) - X ( s) I = 1
O:Ss:Sl
Specifically, define Yn(s) = y(s) + on(s)
1
as n ~ oo
all n ~ 1
on(s) =
2
n COS (2n7TS) O ~ s ~ n ~
Then it can be shown that
thus proving (1.6.4).
(1.6.3)
(1.6.4)
If a problem (1.6.1) is unstable, then there are serious difficulties in attempting
to solve it. It is usually not possible to solve such problems without first
attempting to understand more about the properties of the solution, usually by
returning to the context in which the mathematical problem was formulated. This
is currently a very active area of research in applied mathematics and numerical
analysis [see, for example, Tikhonov and Arsenio (1977) and Wahba (1980)].
For practical purposes there are many problems that are stable in the
previously given sense, but that are still very troublesome as far as numerical
computations are concerned. To deal with this difficulty, we introduce a measure
of stability called a condition number. It shows that practical stability represents a
continuum of problems, some better behaved than others.
The condition number attempts to measure the worst possible effect on the
solution x of (1.6.1) when the variable y is perturbed by a small amount. Let oy
36 ERROR.: ITS SOURCES, PROPAGA1]0N, AND ANALYSIS
be a perturbation of y, and let x + ox be the solution of the perturbed equation
F(x +ox, y + oy) = 0 (1.6.5)
Define
(1.6.6)
We have used the notation II II to denote a measure of size. Recall the definitions
(1.1.16)-(1.1.18) for vectors from R" and C[ a, b ].-The-example (1.6.2) used the
norm (1.1.18) for mea.Suring the perturbations in both x andy. Commonly x and
y may be different kinds of variables, and then different norms are appropriate.
The supremum in (1.6.6) is taken over all small perturbations o y for which the
perturbed problem (1.6.5).will still make sense. Problems that are unstable lead to
K(x) = oo.
The number K ( x) is called the condition number for (1.6.1 ). It is a measure of
the sensitivity of the solution x to small changes in the data y. If K(x) is quite
large, then there exists small relative changes 8 y in y that lead to large relative
changes 8x in x. But if K(x) is small, say K(x) 10, then small relative changes
in y always lead to correspondingly small relative changes in x. Since numerical
calculations almost always involve a variety of small computational errors, we do
not want problems with a large condition numbe:r. Such problems are called
ill-conditioned, and they are generally very hard to solve accurately.
Example Consider solving
Perturbing y by oy, we have
8x
X
X- aY = 0 a>O
ay+By- aY
---- =a
8
Y-1
aY
For the condition number for (1.6.7).
Restricting 8 y to be small, we have
K(x).=jy ln(a)l
(1.6.7)
(1.6.8)
Regardless of how we compute x in (1.6.1), if K(x) is large, then small relative
changes in y will lead to much larger relative changes in x. If K(x) = 10
4
and if
the value of y being used has relative error 10-
7
due to using finite-length
computer arithmetic and rounding, then it is likely that the resulting value of x
will have relative error of about 10-
3
This is a large drop in accuracy, and there
STABILITY IN NUMERICAL ANALYSIS 37
is little way to avoid it except perhaps by doing all computations in longer
precision computer arithmetic, provided y can then be obtained with greater
accuracy.
Example Consider the n X n nonsingular matrix
1 1 1
1 - -
2 3 n
1 1 1 1
Y= 2 3 4 n + 1 (1.6.9)
1 1 1
n n+1 2n- 1
which is called the Hilbert matrix. The problem of calculating the inverse of Y, or
equivalently of solvirig YX = I with I the identity matrix, is a well-posed
problem. The solution X can be obtained in a finite number of steps using only
simple-arithmetic operations. But the problem of calculating X is increasingly
ill-conditioned -as -n -increases.
The ill-conditioning of the numerical inversion of Y will be shown in a
practical setting. Let Y denote the result of entering the matrix Y into an IBM
370 computer and storing the matrix entries using single precision floating-point
format. The fraction,al elements of Y will be expanded in the hexadecimal (base
16) number system and then chopped after six hexadecimal digits (about seven
decimal digits). Since most of the .entries in Y do not have finite hexadecimal
expansions, there will be a relative error of about 10-
6
in each such element
of Y.
Using higher precision arithmetic, we can calculate the exact value of f-
1
.
The inverse y-
1
is known analytically, and we can compare it with f-
1
. For
n = 6, some of the elements of f-
1
differ from the corresponding elements in
y-
1
in the first nonzero digit. For example, the entries in row 6, column 2 are
(Y-
1
)6,2 = 73866.34
This makes the calculation of y-
1
an ill-conditioned problem, and it becomes
increasingly so as n increases. The condition number in (1.62) will be at least 10
6
as a reflection of the poor accuracy in y-
1
compared with y-
1
Lest this be
thought of as an odd pathological example that oould not occur in practice, this
particular example occurs when doing least squares approximation theory (e.g.,
see Section 4.3). The general area of ill-conditioned problems for linear systems
and matrix inverses is considered in greater detail in Chapter 8.
Stability of numerical algorithms A numerical method for solving a mathemati-
cal problem is considered stable if the sensitivity of the numerical answer to the
data is no greater than in the original mathematical problem. We will make this
more precise, again using (1.6.1) as a model for the problem. A numerical methOd
38 ERROR: ITS SOURCES, PROPAGATION, AND ANALYSIS
for solving (1.6.1) will generally result in a sequence of approximate problems
(1.6.10)
depending on some parameter, say n. The data Yn are to approach y as n ~ oo;
the function values Fn(z, w) are to approach F(z, w) as n ~ oo, for all (z, w)
near (x, y); and hopefully the resulting approximate solutions xn will approach
x as n ~ oo. For example, (1.6.1) may represent a differential equation initial
value problem, and (1.6.10) may present a sequence of finite-difference approxi-
mations depending on h = 1/n, as in and following (1.3.5). Another case would
be where n represents the number of digits being used in the calculations, and we
may be solving F(x, y) = 0 as exactly as possible within this finite precision
arithmetic.
For each of the problems (1.6.10) we can define a condition number Kn(xn),
just as in (1.6.6). Using these cbndition numbers, define
K(x) = Limit Supremum Kk(xk) (1.6.11)
n->oo k"?!n
We say the numerical method is stable if K(x) is of about the same magnitude as
K ( x) from (1.6.6), for example, if
K(x)::;; 2K(x)
If this is true, then the sensitivity of (1.6.10) to changes in the data is about the
same as that of the original problem (1.6.1).
Some problems and numerical methods may not fit easily within the frame-
work of (1.6.1), (1.6.6), (1.6.10), and (1.6.11), but there is a general idea of stable
problems and condition numbers that can be introduced and given similar
meaning. The main use of these concepts in this text is in (1) rootfinding for
polynomial equations, (2) solving differential equations, and (3) problems in
numerical linear algebra. Generally there is little problem with unstable numeri-
cal methods in this text. The main difficulty will be the solution of ill-conditioned
problems.
E:mmple Consider the evaluation of a Bessel function,
m ~ 0 (1.6.12)
This series converges very rapidly, and the evaluation of x is easily shown to be a
well-conditioned problem in its dependence on y.
Now consider the evaluation of Jm(Y) using the triple recursion relation
m ~ (1.6.13)
assuming J
0
(y) and J
1
(y) are known. We now demonstrate numerically that this
DISCUSSION OF THE LITERATURE 39
Table 1.5 Computed values of J,.(l)
m Computed Jm (1) True Jm(1)
0 .7651976866 .7651976866
1 .4400505857 .4400505857
2 .1149034848 .1149034849
3 .195633535E- 1 .1956335398E - 1
4 .2476636iE - 2 .2476638964E - 2
5 .2497361E - 3 .2497577302E - 3
6 .207248E- 4 .2093833800E - 4
7 -.10385E- 5 .1502325817E - 5
is an unstable numerical method for evaluating Jm(y), for even moderately large
m. We take y = 1, so that (1.6.13) becomes
m 1 (1.6.14)
We use values for 1
0
(1) and 1
1
(1) that are accurate to 10 significant digits. The
subsequent values Jm(l) are calculated from (1.6.4) using exact arithmetic, and
the results are given in Table 1.5. The true values are given for comparison, and
they show the rapid divergence of the approximate values from the true values:
The only errors introduced were the rounding errors in 1
0
(1) and 1
1
(1), and they
cause an increasingly large perturbation in Jm(l) as m increases.
The use of three-term recursion relations
m 1
is a common tool in applied mathematics and numerical analysis. But as
previously shown, they can lead to unstable numerical methods. For a general
analysis of triple recursion relations, see Gautschi. (1967). In the case of (1.6.13)
and (1.6.14), large loss of significance errors are occurring.
Discussion of the Literature
A knowledge of computer arithmetic is important for programmers who are
concerned with numerical accuracy, particularly when writing programs that are
to be widely used. Also, when writing programs to be run on various computers,
their different floating-point characteristics must be taken into account. Classic
treatments of floating-point arithmetic are given in Knuth (1981, chap. 4) and
Sterbenz (1974).
The topic of error propagation, especially that due to rounding/chopping
error, has been difficult to treat in a precise, but useful manner. There are some
important early papers, but the current approaches to the subject are due in large
40 ERROR: ITS SOURCES, PROPAGATION, AND ANALYSIS
part to the late J. H. Wilkinson. Much of his work was in numerical linear
algebra, but he made important contributions to many areas of numerical
analysis. For a general introduction to his techniques for analyzing the propa-
gation of errors, with applications to several important problems, see Wilkinson
(1963), (1965), (1984).
Another approach to the control of error is called interval analysis. With it, we
carry along an interval [x
1
, x,] in our calculations, rather than a single number
xA, and the numbers x
1
and x., are guaranteed to bound the true value The
difficulty with this approach is that the size of x., - x
1
is generally much larger
than lxr- xAI, mainly because the possible cancellation of errors of opposite
sign is often not x
1
and x,. For an introduction to
this area, showing how to improve on these conservative bounds in particular
cases, see Moore (1966). More recently, this area and that of computer arithmetic
have been combined to give a general theoretical framework allowing the
development of algorithms with rigorous error bounds. As examples of this area,
see the texts of Alefeld and Herzberger (1983), and Kulisch and Miranker (1981),
the symposium proceedings of Alefeld and Grigorieff (1980), and the survey in
Moore (1979).
The topic of ill-posed problems was just touched on in Section 1.6, but it has
been of increasing interest in recent years. There are many problems of indirect
physical measurement that lead to ill-posed problems, and in this form they are
called inverse problems. The book by Lavrentiev (1967) gives a general introduc-
tion, although it discusses mainly (1) analytic continuation of analytic functions
of a complex variable, and (2) inverse problems for differential equations. One of
the major numerical tools used in dealing with ill-posed problems is called
regularization, and an extensive development of it is given in Tikhonov and
Arsenin (1977). As important examples of the more current literature on numeri-
cal methods for ill-posed problems, see Groetsch (1984) and Wahba (1980).
Two new types of computers have appeared in the last 10 to 15 years, and they
are now having an increasingly important impact on numerical analysis. These
are microcomputers and supercomputers. Everyone is aware of microcomputers;
scientists and engineers are using them for an increasing amount of their
numerical calculations. Initially the arithmetic design of microcomputers was
quite poor, with some having errors in their basic arithmetic operations. Re-
cently, an excellent new standard has been produced for arithmetic on microcom-
puters, and with it one can write high-quality and efficient numerical programs.
This standard, the IEEE Standard for Binary Floating-Point Arithmetic, is de-
scribed in IEEE (1985). Implementation on the major families of microprocessors
are becoming available; for example, see Palmer and Morse (1984).
The name supercomputer refers to a variety of machine designs, all having ii:t
common the ability to do very high-speed numerical computations, say greater
than 20 million floating-point operations per second. This area is developing and
changing very rapidly, andso we can only give a few references to hint at the
effect of these machines on the design of numerical algorithms. Hackney and
Jesshope (1981), and Quinn (1987) are general texts on the architecture of
supercomputers and the design of numerical algorithms on them; Parter (1984) is
a symposium proceedings giving some applications of supercomputers in a
variety of physical problems; and. Ortega and Voigt discuss supercom-
BIBLIOGRAPHY 41
puters as they are being used to solve partial differential _equations. These
machines will become increasingly important in all areas of computing, and their
architectures are likely to affect smaller mainframe computers of the type that are
now more widely used.
Symbolic mathematics is a rapidly growing area, and with it one can do
analytic rather than numerical mathematics, for example, finding antiderivatives
exactly when possible. lbis area has not significantly affected numerical analysis
to date, but that appears to be changing. In many situations, symbolic mathe-
matics are used for part of a calculation, with numerical methods used for the
remainder of the calculation. One of the most sophisticated of the programming
languages for carrying out symbolic mathematics is MACSYMA, which is
described in Rand (1984). For a survey and historical account of programming
languages for this area, see Van Hulzen and Calmet (1983).
We conclude by discussing the area of mathematical software. This area deals
with the implementation of numerical algorithms as computer programs, with
careful attention given to questions of accuracy, efficiency, flexibility, portability,
and other characteristics that improve the usefulness of the programs. A major
journal of the area is the ACM Transactions on Mathematical Software. For an
extensive survey of the area, including the most important program libraries that
have been developed in recent years, see Cowell (1984). In the ap_pendix to this
book, we give a further discussion of some currently available numerical analysis
computer program packages.
Bibliography
Alefeld, G., and R. Grigorieff, eds. (1980). Fundamentals of Numerical Computa-
tion (Computer-oriented Numerical Analysis). Computing Supplementum 2,
Springer-Verlag, Vienna.
Alefeld, G., and J. Herzberger (1983). Introduction to Interval Computations.
Academic Press, New York.
Bender, E. (1978). An Introduction to Mathematical Modelling. Wiley, New York.
Cowell, W., ed. (1984). Sources and Development of Mathematical Software.
Prentice-Hall, Englewood Cliffs, N.J.
Fadeeva, V. (1959). Computational Methods of Linear Algebra. Dover, New York.
Forsythe, G. (1969). What is a satisfactory quadratic equation solver? In B.
Dejon and P. Henrici (eds.), Constructive Aspects of the Fundamental
Theorem of Algebra, pp. 53-61, Wiley, New York.
Forsythe, G., and C. Moler (1967). Computer Solution of Linear Algebraic
Systems. Prentice-Hall, Englewood Cliffs, NJ.
Forsythe, G., M. Malcolm, and C. Moler (1977). Computer Methods for Mathe-
matical Computations. Prentice-Hall, Englewood Cliffs, NJ.
Gautschi, W. (1967). Computational aspects of three term recurrence relations.
SIAM Rev., 9, 24-82.
42 ERROR: ITS SOURCES, PROPAGATION, AND ANALYSIS
Groetsch, C. (1984). The Theory of Tikhonov Regularization for Fredholm Equa-
tions of the First Kind. Pitman, Boston .
. Henrici, P. (1962). Discrete Variable Methods in Ordinary Differential Equations.
Wiley, New York.
Hockney, R., and C. Jesshope (1981). Parallel Computers: Architecture, Program-
ming, and Algorithms. Adam Hilger, Bristol, England.
Hogg, R., and A. Craig (1978). Introduction to Mathematical Statistics, 4th ed.
Macmillan, New York.
Institute of Electrical and Electronics Engineers (1985). Proposed standard for
binary floating-point arithmetic. (IEEE Task P754), Draft 10.0. IEEE
Society, New York.
Knuth, D. (1981). The Art of Computer Programming, vol. 2, Seminumerica/
Algorithms, 2nd ed. Addison-Wesley, Reading, Mass.
Kulisch, U., and W. Miranker (1981). Computer Arithmetic in Theory and Prac-
tice. Academic Press, New York.
Lavrentiev, M. (1967). Some Improperly Posed Problems of Mathematical Physics.
Springer-Verlag, New York.
Lin, C., and L. Segal (1974). Mathematics Applied to Deterministic Problems in the
Natural Sciences. Macmillan, New York.
Maki, D., and M. Thompson (1973). Mathematical Models and Applications.
Prentice-Hall, Englewood Cliffs, N.J.
Moore, R. (1966). Interval Analysis. Prentice-Hall, Englewood Cliffs, N.J.
Moore, R. (1979). Methods and Applications of Interval Analysis. Society for
Industrial and Applied Mathematics, Philadelphia.
Ortega, J., and R. Voigt (1985). Solution of partial differential equations on
vector and parallel computers. SIAM Rev., 27, 149-240.
Parter, S., ed. (1984). Large Scale Scientific Computation. Academic Press, New
York.
Pahner, J., and S. Morse (1984). The 8087 Primer. Wiley, New York.
Quinn, M. (1987). Designing Efficient Algorithms for Parallel Computers.
McGraw-Hill, New York.
Rand, R. (1984). Computer Algebra in Applied Mathematics: An Introduction to
MA CSYMA. Pitman, Boston.
Rubinow, S. (1975). Introduction to Mathematical Biology. Wiley, New York.
Sterbenz, P. (1974). Floating-Point Computation. Prentice-Hall, Englewood Cliffs,
N.J.
Tikhonov, A., and V. Arsenin (1977). Solutions of Ill-posed Problems. Wiley, New
York.
Van Hulzen, J., and J. Calmet (1983). Computer algebra systems. In B. Buch-
berger, G. Collins, R. Loos (eds.), Computer Algebra: Symbolic and Alge-
braic Computation, 2nd ed. Springer-Verlag, Vienna.
Wahba, G. (1980). Ill-posed problems: Numerical and statistical methods for
mildly, moderately, and severely ill-posed problems with noisy data. Tech.
PROBLEMS 43
Rep. # 595, Statistics Department, Univ. of Wisconsin, Madison. Prepared
for the Proc. Int. Symp. on Ill-Posed Problems, Newark, Del., 1979.
Wahba, G. (1984). Cross-validated spline methods for the estimation of multi-
variate functions from data on functionals, in Statistics: An Appraisal,
H. A. David and H. T. David (eds.), Iowa State Univ. Press, Ames,
pp. 205-235.
Wilkinson, J. (1963). Rounding Errors in Algebraic Processes. Prentice-Hall,
Englewood Cliffs, N.J.
Wilkinson, J. (1965). The Algebraic Eigenvalue Problem. Oxford, England.
Wilkinson, J. (1984). The perfidious polynomial. In Golub (ed.), Studies in
Num.erical Analysis, pp. 1-28, Mathematical Association of America.
Problems
1. (a) Assume f(x) is continuous on as x s b, and consider the average
1 n
S = - L f(xj)
n j=l
with all points xj in the interval [a, b ]. Show that
S=f(r)
for some r in [a, b]. Hint: Use the intermediate value theorem and
consider the range of values of f(x) and thus of S.
(b) part (a) to the sum
n
S= L:wJ(x)
j-1
with all xj in [a, b) and all wj 0.
2. Derive the following inequalities:
for all x, z .s 0.
'IT 'IT
(b) lx- zi s ltan(x)- tan(z)l --<x z<-
2 ' 2
0 s y s x, p 1.
3. (a) Bound the error in the approximation
sin(x) = x lxl s 8
44 ERROR: ITS SOURCES, PROPAGATION, AND ANALYSIS
(b) For small values of 8, measure the relative error in sin(x) == x by
using
sin(x)- x sin(x)- x
sin(x) = x
x+O
Bound this modified relative error for lx I .:s;: 8. Choose 8 to make this
error less than .01, corresponding to a 1 percent error.
4. Assuming g E C[ a, b ], show
5. Construct a Taylor series for the following functions, and bound the error
when truncating after n terms.
1
1
x
(a) - e-
12
dt (b) sin-
1
(x)
lxl < 1
X 0
1 xtan-
1
tdt
(c)
-;fo t
(d) cos(x) + sin(x)
(e) log(l - x) -1 <X< 1 (f)
[ 1 +X]
log--
.1-x
-1<x<1
6. (a) Using the result (1.1.11), we can show
1T 00 ( -1)2j+l
- = tan-l (1) = :E ---
4 j-0 2j + 1
and we can obtain 1r by- multiplying by 4. Why is this not a practical
way to compute 1r?
(b) Using a Taylor polynomial approximation, give a practical way to
evaluate 11.
7. Using Taylor's theorem for functions of two variables, find linear and
quadratic approximations to the following functions f(x, y) for small
values of x and y. Give the tangent plane function z = p(x, y) whose
graph is tangent to that of z = f(x, y) at (0, 0, /(0, 0)).
(a) ..jl + 2x - y
(c) x cos(x ..,.. y)
(b)
1+x
l+y
(d) cos(x + V1r
2
+ y)
PROBLEMS 45
8. Consider the second-order divided difference f[x
0
, x
1
, x
2
] defined in
(1.1.13).
(a) Prove the property (1.1.15), that the order of the arguments x
0
, x
1
, x
2
does not affect the value of the divided difference.
(b) Prove formula (1.1.14),
for some r between the minimum and maximum of Xo, X1, and X2.
Hint: From part (a), there is no loss of generality in assuming
x
0
< x
1
< x
2
Use Taylor's theorem to reduce f[x
0
, x
1
, x
2
], expand-
ing about x
1
; and then use the intermediate value theorem to simplify
the error term.
(c) Assuming f(x) is twice cOntinuously differentiable, show that
f[x
0
, x
1
, x
2
] can be extended continuously to the case where some or
all of the points x
0
, x
1
, and x
2
are-coincident.-For_example, show
exists and compute a formula for it.
9. (a) Show that the vector norms (1.1.16) and (1.1.18) satisfy the three
general properties of norms that are listed following (1.1.18).
(b) Show !lxlb in (1.1.17) is a vector norm, restricting yourself to the
n = 2 case.
(c) Show that the matrix norm (1.1.19) satisfies (1.1.20) and (1.1.21). For
simplicity, consider only matrices of order 2 X 2.
10. Convert the following numbers to their decimal equivalents.
(a) (10101.101h (b) (2A3.FF)
16
(c) (.101010101 ... h
(d) (.AAAA ... h
6
(e) (.00011001100110011 ... h
(f) (11 ... 1 h with the parentheses enclosing n 1s.
11. To convert a positive decimal integer x to its binary equivalent,
46 ERROR: ITS SOURCES, PROPAGATION, AND ANALYSIS
begin by writing
Based on this, use the following algorithm.
(i) x
0
== x; j == 0
(H) While x
1
=I= 0, Do the following:
a
1
==-Remainder of integer divide x/2-
x
1
+
1
==Quotient of integer divide x/2
j :=j + 1
End While
The language of the algorithm should be self-explanatory. Apply it to
convert the following integers to their binary equivalents.
(a) 49 (b) 127 (c) 129
12. To convert a positive decimal fraction x < 1 to its binary equivalent
begin by writing
Based on this, use the following algorithm.
(i) x
1
== x; j == 1
(ii) While x
1
=I= 0, Do the following:
ai == Integer part of 2 xi
xi+l == Fractional part of 2 xi
j==j+1
End While
Apply this algorithm to convert the following decimal numbers to their
binary equivalents.
(a) .8125 (b) 12.0625 (c) .1 (d) .2 (e) .4
(0 = .142857142857 ...
13. Generalize Problems 11 and 12 to the conversion of a decimal integer to its
hexadecimal equivalent.
PROBLEMS 47
14. Predict the output of the following section of Fortran code if it is run on a
binary computer that uses chopped arithmetic.
I= 0
X= 0.0
H = .1
10 I= I+ 1
X=X+H
PRINT * I, X
IF (X .LT. 1.0) GO TO 10
Would the outcome be any different if the statement "X =X + H" was
replaced by "X = I* H"?
15. Derive the bounds (1.2.9) for the relative error in the rounded floating-point
representation of (1.2.5)-(1.2.6).
16. Derive the upper bound result M = pt given in (1.2.16).
17. (a) Write a program to-create-an-overflow-error-on your computer. For
example, input a number x and repeatedly square it.
(b) Write a program to experimentally determine the largest allowable
floating-point number.
18. (a) A simple model for population growth is
dN
-=kN
dt
with N(t) the population at time t and k > 0. Show that this implies
a geometric rate of increase L.1 population:
N(t+1)=CN(t) t ~ t
Find a formula for C.
(b) A more sophisticated model for population growth is
dN
- = kN[1- bN]
dt
with b, k > 0 and 1 - bN
0
> 0. Find the solution to this differential
equation problem. Compare its solution to that of part (a). Describe
the differences in population growth predicted by the two models, for
both large and small values of t.
.. !
j
48 ERROR: ITS SOURCES, PROPAGATION, AND ANALYSIS
19. On your computer, evaluate the two functions
(a) f(x) = x
3
- 3x
2
+ 3x - 1
(b) f(x) = x
3
+ 2x
2
- x - 2
Evaluate them for a large sampling of values of x around 1, and try to
produce the kind of behavior shown in Figure 1.2. Compare the results for
the two functions.
20. Write-a program to compute-experimentally
Limit(xP + yP)
11
P
p-oo
where x and y are positive numbers. First do the computation in the form
just shown. Second, repeat the computation with the idea used in (1.3.8).
Run the program for a variety of large and small values of X and y, for
example, X= y = 10
10
and X= y = 10-
10
21. For the following numbers x A and Xr, how many significant digits are
there in x A with respect to xr?
22.
23.
24.
25.
(a) X A = 451.023, Xr = 451.01
(b) X A = - .045113, Xr = - .045l8
(c) XA = 23.4213, Xr = 23.4604
Let all of the following numbers be correctly rounded to the number of
digits shown: (a) 1.1062 + .947, (b) 23.46 - 12.753, (c) (2.747)(6.83), (d)
8.473/.064. For each calculation, determine the smallest interval in which
the result, using true instead of round,ed values, must be located.
Prove the formula (1.4.5) for Rel(xAIYA).
Given the ~ t o n x
2
- 40x + 1 = 0, find its roots to five significant
digits. Use -./399 ,;. 19.975, correctly rounded to five digits.
Give exact ways of avoiding loss-of-significance errors in the following
computations.
(a) log(x + 1)- log(x) large x
(b) sin(x)- sin(y) x = y
(c) tan(x)- tan(y) x = y
(d)
(e)
1- cos(x)
x2
3
t1 +X - 1,
PROBLEMS 49
x,;,o
x,;,o
26. Use Taylor approximations to avoid the loss-of-significance error in the
following computations.
(a) f(x) = _2_x_
(b)
log(1- x) + xexl
2
f(x) = 3
X
In both cases, what is Limit/(x)?
x-+0
27. Consider evaluating cos(x) for large x by using the Taylor approximation
(1.1.5),
28.
29.
30.
x2 x2n
cos(x) _:__1-- + ... +(-1r--
2! (2n)!
To see the difficulty involved in using this approximation, use it to evaluate
cos (2 'IT) = 1. Determine n so that the Taylor approximation error is less
than .0005. Then repeat the type of computation used in (1.4.17) and Table
1.2. How should cos ( x) be evaluated for larger values of x?
Suppose you wish to compute the values of (a) cos (1.473), (b) tan -
1
(2.621),
(c) ln (1.471), (d) e
2
653
In each case, assume you have only a table of values
of the function with the argument x given in increments of .01. Choose the
table value whose argument is nearest to your given argument. Estimate the
resulting error.
Assume that x A = .937 has three significanJ digits with respect to Xr.
Bound the relative error in xA .. For f(x) = 1- x, bound the error and
relative error in f(xA) with respect to f(xr).
The numbers given below are correctly rounded to the number of digits
shown. Estimate the errors in the function values in terms of the errors in
the arguments. Bound the relative errors.
(a) sin [(3.14)(2.685)] (b) ln (1.712)
(c) (1.
56
)3.414
31. Write a computer subroutine to form the sum
.50 ERROR: ITS SOURCES, PROPAGATION, AND ANALYSIS
in three ways: (1) from smallest to largest, (2) from largest to smallest, and
(3) in double precision, with a single precision rounding/chopping error at
the conclusion of the summation. Use the double precision result to find the
error in the two single precision results. Print the results. Also write a main
program to create the following series in single precision, and use the
subroutine just given to sum the series. [Hint: In writing the subroutine, for
simplicity assume the terms of the series are arranged from largest to
smallest.]
(a)
n 1
2:--:- (b>
1 1
n 1
~ (c)
1 J
n 1
I: -:J (d)
1 J
n ( -1)j
2:-.
J
32. Consider the product a
0
a
1
am, where a
0
, a
1
, , am are m + 1 num-
bers stored in a computer that uses n digit base fJ arithmetic. Define
Pi= fl(a
0
a
1
), p
2
= fl(a
2
p
1
), p
3
= fl(a
3
p
2
), ... , Pm = fl(amPm-
1
). If we
write Pm = a
0
a
1
am(! + w), determine an estimate for w. Assume that
a; = fl (a;), i = 0, 1, ... , m. What is a rigorous bound for w? What is a
statistical estimate for the size of w?
TWO
ROOTFINDING
FOR NONLINEAR
EQUATIONS
Finding one or more roots of an equation
f(x)=O (2.0.1)
is one of the more commonly occurring problems of applied mathematics. In
most cases explicit solutions are not available and we must be satisfied with being
able to find a root to any specified degree of accuracy. The numerical methods
for finding the_roots-are-called iterative methods, and they are the main subject of
this chapter.
We begin with iterative methods for solving (2.0.1) when f(x) is any continu-
ously differentiable real valued function of a real variable x. The iterative
methods for this quite general class of equations will require knowledge of one or
more initial guesses x
0
for the desired root o: of f(x). An initial guess x
0
can
usually be found by using the context in which the problem first arose; otherwise,
a simple graph of y = f(x) will often suffice for estimating x
0
A second major problem discussed in this chapter is that of finding one or
more roots of a polynomial equation
(2.0.2)
The methods of the first problem are often specialized to deal with (2.0.2), and
that will be our approach. But there is a large. literature on methods that have
been developed especially for polynomial equations, using their special properties
in an essential way. These are the most important methods used in creating
automatic computer programs for solving (2.0.2), and we will reference some such
methods.
The third class of problems to be discussed is the solution of nonlinear systems
of equations. These systems are very diverse in form, and the associated numeri-
cal analysis is both extensive and sophisticated. We will just touch on this
subject, indicating some successful methods that are fairly general in applicabil-
ity. An adequate development of the subject requires a good knowledge of both
theoretical and numerical linear algebra, and these topics are not taken up until
Chapters 7 through 9.
The last class of problems discussed in this chapter are optimization problems.
In this case, we seek to maximize or minimize a real valued function /(x
1
, .. , xn)
and to find the point (x
1
, . , xn) at which the optimum is attained. Such
53
54 ROOTFINDING FOR NONLINEAR EQUATIONS
y
I
y=a
Figure 2.1 Iterative solution of a- (1/x) = 0.
problems can often be reduced to a system of nonlinear equations, but it is
usually better to develop special methods to carry out the optimization directly.
The area of optimization is well-developed and extensive. We just briefly intro-
duce and survey the subject.
To illustrate the concept of an iterative method for finding a root of (2.0.1), we
begin with an example. Consider solving
1
/(x)=a--=0
X
(2.0.3)
for a given a > 0. This problem has a practical application to computers without
a machine divide operation. This was true of some early computers, and some
modern-day computers also use the algorithm derived below, as part of their
divide operation.
Let x = 1/a be an approximate solution of the equation. At the point
(x
0
, /(x
0
)), draw the tangent line to the graph of y = f(x) (see Figure 2.1). Let
x
1
be the point at which the tangent line intersects the x-axis. It should be an
improved approximation of the root a.
To obtain an equation for x
1
, match the slopes obtained from the tangent line
and the derivative of f(x) at x
0
/'(xo) = /(x
0
) - 0
Xo- xl
Substituting from (2.0.3) and manipulating, we obtain
x
1
= x
0
(2 - ax
0
)
ROOTFINDING FOR NONLINEAR EQUATIONS 55
The general iteration formula is then obtained by repeating the process, with x
1
replacing x
0
, ad infinitum, to get
(2.0.4)
A form more convenient for theoretical purposes is obtained by introducing
the scaled residual
(2.0.5)
Using it,
n ~ (2.0.6)
For the error,
1 'n
e =--x =-
" a n a
(2.0.7)
We will analyze the convergence of this method, its speed, and its dependence
on x
0
First,
(2.0.8)
Inductively,
n ~ (2.0.9)
From (2.0.7), the error en converges to zero as n ~ oo if and only if rn converges
to zero. From (2.0.9), '"converges to zero if and only if lrol < 1, or equivalently,
-1 < 1- ax
0
< 1
2
0 < x
0
<-
a
(2.0.10)
. In order that x" converge to 1/a, it is necessary and sufficient that x
0
be chosen
to satisfy (2.0.10).
To examine the speed of convergence when (2.0.10) is satisfied, we obtain
formulas for the error and relative error. For the speed of convergence when
(2.0.10) is satisfied,
2 2 2
'n+l '" ena
e =--=-=--
n+l a a a
(2.0.11)
n ~ (2.0.12)
56 ROOTFINDING FOR NONLINEAR EQUATIONS
The notation Rel (x") denotes the relative error in x". Based on equation (2.0.11),
we say e" converges to zero tpUJdratically. To illustrate how rapidly the error will
decrease, suppose that Rel(x
0
) = 0.1. Then Rel(x
4
) = 10-
16
Each iteration
doubles the number of significant digits.
This example illustrates the construction of an iterative method for solving an
equation; a complete convergence analysis has been given. This analysis included
a proof of convergence, a determination of the interval of convergence for the
choice of x
0
, and a determination of the speed of convergence. These ideas are
examined in more detail in the following sections using more general approaches
to solving (2.0.1).
Definition A sequence of iterates { xnln 0} is said to converge with order
p 1 to a point a if
Ia - x I _<cia- xniP
n+l
(2.0.13)
for some c > 0. If p = 1, the sequence is said to converge linearly to
a. In that case, we require c < 1; the constant c is called the rate of
linear convergence of x n to a.
Using this definition, the earlier example (2.0.5)-(2.0.6) has order of conver-
gence 2, which is also called quadratic convergence. This definition of order is not
always a convenient one for some linearly convergent iterative methods. Using
induction on (2.0.13) with p = 1, we obtain
n=2:0 (2.0.14)
This shows directly the convergence of x" to a. For some iterative methods we can
show (2.0.14) directly, whereas (2.0.13) may not be true for any c < 1. In such a
case, the method will still be said to converge linearly with a rate of c.
2.1 The Bisection Method
Assume that f(x) is continuous on a given interval [a, b] and that it also satisfies
f(a)f(b) < 0 (2.1.1)
Using the value Theorem 1.1 from Chapter 1, the function f(x)
must have at least one root in [a, b]. Usually [a, b] is chosen to contain only one
root a, but the following algorithm for the bisection method will always converge
to some root a in [a, b], because of (2.1.1).
Algorithm Bisect (J, a, b,root, !)
1. Define c = (a+ b)/2.
l. If b - c !, then accept root = c, and exit.
I
I
J
I
.I
~
II
!1
THE BISECTION METHOD 57
3. If sign(f(b)) sign(f(c)) s; 0, then a:= c; otherwise, b :=c.
4. Return to step 1.
The interval [a, b] is halved in size for every pass through the algorithm.
Because of step 3, [a, b] will always contain a root of f(x). Since a root a is in
[a, b], it must lie within either [a, c] or [c, b]; and consequently
lc- al ,::5; b- c = c- a
This is justification for the test in step 2. On completion of the algorithm, c will
be an approximation to the root with
lc- al ,::5;
Example Find the largest real root a of
f( X) =: x
6
- X - 1 = 0 (2.1.2)
It is .straightforward to show that 1 < a < 2, and we will use this as our initial
interval [a, b ]. The algorithm Bisect was used with t:. = .00005. The results are
shown in Table 2.1. The first two iterates give the initial interval enclosing a, and
the remaining values en, n ;;:: 1, denote the successive midpoints found using
Bisect. The final value c
15
= 1.13474 was accepted as an approximation to a with
Ia- C1sl s; .00004.
The true solution is
a= 1.13472413840152 (2.1.3)
The true error in c
15
is
a - c
15
=. - .000016
It is much smaller than the predicted error bound. It might seem as though we
could have saved some computation by stopping with an earlier iterate. But there
Table 2.1 Example of bisection method
n
en f(en) n en /(en)
2.0 = b 61.0 8 1.13672 .02062
1.0 =a -1.0 9 1.13477 .00043
1 1.5 8.89063 10 1.13379 -.00960
2 1.25 1.56470 11 1.13428 -.00459
3 1.125 -.09771 12 1.13452 -.00208.
4 1.1875 .61665 13 1.13464 -.00083
5 1.15625 .23327 14 1.13470 -.00020
6 1.14063 .06158 15 1.13474 .00016
7 1.13281 -.01958
58 ROOTFINDING FOR NONLINEAR EQUATIONS
is no way to predict the possibly better accuracy in an earlier iterate, and thus
there is no way we can know the iterate is sufficiently accurate. For example, c
9
is
sufficiently accurate, but there was no way of telling that fact during the
computation.
To examine the speed of convergence, let en denote the nth value of c in the
algorithm. Then it is easy to see that
a = limite"
n ..... oo
(2.1.4)
where b - a denotes the length of the original interval input into Bisect. Using
the variant (2.0.14) for defining linear convergence, we say that the bisection
method converges linearly with a rate of t. The actual error may not decrease by
a factor oft at each step, but the average rate of decrease is t, based on (2.1.4).
The preceding example illustrates the result (2.1.4).
There are several deficiencies in the algorithm Bisect. First, it does not take
account of the limits of machine precision, as described in Section 1.2 of Chapter
1. A practical program would take account of the unit round on the machine [see
(1.2.12)], adjusting the given t: if necessary. The second major problem with
Bisect is that it converges very slowly when compared with the methods defined
in the following sections. The major advantages of the bisection method are: (1)
it is guaranteed to converge (provided f is continuous on [a, b] and (2.1.1) is
satisfied), and (2) a reasonable error bound is available. Methods that at every
step giveJupper and lower bounds on the root a are called enclosure methods. In
Section 2.8, we describe an enclosure algorithm that combines the previously
stated advantages of the bisection method with the faster convergence of the
secant method (described in Section 2.3).
2.2 Newton's Method
Assume that an initial estimate x
0
is known for the desired root a of f(x)= 0.
Newton's method will produce a sequence of iterates { x": n 2 1 }, which we hope
will converge to a. Since x
0
is assumed close to a, approximate the graph of
y = f(x) in the vicinity of its root a by constructing its tangent line at
(x
0
, /(x
0
)). Then use the root of this tangent line to approximate a; call this new
approximation x
1
Repeat this process, ad infinitum, to obtain a sequence of
iterates x". As with the example (2.0.3) beginning this chapter, this leads to the
iteration formula
n20 (2.2.1)
The process is illustrated in Figure 2.2, for the iterates x
1
and x
2
NEWTON'S METHOD 59
Figure 2.2 Newton's method.
Newton's method is the best known procedure for finding the roots of an
equation. It has been generalized in many ways for the solution of other, more
difficult nonlinear problems, for example, systems of nonlinear equations and
nonlinear integral and differential equations. It is not always the best method for
a given problem, but its formal simplicity and its great speed often lead it to be
the first method that people use in attempting to solve a nonlinear problem.
As another approach to (2.2.1), we use a Taylor series development. Ex-
panding f(x) about xn,
with ~ between x and xn. Letting x =a and using f(a) = 0, we solve for a to
obtain
f(xJ (a- xJ
2
f ( ~ n )
a=x ----
n f'(xn)
---
2 f'(xn)
with ~ n between xn and a. We can drop the error term (the last term) to obtain a
better approximation to a than xn, and we recognize this approximation as xn+I
from (2.2.1). Then
n ~ O (2.2.2)
60 ROOTFINDING FOR NONLINEAR EQUATIONS
Table 2.2 Example of Newton's method
n Xn f(x") a - xn
x"+J - xn
0 2.0 61.0 -8.653E- 1
1 1.680628273 19.85 - 5.459E- 1 -2.499E- 1
2 1.430738989 6.147 -2.960E- 1 -1.758E- 1
3 1.254970957 1.652 -1.202E- 1 -9.343E- 2
4 1.161538433 2.943E- 1 -2.681E- 2 -2.519E- 2
5 1.136353274 1.683E- 2 -1.629E- 3 -1.623E- 3
6 1.134730528 6.574E- 5 -6.390E- 6 -6.390E- 6
7 1.134724H9 1.015E- 9 -9.870E- 11 -9.870E- 11
This formula will be used to show that Newton's method has a quadratic order of
convergence, p = 2 in (2.0.13).
Example We again solve for the largest root of
f(x) = x
6
- x - 1 = 0
Newton's method (2.2.1) is used, and the results are shown in Table 2.2. The
computations were carried out in approximately 16-digit floating-point arith-
metic, and the table iterates were rounded from these more accurate computa-
tions. The last column, Xn+l - Xn, is an estimate of a- xn; this is discussed
later in the section.
The Newton method converges very rapidly once an iterate is fairly close to
the root. This is illustrated in iterates x
4
, x
5
, x
6
, x
1
The iterates x
0
, x
1
, x
2
, x
3
show the slow initial convergence that is possible with a poor initial guess x
0
If
the initial guess x
0
= 1 had been chosen, then x
4
would have been accurate to
seven significant digits and x
5
~ o fourteen digits. These results should be
compared with those of the bisection method given in Table 2.1. The much
greater speed of Newton's method is apparent immediately.
Convergence analysis A convergence result will be given, showing the speed of
convergence and also an interval from which initial guesses can be chosen.
Theorem 2.1 Assume f(x), f'(x), and f"(x) are continuous for all x in some
neighborhood of a, and assume /(a)= 0, /'(ex)* 0. Then if x
0
is
chosen sufficiently close to a, the iterates xn, n ~ 0, of (2.2.1) will
converge to a. Moreover,
a - x /"(a}
li . n+l
~ ~ (a...:. xJ
2
= - 2/'(cr.}
(2.2.3}
proving that the iterates have an order of convergence p = 2.
NEWTON'S METHOD 61
Proof Pick a sufficiently small interval I= [a - t:, a+ t:] on which f'(x) =t- 0
[this exists by continuity of f'(x}], and then let
Max lf"(x) I
M = .....::;.:XE=..!/'------
2 Min l!'(x) I
xe/
From (2.2.2),
Pick Ia- x
0
1 .::;; t: and Mia- x
0
1 < 1. Then Mia- x
1
1 < 1, and
Mia- x
1
1 .::;; Mia- x
0
1, which says Ia- x
1
1 .::;; t:. We can apply the
same argument to x
1
, x
2
, .. , inductively, showing that Ia- xnl .::;; t:
and Mia - xnl < 1 for all n :2: 1.
To show convergence, use (2.2.2) to give
(2.2.4)
and inductively,
(2.2.5)
Since Mia - x
0
1 < 1, this shows that xn a as n oo.
In formula (2.2.2), the unknown point is between xn and a,
implying a as n oo. Thus
L
. . a - Xn+l L" .
lffilt = -. lffilt-
n-+oo (a -.xn)
2
n-+oo 2f'(xn)
-f"(a)
2/'( a)
The error column in Table 2.2 can be used to illustrate (2.2.3). In particular,
for that example,
f"(a)
-
2
/'{a) = -2.417
a - x
6
---'--2 = -2.41
(a-x
5
)
Let M denote the limit on the right side of (2.2.2). Then if xn is near a, (2.2.2)
implies
62 ROOTFINDING FOR NONLINEAR EQUATIONS
y
X
Figure 2.3 The Newton-Fourier method.
In order to have convergence of xn to o:, this statement says that we should
probably have
1
Jo:- xol <-
M
(2.2.6)
Thus M is a measure of how close x
0
must be chosen to o: to ensure convergence
to o:. Some examples with large values of Mare given in the problems at the end
of the chapter.
Another approach to the error analysis of Newton's method is given by the
following construction and theorem. Assume f(x) is twice continuously differen-
tiable on an interval [a, b] containing o:. Further assume f(a) < 0, f(b) > 0,
and that
f'(x)>O f"(x)>O for a::;; x::;; b (2.2.7)
Then f(x) is strictly increasing on [a, b], and there is a unique root a in [a, b].
Also, f(x) < 0 for a::;; x <a, and f(x) > 0 for a < x ::;; b.
Let x
0
= b and define the Newton iterates xn as in (2.2.1). Next, define a new
sequence of iterates by
n 0 (2.2.8)
with z
0
= a. The resulting iterates are illustrated in Figure 2.3. With the use of
{ zn }, we obtain excellent upper and lower bounds for a. The use of (2.2.8) with
Newton's method is called the Newton-Fourier method.
Theorem 2.2 As previously, assume f(x) is twice continuously differentiable on
[a, b ], f( a) < 0, f( b) > 0, and condition (2.2. 7). Then the iterates
xn are strictly decreasing to o:, and the iterates z n are strictly
increasing to o:. Moreover,
X - Z
L
. n+l n+l
urut
2
n->oo (xn- zn)
f"(o:)
2f'(a)
(2.2.9)
showing that the distance between xn and zn decreases quadrati-
cally with n.
NEWTON'S METHOD 63
Proof We first show that
(2.2.10)
From the definitions (2.2.1) and (2.2.8),
-[(xo)
x l - Xo = - ~ - (-=-x-o-:-)- < 0
-f(zo)
zl - Zo = -f-'(_x_o_) > 0
From the error formula (2.2.2),
Finally,
some Zo < ro < a
= ( _ )[f'(xo) - f'Uo) l
0
a z0 ( ) >
f' Xo
because f'(x) is an increasing function on [a, b ]. Combining these results
proves (2.2.10). This proof can be repeated inductively to prove that
n;;:=:O (2.2.11)
The sequence { x n} is bounded below by a, and thus it has an infimum
.X; similarly, the sequence { zn} has a supremum z:
Limitxn = X;;::: a Limitzn = z :$; a
n-co n-co
Taking limits in (2.2.1) and (2.2.8), we obtain
_ _ f(x)
- - f(Z)
x=x---
f'(x)
z=z---
f'(x)
which leads to f(x) = 0 = f(Z). Since a is the unique root of f(x) in
[a, b ], this proves { x n} and { z n} converge to a.
The proof of (2.2.9) is more complicated, and we refer the reader to
Ostrowski (1973, p. 70). From Theorem 2.1 and formula (2.2.3), the
64 ROOTFINDING FOR NONLINEAR EQUATIONS
sequence { x"} converges to a quadratically. The result (2.2.9) shows that
lo: - Xnl lzn - Xnl
is an error bound that decreases quadratically. II
The hypotheses of Theorem 2.2 can be reduced to /(x) being twice continu-
ously differentiable in a neighborhood of a and
/'(o:)f"(o:) * 0 (2.2.12)
From this, there will be an interval [a, b] about a with f'(x) and f"(x) nonzero
on the interval. Then the rootfinding problem /(x) = 0 will satisfy (2.2.7) or it
can be easily modified to an equivalent problem that will satisfy it. For example,
if /'(a)< 0, f"(o:) > 0, then consider the rootfinding problem g(x) = 0 with
g(x) = f( -x). The root of g will be -a, and the conditions in (2.2.7) will be
satisfied by g( x) on some interval about -a. The numerical illustration of
Theorem 2.2 will be left until the Problems section.
Error estimation The preceding procedure gives upper and lower bounds for
the root, with the distance x"- z" decreasing quadratically. However, in most
applications, Newton's method is used alone, without (2.2.8). In that case, we use
the following.
Using the mean value theorem,
f(xJ = f(x")- /(a)= a)
-j(xn)
with between xn and a. If f'(x) is not changing rapidly between x" and o:,
then we f'(xn), and
with the last equality following from the definition of Newton's method. For
Newton's method, the standard error estimate is
(2.2.13)
and this is illustrated in Table 2.2. For relative error, use
a - Xn ,;, Xn+l - Xn
a xn+l
The Newton algorithm Using the Newton formula (2.2.1) and the error estimate
(2.2.13), we give the following algorithm.
THE SECANT METHOD 65
Algorithm Newton (f, df, x
0
, t:, root, itmax, ier)
1. Remark: df is the derivative function f'(x), itmax is the
maximum number of iterates to be computed, and ier is an error
flag to the user.
2. itnum := 1
3. denom := df(x
0
).
4. If denom = 0, then ier := 2 and exit.
5. x
1
:= x
0
- /(x
0
)/denom
6. If lx
1
- x
0
1 t:, then set ier := 0, root := x
1
, and exit.
7. If itnum = itmax, set ier := 1 and exit.
8. Otherwise, itnum := itnum + 1, x
0
:= x
1
, and go to step 3.
As with the earlier algorithm Bisect, no account is taken of the limits of the
computer arithmetic, although a practical program would need to do such. Also,
Newton's method is not guaranteed to converge, and thus a test on the number of
iterates (step 7) is necessary.
When Newton's method converges, it generally does so quite rapidly, an
advantage over the bisection method. But again, it need not converge. Another
source of difficulty in some cases is the necessity of knowing f'(x) explicitly.
With some rootfinding problems, this is not possible. The method of the next
section remedies this situation, at the cost of a somewhat slower speed of
convergence.
2.3 The Secant Method
As with Newton's method, the graph of y = f(x) is approximated by a straight
line in the vicinity of the root a. In this case, assume that x
0
and x
1
are two
initial estimates of the root a. Approximate the graph of y = f(x) by the secant
line determined by (x
0
, /(x
0
)) and (x
1
, f(x
1
)). Let its root be denoted by x
2
; we
hope it will be an improved approximation of a. This is illustrated in Figure 2.4.
Using the slope formula with the secant line, we have
Solving for x
2
,
66 ROOTFINDING FOR NONLINEAR EQUATIONS
Figure 2.4 The secant method.
Using x
1
and x
2
, repeat this process to obtain x
3
, etc. The general formula based
on this is
n ~ 1 (2.3.1)
This is the secant method. As with Newton's method, it is not guaranteed to
converge, but when it does converge, the speed is usually greater than that of the
bisection method.
Example Consider again finding the largest root of
f(x) = x
6
- x - 1 = 0
The secant method (2.3.1) was used, and the iteration continued until the
successive differences xn- xn_
1
were considered sufficiently small. The numeri-
cal results are given in Table 2.3. The calculations were done on a binary machine
with approximately 16 decimal digits of accuracy, and the table results are
rounded from the computer results.
The convergence is increasingly rapid as n increases. One way of measuring
this is to calculate the ratios
n ~
For a linear method these are generally constant as xn converges to a. But in this
example, these ratios become smaller with increasing n. One intuitive explanation
is that the straight line connecting (xn_
1
, f(xn_
1
)) and (xn f(xn)) becomes an
THE SECANT METHOD 67
Table 2.3 Example of secant method
n
X"
f(x") a - x"
xn+l - xn
0 2.0 61.0 8.65E- 1
1 1.0 -1.0 1.35E- 1 1.61E- 2
2 1.016129032 -9.154E- 1 1.19E- 1 1.74E- 1
3 1.190577769 6.575E- 1 -5.59E- 2 -7.29E- 2
4 1.117655831 -1.685E- 1 -1.71E- 2 1.49E- 2
5 1.132531550 -2.244E- 2 2.19E- 3 2.29E- 3
6 1.134816808 9.536E- 4 -9.27E-5 -9.32E- 5
7 1.134723646 -5.066E- 6 4.92E- 7 4.92E- 7
9 1.134724138 -1.135E- 9 l.lOE- 10 l.lOE- 10
increasingly accurate approximation to the graph of y = f(x) in the vicinity of
x = a, and consequently the root xn+l the straight line is an increasingly
improved estimate of a. Also note that the iterates x" move above and below the
root a in an apparently random fashion as n increases. An explanation of this
will come from the error formula (2.3.3) given below.
Error analysis Multiply both sides of by -1 and then add a to both
sides, obtaining
The right-hand side can be algebraically to obtain the formula
(2.3.2)
The quantities f[xn-l xn] and j[xn_
1
, xn, a] are-first- and second-order Newton
divided differences, defined in (1.1.13) of Chapter 1. The reader should check
(2.3.2) by substituting from (1.1.13) and then simplifying. Using (1.1.14), formula
(2.3.2) becomes
(2.3.3)
between x"_
1
and x", and f" between xn_
1
, x", and a. Using this error
formula, we can examine the convergence of the secant method.
Theorem 2.3 Assume f(x), f'(x), and f"(x) are continuous for all values of x
in some interval containing a, and assume f'( a) -:# 0. Then if the
initial guesses x
0
and x
1
are chosen sufficiently close to a, the
iterates x" o((2.3.1) will converge to a. The order of convergence
will be p = (1 + /5)/2 = 1.62.
I
J
68 ROOTFINDING FOR NONLINEAR EQUATIONS
Proof For the neighborhood I= [a - t:, a + t:] with some t: > 0, f'(x) =I= 0
everywhere on I. Then define
Max if"(x) I
M = _.:;,:_XE=/:...._ __ _
2Minlf'(x)l
xel
Then for all x
0
, x
1
E [a - t:, a + t:], using (2.3.3),
le2l .S: led leolM,
Mle2l .S: Mle1l Mleol
Further assume that x
1
and x
0
are so chosen that
Then Mje
2
1 < 1 since
(2.3.4)
and thus x
2
E [a - t:, a + t:}. We apply this argument inductively to
show that xn E [a - t:, a + t:} and Mien I .S: 8 for n ;e: 2.
To prove convergence and obtain the order of convergence, continue
applying (2.3.3) to get
For
(2.3.5)
Thus
n;e:l (2.3.6)
with q
0
= q
1
= 1. This is a Fibonacci sequence of numbers, and an
THE SECANT METHOD 69
explicit formula can be given:
1
qn = ..j5 [ron+l- rt+l] n ~
Thus
1 + ..j5
'o = --- = 1.618
2
1- ..j5
't = --- = -.618
2
for large n
(2.3.7)
(2.3 .8)
For example q
5
= 8, and formula (2.3.8) gives 8.025. Returning to
(2.3.5), we obtain the error bound
1
l
e I < -8q
n- M
n ~ (2.3.9)
with qn given by {2.3.7). Since qn--+ oo as n --+ oo, we have xn --+ a.
By doing a more careful derivation, we can actually show that the
order of convergence is p = (1 + ..f5 )/2. To simplify the presentation,
we instead show that this is the rate at which the bound in (2.3.9)
decreases. Let Bn denote the upper bound in (2.3.9). Then
1
-/)qn+l
Bn+l M
-- = r-:..;...,--- = M'o-ll)qn+l-roqn
B;a [ ]'". l)'oq.
which implies an order of convergence p = r
0
= (1 + ..f5 )/2. A similar
result holds for the actual errors en; moreover,
L
.. len+tl I /"(a) l(.,f5-l)/
2
liDlt --= ---
n->oo len!'" 2/'(a)
(2.3.10)
The error formula (2.3.3) can be used to explain the oscillating behavior of the
iterates xn about the root a in the last example. For xn and xn-l near to a,
(2.3.3) implies
f"(a)
a- xn+l = -(a- xJ(a- Xn-1)
2
/'(a)
(2.3.11)
70 ROOTFINDING FOR NONLINEAR EQUATIONS
The sign of a - xn+l is determined from that of the previous two errors, together
with the sign of/"( a)//'( a).
The condition (2.3.4) gives some information on how close the initial values x
0
and x
1
should be to a in order to have convergence. If the quantity M is large, or
more specifically, if
I
!J!ll
2/'(a)
is very large, then a - x
0
and a - x
1
must be correspondingly smaller. Conver-
gence can occur without (2.3.4), but it is likely to initially be quite haphazard in
such a case.
For an error test, use the same error estimate (2.2.13) that was used with
Newton's method, namely
Its use is illustrated in Table 2.3 in the last example. Because the secant method
may not converge, programs implementing it should have an upper limit on the
number of iterates, as with the algorithm Newton in the last section.
A possible problem with the secant method is the calculation of the approxi-
mate derivative
a =
n
where the secant method (2.3.1) is then written
(2.3.12)
(2.3.13)
The calculation of an involves loss of significance errors, in both the numerator
and denominator. Thus it is a less accurate approximation of the derivative of f
as xn a. Nonetheless, we continue to obtain improvements in the accuracy of
xn, until we approach the noise level of f(x) for x near a. At that point, an may
become very different from f'(a), and xn+l can jump rapidly away from the
root. For this reason, Dennis and Schnabel (1983, pp. 31-32) recommend the use
of (2.3.12) until xn -xn-I becomes sufficiently small. They then-recommend
another approximation of f'(x):
with a fixed h. For h, they recommend
where Ta is a reasonable nonzero approximation to a, say xn, and 8 is the
THE SECANT METHOD 71
computer's unit round [see (1.2.12)]. They recommend the use of h .when
lx,- x,_d is smaller than h. The cost of the secant method in function
evaluations will rise slightly, but probably by not more than one or two.
The secant method is well recommended as an efficient and easy-to-use
rootfinding procedure for a wide variety of problems. It also has the advantage of
not requiring a knowledge of f'(x), unlike Newton's method. In Section 2.8, the
secant method will form an important part of another rootfind.ing algorithm that
is guaranteed to converge.
Comparison of Newton's rnetbod and the secant method Newton's method and
the secant method are closely related. If the approximation
is used in the Newton formula (2.2.1), we obtain the secant formula (2.3.1). The
conditions for convergence are almost the same [for example, see (2.2.6) and
(2.3.4) for conditions on the initial error], and the error formulas are similar [see
"(2.2.2) and (2.3.3)]. Nonetheless, there are two major differences. Newton's
method-requires-two function evaluations per iterate, that of f(x,) and f'(x,),
whereas the secant method requires only one function evaluation per iterate, that
of f(x,.) [provided the needed function value f(x,_
1
) is retained from the last
iteration]. Newton's method is generally more expensive per iteration. On the
other hand, Newton's method converges more rapidly [order p = 2 vs. the secant
method's p = 1.62], and consequently it will require fewer iterations to attain a
given desired accuracy. An analysis of the effect of these two differences in the
secant and Newton methods is given below.
We now consider the expenditure of time necessary to reach a desired root a
within a desired tolerance of t:. To simplify the analysis, we assume that the
initial guesses are quite close to the desired root. Define
xn+l
n;:::O
and let x
0
= x
0
We define .X
1
based on the following convergence formula. From
(2.2.3) and (2.3.10), respectively,
l
f"(a) I
n 0, c =
2
/'( a)
r=
1 + v'5
2
72 ROOTFINDING FOR NONLINEAR EQUATIONS
Inductively for the error in the Newton iterates,
1 2"
Ia- xnl = -(cia- Xol)
c
Similarly for the secant method iterates,
d1+r+ ... +r"-11 - I'"
- a- x
0
Using the formula (1.1.9) for a finite geometric series, we obtain
d1+r+ ... +r-l = d(r"-1)/(r-1) = c'"-1
and thus
To satisfy Ia - xnl the Newton iterates, we must have
K
n>--
- log2
2"
(cia- x
0
1) Cf.
K= log
[
log EC ]
log cia- x
0
1
Let m be the time to evaluate f(x), and let s m be the time to evaluate f'(x).
Then the minimum time to obtain the desired accuracy with Newton's method is
(1 + s)mK
TN= (m + ms)n = log2
(2.3.14)
For the secant method, a similar calculation shows that Ia - .Xnl if
K
n>--
- logr
Thus the minimum time necessary to obtain the desired accuracy is
mK
T. = mn = --
s logr
(2.3.15)
MULLER'S METHOD 73
To compare the times for the secant method and Newton's method, we have
The secant method is faster than the Newton method if the ratio is less than one,
log 2
s > ---1 = .44
log r
(2.3.16)
If the time to evaluate f'(x) is more than 44 percent of that necessary to evaluate
f(x), then the secant method is more efficient. In practice, many other factors
will affect the relative costs of the two methods, so that the .44 factor should be
used with caution.
The preceding argument is useful in illustrating that the mathematical speed of
convergence complete picture. Total computing time, ease of use-of an
algorithm, stability, and other factors also have a bearing on the relative
desirability of one algorithm over another one.
2.4 Muller's Method
Muller's method is useful for obtaining both real and complex roots of a
function, and it is reasonably straightforward to implement as a computer
program. We derive it, discuss its convergence, and give some numerical exam-
ples.
Muller's method is a generalization of the approach that led to the secant
method. Given three points Xo, xl, Xz, a quadratic polynomial is constructed that
passes through the three points (x;. f(x;)), i = 0, 1, 2; one of the roots of this
polynomial is used as an improved estimate for a root a of f(x).
The quadratic polynomial is given by
p(x) = f(x
2
) + (x- x
2
)/[x
2
, x
1
] + (x- x
2
)(x- x
1
)/[x
2
, x
1
, x
0
].
(2.4.1)
The divided differences f[x
2
, xd and f[x
2
, x
1
, x
0
] were defined in (1.1.13) of
Chapter 1. To check that
p(x;) = f(x;) i = 0,1,2
just substitute X; into (2.4.1) and then reduce the resulting expression using
(1.1.13). There are other formulas for p(x) given in Chapter 3, but the form
shown in (2.4.1) is the most convenient for defining Muller's method. The
74 ROOTFINDING FOR NONLINEAR EQUATIONS
formula (2.4.1) is called Newton's divided difference form of the interpolating
polynomial, and it is developed in general in Section 3.2 of Chapter 3.
To find the zeros oi (2.4.1) we first rewrite it in the more convenient form
y = /(x
2
) + w(x- x
2
) + f[x
2
, x
1
, x
0
]](x- x
2
)
2
w = /[x
2
, x
1
] + (x
2
- x
1
)/[x
2
, x
1
, x
0
]
= /[x
2
, x
1
] + /[x
2
, x
0
] - J[x
0
, x
1
]
We want to find the smallest value of x - x
2
that satisfies the equation y = 0,
thus finding the root of (2.4.1) that is closest to x
2
The solution is
-w /w
2
- 4f(x
2
)/[x
2
, x
1
, x
0
]
2f[x
2
, x
1
, x
0
]
with the sign chosen to make the numerator as small as possible. Because of the
loss-of-significance errors implicit in this formula, we rationalize the numerator
to obtain the new iteration formula
(2.4.2)
J
with the sign chosen to maximize the magnitude of the denominator.
Repeat (2.4.2) recursively to define a sequence of iterates { xn: n :2::. 0}. If they
converge to a point a, and if ['(a) :1= 0, then a is a root of f(x). To see this, use
(1.1.14) of Chapter 1 and (2.4.2) to give
w f'(a) as n co
2/(a)
a = a - ----;::=========-
['(a:) V[f'(a)]
2
- 2/(a)f"(a:)
showing that the right-hand fraction must be zero. Since /'(a) :1= 0 by assump-
tion, the method of choosing the sign in the denominator implies that the
denominator is nonzero. Then the numerator must be zero, showing /(a)= 0.
The assumption /'(a) :1= 0 will say that a is a simple root. (See Section 2.7 for a
discussion of simple and multiple roots.)
By an argument similar to that used for the secant method, it can be shown
that
.. Ia- xn+ti lt(3>(a) l<p-1)/2
Ltmtt =
n-+oo Ia- xnlp 6f'(a)
p 1.84 (2.4.3)
provided f(x) is three times continuously differentiable in a neighborhood of a
MULLER'S METHOD 75
and f'( a) 0. The order p is the positive root of
With the secant method, real choices of x
0
and x
1
lead to a real value of x
2
But with Muller's method, real choices of x
0
, x
1
, x
2
can and do lead to complex
roots of f(x). This is an important aspect of Muller's method, being one reason it
is used.
The following examples were computed using a commercial program that gives
an automatic implementation of Muller's method. With no initial guesses given,
it found the roots of f(x) in roughly increasing order. After approximations
z
1
, . , z, had been found as roots, the function
(2.4.4)
was used in finding the remaining roots of f(x). [For a discussion of the errors in
this use of g(x), see Peters and Wilkinson (1971)]. In order that an approximate
root z be acceptable to the program, it had to satisfy one of the following two
conditions (specified by the user):
1. lf(z)l .::;; 10-
10
2. z has eight significant digits of accuracy.
In Tables 2.4 and 2.5, the roots are given in the order in which they were found.
The column IT gives the number of iterates that were calculated for each root.
The examples are all for f(x) a polynomial, but the program was designed for
general functions f(x), with x allowed to be complex.
Table 2.4 Muller's method, example 2
IT Root /(root)
9 1.1572211736 - 1 5.96- 8
10 6.1175748452 - 1 + 9.01 - 20i -2.9SE- 7 +9.06- lli
14 2.S337513377EO - 5.05- 17i 2.55- 5 -4.78- 8i
13 4.59922763940 - 5.95- lSi 7.13- 5 +9.37- 6i
s 1.51261026980 + 2.9SE- 16i 3.34- 6 -2.35- 7i
19 1.30060549931 + 9.04- lSi 2.32- 1 +4.15- 7i
16 9.62131684250 - 4.97- 17i -3.66-2 - 5.3SE- 7i
14 1.71168551S7El - 8.4SE- 17i -1.6SE + 0 +2.40- 5i
13 2.21510903791 + 9.35E- lSi 8.61 - 1 + 2.60E - 5i
7 6.84452545310 - 3.43- 2Si -4.49E- 3 -1.22- lSi
4 2.84S7967251E1 + 5.77E- 25i -6.34 + 1 -2.96E- lli
4 3.70991210441 + 2.SOE- 24i 2.12 3 +7.72- 9i
76 ROOTFINDING FOR NONLINEAR EQUATIONS
Table 2.5
IT
41
17
31
10
6
3
Muller's method, example 3
Root
2.9987526 -6.98E- 4i
2.9997591 -2.68E- 4i
3.0003095 -3.17E- 4i
3.0003046 + 3.14E - 4i
5.97E- 15 + 3.000000000 i
5.97E- 15 -3.000000000i
/(root)
- 3.33E - 11 + 6. 70E - 11i
5.68E - 14 - 6.48E - 14i
- 3.41E - 13 + 3.22E - 14i
3.98E - 13 - 3.83E - 14i
4.38E- 11 -1.19E- 11i
4.38E - 11 + 1.19E - 11i
Example 1. f(x) = x
20
- 1. All 20 roots were found with an accuracy of 10
or more significant digits. In all cases, the approximate root z satisfied 1/(z)l <
10-
10
, generally much less. The number of iterates ranged from 1 to 18, with an
average of 8.5.
2. /(x) = Laguerre polynomial of degree 12. The real parts of the roots as
shown are correct, rounded to the number of places shown, but the imaginary
parts should all be zero. The numerical results are given in Table 2.4. Note that
f(x) is quite large for many of the approximate roots.
3.
f(x) = x
6
- 12x
5
+ 63x
4
- 216x
3
+ 567x
2
- 972x + 729
= (x
2
+ 9)(x - 3)
4
The numerical results are given in Table 2.5. Note the inaccuracy in the first four
roots, which is inherent .due to the noise in f(x) associated with a = 3 being a
repeated root. See Section 2.7 for a complete discussion of the problems in
calculating repeated roots.
The last two examples demonstrate why two error tests are necessary, and they
indicate why the routine requests a maximum on the number of iterations to be
allowed per root. The form (2.4.2) of Muller's method is due to Traub (1964,
pp. 210-213). For a computational discussion, see Whitley (1968).
2.5 A General Theory for One-Point Iteration Methods
We now consider solving an equation x = g(x) for a root a by the iteration
n;:;:;O (2.5.1)
with x
0
an initial guess to a. The Newton method fits in this pattern with
f(x)
g(x) = x- f'(x)
(2.5.2)
GENERAL THEORY FOR ONE-POINT ITERATION METHODS 77
Table 2.6 Iteration examples
for x
1
- 3 = 0
Case (i) Case (ii)
n Xn Xn
0 2.0 2.0
1 3.0 1.5
2 9.0 2.0
3 87.0 1.5
Case (iii)
Xn
2.0
1.75
1.732143
1.732051
Each solution of x = g( x) is called a fixed point of g. Although we are usually
interested in solving an equation f(x) = 0, there are many ways this can be
reformulated as a fixed-point problem. At this point, we just illustrate this
reformulation process with some examples.
Example Consider solving x
2
- a = 0 for a > 0.
(i) x = x
2
+ x- a, or more generally, x = x + c(x
2
- a) for some c =I= 0
a
(ii) X= -
X
(iii) X = ~ (X + ~
(2.5.3)
We give a numerical example with a= 3, x
0
= 2, and a= f3 = 1.732051. With
x
0
= 2, the numerical results for (2.5.1) in these cases are given in Table 2.6.
It is natural to ask what makes the various iterative schemes behave in the way
they do in this example. We will develop a general theory to explain this behaVior
and to aid in analyzing new iterative methods.
Lemma 2.4 Let g(x) be continuous on the interval a ~ x ~ b, and assume that
a ~ g(x) ~ b for every a ~ x ~ b . (We say g sends [a, b] into
[a, b], and denote it by g([a, b]) c [a, b].) Then x = g(x) has at
least one solution in [a, b ].
Proof Consider the continuous function g(x)- x. At x =a, it is positive, and
at x = b it is negative. Thus by the intermediate value theorem, it must
have a root in the interval [a, b]. In Figure 2.5, the roots are the
intersection points of y = x andy= g(x).
Lemma 2.5 Let g(x) be continuous on [a, b], and assume g([a, b]) c [a, b].
Furthermore, assume there is a constant 0 <X < 1, with
lg(x)- g(y )I ~ Xlx- Yl for all x, y E [a, b] (2.5.4)
78 ROOTFINDING FOR NONLINEAR EQUATIONS
Then x = g(x) has a unique solution a in [a, b]. Also, the iterates
Xn = g(xn_
1
) n ~ 1
will converge to a for any choice of x
0
in [a, b], and
"A"
Ia- xnl ~
1
_A lx1- Xol
Proof Suppose x- g(x) has two solutions a and {3 in [a, b]. Then
Ia- .BI =lg(a)- g(,B)I ~ "-Ia- /31
(1 - "-)Ia- .BI ~ o
(2.5.5)
Since 0 < A. < 1, this implies a = ,8. Also we know by the earlier lemma
that there is at least one root a in [a, b ].
To examine the convergence of the iterates xn, first note that they all
remain in [a, b ]. To see this, note that the result
XnE[a,b] implies Xn+l = g(xJ E (a, b)
can be used with mathematical induction to prove xn E [a, b] for all n.
For the convergence,
and by induction,
As n --. oo, A n ~ 0; thus, Xn ~ a
y
b-
I
I
I
I
I
I
I
n ~ O
-----------t----
1
I
b
Figure 2.5 Example of Lemma 2.4.
(2.5.6)
(2.5.7)
GENERAL THEORY FOR ONE-POINT ITERATION METHODS 79
To prove the bound (2.5.5), begin with
Ia- Xol Ia- X1l + lx1- Xol .:$Ala- Xol + lx1- Xol
where the last step used (2.5.6). Then solving for Ia- x
0
1, we have
1
Ia - xol
1
_ A lx1 - Xol
(2.5.8)
Combining this with (2.5.7) will complete the proof.
The bound (2.5.6) shows that the sequence { xn} is linearly convergent, with
the rate of convergence bounded by A, based on the definition (2.0.13). Also from
the proof, we can devise a possibly more accurate error bound than (2.5.5).
Repeating the argument that led to (2.5.8), we obtain
1
Ia- xnl
1
_ Alxn+l- xnl
Further, applying (2.5.6) yields the bound
A
Ia.- xn+ll 1 -A lxn+l - xnl
(2.5.9)
When A is computable, this furnishes a practical bound in most situations. Other
error bounds and estimates are discussed in the following section.
If g(x) is differentiable on [a, b], then
g(x)- g(y) = g'(g)(x- y) g between x and y
for all x, y E [a, b ]. Define
.\ = Max lg'(x )I
a:s;x:s;b
Then
lg(x) - g(y )I .:$ .\lx- Yl aU x,yE[a,b]
Theorem 2.6 Assume that g( x) is continuously differentiable on [a, b ], that
g([a, b]) c [a, b], and that
A= Max lg'(x)l < 1
a:>;x:s;b
(2.5.10)
Then
(i) x = g(x) has a unique solution a in [a, b]
(ii) For any choice of x
0
in [a, b], with xn+l = g(xn), n;;::: 0,
Limitxn =a
n-+ co
i
- I
I
-I
80 ROOTFINDING FOR NONLINEAR EQUATIONS
(iii)
An
I
a- X I 'Anla- x I < --jx -X I
n o-1-'Al 0
a - xn+l
Limit = g'( a)
n->oo a- xn
(2.5.11)
Proof Every result comes from the preceding lemmas, except for the rate of
convergence (2.5.11). For it, use
y
n 0 (2.5.12)
with an unknown point between a and xn- Since xn __,.a, we must
have a, and thus
a-x
Limit n+l = = g'(a)
n-tooo a- xn n-+oo
If g'( a) '#- 0, then the sequence { xn} converges to a with order exactly
p = 1, linear convergence. IIIII
0 < g"(OI) < 1 -1 <g'(OI) < 0
g'(OI) < -1
Figure 2.6 Examples of convergent and nonconvergent sequences
Xn+l = g(xn)
GENERAL THEORY FOR ONE-POINT ITERATION METHODS 81
This theorem generalizes to systems of m nonlinear equations in m unknowns.
Just regard x as an element of Rm, g( x) as a function from Rm to Rm, replace the
absolute values by vector and matrix norms, and replace g'(x) by the Jacobian
matrix for g(x). The assumption g([a, b)) c [a, b] must be replaced with a
stronger assumption, and care must be exercised in the choice of a region
generalizing [a, b]. The lemmas generalize, but they are nontrivial to prove. This
is discussed further in Section 2.10.
To see the importance of the assumption (2.5.10) on the size of g'(x), suppose
lg'(a)l > 1. Then if we had a sequence of iterates xn+l = g(xn) and a root
a= g(a), we have (2.5.12). If xn becomes sufficiently close to a, then lg'(gn)l > 1
and the error Ia- xn+d will be greater than lo:- xnl Thus convergence is not
possible if lg'(a)l > 1. We graphically portray the computation of the iterates in
four cases (see Figure 2.6).
To simplify the application of the previous theorem, we give the following
result.
Theorem 2.7 Assume a is a solution of x = g(x), and suppose that g(x) is
continuously differentiable in some neighboring interval about a
with lg'(a)l < 1. Then the results of Theorem 2.6 are still true,
provided x
0
is chosen:sufficiently close to a.
Proof Pick a number A satisfying lg'(a)l < ) < 1. Then pick an interval
I= [a- E:, a+ t:] with
Maxjg'(x)l .s;; A< 1
xe/
We have g(J) c I, since Ia- xi :$ E: implies
Ia- g(x)l = lg{a)- g(x)i = ig'(OIia- xi:$ Ala- xi:$ E:
Now apply the preceding theorem using [a, b] =[a-t:, a+ t:).
Example Referring back to the earlier example in this section, calculate g'( a).
(i)
(ii)
(iii)
g(x) = x
2
+ x- 3 g'(a) = g'({f) = 2{3 + 1 > 1
3 -3
g(x) =- g'(ff) = --= -1
X ({3)2
1 3 1 3
g(x) = 2(x + ~ g'(x) = 2(1 - x
2
) g'(ff) = 0
Example For x = x + c(x
2
- 3), pick c to ensure convergence. Since the
solution is a= 13, and since g'(x) = 1 + 2cx, pick c so that
-1 < 1 + 2cff < 1
For a good rate of convergence, pick c so that
1 + 2cf3 = 0
82 ROOTFINDING FOR NONLINEAR EQUATIONS
Table 2.7
n
0
1
2
3
4
5
6
7
This gives
Numerical example of iteration (2.5.13)
Xn
2.0
1.75
1.7343750
1.7323608
1.7320923
1.7320564
1.7320516
1.7320509
-1
2/3
a - xn
-2.68- 1
-1.79- 2
-2.32E- 3
-3.10- 4
-4.15E- 5
-5.56E- 6
-7.45- 7
-l.OOE- 7
Ratio
.0668
.130
.134
.134
.134
.134
.134
Use c = - -Then g'(/3) = 1 - (/3 /2),;, .134. This gives the iteration scheme
n ~ 0 (2.5.13)
The numerical results are given in Table 2.7. The ratio column gives the values of
n ~ 1
The results agree closely with the theoretical value of g'(/3).
Higher order one-point methods We complete the development of the theory
for one-point iteration methods by considering methods with an order of conver-
gence greater than one, for example, Newtons' method.
Theorem 2.8 Assume a is a root of x = g(x), and that g(x) is p times
continuously differentiable for all x near to a, for some p ~ 2.
Furthermore, .assume
g'(a) = = g<p-l>(a) = 0 (2.5.14)
Then if the initial guess x
0
is chosen sufficiently close to a, the
iteration
xn+l = g(x,)
n ~
will have order of convergence p, and
L
.. a- xn+l ( 1)p-l g<P>(a)
lffilt = - .
n-oo (a- xJP p!
AITKEN EXTRAPOLATION FOR LINEARLY CONVERGENT SEQUENCES 83
Proof Expand g(xn) about a:
( )
p-1
x -a
xn+l = g(x") = g(a) + (x"- a)g'(a) + :+- (np _
1
)! g<p-ll(a)
(x a)P
+ " (p)(/" )
p! g 'On
for some ~ between x" and a. Using (2.5.14) and a= g(a),
(x"- a)P
a - X = - g<Pl( I" )
n+l 1 'On
p.
Use Theorem 2.7 and x" __..a to complete the proof.
The Newton method can be analyzed by this result
/(x)
g(x) = x- f'(x)
g'(a) = 0
f(x)f"(x)
g'(x) = [/'(x)]2
"( ) - /"(a)
g a - /'(a)
This and (2.5.14) give the previously obtained convergence result (2.2.3) for
Newton's method. For other examples of the application of Theorem 2.8, see the
problems at the end of the chapter.
The theory of this section is only for one-point iteration methods, thus
eliminating the secant method and Muller's method from consideration. There is
a corresponding fixed-point theory for multistep fixed-point methods, which can
be found in Traub (1964). We omit it here, principally because only the one-point
fixed-point iteration theory will be needed in later chapters.
2.6 Aitken Extrapolation for
Linearly Convergent Sequences
From (2.5.11) of Theorem 2.6,
a-x
Limit n+l = g'(a)
x-oo a- xn
for a convergent iteration
n ~ O
(2.6.1)
In this section, we concern ourselves only with the case of linear convergence.
84 ROOTFINDING FOR NONLINEAR EQUATIONS
Thus we will assume
o < 1 g'( a )I < 1 (2.6.2)
We examine estimating the error in the iterates and give a way to accelerate the
convergence of { xn }.
We begin by considering the ratios
"A =
n
Claim:
Limit An= g'(a)
n-> oo
To see this, write
Using (2.5.12),
(a- Xn-1)- Xn-1)
An= (a- -(a- Xn-1)
1- g'(a)
1/[g'{a)] - 1 = g'(a)
(2.6.3)
(2.6.4)
The quantity "An is computable, and when it converges empirically to a value "A,
we assume "A= g'(a).
We use "An= g'(a) to estimate the error in the iterates xn. Assume
Then
1
= xn) + (xn_
1
- xn)
n
(2.6.5)
This is Aitken's error formula for xn, and it is increasingly accurate as {"An}
converges to g'( a).
AITKEN EXTRAPOLATION FOR LINEARLY CONVERGENT SEQUENCES 85
Table 2.8 Iteration (2.6.6)
Estimate
n xn Xn- Xn-I An a - Xn (2.6.5)
0 6.0000000 1.55E- 2
1 6.0005845 5.845E- 4 1.49E- 2
2 6.0011458 5.613E- 4 .9603 1.44E- 2 1.36E- 2
3 6.0016848 5.390E- 4 .9604 1.38E- 2 1.31E- 2
4 6.0022026 5.178E- 4 .9606 1.33E- 2 L26E- 2
5 6.0027001 4.974E- 4 .9607 1.28E- 2 1.22E- 2
6 6.0031780 4.780E- 4 .9609 1.23E- 2 1.17E- 2
7 6.0036374 4.593E- 4 .9610 USE- 2 1.13E- 2
Example Consider the iteration
Xn+l = 6.28 + sin (xn) n ~ O (2.6.6)"
The true root is a ~ 6.01550307297. The results of the iteration are given in
Table 2.8, along-with the-values of An, a- xn, xn - xn-l and the error estimate
(2.6.5). The values of An are converging to
g'( a) =cos (a) ~ .9644
and the estimate (2.6.5) is an accurate indicator of the true error. The size of
g'( a) also shows that the iterates will converge very slowly, and in this case,
xn+l - xn is not an accurate indicator of a- xn.
Aitken's extrapolation formula is simply (2.6.5), rewritten as an estimate of a:
(2.6.7)
We denote this right side by xn' for n ~ 2. By substituting (2.6.3) into (2.6.7), the
formula for x can be rewritten as
n ~ (2.6.8)
which is the formula given in many texts.
Example Use the results in Table 2.8 for iteration (2.6.6). With n = 7, using
either (2.6.7) or (2.6.8),
x
7
= 6.0149518 a- x
7
= 5.51E- 4
Thus the extrapolate x
7
is significantly more accurate than x
7
86 ROOTFINDING FOR NONLINEAR EQUATIONS
We now combine linear iteration and Aitken extrapolation in a simpleminded
algorithm.
Algorithm Aitken (g, x
0
, ,root)
l. Remark: It is assurried that lg'(a:)l < 1 and that ordinary linear
iteration using x
0
will convergeto a:.
2.
3.
5. Set Xo := x2 and go to step 2.
This algorithm will usually converge, provided the assumptions of step 1 are
satisfied.
Example To illustrate algorithm Aitken, we repeat the previous example (2.6.6).
The numerical results are given in Table 2.9. The values x
3
, x
6
, and x
9
are the
Aitken extrapolates defined by (2.6.7). The values of An are given for only the
cases n = 2, 5, and 8, since only then do the errors a - xn, a - xn_
1
, and
a- xn-l decrease linearly, as is needed for n ~ g'(a).
Extrapolation is often used with slowly convergent linear iteration methods for
solving large systems of simultaneous linear equations. The actual methods used
are different from that previously described, but they also are based on the
general idea of finding the qualitative behavior of the error, as in (2.6.1), and of
then using that to produce an improved estimate of the answer. This idea is also
pursued in developing numerical methods for integration, solving differential
equations, and other mathematical problems.
Table 2.9 Algorithm Aitken applied to (2.6.6)
n xn An a - xn
0 6.000000000000 1.55E- 2
1 6.000584501801 1.49E- 2
2 6.001145770761 .96025 1.44E- 2
3 6.014705147543 7.98E- 4
4 6.014733648720 7.69E- 4
5 6.014761128955 .96418 7.42E- 4
6 6.015500802060 2.27E- 6
7 6.015500882935 2.19E- 6
8 6.015500960931 .96439 2.11E- 6
9 6.015503072947 2.05E- 11
THE NUMERICAL EVALUATION OF MULTIPLE ROOTS 87
2. 7 The Numerical Evaluation of Multiple Roots
We say that the function f(x) has a root a of multiplicity p > 1 if
f(x) = (x- aYh(x) (2.7.1)
with h(a) * 0 and h(x) continuous at x =a. We restrict p to be a positive
integer, although some of the following is equally valid for nonintegral values. If
h(x) is sufficiently differentiable at x = a, then (2.7.1) is equivalent to
/{a) =/'{a)= =/(p-l)(a) = 0 (2.7.2}
When finding a root of any function on a computer, there is always an interval
of uncertainty about the root, and this is made worse when the root is multiple.
To see this more clearly, consider evaluating the two functions /
1
(x) = x
2
- 3
and f
2
(x) = 9 + x
2
(x
2
- 6). Then a = f3 has multiplicity one as a root of /
1
and multiplicity two as a root of /
2
Using four-digit decimal arithmetic,
/
1
(x) < 0 for x 1.731, /
1
(1.732) = 0, and /
1
(x) > 0 for x > 1.733. But /
2
(x)
= 0 for 1.726 x 1.738, thus limiting the amount of accuracy that can be
attained in finding a root of /
2
(x). A second example of the etl"ect of noise in the
evaluation of a multipleorooLisillustrated for f(x) = (x- 1)
3
in Figures 1.1 and
1.2 of Section 1.3 of Chapter 1. For a final example, consider the following
example.
Example Evaluate
f(x) = (x- 1.1)
3
{x- 2.1)
= x
4
- 5.4x
3
+ 10.56x
2
- 8.954x + 2.7951 (2.7.3)
on an IBM PC microcomputer using double precision arithmetic (in BASIC). The
coefficients will not enter exactly because they do not have finite binary expan-
sions (except for the x
4
term). The polynomial f( x) was evaluated in its
expanded form (2.7.3) and also using the nested multiplication scheme
f(x) = 2.7951 + x( -8.954 + x(10.56 + x( -5.4 + x)))
Table 2.10 Evaluation of .f(x) = (x- 1.1)
3
(x- 2.1)
x f(x): (2.7.3) f(x): (2.7.4)
1.099992
1.099994
1.099996
1.099998
1.100000
1.100002
1.100004
1.100006
1.100008
3.86E- 16
3.86E- 16
2.76E- 16
-5.55E- 17
5.55E- 17
5.55E- 17
-5.55E- 17
-1.67E- 16
-6.11E- 16
5.55E- 16
2.76E- 16
0.0
l.llE- 16
0.0
5.55E- 17
0.0
-1.67E- 16
-5.00E- 16
(2.7 .4)
88 ROOTFINDING FOR NONLINEAR EQUATIONS
.Y
y = [(x!
Simple root
Double root
Figure 2.7 Band of uncertainty in evaluation of a function.
The numerical results are given in Table 2.10. Note that the arithmetic being
used has about 16 decimal digits in the floating-point representation. Thus,
according tothe numerical results in the table, no more than 6 digits of accuracy
can be expected in calculating the root a= 1.1 of f(x). Also note the effect of
using the different representations (2.7.3) and (2.7.4).
There is uncertainty in evaluating any function f(x) due to the use of finite
precision arithmetic with its resultant rounding or chopping error. This was
discussed in Section 3 of Chapter 1, under the name of noise in function
evaluation. For multiple rdots, this leads to considerable uncertainty as to the
location of the root. In Figure 2.7, the solid line indicates the graph of y = f(x),
and the dotted lines give the region of uncertainty in the evaluation of f(x),
which is due to rounding errors and finite-digit arithmetic. The interval of
uncertainty in finding the root of f(x) is given by the intersection of the band
about the graph of y = f(x) and the x-axis. It is clearly greater with the double
root than with the simple root, even though the vertical widths of the bands
about y = f(x) are the same.
Newton's method and multiple roots Another problem with multiple roots is
that the earlier rootfinding methods will not perform as well when the root being
sought is multiple. We now investigate this for Newton's method.
We consider Newton's method as a fixed-point method, as in (2.5.2), with f(x)
satisfying (2.7.1):
f(x)
g(x) = x- f'(x)
x=Fa
Before calculating g'(a), we first simplify g(x) using (2.7.1):
f'(x) = (x- a)Ph'(x) + p(x- a)P-
1
h(x)
(x- a)h(x)
g(x) = x- ph(x) + (x- a)h'(x)
I
J
THE NUMERICAL EVALUATION OF MULTIPLE ROOTS 89
Table 2.11 Newton's method for (2.7.6)
n x.
f(x.) Ci - x.
0 1.22 -1.88E- 4 l.OOE- 2
1 1.2249867374 -4.71E- 5 5.01E- 3
2 1.2274900222 -1.18E- 5 2.51E- 3
3 1.2287441705 -2.95E- 6 1.26E- 3
4 1.2293718746 -7.38E- 7 6.28E- 4
5 1.2296858846 -1.85E- 7 3.14E- 4
18 1.2299999621 -2.89E- 15 3.80E- 8
19 1.2299999823 -6.66E- 16 1.77E- 8
20 1.2299999924 -l.llE- 16 7.58E- 9
21 1.2299999963 0.0 3.66E- 9
Differentiating,
and
h(x)
g'(x) =
1
- ph(x) + (x- a)h'(x)
d [ h(x) l
-(x- a)-
dx ph(x) + (x- a)h'(x)
1
g'(a) = 1 - - =I= 0
p
for p > 1
Ratio
.502
.501
.501
.500
.505
.525
.496
.383
(2.7.5)
Thus Newton's method is a linear method with rate of convergence (p - 1)/p.
Example Find the smallest root of
f(x) = -4.68999 + x(9.1389 + x( -5.56 + x)) (2.7 .6)
using Newton's method. The numerical results are shown in Table 2.11. The
calculations were done on an IBM PC microcomputer in double precision
arithmetic (in BASIC). Only partial results are shown, to indicate the general
course of the calculation. The column labeled Ratio is the rate of linear
convergence as measured by An in (2.6.3).
The Newton method for solving for the root of (2.7.6) is clearly linear in this
case, with a linear rate of g'(x) = t. This is consistent with (2.7.5), since
a = 1.23 is a root of multiplicity p = 2. The final iterates in the table are being
affected by the noise in the computer evaluation of f(x). Even though the
floating-point representation contains about 16 digits, only about 8 digits of
accuracy can be found in this case.
90 ROOTFINDING FOR NONLINEAR EQUATIONS
To improve Newton's method, we would like a function g(x) for which
g'(a) = 0. Based on the derivation of (2.7.5), define
Then easily, g'(a) = 0; thus,
J(x)
g(x) = x- p--
f'(x)
with ~ ~ ~ between xn and a. Thus
showing that the method
n = 0, 1,2, ... , (2.7.7)
has order of convergence two, the same as the original N ewt<?n method for simple
roots.
Example Apply (2.7.7) to the preceding example (2.7.6), using p = 2 for a
double root. The results are given in Table 2.12, using the same computer as
previously. The iterates converge rapidly, and then they oscillate around the root.
The accuracy (or lack of it) reflects the noise in f(x) and the multiplicity of the
root.
Newton's method can be used to determine the multiplicity p, as in Table 2.11
combined with (2.7.5), and then (2,7.7) can be used to speed up the convergence ..
But the inherent uncertainty in the root due to t ~ noise and the multiplicity will
remain. This can be removed only by analytically reformulating the rootfinding
problem as a new one in which the desired root a is simple. The easiest way to do
Table 2.12 Modified Newton's method (2.7.7), applied to (2.7.6)
n x, f(x,)
a- X
11
0 1.22 -1.88E- 4 l.OOE- 2
1 1.2299734748 -1.31E- 9 2.65E- 5
2 1.2299999998 -l.llE- 16 1.85E- 10
3 1.2300003208 -1.92E- 13 -3.21E- 7
4 1.2300000001 -l.llE- 16 -8.54E- 11
5 1.2299993046 -9.04E- 13 6.95E- 7
BRENT'S ROOTFINDING ALGORITHM 91
this is to form the (p- l)st derivative of f(x), and to then solve
(2.7 .8)
which will have a. as a simple root.
Example The previous example had a root of multiplicity p = 2. Then it is a
simple root of
f'(x) = 3x
2
- 11.12x + 9.1389
Using the last iterate in Table 2.11 as an initial guess, and applying Newton's
method to finding the root of f'(x) just given, only one iteration was needed to
find the value of a. to the full precision of the computer.
2.8 Brent's Rootfinding Algorithm
We describe an algorithm that combines the advantages of-the--bisection-method
and the secant method, while avoiding the disadvantages of each of them. The
algorithm is due to Brent (1973, chap. 4), and it is a further development of an
earlier algorithm due to Dekker (1969). The algorithm results in a small interval
that contains the root. If the function is sufficiently smooth around the desired
root f, then the order of convergence will be superlinear, as with the secant
method.
In describing the algorithm we use the notation of Brent (1973, p. 47). The
program is entered with two values, a
0
and b
0
, for which (1) there is at least one
root t of f(x) between a
0
and b
0
, and (2) f(a
0
)f(b
0
) 0. The program is also
entered with a desired tolerance t, from which a stopping tolerance 8 is
produced:
0 = t + 2(1bl (2.8.1)
with ( the unit round for the computer [see (1.2.11) of Chapter 1].
In a typical step of the algorithm, b is the best current estimate of the root t,
a is the previous value of b, and c is a past iterate that has been so chosen that
the root f lies between b and c (initially c = a). Define m = t( c - b).
Stop the algorithm if (1) j(b) = 0, or (2) lml o. In either case, set the
approximate root f = b. For case (2), because b will usually have been obtained
by the secant method, the root f will generally be closer to b than to c. Thus
usually,
although all that can be guaranteed is that
If- fl s 2o
92 ROOTFINDING FOR NONLINEAR EQUATIONS
If the error test is not satisfied, set
b-a
i = b- /(b) /(b)_ /(a)
(2.8.2)
Then set
b+c
if i lies between b and b + m = --
2
otherwise [which is the bisection method]
In the case that a, b, and c are distinct, the secant method in the definition of i
is replaced by an inverse quadratic interpolation method. This results in a very
slightly faster convergence for the overall algorithm. Following the determination
of b", define
{
b"
b' = b + 8 sign ( m)
if 1 b - b" 1 > 8
if 1 b - b" 1 8
(2.8.3)
If you are some distance from the root, then b' = b". With this choice, the
method is (1) linear (or quadratic) interpolation, or (2) the bisection method;
usually it is (1) for a smooth function f(x). This generally results in a value of m
that does not become small. To obtain a small interval containing the root t,
once we are close to it, we use b' := b + 8 sign ( m ), a step of 8 in the direction
of c. Because of the way in which a new c is chosen, this will usually result in a
new small interval about t. Brent makes an additional important, but technical
step before choosing a new b, usually the b' just given.
Having obtained the new b', we set b = b', a= the old value of b. If the sign
of /(b), using the new b, is the same as with the old b, the value of c is
unchanged; otherwise, c is set to the old value of b, resulting in a smaller interval
about t. The accuracy of the value of b is now tested, as described earlier.
Brent has taken great care to avoid underflow and overflow difficulties with his
method, but the program is somewhat complicated to read as a consequence.
Example Each of the following cases was computed on an IBM PC with 8087
arithmetic coprocessor and single precision arithmetic satisfying the IEEE stan-
dard for floating-point arithmetic. The tolerance was t = 10-:
5
, and thus
8 = 10-
5
+ 2ibl X (5.96 X 10-
8
)
='= 1.01 X 10-
5
since the root of t = 1 in all cases. The functions were evaluated in the form in
which they are given here, and in all cases, the initial interval was [a, b] = [0, 3].
The table values for b and c are rounded to seven decimal digits.
BRENT'S ROOTFINDING ALGORITHM 93
Table 2.13 Example 1 of
Brent's method
b /(b)
0.0 -2.00E + 0
0.5 -6.25E -1
.7139038 -3.10E- 1
.9154507 -8.52E- 2
.9901779 -9.82E- 3
.9998567 -1.43E- 4
.9999999 -1.19E- 7
.9999999 -1.19E- 7
c
3.0
3.0
3.0
3.0
3.0
3.0
3.0
1.000010
Case (1) f(x) = (x- 1)[1 + (x- 1)
2
]. The numerical results are given in
Table 2.13. This illustrates the necessity of using b' = b + 8 sign ( m) in order to
obtain a small interval enclosing the root ~
Case (2) f(x) = x
2
- 1. The numerical results are given in Table 2.14.
Case (3) f(x) = -1 + x(3 + x(- 3 + x)). The root i s ~ = 1, of multiplicity
three, and it took 50 iterations to converge to. the approximate root 1.000001.
With the initial values a = 0, b = 3, the bisection method would use only 19
iterations for the same accuracy. If Brent's method is compared with the
bisection method over the class of all continuous functions, then the number of
necessary iterates for an error tolerance of 8 is approximately
(
b- a)
log2 -8- for bisection method
for Brent's algorithm
Table 2.14 Example 2of
Brent's method
b /(b) c
0.0 -1.00 3.0
.3333333 -8.89E- 1 3.0
.3333333 -8.89E -1 1.666667
.7777778 -4.00E- 1 1.666667
1.068687 1.42E -1 .7777778
.9917336 -1.65E- 2 1.068687
.9997244 -5.51E- 4 1.068687
1.000000 2.38E- 7 .9997244
1.000000 2.38E- 7 .9999900
94 ROOTFINDING FOR NONLINEAR EQUATIONS
Table 2.15 Example 4 of
Brent's method
b f(b) c
0.0 -3.68E.,... 1 3.0
.5731754 -1.76E- 3 3.0
.5959331 -1.63E- 3 3.0
.6098443 -5.47E- 4 3.0
.6136354 -4.76E- 4 1.804922
.6389258 -1.68E- 4 1.804922
1 2 2 1 9 2 ~ 3.37E -10 .6389258
1.221914 3.37E- 10 .6389258
1.216585 1.20E- 10 .6389258
.9277553 0.0 1.216585
Thus there are cases for which bisection is better, as our example shows. But for
sufficiently smooth functions with f'( a) * 0, Brent's algorithm is almost always
far faster.
Case (4) f(x) = (x- 1) exp [ -1/(x- 1)
2
]. The root x = 1 has infinite mul-
tiplicity, since [<r>(1) = 0 for all r ;;:: 0. The numerical results are given in Table
2.15. Note that the routine has found an exact root for the machine version of
f(x), due to the inherent imprecision in the evaluation of the function; see the
preceding section on multiple roots. This root is of course very inaccurate, but
this is nothing that the program can treat.
Brent's original program, published in 1973, continues to be very popular and
well-used. Nonetheless, improvements and extensions of it continue to be made.
For one of them, and for a review of others, see Le (1985).
2.9 Roots of Polynomials
We will now consider solving the polynomial equation
(2.9 .1)
This problem arises in many ways, and a large literature has been created to deal
with it. Sometimes a particular root is wanted and a good initial guess is known.
In that case, the best approach is to modify one of the earlier iterative methods to
take advantage of the special form of polynomials. In other cases, little may be
known about the location of the roots, and then other methods must be used, of
which there are many. In this section, we just give a brief excursion into the area
of polynomial rootfinding, without any pretense of completeness. Modifications
of the methods of earlier sections will be emphasized, and numerical stability
questions will be considered. We begin with a review of some results on bounding
or roughly locating the roots of (2.9.1).
ROOTS OF POLYNOMIALS 95
Location theorems Because p(x) is a polynomial, many results can be given
about the roots of p(x), results that are not true for other functions. The best
known of these is the fundamental theorem of algPbra, which allows us to write
p(x) as a unique product (except for order) involving the roots
(2.9.2)
and z
1
, , zn are the roots of p(x), repeated according to their multiplicity. w_e
now give some classical results on locating and bounding these roots.
Descartes's rule of signs is used to bound the number of positive real roots of
p(x), assuming the coefficients a
0
, .. , an are all real.
Let v be the number of changes of sign in the coefficients of p(x) in (2.9.1),
ignoring the zero terms. Let k denote the number of positive real roots of
p(x ), counted according to their multiplicity. Then k ~ v and v - k is
even.
A proof of this is given in Henrici (1974, p. 442) and Householder (1970, p. 82).
Example The expression p(x) = x
6
- x- 1 has v = 1 changes of sign. There-
fore, k = 1; otherwise, k = 0, and v - k = 1 -is not-an -even-integer, a contradic-
tion.
Descartes's rule of signs can also be used to bound the number of negative
roots of p(x). Apply it to the polynomial
q ( . ~ ) = p(-x)
Its positive roots are the negative roots of p(x). Applying this to the last
example, q(x) = x
6
+ x- 1. Again there is one positive real root [of q(x)], and
thus one negative real root of p(x).
An upper bound for all of the roots of p(x) is given by the following:
(2.9.3)
This is due to Augustin Cauchy, in 1829, and a proof is given in Householder
(1970, p. 71). Another such result of Cauchy is based on considering the
polynomials
(2.9 .4)
(2.9.5)
Assume that a
0
=I= 0, which is equivalent to assuming x = 0 is not a root of p(x).
Then by Descartes's law of signs, each of these polynomials has exactly one
positive root; call them p
1
and p
2
, respectively. Then all roots zj of p(x) satisfy
(2.9.6)
96 ROOTFINDING FOR NONLINEAR EQUATIONS
The proof of the upper bound is given in Henrici (1974, p. 458) and Householder
(1970, p. 70). The proof of the lower bound can be based on the following
approach, which can also be used in constructing a lower bound for (2.9.3).
Consider the polynomial
a
0
=F 0 (2.9.7)
Then the roots of g(x) are 1/z, where z is a root of p(x). If the upper bound
result of (2.9.6) is.applied to (2.9.7), the lower bound result of (2.9.6) is obtained.
We leave-thisapplication to be shown in a problem.
Because each of the polynomials (2.9.4), (2.9.5) has a single simple positive
root, Newton's method can be easily used to construct R
1
and R
2
As an initial
guess, use the upper bound from (2.9.3) or experiment with smaller positive
initial guesses. We leave the illustration of these results to the problems.
There are many other results of the preceding type, and both Henrici (1974,
chap. 6) and Householder (1970) wrote excellent treatises on the subject.
Nested multiplication A very efficient way to evaluate the polynomial p(x)
given in (2.9.1) is to use nested multiplication:
p(x) = a
0
+ x(a
1
+ x(a
2
+ +x(an_
1
+ anx) ) (2.9.8)
With formula (2.9.1), there are n additions and 2n - 1 multiplications, and with
(2.9 .8) there are n additions and n multiplications, a considerable saving.
For later work, it is convenient to introduce the following auxiliary coeffi-
cients. Let bn =a",
k=n-1,n-2, ... ,0
By considering (2.9.8), it is easy to see that
p(z) = b
0
Introduce the polynomial
Then
b
0
+ (x- z)q(x) = b
0
+ (x- z)[b
1
+ b
2
x + +bnx"-
1
]
= (b
0
- b
1
z) + (b
1
- b
2
z)x +
+{bn-1- bnz}xn-1 + bnx"
= a
0
+ a
1
x + . +anx" = p(x)
p(x) = b
0
+ (x- z)q(x)
(2.9.9)
(2.9.10)
(2.9.11)
(2.9.12)
ROOTS OF POLYNOMIALS 97
where q(x) is the quotient and b
0
the remainder when p(x) is divided by x - z.
The use of (2.9.9) to evaluate p(z) and to form the quotient polynomial q(x) is
also called Horner's method.
If z is a root of p(x), then b
0
= 0 and p(x) = (x- z)q(x). To find
additional roots of p(x), we can restrict our search to the roots of q(x). This
reduction process is called deflation; it must be used with some caution, a point
we will return to later.
Newton's method If we want to apply Newton's method to find a root of p(x),
we must be able to evaluate both p(x) and p'(x) at any point z. From (2.9.12),
p'(x) = (x- z)q'(x) + q(x)
p'(z) = q(z) (2.9.13)
We use (2.9.10) and (2.9.13) in the following adaption of Newton's method to
polynomial rootfinding.
Algorithm Polynew (a, n, x
0
, t:, itma:X, root, b, ier)
1. Remark: a is a vector of -itmax the maximum
number of iterates to be computed, b the vector of coefficients
for the deflated polynomial, and ier an error indicator.
2. itnum := 1
4. Fork= n -1, ... ,1, bk == ak + zbk+I c := bk + zc
6. If c = 0, ier == 2 and exit.
7. x
1
== x
0
- b
0
jc
8. If jx
1
- x
0
l ::;: t:, then ier == 0, root == x
1
, and exit.
9. If itnum = itmax, then ier == 1 and exit.
10. Otherwise, itnum == itnum + 1, x
0
== x
1
, and go to step 3.
Stability problems There are many polynomials in which the roots are quite
sensitive to small changes in the coefficients. Some of these are problems with
multiple roots, and it is not surprising that these roots are quite sensitive to small
changes in the coefficients. But there are many polynomials with only simple
roots that appear to be well separated, and for which the roots are still quite
sensitive to small perturbations. Formulas are derived below that explain this
sensitivity, and numerical examples are also given.
98 ROOTFINDING FOR NONLINEAR EQUATIONS
For the theory, introduce
and define a perturbation of p(x) by
p(x; t:) = p(x) + t:q(x) (2.9.15)
Denote the-zeros of p( x; t:) by z
1
( t: ), ... , z n ( t: ), repeated according to their
multiplicity, and let z
1
= z/0), i = 1, ... , n, denote the corresponding n zeros of
p(x) = p(x; E>). It is well known that the zeros of a polynomial are continuous
functions of the coefficients of the polynomial [see, for example, Henrici (1974, p.
281)]. z/t:} is a continuous function of t:. What we want to
determine is how rapidly the root z/t:) varies with t:, fort: near 0.
Example Define
p(x; t:) = (x- 1)
3
- t: p(x) = (x- 1)
3
t:>O
Then the roots of p(x) are z
1
= z
1
= z
3
= 1. The roots of p(x; t:) are
with w = }( -1 + i/3). For all three roots of p(x; t:),
To illustrate this, let t: = .001. Then
p(x; e) = x
3
- 3x
2
+ 3x - 1.001
which is a relatively small change in p(x). But for the roots,
a relatively large change in the roots z
1
= 1.
We now give some more general estimates for z
1
(t:)- z
1
.
Case (1) z
1
is a simple root of p(x), and thus p'(z) =F 0. Using the theory of
functions of a complex variable, it is known that z / t:) can be written as a power
series:
00
zJ(t:) = zJ + L 'fl1
(2.9.16)
1-1
ROOTS OF POLYNOMIALS 99
To estimate z/)- z
1
, we obtain a formula for the first term y
1
in the series. To
begin, it is easy to see that
i'r = zf(O)
To calculate zj(), differentiate the identity
which holds for all sufficiently small .. We obtain
p'(z/))zf() + q(z/)) + q'(z/t:))zJ(t:) = 0
-q(z/t:))
Z I ( () -
1
- p'(z/t:)) + q'(z/))
Substituting . = 0, we obtain
Returning to (2.9.16),
(2.9.17)
(2.9.18)
for some constants
0
> 0 and K > 0. To estimate z/) for small t:, we use
(2.9.19)
The coefficient of t: determines how rapidly z/t:) changes relative to t:; if it is
large, the root z
1
is called ill-conditioned.
Case (2) z
1
has multiplicity m > 1. By using techniques related to those used in
1, we can obtain
(2.9.20)
for some
0
> 0, K > 0. There are m possible values to y
1
, given as the m
complex roots of
J
I
I
,I
100 ROOTFINDING FOR NONLINEAR EQUATIONS
Exompk Consider the simple polynomial
p(x) = (x- 1)(x- 2) (x- 7)
+ 13068x - 5040 (2.9.21)
For the perturbation, take
q( x) = x
6
= - .002
Then for the root zj = j,
p'(z) = nu -I)
l+j
From (2.9.19), we have the estimate
(2.9.22)
The numerical values of o(j) are given in Table 2.16. The relative error in the
coefficient of x
6
is .002/28 = 7.1E - 5, but the relative errors in the roots are
much larger. In fact, the size of some of the perturbations 8()) casts doubt on
the validity of using the linear estimate (2.9.22). The actual roots of p(x) + q(x)
are given in Table 2.17, and they correspond closely to the predicted perturba-
tions. The major departure is in the roots for j = 5 and j = 6. They are complex,
which was not predicted by the linear estimate (2.9.22). In these two cases, is
outside the radius of convergence of the power series (2.9.16), since the latter will
only have the real coefficients
1
Y
=
I .. /! J
obtained from differentiating (2.9.17).
Table 2.16 Values of 8(j) from (2.9.22)
j
1
2
3
4
5
6
7
8(})
2.78E- 6
-1.07E- 3
3.04E- 2
-2.28E- 1
6.51E- 1
-7.77E- 1
3.27E -1
ROOTS OF POLYNOMIALS 101
Table 2.17 Roots of p(x; f.) for (2.9.21)
j zj( ) Z/)- z/0)
1 1.0000028 2.80E- 6
2 1.9989382 -1.06E- 3
3 3.0331253 3.31E- 2
4 3.8195692 -1.80E- 1
5 5.4586758 + .54012578i
6 5.4586758 - .54012578i
7 7.2330128 2.33E- 1
We say that a polynomial whose roots are unstable with respect to small
relative changes in the coefficients is ill-conditioned. Many such polynomials
occur naturally in applications. The previous example should illustrate the
difficulty in detennining with only a cursory examination whether or not a
polynomial is ill-conditioned.
Polynomial deflation Another with deflation of a polynomial to
a lower degree polynomial, a process defined following (2.9.12). Since the zeros
will not be found exactly, the lower degree polynomial (2.9.11) found by
extracting the latest root will generally be in error in all of its coefficients. Clearly
from the past example, this can cause a significant perturbation in the roots for
some classes of polynomials. Wilkinson (1963) has analyzed the effects of
deflation and has recommended the following general strategy: (1) Solve for the
roots of smallest magnitude first, ending with those of largest size; (2) after
obtaining approximations to all roots, iterate again using the original polynomial
and using the previously calculated values as initial guesses. A complete discus-
sion can be found in Wilkinson (1963, pp. 55-65).
Example Consider finding the roots of the degree 6 Laguerre polynomial
p ( x) = 720 - 4320x + 5400x
2
- 2400x
3
+ 450x
4
- 36x
5
+ x
6
The Newton algorithm of the last section was used to solve for the roots, with
deflation following the acceptance of each new root. The roots were calculated in
two ways: (1) from largest to smallest, and (2) from smallest to largest. The
calculations were in single precision arithmetic on an IBM 360, and the numeri-
cal results are given in Table 2.18. A comparison of the columns headed Method
(1) and Method (2) shows clearly the superiority of calculating the roots in the
order of in-::reasing magnitude. If the results of method (1) are used as initial
guesses for further iteration with the original polynomial, then approximate roots
are obtained with an accuracy better than that of method (2); see the column
headed Method (3) in the table. This table shows the importance of iterating
again with the original polynomial to remove the effects of the deflation process.
102 ROOTFINDING FOR NONLINEAR EQUATIONS
Table 2.18 Example involving polynomial deHation
True Method (1) Method (2)
15.98287 15.98287 15.98279
9.837467 9.837471 9.837469
5.775144 5.775764 5.775207
2.992736 2.991080 2.992710
1.188932 1.190937 1.188932
.2228466 .2219429 .2228466
Method (3)
15.98287
9.837467
5.775144
2.992736
1.188932
.2228466
There are other ways to deflate a polynomial, one of which favors finding roots
of largest magnitude first. For a complete discussion see Peters and Wilkinson
(1971, sec. 5). An algorithm is given for composite deflation, which removes the
need to find the roots in any particular order. In that paper, the authors also
discuss the use of implicit deflation,
to remove the roots z
1
, , z, that have been computed previously. This was
given earlier, in (2.4.4), where it was used in connection with Muller's method.
General polynomial rootfinding methods There are a large number of rootfind-
ing algorithms designed especially for polynomials. Many of these are taken up in
detail in the books Dejon and Henrici (1969), Henrici (1974, chap. 6), and
Householder (1970). There are far too many types of such methods to attempt to
describe them all here.
One large class of important methods uses location theorems related to those
described in (2.9.3)-(2.9.6), to iteratively separate the roots into disjoint and ever
smaller regions, often circl.es. The best known of such methods is probably the
Lehmer-Schur method [see Householder (1970. sec. 2.7)]. Such methods converge
linearly, and for that reason, they are ofteit combined with some more rapidly
convergent method, such as Newton's method. Once the roots have been sep-
arated into distinct regions, the faster method is applied to rapidly obtain the
root within that region. For a general discussion of such rootfinding methods, see
Henrici (1974, sec. 6.10).
Other methods that have been developed into widely used algorithms are the
method of Jenkins and Traub and the method of Laguerre. For the former, see
Householder (1970, p. 173), Jenkins and Traub (1970), (1972). For Laguerre's
method, see Householder (1970, sec. 4.5) and Kahan (1967).
Another easy-to-use numerical method is based on being able to calculate the
eigenvalues of a matrix. Given the polynomial p(x), it is. possible to easily
construct a matrix with p(x) as its characteristic polynomial (see Problem 2 of
Chapter 9). Since excellent software exists for solving the eigenvalue problem,
this software can be used to find the roots of a polynomial p(x).
SYSTEMS OF NONLINEAR EQUATIONS 103
2.10 Systems of Nonlinear Equations
This section and the next are concerned with the numerical solution of systems of
nonlinear equations in several variables. These problems are widespread in
applications, and they are varied in form. There is a great variety of methods for
the solution of such systems, so we only introduce the subject. We give some
general theory and some numerical methods that are easily programmed. To do a
complete development of the numerical analysis of solving nonlinear systems, we
would need a number of results from numerical linear algebra, which is not taken
up until Chapters 7-9.
For simplicity of presentation and ease of understanding, the theory is
presented for only two equations:
(2.10.1)
The generalization to n equations in n variables should be straightforward once
the principal ideas have been grasped. As an additional aid, we will simulta-
neously consider the solution of (2.10.1) in vector notation:
f(x) = 0
X= ~ ~ ]
(2.10.2)
The solution of (2.10.1) can be looked upon as a two-step process: (1) Find the
zero curves in the x
1
x
2
-plane of the surfaces z = /
1
{x
1
, x
2
) and z = /
2
(x
1
, x
2
),
and (2) find the points of intersection of these zero curves in the x
1
x
2
-plane. This
perspective is used in the next section to generalize Newton's method to solve
(2.10.1).
Fixed-point theory We begin by generalizing some of the fixed-point iteration
theory of Section 2.5. Assume that the rootfinding problem (2.10.1) has been
reformulated in an equivalent form as
(2.10.3)
Denote its solution by
We study the fixed-point iteration
(2.10.4)
Using vector notation, we write this as
xn+l = g(xJ
(2.10.5)
104 ROOTFINDING FOR NONLINEAR EQUATIONS
Table 2.19 Example (2.10.7) of fixed-point iteration
n
xl,n x2,n f1(Xl.n> X2,n) f2 ( X1, n' X 2, n)
0 -.5 .25 0.0 1.56E- 2
1 - .497343750 .254062500 2.43E- 4 5.46E- 4
2 -.497254194 .254077922 9.35E- 6 2.12E- 5
3 - .497251343 .254078566 3.64E- 7 8.26E- 7
4 -.497251208 .254078592 1.50E- 8 3.30E- 8
with
Example Consider solving
/
1
= 3x? + 4xi - 1 = 0 /
2
= - 8xi - 1 = 0 (2.10.6)
for the solution a near (x
1
, x
2
) = (- .5, .25). We solve this system iteratively with
[
xl,n+ll = [xl,n]- [.016
x2, n+ 1 x2, n .52
+ 1]
-.26 x2,n-8xl,n-1
(2.10.7)
The origin of this reformulation of (2.10.6) is given later. The numerical results of
(2.10.7) are given in Table 2.19. Clearly the iterates are converging rapidly.
To analyze the convergence of (2.10.5), begin by subtracting the two equations
in (2.10.4) from the corresponding equations
involving the true solution ct. Apply the mean value theorem for functions of two
variables (Theorem 1.5 with n = 1) to these differences to obtain
for i = 1, 2. The points = are on the line segment joining and
xn. In matrix form, these error equations become
agl (
agl (
[1- xl,n+l] =
axl ax2
[ 1 - xl, n]
(2.10.8)
2- x2,n+l
og2 (
ag2(
a2-x2,n
oxl ax2
_j
SYSTEMS OF NONLINEAR EQUATIONS 105
Let Gn denote the matrix in (2.10.8). Then we can rewrite this equation as
(2.10.9)
It is convenient to introduce the Jacobian matrix for the functions g
1
and g
2
:
G(x) = (2.10.10)
In (2.10.9), if xn is close to a, then Gn will be close to G(a). This will make the
size or norm of G( a) crucial in analyzing the convergence in (2.10.9). The matrix
G( a) plays the role of g'( a) in the theory of Section 2.5. To measure the size of
the errors a - xn and of the matrices Gn and G( a), we use the vector and matrix
norms of (1.1.16) and (1.1.19) in Chapter 1.
Theorem 2.9 Let D be a closed, bounded, and convex set in the plane. (We say
D is convex if for any two points in D,-the line segment joining
them is also in D.) Assume that the components of g(x) are
continuously differentiable at all points of D, and further assume
I. g(D) c D,
2. A = MaxiJG(x)IJoo < 1
xED
Then
(a) x = g(x) has a unique solution a E D.
(2.10.11)
(2.10.12)
(b) For any initial point x
0
E D, the iteration (2.10.5) will
converge in D to a.
(2.10.13)
with n 0 as n oo.
Proof (a) The existence of a fixed point a can be shown by proving that the
sequence of iterates {xn} from (2.10.5) are convergent in D. We leave
that to a problem, and instead just show the uniqueness of a.
Suppose a and 13 are both fixed points of g(x) in D. Then
a -13 = g(a)- g(l3) (2.10.14)
106 ROOTFINDING FOR NONLINEAR EQUATIONS
Apply the mean value theorem to component i, obtaining
i = 1, 2 (2.10.15)
with
and E(i) ED, on theJine-segment joining_(x__.and Since JIG(x)lloo
< 1, we have-from-the-definition -of-the-norm that
XED, i = 1, 2
Combining this with (2.10.15),
Lsd a) - "All -
llg(a)- g(f3) lloo
Combined with (2.10.14), this yields
(2.10.16)
which is possible only if = showing the uniqueness of a. in D.
(b) Condition (2.10.11) will ensure that all xn E D if x
0
E D. Next
subtract xn+l = g(xn) from (l = g(), obtaining
The result (2.10.16) applies to any two points in D. Applying this,
(2.10.17)
Inductively.
(2.10.lg)
Since A < 1, this shows X n -+ (l as n -+ 00.
(c) From (2.10.9) and using (1.1.21),
(2.10.19)
As n -+ oo, the points used in evaluating Gn will all tend to , since
they are on the line segment joining xn and .Then IIGnJioo -+ JIG()IIoo
as n -+ oo. Result (2.1 0.13) follows from (2.1 0.19) by letting
(n = IIGnlloo -JIG()IIoo
SYSTEMS OF NONLINEAR EQUATIONS 107
The preceding theorem is the generalization to two variables of Theorem 2.6
for functions of one variable. The following generalizes Theorem 2.7.
Corollary 2.10 Let a be a fixed point of g(x), and assume components of g(x)
are continuously differentiable in some neighborhood about a.
Further assume
IIG( a)lloo < l (2.10.20)
Then for Xo chosen sufficiently close to a, the iteration xn+l =
g(xn) will converge to a, and the results of Theorem 2.9 will be
valid on some closed, bounded, convex region about a.
We leave the proof of this as a problem. Based on results in Chapter 7, the
linear convergence of x n to a will still be true if all eigenvalues of G (a) are less
than one in magnitude, which can be shown to be a weaker assumption than
(2.10.20).
Example Continue the earlier example (2.10.7). It is straightforward to compute
and therefore
G (- ) = [ .038920
a .008529
.000401]
-.006613
IIG(a)lloo = 0393
Thus the condition (2.10.20) of the theorem is satisfied. From (2.10.13), it will be
approximately true that
for all sufficiently large n.
Suppose that A is a constant nonsingular matrix of order 2 X 2. We can then
reformulate (2.10.1) as
X = X + Af(x) = g(x) (2.10.21)
I
The example (2.10.7) illustrates this procedure. To see the requirements on A, we
produce the Jacobian matrix. Easily,
G(x) = I + AF(x)
where F(x) is the Jacobian matrix of /
1
and /
2
,
F(x) = (2.10.22)
We want to choose A so that (2.10.20) is satisfied. And for rapid convergence, we
108 ROOTFINDING FOR NONLINEAR EQUATIONS
want IIG( )11
00
= 0, or
_The matrix in (2.10.7) was chosen in this way using
This suggests using a continual updating of A, say A= -F(xn)-
1
The resulting
method is
xn+l = Xn- F(xn) -lf(xn) n ~ O (2.10.23)
We consider this method in the next section.
2.11 Newton's Method for Nonlinear Systems
As with Newton's method for a single equation, there is more than one way of
viewing and deriving the Newton method for solving a system of nonlinear
equations. We begin with an analytic derivation, and then we give a geometric
perspective.
Apply Taylor's theorem for functions of two variables to each of the equations
/;(x
1
, x
2
) = 0, expanding/;() about x
0
: fori= 1, 2
with ~ i ) on the line segment joining x
0
and . If we drop the second-order
terms, we obtain the approximation
In matrix form,
(2.11.3)
with F(x
0
) the Jacobian matrix of J, given in (2.10.22).
Solving for ,
NEWTON'S METHOD FOR NONLINEAR SYSTEMS 109
The approximation x
1
should be an improvement on x
0
, provided x
0
is chosen
sufficiently close to . This leads to the iteration method first obtained at the end
of the last section,
n::::::O (2.11.4)
This is Newton's method for solving the nonlinear system f(x) = 0.
In actual practice, we do not invert F(xn), particularly for systems of more
than two equations. Instead we solve a linear system for a correction term to xn:
F(xJan+I = -f(xJ
(2.11.5)
This is more efficient in computation time, requiring only about one-third as
many operations as inverting F(xn). See Sections 8.1 and 8.2 for a discussion of
the numerical solution of linear systems of equations.
There is a geometrical derivation for Newton's method, in analogy with the
tangent line approximation used with single nonlinear equations in Section 2.2.
The graph in space of the equation
is a plane that is tangent to the graph of z = /;(x
1
, x
2
) at the point x
0
, for
i = 1, 2. If x
0
is near, then these tangent planes should be good approximations
to the associated surfaces of z = f;(x
1
, x
2
), for x = (x
1
, x
2
) near . Then the
intersection of the zero curves of the tangent planes z = P;(x
1
, x
2
) should be a
good approximation to the corresponding intersection a of the zero curves of the
original surfaces z = /;(x
1
, x
2
). This results in the statement (2.11.2). The inter-
section of the zero curves of z = P;(x
1
, x
2
), i = 1, 2, is the point x
1
Example Consider the system
!
1
= 4x? + xi - 4 = 0
There are only two roots, one near (1, 0) and its reflection through the origin near
( -1, 0). Using (2.11.4) with x
0
= (1, 0), we obta:in the results given in Table 2.20.
Table 2.20 Example of Newton's method
n
Xr.n x2.n /r(Xn) /2(xn)
0 1.0 0.0 0.0 1.59E- 1
1 1.0 -.1029207154 1.06E- 2 4.55E- 3
2 .9986087598 -.1055307239 1.46E- 5 6.63E- 7
3 .9986069441 - .1055304923 1.32E- 11 1.87E- 12
J
110 ROOTFINDING FOR NONLINEAR EQUATIONS
Convergence analysis For the convergence analysis of Newton's method (2.11.4),
regard it as a fixed-point iteration method with
g(x) = x- F(x) -
1
f(x)
(2.11.6)
Also assume
Determinant F( a) '4= 0
which is the analogue of assuming a is a simple root when dealing with a single
equation,-as in Theorem 2.1. It can then be shown that the Jacobian G(x) of
(2.11.6) is zero at x = a (see Problem 53); consequently, the condition (2.10.20) is
easily satisfied.
Corollary 2.10 then implies that x" converges to a, provided x
0
is chosen
sufficiently close to a. In addition, it can be shown that the iteration is quadratic.
Specifically. the formulas (2.11.1) and (2.11.4) can be combined to obtain
n ~ 0 (2.11.7)
for some constant B > 0.
Variations of Newton's method Newton's method has both advantages and
disadvantages when compared with other methods for solving systems of nonlin-
ear equations. Among its advantages, it is very simple in form and there is great
flexibility in using it on a large variety of problems. If we do not want to bother
supplying partial derivatives to be evaluated by a computer program, we can use
a difference approximation. For example, we commonly use
(2.11.8)
with some very small number ~ : . For a detaikd discussion of the choice of ~ : see
Dennis and Schnabel (1983, pp. 94--99).
The first disadvantage of Newton's method is tliat there are other methods
which are (1) iess expensive to use, and/or (2) easier to use for some special
classes of problems. For a system of m nonlinear equations in m unknowns, each
iterate for Newton's method requires m
2
+ m function evaluations in general. In
addition, Newton's method requires the solution of a system of m linear
equations for each iterate, at a cost of about fm
3
arithmetic operations per linear
system. There are other methods that are as fast or almost as fast in their
mathematical speed of convergence, but that require fewer function evaluations
and arithmetic operations per iteration. These are often referred to as Newton-like,
quasi-Newton, and modified Newton methods. For a general presentation of many
of these methods, see Dennis and Schnabel (1983).
A simple modification of Newton's method is to fix the Jacobian matrix for
several steps, say k:
j=0,1, ... ,k-1 (2.11.9)
UNCONSTRAINED OPTIMIZATION 111
for r = 0, 1, 2, .... This means the linear system in
j = 0, 1, ... , k- 1, (2.11.10)
can be solved much more efficiently than in the original Newton method (2.11.5).
The linear system, of order m, requires about fm
3
arithmetic operations for its
solution in the first case, when j = 0. But each subsequent case, j = 1, ... , k - 1,
will require only 2m
2
aritl:imetic operations for its solution. See Section 8.1 for
more complete details. The speed of convergence of (2.11.9) will be slower than
the original method (2.11.4), but the actual computation time of the modified
method will often be much less. For a more detailed examination of this
question, see Petra and Ptak (1984, p. 119).
A second problem with Newton's method, and with many other methods, is
that often x
0
must be reasonably close to in order to obtain convergence.
There are modifications of Newton's method to force convergence for poor
choices of x
0
For example, define
(2-11.11)
and choose s > 0 to minimize
m
llf(xn + sdn) = L [.tj(xn + sdn)]
2
(2.11.12)
j=l
The choices= 1 in (2.11.11) yields Newton's method, but it may not be the best
choice. In some cases, s may need to be much smaller than 1, at least initially, in
order to ensure convergence_ For a more detailed discussion, see Dennis and
Schnabel (1983, 6)_
For an analysis of some current programs for solving nonlinear systems, see
Hiebert (1982). He also discusses the difficulties in producing such software.
2.12 Unconstrained Optimization
Optimization refers to finding the maximum or mm1mum of a continuous
function f(x
1
, ... , xm) This is an extremely important problem, lying at the
heart of modern industrial engineering, management science, and other areas.
This section discusses some methods and perspectives for calculating the mini-
mum or maximum of a function f(x
1
,. _., xm)- No formal algorithms are given,
since this would require too extensive a development.
Vector notation is used in much of the presentation, to give results for a
general m of variables. We consider only the unconstrained optimization
problem, in which there are no limitations on (x
1
, .. , xm). For simplicity only,
we also assume f(x
1
, , xm) is defined for all (x
1
, , xm).
Because the behavior of a function f(x) can be quite varied, the problem must
be further limited. A point is called a strict local minimum off if f(x) > /( )
112 ROOTFINDING FOR NONLINEAR EQUATIONS
for all x close to a, x * a. We limit ourselves to finding a strict local minimum of
f(x). Generally an initial guess x
0
of a will be known, and f(x) will be assumed
to be twice continuously differentiable with respect to its variables x
1
, , xm.
Reformulation as a nonlinear system With the assumption of differentiability, a
necessary condition for a to be a strict local minimum is that
aj(a)
--=0
a xi
Thus the nonlinear system
aj(x)
--=0
ax;
i=1, ... ,m (2.12.1)
i=l, ... ,m (2.12.2)
can be solved, and each calculated solution can be checked as to whether it is a
local maximum, minimum, or neither. For notation, introduce the gradient vector
"V/(x) =
af
Using this vector, the system (2.12.2) is written more compactly as
"V/(x)=O (2.12.3)
To solve (2.12.3), Newton's method (2.11.4) can be used, as well as other
rootfinding methods for nonlinear systems. Using Newton's method leads to
(2.12.4)
with H(x) the Hessian matrix of J,
If a is a strict local minimum of J, then Taylor's theorem (1.1.12) can be used to
show that H( ) is a nonsingular matrix; then H(x) will be nonsingular for x
close to u. For convergence, the analysis of Newton's method in the preceding
section can be used to prove quadratic convergence of xn to a provided x
0
is
chosen sufficiently close to a.
The main drawbacks with the iteration (2.12.4) are the same as those given in
the last section for Newton's method for solving nonlinear systems. There are
UNCONSTRAINED OPTIMIZATION 113
other, more efficient optimization methods that seek to approximate a by using
only l(x) and \1 l(x). These methods may require more iterations, but generally
their total computing time will be much less than with Newton's method. In
addition, these methods seek to obtain convergence for a larger set of initial
values x
0
Descent methods Suppose we are trying to minimize a function l(x). Most
methods for doing so are based on the following general two-step iteration
process.
STEP Dl: At x,, pick a direction d, such that l(x) will decrease as x moves
away from x, in the direction d,.
STEP D2: Let x,+
1
= x, + sd,, with s chosen to minimize
q>(s) = l(x, + sd,), s 0 (2.12.5)
Usually s is chosen as the positive relative minimum of
q>( s ).
Such methods are called
Descent methods are guaranteed to converge under more general conditions
than for Newton's method (2.12.4). Consider the level surface
C = {xll(x) = l(x
0
)}
and consider only the connected portion of it, say C', that contains x
0
Then if
C' is bounded and contains in its interior, descent methods will converge under
very general conditions. This is illustrated for the two-variable case in Figure 2.8.
Several level curves l(x
1
, x
2
) = c are shown for a set of values c approaching
I (a.). The vectors d, are directions in which I ( x) is decreasing.
There are a number of ways. for choosing the directions d,, and the best
known are as follows.
l. The method of steepest descent. Here d, = -\1 l(x,). It is the direction in
which l(x) decreases most rapidly when moving away from x,. It is a good
strategy near x,, but it usually turns out to be a poor strategy for rapid
convergence to .
2. QUilSi-Newton methods. These methods can be viewed as approximations
of Newton's method (2.12.4). They use easily computable approximations of
H(x,) or H(x,)-1, and they are also descent methods. The best known
examples are the Davidon-Fletcher-Powell method and the Broyden
methods.
3. The conjugate gradient method. This uses a generalization of the idea of an
ortho&onal basis for a vector space to generate the directions d,, with the
114 ROOTFINDING FOR NONLINEAR EQUATIONS
Figure 2.8 Illustration of steepest
descent method.
directions related in an optimal way to the function f(x) being minimized.
In Chapter 8, the conjugate gradient method is used for solving systems of
linear equations.
There are many other approaches to minimizing a function, but they are too
numerous to include here. As general references to the preceding ideas, see
Dennis and Schnabel (1983), Fletcher (1980), Gill et al. (1981), and Luenberger
(1984). An important and very different approach to minimizing a function is the
simplex method given in Neider and Mead (1965), with a discussion given in Gill
et al. (1981, p. 94) and Woods (1985, chap. 2). This method uses only function
values (no derivative values), and it seems to be especially suitable for noisy
functions.
An important project to develop programs for solving optimization problems
and nonlinear systems is under way at the Argonne National L::,boratory. The
program package is called MINPACK, and version 1 is available [see More et al.
(1980) and More et al. (1984)]. It contains routines for nonlinear systems and
nonlinear least squares problems. Future versions are intended to include pro-
grams for both unconstrained and constrained optimization problems.
Discussion of the Literature
There is a large literature on methods for calculating the roots of a single
equation. See the books by Householder (1970), Ostrowski (1973), and Traub
(1964) for a more extensive development than has been given here. Newton's
method is one of the most widely used methods, and its development is due to
many people. For an historical account of contributions to it by Newton,
Raphson, and Cauchy, see Goldstine (1977, pp. 64, 278).
For computer programs, most people still use and individually program a
method that is especially suitable for their particular application. However, one
should strongly consider using one of the general-purpose programs that have
been developed in recent years and that are available in the commercial software
libraries. They are usually accurate, efficient, and easy to use. Among such
general-purpose programs, the ones based on Brent (1973) and Dekker (1969) has
BIBLIOGRAPHY 115
been most popular, and further developments of them continue to be made, as in
Le (1985). The IMSL and NAG computer libraries include these and other
excellent rootfinding programs.
Finding the roots of polynomials is an extremely old area, going back to at
least the ancient Greeks. 7 nere are many methods and a large literature for them,
and many new methods have been developed in the past 2 to 3 decades. As an
introduction to the area, see Dejon and Henrici (1969), Henrici (1974, chap. 6),
Householder (1970), Traub (1964), and their bibliographies. The article by
Wilkinson (1984) shows some of the practical difficulties of solving the poly-
nomial rootfinding problem on a computer. Accurate, efficient, automatic, and
reliable computer programs have been produced for finding the roots of poly-
nomials. Among such programs are (a) those of Jenkins (1975), Jenkins and
Traub (1970), (1972), and (b) the program ZERPOL of Smith (1967), based on
Laguerre's method [see Kahan (1967), Householder (1970, p. 176)]. These auto-
matic programs are much too sophisticated, both mathematically and algorithmi-
cally, to discuss in an introductory text such as this one. Nonetheless, they are
well worth using. Most people would not be able to write a program that would
be as competitive in both speed and accuracy. The latter is especially important,
since the polynomial rootfinding problem can be very sensitive to rounding
errors, as was shown in examples earlier in the chapter.
The study of numerical methods for solving systems of nonlinear equations
and optimization problems is currently a very popular area of research. For
introductions to numerical methods for solving nonlinear systems, see Baker and
Phillips (1981, pt. 1), Ortega and Rheinboldt (1970), and Rheinboldt (1974). For
generalizations of these methods to nonlinear differential and integral equations,
see Baker and Phillips (1981), Kantorovich (1948) ~ c l a s s i c a l paper in this area],
Kantorovich and Akilov (1964), and Rail (1969). For a survey of numerical
methods for optimization, see Dennis (1984) and Powell (1982). General intro-
ductions are given in Dennis and Schnabel (1983), Fletcher (1980), (1981), Gill
et al. (1981), and Luenberger (1984). As an example of 'recent research in
optimization theory and in the development of software, see Boggs et al. (1985).
For computer programs, see Hiebert (1982) and More et al. (1984).
Bibliography
Baker, C., and C. Phillips, eds. (1981). The Numerical Solution of Nonlinear
Problems. Clarendon Press, Oxford, England.
Boggs, P., R. Byrd, and R. Schnabel, eds. (1985). Numerical Optimization 1984.
Society for Industrial and Applied Mathematics, Philadelphia.
Brent, R. (1973). Algorithms for Minimization Without Derivatives. Prentice-Hall,
Englewood Cliffs, N.J.
Byrne, G., and C. Hall, eds. (1973). Numerical Solution of Systems of Nonlinear
Algebraic Equations. Academic Press, New York.
Dejon, B., and P. Henrici, eds. (1969). Constructive Aspects of the Fundamental
Theorem of Algebra. Wiley, New York.
116 ROOTFINDING FOR NONLINEAR EQUATIONS
Dekker, T. (1969). Finding a zero by means of successive linear interpolation. In
B. Dejon and P. Henrici (eds.), Constructive Aspects of the Fundamental
Theorem of Algebra, pp. 37-51. Wiley, New York.
Dennis, J. (1984). A user's guide to nonlinear optimization algorithms. Proc.
IEEE, 12, 1765-1776.
Dennis, J., and R. Schnabel (1983). Numerical Methods for Unconstrained Optimi-
zation and Nonlinear Equations. Prentice-Hall, Englewood Cliffs, N.J.
Fletcher, R. (1980). Practical Methods of Optimization, Vol. 1, Unconstrained
Optimization. Wiley, New York.
Fletcher, R. (1981). Practical Methods of Optimization, Vol. 2, Constrained
Optimization. Wiley, New York.
Forsythe, G. (1969). What is a satisfactory quadratic equation solver? In B.
Dejon and P. Henrici (eds.), Constructive Aspects of the Fundamental
Theorem of Algebra, pp. 53-61. Wiley, New York.
Gill, P., W. Murray, and M. Wright (1981). Practical Optimization. Academic
Press, New York.
Goldstine, H. (1977). A History of Numerical Analysis. Springer-Verlag, New
York.
Henrici, P. (1974). Applied and Computational Complex An(llysis, Vol. 1. Wiley,
New York.
Hiebert, K. (1982). An evaluation of mathematical software that solve systems of
nonlinear equations. ACM Trans. Math. Softw., 11, 250-262.
Householder, A. (1970). The Numerical Treatment of a Single Nonlinear Equation.
McGraw-Hill, New York.
Jenkins, M. (1975). Algorithm 493: Zeroes of a real polynomial. ACM Trans.
Math. Softw., 1, 178-189.
Jenkins, M., and J. Traub (1970). A three state algorithm for real polynomials
using quadratic iteration. SIAM J. Numer. Anal., 7, 545-566.
Jenkins, M., and J. Traub (1972). Algorithm 419-Zeros of a complex poly-
nomial. Commun. ACM, 15, 97-99.
Kahan, W. (1967). Laguerre's method and a circle which contains at least one
zero of a polynomial. SIAM J. Numer. Anal., 4, 474-482.
Kantorovich, L. (1948). Functional analysis and applied mathematics. Usp. Mat.
Nauk, 3, 89-185.
Kantorovich, L., and G. Akilov (1964). Functional Analysis in Normed Spaces.
Pergamon, London.
Le, D. (1985). An efficient derivative-free method for solving nonlinear equations.
ACM Trans. Math. Softw., 11, 250-262.
Luenberger, D. (1984). Linear and Nonlinear Programming, 2nd ed. Wiley, New
York.
More, J., B. Garbow, and K. Hillstrom (1980). User Guide for M/NPACK-1.
Argonne Nat. Lab. Rep. ANL-80-74.
More, J., and D. Sorenson. Newton's method. In Studies in Numerical Analysis,
G. Golub (ed.), pp. 29-82. Math. Assoc. America, Washington, D.C.
PROBLEMS 117
More, J., D. Sorenson, B. Garbow, and K. Hillstrom (1984). The MINPACK
project. In Sources and Development of Mathematical Software, Cowell (ed.),
pp. 88-111. Prentice-Hall, Englewood Cliffs, N.J.
Neider, A., and R. Mead (1965). A simplex method for function minimization.
Comput. J., 7, 308-313.
Ortega, J., and W. Rheinboldt (1970). Iterative Solution of Nonlinear Equations in
Several Variables. Academic Press, New York.
Ostrowski, A. (1973). Solution of Equations in Euclidean and Banach Spaces.
Academic Press, New York.
Peters, G., and J. Wilkinson (1971). Practical problems arising in the solution of
polynomial equations. J. Inst. Math. Its Appl. 8, 16-35.
Petra, F., and V. Ptak (1984). Nondiscrete Induction and Iterative Processes.
Pitman, Boston.
Powell, M., ed. (1982). Nonlinear Optimization 1981. NATO Conf. Ser. Academic
Press, New York.
Rail, L. (1969). Computational Solution of Nonlinear Operator Equations. Wiley,
New York.
Rheinboldt, W. (1974). Methods for Solving Systems of Nonlinear Equations.
Society for Industrial and Applied Mathematics, Philadelphia.
Smith, B. (1967). ZERPOL: A zero finding algorithm for polynomials using
Laguerre's method. Dept. of Computer Science, Univ. Toronto, Toronto,
Ont., Canada.
Traub, J. (1964). Iterative Methods for the Solution of Equations. Prentice-Hall,
Englewood Cliffs, N.J.
Whitley, V. (1968). Certification of algorithm 196: Muller's method for finding
roots of an arbitrary function. Commun. ACM 11, 12-14.
Wilkinson, J. (1963). Rounding Errors in Algebraic Processes. Prentice-Hall,
Englewood Cliffs, N.J.
Wilkinson, J. (1984). The perfidious polynomial. In Studies in Numerical Analysis,
G. Golub (ed.). Math. Assoc. America, Washington, D.C.
Woods, D. (1985). An interactive approach for solving multi-objective optimiza-
tion problems. Ph.D. dissertation, William Marsh Rice Univ., Houston,
Tex.
Problems
1. The introductory examples for f(x) =a- (1/x) is related to the infinite
product
00
n (1 + r
2
j) = Limit((1 + r)(1 + r
2
)(1 + r
4
) (1 + r
2
")]
j-0 n .... oo
I
J
118 ROOTFINDING FOR NONLINEAR EQUATIONS
By using formula (2.0.6) and (2.0.9), we can calculate the value of the
infinite product. What is this value, and what condition on r is required for
the infinite product to converge? Hint: Let r = r
0
, and write xn in terms of
x
0
and r
0
.
2. Write a program implementing the algorithm Bisect given in Section 2.1.
Use the program to calculate the real roots of the following equations. Use
an error tolerance of = 10-
5
.
(d) x = 1 + .3 cos(x)
3. Use the program from Problem 2 to calculate (a) the smallest positive root
of x- tan(x) = 0, and (b) the root of this equation that is closest to
X= 100.
4. Implement the algorithm Newton given in Section 2.2. Use it to solve the
equations in Problem 2.
5. Use Newton's method to calculate the roots requested in Problem 3.
Attempt to explain the differences in finding the roots of parts (a) and (b).
6. Use Newton's method to calculate the unique root of
X + e- Bx' COS (X ) = 0
with B > 0 a parameter to be set. Use a variety of increasing values of B,
for example, B = 1, 5, 10, 25, 50. Among the choices of x
0
used, choose
x
0
= 0 and explain any anomalous behavior. Theoretically, the Newton
method will converge for any value of x
0
and B. Compare this with actual
computations for larger values of B.
7. An interesting polynomial rootfinding problem occurs in the computation
of annuities. An amount of P
1
dollars is put into an account at the
beginning of years 1, 2, ... , N
1
It is compounded annually at a rate of r
(e.g., r = .05 means a 5 percent rate of interest). At the beginning of years
N
1
+ 1, ... , N
1
+ N
2
, a payment of P
2
dollars is removed from the account.
After the last payment, the account is exactly zero. The relationship of the
variables is
If N
1
= 30, N
2
= 20, P
1
= 2000, P
2
= 8000, then what is r? Use a
rootfinding method of your choice.
8. Use the Newton-Fourier method to solve the equations in Problems 2
and 6.
PROBLEMS 119
9. Use the secant method to solve the equations given in Problem 2.
10. Use the secant method to solve the equation of Problem 6.
11. Show the error formula (2.3.2) for the secant method,
f[a, b, a]
a- c =-(a- b)(a- a)-
1
-[-a-, b-]-
12. Consider Newton's method for finding the positive square root of a > 0.
Derive the following results, assuming x
0
> 0, x
0
=F Va.
(a) x = _:(x + ~
n+l 2 n X
n
n ~ 0, and thus xn > Va for all n > 0.
(c) The iterates {xn} are a strictly decreasing sequence for n ~ 1. Hint:
Consider the sign of xn+l - xn.
with Rei (xn) the relative error in xn.
(e) If x
0
~ .fQ and jRel(x
0
)j:::;; 0.1, bound Rel(x
4
).
13. Newton's method is the commonly used method for calculating square
roots on a computer. To use Newton's method to calculate [Q, an initial
guess x
0
must be chosen, and it would be most convenient to use a fixed
number of iterates rather than having to test for convergence. For definite-
ness, suppose that the computer arithmetic is binary and that the mantissa
contains 48 binary bits. Write
a= a
0
2'
This can be easily modified to the form
a= b. 21
*:::::; b < 1
with f an even integer. Then
.;a = !b 0 2112
and the number Va will be in standard floating-point form, once !b is
known.
120 ROOTFINDING FOR NONLINEAR EQUATIONS
This reduces the problem to that of calculating fb for ~ ~ b < 1. Use
the linear interpolating formula
x
0
= t{2b + 1)
as an initial guess for the Newton iteration for calculating lb. Bound the
error fb - x
0
Estimate how many iterates are necessary in order that
which is the limit of machine precision for b on a particular computer.
[Note that the effect of rounding errors is being ignored]. How might the
choice of x
0
be improved?
14. Numerically calculate the Newton iterates for solving x
2
- 1 = 0, and use
x
0
= 100,000. Identify and explain the resulting speed of convergence.
15. (a) Apply Newton's method to the function
f(x) = { -%-x
x ~
x<O
with the root a = 0. What is the behavior of the iterates? Do they
converge, and if so, at what rate?
(b) Do the same as in (a), but with
f(x) = ( ~
-P
x ~
x<O
16. A sequence { xn} is said to converge superlinearly to a if
n ~
with en ~ 0 as n ~ oo. Show that in this case,
L
. . Ia- xnl
mut ----- = 1
n-+oo lxn+l- xnl
Thus Ia - xnl = lxn+l - xnl is increasingly valid as n ~ 00.
17. Newton's method for finding a root a of f(x) = 0 sometimes requires the
initial guess x
0
to be quite close to a in order to obtain convergence. Verify
that this is the case for the root a = 'TT/2 of
J(x) =cos (x) + sin
2
(SOx)
Give a rough estimate of how small !x
0
- al should be in order to obtain
convergence to a. Hint: Consider (2.2.6).
PROBLEMS 121
18. Write a program to implement Muller's method. Apply it to the rootfinding
problems in Problems 2, 3, and 6.
19. Show that x = 1 + tan -
1
( x) has a solution a. Find an interval [a, b]
containing a such that for every x
0
E [a, b ], the iteration
xn+l = 1 + tan -
1
(xJ n;;:.:O
will converge to a. Calculate the first few iterates and estimate the rate of
convergence.
20. Do the same as in Problem 19, but with the iteration
n;;:.:O
21. To find a root for f(x) = 0 by iteration, rewrite the equation as
x = x + cf(x) = g(x)
for-some-constant c -=1= 0. If a is a root of f(x) and if f'(a) =F- 0, how should
c be chosen in order that the sequence xn+l = g(xn) converges to a?
22. Consider the equation
x=d+hf(x)
with d a given constant and f(x) continuous for all x. For h = 0, a root is
a = d. Show that for all sufficiently small h, this equation has a root a( h).
What condition is needed, if any, in order to ensure the uniqueness of the
root a( h) in some interval about d?
23. The iteration xn+l = 2- (1 + c)xn will converge to a = 1 for some
values of c [provided x
0
is chosen sufficiently close to a]. Find the values of
c for which this is true. For what value of c will the be
quadratic?
24. Which of the following iterations will converge to the indicated fixed point
a (provided x
0
is sufficiently close to a)? If it does converge, give the order
of convergence; for linear convergence, give the rate oflinear convergence.
12
(a) xn+l = -16 + 6xn +-
xn
(b)
12
(c) a=3
a=2
122 ROOTFINDING FOR NONLINEAR EQUATIONS
25. Show that
xn(x;+3a)
3x; +a
n ~
is a third-order method for computing {{i. Calculate
assuming x
0
has been chosen sufficiently close to a:.
26. Using Theorem 2.8, show that formula (2.4.11) is a cubically convergent
iteration method.
27. Define an iteration formula by
Show that the order of convergence of { xn} to a is at least 3. Hint: Use
(heorem 2.8, and let
( ) = h( ) _ f(h(x))
g X X f'(x)
f(x)
h(x)=x---
f'(x)
28. There is another modification of Newton's method, similar to the secant
method, but using a different approximation to the derivative f'(xn).
Define
n ~
This one-point method is called Steffenson's method. Assuming /'(a) =F 0,
show that this is a second-order method. Hint: Write the iteration as
xn+l = g(xn) Use f(x) = (x- a)h(x) with h(a:) =I= 0, and then compute
the formula for g(x) in terms of h(x). Having done so, apply Theorem 2.8.
29. Given below is a table of iterates from a linearly convergent iteration
xn+l = g(xn). Estimate (a) the rate of linear convergence, (b) the fixed
point a:, and (c) the error a - x
5
n xn
0 1.0949242
1 1.2092751
2 1.2807917
3 1.3254943
4 1.3534339
5 1.3708962
PROBLEMS 123
30. The algorithm Aitken, given in Section 2.6, can be shown to be second
order in its speed of convergence. Let the original iteration be x,+
1
= g(x,),
n ~ 0. The formula (2.6.8).can be rewritten in the equivalent form
{x,_i- x,_2)2
a = x, = x,_
2
+ -:--------,-----
{x,_1- x,_
2
) - (x,- x,_
1
)
n ~
To examine the speed of convergence of the Aitken extrapolates, we
consider the associated sequence
n ~ O
The values z, are the successive values of x, produced in the algorithm
Aitken.
For g'( a) * 0 or 1, show that z, converges to a quadratically. This is
true even if Jg'(a)J > 1 and the original iteration is divergent. Hint: Do
not attempt to use Theorem 2.8 directly, as it will be too complicated.
Instead write
g(x) = (x ~ a)h(x) h(a) = g'(a).;,. 0
Use this to show that
for some function H(x) bounded about x = a.
31. Consider the sequence
n ~ 0, IPI < 1
with {3, y * 0, which converges to a with a linear rate of p. Let .X,_
2
be the
Aitken extrapolate:
n ~ O
Show that
x = a + ap2" + bp4" + c p6"
n-2 n
where c, is bounded as n .-.. oo. Derive expressions for a and b. The
sequence {.X,} converges to a with a linear rate of p
2
32. Let f(x) have a multiple root a, say of multiplicity m ~ 1. Show that
K(x) = f(x)
f'(x)
124 ROOTFINDING FOR NONLINEAR EQUATIONS
has a as a simple root. Why does this not help with the fundamental
difficulty in numerically calculating multiple roots, namely that of the large
interval of uncertainty in a?
33. Use Newton's method to calculate the real roots of the following polynomi-
als as accurately as possible. Estimate the multiplicity of each root, and
then if necessary, try an alternative way of improving your calculated
values.
(a) x
4
- 3.2x
3
+ .96x
2
+ 4.608x - 3.456
(b) x
5
+ .9x
4
- 1.62x
3
- 1.458x
2
+ .6561x + .59049
34. Use the program from Problem 2 to solve the following equations for the
root a = 1. Use the initial interval [0, 3], and in all cases use = 10-
5
as
the stopping tolerance. Compare the results with those obtained in Section
2.8 using Brent's algorithm.
(i) (x- 1)[1 + (x- 1)
2
] = 0
(ii) x
2
- 1 = 0
(iii) -1 + x(3 + x(-3 + x)) = 0
(iv) (x - 1) exp ( -1/(x - 1)
2
) =.0
35. Prove the lower bound in (2.9.6), using the upper bound in (2.9.6) and the
suggestion in (2.9.7).
36. Let p(x) be a polynomial of degree n. Let its distinct roots be denoted by
a
1
, . , ar, of respective multiplicities m
1
, ... , mr.
(a) Show that
p'(x) r m
1
-=L-
p(x) J=l x- aJ
(b) Let c be a number for which p'(c) * 0. Show there exists a root a of
p ( x) satisfying
I
p(c) I
la-cl ~ p'(c)
37. For the polynomial
PROBLEMS 125
define
Show that every root x of p(x) = 0 satisfies
lxl :5; Max { R, Vlf}
38. Write a computer program to evaluate the following polynomials p(x) for
the given values of x and to evaluate the noise in the values p(x). For each
x, evaluate p(x) in both single and double precision arithmetic; use their
difference as the noise in the single precision value, due to rounding errors
in the evaluation of p(x). Use both the ordinary formula (2.9.1) and
Horner's rule (2.9.8) to evaluate each polynomial; this should show that the
noise is different in the two cases.
(a) p(x) = x
4
- 5.7x
3
- .47x
2
+ 29.865x- 26.1602
with steps of h = 0.1 for x.
(b) p(x) = x
4
- 5.4x
3
+ 10.56x
2
- 8.954x + 2.7951
in steps of h = .001 for x.
1 :5; X :5; 1.2
Note: Smaller or larger values of h may be appropriate on different
computers. Also, before using double precision, enter the coefficients
in single precision, for a more valid comparison.
39. Use complex arithmetic and Newton's method to calci.J.ldte a complex root
of
p(z) = z
4
- 3z
3
+ 20z
2
+ 44z +54
located near to z
0
= 2.5 + 4.5i.
40. Write a program to find the roots of the following polynomials as accu-
rately as possible.
(a) 676039x
12
- 1939938x
10
+ 2078505x
8
- 1021020x
6
+ 225225x
4
-18018x
2
+ 231
(b) x
4
- 4.096152422706631x
3
+ 3.284232335022705x
2
+4.703847577293368x- 5.715767664977294
41. Use a package rootfinding program for polynomials to find the roots of the
polynomials in Problems 38, 39, and 40.
42. For the example f(x) = (x- 1)(x- 2) (x- 7), (2.9.21) in Section 2.9,
consider perturbing the coefficient of x; by f.;X;, in which f.; is chosen so
that the relative perturbation in the coefficient of x; is the same as that of
126 ROOTFINDING FOR NONLINEAR EQUATIONS
the example in the text in the coefficient of x
6
What does the linearized
theory (2.9.19) predict for the perturbations in the roots? A change in which
coefficient will lead to the greatest changes in the roots?
43. The polynomial f(x) = x
5
- 300x
2
- 126x + 5005 has a= 5 as a root.
Estimate the effect on a of changing the coefficient of x
5
from 1 to 1 + .
44. The stability result (2.9.19) for polynomial roots can be generalized to
general functions. Let a be a simple root of f(x) = 0, and let f(x) and
g(x) be continuously differentiable about a. Define F.(x) = f(x) + g(x).
Let a( t:) denote a Toot of F.( x) = 0, corresponding to a = a(O) for small L
To see that there exists such an a(), and to prove that it is continuously
differentiable, use the implicit function theorem for functions of one vari-
able. From this, generalize (2.9.19) to the present situation.
45. Using the stability result in Problem 44, estimate the root a( f) of
x-tan(x)+=0
Consider two cases explicitly for roots a of x - tan (x) 0:
(1) a E (.5'1T, 1.5'1T), (2) a E (31.5'1T, 32.5'1T).
46. Consider the system
.5 .5
x= ~
1+(x+y)
2
y= -----=-
l+(x-y)2
Find a bounded region D for which the hypotheses of Theorem 2.9 are
satisfied. Hint: What will be the sign of the components of the root a?
Also, what are the maximum possible values for x and y in the preceding
formulas?
47. Consider the system
y = .5 + h tan -
1
(x
2
+ yl)
Show that if h is chosen sufficiently small, then this system has a unique
solution ex within some rectangular region. Moreover, show that simple
iteration of the form (2.10.4) will converge to this solution.
48. Prove Corollary 2.10. Hint: Use the continuity of the partial derivatives of
the components of g(x).
49. Prove that the iterates { x n} in Theorem 2.9 will converge to a solution of
x = g(x). Hint: Consider the infinite sum
00
Xo + L [xn+l- xnJ
n-o
PROBLEMS 127
Its partial sums are
N-1
Xo + L [xn+l- xn] = XN
n=O
Thus if the infinite series converges, say to a, then x N converges to a. Show
the infinite series converges absolutely by showing and using the result
llxn+l- xnlloo :$; A.llxn- xn-tlloo
Following this, show that a is a fixed point of g(x).
50. Using Newton's method for nonlinear systems, solve the nonlinear system
The true solutions are easily determined to be ( fi3, 113 ). As an
initial guess, use (x
0
, y
0
) = (1.6, 1.2).
51. Solve the system
using Newton's method for nonlinear systems. Use each of the initial
guesses (x
0
, y
0
) = (1.2, 2.5), (- 2, 2.5), ( -1.2, - 2.5), (2, - 2.5). Observe
which root to which the method converges, the number of iterates required,
and the speed of convergence.
52. Using Newton's method for nonlinear systems, solve for all roots of the
following nonlinear system. Use graphs to estimate initial guesses.
(a) x
2
+ y
2
- 2x - 2y + 1 = 0 x + y- 2xy = 0
(b) x
2
+ 2xy + y
2
- x + y- 4 = 0
5x
2
- 6xy + 5y
2
+ 16x- 16y + 12 = 0
53. Prove that the Jacobian of
g(x) = x- F(x) -
1
f(x)
is zero at any root a of f(x) = 0, provided F( a) is nonsingular. Combined
with Corollary 2.10 of Section 2.10, this will prove the convergence of
Newton's method.
54. Use Newton's method (2.12.4) to find the minimum value of the function
Experiment with various initial guesses and observe the pattern of conver-
gence.
THREE
' .
INTERPOLATION THEORY
The concept of interpolation is the selection of a function p(x) from a given
class of functions in such a way that the graph of y = p(x) passes through a
finite set of giyen data points. In most of this chapter we limit the interpolating
function p ( x) to being a polynomial.
Polynomial interpolation theory has a number of important uses. In this text,
its primary use is to furnish some mathematical tools that are used in developing
methods in the areas of approximation theory, numerical integration, and the
numerical solution of differential equations. A second use is in developing means
- for working with functions that are stored in tabular form. For example, almost
everyone is familiar from high school algebra with linear interpolation in a table
of logarithms. We derive computationally convenient forms for polynomial
interpolation with tabular data and analyze the resulting error. It is recognized
that with the widespread use of calculators and computers, there is far less use
for table interpolation than in the recent past. We have included it because the
resulting formulas are still useful in other connections and because table interpo-
lation provides us with convenient examples and exercises.
The chapter concludes with introductions to two other topics. These are (1)
piecewise polynomial interpolating functions, spl_ine functions in particular; and
(2) interpolation with trigonometric functions. .
3.1 Polynomial Interpolation Theory
Let x
0
, x
1
, , xn be distinct real or complex numbers, and let y
0
, y
1
, , Yn be
associated function values. We now study the problem of finding a polynomial
p ( x) that interpolates the given data:
p(x;) = Y; i=O,l, ... ,n (3.1.1)
Does such a polynomial exist, and if so, what is its degree? Is it unique? What is a
formula for producing p(x) from the given data?
By writing
for a general polynomial of degree m, we see there are m + 1 independent
131
132 INTERPOLATION THEORY
parameters a
0
, a
1
, , am. Since (3.1.1) imposes n + 1 conditions on p(x), it is
reasonable to first consider the case when m = n. Then we want to find
a
0
, a
1
, , an such that
(3.1.2)
This is a system of n + 1 linear equations in n + 1unknowns, and solving it is
completely equivalent to solving the polynomial interpolation problem. In vector
and matrix notation, the system is
Xa =y
with
X= [x{] i,j=0,1, ... ,n (3.1.3)
The matrix X is called a Vandermonde matrix.
Theorem 3.1 Given n + 1 distinct points x
0
, . , xn and n + 1 ordinates
y
0
, , Yn there is a polynomial p(x) of e g r e e ~ n that inter-
polates Y; at X;, i = 0, 1, ... , n. This polynomial p(x) is unique
among the set of all polynomials of degree at most n.
Proof Three proofs of this important result are given. Each will furnish some
needed information and has important uses in other interpolation prob-
lems.
(i) It can be shown that for the matrix X in (3.1.3),
det(X) = n (X - X)
Os.j<is.n
1
.I
(3.1.4)
(see Problem 1). This shows that det(X) * 0, since the points X;
are distinct. Thus X is nonsingular and the system Xa = y has a
unique solution a. This proves the existence and uniqueness of an
interpolating polynomial of degree ~ n.
(ii) By a standard theorem of linear algebra (see Theorem 7.2 of
Chapter 7), the system Xa = y has a unique solution if and only if
the homogeneous system Xb = 0 has only the trivial solution
b = 0. Therefore, assume Xb = 0 for some b. Using b, define
p(x) = b
0
+ b
1
x + +bnxn
POLYNOMIAL INTERPOLATION THEORY 133
From the system Xb = 0, we have
p(x;) = 0 i=0,1, ... ,n
The polynomial p(x) has n + 1 zeros and degree p(x)::;; n. This
is not possible unless p(x) = 0. But then all coefficients b; = 0,
i = 0, 1, ... , n, completing the proof.
(iii) We now exhibit explicitly the interpolating polynomial. To begin,
we consider the special interpolation problem in which
Y; = 1 for j =F i
for some i, 0 ::;; i ::;; n. We want a polynomial of degree ::;; n with
the n zeros xj, j =F i. Then
p(x) = c(x- x
0
) (x- X;_
1
)(x- xi+l) (x- xn)
for some constant c. The condition p(x;) = 1 implies
This special polynomial is written as
(
X- X-)
l;(x)=JI x ; ~
i=0,1, ... ,n (3.1.5)
To solve the general interpolation problem (3.1.1), write
With the special properties of the polynomials l;(x), p(x) easily
satisfies (3.1.1). Also, degree p(x)::;; n, since all l;(x) have de-
gree n.
To prove uniqueness, suppose q(x) is another polynomial of
degree ::;; n that satisfies (3.1.1). Define
r(x) = p(x)- q(x)
Then degree r(x) ::;; n, and
r(x;) = p(x;) - q(x;) = Y;- Y; = 0 i=0,1, ... ,n
Since r(x) has n + 1 zeros, we must have r(x) = 0. This proves
p(x) = q(x), completing the proof. 1111
Uniqueness is a property that is of practical use in much that follows. We
derive other formulas for the interpolation problem (3.1.1), and uniqueness says
134 INTERPOLATION THEORY
they are the same polynomial. Also, without uniqueness the linear system (3.1.2)
would not be uniquely solvable; from results in linear algebra, this would imply
the existence of data vectors y for which there is no interpolating polynomial of
e g r e e ~ n.
The formula
n
Pn(x) = LY;i;(x) (3.1.6)
i=O
is called Lagrange's formula for the interpolating polynomial.
Example
x-x
1
x-x
0
(x
1
-x)y
0
+(x-x
0
)y
1
PI(x)= Yo+ YJ=---------
x0 - x
1
x
1
- Xo X
1
- Xo
The polynomial of degree s 2 that passes through the three points (0, 1), ( -1, 2),
and (1, 3) is
(x + 1)(x- 1) (x- O)(x- 1) (x- O)(x + 1)
p(x)- 1+ 2+ 3
2
- (0 + 1)(0- 1) ( -1 - 0)( -1 - 1) (1 - 0)(1 + 1)
1 3 2
= 1 + -x + -x
2 2
If a function f(x) is given, then we can form an approximation to it using the
interpolating polynomial
n
Pn(x; f)= Pn(x) = L f(x;)l;(x) (3.1.7)
i=O
This interpolates f(x) at x
0
, , xn. For example, we later consider f(x) =
log
10
x with linear interpolation. The basic result used in analyzing the error of
interpolation is the following theorem. As a notation, '{a, b, c, ... } denotes
the smallest interval containing all of the real numbers a, b, c, ... .
Theorem 3.2 Let x
0
, x
1
, .. , xn be distinct real numbers, and let f be a given
real valued function with n + 1 continuous derivatives on the
interval 1
1
= ..n"{t, x
0
, , xn}, with t some given real number.
POLYNOMIAL INTERPOLATION THEORY 135
Then. there exists g E I, with
f(
t)- J(x.)/.(t) = (t- x() ... ~ ~ xn) J<n+l)(g) (3.1.8)
j-O
1 1
n + 1 !
Proof Note that the result is trivially true if t is any node point, since then both
sides of (3.1.8) are zero. Assume t does not equal any node point. Then
define
with
n
E(x) = J(x)- Pn(x) Pn(x) = L f(xj)l/x)
j=O
i'(x)
G(x) = E(x)- i'(t) E(t)
for all x E /
1
i'(x) = (x- x
0
) (x- xJ
(3.1.9)
The function G(x) is n + 1 times continuously differentiable on the
interval I,, as are E(x) and i'(x). Also,
i'(xJ
G(x;) = E(x;)- i'(t) E(t) = 0
i=0,1, ... ,n
G(t) = E(t)- E(t) = 0
Thus G has n + 2 distinct zeros in Ir Using the mean value theorem, G'
has n + 1 distinct zeros. Inductively, GU>(x) has n + 2 - j zeros in I,,
for j = 0, 1, ... , n + 1. Let ~ be a zero of G<n+I>(x),
Since
we obtain
(n+ll(x) = J<n+ll(x)
.y<n+ll(x) = (n + 1)!
(n + 1)!
G(n+ll(x) = J(n+l)(x)- E(t)
i'( t)
Substituting x = g and solving for E(t),
the desired result.
E(t) = i'(t) J(n+l)(g)
(n + 1)!
136 INTERPOLATION THEORY
This may seem a "tricky" derivation, but it is a commonly used
technique for obtaining some error formulas. II
Example For n = 1, using x in place oft,
(x
1
- x)f(x
0
) + (x- x
0
)j(x
1
)
f(x)- ---------
xl- Xo
(x-
x
1
)
(3.1.10)
for some E {x
0
, x
1
, x}. The subscript x on shows explicitly that
depends on x; usually we omit the subscript, for convenience.
We now apply the n = 1 case to the common high school technique of linear
interpolation in a logarithm table. Let
Then j"(x) = -log
10
ejx
1
, log
10
e,;, 0.434. In a table, we generally would have
x
0
< x < x
1
. Then
(x- x
0
)(x
1
- x)
E(x) =
2
This gives the upper and lower bounds
log
10
e
.
log
10
e (x-x
0
)(x
1
-x) log
10
e (x-x
0
)(x
1
-x)
-- . < E(x) < -- ----=--------
2 - - 2
This shows that the error function E(x) looks very much like a quadratic
polynomial, especially if the distance h = x
1
- x
0
is reasonably small. For a
uniform bound on [x
0
, xrJ,
h
2
.434 .0542h
2
ilog
10
x- p
1
(x)l -
8
-
2
= ---=-- .0542h
2
Xo
(3.1.11)
for x
0
1, as is usual in a logarithm table. Note that the interpolation error in a
standard table is much less for x near 10 than near 1. Also, the maximum error is
near the midpoint of [x
0
, xr].
For a four-place table, h = .01,
1 x
0
< x
1
10
Since the entries in the table are given to four digits (e.g., log
10
2 = .3010), this
result is sufficiently accurate. Why do we need a more accurate five-place table if
POLYNOMIAL INTERPOLATION THEORY 137
the preceding is so accurate? Because we have neglected to include the effects of
the rounding errors present in the table entries. For example, with log
10
2 ,;, .3010.
!log
10
2 - .3010 I ::::; .00005
and this will dominate the interpolation error if x
0
or x
1
= 2.
Rounding error analysis for linear interpolation Let
with /
0
and /
1
the table entries and E:
0
, EI the rounding errors. We will assume
for a known E:. In the case of the four-place logarithm table, f = .00005.
We want to bound
cf(x) = f(x)- (xl- x)fo + (x- xo)fl
XI- Xo
Xo::::; X ::::; XI (3.1.12)
Using /; = f(xJ - E:;,
(x
1
- x)f(x
0
) + (x- x
0
)f(x
1
)
tf(x) =J(x)- ---------
x1- Xo
(xi- x)E:o + (x- xo)E:I
+------------------
x1- Xo
= E(x) + R(x)
E(x) = (x- x
0
)(x- xi) /"(O
2
(3.1.13)
The error cf(x) is the sum of the theoretical interpolation error E(x) and the
function R(x), which depends on E:
0
, ~ :
+ g"(x)i
2
dx (3.7.27)
a a a
PIECEWISE POLYNOMIAL INTERPOLATION 171
This proves (3.7.25). Equality in (3.7.25) occurs only if s;'(x)- g"(x) = 0 on
[a,b], or equivalently sc(x)- g(x) is linear. The interpolating conditions then
imply s"(x) - g(x) = 0. We leave a further discussion of this topic to Problem
38.
Case 2 The "not-a-knot" condition. When the derivative values f'(a) and
f'(b) are not available, we need other end conditions on s(x) in order to
complete the system of equations (3.7.19). This is accomplished by requiring
s<
3
>(x) to be continuous at x
1
and xn-t This is equivalent to requiring that s(x)
be a cubic spline function with knots { x
0
, x
2
, x
3
, .. , xn_
2
, ~ n }, while still
requiring interpolation at all node points in {x
0
,x
1
,x
2
, ... ,xn-lxn} This
reduces system (3.7.19) ton- 3 equations, and the interpolation at x
1
and xn_
1
introduces two new equations (we leave their derivation to Problem 34). Again
we obtain a tridiagonal linear system AM= D, although the matrix A does not
possess some of the nice properties of that in (3.7.22). The resulting spline
function will be denoted here by snk(x), with the subscript indicating the
"not-a-knot" condition. A convergence analysis can be given for snk(x), similar
to that given in Theorem 3.4. For a discussion of this, see de Boor (1978, p. 211),
(1985).
There are other ways of introducing endpoint conditions when f'(a) and
f'( b) are unknown. A discussion of some of these can be found in de Boor (1978,
p. 56). However, the preceding scheme is the simplest to apply, and it is widely
used. In special cases, there are simpler endpoint conditions that can be used
than those discussed here, and we take up one of these in Problem 38. In general,
however, the preceding type of endpoint conditions are needed in order to
preserve the rates of convergence given in Theorem 3.4.
Numerical examples Let f(x) = tan-
1
x, 0 ~ x ~ 5. Table 3.11 gives the er-
rors
E; = Max IJU>( x) - L ~ ; > x )I
0:Sx:S5
i = 0, 1,2,3 (3.7.28)
where Ln(x) is the Lagrange piecewise cubic function interpolating f(x) on the
nodes xj =a+ jh, j = 0, 1, ... , n, h = (b- a)jn. The columns labeled Ratio
Table 3.11 Lagrange piecewise cubic interpolation: Ln(x)
n
Eo Ratio Et Ratio 2 Ratio
E3
Ratio
2 1.20E- 2 1.22E- 1 7.81E- 1 2.32
3.3 2.1 1.5 1.2
4 3.62E- 3 5.83E- 2 5.24E- 1 1.95
11.4 6.1 3.2 1.6
8 3.18E- 4 9.57E- 3 1.64E- 1 1.19
16.9 8.1 3.9 1.7
16 1.88E- 5 l.llE- 3 4.21E- 2 .682
14.5 7.3 3.7 1.9
32 1.30E- 6 1.61E- 4 1.14E- 2 .359
/
I
172 INTER PO LA TION THEORY
Table 3.12 Hermite piecewise cubic interpolation: Q n( X)
n Eo
Ratio Ei Ratio
E2 Ratio
E3
Ratio
3 2.64E- 2 5.18E- 2 4.92E- 1 2.06
5.6 3.0 2.3 1.5
6 4.73E- 3 1.74E- 2 2.14E- 1 1.33
16.0 8.0 3.6 1.5
12 2.95E- 4 2.17E- 3 5.91E- 2 .891
13.1 6.7 3.6 1.9
24 2.26E- 5 3.25E- 4 1.66E- 2 .475
16.0 8.0 4.0 2.0
48 1.41E- 6 4.06E- 5 4.18E- 3 .241
give the rate of decrease in the error when n is doubled. Note that the rate of
convergence for to [U> is proportional to h
4
-;, i = 0, 1, 2, 3. This can be
rigorously proved, and an indication of the proof is given in Problem 32.
Table 3.12 gives the analogous errors for the Hermite piecewise cubic function
Qn(x) interpolating f(x). Note that again the errors agree with
Max IJU>(x)- :::;; ch
4
-;
a;5;xs;b
i = 0, 1, 2, 3
which can also be proved, for some c > 0.
As was stated earlier following (3.7.10), the functions Ln(x) and Qm(x),
m = l.5n, are of comparable accuracy in approximating f(x), and Tables 3.11
and 3.12 confirm this. In contrast, the derivative Q:,(x) is a more accurate
approximation to f'(x) than is An explanation is given in Problem 32.
In Table 3.13, we give the results of using the complete cubic spline inter-
polant sc(x). To compare with Ln(x) and Qm(x) for comparable amounts of
given data on f(x), we use the same number of evenly spaced interpolation
points as used in Ln(x).
Example Another informative example is to take f(x) = x\ 0:::;; x:::;; 1. All of
the preceding interpolation formulas have [<
4
>(x) as a multiplier in their error
Table 3.13 Complete cubic spline interpolation: sc,ix)
n Eo
Ratio Ei
Ratio E2
Ratio E3
Ratio
6 7.09E- 3 2.45E- 2 1.40E- 1 1.06EO
21.9 10.7 4.8 2.6
12 3.24E- 4 2.28E- 3 2.90E- 2 4.09E- 1
10.6 5.6 2.9 1.6
24 3.06E- 5 4.09E- 4 9.84E- 3 2.53E- 1
20.7 9.7 4.6 2.1
48 1.48E- 6 4.22E- 5 2.13E- 3 1.22E- 1
16.4 8.1 4.0 2.0
96 9.04E- 8 5.19E- 6 5.30E- 4 6.09E- 2
PIECEWISE POLYNOMIAL INTER PO LA TION 173
Table 3.14 Comparison of three forms of piecewise cubic interpolation
Method Eo 1 Ez 3
Lagrange l.lOE- 8 6.78E- 6 2.93E- 3 .375
n = 32
Hermite l.18E- 8 1.74E- 6 8.68E- 4 .250
n = 48
Spline 7.36E- 10 2.12E- 7 1.09E- 4 .0625
n = 96
formulas. Since J<
4
>(x) = 24, a constant, the error for all three forms of interpo-
lation satisfy
j = 0,1,2,3 (3.7.29)
The constants cj will vary with the form of interpolation being used. In the actual
computations, the errors behaved exactly like (3.7.29), thus providing another
means for comparing the methods. We give the results in Table 3.14 for only the
most accurate case.
These examples show that the complete cubic interpolating spline is more
accurate, significantly so in some cases. But the examples also show that all of the
methods are probably adequate in terms of accuracy, and that they all converge
at the same rate. Therefore, the decision as to which method of interpolation to
use should depend on other factors, usually arising from the intended area of
application. Spline functions have proved very useful with data fitting problems
and curve fitting, and Lagrange and Hermite functions are more useful for
analytic approximations in solving integral and differential equations, respec-
tively. All of these forms of piecewise polynomial approximation are useful with
all of these applications, and one should choose the form of approximation based
on the needs of the problem being solved.
B-splines One way of representing cubic spline functions is given in
(3.7.12)-(3.7.13), in which a cubic polynomial is given on each subinterval. This
is satisfactory for interpolation problems, as given in (3.7.16), but for most
applications, there are better ways to represent cubic spline functions. As before,
we look at cubic splines with knots { x
0
, x
1
, ... , xn } .
. Define
; {0 x<O
X -
+- x' x :2:. 0
{3.7 .30)
This is a spline of order r + 1, and it has only the one knot x = 0. This can be
used to give a second representation of spline functions. Let s(x) be a spline
function of order m with knots { x
0
, , xn }. Then for x
0
x xn,
n-1
s(x) = Pm-l(x) + L /3;(x- xj):-l
j=l
(3.7 .31)
174 INTERPOL.' TION THEORY
with Pm-
1
(x) a uniquely chosen polynomial of degree.::;; m- 1 and {3
1
, , {3,_
1
uniquely determined coefficients. The proof of this result is left as Problem 37.
There are several unsatisfactory features to this representation when applying it
to the solution of other problems. The most serious problem is that it often leads
to numerical schemes that are ill-conditioned. For this reason, we introduce
another numerical representation of s(x), one that is much better in its numerical
properties. To simplify the presentation, we consider only cubic splines.
We begin by augmenting the knots { Xo, ... , X n}. Choose additional knots
(3.7.32)
in some arbitrary manner. For i = - 3, - 2, ... , n - 1, define
(3.7.33)
a fourth-order divided difference of
fxCt) = (t- x ~ (3.7 .34)
The function B;(x) is called a B-spline, which is short for basic spline function.
As an alternative to (3.7.33), apply the formula (3.2.5) for divided differences,
obtaining
i+4 (xi- x ~
B;(x) = (x;+4 - xJ ~ - "o/-'(x .)
;-r r J
This shows B;(x) is a cubic spline with knots X;, ... , X;+
4
A graph of a typical
B-spline is shown in Figure 3.7. We summarize some important properties of
B-splines as follows.
Theorem 3.5 The cubic B-splines satisfy
(a) B;(x) = 0 outside ofx; < x < X;+
4
;
(b) O.::;;B;(x).::;;l all x;
Figure 3.7 The B-spline B
0
(x).
(3.7.36)
(3.7.37)
PIECEWISE POLYNOMIAL INTERPOLATION 175
n-1
(c)
1:
B;(x) = 1 Xo ~ X ~ Xn; (3.7.38)
i= -3
fxi+4
(xi+4- x;)
(d) B;(x) dx =
4
(3.7.39)
X;
(e) If s(x) is a cubic spline function with knots {x
0
, ... , xn},
then for Xo ~ X ~ Xn,
n-1
s(x) = L a;B;(x) (3.7 .40)
i= -3
with the choice of a_
3
, . , an-l unique.
Proof (a) For x ~ X;, the function fAt) is a cubic polynomial for the interval
X; ~ t ~ X;+
4
Thus its fourth-order divided difference is zero. For
x ~ X;+
4
, the function fxCt) = 0 for X ~ t ~ X;+
4
, and thus B;(x) = 0.
(b) See de Boor (1978, p. 131).
(c) Using the recursion relation for divided differences,
Next, assume xk ~ x ~ xk+
1
. Then the only B-splines that can be
nonzero at x are Bk_
3
(x), Bk_
2
(x), ... , Bk(x). Using (3.7.41),
n-1 k
L b;(x) = L B;(x)
i= -3 i=k-3
k
L (fx[xi+l xi+2 xi+3 xi+4]- f[x;, Xi+l xi+2 xi+3])
i=k-3
=1-0=1
The last step uses (1) the fact that fx(t) is cubic on [xk+l xk+
4
], so
that the divided difference equals 1, from (3.2.18); and (2) fx(t) = 0
on [xk_
3
, xd.
(d) See de Boor (1978, p. 151).
(e) The concept of B-splines originated with I. J. Schoenberg, and the
result (3.7.40) is due to him. For a proof, see de Boor (1978, p. 113) .
176 INTERPOLATION THEORY
Because of (3.7.36), the sum in (3.7.40) involves at most four nonzero terms.
For xk ::::; x < xk+l
k
s(x)= I: o:;Br(x) (3.7.42)
i-k-3
In addition, using (3.7.37) and (3.7.38),
showing that the value of s( x) _is bounded_by coefficients for B-splines near to x.
~ this sense, (3.7.40) is a local representation of s(x), at each x E [x
0
, xnl
A more general treatment of B-splines is given in de Boor (1978, chaps. 9-11),
along with further properties omitted here. Programs are also given for comput-
ing with B-splines.
An important generalization of splines arises when the knots are allowed to
coalesce. In particular, let some of the nodes in (3.7.33) become coincident. Then,
so as long as X; < X;+4 t h ~ function B;(x) will be a cubic piecewise polynomial.
Letting two knots coalesce will reduce from two to one the number of continuous
derivatives at the multiple knot. Letting three knots coalesce will mean that
B;(x) will only be continuous. Doing this, (3.7.40) becomes a representation for
all cubic piecewise polynomials. In this scheme of things, all piecewise poly-
nomial functions are spline functions, and vice versa. This is fully explored in
de Boor (1978).
3.8 Trigonometric Interpolation
An extremely important class of functions are the periodic functions. A function
f(t) is said to be periodic with period T if
f(t + T) = f(t) -oo<t<oo
.
and this is not to be true for any smaller positive value of T. The best known
periodic functions are the trigonometric functions. Periodic functions occur
widely in applications, and this motivates our consideration of interpolation
suitable for data derived from such functions. In addition, we use this topic to
introduce the fast Fourier transform (FFr), which is used in solving many
problems that involve data from periodic functions.
By suitably scaling the independent variable, it is always possible to let
T = 2'1T be the period:
/( t + 2'1T) = /( t) -oo<t<oo
We approximate such functions f(t) by using trigonometric polynomials,
n
p,(t) = a
0
+ I: a
1
cos(jt) + b
1
sin(jt)
j-1
(3.8.1)
(3.8.2)
TRIGONOMETRIC INTERPOLATION 177
If ian I + ibn I =/= 0, then this function Pn(t) is called a trigonometric polynomial of
degree n. It can be shown by using trigonometric addition formulas that an
equivalent formulation is
n
Pn(t) = a
0
+ L aJcos(t)J
1
+ /3
1
[sin(t)J
1
j=l
(3.8.3)
thus partially explaining our use of the word polynomial for such a function. The
polynomial Pn(t) has period 2'1T or integral fraction thereof.
To study interpolation problems with Pn(t) as a solution, we must impose
2n + 1 interpolating conditions, since Pn(t) contains 2n + 1 coefficients a
1
, b
1
.
Because of the periodicity of the function f(t) and the polynomial Pn(t), we also
require the interpolation nodes to lie in the interval 0 ::;: t < 2'1T (or equivalently,
- '1T ::;: t < '1T or 0 < t ::;: 2'1T ). Thus we assume the existence of the interpolation
nodes
(3.8.4)
and we require Pn(t) to be chosen to satisfy
i = 0, 1, ... , 2n (3.8.5)
It is shown later that this problem has a unique solution.
This interpolation problem has an explicit solution, comparable to the Lagrange
formula (3.1.6) for polynomial interpolation; this is dealt with in Problem 41.
Rather than proceeding with such a development, we first convert (3.8.4)-(3.8.5)
to an equivalent problem involving polynomials and functions of a complex
variable. This new formulation is the more natural mathematical setting for
trigonometric polynomial interpolation.
Using Euler's formula
e;
8
= cos ( 0) + i sin ( 0) i=Ff
we obtain
e;e + e-;e
cos(O) = --
2
-- sin (0) = --
2
-i-
Using these in (3.8.2), we obtain
The coefficients are related by
n
Pn(t) = L cieifr
j= -n
1 :5:j::;: n
(3.8.6)
(3.8.7)
(3.8.8)
Given { cj}, the coefficients {a
1
, b
1
} are easily obtained by solving these latter
178 INTERPOLATION THEORY
equations. Letting z = e;
1
, we can rewrite (3.8.8) as the complex function
n
Pn(z) = L c
1
zj (3.8.9)
j= -n
The function znPn(z) is a polynomial of e g r e e ~ 2n.
To reformulate the interpolation problem (3.8.4)-(3.8.5), let z
1
= e;\
j = 0, ... , 2n. With the restriction in (3.8.4), the numbers zj are distinct points on
the unit circle JzJ = 1 in the complex plane. The interpolation problem is
j = 0, ... ,2n (3.8.10)
To see that this is always uniquely solvable, note that it is equivalent to
j = 0, ... ,2n
with Q(z) = znPn(z). This is a polynomial interpolation problem, with 2n + 1
distinct node points z
0
, , z
2
n; Theorem 3.1 shows there is a unique solution.
Also, the Lagrange formula (3.1.6) generalizes to Q(z), and thence to P(z).
There are a number of reasons, both theoretical and practical, for converting
to the complex variable form of trigonometric interpolation. The most important
in our view is that interpolation and approximation by trigonometric polynomials
are intimately connected to the subject of differentiable functions of a complex
variable, and much of the theory is better understood from this pi!rspective. We
do not develop this theory, but a complete treatment is given in Henrici (1986,
chap. 13) and Zygmund (1959, chap. 10) ..
Evenly spaced interpolation The case of interpolation that is of most interest in
applications is to use evenly spaced nodes tj. More precisely, define
2'TT
11 = j-2-n_+_1
j = 0, 1,.2, ... (3.8.11)
The points t
0
, , t
2
n satisfy (3.8.4), and the points zj = e;\ j = 0, ... , 2n, are
evenly spaced points on the unit circle Jz I = 1. Note also that the points z
1
repeat as j increases by 2n + 1.
We now develop an alternative to the Lagrange form for p,(t) when the nodes
{ tj} satisfy (3.8.11). We begin with the following lemma.
Lemma 4 For all integers k,
L eikr
1
= 2n + 1
2n {
j=O 0
(3.8.12)
The condition e;
1
k = 1 is equivalent to k being an integer multiple of
2n + 1.
TRIGONOMETRIC INTERPOLATION 179
Proof Let z = e;
1
'. Then using (3.8.11), eikr, = eih, and the sum in (3.8.12)
becomes
2n
S = :E zi
j=O
If z = 1, then this sums to 2n + 1. If z =t= 1, then the geometric series
formula (1.1.8) implies
z2n+1- 1
S= ----
z- 1
Using (3.8.11), z
1
n+l = e
1
"'ki = 1; thus, S = 0.
The interpolation conditions (3.8.10) can be written as
n
:E ckeikrj = /( tJ j = 0, 1, ... ,2n
k= -n
(3.8.13)
To find the coefficients ck, we use Lemma 4. Multiply equation j by e-il\ then
sum over j, restricting I to satisfy - n I n. This yields
2n n 2n
:E :E ckei(k-t)rj = :E e-itr,f(tJ
(3.8.14)
j=O k= -n j=O
Reverse the order of summation, and then use Lemma 4 to obtain
2n '0 I
L ei(k-t)rj = ! k =t=
J=O t 2n + 1 k = I
Using this in (3.8.14), we obtain
1=-n, ... ,n (3.8.15)
The coefficients { c_n, ... , en} are called the finite Fourier transform of the data
{/(t
0
), . ,f(t
2
n)}. They yield an explicit formula for the trigonometric inter-
polating polynomial Pn(t) of (3:8.8). The formula (3.8.15) is related to the
Fourier coefficients of /(t):
-oo<l<oo (3.8.16)
If the trapezoidal numerical integration rule [see Section 5.1] is applied to these
integrals, using 2n + 1 subdivisions of [0, 2'11" ], then (3.8.15) is the result, provided
/(t) is periodic on [0,2'11"]. We next discuss the convergence of Pn(t) to/(t).
180 INTERPOLATION THEORY
Theorem 3.6 Let f(t) be a continuous, periodic function, and let 27T be an
integer multiple of its period. Define
Pn(f) = Infimum [ Max 1/(t)- q(t)l]
deg(q).Sn O.sr.s2or
(3.8.17)
with q(t) a trigonometric polynomial. Then the interpolating
function Pn(t) from (3.8.8) and (3.8.15) satisfies
Max 1/(t)- Pn(t)l::;; c[ln(n + 2)]pn(/)
0.SI.S2or
n 0 (3.8.18)
The constant c is independent off and n.
Proof See Zygmund (1959, chap. 10, p. 19), since the proof is fairly com-
plicated. Ill
The quantity PnU) is called a minimax error (see Chapter 4), and it can be
estimated in a variety of ways. The most important bound on PnU) is probably
that of D. Jackson. Assume that f(t) is k times continuously differentiable on
[0, 27T ], k 0, and further assume J<k>( t) satisfies the condition
for some 0 < a ::;; 1. (This is called a HOlder condition.) Then
n 1 (3.8.19)
with ck(f) independent of n. For a proof, see Meinardus (1967, p. 55).
An alternative error formula to that of (3.8.18) is given in Henrici (1986, cor.
13.6c), using the Fourier series coefficients (3.8.16) for f(t).
Example Consider approximating /(t) = esin(ll, using the interpolating func-
tion Pn(t). The maximum error
En= Max 1/(t)- Pn(t)l
O.st.s2.,.
for various values of n, is given in Table 3.15. The convergence is rapid.
Table 3.15 Error in trigonometric
polynomial interpolation
n En n En
1 5.39E- 1 6 6.46E- 6
2 9.31E- 2 7 4.01E- 7
3 l.lOE- 2 8 2.22E- 8
4 1.11E- 3 9 l.lOE- 9
5 9.11E- 5 10 5.00E- 11
TRIGONOMETRIC INTERPOLATION 181
The fast Fourier transform The approximation off( t) by Pn( t) in the preceding
example was very accurate for small values of n. In contrast, the calculation of
the finite Fourier transform (3.8.15) in other applications will often require large
values of n. We introduce a method that is very useful in reducing the cost of
calculating the coefficients { c
1
} when n is large.
Rather than using formula (3.8.15), we consider the equivalent formula
1 m-1
" "k
dk =- .... w..:r Jj
w = e-2wijm
m
k = 0, 1, ... , m - 1 (3.8.20)
m j=O
with given data { f
0
, , fm-
1
}. This is called a finite Fourier transform of order
m. For formula (3.8.15), let m = 2n + 1. We can allow k to be any integer,
noting that
-:oo<k<oo (3.8.21)
Thus it is sufficient to compute d
0
, , dm_
1
or any other m consecutive
coefficients d k
To contrast the formula (3.8.20) with the alternative presented below, we
calculate the cost of evaluating d
0
, , dm_
1
using (3.8.20). To evaluate dk, let
zk = w;. Then
1 m-1
dk =- L Jjz
m j=O
(3.8.22)
Using nested multiplication, this requires m - 1 multiplications and m - 1
additions. We ignore the division by m, since often other factors are used. The
evaluation of zk requires only 1 multiplication, since zk = wmzk_
1
, k ;::_.; 2. The
total cost of evaluating d
0
, , dm_
1
is m
2
multiplications and m(m - 1)
additions.
To introduce the main idea behind the fast Fourier transform, let m = pq with
p and q positive integers greater than 1. Rewrite the definition (3.8.20) in the
equivalent form
1 p -1 1 q-1
d = - " - " wk(l+pg)r
k .... .... m li+n
p 1=0 q g=O
Use w,t; = exp( -2'1Ti/q) = wq. Then
1p-1 [1q-l l
d -- kl - kg
k - L Wm L Wq fl+pg
. p 1=0 q g=O
k = 0, 1, ... , m- 1
Write
1 q-1
e<l) = - " wkgr
k q .... q 1/+pq
g=O
O:::;/:::;p-1 (3.8.23)
1 p-1
d = - " wkleU>
k .... m k
p 1=0
O:::;k:::;m-1 (3.8.24)
182 INTERPOLATION THEORY
Once { ei'>} is known, each value of dk will require p multiplications, using a
nested multiplication scheme as in (3.8.22). The evaluation of (3.8.24) will require
mp multiplications, assuming all ei'> have been computed previously. There will
be a comparable number of additions.
We turn our attention to the computation of the quantities ei'>. The index k
ranges from 0 to m - 1, but not all of these need be computed. Note that
1 q-1
eU> = - " w<k+q)gr = eU>
k+q q L... q J/+pq k
g=O
because wqq = 1. Thus ei'> needs to be calculated for only k = 0, 1, ... , q - 1,
and then it repeats itself. For each /, { e6'>, ... , ~
1
with the coefficients c
1
given in Table 4.3 in example (4.5.22) at the end of
Section 4.5. The results are given in Table 4.10.
Table 4.8 Relative maxima of lex- F
1
(x)i
x e"- F
1
(x)
- 1.0 .272
.1614 -.286
1.0 .272
NEAR-MINIMAX APPROXIMATIONS 23S
Table 4.9 Relative maxima of lex- Fix)l
X ex- FJ(x)
- 1.0 .00547
-.6832 -.00552
.0493 .00558
. 7324 - .00554
1.0 .00547
Table 4.10 Expansion coefficients for C
3
(x) and F
3
(x) to e.r
j cj cn.j
0 2.53213176 2.53213215
1 1.13031821 1.13032142
2 .27149534 .27154032
3 .04433685 .04487978
4 .00547424 E
4
= .00547424
As with the interpolatory approximation In(x ), care must be taken il j(x) is
odd or even in [ -1, 1]. In such a case, choose n as follows:
If f is { then chooses n to be { (4.7.44)
This ensures en+ I =I= 0 in (4.7.5), and thus the nodes chosen will be correcl
ones.
An analysis of the convergence of Fn(x) to f(x) is given in Shampin1(1970),
resulting in a bound similar to the bound (4.7.30) for In(x):
(4.7.45)
with w(n) empirically nearly equal to the bounding coefficient in (4.7.3l). Botll
In(x) and Fn(x) are practical near-minimax approximations.
We now give an algorithm for computing Fn(x), which can then be ealuated
using the algorithm Chebeval of Section 4.5.
Algorithm Approx (c, E, J, n)
1. Remark: This algorithm calculates the coefficients cj in
n
Fn(x) = f..' cjTj(x)
j=O
-l::s;x:s;l
according to the formula (4.7.38), and E is from
(4.7.39). The term c
0
should be halved before using apri thm
Chebeval.
236 APPROXIMATION OF FUNCTIONS
2. Create X;:= cos(i1rj(n + 1))
/; := f(x;) i=0,1, ... ,n+1
3. Do through step 8 for j = 0, 1, ... , n + 1.
4. sum := [/
0
+ ( -1)
1
/n +
1
]/2
5. Do through step 6 for i = 1, ... , n.
6. sum := sum +/;cos (ij7T /(n + 1))
7. End loop on i.
8. c
1
= 2 sumj(n + 1)
9. End loop on j.
10. E := cn+
1
/2 and exit.
The cosines in step 6 can be evaluated more efficiently by using the trigono-
metric addition formulas for the sine and cosine functions, but we have opted for
pedagogical simplicity, since the computer running time will still be quite small
with our algorithm. For the same reason, we have not used the FFT techniques of
Section 3.8.
Discussion of the Literature
Approximation theory is a classically important area of mathematics, and it is
also an increasingly important tool in studying a wide variety of modern
problems in applied IlJathematics, for example, in mathematical physics and
combinatorics. The variety of problems and approaches in approximation theory
can be seen in the books of Achieser (1956), Askey (1975b), Davis (1963),
Lorentz (1966), Meinardus (1967), Powell (1981), and Rice (1964), (1968). The
classic work on orthogonal polynomials is Szego (1967), and a survey of more
recent work is given in Askey (1975a). For Chebyshev polynomials and their
many uses throughout applied mathematics and numerical analysis, see Fox and
Parker (1968) and Rivlin (1974). The related subject of Fourier series and
approximation by trigonometric polynomials was only alluded to in the text, but
it is of central importance in a large number of applications. The classical
reference is Zygmund (1959). There are many other areas of approximation
theory that we have not even defined. For an excellent survey of these areas,
including an excellent bibliography, see Gautschi (1975). A major area omitted in
the present text is approximation by rational functions. For the general area, see
Meinardus (1967, chap. 9) and Rice (1968, chap. 9). The generalization to
BIBLIOGRAPHY 237
rational functions of the Taylor polynomial is called the Pade approximation; for
introductions, see Baker (1975) and Brezinski (1980). For the related area of
continued fraction expansions of functions, see Wall (1948). Many of the functions
that are of practical interest are examples of what are called the special functions
of mathematical physics. These include the basic transcendental functions (sine,
log, exp, square root), and in addition, orthogonal polynomials, the Bessel
functions, gamma function, and hypergeometric function. There is an extensive
literature on special functions, and special approximations have been devised for
most of them. The most important references for special functions are in
Abramowitz and Stegun (1964), a handbook produced under the auspices of the
U.S. National Bureau of Standards, and Erdelyi et al. (1953), a three volume set
that is often referred to as the "Bateman project." For a general overview and
survey of the techniques for approximating special functions, see Gautschi
(1975). An extensive compendium of theoretical results for special functions and
of methods for their numerical evaluation is given in Luke (1969), (1975), (1977).
For a somewhat more current sampling of trends in the study of special
functions, see the symposium proceedings in Askey (1975b).
From the advent of large-scale use of computers in the 1950s, there has been a
need for high-quality polynomial or rational function approximations of the basic
mathematical functions and other special functions. As pointed out previously,
the approximation of these functions requires a knowledge of their properties.
But it also requires an intimate knowledge of the arithmetic of digital computers,
as surveyed in Chapter 1. A general survey of numerical methods for producing
polynomial approximations is given in Fraser (1965), which has influenced the
organization of this chapter. For a very complete discussion of approximation of
the elementary functions, together with detailed algorithms, see Cody and Waite
(1980); a discussion of the associated programming project is discussed in Cody
(1984). For a similar presentation of approximations, but one that also includes
some of the more common special functions, see Hart et al. (1968). For an
extensive set of approximations for special functions, see Luke (1975), (1977).
For general functions, a successful and widely used program for generating
minimax approximations is given in Cody et al. (1968). General programs for
computing minimax approximations are available in the IMSL and NAG libraries.
Bibliography
Abramowitz, M., and I. Stegun (eds.) (1964). Handbook of Mathematical Func-
tions. National Bureau of Standards, U.S. Government Printing Office,
Washington, D.C. (It is now published by Dover, New York)
Achieser, N. (1956). Theory of Approximation (trans!. C. Hyman). Ungar, New
York.
Askey, R. (1975a). Orthogonal Polynomials and Special Functions. Society for
Industrial and Applied Mathematics, Philadelphia.
238 APPROXIMATION OF FUNCTIONS
Askey, R. (ed.) (1975b). Theory and Application of Special Functions. Academic
Press, New York.
Baker, G., Jr. (1975). Essentials of Pade Approximants. Academic Press, New
York.
Brezinski, C. (1980). Pade- Type Approximation and General Orthogonal Polynomi-
als_. Birkhauser, Basel.
Cody, W. (1984). FUNPACK-A package of special function routines. In Sources
and Development of Mathematical Software, W. Cowell (ed.), pp. 49-67.
Prentice-Hall, Englewood Cliffs, N.J.
Cody, W. and W. Waite (1980). Software Manual for the Elementary Functions.
Prentice-Hall, Englewood Cliffs, N.J.
Cody, W., W. Fraser, and J. Hart (1968). Rational Chebyshev approximation
using linear equations. Numer. Math., 12, 242-251.
Davis, P. (1963). Interpolation and Approximation. Ginn (Blaisdell), Boston.
Erdelyi, A., W. Magnus, F. Oberhettinger, and F. Tricomi (1953). lfigher
Transcendental Functions, Vols. I, II, and III. McGraw-Hill, New York.
Fox, L., and I. Parker (1968). Chebyshev Polynomials in Numerical Analysis.
Oxford Univ. Press, Oxford, Englimd.
Fraser, W. (1965). A survey of methods of computing minimax and near-mini-
max polynomial approximations for functions of a single independent
variable. J. ACM, 12, 295-314.
Gautschi, W. (1975). Computational methods in special functions-A survey. In
' R. Askey (ed.), Theory and Application of Special Functions, pp. 1-98.
Academic Press, New York.
Hart, J., E. Cheney, C. Lawson, H. Maehly, C. Mesztenyi, J. Rice, H. Thacher,
and C. Witzgall (1968). Computer Approximations. Wiley, New York. (Re-
printed in 1978, with corrections, by Krieger, Huntington, N.Y.)
Isaacson, E., and H. Keller (1966). Analysis of Numerical Methods. Wiley, New
York.
Lorentz, G. (1966). Approximation of Functions. Holt, Rinehart & Winston, New
York.
Luke, Y. (1969). The Special Functions and Their Applications, Vols. I and II.
Academic Press, New York.
Luke, Y. (1975). Mathematical Functions and Their Approximations. Academic
Press, New York.
Luke, Y. (1977). Algorithms for the Computation of Mathematical Functions.
Academic Press, New York.
Meinardus, G. (1967). Approximation of Functions: Theory and Numerical Meth-
ods (transl. L. Schumaker). Springer-Verlag, New York.
Powell, M. (1981). Approximation Theory and Methods. Cambridge Univ. Press,
Cambridge, England.
Rice, J. (1964). The Approximation of Functions: Linear Theory. Addison-Wesley,
Reading, Mass.
PROBLEMS 239
Rice, J. (1968). The Approximation of Functions: Advanced Topics. Addison-
Wesley, Reading, Mass.
Rivlin, T. (1974). The Chebyshev Polynomials. Wiley, New York.
Shampine, L. (1970). Efficiency of a procedure for near-minimax approximation.
J. ACM, 17, 655-660.
Szego, G. (1967). Orthogonal Polynomials, 3rd ed. Amer. Math. Soc., Providence,
R.I.
Wall, H. (1948). Analytic Theory of Continued Fractions. Van Nostrand, New
York.
Zygmund, A. (1959). Trigonometric Series, Vols. I and II. Cambridge Univ. Press,
Cambridge, England.
Problems
1. To illustrate that the Bernstein polynomials Pn(x) in Theorem4.1 are poor
approximations, calculate theJour:th-degree approximation Pn(x) for f(x)
= sin(7rx), 0 s x s 1. Compare it with the fourth-degree Taylor poly-
nomial approximation, expanded about x = t.
00
2. Let S = L ( -1 )ja j be a convergent series, and assume that all a j ;:::. 0 and
1
Prove that
3. Using Problem 2, examine the convergence of the following series. Bound
the error when.'. truncating after n terms, and note the dependence on x.
Find the value of n for which the error is less than lo-
5
lh.is problem
illustrates another common technique for bounding the error in Taylor
series.
(a)
(b)
4. Graph the errors of the Taylor series approximations Pn(.x) 10 f(x) =
sin[(7r/2)x] on -1 s x s 1, for n = 1, 3,5. Note the behaviorof the error
both near the origin and near the endpoints.
PROBLEMS 241
Obtain a formula for a polynomial q(x) approximating /(x), with the
formula for q(x) inv.olving a single integral.
(c) Assume that f(x) is infinitely differentiable on [a, b], that is, JU>(x)
exists and is continuous on [a, b ], for all j 0. [This does not imply
that f(x) has a convergent Taylor series on [a, b].] Prove there exists
a sequence of polynomials { Pn(x)jn 1} for which
Limit II /(j) - P!j) II = 0
n-co oo
for all j 0. Hint: Use the Weierstrass theorem and part (b).
9. Prove the following result. Let f E C
2
[a, b] with f"(x) > 0 for as;; xs;; b.
If qi(x) = a
0
+ a
1
x is the linear minimax approximation to f(x) on.
[a, b], then
f(b)-.J(a)
al = b- a
= /(a) +/(c) _ (a + c) [/(b) -/(a) ]
00
2 2 b- a
where c is the unique solution of
f'(c) =/(b)- f(a)
b-a
What is p?
10. (a) Produce the linear Taylor polynomials to/( x) = ln ( x) on 1 s;; x s;; 2,
expanding about x
0
= f. Graph the error.
(b) Produce the linear minimax approximation to f(x) = ln (x) on [1, 2].
Graph the error, and compare it with the Taylor approximation.
11. (a) Show that the linear minimax approximation to VI + x
2
on [0, 1] is
qi( x) = .955 + .414x
(b) Using part (a), derive the approximation
Vy
1
+ z
2
= .955z + .414y
and determine the error.
12. Find the linear least squares approximation to /(x) = ln(x) on [1,2].
Compare the error with the results of Problem 10.
13. Find the value of a that minimizes
242 APPROXIMATION OF FUNCTIONS
What is the minimum? This is a simple illustration of yet another way to
measure the error of an approximation and of the resulting best approxima-
tion.
14. Solve the following minimization problems and determine whether there is
a unique value of a that gives the minimum. In each case, a is allowed to
range over all real numbers. We are approximating the function f(x) = x
with polynomials of the form ax
2
(a) Min J
1
[x- ax
2
]
2
dx
a -1
(b) ~ i n f
1
1x - ax
2
1 dx
(c) Min Max lx - ax
2
1
a -l:Sx:Sl
15. Using (4.4.10), show that {Pn(x)} IS an orthogonal family and that
IIPnlb = /2/(2n + 1), n ~ 0.
16. Verify that
for n ~ 0 are orthogonal on the interval [0, oo) with respect to the weight
function w(x) =e-x. (Note: {'><:> e-xxm dx = m! for m = 0, 1, 2 ... )
17. (a) Find the relative maxima and minima of Tn(x) on [ -1, 1], obtaining
(4.7.21).
(b) Find the zeros of TAx), obtaining (4.7.14).
18. Derive the formulas for bn and en given in the triple recursion relation in
(4.4.21) of Theorem 4.5.
19. Modify the Gram-Schmidt procedure of Theorem 4.2, to avoid the normal-
ization step IPn = lfin/111fnll2:
lfn(x) = Xn + bn.n-tlfin-l(x) + +bn.olfio(x)
.
and find the coefficients bn, j 0 ~ j ~ n - 1.
20. Using Problem 19, find 1f
0
, 1f
1
, 1f
2
for the following weight functions w(x)
on the indicated intervals [a, b ].
(a) w(x) = ln(x), ~ x ~ l
n>O
240 APPROXIMATION OF FUNCTIONS
5. Let /(x) be three times continuously differentiable on [-a, aJ for some
a > 0, and consider approximating it by the rational function
a+ bx
R(x) = _1_+_c_x
To generalize the idea of the Taylor series, choose the constants a, b, and c
so that
j = 0,1,2
Is it always possible to find such an approximation R(x)? The function
R(x) is an example of a Pade approximation to f(x). See Baker (1975) and
Brezinski (1980).
6. Apply the results of Problem 5 to the case /(x) =ex, and give the resulting
approximation R(x). Analyze its error on [ -1, 1], and compare it with the
error for the quadratic Taylor polynomial.
7. By means of various identities, it is often possible to reduce the interval on
which a function needs to be approximated. Show how to reduce each of
the following functions from - oo < x < oo to the given interval. Usually a
few additional, but simple, operations will be needed.
(a) ex 0 ~ ~ 1
(b) cos(x) 0 ~ x s; 'IT/4
{c) tan-
1
(x) 0 s; x ~ 1
(d) ln(x) 1 ~ x s; 2. Reduce from ln(x) on 0 < x < oo.
8. (a) Let f(x) be continuously differentiable on [a, bJ. Let p(x) be a
polynomial for which
II/' - Plloo s;
and define
q(x) =/(a)+ jxp(t) dt
a
Show that q( x) is a polynomial and satisfies
11/- qlloo s; (b- a)
{b) Extend part (a) to the case where f(x) is N times continuously
differentiable on [a, b], N;;:: 2, and p(x) is a polynomial satisfying
11/{N) - Plloo ~ (
PROBLEMS 243
(b) w(x) = x
(c)
21. Let { <Jln(x)!n 1} be orthogonal on (a, b) with weight function w(x) 0.
Denote the zeros of <pn(x) by
a<zn,n<zn-1.n< ... <z1,n<b
Prove that the zeros of <Jln+
1
(x) are separated by those of <pn(x),. that is,
a< zn+I,n+I < zn,n < zn,n+I < ... < z2,n+l < zl,n < zl,n+I < b
Hint: Use induction on the degree n. Write <pn(x) = Anxn + , with
An > 0, and use the triple recursion relation ( 4.4.21) to evaluate the
polynomials at the zeros of <pn(x). Observe that the sign changes for
<Jln_
1
(x) and <Jln+I(x).
22. Extend the Christoffel-Darboux identity (4.4.27) to the case with x = y,
obtaining a formula for
[cpk(x)]2
k=O "Yk
Hint: Consider the limit in (4.4.27) as y -+ x.
23. Let f(x) = cos-
1
(x) for -1 x 1 (the principal branch 0 ?T).
24.
Find the polynomial of degree two,
which minimizes
!
I [J(x)- p(x )]
2
dx
-1 /1-x
2
1
Define Sn(x) = --
1
n 0, with T,+
1
(x) the Chebyshev poly-
n+
nomial of degree n + 1. The polynomials Sn(x) are called Chebyshev
polynomials of the second kind.
(a) Show that is an orthogonal family on [-1,1] with
respect to the weight fmiction w( x) = ./1 - x
2
(b) Show that the family { Sn(x)} satisfies the same triple recursion
relation (4.4.13) as the family {Tn(x)}.
244 APPROXIMATION OF FUNCTIONS
(c) Given f E C( -1, 1), solve the problem
Min f/1- x
2
[J(x)- P
11
(x)]
2
dx
where p
11
(x) is allowed to range over all polynomials of degree::;;; n.
25. Do an operations count for algorithm Chebeval of Section 4.5. Give the
number of multiplications and the number of additions. Compare this to
the ordinary nested multiplication algorithm.
26. Show that the framework of Sections 4.4 and 4.5 also applies to the
trigonometric polynomials of degree ::;;; n. Show that the family
{1,sin(x),cos(x), ... ,sin(nx),cos(nx)} is orthogonal on [0,277]. Derive
the least squares approximation to f( x) on [0, 2 7T] using such polynomials.
[Letting n oo, we obtain the well-known Fourier series (see Zygmund
(1959))].
27. Let /(x) be a continuous even (odd) function on [-a, a]. Show that the
minimax approximation q:(x) to f(x) will be an even (odd) function on
[-a, a], regardless of whether n is even or odd. Hint: Use Theorem 4.10,
including the uniqueness result.
28. Using (4.6.10}, bound p
11
(/) for the following functions f(x) on the given
interval, n = 1, 2, ... , 10.
(a) sin(x) 0::;;; X::;;; 7Tj2
(b) ln(x) 1:;;;x:;;;e
(c) tan-
1
(x) O:;;;x:;;;7Tj4
(d)
ex
O:;;;x:;;;1
29. For the Chebyshev expansion (4.7.2), show that if f(x) is even (odd) on
[ -1, 1], then cj = 0 if j is odd (even).
30. For f(x) = sin[(7T/2)x], -1::;;; x::;;; 1, find both the Legendre and
Chebyshev least squares approximations of degree three to f(x). Determine
the error in each approximation and graph them. Use Theorem 4.9 to
bound the minimax error p
3
(/). Hint: Use numerical integration to sim-
plify constructing the least squares coefficients. Note the comments follow-
ing (4.7.13), for the Chebyshev least squares approximation.
31. Produce the interpolatory near-minimax approximation In(x) for the fol-
lowing basic mathematical functions f on the indicated intervals, for
n = 1, 2, ... , 8. Using the standard routines of your computer, compute the
error. Graph the error, and using Theorem 4.9, give upper and lower
bounds for P
11
(/).
(b) sin(x)
(d) ln (x)
O:;;;x,:;;;7Tj2
PROBLEMS 245
32. Repeat Problem 31 with the near-minimax approximation Fn(x ).
33. Repeat Problem 31 and 32 for
1
1
x sin {t)
f(x) =- --dt
X 0 t
7T
0 <X<-
- - 2
Hint: Find a Taylor approximation of high degree for evaluating f(x), then
use the transformation (4.5.12) to obtain an approximation problem on
[ -1, 1).
34. For f(x) =ex on [ -1, 1], consider constructing ln(x). Derive the error
result
-1.:::;; X.:::;; 1
for appropriate constants an, f3n Find nonzero upper and lower bounds for
Pn(/).
35. (a) The function sin ( x) vanishes at x = 0.1n order _to better approximate
it in a relative error sense, we consider the function f(x) = sin(x)jx.
Calculate the near-minimax approximation ln(x) on 0.:::;; x.:::;; 7Tj2 for
n = 1, 2, ... , 7, and then compare sin (x) - xln(x) with the results of
Problem 31(b).
(b) Repeat part (a) for f(x) = tan-
1
(x), O.::;;x.::;;l.
36. Let f(x) = anxn + +a
1
x + a
0
, an =1=- 0. Find the minimax approxima-
tion to f(x) on [ -1, 1] by a polynomial of degree.:::;; n - 1, and also find
Pn-l(f).
37. Let a = Min [Max I x
6
- x
3
- p
5
( x) 1], where the minimum is taken over
jxj.s 1 .
all polynomials of degree .:::;; 5.
(a) Find a.
(b) Find the polynomial p
5
(x) for which the minimum a is attained.
38. Prove the result (4.6.10). Hint: Consider the near-minimax approximation
ln(x).
39. (a) For f(x) =ex on [ -1, 1], find the Taylor polynomial p
4
(x) of degree
four, expanded about x = 0.
(b) Using Problem 36, find the minimax polynomial m
4
3
(x) of degree
three that approximates p
4
(x). Graph the error ex- m
4
3
(x) and
compare it to the Taylor error ex- PJ(x) shown in Figure 4.1. The
process of reducing a Taylor polynomial to a lower degree one by this
246 APPROXIMATION OF FUNCTIONS
process is called economization or telescoping. It is usually used
several times in succession, to reduce a high-degree Taylor polynomial
to a polynomial approximation of much lower degree ..
40. Using a standard program from your computer center for computing
minimax approximations, calculate the minimax approximation q:(x) for
the following given functions f(x} on the given interval. Do this for
n = 1, 2, 3, ... , 8. Compare the results with those for problem 31.
(b) sin (x) O.:S:x.:S:'fT/2
(d) ln (X) 1 X 2
41. Produce the minimax approximations q:(x), n = 1, 3, 5, 7, 9, for
1 jx
2
f(x) = ;- e-
1
dt
Hint: First produce a Taylor approximation of high accuracy, and then use
it with the program of Problem 40.
42. Repeat Problem 41 for
1
1
x sin (t)
f(x) =- --dt
X 0 t
lxl .:S:.,
FIVE
NUMERICAL INTEGRATION
In this chapter we derive and analyze numerical methods for evaluating definite
integrals. The integrals are mainly of the form
I(/)= jbf(x) dx (5.0.1)
a
with (a, b] finite. Most such integrals cannot be evaluated explicitly, and with
many others, it is often faster to integrate them numerically rather than evaluat-
ing them exactly using a complicated antiderivative of f(x). The approximation
of /(f) is usually referred to as numerical integration or quadrature.
There are many numerical methods for evaluating (5.0.1), but most can be
made to fit within the following simple framework. For the integrand f(x), find
an approximating family Un(x)ln ~ 1} and define
In(/)= jbfn(x) dx =1(/n) (5.0.2)
a
We usually require the approximations fn(x) to satisfy
(5.0.3)
And the form of each fn(x) should be chosen such that J(fn) Cl!-n be evaluated
easily. For the error,
En(!)= I(j)- In(!)= jb[J(x)- fn(x)] dx
.a
lEn(!) ~ jb!J(x)- fn(x)jdx ~ (b- a)llf- fnlloo
a
{5.0.4)
Most numerical integration methods can be viewed within this framework,
although some of them are better studied from some other perspective. The one
class of methods that does not fit within the framework are those based on
extrapolation using asymptotic estimates of the error. These are examined in
Section 5.4.
249
i
___ j
250 NUMERICAL INTEGRATION
Most numerical integrals In(/) will have the following form when they are
evaluated:
n
In(!)= L WJ,nf(xj.J n ~ 1 (5.0.5)
J-1
The coefficients w
1
, n are called the integration weights or quadrature weights; and
the points x
1
,n are the integration nodes, usually chosen in [a, b]. The depen-
dence on n is usually suppressed, writing w
1
and x
1
, although it will be
understood implicitly. Standard methods have nodes and weights that have
simple formulas or else they are tabulated in tables that are readily available.
Thus there is usually no need to explicitly construct the functions fn(x) of (5.0.2),
although their role in defining In(/) may be useful to keep in mind.
The following example is a simple illustration of (5.0.2)-(5.0.4), but it is not of
the form (5.0.5).
Example Evaluate
1 ex- 1
I=l--dx
0 X
(5.0.6)
This integrand has a removable singularity at the origin. Use a Taylor series for
ex [see (1.1.4) of Chapter 1) to define fn(x), and then define
1
n xi-l
I= 1 "-dx
n ..... '!
0 j=1 J.
n 1
j"f1 (j!)(j)
For the error in In, use the Taylor formula (1.1.4) to obtain
xn
f(x)- fn(x) = (n +
1
)! e(,
for some 0 ~ ~ x ~ x. Then
xn
I - I = 1
1
e(, dx
n 0 (n + 1)!
1 e
----,----- ~ I - In ~ : : ~ =
(n + 1)!(n + 1) (n + 1)!(n + 1)
(5.0.7)
(5.0.8)
The .sequence in (5.0.7) is rapidly convergent, and (5.0.8) allows us to estimate the
error very accurately. For example, with n = 6
I6 = 1.31787037
THE TRAPEZOIDAL RULE AND SIMPSON'S RULE 251
and from (5.0.8)
2.83 X 10-
5
::; 1- 1
6
::; 7.70 X 10-
5
The true error is 3.18 X 10-
5
For integrals in which the integrand has some kind of bad behavior, for
example, an infinite value at some point, we often will consider the integrand in
the form
!{!) = jbw(x)f(x) dx
a
(5.0.9)
The bad behavior is assumed to be located in w( x ), called the weight function,
and the function f(x) will be assumed to be well-behaved. For example, consider
evaluating
fo\ln x)f(x) dx
for arbitrary continuous functions f(x). The framework (5.0.2)-(5.0.4) gener-
alizes easily to the treatment of (5.0.9). Methods for such integrals are considered
in Sections 5.3 and 5.6.
Most numerical integration formulas are based on defining fn(x) in (5.0.2) by
using polynomial or piecewise polynomial interpolation. Formulas using such
interpolation with evenly spaced node points are derived and discussed in
Sections 5.1 and 5.2. The Gaussian quadrature formulas, which are optimal in a
certain sense and which have very rapid convergence, are given in Section 5.3.
They are based on defining fn(x) using polynomial interpolation at carefully
selected node points that need not be evenly spaced.
Asymptotic error formulas for the methods of Sections 5.1 and 5.2 are given
and discussed in Section 5.4, and some new formulas are derived based on
extrapolation with these error formulas. Some methods that control the integra-
tion error in an automatic way, while remaining efficient, are giveiJ. in Section 5.5.
Section 5.6 surveys methods for integrals that are singular or ill-behaved in some
sense, and Section 5.7 discusses the difficult task of numerical differentiation.
5.1 The Trapezoidal Rule and Simpson's Rule
We begin our development of numerical integration by giving two well-known
numerical methods for evaluating
(5.1.1)
We analyze and illustrate these methods very completely, and they serve as an
introduction to the material of later sections. The interval [a, b] is always finite in
this section.
- ---- --- ---------- ---- ----- -. --.- _:___________: ... -- -- j
252 NUMERICAL INTEGRATION
y
a b
Figure 5.1 Illustration of trapezoidal rule.
The trapezoidal rule The simple trapezoidal rule is based on approximating
f(x) by the straight line joining (a, f(a)) and (b, f(b)). By integrating the
formula for this straight line, we obtain the approximation
(
b- a)
/1(!) = -
2
- [!(a)+ f(b)] (5.1.2)
This is of course the area of the trapezoid shown in Figure 5.1. To obtain an error
formula, we use the interpolation error formula (3.1.10):
f(x)- (b- a)f(b) = (x- a)(x- b)f[a, b, x]
We also assume for all work with the error for the trapezoidal rule in this section
that f(x) is twice continuously differentiable on [a, b ]. Then
f
h (b-a)
1
(/)= f(x)dx- [J(a)+f(b)]
a 2
= t(x- a)(x- b)f[a, b, x] dx
a
(5.1.3)
Using the integral mean value theorem [Theorem 1.3 of Chapter 1],
1
(/) = f[a, b, a)(x- b) dx some a 5: 5: b
a
some 7J E [ a, b]
____ J
THE TRAPEZOIDAL RULE AND SIMPSON'S RULE 253
using (3.2.12). Thus
TJE[a,b] (5.1.4)
If b - a is not sufficiently small, the trapezoidal rule (5.1.2) is not of much
use. For such an integral, we break it into a sum of integrals over small
subintervals, and then we apply (5.1.2) to each of these smaller integrals. Let
n ;;::: 1, h = (b- a)jn, and xj =a+ jh for j = 0, 1, ... , n. Then
b n X
I(/)= 1 J(x) dx = f:-1' J(x) dx
a j=l xj-I
with xj_
1
.::;; TJj.::;; xj. There is no reason why the subintervals [xj_
1
, x) must all
have equal length, but it is customary to first introduce the general principles
involved in this way. Although this is also the customary way in which the
method is applied, there are situations in which it is desirable to vary the spacing
of the nodes.
The first terms in the sum can be combined to give the composite trapezoidal
rule,
n;;:::1 (5.1.5)
with f(x) = h For the error in In(f),
n h3
En(/) =I(/) -In(/) = _[ -
12
/"( TJj)
j=l
h3n[1" l
= -- - 'Lf"(TJ)
12 n j=l
(5.1.6)
For the term in brackets,
1 n
Min f"(x).::;; M =- L, J"(TJ).::;; Max f"(x)
a:s;x:s;b n j=l a:s;x:s;b
Since f"(x) is continuous for a.::;; x.::;; b, it must attain all values between its
minimum and maximum at some point of [a, b); thus f"(TJ) = M for some
TJ E [a, b). Thus we can write
some TJ E [a, b] (5.1.7)
254 NUMERICAL INTEGRATION
Another error estimate can be derived using this analysis. From (5.1.6)
L
. . En(!) L. . [ h f ( )]
mut --
2
- = liTIIt - -. 1.... " 1J.
n-+oo h n-+oo 12 .
1
1
j=
1 n
= --Limit L: r< TJ
1
)h
12 n-+oo j=l
Since x
1
_
1
71
1
:s; x
1
, j = 1, ... , n, the last sum is a Riemann sum; thus
En(!) 1 fb 1
Limit -
2
- =-- f"(x) dx = - -[j'(b)- f'(a)]
n->oo h 12 a 12
{5.1.8)
(5.1.9)
The term En(/) is called an asymptotic error estimate for En(/), and is valid in
the sense of (5.1.8).
Definition Let En(/) be an exact error formula, and let En(/) be an estimate
of it. We say that En(/) is an asymptotic error estimate for En(/) if
{5.1.10)
or equivalently,
The estimate in (5.1.9) meets this criteria, based on (5.1.8).
The composite trapezoidal rule (5.1.5) could also have been obtained by
replacing f(x) by a piecewise linear interpolating function fn(x) interpolating
f(x) at the nodes x
0
, x
1
, .. , xn. From here on, we generally refer to the
composite trapezoidal rule as simply the trapezoidal rule.
Example We use the trapezoidal rule (5.1.9) to calculate
{5.1.11)
The true value is I= -(e" + 1)/2 -12.0703463164. The values of In are
given in Table 5.1, along with the true errors En and the asymptotic estimates En,
obtained from (5.1.9). Note that the errors decrease by a factor of 4 when n is
doubled (and hence h is halved). This result was predictable from the multiplying
"--- -- - --------------------------. ----------------------------- ------ -----
THE TRAPEZOIDAL RULE AND SIMPSON'S RULE 255
Table 5.1 Trapezoidal rule for evaluating (5.1.11)
n Jn En Ratio En
2 -17.389259 5.32
4.20
4.96
4 -13.336023 1.27
4.06
1.24
8 -12.382162 3.12E- 1
4.02
3.10E- 1
16 -12.148004 7.77E- 2
4.00
7.76E- 2 .
32 -12.089742 1.94E- 2
4.00
1.94E- 2
64 -12.075194 4.85E- 3
4.00
4.85E- 3
128 -12.071558 1.21E- 3
4.00
1.21E- 3.
256 -12.070649 3.03E- 4 3.03E- 4
512 -12.070422 7.57E- 5
4.00
7.57E- 5
factor of h
2
present in (5.1.7) and (5.1.9); when h is halved, h
2
decreases by a
factor of 4. This example also shows that the trapezoidal rule is relatively
inefficient when compared with other methods to be developed in this chapter.
Using the error estimate En(/), we can define an improved numerical integra-
tion rule:
CTn(!) =In(!)+ En(!)
+ +fn-1 + (5.1.12)
This is called the corrected trapezoidal rule. The accuracy of En(/) should make
CTn(f) much more accurate than the trapezoidal rule. Another derivation of
(5.1.12) is suggested in Problem 4, one showing that (5.1.12) will fit into the
approximation theoretic framework (5.0.2)-(5.0.4). The major difficulty of using
CTn{f) is that /'(a) and /'(b) are required.
Example Apply CT,{f) to the earlier example (5.1.11). The results are shown in-
Table 5.2, together with the errors for the trapezoidal rule, for comparison.
Empirically, the error in CTn{f) is proportional to h
4
, whereas it was propor-
tional to h
2
with the trapezoidal rule. A proof of this is suggested in Problem 4.
Table 5.2 The corrected trapezoidal rule for (5.1.11)
n CT,.(f) Error Ratio Trap Error
2 -12.425528367 3.55E- 1
14.4
5.32
4 -12.095090106 2.47E- 2
15.6
1.27
8 -12.071929245 1.58E- 3 3.12E- 1
16 -12.070445804 9.95E- 5
15.9
7.77E- 2
32 -12.070352543 6.23E- 6
16.0
1.94E- 2
64 -12.070346706 3.89E- 7
16.0
4.85E- 3
128 -12.070346341 2.43E- 8
16.0
1.21E- 3
256 NUMERICAL INTEGRATION
y
X
a (a+b)/2 b
Figure 5.2 Illustration of Simpson's rule.
Simpson's rule To improve upon the simple trapezoidal rule (5.1.2), we use a
quadratic interpolating polynomial p
2
(f) to approximate f(x) on [a, b]. Let
c = (a + b )/2, and define
f
h[(x- c)(x- b) (x- a)(x- b)
12
(/)= a (a-c)(a-b)f(a)+ (c-a)(c-b)f(c)
(x-a)(x-c) ]
+ (b- a)(b- c)f(b) dx
Carrying outthe integration, we obtain
b-a
h = -- (5.1.13)
2
This is called Simpson's rule. An illustration is given in Figure 5.2, with the
shaded region denoting the area under the graph of y = h(x).
For the error, we begin with the interpolation error formula (3.2.11) to obtain
= jb(x- a)(x- c){x- b )f[a, b, c, x] dx
a
(5.1.14)
We cannot apply the integral mean value theorem since the polynomial" in the
J integrand changes sign at x = c =(a+ b)/2. We will assume f(x) is four times
============-=---=---='!, continuously differentiable on [a, b] for the work of this section on Simpson's
rule. Define
w(x) = jx(t- a)(t- c){t- b) dt
a
THE TRAPEZOIDAL RULE AND SIMPSON'S RULE 257
It is not hard to show that
w(a) = w(b) = 0 w(x) > 0 for a< x < b
Integrating by parts,
2
(/) = Jbw'(x)f[a,b,c,x]dX
u
d
= [w(x)f[a,b,c,x]];:!-Jbw(x)-J[a,b,c,x]dx
u dx
= -Jbw(x)f[a, b, c, x, x] dx
a
The last equality used (3.2.17). Applying the integral mean value theorem and
(3.2.12),
Thus
2
(/) = -J[a, b, c, E. dx
a
some a ::5; ::5; b
b-a
h = -- some TJ E [a, b].
2
TJE[a,b] (5.1.15)
From this we see that
2
(/) = 0 if f(x) is a polynomial of degree ::5; 3, even
though quadratic interpolation is exact only if f(x) is a polynomial of degree at
most two. The additional fortuitous cancellation of errors is suggested in Figure
5.2. This results in Simpson's rule being much more accurate than the trapezoidal
rule.
Again we create a composite rule. For n 2 and even, define h = (b- a)jn,
x
1
= a + jh for j = 0, 1, ... , n. Then
b n/2 x2.
I(!)= 1 f(x) dx = L 1
1
f(x) dx
u j= 1 x2j-2
with x
21
_
2
::5; 11
1
::5; x
21
Simplifying the first terms in the sum, we obtain the
composite Simpson rule:
I
I
- .. ------. ------- - - ---------------- -- --- . J
'
258 NUMERICAL INTEGRATION
As before, we will simply call this Simpson's rule. It is probably the most
well-used numerical integration rule. It is simple, easy to use, and reasonably
accurate for a wide variety of integrals.
For the error, as with the trapezoidal rule,
hs(n/2) 2 ' ~
/l(j) = /(f) - /11(/) = - ~ . ;; ~ l f'
4
'(lli)
h
4
(b-a)
En(/)=- 180 /(4)(71) some 71 E [ a , b] (5.1.17)
We can also derive the asymptotic error formula
(5.1.18)
The proof is essentially the same as was used to obtain (5.1.9).
Example We use Simpson's .rule (5.1.16) to evaluate the integral (5.1.11),
used earlier as an example for the trapezoidal rule. The numerical results are
given in Table 5.3. Again, the rate of decrease in the error confirms the results
given by (5.1.17) and (5.1.18). Comparing with the earlier results in Table 5.1 for
the trapezoidal rule, it is clear that Simpson's rule is superior. Comparing with
Table 5.2, Simpson's rule is slightly inferior, but the speed of convergence is the
same. Simpson's rule has the advantage of not requiring derivative values.
Peano kernel error formulas There is another approach to deriving the error
formulas, and it does not result in the derivative being evaluated at an unknown
point 71 We first consider the trapezoidal rule. Assume/' E C[a, b] and that
f"(x) is integrable on [a, b]. Then using Taylor's theorem [Theorem 1.4 in
Table 5.3 Simpson's rule for evaluating (5.1.11)
n /n En
Ratio ,
i -11.5928395534 -4.78E- 1
5.59
-1.63
4 -11.9849440198 -8.54E- 2
14.9
-L02E- 1
8 -12.0642089572 -6.14E- 3
15.5
-6.38E- 3
16 -12.0699513233 -3.95E-4 -3.99E- 4
15.9
32 -12.0703214561 -2.49E- 5
16.0
-2.49E- 5
64 - 12.0703447599 -1.56E- 6
16.0
-1.56E- 6
128 -12.0703462191 -9.73E- 8 -9.73E- 8
THE TRAPEZOIDAL RULE AND SIMPSON'S RULE 259
Chapter 1],
p
1
(x) = f(a) + (x- a)f'(a)
R
2
(X) = fx (X - t ) f" ( t ) dt
a
Note that from (5.1.3),
(5.1.19)
for any two functions F, G E C[ a, b ]. Thus
since E
1
(p
1
) = 0 from (5.1.4). Substituting,
f
hjx ( b - a ) Jh
= (x- t)f"(t) dt- -- (b- t)f"(t) dt
a a 2 a
In general for any integrable function G(x, t),
t Jx G(x, t) dtdx = t tG(x, t) dxdt
a a a t
(5.1.20)
Thus
f
h jh (b- a)Jh
E
1
(R
2
) = a f"(t)
1
(x- t) dxdt- -
2
- a (b- t)f"(t) dt
Combining integrals and simplifying the results,
1 fb
E
1
(!) =- f"(t)(t- a)(t- b) dt
2 a
(5.1.21)
For the composite trapezoidal rule (5.1.5),
En(!)= fbK(t)f"(t) dt
a
(5.1.22)
i
-------------------- ------------- -------- -- .. -- -------- ---------------
j = 1, 2, ... , n (5.1.23)
The formulas (5.1.21) and (5.1.22) are called the Peano kernel formulation of the
i
I
-- _/
260 NUMERICAL INTEGRATION
error, and K(t) is called the Peano kernel. For a more general presentation, see
Davis (1963, chap. 3).
As a simple illustration of its use, take bounds in (5.1.22) to obtain
(5.1.24)
If f"(t) is very peaked, this may give a better bound on the error than (5.1.7),
because in (5.1.7) we generally must replace 1/"('IJ)I by 11/"lloo
For Simpson's rule, use Taylor's theorem to write
11x 3
R
4
(x) =- (x- t) [<
4
>(t) dt
6 a
As before
and we the!l calculate EiR
4
) by direct substitution and simplification:
This yields
2
(!) = tK(t)[<
4
>(t) dt
a
{
1 3
-(t- a) (3t- a- 2b)
K(t) =
72
1 3
-. (b-t)(b+2a-3t)
72
a+b
a 5. t 5. -
2
-
a+b
-2- 5,_ t 5,_ b
A graph of K(t) is given in Figure 5.3. By direct evaluation,
b-a
h=--
2
{5.1.25)
{5.1.26)
As with the composite trapezoidal method, these results extend easily to the
composite Simpson rule.
The following examples are intended to describe more fully the behavior of
Simpson's and trapezoidal rules.
----- ---- - -- - ------ -- - - --- -
THE TRAPEZOIDAL RULE AND SIMPSON'S RULE 261
y
Example 1.
a
Figure 5.3 The Peano kernel for
Simpson's rule.
2
j(x) = ;r
3
v'X. [a. b] = [0. 1]. I = 9
Table 5.4 gives the error for increasing values of n. The derivative j<
4
l(x) is
singular at x = 0, and thus the formula (5.1.17) cannot be applied to this case. As
an alternative, we use the generalization of (5.1.25) to the composite Simpson rule
to obtain
Thus the error should decrease by a factor of 16 when h is halved (i.e., n is
doubled). This also gives a fairly realistic bound on the error. Note the close
agreement of the empirical values of ratio with the theoretically predicted values
of 4 and 16, respectively.
2.
1
f(x) =
1
( )2
+ x-'IT
[a, b] = [0,5] I = 2.33976628367
Table 5.4 Trapezoidal, Simpson integration: case (1)
Trapezoidal Rule Simpson's Rule
n Error Ratio Error Ratio
2 -7.197- 2
3.96
-3.370-3
14.6
4 -1.817- 2 -2.315- 4
8 -4.553- 3
3.99
-1543-5
15.0
4.00 15.3
16 -1.139- 3 -1.008-6
32 -2.848-4
4.QO
-6.489- 8
15.5
4.00 i5.7
64 -7.121- 5
4.00
-.4.141-9
15.8
128 -1.780- 5 -2.626- 10
----------------- - - ------------------
262 NUMERICAL INTEGRATION
Table 5.5 Trapezoidal, Simpson integration: case (2)
Trapezoidal Rule Simpson's Rule
n Error Ratio Error Ratio
2 1.731 - 1
2.43
-2.853- 1
-7.69
4 7.110- 2 3.709- 2
8 7.496 .:... 3
9.48
-1.371- 2
-2.71
16 1.953- 3
3.84
1.059- 4
-130
32 4.892- 4
3.99
1.080- 6
9.81
64 1.223- 4
4.00
6.743- 8
16.0
128 3.059- 5
4.00
4.217- 9
16.0
According to theory, the infinite differentiability of f(x) implies a value for ratio
of 4.0 and 16.0 for the trapezoidal and Simpson rules, respectively. But these
need not hold for the first several values of In, as Table 5.5 shows. The integrand
is relatively peaked, especially its higher derivatives, and this affects the speed of
convergence.
3.
f(x) =IX [a,b]=[0,1]
2
I=-
3
Since f'(x) has an infinite value at x = 0, none of the theoretical results given
previously apply to this case. The numerical results are in Table 5.6; note that
there is still a regular behavior to the error. In fact, the errors of the two methods
decrease by the same ratio as n is doubled. This ratio of 21.
5
= 2.83 is explained
in Section 5.4, formula (5.4.24).
4.
J(x) = ecos(x)
[a, b] = [0,277] I= 7.95492652101284
The results are shown in Table 5.7, and they are extremely good. Both methods
Table 5.6 Trapezoidal, Simpson integration: case (3)
Trapezoidal Rule Simpson's Rule
n Error Ratio Error Ratio
2 6.311- 2
2.70.
2.860- 2
2.82
4 2.338- 2 1.012- 2
2.83 2.74
3.587- 3
8 8.536- 3
2.77 2.83
16 3.085- 3 1.268- 3
2.78 2.83
32 1.108- 3 4.485- 4
2.83 2.80
64 3.959- 4 1.586- 4
2.83
128 1.410- 4
2.81
5.606- 5
NEWTON-COTES INTEGRATION FORMULAS 263
Table 5.7 Trapezoidal, Simpson integration: case (4)
Trapezoidal Rule Simpson's Rule
n Error Ratio Error Ratio
2 -1.74
5.06E + 1
7.21E- 1
4 -3.44E- 2 5.34E- 1
1.35
8 -1.25E- 6
2.75E + 4
1.15E- 2
4.64E + 1
16 < l.OOE- 14
> 1.25E + 8
4.17E- 7
2.76E + 4
< l.OOE- 14
> 4.17E + 7
are very rapidly convergent, with the trapezoidal rule superior to Simpson's rule.
This illustrates the excellent convergence of the trapezoidal rule for periodic
integrands; this is analyzed in Section 5.4. An indication of this behavior can be
seen from the asymptotic error terms (5.1.9) and (5.1.18), since both estimates are
zero in this case of f(x).
5.2 Newton-Cotes Integration Formulas
The simple trapezoidal rule (5.1.2) and Simpson's rule (5.1.13) are the first two
cases of the Newton-Cotes integration formula. For n 1, let h = (b- a)jn,
xj = a+ jh for j = 0, 1, ... , n. Define In(/) by replacing f(x) by its interpolat-
ing polynomial Pn(x) on the nodes x
0
, x
1
; , xn:
(5.2.1)
Using the Lagrange formula (3.1.6) for Pn(x),
(5.2.2)
with
wj,n= J\.n(x)dx j=0,1, ... ,n
a
(5.2.3)
Usually we suppress the subscript n and write just wj. We have already
calculated the cases n = 1 and 2. To illustrate the calculation of the weights, we
give the case of w
0
for n = 3.
A change of variable simplifies the calculations. Let x = x
0
+ p.h, 0 p. 3.
i
I
~ ~ ~ ~ ~ ~ ~ ~ ~ ~ _, ___ _j
I
!
I
'
264 NUMERICAL INTEGRATION
Then
1 13
= -- (JL- 1)h(p.- 2)h(p.- 3)h. hdp.
6h
3
0
h 13
= -- (p.- 1)(p. -2)(p.- 3) dp.
6 0
3h
wo=-
8
The complete formula for n = 3 is
(5.2.4)
and is called the three-eighths rule.
For the error, we give the following theorem.
Theorem 5.1 (a) For n even, assume f(x) is n + 2 times continuously differen-
tiable on [a, b]. Then
some 11 E [a, b] (5.2.5)
with
1 1n
C..= p.
2
(p.- 1) (p.- n) dp.
n (n + 2)! 0
(5.2.6)
(b) For n odd, assume f(x) is n + 1 times continuously differen-
tiable on [a, b ]. Then
some 11 E [a, b] (5.2.7)
with
1 in
C= p.(p.-1)(p.-n)dp.
n (n + 1)! o
Proof We sketch the proof for part (a), the most important case. For complete
proofs of both cases, see Isaacson and Keller (1966, pp. 308-314). From
(3.2.11),
En(/) =I(/) -In(/)
= jb(x- x
0
)(x- x
1
) (x- xn)f[x
0
, x
1
, .. , xn, x] dx
a
NEWTON-COTES INTEGRATION FORMULAS 265
Define
w(x) = jx(t- x
0
) (t- xJ dt
a
Then
w(a)=w(b)=O w(x)>O for a< x < b
The proof that w(x) > 0 can be found in Isaacson and Keller (1966,
p. 309). It is easy to show w(b) = 0, since the integrand (t- x
0
)
(t- xn) is an odd function with respect to the middle node xn
12
=
(a+ b)/2.
Using integration by parts and (3.2.17),
En(!)= jbw'(x)f[x
0
, ... ,xn,x]dx
a
En(!) = - jb w(x )/[x
0
, . , xn, x, x] dx
a
Using the integral mean value theorem and (3.2.12),
En(!)= -f[xo xn, t dx
a
/(n+2)(TJ) b x
= - ( ) j j (t- x
0
) (t- xn) dtdx (5.2.8)
n+2! a a
We change the order of integration and then use the change of variable
t = x
0
+ p.h, 0 p. ;5; n:
= -hn+J f"p.(p.- 1) (p.- n + 1)(p.- n)
2
dp.
lo .
Use the change of variable v = n - p. to give the result
jbw(x) dx -hn+J1n(n- v) (1- v)v
2
dv
a 0
Use the fact that n is even and combine the preceding with (5.2.8), to
obtain the result (5.2.5)-(5.2.6).
266 NUMERICAL INTEGRATION
Table 5.8 Commonly used Newton-Cotes formulas
h h h
3
11 = 1 J f(x) dx = -[f(a) +/(b)]- -/"(0 trapezoidal rule
u 2 12
n=2 +4/(a;b) +f(b)]-
Simpson'srule
n=3
h 3h 3h
5
J f(x) dx = -[f(a) + 3f(a +h)+ 3f(h- h)+ /(b)]- -j<
4
>aJ
u 8 w
n=4
f
h 2h [ (a+ h) ] 8h
1
f(x) dx =- 7f(a) + 32/(a +h)+ 12/ -- + 32/(h- h)+ 7f(h) - -j<
6
>W
u 45 2 945
For easy reference, the most commonly used Newton-Cotes formulas are
given in Table 5.8. For n = 4, /
4
(/) is often called Boote's rule. As previously,
let h = (b- a)jn in the table.
Definition A numerical integration formula i(f) that approximates J(f) is
said to have degree of precision m if
1. f(f) =!(f) for all polynomials f(x) of degree.:::;; m.
2. [(f) =I= !(f) for some polynomial f of degree m + 1.
Example With n = 1, 3 in Table 5.8, the degrees of precision are also m = n =
1,3, respectively. But with n = 2,4, the degrees of precision are (m = n + 1 =
3, 5, respectively. This illustrates the general result that Newton-Cotes formulas
with an even index n gain an extra degree of precision as compared with those of
an odd index [see formulas (5.2.5) and (5.2.7)].
Each Newton-Cotes formula can be used to construct a composite rule. The
most useful remaining one is probably that based on Boole's rule (see Problem 7).
We omit any further details.
Convergence discussion The next question of interest is whether In(f) con-
verges to !(f) as n oo. Given the lack of convergence of the interpolation
polynomials on evenly spaced nodes for some choices of f(x) [see (3.5.10)], we
should expect some difficulties. Table 5.9 gives the results for a well-known
example,
!
4 dx
I= --= 2 tan-
1
(4)::: 2.6516
-41 + x
2
(5.2.9)
These Newton-Cotes numerical integrals are diverging; and this illustrates the
fact that the Newton-Cotes integration formulas In(f) in (5.2.2), need not
converge to !(f).
To understand the implications of the lack of convergence of Newton-Cotes
quadrature for (5.2.9), we first give a general discussion of the convergence of
numerical integration methods.
NEWTON-COTES INTEGRATION FORMULAS 267
Table5.9 Newton-Cotes example
(5.2.9)
n Jn
2 5.4902
4 2.2776
6 3.3288
8 1.9411
10 3.5956
Definition Let .fF be a family of continuous functions on a given interval [a, b ].
We say is dense in C[a, b] if for every f E C[a, b] and every
> 0, there is a function f. in .fF for which
Max lf(x)- f.(x) I:::;;
as,xs,b
(5.2.10)
Example 1. From the Weierstrass theorem [see Theorem 4.1], the set of all
polynomials is dense in C[ a, b ].
2. Let n;;:: 1, h = (b- a)jn, xj =a+ jh, 0 ~ } ; ; n. Let f(x) be linear on
each of the subintervals [xj_
1
, xj]. Define to be the set of all such piecewise
linear fup.ctions f(x) for all n. We leave to Prbblem 11 the proof that .fF is dense
in C[a, b].
Theorem 5. 2 Let
n
In(!)= L Wj,nf(xj,n) n;;:: 1
j-0
be a sequence of numerical integration formulas that approximate
Let .fF be a family dense in C[a, b]. Then
all fEC[a,b] (5.2.11)
if and only if
1. all jE.fF (5.2.12)
and
n
2.
B = Supremum L lwj,nl < oo
n<!:l j-0
268 NUMERICAL INTEGRATION
Proof (a) Trivially, (5.2.11) implies (5.2.12). But the proof that (5.2.11) im-
plies (5.2.13) is much more difficult. It is an example of the principle of
uniform boundedness, and it can be found in almost any text on func-
tional analysis; for example, see Cryer (1982, p. 121).
(b) We now prove that (5.2.12) and (5.2.13) implies (5.2.11). Let
f E C[ a, b] be given, and let t: > 0 be arbitrary. Using the assumption
that .;F is dense in C[ a, b ], pick f. E ff such that
Then write
I(!)- In(!)= [I(!) -/(f.)] + [I(f.) -In(/.)]
+[In(!.)- I,(!)]
It is straightforward to derive, using (5.2.13) and (5.2.14), that
IJ{f}- In(!) I:=; IJ(f)- J(f,) I+ II{!.}- Jn(f<) I
+ IJn(J.}- Jn(f) I
(
::; 2 +II(!.) - !Jf.) I
(5.2.14}
Using (5.2.12), ln(f.) ~ !{f.) as n ~ oo. Thus for all sufficiently large
n, say n ~ n,,
Since t: was arbitrary, this shows ln(f) ~ /(f) as n ~ oo.
Since the Newton-Cotes numerical integrals In(f) do not converge to /{f)
for f(x) = 1/(1 + x
2
) on [ -4,4], it must follow that either condition (5.2.12) or
(5.2.13) is violated. If we choose !F as the polynomials, then (5.2.12) is satisfied,
since In( p) = /( p) for any polynomial p of degree ::; n. Thus (5.2.13) must be
false. For the Newton-Cotes formulas (5.2.2),
n
Supremum L lwj.nl = oo
n j=O
Since /{f) = ln(f) for the special case f(x) == 1, for any n, we have
n
"w. = b- a
1..- J,n
j=O
n ~
(5.2.15)
(5.2.16)
NEWTON-COTES INTEGRATION FORMULAS 269
Combining these results, the weights w
1
. n must vary in sign as n becomes
sufficiently large. For example, using n = 8,
Such formulas can cause loss-of-significance errors, although it is unlikely to be a
serious problem until n is larger. But because of this problem, people have
generally avoided using Newton-Cotes formulas for n 8, even in forming
composite formulas.
The most serious problem of the Newton-Cotes method (5.2.2) is that it may
not converge for perfectly well-behaved integrands, as in (5.2.9).
The midpoint rule There are additional Newton-Cotes formulas in which one
or both of the endpoints of integration are deleted from the interpolation (and
integration) node points. The best known of these is also the simplest, the
midpoint rule. It is based on interpolation of the integrand f(x) by the constant
f((a + b)/2); and the resulting integration formula is
f
h (a+b) (b-a)
3
f(x)dx=(b-a)f -- + /"(7J)
a 2 24
some 7J E [ a, b]
(5.2.17)
For its composite form, define
x
1
=a+(j-!)h j=1,2, ... ,n
the midpoints of the intervals [a+ (j- 1)h, a+ jh]. Then
jhf(x) dx =In(/)+ En(/)
a
(5.2.18)
some 7J E [ a, b] (5.2.19)
The proof of these results is left as Problem 10.
These integration formulas in which one or both endpoints are missing are
called open Newton-Cotes formulas, and the previous formulas are called closed
formulas. The open formulas of higher order were used classically in deriving
numerical formulas for the solution of ordinary differential equations.
i
I
i
i
I
I
i
I
_j
270 NUMERICAL INTEGRATION
5 ~ Gaussian Quadrature
The composite trapezoidal and Simpson rules are based on using a low-order
polynomial approximation of the integrand f(x) on subintervals of decreasing
size. In this section, we investigate a class of methods that use polynomial
approximations of f(x) of increasing degree. The resulting integration formulas
are extremely accurately in most cases, and they should be considered seriously
by anyone faced with many integrals to evaluate.
For greater generality, we will consider formulas
n b
I,(!)= L wj_,f(xj,,) = j w(x)f(x) dx =I(/)
j=1 a
(5.3.1)
The weight function w(x) is assumed to be nonnegative and integrable on [a, b],
and it is to also satisfy the hypotheses (4.3.8) and (4.3.9) of Section 4.3. The
nodes { xj.,} and weights { wj,,} are to be chosen so that /,(/) equals I (f)
exactly for polynomials f(x) of as large a degree as possible. It is hoped that this
will result in a formula /,(/) that is nearly exact for integrands f(x) that are
well approximable by polynomials. In Section 5.2, the Newton-Cotes formulas
have an increasing degree of precision as n increased, but nonetheless they do not
converge for many well-behaved integrands. The difficulty with the Newton-Cotes
formulas is that the nodes {xi.,} must be evenly spaced. By omitting this
restriction, we will be able to obtain new formulas /,(/) that converge for all
f E C[a, b].
To obtain some intuition for the determination of /,(/), consider the special
case
1 n
J J(x) dx = L wJ(x)
-1 j=l
(5.3.2)
where w(x) = 1 and the explicit dependence of { wi} and {xi} on n has been
dropped. The weights { wi} and nodes {xi} are to be determined to make the
error
(5.3.3)
equal zero for as high a degree polynomial f(x) as possible. To derive equations
for the nodes and weights, we first note that
Thus E,(f) = 0 for every polynomial of degree :s: m if and only if
i=O,l, ... ,m (5.3.5)
GAUSSIAN QUADRATURE 271
Case 1. n = 1. Since there are two parameters, w
1
and x
1
, we consider
requiring
This gives
J
l .
1dx- w
1
= 0
-1
This implies w
1
= 2 and x
1
= 0. Thus the formula (5.3.2) becomes
f f(x) dx ~ 2/(0)
-1
the midpoint rule.
Case 2. n = 2. There are four parameters, w
1
, w
2
, x
1
, x
2
, and thus we put
four constraints on these parameters:
i=0,1,2,3
or
These equations lead to the unique formula
1 ( {3) ({3)
f_J(x) dx ~ -3 + f 3
(5.3.6)
which has degree of precision three. Compare this with Simpson's rule (5.1.13),
which uses three nodes to attain the same degree of precision.
Case 3. For a general n there are 2n free parameters {X;} and { W; }, and we
would guess that there is a formula (5.3.2) that uses n nodes and gives a degree of
precision of 2n - 1. The equations to be solved are
i = 0,1, ... ,2n -1
or
n {0
L wjxj= _2_
j=l i + 1
i = 1,3, ... ,2n- 1
i = 0,2, ... ,2n- 2
(5.3.7)
I
I
--- --------- ------------------------ ----------- - ________\
272 NUMERICAL INTEGRATION
These are nonlinear equations, and their solvability is not at all obvious. Because
of the difficulty in working with this nonlinear system, we use another approach
to the theory for (5.3.2), one that is .somewhat circuitous.
Let {qJ,(x)ln;;:: 0} be the orthogonal polynomials on (a, b) with respect to
the weight function w(x);;:: 0. Denote the zeros of rp,(x) by
a < x
1
< < x, < b (5.3.8)
Also, recall the notation from (4.4.18)-(4.4.20):
rp,(x) = A,x" +
(5.3.9)
Theorem 5.3 For each n ;;:: 1, there is a unique numerical integration formula
(5.3.1) of degree of precision 2n - 1. Assuming f(x) is 2n times
continuously differentiable on [a, b], the formula for I,(f) and its
error is given by
for some a< 11 <b. The nodes {xJ are the zeros ot rp,(x), and
the weights { wj} are given by
j = 1, ... , n (5.3.11)
Proof The proof is divided into three parts. We first obtain a formula with
degree of precision 2n - 1, using the nodes (5.3.8). We then show that it
is unique. Finally, we sketch the derivation of the error formula and the
(a) Construction of the formula. Hermite interpolation is used as the
vehicle for the construction (see Section 3.6 to review the notation and
results). For the nodes in (5.3.8), the Hermite polynomial interpolating
f(x) and f'(x) is
II II
H,(x) = 2: fix)h;{x) + 2: f'(x)hj(x) (5.3.12)
j=l j=l
with hix) and hix) defined in (3.6.2) of Section 3.6. The interpolation
error is given by
tf,(x) =J(x) -H,.(x) = [o/,.(x)]
2
/[x
1
,x
1
, ... ,x,.,x,.,x]
(
J<
2
">(E) E (a, b] (5.3.13)
2n !
---------- --- - ------------- ----. ------ ------ -- ... ..J
GAUSSIAN QUADRATURE 273
with
o/n(x) = {x- xJ {x- xJ
Note that
{5.3.14)
since both cpn(x) and o/n(x) are of degree n and have the same zeros.
Using (5.3.12), if f(x) is continuously differentiable, then
Jhw(x)f(x) dx = Jbw(x)Hn(x) dx + jbw(x)tffn(x) dx
a a a
(5.3.15)
The degree of precision is at least 2n - 1, since tfn(x) = 0 if f(x) is a
polynomial of degree < 2n, from (5.3.13). Also from (5.3.13),
En(x
2
n) = tw(x)tfn(x) dx = jhw(x)[ o/n(x)]
2
dx > 0 (5.3.16)
a a
Thus the degree of precision of In(/) is exactly 2n - 1.
To derive a simpler formula for In(/),
n b n b
In(/)= L /(xi) j w(x)hi(x) dx + L f'(x) J w(x)h/x) dx
j=l a j=l a
{5.3.17)
we show that all of the integrals in the second sum are zero. Recall that
from (3.6.2),
h/x) = (x- x)[l/x)]
2
l/x) = 1/Jn(x )' CJln(x)
(x- x)tfln(xi) (x- x ) c p ~ x i )
The last step uses (5.3.14). Thus
(5.3.18)
Since degree(/i) = n- 1, and since CJln(x) is orthogonal to all polynomi-
als of degree< n, we have
j = 1, ... , n
(5.3.19)
274 NUMERICAL INTEGRATION
The integration formula (5.3.15) becomes
b n
1 w(x)f(x) dx = L wJ(x) +En(!)
a j=1
~ = 1bw(x)hj(x)dx j= 1, ... ,n
a
(5.3.20)
(b) Uniqueness of formula (5.3.19). Suppose that we have a numerical
integration formula
(5.3.21)
that has degree of precision ~ 2n - 1. Construct the Hermite interpola-
tion formula to f(x) at the nodes z
1
, , zn. Then for any pqlynomial
f(x) of d e g r e e ~ 2n- 1,
n n
f(x)= L f(z)h/x)+ L f'(z)hj(x) deg (!) 2n- 1 {5.3.22)
j=l j=l
where h/x) and hix) are defined using {zj} Multiply (5.3.22) by
w(x), use the assumption on the degree of precision of (5.3.21), and
integrate to get
n n b n b
L vjf(zj) = L f(z) 1 w(x)hix) dx + L f'(zj) J w(x)hix) dx
j=l j=l a j=1 a
(5.3.23)
for any polynomial f(x) of d e g r e e ~ 2n - 1.
Let f(x} = hi(x). Use the properties (3.6.3) of hi(x) to obtain from
(5.3.23) that
1
b -
0 = w(x)hi(x) dx
a
i = 1, ... , n
As before in (5.3.18), we can write
w(x) = (x- z
1
) (x- zJ
Then (5.3.24) becomes
jbw(x)wn(x)l;(x) dx = 0
a
i=1,2, ... ,n
(5.3.24)
;
- _j
(iAUSSIAN QUADRATURE 275
Since all polynomials of degree .:;;; n - 1 can be written as a combination
of /
1
(x), ... , /n(x), we have that wn(x) is orthogonal to every polynomial
of degree .:;;; n - 1. Using the uniqueness of the orthogonal polynomials
[from Theorem 4.2], wn(x) must be a constant multiple of <p
11
(x). Thus
they must have the same zeros, and
Z; =X; i=1, ... ,n
To complete the proof of the uniqueness, we must show that W; = u;,
where u; is the weight in (5.3.21) and w; in (5.3.10). Use (5.3.23) with
(5.3.24) and f(x) = h;(x). The result will follow immediately, since
h ;( x) is now constructed using { x ;}.
(c) The error formula. We begin by deducing some further useful
properties about the weights { w;} in (5.3.10). Using the definition (3.6.2)
of h;(x),
W; = jh w(x)h;(x) dx = jh w(x)[1- 2/f(x;)(x- X;)] [l;(x)]
2
dx
a a
= jhw(x)[l;(x)Y dx- 2/f(x;) jhw(x)(x- x;)[l;(x)]
2
dx
a a
The last integral is zero from (5.3.19), since h;(x) = (x- x;)[/;(x)]
2
Thus
i = 1,2, ... , n (5.3.25)
and all the weights are positive, for all n.
To construct W;, begin by substituting f(x) = l;(x) into (5.3.20), and
note that En(f) = 0, since degree(/;)= n - 1. Then using l;(x) = 8ij
we have
i = 1, ... , n (5.3.26)
To further simplify the formula, the Christoffel-Darboux identity (Theo-
rem 4.6) can be used, followed by much manipulation, to give the
formula (5.3.11). For the details, see Isaacson and Keller (1966, pp.
333-334).
For the integration error, if f(x) is 2n times continuously differentia-
ble on [a, b], then
En{!)= Jh w_(x)tfn(x) dx
a
= f[xl, X1, ... , xn, xn, E] jb w(x)[ tltn(x )]
2
dx some E E [a, b]
a
i
I
I
I
I
I
- . ------ ~ -- -- --. -- ---- - --- ... n H - I
276 NUMERICAL INTEGRATION
the last step using the integral mean value theorem. Using (5.3.14) in the
last integral, and replacing the divided difference by a derivative, we have
E (f)= J<2n){TI) Jhw(x) [cpn(x)}2 dx
n {2n)! a A ~
whlch gives the error formula in (5.3.10).
Gauss-Legendre quadrature For w(x) = 1, the Gaussian formula on [ -1, 1] is
given by
(5.3.27)
with the nodes equal to the zeros of the degree n Legendre polynomial Pn(x) on
[ -1, 1]. The weights are
and
-2
w.= ~ ~ ~ ~ ~ ~
' (n + 1)P;(x;)Pn+l(x;)
i = 1,2, ... , n
2
2n+l(n!)4 /(2n)( Tl) /(2n)( Tl)
E (!)- -e --
n - (2n + 1)[(2n)!]
2
(2n)! = n (2n)! '
Table 5.10 Gauss-Legendre nodes
and weights
n X;
w.
I
2 .5773502692 1.0
3 .7745966692 .5555555556
0.0 .8888888889
4 .8611363116 .3478546451
.3399810436 .6521451549
5 .9061798459 .2369268851
.5384693101 .4786286705
0.0 .5688888889
6 .9324695142 .1713244924
.6612093865 .3607615730
.2386191861 .4679139346
7 .9491079123 .1294849662
.7415311856 .2797053915
.4058451514 .3818300505
0.0 .4179591837
8 .9602898565 .1012285363
.7966664774 .2223810345
.5255324099 .3137066459
.1834346425 .3626837834
(5.3.28)
(5.3.29)
GAUSSIAN QUADRATURE 277
Table 5.11 Gaussian quadrature
for (5.1.11)
n I, !-!,
2 -12.33621046570 2.66E- 1
3 -12.12742045017 5.71E- 2
4 -12.07018949029 -1.57E- 4
5 -12.07032853589 -1.78E- 5
6 -12.07034633110 1.47E- 8
7 -12.07034631753 1.14E- 9
8 -12.07034631639 -4.25E -13
for some -1 < 7J < 1. For integrals on other finite intervals with weight function
w(x) = 1, use the following linear change of variables:
l
b ( b - a ) JI ( a + b + x ( b - a) )
J(t) dt = -- f dx
u 2 -1 2
(5.3.30)
reduc_ing the integral to the standard interval [ -1, 1].
For convenience, we include Table 5.10, which gives the nodes and weights for
formula (5.3.27) with small values of n. For larger values of n, see the very
complete tables in Stroud and Secrest (1966), which go up to n = 512.
Example Evaluate the integral (5.1.11),
I= ['ex cos (x) dx = -12.0703463164
0
which was used previously in Section 5.1 as an example for the trapezoidal rule
(see Table 5.1) and Simpson's rule (see Table 5.3). The results given in Table 5.11
show the marked superiority of Gaussian quadrature.
A general error result We give a useful result in trying to explain the excellent
convergence of Gaussian quadrature. In the next subsection, we consider in more
detail the error in Gauss-Legendre quadrature.
Theorem 5.4 Assume [a, bJ is finite. Then the error in Gaussian quadrature,
b n
En(!)= J w(x)J(x) dx- L wjf(x)
u j=l
satisfies
n 1 (5.3.31)
with p
2
n_
1
{f) the minimax error from (4.2.1).
i
I
_J
278 NUMERICAL INTEGRATION
Proof E,(p) = 0 for any polynomial p(x) of e g r e e ~ 2n- 1. Also. the error
function En satisfies
for all F, G E C[a, b]. Let p(x) = qin-
1
(x), the minimax approxima-
tion of e g r e e ~ 2n- 1 to f(x) on [a, b]. Then
From (5.3.25), all wj > 0. Also, since p(x) = 1 is of degree 0,
n h
[wj=jw(x)dx
j=l "
This completes the proof of (5.3.31).
From the results in Sections 4.6 and 4.7, the speed of convergence to zero of
PmU) increases with the smoothness of the integrand. From (5.3.31), the same is
true of Gaussian quadrature. In contrast, the composite trapezoidal rule will
usually not converge faster than order h
2
[in particular, if /'(b) - /'(a) =f. OJ,
regardless of the smoothness of f(x). Gaussian quadrature takes advantage of
additional smoothness in the integrand, in contrast to most composite rules.
Example Consider using Gauss-Legendre quadrature to integrate
Table 5.12 contains error bounds based on (5.3.31),
Table 5.12 Gaussian quadrature
of (5.3.32)
n E,(f) (5.3.33)
1 -3.20E- 2 1.06E-l
2 2.29E- 4 1.33E- 3
3 9.55E- 6 3.24E- 5
4 -3.35E- 7 9.24E- 7
5 6.05E- 9 1.61E- 8
(5.3.32)
(5.3.33)
..... /
GAUSSIAN QUADRATURE 279
along with the true error. The error bound is of approximately the same
magnitude as the true error.
Discussion of Gauss-Legendre quadrature We begin by trying to make the
error term (5.3.29) more understandable. First define
M =
m
Max
-lsxsl m!
m ~ O (5.3.34)
For a large class of infinitely differentiable functions f on [ -1, 1], we have
Supremumm;;,;oMm < oo. For example, this will be true if f(z) is analytic on the
region R of the complex plane defined by
R = { z: lz - xl ~ 1 for some x, -1 ~ x ~ 1}
With many functions, Mm ~ 0 as m ~ oo, for example, f(x) =ex and cos(x).
Combining (5.3.29) and (5.3.34), we obtain
n ~ (5.3.35)
and the size of en is essential in examining the speed of convergence.
The term en can be made more understandable by estimating it using Stirling's
formula,
which is true in a relative error sense as n ~ oo. Then we obtain
as n ~ o o (5.3.36)
This is quite a good estimate. For example, e
5
= .00293, and (5.3.36) gives the
estimate .00307. Combined with (5.3.35), this implies
(5.3.37)
which is a correct bound in an asymptotic sense as n ~ oo. This shows that
En(/) ~ 0 with an exponential rate of decrease as a function of n. Compare this
with the polynomial rates of 1jn
2
and 1/n
4
for the trapezoidal and Simpson
rules, respectively.
In order to consider integrands that are not infinitely differentiable, we can use
the Peano kernel form of the error, just as in Section 5.1 for Simpson's and the
trapezoidal rules. If f(x) is r times differentiable on [ -1, 1], with J<'>(x)
integrable on [ -1, 1], then
r
n >-
2
(5.3.38)
!
i
'
280 NUMERICAL INTEGRATION
Table 5.13 Error constants e,,, for (5.3.39)
n
en,2
Ratio
en,4
Ratio
2 .162 .178
3.7 27.5
4 .437- 1 .647-2
8 .118- 1
3.7
.417-3
15.5
16 .311- 2
3.8
.279-4
14.9
32 .800- 3
3.9
.183- 5
15.3
3.9
64 .203- 3
4.0
128 .511- 4
for an appropriate Peano kernel Kn.r(t). The procedure for constructing Knjt)
is exactly the same as with the Peano kernels (5.1.21) and (5:1.25) in Section 5.1.
From (5.3.38)
lEn(!) ~ en,rMr
en. r = r! F I K n. r ( l) I dt
-1
(5.3.39)
The values of en., given in Table 5.13 are taken from Stroud-Secrest (1966, pp.
152-153). The table shows that for f twice continuously differentiable, Gaussian
quadrature converges at least as rapidly as the trapezoidal rule (5.1.5). Using
(5.3.39) and the table, we can construct the empirical bound
.42 [ ]
lEn(!) ~ -
2
Max lf"(x) I
n -l!>x!>l
(5.3.40)
The corresponding formula (5.1.7) for the trapezoidal rule on [ -1, 1] gives
.67 [ ]
I Trapezoidal error I -;;z_ _ i'!:";;
1
I!" ( x) I
which is slightly larger than (5.3.40). In actual computation, Gaussian quadrature
appears to always be superior to the trapezoidal rule, except for the case of
periodic integrands with the integration interval an integer multiple of the period
of the integrand, as in Table 5.7. An analogous discussion, using Table 5.13 with
en,
4
, can be carried out for integrands j(x), which are four times differentiable
(see Problem 20).
Example We give three further examples that are not as well behaved as the
ones in Tables 5.11 and 5.12. Consider
5
dx
J(
2
) = 1 2 ~ 2.33976628367
o 1 + (x- '17)
[2'1T
J<
3
l = Jn e-x sin (50x) dx ~ .019954669278
0
(5.3.41)
l
............. \
GAUSSIAN QUADRATURE 281
Table 5.14 Gaussian quadrature examples (5.3.41)
n
2
4
8
16
32
64
128
JOl _
-7.22- 3
-1.16- 3
-1.69-4
-2.30- 5
-3.00- 6
-3.84- 7
-4.85- 8
Ratio
6.2
6.9
7.4
7.6
7.8
7.9
/(2)- J<3l -
3.50- 1 3.48- 1
-9.19- 2 -1.04-1
-4.03- 3 -1.80- 2
-6.24- 7 -3.34- 1
-2.98- 11 1.16- 1
1.53- 1
6.69- 15
The values in Table 5.14 show that Gaussian quadrature is still very effective, in
spite of the bad behavior of the integrand.
Compare the results for 1<
1
> with those in Table 5.6 for the trapezoidal and
Simpson rules. Gaussian quadrature is converging with an error proportional to
1jn
3
, whereas in Table 5.6, the errors converged with a rate proportional to
1jnl.5. Consider the integral
{5.3.42)
with a > -1 and nonintegral, and f(x) smooth with /(0) if= 0. It has been
shown by Donaldson and Elliott (1972) that the error in Gauss-Legendre
quadrature for (5.3.42) will have the asymptotic estimate
. c(!, a)
En(/) = 2(l+a)
n
(5.3.43)
This agrees with
= 1<
2
> within the limits of the
machine arithmetic. Also compare these results with those of Table 5.5, for
the trapezoidal and Simpson rules.
The approximations in Table 5.14 for J<
3
> are quite poor because the integrand
is so oscillatory. There are 101 zeros of the integrand in the interval of integra-
tion. To obtain an accurate value Ip>, the degree of the approximating poly-
nomial underlying Gaussian quadrature must be very large. With n = 128, Jp> is
a very accurate approximation of /(3>.
General comments Gaussian quadrature has a number of strengths and weak-
nesses.
1. Because of the form of the nodes and weights and the resulting need to use a
table, many people prefer a simpler formula, such as Simpson's rule. This
shouldn't be a problem when doing integration using a computer. Programs
should be written containing these weights and nodes for standard values of
n, for example, n = 2,4, 8, 16, ... , 512 [taken from Stroud and Secrest (1966)].
I
I
-... -- .J
i
I
I
I
282 NUMERICAL INTEGRATION
In addition, there are a number of very rapid programs for calculating the
nodes and weights for a variety of commonly used weight functions. Among
the better known algorithms is that in Golub and Welsch (1969).
2. It is difficult to estimate the error, and thus we usually take
(5.3.44}
for some m > n, for example, m = n + 2 with well-behaved integrands, and
m = 2n otherwise. This results in greater accuracy than necessary, but even
with the increased number of function evaluations, Gaussian quadrature is
still faster than most other methods.
3. The nodes for each formula In are distinct from those of preceding formulas
/m, and this results in some inefficiency. If In is not sufficiently accurate,
based on an error estimate like (5.3.44), then we must compute a new value
of ln. However, none of the previous values of the integrand can be reused,
resulting in wasted effort. This is discussed more extensively in the last part
of this section, resulting in some new methods without this drawback.
Nonetheless, in many situations, the resulting inefficiency in Gaussian
quadrature is usually not significant because of its rapid rate of convergence.
4. If a large class of integrals of a similar nature are to be evaluated, then
proceed as follows. Pick a few representative integrals, including some with
the worst behavior in the integrand that is likely to occur. Determine a value
of n for which In(/) will have sufficient accuracy among the representative
set. Then fix that value of n, and use In(/) as the numerical integral for all
members of the original class of integrals.
5. Gaussian quadrature can handle many near-singular integrands very effec-
tively, as is shown in (5.3.43) for (5.3.42). But all points of singular behavior
must occur as endpoints of the integration interval. Gaussian quadrature is
very poor on an integral such as
fvlx- .71 dx
which contains a singular point in the interval of integration. (Most other
numerical integration methods will also perform poorly on this integral.) The
integral should be decomposed and evaluated in the form
L
7
{7-xdx+ f-./x-.1dx
0 .7
Extensions that reuse node points Suppose we have a quadrature formula
(5.3.45}
We want to produce a new quadrature formula that uses the n nodes x
1
, , xn
and m new nodes xn+l , Xn+m:
(5.3.46)
GAUSSIAN QUADRATURE 283
These n +2m unspecified parameters, namely the nodes xn+I , xn+m and the
weights u
1
, , vn+m are to be chosen to give (5.3.46) as large a degree of
precision as is possible. We seek a formula of degree of precision n + 2m - 1.
Whether such a formula can be determined with the new nodes xn+I ... , xn+m
located in [a, b] is in general unknown.
In the case that (5.3.45) is a Gauss formula, Kronrod studied extensions
(5.3.46) with m = n + 1. Such pairs of formulaS' give a less expensive way of
producing an error estimate for a Gauss rule (as compared with using a Gauss
rule with 2n + 1 node points). And the degree of precision is high enough to
produce the kind of accuracy associated with the Gauss rules.
A variation on the preceding theme was introduced in Patterson (1968). For
w(x) = 1, he started with a Gauss-Legendre rule /n
0
{f). He then produced a
sequence of formulas by repeatedly constructing formulas (5.3.46) from the
preceding member of the sequence, with m = n + 1. A paper by Patterson (1973)
contains an algorithm based on a sequence of rules /
3
, /
7
, /
15
, /
31
, /
63
, /
127
, /
255
;
the formula /
3
is the three-point Gauss rule. Another such sequence
{ /
10
, /
21
, /
43
, /
87
} is given in Piessens et al. (1983, pp. 19, 26, 27), with /
10
the
ten-point Gauss rule. All such Patterson formulas to date have had all nodes
located inside the interval of integration and all weights positive.
The degree of precision of the Patterson rules increases with the number of
points. For the sequence /
3
, /
7
, ... , /
255
previously referred to, the respective
degrees of precision are d = 5, 11, 23, 47, 95,191,383. Since the weights are
positive, the proof of Theorem 5.4 can be repeated to show that the Patterson
rules are rapidly convergent.
A further discussion of the Patterson and Kronrod rules, including programs,
is given in Piessens et al. (1983, pp. 15-27); they also give reference to much of
the literature on this subject.
.Example Let (5.3.45) be the three-point Gauss rule on [ -1, 1]:
8 5
13(!) = 9/(0) + 9[/(-{.6) + /(!.6)]
(5.3.47)
The Kronrod rule for this is
/7{!) = aof(O) + a
1
[/( -{.6) + /(f6)]
+a2[/( -{31) + /({31)] + a3[/( -{32) + /({3
2
)] (5.3.48)
with f3'f and PI the smallest and largest roots, respectively, of
2 10 155
x - -x+- =0
9 891
The weights a
0
, a
1
, a
2
, a
3
come from integrating over [ -1, 1] the Lagrange
polynomial p
7
(x) that interpolates f(x) at the nodes {0, 6, {3
1
, {3
2
}.
Approximate values are
a
0
= .450916538658
a
2
= .401397414776
a
1
= .268488089868
a
3
= .104656226026
.. .. . .. .. J
284 NUMERICAL INTEGRATION
5.4 Asymptotic Error FormulaS and Their Applications
Recall the definition (5.1.10) of an asymptotic error formula for a numerical
integration formula: En(f) is an asymptotic error formula for En(f) =
/(f) - /n(f} if
(5.4.1)
or equivalently,
Examples are (S.l.9) and (5.1.18) from Section 5.1.
By obtaining an asymptotic error formula, we are obtaining the form or
structure of the error. With this information, we can either estimate the error in
ln(f), as in Tables 5.1 and 5.3, or we can develop a new and more accurate
formula, as with the corrected trapezoidal rule in (5.1.12). Both of these alterna-
tives are further illustrated in this section, concluding with the rapidly convergent
Romberg integration method. We begin with a further development .of asymp-
totic error formulas.
The Bernoulli polynomials For use in the next theorem, we introduce the
Bernoulli polynomials Bn(x), n ~ 0. These are defined implicitly by the generating
function
(5.4.2)
The first few polynomials are
B
0
(x) = 1
(5.4.3)
With these polynomials,
k ~ (5.4.4)
There are easily computable recursion relations for calculating these polynomials
(see Problem 23).
Also of interest are the Bernoulli numbers, defined implicitly by
t
00
tj
--= EB--
e'- 1 j=O
1
j!
(5.4.5)
ASYMPTOTIC ERROR FORMULAS AND THEIR APPLICATIONS 285
The first few numbers are
1 1 -1 1 -1
Bo = 1 Bl = -2 B2 = 6 B4 = 30 B6 = 42 Bs = 30 (5.4.6)
and for all odd integers j 3, Bj = 0. To obtain a relation to the Bernoulli
polynomials B/x), integrate (5.4.2) with respect to x on [0, 1]. Then
t CX) tj 1
1---= Z:-1 B.(x) dx
e
1
- 1
1
j! o
1
and thus
(5.4.7)
We will also need to define a periodic extension of Bj(x),
(5 .4.8)
The Euler-Macl...aurin fonnula The following theorem gives a very detailed
asymptotic error formula for the trapezoidal rule. This theorem is at the heart of
much of the asymptotic error analysis of this section. The connection with some
other integration formulas appears later in the section.
Theorem 5.5 (Euler-MacLaurin Formula) Let m:;::: 0, n:;::: 1, and define h =
(b- a)jn, xj =a+ jh for j = 0, 1, ... , n. Further assume f(x)
is 2m + 2 times continuously differentiable on [a, b] for some
m :;::: 0. Then for the error in the trapezoidal rule,
n
En(!)= jbf(x) dx- h["J(x)
a j=O
Note: The double prime notation on the summation sign means
that the first and last terms are to be halved before summing.
Proof A complete proof is given in Ralston (1965, pp. 131-133), and a more
general development is sketched in Lyness and Purl (1973, sec. 2). The
. . . ' . II
.. --- . ------ -- ------ . --- -- - ------ ---- ---- - ---- . --
286 NUMERICAL INTEGRATION
proof in Ralston is short and correct, making full use of the special
properties of the Bernoulli polynomials. We give a simpler, but less
general, version of that proof, showing it to be based on integration by
parts with a bit of clever algebraic manipulation.
The proof of (5.4.9) for general n > 1 is based on first proving the
result for n = 1. Thus we concentrate on "'
1
h h
1
(/) = j(x) dx- -[J(o) + j(h)]
0 2
11h
=- f"(x)x(x-h)dx
2 0
(5.4.10)
the latter formula coming from (5.1.21). Since we know the asymptotic
formula
we attempt to manipulate (5.4.10) to obtain this. Write
Then
h
2
h [x
2
xh h
2
]
1
{!) =- -[j'(h)- /'(0)] + 1 f"(x) ---+- dx
12 0 2 2 12
Using integration by parts,
The evaluation of the quantity in brackets at x = 0 and x = h gives
zero. Integrate by parts again; the parts outside the integral will again be
zero. The result will be
h
2
1 h
1
{!) =--[/'(h)- j'(O)] + -1 j<
4
>(x)x
2
(x- h)
2
dx (5.4.11)
12 24 0
ASYMPTOTIC ERROR FORMULAS AND THEIR APPLICATIONS 287
which is (5.4.9) with m = 1. To obtain the m = 2 case, first note that
1 1h 2 h5
- x
2
(x- h) dx = -
24 0 720
Then as before, write
Integrate by parts twice to obtain the m = 2 case of (5.4.9). This can be
continued indefinitely. The proof in Ralston uses integration by parts,
taking advantage of special relations for the Bernoulli polynomials (see
Problem 23).
For the proof for general n > 1, write
For the m = 1 case, using (5.4.11),
h
2
h
4
Jb _ ( x - a)
=-
12
[/'(b)- f'(a)] +
24
aj<
4
>(x)B
4
-h- dx (5.4.12)
The proof for m > 1 is essentially the same.
The error term in (5.4.9) can be simplified using the integral mean value
theorem. It can be shown that
O<x<1 (5.4.13)
288 NUMERICAL INTEGRATION
. and consequently the error term satisfies
Thus (5.4.9) becomes
h
2
m+
2
(b - a)B
- 2m+2
(2m+ 2)!
(5.4.14)
for some a .::5: .::5: b.
As. an important corollary of (5.4.9), we can show that the trapezoidal rule
performs especially well when applied to periodic functions.
Corollary I Suppose f(x) is infinitely differentiable for a .::5: x .::5: b, and suppose
that all of its odd ordered derivatives are periodic with b - a an
integer multiple of the period. Then the order of convergence of the
trapezoidal rule In(f) applied to
is greater than any power of h.
Proof Directly from the assumptions on f(x),
{5.4.15)
Consequently for any m 0, with h = (b - a)jn, (5.4.14) implies
h2m+2
I{!)- In{!)= -(
2
m+
2
)! (b- a)Bzm+zf<Zm+Z>(g) a .::5: b (5.4.16)
ASYMPTOTIC ERROR FORMULAS AND THEIR APPLICATIONS 289
Thus as n oo (and h 0) the rate of convergence is proportional to
h
2
m+
2
But m was arbitrary, which shows the desired result.
This result is illustrated in Table 5.7 for f(x) = exp(cos(x)). The trapezoidal
rule is often the best numerical integration rule for smooth periodic integrands of
the type specified in the preceding corollary. For a comparison of the
Gauss-Legendre formula and the trapezoidal rule for a one-parameter family of
periodic functions, of varying behavior, see Donaldson and Elliott (1972, p. 592).
They suggest that the trapezoidal rule is superior, even for very peaked in-
tegrands. This conclusion improves on an earlier analysis that seemed to indicate
that Gaussian quadrature was superior for peaked integrands.
The Euler-MacLaurin summation fonnula Although it doesn't involve numeri-
cal integration, an important application of (5.4.9) or (5.4.14) is to the summation
of series.
Corollary 2 (Euler-MacLaurin summation formula) Assume f(x) is 2m + 2
times continuously differentiable for 0 5 x < oo, for some m 0.
Then for all n 1,
n n 1
L f(J) = 1 f(x) dx + 2[/(0) + f(n)]
j=O . 0
m B.
+ L -;[!<2i-l>(n) _ t<2i-l>(o)]
i=l (2z ).
1
1
n_
- B (x)J<
2
m+
2
>(x) dx
(2m + 2)! 0 2m+2
(5.4.17)
Proof Merely substitute a = 0, b = n into (5.4.9), and then rearrange the terms
appropriately.
Example In a later example we need the sum
00 1
s =I: 3/2
1 n
(5.4.18)
If we use the obvious choice f(x) = (x + 1)-
3
1
2
in (5.4.17), the results are
disappointing. By letting n -+ oo, we obtain
1
oo dx 1 m B2
s- + I (2i-l) 0
- o (x + 1)
3
/
2
2- (2i)!/ ( )
1 oo_
- 1 B (x)J<
2
m+
2
>(x) dx
(2m + 2)! 0 2m+2
u )
290 NUMERICAL INTEGRATION
and the error term does not become small for any choice of m. But if we divide
the series S into two parts, we are able to treat it very accurately.
Let j(x) = (x + 10)-
3
1
2
Then with m = 1,
oo 1
00
.
1
oo dx 1 (1/6) (- l)
" "/( ) + z + E
'f;; n
312
=
1
= o (x + 10)
312
2(10)
312
- -2-. (10)
512
Since B
4
(x) :2: 0, j<
4
>(x) > 0, we have E < 0. Also
1 00( 1 ) 35
0 < -E < -1 - j<
4
>(x) dx = ='= 1.08 X 10-
6
24 0 16 (1024)(10)
912
Thus
9
00 1
L 3/2 = .648662205 + E
10 n
By directly summing L(ljn
3
1
2
) = 1.963713717, we obtain
00 1
L 3/2 = 2.6123759 + E 0 < -E < 1.08 X 10-
6
1 n
(5.4.19)
See Ralston (1965, pp. 134-138) for more information on summation techniques.
To appreciate the importance of the preceding summation method, it would have
been necessary to have added 3.43 X 10
12
terms in S to have obtained compara-
ble accuracy.
A generalized Euler-MacLaurin formula For integrals in which the integrand is
not differentiable at some points, it is still often possible to obtain an asymptotic
error expansion. For the trapezoidal rule and other numerical integration rules
applied to integrands with algebraic andjor logarithmic singularities, see the
article Lyness and Ninham (1967). In the following paragraphs, we specialize
their results to the integral
a>O (5.4.20)
with j E cm+
1
[0, 1], using the trapezoidal numerical integration rule.
Assuming a is not an integer,
(5.4.21)
ASYMPTOTIC ERROR FORMULAS AND THEIR APPLICATIONS 291
The term 0(1/nm+l) denotes a quantity whose size is proportional to 1/nm+t, or
possibly smaller. The constants are given by
2 r (a + j + 1) sin [ ( '1T12) (a + j) H (a + j + 1) .
= JU>(o)
cj {2'1Tr+j+lj! .
d.= 0
J
for j even
2t( "+ 1)
d.= ( -1)U-ll/
2
1
. U>(1) 1 odd
J (2'1T);+l g
with g(x) = xaf(x), f(x) the gamma function, and t{p) the zeta function,
00 1
t( P) = L: ---:;
j=l}
p>1 (5.4.22)
For 0 < a < 1 with m = 1, we obtain the asymptotic error estimate
2f( a+ 1) sin [( '1T/2)a]t( a + i)/(0) + o( n12)
En(!)= (2'1T t+lna+l
(5.4.23)
For example with I= f5J(x) dx, and using (5.4.19) for evaluating t{f),
c = t(t) .208
4'1T
(5.4.24)
This is confirmed numerically in the example given in Table 5.6 in Section 5.1.
For logarithmic endpoint singularities, the results of Lyness and Ninham
(1967) imply an asymptotic error f<;>rmula
(5.4.25)
for some p > 0 and some constant c. For numerical purposes, this is essentially
0(1jnP). To justify this, calculate the following limit using L'Hospital's rule,
with p > q:
.. log(n)jnP .. log(n)
Lmut = Lumt = 0
n--+oo 1/nq n--+oo np-q
This means that log(n)jnP decreases more rapidly than 1/nq for any q < p.
And it clearly decreases less rapidly than 1jnP, although not by much.
- .... :-.:.. -- - ~ : .. ____ ; ....... : - -
----
292 NUMERICAL INTEGRATION
For practical computation, (5.4.25) is essentially 0(1jnP). For example,
calculate the limit of successive errors:
I-1
Limit n
n-+oo I- /
2
n
. . c log(n)jnP log(n}
Lnrut = Limit2P ---
n-+oo c log(2n)j2PnP n-+oo log(2n}
1
= Limit2P. = 2P
n-+oo 1+(log2jlogn)
This is the same limiting ratio as would occur if the error were just O(lfn.P).
Aitken extrapolation Motivated by the preceding, we assume that the integra-
tion formula has an asymptotic error formula
c
I-I=-
n nP
p>O {5.4.26)
This is not always valid. For example, Gaussian quadrature does not usually
satisfy (5.4.26), and the trapezoidal rule applied to periodic integrands does not
satisfy it. Nonetheless, many numerical integration rules do satisfy it, for a wide
variety of integrands. Using this assumed form for the error, we attempt to
estimate the error. An analogue of this work is that on Aitken extrapolation in
Section 2.6.
First we estimate p. Using (5.4.26),
(I - In) - (I - I 2n)
(I- /2n) - (I- /4,)
This gives a simple way of computing p.
(5.4.27)
Example Consider the use of Simpson's rule with /Jx/X dx = 0.4. In Table
5.15, column Rn should approach 2
2
5
= 5.66, a theoretical result from Lyness
Table 5.15 Simpson integration errors for 1
1
xiX dx
0
n
In
I- I, I,- In/2 Rn
2 .402368927062 -2.369- 3
4 .400431916045 -4.319- 4 -1.937- 3
8 .400077249447 -7.725- 5 -3.547- 4 5.46
16 .400013713469 -1.371- 5 -6.354- 5 5.58
32 .400002427846 -2.428- 6 -1.129- 5 5.63
64 .400000429413 -4.294- 7 -1.998- 6 5.65
128 .400000075924 -7.592-8 -3.535- 7 5.65
256 .400000013423 -1.342- 8 -6.250- 8 5.66
512 .4000000023 73 -2.373- 9 -1.105- 8 5.66
..1
ASYMPTOTIC ERROR FORMULAS AND THEIR APPLICATIONS 293
and Ninham (1967) for the order of convergence. Clearly the numerical results
confirm the theory.
To estimate the integral I with increased accuracy, suppose that In, I
2
n, and
I
4
n have been computed. Using (5.4.26),
and thus
Solving for I, and manipulating to obtain a desirable form,
(5.4.28)
Example Using the previous example for f(x) = xVx and Table 5.15, we
obtain the difference table in Table 5.16. Then
I = i
64
= .399999999387
I- f
64
= 6.13 X 10-
10
I- I
64
= -4.29 X 10-
7
Thus [
64
is a considerable improvement on I
64
Also note that I:
4
- I
64
is an
excellent approximation to I- I
64
Summing up, given a numerical integration rule satisfying (5.4.26) and given
three values In, I
2
n, I
4
n, calculate the Aitken extrapolate ~ of (5.4.28). It is
usually a significant improvement on I
4
n as an approximation to I; and based on
this,
(5.4.29)
With Simpson's rule, or any other composite closed Newton-Cotes formula,
the expense of evaluating In, I
2
n, I
4
n is no more than that of I
4
n alone, namely
4n + 1 function evaluations. And when the algorithm is designed correctly, there
is no need for temporary storage of large numbers of function values f(x). For
Table 5.16 Difference table for Simpson integration
m /m fl[m = [2m- Jm
fl2Jm
16 .400013713469
-1.1285623E - 5
32 .400002427846
9.28719E- 6
64 .400000429413
-1.998433E - 6
294 NUMERICAL INTEGRATION
that reason, one should never use Simpson's rule with just one value of the index
n. With no extra expenditure of time, and with only a slightly more complicated
algorithm, an Aitken extrapolate and an error estimate can be produced.
Richardson extrapolation If we assume sufficient smoothness for the integrand
f(x) in our integral I(f), then we can write the trapezoidal error term (5.4.9) as
. d(O) d(O) d(O)
I-I
n n2 n4 n2m n.m
(5.4.30)
where In denotes the trapezoidal rule, and
( )
2m+2
b-a h- (x-a)
Fn,m = (
2
m+
2
)!n2m+2l B2m+2 -h- J<Zm+Z>(x) dx
(5.4.31)
Although the series dealt with are always finite and have an error term, we will
usually not directly concern ourselves with it.
.For n even,
4d(O) 16d(O) 64d(O)
I-I =-2-+ __ 4_+ __ 6_+
n/2 n1 n4 n6
Multiply (5.4.30) by 4 and subtract from it (5.4.32):
-12d(O) 6Qd(O)
4( I - I ) - (I - I ) =
4
- -
6
- -
n n/2 n4 n6
Define
1
I(l) = _ [4I(0) _ I(O) ]
n 3 n n/1
2Qd(O)
6
n even n:::::2
and
= Im. We call {
}.
The sequence
/
(1) . J(l) J(l)
2 ' 4 ' 6 '
is a new numerical integration rule. For the error,
d(l) d(l)
I - J<l> = _4_ + _6_ +
n n4 n6
d
(l) = -4d(O) d(l) = -2Qd(O)
4 4 ' 6 6 '
(5.4.32)
(5.4.33)
(5.4.34)
(5.4.35)
.i
ASYMPTOTIC ERROR FORMULAS AND THEIR APPLICATIONS 295
To see the explicit formula for
}. If we
derive the actual integration weights of Jp>, in analogy with (5.4.36), we will find
that Jp> is simply the composite Boole's rule.
296 NUMERICAL INTEGRATION
Using the preceding formulas, we can obtain useful estimates of the error.
Using (5.4.39),
16I(l) - I(l) d(2)
I - IP) = n nj2 - I(l) + _6_ + .
15 " n
6
I(l) - I(l) d(2)
n n/2 + _6_ + ...
15 n
6
Using h = (b- a)jn,
1
I_ I<I> = -(I<I> _ I<I>] + O(h6)
n 15 n n/2
(5.4.41)
and thus
1
I_ I<Il _ (I<I> _ I<I> ]
n 15 n n/2
(5.4.42)
since both terms are O(h
4
) and the remainder term is O(h
6
). This is called
Richardson's error estimate for Simpson's rule.
This extrapolation process can be continued inductively. Define
4ki(k-1)- J(k-1)
I(k) = n nj2
n 4k- 1
(5.4.43)
with n a multiple of 2k, k 1. It can be shown that the error has the form
d(k)
(
k) 2k+2
I-I =--+
n n2k+2
(5.4.44)
with Ak a constant independent off and h, and
Finally, it can be shown that for any f E C[a, b],
=I(!) (5.4.45)
n-+oo
The rules for k > 2 bear no direct relation to the composite
Newton-Cotes rules. See Bauer et al. (1963) for complete details.
Romberg integration Define
k = 0,1,2, ... (5.4.46)
ASYMPTOTIC ERROR FORMULAS AND THEIR APPLICATIONS 297
J(O)
I
f(O)
2
J(1)
2
J(O)
4
Jjl>
/(2)
4
J(O)
8
J(1)
8
/(2)
8
J(3)
8
f(O)
16
J(1)
16
/(2)
16
J(3)
16
J(4)
16
Figure 5.4 Romberg integration table.
This is the Romberg integration rule. Consider the diagram in Figure 5.4 for the
Richardson extrapolates of the trapezoidal rule, with the number of subdivisions
a power of 2. The first column denotes the trapezoidal rule, the second Simpson's
rule, etc. By (5.4.45), each column converges to /(f). Romberg integration is the
rule of taking the diagonal. Since each column converges more rapidly than the
preceding column, assuming f(x) is infinitely differentiable, it could be expected
that Jk(f) would converge more rapidly than { ~ k l } for any k. This is usually
the case, and consequently the method has been very popular since the late 1950s.
Compared with Gaussian quadrature, Romberg integration has the advantage of
using evenly spaced abscissas. For a more complete analysis of Romberg integra-
tion, see Bauer et al. (1963).
Example Using Romberg integration, evaluate
This was used previously as an example, in Tables 5.1, 5.3, and 5.11, for the
trapezoidal, Simpson, and Gauss-Legendre rules, respectively. The Romberg
results are given in Table 5.17. They show that Romberg integration is superior to
Simpson's rule, but Gaussian quadrature is still more rapidly convergent.
To compute Jk(f) for a particular k, having already computed J
1
(f),
... , Jk_
1
(f), the row
Table 5.17 Example of Romberg integration
k Nodes Jk (/)
0
1
2
3
4
5
6
2
3
5
9
17
33
6S
-34.77851866026
-11.59283955342
-12.01108431754
-12.07042041287
-12.07034720873
-12.07034631632
-12.07034631639
{5.4.47)
Error
2.27E + 1
-4.78E- 1
-5.93E- 2
7.41E- 5
8.92E- 7
-6.82E- 11
< 5.00E- 12
298 NUMERICAL INTEGRATION
should have been saved in temporary storage. Then from and
2k-
1
new function values. Using (5.4.33), (5.4.40), and (5.4.43) compute the next
row in the table, including Jk(f). Compare Jk(f) and Jk_
1
(f) to see if there is
sufficient accuracy to accept Jk(f) as an accurate approximation to /(/).
We give this procedure in a formal way in the following algorithm. It is
included for pedagogical reasons, and it should not be considered as a serious
program unless some improvements are included. For example, the error test is
primitive and much too conservative, and a safety check needs to be included for
the numerical integrals associated with small k, when not enough function values
have yet been sampled.
Algorithm Romberg (f, a, b, , int)
1. Remark: Use Romberg integration to calculate int, an estimate
of the integral
Stop when I/- inti .
2. Initialize:
k := 0, n := 1,
T
0
:= R
0
:= a
0
:= (b- a)[f(a) + /(b)]/2
3. Begin the main loop:
n := 2n k := k + 1 h = (b- a)jn
n/2
4. sum:= L f(a + (2}- 1)h)
j=1
1
5. Tk := h sum+ 2Tk_
1
6. f3j := aj j = 0, 1, ... , k - l
8. Do through step 10 for j = 1, 2, ... , k
9. m := 4m
10.
m. aj-1- f3j-1
aj := m- 1
AUTOMATIC NUMERICAL INTEGRATION 299
12. If iRk- Rk-d > t:, then go to step 3
13. Since IRk- Rk_Ji .:::;; t:, accept int = Rk-l and n;turn.
There are many variants of Romberg integration. For example, other ways of
increasing the number of nodes have been studied. For a very complete survey of
the literature on Romberg integration, see Davis and Rabinowitz (1984, pp.
434-446). They also give a Fortran program for Romberg integration.
5.5 Automatic Numerical Integration
An automatic numerical integration program calculates an approximate integral
to within an accuracy specified by the user of the program. The user does not
need to specify either the method or the number of nodes to be used. There are
some excellent automatic integration programs, and many people use them. Such
a program saves you the time of writing your own program, and for many people,
it avoids having to understand the needed numerical integration theory. Nonethe-
less, it is almost always possible to improve upon an automatic program,
although it usually ~ q u i r s a good knowledge of the numerical integration
needed for your particular problem. When doing only a small number of
numerical integrations, automatic integration is often a good way to save time.
But for problems involving many integrations, it is probably better to invest the
time to find a less expensive numerical integration procedure.
An automatic numerical integration program functions as a "black box,"
without the user being able to see the intermediate steps of the computation.
Because of this, the most important characteristic of such a program is that it be
reliable: The approximate integral that is returned by the program and that the
program says satisfies the user's error tolerance must, in fact, be that accurate. In
theory, no such algorithm exists, as we explain in the next paragraph. But for the
type of integrands that one usually considers in practice, there are programs that
have a high order of reliability. This reliability will be improved if the user reads
the program description, to see the restrictions and assumptions of the program.
To understand the theoretical impossibility of a: perfectly reliable automatic
integration program, note that the program will evaluate the integrand j(x) at
only a finite number of points, say x
1
, . , xn- Then there are an infinity of
continuous functions /(x) for which
/-(x;) = f(x;) i = 1, ... , n
and
In fact, there are an infinity of such functions /(x) that are infinitely differentia-
300 NUMERICAL INTEGRATION
ble. For practical problems, it is unlikely that a well-constructed automatic
integration program will be unreliable, but it is possible. An automatic integra-
tion program can be made more reliable by increasing the stringency of its error
tests, but this also makes the program less efficient. Generally there is a tradeoff
between reliability and efficiency. For a further discussion of the questions of
reliability and efficiency of automatic quadrature programs, see Lyness and
Kaganove (1976).
Adaptive quadrature Automatic programs can be divided into (1) those using a
global rule, such as Gaussian quadrature or the trapezoidal rule with even
spacing, and (2) those using an adaptive strategy, in which the integration rule
varies its placement of node points and even its definition to reflect the varying
local behavior of the integrand. Global strategies use the type of error estimation
that we have discussed in previous sections. We now discuss the concept and
practice of an adaptive strategy.
Many integrands vary in their smoothness or differentiability at different
points of the interval of integration [a, b]. For example, with
I= frx dx
0
the integrand has infinite slope at x = 0, but the function is well behaved at
points x near 1. Most numerical methods use a uniform grid of node points, that
is, the density of node points is about equal throughout the integration interval.
This includes composite Newton-Cotes formulas,. Gaussian quadrature, and
Romberg integration. When the integrand is badly behaved at some point a in
the interval [a, b ], many node points must be placed near a to compensate for
this. But this forces many more node points than necessary to be used at all other
parts of [a, b]. Adaptive integration attempts to place node points according to
the behavior of the integrand, with the density of node points being greater near
points of bad behavior.
We now explain the basic concept of adaptive integration using a simplified
adaptive Simpson's rule. To see more precisely why variable spacing is necessary,
consider Simpson's rule with such a spacing of the nodes:
nj2 nj2 ( )
x2j x2j- x2j-2
I(I) = J f(x) dx =In(!)=
6
(/2j-2 + 4/2j-l + /2)
;=1 Xzj-2 ;=1
with x
2
j_
1
= (x
2
j_
2
+ x
2
)/2. Using (5.1.15),
1 n/2
I(/) -In(!) = -
2880
(xzj- Xzj-2)
5
J<
4
)(
(5.5.1)
(5.5.2)
with x
2
j_
2
x
2
j. Clearly, you want to choose x
2
j- x
2
j_
2
according to the
size of
_ J<I> I :$ t:
a,p a,P
(5.5.5)
then accept as the adaptive integral approXimation to Ia,p Otherwise let
t: == t:/2, and set the adaptive integral for Ia,p equal to the sum of the adaptive
integrals for /a,y and I.r.P y =(a+ {1)j2, each to be computed with an error
tolerance of t:.
In an actual implementation as a computer program, many extra limitations
are included as safeguards; and the error estimation is usually much more
sophisticated. All function evaluations are handled carefully in order to ensure
that the integrand is never evaluated twice at the same point. This requires a
clever stacking procedure for those values of f(x) that must be temporarily
stored because they will be needed again later in the computation. There are
many small modifications that can be made to improve the performance of the
program, but generally a great deal of experience and empirical investigation is
first necessary. For that and other reasons, it is recommended that standard
well-tested adaptive procedures be used [e.g., de Boor (1971), Piessens et al.
(1983)]. This is discussed further at the end of the section.
Table 5.18 Adaptive Simpson's example (5.5.6)
[a,,B]
[<2)
I- J<2l I- JCI> 11(2)- [(ill
t:
[0.0, .0625] .010258 1.6E- 4 4.5E- 4 2.9E- 4 .0003125
[.0625, .0125] .019046 1.2E- 7 l.lE-6 l.OE- 6 .0003125
[.125, .25] .053871 4.5E- 7 3.6E- 6 4.0E- 6 .000625
[.25,.5] .152368 9.3E- 7 l.lE-5 l.OE- 5 .00125
[.5, 1.0] .430962 2.4E- 6 3.0E- 5 2.8E- 5 .0025
i
I
I
I
i
-- ------- --------- ---- ------- J
i
I
!
;
302 NUMERICAL INTEGRATION
Example Consider using the preceding simpleminded adaptive Simpson proce-
dure to evaluate
I= [IX dx
0
(5.5.6)
with t: = .005 on [0, 1). The final intervals [a, fJJ and integrals I ~ ~ ~ are given in
Table 5.18. The column labeled t: gives the error tolerance used in the test (5.5.5),
which estimates the error in I ~ ~ 1 - The error estimated for I ~ ~ ~ on [0, .0625] was
inaccurate, but it was accurate for the remaining subintervals. The value used to
estimate Ia.fJ is actually I ~ ~ 1 and it is sufficiently accurate on all subintervals.
The total integral, obtained by summing all I ~ ~ 1 . is
i = .666505
and the calculated bound is
I-f= 1.6-4
I I - il ~ 3.3 - 4
obtained by summing the column labeled jJC
2
> - JC
1
>j. Note that the error is
concentrated on the first subinterval, as could have been predicted from the
behavior of the integrand near x = 0. For an example where the test (5.5.5) is not
adequate, see Problem 32.
Some automatic integration programs One of the better known automatic
integration programs is the adaptive program CADRE (Cautious Adaptive
Romberg Extrapolation), given in de Boor (1971). It includes a means of
recognizing algebraic singularities at the endpoints of the integration interval.
The asymptotic error formulas of Lyness and Ninham (1967), given in (5.4.21) in
a special case, are used to produce a more rapidly convergent integration method,
again based on repeated Richardson extrapolation. The routine CADRE has
been found empirically to be both quite reliable and efficient.
Table 5.19 Integration examples for CADRE
Desired Error
10-2 10-5 10-s
Integral Error N Error N Error N
Jl
A= 2.49E- 6 76 A = 1.40E- 7 73 A= 4.60E- 11 225
P = 5.30E- 4 P = 4.45E- 6 P = 2.48E- 9
/2
A= 1.18E- 5 9 A= 3.96E- 7 17 P = 2.73E -10 129
P = 3.27E- 3 P = 3.56E- 6 P = 2.81E- 9
/3
A= 1.03E- 4 17 A= 3.23E- 8 33 A= 1.98E- 9 65
P = 2.98E- 3 P = 4.43E- 8 P = 2.86E- 9
/4
A= 6.57E- 5 105 A= 6.45E- 8 209 A= 4.80E- 9 281
P = 4.98E- 3 P = 9.22E- 6 P = 1.55E- 8
Is
A= 2.77E- 5 226 A= 7.41E- 8 418 A= 5.89E- 9 562
P = 3.02E- 3 P = l.OOE- 5 P = 1.11E- 8
/6
A= 8.49E- 6 955 A= 2.37E- 8 1171 A= 4.30E- 11 2577
P = 8.48E- 3 P = 1.67E- 5 P = 2.07E- 8
/7
A= 4.54E- 4 98 A= 7.72E- 7 418 A=** 1506
P = 1.30E- 3 P = 8.02E- 6 p = **
AUTOMATIC NUMERICAL INTEGRATION 303
A more recently developed package is QUAD PACK, some of whose programs
are general purpose, while others deal with special classes of integrals. The
package was a collaborative effort, and a complete description of it is given in
Piessens et al. (1983). The package is well tested and appears to be an excellent
collection of programs.
We illustrate the preceding by calculating numerical approximations to the
following integrals:
1
1 4 dx 1 .
I1 =
2
= -[tan-
1
(10) + tan-
1
(6)]
o 1 + 256(x - .375) 4
1
1 2
I
2
= xVx dx =-
0 7
1
1 2
I
3
= IX dx =-
0 3
I
4
= f log ( x) dx = 1
Is= flog lx- .71 dx = .3log (.3) + .?log (.7) - 1
!
10000 dx
I1 = -- = 206
-9 v1xT
From QUADPACK, we chose DQAGP. It too contains ways to recognize'
algebraic singularities at the endpoints and to compensate for their presence. To
improve performance, it allows the user to specify points interior to the integra-
tion interval at which the integrand is singular.
We used both CADRE and DQAGP to calculate the preceding integrals, with
error tolerances of 10-
2
, 10-
5
, and 10-
8
The results are shown in Tables 5.19
and 5.20. To more fairly compare DQAGP and CADRE, we applied CADRE to
two integrals in both Is and I
7
, to have the singularities occur at endpoints. For
example, we used CADRE for each of the integrals in
J
o dx
1
10000 dx
I- --+ -
7
- -91-x o IX
(5.5.7)
In the tables, P denotes the error bound predicted by the program and A denotes
the actual absolute error in the calculated answer. Column N gives the number of
integrand evaluations. At all points at which the integrand was undefined, it was
arbitrarily set to zero. The examples were computed in double precision on a
Prime 850, with a unit round of r
46
= 1.4 x 10-
14
In Table 5.19, CADRE failed for I
7
with the tolerance 10-
8
, even though
(5.5.7) was used. Otherwise, it performed quite well. When the decomposition
I
i
I
I
I
J
304 NUMERICAL INTEGRATION
Table 5.20 Integration examples for DQAGP
Desired Error
10
2
10
5
w-s
Integral Error N Error N Error N
II
A= 2.88E- 9 105 A= 5.40E- 13 147 A= 5.40E- 13 147
P = 2.96E- 3 P = 5.21E- 10 P = 5.21E -10
I2
A = 1.17E- 11 21 A= 1.17E- 11 21 A= 1.17E- 11 21
P = 7.46E- 9 P = 7.46E- 9 P = 7.46E- 9
IJ
A= 4.79E- 6 21 A= 4.62E -13 189 A= 4.62E- 13 189
P = 4.95E- 3 P = 4.77E- 14 P = 4.77E- 14
I4
A= 5.97E- 13 231 A= 5.97E -13 231 A= 5.97E- 13 231
P = 7.15E -14 P = 7.15E- 14 P = 7.15E- 14
Is
A= 8.67E- 13 462 A= 8.67E- 13 462 A= 8.67E- 13 462
P =USE- 13 P = 1.15E- 13 P = 1.15E- 13
I6
A= l.OOE- 3 525 A= 6.33E- 14 861 A = 5.33E- 14 1239
P = 4.36E- 3 P = 8.13E- 6 P = 7.12E- 9
I1
A= 1.67E- 10 462 A= 1.67E- 10 462 A = 1.67E- 10 462
P = 1.16E -10 P = 1.16E- 10 P = 1.16E- 10
(5.5.7) is not used and CADRE is called only once for the single interval
[- 9, 10000], it fails for all three error tolerances.
In Table 5.20, the predicted error is in some cases smaller than the actual
error. This difficulty appears to be due to working at the limits of the machine
arithmetic precision, and in all cases the final error was well within the limits
requested.
In comparing the two programs, DQAGP and CADRE are both quite reliable
and efficient. Also, both programs perform relatively poorly for the highly
oscillatory integral /
6
, showing that /
6
should be evaluated using a program
designed for oscillatory integrals (such as DQAWO in QUADPACK, for Fourier
coefficient calculations). From the tables, DQAGP is somewhat more able to deal
with difficult integrals, while remaining about equally efficient compared to
CADRE. Much more detailed examples for CADRE are given in Robinson
(1979).
Automatic quadrature programs can be easily misused in large calculations,
resulting in erroneous results and great inefficiency. For comments on the use of
such programs in large calculations and suggestions for choosing when to use
them, see Lyness (1983). The following are from his concluding remarks.
The Automatic Quadrature Rule (AQR) is an impressive and practical item
of numerical software. Its main advantage for the user is that it is conveni-
ent. He can take it from the library shelf, plug it in, and feel confident that
it will work. For this convenience, there is in general .a modest charge in
CPU time, this surcharge being .a factor of about 3. The Rule Evalua;ion
Quadrature Routine (REQR) [non-automatic quadrature rule] does not
carry this surcharge, but to code and check out an REQR might take a
SINGULAR INTEGRALS 305
couple of hours of the user's time. So unless the expected CPU time is high,
many user's willingly pay the surcharge in order to save themselves time
and trouble.
However there are certain-usually large scale-problems for which the
AQR is not designed and in which its uncritical use can lead to CPU time
surcharges by factors of 100 or more. . . . These are characterized by the
circumstances that a large number of separate quadratures are involved,
and that the results of these quadratures are subsequently used as input to
some other numerical process. In order to recognize this situation, it is
necessary to examine the subsequent numerical process to see whether it
requires a smooth input function. . . . For some of these problems, an
REQR is quite suitable while an AQR may lead to a numerical disaster.
5.6 Singular Integrals
We discuss the approximate evaluation of integrals for which methods of the type
discussed in Sections 5.1 through 5.4 do not perform well: these methods include
the composite Newton-Cotes rules (e.g., the trapezoidal rule), Gauss-Legendre
quadrature, and Romberg integration. The integrals discussed here lead to poorly
convergent numerical integrals when evaluated using the latter integration rules,
for a variety of reasons. We discuss (1) integrals whose integrands contain a
singularity in the interval of integration (a, b), and (2) integrals with an infinite
interval of integration. Adaptive integration methods can be used for these
integrals, but it is usually possible to obtain more rapidly convergent approxima-
tions by carefully examining the nature of the singular behavior and then
compensating for it.
Change of the variable of integration We illustrate the importance of this idea
with several examples. For
1
= ibf(x) dx
o IX
(5.6.1)
with f(x) a function with several continuous derivatives, let x = u
2
, 0 ::s; u ::s; lb.
Then
.
This integral has a smooth integrand and standard techniques can be applied
to it.
Similarly,
using the change of variable u = ..;r-::x. The right-hand integrand has an
infinite number of continuous derivatives on [0, 1], whereas the derivative of the
first integrand was singular at x = 1.
i
-i
I
-------- ----- ----------------------- _,._ -- - - - - - - ~ - - - - - - - - - - - - ~
'
'
!
I
____ _j
306 NUMERICAL INTEGRATION
For an infinite interval of integration, the change of variable technique is also
useful. Suppose
p>1 (5.6.2)
with Limitx ... oo f(x) existing. Also assume f(x) is smooth on [1, oo). Then use
the change of variable
Then
-a
dx = --du
ul+a
for some a> 0
(5.6.3)
Maximize the smoothness of the new integrand at u = 0 by picking a to produce
a large value for the exponent (p- 1)a- 1. For example, with
1
= joof(x) dx
1 x/X
the change of variable x = 1ju
4
leads to
If we assume a behavior at x = oo of
then
and (5.6.4) has a smooth integrand at u = 0.
An interesting idea has been given in Iri et al. (1970) to deal with endpoint
singularities in the integral
Define
1/J(t),;, e x p ~ )
1- t
b- a
1
<p(t) =a+ --j 1/J(u) du
y -1
(5.6.5)
(5.6.6)
-1:;;; t:;;; 1 (5.6.7)
SINGULAR INTEGRALS 307
where c is a positive constant and
y = r 1/J(u) du
-1
As t varies from -1 to 1, <p{t) varies from a to b. Using x = <p{t) as a change
of variable in (5.6.5), we obtain
I= t /(!J!(t))!J!'(t) dt
-1
(5.6.8)
The function qJ'(t) = ((b- a)jy)I/J(t) is infinitely differentiable on [ -1, 1], and
"it and all of derivatives are zero at t = 1. In (5.6.8), the integrand and all of
derivatives will vanish at t = 1 for virtually all functions f(x) of interest.
Using the error formula (5.4.9) for the trapezoidal rule on [ -1, 1], it can be seen
that the trapezoidal rule will converge very rapidly when applied to 5 . 6 . ~ ) . We
will call this method the IMT method.
This method has been implemented in de Doncker and Piessens (1976), and in
the general comparisons of Robinson (1979), it is rated as an extremely reliable
and quite efficient way of handling integrals (5.6.4) that have endpoint singulari-
ties. De Doncker and Piessens (1976) also treat integrals over [0, oo) by first using
the change of variable x = (1 + u)/(1 - u), -1 ~ u < 1, followed by the change
of variable u = <p(t).
Example Use the preceding method (5.6.5)-(5.6.8) with the trapezoidal rule, to
evaluate
1
1 .;;
I= ...j-Inx dx =- ~ .8862269
0 2
(5.6.9)
Note that the integrand has singular behavior at both endpoints, although it is
different in the two cases. The constant in (5.6.6) is c = 4, and the evaluation of
(5.6.7) is taken from Robinson and de Doncker (1981). The results are shown in
Table 5.21. The column labeled nodes gives the number of nodes interior to [0, 1].
Table 5.21 Example of the IMT
method
Nodes Error
2 -6.54E- 2
4 5.82E- 3
8 -1.30E- 4
16 7.42E- 6
32 1.17E- 8
64 1.18E- 12
i
1
I
_______ _\
308 NUMERICAL INTEGRATION
Gaussian quadrature In Section 5.3, we developed a general theory for Gauss-
ian quadrature formulas
n ~
that have degree of precision 2n - 1. The construction of the nodes and weights,
and the form of the error, are given in Theorem 5.3. For our work in this section
we note that (1) the interval (a, b) is allowed to be infinite, and (2) w( x) can
have singularities on (a, b), provided it is nonnegative and satisfies the assump-
tions (4.3.8) and (4.3.9) of Section 4.3. For rapid convergence, we would also
expect that f(x) would need to be a smooth function, as was illustrated with
Gauss-Legendre quadrature in Section 5.3.
The weights and nodes for a wide variety of weight functions w(x) and
intervals (a, b) are known. The tables of Stroud and Secrest (1966) include the
integrals
fIn ( )t(x) dx (5.6.10)
and others. The constant a > -1. There are additional books containing tables
for integrals other than those in (5.6.10). In addition, the paper by Golub and
Welsch (1969) describes a procedure for constructing the nodes and weights in
(5.6.10), based on solving a matrix eigenvalue problem. A program is given, and
it includes most of the more popular weighted integrals to which Gaussian
quadrature is applied. For an additional discussion of Gaussian quadrature, with
references to the literature (including tables and programs), see Davis and
Rabinowitz (1984, pp. 95-132, 222-229).
Example We illustrate the use of Gaussian quadrature for evaluating integrals
I= loco g(x) dx
We use Gauss-Laguerre quadrature, in which w(x) =e-x. Then write I as
(5.6.11)
We give results for three integrals:
oo xdx .,
2
1
<1> = r ___ = _
lo ex- 1 6
xdx 1
J(2)- (
00
=-
- lo (1 + x
2
)
5
8
oo dx .,
1
<3> = r
lo 1 + x
2
2
SINGULAR INTEGRALS 309
Table 5.22 Examples of Gauss-Laguerre quadrature
Nodes
2
4
8
16
32
64
J(l)- J ~ l
l.OlE- 4
1.28E- 5
-9.48E- 8
3.16E -11
7.11E- 14
J(2)- J!2)
-8.05E- 2
-4.20E- 2
1.27E- 2
-1.39E- 3
3.05E- 5
1.06E- 7
J<3l - 1 ~ 3
7.75E- 2
6.96E- 2
3.70E- 2
1.71E- 2
8.31E- 3
4.07E- 3
Gauss-Laguerre quadrature is best for integrands that decrease exponentially as
x--+ oo. For integrands that are 0(1fxP), p > 1, as x--+ oo, the convergence
rate becomes quite poor as p --+ 1. These comments are illustrated in Table 5.22.
For a formal discussion of the convergence of Gauss-Laguerre quadrature, see
Davis and Rabinowitz (1984, p. 227).
One especially easy case of Gaussian quadrature is for the singular integral
f
l f(x) dx
I(!)= -1/1- xz
(5.6.12)
With this weight function, the orthogonal polynomials are the Chebyshev poly-
nomials {Tn(x), n ~ 0}. Thus the integration nodes in (5.6.10) are given by
(
2j- 1 )
xj,n =cos
2
n 'IT
j = 1, ... , n
and from (5.3.11), the weights are
'IT
w. =-
J,n n
j=1, ... ,n
(5.6.13)
Using the formula (5.3.10) for the error, the Gaussian quadrature formula for
(5.6.12) is given by
(5.6.14)
for some - 1 < 11 < 1.
This formula is related to the composite midpoint rule (5.2.18). Make the
change of variable x = cos () in (5.6.14) to obtain
w 'IT n
1 !(cos 0) dO= - E !(cos oj,n) + E
o n j=l
(5.6.15)
where Oj, n = (2 j - 1)'1T j2n. Thus Gaussian quadrature for (5.6.12) is equivalent
to the composite midpoint rule applied to the integral on the left in (5.6.15). Like
310 NUMERICAL INTEGRATION
the trapezoidal rule, the midpoint rule has an error expansion very similar to that
given in (5.4.9) using the Euler-MacLaurin formula. The Corollary 1 to Theorem
5.5 also is valid, showing the composite midpoint rule to be highly accurate for
periodic functions. This is reflected in the high accuracy of (5.6.14). Thus
Gaussian quadrature for (5.6.12) results in a formula that would have been
reasonable from the asymptotic error expansion for the composite midpoint rule
applied to the integral on the left of (5.6.15).
Analytic treabnent of singularity Divide the interval of integration into two
parts, one containing the singular point, which is to be treated analytically. For
example, consider
I =.foh!(x)log(x) dx = [{ + J.h]/(x}log(x} dx = /
1
+ /
2
(5.6.16)
Assuming f(x) is smooth on [t:, b], apply a standard technique to the evaluation
of /
2
For /(x) about zero, assume it has a convergent Taylor series on [0, t:].
Then
( (( 00 )
1
1
= lt(x)log(x) dx = 1 ,Eajxj log(x) dx
0 0 0
00 (j+ 1 [ 1 ]
=:[a.-- log(t:)- --
o lj+1 j+1
For example, with
define
1
4,.
I = cos ( x) log ( x) dx
0
1
1
= L
1
cos(x)log(x) dx
0
( = .1
(5.6.17)
(3 [ 1 ] 5 [ 1 ]
= t:[ log ( t:) - 1] - - log ( t:) - - + - log ( t:) - - -
6 3 600 5
This is an alternating series, and thus it is clear that using the first three terms
will give a very accurate value for /
1
A standard method can be applied to /
2
on
[.1, 4'1T].
Similar techniques can be used with infinite intervals of integration [a, oo ),
discarding the integral over [b, oo) for some large value of b. This is not
developed here.
Product integration Let /(/) = J:w(x)f(x) dx with a near-singular or singu-
lar weight function w(x) and a smooth function f(x). The main idea is to
SINGULAR INTEGRALS 311
produce a sequence of functions fn(x) for which
1. II/- fnlloo = Max IJ(x)- fn(x)l __. 0
a:;;x:;;b
2. The integrals
In(!)= J.bw(x)fn(x) dx
a
(5.6.18)
can be fairly easily evaluated.
This generalizes the schema (5.0.2) of the introduction. For the error,
II(!)- In(!) ~ J.biw(x)IIJ(x)- fn(x)jdx
a
~ I I / !niL,., fi w(x) I dx
a
(5.6.19)
Thus In(/) __. !(f) as n ---. oo, and the rate of convergence is at least as rapid as
that of fn(x) to f(x) on [a, b].
Within the preceding framework, the product integration methods are usually
defined by using piecewise polynomial interpolation to define fn(x) from f(x).
To illustrate the main ideas, while keeping the algebra simple, we will define the
product trapezoidal method for evaluating
I(!)= fobJ(x)log(x) dx (5.6.20)
Let n ~ 1, h = bjn, xj = jh for j = 0, 1, ... , n. Define fn(x) as the piecewise
linear function interpolating to f(x) on the nodes x
0
, x
1
, ... , xn. For xj-l ~ x ~
xj, define
(5.6.21)
for j = 1, 2, ... , n. From (3.1.10), it is straightforward to show
h2
Ill- fnlloo ~ sllf"lloo
(5.6.22)
provided f(x) is twice continuously differentiable for 0 ~ x ~ b. From (5.6.19)
we obtain the error bound
(5.6.23)
This method of defining /n(/) is similar to the regular trapezoidal rule (5.1.5).
The rule (5.1.5) could also have been obtained by integrating the preceding
function fn(x), but the weight function would have been simply w(x) = 1. We
i -- -
312 NUMERICAL INTEGRATION
can easily generalize the preceding by using higher degree piecewise polynomial
interpolation. The use of piecewise quadratic interpolation to define /,(x) leads
to a formula /,(/) called the product Simpson's rule. And using the same
reasoning as led to (5.6.23), it can be shown that
(5.6.24)
Higher order formulas can be obtained by using even higher degree interpolation.
For the computation of/,(/) using (5.6.21),
1 Jx
1
w
0
=- (x
1
- x)Iog(x) dx
h x
0
1 JX
wj = h
1
(x- xj_
1
) log (x) dx
. Xj-1
1
1
xj+l
+- (xj+
1
- x) log(x) dx
h xj
(5.6.25)
j=1, ... ,n-1 (5.6.26)
The calculation of these weights can be simplified considerably. Making the
change of variable x- xj_
1
= uh, 0 .:5; u .:5; 1, we have
1 JX 11
-
1
(x- xj_
1
) log(x) dx = h u log[{J- 1 + u)h] du
h xj-l 0
h 11
= - log (h) + h u log (J - 1 + u) du
2 0
and
1 JX 11
-
1
( x j - x) log ( x) dx = h ( 1 - u) log [ (j - 1 + u) h] du
h Xj-l . 0
h 11
= - log (h) + h ( 1 - u) log (j - 1 + u) du
2 0
Define
,h(k) = f(l- u) log(u + k) du (5.6.27)
~ -------------- - -- I
SINGULAR INTEGRALS 313
Table 5.23 Weights for product trapezoidal rule
k 1/tl (k) l/t2 ( k)
0 -.250 -.750 .
. 250 .1362943611
2 .4883759281 .4211665768
3 .6485778545 .6007627239
4 .7695705457 .7324415720
5 .8668602747 .8365069785
6 .9482428376 .9225713904
7 1.018201652 .9959596385
for k = 0, 1, 2, .... Then
h h
w
0
= 2 log (h) + h th ( 0) w" = 2log(h) + hth(n- 1)
j = 1, 2, ... , n - 1 (5.6.28)
The functions th(k) and th(k) do not depend on h, b, or n. They can be
calculated and stored in a table for use with a variety of values of b and n. For
example, a table of th(k) and l[;
2
(k) fork= 0, 1, ... ,99 can be used with any
b > 0 and with any n ~ 100. Once the table of the values l[;
1
(k) and l[;
2
(k) has
been calculated, the cost of using the product trapezoidal rule is no greater than
the cost of any other integration rule.
The integrals t/;
1
(k) and th(k) in (5.6.27) can be evaluated explicitly; some
values are given in Table 5.23.
Example Compute I= Jl(lj(x + 2))log(x) dx = -.4484137. The computed
values are given in Table 5.24. The computed rate of convergence is in agreement
with the order of convergence of (5.6.23).
Many types of interpolation may be used to define fn(x), but most applica-
tions to date have used piecewise polynomial interpolation on evenly spaced node
points. Other weight functions may be used, for example,
w(x) = xa a> -1 x ~ O (5.6.29)
and again the weights can be reduced to a fairly simple formula similar to
Table 5.24 Example of product trapezoidal rule
n
I"
I- In Ratio
1 -.4583333 .00992
2 -.4516096 .00320 3.10
4 -.4493011 .000887 3.61
8 -.4486460 .000232 3.82
!
I
j
I
------ _ _j
314 NUMERICAL INTEGRATION
(5.6.28). For an irrational value of a, say
1
w(x) = ---;.r-
xv2 -1
a change of variables can no longer be used to remove the singularity in the
integraL Also, one of the major applications of product integration is to integral
equations in which the kernel function has an algebraic andjor logarithmic
singularity. For such equations, changes of variables are no longer possible, even
with square root singularities. For example, consider the equation
i\ (x) - fP(Y) dy = f(x)
fP a lx- Yl
112
with i\, a, b, and f given and fP the desired unknown function. Product
integration leads to efficient procedures for such equations, provided fP(Y) is a
smooth function [see Atkinson (1976), p. 106].
For complicated weight functions in which the wj can no longer be
calculated, it is often possible to modify the problem to one in which product
integration is still easily applicable. This will be examined using an example.
Example Consider I= /
0
/(x) log(sin x) dx. The integrand has a singularity at
both x = 0 and x = 'TT. Use
[
sin(x)]
log (sin x) = log x (., _ x) + log ( x) + log (., - x)
and this gives
I= fo"t(x) log [ x( :n_x x)] dx + {'t(x) log (x) dx
+ fo"t(x)log(.,- x) dx
(5.6.30)
Integral /
1
has an infinitely differentiable integrand, and any standard numerical
method will perform welL Integral I
2
has already been discussed, with w(x) =
log(x). For I
3
, use a change of variable to write
I
3
= 1"t(x)1og(.,- x) dx = 1"t(.,- z)log(z) dz
0 0
Combining with I
2
,
I2 + I3 = fo"log{x)[f(x) + /(.,- x)] dx
NUMERICAL DIFFERENTIATION 315
to which the preceding work applies. By such manipulations, the applicability of
the cases w(x) = log.(x) and w(x) = X
0
is much greater than might first be
imagined.
For an asymptotic error analysis of product integration; see the work of
de Hoog and Weiss (1973), in which some generalizations of the Euler-
MacLaurin expansion are derived. Using their results, it can be shown that the
error in the product Simpson rule is 0( h
4
log (h)). Thus the bound (5.6.24) based
on the interpolation error f(x) - fn(x) does not predict the correct rate of
convergence. This is similar to the result (5.1.17) for the Simpson rule error, in
which the error was smaller than the use of quadratic interpolation would lead us
to believe.
5.7 Numerical Differentiation
Numerical approximations to derivatives are used mainly in two ways. First, we
are interested in calculating derivatives of given data that are often obtained
empirically. Second, numerical differentiation formulas are used in deriving
numerical methods for solving ordinary and partial differential equations. We
begin this section by .deriving some of the most commonly used formulas for
numerical differentiation.
The problem of numerical differentiation is in some ways more difficult than
that of numerical integration. When using empirically determined function
values, the error in these values will usually lead to instability in the numerical
differentiation of the function. In contrast, numerical integration is stable when
faced with such errors (see Problem 13).
The classical fonnulas One of the main approaches to deriving a numerical
approximation to f'(x) is to use the derivative of a polynomial Pn(x) that
interpolates f(x) at a given set of node points. Let x
0
, x
1
, , xn be given, and
let Pn(x) interpolate f(x) at these nodes. Usually {X;} are evenly spaced. Then
use
f'(x) =
From (3.1.6), (3.2.4), and (3.2.11):
n
Pn(x) = L f(x)l/x)
j-0
1/x) = i'n(x)'
(x- x)i'n(x)
(5.7.1)
(x -,x
0
) (x- xj_
1
)(x- xj+l) (x- xn)
(xj- x
0
) (xj- xj_
1
)(xj- xj+l) (xj- xJ
= (x- x
0
) {x- xn)
f{x)- Pn(x) = Xn, x} (5.7.2)
316 NUMERICAL INTEGRATION
Thus
II
f'(x) = = L f(x)IJ(x) = Dhf(x)
j=O
f'(x)- Dhf(x) = ir:(x)J[x
0
, .. , x,, x]
+i',{x )/[x
0
, , x,, x, x]
with the last step using (3.2.17). Applying (3.2.12),
(5.7.3)
(5.7.4)
with
E '{ x
0
, ... , x,, x }. Higher order differentiation formulas and their
error can be obtained by further differentiation of (5.7.3) and (5.7.4).
The most common application of the preceding is to evenly spaced nodes
{ x;}. Thus let
X;= X
0
+ ih i 0
with h > 0. In this case, it is straightforward to show that
Thus
We now derive examples of each case.
ir;(x)=FO
v;(x) = 0
{5.7.6)
(5.7.7)
Let n = 1, so that p,(x) is just the linear interpolate of (x
0
, f(x
0
)) and
(x
1
, f(x
1
)). Then (5.7.3) yields
(5.7.8)
From (5.7.5),
(5.7.9)
since i'(x
0
) = 0.
To improve on this with linear interpolation, choose x = m = (x
0
+ x
1
)j2.
Then
NUMERICAL DIFFERENTIATION 317
We usually rewrite this by letting 8 = h/2, to obtain
1
/'(m) Dsf(m) =
28
[/(m + 8)- J(m- 8)] (5.7 .10)
For the error, using (5.7.5) and ir{(m) = 0,
m- 8 5:
5: m + 8 (5.7.11)
In general, to obtain the higher order case in (5.7.7), we want to choose the nodes
{X;} to have = 0. This will be true if n is odd and the nodes are placed
symmetrically about x, as in (5.7.10).
To obtain higher order formulas in which the nodes all lie on one side of x,
use higher values of n in (5.7.3). For example, with x = x
0
and n = 2,
(5.7.12)
(5.7.13)
The method of undetermined coefficients Another method to derive formulas
for numerical integration, differentiation, and interpolation is called the method
of undetermined coefficients. It is often equivalent to the formulas obtained from
a polynomial interpolation formula, but sometimes it results in a simpler deriva-
tion. We will illustrate the method by deriving a formula for f"(x).
Assume
f"(x)
Example Let f(x) = -cos(x), and compute /"(0) using the numerical ap-
proximation (5.7.17). In Table 5.25, we give the errors in (1) Dflf(O), computed
exactly, and (2)
and
h f"(O) -
x
2
f(x) dx with weight w(x) =VI- x
2
,
find explicit formulas for the nodes and weights of the Gaussian quadrature
formula. Also give the error formula. Hint: See Problem 24 of Chapter 4.
20. Using the column en.
4
of Table 5.13, produce the fourth-order error
formula analogous to the second-order formula (5.3.40) for en.
2
Compare
it with the fourth-order Simpson error formula (5.1.17).
21. The weights in the Kronrod formula (5.3.48) can be calculated as the
solution of four simultaneous linear equations. Find that system and then
solve it to verify the values given following (5.3.48). Hint: Use the approach
leading to (5.3.7).
22. Compare the seven-point Gauss-Legendre formula with the seven-point
Kronrod formula of (5.3.48). Use each of them on a variety of integrals,
and then compare their respective errors.
23. (a) Derive the relation (5.4.7) for the Bernoulli polynomials B/x) and
Bernoulli numbers Bj. Show that Bj = 0 for all odd integers j 3.
(b) Derive the identities
Bj(x) = jBj_
1
(x) j 4 and even
j 3 and odd
These can be used to give a general proof of the Euler-MacLaurin
formula (5.4.9).
PROBLEMS 327
24. Using the Euler-MacLaurin summation formula (5.4.17), obtain an esti-
mate of t( 1 ), accurate to three decimal places. The zeta function t( p) is
defined in (5.4.22).
25. Obtain the asymptotic error formula for the trapezoidal rule applied to
f ~ ':;xj(x) dx. Use the estimate from Problem 24.
26. Consider the following table of approximate integrals In produced using
Simpson's rule. Predict the order of convergence of In to I:
n In
2 .28451779686
4 .28559254576
8 .28570248748
16 .28571317731
32 .28571418363
64 .28571427643
That is, if I- In = cjnP, then what is p? Does this appear to be a valid
form for the error for these data? Predict a value of c and the error in I
64
How large should n be chosen if In is to be in error by less than w-u?
27. Assume that the error in an integration formula has the asymptotic expan-
sion
Generalize the Richardson extrapolation process of Section 5.4 to obtain
formulas for C
1
and C
2
. Assume that three values In, I
2
n, and I
4
n have
been computed, and use these to compute C
1
, C
2
, and an estimate of I,
with an error of order 1 j n 2../n.
28. For the trapezoidal rule (denoted by IY>) fofevaluating I= J:J(x) dx, we
have the asymptotic error formula
and for the midpoint formula I ~ M > we have
provided f is sufficiently differentiable on [a, b]. Using these results, obtain
a new numerical integration formula I: combining IY> and I ~ M > with a
higher order of convergence. Write out the weights to the new formula in-
328 NUMERICAL INTEGRATION
29. Obtain an asymptotic error formula for Simpson's rule, comparable to the
Euler-MacLaurin formula (5.4.9) for the trapezoidal rule. Use (5.4.9),
(5.4.33), and (5.4.36), as in (5.4.37).
30. Show that the formula (5.4.40) for I ~
The computation of Yn+l from Yn contains a truncation error that is O(h
3
)
[see (6.5.1)]. To maintain this order of accuracy, the eventual iterate
which
is chosen to represent Yn+l should satisfy IYn+l - = O(h
3
). And if we
want the iteration error to be less significant (as we do in the next section), then
should be chosen to satisfy
(6.5.8)
To analyze the error in choosing an initial guess Yn+l we must introduce the
concept of local solution. This will also be important in clarifying exactly what
solution is being obtained by most automatic computer programs for solving
ordinary differential equations. Let un(x) denote the solution of y' = j(x, y)
that passes through (xn, Yn):
= f(x, un(x)) (6.5.9)
At step xn, knowing Yn' it is un(xn+l) that we are trying to calculate, rather than
Y(xn+l).
Applying the derivation that led to (6.5.1), we have
for some Xn:::;; gn:::;; xn+l Let en+l = un(Xn+l)- Yn+l> which we call the local
error in computing Yn+l from Yn Subtract (6.5.2) from the preceding to obtain
h h
3
en+l = 2 [/(xn+l un(xn+l)) - f(xn+l Yn+l)] - l2
where we have twice applied the mean value theorem. It can be shown that for all
sufficiently small h,
THE TRAPEZOIDAL METHOD 369
This shows that the local error is essentially the truncation error.
If Euler's method is used to compute
then un(xn+
1
) can be expanded to show that
h2
( )
(0) - "(,. )
un xn+1 - Yn+1- Tun !ln
Combined with (6.5.11),
Y
- y<O> = O(h2)
n+1 n+1
(6.5.11)
(6.5.12)
(6.5.13)
(6.5.14)
To satisfy (6.5.8), the bound (6.5.5) implies two iterates will have to be computed,
and then we use
to represent Yn+
1
.
Using the midpoint method, we can obtain the more accurate initial guess
(6.5.15)
To estimate the error, begin by using the derivation that leads to (6.4.1) to obtain
- ( ) h3 (3)(
un(xn+
1
)- un(xn-1) + 2hj Xn, un(xn) +
3
Un '11n),
for some xn-1 ;5; "11n ;5; xn+1 Subtracting (6.5.15),
. h3
un(xn+1)- = un(xn-1)- Yn-1 +
The quantity un(xn_
1
) - Yn-lcan be computed in a manner similar to that used
for (6.5.11) with about the same result:
h3
un(xn-1)- Yn-1 =
12
+ O(h
4
) (6.5.16)
Then
5h
3
u (x ) - y<O> = -u<
3
>(x ) + O(h
4
)
n n+1 n+1 12 n n
And combining this with (6.5.11),
h3
Y
- y<
0
> = -u<
3
>(x ) + O(h
4
)
n+1 n+1 2 n n
(6.5.17)
i
1
I
I
I
____ _J
370 NUMERICAL METHODS FOR ORDINARY DIFFERENTIAL EQUATIONS
With the initial guess (6.5.15), one iterate from (6.5.3) will be sufficient to satisfy
(6.5.8), based on the bound in (6.5.5).
The formulas (6.5.12) and (6.5.15) are called predictor formulas, and the
trapezoidal iteration formula (6.5.3) is called a corrector formula. Together they
foi:m a predictor-corrector method, and they are the basis of a method that can
be used to control the size of the local error. This is illustrated in the next section.
Convergence and stability results The convergence of the trapezoidal method is
assured by Theorem 6.6. Assuming hk .:s:; 1,
Max IY(xn) - Yh(xn) l.:s:; eZK(b-:co>leol
:c
0
;S;:c.s;b
The derivation of an asymptotic error formula is similar to that for Euler's
method. Assuming e
0
= 5
0
h
2
+ O(h
3
), we can show
(6.5.19)
The standard type of stability result, such as that given in (6.2.28) and (6.2.29) for
Euler's method, can also be given for the trapezoidal method. We leave the proof
of this to the reader.
As with the midpoint method, we can examine the effect of applying the
trapezoidal rule to the model equation
y' = A.y y(O) = 1 (6.5.20)
w h o s ~ solution is Y(x) = e>.x_ To give further motivation for doing so, consider
the trapezol.dal method applied to the linear equation
y'=A.y+g(x) y(O) = Y
0
(6.5.21)
namely
h
Yn+l = Yn + 2 [A.yn + g(xn) + AYn+l + g{xn+l)]
n ~ 0 (6.5.22)
with Yo= Then consider the perturbed numerical method
h
Zn+l = Zn + 2 [A.zn + g(xn) + AZn+l + g{xn+l)] n ~
THE TRAPEZOIDAL METHOD 371
with z
0
= Y
0
+ t:. To analyze the effect of the perturbation in the initial value, let
wn = zn- Yn Subtracting,
n ~ ( 6.5.23)
This is simply the trapezoidal method applied to our model problem, except that
the initial value is t: rather than 1. The numerical solution in (6.5.23) is simply t:
times that obtained in the numerical solution of (6.5.20). Thus the behavior of the
numerical solution of the model problem (6.5.20) will give us the stability
behavior of the trapezoidal rule applied to (6.5.21).
The model problems in which we are interested are those for which A is real
and negative or A is complex with negative real part. The reason for this choice is
that then the differential equation problem (6.5.21) is well-conditioned, as noted
in (6.1.8), and the major interesting cases excluded are A = 0 and A strictly
imaginary.
Applying the trapezoidal rule to (6.5.20),
Then
Inductively
hA
Yn+I = Yn + T(Yn + Yn+I]
[
1 + (hA/2)]
Yn+I =
1
_ (hA/
2
) Yn
= [1 + {hA/2)]n
Yn 1- {hA/2)
Yo= 1
n ~
n ~
provided hA =F 2. For the case of real A < 0, write
1 + (hA/2) hA 2
r= =1+ =-1+----
1 - (hA/2) 1 - (hA/2) . 1 - (hA/2)
This shows - 1 < r < 1 for all values of h > 0. Thus
. Limityn = 0
n->oo
(6.5.24)
(6.5.25)
There are no limitations on h in order to have boundedness. of {Yn}, and thus
stability of the numerical method in (6.5.22) is assured for all h > 0 and all
======================' J A < 0. This is a stronger statement than is possible with most numerical methods,
where generally h must be sufficiently small to ensure stability. For certain
applications, stiff differential equations, this is an important consideration. The
property that (6.5.25) holds for all h > 0 and all complex A with Real(A) < 0 is
called A-stability. We explore it further in Section 6.8 and Problem 37.
372 NUMERICAL METHODS FOR ORDINARY DIFFERENTIAL EQUATIONS
Richardson error estimation This error estimation was introduced in Section
5.4, and it was used both to predict the error [as in (5.4.42)] and to obtain a more
rapidly convergent numerical integration method [as in (5.4.40)]. It can also be
used in both of these ways in solving differential equations, although we will use
it mainly to predict the error.
Let Yh(x) and y
2
h(x) denote the numerical solutions to y' = j(x, y) on
[x
0
, b], obtained using the trapezoidal method (6.5.2). Then using (6.5.19),
Y(xJ- Yh(xn) = D(xn)h
2
+ O(h
3
)
Y(xn)- Y2h(xn) = 4D(xn)h
2
+ O(h
3
)
Multiply the first equation by four, subtract the second, and solve for Y(xn):
(6.5.26)
The formula on the right side has a higher order of convergence than the
trapezoidal method, but note that it requires the computation of yh(xn) and
y
2
h(xn) for all nodes xn in [x
0
, b].
The formula (6.5.26) is of greater use in predicting the global error in yh(x).
Using (6.5.26),
The left side is O(h
2
), from (6.5.19), and thus the first term on the right side must
also be 0( h
2
). Thus
(6.5.27)
is an asymptotic estimate of the error. This is a practical procedure for estimating
the global error, although the way we have derived it does not allow for a variable
stepsize in the nodes ..
Example Consider the problem
y' = -y2 y(O) = 1
Table 6.8 Trapezoidal ~ e t h o and Richardson error estimation
X Y2h(x)
Y(x)- y
21
,(x) Yh(x) Y(x)- Yh(x) f[yh(x) - Y2h(x)]
1.0 .483144 .016856 .496021 .003979 .004292
2.0 .323610 .009723 .330991 .002342 .002460
3.0 .243890 .006110 .248521 .001479 .001543
4.0 .194838 .004162 .198991 .001009 .001051
5.0 .163658 .003008 .165937 .000730 .000759
A LOW-ORDER PREDICTOR-CORRECTOR ALGORITHM 373
which has the solution Y(x) = 1/(1 + x). The results in Table 6.8 are for
stepsizes h = .25 and 2h = .5. The last column is the error estimate (6.5.27), and
it is an accurate estimator of the true error Y(x)- yh(x).
6.6 A Low-Order Predictor-Corrector Algorithm
In this section, a fairly simple algorithm is described for solving the initial value
problem (6.0.1). It uses the trapezoidal method (6.5.2), and it controls the size of
the local error by varying the stepsize h. The method is not practical because of
its low order of convergence, but it demonstrates some of the ideas and
techniques involved in constructing a variable-stepsize predictor-corrector al-
gorithm. It is also simpler to understand than algorithms based on higher order
methods.
Each step from xn to xn+
1
will consist of constructing Yn+
1
from Yn and Yn-
1
,
and Yn+
1
will be an approximate solution of (6.5.2) based on using some iterate
from (6.5.3). A regular step has xn+
1
- xn = xn- xn_
1
= h, the midpoint
predictor (6.5.15) is used, and the local error is predicted using the difference of
the predictor and corrector formulas. When the stepsize is being changed, the
Euler predictor (6.5.12) is used.
The user of the algorithm will have to specify several parameters in addition to
those defining the differential equation problem (6.0.1). The stepsize h will vary,
and the user must specify values h min and h max that limit the size of h. The user
should also specify an initial value for h; and the value should be one for which
ihfy(x
0
, y
0
) is sufficiently less than 1 in magnitude, say less than 0.1. This
quantity will determine the speed of convergence of the iteration in (6.5.3) and is
discussed later in the section, following the numerical example. An error toler-
ance t: must be given, and the stepsize h is so chosen that the local error trunc
satisfies
(6.6.1)
at each step. This is called controlling the error per unit stepsize. Its significance is
discussed near the end of the section.
The notation of the preceding section is continued. The function un(x) is the
solution of y' = f(x, y) that passes through (xn, Yn). The local error to be
estimated and controlled is (6.5.11), which is the error in obtaining un(xn+
1
)
using the trapezoidal method:
h = Xn+
1
- Xn (6.6.2)
If Yn is sufficiently close to Y(xn), then this is a good approximation to the
closely related truncation error in (6.5.1):
i
---------------------------------------")
I
I
I
I
I
I
~
374 NUMERICAL METHODS FOR ORDINARY DIFFERENTIAL EQUATIONS
And (6.6.2) is the only quantity for which we have the information needed to
control it.
Choosing the initial stepsize The problem is to find an initial value of h and
node x
1
= x
0
+ h, for which IY
1
- Y(x
1
)1 satisfies the bounds in (6.6.1). With
the initial h supplied by the user, the value Jh(x
1
) is obtained by using the Euler
predictor (6.5.12) and iterating twice in (6.5.3). Using the same procedure, the
values yh
12
(x
0
+ h/2) and Yh;
2
(x
1
) are also calculated. The Richardson ex-
trapolation procedure is used to predict the error in Yh(x
1
),
(6.6.3)
If this error satisfies the bounds of (6.6.1), then the value of h is accepted, and
the regular trapezoidal step using the midpoint predictor (6.5.15) is begun. But if
(6.6.1) is not satisfied by (6.6.3), then a new value of h is chosen.
Using the values
obtain the approximation
(6.6.4)
This is an approximation using the second-order divided difference of Y' =
f(x, Y); for example, apply Lemma 2 of Section 3.4. For any small stepsize h,
the truncation error at x
0
+ h is well approximated by
The new stepsize h is chosen so that
_r
h = v TD;YI
(6.6.5)
This should place the initial truncation error in approximately the middle of the
range (6.6.1) for the error per unit step criterion. With this new value of h, the
test using (6.6.3) is again repeated, as a safety check.
By choosing h so that the truncation error will satisfy the bound (6.6.1) when
it is doubled or halved, we ensure that the stepsize will not have to be changed
A LOW-ORDER PREDICTOR-CORRECTOR ALGORITHM 375
for several steps, provided the derivative Y(3)(x) is not changing rapidly. Chang-
ing the stepsize will be more expensive than a normal step, and we want to
minimize the need for such changes.
The regular predictor-corrector step The stepsize h satisfies xn- xn-l =
xn+l- xn =h. To solve for the value Yn+l use the midpoint predictor (6.5.15)
and iterate once in (6.5.3). The local error (6.5.11) is estimated using (6.5.17):
1 h
3
- -(" - y<
0
> ) = - -u<
3
>(x ) + O(h
4
)
6 Jn+l n+l 12 n n
(6.6.6)
Thus we measure the local error using
(6.6.7)
If trunc satisfies (6.6.1), then the value of h is not changed and calculation
continues with this regular step procedure. But when (6.6.1) is not satisfied, the
values Yn+l and xn+l are discarded, and a new stepsize is chosen based on the
value of trunc.
Changing the stepsize Using (6.6.6), -
trunc
h ~
where h
0
denotes the stepsize used in obtaining trunc. For an arbitrary stepsize
h, the local error in obtaining Yn+l is estimated using
Choose h so that
[
h ]3 1
ho jtruncj = 2h,
(6.6.8)
Calculate Yn+l by using the Euler predictor and iterating twice in (6.5.3). Then
return to the regular predictor-corrector s t ~ p To avoid rapid changes in h that
can lead to significant errors, the new value of h is never allowed to be more than
twice the previous value. If the new value of h is less than h min then calculation
'
I
I
I
----------------------------------------------- j
376 NUMERICAL METHODS FOR ORDINARY DIFFERENTIAL EQUATIONS
is terminated. But if the new value of h is greater than h max we just let h = h max
and proceed with the calculation. This has possible problems, which are discussed
following the numerical example.
Algorithm Detrap(J, x
0
, y
0
, X end , h, h min hmax ier)
1. Remark: The problem being solved is Y' = f(x, Y), Y(x
0
) =
y
0
, for x
0
:S: x :S: Xend using the method described earlier in the
section. The approximate solution values are printed at each
node point. The error parameter f and the stepsize parameters
were discussed earlier in the section. The variable ier is an
error indicator, output when exiting the algorithm: ier = 0
means a normal return; ier = 1 means that h = h max at some
node points; and ier = 2 means that the integration was
terminated due to a necessary. h < h min
2. Initialize: loop == 1, ier == 0.
3. Remark: Choose an initial value of h.
4. Calculate Yh(x
0
+h), Yh;
2
(x
0
+ (h/2)), Yh;
2
(x
0
+h) using
method (6.5.2). In each case, use the Euler predictor (6.5.12)
and follow it by two iterations of (6.5.3).
5. For the error in Yh(x
0
+h), use
6. If Eh :s; !trunci :s; Eh, or if loop = 2, then x
1
== x
0
+ h,
y
1
== Yh(x
0
+ h), print x
1
, y
1
, and go to step 10.
7. Calculate D
3
y ,;, y<3>(x
0
) from (6.6.4). If D
3
y * 0, then
If D
3
y = 0, then h == h max and loop == 2.
8. If h < hmin then ier := 2 and exit. If h > hmax then h == hmax
ier == 1, loop== 2.
9. Go to step 4.
10. Remark: This portion of the algorithm contains the regular
predictor-corrector step with error control.
A LOW-ORDER PREDICTOR-CORRECTOR ALGORITHM 377
1L Let x
2
= x
1
+ h, and y4) = Yo+ 2hf(x
1
, y
1
). Iterate (6.5.3)
once to obtain y
2
13. If jtruncj > h or jtruncj < ih, then go to step 16.
14. Print x
2
, h
15. x
0
== x
1
, x
1
= x
2
, y
0
== y
1
, y
1
= Y2 If x
1
< xend then go to
step 11. Otherwise exit.
16. Remark: Change the stepsize.
17. x
0
:= x
1
, y
0
= y
1
, h
0
== h, and calculate h using (6.6.8)
18. h := Min { h, 2h
0
}.
19. If h < hmm, then ier = 2 and exit. If h > hmax then ier := 1
and h = h max
20. yf
0
> = Yo + hf(x
0
, y
0
), and iterate twice in (6.5.3) to calculate
y
1
Also, x
1
:= x
0
+h.
21. Print x
1
, Jt.
22. If x
1
< xend, then go to step 10. Otherwise, exit.
The following example uses an implementation of Detrap that also prints trunc.
A section of code was added to predict the truncation error in y
1
of step 20.
Example Consider the problem
which has the solution
1
y' = ---- 2y2
1 + x
2
X
Y(x) = 1 + x2
y(O) = 0 (6.6.9)
This is an interesting problem for testing Detrap, and it performs quite well. The
equation was solved on [0, 10] with hmin = .001, hmax = 1.0, h = .1, and =
.0005. Table 6.9 contains some of the results, including the true global error and
the true local error, labeled True le. The latter was obtained by using another
more accurate numerical method. Only selected sections of output are shown
because of space.
378 NUMERICAL METHODS FOR ORDINARY DIFFERENTIAL EQUATIONS
Table 6.9 Example of algorithm Detrap
xn
h
Yn
Y(xn)- Yn trunc True Je
.0227 .0227 .022689 5.84E- 6 5.84E- 6 5.84E- 6
.0454 .0227 .045308 1.17E- 5 5.83E- 6 5.84E- 6
.0681 .0227 .067787 1.74E- 5 5.76E- 6 5.75E- 6
.0908 .0227 .090060 2.28E- 5 5.62E- 6 5.61E- 6
.2725 .0227 .253594 5.16E- 5 2.96E- 6 2.85E- 6
.3065 .0340 .280125 5.66E- 5 6.74E- 6 6.79E- 6
.3405 .0340 .305084 6.01E- 5 6.21E- 6 5.73E- 6
.3746 .0340 .328411 6.11E- 5 4.28E- 6 3.54E- 6
.4408 .0662 .369019 5.05E- 5 -6.56E- 6 -5.20E- 6
.5070 .0662 .403297 2.44E- 5 -1.04E-5 -2.12E- 5
.5732 .0662 .431469 -2.03E- 5 -2.92E- 5 -4.21E- 5
.6138 .0406 .445879 -2.99E- 5 -1.12E- 5 -l.lOE- 5
1.9595 .135 .404982 -1.02E- 4 -1.64E- 5 -1.67E- 5
2.0942 .135 .388944 -1.03E- 4 -1.79E- 5 -2.11E- 5
2.3172 .223 .363864 -6.57E- 5 1.27E- 5 8.15E- 6
2.7632 .446 .319649 3.44E- 4 4.41E- 4 3.78E- 4
3 0 6 6 ~ .303 .294447 3.21E- 4 9.39E- 5 8.41E- 5
7.6959 .672 .127396 3.87E- 4 8.77E- 5 1.12E- 4
8.6959 1.000 .113100 3.96E- 4 1.73E- 4 1.57E- 4
9.6959 1.000 .101625 4.27E- 4 1.18E- 4 1.68E- 4
10.6959 1.000 .092273 4.11E- 4 9.45E- 5 1.21E- 4
We illustrate several points using the example. First, step 18 is necessary in
order to avoid stepsizes that are far too large. For the problem (6.6.9), we have
which is zero at x ,;, .414, 2.414. Thus the local error in solving the problem
(6.6.9) will be very small near these points, based on (6.6.2) and the close relation
of un(x) to Y(x). This leads to a prediction of a very large h, in (6.6.8), one
which will be too large for following points xn. At xn = 2.7632, step 18 was
needed to avoid a misleadingly large value of h. As can be observed, the local
error at xn = 2.7632 increases greatly, due to the larger value of h. Shortly
thereafter, the stepsize h is decreased to reduce the size of the local error.
In all of the derivations of this and the preceding section, estimates were made
that were accurate if h was sufficiently small. In most cases, the crucial quantity
is actually hfv(xn, Yn), as in (6.5.7) when analyzing the rate of convergence 0f the
iteration (6.5.3). In the case of the trapezoidal iteration of (6.5.3), this rate of
convergence is
(6.6:10)
A LOW-ORDER PREDICTOR-CORRECTOR ALGORITHM 379
and for the problem (6.6.9), the rate is
Rate= -2hyn
If this is near 1, then s_everal iterations are necessary to obtain an accurate
estimate of Yn+l From the table, the rate roughly increases in size ash increases.
At xn = 2.3172, this rate is about .162. This still seems small enough, but the
local error is more inaccurate than previously, and this may be due to a less
accurate iterate being obtained in (6.5.3). The algorithm can be made more
sophisticated in order to detect the problems of too large an h, but setting a
reasonably sized h max will also help.
The global error We begin by givi11g an error bound analogous to the bound
(6.5.18) for a fixed stepsize. Write
For the last term, we assume the error per unit step criterion (6.6.1) is satisfied:
{6.6.12)
For the other term in (6.6.11), introduce the integral equation refqrmulations
Y(x) = Y(x,) + jxf(t, Y(t)) dt
x.
un(x) = Yn + jxf(t, un(t)) dt
x.
Subtract and take bounds using the Lipschitz condition to obtain
I Y(x) - u,(x) I :5 I e, I + K(x - x,) Max I Y(t) - u,(i) I
.m::=J::;x
with en = Y(xn) - Yn Using this, we can derive
Introduce
H = Max (xn+l- xn) =Max h
x
0
sx.sb
and assume that
HK<l
Combining (6.6.11), (6.6.12), and (6.6.14), we obtain
1
len+ll
1
_ HKlenl + H
(6.6.13)
X 2::: X,
(6.6.14)
i
-------------------------------------------!
380 NUMERICAL METHODS FOR ORDINARY DIFFERENTIAL EQUATIONS
This is easily solved, much as in Theorem 6.3, obtaining
[
ec(h-x0 ) _ 1]
c. (6.6.15)
for an appropriate constant c > 0. This is the basic error result when using a
variable stepsize, and it is a partial justification of the error criterion (6.6.1).
In some situations, we can obtain a more realistic bound. For simplicity,
assume fv(x, y) 0 for all (x, y). Subtracting in (6.6.13),
Y(x)- un(x) =en+ jx[J(t, Y(t))- f(t, un(t))] dt
x.
The last step uses the mean-value theorem 1.2, and it can be shown that
Bf(t, t(t))jBy is a continuous function of t. This shows that v(x) = Y(x)-
un(x) is a solution of the linear problem
'( ) Bf(x, ( )
V X = By V X
The solution of this linear problem, along with the assumption fv(x, y) 0,.
implies -
jY(xn+l)- un(xn+l) I:::;; lenl
The condition fy(x, y) 0 is associated with well-conditioned initial value
problems, as was noted earlier in (6.1.8).
Combining with (6.6.11) and (6.6.12), we obtain
jY(xn+l)- Yn+ll Ynl + (Xn+l- xn)
Solving the inequality, we obtain the more realistic bound
(6.6.16)
This partially explains the good behavior of the example in Table 6.9; and even
better theoretical results are possible. But results (6.6.15) and (6.6.16) are suffi-
cient justification for the use of the test (6.6.1), which controls the error per unit
step. For systems of equations y' = f(x,y), the condition fy(x, y) 0 is replaced
by requiring that all eigenvalues of the Jacobian matrix fy(x, Y(x)) have real
parts that are zero or negative.
The algorithm Detrap could be improved in a number of ways. But it
illustrates the construction of a predictor-corrector algorithm with variable
stepsize. The output is printed at an inconvenient set of node points xn, but a
simple interpolation algorithm can take care of this. The predictors can be
improved upon, but that too would make the algorithm more complicated. In the
next section, we return to a discussion of currently available practical
predictor-corrector algorithms, most of which also vary the order.
DERIVATION OF HIGHER ORDER MULTISTEP METHODS 381
6.7 Derivation of Higher Order Multistep Methods
Recall from (6.3.1) of Section 6.3 the general formula for a p + 1 step method
for solving the initial value problem (6.{\.1):
p p
Yn+1 = L ajYn-j + h L bjf(xn-j Yn-j) (6.7.1)
j=O j- -1
A theory was given for these methods in Section 6.3. Some specific higher order
methods will now be derived. There are two. principal means of deriving higher
order formulas: (1) The method of undetermined coefficients, and (2) numerical
integration. The methods based on numerical integration are currently the most
popular, but the perspective of the method of undetermined coefficients is still
important in analyzing and developing numerical methods.
The implicit formulas can be solved by iteration, in complete analogy with
(6.5.2) and (6.5.3) for the trapezoidal method. If b_
1
-:!= 0 in (6.7.1), the iteration
is defined by
p
J-0
( 6.7 .25)
(6.7.26)
with yj = f(x
1
, Y) as before. Table 6.13 contains the low-order formulas for
p = 0, 3. Note that the p = 1 case is the trapezoidal method.
Formula (6.7.26) is an implicit method, and therefore a predictor is necessary
for solving it by iteration. The basic ideas involved are exactly the same as those
in Section 6.5 for the iterative solution of the trapezoidal method. If a fixed-order
predictor-corrector algorithm is desire.d, and if only one iteration is to be
calculated, then an Adams-Moulton formula of order m 2 can use a predictor
Table 6.13 Adams-Moulton fonnulas
DERIVATION OF HIGHER ORDER MULTISTEP METHODS 389
of order m or m - 1. The advantage of using an order m - 1 predictor is that
the predictor and corrector would both use derivative values at the same nodes,
namely, xn, xn_
1
, ... , xn-m+
2
. For example, the second-order Adams-Moulton
formula with the first-order Adams-Bashforth formula as predictor is just the
trapezoidal method with the Euler predictor. This was discussed in Section 6.5
and shown to be adequate; both methods use the single past value of the
derivative, f(xn, Yn)
A less trivial example is the following fourth-order method:
h
= Yn +
12
[23/{xn, Yn) - 16j(xn-1 Yn-1) + 5j(xn-2 Yn-2)]
h
= Yn +
24
[9J(xn+1 + 19/(xn, yJ- 5/(xn-1 Yn-1)
+j(xn-2 Yn-2)] (6.7.27)
Generally only one iterate is calculated, although this will alter the form of the
truncation error in (6.7.23). Let un(x) denote the solution of y' = f(x, y)
passing through (xn, Yn) Then for the truncation error in using the approxima-
tion
( )
- {1) - [ ( ) - ] + [ - {1) ]
un xn+1 Yn+!- un xn+1 Yn+1 Yn+! Yn+!
Using (6.7.23) and an expansion of the iteration error,
( 6.7 .28)
The first two terms following the equality sign are of order h
5
If either (1) more
iterates are calculated, or (2) a higher order predictor is used, then the principal
part of the truncation error will be simply
= + trunc
(6.7.31)
Thus the actual truncation error is O(hP+
3
), and combined with (6.7.30), it can
be shown that the truncation error in Yn+l satisfies an error per unit step criteria,
which is similar to that of (6.6.1) for the algorithm Detrap of Section 6.6. For a
detailed discussion, see Shampine and Gordon (1975, p. 100).
The program DE (and its successor DDEABM) is very sophisticated in its
error control, including the choosing of the order and the stepsize. It cannot be
discussed adequately in the limited space available in this text, but the best
reference is the text of Shampine and Gordon (1975), which is devoted to
variable-order Adams algorithms. The programs DE and DDEABM have been
well designed .from both the viewpoint of error control and user convenience.
Each is also written in a portable form, and generally, is a well-recommended
program for solving differential equations.
Example Consider the problem
which has the solution
y(O) = 1
20
Y(x)= -----
(1 + 19e-<xf
4
l)
(6.7.32)
DDEABM was used to solve this problem with values output at x = 2, 4, 6, ... , 20.
Three values of. ABSERR were used, and RELERR = 0 in all cases. The true
global errors are shown in Table 6.15. The column labeled NFE gives the number
of evaluations of f(x, y) necessary to obtain the value yh(x), beginning from x
0
.
i
'
--------------------------------------------------------l
'
392 NUMERICAL METHODS FOR ORDINARY DIFFERENTIAL EQUATIONS
Table 6.15 Example of the automatic program DDEABM
ABSERR = 10-
3
ABSERR = 10
6
ABSERR = 10
9
X Error NFE Error NFE Error NFE
4.0 -3.26E- 5 15 1.24E- 6 28 2.86E- 10 52
8.0 6.00E- 4 21 3.86E- 6 42 -1.98E- 9 76
12.0 1.70E- 3 25 4.93E- 6 54 -2.41E- 9 102
16.0 9.13E- 4 31 3.73E- 6 64 -l.86E- 9 124
20.0 9.16E- 4 37 1.79E- 6 74 -9.58E- 10 138
Global error The automatic computer codes that are discussed previously
control the local error or truncation error. They do not control the global error in
the solution. Usually the truncation error is kept so small by these codes that the
global error is also within acceptable limits, although that is not guaranteed. The
reasons for this small global error are much the same as those described in
Section 6.6; in particular, recall (6.6.15) and (6.6.16).
The global error can be monitored, and we give an example of this below. But
even with an estimate of the global error, we cannot control it for most equations.
This is because the global error is composed of the effects of all past truncation
errors, and decreasing the present stepsize will not change those past errors. In
general, if the global error is too large, then the equation must be solved again,
with a smaller stepsize.
There are a number of methods that have been proposed for monitoring the
global error. One of these, Richardson extrapolation, is illustrated in Section 6.10
for a Runge-Kutta method. Below, we illustrate another one for the method of
Section 6.6. For a general survey of the topic, see Skeel (1986).
For the trapezoidal method, the true solution Y(x) satisfies
with h = Xn+l - Xn and xn ~ ~ n ~ xn+l Subtracting the trapezoidal rule
h
Yn+l = Yn + 2 [J(xn, Yn) + f(xn+l Yn+I)]
we have
h
en+l =en+ 2 { [J(xn, Yn +en)- f(xn, Yn)]
h3
+ [/(xn+I Yn+l + ei>+I) - f(xn+l Yn+I)]} -
12
Y 3 ) ~ n ) (6.7.33)
with en = Y(xn) - Yn n ~ 0. This is the error equation for the trapezoidal
method, and we try to solve it approximately in order to calculate en+ I
DERIVATION OF HIGHER ORDER MULTISTEP METHODS 393
Table 6.16 Global error calculation for Detrap
xn
h en en trunc
.0227 .0227 5.84E- 6 5.83E- 6 5.84E- 6
.0454 .0227 1.17E- 5 1.16E- 5 5.83E- 6
.0681 .0227 1.74E- 5 1.73E- 5 5.76E- 6
.0908 .0227 2.28E- 5 2.28E- 5 5.62E- 6
.2725 .0227 5.16E- 5 5.19E- 5 2.96E- 6
.3065 .0340 5.66E- 5 5.66E- 5 6.74E- 6
.3405' .0340 6.01E- 5 6.05E- 5 6.21E- 6
.3746 .0340 6.11E- 5 6.21E- 5 4.28E- 6
.4408 .0662 5.05E- 5 5.04E- 5 -6.56E- 6
.5070 .0662 2.44E- 5 3.57E- 5 -1.04E- 5
.5732 .0662 -2.03E- 5 4.34E- 6 -2.92E- 5
.6138 .0406 -2.99E- 5 -6.76E- 6 -1.12E- 5
1.9595 .135 -L02E- 4 -8.76E-5 -1.64E- 5
2.0942 .135 -1.03E- 4 -8.68E- 5 -1.79E- 5
2.3172 .223 -6.57E- 5 -5.08E- 5 1.27E- 5
2.7632 .446 3.44E- 4 3.17E- 4 4.41E- 4
3.0664 .303 3.21E- 4 2.96E- 4 9.39E- 5
7.6959 .672 3.87E- 4 2:69E- 4 8.77E- 5
8.6959 1.000 3.96E- 4 3.05E- 4 1.73E- 4
9.6959 1.000 4.27E- 4 2.94E- 4 1.18E- 4
10.6959 1.000 4.11E- 4 2.77E- 4 9.45E- 5
Returning to the algorithm Detrap of Section 6.6, we replace the truncation
term in (6.7.33) with the variable trunc computed in Detrap. Then. we solve
(6.7.33) for en+l which will be an approximation of the true global error en+l
We can solve for en+l by using various rootfinding methods, but..we use simple
fixed-point iterations:
h
A{j+l)- A -{[/( +A)-!( )]
en+l - en+
2
xn, Yn en xn, Yn
+ [f(xn+I Yn+I + e!.iJr)- f(xn+l Yn+I)]} + trunc (6.7.34)
for j ;::::: 0. we use e!OJ.r = e n and since this is for just illustrative purposes, we
iterate several times in (6.7.34). This simple idea is closely related to the
difference correction methods of Skeel (1986).
Example We repeat the calculation given in Table 6.9, for Detrap applied to
Eq. (6.6.9). We use the same parameters for Detrap. The results are shown in
Table 6.16, for the same values of xn as in Table 6.9. The results show that en
and en are almost always reasonably close, c:ertainly in magnitude. The ap-
proximation n ~ en is poor around X= .5, due tO the poor estimate of the
truncation error in Detrap. Even then these poor results damp out for this
problem, and for larger values of xn, the approximation en ~ en is still useful.
394 NUMERICAL METHODS FOR ORDINARY DIFFERENTIAL EQUATIONS
6.8 Convergence and Stability Theory
for Multistep Methods
In this section, a complete theory of convergence and stability is presented for
the multistep method
p p
Yn+l = L ajYn-j + h L bjf(xn-j Yn-j) Xp+l ~ Xn+I ~ b (6.8.1)
j=O j= -1
This generalizes the work of Section 6.3, and it creates the mathematical tools
necessary for analyzing whether method (6.8.1) is only weakly stable, due to
instability of the type associated with the midpoint method.
We begin with a few definitions. The concept of stability was introduced with
Euler's method [see (6.2.28) and (6.2.20)], and it is now generalized. Let {YniO ~
n ~ N(h)} be the solution of (6.8.1) for some differential equation y' = f(x, y),
for all sufficiently small values of h, say h ~ h
0
. Recall that N(h) denotes the
largest subscript N for which xN ~ b For each h ~ h
0
, perturb the initial values
y
0
, , Yi> to new values z
0
, , z P with
Max IYn - znl ~
05,n5,p
0 < h ~ h
0
(6.8.2)
Note that these initial values are likely to depend on h. We say the family of
solutions { Yn I 0 ~ n ~ N( h)} is stable if there is a constant c, independent of .
h ~ h
0
and valid for all sufficiently small , for which
Max IYn - zni ~ C
05.n5.N(h)
Consider all differential equation problems
y' = f(x, y)
0 < h ~ h
0
(6.8.3)
( 6.8.4)
with the derivative f(x, y) continuous and satisfying the Lipschitz condition
(6.2.12), and suppose the approximating solutions { Yn} are all stable. Then we
say that (6.8.1) is a stable numerical method.
To define convergence for a given problem (6.8.4), suppose the initial values
y
0
, , Yp satisfy
(6.8.5)
Then the solution { Yn } is said to converge to Y( x) if
Max IY(xn)- Yni ~ 0
x
0
5.x.5.b
(6.8.6)
If (6.8.1) is convergent for all problems (6.8.4), then it is called a convergent
numerical method.
CONVERGENCE AND STABILITY THEORY FOR MULTISTEP METHODS 395
Recall the definition of consistency given in Section 6.3. Method (6.8.1) is
consistent if
as h --" 0
for all functions Y(x) continuously differentiable on [x
0
, b]. Or equivalently from
Theorem 6.5, the coefficients {a j} and { bj} must satisfy
p p
- I: ja j + I: bj = 1
(6.8.7)
j=O j= -1
Convergence can be shown to imply consistency; consequently, we consider only
methods satisfying (6.8.7). As an example of the proof of the necessity of (6.8.7),
tl_le assumption of convergence of (6.8.1) for the problem
y' = 0 y(O) = 1
will imply the first condition in (6.8.7). Just take y
0
= = Yp = 1, and observe
the consequences of the convergence of Yp+l to Y(x) = 1.
The convergence and stability of (6.8.1) are linked to the roots of the
polynomial
p
p(r) = rp+l- L ajrp-j
j=O
(6.8.8)
Note that p(1) = 0 from the consistency condition (6.8.7). Let r
0
, , rP denote
the roots of p(r), repeated according to their multiplicity, and let r
0
= 1. The
method (6.8.1) satisfies the root condition if
1.
2.
j = 0, 1, ... , p
l'jl = 1 = p'(lj) * 0
(6.8.9)
(6.8.10)
The first condition requires all roots of p(r) to lie in the unit circle {z: lzl 1}
in the complex plane. Condition (6.8.10) states that all roots on the boundary of
the circle are to be simple roots of p(r). -
The main results of this section are pictured in Figure 6.6, although some of
them will not be proved. The strong root condition and the concept of relative
stability are introduced later in the section.
Strong root
==::::;>
Relative
condition stability
n
Convergence Root
Stability
c.ondition
Figure 6.6 Schematic of the theory for con-
sistent multistep methods.
---------------------------------- --------------------!
396 NUMERICAL METHODS FOR ORDINARY DIFFERENTIAL EQUATIONS
Stability theory All of the numerical methods presented in the preceding
sections have been stable, but we now give an example of an unstable method.
This is to motivate the need to develop a general theory of stability.
Exampk Recall the general formula (6.7.10) for an explicit two-step second-
order method, and choose a
0
= 3. Then we obtain the method
h
Yn+l = 3yn- 2Yn-l + l[J(xn, Yn)- 3/{xn-1 Yn-JJ
n ~ 1 {6.8.11)
with the truncation error
Consider solving the problemy'= 0, y(O) = 0, which has the solution Y(x) = 0.
Using y
0
= y
1
= 0, the numerical solution is clearly Yn = 0, n ~ 0. Perturb the
initial data to z
0
= /2, z
1
= , for some '* 0. Then the corresponding numeri-
, cal solution can be shown to be
n ~ (6.8.12)
The. reasoning used in deriving this solution is given later in a more general
context. To see the effect of the perturbation on the original solution,
Max IYn- Znl = Max l12n-l = I12N(h)-l
~ ~ ~ ~ h o ~ ~ ~ h
Since N(h)--+ oo ash--+ 0, the deviation of {zn} from {Yn} becomes increas-
ingly greater as h --+ 0. The method (6.8.11) is unstable, and it should never be
used. Also note that the root condition is violated, since p(r) = r
2
- 3r + 2 has
the roots r
0
= 1, r
1
= 2.
To investigate the stability of (6.8.1), we consider only the special equation
y' =]\y y(O) = 1 (6.8.13)
with the solution Y(x) = e'hx. The results obtained will transfer to the study of
stability for a general differential equation problem. An intuitive reason for this is
easily derived. Expand Y'(x) = f(x, Y(x)) about (x
0
, Y
0
) to obtain
= A(Y{x)- Y
0
) + g(x) ( 6.8.14)
with A= fy(x
0
, Y
0
) and g(x) = f(x
0
, Y
0
) + f;x(x
0
, Y
0
)(x- x
0
). This is a valid
approximation if x- x
0
is sufficiently small. Introducing V(x) = Y(x)- Y
0
,
V'(x) = AV(x) + g(x) (6.8.15)
The inhomogeneous term g(x) will drop out of all derivations concerning
CONVERGENCE AND STABILITY THEORY FOR MULTISTEP METHODS 397
numerical stability, because we are concerned with differences of solutions of the
equation. Dropping g(x) in (6.8.15), we obtain the model equation (6.8.13). As
further motivation, refer back to the stability results (6.1.5)-(6.1.10), and to the
trapezoidal stability results in (6.5.20)-(6.5.23).
In the case that y' = f(x, y) represents a system of m differentiable equations,
as in (6.1.13), the partial derivative fy(x,y) becomes a Jacobian matrix,
at;
[ry(x,y)] ij = -a
Y
. 1
1:;;; i, j:;;; m
as in (6.2.54). Thus the model equation becomes
y' = Ay + g(x) (6.8.16)
a system of m linear differential equations with A = f y(x
0
, Y
0
). It can be shown
that in most cases, this system reduces to an equivalent system
z; = A;Z; + Y;(x) (6.8.17)
with A
1
, ... , Am the eigenvalues of A (see Problem 24). With (6.8.17), we are back
to the simple model equation (6.8.13), provided we allow A to be complex in
order to include all possible eigenvalues of A.
Applying (6.8.1) to the model equation (6.8.13), we obtain
p p
Yn+l = L ajYn-j + hA L hjYn-j
j=O j= -1
p
(1- hAb_l)Yn+l- L (aj + hAbj)Yn-j = 0
j=O
(6.8.18)
n ~ p (6.8.19)
This is a homogeneous linear difference equation of order p + 1, and the theory
for its solvability is completely analogous to that of (p + 1)st-order homoge-
neous linear differential equations. As a general reference, see Henrici (1962, pp.
210-215) or Isaacson and Keller (1966, pp. 405-417).
We attempt to find a general solution by first looking for solutions of the
special form
n ~
If we can find p + 1 linearly independent solutions, then an arbitrary linear
combination will give the general solution of (6.8.19).
Substituting Yn = rn into (6.8.19) and canceling rn-p, we obtain
p
(1- hAb_
1
)rp+l- L (aj + hAbJrp-j = 0
j=O
(6.8.20)
This is called the characteristic equation, and the left-hand side is the characteris-
i
!
I
i
j
!
----------------------------------------- .!
I
.I
398 NUMERICAL METHODS FOR ORDINARY DIFFERENTIAL EQUATIONS
iic polynomial. The roots are called characteristic roots. Define
p
a(r) = b_
1
rp+I + L b
1
rrl
j=O
and recall the definition (6.8.8) of p(r). Then (6.8.20) becomes
p(r}- hA.a(r} = 0
Denote the characteristic roots by
(6.8.21)
which can be shown to depend continuously on the value of hA.. When hA. = 0,
the equation (6.8.21) becomes simply p(r) = 0, and we have r/0) = r
1
, j =
0, 1, ... , p, for the earlier roots 1j of p(r) = 0. Since r
0
= 1 is a root of p(r), we
let r
0
(hA.) be the root of (6.8.21) for which r
0
(0) = 1. The root r
0
(hA.) is called
the principal root, for reasons that will become apparent later. If the roots r.(hA.)
are all distinct, then the general solution of (6.8.19) is
1
p
Yn = L "Yjh(hA.)] n n;;::O (6.8.22}
j=O
But if 'j(h"A) is a root of multiplicity P > 1, then the following are v linearly
independent solutions of (6.8.19):
These can be used with the solution arising from the other roots to generate a
general solution for (6.8.19), comparable to (6.8.22).
Theorem 6.7 Assume the consistency condition (6.8.7). Then the multistep
method (6.8.1) is stable if and only if the root condition (6.8.9),
(6.8.10) is satisfied.
Proof l. We begin by showing the necessity of the root condition for stabil-
ity. To do so, assume the opposite by letting
llj(O) I > 1
for some j. Consider the differential equation problem y' = 0, y(O) = 0,
with the solution Y(x) = 0. Then (6.8.1) becomes
p
Yn+l = L ajYn-j
j-0
n;<::p {6.8.23)
If we take Yo = y
1
= = Yp = 0, then the numerical solution is clearly
CONVERGENCE AND STABILITY THEORY FOR MULTISTEP METHODS 399
y, = 0, with all n 2::: 0. For the perturbed initial values, take
(6.8.24)
For these initial values,
which is a uniform bound for all small values of h, since the right side is
independent of h. As -+ 0, the bound also tends to zero.
The solution of (6.8.24) with the initial conditions (6.8.24) is simply
For the deviation from {y,},
I
'
N(h)
Max IY,- z,l = ij(O)
Xo:SXnSb .
As h-+ 0, N(h)-+ co and the bound becomes infinite. This proves the
method. is unstable when some lij(O)I > 1. If the root condition is
violated instead by assuming (6.8.10) is false, then a similar proof can be
given. This is left as Problem 29.
2. Assume the root condition is satisfied. The proof of stability will be
restricted to the model equation (6.8.13). A proof can be given for the
general equation y' = f(x, y), but it is a fairly involved modification of
the following proof. The general proof involves the solution of nonhomo-
geneous linear difference equations [see Isaacson and Keller (1966), pp.
405-417, for a complete development]. To further simplify the proof, we
assume that the roots !j(O), j = 0, 1, ... , p are all distinct. The same will
be true of !j(h'A), provided the value of h is kept sufficiently small, say
0 ~ h ~ h
0
Let { y,} and { z,} be two solutions of (6.8.19) on [x
0
, b J, and assume
Max IY,- z,l ~
Osnsp
0 < h ~ h
0
( 6.8.25)
Introduce the error e, = y, - z,. Subtracting using (6.8.19) for each
solution,
p
(1 - hAb_
1
)e,+
1
- L ( aj + h'Ab)en-j = 0
j-0
for xp+l ~ xn+l ~ b The general solution is
p
e, = L: rA!j(h'A)]" n:<::O
j-0
(6.8.26)
( 6.8.27)
'
i
I
J
400 NUMERICAL METHODS FOR ORDINARY DIFFERENTIAL EQUATIONS
The coefficients y
0
(h), ... , Yp(h) must be chosen so that
The solution (6.8.27) will then agree with the given initial perturbations
e
0
, ... , eP, and it will satisfy the difference equation (6.8.26). Using the
bound (6.8.25) and the theory of systems of linear equations, it is fairly
straightforward to show that
Max lr;l ::;; c
1
t:
O:s;i:Sp
0 < h::;; h
0
(6.8.28)
for some constant c
1
> 0. We omit the proof of this, although it can be
carried out easily by using concepts introduced in Chapters 7 and 8.
To bound the solution en on [x
0
, b], we must bound each term
[lj(hAW. To do so, consider the expansion
'j(u) = lj(O) + urj(r) (6.8.29)
for some f between 0 and u. To compute rj(u), differentiate the
characteristic equation
Then
(6.8.30)
By the assumption that lj(O) is a simple root of p ( r) = 0, 0 ::;; j ::;; p, it
follows that p'(!j(O)) =F 0, and by continuity, p'(lj(u)) =F 0 for all suffi-
ciently small values of u. The denominator in (6.8.30) is nonzero, and we
can bound rj(u)
all lui ::;; u
0
for some u
0
> 0.
Using this with (6.8.29) and the root condition (6.8.9}, we have
llj(h;\) I::;; 10(0) I+ c
2
lhAI ::;; 1 + c
2
jh;\l
lfrAh;\)] nl::;; [1 + c21hAir::;; ec2nih>-i::;; ec2(b-xo>i>..i (6.8.31)
I
_.:/
CONVERGENCE AND STABILITY THEORY FOR MULTISTEP METHODS 401
for all 0 < h ~ h
0
Combined with (6.8.27) and (6.8.28),
Max lenl ~ c3lt:lecz(b-xo)i>-i
x
0
s;x.s;b
for an appropriate constant c
3
0 < h ~ h
0
Convergence theory The following result generalizes Theorem 6.6 of Section
6.3, with necessary and sufficient conditions being given for the convergence of
multistep methods.
Theorem 6.8 Assume the consistency condition (6.8.7). Then the multistep
method (6.8.1) is convergent if and only if the root condition (6.8.9),
(6.8.10) is satisfied.
Proof 1. We begin by showing the necessity of the root condition for conver-
gence, and again we use the problem y' = 0, y(O) = 0, with the solution
Y(x) = 0. The multistep method (6.8.1) becomes
p
Yn+l = L ajYn-j
j=O
with y
0
, ... , Yp chosen to satisfy
1J(h) = Max IYnl -+ 0
Os;ns;p
~ p (6.8.32)
as h-+ 0 (6.8.33)
Suppose that the root condition is violated. We show that (6.8.32) is not
convergent to Y( x) = 0.
Assume that some l'j(O)I > 1. Then a satisfactory solution of (6.8.32)
is
(6.8.34)
Condition (6.8.33) is satisfied since
But the solution { Yn} does not converge. First,
Consider those values of h = bjN(h). Then L'Hospital's rule can be
used to show that
b
Limit -!r/O) IN = 00
N-+oo N
showing (6.8.32) does not converge to the solution Y(x) = 0.
i
i
I
I
402 NUMERICAL METHODS FOR ORDINARY DIFFERENTIAL EQUATIONS
Assume (6.8.9) of the root condition is satisfied, but that some 'j(O) is
a multiple root of p( r) and I 'j(O) I = 1. Then the preceding form of proof
is still satisfactory, but we must use the solution
0::;; n::;; N(h)
This completes the proof of the necessity of the root condition.
2. Assume that the root condition is satisfied. As with the previous
theorem, it is too difficult to give a general proof of convergence for an
arbitrary differential equation. For that, see the development in Isaacson
and Keller (1966, pp. 405-417). The present proof is restricted to the
model equation (6.8.13), and again we assume the roots 'j(O) are distinct,
in order to simplify the proof.
The multistep method (6.8.1) becomes (6.8.18) for the model equation
y' = 'Ay, y(O) = 1. We show that the term y
0
[r
0
(h'AW in its general
solution
p
Yn = L "Yj['J(h'A)] n {6.8.35)
j=O
will converge to the solution Y(x) = e>..x on [0, b]. The remaining terms
"Yj['j(h'A)r, j = 1, 2, ... , p, are parasitic solutions, and they can be
shown to converge to zero as h ~ 0 (see Problem 30).
Expand r
0
(h'A) using Taylor's theorem,
From (6.8.30),
o{1)
r ~ O ) = p'(
1
)
and using consistency condition (6.8.7), this leads to r
0
(0) = 1. Then
over every finite interval 0 ::;; xn ::;; b. Thus
Max jfro{h'A)] n- e>..xl- 0
Osx.::;.b
as h ~ 0 {6.8.37)
We must now show that the coefficient 'Yo- 1 ash ~ 0.
CONVERGENCE AND STABILITY THEORY FOR MULTISTEP METHODS 403
The coefficients y
0
( h), ... , Yp( h) satisfy the linear system
Yo + Yr + + Yp =Yo
Yo[r
0
{hA)J + +yph(hA)] Y1
Yofro(hA)] P + +yP[rP(hA)] P,;,. Yp
(6.8.38)
The initial values y
0
, , Yp depend on h and are assumed to satisfy
71(h) = Max je'h- Ynl- 0
Osn:Sp
as h- 0
But this implies
Limityn = 1
h-+0
(6.8.39)
The coefficient y
0
can be obtained by using Cramer's rule to solve
(6.8.38):
Yo
1 1
Yr
rr rP
Yp
rP rP
1 p
Yo=
1 1 1
(6.8.40)
ro rl rP
r[ r{
rP
p
The denominator converges to the Vandermonde determinant for r
0
(0)
= 1, r
1
(0), ... , rp(O), and this is nonzero since the roots are distinct.(see
Problem 1 of Chapter 3). By using (6.8.39), the numerator converges to
the same quantity as h - 0. Therefore, y
0
- 1 as h - 0. Using this,
along with (6.8.37) and Problem 30, the solution { Yn} converges to
Y(x) = eh on [0, b].
The following is a well-known result; it is a trivial consequence of Theorems
6.7 and 6.8.
Corollory Let (6.8.1) be a consistent multistep method. Then it is convergent if
and only if it is stable.
Relative stability and weak stability Consider again the model equation (6.8.13)
and its numerical solution (6.8.22). The past theorem stated that the parasitic
404 NUMERICAL METHODS FOR ORDINARY DIFFERENTIAL EQUATIONS
solutions "Yj['j(hl\W will converge to zero as h -4 0. But for a fixed h with
increasing xn, we also would like them to remain small relative to the principal
part of the solution y
0
[r
0
(hA}r. This will be true if the characteristic roots satisfy
j = 1, 2, ... ' p (6.8.41)
for all sufficiently small values of h. This leads us to the definition of relative
stability.
We say that the method (6.8.1) is relatively stable if the characteristic roots
'j(hl\) satisfy (6.8.41) for all sufficiently small nonzero values of !hAl. And the
method is said to satisfy the strong root condition if
I'J(O) I < 1
j = 1,2, ... , p (6.8.42)
This is an easy condition to check, and it implies relative stability. Just use the
of the roots 'j(hA) with respect to h;\ to have (6.8.42) imply (6.8.41).
Relative stability does not imply the strong root condition, although they are
equivalent for most methods [see Problem 36(b)]. If a multistep method is stable
but not relatively stable, then it will be called weakly stable.
Example 1. For the midpoint method,
It is weakly stable according to (6.8.41) when A < 0, which agrees with what was
shown earlier in Section 6.4.
2. The Adams-Bashforth and Adams-Moulton methods, (6.7.22) and (6.7.26),
have the same characteristic polynomial when h = 0,
p(r) = rp+I- rP
(6.8.44)
The roots are r
0
= 1, rj = 0, j = 1, 2, ... , p; thus the strong root condition is
satisfied and the Adams methods are relatively stable.
Stability regions In the preceding 9iscussions of stability, the values of h were
required to be sufficiently small in order to carry through the derivations. Little
indication was given as to just how small h should be. It is clear that if h is
required to be extremely small, then the method is impractical for most prob-
lems; thus we need to examine the permissible values of h. Since the stability
depends on the characteristic roots, and since they in turn depend on hA, we are
interested in determining the values of hA for which the multistep method (6.8.1)
is stable in some sense. To cover situations arising when solving systems of
differential equations, it is that the value of A be allowed to be
complex, as noted following (6.8.17).
To motivate the later discussion, we consider the stability of Euler's method.
Apply Euler's method to the equation
y' = Ay + g(x) y(O) = Y
0
(6.8.45)
CONVERGENCE AND STABILITY THEORY FOR MULTISTEP METHODS 405
/
obtaining
n;::::;O
Yo= Yo
(6.8.46)
Then consider the perturbed problem
n;:::: 0 z
0
= Y
0
+ c:: (6.8.47)
For the original Eq. (6.8.45), this perturbation of Y
0
leads to solutions Y(x) and
Z(x) satisfying
Y(x)- Z(x) = c::eAx x;:::: 0
In this original problem, we would ordinarily be interested in the case with
Real(;\).:::;; 0, since then I Y(x)- Z(x)l would remain bounded as x..., 0. We
further restrict our interest to the case of Real ( ;\) < 0, so that Y( x) - Z ( x) ..., 0
as x ..., oo. For such ;\, we want to find the values of h so that the numerical
solutions of (6.8.46) and (6.8.47) will retain the behavior associated with Y(x)
and Z(x).
Let e,. = z,. - y,.. Subtracting (6.8.46) from (6.8.47),
en+1 = e,. + h;\e,. = (1 + h;\)e,.
Inductively,
en= (1 + h;\r. {6.8.48)
Then e,...., 0 as x,...., oo if and only if
11 + h;\1 < 1 {6.8.49)
This yields a set of complex values h;\ that form a circle of radius 1 about the
point -1 in the complex plane. If h;\ belongs to this set, then y,. - z,. ..., 0 as
xn..., oo, but not otherwise.
To see that this discussion is also important for convergence, realize that the
original differential equation can be looked upon as a perturbation of the
approximating equation (6.8.46). From (6.2.17), applied to (6.8.45),
h2
Yn+l = Y,. + h[;\Yn + g{x..}] + (6.8.50)
Here we have a perturbation of the equation (6.8.46) at every step, not at just the
initial point x
0
= 0. Nonetheless, the preceding stability analysis can be shown to
apply to this perturbation of (6.8.46). The error formula (6.8.48) will have to be
suitably modified, but it will still depend critically on the bound (6.8.49) (see
Problem 40).
Example Apply Euler's method to the problem
y' = ;\y + {1- ;\)cos(x)- (1 + ;\)sin{x) y(O) = 1 (6.8.51)
406 NUMERICAL METHODS FOR ORDINARY DIFFERENTIAL EQUATIONS
Table 6.17 Euler's method for (6.8.51)
"A X Error: h = .5 Error: h = .1 Error: h = .01
- 1 1 -2.46E- 1 -4.32E- 2 -4.22E- 3
2 -2.55E- 1 -4.64E- 2 -4.55E- 3
3 -2.66E- 2 -6.78E- 3 -7.22E- 4
4 2.27E- 1 3.91E- 2 3.78E- 3
5 2.72E- 1 4.91E- 2 4.81E- 3
- 10 1 3.98E- 1 -6.99E- 3 -6.99E- 4
2 6.90E + 0 -2.90E- 3 -3.08E-4
3 l.llE + 2 3.86E- 3 3.64E- 4
4 1.77E + 3 7.07E- 3 7.04E- 4
5 2.83E + 4 3.78E- 3 3.97E- 4
-50 1 3.26E + 0 1.06E + 3 -1.39E- 4
2 1.88E + 3 1.11E+9 -5.16E- 5
3 LOSE+ 6 1.17E + 15 8.25E- 5
4 6.24E + 8 1.23E + 21 1.41E- 4
5 3.59E + 11 1.28E + 27 7.00E- 5
whose true solution is Y(x) = sin(x) + cos(x). We give results for several
values of i\ and h. For A= -1, -10, -50, the bound (6.8.49) implies the
respective bounds on h of
O<h<2
1
O<h<-=2
5 .
1
0 < h <- = .04
25
The use of larger values of h gives poor numerical results, as seen in Table 6.17.
The preceding derivation with Euler's method motivates our general approach
to finding the set of all hA for which the method (6.8.1) l.s stable. Since we
consider only cases with Real (A) < 0, we want the numerical solution { Yn} of
(6.8.1), when applied to the model equation y' = Ay, to tend to zero as xn oo,
for all choices of initial values y
0
, y
1
, , Yr The set of all hA for which this is
true is called the region of absolute stability of the method (6.8.1). The larger this
region, the less the restriction on h in order to have a stable numerical solution.
When (6.8.1) is applied to the model equation, we obtain the earlier equation
(6.8.18), and its solution is given by (6.8.22), namely
p
Yn = L rArihA)] n
n;;o:O
j=O
provided the characteristic roots r
0
(hA), ... , r
0
(hi\) are distinct. To have this
tend to zero as n oo, for all choices of y
0
, ... , yP, it is necessary and sufficient
to have
h(hA) I < 1 j = 0,1, ... , p (6.8.52)
CONVERGENCE AND STABILITY THEORY FOR MULTISTEP METHODS 407
The set of all hA. satisfying this set of inequalities is also called the region of
absolute stability. This region is contained in the set defined in the preceding
paragraph, and it is usually equal to that set. We work only with (6.8.52) in
finding the region of absolute stability.
Example Consider the second-order Adams-Bashforth method
h
Yn+l = Yn +
The characteristic equation is
and its roots are
Imaginary
3.00
_ Real
-2.00
-3.00
k=6
k=5
k=4
2.00
Figure 6.7 Stability regions for Adams-Bashforth
methods. The method of order k is stable inside the
region indicated left of origin. [Taken from Gear (1971),
p. 131, with permission.]
(6.8.53)
408 NUMERICAL METHODS FOR ORDINARY DIFFERENTIAL EQUATIONS
Imaginary
Real
Figure 6.8 Stability regions for Adams-Moulton methods. The
method of order k is stable inside the region indicated. [Taken from
Gear (1971), p. 131, with permission.]
The region of absolute stability is the set of hA. for which
h(hA.) I < 1
For X. real, the acceptable values of hX. are - I < hX. < 0.
The boundaries of the regions of absolute stability of theAdams-Bashforth
and the Adams-Moulton methods are given in Figures 6.7 and 6.8, respectively.
For Adams-Moulton formulas with one iteration of an Adams-Bashforth pre-
dictor, the regions of absolute stability are given in Shampine and Gordon (1975,
pp. 135-140).
From these diagrams, it is clear that the region of absolute stability becomes
smaller as the order of the method increases. And for formulas of the same order,
the Adams-Moulton formula has a significantly larger region of absolute stabil-
ity than the Adams-Bashforth formula. The size of these regions is usually quite
acceptable from a practical point of view. For example, the real values of hA. in
the region of absolute stability for the fourth-order Adams-Moulton formula are
given by - 3 < hA. < 0. This is not a serious restriction on h in most cases .
. The Adams family of formulas is very convenient for creating a variable-order
algorithm, and their stability regions are quite acceptable. They will have
difficulty with problems for which A. is negative and large in magnitude, and
these problems are best treated by other methods, which we consider in the next
section.
There are special methods for which the region of absolute stability consists of
all .complex values hX with Real(hX) < 0. These methods are called
and with them there is no restriction on h in order to have stability of the type
CONVERGENCE AND STABILITY THEORY FOR MULTISTEP METHODS 409
Table 6.18 Example of trapezoidal rule: h = .5
X Error: A= -1 Error: A = -10 Error: A = -50
2 -1.13- 2 -2.78-3 -7.91-4
4 -1.43- 2 -8.91- 5 -8.91- 5
6 2.02E- 2 2.77- 3 4.72-4
8 -2.86-3 -2.22-3 -5.11- 4
10 -1.79- 2 -9.23-4 -1.56-4
we have been considering. The trapezoidal rule is an example of such a method
[see (6.5.24)-(6.5.25)].
Example Consider the backward Euler method:
Yn+l = Yn + hf(xn+l Yn+l)
n ~
Applying this to the model equation y' = Ay and solving, we have
Then Yn- 0 as xn- oo if and only if
1
<1
11- hAl
n ~
(6.8.54)
( 6.8.55)
This will be true for all hA with Real(A) < 0, and the backward Euler method is
an A-stable method.
Example Apply the trapezoidal method to the problem (6.8.51), which was
solved earlier with Euler's method. We use a stepsize of h = .5 for A=
-1, -10, -50. The results are given in Table 6.18. They illustrate that the
trapezoidal rule does not become unstable as 1 A 1 increases, while Real (A) < 0.
It would be useful to have A-stable multistep methods of order greater than 2.
But a result of Dahlquist (1963) shows there are no such methods. We examine
some higher order methods that have most of the needed stability properties in
the following section.
6.9 Stiff Differential Equations and the Method of Lines
The numerical solution of stiff differential equations has, within the past ten to
fifteen years, become a much studied subject. Such equations (including systems
of differential equations) have appeared in an increasing number of applications,
in subjects as diverse as chemical kinetics and the numerical solution of partial
differential equations. In this section, we sketch some of the main ideas about this
'
i
I
m
410 NUMERICAL METHODS FOR ORDINARY DIFFERENTIAL EQUATIONS
subject, and we show its relation to the numerical solution of the simple heat
equation.
There are many definitions of the concept of stiff differential equation. The
most important common feature of these definitions is that when such equations
are being solved with standard numerical methods (e.g., the Adams methods of
Section 6.7), the stepsize h is forced to be extremely small in order to maintain
stability-far smaller than would appear to be necessary based on a considera-
tion of the truncation error. An indication of this can be seen with Eq. (6.8.51),
which was solved in Table 6.17 with Euler's method. In that case, the unknown
Y(x) did not change with A, and therefore the truncation error was also
independent of A. But the actual error was strongly affected by the magnitude of
A, with hA required to satisfy the stability condition 11 +hAl < 1 to obtain
convergence. As IAI increased, the size of h had to decrease accordingly. This is
typical of the behavior of standard numerical methods when applied to stiff
differential equations, with the major difference being that the actual values of IAI
are far larger in real life examples, for example, A = -10
6
We now look at the most comq1.0n class of such differential equations, basing
our examination on consideration of the linearization of the system y' = f(x, y)
as developed in (6.8.14)-(6.8.17):
y' = Ay + g(x) (6.9.1)
with A = fy(x
0
, Y
0
) the Jacobian matrix of f. We say the differential equation
y' = f(x, y) is stiff if some of eigenvalues Aj of A, or more generally of fy(x, y),
have a negative real part of very large magnitude. We study numerical methods
for stiff equations by considering their effect on the model equation
y'=Ay+g(x) {6.9.2)
with Real (A) negative and very large in magnitude. This approach has its
limitations, some of which we indicate later, but it does give us a means of
rejecting unsatisfactory methods, and it suggests some possibly satisfactory
methods.
The concept of a region of absolute stability, introduced in the last section, is
the initial tool used in studying the stability of a numerical method for solving
stiff differential equations. We seek methods whose stability region contains the
entire negative real axis and as much of the left half of the complex plane as
possible. There are a number of ways to develop such methods, but we only
discuss one of them-obtaining the backward differentiation formulas (BDFs).
Let PP(x) denote the polynomial of degree p that interpolates Y(x) at the
points Xn+l Xn, . , Xn-p+l for some p;;::: 1:
p-1
PP{x) = L Y(xn-j)lj,n(x)
j= -1
(6.9.3)
with {lj, n( x)} the Lagrange interpolation basis functions for the nodes
~ I
I
STIFF DIFFERENTIAL EQUATIONS AND THE METHOD OF LINES 411
Table 6.19 Coefficients of BDF method (6.9.6)
p {3 ao a! a2 aJ a4
1 1 1
2 4 1
2 - -
3 3 3
6 18 9 2
3
11 11 11 11
12 48 36 16 3
4
25 25 25 25 25
60 300 300 200 75 12
5
137 137 137 137 137 137
60 360 450 400 225 72
6
147 147 147 147 147 147
xn+I xn-p+l {see (3.1.5)) .. Use
Combining with (6.9.3) and solving for Y(xn+
1
),
p-1
Y(xn+l) ='= L ajY(xn-) + h/3/(xn+l Y{xn+
1
))
j=O
The p-step BDF method is given by
p-1
Yn+l = L ajYn-j + h/3/(xn+l Yn+l)
j=O
as
10
147
(6.9.4)
{6.9.5)
{6.9.6)
The coefficients for the cases of p = 1, ... , 6 are given in Table 6.19. The case
p = 1 is simply the backward Euler method, which was discussed following
(6.8.54) in the last section. The truncation error for (6.9.6) can be obtained from
the error formula for numerical differentiation, given in (5.7.5):
(6.9.7)
for some xn-p+l ,:5; ~ n ,:5; xn+l
The regions of absolute stability for the formulas of Table 6.19 are given in
Gear (1971, pp. 215-216). To create these regions, we must find all values hA for
which
llj(h:\) I< 1 j=0,1, ... ,p (6.9.8)
I
I
I
l
l
I
I
..... ---- - - -- -_____ I
412 NUMERICAL METHODS FOR ORDINARY DIFFERENTIAL EQUATIONS
where the characteristic roots rj(h'A) are the solutions of
p-1
rP = L ajrp-l-j + hA.f3rP
j-0
{6.9.9)
It can be shown that for p = 1 and p = 2, the BDFs are A-stable, and that for
3 p 6, the region of absolute stability becomes smaller as p increases,
although containing the entire negative real axis in each case. For p ;;::: 7, the
regions of absolute stability are not acceptable for the solution of stiff problems.
For more discussion of these stability regions, see Gear (1971, chap. 11) and
Lambert (1973, chap. 8).
There are still problems with the BDF methods and with other methods that
are chosen solely on the basis of their region of absolute stability. First, with the
model equation y' = A.y, if Real('A) is of large magnitude and negative, then the
solution Y(x) goes to the zero very rapidly, and as !Real('A)i increases,
the convergence to zero of Y( x) becomes more rapid. We would like the same
behavior to hold for the numerical solution of the model equation { y, }. But with
the A-stable trapezoidal rule, the solution [from (6.5.24)] is
[
h'A ]"
1+-
Y, = h
2
A Yo
1--
2
n;::.:O
If !Real(A)f is large, then the fraction inside the brackets is near to -1, andy,
decreases to 0 quite slowly. Referring to the type of argument used with the Euler
method in (6.8.45)-(6.8.50), the effect of perturbations will not decrease rapidly
to zero for larger values of A. Thus the trapezoidal method may not be a
completely satisfactory choice for stiff problems. In comparison the A-stable
backward Euler method has the desired behavior. From (6.8.55), the solution of
the model problem is
[
1 .. ] "
Yn =
1
_ hA Yo
n;::.:O
As IAI increases, the sequence {y,} goes to zero more rapidly. Thus the
backward Euler solution better reflects the behavior of the true solution of the
model equation.
A second problem with the case of stability regions is that it is based on using
constant A and linear problems. The linearization (6.9.1) is often valid, but not
always. For example, consider the second-order linear problem
y" + ay' + (1 + b cos(27rx})y = g(x) x;::.:O (6.9.10)
in which onl! coefficient is not constant. Convert it to the equivalent system
Y2 = -(1 + b cos{27rx))y
1
- ay
2
+ g(x)
(6.9.11)
STIFF DIFFERENTIAL EQUATIONS AND THE METHOD OF LINES 413
We will assume a > 0, 1 b 1 < 1. The eigenvalues of the homogeneous equation
[g(x) = 0] are
-a Ja
2
- 4[1 + b cos (2'1Tx)]
;\. = ---'----------
2
(6.9.12)
These are negative real numbers or are complex numbers with negative real parts.
On the basis of the stability theory for the constant coefficient (or constant A)
case, we would assume that the effect of all perturbations in the initial data
would die away as x --+ oo. But in fact, the homogeneous part of (6.9._10) will
have unbounded solutions. Thus there will be perturbations of the initial values
that will lead to unbounded perturbed solutions in (6.9.10). This calls into
question the validity of the use of the model equation y' = ;\.y + g(x). Its use
suggests methods that we may want to study further, but by itself, this approach
is not sufficient to encompass the vast variety of linear and nonlinear problems.
The example (6.9.10) is taken from Aiken (1985, p. 269).
Solving the finite difference method We illustrate the problem by considering
the backward Euler method:
Yn+1 = Yn + hj(xn+1 Yn+1)
If the ordinary iteration formula
(j+l) - + hif( (j) )
Yn+I - Yn xn+I Yn+I
is used, then
n:::::O
_ U+l),;, h aj(xn+l Yn+I) ( _ Ul]
Yn+l Yn+l ay Yn+l Yn+l
For convergence,-we would need to have
(6.9.13)
(6.9.14)
(6.9.15)
But with stiff equations, this would again force h to be very small, which we are
trying to avoid. Thus another rootfinding method must be used to solve for Yn+l
in (6.9.13).
The most popular methods for solving (6.9.13) are based on Newton's method.
For a single differential equation, Newton's method for finding Yn+l is
{6.9.16)
for j::::: 0. A crude initial guess is y ~ ~
= Yn although generally this can be
improved upon. With systems of differential equations Newton's method be-
comes very expensive. To decrease the expense, the matrix
some z,;, Yn (6.9.17)
;
i
I
I
I
I
. .. --- .... 1
414 NUMERICAL METHODS FOR ORDINARY DIFFERENTIAL EQUATIONS
is used for all j and for a number of successive values of n. Thus Newton's
method [see Section 2.11] for solving the system version of (6.9.13) is approxi-
mated by
{6.9.18)
YHil> = + a<n
for j 0. This amounts to solving a number of linear systems with the same
coefficient matrix. This can be done much more cheaply than when the matrix is
being modified (see the material in Section 8.1). The matrix in (6.9.17) will have
to be updated periodically, but the savings will still be very significant when
compared to an exact Newton's method. For a further discussion of this topic,
see Aiken (1985, p. 7) and Gupta et al. (1985, pp. 22-25). For a survey of
computer codes for solving stiff differential equations, see Aiken (1985, chap. 4).
The method of lines Consider the following parabolic partial differential equa-
tion problem:
U{O, t) = d
0
{t)
U(x,O) = f(x)
O<x<1
t>O
t 0
{6.9.19)
( 6.9.20)
{6.9.21)
The notation U, and Uxx refers to partial derivatives with respect to t and x,
respectively. The unknown function U(x, t) depends on the time t and a spatial
variable x. The conditions (6.9.20) are called boundary conditions, and (6.9.21) is
called an initial condition. The solution U can be interpreted as the temperature
of an insulated rod of length 1, with U(x, t) the temperature at position x and
time t; thus (6.9.19) is often called the heat equation. The functions G, d
0
, d
1
,
and f are assumed given and smooth. For a development of the theory of
(6.9.19)-(6.9.21), see Widder (1975) or any standard introduction to partial
differential equations. We give the method of lines for solving for U, a numerical
method that has become much more popular in the past ten to fifteen years. It
will also lead to the solution of a stiff system of ordinary differential equations.
Let m > 0 be an integer, and define 8 = 1/m,
j=0,1, ... ,m
We discretize Eq. (6.9.19) by approximating the spatial derivative. Recall the
formulas (5.7.17) and (5.7.18) for approximating second derivatives. Using this,
8i a
4
U(
t)
12 ax
4
for j = 1, 2, ... , m - 1. Substituting into (6.9.19),
( )
_ U(xJ+l t)- 2U(x
1
, t) + U(x
1
_
1
, t) ( )
U, x
1
, t -
82
+ G x
1
, t
8
2
12 ax
4
{6.9.22)
STIFF DIFFERENTIAL EQUATIONS AND THE METHOD OF LINES 415
Equation (6.9.I9) is to be approximated at each interior node point xj. The
unknown ~ j E fxj-I xj+d
Drop the final term in (6.9.22), the truncation error in the numerical differenti-
ation. Forcing equality in the resulting approximate equation, we obtain
(6.9.23)
for j =I, 2, ... , m- I. The functions u/t) are intended to be approximations
of U(xp t ), I ~ j ~ m - I. This is the method of lines approximation to (6.9.I9),
and it is a system of m - I ordinary differential equations. Note that u
0
(t) and
um(t), which are needed in (6.9.23) for j = 1 and j = m- 1, are given using
(6.9.20):
(6.9.24)
The initial condition for (6.9.23) is given by (6.9.21):
I ~ j ~ m- 1 (6.9.25)
The name method of lines comes from solving for U(x, t) along the lines (xj. t),
t ~ 0, 1 ~ j ~ m - I, in the (x, t)-plane.
Under suitable smoothness assumptions on the functions d
0
, d
1
, G, and f, it
can be shown that
M.ax ju(xj, t) - u/t) ~ C r ~
Os;Js;m
Os; cs; T
(6.9.26)
Thus to complete the solution process, we need only solve the system (6.9.23).
It will be convenient to write (6.9.23) in matrix form. Introduce
-2 I 0 0 0
1 -2 I 0
I 0 I -2 1
A= 82
0
(6.9.27)
I -2 1
0 0 0 1 -2
The matrix A is of order m - I. In the definitions of u and g, the superscript T
indicates matrix transpose, so that u and g are column vectors of length m - I.
!
I
I
j
416 NUMERICAL METHODS FOR ORDINARY DIFFERENTIAL EQUATIONS
U:sing these matrices, Eqs. (6.9.23)-(6.9.25) can be rewritten as
u'( t) = Au( t) + g( t) u(O) = u
0
(6.9.28)
If Euler's method is applied, we have the numerical method
(6.9.29)
with tn = nh and vn = u(tn). This is a well-known numerical method for the heat
equation, called the simple explicit method. We analyze the stability of (6.9.29)
and some other methods for solving (6.9.28).
Equation (6.9.28) is in the form of the model equation, (6.9.1), and therefore
we need the eigenvalues of A to examine the stiffness of the system. It can be
shown that these eigenvalues are all real and are given by
1 ~ j ~ m -1 (6.9.30)
We leave the proof of this to Problem 6 in Chapter 7. Directly examining this
formula,
(6.9.31)
A = -stn
2
=-
-4 . ( (m- 1)'11) . -4
m-1 82 2m 82
with the approximations valid for larger m. It can be seen that (6.9.28) is a stiff
system if 8 is small.
Applying (6.9.31) and (6.8.49) to the analysis of stability in (6.9.29), we must
have
j=1, ... ,m-1
Using (6.9.30), this leads to the equivalent statement
4h ( j'IT )
0 <
2
sin
2
- < 2
8 2m
1 ~ j ~ m 1
This will be satisfied if 4hj8
2
~ 2, or
{6.9.32)
If 8 is at all small, say 8 = .01, then the time step h must be quite small to have
stability,
;
... l
-.!
STIFF DIFFERENTIAL EQUATIONS AND THE METHOD OF LINES 417
In contrast to the restriction (6.9.32) with Euler's method, the backward Euler
method has no such restriction since it is A-stable. The method becomes
(6.9.33)
To solve this linear problem for Vn+l
(6.9.34)
This is a tridiagonal system of linear equations (see Section 8.3). It can be solved
very rapidly, with approximately 5m arithmetic operations per time step, exclud-
ing the cost of computing the right side in (6.9.34). The cost of solving the Euler
method (6.9.29) is almost as great, and thus the solution of (6.9.34) is not
especially time-consuming.
Example Solve the partial differential equation problem (6.9.19)-(6.9.21) with
the functions G, d
0
, d
1
, and f determined from the known solution
U = e -.lr sin ('IT X)
O ~ x ~ t ~ O (6.9.35)
Results for Euler's method (6.9.29) are given in Table 6.20, and results for the
backward Euler method (6.9.33) are given in Table 6.21.
For Euler's method, we take m = 4, 8, 16, and to maintain stability, we take
h = S
2
/2, from (6.9.32). Note this leads to the respective time steps of h ~ .031,
.0078, .0020. From (6.9.26) and the error formula for Euler's method, we would
expect the error to be proportional to 8
2
, since h = 8
2
/2. This implies that the
Table6.20 The method of lines: Euler's method
Error Error Error
t m=4 Ratio m= 8 Ratio m = 16
1.0 3.89E- 2 4.09 9.52E- 3 4.02 2.37E- 3
2.0 3.19E- 2 4.09 7.79E- 3 4.02 1.94E- 3
3.0 2.61E- 2 4.09 6.38E- 3 4.01 1.59E- 3
4.0 2.14E- 2 4.10 5.22E- 3 4.02 1.30E- 3
5.0 1.75E- 2 4.09 4.28E- 3 4.04 1.06E- 3
Table 6.21 The method of lines: backward Euler's method
Error Error Error
t m=4 m = 8 m = 16
1.0 4.45E- 2 l.lOE- 2 2.86E- 3
2.0 3.65E- 2 9.01E- 3 2.34E- 3
3.0 2.99E- 2 7.37E- 3 1.92E- 3
4.0 2.45E- 2 6.04E- 3 1.57E- 3
5.0 2.00E- 2 4.94E- 3 1.29E- 3
;
'
)
'
I
.J
418 NUMERlcAL METHODS FOR ORDINARY DIFFERENTIAL EQUATIONS
error should decrease by a factor of 4 when m is doubled, and the results in
Table 6.20 agree. In the table, the column Error denotes the maximum error at
the node points (xj, t), 0 5.j 5. n, for the given value oft.
For the solution of (6.9.28) by the backward Euler method, there need no
longer be any connection between the space step 8 and the time step h. By
observing the error formula (6.9.26) for the method of lines and the truncation
error formula (6.9.7) (use p = 1) for the backward Euler method, we see that the
error in solving the problem (6.9.19)-(6.9.21) will be proportional to h + 8
2
For
the unknown function U of (6.9.34), there is a slow variation with t. Thus for the
truncation error associated with the time integration, we should be able to use a
relatively large time step h as compared to the space step 8, in order to have the
two sources of error be relatively equal in size. In Table 6.21, we use h = .1 and
m ==:' 4, 8, 16. Note that this time step is much larger than that used in Table 6.20
for Euler's method, and thus the backward Euler method is much more efficient
for this particular example.
For more of the method of lines, see Aiken (1985, pp. 124-148).
For some method-of-lines codes to solve systems of nonlinear parabolic partial
differential equations, in one and two space variables, see Sincovec and Madsen
(1975) and Melgaard and Sincovec (1981).
6.10 Single-Step and Runge-Kutta Methods
Single-step methods for solving y' = f(x, y) require only a knowledge of the
numerical solution Yn in order to compute the next value Yn+l This has obvious
advantages over the p-step multistep methods that use several past values
{Jn, ... , Yn-p+d, since then the additional initial values {y
1
,. . , Yp-d have to
be computed by some other numerical method.
The best known one-step methods are the Runge-Kutta methods. They are
fairly simple to program, and their truncation error can be controlled in a more
straightforward manner than for the multistep methods. For the fixed-order
multistep methods that were used more commonly in the past, the Runge-Kutta
methods were the usual tool for calculating the needed initial values for the
multistep method. The major disadvantage of the Runge-Kutta methods is that
they use many more evaluations of the derivative f(x, y) to attain the same
accuracy, as compared with the multistep methods. Later we will mention some
results on comparisons of variable-order Adams codes and fixed-order
Runge-Kutta codes.
The most simple one-step method is based on using the Taylor series. Assume
Y( x) is r + 1 times continuously differentiable, where Y( x) is the solution of the
initial value problem
y' = /(x, y) (6.10.1)
Expand Y(x
1
) about x
0
using Taylor's theorem:
h' hr+l
Y(x
1
) = Y(x
0
) + hY'(x
0
) + +- y<r>(x
0
) + ( ) (6.10.2)
r! . r + 1 !
..... ..... J
SINGLE-STEP AND RUNGE-KUTIA METHODS 419
for some x
0
Xp B:)' dropping the remainder term, we have an approxima-
tion for Y(x
1
), provided we can calculate Y"(x
0
); y<r>(x
0
). Differentiate
Y'(x) = f(x, Y(x)) to obtain
Y"(x) = jx(x, Y(x)) + fv(x, Y(x))Y'(x),
Y" = fx + /y/
and proceed similarly to obtain the higher order derivatives of Y( x ).
Example Consider the problem
y(O) = 1
with the solution Y(x) = 1/(1 + x). Then Y" = -2YY' = 2Y
3
, and (6.10.2)
with r = 2 yields
We drop the remainder to obtain an approximation of Y(x
1
). This can then be
used in the same manner to obtain an approximation for Y(x
2
), and so on. The
numerical method is
(6.10.3)
Table 6.22 contains the errors in this numerical solution at a selected set of node
points. The grid sizes used are h = .125 and h = .0625, and the ratio of the
resulting errors is also given. Note that when h is halved, the ratio is almost 4.
This can be justified theoretically since the rate of convergence can be shown to
be O(h
2
), with a proof similar to that given in Theorem 6.3 or Theorem 6.9,
given later in this section.
The Taylor series method can give excellent results. But it is bothersome to use
because of the need to analytically differentiate f(x, y). The derivatives can be
very difficult to calculate and. very time-consuming to evaluate, especially for
systems of equations. These differentiations can be carried out using a symbolic
Table 6.22 Example of the Taylor series method (6.10.3)
h = .0625 h = .125
X Y(x)- Y(x)- Ratio
2.0 .333649 -3.2E- 4 -1.4E- 3 4.4
4.0 .200135 -1.4E- 4 -5.9E- 4 4.3
6.0 .142931 -7.4E-5 -3.2E- 4 4.3
8.0 .111157 -4.6E- 5 -2.0E- 4 4.3
10.0 .090941 -3.1E- 5 -1.4E- 4 4.3
420 NUMERICAL METHODS FOR ORDINARY DIFFERENTIAL EQUATIONS
manipulation language on a computer, and then it is easy to produce a Taylor
series method. However, the derivatives are still likely to be quite time-consuming
to evaluate, and it appears that methods based on evaluating just f(x, y) will
remain more efficient. To imitate the Taylor series method, while evaluating only
f(x, y), we tum to the Runge-Kutta formulas.
Runge-Kutta methods The Runge-Kutta methods are closely related to the
Taylor series expansion of Y(x) in (6.10.2), but no differentiations of f are
necessary in the use of the methods. For notational convenience, we abbreviate
Runge-Kutta to RK. All RK methods will be written in the form
Yn+l = Yn + hF(xn, Yn> h; /)
n;;::;O (6.10.4)
We begin with examples of the function F, and will later discuss hypotheses for
it. But at this point, it should be intuitive that we want
F(x, Y(x), h;!) Y'(x) = f(x,.Y(x)) (6.10.5)
for all small values of h. Define the truncation error for (6.10.4) by
n;;:: 0 (6.10.6)
and define Tn(Y) implicitly by
Rearranging (6.10.6), we obtain
n;;:: 0 (6.10.7)
In Theorem 6.9, this will be compared with (6.10.4) to prove convergence of { Yn}
toY.
Example 1. Consider the trapezoidal method, solved with one iteration using
Euler's method as the predictor:
n ;;:: 0 ( 6.10.8)
In the notation of (6.10.4),
F(x, y, h; f)= Hf(x, y) + J(x + h, Y + hf(x, y))]
As can be seen in Figure 6.9, F is an average slope of Y(x) on [x, x + h].
2. The following method is also based on obtaining an average slope for the
solution on [xn, xn+d:
(6.10.9)
For this case,
SINGLE-STEP AND RUNGE-KUTTA METHODS 421
Slope= [(x, Y(x))
Slope= f(x +h, Y(x) +h[(x, Y(x)))
Average
slope= F
X
Y(x) +hF(x, h, Y(x); fJ
x+h
Figure 6.9 Illustration of Runge-Kutta method (6.10.8).
F(x, y, h; f)= f(x + th, Y + thf(x, y))
The derivation of a formula for the truncation error is linked to the derivation
of these methods, and this will also be true when considering RI\ methods of a
higher order. The derivation of RK methods will be illustrated by deriving a
family of second-order formulas, which will include (6.10.8) and (6.10.9). We
suppose F has the general form
F(x, y, h; f) = Yd(x, y) + y
2
f(x + ah, y + f3hf(x, y )) (6.10.10)
in which the four constants y
1
, y
2
, a, f3 are to be determined.
We will use Taylor's theorem (Theorem 1.5) for functions of two variables to
expand the second term on the right side of (6.10.10), through the second
derivative terms. This gives
F(x, y, h;f) = -y
1
f(x, y) + "'(
2
{fix, y) + h[af, + 13/f,.]
I Also we will need some derivatives of Y'(x) = f(x, Y(x)), namely
~ = = ~ ~ ~ = = ~ = = = = ~ ~ = = ~ ~
{6.10.12)
I
J
422 NUMERICAL METHODS FOR ORDINARY DIFFERENTIAL EQUATIONS
For the truncation error,
h2 h3
= hY' + - Y" + - y<
3
l + O{h
4
)- hF(x Y h f)
n 2 n 6 n n n
Substituting from (6.10.11) and (6.10.12), and collecting together common powers
of h, we obtain
All derivatives are evaluated at (xn, Yn).
We wish to make the truncation error converge to zero as rapidly as possible.
The coefficient of h
3
cannot be zero in general, iff is allowed to vary arbitrarily.
The requirement that the coefficients of h and h
2
be zero leads to
and this yields
1
""6/3 =-
2
The system (6.10.14) is underdetermined, and its general solution is
1
a=/3=-
2y2
{6;10.14)
{6.10.15)
with y
2
arbitrary. Both (6.10.8) (with y
2
= !) and (6.10.9) (with y
2
= 1) are
special cases of this solution.
By substituting into (6.10.13), we can obtain the leading term in the truncation
error, dependent on only y
2
. In some cases, the value of y
2
has been chosen to
make the coefficient of h
3
as small as possible, while allowing f to vary
arbitrarily. For example, if we write (6.10.13) as
{6.10.16)
then the Cauchy-Schwartz inequality [see (7.1.8) in Chapter 7] can be used to
show
(6.10.17)
SINGLE-STEP AND RUNGE-KUTTA METHODS 423
where
( )
-{(1 I 2)2 (1 /3)2 (I 1 /32)2 1 ]1/2
C2 Yz - 6 - 2Y2a + :; - Y2a + 6 - 2Y2 + i8
with a, f3 given by (6.10.15). The minimum value of c
2
(y
2
) is attained with
y
2
= .75, and c
2
(.75) = 1/118. The resulting second-order numerical method is
Yn+i = Y, + [!(x,, y,) + 3/( x, + }h, y, + }hf(x,, y,))] n 0
{6.10.18)
It is optimal in the sense of minimizing the coefficient c
2
(y
2
) of the term c
1
(/)h
3
in the truncation error. For an extensive discussion of this means of analyzing the
truncation error in RK methods, see Shampine (1986).
Higher order formulas can be created and analyzed in an analogous manner,
although the algebra becomes very complicated. Assume a formula for
-F(x, y, h; f) of the form
p
F(x, y, h; /) = L Y}'}
vi= f(x, y)
(6.10.19)
j = 2, ... , p (6.10.20)
These coefficients can be chosen to make the leading terms in the truncation error
equal to zero, just as was done with (6.10.10) and (6.10.14). There is obviously a
connection between the number of evaluations of f(x, y), call it p, and the
maximum possible order that can be attained for the truncation error. These are
given in Table 6.23, which is due in part-to Butcher (1965).
Until about 1970, the most popular RK method has probably been the
original classical formula that is a generalization of Simpson's rule. The method
is
h
Yn+i = Yn + 6[v;. + 2V2 + 2Vj + V.S] (6.10.21)
vl = f(x,, y,)
V2 =t(x, + y, +
Vj = t( x, + y, +
n 0
=
with this formula based on differentiating Y'(x) = f(x, Y(x)).
(a) Show that this is a fourth-order method: T,(Y) = O(h
5
).
(b) Show that the region of absolute stability contains the entire negative
real axis of the complex hA.-plane.
43. Generalize the method of lines, given in (6.9.23)-(6.9.25), to the problem
llr = a(x, t)Uxx + G(x, t, U(x, t)) O<x<l t > 0
U(O, t) = d
0
(t) t 0
U(x,O) =f(x)
For it to be well-defined, we assume a(x, t) > 0, 0 x 1, t 0.
44. (a) If you have a solver of tridiagonal linear algebraic systems available to
you, then write a program to implement the method of lines for the
problem (6.9.19)-(6.9.21). The example in the text, with the unknown
(6.9.35), was solved using the backward Euler method. Now imple-
ment the method of lines using the trapezoidal rule. Compare your
results with those in Table 6.20 for the backward Euler method.
(b) Repeat with the second-order BDF method.
45. Derive a third-order Taylor series method to solve y' = - y
2
Compare the
i numerical results to those in Table 6.22 .
.. .i
46. Using the Taylor series method of Section 6.10, produce a fourth-order
method to solve y' = x- y
2
, y(O) = 0. Use fixed stepsizes, h = .5, .25,
.125 in succession, and solve for 0 x 10. Estimate the global error
using the error estimate (6.10.24) based on Richardson extrapolation.
:
i
I
~
I
j
I
I-
I
J
458 NUMERICAL METHODS FOR ORDINARY DIFFERENTIAL EQUATIONS
47. Write a program to solve y' = f(x, y), y(x
0
) = y
0
, using the classical
Runge-Kutta method (6.10.21), and let the stepsize h be fixed.
(a) Using the program, solve the equations of Problem 16.
(b) Solve y' = x - y
2
, y(O) = 0, for h = .5, .25, .125. Compare the
results with those of Problem 46.
48. Consider the three stage Runge-Kutta formula
Determine the set of equations that the coefficients { "Yj aj, f3j;} must satisfy
if the formula is to be of order 3. Find a particular solution of these
equations.
49. Prove that if the Runge-Kutta method (6.10.4) satisfies (6.10.27), then it is
stable.
50. Apply the classical Runge-Kutta method (6.10.21) to the test problem
(6.8.51), for various values of A. and h. For example, try A. = -1, -10,
-50 and h = .5, .1, .01, as in Table 6.17.
51. Calculate the real part of the region of absolute stability for the
Runge-Kutta method of (a) (6.10.8), (b) (6.10.9), (c) (6.10.21). We are
interested in the behavior of the numerical solution for the differential
equation y' = A.y with Real(:\)< 0. In particular, we are interested in
those values of hA. for which the numerical solution tends to zero as
Xn ~ 00.
52. (a) Using the Runge-Kutta method (6.10.8), solve
(b)
y' = -y + x
1
[1.1 + x] y(O) = 0
whose solution is Y(x) =xu. Solve the equation on [0, 5], p r i ~ t i n
the errors at x = 1, 2, 3, 4, 5. Use stepsizes h = .1, .05, .025, .0125,
.00625. Calculate the errors by which the errors decrease when h is
halved. How does this compare with the usual theoretical rate of
convergence of 0( h
2
)? Explain your results.
What difficuhy arises when trying to use a Taylor wethod of order
~ 2 to solve the equation of part (a)? What does it tell us about the
solution?
I
=-=-=--== ===--=--=-=-====--=-===--=-"-'-=--==-= .,
PROBLEMS 459
53. Convert the boundary value problem (6.11.1) to an equivalent boundary
value problem for a system of first-order equatiens, as in (6.11.15).
54. (a) Consider the two-point boundary value problem (6.11.25). To convert
this to an equivalent problem with zero boundary conditions, write
y(x) = z(x) + w(x), with w(x) a straight line satisfying the follow-
ing boundary conditions: w(a) = y
1
, w(b) = y
2
Derive a new
boundary value problem for z( x ).
(b) Generalize this procedure to problem (6.11.10). Obtain a new problem
with zero boundary conditions: What assumptions, if any, are needed
for the coefficients a
0
, a
1
, b
0
, b
1
?
55. Using the shooting method of Section 6.11, solve the following boundary
value problems. Study the convergence rate as h is varied.
-2 1 2
(a) y" = --;-yy', 1 < x < 2; y(1) = 2' y(2) = 3
True solution: Y(x) = xj(1 + x).
(b) y" = 2yy', 0 <X< -i; y(O) = 0, y( -i) = 1
True solution: Y(x) = tan(x).
56. Investigate the differential equation programs provided by your computer
center. Note those that automatically control the truncation error by
varying the stepsize, and possibly the order. Classify the programs as
multistep (fixed-order or variable-order), Runge-Kutta, or extrapolation.
Compare one of these with the programs DDEABM [of Section 6.7 and
Shampine and Gordon (1975)] and RKF45 [of Section 6.9 and
Shampine and Watts (1976b)] by solving the problem
y' =
4 20
y(O) = 1
with desired absolute errors of 10-3, 10-
6
, and 10-
9
Compare the results
with those given in Tables 6.15 and 6.28.
57. Consider the problem
1 1
y' = -- + c tan-
1
(y(t))- -
t+1 2
y(O) = 0
with c a given constant. Since y'(O) = }, the solution y(t) is initially
increasing as t increases, regardless of the value of c. As best you can, show
that there is a value of c, call it c*, for which (1) if c > c*, the solution y(t)
~ I
460 NUMERICAL METHODS FOR ORDINARY DIFFERENTIAL EQUATIONS
increases indefinitely, and (2) if c < c*, then y( t) increases initially, but
then peaks and decreases. Determine c* to within .00005, and then calcu-
late the associated solution y( t) for 0 ~ t ~ 50.
58. Consider the system
x'(t) A x ~ Bxy y'(t) = Cxy- Dy
This is known as the Lotka-Volterra predator-prey model for two popula-
tions, with x(t) being the number of prey and y(t) the number of
predators, at time t.
(a) Let A = 4, B = 2, C = 1, D = 3, and solve the model to at least
three significant digits for 0 ~ t ~ 5. The initial values are x(O) = 3,
y(O) = 5. Plot x and y as functions of t, and plot x versus y.
(b) Solve the same model with x(O) = 3 and, in succession, y(O) = 1, 1.5,
2. Plot x versus y in each case. What do you observe? Why would the
point (3, 2) be called an equilibrium point?
_j
SEVEN
J
.... J
LINEAR ALGEBRA
The solution of systems of simultaneous linear equations and the calculation of
the eigenvalues and eigenvectors of a matrix are two very important problems
that arise in a wide variety of contexts. As a preliminary to the discussion of
these problems in the following chapters, we present some results from linear
algebra. The first section contains a review of material on vector spaces, matrices,
and linear systems, wltich is taught in inost undergraduate linear algebra courses.
These results are summarized only. and no derivations are included. The remain-
ing sections discuss eigenvalues, canonical forms for matrices, vector and matrix
norms, and perturbation theorems for matrix If necessary, this chapter
can be skipped, and the results can be referred back to as they are needed in
Chapters 8 and 9. For notation, Section 7.1 and the norm notation of Section 7.3
should be skimmed.
7.1 Vector Spaces, Matrices, and Linear Systems
Roughly speaking a vector space V is a set of objects, called vectors, for which
operations of vector addition and scalar multiplication have been defined. A
vector space V has a set of scalars associated with it, and in this text, this set can
be either the real numbers R or complex numbers C. The vector operations must
satisfy certain standard associative, commutative, and distributive rules, which
we will not list. A subset W of a vector space V is called a subspace of V if W is
a vector space using the vector operations inherited from V. For a complete
development of the theory of vector spaces, see any undergraduate text on linear
algebra [for example, Anton (1984), chap. 3; Halmos (1958), chap. 1; Noble
(1969). chaps. 4 and 14; Strang (1980), chap. 2].
Example l. V = Rn, the set of all n-tuples (x
1
, ... , xn) with real entries x;, and
R is the associated set of scalars .
2. V = en, the set of all n-tuples with complex entries, and C is the set of
scalars.
3. V = the set of all polynomials of degree :;;; n, for some given n, is a vector
space. The scalars can be R or C, as desired for the application.
463
__ __]
!
i
[
I
I
!
I
I
... j
464 LINEAR ALGEBRA
4. V = C[ a, b ], the set of all continuous real valued [or complex valued]
functions on the interval [a, b ], is a vector space with scalar set equal to R [or C].
The example in (3) is a subspace of C[ a, b ].
Definition Let V be a vector space and let v
1
, v
2
, , vm E V.
1. We say that vi> ... , vm are linearly dependent if there is a set of
scalars a
1
, , am, with at least one nonzero scalar, for which
Since at least one scalar is nonzero, say a; =1= 0, we can solve for
We say that v; is a linear combination of the vectors
v
1
, , V;_
1
, vi+! ... , vm. For a set of vectors to be linearly
dependent, one of them must be a linear combination of the
remaining ones.
2. We say v
1
, , vm are linearly independent if they are not depen-
dent. Equivalently, the only choice of scalars a
1
, .. , am for
which
is the trivial choice a
1
= = am = 0. No V; can be written as
a combination of the remaining ones.
3. { u
1
, , um} is a basis for V if for every v E V, there is a
unique choice of scalars a
1
, . , am for which
Note that this implies v
1
, , um are independent. If such a
finite basis exists, we say V is finite dimensional. Otherwise, it is
called infinite dimensional.
Theorem 7.1 If V is a: vector space with a basis { v
1
, ... , vm }, then every basis
for V will contain exactly m vectors. The number m is called the
dimension of V.
Example 1. {1, x, x
2
, , xn} is a basis for the space V of polynomials of
degree :$; n. Thus dimension V = n + 1.
2. Rn and en have the basis { e
1
, ... , en), in which
e;= (0,0, ... ,0,1,0, ... ,0) (7.1.1)
VECTOR SPACES, MATRICES, AND LINEAR SYSTEMS 465
with the 1 in position i. Dimension R", C" = n. This is called the standard basis
for R" and C", and the vectors in it are called unit vectors.
3. C[ a, b] is infinite dimensional.
Matrices and linear systems Matrices are rectangular arrays of real or complex
numbers, and the general matrix of order m X n has the form
(7.1.2)
A matrix of order n is shorthand for a square matrix of order n X n. Matrices
will be denoted by capital letters, and their entries will normally be denoted by
lowercase letters, usually corresponding to the name of the matrix, as just given.
The following definitions give the rommon operations on matrices.
Definition 1. Let A and B have order m X n. The sum of A and B is the
matrix C =A + B, of order m X n, given by
2. Let A have order m X n, and let a be a scalar. Then the scalar
multiple C = aA is of order m X n and is given by
C;j = aaij
3. Let A have order m X n and B have order n X p. Then the
product C = AB is of order m X p, and it is given by
n
C;j = L O;kbkj
k=l
4. Let A have order m X n. The transpose C = AT has order n X m,
and is given by
The conjugate transpose C =A* also has order n X m, and
The notation z denotes the complex conjugate of the complex number
z, and z is real if and only if z = z. The conjugate transpose A* is also
called the adjoint of A.
i
i
i
I
i
.I
466 LINEAR ALGEBRA
The following arithmetic properties of matrices can be shown without much
difficulty, and they are left to the reader.
(a) A+ B = B +A (b) (A +B)+ C =A + (B +C)
(c) A(B + C)= AB + AC
(e) (A + B)T =AT+ BT
(d) A(BC) = (AB)C (7.1.3)
(f) (AB)T = BTAT
It is important for many applications to note that the matrices need not be
square for the preceding properties to hold.
The vector spaces Rn and en will usually be identified with the set of column
vectors of order n X 1, with real and complex entries, respectively. The linear
system
(7 .1.4)
can be written as Ax = b, with A as in (7.1.2), and
X = [ Xl' ... ' X n] T
The vector b is a given vector in Rm, and the solution x is an unknown vector in
Rn. The use of matrix multiplication reduces the linear system (7.1.4) to the
simpler and more intuitive form Ax = b.
We now introduce a few additional definitions for matrices, including some
special matrices.
Definition l. The zero matrix of order m X n has all entries equal to zero. It is
denoted by omXn' or more simply, by 0. For any matrix A of order
m X n,
A+ 0 = 0 + A.=A
2. The identity matrix of order n is defined by I= [8ij],
i=j
i#-j
(7.1.5)
for alll i, j n. For all matrices A of order m X n and B of order
n Xp,
Al=A JB = B
The notation B;j denotes the Kronecker delta function.
3. Let A be a square matrix of order n. If there is a square matrix B
of order n for which AB = BA = I, then we say A is invertible, with
inverse B. The matrix B can be shown to be unique, and we denote the
inverse of A by A -l.
VECTOR SPACES, MATRICES, AND LINEAR SYSTEMS 467
4. A matrix A is called symmetric if AT= A, and it is called
Hermitian if A* =A. The term symmetric is generally used only with
real matrices. The matrix A is skew-symmetric if AT= -A. Of
necessity, all matrices that are symmetric, Hermitian, or skew-symmet-
ric must also be square.
5. Let A be an m X n matrix. The row rank of A is the number of
linearly independent rows in A, regarded as elements of Rn or en, and
the column rank is the nuptber of linearly independent columns. It can
be shown (Problem 4) that these two numbers are aiways equal, and
this is called the rank of A.
For the definition and properties of the determinant of a square matrix A, see
any linear algebra text [for example, Anton (1984), chap. 2; Noble (1969), chap.
7; and Strang (1980), chap. 4]. We summarize many of the results on matrix
inverses and the solvability of linear systems in the following theorem.
Theorem 7.2 Let A be a square matrix with elements from R (or C), and let the
vector space be V = Rn (or en). Then the following are equivalent
statements.
1. Ax = b has a unique solution x E V for every b E V.
2. Ax = b has a solution x E V for every b E V.
3. Ax = 0 implies x = 0.
4. A -I exists.
5. Determinant {A) =t= 0.
6. Rank (A) = n.
Although no proof is given here, it is an excellent exercise to prove the
equivalence of some of these statements. Use the concepts of linear independence
and basis, along with Theorem 7 .1. Also, use the decomposition
(7.1.6)
with A*j denoting column j in A. This says that the space of all vectors of the
form Ax is spanned by the columns of A, although they may be linearly
dependent.
Inner product vector spaces One of the important reasons for reformulating
problems as equivalent linear algebra problems is to introduce some geometric
insight. Important to this process are the concepts of inner product and orthogo-
nality.
468 LINEAR ALGEBRA
Definition 1. The inner product of two vectors x, y E R" is defined by
n
(x,y) = LX;Y;=xTy=yTx
i=l
and for vectors x, y E C", define the inner product by
n
(x,y)= LX;J;=y*x
i-1
Z. The Euclidean norm of x in C" or R" is defined by
(7.1.7)
The following results are fairly straightforward to prove, and they are left to
the reader. Let V denote C" orR".
1. For all x, y, z E V,
(x, y + z) = (x, y) + (x, z ), (x + y, z) = (x, z) + (y, z)
2. For all x, y E V,
(ax, y) = a(x, y)
and for V= C", a E C,
(x, ay) = a(x, y)
3. In en, (x, y) = (y, x); and in Rn, (x, y) = (y, x).
4. For all x E V,
(x, x) 0
and (x, x) = 0 if and only if x = 0.
5. For all x, y E V,
l(x, y) 1
2
(x, x)(y, y) (7 .1.8)
This is called the Cauchy-Schwartz inequality, and it is proved in exactly the
same manner as (4.4.3) in Chapter 4. Using the Euclidean norm, we can
write it as
(7.1.9)
6. For all x, y E V,
(7.1.10)
This is the triangle inequality. For a geometric interpretation, see the earlier
comments in Section 4.1 of Chapter 4 for the norm llflloo on C[a, b]. For a
proof of (7.1.10), see the derivation of (4.4.4) in Chapter 4.
VECTOR SPACES, MATRICES, AND LINEAR SYSTEMS 469
7. For any square matrix A of order n, and for any x, y E C",
(Ax, y) = (x, A*y) (7 .1.11)
The inner product was used to introduce the Euclidean length, but it is also
used to define a sense of angle, at least in spaces in which the scalar set is R.
Definition I. For x, y in R", the angle between x andy is defined by
Note that the argument is between -1 and 1, due to the
Cauchy-Schwartz inequality (7.1.9). The preceding definition can be
written implicitly as
(7.1.12)
a familiar formula from the use of the dot product in R
2
and R
3
2. Two vectors x and y are orthogonal if and only if (x, y) = 0.
This is motivated by (7.1.12). If { x(l), ... , x<">} is a basis for C" orR",
and if (x<il, x<j)) = 0 for all i =I= j, 1 ::s; i, j ::s; n,' then we say
{ x(l>, .. .", x<n>} is an orthogonal basis. If all basis vectors have
Euclidean length 1, the basis is called orthonormal.
3. A square matrix U is called unitary if
U*U= UU* =I
If the matrix U is real, it is usually called orthogonal, rather than
unitary. The rows for colu.mns] of an order n unitary matrix form an
orthonormal basis for C", and similarly for orthogonal matrices
and R".
E:mmp/e I. The angle between the vectors
x ='= (1, 2, 3) y = (3, 2, I)
is given by
d= ccis-
1
( ~ ~ = .775 radians
2. The matrices
[
cos 8 sinO]
Ut=
-sin 8 cos 8
1]
.fi
are unitary, with the first being orthogonal.
j
!
I
!
I
470 LINEAR ALGEBRA
Yj
uW
Figure 7.1 Illustration of (7.1.15).
An orthonormal basis for a vector space V = Rn or en is desirable, since it is
then easy to decompose an arbitrary vector into its components in the direction
of the basis vectors. More precisely, let { u(l>, ... , u<n>} be an orthonormal basis
for V, and let x E V. Using the basis,
for some unique choice of coefficients a
1
, ... , a". To find aj, form the inner
product of x with u<j), and then
using the orthonormality properties of the basis. Thus
n
x = L (x, uU>)uU>
j-1
{7.1.14)
This can be given a geometric interpretation, which is shown in Figure 7.1. Using
(7.1.13)
{7 .1.15)
Thus the coefficient aj is just the length of the orthogonal projection of x onto
the axis determined by uU>. The formula (7.1.14) is a generalization of the
decomposition of a vector x u,sing the standard basis ( e <
1
>, ... , e <" > } , defined
earlier.
Example Let V = R
2
, and consider the orthonormal basis
{3)
2' 2
.u<2) = (- {3
2 '2
EIGENVALUES AND CANONICAL FORMS FOR MATRICES 471
Then for a given vector x = (x
1
, x
2
), it can be written as
x - a u<
1
> + a u<
2
>
- 1 2
For example,
1 13
(1 0) = -u<
1
>- -u<
2
>
' 2 2
7.2 Eigenvalues and Canonical Fonns for Matrices
The number A, complex or real, is an eigenvalue of the square matrix A if there is
a VeCtOr X E en, X -f= 0, SUCh that
Ax =Ax (7.2.1)
The vector x is called an eigenvector corresponding to the eigenvalue A. From
Theorem 7.2, statements (3) and (5), A is an eigenvalue of A if and only if
det(A- AI)= 0 (7.2.2)
This is called the characteristic equation for A, and to analyze it we introduce the
function
If A has order n, then fA(A) will be a polynomial of degree exactly n, called the
characteristic polynomial of A. To prove it is a polynomial, expand the determi-
nant by minors repeatedly to get
[
a
11
- A
a21
= det .
an1
= (a
11
- A)(a
22
- A) (ann- A)
+ terms of degree ri - 2
+ terms of degree n - 2 (7.2.3)
472 LINEAR ALGEBRA
Also note that the constant term is
lAO) = det (A) (7 .2.4)
From the coefficient of An-I, define
trace(A) = a
11
+ a
22
+ +ann (7 .2.5)
which is often a quantity of interest in the study of A.
Since /A( A) is of degree n, there are exactly n eigenvalues for A, if we count
multiple roots according to their multiplicity. Every matrix has at least one
eigenvalue-eigenvector pair, and the n X n matrix A has at most n distinct
eigenvalues.
Example 1. The characteristic polynomial for
is
1
3
1
The eigenvalues are A
1
= 1, 71.
2
= 2, 71.
3
= 4, and the corresponding eigenvectors
are
Note that these eigenvectors are orthogonal to each other, and therefore they are
linearly independent. Since the dimension of R
3
(and C
3
) is three, these eigenvec-
tors form an orthogonal basis for R
3
(and C
3
). This illustrates Theorem 7.4,
which is presented later in the section.
2. For the matrix
and there are three linearly independent eigenvectors for the eigenvalue A = 1,
for example,
[1,o,or [o, 1,or [o.o. 1r
All other eigenvectors are linear combinations of these three vectors.
EIGENVALUES AND CANONICAL FORMS FOR MATRICES 473
3. For the matrix
1
1
0
The matrix A has only one linearly independent eigenvector for the eigenvalue
A= 1, namely
x = [1,o,of
and multiples of it.
The algebraic multiplicity of an eigenvalue of a matrix A is its multiplicity as a
root of fA(A), and its geometric multiplicity is the maximum number of linearly
independent eigenvectors associated with the eigenvalue. The sum of the alge-
braic multiplicities of the eigenvalues of an n X n matrix A is constant with
respect to small perturbations in A, namely n. But the sum of the geometric
multiplicities can vary greatly with small perturbations, and this causes the
numerical calculation of eigenvectors to often be a very difficult problem. Also,
the algebraic and geometric multiplicities need not be equal, as the preceding
examples show.
Definition Let A and B be square matrices of the same order. Then A is similar
to B if there is a nonsingular matrix P for which
(7.2.6)
Note that this is a symmetric relation since
Q = p-1
The relation (7.2.6) can be interpreted to say that A and B are matrix representa-
tions of the same linear transformation T from V to V [V = Rn or en), but with
respect to different bases for V. The matrix P is called the change of basis matrix,
and it relates the two representations of a vector x E V with respect to the two
bases being used [see Anton (1984), sec. 5.5 or Noble (1969), sec. 14.5 for greater
detail].
We now present a few simple properties about similar matrices and their
eigenvalues.
1. If A and B are similar, then fA(A) = fB(A). To prove this, use (7.2.6) to
show
fB(A) = det (B- A/) = det (P-
1
(A - AI)P]
= det(P-
1
}det(A- AI}det(P) =fAA)
474 LINEAR ALGEBRA
since
det ( P ) det ( P-
1
) = det ( P P-
1
) = det ( I) = 1
2. The eigenvalues of similar matrices A and B are exactly the same, and there
is a one-to-one correspondence of the eigenvectors. If Ax = A.x, then using
Bz = A.z (7.2.7)
Trivially, z -:!= 0, s)nce otherwise x would be zero. Also, given any eigenvec-
tor z of B, this argument can be reversed to produce a corresponding
eigenvector x = Pz for A.
3. Since /A(A.) is invariant under similarity transformations of A, the coeffi-
cients of /A(A.) are also invariant under such similarity transformations. In
particular, for A similar to B,
trace(A) = trace(B) det (A) = det (B) (7.2.8)
Canonical fonns We now present several important canonical forms for
matrices. These forms relate the structure of a matrix to its eigenvalues and
eigenvectors, and they are used in a variety of applications in other areas of
mathematics and science.
Theorem 7.3 (Schur Normal Form) Let A have order n with elements from C.
Then there exists a unitary matrix U such that
T= U*AU (7.2.9)
is upper triangular.
Since T is triangular, and since U* = u-
1
,
(7 .2.10)
and thus the eigenvalues of A are the diagonal elements of T.
Proof The proof is by induction on the order n of A. The result is trivially true
for n = 1, using U = [1]. We assume the result is true for all matrices of
order n k - 1, and we will then prove it has to be true for all matrices
of order n = k.
Let A.
1
be an eigenvalue of A, and let u<
1
) be an associated eigenvector
with llu<
1
)lb = 1. Beginning with u<
1
), pick an orthonormal basis for Ck,
calling it {u(1), ... , u<k)}. Define the matrix P
1
by
P
= [u<l) u<2) u<k)]
1 ' , .... '
which is written in partitioned form, with columns u(l)' , u<k) that are
orthogonal. Then PtP
1
=I, and thus P
1
-
1
= P
1
*. Define
B
1
= P
1
*AP
1
EIGENVALUES AND CANONICAL FORMS FOR MATRICES 475
Claim:
with A
2
of order k - 1 and a
2
, , ak some numbers. To prove this,
multiply using partitioned matrices:
= [' u<l) v<2) v<k>]
1\.1 , ' .... ,
Since P
1
* P
1
=I, it follows that Pr u<
1
> = e<
1
> = [1, 0, ... , Of. Thus
B
= [' e<l) w<2) w<k>]
1 /\1 '. ' . '
which has the desired form.
By the induction hypothesis, there exists a unitary matrix P
2
of order
k - 1 for which
is an upper triangular matrix of order k - 1. Define
P ~ [1
0
ol
p2
Then P
2
is unitary, and
rA,
Y2
Y, J
P,*B,P, ~ r
P2*A2P2
~ rr
Y2
ylT
f
476 LINEAR ALGEBRA
an upper triangular matrix. Thus
T= U*AU
and U is easily unitary. This completes the induction and the proof.
Example For the matrix
[
.2
A= 1.6
-1.6
the matrices of the theorem and (7.2.9) are
0
3
0
~ ]
-1
[
.6
U= ~
0
0
1.0
-.8]
.6
0
This is not the usual way in which eigenvalues are calculated, but should be
considered only as an illustration of the theorem. The theorem is used generally
as a theoretical tool, rather than as a computational tool.
Using (7.2.8) and (7.2.9),
trace (A) = i\
1
+ i\2 + +i\n det(A)=i\
1
A
2
. i\n (7.2.11)
where i\
1
, . , i\n are the eigenvalues of A, which must form the diagonal
elements of T. As a much more important application, we have the following
well-known theorem.
Theorem 7.4 (Principal Axes Theorem) Let A be a Hermitian matrix of order
n, that is, A* =A. Then A has n real eigenvalues i\
1
, , i\n, not
necessarily distinct, and n corresponding eigenvectors u<
1
>, . , u<n>
that form an orthonormal basis for en. If A is real, the eigenvectors
u(l>, .. , u<n> can be taken as real, and they form an orthonormal
basis of Rn. Finally there is a unitary matrix U for which
(7.2.12)
is a diagonal matrix with diagonal elements i\
1
, ... , i\n. If A is also
real, then U can be taken as orthogonal.
Proof From Theorem 7.3, there is a unitary matrix U with
U*AU= T
with Tupper triangular. Form the conjugate transpose of both sides to
EIGENVALUES AND CANONICAL FORMS FOR MATRICES 477
obtain
T* = (U*AU)* = U*A*(U*)* = U*AU= T
Since T* is lower triangular, we must have
Also, T* = T involves complex conjugation of all elements of T, and
thus all diagonal elements of T must be real.
Write U as
U= [u(l), ... , u<n>]
Then T = U* AU implies AU = UT,
and
Au<j) = A. .uU>
1
j= 1, ... , n (7.2.13)
Since the columns of U are orthonormal, and since the dimension of en
is n, these must form an orthonormal basis for en. We omit the proof of
the results that follow from A being real. This completes the proof .
Example From an earlier example in this section, the matrix
has the eigenvalues A.
1
= 1, A.
2
= 2, A.
3
= 4 and corresponding orthonormal
eigenvectors
u<
1
> = -
1
[ -i]
{3 1
u<2) = _2_[ ~
.fi -1
These form an orthonormal basis for R
3
or e
3
There is a second canonical form that has recently become more important for
problems in numerical linear algebra, especially for solving overdetermined
systems of linear equations. These systems arise from the fitting of empirical data
478 LINEAR ALGEBRA
using the linear least squares procedures [see Golub and Van Loan (1983), chap.
6, and Lawson and Hanson (1974)].
Theorem 7.5 (Singular Value Decomposition) Let A be order 11 X m. Then
there are unitary matrices U and V, of orders m and n, respectively,
such that
V*AU = F {7 .2.14)
is a "diagonal" rectangular matrix of order n X m,
P.!
0
P.2
F=
0
{7.2.15)
0
The numbers p.
1
, .. , P.r are called the singular values of A. They are
all real and positive, and they can be arranged so that
{7.2.16)
where r is the rank of the matrix A.
Proof Consider the square matrix A* A of order m. It is a Hermitian matrix,
and consequently Theorem 7.4 can be applied to it. The eigenvalues of
A* A are all real; moreover, they are all nonnegative. To see this, assume
Then
A*Ax =Ax X =fo. 0.
(x, A*Ax) = (x, Ax)= l l x l l ~
(x, A*Ax) =(Ax, Ax)= I x l l ~
A= -- >0
(
11Axjj 2 )
2
1lxl12 -
This result also proves that
Ax= 0 if and only if A*Ax = 0 X E Cn (7.2.17)
From Theorem 7.3, there is an m X m unitary matrix U such that
(7.2.18)
where all A; =fo. 0, 1 ~ i ~ r, and all are positive. Because A* A has order
EIGENVALUES AND CANONICAL FORMS FOR MATRICES 479
m, the index r ~ m. Introduce the singular values
JL; = .p:;
i = 1, ... , r (7.2.19)
The U can be chosen so that the ordering (7.2.16) is obtained. Using the
diagonal matrix
D = diag [p.
1
, .. , !Lr 0, ... , 0]
of order m, we can write (7.2.18) as
.(AU)*(AU) = D
2
(7.2.20)
Let W =AU. Then (7.2.20) says W*W = D
2
Writing Was
w = [w(l>, ... ' w<m>]
W(j) E Cn
we have
(w<j), wu>) = {
1 ~ ~ r
j>r
(7.2.21)
and
(w<i), wu>) = o ifi=l=j (7 .2.22)
From (7.2.21), wu> = 0 if j > r. And from (7.2.22), the first r columns
of Ware orthogonal elements in en. Thus the first r columns are linearly
independent, and this implies r ~ n.
Define
1
vU> =-wu>
j = 1, ... , r (7.2.23)
!Lj
This is an orthonormal set in en. If r < n, then choose v<r+ll, ... , v<n>
so that {V<
1
>, ... , v<n>} is an orthonormal basis for en. Define
(7.2.24)
Easily V is an n X n unitary matrix, and it can be verified directly that
VF = W, with F as in (7.2.15). Thus
VF=AU
which proves (7.2.14). The proof that r = rank(A) and the derivation of
other properties of the singular value decomposition are left to Problem
19. The singular value decomposition is used in Chapter 9, in the least
squares solution of overdetermined linear systems.
480 LINEAR ALGEBRA
To give the most basic canonical form, introduce the following notation.
Define the n X n matrix
A 1 0 0
0 A 1
Jn(A) = n ~ (7.2.25)
1
0 A
where Jn(A) has the single eigenvalue A, of algebraic multiplicity n and geometric
multiplicity 1. It is called a Jordan block.
Theorem 7.6 (Jordan Canonical Form) Let A have order n. Then there is a
nonsingular matrix P for which
0
(7 .2.26)
0
The eigenvalues A
1
, A
2
, , A, need not be distinct. For A Hermi-
tian, Theorem 7.4 implies we must have n
1
= n
2
= = n, = 1,
for in that case the sum of the. geometric multiplicities must be n,
the order of the matrix A.
It is often convenient to write (7.2.26) as
p-
1
AP = D + N
D = diag [A
1
, . , A,] (7.2.27)
with each A; appearing n; times on the diagonal of D. The matrix N has all zero
entries, except for possible Is on the superdiagonal. It is a nilpotent matrix, and
more precisely, it satisfies
(7.2.28)
The Jordan form is not an easy theorem to prove, and the reader is referred to
any of the large number of linear algebra texts for a development of this rich
topic [e.g., see Franklin (1968), chap. 5; Halmos (1958), sec. 58; or Noble (1969),
chap. 11].
7.3 Vector and Matrix Norms
The Euclidean norm llxll:i has already been introduced, and it is the way in
which most people are used to measuring the size of a vector. But there are many
situations in which it is more convenient to measure the size of a vector in other
ways. Thus we introduce a general concept of the norm of a vector.
-
I
VECTOR AND MATRIX NORMS 481
Definition Let V be a vector space, and let N(x) be a real valued function
defined on V. Then N(x) is a norm if:
(Nl) N(x) 0 for all x E V, and N(x) = 0 if and only if x = 0.
(N2) N(ax) = lalN(x), for all x E Vandall scalars a.
(N3) N(x + y) N(x) + N(y), for all x, y E V.
The usual notation is llxll = N(x). The notation N(x) is used to emphasize
that the norm is a function, with domain V and range the nonnegative real
numbers. Define the distance from x to y as llx - Yll Simple consequences are
the triangular inequality in its alternative form
llx - zll llx - Yll + IIY - zll
and the reverse triangle inequality,
lllxll-llYlll llx- Yll
x,yE V (7.3.1)
Example 1. For 1 p < oo, define the p-norm,
X E Cn (7.3.2)
2. The maximum norm is
X E Cn (7.3.3)
The use of the subscript oo on the norm is motivated by the result in Problem 23.
3. For the vector space V = C( a, b ], the function norms 11/lb and 11/lloo were
introduced in Chapters 4 and 1, respectively.
Example Consider the vector x = (1, 0, -1, 2). Then
llxlh = 4
To show that II liP is a norm for a general p is nontrivial. The cases p = 1
and oo are straightforward, and II 11
2
has been treated in Section 4.1. But for
1 < p < oo, p =I= 2, it is difficult to show that II liP satisfies the triangle in-
equality. This is not a significant problem for us since the main cases of interest
are p = 1, 2, oo. To give some geometrical intuition for these norms, the unit
circles
p = 1,2,oo (7.3.4)
are sketched in Figure 7 .2.
i
I
. . I
482 LINEAR ALGEBRA
(0,1) (1, 1)
(1, 0)
(-1,-1)
S1
Figure 7.2 The unit sphere SP using vector norm II llr
We now prove some results relating different norms. We begin with the
following result on the continuity of N(x) = llxll, as a function of x.
Lemma Let N(x) be a norm on en (or Rn). Then N(x) is a continuous
function of the components x
1
, x
2
, .. , xn of x.
Proof We want to show that
i = 1,2, ... , n
implies
N(x)=N(y)
Using the reverse triangle inequality (7.3.1),
IN(x)- N(y)l::; N(x- y) X, y E en
Recall from (7.1.1) the definition of the standard basis { e<
1
l, ... , e<nl}
for en. Then
n
X - y = L (X j - y) e (j)
j=l
n n
N(x- y) ::; L !xj- yj!N( eUl) ::; llx- Ylloo L N( eUl)
j-1 j=l
!N(x)- N(y)l s cllx- Ylloo
This completes the proof.
n
c = L N(eUl)
j=l
(7.3.5)
Note that it also proves that for every vector norm Non en, there is a c > 0
with
N(x)::; cllxlloo
all X E en (7.3.6)
Just let y = 0 in (7.3.5). The following theorem proves the converse of this result.
VECTOR AND MATRIX NORMS 483
Theorem 7.7 (Equivalence of Norms) Let Nand M be norms on V =en or
Rn. Then there are constants c
1
, c
2
> 0 for which
allxE V (7.3.7)
Proof It is sufficient to consider the case in which N is arbitrary and M(x) =
llxlloo Combining two such statements then leads to the general result.
Thus we wish to show there are constants c
1
, c
2
for which
(7.3.8)
or equivalently,
allz E S (7.3.9)
in which Sis the set of all points z in en for which l!zlloo = 1. The upper
inequality of (7.3.9) follows immediately from (7.3.6).
Note tllat Sis a closed and bounded set in en, and N is a continuous
function on S. It is then a standard result of advanced calculus that N
attains its maximum and minimum on S at points of S, that is, there are
constants c
1
, c
2
and points z
1
, z
2
in S for which
Clearly, c
1
, c
2
~ 0. And if c
1
= 0, tllen N(z
1
) = 0. But then z
1
= 0,
contrary to the construction of S that requires llz
1
lloo = 1. This proves
(7.3.9), completing the proof of the theorem. Note: This theorem does
not generalize to infinite dimensional spaces.
Many numerical methods for problems involving linear systems produce a
sequence of vectors {x<m>lm ~ 0}, and we want to speak of convergence of this
sequence to a vector x.
Definition A sequence of vectors { x(l>, x<
2
>, ... , x<m>, }, in en or Rn is said
to converge to a vector x if and only if
as m ~ oo
Note that the choice of norm is left unspecified. For finite
dimensional spaces, it doesn't matter which norm is used. Let M and
N be two norms on en. Then from (7.3.7),
m ~
and M(x - x<m>) converges to zero if and only if N(x - x<m>) does
the same. Thus x<m> ~ x with tile M norm if and only if it
converges with the N norm. This is an important result, and it is not
true for infinite dimensional spaCes.
Matrix nonns The set of all n X n matrices with complex entries can be
considered as equivalent to the vector space en\ with a special multiplicative
484 LINEAR ALGEBRA
operation added onto the vector space. Thus a matrix norm should satisfy the
usual three requirements Nl-N3 of a vector norm. In addition, we also require
two other conditions.
Definition A matrix norm satisfies N1-N3 and the following:
(N4) IIABII IIAIIIIBII.
(NS) Usual,ly the vector space we will be working with, V =en or
Rn, will have some vector norm, call it Hxllv x E V. We
require that the matrix and vector norms be compatible:
allxE V all A
Example Let A be n X n, II . llv = II . lb- Then for X E en,
by using the Cauchy-Schwartz inequality (7.1.8). Then
{7.3.10)
F(A) is called the Frobenius norm of A. Property N5 is shown using (7.3.10)
directly. Properties Nl-N3 are satisfied since F(A) is just the Euclidean norm on
en
2
It remains to show N4. Using the Cauchy-Schwartz inequality,
= F(A)F(B)
Thus F(A) is a matrix norm, compatible with the Euclidean norm.
Usually when given a vector space with a norm II llv an associated matrix
norm is defined by
IIAxllv
IIAII = Supremum--
x,.o Uxllv
(7.3.11)
VECTOR AND MATRIX NORMS 485
Table 7.1 Vector nonns and associated operator matrix norms
Vector Norm
n
llxlh = L !x;l
i-1
llxlb = [ !x;l
2
]
112
J-1
llxlloo = Max jx;l
lSi;S;n
Matrix Norm
n
I!Aih = M!U L !aiJi
lS]Sni=l
n
I!AIIoo = M!U L !a;JI
1St;S;n j-l
It is ofien called the operator norm. By its definition, it satisfies N5:
X E en (7.3.12).
For a matrix A, the operator norm induced by the vector norm llxliP will be
denoted by IIAllr The most important cases are given in Table 7.1, and the
derivations are given later. We need the following definition in order to define
IIAII.2
Definition Let A be an arbitrary matrix. The spectrum of A is the set of all
eigenvalues of A, and it is denoted by a(A). The spectral radius is
the maximum size of these eigenvalues, and it is denoted by
(7.3.13)
To show (7.3.11) is a norm in general, we begin by showing it is finite. Recall
from Theorem 7.7 that there are constants c
1
, c
2
> 0 with
X E en
Thus,
which proves IIAII is finite.
At this point it is interesting to note the geometric significance of IIAII:
IIAxllv II ( X ) II IIAII = Supremum-- = Supremum A -- = SupremumiiAzllv
x,.O llxllv x,.O llxllv v Uzll.=l
By noting that the supremum doesn't change if we let llzllv 1,
IIAII = Supremumi!Azll
Uzfl,,sl
(7.3.14)
486 LINEAR ALGEBRA
Let
the unit ball with respect to II II v Then
IIAII -= SupremumiiAzll. = Supremumllwll.
zeB weA(B)
with A( B) the image of B when A is applied to it. Thus IIAII measures the effect
of A on the unit ball, and if IIAII > 1, then IIAII denotes the maximum stretching
of the ball B under the action of A.
Proof Following is a proof that the operator norm IIAII is a matrix norm.
1. Clearly IIAII 0, and if A = 0, then HAll= 0. Conversely, if IIAII = 0,
then IJAxll. = 0 for all x. Thus Ax = 0 for all x, and this implies
A =0.
2. Let a be any scalar. Then
llaAII = SupremumllaAxllv = SupremumJaJIIAxJI. = Jal SupremumUAxllv
llxll,,:s;l IJxllv:s;l llxiJv:s;l
= JaJIIAIJ
3. For any X E en,
II(A + B)xllv = IIAx + Bxllv :$JJAxllv + JJBxJJv
since II llv is a norm. Using the property (7.3.12),
This implies
II(A + B)xllv :SIIAiillxllv + IIBJJIIxllv
II(A + B)xllv :SIIAII + IIBII
llxllv
IIA + BIJ ::;; IIAII + IIBII
4. For any X E en, use (7.3.12) to get
II(AB)xllv = IIA(Bx) llv :SIIAIIIIBxiJv :SIIAIIIIBIIIIxllv
IJABxiJv :S IIAIIIIBII
llxiJv
This implies
IIABIJ ::;; IJAIJIIBIJ
VECTOR AND MATRIX NORMS 487
We now comment more extensively on the results given in Table 7.1.
Example I. Use the vector norm
Then
II
llxll1 = L lxjl
j=l
xE C
11
Changing the order of summation, we can separate the summands,
Let
Then
and thus
II II
IIAxlh ~ L ixjl L laijl
j ~ l i=1
II
c = Max L ia;i
1:5}:511 i=1 1
IIAih ~ c
To show this is an equality, we demonstrate an x for which
IIAxih
--=c
llxlh
{7.3.15)
Let k be the column index for which the maximum in (7.3.15) i ~ attained. Let
x = e<k>, the kth unit vector. Then Uxlh = 1 and
This proves that for the vector norm II 11
1
, the operator norm is
II
IIAih = ~ ~ L laijl
1_]:5,11 i=1
{7 .3.16)
This is often called the column norm.
2. For C
11
with the norm llxll""' the operator norm is
II
IIAIIco = ~ L laijl
1:5,1_11 j=l
{7.3.17)
488 LINEAR ALGEBRA
This is called the row norm of A. The proof of the formula is left as Problem 25,
although it is similar to that for IIAih-
3. Use the norm llxlh on en. From (7.3.10), we conclude that
(7.3.18)
In general, these are not equal. For example, with A = /, the identity matrix, use
(7.3.10) and (7.3.11) to obtain
F(l} = {;
We prove
IIAib = .jr"(A*A)
(7.3.19)
as stated earlier in Table 7.1. The matrix A* A is Hermitian and all of its
eigenvalues are nonnegative, as shown in the proof of Theorem 7.5. Let it have
the eigenvalues
counted according to their multiplicity, and let u(l), ... , u<n> be the correspond-
ing eigenvectors, arranged as an orthonormal basis for en.
For a general X E en,
Write x as
Then
and
I A x l l ~ =(Ax, Ax)= (x, A*Ax)
n
x = L ajuU>
j=l
n n
A* Ax = " a .A* AuU> = "' a), .uU>
L..J L..Jj
j=l j=l
n
~ A1 L la)
2
= A I I I x l l ~
j=l
using (7.3.20) to calculate llxlh- Thus
(7.3.20)
VECTOR AND MATRIX NORMS 489
Equality follows by noting that if x = u(l>, then llxll
2
= 1 and
I A x l l ~ = (x, A*Ax) = (u(l>, A
1
u(l>) = A
1
This proves (7j.19), since A
1
= ra(A* A). It can be shown that AA* and A* A
have the same nonzero eigenvalues (see Problem 19); thus, ra(AA*) = ra{A* A),
an alternative formula for (7.3.19). It also proves
IIA!b = IIA*II2
(7.3.21)
This is not true for the previous matrix norms.
It can be shown fairly easily that if A is Hermitian, then
{7.3.22)
This is left as Problem 27.
Example Consider the matrix
Then.
IIAih = 6 IIAib = h5 + f221 ~ 5.46
IIAII"" = 7
As an illustration of the inequality (7.3.23) of the following theorem,
5 + v'33
rAA) =
2
~ 5.37 <!!Alb
Theorem 7.8 Let A be an arbitrary square matrix. Then for any operator matrix
norm,
{7.3.23)
Moreover, if f > 0 is given, then there is an operator matrix norm,
denoted here by 1111., for which
(7.3.24)
Proof To prove (7.3.23), let II II be any matrix norm with an associated
compatible vector norm II llu Let A be the eigenvalue in o(A) for which
and let x be an associated eigenvector, llxllv = 1. Then
which proves (7.3.23).
I
I
I
490 LINEAR ALGEBRA
The proof of (7.3.24) is a nontrivial construction, and a proof is given
in Isaacson and Keller (1966, p. 12).
The following corollary is an easy, but important, consequence of Theo-
rem 7.8.
Corollary For a square matrix A, r.,(A) < 1 if and only if IIAII < 1 for some
operator matrix norm.
Tiris result can be used to prove Theorem 7.9 in the next section, but we prefer
to use the Jordan canonical form, given in Theorem 7.6. The results (7.3.22) and
Theorem 7.8 show that r.,(A) is almost a matrix norm, and this result is used in
analyzing the rates of convergence for some of the iteration methods given in
Chapter 8 for solving linear systems of equations.
7.4 Convergence and Perturbation Theorems
The following results are the theoretical framework from which we later construct
error analyses for numerical methods for linear systems of equations.
Theorem 7.9 Let A be a square matrix of order n. Then Am converges to the
zero matrix as m -+ co if and only if r.,(A) < 1.
Proof We use Theorem 7.6 as a fundamental tool. Let J be the Jordan
canonical form for A,
p-
1
AP = J
Then
(7 .4.1)
and Am-+ 0 if and only if 1m:..... 0. Recall from (7.2.27) and (7.2.28) that
J can be written as
J=D+N
in which
contains the eigenvalues of J (and A), and N is a matrix for which
N"=O
By examining the structure of D and N, we have DN = ND. Then
CONVERGENCE AND PERTURBATION THEOREMS 491
and using Nj = 0 for j;;:: n,
(7 .4.2)
Notice that the powers of D satisfy
m-j;;::m-n-+oo as m-+oo (7.4.3)
We need the following limits: For any positive c < 1 and any r ;;:: 0,
(7 .4.4)
m-+ oo
This can be proved using L'Hospital's rule from elementary calculus.
In (7.4.2), there are a fixed number of terms, n + 1, regardless of the
size of m, and we can consider the convergence of Jm by considering
each of the individual terms. Assuming ra(A) < 1, we know that all
1-h;l < 1, i = 1, ... , n. And for any matrix norm
Using the row norm, we have that the preceding is bounded by
which converges to zero as m -+ oo, using (7.4.3) and (7.4.4), for 0 :$; j :$;
n. This proves half of the theorem, namely that if ra(A) < 1, then Jm
and Am, from (7.4.1), converge to zero as m-+ oo.
Suppose that ra(A);;:: 1. Then let A be an eigenvalue of A for which
i.hl ;;:: 1, and let x be an associated eigenvector, x-=!= 0. Then
and clearly this does not converge to zero as m -+ oo. Thus it is not
possible that Am-+ 0, as that would implyAmx-+ 0. This completes the
proof.
Theorem 7.10 (Geometric Series) Let A be a square matrix. If ra(A) < 1, then
(I- A)-
1
exists, and it can be expressed as a convergent series,
( )
-1 2
I-A =l+A+A ++Am+ (7 .4.5)
Conversely, if the series in (7.4.5) is convergent, then ra(A) < 1.
Proof Assume ra(A) < 1. We show the existence of (I- A)-
1
by proving the
equivalent statement {3) of Theorem 7.2. Assume
(I- A)x = 0
492 LINEAR ALGEBRA
Then Ax = x, and this implies that 1 is an eigenvalue of A if x =fo 0. But
we assumed ra(A) < 1, and thus we must have x = 0, concluding the
proof of the existence of (I- A)-
1
.
We need the following identity:
(7.4.6)
which is true for any matrix A. Multiplying by (I- A)-
1
,
The left-hand side has a limit if the right-hand side does. By Theorem
7.9, r.,(A) < 1 implies that Am+I 0 as m oo. Thus we have the
result (7.4.5).
Conversely, assume the series converges and denote it by
Then B- AB = B- BA =I, and thus I- A has an inverse, namely B.
Taking limits on both sides of (7.4.6), the left-hand side has the limit
(I- A)B =I, and thus the same must be true of the right-hand limit.
But that implies
By Theorem 7.9, we must have r.,(A) < 1.
Theorem 7.11 Let A be a square matrix. If for some operator matrix norm,
IIAII < 1, then (I- A)-
1
exists and has the geometric series
expansion (7.4.5). Moreover,
(7.4.7)
Proof Since IIAII < 1, it follows from (7.3.23) of Theorem 7.8 that r.,(A) < 1.
Except for (7.4.7), the other conclusions follow from Theorem 7.10. For
(7 .4. 7), let
Bm =I+ A + +Am
From (7.4.6),
Bm = (1- A)-1(1- Am+l)
(1- A)-1- Bm = (1- A)-1[I- (I- Am+1))
(7.4.8)
CONVERGENCE AND PERTURBATION THEOREMS 493
Using the reverse triangle inequality,
Ill(/- A}-
1
11-IIBmlll 11(/- A)-
1
- Bm!l
11(/- A) -
1
1111AIIm+
1
Since this converges to zero as m -+ oo, we have
as m-+oo
From the definitionof Bm and the properties of a matrix norm,
IIBmll 11/11 + IIAII + IIAII
2
+ +IIAIIm
1 -11411m+
1
1
----<---
1 - IIAII - 1 - IIAII
Combined with (7.4.9), this concludes the proof of (7.4.7).
(7.4.9)
Theorem 7.12 Let A and B be square matrices of the same order. Assume A is
nonsingular and suppose that
1
IIA- Bll < IIA-111
Then B is also nonsingular,
IIA-111
IIB-111 1 - IIA -liiiiA- Bll
and
Proof Note the identity
B =A- (A- B)= A(I- A-
1
(A- B)]
{7.4.10)
(7 .4.11)
(7.4.12)
(7.4.13)
The matrix[/- A -
1
(A -B)] is nonsingular using Theorem 7.11, based
on the inequality (7.4.10), which implies
Since B is the product of nonsingular matrices, it too is nonsingular,
494 LINEAR ALGEBRA
The bound (7.4.11) follows by taking norms and applying Theorem 7.11.
To prove (7.4.12), use
A-
1
- B-
1
= A-
1
(B- A)B-
1
Take norms again and apply (7.4.11).
This theorem is important in a number of ways. But for the moment, it says
that all sufficiently close perturbations of a nonsingular matrix are nonsingular.
Example We illustrate Theorem 7.11 by considering the invertibility of the
matrix
Rewrite A as
A=
0
1
4
B= 0
0
4
1
0
0
1 0 0
4 1 0
1 4 1
1
0
A= 4(1 +B),
1
4
0
0 0
1
4
0
4
1
1
4
0
1
4
0
0
1
4
0
Using the row norm (7.3.17), IIBII"" = }. Thus (I+ B)-
1
exists from Theorem
7.11, and from (7.4.7),
Thus A -
1
exists, A -
1
= !-(I+ B)-I, and
By use of the row norm and inequality (7.3.23),
BIBLIOGRAPHY 495
Since the eigenvalues of A -l are the reciprocals of those of A (see problem 27),
and since all eigenvalues of A are real because A is Hermitian, we have the
bound
all;\ E a(A)
For better bounds in this case, see the Gerschgorin Circle Theorem of Chapter 9.
Discussion of the Literature
The subject of this chapter is linear algebra, especially selected for use in deriving
and analyzing methods of numerical linear algebra. The books by Anton (1984)
and Strang (1980) are introductory-level texts for undergraduate linear algebra.
Franklin's (1968) is a higher level introduction to matrix theory, and Halmos's
(1958) is a well-known text on abstract linear algebra. Noble's (1969) is a
wide-ranging applied linear algebra text. Introductions to the foundations are
also contained in Fadeeva (1959), Golub and Van Loan (1982), Parlett (1980),
Stewart (1973), and Wilkinson (1965), all of which are devoted entirely to
numerical linear algebra. For additional theory at a more detailed and higher
level, see the classical accounts of Gantmacher (1960) and Householder (1965).
Additional references are given in the bibliographies of Chapters 8 and 9.
Bibliography
Anton, H. (1984). Elementary Linear Algebra, 4th ed. Wiley, New York.
Fadeeva, V. (1959). Computational Methods of Linear Algebra. Dover, New York.
Franklin, J. (1968). Matrix Theory. Prentice-Hall, Englewood Cliffs, N.J.
Gantmacher, F. (1960). The Theory of Matrices, vols. I and II. Chelsea, New
York.
Golub, G., and C. Van Loan (1983). Matrix Computations. Johns Hopkins Press,
Baltimore.
Halmos, P. (1958). Finite-Dimensional Vector Spaces. Van Nostrand, Princeton,
N.J.
Householder, A. (1965). The Theory of Matrices in Numerical Analysis. Ginn
(Blaisdell), Boston.
Isaacson, E., and H. Kelier (1966). Analysis of Numerical Methods. Wiley, New
York.
Lawson, C., and R. Hanson (1974). Solving Least Squares Problems. Prentice-Hall,
Englewood Cliffs, N.J.
Noble, B. (1969). Applied Linear Algebra. Prentice-Hall, Englewood Cliffs, N.J.
496 LINEAR ALGEBRA
Parlett, B. (1980). The Symmetric Eigenvalue Problem. Prentice-Hall, Englewood
Cliffs, N.J.
Stewart, G. (1973). Introduction to Matrix Computations. Academic Press, New
York.
Strang, G. (1980). Linear Algebra and Its Applications, 2nd ed. Academic Press,
New York.
Wilkinson, J. (1965). The Algebraic Eigenvalue Problem. Oxford Univ. Press,
Oxford, England.
Problems
1. Determine whether the following sets of vectors are dependent or indepen-
dent.
(a) (1, 2, -1, 3), (3, -1, 1, 1), (1, 9, -5, 11)
(b) (1, 1, 0), (0, 1, 1), (1, 0, 1)
2. Let A, B, and C be matrices of order m X n, n X p, and p X q, respec-
tively.
(a) Prove the associative law (AB)C = A(BC).
(b) Prove (AB)r = BrAr.
3. (a) Produce square matrices A and B for which AB =!= BA.
(b) Produce square matrices A and B, with no zero entries, for which
AB = 0, BA =!= 0.
4. Let A be a matrix of order m X n, and let r and c denote the row and
column rank of A, respectively. Prove that r = c. Hint: For convenience,
assume that the first r rows of A are independent, with the remaining rows
dependent on these first r rows, and assume the same for the first c
columns of A. Let A denote the r X n matrix obtained by deleting the last
m - r rows of A, and let ; and c denote the row and column rank of A,
respectively. Clearly ; = r. Also, the columns of A are elements of cr,
which has dimension r, and thus we must have c r. Show that c = c,
thus proving that c .:::;; r. The reverse inequality will follow by applying the
same argument to Ar, and taken together, these two inequalities
imply r =c.
5. Prove the equivalence of statements (1)-(4) and (6) of Theorem 7.2. Hint:
Use Theorem 7.1, the result in Problem 4, and the decomposition (7.1.6).
6.
PROBLEMS 497
Let
X 1 0 0
1 X 1 0 0
0 1 X 1
fn(x) = det
0
1
0 0 1 X
with the matrix of order n. Also define /
0
(x) = 1.
(a) Show
n ~ 1
(b) Show
n ~
with Sn(x) the Chebyshev polynomial of the second kind of degree n
(see Problem 24 in Chapter 4).
n n
q(xi, ... ,xJ = (Ax,x) = L L a;jxixj
i=l j=l
is called the quadratic form determined by A. It is a quadratic polynomial in
the n variables xi, ... , xn, and it occurs when considering the maximiza-
tion or minimization of a function of n variables.
(a) Prove that if A is skew-symmetric, then q(x) = 0.
(b) For a general square matrix A, define AI ::: t{A +AT), A
2
=
t(A -AT). Then A =AI + A
2
Show that A
1
is symmetric, and
(Ax, x) = (A
1
x, x), all x E R". This shows that the coefficient matrix
A for a quadratic form can always be assumed to be symmetric,
without any loss of generality.
8. Given the orthogonal vectors
u<I> = (1,2, -1) u<
1
> = (1, 1, 3)
produce a third vector u<
3
> such that { u(l>, u<
2
>, u<
3
>} is an orthogonal basis
.for R
3
Normalize these vectors to obtain an orthonormal basis.
498 LINEAR ALGEBRA
9. For the column vector wE en with llw112 = vw*w = 1, define the n X n
matrix
A= I- 2ww*
{a) For the special case w = [t, f, f]r; produce the matrix A. Verify that
it is symmetric and orthogonal.
{b) Show that, in general, all such matrices A are Hermitian and unitary.
10. Let W be a subspace of Rn. For x ERn, define
p(x) = Infimumjjx - Yib
yeW
Let { u
1
, , um} be an orthonormal basis of W, where m is the dimension
of W. Extend this to an orthonormal basis { u
1
, , um, ... , un} of all
of Rn.
{a) Show that
[
n ]1/2
p(x) = uJ 12
and that it is uniquely attained at
y=Px
m
P = L ujuJ
j=l
This is called the orthogonal projection of x onto W.
{b) Show P
2
= P. Such a matrix is called a projection matrix.
(c) Show pT = P.
(d) Show (Px, z- Pz) = 0 for all x, z ERn.
{e) Show = + ilx - for all x E Rn. This is a version of
the theorem of Pythagoras.
11. Calculate the eigenvalues and eigenvectors of the following matrices.
(a) [ j]
12. Let y =/= 0 in Rn, and define A = yyr, an n X n matrix. Show that A= 0 is
an eigenvalue of multiplicity exactly n - 1. What is the single nonzero
eigenvalue?
PROBLEMS 499
13. Let U be an n X n unitary matrix.
(a) Show 11Ux112 = llxlb all X E en. Use this to prove that the distance
between points ;x and y is the same as the distance between Ux and
Uy, showing that unitary transformations of en preserve distances
between all points.
(b) Let U be orthogonal, and show that
(Ux, Uy) = (x, y) x,y E R"
This shows that orthogonal transformations of Rn also preserve angles
between lines,.as defined in (7.1.12).
(c) Show that all eigenvalues of a unitary matrix have magnitude one.
14. Let A be a Hermitian matrix of order n. It is called positive definite if and
only if (Ax, x) > 0 for all X* 0 in en. Show that A is positive definite if
and only if all of its eigenvalues are real and positive. Hint: Use Theorem
7.4, and. expand (Ax, x) by using an eigenvector basis to express an
arbitrary X E en.
15. Let A be ~ e l and symmetric, and denote its eigenvalues by A
1
, ... , An,
repeated according to their multiplicity. Using a basis of orthonormal
eigenvectors, show that the quadratic form of Problem 7, q(x) =(Ax, x),
x E Rn, can be reduced to the simpler form
n
q(x) = L aJAj
j-1
with the { aj} determined from x. Using this, explore the possible graphs
for
(Ax, x) =constant
when A is of order 3.
16. Assume A is real, symmetric, positive definite, and of order n. Define
Show that the unique minimum of f(x) is given by solving Ax= b for
x=A-
1
b.
17. Let /(x) be a real valued function of x ERn, and assume f(x) is three
times continuously differentiable with respect to the components of x.
500 LINEAR ALGEBRA
Apply Taylor's theorem 1.5 of Chapter 1, generalized to n variables, to
obtain
f(x) = f(a) + (x- a)T'l/(a)
1 T
+ 2(x- a) H(a)(x- a)+ O(llx- all
3
)
Here
[
dj (}j ]T
'il/(x) = 8x
1
, , 8xn
is the gradient of f, and
1 ::;; i, j::;; n
is the Hessian matrix for f(x ). The final term indicates that the remaining
terms are smaller than some multiple of llx - al!
3
for x close to a.
If a is to be a local maximum or minimum, then a necessary condition is
that 'il/(x) = 0. Assuming 'il/(a) = 0, show that a is a strict (or unique)
local minimum of f(x) if and only if H(a) is positive definite. [Note that
H(x) is always symmetric.]
18. Demonstrate the relation (7.2.28).
19. Recall the notation used in Theorem 7.5 on the singular value decomposi-
tion of a matrix A.
(a) Show that p.i, ... , p.; are the nonzero eigenvalues of A* A and AA*,
with corresponding eigenvectors u<
1
l, ... ' u<r) and v<
1
l, ... ' v<r>, re-
spectively. The vector uu> denotes column j of U, and similarly for
vu> and v.
(c) Prover= rank(A).
20. For any polynomial p(x) = b
0
+ b
1
x + +bmxm, and for A any square
matrix, define
Let A be a matrix for which the Jordan canonical form is a diagonal
32. Consider the matrix
6 1 1
1 6 1
1 1 6
A= 0 1 1
0
Show A is nonsingular. Fit
33. In producing cubic interpo
necessary to solve the l i n ~
PROBLEMS 501
matrix,
For the characteristic polynomial JIA.) of A, prove fA(A) = 0. (This result
is the Cayley-Hamilton theorem. It is true for any square matrix, not just
those that have a diagonal Jordan canonical form.) Hint: Use the result
A = PDP-
1
to simplify fA(A).
21. Prove the following: for x E C"
(a) llxllcx:> ~ llxlh ~ nllxllcx:>
(b) llxllcx:> ~ llxlb ~ v'nllxllcx:>
(c) llxll2 ~ llxlh ~ v'n11x112
22. Let A be a real nonsingular matrix of order n, and let II llv denote a
vector norm on R". Define
llxll. = IIAxllv
A = Show that 1111. is a vector norm on R".
0
All hi > 0, i = 1, ... , m. l
show that A is nonsingular
for the eigenvalues of A.
34. Let A be a square matrix,
is nonsingular. Such a mat
23. Show
X E C"
This justifies the use of the notation llxllcx:> for the right side.
24. For any matrix norm, show that (a) 11111 ;;::: 1, and (b) IIA -
1
11;;::: (1/IIAII). For
an operator norm, it is immediate from (7.3.11) that 11111 = 1.
/
25. Derive formula (7.3.17) f o ~ the operator matrix norm IIAIICX>
26. Define a vector norm on R" by
1 n
llxll=- [lx)
n j-1
What is the operator matrix norm associiued with this vector norm?
27. Let A be a square matrix of order n X n.
(a) Given the eigenvalues and eigenvectors of A, determine those
of (1) Am for m;;:: 2, (2) A-
1
, assuming A is nonsingular, and
(3) A + c/, c = constant.
502 LINEAR ALGEBRA
(b) Prove IIAib = r
0
(A) when A is Hermitian.
(c) For A arbitrary and U unitary of the same order, show IIA Uii
2
=
iiUA112 = IIA112
28. Let A be square of order n X n.
(a) Show that F(AU) = F(UA) = F(A), for any unitary matrix U.
(b) If A is Hermitian, then show that
. F(A) = />.."t_ + i \ ~
where i\
1
, . , i\n are the eigenvalues of A, repeated according to their
multiplicity. Furthermore,
29. Recalling the notation of Theorem 7.5, show
F(A) = VJLi + + JL;
30. Let A be of order n X n. Show
If A is symmetric and positive definite, show
31. Show that the infinite series
A2 A3 An
I+A+-+-++-+
2! 3! n!
converges for any square matrix A, and denote the sum of the series by eA.
(a) If A = p-
1
BP, show that eA = p-
1
e
8
P.
(b) Let i\
1
, . , i\n denote the eigenvalues of A, repeated according to
their multiplicity, and show that the eigenvalues of eA are e
1
''1, , e.,_.
32. Consider the matrix
6 1 1
1 6 1
1 1 6
A= 0 1 1
0
0
1 0
1 1 0
6 1 1 0
1 1
0 1
PROBLEMS 503
0
0
1
6 1
1 6
Show A is nonsingular. Find a bound for IIA -
1
lloo and IIA -
1
llz
33. In producing cubic interpolating splines in Section 3.7 of Chapter 3, it was
necessary to solve the linear system AM= D of (3.7.21) with
h1 h1
0 0 - -
3 6
h1 h1 + h2 h2
- -
6 3 6
A=
hm-1 hm-1 + hm hm
6 3 6
0 0
hm hm
6 3
All hi > 0, i = 1, ... , m. Using one or more of the results of Section 7.4,
show that A is nonsingular. In addition, derive the bounds
A E a(A) .
for the eigenvalues of A.
34. Let A be a square matrix, with Am = 0 for some m ;;:::. 2. Show that I - A
is nonsingular. Such a matrix A is called nilpotent.
EIGHT
NUMERICAL SOLUTION
OF SYSTEMS OF LINEAR EQUATIONS
Systems of linear equations arise in a large number of areas, both directly in
modeling physical situations and indirectly in the numerical solution of other
mathematical models. These applications occur in virtually all areas of the
physical, biological, and social sciences. In addition, linear systems are involved
in the following: optimization theory; solving systems of nonlinear equations; the
approximation of functions; the numerical solution of boundary value problems
for ordinary differential equations, partial differential equations, and integral
equations; statistical inference; and numerous other problems. Because of the
widespread importance of linear systems, much research has been devoted to
their numerical solution. Excellent algorithms have been developed for the most
common types of problems for linear systems, and some of these are defined,
analyzed, and illustrated in this chapter.
The most common type of problem is to solve a square linear system
Ax= b
of moderate order, with coefficients that are mostly nonzero. Such linear systems,
of any order, are called dense. For such systems, the coefficient matrix A must
generally be stored in the main memory of the computer in order to efficiently
solve the linear system, and thus memory storage limitations in most computers
will limit the order of the system. With the rapid decrease in the cost of computer
memory, quite large linear systems can be accommodated on some machines, but
it is expected that for most smaller machines, the practical upper limits on the
order will be of size 100 to 500. Most algorithms for solving such dense systems
are based on Gaussian elimination, which is defined in Section 8.1. It is a direct
method in the theoretical sense that if rounding errors are ignored, then the exact
answer is found in a finite number of steps. Modifications for improved error
behavior with Gaussian elimination, variants for special classes of matrices, and
error analyses are given in Section 8.2 through Section 8.5.
A second important type of problem is to solve Ax = b when: A is square,
sparse, and of large order. A sparse matrix is one in which most coefficients are
zero. Such systems arise in a variety of ways, but we restrict our development to
those for which there is a simple, known pattern for the nonzero coefficients.
These systems arise commonly in the numerical solution of partial differential
equations, and an example is given in Section 8.8. Because of the large order of
most sparse systems of linear equations, sometimes as large as 10
5
or more, the
linear system cannot usually be solved by a direct method such as Gaussian
507
'
- u -- - H --- I
508 NUMERICAL SOLUTION OF SYSTEMS OF LINEAR EQUATIONS
elimination. Iteration methods are the preferred method of solution, and these
are introduced in Section 8.6 through Section 8.9.
For solving dense square systems of moderate order, most computer centers
have a set of programs that can be used for a variety of problems. Students
should become acquainted with those at their university computer center and use
them to further illustrate the material of this chapter. An excellent package is
called UNPACK. and it is described in Dongarra et al. (1979). It is widely
available, and we will make further reference to it later in this chapter.
8.1 Gaussian Elimination
This is the formal name given to the method of solving systems of linear
equations by successively eliminating unknowns and reducing to systems of lower
order. It is the method most people learn in high school algebra or in an
undergraduate linear algebra course (in which it is often associated with produc-
ing the row-echelon form of a matrix). A precise definition is given of Gaussian
elimination, which is necessary when implementing it on a computer and when
analyzing the effects of rounding errors that occur when computing with it.
To solve Ax = b, we reduce it to an equivalent system Ux = g, in which U is
upper triangular. This system can be easily solved by a process of back-substitu-
tion. Denote the original linear system by A(llx = b(l),
1:.:;; i, j:.:;; n
in which n is the order of the system. We reduce the system to the triangular
form Ux = g by adding multiples of one equation to another equation, eliminat-
ing some unknown from the second equation. Additional row operations are used
in the modifications given in succeeding sections. To keep the presentation
simple, we make some technical assumptions in defining the algorithm; they are
removed in the next section.
Gaussian elimination algorithm
STEP 1: Assume agl of= 0. Define the row multipliers by
i = 2,3, ... , n
These are used in eliminating the x
1
term from equations 2 through n.
Define
a ~ ~ l = a ~ ~ l m-
1
a
1
<
1
.l
IJ lj I J
i, j = 2, ... , n
i = 2, ... , n
Also, the first rows of A and b are left undisturbed, and the first
.... :. _: - .. : ...... ___ -.
GAUSSIAN ELIMINATION 509
column of A(l>, below the diagonal, is set to zero. The system
A<
2
>x = b<
2
> looks like
a<I>
11
0
0
a<2>
n2
a<2>
nn
We continue to eliminate unknowns, going onto columns 2, 3, etc.,
and this is expressed generally in the following.
STEP k: Let 1 :::::; k :::::; n - 1. Assume that A<k>x = b<k> has been constructed,
with xi, x
2
, , xk-I eliminated at successive stages, and A<k> has
the form
a<I>
11
a<ll
12
a<1>
In
0
a<2l
22
a<2l
2n
A<k> =
a<kl a<kl
0 0
kk kn
0 0
a<k>
nk
a<kl
nn
Assume aW =I= 0. Define the multipliers
i=k+1, ... ,n (8.1.1)
Use these to remove the unknown xk from equations k + 1 through
n. Define
m. a<k)
1). IJ 1k kj
b(k+I) = b(kl _ m.kbk<k>
I I I
i,j=k+1, ... ,n (8.1.2)
The earlier rows 1 through k are ieft undisturbed, and zeros are
introduced into column k below the diagonal element.
By continuing in this manner, after n - 1 steps we obtain
A<n>x = b(n):
[
a<I>
11
0
0
= [br)]
a(n) X b(:n)
nn n n
' .......... .
- - --- .. --- .... -. "'" --- - .......... __ I
510 NUMERICAL SOLUTION OF SYSTEMS OF LINEAR EQUATIONS
For notational convenience, let U = A<nl and g = b<nl. The system
Ux = g is upper triangular, and it is quite easy to First
and then
k=n-l,n-2, ... ,1
This completes the Gaussian elimination algorithm.
Example Solve the linear system
(8.1.4)
To simplify the notation, we note that the unknowns x
1
, x
2
, x
3
never enter into
the algorithm until the final step. Thus we represent the preceding linear system
with the augmented matrix
[Alb]= [
-1
2
2
-3
The row operations are performed on this augmented matrix, and the unknowns
are given in the final step. In the following diagram, the multipliers are given next
to an arrow, corresponding to the changes they cause:
u
2 1
2 1
2 3 -2 1
-3 0
mzt=2
-1 1
m
31
= -1
lm32-}
[Uig]=
2 1
!]
-2 1
0
1.
2
Solving Ux = g,
x
3
= 1 x
2
= -1 x
1
= 1
GAUSSIAN ELIMINATION 511
Triangular factorization of a matrix It is conveniept to keep the multipliers m,j,
since we often want to solve Ax = b with the same A but a different vector b. In
the computer the elements afj+
1
>, j;;;:.:: i, always are stored into the for
afj>. The elements the diagonal are being zeroed, and this provides a
convenient storage for the elements m;j Store m;j into the space originally used
to store a;j> i > j.
There is yet another reason for looking at the multipliers mij as the elements
of a matrix. First, introduce the lower triangular matrix
0
1
0
0
f]
Theorem 8.1 If L and U are the lower and upper triangular matrices defined
previously using Gaussian elimination, then
A =LU (8.1.5)
Proof This proof is basically an algebraic manipulation, making use of defini-
tions (8.1.1) and (8.1.2). To visualize the matrix element (LU);j> use the
vector formula
Fori S.j,
( )
[ ]
ujj
LU ;j= m,1., ... ,m,,;_
1
,1,0, ... ,0
0
i-1
"" m + a!9
i..t I J IJ
k-1
i-1
""
+
i..t IJ IJ IJ
k-1
=a!9=a ..
ij IJ
0
i
I
I
i
.I
512 NUMERICAL SOLUTION OF SYSTEMS OF LINEAR EQUATIONS
Fori> j,
j-1
= "\' + aVl
i..J 1) lj 1)
k=l
= a(
1
l =a ..
I) 1)
This completes the proof.
The decomposition (8.1.5) is an important result, and extensive use is made of
it in developing variants of Gaussian elimination for special classes of matrices.
But for the moment we give only the following corollary.
Corollary With the matrices A, L, and U as in Theorem 8.1,
Proof By the product rule for determinants,
det (A) = det ( L) det ( U)
Since L and U are triangular, their determinants are the product of their
. diagonal elements. The desired result follows easily, since det(L) = 1.
Example For the system (8.1.4) of the previous example,
L = [ ;
-1 t 1
2
-2
0
It is easily verified that A = LU. Also det(A) = det(U) = -1.
Operation count To analyze the number of operations necessary to solve
Ax = b using Gaussian elimination, we will consider separately the creation of L
and U from A, the modification of b to g, and finally the solution of x.
I. Calculation of L and U. At step 1, n - 1 divisions were used to calculate the
multipliers m;
1
, 2:;:;; i:;:;; n. Then (n - 1)
2
multiplications and (n - 1)
2
additions were used to create the new elements afJ>. We can continue in this
GAUSSIAN ELIMINATION 513
Table 8.1 Operation count for LU decomposition of a matrix
Step k Additions Multiplications Divisions
1
2
(n- 1)
2
(n- 2)
2
(n - 1)
2
n- 1
(n - 2)
2
n-2
n -1 1 1 1
n(n- 1)(2n- 1) n(n- 1)(2n- 1) n(n- 1)
6 6 2
Total
way for each step. The results are summarized in Table 8.1. The total value
for each column was obtained using the identities
fj= p(p+1)
j=l 2
ff = p(p + 1)(2p + 1)
j ~ l 6
p?:.1
Traditionally, it is the number of multiplications and divisions, counted
together, that is used as the operation count for Gaussian elimination. On
earlier computers, additions were much faster than multiplications and
divisions, and thus additions were ignored in calculating the cost of many
algorithms. However, on modern computers, the time of additions, multipli-
cations, and divisions are quite close in size. For a convenient notation, let
MD() and AS() denote the number of multiplications and divisions, and
the number of additions and subtractions, respectively, for the computation
of the quantity in the parentheses.
For the LU decomposition of A, we have
n(n
2
- 1) n
3
MD(LU) = ~
. . 3 3
n(n- 1)(2n- 1)
AS(LU) =
6 3
The final estimates are valid for larger values of n . .
2. Modification of b to g = b<n>:
. n(n- 1)
MD(g) = (n- 1) + (n- 2) + +1 = --
2
-
n(n- 1)
AS(g) =
2
3. Solution of Ux = g
n(n + 1)
MD(x) =
2
n(n - 1)
AS(x) =
2
(8.1.6)
{8.1.7)
(8.1.8)
----- . - -- ....... ----- :: .. . .. .:. ".'.:..: .. .. : __ : _:_ _____ ~ .... . ::
I
I
I
I
i
I
I
i
I
I
...I
514 NUMERICAL SOLUTION OF SYSTEMS OF LINEAR EQUATIONS
4. Solution of Ax = b. Combine (1) through (3) to get
n
3
n 1
MD(LU x) = - + n
2
- - ,;, -n
3
' 3 3 3
(8.1.9)
n(n- 1)(2n + 5) 1
AS(LU x) = = -n
3
' 6 3
The number of additions is always about the same as the number of
multiplications and divisions, and thus from here on, we consider only the
latter. The first thing to note is that solving Ax = b is comparatively cheap
when compared to such a supposedly simple operation as. multiplying two
n X n matrices. The matrix multiplication requires n
3
operations, and the
solution of Ax = b requires only about tn
3
operations.
Second, the main cost of solving Ax= b is in producing the decomposi-
tion A = LU. Once it has been found, only n
2
additional operations are
necessary to solve Ax = b. After once solving Ax = b, it is comparatively
cheap to solve additional systems with the same coefficient matrix, provided
the LU decomposition has been saved.
Finally, Gaussian elimination is much cheaper than Cramer's rule, which
uses determinants and is often taught in linear algebra courses [for example,
see Anton (1984), sec. 2.4]. If the determinants in Cramer's rule are com-
puted using expansion by minors, then the operation count is (n + 1)!. For
n = 10, Gaussian elimination uses 430 operations, and Cramer's rule uses
39,916,800 operations. This should emphasize the point that Cramer's rule is
not a practical computational tool, and that it should be considered as just a
theoretical mathematics tool.
5. Inversion of A. The inverse A -
1
is generally not needed, but it can be
produced by using Gaussian elimination. Finding A -
1
is equivalent to
solving the equation AX = I, with X an n X n unknown matrix. If we write
X and I in terms of their columns,
1= [e(l), ... ,e<n>]
then solving AX= I is equivalent to solving the n systems
Ax<
1
> = e<
1
>, .. , Ax<n> = e<n>
(8.1.10)
all having the same coefficient matrix A. Using (1)-(3)
Calculating A -l is four times the expense of solving Ax = b for a single
vector b, not n times the work as one might first imagine. By careful
attention to the details of the inversion process, taking advantage of the
special form of the right-hand vectors e<
1
>, . , e<n>, it is possible to further
PIVOTING AND SCALING IN GAUSSIAN ELIMINATION 515
reduce the operation count to exactly
(8.1.11)
However, it is still wasteful in most situations to produce A -
1
to solve
Ax= b. And there is no advantage in saving A -
1
rather than the LU
decomposition to solve future systems Ax = b. In both cases, the number of
multiplications and divisions necessary to solve Ax = b is exactly n
2
8.2 Pivoting and Scaling in Gaussian Elimination
At each stage of the elimination process in the last section, we assumed the
appropriate pivot element aW i= 0. To remove this assumption, begin each step
of the elimination process by switching rows to put a nonzero element in the
pivot position. If none such exists, then the matrix must be singular, contrary to
assumption.
It is not enough, however, to just ask that the pivot element be nonzero. Often
an element would be zero except for rounding errors that have occurred in
calculating it. Using such an element as the pivot element will result in gross
errors in the further calculations in the matrix. To guard against this, and for
other reasons involving the propagation of rounding errors, we introduce partial
pivoting and complete pivoting.
Definition 1. Partial Pivoting. For 1 .::;; k .::;; n - 1, in the Gaussian elimination
process at stage k, let
(8.2.1)
Let i be the smallest row index, i k, for which the maximum ck is
attained. If i > k, then switch rows k and i in A and b, and proceed
with step k of the elimination process. All of the multipliers will now
satisfy
i = k + 1, ... , n (8.2.2)
This aids in preventing the growth of elements in A(k) of greatly
varying size, and thus lessens the possibility for large loss of signifi-
cance errors.
2. Complete Pivoting. Define
Switch rows of A and b and columns of A to bring to the pivot
position an element giving the maximum ck. Note that with a column
switch, the order of the unknowns is changed. At the completion of
the elimination and back substitution process, this must be reversed.
-"--- .: -... ~ : :.. . . - ..... - .:_.
516 NUMERICAL SOLUTION OF SYSTEMS OF LINEAR EQUATIONS
Complete pivoting has been proved to cause the roundoff error in Gaussian
elimination to propagate at a reasonably slow speed, compared with what can
happen when no pivoting is used. The theoretical results on the use of partial
pivoting are not quite as good, but in virtually all practical problems, the error
behavior is essentially the same as that for complete pivoting. Comparing
operation times, complete pivoting is the more expensive strategy, and thus,
partial pivoting is used in most practical algorithms. Henceforth we always mean
partial pivoting when we used the word pivoting. The entire question of roundoff
error propagation in Gaussian elimination has been analyzed very thoroughly by
J. H. Wilkinson [e.g., see Wilkinson (1965), pp. 209-220], and some of his results
are presented in Section 8.4.
Example We illustrate the effect of using pivoting by solving the system
.729x + .8ly + .9z = .6867
X + y + Z = .8338
1.331x + 1.21y + 1.1z = 1.000
The exact solution, rounded to four significant digits, is
X= .2245 y = .2814 z = .3279
(8.2.3)
(8.2.4)
Floating-point decimal arithmetic, with four digits in the mantissa, will be
used to solve the linear system. The reason for using this arithmetic is to show the
effect of working with only a finite number of digits, while keeping the presenta-
tion manageable in size. The augmented matrix notation will be used to represent
the system (8.2.3), just as was done with the earlier example (8.1.4).
1. Solution without pivoting.
[ .7290
1.000
1.331
[ .7290
0.0
0.0
.8100 .9000
1.000 1.000
1.210 1.100
l
m
21
=1.372
m3
1
-1.826
.8100 .9000
-.1110 -.2350
-.2690 -.5430
lm32 =2.423
.8100 .9000
0.0
[ .7290
-.H10 -.2350
0.0 0.0 .02640
.6867]
.8338
1.000
.6867]
-.1084
-.2540
.6867 l
-.1084
.008700
PIVOTING AND SCALING IN GAUSSIAN ELIMINATION 517
The solution is
X= .2251 y = .2790 z = .3295 {8.2.5)
2. Solution with pivoting. To indicate the interchange of rows i and j, we will
u s ~ the notation r; ~ rj.
The error in (8.2.5) is from seven to sixteen times larger than it is for (8.2.6);
depending on the component of the solution being considered. The r:esults in
. (8.2.6) have one more significant digit than do those of (8.2.5). This illustrates the
positive effect that the use of pivoting can have on the error behavior for
Gaussian elimination.
Pivoting changes the factorization result (8.1.5) given in Theorem 8.1. The
result is still true, but in a modified form. If the row interchanges induced by
pivoting were carried out on A before beginning elimination, then pivoting would
be unnecessary. Row interchanges on A can be represented by premultiplication
of A by an appropriate permutation matrix P, to get P ~ Then Gaussian
elimination on P A leads to
LU=PA (8.2.7)
where U is the upper triangular matrix obtained in the elimination process with
pivoting. The lower triangular matrix L can be constructed using the multipliers
from Gaussian elimination with pivoting. We omit the details, as the actual
construction is unimportant.
' ' ... '
...... - .. -- ...... ---- ------ - ---- ........... .
518 NUMERICAL SOLUTION OF SYSTEMS OF LINEAR EQUATIONS
Example From the preceding example with pivoting, we form .
rooo
0.0
0.0 l
[1.331
1.210
1.100 l
L = .5477 1.000 0.0
U= 0.0 .1473 .2975
.7513 .6171 1.000
0.0 0.0 -.01000
When multiplied,
r331
1.210
1.100]
p ~ [!
0
~
LU = .7289 .8100 .9000 = PA
0
1.000 1.000 1.000 1
The result PA is the matrix A with first, rows 1 and 3 interchanged, and then
rows 2 and 3 interchanged. This illustrates (8.2.7).
Scaling It has been observed empirically that if the elements of the coefficient
matrix A vary greatly in size, then it is likely that large loss of significance errors
will be introduced and the propagation of rounding errors will be worse. To
avoid this problem, we usually scale the matrix A so that the elements vary less.
This is usually done by multiplying the rows and columns by suitable constants.
The subject of scaling is not well understood currently, especially how to
guarantee that the effect of rounding errors in Gaussian elimination will be made
smaller by such scaling. Computational experience suggests that often all rows
should be scaled to make them approximately equal in magnitude. In addition,
all columns can be scaled to make them of about equal magnitude. The latter is
equivalent to scaling the unknown components X; of x, and it can often be
interpreted to say that the X; should be measured in units of comparable size.
There is no known a priori strategy for picking the scaling factors so as to
always decrease the effect of rounding error propagation, based solely on a
knowledge of A and b. Stewart (1977) is somewhat critical of the general use of
scaling as described in the preceding paragraph. He suggests choosing scaling
factors so as to obtain a rescaled matrix in which the errors in the coefficients are
of about equal magnitude. When rounding is the only source of error, this leads
to the strategy of scaling to make all elements of about equal size. The
UNPACK programs do not include scaling, but they recommend a strategy
along the lines indicated by Stewart (see Dongarra et al. (1979), pp. 17-Il2 for a
more extensive discussion; for other discussions of scaling, see Forsythe and
Moler (1967), chap. 11 and Golub and Van Loan (1983), pp. 72-74].
If we let B denote the result of row and column. scaling in A, then
B = D
1
AD
1
where D
1
and D
2
are diagonal matrices, with entries the scaling constants. To
solve Ax = b, observe that
Thus we solve for x by solving
Bz = D
1
b (8.2.8)
PIVOTING AND SCALING IN GAUSSIAN ELIMINATION 519
The remaining discussion is restricted to row scaling, since some form of it is
fairly widely agreed upon.
Usually we attempt to choose the coefficients so as to have
Max lh-1 ~ 1
1s;js;n 'J
i = 1, ... , n (8.2.9)
where B = [b;) is the result of scaling A. The most straightforward approach is
to define
s; = Max jaiil
1 Sj s; n
b
.. ~
u
S;
i = 1, ... , n
j = 1, ... , n (8.2.10)
But because this introduces an additional rounding error into each element of the
coefficient.matrix, two other techniques are more widely used.
1. Scaling using computer number base. Let fJ denote the base used in the
computer arithmetic, for example, fJ = 2 on binary machines. Let r; be the
smallest integer for which fJ' ~ s;. Define the scaled matrix B by
a;j
b .. =-
,, P'
i,j=1, ... ,n (8.2.11)
No rounding is involved in defining bij only a change in the exponent in the
floating-point form for a;j The values of B satisfy
p-
1
< Max lhiil .s: 1
1 sjs n
and thus (8.2.9) is satisfied fairly well.
2. Implicit. scaling. The use of scaling will generally change the choice of pivot
elements when pivoting is used with Gaussian elimination. And it is only with
such a change of pivot elements that the results in Gaussian elimination will be
changed. This is due to a result of F. Bauer [in Forsythe and Moler (1967), p. 38),
which states that if the scaling (8.2.11) is used, and if the choice of pivot elements
is forced to remain the same as when solving Ax = b, then the solution of (8.2.8)
will yield exactly the same computed value for x. Thus the only significance of
scaling is in the choice of the pivot elements.
For implicit scaling, we continue to use the matrix A. But we choose the pivot
element in step k of the Gaussian elimination algorithm by defining
(8.2.12)
replacing the definition (8.2.1) used in defining partial pivoting. Choose the
520 NUMERICAL SOLUTION OF SYSTEMS OF LINEAR EQUATIONS
smallest index i :2:: k that yields c" in (8.2.12), and if i =I= k, then interchange
rows i and k. Also interchange s ~ l and s ~ k l , and denote the resulting new
values by s ~ k +
1
l, j = 1, ... , n, most of which have not changed. Then proceed
with the elimination algorithm of Section 8.1, as before. This form of scaling
seems to be the form most commonly used in current published algorithms, if
scaling is being used.
An algorithm for Gaussian elimination We first give an algorithm, called Factor,
for the triangular factorization of a matrix A. It uses Gaussian elimination with
partial pivoting, combined with the implicit scaling of (8.2.10) and (8.2.12). We
then give a second algorithm, called Solve, for using the results of Factor to solve
a linear system Ax = b. The reason for separating the elimination procedure into
these two steps is that we will often want to solve several systems Ax = b, with
the same A, but different values for b.
Algorithm Factor (A, n, Pivot, det, ier)
1 Remarks: A is an n X x matrix, to be factored using the LU
decomposition. Gaussian elimination is used, with partial pivot-
ing and implicit scaling in the rows. Upon completion of the
algorithm, the upper triangular matrix U will be stored in the
upper triangular part of A; and the multipliers of (8.1.1), which
make up L below its diagonal, will be stored in the correspond-
ing positions of A. The vector Pivot will contain a record of all
row interchanges. If Pivot ( k) = k, then no interchange was used
in step k of the elimination process. But if Pivot ( k) = i =1= k,
then rows i and k were interchanged in step k of the elimination
process. The variable det will contain det-(A) on exit.
The variable ier is an error indicator. If ier = 0, then the
routine was completed satisfactorily. But for ier = 1, the matrix
A was singular, in the sense that all possible pivot elements were
zero at some step o{ the elimination process. In this case, all
computation ceased, and the routine was exited. No attempt is
made to check on the accuracy of the computed decomposition
of A, and it can be nearly singular without being detected.
2 det := 1
3 s
1
:= Max jaiil' i = 1, ... , n.
l:Sj:Sn
4 Do through step 16 for k = 1, ... , n - 1.
5 c" := Max I atk I
k:Si:Sn S
1
6 Let i
0
be the smallest index i :2:: k for which the maximum in
step 5 is attained. Pivot ( k) := i
0
7 If c" = 0, then ier := 1, det := 0, and exit from the algorithm.
PIVOTING AND SCALING IN GAUSSIAN ELIMINATION 521
8 If i
0
= k, then go to step 11.
9 det := -det
10 Interchange aki and aioi' j = k, ... , n. Interchange sk and S;
0
11 Do through step 14 for i = k + 1, ... , n.
14 End loop on i.
15 det :=au det
16 End loop on k.
17 det = a"" det; ier = 0 and exit the algorithm.
A.lgoritlun Solve (A, n, b, Pivot)
1.
2.
3.
4.
5.
6.
Remarks: This algorithm will solve the linear system Ax = b.
It is assumed that the original matrix A has been factored
using the algorithm Factor, with the row interchanges recorded
in Pivot. The solution will be stored in b on exit. The matrix A
and vector Pivot are left unchanged.
Do through step 5 for k = 1, 2, ... , n - 1.
If i := Pivot(k) * k, then interchange b; and bk.
End loop on k.
b,.
b :=-
" a,.,.
7. Do through step 9 for i = n - 1, ... , 1.
9. End loop on i.
10. Exit from algorithm.
522 NUMERICAL SOLUTION OF SYSTEMS OF LINEAR EQUATIONS
The earlier example (8.2.3) will serve as an illustration. The use of implicit
saaling in this case will not require a change in the choice of pivot elements with
parfial pivoting. Algorithms similar to Factor and Solve are given in a number of
references [see Forsythe and Moler (1967), chaps. 16 and 17 and Dongarra et al.
(1979), chap. 1, for improved versions of these algorithms]. The programs in
UNPACK will also compute information concerning the condition or stability of
the problem Ax = b and the accuracy of the computed solution.
An important aspect of the UNPACK programs is the use of Basic Linear
Algebra Subroutines (BLAS). These perform simple operations on vectors, such as
forming the dot product of two vectors or adding a scalar multiple of one vector
to another vector. The programs in UNPACK use these BLAS to replace many
of the inner loops in a method. The BLAS can be optimized, if desired, for each
computer; thus, the performance of the main UNPACK programs can also be
easily improved while keeping the main source code machine-independent. For a
more complete discussion of BLAS, see Lawson et al. (1979).
8.3 Variants of Gaussian Elimination
There are many variants of Gaussian elimination. Some are modifications or
simplifications, based on the special properties "of some class of matrices, for
example, symmetric, positive definite matrices. Other variants are ways to rewrite
Gaussian elimination in a more compact form, sometimes in order to use special
techniques to reduce the error. We consider only a few such variants, and later
make reference to others.
Gauss-Jordan method This procedure is much the same as regular elimination
including the possible use of pivoting and scaling. It differs in eliminating the
unknown in equations above the diagonal as well as below it. In step k of the
elimination algorithm choose the pivot element as before. Then define
a<k>
a<k+l) _ .21_
kj - a<k>
kk
b<{>
b(k+l) = --
k a<k>
kk
j = k, ... , n
Eliininate the unknown xk in equations both above and below equation k.
Define
lj lj l J
(8.3.1)
Mk+l> = Mk>-
l l l
for j = k, ... , n, i = 1, ... , n, i * k. The Gauss-Jordan method is equivalent to
the use of the reduced row-echelon form of linear algebra texts [for example, see
Anton (1984), pp. 8-9].
VARIANTS OF GAUSSIAN ELIMINATION 523
This procedure will convert the augmented matrix [A ib] to [Jjb<n>], so that at
the completion of the preceding elimination, x = b<n>. To solve Ax = b by this
technique requires
{8.3.2)
multiplications and divisions. This is 50 percent more than the regular elimina-
tion method; consequently, the Gauss-Jordan method should usually not be used
for solving linear systems. However, it can be used to produce a matrix inversion
program that uses a minimum of storage. By taking special advantage of the
special structure of the right side in AX= I, the Gauss-Jordan method can
produce the solution X= A -l using only n extra storage locations, rather than
the normal n
2
extra storage locations. Partial pivoting and implicit scaling can
still be used.
Compact methods It is possible to move directly from a matrix A to its LU
decomposition, and this can be combined with partial pivoting and scaling. If we
disregard the possibility of pivoting for the moment, then the result
A =LU (8.3.3)
leads directly to a set of recursive formulas for the elements of L and U.
There is some non uniqueness in the choice of L and U, if we insist only that L
and U be lower and upper triangular, respectively. If A is nonsingular, and if we
have two decompositions
(8.3.4)
then
(8.3.5)
The inverse and the products of lower triangular matrices are again lower
triangular, and similarly for upper triangular matrices. The left and right sides of
(8.3.5) are lower and upper triangular, respectively. Thus they must equal a
diagonal matrix, call it D, and
(8.3.6)
The choice of D is tied directly to the choice of the diagonal elements of either L
or U, and once they have been chosen, D is uniquely determined.
If the diagonal elements of L are all required to equal 1, then the resulting
decomposition A = LU is that given by Gaussian elimination, as in Section 8.1.
The associated compact method gives explicit formulas for l;j and uij and it is
known as Doolittle's method. If we choose to have the diagonal elements of U all
equal 1, the associated compact method for calculating A = LU is called Crout's
method. There is only a multiplying diagonal matrix to distinguish it from
Doolittle's method. For an algorithm using Crout's algorithm for the factoriza-
524 NUMERICAL SOLUTION OF SYSTEMS OF LINEAR EQUATIONS
tion (8.3.3), with partial pivoting and implicit scaling, see the program unsymdet
in Wilkinson and Reinsch (1971, pp. 93-110). In some situations, Crout's method
has advantages over the usual Doolittle method.
The principal advantage of the compact formula is that the elements lij and
uij all involve inner products, as illustrated below in formula (8.3.14)-(8.3.15) for
the factorization of a symmetric positive definite matrix. These inner products
can be accumulated using double precision arithmetic, possibly including a
concluding division, and then be rounded to single precision. This way of
computing inner products was discussed in Chapter 1 preceding the error
formula (1.5.19). This limited use of double precision can greatly increase the
accuracy of the factors L and U, and it is not possible to do this with the regular
elimination method unless all operations and storage are done in double preci-
sion [for a complete discussion of these compact methods, see Wilkinson (1965),
pp. 221-228, and Golub and Van Loan (1983), sec. 5.1].
The Cholesky method Let A be a symmetric and positive definite matrix of
order n. The matrix A is positive definite if
n n
(Ax, x) = L L aijxixj > 0
i=l j=l
(8.3.7)
for all x E R", x * 0. Some of the properties of positive definite matrices are
given in Problem 14 of Chapter 7 and Problems 9, 11, and 12 of this chapter.
Symmetric positive definite matrices occur in a wide variety of applications.
For such a matrix A, there is a very convenient factorization, and it can be
carried through without any need for pivoting or scaling. This is called Choleski 's
method, and it states that we can find a lower triangular real matrix L such that
A =LLT (8.3.8)
The method requires only !n(n + 1) storage locations fof L, rather than the
usual n
2
locations, and the number of operations is about in
3
, rather than the
number tn
3
required for the usual decomposition.
To prove that (8.3.8) is possible, we give a derivation of L based on induction.
Assume the result is true for all positive definite symmetric matrices of order
:$ n - 1. We show it is true for all such matrices A of order n. Write the desired
L, of order n, in the form
with i a square matrix of order n- 1, y E an-I, and x a scalar. The L is to be
chosen to satisfy A = LLT:
[
i xo][i
0
T
YT
Y]=A=[A
X CT ~
(8.3.9)
VARIANTS OF GAUSSIAN ELIMINATION 525
with A of order n- 1, c E R"-
1
, and d =a,, real. Since (8.3.7) is true for A, let
x,. = 0 in it to obtain the analogous statement for A, showing A is also positive
definite and. symmetric. In addition, d > 0, bi' letting x
1
= = x,_
1
= 0,
x,. = 1 in (8.3. 7). Multiplying in (8.3.9), choose L, by the induction hypothesis to
satisfy
(8.3.10)
Then choose y by solving
iy = c (8.3.11)
since i is nonsingular, because det(A) = [det(i)]
2
Finally, x must satisfy
(8.3.12)
To see that x
2
must be positive, form the determinant of both sides in (8.3.9),
obtaining
(8.3.13)
Since det(A) is the product of the eigenvalues of A, and since all eigenvalues of
positive definite symmetric are positive (see Problem 14 of Chapter 7), det(A) is
positive. Also, by the induction hypothesis, i is real. Thus x
2
is positive in
(8.3.13), and we let x be its positive square root. Since the result (8.2.8) is trivially
true for matrices of order n = 1, this completes the proof of the factorization
(8.3.8). For another approach, see Golub and Van Loan (1983, sec. 5.2).
A practical construction of L can be based on (8.3.9)-(8.3.12), but we give one
based on directly finding the elements of L. Let L = [/;j], with lij = 0 for j > i.
Begin the construction of L by multiplying the first row of L times the first
column of Lr to get
Because A is positive definite, a1
1
> 0, and /
11
= {i;;. Multiply the second row
of L times the first two columns of L T to get
Again, we can solve for the unknowns /
21
and /
22
In general for i = 1, 2, ... , n,
j=1, ... ,i-1
[
i-1. ]1/2
I;; = a;; - 1: ~
k-1
(8.3.14)
(8.3.15)
526 NUMERICAL SOLUTION OF SYSTEMS OF LINEAR EQUATIONS
The argument in this square root is the term x
2
in the earlier derivatiop (8.3.12),
and (; is real and positive. For programs implementing Cholesky's method, see
Dongarra et al. (1979, chap. 3) and Wilkinson and Reinsch (1971, pp. 10-30).
Note the inner products in (8.3.14) and (8.3.15). These can be accumulated in
double precision, minimizing the number of rounding errors, and the elements lij
will be in error by much less than if they had been calculated using only single
precision arithmetic. Also note that the elements of L remain bounded relative to
A, since (8.3.15) yields a bound for the elements of row i, using
lA + +It= au (8.3.16)
The square roots in (8.3.15) of Choleski's method can be avoided by using a
slight modification of (8.3.8). Find a diagonal matrix D and a lower triangular
matrix L, with Is on the diagonal, such that
A= LDLT (8.3.17)
This factorization can be done with about the same number of operations as
Choleski's method, about tn
3
, with no square roots. For further discussion and a
program, see Wilkinson and Reinsch (1971, pp. 10-30).
Example Consider the Hilbert matrix of order three,
1 1
1 - -
2 3
1 1 1
A= - - -
2 3 4
(8.3.18)
1 1 1
3 4 5
For the Choleski decomposition,
1 0 0
1 1
L=
2 2{3
0
1 1 1
3 2{3 6{5
and for (8.3.17),
1 0 0 1 0 0
1 1
L=
1 0 0 - 0
2 D= 12
1 1
1 1 0 0 -
3 180
VARIANTS OF GAUSSIAN ELIMINATION 527
For many linear systems in applications, the coefficient matrix A is banded,
which means
if li-Jl > m
(8.3.19)
for some small m > 0. The preceding algorithms simplify in this case, with a
considerable savings in computation time. For such algorithms when A is
symmetric and positive definite, see the UNPACK programs in Dongarra et al.
(1979, chap. 4). We next describe an algorithm in the case m = 1 in (8.3.19).
Tridiagonal systems The matrix A = [a;j] is tridiagonal if
for li-Jl > 1 (8.3.20)
This gives the form
al cl
0 0 0
b1 a2 c2
0
0
b3 aJ c3
A= (8.3.21)
0
Tridiagonal matrices occur in a variety of applications. Recall the linear system
(3.7.22) for spline functions in Section 3.7 of Chapter 3. In addition, many
numerical methods for solving boundary value problems for ordinary and partial
differential equations involve the solution of tridiagonal systems. Virtually all of
these applications yield tridiagonal matrices for which the LU factorization can
be formed without pivoting, and for which there is no large increase in error as a
consequence. The precise assumptions on A are given below in Theorem 8.2.
By considering the factorization A = LU without pivoting, we find that most
elements of L and U will be zero. And we are lead to the following general
formula for the decomposition:
A =LU
al 0 0 1
YI
0 0
b2 a1 0 0 1
Y2
0
0
b3 . a3
1
Yn-1
0
bn an 0 0 1
We can multiply to obtain a way to recursively compute {a;} and { Y;}:
i = 2, ... , n (8.3.22)
i = 2, 3, ... , n- 1
528 NUMERICAL SOLUTION OF SYSTEMS OF LINEAR EQUATIONS
These can be solved to give
C;
'V. =-
,,
a;
i = 2, 3, ... , n - 1
To solve LUx = f, let Ux = z and Lz =f. Then
z.=
I
i = 2, 3, ... , n
i = n- 1, n- 2, ... , 1
(8.3.23)
(8.3.24)
The constants in (8.3.23) can be stored for later use in solving the linear system
Ax = j, for as many right sides f as desired.
Counting only multiplications and divisions, the number of operations to
calculate L and U is 2n - 2; to solve Ax = f t k e ~ an additional 3n - 2
operations. Thus we need only Sn - 4 operations to solve Ax - f the first time,
and for each additional right side, with the same A, we need only 3n- 2
operations. This is extremely rapid. To illustrate this, note that A -I generally is
dense and has mostly nonzero entries; thus, the calculation of x = A- Y will
require n
2
operations. In many applications n may be larger than 1000, and thus
there is a significant savings in using (8.3.23)-(8.3.24) as compared with other
methods of solution.
To justify the preceding decomposition of A, especially to show that all the
coefficients a; =I' 0, we have the following theorem.
Theorem 8.2 Assume the coefficients {a;, b;, C;} of (8.3.21) satisfy the following
conditions:
2. la;l ~ lb;l + lc;l.
i=2, ... ,n-2
3. lanl > lbnl > 0
Then A is nonsingular,
i=l, ... ,n-1
i = 2, .. , n
Proof For the proof, see Isaacson and Keller (1966, p. 57). Note that the last
bound shows la;l > lc;l, i = 2, ... , n- 2. Thus the coefficients of L
ERROR ANALYSIS 529
and U remain bounded, and no divisors are used that are almost zero
except for rounding error.
The condition that h;. c; * 0 is not essential. For example, if some
h; = 0, then the linear system can be broken into two new systems, one
of order i - 1 and the other of order n - i + 1. For example, if
then solve Ax = f by reducing it to the following two linear systems,
[
a3 c3][x3] [/3] [al c1][x1] [ /1 ]
b4 4 X4 = !4 b2 2 X2 = !2- C2X3
This completes the proof.
Example Consider the coefficient matrix for spline interpolation, in (3.7.22) of
Chapter 3. Consider h; = constant in that matrix, and then factor hj6 from
every row. Restricting our interest to the matrix of order four, the resulting
matrix is
A= ~
1 0
[]
4 1
1 4
0 1
Using the method (8.3.23), this has the LU factorization
2 0 0 0 1
7
1 - 0 0
1 0 0
2
2
2
L=
26
U=
0 1 - 0
0 1 0
7
7
7
45
0 0 1
0 0 1 -
26
26
0 0 0 1
This completes the example. And it should indicate that the solution of the cubic
spline interpolation problem, described in Section 3.7 of Chapter 3, is not
difficult to compute.
8.4 Error Analysis
We begin the error analysis of methods for solving Ax = b by examining the
stability of the solution x relative to small perturbations in the right side b. We
will follow the general schemata of Section 1.6 of Chapter 1, and in particular, we
530 NUMERICAL SOLUTION OF SYSTEMS OF LINEAR EQUATIONS
will study the condition number of (1.6.6).
Let Ax = b, of order n, be uniquely solvable, and consider the solution of the
perturbed. problem
Ai = b + r (8.4.1)
Let e = i - x, and subtract Ax = b to get
Ae = r (8.4.2)
To examine the stability of Ax = b as in (1.6.6), we want to bound the quantity
(8.4.3)
as r ranges over all elements of Rn, which are small relative to b.
From (8.4.2), take norms to obtain
Divide by IIAIIIIxll in the first inequality and by Uxll in the second one to obtain
Urll llell IIA -
1
1lllrll
---<- <---
IIAIIIIxll - llxll - llxll
The matrix norm is the operator matrix norm induced by the vector norm. Using
tlle bounds
we obtain
1 llrll lieU A A _
1
llrll
IIAIIIIA-
1
11 . jj'ij .s W .s II II II II jj'ij
(8.4.4)
Recalling (8.4.3), this result is justification for introducing the condition number
of A:
cond (A)= IIAIIIIA -
1
11 (8.4.5)
For each given A, there are choices of b and r for which either of the inequalities
in (8.4.4) can be made an equality. This is a further reason for introducing
cond(A) when considering (8.4.3). We leave the proof to Problem 20.
The quantity cond (A) will vary with the norm being used, but it is always
bounded below by one, since
If the condition number is nearly 1, then we see from (8.4.4) that s m l ~ relative
ERROR ANALYSIS 531
perturbations in b will lead to similarly small relative perturbations in the
solution x. But if cond(A) is large, then (8.4.4) suggests that there may be small
relative perturbations of b that will lead to large relative perturbations in x.
Because (8.4.5) will vary with the choice of norm, we sometimes use another
definition of condition number, one independent of the norm. From Theorem 7.8
of Chapter 7,
Since the eigenvalues of A -I are the reciprocals of those of A, we have the result
cond (A) :<::
= cond (A)*
Min IAI
>.ea(A)
Max !AI
>.ea(A)
in which a(A) denotes the set of all eigenvalues of A.
Example Consider the linear system
For the coefficient matrix,
A = [ ~ ~ ]
10]
-7
(8.4.6)
(8.4.7)
Let the condition number in (8.4.5) be denoted by cond (A) P when it is generated
using the matrix norm II llr For this example,
cond (A)
1
= cond (A)
00
= (17)(17) = 289,
cond (A h ;;; 223 cond (A)* ;;; 198
These condition numbers all suggest that (8.4.7) may be sensitive to changes in
the right side b. To illustrate this possibility, consider the particular case
which has the solution
For the perturbed system, solve
7x
1
+ 10x
2
= 1
5x
1
+ 7x
2
= .7
x
2
= .1
532 NUMERICAL SOLUTION OF SYSTEMS OF LINEAR EQUATIONS
It has the solution
.X\=-.17
The relative changes in x are quite large when compared with the size of the
relative changes in the right side b.
A linear system whose solution x is unstable with respect to small relative
changes in the right side b is called ill-conditioned. The preceding system (8.4.7) is
somewhat ill-conditioned, especially if only three or four decimal digit floating-
point arithmetic is used in solving it. The condition numbers cond (A) and
cond (A)* are fairly good indicators of ill-conditioning. As they increase by a
factor of 10, it is likely that one less digit of accuracy will be obtained in the
solution.
In general, if cond (A)* is large; then there will be values of b for which the
system Ax= b is quite sensitive to changes r in b. Let A.
1
and A.u denote
eigenvalues of A for which
j"A
1
1 = Min i"AI
AEo(A),
I"Aul = Max I"AI
>.eo( A)
and thus
cond (A)*= ~ ~ I
(8.4.8)
Let x
1
and xu be corresponding eigenvectors, with llx
1
lloc = llxulloo = L Then
has the solution X = Xu And the system
has the solution
If cond (A)* is large, then the right-hand side has only a small relative perturba-
tion,
llrllco 1
--=
llbllco cond (A)*
(8.4.9)
But for the solution, we have the much larger relative perturbation
(8.4.10)
ERROR ANALYSIS 533
There are systems that are not ill-conditioned in actual practice, but for which
the preceding condition numbers are quite large. For example,
has all condition numbers cond(A)P and cond(A). equal to 10
10
But usually
the matrix is not considered ill-conditioned. The difficulty is in using norms to
measure changes in a vector, rather than looking at each component separately. If
scaling has been carried out on the coefficient matrix and unknown vector, then
this problem does not usually arise, and then the condition numbers are usually
an accurate predictor of ill-conditioning.
As a final justification for the use of cond (A) as a condition number, we give
the following result.
Theorem 8.3 (Gastinel) Let A be any nonsingular matrix of order n, and let
II II denote an operator matrix norm. Then
1 { IIA - Bill . }
d ( )
= Min B a singular matrix
con A IIAII
(8.4.11)
with cond (A) defined in (8.4.5).
Proof See Kahan (1966, p. 775).
The theorem states that A can be well approximated in a relative error sense
by a singular matrix B if and only if cond(A) is quite large. And from our view,
a singular matrix B is the ultimate in ill-conditioning. There are nonzero
perturbations of the solution, by the eigenvector for the eigenvalue A = 0, which
correspond to a zero perturbation in the right side b. More importantly, there are
values of b for which Bx = b is no longer solvable.
The Hilbert matrix The Hilbert matrix of order n is defined by
1 1 1
1
2 3 n
1 1 1 1
-
H=
n
2 3 4 n+1 (8.4.12)
1 1 1
n n+1 2n- 1
This matrix occurs naturally in solving the continuous least squares approxima-
tion problem. Its derivation is given near the end of Section 4.3 of Chapter 4,
with the resulting linear system given in (4.3.14). As was indicated in Section 4.3
and illustrated following (1.6.9) in Section 1.4 of Chapter 1, the Hilbert matrix is
'
_j
534 NUMERICAL SOLUTION OF SYSTEMS OF LINEAR EQUATIONS
Table 8.2 Condition numbers of Hilbert matrix
n cond(Hn)* n cond(H,).
3 5.24E + 2 7 4.75E + 8
4 1.55E + 4 8 1.53E + 10
5 4.77E + 5 9 4.93E + 11
6 1.50E + 7 10 1.60E + 13
very ill-conditioned, and increasingly so as n increases. As such, it has been a_
favorite numerical example for checking programs for solving linear systems of
equations, to determine the limits of effectiveness of the program when dealing
with ill-conditioned probkms. Table 8.2 gives the condition number cond (H,)*
for a few values of n. The inverse matrix H,-
1
= is known explicitly:
{ -1)Hi(n + i -1)!(n + j -1)!
'1 (i + j- 1)[(;- 1)!{J- 1)!]
2
{n- i)!{n- j)!
1::::; i, j::::; n
{8.4.13)
For additional information on H,, including an asymptotic formula for
cond(H,)., see Gregory and Karney (1969, pp. 33-38, 66-73).
Although widely used .as an example, some care must be taken as to what is
the true answer. Let Ii, denote the version of H, after it is entered into the finite
arithmetic of a computer. For a matrix inversion program, the results of the
program should be compared with H;;\ not with H;;
1
; these two inverse
matrices can be quite different. For example, if we use four decimal digit
floating-point arithmetic with rounding, then
[
1.000
H3 = .5000
.3333
.5000
.3333
.2500
.3333]
.2500
.2000
{8.4.14)
Rounding has occurred only in expanding t in decimal fraction form. Then
[m
-36.00
30.00 l
H
3
-
1
= -36.00 192.0 -180.0
30.00 -180.0 180.0
[ 9.062
-36.32
30.30 l
H
3
-
1
= -36.32 193.7 -181.6
{8.4.15)
30.30 -181.6 181.5
Any program for matrix inversion, when applied to H
3
, should have its resulting
solution compared with H)\ not H)
1
We return to this example later in
Section 8.5.
ERROR ANALYSIS 535
Error bounds We consider the effects of rounding error on the solution x to
Ax= b, obtained using Gaussian elimination. We begin by giving a result
bounding the error when b and A are changed by small amounts. This is a useful
result by itself, and it is necessary for the error analysis of Gaussian elimination
that follows later.
Theorem 8.4 Consider the system Ax = b, with A nonsingular. Let 8A and 8b
be perturbations of A and b, and assume
(8.4.16)
Then A + 8A is nonsingular. And if we define llx implicitly by
(A + BA)(x +ox)= b + llb (8.4.17)
then
_llox_ll < __ co_n_d_(A_)....,..,...,.,... . { llllAII + 118hll} (
8
.4.
18
)
llxll - IIMII IIAII llhll
1- cond{A)T!Ail
Proof First note that 8A represents any matrix satisfying (8.4.16), not a
constant 8 times the matrix A, and similarly for llb and llx. Using
(8.4.16), the nonsingularity of A + 8A follows immediately from Theo-
rem 7.12 of Chapter 7. From (7.4.11),
II(
A BA)-111 IIA-111
+ 1 -IIA-
1
11118AII
Solving for llx in (8.4.17), and using Ax = b,
(A + llA) ox+ Ax+ (llA)x = b + ob
ox= (A+ oA)-
1
[ob- {oA)x]
Using (8.4.19) and the definition (8.4.5) of cond(A),
cond(A) {llllbll !loAII}
llllxll IIMII . IIAII + llxliT!Ail
1- cond{A)-
IIAII
(8.4.19)
Divide by llxll on both sides, and use llbll IIAllllxll to obtain (8.4.18) .
The analysis of the effect of rounding errors on Gaussian elimination is due to
J. H. Wilkinson, and it can be found in Wilkinson (1963, pp. 94-99), (1965, pp.
536 NUMERICAL SOLUTION OF SYSTEMS OF LINEAR EQUATIONS
209-216), Forsythe and Moler (1967, chap. 21), and Golub and Van Loan (1983,
chap. 4). Let .X denote the computed solution of Ax = b. It is very difficult to
compute directly the effects on x of rounding at each step, as a means of
obtaining a bound on llx - .Xjj. Rather, it is easier, although nontrivial, to take .X
and the elimination algorithm and to work backwards to show that .X is the exact
solution of a system
(A+ 8A)x = b
in which bounds can be given for 8A. This approach is known as backward error
analysis. We can then use the preceding Theorem 8.4 to bound llx- .XII. In the
following result, the matrix norm will be IIAIIoo the row norm (7.3.17) induced by
the vector norm llxllw
Theorem 8.5 Let A be of order n and nonsingular, and assume partial or
complete pivoting is used in the Gaussian elimination process.
Define
1
p = -- Max jafk>l
IIAIIoo l:s;i.j.ks,n 1
(8.4.20)
Let u denote the unit round on the computer being used. [See
(1.2.11)-(1.2.12) for the definition of u.)
1. The matrices L and U computed using Gaussian elimination
satisfy
LU=A+E,
IIEIIoo n
2
PIIA11oou
(8.4.21)
2. The approximate solution .X of Ax = b, computed using
Gaussian elimination, satisfies
(A+ 8A)x = b (8.4.22)
with
(8.4.23)
3. Using Theorem 8.4,
cond (A)oo
3 2
-----
1
::-::
18
:--:-A-:-;-IIoo-[1.01(n + 3n )pu]
1- cond (A) 00 IIAIIoo
(8.4.24)
Proof The proofs of (1) and (2) are given in Forsythe and Moler (1967, chap.
21). Variations on these results are given in Golub and Van Loan (1983,
chap. 4).
ERROR ANALYSIS 537
Empirically, the bound (8.4.23) is too large, due to cancellation of rounding
errors of varying magnitude and sign. According to Wilkinson (1963, p. 108), a
better empirical bound for most cases is
{8.4.25)
The result (8.4.24) shows the importance of the size of cond (A).
The quantity p in the bounds can be computed during the elimination process,
and it can also be bounded a priori. For complete pivoting, an a priori bound is
p ::s; 1.8n(ln n)/4
n ~ 1
and it is conjectured that p ::s; en for some c. For partial pivoting, an a priori
bound is 2n-l, and pathological examples are known for which this is possible.
Nonetheless, in all empirical studies to date, p has been bounded by a relatively
small number, independent of n. Because of the differing theoretical bounds for
p, complete pivoting is sometimes considered superior. In actual practice, how-
ever, the error behavior with partial pivoting is as good as with complete
pivoting. Moreover, complete pivoting requires many more comparisons at each
step of the eliminatipn process. Consequently, partial pivoting is the approach
used in all modem Gaussian elimination codes.
One of the most important consequences of the preceding analysis 'is to show
that Gaussian elimination is a very stable process, provided only that the matrix
A is not badly ill-conditioned. Historically, researchers in the early 1950s were
uncertain as to the stability of Gaussian elimination for larger systems, for
example, n ~ 10, but that question has now been settled.
The size of the residual in the computed solution x, namely
r = b- Ax {8.4.26)
is sometimes linked, mistakenly, to thesize of the error x -.X. In fact, the error
in x can be large even though r is small, and this is usually the case with
ill-conditioned problems. From (8.4.26) and Ax = b,
r=A(x-x)
(8.4.27)
and thus x - x can be much larger than r if A -l has large elements.
In practice, the residual r is quite small, even for ill-conditioned problems. To
suggest why this should happen, use (8.4.22) to obtain
r = (8A)x
llrU"" .::s:: IISAII..,IIxll..,
Hrllco 118AIIco
~ < -----
IIAIIcollxllco - IIAII""
(8.4.28)
. : :: ..... - ~ .... :.: . .. : ...:::. .: ..... : .. .. ::: : . ..: -' _:_ ~ : . : ... -. . . . . . . . .. '
538 NUMERICAL SOLUTION OF SYSTEMS OF LINEAR EQUATIONS
The bounds for iic5Ail/IIAII,. in (8.4.23) or (8.4.25), are independent of the
conditioning of the problem. Thus llrllco will generally be small relative to
IIAIIIIxll. The latter is often close to llbll or is of the same magnitude, since
b =Ax, and then llrll will be small relative to llbll. As a final note on the size of
the residual, there are some problems in which it is important only to have r be
small, without x - x needing to be small. In such cases, ill-conditioning will not
have the same meaning.
The bounds (8.4.18) and (8.4.24) indicate the importance of cond (A) in
determining the error. Generally if cond(A),;. 10m, some m;;:: 0, then about m
digits of accuracy will be lost in computing x, relative to the number of digits in
the arithmetic being used. Thus measuring cond (A) = II All II A -til is desirable.
The term IIAII is easy and inexpensive to evaluate, and IIA -
1
11 is the main
problem in computing cond (A). Calculating A -I requires n
3
operations, and
this is too expensive a way to compute IIA -
1
11. A less expensive approach, using
O(n
2
) operations, was developed for the UNPACK package.
For any system Ay = d,
IIYII ~ IIA -
1
lllldll
IIA-
1
11 > ~
- !ldll
{8.4.29)
We want to choose d to make this ratio as large as possible. Write A = LU, with
LU obtained in the Gaussian elimination. Then solving Ay = d is equivalent to
solving
Lw=d Uy=w
While solving Lw = d, develop d to make w as large as possible, while retaining
lid II"" = 1. Then solve Uy = w for y. This will give a better bound in (8.4.29) than
a randomly chosen d. An algorithm for choosing d is "given in Golub and
Van Loan (1983, p. 77). The algorithm in UNPACK is a more complicated
extension of the preceding. For a description see Golub and Van Loan (1983, p.
78) or Dongarra et al. (1979, pp. 1.12-1.13).
A posteriori error bounds We begin with error bounds for a computed inverse
C of a given matrix A. Define the residual matrix by
Theorem 8.6
R=I- CA
If IIRII < 1, then A and C are nonsingular, and
___:;_IIR...::.II_ < IIA _, - en < _IIR_I!_
IIAIIIICII - IICII - t - IIRII
(8.4.30)
ERROR ANALYSIS 539
Proof Since IIRII < 1, I- R is nonsingular by Theorem 7.11 of Chapter 7, and
But
1-R=CA (8.4.31)
0 "4= det (I - R) = det ( CA) = det (C) det (A)
and thus both det(C) and det(A) are nonzero. This shows that both A
and C are nonsingular.
For the lower bound in (8.4.30),
R =I- CA = (A -
1
- C) A,
IIRII ~ IIA-
1
- Cll IIAII
and dividing by IIAII IICII proves the result. For the upper bound, (8.4.31)
implies
(I - R) -1 = A -1c-1
A-
1
= (1- R)-
1
C
For the error in C,
A-
1
- C = (1- CA)A-
1
= RA-
1
= R(I- R)-
1
C
IIA-1- Cll < IIRII IICII
- 1- IIRII
This completes the proot
(8.4.32)
This result is generally of mote theoretical than practical interest. Inverse
matrices should not be produced for solving a linear system, as was pointed out
earlier in Section 8.1. And as a consequence, there is seldom any real need for the
preceding type of error bound. The main exception is when C has been produced
as an approximation by means other than a u s s i ~ elimination, often by some
theoretical derivation. Such approximate inverses are then used to solve Ax = b
by the residual correction procedure (8.5.3) described in the next section. In this
case, the bound (8.4.30) can furnish some useful information on C.
Corol/ary Let A, C, and R be as given in Theorem 8.6. Let x be an
approximate solution to Ax= b, and definer= b- Ax. Then
A IICr.ll
llx - xll ~ 1 - IIRII
{8.4.33)
,. ,, . -.
""""'--
540 NUMERICAL SOLUTION OF SYSTEMS OF LINEAR EQUATIONS
Proof From
r = b - Ax = Ax - Ax = A ( x - :X)
(8.4.34)
with (8.4.32) used in the last equality. Taking norms, we obtain (8.4.33) .
This bound (8.4.33) has been found to be quite accurate, especially when
compared with a number of other bounds that are commonly used. For a
complete discussion of computable error bounds, including a number of exam-
ples, see Aird and Lynch (1975).
The error bound (8.4.33) is relatively expensive to produce. If we suppose that
:X was obtained by Gaussian elimination, then about n
3
j3 operations were used
to calculate :X and the LU decomposition of A. To produce C A-
1
by
elimination will take at least fn
3
additional operations, producing CA requires n
3
multiplications, and producing Cr requires n
2
Thus the error bound requires at
least a fivefold increase in the number of operations. It is generally preferable to
estimate the error by solving approximately the error equation
A(x- :X) = r
using the LU decomposition stored earlier. This requires n
2
operations to
evaluate r, and an additional n
2
to solve the linear system. Unless, the residual
matrix R = I - CA has norm nearly one, this approach will give a very reason-
able error estimate. This is pursued and illustrated in the next section.
8.5 The Residual Correction Method
We assume that Ax= b has been solved for an approximate solution :X= x<
0
>.
Also the LU decomposition along with a record of all row or column inter-
changes should have been stored. Calculate
r<
0
> = b - Ax<
0
> (8.5.1)
Define e<
0
> = x - x<
0
>. Then as before in (8.4.34),
Ae<
0
> = r<
0
>
Solve this system using the stored LU decomposition, and call the resulting
approximate solution e<
0
>. Define a new approximate solution to Ax = b by
X(l) = X(O) + e(O)
(8.5.2)
The process can be repeated, calculating x<
2
>, . , to continually decrease the
.. J ;
THE RESIDUAL CORRECTION METHOD 541
error. To calculate r<
0
> takes n
2
operations, and the calculation of e<
0
> takes an
additional n
2
operations. Thus the calculation of the improved values
xO>, x<
2
>, , il) inexpensive compared with the calculation of the original value
x<
0
>. This method is also known as iterative improvement or the residual correction
method.
It is extremely important to obtain accurate. values for r<
0
>. Since x<
0
> ap-
proximately solves Ax = b, r<
0
> will generally involve loss-of-significance errors
in its calculation, with Ax<
0
> and b agreeing to almost the full precision of the
machine arithmetic. Thus to obtain accurate values for r<
0
>, we must usually go to
higher precision arithmetic. If only regular arithmetic is used to calculate r<
0
>, the
same arithmetic as used in calculating x<
0
> and LU, then the resulting inaccuracy
in r<O) will usually leads to e<O) being a poor approximation to e<
0
>. In single
precision arithmetic, we calculate r<
0
> in double precision. But if the calculations
are already in double precision, it is often hard to go to a higher precision
arithmetic.
Example Solve the system Ax= b, with A = H
3
from (8.4.14). The arithmetic
will be four decimal digit floating-point with rounding. For the right side, use
b = [1,o,of
The true solution is the first column of H
3
-\ which from (8.4.15) is
x = [9.062, -36.32,3o.3or
to four significant digits.
Using elimination with partial pivoting,
x<
0
> = [8.968, -35.77,29.77f
The residual r<
0
> is calculated with double precision arithmetic, and IS then
rounded to four significant digits. The value obtained is
r<
0
> = [- .005341,-.004359,- .005344f
Solving Ae<
0
> = r<
0
> with the stored LU decomposition,
. e<
0
> = [.09216,- .5442, .5239f
x<
1
> = [9.060, -'36.31, 30.29f
Repeating these operations,
r(l) = [- .0006570, - .0003770, - .0001980] T
e<l) = [ .001707' - .01300, .01241 r
x<
2
> = [9.062, -36.32, 30.30f
. ' . . . . . . . .
. . . . - ~ ....
I
i
. _/
542 NUMERICAL SOLUTION OF SYSTEMS OF LINEAR EQUATIONS
The vector x<
2
l is accurate to four significant digits. Also, note that x<
1
> - x<
0
> =
e<
0
> is an accurate predictor of the error e<
0
> in x<
0
>.
Formulas can be developed to estimate how many iterates should be calcu-
lated in order to get essentially full accuracy in the solution x. For a discussion of
what is involved and for some algorithms implementing this method, see Dongarra
et al. (1979, p. 1.9), Forsythe and Moler (1967, chaps. 13, 16, 17), Golub and
Van Loan(1983, p. 74), and Wilkinson and Reinsch (1971, pp. 93-110).
Another residual correction method There are situations in which we can
calculate an approximate inverse C to the given matrix A. This is generally done
by carefully considering the structure of A, and then using a variety of approxi-
mation techniques to estimate A -l. Without considering the origin of C, we show
how to use it to iteratively solve Ax = b.
Let xC
0
l be an initial guess, and define ,col = b - Ax<
0
>. As before, A(x -
xC
0
l) = rC
0
>. Define xC
1
l implicitly by
xCl) - xC
0
> = Cr<
0
>
In general, define
rCm) = b - AxCm) xCm+l) = xCm) + cr<m)
m = 0, 1, 2,... (8.5.3)
If C is a good approximation to A -I, the iteration will converge rapidly, as
shown in the following analysis.
We first obtain a recursion formula for the error:
X- x<m+l) =X- xCm)- crCm) =X- x<m)- C[b- AxCml]
=X- xCm)- C[Ax- AxCml]
x- xCm+l) =(I- CA)(x- xCml)
(8.5.4)
By induction
m ~ 0 (8.5.5)
If
Ill- CAll< 1 (8.5.6)
for some matrix norm, then using the associated vector norm,
(8.5.7)
And this converges to zero as m- o6, for any choice of initial guess xC
0
>. More
generally, (8.5.5) implies that xCml converges to x, for any choice of x<
0
), if and
only if
as m- co
THE RESIDUAL CORRECTION METHOD 543
And by Theorem 7.9 of Chapter 7, this is equivalent to
(8.5.8)
for the special radius of I - CA. This may be possible to show, even when (8.5.6)
fails for the common matrix norms. Also note that
I- AC = A(I- CA)A-
1
and thus I - A C and I - CA are similar matrices and have the same eigenval-
ues. If
Ill- ACII < 1 (8.5.9)
then (8.5.8) is true, even if (8.5.6) is not true, and convergence will still occur.
Statement (8.5.4) shows that the rate of convergence of x<m> to x is linear:
m ~ (8.5.10)
with c < 1 unknown. The constant c is often estimated computationally with
(8.5.11)
with the maximum performed over some or all of the iterates that have been
computed. This is not rigorous, but is motivated by the formula
x<m+2> _ x<m+l> = (I_ CA)(x<m+l> _ x<m>)
(8.5.12)
To prove this, simply use (8.5.4), subtracting formulas for successive values of m.
If we assume (8.5.10) is valid for the iterates that we are calculating, and if we
have an estimate for c, then we can produce an error bound.
nx<m+l)- x<m)ll = ll[x- x<m>] - [x- x<m+l)]ll
~ nx- x<m>ll-:-llx- x<m+l)ll
~ llx- x<m>u- cllx- x<m>ll
1
. llx - x<m>u :s;; --ux<m+I> - x<m>ll
1-c
(8.5.13)
For slowly convergent iterates [with c = 1], this bound lS Important, since
llx<m+l) - x<m>ll_can then be much smaller than llx - x<m>n. Also, recall the
earlier derivation in Section 2.5 of Chapter 2. A similar bound, (2.5.5), was
derived for the error in a linearly convergent method.
544 NUMERICAL SOLUTION OF SYSTEMS OF LINEAR EQUATIONS
Example Define A(t:) = A
0
+ t:B, with
As an approximate inverse to A ( t: ), use
_l
2
1
_l
2
-t1
We can solve the system A(t:)x =busing the residual correction method (8.5.3).
For the convc;:rgence analysis,
-{1 ; _j 1
Convergence is assured if
and from (8.5.4),
m;:;:: 0
There are many situations of the kind in this example. We may have to solve
linear systems of a general form A(t:)x = b for any t: near zero. To save time, we
obtain either A(0)-
1
or the LU decomposition of A(O). This is then used as an
approximate inverse to A( t:), and we solve A( t:)x = b using the residual correc-
tion method.
8.6 Iteration Methods
As was mentioned in the introduction to this chapter, many linear systems are
too large to be solved by direct methods based on Gaussian elimination. For
these systems, iteration methods are often the only possible method of solution,
as well as being faster than elimination in many cases. The largest area for the
application of iteration methods is to the linear systems arising in the numerical
solution of partial differential equations. Systems of orders 10
3
to 10
5
are not
unusual, although almost all of the coefficients of the system will be zero. As an
example of such problems, the numerical solution of Poisson's equation is studied
ITERATION METHODS 545
in Section 8.8. The reader may want to combine reading that section with the
present one.
Besides being large, the linear systems to be solved, Ax = b, often have
several other important properties. They are usually sparse, which means that
only a small percentage of the coefficients are nonzero. The nonzero coefficients
generally have a special pattern in the way they occur in A, and there is usually a
simple formula that can be used to generate the coefficients a iJ as they are
needed, rather than having to store them. As one consequence of these properties,
the storage space for the vectors x and b may be a more important consideration
than is storage for A. The matrices A will often have special properties, which are
discussed in this and the next two sections.
We begin by defining and analyzing two classical iteration methods; following
that, a general abstract framework is presented for studying iteration methods.
The special properties of the linear system Ax = b are very important when
setting up an iteration method for its solution. The results of this section are just
a beginning to the design of a method for any particular area of applications.
The Gauss-Jacobi method (Simultaneous displacements) Rewrite Ax= b as
i = 1,2, ... , n (8.6.1)
assuming all a;; =F 0. Define the iteration as
1 n j
= - b.- "a .. x!m)
l l I} j
a;; j=l ,
}>1-i
i = 1, ... , n m;;:: 0 (8.6.2)
and assume initial guesses
B
(r+1) _ b A (r) C (r)
;X(i) - (i) - iX(i-1) - iX(i+ 1)
(8.6.32)
B x<+I> - b A <r>
r (r) - (r) - rx(r-1)
i
I
I
552 NUMERICAL SOLUTION OF SYSTEMS OF LINEAR EQUATIONS
for P 0. The analysis of convergence is more complicated than for the
Gauss-Jacobi and Gauss-Seidel methods; some results are suggested in Problem
29. Similar methods are used with the linear systems arising from solving some
partial differential equations.
Another important aspect of solving linear systems Ax = b is to look at their
origin. In many cases we have a differential or integral equation, say
J#x=y (8.6.33)
where x and y are functions. This is discretized to give a family of problems
(8.6.34)
with A" of order n. As n oo, the solutions X
11
of (8.6.34) approach (in some
sense) the solution x of (8.6.33). Thus the linear systems in (8.6.34) are closely
related. For example, in some sense A;;;
1
,;, A;
1
for m and n sufficiently large,
even though they are matrices of different orders. This can be given a more
precise meaning, leading to ways of iteratively solving large systems by using the
of lower order systems. Recently, many such methods have been
developed under the name of multigrid methods, with applications particularly to
partial differential equations [see Hackbusch and Trottenberg (1982)]. For itera-
tive methods for integral equations, see the related but different development in
Atkinson (1976, part II, chap. 4). Multigrid methods are very effective and
efficient iterative methods for differential and integral equations.
8. 7 Error Prediction and Acceleration
From (8.6.25), we have the error relation
X- x<m+l) = M(x- x<ml)
(8.7.1)
The manner of convergence of x<ml to x can be quite complicated, depending on
the eigenvalues and eigenvectors of M. But in most practical cases, the behavior
of the errors is quite simple: The size of Ux - x<m>ll"" decreases by approxi-
mately a constant factor at each step, and
(8.7.2)
for some c < 1, closely related to r.,(M). To measure this constant c, note from
(8.7.1) that
x<m+l> _ x<m> = e<m> _ e<m+ll = Me<m-lJ _ Me<m>
x<m+l> _ x<m> = M(x<m> _ x<m-1>)
(8.7.3)
This motivates the use of
{8.7.4)
ERROR PREDICTION AND ACCELERATION 553
Table 8.5 Example of Gauss-Seidel iteration
m
llu(m)- u<m-l)lloo
Ratio Est. Error Error
20 1.20E- 3 .966 3.42E- 2 3.09E- 2
21 1.16E -.3 .966 3.24E- 2 2.98E- 2
22 i.12E- 3 .965 3.08E- 2 2.86E- 2
23 LOSE- 3 .965 2.93E- 2 2.76E- 2
24 1.04E-3 .964 2.80E- 2 2.65E- 2
60 2.60E- 4 .962 6.58E- 3 6.58E- 3
61 2.50E- 4 .962 6.33E- 3 6.33E- 3
62 2.41E- 4 .962 6.09E- 3 6.09E- 3
or for greater safety, the maximum of several successive such ratios. In many
applications, this ratio is about constant for large values of m.
Once this constant c has been obtained, and assuming (8.7.2), we can bound
the error in x<m+ll by using (8.5.13):
(8.7 .5)
This bound is important when c ,;, 1 and the convergence is slow. In that case,
the difference llx<m+lJ- x<mJIIoo can be much smaller than the actual error
llx - x<m+l)lloo
Example The linear system (8.8.5) of Section 8.8 was solved using the
method. In keeping with (8.8.5), we denote our unknown vector by
u. In (8.8.4), the function f= x
2
y
2
, and in (8.8.5), the function g = 2(x
2
+ y
2
).
The region was 0 x, y 1, and the mesh size in each direction was h = f6.
This gave an order of 225 for the linear system (8.8.5). The initial guess u<
0
> in the
iteration was based on a "bilinear" interpolant of f = x
2
y
2
over the region
0 x, y 1 [see (8.8.17)]. A selection of numerical results is given in Table 8.5.
The column Ratio is calculated from (8.7.4), the column Est. Error uses (8.7.5),
and the column Error is the true error llu - u<m>lloo
As can be seen in the table, the convergence was quite slow, justifying the
need for (8.7.5) rather than the much smaller llu
1
"'
1
- u
1
"'-
11
llx As m ::c, the
value of Ratio converges to .962. and the error estimate (8.7.5) is an accurate
estimator of the true iteration error.
Speed of convergence We now discuss how many iterates to calculate in order
to obtain a desired error. And when is iteration preferable to Ga).lssian elimina-
tion in solving Ax = b? We find the value of m for which
(8.7.6)
with ( a given factor by which the initial error is to be reduced. We base the
analysis on the assumption (8. 7 .2). Generally the constant c is almost equal to
ra(M). with Mas in (8.7.1).
554 NUMERICAL SOLUTION OF SYSTEMS OF LINEAR EQUATIONS
The relation (8.7.2) implies
m 0
Thus we find the smallest value of m for which
Solving this, we must have
-ln
m>--=m*
- R(c)
R(c) = -lnc
(8.7.7)
{8.7.8)
Doubling R(c) leads to halving the number of iterates that must be calculated.
To make this result more meaningful, we apply it to the solution of a dense
linear system by iteration. Assume that the Gauss-Jacobi or Gauss-Seidel
method is being used to solve Ax = b to single precision accuracy on an IBM
mainframe computer, that is, to about six significant digits. Assume x<
0
> = 0, and
that we want to find m such that
(8.7.9)
Assuming A has order n, the number of operations (multiplications and divi-
sions) per iteration is n
2
To obtain the result (8.7.9), the necessary number of
iterates is
6In, 10
m*= ---
R(c)
and the number of operations is
n2
m*n
2
= {6ln 10)--
e R(c)
If Gaussian elimination is used to solve Ax = b with the same accuracy, the
number of operations is about n
3
j3. The iteration method will be more efficient
than the Gaussian elimination method if
n
m* <-
3
(8.7 .10)
Example Consider a matrix A of order n = 51. Then iteration is more efficient
if m* < 17. Table 8.6 gives the values of m* for various values of c. For c::;; .44,
the iteration method will be more efficient than Gaussian elimination. And if less
ERROR PREDICTION AND ACCELERATION 555
Table8.6 Example of iteration count
c R(c) m*
.9 .105 131
.8 .223 62
.6 .511 27
.4 .916 15
.2 1.61 9
than full precision accuracy in (8.7.9) is desired, then iteration will be more
efficient with even larger values of c. In practice, we also will usually know an
initial guess x<
0
l that is better than x<
0
l = 0, further decreasing the number of
needed iterates.
The main use of iteration methods is for the solution of large sparse systems,
in which case Gaussian elimination is often not possible. And even when
elimination is possible, iteration may still be preferable. Some examples of such
systems are discussed in Section 8.8.
Acceleration methods Most iteration methods have a regular pattern in which
the error decreases. This can often be used to accelerate the convergence, just as
was done in earlier chapters with other numerical methods. Rather than giving a
general theory for the acceleration of iteration methods for solving Ax = b, we
just describe an acceleration of the Gauss-Seidel method. This is one of the main
cases of interest in applications. .
Recall the definition (8.6.15) of the Gauss-Seidel method. Introduce an
acceleration parameter "' and consider the following modification of (8.6.15):
= 2_{b- a .. . a.
, a , i- ,
1 1
i- ,
1 1
ii j=l j=i+l
= + (1-
l l l
i = 1, ... , n (8.7 .11)
for m ;;;:: 0. The case w = 1 is the regular Gauss-Seidel method. The acceleration
is to optimally choose some linear combination of the preceding iterate and the
regular Gauss-Seidel iterate. The method (8.7.11), with an optimal choice of w, is
called the SOR method, which is an abbreviation for successive overrelaxation, an
historical term.
To understand how w should be chosen, we rewrite (8.7.11) in matrix form.
Decompose A as
A=D+L+U
with D = diag [ aw ... , a nnl, L lower triangular, and U upper triangular, with
both L and U having zeros on the diagonal. Then (8.7.11) becomes
z<m+ll = D-l[b- Lx<m+l)- Ux<m>]
x<m+l) = wz<m+l) + (1- w)x<m)
m;;;:: 0
556 NUMERICAL SOLUTION OF SYSTEMS OF LINEAR EQUATIONS
Eliminating z<m+l) and solving for x<m+l),
For the error,
e<m+i) = M(w )e<m)
m ~ (8.7.12)
(8.7 .13)
The parameter w is to be chosen to minimize r
0
(M(w)), in order to make x<ml
converge to x as rapidly as possible. Call the optimal value w*.
The calculation of w* is difficult except in the simplest of cases. And usually it
is obtained only approximately, based on trying several values of w and observ-
ing the effect on the speed of convergence. In spite of the problem of calculating
w*, the resulting increase in the speed of convergence of x<ml to x is very
dramatic, and the calculation of w* is well worth the effort. This is illustrated in
the next section.
Example We apply the acceleration (8.7.11) to the preceding example of the
Gauss-Seidel method, given following (8.7.5). The optimal acceleration parame-
ter is w* ~ 1.6735. A more extensive discussion of the SOR method for solving
the linear systems arising in solving partial differential equations is given in the
following section. The initial guess was the same as before. The results are given
in Table 8.7.
The results show a much faster rate of convergence than for the Gauss-Seidel
method, For example, with the Gauss-Seidel method, we have l!u- u<
228
llloo ;=
9.70E - 6. In comparison, the SOR method leads to l!u - u(32llloo = 8.71E - 6.
But we have lost the regular behavior in the convergence of the iterates, as can be
seen from the values of Ratio. The value of c used in the error test (8.7.5) needs
to be chosen more carefully than our choice of c = Ratio in the table. You may
want to use an average or the maximum of several successive preceding values of
Ratio.
Table 8.7 Example of SOR method (8.7.11)
m
nu<ml - u<m-l)llco
Ratio Est. Error Error
21 2.06E- 4 .693 4.65E- 4 3.64E- 4
22 1.35E- 4 .657 2.59E- 4 2.65E- 4
23 8.76E- 5 .648 1.61E- 4 1.87E- 4
24 5.11E- 5 .584 7.17E-5 1.39E- 4
25 3.48E- 5 .680 7.40E- 5 1.06E- 4
26 2.78E- 5 .800 l.llE- 4 8.04E- 5
27 2.46E- 5 .884 1.87E- 4 6.15E- 5
28 2.07E- 5 .842 l.llE- 4 4.16E- 5
THE NUMERICAL SOLUTION OF POISSON'S EQUATION 557
8.8 The Numerical Solution of Poisson's Equation
The most important application of linear iteration methods is to the large linear
systems arising from the numerical solution of partial differential equations by
finite difference methods. To illustrate this, we solve the Dirichlet problem for
Poisson's equation on the unit square in the xy-plane:
0 <X, y < 1
(8.8.1)
u(x,y) =j(x,y) ( x, y) a boundary point
The functions g(x, y) and f(x, y) are given, and we must find u(x, y).
For N > 1, define h = ljN, and
0 5,j, k .:$; N
These are called the grid points or mesh points (see Figure 8.1). To approximate
(8.8.1), we use approximations to the second derivatives. For a four times
continuously differentiable function G(x) on [x - h, x + h), the results
(5.7.17)-:-(5.7.18) of Section 5.7 give
G(x +h)- 2G(x) + G(x- h)
G"(x) =. hz
(8.8.2)
When applied tb (8.8.1) at each interior grid point, we obtain
u( xj+
1
, Yk)- 2u( xj, Yk) + u( xj_
1
, Yk) u( xj, Yk+
1
)- 2u{ xj, Yk )+ u( xj, Yk-
1
)
h2 + h2
(8.8.3)
for some xj_
1
5, g 5, xj+
1
, Yk-
1
.:$; 'T/ 5,yk+
1
, 15,j, k 5, N -1.
For the numerical approximation uh(x, y) of (8.8.1), let
(xj, Yk) a boundary grid point (8.8.4)
At all interior mesh points, drop the right-hand truncation errors in (8.8.3) and
solve for the approximating solution uh(xj, Yk):
1
uh(xj, Yk) = 4 { uh(xj+1 Yk) + uh(xj, Yk+1) + uh(xj-1 Yk) + uh{xj, Yk-1)}
h2
- 4g(xj, Yk) (8.8.5)
558 NUMERICAL SOLUTION OF SYSTEMS OF LINEAR EQUATIONS
y
1
X
1
Figure 8.1 Finite difference mesh.
The number of equations in (8.8.4)-(8.8.5) is equal to the number of unknowns,
(N + 1)
2
Theorem 8.8 For each N ;;::: 2, the linear system (8.8.4)-(8.8.5) has a unique
solution { uh(xp Yk)IO ~ j k ~ N }. If the solution u(x, y) of
(8.8.1) is four times continuously differentiable, then
(8.8.6)
c =- Max + Max
1 ( la
4
u(x,y)l 'a
4
u(x,y)l)
24 O.sx,y.sl ax
4
O.sx,y.sl ay
4
Proof 1. We prove the unique solvability of (8.8.4)-(8.8.5) by using Theorem
7.2 of Chapter 7. We consider the homogeneous system
1 ~ j k ~ N- 1 (8.8.7)
(xj, Yk) a boundary point . (8.8.8)
By showing that this system has only the trivial solution uh(xp Yk) = 0,
it will follow from Theorem 7.2 that the nonhomogeneous system
(8.8.4)-(8.8.5) will have a unique solution.
Let
From (8.8.8), a ;;::: 0. Assume a > 0. Then there must be an interior grid
point (.X., jik) for which this maximum is attained. But using (8.8.7),
J .
THE NUMERICAL SOLUTION OF POISSON'S EQUATION 559
vh(:Xp Yk) is the average of the values of vh at the four points neighbor-
ing (xp Yk). The only way that this can be compatible with (xj, Yk)
being a maximum point is if vh also equals a at the four neighboring grid
points. Continue the same argument to these neighboring points. Since
there are only a finite number of grid points, we eventually have
vh(xj, Yk) = a for a boundary point (xj, Yk). But then a > 0 will con-
tradict (8.8.8). Thus the maximum of vh(xj, Yk) is zero. A similar
argument will show that the minimum of vh(xj, Yk) is also zero. Taken
together, these results show that the only solution of (8.8.7)-(8.8.8) is
vh(xp Yk) = 0.
2. To consider the convergence of uh(xj, Yk) to u(xj, Yk), define
Subtracting (8.8.5) from (8.8.3), we obtain
1
eh( xj, Yk) = 4 [ eh( xj+1, Yk) + eh( xj, Yk+I) + eh( xj-1> Yk) + eh(xj, Yk-1)]
- + a4u(xj,1Jk)l
12 ax
4
ay
4
(8.8.9)
and from (8.8.4),
eh(xj, Yk) = 0 (xj, Yk) a boundary grid point {8.8.10)
This system can be treated in a manner similar to that used in part (1),
and the result (8.8.6) will follow. Because it is not central to the
discussion of the linear systems, the argument is omitted [see Isaacson
and Keller (1966), pp. 447-450].
Example Solve
u(O, y) = cos('1Ty)
u(x, 0) = e'ITx
The true solution is
0 y 1
u{1, y) = e'ITcos('ITy)
u(x, 1) = - e'ITx
u(x, y) = e"xcos('1Ty)
{8.8.11)
Numerical results for several values of N are given in Table 8.8. The error is the
maximum over all grid points, and the column Ratio gives the factor by which
the maximum error decreases when the grid size .h is halved. Theoretically from
(8.8.6), it should be 4:0. The numerical results confirm this.
I
i
I
560 NUMERICAL SOLUTION OF SYSTEMS OF LINEAR EQUATIONS
Table 8.8 Numerical solution
of (8.8.11)
N
llu- uhlloo
Ratio
4 .144
8 .0390 3.7
16 .0102 3.8
32 .00260 3.9
64 .000654 4.0
Iterative Solution Because the Gauss-Seidel method is generally faster than the
Gauss-Jacobi method, we only consider the former. For k = 1, 2, ... , N- 1,
define
j = 1,2, ... , N- 1 (8.8.12)
For boundary points, use
all m 0
The values of uim+ll(xp Yk) are computed row by row, from the bottom row of
grid points first to the top row of points last. And within each row, we solve from
left to right.
For the iteration (8.8.12), the convergence analysis must be based on Theorem
8.7. It can be shown easily that the matrix is symmetric, and thus all eigenvalues
are real. Moreover, the eigenvalues can all be shown to lie in the interval
0 < A < 2. From this and Problem 14 of Chapter 7, the matrix is positive
definite. Since all diagonal coefficients of the matrix are positive, it then follows
from Theorem 8. 7 that the Gauss-Seidel method will converge. To show that all
eigenvalues lie in 0 < A < 2, see Isaacson and Keller (1966, pp. 458-459) or
Problem 3 of Chapter 9.
The calculation of the speed of convergence ra(M) from (8.6.28) is nontrivial.
The argument is quite sophisticated, and we only refer to the very complete
development in Isaacson and Keller (1966, pp. 463-470), including the material
on the acceleration of the Gauss-Seidel method. It can be shown that
(8.8.13)
The Gauss-Seidel method converges, but the speed of convergence is quite slow
for even moderately small values of h. This is illustrated in Table 8.5 of
Section 8.7.
THE NUMERICAL SOLUTION OF POISSON'S EQUATION 561
To accelerate the Gauss-Seidel method, use
u<m+!l(x. y ) = wu<m+!l(x. y)
h J' k h J' k
j=I, ... ,N-I (8.8.I4)
for k = I, ... , N - I. The optimal acceleration parameter is
2
w* = -----,===-
I ~
(8.8.I5)
The correspondence rate of convergence is
(8.8.16)
This is. a much better rate than that given by (8.8.13). The accelerated
Gauss-Seidel method (8.8.14) with the optimal value w* of (8.8.15) is known as
the SOR method. The name SOR is an abbreviation for successive ouerrelaxation,
a name that is based on a physical interpretation of the method, first used in
deriving it.
Example Recall the previous example (8.8.11). This was solved with both the
Gauss-Seidel method and the SOR method. The initial guess for the iteration
was taken to be the "bilinear" interpolation formula for the boundary dataf:
ui
0
>(x, y) = (1- x)l(O, y) + xf(I, y) +(I- y)f(x,O) + yf(x, I)
- [(1- y)(l- x)f(O,O) +(I- y)x/(1,0)
+y(I- x)f(O, I)+ xyf(l, I)] (8.8.17)
at all interior grid points. The error test to stop the iteration was
with > 0 given and the right-hand side of (8.7.5) used to predict the error in the
iterate. The numerical results for the necessary number of iterates are given in
Table 8.9. The SOR method requires far fewer iterates for the smaller values of h
than -does the Gauss-Seidel method.
562 NUMERICAL SOLUTION OF SYSTEMS OF LINEAR EQUATIONS
Table 8.9 Number of iterates necessary
to solve (8.8.5)
N
( Gauss-Seidel SOR
8 .01 25 12
8 .001 40 16
16 .001 142 32
32 .001 495 65
!)
.0001 54 18
16 .0001 201 35
32 .0001 733 71
Recall from the previous section that the number of iterates, called m*,
necessary to reduce the iteration error by a factor of t: is proportional to 1 jln (c),
where c is the ratio by which the iteration error decreases at each step. For the
methods of this section, we take c = r
0
(M). If we write r
0
(M) = 1
1 1 1
-ln(1-8) 8
When 8 is halved, the number of iterates to be computed is doubled. For the
Gauss-Seidel method,
1 1
ln r
0
(M) = 7T
2
h
2
When h is halved, the number of iterates to be computed increases by a factor of
4. For the SOR method,
1 1
---------= ---
ln ra(M(w*))
and when h is halved, the number of iterates is doubled. These two results are
illustrated in Table 8.9 by the entries fort:= 10-
3
and t: = 10-
4
With either method, note that doubling N will increase the number of
equations to be solved by a factor of 4, and thus the work per iteration will
increase by the same amount. The use of SOR greatly reduces the resulting work,
although it still is large when N is large.
8.9 The Conjugate Gradient Method
The iteration method presented in this section was developed in the 1950s, but
has gained its main popularity in more recent years, especially in solving linear
systems associated with the numerical solution of linear partial differential
equations. The literature on the conjugate gradient method (CG method) has
THE CONJUGATE GRADIENT METHOD 563
become quite large and sophisticated, and there are numerous connections to
other topics in linear algebra. Thus, for reasons of space, we are able to give only
a brief introduction, defining the CG method and stating some of the principal
theoretical results concerning it.
The CG method differs from earlier methods of this chapter, in that it is based
on solving a nonlinear problem; in fact, the CG method is also a commonly used
method for minimizing nonlinear functions. The linear system to be solved,
Ax= b (8.9.1)
is assumed to have a coefficient matrix A that is real, symmetric, and positive
definite. The solution of this system is equivalent to the minimization of the
function
(8.9.2)
The unique solution x* of Ax= b is also the unique minimizer of f(x) as x
varies over Rn. To see this, first show
f(x) = E(x)- tbTx*
(8.9.3)
E ( x) = H x* - x) T A ( x* - x)
Using Ax* = b, the proof is straightforward. The functions E(x) and f(x) differ
by a constant, and thus they will have the same minimizers. By the positive
definiteness of A, E(x) is minimized uniquely by x = x*, and thus the same is
true of f(x).
A well-known iteration method for finding a minimum for a nonlinear
function is the method of steepest descent, which was introduced briefly in Section
2.12. For minimizing f(x) by this method, assume that an initial guess x
0
is
given. Choose a path in which to se.arch for a new minimum by looking along the
direction in which f(x) decreases most rapidly at Xo. This is given by go=
-\lj(x
0
), the negative of the gradient of f(x) at x
0
:
(8.9.4)
Then solve the one-dimensional minimization problem
Min f(x
0
+ ag
0
)
O ~ a o o
calling the solution a
1
. Using it, define the new iterate
(8.9.5)
Continue this process inductively. The method of steepest descent will converge,
but the convergence is generally quite slow. The optimal local strategy of using a
direction of fastest descent is not a good strategy for finding an optimal direction
for finding the global minimum. In comparison, the CG method will be more
564 NUMERICAL SOLUTION OF SYSTEMS OF LINEAR EQUATIONS
rapid, and it will take no more thann iterates, assuming there are no rounding
errors.
For the remainder of this section, we assume the given initial guess is x
0
= 0.
If it is not, then we can always solve the modified problem
Az = b- Ax
0
Denoting its solution by z*, we have x = x
0
+ z*. An initial guess of z* = z
0
= 0
corresponds to x = x
0
in the original problem. Henceforth, assume x
0
= 0.
Conjugate direction methods Assuming A is n X n, we say a set o( nonzero
vectors PI ... , p, in R" is A-conjugate if.
l.:s;;i,j.:s;;n i=Fj (8.9.6)
The vectors pj are often called conjugate directions. An equivalent geometric
definition can be given by introducing a new inner product and norm for R":
(8.9.7)
The condition (8.9.6) is equivalent to requiring PI ... , p, to be an orthogonal
basis for R" with respect to the inner product ( , )A Thus we also say that
{ PI ... , p,} are A-orthogonal if they satisfy (8.9.6). With the norm II II A the
function E(x) of (8.9.3) is seen to be
E(x) = tllx*- x ~ (8.9.8)
which is more clearly a measure of the error x - x*. The relationship of llxll
2
and llxiiA is further explored in Problem 36.
Given a set of conjugate directions {p
1
, , p,}, it is straightforward to solve
Ax= b. Let
Using (8.9 .6),
k = 1, ... , n (8.9.9)
We use this formula for x* to introduce the conjugate direction iteration method.
Let x
0
= 0,
(8.9.10)
Introduce
THE CONJUGATE GRADIENT METHOD 565
the residual of xk in Ax = b. Easily, r
0
= b and
k=l, ... ,n (8.9.11)
For k = n, we have xn = x* and rn = 0, and xk may equal x* with a smaller
value of k.
Lemma 1 The term rk is orthogonal to h ... , Pk that is, r{p; = 0, 1 .:::;; i.:::;; k.
We leave the proof to Problem 37.
Lemma 2 (a) The minimization problem
is solved uniquely by a = ak, yielding f(xk) as the minimum.
(b) Let !?k = Span { p
1
, . , pd, the k-dimensional subspace gen-
erated by { p
1
, , p k } Then the problem
Minf(x)
xe.Y'k
is uniquely solved by x = xk, yielding the minimum f(xk).
Proof (a) Expand cp(a) = f(xk-l + apk):
The term p[Axk-l equals 0, because xk_
1
E !7k_
1
and Pk is A-
orthogonal to sPk_
1
. Solve cp'(a) = 0, obtaining a = ak, to complete the
proof.
(b) Expand f(xk +h), for any h E sPk:
f(xk +h)= f(xk) + hTAxk + th-r.Ah- hTb
By Lemma 1 andthe assumption h E sPk, It follows that hTrk = 0. Thus
since A is positive definite. The minimum is attained uniquely in sPk by
letting h = 0, proving (b).
Lemma 2 gives an optimality property for conjugate direction methods,
defined by (8.9.11) and (8.9.9). The problem is knowing how to choose the
conjugate directions { pj }. There are many possible choices, and some of them
lead to well-known methods for directly solving Ax= b [see Stewart (1973b)].
566 NUMERICAL SOLUTION OF SYSTEMS OF LINEAR EQUATIONS
The conjugate gradient method We give a way to simultaneously generate the
directions {Pk} and the iterates {xd. For the first direction p
1
, we use the
steepest descent direction:
(8.9.12)
since x
0
= 0, An inductive construction is given for the remaining directions.
Assume xi, ... , x" have been generated, along with the conjugate directions
PI ... , P1c A new direction Pic+ I must be chosen, one that is A-conjugate to
PI ... , P1c Also, assume x" '* x*, and thus'"'* 0; otherwise, we would have the
solution x* and there would be no point to proceeding.
By Lemma 1, '" is orthogonal to !/'", and thus '" does not belong to !/'". We
use rk to generate Pk+l choosing a component of r". For reasons too com-
plicated to here, it suffices to consider
(8.9.13)
Then the condition prApk+I = 0 implies
(8.9.14)
The denominator is nonzero since A is positive definite and PJc =!= 0. It can be
shown [Luenberger (1984), p. 245] that this definition of Pk+I also satisfies
pfApk+l = 0
j=1,2, ... ,k-1 (8.9.15)
thus showing { p
1
, .. , is A-conjugate.
The conjugate gradient method consists of choosing { Pd from (8.9.12)-
(8.9.14) and rx} from (8.9.11) and (8.9.9). Ignoring rounding errors, the
method converges in n or fewer iterations. The actual speed of convergence
varies a great deal with the eigenvalues of A. The error analysis of the CG
method is based on the following optimality result.
Theorem 8.9 The iterates { x"} of the CG method satisfy
llx*- = Min llx*- q(A)bjjA
deg(q) <k
(8.9.16)
Proof For q('A) a polynomial, the notation q(A) denotes the matrix expression
with each power N replaced by Aj. For example,
1
The proof of (8.9.16) is given in Luenberger (1984, p. 246).
Using this theorem, a number of error. results can be given, varying with the
properties of the matrix A. For example, let the eigenvalues of A be denoted by
(8.9.17)
THE CONJUGATE GRADIENT METHOD 567
repeated according to their multiplicity, and let v
1
, .. , vn denote a corresponding
orthonormal basis of eigenvectors. Using this basis, write
Then
n
x* = " c.v.
'- J J
j-1
n
b=Ax*= "cAD
'- J J J
j-1
n
q(A)b = L cjAjq(Aj)vj
j-1
(8.9.18)
(8.9.19)
Any choice of a polynomial q(A) of degree < k will give a bound for llx* - xkll,
4
One of the better known bounds is
(8.9.20)
. with c = AifAn, the reciprocal of the condition number cond (A h. This is a
conservative bound, implying poor convergence for ill-conditioned problems. Its
proof is sketched in Luenberger (1984, p. 258, prob. 10). Other bounds can be
derived, based on the behavior of the eigenvalues {A;} and the coefficients c; of
(8.9.18). In many applications in which the A; vary greatly, it often happens that
the cj for the smaller Aj are quite close to zero. Then the formula in (8.9.19) can
be manipulated to give an improved bound over that in (8.9.20). In other cases,
the eigenvalues may coalesce around a small number of values, and then (8.9.19)
can be used to show convergence with a small k. For other results, see Luen-
berger (1984, p. 250), Jennings (1977), and van der Sluis and van der Vorst
(1986).
The formulas for { aj, Pj} in defining the CG method can be further modified,
to give simpler and more efficient formulas. We incorporate those into the
following.
Algorithm CG (A, b, x, n)
1. Remark: This algorithm calculates the solution of Ax = b using
the conjugate gradient method.
2. x
0
:= 0, r
0
:= b, p
0
:= 0
3. Fork= 0, ... , n- 1, do through step 7.
4. If rk = 0, then set x = xk and exit.
5. For k = 0, P
1
:= 0; and
fork> 0, Pk+l := r'[rkfr[_
1
rk-I
Pk+l := rk + pk+1Pk
568 NUMERICAL SOLUTION OF SYSTEMS OF LINEAR EQUATIONS
6. ak+l == r[rk!PLIAPk+l
xk+l == xk + ak+1Pk+1
rk+l == b- Axk+l
7. End loop on k.
8. X := Xn and exit.
This algorithm does not consider the problems of using finite preclSlon
arithmetic. For a more complete algorithm, see Wilkinson and Reinsch (1971,
pp. .
Our discussion of the CG method has followed closely that in Luenberger
(1984, chap 8). For another approach, with more geometric motivation, see
Golub and Van Loan (1983, sec. 10.2). They also have extensive references to the
literature.
Example As a simpleminded test case, we use the order five matrix
r5
4 3 2
5 4 3
4 5 4 (8.9.21)
3 4 5
2 3 4
The smallest and largest eigenvalues are A.
1
= .5484 and A.
5
= 17.1778, respec-
tively. For the error bound (8.9.20), c = .031925, and
1- {C
'- = .697
1 + vc
For the linear system, we chose
b = [7.9380, 12.9763,17.3057,19.4332, 18.4196f
which leads to the true solution
x* = [ -0.3227,0.3544, 1.1010, 1.5705, 1.6897f
Table 8.10 Example of the conjugate gradient method
k
llrklloo llx- xklloo llx- xkiiA
1 4.27 8.05E- 1 2.62
2 8.98E- 2 7.09E- 2 1.31E- 1
3 2.75E- 3 3.69E- 3 4.78E- 3
4 7.59E- 5 1.38E- 4 1.66E- 4
5
=o =o =o
Bound (8.9.20)
12.7
8.83
6.15
4.29
2.99
.... ;
DISCUSSION OF THE LITERATURE 569
The results from using CG are shown in Table 8.10, along with the error bound
in (8.9.20). As stated earlier, the bound (8.9.20) is very conservative.
The residuals decrease, as expected. But from the way the directions { p k } are
constructed, this implies that obtaining accurate directions p k for larger k will
likely be difficult because of the smaller number of digits of accuracy in the
residuals rk. For some discussion of this, see Golub and Van Loan (1983, p. 373),
which also contains additional references to the literature for this problem.
The preconditioned conjugate gradient method The bound (8.9.20) indicates or
seems to imply that the CG iterates can converge quite slowly, even for methods
with a moderate condition number such as cond(A}z = 1/c = 100. To increase
the rate of convergence, or at least to guarantee a rapid rate of convergence, the
problem Ax = b is transformed to an equivalent problem with a smaller condi-
tion number. The bound in (8.9.20) will be smaller, and one expects that the
iterates will converge more rapidly.
For a nonsingular matrix Q, transform Ax = b by
(8.9.22)
(8.9.23)
Then (8.9.22) is simply Ai =b. The matrix Q is to be chosen so that cond(A}z is
significantly smaller than cond (Ah. The actual CG method is not applied
explicitly to solving Ai = b, but rather the algorithm CG is modified slightly.
For the resulting algorithm when Q is symmetric, see Golub and Van Loan
(1983, p. 374).
Finding Q requires a careful analysis of the original problem Ax = b, under-
standing the structure of A in order to pick Q. From (8.9.23),
with A to be chosen with eigenvalues near 1 in magnitude. For example, if A is
about the identity I, then A = QQT. This decomposition could be accomplished
with a Cholesky triangular factorization. Approximate Cholesky factors are used
in defining preconditioners in some cases. For an introduction to the problem of
selecting preconditioners, see Golub and Van Loan (1983, sec. 10.3) and Axelson
r u ~
Discussion of the Literature
The references that have most influenced the presentation of Gaussian elimina-
tion and other topics in this chapter are the texts of Forsythe and Moler (1967),
570 NUMERICAL SOLUTION OF SYSTEMS OF LINEAR EQUATIONS
Golub and Van Loan (1983), Isaacson and Keller (1966), and Wilkinson (1963),
(1965), along with the paper of Kahan (1966). Other very good general treatments
are given in Conte and de Boor (1980), Noble (1969), Rice (1981), and Stewart
(1973a). More elementary introductions are given in Anton (1984) and Strang
(1980).
The best codes for the direct solution of both general and special forms of
linear systems, of small to moderate size, are based on those given in the package
UNPACK, described in Dongarra et al. (1979). These are completely portable
programs, and they are available in single and double precision, in both real and
complex arithmetic. Along with the solution of the systems, they also can
estimate the condition number of the matrix under consideration. The linear
equation programs in IMSL and NAG are variants and improvements of the
programs in UNPACK.
Another feature of the UNPACK is the use of the Basic Linear Algebra
Subroutines (BLAS). These are low-level subprograms that carry out basic vector
operations, such as the dot product of two vectors and the sum of two vectors.
These are available in Fortran, as part of UNPACK; but by giving assembly
language implementations of them, it is often possible to significantly improve
the efficiency of the main UNPACK programs. For a more general discussion of
the BLAS, see Lawson et al. (1979). The UNPACK programs are widely
available, and they have greatly influenced the development of linear equation
programs in other packages.
There is a very large literature on solving the linear systems arising from the
numerical solution of partial differential equations (PDEs). For some general
texts on the numerical solution of PDEs, see Birkhoff and Lynch (1984), Forsythe
and Wasow (1960), Gladwell and Wait (1979), Lapidus and Pinder (1982), and
Richtmyer and Morton (1967). For texts devoted to classical iterative methods
for solving the linear systems arising from the numerical solution of PDEs, see
Hageman and Young (1981) and Varga (1962). For other approaches of more
recent interest, see Swarztrauber (1984), Swarztrauber and Sweet (1979), George
and Liu (1981), and Hackbusch and Trottenberg (1982).
The numerical solution of PDEs is the source of a large percentage of the
sparse linear systems that are solved in practice. However, sparse systems of large
order also occur with other applications [e.g., see Duff (1981)]. There is a large
variety of approaches to solving large sparse systems, some of which wediscussed
in Sections 8.6-8.8. Other direct and iteration methods are available, depending
on the structure of the matrix. For a sample of the current research in this very
active area, see the survey of Duff (1977), the proceedings of Bjorck et al. (1981),
Duff (1981), Duff and Stewart (1979), and Evans (1985), and the texts of George
and Liu (1981) and Pissanetzky (1984). There are several software packages for
the solution of various types of sparse systems, some associated with the
preceding books. For a general index of many of the packages that are available,
see the compilation of Heath (1982). For iteration methods for the systems
associated with solving partial differential equations, the books of Varga (1962)
and Hageman and Young (1981) discuss many of the classical approaches.
Integral equations lead to dense linear systems, and other types of iteration
methods have been used for their solution. For some quite successful methods,
see Atkinson (1976, part II, chap. 4).
BIBLIOGRAPHY 571
The conjugate gradient method dates to Hestenes and Stiefel (1952), and its
use in solving integral and partial differential equations is still under develop-
ment. For more extensive discussions relating the conjugate direction method to
other numerical methods, see Hestenes (1980) and Stewart (1973b). For refer-
ences to the recent literature, including discussions of the preconditioned con-
jugate gradient method, see Axelsson (1985), Axelsson and Lindskog (1986), and
Golub and Van Loan (1983, sees. 10.2 and 10.3). A generalization for nonsym-
metric systems is proposed in Eisenstat et al. (1983).
One of the most important forces that will be determining the direction of
future research in numerical linear algebra is the growing use of vector and
parallel processor computers. The vector machines, such as the CRA Y-2, work
best when doing basic operations on vector quantities, such as those specified in
the BLAS used in UNPACK. In recent years, there has been a vast increase in
the availability of time on these machines, on newly developed nationwide
computer networks. This has changed the scale of many physical problems that
can be attempted, and it has led to a demand for ever more efficient computer
programs for solving a wide variety of linear systems. The use of parallel
computers is even more recent, and only in the middle to late 1980s have they
become widespread. There. is a wide variety of architectures for such machines.
Some have the multiple processors share a common memory, with a variety of
possible designs; others are based on each processor having its own memory and
being linked in various ways to other processors. Parallel computers often lead to
quite different types of numerical algorithms than those we have been studying
for sequential computers, algorithms that can take advantage of several concur-
rent processors working on a problem. There is little literature available, although
that is changing quite rapidly. As a survey of the solution of the linear systems
associated with partial differential equations, on both vector and parallel com-
puters, see Ortega and Voigt (1985). For a proposed text for the solution of linear
systems, see Ortega (1987).
Aird, T., and R. Lynch (1975). Computable accurate upper and lower error
bounds for approximate solutions of linear algebraic systems, ACM Trans.
Math. Softw. 1, 217-231. .
Anton, H. (1984). Elementary Linear Algebra, 4th ed. Wiley, New York.
Atkinson, K. (1976). A Survey of Numerical Methods for the Solution of Fredholm
Integral Equations of the Second Kind. SIAM Pub., Philadelphia.
Axelsson, 0. (1985). A survey of preconditioned iterative methods for linear
systems of algebraic equations, BIT 25, 166-187.
Axelsson, 0., and G. Lindskop (1986). On the rate of convergence of the
preconditioned conjugate gradient method, Numer. Math. 48, 499-523.
572 NUMERICAL SOLUTION OF SYSTEMS OF LINEAR EQUATIONS
Birkhoff, G., and R. Lynch (1984). Numerical Solution of Elliptic Problems. SIAM
Pub., Philadelphia.
Bjorck, A., R. Plemmons, and H. Schneider, Eds. (1981). Large Scale Matrix
Problems. North-Holland, Amsterdam.
Concus, P., G. Golub, and D. O'Leary (1984). A generalized conjugate gradient
method for the numerical solution of elliptic partial differential equations.
In Studies in Numerical Analysis, G. Golub, Ed. Mathematical Association
of America, pp. 178-198.
Conte, S., and C. de Boor (1980). Elementary Numerical Analysis, .3rd ed.
McGraw-Hill, New York.
Dongarra, J., J. Bunch, C. Moler, and G. Stewart (1979). Linpack User's Guide.
SIAM Pub., Philadelphia.
Dorr, F. (1970). The direct solution of the discrete Poisson equations on a
rectangle, SIAM Rev. 12, 248-263.
Duff, I. (1977). A survey of sparse matrix research, Proc. IEEE 65, 500-535.
Duff, 1., Ed. (1981). Sparse Matrices and Their Uses. Academic Press, New York.
Duff, 1., and G. Stewart, Eds. (1979). Sparse Matrix Proceedings 1978. SIAM
Pub., Philadelphia.
Eisenstat, S., H. Elman, and M. Schultz (1983). Variational iterative methods for
nonsymmetric systems of linear equations, SIAM J. Numer. Anal. 20,
345-357.
Evans, D., Ed. (1985). Sparsity and Its Applications. Cambridge Univ. Press,
Cambridge, E n g l ~ d
Forsythe, G., and C. Moler (1967). Computer Solution of Linear Algebraic
Systems. Prentice-Hall, Englewood Cliffs, N.J.
Forsythe, G., and W. Wasow (1960). Finite Difference Methods for Partial
Differential Equations. Wiley, New York.
George, A., and J. Liu (1981). Computer Solution of Large Sparse Positive Definite
Systems. Prentice-Hall, Englewood Cliffs, N.J.
Gladwell, L, and R. Wait, Eds. (1979). A Survey of Numerical Methods for Partial
Differential Equations. Oxford Univ. Press, Oxford, England.
Golub, G., and C. Van Loan (1983). Matrix Computations. Johns Hopkins Press,
Baltimore.
Gregory, R., and D. Karney (1969). A Collection of Matrices for Testing Computa-
tional Algorithms. Wiley, New York.
Hackbusch, W., and U. Trottenberg, Eds. (1982). Multigrid Methods, Lecture
Notes Math. 960. Springer-Verlag, New York.
Hageman, L., and D. Young (1981). Applied Iterative Methods. Academic Press,
New York.
Heath, M., Ed. (1982). Sparse Matrix Software Catalog. Oak Ridge National
Laboratory, Oak Ridge, Tenn. (Published in connection with the Sparse
Matrix Symposium 1982.)
BIBLIOGRAPHY 573
Hestenes, M. (1980). Conjugate Direction Methods in Optimization. Springer-
Verlag, New York.
Hestenes, M., and E. Stiefel (1952). Methods of conjugate gradients for solving
linear systems, J. Res. Nat. Bur. Stand. 49, 409-439.
Isaacson, E., and H. Keller (1966). Analysis of Numerical Methods. Wiley, New
York.
Jennings, A . (1977). Influence of the eigenvalue spectrum on the convergence
rate of the conjugate gradient method, J. lnst. Math. Its Appl. 20, 61-72.
Kahan, W. (1966). Numerical linear algebra, Can. Math. Bull. 9, 756-801.
Lapidus, L., and G. Pinder (1982). Numerical Solution of Partial Differential
Equations in Science and Engineering. Wiley, New York.
Lawson, C., and R. Hanson (1974). Solving Least Squares Problems. Prentice-Hall,
Englewood Cliffs, N.J.
Lawson, C., R. Hanson, D. Kincaid, and F. Krogh (1979). Basic linear algebra
subprograms for Fortran usage, ACM Trans. Math. Softw. 5, 308-323.
Luenberger, D. (1984). Linear and Nonlinear Programming, 2nd ed. Addison-
Wesley, Reading, Mass.
Noble, B. (1969). Applied Linear Algebra .. Prentice-Hall, Englewood Cliffs, N.J.
Ortega, J. (1987). Parallel and Vector Solution of Linear Systems. Preprint, Univ.
of Virginia, Charlottesville.
Ortega, J., and R. Voigt (1985). Solution of partial differential equations on
vector and parallel computers, SIAM Rev. 27, 149-240.
Pissanetzky, S. (1984). Sparse Matrix Technology. Academic Press, New York.
Rice, J. (1981). Matrix Computations and Mathematical Software. McGraw-Hill,
New York.
Richtmyer, R., and K. Morton (1967). Difference Methods for Initial Value
Problems, 2nd ed. Wiley, New York.
Stewart, G. (1973a). Introduction to Matrix Computations. Academic Press, New
York.
Stewart, G. (1973b). Conjugate direction methods for solving systems of linear
equations, Numer. Math. 21, 284-297.
Stewart, G. (1977). Research, development, and UNPACK. In Mathematical
Software Ill, John Rice (Ed.). Academic Press, New York.
Stone, H. (1968). Iterative solution of implicit approximations of multidimen-
sional partial differential equations, SIAM J. Numer. Anal. 5, 530-558.
Strang, G. (1980). Linear Algebra and Its Applications, 2nd ed. Academic Press,
New York.
Swarztrauber, P. (1984). Fast Poisson solvers. In Studies in Numerical Analysis,
G. Golub, Ed. Mathematical Association of America, pp. 319-370.
Swarztrauber, P., and R. Sweet (1979). Algorithm 541: Efficient Fortran subpro-
grams for the solution of separable elliptic partial differential equations,
ACM Trans. Math. Softw. 5, 352-364.
574 NUMERICAL SOLUTION OF SYSTEMS OF LINEAR EQUATIONS
Van der Sluis, A., and H. van der Vorst (1986). The rate of convergence of
conjugate gradients, Numer. Math. 48, 543-560.
Varga, R. (1962). Matrix Analysis. Prentice-Hall, Englewood Cliffs, N.J.
Wilkinson, J. (1963). Rounding Errors in Algebraic Processes. Prentice-Hall,
Englewood Cliffs, N.J.
Wilkinson, J. (1965). The Algebraic Eigenvalue Problem. Oxford Univ. Press,
Oxford, England.
Wilkinson, J., and C. Reinsch (1971). Linear Algebra, Handbook for Automatic
Computation, Vol. 2. Springer-Verlag, New York.
Problems
1. Solve the following systems Ax = b by Gaussian elimination without
pivoting. Check that A = LV, as in (8.1.5).
[ :
1
-1]
b-m
(a)
2 -2
-2 1 1
3 2
!]
(b)
4 3
3 4
2 3
[ _:
-1 1
l
b [
(c)
3 -3
A=
2 -4 7 -7
-3 7 -10 14
2. Consider the linear system
and verify its solution is
x
2
= -3.8 x
3
= -5.0
(a) Using four-digit floating-point decimal arithmetic with rounding, solve
the preceding system by Gaussian elimination without pivoting.
PROBLEMS 575
(b) Repeat part (a), using partial pivoting. In performing the arithmetic
operations, remember to round to four significant digits after each
operation, just as would be done on a computer.
3. (a) Implement the algorithms Factor and Soive of Section 8.2, or imple-
ment the analogous programs given in Forsythe and Moler (1967,
chaps. 16 and 17).
(b) To test the program, solve the system Ax= b of order n, with
A = [a;) defined by
a;j = Max (i, j)
Also define b = [1, 1, ... , 1]r. The true solution is x =
[0, 0, ... , 0, (1/n)f. This matrix is taken from Gregory and Karney
(1969, p. 42).
4. Consider solving the integral equation
A.x(s)- L' cos(-rrst)x(t)dt = 1 O:::;s:::;J
by discretizing the integral with the midpoint numerical integration rule
(5.2.18). More precisely, let n > 0, h = ljn, I; = (i - t)h fori = 1, ... , n.
We solve for. approximate valut?S of x(t
1
), .. , x(tn) by solving the linear
system
n
>..z;- I: h cos ( '1Tt;tj)zj = 1
~ l
i = 1, ... , n
Denote this linear system by (AI- Kn)z = b, with Kn of order n X n,
b; = 1 1 ~ i, j ~ n
For s1,dficiently large n, z; = x(t;); 1 ~ i ~ n. The value of >.. is nonzero,
and it is assumed here to not be an eigenvalue of K n
Solve (>..! - Kn)z = b for several values of n, say n = 2, 4, 8, 16, 32, 64,
and print the vector solutions z. If possible, also graph these solutions, to
gain some idea of the solution function x(s) of the original integral
equation. Use >.. = 4, 2, 1, .5.
5. (a) Consider solving Ax= b, with A and b complex and order(A) = n.
Convert this problem to that of solving a real square system of order
2n. Hint: Write A =AI + iA
2
, b = bi + ib
2
, x = xi + ix
2
, with
AI, A
2
; b
1
, b
2
, xi, x
2
all real. Detennine equations to be satisfied by
xi and x
2
(b) Determine the storage requirements and the number of operations for
the method in (a) of solving the complex system Ax= b. Compare
576 NUMERICAL SOLUTION OF SYSTEMS OF LINEAR EQUATIONS
these results with those based on directly solving Ax = b using
Gaussian elimination and complex arithmetic. Note the greater ex-
pense of complex arithmetic operations.
6. Let A, B, C be matrices of orders m X n, n X p, p X q, respectively. Do an
operations count for computing A(BC) and (AB)C. Give examples of
when one order of computation is preferable over the other.
7. (a) Show that the number of multiplications and divisions for the
Gauss-Jordan method of Section 8.3 is about tn
3
(b) Show how the Gauss-Jordan method, with partial pivoting, can be
used to invert an n X n matrix within only n(n + 1) storage loca-
tions. ~ n complete pivoting be used?
8. Use either the programs of Problem 3(a) or the Gauss-Jordan method to
invert the matrices in Problems 1 and 3(b).
9. Prove that if A = LLT with L real and nonsingular, then A is symmetric
and positive definite.
10. Using the Choleski method, calculate the decomposition A = LLr for
[
2.25
(a) -3.0
4.5
-3.0
5.0
-10.0
4.5]
-10.0
34.0
(b)
[
15
-18
15
-3
-18
24
-18
4
15
-18
18
-3
11. Let A be nonsingular. Let. A = LU = LDM, with all l;;, mu = 1, and D
diagonal. Further assume A is symmetric. Show that M = Lr, and thus
A = LDe. Show A is positive definite if and only if all d;; > 0.
12. Let A be real, symmetric, positive definite, and of order n. Consider solving
Ax = b using Gaussian elimination without pivoting. The purpose of this
problem is to justify that the pivots will be nonzero.
(a) Show that all of the diagonal elements satisfy a;; > 0. This shows that
a
11
can be used as a pivot element.
(b) After elimination of x
1
from equations 2 through n, let the resulting
matrix A<
2
> be written as
1<2).
Show that 1<
2
>.is symmetric and positive definite.
PROBLEMS 577
This procedure can be continued inductively to each stage of the
elimination process, thus justifying existence of nonzero pivots at every
step. Hint: To prove A<
2
l is positive definite, first prove the identity
for any choice of x
1
, x
2
, , xn. Then choose x
1
suitably.
13. As. another approach to developing a compact method for producing the
LV factorization of A, consider the following matrix-oriented approach.
Write
A= [A d]
CT a
C, d E Rn -
1
a E R
and A square of order n - 1. Assume A is nonsingular. As a step in an
induction process, assume A= LO is known, .with A nonsingular. Look
for A = LU in the form
A=[L o][o q]
mT 1 0 Y
m, q E Rn-
1
y E R
Show that m, q, and y can be found, and describe how to do so. (This
method is applied to an original A, factoring each principal submatrix in
the upper left corner, in increasing order.)
14. Using the algorithm (8.3.23)-(8.3.24) for solving tridiagonal systems, solve
Ax= b with
. 0
0
-1
2
1
0
0
0
-1
2
1
0
0
0
-1
2
1
-b =
-1 -2
2 1
Check that the hypotheses and conclusions of Theorem 8.2 are satisfied by
this example.
15. Define the order n tridiagonal matrix
2 -1 0 0
-1 2 -1 0
A =
0 -1 2 -1
n
0 -1 2
578 NUMERICAL SOLUTION OF SYSTEMS OF LINEAR EQUATIONS
Find a general formula for An = LU. Hint: Consider the cases n = 3, 4, 5,
and then guess the general pattern and verify it.
16. Write a subroutine to solve tridiagonal systems using (8.3.23)-(8.3.24).
Check it using the examples in Problems 14 and 15. There are also a
number of tridiagonal systems in Gregory and Karney (1969, chap. 2) for
which the true inverses are known.
17. There are families of linear systems Akx = b in which Ak changes in some
simple way into a matrix Ak+
1
, and it may then be simpler to find the LU
factorization of Ak+
1
by modifying that of Ak. As an example that arises in
the simplex method for linear programming, let A
1
= [a
1
, .. , an] and
A
2
= [a
2
, . , an+d, with all aj ERn. Suppose A
1
= L
1
U
1
is known, with
L
1
lower triangular and U
1
upper triangular. Find a simple way to obtain
the LU factorization A
2
= LP
2
from that for A
1
, assumiJ;lg pivoting is not
needed. Hint: Using L
1
u; =a;, 1 s is n, write
Show that 0 can be simply modified into an upper triangular form U
2
, and
that this corresponds to the conversion of L
1
into the desired L
2
More
precisely, U
2
= MU, L
2
= L
1
M-
1
Show that the operation cost for obtain-
ing A
2
= Lp
2
is O(n
2
).
18. (a) Calculate the condition numbers cond (A) P' p = 1, 2, oo, for
A = [100 99]
99 98
(b) Find the eigenvalues and eigenvectors of A, and use them to illustrate
the remarks following (8.4.8) in Section 8.4.
19. Prove that if A is unitary, then cond(A)* = 1.
20. Show that for every A, the upper bound in (8.4.4) can be attained for
suitable choices of b and r. Hint: From the definitions of IIAII and IIA -
1
11
in Section 7.3, there are vectors .X and f for which IIA.XII = IIAIIII.XII and
IIA -Ipll = IIA -
1
1111fll. Use this to complete the construction of equality in
the upper bound of (8.4.4).
21. The condition number cond(A)* of (8.4.6) can be quite small for matrices
A that are ill-conditioned. To see this, define the n X n matrix
1 -1 -1 -1
0 1 -1 -1
AR =
1 -1
0 0 1
PROBLEMS 579
Easily cond(A). = 1. Verify that A,;-
1
is given by the upper triangular
matrix B = [b;j], with b;; = 1,
b .. = 2j-i-1
,, i<j:Sn
Compute cond(A)
00
22. As in Section 8.4, let Hn = [lj(i + j- 1)] denote the Hilbert matrix of
order n, and let lin denote the matrix obtained when Hn is entered into
your computer in single precision arithmetic. To compare H,;-
1
and ii;\
convert lin to a double precision matrix by appending additional zeros to
the mantissa of each entry. Then use a double precision matrix inversion
computer program to calculate H,-
1
numerically. This will give an accurate
value of ii,;-
1
to single precision accuracy, for lower values of n. After
obtaining Hn-1, compare it with H;\ given in (8.4.13) or Gregory and
Karney (1969, pp. 34-37).
23. Using the programs of Problem 3 or the UNPACK programs SGECO and
SGESL, solve Jinx = b for several values of n. Use b = [1, -1, 1, -1, ... f,
and calculate the true answer by using H;
1
from Problem 22. Comment on
your results.
24. Using the residual correction method, described at the beginning of Section
8.5, calculate accurate single precision answers to the linear systems of
Problem- 23. Print the residuals and corrections, and examine the rate of
decrease in the correction terms as the order n is increased. Attempt to
explain your results.
25. Consider the linear system of Problem 4, for solving approximately an
integral equation. Occasionally we want to solve such a system for several
values of A that are close together. Write a program to first solve the system
for A
0
= 4.0, and then save the LU decomposition of A
0
J- K. To solve
(AI- K)x = b with other values of A nearby A
0
, use the residual correc-
tion method (8.5.3) with C = [LU]-
1
For example, solve the system when
A = 4.1, 4.5, 5, and 10. In each case, print the iterates and calculate the
ratio in (8.5.11). Comment on the behavior of the iterates as A increases.
26. The system Ax= b,
4 -1 0 -1 0 0 2
~
4 -1 0 -1 0 1
A=
0 -1 4 0 0 -1
b=
2
-1 0 0 4 -1 0 2
0 -1 0 -1 4 -1 1
0 0 -1 0 -1 4 2
has the solution X = [1, 1, 1, 1, 1, If. Solve the system using the
Gauss-Jacobi iteration method, and then solve it again using the
580 NUMERICAL SOLUTION OF SYSTEMS OF LINEAR EQUATIONS
Gauss-Seidel method. Use the initial guess x<
0
l = 0. Note the rate at which
the iteration error decreases. Find the answers with an accuracy of = .0001.
27. Let A and B have order n, with A nonsingular. Consider solving the linear
system
(a) Find necessary and sufficient conditions for convergence of the itera-
tion method
(b) Repeat part (a) for the iteration method
m 0
Compare the convergence rates of the two methods.
28. For the error equation (8.6.25), show that r"(M) < 1 if
1
for some matrix norm.
29. For the iteration of a block tridiagonal systems, given in (8.6.30), show
convergence under the assumptions that
I I I
IIC"I < IIAAI + IIC;II < IIB;-111' 2 :s i :s r - I; < IIB,-111
Bound the rate of convergence.
30. Recall the matrix An of Problem 15, and consider the linear system
Anx = b. This system is important as it arises in the standard finite
difference approximation (6.11.30) to the two-point boundary value prob-
lem
y"(x) = f(x, y(x)) a<x</3 y(a) = a
0
y(/3) =at
It is also important because it arises in the analysis of iterative methods for
solving discretiuitionsof Poissons equation, as in (8.8.5). In line with this,
consider using Jacobi's method to solve Anx = b iteratively. Show that
Jacobi's method converges by showing r"(M) < 1 for the appropriate
matrix M. Hint: Use the results of Problem 6 of Chapter 7.
PROBLEMS 581
31. As an example that convergent 1teration methods can behave in unusual
ways, consider
with
Assuming jAj < 1, we have(/- A)-
1
exists and x<k>-+ x* = (/- A)-
1
b
(or all initial guesses x<>. Find explicit formulas for Ak, x* - x<k>, and
x<k+l)- x<k>. By suitably adjusting c relative to A, show that it is possible
for llx* - x<k>lloo to alternately increase and decrease as it converges to
zero. Look at the corresponding values for jjx<k+l)- x<k>lloo For simplic-
ity, use x<
0
> = 0 in all calculations.
32. (a) Let C
0
be an approximate inverse ,to A. Define R
0
= I- AC
0
, and
assume IIRoll < 1 for some matrix norm. Define the iteration method
m
This is a iteration method for calculating the inverse A -l.
Show the convergence of Cm to A:.._
1
by first relating the error
A -l - Cm to the residual Rm. And then examine the behavior of the
. . 2
residual Rm by showing that Rm+l = Rm, m 0.
(b) Relate Cm to the expansion
00
A-
1
= C
0
{l- R
0
)-
1
= C
0
L Rb
j-0
Observe the relation of this method for inverting A to the iteration
method (2.0.6) of Chapter 2 for calculating 1/a for nonzero numbers
a. Also, see Problem l.of Chapter 2.
33. Implement programs for iteratively solving the discretization (8.8.5) of
Poisson's equation on the unit square. To have a situation for which you
have a true solution of the linear system, choose Poisson equations in which
there is no discretization error in going to (8.8.5). This will be true if the
truncation errors in (8.8.3) are identically zero, as, for example, with
u(x, y) = x
2
y
2
(a) Solve (8.8.5) with Jacobi's method. Observe the actual error
llx- x<>u.,., in each iterate, as well as ux<+l)- .x<>uoo Estimate the
constant c of (8.7.2), and compute the estimated error bound of
(8.7.5). Compare with the true iteration error.
582 NUMERICAL SOLUTION OF SYSTEMS OF LINEAR EQUATIONS
(b) Repeat with the Gauss-Seidel method. Also compare the iteration
rate c with that predicted by (8.8.13).
(c) Implement the SOR method, using the optimal acceleration parameter
w* from (8.8.15).
34. (a) Generalize the discretization of the Poisson equation in (8.8.1) to
the equation
0 <X, y < 1
with u = f(x, y) on the boundary as before.
(b) Assume c(x, y) 2:: 0 for 0 .::;; x, y.::;; 1. Generalize part (1) of the
proof of Theorem 8.8 to show that the linear system of part (a) will
have a unique solution.
35. Implement the conjugate gradient algorithm CG of Section 8.9. Test it with
the systems of Problems 1, 3, and 4. Whenever possible, for testing
purposes, use systems with a known true solution. Using it, compute the
true errors in each iterate and see how rapidly they decrease. For the linear
system in Problem 4, that is based on solving an integral equation, solve the
system for several values of n. Comment on the results.
36. Recall the vector norm l l x l l o ~ of (8.9.7), with A symmetric and positive
definite. Let the eigenvalues of A be denoted by
Show that
with both equalities attainable for suitable choices of x. Hint: Use an
orthonormal basis of eigenvectors of A.
37. Prove Lemma 1, following (8.9.11). Hint: Use mathematical induction on
k. Prove it for k = 1. Then assume it is true for k .::;; /, and prove it for
k =I+ 1. Break the proof into two parts: (1) p[r
1
+
1
= 0 fori.::;; I, and (2)
PT+l,/+1 = 0.
38. Let A be symmetric, pos1Uve definite, and order n X n. Let U =
{ u
1
, ... , un} be a set of nonzero vectors in Rn. Then if U is both an
orthogonal set and an A-orthogonal set, then Au;= ]\;U;, i = 1, ... , n for
suitable A;> 0. Conversely, one can always choose a set of eigenvectors
{ u
1
, . : . , u n} of A to have them be both orthogonal and A -orthogonal.
PROBLEMS 583
39. Let A be symmetric, positive definite, and of order n. Let { v
1
, ... , vn} be
an A-orthogonal set in Rn, with all V; i= 0. Define
j = 1, ... , n
Showing the following properties for Q P
1. Q
1
v; = 0 if i i= j; and Q
1
v
1
= v
1
.
2. Q] = Qj.
3. (x, Q
1
y)A = (Q
1
x, y)A, for all x, y ERn.
4. (Q
1
x,(I- Q)y)A = 0, for all x, y ERn.
5. =
m 3 (9.2.25)
A similar derivation can be applied to the eigenvector approximants, to accel-
erate each component of the sequence {z(m)}, although some care must be used.
Exampk Consider again the example (9.2.12). In Table 9.1,
A2
R
5
= -.06474 = = -.06479
1
As an example of (9.2.25), extrapolate with the values A?\ and A.\
4
) from that
table. Then
= 9.6234814 A.
1
-
= -6.03 X 10-
6
In comparison using the more accurate table value A.?). the error is A
1
-A.?)=
-3.29 X 10-
5
This again shows the value of using extrapolation whenever the
theory justifies its use.
Case 3. The Rayleigh-Ritz Quotient. Whenever A is symmetric, it is better to
use the following eigenvalue approximations:
(Az(m), z(m)) (w<m+1), z(m))
A_(m+ 1) = = _..:._ ___ _:_
1 (z<m), z(m)) (z<m), z(m))
We are using standard inner product notation:
n
(w, z) = .[w;z;
1
w, z E R"
(9.2.26)
To analyie this sequence (9.2.26), note that all eigenvalues of A are real and
that the eigenvectors x
1
, , xn can be chosen to be orthonormal. Then (9.2.2),
(9.2.4), (9.2.5), together with (9.2.26), imply
n
L
A_<;"+ 1) = -=-j--.:..!----
.E
j=l
(9.2.27)
... i"
ORTHOGONAL TRANSFORMATIONS USING HOUSEHOLDER MATRICES 609
The ratio of convergence of A.\m> to A.
1
is (A.:z!A.
1
)
2
, an improvement on the
original ratio in (9.47) of A.:z!A.
1
This is a well-known classical procedure, and it has many additional aspects
that are of use in some problems. For additional discussion, see Wilkinson (1965,
pp. 172-178).
Example In the example (9.2.12), use the approximate eigenvector z<
2
> in
(9.2.26). Then
which is as accurate as the value A.?> obtained earlier.
The power method can be used when there is not a single dominant eigen-
value, but the algorithm is more complicated. The power method can also be used
to determine eigenvalues other than the dominant one. This involves a process
called deflation of A to remove A. as an eigenvalue. For a complete discussion of
all aspects of the power method, see Golub and Van Loan (1983, 208-218) and
Wilkinson (1965, chap. 9). Although it is a useful method in some circumstances,
it should be stressed that the methods of the following sections are usually more
efficient. For a rapidly convergent variation on the power method and the
Rayleigh-Ritz quotient, see the Rayleigh quotient iteration for symmetric matrices
in Parlett (1980, p. 70).
9.3 Orthogonal Transformations Using
Householder Matrices
As one step in finding the eigenvalues of a matrix, it is often reduced to a simpler
form using similarity transformations. Orthogonal matrices will be the class of
matrices we use for these transformations. It was shown in (9.1.39) that orthogo-
nal transformations will not worsen the condition or stability of the eigenvalues
of a nonsymmetric matrix. Also, orthogonal matrices have other desirable error
propagation properties, an example of which is given later in the section. For
these reasons, we restrict our transformations to those using orthogonal matrices.
We begin the section by looking at a special class of orthogonal matrices
known as Householder matrices. Then we show how to construct a Householder
matrix that will transform a given vector to a simpler form. With this construc-
tion as a tool, we look at two transformations of a given matrix A: (1) obtain its
QR factorization, and (2) construct a similar tridiagonal matrix when A is a
symmetric matrix. These forms are used in the next two sections in the calcula-
tion of the eigenvalues of A. As a matter of notation, note that we should be
restricting the use of the term orthogonal to real matrices. But it has become
common usage in this area ~ use orthogonal rather than unitary for the general
complex case, and we will adopt the same convention. The reader should
understand unitary when ort7gonal is used for a complex matrix.
Let w E en with Uwlb = w*w = 1. Define
U= 1- 2ww*
This is the general form of a Householder matrix.
(9.3.1)
I
- - - ~ - - - - - - - - - - - - - - - - - - - -
I
I
610 THE MATRIX EIGENVALUE PROBLEM
Example For n = 3, we require
The matrix U is given by
[
1- 21wtl
2
U = -2M.\w
2
-2w
1
w
3
For the particular case
we have
-2w
1
w
2
1 - 21w
2
1
2
-2w
2
w
3
U= ~ [ - ~
9 -4
-4
1
-8
-4]
-8
1
We first prove U is Hermitian and orthogonal. To show it is Hermitian,
U* = (1- 2ww*)* =I* - 2( ww*)*
= I - 2( w*)*w* = I - 2 ww* = U
To show it is orthogonal,
U*U = U
2
= {1- 2ww*)
2
= I- 4ww* + 4{ ww*)( ww*)
=I
since using the associative law and w*w = 1 implies
{ ww*)( ww*) = w{ w*w) w* = ww*
The matrix U of the preceding example illustrates these properties. In Problem
12, we give a geometric meaning to the linear function T(x) = Ux for U a
Householder matrix.
We will usually use vectors w with leading zero components:
w = [0, ... , o. w,, ... , w,F = [0,_
1
, wry (9.3.2)
with WE cn-r+l. Then
U= [I,
0
_t 0 ]
In-r+l - 2ww*
(9.3.3)
ORTHOGONAL TRANSFORMATIONS USING HOUSEHOLDER MATRICES 611
Premultiplication of a matrix A by this U will leave the Rrst r - 1 rows of A
unchanged, and postniultiplication of A will leave its first r - 1 columns un-
changed. For the remainder of this section we assume all matrices and vectors are
real, in order to avoid having to deal with possible complex values for w.
The Householder matrices are used to transform a nonzero vector into a new
vector containing mainly zeros. Let b * 0 be given, b E Rn, and suppose we
want to produce U of form (9.3.1) such that Ub contains zeros in positions r + 1
through n, for some given r L Choose w as in (9.3.2). Then the first r - 1
elements of b and Ub are the same.
To simplify the later work, write m = n- r + 1,
with c E Rr-l, v, dE Rm. Then our restriction on the form of Ub requires the
first r - 1 components of Ub to be c, and
(I- 2vvT)d = [ a,O, ... , Of (9.3.4)
for some a. Since I- 2vvT is orthogonal, the length of d is preserved (Problem
13 of Chapter 7); and thus
a= +S = + Vd
2
+ +d
2
- - 1 m
(9.3.5)
Define
From (9.3.4),
d- 2pv= [a,O, ... ,O]T (9.3.6)
Multiplication by vT and use of llvll
2
= 1 implies
(9.3.7)
Substituting this into the first component of (9.3.6) gives
d
1
+ 2avi =a
(9.3.8)
Choose the sign of a in (9.3.5) by
sign (a) = - sign ( d
1
) (9.3.9)
This choice maximizes vi, and it avoids any possible loss of significance errors in
612 THE MATRIX EIGENVALUE PROBLEM
the calculation of v
1
The sign for v
1
is irrelevant. Having v
1
, obtain p from
(9.3.7). Return to (9.3.6), and then using components 2 through m,
dj
v.=-
J 2p
j=2,3, ... ,m (9.3.10)
The statements (9.3.5), (9.3.7)-(9.3.9) completely define v, and thus w and U.
The operation count is 2m + 2 multiplications and divisions, and two square
roots. The square root defining v
1
can be avoided in practice, because it will
disappear when the matrix ww r is formed. A sequence of such transformations
of vectors b will be used to systematically reduce matrices to simpler forms.
Example Consider the given vector
b = (2,2, If
We calculate a matrix U for which Ub will have zeros in its last two positions. To
help in following the construction, some of the intermediate calculations are
listed. Note that w = v and b = d for this case. Then
a= -3
p ~
2
u = --
2130
The matrix U is given by
2 2 1
--
3 3 3
2 11 2
U= -- -
3 15 15
1 2 14
-
3 15 15
and
ub = ( -3,o,of
The QR factorization of a matrix Given a real matrix A, we show there is an
orthogonal matrix Q and an upper triangular matrix R for which
A =QR (9.3.11)
Let
P =I- 2w<r>w<r)T
r
r = 1, ... , n -1 (9.3.12)
ORTHOGONAL TRANSFORMATIONS USING HOUSEHOLDER MATRICES 613
with w<rl as in (9.3.2) with r- I leading zeros. Writing A in terms of its columns
A.
1
, ... , A*n' we have
Pick P
1
and w<
1
> using the preceding construction (9.3.5)-(9.3.10) with b = A.
1
.
Then P
1
A contains zeros below the diagonal in its first column.
Choose P
2
similarly, so that P
2
P
1
A will contain zeros in its second column
below the diagonal. First note that because w<
2
> contains a zero in position 1, and
because P
1
A is zero in the first column below position 1, the products P
2
P
1
A and
P
1
A contain the same elements in row one and column one. Now choose P
2
and w<
2
l as before in (9.3.5)-(9.3.10), with b equal to the second column of P
1
A.
By carrying this out with each column of A, we obtain an upper triangular
matrix
(9.3.13)
If at step r of the construction, all elements below the diagonal of column r are
zero, then just choose P, = I and go onto the next step. To complete the
construction, define
which is orthogonal. Then A = QR, as desired.
Example Consider
Then
1
4
1
!]
w<
1
> = [.985599, .119573, .119573]T
-2.12132
3.62132
.621321
w<
2
l = [0, .996393, .0848572r
-2.12132
-3.67423
0
-2.12132 ]
.621321
3.62132
-2.12132]
-1.22475
3.46410
(9.3.14)
For the factorization A = QR, evaluate Q = P.1P
2
But in most applications, it
would be inefficient to explicitly produce Q. We comment further on this shortly.
Since Q orthogonal implies det(Q) = 1, we have
ldet(A)I =ldet(Q)det(R)I =ldet(R)I = 53.9999
614 THE MATRIX EIGENVALUE PROBLEM
This number is consistent with the fact that the eigenvalues of A are A. = 3, 3, 6,
and their product is det (A) = 54.
Discussion of the QR factorization It is useful to know to what extent the
factorization A = QR is unique. For A nonsingular, suppose
(9.3.15)
Then R
1
and R
2
must also be nonsingular, and
The inverse of an upper triangular matrix is upper triangular, and the product of
two upper triangular matrices is upper triangular. Thus R
2
R1
1
is upper triangu-
lar. Also, the product of two orthogonal matrices is orthogonal; thus, the product
QfQ
1
is orthogonal. This says R
2
R1
1
is orthogonal. But it is not hard to show
that the only upper triangular orthogonal matrices are the diagonal matrices. For
some diagonal matrix D,
Since R
2
R1
1
is orthogonal,
Since we are only dealing with real matrices, D has diagonal elements equal to
+ 1 or - 1. Combining these results,
{9.3.16)
This says the signs of the diagonal elements of R in A = QR can be chosen
arbitrarily, but then the rest of the decomposition is uniquely determined.
Another practical matter is deciding how to evaluate the matrix R of (9.3.13).
Let
A = PA = (J- 2w<'lw<rlT)A
r r r-1 r-1
r = 1,2, ... , n- 1 (9.3.17)
with A
0
= A, An_
1
= R. If we calculate P, and then multiply it times A,_
1
to
form A,, the number of multiplications will be
3 1
(n- r + 1) + 2(n- r + 2)(n- r + 1)
There is a much more efficient method for calculating A,. Rewrite (9.3.17) as
A =A - 2w(r) [w<r)TA ]
r r-1 r-1
(9.3.18)
First calculate w<r)TA,_
1
, and then calculate w<'l[w<r)TA,_d and A,. This re-
quires about
2(n- r)(n- r + 1) + (n- r + 1) (9.3.19)
-----------
I.
I
ORTHOGONAL TRANSFORMATIONS USING HOUSEHOLDER MATRICES 615
multiplications, which shows (9.3.18) is a preferable way to evaluate each Ar and
finally R =An-I This does not include the cost of obtaining w<r>, which was
discussed earlier, following (9.3.10).
If it is necessary to store the matrices PI, ... , Pn-I for later use, just store each
column w<r>, r = 1, ... , n - 1. Save the first nonzero element of w<r>, the one in
position r, in a special storage location, and save the remaining nonzero elements
of w<r>, those in positions r + 1 through n, in column r of the matrix Ar and R,
below the diagonal. The matrix Q of (9.3.14) could be produced explicitly. But as
the construction (9.3.18) shows, we do not need Q explicitly in order to multiply
it times some other matrix.
The main use of the QR factorization of A will be in defining the QR method
for calculating the eigenvalues of A, which is presented in Section 9.5. The
factorization can also be used to solve a linear system of equations Ax = b. The
factorization leads directly to the equivalent system Rx = QTb, and very little
error is introduced because Q is orthogonal. The system Rx = QTb is upper
triangular, and it can be solved in a stable manner using back substitution. For A
an ill-conditioned matrix, this may be a superior way to solve the linear system
Ax = b. For a discussion of the errors involved in obtaining and using the QR
factorization and for a comparison of it and Gaussian elimination for solving
Ax= b, see Wilkinson (1965, pp. 236, 244-249). We pursue this topic further in
Section 9.7, when we discuss the least squares solution of overdetermined linear
systems.
The transformation of a symmetric matrix to tridiagonal form Let A be a real
symmetric matrix. To find the eigenvalues of A, it is usually first reduced to
tridiagonal form by orthogonal similarity transformations. The eigenvalues of the
tridiagonal matrix are then calculated using the theory of Sturm sequences,
presented in Section 9.4, or the QR method, presented in Section 9.5. For the
orthogonal matrices, we use the Householder matrices of (9.3.3).
Let
r = 1, ... , n- 2 (9.3.20)
with w<r+ I) defined as in (9.3.2):
(r+ I) _ [0 0 ] T
w - , ... , 'wr+l, ... , wn
[Note the change in notation from that of the Prof (9.3.12) used in defining the
QR factorization.] The matrix
is similar to A, the element a
11
is unchanged, and A
2
will be symmetric. Produce
w<
2
> and PI to obtain the form.
for some a
2
I. The vector A*I is the first column of A. Use (9.3.5)-(9.3.10) with
--------------
~
!
616 THE MATRIX EIGENVALUE PROBLEM
m = n- 1 and
d = {a 21' a 31' a n1] T
For example, from (9.3.8),
Having obtained P
1
and P
1
A, posimultiplication by P
1
will not change the
first column of P
1
A. (Tills should be checked by the reader.) The symmetry of A
2
follows from
Since A
2
is symmetric, the construction on the first column of A will imply that
A
2
has zeros in positions 3 through n of both the first row and column.
Continue this process, letting
r = 1, 2, ... , n - 2 (9.3.21)
with A
1
= A. Pick P, to introduce zeros into positions r + 2 through n of
column r. Columns 1 through r- 1 will remain undisturbed in calculating
P,A,_
1
, due to the special form of P,. Pick the vector w<'+
1
) in analogy with the
preceding description for w<
2
).
The final matrix T =An-t is tridiagonal and symmetric.
a1 /31
0 0
/31
a2 /32
0
/32 a3
(9.3.22) T=
an-1 /3n-1
0
Pn-1
an
This will be a much more convenient form for the calculation of the eigenvalues
of A, and the eigenvectors of A can easily be obtained from those of T.
T is related to A by
Q = P1 pn-2
(9.3.23)
As before with the QR factorization, we seldom produce Q .. explicitly, preferring
to work with the individual matrices P, in analogy with (9.3.18). For an
eigenvector x of A, say Ax = . ~ x . we have
Tz = z x = Qz (9.3.24)
i
---;
ORTHOGONAL TRANSFORMATIONS USING HOUSEHOLDER MATRICES 617
If we produce an orthonormal set of eigenvectors { z;} for T, then { Qz;} will be
an orthonormal set of eigenvectors for A, since Q preserves length and angles
(see Problem 13 of Chapter 7).
Example Let
3
1
2
Then
w<
2
>= 0-- r 2 1 r
'/5'15
1 0 0
3 4
0
--
pl = 5 5
4 3
0 --
5 5
1 -5 0
73 14
T=P[AP
1
=
-5
25 25
14 23
0 --
25 25
For an error analysis of this reduction to tridiagonal form, we give some
results from Wilkinson (1965). Let the computer arithmetic be binary floating-
point with rounding, with t binary digits in the mantissa. Furthermore, assume
that all inner products
that occur in the calculation are accumulated in double precision, with rounding
to single orecision at the completion of the summation. These inner products
occur in ; variety of places in the computation of T from A. Let f denote the
actual symmetric tridiagonal matrix that is computed from A using the preceding
computer arithmetic. Let P, denote the actual matrix produced in
A,_
1
to Ar, let Pr be the theoreticiilly exact version of this matrix if no rounding
errors occurred, and let Q = P
1
P,_
1
be the exact product of these P, an
orthogonal matrix.
Theorem 9.4 Let A be a real symmetric matrix of order n. Let f be the real
symmetric tridiagonal matrix resulting from applying the House-
holder similarity transformations (9.3.20) to A, as in (9.3.21).
618 THE MATRIX EIGENVALUE PROBLEM
Assume the floating-point arithmetic used has the characteristics
described in the preceging paragraph. Let {A;} and { r;} be the
eigenvalues of A and T, respectively, arranged in increasing order.
Then
{9.3.25)
with
For small and moderate values of n, en = 25(n - 1).
Proof From Wilkinson (1965, p. 161) using the Frobenius matrix norm F,
F(f- Q'IAQ) .:5: 2x(n- 1)(1 + x)
2
n-
4
F(A) (9.3.26)
with x = (12.36)2-'. From the Wielandt-Hoffman result (9.1.19) of
Theorem 9.3, we have
(9.3.27)
since A and QTAQ have the same eigenvalues. And from Problem 28(b)
of Chapter 7,
[
n ]1/2
F(A) = l i \ ~
Combining these results yields (9.3.25).
For a further discussion of the error, including the case in which inner
products are not accumulated in double precision, see Wilkinson (1965, pp.
297 -299). The result (9.3.25) shows that the reduction to tridiagonal form is an
extremely stable operation, with little new error introduced for the eigenvalues.
Planar rotation orthogonal matrices There are other classes of orthogonal
matrices that can be used in place of the Householder matrices. The principal
class is the set of plane rotations, which can be given the geometric interpretation
of rotating a pair of coordinate axes through a given angle B in the plane of the
axes. For integers k, /, 1 .:5: k <I .:5: n, define then X n orthogonal matrix R<k.l>
by altering four elements of the identity matrix ln. For any real number 0, define
_j
THE EIGENVALUES OF A SYMMETRIC TRIDIAGONAL MATRIX 619
the elements of R<k,tl by
{
cos 0
R(k!tl= sinO
'
1
-sinO
(Ill)ij
for 1 .:5: i, j .:5: n.
Example For n = 3,
(i, j) = (k, k) or {1, I)
(i, j) = (k, I)
(i,j) = (l,k)
all other ( i, j)
R(l,J) = [ c;s 0 ~
-sinO 0
s i ~ ]
cos 0
As a particular case, take 0 = '17'/4. Then
1 1
{i
0
{i
R<I.Jl =
0 1
1 1
-{i
0
{i
(9.3.28)
The plane rotations R(k,t) can be used to accomplish the same reductions for
which the Householder matrices are used. In most situations, the Householder
matrices are more efficient, but the plane rotations are more efficient for part of
the QR method of Section 9.5. The idea of. solving the symmetric matrix
eigenvalue problem by first reducing it to tridiagonal form is due to W. Givens in
1954. He also proposed the use of the techniques of the next section for the
calculation of the eigenvalues of the tridiagonal matrix. Givens used the plane
rotations R<k.ll, and the Householder matrices were introduced in 1958 by A.
Householder. For additional discussion of rotation matrices and their properties
and uses, see Golub and Van Loan (1983, sec. 3.4), Parlett (1980, sec. 6.4),
Wilkinson (1965, p. 131), and Problems 15 and 17(b).
9.4 The Eigenvalues of a Symmetric. Tridiagonal Matrix
Let T be a real symmetric tridiagonal matrix of order n, as in (9.3.22). We
compute the characteristic polynomial of T and use it to calculate the eigenvalues
ofT.
To compute
fn (A) = det ( T - A I) (9.4.1)
I
I
i
-------------------------------1
I
I
i
i
I
I
I
I
620 THE MATRIX EIGENVALUE PROBLEM
introduce the sequence
a
1
- A /3
1
0 0
/31 a2- A /32
fk(A) = det 0 (9.4.2)
0
for 1 s k s n, and f
0
(A) = 1. By direct evaluation,
/
1
(A) = a
1
- A
/
2
(A) = ( a
2
- A) ( a
1
- A) - f3l
= (a
2
- A)/
1
(A)- /3l/
0
(A)
The formula for f
2
(A) illustrates the general triple recursion relation that the
sequence { fk( A)} satisfies:
2sksn (9.4.3)
To prove this, expand the determinant (9.4.2) in its last row using minors and the
result will follow easily. This method for evaluating fn(A) will require 2n - 3
multiplications, once the coefficients { f3 f} have been evaluated.
Example Let
2 1 0 0 0 0
1 2 1 0 0 0
T=
0 1 2 1 0 0
(9 .4.4)
0 0 1 2 1 0
0 0 0 1 2 1
0 0 0 0 1 2
Then
/
0
(A) = 1 /
1
(A) = 2- A
(2-
j=;2,3,4,5,6 (9.4.5)
Without the triple recursion relation (9.4.5), the evaluation of /
6
(A) would be
much more complicated.
At this point, we might consider the problem as solved since fn(A.) is a
polynomial and there are many polynomial rootfinding methods. Or we might
use a more general method, such as the secant method or Brent's method, both
described in Chapter 2. But the sequence {!k(A)jO s k s n} has special proper-
ties that make it a Sturm sequence, and these properties make it comparatively
THE EIGENVALUES OF A SYMMETRIC TRIDIAGONAL MATRIX 621
easy to isolate the eigenvalues of T. Once the eigenvalues have been isolated, a
method such as Brent's method [see Section 2.8] can be used to rapidly calculate
the roots. The theory of Sturm sequences is discussed in Henrici (1974, p. 444),
but we only consider the special case of {/k(A.)}.
Before stating the consequences of the Sturm theory for {/k(A)}, we consider
what happens when some P
1
= 0. Then the eigenvalue problem can be broken
apart into two smaller eigenvalue problems of orders I and n - I. As an example,
consider
a I PI
0 0 0
PI
a2
0 0 0
T= 0 0 aJ /33
0
0 0
PJ a4 /34
0 0 0
/34 s
Define TI and T
2
as the two blocks along the diagonal, of orders 2 and 3,
respectively, and then
From this,
and we can find the eigenvalues of T by finding those of T
1
and T
2
The
eigenvector problem also can be solved in the same way. For example, if
T
1
x = Ax, with x =F 0 in R
2
, define
x = [.xr,o,o,or
Then Tx = Ax. This construction can be used to calculate a complete set of
eigenvectors for T from those for TI and T
2
For the remainder of the section, we
assume that all /3
1
=F 0 in the matrix T. Under this assumption, all eigenvalues of
Twill be simple roots of /,(A.).
The Sturm sequence property of { fk(} .. )} The sequences {/k(a)} and {/k(b)}
be used to determine the number of roots of /,(A) that a,re contained in
[a, b J. To do this, introduce the following integer-valued function s( A.). Define
s( A) to be the number of agreements in sign of consecutive members of the
sequence {/k(A)}, and if the value of some member Jj(A.) = 0, let its sign be
chosen opposite to that of Jj_I(A.). It can be shown that Jj(A.) = 0 implies
fj_
1
(A) =F 0.
EX11mp/e Consider the sequence /
0
(A.), ... ,f
6
(A) given in (9.4.5) of the last
example. For A = 3,
Uo(A), ... ,f6(A)) = (1,-1,0,1,-1,0,1)
i
------------;
I
I
I
!
I
I
!
622 THE MATRIX EIGENVALUE PROBLEM
The corresponding sequence of signs is
(+,-,+,+,-,+,+)
and s(3) = 2.
We now state the basic result used in computing the roots of fn('A) and thus
the eigenvalues of T. The proof follows from the general theory given in Henrici
(1974).
Theorem 9.5 Let T be a real symmetric tridiagonal matrix of order n, as given
in (9.3.22). Let the sequence {/k(A)IO .:5: k .:5: n} be defined as in
(9.4.2), and assume all {3
1
of= 0, I = 1, ... , n - 1. Then the number
of roots of fn(A) that are greater than A= a is given by s(a),
which is defined in the preceding paragraph. For a < b, the
number of roots in the interval a < A .:5: b is given by s( a) - s( b).
Calculation of the eigenvalues Theorem 9.5 will be the basic tool in locating
and separating the roots of fn(A). To begin, calculate an interval that contains
the roots. Using the Gerschgorin circle Theorem 9.1, all eigenvalues are contained
in the interval [a, b], with
b = Max {a;+ 1/3;1 + 1/3;_;1}
l!>i!>n
where /3
0
= /3n = 0.
We use the bisection method on [a, b] to divide it into smaller subintervals.
Theorem 9.5 is used to determine how many roots are contained in a subinterval,
and we seek to obtain subintervals that will each contain one root. If some
eigenvalues are nearly equal, then we continue subdividing until the root is found
with sufficient accuracy. Once a subinterval is known to contain a single root, we
can switch to a more rapidly convergent method. .
Example Consider further the example (9.4.4). By the Gerschgorin Theorem
9.1, all eigenvalues lie in [0, 4]. And it is easily checked that neither A = 0 nor
A = 4 is an eigenvalue. A systematic bisection process was carried out on [0, 4] to
separate the six roots of f
6
(A) into six separate subintervals. The results are
shown in Table 9.2 in the order they were calculated. The roots are labeled as
follows:
The roots can be found by continuing with the bisection method, although
Theorem 9.5 is no longer needed. But it would be better to use some other
rootfinding method.
Although all roots of a tridiagonal matrix may be found by this technique, it is
generally faster in that case to use the QR algorithm of the next section. With
I
I
THE EIGENVALUES OF A SYMMETRIC TRIDIAGONAL MATRIX 623
Table 9.2 Example of use of Theorem 9.5
A. /6(A.)
s(X) Comment
0.0 7.0 6 ;\6 > 0
4.0 7.0 0 x, < 4
2.0 -1.0 3 A.4 < 2 < ;\3
1.0 1.0 4 As< 1 < A.
4
< 2
.5
-1.421875 5 0 < A.
6
< 0.5 < A.s < 1
3.0 1.0 2 2 < A.
3
< 3 < A.
2
3.5
-1.421875 1 3 < A.
2
< 3.5 < A.
1
< 4
large matrices, we usually do not want all of the roots, in which case the methods
of this section are preferable. If we want only certain specific roots, for example,
the five largest or all roots in a given interval or all roots in [1, 3], then it is easy to
locate them using Theorem 9.5.
9.5 The QR Method
At the present time, the QR method is the most efficient and widely used general
method for the calculation of all of the eigenvalues of a matrix. The method was
first published in 1961 by J. G. F. Francis and it has since been the subject of
intense investigation. The QR method is quite complex in both its theory and
application, and we are able to give only an introduction to the theory of the
method. For actual algorithms for both symmetric and nonsymmetric matrices,
refer to those in EISPACK and Wilkinson and Reinsch (1971).
Given a matrix A, there is a factorization
A= QR
with R upper triangular and Q ortl-togonal. With A real, both Q and R can be
chosen real; their construction is given in Section 9.3. We assume A is real
throughout this section. Let A
1
=A, and define a sequence of matrices Am, Qm,
and Rm by
m = 1,2, ... (9.5.1)
Since Rm = Q ~ A m we have
(9.5.2)
The matrix Am+l is orthogonally similar to Am, and thus by induction, to A
1
The sequence {Am} will converge to either a triangular matrix with the
eigenvalues of A on its diagonal or to a near-triangular matrix from which the
eigenvalues can be easily calculated. In this form the convergence is usually slow,
and a technique known as shifting is used to accelerate the convergence. The
I
i
______________________ j
I
I
I
I
624 THE MATRIX EIGENVALUE PROBLEM
technique of shifting will be introduced and illustrated later in the section.
Example Let
The eigenvalues are
1
3
1
(9.5.3)
A
1
= 3 + 13 ,;, 4.7321 ;\
3
= 3 - 13,;, 1.2679
The iterates Am do not converge rapidly, and only a few are given to indicate the
qualitative behavior of the convergence:
1.0954
[3.7059
.9558
0 ]
A2 = 1.g954 3.0000
A3 =
3.5214 .9738
-1.3416 3.0000 .9738 1.7727
[ 4.6792
.2979
0 ]
[ 4.7104
.1924
0 ]
A7 =
3.0524 .0274
A8 =
3.0216 -.0115
.0274 1.2684 -.0115 1.2680
[ 4.7233
.1229
0 ]
[ 4.7285
.0781
0 ]
A9 =
3.0087 .0048
A10 = .g781
3.0035 -.0020
.0048 1.2680 -.0020 1.2680
The elements in the (1, 2) position decrease geometrically with a ratio of about
.64 per iterate, and those in the (2, 3) position decrease with a ratio of about .42
per iterate. The value in the (3, 3) position of A
15
will be 1.2679, which is correct
to five places.
The preliminary reduction of A to simpler form The QR method can be
relatively expensive because the QR factorization is time-consuming when re-
peated many times. To decrease the expense the matrix is prepared for the QR
method by reducing it to a simpler form, one for which the QR factorization is
much less expensive.
If A is symmetric, it is reduced to a similar symmetric tridiagonal matrix
exactly as described in Section 9.3. If A is nonsymmetric, it is reduced to a
similar Hessenberg matrix. A matrix B is Hessenberg if
for all i > j + 1 (9.5.4)
It is upper triangular except for a single nonzero subdiagonal. The matrix A is
reduced to Hessenberg form using the same algorithm as was used (or reducing
symmetric matrices to tridiagonal form.
With A tridiagonal or Hessenberg, the Householder matrices of Section 9.3
take a simple form when calculating the QR factorization. But generally the
THE QR METHOD 625
plane rotations (9.3.28) are used in place of the Householder matrices because
they are more efficient to compute and apply in this situation. Having produced
A
1
= Q
1
R
1
and A
2
= R
1
Q
1
, we need to know that the form of A
2
is the same as
that of A
1
in order to continue using the less expensive form of QR factorization.
Suppose A
1
is in the Hessenberg form. From Section 9.3 the factorization
A
1
= Q
1
R
1
has the following value of Q
1
:
(9.5.5)
with each Hk a Householder matrix (9.3.12):
(9.5.6)
Because the matrix A
1
is of Hessenberg fonn, the vectors w<k> can be shown
to have the special form
for i < k and i > k + 1 (9.5.7)
This can be shown from the equations for the components of w<k>, and in
particular (9.3.10). From (9.5.7), the matrix Hk will differ from the identity in
only the four elements in positions (k, k), (k, k + 1), (k + 1, k), and (k + 1,
k + 1). And from this it is a fairly straightforward computation to show that Q
1
.
must be Hessenberg in form. Another necessary lemma is that the product of an
upper triangular matrix and a Hessenberg matrix is again Hessenberg. Just
multiply the two forms of matrices, observing the respective patterns of zeros, in
order to prove this lemma. Combining these results, observing that R
1
is upper
triangular, we have that A
1
= R
1
Q
1
must be in Hessenberg form.
If A
1
is symmetric and tridiagonal, then it is trivially Hessenberg. From the
preceding result, A
2
must also be Hessenberg. But A
2
is symmetric, since
Since any symmetric Hessenberg matrix is tridiagonal, we have shown that A
2
is
tridiagonal. Note that the iterates in the example (9.5.3) illustrate this result.
Convergence of the QR Convergence results for the QR method can be
found in Golub and Van Loan (1983, sees. 7 ..5 and 8.2), Parlett (1968), (1980,
chap. 8), and Wilkinson (1965, chap. 8). The following theorem is taken from the
latter reference.
Theorem 9.6 Let A be a real matrix of order n, and let its eigenvalues {
.
(9.5.8)
Then the iterates Rm of the QR method, in (9.5.1), Will
converge to an upper triangular matrix D, which contains the
eigenvalues { in the <Jiagonal positions. If A is symmetric, the
sequence {Am} converges to a diagonal matrix. For the speed of
i
------------------;
i
I
I
626 THE MATRIX EIGENVALUE PROBLEM
convergence,
(9.5.9)
As an example of this error bound, consider the example (9.5.3). In it, the
ratios of the successive eigenvalues are
A
A 2 = .63
1
A
; = .42
2
(9.5.10)
If any one of the off-diagonal elements of Am in the example is examined, it will
be seen to decrease by one of the factors in (9.5.10).
For matrices whose eigenvalues do not satisfy (9.5.8), the iterates Am may not
converge to a triangular matrix. For A symmetric, the sequence {Am} will
converge to a block diagonal matrix
(9.5.11)
in which all blocks B; have order 1 or 2. Thus the eigenvalues of A can be easily
computed from those of D. If A is real and nonsymmetric, the situation is more
complicated, but acceptable. For a discussion, see Wilkinson (1965, chap. 8) and
Parlett (1968).
To see that {Am} does not always converge to a diagonal matrix, consider the
simple symmetric example
A [ ~ ~ ]
Its eigenvalues are A= 1. Since A is orthogonal, we have
with Q
1
= A R
1
= I
And thus
and all iterates Am= A. The sequence {Am} does not converge to a diagonal
matrix.
The QR method with shift The QR algorithm is generally i!pplied with a shift
of origin for the eigenvalues in order to increase the speed of convergence. For a
sequence of constants {em}, define A
1
= A and
m = 1,2, ... (9.5.12)
'
J
THE QR METHOD 627
The matrices Am are sinilar to A
1
, since
m 1 (9.5.13)
The eigenvalues of Am+
1
are the same as those of Am, and thence the same as
those of A.
To be more specific on the choice of shifts {em}, we consider only a symmetric
tridiagonal matrix A. For Am, let
a<m>
1
p{m)
0 0
P1m)
a<m)
2
P4m)
A =
0
(9.5.14)
"!
p<ml
n-1
0
p<m)
n-1
a<m>
n
There are two methods by which {em} is chosen: (1) Let em= and (2) let
em be the eigenvalue of
[
a(m)
n-1
n(m)
1-'n-1
n(m) l
1-'n-1
a<ml
n
(9.5.15)
which is closest to The second strategy is preferred, but in either case the
matrices Am converge to a block diagonal matrix in which the blocks have order
1 or 2, as in (9.5.11). It can be shown that either choice of {em} ensures
P
(m)n(m) 0
n-11-'n-1
as m-+oo (9.5.16)
generally at a much more rapid rate than with the original QR method (9.5.1).
From (9.5.13),
using the operator matrix norm (7.3.19) and Problem 27(c) of Chapter 7. The
matrices {Am} are uniformly bounded, and consequently the same is true of their
elements. From (9.5.16) and the uniform boundedness of { and { we
have either -+ 0 or -+ 0 as m -+ oo. In the former case, converges
to an eigenvalue of A. And in the latter case, two eigenvalues can easily be
extracted from the limit of the submatrix (9.5.15).
Once one or two eigenvalues have been obtained due to or being
essentially zero, the matrix Am can be reduced in order by one or two rows,
I
i
I
I
.......... .I
628 THE MATRIX EIGENVALUE PROBLEM
respectively. Following this, the QR method with shift can be applied to the
reduced matrix. The choice of shifts is designed to make the convergence to zero
be more rapid for than for the remaining off-diagonal elements of the
matrix. In this way, the QR method becomes a rapid general-purpose method,
faster than any other method at the present time. For a proof of convergence of
the QR method with shift, see Wilkinson (1968). For a much more complete
discussion of the QR method, including the choice of a shift, see Parlett (1980,
chap. 8).
Example Use the previous example (9.5.3), and use the first method of choosing
the shift, em = The iterates are
1
!]
.
.4899
0 ]
3
A2 =
3.2667 .7454
1 .7454 4.3333
[1.2915
.2017
0 ]
[1.2737
.0993
0 ]
A
3
= .2g11
3.0202 .2724
A4 =
2.9943 .0072
.2724 4.6884 .0072 4.7320
[ 1.2694
.0498
A
5
= .0498 2.9986
0 0
The element 134m> converges to zero extremely rapidly, but the element
converges to zero geometrically with a ratio of only about .5.
Mention should be made of the antecedent to the QR method, motivating
much of it. In 1958, H. Rutishauser introduced an LR method based on the
Gaussian elimination decomposition of a matrix into a lower triangular matrix
times an upper triangular matrix. Define
with Lm lower triangular, Rm upper triangular. When applicable, this method
will generally be more efficient than the QR method. But the nonorthogonal
similarity transformations can cause a deterioration of the conditioning of the
eigenvah:.es of some nonsymmetric matrices. And generally it is a more com-
plicated algorithm to implement in an automatic program. A complete discussion
is given in Wilkinson (1965, chap. 8).
9.6 The Calculation of Eigenvectors and Inverse Iteration
The most powerful tool for the calculation of the eigenvectors of a matrix is
inverse iteration, a method attributed to H. Wielandt in 1944. We first define and
illustrate inverse iteration, and then comment more generally on the calculation
of eigenvectors.
J
THE CALCULATION OF EIGENVECTORS AND INVERSE ITERATION 629
To simplify the analysis, let A be a matrix whose Jordan canonical form is
diagonal,
(9.6.1)
Let the columns of P be denoted by x
1
, . , x,. Then
i = 1, ... , n (9.6.2)
Without loss of generality, it can also be assumed that llx;ll,., = 1, for all i.
Let A be an approximation to a simple eigenvalue Ak of A. Given an initial
z <
0
>, define { w (m)} and { z(m)} by
(A- A/)w<m+l) = z<"'>,
w<m+l)
z<m+l) = --..,.--
nw<m+l)ll,.,
m 0. (9.6.3)
This is essentially the power method, with (A- ur replacing A in (9.2.2)-(9.2.3);
and for simplicity in analysis and implementation, we replace Pm by nw<m+ l)ll <X)" The
matrix A- AI is ill-conditioned from the viewpoint of the material in Section 8.4 of
Chapter 8. But any resulting large perturbations in the solution will be rich in the ei-
genvector xk of the eigenvalue Ak-A for A- AI, and this is the vector we desire. For a
further discussion of this source of instability in solving the linear system, see the ma-
terial formula (8.4.8) in Section 8.4. For the method (9.6.3) to" work, we do
not want A- AI to be singular. Thus A shouldn't be exactly Ak, although it can be
quite close, as a later example demonstrates.
For a more precise analysis, let z<
0
> be expanded in terms of the eigenvector
basis of (9.6.2):
n
z<
0
> = L a;X;
i-1
{9.6.4)
And assume ak * 0. In analogy with formula (9.2.4) for the power method, we
can show
(9.6.5)
Using (9.6.4),
(9.6.6)
Let Ak - A = t:, and assume
lA;- c > 0 i = 1, ... , n (9.6.7)
i
-"- , .....
---- .... - --- - --- -- .. - --"'
....... I
630 THE MATRIX EIGENVAI.,UE PROBLEM
From (9.6.6) and (9.6.5),
(9.6.8)
with laml = 1. If lt:l < c, then
(9.6.9)
This quantity goes to zero as m ~ oo. Combined with (9.6.8), this shows z<m>
converges to a multiple of xk as m ~ oo. This convergence is linear, with a ratio
of lt:/cl decrease in the error in each iterate. In practice lt:l is quite small and
this will usually mean lt:/cl is also q ~ i t small, ensuring rapid convergence.
In implementing (9.6.3), begin by factoring A -A./ using the LU decomposi-
tion of Section 8.1 of Chapter 8. To simplify the notation, write
A- A./= LU
in which pivoting is not involved. In practice, pivoting would be used. Solve for
each iterate z<m+ll as follows:
Ly<m+ll = z<m> uw<m+l) = y<m+l)
{9.6.10)
Since A - A./ is nearly singular, the last diagonal eleinent of U will be nearly
zero. If it is exactly zero, then change .it to some small number or else change A.
very slightly and recalculate L and U.
For the initial guess z<
0
>, Wilkinson (1963, p. 147) suggests using
z<
0
> = Le e = [1, 1, ... , 1r
Thus in (9.6.10),
y<ll = e uw<m) = e (9.6.11)
This choice is intended to ensure that ak is neither nonzero nor small in (9 .. 6.4).
But even if it were small, the method would usually converge rapidly. For
example, suppose that some or all of the values a;/ak in (9.6.9) are about 10
4
And suppose lt:/cl = w-s, a realistic value for many cases. Then the bound in
(9.6.9) becomes
and this will- decrease very rapidly as m increases.
THE CALCULATION OF EIGENVECTORS AND INVERSE ITERATION 631
Example Use the earlier matrix (9.5.8).
~ u
1
3
1
!]
(9.6.12)
Let A = 1.2679 = A
3
= 3 - 13, which is accurate to five places. This leads to
0
1.0
2.7310
[
.7321
U= 0
0
1.0
.3662
0
subject to the effects of rounding errors. Using y<
1
> = [1, 1, If,
w<
1
> = [3385.2, -2477.3, 908.20r
z<l) = [1.0000, -.73180, .26828r
w<
2
> = [20345, -14894, 5451.9f
z<
2
> = [1.000, - .73207' .26797r
and the vector of z<
3
> = z<
2
>. The true answer is
x3 = [1, 1- 13,2- 13r
= [1.0000, - .73205, .26795f
0 l 1.0
.0011
(9.6.13)
(9.6.14)
and z<
2
> equals x
3
to within the limits of rounding error accumulations.
Eigenvectors for symmetric tridiagonal matrices Let A be a real symmetric
tridiagonal matrix of order n. As previously, we assume that some or all of its
eigenvalues have been computed accurately. Inverse iteration is the preferred
method of calculation for the eigenvectors, and it is quite easy to implement. For
A an approximate eigenvalue of A, the calculation of the LU decomposition is
inexpensive in time and storage, even with pivoting. For example, see the
material on tridiagonal systems in Section 8.3 of Chapter 8. The previous
numerical example also illustrates the method for tridiagonal matrices.
Some error results are given as further justification for the use of inverse
iteration. Suppose that the computer arithmetic is binary floating point with
rounding and with t digits in the mantissa. In Wilkinson (1963, pp. 143-147) it is
shown that the computed solution w of
(A- AI)w<m+l) = z<m>
is the exact solution of
(A-Al+E)w=z<mJ
(9.6.15)
with
(9.6.16)
!.
I
I
-- ... ::.J
632 THE MATRIX EIGENVALUE PROBLEM
for some constant K of order unity. This bound is of the size that would be
expected from errors of the order of the rounding error.
If the solution w of (9.6.15) is quite large, then it will be a good approximation
to an eigenvector of A. To prove this, we begin by introducing
Then
w
i=--
llwlb
z(m)
( A - AI+ E)i = -
llwllz
z<m)
1'/ === (A - Al)z = - Ei + --
llwllz
Usingllz<m>lb :5 .Jnnz<m>ll"" :5 .Jn, theresidual7]satisfies
.Jn
1111112 ::; IIEII2 + llwlb
.Jn
::; K../n r'+--
llwllz
(9.6.17)
which is small if Uw1J
2
is large. To prove that this implies z is dose to an
eigenvector of A, we let { xdi = 1, ... , n} be an orthonormal set of eigenvectors.
And assume
n
i = :E a;X;
i-1
n
l l i l l ~ = : E a ~ = 1
1
with Ak an i.solated eigenvalue of A. Also, suppose
with
With these assumptions, we can now derive a bound for the error in i.
Expanding 11 using the eigenvector ba$s:
7J = Ai- 71. i = La:i('A. i- 71. )x;
I
n
~ = _EaT{ A;- A)
2
1
~ E af(Ai- !.)
2
~ c
2
L af
i .. k i ... k
.... i
.. -. - - ... - -------
LEAST SQUARES SOLUTION OF LINEAR SYSTEMS 633
which is quite small using (9.6.17). From II.ZII = 1, this implies ak;;, 1 and
1
liz- akxklb = J L a7 .s: -1171112 (9.6.18)
io#k c
showing the desired result. For a further discussion of the error, see Wilkinson
(1963, pp. 142-146) and (1965, pp. 321-330).
Another method for calculating eigenvectors would appear to be the direct
solution of
(A- AI)x = 0
after deleting one equation and setting one of the unknown components to a
nonzero constant, for example x
1
= 1. This is often the procedure used in
undergraduate linear algebra courses. But as a general numerical method, it can
be disastrous. A complete discussion of this problem is given in Wilkinson (1965,
pp. 315-321), including an excellent example. We just use the previous example
to show that the results need not be as good as those obtained with inverse
iteration.
Example Consider the preceding example (9.6.12) with A= 1.2679. We con-
sider (A - AI)x = 0 and delete the last equation to obtain
.7321x
1
+ x
2
= 0
x
1
+ 1.7321x
2
+ x
3
= 0
Taking x
1
= 1.0, we have the approximate eigenvector
X = [1.0000, - .73210, .26807]
Compared with the true answer (9.6.14), this is a slightly poorer result than
(9.6.13) obtained by inverse iteration. In general, the results of using this
approach can be very poor, and great care must be taken when using it.
The inverse iteration method requires a great deal of care in its implementa-
tion. For dealing with a particular matrix, any difficulties can be dealt with on an
ad hoc basis. But for a general computer program we have to deal with
eigenvalues that are multiple or close together, which can cause some difficulty if
not dealt with carefully. For nonsymmetric matrices whose Jordan canonical
form is not diagonal, there are additional difficulties in selecting a correct basis of
eigenvectors. The best reference for this topic is Wilkinson (1965). Also see
Golub and Van Loan (1983, pp. 238-240) and Parlett (1980, pp. 62-69). For
several excellent programs, see Wilkinson and Reinsch (1971, pp. 418-439) and
Garbow et al. (1977).
9. 7 Least Squares Solution of Linear Systems
We now consider the solution of overdetermined systems of linear equations
n
L aijxj = b;
j-1
i = 1, ... , m (9.7.1)
634 THE MATRIX EIGENVALUE PROBLEM
with m > n. These systems arise in a variety of applications, with the best known
being the fitting of functions to a set of data {(t;, b;)l i = 1, ... , m }, about which
we say more later. It might appear that the logical place for considering such
systems would be in Chapter 8, but some of the tools used in the solution of
(9.7.1) involve the orthogonal transformations introduced in this chapter. The
numerical sqlution of (9.7.1) can be quite involved, both theoretically and
practically, and we give only some major highlights of the subject.
An overdetermined system (9.7.1) will generally not have a solution. For that
reason, we seek a vector x = (x
1
, , xn) that solves (9.7.1) approximately in
some sense. Introduce
with A m X n. Then (9.7.1) can be written as
Ax= b (9.7.2)
For simplicity, assume A and b are real. Among the possible ways of finding an
approximate solution, we can seek a vector x that minimizes
(9.7.3)
for some p, 1 p oo. In this section, only the classical case of p = 2 is
considered, although in recent years, much work has also been done for the cases
p = 1 and p = oo.
The solution x* of
MinimizeiiAx - bllz
xeR"
(9.7.4)
is called the least squares solution of the linear system Ax = b. There are a
number of reasons for this approach to solving Ax = b. First, it is easier to
develop the theory and the practical solution techniques for minimizing
!lAx - bllz, partly because it is a continuously differentiable function of
x
1
, .. , xn. Second, the curve fitting problems that lead to systems (9.7.1) often
have a statistical framework that leads to (9.7.4), in preference to minimizing
!lAx - bliP with some p =F 2.
To better understand the nature of the solution /of (9.7.4), we give the
following theoretical construction. It also can be used as a practical numerical
approach, although there are usually other more efficient constructions. Crucial
to the theory is the singular value decomposition (SVD)
P.1
0
0
p.,
vrAU= F=
0
(9.7.5)
0
0 0
' .... '" . ., ....... ' """.
__, - - .. . ... - .. -"' . . ---- -- -- - ,_j
_ LEAST SQUARES SOLUTION OF LINEAR SYSTEMS 635
The matrices U and V are orthogonal, and the singular values }L, satisfy
See Theorem 7.5 in Chapter 7 for more information; and later in this section, we
describe a way to construct the SVD of A.
Theorem 9. 7 Let A be real and m X n, m ~ n. Define z = urx, c = vrb.
Then the solution x* = Uz* of (9.7.4) is given by
C;
z* =-
1
JL;
i = 1, ... , r (9.7.6)
with z,+
1
, ... , zn arbitrary. When r = n, x* is unique. When
r < n, the solution of (9.7.4) of minimal Euclidean norm is ob-
tained by setting
zj = 0 i = r + 1, ... , n (9.7.7)
[This is also called the least squares solution of (9.7.4), even
though it is not the unique minimizer of IIAx - bllz.J The mini-
mum in (9.7.4) is given by
[
m ]1/2
IIAx* - bll
2
= I: cf
j-r+1
(9.7.8)
Proof Recall Problem 13(a) of Chapter 7. For any x ERn and any orthogonal
matrix P,
Applying this to IIAx - bll
2
and using (9.7.5),
IIAx- bib= IIVTA_,x- VTblb = IIV"IAUUTx- clb
= I!Fz- clb
Then (9.7.6) and (9.7.8) follow immediately. For (9.7.7), use
[
r n ]1/2
llx*lb = llzll2 = L (zj)
2
+ I: zj
j-1 j-r+1
(9.7.9)
with z,+-l ... , zn arbitrary according to (9.7.9). Choosing (9.7.7) leads to
a unique minimum value for l l x l l
2
~
i
i
I
I
J
636 THE MATRIX EIGENVALUE PROBLEM
Define the n X m matrix
and
-1
p.1
0
Looking at (9.7.6)-(9.7.8),
0
-1
P.r
0
0
(9.7.10)
0 0
(9.7.11)
(9.7.12)
The matrix A+ is called the generalized inverse of A, and it yields the least
squares solution of Ax= b. The formula (9.7.12) shows that x* depends linearly
on b. This representation of x* is an important tool in studying the numerical
solution of Ax = b. Some further properties of A+ are left to Problems 27 and
28.
To simplify the remaining development of methods for finding x* and
analyzing its stability, we restrict A to having full rank, r = n. This is the most
important case for applications. For the singular values of A,
(9.7.13)
The concept of matrix norm can be generalized to A, from that given for
square matrices in Section 7.3. Define
11Ax112
IIAII = Supremum--
xeR" llxlb
x0
It can be shown, using the SVD of A, that
(9.7.14)
(9.7.15)
In-analogy with the error analysis in Section 8.4, define a condition number for
Ax= b by
P.l
cond(Ah = IIAIIIIA+ll =-
P.n
(9.7.16)
Using this notation, we give a stability result from Golub and Van Loan (1983,
I
I
i
--;
LEAST SQUARES SOLUTION OF LINEAR SYSTEMS 637
p. 141). It is the analogue of Theorem 8.4, for the perturbation analysis of square
nonsingular linear systems.
Let b + ob and A + oA be perturbations of b and A, respectively. Define
x* =(A+ oA)+(b + ob)
r = b- Ax* ; = ( b + ob) - (A + oA) x* (9.7.17)
Assume
[
IIOAII lloblb] 1
=Max IIAII 'libiG < cond (A)2
(9.7.18)
and
. (O) = llrll2 <
1
sm llhlb
(9.7.19)
implicitly defining 0, 0 .:::.;; 8 < ., j2. Then
llx*- x*lb [2cond(A)2 [ ( ) ]2] ( 2)
---- < + tanO cond A
2
+ 0 (9.7.20)
llx*ll2 - cos 0
!If- rib
llblb .:::.;; [1 + 2cond (Ah] Min {1, m - n) + 0(
2
) (9.7.21)
For the case m = n with rank(A) = n, the residual r will be zero, and then
(9.7.20) will reduce to the earlier Theorem 8.4.
The preceding results say that the change in r can be quite small, while the
change in x* can be quite large. Note that the bound in (9.7.20) depends on the
square of cond (A h, as compared to the dependence on cond (A) for
the nonsingular case with m = n [see (8.4.18)]. If the columns of A are nearly
dependent, then cond(Ah can be very large, resulting in a larger bound in
(9.7.20) than in (9.7.21) [see Problem 34(a)]. Whether this is acceptable or not will
depend on the problem, on whether one wants small values of r or accurate
values of x*.
The least squares data-fitting problem The origin of most overdetermined linear
systems is that of fitting data by a function from a prescribed family of functions.
Let { ( t i b;) I i = 1, ... , m} be a given set of data, presumably representing some
function b = g(t). Let cp
1
(t), ... , cp,(t) be given functions, and let be the
family of all linear combinations of cp
1
, , cp,:
(9.7.22)
l
!
- .. - --. - --- - .......... - ... - ..
'
I
I
.... I
I
I
' i
I
i
i
638 THE MATRIX EIGENVALUE PROBLEM
We want to choose an element of fF to approximately fit the given data:
n
L XfPj(t;) = b;
j=l
This is the system (9.7.1), with a;j = <p/t;)-
i = 1, ... , m
For statistical modeling reasons, we seek to minimize
[ } m [ n ]2]1/2
E(x) = m .L b;- xj<pj(t;)
z=l J=l
(9.7 .23)
(9.7.24)
hence the description of fitting data in the sense of least squares. The quantity
E(x*), for which E(x) is a minimum, is called the root-mean-square error in the
approximation of the data by the function
n
g*(t) = L xj<pj(t) (9.7.25)
j=l
Using earlier notation,
and minimizing E(x) is equivalent to finding the least squares solution of
(9.7.23).
Forming the partial derivatives of (9.7.24) with respect to each X;, and setting
these equal to zero, we obtain the system of equations
(9.7.26)
This system is a necessary condition for any minimizer of E(x), and it can also
be shown to be sufficient. The system (9.7.26) is called the normal equation for the
least squares problem. If A has rank n, then ATA will ben X n and nonsingular,
and (9.7.26) has a unique solution.
To establish the equivalency of (9.7.26) with the earlier solution of the least
squares problem, we use the SVD,of A to convert (9.7.26) to a simpler form.
Substituting A = VFUT into (9.7.26),
UFTFUTx = UFTVTb
Multiply by U, and use the earlier notation z = urx, c = vrb. Then
This gives a complete mathematical equivalence of the normal equation to the
earlier minimization of !lAx- bib given in Theorem 9.7.
Assuming that rank(A) = n, the solution x* can be found by solving the
normal equation. Since A1A is symmetric and positive definite, the Cholesky
.. ''. ,, ...... , ..
--- -------
"'\
--- ..
LEAST SQUARES SOLUTION OF LINEAR SYSTEMS 639
decomposition can be used for the solution (8.3.8)-(8.3.17)]. The effect on x*
of rounding errors will be proportional to both the unit round of the computer
and to the condition number of A1A. From the SVD of A, this is easily seen to be
2
cond(A1A)
2
[cond(A)
2
]
2
1' ..
(9.7 .27)
Thus the sensitivity of x* to errors will be proportional to [cond(A)z]
2
, which is
consistent with the earlier perturbation error bound (9.7.19).
The result (9.7.27) used to be cited as the main reason for avoiding the use of
the normal equation for solving the least squares problem. This is still good
advice, but the reasons are more subtle. From (9.7.19), if llrll
2
is nearly zero, then
sin 8 = 0, and the bound will be proportional to cond (A )z. In contrast, the error
bound for Cholesky's method will feature [cond (A)z]
2
, which is larger when
cond (A)z is large. A second reason occurs when A has columns that are nearly
dependent. The use of finite computer arithmetic can then lead to an
mate normal equation that has lost vital information present in A. In such case,
ATA will be nearly singular, and solution of the normal equation will yield much
less accuracy in x* than will some other methods that work directly with Ax = b.
For a more extensive discussion of this, see Lawson and Hanson (1974, pp.
126-129).
E:rampk Consider the data in Table 9.3 and its plot in Figure 9.2. We use a
cubic polynomial to fit these data, and thus are led to the expression
This yields the overdetermined linear system
Table 9.3
t;
0.00
.05
.10
.15
.20
.25
.30
.35
.40
:45
.50
4
x.ri-
1
=b.
i..J ) I I
j-1
Data for a cubic least squares fit
b;
.486
.866
.944
1.144
1.103
1.202
1.166
1.191
1.124
1.095
1.122
i = 1, ... , m
t;
.55
.60
.. 65
.70
.75
.80
.85
.90
.95
1.00
(9.7.28)
1.102
1.099
1.017
1.111
1.117
1.152
1.265
1.380
1.575
1.857
640 THE MATRIX EIGENVALUE PROBLEM
y
Figure 9.2 Plot of data of Table 9.3.
and the normal equations
Writing this in the form (9.7.26),
[
21
T . 10.5
A A= 7.175
5.5125
10.5
7.175"
5.5125
4.51666
k=1,2,3,4
7.175
5.5125
4.51666
3.85416
5.5125]
4.51666
3.85416
3.38212
A
7
b = [24.1180, 13.2345, 9.468365, 7.5594405f
The solution is
x* = [.5747, 4.7259, -11.1282, 7.6687f
(9.7.29)
(9.7.30)
This solution is very sensitive to changes in b. This can be inferred from the
condition number
cond (A
7
A) = 12105 (9.7.31)
As a further indication, perturb the right-hand vector Arb by adding it to the
vector
[.01, - .01, .01,- .01f
This is consistent with the size of the errors present in the data values b;. With
this new right side, the normal equation has the perturbed solution
x = [.7408, 2.6825, -6.1538, 4.455of
which differs significantly from x*.
LEAST SQUARES SOLUTION OF LINEAR SYSTEMS 641
0.4 L...-....L----L-.l.---L---1-...J.---L-.I...-....L---L-
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Figure 9.3 The least squares fit g*( t ).
The plot of the least squares fit
is shown in Figure 9.3, together with the data. Its root mean square error is
E(x*) = .0421
The columns in the example matrix A for the preceding example has columns
that are almost linearly dependent, and ATA has a large condition number. To
improve on this, we can choose a better set of basis functions { IP;(t)} for the
family ffi", of polynomials of degree ~ 3. Examining the coefficients of ATA,
m
[ATA]jk = L cpj(t;)cpk(t;) 1 ~ j k ~ n (9.7.32)
i="l
If the points {t;} are well distributed throughout the interval [a, b ], then the
previous sum, when multiplied by (b- a)jm, is an approximation to
To obtain a matrix AlA that has a smaller condition number, choose functions
cp/t) that are orthonormal. Then AlA will approximate the identity, and A will
have approximately orthonormal columns, leading to condition numbers close to
1. In fact, all thai is really important is that the family { IP/ t)} be orthogonal,
since then the matrix ATA will be nearly diagonal, a well-conditioned matrix.
Example We repeat the preceding example, using the Legendre polynomials
that are orthonormal over [0, 1]. The first four orthonormal Legendre polynomials
on [0, 1] are
15 ff
cp
0
(t) = 1 cp
1
(t) = ffs cp
2
(t) = 2(3s
2
-1) cp
3
(t) = 2(5s
3
- 3s)
(9.7.33)
'' '"""''--'.
""----------- ..
I
!
I
.I
642 THE MATRIX EIGENVALUE PROBLEM
with s = 2t - 1, 0 t 1. For the normal equation (9.7.26),
T - 0
5.1164
[
21.0000
A A-
0
23.1000
0
5.1164
2.3479
0
25.4993
0
0 l
ATb = [24.1180, 4.0721, 3.4015, 4.8519f
x* = [1.1454, .1442, .0279, .1449f
The condition number of ATA is now
cond(ATA) = 1.58
much less than earlier in (9.7.31).
(9.7.34)
(9.7.35)
The QR method. of solution Recall the QR factorization of Section 9.3,
following (9.3.11). As there, we consider Householder matrices of order m X m
Pj =I- 2wUlwUlT j = 1, ... , n
to reduce to zero the elell\ents below the diagonal in A. The orthogonal matrices
Pj are applied in succession, to reduce to zero the elements below the diagonal in
columns 1 through n of A. The vector wUl has nonzero elements in positions j
through m. This process leads to a matrix
R = P PA = QTA
n 1
(9.7 .36)
If these are also applied to the right side in the system Ax = b, we obtain the
equivalent system
(9.7.37)
The matrix R has .the form
(9.7.38)
with R
1
an upper triangular square matrix of order n X n. The matrix R
1
must
be nonsingular, since A and R = Q 1A have the same rank, namely n. In line with
(9.7.38), write
Then
IIAx - bib = IIQTAx- QTblb = IIRx - QTblb
= [IIRlx- +
j
-- - -- - . .J
LEAST SQUARES SOLUTION OF LINEAR SYSTEMS 643
The least squares solution of Ax = b is obtained by solving the nonsingular
upper triangular system
(9.7 .39)
Then the minimum is
(9.7.40)
l11e QR method for calculating x is slightly more expensive in operations.
than the Cholesky method. The Cholesky method, including the formation of
ATA, has an operation count (multiplications and divisions) of about
1 . n
3
-mn
1
+-
2 6
and the Householder QR method has an operation count of about
n3
mn
1
+-
3
Nonetheless, the QR method is generally the recommended method for calculat-
ing the least squares solution. It works directly on the matrix A, and because of
that and the use of orthogonal transformations, the effect of rounding errors is
better than with the use of the Cholesky factorization to solve the normal
equation. For a thorough discussion, see Golub and Van Loan (1983, pp.
147-149) and Lawson and Hanson (1974, chap. 16).
EXDmple We consider the earlier example of the linear system (9.7.28) with
A .. = [t!-
1
)
IJ I
1 ~ i ~ 21, 1 :::;,j ~ 4
for the data in Table 9.3. Then cond(A) = 110.01,
-2.2913
.. 1.3874
0
0
-1.5657
1.3874
-.3744
0
gl = [-5.2630,.8472, -.1403, -.7566f
-1.2029]
1.2688
-.5617
-.0987
The solution x is the same as in (9.7.30), as is the root-mean-square error.
The singular value decomposition The SVD is a very valuable. tool for analyzing
and solving least squares problems and other problems of linear algebra. For
least squares problems of less than full rank, the QR method just described will
probably lead to a triangular matrix R
1
that is nonsingular, but has some very
small diagonal e l e m ~ t s The SVD of A can then be quite useful in making
clearer the structure of A. If some singular values ll; are nearly zero, then the
644 THE MATRIX EIGENVALUE PROBLEM
effect of setting them to zero can be determined more easily than with some other
methods for solving for x*. Thus there is ample justification for finding efficient
ways to calculate the SVD of A.
One of the best known ways to calculate the SVD of A is due tQ G. Golub, C.
Reinsch, and W. and a complete discussion of it is given in Golub and
Van Loan (1983, sec. 6.5). We instead merely show how the singular value
decomposition in (9.7.5) can be obtained from the solution of a symmetric matrix
eigenvalue problem together with a QR factorization.
From A real and m X n, m n, we have that ATA is n X n and real. In
addition, it is straightforward to show that A1A is symmetric and positive
semidefinite [xTATA.x] 0 for all x]. Using a progr.am to solve the symmetric
eigenvalue problem, find a diagonal matrix D and an orthogonal matrix U for
which
(9.7.41)
Let D = diag[A
1
, . , An] with the eigenvalues arranged in descending order. If
any A; is a small negative number, then set it to zero, since all eigenvalues of ATA
should be nonnegative except for possible perturbations due to rounding errors.
From (9.7.41), define B =AU, of order m n. Then (9.7.41) imJ>lies
BTB=D
Then the columns-of B are orthogonal. Moreover, if some A;= 0, then the
corresponding column of B must be identically zero, because its norm is zero.
Using the QR method, calculate an orthogonal matrix V for which
(9.7.42)
is zero below the diagonal in all columns. The matrix R satisfies
RTR = BTV
7
VB = B
7
B = D
Again, the columns of R must be orthogonal, and if some A; = 0, then the
corresponding column of R must be zero. Since R is upper triangular, we can use
the orthogonality to show that the columns" of R must be zero in all positions
above the diagonal. Thus R has the of the matrix F of (9. 7 .5). We will then have
R = Fwith JL; B =AU in (9.7.42), we have the desired SVD:
One of the possible disadvantages of this procedure is.ATA must be formed,
and this may lead to a loss of information due to the use of finite-length
computer arithmetic. But the method is simple to implement, if the synimetric
eigenvalue problem is solvable.
Example Consider again tlie matrix A of (9.7.28), based on the data of Table
9.3. The matrix ATA is given in (9.7.29). Using EISPACK and UNPACK
j
I
'"'"""' ' .......... '
-- .... .
... j
programs,
[
.7827
u = .4533
.3326
.2670
The singular values are
.5963
-.3596
-.4998
-.5150
DISCUSSION OF THE LITERATURE 645
-.1764
.7489
-.0989
-.6311
.0256]
-.3231
.7936
-.5150
32.0102, 3.8935, .1674, .0026
The matrix V is orthogonal and of order 21 X 21, and we omit it for obvious
reasons. In practice it would not be computed, since it is a product of four
Householder matrices, which can be stored in a simpler form.
For a much more extensive discussion of the solution of least squares
problems, see Golub and Van Loan (1983, chap. 6) and the book by Lawsbn and
Hanson (1974). There are many additional practical problems that must be
discussed, including that of determining the rank of a matrix when rounding
error causes it to falsely have full rank. For programs, see the appendix to
Lawson and Hanson (1974) and UNPACK. For the SVD, see LINPACK or
EISPACK.
l)iscussion of the Literature
The main source of information for this chapter was the well-known and
encyclopedic book of Wilkinson (1965). Other sources were Golub and Van Loan
(1983), Gourlay and Watson (1976), Householder (1964), Noble (1969, chaps.
9-12), Parlett (1980), Stewart (1973), and Wilkinson (1963). For matrices of
moderate size, the numerical solution of the eigenvalue problem is fairly well
understood. For another perspective on the QR method, see Watkins (1982), and
for an in-depth look at inverse iteration, see Peters and Wilkinson (1979).
Excellent algorithms for most eigenvalue problems are given in Wilkinson and
Reinsch (1971) and the EISPACK guides by Smith et al. (1976), and Garbow
et al. (1977). For a history of the EISPACK project, see Dongarra and Moler
(1984). An excellent general account of the problems of developing mathematical
software for eigenvalue problems and other matrix problems is given in Rice
(1981). The EISPACK package is the basis for most of the eigenvalue programs
in the IMSL and NAG libraries.
A number of problems and numerical methods have not been discussed in this
chapter, often for reasons of space. For the symmetric eigenvalue problem, the
Jacobi method has been omitted. It is an elegant and rapidly convergent method
for computing all of the eigenvalues of a symmetric matrix, and it is relatively
easy to program. For a description of the Jacobi method, see Golub and Van
Loan (1983, sec. 8.4), Parlett (1980, chap. 9), and Wilkinson (1965, pp. 266-282).
i
I
-- - ...... -- . _ -- -- _ ---- :.:.:: oc.::_::.: ": c:l
646 THE MATRIX EIGENVALUE PROBLEM
An ALGOL program is given in Wilkinson and Reinsch (1971, pp. 202-211).
The generalized eigenvalue problem, Ax = A.Bx, has also been omitted. This has
become an important problem in recent years. The most popular method for its
solution is due to Moler and Stewart (1973), and other descriptions of the
problem and its solution are given in Golub and Van Loan 0983, sees. 7.7 and
8.6) and Parlett (1980, chap. 15). EISPACK programs for the generalized
eigenvalue problem are given in Garbow et al. (1977).
The problem of finding the eigenvalues and eigenvectors of large sparse
matrices is an active area of research. When the matrices have large order (e.g.,
n 300), most of the methods of this chapter are more difficult to apply because
of storage considerations. In addition, the methods often do not take special
account of the sparseness of most large matrices that occur in practice. One
common form of problem involves a symmetric banded matrix. Programs for this
problem are given in Wilkinson and Reinsch (1971, pp. 266-283) and Garbow et
al. (1977). For more general discussions of the eigenvalue problem for sparse
matrices, see Jennings (1985) and Pissanetzky (1984, chap. 6). For a discussion of
software for the eigenvalue problem for sparse matrices, see Duff (1984, pp.
179-182) and Heath (1982). An important method for the solution of the
eigenvalue problem for sparse symmetric matrices is the Lancws method. For a
discussion of it, see Scott (1981) and the very extensive books and programs of
Cullum and Willoughby (1984, 1985).
The least squares solution of overdetermined linear systems is a very im-
portant tool, one that is very widely used in the physical, biological, and social
sciences. We have just introduced some aspects of the subject, showing the
crucial role of the singular value decomposition. A very comprehensive introduc-
tion to the least squares solution of linear systems is given in Lawson and
Hanson (1974). It gives a complete treatment of the theory, the practical
implementation of methods, and ways for handling large data sets efficiently. In
addition, the book contains a complete set of programs for solving a variety of
least squares problems. For other references to the least squares solutions of
linear systems, see Golub and Van Loan (1983, chap. 6) and Rice (1981, chap.
11 ). Programs for some least squares problems are also given in LINP ACK.
In discussing the least squares solution of overdetermined systems of linear
equations, we have avoided any discussion of the statistical aspect of the subject.
Partly this was for reasons of space, and partly it was a mistrust of using the
statistical justification, since it often depends on assumptions about the distribu-
tion of the error that are difficult to validate. We refer the reader to any of the
many statistics textbooks for a development of the statistical framework for the
least squares method for curve fitting of data.
Bibliography
Chatelin, F. (1987). Eigenvalues of Matrices. Wiley, London.
Conte, S., and C. de Boor (1980). Elementary Numerical Analysis, 3rd ed.
McGraw-Hill, New York.
' .... "' --: - ~ ~ J
BIBLIOGRAPHY 647
Cullum, J., and R. Willoughby (1984, 1985). Lanczos Algorithms for Large
Symmetric Eigenvalue Computations, VoL 1, Theory; VoL 2, Programs.
Birkhauser, BaseL
Dongarra, J., and C. Moler (1984). EISPACK-A package for solving matrix
eigenvalue problems. In Sources and Development of Mathematical Software,
W. Cowell (Ed.), pp. 68-87. Prentice-Hall, Englewood Cliffs, N.J.
Dongarra, J., J. Bunch, C. Moler, and G. Stewart (1979). LINPACK User's
Guide. SIAM Pub., Philadelphia.
Duff, L (1984). A survey of sparse matrix software, In Sources and Development
of Mathematical Software, W. Cowell (Ed.). Prentice-Hall, Englewood Cliffs,
N.J.
Garbow, B., J. Boyle, J. Dongarra, and C. Moler (1977). Matrix Eigensystems
Routines-EISPACK Guide Extension, Lecture Notes in Computer Science,
Vol. 51. Springer-Verlag, New York.
Golub, G., and C. Van Loan (1983). Matrix Computations. Johns Hopkins Press,
Baltimore.
Gourlay, A., and G. Watson (1976). Computational Methods for Matrix Eigen-
problems. Wiley, New York.
Gregory, R., and D. Karney (1969). A Collection of Matrices for Testing Computa-
tional Algorithms. Wiley, New York.
Heath, M., Ed. (1982). Sparse Matrix Software Catalog. Oak Ridge National
Laboratory, Mathematics and Statistics Dept., Tech. Rep. Oak Ridge,
Tenn.
Henrici, P. (1974). Applied and Computational Complex Analysis, VoL L Wiley,
New York.
Householder, A. (1964). The Theory of Matrices in Numerical Analysis. Ginn
(Blaisdell), Boston.
Jennings, A. (1985). Solutions of sparse eigenvalue problems. In Sparsity and Its
Applications, D. Evans (Ed.), pp. 153-184. Cambridge Univ. Press, Cam-
bridge, England.
Lawson, C., and R. Hanson (1974). Solving Least Squares Problems. Prentice-Hall,
Englewood Cliffs, N.J.
Moler, C., and G. Stewart (1973). An algorithm for generalized matrix eigenvalue
problems, SIAM J. Numer. Anal. 10, 241-256.
Noble, B. (1969). Applied Linear Algebra. Prentice-Hall, Englewood Cliffs, N.J.
Parlett, B. (1968). Global convergence of the basic QR algorithm on Hessenberg
matrices, Math. Comput. 22, 803-817.
Parlett, B. (1980). The Symmetric Eigenvalue Problem. Prentice-Hall, Englewood
Cliffs, N.J. .
Peters, G., and J. Wilkinson (1979). Inverse iteration, ill-conditioned equations
and Newton's method, SIAM Rev. 21, 339-360.
Pissanetzky, S. (1984). Sparse Matrix Technology. Academic Press, New York.
i
I
I
_.J
648 THE MATRIX EIGENVALUE PROBLEM
Rice, J. (1981). Matrix Computations and Mathematical Software. McGraw-Hill,
New York.
Scott, D. (1981). The Lanczos algorithm. In Sparse Matrices and Their Uses, I.
Duff (Ed.), pp. 139-160. Academic Press, London.
Smith, B. T., J. Boyle, B. Garbow, Y. lkebe, V. Klema, and C. Moler (1976).
Matrix Eigensystem Routines-EISPACK Guide, 2nd ed., Lecture Notes in
Computer Science, Vol. 6. Springer-Verlag, New York.
Stewart, G. (1973). Introduction to Matrix Computations. Academic Press, New
York.
Watkins, D. (1982). Understanding the QR algorithm, SIAM Rev. 24, 427-440.
Wilkinson, J. (1963). Rounding Errors in Algebraic Processes. Prentice-Hall,
Englewood Cliffs, N.J.
Wilkinson, J. (1965). The Algebraic Eigenvalue Problem. Oxford Univ. Press,
Oxford, England.
Wilkinson, J. (1968). Global convergence of the tridiagonal QR algorithm with
origin shifts. Linear Algebra Its Appl. 1, 409-420.
Wilkinson, J., and C. Reinsch, Eds. (1971): Linear Algebra. Springer-Verlag, New
York.
Problems
1. Use the Gerschgorin theorem 9.1 to determine the approximate location of
the eigenvalues of
(a)
[
1 _- 511 ~ ]
_;
Where possible, use these results to infer whether the eigenvalues are real or
complex. To check these results, compute the eigenvalues directly by
finding the roots of the characteristic polynomial.
2. (a) Given a polynomial
p(A.) =A"+ an-lA_n-1 + ... +ao
show p(A) = det[A.I- AJ for the matrix
A=
0
0
1
0
0
1 0
0 0
o
1
PROBLEMS 649
The roots of p(A) are the eigenvalues of A. The matrix A is called the
companion matrix for the polynomial p (A).
(b) Apply the Gerschgorin theorem 9.1 to obtain the following bounds
for the roots r of p(A): !rl =:; 1, or !r + a,_d :=;;; ja
0
l + + Ja,_
2
J.
If these bounds give disjoint regions in the complex plane, what can
be said about the number of roots within each region.
(c) Use the Gerschgorin theorem on the columns of A to obtain ad-
ditional bounds for the roots of p(A).
(d) Use the results of parts (b) and (c) to bound the roots of the following
polynomial equations:
3. Recall the linear system (8.8.5) of Chapter 8, which arises when numerically
solving Poisson's equation. If the equations are ordered in the manner
described in (8.8.12) and following, then the linear system is symmetric
with positive diagonal elements. For the Gauss-Seidel iteration method in
(8.8.12) to converge, it is necessary and sufficient that A be positive definite,
according to Theorem 8.7. Use the Gerschgorin theorem 9.1 to prove A is
positive definite. It will also be necessary to quote Theorem 8.8, that A = 0
is not an eigenvalue of A.
4. The values A = - 8.02861 and
X= [1.0, 2.50146, -.75773, -2.56421)
are an approximate eigenvalue and eigenvector for the matrix
A= [i
. 3
4
1
-3
1
5
3
1
6
-2
4].
5
-2
-1
Use the result (9.1.22) to compute an error bound for A.
5. For the matrix example (9.1.17) with f = .001 and A = 2, compute the
perturbation error bound (9.1.36). The same bound was given in (9.1.38) for
the other eigenvalue A = 1.
6. Prove the eigenvector perturbation result (9.1.41). Hint: Assume Ak(f} and
uk(f) are continuously differentiable functions of f. From (9.1.32), A'k(O) =
vZBukf'sk. Write
650 THE MATRIX EIGENVALUE PROBLEM
and solve for Since { u
1
, ... , un} is a basis, write
n
uk(O) = ajuj
j=l
To find aj, first differentiate (9.1.40) with respect to E:, and then let E: = 0.
Substitute the previous representation for uk(O). Use (9.1.29) and the
biorthogonality relation
from (9.1.28).
7. For the following matrices A( i), determine the eigenvalues and eigenvec-
tors for both E: = 0 and i > 0. Observe the.behavior as E: 0.
(a) [!
(b) [ol 1 ]
1 + i
(c) n (d)
1
1
What do these examples say about the stability of eigenvector subspaces?
8. Use the power method to calculate the dominant eigenvalue and associated
eigenvector for the following matrices.
9.
(a)
4 4
6 1
1 6
4 4
(b)
1
-3
1
5
3
1
6
-2
41
5
-2
-1 .
Check the speed of convergence, calculating the ratios Rm of (9.2.14). When
the ratios Rm are fairly constant, use Aitken extrapolation to improve the
speed of convergence of both the eigenvalue and eigenvector, using the
eigenvalue ratios Rm to accelerate the eigenvectors { z<m> }.
Use the power method to find the dominant eigenvalue of
A= [
-16
13
-10
13
-16]
13
7
Use the initial guess z<
0
> = [1, 0, If. Print each iterate z<m> and >..)m>.
Comment on the results. What would happen if >..)ml were defined by
>._)m) =am?
. ' .. . . ' .... ~ .. " ": ".' :.:.. "i
PROBLEMS 651
10. For a matrix A of order n, assume its Jordan canonical form is diagonal
and denote the eigenvalues by A
1
, ... , An. Assume that A
1
= A
2
= =A,
for some r > 1, and
Show that the power method (9.2.2)-(9.2.3) will still converge to A
1
and
some associated eigenvector, for most choices of initial vector z<
0
>.
11. Let A be a symmetric matrix of order n, with the eigenvalues ordered by
Define
(Ax, x)
9i(x)= ( )
x,x
using the standard inner product. Show
Min 9i(x) =An
as x =F 0 ranges over Rn. The function 9i(x) is called the Rayleigh quotient,
and it can be used to characterize the remaining eigenvalues of A, in
addition to A
1
and An. Using these maximizations and minimizations for
9i(x) forms the basis of some classical numerical methods for calculating
the eigenvalues of A.
12. To give a geometric meaning to the n X n Householder matrix P =
I- 2wwT, let u<
2
>, ... , u<n) be an orthonormal basis of the (n- 1)
dimensional subspace that is perpendicular to w. Define
T(x) = (I- 2wwT)x
Use the basis { w, u(2), ... , u<nl} for Rn to write
Apply T to this representation and interpret the results.
13. (a) Let A be a symmetric matrix, and let A and x be an eigenvalue-
eigenvector pair for A with llxllz = 1. Let P be an orthogonal matrix
for which
Px = e
1
= [l,O, ... ,OV
Consider the similar matrix B = PAPT, and show that the first row
and column are zero except for the diagonal element, which equals A.
Hint: Calculate and use Be
1
l
652 THE MATRIX EIGENVALUE PROBLEM
(b) For the matrix
10
5
-8
-i]
11
A = 9 is an eigenvalue with associated eigenvector x = [ t, f .
Produce a Householder matrix P for which Px = e
1
, and then
produce B = PAPr. The matrix eigenvalue problem forB can then be
reduced easily to a problem for a 2 X 2 matrix. Use this procedure to
calculate the remaining eigenvalues and eigenvectors of A. The pro-
cess of changing A to B and of then solving a matrix eigenvalue
problem of order one less than for A, is known as deflation. It can be
used to extend the applicability of the power method to other than the
dominant eigenvalue. For an extensive discussion, see Wilkinson
(1965, pp. 584-598) and Parlett (1980, chap. 5).
14. Use Householder matrices to produce the QR factorization of
15.
A ~ u
1
~ ~ ]
H
3
~ ~ ]
(a) -1 (b) -2
-4 1
Consider the rotation matrix of order n,
1 0 0 0
0 1 0
0 a 0
/3
0 0 row k
R<k.t> =
0 1 0
-/3 0 a 0 0 row I
1
0 1
with a
2
+ /3
2
= 1. If we compute Rb for a given b E Rn, then the only
elements that will be changed are in positions k and I. By choosing a and /3
suitably, we can force Rb to have a zero in position I. Choose a, /3 so that
for some y.
(a) Derive formulas for a, /3, and show y = Vbz +hi.
I
' .... , ................
----- - --- ---
.........
PROBLEMS 653
(b) Reduce b=[1,1,1,1f to a form h=[c,O,O,O] by a sequence of
multiplications by rotation matrices:
16. Show how the rotation matrices R<k.l) can be used to produce the QR
factorization of a matrix.
17. (a) Do an operations count for producing the QR factorization of a
matrix using Householder matrices, as in Section 9.3. As usual,
combine multiplications and divisions, and keep a separate count for
the number of square roots.
(b) Repeat part (a), but use the rotation matrices R(k,/) for the reduction.
18. Give the explicit formulas for the calculation of the QR factorization of a
symmetric tridiagonal matrix. Do an operations count, and compare the
result with those of Problem 17.
19. Use Theorem 9.5 to separate the roots of
r ~
1 0 0
!l l ~
2 0 0
~ l
1 1 0 2 3 0
(a)
1 1 1 (b) 3 3 4
0 1 1 0 4 4
0 0 1 0 0 5
Then obtain accurate approximations using the bisection method or some
other rootfinding technique.
20. (a) Write a program to reduce a symmetric matrix to tridiagonal form
using Householder matrices for the similarity transformations. For
efficiency in the matrix multiplications, use the analogue of the form
of multiplication shown in (9.3.18).
(b) Use the program to reduce the following matrices to tridiagonal form:
~
2
!]
[[
4 1
il
(i)
3
(ii)
5 1
1 4
5
1 2 4
[ 24i
6 242
12]
(iii)
225 3 18
3 25 6
12 18 6 0
(c) Calculate the eigenvalues of your reduced tridiagonal matrix as accu-
rately as possible.
I
i
!
I
I
-- -. --- ---- -.. -.J
i
!
654 THE MATRIX EIGENVALUE PROBLEM
21. Let { Pn(x)jn ~ 0} denote a family of orthogonal polynomials with respect
to a weight fun.ction w(x) on an interval a < x < b. Further, assume that
the polynomials have leading coefficient 1:
n-1
P
(x) =xn+ "a -Xj
n ..., n,;
j=O
Find a symmetric tridiagonal. matrix Rn for which PnCJ...) is the characteris-
tic polynomial. Thus, calculating the roots of an orthogonal polynomial
(and the nodes of a Gaussian quadrature formula) is reduced to the
solution of an eigenvalue problem for a symmetric tridiagonal matrix ..
Hint: Recall the formula for the triple recursion relation for { Pn(x)}, and
compare it to the formula (9.4.3).
22. Use the QR method (a) without shift, and (b) with shift, to calculate the
eigenvalues of
[!
1
:J
r ~
1 0
fl
(a)
2
(b)
2 1
1 2
1
0 1
[f
1 0 0
~
1 1 0
(c)
1 1 1
0 1 1
0 0 1
23. Let A be a Hessenberg matrix and consider the factorization A = QR, with
Q orthogonal and R upper triangular.
24.
25.
(a) Recalling the discussion following (9.5.5), show that (9.5.7) is true.
(b) Show that the result (9.5.7) implies a form for Hk in (9.5.6) such that
Q will be a Hessenberg matrix.
(c) Show the product of a Hessenberg matrix and an upper triangular
matrix, in either order, is again a Hessenberg matrix.
When combined, these results show that RQ is again a Hessenberg matrix,
as claimed in the paragraph following (9.5.7).
For the matrix A of Problem 4, two additional approximate eigenvalues are
A = 7.9329 and A = 5.6689. Use inverse iteration to calculate the associ-
ated eigenvectors.
Investigate the programs available at your computer center for the calcula-
tion of the eigenvalues of a real symmetric matrix. Using such a program,
PROBLEMS 655
compute the eigenvalues of the Hilbert matrices Hn for n = 3, 4, 5, 6, 7. To
check your answers, see the very accurate values given in Gregory and
Karney (1969, pp. 66-73).
26. Consider calculating the eigenvalues and associated eigenfunctions x( t) for
which
1
1 x(t) dt
_ _,;_ __
2
= A.x(s)
o 1 + (s- t)
One way to obtain approximate eigenvalues is to discretize the equation
using numerical integration. Let h = 1/n for some n 2:: 1 and define
tj = (} - t)h, j = 1, ... , n. Substitute t; for s in the equation, and ap-
proximate the integral using the midpoint numerical integration method.
This leads to the system
n A(t )
h L X j 2 = AX ( t i)
i = 1, ... , n
in which x(s) denotes a function that we expect approximates x(s). This
system is the eigenvalue problem for a symmetric matrix of order n. Find
the two largest eigenvalues of this matrix for n = 2, 4, 8, 16, 32. Examine the
convergence of these eigenvalues as n increases, and attempt to predict the
error in the most accurate case, n = 32, as compared with the unknown
true eigenvalues for the integral equation.
27. Show that the generalized inverse A+ of (9.7.11) satisfies the following
Moore-Penrose conditions.
1. AA+A =A
Also show
Conditions (3)-(6) show that A+ A and AA + represent orthogonal projec-
tions on Rn and Rm, respectively.
28. For an arbitrary m X n matrix A, show that
Limit (a/+ ATA)-
1
AT =A+
a-0+
where a > 0. Hint: Use the SVD of A.
29. Unlike the situation with nonsingular square matrices, the generalized
inverse A+ need not vary continuously with changes in A. To support this,
I
I
:
I
-.I
656 THE MATRIX EIGENVALUE PROBLEM
30.
find a family of matrices {A( c)} where A( c) converges to A(O), but A( c)+
does not converge to A(O)+.
Calculate the linear polynomial least squares fit for the following data.
Graph the data and the least squares fit. Also, find the root-mean-square
error in the least squares fit.
t; b; t; b; t; b;
- 1.0 1.032 -.3 1.139 .4 -.415
-.9 1.563 -.2 .646 .5 -.112
-.8 1.614 -.1 .474 .6 -.817
-.7 1.377 0.0 .418 .7 -.234
-.6 1.179 .1 .067 .8 -.623
-.5 1.189 .2 .371 .9 -.536
-.4 .910 .3 .183 1.0 -1.173
31. Do a quadratic least squares fit to the following data. Use the standard
form
and use the normal equation (9.7.26). What is the condition number of
ATA?
t; b; t; b; t; b;
- 1.0 7.904 -.3 .335 .4 -.711
-.9 7.452 -.2 -.271 .5 .224
-.8 5.827 -.1 -.963 .6 .689
-.7 4.400 0.0 -.847 .7 .861
-.6 2.908 .1 -1.278 .8 1.358
-.5 2.144 .2 -1.335 .9 2.613
-.4 .581 .3 -.656 1.0 4.599
32. For the matrix A arising in the least squares curve fitting of Problem 31,
calculate its QR factorization, its SVD, and its generalized inverse. Use
these to again solve the least squares problem.
33. Find the QR factorization, singular value decomposition, and generalized
inverse of the following matrices. Also give cond (Ah
[
.9
(a) A= -1.0
1.1
1.1]
-1.0
.9
PROBLEMS 657
34. (a) Let A be m X n, m n, and suppose that the columns of A are
nearly dependent. More precisely, let A = [u
1
, , un], u
1
E Rm, and
suppose the vector
is quite small compared to II alb, a = [ a
1
, , anf Show that A will
have a large condition number.
(b) In contrast to part (a), suppose the columns of A are orthonormal.
Show cond(Ah = 1.
INDEX
Note: (1) An asterisk (*) following a subentry name means that name is also listed separately with
additional subentries of its own. (2) A page number followed by a number in parentheses, prefixed by
P, refers to a problem on the given page. For example, 123(P30) refers to problem 30 on page 123.
Absolute stability, 406
Acceleration methods:
eigenvalues, 606
linear systems, 83
numerical integration, 255, 294
rootfinding, 83
Adams-Bashforth methods, 385
Adams methods, 385
DE/STEP, 390
stability*, 404
stability region*, 407
variable order, 390
Adams-Moulton methods, 387
Adaptive integration, 300
CADRE, 302
QUADPACK, 303
Simpson's rule, 300
Adjoint, 465
Aitken, 86
Aitken extrapolation:
eigenvalues, 607
linear iteration, 83
numerical integration, 292
rate of convergence, 123(P30)
Algorithms:
Aitken, 86
Approx, 235
Bisect, 56
Chebeval, 221
Cq, 567
Detrap, 376
Divdif, 141
Factor, 520
lnterp, 141
Newton, 65
Polynew, 97
Romberg, 298
Solve, 521
Alternating series, 239(P2)
Angle between vectors, 469
Approx, 235
Approximation of functions, 197
Chebyshev series 219, 225
de Ia Vallee-Poussin theorem, 222
economization of power series, 245(P39)
equioscillation theorem, 224
even/ odd functions, 229
interpolation, 158
Jackson's theorem, 180, 224
least squares*, 204, 206, 216
minimax*, 201, 222
near-minimax*, 225
Taylor's theorem, 4, 199
AS(), 513
A-stability, 371, 408, 412
Asymptotic error formula:
definition, 254
differential equations, 352, 363, 370
Euler-MacLaurin formula*, 285, 290
Euler's method, 352
numerical integration, 254, 284
Runge-Kutta formulas, 427
Simpson's rule, 258
trapezoidal rule, 254
Augmented matrix, 510
Automatic numerical integration, 299
adaptive integration, 300
CADRE, 302
QUADPACK, 303
Simpson's rule, 300
Back substitution, 508
Backward differences, 151
Backward differentiation formulas, 410
Backward error analysis, 536
Backward Euler method, 409
Banded matrix, 527
Basic linear algebra subroutines, 522, 570
683
684 INDEX
Basis, 464
orthogonal, 469
standard, 465
Bauer-Fik:e theorem, 592
Bernoulli numbers, 284
Bernoulli polynomials, 284, 326(P23)
Bernstein polynomials, 198
Bessel's inequality, 218
Best approximation, see Minimax approximation
Binary number, 11
Binomial coefficients, 149
Biorthogonal family, 597
Bisect, 56
Bisection method, 56-58
convergence, 58
BLAS, 522, 570
Boole's rule, 266
Boundary value problems, 433
collocation methods, 444
existence theory, 435, 436
finite difference methods, 441
integral equation methods, 444
shooting methods, 437
Brent's method, 91
comparison with bisection method, 93
convergence criteria, 91
en, 463
en. 219
C[a, b], 199
CADRE, 302
Canonical forms, 474
Jordan, 480
Schur, 474
singular value decomp{)Sition, 478
symmetric matrices, 476
Cauchy-Schwartz inequality, 208, 468
Cayley-Hamilton theorem, 501(P20)
Change of basis matrix, 473
Characteristic equation:
differential equations, 364, 397
matrices, 471
Characteristic polynomial, 397, 471
Characteristic roots, 398
Chebeval, 221
Chebyshev equioscillation theorem, 224
Chebyshev norm, 200
Chebyshev polynomial expansion, 219, 225
Chebyshev polynomials, 211
maxima, 226
minimax property, 229
second kind, 243(P24)
triple recursion formula, 211
zeros, 228
Chebyshev zeros, interpolation at, 228
Cholesky method, 524, 639
Chopping, 13
Christoffel-Darboux identity, 216
Collocation methods, 444
Column norm, 487
Compact methods, 523
Compansion matrix, 649(P2)
Complete pivoting, 515
Complex linear systems, 575(P5)
Composite Simpson's rule, 257
Composite trapezoidal rule, 253
Cond(A), 530
Cond(A)., Cond(A)P, 531
Condition number, 35, 58
calculation, 538
eigenvalues, 594, 599
Gastinel's theorem, 533
Hilbert matrix, 534
matrices, 530
Conjugate directions methods, 564
Conjugate gradient method, 113, 562, 566
acceleration, 569
convergence theorem, 566, 567
optimality, 566
projection framework, 583(P39)
Conjugate transpose, 465
Consistency condition, 358, 395
Runge-Kutta methods, 425
Convergence:
interval, 56
linear, 56
order, 56
quadratic, 56
rate of linear, 56
vector, 483
Conversion between number bases, 45(Pl0, Pll)
Corrected trapezoidal rule, 255, 324(P4)
Corrector formula, 370
Cq, 567
Cramer's rule, 514
Crout's method, 523
Data error, 20, 29, 325(P13)
Deflation, polynomial, 97
matrix, 609, 65l(P13)
Degree of precision, 266
de Ia Vallee-Poussin theorem, 222
Dense family, 267
Dense linear systems, 507
Dense matrix, 507
DE/STEP, 390
Detecting noise in data, 153
Determinant, 467, 472
calculation, 512
Detrap, 376
Diagonally dominant, 546
Difference equations, linear, 363, 397
Differential equations:
automatic programs,
Adam's methods, 390
boundary value codes, 444, 446
comparison, 446
control of local error, 373, 391
DE/STEP, 390
error control, 391
error per unit stepsize, 373
global error, 392
RKF45, 431
variable order, 390
boundary value problems, 433
direction field, 334
existence theory, 336
first order linear, 333
higher order, 340
ill-conditioned, 339
initial value problem, 333
integral equation equivalence, 451(P5)
linear, 333, 340
model equation, 396
numerical solution:
Adam's methods*, 385
A-stability, 371, 408, 412
backward Euler method, 409
boundary value problems*, 433
characteristic equation, 397
convergence theory, 360, 401
corrector formula, 370
Euler's method*, 341
explicit methods, 357
extrapolation methods, 445
global error estimation, 372, 392
grid size, 341
implicit methods, 357
lines, method of, 414
local solution, 368
midpoint method*, 361
modelequation,363,370, 397
multistep methods*, 357
numerical integration, 384
node points, 341
predictor formula, 370
Runge-Kutta methods*, 420
single-step methods, 418
stability, 349, 361, 396
stability regions, 404
stiff problems*, 409
Taylor series method, 418
trapezoidal method*, 366
undetermined coefficients, 381
variable order methods, 390, 445
Picard iteration, 45l(P5)
stability, 337
stiff, 339, 409
systems, 339
Differentiation, see Numerical differentiation
Dimension, 464
INDEX 685
Direction field, 334
Dirichlet problem, see Poisson's equation
Divdif, 141
Divided difference interpolation formula, 140
Divided differences, 9, 139
continuity, 146
diagram for calculation, 140
differel).tiation, 146
formulas, 139, 144
Hermite-Gennochi formula, 144
interpolation, 140
polynomials, 147
relation to derivatives, 144
recursion relation, 139
Doolittle's method, 523
Economization of Taylor series, 245(P39)
Eigenvalues, 471
Bauer-Fike theorem, 592
condition number, 594, 600
deflation,609
error bound for symmetric matrices, 595
Gerschgorin theorem, 588
ill-conditioning, 599
location, 588
matrices with nondiagonal Jordan form, 601
multiplicity, 473
numerical approximation
/SPACK, 588, 645, 662
Jacobi method, 645
power method*, 602
QR method*, 623
sparse matrices, 646
Sturm sequences, 620
numerical solution, see Eigenvalues, numerical
approximation
stability, 591
under unitary transformations, 600
symmetric 476
tridiagonal matrices*, 619
Eigenvector(s), 471
numerical approximation:
/SPACK, 588, 645, 662
error bound, 631
inverse iteration, 628
power method*, 602
stability, 600
/SPACK, 588,645,662
Enclosure rootfinding methods, 58
Brent's method, 91
Error:
backward analysis, 536
chopping, 13
data, 20, 29, 153
definitions, 17
loss of significance, 24, 28
machine,. 20
;
. .::.--.::..:..::......::. .::.....:.:.. . ..:...
... -------------- -- ---- ------------------------- ------- .... J
i
i
i!
i
I
I
J
i
686 INDEX
Error (Cominued)
noise, 21
propagated, 23, 28
rounding, 13
sources of, 18
statistical treatment, 31
summation, 29
truncation, 20
unit roundoff, 15
Error estimation:
global, 392, 433
rootfinding, 57, 64, 70, 84, 129(P16)
Error per stepsize, 429
Error per unit stepsize, 373
Euclidean norm, 208, 468
Euler-MacLaurin formula, 285
generalization, 290
summation formula, 289
Euler's method, 341
asymptotic error estimate, 352, 356
backward, 409
convergence analysis, 346
derivation,342
error bound, 346, 348
rounding error analysis, 349
stability*, 349, 405
systems, 355
truncation error, 342
Evenjodd functions, 229
Explicit multistep methods, 357
Exponent, 12
Extrapolation methods:
differential equations, 445
numerical integration, 294
rootfinding, 85
F,(x), 232
Factor, 520
Fast Fourier transform, 181
Fehlberg, Runge-Kutta methods, 429
Fibonacci sequence, 68
Finite differences, 14 7
interpolation formulas, 149, 151
Finite dimension, 464
Finite Fourier transform, 179
Fixed point, 77
Fixed point iteration, see One-point iteration
methods, 13
fl{x),13
Floating-point arithmetic, 11-17, 39
Floating-point representation, 12
accuracy, 15
chopping, 13
conversion, 12
exponent, 12
mantissa, 12
overflow, 16
radix point, 12
rounding, 13
underflow, 16
unit round, 15
Forward differences, 148
detection of noise by, 153
interpolation formula, 149
linearity, 152
relation to derivatives, 151
relation to divided differences, 148
tabular form, 149
Fourier series, 179, 219
Frobenius norm, 484
Gastinel's theorem, 533
Gaussian elimination, 508
backward error analysis, 536
Cholesky method, 524
compact methods, 523
complex systems, 575(P5)
error analysis, 529
error bounds, 535, 539
error estimates, 540
Gauss-Jordan method, 522
iterative improvement, 541
LU factorization*, 511
operation count, 512
pivoting, 515
positive definite matrices, 524, 576(P12)
residual correction*, 540
scaling, 518
tridiagonal matrices, 527
variants, 522
Wilkinson theorem, 536
Gaussian quadrature, 270. See also
Gauss-Legendre quadrature
convergence, 277
degree of precision, 272
error formulas, 272, 275
formulas, 272, 275
Laguerre, 308
positivity of weights, 275
singular integrals, 308
weights, 272
Gauss-Jacobi iteration, 545
Gauss-Jordan method, 522
matrix inversion, 523
Gauss-Legendre quadrature, 276
comparison to trapezoidal rule, 280
computational remarks, 281
convergence discussion, 279
error formula, 276
Peano kernel, 279
weights and nodes, 276
Gauss-Seidel method, 548
acceleration, 555
convergence, 548, 551
rate of convergence, 548
Generalized inverse, 636, 655(P27, P28)
Geometric series, 5, 6
matrix form, 491
Gerschgorin theorem, 588
Global error, 344
estimation, 392, 433
Gram-Schmidt method, 209, 242(P19)
Grid size, 341
Heat equation, 414
Hermite-Birkhoff interpolation, 190(P28, P29),
191(P30)
Hennite-Gennochi formula, 144
Hermite interpolation, 159
error formula, 161, 190(P27)
formulas, 160, 161, 189(P26)
Gaussian quadrature, 272
general interpolation problem, 163
piecewise cubic, 166
Hermitian matrix, 467. See also Symmetric
matrix
Hessenberg matrix, 624
Hexadecimal, 11
Higher order differential equations, 340
Hilbert matrix, 37, 207, 533
condition number, 534
eigenvalues, 593
Homer's method, 97
Householder matrices, 609, 651(P12)
QR factorization, 612
reduction of symmetric matrices, 615
transformation of vector, 611
ln(x), 228
Ill-conditioned problems, 36
differential equations, 339
eigenvalues, 599
inverse problems, 40
linear systems, 532
polynomials, 99
Ill-posed problem, 34, 329(P41)
Implicit multistep methods, 357
iterative solution, 367, 381
Infinite dimension, 464
Infinite integrand, 305
Infinite interval of integration, 305
Infinite product expansion, ll7(P1)
Infinity norm, 10, 200
Influence function, 383
Initial value problem, 333
Inner product, 32, 208, 468
error, 32
Instability, 39. See also III-conditioned problems
Integral equation, 35, 444, 451(P5), 570, 575(P4),
655(P26)
Integral mean value theorem, 4
INDEX 687
Integration, see Numerical integration
Intermediate value theorem, 3
lnterp, 141
Interpolation:
.exponential, 187{Pll)
multivariable, 184
piecewise polynomial*, 162, 183
polynomial, 131
approximation theory, 158
backward difference formula, 151
barycentric forma, 186(P3)
at Chebyshev zeros, 228
definition, 131
divided difference formula, 140
error behavior, 157.
error in derivatives, 316
error formula, 134, 143, 155-157
example of log
10
x, 136
existence theory, 132
forward difference formula, 149
Hermite, 159, 163
Hermite-Birkhoff, 190(P29)
inverse, 142
Lagrange formula, 134
non-convergence, 159
numerical integration*, 263
rounding errors, 137
Runge's example, 158
rational, 187(P12)
spline function*, 166
trigonometric*, 176
Inverse interpolation, 142
Inverse iteration, 628
computational remarks, 630
rate of convergence, 630
Inverse matrix, 466
calculation, 514, 523
error bounds, 538
iterative evaluation, 581(P32)
operation count, 514
Iteration methods, see also Rootfinding
differential equations, 367, 381
eigenvalues, 602
eigenvectors, 602, 628
linear systems:
comparison to Gaussian elimination, 554
conjugate gradient method*, 562, 566
error prediction, 543, 553
Gauss-Jacobi method, 545
Gauss-Seidel method, 548
general schema, 549
multigrid methods, 552
Pois5on's equation, 557
rate of convergence, 542, 546, 548
SOR method, 555, 561
nonlinear systems of equations, 103
one-point iteration, 76
i
!
I
i
J
688 INDEX
Iteration methods (Continued)
polynomial rootfinding, 94-102
Iterative improvement, 541
Interval analysis, 40
Jackson's theorem, 180, 224
Jacobian matrix, 105, 356
Jacobi method, 645
Jordan block, 480
Jordan canonical form, 480
Kronecker delta function, 466
Kronrod formulas, 283
Lagrange interpolation formula, 134
Laguerre polynomials, 211, 215
Least squares approximation:
continuous problem, 204
convergence, 217
formula, 217
weighted, 216
discrete problem, 633
data fitting problem, 637
definition, 634
generalized inverse, 636; 655(P27, P28)
QR solution procedure, 642
singular value solution, 635
stability, 637
Least squares data fitting, see Least squares
approximation, discrete problem
Legendre polynomial expansion, 218
Legendre polynomials, 210, 215
Level curves, 335
Linear algebra, 463
Linear combination, 464
Linear convergence, 56
acceleration, 83
rate, 56
Linear dependence, 464
Linear difference equations, 363, 397
Linear differential equation, 333, 340
Linear independence, 464
Linear iteration, 76, 103. See also One-point
iteration methods
Linearly convergent methods, 56
Linear systems of equations, 466, 507
augmented matrix, 510
BLAS, 522, 570
Cholesky method, 524
compact methods, 523
condition number, 530, 531
conjugate gradient method , 562, 566
Crout's method, 523
dense, 507
Doolittle's method, 523
error analysis, 529
error bounds, 535
Gaussian elimination , 508
variants, 522
Gauss-Jacobi method, 545
Gauss-Seidel method , 548
iterative solution , 540, 544
LINPACK, 522,570,663
LU factorization . 511
numerical solution, 507
over-determined systems, see Least squares
approximation, discrete problem, data fitting
problem
Poisson's equation , 557
residual correction method , 540
scaling, 518
solution by QR factorization, 615, 642
solvability of, 467
SOR method, 555, 561
sparse, 507, 570
tridiagonal, 527
Lines, method of, 414. see also Method of lines
LINPACK, 522, 570, 663
Lipschitz condition, 336, 355, 426
Local error, 368
Local solution, 368
Loss of significance error, 24, 28
LU factorization, 511
inverse iteration, 628
storage, 511
tridiagonal matrices, 527
uniqueness, 523
Machine errors, 20
Mantissa, 12
Mathematical modelling, 18
Mathematical software, 41, 661
Matrix, 465
banded, 527
canonical forms , 474
characteristic equation, 471
condition number, 530, 531
deflation, 609, 651(Pl3)
diagonally dominant, 546
geometric series theorem, 491
Hermitian, 467
Hilbert, 37, 207, 533
Householder , 609
identity, 466
inverse , 466, 514
invertibility of, 467
Jordan canonical form, 480
LU factorization , 511
nilpotent, 480
norm, 481. See also Matrix norm; Vector norm
notation, 618
operations on, 465
order, 465
orthogonal, 469. See also Orthogonal
transformations
- --------------
permutation, 517
perturbation theorems, 493
positive definite, 499(P14), 524
principal axes theorem, 476
projection, 498(Pl0), 583(P39)
rank, 467
Schur normal form, 474
similar, 473
singular value decomposition, 478
symmetric , 467
tridiagonal, 527
unitary, 469, 499(P13)
zero, 466
Matrix norm, 484
column, 487
compatible, 484
Frobenius, 484
operator, 485
relation to spectral radius, 489
row, 488
Maximum norm, 200, 481
MD(e), 513
Mean value theorem, 4
Method of lines, 414
convergence; 415
explicit method, 416
heat equation, 414
implicit methods, 417
Midpoint method, differential equations, 361
asymptotic error formula, 363
characteristic equation, 364
error bound, 361
weak stab iii ty, 365
Midpoint numerical integration, 269, 325(P10)
Milne's method, 385
Minimax approximation, 201
equioscillation theorem, 224
error, 201
speed of convergence, 224
MINPACK, 114,570,663
Moore-Penrose conditions, 655(P27)
Muller's method, 73
Multiple roots, 87
instability, 98, 101
interval of uncertainty, 88
Newton's method, 88
noise, effect of, 88
removal of multiplicity, 90
Multipliers, 509
Multistep methods, 357
Adams-Bashforth, 385
Adams methods , 385
Adams-Moulton, 387
convergence, 360, 401
consistency condition, 358, 395
derivation, 358,381
error bound, 360
explicit, 357
general form, 357
general theory, 358,394
implicit, 357, 381
iterative solution, 367, 381
midpoint method , 361
Milne's method, 385
rnodelequation,363, 397
numerical integration, 384
order of convergence, 359
parasitic solution, 365, 402
Peano kernel, 382
relative stability, 404
root condition, 395
stability, 361, 3%
stability regions, 404
INDEX 689
stiff differential equations, 409
strong root condition, 404
trapezoidal method , 366
. truncation error, 357
undetermined coefficients, method of, 381
unstable examples, 3%
variable order, 390, 445
Multivariable interpolation, 184
Multivariable quadrature, 320
Near-minimax approximation, 225
Chebyshev polynomial expansion, 219, 225
forced oscillation, 232
interpolation, 228
Nelder-Mead method, 114
Nested multiplication, %
Newton, 65
Newton backward difference, 151. See also
Backward differences
Newton-Cotes integration, 263
closed, 269
convergence, 266
error formula, 264
open, 269
Newton divided differences, 139. See also
Divided differences
Newton forward differences, 148. See also
Forward differences
Newton-Fourier method, 62
Newton's method, 58
boundary value problems, 439, 442
comparison with secant method, 71
convergence, 60
error estimation, 64
error formula, 60
multiple roots, 88
Newton-Fourier method, 62
nonlinear systems, 109, 442
polynomials, 97
reciprocal calculations, 54
square roots, 119(P12, P13)
Nilpotent matrix, 480
Node points, 250, 341
I
I
_I
690 INDEX
Noise in data, detection, 153
Noise in function evaluation, 21, 88
Nonlinear equations, See Rootfinding
Nonlinear systems, 103
convergence, 105, 107
fixed points iteration, 103
MJNPACK, 114, 663
Newton's method, 109
Normal equation, 638
Norms, 480
compatibility, 484
continuity, 482
equivalence, 483
Euclidean,208,468
Frobenius, 484
matrix . 484
maximum, 200, 481
operator, 485
spectral radius, 485
two, 208
uniform, 200
vector . 481
Numerical differentiation, 315
error formula, 316
ill-posedness, 329(P41)
interpolation based, 315
noise, 318
undetermined coefficients, 317
Numerical integration, 249
adaptive , 300
Aitken extrapolation, 292
automatic programs, 302, 663
Boole's rule, 266
CADRE, 302
convergence, 267
comparison of programs, 303
corrected trapezoidal rule, 255
degree of precision, 266
Euler-MacLaurin formula , 285, 290
Gaussian quadrature , 270
Gauss-Legendre quadrature , 276
general schemata, 249
Kronrod formulas, 283
midpoint rule , 269
multiple integration, 320
Newton-Cotes formulas , 263
noise, effect of, 325(P13)
open formulas, 269
Patterson's method,.283
rectangular rule, 343
Richardson extrapolation, 294
Romberg's method, 298
Simpson's method , 256
singular integrals , 305
standard form, 250
three-eights rule, 264
trapezoidal rule , 252
O(hP), 0(1jnP), 291, 352
One-point iteration methods, 76
convergence theory, 77-83
differential equations, 367, 413
higher order convergence, 82
linear convergence, 56
Newton's method , 58
nonlinear systems , 103
Operations count, 576(P6)
Gaussian elimination, 512
Operator norm, 485
Optimization, 111, 115.499(P17)
conjugate directions method, 564
conjugate gradient method, 113
descent methods, 113
method of steepest descent, 113
MJNPACK, 114,663
Nelder-Mead method, 114
Newton's method, 112
quasi-Newton methods, 113
Order of convergence, 56
Orthogonal, 209. See also Orthogonal
transformations
basis, 469
family, 212
matrix, 469
Orthogonal polynomials, 207
Chebyshev polynomials , 211
Christoffel-Darboux identity, 216
Gram-Schmidt method, 209
Laguerre, 211, 215
Legendre, 210, 215
triple recursion relation, 214
zeros, 213, 243(P21)
Orthogonal projection, 470, 498(P10), 583(P39)
Orthogonal transformations:
Householder matrices , 609
planar rotations, 618, 652(P15)
preservation of length, 499(Pl3)
QR factorization , 612
symmetric matrix, reduction of, 615
transformation of vector, 611
Orthonormal family, 212
Overflow error, 17,22
Pade approximation, 237, 240(P5)
Parallel computers, 571
Parasitic solution, 365, 402
Parseval's equality, 218
Partial differential equations, 414, 557
Partial pivoting, 515
Patterson's method, 283
Peano kernel, 258, 279, 325(Pl0)
differential equations, 382
Permutation matrix, 517
Picard iteration, 451(P5)
i
----1
Piecewise polynomial interpolation, 163, 183
evaluation, 165
Hermite fonn, 166
Lagrange fonn, 164
spline functions , 166
Pivot element, 515
Pivoting, 515
Planar rotation matrices, 618, 652(P15)
p-Nonn, 481
Poisson's equation, 557
finite difference approximation, 557
Gauss-Seidel method, 553, 560
generalization, 582(P34)
SOR method, 556, 561
Polynew,91
Polynomial interpolation, 131. See also
Interpolation, polynomial
Polynomial perturbation theory, 98, 99
Polynomial rootfinding, 94-102
bounds on roots, 95, 127(P36, P37)
companion matrix, 649(P2)
deflation, 97, 101
ill-conditioning, 99
Newton's method, 97
stability, 98
Positive definite matrix, 499(P14), 524
Cholesky's method, 524
Power method, 602
acceleration techniques, 606
Aitken extrapolation, 607
convergence, 604
deflation, 609, 651(P13)
Rayleigh-Ritz quotient, 605
Precision, degree, 266
Predictor-corrector method, 370, 373. See also
Multistep methods
Predictor fonnula, 370
Principal. axes theorem, 476
Principal root, 398
Product integration, 310
Projection matrix, 498(P10), 583(P39)
Propagated error, 23, 28
qn*(x), 201
QR factorization, 612
practical calculation, 613
solution of linear systems, 615, 642
uniqueness, 614
QR method, 623
convergence, 625
preliminary preparation, 624
rate of convergence, 626
with shift, 626
QUADPACK, 303, 663
Quadratic convergence, 56
Quadratic fonn, 497(P7)
INDEX 691
Quadrature, See Numerical integration
Quasi-Newton methods, 110, 113
rn'" (x), 204, 207
R". 463
Radix, 12
Rank,467
Rational interpolation, 187(P12)
Rayleigh quotient, 608, 651(Pll)
Relative error, 17
Relative stability, 404
Region of absolute stability, 404
Residual correction method, 540
convergence, 542
error bounds, 543
Reverse triangle inequality, 201
Richardson error estimate, 296
Richardson extrapolation, 294, 372
RK, 420
RKF, 429
RKF45,431
Romberg, 298
Romberg integration, 296
CADRE, 302
Root condition, 395
Rootfinding, 53
acceleration, 85
Aitken extrapolation , 83
bisection method , 56
Brent's method , 91
enclosure methods, 58
error estimates, 57, 64, 70
Muller's method , 73
multiple roots , 87
Newton's method , ss, 108
nonlinear systems , 103
one-point iteration methods , 76
optimization, 111
polynomials , 94-102
secant method , 66
Steffenson's method, 122(P28)
stopping criteria, 64
Root mean square error, 204, 638
Rotations, planar, 618, 652(P15)
Rounding error, 13
differential equations, 349
Gaussian elimination, 536
interpolation, 137, 187(P8)
numerical differentiation, 318
numerical integration, 325(P13)
Row nonn, 488
Runge-Kutta methods, 420
asymptotic error, 427
automatic programs, 431
classical fourth order fonnula, 423
consistency, 425
convergence, 426
'
i
!
i
I
i
I
I
______ j
692 INDEX
Runge-Kutta methods (Continued)
derivation, 421
error estimation, 427
Fehlberg methods, 429
global error, 433
implicit methods, 433
low order formulas, 420, 423
Richardson error estimate, 425
RKF45, 431
stability, 427
truncation error, 420
Scalars, 463
Scaling, 518
Schur normal form, 474
Secant method, 66
error formula, 67, 69
comparison with Newton's methods, 71
convergence, 69
Shooting methods, 437
Significant digits, 17
Similar matrices, 473
Simpson's rule, 256
adaptive,300 .
Aitken extrapolation, 292
asymptotic error formula, 258
composite, 257
differential equations, 384
error formulas, 257, 258
Peano kernel, 260
product rule, 312
Richardson extrapolation, 295
Simultaneous displacements, 545. See also
Gauss-Seidel method
Single step methods, 418. See also Runge-Kutta
methods
Singular integrands, 305
analytic evaluation, 310
change of variable, 305
Gaussian quadrature , 308
IMT method, 307
product integration, 310
Singular value decomposition, 478, 500(P19), 634
computation of, 643
Singular values, 478
Skew-symmetric matrix, 467
Solve, 521
SOR meth, 555, 561
Sparse linear systems, 507, 570
eigenvalue problem, 646
Poisson's equation, 557
Special functions, 237
Spectral radius, 485
Spline function, 166
B-splines, 173
complete spline interpolant, 169
construction, 167
error, 169
natural spline interpolation, 192(P38)
not-a-knot condition, 171
optimality, 170, 192(P38)
Square root, calculation, 119(P12, Pl3)
Stability, 34
absolute, 406
differential equations, 337
numerical methods, 349, 361, 396
eigenvalues, 592, 599
Euler's method, 349
numerical methods, 38
polynomial rootfinding, 99
relative stability, 404
weak, 365
Stability regions, 404
Standard basis, 465
Steffenson's method, 122(P28)
Stiff differential equations, 409
A-stable methods, 371, 408, 412
backward differentiation formulas, 410
backward Euler method, 409
iteration methods, 413
method of lines, 414
trapezoidal method, 412
Stirling's formula, 279
Strong root condition, 404
Sturm sequence, 620
Successive displacements, 548. See also
Gauss-Seidel method
Successive over-relaxation, See SOR method
Summation errors, 29-34
chopping vs. rounding, 30
inner products, 32
loss of significance errors, 27
statistical analysis, 31
Supercomputer, 40, 571
Symbolic mathematics, 41
Symmetric matrix, 467
deflation, 651(Pl3)
eigenvalue computation, 619, 623
eigenvalue error bound, 595
eigenvalues, 476
eigenvalue stability, 593, 595
eigenvector computation, 631
Jacobi's method, 645
positive definite, 499{P14), 576(!'12)
QR method , 623
Rayleigh-Ritz quotient, 608, 651{Pll)
similarity to diagonal matrix, 476
Sturm sequence, 620
tridiagonal matrix, reduction to, 615
Wielandt-Hoffman theorem, 595
Systems of differential equations, 339, 355, 397,
437
Systems of linear equations, See Linear systems
of equations
Taylor series method, differential equations, 420
Taylor's theorem, 4, 199
geometric series theorem, 5, 6
important expansions, 5
two-dimensional form, 7
Telescoping of Taylor series, 245(P39)
Three-eights quadrature rule, 264
Trace, 472
Transpose, 465
Trapezoidal methods;
differential equations, 366
A-stability, 371, 456(P37)
asymptotic error, 370
convergence, 370
global error, 379
iterative solution, 367
local error, 368
Richardson extrapolation, 372
stability, 370
stability region, 409
numerical integration, 252
asymptotic error formula, 254
comparison to Gaussian quadrature, 280
composite, 253
corrected trapezoidal rule, 255
error formula, 253
Euler-MacLaurin formula , 285
Peano kernel, 259, 324(P5)
periodic integrands, 288
product rule, 311
Richardson extrapolation, 294
Triangle inequality, 10, 200, 468
Triangular decomposition, .see LU factorization
Tridiagonal linear systems, solution, 527
Tridiagonal matrix eigenvalue problem, 619
error analysis, 622
Given's method, 619
Householder's method, 619
t u ~ sequences, 620
Trigonometric functions, discrete orthogonality,
178, 193(P42), 233
Trigonometric interpolation, 176
convergence, 180
existence, 178, 179
fast Fourier transform, 181
Trigonometric polynomials, 176
Triple recursion relation, 39
Truncation error, 20, 342, 357
Two-nonri, 208
INDEX 693
Undetermined coefficients, method of, 317
Uniform approximation, see Approximation of
functions
Uniform norm, 200
Unitary matrix, 469, 499(P13). See also
Orthogonal transformations
Unit roundoff error, 15
Unstable problems, 34. See also Ill-conditioned
problems
Vandermonde matrix, 132, 185(P1)
Vector computer, 571
Vector norm, 10, 481
continuity of, 482
equivalence, 483
maximum, 200, 481
p-norm, 481
Vectors, 463
angle between, 469
biorthogonal, 597
convergence, 483
dependence,464
independence,464
norm , 481. See also Vector norm
orthogonal, 469
Vector space, 463
basis, 464
dimension, 464
inner product, 467
orthogonal basis, 469
Weak stability, 365
Weierstrass theorem, 198
Weight function, 206, 251
Weights, 250
Well-posed problem, 34. See also Stability
Wielandt-Hoffman theorem, 595
Zeros, .see Rootfinding