First Steps Inna
First Steps Inna
First Steps Inna
Prologue 1
1 Historical 1
2 Numerical Analysis today 2
3 This book 3
ERRORS
NONLINEAR EQUATIONS
FINITE DIFFERENCES
Step 18 Tables 79
1 Tables of values 79
2 Finite differences 80
3 Influence of round-off errors 80
Step 20 Polynomials 88
1 Finite differences of a polynomial 88
2 Example 89
3 Approximation of a function by a polynomial 89
INTERPOLATION
CURVE FITTING
NUMERICAL DIFFERENTIATION
NUMERICAL INTEGRATION
Bibliography 216
Index 217
PREFACE TO THE SECOND EDITION
First Steps in Numerical Analysis, originally published in 1978, is now in its twelfth
impression. It has been widely used in schools, polytechnics, and universities
throughout the world. However, we decided that after a life of seventeen years in
the classroom and lecture theatre, the contents of the book should be reviewed.
Feedback from many users, both teachers and students, could be incorporated; and
the development of the subject suggested that some new topics might be included.
This Second Edition of the book is the outcome of our consideration of these
matters.
The changes we have made are not very extensive, which reflects our view that
the syllabus for a first course in Numerical Analysis must continue to include
most of the basic topics in the First Edition. However, the result of rapid changes
in computer technology is that some aspects are obviously less important than
they were, and other topics have become more important. We decided that less
should be said about finite differences, for example, but more should be said about
systems of linear equations and matrices. New material has been added on curve
fitting (for example, use of splines), and more has been given on the solution of
differential equations. The total number of Steps has increased from 31 to 35.
For the benefit of both teachers and students, additional exercises have been set
at the end of many of the Steps, and brief answers again supplied. Also, a set
of Applied Exercises has been included, to challenge students to apply numerical
methods in the context of ‘real world’ applications. To make it easier for users
to implement the given algorithms in a computer program, the flowcharts in the
Appendix of the First Edition have been replaced by pseudo-code. The method of
organizing the material into STEPS (of a length suitable for presentation in one or
two hours) has been retained, for this has been a popular feature of the book.
We hope that these changes and additions, together with the new typesetting
used, will be found acceptable, enhancing and attractive; and that the book will
continue to be widely used. Many of the ideas presented should be accessible to
students in mathematics at the level of Seventh Form in New Zealand, Year 12 in
Australia, or GCE A level in the United Kingdom. The addition of more (optional)
starred Steps in this Edition makes this book also suitable for first and second year
introductory Numerical Analysis courses in polytechnics and universities.
R. J. Hosking
S. Joe
D. C. Joyce
J. C. Turner
1995
xii
PREFACE TO THE FIRST EDITION
1 Historical
Although some may regard Numerical Analysis as a subject of recent origin, this
in fact is not so. In the first place, it is concerned with the provision of results in the
form of numbers, which no doubt were in use by very early man. More recently,
the Babylonian and ancient Egyptian cultures were noteworthy for numerical
expertise, particularly in association with astronomy and civil engineering. There
is a Babylonian tablet dated approximately 2000 B.C. giving the squares of the
integers 1–60, and another which records the eclipses back to about 750 B.C. The
Egyptians dealt with fractions, and even invented the method of false position for
the solution of nonlinear algebraic equations (see Step 8).
It is probably unnecessary to point out that the Greeks produced a number of
outstanding mathematicians, many of whom provided important numerical results.
In about 220 B.C. Archimedes gave the result
71 < π < 3 7
3 10 1
√
The iterative procedure for a involving 12 xn + xa and usually attributed to
n
Newton (see Step 10) was in fact used by Heron the elder in about 100 B.C. The
Pythagoreans considered the numerical summation of series, and Diophantus in
about 250 A.D. gave a process for the solution of quadratic equations.
Subsequently, progress in numerical work occurred in the Middle East. Apart
from the development of the modern arithmetical notation commonly referred to
as Arabic, tables of the trigonometric functions sine and tangent were constructed
by the tenth century. Further east, in India and China, there was parallel (although
not altogether separate) mathematical evolution.
In the West, the Renaissance and scientific revolution involved a rapid expansion
of mathematical knowledge, including the field of Numerical Analysis. Such
great names of mathematics as Newton, Euler, Lagrange, Gauss, and Bessel
are associated with modern methods of Numerical Analysis, and testify to the
widespread interest in the subject.
In the seventeenth century, Napier produced a table of logarithms, Oughtred
invented the slide rule, and Pascal and Leibniz pioneered the development of
calculating machines (although these were not produced in quantity until the
nineteenth century). The provision of such machines brought a revolution in
numerical work, a revolution greatly accelerated since the late 1940’s by the
development of modern computers.
2 PROLOGUE
The extent of this revolution certainly becomes clearer when we consider the
great advances in computing power in the past fifty years. The fastest single-
processor supercomputers currently available are hundreds of thousands of times
faster than the earliest computers. The micro-computers that students have in
their place of study are many times faster (and smaller) than the mini-computers
that were available when the first edition of this book came out. Even hand-
held scientific calculators can perform calculations that were once the domain of
big mainframe computers. New procedures have been and are being developed;
computations and data analyses which could not have been contemplated even
as a life’s work a few decades ago are now solved in quite acceptable times.
There is now quite widespread use of vector machines for large-scale scientific
computation, and increasing use is being made of parallel computers with two or
more processors (perhaps even thousands) over which a computing task can be
spread. The equipment at our disposal is the dominant new feature in the field of
Numerical Analysis.
equations are nonlinear and therefore not normally amenable to analytic solution,
their numerical solution is important.
In an introductory text, of course, it is not possible to deal in depth with other
than a few basic topics. Nevertheless, we hope by these few remarks to encourage
students not only to view their progress through this book as worthwhile, but also
to venture beyond it with enthusiasm and success.
3 This book
Each main topic treated in the book has been divided into a number of Steps. The
first five are devoted to the question of errors arising in numerical work. We believe
that a thorough understanding of errors is necessary for a proper appreciation of
the art of using numerical methods. The succeeding Steps deal with concepts
and methods used in the problem areas of nonlinear equations, systems of linear
equations, the eigenvalue problem, interpolation, curve fitting, differentiation,
integration, and ordinary differential equations.
Most of the unstarred Steps in the book will be included in any first course.
The starred Steps (‘side-steps’) include material which the authors consider to
be extra, but not necessarily extensive, to a first course. The material in each
Step is intended to be an increment of convenient size, perhaps dependent on the
understanding of earlier (but not later) unstarred Steps. Ideally, the consideration
of each Step should involve at least the Exercises, carried out under the supervision
of the teacher where necessary. We emphasize that Numerical Analysis demands
considerable practical experience, and further exercises could also be valuable.
Some additional exercises of an applied nature are given towards the end of the
book (see pages 160–162).
Within each Step, the concepts and method to be learned are presented first,
followed by illustrative examples. Students are then invited to test their immediate
understanding of the text by answering two or three Checkpoint questions. These
concentrate on salient points made in the Step, and induce the student to think
about and re-read the text just covered; they may also be useful for revision
purposes. Brief answers are provided at the end of the book for the Exercises set
in each Step.
After much consideration, the authors decided not to include computer programs
for the various algorithms introduced in the Steps. However, they have provided
pseudo-code in an Appendix. In our experience, students do benefit if they study
the pseudo-code of a method at the same time as they learn it in a Step. If they
are familiar with a programming language they should be encouraged to convert
at least some of the pseudo-code into computer programs, and apply them to the
set Exercises.
To encourage further reading, reference is made at various places in the text to
books listed in the Bibliography on page 216.
STEP 1
ERRORS 1
Sources of error
1 Example
To illustrate the ways in which the above errors arise, let us take the example of
the simple pendulum (see Figure 1). If various physical assumptions are made,
including that air resistance and friction at the pivot are negligible, we obtain the
simple (nonlinear) differential equation
d2 θ
m` = −mg sin θ
dt 2
In introductory mechanics courses† the customary next step is to use the ap-
proximation sin θ ≈ θ (assuming that θ is small) to produce the even simpler
(linear) differential equation
d2 θ
= −ω2 θ, where ω2 = g/`
dt 2
† Inpractice one could reduce the type (a) error by using a numerical method (see Step 35) to
solve the more realistic (nonlinear) differential equation
d2 θ
= −ω2 sin θ
dt 2
SOURCES OF ERROR 5
..
........ ..
........ ...
.........
.
θ
...
. ..
........
........ ........ ..
` .
...
.. .
.....
...
......... ........
.................
..
.... .
..
...
.. ...
.......
........ ..
..
...
.......... ..
..
...
...
.... ..
..
...
..... ..
...
...
.... ..
..
...
..... ..
...
...
.... ..
.
• .......... ..
... ..
... ..
... ..
... ..
... ..
...... ..
.. ..
..
..
mg ..
..
..
..
2π
= 2π `/g
p
ω
Up to this point we have encountered only errors of type (a); the other errors
are introduced when we try to obtain a numerical value for T in a particular case.
Thus both ` and g will be subject to measurement errors; π must be represented as
a finite decimal number, the square root must be computed (usually by an iterative
process) after dividing ` by g (which may involve a rounding error), and finally
the square root must be multiplied by 2π.
Checkpoint
EXERCISES
When carrying out the following calculations, notice all the points at which errors
of one kind or another arise.
1. Calculate the period of a simple pendulum of length 75 cm, given that g is
981 cm/s2 .
6 ERRORS 1
ERRORS 2
Approximation to numbers
1 Number representation
We humans normally represent a number in decimal (base 10) form, although
modern computers use binary (base 2) and also hexadecimal (base 16) forms. Nu-
merical calculations usually involve numbers that cannot be represented exactly
by a finite number of digits. For instance, the arithmetical operation of division
often gives a number which does not terminate; the decimal (base 10) representa-
tion of 23 is one example. Even a number such as 0.1 which terminates in decimal
form would not terminate if expressed in binary form. There are also the irrational
numbers such as the value of π, which do not terminate. In order to carry out
a numerical calculation involving such numbers, we are forced to approximate
them by a representation involving a finite number of significant digits (S ). For
practical reasons (for example, the size of the back of an envelope or the ‘storage’
available in a machine), this number is usually quite small. Typically, a ‘single
precision’ number on a computer has an accuracy of only about 6 or 7 decimal
digits (see below). √
To five significant digits (5S ), 32 is represented by 0.66667, π by 3.1416, and 2
by 1.4142. None of these is an exact representation, but all are correct to within
half a unit of the fifth significant digit. Numbers should normally be presented in
this sense, correct to the number of digits given.
If the numbers to be represented are very large or very small, it is convenient to
write them in floating point notation (for example, the speed of light 2.99792×108
m/s, or the electronic charge 1.6022 × 10−19 coulomb). As indicated, we separate
the significant digits (the mantissa) from the power of ten (the exponent); the form
in which the exponent is chosen so that the magnitude of the mantissa is less than
10 but not less than 1 is referred to as scientific notation.
In 1985 the Institute of Electrical and Electronics Engineers published a stan-
dard for binary floating point arithmetic. This standard, known as the IEEE
Standard 754, had been widely adopted (it is very common on workstations used
for scientific computation). The standard specifies a format for ‘single precision’
numbers and a format for ‘double precision’ numbers. The single precision format
allows 32 binary digits (known as bits) for a floating point number with 23 of these
bits allocated for the mantissa. In the double precision format the values are 64
and 52 bits, respectively. On conversion from binary to decimal, it turns out that
8 ERRORS 2
any IEEE Standard 754 single precision number has an accuracy of about six or
seven decimal digits, and a double precision number an accuracy of about 15 or
16 decimal digits.
2 Round-off error
The simplest way of reducing the number of significant digits in the representation
of a number is merely to ignore the unwanted digits. This procedure, known as
chopping, was used by many early computers. A more common and better
procedure is rounding, which involves adding 5 to the first unwanted digit, and
then chopping. For example, π chopped to four decimal places (4D ) is 3.1415, but
it is 3.1416 when rounded; the representation 3.1416 is correct to five significant
digits (5S ). The error involved in the reduction of the number of digits is called
round-off error. Since π is 3.14159 . . ., we could remark that chopping has
introduced much more round-off error than rounding.
3 Truncation error
Numerical results are often obtained by truncating an infinite series or iterative
process (see Step 5). Whereas round-off error can be reduced by working to more
significant digits, truncation error can be reduced by retaining more terms in the
series or more steps in the iteration; this, of course, involves extra work (and
perhaps expense!).
4 Mistakes
In the language of Numerical Analysis, a mistake (or blunder) is not an error!
A mistake is due to fallibility (usually human, not machine). Mistakes may be
trivial, with little or no effect on the accuracy of the calculation, or they may be
so serious as to render the calculated results quite wrong. There are three things
which may help to eliminate mistakes:
(a) care;
(b) checks, avoiding repetition;
(c) knowledge of the common sources of mistakes.
Common mistakes include: transposing digits (for example, reading 6238 as
6328); misreading repeated digits (for example, reading 62238 as 62338); misread-
ing tables (for example, referring to a wrong line or a wrong column); incorrectly
positioning a decimal point; overlooking signs (especially near sign changes).
5 Examples
The following illustrate rounding to four decimal places (4D ):
√
4/3 → 1.3333; π/2 → 1.5708; 1/ 2 → 0.7071
APPROXIMATION TO NUMBERS 9
Checkpoint
EXERCISES
ERRORS 3
Error propagation and generation
1 Absolute error
The absolute error is the absolute difference between the exact number x and the
approximate number x ∗ ; that is,
eabs = |x − x ∗ |
A number correct to n decimal places has
eabs ≤ 0.5 × 10−n
we expect that the absolute error involved in any approximate number is no more
than five units at the first neglected digit.
2 Relative error
The relative error is the ratio of the absolute error to the absolute exact number;
that is,
eabs eabs
erel = ≤ ∗
|x| |x | − eabs
(Note that the upper bound follows from the triangle inequality; thus
|x ∗ | = |x + x ∗ − x| ≤ |x| + |x ∗ − x|
so that |x| ≥ |x ∗ | − eabs .) If eabs |x ∗ |, then
eabs
erel ≈
|x ∗ |
ERROR PROPAGATION AND GENERATION 11
erel ≤ 5 × 10−n
3 Error propagation
Consider two numbers x = x ∗ + e1 , y = y ∗ + e2
(a) Under the operations addition or subtraction, we have
x ∓ y = x ∗ ∓ y ∗ + e1 ∓ e2
so that
e ≡ (x ∓ y) − (x ∗ ∓ y ∗ ) = e1 ∓ e2
and hence
|e| ≤ |e1 | + |e2 |
that is,
max(|e|) = |e1 | + |e2 |
The magnitude of the propagated error is therefore not more than the sum of
the initial absolute errors; of course, it may be zero.
(b) Under the operation multiplication,
x y − x ∗ y ∗ = x ∗ e2 + y ∗ e1 + e1 e2
so that
x y − x ∗ y ∗ e1 e2 e1 e2
≤ + +
x ∗ y∗ x ∗ y∗ x ∗ y∗
and so
e1 e2
max(erel ) ≈ ∗ + ∗
x y
assuming e1∗ e2∗ is negligible. The maximum relative error propagated
x y
is approximately the sum of the initial relative errors. The same result is
obtained when the operation is division.
4 Error generation
Often (for example, in a computer) an operation ⊗ is also approximated, by an
operation ⊗∗ , say. Consequently, x ⊗ y is represented by x ∗ ⊗∗ y ∗ . Indeed, one
has
|x ⊗ y − x ∗ ⊗∗ y ∗ | = |(x ⊗ y − x ∗ ⊗ y ∗ ) + (x ∗ ⊗ y ∗ − x ∗ ⊗∗ y ∗ )|
≤ |x ⊗ y − x ∗ ⊗ y ∗ | + |x ∗ ⊗ y ∗ − x ∗ ⊗∗ y ∗ |
so that the accumulated error does not exceed the sum of the propagated and
generated errors. Examples may be found in Step 4.
12 ERRORS 3
5 Example
Here we evaluate (as accurately as possible) the following:
(i) 3.45 + 4.87 − 5.16
(ii) 3.55 × 2.73
There are two methods which the student may consider, the first of which
is to invoke the concepts of absolute and relative error as defined in this Step.
Thus the result for (i) is 3.16 ± 0.015, since the maximum absolute error is
0.005 + 0.005 + 0.005 = 0.015. One concludes that the answer is 3 (to 1S ), for
the number certainly lies between 3.145 and 3.175. In (ii), the product 9.6915 is
subject to the maximum relative error
0.005 0.005 0.005 0.005 1 1
+ + × ≈ + × 0.005
3.55 2.73 3.55 2.73 3.55 2.73
hence the maximum (absolute) error ≈ (2.73 + 3.55) × 0.005 ≈ 0.03, so that the
answer is 9.7.
A second approach is to use ‘interval arithmetic’. Thus, the approximate number
3.45 represents a number in the interval (3.445, 3.455), etc. Consequently, the
result for (i) lies in the interval bounded below by
and above by
3.455 + 4.875 − 5.155 = 3.175
Similarly, in (ii) the result lies in the interval bounded below by
and above by
3.555 × 2.735 ≈ 9.72
Hence one again concludes that the approximate numbers 3 and 9.7 correctly
represent the respective results to (i) and (ii).
Checkpoint
EXERCISES
Evaluate the following as accurately as possible, assuming all values are correct
to the number of digits given:
1. 8.24 + 5.33.
2. 124.53 − 124.52.
3. 4.27 × 3.13.
4. 9.48 × 0.513 − 6.72.
5. 0.25 × 2.84/0.64.
6. 1.73 − 2.16 + 0.08 + 1.00 − 2.23 − 0.97 + 3.02.
STEP 4
ERRORS 4
Floating point arithmetic
2 Multiplication
The exponents are added and the mantissae are multiplied; the final result is
obtained by rounding (after shifting the mantissa right and increasing the exponent
by 1, if necessary). Thus:
3 Division
The exponents are subtracted and the mantissae are divided; the final result is
obtained by rounding (after shifting the mantissa left and reducing the exponent
by 1, if necessary). Thus:
5.43 × 101 /(4.55 × 102 ) = 1.19340 . . . × 10−1 → 1.19 × 10−1
−2.75 × 102 /(9.87 × 10−2 ) = −0.278622 . . . × 104 → −2.79 × 103
4 Expressions
The order of evaluation is determined in a standard way and the result of each
operation is a normalized floating point number. Thus:
(6.18 × 101 + 1.84 × 10−1 )/((4.27 × 101 ) × (3.68 × 101 ))
→ 6.20 × 101 /(1.57 × 103 ) = 3.94904 . . . × 10−2 → 3.95 × 10−2
5 Generated error
We note that all the above examples (except the subtraction and the first addition)
involve generated errors which are relatively large because of the small number
of digits in the mantissae. Thus the generated error in
2.77 × 102 + 7.55 × 102 = 10.32 × 102 → 1.03 × 103
is 0.002 × 103 . Since the propagated error in this example may be as large as
0.01 × 102 (assuming the operands are correct to 3S ), we can use the result
given in Section 4 of Step 3 to deduce that the accumulated error cannot exceed
0.002 × 103 + 0.01 × 102 = 0.003 × 103 .
6 Consequences
The peculiarities of floating point arithmetic lead to some unexpected and unfor-
tunate consequences, including the following:
(a) Addition or subtraction of a small (but nonzero) number may have no effect,
for example,
5.18 × 102 + 4.37 × 10−1 = 5.18 × 102 + 0.00437 × 102
= 5.18437 × 102 → 5.18 × 102
thus, the additive identity is not unique.
(b) Frequently the result of a × (1/a) is not 1, for example, if a = 3.00 × 100 ,
then
1/a → 3.33 × 10−1
and
a × (1/a) → 9.99 × 10−1
thus, the multiplicative inverse may not exist.
16 ERRORS 4
(c) The result of (a + b) + c is not always the same as the result of a + (b + c),
for example, if
a = 6.31 × 101 , b = 4.24 × 100 , c = 2.47 × 10−1
then
(a + b) + c = (6.31 × 101 + 0.424 × 101 ) + 2.47 × 10−1
→ 6.73 × 101 + 0.0247 × 101
→ 6.75 × 101
whereas
a + (b + c) = 6.31 × 101 + (4.24 × 100 + 0.247 × 100 )
→ 6.31 × 101 + 4.49 × 100
→ 6.31 × 101 + 0.449 × 101
→ 6.76 × 101
thus, the associative law for addition does not always hold.
Examples involving adding many numbers of varying size indicate that adding
in order of increasing magnitude is preferable to adding in the reverse order.
(d) Subtracting a number from another nearly equal number may result in loss
of significance or cancellation error. To illustrate this loss of accuracy, sup-
pose we evaluate f (x) = 1 − cos x for x = 0.05 using three-digit decimal
normalized floating point arithmetic with rounding. Then
1 − cos(0.05) = 1 − 0.99875 . . .
→ 1.00 × 100 − 0.999 × 100
→ 1.00 × 10−3
Although the value of 1 is exact and cos(0.05) is correct to 3S when expressed
as a three-digit floating point number, their computed difference is correct
to only 1S! (The two zeros after the decimal point in 1.00 × 10−3 ‘pad’ the
number.)
The approximation 0.999 ≈ cos(0.05) has a relative error of about 2.5×10−4 .
By comparison, the relative error of 1.00 × 10−3 ≈ 1 − cos(0.05) is about
0.2 and so much larger. Thus subtraction of two nearly equal numbers should
be avoided whenever possible.
For f (x) = 1 − cos x we can avoid this loss of significant digits by writing
(1 − cos x)(1 + cos x) 1 − cos2 x sin2 x
1 − cos x = = =
1 + cos x 1 + cos x 1 + cos x
This last formula is more suitable for calculations when x is close to 0. It can
be verified that the more accurate approximation of 1.25 × 10−3 is obtained
for 1 − cos(0.05) when three-digit floating point arithmetic is used.
FLOATING POINT ARITHMETIC 17
Checkpoint
EXERCISES
ERRORS 5
Approximation to functions
f 0 (x) = cos x
f 00 (x) = − sin x
..
.
etc.
x3 x5 (−1)k−1 x 2k−1
sin x = x − + − ··· + + R2k−1
3! 5! (2k − 1)!
with
(−1)k x 2k+1
R2k−1 = cos ξ
(2k + 1)!
Note that this expansion has only odd-powered terms so, although the polynomial
approximation is of degree (2k − 1), it has only k terms. Moreover, the absence of
even-powered terms means that the same polynomial approximation is obtained
with n = 2k, and hence R2k−1 = R2k ; the remainder term R2k−1 given above is
actually the expression for R2k . Since | cos ξ | ≤ 1, then
|x|2k+1
|R2k−1 | ≤
(2k + 1)!
if 5D accuracy is required, it follows that we need only take k = 2 at x = 0.1, and
k = 4 at x = 1 (since 9! = 362 880). On the other hand, the expansion for the
natural (base e) logarithm,
x2 x3 (−1)n−1 x n
ln(1 + x) = x − + − ··· + + Rn
2 3 n
where
(−1)n x n+1
Rn =
(n + 1)(1 + ξ )n+1
is less suitable. Although only n = 4 terms are needed to give 5D accuracy at
x = 0.1, n = 13 is required for 5D accuracy at x = 0.5, and n = 19 gives just 1D
accuracy at x = 1!
Further, we remark that the Taylor series is not only used extensively to represent
functions numerically, but also to analyse the errors involved in various algorithms
(for example, see Steps 8, 9, 10, 30, and 31).
20 ERRORS 5
2 Polynomial approximation
The Taylor series provides a simple method of polynomial approximation (of
chosen degree n),
f (x) ≈ a0 + a1 x + a2 x 2 + · · · + an x n
4 Recursive procedures
While a truncated series with few terms may be a practical way to compute values
of a function, there is a number of arithmetic operations involved, so if available
some recursive procedure which reduces the arithmetic required may be preferred.
For example, the values of the polynomial
P(x) = a0 + a1 x + a2 x 2 + · · · + an x n
with p0 = an and q0 = 0.
APPROXIMATION TO FUNCTIONS 21
p1 = p0 x̄ + an−1 q1 = q0 x̄ + p0
= an x̄ + an−1 = an
p2 = p1 x̄ + an−2 q2 = q1 x̄ + p1
= an x̄ 2 + an−1 x̄ + an−2 = 2an x̄ + an−1
.. ..
. .
pn = P(x̄) qn = P 0 (x̄)
Checkpoint
EXERCISES
1. Find the Taylor series expansions about x = 0 for each of the following
functions.
(a) cos x.
(b) 1/(1 − x).
(c) e x .
For each series also determine a general remainder term.
2. For each of the functions in Exercise 1, evaluate f (0.5) using a calculator
and by using the first four terms of your Taylor expansion.
3. Use the remainder term found in Exercise 1(c) to find the value of n required
in the Taylor series for f (x) = e x about x = 0 to give 5D accuracy for all x
between 0 and 1.
4. Truncate the Taylor series found in Exercise 1(c) to give linear, quadratic,
and cubic polynomial approximations for f (x) = e x in the neighbourhood
of x = 0. Use the remainder term to estimate (to the nearest 0.1) the range
over which each polynomial approximation yields results correct to 2D.
22 ERRORS 5
NONLINEAR EQUATIONS 1
Nonlinear algebraic and transcendental equations
The first nonlinear equation encountered in algebra courses is usually the quadratic
equation
ax 2 + bx + c = 0
and all students will be familiar with the formula for its roots:
√
−b ± b2 − 4ac
x=
2a
The formula for the roots of a general cubic is somewhat more complicated and
that for a general quartic usually takes several pages to describe! We are spared
further effort by a theorem which states that there is no such formula for general
polynomials of degree higher than four. Accordingly, except in special cases (for
example, when factorization is easy), we prefer in practice to use a numerical
method to solve polynomial equations of degree higher than two.
Another class of nonlinear equations consists of those which involve transcen-
dental functions such as e x , ln x, sin x, and tan x. Useful analytic solutions of
such equations are rare so we are usually forced to use numerical methods.
1 A transcendental equation
We shall use a simple mathematical problem to show that transcendental equations
do arise quite naturally. Suppose we seek the height of liquid in a cylindrical tank
of radius r , lying with its axis horizontal, when the tank is a quarter full (see
Figure 2). Suppose the height of liquid is h (D B in the diagram). The condition
to be satisfied is that the area of the segment ABC should be 14 of the area of the
circle. This reduces to
h i
2 12 r 2 θ − 12 (r sin θ )(r cos θ ) = 14 πr 2
( 21 r 2 θ is the area of the sector O AB, r sin θ is the base and r cos θ the height of
the triangle O AD.) Hence
π
2θ − 2 sin θ cos θ =
2
or π
x + cos x = 0, where x = − 2θ
2
(since 2 sin θ cos θ = sin 2θ = sin(π/2 − x) = cos x).
24 NONLINEAR EQUATIONS 1
f (x) ≡ x + cos x = 0
we obtain h from
h π x i
h = O B − O D = r − r cos θ = r 1 − cos −
4 2
..........................................
............. .........
......... .......
....... .......
......... ....
....
....
... ....
.. ...
..... ...
.. ...
.. ...
.. ..
.. ..
.. ..
. O ..
...
.... • .
.. .......
..... ...
..
.. .
... ...
..
... ......... ...
... .
..
.. r .
............
θ
....
..
... ........ ..
... ............... ..
.......... . ...
A .... . . . .. C
.
.
....
.... D ........ .....
...
...... ..
. ...
.. h ....
.......
... .......
........
........... ........
.......................................................
B
FIGURE 2. Cylindrical tank (cross-section).
2 Locating roots
Let us suppose that our problem is to find some or all of the roots of the nonlinear
equation f (x) = 0. Before we use a numerical method (compare Steps 7–10) we
should have some idea about the number, nature and approximate location of the
roots. The usual approach involves the construction of graphs and perhaps a table
of values of the function f to confirm the information obtained from the graph.
We now illustrate this approach by a few examples.
(i) sin x − x + 0.5 = 0
If we do not have a calculator or computer available to immediately plot the
graph of f (x) = sin x −x +0.5, we can separate f into two parts, sketch two curves
on the one set of axes, and see where they intersect. Because sin x − x + 0.5 = 0
is equivalent to sin x = x − 0.5, we sketch y = sin x and y = x − 0.5. Since
| sin x| ≤ 1 we are only interested in the interval −0.5 ≤ x ≤ 1.5 (outside which
|x − 0.5| > 1). Thus we deduce from the graph (Figure 3) that the equation has
only one real root, near x = 1.5. We can then tabulate f (x) = sin x − x + 0.5
near x = 1.5 as follows (the argument to the sine function should be in radians):
y
N
4
.... y = x − 0.5
..
...
3 .....
....
....
....
.....
2 ....
...
...
......
.
....
1 ................
............. ................ y = sin x
.......
...... ..
......... ......
. ......
.....
.
.... .... .... ....
.... .... .... ....
....
.... ..... ..... .... xI
.... .... .... ....
−4 ..... −2 ..... .... 2 4 ....
......
....... ...
... .... ..
........ .
... . ....
................................... .......
...
...
. −1
. ..
....
....
...
..... −2
....
....
...
. .....
...
.... −3
...
......
.
.... −4
We now know that the root lies between 1.49 and 1.50, and we can use a numerical
method to obtain a more accurate answer, as discussed in the following Steps.
(ii) e−0.2x = x(x − 2)(x − 3)
Again we sketch two curves:
y = e−0.2x
and
y = x(x − 2)(x − 3)
In sketching the second curve we use the three obvious zeros at x = 0, 2, and 3; as
well as the knowledge that x(x − 2)(x − 3) is negative for x < 0 and 2 < x < 3,
but positive and increasing steadily for x > 3. We deduce from the graph (Figure
4) that there are three real roots, near x = 0.2, 1.8, and 3.1, and tabulate as follows
(with f (x) = e−0.2x − x(x − 2)(x − 3)):
We conclude that the roots lie between 0.15 and 0.2, 1.6 and 1.8, and 3.1 and 3.2,
respectively. Note that the values in the table were calculated using an accuracy
of at least 5S, but are displayed to only 4D. For example, working to 5S accuracy
we have f (0.15) = 0.97045 − 0.79088 = 0.17957, which is then rounded to
26 NONLINEAR EQUATIONS 1
y
N
4
.. y = x(x − 2)(x − 3)
.
..
...
.
..
...
.......... .
........... ........ ..
........... 2 .... ....... ..
............
............. . ... ... ...
.............. . .... .
................
.................. ....
.
... ..
...................... ..
.............................. ...
... .
...............................................
. ... . ................ y = e −0.2x
.. ... ..
.. ..
.... ...
... x
.. . I
.. .
−4 −3 −2 −1 1...
. 2 3 4 .... ..
............
.
..
..
...
.
..
−2 ..
...
.
..
..
...
.
..
..
...
−4
FIGURE 4. Graphs of y.
0.1796. Thus the entry in the table for f (0.15) is 0.1796 and not 0.1795 as one
might expect from calculating 0.9704 − 0.7909.
Checkpoint
EXERCISES
NONLINEAR EQUATIONS 2
The bisection method
The bisection method † for finding the roots of the equation f (x) = 0 is based on
the following theorem.
Theorem: If f is continuous for x between a and b and if f (a) and f (b) have
opposite signs, then there exists at least one real root of f (x) = 0 between a and
b.
1 Procedure
Suppose that a continuous function f is negative at x = a and positive at x = b, so
that there is at least one real root between a and b. (Usually a and b may be found
from a graph of f .) If we calculate f ((a + b)/2), which is the function value at
the point of bisection of the interval a < x < b, there are three possibilities:
(a) f ((a + b)/2) = 0, in which case (a + b)/2 is the root;
(b) f ((a + b)/2) < 0, in which case the root lies between (a + b)/2 and b;
(c) f ((a + b)/2) > 0, in which case the root lies between a and (a + b)/2.
Presuming there is just one root, if case (a) occurs the process is terminated. If
either case (b) or case (c) occurs, the process of bisection of the interval containing
the root can be repeated until the root is obtained to the desired accuracy. In Figure
5, the successive points of bisection are denoted by x1 , x2 , and x3 .
y
N
................
.............................
....... y = f (x)
..
...
...
.............
.
.........
.......
.......
...
.........
.
...
......
.....
....
......
....
....
...
.....
| | .
.... |
. | | I x
a x1 ................ x3 x2 b
..
....
.
.......
.......
........
.
...
...
...........
..........
.................
.......................................................................................
† Thismethod is suitable for implementation on a computer. Pseudo-code for study and use in
programming may be found on page 164.
28 NONLINEAR EQUATIONS 2
2 Effectiveness
The bisection method is almost certain to give a root. Provided the conditions of
the above theorem hold, it can only fail if the accumulated error in the calculation
of f at a bisection point gives it a small negative value when actually it should
have a small positive value (or vice versa); the interval subsequently chosen would
therefore be wrong. This can be overcome by working to sufficient accuracy, and
this almost-assured convergence is not true of many other methods of finding a
root.
One drawback of the bisection method is that it applies only for roots of f
about which f (x) changes sign. In particular, double roots can be overlooked;
one should be careful to examine f (x) in any range where it is small, so that
repeated roots about which f (x) does not change sign are otherwise evaluated
(for example, see Steps 9 and 10). Of course, such a close examination also avoids
another nearby root being overlooked.
Finally, note that bisection is rather slow; after n iterations the interval con-
taining the root is of length (b − a)/2n . However, provided values of f can be
generated readily, as when a computer is used, the rather large number of itera-
tions which can be involved in the application of bisection is of relatively little
consequence.
3 Example
Let us solve 3xe x = 1 to three decimal places by the bisection method.
We can consider f (x) = 3x − e−x , which changes sign in the interval 0.25 <
x < 0.27: one may tabulate (working to 4D ) as follows:
x 3x e−x f (x)
0.25 0.75 0.7788 −0.0288
0.27 0.81 0.7634 0.0466
(The student should ascertain graphically that there is just one root.)
Let us denote the lower and upper endpoints of the interval bracketing the root
at the n-th iteration by an and bn respectively (with a1 = 0.25 and b1 = 0.27).
Then the approximation to the root at the n-th iteration is given by xn = (an +
bn )/2. Since the root is either in [an , xn ] or [xn , bn ] and both intervals are of
length (bn − an )/2, we see that xn will be accurate to three decimal places when
(bn − an )/2 < 5 × 10−4 . Proceeding to bisection:
n an bn xn = (an + bn )/2 3xn e−xn f (xn )
1 0.25 0.27 0.26 0.78 0.7711 0.0089
2 0.25 0.26 0.255 0.765 0.7749 −0.0099
3 0.255 0.26 0.2575 0.7725 0.7730 −0.0005
4 0.2575 0.26 0.2588 0.7763 0.7720 0.0042
5 0.2575 0.2588 0.2581 0.7744 0.7725 0.0019
6 0.2575 0.2581 0.2578
THE BISECTION METHOD 29
(Note that the values in the table are displayed to only 4D.) Hence the root accurate
to three decimal places is 0.258.
Checkpoint
1. When may the bisection method be used to find a root of the equation
f (x) = 0?
2. What are the three possible choices after a bisection value is calcu-
lated?
3. What is the maximum error after n iterations of the bisection
method?
EXERCISES
NONLINEAR EQUATIONS 3
Method of false position
As mentioned in the Prologue, the method of false position† dates back to the
ancient Egyptians. It remains an effective alternative to the bisection method for
solving the equation f (x) = 0 for a real root between a and b, given that f is
continuous and f (a) and f (b) have opposite signs.
1 Procedure
The curve y = f (x) is not generally a straight line. However, one may join the
points
(a, f (a)) and (b, f (b))
by the straight line
y − f (a) x −a
=
f (b) − f (a) b−a
The straight line cuts the x-axis at (x̄, 0), where
0 − f (a) x̄ − a
=
f (b) − f (a) b−a
so that
b−a
x̄ = a − f (a)
f (b) − f (a)
a f (b) − b f (a) 1 a
f (a)
= =
f (b) − f (a) f (b) − f (a) b f (b)
Let us suppose that f (a) is negative and f (b) is positive. As in the bisection
method, there are three possibilities:
(a) f (x̄) = 0, in which case x̄ is the root;
(b) f (x̄) < 0, in which case the root lies between x̄ and b;
(c) f (x̄) > 0, in which case the root lies between a and x̄.
Again, if case (a) occurs, the process is terminated; if either case (b) or case
(c) occurs, the process can be repeated until the root is obtained to the desired
accuracy. In Figure 6 the successive points where the straight lines cut the x-axis
are denoted by x1 , x2 , and x3 .
† Thismethod is suitable for implementation on a computer. Pseudo-code for study and use in
programming may be found on page 165.
METHOD OF FALSE POSITION 31
y
N
......................
..................................... .. y = f (x)
.. ................. ...................................... ....
..
....... ......................................
... .. ...
..... ..
. ........ ..................... ................ ..
..... ........... ......... ..
..
....
.............
. .
.... ..
...
... ..
...
..
.
. .
......
.... ..
...
.
........
..
..
.... ...
... ..
..
...
... .
......
. ..
.
.....
..... .... ...
......
..
..... ..
.... .... .
........ ..
.......... .
....
. ..
...
... ..
. .. .... . ... x 3 ........ .. ........
. .
..
.... . . ... ...
.
..... ..........
| ..
...
...
| ... ..|
.
..... ......|.. | I x
......... ....... .......
. .
a. ......... x 1
......... ....... ....... x 2
.............. b
... ...
. ............ ...
........................
...... . . ..
... ..........................
..
.. ...
.............. . ..
..
.
.. .........
....... .......
......... ......
........
..
.. ......... .. ..
.. ................. ........ ......
. .
..............................................................................................
in the secant method, which resembles the method of false position except that
no attempt is made to ensure that the root α is enclosed. Starting with two
approximations (x0 and x1 ) to the root α, further approximations x2 , x3 , . . . are
computed from
xn − xn−1
xn+1 = xn − f (xn )
f (xn ) − f (xn−1 )
We no longer have assured convergence, but the process is simpler (the sign
of f (xn+1 ) is not tested) and often converges faster. With respect to speed of
convergence of the secant method, we have the error at the (n + 1)-th iteration:
en+1 = α − xn+1
(α − xn−1 ) f (xn ) − (α − xn ) f (xn−1 )
=
f (xn ) − f (xn−1 )
en−1 f (α − en ) − en f (α − en−1 )
=
f (α − en ) − f (α − en−1 )
Hence, expanding in terms of Taylor series,
en−1 [ f (α) − en f 0 (α) + (en2 /2!) f 00 (α) − · · ·]
en+1 =
[ f (α) − en f 0 (α) + · · ·] − [ f (α) − en−1 f 0 (α) + · · ·]
en [ f (α) − en−1 f 0 (α) + (en−1
2 /2!) f 00 (α) − · · ·]
−
[ f (α) − en f 0 (α) + · · ·] − [ f (α) − en−1 f 0 (α) + · · ·]
f (α)
00
≈− en−1 en
2 f 0 (α)
where we have used the fact that f (α) = 0. Thus we see that en+1 is proportional to
en en−1 , which may be expressed in mathematical notation as en+1 ∼ en−1 en . We
k ; then e
seek k such that en ∼ en−1 k k2 k+1
n+1 ∼ en ∼ en−1 and en−1 en ∼ en−1 , so that we
√
deduce k 2 ≈ k + 1, whence k ≈ (1 + 5)/2 ≈ 1.618. The speed of convergence
is therefore faster than linear (k = 1), but slower than quadratic (k = 2). This rate
of convergence is sometimes referred to as superlinear convergence.
3 Example
We solve 3xe x = 1 by the method of false position, stopping when | f (xn )| <
5 × 10−6 , where f (x) = 3x − e−x .
In the previous Step, we observed that the root lies in the interval 0.25 < x <
0.27. Consequently, with calculations displayed to 6D, the first approximation is
given by
1 0.25 −0.028801
x1 =
0.046621 + 0.028801 0.27 0.046621
0.011655 + 0.007776
= = 0.257637
0.075421
METHOD OF FALSE POSITION 33
Then
f (x1 ) = f (0.257637) = 3 × 0.257637 − 0.772875
= 0.772912 − 0.772875 = 0.000036
The student may verify that doing one more iteration of the method of false
position yields an estimate x2 = 0.257628 for which the function value is less
than 5 × 10−6 . Since x1 and x2 agree to 4D, we conclude that the root is 0.2576
correct to 4D.
Checkpoint
1. When may the method of false position be used to find a root of the
equation f (x) = 0?
2. On what geometric construction is the method of false position
based?
EXERCISES
1. Use the method of false position to find the smallest positive root of the
equation f (x) ≡ 2 sin x + x − 2 = 0, stopping when xn satisfies | f (xn )| <
5 × 10−5 .
2. Compare the results obtained when
(a) the bisection method,
(b) the method of false position, and
(c) the secant method
are used (with starting values 0.7 and 0.9) to solve the equation
1
3 sin x = x +
x
3. Use the method of false position to find the root of the equation
f (x) ≡ x + cos x = 0
stopping when | f (xn )| < 5 × 10−6 .
4. Each equation in Exercises 2(a)–2(c) of Step 6 on page 26 has only one
root. Use the method of false position to find each root, stopping when
| f (xn )| < 5 × 10−6 .
STEP 9
NONLINEAR EQUATIONS 4
The method of simple iteration
The method of simple iteration involves writing the equation f (x) = 0 in a form
x = φ(x) suitable for the construction of a sequence of approximations to some
root, in a repetitive fashion.
1 Procedure
The iteration procedure is as follows. In some way we obtain a rough approxi-
mation x0 of the desired root, which may then be substituted into the right-hand
side to give a new approximation, x1 = φ(x0 ). The new approximation is again
substituted into the right-hand side to give a further approximation x2 = φ(x1 ),
and so on until (hopefully) a sufficiently accurate approximation to the root is ob-
tained. This repetitive process, based on xn+1 = φ(xn ), is called simple iteration;
provided that |xn+1 − xn | decreases as n increases, the process tends to α = φ(α),
where α denotes the root.
2 Example
The method of simple iteration is used to find the root of the equation 3xe x = 1
to an accuracy of 4D.
One first writes
x = 31 e−x ≡ φ(x)
Assuming x0 = 1 and with numbers displayed to 5D, successive iterations produce
x1 = 0.12263
x2 = 0.29486
x3 = 0.24821
x4 = 0.26007
x5 = 0.25700
x6 = 0.25779
x7 = 0.25759
x8 = 0.25764
Thus we see that after eight iterations the root is 0.2576 to 4D. A graphical
interpretation of the first three iterations is shown in Figure 8.
THE METHOD OF SIMPLE ITERATION 35
y
N .... y = x
.....
.. ....
1 ....
. ..
....
. . . .....
.
....
....
. . . ....
.
.....
....
.......
.
....
....
..
......
......... ....
......... ....
.......... .......
............... .
...................................................
.....
...........................
...
... ...... ... ... .....................................
... .... .. .. ...............
...................................................................................................................................................................................................
. .
..... . .. . ................................................
. ...............................
.
.
.
.... ..
. .
... ...
..
. y = 13 e−x
... Ix
| | | |
x1 x3 x2 x0 = 1
FIGURE 8. Iterative method.
3 Convergence
Whether or not the iteration procedure converges quickly, or indeed at all, depends
on the choice of the function φ, as well√as the starting value x0 . For example, the
equation x 2 = 3 has two real roots, ± 3(≈ ±1.732). It can be rewritten in the
form
3
x = ≡ φ(x)
x
which suggests the iteration
3
xn+1 =
xn
However, if the starting value x0 = 1 is used, successive iterations give
3
x1 = =3
x0
3
x2 = =1
x1
3
x3 = = 3 etc.
x2
so that there is no convergence!
We can examine the convergence of the iteration process
xn+1 = φ(xn )
to
α = φ(α)
with the help of the Taylor series (see page 18)
φ(α) = φ(xk ) + (α − xk )φ 0 (ζk ), k = 0, 1, 2, . . . , n
where ζk is a value between the root α and the approximation xk . We have
36 NONLINEAR EQUATIONS 4
Checkpoint
EXERCISES
NONLINEAR EQUATIONS 5
The Newton-Raphson iterative method
1 Procedure
Let x0 denote the known approximate value of the root of f (x) = 0, and let h
denote the difference between the true value α and the approximate value; that is,
α = x0 + h
The second degree terminated Taylor expansion (see page 18) about x0 is
h 2 00
f (α) = f (x0 + h) = f (x0 ) + h f 0 (x0 ) + f (ξ )
2!
where ξ = x0 + θ h, 0 < θ < 1, lies between α and x0 . Ignoring the remainder
term and writing f (α) = 0,
f (x0 ) + h f 0 (x0 ) ≈ 0
so that
f (x0 )
h≈−
f 0 (x0 )
and consequently
f (x0 )
x1 = x0 −
f 0 (x0 )
should be a better estimate of the root than x0 .
Even better approximations may be obtained by repetition (iteration) of the
process, which may then be written as
f (xn )
xn+1 = xn −
f 0 (xn )
Note that if f is a polynomial we can use the recursive procedure of Step 5 to
compute f (xn ) and f 0 (xn ).
† Thismethod is suitable for implementation on a computer. Pseudo-code for study and use in
programming may be found on page 166.
38 NONLINEAR EQUATIONS 5
The geometrical interpretation is that each iteration provides the point at which
the tangent at the original point cuts the x-axis (see Figure 9). Thus the equation
of the tangent at (xn , f (xn )) is
which leads to
f (xn )
xn+1 = xn −
f 0 (xn )
y
N
............................ y = f (x)
...................
...............
..........
......... ..
........... ...
..
...
.................. ...
....... ...
........ ... ..
.
.............. .......... ..
..
.
....
..... .
.. .. .
. ..
......
. ... ..... ..
...
... . ... ..
.....
. . .
.
..
..
....
...... ...
..
.
.. .
. ..
.......... .......
. .
. ..
..... . .... ....
...
. .
. ..
.
.... ..... ........ .
.
...
.....
. ........
|.. |....... ....... .
.....
..
| | I x
... ......... x
x.1 ....... x..3.................. 2 x0
.. .... .........
.. ..... ...........
..
.. ...........................
.
.. ..................
.
.
...........
...
...
..............
....
.......................................................................................
2 Example
We find the positive root of the equation sin x = x 2 , correct to 3D, using the
Newton-Raphson method.
It is convenient to use the method of false position to obtain an initial approxi-
mation. Tabulating, one has
x f (x) = sin x − x 2
0 0
0.25 0.1849
0.5 0.2294
0.75 0.1191
1 −0.1585
THE NEWTON-RAPHSON ITERATIVE METHOD 39
With working displayed to 4D, we see that there is a root in the interval 0.75 <
x < 1 at approximately
1 0.75
0.1191
x0 =
−0.1585 − 0.1191 1 −0.1585
1
=− (−0.1189 − 0.1191)
0.2777
0.2380
= = 0.8573
0.2777
We now use the Newton-Raphson method; we have
and
f 0 (x) = cos x − 2x
giving
f 0 (0.8573) = 0.6545 − 1.7145 = −1.0600
Consequently, a better approximation is
0.0211
x1 = 0.8573 + = 0.8573 + 0.0200 = 0.8772
1.0600
Repeating the procedure, we obtain
and
f 0 (x1 ) = f 0 (0.8772) = −1.1151
so that
0.0005
x2 = 0.8772 − = 0.8772 − 0.0005 = 0.8767
1.1151
Since f (x2 ) = 0.0000, we conclude that the root is 0.877 to 3D.
3 Convergence
If we write
f (x)
φ(x) = x −
f 0 (x)
the Newton-Raphson iteration expression
f (xn )
xn+1 = xn −
f 0 (xn )
may be written
xn+1 = φ(xn )
40 NONLINEAR EQUATIONS 5
We observed (see page 36) that in general the iteration method converges when
|φ 0 (x)| < 1 near the root. In the Newton-Raphson case we have
4 Speed of convergence
The second degree terminated Taylor expansion about xn is
en2 00
f (α) = f (xn + en ) = f (xn ) + en f 0 (xn ) + f (ξn )
2!
where en = α − xn is the error at the n-th iteration and ξn = xn + θen , 0 < θ < 1.
Since f (α) = 0 we have
f (xn ) e2 f 00 (ξn )
0= + (α − xn ) + n 0
f (xn )
0 2 f (xn )
But from the Newton-Raphson formula we have
f (xn )
− xn = −xn+1
f 0 (xn )
and so the error at the (n + 1)-th iteration is
en+1 = α − xn+1
en2 f 00 (ξn )
=−
2 f 0 (xn )
en2 f 00 (α)
≈−
2 f 0 (α)
when en is sufficiently small. This result states that the error at iteration (n + 1)
is proportional to the square of the error at iteration n; hence (if f 00 (α) ≈ 4 f 0 (α))
an answer correct to one decimal place at one iteration should be accurate to two
places at the next iteration, four at the next, eight at the next, etc. This quadratic
(‘second-order’) convergence outstrips the rate of convergence of the methods of
bisection and false position.
In relatively little used computer programs, it may be wise to prefer the methods
of bisection or false position, since convergence is virtually assured. However, for
hand calculations or for computer routines in constant use, the Newton-Raphson
method is usually preferred.
THE NEWTON-RAPHSON ITERATIVE METHOD 41
Checkpoint
EXERCISES
1. Use the Newton-Raphson method to solve for the (positive) root of 3xe x = 1
to four significant digits.
2. Derive the Newton-Raphson iteration formula
xnk − a
xn+1 = xn −
kxnk−1
for finding the k-th root of a.
3. Use the formula xn+1 = (xn + a/xn )/2 to compute the square root of 10 to
five significant digits, from the initial guess 1.
4. Use the Newton-Raphson method to find (to 4D ) the root of the equation
x + cos x = 0
5. Use the Newton-Raphson method to find (to 4D ) the root of each equation
in Exercises 2(a)–2(c) of Step 6 on page 26.
STEP 11
x+ y−z =2
x + 2y + z = 6
2x − y + z = 1
This is a set of three linear equations in the three variables (or unknowns) x, y,
and z. By solution of the system we mean the determination of a set of values for
x, y, and z which satisfies each one of the equations. In other words, if values
(X, Y, Z ) satisfy all equations simultaneously, then (X, Y, Z ) constitute a solution
of the system.
Let us now consider the general system of n equations in n variables, which
may be written as follows:
a11 x1 + a12 x2 + · · · + a1n xn = b1
a21 x1 + a22 x2 + · · · + a2n xn = b2
.. .. .. .. .. n equations
. . . . .
an1 x1 + an2 x2 + · · · + ann xn = bn
SOLUTION BY ELIMINATION 43
The dots indicate, of course, similar terms in the variables x3 , x4 etc., and the
remaining (n − 3) equations which complete the system.
In this notation, the variables are denoted by x1 , x2 , . . . , xn ; sometimes we write
xi , i = 1, 2, . . . , n, to represent the variables. The coefficients of the variables
may be detached and written in a coefficient matrix thus:
a11 a12 · · · a1n
a21 a22 · · · a2n
A= . .. .. ..
.. . . .
an1 an2 · · · ann
The notation ai j will be used to denote the coefficient of x j in the i-th equation.
Note that it occurs in the i-th row and j-th column of the matrix.
The numbers on the right-hand side of the equations are called constants, and
may be written in a column vector, thus:
b1
b2
b= .
..
bn
The coefficient matrix may be combined with the constant vector to form the
augmented matrix, thus:
a11 a12 · · · a1n b1
a21 a22 · · · a2n b2
.. .. .. .. ..
. . . . .
an1 an2 · · · ann bn
It is usual to work directly with the augmented matrix when using elimination
methods of solution.
By multiplying the equation Ax = b from the left by the inverse matrix A−1 we
obtain A−1 Ax = A−1 b, so the unique solution is x = A−1 b (since A−1 A = I and
Ix = x). Thus in principle a linear system with a unique solution may be solved
by first evaluating A−1 and then A−1 b. This approach is discussed in more detail
in the optional Step 14. The Gaussian elimination method is a more general and
efficient direct procedure for solving systems of linear equations.
† Thisprocess is suitable for implementation on a computer. Pseudo-code for study and use in
programming may be found on page 167.
46 SYSTEMS OF LINEAR EQUATIONS 1
The matrix is now in the form necessary for back-substitution. The full system
of equations at this point, equivalent to the original system, is
x3 = b300 /a33
00
x2 = (b20 − a230
x3 )/a22
0
We now display the process for the general n × n system, omitting the primes
(0 ) for convenience. Recall that the original augmented matrix is
a11 a12 · · · a1n b1
a21 a22 · · · a2n b2
.. .. .. .. ..
. . . . .
an1 an2 · · · ann bn
First stage: eliminate the coefficients a21 , a31 , . . . , an1 by calculating the
multipliers
m i1 = ai1 /a11 , i = 2, 3 . . . , n
and then calculating
ai j = ai j − m i1 a1 j , bi = bi − m i1 b1 , i, j = 2, 3, . . . , n
Second stage: eliminate the coefficients a32 , a42 , . . . , an2 by calculating the
multipliers
m i2 = ai2 /a22 , i = 3, 4, . . . , n
SOLUTION BY ELIMINATION 47
ai j = ai j − m i2 a2 j , bi = bi − m i2 b2 , i, j = 3, 4, . . . , n
This gives
a11 a12 a13 ··· a1n b1
0 a22 a23 ··· a2n b2
0 0 a33 ··· a3n b3
.. .. .. .. .. ..
. . . . . .
0 0 an3 · · · ann bn
We continue to eliminate unknowns, going on to columns 3, 4, . . . so that by
the beginning of the k-th stage we have the augmented matrix
a11 a12 · · · · · · · · · a1n b1
0 a22 · · · · · · · · · a2n b2
.. .. .. .. .. .. ..
.
. . . . . .
0 0 · · · akk · · · akn bk
.. .. .. .. .. .. ..
. . . . . . .
0 0 · · · ank · · · ann bn
k-th stage: eliminate ak+1,k , ak+2,k , . . . , an,k by calculating the multipliers
m ik = aik /akk , i = k + 1, k + 2, . . . , n
ai j = ai j − m ik ak j , bi = bi − m ik bk , i, j = k + 1, k + 2, . . . , n
Thus at the end of the k-th stage we have the augmented system
a11 a12 · · · · · · ··· ··· a1n b1
0 a22 · · · · · · · · · · · · a 2n b2
.. .. . . . . . ..
. . .. .. .. .. .. .
0 0 · · · a kk a k,k+1 · · · a kn b k
0 0 · · · 0 a k+1,k+1 · · · a k+1,n bk+1
.. .. .. .. .. .. .. ..
. . . . . . . .
0 0 ··· 0 an,k+1 ··· ann bn
Continuing in this way, we obtain after n − 1 stages the augmented matrix
a11 a12 · · · a1,n−1 a1n b1
0 a22 · · · a2,n−1 a2n b2
.. .. . . . ..
. . . . .
. .
. .
0 0 · · · an−1,n−1 an−1,n bn−1
0 0 ··· 0 ann bn
48 SYSTEMS OF LINEAR EQUATIONS 1
Note that the original coefficient matrix has been transformed into upper triangular
form.
We now back-substitute. Clearly we have xn = bn /ann , and subsequently
" #
n
1 X
xi = bi − ai j x j , i = n − 1, n − 2, . . . , 2, 1
aii j=i+1
Notes
(a) The diagonal elements akk used in the k-th stage of the successive elimination
are called pivot elements.
(b) To proceed from one stage to the next, it is necessary for the pivot element
to be nonzero (notice that the pivot elements are used as divisors in the
multipliers and in the final solution). If at any stage a pivot element vanishes,
we rearrange the remaining rows of the matrix so as to obtain a nonzero pivot;
if this is not possible, then the system of linear equations has no solution.
(c) If a pivot element is small compared with the elements in its column which
have to be eliminated, the corresponding multipliers used at that stage will be
greater than one in magnitude. The use of large multipliers in the elimination
and back-substitution processes leads to magnification of round-off errors,
and this can be avoided by using partial pivoting as described in the next
Step.
6 Numerical example
Here we shall solve the system
The working required to obtain the solution is set out in tabular form on the
next page. For illustrative purposes, the calculations were done using three-digit
decimal floating point arithmetic. For example, in the first stage the multiplier
0.794 comes from
1.50 × 100 − (7.94 × 10−1 × 2.00 × 100 ) = 1.50 × 100 − (15.88 × 10−1 )
→ 1.50 × 100 − (1.59 × 100 )
= −0.09 × 100 → −9.00 × 10−2
Working with so few significant digits leads to errors in the solution, as is shown
below by an examination of the residuals.
SOLUTION BY ELIMINATION 49
m Augmented matrix
0.34 −0.58 0.94 2.0
0.27 0.42 0.13 1.5
0.20 −0.51 0.54 0.8
First stage 0.34 −0.58 0.94 2.0
0.794 0.881 −0.616 −0.0900
0.588 −0.169 −0.0130 −0.380
Second stage 0.34 −0.58 0.94 2.0
0.881 −0.616 −0.0900
−0.192 −0.131 −0.397
We now do back-substitution:
As a check, we can sum the original three equations to obtain 0.81x1 −0.67x2 +
1.61x3 = 4.3. Inserting the solution yields 0.81 × 0.941 − 0.67 × 2.02 + 1.61 ×
3.03 = 4.28711.
In order to judge the accuracy of the solution, we may insert the solution into the
left-hand side of each of the original equations, and compare the results with the
right-hand side constants. The differences between the results and the constants
are called residuals. For the example we have:
It would seem reasonable to believe that if the residuals are small the solution
is a good one. This is usually the case. Sometimes, however, small residuals are
not indicative of a good solution. This point is taken up under ‘ill-conditioning’,
in the next Step.
Checkpoint
EXERCISES
Solve the following systems by Gaussian elimination.
1. x1 + x2 − x3 = 0
2x1 − x2 + x3 = 6
3x1 + 2x2 − 4x3 = −4
2. 5.6x + 3.8y + 1.2z = 1.4
3.1x + 7.1y − 4.7z = 5.1
1.4x − 3.4y + 8.3z = 2.4
3. 2x + 6y + 4z = 5
6x + 19y + 12z = 6
2x + 8y + 14z = 7
4. 1.3x + 4.6y + 3.1z = −1
5.6x + 5.8y + 7.9z = 2
4.2x + 3.2y + 4.5z = −3
STEP 12
For any system of linear equations, the question of how much error there may be
in a solution obtained by a numerical method is a very difficult one to answer.
A general discussion of the problems it raises is beyond the scope of this book.
However, some of the sources of error are indicated.
2x + y = 4 (±0.01)
−x + y = 1 (±0.01)
2x + y = 4 (±0.01)
2 y = 1 (±0.01) + 2 (±0.005)
3
Therefore 32 y lies between 2.985 and 3.015, so y lies between 1.990 and 2.010.
From the first equation we now obtain
2x = 4 (±0.01) − 2 (±0.01)
3 Partial pivoting
In the Gaussian elimination method, the buildup of round-off errors may be
reduced by arranging the equations so that the use of large multipliers in the
elimination operations is avoided. The procedure to be carried out is known as
partial pivoting (or pivotal condensation). The general rule to follow is: at each
elimination stage, arrange the rows of the augmented matrix so that the new pivot
element is larger in absolute value than (or equal to) any element beneath it in its
column.
Use of this rule ensures that the multipliers used at each stage have magnitude
less than or equal to one. To show the rule in operation we treat a simple example,
using three-digit decimal floating point arithmetic. We solve
2x + 5y + 8z = 36
4x + 7y − 12z = −16
x + 8y + z = 20
The tabular solution is as follows, the pivot elements being printed in boldface
numerals. (Note that all the multipliers have magnitude less than 1.)
4 Ill-conditioning
Certain systems of linear equations are such that their solutions are very sensitive to
small changes (and therefore to errors) in their coefficients and constants. We give
an example below in which 1% changes in two coefficients change the solution by
a factor of 10 or more. Such systems are said to be ill-conditioned. If a system is
ill-conditioned, a solution obtained by a numerical method may be very different
from the exact solution, even though great care is taken to keep round-off and
other errors very small.
As an example, consider the following system of equations:
2x + y = 4
2x + 1.01y = 4.02
Checkpoint
1. Describe the types of error that may affect the solution of a system
of linear equations.
2. How can partial pivoting contribute to a reduction of errors?
3. Is it true to say that an ill-conditioned system has not got an exact
solution?
EXERCISES
1. Find the range of solutions for the following system, assuming maximum
errors in the constants as shown:
x − y = 1.4 (±0.01)
x + y = 3.8 (±0.05)
(b) Insert the solution of the first system into the left-hand side of the second
system. Does x = 1, y = 2 ‘look like’ a good solution to the second
system? Comment.
(c) Insert the solution of the second system into the left-hand side of the
first system. Comment.
7. The system
10x1 + 7x2 + 8x3 + 7x4 = 32
7x1 + 5x2 + 6x3 + 5x4 = 23
8x1 + 6x2 + 10x3 + 9x4 = 33
7x1 + 5x2 + 9x3 + 10x4 = 31
is an example of ill-conditioning due to T. S. Wilson. Insert the ‘solution’
(6.0, −7.2, 2.9, −0.1) into the left-hand side. Would you claim this solution
to be a good one? Now insert the solution (1.0, 1.0, 1.0, 1.0). Comment on
the dangers of making claims!
STEP 13
The methods used in the previous Steps for solving systems of linear equations
are termed direct methods. When a direct method is used, and if round-off and
other errors do not arise, an exact solution is reached after a finite number of
arithmetic operations. In general, of course, round-off errors do arise; and when
large systems are being solved by direct methods, the growing errors can become
so large as to render the results obtained quite unacceptable.
1 Iterative methods
Iterative methods provide an alternative approach. Recall that an iterative method
starts with an approximate solution, and uses it in a recurrence formula to provide
another approximate solution; by repeatedly applying the formula, a sequence
of solutions is obtained which (under suitable conditions) converges to the exact
solution. Iterative methods have the advantages of simplicity of operation and
ease of implementation on computers, and they are relatively insensitive to propa-
gation of errors; they would be used in preference to direct methods for solving
linear systems involving several hundred variables, particularly if many of the
coefficients were zero. Systems of over 100 000 variables have been successfully
solved on computers by iterative methods, whereas systems of 10 000 or more
variables are difficult or impossible to solve by direct methods.
3 Convergence
The sequence of solutions produced by the iterative process may be displayed in
a table, thus:
Iteration Approximate solution (Gauss-Seidel )
(k) (k) (k)
k x1 x2 x3
0 0 0 0
1 1.3 1.04 0.936
2 0.9984 1.00672 0.999648
3 0.998691 1.000297 1.000232
The student may check that the exact solution for this system is (1, 1, 1). It is
seen that the Gauss-Seidel solutions are rapidly approaching this; in other words,
the method is converging.
In practice, of course, the exact solution is not known. It is customary to end
the iterative procedure as soon as the differences between the x (k+1) values and
the x (k) values are suitably small. One stopping rule is to end the iteration when
n
(k+1) (k)
X
Sk = xi − xi
i=1
58 SYSTEMS OF LINEAR EQUATIONS 3
becomes less than a prescribed small number (usually chosen according to the
accuracy of the machine on which the calculations are carried out).
The question of convergence with a given system of equations is crucial. As
in the above example, the Gauss-Seidel method may quickly lead to a solution
very close to the exact one; on the other hand, it may converge too slowly to be of
practical use, or it may produce a sequence which diverges from the exact solution.
The reader is referred to more advanced texts (such as Conte and de Boor (1980))
for treatments of this question.
To improve the chance (and rate) of convergence, before applying the iterative
method the system of equations should be arranged so that as far as possible each
leading-diagonal coefficient is the largest (in absolute value) in its row.
Checkpoint
EXERCISES
1. For the example treated above, compute the value of S3 , the quantity used in
the suggested stopping rule after the third iteration.
2. Use the Gauss-Seidel method to solve the following systems to 5D accuracy
(remember to rearrange the equations if appropriate). Compute the value of
Sk (to 6D ) after each iteration.
(a) x − y + 10z = −7
20x + 3y − 2z = 51
2x + 8y + 4z = 25
(b) 10x − y =1
−x + 10y − z =1
−y + 10z − w =1
−z + 10w =1
STEP 14
The general system of n linear equations in n variables (see Step 11, Section 1)
can be written in matrix form Ax = b, and we seek a vector x which satisfies this
equation. Here we make use of the inverse matrix A−1 to find this vector.
A I Ã Ĩ
2 1 1 0 2 1 1 0
→ −2 1 (row 2 − twice row 1)
4 5 0 1 0 3
(iii) Solve the two systems
2 1 u1 1 2 1 u2 0
= and =
0 3 v1 −2 0 3 v2 1
using the back-substitution method. Note how the systems are constructed,
using à and columns of Ĩ. From the first system, 3v1 = −2, v1 = − 23 , and
2u 1 + v1 = 1, so 2u 1 = 1 + 32 , u 1 = 65 . From the second system, 3v2 = 1,
v2 = 31 , and 2u 2 + v2 = 0, so 2u 2 = − 13 , u 2 = − 16 . The required inverse
matrix is
5 1
u 1 u 2 −
A−1 = = 6 6
v1 v2 −32 1
3
(iv) Check: AA−1 should equal I. By multiplication we find
5
2 1 6 − 16 1 0
=
4 5 − 23 1
3 0 1
so A−1 is correct.
In this simple example it has been possible to work with fractions, so no round-
off errors occur and the resulting inverse matrix is exact. More generally, when
doing calculations by hand the final result should be checked by computing AA−1 ,
which should be approximately equal to the identity matrix I.
As a 3 × 3 example, we shall find the inverse matrix A−1 of
0.20 0.24 0.12
A = 0.10 0.24 0.24
0.05 0.30 0.49
To show the effects of errors we shall work to 3S in the calculation of A−1 . The
results of the calculations are displayed below in tabular form.
MATRIX INVERSION* 61
yields w2 = −20.0, v2 = 38.3, and u 2 = −34.0, found in that order. One might
check by multiplication that AA−1 is
1.004 −0.008 0
0.004 0.992 0
0.005 −0.01 1
We can use the A−1 calculated in the previous section in the following manner:
x
x = y = A−1 b
z
19 −34 12 1 −13
= −15.4 38.3 −15 2 = 16.2
Checkpoint
1. In the method for finding the inverse of A, what is the final form of
A after the elementary row operations have been carried out?
2. Is the solution of the system Mx = d, x = dM−1 or x = M−1 d (or
neither)?
3. Give a condition for a matrix not to have an inverse.
EXERCISES
1. Find the inverses of the following matrices, using the elimination and back-
substitution method.
(a) 2 6 4
6 19 12
2 8 14
MATRIX INVERSION* 63
(b) 1.3 4.6 3.1
5.6 5.8 7.9
2x + 8y + 14z = 7 =3
(b) 1.3x + 4.6y + 3.1z = −1 =0
5.6x + 5.8y + 7.9z = 2 = 1
Note that all the elements above the leading diagonal in a lower triangular matrix
are zero. Examples of upper triangular matrices are
−1 2 1 −1 2 0
U1 = 0 8 6 and U2 = 0 1 2
0 0 6 0 0 −1
where all the elements below the leading diagonal are zero. The product of L1
and U1 is given by
−1 2 1
A = L1 U1 = 0 8 6
−2 0 5
1 Procedure
Suppose we have to solve a linear system Ax = b, and that we can express the
coefficient matrix A in the form A = LU. This form is called an LU decomposition
of A.
Then we may solve the linear system by the following procedure:
Stage 1: Write Ax = LUx = b.
USE OF LU DECOMPOSITION* 65
`11
0 ··· 0 0 b1
` `22 ··· 0 0 b2
21
.. .. .. .. .. ..
. . . . . .
`n−1,1 `n−1,2 · · · `n−1,n−1 0 bn−1
`n1 `n2 · · · `n,n−1 `nn bn
Note that the value of yi depends on the values y1 , y2 , . . . , yi−1 already calculated.
Stage 3: Finally, use back-substitution on Ux = y to find xn , . . . , x1 in that order.
Later on we shall outline a general method for finding LU decompositions
of square matrices. There follows an example showing this method in action
involving the matrix A = L1 U1 given above. If we wish to solve Ax = b
with a number of different b’s, then this method is more efficient than applying
the Gaussian elimination technique to each separate linear system. Once we
have found an LU decomposition of A, we need only do forward and backward
substitutions to solve the system for any b.
2 Example
We solve the system
−x1 + 2x2 + x3 = 0
8x2 + 6x3 = 10
−2x1 + 5x3 = −11
Stage 1: An LU decomposition of the system is
1 0 0 −1 2 1 x1 0
Ax = 0 1 0 0 8 6 x2 = 10
2 −0.5 1 0 0 6 x3 −11
L1 U1 x b
Stage 2: Set y = U1 x and then solve the system L1 y = b, that is,
1 0 0 y1 0
0 1 0 y2 = 10
2 −0.5 1 y3 −11
66 SYSTEMS OF LINEAR EQUATIONS 5*
y1 = 0
y2 = 10
2y1 − 0.5 × y2 + y3 = −11 ⇒ y3 = −6
−1 2 1 x1 0
Stage 3: Solve 0 8 6 x2 = 10
0 0 6 x3 −6
U1 x y
Back-substitution yields:
6x3 = −6 ⇒ x3 = −1
8x2 + 6x3 = 10 ⇒ x2 = 2
−x1 + 2x2 + x3 = 0 ⇒ x1 = 3
3
Thus the solution of Ax = b is x = 2 , which may be checked in the
−1
original equations. We turn now to the problem of finding an LU decomposition
of a given square matrix A.
3 Effecting an LU decomposition
For an LU decomposition of a given matrix A of order n × n, we seek a lower
triangular matrix L and an upper triangular matrix U (both of order n × n) such
that A = LU. The matrix U may be taken to be the upper triangular matrix
resulting from the process of Gaussian elimination without partial pivoting (see
Sections 3 and 5 of Step 11), and the matrix L may be taken to be the lower
triangular matrix which has diagonal elements 1 and which, for k < i, has as the
(i, k)-th element the multiplier m ik . This multiplier is calculated at the k-th stage
of Gaussian elimination and is required to transform the current value of aik to
0. In the notation of Step 11, these multipliers were given by m ik = aik /akk ,
i = k + 1, k + 2, . . . , n.
An example will help clarify this. From Step 11, we recall that the Gaussian
elimination procedure applied to the system
x+ y−z =2
x + 2y + z = 6
2x − y + z = 1
Also, we saw that in the first stage we calculated the multipliers m 21 = a21 /a11 =
1/1 = 1 and m 31 = a31 /a11 = 2/1 = 2, while in the second stage we calculated
the multiplier m 32 = a32 /a22 = −3/1 = −3. Thus
1 0 0 1 0 0
L = m 21 1 0 = 1 1 0
m 31 m 32 1 2 −3 1
It may be readily verified that
1 1 −1
LU = 1 2 1
2 −1 1
the coefficient matrix of the original system.
Another technique that may be used to find an LU decomposition of an n × n
matrix is by a direct decomposition. To illustrate, suppose we wish to find an LU
decomposition for the 3 × 3 coefficient matrix of the system given above. Then
the required L and U are of the form
`11 0
0 u 11 u 12 u 13
L = `21 `22 0 , U = 0 u 22 u 23
`31 `32 `33 0 0 u 33
Note that the total number of unknowns in L and U is 12, whereas there are only
9 elements in the 3 × 3 coefficient matrix A. To ensure that L and U are unique,
we need to impose 12 − 9 = 3 extra conditions on the elements of these two
triangular matrices. (In the general n × n case, n extra conditions are required.)
One common choice is to require all the diagonal elements of L to have the value
1; the resulting method is known as Doolittle’s method. Another choice is to
require all the diagonal elements of U to be 1; this is called Crout’s method. Since
Doolittle’s method will result in the same LU decomposition for A as given above,
we shall use Crout’s method to illustrate this direct decomposition procedure.
We then require
`11 0
0 1 u 12 u 13 1 1 −1
`21 `22 0 0 1 u 23 = 1 2 1
`31 `32 `33 0 0 1 2 −1 1
By multiplying out L and U, we obtain:
`11 × 1 = 1 ⇒ `11 = 1
`11 u 12 = 1 ⇒ u 12 = 1
`11 u 13 = −1 ⇒ u 13 = −1
`21 × 1 = 1 ⇒ `21 = 1
`21 u 12 + `22 = 2 ⇒ `22 = 1
`21 u 13 + `22 u 23 = 1 ⇒ u 23 = 2
68 SYSTEMS OF LINEAR EQUATIONS 5*
`31 × 1 = 2 ⇒ `31 = 2
`31 u 12 + `32 = −1 ⇒ `32 = −3
`31 u 13 + `32 u 23 + `33 = 1 ⇒ `33 = 9
It is clear that this construction from Crout’s method yields triangular matrices L
and U for which A = LU.
Checkpoint
EXERCISES
1 Norms
One of the most common tests for ill-conditioning of a linear system involves the
condition number of the coefficient matrix. In order to define this quantity, we
need to first consider the concept of the norm of a vector or matrix, which in some
way measures the size of their elements.
Let x and y be vectors. Then a vector norm k · k is a real number with the
following properties:
(a) kxk ≥ 0 and kxk = 0 if and only if x is a vector with all components zero;
(b) kαxk = |α| kxk for any real number α;
(c) kx + yk ≤ kxk + kyk (triangle inequality).
There are many possible ways to choose a vector norm with the above three
properties. One vector norm that is probably familiar to the student is the Euclidean
or 2-norm. Thus if x is an n × 1 vector, then the 2-norm is denoted and defined by
" #1/2
Xn
2
kxk2 ≡ xi
i=1
Another possible choice of norm, which is more suitable for our purposes here, is
the infinity norm defined by
Thus for the vector in the previous example we have kxk∞ = 6. It is easily verified
that kxk∞ has the three properties in the above definition. For kxk2 the first two
properties are easy to verify; the triangle inequality (c) is a bit more difficult and
70 SYSTEMS OF LINEAR EQUATIONS 6*
requires use of the so-called Cauchy-Schwarz inequality (for example, see Cheney
and Kincaid (1994)).
The defining properties of a matrix norm are similar, except that there is an extra
property. Let A and M be matrices. Then a matrix norm k · k is a real number
with the following properties:
(a) kAk ≥ 0 and kAk = 0 if and only if A is a matrix with all elements zero;
(b) kαAk = |α| kAk for any real number α;
(c) kA + Mk ≤ kAk + kMk;
(d) kAMk ≤ kAk kMk.
As for vector norms, there are many ways of choosing matrix norms with the
four properties above, but here we consider only the infinity norm. If A is an n × n
matrix, then the infinity norm is defined by
n
X
kAk∞ ≡ max |ai j |
i=1,2,...,n
j=1
From this definition, we see that this norm is the maximum of the sums obtained
from adding the absolute values of the elements in each row, so it is commonly
referred to as the maximum row sum norm.
As an example, suppose
−3 3 4 4
5 1 2 −3
A=
−4 4 −3 −4
−3 −2 4 −2
Then
4
X 4
X 4
X 4
X
|a1 j | = 14, |a2 j | = 11, |a3 j | = 15, and |a4 j | = 11
j=1 j=1 j=1 j=1
where we have used the matrix norm property (d) given in the previous section.
Large values of the condition number usually indicate ill-conditioning. As a
justification for this last statement, we state and prove the following theorem.
Theorem: Suppose x satisfies the linear system Ax = b and x̃ satisfies the linear
system Ax̃ = b̃. Then
kx − x̃k∞ kb − b̃k∞
≤ cond(A)
kxk∞ kbk∞
we see that
kx − x̃k∞ ≤ kA−1 k∞ kb − b̃k∞
However, since b = Ax, we have kbk∞ ≤ kAk∞ kxk∞ , or
1 kAk∞
≤
kxk∞ kbk∞
1 kAk∞
kx − x̃k∞ × ≤ kA−1 k∞ kb − b̃k∞ ×
kxk∞ kbk∞
and cond(A) = kAk∞ kA−1 k∞ = 3.01 × 200 = 602. This suggests that a
numerical solution would not be very accurate if only two decimal digits of
accuracy were used in the calculations. Indeed, if the components of A were
rounded to two decimal digits, the two rows of A would be identical. Then the
determinant of A would be zero, and it follows from the theorem in Step 11 that
this system would not have a unique solution.
We recall that as defined, the condition number requires A−1 , but it is compu-
tationally expensive to compute the inverse matrix. Moreover, even if the inverse
were calculated, this approximation might not be very accurate if the system is
ill-conditioned. It is therefore common in software packages to estimate the con-
dition number by obtaining an estimate of kA−1 k∞ without explicitly finding
A−1 .
Checkpoint
EXERCISES
Suppose A is an n×n matrix. If there exists a number λ and a nonzero vector x such
that Ax = λx, then λ is said to be an eigenvalue of A, and x the corresponding
eigenvector. The evaluation of eigenvalues and eigenvectors of matrices is a
problem that arises in a variety of contexts. Note that if we have an eigenvalue λ
and an eigenvector x, then βx (where β is any real number) is also an eigenvector
since
A(βx) = βAx = βλx = λ(βx)
This shows that the eigenvector is not unique and may be scaled if desired (for
instance, we might want the sum of the components of the eigenvector to be 1).
Writing Ax = λx as
(A − λI)x = 0
we conclude from the theorem on page 43 that this can have a nonzero solution
only if the determinant of A − λI is zero. If we expand out this determinant, then
we get an n-th degree polynomial in λ known as the characteristic polynomial
of A. Thus one way to find the eigenvalues of A is to obtain its characteristic
polynomial, and then find the n zeros (some may be complex) of this polynomial.
For example, suppose
a b
A=
c d
Then
a−λ
b
A − λI =
c d −λ
This last matrix has determinant
magnitude. Methods for finding all the eigenvalues are beyond the scope of this
book. (One such method, called the QR method, is based on the QR factorization
to be discussed in Section 3 of Step 27.)
1 Power method
Suppose that the n eigenvalues of A are λ1 , λ2 , . . . , λn and that they are ordered
in such a way that
|λ1 | > |λ2 | ≥ · · · ≥ |λn−1 | ≥ |λn |
Then the power method† can be used to find λ1 . We begin with a starting vector
w(0) and calculate the vectors
w( j) = Aw( j−1)
w( j) = A j w(0)
where A j is A multiplied by itself j times. Thus w( j) is the product of w(0) and the
j-th power of A, which explains why this approach is called the power method.
It turns out that at the j-th iteration an approximation to the eigenvector x
( j) ( j−1)
associated with λ1 is given by w( j) . Moreover, if wk and wk are the k-th
components of w( j) and w( j−1) respectively, then an approximation to λ1 is given
by
( j)
( j) wk
λ1 = ( j−1)
wk
for any k ∈ {1, 2, . . . , n}. Although there are n possible choices for k, it is usual
( j)
to choose k so that wk is the component of w( j) with the largest magnitude.
2 Example
Let us use the power method to find the largest eigenvalue of the matrix
1 1 −1
A= 1 2 1
2 −1 1
Since the second component of w(1) has the largest magnitude we take k = 2 so
that the first approximation to λ1 is
(1)
(1) w2 4
λ1 = (0)
= =4
w2 1
From these calculations we conclude that the largest eigenvalue is about 2.6.
3 Variants
In the previous example, the reader would have noticed that the components of
w( j) were growing in size as j increases. Overflow problems would arise if this
growth were to continue, so in practice it is usual to use the scaled power method
instead. This is identical to the power method except we scale the vectors w( j)
at each iteration. Thus suppose w(0) is given and set y(0) = w(0) . Then for
j = 1, 2, . . . let us carry out the following steps:
76 THE EIGENVALUE PROBLEM
w( j) = A−1 w( j−1)
In general it is more efficient to solve the linear system Aw( j) = w( j−1) than to
find the inverse of A (see Step 14).
4 Other aspects
It may be shown that the convergence rate of the power method is linear and that
under suitable conditions
j
( j)
λ2
λ1 − λ1 ≈ C
λ1
where C is some positive constant. Thus the bigger the gap between λ2 and λ1 ,
the faster the rate of convergence.
Since the power method is an iterative method, one has to stop at some stage. It
is usual to carry on the process until successive estimates of the eigenvalue agree
to a certain tolerance or a maximum number of iterations is exceeded.
Difficulties with the power method usually arise when our assumptions about
the eigenvalues are not valid. For instance, if |λ1 | = |λ2 |, then the sequence of
estimates for λ1 may not converge. Even if the sequence does converge, one may
not be able to get an approximation to the eigenvector associated with λ1 . A short
discussion of such difficulties may be found in Conte and de Boor (1980).
Checkpoint
EXERCISES
FINITE DIFFERENCES 1
Tables
Historically, numerical analysts have been concerned with tables of numbers, and
many techniques have been developed for dealing with mathematical functions
represented in this way. For example, the value of the function at an untabulated
point may be required, so that an interpolation procedure is necessary. It is also
possible to estimate the derivative or the definite integral of a tabulated function,
using some finite processes to approximate the corresponding (infinitesimal) lim-
iting procedures of calculus. In each case, it has been traditional to use finite
differences. Another application of finite differences, which is outside the scope
of this book, is the numerical solution of partial differential equations.
1 Tables of values
Many books contain tables of mathematical functions. One of the most com-
prehensive is Handbook of Mathematical Functions, edited by Abramowitz and
Stegun (see the Bibliography for publication details), which also contains useful
information about numerical methods.
Although most tables use constant argument intervals, some functions do change
rapidly in value in particular regions of the argument, and hence may best be
tabulated using intervals varying according to the local behaviour of the function.
Tables with varying argument interval are more difficult to work with, however, and
it is common to adopt uniform argument intervals wherever possible. As a simple
example consider the 6S table of the exponential function over 0.10 (0.01) 0.18
(this notation specifies the domain 0.10 ≤ x ≤ 0.18 spanned in intervals of 0.01).
2 Finite differences
Since Newton, finite differences have been used extensively. The construction of
a table of finite differences for a tabulated function is simple: first differences are
obtained by subtracting each value from the succeeding value in a table, second
differences by repeating this operation on the first differences, and so on for higher
orders. From the above table of e x for x = 0.10 (0.01) 0.18, one has the following
table (note the customary layout, with decimal points and leading zeros omitted
from the differences).
Differences
x f (x) = e x 1st. 2nd. 3rd.
0.10 1.10517
1111
0.11 1.11628 11
1122 0
0.12 1.12750 11
1133 0
0.13 1.13883 11
1144 1
0.14 1.15027 12
1156 0
0.15 1.16183 12
1168 −1
0.16 1.17351 11
1179 2
0.17 1.18530 13
1192
0.18 1.19722
(In this case, the differences must be multiplied by 10−5 for comparison with the
function values.)
Differences
x f (x) = e x 1st. 2nd. 3rd.
0.10 1.10517
5666
0.15 1.16183 291
5957 15
0.20 1.22140 306
6263 14
0.25 1.28403 320
6583 18
0.30 1.34986 338
6921 16
0.35 1.41907 354
7275 20
0.40 1.49182 374
7649 18
0.45 1.56831 392
8041
0.50 1.64872
Although the round-off errors in f should be less than 12 in the last significant
place, they may accumulate; the greatest error that can be obtained corresponds
to:
Differences
Tabular error 1st. 2nd. 3rd. 4th. 5th. 6th.
+ 12
−1
− 12 +2
+1 −4
+ 12 −2 +8
−1 +4 −16
− 12 +2 −8 +32
+1 −4 +16
+ 12 −2 +8
−1 +4
− 12 +2
+1
+ 12
82 FINITE DIFFERENCES 1
A rough working criterion for the expected fluctuations (‘noise level’) due to
round-off error is shown in the following table.
Order of difference 1 2 3 4 5 6
Expected error limits ±1 ±2 ±3 ±6 ±12 ±22
Checkpoint
EXERCISES
FINITE DIFFERENCES 2
Forward, backward, and central difference notations
There are several different notations for the single set of finite differences described
in the previous Step. Here we shall consider only the forward, backward, and
central differences. We introduce each of these three notations in terms of the
so-called shift operator, which we define first.
E −1 f j = f j−1
and
1
E 2 f j = f j+ 1 = f x j + 12 h = f x0 + j + 12 h
2
1≡ E −1
it follows that
1 f j = (E − 1) f j = E f j − f j = f j+1 − f j
∇ ≡ 1 − E −1
it follows that
∇ f j = (1 − E −1 ) f j = f j − E −1 f j = f j − f j−1
it follows that
1 1 1 1
δ f j = (E 2 − E − 2 ) f j = E 2 f j − E − 2 f j = f j+ 1 − f j− 1
2 2
5 Difference display
The role of the forward, central, and backward differences is displayed by the
difference table:
Differences
x f (x) 1st. 2nd. 3rd. 4th.
x0 f0 ..............
..............
..............
1f 0
..............
..............
..............
..............
x1 f1 1 f 2
0
..............
..............
..............
..............
1f 1 1 f 3
0
..............
..............
...................
... .......
x2 f2 12 f1 14 f 0
1 f2 13 f1
x3 f3 12 f 2
1 f3
x4 f4
.. ..
. .
x j−2 f j−2
δ f j− 3
2
x j−1 f j−1 δ 2 f j−1
δ f j− 1 δ 3 f j− 1
2 2
xj fj δ2 f j δ4 f j
.......................................................................................................................................................................................................
δ f j+ 1 δ f j+ 1
3
2 2
x j+1 f j+1 δ 2 f j+1
δ f j+ 3
2
x j+2 f j+2
.. ..
. .
xn−4 f n−4
∇ f n−3
xn−3 f n−3 ∇ 2 f n−2
∇ f n−2 ∇ 3 f n−1
xn−2 f n−2 ∇2 f n−1 ∇ 4 fn
..........
................
∇ f n−1 ∇3 fn
..........................
.............
..............
xn−1 f n−1 ∇ f 2 .............
n ................................
...........
∇f ..............
n .................................
..................
xn fn .....
86 FINITE DIFFERENCES 2
Although forward, central, and backward differences represent precisely the same
set of numbers:
(a) forward differences are especially useful near the start of a table, since they
involve tabulated function values below x j ;
(b) central differences are especially useful away from the ends of the table,
where there are available tabulated function values above and below x j ;
(c) backward differences are especially useful near the end of a table, since they
involve tabulated function values above x j .
Checkpoint
EXERCISES
f (x) = 3x 3 − 2x 2 + x + 5
2. For the difference table on page 81 of f (x) = e x for x = 0.1 (0.05) 0.5 to
six significant digits, determine the following (taking x0 = 0.1):
(a) 1 f 2 , 12 f 2 , 13 f 2 , 14 f 2 .
(d) 12 f 1 , δ 2 f 2 , ∇ 2 f 3 .
(e) 13 f 3 , ∇ 3 f 6 , δ 3 f 9 .
2
FORWARD, BACKWARD, AND CENTRAL DIFFERENCE NOTATIONS 87
(d) δ 3 f j = f j+ 3 − 3 f j+ 1 + 3 f j− 1 − f j− 3 .
2 2 2 2
STEP 20
FINITE DIFFERENCES 3
Polynomials
yields
(x j + h)k − x jk = kx jk−1 h + polynomial of degree (k − 2)
Omitting the subscript on x j , we then have
1 f (x) = f (x + h) − f (x)
= an [(x + h)n − x n ] + an−1 [(x + h)n−1 − x n−1 ]
+ · · · + a1 [(x + h) − h]
= an nx n−1 h + polynomial of degree (n − 2)
1n+1 f (x) = 0
In passing, the student may recall that in differential calculus the increment
1 f (x) = f (x + h) − f (x) is related to the derivative of f (x) at the point x.
POLYNOMIALS 89
2 Example
For f (x) = x 3 for x = 5.0 (0.1) 5.5 we obtain the following difference table.
x f (x) = x 3 1 12 13 14
5.0 125.000
7651
5.1 132.651 306
7957 6
5.2 140.608 312 0
8269 6
5.3 148.877 318 0
8587 6
5.4 157.464 324
8911
5.5 166.375
x f (x) = x 3 1 12 13 14
5.0 125.00
765
5.1 132.65 31
796 0
5.2 140.61 31 0
827 0
5.3 148.88 31 3
858 3
5.4 157.46 34
892
5.5 166.38
x f (x) = e x 1 12 13 14
0.10 1.10517
5666
0.15 1.16183 291
5957 15
0.20 1.22140 306 −1
6263 14
0.25 1.28403 320 4
6583 18
0.30 1.34986 338 −2
6921 16
0.35 1.41907 354 4
7275 20
0.40 1.49182 374 −2
7649 18
0.45 1.56831 392
8041
0.50 1.64872
Since the estimate for round-off error at 13 is ±3 (see page 82), we say that
third differences are constant within round-off error, and deduce that a cubic
approximation is appropriate for e x over the range 0.1 < x < 0.5 at interval
0.05. In this fashion, differences can be used to decide what (if any) degree of
approximating polynomial is appropriate.
An example in which polynomial approximation is inappropriate is when
f (x) = 10x for x = 0 (1) 4, thus:
x f (x) 1 12 13 14
0 1
9
1 10 81
90 729
2 100 810 6561
900 7290
3 1000 8100
9000
4 10000
Although f (x) = 10x is ‘smooth’, the large tabular interval (h = 1) produces
large higher order finite differences. It should also be understood that there exist
functions that cannot usefully be tabulated at all, at least in some neighbourhood;
for example, f (x) = sin(1/x) near the origin x = 0. Nevertheless, these are
fairly exceptional cases.
POLYNOMIALS 91
Checkpoint
EXERCISES
INTERPOLATION 1
Linear and quadratic interpolation
Interpolation is ‘the art of reading between the lines in a table’ and may be regarded
as a special case of the general process of curve fitting (see Steps 26 and 28). More
precisely, interpolation is the process whereby untabulated values of a function
tabulated only at certain values are estimated, on the assumption that the function
behaves sufficiently smoothly between tabular points for it to be approximated by
a polynomial of fairly low degree.
Interpolation is not as important in Numerical Analysis as it was, now that
computers (and calculators with built-in functions) are available, and function
values may often be obtained readily by an algorithm (probably from a standard
subroutine). However,
(a) interpolation is still important for functions that are available only in tabular
form (perhaps from the results of an experiment); and
(b) interpolation serves to introduce the wider application of finite differences.
In Step 20, we observed that when the differences of order k are constant
(within round-off fluctuation), the tabulated function may be approximated by a
polynomial of degree k. Linear and quadratic interpolation correspond to the
cases k = 1 and k = 2, respectively.
1 Linear interpolation
When a tabulated function varies so slowly that first differences are approximately
constant, it may be approximated closely by a straight line between adjacent
tabular points. This is the basic idea of linear interpolation. In Figure 10, the two
function points (x j , f j ) and (x j+1 , f j+1 ) are connected by a straight line. Any x
between x j and x j+1 may be defined by a value θ such that
Provided f (x) is only slowly varying in the interval, a value of the function at
x is approximately given by the ordinate to the straight line at x. Elementary
geometrical considerations yield
x − xj f (x) − f j
θ= ≈
x j+1 − x j f j+1 − f j
so that
.............
...............................................
.....................................................................................
.................... .... ........................................ ..........................................
| .
.
.
....... ......
...................... ..................... .....
................... .....................
................. ........................... ...
............................. ......................................... .
.
.
...
...
................. ............................ .
.
. ...
................ ............................ .
.
. ...
............... ............................. .
.
. ...
.
...
..
..
..
..
........................... .
.
. ...
... ..........
..
..
...
...
....... .
. f
. .
............................ ..... .
.
. ... j+1
....
. .
. .
.
. ...
........ . . ...
..... ..... ...
... ... ...
... ... ...
... ... ...
... ..
... ...
...
... j x−x ...
...
... j f +x −x j+1 f j − f ...
...
...
...
j+1 j ...
f
... j
...
...
...
...
...
... .... ....
... ... ...
... ..
. ....
... .
. ...
... ... ...
... ..
. ...
... ... ...
... .... ...
... ... ...
... .... ...
..
... ... ..
..... ..... ......
.. .. ..
xj x x j+1
................................................................................
θh ................................................................................
..........................................................................................................................................
h ........................................................................................................................................
FIGURE 10. Linear interpolation.
The first differences are almost constant locally, so that the table is suitable for
linear interpolation. For example,
f (0.934) ≈ 0.3946 + 10 (−0.0040)
4
= 0.3930
2 Quadratic interpolation
As previously indicated, linear interpolation is appropriate only for slowly varying
functions. The next simplest process is quadratic interpolation, based on an
approximating polynomial of degree two; one might expect that this approximation
would give better accuracy for functions with larger variation.
Given three adjacent points x j , x j+1 = x j + h, and x j+2 = x j + 2h, suppose
that f (x) is approximated by
P2 (x) = a + b(x − x j ) + c(x − x j )(x − x j+1 )
where a, b, and c are chosen so that
P2 (x j+k ) = f (x j+k ) = f j+k , k = 0, 1, 2
Thus
P2 (x j ) = a = f j
P2 (x j+1 ) = a + bh = f j+1
P2 (x j+2 ) = a + 2bh + 2ch 2 = f j+2
whence
a = fj
b = ( f j+1 − a)/ h = ( f j+1 − f j )/ h = 1 f j / h
c = ( f j+2 − 2bh − a)/(2h 2 ) = ( f j+2 − 2 f j+1 + f j )/(2h 2 )
= 12 f j /(2h 2 )
Setting x = x j + θ h, we obtain the quadratic interpolation formula
f (x j + θ h) ≈ f j + θ 1 f j + 12 θ(θ − 1)12 f j
We note immediately that the quadratic interpolation formula introduces a second
term (involving 12 f j ) not included in the linear interpolation formula.
As an example, we determine the second-order correction to the value of
f (0.934) obtained above using linear interpolation. The extra term is
1 4 −6
0.0024
2 × 10 × 10 × 0.0001 = − 200
Checkpoint
EXERCISES
INTERPOLATION 2
Newton interpolation formulae
The linear and quadratic interpolation formulae are based on first and second de-
gree polynomial approximation. Newton derived general forward and backward
difference interpolation formulae, corresponding to approximation by a poly-
nomial of degree n, for tables of constant interval h. (For tables that do not
have constant interval, we can use an interpolation procedure involving divided
differences – see Step 24.)
1 Newton’s forward difference formula
Consider the points x j , x j + h, x j + 2h, . . ., and recall that
f (x j + θ h) = E θ f j
= (1 + 1)θ f j
θ (θ − 1) 2 θ (θ − 1)(θ − 2) 3
= 1 + θ1 + 1 + 1 + · · · fj
2! 3!
which is Newton’s forward difference formula. The linear and quadratic (for-
ward) interpolation formulae correspond to truncation at first and second order,
respectively. If we truncate at n-th order, we obtain
θ(θ − 1) 2 θ(θ − 1) · · · (θ − n + 1) n
f (x j + θ h) ≈ 1 + θ1 + 1 + ··· + 1 fj
2! n!
1n+k f j = 0, k = 1, 2, . . .
f (x j + θ h) = E θ f j
= (1 − ∇)−θ f j
θ (θ + 1) 2 θ (θ + 1)(θ + 2) 3
= 1 + θ∇ + ∇ + ∇ + · · · fj
2! 3!
which is Newton’s backward difference formula. The linear and quadratic (back-
ward) interpolation formulae correspond to truncation at first and second order,
respectively. The approximation based on the values f j−n , f j−n+1 , . . . , f j−1 , f j
is
θ(θ + 1) 2 θ (θ + 1) · · · (θ + n − 1) n
f (x j + θ h) ≈ 1 + θ∇ + ∇ + ··· + ∇ fj
2! n!
x◦ f (x) = sin x 1 12 13 14 15
0 0
1736
10 0.1736 −52
1684 −52
20 0.3420 −104 4
1580 −48 0
30 0.5000 −152 4
1428 −44
40 0.6428 −196
1232
50 0.7660
Since constant differences occur at fourth order, we conclude that a quartic ap-
proximation is appropriate. (Third-order differences are not quite constant within
expected round-off, and we anticipate that a cubic approximation is not quite good
enough.) To determine sin 5◦ from the table, we use Newton’s forward difference
98 INTERPOLATION 2
Note that we have kept a guard digit (in parentheses) to minimize accumulated
round-off error.
To determine sin 45◦ from the table, we use Newton’s backward difference
formula (to fourth order); thus taking x j = 40, we have θ = 45 10
− 40 = 1 , and
2
6 2 2 2 (−0.0048) + 24 2 2 2 2 (0.0004)
1135 1 1357
+
= 0.6428 + 0.0714 − 0.0057 − 0.0015 + 0.0001(1)
= 0.7071 (compare with the actual value 0.7071 to 4D)
f (x) ≈ Pn (x)
θ(θ − 1) 2 θ (θ − 1) · · · (θ − n + 1) n
= 1 + θ1 + 1 + ··· + 1 f0
2! n!
and
f (x) ≈ Q n (x)
φ(φ + 1) 2 φ(φ + 1) · · · (φ + n − 1) n
= 1 + φ∇ + ∇ + ··· + ∇ fn
2! n!
where θ = (x − x0 )/ h and φ = (x − xn )/ h
Clearly Pn and Q n are both polynomials of degree n. It can be verified (see
Exercise 2 at the end of this Step) that Pn (x j ) = Q n (x j ) = f (x j ) for j =
0, 1, 2, . . . , n, which implies that Pn − Q n is a polynomial of degree n which
vanishes at (n + 1) points. This in turn implies that Pn − Q n ≡ 0, or Pn ≡ Q n . In
fact a polynomial of degree n through any given (n+1) (distinct but not necessarily
equidistant) points is unique, and is called the interpolating polynomial.
NEWTON INTERPOLATION FORMULAE 99
(x − x j )2 2
f (x) = f j + (x − x j )D f j + D fj + · · ·
2!
θ 2h2 2
f (x j + θ h) = f j + θ h D f j +D fj + · · ·
2!
θ 2h2 2
= 1 + θhD + D + · · · fj
2!
= eθ h D f j
f (x j + θ h) = E θ f j
Checkpoint
EXERCISES
j ( j − 1) 2
= 1 + j1 + 1 + · · · + 1 j f (x0 )
2
3. Derive the equation of the interpolating polynomial for the following data.
x f (x) x f (x)
0 3 3 24
1 2 4 59
2 7 5 118
STEP 23
INTERPOLATION 3
Lagrange interpolation formula
The linear and quadratic interpolation formulae of Step 21 correspond to first and
second degree polynomial approximation, respectively. In Step 22, we discussed
the Newton forward and backward interpolation formulae and noted that higher
order interpolation corresponds to higher degree polynomial approximation. In
this Step we consider an interpolation formula attributed to Lagrange, which does
not require function values at equal intervals of the argument. The Lagrange
interpolation formula has the disadvantage that the degree of the approximating
polynomial must be chosen at the outset, and in the next Step we shall discuss
another approach. Thus the Lagrange formula is mainly of theoretical interest
for us here, but in passing we mention that there are some important applications
beyond the scope of this book – for example, the construction of basis functions
to solve differential equations using a spectral (‘discrete ordinate’) method.
1 Procedure
Suppose that the function f is tabulated at (n + 1) (not necessarily equidistant)
points {x0 , x1 , . . . , xn } and is to be approximated by a polynomial
f j = f (x j ) = Pn (x j ) for j = 0, 1, 2, . . . , n
Now, for k = 0, 1, 2, . . . , n,
(x − x0 )(x − x1 ) · · · (x − xk−1 )(x − xk+1 ) · · · (x − xn )
L k (x) =
(xk − x0 )(xk − x1 ) · · · (xk − xk−1 )(xk − xk+1 ) · · · (xk − xn )
is a polynomial of degree n which satisfies
L k (x j ) = 0, j 6= k, j = 0, 1, 2, . . . , n, and L k (xk ) = 1
Hence
n
X
Pn (x) = L k (x) f k
k=0
is a polynomial of degree (at most) n such that
Pn (x j ) = f j , j = 0, 1, 2, . . . , n
102 INTERPOLATION 3
that is, it is the (unique) interpolating polynomial. Note that for x = x j all terms
in the sum vanish except the j-th, which is f j ; L k (x) is called the k-th Lagrange
interpolation coefficient, and the identity
n
X
L k (x) = 1
k=0
2 Example
We use the Lagrange interpolation formula to find the interpolating polynomial
P3 through the points (0, 3), (1, 2), (2, 7), and (4, 59), and then approximate f (3)
by P3 (3).
The Lagrange coefficients are
(x − 1)(x − 2)(x − 4)
L 0 (x) = = − 18 (x 3 − 7x 2 + 14x − 8)
(0 − 1)(0 − 2)(0 − 4)
(x − 0)(x − 2)(x − 4)
L 1 (x) = = 13 (x 3 − 6x 2 + 8x)
(1 − 0)(1 − 2)(1 − 4)
(x − 0)(x − 1)(x − 4)
L 2 (x) = = − 14 (x 3 − 5x 2 + 4x)
(2 − 0)(2 − 1)(2 − 4)
(x − 0)(x − 1)(x − 2)
L 3 (x) = = 24 (x
1 3
− 3x 2 + 2x)
(4 − 0)(4 − 1)(4 − 2)
(The student can verify that L 0 (x) + L 1 (x) + L 2 (x) + L 3 (x) = 1.) Hence, the
required polynomial is
24 (−9x
3
= 1
+ 63x 2 − 126x + 72 + 16x 3 − 96x 2 + 128x
−42x 3 + 210x 2 − 168x + 59x 3 − 177x 2 + 118x)
24 (24x
3
= 1
+ 0x 2 − 48x + 72)
= x 3 − 2x + 3
to evaluate P3 (x) for some x directly from the factored forms of L k (x). Thus, to
evaluate P3 (3), one has
(3 − 1)(3 − 2)(3 − 4) 1
L 0 (3) = = , etc.
(0 − 1)(0 − 2)(0 − 4) 4
3 Notes of caution
In the case of the Newton interpolation formulae considered in the previous Step,
or the formulae to be discussed in the next Step, the degree of the required
approximating polynomial may be determined merely by computing terms until
they no longer appear significant. In the Lagrange procedure, the polynomial
degree must be chosen at the outset. Also, note that
(a) a change of degree involves a completely new computation of all terms; and
(b) for a polynomial of high degree the process involves a large number of
multiplications and therefore may be quite slow.
Lagrange interpolation should be used with considerable
√ caution. For example,
suppose we use Lagrange interpolation to estimate
√
3
20 from the points (0, 0),
(1, 1), (8, 2), (27, 3), and (64, 4) on f (x) = 3 x. We have
x(x − 8)(x − 27)(x − 64) x(x − 1)(x − 27)(x − 64)
f (x) ≈ ×1+ ×2
1(−7)(−26)(−63) 8(7)(−19)(−56)
x(x − 1)(x − 8)(x − 64) x(x − 1)(x − 8)(x − 27)
+ ×3+ ×4
27(26)(19)(−37) 64(63)(56)(37)
so that f (20) ≈ −1.3139, which is not very close to the correct value 2.7144! A
better result (2.6316) can be obtained by linear interpolation between (8, 2) and
(27, 3). The problem
√ is that the Lagrange method gives no indication as to how
well f (x) = 3 x is represented by a quartic. In practice, therefore, Lagrange
interpolation is used only rarely.
Checkpoint
EXERCISE
Given that f (−2) = 46, f (−1) = 4, f (1) = 4, f (3) = 156, and f (4) =
484, use the Lagrange interpolation formula to estimate f (0).
STEP 24
INTERPOLATION 4*
Divided differences*
1 Divided differences
Again, suppose the function f is tabulated at the (not necessarily equidistant)
points {x0 , x1 , . . . , xn }. We define the divided differences between points thus:
first divided difference (say, between x0 and x1 ) by
f (x1 ) − f (x0 ) f1 − f0
f (x0 , x1 ) = = = f (x1 , x0 )
x1 − x0 x1 − x0
second divided difference (say, between x0 , x1 , and x2 ) by
f (x1 , x2 ) − f (x0 , x1 )
f (x0 , x1 , x2 ) =
x2 − x0
and so on to the n-th divided difference (between x0 , x1 , . . . xn )
f (x1 , x2 , . . . , xn ) − f (x0 , x1 , . . . , xn−1 )
f (x0 , x1 , . . . , xn ) =
xn − x0
As an example, we construct a divided difference table from the following data:
x 0 1 3 6 10
f (x) 1 −6 4 169 921
The divided difference table is as follows:
x f (x)
0 1
−7
1 −6 4
5 1
3 4 10 0
55 1
6 169 19
188
10 921
DIVIDED DIFFERENCES* 105
3 Example
From the tabulated function in Section 1 of this Step, we estimate f (2) by Newton’s
divided difference formula and find the corresponding interpolating polynomial.
The same is done for f (4).
Since the third divided difference is constant, we can fit a cubic through the five
points. By Newton’s divided difference formula, using x0 = 0, x1 = 1, x2 = 3,
x3 = 6, the interpolation cubic is
so that
f (2) ≈ P3 (2) = 1 − 14 + 8 − 2 = −7
† Thisformula is suitable for implementation on a computer. Pseudo-code for study and use in
programming may be found on page 169.
106 INTERPOLATION 4*
1 − 7x + 4x 2 − 4x + x 3 − 4x 2 + 3x = x 3 − 8x + 1
and
f (4) ≈ P3 (4) = −6 + 15 + 30 − 6 = 33
As expected, the interpolating polynomial is the same cubic – namely, x 3 −8x +1.
R = (x − x0 )(x − x1 ) · · · (x − xn ) f (x, x0 , x1 , . . . , xn )
As it stands, this expression is not very useful because of the unknown quantity
f (x, x0 , x1 , . . . , xn ). However, it may be shown (for example, see Conte and de
Boor (1980)) that if a = min(x0 , x1 , . . . , xn ), b = max(x0 , x1 , . . . , xn ), and f is
(n + 1)-times differentiable on (a, b), then there exists a ξ ∈ (a, b) such that
f (n+1) (ξ )
f (x, x0 , x1 , . . . , xn ) =
(n + 1)!
f (n+1) (ξ ) Y
n
f (x) − Pn (x) = (x − x j )
(n + 1)! j=0
This formula may be useful when we know the function giving the data and we
wish to find lower and upper bounds on the error.
As an example, to 6D we have sin 0 = 0, sin(0.2) = 0.198669, and sin(0.4) =
0.389418 (where the arguments to the sine function are in radians). Then we can
form the following divided difference table.
x sin x
0 0
993345
0.2 0.198669 −99000
953745
0.4 0.389418
DIVIDED DIFFERENCES* 107
f (ξ ) f (ξ )
000 000
3! (0.1 − 0)(0.1 − 0.2)(0.1 − 0.4) = 2000
cos(0.4) cos 0
0.000461 = ≤ |sin(0.1) − 0.100325| ≤ = 0.000500
2000 2000
The actual error has magnitude 0.000492 which is within these bounds.
5 Aitken’s method
In practice, a procedure due to Aitken is often adopted, in which successively better
interpolation polynomials (corresponding to successively higher order truncation
of Newton’s divided difference formula) are determined systematically. Thus, one
has
f1 − f0
f (x) ≈ f 0 + (x − x0 )
x1 − x0
f 0 (x1 − x) − f 1 (x0 − x)
= ≡ I0,1 (x)
x1 − x0
and obviously
f 0 = I0,1 (x0 ), f 1 = I0,1 (x1 )
Next, since f (x0 , x1 , x2 ) = f (x1 , x0 , x2 ) = ( f (x0 , x2 ) − f (x1 , x0 ))/(x2 − x1 ),
one has
f (x0 , x2 ) − f (x1 , x0 )
f (x) ≈ f 0 + (x − x0 ) f (x0 , x1 ) + (x − x0 )(x − x1 )
x2 − x1
I0,1 (x)(x2 − x) − I0,2 (x)(x1 − x)
= ≡ I0,1,2 (x)
x2 − x1
noting that
I0,2 (x) = f 0 + (x − x0 ) f (x0 , x2 )
and so on. In passing, one may note that
At first sight, the procedure may look complicated, but it is systematic, and
therefore computationally straightforward: it may be represented by the scheme
x0 f0 x0 − x
x1 f1 I0,1 (x) x1 − x
x2 f2 I0,2 (x) I0,1,2 (x) x2 − x
x3 f2 I0,3 (x) I0,1,3 (x) I0,1,2,3 (x) x3 − x
.. .. .. .. .. ..
. . . . . .
··· ··· ··· ··· ··· ···
One major advantage is that the accuracy may be gauged by comparing suc-
cessive steps. (This of course corresponds to gauging the appropriate truncation
of the Newton divided difference formula.) As in the case of Newton’s div-
ided difference formula, usually the points x0 , x1 , x2 , . . . are ordered such that
x0 − x, x1 − x, x2 − x, . . . form a sequence with increasing magnitude. Finally,
we remark that although the derivation of Aitken’s method emphasizes its rela-
tionship with the Newton formula, it is notable that Aitken’s method ultimately
does not involve divided differences at all!
As an example, we estimate f (2) by Aitken’s method from the tabulated func-
tion given in Section 1 of this Step.
We have x = 2, so that we choose x0 = 1, x1 = 3, x2 = 0, x3 = 6, and
x4 = 10: thus the scheme yields
k xk fk xk − x
0 1 −6 −1
1 3 4 −1 +1
2 0 1 −13 −5 −2
3 6 169 29 −11 −7 +4
4 10 921 97 −15 −7 −7 +8
The computation proceeds from the left, row by row, with an appropriately divided
‘cross multiplication’ of the respective entries with those in the (xk − x) column
on the right: thus,
(−6)(+1) − (+4)(−1)
I0,1 = = −1,
3−1
(−6)(−2) − (+1)(−1)
I0,2 = = −13,
0−1
(−1)(−2) − (−13)(+1)
I0,1,2 = = −5, etc.
0−3
The entry −7 (in the square) appears twice successively along the diagonal, so
one may conclude that f (2) ≈ −7.
DIVIDED DIFFERENCES* 109
Checkpoint
EXERCISES
INTERPOLATION 5*
Inverse interpolation*
Rather than the value of a function f (x) for a certain x, one might seek the value of
x corresponding to a given value of f (x); this is called inverse interpolation. For
example, perhaps the reader may have contemplated the possibility of obtaining
roots of f (x) = 0 by inverse interpolation.
x = x j + θ (x j+1 − x j )
where
f (x) − f j
θ≈
f j+1 − f j
in the linear approximation. (Note that if f (x) = 0, we recover the method of
false position – see Step 8).
For example, from a 4D table of f (x) = e−x one has f (0.91) = 0.4025,
f (0.92) = 0.3985 so that f (x) = 0.4 corresponds to
0.4 − 0.4025
x ≈ 0.91 + × (0.92 − 0.91)
0.3985 − 0.4025
= 0.91 + 0.00625 = 0.91625
3 Divided differences
Since divided differences are suitable for interpolation with tabular values that
are unequally spaced, they may be used for inverse interpolation. Let us again
consider the function f (x) = sin x for x = 10◦ (10◦ ) 50◦ , and determine x for
which f (x) = 0.2. Ordering with increasing distance from f (x) = 0.2, one has
the divided difference table (entries multiplied by 100):
112 INTERPOLATION 5*
f (x) x
0.1736 10
5938
0.3420 20 518
5848 1360
0.0000 0 962 1338
6000 1988 3486
0.5000 30 1560 3403
7003 3431
0.6428 40 4188
8117
0.7660 50
Consequently,
x = 10 + (0.2 − 0.1736)59.38
+ (0.2 − 0.1736)(0.2 − 0.3420)5.18
+ (0.2 − 0.1736)(0.2 − 0.3420)(0.2 − 0)13.60
+ (0.2 − 0.1736)(0.2 − 0.3420)(0.2 − 0)(0.2 − 0.5)13.38
+ (0.2 − 0.1736)(0.2 − 0.3420)(0.2 − 0)(0.2 − 0.5)
×(0.2 − 0.6428)34.86
= 10 + 1.5676 − 0.0194 − 0.0102 + 0.0030 − 0.0035
= 11.5375
Alternatively, the Aitken scheme could be used. With either alternative, however,
it is noticeable that any advantage in accuracy compared with iterative inverse
interpolation may not justify the additional computational demand.
Checkpoint
EXERCISES
CURVE FITTING 1
Least squares
Scientists and social scientists often wish to fit a smooth curve to some experi-
mental data. Given (n + 1) points an obvious approach is to use the interpolating
polynomial of degree n, but when n is large this is usually unsatisfactory. Better
results can be obtained by using piecewise polynomials – that is, fitting lower de-
gree polynomials through subsets of the data points. The use of spline functions,
which usually provide a particularly smooth fit, has become widespread (see Step
28).
A rather different but often quite suitable approach is the least square fit in
which, rather than try to fit the points exactly we find a polynomial of low degree
(often first or second) which fits the points closely (after all, the points themselves
are not generally exact, being subject to experimental error).
...
... Polygon (A)
...
...
...
...
Straight line (B)
...
... Exponential (C)
...
...
... Ill-fitting curve (D)
...
...
• • ..
..
.
....
..
.... . .....
......... .... ....
........ .. ....
............. ....
.............. ....
..... .......
• • ........ ........
....... . ........ .......
. .
...
.
.... . ... ........ .......
. .
.... . .... ........ .......
.... .... . ......... .........
.... ... .... ......... ..........
• ....
.... .... .. . •
.... ... ...................... .................... ....................
.......... D
..................... .... .... ...................
• • • .................. •
.... ......... ............. A
.... ................ .... .... .
...... ......... ... .
• • ........
.......
C
B
I I
x x
FIGURE 11. Response to a drug.
If the number of data points is less than or equal to the number of parameters
(that is, n ≤ k), it is possible to find values for {c1 , c2 , . . . , ck } which make all
the errors i zero. If n < k there is an infinite number of solutions for {ci } which
make all the errors zero (and therefore an infinite number of curves with the given
form pass through all the experimental points); in this case the problem is not fully
determined – more information is needed to choose an appropriate curve.
If n > k, which in practice is usually the case, then it is not normally possible
to make all the errors zero by a choice of the {ci }. Three possible procedures are
as follows:
(a) choose a set {ci } which minimizes the total absolute error; that is, minimize
Pn
the sum |i |;
i=1
(b) choose a set {ci } which minimizes the maximum absolute error; that is,
minimize max |i |;
i=1,2,...,n
(c) choose a set {ci } which minimizes the sum of squares of errors; that is,
n
i2 .
P
minimize S =
i=1
Procedures (a) and (b) are generally difficult to apply. Procedure (c) leads to
a linear system of equations to solve for the set {ci }; it is called the principle of
least squares, and is the one customarily used.
..
.
∂S
=0
∂ck
5 Example
The following points were obtained in an experiment:
x 1 2 3 4 5 6
y 1 3 4 3 4 2
118 CURVE FITTING 1
We shall plot the points on a diagram, and use the method of least squares to fit
(a) a straight line, and (b) a parabola through them.
(a) The plotted points are shown in Figure 12(a). To fit a straight line, we have to
find a function y = c1 + c2 x (that is, a first degree polynomial) which minimizes
6
X 6
X
S= i2 = [yi − c1 − c2 xi ]2
i=1 i=1
Differentiating first with respect to c1 (keeping c2 constant) and then with respect
to c2 (keeping c1 constant), and setting the results equal to zero, gives the normal
equations:
6
∂S X
≡ −2 (yi − c1 − c2 xi ) = 0
∂c1
i=1
6
∂S X
≡ −2 xi (yi − c1 − c2 xi ) = 0
∂c2
i=1
We may divide both equations by −2, take the summation operations through the
brackets, and rearrange, to obtain:
!
6
P 6
P
yi = 6c1 + xi c2
i=1 !i=1 !
6 6 6
xi2
P P P
xi yi = x i c1 + c2
i=1 i=1 i=1
P
It
Pis seenP that to proceed to a solution we have to evaluate the four sums xi ,
xi2 ,
P
yi , xi yi , and insert them in the last equations. We can arrange the
work in a table thus (the last three columns are for fitting the parabola and the
required sums are in the last row):
The normal equations for fitting the straight line are therefore
)
17 = 6c1 + 21c2
63 = 21c1 + 91c2
y = 2.13 + 0.20x
y = c1 + c2 x + c3 x 2
which minimizes
6
X 6 h
X i2
S= i2 = yi − c1 − c2 xi − c3 xi2
i=1 i=1
Taking partial derivatives and proceeding as above we obtain the normal equations
6
6 6
P P P 2
yi = 6c1 + x i c2 + x i c3
i=1 i=1 i=1
6 6 6 6
P
xi yi =
P
x i c1 +
P 2
x i c2 +
P 3
xi c3
i=1 i=1 i=1 i=1
6 6 6 6
2 2 3 4
P P P P
xi yi = x i c1 + xi c2 + xi c3
i=1 i=1 i=1 i=1
Inserting the values for the sums (see the above table) we obtain the system of
linear equations:
17 = 6c1 + 21c2 + 91c3
63 = 21c1 + 91c2 + 441c3
269 = 91c1 + 441c2 + 2275c3
The solution to 3D is c1 = −1.200, c2 = 2.700, and c3 = −0.357. The required
parabola is therefore (retaining 2D ):
it is also plotted in Figure 12(b). It is clear that the parabola is a better fit than the
straight line!
Checkpoint
EXERCISES
1. For the example above (with data points shown in Figure 12(a)), compute
the value of S, the sum of squares of errors of points from (a) the fitted line,
and (b) the fitted parabola. Plot the points on graph paper, and fit a straight
line ‘by eye’ (that is, use a ruler to draw a line, guessing its best position).
Determine the value of S for this line and compare with the value for the least
squares line.
2. Fit a straight line by the least squares method to each of the following sets of
data:
(a) toughness x and percentage of nickel y in eight specimens of alloy steel.
toughness x 36 41 42 43 44 45 47 50
% nickel y 2.5 2.7 2.8 2.9 3.0 3.2 3.3 3.5
(b) aptitude test mark x given to six trainee salespeople, and their first-year
sales y in thousands of of dollars.
aptitude test x 25 29 33 36 42 54
first-year sales y 42 45 50 48 73 90
For both sets, plot the points and draw the least squares line on a graph. Use
the lines to predict the % nickel of a specimen of steel whose toughness is
38, and the likely first-year sales of a trainee salesperson who obtains a mark
of 48 on the aptitude test.
LEAST SQUARES 121
Deduce the matrix form of the normal equations for fitting a fourth-degree
polynomial.
4. Fit a parabola by the least squares method to the points (0, 0), (1, 1), (2, 3),
(3, 3), and (4, 2). Find the value of S for this fit.
5. Find the normal equations that arise from fitting, by the least squares method,
an equation of the form y = c1 + c2 sin x to the set of points (0, 0), (π/6, 1),
(π/2, 3), and (5π/6, 2). Solve them for c1 and c2 .
STEP 27
CURVE FITTING 2*
Least squares and linear equations*
1 Pseudo-inverse
Recall that we have data points (x1 , y1 ), . . . , (xn , yn ) and that we wish to find the
parameters c1 , . . . , ck for the basis functions φ1 , . . . , φk such that
n
X
S= |i |2
i=1
previous Step, it is generally not possible to find a solution to this system, but we
can find the c1 , . . . , ck such that Ac is ‘close’ to y (in the least squares sense). If
there is a solution c∗ for the least squares problem, then we write
c∗ = A+ y
2 Normal equations
To minimize S, we need to minimize
n
X
[c1 φ1 (xi ) + c2 φ2 (xi ) + · · · + ck φk (xi ) − yi ]2
i=1
or equivalently
n
X n
X
[c1 φ1 (xi ) + c2 φ2 (xi ) + · · · + ck φk (xi )] φ j (xi ) = φ j (xi )yi
i=1 i=1
If M is a matrix (or vector) with (i, j)-th element m i j , then recall from linear
algebra that its transpose, denoted by M T , is the matrix with (i, j)-th element
m ji – that is, M T is obtained from M by swapping the rows and columns. For
example, if
1 2 3
M = 4 5 6
7 8 9
then
1 4 7
MT = 2 5 8
3 6 9
It is evident that the normal equations may be written as
AT Ac = AT y
and
1 1 ··· 1
x1 x2 ··· xn
AT =
2
x1 x22 ··· xn2
3 QR factorization
If the matrix AT A has an inverse, then in principle the normal equations can be
solved to find the least squares solution. However, as we remarked in the previous
Step, the normal equations may be ill-conditioned. If so, then an alternative
LEAST SQUARES AND LINEAR EQUATIONS* 125
R1 c = q1
Thus once we have a QR factorization of A, we can find the least squares approxi-
mation by solving an upper triangular system using back-substitution.
For example, suppose we wish to fit a parabola to the experimental data pre-
sented on page 117. The relevant matrix A with its QR factorization is given on
the previous page, and we see that
−2.44949 −8.57321 −37.15059
R1 = 0.00000 4.18330 29.28310
0.00000 0.00000 6.11010
Calculation of QT y yields
−6.94022 −0.92889
q1 = 0.83666 and q2 = 0.70246
−2.18218 0.12305
where
n h
(`−1) 2
i
(S (`) )2 =
X
a j,`
j=`
Since S (`)
can be either positive or negative, it is best to choose the sign to be
(`−1) (`−1)
opposite to a`,` – that is, take S (`) to be positive if a`,` is negative and
(`)
vice-versa. This choice maximizes w` and minimizes the round-off error. The
remaining elements of w(`) are given by
(`−1)
(`)
a j,`
wj = (`)
, j = ` + 1, ` + 2, . . . , n
−2S (`) w`
Finally, the matrix QT is given by the product H(k) H(k−1) · · · H(1) , from which
Q follows by taking the transpose. However, as pointed out earlier, it is not
necessary to find Q explicitly in order to obtain the least squares solution. Instead,
we set y(0) = y, and when A(`) = H(`) A(`−1) is being calculated, we may also
calculate
y(`) = H(`) y(`−1) , ` = 1, 2, . . . , k
The end result is a transformation of the original system Ac = y into the system
A(k) c = y(k) , that is, the system Rc = QT y. Furthermore, the calculations may
128 CURVE FITTING 2*
be carried out without the need to explicitly form the Householder matrices. Once
we have w(`) , then
A(`) = H(`) A(`−1)
= I − 2w(`) (w(`) )T A(`−1)
Checkpoint
EXERCISES
1. Given the data points (0, 0), (π/6, 1), (π/2, 3), and (5π/6, 2), take φ1 (x) =
1 and φ2 (x) = sin x. Write down the matrix A with ai j = φ j (xi ) and obtain
the normal equations. Verify that these normal equations are the same as
those obtained in Exercise 5 of the previous Step.
2. For the A of Exercise 1, find H(1) (take A(0) = A) and hence calculate
A(1) = H(1) A. Verify that the second, third, and fourth components in the
first column of this last matrix are all zero.
STEP 28
CURVE FITTING 3*
Splines*
Suppose we want to fit a smooth curve which actually goes through n + 1 given
data points, where n is quite large. Since an interpolating polynomial of corre-
spondingly high degree n tends to be highly oscillatory and therefore gives an
unsatisfactory fit, at least in some places, the interpolation is often constructed by
linking lower degree polynomials (piecewise polynomials) at some or all of the
given data points (called knots or nodes). This interpolation is smooth if we also
insist that the piecewise polynomials have matching derivatives at these knots, and
this smoothness is enhanced by matching higher order derivatives.
Suppose the data points (x0 , f 0 ), (x1 , f 1 ), . . . , (xn , f n ), are ordered so that
x0 < x1 < · · · < xn−1 < xn
Here we seek a function S which is a polynomial of degree d on each subinterval
[x j−1 , x j ], j = 1, 2, . . . , n, and for which
S(x j ) = f j , j = 0, 1, 2, . . . , n
For maximum smoothness in this case, where the knots are all the given data
points, it turns out that we can allow S to have up to d − 1 continuous derivatives.
Such a function is known as a spline. An example of a spline for the linear
(d = 1) case is the polygon (Curve A) of Figure 11(b) on page 115. It is clear
that this spline is continuous, but it does not have a continuous first derivative.
The most popular in practice are cubic splines, constructed from polynomials of
degree three with continuous first and second derivatives at the data points, and
discussed further below. An example of a cubic spline S for n = 5 is displayed in
Figure 13. (The data points are taken from the table on page 117.) We see that S
goes through all the data points. Each function Sj on the subinterval [x j−1 , x j ] is
a cubic. As has already been indicated, the first and second derivatives of Sj and
Sj+1 match at (x j , f j ), the point where they meet.
The term ‘spline’ refers to the thin flexible rods that in the past were used by
draughtsman to draw smooth curves. The graph of a cubic spline approximates
the shape that arises when such a rod is forced to pass through the given n + 1
data points, and corresponds to minimum ‘strain energy’.
y
N
S2 .... .........
• •
......
4 ...
.... ....
.... . ..
. .... ........
... ... S S 4 ......
...
.. .. ... 3 ...
...
.. ... .. ...
... ... ... ...
.. ... .
..
... ...
.... .
.. ..
.. 5 S
.. ..... ... ...
.. ........... .... ...
3 • ... • ...
...
..... ...
. ..
... ..
... ..
S ..... ..
..
1 ... ..
. ..
... ...
2 ... f2 f4 • .
.....
.
...
...
..... f 1 f3
.
...
...
....
1 • f5
f0
I x
0 1 2 3 4 5 6
x0 x1 x2 x3 x4 x5
FIGURE 13. Schematic example of a cubic spline over subintervals [x0 , x1 ], [x1 , x2 ],
[x2 , x3 ], [x3 , x4 ], and [x4 , x5 ]; each function Sj on [x j−1 , x j ] is a cubic.
Sj (x j ) = Sj+1 (x j )
Sj0 (x j ) = Sj+1
0
(x j )
Sj00 (x j ) = Sj+1
00
(x j )
for j = 1, 2, . . . , n − 1.
Since we have a cubic with four unknowns (a j , b j , c j , d j ) on each of the n
subintervals, and so a total of 4n unknowns, we need 4n equations to specify
them. The requirement S(x j ) = f j , j = 0, 1, 2, . . . , n, yields n + 1 equations,
while 3(n − 1) equations arise from the continuity requirement on S and its first
two derivatives given above. This yields a total of n + 1 + 3(n − 1) = 4n − 2
equations, so we need to impose two more conditions to specify S completely. The
choice of these two extra conditions determines the type of cubic spline obtained.
Two common choices are the following:
(a) natural cubic spline: S 00 (x0 ) = S 00 (xn ) = 0;
(b) clamped cubic spline: S 0 (x0 ) = β0 , S 0 (xn ) = βn for some given constants
β0 and βn . If the values of f 0 (x0 ) and f 0 (xn ) are known, then β0 and βn can
be set to these values.
SPLINES* 131
We shall not go into the algebraic details here; but it turns out that if we write
h j = x j − x j−1 and m j = S 00 (x j ), then the coefficients of Sj for j = 1, 2, . . . , n
are given by
aj = f j
f j − f j−1 h j (2m j + m j−1 )
bj = +
hj 6
mj
cj =
2
m j − m j−1
dj =
6h j
The spline is thus determined by the values of {m j }nj=0 , which depend on whether
we have a natural or a clamped cubic spline.
For a natural cubic spline we have m 0 = m n = 0, and the equations
f j+1 − f j f j − f j−1
h j m j−1 + 2(h j + h j+1 )m j + h j+1 m j+1 = 6 −
h j+1 hj
for j = 1, 2, . . . , n − 1. (We remark that if all the values of h j are the same,
then the right-hand side of this last equation is just 612 f j−1 / h j .) Setting α j =
2(h j + h j+1 ), these linear equations can be written as the (n − 1) × (n − 1) system
α1 h 2
··· 0 0 0 m1
h α h ··· 0 0 0 m2
2 2 3
0 h 3 α3 · · · 0 0 0 m3
. .. .. . . .. .. .. ..
.
. . . . . . . . = b
0 · · · αn−3 h n−2
0 0 0 m n−3
0 · · · h n−2 αn−2 h n−1
0 0 m n−2
0 0 0 ··· 0 h n−1 αn−1 m n−1
where
f2 − f1 f − f
6 h − 1h 0
2 1
f − f f − f
6 3h 2 − 2h 1
3 2
..
b=
.
6 f n−1 − f n−2 − f n−2 − f n−3
h n−1 h n−2
f n − f n−1 f n−1 − f n−2
6 h − h
n n−1
It is notable that the coefficient matrix has nonzero entries only on the leading
diagonal and the two subdiagonals either side of it. Such a system is called a
tridiagonal system. Because most of the entries below the leading diagonal are
already zero, it is possible to modify Gaussian elimination (see Step 11) to produce
a very efficient method for solving tridiagonal systems.
132 CURVE FITTING 3*
2 Examples
We fit a natural cubic spline to the following data from Step 26 on page 117:
j 0 1 2 3 4 5
xj 1 2 3 4 5 6
fj 1 3 4 3 4 2
Since the values of the x j are equally spaced, we have h j = 1 for j = 1, 2, . . . , 5.
Also, m 0 = m 5 = 0 and the remaining values m 1 , m 2 , m 3 , and m 4 satisfy the
linear system
4 1 0 0 m1 −6
1 4 1 0 m 2 −12
=
0 1 4 1 m 3 12
0 0 1 4 m4 −18
Using Gaussian elimination to solve this system, to 5D we obtain
Calculating the coefficients, we then find that the natural spline S is given by
S1 (x), 1 ≤ x < 2
S2 (x), 2 ≤ x < 3
S(x) = S3 (x), 3 ≤ x < 4
S4 (x), 4 ≤ x < 5
S5 (x), 5 ≤ x ≤ 6
where
where
S4 (x) = 5 − 5.65385(x − 1) + 3.69231(x − 1)2 + 4.34615(x − 1)3
S5 (x) = 2 − 1.38462(x − 2) + 0.57692(x − 2)2 − 1.03846(x − 2)3
S6 (x) = 1 − 0.80769(x − 3) − 0.19231(x − 3)3
The spline is also plotted in Figure 14 (using a dotted line). It is clear that it is a
much better approximation to f than the interpolating polynomial.
134 CURVE FITTING 3*
y
N
10 •........................
........
........
........
.......
.........
8 .......
.........
.......
.........
......
......
.....
6 ......
......
....
• ....
......
......
....... .. ... ...
....... .. . ....
4 .. ......
.. ...... ..
.. ..
.. ...... .
. ..
.. ....... . .. ..
.. . . ..
.. ............ ... ..
.. . ........... ..
2 . ... • . ... .
... ... ... ... ....................
. ..
.
................. ...
......
...........
...
•
I x
0 1 2 3
FIGURE 14. The function f (x) = 10/(1 + x 2 ) (solid line) approximated by an interpo-
lating polynomial (dashed line) and a natural cubic spline (dotted line).
Checkpoint
EXERCISE
Given the data points (0, 1), (1, 4), (2, 15), and (3, 40), find the natural cubic
spline fitting this data. Use the spline to estimate the value of y when x = 2.3.
STEP 29
NUMERICAL DIFFERENTIATION
Finite differences
1 Procedure
Formulae for numerical differentiation may easily be obtained by differentiating
the interpolation polynomial. The essential idea is that the derivatives f 0 , f 00 , . . .
of a function f are represented by the derivatives Pn0 , Pn00 , . . . of the interpolat-
ing polynomial Pn . For example, differentiating the Newton forward difference
formula (see page 96)
f (x) = f (x j + θ h)
θ(θ − 1) 2 θ (θ − 1)(θ − 2) 3
= 1 + θ1 + 1 + 1 + · · · fj
2! 3!
df df dθ , etc.)
with respect to x gives formally (since x = x j + θ h, dx = dθ × dx
3θ 2 − 6θ + 2 3
1df 1
f (x) =
0
= 1 + (θ − 2 )1 +
1 2
1 + · · · fj
h dθ h 6
1 d2 f 1 h i
f 00 (x) = 2 2 = 2 12 + (θ − 1)13 + · · · f j , etc.
h dθ h
In particular, if we set θ = 0 we have formulae for derivatives at the tabular
points {x j }:
1h i
f 0 (x j ) = 1 − 12 12 + 13 13 − · · · f j
h
1 h i
f (x j ) = 2 12 − 13 + 11
00
12 14
− · · · f j , etc.
h
1h i
f 0 (x j + 12 h) = 1− 24 1
1 3
+ · · · fj
h
136 NUMERICAL DIFFERENTIATION
if we set θ = 1 in the formula for the second derivative, we have the result (without
third differences)
1 h i
f 00 (x j+1 ) = 2 12 − 12
1 4
1 + · · · fj
h
a formula for the second derivative at the next point.
Note that if only one term is retained, the well-known formulae
f (x j + h) − f (x j )
f 0 (x j ) ≈
h
f (x j + 2h) − 2 f (x j + h) + f (x j )
f 00 (x j ) ≈
h2
f (x j + h) − f (x j)
f 0 (x j + 12 h) ≈
h
f (x j + 2h) − 2 f (x j + h) + f (x j )
f (x j+1 ) ≈
00
h2
It should also be noted that the formulae all involve dividing a combination
of differences (which are prone to loss of significance or cancellation errors,
especially if h is small), by a positive power of h. Consequently if we want to
keep round-off errors down, we should use a large value of h. On the other hand,
it can be shown (see Exercise 3 at the end of this Step) that the truncation error is
approximately proportional to h p , where p is a positive integer, so that h must be
sufficiently small for the truncation error to be tolerable. We are in a ‘cleft stick’
and must compromise with some optimum choice of h.
In brief, large errors may occur in numerical differentiation based on direct
polynomial approximation, so that an error check is always advisable. There are
alternative methods based on polynomials which use more sophisticated proce-
dures such as least-squares or mini-max, and other alternatives involving other
basis functions (for example, trigonometric functions). However, the best policy
is probably to use numerical differentiation only when it cannot be avoided!
3 Example
We estimate f 0 (0.1) and f 00 (0.1) for f (x) = e x using the data in Step 20 (page
90).
If we use the formulae from page 135 (with θ = 0) we obtain (ignoring fourth
and higher differences):
1
f 0 (0.1) ≈ [0.05666 − 12 (0.00291) + 13 (0.00015)]
0.05
= 20(0.05666 − 0.00145(5) + 0.00005)
= 1.1051
f (0.1) ≈ 400(0.00291 − 0.00015)
00
= 1.104
Since f 00 (0.1) = f 0 (0.1) = f (0.1) = 1.10517, it is obvious that the second result
is much less accurate (because of round-off errors).
Checkpoint
EXERCISES
1. Derive formulae involving backward differences for the first and second
derivatives of a function.
138 NUMERICAL DIFFERENTIATION
√
2. The function f (x) = x is tabulated for x = 1.00 (0.05) 1.30 to five
decimal places:
x f (x)
1.00 1.00000
1.05 1.02470
1.10 1.04881
1.15 1.07238
1.20 1.09545
1.25 1.11803
1.30 1.14018
(a) Estimate f 0 (1.00) and f 00 (1.00) using Newton’s forward difference
formula.
(b) Estimate f 0 (1.30) and f 00 (1.30) using Newton’s backward difference
formula.
3. Use the Taylor series to find the truncation errors in the following formulae.
(a) f 0 (x j ) ≈ [ f (x j + h) − f (x j )]/ h.
(b) f 0 (x j + 12 h) ≈ [ f (x j + h) − f (x j )]/ h.
(c) f 00 (x j ) ≈ [ f (x j + 2h) − 2 f (x j + h) + f (x j )]/ h 2 .
(d) f 00 (x j + h) ≈ [ f (x j + 2h) − 2 f (x j + h) + f (x j )]/ h 2 .
STEP 30
NUMERICAL INTEGRATION 1
The trapezoidal rule
which will be a good approximation if n is chosen so that the error ( f (x) − Pn (x))
in each tabular subinterval x j+k−1 ≤ x ≤ x j+k (k = 1, 2, . . . , n) is sufficiently
small. It is notable that (for n > 1) the error is often alternately positive and
negative in successive subintervals, and considerable cancellation of error occurs;
in contrast with numerical differentiation, quadrature is inherently accurate! It is
usually sufficient to use a rather low degree polynomial approximation over any
subinterval x j ≤ x ≤ x j+n .
x j = a + j h, j = 0, 1, 2, . . . , N
h
= ( f j + f j+1 )
2
y
N
............................
................... .......................................... ..
. y = f (x)
................ .......... ..
..
................. ...... ..
.. ........ .
... ....... ..
.... .......
...
.... ....... ..
..... .......
....... .
...
.
.. ....
... .......
..... ...
... ....... ....
.. . ........
....... .....
. ........... ............
.. ........................ ................
.. ...................... ....................................
.. .......... ....
..................h
....................
I x
x0 = a x1 x2 x N −1 x N = b
2 Accuracy
The trapezoidal rule corresponds to a rather crude polynomial approximation (a
straight line) between successive points x j and x j+1 = x j + h, and hence can only
† Thisrule is suitable for implementation on a computer. Pseudo-code for study and use in
programming may be found on page 170.
THE TRAPEZOIDAL RULE 141
h 2 00
f j+1 = f (x j + h) = f j + h f 0 (x j ) + f (x j ) + · · ·
2!
one has the trapezoidal form
Z x j+1
h 2 00
h h 0
f (x) dx ≈ ( f j + f j+1 ) = h f j + f (x j ) + f (x j ) + · · ·
xj 2 2 4
(x − x j )2 00
f (x) = f j + (x − x j ) f 0 (x j ) + f (x j ) + · · ·
2!
to get the exact form
Z x j+1
h 2 00
h
f (x) dx = h f j + f 0 (x j ) + f (x j ) + · · ·
xj 2 6
h 3 00
1 1
h 2 f 00 (x j ) + · · · = − f (x j ) + · · ·
h 6 − 4 12
(The concept of truncation error was introduced on page 8.) If we ignore higher-
order terms, an approximate bound on this error in using the trapezoidal rule (over
N subintervals) is therefore
N 3 (b − a)h 2
h max f 00 (x) = max f 00 (x)
12 a≤x≤b 12 a≤x≤b
3 Example
The integral
Z 0.3
e x dx
0.1
is estimated using the trapezoidal rule and the data in Step 20 (page 90).
142 NUMERICAL INTEGRATION 1
Checkpoint
EXERCISES
NUMERICAL INTEGRATION 2
Simpson’s rule
1 Simpson’s rule
Simpson’s rule corresponds to quadratic approximation; thus, for x j ≤ x ≤
x j + 2h,
Z x j +2h Z 2
f (x) dx = h f (x j + θ h) dθ
xj 0
2 θ (θ − 1) 2
Z
≈h 1 + θ1 + 1 f j dθ
0 2!
2
θ 2 θ θ2
3
=h θ + 1+ − 12 f j
2 6 4 0
= h 2 f j + 2( f j+1 − f j ) + 3 ( f j+2 − 2 f j+1 + f j )
1
h
= ( f j + 4 f j+1 + f j+2 )
3
A parabolic arc is fitted to the curve y = f (x) at the three tabular points x j , x j + h,
and x j + 2h. Consequently, if N = (b − a)/ h is even, one obtains Simpson’s rule:
Z b Z x2 Z x4 Z xN
f (x) dx = f (x) dx + f (x) dx + · · · + f (x) dx
a x0 x2 x N −2
h
≈ [ f 0 + 4 f 1 + 2 f 2 + 4 f 3 + · · · + 4 f N −1 + f N )]
3
where
f j = f (x j ) = f (a + j h), j = 0, 1, 2, . . . , N
Integration by Simpson’s rule involves computing a finite sum of values given
by the integrand f , as does the trapezoidal rule. Simpson’s rule is also effective
for implementation on a computer, and one direct application in hand calculation
usually gives sufficient accuracy.
144 NUMERICAL INTEGRATION 2
2 Accuracy
For a known integrand f , we emphasize that it is quite appropriate to program
increased interval subdivision to provide the desired accuracy, but for hand calcu-
lation an error bound is again useful.
Suppose that in x j ≤ x ≤ x j + 2h the function f (x) has the Taylor expansion
(x − x j+1 )2 00
f (x) = f j+1 + (x − x j+1 ) f 0 (x j+1 ) + f (x j+1 ) + · · ·
2!
then
Z x j +2h
1 h 2 00 1 h 4 (4)
f (x) dx = 2h f j+1 + f (x j+1 ) + f (x j+1 ) + · · ·
xj 3 2! 5 4!
h4 h 5 (4)
2h 1
5 − 1
3 f (4) (x j+1 ) + · · · = − f (x j+1 ) + · · ·
4! 90
Ignoring higher-order terms, we conclude that the approximate bound on this error
in estimating
Z b
f (x) dx
a
by Simpson’s rule (with N /2 subintervals of width 2h) is
N h5 (b − a)h 4
max f (4) (x) = max f (4) (x)
2 90 a≤x≤b 180 a≤x≤b
It is notable that the error bound is proportional to h 4 , compared with h 2 for the
cruder trapezoidal rule. In passing, one may note that Simpson’s rule is exact for
a cubic.
SIMPSON’S RULE 145
3 Example
We estimate the integral
Z 1.3 √
x dx
1.0
by using Simpson’s rule and the data in Exercise 2 of Step 29 on page 138.
There will be an even number of intervals if we choose h = 0.15 or h = 0.05.
If we use S(h) to denote the approximation with strip width h, we obtain
0.15
S(0.15) = [1 + 4(1.07238) + 1.14018] = 0.32148(5)
3
and
0.05
S(0.05) = [1 + 4(1.02470 + 1.07238 + 1.11803)
3
+ 2(1.04881 + 1.09545) + 1.14018]
= 0.32148(6)
Since f (4) (x) = − 15
16 x
−7/2 , an approximate bound on the truncation error is
0.30 15 4
× h = 0.0015625h 4
180 16
whence 0.0000008 for h = 0.15 and 0.00000001 for h = 0.05. Note that the
truncation error is negligible; within round-off error, the estimate is 0.32148(6).
Checkpoint
EXERCISES
1. Estimate Z 1 1
dx
0 1+x
to 4D, using numerical integration.
2. Use Simpson’s rule with N = 2 to obtain an approximation to
Z π/4
x cos x dx
0
Compute the resulting error, given that the true value of the integral is 0.26247
(5D ).
STEP 32
NUMERICAL INTEGRATION 3
Gaussian integration formulae
A change of variable also renders this last form applicable to any interval; we
make the substitution
u = 12 [(b − a)x + (b + a)]
† Thisformula is suitable for implementation on a computer. Pseudo-code for study and use in
programming may be found on page 171.
GAUSSIAN INTEGRATION FORMULAE 147
If we write
φ(u) = φ 1
− a)x + (b + a)] ≡ g(x)
2 [(b
then
Z b b−a
Z 1
φ(u) du = g(x) dx
a 2 −1
since du = 12 (b − a)dx, and u = a when x = −1, u = b when x = 1.
It is important to note that the Gauss two-point formula is exact for cubic
polynomials, and hence may be compared in accuracy with Simpson’s rule. (In
fact, the error for the Gauss formula is about 2/3 that for Simpson’s rule.) Since
one fewer function value is required for the Gauss formula, it may be preferred
provided the function evaluations at the irrational abscissae values are available.
1
max f (6) (x)
15 750 −1≤x≤1
This and the previous two point-formula represent the lowest order in a series of
formulae commonly referred to as Gauss-Legendre, because of their association
with Legendre polynomials.
There are yet other formulae associated with other orthogonal polynomials
(Laguerre, Hermite, Chebyshev, etc.); the general form of Gaussian integration
may be represented by the formula
Z b n
X
W (x) f (x) dx ≈ wi f (xi )
a i=1
where W (x) is the weight function in the integral, {x1 , x2 , . . . , xn } is the set of
points in the integration range a ≤ x ≤ b, and the weights wi in the summation
are again constants.
148 NUMERICAL INTEGRATION 3
EXERCISE Z 1 1
Apply the Gauss two-point and four-point formulae to evaluate du.
0 1+u
STEP 33
1 Taylor series
We already have one technique available for this problem; we can estimate y(x1 )
by a Taylor series of order p (the particular value of p will depend on the size of
h and the accuracy required):
h 2 00 h p ( p)
y(x1 ) ≈ y1 = y(x0 ) + hy 0 (x0 ) + y (x0 ) + · · · + y (x0 )
2! p!
150 ORDINARY DIFFERENTIAL EQUATIONS 1
yn+1 = yn + h f (xn , yn ), n = 0, 1, 2, . . . , N − 1
This is known as Euler’s method. However, unless the step length h is very small,
the truncation error will be large and the results inaccurate.
2 Runge-Kutta methods
A popular way of avoiding the differentiation of f (x, y) without sacrificing ac-
curacy involves estimating yn+1 from yn and a weighted average of values of
f (x, y), chosen so that the truncation error is comparable to that of a Taylor series
of order p. The details of the derivation lie beyond the scope of this book, but we
can quote two of the simpler Runge-Kutta methods† .
The first has the same order of accuracy as the Taylor series with p = 2 and is
usually written as three steps:
k1 = h f (xn , yn )
k2 = h f (xn + h, yn + k1 )
yn+1 = yn + 12 (k1 + k2 )
k1 = h f (xn , yn )
k2 = h f (xn + h/2, yn + k1 /2)
k3 = h f (xn + h/2, yn + k2 /2)
k4 = h f (xn + h, yn + k3 )
yn+1 = yn + 16 (k1 + 2k2 + 2k3 + k4 )
† These methods are suitable for implementation on a computer. Pseudo-code for study and use
in programming may be found on page 172.
SINGLE-STEP METHODS 151
Neither method involves evaluating derivatives of f (x, y); instead f (x, y) itself
is evaluated several times (twice in the second-order method, four times in the
fourth-order method).
3 Example
It is instructive to compare some of the methods given above on a very simple
problem. For example, let us estimate y(0.5) given that
so
y1 = 0.1(0) + 1.1(1) = 1.1
y2 = 0.1(0.1) + 1.1(1.1) = 1.22
y3 = 0.1(0.2) + 1.1(1.22) = 1.362
y4 = 0.1(0.3) + 1.1(1.362) = 1.5282
and y5 = 0.1(0.4) + 1.1(1.5282) = 1.72102
which is not even accurate to 1D (the error is approximately 0.08).
(ii) Taylor series (fourth order): Since
we have
0.12
yn+1 = yn + 0.1(xn + yn ) + (1 + xn + yn )
2!
0.13 0.14
+ (1 + xn + yn ) + (1 + xn + yn )
3! 4!
≈ 0.00517 + 0.10517xn + 1.10517yn
Thus
n = 0 : k1 = 0.1(0 + 1) = 0.1
k2 = 0.1(0.1 + 1 + 0.1) = 0.12
y1 = 1 + 12 (0.1 + 0.12) = 1.11
n = 1 : k1 = 0.1(0.1 + 1.11) = 0.121
k2 = 0.1(0.2 + 1.11 + 0.121) = 0.1431
y2 = 1.11 + 21 (0.121 + 0.1431) = 1.24205
n = 2 : k1 = 0.1(0.2 + 1.24205) = 0.14421
k2 = 0.1(0.3 + 1.24205 + 0.14421) = 0.16863
y3 = 1.24205 + 12 (0.14421 + 0.16863) = 1.39847
n = 3 : k1 = 0.1(0.3 + 1.39847) = 0.16985
k2 = 0.1(0.4 + 1.39847 + 0.16985) = 0.19683
y4 = 1.39847 + 21 (0.16985 + 0.19683) = 1.58181
n = 4 : k1 = 0.1(0.4 + 1.58181) = 0.19818
k2 = 0.1(0.5 + 1.58181 + 0.19818) = 0.22800
y5 = 1.58181 + 21 (0.19818 + 0.22800) = 1.79490
which is accurate to 2D (the error is approximately 0.003).
As we might expect, the fourth-order method is clearly superior, the first-order
method is clearly inferior, and the second-order method falls in between.
Checkpoint
1. For each of the two types of method outlined in this Step, what is
the main disadvantage?
2. Why might we expect higher order methods to be more accurate?
EXERCISES
1. For the initial value problem y 0 = x + y with y(0) = 1 considered in the
previous section, obtain estimates of y(0.8) by doing three more steps of
(a) Euler’s method,
(b) the fourth-order Taylor series method,
(c) the second-order Runge-Kutta method,
with h = 0.1. Compare the accuracy of the three methods.
2. Use Euler’s method with step length h = 0.2 to estimate y(1) given that
y 0 = −x y 2 with y(0) = 2.
STEP 34
As mentioned earlier, the methods covered in the previous Step are classified as
single-step methods, because the only value of the approximate solution used in
constructing yn+1 is yn , the result of the previous step. In contrast, multistep
methods make use of earlier values like yn−1 , yn−2 , . . ., in order to reduce the
number of times that f (x, y) or its derivatives have to be evaluated.
1 Introduction
Among the multistep methods that can be derived by integrating interpolating
polynomials we have (using f n to denote f (xn , yn )):
(a) the midpoint method (second order): yn+1 = yn−1 + 2h f n
(b) Milne’s method (fourth order): yn+1 = yn−3 + 4h3 (2 f n − f n−1 + 2 f n−2 )
(c) the family of Adams-Bashforth methods: the second-order formula in the
family is given by
h
yn+1 = yn + (3 f n − f n−1 )
2
while the formula of order 4 is
h
yn+1 = yn + (55 f n − 59 f n−1 + 37 f n−2 − 9 f n−3 )
24
(d) the family of Adams-Moulton methods: the second-order formula in this
family, given by
h
yn+1 = yn + ( f n+1 + f n )
2
is often referred to as the trapezoidal method. The Adams-Moulton formula
of order 4 is
h
yn+1 = yn + (9 f n+1 + 19 f n − 5 f n−1 + f n−2 )
24
Note that the family of Adams-Moulton methods in (d) requires evaluation of
f n+1 = f (xn+1 , yn+1 ). Because yn+1 is therefore involved on both the left and
right-hand sides of the expressions, such methods are known as implicit methods.
On the other hand, since yn+1 appears only as the term on the left-hand side in all
the families listed under (a)–(c), they are called explicit methods. Implicit methods
have the disadvantage that one usually requires some numerical technique (see
Steps 7–10) to solve for yn+1 . However, it is common to use an explicit method
154 ORDINARY DIFFERENTIAL EQUATIONS 2
2 Stability
Numerical stability is discussed in depth in more advanced texts such as Burden
and Faires (1993). In general, a method is unstable if any errors introduced into
the computation are amplified as the computation progresses. It turns out that the
Adams-Bashforth and Adams-Moulton families of methods have good stability
properties.
As an example of a multistep method with poor stability properties, let us
apply the midpoint method given above with h = 0.1 to the differential equation
y 0 = −5y, y(0) = 1. The true solution to this problem is given by y(x) = e−5x .
We introduce error by taking y1 to be the value obtained by rounding the true
value e−0.5 to 5D, namely, y1 = 0.60653. The resulting method is then given by
Working to 5D, we construct the following table which allows us to compare the
consequent estimates yn with the true values y(xn ).
y20 has the value 77.82455 with an error over a million times larger than the true
value!
Checkpoint
EXERCISES
In the previous two Steps we discussed numerical methods for solving the first-
order initial value problem y 0 = f (x, y), y(x0 ) = y0 . However, ordinary differ-
ential equations that arise in practice are often of higher order. For example, as
explained in the footnote on page 4, a more realistic differential equation for the
motion of a pendulum is given by
y 00 = −ω2 sin y
which may be solved subject to y(x0 ) = y0 and y 0 (x0 ) = y00 , where y0 , y00 are
given values. (For notational consistency with the previous two Steps, we have
changed the variables from θ and t to y and x, respectively.) More generally, an
n-th order differential equation may be written in the form
(n−1)
y(x0 ) = y0 , y 0 (x0 ) = y00 , . . . , y (n−1) (x0 ) = y0
(n−1)
where y0 , y00 , . . . , y0 are given values. We shall see how this n-th order initial
value problem may be written as a system of first-order initial value problems,
which leads us to numerical procedures to solve the general initial value problem
that are extensions of the numerical methods considered in the previous two Steps.
wn0 = g(x, w1 , w2 , . . . , wn )
pendulum are y(0) = 0 and y 0 (0) = 1 for example, then the system of two
first-order initial value problems is given by
w10 = w2 , w1 (0) = 0, w20 = −ω2 sin w1 , w2 (0) = 1
We remark that a more general system of n first-order differential equations is
given by
w0j = g j (x, w1 , w2 , . . . , wn )
for j = 1, 2, . . . , n.
3 Numerical example
If we use the Euler method to solve the pendulum problem
w10 = w2 , w1 (0) = 0, w20 = −ω2 sin w1 , w2 (0) = 1
the resulting equations are
w1,n+1 = w1,n + hw2,n , w1,0 = 0
and
w2,n+1 = w2,n − hω2 sin w1,n , w2,0 = 1
With ω = 1 and h = 0.2, we obtain the values given in the following table:
n xn w1,n w2,n
0 0.0 0 1
1 0.2 0.20000 1.00000
2 0.4 0.40000 0.96027
3 0.6 0.59205 0.88238
4 0.8 0.76853 0.77077
5 1.0 0.92268 0.63175
HIGHER ORDER DIFFERENTIAL EQUATIONS* 159
If we use the Runge-Kutta method given in the previous section, we obtain the
following values:
n xn w1,n w2,n
0 0.0 0 1
1 0.2 0.20000 0.98013
2 0.4 0.39205 0.92169
3 0.6 0.56875 0.82898
4 0.8 0.72377 0.70810
5 1.0 0.85215 0.56574
Since this Runge-Kutta method is second-order and the Euler method is only
first-order, we might expect the values in this table to be more accurate than those
displayed in the previous table. By obtaining very accurate approximations using
much smaller values of h, it may be verified that this is indeed the case.
Checkpoint
EXERCISE
Apply the Euler method with step length h = 0.2 to obtain approximations
to y(1) and y 0 (1) for the second-order initial value problem
y 00 + y 0 + y = sin x, y(0) = y 0 (0) = 0
APPLIED EXERCISES
Here we give some exercises which have a more ‘applied’ nature than most of
the exercises found previously. Some of these exercises are not suitable for hand
calculation, but require the use of a computer.
EXERCISES
1. Consider the cylindrical tank of radius r lying with its axis horizontal as
described in Section 1 of Step 6. Suppose we wish to calibrate a measuring
stick so that it has markings showing when the tank is 10%, 20%, 30%, and
40% full. Find the values of h required (see Figure 2 on page 24) for doing
this calibration.
2. If H (t) is the population of a prey species at time t and P(t) is the population
of a predator species at time t, then a simple model relating these two
populations is given by the two differential equations
dH
= H (1 − 0.05P)
dt
dP
= P(0.01H − 0.6)
dt
It may be shown that one solution of this problem is obtained when H (t) and
P(t) satisfy
0.6 ln H (t) + ln P(t) − 0.01H (t) − 0.05P(t) = −30
If the population of the prey at t = 0 is H (0) = 1000, find the value of P(0).
3. Carbohydrates, proteins, fats, and alcohol are the main sources of energy in
food. The number of grams of these nutrients in 100 gram servings of each
of bread, lean steak, ice cream, and red wine is given in the following table.
Carbohydrate Protein Fat Alcohol
Bread 47 8 2 0
Steak 0 27 12 0
Ice cream 25 4 7 0
Red wine 0 0 0 10
Given that 100 grams of bread, lean steak, ice cream, and red wine provide
227, 218, 170, and 68 kilocalories respectively, find the number of kilocalories
provided by 100 grams of carbohydrate, protein, fat, and alcohol.
APPLIED EXERCISES 161
4. A lake populated with fish is divided into three regions X , Y , and Z . Let
x (t) , y (t) , and z (t) denote the proportions of the fish in regions X , Y , and
Z respectively after t days. Then as the fish swim around the lake, the
proportions after (t + 1) days satisfy
x (t+1) = 1 (t)
4x + 14 y (t) + 15 z (t)
y (t+1) = 2 (t)
5x + 12 y (t) + 25 z (t)
z (t+1) = 7 (t)
20 x + 14 y (t) + 25 z (t)
Given that after day 1 we have x(1) = (0.24, 0.43, 0.33)T , find the initial
population distribution x(0) .
5. For the linear system x(t+1) = Ax(t) given in the previous exercise, it is
interesting to see whether there is an equilibrium population distribution;
that is, is there an x for which Ax = x? Use the power method to show that
there is such an x and find its value.
6. In applied mathematics the Bessel functions Jn of order n often arise. The
values of the Bessel function J0 (x) for x = 0.0 (0.1) 0.5 are given to 4D in
the following table.
x J0 (x)
0.0 1.0000
0.1 0.9975
0.2 0.9900
0.3 0.9776
0.4 0.9604
0.5 0.9385
Find the degree of the polynomial which fits the data and obtain an approxi-
mation to J0 (0.25) based on the interpolating polynomial of this degree.
7. Corrugated iron is manufactured by using a machine that presses a flat sheet
of iron into one whose cross section has the form of a sine wave. Suppose
a corrugated sheet 69 cm wide is required, the height of each wave from the
centre line is 1 cm and each wave has a period of 10 cm. The width of flat
162 APPLIED EXERCISES
sheet required is then given by the arc length of the curve f (x) = sin(π x/5)
from x = 0 to x = 69. From calculus, this arc length is
Z 69 q Z 69 q
L= 1 + ( f 0 (x))2 dx = 1 + π 2 cos2 (π x/5)/25 dx
0 0
Basic pseudo-code is given for some of the algorithms introduced in the book. In
our experience, students do benefit if they study the pseudo-code of a method at
the same time as they learn it in a Step. If they are familiar with a programming
language, they should attempt to convert at least some of the pseudo-code into
computer programs, and apply them to the set Exercises.
Nonlinear equations
1. Bisection method (Step 7)
2. Method of false position (Step 8)
3. Newton-Raphson iterative method (Step 10)
Interpolation
6. Newton’s divided difference formula (Step 24)
Numerical integration
7. Trapezoidal rule (Step 30)
8. Gaussian integration formula (Step 32)
Differential equations
9. Runge-Kutta method (Step 33)
164 PSEUDO-CODE
1 read a, b, N , M,
2 done = f alse
3 U = 0.0
4 repeat
5 h = (b − a)/N
6 s = ( f (a) + f (b))/2
7 for i = 1 to N − 1 do:
8 x =a+i ∗h
9 s = s + f (x)
10 endfor
11 T =h∗s
12 if |T − U | < then do:
13 done = tr ue
14 else do:
15 N =2∗ N
16 U =T
17 endif
18 until N > M or done
19 print ‘Approximation to integral is’, T
20 if N > M then do:
21 print ‘required accuracy not reached with M =’, M
22 endif
1 read a, b √
2 x1 = (b + a − (b − a)/√3)/2
3 x2 = (b + a + (b − a)/ 3)/2
4 I = (b − a) ∗ ( f (x1 ) + f (x2 ))/2
5 print ‘Approximation to integral is’, I
seconds
√
2. R ≈ 0.028∗ × 3.14∗ × 56.25∗ × 2 × 981∗ × 650∗ ≈ 4.946∗ × 1129∗ ≈
5.58∗ × 103 cm3 /s
1. The result 13.57, max |eabs | = 0.005 + 0.005 = 0.01, so the answer is
13.57 ± 0.01 or 13.6 correct to 3S.
2. The result 0.01, max |eabs | = 0.01, so that although operands are correct to
5S, the answer may not even be correct to 1S ! This phenomenon is known as
loss of significance or cancellation (see Step 4 for more details).
3. The result 13.3651, max |eabs | ≈ (4.27+3.13)×0.005 = 0.037, so the answer
is 13.3651 ± 0.037 or 13 correct to 2S.
4. The result −1.85676, max |eabs | ≈ 0.513 × 0.005 + 9.48 × 0.0005 + 0.005 ≈
0.012, so the answer is −1.85676 ± 0.012 or −2 correct to 1S.
174 ANSWERS TO THE EXERCISES
E ≈ 0.002 × 103 .
(h) ≈ 0.0003 × 10−5 , max |erel | ≈ 0.005/2.86 + 0.005/3.29
0.005 0.005
max |eabs | ≈ + × 8.69 × 10−6 ≈ 0.028 × 10−6
2.86 3.29
E ≈ 0.031 × 10−6 .
3. (a) b − c = 5.685 × 101 − 5.641 × 101 = 0.044 × 101 → 4.400 × 10−1 .
a(b − c) = 6.842 × 10−1 × 4.400 × 10−1 = 30.1048 × 10−2 → 3.010 ×
10−1 .
ab = 6.842 × 10−1 × 5.685 × 101 = 38.896770 × 100 → 3.890 × 101 .
ac = 6.842 × 10−1 × 5.641 × 101 = 38.595722 × 100 → 3.860 × 101 .
ab − ac = 3.890 × 101 − 3.860 × 101 = 0.030 × 101 → 3.000 × 10−1 .
The answer obtained (working to 6S ) is 3.01048 × 10−1 with propagated
error at most 0.069 × 10−1 , so we can only rely on the first digit!
(b) a + b = 9.812 × 101 + 0.04631 × 101 = 9.85831 × 101 → 9.858 × 101 .
(a+b)+c = 9.858×101 +0.08340×101 = 9.94140×101 → 9.941×101 .
b + c = 4.631 × 10−1 + 8.340 × 10−1 = 12.971 × 10−1 → 1.297 × 100 .
a +(b +c) = 9.812×101 +0.1297×101 = 9.9417×101 → 9.942×101 .
The answer obtained (working to 6S ) is 9.94171 × 101 with propagated
error at most 0.00051 × 101 .
4. Direct use of f (x) = tan x − sin x leads to the approximation to f (0.1) given
by
while using the alternative expression f (x) = 2 tan x sin2 (x/2) leads to the
approximation
x2 x4 x6 (−1)k x 2k
cos x = 1 − + − + ··· + + R2k
2! 4! 6! (2k)!
where
(−1)k+1 x 2k+1 sin ξ
R2k =
(2k + 1)!
and ξ lies between 0 and x. Alternatively, since the coefficient of the
x 2k+1 term will be zero, the same polynomial approximation is obtained
with n = 2k + 1 so that R2k may be replaced by
x n+1
Rn =
1−x
which may be obtained from the formula for the sum of a geometric series.
(c) Since f ( j) (x) = e x , the Taylor expansion is
x2 x3 xn
ex = 1 + x + + + ··· + + Rn
2! 3! n!
where
x n+1 eξ
Rn =
(n + 1)!
and ξ lies between 0 and x.
2. (a) cos(0.5) = 0.87758 (to 5D ) while the first four terms of the Taylor
expansion yield 0.87758.
(b) 1/(1 − 0.5) = 2 while the first four terms of the Taylor expansion yield
1.875.
(c) e0.5 = 1.64872 (to 5D ) while the first four terms of the Taylor expansion
yield 1.64583.
ANSWERS TO THE EXERCISES 177
y
N
2 . y = x
....
....
. ......
..
....
....
.....
.
.
.... ........ y = − cos x
.......... 1 ....
.......
.... .... .....
....
......
. .......
.... . ..
.... .... ....
.... .... ....
....
....... ....
.
.... . ..
.... .... .... I x
.... .... ....
−3 −2 −1 .... .
..... 1 ......
2 3
.... .... ...
.
.... . .
.... ...... ....
...... ...
. ...
..... .............
. ...
........
. ..................
....
....
..
.....
.
....
....
..
.....
....
−2
half-length 3.125 × 10−3 , which is less than the required 5 × 10−3 for 2D
accuracy.
2. Root is 0.615 to 3D.
3. (a) With an initial interval of (−1.1, −1) say, application of the bisection
method shows that the root to 2D is −1.03.
(b) With an initial interval of (−0.6, −0.5) say, the root to 2D is −0.57.
(c) With an initial interval of (−0.5, −0.4) say, the root to 2D is −0.44.
1. Tabulate f :
x f (x)
0 −2
0.2 −1.40266
0.6 −0.27072
0.8 +0.23471
There is a root in the interval 0.6 < x < 0.8. We have
1 0.6 −0.27072
x1 =
0.23471 + 0.27072 0.8 0.23471
0.14083 + 0.21657
= = 0.70712
0.50543
f (x1 ) = f (0.70712) = 2 sin(0.70712) + 0.70712 − 2
= 1.29929 + 0.70712 − 2 = 0.00642
ANSWERS TO THE EXERCISES 179
Since f (0.6) and f (0.70712) have opposite signs, the root is in the interval
0.6 < x < 0.70712. Repeating the process,
1 0.6 −0.27072
x2 =
0.00642 + 0.27072 0.70712 0.00642
0.00385 + 0.19143
= = 0.70464
0.27714
Since f (x2 ) = f (0.70464) = 0.00016, we know the root lies between 0.6
and 0.70464, so we compute
1 0.6 −0.27072
x3 = = 0.70458
0.00016 + 0.27072 0.70464 0.00016
Since | f (0.70458)| is less than the requested value of 5 × 10−5 we may stop.
Because x2 and x3 agree to 4D, we conclude that the root accurate to 4D is
0.7046. Note that all the xn computed have f (xn ) positive.
2. Let us take f (x) = 3 sin x − x − 1/x. We note that f (0.7) = −0.19592 and
f (0.9) = 0.33887, that is, the root is enclosed. We shall obtain a solution
accurate to four decimal places. The following results are obtained.
(a) Bisection gives the sequence of intervals: (0.7, 0.9), (0.7, 0.8), (0.75, 0.8),
(0.75, 0.775), (0.7625, 0.775), (0.7625, 0.76875), (0.7625, 0.76563),
(0.7625, 0.76406), (0.7625, 0.76328), (0.76289, 0.76328), (0.76309,
0.76328), (0.76309, 0.76318). Thus the root to 4D is 0.7631, since it
is enclosed in the interval (0.76309, 0.76318), of half-length less than
5 × 10−5 .
(b) If [an , bn ] is the interval bracketing the root at the n-th iteration of false
position, then the first iteration with a1 = 0.7 and b1 = 0.9 yields the
approximation x1 = 0.77327. Since f (0.77327) = 0.02896, the process
is repeated, now with a2 = 0.7 and b2 = 0.77327. This yields x2 =
0.76383. Since f (0.76383) = 0.00207, we take a3 = 0.7 and b3 =
0.76383 to obtain x3 = 0.76317 and f (x3 ) = 0.00015. Then a4 = 0.7
and b4 = 0.76317 gives x4 = 0.76312. One more iteration yields the
approximation 0.76312 again, so we conclude that the root is 0.7631 to
4D. Note that all the values of f (xn ) are positive.
(c) The secant method with x0 = 0.7, x1 = 0.9 gives x2 = 0.77327, x3 =
0.76143, x4 = 0.76314, and x5 = 0.76312. Again we conclude that the
root is 0.7631 to 4D.
3. In Step 7 we found that the root lies in the interval (−0.74375, −0.7375). False
position with a1 = −0.75 and b1 = −0.73 (using f (a1 ) = −0.01831 and
f (b1 ) = 0.01517) gives x1 = −0.73906. Since f (−0.73906) = 0.00004,
the process is repeated with a2 = −0.75 and b2 = −0.73906 to give x2 =
−0.73909. Since the magnitude of f (−0.73909) is less than the specified
value of 5 × 10−6 , we terminate the process and give the root as −0.7391.
180 ANSWERS TO THE EXERCISES
4. (a) With an initial interval of (−1.1, −1), the stopping criterion is satisfied
after three iterations of the method of false position and the root accurate
to 4D is −1.0299.
(b) With an initial interval of (−0.6, −0.5), the stopping criterion is satisfied
after three iterations and the root accurate to 4D is −0.5671.
(c) With an initial interval of (−0.5, −0.4), the stopping criterion is satisfied
after three iterations and the root accurate to 4D is −0.4441.
Since x9 and x10 agree to 4D we can give the root as −0.7391. Note that
φ 0 (x) = sin x ≈ −0.67 near the root, so convergence is slow (and ‘oscilla-
tory’).
ANSWERS TO THE EXERCISES 181
Since
f 0 (x) = 3(x + 1)e x
then f 0 (0.25) = 4.81510 and
0.03698
x1 = 0.25 + = 0.25 + 0.00768 = 0.25768
4.81510
Then
and
0.00026
x2 = 0.25768 − = 0.25768 − 0.00005 = 0.25763
4.88203
Doing one more iteration yields x3 = 0.25763, so we conclude that the
root is 0.2576 to 4S. Note that only two or three iterations are required for
the Newton-Raphson process, whereas eight iterations were needed for the
iteration method based on
xn+1 = 13 e−xn
xnk − a
1 a
xn+1 = xn − = 1− xn + k−1
kxnk−1 k kxn
Solution by back-substitution
−2x3 = −6 ⇒ x3 = 3
−3x2 + 9 = 6 ⇒ x2 = 1
x1 + 1 − 3 = 0 ⇒ x1 = 2
2.
m Augmented Matrix
5.6 3.8 1.2 1.4
3.1 7.1 −4.7 5.1
1.4 −3.4 8.3 2.4
5.6 3.8 1.2 1.4
0.55 5.00 −5.36 4.33
0.25 −4.35 8.00 2.05
5.6 3.8 1.2 1.4
5.00 −5.36 4.33
−0.87 3.33 5.82
Taking Sk < 0.000005 as the stopping criterion, these relations yield the
following table of results.
Iteration
k x (k) y (k) z (k) Sk (to 6D )
0 0 0 0 5.743750
1 2.550000 2.487500 −0.706250 0.998594
2 2.106250 2.951563 −0.615469 0.093816
3 2.045719 2.921305 −0.612441 0.008322
4 2.050560 2.918581 −0.613198 0.000632
5 2.050893 2.918876 −0.613202 0.000063
6 2.050848 2.918889 −0.613196 0.000004
7 2.050847 2.918886 −0.613196
ANSWERS TO THE EXERCISES 185
−1.2 − λ
1.1
A − λI =
3.6 −0.8 − λ
Thus the eigenvalues are −3 and 1. The approximations from the power
method do appear to be converging to −3, the eigenvalue with the larger
magnitude.
188 ANSWERS TO THE EXERCISES
2. Five iterations of the normal power method with starting vector w(0) =
[1, 1, 1]T yield:
12
(1)
w(1) = 37 , λ1 = 371 = 37
24
342
(2)
w(2) = 1063 , λ1 = 1063
37 = 28.72973
656
9686
(3)
w(3) = 30121 , λ1 = 30121
1063 = 28.33584
18372
273586
(4)
w(4) = 850879 , λ1 = 850879
30121 = 28.24870
517548
7722638
(5)
w(5) = 24018793 , λ1 = 24018793
850879 = 28.22821
14599876
Note the rapid growth in the size of the components of the vectors. For the
scaled power method with the same starting vector we obtain:
12
(1)
w(1) = 37 , p = 2, λ1 = 37 1 = 37
24
0.32432
y(1) = 1
0.64865
9.24324
(2)
w(2) = 28.72973 , p = 2, λ1 = 28.729731 = 28.72973
17.72973
0.32173
y(2) = 1
0.61712
9.11195
(3)
w(3) = 28.33584 , p = 2, λ1 = 28.335841 = 28.33584
17.28316
0.32157
y(3) = 1
0.60994
9.08290
(4)
w(4) = 28.24870 , p = 2, λ1 = 28.248701 = 28.24870
17.18230
ANSWERS TO THE EXERCISES 189
0.32153
y(4) = 1
0.60825
9.07607
(5)
w(5) = 28.22821 , p = 2, λ1 = 28.22821
1 = 28.22821
17.15858
0.32152
y(5) = 1
0.60785
1.
x f (x) = x 3 First diff. Second Third Fourth
0 0
1
1 1 6
7 6
2 8 12 0
19 6
3 27 18 0
37 6
4 64 24 0
61 6
5 125 30
91
6 216
2. (a)
x f (x) = 2x − 1 First diff. Second
0 −1
2
1 1 0
2
2 3 0
2
3 5
(b)
x f (x) = 3x 2 + 2x − 4 First diff. Second Third
0 −4
5
1 1 6
11 0
2 12 6
17 0
3 29 6
23
4 52
190 ANSWERS TO THE EXERCISES
(c)
x f (x) = 2x 3 + 5x − 3 First diff. Second Third Fourth
0 −3
7
1 4 12
19 12
2 23 24 0
43 12
3 66 36 0
79 12
4 145 48
127
5 272
If the polynomial has degree n, then the differences of order n are all equal
so that the differences of order (n + 1) are 0.
3.
x f (x) = e x First diff. Second Third Fourth
0.10 1.105171
56663
0.15 1.161834 2906
59569 147
0.20 1.221403 3053 12
62622 159
0.25 1.284025 3212 4
65834 163
0.30 1.349859 3375 10
69209 173
0.35 1.419068 3548 9
72757 182
0.40 1.491825 3730 10
76487 192
0.45 1.568312 3922
80409
0.50 1.648721
There is just a hint of excessive ‘noise’ at the fourth differences.
(b)
x f (x) = x 4 1 12 13 14
0.0 0.000
0
0.1 0.000 2
2 2
0.2 0.002 4 6
6 8
0.3 0.008 12 −1
18 7
0.4 0.026 19 4
37 11
0.5 0.063 30 2
67 13
0.6 0.130 43 4
110 17
0.7 0.240 60 −1
170 16
0.8 0.410 76 6
246 22
0.9 0.656 98
344
1.0 1.000
From the table in part (a), the true value of the fourth difference is 0.0024.
Thus the values in this last column should be 2.4. The worst round-off
error is therefore 6.0 − 2.4 = 3.6, which is within expectations.
2.
x f (x) 1 12 13
0 3
−1
1 2 6
5 6
2 7 12
17 6
3 24 18
35 6
4 59 24
59
5 118
1. The first difference is 0.56464 − 0.47943 = 0.08521 so that the linear inter-
polating polynomial is
x − 0.5
P1 (x) = 0.47943 + × 0.08521
0.1
Thus
sin(0.55) ≈ 0.47943 + 0.5 × 0.08521 = 0.52204
The true value of sin(0.55) to 5D is 0.52269.
2. Difference table:
x f (x) = cos x 1 12
80◦ 0.1736
−28
80◦ 100 0.1708 −1
−29
80◦ 200 0.1679 0
−29
80◦ 300 0.1650 1
−28
80◦ 400 0.1622 −1
−29
80◦ 500 0.1593
(a) We have
(b) We have
3. Difference table:
x f (x) = tan x 1 12 13
80◦ 5.671
98
80◦ 100 5.769 4
102 −1
80◦ 200 5.871 3
105 0
80◦ 300 5.976 3
108 2
80◦ 400 6.084 5
113
80◦ 500 6.197
The second-order differences are approximately constant, so that quadratic
approximation is appropriate: setting θ = 12 ,
3
= 5.976 + 21 (0.108) − 12 (0.005)
= 6.029
(a) We have
e0.14 = f (0.14)
≈ f (0.1) + 45 (0.05666) + 2 5 (− 5 )(0.00291)
14 1
+ 61 45 (− 15 )(− 65 )(0.00015)
= 1.10517 + 0.04532(8) − 0.00023(3) + 0.00000(5)
= 1.15027
(b) We have
e0.315 = f (0.315)
≈ f (0.30) + 10 (0.06583) + 2 10 10 (0.00320)
3 1 3 13
6 10 10 10 (0.00014)
1 3 13 23
+
= 1.34986 + 0.01974(9) + 0.00062(4) + 0.00002(1)
= 1.37025
f k+1 = f k + 1 f k
k(k − 1)
= f 0 + (k + 1)1 f 0 + + k 12 f 0 + · · · + 1k+1 f 0
2
(k + 1)k 2
= f 0 + (k + 1)1 f 0 + 1 f 0 + · · · + 1k+1 f 0
2
that is, the relation holds for j = k + 1. We conclude that it holds for
j = 0, 1, 2 . . .
With reference to Section 4 of Step 22, note that
f j = f (x j ) = Pn (x j )
on setting θ = j for j = 0, 1, 2, . . .
196 ANSWERS TO THE EXERCISES
3. The relevant difference table is given in the answer to Exercise 2 of Step 20.
Since f 0 = 3, 1 f 0 = −1, 12 f 0 = 6, 13 f 0 = 6, and 14 f 0 = 0, we obtain
θ (θ − 1) 2 θ (θ − 1)(θ − 2) 3
P3 (x) = f 0 + θ1 f 0 + 1 f0 + 1 f0
2 6
= 3 − θ + 3θ(θ − 1) + θ (θ − 1)(θ − 2)
= θ 3 − 2θ + 3
P3 (x) = x 3 − 2x + 3
The student may verify that any four adjacent tabular points have the same
interpolating cubic. This suggests that the tabulated function f is a cubic
in which case we have f ≡ P3 . However, it is by no means certain that
f is a cubic. The data could have come from any function of the form
f (x) = P3 (x) + g(x), where g(x) is zero at the points x = 0, 1, 2, 3, 4, 5. A
simple example is g(x) = x(x − 1)(x − 2)(x − 3)(x − 4)(x − 5).
x f (x)
27 3.00000
5263
8 2.00000 −347
14286 384
1 1.00000 −10714 −6
100000 165
0 0.00000 −1488
6250
64 4.00000
Since the terms are not decreasing we cannot have much confidence in this
result. The reader may recall that this example was quoted in Section 3 of Step
23, in a warning concerning the use of the Lagrange interpolation formula in
practice. With divided differences, we can at least see that interpolation for
f (20) is invalid!
2. Let us order the points as x0 = 0, x1 = 0.5 and x2 = 1. Then the divided
difference table is as follows:
x f (x) = e x
0 1
129744
0.5 1.64872 84168
213912
1 2.71828
198 ANSWERS TO THE EXERCISES
f (ξ ) f (ξ )
000 000
3! (0.25 − 0.0)(0.25 − 0.5)(0.25 − 1.0)=
128
where ξ lies between 0 and 1. For f (x) = e x , f 000 (x) = e x and thus
e0 ≤ f 000 (ξ ) ≤ e1
k xk f (xk )
0 −1 4
0
1 1 4 14
−14 1
2 −2 46 18 2
22 11
3 3 156 51
328
4 4 484
Then
(b) Let us again order the points such that x0 = −1, x1 = 1, x2 = −2, x3 = 3,
x4 = 4, to get the Aitken scheme:
ANSWERS TO THE EXERCISES 199
k x f (x) xk − x
0 −1 4 −1
1 1 4 4 1
2 −2 46 −38 −10 −2
3 3 156 42 −15 −12 3
4 4 484 100 −28 −16 0 4
1. The root of f (x) = x + cos x is in the interval −0.8 < x < −0.7; in fact,
f (−0.8) = −0.10329 and f (−0.7) = 0.06484. Since f is known explicitly,
one may readily subtabulate (by successive interval bisection, say) and use
linear inverse interpolation:
0 + 0.01831
f (−0.75) = −0.01831, θ = = 0.22021
0.06484 + 0.01831
0 + 0.01831
f (−0.725) = 0.02350, θ = = 0.43795
0.02350 + 0.01831
0 + 0.01831
f (−0.7375) = 0.00265, θ = = 0.87349
0.00265 + 0.01831
and this interval is quite small enough for linear inverse interpolation: since
f (0.27) = 1.0611 and f (0.25) = 0.9630, we have
1.0000 − 0.9630
θ= = 0.3772
1.0611 − 0.9630
whence α = 0.25 + (0.3772)(0.02) = 0.2575. Checking α = 0.258, we have
f (0.258) = 1.0018, which is closer to 1 than f (0.257) = 0.9969. (While the
value to 3D is obtained immediately by linear inverse interpolation, the method
of bisection described in Step 7 may be preferred when greater accuracy is
demanded.)
3. If the explicit form of the function is unknown so that it is not possible to
subtabulate readily, one may use iterative inverse interpolation. The relevant
difference table is:
x f (x) 1f 12 f 13 f
2 3.0671
33417
3 6.4088 752
34169 6
4 9.8257 758
34927 6
5 13.3184 764
35691 6
6 16.8875 770
36461 6
7 20.5336 776
37237 6
8 24.2573 782
38019 6
9 28.0592 788
38807 6
10 31.9399 794
39601 6
11 35.9000 800
40401 6
12 39.9401 806
41207
13 44.0608
To find x for which f (x) = 10, one may use inverse interpolation based on
Newton’s forward formula:
θ1 = (10 − 9.8257)/3.4927 = 0.1743/3.4927 = 0.0499 ≈ 0.05
ANSWERS TO THE EXERCISES 201
.
θ2 = 0.1743 − 12 (0.05)(−0.95)(0.0764) 3.4927
= (0.1743 + 0.0018)/3.4927 = 0.0504
and further corrections are negligible so that
x = 4 + 0.0504 = 4.0504
To find x for which f (x) = 20 one again may choose inverse interpolation
based on Newton’s forward formula:
θ1 = (20 − 16.8875)/3.6461 = 3.1125/3.6461 = 0.85365
≈ 0.85
.
θ2 = 3.1125 − 12 (0.85)(−0.15)(0.0776) 3.6461
= (3.1125 + 0.0049)/3.6461 = 0.8550
and further corrections are negligible so that
x = 6 + 0.8550 = 6.8550
To find x for which f (x) = 40, one may choose inverse interpolation based
on Newton’s backward formula. Thus,
θ1 = ( f (x) − f j )/∇ f j
θ1 (θ1 + 1) 2
.
θ2 = f (x) − f j − ∇ fj ∇ fj
2
etc. Consequently,
θ1 = (40 − 39.9401)/4.0401 = 0.0599/4.0401 = 0.0148
≈ 0.015
.
θ2 = 0.0599 − 12 (0.015)(1.015)(0.0800) 4.0401
= (0.0599 − 0.0006)/4.0401 = 0.0147
and further corrections are negligible so that
x = 12 + 0.0147 = 12.0147
Let us now consider the check by direct interpolation. We have from Newton’s
forward formula
f (4.0504) = 9.8257 + (0.0504)(3.4927)
+ 12 (0.0504)(−0.9496)(0.0764)
= 9.9999
and
f (6.8550) = 16.8875 + (0.8550)(3.6461)
+ 12 (0.8550)(−0.1450)(0.0776)
= 20.0001
202 ANSWERS TO THE EXERCISES
Finally, we may determine the cubic f and use it to check the answers:
θ (θ − 1) 2 θ (θ − 1)(θ − 2) 3
f (x) = f j + θ1 f j + 1 fj + 1 fj
2! 3!
= 9.8257 + (x − 4)(3.4927) + 21 (x − 4)(x − 5)(0.0764)
+ 16 (x − 4)(x − 5)(x − 6)(0.0006)
= (9.8257 − 4(3.4927) + 10(0.0764) − 20(0.0006))
+ 3.4927 − 29 (0.0764) + 373 (0.0006) x
+ 12 (0.0764) − 52 (0.0006) x 2 + 16 (0.0006)x 3
x 1 2 3 4 5 6
y 1 3 4 3 4 2
` 2.33 2.53 2.73 2.93 3.13 3.33
y−` −1.33 0.47 1.27 0.07 0.87 −1.33
(y − `)2 1.7689 0.2209 1.6129 0.0049 0.7569 1.7689
p 1.14 2.76 3.66 3.84 3.30 2.04
y−p −0.14 0.24 0.34 −0.84 0.70 −0.04
(y − p)2 0.0196 0.0576 0.1156 0.7056 0.4900 0.0016
xi2 , and
P P P P
2. Computing n, xi , yi , xi yi , inserting in the normal equations
and solving gives:
Now
4
∂S X
= −2(yi − c1 − c2 sin xi )
∂c1 i=1
and
4
∂S X
= −2(yi − c1 − c2 sin xi ) sin xi
∂c2 i=1
so the normal equations may be written as
!
4
X 4
X
yi = 4ci + sin xi c2
i=1 i=1
! !
4
X 4
X 4
X
and (sin xi )yi = sin xi c1 + sin2 xi c2
i=1 i=1 i=1
204 ANSWERS TO THE EXERCISES
Tabulating:
6 = 4c1 + 2c2
4.5 = 2c1 + 1.5c2
we obtain c1 = 0, c2 = 3.
Since the last three elements of the first column of A(0) are the same, the three
remaining components of w(1) all have the same value, namely,
1 1
√ =√
(−2) × (−2) × 3/2 12
ANSWERS TO THE EXERCISES 205
Thus
−1/2 −1/2 −1/2 −1/2
−1/2 5/6 −1/6 −1/6
H(1) = I − 2w(1) (w(1) )T =
−1/2 −1/6 5/6 −1/6
−1/2 −1/6 −1/6 5/6
and
−2 −1
0 1/6
A(1) = H(1) A(0) =
0
2/3
0 1/6
Upon solving, we find that m 1 = 36/5 and m 2 = 96/5. These two values
along with m 0 = m 3 = 0 can then be used in the formulae for the spline
coefficients given on page 131. So we have
a1 = f 1 = 4
f1 − f0 h 1 (2m 1 + m 0 ) 27
b1 = + =
h1 6 5
m1 18 m1 − m0 6
c1 = = , d1 = =
2 5 6h 1 5
S1 (x) = 4 + 5 (x
27
− 1) + 5 (x
18
− 1)2 + 65 (x − 1)3
Similarly, we obtain
a2 = f 2 = 15
f2 − f1 h 2 (2m 2 + m 1 ) 93
b2 = + =
h2 6 5
m2 48 m2 − m1
c2 = = , d2 = =2
2 5 6h 2
S2 (x) = 15 + 5 (x
93
− 2) + 5 (x
48
− 2)2 + 2(x − 2)3
206 ANSWERS TO THE EXERCISES
Finally, we have
a3 = f 3 = 40
f3 − f2 h 3 (2m 3 + m 2 ) 141
b3 = + =
h3 6 5
m3 m3 − m2 16
c3 = = 0, d3 = =−
2 6h 3 5
and hence on the interval (2, 3] the spline is given by
S3 (x) = 40 + 5 (x
141
− 3) − 5 (x
16
− 3)3
The spline is plotted in Figure 18. The required estimate at x = 2.3 is given
by
5 (−0.7) − 5 (−0.7) = 21.3576
3
40 + 141 16
y
N
40 •
...
..
...
.....
.
...
...
.. ..
.
30 ...
..
.
....
.
...
...
...
.....
..
20 ....
...
....
.
...
....
• .
....
....
...
.
.....
.....
10 .....
.
...
........
..
........
.........
..........
.•
..
...
...
...
..............
......
........................
•....................................
I x
0 1 2 3
FIGURE 18. Natural cubic spline.
f (x) = f (x j + θ h)
θ(θ + 1) 2 θ (θ + 1)(θ + 2) 3
= 1 + θ∇ + ∇ + ∇ + · · · fj
2! 3!
and hence
3θ 2 + 6θ + 2 3
1df 1
f 0 (x) = ∇ + θ + 12 ∇ 2 +
= ∇ + · · · fj
h dθ h 6
2
1 d f 1 h i
f 00 (x) = 2 2 = 2 ∇ 2 + (θ + 1)∇ 3 + · · · f j
h dθ h
ANSWERS TO THE EXERCISES 207
= 400[−0.00059 − 0.00005]
= −0.256
Although the input data are correct to 5D, the results are accurate to only
3D and 1D, respectively.
(b) The approximations are:
h i
f 0 (1.30) ≈ 20 ∇ + 12 ∇ 2 + 13 ∇ 3 f (1.30)
= 400[−0.00043 + 0.00006]
= −0.148
To 4D the correct values are 0.4385 and −0.1687. Thus the first approxi-
mation is accurate to only 2D (the error is about 0.0006), while the second
approximation is accurate to only 1D (the error is about 0.02).
3. (a) Expanding about x = x j :
h 2 00
f (x j + h) = f (x j ) + h f 0 (x j ) + f (x j ) + · · · ,
2!
so
f (x j + h) − f (x j ) h
= f 0 (x j ) + f 00 (x j ) + · · ·
h 2
and the truncation error ≈ 12 h f 00 (x j ).
(b) Denoting x j + 12 h by x j+ 1 , we expand about x = x j+ 1 :
2 2
2
h 0 1 h
f (x j + h) = f (x j+ 1 ) + f (x j+ 1 ) + f 00 (x j+ 1 )
2 2 2 2! 2 2
1 h 3 000
+ f (x j+ 1 ) + · · ·
3! 2 2
and
1 h 2 00
h 0
f (x j ) = f (x j+ 1 h ) − f (x j+ 1 ) + f (x j+ 1 )
2 2 2 2! 2 2
3
1 h
− f 000 (x j+ 1 ) + · · · ,
3! 2 2
so
f (x j + h) − f (x j ) h 2 000
= f 0 (x j+ 1 ) + f (x j+ 1 ) + · · ·
h 2 24 2
and the truncation error ≈ 24 h f (x j + 2 h).
1 2 000 1
4
f (x j + 2h) = f (x j ) + 2h f 0 (x j ) + 2h 2 f 00 (x j ) + h 3 f 000 (x j ) + · · · ,
3
so
f (x j + 2h) − 2 f (x j + h) + f (x j )
= f 00 (x j ) + h f 000 (x j ) + · · ·
h2
and the truncation error ≈ h f 000 (x j ).
ANSWERS TO THE EXERCISES 209
h 2 00
f (x j + 2h) = f (x j + h) + h f 0 (x j + h) + f (x j + h)
2!
h 3 000 h 4 (4)
+ f (x j + h) + f (x j + h) + · · ·
3! 4!
and
h 2 00
f (x j ) = f (x j + h) − h f 0 (x j + h) + f (x j + h)
2!
h 3 000 h 4 (4)
− f (x j + h) + f (x j + h) + · · · ,
3! 4!
so
f (x j + 2h) − 2 f (x j + h) + f (x j ) h 2 (4)
= f 00
(x j + h) + f (x j + h) + · · ·
h2 12
1 2 (4)
and the truncation error ≈ 12 h f (x j + h).
0.30
T (0.30) = (1.00000 + 1.14018) = 0.32102(7)
2
0.15
T (0.15) = (1.00000 + 1.14018) + (0.15)(1.07238)
2
= 0.16051(4) + 0.16085(7) = 0.32137(1)
0.10
T (0.10) = (1.00000 + 1.14018) + (0.10)(1.04881 + 1.09545)
2
= 0.10700(9) + 0.21442(6) = 0.32143(5)
0.05
T (0.05) = (1.0000 + 1.14018)
2
+ (0.05)(1.02470 + 1.04881 + 1.07238
+ 1.09545 + 1.11803)
= 0.05350(5) + 0.26796(9) = 0.32147(4)
To 8D, the answer is in fact 0.32148537, so that we may observe that the error
sequence 0.00045(8), 0.00011(4), 0.00005(0), 0.00001(1) decreases with h 2
(the truncation error dominates the round-off error).
210 ANSWERS TO THE EXERCISES
2. We have:
1 1 1
T (1) = + = 0.75
2 1+0 1+1
0.5 1 1 1
T (0.5) = + + 0.5
2 1+0 1+1 1 + 0.5
= 0.7083 (to 4D )
0.25 1 1
T (0.25) = +
2 1+0 1+1
1 1 1
+ 0.25 + +
1 + 0.25 1 + 0.5 1 + 0.75
= 0.6970 (to 4D )
1. We have
1 2 24
f (x) = , f 00 (x) = , f (4) (x) =
1+x (1 + x)3 (1 + x)5
2 2
A bound on the truncation error for the trapezoidal rule is 12 h = 16 h 2 , so that
we would need to choose h ≤ 0.017 to obtain 4D accuracy. For Simpson’s
24 4 2 4
rule, however, the truncation error bound is 180 h = 15 h so that we may
choose h = 0.1. Tabulating:
By Simpson’s rule,
Z 1 1 0.1
dx ≈ [1 + 4(0.909091 + 0.769231 + 0.666667
0 1+x 3
+ 0.588235 + 0.526316)
+ 2(0.833333 + 0.714286 + 0.625000
+ 0.555556) + 0.500000]
= 0.693150
ANSWERS TO THE EXERCISES 211
Thus to 4D, the approximation to the integral is 0.6932. (To 6D the true value
is 0.693147.)
2. Simpson’s rule with N = 2 yields the approximation
π
0 + 4 × π8 cos(π/8) + π4 cos(π/4) = 0.26266 (to 5D )
24
To 5D the true value of the integral is 0.26247, so that the magnitude of the
error is approximately |0.26247 − 0.26266| = 0.00019.
Z 1 1 1
Z 1 1
du = dx
2 (x
1+u 2 1
0 −1 1+ + 1)
Z 1 dx
=
−1 3+x
Two-point formula:
Z 1
1 1 1
du ≈ +
0 1+u 3 − 0.57735027 3 + 0.57735027
= 0.41277119 + 0.27953651 = 0.6923077
which is correct to 2D.
Four point formula:
Z 1
du 1 1
≈ 0.34785485 +
0 1+u 3 − 0.86113631 3 + 0.86113631
1 1
+ 0.65214515 +
3 − 0.33998104 3 + 0.33998104
= 0.34785485[0.46753798 + 0.25899112]
+ 0.65214515[0.37593717 + 0.29940290]
= (0.34785485)(0.72652909) + (0.65214515)(0.67534007)
= 0.25272667 + 0.44041975 = 0.69314642
This approximation is correct to 5D.
(b) The approximations from the fourth-order Taylor series method are y6 =
2.04422, y7 = 2.32748, and y8 = 2.65105. The estimate is accurate to
4D (the error in y8 is approximately 0.00003).
(c) For the second-order Runge-Kutta method we calculate k1 = 0.22949 and
k2 = 0.26244, and obtain y6 = 2.04086. Further calculation yields k1 =
0.26409, k2 = 0.30049, y7 = 2.32315, k1 = 0.30231, k2 = 0.34255, and
y8 = 2.64558. The estimate is accurate to 1D, but not to 2D (the error
in y8 is approximately 0.0055, which is larger than the maximum error of
0.005 allowable for 2D accuracy).
2. Euler’s method is yn+1 = yn − 0.2xn yn2 = yn (1 − 0.2xn yn ), and thus we
obtain (with working displayed to 5D ):
y1 = 2(1 − 0.2 × 0 × 2) = 2
y2 = 2(1 − 0.2 × 0.2 × 2) = 1.84
y3 = 1.84(1 − 0.2 × 0.4 × 1.84) = 1.56915
y4 = 1.56915(1 − 0.2 × 0.6 × 1.56915) = 1.27368
y5 = 1.27368(1 − 0.2 × 0.8 × 1.27368) = 1.01412
yn+1 = yn + 0.1
2 [−15yn + 5yn−1 ] = 0.25(yn + yn−1 )
The accuracy does vary, but the estimates decrease in magnitude as they should,
and do not change sign.
2. The second-order Adams-Bashforth method is
h
yn+1 = yn + (3 f n − f n−1 ) = yn + 0.05(3xn + 3yn − xn−1 − yn−1 )
2
and thus we obtain:
and
w2,n+1 = w2,n + h(sin xn − w1,n − w2,n ), w2,0 = 0
Some computations then yield the values given in the following table.
n xn w1,n w2,n
0 0.0 0 0
1 0.2 0.00000 0.03973
2 0.4 0.00795 0.10967
3 0.6 0.02988 0.19908
4 0.8 0.06970 0.29676
5 1.0 0.12905 0.39176
θ − sin θ cos θ = cπ
where c takes the values 0.1, 0.2, 0.3, and 0.4. Application of the bisection
method shows that the corresponding values of θ are given to 4D by 0.8134,
1.0566, 1.2454, and 1.4124. The values of h are then given by r (1 − cos θ).
Hence the calibration markings should be at 0.3130r , 0.5082r , 0.6803r , and
0.8423r .
2. Application of the bisection method shows that P(0) = 611.
3. Let x1 , x2 , x3 , and x4 be the number of kilocalories provided by 100 grams
of carbohydrates, proteins, fats, and alcohol, respectively. Then the given
information shows that we need to solve the linear system
0.47 0.08 0.02 0 x1 227
0 0.27 0.12 0 x2 218
=
0.25 0.04 0.07 0 x3 170
0 0 0 0.10 x4 68
x f (x) 1 12 13
0.0 1.0000
−25
0.1 0.9975 −50
−75 1
0.2 0.9900 −49
−124 1
0.3 0.9776 −48
−172 1
0.4 0.9604 −47
−219
0.5 0.9385
ANSWERS TO THE EXERCISES 215
The following is a short list of books which may be referred to for complementary
reading, proofs omitted in this text, or further study in Numerical Analysis.
Calculus
G. B. Thomas and R. L. Finney (1992). Calculus and Analytic Geometry (8th
edn). Addison-Wesley, Reading, Mass.
Linear Algebra
H. Anton (1993). Elementary Linear Algebra (7th edn). Wiley, New York.
Numerical Analysis
K. E. Atkinson (1993). Elementary Numerical Analysis (2nd edn). Wiley, New
York.
R. L. Burden and J. D. Faires (1993). Numerical Analysis (5th edn). PWS-Kent,
Boston.
E. W. Cheney and D. R. Kincaid (1994). Numerical Mathematics and Computing
(3rd edn). Brooks/Cole, Belmont, Calif.
S. D. Conte and C. de Boor (1980). Elementary Numerical Analysis (3rd edn).
McGraw-Hill, New York.
C. F. Gerald and P. O. Wheatley (1994). Applied Numerical Analysis (5th edn).
Addison-Wesley, Reading, Mass.
J. H. Mathews (1992). Numerical Methods for Mathematics, Science, and Engi-
neering (2nd edn). Prentice-Hall, Englewood Cliffs, N.J.
Tables
M. Abramowitz and I. A. Stegun (1965). Handbook of Mathematical Functions.
Dover, New York.
INDEX
Tables, 24
differences, 80, 85–86, 89, 97
Taylor series expansion, 18–19, 35,
37, 40, 99, 141, 144, 149–
150
Three-point integration (Gauss), 147
Transcendental, 2, 23
equations, 23–24
functions, 2, 23
Transformation operations, 45
Transformations
Householder, 126–127
orthogonal, 126
Trapezoidal rule, 139–142, 170