Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
146 views

Course Note

This document contains the table of contents for a textbook on numerical analysis. The table of contents lists 6 chapters that cover topics such as errors and error propagation, root finding, numerical linear algebra, discrete Fourier methods, interpolation, and integration. Each chapter is further broken down into sections that describe specific algorithms and concepts within each topic area.

Uploaded by

zhhh7
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
146 views

Course Note

This document contains the table of contents for a textbook on numerical analysis. The table of contents lists 6 chapters that cover topics such as errors and error propagation, root finding, numerical linear algebra, discrete Fourier methods, interpolation, and integration. Each chapter is further broken down into sections that describe specific algorithms and concepts within each topic area.

Uploaded by

zhhh7
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 121

2

Contents

Preface 5

1 Errors and Error Propagation 7


1.1 Sources of Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2 Floating Point Numbers and Operations . . . . . . . . . . . . . . . . . . . . . 11
1.2.1 A Binary Computer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2.2 Standard floating point systems . . . . . . . . . . . . . . . . . . . . . . 13
1.2.3 Machine Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.2.4 Floating Point Operations . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.3 Condition of a Mathematical Problem . . . . . . . . . . . . . . . . . . . . . . 16
1.4 Stability of a Numerical Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 20

2 Root Finding 27
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.2 Four Algorithms for Root Finding . . . . . . . . . . . . . . . . . . . . . . . . 28
2.2.1 Bisection Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.2.2 Fixed Point Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.2.3 Newton’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.2.4 Secant Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.2.5 Stopping Criteria for Iterative Functions . . . . . . . . . . . . . . . . . 32
2.3 Rate of Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.4 Convergence Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.4.1 Fixed Point Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.4.2 Newton’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.4.3 Secant Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.4.4 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3 Numerical Linear Algebra 45


3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.2 Gaussian Elimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.2.1 LU Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.2.2 Pivoting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.2.3 Algorithm and Computational Cost . . . . . . . . . . . . . . . . . . . 51
3.2.4 Determinants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

3
4 CONTENTS

3.3 Condition and Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59


3.3.1 The Matrix Norm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.3.2 Condition of the problem A~x = ~b . . . . . . . . . . . . . . . . . . . . . 60
3.3.3 Stability of the LU Decomposition Algorithm . . . . . . . . . . . . . . 63
3.4 Iterative Methods for solving A~x = ~b . . . . . . . . . . . . . . . . . . . . . . . 65
3.4.1 Jacobi and Gauss-Seidel Methods . . . . . . . . . . . . . . . . . . . . . 66
3.4.2 Convergence of Iterative Methods . . . . . . . . . . . . . . . . . . . . . 67

4 Discrete Fourier Methods 71


4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.2 Fourier Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.2.1 Real form of the Fourier Series . . . . . . . . . . . . . . . . . . . . . . 72
4.2.2 Complex form of the Fourier Series . . . . . . . . . . . . . . . . . . . . 75
4.3 Fourier Series and Orthogonal Basis . . . . . . . . . . . . . . . . . . . . . . . 77
4.4 Discrete Fourier Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.4.1 Aliasing and the Sample Theorem . . . . . . . . . . . . . . . . . . . . 84
4.5 Fast Fourier Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.6 DFT and Orthogonal Basis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.7 Power Spectrum and Parseval’s Theorem . . . . . . . . . . . . . . . . . . . . 93

5 Interpolation 95
5.1 Polynomial Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.1.1 The Vandermonde Matrix . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.1.2 Lagrange Form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.1.3 Hermite Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.2 Piecewise Polynomial Interpolation . . . . . . . . . . . . . . . . . . . . . . . . 102
5.2.1 Piecewise Linear Interpolation . . . . . . . . . . . . . . . . . . . . . . 102
5.2.2 Spline Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.2.3 Further Generalizations . . . . . . . . . . . . . . . . . . . . . . . . . . 105

6 Integration 107
6.1 Integration of an Interpolating Polynomial . . . . . . . . . . . . . . . . . . . . 107
6.1.1 Midpoint Rule: y(x) degree 0 . . . . . . . . . . . . . . . . . . . . . . . 108
6.1.2 Trapezoid Rule: y(x) degree 1 . . . . . . . . . . . . . . . . . . . . . . 108
6.1.3 Simpson Rule: y(x) degree 2 . . . . . . . . . . . . . . . . . . . . . . . 109
6.1.4 Accuracy, Truncation Error and Degree of Precision . . . . . . . . . . 110
6.2 Composite Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
6.2.1 Composite Trapezoid Rule . . . . . . . . . . . . . . . . . . . . . . . . . 112
6.2.2 Composite Simpson Rule . . . . . . . . . . . . . . . . . . . . . . . . . 113
6.3 Gaussian Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

A Sample Midterm Exam 117

B Sample Final Exam 119


Preface

The goal of computational mathematics, put simply, is to find or develop algorithms that
solve mathematical problems computationally (ie. using computers). In particular, we desire
that any algorithm we develop fulfills three primary properties:

• Accuracy. An accurate algorithm is able to return a result that is numerically very


close to the correct, or analytical, result.
• Efficiency. An efficient algorithm is able to quickly solve the mathematical problem
with reasonable computational resources.
• Robustness. A robust algorithm works for a wide variety of inputs x.

These notes have been funded by...

5
6
Chapter 1

Errors and Error Propagation

In examining computational mathematics problems, we shall generally consider a problem


of the following form:

Problem Consider an arbitrary problem P with input x. We must compute the desired
output z = fP (x).

In general our only tools for solving such problems are primitive mathematical operations
(for example, addition, subtraction, multiplication and division) combined with flow con-
structs (if statements and loops). As such, even simple problems such as evaluating the
exponential function may be difficult computationally.

Example 1.1 Consider the problem P defined by the evaluation of the exponential function
z = exp(x). We wish to find the approximation ẑ for z = exp(x) computationally.

Algorithm A. Recall from calculus that the Taylor series expansion of the exponential
function is given by

x2 x3 X xi
exp(x) = 1 + x + + + ... = . (1.1)
2! 3! i=0
i!

Since we obviously cannot compute an infinite sum computationally without also using
infinite resources, consider the truncated series constructed by only summing the first n
terms. This yields the expansion
n
X xi
ẑ = (1.2)
i=0
i!

As we will later see, this is actually a poor algorithm for computing the exponential -
although the reason for this may not be immediately obvious.

7
8 CHAPTER 1. ERRORS AND ERROR PROPAGATION

1.1 Sources of Error


Computational error can originate from several sources, although the most prominent forms
of error (and the ones that we will worry about in this text) come from two main sources:

1. Errors introduced in the input x. In particular, this form of error can be generally
classified to be from one of two sources:

a) Measurement error, caused by a difference between the “exact” value x and the
”measured” value x̃. The measurement error is computed as ∆x = |x − x̃|.
b) Rounding error, caused by a difference between the “exact” value x and the com-
putational or floating-point representation x̂ = f l(x). Since infinite precision
cannot be achieved with finite resources, the computational representation is a
finite precision approximation of the exact value.

Consider, for example, the decimal number x = 0.00012345876543. In order to


standardize the representation of these numbers we perform normalization (such
that the number to the left of the decimal point is 0 and the first digit right of
the decimal point is nonzero). The number x̂ is thus normalized as follows:
−3
| {z } ×10 .
x = 0. 12345876543 (1.3)
mantissa

However, a computer can only represent finite precision, so we are not guaran-
teed to retain all digits from the initial number. Let’s consider a hypothetical
“decimal” computer with at most 5 digits in the mantissa. The floating-point
representation of x obtained on this computer, after rounding, is given by

x̂ = fl(x) = 0.12346 × 10−3 . (1.4)

The process of conversion to a floating point number gives us a rounding error


equal to ∆x = x − x̂ = −0.00000123457.

2. Errors as a result of the calculation, approximation or algorithm. This form of error


can again be generally classified to be from one of two sources:

a) Truncation error. When truncating an infinite series to provide a finite approx-


imation, the method inherently introduces an error. In example 1.1 we first
considered truncating the infinite expansion as follows:
∞ n ∞ ∞
X xi X xi X xi X xi
z = exp(x) = = + = ẑ + (1.5)
i=0
i! i=0
i! i=n+1
i! i=n+1
i!

In this case, the truncation error is given by



X xi
T = z − ẑ = (1.6)
i=n+1
i!
1.1. SOURCES OF ERROR 9

b) Rounding errors in elementary steps of the algorithm. For example, consider


the addition of 1.234 × 107 and 5.678 × 103 in a floating point number system
where we only maintain 4 digit accuracy (in base 10). The exact result should
be 1.2345678 × 107 but due to rounding we instead calculate the result to be
1.235 × 107 .

In analyzing sources of error, it is often useful to provide a mathematical basis for the
error analysis. In particular, there are two principal ways for providing a measure of the
generated error.

Definition 1.1 Consider an exact result z and an approximate result ẑ generated in a


specified floating point number system. Then the absolute error is given by

∆z = z − ẑ, (1.7)

and the relative error (assuming z 6= 0) is given by


z − ẑ
δz = . (1.8)
z
Of course, considering the mathematical hardships often involved in analyzing our algo-
rithms for error, there must be some justification in bothering with error analysis. Although
the errors may be first perceived as being negligible, some numerical algorithms are “nu-
merically unstable” in the way they propagate errors. In doing error analysis, we will often
run into one of the following situations in computational mathematics:

1. The algorithm may contain one or more “avoidable” steps that each greatly amplify
errors.
2. The algorithm may propagate initially small errors such that the error is amplified
without bound in many small steps.

Consider the following example of the first kind of generated error:

Example 1.2 Consider the problem P with input x defined by the evaluation of the expo-
nential function z = exp(x) (as considered in (1.1)). However, in performing the calculation,
assume the following conditions:

• Assume “decimal” computer with 5 digits in the mantissa.


• Assume no measurement error or initial rounding error in x.

Consider solving this problem with input x = −5.5; we find a numerical approximation
to the exact value z = exp(−5.5) ≈ 0.0040868.
In solving this problem, we first apply Algorithm A, truncating the series after the first
P24 xi
25 values. This yields the formula ẑ = i=0 i! . Performing this calculation with our
floating point system yields the approximation ẑA = 0.0057563 which an observant reader
will notice has no significant digits in common with the exact solution. We conclude that
precision was lost in this calculation, and in fact notice that the relative error is δzA =
1
z (z − ẑA ) = −0.41 = −41%. This is a substantial error!
10 CHAPTER 1. ERRORS AND ERROR PROPAGATION

Table I: Algorithm A applied to x = −5.5

i ith term in series ith truncated sum

0 1.0000000000000 1.0000000000000
1 -5.5000000000000 -4.5000000000000
2 15.1250000000000 10.6250000000000
3 -27.7300000000000 -17.1050000000000
4 38.1280000000000 21.0230000000000
5 -41.9400000000000 -20.9170000000000
6 38.4460000000000 17.5290000000000
7 -30.2060000000000 -12.6770000000000
8 20.7670000000000 8.0900000000000
9 -12.6910000000000 -4.6010000000000
10 6.9803000000000 2.3793000000000
11 -3.4900000000000 -1.1107000000000
12 1.5996000000000 0.4889000000000
13 -0.6767600000000 -0.1878600000000
14 0.2658700000000 0.0780100000000
15 -0.0974840000000 -0.0194740000000
16 0.0335100000000 0.0140360000000
17 -0.0108420000000 0.0031940000000
18 0.0033127000000 0.0065067000000
19 -0.0009589000000 0.0055478000000
20 0.0002637100000 0.0058115000000
21 -0.0000690670000 0.0057424000000
22 0.0000172670000 0.0057597000000
23 -0.0000041289000 0.0057556000000
24 0.0000009462300 0.0057565000000
25 -0.0000002081700 0.0057563000000
26 0.0000000440350 0.0057563000000
1.2. FLOATING POINT NUMBERS AND OPERATIONS 11

So why does this instability occur?

Our answer to this question can be found in a related problem: Consider the subtraction
of two numbers that are almost equal to one another.

Let x1 = 0.100134826 with floating-point representation fl(x1 ) = 0.10013. The relative


error in performing this approximation is δx1 = x11 (x1 − fl(x1 )) = 4.8 × 10−5 = 0.0048%.

Let x2 = 0.100121111 with floating-point representation fl(x1 ) = 0.10012. The relative


error in performing this approximation is δx2 = x12 (x2 − fl(x2 )) = 1.1 × 10−5 = 0.0011%.

So, in general, the approximation of these two numbers to their floating-point equivalents
produce relatively small errors. Now consider the subtraction z = x1 −x2 . The exact solution
to this is z = 0.000013715 and the computed solution using the floating-point representations
is ẑ = fl(x1 ) − fl(x2 ) = 0.00001. The relative error in this case is δẑ = z1 (z − ẑ) ≈ 27%. So
what was an almost negligible error when performing rounding becomes a substantial error
after we have completed the subtraction.

Thus, if possible, we will need to avoid these kind of subtractions when we are developing
our algorithms. Looking back at the additions we performed in calculating exp(−5.5) (see
Table I), we see that there were several subtractions performed of numbers of similar magni-
tude. Fortunately, we have an easy way around this if we simply take advantage of a property
of the exponential, namely exp(−x) = (exp(x))−1 . This provides us with Algorithm B:

Algorithm B. Applying the truncated Taylor expansion for exp(5.5), we get the following
formula for exp(−5.5):
24
X (−x)i −1
exp(x) = ( ) (1.9)
i=0
i!

This yields ẑB = 0.0040865, which matches 4 out of 5 digits of the exact value.
We conclude that Algorithm B is numerically stable (largely since it avoids cancellation).
Similarly, Algorithm A is unstable with respect to relative error for the input x = −5.5.
However, it is also worth noting that for any positive value of the input we are better off
using Algorithm A - Algorithm B would end up with the same cancellation issues we had
originally with Algorithm A.

1.2 Floating Point Numbers and Operations


Floating point numbers are the standard tool for approximating real numbers on a com-
puter. Unlike real numbers, floating point numbers only provide finite precision - effectively
approximating real numbers while still attempting to provide full functionality. Consider
the following definition of a floating point number system:

Definition 1.2 A floating point number system is defined by three components:


12 CHAPTER 1. ERRORS AND ERROR PROPAGATION

• The base, which defines the base of the number system being used in the representation.
This is specified as a positive integer bf .

• The mantissa, which contains the normalized value of the number being represented.
Its maximal size is specified as a positive integer mf , which represents the number of
digits allowed in the mantissa.

• The exponent, which effectively defines the offset from normalization. Its maximal
size is specified by a positive integer ef , which represents the number of digits allowed
in the exponent.

In shorthand, we write F [b = bf , m = mf , e = ef ].

Combined, the three components allow us to approximate any real number in the fol-
lowing form:
y1 y2 · · · ye
| {z }
0. x1 x2 · · · xm × |{z}
b exponent
(1.10)
| {z }
mantissa base

Example 1.3 Consider a “decimal” computer with a floating point number system defined
by base 10, 5 digits in the mantissa and 3 digits in the exponent. In shorthand, we write the
system as F [b = 10, m = 5, e = 3].
Consider the representation of the number x = 0.000123458765 under this system. To
find the representation, we first normalize the value and then perform rounding so both the
mantissa and exponent have the correct number of digits:

input x = 0.000123458765
normalize x = 0.123458765 × 10−3
round x̂ = fl(x) = 0.12346 × 10−003

Under this system, our mantissa is bounded by 99999 and the exponent is bounded by
999 (each representing the largest numbers we can display under this base in 5 and 3 digits,
respectively). The largest number we can represent under this system is 0.99999 × 10999 and
the smallest positive number we can represent is 0.00001 × 10−999 .

Note: Instead of rounding in the final step, we can also consider “chopping”. With chop-
ping, we simply drop all digits that cannot be represented under our system. Thus, in our
example we would get x̂ = fl(x) = 0.12345 × 10−003 since all other digits simply would be
dropped.

1.2.1 A Binary Computer


Instead of working in decimal (base 10), almost all computers work in binary (base 2). We
normally write binary numbers as (x)b to indicate that they are represented in binary.
1.2. FLOATING POINT NUMBERS AND OPERATIONS 13

Example 1.4 Consider the binary number x = (1101.011)b. Similar to decimal, this nota-
tion is equivalent to writing

x = 1 · 23 + 1 · 22 + 0 · 21 + 1 · 20 + 0 · 2−1 + 1 · 2−2 + 1 · 2−3


= 8 + 4 + 1 + 0.25 + 0.125
= 13.375

Floating-point number conversions from binary work exactly the same way as decimal
conversions, except for the new base:

Example 1.5 Consider the binary number x = (1101.011)b under the floating-point system
F [b = 2, m = 4, e = 3].

input x = (1101.011)b

normalize x = (0.1101011)b × 24
= (0.1101011)b × 2(100)b

round x̂ = fl(x) = (0.1101)b × 2(100)b


(100)
= (0.1101)b × (10)b b

1.2.2 Standard floating point systems


Single Precision Numbers. Single precision is one of the standard floating point number
systems, where numbers are represented on a computer in a 32-bit chunk of memory (4
bytes). Each number in this system is divided as follows:

sm m = 23 bits se e = 7 bits

Here sm is the sign bit of the mantissa and se is the sign bit for the exponent.

Recall that we normally write floating point numbers in the form given by (1.10). The
typical convention for sign bits is to use 0 to represent positive numbers and 1 to represent
negative numbers.

Example 1.6 Consider the decimal number x = −0.3125. We note that we may write this
in binary as

x = −0.3125
= −(0 · 2−1 + 1 · 2−2 + 0 · 2−3 + 1 · 2−4 )
= −(0.0101)b
= −(0.101)b × 2−1
= −(0.101)b × 2−(1)b
14 CHAPTER 1. ERRORS AND ERROR PROPAGATION

In the single precision floating point system F [b = 2, m = 23, e = 7] we may write x̂ =


fl(x) = |1|10100000000000000000000|1|0000001|.

Under the single-precision floating point number system, the largest and smallest num-
bers that can be represented are given as follows (without consideration for normalization,
in the case of the smallest number).

x̂max = |0|11111111111111111111111|0|1111111|
= (1 − 2−23 ) · 2127
≈ 2127
≈ 1.7 × 1038

x̂min = |0|00000000000000000000001|1|1111111|
= 2−23 · 2−127
= 2−150
≈ 7.0 × 10−46

Note that there are 512 ways to represent zero under the single precision system proposed
here (we only require that the mantissa is zero, meaning that the signed bits and exponent
can be arbitrary). Under the IEEE standard (which is used by real computers) there are
some additional optimizations that take advantage of this “lost” space:

1. use of a signed integer for the exponent


2. only allow for one way to represent zero and use the free space to represent some
smaller numbers
3. do not store the first 1 in the mantissa (because it is always 1 for a normalized number)

Double Precision Numbers. Double precision is the other standard floating point num-
ber system. Numbers are represented on a computer in a 64-bit chunk of memory (8 bytes),
divided as follows:

sm m = 52 bits se e = 10 bits

In this case, the maximum and minimum numbers that can be represented are xmax ≈
9 × 10307 and xmin ≈ 2.5 × 10−324 , respectively.

1.2.3 Machine Precision


Recall that the relative error (defined by equation (1.8)) when converting a real number x
to a floating point number is

x − fl(x)
δx = . (1.11)
x
This motivates the question: Is there a bound on the absolute value of the relative error |δx|
for a given floating point number system?
1.2. FLOATING POINT NUMBERS AND OPERATIONS 15

An answer to this requires some delving into the way in which floating point numbers
are represented. Consider the floating point number given by

±0.x1 x2 · · · xm × b±t (1.12)


with 1 ≤ xi < b for i = 1, . . . , m. We then define the machine epsilon:

Definition 1.3 The machine epsilon ǫmach is the smallest number ǫ > 0 such that fl(1+
ǫ) > 1.

We then present the following proposition:

Proposition 1.1 The machine epsilon is given by

a) ǫmach = b1−m if chopping is used

b) ǫmach = 21 b1−m if rounding is used

Proof. For simplicity we only prove part (a) of the proposition. If rounding is used a more
complex analysis will yield the result of part (b).
Consider the following subtraction:

1+ǫ = 0. 1
|{z} 0
|{z} 0
|{z} ··· 0
|{z} 1
|{z} ×b1
b−1 b−2 b−3 b−(m−1) b−m
1 = 0. 1
|{z} 0
|{z} 0
|{z} ··· 0
|{z} 0
|{z} ×b1
x1 x2 x3 xm−1 xm

ǫ = 0. 0 0 0 ··· 0 1
|{z} ×b1
b−m
ǫ = b1−m

Theorem 1.1 For any floating point system F under chopping,

x − fl(x)
|δx| = | | ≤ ǫmach . (1.13)
x

Proof. Consider the following calculation:

x = 0.d1 d2 · · · dm dm+1 dm+2 · · · ·bt


fl(x) = 0.d1 d2 · · · dm ·bt

Thus combining these two equations yields

x−fl(x) 0.00···0dm+1 dm+2 ···


x = 0.d1 d2 ···dm dm+1 dm+2
0.dm+1 dm+2 ···
= 0.d1 d2 ···dm dm+1 dm+2 ··· · b−m .
16 CHAPTER 1. ERRORS AND ERROR PROPAGATION

But we know the numerator is less than or equal to 1 (1 = 1 · b0 ) and the denominator is
greater than or equal to 0.1 (0.1 = 1 · b−1 ). So we may write:

(x − fl(x)) · x−1 ≤ (1) · (b−1 )−1 · b−m


= b−m+1
= ǫmach

1
Note. Since δx = x (x − fl(x)), we may also write fl(x) = x(1 − δx) with |δx| ≤ ǫmach .
Hence we often say
fl(x) = x(1 + η) with |η| ≤ ǫmach . (1.14)

Single Precision. Under single precision, m = 23 and so ǫ = 2−22 ≈ 0.24 × 10−6 . Thus
|δx| ≤ 0.24 × 10−6 , and so we expect 6 to 7 decimal digits of accuracy.

Double Precision. Under double precision, m = 52 and so ǫ = 2−51 ≈ 0.44 × 10−15 .


Thus |δx| ≤ 0.44 × 10−15 , and so we expect 15 to 16 decimal digits of accuracy.

1.2.4 Floating Point Operations


Definition 1.4 The symbol ⊕ is used to denote floating point addition, defined as

a ⊕ b = fl(fl(a) + fl(b)) (1.15)

Proposition 1.2 For any floating point number system F,

a ⊕ b = (fl(a) + fl(b))(1 + η) (1.16)

with |η| ≤ ǫmach . This may also be written as

a ⊕ b = (a(1 + η1 ) + b(1 + η2 ))(1 + η) (1.17)

with |η1 |, |η2 |and|η| ≤ ǫmach . In general the operation of addition under F is not associative.
That is,
(a ⊕ b) ⊕ c 6= a ⊕ (b ⊕ c). (1.18)

Note: There are analogous operations for subtraction, multiplication and division, written
using the symbols ⊖, ⊗, and ⊘.

1.3 Condition of a Mathematical Problem


Consider a problem P with input x̃ which requires the computation of a desired output
z = fP (x̃). In general, some mathematical problems of this form are very sensitive to small
deviations in the input data, in which case there are a variety of problems (such as rounding
in input data) which make accurate approximation on a computer difficult.
1.3. CONDITION OF A MATHEMATICAL PROBLEM 17

Definition 1.5 We say that a problem P is well-conditioned with respect to the absolute
error if small changes ∆~x in ~x result in small changes ∆~z in ~z. Similarly, we say P is
ill-conditioned with respect to the absolute error if small changes ∆~x in ~x result in large
changes ∆~z in ~z.

Definition 1.6 The condition number of a problem P with respect to the absolute error is
given by the absolute condition number κA :

κA =k ∆~z k /k∆~xk. (1.19)

The condition number with respect to the relative error is given by the relative condition
number κR :
k ∆~z k / k ~z k
κR = (1.20)
k∆~xk/k~xk

If κA and κR are “small” we can generally infer that P is well-conditioned. As a guideline,


if κA and κR are between 0.1 and 10 we can consider them to be “small.” Similarly, if κA
and κR are “large” (tend to ∞ or “blow up”) then we can say that P is ill-conditioned.

Example 1. Consider the mathematical problem P defined by z = x + y. We wish


to examine how the errors in x and y propagate to z. We define our approximations by
∆x = x − x̂ or x̂ = x − ∆x, where ∆x is the error in x. Similarly, we have ∆y = y − ŷ
or ŷ = y − ∆y, where ∆y is the error in y. In solving the problem, we are given the
approximations x̂ and ŷ and compute the approximation ẑ:

ẑ = x̂ + ŷ = (x + y) − (∆x + ∆y).

and so we define our error by

∆z = z − ẑ = ∆x + ∆y. (1.25)

a) Condition with respect to the absolute error


We attempt to find an upper bound for κA . Using equation (1.25) and the 1-norm
(1.23), we can write
|∆z| |∆x + ∆y|
κA = = .
k(∆x, ∆y)k1 |∆x| + |∆y|
But by the triangle inequality we have |∆x + ∆y| ≤ |∆x| + |∆y| and so yield

|∆x| + |∆y|
κA ≤ = 1. (1.26)
|∆x| + |∆y|

We conclude that P is well-conditioned with respect to the absolute error.


18 CHAPTER 1. ERRORS AND ERROR PROPAGATION

Intermission: Vector Norms


Vector norms are a useful tool for providing a measure of the magnitude of a vector, and
are particularly applicable to derivations for the condition number.

Definition 1.7 Suppose V is a vector space over Rn . Then k · k is a vector norm on V if


and only if k~vk ≥ 0, and
a) k~v k = 0 if and only if ~v = ~0
b) kλ~v k = |λ|k~v k ∀ ~v ∈ V, ∀ λ ∈ R
c) k~u + ~v k ≤ k~uk + k~v k ∀ ~u, ~v ∈ V (triangle inequality)

There are three standard vector norms known as the 2-norm, the ∞-norm and the 1-norm.

Definition 1.8 The 2-norm over Rn is defined as


v
u n
uX
k~xk2 = t x2i (1.21)
i=1

Definition 1.9 The ∞-norm over Rn is defined as

k~xk∞ = max (xi ) (1.22)


1≤i≤n

Definition 1.10 The 1-norm over Rn is defined as


n
X
k~xk1 = |xi | (1.23)
i=1

Further, all vector norms that are induced by an inner product (including the 1-norm and
2-norm - but not the ∞-norm) satisfy the Cauchy-Schwartz Inequality, which will be of use
later:

Theorem 1.2 Cauchy-Schwartz Inequality. Let k · k be a vector norm over a vector space
V induced by an inner product. Then

|~x · ~y | ≤ k~xkk~yk (1.24)


1.3. CONDITION OF A MATHEMATICAL PROBLEM 19

b) Condition with respect to the relative error


We attempt to find an upper bound for κR . Consider the relative error in z, given by
|∆z| |∆x + ∆y|
|δz| = = .
|z| |x + y|
We can clearly see here that if x ≈ −y then |δz| can be very large, even though |∆x|/|x|
and |∆y|/|y| may not be large. The relative condition number is thus
|∆z|/|z| |∆x + ∆y|/|x + y| |x| + |y|
κR = = ≤ . (1.27)
k∆~xk1 /k~xk1 (|∆x| + |∆y|)/(|x| + |y|) |x + y|
Thus we see that κR can grow arbitrarily large when x ≈ −y.

We conclude that the problem z = x + y is ill-conditioned with respect to the relative


error only in the case x ≈ −y.

Example 2. Consider the mathematical problem P defined by z = x · y. We define our


approximations by ∆x = x − x̂ and ∆y = y − ŷ as in Example 1. In solving the problem,
we are given the approximations x̂ and ŷ and compute the approximation ẑ:

ẑ = x̂ · ŷ = (x − ∆x)(y − ∆y)
neglect
z }| {
= xy − y∆x − x∆y + ∆x∆y
≈ xy − y∆x − x∆y
and so we define our error by

∆z ≈ y∆x + x∆y = (x, y) · (∆y, ∆x). (1.28)

The relative error in z is given by


1 ∆x ∆y
δz ≈ (y∆x + x∆y) = + . (1.29)
xy x y
a) Condition with respect to the absolute error
We attempt to find an upper bound for κA . Using equation (1.28) and the Cauchy-
Schwartz Inequality (1.24), we can write
|∆z| k(x, y)k2 k(∆x, ∆y)k2
κA = ≤ = k(x, y)k2 .
k(∆x, ∆y)k k(∆x, ∆y)k2

We conclude that P is well-conditioned with respect to the absolute error, except when
x or y are large.

b) Condition with respect to the relative error


From (1.29) we have that
∆x ∆y
δz ≈ + = δx + δy
x y
20 CHAPTER 1. ERRORS AND ERROR PROPAGATION

and so can immediately conclude that P is well-conditioned with respect to the relative
error. In fact, in this particular case

|∆z|/|z|
κR =
k∆~xk/k~xk

does not easily yield a useful bound.

1.4 Stability of a Numerical Algorithm


Consider a problem P with input x which requires the computation of a desired output
z = fP (x). If we assume that P is well-conditioned, then by definition we have that

k∆~xk small ⇒ |∆z| small

k∆~xk/k~xk small ⇒ |∆z/z| small


For this well-conditioned problem, some algorithms may be numerically unstable, i.e.
they may produce large errors |∆z| or |δz|, while other algorithms may be stable.

Example 1.9 Stability with respect to the relative error. Consider the problem P defined
by z = exp(x) with x = 5.5. The approximate values for x and z are denoted x̂ and ẑ
respectively. They are related by the following formula:

ẑ = exp(x̂) = exp(x − ∆x) (1.31)

A) We investigate the condition of P as follows:

a) With respect to ∆ we have

|∆z| |z − ẑ| | exp(x) − exp(x − ∆x)|


κA = = = . (1.32)
|∆x| |x − x̂| |x − x̂|

We apply the Taylor series expansion given by

exp(x − ∆x) = exp(x) − exp(x)∆x + O(∆x2 ) (1.33)

to yield,

κA = | exp(x) − (exp(x) − exp(x)∆x + O(∆x2 ))| / |x − x̂|


≈ | exp(x)∆x| / |∆x|
= | exp(x)|,

for small |∆x|. We conclude that the problem is well conditioned with respect to
∆, except for large x.
1.4. STABILITY OF A NUMERICAL ALGORITHM 21

Intermission:
Asymptotic Behaviour of Polynomials
We wish to consider a general method of analyzing the behaviour of polynomials in two cases:
when x → 0 and when x → ∞. In particular, if we are only interested in the asymptotic
behaviour of the polynomial as opposed to the exact value of the function, we may employ
the concept of Big-Oh notation.

Definition 1.11 Suppose that f (x) is a polynomial in x without a constant term. Then the
following are equivalent:
a) f (x) = O(xn ) as x → 0.
b) ∃ c > 0, x0 > 0 such that |f (x)| < c|x|n ∀ x with |x| < |x0 |.
c) f (x) is bounded from above by |x|n , up to a constant c, as x → 0.

This effectively means that the dominant term in f (x) is the term with xn as x → 0, or
f (x) goes to zero with order n.

Example 1.7 Consider the polynomial g(x) = 3x2 + 7x3 + 10x4 + 7x12 . We say

g(x) = O(x2 ) as x → 0
g(x) 6= O(x3 ) as x → 0
g(x) = 3x2 + O(x3 ) as x → 0

We note that g(x) = O(x) as well, but this statement is not so useful because it is not a
sharp bound.

We may also consider the behaviour of a polynomial as x → ∞.

Definition 1.12 Suppose that f (x) is a polynomial in x. Then the following are equivalent:
a) f (x) = O(xn ) as x → ∞.
b) ∃ c > 0, x0 > 0 such that |f (x)| < c|x|n ∀ x with |x| > |x0 |.
c) f (x) is bounded from above by |x|n , up to a constant c, as x → ∞.

As before, this effectively means that the dominant term in f (x) is the term with xn as
x → ∞, or f (x) goes to infinity with order n.

Example 1.8 Consider the polynomial g(x) = 3x2 + 7x3 + 10x4 + 7x12 . We say

g(x) = O(x12 ) as x → ∞
g(x) 6 = O(x8 ) as x → ∞
g(x) = 7x12 + O(x4 ) as x → ∞
22 CHAPTER 1. ERRORS AND ERROR PROPAGATION

Addition and Multiplication of Terms Involving Big-Oh


We can also add and multiply terms using Big-Oh notation, making sure to neglect higher
order terms:

f (x) = x + O(x2 ) as x → 0
g(x) = 2x + O(x3 ) as x → 0

f (x) + g(x) = 3x + O(x2 ) + O(x3 ) as x → 0


= 3x + O(x2 ) as x → 0

f (x) · g(x) = 2x2 + O(x3 ) + O(x4 ) + O(x5 ) as x → 0


= 2x2 + O(x3 ) as x → 0

Applications to the Taylor Series Expansion


Recall that the Taylor series expansion for a function f (x) around a point x0 is given by

X 1 (n)
f (x0 + ∆x) = f (x0 )(∆x)n (1.30)
n=0
n!
We may expand this formula and write it in terms of Big-Oh notation as follows:

f (x0 + ∆x) = f (x0 )


+ f ′ (x0 )∆x
+ 12 f ′′ (x0 )∆x2
+ 16 f ′′′ (x0 )∆x3
+ O(∆x4 ) as ∆x → 0
1.4. STABILITY OF A NUMERICAL ALGORITHM 23

b) With respect to δ we have


|∆z|/|z| ∆z x |x|

κR = = ≈ | exp(x)| = |x|. (1.34)
|∆x|/|x| ∆x z | exp(x)|
We conclude that the problem is well conditioned with respect to δ, except for
large x.
B) Now that we know that problem P is well conditioned, we can investigate the stability
of our algorithms for P.
Recall Algorithm A, given by (1.2). We determined that this algorithm is unstable
with respect to δ for x < 0. This was due to a cancellation of nearly equal values in
the subtraction, and so the algorithm was unstable due to ill-conditioned steps that
could be avoided.
Recall Algorithm B, given by (1.9). As opposed to Algorithm A, this algorithm is
stable with respect to δ for x < 0.

Example R 1 1.10 Instability with respect to absolute error ∆. Consider the problem P defined
xn
by z = 0 x+α dx where α > 0. The approximate values for α and z are denoted α̂ and ẑ
respectively. It can be shown that P is well-conditioned (for instance with respect to the
integration boundaries 0 and 1).
We derive an algorithm for solving this problem using a recurrence relation. In deriving
a recurrence, we need to consider the base case (a) and the recursive case (b) and then
ensure that they are appropriately related.
a) Consider n = 0:
R1 1
I0 = 0 x+α
dx
1
= log(x + α) 0
= log(1 + α) − log(α)
and so we get

1+α
I0 = log( ). (1.35)
α
b) For general n we can derive the following recurrence:
R 1 n−1
In = 0 xx+αx dx
R 1 n−1 (x+α−α)
= 0 x x+α dx
R 1 n−1 R 1 xn−1
= 0 x dx − α 0 x+α dx
n 1
= xn 0 − αIn−1
which yields the expression

1
In =
− αIn−1 . (1.36)
n
Thus we may formulate an algorithm using the base case and the recurrence:
24 CHAPTER 1. ERRORS AND ERROR PROPAGATION

Algorithm A.
1. Calculate I0 from (1.35).
2. Calculate I1 , I2 , . . . , In using (1.36).
An implementation of this algorithm on a computer provides the following results, for
two sample values of α:

α = 0.5 −→ I100 ≈ 6.64 × 10−3


α = 2.0 −→ I100 ≈ 2.1 × 1022

For α = 2.0, we obtain a very large result! This may be a first indication that something
is wrong. From the original equation we have that
Z 1 100
x 1
I100 = dx ≤ · 1 = 1/3. (1.37)
0 x+α 1+α
Hence we conclude something is definitely amiss. We need to consider the propagation of
the error in our algorithm, since we know that the algorithm is definitely correct using exact
arithmetic.
Consider In to be the exact value at each step of the algorithm, with Iˆn the numerical
approximation. Then the absolute error at each step of the algorithm is given by ∆In =
In − Iˆn . Thus the initial error is ∆I0 = I0 − Iˆ0 = I0 − fl(I0 ).
The the exact value satisfies
1
In = − αIn−1 (1.38)
n
and the numerical approximation is calculated from
1
Iˆn = − αIˆn−1 + ηn , (1.39)
n
where ηn is the rounding error we introduce in step n.
For a first analysis, we neglect ηn and simply investigate the propagation of the initial
error ∆I0 only. Then

∆In = In − Iˆn
= ( n1 − αIn−1 ) − ( n1 − αIˆn−1 )
= −α(In−1 − Iˆn−1 )
= −α∆In−1 .
Applying recursion yields an expression for the accumulated error after n steps:

∆In = (−α)n ∆I0 . (1.40)

Conclusion. From this expression, we note that there are potentially two very different
outcomes:
a) If α > 1 then the initial error is propagated in such a way that blow-up occurs. Thus
Algorithm A is numerically unstable with respect to ∆.
1.4. STABILITY OF A NUMERICAL ALGORITHM 25

b) If α ≤ 1 then the initial error remains constant or dies out. Thus Algorithm A is
numerically stable with respect to ∆.
We note that further analysis would lead to the conclusion that the rounding error ηn
is also propagated in the same way as the initial error ∆I0 . These results confirm the
experimental results observed in the implementation.

These notes have been funded by...


26 CHAPTER 1. ERRORS AND ERROR PROPAGATION
Chapter 2

Root Finding

2.1 Introduction
The root finding problem is a classical example of numerical methods in practice. The
problem is stated as follows:

Problem Given any function f (x), find x∗ such that f (x∗ ) = 0. The value x∗ is called a
root of the equation f (x) = 0.

If f (x) = 0 has a root x∗ there is no truly guaranteed method for finding x∗ compu-
tationally for an arbitrary function, but there are several techniques which are useful for
many cases. A computational limitation inherent to this problem can be fairly easily seen
by an observant reader: any interval of the real line contains an infinite number of points,
but computationally we can solve this problem with only a finite number of evaluations of
the function f .
Additionally, since the value x∗ may not be defined in our floating point number system,
we will not be able to find x∗ exactly. Therefore, we consider a computational version of
the same problem:

Problem (Computational) Given any f (x) and some error tolerance ǫ > 0, find x∗ such
that |f (x∗ )| < ǫ.

We will only consider functions which are continuous in our analysis.

Example 2.1

1) Consider the function f (x) = x2 + x − 6 = (x − 2)(x + 3). The function f (x) = 0 has
two roots at x=2, -3.

2) With f (x) = x2 − 2x + 1 = (x − 1)2 = 0, x = 1 is a double root (i.e. f (x∗ ) = 0 and


f ′ (x∗ ) = 0).

27
28 CHAPTER 2. ROOT FINDING

3) f (x) = 3x5 + 5x4 + 31 x3 + 1 = 0. We have no general closed form solution for the roots
of a polynomial with degree larger than 4. As a result, we will need to use numerical
approximation by iteration.

4) f (x) = x2 − 12 exp(−x) = 0. This is naturally more difficult to solve because exp(−x)


is a transcendental function. In fact, f (x) = 0 is called the transcendental equation
and can only be solved with numerical approximation.

Definition 2.1 We say that x∗ is a double root of f (x) = 0 if and only if f (x∗ ) = 0 and
f ′ (x∗ ) = 0.

We naturally examine this computational problem from an iterative standpoint. That is,
we wish to generate a sequence of iterates (xk ) such that any iterate xk+1 can be written as
some function of xk , xk−1 , . . . , x0 . We assume that some initial conditions are applied to the
problem so that xp , xp−1 , . . . x0 are either given or arbitrarily chosen. Obviously, we require
that the iterates actually converge to the solution of the problem, i.e. x∗ = limk→∞ xk .
A natural question to ask might be, “how do we know where a root of a function may
approximately be located?” A simple result from first year calculus will help in answering
this:

Theorem 2.1 (Intermediate Value Theorem) If f (x) is continuous on a closed inter-


val [a, b] and c ∈ [f (a), f (b)], then ∃ x∗ ∈ [a, b] such that f (x∗ ) = c.

Thus if we can find [a, b] such that f (a) · f (b) < 0 then by the Intermediate Value
Theorem, [a, b] will contain at least one root x∗ as long as f (x) is continuous.

2.2 Four Algorithms for Root Finding


2.2.1 Bisection Method
The bisection method is one of the most simple methods for locating roots, but is also very
powerful as it guarantees convergence as long as its initial conditions are met. To apply
this method, we require a continuous function f (x) and an initial interval [a, b] such that
f (a) · f (b) ≤ 0. This method effectively works by bisecting the interval and recursively using
the Intermediate Value Theorem to determine a new interval where the initial conditions
are met.
Theorem 2.2 If f (x) is a continuous function on the interval [a0 , b0 ] such that f (a0 ) ·
f (b0 ) ≤ 0 then the interval [ak , bk ], defined by

 ak−1 if f ((ak−1 + bk−1 )/2) · f (ak−1 ) ≤ 0
ak = (2.1)

(ak−1 + bk−1 )/2 otherwise

 bk−1 if f ((ak−1 + bk−1 )/2) · f (ak−1 ) > 0
bk = (2.2)

(ak−1 + bk−1 )/2 otherwise
fulfills the property f (ak ) · f (bk ) ≤ 0 ∀ k.
2.2. FOUR ALGORITHMS FOR ROOT FINDING 29

Algorithm: Bisection Method


in: f (x), [a, b], tolerance t
out: x, an approximation for x∗

while |b-a| > t


c = (a+b)/2
if f(a)*f(c) <= 0
keep a, b = c
else
keep b, a = c
end if
end while
x = (a+b)/2

When applying the bisection method, we only require continuity of the function f (x) and
an initial knowledge of two points a0 and b0 such that f (a0 ) · f (b0 ) ≤ 0. We are guaranteed
the existence of a root in the interval [a0 , b0 ] by the Intermediate Value Theorem (2.1) and
are further guaranteed that the bisection method will converge to a solution.
We consider the question of “speed of convergence,” namely given a, b and t, how many
steps does it take to reach t? If we suppose that x∗ = limk→∞ xk then at each iteration the
interval containing x∗ is halved. Thus, assuming it takes n steps to fulfill |b − a| ≤ t we have
that

2−n |b − a| ≤ t
⇒ n log 2 ≥ log( |b−a|
t )
⇒ n ≥ log1 2 log( |b−a|
t ).

Thus we conclude that for a given tolerance t and initial interval [a, b], bisection will take

1 |b − a|
n≥ log( ) (2.3)
log 2 t
steps to converge.

Example 2.2 Given |b − a| = 1 and t = 10−6 , how many steps does it take to converge?
From (2.3) we have that

1
n ≥ log10 (106 )
log10 (2)
⇒n ≥ 3.4 · 6
⇒n ≥ 20.

(compare with 220 ≈ 1.05 × 106 ).


30 CHAPTER 2. ROOT FINDING

2.2.2 Fixed Point Iteration


We note that we may rewrite the root-finding problem in an alternate way that may not be
immediately obvious. Consider the real-valued function g, defined by g(x) = x − f (x). We
note that this function inherits the continuity of f in an interval [a, b]. We can also write
f (x) = x − g(x) in order to obtain our original function f (x). The problem of root-finding
for our original function is hence equivalent to the problem of finding a solution to g(x) = x.

Definition 2.2 We say that x∗ is a fixed point of g(x) if g(x∗ ) = x∗ , i.e. if x∗ is mapped
to itself under g.

We note that if our function g has certain desirable properties (in particular, as will
be shown later, if |g ′ (x∗ )| < 1 and x0 is ”close enough” to x∗ ), then repeated application
of g will actually cause us to converge to this fixed point. This implies we can write our
algorithm for fixed-point iteration as follows:

Algorithm: Fixed Point Iteration


in: g(x), x0 , tolerance t
out: x, an approximation for x∗

i = 0
repeat
i = i + 1
x[i] = g(x[i-1])
until |x[i] - x[i-1]| < t
x = x[i]

We note that it is not required that we limit ourselves to the choice of g(x) = x − f (x)
in applying this scheme. In general we can write g(x) = x − H(f (x)) as long as we choose
H such that H(0) = 0. Not all choices will lead to a converging method. For convergence
we must also choose H such that |g ′ (x∗ )| < 1 (see later).

2.2.3 Newton’s Method


Certain additional information may be available about our function that may assist in con-
structing a root-finding method. In particular, knowledge of the first derivative of f (x)
motivates the use of Newton’s method. Consider the Taylor series expansion of f (x∗ ) about
an initial estimate x0 :

f (x∗ ) = f (x0 ) + f ′ (x0 )(x∗ − x0 ) + O((x∗ − x)2 ). (2.4)


If we take this sequence to leading order, we have that

f (x∗ ) ≈ f (x0 ) + f ′ (x0 )(x∗ − x0 ). (2.5)


2.2. FOUR ALGORITHMS FOR ROOT FINDING 31

But we know that f (x∗ ) = 0 and so we find a new, often better approximation x1 from x0
by requiring that

f (x0 ) + f ′ (x0 )(x1 − x0 ) = 0. (2.6)


Rearranging and taking the sequence in general yields the defining equation for Newton’s
method:

xi+1 = xi − f (xi )/f ′ (xi ). (2.7)



We note that we will need to look out for the case where f (xi ) = 0, since this will lead to
a division by zero. Otherwise, this derivation allows us to provide an algorithm for applying
Newton’s method numerically:

Algorithm: Newton’s Method


in: f (x), f ′ (x), x0 , tolerance t
out: x, an approximation for x∗

i = 0
x[0] = x0
repeat
i = i + 1
if f’(x[i-1]) == 0 stop
x[i] = x[i-1] - f(x[i-1]) / f’(x[i-1])
until |x[i] - x[i-1]| < t
x = x[i]

2.2.4 Secant Method


Newton’s method provides very fast convergence, but relies on the knowledge of f ′ (x). If
this derivative is not known, not easily computable, or if f (x) is not explicitly given but
is the output of a computer program, then we will be unable to apply Newton’s method.
However, we can approximate the derivative using a numerical scheme that requires only
evaluations of the function f (x). From the definition of the derivative we know that

f (xi ) − f (η) f (xi ) − f (η)


f ′ (xi ) = lim ≈ . (2.8)
η→xi xi − η xi − η
If we choose η = xi−1 then we approximate the derivative as

f (xi ) − f (xi−1 )
f ′ (xi ) ≈ . (2.9)
xi − xi−1
This result can be plugged into Newton’s method to give the defining equation for the Secant
method:
32 CHAPTER 2. ROOT FINDING

" #
xi − xi−1
xi+1 = xi − f (xi ) . (2.10)
f (xi ) − f (xi−1 )

Note that this method actually requires the two previous values (xi and xi−1 ) in order
to compute xi+1 . Thus, we also need two initial values x0 and x1 in order to begin iteration.
Also, as in Newton’s method where we needed to check for f ′ (xi ) = 0 here we need to be
wary of the case where f (xi ) ≈ f (xi−1 ), as this will potentially give undesirable results.

2.2.5 Stopping Criteria for Iterative Functions


For any iterative algorithm that approximates a root x∗ we need to consider a stopping
criterion. There are several general criteria for stopping, which can be combined if necessary.

1. Maximum number of steps. Using this method for stopping, we impose some
maximum number of steps in the iteration imax and stop when i = imax . This provides
a safeguard against infinite loops, but is not very efficient - i.e. even if we are very
close to the root after 1 iteration, this method will always run for the same number of
iterations.

2. Tolerance on the size of the correction. Under this criterion, we are given
some tolerance t and stop when |xi+1 − xi | ≤ t. Under the bisection method and
fixed-point iteration (with “spiral-wise” convergence, see later) we actually are able to
guarantee that |xi+1 − x∗ | ≤ t. Unfortunately, this criterion does not guarantee that
|xi+1 − x∗ | ≈ t, in general.
This may not work very well if one desires a small function value in the approximate
root for steep functions such as f (x) = a(x − x∗ ) with a = 1011 . Even a small error in
x will mean a large value for f (x).

3. Tolerance on the size of the function value. Under this criterion, we are given
some tolerance t and stop when |f (xi )| < t. This may not work well for a flat function
such as f (x) = a(x − x∗ ) with a = 10−9 . In this case, for xi far from x∗ , |f (xi )| may
be smaller than t.

In conclusion, choosing a good stopping criterion is difficult and dependent on the prob-
lem. Often trial and error is used to determine a good criterion and combines any of the
aforementioned options.
2.3. RATE OF CONVERGENCE 33

2.3 Rate of Convergence


We wish to examine how quickly each of the four methods discussed here converges to
the root x∗ , assuming that convergence occurs. In section 2.4 we will discuss the criteria
necessary for convergence of each method.
In order to lay a foundation for this discussion, we must consider how rate of convergence
is measured. We first consider the error at each step of the iteration:

Definition 2.3 For a sequence {xi }∞ ∗


i=0 and point x , the error at iteration i is

ei = xi − x∗ . (2.11)

We will define the rate of convergence by how quickly the error converges to zero (and
hence how quickly {xi }∞ ∗ ∞ ∗
i=0 converges to x ). If {xi }i=0 diverges from x , then we note that
limi→∞ ei = ±∞.

Definition 2.4 The sequence {xi }∞ ∗ ∞


i=1 converges to x with order q if and only if {xi }i=1

converges to x , limi→∞ ci = N when N ∈ [0, ∞) and

|ei+1 | = ci |ei |q . (2.12)

With these definitions in mind, we may consider the rate of convergence of each of our
iteration methods. Consider the example given in Table 2.1 and Table 2.2. We wish to
determine the positive root of f (x) = x2 − 21 exp(−x) which has the exact value x∗ =
0.53983527690282. We measure the value of the iterate xi in Table 2.1 and the value of the
error ei in Table 2.2.

Bisection Method. From the derivation (2.3) we note that, on average, |ei+1 | ≈ 12 |ei |.
But the error may increase for certain iterations, depending on the initial interval. Thus,
we cannot directly apply the definition for convergence to the bisection method, but we
nonetheless say that the method behaves like a linearly convergent method, with the follow-
ing justification:

Consider the sequence defined by {Li }∞ i=1 with Li = |bi − ai | the length of the interval
at step i. We know that Li+1 = 21 Li and so the sequence {Li } converges to 0 linearly. We
also know that |ei | ≤ Li , and so we say that {ei } converges to 0 at least linearly.

Fixed Point Iteration. The rate of convergence for this method is highly variable and
depends greatly on the actual problem being solved. In the example at hand, we find that
if we define ci by |ei+1 | = ci |ei | then, on average, it appears that limi→∞ ci = 0.37. Thus
we note that fixed point iteration converges linearly as well.

Newton’s Method. A thorough analysis of Newton’s method indicates that Newton


actually converges much faster than the other methods. In the example at hand, if we
consider |ei+1 | = ci |ei |2 we will find that we get limi→∞ ci = 0.62. Thus Newton’s method
converges quadratically in this example.
34 CHAPTER 2. ROOT FINDING

Secant Method. With a thorough analysis of the Secant method, we find that the Secant
method converges faster than fixed point iteration, but slower √ than Newton’s method. If
we consider |ei+1 | = ci |ei |q we actually find that q = 21 (1 + 5) ≈ 1.618. In the example at
hand, we actually get that limi→∞ ci ≈ 0.74.

2.4 Convergence Theory

2.4.1 Fixed Point Iteration

For fixed point iteration, we construct the iteration sequence xi+1 = g(xi ) and iterate to
approximate the fixed point x∗ = g(x∗ ). We demonstrated previously that this is equivalent
to root-finding if we use g(x) = x− H(f (x)) for any function H(x) = H(f (x)) where H(0) =
0. Since f (x∗ ) = 0 by the definition of a root, we also have that H(x∗ ) = H(f (x∗ )) = H(0) =
0 and so g(x∗ ) = x∗ . We will show now that this method converges when |g ′ (x∗ )| < 1.
The theory for the fixed point iteration method goes hand in hand with the theory behind
contractions and contraction mapping from real analysis:

Definition 2.5 Suppose that g is a real-valued function, defined and continuous on a bounded
closed interval [a, b] of the real line. Then, g is said to be a contraction on [a, b] if there
exists a constant L ∈ (0, 1) such that

|g(x) − g(y)| ≤ L|x − y| ∀x, y ∈ [a, b] (2.16)

The definition of a contraction has two very important graphical interpretations:

y y=x

g(a)

g(b) g(x)

a b x

For one, we notice |g(a) − g(b)| is smaller than |a − b| and so we may say that the interval
[a, b] has been contracted to a smaller interval [g(a), g(b)].
2.4. CONVERGENCE THEORY 35

1
Table 2.1: Root-Finding Iteration xi (f (x) = x2 − 2 exp(−x))

Bisection Fixed Point Iteration Newton Secant

1.00000000000000 1.00000000000000 2.00000000000000 2.00000000000000


0.50000000000000 0.18393972058572 1.03327097864435 0.00000000000000
0.75000000000000 0.56609887674714 0.63686054270010 0.22561484995794
0.62500000000000 0.52949890451207 0.54511924037555 0.74269471761919
0.56250000000000 0.54357981007437 0.53985256974508 0.49760551690233
0.53125000000000 0.53843372706828 0.53983527708914 0.53494633480233
0.54687500000000 0.54035370380524 0.53983527690282 0.53996743772003
0.53906250000000 0.53964266286324 0.53983527690282 0.53983487323262
0.54296875000000 0.53990672286905 0.53983527690282 0.53983527686958
0.54101562500000 0.53980875946698 0.53983527690282 0.53983527690282
0.54003906250000 0.53984511672844 0.53983527690282 0.53983527690282
0.53955078125000 0.53983162533285 0.53983527690282 0.53983527690282
0.53979492187500 0.53983663196232 0.53983527690282 0.53983527690282
0.53991699218750 0.53983477404859 0.53983527690282 0.53983527690282
0.53985595703125 0.53983546350813 0.53983527690282 0.53983527690282
0.53982543945312 0.53983520765493 0.53983527690282 0.53983527690282
0.53984069824219 0.53983530260020 0.53983527690282 0.53983527690282
0.53983306884766 0.53983526736671 0.53983527690282 0.53983527690282
0.53983688354492 0.53983528044160 0.53983527690282 0.53983527690282
0.53983497619629 0.53983527558960 0.53983527690282 0.53983527690282
0.53983592987061 0.53983527739014 0.53983527690282 0.53983527690282
0.53983545303345 0.53983527672198 0.53983527690282 0.53983527690282
0.53983521461487 0.53983527696993 0.53983527690282 0.53983527690282
0.53983533382416 0.53983527687792 0.53983527690282 0.53983527690282
0.53983527421951 0.53983527691206 0.53983527690282 0.53983527690282
0.53983530402184 0.53983527689939 0.53983527690282 0.53983527690282
0.53983528912067 0.53983527690409 0.53983527690282 0.53983527690282
0.53983528167009 0.53983527690235 0.53983527690282 0.53983527690282
0.53983527794480 0.53983527690300 0.53983527690282 0.53983527690282
0.53983527608216 0.53983527690276 0.53983527690282 0.53983527690282
0.53983527701348 0.53983527690284 0.53983527690282 0.53983527690282
0.53983527654782 0.53983527690281 0.53983527690282 0.53983527690282
0.53983527678065 0.53983527690282 0.53983527690282 0.53983527690282
0.53983527689707 0.53983527690282 0.53983527690282 0.53983527690282
0.53983527695527 0.53983527690282 0.53983527690282 0.53983527690282
0.53983527692617 0.53983527690282 0.53983527690282 0.53983527690282
0.53983527691162 0.53983527690282 0.53983527690282 0.53983527690282
0.53983527690434 0.53983527690282 0.53983527690282 0.53983527690282
0.53983527690070 0.53983527690282 0.53983527690282 0.53983527690282
0.53983527690252 0.53983527690282 0.53983527690282 0.53983527690282
0.53983527690343 0.53983527690282 0.53983527690282 0.53983527690282
0.53983527690298 0.53983527690282 0.53983527690282 0.53983527690282
0.53983527690275 0.53983527690282 0.53983527690282 0.53983527690282
0.53983527690286 0.53983527690282 0.53983527690282 0.53983527690282
0.53983527690281 0.53983527690282 0.53983527690282 0.53983527690282
0.53983527690283 0.53983527690282 0.53983527690282 0.53983527690282
0.53983527690282 0.53983527690282 0.53983527690282 0.53983527690282
0.53983527690281 0.53983527690282 0.53983527690282 0.53983527690282
0.53983527690282 0.53983527690282 0.53983527690282 0.53983527690282
0.53983527690282 0.53983527690282 0.53983527690282 0.53983527690282
0.53983527690282 0.53983527690282 0.53983527690282 0.53983527690282
36 CHAPTER 2. ROOT FINDING

1
Table 2.2: Root-Finding Iteration ei (f (x) = x2 − 2 exp(−x))

Bisection Fixed Point Iteration Newton Secant

0.46016472309718 0.46016472309718 1.46016472309718 1.46016472309718


-0.03983527690282 -0.35589555631710 0.49343570174153 -0.53983527690282
0.21016472309718 0.02626359984432 0.09702526579728 -0.31422042694488
0.08516472309718 -0.01033637239075 0.00528396347273 0.20285944071637
0.02266472309718 0.00374453317155 0.00001729284226 -0.04222976000049
-0.00858527690282 -0.00140154983454 0.00000000018632 -0.00488894210049
0.00703972309718 0.00051842690242 0.00000000000000 0.00013216081721
-0.00077277690282 -0.00019261403958 0.00000000000000 -0.00000040367020
0.00313347309718 0.00007144596623 0.00000000000000 -0.00000000003324
0.00118034809718 -0.00002651743584 0.00000000000000 0.00000000000000
0.00020378559718 0.00000983982562 0.00000000000000 0.00000000000000
-0.00028449565282 -0.00000365156997 0.00000000000000 0.00000000000000
-0.00004035502782 0.00000135505950 0.00000000000000 0.00000000000000
0.00008171528468 -0.00000050285423 0.00000000000000 0.00000000000000
0.00002068012843 0.00000018660531 0.00000000000000 0.00000000000000
-0.00000983744969 -0.00000006924789 0.00000000000000 0.00000000000000
0.00000542133937 0.00000002569738 0.00000000000000 0.00000000000000
-0.00000220805516 -0.00000000953611 0.00000000000000 0.00000000000000
0.00000160664210 0.00000000353878 0.00000000000000 0.00000000000000
-0.00000030070653 -0.00000000131322 0.00000000000000 0.00000000000000
0.00000065296779 0.00000000048732 0.00000000000000 0.00000000000000
0.00000017613063 -0.00000000018084 0.00000000000000 0.00000000000000
-0.00000006228795 0.00000000006711 0.00000000000000 0.00000000000000
0.00000005692134 -0.00000000002490 0.00000000000000 0.00000000000000
-0.00000000268331 0.00000000000924 0.00000000000000 0.00000000000000
0.00000002711902 -0.00000000000343 0.00000000000000 0.00000000000000
0.00000001221785 0.00000000000127 0.00000000000000 0.00000000000000
0.00000000476727 -0.00000000000047 0.00000000000000 0.00000000000000
0.00000000104198 0.00000000000018 0.00000000000000 0.00000000000000
-0.00000000082066 -0.00000000000006 0.00000000000000 0.00000000000000
0.00000000011066 0.00000000000002 0.00000000000000 0.00000000000000
-0.00000000035500 -0.00000000000001 0.00000000000000 0.00000000000000
-0.00000000012217 0.00000000000000 0.00000000000000 0.00000000000000
-0.00000000000575 -0.00000000000000 0.00000000000000 0.00000000000000
0.00000000005245 0.00000000000000 0.00000000000000 0.00000000000000
0.00000000002335 -0.00000000000000 0.00000000000000 0.00000000000000
0.00000000000880 0.00000000000000 0.00000000000000 0.00000000000000
0.00000000000152 0.00000000000000 0.00000000000000 0.00000000000000
-0.00000000000212 0.00000000000000 0.00000000000000 0.00000000000000
-0.00000000000030 0.00000000000000 0.00000000000000 0.00000000000000
0.00000000000061 0.00000000000000 0.00000000000000 0.00000000000000
0.00000000000016 0.00000000000000 0.00000000000000 0.00000000000000
-0.00000000000007 0.00000000000000 0.00000000000000 0.00000000000000
0.00000000000004 0.00000000000000 0.00000000000000 0.00000000000000
-0.00000000000001 0.00000000000000 0.00000000000000 0.00000000000000
0.00000000000001 0.00000000000000 0.00000000000000 0.00000000000000
0.00000000000000 0.00000000000000 0.00000000000000 0.00000000000000
-0.00000000000001 0.00000000000000 0.00000000000000 0.00000000000000
-0.00000000000000 0.00000000000000 0.00000000000000 0.00000000000000
-0.00000000000000 0.00000000000000 0.00000000000000 0.00000000000000
-0.00000000000000 0.00000000000000 0.00000000000000 0.00000000000000
2.4. CONVERGENCE THEORY 37

Intermission: The Golden Ratio

1 Φ−1

Figure 2.1: Two nested rectangles with equivalent aspect ratios

The golden ratio appears often in nature and in mathematics. It can be defined using two
nested rectangles that have equivalent aspect ratios, as depicted in figure 2.1. The aspect
ratio of the larger rectangle is given by φ1 and the smaller rectangle has the aspect ratio
φ−1
1 . Equating these gives
1 1−φ
= , (2.13)
φ 1
or equivalently

φ2 − φ − 1 = 0. (2.14)
Finally, we apply the quadratic formula and take the positive root to get

1+ 5
φ1,2 = , (2.15)
2
the golden ratio.
38 CHAPTER 2. ROOT FINDING

g(x)

a x y b x

Also, we note that for any x, y ∈ [a, b] such that x 6= y a simple manipulation indicates
that a contraction fulfills

|g(x) − g(y)|
≤ L < 1. (2.17)
|x − y|
Thus we notice that the slope of any secant line within the interval [a, b] cannot exceed L in
absolute value.
An observant reader might notice that this definition of a contraction appears very
similar to the definition for a derivative. In fact, if we have g(x) differentiable on [a, b] with
|g ′ (x)| < 1 ∀ x ∈ [a, b], then g(x) is a contraction on [a, b] with

L = max |g ′ (x)|.
x∈[a,b]

The proof of this fact is left as an exercise for the reader.


The definition of a contraction leads to a very important theorem that governs the
behaviour of the contraction and, in fact, gives us the result we require for fixed point
iteration.

Theorem 2.3 (Contraction Mapping Theorem) Let g be a real-valued function, de-


fined and continuous on a bounded closed interval [a, b] of the real line, and assume that
g(x) ∈ [a, b] for all x ∈ [a, b]. Suppose, further, that g is a contraction on [a, b]. Then,
1. g has a unique fixed point x∗ in the interval [a, b].
2. The sequence {xk } defined by xk+1 = g(xk ) converges to x∗ as k → ∞ for any starting
value x0 in [a, b].

Proof: Existence of the fixed point. The existence of a fixed point x∗ for g is a
consequence of the Intermediate Value Theorem. Define u(x) = x − g(x). Then

u(a) = a − g(a) ≤ 0 and u(b) = b − g(b) ≥ 0.

Then by the Intermediate Value Theorem, there exists x∗ ∈ [a, b] such that u(x∗ ) = 0. Thus
x∗ − g(x∗ ) = 0, or equivalently x∗ = g(x∗ ) and so x∗ is a fixed point of g.
2.4. CONVERGENCE THEORY 39

y y=x

a
g(x)

a b x

Figure 2.2: An illustration of the contraction mapping theorem.

Uniqueness of the fixed point. The uniqueness of this fixed point follows from (2.16)
by contradiction. Suppose that g has a second fixed point, x∗2 , in [a, b] such that g(x∗ ) = x∗
and g(x∗2 ) = x∗2 . Then,

|g(x∗ ) − g(x∗2 )| ≤ L|x∗ − x∗2 |, (contraction property).

Using the definition of a fixed point, we have

|x∗ − x∗2 | ≤ L|x∗ − x∗2 |,

or equivalently, L ≥ 1. However, from the contraction property we know L ∈ (0, 1). Thus
we have a contradiction and so there is no second fixed point.

Convergence property. Let x0 be any element of [a, b]. Consider the sequence {xi }
defined by xi+1 = g(xi ), where xi ∈ [a, b] implies xi+1 ∈ [a, b]. We note for any xi−1 in
the interval we have, by the contraction property

|g(xi−1 ) − g(x∗ )| ≤ L|xi−1 − x∗ |,


or equivalently
|xi − x∗ | ≤ L|xi−1 − x∗ |.
Using the fact that this applies for all i, we may use recursion to get

|xi − x∗ | ≤ Li |x0 − x∗ |.

We take the limit as i → ∞ to get

lim |xi − x∗ | ≤ |x0 − x∗ | lim Li .


i→∞ i→∞

But since we also know that L ∈ (0, 1) then limi→∞ Li = 0. Thus our equation reduces to

lim |xi − x∗ | = 0,
i→∞
40 CHAPTER 2. ROOT FINDING

or
lim xi = x∗ . 
i→∞

From the contraction mapping theorem, we note that it appears that convergence to the
fixed point x∗ appears to be linear; in particular, we get from the contraction property that
ei ≤ Lei−1 .
Intuitively, only one fixed point is allowed since we require a slope greater than 1 to get
multiple fixed points (see Figure 2.3).

y y=x

g(x)

a b x

Figure 2.3: In order to get multiple fixed points, we need the slope of g(x) to be greater
than 1.

We can determine when a sequence converges from the following corollary to the Con-
traction Mapping Theorem:

Corollary 2.1 Let g be a real-valued function, defined and continuous on a bounded closed
interval [a, b] of the real line, and assume that g(x) ∈ [a, b] for all x ∈ [a, b]. Let x∗ = g(x∗ )
be a fixed point of g(x) with x∗ ∈ [a, b]. Assume there exists δ such that g ′ (x) is continuous
in Iδ = [x∗ − δ, x∗ + δ]. Define the sequence {xi }∞i=0 by xi+1 = g(xi ). Then:

I. If |g ′ (x∗ )| < 1 then there exists ǫ such that {xi } converges to x∗ for |x0 − x∗ | < ǫ.
Further, convergence is linear with limi→∞ ci = |g ′ (x∗ )|.

II. If |g ′ (x∗ )| > 1 then {xi } diverges for any starting value x0 .

Using this corollary, we can come up with a method of choosing our form for g(x) in
terms of f (x) depending on the derivative at the point x∗ .

Example 2.3 Suppose that we somehow know that f ′ (x∗ ) = 3/2 where we wish to solve
for the root using f (x) = 0. Then if we add and subtract x from the equivalent equation
2.4. CONVERGENCE THEORY 41

−f (x) = 0 we get that x − x − f (x) = 0. We define g(x) = x + f (x) so we can apply fixed
point iteration on g(x) to solve x = g(x). Using this definition of g(x) we get that
|g ′ (x∗ )| = |1 + f ′ (x∗ )| = |1 + 3/2| = 5/2 > 1
and so from the corollary we note that we will not have convergence.
If we instead add and subtract x from f (x) = 0 we get that x − x + f (x) = 0. We define
g(x) = x − f (x) so we can apply fixed point iteration on g(x) as before. However, with this
definition of g(x) we get that
|g ′ (x∗ )| = |1 − f ′ (x∗ )| = |1 − 3/2| = 1/2 < 1
and so from the corollary we can choose some x0 close enough to x∗ to get convergence.

2.4.2 Newton’s Method


We need to use special care when applying Newton’s method. In particular, if x0 is too far
away from x∗ we may not get convergence. Consider the following example:

xα xβ x1β x

If x0 = xα then we will achieve convergence. However, if x0 = xβ we will diverge since


the function is monotone decreasing for x → ∞ but remains positive.
The convergence theorem for Newton’s method is not immediately obvious, but allows
us to predict cases where we will achieve convergence:
Theorem 2.4 Convergence Theorem for Newton’s Method (Atkinson p.60) If
f (x∗ ) = 0, f ′ (x∗ ) 6= 0 and f , f ′ and f ′′ are all continuous in Iδ = [x∗ − δ, x∗ + δ] with x0
sufficiently close to x∗ then the sequence {xi }∞ ∗
i=0 converges quadratically to x with

|f ′′ (x∗ )|
lim ci = . (2.18)
i→∞ |2f ′ (x∗ )|
Note that in this case, the defining equation for ci is |xi−1 − x∗ | = ci |xi − x∗ |2 . We
note that if f ′ (x∗ ) = 0 then the rate of convergence degrades to linear convergence. If the
conditions for Newton’s method are not met, we may not be able to achieve convergence for
any starting value x0 .
42 CHAPTER 2. ROOT FINDING

Convergence Behaviour of the Fixed Point Method


There exist four different types of behaviour when considering convergence in fixed point
iteration.

Spiral convergence Spiral divergence


y y=x y y=x
x3

x1
x2
x2
x3
g(x)
x1 x4 g(x)

x1 x3 x2 x0 x x4 x2 x1 x3 x

−1 < g ′ (x∗ ) < 0 g ′ (x∗ ) < −1

Staircase convergence Staircase divergence


y y
x0 y=x g(x) y=x

x3
x1
g(x)
x2 x2
x1

x2 x1 x0 x x1 x2 x3 x

0 < g ′ (x∗ ) < 1 g ′ (x∗ ) > 1


2.4. CONVERGENCE THEORY 43

2.4.3 Secant Method


The secant method has a very similar convergence theorem to that of Newton’s method.
In fact, all that changes is the order of convergence of the method. Similar to Newton’s
method, it can be shown that if f ′ (x∗ ) = 0 then the rate of convergence degrades to linear.

Theorem 2.5 Convergence Theorem for Secant Method (Atkinson p.67) If f (x∗ ) =
0, f ′ (x∗ ) 6= 0 and f , f ′ and f ′′ are all continuous in Iδ = [x∗ − δ, x∗ +√δ] with x0 sufficiently
1
close to x∗ then the sequence {xi }∞ i=0 converges with order q = 2 (1 + 5).

2.4.4 Overview
Each of the root-finding methods has their own strengths and weaknesses that make them
particularly useful for some situations and inapplicable in others. The following two tables
provide a summary of the functionality of each root-finding method:

Method Does the method converge?

Bisection yes, guaranteed


Fixed-point not always; depending on g(x) and x0
Newton not always; depending on f (x) and x0
Secant not always; depending on f (x), x0 and x1

Method Speed of convergence Require knowledge of f ′

Bisection slow (linear) no


Fixed-point slow (linear) no
Newton fast (quadratic) √ yes
Secant fast (q ≈ 12 (1 + 5)) no

In practice, MATLAB uses a combination of the Secant method, Bisection method and
a method not discussed here called Inverse Quadratic Interpolation (the combined method
is accessible through the fzero command). It uses the method that converges fastest if
possible, defaulting to the guaranteed convergence of Bisection if necessary. Further, it
requires no knowledge of the derivative. This approach allows MATLAB’s general root-
finding function fzero to be well-suited to a variety of applications.

These notes have been funded by...


44 CHAPTER 2. ROOT FINDING
Chapter 3

Numerical Linear Algebra

3.1 Introduction
Consider the linear system

a11 x1 + a12 x2 + · · · + a1n xn = b1


a21 x1 + a22 x2 + · · · + a2n xn = b2
.. .. .. .. .. (3.1)
. . . . .
an1 x1 + an2 x2 + · · · + ann xn = bn
with coefficients aij and unknowns xi . We may rewrite this as a matrix system A~x = ~b
where A ∈ Rn×n and ~x, ~b ∈ Rn as follows:
    
a11 a12 · · · a1n x1 b1
 a21 a22 · · · a2n   x2   b2 
    
 .. .. .. ..   ..  =  .. . (3.2)
 . . . .  .   . 
an1 an2 · · · ann xn bn
We consider the problem of solving for ~x computationally, where we desire

• accuracy (we must make sure that this is a well-conditioned problem and that we
have a stable algorithm)
• efficiency (we wish to solve potentially large systems with limited resources)

This problem has several well-known applications. For example, the Google search engine
uses linear systems in order to rank the results it retrieves for each keyword search. Here
we see that efficiency is very important: the linear systems they use may contain more than
three billion equations!

We may also ask if the problem of solving a given linear system is well posed, i.e., can
we find a single unique solution ~x that satisfies the linear system? We may appeal to a
standard result from linear algebra to answer this question:

45
46 CHAPTER 3. NUMERICAL LINEAR ALGEBRA

Theorem 3.1 Existence and Uniqueness Consider A~x = ~b.

Case 1: det(A) 6= 0 (A has linearly independent rows/columns, or A is invertible) if and


only if ~x = A−1~b is the unique solution of A~x = ~b.
Case 2: det(A) = 0 (recall that range(A) = column space of A), then
Case 2a: If ~b ∈ range(A) then A~x = ~b has infinitely many solutions.
Case 2b: If ~b 6∈ range(A) then A~x = ~b has no solutions.

3.2 Gaussian Elimination


Gaussian elimination was originally described by Carl Friedrich Gauss in 1809 in his work
on celestial mechanics entitled Theoria Motus Corporum Coelestium. However, elimination
in the form presented here was already in use well beforehand – in fact, it was even known
in China in the first century B.C.

3.2.1 LU Factorization
Before continuing, we first consider the definition of a triangular matrix. Linear systems
involving triangular matrices are very easy to solve and appear when performing Gaussian
elimination.

Definition 3.1 A matrix A ∈ Rn×n with components aij is said to be upper-triangular


if aij = 0 for all i > j. Similarly, A is said to be lower-triangular if aij = 0 for all i < j.
A is triangular if it is either upper-triangular or lower-triangular.

Gaussian elimination may be performed in two phases:


• Phase 1: Reduce the matrix A to upper triangular form.
• Phase 2: Solve the reduced system by backward substitution.
We illustrate Gaussian elimination of a linear system with an example:

Example 3.1 Consider the system A~x = ~b with


 
1 2 3
A =  4 5 6 .
7 8 1
(1)
In the first step we choose the pivot element a11 = 1 and use it to compute A(2) .
 
1 2 3
Step i = 1 : A(1) =  4 5 6 
 7 8 1 
1 2 3
Step i = 2 : A(2) =  0 −3 −6 
0 −6 −20
3.2. GAUSSIAN ELIMINATION 47

In this case A(2) is obtained from A(1) by taking linear combinations of the first row of
(1)
A with each of the other rows so as to generate zeroes in the first column. This operation
may also be represented by matrix multiplication on the left with the matrix M1 :

 M1   A(1)  =  A(2) 
1 0 0 1 2 3 1 2 3
 −4 1 0   4 5 6  =  0 −3 −6  .
1
− 71 0 1 7 8 1 0 −6 −20
In general, we may write
 
1 0 0  (2) (2) (2)

 − a(1) a11 a12 a13
M1 = 
21
1 0 
, A(2)
 (2)
=  a21 (2) (2) 
 a11
(1)
 a22 a23  .
(1)
a31 (2) (2) (2)
− (1) 0 1 a31 a32 a33
a11

(2)
We now choose a22 as the pivot element. We may then compute the matrix A(3) using
M2 · A(2) = A(3) :
   
1 0 0 1 2 3
Step i = 3 : M2 =  0 1 0 , A(3) =  0 −3 −6  ,
0 − (−6)
−3 1 0 0 −8
where we obtain A(3) by

 M2   A(2)  =  A(3) 
1 0 0 1 2 3 1 2 3
 0 1 0   0 −3 −6  =  0 −3 −6  .
0 −2 1 0 −6 −20 0 0 −8

We note that A(3) is in upper triangular form, as desired.

Example 3.1 Recap:


   
1 2 3 1 2 3
A =  4 5 6  U =  0 −3 −6 
 7 8 1   0 0 8
1 0 0 1 0 0
M1 =  −4 1 0  M2 =  0 1 0 
−7 0 1 0 −2 1
We may write

M2 · (M1 · A) = U
(M2 · M1 ) · A = U
A = (M2 · M1 )−1 U
A = M1−1 M2−1 U
48 CHAPTER 3. NUMERICAL LINEAR ALGEBRA

We will define a matrix L by M1−1 M2−1 = L and so may write the matrix A as the product
A = LU . We now wish to consider the properties of the matrices Mi and L. Consider the
matrix M1 , introduced above, and its inverse:
   
1 0 0 1 0 0
M1 =  −4 1 0  M1−1 =  4 1 0  .
−7 0 1 7 0 1
We define the matrix Li as the inverse of the matrix Mi (so Li Mi = I). The following
inversion property then follows from the structure of the matrix Mi :

Inversion Property: Li can be obtained from Mi by swapping the signs of the off-
diagonal elements.
Example 3.2 Consider the matrix M2 , introduced above. A simple calculation allows us to
obtain the following result:
 −1  
1 0 0 1 0 0
L2 = M2−1 =  0 1 0  =  0 1 0 ,
0 −2 1 0 2 1
which satisfies the inversion property.
In addition, the structure of L can be directly determined from the matrices Li using
the combination property:
Qn−1
Combination Property: In general, L = i=1 Li = L1 · L2 · · · · · Ln−1 . L can be
obtained from the Li by placing all of the off-diagonal elements of the matrices Li in the
corresponding position in L.
We note that the matrix L is a special type of lower-triangular matrix, defined as follows:
Definition 3.2 L is called a lower triangular matrix with unit diagonal if and only
if the matrix elements of L vanish above the diagonal and are 1 on the diagonal (i.e. ℓij =
0 ∀ j > i and ℓii = 1 ∀ i).

Using these two properties, we may write the LU decomposition of A as follows:


M2 M1 A = U
A = M1−1 M2−1 U
A = LU,
with L unit lower triangular and U upper triangular. For example 3.1, we write

 A  =  L   U 
1 2 3 1 0 0 1 2 3
 4 5 6  =  4 1 0   0 −3 −6  .
7 8 1 7 2 1 0 0 −8
The technique discussed here may be generalized to square matrices of any size and is
more generally known as LU decomposition.
3.2. GAUSSIAN ELIMINATION 49

Procedure: LU Decomposition. For any A ∈ Rn×n we may compute the LU decom-


position in n steps as follows:

A(1) = A
A(2) = M1 · A(1)
A(3) = M2 · A(2) = M2 M1 A(1)
..
.
A(n) = Mn−1 · A(n−1)
We obtain Mi from
 
1 0
 .. 
 .  (j)
  aij
Mj =  1 , cij = − .
  (j)
 ..  ajj
 cij . 
0 1
After computing the product of all Mi , we have

Mn−1 · Mn−2 · · · M2 · M1 ·A = U,
| {z }
which may be inverted to obtain

A = M1−1 · M2−1 · · · Mn−2


−1 −1
· Mn−1 ·U.
| {z }
Since we defined Lj = Mj−1 , we also have

A = L1 · L2 · · · Ln−2 · Ln−1 ·U
| {z }
and so obtain the LU decomposition of A,
A = L · U.
Since L and U are triangular matrices, we may easily compute the solution to a linear
system with either matrix L or U . The decomposition then leads to the final step in our
method of solving a general linear system:

Procedure: Solving a Linear System by LU Decomposition. Consider a linear


system given by A~x = ~b. Since we now have a procedure for computing the LU decom-
position of a matrix A, we may write an algorithm for performing Gaussian Elimination
computationally.
• Phase 1: Decompose A = LU so we may write the linear system as LU ~x = ~b.
• Phase 2: Solve L~y = ~b for ~y by forward substitution.
• Phase 3: Solve U ~x = ~y for ~x by backward substitution.
50 CHAPTER 3. NUMERICAL LINEAR ALGEBRA

Note: Retaining L and U is advantageous when systems have to be solved with the same
A and multiple right-hand-side vectors ~b.

3.2.2 Pivoting
We observe that the LU decomposition algorithm breaks down when at some step i the
pivot element aii is equal to zero. However, this problem does not necessarily imply that
the system is unsolvable.

Example 3.3 Consider the linear system A~x = ~b defined by


    
0 1 x1 1
= .
2 1 x2 3

We note that the pivot element in this example is zero in the first step, making it
impossible to proceed using the LU decomposition algorithm described previously. However,
we will have no problem proceeding if we swap the first and second rows before applying
the LU decomposition.
    
2 1 x1 3
= .
0 1 x2 1
In this case, swapping rows reduces the system to upper-triangular form, and so we may
solve it very easily. By inspection, this system has the solution ~x = (1 1)T .
More formally, the operation of swapping rows can be written as multiplication on the
left with a permutation matrix P . For example, in the previous example we may write
   
0 1 0 1
P = , A= .
1 0 2 1
    
0 1 0 1 2 1
=⇒ PA = = .
1 0 2 1 0 1

Then we have A~x = ~b ⇐⇒ P A~x = P~b.

Definition 3.3 P ∈ Rn×n is a permutation matrix if and only if P is obtained from the
unit matrix In by swapping any number of rows.

Example 3.4 An example of a 4 × 4 permutation matrix is


 
0 0 0 1
 0 0 1 0 
P =  .
1 0 0 0 
0 1 0 0
3.2. GAUSSIAN ELIMINATION 51

We may also (equivalently) define a permutation matrix P as an n × n matrix with


elements 1 and 0 such that every row and column has exactly one 1. In general, we find
that the addition of the permutation matrix step is all that is needed to make the LU
decomposition algorithm applicable to any square matrix:

Theorem 3.2 For all A ∈ Rn×n there exists a permutation matrix P , a unit lower trian-
gular matrix L and an upper triangular matrix U (all of type Rn×n ) such that P A = LU .

Corollary 3.1 If A is nonsingular then A~x = ~b can be solved by the LU decomposition


algorithm applied to P A.

Proof. We write our linear system as A~x = ~b and multiply both sides by the permutation
matrix P (from Theorem 3.2) to obtain P A~x = P~b. From Corollary 3.1, we have that
P A = LU and so may write LU ~x = P~b. Thus we may simply apply forward and backward
substitution and so solve this system by LU decomposition. The substitution steps will not
lead to divisions by zero as L and U do not have any vanishing diagonalQelements. This
n
follows because det(A) = det(U ) det(L)/ det(P ) 6= 0, which means that i=1 uii 6= 0 as
det(P ) = ±1 and det(L) = 1 (see Section 3.2.4.) 

3.2.3 Algorithm and Computational Cost


We now consider the computational implementation of Gaussian elimination using the LU
decomposition. Recall that Gaussian elimination can be performed in two phases. We will
assume that pivoting is not needed.

Phase 1: Compute the LU decomposition, A = LU . The pseudocode for this algorithm


is as follows:

LU-Decomposition

L = diag(1)
U = A
for p = 1:n-1
for r = p+1:n
m = -u(r,p)/u(p,p)
u(r,p) = 0
for c = p+1:n
u(r,c) = u(r,c) + m u(p,c)
end for
l(r,p) = -m
end for
end for
52 CHAPTER 3. NUMERICAL LINEAR ALGEBRA

Here, the variable p represents the row of the pivot element, r is the current row and c
is the current column.

Aside: An efficient storage method.


We can save memory by implementing this algorithm in a clever way. Recall that the LU
decomposition of A is given by

  × × × ×

1 0
..  .. 

 × . 
 . × × × 

A = LU =   .. .
 ..  . × × 
 × × .  
× × × 1 ..
0 . ×

We may store L and U together as a single matrix since we know that the diagonal com-
ponents of L are all equal to 1. Thus the diagonal and upper-triangular portion of the
new combined matrix will consist of the elements of U and the lower-triangular portion will
consist of the elements of L.
With this storage mechanism, we can also perform LU decomposition in place; i.e. per-
form all computations directly on top of the input matrix A. However, using this technique
we lose the original contents of the matrix A.

Computational Cost of Phase 1


In order to compute the computational cost of this algorithm, we consider two counters:
Let M be the number of multiplications or divisions and let A be the number of additions
or subtractions. The following summation identities will come in handy in performing our
analysis:

n−1 n−1 n−1


X X 1 X 1
1 = n − 1, p= n(n − 1), p2 = n(n − 1)(2n − 1).
p=1 p=1
2 p=1
6
3.2. GAUSSIAN ELIMINATION 53

• Count A: We sum over all loops in the algorithm to yield


n−1
X n
X n
X
W = A
p=1 r=p+1 c=p+1
n−1
X
= (n − p)2 A
p=1
n−1
X
= (n2 − 2np + p2 )A
p=1
 
1 1 2
= · n · (n − 1) · (2n − 1) − · 2n · n · (n − 1) + n · (n − 1) A
6 2
 
2
= ( − 1 + 1)n3 + O(n2 ) A
6
 
1 3 2
= n + O(n ) A.
3

• Count M: Similarly we determine


 
1 3 2
W = n + O(n ) M.
3

(close inspection reveals that the additional multiplication only affects the O(n2 )
terms.)

Thus the total number of floating point operations is


2 3
W = n + O(n2 ) flops. (3.3)
3
(on modern CPUs, type A operations and type M operations take approximately the same
amount of time.)

Phase 2: We solve L~y = ~b by forward substitution.


 
1 0    
 .. 
  ~y  =  ~b  .

 . 
L(i, j) 1

The ith equation in this system is given by


i
X
L(i, k)y(k) = b(i).
k=1
54 CHAPTER 3. NUMERICAL LINEAR ALGEBRA

We rewrite this as
i−1
X
y(i) = b(i) − L(i, k)y(k).
k=1

Thus we may formulate an algorithm for forward substitution by solving for each y(i) in
turn:

Forward Substitution

y = b
for r = 2:n
for c = 1:r-1
y(r) = y(r) - L(r,c) * y(c)
end for
end for

Here, r is the current row and c is the current column. We may determine the compu-
tational cost of the algorithm, as with LU decomposition:

n X
X r−1
W = (1M + 1A)
r=2 c=1
Xn
= (r − 1)(1M + 1A)
r=2
n−1
X
= s(1M + 1A)
s=1
1
= n(n − 1)(M + A)
2
= n(n − 1) flops

Thus, the total number of floating point operations required for Forward Substitution is

W = n2 + O(n) flops. (3.4)

Phase 3: We solve U ~x = ~y by backward substitution.


 
U (1, 1) U (i, j)    
 

 U (2, 2) 
  ~x  =  ~y  .
 .. 
 . 
U (n, n)
3.2. GAUSSIAN ELIMINATION 55

The ith equation in this system is given by

i
X
u(i, k)x(k) = y(i).
k=1

We rewrite this as
" n
#
X 1
x(i) = y(i) − u(i, k)x(k) .
u(i, i)
k=i+1

Thus we may formulate an algorithm for backward substitution by solving for each x(i)
in turn:

Backward Substitution

x = y
for r = n:-1:1
for c = r+1:n
x(r) = x(r) - U(r,c) * x(c)
end for
x(r) = x(r) / U(r,c)
end for

Here, r is the current row and c is the current column. The computational complexity
will be the same as with forward substitution:

W = n2 + O(n) flops. (3.5)

Note: If u(i, i) = 0 for some i, the backward substitution algorithm breaks down, but this
can never happen if det(A) 6= 0.

3.2.4 Determinants
Before continuing, we consider some of the properties of the determinant.

Definition 3.4 The determinant of a matrix A ∈ Rn×n is given by


n
X
det(A) = (−1)i+j aij det(Aij ), for fixed i, (3.6)
j=1
56 CHAPTER 3. NUMERICAL LINEAR ALGEBRA

with
 
a11 a12 ··· a(1)(j−1) a(1)(j+1) ··· a1n

 a21 a22 ··· a(2)(j−1) a(2)(j+1) ··· a2n 

 .. .. .. .. .. 

 . . . . . 

 a(i−1)(1)
Aij =  a(i−1)(2) · · · a(i−1)(j−1) a(i−1)(j+1) · · · a(i−1)(n) .

 a(i+1)(1) a(i+1)(2) · · · a(i+1)(j−1) a(i+1)(j+1) · · · a(i+1)(n) 
 
 .. .. .. .. .. 
 . . . . . 
an1 an2 ··· a(n)(j−1) a(n)(j+1) ··· ann

i.e. the matrix Aij is an (n − 1) × (n − 1) matrix obtained by removing row i and column j
from the original matrix A.

This is the expansion of the determinant about row i for any 1 ≤ i ≤ n. We may also
consider the expansion of the determinant about column j for any 1 ≤ j ≤ n as follows:
n
X
det(A) = (−1)i+j aij det(Aij ), for fixed j. (3.7)
i=1

Example 3.5 We compute the determinant of the matrix used in example 3.1 using an
expansion of the first row:

1 2 3
5 6 4 6 4 5
det 4 5 6 = 1 · det
− 2 · det
+ 3 · det
= 24.
7 8 1 8 1 7 1 7 8

The determinant satisfies several useful properties, which we may formulate in the fol-
lowing proposition:

Proposition 3.1 The following identities hold for determinants:

1. det(BC) = det(B) · det(C), (B, C ∈ Rn×n )


Qn
2. U ∈ Rn×n upper triangular ⇒ det(U ) = i=1 uii
Qn
3. L ∈ Rn×n lower triangular ⇒ det(L) = i=1 ℓii

4. P ∈ Rn×n permutation matrix



+1 even number of row changes to obtain P from In .
⇒ det(P ) =
−1 odd number of row changes to obtain P from In .
3.2. GAUSSIAN ELIMINATION 57

Proof of 2. Consider the following upper-triangular matrix U :


 
u11 × × ··· ×
 0 u22 × · · · × 
 
U =
 0 u33 · · · ×  .
 .. .. 
 . . 
unn

We expand on the first column to yield


0
z }| {
(2)
det(U ) = u11 det(U ) + u21 det(· · · ) + · · ·
0
z }| {
(3)
= u11 u22 det(U ) + u32 det(· · · ) + · · ·
= ···
Yn
= uii . 
i=1

The proof of 1, 3, and 4 are similar and are left as an exercise for the reader.

Recall that we may solve the linear system A~x = ~b using Gaussian elimination as follows:

• Phase 1: P A = LU

• Phase 2: L~y = P~b

• Phase 3: U ~x = ~y

However, recall that the algorithm for the LU decomposition (phase 1) can only be
performed if there are no divisions by zero. How can we guarantee that this will not occur?

Proposition 3.2 Consider a matrix A ∈ Rn×n . Then det(A) 6= 0 if and only if the decom-
position P A = LU has uii 6= 0 ∀ i.

Proof.

PA = LU
det(P A) = det(LU )
det(P ) det(A) = det(L) det(U )
| {z } | {z } Q
| {z }
±1 1 n
i=0 uii
n
Y
± det(A) = uii .
i=0

Thus we have det(A) 6= 0 ⇔ uii 6= 0 ∀ i. 


58 CHAPTER 3. NUMERICAL LINEAR ALGEBRA

Intermission: Cramer’s Rule


Consider a set of linear equations in matrix form A~x = ~b as in equation (3.2). The determi-
nant det(A) is given by
a11 a12 a13

det(A) = a21 a22 a23 . (3.8)
a31 a32 a33

If we multiply det(A) by x1 and apply a property of determinants, we may take the x1 inside
the determinant along one of the columns.

a11 a12 a13 x1 a11 a12 a13

x1 det(A) = x1 a21 a22 a23 = x1 a21 a22 a23 . (3.9)
a31 a32 a33 x1 a31 a32 a33
By another property of determinants, we may add to a column a linear combination of the
other columns without changing the determinant. We write

x1 a11 + x2 a12 + x3 a13 a12 a13 b1 a12 a13

x1 det(A) = x1 a21 + x2 a22 + x3 a23 a22 a23 = b2 a22 a23 .
(3.10)
x1 a31 + x2 a32 + x3 a33 a32 a33 b3 a32 a33

We define Di as the determinant of A with the ith column replaced with ~b. If det(A) 6= 0,
it follows from (3.10) that xi = Di / det(A). So, for our simple linear system:

b1 a12 a13 a11 b1 a13 a11 a12 b1

b2 a22 a23 a21 b2 a23 a21 a22 b2

b3 a32 a33 a31 b3 a33 a31 a32 b3
x1 = x2 = x3 =
D D D

This procedure is known as Cramer’s Rule and can be generalized to a set of n equations.
Using this method, we can solve our linear system by calculating the determinants of n + 1
matrices.

Consider the general case of A ∈ Rn×n . In order to apply Cramer’s Rule we must compute
n + 1 determinants, each of a matrix of size n × n. If we use the recursive method to
compute each of these determinants, for each of the n + 1 determinants, we must compute n
determinants of a matrix of size (n − 1) × (n − 1). A short calculation reveals that we need
to compute (n + 1)! determinants in the general case. This complexity is much higher than
for any polynomial-time algorithm; in fact, it is even much worse than even an exponential
time algorithm! Therefore, calculation of the determinant as in the proof of Proposition 3.2,
which requires O(n3 ) operations, is a much better idea.
3.3. CONDITION AND STABILITY 59

3.3 Condition and Stability


3.3.1 The Matrix Norm
The matrix norm and the vector norm are both tools designed to allow us to measure the
size of either a matrix or a vector. We consider a particular set of matrix norms, namely
those induced by a vector norm. These are also known as the set of “natural” matrix norms
over the vector space of matrices (for matrices of the form A ∈ Rn×n , the set V = Rn×n is
a vector space over R.)
Definition 3.5 The natural matrix p-norm for p = 1, 2 or ∞ is defined by
kA~xkp
kAkp = max . (3.11)
xk6=0 k~
k~ xkp
An inequality then follows from this definition by rearranging the defining equation
(3.11).
Proposition 3.3 kA~xkp ≤ kAkp k~xkp .
We note that we also have the following properties associated with the matrix norm:
Proposition 3.4 Consider a matrix A ∈ Rn×n with elements aij . Then
Pn
1. kAk1 = max1≤j≤n i=1 |aij | (maximum absolute column sum)
Pn
2. kAk∞ = max1≤i≤n j=1 |aij | (maximum absolute row sum)
1/2
3. kAk2 = max1≤i≤n λi where λi are the eigenvalues of AT A. We note that λi = σi2 ,
with σi the singular values of A.
We also note that the natural matrix norms satisfy the triangle inequality:
Proposition 3.5 Consider matrices A, B ∈ Rn×n with p = 1, 2, ∞. Then
kA + Bkp ≤ kAkp + kBkp .

Proof. For any matrix norm k · k = k · kp we have


k(A + B)~xk = kA~x + B~xk ≤ kA~xk + kB~xk,
from the vector triangle inequality. Then, by proposition 3.3,
k(A + B)~xk ≤ kAkk~xk + kBkk~xk,
and so
k(A + B)~xk
≤ kAk + kBk.
k~xk
Thus, by the definition of the matrix norm we obtain
kA + Bk ≤ kAk + kBk. 
We can now show that the natural matrix norm satisfies the defining properties of a
norm:
60 CHAPTER 3. NUMERICAL LINEAR ALGEBRA

Proposition 3.6 kAkp is a norm:

1. kAkp ≥ 0, kAkp = 0 ⇔ A = 0.

2. kαAkp = |α|kAkp .

3. kA + Bkp ≤ kAkp + kBkp .

Since the natural matrix p-norm satisfies the properties of a norm, it can be used as a
mechanism for measuring the “size” of a matrix. This will be useful later when considering
condition and stability.

3.3.2 Condition of the problem A~x = ~b


The linear system problem can be reformulated as follows:

Problem Find ~x from A~x = ~b (i.e., ~x = A−1~b).

From this statement of the problem, we may write ~x = f~(A, ~b). If we want to consider the
condition of this problem, there are then two dependent variables which can be perturbed
and with contribute to the condition number. We want to consider the change ∆~x if we
have inputs A + ∆A and ~b + ∆~b.

Example 3.6 Consider the linear system A~x = ~b and solution ~x given by
      
1 2 x1 3 1
= , ~x = .
0.499 1.001 x2 1.5 1

We perturb A by a small matrix ∆A to yield a new linear system and solution given by
      
1 2 x1 3 3
= , ~x = .
0.500 1.001 x2 1.5 0

Thus a small change in A results in a large change in ~x; this seems to imply that the problem
is ill conditioned.

x1 x1
2 2

1 1

1 2 3 4 x2 1 2 3 4 x2

First equation Second equation


3.3. CONDITION AND STABILITY 61

In order to examine the condition of the initial problem P (A~x = ~b) we need to consider
a slight perturbation on the input data A and ~b,

(A + ∆A)(~x + ∆~x) = ~b + ∆~b. (3.12)

We now look for bounds on κR . We note that it is difficult to derive a bound on κR without
considering a simplification of the defining equation (3.12). As a result, we consider the two
most obvious simplified cases:

Case 1: Consider the case of no perturbation in A, i.e. ∆A = 0 with ∆~b 6= 0. Equation


(3.12) becomes

=⇒ A∆~x = ∆~b =⇒ ∆~x = A−1 ∆~b.


A(~x + ∆~x) = ~b + ∆~b |{z} (3.13)
x=~b
A~

We take the norm of both sides of this expression to yield

k∆~xk = kA−1 ∆~bk ≤ kA−1 kk∆~bk. (3.14)

We may also take the norm of both sides of A~x = ~b to yield

k~bk = kA~xk ⇒ k~bk ≤ kAk · k~xk ⇒ k~bk · kAk−1 ≤ k~xk. (3.15)

From equations (3.14) and (3.15) we have that

k∆~xk k∆~bk
≤ kA−1 kkAk .
k~xk k~bk

This may be rearranged to give the relative condition number

k∆~xk/k~xk
κR = ≤ kAkkA−1 k. (3.16)
~ ~
k∆bk/kbk

Thus we know kAk · kA−1 k is an upper bound for κR .

Definition 3.6 The condition number of a matrix A is κ(A) = kAkkA−1 k.

Thus if κ(A) is small, problem P is well conditioned. The natural error due to rounding
~
in ~b produces an error of relative magnitude k∆bk ≈ ǫmach , so it follows that there will be
k~bk
k∆~xk
an error in ~x of relative magnitude k~
xk ≤ κ(A) · ǫmach .

Case 2: We now consider the case of no perturbation in b, i.e. ∆A 6= 0 with ∆~b = 0.


Equation (3.12) becomes
(A + ∆A)(~x + ∆~x) = ~b. (3.17)
62 CHAPTER 3. NUMERICAL LINEAR ALGEBRA

We expand this equation and subtract A~x = ~b. This allows us to bound k∆~xk.

A∆~x = −∆A(~x + ∆~x)


∆~x = −A−1 ∆A(~x + ∆~x)
k∆~xk ≤ kA−1 kk∆Akk~x + ∆~xk
k∆~xk/k~x + ∆~xk ≤ kA−1 kk∆AkkAkkAk−1.

We may apply the approximation k∆~xk/k~x + ∆~xk ≈ k∆~xk/k~xk for ∆x small. This gives us
an expression for the relative condition number in this case:

k∆~xk/k~xk
κR = ≤ kAkkA−1 k. (3.18)
k∆Ak/kAk

Case 3: In the case of perturbations in A and b, i.e. ∆A 6= 0 and ∆~b 6= 0, it can also be
shown that κR ≤ κ(A) = kAkkA−1 k. However, the derivation is tedious and so will not be
given here.

From these three cases, it appears that the condition number of a matrix κ(A) is all
we need to determine the condition number of the problem P defined by the linear system
A~x = ~b. In particular, we note that the 2-condition number of a matrix (defined by using
the 2-norm) has a useful property that makes it unnecessary to compute the inverse A−1 :

Proposition 3.7 For a matrix A ∈ Rn×n ,


s
λmax (AT A) σmax (A)
κ2 (A) = kAk2 kA−1 k2 = = . (3.19)
λmin (AT A) σmin (A)

Proof. We know that kAk2 = max1≤i≤n (λi (AT A))1/2 , where λi (AT A) > 0 ∀ i and λi are
eigenvalues of AT A. Also,

B~xi = λi ~xi (if det(B) 6= 0)


−1
⇒ ~xi = λi B ~xi
⇒ λ−1
i ~xi = B −1 ~xi ,

and so λi−1 are eigenvalues of B −1 = (AT A)−1 . Using this property with B = (AT A)−1 , we
obtain

kA−1 k2 = max (λi ((A−1 )T A−1 ))1/2


1≤i≤n

= max (λi ((AAT )−1 ))1/2


1≤i≤n

= ( min (λi (AAT ))1/2 )−1 .


1≤i≤n
3.3. CONDITION AND STABILITY 63

In addition, if λi are eigenvalues of AAT then λi are also eigenvalues of AT A, since

AAT ~x = λi ~x
A AAT ~x =
T
λi AT ~x
AT A~y = λi ~y.

Thus the result follows from the definition of the condition number of a matrix. 

Example 3.7 Compute the condition number of


 
1 2
A= .
0.5 1.001

The eigenvalues of AT A may be calculated as λ1 ≈ 1.6 × 10−7 and λ2 ≈ 6.25. From


(3.19) we get κ2 (A) = 6.25 × 103 , which is a large value. We conclude the problem A~x = ~b
of example 3.6 is ill-conditioned (or the matrix A is ill-conditioned).

3.3.3 Stability of the LU Decomposition Algorithm


We wish to consider a related problem, namely:

a
Problem Consider the mathematical problem defined by z(x) = x with a constant.

We now wish to know the absolute and relative condition number of this problem. The
absolute condition number is computed from

∆z dz(x) a |a|
κA = ≈ = − = .
∆x dx x2 x2
Thus, if x is small then this problem is ill-conditioned with respect to the absolute error.
The relative condition number is computed from

|∆z|/|z| dz(x) x a x2
κR = ≈ = − 2 = 1,
|∆x|/|x| dx z x a
so the problem is well-conditioned with respect to the relative error. These calculations in-
dicate that dividing by a small number (or multiplying by a large number) is ill-conditioned;
this is a bad step in any algorithm because the absolute error may increase by a lot.
We conclude that the problem of linear systems will be ill-conditioned if we have many
divisions by small numbers. Consider the following example, with δ small.

Example 3.8 Solve the linear system


    
δ 1 x1 1
= .
1 1 x2 2
64 CHAPTER 3. NUMERICAL LINEAR ALGEBRA

Applying Gaussian elimination yields


    
δ 1 x1 1
= .
0 1 − δ1 x2 2− 1
δ

We solve for x1 and x2 to obtain


2 − δ1 1
δ (2δ − 1) 2δ − 1
x2 =
1 = 1 = ≈1
1− δ δ (δ − 1) δ−1
     
1 1 2δ − 1 1 δ − 1 − 2δ + 1 1 −δ 1
x1 = (1 − x2 ) = 1− = = =− ≈ 1.
δ δ δ−1 δ δ−1 δ δ−1 δ−1
Thus for small δ we have that (x1 x2 ) ≈ (1 1).
Under the finite precision system F [b = 10, m = 4, e = 5] with δ = 10−5 we have by
backward substitution
fl(2 − δ1 ) fl(2 − 105 ) fl(−99998)
x̂2 = 1 = fl(1 − 105 ) = fl(−99999) = 1
fl(1 − δ )
1
x̂1 = fl( (1 − x̂2 )) = 0,
δ
and so have generated a large error in x̂1 .

Consider another approach to this problem, where we first interchange the two equations
(and so use pivot 1 instead of δ). We rewrite the linear system as
    
1 1 x1 2
= .
δ 1 x2 1
Thus, after applying Gaussian elimination
    
1 1 x1 2
= .
0 1−δ x2 1 − 2δ
We recompute under the finite precision system
fl(1 − 2δ) fl(1 − 2 · 10−5 ) fl(0.99998)
x̂2 = = = =1
fl(1 − δ) fl(1 − 10−5 ) fl(0.99999)
x̂1 = fl(2 − x̂2 ) = fl(2 − 1) = 1.
and so avoid the large error.

We now consider the general case. Recall


 
1
 
a11 a12 a13 ··· 
 1 

 (2)
a22
(2)
a23 ···   .. .. 
 . . 
A(2) = 
 
(2)
a32
(2)
a33 ··· , M2 =  .
   (2)
−ai2 .. 
.. ..

 (2)
. 

a22
. . 
..

. 1
~ =B
3.4. ITERATIVE METHODS FOR SOLVING AX ~ 65

(2)
We note that we divide by the pivot element a22 in M2 . In order to minimize the error, we
should rearrange the rows in every step of the algorithm so that we get the largest possible
pivot element (in absolute value). This will give us the most stable algorithm for computing
the solution to the linear system since we avoid divisions by small pivot elements. This
approach is called LU decomposition with partial pivoting.

3.4 Iterative Methods for solving A~x = ~b


Recall that the LU decomposition requires W = O(n3 ) time to compute. If we specialize the
type of matrix we wish to solve, there are different algorithms that may be able to solve the
matrix more efficiently. This is especially true for sparse matrices, where iterative methods
are very useful (these techniques are similar to Fixed-point iteration).
Definition 3.7 A ∈ Rn×n is a sparse matrix if and only if the number of nonzero ele-
ments in A is much smaller than n2 .
For many sparse matrix applications, the number of nonzero elements per row is restricted
to some small constant, e.g. 10. We can often store matrices of this type very efficiently (for
example, compressed sparse row (CSR) format can be used). MATLAB supports a sparse
storage format which can be invoked by B = sparse(A).
Consider the special type of matrix:
Definition 3.8 A ∈ Rn×n is strictly diagonally dominant if and only if
n
X
|aii | > |aij | (3.20)
j=1,j6=i

for all rows i.


This type of matrix has the following useful property:
Proposition 3.8 A strictly diagonally dominant matrix is non-singular.

Proof. Left as an exercise for the reader (this can be easily proven with the Gershgorin
Circle Theorem).

Example 3.9 Consider the matrix A ∈ R3×3 :


 
7 2 0 7>2
A =  3 5 −1  5>3+1
0 5 −6 6>5
Clearly, A is strictly diagonally dominant. In general if A is strictly diagonally dominant,
the transpose of A does not necessarily retain the same property:
 
7 3 0 7>3
AT =  2 5 5  5 6> 2 + 5
0 −1 −6 6>1
66 CHAPTER 3. NUMERICAL LINEAR ALGEBRA

3.4.1 Jacobi and Gauss-Seidel Methods


We wish to solve the linear system A~x = ~b using an iterative method. We write at each step
 old 
x1
~xold =  xold
2
 (3.21)
old
x3

and determine ~xnew from ~xold by either the Jacobi or Gauss-Seidel algorithm.

Jacobi: We assume A ∈ R3×3 and construct a system of equations as follows:

a11 xnew
1 + a12 xold
2 + a13 xold
3 = b1
a21 xold
1 + a22 xnew
2 + a23 xold
3 = b2 (3.22)
a31 xold
1 + a32 xold
2 + a33 xnew
3 = b3 .
This system may be easily rearranged to solve for ~xnew , giving us the defining equation for
the Jacobi method:  
n
1 X
xnew
i = bi − aij xold
j
. (3.23)
aii
j=1,j6=i

Gauss-Seidel: In the Jacobi method we may write ~xnew = J(~xold ) for some function J.
However, there is no reason to ignore the elements of xnew
i derived earlier in the same step:
we can indeed construct a linear system as

a11 xnew
1 + a12 xold
2 + a13 xold
3 = b1
a21 xnew
1 + a22 xnew
2 + a23 xold
3 = b2 (3.24)
a31 xnew
1 + a32 xnew
2 + a33 xnew
3 = b3 .

Rearranging yields the defining equation for the Gauss-Seidel method:


 
i−1 n
1  X X
xnew
i = bi − aij xnew
j − aij xold
j
. (3.25)
aii j=1 j=i+1

For both of these methods, we must choose a starting vector ~x(0) and generate the
sequence ~x(1) , ~x(2) , . . . = {~x(i) }∞i=1 . We may also formulate these methods in matrix form
using the decomposition A = AL + AD + AR :
     
.. .. ..
 . 0   . 0   . AR 
A=  0 +
  AD
+
  0 .
 (3.26)
.. .. ..
AL . 0 . 0 .
| {z } | {z } | {z }
AL AD AR

Under the matrix decomposition, we may write the Jacobi method as

~xnew = A−1 ~ xold )


D (b − (AL + AR )~ (3.27)
~ =B
3.4. ITERATIVE METHODS FOR SOLVING AX ~ 67

and the Gauss-Seidel method as

~xnew = A−1 ~ xnew + AR ~xold )).


D (b − (AL ~ (3.28)

These forms are equivalent to (3.23) and (3.25).


Can we guarantee convergence of the iterative methods? For some classes of matrices the
answer is yes. It depends on the matrix A whether the sequence of iterates converges and if
so, how quickly it does. We can prove the following sufficient condition for convergence:

Theorem 3.3 Consider A~x = ~b and let ~x(0) be any starting vector. Let {~x(i) }∞
i=0 be the
sequence generated by either Jacobi or Gauss-Seidel iterative methods. Then if A is strictly
diagonally dominant the sequence converges to the unique solution of the system A~x = ~b.

Since this theorem is only a sufficient condition, we can have a matrix A that is not
strictly diagonally dominant but leads to a convergent method.
In general, we note that often Gauss-Seidel will converge faster than Jacobi, but this is
not always the case. For sparse matrices, we often obtain W = O(n2 ) for both methods.

3.4.2 Convergence of Iterative Methods


As a practical consideration, we need to determine when to stop iteration for a given iterative
method. In particular, we need a measure of how close we are to a correct solution.

Definition 3.9 The residual of a linear system A~x = ~b for some vector ~u is

~r = ~b − A~u. (3.29)

We write the residual at step i as ~r(i) . As ~r(i) becomes smaller, we approach convergence
of the system. Hence, for a given relative tolerance trel = 10−6 (for example), we compute
the residual at each step and stop when k~r(i) k2 /k~r(0) k2 ≤ trel .
For an iterative approximation ~u, the error is given by ~e = ~x − ~u. This leads to the
following relation between the error and the residual:

A~e = A(~x − ~u)


= A~x − A~u
= ~b − A~u
= ~r.

There are several benefits to using the residual instead of the error:
• ~r can be calculated easily (because calculating a matrix-vector product is “cheap” in
terms of computational work compared with solving a linear system).
• ~e is generally unknown.
Sometimes ~r is small but ~e is large (this may happen when κ(A) is large). Nevertheless,
we will assume our linear system is well-conditioned and use ~r instead of ~e in the stopping
criterion.
68 CHAPTER 3. NUMERICAL LINEAR ALGEBRA

If we knew the error ~e for any approximation ~u we could write out the solution to the
linear system directly:

~x = ~u + (~x − ~u) =
|{z} ~u
|{z} ~e .
+ |{z}
exact approximate error

However, since the error is unknown, we can instead use the residual:

~x = ~u + A−1~r. (3.30)

Unfortunately, inverting A is an expensive operation (3 times as expensive as actually solving


A~x = ~b). Instead of using A−1 , we can choose a matrix B that is easy to invert such that
B −1 is an approximation for A−1 . Then, we can obtain an approximation for ~x by the
formula
~x ≈ ~u + B −1~r. (3.31)
Repeated application of this formula leads to a standard general form for an iterative method:

~x(i+1) = ~x(i) + B −1~r(i) ,


or with the definition of the residual

~x(i+1) = ~x(i) + B −1 (~b − A~x(i) ).

If B is chosen appropriately then ~x(i) will converge to ~x.


This approach is basically fixed point iteration: we have defined a function G such that
~x(i+1) = G(~x(i) ). If we choose ~x(i) = ~x then ~x(i+1) = ~x has to hold as well; thus ~x = G(~x)
is a fixed point of G. The convergence theorem for iterative methods is stated as follows:

Theorem 3.4 (Convergence of Iterative Methods) Consider the iterative method

~x(i+1) = ~x(i) + B −1 (~b − A~x(i) ) (3.32)

with det(A) 6= 0. If there exists a p-norm for which kI − B −1 Akp < 1 then the iterative
method will converge for any starting value ~x(0) . Convergence will then hold in any p-norm.

Proof. Consider the error ~e(i) = ~x − ~x(i) . By (3.32) we may write

~x(i+1) − ~x = ~x(i) − ~x + B −1 (A~x − A~x(i) )


−~e(i+1) = −~e(i) + B −1 A~e(i)
~e(i+1) = (I − B −1 A)~e(i)
k~e(i+1) kp = k(I − B −1 A)~e(i) kp
≤ kI − B −1 Akp k~e(i) kp .

If we apply this result recursively, we obtain

k~e(i+1) kp ≤ kI − B −1 Aki+1
p k~e(0) kp .
~ =B
3.4. ITERATIVE METHODS FOR SOLVING AX ~ 69

Taking the limit as i → ∞ with kI − B −1 Akp < 1 yields

lim k~e(i+1) kp = 0,
i→∞

which is equivalent to
lim ~x(i) = ~x. 
i→∞

We note that kI −B −1 Akp < 1 means that I ≈ B −1 A, or B −1 ≈ A−1 . Recall from before
that we require B −1 to be an approximation for A−1 in order to apply equation (3.31). The
convergence theorem is very similar to Theorem 2.3 (The Contraction Mapping Theorem),
which governs convergence of the fixed point method. In fact, Banach’s Contraction Theorem
(from real analysis) provides a generalization of both theorems.
Since (3.32) is a general form for iterative methods, we should be able to determine B for
the Jacobi and Gauss-Seidel methods. We rewrite the matrix form of Jacobi into standard
form:

~x(i+1) = A−1 ~ x(i) )


D (b − (AL + AR )~
= A−1 (~b − (A − AD )~x(i) )
D (f rom(3.26))
= AD−1
AD ~x(i) + A−1 ~ (i)
D (b − A~x )
(i) −1 ~ (i)
= ~x + AD (b − A~x ).

This is the standard form of an iterative method with B −1 = A−1


D (so A
−1
is approximated
−1
by AD .) We can rewrite the Gauss-Seidel method in a similar manner to recover B −1 =
(AD + AL )−1 (left as an exercise.)

These notes have been funded by...


70 CHAPTER 3. NUMERICAL LINEAR ALGEBRA
Chapter 4

Discrete Fourier Methods

4.1 Introduction
We define a complex
√ number z as z = a + ib, with a the real part of z, b the imaginary
part, and i = −1. As such, we may depict a complex number z = a + ib as a vector in the
complex plane:
Im(z)

z = a+ib
b

0 Re(z)
a

Recall that for any complex number z = a + ib we have:

Term Definition
Complex conjugate z̄ = a − ib
Real part Re(z) = a
Imaginary part Im(z) = b

Modulus r = |z| = a2 + b2
Phase angle θ = arctan(b/a)

We may write the complex number in terms of the modulus and phase angle
z = r exp(iθ), (4.1)
in conjunction with the Euler formulas:
exp(iθ) = cos(θ) + i sin(θ), (4.2)
exp(−iθ) = cos(θ) − i sin(θ). (4.3)

71
72 CHAPTER 4. DISCRETE FOURIER METHODS

We may invert these formulas in order to write sinusoids in terms of complex exponentials:
1
cos(θ) = (exp(iθ) + exp(−iθ)), (4.4)
2
1
sin(θ) = (exp(iθ) − exp(−iθ)). (4.5)
2i
Consider an arbitrary wave signal given by y(t) = sin(2π · kt) for some constant k. The
frequency of the wave is defined by
f = k, (4.6)
and has dimensions of oscillations per second (or Hz). The period is defined as the time
required to complete one oscillation. It is related to the frequency by
1 1
T = = , (4.7)
f k
and has dimensions of seconds. The angular frequency ω is related to the frequency by
ω = 2π f (4.8)
with dimensions of radians per second. These definitions can be easily generalized to any
periodic function (not just sinusoids). This characterization is broadly applicable to sound
waves, water waves and any other periodic behaviour. Human audible sound, for example,
occurs in the frequency range from 20Hz to 20kHz.

T 

 y(t) = sin(2πt)

f = 1 oscillation / sec = 1Hz
1 
 T = 1 sec

ω = 2π radians / sec



 y(t) = sin(2π · 31 t)
 1 1
f = 3 oscillation / sec = 3 Hz
1 
 T = 3 sec
 2
ω = 3 π radians / sec



 y(t) = sin(2π · 3t)

f = 3 oscillation / sec = 3Hz
1
1 
 T = 3 sec

ω = 6π radians / sec

4.2 Fourier Series


4.2.1 Real form of the Fourier Series
Consider a continuous function f (x) on [a, b]. In general, f (x) can be expanded in a Fourier
series as follows:
4.2. FOURIER SERIES 73

f(x)

a b

Definition 4.1 The Fourier series of f (x) with x ∈ [a, b] is


∞     
a0 X 2πx 2πx
g(x) = + ak cos k + bk sin k , (4.9)
2 b−a b−a
k=1

with
Z b  
2 2πx
ak = f (x) cos k dx (4.10)
b−a a b−a
Z b  
2 2πx
bk = f (x) sin k dx. (4.11)
b−a a b−a
Example 4.1 Compute the Fourier series of the function f (x) defined by
 π
−4 x ∈ [−π, 0)
f (x) = π
4 x ∈ [0, π]
over the interval [a, b] = [−π, π] ⇒ b − a = 2π.

π
4

−π π

π
4

We note that f (x) is an odd function. Since cos(kx) is even, it follows that f (x) cos(kx)
is odd. Thus, Z π
2
ak = f (x) cos(kx)dx = 0
2π −π
for all k. We compute bk as follows:
2

bk = 2π −π
f (x) sin(kx)dx
4

= 2π 0 f (x) sin(kx)dx (because odd × odd = even)

= 21 0 sin(kx)dx
11 π
= 2 k [− cos(kx)]0
1
= 2k (− cos(πk) + 1)
1 k+1
= 2k ((−1) − 1).
74 CHAPTER 4. DISCRETE FOURIER METHODS

Thus (
1
k if k is odd,
bk =
0 if k is even.
The complete Fourier series for f (x), given by (4.9), may then be written as
1 1
g(x) = sin(x) + 3 sin(3x) + 5 sin(5x) + · · · .
The Fourier coefficients of odd and even functions simplify under the following proposi-
tion:
Proposition 4.1 The Fourier coefficients of a function f (t) satisfy
f (t) even =⇒ bk = 0 ∀ k,
f (t) odd =⇒ ak = 0 ∀ k.
We may wonder: what functions can be expressed in terms of a Fourier series. The
following fundamental theorem, originally proposed by Dirichlet, describes how the Fourier
series g(x) is related to the original function f (x).
Theorem 4.1 Fundamental Convergence Theorem for Fourier Series.
Let  s 
 Z b 
V = f (x) f (x)dx < ∞ .
 a 

Then for all f (x) ∈ V there exist coefficients a0 , ak , bk (with 1 ≤ k < ∞) such that
n     
a0 X 2πx 2πx
gn (x) = + ak cos k + bk sin k
2 b−a b−a
k=1
qR
b
converges to f (x) for n → ∞ in the sense that a
(f (x) − g(x)) = 0 with g(x) = limn→∞ gn (x).

Note that gn (x) is sometimes also called Sn (x) (the nth partial sum of the Fourier series).
This theorem holds for any bounded interval [a, b], but for simplicity we will generally
consider the interval to be [a, b] = [−π, π].
V is called the set of square integrable functions over [a, b] and contains all polynomials,
sinusoids and other nicely-behaved bounded functions. However, V does not contain many
common functions, including f (x) = tan(x) and f (x) = (x − a)−1 .
In addition, V is a vector space of functions, i.e. if f1 (x) ∈ V and f2 (x) ∈ V , then
c1 f1 (x) + c2 f2 (x) ∈ V (∀ c1 , c2 ∈ R). We define a norm over V by
s
Z b
kh(x)k2 = h(x)2 dx (L2 norm). (4.12)
a

As a norm, this measures the “size” of the function h(x). This implies a measure of the
“distance” between f (x) ∈ V and g(x) ∈ V by
s
Z b
kf (x) − g(x)k2 = (f (x) − g(x))2 dx. (4.13)
a
4.2. FOURIER SERIES 75

We call the set of functions V that are defined on an interval [a, b] L2 ([a, b]). We write

L2 ([a, b]) = {f (x) kf (x)k2 < ∞} , (4.14)

where k · k2 is the L2 norm defined by (4.12).

4.2.2 Complex form of the Fourier Series


Perhaps a more natural method of representing the Fourier series is its complex form. This
form of the series has only one sequence of coefficients, but each coefficient is complex-valued.
We will later demonstrate that this form is equivalent to the real form of the series.

Definition 4.2 The complex Fourier series of a function f (t) is



X
h(t) = ck exp(ikt), (4.15)
k=−∞

with Z π
1
ck = f (t) exp(−ikt)dt. (4.16)
2π −π

Note: In some books (and in MATLAB!)the complex form of the Fourier series is defined
using a different sign convention, namely
X
h(t) = ck exp(−ikt)

with Z π
1
ck = f (t) exp(ikt)dt.
2π −π

Relation between ck and ak , bk : We may apply Euler’s formula (4.3) to the complex
form (4.15) to obtain
Z π
1
ck = f (t)(cos(−kt) + i sin(−kt))dt
2π −π
 Z Z 
1 1 π 1 π
= f (t) cos(kt)dt − i f (t) sin(kt)dt .
2 π −π π −π

Comparing this expression with (4.10) and (4.11) reveals


1
ck = (ak − ibk ). (4.17)
2
This holds for 0 ≤ k, but we can extend it to −∞ < k < ∞.

Proposition 4.2 The real and complex Fourier coefficients of a real function f (t) obey
1. ck = c−k
76 CHAPTER 4. DISCRETE FOURIER METHODS

2. a−k = ak , b−k = −bk


3. ak = 2 Re(ck ), bk = −2 Im(ck )
4. f (t) even ⇒ Im(ck ) = 0, f (t) odd ⇒ Re(ck ) = 0
5. b0 = 0, c0 = 12 a0 .

Proof.
1. From (4.16) we write
Z π
1
ck = f (t) exp(ikt)dt = c−k .
2π −π

It also follows that Re(ck ) = Re(c−k ) (the real part of ck is even in k) and Im(ck ) =
−Im(c−k ) (the imaginary part of ck is odd in k).
2. This result follows from (4.10) and (4.11).
3. This result follows from (4.17).
4. This result follows from Proposition 4.1 in conjunction with (4.17).
5. This result follows from (4.11) and (4.17). 
Although the complex and real forms of the Fourier series appear very different, they
describe the same Fourier series. We demonstrate this result in the following theorem:

Theorem 4.2 The complex and real forms of the Fourier series are equivalent (h(t) = g(t)).

Proof. By the Euler formulas (4.4) and (4.5) we may write



a0 X
g(t) = + (ak cos(kt) + bk sin(kt))
2
k=1
∞  
a0 X ak bk
= + (exp(ikt) + exp(−ikt)) + (exp(ikt) − exp(−ikt)) .
2 2 2i
k=1

We use i2 = −1 =⇒ i = − 1i to write
∞     
a0 X ak − ibk ak + ibk
g(t) = + exp(ikt) + exp(−ikt) .
2 2 2
k=1

Then by Proposition 4.2,


∞     
X ak − ibk a−k − ib−k
g(t) = c0 + exp(ikt) + exp(−ikt) .
2 2
k=1
4.3. FOURIER SERIES AND ORTHOGONAL BASIS 77

We apply the identity (4.17) and so write



X
g(t) = c0 + [ck exp(ikt) + c−k exp(−ikt)].
k=1

Thus,

X
g(t) = ck exp(ikt) = h(t),
k=−∞

which completes the proof. 

Example 4.2 Compute the complex Fourier series of the function f (x), defined by
 π
−4 x ∈ [−π, 0]
f (x) = π
4 x ∈ [0, π]

over the interval [a, b] = [−π, π] ⇒ b − a = 2π.

Recall from Example 4.1 that the real Fourier coefficients are
 1
ak = 0 ∀ k, bk = k, k odd
0, k even.

Then by (4.17) the complex coefficients are


 i
− 2k , k odd
ck =
0, k even.
The complex Fourier series is then
i i i i
h(t) = · · · + exp(−i3t) + exp(−it) − exp(it) − exp(i3t) − · · · .
6 2 2 6

4.3 Fourier Series and Orthogonal Basis


Recall that for the vector space V = R2 we may define a standard basis B = {~e1 , ~e2 } with
~e1 = (1, 0) and ~e2 = (0, 1). The number of elements in the basis B must be equal to the
dimension of the vector space (in this case 2). We may then write any vector ~x ∈ R2 as a
linear combination of the basis vectors:

~x = x1~e1 + x2~e2 . (4.18)

We define the scalar product between two vectors in V as

~x · ~y = x1 y1 + x2 y2 =< ~x, ~y > . (4.19)

Any scalar product also induces a norm over the vector space by
p q
k~xk2 = < ~x, ~x > = x21 + x22 . (4.20)
78 CHAPTER 4. DISCRETE FOURIER METHODS

(0,1)
e2

e1
(1,0)

In particular, we wish to focus on a particular type of basis which has some useful
properties:

Definition 4.3 We say that a basis B = {~e1 , . . . , ~en } is an orthogonal basis if and only
if
< ~ei , ~ej >= cij for 1 ≤ i, j ≤ n,
where cij is nonzero if and only if i = j.

We note that the standard basis B = {~e1 , ~e2 } defined over V is orthogonal, since

< ~e1 , ~e1 >= 1, < ~e2 , ~e2 >= 1, < ~e1 , ~e2 >= 0.

Given an orthogonal basis {~e1 , ~e2 } we can easily find the components of any vector ~x in
the basis. Consider the scalar product of ~x and ~e1 :

< ~x, ~e1 > = < x1~e1 + x2~e2 , ~e1 >


= x1 < ~e1 , ~e1 > +x2 < ~e2 , ~e1 >
| {z }
=0
= x1 < ~e1 , ~e1 > .

Thus we may write


xi =< ~x, ~ei > / < ~ei , ~ei > . (4.21)
We now focus on the vector space of square integrable functions. We define our vector
space by
V = L2 ([−π, π]) = {f (x) kf (x)k2 < ∞} , (4.22)
with basis
B = {1, cos(kt), sin(kt)} (1 ≤ k < ∞). (4.23)
This vector space has an infinite (but countable) number of linearly independent basis vectors
and so has infinite (but countable) dimension. Since B forms a basis for V , it follows that
we may write any function f (t) ∈ V as a linear combination of the basis vectors:
"∞ # "∞ #
X X
f (t) = a0 · 1 + ak cos(kt) + bk sin(kt) . (4.24)
k=1 k=1
4.3. FOURIER SERIES AND ORTHOGONAL BASIS 79

The scalar product over this vector space is given by


Z π
< f (t), g(t) >= f (t)g(t)dt, (4.25)
−π

which induces the standard 2-norm


sZ
p π
kf (t)k2 = < f (t), f (t) > = f (t)2 dt. (4.26)
−π

In addition, we note that B is an orthogonal basis:

Proposition 4.3 B is an orthogonal basis for V .

Proof. We note that B is a basis due to Theorem 4.1. In order to prove that B is
orthogonal, we need to consider the scalar product of all basis vectors:
Z π
< 1, 1 > = 12 dt = 2π
−π
Z π
< cos(kt), cos(kt) > = cos2 (kt)dt = π (k ≥ 1)
−π
Z π
< sin(kt), sin(kt) > = sin2 (kt)dt = π (k ≥ 1).
−π

(If our basis vectors were normalized, each of these terms would equal 1). We must now
show that the scalar products between different basis vectors all vanish:
Z π
< 1, cos(kt) > = cos(kt)dt = 0 (k ≥ 1)
−π
Z π
< 1, sin(kt) > = sin(kt)dt = 0 (k ≥ 1)
−π
Z π
< cos(kt), sin(ℓt) > = cos(kt) sin(ℓt) dt = 0 (k, ℓ ≥ 1).
−π | {z } | {z }
even odd
The scalar product between different cosine basis vectors requires a more extensive deriva-
tion:
Z π
< cos(kt), cos(ℓt) > = cos(kt) cos(ℓt)dt = 0 (k 6= ℓ)
−π
Z π
1
= [cos((k + ℓ)t) + cos((k − ℓ)t)] dt
−π 2
" π π #
1 1 1
= sin((k + ℓ)t) + sin((k − ℓ)t)
2 k+ℓ −π k − ℓ −π
= 0.
80 CHAPTER 4. DISCRETE FOURIER METHODS

We must also show < sin(kt), sin(ℓt) >= 0 for (k 6= ℓ ≥ 1). This is left as an exercise for
the reader. 

Since B forms an orthogonal basis, we may use the projection formula (4.21) to determine
the coefficients ak and bk . By projection, we have
Z
< f (t), cos(kt) > 1 π
ak = = f (t) cos(kt)dt (4.27)
< cos(kt), cos(kt) > π −π
Z π
< f (t), sin(kt) > 1
bk = = f (t) sin(kt)dt (4.28)
< sin(kt), sin(kt) > π −π
Z π
a0 < f (t), 1 > 1
= = f (t)dt, (4.29)
2 < 1, 1 > 2π −π

which are simply (4.10) and (4.11), the standard formulae for the Fourier coefficients.
For the complex form of the Fourier series, we can recover formula (4.16) by choosing
the alternate basis
B = {exp(ikx)} (−∞ < k < ∞) (4.30)
and using the scalar product for complex functions,
Z π
< f (t), g(t) >= f (t)g(t)dt. (4.31)
−π

Then, the scalar product between two basis vectors is


Z π
< exp(ikt), exp(iℓt) >= exp(i(k − ℓ)t)dt = 2πδkℓ , (4.32)
−π

where δkℓ is the Kroenecker delta. Applying the projection formula yields
Z π
< f (t), exp(ikt) > 1
ck = = f (t) exp(−ikt)dt, (4.33)
< exp(ikt), exp(ikt) > 2π −π

which is (4.16).

4.4 Discrete Fourier Transform


The Fourier series is used to describe a continuous time signal f (t) with t ∈ [a, b], or f (t)
periodic. We wish to turn our attention now to the discrete time signal f [n] over N points
in 0 ≤ n ≤ N − 1. This type of signal may arise from sampling or digital recording of a
continuous time signal, such as in music, images, stock indexes or weather data. Applying
the Discrete Fourier Transform (DFT) to a discrete signal yields the frequencies present in
the signal (this will be explained in more detail later.)
In order to present an expression for the discrete Fourier transform, we first require some
intermediate results:
4.4. DISCRETE FOURIER TRANSFORM 81

Figure 4.1: A discrete temperature profile f [n] with Discrete Fourier transform F [k].
82 CHAPTER 4. DISCRETE FOURIER METHODS

Definition 4.4 The N th roots of unity are the integer powers of


 

WN = exp i . (4.34)
N
We may write the N distinct N th roots of unity as
 
k 2πk
WN = exp i (4.35)
N
for 0 ≤ k < N .
Example 4.3 The N th roots of unity for N = 8 are
 
k πk
W8 = exp i .
4
In the complex plane, they may be depicted as
2
W8
9
W8
3 W8 = W8

8
W8 = 1
4
-1 = W
8

7
W8
5 W8
6
W
8

Proposition 4.4 The N th roots of unity satisfy


(WNk )N = 1. (4.36)

Proof. By definition,
 
kN 2π
(WNk )N = exp i = exp(ik2π) = 1. 
N
Proposition 4.5 The N th roots of unity satisfy
WN−k = WNN −k . (4.37)

Proof. By definition,
WN−k = WNN · WN−k = WNN −k . 
We are now prepared to define the discrete Fourier transform:
Definition 4.5 The discrete Fourier transform of a discrete time signal f [n] with 0 ≤
n ≤ N − 1 is
N −1
1 X
F [k] = DF T {f [n]} = f [n] WN−kn , 0 ≤ k ≤ N − 1. (4.38)
N n=0
4.4. DISCRETE FOURIER TRANSFORM 83

Definition 4.6 The inverse discrete Fourier transform of a discrete frequency signal
F [k] with 0 ≤ k ≤ N − 1 is
N
X −1
f [n] = IDF T {F [k]} = F [k] WNkn , 0 ≤ n ≤ N − 1. (4.39)
k=0

Note that these expressions are closely related to the complex form of the Fourier series,
given in equation (4.15). Recall that when we compute the Fourier series of a function,
we enforce that it is periodic outside the interval we are examining. This requirement is
analogous in the discrete case; we implicitly assume that the time signal f [n] is periodic
when applying the discrete Fourier transform, which necessarily implies that the frequency
signal is also periodic:

Proposition 4.6 The frequency signal F [k] given by (4.38) is periodic with period N :
F [k] = F [k + sN ] with s ∈ Z. (4.40)

Proof. This result follows from (4.38) and


−(k+sN )n
WN = WN−kn WN−nsN = WN−kn .  (4.41)
The discrete Fourier transform also leads to a set of symmetry properties for the frequency
signal F [k]:

Proposition 4.7 For any real time signal f [n], the frequency signal F [k] satisfies
1. Re(F [k]) is even in k
2. Im(F [k]) is odd in k
3. F [k] = F [−k]
4. f [n] is even in n ⇒ Im(F [k]) = 0 (the DFT is real)
5. f [n] is odd in n ⇒ Re(F [k]) = 0 (the DFT is purely imaginary)

Example 4.4 Consider a cosine wave f (t) = cos(2πt), t ∈ [0, 1]. We sample at N = 6
points
n
f [n] = cos(2πtn ), tn = , 0 ≤ n ≤ N − 1. (4.42)
N

1
0 1 2 3 4 5 6=N
1 t

-1
84 CHAPTER 4. DISCRETE FOURIER METHODS

Using formula (4.4) we can rewrite f [n] as


1  n 1  n
f [n] = exp i2π + exp −i2π
2 N 2 N
1 n 1 −n
= W + W .
2 N 2 N
1 1
By (4.39) we conclude that the transformed signal is F [1] = 2 and F [−1] = 2 with all other
F [k] zero. Recall that by periodicity, F [N − 1] = F [−1].

F[k]
1
2

-7 -6 -5 -4 -3 -2 -1 1 2 3 4 5 6 7 k

We note that cos(2πtn ) is a low frequency wave, but we still get higher frequency com-
ponents like F [5]. This effect is called aliasing.

4.4.1 Aliasing and the Sample Theorem


Consider a continuous time signal f (t) = cos(2πℓt) with t ∈ [0, 1] (the frequency is f = ℓ in
n
Hz). We sample f (t) with N = 6 points at tn = N = n6 with 0 ≤ n ≤ 6 to obtain a discrete
n
time signal f [n] = cos(2πℓ N ). We consider F [k] = DF T {f [n]} for 0 ≤ ℓ ≤ 6:

ℓ = 0: The time signal is


n

f [n] = cos 2π · 0 · N
= 1 · WN0 ,

with frequency signal

F [0] = 1, F [6] = 1

due to periodicity.

ℓ = 1: The time signal is


n

f [n] = cos 2π · 1 · N
1 n
 1 n

= 2 exp i2π N + 2 exp −i2π N
1 n 1 (N −1)n
= 2 WN + 2 WN ,

with frequency signal

F [1] = 12 , F [5] = 21 .
4.4. DISCRETE FOURIER TRANSFORM 85

ℓ = 2: The time signal is


n

f [n] = cos 2π · 2 · N
1 n
 1 n

= 2 exp i2π N · 2 + 2 exp −i2π N ·2
1 2n 1 (N −2)n
= 2 WN + 2 WN ,

with frequency signal

F [2] = 21 , F [4] = 21 .

ℓ = 3: The time signal is


n

f [n] = cos 2π · 2 · N
1 n
 1 n

= 2 exp i2π N · 3 + 2 exp −i2π N ·3
1 3n 1 (N −3)n
= 2 WN + 2 WN ,

with frequency signal

F [3] = 1.

ℓ = 4: The time signal is


n

f [n] = cos 2π · 4 · N
1 n
 1 n

= 2 exp i2π N · 4 + 2 exp −i2π N ·4
1 4n 1 (N −4)n
= 2 WN + 2 WN ,

with frequency signal

F [4] = 21 , F [2] = 21 .

Examining these results, we may find it a little worrisome that the ℓ = 4 case and the
ℓ = 2 case match. The sampling rate, or sampling frequency fs for this example is 6 samples
/ second, or 6Hz. The “critical frequency” for aliasing occurs at f = 3Hz (we have 2 f =fs ).
In general, if fs ≥ 2 f then there will be no aliasing. If fs < 2 f then aliasing will occur.

Theorem 4.3 “Sampling Theorem” (loosely formulated)


In order to avoid aliasing error, the sampling frequency fs should be at least twice the largest
frequency present in the continuous time signal.
86 CHAPTER 4. DISCRETE FOURIER METHODS

Figure 4.2: Left: Continuous time signal f (t) = cos(2πℓt) with 6 discrete sampling points
f [n], 0 ≤ ℓ ≤ 6. Right: Discrete Fourier transforms of f [n]. Aliasing occurs for ℓ = 4, 5and6:
sampled high frequency signals show up as low-frequency discrete signals when the sampling
rate is not high enough.
4.5. FAST FOURIER TRANSFORM 87

For example, human audible sound falls in the range 20Hz to 20kHz. We will require
a sampling frequency fs ≥ 2 · 20000Hz = 40kHz to avoid aliasing problems. As a result of
this requirement, digital music CDs have a sampling frequency of fs = 44.1kHz.

4.5 Fast Fourier Transform

We wish to numerically compute the discrete Fourier transform F [k] of a time signal f [n].
If we assume that the factors WN−kn are precomputed and stored in a table (there will be
N factors), then we can use formula (4.38) directly to compute the N coefficients F [k]
(0 ≤ k ≤ N − 1). The amount of work we must do per coefficient is then

W = (N − 1)A + (N + 1)M = 2N (complex) flops.

The total work required to compute all coefficients is

W = N · 2N = 2N 2 (complex) flops.

This process is fairly inefficient; for a typical one minute sound file sampled at 44.1kHz we
require 2.656 × 106 samples for the time signal and 1.4 × 1013 (complex) flops to compute
the discrete Fourier transform which is very large, even for today’s fast computers. We pose
the question: “how do we compute the discrete Fourier coefficients more efficiently?”
In order to answer this, we first require some intermediate results:

Theorem 4.4 If N = 2m for some integer m, then the length N discrete Fourier transform
F [k] of discrete time signals f [n] can be calculated by combining two length N2 discrete
Fourier transforms. We define

g[n] = f [n] + f [n + N/2] (0 ≤ n ≤ N/2 − 1) (4.43)


−n
h[n] = (f [n] − f [n + N/2])WN (0 ≤ n ≤ N/2 − 1). (4.44)

Then

1
F [2ℓ] = 2 DF T {g[n]} (even indices) (4.45)
1
F [2ℓ + 1] = 2 DF T {h[n]} (odd indices). (4.46)
88 CHAPTER 4. DISCRETE FOURIER METHODS

Proof. From the direct formula for the discrete Fourier transform,

N −1
1 X
F [k] = f [n]WN−kn
N n=0
N/2−1 N −1
1 X 1 X
= f [n]WN−kn + f [n]WN−kn
N n=0 N
n=N/2
N/2−1 N/2−1
1 X 1 X −k(ℓ+N/2)
= f [n]WN−kn + f [ℓ + N/2]WN
N n=0 N
ℓ=0
N/2−1
1 X −kN/2
= (f [n] + f [n + N/2]WN )WN−kn .
N n=0

The inner root of unity may be simplified:


 
−kN/2 kN/2
WN = exp −i2π
N
= exp(−ikπ)
= (exp(−iπ))k
= (−1)k .

We now have two cases, based on whether k is even or odd:

Case 1 (k even): We have k = 2ℓ (0 ≤ ℓ ≤ N/2 − 1). Thus,

N/2−1
1 X
F [2ℓ] = (f [n] + f [n + N/2])WN−2ℓn
N n=0
 
N/2−1
1 1 X −ℓn 
= g[n]WN/2
2 N/2 n=0
1
= 2 DF T {g[n]}.

Case 2 (k odd): We have k = 2ℓ + 1 (0 ≤ ℓ ≤ N/2 − 1). Thus,

N/2−1
1 X
F [2ℓ + 1] = (f [n] − f [n + N/2])WN−2ℓn−n
N n=0
 
N/2−1
1 1 X −ℓn 
= h[n]WN/2
2 N/2 n=0
1
= 2 DF T {h[n]}. 
4.5. FAST FOURIER TRANSFORM 89

If we split the discrete Fourier transform into two transforms (each of length N/2), the
total work required will be
Wtot = 2 · 2(N/2)2 f lops = N 2 f lops.
Recall that the direct method requires 2N 2 flops. In splitting up the Fourier transform, we
only require half as much work. But we may further apply the splitting method recursively!
In order to compute the splitting, we require N/2 additions for g[n], N/2 additions and
multiplications for h[n], and N multiplications for the transform. In total, we will require
5
2 N flops at each level (where N is the length at each level).

Theorem 4.5 The fast Fourier transform (FFT) requires O(N log2 N ) flops.

Proof. Assume N = 2m . We tabulate the results at each step, as follows

Level Length per DFT # of DFTs Total work


5
m 2m = N 1 2 · N flops
5 N 5
m−1 2m−1 2 2· 2 · 2 = 2 · N flops
m−2 5 N 5
m−2 2 4 4· 2 · 4 = 2 · N flops
.. .. .. ..
. . . .
5 N 5
1 2 2m−1 2m−1 · 2 · 2m−1 = 2 · N flops
m
0 1 2 None (F [0] = f [0])

We sum over the total work column:


5 5
N flops per level · log2 (N ) levels = N log2 N flops,
2 2
which is our desired result. 

The pseudo-code for this algorithm is given as follows:

Fast Fourier Transform

function F = FastFT(f, N)
m = log N
if m = 0
F = f
else
build g[N]
build h[N]
F[even] = (1/2) FastFT(g, N/2)
F[odd] = (1/2) FastFT(h, N/2)
end
90 CHAPTER 4. DISCRETE FOURIER METHODS

We note that this algorithm can also be applied to signals where N 6= 2m by padding the
signal with zeroes. In addition, this algorithm also works with complex f [n], but requires
that all computations be done in complex flops. Computationally, the fast Fourier transform
is almost always used, due to its efficiency.

4.6 DFT and Orthogonal Basis


Recall that over the vector space V = R2 we may define an orthogonal basis

B = {~e1 , ~e2 }, with < ~ei , ~ej >= δij

and an alternate orthogonal basis

B ′ = {f~1 , f~2 }, with < f~i , f~j >= δij .

Then any vector ~x can be expressed in terms of either basis (see figure).

(0,1)
e2 f1
f2
e1
(1,0)

Consider the discrete Fourier transform with N = 4, for simplicity. The time signal f [n]
(0 ≤ n ≤ N − 1 = 3) may be defined as a time signal vector

f~ = f [0], f [1], f [2], f [3] ∈ R4 . (4.47)

Since R4 is a vector space, we may define an orthogonal basis

B = {f~0 , f~1 , f~2 , f~3 } (4.48)

with

f~0 = (1, 0, 0, 0),


f~1 = (0, 1, 0, 0),
f~2 = (0, 0, 1, 0),
f~3 = (0, 0, 0, 1).

Then the time signal vector may be written as

f~ = f [0]f~0 + f [1]f~1 + f [2]f~2 + f [3]f~3 . (4.49)


4.6. DFT AND ORTHOGONAL BASIS 91

From (4.39) we may also write the time signal vector as


N −1 N −1 N −1 N −1
!
X X X X
f~ = F [k]WNk0 , F [k]WNk1 , F [k]WNk2 , F [k]WNk3 ,
k=0 k=0 k=0 k=0

or
N
X −1
f~ = F [k]F~k = F [0]F~0 + F [1]F~1 + F [2]F~2 + F [3]F~3 , (4.50)
k=0

where F~k is a vector in an alternate basis. We define

B ′ = {F~0 , F~1 , F~2 , F~3 }, (4.51)

where
F~0 =( WN00 , WN01 , WN02 , WN03 ) −→ k = 0
F~1 =( WN10 , WN11 , WN12 , WN13 ) −→ k = 1
F~2 =( WN20 , WN21 , WN22 , WN23 ) −→ k = 2 (4.52)
F~3 =( WN30 , WN31 , WN32 , WN33 ) −→ k = 3
↓ ↓ ↓ ↓
n=0 n=1 n=2 n=3
We conclude from (4.50) that the discrete Fourier coefficients F [k] of the time signal f [n]
are just the coordinates of the time signal vector f~ in the DFT basis B ′ . This is the same
as saying that the DFT is nothing more than a basis transformation (with a basis that is
useful to extract frequency information).

Is B ′ an orthogonal basis? Consider the vector space V = Cn (complex vectors). We


define the scalar product between two vectors ~x, ~y ∈ Cn as
n
X
< ~x, ~y >= ~x · ~y = xj yj . (4.53)
j=1

Under this definition, the scalar product of the basis vectors in B ′ is


N −1
X   NX
−1
(i−j)k
< F~i , F~j >= WNik WN−jk = WN .
k=0 k=0

If i = j then
N
X −1
< F~i , F~i >= 1 = N. (4.54)
k=0

If i 6= j then by the geometric series formula we may write


N
~ ~ WNi−j −1
< Fi , Fj >= i−j
.
WN − 1
92 CHAPTER 4. DISCRETE FOURIER METHODS

Figure 4.3: A basis for the discrete Fourier transform over N = 8 points. For each basis
vector F~k , the functions on the left represents the real component and the functions on the
right represent the imaginary component.
4.7. POWER SPECTRUM AND PARSEVAL’S THEOREM 93

But by the properties of WN , (WNi−j )N = (WNN )i−j = 1 so we get

< F~i , F~j >= 0, i 6= j, (4.55)


and so B ′ is an orthogonal basis!

Recall that since B ′ is an orthogonal basis, we can use the projection formula (4.21) in
conjunction with expressions (4.47) and (4.52) to find F [k]:
N −1 N −1
< f~, F~k > 1 X 1 X
F [k] = = f [n]WNkn = f [n]WN−kn .
< F~k , F~k > N n=0 N n=0

This is just the discrete Fourier transform formula (4.38).

4.7 Power Spectrum and Parseval’s Theorem


Recall the Fourier series of a continuous function f (t). We assume that f (t) is the amplitude
Rb
of a sound signal. From physics, we know that a f (t)2 dt is proportional to the power in a
sound signal on a time interval [a, b]. In the frequency domain, we define the concept of the
power spectrum:

Definition 4.7 Let F [k] be the complex Fourier coefficients of a discrete (or continuous)
signal f [n] (or f (t)). Then the power spectrum of f [n] (or f (t)) is |F [k]|2 .

Parseval’s theorem then provides a connection between the power of a signal in the time
domain and the summed power spectrum in the frequency domain:

Theorem 4.6 (Parseval’s Theorem.)

(A) (Continuous case) Let F [k] be the complex Fourier coefficients of a real continuous
signal f (t). Then
Z b ∞
1 X
f (t)2 dt = |F [k]|2 . (4.56)
b−a a
| {z } k=−∞
| {z }
total power in time domain total power in frequency domain

(B) (Discrete case) Let F [k] be the complex Fourier coefficients of a real discrete signal
f [n]. Then
N −1 N −1
1 X 2
X 2
|f [n]| = |F [k]| . (4.57)
N n=0
k=0
| {z } | {z }
total power in time domain total power in frequency domain
94 CHAPTER 4. DISCRETE FOURIER METHODS

Proof of (A).
Z b Z b
1 1
f (t)2 dt = f (t) · f (t)dt
b−a a b−a a
Z ∞   !
b
1 X kt
= F [k] exp i2π f (t)dt .
b−a a b−a
k=−∞

We assume we can interchange the order of the integration and summation:


Z ∞ Z   !
b b
1 2
X 1 kt
f (t) dt = F [k] exp i2π f (t)dt
b−a a b−a a b−a
k=−∞
X∞
= F [k] · F [k]
k=−∞
X∞
2
= |F [k]| . 
k=−∞

The proof of (B) is analogous.

These notes have been funded by...


Chapter 5

Interpolation

We wish to now focus our attention on the problem of interpolation. If we have a set of
discrete data, we may wish to determine a continuous function that interpolates all the data
points. Consider the following statement of the problem:

Problem Given n + 1 discrete data points {(xi , fi )}ni=0 with xi 6= xj for i 6= j, determine
a continuous function y(x) that interpolates the data: y(xi ) = fi for 0 ≤ i ≤ n.

y
y(x)

x0 x1 x2 ... xn-1 xn xi

The points (xi , fi ) could come from measurements, expensive calculations, discrete data
analysis, or computer graphics (2D and 3D). There are several reasons for which we require
access to a continuous function y(x). For example,

• we may need to determine interpolated values at x 6= xi .

• we may need to differentiate or integrate the interpolating function.

It may be impossible or infeasible to perform a continuous measurement of the function or


to determine the function exactly, so interpolation is useful!
There are many ways for us to choose the function y(x), some which may be very difficult
and require substantial computational resources. We will focus on polynomial interpolation,
which is one of the most general and widely-used methods.

95
96 CHAPTER 5. INTERPOLATION

5.1 Polynomial Interpolation


In polynomial interpolation, we wish to determine the coefficients of a polynomial that
interpolates the set of points. We define the interpolating polynomial as follows:
Definition 5.1 Given n + 1 discrete data points {(xi , fi )}ni=0 with xi 6= xj for i 6= j, the
interpolating polynomial is the degree n polynomial
yn (x) = a0 + a1 x + a2 x2 + · · · + an xn , (5.1)
such that yn (xi ) = fi for 0 ≤ i ≤ n.
In this problem we are given n + 1 unknowns a0 , a1 , · · · , an and n + 1 conditions yn (xi ) =
fi . In order to solve for the interpolating polynomial, we must solve for the unknowns under
the given conditions. There are many methods to solve for these unknowns, but we will
concentrate on the Vandermonde matrix solution and the Lagrange form of the polynomial.

5.1.1 The Vandermonde Matrix


The Vandermonde matrix method is the most straightforward algorithm for determining
an interpolating polynomial. Since all the conditions are linear, we can write them as a
(n + 1) × (n + 1) linear system:

a0 + a1 x0 + a2 x20 + ··· + an xn0 = f0


a0 + a1 x1 + a2 x21 + ··· + an xn1 = f1
(5.2)
··· ··· ··· ··· ···
a0 + a1 xn + a2 x2n + ··· + an xnn = fn ,
which we may also write in matrix form:
    
1 x0 x20 · · · xn0 a0 f0
 1 x1 x21 · · · xn1  a1   f1 
    
 .. .. .. . . .  .. = .. . (5.3)
 . . . . ..  .   . 
1 xn x2n · · · xnn an fn
Thus we may write V ~a = f~, where V is known as a Vandermonde matrix. We can obtain
the following explicit expression for the determinant of a Vandermonde matrix:

Proposition 5.1 The determinant of V is given by


det(V ) = (x1 − x0 )
(x2 − x0 )(x2 − x1 )
(x3 − x0 )(x3 − x1 )(x3 − x2 ) (5.4)
···
(xn − x0 )(xn − x1 ) · · · (xn − xn−1 ),
or, in a more compact form,
Y
det(V ) = (xj − xi ). (5.5)
0≤i≤j≤n
5.1. POLYNOMIAL INTERPOLATION 97

Proof. Recall that we may expand det(V ) about row i by writing


n
X
det(V ) = (−1)i+j aij det(Vij ),
j=0

where Vij is the (n − 1) × (n − 1) matrix obtained by removing row i and column j from
matrix V:
 
v00 v01 ··· v(0)(j−1) v(0)(j+1) ··· v0n

 v10 v11 ··· v(1)(j−1) v(1)(j+1) ··· v1n 

 .. .. .. .. .. 

 . . . . . 

Vij =  v(i−1)(0) v(i−1)(1) · · · v(i−1)(j−1) v(i−1)(j+1) · · · v(i−1)(n) .

 v(i+1)(0) v(i+1)(1) · · · v(i+1)(j−1) v(i+1)(j+1) · · · v(i+1)(n) 
 
 .. .. .. .. .. 
 . . . . . 
vn0 vn1 ··· v(n)(j−1) v(n)(j+1) ··· vnn

We choose to expand det(V ) about row i = n:


n
X
det(V ) = (−1)n+j vnj det(Vnj )
j=0
h
= (−1)n 1 · det(Vn0 ) − xn · det(Vn1 ) + x2n · det(Vn2 ) −
i
· · · + (−1)n xnn · det(Vnn ) .

All of the sub-determinants (indicated by boxes) are independent of xn since they do not
have any elements from the nth row: This is essentially a polynomial of degree n in xn , so
we may write det(V ) = pn (xn ).
We know pn (x1 ) = 0 because det(V ) = 0 when xn = x1 ; V then has two equal rows.
But we also know pn (x2 ) = 0, pn (x3 ) = 0, pn (x4 ) = 0, · · · , pn (xn−1 ) = 0, which gives us n
roots of pn (xn ). From this information, we know that pn (x) and may be written as:

det(V ) = b(xn − x0 )(xn − x1 ) · · · (xn − xn−1 ), (5.6)

with b = (−1)2n det(Vnn ) = det(Vnn ).


We define V (i) as the (i + 1) × (i + 1) matrix formed by taking the first i + 1 rows and
i + 1 columns of V . Then V (n) = V and V (n−1) = Vnn . We may write

det(V (n) ) = det(V (n−1) )(xn − x0 )(xn − x1 ) · · · (xn − xn−1 )


det(V (n−1) ) = det(V (n−2) )(xn−1 − x0 )(xn−1 − x1 ) · · · (xn−1 − xn−2 )
···

Our result is obtained by repeating the decomposition (5.6) recursively to obtain the
desired result. 

We may wish to consider when the interpolating polynomial is well-defined, i.e. when
we may solve the linear system (5.3) to find a unique solution. It turns out that as long as
98 CHAPTER 5. INTERPOLATION

xi 6= xj for i 6= j, we can always obtain a polynomial that interpolates the given points.
This is proven in the following theorem:

Theorem 5.1 The interpolating polynomial yn (x) exists and is unique.

Proof. We consider the system V ~a = f~. If xi 6= xj for i 6= j then we know det(V ) 6= 0


from (5.5). Thus, by a standard result from linear algebra, we know the linear system has
a unique solution (or, yn exists and is unique). 

We note that we rarely solve the linear system V ~a = f~ in practice, for two reasons:

1. We note that this approach requires W = O(n3 ) time to solve the linear system. There
are more efficient methods than this to find the interpolating polynomial.

2. V is a very ill-conditioned matrix, since κ2 (V ) grows faster than exponentially as a


function of n.

Instead, we consider a different approach which will allow us to write down the interpo-
lating polynomial directly.

5.1.2 Lagrange Form


To motivate the Lagrange form of the interpolating polynomial, we ask the question: “is
there a simply way to write down the interpolating polynomial without needing to solve a
linear system?” Consider the most simple case of a non-constant polynomial:

Linear Case (n=1): We have two points (x0 , f0 ) and (x1 , f1 ). The polynomial is of the
form
y1 (x) = a0 + a1 x,
with the conditions
y1 (x0 ) = f0 , y1 (x1 ) = f1 .
With a little intuition, we may think to write y1 (x) as
x − x1 x − x0
y1 (x) = f0 + f1 .
x0 − x1 x1 − x0

In this case, y1 (x) takes the form

y1 (x) = ℓ0 (x)f0 + ℓ1 (x)f1 ,

where ℓ0 (x) and ℓ1 (x) are both degree 1 polynomials. We verify that y1 (x) is a interpolating
polynomial:

y1 (x0 ) = 1 · f0 + 0 · f1 = f0 , OK!
y1 (x1 ) = 0 · f0 + 1 · f1 = f1 , OK!
5.1. POLYNOMIAL INTERPOLATION 99

By Theorem 5.1 we know that the interpolating polynomial is unique, so this must be the
interpolating polynomial associated with the given points. If we collected the terms of this
polynomial, we would find that this is simply another way of writing the solution we would
get if we solved the Vandermonde system (5.3).
We may generalize this method for writing the interpolating polynomial to an arbitrary
number of points as follows:

Definition 5.2 The n + 1 Lagrange polynomials for a set of points {(xi , fi )}ni=0 are the
degree n polynomials that satisfy the property

1 if i = j
ℓi (xj ) = (5.7)
0 otherwise.

Explicitly, we may write the ith Lagrange polynomial as


(x − x0 )(x − x1 ) · · · (x − xi−1 )(x − xi+1 ) · · · (x − xn )
ℓi (x) = . (5.8)
(xi − x0 )(xi − x1 ) · · · (xi − xi−1 )(xi − xi+1 ) · · · (xi − xn )
Using product notation, we may also write
n
Y x − xj
ℓi (x) = . (5.9)
xi − xj
j=0,j6=i

In general, the Lagrange form of the interpolating polynomial may be written as follows:
yn (x) = ℓ0 (x)f0 + ℓ1 (x)f1 + · · · + ℓn (x)fn , (5.10)
or, using summation notation,
n
X
yn (x) = ℓi (x)fi , (5.11)
i=0
with the Lagrange polynomials ℓi (x) defined by (5.8). This form is an alternative way of
writing the interpolating polynomial yn (x). Using this form, interpolation can be done in
O(n2 ) time without solving a linear system!
Example 5.1 Write the interpolating polynomial of degree 2 for the set of points
3
{(2, ), (3, 2), (5, 1)}.
2

1
y(x)

0 x
1 2 3 4 5 6
100 CHAPTER 5. INTERPOLATION

The Lagrange form of the interpolating polynomial is


(x − 3)(x − 5) 3 (x − 2)(x − 5) (x − 2)(x − 3)
y2 (x) = ( )+ (2) + (1).
(2 − 3)(2 − 5) 2 (3 − 2)(3 − 5) (5 − 3)(5 − 3)

The Lagrange Basis. Consider the set


Pn (x) = {yn (x)|yn (x) is a polynomial of degree ≤ n}. (5.12)
This is the set of all polynomials of degree less than or equal to n. We note that Pn (x) is a
vector space with the standard basis
B = {1, x, x2 , · · · , xn }.
Hence we may write any polynomial yn (x) as a linear combination of the basis vectors. In
standard form, we write
yn (x) = a0 + a1 x + · · · + an xn . (5.13)
Similarly, the Lagrange polynomials form a different basis for the vector space Pn (x):
B ′ = {ℓ0 (x), ℓ1 (x), · · · , ℓn (x)} (5.14)
(recall that in order to write a Lagrange basis, we require n points xi (1 ≤ i ≤ n) with
xi 6= xj for i 6= j.) Since this is also a basis, we may write any polynomial yn (x) as a linear
combination of the Lagrange polynomials:
n
X
yn (x) = fi ℓi (x). (5.15)
i=0

5.1.3 Hermite Interpolation


Sometimes derivatives (or slopes) are given or known at interpolation points. In this case,
we can find a polynomial that interpolates both the function values and the derivatives.
y
slope f1’
slope f0’
f1 slope f2’
f0
f2

x0 x1 x2 xi

Definition 5.3 Given {(xi , fi , fi′ )}ni=0 , the Hermite interpolating polynomial is the polyno-
mial y(x) of degree 2n + 1 which satisfies
y(xi ) = fi n+1 conditions
y ′ (xi ) = fi′ n+1 conditions
2n + 2 conditions.
5.1. POLYNOMIAL INTERPOLATION 101

Since there are 2n + 2 conditions, there must be 2n + 2 polynomial coefficients in the


minimal degree interpolating polynomial. We conclude that y(x) has degree 2n + 1.

Example 5.2 Consider the case of n = 1. We have two points (x0 , f0 , f0′ ) and (x1 , f1 , f1′ ).
The polynomial is of degree 2n + 1 = 2 · 1 + 1 = 3 (a cubic), so we may write

y(x) = a0 + a1 x + a2 x2 + a3 x3 .

Similar to the standard polynomial interpolation problem, we must solve for these coef-
ficients. We consider two methods:

Method 1: Undetermined Coefficients. We note that we may write the polynomial


and its first derivative as

y(x) = a0 + a1 x + a2 x2 + a3 x3 ,

y (x) = a1 + 2a2 x + 3a3 x2 .

In matrix form, the linear system becomes


    
y(x0 ) = f0 1 x0 x20 x30 a0 f0
y(x1 ) = f1  1 x1 x21 x31   a1   f1 
=
  f2  .
  
y ′ (x0 ) = f0′  0 1 2x0 3x20   a2
y ′ (x1 ) = f1′ 0 1 2x1 3x21 a3 f3
Since this is simply a matrix system, we may use known techniques to solve this system.
This method suffers from the same shortfalls as the Vandermonde system however, since we
are still required to solve a potentially costly linear system.

Method 2: Determine a, b, c, d Similar to the idea of the Lagrange form, we can actually
write the Hermite polynomial in a form that makes solving for the polynomial coefficients
much easier. We write the polynomial and its derivative as

y(x) = a + b(x − x0 ) + c(x − x0 )2 + d(x − x0 )2 (x − x1 )


y ′ (x) = b + 2c(x − x0 ) + 2d(x − x0 )(x − x1 ) + d(x − x0 )2 .

Substituting in the conditions yields the following linear system:


y(x0 ) = f0 a = f0
y(x1 ) = f1 a + b(x1 − x0 ) + c(x1 − x0 )2 = f1
y ′ (x0 ) = f0′ b = f0′
y ′ (x1 ) = f1′ b + 2c(x1 − x0 ) + d(x1 − x0 )2 = f1′ .

We note that c and d are the only coefficients we need to solve for, since a and b are
immediately determined. We may rearrange the system to obtain
1
c = (f1 − f0 − f0′ (x1 − x0 ))
(x1 − x0 )2
1 2
d = (f ′ − f0′ − (f1 − f0 − f0′ (x1 − x0 ))).
(x1 − x0 )2 1 (x1 − x0 )
102 CHAPTER 5. INTERPOLATION

So why does this work? Recall that we could write a basis for P3 (x) as either {1, x, x2 , x3 }
(standard basis) or {ℓ0 (x), ℓ1 (x), ℓ2 (x), ℓ3 (x)} (Lagrange basis). We may also choose the
following basis for P3 (x):

{1, x − x0 , (x − x0 )2 , (x − x0 )2 (x − x1 )}.

Writing y(x) under this basis yields the form we used in method 2. This particular basis is
chosen such that the calculations are somewhat simplified.

5.2 Piecewise Polynomial Interpolation


There are many problems with standard high degree polynomial interpolation, including
• strong oscillations
• quickly divergent extrapolation.
We can remedy these drawbacks by using piecewise interpolation or least-squares fitting. In
the former, we break the interpolation interval into regions and generate several interpolating
polynomials that are valid over each region. In the latter, we allow the data points to be
‘close to’ y(x) instead of on y(x).

5.2.1 Piecewise Linear Interpolation


In piecewise linear interpolation we split the domain of the function into a set of intervals
(each consisting of two adjacent points) and determine the interpolating polynomial of each
interval. We use a line segment to interpolate the function in each interval, since each
interval contains two points.

x0 x1 x2 xi-1 xi xn-1 xn
intervals I1 I2 Ii In

We define a set of n polynomials yi (x), 1 ≤ i ≤ n where the domain of each yi (x) is


Ii = [xi−1 , xi ]. We may write
x − xi x − xi−1
yi (x) = fi−1 + fi .
xi−1 − xi xi − xi−1
Then, the interpolating piecewise polynomial y(x) is equal to yi (x) over the interval Ii =
[xi−1 , xi ] for all 1 ≤ i ≤ n.
5.2. PIECEWISE POLYNOMIAL INTERPOLATION 103

This method of interpolation has the drawback of not being smooth at the interpolation
points (we end up with jagged edges on the interpolating curve). We may instead consider
piecewise quadratic (see figure) or piecewise cubic interpolation, but these will also inevitably
have jagged points at the boundaries.

x0 x1 x2 x3 x4 x5 x6
intervals I1 I2 I3

5.2.2 Spline Interpolation


y

x0 x1 x2 xi-1 xi xn-1 xn
intervals I1 I2 Ii In

We consider a generalization of piecewise linear interpolation on a set of points {(xi , fi )}ni=0


that takes into account our desire for smoothness at the boundaries. Instead of only using
an interpolation condition, we impose three types of conditions:
• interpolation conditions
• smoothness conditions
• extra boundary conditions.
This leads us to a very powerful type of interpolation, known as spline interpolation.

Definition 5.4 y(x) is a degree k spline if and only if


1. y(x) is a piecewise polynomial of degree k in each interval Ii . We define yi (x) as the
restriction of y(x) to [xi−1 , xi ] (1 ≤ i ≤ n).
2. yi (xi−1 ) = fi−1 and yi (xi ) = fi in interval Ii (1 ≤ i ≤ n) (interpolation condition).
104 CHAPTER 5. INTERPOLATION

3. For each interior point j, there are k − 1 smoothness conditions:



yj′ (xj ) ′
= yj+1 (xj ) 

yj′′ (xj ) ′′
= yj+1 (xj ) 
 k − 1 smoothness conditions
..
. 

 (0 ≤ j ≤ n)
(k−1) (k−1) 
yj (xj ) = yj+1 (xj )

Under this definition, there will be n intervals and k + 1 coefficients for each polynomial.
In total, this gives n(k + 1) = nk + n unknowns.
We will have 2n interpolation conditions from part 2 and (k − 1)(n − 1) smoothness
conditions (note that the smoothness conditions only apply to the n − 1 internal points of
the spline.) Thus, in total we will have 2n + kn − k − n + 1 = kn + n − k + 1 conditions.
Comparing the number of unknowns and the number of conditions, it is clear we need
to impose k − 1 extra conditions. These are supplied by extra boundary conditions at x0
and xn .

Example 5.3 Consider the case of k = 3 (a cubic spline.) We will need to impose 2 extra
boundary conditions.

xi-1 xi xi+1
intervals Ii Ii+1

There are many different types of boundary conditions we could impose. Three possible
types of boundary conditions are

• “free boundary”: y1′′ (x0 ) = 0, yn′′ (xn ) = 0. A cubic spline with this boundary condition
is known as a “natural” cubic spline.

• “clamped boundary”: We specify the first derivatives at the ends by choosing constants
f0′ and fn′ . We then impose y1′ (x0 ) = f0′ and yn′ (xn ) = fn′ .

• “periodic boundary”: If f0 = fn we may impose that the first and second derivatives
of the first and last polynomial are equal at x0 and xn . We obtain the conditions
y1′ (x0 ) = yn′ (xn ) and y1′′ (x0 ) = yn′′ (xn ).

When we impose the three types of conditions, we will produce a (nk + n) × (nk + n)
linear system that may be uniquely solved for the coefficients of the yi (x), 1 ≤ i ≤ n.
5.2. PIECEWISE POLYNOMIAL INTERPOLATION 105

5.2.3 Further Generalizations


Further generalizations of these techniques are possible. For example:
• Bezier Curves
• B-Splines

These notes have been funded by...


106 CHAPTER 5. INTERPOLATION
Chapter 6

Integration

The problem of numerical integration is simply stated:

Problem Given a continuous function f (x) and an interval [a, b], find a numerical approx-
Rb
imation for I = a f (x)dx.

There are several cases where numerical integration is necessary:

1. if f (x) is given but no closed form solution can be found. For example, the integral of
2
the function f (x) = e−x has no closed form solution and requires numerical integration
to compute.
2. if f (x) is not given, but {(xi , fi )}ni=0 is given.

In each case, numerical integration may be the only method of determining the inte-
gral. We consider three methods for performing numerical integration: integration of an
interpolating polynomial, composite integration and Gaussian integration.

6.1 Integration of an Interpolating Polynomial


Recall the definition of an interpolating polynomial y(x) of degree n for given data points
{(xi , fi )}ni=0 . Since the interpolating polynomial provides an approximation to the func-
tion f (x), we may use it to determine an approximate solution to the integral. This is
advantageous because y(x) may be easily integrated due to its simple form.
The exact solution, I of to the integration problem is given by
Z b
I= f (x)dx, (6.1)
a

ˆ using the interpolating polynomial given by


and the numerical approximation I,
Z b
Iˆ = y(x)dx. (6.2)
a

107
108 CHAPTER 6. INTEGRATION

The truncation error T is then defined as


ˆ
T = I − I, (6.3)

the difference between the exact solution and our approximation.

6.1.1 Midpoint Rule: y(x) degree 0


We choose y(x) constant and sample at x0 = a+b a+b
2 . We obtain y(x) = f0 = f ( 2 ). The
numerical approximation is then
Z b    
ˆ a+b a+b
I0 = f dx = (b − a)f . (6.4)
a 2 2

f(x)
a+b
f( 2
) y(x)

a a+b b x
2

6.1.2 Trapezoid Rule: y(x) degree 1


We choose y(x) to be a linear function between the endpoints of the interval,

(x0 = a, f0 = f (x0 )) and


(x1 = b, f1 = f (x1 )).

We may immediately write the interpolating polynomial in Lagrange form:

(x − x1 ) (x − x0 )
y(x) = f0 + f1 .
(x0 − x1 ) (x1 − x0 )

The numerical approximation using the interpolating polynomial is then


Z x1  
(x − x1 ) (x − x0 )
Iˆ1 = f0 + f1 dx
x0 (x0 − x1 ) (x1 − x0 )
b b
f (a) (x − b)2 f (b) (x − a)2
= +
a−b 2
a a−b 2
a
   
f (a) (a − b)2 f (b) (b − a)2
= − + .
a−b 2 b−a 2
6.1. INTEGRATION OF AN INTERPOLATING POLYNOMIAL 109

From this we obtain the trapezoid rule


1
Iˆ1 = (b − a) [f (a) + f (b)]. (6.5)
2

f(x)

y(x)

a b x

6.1.3 Simpson Rule: y(x) degree 2


We choose y(x) to be a parabola within the interval. We interpolate the points

(x0 = a, f0 = f (a)),
(x1 = a+b
2 , f1 = f ( a+b
2 )), and
(x2 = b, f2 = f (b)).
We may write the interpolating polynomial in Lagrange form:
(x − x1 )(x − x2 ) (x − x0 )(x − x2 ) (x − x0 )(x − x1 )
y(x) = f0 + f1 + .
(x0 − x1 )(x0 − x2 ) (x1 − x0 )(x1 − x2 ) (x2 − x0 )(x2 − x1 )
The numerical approximation using the interpolating polynomial is then
Z x2 
ˆ (x − x1 )(x − x2 ) (x − x0 )(x − x2 )
I2 = f0 + f1 +
x0 (x0 − x1 )(x0 − x2 ) (x1 − x0 )(x1 − x2 )

(x − x0 )(x − x1 )
+ f2 dx.
(x2 − x0 )(x2 − x1 )

This may be written as a weighted linear combination of the fi :

Iˆ2 = w0 f0 + w1 f1 + w2 f2 ,

with Z x2
(x − x1 )(x − x2 ) b−a
w0 = dx = .
x0 (x0 − x1 )(x0 − x2 ) 6
Repeating a similar integration for w1 and w2 yields the Simpson rule
b−a
Iˆ2 = (f0 + 4f1 + f2 ). (6.6)
6
110 CHAPTER 6. INTEGRATION

y(x)

f(x)

a a+b b x
2

6.1.4 Accuracy, Truncation Error and Degree of Precision


There are two ways to study the accuracy of the integration formulas:

Truncation error: We can use Taylor’s theorem to compute the truncation error of each
integration rule. For example, for the midpoint rule it can be shown that T0 = I − I0 =
(b−a)3 ′′
24 f (ξ0 ) with ξ0 ∈ (a, b).

Degree of Precision: The degree of precision of an integration formula is defined as


follows:

Definition 6.1 The following statements are equivalent:


• Iˆ has degree of precision m
• T = I − Iˆ = 0 for any f (x) polynomial of degree ≤ m
• Iˆ integrates any polynomial of degree ≤ m exactly.

The three integration formulas we discussed are summarized in the following table, along
with their accuracy:

Degree of poly. Truncation Error Degree of Precision


(b−a)3 ′′
Midpoint - eq. (6.4) 0 24 f (ξ0 ) 1
(b−a)3 ′′
Trapezoid - eq. (6.5) 1 − 12 f (ξ1 ) 1
5
Simpson - eq. (6.6) 2 − (b−a)
2880 f
(4)
(ξ2 ) 3

Clearly the Simpson rule is the most accurate approximation, but also requires the
most computation. Perhaps surprising is the fact that the midpoint rule seems to provide
comparable accuracy to the more computationally-intensive trapezoid rule. We consider the
application of these methods to the following example:
6.2. COMPOSITE INTEGRATION 111

Example 6.1 The error function erf(x) is defined as


Z x
2
erf(x) = √ exp(−t2 )dt.
π 0
We can compute erf(1) to high precision using MATLAB or an advanced calculator and
obtain erf ≈ 0.842701 · · · .
Using the three integration algorithms described above, we obtain:

Midpoint Iˆ0 = √2π exp(−( 21 )2 ) ≈ 0.878783


(one correct digit)

Trapezoid Iˆ1 = √2π 12 [exp((−0)2 ) + exp((−1)2 )] ≈ 0.77173
(zero correct digits)

Simpson Iˆ2 = √2π 61 [1 + 4 exp((− 12 )2 ) + exp(−1)] ≈
0.8431028
(three correct digits)

As predicted, the Simpson rule is the most accurate. We also find, in this case, that the
Midpoint rule is more accurate than the trapezoid rule.

6.2 Composite Integration


If we wish to attain higher accuracy in the result, we may divide the integration interval
into several smaller subintervals and integrate each of the subintervals individually. If we
split up [a, b] into n subintervals of equal length, each subinterval will have length h = b−a
n .
The exact integral is then written as
Z b Xn Z xi Xn
I= f (x)dx = f (x)dx = Ii , (6.7)
a i=1 xi−1 i=1

with Z xi
Ii = f (x)dx. (6.8)
xi−1

interval interval interval


1 i n

x0=a x1 xi-1 xi xn-1 b=xn


112 CHAPTER 6. INTEGRATION

6.2.1 Composite Trapezoid Rule


We may use the trapezoid rule to approximate the integral over each subinterval. We write:
f (xi−1 ) + f (xi )
Iˆi = h . (6.9)
2
Summing over all intervals yields the complete expression for the composite trapezoid rule:
n
X h
Iˆ = Iˆi = [f0 + 2f1 + 2f2 + · · · + 2fn−1 + fn ]. (6.10)
i=1
2

Local Truncation Error: The local truncation error is the truncation error expected in
each subinterval. We write
1
Tloc,i = Ii − Iˆi = −(xi − xi−1 )3 f ′′ (ξi ), with ξi ∈ (xi−1 , xi )
12
Hence, the local truncation error for the trapezoid rule is given by
1 ′′
Tloc,i = − f (ξi )h3 . (6.11)
12
We say that the local truncation error is of the order O(h3 ).

Global Truncation Error: The global truncation error is the total truncation error over
all intervals. We write
n
X n
X
Tglobal = I − Iˆ = (Ii − Iˆi ) = Tloc,i .
i=1 i=1

In order to relate Tglobal to the interval length h, we will need the following theorem:

Theorem 6.1 The global truncation error for the composite trapezoid rule is O(h2 ).

Proof. The global truncation error, in terms of the local truncation error, is given by:
n
X n
X
|Tglobal | = | Tloc,i | ≤ |Tloc,i | (by Triangle inequality).
i=1 i=1

From (6.10),
n
X h3
|Tglobal | ≤ |f ′′ (ξi )| .
12
i=1
We define M = maxa≤x≤b |f ′′ (x)| and so obtain
h3
|Tglobal | ≤ nM .
12
b−a
Substituting n = h yields
h2
|Tglobal | ≤ (b − a)M .
12
We conclude Tglobal = O(h2 ). 
6.3. GAUSSIAN INTEGRATION 113

6.2.2 Composite Simpson Rule


Recall that the Simpson rule is given by Iˆ = b−a a+b
6 (f (a) + 4f ( 2 ) + f (b)). We define the
xi−1 +xi
midpoint of the subinterval [xi−1 , xi ] as xi−1/2 = 2 (see figure).

interval interval interval


1 i n

x0=a x1 xi-1 xi xn-1 b=xn


x1/2 xi-1/2 xn-1/2

Using this definition, we may write the Simpson rule over one subinterval as
h
Iˆi = (fi−1 + 4fi−1/2 + fi ). (6.12)
6
Summing over all intervals yields the expression for the composite Simpson rule:
n
X h
Iˆ = Iˆi = [f0 + 4f1/2 + 2f1 + · · · + 4fn−1/2 + fn ]. (6.13)
i=1
6

Finally, we can devise a theorem for the global truncation error of this expansion:

Theorem 6.2 The global truncation error for the Simpson rule is O(h4 ).

Proof. The proof is similar to the proof for the Trapezoid rule, except uses Tloc,i = O(h5 ).
This is left as an exercise for the reader.

6.3 Gaussian Integration


Gaussian integration applies a different approach to integration than we have seen so far in
this chapter. Consider the following example:
Assume we wish to determine the integral of a function f (x) over the interval [−1, 1].
We write Z 1
I= f (x)dx (6.14)
−1

and propose the following form of the numerical approximation:

Iˆ = w1 f (x1 ) + w2 f (x2 ). (6.15)


114 CHAPTER 6. INTEGRATION

This expression has four unknowns (namely w1 , w2 , x1 and x2 .) We want to determine these
unknowns so that the degree of precision is maximal. Note that the location of function
evaluation is now also a variable that will be determined optimally, in addition to the function
weights.
Recall the degree of precision of an integration formula is the highest degree of polyno-
mials that are integrated exactly. Since there are 4 unknowns in this problem, we assume
we can require exact integration of polynomials up to degree 3. The four conditions imposed
on the problem are then:

1. f (x) = 1 is integrated exactly.

2. f (x) = x is integrated exactly.

3. f (x) = x2 is integrated exactly.

4. f (x) = x3 is integrated exactly.

(Note: these form a basis for all polynomials of degree ≤ 3). Mathematically, these
conditions are written as the following non-linear system in the four unknowns:
R1
1 1 · dx = w1 + w2 ⇒ 2 = w1 + w2
R−1
1
2 −1 x · dx = w1 x1 + w2 x2 ⇒ 0 = w1 x1 + w2 x2
R1
3 −1
x2 · dx = w1 x21 + w2 x22 ⇒ 2/3 = w1 x21 + w2 x22
R1
4 −1 x3 · dx = w1 x31 + w2 x32 ⇒ 0 = w1 x31 + w2 x32 .

With some manipulation, we may solve the system for

1 1
x1 = − √ , x2 = √ , w1 = 1, w2 = 1.
3 3
Substituting these constants back into (6.15) yields
   
1 1
Iˆ = 1 · f − √ +1·f √ . (6.16)
3 3
R1
We conclude that this is an approximation for −1 f (x)dx with degree of precision m = 3.
By a change of integration variable, result (6.16) can be generalized as follows:

Proposition 6.1 (Gaussian Integration.)

b−ah i
Iˆ = f (( b−a
2 )(− √13 ) + ( b+a
2 )) + f (( b−a
2 ) √13 + ( b+a
2 )) (6.17)
2
Rb
is an approximation for I = a f (x)dx with degree of precision m = 3.
6.3. GAUSSIAN INTEGRATION 115

Proof. We can express x in terms of a new integration variable t as follows:

(1 − t) (1 + t)
x=a +b .
2 2

Note that x = a when t = −1 and x = b when t = 1. Also,

b−a
dx = dt.
2

Using this substitution we get

Z b Z 1
I= f (x)dx = f (( b−a
2 )t +
b+a b−a
2 ) 2 dt
a −1
Z 1
b−a
= 2 g(t)dt,
−1

with


g(t) = f ( b−a b+a
2 )t + ( 2 ) .

Using (6.16) we obtain:

b−a 
Iˆ = g(− √13 ) + g( √13 ) , (6.18)
2

which leads to the desired result. 

These notes have been funded by...


116 CHAPTER 6. INTEGRATION
Appendix A

Sample Midterm Exam

Midterm Exam October 20, 2005


Instructor: Professor H. De Sterck Time: 1.5 hours
AIDS: non-graphing calculator

1. (Errors and error propagation.) [20]

(a) Give a brief explanation of the concept of ‘condition of a mathematical problem’.


Introduce the two relevant condition numbers and explain what they mean. When
is a problem well-conditioned or ill-conditioned?
(b) You are asked to calculate
1 − cos x
f (x) =
x
for x = 0.01234567 in the floating point system F [b = 10, m = 4, e = 4] (use
rounding). What is the relative error in f l(x), and the relative error in the result
f l(f (x)), if you calculate f (x) using the straightforward algorithm given by the
formula above? Explain. Find a different algorithm for approximating f (x) that
is more stable than the straightforward algorithm. Compare relative errors, and
explain.

2. (Root finding.) [20]


Consider the polynomial function f (x) = x2 + b x + b2 /4 with b > 0.

(a) Show that this function has a double root x∗ , and locate it.
(b) Apply the formula for Newton’s method and find an expression for xn+1 as a
function of xn .

117
118 APPENDIX A. SAMPLE MIDTERM EXAM

(c) Determine cn = |xn+1 − x∗ |/|xn − x∗ |. What does this prove about the or-
der of convergence? Does this contradict the convergence theorem for Newton’s
method? Explain.
(d) Can the Fixed Point Iteration method with g(x) = x + f (x) be applied to this
problem? If so, give the interval I in which the initial value x0 can be chosen
such that convergence will occur. (Watch out, this is not an obvious question.)
3. (Root finding.) [20]

(a) Define ‘contraction’. Illustrate with a graph and give one interpretation in terms
of geometrical properties of the graph.
(b) Formulate the contraction mapping theorem, and prove its first part (existence
and uniqueness of the fixed point).

4. (Numerical linear algebra.) [20]


Find the LU decomposition of the following matrices. If necessary, determine a per-
mutation matrix P s.t. P A = LU .
(a)  
1 0 3
A =  −2 2 −3  .
4 −4 7
(b)  
1 7 3
A= 2 14 5  .
3 13 2
5. (Numerical linear algebra.) [20]

(a) Describe an algorithm for calculating the determinant of an n × n matrix that


has computational complexity W = O(n3 ) flops.
(b) Determine the number of floating point operations required for calculating the
determinant of an n × n matrix using a straightforward implementation of the
recursive definition of the determinant. (You can combine additions and multipli-
cations. Hint: consider the number of flops done on each recursive level separately,
and sum up over all the levels.) Find an approximation for the expression derived
that is valid for large n. (Hint: this approximate result should be a very simple
expression, and you will need exp x = 1 + x + x2 /2! + x3 /3! + x4 /4! + . . .)
Appendix B

Sample Final Exam

Final Examination Wednesday, December 14, 2005


Instructor: Professor H. De Sterck Time: 2.5 hours

1. (a) Define [15] machine epsilon of a floating point number system, and give an upper
bound for the relative error |δx| in the single precision floating point representa-
tion of a real number (assume chopping).
(b) Give a short derivation of the Secant method (starting from a Taylor series ex-
pansion). Provide a geometrical interpretation for the Secant method.

2. (a) Find [10] the positive root of f (x) = x2 − 6 using Newton’s method with starting
value x0 = 1. Stop the iteration when the fifth digit does not change anymore in
the result. (You can limit your calculations to numbers with 6 digits if you prefer
to do so.)
(b) If you were asked to find the positive root of f (x) using the bisection method
with initial interval [1, 3], how many iterations would be required to make sure
that the interval that contains the root has length smaller than t = 0.0001?

3. (a) Find the L U decomposition of [20]


 
3 3 3
A= 6 8 8 .
6 10 11

(b) Consider the general form of an iterative method for solving A~x = ~b, with A
non-singular. Show that if kI − B −1 Akp < 1 for any p-norm, then the iterative
method will converge to the solution for any starting value x~0 .

119
120 APPENDIX B. SAMPLE FINAL EXAM

(c) Show that if kAkp < 1 for any p-norm, then I + A is non-singular. (Hint: assume
that I + A is singular, and demonstrate a contradiction.)
4. (a) Define frequency, period and angular frequency of a sine wave. [25] How are they
related?
(b) Explain how the formula for the DFT coefficients F [k] of a time signal vector
f~ = (f [0], f [1], . . . , f [N −1]) can be obtained using the general projection formula
that is valid in any orthogonal basis. Briefly justify all steps in your reasoning.
(c) Find the Fourier Series of
1
f (t) = (π − |t|), t ∈ [−π, π].
2
Rb
(Hint: note that f (t) is an even function, and recall that a
f g ′ dt = f g|ba −
Rb ′
a f gdt.)
PN −1
(d) Recall the general formula for calculating the length of a vector ~x = n=0 x[n]~en
in an N-dimensional vector space with basis {~e0 , ~e1 , . . . , ~eN −1 }:
s X X
p
k~xk = < ~x, ~x > = < x[n]~en , x[l]~el >.
n l

Calculate the length of the time signal vector f~ = (f [0], f [1], . . . , f [N − 1]) both
using the expression for f~ in the time domain basis {f~0 , f~1 , . . . , f~N −1 }, and the ex-
pression in the frequency domain basis {F~0 , F~1 , . . . , F~N −1 }. The resulting length
should of course be the same in both cases. Is it the same? What is the physical
interpretation of this ‘length’ of time signal vector f~?
5. (a) Given [15] f0 , f0′ , f1 in points x0 and x1 , determine the coefficients a, b and c
of the interpolating polynomial y(x) = a(x − x0 )2 + b(x − x0 ) + c that satisfies
y(x0 ) = f0 , y ′ (x0 ) = f0′ and y(x1 ) = f1 .
n
(b) Given {(xP i , fi )}i=0 , you are asked to determine the coefficients cj such that
n
hn (x) = j=0 cj exp(j x) interpolates the data. Show that there is a unique
solution for the coefficients cj . You can assume that all the xi are different.
6. (a) Find [15] approximations for the integral
Z 4
1
I= exp( ) dx,
0 1+x
using the midpoint rule, the trapezoid rule, and the Simpson rule. An accurate
value for the integral is I ≈ 6.1056105. Which of the three methods is the most
accurate?
(b) Given the general expression for the truncation error of the Simpson rule in
interval i
−h5 (4)
Tloc,i = f (ξi ),
2880
121

derive an upper bound for the global error Tglobal of the composite Simpson rule
Rb
for approximating I = a f (x)dx using n subintervals of equal length h. What is
the global order of accuracy of the composite Simpson rule?

You might also like