Num Methods
Num Methods
Num Methods
1 Introduction 1
1.1 Overview of typical issues in scientific computing . . . . . . . . . . . . . . 1
1.2 Structure of the course . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Algebraic equations 4
2.1 Problem description and modeling of floating objects . . . . . . . . . . . . 5
2.2 Solving an algebraic equation by hand . . . . . . . . . . . . . . . . . . . . 6
2.3 Solving an algebraic equation with Matlab . . . . . . . . . . . . . . . . . . 7
2.3.1 Graphical approximation . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3.2 Symbolical calculations . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3.3 Numerical calculations . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4 Digital representation of numbers . . . . . . . . . . . . . . . . . . . . . . . 10
2.4.1 Floating point numbers and round-off errors . . . . . . . . . . . . . 10
2.4.2 Binary representation of integer numbers . . . . . . . . . . . . . . . 14
2.4.3 Binary representation of floating point numbers . . . . . . . . . . . 15
2.5 Iterative methods for algebraic equations . . . . . . . . . . . . . . . . . . . 17
2.6 Bisection method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.6.1 Mathematical background and method . . . . . . . . . . . . . . . . 19
2.6.2 Algorithm and program . . . . . . . . . . . . . . . . . . . . . . . . 20
2.6.3 Programming issues . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.6.4 Example: equation for floating sphere . . . . . . . . . . . . . . . . . 28
2.7 Fixed-point iterations (Picard iteration) . . . . . . . . . . . . . . . . . . . 30
2.7.1 Mathematical background and method . . . . . . . . . . . . . . . . 30
2.7.2 Example: equation for floating sphere . . . . . . . . . . . . . . . . . 31
2.7.3 Checking convergence . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.8 Newton’s method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.8.1 Mathematical background and method . . . . . . . . . . . . . . . . 34
2.8.2 Example: equation for floating sphere . . . . . . . . . . . . . . . . . 35
2.9 Rate of convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.9.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.9.2 Example: equation for floating sphere . . . . . . . . . . . . . . . . . 37
i
3 Nonlinear systems of algebraic equations 40
3.1 Problem description and modeling: predator-prey models . . . . . . . . . . 41
3.2 Analytical solutions and solving with Matlab . . . . . . . . . . . . . . . . . 43
3.2.1 Analytical solution . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.2.2 Plotting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.2.3 Symbolical calculations . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.2.4 Numerical calculations . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.3 Newton’s method for systems of equations . . . . . . . . . . . . . . . . . . 46
3.3.1 Mathematical background and method . . . . . . . . . . . . . . . . 46
3.3.2 Stopping criteria and vector norms . . . . . . . . . . . . . . . . . . 47
3.3.3 Example: predator-prey equations . . . . . . . . . . . . . . . . . . . 48
3.3.4 Choice of initial vector . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.4 Solving linear systems for population models . . . . . . . . . . . . . . . . . 50
3.4.1 Solving very small systems . . . . . . . . . . . . . . . . . . . . . . . 50
3.4.2 Solving a little larger systems: Gaussian elimination . . . . . . . . . 51
3.4.3 Built-in Matlab functions . . . . . . . . . . . . . . . . . . . . . . . 53
ii
4.7.4 Numerical computation of the integrals . . . . . . . . . . . . . . . . 88
4.7.5 Programming finite elements . . . . . . . . . . . . . . . . . . . . . . 88
4.8 Convergence of numerical methods for BVPs . . . . . . . . . . . . . . . . . 90
4.8.1 Finite differences . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.8.2 Finite elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.9 Solving linear systems for BVPs: Crout’s method . . . . . . . . . . . . . . 94
iii
6.3.1 Programming one-step methods . . . . . . . . . . . . . . . . . . . . 130
6.4 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
6.5 Stability of one-step methods . . . . . . . . . . . . . . . . . . . . . . . . . 134
6.5.1 Linear systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
6.5.2 Nonlinear systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
iv
Chapter 1
Introduction
1
– Limitations of the theory: The mathematical model is usually only an approx-
imation to the physical process.
Programming
Programming is an essential part of the course. To obtain a solution using a numerical
method often requires a large amount of algebraic manipulations. In order to perform
such calculations, you need to be able transform the numerical method into a numerical
computer program and let a computer do all calculations.
In this course we use Matlab to write numerical programs.
• Understand the problem and the numerical technique before you start programming.
• Write your program in such a way that it is easily changed to solve similar problems.
2
• Validate the results of your computer program. If you do computations with a
computer, you always get an answer. You need to make sure that you
get an answer that makes sense. A numerical technique might not work very
well for a specific problem or a computer program might contain errors. (It is very
easy to make mistakes if you write programs, but also easy to check.)
2. Numerical method: make the mathematical model suitable to solve with the
help of a computer. Go from a continuous description (differential equation) to a
an approximate discrete description (difference equation). Discuss one or several
basic techniques that can be used to solve the obtained equations and relevant
mathematical and computational concepts. (This will take about 3/4 of the time.)
The focus is on how to use the techniques, how they work, what are advantages
and disadvantages of the various numerical techniques, and what are the numerical
issues.
4. Validation and visualization of results Convince yourself that you found the
correct solution. Visualization of relevant data.
The last 5 weeks of class you need to work on a longer project (in groups or on your
own, you may make your own project or I may make one). During these weeks some classes
will be replaced by office hours, so you can work on the project and/or presentation.
This should give you a good basis for when and how to use scientific computing and
what to pay attention to. Wherever applicable, I will use a laptop with Matlab to illustrate
what we are doing.
3
Chapter 2
Algebraic equations
• rate of convergence
• algorithms
• stopping criteria
• ininitial guess
Numerical methods:
Programming
4
2.1 Problem description and modeling of floating ob-
jects
Consider a ball made of wood with a radius R = 1 and a density ρb = 1/2. How much of
the ball will be submerged when it is placed into water (ρw = 1). We don’t know which
portion of the ball is below the water, let’s call this d. See Fig. 2.1.
R=1
d
r=r z
z=0
π(3d2 − d3 )
Vw = . (2.2)
3
Using Mb = Mw we arrive at an algebraic equation
3d2 − d3 = 2. (2.3)
5
2.2 Solving an algebraic equation by hand
Why is it useful to have an analytic solution?
To check whether a numerical program that you wrote is working correctly.
Solution method 1: You might know the general formula for a third order equation
or know where to find it.
Solution method 2: The polynomial has integer coefficients. Then we can look
for roots that are also integers. Any such root must divide the constant term. For
d3 − 3d2 + 2 = 0 this leaves four possibilities: 1, -1, 2, -2. We can easily verify by
substitution that d = 1 is a solution and the others are not. The other 2 roots can be
found by reducing the degree√ of the original polynomial,
√ (d3 −3d2 +2)/(d−1) = d2 −2d−2.
This has the roots d1 = 1 + 3 and d2 = 1 − 3.
We have 3 possible solutions. If we let a body float in the water it will stay at 1 height
only. Which is the correct solution? The problem requires that 0 ≤ d ≤ 2R = 2, so d = 1
is the solution we are looking for.
6
2.3 Solving an algebraic equation with Matlab
See the separate Matlab guide for an introduction on how to use Matlab and how to write
simple programs. All Matlab code and Matlab output will be in Sans Serif font.
3 2
d3−3 d2+2
d −3 d +2
150
8
100
6
50
0 4
−50 2
−100
0
−150
−2
−200
−4
−250
−6
−300
−8
−350
(a) (b)
Figure 2.2: Plot of d3 − 3d2 + 2 (a) Using ezplot, and (b) zoom around the roots.
the function intersects the d-axis. From the graph we can estimate this number. For a
more accurate number you can zoom in around the root in the Matlab window.
How to get a more exact number: solve the equation d3 − 3d2 + 2 = 0?
7
This gives
sol =
1
1 + 3^(1/2)
1 - 3^(1/2)
Remarks
√
• the roots d2,3 = 1 ± 3 are found exactly not as finite precision numbers.
• Alternatively, you could have divided out a root from a polynomial symbolically:
syms d
y=(d^3-3*d^2+2) / (d-1)
y = simplify(y)
gives
y = d^2-2*d-2
• Symbolic calculations take much longer than floating point calculations. Thus if
computing time is an issue and a numerical solution is sufficient do not use symbolic
calculations.
• If no analytical solution is found (and the number of equations equals the number
of unknowns), a numerical solution is attempted. For example
syms d
sol = solve(d^5 + 3*d^2 - d^3 - 2)
gives the numerical values
[ .85052644896432252802899764958837]
[ .73793253717418508330026780848423+1.1865616439828408330458807692618*i]
[ -.77762851016613744223048982300174]
[ -1.5487630131465552523990434435551]
[ .73793253717418508330026780848423-1.1865616439828408330458807692618*i]
• Only for relatively simple problems that you could (in principle) also do by hand
8
first write the polynomial in the form f (d) = 0, so for this case d3 − 3d2 + 2 = 0. Then
make an array with the coefficients of the polynomial (the values of an array in Matlab
are in between [..]). Coefficients need to be ordered from the highest to the lowest power
in d, (and don’t forget the zero of 0d)
c=[1 -3 0 2]
Then use d=roots(c) to compute all three roots
d=
2.7321
1.0000
-0.7321
For non-polynomial algebraic equations, fzero can be used to compute a single root
of an arbitrary function. For example fzero(’x^3-3*x^2+2’, 0.5) tries to find a root of
x3 − 3x2 + 2 = 0 near x0 = 0.5. To use fzero you need a reasonable estimate of the value
of the root for which you could use a plot of the function.
Advantages and disadvantages of numerical calculations:
9
2.4 Digital representation of numbers
2.4.1 Floating point numbers and round-off errors
When the root problem d3 − 3d2 + 2 = 0 was solved with the Matlab function roots or
fzero, the exact roots were not found. Matlab produced socalled floating point num-
bers which are only approximations to the roots. A numerical computation on
a computer is different from a calculation in algebra and calculus courses: it
never produces an exact solution, only an approximation. Even Matlab’s 1.0000
is not the same as d = 1 which we obtained analytically. It could be any number in the
range 0.99995 ≤ d < 1.00005. To check this in Matlab:
d = 1.0000499999999999
which gives
d = 1.0000
and
d = 0.99995
which also gives
d = 1.0000
10
method is simply chopping the number after 16 digits. Thus if the 17th digit is 5 or larger,
1 is added to the 16th digit (round up). If the 17th digit is lower than 5, all digits from
17 on are chopped (round down).
The notation for the floating point number of y is fl(y). Matlab uses 16 significant
digits for floating point numbers.
Advantages/disadvantages of more digits
• Computations are slower and more memory is required to store numbers.
• Roundoff error becomes smaller.
Nowadays, most computations are done in double precision (16 digits).
11
In the floating object problem: we started from a cubic equation and divided out d − 1
and we found the quadratic equation d2 −2d−2 = 0. A quadratic equation ax2 +bx+c = 0
has the solution √
−b ± b2 − 4ac
x1,2 =
2a
In ”exact (calculus) calculations” this always gives the correct answer. When a finite
number of digits are used, a poor approximation of the root can be found using this
equation.
Example:
Use a = 1, b = 123.4, c = 1.2. The exact solutions are up to 8 digits:
x1 = -9.7252397e-03 and x2 = -1.2339027e+02.
√ Using 4-digit p arithmetic we would get: √
b 2 − 4ac = fl(1.522756e4) − fl(4.800000e0) = 1.523e4 − 4.800e0 =
√
1.523e4 = 1.234e2
Now compute x1,2 (using 4-digit arithmetic):
√
−b + b2 − 4ac -1.234e2 + 1.234e2
x1 = = = 0.000e0
2a 2.000e0
√
−b − b2 − 4ac -1.234e2 − 1.234e2
x2 = = = -1.234e2
2a 2.000e0
The approximation of x2 using 4-digit precision is accurate up to 4 digits, but the ap-
proximation for x1 is not! x1 has no accurate digits which is rather problematic if x1 is
the physical solution. The main problem is subtracting two almost equal numbers. Using
more digits for the calculations will improve the result, but it cannot completely eliminate
the inaccuracy of subtracting two nearly equal numbers with finite precision arithmetic.
One can avoid the subtraction of two equal numbers by rewriting the expression for
x1 : √ √
−b + b2 − 4ac −b − b2 − 4ac 2c
x1 = × √ = √
2a −b − b2 − 4ac −b − b2 − 4ac
This gives using 4-digit arithmetic.
2c 2.400e0
√ = = -9.724e-3
2
−b − b − 4ac -1.234e2 − 1.234e2
12
almost equal numbers are subtracted, depends on whether b is positive or negative. To
take this into account properly, first evaluate
1h √ i
2
q = − b + sign(b) b − 4ac ,
2
where sign(b) = 1 if b ≥ 0 and −1 otherwise. Then the roots are
q c
x1 = , x2 = .
a q
Using 4 digits arithmetic this gives x1 = -9.724e-03 and x2 = -1.234e+02 which are both
accurate up to 3 digits.
13
2.4.2 Binary representation of integer numbers
Computers do not use base-10 numbers but base-2 (binary) numbers to do calculations.
We start with binary representation of integer numbers, since this is easier to understand
than the binary representation for floating point numbers. Matlab doesn’t have a separate
integer number representation, only floating point number. Programming languages like
C and Fortran, however, have.
Integer base 10 numbers: What does the number 123 mean exactly? 123 =
1 × 102 + 2 × 101 + 0
P3n × 10 .i Any integer number we can write as a base 10 number. For
a positive integer i=0 ai 10 , with each ai any integer number from 0 to 9.
Integer base P 2 numbers: Any integer number we can write as a base 2 number. For
a positive integer ni=0 bi 2i , with bi a 0 or 1. To distinguish base-2 numbers from base-10
numbers we use the notation ( )2 for base-2 numbers and (P )10 for base-10 numbers. To
go from base-2 to base-10 numbers and vice versa, just use ni=0 bi 2i :
base 2 base 10
(1)2 1 × 20 (1)10
1 0
(10)2 1×2 +0×2 (2)10
(11)2 1 × 21 + 1 × 20 (3)10
3 2 1 0
(1011)2 1 × 2 + 0 × 2 + 1 × 2 + 1 × 2 (11)10
Try the following two examples yourself:
What is the base-10 number corresponding to the base-2 number (10101)2 ? What is the
base-2 number corresponding to the base-10 number (101)10 ?
There are Matlab functions that convert from binary to decimal and vice versa: bin2dec
and dec2bin. The above 2 examples you can check using bin2dec(’10101’) and dec2bin(101).
How much memory is reserved for an integer?
How do we measure the amount of memory used?
A bit is a binary digit, i.e. a 0 or a 1.
A byte is a group of eight bits.
A word is the smallest addressable unit of memory for a computer (often 2 bytes or 4
bytes).
If we have one word of 2 bytes to store an integer number, the binary number (1011)2
would be stored as 0000 0000 0000 1011 (just fill up with zeros at the front). Why is
the above number not stored as a 4-bit number? Computer hardware can be kept more
simple and efficient if it only handles numbers with a predetermined number of bits.
What is the largest integer on a computer if 2 bytes are used to store integers (don’t
consider negative integers).
P15 k We need to have all ones to have the largest integer, so for 2
bytes (= 16 bits): k=0 2 = (1 − 2 )/(1 − 2) = 216 − 1 (geometric series). In this way we
16
cannot represent negative numbers, only integers from 0 to 216 − 1. These type of integers
are called unsigned integers. Note that when signed integers are used, only integers
in the range [−215 , 215 − 1] = [−32768, 32767] are available. (This is not symmetric since
0 is included.) If 32-bits are used for an integer this is called a long integer.
14
2.4.3 Binary representation of floating point numbers
Single precision numbers have 32 bits (4 bytes). The single precision IEEE standard
floating point number is defined as:
8 bits 23 bits
1 bit
s c f
Figure 2.3: Single precision floating point number (32 bits): 1 sign bit for mantissa (s), 8
bits for the exponent (c), and 23 bits for the fractional part of the mantissa (f ).
number number so that the first number is non-zero. For example for base 10 numbers:
0.9 = 9 × 10−1 . For binary numbers the only non-zero number is 1.
The leftmost bit of a floating point number is for the sign of the number: s = 0 for
positive and s = −1 for negative numbers.
The next 8 bits are for the exponent c − 127. The value of c could take 28 = 256
different numbers from 0 to 255. The first and last value (c = 0 and c = 255 for single
precision) are always reserved for special cases including ±0 and ±∞. (also for other
precisions like double precision) reserved for special cases Thus for single precision the
range of values for the exponent is
The last 23 bits are used for the mantissa (the number multiplying the exponential
function, here with base 2). Since the first bit is always 1, it doesn’t need to be stored.
So the mantissa actually corresponds to 24 bits since there is one ’hidden’ bit. Zero
in floating point number notation is represented by all zero bits (with the sign bit as a
possible exception). The mantissa is restricted by
23
X
1 = (1.000.......0)2 ≤ (1.f )2 ≤ (1.111.......1)2 = (1/2)i = 2 − 2−23
i=0
Note that all together there are 24 1’s in (1.111.......1)2 , meaning 1 × 20 + 1 × 2−1 + 1 ×
2−2 + · · · 1 × 2−23 .
The largest single precision number is (2 − 2−23 ) × 2127 ≈ 3.4 × 1038 (largest mantissa
and largest exponent). The smallest (positive) single precision number 1 × 2−126 ≈ 1.2 ×
10−38 (smallest mantissa and smallest exponent). There are no (accurate) single precision
numbers in between 0 and approximately 1.2 × 10−38 and there are no single precision
numbers above the maximum single precision number 3.4 × 1038 . Similarly the largest
15
denormal denormal
usable
Figure 2.4: Usable range of numbers in single precision using standard IEEE notation.
negative single precision number is approximately −3.4 × 1038 and the smallest negative
number is approximately −1.2 × 10−38 . See Fig. 2.4 for the usable range of numbers.
What happens when we produce a number outside the usable range of values?
A too large number gives an overflow (Inf). A too small (positive) numbers first gives
a less accurate number (denormal) and then 0. The situation for negative numbers is
similar.
The machine epsilon is the smallest positive machine number so that 1 + 6= 1. In
single precision (23 bits for the mantissa) this is 2−23 ≈ 1.2 × 10−7 . Note that this value
is much larger than the smallest single precision number.
Double precision numbers have 64 bits (8 bytes). The double precision IEEE
standard floating point numbers uses 1 bit for the sign, 11 bits for the number c in the
exponent (c − 1023) and 52 bits for the mantissa. In Matlab you can find the maximum
number by using realmax, the minimum positive number by realmin, and the machine
epsilon by eps.
16
2.5 Iterative methods for algebraic equations
Iterative methods do not try to compute the exact solution, but give only an approxi-
mation to the solution. The user should specify how close the approximation should be
to the solution. Iterative methods involve the following 3 steps:
• The sequence is stopped when the approximation is ”close enough to the solution”.
The sequence should also be stopped when it is clear that it will not approach the
solution at all (otherwise it keeps on generating numbers forever). This might be
because the method generates a sequence that does not approach the solution or
because a (Matlab) program is written incorrectly.
Usually a while-loop is used for an iterative method, since you don’t know in advance how
many times you need to compute an approximation in the sequence. The structure of a
while-loop for an iterative method is as follows
function [x] = iterative method(initial guess, maxiter, tolerance)
% Initializations, for example
iter = 1;
xold = initial guess;
17
enough (break terminates the while loop and continues the program after the end corre-
sponding to the while loop).
Three iterative methods to find roots will be discussed in the next sections: bisection,
fixed-point iterations, and Newton’s method.
18
2.6 Bisection method
2.6.1 Mathematical background and method
The idea of the bisection method is based on the intermediate value theorem:
If f is continuous on [a, b], and f (a) and f (b) have opposite sign, then there exists a point
p with f (p) = 0. See Fig. 2.5.
f(b)>0
a f(p)=0
p
b
f(a)<0
Figure 2.5: A continuous line from a to b where f (a) and f (b) have opposite sign should
cross the x-axis (y = 0).
a p4
p3 p2 p1 b
a1 p1 b1
a2 p2 b2
Figure 2.6: Graphical representation of bisection method.
If f (a) and f (p1 ) have opposite signs, then the root p is in (a, p1 ). Take a point halfway
this interval p2 = (a + p1 )/2 etc.
19
If f (b) and f (p1 ) have opposite signs, then the root p is in (p1 , b). Take a point halfway
this interval p2 = (p1 + b)/2 etc.
Advantages/disadvantages of the bisection method:
• You only find 1 root, which depends on the initial points a and b.
20
Algorithm: Bisection
Input: 2 points a and b, a tolerance , and a maximum number of iterations N
Output: approximation to a root in [a, b] and the number of iterations performed
Checks
Check whether f (a) and f (b) have opposite sign
Initialization
Compute the function value f (a) (Done already in Checks)
Actual method
While the number of iterations does not exceed N , do
Compute new p = (a + b)/2
Update the right point if f (p) and f (a) have opposite sign
Otherwise update the left point
21
Matlab program: Bisection
(Following the algorithm, ignorant of the programming issues in Sec. 2.6.3. Using these
it can be improved significantly.)
%———————————————————————————————————
% Checks and initializations
%———————————————————————————————————
k = 1;
a(1) = a0;
b(1) = b0;
fa(1) = a0^3 - 3*a0^2 + 2;
fb(1) = b0^3 - 3*b0^2 + 2;
if (fa(1) <= 0 & fb(1) <= 0) | (fa(1) >= 0 & fb(1) >= 0)
error(’Initial guesses do not have opposite sign’)
end
22
%———————————————————————————————————
% Iteration loop
%———————————————————————————————————
while k <= N
%————————————————————————————————-
% Calculate new approximation to root pk
%————————————————————————————————-
p(k) = (a(k) + b(k)) / 2;
fp(k) = p(k)^3 - 3*p(k)^2 + 2;
%———————————————————————————————————
% Check whether close to a root
%———————————————————————————————————
if k > 1
if abs(p(k) - p(k-1)) < epsil
break
end
end
%———————————————————————————————————
% Prepare for next iteration
%———————————————————————————————————
if (fa(k) < 0 & fp(k) > 0) | (fa(k) > 0 & fp(k) < 0)
a(k+1) = a(k);
fa(k+1) = fa(k);
b(k+1) = p(k);
fb(k+1) = b(k+1)^3 - 3*b(k+1)^2 + 2;
else
a(k+1) = p(k);
fa(k+1) = a(k+1)^3 - 3*a(k+1)^2 + 2;
b(k+1) = b(k);
fb(k+1) = fb(k);
end
k = k + 1;
end
%———————————————————————————————————
% Check if iterative method converged or not
%———————————————————————————————————
if k > N
error(sprintf(’Bisection method did not converge in %d iterations’, N));
end
23
2.6.3 Programming issues
Programming with minimal memory usage
Now we use 6 arrays with all previous values of a, b, p, fa, fb, and fp. If a large number
of iterations is performed, the large arrays may take quite a lot of memory and make
computations slower. In addition, no memory is allocated (reserved) in advance for the
large arrays. Every time a new number is stored in an array, Matlab needs to create space
first to store that number. This makes the process even slower. However, storing all those
results is totally unnecessary: only the most recent values of the left point a, the right
point b, and the middle point p are used and some corresponding function values. So a
single variable instead of an array is sufficient for each.
How to check efficiently whether f (a) and f (p) have opposite sign?
If you use if statements inside for or while loops, this makes a code (much) slower than
necessary. A more efficient way to check whether two numbers have opposite sign is to
check whether the product is negative, thus check if f (a) × f (p) < 0. Note: if both f (a)
and f (p) are very large, the product f (a) × f (p) might be larger than the maximum
floating point number (see Sec. 2.4). To avoid such problems the sign function can be
used: sign(fa)*sign(fb). If fa is positive, sign(fa) equals 1, if it is negative it equals −1.
Make m-files easy to modify
Use separate functions for parts that need to be modified frequently. Advantage: Once
the bisection function is working, you never need to modify it anymore. You only need
to modify the accompanying function that evaluates the function f . Additionally, there
is only one line that defines the function f , even if you evaluate f at various places in the
bisection m-file. In the bisection function you need to compute f (a), f (b), and f (p). You
just need to call the function several times with the correct value at which the function
needs to be evaluated, a, b, and p.
Disadvantage: Computations take a little longer due to extra function calls. Only
when you find that you can save a significant amount of computing time, you may want
to avoid the function calls.
Easiest way to do this in Matlab: We write a separate function to evaluate the function
f at the end of the m-file for the bisection method. If we solve d3 −3d2 +2 = 0, for example,
we make a function funcbisec that evaluates the function
function [f] = funcbisec(x)
% Floating sphere equation
f = x^3 - 3*x^2 + 2;
The function funcbisec is called in the m-file for the bisection method, bisection, at every
place where f needs to be evaluated, with the proper value for x. To evaluate f (p), for
example, and assign this value to the variable fp, use
fp = funcbisec(p);
Here p needs to have a proper numerical value.
If we want to compute roots of another problem, we only need to modify the expression
for f in the function funcbisec.
How to check whether the approximation is ’good enough’ ?
24
For the bisection method, we can do better than comparing pi and pi−1 . We know that
the root is in between the most recent values of a and b. So if we choose p in the middle,
we are certain that the error is less or equal to |b − a|/2 = |p − a|. Thus if |p − a| is
smaller than a specified tolerance , we are certain that the actual error is less or equal
than .
The stopping criterium can be made more robust by using a relative stopping criterium
and/or checking how small the residual f (pi ) is (this measures how well the equation you
try to solve is satisfied, for a root f (p) = 0 exactly, so you want f (pi ) to be small).
When you are not sure whether the numerical solution is good enough, you can always
try to decrease (and probably increase the maximum number of iterations as well).
Function evaluations
Evaluating functions like sin(x), exp(x), etc. is computationally much more expensive
then multiplications or additions. For not too simple functions, most of the computing
time will be in the evaluation of f . Thus a fast program for the bisection method contains
as few function evaluations as possible.
Our function bisection0 has several function evaluations inside the while-loop. Only
the function evaluation at the point pi is necessary.
Unnecessary operations inside loops
Every operation and function call takes computing time (CPU time). If part of a compu-
tation is repeated exactly every iteration, it saves CPU time if you do the computation
once before the loop starts. For example, the sign of f (a) never changes during the bi-
section iterations. Thus these can be computed before the while loop, and stored in a
variable (signfa in function bisection). Decisions on whether to use function calls (more
flexible, easier to read code) or not (save CPU time) depends on what is most important
for the problem you are solving. If your computations take a long time and a significant
percentage of the total CPU time can be saved by avoiding function calls, you may want
to minimize the function calls.
A Matlab function with all the above modifications can be found on the next page.
You can run the bisection function using (with input parameters a = 0.1 and b = 2 for
the endpoints, tolerance = 10−6 , and maximum number of iterations N = 100)
[p, k] = bisection(0.1, 2, 100, 1e-6)
25
Matlab program: Bisection
(More flexible and robust bisection method, including the programming issues in
Sec. 2.6.3)
%———————————————————————————————————
% Checks and initializations
% Note: use sign to avoid overflow
% signfa will never change, no need to recompute
%———————————————————————————————————
k = 1;
fa = funcbisec(a);
fb = funcbisec(b);
signfa = sign(fa);
if signfa*sign(fb) >= 0
error(’Initial guesses do not have opposite sign’)
end
26
%———————————————————————————————————
% Iteration loop
%———————————————————————————————————
while k <= N
%———————————————————————————————————
% Calculate new approximation to root pk
%———————————————————————————————————
p = (a + b) / 2;
fp = funcbisec(p);
%———————————————————————————————————
% Check whether close to a root
% Note: Both f and the difference of 2 approximations should be small
%———————————————————————————————————
if abs(fp) < epsil & abs(p-a) < epsil
break
end
%———————————————————————————————————
% Prepare for next iteration
%———————————————————————————————————
if signfa*sign(fp) < 0
b = p;
else
a = p;
end
k = k + 1;
end
%———————————————————————————————————
% Check if iterative method converged or not
%———————————————————————————————————
if k > N
error(sprintf(’Bisection method did not converge in %d iterations’, N));
end
%———————————————————————————————————
% Function f(x)
%———————————————————————————————————
function [f] = funcbisec(x)
% Floating body problem
f = x^3 - 3*x^2 + 2;
27
2.6.4 Example: equation for floating sphere
As example we take as starting values a = 0.1 and b = 2. In addition we use N = 100 for
the maximum number of iterations and = 10−6 for the tolerance (We can always change
the values and run the problem again if these are not sufficient). Table 2.1 contains the
sequence of approximations. Such a long list of numbers is difficult to interpret, a plot is
i pi |1 − pi |
1 1.050000000000000e+00 5.000000000000004e-02
2 5.750000000000001e-01 4.249999999999999e-01
3 8.125000000000000e-01 1.875000000000000e-01
4 9.312500000000000e-01 6.874999999999998e-02
5 9.906250000000001e-01 9.374999999999911e-03
6 1.020312500000000e+00 2.031250000000018e-02
7 1.005468750000000e+00 5.468750000000133e-03
8 9.980468750000001e-01 1.953124999999889e-03
9 1.001757812500000e+00 1.757812500000178e-03
10 9.999023437500001e-01 9.765624999991118e-05
11 1.000830078125000e+00 8.300781250001332e-04
12 1.000366210937500e+00 3.662109375000000e-04
13 1.000134277343750e+00 1.342773437500444e-04
14 1.000018310546875e+00 1.831054687517764e-05
15 9.999603271484376e-01 3.967285156236677e-05
16 9.999893188476564e-01 1.068115234359457e-05
17 1.000003814697266e+00 3.814697265847045e-06
18 9.999965667724611e-01 3.433227538929273e-06
19 1.000000190734863e+00 1.907348634588857e-07
20 9.999983787536623e-01 1.621246337735194e-06
21 9.999992847442629e-01 7.152557370826429e-07
22 9.999997377395632e-01 2.622604368118786e-07
Table 2.1: Iteration number i, approximations pi , and absolute difference with the exact
solution |1 − pi |; bisection method for d3 − 3d2 + 2 = 0 using initial points a = 0.1 and
b = 2, tolerance = 10−6 , and maximum number of iterations N = 100.
easier. The easiest way to generate a plot of the errors, is to compute the errors |1 − pi | in
the bisection function and store these in an array, say e. Thus, for the above example, e
is an array of length 22 which contains the actual error (error with the exact solution),
here |1 − pi |.
Then create an array with the iteration numbers 1 to 22:
i=1:1:22
Then make a plot with marker x:
plot(i, e, ’x’)
This plots the errors, but it is still difficult to see what happens (does the error continue
28
to decrease or not).
To see more clearly the behavior at small errors, we can use a logarithmic scale for the
”y-axis”. Instead of plot, now use
semilogy(i, e, ’x’)
See Fig. 2.7. It is clear from the logarithmic plot that the error continues to decrease.
0.4 −1
10
0.35
−2
10
0.3
−3
error |1−pi|
error |1−p |
10
i
0.25
0.2 −4
10
0.15
−5
10
0.1
−6
10
0.05
−7
0 10
0 5 10 15 20 25 0 5 10 15 20 25
iteration i iteration i
(a) (b)
Figure 2.7: Actual error as a function of the iteration number for the bisection method
(d3 − 3d2 + 2 = 0, a = 0.1, b = 2) using unscaled (a) and logarithmic y-axis (b).
29
2.7 Fixed-point iterations (Picard iteration)
2.7.1 Mathematical background and method
A number p is a fixed point for a given function g if g(p) = p.
Relation between roots and fixed points:
Finding solutions of the root problem f (p) = 0 is equivalent to finding the fixed points
of a corresponding fixed point problem. There is not one single way to transform a root
problem into a fixed point problem.
For example, g(x) = x − f (x) and g(x) = x + f (x)/3 are 2 fixed point iterations cor-
responding to to the root problem f (x) = 0. For both cases, if p is a root of f , then
f (p) = 0 and thus g(p) = p, i.e. p is a fixed point of g.
Also if p is a fixed point of g, then g(p) = p and thus f (p) = 0, i.e. p is a root of f .
How does a fixed-point iteration work?
Fixed points are the intersection points of the line y = x and the curve y = g(x). They
consist of two steps:
y=x
g(x)
p0 p1 p2 p3 p
|g 0 (x)| < 1
30
for all x in [a, b] (and the initial guess is in [a, b]).
When does a fixed point iteration converge fast?
When the absolute value of the derivative |g 0 (x)| is small. If you have to find such a g
by trial and error it might be much more work than the bisection method takes to solve
the problem. In the next section, we discuss a fixed-point iteration that converges fast:
Newton’s method.
Advantages/disadvantages of fixed-point iterations:
• if a fixed-point iteration converges, it may (or may not) be faster than the bisec-
tion method.
1. Adding a d on the left and right: d = d+d3 −3d2 +2. Starting from d = 0.99 this does
not converge. However, d = d+(d3 −3d2 +2)/2 converges and d = d+(d3 −3d2 +2)/3
converges in just 4 iterations starting from d = 0.5.
p
2. Writing d2 = (3d2 −2)/d, and take the square root: d = 3d − 2/d. Converges in 30
iterations, starting from d = 0.5. In Matlab, the result has a small imaginary part.
Some other programming languages would give an error when you try to compute
the square root of a negative number.
p
3. Writing d2 (3 − d) = 2, divide by (3 − d) and take the square root: d = 2/(3 − d).
Converges in 11 iterations, starting from d = 0.5. See Table 2.2.
31
i pi |1 − pi |
1 8.944271909999159e-01 1.055728090000841e-01
2 9.746077623781704e-01 2.539223762182963e-02
3 9.937117548732618e-01 6.288245126738201e-03
4 9.984316360970824e-01 1.568363902917591e-03
5 9.996081394766783e-01 3.918605233217409e-04
6 9.999020492625699e-01 9.795073743013027e-05
7 9.999755132150757e-01 2.448678492428247e-05
8 9.999938783599811e-01 6.121640018896812e-06
9 9.999984695935085e-01 1.530406491534464e-06
10 9.999996173985968e-01 3.826014032259906e-07
11 9.999999043496630e-01 9.565033698422098e-08
The stopping criterium is not always sufficient. Consider a function g with g 0 that
has a derivative that is almost equal to 1? Take as example p = g(p) with g(p) =
p+10−4 ×(p3 −3p2 +2) to determine the root of the floating object equation p3 −3p2 +2 = 0.
The derivative is g 0 (p) = 1 + 10−4 × (3p2 − 6p) which is just below 1 in (0, 1]. Again we
take p0 = 0.5 and = 10−6 . The actual error at each iteration is displayed in Fig. 2.9 on
a semilogarithmic scale.
Convergence for fixed−point iteration
0
10
−1
10
error |1−pi|
−2
10
−3
10
0 2000 4000 6000 8000 10000 12000 14000 16000 18000
iteration i
Figure 2.9: Actual error as a function of iteration number for the fixed-point iteration
p = p + 10−4 × (p3 − 3p2 + 2), p0 = 0.5.
The difference between the root and the approximation when the stopping criterium
was met, |1 − p16846 | = 3.332001912442095e − 03, is much larger than 10−6 . Another
stopping criterium that could be used is checking whether the residual r is small. The
residual measures how well the original equation is satisfied, here how well f (p) = 0 is
32
satisfied. For the floating body equation we need to check whether ri (pi ) = p3i − 3p2i + 2 is
small. At the final iteration we have r16846 = p316846 −3p216846 +2 = 9.995968744652473e−03
which is quite large compared to the tolerance 10−6 .
To obtain the required accuracy, it often helps to check the residual as well. If we use
as additional stopping criterium |ri | < 10−6 , both stopping criteria are met after 47542
iterations. Then p47542 = 9.999996667462827e−01 and |1−p47542 | = 3.332537172884287e−
07 which agrees well with the accuracy we wanted (10−6 ).
33
2.8 Newton’s method
Newton’s method for solving f (x) = 0 is a special choice of a fixed-point iteration. It is
also called the Newton–Raphson method.
Now assume that we are ”close” to the root, then the O(p − x0 )2 term is small compared
to the linear term:
0 ' f (x0 ) + (p − x0 )f 0 (x0 ),
or after rearranging:
f (x0 )
p ' x0 − .
f 0 (x0 )
This is used in Newton’s method to find a new approximation pn .
How does Newton’s method work?
2. Take for n ≥ 1
f (pn−1 )
pn = pn−1 −
f 0 (pn−1 )
Note that Newton’s method is of the form pn = g(pn−1 ) with g(pn−1 ) = pn−1 −f (pn−1 )/f 0 (pn−1 ),
i.e. Newton’s method is a fixed-point iteration.
Remarks
• In the derivation it was assumed that the remainder term which contains a term
(x − p)2 was small compared to the linear term in x − p. This is not true if the
initial guess is not close enough to the root, and the method might therefore not
converge if the starting point is ”too far” from the root.
34
• Graphical interpretation: Newton’s method finds at every iteration an approxi-
mation to the root by walking a distance (pi − pi−1 ) along the tangent line at
(pi−1 , f (pi−1 )) so that f (pi ) = 0. Thus for linear functions f , you will arrive at the
root in 1 iteration.
• Fast (quadratic) convergence when you get close enough to the root.
• The method does not always converge to a solution: zero derivative, initial guess not
sufficiently close. (If you don’t have a good enough initial approximation, you could
first do, for example, some bisection iterations and then start Newton’s method)
• You need to calculate a derivative (for complicated functions, you can use a symbolic
calculation)
i pi |1 − pi |
1 1.111111111111111e+00 1.111111111111112e-01
2 9.990740740740740e-01 9.259259259259967e-04
3 1.000000000529222e+00 5.292219995567393e-10
4 1.000000000000000e+00 0.000000000000000e+00
Table 2.3: Iteration number i, approximations pi , and absolute difference with the exact
solution |1 − pi |; Newton’s method using p0 = 1/2, tolerance = 10−6 , and maximum
number of iterations N = 100.
35
2.9 Rate of convergence
2.9.1 Definition
A sequence {pn } converges to p of order α if constants α and γ exist so that
|pn+1 − p|
lim = γ.
n→∞ |pn − p|α
36
Thus if we make a xy-plot of y = log en+1 vs. x = log en , α corresponds to the slope.
Once α is known, γ can be determined easily from the above equation.
i en = |p − pn |
1 1.229976737424930e+01
2 7.670248252043633e+00
3 4.607864445396778e+00
4 2.602396109292012e+00
5 1.320034026906669e+00
6 5.473710251344159e-01
7 1.497420289157650e-01
8 1.616422346263446e-02
9 2.214556785341548e-04
10 4.245948570513747e-08
11 1.665334536937735e-15
12 1.110223024625157e-16
37
2
0 Newton
fixed−point iteration
−2
−4
log(ei+1)
−6
−8
−10
−12
−14
−16
−16 −14 −12 −10 −8 −6 −4 −2 0 2
log(ei)
Figure 2.10: Rate of convergence plot for floating sphere problem (d3 − 3d2 + 2 = 0);
Newton’s method using p0 = −20, tolerance = 10−13 . Rate of convergence of a typical
fixed-point iteration ais shown for comparison.
i.e. α ≈ 2. The value of α can be approximated more precisely by using the numerical
values for the e’s, for example
∆y log e9 − log e10 −3.6547 + 7.3720
α≈ = ≈ ≈ 1.995
∆x log e8 − log e9 −1.7914 + 3.6547
Thus Newton’s method converges quadratically (α ≈ 2) close to the solution. Once α has
been found, γ can be determined by taking the ratio of en and eαn+1 . For example,
e9
γ≈ 1.995
≈ 0.83
e8
|f 00 (p)|
.γ≈
2|f 0 (p)|
√
For the problem we consider this gives γ ≈ 3/2 ≈ 0.866 which is indeed close to what
we find numerically.
In general a fixed-point iteration converges linearly, α ≈ 1. This can also be observed
from Fig. 2.10 where the slope is approximately 1, meaning α ≈ 1. As fixed-point iteration
we used g(p) = p−(p3 −3p2 +2)/10 and an initial guess p0 = −2. For a linearly convergent
fixed-point iteration we should find a constant (γ) if we we take the ratio of two consecutive
errors (en+1 /en ). For the fixed-point iteration we consider, we have γ ≈ en+1 /en ≈ 0.400.
This corresponds to the theoretical value
√ γ ≈ |g 0 (p)| for a fixed-point iteration g(p) = p.
0
For the problem we consider |g (1 − 3)| = 2/5 = 0.4.
Newton’s method for the root d = 1 is a special case. See Table 2.5. Convergence is
faster than quadratic: better than doubling of the number of accurate digits. What is
38
i en = |p − pn |
1 1.111111111111112e-01
2 9.259259259259967e-04
3 5.292219995567393e-10
4 0.000000000000000e+00
Table 2.5: Iteration number i and absolute difference with the exact solution |p − pn | with
p = 1; Newton’s method using p0 = 1/2 and tolerance = 10−6 .
39
Chapter 3
• vector norms
Numerical methods:
40
3.1 Problem description and modeling: predator-prey
models
Consider an environment with 2 species: one of them is a predator and of them a prey.
We want to know how the population of the predator and prey evolve in time. Will one
or both species die out or will they coexist.
In the model we denote by x1 (t) the population of prey at time t and by x2 (t) the
population of predators at time t. The basic model to describe changing quantities in
time is:
41
We are interested in whether the predator and/or prey die out or coexist. Thus we
are interested in possible equilibrium solutions (i.e. when the populations do not change
anymore in size, or dx1 /dt = dx2 /dt = 0),
0 = ax1 − bx1 x2 ,
0 = cx1 x2 − dx2 .
0 = 2x1 − x1 x2 ,
0 = x1 x2 − 2x2 .
We first discuss some methods to solve systems of equations analytically, before we dis-
cuss some numerical techniques (to see whether numerical methods converge to a correct
solution).
42
3.2 Analytical solutions and solving with Matlab
In this section we discuss three methods to obtain solutions of a 2 × 2 system of equations.
In the next sections we will solve the above problem numerically and see whether numerical
methods converge to one of the two equilibrium solutions found in this section.
2x1 − x1 x2 = x1 (2 − x2 ) = 0,
x1 x2 − 2x2 = x2 (x1 − 2) = 0.
The first equation is satisfied for x1 = 0 or x2 = 2. For x1 = 0 we get from the second
equation x2 = 0. For x2 = 2 we get from the second equation x1 = 2. Thus we have two
equilibrium solutions (0, 0) and (2, 2).
3.2.2 Plotting
The curves corresponding to the two equations 2x1 − x1 x2 = 0 and x1 x2 − 2x2 = 0 can
be plotted using Matlab’s ezplot
ezplot(’2*x1 - x1*x2=0’)
hold on
ezplot(’x1*x2 - 2*x2=0’)
grid on
setcurve2(’color’,’red’)
where setcurve2 is a small Matlab script to plot the two curves corresponding to the second
equation in red. In Fig. 3.1 the blue curves correspond to solutions of the first equation
and red lines to solutions of the second equation. The intersections of a blue and red
curve near (0, 0) and (2, 2) correspond to the two equilibrium points. You can zoom in
near these points to obtain more accurate values. Note that n × n systems of equations
need n-dimensional plots so that this method is only useful for 2 × 2 systems.
43
6
x2
0
−2
−4
−6
−6 −4 −2 0 2 4 6
x1
x2 =
0
2
From the first line of solutions for x1 and x2 we get the equilibrium point (0, 0). From
the second line of solutions for x1 and x2 we get the equilibrium point (2, 2).
Then you need to select an initial vector p(0) , say h3, 3i. To solve the system numeri-
cally using fsolve, use
44
p0 = [3; 3]
p = fsolve(’fun fsolve’, p0)
which gives
p=
2.000000141789899e+00
2.000000141789899e+00
This uses a default tolerance of 10−6 . To increase the accuracy, you need to change the
option TolFun. To use a tolerance of 10−10 use
options = optimset(’TolFun’, 1e-10)
p0 = [3; 3]
p = fsolve(’fun fsolve’, p0, options)
which gives the more accurate solution
p=
2.000000000000007e+00
2.000000000000007e+00
45
3.3 Newton’s method for systems of equations
Newton’s method for systems can be used to solve a system of m nonlinear equations
with m unknowns
f (x) = 0
or in component form
f1 (x1 , x2 , . . . , xm ) = 0,
f2 (x1 , x2 , . . . , xm ) = 0,
.. .. ..
. . .
fm (x1 , x2 , . . . , xm ) = 0.
46
• Like in Newton’s method for scalars, in the derivation of Newton’s method for
systems it was assumed that higher order terms are small compared to the linear
terms. This is not true if the initial guess is not close enough to the solution vector,
and the method might not converge then.
• Provided certain conditions on the partial derivatives of the functions fi are satisfied,
also Newton’s method for systems converges quadratically when the initial vector
p(0) is close enough to the true solution vector p. For hard problems you might need
to be very close to the true solution vector before Newton’s method will converge.
• It does not always converge to a solution (singular Jacobian, initial guess not suf-
ficiently close).
• You only find 1 solution vector, which depends on the initial vector p(0) .
As stopping criteria we can use the norm of the difference between two solution vectors,
kp(k) − p(k−1) k and the norm of the residual vector krk = kf (k) k, either using the l2 or
l∞ norm. Note that if the l2 or l∞ norm of a vector is small, then every component of
the vector must be small. As for scalar equations, relative stopping criteria can be used
to increase robustness.
47
3.3.3 Example: predator-prey equations
We solve the algebraic predator-prey system of equations discussed in Sec. 3.1 using
Newton’s method. As initial vector we take p(0) = h3, 3i and as tolerance 10−13 . We
see from see Table 3.1 that Newton’s method convergences to the analytical solution
p = hp1 , p2 i = h3, 3i in only 6 iterations.
Table 3.1: Iteration number k, approximations p(k) , and l2 and l∞ error norms; Newton’s
method using p(0) = h3, 3i and tolerance = 10−13 .
with p1 = 2 and p2 = 2. We note from Table 3.1 that the l2 error is always slightly larger.
It is easy to see why this should always be true. Consider a vector x with m components,
then v
q u m
q uX
kxk∞ = max |xi | = max (x2i ) = max (x2i ) ≤ t (x2i ) = kxk2
1≤i≤m 1≤i≤m 1≤i≤m
i=1
The l∞ norm only considers the maximum x2i , while for the l2 norm some nonnegative
numbers are added to this value before taking the square root. The convergence behavior
of both norms, however, is very similar. Both give approximately a doubling of the
number of accurate digits each iteration. This indicates quadratic convergence like for
Newton’s method for a single equation (See Sec. 2.8). Typically, the l2 and l∞ norm give
very similar results for small systems of equations. For large m × m systems you need
to be a little more careful when you do finite precision calculations. If all terms of the
error vector have the same error, say machine accuracy e = 10−16 , than the error for the
l2 norm is v
u m
uX √ √
kek2 = t e2i = me2 = m|e|
i=1
√
Thus if m is √sufficiently large, you need to increase the tolerance for l2 norms accordingly,
−15
i.e. at least m10 , in order to be able to satisfy stopping criteria.
48
3.3.4 Choice of initial vector
Finding a good initial guess for a system of equations is a little more complicated than
for a single equation. First, we cannot generalize the bisection method to systems since
we do not have a point with opposite function value on each side of the root in two or
more dimensional space. We only discuss some simple options.
1. Plotting might give you an initial vector for 2 × 2 systems, but cannot be used for
m × m systems.
2. If nonlinear terms are relatively small so that the solution is dominated by the linear
terms, one could first solve the linear system Ax = b and use the solution vector of
the linear sytem as initial guess for Newton’s method for systems.
3. If nonlinear terms are not small, one could first try to solve an easier problem or
a sequence of easier problems (continuation). For example, for our predator-prey
equations, we could first try to find a solution for the problem with b = c = 0.1.
Then we could use the solution corresponding to b = c = 0.1 as initial vector for the
problem with b = c = 0.5. The solution corresponding to b = c = 0.5 could then be
used as initial vector for the problem you are interested in (b = c = 1). The harder
the problem, the more intermediate solutions you might need to generate to find an
appropriate initial vector for the problem you are really interested in.
4. Often Newton’s method for systems is part of a larger calculation which provides an
initial guess in a natural way. For example, for partial differential equations (PDEs)
involving a dependence on time and space, you typically generate a solution at the
next time level n + 1 for every coordinate position from the solution at the current
time level n. Since time steps ∆t = tn+1 − tn should be taken small for reasons
of accuracy, the solution at time level n is usually a good approximation to start
Newton’s method to obtain the solution at tn+1 . We discuss this further when we
discuss PDEs.
49
3.4 Solving linear systems for population models
The solution of a linear system Ax = b can be found using a direct or iterative method.
Direct methods are methods that compute the solution x of Ax = b in a fixed number
of operations that can be determined a priori. Provided that the matrix is nonsingular,
a unique solution will be found. (At least when calculations can be done exactly. When
finite precision arithmetic is used also nearly singular matrices will give problems.) Itera-
tive methods for Ax = b are methods that give an approximation of the solution vector x
by generating a sequence of vectors x(k) that (we hope) converge to the true solution. The
number of operations can not be determined a priori and convergence is not guaranteed.
As for all iterative methods, an initial guess to the solution needs to be provided.
Typical advantages and disadvantages of direct methods
• The solution is subject to round off errors only.
• The number of operations might be very large for large systems of equations, so
that it takes a long time to solve a linear system.
• For large systems, you need a lot of memory on your computer to store the matrix
A (your computer may run out of memory just for storage of a large matrix).
Typical advantages and disadvantages of iterative methods
• Relatively few operations per iteration and thus fast for a single iteration.
• Additional errors since an iterative method only gives an approximation to the
solution.
• Cheap in memory. The matrix A needs not to be stored, only vectors that result
from matrix-vector products like Ax(k) .
• Convergence might be slow (resulting in a large number of iterations and large
computing time) or even impossible.
Of course, one may try to improve on basic direct and iterative methods to alleviate the
typical shortcomings. This is outside the scope of Math 4414. We only discuss methods
that are most useful for the problems we consider. For population models, the size of the
system of equations is typically relatively small and solving Ax = b using a direct method
can still be done in an efficient manner. Iterative methods will be discussed in some more
detail when they are more useful: for numerical solutions of differential equations.
50
Accuracy: A direct method is only affected by round-off errors due to finite precision.
Iterative methods have additional errors, or might not produce an accurate solution at
all if it does not converge. Thus for reasons of accuracy a direct method is preferred.
An example of a direct method is computing the inverse A−1 of the matrix A and then
perform the matrix-vector multiplication x = A−1 b. For a 2 × 2 system we have a formula
for the inverse available,
−1
−1 a b 1 d −b
A = =
c d ad − bc −c a
To compute A−1 exactly we need that det A = ad − bc 6= 0. With finite precision arith-
metic, we also expect large errors when det A = ad − bc is very small. Then we subtract
two nearly equal numbers ad and bc and next we divide by this inaccurate small num-
ber. Other techniques, however, will have similar difficulties when the matrix is nearly
nonsingular (ill-conditioned).
Computing time: This is not a real issue for a 2 × 2 system, only a few operations
(multiplications and additions) are involved and computing x = A−1 b will be fast.
Memory usage: For 2 × 2 systems, however, this is not a real issue. We only need
to store 4 numbers.
51
We illustrate Gaussian elimination with backward substitution for a simple 2 × 2
system:
2 1 x1 1
=
1 2 x2 1
Using Gaussian elimination
2 1 1 =⇒ 2 1 1
1 2 1 E2 := E2 − (1/2) × E1 0 3/2 1/2
Solving with backward substitution gives from the second line x2 = (1/2)/(3/2) = 1/3
and by substituting this in the first line
1 − x2 1
x1 = =
2 3
The Gaussian elimination described above fails when you want to make zeros below a
diagonal entry aii (pivot) and the value of aii is exactly zero. Then we need to interchange
that row i (including the row of vector b) with a row below row i that has a non-zero
value in column i. This is called pivoting.
Implications of finite precision: pivoting
In finite precision calculations, also very small pivots can also create large errors. Consider
as example the system
0.001 1 x1 1
=
−1 4 x2 9
which has the solution x1 ≈ −4.9800797 and x2 ≈ 1.0049801. If we try to solve this with
the Gaussian elimination/backward substitution algorithm in three digit arithmetic with
rounding, we get
1.00e-3 1.00 1.00 =⇒ 1.00e-3 1.00 1.00
-1.00 4.00 9.00 E2 := E2 + 1.00e3 × E1 0 1.00e3 1.01e3
Solving with backward substitution gives x2 = 1.01 and
1.00 − 1.00 × 1.01
x1 = = -1.00e1.
1.00e-3
We see that x1 is not at all close to the true solution. The reason lies in the backward
substitution: to determine x1 we subtract nearly equal numbers and divide by a small
pivot number. Here, small errors in the numerator are amplified by a factor 1000, because
of the small pivot of 1.00e-3.
If we first interchange rows, we can obtain an accurate solution
-1.00 4.00 9.00 =⇒ -1.00 4.00 9.00
1.00e-3 1.00 1.00 E2 := E2 + 1.00e-3 × E1 0 1.00 1.01
which gives using backward substitution x2 = 1.01 and x1 = -4.96 which has 2 accurate
digits.
52
Thus to increase accuracy in finite precision calculations, we need to perform pivoting
not only when the pivot element is zero, but also when it is ”small”. The safest pivoting
technique is complete pivoting. Both row and column interchanges are performed to
find the largest pivot element. Of course, the comparisons and interchanging of rows and
columns makes the Gaussian elimination more time consuming (O(m3 ) more operations
are required).
In 16 digits arithmetic, the problem is less serious, but the problem is not eliminated.
Also keep in mind that matrices are much larger than 2 × 2. In a large system many more
operations need to be performed and errors accumulate. Even with the best pivoting
technique a significant number of digits may be lost when solving Ax = b. The larger the
linear system, the more computations and the more inaccurate digits can be expected.
Larger systems, like 10×10, can be solved more efficiently using Gaussian elimination with
backward substitution. A safe Gaussian elimination and backward substitution technique
with pivoting is implemented in Matlab through the backslash operator \. For the above
example, you would use
A = [2 1; 1 2]
b = [1; 1]
x=A\b
gives the numerical solution
x=
3.333333333333332e-01
3.333333333333335e-01
53
Chapter 4
• Accuracy
• Order of convergence
Numerical methods:
54
4.1 Problem description and modeling: pollution mod-
els
4.1.1 Governing equation
We consider a river which is being polluted by some factories along that river. The
pollutant affects the fish population once it reaches a critical concentration level for a
sufficiently long period in time. For this we want to predict the concentration of pollutant
along the river.
The basic model to describe changing quantities is
The terms on the right-hand side need to be modeled. There are two types of transport
in and out a small volume element: convection and diffusion of the pollutant. We discuss
all 4 terms in the right-hand side separately.
• Diffusion: Diffusion describes how the pollutant spreads over the river in the ab-
sence of flow. Diffusion is described mathematically by the divergence of the mass
flux φ, i.e. −∇ · φ where ∇ is the gradient vector. We assume that the mass flux φ
of the pollutant is proportional to the concentration gradient, φ = −D∇c (Fick’s
law) where D is the diffusivity which we assume constant. This gives a diffusion
term ∇ · (D∇c) = D∇ · ∇c.
• Production of pollutant: We assume that the pollutant added to the river de-
pends only on where and when the factories add pollutant to the river. This means
the source term r depends on position x and time t only and is independent of the
concentration of pollutant in the river.
55
• Decay of pollutant: We assume that the rate of decay is proportional to the
concentration, i.e. it can be modeled by kc with k a constant decay rate. Note that
the rate of change depends indirectly on position and time through the concentration
c(x, t).
The resulting model is a partial differential equation (PDE) for the concentration of
the pollutant which depends on position x in the river and time t, i.e. c = c(x, t):
∂c
= −v · ∇c + D∇ · ∇c − kc + r. (4.1)
∂t
In case k = r = 0, this equation reduces to the so-called convection-diffusion equation.
Solving Eq. (4.1) for the concentration in three spatial dimensions and time is beyond
the scope of Math 4414. However, we can take advantage of the geometry of the river to
simplify the equation. We consider a narrow and shallow river so that we can consider a
concentration that depends on x (coordinate along the river) and time t only: c = c(x, t).
Flow will then occur in the x direction only, represented by a scalar velocity v. Then the
model simplifies to
∂c ∂c ∂2c
= −v + D 2 + r − kc. (4.2)
∂t ∂x ∂x
Here the velocity v and pollution rate r may depend on position along the river x and
time t.
To solve Eq. (4.2) it is important to realize how the different terms on the right-hand
side affect the solution. Of course the source term r tends to increase the concentration
and the decay term −kc tends to decrease the concentration, so we only discuss convection
and diffusion in some more detail.
Convection: If ∂c/∂x is negative at some location, there is more pollution just up-
stream than just downstream. Therefore, the concentration will increase which is ac-
counted for in Eq. (4.2) by the minus sign. Convection affects the concentration level
downstream while leaving the concentration upstream unchanged. Fig. 4.1(a) shows how
an initial concentration profile changes in time by convection: the concentration profile
at the initial time shift in the direction of the flow, leaving the shape unaltered.
Diffusion: If there is a difference in concentration, pollutant will be transported from
high to low concentration by thermal motion. This diffusion process is independent from
the convection and may transport pollutant in both the upstream and downstream direc-
tion. Fig. 4.1(b) shows how an initial concentration profile changes in time by diffusion:
the initial concentration profile will smooth out.
We are interested in whether the concentration reaches an equilibrium solution when
the rate at which pollution is added to the river and the velocity of the water are constant
in time, i.e. r = r(x) and v = v(x). Then the concentration does not change anymore
in time (∂c/∂t = 0) and the model reduces to a 2nd order ordinary differential equation
(ODE) in which the concentration only depends on position c = c(x),
dc d2 c
v − D 2 + kc = r
dx dx
56
1 1
t=0
0.9 t=0 0.9 t=0.01
t=0.5
t=0.05
0.8 t=1 0.8
t=0.1
t=3
t=0.5
0.7 0.7
0.6 0.6
c
c
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 1 2 3 4 5 6 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x x
(a) (b)
Figure 4.1: Effect of transport terms on c(x, t) (a) pure convection, and (b) pure diffusion.
4.1.3 Example
For simplicity, we assume we know the concentration at the endpoints:
d2 c dc
2
−8 − 9c = −17 − 9x, −1 < x < 1,
dx dx
c(x = −1) = e−18 , c(x = 1) = 3.
58
4.2 Analytical solutions and solving with Matlab
In this section we discuss two methods to obtain the analytical solution of a boundary
value problem. In the next sections we will solve BVPs numerically and see whether the
solution obtained by a numerical method converges to the analytical solution found in
this section.
λ2 − 8λ − 9 = 0
which has roots λ1 = −1 and λ2 = 9. This gives solutions e−x and e9x and the homogenous
solution is a linear combination of these,
The particular solution can be found using the method of undeterminied coefficients. Here
you would try a polynomial of the same order as −17 − 9x, i.e. cp (x) = Ax + B and try
to find A and B by substitution. This gives −8A − 9B = −17 and −9A = −9 which has
solution A = 1 and B = 1. Thus the particular solution equals cp (x) = 1 + x.
This gives for the total concentration c(x) = C1 e−x + C2 e9x + 1 + x. The values for C1
and C2 are obtained from the boundary conditions c(x = 1) = 3 and c(x = −1) = e−18 .
This gives C1 = 0 and C2 = e−9 which gives for the solution of the pollution BVP
c(x) = e9x−9 + 1 + x
To find the unique solution of the boundary value problem with the boundary condi-
tions c(−1) = e−18 and c(1) = 3, use
59
c = dsolve(’D2c - 8*Dc -9*c= -17 - 9*x’, ’c(-1)=exp(-18)’, ’c(1)=3’, ’x’)
simplify(c)
which will give
c=
(-exp(-9+9*x)+exp(11+9*x)+exp(20)-1+x*exp(20)-x)/(exp(20)-1)
Which is the correct solution but written in a different way.
60
4.3 Solving BVPs numerically: Introduction
4.3.1 Grids
The solution to a boundary value problem is a function c(x) which is defined for every
x. If we use numerical techniques, we find an approximation to the solution at certain
discrete coordinates values xi only. The collection of xi ’s is called a (computational) grid.
Before we can solve a BVP numerically, we need to specify the computational grid, i.e.
all grid points xi . A numerical technique will produce approximations to the function c
at these grid points only, i.e. we will obtain approximations for the values c(xi ) which
will be denoted by ci . See Fig. 4.2.
cn
cn−1
c3 c(x)
c1 c2
c0
a = x0 x1 x2 x3 xn−1 x = b
n
Often we choose the interval [a, b] to be divided into N equally spaced subintervals of
length h = (b − a)/N , which corresponds to the grid points
xi = a + ih
for i = 0, . . . , N . The length of a subinterval h is called the mesh size. Note that the
nodes are numbered by increasing x coordinate, thus x0 = a, x1 = a + h, . . ., xN = b.
This numbering is called natural numbering. Henceforth, we will only use natural
numbering since it simplifies the derivation of methods.
Example
We solve a BVP numerically on [0, 1] and divide the interval [0, 1] into N = 4 equally
spaced subintervals. The length of a subinterval is h = (1 − 0)/4 = 1/4. We obtain a
numerical approximation to the solution at the N + 1 = 5 discrete coordinate values only:
x0 = 0, x1 = 1/4, x2 = 1/2, x3 = 3/4, and x4 = 1.
Remarks:
1. Typically, the more mesh points the more accurate the approximation to the so-
lution and the more work to compute the approximation. The goal is to compute
61
an accurate numerical solution with as few grid points as possible, to minimize
computing time.
2. Almost always there are certain regions in [a, b] where the solution changes more
rapidly (i.e. where you want more mesh points). An equally spaced mesh is then
not the best choice.
3. An equally space mesh is easiest to introduce the numerical techniques and will
therefore be used in this chapter.
1. Accuracy.
2. Computing time and memory (Typically not an important issue for BVPs, only for
PDEs in 2 or 3 spatial dimensions)
The four most frequently used techniques that can easily be extended to two or three
dimensions are finite differences (FD), finite elements (FE), finite volumes (FV), and
spectral methods. We discuss two of these techniques, FD and FE, and apply the tech-
niques to solve BVPs. We focus on how these methods work, how to program them, and
typical numerical issues. Finite differences is easiest to understand the mathematical and
numerical concepts. Finite elements are particularly useful when dealing with complex
geometries (curved boundaries) in two or three dimensions. We will further discuss this
issue in Chap. 7.
62
4.4 Solving BVPs numerically using Matlab: bvp4c
Solving BVPs numerically with Matlab is a little more complicated than solving algebraic
equations. We discuss only the use of bvp4c. To solve a boundary value problem in
Matlab, you first need to write the ODE into a system of first order ODEs y 0 = f (y, x)
(See Math 2214), using the solution vector
y1 y
y2 y 0
.. = ..
. .
ym y (m−1)
where y (m−1) is the (m − 1)th derivative of y with respect to x. For the pollution BVP
one would introduce
c1 c
= 0
c2 c
Taking the derivative of this vector and using the pollution ODE gives
0 0
0 c1 c c2
c = 0 = 00 = = f (c, x)
c2 c 9c1 + 8c2 − 17 − 9x
For the pollution BVP written as a first order system of first order ODEs one would
use in the Matlab Command Window (in addition you need to write 3 small m-files as
discussed below)
x = -1:0.1:1;
solinit = bvpinit(x, ’init bvp’);
options = bvpset(’AbsTol’, 1e-8);
sol = bvp4c(’func bvp’, ’bc bvp’, solinit, options);
The first line specifies the initial grid points (Matlab might add some grid points if it
thinks it is necessary to obtain a solution that is sufficiently accurate).
The second line creates a solution structure solinit which contains in solinit.x the initial
mesh and in solinit.y the initial approximation specified in the user-written m-file named
here init bvp.m. If we use a linear initial guess between c(−1) = e−18 ≈ 0 and c(1) = 3,
we get yinit (x) = (3 + 3x)/2. We need to specify the initial guess for the system of first
0
order ODEs, so we also need to specify the derivative yinit (x) = 3/2. The initial guess
needs to be specified as a column vector,
63
The third line specifies options for BVPs that are different from the defaults used to
solve a BVP. Here we specify an absolute tolerance for the residual AbsTol = 1e-8, where
the default is 1e-6. The tolerance is satisfied at every grid point in the mesh.
The fourth line solves the BVP. In addition to the initial solution structure solinit,
it needs two user-written m-files which we named here func bvp and bc bvp. The fourth
parameter options is optional and can be omitted if default options are used.
The right-hand-side vector f should be specified in func bvp as a column vector,
where the array y contains the solution vector (y(1)= y and y(2)= y 0 ) at the grid point
x (single variable, not all grid points). The boundary conditions should be specified in
bc bvp, written in residual form . . . = 0, i.e. y(x = −1) − e−18 = 0 and y(x = 1) − 3 = 0.
The residual needs to be specified as a column vector,
where the array ya contains the solution vector at the left endpoint (ya(1)= y(x = a)
and ya(2)= y 0 (x = a)) and the array yb contains the solution vector at the right endpoint
(yb(1)= y(x = b) and yb(2)= y 0 (x = b)).
The function bvp4c produces as output a data structure which we named sol. The data
structure sol has two members sol.x and sol.y. The grid points at which the solution is
approximated are stored in sol.x (These are typically not the same grid points as the initial
mesh you specified, but might be refined to satisfy the default tolerances or tolerances
specified in options). The solution vector y at each grid point is stored in the two-
dimensional array sol.y. Row 1 contains y1 = y at all grid points, row 2 contains y2 = y 0
at all grid points, etc.
A specific column or row in a two-dimensional array y can be selected using colon
notation. For example, the first row can be selected by using y(1,:) (meaning row 1, all
columns) and the second column by using y(:,2) (meaning all rows, column 2).
After the calculation with bvp4c we only have some (long) arrays with numbers from
which it is difficult to obtain insight into what the model predicts exactly. For this we
need to plot the solution. The approximation to the solution is in the first row of sol.y
and can be selected using colon notation, sol.y(1,:). The following plots the numerical
64
solution from bvp4c and the exact solution,
exact solution
2.5
bvp4c
2
c
1.5
0.5
Figure 4.3: Numerical solution using bvp4c and exact solution using ezplot for the pollution
BVP.
65
4.5 Finite differences
In this section, we discuss the discretization in space of a second-order BVP using the
finite difference method. Finite differences is conceptually the easiest, but not necessarily
the best method. In section 4.7 we discuss an alternative method: finite elements.
We only consider equally spaced grid points and use natural numbering. This simplifies
the derivation.
Here O((∆x)n+1 ) means that all terms we neglect are C(∆x)n+1 (with C a constant) or
of higher order. We consider three cases that frequently occur.
Rewriting gives the forward difference formula for the 1st derivative
1
f 0 (xi ) = [f (xi + h) − f (xi )] + O(h).
h
Alternatively, we can write the forward difference a little shorter, using the notation
introduced in Fig. 4.2,
1
f 0 (xi ) = [fi+1 − fi ] + O(h).
h
By taking ∆x = −h, i.e. h → −h in the forward difference formula, the backward
difference formula can be obtained
1 1 1
f 0 (xi ) = [f (xi − h) − f (xi )] + O(h) = [f (xi ) − f (xi − h)] + O(h) = [fi − fi−1 ] + O(h).
−h h h
66
Both formulas are equally accurate O(h), so if you have a choice is doesn’t matter which
one you use. However, for the first node (no previous node) the backward difference
formula can not be used and for the last node (no next node) the forward difference
formula can not be used.
f 0 (0) = α1 f (0 − h) + α2 f (0) + α3 f (0 + h)
is exact when we use for the function f the polynomials P0 (x) = 1, P1 (x) = 1, and
P2 (x) = x2 (Note that you also know f 0 (0) for these functions). This gives three equations
with three unknowns,
The third line gives α1 = −α3 . Substituting in line 2 gives α3 = 1/(2h) and α1 = −1/(2h).
Line 1 gives then α2 = 0.
67
Thus we found the central finite difference formula
1
f 0 (xi ) = [f (xi + h) − f (xi − h)] + O(h2 ).
2h
Alternatively we could have solved the linear system with Matlab
sol = solve(’a1+a2+a3 = 0’, ’-h*a1+h*a3=1’, ’h^2*a1+h^2*a3=0’, ’a1’, ’a2’, ’a3’)
produces a structure sol. The values of a1 can be found by typing sol.a1, sol.a2, and sol.a3
in the command window.
f 00 (0) = α1 f (0 − h) + α2 f (0) + α3 f (0 + h)
is exact when we use for the function f the polynomials P0 (x) = 1, P1 (x) = 1, and
P2 (x) = x2 . This gives three equations with three unknowns,
The second line gives α1 = α3 . Substituting in line 3 gives α3 = 1/h2 and α1 = 1/h2 .
Line 1 gives then α2 = −2/h2 . Thus we obtained a central finite difference formula for
the second derivative,
1
f 00 (xi ) = [f (xi − h) − 2f (xi ) + f (xi + h)] + O(h2 ).
h2
d2 c
− 2 +c=x
dx
with boundary conditions c(x = 0) = 0 and c(x = 1) = 3 using n = 4 subintervals of
length h = 0.25.
We make a grid in the x-direction using natural numbering and we denote the numer-
ical approximation of c at x = xj by cj = c̃(xj ) for j = 0, . . . , n = 4.
At every point xj in the interior (0, 1), the differential equation holds. The second
derivative is approximated by a difference formula. Here we use the O(h2 ) approximation
for c00 . The concentration itself is approximated by the nodal value cj at x = xj . The
68
value of x in node j we know: xj . We obtain 3 finite difference equations for j = 1, 2, 3
(for which the value of xj is in (0, 1)),
c0 − 2c1 + c2
j=1: − + c1 = x1 ,
h2
c1 − 2c2 + c3
j=2: − + c2 = x2 ,
h2
c2 − 2c3 + c4
j=3: − + c3 = x3 ,
h2
with h = 1/4 here. At the endpoints x0 = 0 and x4 = 1 we have the boundary condition
instead of the ODE. Since c0 is the approximation of c at x = 0, we have c0 = 0. At the
last point x4 = 1 we have the boundary condition: c4 = 3.
All together we have 5 linear equations with 5 unknowns (including the boundary
conditions) which can be written in matrix form
1 0 0 0 0 c0 0
−1/h2 1 + 2/h2 −1/h2 0 0 c1 x1
2 2 2
0
−1/h 1 + 2/h −1/h 0 c2 = x2 ,
2 2 2
0 0 −1/h 1 + 2/h −1/h c3 x3
0 0 0 0 1 c4 3
or using the known values for h = 1/4 and the grid points xj = j/4
1 0 0 0 0 c0 0
−16 33 −16 0 0 c1 1/4
0 −16 33 −16 0 c2 = 1/2 .
0 0 −16 33 −16 c3 3/4
0 0 0 0 1 c4 3
The matrix and right-hand side are constant, so that we have reduced the problem to a
linear algebra problem: solving Ax = b. Once we have setup the linear system, we can
use any appropriate solution method to solve Ax = b to find the values cj .
d2 c dc
− 2
+8 + 9c = 17 + 9x, −1 < x < 1,
dx dx
c(x = −1) = e−18 , c(x = 1) = 3.
69
We use the O(h2 ) central FD formulas to approximate derivatives at xj , i.e Eq. (4.5.3)
for c0 and Eq. (4.5.3) for c00 . The concentration itself at x = xj gives cj . For each internal
node j = 1, . . . , n − 1 we then have
cj−1 − 2cj + cj+1 cj+1 − cj−1
− 2
+8 + 9cj = 17 + 9xj .
h 2h
The boundary conditions are: c0 = e−18 and cn = 3.
We can write the equations in matrix form:
Ac = f ,
• In the first row we put the equation corresponding to the grid point x0 . This is
the boundary condition c0 = e−18 . Thus we need a value of 1 in the first column
corresponding to c0 and zeros in the other columns since these cj ’s are not involved in
the Dirichlet boundary condition. In f0 we need the value of the boundary condition
e−18 .
• The next n−1 rows (rows 2, . . . , n correspond to the internal grid points x1 , . . . , xn−1 ,
at which the (discretized) ODE needs to be satisfied: The second row correspond
to x1 , the third row to x2 etc. until row n. Thus the rows are ordered the same way
as the grid points. For each j = 1, . . . , n − 1, we have
• In the last row n + 1 we put the equation corresponding to xn . This is the boundary
condition cn = 3. Thus we need a value of 1 in the last column, n + 1, corresponding
to cn and zeros in the other columns since these cj ’s are not involved in the Dirichlet
boundary condition. In fn we need the value of the boundary condition 3.
70
where · · · denote a continuation of the same value, for example 9 + 2/h2 on the diagonal.
Once the matrix A and right-hand side vector f have been constructed, any appro-
priate method to solve Ax = b can be used. An efficient method will be discussed in
Sec. 4.9.
1. Setup of Ax = b
A lot of things can vary when you use the finite difference method to solve a second
order ODE. In a general form, a second order ODE reads
d2 c dc
− 2
+ p(x) + q(x)c = r(x) (4.3)
dx dx
where p(x), q(x), and r(x) are functions of x that are different from one problem
to the next. In addition, the type and values of the boundary conditions may vary
and you may want to change the order of the approximation of a derivative.
2. Solving Ax = b
Depending on the size of your matrix you may want to choose a different method to
solve Ax = b. To economize in memory, you might want to store only the non-zero
components in your matrix (in the examples above, just three diagonals instead of
the full (n + 1) × (n + 1) matrix with all the zero’s).
– a function fdmatvec that sets up the FD part matrix A and vector f . This
function calls
∗ a function ode param that computes the values of p(x), q(x), and r(x).
– a function bcmatvec that sets up the boundary condition part of the matrix A
and vector f . This function calls
71
∗ A function bc param that specifies the type (Dirichlet, Neumann) and val-
ues (CD , CN ) of boundary conditions.
– a function that solves Ax = b
In the main file, it needs to be decided (input) which order of approximation for the
derivatives is used, how the matrix is stored, and how Ax = b is solved.
In fdmatvec the part of the matrix A and f corresponding to the finite difference
equations is set up. Since you have the same type of expression for every internal grid
point you can use a for-loop for this. Here you may want to distinguish different orders
of approximation using if statements. These if statements should be kept outside of for-
loops if possible, for reasons of efficiency. The function ode param is called for every grid
point and evaluates the functions p(x), q(x), and r(x) at a certain grid point.
Boundary conditions would typically be incorporated in a separate function bcmatvec,
since it is independent of how you handle internal gridpoints. To keep this function
general, you would use another function to specify the type and value (say Ca for the left
and Cb for the right endpoint) of the boundary condition used at each end point. Different
types of boundary conditions can be distinguished using if-statements. The parameters
Ca and Cb are used in the matrix A and right-hand side f to keep these expressions
general.
In the main function finitedif different solvers would be distinguished through if-
statements. Since different solvers do not have much in common and for reasons of
efficiency, different methods to solve Ax = b would be in different functions. In Matlab
you could use, for example, the \ operator.
Once your program is working, you just need to change the simple functions ode param
and bc param if you solve a different BVP. The more complicated functions fdmatvec,
bcmatvec, and possible functions to solve Ax = b you can leave unchanged.
72
4.6 Eliminating boundary conditions
There are two reasons to eliminate boundary conditions from a linear system Ac = f .
First, you reduce the number of unknowns, so you need to solve a smaller system
which is faster. For a BVP this is not a very important issue since the boundary consists
only of two grid points (the end points). However, when we solve PDEs on 2D regions the
boundary consists of curves and on 3D regions the boundary consists of surfaces. Both
typically have a lot of grid points.
Second, a certain pattern in the matrix might be distorted by the boundary conditions.
For example, symmetry or a tridiagonal structure of the matrix may be lost due to the
boundary conditions. These are important properties of the matrix when you solve the
linear system. Some efficient techniques to solve Ax = b require a symmetric matrix,
for example the conjugate gradient method. A tridiagonal system can be solved very
efficiently using a direct method (Crout’s method, see Sec. 4.9).
4.6.1 Example
In the FD example in Sec. 4.5.4,
j=0: c0 = 0,
c0 − 2c1 + c2
j=1: − + c1 = x1 ,
h2
c1 − 2c2 + c3
j=2: − + c2 = x2 ,
h2
c2 − 2c3 + c4
j=3: − + c3 = x3 ,
h2
j=4: c4 = 3,
c0 = 0 and c4 = 3 are known and do not need to be solved for. The unknowns c0 and
c4 can be eliminated from the above set of equations (bring the c0 and c4 terms in the
equations for the internal points to the right-hand side and substitute the values). Only
the equations for j = 1 and j = 3 contain c0 and c4 and are affected,
−2c1 + c2 c0
j=1: − + c 1 = x 1 + = x1
h2 h2
c1 − 2c2 + c3
j=2: − + c2 = x2
h2
c2 − 2c3 c4 3
j=3: − + c 3 = x 3 + = x 3 + .
h2 h2 h2
In matrix-vector form this becomes
1 + 2/h2 −1/h2 0 c1 x1
−1/h2 1 + 2/h2 −1/h2 c2 = x2 ,
2 2 2
0 −1/h 1 + 2/h c3 x3 + 3/h
73
or using h = 0.25, x1 = 0.25, x2 = 0.5, and x3 = 0.75:
33 −16 0 c1 0.25
−16 33 −16 c2 = 0.5 .
0 −16 33 c3 48.75
Remarks
• By eliminating the boundary conditions from the linear system, we have created a
symmetric matrix. Solvers that require a symmetric matrix can now be used and
storing a matrix only requires storing two diagonals (the third diagonal elements
are then known because of the symmetry).
c0 = e−18 ,
74
cj−1 − 2cj + cj+1 cj+1 − cj−1
− 2
+8 + 9cj = 17 + 9xj , j = 1, . . . , n − 1,
h 2h
cn = 3,
However, c0 and cn are known from the boundary conditions. We can eliminate these
from the equations to obtain only n − 1 equations with n − 1 unknowns. Thus we will
eliminate the first and last row and the first and last column by eliminating c0 and cn from
equations j = 1, . . . , n − 1 using the 2 rows corresponding to the boundary conditions. c0
only appears in the finite difference formula of j = 1 and cn only in the finite difference
formula of j = n − 1. After substituting the values for c0 and cn and bringing the known
terms to the right-hand side we obtain a new equation for j = 1 and j = n − 1:
−2c1 + c2 c2 c0 c0 4e−18 e−18
− + 8 + 9c 1 = 17 + 9x 1 + 8 + = 17 + 9x 1 + + 2 ,
h2 2h 2h h2 h h
and j = n − 1:
cn−2 − 2cn−1 cn−2 cn cn 12 3
− 2
−8 + 9cn−1 = 17 + 9xn−1 − 8 + 2 = 17 + 9xn−1 − + 2.
h 2h 2h h h h
The equations for j = 2, . . . , n − 2 remain the same since they don’t contain c0 and cn .
Thus we now have the n − 1 × n − 1 system of equations
To incorporate the values c0 and cn at the correct postions in the array corresponding to
the first and last grid point, use
n = length(c) + 1;
c(2:n) = c(1:n-1)
c(1) = exp(-18)
c(n+1) = 3
Remarks
• The size of the matrix-vector equations is reduced by 2, so it can be solved a little
faster.
75
• The matrix is tridiagonal, which is a result of the natural numbering. Instead of
the full matrix A we could store only the non-zero elements of A (to save memory)
and use an efficient solver for tridiagonal matrices (to save computing time). See
Sec. 4.9.
• The convection term makes the matrix A to be non-symmetric. Solvers that require
a symmetric matrix cannot be used. In the absence of convection, however, the
resulting matrix would be symmetric after eliminating the boundary conditions
allowing the use of solvers for symmetric matrices.
76
4.7 Finite elements
In this section, we discuss the discretization in space of a second-order BVP using the finite
element method (FEM). The method is conceptually much harder than finite differences.
The advantage lies in the easy incorporation of Neumann and Robin type boundary
conditions and mesh refinement. This is particularly useful in two or three dimensions,
especially when there are curved boundaries or steep gradients in the solution. We only
consider equally spaced grids and use natural numbering. This simplifies the derivation.
e1
e2
e3
eN −1
eN
a = x0 x1 x2 x3 xN −2 xN −1 xN = b
nodes
Basis functions
On the interval [a, b], basis functions φj are defined with the following properties:
• φj (xj ) = 1
• φj (xi ) = 0 for i 6= j
77
• φj is continuous and a piecewise polynomial (per element). For linear elements, φj
is a piecewise linear polynomial, for quadratic elements, φj is a piecewise quadratic
polynomial etc.
Fig. 4.5 shows the linear finite element basis functions φ0 , φj for 0 < j < N , and φN .
Outside the sketched region every basis function is exactly equal to zero.
1
@ @
@ φ0 @ φj φN
@ @
@ @
0 @ @
x0 x1 xj−1 xj xj+1 xN −1 xN
This results in a piecewise polynomial approximation of the function c(x). Fig. 4.6 shows
a linear finite element approximation and how it is constructed using the basis functions
and nodal values cj , j = k − 2, . . . , k + 2.
On an element el = [xl−1 , xl ], only two basis functions are non-zero: φl−1 and φl . (See
Fig. 4.7). The non-zero contributions to A on this element can be put in an element
matrix a(l) and the non-zero contributions to f in an element vector f (l) .
After the element matrices and vectors have been computed, they need to be assembled
into (added to) the global matrix A and global vector f .
78
ck−1
ck
ck−2 ck+1 ck+2
Figure 4.6: Finite element approximation. The solid black line is the sum of the colored
dashed lines. Red dashed line: ck−1 φk−1 , cyan dashed line: ck φk , green dashed line:
ck+1 φk+1 , blue dashed line: right part of ck−2 φk−2 , magenta dashed line: left part of
ck+2 φk+2 .
φl−1 φl
xl−1 el xl
Figure 4.7: Two non-zero basis functions on element el .
Reference element
To simplify the numerical calculations, elements el = [xl−1 , xl ] are mapped to the refer-
ence element ê = [0, 1] using the transformation
We define two basis functions on the reference element (See Fig. 4.8)
φ̂1 = 1 − ξ, φ̂2 = ξ.
whose derivatives are dφ̂1 /dξ = −1 and dφ̂2 /dξ = 1. The 2 non-zero basis functions on
element el are mapped to the reference basis functions by
Derivatives on the x and ξ domain are related via the chain rule. We have φ̂(ξ) = φ(x)
with x = x(ξ), thus
dφ̂ dφ dx
=
dξ dx dξ
Since dx/dξ = h we get
dφ 1 dφ̂
= .
dx h dξ
79
φ̂1 φ̂2
0 ê 1
Figure 4.8: Reference element ê = [0, 1] with linear basis functions φ̂1 (ξ) and φ̂2 (ξ).
• Find the weak form: multiply the differential equation by a test function ψ(x),
integrate over the whole domain, and apply integration by parts.
• Choose suitable test functions ψ. We will choose the n + 1 basis functions φj . This
is called Galerkin finite element method.
80
Pn
• Replace c(x) by its finite element approximation c(x) = j=0 cj φj (x).
• Evaluate integrals over each finite element using a reference element (analytically
or with some appropriate numerical technique).
• Assemble contributions of each element into the matrix A and right-hand-side vector
f.
−c00 + c = 1, 0<x<1
We note for further reference that the term φi (x)c0 |10 = φi (x = 1)c0 (x = 1) − φi (x =
0)c0 (x = 0) doesn’t affect the equations for i = 1, . . . , 3. The only non-zero basis function
at x = 1 is φ4 and the only non-zero basis function at x = 0 is φ0 . Thus the equation for
81
i = 0 has an extra term −φ0 (x = 0)c0 (x = 0) = −c0 (x = 0) and the equation for i = 4 has
an extra term φ4 (x = 1)c0 (x = 1) = c0 (x = 1).
Element matrix and vector
The element matrix a(l) and element vector f (l) contain the contributions of the element
integrals. The term φi (x)c0 |10 is related to the boundary conditions and will be discussed
when applying the boundary conditions. On an element el = [xl−1 , xl ], only two basis
functions are non-zero: φl−1 and φl . Thus only the equations for i = l − 1 and i = l give
non-zero contributions on element el in Eq. (4.7.2).P4 Similarly, only the j = l − 1 and j = l
terms in the finite element approximation in j=0 give a non-zero contribution to this
equation. The non-zero integrals can be put into a local matrix a(l)
" #
(l) (l)
a11 a12 cl−1
(l) (l)
a21 a22 cl
with
Z
(l)
a11 = φ0l−1 φ0l−1 + φl−1 φl−1 dx
el
Z
(l)
a12 = φ0l−1 φ0l + φl−1 φl dx
Zel
(l)
a21 = φ0l φ0l−1 + φl φl−1 dx
el
Z
(l)
a22 = φ0l φ0l + φl φl dx
el
82
For the element vector we get
Z 1
(l)
f1 = h (1 − ξ) dξ = h/2,
0
Z 1
(l)
f2 = h ξ dξ = h/2.
0
Note that we have the same element matrix and element vector for every element in this
case (equally spaced grid points and constant coefficient ODE).
Assembling
The element matrix and vector needs to be assembled element-by-element into the global
matrix A and right-hand-side vector f . We start form the zero matrix. We first add the
contributions of the first element, i = 0 and i = 1 corresponding to row 1 and 2,
1/h + h/3 −1/h + h/6 0 0 0 h/2
−1/h + h/6 1/h + h/3 0 0 0 h/2
A :=
0 0 0 0 0 0
f :=
0 0 0 0 0 0
0 0 0 0 0 0
83
Boundary conditions
To incorporate c(0) = 1 we replace the first row by c0 = 1. To incorporate c0 (1) = 0 we
add +φ4 (x = 1)c0 )x = 1) = c0 (1) = 0 to the right-hand side of the last row, f4 .
1 0 0 0 0 c0 1
−1/h + h/6 2/h + 2h/3 −1/h + h/6 0 0 c1 h
0 −1/h + h/6 2/h + 2h/3 −1/h + h/6 0 c2 = h
0 0 −1/h + h/6 2/h + 2h/3 −1/h + h/6 c3 h
0 0 0 −1/h + h/6 1/h + h/3 c4 h/2 + 0
This is a linear system that can be solved with any appropriate numerical technique.
d2 c dc
− + 8 + 9c = 17 + 9x, −1 < x < 1,
dx2 dx
c(x = −1) = e−18 , c0 (x = 1) = 10.
To point out the different treatment of Dirichlet and Neumann boundary conditions in
finite elements, we replaced the boundary condition c(x = 1) = 3 by the equivalent
Neumann boundary condition c0 (x = 1) = 10 (obtained from the analytical solution).
Weak form
The weak form is obtained by multiplying the ODE by suitable test functions ψ and
integrating over the whole domain, here [−1, 1],
Z 1 Z 1
00 0
ψ [−c + 8c + 9c] dx = ψ [17 + 9x] dx
−1 −1
for all suitable test functions ψ. Now we can use integration by parts for the c00 term to
reduce the order of the highest derivative
Z 1 Z 1
0 0 0 0 1
ψ c + ψ [8c + 9c] dx − ψc |−1 = ψ [17 + 9x] dx
−1 −1
84
for 0 ≤ i ≤ n. Each i corresponds to a row in the linear system: i = 0 corresponds to row
1, i = 1 corresponds to row 2, etc. until i = n that corresponds to row n + 1.
FEM approximation
Next the finite element approximation c(x) = nj=0 cj φj (x) is substituted in the terms
P
under the integral and the integral over the whole domain is written as a sum of integrals
over all elements,
n Z n
" n n
# n Z
X X X X X
0 0 0 0 1
φi cj φj + φi 8 cj φj + 9 cj φj dx = φi c |−1 + φi (17 + 9x) dx
l=1 el j=0 j=0 j=0 l=1 el
We note for further reference that the term φi (x)c0 |1−1 = φi (x = 1)c0 (x = 1) − φi (x =
−1)c0 (x = −1) doesn’t affect the equations for i = 1, . . . , n − 1. The only non-zero basis
function at x = 1 is φn and the only non-zero basis function at x = −1 is φ0 . Thus the
equation for i = 0 has an extra term −φ0 (x = −1)c0 (x = −1) = −c0 (x = −1) and the
equation for i = n has an extra term φn (x = 1)c0 (x = 1) = c0 (x = 1).
Element matrix and vector
The element matrix a(l) and element vector f (l) contain the contributions of the element
integrals. The term φi (x)c0 |1−1 is related to the boundary conditions and will be discussed
when applying the boundary conditions. On an element el = [xl−1 , xl ], only two basis
functions are non-zero: φl−1 and φl . Thus only the equations for i = l − 1 and l = i give
non-zero contributions on element el in Eq. (4.7.2). P Similarly, only the j = l − 1 and j = l
terms in the finite element approximation in nj=0 give a non-zero contribution to this
equation.
The non-zero integrals can be put into a local matrix a(l)
" #
(l) (l)
a11 a12 cl−1
(l) (l)
a21 a22 cl
with
Z
(l)
φ0l−1 φ0l−1 + φl−1 8φ0l−1 + 9φl−1
a11 = dx
el
Z
(l)
a12 = φ0l−1 φ0l + φl−1 [8φ0l + 9φl ] dx
Z el
(l)
φ0l φ0l−1 + φl 8φ0l−1 + 9φl−1
a21 = dx
el
Z
(l)
a22 = φ0l φ0l + φl [8φ0l + 9φl ] dx
el
85
Row 1 in a(l) and f (l) corresponds to i = l − 1 and row 2 to i = l. Column 1 corresponds
to j = l − 1 and column 2 to j = l.
Evaluating integrals using a reference element
By applying the tranformation of variables discussed in 4.7.1, we can write all integrals
in terms of ξ. For the element matrix we get
Z 1
(l) 1 8 2 1
a11 = 2
+ (1 − ξ) + 9(1 − ξ) h dξ = − 4 + 3h,
0 h h h
Z 1
(l) −1 8 1 3h
a12 = + (1 − ξ) + 9ξ(1 − ξ) h dξ = − + 4 + ,
0 h2 h h 2
Z 1
(l) −1 8 1 3h
a21 = 2
− ξ + 9ξ(1 − ξ) h dξ = − − 4 + ,
0 h h h 2
Z 1
(l) 1 8 1
a22 = 2
− ξ + 9ξ 2 h dξ = + 4 + 3h.
0 h h h
For the element vector we get
Z 1
(l) h 3h2
f1 = h (1 − ξ)(xl−1 + hξ) dξ = (17 + 9xl−1 ) + ,
0 2 2
Z 1
(l) h
f2 = h ξ(xl−1 + hξ) dξ = (17 + 9xl−1 ) + 3h2 .
0 2
Note that, because of the dependence on x of the right-hand side, the element vector is
different for every element.
Assembling
The element matrix and vector needs to be assembled element-by-element into the global
matrix A and right-hand-side vector f . We start form the zero matrix. We first add the
contributions of the first element, i = 0 and i = 1 corresponding to row 1 and 2. At
element 1, we have xl−1 = x0 ,
(1) (1)
a11 a12 0 . . . 0 (1)
f1
(1) (1)
a21 a22 0 . . . 0
f (1)
... 2
A :=
0 0 0 0
f := 0
.. .
. . . . . .
. 0
. . . . .
0 0 0 ... 0 0
86
And so on until the contributions of the nth element, i = n − 1 and i = n corresponding
to row n and n + 1,
(1) (1)
(1)
a11 a12 0 ... 0 f1
a(1) a(1) + a(2) a(2) (1)
... 0 f2 + f1(2)
21 22 11 12
... ... ... ..
A :=
0 0
f :=
.
.. . . (n−1) (n−1) (n) (n) (n−1) (n)
. . a21 a22 + a11 a12 f2 + f1
(n) (n) (n)
0 ... 0 a21 a22 f2
Boundary conditions
To incorporate the Dirichlet boundary condition c(−1) = 1 we replace the first row by
c0 = 1. To incorporate c0 (1) = 10 we add c0 (1) = 10 to the right-hand side of the last row,
fn ,
1 0 0 ... 0 1
c 0
a(1) a(1) + a(2) a(2) ... 0 (1) (2)
21 22 11 12 c1 f2 + f1
0 ... ... ... .. ..
0 =
. .
..
... (n−1) (n−1) (n)
(n) c
(n−1)
f2
(n)
+ f1
. a21 a22 + a11 a12 n−1
(n) (n) cn (n)
0 ... 0 a21 a22 f2 + 10
This is a linear system that can be solved with any appropriate numerical technique.
For a Robin boundary conditions at x = 1, c0 = αc + β, the right-hand side of the
boundary condition also contains c, so that the last row of the matrix A would change
as well. The way natural boundary conditions are handled in 2D and 3D, including
curved boundaries, is similar (just substitution) but there are of course more nodes on
the boundary. This makes boundary conditions much easier to handle in the finite element
method than in the finite difference method.
87
4.7.4 Numerical computation of the integrals
For an ODE c00 + p(x)c0 + q(x)c = r(x) with relatively simple functions p(x), q(x), and
r(x) the integrals in the element matrices and vectors can be computed analytically. For
more complicated functions, the integrals can only be approximated numerically.
An integral can be approximated by a weighted sum of function values (numerical
quadrature) in integration points xi ,
Z b nint
X
f (x) dx = wi f (xi )
a i=1
where nint is the number of integration points and wi the weight for each integration point.
We only discuss two closed Newton-Cotes formulas, i.e. formulas that contain the
endpoints of the interval and additional points are chosen so that integration points are
equally spaced over the interval.
The trapezoidal rule only uses the two endpoints of the interval, i.e. ξ1 = 0 and ξ2 = 1
on the reference element 0 ≤ ξ ≤ 1,
Z 1 int n
1 X
g(ξ) dξ ≈ [g(0) + g(1)] = wi g(ξi )
0 2 i=1
88
loop over all elements. Since row i + 1 corresponding to test function φi has contributions
from two different elements, we need to add the local matrix a(l) and vector f (l) to the
matrix A and vector f (For FD you can just set the elements of a row). Inside the
for-loop the element matrix and vector needs to be computed and assembled into the
global matrix A and vector f . If you distinguish for example different quadrature rules
and element orders, it is more convenient to compute the element matrix and element
vector in a separate function that we name fematvecelm. To keep fematvecelm general, a
function ode param is called to evaluate the functions p(x), q(x), and r(x) at a point x.
Boundary conditions would typically be incorporated in a separate function febc, since
it is independent of how you handle the element contributions. To keep this function
general, you would use another function to specify the type and value of the boundary
condition used at each end point. Implementation of a Dirichlet boundary condition in
a certain row requires that all existing values in that row of A and f are replaced by
the boundary condition. Implementation of a Neumann/Robin boundary condition in a
certain row requires that boundary conditions are added to the existing row.
89
4.8 Convergence of numerical methods for BVPs
We expect that we get a better approximation to the solution of a BVP if we decrease the
grid size h. The order of convergence tells how fast the numerical solution approximates
the exact solution. Since we solve a system of equations, we use error norms to compute
the error (See Sec. 3.3). In this section we measure the actual error using the infinity
norm e∞ = kc(xi ) − ci k∞ , i.e. we compute at every grid point the absolute difference with
the exact solution and take the maximum.
To obtain the order of convergence we need to determine how fast the error approaches
zero if h → 0. Thus we need to determine the value p in e = O(hp ) = Chp . For this we
need to do numerical computations on several grids, and record the error with the exact
solution (or if this is not available a very accurate numerical solution). To determine p
we take the logarithm of e = Chp ,
Thus p is the slope on a log10 h vs. log10 e plot. From the plot you can determine suitable
values to determine p: h should not be too large (the order of convergence is for h → 0)
and not too small (round-off errors). Note that when h is small, you subtract nearly equal
numbers in the numerator of a FD formula, since the values of yi−1 , yi , and yi+1 only differ
slightly. In addition, you divide by a small number of O(h) for an FD formula of a first
derivative or of O(h2 ) for an FD formula of a second derivative. This limits the smallest
grid size h that you can use in your calculations. For too small values of h, the error e∞
will not be dominated by the error due to the finite difference approximation but by the
error caused by round-off errors due to finite precision calculations and e∞ will start to
increase. This is not a real problem in Matlab, since you will run out of memory before
round-off errors become important.
To determine the order of convergence numerically, we consider the problem
d2 c dc
2
−8 − 9c = −17 − 9x, −1 < x < 1,
dx dx
c(x = −1) = e−18 , c(x = 1) = 3.
90
e∞ we find is O(h2 ) + O(h2 ) = O(h2 ). Similarly, if O(h) difference formulas are used in a
finite difference method, we would find a numerical error e∞ of O(h). If O(h3 ) difference
formulas are used for all derivatives, we would find a numerical error e∞ of O(h3 ), and so
on.
We start with a medium grid size of h = 0.1 to compute the numerical solution. Next
meshes are obtained by dividing the grid size by a factor of two (i.e. doubling the number
of intervals). Fig. 4.9 shows e∞ for the various grid sizes considered. We indeed observe a
slope n ≈ 2 up to h ≈ 10−4.5 . For smaller values of h, e∞ starts to increase. This is caused
by round-off errors which are then the dominant contributiuon to the error e∞ . Table 4.1
−1
−2
−3
−4
log10 e∞
−5
−6
−7
−8
−9
−5.5 −5 −4.5 −4 −3.5 −3 −2.5 −2 −1.5 −1
log h
10
Figure 4.9: Error plot for O(h2 ) centered finite difference scheme.
shows the numerical values of e∞ for the various grid sizes considered. If the step size h
is halved for a O(h2 ) method, the error on the refined mesh is O((h/2)2 ) = 1/4 O(h2 ),
i.e 1/4 of the error using mesh size h. This is exactly what we observe in Table 4.1 up to
h = 1/81920. For small values of h, the roundoff error becomes important and the error
would start to grow rapidly: an ancrease by a factor of almost 10 instead of a decrease
by a factor of 4.
91
h e∞ ratio
1/10 1.87e-2 -
1/20 4.41e-3 0.235
1/40 1.09e-3 0.246
1/80 2.72e-4 0.250
1/160 6.79e-5 0.250
1/320 1.70e-5 0.250
1/640 4.24e-6 0.250
1/1280 1.06e-6 0.250
1/2560 2.65e-7 0.250
1/5120 6.63e-8 0.250
1/10240 1.66e-8 0.250
1/20480 4.15e-9 0.250
1/40960 1.07e-9 0.251
1/81920 1.07e-9 0.258
1/163840 1.07e-8 9.993
1/327680 9.89e-8 9.237
−1
−2
−3
log10 e∞
−4
−5
−6
−7
−8
−4 −3.5 −3 −2.5 −2 −1.5 −1
log10 h
values of e∞ for the various grid sizes considered. If the step size h is halved, the error on
the refined mesh is 1/4 of the error using mesh size h, typical for quadratic convergence.
92
h e∞ ratio
1/10 2.47e-2 -
1/20 5.76e-3 0.233
1/40 1.42e-3 0.246
1/80 3.54e-4 0.250
1/160 8.84e-5 0.250
1/320 2.21e-5 0.250
1/640 5.52e-6 0.250
1/1280 1.38e-6 0.250
1/2560 3.45e-7 0.250
1/5120 8.62e-8 0.250
93
4.9 Solving linear systems for BVPs: Crout’s method
Typically, the number of grid points to obtain a decent numerical approximation is rel-
atively small for boundary value problems and the linear system can still be solved ef-
ficiently using a direct method. To obtain more accurate numerical approximations one
could use a finer grid (typically non-uniform) or higher-order finite difference or finite el-
ement schemes. For the FD scheme using O(h2 ) centered finite difference approximation
and for linear finite elements we considered, a tridiagonal matrix was obtained. For such
type of matrices a very efficient direct solver is available: Crout’s method which uses only
O(n) operations to solve a (tridiagonal) linear system. Higher-order finite differences and
higher-order finite elements, will lead to more than 3 non-zero diagonals and solving the
linear system is computationally more expensive. Using the basic Gaussian elimination
technique would take O(n3 ) operations. The most computationally efficient approach is
therefore usually to use the O(h2 ) schemes in combination with grid refinement. Note
that this argument only holds for BVPs where only one spatial dimension is involved.
For O(h2 ) methods for PDEs in two or three dimensions the resulting linear system is
no longer tridiagonal. Since Crout’s method only requires O(n) operations, there is no
advantage in using iterative techniques to solve the linear system. Iterative techniques
use O(n) operations per iteration.
Background
A tridiagonal N × N matrix (which arises in O(h2 ) finite difference methods and linear
finite elements for BVPs)
a11 a12 0 ··· 0
.. ..
a21 a22 a23 . .
A= . .
.. .. . ..
0 0
. . . .
.. .. .. ..
aN −1,N
0 · · · 0 aN,N −1 aN N
has a Crout factorization A = LU of the form
l11 0 · · · ··· 0 1 u12 0 ··· 0
... .. ... ... ..
l21 l22 . 0 1 .
... ... ... ..
. . . ...
L= 0 . .. . . . .
U = 0 .
. . .
.. .. ... ... .. ... ...
0 uN −1,N
0 · · · 0 lN,N −1 lN N 0 ··· ··· 0 1
This is easy to check: just take the matrix product LU and start comparing coefficients
from the top to the bottom row. If we have the LU factorization of the matrix A, we can
solve Ax = b very fast. Since A = LU we need to solve
LU x = b.
94
The matrix vector product U x is a vector, say y. Thus we can first solve the lower
triangular system
Ly = b
using forward substitution (first y1 , then y2 using the already computed value of y1 etc.)
to find the intermediate vector y, and then solve
Ux = y
using backward substitution (first xN , then xN −1 etc.) to find the solution x of the system
U x = y which is the solution of LU x = b.
The non-zero entries in L and U can be calculated using the Crout factorization
algorithm for tridiagonal systems and the system LU x = b can be solved using the Crout
forward/backward substitution algorithm.
• Computing a Crout factorization and solving a system is very fast. The Crout
factorization requires only O(3N ) operations. Also the backward substitution is
only O(N ). This is very cheap compared to Gaussian elimination with backward
substitution (O(N 3 ) operations) and faster than any iterative technique can be
(O(N ) operations per iteration).
95
Algorithm for Crout factorization for tridiagonal systems
Input: tridiagonal matrix A
Output: Crout LU factorization of tridiagonal matrix A
96
Remarks on Crout factorization algorithm:
• The factorization cannot be performed when lii = 0 for some i. In a computer code
you might want to add a test and error message for this to make the code more
robust. For matrices arising from 1D finite difference and finite element methods
you typically have lii > 0.
• It is not necessary to define a full matrix for A, L, and U . Storing all zeros is a
waste of memory. Only storing the unknown entries in L and U as arrays and using
the proper array elements in the forward and backward substitution is sufficient.
97
Chapter 5
• Accuracy
• Stability
• Order of convergence
Numerical methods:
– Euler
– Trapezoidal rule
– Runge–Kutta
• polynomial approximation
– Lagrange polynomials
– piecewise Lagrange polynomials
98
5.1 Problem description and modelling
Problem: A hot cup of tea (of temperature T0 ) is placed in a room with a lower tem-
perature. What is the temperature of the tea as a function of time?
Simplifying assumptions: We assume that the tea is well-stirred so that its tempera-
ture doesn’t vary in space, only in time: T = T (t). In Chap. 7 we consider the temperature
to be a function of time and space. The room is large, so we may assume that the tem-
perature of the room is basically unchanged by the tea, i.e. the room temperature Tsur
remains constant.
Basic model:
”Rate of change = rate of increase - rate of decrease”.
Apply this to temperature1
• Rate of change: dT /dt, change of temperature in time.
• Rate of increase: 0, no heat source.
• Rate of decrease: cup looses energy to environment. Newton’s cooling law: Experi-
mentally one observes that the rate at which the temperature of the liquid decreases
is proportional to the differences in temperatures between the object and its sur-
roundings: k(T − Tsur ).
Substitution results in the initial value problem
dT
= −k(T − Tsur ), T (t = 0) = T0 .
dt
Henceforth, we take k = 1, Tsur = 20, and T0 = 80. (All quantities are expressed in an
arbitrary consistent system of units.) Substituting gives
dT
= −T + 20, T (t = 0) = 80.
dt
which we will solve numerically on the time interval 0 ≤ t ≤ 10.
The cooling model is a linear, nonhomogeneous first order initial value problem (IVP)
which have the general form y 0 + p(t)y = g(t), y(t0 ) = y0 . From differential equations
we know that a unique solution exists if p(t) and g(t) are continuous functions on the t
interval considered. For those cases, a numerical technique should give a good numerical
approximation to the solution. If this is not the case, there is likely an error in the program.
For non-linear first-order IVPs, which have the general form y 0 = f (t, y), y(t0 ) = y0 , it
is often unknown whether a unique solution exists or not. If you experience problems in
solving nonlinear IVPs, there can be several causes. A unique solution might not exist,
the numerical method or the grid is not good enough, or you might have an error in your
program. Then a careful inspection of the numerical results is necessary to determine the
most likely cause and a possible fix.
1
Actually it should be an internal energy balance, but for the usual assumption that the internal
energy only depends on temperature, the result is the same.
99
5.2 Solving first order linear IVPs analytically
In this section we discuss two methods to obtain solutions of a (linear) IVP. In the next
sections we will solve the linear IVP numerically and see whether numerical methods
converge to the analytical solution found in this section.
y 0 + p(t)y = g(t)
dµy
= µ(t)g(t).
dt
Integration gives Z
1
y(t) = µ(t)g(t) dt + C .
µ(t)
The constant C can be determined from the initial condition. R
In our example, we have p(t) = 1 and g(t) = 20. This gives µ(t) = exp( 1 dt) = et .
Multiplying and combining terms on the left, the ODE results in
det T
= et 20.
dt
After integration and dividing by et , we get
T (t) = 20 + Ce−t ,
where C needs to be determined from the initial condition: 80 = T (0) = 20+C or C = 60.
The (unique) solution to the IVP is thus
T (t) = 20 + 60e−t .
100
5.2.2 Symbolical calculations
ODEs and IVPs can be solved symbolically using Matlab with dsolve. In order to solve
the 1st order ODE T 0 = −T + 20, use
T = dsolve(’DT=20-T’)
which gives the general solution
T = 20+exp(-t)*C1
In order to solve the 1st order IVP T 0 = −T + 20 with T (0) = 80 just add the initial value
T = dsolve(’DT=20-T’, ’T(0)=80’)
which gives the unique solution
T = 20+60*exp(-t)
If Matlab cannot solve the ODE, for example T 0 = sin(T 2 ),
dsolve(’DT=sin(T^2)’)
it will respond
T = RootOf(-t+Int(1/sin( a^2), a = .. Z)-C1)
which means that it cannot find an analytical solution.
101
5.3 Solving IVPs numerically: Introduction
5.3.1 Grids
The solution to an initial value problem is a function y(t) which is defined for every t. If
we use numerical techniques, we find an approximation to the solution at certain discrete
time levels ti only. The collection of ti ’s is called a (computational) grid. Before we can
solve an IVP numerically, we need to specify the computational grid, i.e. all times ti in
[a, b] for which we want to obtain a numerical approximation. A numerical technique will
produce approximations to the function y at these grid points only, i.e. we will obtain
approximations for the values y(ti ) which will be denoted by yi . See Fig. 5.1.
yn
yn−1
y3 y(t)
y1 y2
y0 = y(a)
a = t0 t1 t2 t3 tn−1 t = b
n
Often we choose the interval [a, b] to be divided into N equally spaced subintervals of
length h = (b − a)/N , which corresponds to the grid points
ti = a + ih
for i = 0, . . . , N . The length of a subinterval h is called the step size.
Example
We solve an initial value problem numerically on [0, 1] and divide the interval [0, 1] into
N = 4 equally spaced subintervals. The length of a subinterval is h = (1 − 0)/4 = 1/4.
We obtain a numerical approximation to the solution at the N + 1 = 5 discrete time levels
only: t0 = 0 (just the initial condition), t1 = 1/4, t2 = 1/2, t3 = 3/4, and t4 = 1.
Remarks:
1. Typically, the more grid points the more accurate the approximation to the solution
and the more work to compute the approximation. The goal is to compute an accu-
rate numerical solution with as few grid points as possible, to minimize computing
time.
2. Almost always there are certain regions in [a, b] where the solution changes more
rapidly (i.e. where you want more grid points). An equally spaced grid is then not
the best choice.
102
3. An equally space mesh is easiest to introduce the numerical techniques and will
therefore be used in this chapter.
1. Accuracy.
2. Stability.
3. Computing time and memory (Typically not a very important issue for IVPs, only
for time-dependent PDEs in 2 or 3 spatial dimensions)
We only discuss some well-known one-step methods, for which a solution at some time
ti is computed using only quantities at the previous time level ti−1 . We discuss Euler’s
method, trapezoidal rule, and Runge-Kutta methods. We focus on how these methods
work, how to program them, and typical numerical issues.
103
5.4 Solving IVPs numerically using Matlab: ode45
Matlab has several built-in functions to solve initial value problems numerically. We only
discuss ode45 which uses a Runge–Kutta technique with adaptive time stepping, i.e. every
step a proper value of the step size h is determined in order to obtain a solution within
a specified tolerance. Matlab’s ode45 solves the IVP y 0 = f (t, y) with y(a) = y0 on the
time interval a ≤ t ≤ b.
To solve the cooling problem with default tolerances, one would type in the Matlab
Command Window
[t, T] = ode45(’func ode’, [0 10], 80)
where [0 10] is the time interval [a, b] at which you want to obtain a numerical solution
and 80 the value of the initial condition y0 . The string func ode (the quotes are to indicate
that it is a string) specifies the name of your m-file where the right-hand-side function
f (t, y) is specified. For the cooling problem f (t, T ) = 20 − T needs to be specified in the
m-file func ode.m,
The result of ode45 is 2 arrays with the discrete time values used (t) and the corre-
sponding approximations to the solution (T). To satisfy the default tolerance values, 49
grid points are used.
The accuracy can be increased by using odeset. By default an absolute tolerance of
−6
10 is used to determine the step size h at grid point ti . To solve the IVP with an
absolute tolerance of 10−8 , you would type in the Matlab Command Window
options = odeset(’AbsTol’, 1e-8)
[t,y] = ode45(’func ode’, [0 10], 80, options)
which produces, in this case, the same solution (49 grid points).
Fig. 5.2 shows the numerical solution together with the analytical solution. No differ-
ences are observed on the scale of the plot.
80
exact solution
ode45
70
60
T
50
40
30
20
0 1 2 3 4 5 6 7 8 9 10
t
104
5.5 One-step methods
Details about the derivation of one-step methods for y 0 = f (t, y) are discussed in Math
4446. Here we focus on how the methods work and typical numerical issues. We start with
the easiest method: Euler’s method. It is the simplest numerical technique to demonstrate
all numerical concepts. However, it is almost never the best numerical technique for the
problem you solve. We consider a constant step size h = ti+1 − ti .
yi+1 = yi + hf (ti , yi ) i = 0, . . . , N − 1
Euler’s method is an explicit method. The right-hand side only depends on known quan-
tities (at time level ti ). This makes it easy to solve: just evaluate the right-hand side
using the known values ti and yi .
The integral is approximated with the trapezoidal rule (average of the function values
in the end points ti and ti+1 ). Starting from the initial condition y0 , any next yi+1 is
obtained from
h
yi+1 = yi + [f (ti , yi ) + f (ti+1 , yi+1 )] ,
2
for i = 0, . . . , N − 1.
This is an implicit method. The right-hand side also depends on the a priori unknown
yi+1 . In general, this makes it more difficult to compute yi+1 . Only for relatively simple
functions f an explicit equation for yi+1 can be obtained. Otherwise, a nonlinear equation
needs to be solved. For example, bisection or Newton can be used. Generally, this requires
much more work per time step. The test equation and cooling equation are relatively
simple and solving a nonlinear equation is not neccessary (the right-hand side is linear in
y).
105
A frequently used Runge–Kutta method is the Runge–Kutta method of order four
(RK4). Starting from the initial condition y0 , any next yi+1 is obtained from
k1 = hf (ti , yi )
h k1
k2 = hf (ti + , yi + )
2 2
h k2
k3 = hf (ti + , yi + )
2 2
k4 = hf (ti+1 , yi + k3 )
1
yi+1 = yi + (k1 + 2k2 + 2k3 + k4 )
6
for i = 0, . . . , N − 1. The method is explicit: all substeps only involve known quantities
in the right-hand side.
Initializations
Set initial condition
Compute number of subintervals N
One-step method
Do for i = 0, . . . , N − 1
Compute step size h
Compute next approximation yi+1 from the known values h, yi , ti , and ti+1 .
End do (i-loop)
106
function [y] = euler(t, y0)
%=============================================
% Euler’s method for 1st order IVPs
% Input/output parameters:
% t array with (N+1) grid points ti
% y0 value of initial condition
% y array with (N+1) approximations yi
%=============================================
%———————————————————————————————————
% Initializations
%———————————————————————————————————
y(1) = y0;
N = length(t) - 1;
%———————————————————————————————————
% Euler’s method
%———————————————————————————————————
for i = 1:N
h = (t(i+1) - t(i)) / N;
f = funcivp(y(i), t(i));
y(i+1) = y(i) + h*f;
end
%———————————————————————————————————
% Function f(y, t) with y and t scalars
%———————————————————————————————————
function [f] = funcivp(y, t)
% cooling problem
f = 20 - y;
Remarks:
• Note that the function euler is general. When changing to a different f (t, y) only
funcivp needs to be changed.
• For different one-step methods just the part inside the for-loop needs to be modified.
For the explicit RK4 we just have some more explicit substeps to perform. If you
solve a nonlinear equation with the implicit trapezoidal rule, you need to solve a
nonlinear algebraic equation which can be done with Newton’s method. Then you
need to call a function newton to find yi+1 . Since the time step is typically small, yi
is usually a good enough initial guess for Newton’s method.
107
5.5.5 Example
We solve the cooling problem Eq. (5.1) using a step size h = 1/2 and compare with the
exact solution T (t) = 20 + 60e−t .
Fig. 5.3 shows the approximations Ti for the three different one-step methods consid-
ered: Euler, trapezoidal rule, and RK4.
80
70
60
50
exact
Euler
T, Ti
40
trapezoidal
RK4
30
20
10
0
0 2 4 6 8 10
t
Figure 5.3: One-step methods for the cooling problem using step size h = 1/2.
Table 5.1: One-step methods for the cooling problem using step size h = 1/2.
from the comparison with the exact solution in Fig. 5.3 and Table 5.1 that RK4 gives a
better approximation than the trapezoidal rule which gives a better approximation than
Euler. Of course RK4 requires more work per time step than the trapezoidal rule which
requires more work per time step than Euler. In Sec. 5.7 we discuss accuracy in more
detail.
108
5.6 Test equation and amplifying factor
The general IVP y 0 = f (t, y) is rather complicated to analyze. For this a test equation
is introduced (just use f (t, y) = λy):
y 0 = λy
where λ is a complex number. This is a rather simple equation, but can still demonstrate
the main concepts. Results we derive remain valid for more complicated equations.
Applying a one-step method to the test equation results in an equation of the form
yi+1 = k(hλ)yi ,
where k is the amplifying factor which depends on hλ. The amplifying factor of a
certain numerical technique contains enough information to determine its accuracy and
stability properties.
Euler’s method
The Euler method for the test equation is
h
yi+1 = yi + [λyi + λyi+1 ] .
2
After some algebra this gives the amplifying factor
1 + hλ/2
k(hλ) = .
1 − hλ/2
Runge-Kutta (RK4)
Writing the RK4 method in the form yi+1 = k(hλ)yi gives the amplifying factor. After
some algebra this gives
1 1 1
k(hλ) = 1 + hλ + (hλ)2 + (hλ)3 + (hλ)4 .
2 6 24
109
5.7 Accuracy
To say more about the error we can expect when using a numerical technique to solve
IVPs, we distinguish two types of errors:
• local truncation error (easiest to use, but doesn’t include accumulation of errors).
where we assume that the solution at the previous step was exact: yi = y(ti ).
To analyze the local truncation error we consider the test equation y 0 = λy. Integration
from ti to ti+1 gives
y(ti+1 ) = ehλ y(ti ).
To investigate errors, we use the Taylor series
n−1
X (λh)j (λh)2 (λh)n−1
ehλ = + O(hn ) = 1 + λh + + ··· + + O(hn ),
j=0
j! 2 (n − 1)!
Combining with the general form for a one-step method yi+1 = k(hλ)yi gives
110
h y1 y(t1 ) e1 ratio
0.1 74.000 7.4290245e+01 2.90e-01 -
0.05 77.000 7.7073765e+01 7.38e-02 0.254
0.025 78.500 7.8518594e+01 1.86e-02 0.252
0.0125 79.250 7.9254668e+01 4.67e-03 0.251
0.00625 79.625 7.9626169e+01 1.17e-03 0.251
Table 5.2: Local truncation error for the cooling problem Eq. (5.1) using Euler’s rule.
observe that the error decreases by a factor of 4 = 22 This is exactly what we expect:
if we decrease the step size h to h/2 we the error decreases from O(h2 ) to O((h/2)2 ) =
1/4 O(h2 ).
Trapezoidal rule
Using
P∞ Taylor expansions for the exponential and 1/(1 − hλ/2) gives (using 1/(1 − x) =
i
i=0 x ) after some algebra
h y1 y(t1 ) e1 ratio
0.1 7.42857143e1 7.42902451e1 4.53e-3 -
0.05 7.70731707e1 7.70737655e1 5.95e-4 0.131
0.025 7.85185185e1 7.85185947e1 7.62e-5 0.128
0.0125 7.92546584e1 7.92546680e1 9.64e-6 0.127
0.00625 7.96261682e1 7.96261694e1 1.21e-6 0.126
Table 5.3: Local truncation error for the cooling problem Eq. (5.1) using the trapezoidal
rule.
observe that the error decreases by a factor of 8 = 23 This is exactly what we expect:
if we decrease the step size h to h/2 we the error decreases from O(h3 ) to O((h/2)3 ) =
1/8 O(h3 ).
Runge-Kutta (RK4)
For RK4 we obtain
ei+1 (h) = ehλ − k(hλ) yi = O(h5 ).
This is two orders higher than the trapezoidal rule and three orders higher than Euler.
The local truncation error for the cooling problem Eq. (5.1) is given in Table 5.4. We
observe that the error decreases by a factor of 32 = 25 This is exactly what we expect:
if we decrease the step size h to h/2 we the error decreases from O(h5 ) to O((h/2)5 ) =
1/32 O(h5 ).
111
h y1 y(t1 ) e1 ratio
0.1 7.4290250000000e1 7.4290245082158e1 4.92e-06 -
0.05 7.7073765625000e1 7.7073765470043e1 1.55e-07 3.150e-2
0.025 7.8518594726563e1 7.8518594721700e1 4.86e-09 3.135e-2
0.0125 7.9254668029785e1 7.9254668029633e1 1.52e-10 3.128e-2
0.00625 7.9626169437408e1 7.9626169437404e1 4.76e-12 3.132e-2
Table 5.4: Local truncation error for the cooling problem Eq. (5.1) using RK4.
Consequence
The global error is one order lower than the local truncation error, thus for Euler we have
i = O(h), for the trapezoidal rule i = O(h2 ), and for RK4 i = O(h4 ).
Application
We can use the order of the global error to estimate a value of the step size h that is
required to obtain a solution that is a certain amount more accurate. Assume we have
a solution with error 0 obtained using step size h0 . To obtain a solution with an error
= 10−4 0 (4 more accurate digits) with a method of order p, we would need
hp ≈ 10−4 hp0
For Euler, the global error is O(h), i.e. p = 1, and we would need h ≈ 10−4 h0 or 104 as
many intervals. For the trapezoidal rule, the global error is O(h2 ), i.e. p = 2, and we
would need h2 ≈ 10−4 h20 or h ≈ 10−2 h0 or 102 as many intervals. For RK4, the global
112
error is O(h4 ), i.e. p = 4, and we would need h4 ≈ 10−4 h40 or h ≈ 10−1 h0 or only 101 as
many intervals.
Example
For the cooling problem Eq. (5.1), the exact solution at t = 1 is T (t = 1) = 20 + 60e−1 ≈
4.207276647028654e+01.
Euler
The global error is one order lower than the local truncation error, thus i = O(h).
The global error for the cooling problem Eq. (5.1) is given in Table 5.5. We observe
h Ti (t = 1) (t = 1) ratio
1/10 4.0921e1 1.2 100 -
1/20 4.1509e1 5.6 10−1 0.489
1/40 4.1794e1 2.8 10−1 0.495
1/80 4.1934e1 1.4 10−1 0.497
1/160 4.2004e1 6.9 10−2 0.499
1/320 4.2038e1 3.5 10−2 0.499
Table 5.5: Global error for the cooling problem Eq. (5.1) using Euler.
Table 5.6: Global error for the cooling problem Eq. (5.1) using the trapezoidal rule.
that the error decreases by a factor of 4 = 22 This is exactly what we expect: if we decrease
the step size h to h/2 we the error decreases from O(h2 ) to O((h/2)2 ) = 1/4 O(h2 ). Note
that the global error for h = 0.1 is already better than for Euler using h = 1/320.
Runge-Kutta (RK4)
The global error is one order lower than the local truncation error, thus i = O(h4 ).
The global error for the cooling problem Eq. (5.1) is given in Table 5.7. We observe that
the error decreases by a factor of 16 = 24 This is exactly what we expect: if we decrease
the step size h to h/2 we the error decreases from O(h4 ) to O((h/2)4 ) = 1/16 O(h4 ). Note
the the global error for h = 0.1 is already better than the global error for the trapezoidal
rule for h = 1/160.
113
h w(t = 1) (t = 1) ratio
1/10 4.207278646475e1 2.00e-05 -
1/20 4.207276766885e1 1.20e-06 5.99e-2
1/40 4.207276654365e1 7.34e-08 6.12e-2
1/80 4.207276647482e1 4.54e-09 6.19e-2
1/160 4.207276647057e1 2.82e-10 6.21e-2
Table 5.7: Global error for the cooling problem Eq. (5.1) using RK4.
y 0 = 1, 0 ≤ t ≤ 1, y(0) = 1
w0 = 1.0000000, wi+1 = wi + h
More digits of precision shifts the problem to smaller step sizes. If we did the
calculations in double precision (16 digits accuracy) there is no problem to represent
1 + 1.0 × 10−8 = 1.00000001. But 1 + 1.0 × 10−16 = 1 does give the same problem. How
round-off errors affect the global error in Euler’s method for the cooling problem Eq. (5.1)
can be observed in Table 5.8. Up to step size h = 10−9 , the global error is 1/10 times the
previous error if h is divided by 10, i.e. O(h). For a smaller step size the roundoff error is
dominating and the global error starts to grow rapidly. Thus there is an optimum h with
minimum error. In Table 5.8 the optimum value of h would be around 10−9 .
114
h w(t = 0.1) e(t = 0.1) ratio
−4
10 7.429021793686086e+01 2.7e-05 -
10−5 7.429024236764370e+01 2.7e-06 0.1
10−6 7.429024481071004e+01 2.7e-07 0.1
10−7 7.429024505501881e+01 2.7e-08 0.1
10−8 7.429024507944148e+01 2.7e-09 0.1
10−9 7.429024508190086e+01 2.6e-10 0.1
10−10 7.872524636435026e+01 4.43
Table 5.8: Impact of round-off error on global error in Euler’s method for the cooling
problem Eq. (5.1).
115
5.8 Stability
In numerical computations there are always small errors: round-off errors, discretization
errors. Stability is related to whether these errors grow without bound or not. Stability
is particularly important for so-called stiff equations.
5.8.1 Introduction
To develop some intuition for stability, we consider the following examples which we solve
using Euler’s method.
1. y 0 + y = −99e−100t , y(0) = 2 with exact solution y(t) = e−t + e−100t which decays
from 2 to 0. Solving this IVP numerically is straightforward and the numerical
solution looks as expected. This we will call stable further on.
2. y 0 + 100y = 99e−t , y(0) = 2 with exact solution y(t) = e−t + e−100t . The exact
solution is exactly the same, but solving this IVP numerically is much harder. The
global errors show huge errors up to and including h = 1/32. This we will call
unstable further on.
3. y 0 + 100y = 0, y(0) = 1 with exact solution y(t) = e−100t . The numerical solution is
again unstable: global errors again show huge errors up to and including h = 1/32.
To analyze what happens exactly, we only consider the simplest equation for which
we observe the unstable behavior, y 0 = −100y with y(0) = 1. The analytical solution
y(t) = e−100t , decays very rapidly to zero. Numerical solution using Euler gives
Every step the previous value is multiplied by (1 − 100h). The numerical solution will
only tend to 0 if |1 − 100h| < 1. Thus we need 0 < h < 2/100 = 0.02. Otherwise yi and
thus the error will grow since the exact solution y → 0 as t → ∞.
Now consider y 0 = −100e−100t with y(0) = 1 which has the same analytical solution
e−100t . Numerical solution using Euler gives
i
X i
X i
X
−100ti −100tj −100tj
yi+1 = yi −100he = yi−1 −100h e = y0 −100h e = 1−100h e−100tj
j=i−1 j=1 j=1
which is conceptually very different. There is no multiplication of the previous value, just
an addition of a small term every step.
116
5.8.2 Stability of one-step methods (test equation)
We solve two test equations with a one-step method using slightly different initial condi-
tions,
y 0 = λy, y(t0 ) = y0
and
w0 = λw, w(t0 ) = y0 +
Thus we have the same ODE (and the same general solution) but a slightly different value
() for the initial condition.
Applying a one-step method to the above equations gives
yi+1 = k(hλ)yi , w0 = y0
and
wi+1 = k(hλ)wi , w0 = y0 + .
Subtracting gives an equation for the difference i = yi − wi between the 2 solutions
at time level i + 1,
i+1 = k(hλ)i , 0 = .
Applying the one-step method i + 1 times gives
lim |i | = 0.
i→∞
Thus the scheme is stable if |k(hλ)| ≤ 1 (error will not grow), absolutely stable if |k(hλ)| <
1 (error will decay to zero), and unstable if |k(hλ)| > 1 (error will grow and yi and wi
will be very different for large values of i).
117
Euler
For Euler, k(λh) = (1 + hλ). To have absolute stability we need |k(hλ)| < 1 which gives
|1 + z| < 1, where z = hλ is a complex number. For the magnitude of a complex number
z = x + iy, we have
p p p
(1 + z)(1 + z̄) = (1 + x + iy)(1 + x − iy) = (1 + x)2 + y 2 .
(x + 1)2 + y 2 < 1.
Since (x + 1)2 + y 2 = 1 is a circle around (−1, 0) with radius 1, this corresponds to the
region inside the circle. The circle itself is not included which is represented in a figure
by a dashed line.
The boundary of the region of absolute stability can also be plotted directly with
Matlab using
syms x y;
z = x + i*y;
k = 1 + z;
ezplot(abs(k) - 1, [-3, 1, -1.5, 1.5]);
grid on;
setcurve(’Line’, ’:’);
Here abs gives the magnitude of a complex number and [-3, 1, -1.5, 1.5] are the minimum
and maximum x and y values in the figure (appropriate values were found by trial and
error). The last line makes a dashed line. For this you need the m-file setcurve.m in your
Current Directory. The resulting Matlab figure is Fig. 5.4. Note that we only plotted the
0.5
0
y
−0.5
−1
−1.5
−3 −2.5 −2 −1.5 −1 −0.5 0 0.5 1
x
Figure 5.4: Region of absolute stability for Euler’s method: inside the circle. A dashed
curve means that the curve itself is not included.
boundary. The region of absolute stability |k(z)| < 1 is inside the closed curve. This
118
is easily verified by checking one point inside the circle and one point outside the circle
(z = −1 + 0i satisfies |1 + z| < 1 and z = 1 + 0i not).
Trapezoidal rule
For the region of absolute stability we need |k(hλ)| < 1. This gives using z = hλ
1 + z/2
1 − z/2 < 1
or |1 + z/2| < |1 − z/2|
The boundary of the region of stability is displayed in Fig. 5.5. The region of stability is
0
y
−2
−4
−6
−6 −4 −2 0 2 4 6
x
Figure 5.5: Region of absolute stability for trapezoidal rule: the left half plane.
on the left of the imaginary axis (Check a point on each side). Thus the region of stability
is the whole left half plane.
RK4
The boundary of the region of stability is displayed in Fig. 5.6. The region of absolute
stability is inside the closed curve (check 1 point inside and outside the closed curve).
Application
From the region of absolute stability, you can obtain a minimum value of h for which the
solution should remain stable. For example,
1. Assume λ is real and negative. For which values of h is Euler absolutely stable?
λ is real, thus we are on the real axis. The part of the real axis inside the circle
corresponds to −2 < hλ < 0. Since λ < 0 this gives 0 < h < 2/(−λ) = 2/|λ|.
2. Assume λ is purely imaginary. For which values of h is Euler absolutely stable?
λ is purely imaginary, thus we are on the imaginary axis. There is no point from
the region of absolute stability on the imaginary axis. Thus Euler is unstable for
any h.
119
Region of absolute stability: RK4
3
0
y
−1
−2
−3
−3 −2.5 −2 −1.5 −1 −0.5 0 0.5 1
x
Figure 5.6: Region of absolute stability for RK4: inside the closed curve.
5.8.4 Example
We consider the IVP y 0 = −20y (i.e. λ = −20) with y(t = 0) = 1/3 which has exact
solution y(t) = e−20t /3. The initial condition y0 = 1/3 can not be represented exactly, so
we have a small round-off error in y0 . We use various step sizes h inside and outside the
region of stability to see whether the numerical result.
Euler
Table 5.9 shows results for Euler’s method using various step sizes h. We see that for
t y yi yi yi yi
h = 1 h = 0.1 h = 0.05 h = 0.01
1 6.87e-10 -6.33e0 3.33e-1 0 6.79e-11
2 1.41e-18 1.20e2 3.33e-1 0 1.38e-20
3 2.91e-27 -2.28e3 3.33e-1 0 2.82e-30
4 6.01e-36 4.34e4 3.33e-1 0 5.47e-40
5 1.24e-44 -8.25e5 3.33e-1 0 1.16e-49
h = 1 the numerical solution and error grows without bound so unstable. For h = 0.1 the
numerical solution and error are bounded but the error does not decay. This means the
solution is stable but not absolutely stable. For h < 0.1 the numerical solution and error
decay so absolutely stable. This is exactly what we expect from the theory: absolutely
stable for h < 2/λ = 2/20 = 0.1.
Trapezoidal rule
Table 5.10 shows results for the trapezoidal rule using various step sizes h. We note that
120
t y yi yi yi yi
h = 1 h = 0.1 h = 0.02 h = 0.01
1 6.87e-10 -2.73e-1 0.00e0 5.23e-10 6.42e-10
2 1.41e-18 2.23e-1 0.00e0 8.20e-19 1.24e-18
3 2.91e-27 -1.83e-1 0.00e0 1.29e-27 2.39e-27
4 6.01e-36 1.49e-1 0.00e0 2.02e-36 4.60e-36
5 1.24e-44 -1.22e-1 0.00e0 3.16e-45 8.87e-45
Table 5.10: Stability for test equation using the trapezoidal rule.
all numerical solutions are absolutely stable (all decay). This is exacly what we expect
from theory: the left half plane implies all h > 0 since λ < 0.
Here we see the advantage of an implicit method: implicit methods can have an infinite
region of stability. For the trapezoidal rule, for Re(λ) < 0 all values of h are in the region
of stability. Explicit methods always have a finite region of stability. Thus h should be
chosen small enough to ensure absolute stability for explicit methods.
Note that although the trapezoidal rule is stable for h = 1, the numerical solution is
not very accurate. Accuracy requires a smaller value of h. Also note that h = 0.1 is a
special case: the amplifying factor equals zero giving a zero solution except for the initial
condition.
RK4
Table 5.11 shows results for RK4 using various step sizes h. Note that for h = 1, RK4
t y yi yi yi yi
h=1 h = 0.1 h = 0.02 h = 0.01
1 6.87e-10 1.84e03 5.65e-06 6.91e-10 6.87e-10
2 1.41e-18 1.01e07 9.56e-11 1.43e-18 1.42e-18
3 2.91e-27 5.59e10 1.62e-15 2.97e-27 2.92e-27
4 6.01e-36 3.08e14 2.74e-20 6.16e-36 6.02e-36
5 1.24e-44 1.70e18 4.64e-25 1.28e-44 1.24e-44
is unstable. For h = 0.1, however, RK4 is already absolutely stable, contrary to Euler.
The reason is that the region of stability of RK4 includes a larger portion of the real axis
which includes h = 0.1.
The unstable behavior for too large values of h is typical for explicit methods. A
very accurate explicit technique doesn’t eliminate the unstable behavior. Stability and
accuracy are two different subjects.
121
To determine the linear stability for y 0 = g(y) with initial condition y(0) = y0 , we also
consider the perturbed problem w0 = g(w) with w(0) = y0 + 0 . Introducing = y − w
and differentiating gives
dg(y)
0 = .
dy
Comparing with the test equation, we now have dg/dy instead of λ. Thus to determine
the stability, dg/dy needs to be evaluated at a known value of y (typically the solution at
the previous time step yi ).
Example: We use Euler’s method to solve the nonliner IVP
y 0 = −y 2 , y(0) = 1.
We have
dg
= −2y
dy
which is real. For the test equation we need for Euler h < 2/|λ| for stability. Here we
have −2y instead of λ. At time ti we thus get as stability criterium
2 1
h< = .
| − 2yi | |yi |
Remarks
• We neglected higher order terms when we did the linearization. Thus it is safer to
take the step size a little smaller than the one obtained from the linearization.
• The stability criterion depends on the solution yi which is a priori unknown. The
maximum allowable step size to guarantee stability for step i thus needs to be
determined at every step.
122
5.9 Discussion
Which method you would choose depends on the type of problem you are solving and on
how many times you are solving such problems. You need to consider the following:
• Implementation time. If you only solve a problem once, you want something
that you can program quickly. So you would typically choose an explicit technique
since these are straightforward to program. That computing time is larger is often
not a problem.
• Accuracy. If you need an accurate solution, you would typically choose a higher-
order method to obtain a small global error (lower computing time for accurate
solutions, but a little harder to implement in more complex problems). If you just
need a rough estimate of what the solution looks like, an O(h) method might be
sufficient.
• Implicit/explicit: if stability is not an issue for the problem you are solving, there
is no need to consider implicit texhniques. It is harder to write the numerical pro-
gram and there is no benefit compared to explicit techniques (for implicit methods
a nonlinear equation needs to be solved every time step for implicit with bisection
or Newton for example).
123
Chapter 6
• Accuracy
• Stability
• Order of convergence
Numerical methods:
• Euler
• Trapezoidal rule
• Runge–Kutta
124
6.1 Problem description: predator-prey models
In Sec. 3.1 we discussed a population model for a predator-prey system. The resulting
model was
where x1 (t) is the population of prey at time t, x2 (t) the population of predators at time t,
and a, b, c, and d some given constants. In Chap. 3, we determined equilibrium solutions
(i.e. when the populations do not change anymore in size, or dx1 /dt = dx2 /dt = 0). In
this chapter, we will solve the transient equations, i.e. we will predict the populations as
a function of time.
As an example, we consider the system of equations with a = d = 2 and b = c = 1,
ẋ1 = 2x1 − x1 x2 ,
ẋ2 = x1 x2 − 2x2 .
125
6.2 Checking numerical solutions for systems of IVPs
Systems of IVPs are typically difficult to solve analytically. In this section we discuss four
methods to validate a numerical program for solving a system of equations.
We substitute the analytical solution x1 (t) and x2 (t) that we want and try find the corre-
sponding q1 (t) and q2 (t). For example, we take x1 (t) = e−t and x2 (t) = e−2t . Substitution
126
gives q1 (t) = e−3t −3e−t and q2 (t) = −e−3t . The initial conditions at t = 0 that correspond
to the analytical solution are x1 (0) = 1 and x2 (0) = 1.
Thus we obtained the analytical solution for a slightly more difficult system of IVPs
with ICs x1 (0) = 1 and x2 (0) = 1. This should be sufficient to test the numerical code.
x1 = x(1);
x2 = x(2);
The result of ode45 is 2 arrays with the discrete time values used (ti) and the corre-
sponding approximations to the solution (xi). The first column of the matrix xi contains
the approximation to x1 and can be selected using xi(:,1). The second column of the
matrix xi contains the approximation to x2 and can be selected using xi(:,2).
To satisfy the default tolerance values, 101 grid points are used. The accuracy can be
increased by using odeset, similar to scalar IVPs (See Sec. 5.4).
Fig. 6.1 shows the numerical solution for x1 and x2 . We note that the population does
not approach the equilibrium solution (x1 , x2 ) = (2, 2) but oscillates periodically around
127
5
x
1
4.5 x2
3.5
3
x
2.5
1.5
0.5
0
0 1 2 3 4 5 6 7 8 9 10
t
(x1 , x2 ) = (2, 2). This is easily explained by examining the eigenvalues of the Jacobian
at (x1 , x2 ) = (2, 2). The Jacobian has pure imaginary eigenvalues which correspond to a
periodic solution.
128
6.3 One-step methods
An mth order system of IVPs can be solved by applying any one-step method discussed in
Chapter 5 to a system. Conceptually there is nothing new, it is only more work. We just
apply every single step in the one-step method for all components, i.e. m times, instead
of only one time.
Below we discuss Euler, the trapezoidal rule, and RK4 for systems.
Euler
Applying Euler to the vector equation y 0 = f (t, y) gives
y i+1 = y i + hf (ti , y i ).
Note that the right-hand-side only contains values at level i which are all known. The
difference with Euler’s method for scalar IVPs in Chapter 5 is that y and f are now a
vector (array) with m components. This means we just need to evaluate the m component
functions f1 , . . . , fm and use the values to compute the m components of yi+1,j where
i = 0, . . . , N denotes the time level and j = 1, . . . , m the component.
Trapezoidal rule
Applying the trapezoidal rule to the vector equation y 0 = f (t, y) gives
h
y i+1 = y i + f (ti , y i ) + f (ti+1 , y i+1 ) ,
2
for i = 0, . . . , N .
This is an implicit system of equations since the right-hand side also depends on
the a priori unknown y i+1 . This makes it more difficult to compute y i+1 . A nonlinear
system of equations needs to be solved every time step. This can be done using, for
example, Newton’s method for systems (See Sec. 3.3). This requires solving a linear
system Jy i+1 = b(y i , ti , ti+1 ) at every time step. Since the time step is typically small, y i
is usually a good enough initial guess for Newton’s method for systems. Implicit methods
for systems require much more work per time step compared to explicit methods but have
a larger region of stability.
RK4
Applying RK4 to the vector equation y 0 = f (t, y) gives
k1 = hf (ti , y i ),
k2 = hf (ti + h/2, y i + k1 /2),
k3 = hf (ti + h/2, y i + k2 /2),
k4 = hf (ti+1 , y i + k3 ),
y i+1 = y i + (k1 + 2k2 + 2k3 + k4 )/6.
129
The difference with the RK4 method for scalar IVPs in Chapter 5 is that y, f , k1 , k2 , k3 ,
k4 are now a vector (array) with m components. All substeps, however, are still explicit.
First k1 needs to be computed for all m. Once all components of k1 are known, the values
of k2 can be computed etc.
Initializations
Set initial condition
Compute number of subintervals N
One-step method
Do for i = 0, . . . , N − 1
Compute step size h
Compute next approximation y i+1 from the known values h, y i , ti , and ti+1 .
End do (i-loop)
where funcivp computes all components of the right-hand-side vector f using solution
y i and time ti . The colon notation in Matlab, does the operation for all possible values
at the place of the :. Thus y(i,:); is a vector of length m with all components of y i and
the function funcivp gets all components of the vector as it should get. The second line
computes y i+1 for all components, due to the colon in the two-dimensional array.
Remarks
• By using a function funcivp to compute the right-hand side vector f , you keep euler
general.
130
• Different one-step methods have the same structure. Only the part inside the for-
loop needs to be modified. For RK4 just some more explicit substeps need to be
performs. If you solve a nonlinear equation with the implicit trapezoidal rule, you
need to solve a system of nonlinear algebraic equations which can be done with
Newton’s method for systems. Then you need to call a function newtonsys to find
the vector y i+1 . Since the time step is typically small, the vector y i is usually a
good enough initial guess for Newton’s method.
• Alternatively, for-loops could be used instead of the colon notation. For loops would
typically be used in Fortran-77 or C.
The numerical solutions using Euler and RK4 with h = 0.1 is displayed in Fig. 6.2. The
14 14
ode45 ode45
euler euler
12 rk4 12 rk4
10 10
8 8
x1
x2
6 6
4 4
2 2
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
t t
(a) (b)
Figure 6.2: Numerical solution of the population problem using ode45, Euler and RK4.
(a) x1 , and (b) x2 .
solution obtained using RK4 agrees well with the ode45 solution, but the Euler solution
is far off at larger values of t.
131
6.4 Accuracy
To determine the local truncation error and global error for a system of IVPs at time ti ,
we use the l∞ norm
∞ = ky(ti ) − y i k
i.e. the maximum over all components j = 1, . . . , m.
The results of the analysis of the local truncation error for the scalar test equation
remain valid for systems of IVPs. Furthermore, the global error is still one order lower
than the local truncation error. Thus when there are no discontinuities, we have the
following global errors:
• Euler: O(h).
• RK4: O(h4 ).
Example
We solve using one-step methods
with ICs x1 (0) = 1 and x2 (0) = 1. The analytical solution is x1 (t) = e−t and x2 (t) = e−2t .
The global error at t = 1 for various values of h is given in Table 6.1. We observe that
h Euler RK4
0.1 4.00e-02 9.69e-06
0.05 1.96e-02 5.88e-07
0.025 9.69e-03 3.62e-08
0.0125 4.82e-03 2.24e-09
0.00625 2.40e-03 1.39e-10
RK4 using h = 0.1 is already much more accurate then Euler using h = 0.00625.
The errors are plotted in a log10 h vs. log10 plot in Fig. 6.3. The error behaves as
expected. For Euler the slope equals approximately 1 indicating a global error of O(h).
For RK4 the slope equals approximately 4 indicating a global error of O(h4 ).
132
0
euler
rk4
−2
−4
log10 ε∞
−6
−8
−10
−12
−14
−3.5 −3 −2.5 −2 −1.5 −1
log10 h
133
6.5 Stability of one-step methods
To determine whether small errors grow without bound or not, we follow the same struc-
ture as in Sec. 5.8 for scalar IVPs. We first consider stability for linear systems and then
extend the result to non-linear systems.
y 0 = Ay, y(t0 ) = y 0
|k(hλj )| < 1, j = 1, . . . , m,
where λj are the m eigenvalues of the matrix A. Note that all m eigenvalues λj need to
satisfy the stability criterium. If for one eigenvalue |k(hλj )| > 1, the numerical technique
becomes unstable.
Using the expressions for the amplifying factors k(hλ) obtained in Sec. 5.6 gives
• Euler: |1 + hλj | < 1 for j = 1, . . . , m. This means all m values of hλj should be
inside the circle depicted in Fig. 5.4.
1+hλ /2
• Trapezoidal rule: | 1−hλjj /2 | < 1 for j = 1, . . . , m. This means all m values of hλj
should be in the left half-plane.
y 0 = g(y), y(t0 ) = y 0
A similar analysis as in Sec. 5.8.5 can be performed. Local linearization of the nonlinear
system leads to a vector equation for the error = y − w,
0 = J
where J is the Jacobian matrix. This linear system can be treated as in Sec. 6.5.1.
134
Example
We use Euler’s method for systems to solve the nonlinear system
0
y1 −y12 y1 (0) 1
= , = .
y20 y1 − 20y2 y2 (0) 1
We have
−2y1 0
J=
1 −20
which has two real eigenvalues λ1 = −2y1 and λ2 = −20. For stability we need |k(hλj )| <
1 which becomes for Euler h < 2/|λj |. This gives two conditions: h < 1/|y1 | and h < 0.1
or combining
1
h < min( , 0.1)
|y1 |
The numerical solution x2 for h = 0.2, h = 0.1, and h = 0.05 is depicted in Fig. 6.4.
Stability is in agreement with the linearized theory: absolutely stable for h < 0.1.
10
h=0.2
8
h=0.1
h=0.05
6
2
x2
−2
−4
−6
−8
−10
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
t
Figure 6.4: Unstable, stable, and absolutely stable behavior of Euler’s method for systems
at various h.
Remarks
• We neglected higher order terms when we did the linearization. Thus it is safer to
take the step size a little smaller than the one obtained from the linearization.
• The stability criterion depends on the solution y i which is a priori unknown. The
maximum allowable step size to guarantee stability for step i thus needs to be
determined at every step.
135
Chapter 7
• Accuracy
• Stability
Numerical methods:
136
7.1 Problem description: pollution models
7.1.1 Governing equation
In Sec. 4.1 we discussed pollution of a narrow and shallow river for which the concentration
of the pollutant depends on x (coordinate along the river) and time t: c = c(x, t). Flow
will then occur in the x direction only, represented by a scalar velocity v. The resulting
model was the partial differential equation (PDE) of Eq. (4.2)
∂c ∂c ∂2c
= −v + D 2 + r − kc,
∂t ∂x ∂x
where v is the velocity of the river, D the diffusivity, r(x, t) a production term, and k the
rate of decay. In Chap. 4, we determined equilibrium solutions (i.e. when the concentra-
tion does not change anymore in time, or ∂c/∂t = 0) by solving a BVP. In this chapter,
we will solve the partial differential equation, i.e. we will predict the concentration as a
function of time t and space x.
As an example, we consider the system of equations with v = k = r = 0 and D = 1/π 2 ,
∂c 1 ∂2c
= 2 2. (7.1)
∂t π ∂x
c(x = xb , t) = CD (t),
137
• Neumann boundary conditions. The mass flux (or concentration gradient) is pre-
scribed at the boundary:
∂c
(x = xb , t) = CN (t),
∂x
where CN (t) may depend on t.
7.1.3 Example
The example we use for PDEs throughout this chapter is
∂c 1 ∂2c
= , 0<x<1
∂t π 2 ∂x2
c(x = 0, t) = 0, c(x = 1, t) = 0,
c(x, t = 0) = sin(πx). (7.2)
138
7.2 Validation of numerical code for PDEs
Numerical calculations to solve PDEs are much more involved than calculations for ODEs.
It is thus very important to check the numerical solution carefully. We discuss three ways.
∂c 1 ∂2c
= 2 2 + f (x, t), 0<x<1
∂t π ∂x
From the theory of partial differential equations we know that solutions can be written
as products of exponentials in time and sine or cosine functions in space. We take
We substitute the analytical solution c(x, t) into the PDE and try find the corresponding
f (x, t). We find f (x, t) = 0, which means that c(x, t) = e−t sin(πx) is a solution of
Eq. (7.2). The initial condition that corresponds to this solution is c(x, t = 0) = sin(πx).
The boundary conditions that correspond to this solution are c(x = 0, t) = 0 and c(x =
1, t) = 0. Also both initial and boundary conditions are identical to those in Eq. (7.2).
Thus c(x, t) = e−t sin(πx) is the solution of Eq. (7.2).
Fig. 7.1 shows the analytical solution for various times.
∂u ∂u ∂f (x, t, u, ∂u
∂x
) ∂u
c(x, t, u, ) = s(x, t, u, ).
∂x ∂t ∂+ ∂x
139
1
t=0
0.9 t=0.1
t=0.5
t=1
0.8 t=2
t=10
0.7
0.6
c(x,t)
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x
Figure 7.1: Numerical solution of the pde Eq. (7.2) at various time levels.
For our example pde Eq. (7.2), we have c(x, t, u, ∂u/∂x) = 1, f (x, t, u, ∂u/∂x) = (1/π 2 )∂u/∂x,
s(x, t, u, ∂u/∂x) = 0. The boundary conditions for the left and right point should have
the form
p(x, t, u) + q(x, t) ∗ f (x, t, u, Du/Dx) = 0
where f is identical to the f in the pde. For our example pde Eq. (7.2) we have for both
the left and right boundary point p = u and q = 0.
A pde can be solved by typing in the Command Window
sol = pdepe(0, ’pdefun’, ’pdeic’, ’pdebc’, xj, ti);
where the firsat input variable m = 0 corresponds to m=0. The array xj should contain
the grid points at which you want to obtain the numerical solution. The array ti should
contain the time values at which you want to obtain the numerical approximation (note
that these are not all time levels that matlab uses in the computation, only the time
levels at which a solution is stored in the output array sol). Intermediate time levels to
obtain a sufficiently accurate solution are determined in pdepe. In the solution matrix
sol(i,j) a row i correspond to the selected time levels ti and a columns j to a grid points
xj .
The three strings pdefun, pdeic, and pdebc are the names of the m-files that contain
the info for the pde, initial condition, and boundary conditions, respectively.
For our example pde Eq. (7.2), we would use the following three m-files:
140
c = 1;
f = DuDx / pi^2;
s = 0;
u0 = sin(pi*x);
pl = ul;
ql = 0;
pr = ur;
qr = 0;
Fig. 7.2 shows the numerical approximation using pdepe together with the exact solution
as a function of x at various times. The results for pdepe were obtained using h = 1/10.
We see that qualitatively (on the scale of the figure) the numerical solution agrees well
1
t=0
0.9 t=0.1
t=0.5
t=1
0.8 t=2
t=10
0.7
0.6
c(x,t)
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x
Figure 7.2: Exact solution (solid line) and numerical approximation using pdepe (symbols)
of Eq. (7.2) at times indicated in the legend.
141
7.3 Solving PDEs numerically: Introduction
The solution c(x, t) of Eq. (7.2) depends on spatial coordinate x and time t. Thus we now
need a grid for the spatial and time integration. For the spatial discretization we use a
grid xj for j = 0, . . . , m as we used to solve BVPs in Chap. 4. To keep the algebra as
simple as possible, we only consider equally spaced grid points xj with grid size h. For the
time discretization we use a grid tk for k = 0, . . . , n as we used to solve IVPs in Chap. 5.
To keep the algebra as simple as possible, we only consider equally spaced times tk with
step size ∆t.
In a PDE, partial derivatives need to be discretized. Discretization of partial deriva-
tives is identical to the discretization of derivatives of functions of 1 variable. Thus for
Eq. (7.2) we can use the finite difference formulas or finite element method discussed for
BVPs to discretize ∂ 2 c/∂x2 (See Sec. 4.5 and 4.7).
The general strategy to discretize a PDE in time and 1D space is as follows.
• First discretize the PDE in the spatial direction x. For this you can use finite
differences or finite elements. This is similar to BVPs (See Sec. 4.5 and 4.7).
• Solve the system of IVPs using any method for systems of IVPs. For example,
Euler, trapezoidal rule, or RK4 for systems (See Chap. 6).
142
7.4 Finite differences
We only consider the PDE with boundary and initial conditions as described in Eq. (7.2).
The PDE is first discretized in the x direction using finite difference formulas discussed in
Sec. 4.5. The key difference with section 4.5 is that after the discretization in space the
approximate values at the nodes are still a function of time: cj = cj (t). Thus after using
the central O(h2 ) approximation for ∂ 2 c/∂x2 we get the m − 1 ODEs and 2 boundary
conditions
c0 = 0,
dcj 1
(t) = − 2 2 [−cj−1 (t) + 2cj (t) − cj+1 (t)] , j = 1, . . . , m − 1,
dt π h
cm = 0,
with initial condition cj (t = 0) = sin(πxj ). Note that the partial derivative with respect
to time has become a d/dt derivative since cj only depends on time.
Next we eliminate the boundary conditions, to get a system of IVPs (See Sec. 4.6).
Only the equation for j = 1 contains c0 and only the equation for node j = m − 1 contains
cm . Substitution of c0 = 0 and cm = 0 gives
dc1 1 1
= − 2 2 (−0 + 2c1 − c2 ) = − 2 2 (2c1 − c2 )
dt π h π h
dcj 1
(t) = − 2 2 [−cj−1 (t) + 2cj (t) − cj+1 (t)] , j = 2, . . . , m − 2
dt π h
dcm−1 1 1
= − 2 2 (−cm−2 + 2cm−1 − 0) = − 2 2 (−cm−2 + 2cm−1 )
dt π h π h
with initial condition cj (t = 0) = sin(πxj ).
We obtained a system of IVPs which can be written in general matrix-vector form
dc
= Pc + r
dt
with
2 −1 0
0 ···
c1
0
.. ...
−1 2 −1 . c2 0
−1
..
. . . r = ... .
P = 2 2 0 .. .. .. 0 c=
,
.
π h .
.
.. . . −1 2 −1
cm−2 0
0 · · · 0 −1 2 cm−1 0
with initial condition c(t = 0) = sin(πx). Note that in general r is non-zero due to
non-zero boundary conditions and/or a non-zero right-hand-side function r(x, t) in the
PDE.
The system of IVPs can be solved in time using any of the numerical techniques
discussed in Sec. 6.3 for c0 = f (t, c). The right-hand-side vector is now f (t, c) = P c + r.
Remarks
143
• Since you have a large system of equations, and you want to be able to change the
number of grid points m easily, you would use a for-loop to compute the right-hand-
side vector f (t, c) = P c + r.
• For nonlinear PDEs you need to solve a nonlinear system every time step, using for
example Newton’s method for systems.
Fig. 7.3 shows the finite difference approximation together with the exact solution as
a function of x at various times. The results for the FD method were obtained using
h = 1/10 and ∆t = 2 10−3 . We see that qualitatively (on the scale of the figure) the
1
t=0
0.9 t=0.1
t=0.5
t=1
0.8 t=2
t=10
0.7
0.6
c(x,t)
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x
Figure 7.3: Exact solution (solid line) and FD approximation (symbols) of Eq. (7.2) at
times indicated in the legend.
144
7.5 Finite elements
We only consider the PDE with boundary and initial conditions as described in Eq. (7.2).
The PDE is first discretized in the x direction using finite elements. We only consider
linear finite elements as in Sec. 4.7. The key difference with section 4.7 is that after the
discretization in space the approximate values at the nodes are still a function of time:
m
X
c(x, t) = cj (t)φj (x).
j=0
The weak form of the heat equation (PDE) is again obtained by multiplying by a test
function ψ and integrating over the domain:
Z 1 Z 1 1
∂c 1 ∂c ∂ψ ∂c
ψ dx = − 2 dx + ψ
0 ∂t π 0 ∂x ∂x ∂x 0
for i = 0, . . . , m. The contributions of the element integrals can be put in two element
matrices (the element vector is the zero vector). After evaluating the element integrals,
we obtain
(l) h 2 1 (l) −1 1 −1
k = , p = 2 .
6 1 2 π h −1 1
where k (l) is the element matrix corresponding to the dcj /dt term and p(l) the element
matrix corresponding to the cj term.
Assembling into two global matrices K and P gives the (m + 1) × (m + 1) matrices
2 1 0 ··· 0 1 −1 0 · · · 0
. . . .. . .
1 4 1 . −1 2 −1 . . ..
h ... ... ... , −1 ... ... ...
K= 0 0 P = 2h 0 0 .
6
. .
π . .
. . . .
. . 1 4 1 . . −1 2 −1
0 ··· 0 1 2 0 · · · 0 −1 1
Boundary conditions are handled exactly the same way as in Sec. 4.7. For the two
Dirichlet boundary conditions considered here, we replace the equation for i = 0 by
c0 = 0, and the equation for i = m by cm = 0. Next we eliminate the Dirichlet boundary
145
conditions, to get a system of IVPs (See Sec. 4.6). Note that we do not only need to
eliminate c0 and cm from the equations, but also dc0 /dt and dcm /dt. This does not lead
to major complications since we know the Dirichlet boundary conditions as a function of
time. Since the values of the boundary conditions we consider are independent of time,
both derivatives are zero. Only the equation for i = 1 contains contributions from c0 and
dc0 /dt and only equation for i = m − 1 contains cm and dcm /dt. Substitution of c0 = 0,
dc0 /dt = 0, cm = 0, and dcm /dt = 0 gives in matrix-vector form
dc
K = P c + r,
dt
where K and P are (m − 1) × (m − 1) matrices and r an m − 1 vector:
4 1 0 ··· 0 2 −1 0 · · · 0
0
. . .. . .
1 4 1 . . −1 2 −1 . . .. 0
h −1
.. .. .. 0 ... ... ... 0 , r = ... .
K= , P =
0 . . . 0
6
.
π2h . .
..
.. . 1
. . . −1 2 −1
0
4 1 .
0 ··· 0 1 4 0 · · · 0 −1 2 0
ci+1 = ci + K −1 P ci .
Since it is computationally expensive to compute the inverse, however, you would multiply
by K and solve the system
Kci+1 = (K + P )ci
Since K is tridiagonal and can thus be solved very efficiently using Crout’s method (See
Sec. 4.9). Also no full matrix needs to be stored. Just the three diagonals is sufficient.
Also for the right-hand side there is no need to introduce big the matrices K and P
explicitly. The result of the matrix-vector product (K + P )ci is a vector and that is all
you need.
Fig. 7.4 shows the linear finite element approximation together with the exact solution
as a function of x at various times. The results for FEM were obtained using h = 1/10
and ∆t = 2 10−3 . We see that qualitatively (on the scale of the figure) the numerical
solution agrees well with the exact solution.
146
1
t=0
0.9 t=0.1
t=0.5
t=1
0.8 t=2
t=10
0.7
0.6
c(x,t)
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x
Figure 7.4: Exact solution (solid line) and FEM approximation (symbols) of Eq. (7.2) at
times indicated in the legend.
147
7.6 Stability
In Sec. 6.5, we discussed stability for a system of IVPs. For a linear system y 0 = Ay we
need all eigenvalues λj of A to satisfy the stability criterium |k(hλj )| < 1. To determine
the values of h for which a computation is stable we need to find the eigenvalues of A.
Eigenvalues
To find the maximum eigenvalue of a matrix A we can calculate the eigenvalues or estimate
them:
• The eigenvalues of a matrix A can be calculated with the built-in Matlab function
eig. For example
A = [2 1;1 2];
lambda = eig(A)
creates an array ev with all eigenvalues
lambda =
1
3
The built-in Matlab function max can compute the maximum of an array of values
lammax = max(lambda)
gives
lammax =
3
The disadvantage is that we can only use this for a matrix with numberical values, no
variable h. In addition, it may take a lot of computing time to compute eigenvalues
for large matrices.
– Gerschgorin’s theorem
To find approximations for the eigenvalues τ using Gerschgorin’s theorem we
need to check for every row of a matrix A
N
X
|τ − akk | ≤ |akj |.
j=1, j6=k
The right-hand side is just the sum of the magnitude of the off-diagonal entries
in row k.
– Raleigh’s quotient
Rayleigh’s quotient is useful to relate eigenvalues of the element matrix to
eigenvalues of the global matrix in the finite element method. How to use
Rayleigh’s quotient exactly is outside the scope of Math 4414.
148
Eigenvalues of symmetric matrices
From linear algebra, we know that symmetric matrices have real eigenvalues. This re-
stricts the region of absolute convergence to the part on the real axis. Thus for symmetric
matrices we will need for stability of Euler’s method ∆t ≤ 2/|λj | and for RK4 approxi-
mately ∆t ≤ 2.75/|λj |. This needs to hold for all eigenvalues λj , thus we only need to
determine the largest eigenvalue.
0 · · · 0 −1 2
we get when applying Gerschgorin’s theorem for each row j = 2, . . . , m − 2
τ + 2 ≤ 2
π h π 2 h2
2 2
149
2 2
t=0 t=0
t=0.5 t=0.5
t=2 t=2
1.5 t=5 1.5 t=5
t=20 t=7
t=100 t=8
1 1
c(x,t)
c(x,t)
0.5 0.5
0 0
−0.5 −0.5
−1 −1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x x
(a) (b)
Figure 7.5: Numerical solution of Eq. (7.2) using finite differences and Euler. (a) h = 0.04
within region of absolute stability, and (b) h = 0.06 outside region of absolute stability.
150
7.7 Accuracy
In the discretization of a PDE, we make a discretization error for the time derivatives
O((∆t)p ) and for the spatial derivatives O(hq ). The total error is the sum of these:
= O((∆t)p ) + O(hq ). For example, consider a spatial discretization with error O(h2 ). If
we use for the time discretization Euler’s method the total discretization error would be
= O(∆t) + O(h2 ) and if we use RK4 = O((∆t)4 ) + O(h2 ).
We simulate numerically Eq. (7.2). Since the behavior of the error is similar for finite
differences and finite elements, we only consider finite differences. We consider O(h2 ) finite
differences with Euler and RK4 for the time discretization. We look at the global error at
t = 1/2 and take the l∞ norm over all grid points, = |y(t = 1/2, xi ) − yi (t = 1/2, xi )k.
Fig. 7.6(a) shows the global error for a discretization in space with h = 1/10 and
various step sizes ∆t and Fig. 7.6(b) global error for a step size in space with ∆t = 10−3
and various grid sizes h.
−1.8 −2
Euler
RK4 Euler
−2 RK4
−2.5
−2.2
−3
−2.4
log10 ε
log10 ε
−3.5
−2.6
−4
−2.8
−4.5
−3
−5
−3.2
−3.4 −5.5
−2.6 −2.4 −2.2 −2 −1.8 −1.6 −1.4 −1.2 −1 −2.6 −2.4 −2.2 −2 −1.8 −1.6 −1.4 −1.2 −1 −0.8 −0.6
log10 ∆ t log10 h
(a) (b)
Figure 7.6: Global error at t = 1/2 using O(h2 ) finite differences for the heat equation
using Euler and RK4. (a) h = 1/10 and various ∆t. (b) ∆t = 10−3 and various h.
We observe in both figures that the solution obtained with RK4 is not more accurate
than the one obtained with Euler’s method. To explain this maybe unexpected result,
we need to look more carefully at the discretization errors we make. For Euler the total
discretization error is O(∆t) + O(h2 ). However, for stability we need ∆t < π 2 h2 /2. Thus
the total error we make is
For RK4 the total discretization error is O((∆t)4 ) + O(h2 ). However, for stability we need
∆t < 2.75π 2 h2 /4. Thus the total error we make is
151
which is the same order as Euler’s method.
Thus, if we take a very accurate discretizations in time using Euler or RK4 and not
a very accurate discretization in space the total error is dominated by the error in the
space discretization, O(h2 ). This situation we cannot avoid since we need an accurate
discretization in time to satisfy the stability criterium.
Remarks
• For the PDE considered here, there is no advantage in using RK4 instead of Euler.
Only more computational time and the same accuracy.
• An implicit technique like the trapezoidal rule would be very useful. There is no
stability criterium for ∆t and the error is = O((∆t)2 ) + O(h2 ). Both ∆t and h
can be varied independently to obtain a more accurate solution.
152
7.8 Solving linear and non-linear systems for PDEs
For explicit finite difference methods, the values cj at the new time level n+1 are obtained
directly. Other methods require solving a linear or non-linear system of equations. In this
section we discuss some efficient solution methods for the linear and non-linear systems
resulting from the discretization a PDE in time and 1D space.
Non-linear systems
A nonlinear system in ci+1 results if the PDE is nonlinear in c and an implicit method
like the trapezoidal rule is used for the time discretization. The nonlinear system can be
written in the from f (ci+1 ) = 0 and can be solved, for example, with Newton’s method.
153
(0)
Often ci , the solution vector at the previous time step is a good enough initial guess ci+1
in the Newton iteration
(k−1) (k) (k−1) (k−1)
J(ci+1 )∆ci+1 = −f (ci+1 , ci )
(k−1)
Since J depends on ci+1 factorization now needs to be performed every Newton step
(for each time step). If J is tridiagonal, factorization and the forward/backward substi-
tution can be done efficiently using the O(m) Crout method. Otherwise, a O(m3 ) LU
factorization for more general matrices is necessary (See Sec. 7.8.2).
7.8.2 LU factorization
LU factorization without row interchanges
For some type of matrices, no row interchanges are necessary in the Gaussian elimination
process. Then the m×m matrix A can be factored into A = LU , with L a lower-triangular
matrix and U an upper-triangular matrix
1 0 ... 0 u11 u12 . . . u1m
.. .. . ..
l21 1 . . 0 u22 . . .
L= . . . . U = . . . ,
.. . . .. 0 .. . . . . um−1,m
lm1 · · · lm,m−1 1 0 ··· 0 umm
Ux = y
using backward substitution (first xm , then xm−1 etc.) to find the solution x of the system
LU x = b.
Both the upper and lower triangular systems only take O(N 2 ) operations to solve.
Thus once we have the LU factorization, it is relatively cheap to solve a system involving
the matrix A = LU and any vector b. However, the LU factorization needs to be
computed first, which takes O(2N 3 /3) operations.
154
P LU factorization with row interchanges
If row interchanges are necessary, then a permutation matrix P exists so that P A =
LU . This just means that the rows can be interchanged (via P ) so that a LU factorization
exists. Thus we solve
P Ax = P b
which is ”Ax = b with a different order of the rows”. For P A we can make a LU
factorization, so we solve
LU x = P b.
Starting from the the matrices P , L, and U and vector b, we can find the solution x of
Ax = b in three steps
• compute z = P b (O(N 2 ) operations)
Saving memory
The matrix L and U contain a lot of zeros. To store both L and U in a separate matrix
is a waste of memory. Usually, L and U are stored in a single matrix
u11 u12 ... u1m
.. ..
l21 u22
. .
. . . .
.. . . . . um−1,m
lm1 · · · lm,m−1 umm
Note that the 1’s at the diagonal of L are not stored. This is, however, not necessary
since we know exactly what the values on that diagonal are. They are always 1, so we can
just use lii = 1 whereever the lii ’s are needed. Often the matrix A is no longer necessary
after the LU factorization has been performed. By overwriting the matrix A with L and
U , we don’t need any additional memory at all.
• If you need to solve Ax = b several times with the same matrix A you need to do the
expensive LU factorization only once. Solving the triangular systems is relatively
cheap. This occurs, for example, when
155
Matlab
A P LU factorization of a matrix A can be obtained in Matlab using the built-in function
lu. For example:
A = [1 2 6; 4 8 -1; -2 3 -5]
[L, U, P] = lu(A)
gives
1 0 0 4 8 −1 0 1 0
L = −0.5 1 0 , U = 0 7 −5.5 P = 0 0 1 .
0.25 0 1 0 0 6.25 1 0 0
If we check P*A - L*U this gives indeed the zero matrix.
An LU factorization with the L and U matrix stored in a single matrix K be done by
using lu with a single output argument. For example
A = [4 8 -1; -2 3 -5; 1 2 6;]
K = lu(A)
gives
4 8 −1
K = −0.5 7 −5.5
0.25 0 6.25
Note that information about P is lost.
156