Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
4 views

Ch1 Introduction To Numerical Methods

Uploaded by

李遠哲
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Ch1 Introduction To Numerical Methods

Uploaded by

李遠哲
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

Introduction to Numerical Methods

I. Introduction
1.1 Why Numerical Method?
Example 1. y
T/y=0 (insulated)
Steady state heat conduction
D C
Governing equation:
2
 T= 0 T=T 1 Heat flow
T/x
=0
How do we determine the heat flow
from wall AB to wall AD?
Possible solutions: A
B x
1. Experiment T=T2 > T 1
2. Analytical solution
3. Numerical solution
y
T/y=0 (Insulated)
Example 2
. D C
Steady state heat conduction G

in a non-simple geometry T/x=0


F
E
T=T 1
Governing equation: Heat flow T/y=0 T/x
2
 T= 0 =0
How to determine the heat flow
from wall AB to wall AD?
A
T=T2 > T 1 B x

Example 3.
Unsteady state heat conduction in a non-simple geometry:
T 
= T
t

Notes organized by Prof. Renwei Mei, University of Florida 1


Experimental approach:
 Design the experiment
 Set up a facility to satisfy boundary conditions (insulation and
constant temperature)
 Prepare instrumentation
 Perform experiment & collect data
(measure heat flux on wall AB, for example)
 Analyze data and present data
 Develop a model (say, to describe the effect of LAB/LCD on
heat transfer)
 O.K. for all three examples; no knowledge on T(x, y) inside the
cavity; relatively time consuming & tedious.
Analytical approach:
 T 
* Solve mathematical equation ( T=0 or =   T)
t
based on a physical model.
 Use method of separation of variable? Green's function?
* O.K. for simple geometry in Example 1;
unlikely for Example 2 & 3.

Notes organized by Prof. Renwei Mei, University of Florida 2


Numerical approach:
 Solve equations that can be much more complicated than
T 
= T using a computer.
t

 Solution is discrete, approximate, but can be close to exact.

yn yn+1

t

... t
t1 t2 tn t n+1

 Program can handle more complicated geometries


(as in Example 2 & 3).
 Gain insight into the temperature distribution inside the cavity.
 Cautions:
† NO numerical method is completely trouble free in all
situations.
† NO numerical method is completely error free.
† NO numerical method is optimal for all situations.
† Be careful about:
ACCURACY, EFFICIENCY, & STABILITY.

Notes organized by Prof. Renwei Mei, University of Florida 3


* Example 4.
dy
Solve a simple ODE: = - 10y, y(0) = 1
dt
First note: exact solution is: yexact(t) = e-10t.
dy y n +1 − y n
Approximating: LHS  as
dt t
& RHS  -10y as -10 yn (n=1, 2, 3,…)
y n +1 − y n
i.e. = -10yn
t
=> yn+1 = yn -10t yn = yn (1-10t ),
with y1 =1.
yn+1 = yn (1-10t ) = yn-1 (1-10t )2
= yn-2 (1-10t )3 = ... = y1(1-10t )n

Choose t = 0.05, 0.1, 0.2, and 0.5,


= 10t =0.5, 1, 2, and 5.
See what happens!

n yn(t=0.05) yn(t=0.1) yn(t=0.2) yn(t=0.5)


1 1 1 1 1
2 0.5 0 -1 -4
3 0.25 0 1 16
4 0.125 0 -1 64
5 0.0625 0 1 -256
6 0.03125 0 -1 1024
7 0.0156250 0 1 -2048
8 0.0078125 0 -1 …
Comments: ok inaccurate oscillate blow up

Notes organized by Prof. Renwei Mei, University of Florida 4


1.1
1 y y_numerical
0.9 t=0.05
0.8 y_exact
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
-0.1
0 0.5 1 1.5 t
(This graph compares the exact solution with the stable numerical
solution for t=0.05.)

Questions:
Why the solution blows up for t=0.5?
How to detect/prevent numerical instability (blowing up) in general?
How to improve accuracy (c.f. the case with t=0.05)?
How to get solution efficiently if a large system is solved?

Notes organized by Prof. Renwei Mei, University of Florida 5


Graphs based on numerical solutions to heat transfer problems:
Steady State Temperature Contour
1.0
0.1 0.2
0.3
0.8 0.4

0.5
0.6
y
0.6

0.4
0.7

0.8
0.2
0.9

0.0
0.0 0.2 0.4 0.6 0.8 1.0
x
Steady Temperature with a Block
1.0

0.1
0.8 0.2

0.3
0.6 0.4
y
0.4 0.5
0.6 0.7
0.8
0.9
0.2

0.0
0.0 0.2 0.4 x 0.6 0.8 1.0

Notes organized by Prof. Renwei Mei, University of Florida 6


1.2 Mathematical Preliminary
1.2.1 Intermediate Value Theorem

Let f(x) be continuous function on the finite interval a≤x≤b,


and define
m = Infimum f(x), M = Supremum f(x)
a≤x≤b a≤x≤b
Then for any number z in the interval [m, M], there is at least
one point  in [a, b] for which
f() = z.
In particular, there are points x and x in [a, b] for which
m = f( x ) and M = f( x ).
y
M

a  b x
x x

Notes organized by Prof. Renwei Mei, University of Florida 7


1.2.2 Mean Value Theorem
Let f(x) be continuous function on the finite interval a≤x≤b,
and let it be differentiable for a≤x≤b. Then there is at least
one point  in [a, b] for which
f(b) - f(a) = f ( ) (b-a). (1)

The graphical interpretation of this theorem is shown below.


y
f(b) Slope:
f ( )
f(b) - f(a)

f(a)

a  b x
1.2.3 Integral Mean Value Theorem (IMVT)
Let w(x) be nonnegative and integrable on [a, b] and let f(x) be
continuous on [a, b]. Then
b b
 w(x) f(x) dx = f()  w(x) dx (2)
a a

for some  in [a, b].

Notes organized by Prof. Renwei Mei, University of Florida 8


1.2.4 Taylor series expansion
* Let f(x) have n+1 continuous derivatives on [a, b] for some n≥0
and let x, x0[a, b]. Then
f(x) = Pn(x) + R n+1(x) (3)
where
x − x0 ( x − x0 ) 2
Pn(x) = f(x0) + 
f ( x0 ) + f ( x0 )
1! 2!
( x − x0 ) 3 ( x − x0 ) n (n)
+ f ( x0 ) + ... + f ( x0 ) (4)
3! n!
1 x n (n+1)(t) dt
Rn+1(x) =  (x − t) f
n! x
0
( x − x0 ) n +1 (n+1)
(IMVT) = f () ( between x0 & x) (5)
(n + 1)!
= truncation error of the expansion = T.E.

(This figure shows the approximation of f(x) using 0th, 1st, & 2nd
order Taylor series expansions.)

Notes organized by Prof. Renwei Mei, University of Florida 9


* Examples:
h2 h3
f(x+h) = f(x) + h f (x) + f (x) + f (x) +… (6)
2! 3!
h2 h3
f(x-h) = f(x) - h f (x) + f (x) - f (x) +… (7)
2! 3!

* List of Taylor series for commonly encountered functions


(with x0=0):
 ( − 1)  ( − 1)( − 2)
(1 + x) = 1 + x + x2 + x 3 + ... , |x|<1
2! 3!
1
=1 + x + x2 + x3 + x4 +… |x|<1
1− x
x x 2 x3 5x 4 7 x5
1+ x = 1+ − + − + + ... |x|<1
2 8 16 128 256
x3 x5 x7
sin(x) = x- + - …,
3! 5! 7!
x2 x4 x6
cos(x) = 1- + − + …,
2! 4! 6!
x 3 2 x 5 17 x 7 62 x 9
tan(x) = x + + + + + …, |x|< /2
3 15 315 2835
x 2 5x 4 61x 6 277 x 8
sec x = 1+ + + + + ... , |x|< /2
2 24 720 8064
x 3 3x 5 5 x 7 35x 9
arcsin(x) = x + + + + + ... , |x|< 1
6 40 112 1152
 x 3 3x 5 5 x 7
arccos(x) = −x− − − + ... , |x|< 1
2 6 40 112
x3 x5 x7 x9
arctan(x) = x - + − + + ... , |x|< 1
3 5 7 9
x 2 x3 x5
exp(x) = 1 + x + + + +…,
2! 3! 5!
x3 x5 x7
sinh(x) = x+ + + …,
3! 5! 7!

Notes organized by Prof. Renwei Mei, University of Florida 10


x2 x4 x6
cosh(x) = 1+ + + + …,
2! 4! 6!
x 2 5x 4 61x 6 277 x8
sech(x) = 1/cosh(x) = 1- + − + +…
2 24 720 8064
x 3 2 x 5 17 x 7 62 x 9
tanh(x) = x - + − + + ... ,
3 15 315 2835
-1 x3 3x 5 5 x 7 35 x9
sinh (x) = x- + - + − …,
6 40 112 1152

-1 x3 x5 x 7 x9
tanh (x) = x + + + + + ... ,
3 5 7 9
x2 x3
ln (1+x) = x - + +… |x|< 1
2 3
1+ x  2x3 2x5 2x 7
ln   = 2x + + + + ... |x|<1
1− x  3 5 7

Notes organized by Prof. Renwei Mei, University of Florida 11


* Example: What are the errors, or remainders, R4,
in the
Taylor series expansion of sin(x) ?
Soln. For f(x) = sin(x), f (x) = cos(x), f (x) = sin(x),

f (x) = -cos(x), f (x ) = sin(x)

=> sin(x) = x - x 3 / 3! +…= P3(x) + R4(x)


3
with P3 (x) = x - x / 3!
1x 1x
& R4(x) =  ( x − t ) 3 f (t )dt = 3
 ( x − t ) sin( t )dt ,
3! 0 3! 0
1 −1 x 4
=  sin( t )d ( x − t ) (integration by parts =>)
3! 4 0
1 −1 x x
= [sin( t )( x − t ) 4 -  ( x − t ) 4 cos(t )dt ]
3! 4 0 0
1x 4
=  cos(t )( x − t ) dt (IMVT=>)
4! 0
1 x x5
= cos( )  ( x − t ) dt
4
= cos().
4! 0 5!
3 5
cos() for  between 0 and x.
x x
Hence sin(x) = x - +
3! 5!
Note: R4(x) may also be expressed as
1 x 4
sin( )  ( x − t ) 3 dt = x sin().
3! 0 4!
However, since  is between 0 & x with |x|«1, this estimate,

R4~ x 4 / 4!sin(), is not as useful.

Since cos() ~1 for small |x| so that R4 ~ x 5 / 5! . More useful!

Notes organized by Prof. Renwei Mei, University of Florida 12


* Use the next term in TS expansion, Pn(x), to represent Rn+1(x).
1.2.5 Taylor series expansion in two dimensions
 Let f(x, y) be n+1 time continuously differentiable for all (x, y)
in some neighborhood of (x0, y0). Then,
n 1
f(x0+, y0+) = f(x0, y0) +  m! D f ( x, y)
m
m =1 x 0 , y0

1
+ Dn+1f(x, y) |x0+ ,y0+ (8)
(n + 1)!
 
where D   + and 0≤ ≤1.
x y

2 3 1/ 2
Example: Consider f(x, y) = ln[ 1 + x + xy + y ]
Find its Taylor series expansion near x=y=0.
Solution: Method 1.
1 2 3
f(x, y) = ln[ 1 + 2 x + x + xy + y ] , f(0, 0)=0
2
f 2 + 2x + y f
= , (0,0) = 1 ;
x 2[1 + 2 x + x 2 + xy + y 3 ] x

f x + 3y2 f
= , (0,0) = 0 ;
y 2[1 + 2 x + x 2 + xy + y 3 ] y

2 f 2[1 + 2 x + x 2 + xy + y 3 ] − (2 + 2 x + y )( 2 + 2 x + y )
= ,
x 2 2[1 + 2 x + x 2 + xy + y 3 ]2

2 f
(0,0) = −1 ;
x 2

2 f 6 y[1 + 2 x + x 2 + xy + y 3 ] − ( x + 3 y 2 )( x + 3 y 2 )
= ,
y 2 2[1 + 2 x + x 2 + xy + y 3 ]2

Notes organized by Prof. Renwei Mei, University of Florida 13


2 f
(0,0) = 0 ;
y 2

 2 f 1 + 2 x + x 2 + xy + y 3 − ( x + 3 y 2 )( 2 + 2 x + y )
= .
xy 2[1 + 2 x + x 2 + xy + y 3 ]2

2 f 1
(0,0) =
xy 2

1 1
Thus, f(x, y) ~ 0 + x + 0 • y + (−1) x 2 + xy + 0 • y 2 +…
2 2
1 1
= x − x 2 + xy + ...
2 2
Method II:

Let  = 2 x + x 2 + xy + y 3

2 3
Note: ln (1+) =  - + +…
2 3

1
 ln[ 1 + 2 x + x 2 + xy + y 3 ]
2
1 2 3 1 2 3 2
= {[ 2 x + x + xy + y ] - [2 x + x + xy + y ] + …}
2 2
1 2 1 1
=x+ x + xy − (4 x 2 ) + ... = x − 1 x 2 + 1 xy + ...
2 2 4 2 2
(This approach is much simpler!)

Notes organized by Prof. Renwei Mei, University of Florida 14


1.3 Sources of Errors in Computations:
1.3.1 Absolute and relative errors:
 True value (T.V.) xT = Approximate value (A.V.) xA + Error ,
Absolute error = T.V. - A.V. = (9)
T.V. - A.V. 
Relative error: Rel.( xA) = = (10)
T.V. xT
1.3.2 Types of errors

 Modeling error -- e.g neglecting friction in computing


a bullet trajectory
 Empirical measurements -- g (gravitational acceleration),
h (Planck constant), ...
 Blunders
 Input data inexact-- weather prediction based on data collected

 Round-off error -- e.g.   3.1415927 instead of


3.1415926535897932384...
3
x x2 x x
4
 Truncation error -- e.g. e  1 + x + + + for small
2 3! 4!
x,
dy y −y
or  n +1 n for small t.
dt t
* Example: Surface area of the earth may be approximated as
A = 4 r2
Errors in various approximations:
† Earth is modeled as a sphere (an idealization)
† Earth radius r  6378.14 km from measurements.
†  3.14159265

Notes organized by Prof. Renwei Mei, University of Florida 15


† Calculator or computer has finite length; result is rounded.
* Example of Truncation Error in Taylor series:
x − x0 ( x − x0 ) 2 ( x − x0 ) 3
f(x) = f(x0) + 
f ( x0 ) + 
f ( x0 ) + f ( x0 )
1! 2! 3!
( x − x0 ) n (n)
+ ... + f ( x0 ) + Rn+1(x) (4)
n!
Rn+1(x) = Remainder or Truncation Error (ET)
 Rn+1(x) or ET can be estimated as

1 x n (n+1)(t) dt = ( x − x0 ) n +1 (n+1)
Rn+1(x) =  (x − t) f f ( ) (5)
n! x (n + 1)!
0
 between x and x0.

 To understand the roundoff error, we must first look into floating


point arithmetic.

Notes organized by Prof. Renwei Mei, University of Florida 16


1.4 Floating Point Arithmetic

 Anatomy of a floating-point number

 Three fields in a 32 bit IEEE 754 float

 Example: representation of (0.15625)10 in a binary 32 bit float

= 0.15625

 Example: how to represent 1/10 in binary?


Solution: 1/10 = 1/24+1/25+1/28+1/29+1/212+1/213+…
= 0.0001100110011001100110011001100....
The pattern repeats; never ending; the number is inexact.

Is such “inexactness” important?

Yes! Very important! You need to know your weapon well in


order to use it effectively.

Notes organized by Prof. Renwei Mei, University of Florida 17


Real-life Examples
--Disasters Caused by Computer Arithmetic Error
The Patriot Missile Failure
 On February 25, 1991, during the Gulf
War, an American Patriot Missile
battery in Dharan, Saudi Arabia, failed
to track and intercept an incoming Iraqi
Scud missile. The Scud struck an American Army barracks,
killing 28 soldiers and injuring around 100 other people.
 The Patriot missile had an on-board timer that incremented
every tenth of a second
 Software accumulated a floating point time value by adding 0.1
seconds
 Problem is that 0.1 in floating point is not exactly 0.1. With a 23
bit representation it is really only 0.0999999046326.
 Thus, after 100 hours (3,600,000 ticks), the software timer was
off by 0.3433 seconds.
 Scud missile travels at 1676 m/s. In 0.3433 seconds, the Scud
was 573 meters away from where the Patriot thought it was.
 This was far enough that the incoming Scud was outside the
"range gate" that the Patriot tracked.
See government investigation report:
http://www.fas.org/spp/starwars/gao/im92026.htm

Notes organized by Prof. Renwei Mei, University of Florida 18


 General computer representation of a floating point number x

x:  . d1d2d3 ... dt* Be (11)

( e.g. -.110101* 210 in binary)


 = sign
B = number base: 2 or 10 or 16
d1d2d3...dt = mantissa or fractional part of significand;
d1 ≠ 0; 0≤di ≤B-1, i=1, 2...t
t = number of significant bits, e.g. t=24
 gives PRECISION of x
e = exponent or characteristic, -emin =L< e < U=emax
(e.g. -126<e<127)
it determines RANGE of x.
 In reality, Eq. (11) represents
d d
+ 22 + 33 ... + tt ) Be
d1 d
x=( (12)
B B B B

e.g. -.110101* 211 in base 2


1 1 0 1 0 1
= -( + + + + + ) * 211
2 2 3 4 5 6
2 2 2 2 2
11
= − 0.828125 * 2 = -1696 in base 10

Notes organized by Prof. Renwei Mei, University of Florida 19


 IEEE Standard for single precision (for base 2 only)
* 32 bit IEEE 754 float:
S EEEEEEEE FFFFFFFFFFFFFFFFFFFFFFF
0 1 8 9 31

The value V represented by the word may be determined as follows:

• If “E”=255 and F is nonzero, then V=NaN ("Not a number")

• If “E”=255 and F is zero and S is 1, then V = -Infinity; (-1)S =-1.

• If “E”=255 and F is zero and S is 0, then V=Infinity; (-1)S =1.

• If 0<“E”<255 then V=(-1)**S * 2 ** (E -127) * (1.F)

where "1.F" is intended to represent the binary number created

by prefixing F with an implicit leading 1 (d0=1) and a binary point.

• In the above, the exponent is stored with 127 added to it,

also called "biased with 127".

• Thus, none of the 8 bits is used to store the sign of the


exponent E.

• But, the actual exponent e is equal to “E” – 127.

• Since “E”=255 is for V=NaN, the largest “E” is 254

=> U= 254-127 = 127

Notes organized by Prof. Renwei Mei, University of Florida 20


• If “E”=0 and F is nonzero, then V=(-1)**S * 2 ** (-126) * (0.F)

These are "unnormalized" values.

That is why L = -126.

• If “E”=0 and F is zero and S is 1, then V=-0

• If “E”=0 and F is zero and S is 0, then V=0

• The reason for having |L| <U is so that the reciprocal of the
smallest number, 1/2L, will not overflow. Although it is true that
the reciprocal of the largest number will underflow, underflow is
usually less serious than overflow.

Notes organized by Prof. Renwei Mei, University of Florida 21


 IEEE Standard for double precision

“Double precision” refers to a type of floating-point number


that has more precision (that is, more digits to the right of
the decimal point) than a single precision number. The
term double precision is something of a misnomer
because the precision is not really double.
The word double derives from the fact that a double
precision number uses twice as many bits as a regular
floating-point number. For example, if a single precision
number requires 32 bits, its double precision counterpart
will be 64 bits long (see the partition of three fields shown
above).
The extra bits increase not only the precision (t) but also the
range (e) of magnitudes that can be represented. The
exact amount by which the precision and range of
magnitudes are increased depends on what format the
program is using to represent floating-point values. Most
computers use a standard format known as the IEEE
floating-point format.

Notes organized by Prof. Renwei Mei, University of Florida 22


Brief summary:

B t L U Total
(mantissa) Length
IEEE SP 2 23 -126 127 32
IEEE DP 2 52 -1022 1023 64

SP: 23 bits go to the PRECISION


8 bits go to the RANGE for L or U
1 bit goes to the SIGN
Add 32 bits for single precision representation

DP: 64 = 52 (t) + 11 (e) + 1 (S)

Notes organized by Prof. Renwei Mei, University of Florida 23


 Total number of floating-point numbers
 The floating point number CANNOT represent arbitrary
real numbers even if it is only of modest magnitude.
 The total number of floating point numbers that can be
produced by such a system

x =  . d1d2d3 ... dt* Be

is 2 ( B −1)B t−1 (U −L +1) + 1 (13)


2: accounts for the sign 
(B-1): number of possibilities in choosing d1 ≠ 0

B t−1: number of possibilities in choosing d2, d3, ...dt

U −L +1: number of possibilities in choosing e


1: for representing number x = 0

 That is why, for example, 0.1 in decimal cannot be


represented exactly by a binary number representation:

1/10 = 1/24+1/25+1/28+1/29+1/212+1/213+…
= 0.0001100110011001100110011001100....

Notes organized by Prof. Renwei Mei, University of Florida 24


 Smallest and largest positive numbers
 Smallest positive number x and underflow:

xL = (.100...0)B B-Emin = B-Emin-1 (14)

(=2-126-1= 5.877x10-39)
If x<xL  underflow, i.e. computer may treat x as 0.

 Largest positive number x and overflow:

xU = (....)B BEmax = (1-B-t) BEmax; =B-1 (15)

(~2127=1.7x1038)
If x> BEmax  overflow;
computer treat x as "∞", Inf., or "NaN =Not a Number "

Is it important to know conditions for overflow and underflow?


Absolutely!

Notes organized by Prof. Renwei Mei, University of Florida 25


Real-life Examples
--Disasters Caused by Computer Arithmetic Error

Ariane Rocket
 On June 4, 1996 an
unmanned Ariane 5 rocket
was launched.
 The rocket was on course for 36 seconds and then veered off
and crashed
 The internal reference system was trying to convert a 64-bit
floating point number to a 16-bit integer.
 This caused an overflow which was sent to the onboard
computer.
 The on-board computer interpreted the overflow as real flight
data and bad things happened.
 The destroyed rocket and its cargo were valued at $500 million.
 The rocket was on its first voyage, after a decade of
development costing $7 billion.

Notes organized by Prof. Renwei Mei, University of Florida 26


 Overflow and underflow experiment:

 Write a computer program, using x=2k, with k=1 to .., to


determine the largest floating point number your computer can
handle before overflow occurs.
 Then use y=2-k, k=1 to ..., to determine the smallest floating
point number your computer can handle before underflow
occurs.
 Results of the experiment n an Alpha workstation
single precision:
k_up = 127, x_up= 1.7014118E+38
k_low = -126, y_low= 1.1754944E-38
double precision:
k_up = 1023, x_up = 8.988465674311580E+307
at k > 1023, x blows up → overflow
k_low = -1022 y = 2.225073858507201E-308
at k < -1022, y=0.0 → underflow

Notes organized by Prof. Renwei Mei, University of Florida 27


 Round-off Error and machine precision
 Rounding:
e.g. x = 2/3 = 0.666666666666...
To keep 7 decimal points,
5
rounding to nearest → x =0.6666667  8th 6 > 10

 added 1 to 7th 6.
* Example: T = 3.1415926535897932...
If A = 3.14159
then |roundoff error| = |A - T| = 0.00000265358979...
< 0.000005
 A | round off error |
3.14 0.00159... < 0.005
3.142 0.00040... < 0.0005
3.1416 0.0000073 < 0.00005
Note: x = 5.2468 ~ 5.247, or x ~ 5.25 but x ≠ 5.3.
If you want to keep one decimal, then x ~ 5.2.
i.e. round off is not transitive.
 Chopping:

e.g. x = 2/3 = 0.666666666666...


To keep 7 decimal points, chopping → x = 0.6666666
 8th 6 is simply chopped.
e.g.  A = 3.1415 after chopping.
| round off error | = 0.000092...< 0.0005
It is larger than rounding to nearest.

Notes organized by Prof. Renwei Mei, University of Florida 28


 Is it important to care about Chopping and Rounding?

 Major difference between Chopping and Rounding:

Error in chopping is always non-negative since the chopped


number is always no larger than the original number.

M
This can cause skew in summuation of  x j !!!
j =1

Error in rounding can be either positive or negative.


M
Thus the round off error in computing  x j will be
j =1

smaller since some of the roundoff error will


cancel out.

Notes organized by Prof. Renwei Mei, University of Florida 29


Real-life Examples
--Disasters Caused by Computer Arithmetic Error

Vancouver Stock Exchange


 In 1982, the index was initiated with a starting value of
1000.000 with three digits of precision and truncation
 After 22 months, the index was at 524.881
 The index should have been at 1009.811.
 Successive additions and subtractions introduced truncation
error that caused the index to be off so much.

Notes organized by Prof. Renwei Mei, University of Florida 30


 Machine precision or machine epsilon mach

-- accuracy or precision
chopping: mach = B1-t ( = 21-23 = 2.384 x 10-7 for B=2, t=23)
1
rounding: mach = 2 B1-t (= 2-23 = 1.192x 10-7 for B=2, t=23)
Note: if x< mach then, 1+ x = 1 in machine computation.

 Unit round of a computer is the number  that satisfies the follows:


i) it is a positive floating-point number
ii) it is the smallest such number for which
fl (1 +  ) > 1 (16)
where " fl " means the "floating-point" representation of
the number.
* Thus, for any x<, we have fl(1+ x) = 1.
=> precise measure of how many digits of accuracy are
possible in representing a number.

Machine precision experiment:


 Write a computer program, using  = 2-k, with k=1 to 34 for
single precision and k =1 to 60 for double precision, to
determine  for the machine you are using for both single
precision and double precision operations.

Notes organized by Prof. Renwei Mei, University of Florida 31


* Example: On an Alpha workstation or a Dec 5000 machine,
Single Precision:
k= 23 x= 0.000000119 still truthful
k= 24 x= 5.9604645E-08 no longer truthful
Program:
del=1.0
do k=1, 35
del=del/2.0
f=1+del
z=f-1
write(6,12) k, del, f, z
enddo
12 format(1x,i3,2x,f15.11,2x,f13.10,2x,f13.9)
stop
end

Results (Single precision):


K del f z
1 0.50000000000 1.5000000000 0.500000000
2 0.25000000000 1.2500000000 0.250000000
3 0.12500000000 1.1250000000 0.125000000
4 0.06250000000 1.0625000000 0.062500000
5 0.03125000000 1.0312500000 0.031250000
6 0.01562500000 1.0156250000 0.015625000
7 0.00781250000 1.0078125000 0.007812500
8 0.00390625000 1.0039062500 0.003906250
9 0.00195312500 1.0019531250 0.001953125
10 0.00097656250 1.0009765625 0.000976562
11 0.00048828125 1.0004882812 0.000488281
12 0.00024414062 1.0002441406 0.000244141
13 0.00012207031 1.0001220703 0.000122070
14 0.00006103516 1.0000610352 0.000061035
15 0.00003051758 1.0000305176 0.000030518
16 0.00001525879 1.0000152588 0.000015259

Notes organized by Prof. Renwei Mei, University of Florida 32


17 0.00000762939 1.0000076294 0.000007629
18 0.00000381470 1.0000038147 0.000003815
19 0.00000190735 1.0000019073 0.000001907
20 0.00000095367 1.0000009537 0.000000954
21 0.00000047684 1.0000004768 0.000000477
22 0.00000023842 1.0000002384 0.000000238
23 0.00000011921 1.0000001192 0.000000119
24 0.00000005960 1.0000000000 0.000000000
25 0.00000002980 1.0000000000 0.000000000
26 0.00000001490 1.0000000000 0.000000000
27 0.00000000745 1.0000000000 0.000000000
28 0.00000000373 1.0000000000 0.000000000
29 0.00000000186 1.0000000000 0.000000000
30 0.00000000093 1.0000000000 0.000000000
31 0.00000000047 1.0000000000 0.000000000
32 0.00000000023 1.0000000000 0.000000000
33 0.00000000012 1.0000000000 0.000000000
34 0.00000000006 1.0000000000 0.000000000

Notes organized by Prof. Renwei Mei, University of Florida 33


Double precision:
k= 53 x= 1.1102230E-016 still truthful
k= 54 x= 5.55E-017 no longer truthful
Program:
implicit double precision (a-h,o-z)
del=1.0
do k=1,64
del=del/2.0
f=1+del
z=f-1
write(8,12) k,del,f,z
enddo
12 format(1x,i4,2x,e21.13,2x,f22.17,2x,e16.9)
stop
end

Result (double precision):


K del f z
30 0.9313225746155E-09 1.00000000093132257 0.93132E-09
31 0.4656612873077E-09 1.00000000046566129 0.46566E-09
32 0.2328306436539E-09 1.00000000023283064 0.23283E-09
33 0.1164153218269E-09 1.00000000011641532 0.11642E-09
34 0.5820766091347E-10 1.00000000005820766 0.58208E-10
35 0.2910383045673E-10 1.00000000002910383 0.29104E-10
36 0.1455191522837E-10 1.00000000001455192 0.14552E-10
37 0.7275957614183E-11 1.00000000000727596 0.72760E-11
38 0.3637978807092E-11 1.00000000000363798 0.36380E-11
39 0.1818989403546E-11 1.00000000000181899 0.18190E-11
40 0.9094947017729E-12 1.00000000000090949 0.90949E-12
41 0.4547473508865E-12 1.00000000000045475 0.45475E-12
42 0.2273736754432E-12 1.00000000000022737 0.22737E-12
43 0.1136868377216E-12 1.00000000000011369 0.11369E-12
44 0.5684341886081E-13 1.00000000000005684 0.56843E-13
45 0.2842170943040E-13 1.00000000000002842 0.28422E-13
46 0.1421085471520E-13 1.00000000000001421 0.14211E-13

Notes organized by Prof. Renwei Mei, University of Florida 34


47 0.7105427357601E-14 1.00000000000000711 0.71054E-14
48 0.3552713678801E-14 1.00000000000000355 0.35527E-14
49 0.1776356839400E-14 1.00000000000000178 0.17764E-14
50 0.8881784197001E-15 1.00000000000000089 0.88818E-15
51 0.4440892098501E-15 1.00000000000000044 0.44409E-15
52 0.2220446049250E-15 1.00000000000000022 0.22204E-15
53 0.1110223024625E-15 1.00000000000000000 0.00000E+00
54 0.5551115123126E-16 1.00000000000000000 0.00000E+00
55 0.2775557561563E-16 1.00000000000000000 0.00000E+00
56 0.1387778780781E-16 1.00000000000000000 0.00000E+00
57 0.6938893903907E-17 1.00000000000000000 0.00000E+00
58 0.3469446951954E-17 1.00000000000000000 0.00000E+00
59 0.1734723475977E-17 1.00000000000000000 0.00000E+00
60 0.8673617379884E-18 1.00000000000000000 0.00000E+00
61 0.4336808689942E-18 1.00000000000000000 0.00000E+00
62 0.2168404344971E-18 1.00000000000000000 0.00000E+00
63 0.1084202172486E-18 1.00000000000000000 0.00000E+00
64 0.5421010862428E-19 1.00000000000000000 0.00000E+00

Notes organized by Prof. Renwei Mei, University of Florida 35


1.5 Significant Digits
Definition:
XA has m significant digits w.r.t. XT if the error |XT -XA | has
magnitude ≤5 in the (m+1)th digits counting from the right of the
first non-zero digit in XT.
Examples:
1. XT = 3 . 1 7 2 8 6
1 23 4 56
If XA = 3.17, then | XT -XA| = 0.00286 < 0. 0 0 5
1 234  m + 1= 4  m =3
If XA = 3.172, then | XT -XA| = 0.00086 < 0. 0 0 5
1 2 3 4  m + 1= 4  m =3
If XA = 3.173, then | XT -XA | = 0.00014 < 0. 0 0 0 5
1 2345  m +1 = 5  m =4
2. XT = 3 8 9. 6 7 4
1 2 3 4 5 6
If XA = 3 8 9. 7 8, then |XT -XA| = 0. 1 0 6 < 0. 5
3 4  m + 1 = 4  m =3
If XA = 3 8 9. 7, then |XT -XA| = 0. 0 2 6 < 0 . 05
3 4 5  m + 1 = 5  m =4

Notes organized by Prof. Renwei Mei, University of Florida 36


1.6 Interaction of Roundoff Error with Truncation Error
Consider f(x) = ex EXACT derivative f (x) = ex.

at x=0, EXACT value is f (0) = 1


* Finite difference method 1: forward difference
f ( x + h) − f ( x )
TS expansion to O(h) => f ( x, h)  + O(h) (17)
h
eh − 1
That is, numerically, f (0) = f / x = (method 1)
h
f ( x + h) − f ( x )
* Error(x, h) = | f (x) - f/x | = | ex - |:
h

1.E+00
f(x)=exp(x) abs(f'-1)
1.E-01

1.E-02 f'(0)=[exp(h)-1]/h

1.E-03

1.E-04

1.E-05
decreasing
1.E-06 increasing truncation error
1.E-07
due to
roundoff error
1.E-08

1.E-09
1.E-15 1.E-13 1.E-11 1.E-09 1.E-07 1.E-05 1.E-03 1.E-01 1.E+01
h

* Why the error behave in such a manner?

Notes organized by Prof. Renwei Mei, University of Florida 37


* Roundoff error using Excel in computing the difference
between two O(1) numbers, [f(x+ h) – f(x)], is roughly around
1.11* 10-16.
* Thus the roundoff error (R.E.) for f  is
(R.E. ) ~ 1.11*10-16 / h (18)
R.E. is small for larger x=h, but it will increase as h decreases.
* Truncation error based on Taylor series expansion:
f(x+h) = f(x) + f (x) h + f (x) h2/2 + f (x) h3/6 +…
➔ [f(x+h) - f(x)]/ h = f (x) + f (x) h/2 + f (x) h2/6 +… (19)
Thus in approximating f (x) by [f(x+h) - f(x)]/ h, we
commit an error of f (x) h/2 to the leading order.

Hence the T.E. in this example (x=0, f (0) = e 0 = 1) is:


T.E. = h/2 (20)
=> Total error = R.E. + T.E. ~ 1.11*10-16 / h + h/2 (21)
1.E+00
f(x)=exp(x) abs(f'-1)
1.E-01

1.E-02 f'(0)=[exp(h)-1]/h

1.E-03

1.E-04

1.E-05
decreasing
1.E-06 increasing truncation error
1.E-07
due to
roundoff error
1.E-08

1.E-09
1.E-15 1.E-13 1.E-11 1.E-09 1.E-07 1.E-05 1.E-03 1.E-01 1.E+01
h

Notes organized by Prof. Renwei Mei, University of Florida 38


* Finite difference method 2:
If we use the central difference scheme to compute
f (x) :

f ( x, h) = [f(x+ h) – f(x-h)] / (2h) (22)


the truncation error is smaller as shown below.
* Truncation error (T.E.):
Taylor series expansion:
f(x-h) = f(x) - f (x) h + f (x) h2/2 - f (x) h3/6+…
➔ [ f(x+h) - f(x- h)] /(2h) = f (x) + f (x) h2/6+… (23)
Thus in approximating f (x) by [f(x+h) - f(x- h)]/(2h),
we commit an error of f (x) h2/6 to the leading
order.
Hence the truncation error in this example is
T.E. = f (x) h2 /6 + …. (24)

Notes organized by Prof. Renwei Mei, University of Florida 39


1.E+00 f(x)=exp(x) abs(f'-1)
1.E-01 abs(f2'-1)
f2'(0)=[exp(h)-exp(-h)]/(2h)
1.E-02
1.E-03
1.E-04
1.E-05
1.E-06
1.E-07
1.E-08 increasing
1.E-09 due to decreasing
1.E-10 roundoff error
truncation error
1.E-11
1.E-12
1.E-15 1.E-13 1.E-11 1.E-09 1.E-07 1.E-05 1.E-03 1.E-01 1.E+01
h

Notes organized by Prof. Renwei Mei, University of Florida 40


* Predicted round-off error and truncation error
1.E+01
roundoff
1.E-01 TE1
1.E-03 TE2
1.E-05 1.11E-16/h
1.E-07
h^2/6
1.E-09
h/2
1.E-11
1.E-13
1.E-15
1E-15 1E-13 1E-11 1E-09 1E-07 1E-05 0.001 0.1 10
h
*Comparison: predicted (roundoff + truncation) & actual errors

1.E+00
abs(f'-1)
1.E-01
1.E-02 abs(f2'-1)
1.E-03 R+TE1
1.E-04 R+TE2
1.E-05
1.E-06
1.E-07
1.E-08
1.E-09
1.E-10
1.E-11
h
1.E-12
1E-15 1E-13 1E-11 1E-09 1E-07 1E-05 0.001 0.1 10

Error = Truncation Error + Round-off Error

Notes organized by Prof. Renwei Mei, University of Florida 41


1.7 Propagation of Errors
 Consider zT = xT* yT; * = algebraic operation: +-x÷
First, computer actually uses xA instead xT due to rounding
or the data itself contains error.
Second, after xA*yA is computed, computer rounds the product as
zA = fl(xA * yA). (25)
Thus, the error in the operation * is
zT - zA = xT * yT - fl(xA * yA). (26)
 Let
xT = xA + , yT = yA + . (27)
The error is
zT - zA = xT*yT - xA*yA + [xA*yA - fl(xA*yA)] (28)
The second part in [ … ] is simply due to machine rounding.
It can be easily estimated as
1 1-p
≤ (xA* yA) mach = xA* yA B
2
The first part “xT*yT - xA*yA” is the propagated error.

 Now consider the propagated error in various operations.

Notes organized by Prof. Renwei Mei, University of Florida 42


1.7.1 Error in multiplication
Absolute error in multiplication:
xT yT - xA yA = xT yT - (xT - ) (yT - )

= xT  + yT  - . (29)
Relative error:
xT yT − x A y A    
Rel.( xA yA) = = + − . (30)
xT yT yT xT xT yT

 
Assuming «1 and «1, we obtain
xT yT

 
Rel.( xA yA)  + = Rel.( xA) + Rel.( yA). (31)
yT xT

1.7.2 Error in division


Absolute error in division:
xT/ yT - xA / yA = xT/ yT - (xT - )/( yT - ). (32)
Relative error in division:
xT / yT − x A / y A
Rel.(xA/yA) =
xT / yT
x y 1 − Re l.( x A )
= 1 - A T = 1-
xT y A 1 − Re l.( y A )
 1- [1 − Re l.( x A ) + Re l.( y A ) + ...] (TS expan.)
 
 Rel.( xA) - Rel.( yA) = - (33)
xT yT

Notes organized by Prof. Renwei Mei, University of Florida 43


1.7.3 Error in addition:
Absolute error: xT + yT - (xA + yA) =  +  (34)
 +
Relative error: Rel. (xA + yA) = (35)
xT + yT

1.7.4 Error in subtraction:


Absolute error: xT - yT - (xA - yA) =  -  (36)
 −
Relative error: Rel. (xA - yA) = (37)
xT − yT

Note: xT ± yT may be small due to cancellation


→ large Rel.( xA ± yA).
i.e. loss of significance due to subtraction of nearly
equal quantities--- very important practical issue!

Notes organized by Prof. Renwei Mei, University of Florida 44


* Example: Error in subtraction:
Compute r = 13 - 168 (= x – y) .
Using 5-digit decimals, y = 168 = yA = 12.961 = rA = 0.039.

Exact number: rT = 0.038518603... =>


Error(rA) = 0.038518603… - 0.039 = -0.00048.

or Rel. (rA ) = -1.25x10-2 which is not small.

Reason: x = 13 and y = 168 are quite close =


rA has only 2 significant digits after subtraction.

132 − 168 1 1
Improvement: rA = = =
13 + 168 13 + 168 13 + 12.961
= 0.038519 with 5 significant digits.
0.038518603... − 0.038519
= Rel. (rA ) = = -1.03x10-5
0.038518603...

=> the magnitude of this error is much smaller than the


previous one (1.25x10-2).
Lesson: avoid subtraction of two close numbers!
Whenever you can, use double precision.

Notes organized by Prof. Renwei Mei, University of Florida 45


1.7.5 Induced error in evaluating functions
 With one variable:
If f(x) has a continuous first order derivative in [a, b],
and xT and xA are in [a, b],

f(xT) - f(xA)  f  (xA) (xT - xA) + o(xT - xA) (38)

 With two variables:


f(xT, yT) - f(xA, yA)  f x' (xA, yA) (xT - xA) + f y' (xA, yA) (yT - yA)

+ o(xT - xA, yT - yA) (39)


* Example: f(x, y) = xy =

f x' = yxy-1, f y' = xy logx

= Error(fA)  yA ( x A ) y A −1 Error(xA)

+ ( x A ) y A logxA Error(yA)

= Rel.(fA)  yA Rel.( xA) + log xA Rel.( yA)

Notes organized by Prof. Renwei Mei, University of Florida 46


1.7.6 Error in summation
M
Consider s =  xj. (40)
j =1
In a Fortran program, we write:
S=0
DO J = 1 TO M
S = S + X(J)
ENDDO
Equivalently, in the above code we are doing the following:
s2 = fl( x1 + x2) = (x1 + x2) (1 + 2); (41a)
where 2 = machine error due to rounding
s3 = fl(x3 + s2) = (s2 + x3) (1 + 3) (41b)
= [(x1 + x2) (1 + 2) + x3] (1 + 3)
 (x1 + x2+ x3 ) + 2 (x1 + x2) + 3(x1 + x2+ x3) (41c)
sk+1 = (sk + xk+1) (1 + k+1)
= (x1 + x2+ x3 +... xk+1) + 2(x1 + x2) + 3(x1 + x2+ x3) + ...
+ (x1 + x2+ x3+... +xk+1) k+1 (41d)
Error = s - (x1 + x2+ x3+... +xM )
= 2(x1 + x2) + 3(x1 + x2+ x3) + ...+ (x1 + x2+ x3+... + xM)M
= x1(2 + 3 +... + M) + x2 (3+ 4 +...  M) + ... + xM M (42)
Since all i's are of same magnitude
=> term x1 contributes the most while xM contributes the smallest;
=> we should add from smallest (x1) to the largest (xM)
to reduce the overall machine error accumulation.

Notes organized by Prof. Renwei Mei, University of Florida 47


M 1
Example: Compute S(M) =  for M < 108
k =1 k

i) summing from k=1 to M using single precision


(single: large to small)
ii) summing from k=M to 1 using single precision
(single: small to large)
iii) summing from k=1 to M using double precision
(double: large to small)
iv) summing from k=M to 1 using double precision
(double: small to large)
asymptote = ln(M)+ 0.5772156649015328

single: large single: small double: large double: small


M to small to large to small to large asymptote
16384 10.2813063 10.28131294 10.2813068 10.28130678 10.2812767
32768 10.9744091 10.97444344 10.9744387 10.9744387 10.9744225
65536 11.667428 11.66758823 11.6675783 11.66757825 11.6675701
131072 12.3600855 12.36073208 12.3607216 12.36072161 12.3607178
262144 13.0513039 13.05388069 13.0538669 13.05386689 13.0538654
524288 13.7370176 13.74705601 13.7470131 13.74701311 13.7470112
1048576 14.4036837 14.44023132 14.4401598 14.44015982 14.4401588
2097152 15.4036827 15.13289833 15.1333068 15.13330676 15.1333065
4194304 15.4036827 15.82960701 15.8264538 15.82645382 15.8264542
8388608 15.4036827 16.51415253 16.5196009 16.51960094 16.5195999
16777216 15.4036827 17.23270798 17.2127481 17.21274809 17.2127476

Notes organized by Prof. Renwei Mei, University of Florida 48


18
Sum
17

16

15

14 single: large to small


single: small to large
13
dbl: large to small
12 dbl: small to large
11 asympt
M
10
0

2000000

4000000

6000000

8000000

10000000

12000000

14000000

16000000

18000000
Clearly, the result based on summation using single
precision and adding from large to small values are most
unsatisfactory (since it is already converged).

Notes organized by Prof. Renwei Mei, University of Florida 49

You might also like