Scientific Computing - LESSON 1: Computer Arithmetic and Error Analysis 1
Scientific Computing - LESSON 1: Computer Arithmetic and Error Analysis 1
Scientific Computing - LESSON 1: Computer Arithmetic and Error Analysis 1
Rounding errors
With a finite number of binary bits, a computer can only represent a finite number
of real numbers. That is, most real numbers cannot be represented exactly by a
computer. Therefore, a real number usually has to be rounded to another number
that can be represented by the computer.
The above discussion also indicates that a real number can be represented exactly
with a finite number of binary bits if and only if it can be written in a reduced
fractional form with the denominator being a power of 2.
Problem 1. Write 1/5 as a binary number with 6 bits after the binary point.
Problem 2. Devise a method for converting a fractional number into its binary rep-
resentation.
Let consider the task of evaluating sin(1/5). Firstly, because the sine function
f (x) = sin(x) cannot be evaluated exactly by computer, we choose to use the polyno-
mial fˆ(x) = x−x3 /6, which is a truncated Taylor expansion of sin(x), to approximate
f (x) = sin(x). Secondly, because x = 1/5 cannot be represented exactly by computer,
Scientific Computing– LESSON 1: Computer Arithmetic and Error Analysis 2
The first part [fˆ(x̂) − f (x̂)] is called the computational error, and the second part
[f (x̂) − f (x)] is called the propagated data error.
The propagated data error is determined by the stability of the problem under
consideration. For an unstable problem, a small change of the input from x to x̂ may
cause a large variation f (x̂)−f (x) of the function value, e.g. the case of f (x) = tan(x)
When an algorithm for solving a problem produces a large error, one needs to find
out whether it is because the problem is inherently unstable (i.e. the propagated data
error is dominant) or because the algorithm used gives a poor approximation (i.e.,
the computational error is dominant). Only when a numerically accurate algorithm
is applied to a well-conditioned problem can one guarantee to deliver an accurate
result.
|x − x̃|
Scientific Computing– LESSON 1: Computer Arithmetic and Error Analysis 3
Condition number
The condition number measures how the relative change in the result (or output)
of a numerical problem depends on the relative change in the input to a problem. It
is used to describe the sensitivity of the problem, assuming exact arithmetic; thus,
it has nothing to do with any particular algorithm used to produce the result. It is
understood that exact arithmetic introduces no error in a computational process.
Suppose that a function f (x) is evaluated, and we want to know whether a small
error in x would cause a big error in the function value f (x). Suppose that x is
perturbed into its approximation x̃, and the corresponding function value is f (x̃).
Then the relative error in the input is
|x − x̃|/|x|,
The ratio of these two relative errors gives the condition number,
|f (x) − f (x̃)|/|f (x)| |x| |f (x) − f (x̃)|
cond(f (x)) = =
|x − x̃|/|x| |f (x)| |x − x̃|
If the condition number is large, then a small relative change in the input will
cause a big relative change in the output; i.e., the output is sensitive to the input. A
Scientific Computing– LESSON 1: Computer Arithmetic and Error Analysis 4
Floating-point numbers
where β is an integer called the base, m the exponent, and r = 0.r1 r2 . . . rn the
mantissa, with r1 6= 0 and 0 ≤ ri < β.
x = ±(1.r1 r2 . . . rn )2 ∗ 2m ,
A possible bit allocation in a computer with 32 bit word length is the following: 1
sign bit, 8 bits for the exponent m, and 23 bits for the mantissa r.
We use m = −127 for 0 and m = 128 for ∞. Hence, the effective range of the floating
point numbers is
The distribution of the floating point numbers across the real number axis is un-
even. In fact, the machine numbers between two consecutive powers of 2 have even
distributions, and this distribution is denser towards zero. For example, with the
Scientific Computing– LESSON 1: Computer Arithmetic and Error Analysis 5
above bit allocation of a 32-bit word, the next (greater) neighbor of the machine
number 28 is 28 + 2−15 , whereas the next neighbor of the machine number 230 is
230 + 27 .
Machine unit
The machine unit of a floating point system is the smallest positive number such
that fl(1 + ) > 1. The machine unit is denoted by M , and is also referred to as
machine precision or machine epsilon. In other words, assuming rounding to the
nearest, the machine unit of the floating point system given above is 2−24 , since any
real number smaller than 1 + 2−24 and no less that 1 is rounded to 1, and any real
number between 1 + 2−24 and 1 + 2−23 (the next neighbor of 1 in the floating number
system) is rounded to 1+2−23 . The machine unit plays an important role in bounding
the relative error in approximating any real number by its machine representation (see
below).
1
by adding the successive terms, that is, using hk+1 = hk + k+1
.
hk ≈ ln(k). On the other hand, the sequence will stop increasing at some point when
it is computed by adding its successive terms because of the finite computer precision.
Specifically speaking, this occurs when the ratio of the next term 1/(k + 1) to the
partial sum hk drops below the machine precision M .
There are only a finite number of floating point numbers for a given system with
a fixed word length. On the other hand, there are infinitely many real numbers.
Therefore, most real numbers, or even rational numbers for that matter, cannot be
represented exactly by a floating point number. For example, we have seen that 1/5
must be approximated by a binary floating-point number. Hence, errors are almost
inevitable in converting a real number into its floating point representation.
Then, how big is this relative rounding error? Given a real number x > 0 in the
form,
x = r ∗ 2m = (1.r1 r2 . . . rn rn+1 . . .)2 ∗ 2m ,
there are two close neighbors of x among all the floating point numbers; they are
x− = (1.r1 r2 . . . rn )2 ∗ 2m ≤ x,
and
x+ = [(1.r1 r2 . . . rn )2 + 2−n ] ∗ 2m ≥ x.
Now we want to know how big the rounding error is. When x− is used to approx-
imate x, since |x − x− | < |x − x+ | and |x+ − x− | = 2m−n , we have
1
|x − x− | ≤ ∗ 2−n ∗ 2m = 2m−n−1 .
2
Therefore,
|x − x− | 2m−n−1
≤ ≤ 2−(n+1) ,
|x| r ∗ 2m
Scientific Computing– LESSON 1: Computer Arithmetic and Error Analysis 7
since r ≥ 1.
with |δ| ≤ M .
When two near machine numbers are subtracted, which is called cancelation, there
is usually a big loss of precision, or significant bits. Consider the following example.
Scientific Computing– LESSON 1: Computer Arithmetic and Error Analysis 8
Another way of looking at the risk involved in subtracting two close numbers is as
follows. Suppose that two numbers a and b are approximated by ã and b̃, respectively.
Let δa = a − ã, and δb = b − b̃. Then, if the subtraction a − b is performed, the relative
error of using ã − b̃ to approximate a − b is
|a − b − (ã − b̃)| |δa − δb |
= .
|a − b| |a − b|
Clearly, if a and b are very close to each other, then |a − b| is very small relative to
|δa | and |δb |. Hence, the relative error can be very large.
This analysis shows that subtraction between two close numbers should be avoided
whenever possible.
ax2 + bx + c = 0.
Scientific Computing– LESSON 1: Computer Arithmetic and Error Analysis 9
The formula √
−b ± b2 − 4ac
x1,2 =
2a
√
has a pitfall if b2 >> 4ac since in this case −b and sgn(b) b2 − 4ac are close to each
other in magnitude; thus
√
−b + sgn(b) b2 − 4ac
Another idea is to use the backward error. Let x be the exact input value of f (x).
The backward error is given by
x̂ − x
where x̂ is defined by the equation f (x̂) = fˆ(x). That is, we attribute the com-
putational error to the propagation of a hypothetic initial error x̂ − x, using exact
arithmetic that would faithfully compute f (x̂).
The next example shows the computation of forward and backward errors. Suppose
that we use fˆ(x) = 1 + x + x2 /2 + x3 /6 to approximately evaluate the function f (x) =
ex at x = 1. To determine the backward error, we have x̂ satisfying x̂ = log(fˆ(x)).
At x = 1, we have
x̂ = log(2.666667) = 0.980829,
Forward error = fˆ(x) − f (x) = −0.051615,
Backward error = x̂ − x = −0.019171