Scientific Computing - LESSON 1: Computer Arithmetic and Error Analysis 1

Scientific Computing– LESSON 1: Computer Arithmetic and Error Analysis 1
Rounding errors
With a finite number of binary bits, a computer can only represent a finite number
of real numbers. That is, most real numbers cannot be represented exactly by a
computer. Therefore, a real number usually has to be rounded to another number
that can be represented by the computer.
Consider the example of inputting the number 1/5=0.2 to a computer. We will

show by contradiction that 1/5 cannot be represented exactly by a binary number
with a finite number of bits, in the form 0.r1 r2 ...rk , where the ri , i = 1, 2, . . . , k, are
0 or 1. Assume that it can be represented in such as way. Then
1 r1 r2 rk
= + 2 + ... + k
5 2 2 2
Or
2k = 5 × (r1 2k−1 + r2 2k−2 . . . + rk )
It follows that 2k is divisible by 5. This is a contradiction.
The above discussion also indicates that a real number can be represented exactly
with a finite number of binary bits if and only if it can be written in a reduced
fractional form with the denominator being a power of 2.
Problem 1. Write 1/5 as a binary number with 6 bits after the binary point.
Problem 2. Devise a method for converting a fractional number into its binary rep-
resentation.
Computational errors and propagated data errors
Let consider the task of evaluating sin(1/5). Firstly, because the sine function
f (x) = sin(x) cannot be evaluated exactly by computer, we choose to use the polyno-
mial fˆ(x) = x−x3 /6, which is a truncated Taylor expansion of sin(x), to approximate
f (x) = sin(x). Secondly, because x = 1/5 cannot be represented exactly by computer,
we round it to x̂ = 3/16 as an approximation, where the rounding error is 1/80. So

eventually, sin(1/5) is approximated by
3
3 3 1 1527

fˆ(x̂) = − =
16 16 6 8192
Then the total error is
fˆ(x̂) − f (x) = [fˆ(x̂) − f (x̂)] + [f (x̂) − f (x)]
= [(x − x3 /6)|x=3/16 − sin(3/16)] + [sin(3/16) − sin(1/5)].
The first part [fˆ(x̂) − f (x̂)] is called the computational error, and the second part
[f (x̂) − f (x)] is called the propagated data error.
The computational error is affected by the particular computational method that

is used. In the above example, it is the error caused by approximating f (x) = sin(x)
with fˆ(x) = x − x3 /6.
The propagated data error is determined by the stability of the problem under
consideration. For an unstable problem, a small change of the input from x to x̂ may
cause a large variation f (x̂)−f (x) of the function value, e.g. the case of f (x) = tan(x)
When an algorithm for solving a problem produces a large error, one needs to find
out whether it is because the problem is inherently unstable (i.e. the propagated data
error is dominant) or because the algorithm used gives a poor approximation (i.e.,
the computational error is dominant). Only when a numerically accurate algorithm
is applied to a well-conditioned problem can one guarantee to deliver an accurate
result.
Absolute error and relative error
If x is approximated by another number x̃, then
|x − x̃|
is called the absolute error of x̃ to x, and

|x − x̃|
|x|
is called the relative error of x̃ to x. We are often concerned with relative errors
when analyzing a computational method.
Condition number
The condition number measures how the relative change in the result (or output)
of a numerical problem depends on the relative change in the input to a problem. It
is used to describe the sensitivity of the problem, assuming exact arithmetic; thus,
it has nothing to do with any particular algorithm used to produce the result. It is
understood that exact arithmetic introduces no error in a computational process.
Suppose that a function f (x) is evaluated, and we want to know whether a small
error in x would cause a big error in the function value f (x). Suppose that x is
perturbed into its approximation x̃, and the corresponding function value is f (x̃).
Then the relative error in the input is
|x − x̃|/|x|,
and the relative error in the output is
|f (x) − f (x̃)|/|f (x)|.
The ratio of these two relative errors gives the condition number,
|f (x) − f (x̃)|/|f (x)| |x| |f (x) − f (x̃)|
cond(f (x)) = =
|x − x̃|/|x| |f (x)| |x − x̃|
Taking the limit by letting ∆x = x − x̃ approach 0, we get

xf 0 (x)

cond(f (x)) ≈

f (x)
If the condition number is large, then a small relative change in the input will
cause a big relative change in the output; i.e., the output is sensitive to the input. A
problem with a large condition number is called ill-conditioned; otherwise, it is called

well-conditioned.
Floating-point numbers
Real numbers are approximated by floating-point numbers (or machine numbers)

in a computer. In a base-β system, a floating point number is represented in the
normalized form as
x = ±(0.r1 r2 . . . rn )β ∗ β m
where β is an integer called the base, m the exponent, and r = 0.r1 r2 . . . rn the
mantissa, with r1 6= 0 and 0 ≤ ri < β.
In a decimal system, β = 10, and ri ∈ {0, 1, 2, . . . , 9}. In a binary system β = 2,

and ri = 0 or 1; to save a bit, the normalized form assumes in this case the form
x = ±(1.r1 r2 . . . rn )2 ∗ 2m ,
with the assumption r1 6= 0 dropped.
A possible bit allocation in a computer with 32 bit word length is the following: 1
sign bit, 8 bits for the exponent m, and 23 bits for the mantissa r.
Since there are 8 bits for the exponent m, we have
−(27 − 1) = −127 ≤ m ≤ 128 = 27 .
We use m = −127 for 0 and m = 128 for ∞. Hence, the effective range of the floating
point numbers is
1.6 ∗ 10−38 ≈ 2−126 ≤ x < 2128 ≈ 2.5 ∗ 1038 .
The distribution of the floating point numbers across the real number axis is un-
even. In fact, the machine numbers between two consecutive powers of 2 have even
distributions, and this distribution is denser towards zero. For example, with the
above bit allocation of a 32-bit word, the next (greater) neighbor of the machine
number 28 is 28 + 2−15 , whereas the next neighbor of the machine number 230 is
230 + 27 .
Machine unit
The machine unit of a floating point system is the smallest positive number such
that fl(1 + ) > 1. The machine unit is denoted by M , and is also referred to as
machine precision or machine epsilon. In other words, assuming rounding to the
nearest, the machine unit of the floating point system given above is 2−24 , since any
real number smaller than 1 + 2−24 and no less that 1 is rounded to 1, and any real
number between 1 + 2−24 and 1 + 2−23 (the next neighbor of 1 in the floating number
system) is rounded to 1+2−23 . The machine unit plays an important role in bounding
the relative error in approximating any real number by its machine representation (see
below).
To illustrate the limitation imposed by machine precision, consider the example of

computing the values of the harmonic sequence
k
X 1
hk = , k = 1, 2, . . .
i=1 i
1
by adding the successive terms, that is, using hk+1 = hk + k+1
.
On the one hand, it is known that the sequence {hk }∞

k=1 diverges to ∞; in fact,
hk ≈ ln(k). On the other hand, the sequence will stop increasing at some point when
it is computed by adding its successive terms because of the finite computer precision.
Specifically speaking, this occurs when the ratio of the next term 1/(k + 1) to the
partial sum hk drops below the machine precision M .
Problem 3: Assuming that we use the IEEE standard single-precision floating-point

number system, determine approximately the value of k at which the computed se-
quence hk stops increasing.
Machine unit as bound on relative rounding error
There are only a finite number of floating point numbers for a given system with
a fixed word length. On the other hand, there are infinitely many real numbers.
Therefore, most real numbers, or even rational numbers for that matter, cannot be
represented exactly by a floating point number. For example, we have seen that 1/5
must be approximated by a binary floating-point number. Hence, errors are almost
inevitable in converting a real number into its floating point representation.
Then, how big is this relative rounding error? Given a real number x > 0 in the
form,
x = r ∗ 2m = (1.r1 r2 . . . rn rn+1 . . .)2 ∗ 2m ,
there are two close neighbors of x among all the floating point numbers; they are
x− = (1.r1 r2 . . . rn )2 ∗ 2m ≤ x,
and
x+ = [(1.r1 r2 . . . rn )2 + 2−n ] ∗ 2m ≥ x.
The approximation of x by x− is called chopping; clearly, |x − x− | ≤ 2m−n . The

approximation of x by x+ is called rounding up, and we have also |x − x+ | ≤ 2m−n .
Most computers use the x− to approximate x if |x − x− | < |x − x+ |, and use x+
otherwise; this is called symmetric rounding. The above holds for a number x < 0
as well.
Now we want to know how big the rounding error is. When x− is used to approx-
imate x, since |x − x− | < |x − x+ | and |x+ − x− | = 2m−n , we have
1
|x − x− | ≤ ∗ 2−n ∗ 2m = 2m−n−1 .
2
Therefore,
|x − x− | 2m−n−1
≤ ≤ 2−(n+1) ,
|x| r ∗ 2m
since r ≥ 1.
Similarly, when x+ is used to approximate x, we get also

|x − x+ |
≤ 2−(n+1) .
|x|
Hence, if x is approximated by the machine number x∗ (either x− or x+ ), then

|x − x∗ |
≤ 2−(n+1) = M .
|x|
In the above example where n = 23 for a 32-bit word, we have
|x − x∗ |
≤ 2−24 = M ≈ 6.7 ∗ 10−8 .
|x|
Let fl(x) denote the machine representation of x. Then

!
(x − x∗ )
fl(x) = x∗ = x − (x − x∗ ) = x 1 − = x(1 + δ),
x
where δ = −(x − x∗ )/x. It follows from the above analysis that |δ| ≤ M .
In general, for a floating point operation +, −, ×, / for floating point numbers x

and y, the following hold
fl(x + y) = (x + y)(1 + δ),

fl(x − y) = (x − y)(1 + δ),
fl(x × y) = (x × y)(1 + δ),

fl(x/y) = (x/y)(1 + δ),
with |δ| ≤ M .
Loss of significant bits due to cancelation
When two near machine numbers are subtracted, which is called cancelation, there
is usually a big loss of precision, or significant bits. Consider the following example.
Suppose we wish to calculate the sum
73821 + 7.7472 − 73828(= 0.7472).
Assuming five-digit floating point operation, we get
f (73821 + 7.7472) = 0.73821 ∗ 105 + 0.00008 ∗ 105

= 0.73829 ∗ 105 = 73829.
Then the final step yields

f (73829 − 73828) = 1
There is a total loss of significant bits! The relative error is

1 − 0.7472
≈ 0.3383
0.7472
Here the error introduced when replacing 7.472 by 8 is later revealed by the cancela-
tion.
What do you think is the cause of this problem?
Another way of looking at the risk involved in subtracting two close numbers is as
follows. Suppose that two numbers a and b are approximated by ã and b̃, respectively.
Let δa = a − ã, and δb = b − b̃. Then, if the subtraction a − b is performed, the relative
error of using ã − b̃ to approximate a − b is
|a − b − (ã − b̃)| |δa − δb |
= .
|a − b| |a − b|
Clearly, if a and b are very close to each other, then |a − b| is very small relative to
|δa | and |δb |. Hence, the relative error can be very large.
This analysis shows that subtraction between two close numbers should be avoided
whenever possible.
Now, as an example, consider solving a quadratic equation
ax2 + bx + c = 0.
The formula √
−b ± b2 − 4ac
x1,2 =
2a
√
has a pitfall if b2 >> 4ac since in this case −b and sgn(b) b2 − 4ac are close to each
other in magnitude; thus
√
−b + sgn(b) b2 − 4ac
would result in a bad loss of significant digits.
A better approach to solving the equation is to first compute

√
−b − sgn(b) b2 − 4ac
x1 = ,
2a
and then compute x2 by
c
x2 = .
ax1
Forward and backward error analysis
How to measure if a computational method or algorithm is numerical stable? Let

fˆ(x) be the function that reflects the actual evaluation of the algorithm. Then one
idea is to use the forward error fˆ(x) − f (x) as a measurement, where f (x) is the
original function that is approximated by fˆ(x).
Another idea is to use the backward error. Let x be the exact input value of f (x).
The backward error is given by
x̂ − x
where x̂ is defined by the equation f (x̂) = fˆ(x). That is, we attribute the com-
putational error to the propagation of a hypothetic initial error x̂ − x, using exact
arithmetic that would faithfully compute f (x̂).
A computational method is said to be numerically stable if the backward error is

small, i.e., if the actual (approximate) output (i.e., fˆ(x)) is equivalent to solving an
instance of the problem using exact arithmetic with a nearby input value x̂.
The next example shows the computation of forward and backward errors. Suppose
that we use fˆ(x) = 1 + x + x2 /2 + x3 /6 to approximately evaluate the function f (x) =
ex at x = 1. To determine the backward error, we have x̂ satisfying x̂ = log(fˆ(x)).
At x = 1, we have
f (x) = 2.718282, fˆ(x) = 2.666667,
x̂ = log(2.666667) = 0.980829,
Forward error = fˆ(x) − f (x) = −0.051615,
Backward error = x̂ − x = −0.019171

Scientific Computing - LESSON 1: Computer Arithmetic and Error Analysis 1

Uploaded by

Copyright:

Available Formats

Scientific Computing - LESSON 1: Computer Arithmetic and Error Analysis 1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Scientific Computing - LESSON 1: Computer Arithmetic and Error Analysis 1

Uploaded by

Copyright:

Available Formats

Scientific Computing– LESSON 1: Computer Arithmetic and Error Analysis 1

Consider the example of inputting the number 1/5=0.2 to a computer. We will

It follows that 2k is divisible by 5. This is a contradiction.

Computational errors and propagated data errors

we round it to x̂ = 3/16 as an approximation, where the rounding error is 1/80. So

Then the total error is

fˆ(x̂) − f (x) = [fˆ(x̂) − f (x̂)] + [f (x̂) − f (x)]

= [(x − x3 /6)|x=3/16 − sin(3/16)] + [sin(3/16) − sin(1/5)].

The computational error is affected by the particular computational method that

Absolute error and relative error

If x is approximated by another number x̃, then

is called the absolute error of x̃ to x, and

and the relative error in the output is

|f (x) − f (x̃)|/|f (x)|.

Taking the limit by letting ∆x = x − x̃ approach 0, we get

problem with a large condition number is called ill-conditioned; otherwise, it is called

Real numbers are approximated by floating-point numbers (or machine numbers)

In a decimal system, β = 10, and ri ∈ {0, 1, 2, . . . , 9}. In a binary system β = 2,

with the assumption r1 6= 0 dropped.

Since there are 8 bits for the exponent m, we have

−(27 − 1) = −127 ≤ m ≤ 128 = 27 .

1.6 ∗ 10−38 ≈ 2−126 ≤ x < 2128 ≈ 2.5 ∗ 1038 .

To illustrate the limitation imposed by machine precision, consider the example of

On the one hand, it is known that the sequence {hk }∞

Problem 3: Assuming that we use the IEEE standard single-precision floating-point

Machine unit as bound on relative rounding error

The approximation of x by x− is called chopping; clearly, |x − x− | ≤ 2m−n . The

Similarly, when x+ is used to approximate x, we get also

Hence, if x is approximated by the machine number x∗ (either x− or x+ ), then

Let fl(x) denote the machine representation of x. Then

In general, for a floating point operation +, −, ×, / for floating point numbers x

fl(x + y) = (x + y)(1 + δ),

fl(x × y) = (x × y)(1 + δ),

Loss of significant bits due to cancelation

Suppose we wish to calculate the sum

73821 + 7.7472 − 73828(= 0.7472).

Assuming five-digit floating point operation, we get

f (73821 + 7.7472) = 0.73821 ∗ 105 + 0.00008 ∗ 105

Then the final step yields

There is a total loss of significant bits! The relative error is

What do you think is the cause of this problem?

Now, as an example, consider solving a quadratic equation

would result in a bad loss of significant digits.

A better approach to solving the equation is to first compute

Forward and backward error analysis

How to measure if a computational method or algorithm is numerical stable? Let

A computational method is said to be numerically stable if the backward error is

f (x) = 2.718282, fˆ(x) = 2.666667,

You might also like