CH 1 Number System and Numerical Error Analysis
CH 1 Number System and Numerical Error Analysis
CH 1 Number System and Numerical Error Analysis
CHAPTER ONE
Number System and Numerical Error Analysis
1.1 Introduction
This section is devoted to introducing review of general information concerned with the
quantification of errors such as the concept of significant digits is reviewed.
Significant digits:
Are important in showing the truth one has in a reported number. For example, if someone asked
me what the population of my county is, I would simply respond, “The population of Adama area
is 1 million!” Is it exactly one million? “I don‟t know! But I am quite sure that it will not be
two”.
The problem comes when someone else was going to give me a $100 for every citizen of the
county, I would have to get an exact count in that case. That count would have been 1,079,587 in
this year.
So you can see that in my statement that the population is 1 million, that there is only one
significant digit. i.e, 1, and in the statement that the population is 1,079,587, there are seven
significant digits. That means I am quite sure about the accuracy of the number up to the seventh
digit. So, how do we differentiate the number of digits correct in 1,000,000 and 1,079,587? Well
for that, one may use scientific notation. For our data we can have
Accuracy: refers to how a closely computed value or measured value agrees with the true value.
Precision: refers to how closely individual computed or measured values agree with each other.
Computational Methods 1
Adama University Department Of Electrical Engineering
Error in solving an engineering or science problem can arise due to several factors. First, the error
may be in the modeling technique. A mathematical model may be based on using assumptions
that are not acceptable. For example, one may assume that the drag force on a car is proportional
to the velocity of the car, but actually it is proportional to the square of the velocity of the car.
This itself can create huge errors in determining the performance of the car, no matter how
accurate the numerical methods you may use.
Second, errors may arise from mistakes in programs themselves or in the measurement of
physical quantities. But, in application of numerical methods itself, the two errors we need to
focus on are:
- Round off error and,
- Truncation error
A computer can only represent a number approximately because there is no mechanism to deal
with infiniteness in computers. For example, a number like may be represented as 0.333333
on a PC. Then, the round off error generated in this case is
There are other numbers that cannot be represented exactly. For example, π and √ are numbers
that need to be approximated in computer calculations.
Truncation error is defined as the error caused by truncating a mathematical procedure. For
example, the Maclaurin series for is given as
This series has an infinite number of terms but when using this series to calculate only a finite
number of terms can be used on iteration. For example, if one uses three terms in his calculation,
then the resulting truncation error becomes
[ ]
Clearly we cannot avoid truncation errors. The question should rather be how we can control
truncation errors. We can use the concept of relative approximation` in order to control this. For
the time being let us assume that this could be achieved by setting a particular acceptable error
Computational Methods 2
Adama University Department Of Electrical Engineering
magnitude and decide for the minimum number of terms involved in our calculation. With this
assumption in mind one can finally conclude that increasing the number of terms would decrease
truncation error.
The total numerical error is the summation of truncation error and round off error. We can
minimize round off errors by increasing the number of significant digits. In contrast to this,
Subtractive cancellation 1 and increase in the number of computations (i.e number of terms
involved) will intensify round off errors.
Truncation error on the other hand could be reduced with decreasing the step size which in turn
could increase subtractive cancelation or increase in computations.
All of the above statements are simply would mean that truncation error increases as round off
errors are increased which that means we are forced to trade off between the two errors in finding
the suitable step size for a particular computation.
The basic unit of memory is the binary digit, called a bit. A bit may contain a 0 or a l. It is the
simplest possible unit. A device capable of storing only zeros could hardly form the basis of a
memory system.
People often say that computers use binary arithmetic because it is "efficient." What they mean
(although they rarely realize it) is that digital information can be stored by distinguishing between
different values of some continuous physical quantity, such as voltage or current. The more
values that must be distinguished, the lesser is the separation between adjacent values, and the
less reliable the memory we would have.
The binary number system requires only two values to be distinguished. Consequently, it is the
most reliable method for encoding digital information.
In everyday life, we use a number system with a base of 10, simply because we have ten fingers!
For example, look at the number 257.56. Each digit in 257.56 has a value of 0 through 9 and has a
place value. It can be written as
In a binary system, we have a similar system where the base is made of only two digits 0 and 1.
So it is a base 2 system. A number like (1011.0011) in base-2 represents the decimal number as
1
Subtractive cancellation is the term that refers to round off induced when subtracting two nearly equal floating point
numbers. A common instance where this occurs involves finding the roots of quadratic equation for case .
Computational Methods 3
Adama University Department Of Electrical Engineering
In general in both systems we have an integer part to the left of the decimal point and we have a
fraction part to the right of the decimal point.
The fundamental unit whereby „meaningful‟ information is represented is called a word. This is
an entity that consists of a string of binary digits or bits. Numbers are typically stored in one or
more words.
The primary logic units of a digital computer are on/off electronic components. Hence numbers
on a computer are „represented with binary or base-2 systems‟.
Integer representation:
The most straight forward approach is the signed magnitude method. It employs the first bit of a
word to indicate the sign (0 is positive and 1 is negative) of the number and the remaining bits are
used to store the number. The following figure shows a signed magnitude representation of the
integer -173 on a 16-bit computer.
1 0 0 0 0 0 0 0 1 0 1 0 1 1 0 1
1 0 0 0 0 0 0 0 1 0 1 0 1 1 0 1
Sign number
Fig: The representation of the decimal integer -173 on a `16 bit computer
Fractional numbers are represented in computers using floating point form. The general format is
Computational Methods 4
Adama University Department Of Electrical Engineering
So we can say that 156.78 could be represented as 0.15678 X 103 as a floating point number in
base 10 system.
The above figure shows one way of storing a floating point number in a word. The mantissa is
usually normalized if it has a leading zero digit. For example the number 1/34 = 0.029411765….
may be stored on a floating point base ten system that would allow four decimal places to be
stored i.e
Could be normalized as
That means we are available with more space to increase our significant figure in approximating
the number to the four decimal places available. One can easily deduce that the absolute value of
the mantissa is between 1/10 = 0.1 and 1. Generally, for any base b
In addition to taking more time and memory space the floating point representation has got its
own disadvantage because of round off errors introduced in approximating the mantissa.
1.3.2. Conversion algorithm between base two and base ten systems
A point of interest regarding the two number systems is the conversion mechanism between them.
We have already converted the binary system to the decimal in the beginning of this section and
now we will proceed with the reverse.
Let us take the number 11.1875.
Computational Methods 5
Adama University Department Of Electrical Engineering
Computational Methods 6
Adama University Department Of Electrical Engineering
Computational Methods 7
Adama University Department Of Electrical Engineering
Fig . Flow charts for the integer part and fraction part conversion of base two to base ten.
In any numerical analysis, errors will arise during the calculations. To be able to deal with the
issue of errors, we need to
(A) Identify where the error is coming from (“is it round off or truncation?”), followed by
(B) Quantifying the error (“what is this?”), and lastly
(C) Minimize the error as per our needs (“how about increasing the step size?”).
This section is devoted to addressing what we have stated under (B).
True error, denoted by is the difference between the true value (also called the exact value)
and the approximate value.
The magnitude of true error does not show how bad the error is. That means it actually depends
on the mathematical equation that describes a given quantity.
Relative true error is defined as the ratio of the true error and the true value.
Sometimes it would give more sense if we use the absolute relative true error | |
So far, we discussed how to calculate true errors. Such errors are calculated only if true values are
known. In most cases we will not have the luxury of knowing true values. So when we are solving
a problem numerically, we will only have access to approximate values. We need to know how to
quantify error for such cases.
Approximate error is denoted by and is defined as the difference between the present
approximation and previous approximation.
Here again the magnitude of approximate error does not show how bad the error is.
Computational Methods 8
Adama University Department Of Electrical Engineering
Relative approximate error is denoted by and is defined as the ratio between the approximate
error and the present approximation.
Similar to relative true error we will have approximate error in percentage as well as an absolute
value of relative approximate error.
The scope of this section to study how errors in numbers can influence the results of mathematical
calculations those numbers are involved in.
̅ ̅ ̅ ̅ ̅
Let us drop the second and higher terms and rearrange to approximately get
̅ ̅ ̅
̅ | ̅ | ̅
Here we are going to generalize the foregoing approach. So we will apply the multivariable
version of the tailor series. Suppose we have two variables u and v.
[ ]
Computational Methods 9
Adama University Department Of Electrical Engineering
̅ ̅ | | ̅ | | ̅
̅ ̅ ̅ | | ̅ | | ̅ | | ̅
̅ ̅
Here we measure the sensitivity of a given mathematical equation to variations in its input
variables. We say that a computation is numerically unstable if the uncertainty of the input
values is grossly magnified by the numerical method i.e the function we are using.
̅ ̅
̅ ̅
̅
The relative true error of x is given as ̅
Taking the ratio of the two we will define a condition number as follows
̅ ̅
̅
This number is used to measure the extent of uncertainty in x is magnified by f(x) with three
independent cases:
Computational Methods 10