Data Analysis With Excel
Data Analysis With Excel
# with Excel®
An Introduction
for Physical Scientists
\
WITHDRAWN
N
Q180.55.S7 K57 2002
Kirkup, Les.
Data analysis with Excel
: an introduction for
physical scientists
n -■4
( *.
» . *
'V «l V
Data Analysis
with Excel®
An Introduction
for Physical scientists
Les Kirkup
uniuersity of Technology, Sydney
^^^drawn
CAMBRIDGE
UNIVERSITY PRESS
PUBLISHHD BY THE PRESS SYNDICATE OF THE UNIVERSITY OF CAMBRIDGE
The Pitt Building, Trumpington Street, Cambridge, United Kingdom
http://www.cambridge.org
© L. Kirkup 2002
A catalogue record for this book is available from the British Library
Q180.55.S7K57 2002
001.4'22'0285-dc21 2001037408
& i
Contents
Preface xv
3 Data distributions 1 85
3.1 Introduction 85
3.2 Probability 86
3.2.1 Rules of probability 87
3.3 Probability distributions 89
3.3.1 Limits in probability calculations 93
CONTENTS IX
References 438
Index 441
1
M.
p - u u i.4^
* ■ i,'''-i3;.i • '■ ‘
' ^*'.i '1f/ ‘Mwi ;;t»(
■■ i-
. .i_ - I-**
♦ X v;vX,, -‘-ia'i •■" ' . JiM ♦■’f
‘^ ■ ''ri-- ' -v • .
'v; ® ^ i' .' ,.n» • ‘-f ■ -,j\
f
r-.
K^j^UttiV. C i;-‘■ .^.'J
Preface
XV
XVI PREFACE
• provide a readable text from which students can learn the basic prin¬
ciples of data analysis;
• ensure that problems and exercises are drawn from situations likely to
be familiar and relevant to students from the physical sciences:
• remove much of the demand for manual data manipulation and pre¬
sentation by incorporating the spreadsheet as a powerful and flexible
utility;
• emphasise the analysis tools most often used in the physical sciences;
• focus on aspects often given less attention in other texts for scientists
such as the treatment of systematic errors;
• encourage student confidence by incorporating ‘worked’ examples
followed by exercises;
• provide access to extra material through generally accessible Web
pages.
mostly avoided in the body of the text. Instead, emphasis has been given
to the assumptions underlying the formulae and range of applicability.
Details of derivations may be found in the appendices. It is assumed that
the reader is familiar with introductory calculus, graph plotting and the
calculations of means and standard deviations. Experience of laboratory
work at first year undergraduate level is also an advantage.
I am fortunate that many people have given generously of their
time to help me during the preparation of this book. Their ideas, feed¬
back and not least their encouragement are greatly appreciated. I also
acknowledge many intense Friday night discussions with students and
colleagues on matters relating to data analysis and their frequent plead¬
ings with me to ‘get a life’.
I would like to express my appreciation and gratitude to the follow¬
ing people;
I . -, , )„.
f. . T
s III ri i' * I * ih yj
- - -,^e«-* - • 1 #:• «• ■
■ - * sV. ^ ;•. m'-.0 kbt¥f'
'***•>♦? r^'■ 1 ^ 'I#! *1 w^i\ .^4^
I ^ ,, . .
»
I' < \
t - ■
■■>' • T »'
■«i
i .
^ I**! -A • * • ^
V J,,.-
.1
r
I
■■.»,1
. j. ? •!..
‘i
1,
, i
/
... >.J4»ai ■ ^ - ;.«I
I • .' '
. I*'*
>' , >,v»
■■! I .1
1.1 Introduction
It is possible that when Feynman wrote these words he had in mind elab¬
orate experiments devised to reveal the ‘secrets of the universe’, such as
those involving the creation of new particles during high energy collisions
in particle accelerators. However, experimentation encompasses an enor¬
mous range of more humble (but extremely important) activities such as
testing the temperature of a baby’s bath water by immersing an elbow into
the water, or pressing on a bicycle tyre to establish whether it has gone ‘flat’.
The absence of numerical measures of quantities most distinguishes these
experiments from those normally performed by scientists.
Many factors directly or indirectly influence the fidelity of data gath¬
ered during an experiment such as the quality of the experimental design,
experimenter competence, instrument limitations and time available to
perform the experiment. Appreciating and, where possible, accounting for
such factors are key tasks that must be carried out by an experimenter.
After every care has been taken to acquire the best data possible, it is time
to apply techniques of data analysis to extract the most from the data. The
1
2 1 INTRODUCTION TO SCIENTIFIC DATA ANALYSIS
To find out something about the world, we experiment. A child does this
naturally, with no training or scientific apparatus. Through a potent com¬
bination of curiosity and trial and error, a child quickly creates a viable
model of the ‘way things work’. This allows the consequences of a particu¬
lar action to be anticipated. Curiosity plays an equally important role in the
professional life of a scientist who may wish to know:
cover something interesting and new ‘by accident’, it is usual for science to
progress by small steps. The insights gained by researchers (both experi¬
mentalists and theorists) combine to provide answers and explanations to
some questions, and in the process create new questions that need to be
addressed. In fact, even if something new is found by chance, it is likely that
the discovery will remain a curiosity until a serious scientific investigation
is carried out to determine if the discovery or effect is real or illusory. While
scientists are excited by new ideas, a healthy amount of scepticism remains
until the ideas have been subjected to serious and sustained scrutiny by
others.
Though it is possible to enter a laboratory with only a vague notion of
how to carry out a scientific investigation, there is much merit in planning
ahead as this promotes the efficient use of resources, as well as revealing
whether the investigation is feasible or overambitious.
(a) The aim of the experiment is to determine the change in heat transfer
to a motor vehicle when a reflective coating is applied to the windows
of that vehicle.
(b) The aim of the experiment is to test the hypothesis that a reflective
coating applied to the windows of a motor vehicle reduces the amount
of heat transferred into that vehicle.
• comprehensive,
• clearly defined,
• internationally accepted,
• easy to use.
Term Definition
1.3.1 Units
The most widely used system of units in science is the SI system'^ which has
been adopted officially by most countries around the world. Despite
strongly favouring SI units in this text, we will also use some ‘non-Sl units’
such as the minute and the degree, as these are likely to remain in wide¬
spread use in science for the foreseeable future.
The origins of the SI system can be traced to pioneering work done
on units in France in the late eighteenth century. In 1960 the name ‘SI
system’ was adopted and at that time it consisted of six fundamental or
‘base’ units. Since 1960 the system has been added to and refined and
remains constantly under review. From time to time suggestions are made
regarding how the definition of a unit may be improved. If this allows for
easier or more accurate realisation of the unit as a standard (permitting, for
Unit of quantity
Quantity Derived unit Symbol expressed in base units
iar quantities with their units expressed in derived and base units are
shown in table 1.3.
Example i
The farad is the SI derived unit of electrical capacitance. With the aid of table 1.3,
express the unit of capacitance in terms of the base units, given that the capacitance,
C, may be written
ANSWER
From table 1.3, the unit of charge expressed in base units is s*A and the unit of poten¬
tial difference is kg-m^-s'^-A-h It follows that the unit of capacitance can be
expressed with the aid of equation (1.1) as
s•A
kg'^-m ^-s^-A^
kg • m^ • s“^ • A“^
Exercise A
The henry is the derived unit of electrical inductance in the SI system of units. With
the aid of table 1.3, express the unit of inductance in terms of the base units, given
the relationship
1.3.2 Standards
In such situations there are two widely used methods by which the value of
the quantity may be specified. The first is to choose a multiple of the unit
and indicate that multiple by assigning a prefix to the unit. So, for example,
we might express the value of the capacitance of a capacitor as 47 |jlF. The
symbol p, stands for the prefix ‘micro’ which represents a factor of 10“®. A
benefit of expressing a value in this way is the conciseness of the represen¬
tation. A disadvantage is that many prefixes are required in order to span
the orders of magnitude of values that may be encountered in experi¬
ments. As a result, several unfamiliar prefixes exist. For example, the size of
the electrical charge carried by an electron is about 160 zC. Only dedicated
students of the SI system would immediately recognise z as the symbol for
the prefix ‘zepto’ which represents the factor lO'^i. Table 1.4 includes the
prefixes currently used in the SI system. The prefixes shown in bold are the
most commonly used.
Another way of expressing the value of a quantity is to give the
number that precedes the unit in scientific notation. To express any
number in scientific notation, we separate the first non-zero digit from the
second digit by a decimal point, so for example, the number 1200 becomes
1.200. So that the number remains unchanged we must multiply 1.200 by
10^ so that 1200 is written as 1.200 X 10^. Scientific notation is preferred for
very large or very small numbers. For example, the size of the charge
carried by the electron is written as 1.60 x 10“C. Though any value may
be expressed using scientific notation, we should avoid taking this
approach to extremes. For example, suppose the mass of a body is 1.2 kg.
This couldbe written as 1.2 x 10° kg, but this is possibly going too far.
12 1 INTRODUCTION TO SCIENTIFIC DATA ANALYSIS
Example 2
Rewrite the following values using: (a) commonly used prefixes and (b) scientific
notation:
ANSWER
Exercise B
1. Rewrite the following values using prefixes:
(i) 1.38X 10"**° J inzeptojoules; (ii) 3.6x 10"’^s in microseconds; (iii) 43258 Win
kilowatts; (iv) 7.8 X10° m/s in megametres per second.
2. Rewrite the following values using scientific notation:
(i) 0.650 nm in metres; (ii) 37 pC in coulombs; (iii) 1915 kW in watts; (iv) 125 |xs in
seconds.
1.2X103 kg
1.200X103 kg
77z=(1200±12) kg
Exercise C
1. How many significant figures are implied by the way each of the following values
is written:
(i) 1.72 m; (ii) 0.00130 mol/cm3; (iii) 6500 kg; (iv) 1.701 X 10^3V; (v) 100°C:
(vi) 100.0 °C?
2. Express the following values using scientific notation to two, three and four sig¬
nificant figures:
(i) 775710 m/s3; (ii) 0.001266 s; (iii) -105.4°C; (iv) 14000 nH in henrys; (v) 12.400 kl
in joules: (vi) 101.56 nm in metres
1 INTRODUCTION TO SCIENTIFIC DATA ANALYSIS
14
1.4.1 Histograms
1265 1196 1277 1320 1248 1245 1271 1233 1231 1207
1240 1184 1247 1343 1311 1237 1255 1236 1197 1247
1301 1199 1244 1176 1223 1199 1211 1249 1257 1254
1264 1204 1199 1268 1290 1179 1168 1263 1270 1257
1265 1186 1326 1223 1231 1275 1265 1236 1241 1224
1255 1266 1223 1233 1265 1244 1237 1230 1258 1257
1252 1253 1246 1238 1207 1234 1261 1223 1234 1289
1216 1211 1362 1245 1265 1296 1260 1222 1199 1255
1227 1283 1258 1199 1296 1224 1243 1229 1187 1325
1235 1301 1272 1233 1327 1220 1255 1275 1289 1248
1160<x<1180 3
1180 <x< 1200 10
1200 <x< 1220 7
1220 <x< 1240 24
1240 <x< 1260 25
1260 <x< 1280 16
1280 <x< 1300 6
1300 <x< 1320 4
1320 <x< 1340 3
1340 <x< 1360 1
1360 <x< 1380 1
axis versus interval on tbe borizontal axis. In doing this we create a his¬
togram.
Table 1.6, created using the data in table 1.5, shows the number of
values which occur in consecutive intervals of 20 counts beginning with
the interval 1160 to 1180 counts and extending to the interval 1360 to 1380
counts. This table is referred to as a grouped frequency distribution. The
distribution of counts is shown in figure 1.1. We note that most values are
clustered between 1220 and 1280 and that the distribution is almost sym¬
metric, with the suggestion of a longer Tail’ at larger counts. Other methods
by which univariate data can be displayed include stem and leaf plots and
i6 1 INTRODUCTION TO SCIENTIFIC DATA ANALYSIS
Counts
pie charts,® though these tend to be used less often than the histogram in
the physical sciences.
There are no ‘hard and fast’ rules about choosing the width of inter¬
vals for a histogram, but a good histogram:
N^Vn (1.3)
range
w= (1.4)
N
See Blaisdell (1998) for details of alternate methods of displaying univariate data.
1-4 PICTURING EXPERIMENTAL DATA 17
We should err on the side of selecting ‘easy to work with’ intervals, rather
than holding rigidly to the value of w given by equation (1.4). If, for
example, were found using equation (1.4) to be 13.357, then a value of w
of 10 or 15 should be considered, as this would make tallying up the
number of values in each interval less prone to mistakes.
If there are many values then plotting a histogram 'by hand’ becomes
tedious. Happily, there are many computer based analysis packages, such
as spreadsheets (discussed in chapter 2), which reduce the effort that
would otherwise be required.
Exercise D
Table 1.7 shows the values of 52 ‘weights’ of nominal mass 50 g used in an under¬
graduate laboratory. Using the values in table 1.7, construct
Mass (g)
• the intensity of light emitted from a light emitting diode (LED) as the
temperature of the LED is reduced;
1 INTRODUCTION TO SCIENTIFIC DATA ANALYSIS
• the power output of a solar cell as the angle of orientation of the cell
with respect to the sun is altered;
• the change in electrical resistance of a humidity sensor as the humid¬
ity is varied;
• the variation of voltage across a conducting ceramic as the current
through it changes;
• the decrease in the acceleration caused by gravity with depth below
the earth’s surface.
Let us consider the last example in a little more detail, in which the free-fall
acceleration caused by gravity varies with depth below the earth’s surface.
Based upon considerations of the gravitational attraction between bodies,
it is possible to predict a relationship between acceleration and depth
when a body has uniform density. By gathering ‘real data’ this prediction
can be examined. Conflict between theory and experiment might suggest
modifications are required to the theory or perhaps indicate that some
‘real’ anomaly, such as the existence of large deposits of gold close to the
site of the measurements, has influenced the values of acceleration.
As the acceleration in the example above depends on depth, we refer
to the acceleration as the dependent variable, and the depth as the inde¬
pendent variable. (The independent and dependent variables are some¬
times referred to as the predictor and response variables respectively.) A
convenient way to record values of the dependent and independent vari¬
ables is to construct a table. Though concise, a tableof data is fairly dull and
cannot assist efficiently with the identification of trends or patterns in
data. A revealing and very popular way to display bivariate data is to plot
an x-y graph (sometimes referred to as a scatter graph). The ‘x’ and the ‘3/
are the symbols used to identify the horizontal and vertical axes respec¬
tively of a Cartesian co-ordinate system.®
If properly prepared, a graph is a potent summary of many aspects of
an experiment.^® It can reveal:
® The horizontal and vertical axes are sometimes referred to as the abscissa and
ordinate respectively.
Cleveland (1994) discusses what makes ‘good practice’ in graph plotting.
1.4 PICTURING EXPERIMENTAL DATA
19
I
-1-
400 600 800
Time (s)
Figure 1.2. Temperature versus time for a thermoelectric cooler.
45-1
40-
35-
< 30-
E
25-
c
(D 20-
D • V
0 15-
10-
—I-1-1-1 I I I
1.6 1.8 2.0 2.2 2.4
Voltage (V)
The scales on the graph in figure 1.2 are linear. That is, each division on the
X axis corresponds to a time interval of 200 s and each division on the y axis
corresponds to a temperature interval of 5 °C. In some situations important
information can be obscured if linear scales are employed. As an example,
consider the current-voltage relationship for a LED as shown in figure 1.3.
It is difficult to determine the relationship between current and voltage for
the LED in figure 1.3 for values of voltage below about 2 V. As the current
data span several orders of magnitude, the distribution of values can be
more clearly discerned by replacing the linear y scale in figure 1.3 by a log¬
arithmic scale. Though graph paper is available in which the scales are log¬
arithmic, many computer based graph plotting routines, including those
supplied with spreadsheet packages, allow easy conversion of the y or x or
both axes from linear to logarithmic scales. Eigure 1.4 shows the data from
figure 1.3 replotted using a logarithmic y scale. As one of the axes remains
linear, this type of graph is sometimes referred to as semi-logarithmic.
Exercise E
The variation of current through a Schottky diode is measured as the temperature of
the diode increases. Table 1.8 shows the data gathered in the experiment. Choosing
appropriate scales, plot a graph of current versus temperature for the Schottky diode.
1.5 KEY NUMBERS SUMMARISE EXPERIMENTAL DATA 21
lOOq
< 10-
c
QJ
o 1 -
0.1
1.6 1.8 2.0 2.2 2.4 2.6
Voltage (V)
Figure 1.4. Current versus voltage using seini-logarithmic scales on the xand yaxes.
297 2.86X10“3
317 1.72X10^8
336 6.55X10-8
353 2.15X10-^
377 1.19X10-6
397 3.22X10-6
422 1.29X10-6
436 2.45X10-6
467 9.97X10 6
475 1.41X10-'*
Xi + X2 + X3 + ■■■ + x„ _ Pi
x=- (1.6)
n n
The limits of the summation are usually not shown explicitly, and we write
x= ^xjn.
1.5 KEY NUMBERS SUMMARISE EXPERIMENTAL DATA 23
Frequency (Hz) 2150 2120 2134 2270 2144 2156 2139 2122
Exercise F
Consider the data in table 1.11. Determine the mean and the median of the values
of capacitance in this table.
Capacitance (pF) 103.7 100.3 98.4 99.3 101.0 106.1 103.9 101.5 100.9 105.3
24 1 INTRODUCTION TO SCIENTIFIC DATA ANALYSIS
The starting point for finding a number which usefully describes the
spread of values is to calculate the deviation from the mean of each value.
If the ith value is written as x., then the deviation, d., is defined as'^
d■ = x^-x (1-^)
N
where x is the mean of the values.
At first inspection it appears plausible to use the mean of the sum of
the deviations as representative of the spread of the values. In this case,
mean deviation =
2^;
0^ = (1.9)
n
^(x,-x)2
a= (1.10)
Example 3
A rare earth oxide gains oxygen when it is heated at 600 °C in an oxygen-rich atmos¬
phere. Table 1.12 shows the mass gain from twelve samples of the oxide which were
heated to a temperature of 600 °C for 10 hours. Calculate (i) the mean, (ii) the stan¬
dard deviation and (iii) the variance of the values in table 1.12.
ANSWER
cr = (x)2 (1.11)
n
For the data in table 1.12, 422.53 (mg)^ and x=5.9083 mg, so that
422.53
(r- (5.9083)2 = 0.55 mg
12
Mass gain (mg) 6.4 6.3 5.6 6.8 5.5 5.0 6.2 6.1 5.5 5.0 6.2 6.3
Exercise G
1 Show that equation (1.10) maybe rewritten in the form given by equation (1.11).
2 When a hollow glass tube of narrow bore is placed in water, the water rises up
the tube due to capillary action. The values of height reached by the water in a
small bore glass tube are shown in table 1.13. For these values determine:
Height (cm) 4.15 4.10 4.12 4.12 4.32 4.20 4.18 4.13 4.15
R= (1.12)
100 000
p=\im (1.13)
n-~¥<x
cr^lim (1.14)
n
The term ‘true value’ is often used to express the value of a quantity that
would be obtained if no influences, such as shortcomings in an instrument
used to measure the quantity, existed to ‘interfere with’ the measurement.
In order to determine the true value of a quantity, we require that the fol¬
lowing conditions hold:
• The quantity being measured does not change over the time interval
in which the measurement is made.
• External influences that might affect the measurement, such as 50 Hz
electrical interference or changes in room temperature and humidity,
are absent.
• The instrument used to make the measurement is ‘ideal’.'®
By making many repeat measurements and taking the mean of the values
obtained, we might expect that ‘scatter’ due to imperfections in the meas¬
urement process would cancel out, in which case the mean would be close
to the true value. We might even go one step further and suggest that if we
make an infini te number of measurements, the mean of the values (i.e. the
population mean, jx) will coincide with the true value. Unfortunately, due
to systematic errors,^^ not all imperfectiqns in the measurement process
‘average out’ by taking the mean, and so we must be very careful when
referring to the population mean as the ‘true value’.
There are situations in which the term ‘true value’ may be mislead¬
ing. Returning to our example in which a population consists of the resis¬
tances of 100000 resistors, there is no doubt that this population has a
mean, but in what sense, if any, is this population mean the ‘true value’?
Unlike the example of determination of the charge on an electron, in which
variability in values is due to inadequacies in the measurement process,
the variability in resistance values is mainly due to variations between
resistors introduced during manufacture. Therefore no true value, in the
sense used to describe an attribute of a single entity, such as the charge on
an electron, exists for the group of resistors.
x= (1.15)
n
s= (1.16)
n-l
Example 4
In a fluid flow experiment, the volume of water flowing through a pipe is determined
by collecting water at 1 minute intervals in a measuring cylinder. Table 1.14 shows
the volume of water collected in ten successive 1 minute intervals.
ANSWER
With only ten values to deal with, it is quite easy to find x using equation (1.15).
Determining 5 using equation (1.16) requires more effort. Most scientific calculators
allow you to enter data and will calculate x and 5. Perhaps an even better alternative
is to use a spreadsheet (see chapter 2) as values entered can be inspected easily
before proceeding to the calculations.
Volume (cm®) 256 250 259 243 245 260 253 254 244 249
Exercise H
In an experiment to study electrical signals generated by the human brain, the time
was measured for a particular ‘brain’ signal to double in size when a person closes
both eyes. The values obtained for 20 successive eye closures are shown in table 1.15.
Using the values in the table, determine the sample mean and estimate the standard
deviation of the population.
Time (s)
4.51 2.33 1.51 1.91 2.54 1.91 1.51 1.52 2.71 3.03
2.12 2.61 0.82 2.51 2.07 1.73 2.34 1.82 2.32 1.92
X 100% (1.17)
n-l
X 100% (1.18)
n ,
Figure 1.5 shows the variation of d with n as given by equation (1.18), forn
between 3 and 100. The graph indicates that as n exceeds 10, the percent¬
age difference between standard deviations falls below 5%.
1.6 POPULATION AND SAMPLE 31
1.6.5 Approximating s
range
(1.19)
where the range is the difference between the maximum value and the
minimum value in any given set of data (see equation (1.5)). We can regard
equation (1.19) as a‘first order approximation’ to equation (1.16) for small n.
Equation (1.19) is extremely useful for determining s rapidly and with the
minimum of effort.^^
Example 5
Table 1.16 contains values of repeat measurements of the distance between a lens
and an image produced using the lens. Use equations (1.16) and (1.19) to determine
the standard deviation of the values in table 1.16.
ANSWER
Using equation {1.16), s=2.5 cm.
Using equation (1.19), ■*
(42 -35)cm
= 2.6 cm
V7
The difference between the two values of 5 is about'^% and would be regarded as
unimportant for most applications.
Exercise I
The periods, T, of oscillation of a body on the end of a spring is measured and the
values obtained are given in table 1.17.
T{s) 2.53 2.65 2.67 2.56 2.56 2.60 2.67 2.64 2.63
A particular quantity, for example the time for a ball to fall through a
viscous liquid, may be measured a number of times by the same observer
using the same equipment where the conditions of the measurement
(including the environmental conditions) remain unchanged. If the scatter
of values produced is small, we say the measurement is repeatable. By con¬
trast, if measurements are made of a particular quantity by various workers
using a variety of instruments and techniques in different locations, we say
that the measurements are reproducible if the values obtained by the
various workers are in close agreement. The terms repeatability and repro¬
ducibility are qualitative only and in order to assess the degree of repeat¬
ability or reproducibility we must consider carefully the amount of scatter
in the values as revealed by quantitative estimates, such as the standard
deviation.
1.9 Review
Problems
1. What are the derived SI units of the quantities:
2. Express the units for the quantities in problem 1 in terms of the base
units of the SI system.
3. The force, F, between two point charges Qj and Q^, separated by a dis¬
tance, r, in a vacuum is given by
Q1Q2
where is the permittivity of free space. Use this equation and the
information in table 1.3 to express the unit of Cj, in terms of the base units
in the SI system.
4. Write the following values in scientific notation to two significant figures;
(i) 0.0000571 s; (ii) 13700 K; (iii) 1387.5 m/s; (iv) 101300 Pa;
(v) 0.001525 a.
5. The lead content of river water was measured five times each day for
20 days. Table 1.18 shows the values obtained in parts per billion (ppb).
6. Table 1.19 contains values of the retention time (in seconds) for pseudo-
ephidrine using high pressure liquid chromatography (HPLC).
43 58 53 49 60 48 48 49 49 45
57 35 51 67 49 51 59 55 62 59
40 52 53 70 48 51 44 51 46 47
42 41 53 56 40 42 47 54 54 56
46 40 57 57 47 54 48 61 51 56
56 57 45 42 54 56 66 48 48 52
64 56 49 54 66 45 63 49 65 40
54 54 50 56 51 49 51 46 52 41
43 46 57 51 46 68 58 69 52 49
52 53 46 53 44 36 54 60 61 67
90.5 98
250 90
945 80
3250 70
8950 60
22500 50
63500 40
82500 30
124000 20
(i) Calculate the mean, x, and standard deviation, 5, of the nitrate concen¬
tration.
(ii) Draw up a histogram for the data in table 1.21.
38 1 INTRODUCTION TO SCIENTIFIC DATA ANALYSIS
9. Using accurate clocks in satellites orbiting the earth, the global position¬
ing system (GPS) may establish the position of a GPS receiver. Table 1.22
shows 15 values of the vertical displacement, d, (relative to mean sea level)
of a receiver, as determined using the GPS. Using these values;
2.1 Introduction
' Systat is a product of Systat Inc, Illinois. Statistica is a product of StatSoft Inc,
Oklahoma.
39
40 2 EXCEL® AND DATA ANALYSIS
of its basic features using examples drawn from the physical sciences. Some
familiarity with using a PC is assumed, to the extent that terms such as
‘mouse’, ‘cursor’, ‘Enter key’ and ‘save’ are assumed understood in the
context of using a program such as Excel®.
Column identifiers
Row identifiers
^ We will use the term ‘sheet’, rather than ‘table’, in order to indicate that we are
dealing with a spreadsheet.
2.3 INTRODUCTION TO EXCEL® 41
Designed originally with business users in mind. Excel® has evolved since
the late 1980s into a powerful spreadsheet package capable of providing
valuable support to users from science, mathematics, statistics, engineer¬
ing and business disciplines. With regard to the analysis of experimental
data. Excel® possesses 80 or so built in statistical functions which will eval¬
uate such things as the standard deviation, maximum, minimum and
mean of values. More advanced analysis facilities are available such as
linear and multiple regression, hypothesis testing, histogram plotting and
random number generation. Graphing options include pie and bar graphs
and, perhaps the most widely used graph in the physical sciences, the ‘x-j/
graph.
At the time of writing, the latest version of Excel® for PCs is Excel®
2002. While recent versions of Excel® offer many short cuts to ‘setting up’
and using a spreadsheet,^ they are not vital for solving data analysis prob
lems and we will use them sparingly.
For convenience, sheets containing data referred to in this chapter
can be found on the Internet at http://uk.cambridge.org/resources/0521
793378. Sheets in the file have the same name as the sheets referred to in
this chapter. Files available can be read by versions of Excel® for PCs from
Excel® 97 onwards. Some minor differenoes may be evident between how
a spreadsheet appears in this book and in the corresponding Excel® file.
For example, column widths or the number of figures displayed in each cell
may differ between book and spreadsheet file.
1. Using the left hand mouse button, click on ■—at the bottom
left hand corner of the screen.
2. Click on then click on KWicroSOft Excel.
3. After a few seconds the screen as shown in figure 2.1 appears.
Eigure 2.1 shows toolbars^ at the top of the screen. The icons on the
toolbars allow the spreadsheet to be saved, printed out, a typing mistake to
be undone or a list sorted into alphabetical order. A brief description of the
function of each icon is obtained by moving the cursor onto the icon.
Within a couple of seconds a short message appears describing the func¬
tion the icon represents. Close to the top of the screen is a Menu bar offer¬
ing the following options:
Moving the cursor to each word on the Menu bar and clicking the left hand
mouse button causes an extensive ‘pull down’ menu to appear giving
access to a large range of options. If an option is required that has not been
used before, there may be a delay of a few seconds for that option to appear
after clicking on the left hand mouse button. At the bottom and to the right
of the screen there are ‘scroll’ bars which are useful for navigating around
large spreadsheets. To the left of the horizontal scroll bar are sheet tabs
which allow you to move from one sheet to another. Situated to the right of
the screen (in Excel® 2002, but not in earlier versions) is a task pane which
permits easy access to recent files used with Excel®, as well as other options
such as Microsoft Excel® Elelp. The task pane may be closed (thereby allow¬
ing the cells to fill more of the screen) by clicking on the ^ symbol next to
the ^ symbol in the top right corner of this pane.
When starting Excel®, a screen of empty cells appears as shown in figure 2.1.
Sheet tabs, \5heeU / 5heel:2 / Sheets are visible near to the bottom of the screen.
Switching between sheets is accomplished by clicking on each tab. Excel®
refers to each of the sheets as a ‘Worksheet’. These sheets (and others if they
are added) constitute a ‘Workbook’. A Workbook can contain many
Worksheets, and incorporating several sheets into a Workbook is useful if
interrelated data or calculations need to be kept together. For example, a
Workbook might consist of three Worksheets, where:
44 2 EXCEL® AND DATA ANALYSIS
Worksheets can be renamed by moving the cursor to the Worksheet tab and
double clicking on it. At this stage it is possible to overwrite (for example)
‘Sheetl’ with something more meaningful or memorable.
After starting Excel®, tbe majority of tbe screen is filled with cells. Sheet 2.2
shows cells which contain data from an experiment to establish the potas¬
sium concentration in a specimen of human blood. To begin entering text
and data into the cells:
1. Move the cursor to cell Al. Click the left hand mouse button to make
the cell active. A conspicuous border appears around the active cell.
2. Type® Concentration (mmol/L). Once the Enter key has been pressed,
the active cell becomes A2.
3. Type 5.2 into cell A2 and press the Enter key. Continue this process
until all the data are entered. If a transcription mistake occurs, use the
mouse or the cursor key to activate the cell containing the mistake
then retype the text or number.
Before moving on to manipulate the data, it is wise to save the data to disc.
One way to do this is to:
6 We adopt the convention that anything to be typed into a cell appears in bold.
2.3 INTRODUCTION TO EXCEL®
45
The raw values appearing in sheet 2.2 are likely to have been influenced
by several factors: the underlying variation in the quantity being meas¬
ured, the resolution capability of the instrument used to make the meas¬
urement, as well as the experience, care and determination of the
experimenter. As we begin to use a spreadsheet to ‘manipulate’ the values,
we need to be aware that the spreadsheet introduces its own influences,
such as those due to rounding errors or the accuracy of algorithms used in
the calculation of the statistics. As with a pocket calculator, a spreadsheet
is only able to store a value to a certain number of digits. However, in con¬
trast to a calculator, which might display up to ten digits and hold two ‘in
reserve’ to ensure that rounding errors do not affect the least significant
digit displayed, Excel® holds 15 digits internally. It is unlikely you will come
across a situation in which 15 digits is insufficient.^
Most pocket calculators cannot handle values with rnagnitudes that
fall outside the range 1X 10“®® to 9.99999 X10+^®. A value with magnitude
less than 1X10^^® is rounded to zero, and a value with magnitude lOX 10+^®
(or greater) causes an overflow error (displayed as -E-, or something similar,
depending on the type of calculator). Excel® is able to cope with much
larger and smaller values than the average scientific calculator, but cannot
handle values with magnitudes that fall outside the range 2.23 X 10“^°® to
9.99999 X A value with magnitude smaller than 2.23X10“®°® is
rounded to zero, and a value equal to or in excess of 1 x io+®°® is regarded
by Excel® as a string of characters and not a number. The limitation regard¬
ing the size of a value is unlikely to be of concern unless calculations are
performed in which a divisor is close to zero. A division by zero causes
Excel® to display the error message, #DIV/0!. If the result of a calculation is
a value in excess of9.99999 X 10then the error message #NUM! appears
in the cell. Another way to provoke the #NUM! message to appear is to
attempt to calculate the square root of a negative number.
Values appear in cells as well as in the formula bar® as you type. Values
such as 0.0023 or 1268 remain unchanged once they have been entered.
However, if a value is very small, say, 0.0000000000123, or very large, say,
165000000000, then Excel® automatically switches to scientific notation.
Thus 0.0000000000123 is displayed as 1.23E-11 and 165000000000 as
1.65E+11. The ‘E’ notation is interpreted as follows:
A B A B
1 t(s) V(volts) 1 t(s) V(volts)
2 0 3.98 2 0 3.98E+00
3 5 1,58 3 5 1.58E+00
4 10 0.61 4 10 6.10E-01
5 15 0.24 5 15 2.40E-01
6 20 0.094 6 20 9.40E-02
7 25 0.035 7 25 3.50E-02
8 30 0.016 8 30 1.60E-02
9 35 0.0063 9 35 6.30E-03
10 40 0.0031 10 40 3.10E-03
11 45 0.0017 11 45 1.70E-03
12 50 0.0011 12 50 1.10E-03
13 55 0.0007 13 55 7,00E-04
14 60 0.0006 14 60 6.00E-04
Another option in the Format Cells dialog box, useful for science applica¬
tions, is the Number category. With this option the number of decimal
places to which a value is displayed may be modified (but the value is not
forced to appear in scientific notation). Irrespective of how values cire dis¬
played by Excel® on the screen, they are stored internally to a precision of
15 digits.
After entering data into a spreadsheet, the next step is usually to perform a
mathematical, statistical or other operation on the data. This may be
carried out by entering a formula into one or more cells. Though Excel®
provides many advanced functions, very often only simple arithmetic
operations such as multiplication or division are required. As an example,
consider the capacitor discharge data in sheet 2.3. Suppose at each point
in time we require both the current flowing through the discharge resistor
and the charge remaining on the capacitor. The equations for the current,
/, and charge, Q, are
Q=CV (2.2)
® To save space, steps 2 and 3 are abbreviated as: Format >- Cells > Number >•
Scientific.
48 2 EXCEL® AND DATA ANALYSIS
1. Make cell C2 active by moving the cursor to C2 and click on the left
hand mouse button.
2. Type = B2/12E6.
3. Press the Enter key.
4. The value 3.31667E - 07 is returned'^ in cell C2', as shown in
sheet 2.4(b).''
A B c A B c
1 t(s) V(volts) l(amps) 1 t(s) V(volts) '(amps)
2 0 3.98 =B2/12E6 2 0 3.98 3.31667E-07
3 5 1.58 3 5 1.58
4 10 0.61 4 10 0.61
5 15 0.24 5 15 0.24
6 20 0.094 6 20 0.094
7 25 0.035 7 25 0.035
8 30 0.016 8 30 0.016
9 35 0.0063 9 35 0.0063
10 40 0.0031 10 40 0,0031
11 45 0.0017 11 45 0.0017
12 50 0.0011 12' 50 0.0011
13 55 0.0007 13 55 0.0007
14 60 0,0006 14 60 0.0006
1. Move the cursor to cell C2. With the left hand mouse button pressed
down, move to cell C14 and release the button. The cells from C2 to
C14 should be highlighted as shown in sheet 2.5(a).
2. Click on the Edit menu. Click on the Fill option, then click on the
Down option.
3. Values now appear in cells C3 to C14, as shown in sheet 2.5(b).
When Excel® performs a calculation, it 'returns’ the result of the calculation into
the cell in which the formula was typed.
" The number of figures can be increased or decreased using Format >- Cells >
Number >- Scientific, as described in section 2.3.4.
2.3 INTRODUCTION TO EXCEL®
49
C
1 l(amps)
2 3.31667E-07
3 1.31667E-07
4 5.08333E-08
5 0.00000002
6 7.83333E-09
7 2.91667E-09
8 1.33333E-09
9 5.25E-10
10 2.58333E-10
11 1.41667E-10
12 9.16667E-11
13 5.83333E-11
14 5E-11
The Edit >Fill >Down command copies the formula into the highlighted
cells and automatically increments the cell reference in the formula so that
the calculation is carried out using the value in the cell in the adjacent B
column. Cell referencing is discussed in the next section.
Another common arithmetic operation is to raise a number to a
po\ver. If it is required that we find the square of the contents of, say, cell C2
in sheet 2.5, then we would type^^ in another cell = C2^2. To illustrate this,
suppose we wish to calculate the power, P, dissipated in the 12 MO resis¬
tor. The equation required is
P=PR (2.3)
where /is the current flowing through the resistance, R. The formula used to
calculate the power dissipated in the resistor is shown in cell D2 of sheet 2.6.
C D C D
1 l(amps) P(watts) 1 l(amps) P(watts)
2 3.31667E-07 ^C2^2*12E6 2 3.31667E-07 1.32003E-06
3 1.31667E-07 3 1.31667E-07
4 5.08333E-08 4 5.08333E-08
The ^ symbol is found by holding down the shift key and pressing the ‘6’ key.
50 2 EXCEL® AND DATA ANALYSIS
Exercise A
1. Column B of sheet 2.4 shows the voltage across a 0.47 /aF capacitor at times t= 0 to
r=60 s. Calculate the charge remaining on the capacitor (as given by equation (2.2))
at times t= 0 to t= 60 s. Tabulate values of charge in column D of the spreadsheet.
2. Enter a formula into cell E2 of sheet 2.4 to calculate the square root of the value of
current in cell C2. Use Edit > EHl >• Down to calculate the square root of the other
values of current in sheet 2.4.
Relative referencing
Sheet 2.7. Formula incorporating relative referencing of cells.
A B C
1 20 6.5 =ArBi
2 30 7.2
3 40 8.5
How cells are referenced within other cells affects how calculations are per¬
formed. Eor example, consider the formula, =A1*B1, appearing in cell Cl
in sheet 2.7. Excel® interprets the formula in cell Cl as ‘starting from the
current cell (Cl) multiply the contents of the cell two to the left (the value
in Al) by the contents of the cell one to the left (the value in Bl)’. This is
referred to as relative referencing of cells. When the Enter key is pressed,
the value 130 appears in ceU Cl. If the Edit >-Eill >-Dovym command is used
to fill cells C2 and C3 with formulae, relative referencing ensures that the
correct cells in the A and B columns are used in the calculations.
Specifically, =A2*B2 appears in cell C2, and =A3*B3 in cell C3. If cells are
now moved around by ‘cutting and pasting’,'^ relative referencing ensures
that calculations return the correct values irrespective of which cells
contain the raw data. Excel® keeps track of where values or formulae are
moved to and automatically updates the relative references.
Exercise B
Highlight cells Cl to C3 in sheet 2.7 and choose Edit XFill VDown to calculate the
product of values in adjacent cells in the A and B columns.
To ‘cut and paste’, highlight the cells containing the values to be moved. Choose
Edit >■ Cut. Move the cursor to the cell where you want the first value to appear.
Choose Edit >■ Paste.
2.3 INTRODUCTION TO EXCEL®
51
Absolute referencing
Sheet 2.8. Formula using absolute referencing of cells.
A B C
1 20 6.5 =$A$1*$B$1
2 30 7.2
3 40 8.5
Another way in which cells may be referenced is shown in sheet 2.8. The
formula in cell Cl is interpreted as ‘multiply the value in cell A1 by the value
in cell Bl’. This is referred to as absolute referencing, and on the face of it,
it doesn’t seem very different from relative referencing. Certainly, when the
Enter key is pressed, the value 130 appears in cell Cl just as in the previous
example. The difference becomes more obvious by highlighting cells Cl to
C3 and choosing Edit >-Fill VDown. The consequences of these actions are
shown in sheet 2.9. Cells Cl to C3 each contain 130. This is because the for¬
mulae in cells Cl to C3 use the contents of the cells which have been abso¬
lutely referenced, in this case cells A1 and Bl, and no incrementing of
references occurs when Edit >-Eill >-Down is used. This can be very useful,
for example, if we wish to multiply values in a row or column by a constant.
A B C
1 20 6.5 130
2 30 7.2 130
3 40 8.5 130
Exercise C
(i) Complete sheet 2.10 using the Edit >-Eill >-Down command to find the time of
fall for all the heights given in the A column.
(ii) Use the spreadsheet to calculate the times of fall if the acceleration due to
gravity is 1.6 m/s^.
Naming cells
The use of absolute referenced cells for values that we might want to use
again and again is fine, but it is possible to incorporate constants into a
formula in a way that makes the formula easier to read. That way is to give
the cell a name. Consider sheet 2.11.
A B C D
1 h (m) t(s)
2 2 =(2*A2/g)''0.5
3 4 g 9.81
4 6
5 8
6 10
This sheet is similar to sheet 2.10, the difference being that the absolute ref¬
erence, $D$3, in cell B2 has been replaced by the symbol, g. Before pro¬
ceeding, we must allocate the name g to the contents of cell D3. To allocate
the name:
We have omitted showing the units of g. If we type gtm/s^) in cell C3, then glm/s^)
becomes the name which we would need to type out in full in subsequent formulae.
2.3 INTRODUCTION TO EXCEL®
53
Exercise D
The heat emitted each second, H, from a blackbody of surface area, A, at tempera¬
ture, T, is given by
(2.5)
Care must be taken when entering formulae, as the order in which calcu¬
lations are carried out affects the final values returned by Excel®. For
example, suppose Excel® is used to calculate the equivalent resistance of
two resistors of values 4.7 kU and 6.8 kfi connected in parallel. The formula
for the equivalent resistance, of two resistors and R^ in parallel is
^1^2
R_eq (2.6)
/?l -E i?2
Sheet 2.12 shows the resistor values entered into cells A1 and A2. The equa¬
tion to calculate R^^ is entered in cell A3. When the Enter key is pressed, the
value 13600 appears in cell A3 as indicated in sheet 2.6(b). This value is
incorrect, as R^^ should be 2.779 kfl. Excel® interprets the formula in cell A3
as
A A
1 4.70E3 1 4.70E3
2 6.80E3 2 6.80E3
3 =A1*A2/A1+A2 3 13600
4 4
into a number of smaller formulae, with each one entered into a different
cell. As an example, consider an equation relating the velocity of water
through a tube to the cross-sectional area of the tube:
-2gh^ (2.7)
A B A B
1 g (m/s^) 9.81 1 g (m/s^) 9.81
2 h(m) 0.15 2 h (m) 0.15
3 Ai (m^) 0.062 3 Ai (m") 0.062
4 AHm") 0.018 4 AHm") 0.018
5 2gh (mV) =2*B1*B2 5 2gh (mV) 2.943
6 =(B3/B4)''2 6 (Ai/AH" 11.8642
7 {A:IA,f-^ =B6-1 7 (Ai/AH" -1 10.8642
8 V (m/s) =(B5/B7)''0.5 8 V (m/s) 0.520471
Exercise E
1. The radius of curvature of a spherical glass surface may be found using the
Newton’s rings method.'^ If the radius of the mth ring is r,,, and the radius of the nth
ring is r^, then the radius, R, of the spherical glass surface is given by
R-- (2.8)
(n - m)A
where A is the wavelength of the light incident on the surface. Table 2.1 contains data
from an experiment carried out to determine R. Using these data, create a spread¬
sheet to calculate R, as given by equation (2.8).
2. The velocity of sound, v^, in a tube depends on the diameter of the tube, the fre¬
quency of the sound and the velocity of sound in free air. The equation relating the
quantities when the walls of the tube are made from smooth glass is
3X10 -3'
(2.9)
\ dVf
where v is the velocity of sound in free air in m/s, d is the diameter of the tube in
metres and /is the frequency of the sound in Hz. Taking v = 344 m / s and /= 5 Hz, use
Excel® to tabulate values of f^when d varies from 0.1 m to 1 m in steps of 0.1 m.
Quantity value
r,„(m) 6.35X10-3
r„(m) 6.72X10-3
m 52
n 86
A (m) 6.02X10"^
In the last section we showed the ease with which a formula may be entered
into a cell that ‘looks right’, but without the appropriate parentheses the
formula returns values inconsistent with the equation upon which it is based.
Establishing or verifying that a spreadsheet is returning the correct values
For details of the method refer to Daish and Fender (1970).
56 2 EXCEL® AND DATA ANALYSIS
can sometimes be difficult, especially if there are many steps in the calcula¬
tion. While no single approach can ensure that the spreadsheet is ‘behaving’
as intended, there are a number of actions that can be taken which minimise
the chance of a mistake going unnoticed. If a mistake is detected, a natural
response is to suspect some logic error in the way the spreadsheet has been
assembled. However, it is easy to overlook the possibility that a transcription
error has occurred when entering data into the spreadsheet. In this situation
little is revealed by stepping through calculations performed by the spread¬
sheet in a ‘step by step’ manner. Table 2.2 offers some general advice intended
to help reduce the occurrence of mistakes.
Determining how the contents of various cells are ‘brought together’
to calculate, for example, a mean and standard deviation is aided by using
some of Excel®’s in-built tools. These are the ‘Auditing’ tools and can assist
in identifying problems in a spreadsheet.
Suggestion Explanation/Example
Make the spreadsheet Enter ‘raw’ values into a spreadsheet in the form in which they
do the work emerge from an experiment. For example, if the diameter of a
ball bearing is measured using a micrometer, then it is unwise
to convert the diameter to a radius ‘in your head’. It is better to
add an extra column (with a clear heading) and to calculate
the radius in that column. This makes backtracking to find
mistakes much easier.
Perform an order of If we have a feel for the size and sign of numbers emerging
magnitude calculation from the spreadsheet, then we are alerted when those
numbers do not appear. For example, if we calculate the
volume of a small ball bearing to be roughly 150 X 10“® m^ but
the value determined by the spreadsheet is 143.8 m^, this
might point to an inconsistency in the units used for the
volume calculation or that the formula entered is incorrect.
Use data that has On some occasions ‘old’ data are available that have already
already been analysed been analysed ‘by hand’ or using another computer package.
The purpose of the spreadsheet might be to analyse similar
data in a similar manner. By repeating the analysis of the ‘old’
data using the spreadsheet it is possible to establish whether
the analysis is consistent with that performed previously.
Choose the appropriate Many built in functions in Excel® appear to be very similar, for
built in function example when calculating a standard deviation we could use
STDEVO, STDEVAO, STDEVP() or STDEVPA(). Knowing the
definition of each function (by consulting the help available
2.3 INTRODUCTION TO EXCEL®
57
Suggestion Explanation/Example
6-
5-
4-
3-
2-
1 -
T
0 20 40 60 80 100
Figure 2.2. x-y graph of data indicating a possible transcription error.
Be alert to error #DIV/0!, #NAME?, #REF!, #NUM! and #N/A! are some of the
messages error messages that Excel® may return into a cell due to a
variety of causes. As examples:
• A cell containing the #DIV/0! error indicates that the calculation
being attempted in that cell requires Excel® to divide a value by
zero. A common cause of this error is that a cell such as B1
contains no value, yet this cell is referenced in a formula such as
= 32.45/Bl.
• The #NAME? error occurs when a cell contains reference to an
invalid name. For example if we type =averag(Al:A10) into a
cell. Excel® does not recognise averagO as a function (we
probably meant to type = averaged which is a valid Excel®
function, but spelled it incorrectly). Excel® assumes that
averag(Al:A10) is a name that has not been defined and so
returns the error message.
• A cell containing the #NUM! error indicates that some invalid
mathematical operation is being attempted. For example, if a
cell contains =LN(-6) then the #NUM! error is returned into
that cell as it is not possible to take the logarithm of a negative
number.
58 2 EXCEL® AND DATA ANALYSIS
Sheet 2.14 contains nine values of temperature as well as the mean and the
standard deviation of the values. Calculating the mean and standard devi¬
ation ‘independently’ (say, hy using a pocket calculator) we find that the
mean = 24.74°C (to four significant figures), consistent with the mean cal¬
culated in cell Bll of sheet 2.14. By contrast, the standard deviation found
using a pocket calculator is 2.008°C. To trace the calculation of the stan¬
dard deviation appearing in sheet 2.14, we can use Excel® s auditing tools
which give a graphical representation of how values in cells are calculated.
Specifically, by using the auditing tools, we can establish which cells con¬
tribute to the calculation of a value in a cell. To access the auditing tools go
to the Menu bar and choose Tools >• Auditing >■ Show Formula Auditing
Toolbar. At this point the formula auditing toolbar appears as shown in
figure 2.3. To use the auditing tools we proceed as follows:
1. Click on the cell which contains the calculation we wish to trace (in
the above example that would be cell B12).
2. Click on the lici icon on the formula auditing toolbar. The blue line
and arrow that appear indicate which cells are used in the calculation
of the value in cell B12. If a range of cells contributes to the calcula¬
tion, then this range is outlined in blue.
2.4 BUILT IN MATHEMATICAL FUNCTIONS 59
Figure 2.4 shows the formula auditing tool used to trace the calculation of
the standard deviation in sheet 2.14. The formula auditing tool indicates
that cells in the range B2 to Bll have been used in the calculation of the
standard deviation in cell B12. The mistake that has been made has been
,A B
1 Temperature (X)
2 23,5
3 24,2
4 26,4
5 r 23,1
6 22,8
7 22,5
8 25,4
9 28,3
10 26,5
11 mean —
24,74444444
12 standard deviation 1,893328117
the inclusion of the cell B11 in the range, as B11 does not contain ‘raw data’,
but the mean of the contents of cells B2 to BIO. This mistake could have
been detected by examining the range appearing in the formula in cell B12.
However, as the relationships between cells becomes more complex, the
pictorial representation of those relationships as provided by the formula
auditing tool can help identify mistakes that would be otherwise difficult
to find. The formula auditing toolbar contains other facilities to assist in
tracking mistakes and allows for the inclusion of comments to help docu¬
ment calculations being carried out by Excel®. Details of other formula
auditing facilities can be found either by using Excel®’s Help or by referring
to a standard text.'®
1. Type the function =LN(C2) into cell D2'^, as shown in sheet 2.15(a).
(LN can be in either upper or lower case letters) and press the Enter
key. The number-14.9191 is returned in cell D2 as indicated in sheet
2.15(b).
C D C D
1 l(amps) ln(l) 1 l(amps) ln(l)
2 3.31667E-07 =LN(C2) 2 3.31667E-07 -14.9191
3 1.31667E-07 3 1.31667E-07
4 5.08333E-08 4 5.08333E-08
5 0.00000002 5 0.00000002
6 7.83333E-09 6 7.83333E-09
7 2.91667E-09 7 2.91667E-09
8 1.33333E-09 8 1.33333E-09
9 5.25E-10 9 5.25E-10
10 2.58333E-10 10 2.58333E-10
11 1.41667E-10 11 1.41667E-10
12 9.16667E-11 12 9.16667E-11
13 5.83333E-11 13 5.83333E-11
14 5E-11 14 5E-11
(a) formula entered into cell D2; (b) value returned in cell D2.
As you begin typing =LN(C2) in Excel® 2002, a ‘tooltip’ appears which advises on
the argument(s) that appear in the function. Excel® 2002 provides tooltips for all of
its built-in functions.
2.4 BUILT IN MATHEMATICAL FUNCTIONS 6i
Exercise F
1. Calculate the logarithms to the base 10 of the values shown in column C of
sheet 2.15.
2. An equation used to calculate atmospheric pressure at height h above sea level,
when the air temperature is T, is
-3.39X10-2/7\
P= Pq exp (2.10)
T I
where Pis the atmospheric pressure in pascals, P^ is equal to 1.01 x 10^ Pa, h is the
height in metres above sea level and Pis the temperature in kelvins.
Assuming r=273 K, use the EXPQ function to calculate P for h=lX10^ m,
2 X 10^ m, and so on, up to 9 X 10^ m. Express the value of P in scientific notation to
three significant figures.
The usual sine, cosine and tangent functions are available in Excel®, as are
their inverses. Note Excel® expects angles to be entered in radians and not
degrees. An angle in degrees can be converted to radians by multiplying the
angle by tt/IBO. In Excel® tt is entered as PIQ. Sheet 2.16 shows a range of
angles and the functions for calculating the sine, cosine and tangent of
each angle.
A B C D E
1 x(deqrees) X (radians) sin(x) cos(x) tan(x)
2 10 =A2*PI()/180 =SIN(B2) =COS(B2) =TAN(B2)
3 30
4 56
5 125
In section 2.3.5 we used Edit > Fill >- Dovm to fill a single column of Excel®
with formulae. We can complete sheet 2.16 by filling all the cells from B2 to
E5 with the required formulae. To do this:
A B C D E
1 x(deqrees) X (radians) sin(x) cos(x) tan(x)
2 10 0.174533 0.173648 0.984808 0.176327
3 30 0.523599 0.5 0.866025 0.57735
4 56 0.977384 0.829038 0.559193 1.482561
5 125 2.181662 0.819152 -0.57358 -1.42815
Exercise G
Use the Excel® functions ASIN(), ACOS() and ATANQ to calculate the inverse sine,
inverse cosine and inverse tangent of the values of x in table 2.3. Express the inverse
values in degrees.
X 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Exercise H
Consider the x and y values in sheet 2.19.
A B C D
1 X y xy x^
2 1.75 23
3 3.56 34
4 5.56 42
5 5.85 42
6 8.76 65
7 9.77 87
8
The MAXQ and MIN() functions return the maximum and minimum
values respectively in a list of values. Such functions are useful, for
example, if we wish to scale a list of values by dividing each value by the
maximum or minimum value in'that list, or if we require the range (i.e. the
maximum value - the minimum value) as the first step to plotting a histo¬
gram.*® To illustrate these functions, consider the values in sheet 2.20. The
formula for finding the maximum value ip cells A1 to E6 is shown in cell A8.
When the Enter key is pressed, 98 is returned in cell A8.
A B C D E
1 23 23 13 57 29
2 65 22 45 87 76
3 34 86 76 79 35
4 45 55 89 34 43
5 45 61 56 43 12
6 98 21 87 56 34
7
8 =MAX(A1:E6)
Exercise I
Incorporate the MINQ function into sheet 2.20 so that the smallest value appears in
cell B8. Show the range of the values in cell C8.
*8 See section 2.8.1 for a description of how to use Excel® to plot a histogram.
2.5 BUILT IN STATISTICAL FUNCTIONS 65
Consider sheet 2.21 containing 48 integer values. The mean of the values is
found by entering the AVERAGE0 function into cell A9. When the Enter key
is pressed, the value 23.60417 is returned in cell A9.
A B C D E F
1 27 1 49 . 2 39 11
2 27 29 40 8 28 5
3 18 5 25 0 4 33
4 26 30 20 14 22 10
5 23 28 33 30 28 16
6 23 5 27 9 48 4
7 39 41 46 22 25 25
8 35 49 13 8 40 43
9 =AVERAGE(A1:F8)
Exercise J
Calculate the median and mode of the values in sheet 2.21 by entering the
MEDIANO and MODEO functions into cells B9 and C9 respectively.
A full list of functions in Excel® may be obtained by making any cell active
then going to the Menu bar and choosing Insert > Function. In the cate¬
gory box choose AH’. The functions appear in the dialog box ordered alpha¬
betically. Using the scroll bar you can scroll down to, say, STDEV. To obtain
information on the functions, you can click on .
Other useful statistical functions are given in table 2.4 along with
brief descriptions of each.
Exercise K
Repeat measurements made of the pH of river water are shown in sheet 2.22. Use
the built in functions in Excel® to find the mean, harmonic mean, average deviation
and estimate of the population standard deviation for these data.
A B C D E F G H 1 J
1 6.6 6.4 6.8 7.1 6.9 7.4 6.9 6.4 6.3 7.0
2 6.8 7.2 6.4 6.7 6.8 6.1 6.9 6.7 6.4 7.1
3 6.8 6.7 6.3 6.6 7.0 6.7 6.4 6.7 6.7 6.4
4 6.6 6.7 7.4 7.1 7.0 6.8 7.0 6.8 6.7 6.2
5 7.1 6.4 6.7 6.9 6.9 6.6 7.2 6.8 6.4 6.5
66 2 EXCEL® AND DATA ANALYSIS
Defining Value
Function What it does equation Example of use returned
A B C D E F
1 0.34 0.34 0.40 0.49 0.42 0.33
2 0.41 0.39 0.38 0.39 0.44 0.49
3 0.42 0.50 0.31 0.36 0.48 0.26
4 0.46 0.42 0.32 0.37 0.40 0.36
5 0.44 0.40 0.36 0.38 0.45 0.37
6 0.38 0.30 0.29 0.39 0.49 0.55
To allow some space at the top of sheet 2.23 for labels and titles to he
inserted:
By moving the contents of sheet 2.23 dovm hy three rows, and adding
labels, the spreadsheet can be made to look like sheet 2.24. A title (in bold
font) has been added to the sheet and the quantity associated with the
values in the cells is clearly indicated.
A B C D E F
1 Coefficient of static friction for two wooden surfaces in contact
2
3 Coefficient of static friction (no units)
4 0.34 0.34 0.40 0.49 0.42 0.33
5 0.41 0.39 0.38 0.39 0.44 0.49
6 0.42 0.50 0.31 0.36 0.48 0.26
7 0.46 0.42 0.32 0.37 0.40 0.36
8 0.44 0.40 0.36 0.38 0.45 0.37
9 0.38 0.30 0.29 0.39 0.49 0.55
digits or text contained within cells may be changed, borders can be drawn
around one or more cells, fonts can be altered and background shading
added. A word of caution: too much formatting and too many bright
colours can be distracting and, instead of making the spreadsheet easy to
read, actually has the opposite effect.
A B
1 t(°C) R (ohm)
2 20 15.9
3 30 15.1
4 40 16.3
5 50 17.5
6 60 18.2
7 70 18.5
8 80 18.9
9 90 19.7
10 100 20.1
11 110 20.2
1. To highlight cells A2 to B11, move the cursor to cell A2. With the left
hand mouse button held down, drag across and down from cell A2 to
cell Bit. Release the mouse button.
2. Click on the Chart Wizard icon to .
3. A dialog box appears identified as Chart Wizard - Step 1 of 4 - Chart
Type. Click on the XY (Scatter) option. Click on Next.
4. Step 2 of 4. An x-y graph appears. At this point the data series can be
named and other data series added. We do not want to do this, so click
on Next.
5. Step 3 of 4. A chart title and axes labels can be added. Click on the
Chart Title box and type Resistance versus temperature for a tungsten
wire. In the Value (X) box type Temperature (°C). In the Value (Y) type
Resistance (ohms). Click on Next.
6. Step 4 of 4. You are asked where you want to place the chart. The
default is to embed the chart into the sheet containing the data used
in the plotting of the graph. Take the default. Click on Finish.
7. To raise the ‘o’, in front of the symbol C, which appears in the x axis
label, click on the x axis label and with the left hand mouse button
held down, drag across the ‘o’, then release the button. Next, choose
Format > Selected Axis Title. In the dialog box click on the Superscript
Effects box, then click OK.
70 2 EXCEL® AND DATA ANALYSIS
24
■^1
27
2Q
29 .1 20 40 60 80 100 120
30
31 Temperature (°C)
32 1 —^^----■I
33]
34
» 1
M <
fTMTliHi I ,11 . J
X7r~'iNUM|[..
Figure 2.5. Screen of spreadsheet containing x-y graph of resistance versus
temperature.
The graph that appears embedded in the sheet is quite small. To enlarge it,
move the cursor to one of the corners of the outline around the graph, then
with the left hand mouse button pressed down, drag the outiine until the
graph size has increased till it fills about 2/3 of the sheet as showm in
figure 2.5.
Adding a trendline
If there is evidence to suggest that there is a linear relationship between
quantities plotted on the x and y axes, as there is in this case, then a Tine of
best fit, oi trendline may be added to the data.*® To add a trendline to the
resistance versus temperature data shown in sheet 2.25:
0 20 40 60 80 100 120
Temperature (°C)
.. .T “ 1 I
. ..... ......’~f...
' 1 ■■ r" . =.....j.r.\..
3. Click on the Options tab and click in the box beside Display equation
on chart. Click on OK.
4. A line of best fit should now appear, along with the equation of the
line (y=0.0577x+14.29).
Figure 2.6. shows the line of best fit attached to the resistance versus tem¬
perature data.2° If data appearing in the cells A2 to Bll are changed, then
the points on the line, the trendline and the trendline equation are imme¬
diately updated.
2° The legend to the right of the screen can he removed hy making the chart active,
clicking on the legend, then pressing the delete key.
72 2 EXCEL® AND DATA ANALYSIS
Exercise L
1. The graph in figure 2.6 would be improved by better filling the graph with the data.
This can be accomplished if the y axis were to begin at 14 ohms, rather than at zero.
Open the file Chapter2 which can be found at http://uk.cambridge.org/
resources/0521793378 and select the tab 2.25. This sheet contains the chart shown
in figure 2.6. Make the chart active by clicking on it, move the cursor close to the y
axis and double click. A dialog box appears, with a tab labelled Scale. Click on Scale
and change the Minimum value from 0 to 14 then click OK.
2. The intensity of light emitted from a red light emitting diode (LED) is measured as
a function of current through the LED. Table 2.5 shows data from the experiment. Use
Excel®’s x-y graphing facility to plot a graph of intensity versus current. Attach a
straight line to the data using the Trendline option, and show the equation of the line
on the graph.
2.7 CHARTS IN EXCEL®
73
There are situations in which it is helpful to show more than one set of data
on an x-y graph, as this allows comparisons to be made between data. Two
basic approaches may be adopted.
For example, consider the five sets of data shown in sheet 2.26. Columns B
to F contain y values for each set corresponding to the x values in column
A. To plot the data, highlight the contents of cells A1 through to Fll, then
follow the instructions in section 2.7.1. The graph produced is shown
embedded in the Worksheet in figure 2.8. Note that by including the head¬
ings of the columns when selecting the range of data to be plotted. Excel®
has used those headings in the legend at the right hand side of the graph.
A B C D E F
1 X y1 y2 y3 y4 y5
2 0.2 71 77 83 89 96
3 0.4 53 61 70 81 93
4 0.6 42 50 61 74 91
5 0.8 36 44 54 69 89
6 1.0 34 40 50 65 87
7 1.2 33 39 48 62 86
8 1.4 34 39 47 61 85
9 1.6 36 40 47 60 85
10 1.8 39 42 48 60 85
11 2.0 42 44 49 60 85
74 2 EXCEL® AND DATA ANALYSIS
A ___ 1 B C D E p T Q ] H
1 X Iv1 y2 y3 y4 y5 I
2 0.2 71 77 83 89 96
3 0.4 53 61 70 81 93
4 0.6 42 50 61 74 91
5 0.8 36 44 54 69 89
6
7 120
"'b
■9 100
10
♦ y1
11 80
12' ■y2
13' V 60 Ay3
14' A A A A A X y4 ■
45 40 ->—f—
X y5
17 20
_18
19' 0 —I
20' 0.0 0.5 1.0 1.5 2.0 2.5
21“
23_
The new data should be added to the graph and the legend updated.
Exercise M
The intensity of radiation (in units W/m^) emitted from a body can be written
;=__
A5(eBMr_i^ (2.11)
where Tis the temperature of the body in kelvins, A is the wavelength of the radia¬
tion in metres, A = 3.75X 10“^®W-m^ and B= 1.4435X10“^ m-K.
2.8 DATA ANALYSIS TOOLS 75
(i) Use Excel® to determine I at T= 1250 K for wavelengths between 0.2 X 10~® m
and 6 X 10“® m in steps of 0.2 X 10“® m.
(ii) Repeat part (i) for temperatures 1500 K, 1750 K and 2000 K.
(hi) Plot /versus A at temperatures 1250 K, 1500 K, 1750 K and 2000 K on the same
x-y graph.
Besides many built in statistical functions, there exists an ‘add in’ within
Excel® that offers more powerful data analysis facilities. This is referred to
as the Analysis ToolPak. To establish whether the ToolPak is available,
choose Tools from the Menu bar. If the option ‘Data Analysis’ does not
appear near to the bottom of the Tools menu, then the Analysis ToolPak
must be added.
To add the Analysis ToolPak, choose Tools >- Add-Ins. A dialog box
appears listing the available add-ins. Near to the top of the list is Analysis
ToolPak. Click in the box next to Analysis ToolPak, then click OK. This adds
the utility to the bottom of the Tools menu. If we go to the Menu bar again
and choose Tools >■ Data Analysis, the dialog box shown in figure 2.9
appears.^^
We discuss other analysis tools in chapter 9, but for the moment we
consider just two; Histogram and Descriptive Statistics.
Most ‘dialog boxes’ in the Data Analysis tool allow for the easy entry of cell ranges
into the tool and often allow you to modify parameters.
76 2 EXCEL® AND DATA ANALYSIS
2.8.1 Histograms
1. From the Menu bar, choose Tools > Data Analysis >■ Histogram.
2. A dialog box appears into which we must enter references to the cells
containing the values, references to the cells containing the bin infor¬
mation and indicate where we want the histogram frequencies to
appear. Enter the cell references into the dialog box, as shown in
figure 2.10.
3. To obtain a graphical output of the histogram, tick the Chart output
box in the dialog box. Click on the OK button.
A B C D
11 Bin limits Bin Frequency
12 20 20 0
13 30 30 1
14 40 40 18
15 50 50 27
16 60 60 14
17 70 70 3
18 80 80 1
19 More 0
The values returned in columns C and D are shown in sheet 2.29. It is not
immediately clear how the bin limits in the C column of sheet 2.29 relate
to the frequencies appearing in the D column. The way to interpret the
values in cells D12 to D18 is shown in table 2.6, where x represents the lead
78 2 EXCEL® AND DATA ANALYSIS
Actual interval
Excel® bin label (ppb) Frequency
20 x<20 0
30 20<JC<30 1 ^
40 30<x<40 18
50 40<x<50 27
60 50<x<60 14
70 60<x<70 3
80 70<x<80 1
More x>80 0
Histogram
[3 Frequency
Bin
concentration in ppb. Figure 2.11 shows the chart created using the bins
and frequencies appearing in sheet 2.29. A close inspection of figure 2.11
reveals that the horizontal axis is not labelled correctly. It appears that in
the interval between 20 and 30 there are no values. This conflicts with
table 2.6 which indicates that one value lies in this interval. To remedy this
problem;
The X axis labels should now appear in positions consistent with the fre¬
quencies in the histogram.
2.8 DATA ANALYSIS TOOLS 79
A B C D E F G H I J K
1 Density (g/cm^) 7.3 6.4 7.7 8.6 8.5 9.0 5.7 7.3 8.4 6.6
1. From the Menu bar choose Tools >■ Data Analysis >■ Descriptive
Statistics.
2. Type the values shown in figure 2.12 into the dialog box.
3. All the values are in row 1, so click on the option 'Grouped By Rows’.
4. Tick the Summary statistics box.
5. Click on OK.
Excel® returns values for the mean, median and the other statistics as
shown in sheet 2.31.
Sheet 2.31. Values returned when applying the Descriptive Statistics tool to
density dataP
A B
3 Row1
4
5 Mean 7.55
6 Standard Error 0.343592
7 Median 7.5
8 Mode 7.3
9 Standard Deviation 1.086534
10 Sample Variance 1.180556
11 Kurtosis -1.01032
12 Skewness -0.33133
13 Range 3.3
14 Minimum 5.7
15 Maximum 9
16 Sum 75.5
17 Count 10
As with most of the tools in the Analysis ToolPak, if raw data are
changed, the values returned hy the Descriptive Statistics tool are /zot imme¬
diately updated. To update the values, the sequence beginning at step 1
above must be repeated. If the consequences need to be considered of
changing one or more values when, for example, a standard deviation is to
be calculated, it may be better to forego use of the Descriptive Statistics tool
in favour of using Excel®’s built in functions whose outputs are updated as
soon as the contents of a cell have been changed. The trade off between ease
of use offered by a tool in the Analysis ToolPak as compared to the increased
flexibility and responsiveness of a ‘user designed’ spreadsheet needs to be
considered before analysis of data using the spreadsheet begins.
2.9 Review
ulation, involving the use of advanced statistical functions. The power and
flexibility of computer based spreadsheets make them attractive alterna¬
tives for this purpose to, say, pocket calculators. Pocket calculators have
limited analysis and presentation capabilities and become increasingly
cumbersome to use when a large amount of data is to be analysed.
The use of spreadsheets for analysing and presenting experimental
data continues to become more common ms access to these tools at
college, university or home increases. In this chapter we have considered
some of the features of the Excel® spreadsheet package which are particu¬
larly useful for the analysis of scientific data, including use of functions
such as AVERAGEO and STDEVO as well as the more extensive features
offered by the Analysis ToolPak.
Ease of use, flexibility and general availability make Excel® a
popular tool for data analysis. Other features of Excel® will be treated in
later chapters, in situations where they support or clarify the basic prin¬
ciples of data analysis under discussion. In the next chapter we consider
those basic principles and, in particular, the way in which experimental
data are distributed and how to summarise the main features of data dis¬
tributions.
Problems
(2.12)
g=9.81 m/s^,
7 = 73 X10“^ N/m for water at room temperature,
p= 10^ kg/m^ for water,
m= (2.13)
82 2 EXCEL® AND DATA ANALYSIS
where is the rest mass of the body and c is the velocity of light. Excel® is
used to calculate m, for values of v close to the speed of light. Note that
c= 3.0X10“ m/s and we take m(, = 9.lxi0 kg. A formula has been
entered into cell B2 of sheet 2.32 to calculate m. Cell D2 has been given the
name mo and cell D3 has been given the name c_ (see section 2.3.6 for
restrictions on naming cells).
When the Enter key is pressed, thg^value returned is -9.34E-1. This is
inconsistent with equation (2.13). Find the mistake in the formula in cell
B2, correct it, and complete column B of sheet 2.32.
A B C D
1 V m
2 2.90E+08 =mo/1-A2^2/c''2 mo 9.10E-31
3 2.91 E+08 c 3.00E+08
4 2.92E+08
5 2.93E+08
6 2.94E+08
7 2.95E+08
8 2.96E+08
9 2.97E+08
10 2.98E+08
11 2.99E+08
A B C D E F G H 1 J
1 70 62 58 76 60 55 56 60 68 59
2 69 54 61 58 62 71 68 63 73 72
3 67 68 68 63 63 64 75 65 64 67
4 68 57 72 69 61 70 73 58 63 66
5 72 63 64 65 59 65 63 63 66 58
6 73 61 72 71 69 65 58 66 77 61
7 72 66 67 65 62 70 65 67 66 72
8 64 72 61 69 70 66 64 60 61 68
9 63 70 65 62 70 75 64 79 68 62
10 63 63 72 75 70 66 72 79 78 69
(i) Use the built in functions on Excel® to find the mean, standard devia¬
tion, maximum, minimum and range of these values.
(ii) Use the Histogram tool to plot a histogram of the values.
P (Torr) t[s)
750 1.9
660 4.2
570 6.8
480 10.7
390 16.2
300 25.2
210 49.2
120 69.0
Table 2.8. Variation of vapour pressure with temperature for three volatile
liquids.
Pressure (Torr)
0 102 61 20
5 120 68 24
10 148 82 30
15 188 98 38
20 235 133 51
25 303 182 65
30 382 243 92
35 551 352 119
(i) Convert the pressures in torrs to pascals, given that the relationship
between torrs and pascals is 1 Torr= 133.3 Pa.
(ii) Plot a graph of pressure in pascals versus time in seconds. The graph
should include fully labelled axes.
F(N) y (cm)
2 2.4
4 4.2
8 6.9
10 7.6
12 9.0
14 10.4
16 11.6
100%
£_
(2.14)
pp H
100%
where pp is the resistivity of the plasma surrounding the cells and /is a form
factor which depends on the shape of the cells.
(i) Use Excel® to determine p/pp for Hin the range 0% to 80% in steps of
5% when/=4.
(ii) Repeat part (i) for/=3, 2.5, 2.0 and 1.5.
(iii) Show the p/pp versus Hdata determined in parts (i) and (ii) on the
same x-y graph.
3.1 Introduction
It is tempting to believe that the laws of‘chance’ that come into play when
we toss a coin or roll dice have little to do with experiments carried out in
a laboratory. Rolling dice and tossing coins are the stuff of games. Surely,
well planned and executed experiments provide precise and reliable data,
immune from the laws of chance. Not so. Chance, or what we might call
more formally probability, has rather a large role to play in every experi¬
ment. This is true whether an experiment involves counting the number of
beta particles detected by nuclear counting apparatus in one minute,
measuring the time a ball takes to fall a distance through a liquid or deter¬
mining the values of resistance of 100 resistors supplied by a component
manufacturer. Because it is not possible to predict with certainty what
value will emerge when a measurement is made of a quantity, say of the
time for a ball to fall through liquid, we are in a similar position to a person
throwing dice, who cannot know in advance which numbers will appear
‘face up’. If we are not to give up in frustration at our inability to discover
the ‘exact’ value of a quantity experimentally, we need to find out more
about probability and how it can assist rather than impede our experimen¬
tal studies.
In many situations a characteristic pattern or distribution emerges in
data gathered when repeat measurements are made of a quantity. A distri¬
bution of values indicates that there is a probability associated with the
occurrence of any particular value. Related to any distribution of‘real’ data
there is a probability distribution which allows us to calculate the probabil¬
ity of the occurrence of any particular value. Real probability distributions
85
86 3 DATA DISTRIBUTIONS 1
3.2 Probability
Example 1
A die is thrown once; what are the possible events?
ANSWER
The possible events on a single throw of a die (i.e. a single trial) are 1 or 2 or 3 or 4 or
5 or 6.
We will use some of the rules of probability as ideas are developed regard¬
ing the variability in data. It is useful to illustrate the rules of probability by
reference to a familiar ‘experiment’ which consists of rolling a die one or
more times. It is emphasised that, though the outcomes of rolling a die are
considered, the rules are applicable to the outcomes of many experiments.
Rule 1
All probabilities lie in the range 0 to 1
Notes:
If all the probabilities are equal, then P(l) = P{2) = P(3) etc. =g.
(iv) Another way to find P(l) is to do an experiment consisting of n trials,
where n is a large number - say 1000 and to count the number of
times a 1 appears - call this number, N. The ratio N/n (sometimes
called the relative frequency) can he taken as the probability of
88 3 DATA DISTRIBUTIONS I
obtaining a 1. Assuming we use a die that has not been tampered with,
it is reasonable to expect N/n to be about g. This can be looked at
another way: If the probability of a particular event A is P(A), then the
expected number of times that event A will occur in n trials is equal to
nP{A).
Rule 2 ^
When events are mutually exclusive, the prohahility of event A occurring,
P(A), or event B occurring, P(B), is
Notes
(i) Mutually exclusive events: this means that the occurrence of one event
excludes other events occurring. For example, if you roll a die and
obtain a 6 then no other outcome is possible (you cannot obtain, for
example, a 6 and a 5 on a single roll of a die).
(ii) Example: What is the probability of obtaining a 2 or a 6 on a single roll
of a die?
Rule 3
When events are independent, the prohahility of event A, P(A), and event
B, P(B), occurring is P(A) X P(B). This is written^ as,
Notes
(i) Independent events: this means that the occurrence of an event has no
influence on the probability of the occurrence of succeeding events.
For example, on the first roll of a die the probability of obtaining a 6 is
g. The next time the die is rolled, the probability that a 6 will occur
remains g.
(ii) Example: Using the rule, the probability of throwing two 6s in succes¬
sion is
We now have sufficient probability rules for our purposes.^ Next we con¬
sider how functions which describe the probability of particular outcome
P(A and B) is often written as P(AB).
® Adler and Roessler (1972) discuss the other rules of probability.
3-3 PROBABILITY DISTRIBUTIONS 89
0.632 0.328 0.696 0.166 0.665 0.157 0.010 0.391 0.454 0.396
0.322 0.454 0.087 0.540 0.603 0.138 0.021 0.203 0.272 0.763
0.055 0.095 0.410 0.422 0.109 0.713 0.834 0.029 0.577 0.984
0.575 0.932 0.772 0.043 0.464 0.112 0.234 0.062 0.657 0.839
0.600 0.894 0.421 0.186 0.213 0.676 , 0.504 0.028 0.916 0.809
0.798 0.841 0.927 0.335 0.505 0.549 0.352 0.430 0.984 0.853
0.803 0.302 0.389 0.814 0.175 0.309 0.607 0.198 0.569 0.177
0.711 0.445 0.279 0.091 0.469 0.572 0.719 0.901 0.993 0.034
0.571 0.277 0.345 0.119 0.688 0.512 0.437 0.141 0.903 0.453
0.048 0.597 0.532 0.864 0.936 0.040 0.553 0.129 0.077 0.706
® Random numbers between 0 and 1 can be generated using the RAN function
found on many calculators such as those manufactured by CASIO. The RANDQ
function on Excel® can also be used to generate random numbers.
^ Unless we know the details of the algorithm that produced the random numbers.
90 3 DATA DISTRIBUTIONS I
X
Shaded area = i
, ' I
\
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
1.
X
(3.1)
/(x) = 0forx<0
fix) = 0 for x> 1
/(x)=AforO<x<l
Adx= 1
0
therefore,
AWi=A|l-0l-l
8 Read P{x^^x<x^ as 'the probability that x lies between the limits Xj and xf
92 3 DATA DISTRIBUTIONS I
'^2
Example 2
What is the probability of observing a random number, generated in the manner
described in this section, between 0.045 and 0.621?
ANSWER
Substituting X2 = 0.621 and Xj = 0.045 into equation (3.3) gives
Exercise A
LA particular probability density function is written/(jc) = Ax for the range 0<x<4
and fix) ^0 outside this range.
fix) =
'o
(ii) If A = 0.3, calculate the probability that x lies between x=0 and x= 2.
(iii) Calculate the probability that x> 2.
P(Xj < X< X2) = P(Xj) + P(Xj < X< X2) + P(X2) (3.4)
P(Xj) is the area under the probability curve at x= x^. Using equation (3.2),
this probability is written
Both limits of the integral are x, and so the right hand side of equation (3.5)
must be equal to zero. Therefore, P(Xj) = Pix^) = P(x) = 0, and equation (3.4)
becomes
The fact that P(x) = 0 can be disturbing. Let us return to table 3.1 and
consider the first value, x— 0.632. If we are dealing with the distribution of a
random variable, then P(0.632) = 0. However, the value 0.632 has been
observed, so how are we able to reconcile this with the fact that P(0.632) = 0?
The explanation is that x=0.632 is a rounded and the ‘actual’ value
could lie anywhere between 0.63150 and x=0.632 50. That is, though it is
not obvious, we are dealing with an implied range by the way the number
is written. If we now ask what is the probability that a value of x (for this
distribution) lies between x= 0.63150 and x= 0.63250, we have, using
equation (3.3),
Rounding could have been carried out by the experimenter, or the value recorded
could have been limited by the precision of the instrument used to make the
measurement.
For example, a major source in the variability of time measured in the viscosity
experiment is due inconsistency in hand timing the duration of the fall of the ball,
whereas in the resistance experiment the dominant factor is likely to be variability in
the process used to manufacture the resistors.
34 DISTRIBUTIONS OF REAL DATA
95
14n
12-
10-
4-
2-
0-j-1--I
3.40 3.45 3.50 3.55 3.60 3.65 3.70 3.75 3.80 3.85
Time (s)
Figure 3.3. Frequency versus time of fall through oil.
Resistance (kQ)
Figure 3.4. Frequency versus resistance.
Counts
Figure 3.5. Frequency versus counts in a radioactivity experiment.
96 3 DATA DISTRIBUTIONS I
Time (s)
Figure 3.6. Frequency versus time of fall through liquid.
The term ‘raw’ data refers to data that has been obtained directly through
experiment or observation and has not been manipulated in any way, such as
combining values to calculate a mean.
3-5 the normal distribution 97
The bell shaped curve appearing in figure 3.6 is generated using the prob¬
ability density function
(3.7)
(TV 2 77
where /u, and a are the population mean and the population standard devi¬
ation respectively introduced in chapter 1 and which we used to describe the
centre and spread respectively of a data set. Equation (3.7) is referred to as
the normal probability density function. It is a ‘normalised’ equation which
is another way of saying that when it is integrated between -co<x< -Loo, the
value obtained is 1.
Using equation (3.7) we can generate the bell shaped curve asso¬
ciated with any combination of p. and cr and hence, by integration, find the
probability of obtaining a value of x within a specified interval. Figure 3.7
shows fx) versus x for the cases in which /r = 50 and o-=2, 5 and 10. The
population mean, p, is coincident with the centre of the symmetric distri¬
bution and the standard deviation, cr, is a measure of the spread of the data.
A larger value of a results in a broader, flatter distribution, though the total
area under the curve remains equal to 1. On closer inspection of figure 3.7
we see that/(x) is quite small for x outside the interval p - 2.5cr to p -L 2.5cr.
The normal distribution may be used to calculate the probability of
obtaining a value of x between the limits Xj and x.^. This is given by
-x,
P(Xj <X<X2) (3.8)
—00
Figure 3.7. Three normal distributions with the same mean, but differing standard
deviations.
98 3 DATA DISTRIBUTIONS I
P (Xl< X< X2) = P (-°° < X < X2) — P ('-00 < X < Xi)
The steps to determine the prohahility are shown symbolically and picto-
rially in figure 3.8. The function given by
(3.9)
X- fX
z= (3.11)
a
fiz) = (3.12)
P(-l<z<l) = (3.13)
Figure 3.9 shows the variation of /(z) with z. The integral appearing in
equation (3.13) cannot be evaluated analytically and so a numerical
3-5 the normal distribution 99
Figure 3.9. Variation of f{z) with z. The shaded area is equal to the probability that z
lies between -1 and +1.
method for solving for the area under the curve is required. It turns out that
the shaded area indicated in figure 3.9 is about 0.68. We conclude that the
probability that a value obtained through measurement lies within ± cr of
the population mean is 0.68. Or, looked at another way, that 68% of all
values obtained through experiment are expected to lie within ± tr of the
population mean.^^
NORMDIST(x,mean,standard deviation,cumulative)
In order to select the cdf option, the cumulative parameter in the function
is set to TRUE. To choose the pdf option, the cumulative parameter is set to
FALSE.
Example 3
Calculate the value of the pdf and the cdf when x=46 for normally distributed data
with mean = 50 and standard deviation = 4. -
ANSWER
The NORMDISTQ function for the pdf and cdf is shown entered in cells A1 and A2
respectively of sheet 3.1(a). Sheet 3.1(b) shows the values returned in cells A1 and A2
after the ENTER key has been pressed. We conclude that
A 1 B 1 C A B C
1 =NORMDIST(46,50,4,FALSE) 1 0.06049
2 =NORMDIST(46,50,4,TRUE) 2 0.15866
Example 4
Calculate the area under the curve between x=46 and x = 51 for normally distributed
data with mean = 50 and standard deviation = 4.
ANSWER
P{46<x<51) = P(-oo<x<51)-P(-oo<x<46)
A 1 B 1 C A B C
1 =NORMDIST(51,50,4,TRUE) 1 0.59871
2 =NORMDIST(46,50,4,TRUE) 2 0.15866
3 =A1-A2 I I 3 0.44005
3-5 THE NORMAL DISTRIBUTION 101
Exercise B
The masses of small metal spheres are known to he normally distributed with
jU, = 8.1 g and (j—0.2 g. Use the NORMDISTQ function to find the values of the cdffor
values of mass given in table 3.2 to four significant figures.
^(g) 7.2 7.4 7.6 7.8 8.0 8.2 8.4 8.6 8.8 9.0
^1 d Xz /X
z1
a a
and find the probability that a value of z lies between the limits Zj and Z2 by
evaluating the integral
Table 3.3, as well as table 1 of appendix 1, maybe used to assist in the eval¬
uation of equation (3.14).
If we do not have access to Excel® or some other spreadsheet package
that is able to evaluate the area under a normal curve directly, we can use
‘probability’ tables such as those found in appendix 1. To explain the use of
table 3.3, first consider figure 3.10. For any value of Zj, say z^ = 0.75, go down
102 3 DATA DISTRIBUTIONS I
Table 3.3. Normal probability table giving the area under the standard normal curve
between z=- ^ andz=z^.
0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0.00 0.5000 0.5040 0.5080 0.5120 0.5160 0.5199 0.5239 0.5279 0.5319 0.5359
0.10 0.5398 0.5438 0.5478 0.5517 0.5557 0.5596 0.5636 0.5675 0.5714 0.5753
0.20 0.5793 0.5832 0.5871 0.5910 0.5948 0.5987 0.6026 0.6064 0.6103 0.6141
0.30 0.6179 0.6217 0.6255 0.6293 0.6331 0.6368 0.6406 0.6443 0.6480 0.6517
0.40 0.6554 0.6591 0.6628 0.6664 0.6700 0.6736 0.6772 0.6808 0.6844 0.6879
0.50 0.6915 0.6950 0.6985 0.7019 0.7054 0.7088 0.7123 0.7157 0.7190 0.7224
0.60 0.7257 0.7291 0.7324 0.7357 0.7389 0.7422 0.7454 0.7486 0.7517 0.7549
0.70 0.7580 0.7611 0.7642 0.7673 0.7704 0.7734 0.7764 0.7794 0.7823 0.7852
0.80 0.7881 0.7910 0.7939 0.7967 0.7995 0.8023 0.8051 0.8078 0.8106 0.8133
0.90 0.8159 0.8186 0.8212 0.8238 0.8264 0.8289 0.8315 0.8340 0.8365 0.8389
1.00 0.8413 0.8438 0.8461 0.8485 0.8508 0.8531 0.8554 0.8577 0.8599 0.8621
the first column as far as Zj = 0.70, then across to the column headed 0.05;
this brings you to the entry in the table with value 0.7734. We conclude that
the probability of obtaining a value of z<0.75 is
rO.75
Example 5
What is the probability that a value of z is greater than 0.75?
ANSWER
We have
SO
Example 6
What is the probability that z lies between the limits z^ = -0.60 and = 0.75?
ANSWER
Solving this type of problem is assisted by drawing a picture representing the area
(and hence the probability) to be determined. Figure 3.11 indicates the area required.
The shaded area in figure 3.11 is equal to (area to the left of z^) - (area to the left of z^).
Table 3.3 gives P(-c»<z< 0.75) = 0.7734.
Probability tables do not usually include negative values of z, so we must make
use of the symmetry of the distribution and recognise that^®
Now,
SO
Exercise C
With reference to figure 3.11, if Zj = -0.75 and Z2 = -0.60, calculate the probability
P(-0.75< z<-0.60).
'8 Table 1 in appendix 1 is slightly unusual as it gives the area under the normal
curve for negative values of z.
104 3 DATA DISTRIBUTIONS I
Figure 3.11. Shaded area under curve is equal to probability that z lies between 0.6
and 0.75.
This function calculates the area under the standard normal curve
between °° and, say, z=Zy This is given as
'•^1
P(-oo<z<Zj)
NORMSDIST(z)
Example 7
Calculate the area under the standard normal distribution between z=—°° and
z=-1.5.
ANSWER
Sheet 3.3(a) shows the NORMSDISTO function entered into cell Al. Sheet 3.3(b)
shows the value returned in cell Al after the Enter key is pressed.
A I B I C A B C
1 =NORMSDIST(-1.5) 1 0.06681
2 I I 2
Example 8
Calculate the area under the standard normal distribution between .z=-2.5 and
z=-1.5.
ANSWER
The area between z=—2.5 and z=-1.5 may be written as
In cell A1 of sheet 3.4(a) we calculate P( -00 < -1.5) and in cell A2P(-co<z<-2.5).
The two probabilities are subtracted in cell A3 to give the required result. Sheet 3.4(b)
shows the values returned in cells A1 to A3 after the Enter key is pressed. We conclude
that P(-2.5<z<- 1.5) = 0.06060.
A I B I C A B C
1 =NORMSDIST(-1.5) 1 0.06681
2 =NORMSDIST(-2 5) 2 0.00621
3 =A1-A2 1 1 3 0.06060
Exercise D
Use the NORMSDISTO function to calculate the values of the standard cdf for the
z values given in table 3.4. Give the values of the standard cdf to four significant
figures.
z 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
io6 3 DATA DISTRIBUTIONS I
Number of values, n
Figure 3.12. Percentage difference between population parameters and sample
statistics.
Before proceeding to consider the analysis of real data using the normal
distrihution, we should admit once again that, in the majority of experi¬
ments, we can never know the population mean, p., and the population
standard deviation, a, as they can only he determined if all the values that
make up the population are known. If the population is infinite, as it often
is in science, this is impossible. We must ‘make do’ with the best estimates
of fx and a, namely x and s respectively, /jl and x are given by equations
(1.13) and (1.6) respectively, o-and s are given by equations (1.14) and (1.16)
respectively, xtends to /x and 5 tends to rras the number of points, n, tends
to infinity. It is fair to ask for what sample size, n, does the approximation
of X for IX and 5 for cr become ‘reasonable’?
Figure 3.12 shows the percentage difference between x and /x, and
between uand sfor samples of size n, where n is between 2 and 100. Samples
were drawn from a population consisting of simulated values known to be
normally distributed.^” The percentage difference between the means is
defined as [[x- ix)/ix\ x 100% and the percentage difference between the
standard deviations is defined as [(5-<t)/o-] X 100%. Figure 3.12 indicates
that the magnitude of the percentage difference between /x and xfor these
data is small (under 5%) for n>10. Figure 3.12 also shows that, as n
increases, 5 converges more slowly to a than x converges to fx. The percent¬
age difference between cr and 5 is only consistently^^ below 20% for n>30. As
Normally distributed values used in the simulation were generated with known jx
and cr using the ‘Random Number Generation’ facility in Excel®.
However, note that one sample of n=40 produced a percentage difference
between cr and s of about -30%.
3.6 CONFIDENCE INTERVALS AND CONFIDENCE LIMITS 107
Exercise E
1. Figure 3.6 shows a histogram of the time of fall of a small sphere through a liquid.
For these data, x= 3.624 s and s= 0.068 s.
(i) Calculate the probability that a value lies in the interval 3.65 s to 3.70 s.
(ii) If the data consist of 36 values, how many of these would you expect to lie in the
interval 3.65 s to 3.70 s?
2. With reference to the standard normal distribution, what is the probability that
z is
Sketch the standard normal distribution for each of parts (i) to (iv) above, indicating
clearly the probability required.
3. For normally distributed data, what is the probability that a value lies further than
2.5o-from the mean?
4. A large sample of resistors supplied by a manufacturer is tested and found to have
a mean resistance of 4.70 kfi and a standard deviation of 0.01 kfi. Assuming the dis- ^
tribution of resistances is normal;
(i) What is the probability that a resistor chosen from the population will have a
resistance
(ii) If 10000 resistors are measured, how many would be expected to have resis¬
tance in excess of 4.73 kd?
^ " ■ V
X% confidence
interval
what limits (symmetrical about the mean) is the area under the normal
curve equal to, say, 0.5?’ Obtaining limits which define an interval is useful
if we want to answer the question:
Between what limits (symmetric about the mean) does X% of the
population lie?
X% is sometimes referred to as the confidence level and the interval within
which J<% of the data lie is the confidence interval.
In order to determine the confidence interval:
Example 9
A distribution of data consisting of many repeat measurements of the flow of water
through a tube has a mean of 3.7 mL/s and a standard deviation of 0.41 mL/s. What
is the 50% confidence interval for these data?
ANSWER
We assume that the question provides good estimates of the population parameters,
jx and cr, so that /x=3.7 mL/s and cr=0.41 mL/s. To find the 50% confidence interval
it is necessary to consider the standard normal distribution and indicate the required
area under the curve. The area is distributed symmetrically about z= 0 and is shown
shaded in figure 3.14.
Table 1 in appendix 1 can be used to find Z2 so long as we can determine the area
to the left of z^, i.e. the area between As the normal distribution is sym¬
metrical about 0, half of tlie total area under the curve lies between z= - <» and
z= 0. This area is equal to 0.5, as the total area under the curve is equal to 1. As the
required area is symmetrical about the mean, half the shaded area in figure 3.14 must
lie between z=0 and z=Z2. The total area under the curve between - 00 and Z2 is
therefore 0.5 + 0.25 = 0.75.
The next step is to refer to the normal probability integral tables and look in the
table for the probability 0.75 (or the nearest value to it). Referring to table 1 in appen¬
dix 1, a probability of 0.75 corresponds to a z value of 0.67, so Z2 = 0.67. As the confi¬
dence limits are symmetrical about z= 0, it follows that Zj = -0.67.
The final step is to apply equation (3.11) which is used to transform zvalues to
X values. Rearranging equation (3.11) we obtain
Substituting Zj = -0.67 and Z2 = 0.67 gives Xj = 3.43 mL/s and X2 = 3.97 mL/s.
In summary, the 50% confidence interval (i.e. the interval expected to contain
50% of the values) lies between 3.43 mL/s and 3.97 mL/s. Equivalently, if a measure¬
ment is made of water flow, the probability is 0.5 that the value obtained lies between
3.43 mL/s and 3.97 mL/s.
As the normal distribution extends between -l-°o and -00, it follows that the
100% confidence interval lies between these limits. It is hardly very dis¬
criminating to say that all data lie between ± 0°. Instead, it is useful to quote
a confidence interval between which the majority of the data are expected
to lie. We discovered in section 3.5 that about 68% of normally distributed
110 3 DATA DISTRIBUTIONS I
Figure 3.14. Standard normal curve indicating area of 0.5 distributed symmetrically
about z=0.
Figure 3.15. Standard normal curve indicating area of 0.95 distributed symmetrically
about z=0.
data lie within ± tr of the population mean. It follows that the 68% confi¬
dence interval for such data can be written
ju—cr<x^/i-l-cr (3.17)
-l<z<l (3.18)
Another often quoted confidence interval is that which includes 95% of the
data (or equivalently, where the area under the normal curve is equal to
0.95). This is shown in figure 3.15. As stated in the previous section, the area
between z= and z=0 is 0.5. The area between z=0 and z= is half the
shaded area in figure 3.15. Now the ix shaded area — 2 ^ 0.95 — 0,475, so the
area between z= -co and z= Z2 is 0.5 -l- 0.475 = 0.975.
The z value corresponding to a probability of 0.975, found using
table 1 in appendix 1 is equal to 1.96, so that Z2= 1.96. As the distribution is
symmetric about z=0, it follows that Zj = -Z2 = -1.96. So the 95% confi¬
dence interval lies between z= -1.96 and z= + 1.96.
3.6 CONFIDENCE INTERVALS AND CONFIDENCE LIMITS 111
Example lo
The current gain of many BC107 transistors is measured.^^ The mean gain is found
to be 209 with the standard deviation of 67. What are the 68% and 95% confidence
intervals for the distribution of the gain of the transistors?
ANSWER
Taking /u, = 209 and £r= 67, rearrange equation (3.11) to give
x= fjb + za
Considering the 68% confidence interval first, the lower limit occurs for z= — l,
giving the lower limit of jc, Xj, as
Xj = 209-67=142
^2 = 209+ 67 = 276
We conclude that the 68% confidence interval lies between 142 and 276.
For the 95% confidence interval, the lower limit occurs for z= -1.96, giving the
lower limit of x, Xj, as
Xi = 209 + (-1.96)X67 = 78
We conclude that the 95% confidence interval for the current gain of the transistors
lies between 78 and 340. The z values for other confidence intervals may be calcu¬
lated in a similar manner. Table 3.5 shows a summary of confidence levels and the
corresponding z values.
Exercise F
A water cooled heat sink is used in a thermal experiment. The variation of the tem¬
perature of the heat sink is recorded at 1 minute intervals over the duration of the
experiment. Forty values of heat sink temperature (in°C) are shown in table 3.6.
Assuming the data in table 3.6 are normally distributed, use the data to estimate the
50%, 68%, 90%, 95% and 99% confidence intervals for the population of heat sink
temperatures.
If the area is known under the normal distribution curve between x=—oo
and x= Xj, what is the value of Xj? For a given probability, the NORMINVO
function in Excel® will calculate Xj so long as the mean and standard devi¬
ation of the normal distribution are given. The syntax of the function is
Example ii
Normally distributed data have a mean of 126 and a standard deviation of 18. If the
area under the normal curve between - oo and Xj for these data is 0.75, calculate Xj.
ANSWER
A ^ B ^ C A B C
1 =NORMINVt0.75.126.18) 1 138.141
2 1 1 2
3.6 CONFIDENCE INTERVALS AND CONFIDENCE LIMITS 113
Example 12
Values of density of a saline solution are found to be normally distributed with a
mean of 1.150 g/cm^ and a standard deviation of 0.050 g/cm^. Use the NORMINVQ
function to find the 80% confidence limits for the density data.
ANSWER
It is helpful to sketch a diagram of the normal distribution and to indicate the area
under the curve and the corresponding limits, x, and x^. The total area to the left of
X2 in figure 3.16 is 0.8 + 0.1 =0.9. Now we can use the Excel® function NORMINVQ to
find the value of x,.
Sheet 3.6(a) shows the NORMINVQ function entered into cell Al. Sheet 3.6(b)
shows the value returned in cell Al after the Enter key is pressed. The upper limit of
the confidence interval is 1.214 g/cm^, which is more than the mean by an amount
1.214 g/cm^- 1.150 g/cm^=0.064 g/cm^. The lower limit is less than the mean by the
same amount. It follows that the lower limit is 1.150 g/cm^ - 0.064 g/cm^ = 1.086 g/ cm^
A I B I C A B C
1 =NORMINVf0.9,1.15,0.05) 1 1.21408
2 _^_i_ 2
NORMSINV(probability)
114 3 DATA DISTRIBUTIONS I
Example 13
If the area under the standard normal curve between - 'x> and is equal to 0.85, what
is the value of Zj?
ANSWER
Sheet 3.7(a) shows the NORMSINVO function entered into cell A1. Sheet 3.7(b) shows
the value returned in cell A1 after the Enter key is pressed, i.e. z, = 1.03643.
A I B I C A B C
1 =NORMSINV(0.85) 1 1.03643
2 I I 2
Exercise G
If the area under the standard normal curve between Zj and is 0.2, use the
NORMSINVO function to find z,.
a
(3.19)
Vn
X - fx
(3.20)
o-x
When cris not known, the best we can do is replace crin equation (3.19) hy
5 so that
s
(3.21)
Vn
The population standard deviation, a, of the raw data is independent of
sample size. By contrast, the standard deviation of the distribution of
sample means, a., decreases as ^increases as given by equation (3.21).This
is consistent with the observation of the ‘narrowing’ of the distributions of
the sample means shown in figures 3.18 and 3.19 for n = 2 and n=\Q
respectively. The ability to reduce cr^ by increasing n is important as it
permits us to quote a confidence interval for the population mean which
can be as small as we choose so long as we are able to make sufficient
repeat measurements.
a- is sometimes written as SE x.
2® See appendix 4 for a derivation of equation (3.19).
3 DATA DISTRIBUTIONS I
Finding the confidence interval for x is less important than finding the
confidence interval for the population mean (or true value), /u.. By rear¬
ranging the terms within the brackets, the 68% confidence limits for the
population mean becomes
0.68 (3.23)
Equation (3.23) is certainly more useful than equation (3.22), but there still
is a problem: the population standard deviation, a, appears in this equa¬
tion and we do not know the value of this. The best we can do is replace a
by the estimate of the population standard deviation, s. As discussed in
section 3.5.4, this approximation is regarded as reasonable so long as
n>30. Equation (3.23) becomes^
0.25)
We use the properties of the normal distribution to give the limits asso¬
ciated with other confidence intervals. The X% confidence interval is
written
(3.26)
Example 14
Using the heat sink temperature data in table 3.6:
(i) Calculate the 95% confidence interval for the population mean for these data.
(ii) How many values would be required to reduce the 95% confidence interval for
the population mean to (x - 0.1) °C to (;c + 0.1) °C?
ANSWER
(i) The mean, x and standard deviation, s, of the data in table 3.6, are^^
;c=18.41°C s= 0.6283 °C
s 0.6283°C
The 95% confidence interval for the population mean is from (18.41 —0.1947) °C to
(18.41 +0.1947)°C, i.e.from 18.22°C to 18.60°C.
/l.96X0.6283°c\^
n—\-- =152
0.1°C
Exercise H
Thirty repeat measurements are made of the density of a high quality multigrade
motor oil. The mean density of the oil is found to be 905 kg/m^ and the standard
deviation 25 kg/m^. Use this information to calculate the 99% confidence interval
for the population mean of the density of the motor oil.
3.8.1.1 APPROXIMATING a-
range (32?)
o-«-
n
Example 15
Table 3.7 shows the input offset voltages of five operational amplifiers. Use equations
(3.21) and (3.27) to estimate the standard error of the mean, cr-, of these values to two
significant figures.
ANSWER
Using equation (1.16) we obtain 1.795 mV. Substituting s into equation (3.21) we
then obtain si Vn= {1.795 mV)/V6 = 0.73 mV. Using equation (3.27) we find
a-~ range/n= (5 mV) 16- 0.83 mV.
Input offset voltage (mV) 4.7 5.5 7.7 3.4 2.7 5.8
Exercise I
Table 3.8 shows eight values of the energy gap of crystalline germanium at room
temperature. Use equations (3.21) and (3.27) to estimate the standard error of the
mean, <j-, of the values in table 3.8 to two significant figures.
Table 3.8. Values of the energy gap of germanium, as measured at room temperature.
Energy gap (eV) 0.67 0.63 0.72 0.66 0.74 0.71 0.66 0.64
The X% confidence interval of the population mean for any set of data may
be found using Excel®’s CONFIDENCE function. The syntax of the function
is
(100%-X%)
100%
Example 16
Fifty measurements are made of the coefficient of static friction between a wooden
block and a flat metal table. The data are normally distributed with a mean of 0.340
and a standard deviation of 0.021. Use the CONFIDENCE!) function to find the 99%
confidence interval for the population mean.
ANSWER
(100%-99%)
a = -^-^=0.01
100%
Sheet 3.8(a) shows the CONFIDENCE!) function entered into cell Al. Sheet 3.8(b)
shows the value returned in cell Al after the Enter key is pressed. The confidence
interval can be written
While the shape of the normal distribution describes well the variability in
the mean when sample sizes are large, it describes the variability less well
when sample sizes are small. It is important to be aware of this, as many
experiments are carried out in which the number of repeat measurements
is small (say less than ten). Essentially the difficulty stems from the
assumption made that the estimate of the population standard deviation,
5, is a good approximation to cr. The variation in values is such that, for
small data sets, 5 is not a good approximation to m and the quantity
where n is the size of the sample, does not follow the standard normal dis¬
tribution, but another closely related distribution, referred to as the ‘ t dis¬
tribution. If we write
(3.29)
(3.30)
f{t)dt= 1 (3.31)
Figure 3.21 shows the general shape of the f probability density function.
On the face of it, figure 3.21 has a shape indistinguishable to that of the
normal distribution with a characteristic bell shape in evidence. The differ¬
ence becomes clearer when we compare the fand the standard normal dis¬
tributions directly. Equation (3.30) predicts a different distribution for each
sample size, n, with the r distribution tending to the standard normal dis¬
tribution as n—>00. Figure 3.22 shows a comparison of the t distribution
curve with n=6 and that of the standard normal probability distribution.
An important difference between the family of t and the normal distribu-
See Hoel (1984) for more information on the t probability density function.
3.9 the t DISTRIBUTION 123
t
Figure 3.21. f distribution curve.
lions is the extra area in the tails of the t distributions. As an example, con¬
sider the normal distribution in which there is 5% of the area confined to
the tails of the distribution. A confidence interval of 95% for the population
mean is given by
cr a
x-1.96 to x-E 1.96
\fn \/~n
^95%,!'(3.32)
Table 3.9. t values for various confidence levels and degrees offreedom.^°
Number of Degrees of
values, n freedom, v hmo.v hbVo.v ^99%, V
Diameter (mm)
Example 17
The diameter of ten steel balls is shown in table 3.10. Using these data, calculate:
(i) x;
(ii) s;
(iii) the 95% and 99% confidence interval for the population mean.
ANSWER
^ Wo,9 to
2.26X0.07149 mm 2.26X0.07149 mm
4.680 mm — to 4.680 mm +
VTo VlO
i.e. 4.629 mm to 4.731 mm. For the 99% confidence interval we have
Exercise J
Calculate the 90% confidence interval for the population mean of the data in
table 3.10.
For a specified value of t, the TDIST() function gives the probability in the
tails of the t distribution, so long as the number of degrees of freedom, v, is
specified. The syntax for the TDIST() function is
If the tails parameter is set to 2, the function gives the area in both tails of
the distribution. If the tails parameter is set to 1, then the area in one tail is
given.
Example 18
(i) Calculate the area in the tails of the t distribution when t= 1.5 and = 10.
(ii) Calculate the area between f = -1.5 and r= 1.5.
ANSWER
(i) Sheet 3.9(a) shows the formula required to calculate the areas in both tails
entered into ceU Al. Sheet 3.9(b) shows the area in the tails returned in cell A1
when the Enter key is pressed.
(ii) The area between r= -1.5 and t= 1.5 is equal to 1 - (area in tails) =
1-0.16451 = 0.83549.
Example 19
If the area in the tails of the t distribution is 0.6, calculate the corresponding value of
t, assuming that the number of degrees of freedom, p, = 20.
ANSWER
Sheet 3.10(a) shows the TINVQ function entered into cellAl. Sheet 3.10(b) shows the
value returned in cell A1 after the Enter key is pressed.
Suppose the fraction, / of a sample is less than or equal to the value q(f).
We term q{f) the quantilefor that fraction. For a particular probability dis¬
tribution, qif) can be determined for any value of/and any sample size.
We expect that the experimental quantile (i.e. the quantile for the data
obtained in the experiment) will be very similar to the theoretical quantile
if the data follow the theoretical distribution closely. Put another way, if we
plot the experimental data against the theoretical quantile we would
expect the points to lie along a straight line. A quantile plot may be con¬
structed as follows:^"*
(3.33)
Example 20
Table 3.11 shows the diameter (in ixm) of small sphericEtl particles deposited on the
surface of a film of titanium. Construct a normal quantile plot for these data.
ANSWER
Table 3.12 shows f. and q[ f.). The values of particles size are shown ordered from
smallest to largest (in the interests of brevity, the table shows the first 10 values in the
sample only). Figure 3.26 shows x. versus q(/l) for all the data in table 3.11. The
extreme non-linearity for q(f^)>l indicates that the data are not adequately
described by the normal distribution.
0.075 0.110 0.037 0.065 0.147 0.106 0.158 0.163 0.149 0.131 0.136 0.106
0.068 0.206 0.037 0.097 0.123 0.968 0.110 0.147 0.081 0.062 0.421 0.188
_ 1.2 n
E
B 1.0-
X
CD 0.8
N
^ 0.6
o
t 0.4
cc
0.2
• ••••••••
Exercise K
Transform the data in table 3.11 by taking the natural logarithms of the values in the
table.
(3.35)
/l^l
x= +... (3.36)
where/) is the frequency of occurrence of the value x.. As n->oo, x—> /x, and
becomes the probability, p., of observing the value, x^ As n^oo,
equation (3.36) becomes
(3.38)
Equation (3.38) is most easily applied when dealing with quantities that
take on discrete values, such as occurs when throwing dice or counting
particles emitted from a radioactive source. When dealing with continu¬
ously varying quantities such as length or mass we can write the probabil¬
ity, p, of observing the value x in the interval Ax as
p=f{x) Ax (3.39)
35 Here we have shown the limits of the integral extending from -i-co to -00, indicating
that any value of x is possible. In general, the limits of the integral may differ from
+ c» to -0=. Most importantly, if the limits are written as a and b, then f^J{x]dx= 1.
132 3 DATA DISTRIBUTIONS I
Example 21
A probability density function, fix), is written
ANSWER
Using equation (3.40) we write (with appropriate change of limits in the integration)
Exercise L
Given that a distribution is described by the probability density function
The mean value of x where the probability distribution governing the dis¬
tribution of X is known is also referred to as the expectation value of x and
is sometimes written (x). For continuous quantities (x) is written
From the point of view of data analysis, a useful expectation value is that of
the square of the deviation of values from the mean, as this is the variance,
o^, of values. Writing g(x) = (x-/.t)^ and using equation (3.43) we have
3.14 Review
Problems
/(x)=A-x forO<x<l
jix) = 0 for other values of x.
134 3 DATA DISTRIBUTIONS I
fix)
then
(a) Use the NORMDISTO function on Excel® to calculate the cdf for
focal lengths in the range 13.0 cm to 17.0 cm in steps of 0.2 cm.
(b) Plot a graph of cdf versus focal length.
(ii) Using your cdf values found in part (i), find the probability that a
lens chosen at random has a focal length in the range:
0.35 0.37 0.59 0.14 0.55 0.74 1.99 1.81 0.15 1.58
0.63 0.46 0.19 0.79 0.80 0.99 2.34 1.76 1.82 0.82
0.44 0.45 0.20 0.55 0.57 1.96 0.82 2.14 2.22 4.25
16.9 10.3 11.3 9.0 11.3 4.0 11.4 6.3 13.2 15.7
23.5 8.9 8.4 13.1 10.9 11.0 26.8 12.4 11.3 30.4
8.9 23.0 12.1 11.8 27.3 14.6 15.2 10.8 17.1 11.8
8.4 7.0 13.5 7.5 32.1 11.4 5.5 10.3 19.7 10.1
8.8 6.2 14.3 11.2 16.2 8.9 10.2 14.1 10.3 13.5
3.5 11.8 13.8 35.1 5.7 5.6 3.8 21.9 39.9 10.0
7.4 16.0 16.9 8.0 18.3 16.9 11.0 5.6 20.2 17.8
10.2 8.6 18.8 14.5 16.1 34.6 14.0 7.0 15.5 6.6
16.2 11.3 14.1 14.6 13.0 10.6 6.2 35.2 27.1 20.6
5.8 7.0 15.9 12.6 12.1 13.2 4.8 10.7 7.6 7.4
(iii) Repeat part (i) for v—2 to 100 and plot a graph of the area in the tail
versus v.
(iv) For what value of v is the area in the tail between t= 2 and f = oo equal
to the area in the tail of the standard normal distribution between z=2
and z=00 when both areas are expressed to one significant figure?
(i) Use a normal quantile plot to help decide whether the values in table
3.15 follow a normal distribution.
(ii) Transform the data in table 3.15 by taking the natural logarithms of
the values. Using a normal quantile plot establish whether the
transformed values follow a normal distribution. (That is, do the data
follow a lognormal distribution?)
Chapter /j.
Data distributions II
4.1 Introduction
we speak of performing n trials. The result of a test (e.g. ‘pass’) or the result
of a coin toss (e.g. ‘head’) is referred to as an outcome.
Some experiments consist of trials, in which the outcome of each trial
can be classified as a success (S) or a failure (F). If this is the case, and if the
probability of a success does not change from trial to trial, we can use the
binomial distribution to determine the probability of a given number of suc¬
cesses occurring, for a given number of trials. ^ The binomial distribution is
useful if we want to know the probability of, say, obtaining one or more
defective light emitting diodes (LEDs) when an LED is drawn from a large
population of‘identical’ LEDs, or the probability of obtaining four ‘heads’ in
six tosses of a coin. In addition, by considering a situation in which the
number of trials is veiy large but the probability of success on a single trial
is small (sometimes referred to as a ‘rare’ outcome), we are able to derive
another discrete distribution important in science: the Poisson distribution.
How a ‘success’ is defined largely depends on the circumstances. For
example, if integrated circuits (ICs) are tested, a successful outcome could
be defined as a circuit that is fully functional. Let us write the probability
of obtaining such a success as p. If a circuit is not fully functional, then it is
classed as a failure. Denote the probability of a failure by q. As success or
failure are the only two possibilities, we must have
p+q=l (4.1)
As an example, suppose after testing many ICs, it is found that 20% are
defective. If one IC were chosen at random, the probability of a failure is 0.2
and hence the probability of a success is 0.8. If four ICs are drawn from the
population, what is the probability that all four are successes? To answer
this we apply one of the rules of probability discussed in section 3.2.1. So
long as the trials are independent (so that removing any IC from the pop¬
ulation does not affect the probability that the next IC drawn from the pop¬
ulation is a success) then
Going a stage further in complexity, if four ICs are removed from the pop¬
ulation, what is the probability that exactly two of them are fully func¬
tional? This is a little more tricky as, given that four ICs are removed
and tested, two successes can be obtained in several ways such as two
’ The name binomial distribution derives from the fact that the probability of r suc¬
cessful outcomes from n trials equates to a term of the binomial expansion ip+ qV',
where p is the probability of a success in a single trial, and q is the probability of a
fadure.
140 4 DATA DISTRIBUTION II
The probability that each of the other five combinations, SFFS, FFSS etc.
occurring is also 0.0256, so that the total probability of obtaining exactly
two successes from four trials is
C =--- (4.2)
(n-r)\rl
n\ = nX{n-l)xin-2)X{n-3)-X2Xl (4.3)
P(r) = (4.4)
Example 1
5% of thermocouples in a laboratory need repair. Assume the population of thermo¬
couples to be large enough for the binomial distribution to be valid. If ten thermo¬
couples are withdrawn from the population, what is the probability that;
ANSWER
We designate a good thermocouple (i.e. one not in need of repair) as a success. Given
that 5% of thermocouples need repair, the probability, q, of a failure is 0.05. Therefore
the probability of a success, p, is 0.95.
(i) If two thermocouples are in need of repair (failures) then the other eight must be
good (i.e. successes). We require the probability of eight successes (r= 8) from
ten trials («= 10). Using equation (4.2),
«! 10!
C =-=-= 45
(n-r)!r! (10-8)18!
(ii) If two or fewer thermocouples are in need of repair, then eight, nine or ten must
be good. Therefore the probability required is
(iii) If more than two thermocouples need repair, then zero, one, two, three, four, five,
six or seven must be good. We require the probability P(0 ts 7) which is given by
P(0 < r< 7) = P(0) + P( 1) + P(2) -t P(3) -t P(4) + P(5) + P(6) + P(7)
r=7
P(0<r^7) = '^P{r)
r=0
2 Pir) + 2 Pir) = 1 ^
1 = 0 r=8
Therefore
r=7 r=10
2 p(/-) = 1 - 2
r=0 r=8
r=7
2 Pir) = 1 - 0.9885 = 0.0115
r=0
Exercise A
The assembly of a hybrid circuit requires the soldering of 58 electrical connections.
If 0.2% of electrical connections are faulty, what is the prohahility that an assembled
circuit will have:
r=10
P{r< 10) = 2 (4.5)
r=0
A B C
1 =BINOMDIST(10,20,0.45,TRUE)
2
A B C
1 =BINOMDIST(10,20,0.45,FALSE)
2
144 4 DATA DISTRIBUTION II
Exercise B
Given the number of trials, n = 60, and the success on a single trial, p=0.3, use the
BINOMDISTQ function in Excel® to determine:
^=10X0.7-7
One hundred samples each consisting of ten transistors are removed from
a large population and tested. The number of good transistors in each
sample is shown in table 4.1. If we calculate the mean of the values in table
4.1 we find r = 6.75, which is close to the population mean^ of 7.
Another important parameter representative of a population is its
standard deviation. For an experiment consisting of n trials, where the
probability of success is p and the probability of failure is q, the population
standard deviation, a, is given b>d
^ We cannot really know the population mean except in simple situations such as
the population mean for the number of heads that would occur in ten tosses of a
coin. In this situation we would make the assumption that the probability of a head
is 0.5 so that in ten tosses we would expect five heads to occur.
'* See Meyer (1975) for a proof of equation (4.6).
4.2 THE BINOMIAL DISTRIBUTION 145
6 10 6 4 9 6 7 5 6 8
7 4 5 5 6 6 5 9 8 8
8 6 10 4 7 6 5 6 6 8
8 8 6 7 6 5 6 7 5 7
6 7 7 7 8 6 ' 8 7 8 8
6 7 6 8 5 7 4 7 7 4
7 5 8 6 5 8 5 6 9 5
7 9 8 9 8 7 9 7 8 8
5 7 8 8 7 6 8 8 2 6
7 6 8 7 5 8 7 6 9 9
a= Vnpq (4.6)
For the data in table 4.1, n= 10, p = 0.7 and q=0.3. Using equation (4.6),
(7=1.449.
The value for cr can be compared with the estimate of the population
standard deviation, s, calculated using equation (1.16). We find that
s=1.5
Exercise C
2% of electronic components are known to be defective. If these components are
packed in containers each containing 400 components, determine:
more easily using the normal rather than the hinomial distribution when a
cumulative probability is required and the number of trials, n, is large.
As an example of the use of the normal distribution as an approxima¬
tion to the binomial distribution, suppose we require the probability of
three successful outcomes from ten trials when p=0A. Using the normal
distribution we find P(3) = 0, as finite probabilities can only be determined
if an interval is defined. If we consider the nurhber ‘3’ as a rounded value
from a continuous distribution with continuous random variable, x, then
we would take P(3) as P(2.5<x<3.5). Comparing probabilities calculated
using the binomial and normal distributions we find:
The terms on the right hand side of equation (4.7) may be most easily
determined using the NORMDISTO function described in section 3.5.1.
Specifically, entering =NORMDIST(3.5,4,1.549,TRUE) into a cell in
an Excel® spreadsheet returns the number 0.3734. Similarly, entering
= NORMDIST(2.5,4,1.549,TRUE) into a cell returns the number 0.1664. It
follows that
Example 2
Given the number of trials, n= 100 and the probability of a success on a single trial,
p=0A, determine the probability of between 38 and 52 successes (inclusive) occur¬
ring using:
ANSWER
The cumulative probabilities on the right hand side of equation (4.8) are most
conveniently found using the BINOMDISTO function on Excel®. Entering the
formula =BINOMDIST(52,100,0.40,TRUE) into a cell in an Excel® spreadsheet
returns the value 0.9942.
Entering =BINOMDIST(37,100,0.40,TRUE) into a cell® returns the value
0.3068. It follows that
P(38 < r < 52) = P(r < 52) - P(r < 38) = 0.9942 - 0.3068 = 0.6874
(ii) Using the normal distribution to solve this problem requires that we calculate
the area under a normal distribution curve between 37.5 and 52.5. Using x to
represent the random variable, we write
P(37.5 < X < 52.5) = P(— 00 < X < 52.5) — P(—^ < x <37.5)
The population mean, /a, is given by = np = 100 X 0.4 = 40 and the population
standard deviation, a, is given by a= Vnpq = VlOO X 0.4 X 0.6 = 4.899.
P(37.5<r<52.5) may be found using Excel®’s NORMDISTQ function.^
Specifically, entering =NORMD1ST(52.5,40,4.899,TRUE) into a spreadsheet cell
returns the number 0.9946. Similarly, entering =NORMDIST(37.5,40,4.899, TRUE)
into a cell returns the value 0.3049. It follows that
P(-CO < x< 52.5) = 0.9946 and P(-oo < x < 37.5) = 0.3049
so that
Exercise D
If the probability of a success p=0.3 and the number of trials n = 1000, use the bino¬
mial distribution and normal approximation to the binomial distribution to deter¬
mine the probability of:
P(r) = (4.9)
r!
where p is the mean number of successes in the interval chosen. Equation
(4.9) may be regarded as a new discrete probability distribution, called the
Poisson distribution. The distribution is valid when n is large and p is small
and p does not vary from one interval to the next.
Figure 4.4 shows how the probability given by equation (4.9) depends
on the value of /jl. For /i.= 1 the probability distribution is clearly asymmet¬
ric, but as /X increases we again find that the characteristic shape becomes
more and more ‘normal-like’. In situations where the normal distri¬
bution can be used as an excellent approximation to the Poisson distribu¬
tion and this facilitates computation of probabilities which would be very
difficult or impossible to determine using equation (4.9).
Example 3
Given that )Lt=0.8, calculate P[r) when:
(i) r=0:
(ii) r=l;
(iii) r>2.
ANSWER
(i) Using equation (4.9), P(0) = 0.8“e'0»/0! = 0.4493 (note that 0! = 1).
(ii) Similarly, P(l) = 0.8'e^“®/11 = 0.3595.
(iii) P{r > 2) = P{2) + P(3) + P(4) — + P(oo). A convenient way to calculate this sum is to
recognise that the sum of the probabilities for all possible outcomes = 1, i.e. P(0)
+ P(l) + P(2) + P(3) + P(4)- + P(a>) = 1. It follows that
Exercise E
Given that /r = 0.5, use equation (4.9) to determine:
(i) P(r=0);
(ii) P(r<3);
(iii) P(2<r<4).
Strictly, radioactive decay does not satisfy the requirement of the Poisson
distribution, as the probability that an event occurs is not constant but
decreases with time. The reason for this is that the decay rate depends on
how many undecayed nuclei remain at any particular instant. As that
152 4 DATA DISTRIBUTION II
cr~ \/np
(4.10)
Example 4
Table 4.2 shows the number of cosmic rays detected in time intervals of 10 minutes
at the surface of the earth. Use the data to estimate:
The experiment is continued so that the number of counts occurring in 1200 succes¬
sive time intervals of 1 minute are recorded.
(v) In how many intervals would you expect the number of counts to exceed 2?
ANSWER
(i) The mean of the values in table 4.2 is 6.48. This is the best estimate of the popu¬
lation mean.
(ii) Using equation (4.10), the population standard deviation o-« V6.48 = 2.55.
(hi) Using equation (4.9), P(0) = 6.48"e“® '^**/0! = 1.534 X 10^^
P(3) -E P(4) -E P(5) -E - -E P(oo) = 1 - (1.534 X 10-3 + 9.939 X 10-3 + 3 220 X10-^) = 0.9563
(v) The expected number of 1 minute intervals in which more than two counts
occur = NP(r>2), where Aiis the total number of intervals:
5 3 2 7 10 4 5 7 7 7
10 7 6 10 9 5 9 7 9 6
8 5 9 9 3 6 6 8 2 11
7 6 4 9 4 5 1 4 6 6
7 4 10 3 9 6 9 5 10 7
Exercise F
Small spots of contamination appear on the surface of a ceramic conductor when it
is exposed to a humid atmosphere. The contamination degrades the quality of
electrical contacts made to the surface of the ceramic conductor. Table 4.3 shows the
number of spots identified in 50 non-overlapping regions of area 100 (xm^ at the
surface of a particular sample exposed to high humidity. Use the data in table 4.3 to
determine the mean number of spots per 100 ixm^. A silver electrode of area 100 ixm^
is deposited on the surface of the conductor. Assuming the positioning of the elec¬
trode is random, calculate the probability that the silver electrode will:
154 4 DATA DISTRIBUTION I
2 1 1 0 0 1 0 2 2 2
1 2 1 0 1 0 1 0 0 1
0 1 0 0 3 1 1 2 4 0
1 2 1 0 0 0 0 2 2 1
0 1 0 3 2 1 2 2 0 2
A B C
1 =POISSON(4,8,TRUE)
2
Sheet 4.4. Using the POISSONO function to calculate the probability ofr
events occurring.
A B C
1 =POISSON(4,8,FALSE)
2
Exercise G
Table 4.4 shows the number of X-rays detected in 100 time intervals where each
interval was of 1 second duration. Use the data in the table to determine the mean
number of counts per second. Assuming the Poisson distribution to be applicable,
use Excel® to determine the probability of observing in a 1 second time interval:
1 1 1 1 1 0 0 6 1 2
0 4 1 1 2 3 1 0 3 3
4 1 1 0 0 2 4 1 3 5
6 0 1 1 4 6 0 0 0 1
1 1 2 2 1 1 0 2 3 1
1 4 0 2 0 0 3 3 2 4
0 2 2 1 1 2 2 0 0 1
1 1 3 2 2 0 0 2 0 1
1 1 2 2 1 0 1 4 0 1
1 1 0 1 0 2 2 0 0 3
156 4 DATA DISTRIBUTION II
Figure 4.4 indicates that wher? the mean, /x, equals 5, the shape of the
Poisson distribution is very similar to that of the normal distribution.
When II equals or exceeds 5, the normal distribution is preferred to the
Poisson distribution when the calculatiqn of probabilities is required. This
is due to the fact that summing a large series is tedious, unless computa¬
tional aids are available. Even with aids like Excel®, some summations
cannot be determined. When r, and r! are large, the result of a calcula¬
tion using equation (4.9) can exceed the numerical range of the computer
causing an ‘overflow’ error to occur.
Example 5
In an X-ray counting experiment, the mean number of counts in a period of 1 minute
is 200. Use the Poisson and normal distributions to calculate the probability that in
any other 1 minute period the number of counts occurring would be exactly 200.
ANSWER
To find the probability using the Poisson distribution we use the equation (4.9), with
/x=r=200, i.e.
so that
2002000-200
P(200) =
200!
This is where we must stop, as few calculators or computer programs can cope with
numbers as large as 200! or 200^00.
Using the normal distribution we use the approximation that
Vfji=V^=U.U2
Now
The two terms on the right hand side of equation (4.12) may be determined using the
NORMDISTQ function on Excel®.® Entering NORMDIST(200.5,200,14.142,TRUE) into
a cell returns the number 0.5141 and entering NORMDIST (199.5,200,14.142,TRUE)
into another cell returns the number 0.4859. It follows that
Exercise H
Using the information in example 5, determine the probability that in any 1 minute
time period the number of counts lies between the inclusive limits of 180 and 230.
4*4 Review
Problems N
1. Determine,
err r
^10,5’ ^15,2’ ^42,24’ ^580,290
(i) What is the probability that at least 20 electrodes will make good
contact to the scalp?
(ii) How many electrodes must be pressed against the scalp so that the
probability that at least 20 of them will make good contact is the same
as in part (i)?
6. A thin tape of superconductor is inspected for flaws that would affect its
capacity to carry a high electrical current. Sixty strips of tape, each of length
1 m, are inspected and the number of flaws in each strip is recorded. These
are shown in table 4.6.
0 18
1 19
2 12
3 9
4 1
5 1
5.1 Introduction
Physicists, chemists and other physical scientists are proud of the quanti¬
tative nature of their disciplines. By subjecting nature to ever closer scru¬
tiny, new relationships between quantities are discovered and established
relationships are pushed to the limits of their applicability. When
‘numbers’ emerge from an experiment, they can be subjected to quantita¬
tive analysis, compared to the ‘numbers’ obtained by other experimenters
and be expressed in a clear and concise manner using tables and graphs. If
an unfamiliar experiment is planned, an experimenter usually executes a
pUot experiment. The purpose of such an experiment might be to assess
the effectiveness of the experimental methods being used, or to offer a pre¬
liminary examination of a theoretical prediction. An experimenter then
typically moves to the next stage in which a more thorough experiment is
performed and where there is increased emphasis on the quality of the
data gathered. The analysis of these data often provides crucial and defen¬
sible evidence sought by the experimenter to support (or refute) a particu¬
lar theory or idea.
The goal of an experiment might be to determine the value of a single
quantity, for example the value of the charge on an electron. Experi¬
menters are aware that influences exist, some controllable and others less
so, that conspire to adversely affect the values they obtain. Despite an
experimenter’s best efforts, some uncertainty in an experimentally deter¬
mined value remains. In the case of the charge on the electron, its value is
recognised to be of such importance in science that considerable effort has
gone into establishing an accurate value for it. Currently the ‘best’ value for
161
i62 5 MEASUREMENT, ERROR AND UNCERTAINTY
Factors that might affect the accuracy of experimental data are best illus¬
trated by example. We consider temperature measurement, as many
experiments involve the study of the variation of a quantity with tempera¬
ture, such as the pressure of a gas, electricad conductivity of an alloy, or rate
of chemical reaction. Figure 5.1 shows the block diagram of a system used
5.2 THE PROCESS OF MEASUREMENT 163
(i) Heat transfer from the copper block to the temperature sensor as the
temperature of the block rises above that of the sensor.
(ii) A change of a physical attribute of the temperature sensor (such as its
electrical resistance) in response to the temperature change of the sensor.
The sensor then supplies an output signal, often in the form of a voltage.
(iii) Modification of the output signal of the sensor. This is often referred to
as ‘signal conditioning’. For example, the signal may be amplified, or
perhaps a filter used to attenuate ac electrical noise at 50 Hz or 60 Hz.
(iv) Measurement of the output of the signal conditioner using a voltmeter
or analogue to digital converter on a data acquisition card.
(v) Recording the voltage displayed by the voltmeter ‘by hand’ in a note¬
book, or automatically on a computer disc.
(vi) Transformation of voltage to temperature^ using a calibration curve or
‘look-up’ table.^
(vii) Recording values of temperature in a notebook or a computer.
Factor Explanation _
Thermal resistance No matter how intimate the contact between temperature sensor
between temperature and copper block, there is some resistance to the flow of heat
sensor and copper between the two. This affects the response time of the sensor so
block that, in dynamic experiments vi^here temperatures are changing
with time, the sensor temperature lags behind the temperature of
the block.
Size and constitution A temperature sensor with a large heat capacity responds slowly
of temperature sensor to temperature changes. If the temperature of the copper block is
higher than that of the sensor then when the sensor is brought in
contact with the block it will lower the block’s temperature.
Sampling time and The time to complete the measurement of voltage (i.e. the
voltmeter resolution sampling time) using a digital voltmeter can vary from typically
1 ms to 1 s. Any variation in the voltage that occurs over the
period of the measurement cannot be known as the voltmeter
presents an average value. (A typical hand held digital voltmeter
makes two to three measurements per second). The resolution of
a digital voltmeter puts a limit on the change in voltage that can
be detected. For 3'A digit voltmeters on the 2 V range, the
resolution is 1 mV.
factors that affect the value(s) of temperature recorded along with a brief
explanation.^ All of the factors described in table 5.1 introduce experimen¬
tal error into values obtained through measurement. By error we are not
referring to a mistake such as recording incorrectly a value that appears on
the display of a meter, but a deviation from the ‘true value’ of a quantity
being measured. Such a deviation might be caused, for example, by the
limit of resolution of an instrument or the finite response time of a sensor.
Experimental errors prevent us from being able to quote an ‘exact value’ for
any quantity determined through experiment. By careful consideration of
all sources of error in an experiment we can quote an uncertainty in the
measured value. A detailed consideration of the sources of uncertainties in
an experiment might suggest ways by which some or all of those uncertain¬
ties can be minimised. While identifying and quantifying uncertainties is
important, it is equally important to use any insight gained to reduce the
uncertainties, if possible, through better experimental design.
In addition, quoting uncertainties allows for values to be compared.
For example, the density of a particular alloy obtained by two separate
experimenters might be 6255 kg/m^ and 6138.2 kg/m^. If we have no
knowledge of the uncertainty in each value, we cannot know if these
values are consistent with each other. By contrast, if we are told that the
values are (6255 ± 10) kg/m^ and (6138.2 ±2.9) kg/m^ we are in a much
better position to argue that the values are nor consistent and that further
investigation is required. When quoting results we are obliged to commu¬
nicate all relevant details including the uncertainty in the value as well as
the ‘best’ value.
It is often helpful to imagine that a quantity, such as the mass of a body, has
a ‘true’ value and it is the true value that we seek through measurement. If
measurement procedures and instruments are perfect and no outside
influences conspire to affect the value, then we should be able to deter¬
mine the true value of a quantity to arbitrary precision. Recognising that
neither experimental methods nor instruments are perfect and that
although outside influences can be minimised they can never be com¬
pletely eliminated, the best we can do is obtain a ‘best’ estimate of the true
3 Table 5.1 is not exhaustive and for a fuller discussion of the measurement of
temperature and its challenges see Tompkins and Webster (1988) and Nicholas and
White (1982).
i66 5 MEASUREMENT, ERROR AND UNCERTAINTY
t=45.1783128212°C
The reality is that, for example, offset drift in the measuring instrument,
electrical noise and resolution limits cause measured values to vary from
the true value by such an amount as to make the figures after the decimal
point in t above quite meaningless in most situations. We refer to the
difference between the measured value and the true value as the experi¬
mental error. Representing the true value of a quantity by the symbol jx, the
error in the ith measured value, 5x., is given by
5x.=x^-|Ji (5.1)
where x. is the ith value. If the errors are random then we expect both pos¬
itive and negative 8x., as x. will take on values greater and less than jx. If we
could sum the errors due to many repeat measurements then the errors
would have a ‘cancelling effect’ such that
f5x,^0 (5.2)
i= I
2^-^)=“0 (5.3)
It follows that
^x^-n/x^O
or
^-(5.4)
late X tends to infinity. However, the terms ‘true value’ and ‘population
mean’ are not always interchangeable. As an example, ten repeat measure¬
ments of the breakdown voltage, Vg, of a single zener diode might be made
to estimate the true value of Vg for that diode. In this case we could replace
the term ‘true value’ by ‘population mean’ to describe the value being esti¬
mated. However, if the breakdown voltage of each of ten different zener
diodes is measured then we can estimate the mean of the breakdown volt¬
ages for the population from which the zener diodes were drawn, but the
term ‘true value’ has little meaning in this context.
When communicating a value obtained through measurement, we
should quote a confidence interval for the true value (or the population
mean) being sought. If we use g to represent the true value, then we can
say that the confidence interval is
X ~ /ji^X-t-u (5.5)
In general, u in equation (5.6) may be replaced by 5.55 a., where 5.^ ^ is the
critical t value^ corresponding to the X% confidence level and v is the
number of degrees of freedom.® Equation (5.6) becomes
(5.8)
5:%, v^x
® See section 3.9.
® v=n-l, where n is the numbers of values - see section 3.9.
i68 5 MEASUREMENT, ERROR AND UNCERTAINTY
Degrees of
freedom, v ha%,v
CO
^99%,!/
Table 5.2 contains critical rvalues for various degrees of freedom and con¬
fidence levels (a more complete table appears in appendix 1).
To avoid confusion, it is important to indicate what confidence level
has been adopted when quoting an uncertainty. In problems and exercises
in this chapter we will assume (unless stated otherwdse) that the 95% con¬
fidence interval applies when expressing an uncertainty.
Example i
Ten repeat measurements of the breakdown voltage of a zener diode are shown in
table 5.3. Calculate the 95% confidence interval for the true value of the breakdown
voltage.
ANSWER
Now si Vn (see section 3.8), where 5 is the estimate of the population standard
deviation, given hy
2(x,-;c)2li
n-1
5 0.07834 V
= (5.624 ± 0.056) V
r
In this example we have quoted the uncertainty to two significant figures. We adopt
this convention throughout this text.
Table 5.3. Ten values for the breakdown voltage, Vg, of a zener diode.
Vg(V) 5.62 5.49 5.55 5.61 5.72 5.54 5.63 5.67 5.70 5.71
Exercise A
Table 5.4 contains 20 values of light intensity, /, (in lux). Use these values to deter¬
mine:
I (lx) 348 328 359 380 378 389 310 349 376 332 340 320 317 334 339 312 343 346 357 347
’’ See Young and Freedman (1996) for values of the acceleration due to gravity.
5-5 RANDOM AND SYSTEMATIC ERRORS 171
• Values are precise when the scatter in the values about the mean is
small but this does not imply that the values are close to the true value.
• Values are accurate when they are close to'the true value.
• The mean of n values is accurate when the mean tends to the true
value as n becomes large but the deviation of individual values from
the mean may be large.
Random errors cause values to lie above and below the true value and, due
to the scatter they create, they are usually easy to recognise. So long as only
random errors exist, the mean of the values tends towards the true value as
the number of repeat measurements increases.
Another type of error is that which causes the measured values to be
consistentiy above or consistently below the true value. This is termed a
systematic error, also sometimes referred to as a bias error. An example of
this is the zero error of a spring balance. Suppose with no mass attached to
the spring, the balance reads 0.01 N. We can ‘correct’ for the zero error by;
Description Explanation
® See Simpson (1987) for a discussion of noise due to random thermal motion.
® Chapter 4 deals with the statistics of radioactive decay.
5.6 RANDOM ERRORS 173
Resolution error
Every instrument has a limit of resolution and any small changes in a quan¬
tity less than that limit go undetected. The uncertainty due to the limit of
resolution is routinely taken as one half of the smallest division that
appears on the scale of the instrument. So, for example, for a metre rule
with a smallest scale division of 1 mm, the uncertainty due to the limit of
resolution is taken as ± 0.5 mm. For a micrometer with a smallest division
of 0.01 mm, the uncertainty is taken as ±0.005 mm. Assuming that the
rounding of values that occurs when making a measurement is such that
no bias is introduced, i.e. sometimes a value is rounded up to the nearest
half division and other times it is rounded down, then we can regard the
resolution error as a random error. Similarly, the least significant digit
appearing on an instrument v«th a digital display is the limit of resolution
of the instrument. For example, if the voltage indicated by a digit volt¬
meter is 1.43 V, then the voltage lies between 1.425 V and 1.435 V. It is there¬
fore reasonable to quote the uncertainty due to the resolution limit of the
instrument as 0.005 V, so that the voltage can be expressed as
(1.430 ± 0.005) V. The uncertainty due to the limit of resolution of an instru¬
ment represents the smallest uncertainty that may be quoted in a value
obtained through a single measurement. Other sources of uncertainty,
such as those due to offset or calibration errors or variability in the quan¬
tity being measured, very often exceed that due to the instrument resolu¬
tion and so they also need to be quantified.
Parallax error
If you move your head when viewing the position of a pointer on a stop
watch or the top of a column of alcohol in an alcohol-in-glass thermome¬
ter, the value (as read from the scale of the instrument) changes. The posi¬
tion of the viewer with respect to the scale introduces parallax error. In the
case of the alcohol-in-glass thermometer, the parallax error is a conse¬
quence of the displacement of the scale from the top of the alcohol
column, as shown in figure 5.2.
The parallax error may be random if the eye moves with the top of the
alcohol column such that the eye is positioned sometimes above or below
the ‘best viewing’ position shown by figure 5.2(c). However, it is possible
that the experimenter either consistently views the alcohol column with
respect to the scale from a position below the column (figure 5.2(a)) or
above the column (figure 5.2 (b)). In such situations the parallax error
would be systematic, i.e. values obtained would be consistently below or
above the true value.
17A 5 MEASUREMENT, ERROR AND UNCERTAINTY
-
p- V
: ^ Eye too .
: low :
:
: :
\ Scale
Alcohol
column
These examples indicate that it is not always obvious how a particular error
should be categorised, since the way an instrument is used has an impor¬
tant bearing on whether an error should be regarded as random or system¬
atic.
The uncertainty in a value may be expressed in the unit in which the value
is measured. Such an uncertainty is referred to as the absolute uncer¬
tainty, i*' As an example, if the mass, m, of a body is written
m={45.32±0.15) g
where x is the mean of values.'* In the case of The mass of the body dis¬
cussed above, we have
, . , . O.lSg
fractional uncertamty=-^=3.3X 10"^
45.32 g
Note that the fractional uncertainty has no units, as any units appearing in
the numerator and denominator of equation (5.9) ‘cancel’.
The uncertainty in a value can be expressed as a percentage of the
best estimate of the value by multiplying the fractional uncertainty by
100%, so that
Example 2
In a report on density measurements made of sea water, the density, p, is given as
(1.05 ± 0.02) g/cm^. Express the uncertainty as both fractional and percentage uncer¬
tainties.
ANSWER
Using equation (5.9), the fractional uncertainty is given by
ii 0.02 g/cm^
—= — = 5 0.019
lx I 1.05 g/cm^
Exercise B
The temperature, t, of a furnace is given as t= 1450 °C ± 5%. Express the uncertainty
as both fractional and absolute uncertainties.
The modulus of the mean is used to avoid the fractional uncertainty assuming a
negative value.
1/6 5 MEASUREMENT, ERROR AND UNCERTAINTY
Sx dx
Sy« (5.12)
dy
(5.13)
dx
cr^ (5.14)
dx
where a. is the standard error in the mean of y and a-is the standard error
in the mean of x.
Example 3
The radius of a metal sphere is (2.10 ±0.15) mm. Calculate the surface area of the
sphere and the uncertainty in the surface area.
ANSWER
The surface area of a sphere, A, of radius, r, is given by
A = 4Trr'^ (5.15)
Substituting r—2.1 mm into equation (5.15) gives 71=55.418 mm^. In order to find the
uncertainty in A we rewrite equation (5.13) in terms of A and r, i.e.
dA
^A= “r (5.16)
If A = 4'7rr^, then rM/dr= 87rr= 52.779 mm. Using equation (5.16) gives
Rounding the uncertainty to two significant figures, gives A = (55.4 ± 7.9) mm^.
Exercise C
1. An oscilloscope is used to view an electrical signal which varies sinusoidally with
time. Given that the period of the signal, T, is (0.552 ±0.012) ms, calculate the fre¬
quency, f, of the signal and the uncertainty in the frequency, where
Using the value for rand the uncertainty in rgiven in example 3, calculate Uand the
uncertainty in V.
178 5 MEASUREMENT, ERROR AND UNCERTAINTY
Many equations contain two or more variables. If the values of the vari¬
ables have some uncertainty, how does the uncertainty in each combine to
give an uncertainty in the final ‘result’? To answer this, let y depend on x
and z. Write the change in y, 8y, as'^
\
5y- 5z (5.17)
U + u_ (5.18)
dx X
dz
Example 4
The velocity, v, of a wave travelling along a stretched string is given by
(5.20)
where Tis the tension in the string and /j. is the mass per unit length of the string. If
T— (3.2±0.2) N and/x= (1.24±0.05) X 10“^kg/m, calculate I'and the uncertainty in v.
ANSWER
Using equation (5.20),
/ 3.2 N
y 1.24 X10“^ kg/m
I =50.8 m/s
‘3 For convenience we consider only two variables, x and z. However, this approach
can be extended to any number of variables.
5-7 ABSOLUTE, FRACTIONAL AND PERCENTAGE UNCERTAINTIES 179
dV dV
T
dT dp
where is the uncertainty in the tension and is the uncertainty in the mass per
unit length. With reference to equation (5.20)
^_1/j_y_ 1/_1
= 7.938 s/kg
dT~2[iJLTj “2\l.24xl0-3kg/mX3.2N
dV 1/ 3.2 N
djx 1.24 X 10-= kg/m j =-2.048 X10<m^;(kg.s)
7.938 s/kgX 0.2 N -F 2.048 X 10^ m^/ (kg-s) X 0.05 X10^^ j^g/= 2.6I m/s
Exercise D
The pressure difference, p, between two points in a flowing fluid is related to the
density of the fluid, p, and the speed of the fluid, v, by the equation
p=yv^ (5.22)
Given that p= (0.986 ±0.013) X 10^ kg/m^ and v= (1.840 ±0.023) m/s, calculate pand
the uncertainty in p.
y=x+z (5.23)
53/=5jc+5z ^ (5.24)
u.. Uy (5.25)
, dx'
(5.26)
or below (B) the mean of values in that column. Examining the rows seems
to indicate no correlation between the deviation of the values in columns
1 and 2. For example, a value of voltage above the mean in the first column
does not coincide consistently with a value of current above the mean in
the second column. Turning now to the standard errors of the values in
each column, we find^^
The value of a-p of 0.103 jxW found using table 5.7 may be compared with
those determined using equations (5.19) (maximum uncertainty) and
(5.26) (most probable uncertainty).
As P= V7, it follows that
I = V
dV dl
dP dP
<Jr,
V+
dV dl
= 0.152 |xW
'dP 2 dP
(rp = xrp I +
.'dV
1
212
= 1(0.251 mAX 0.216 mV)2+ (14.78 mV X 0.00659 mA)^]
-0.111 |jlW \
Exercise E
1. Light incident at an angle, i, on a plane glass surface is refracted and enters the
glass. The refractive index of the glass, n, cam be calculated using
sin i
n = ——
sin r
where r is the angle of refraction. Assuming errors in i and rare independent, calcu¬
late n and the uncertainty in n, given that /=(52±1)° and r=(32±l)°. Note, for
y= sin X, the approximation 8y/Sx~ dyl dx is valid for x expressed in radians.
2. In an optics experiment, the distance, u, from an object to a convex lens is
(125.5±1.5) mm. The distance, v, from the lens to the image is (628.0± 1.5) mm.
Assuming that the errors in u and v are independent:
(i) Calculate the focal length, f, of the convex lens and the uncertainty in f, given
that/is related to u and eby the equation
i-1 + 1
f u V
(ii) Calculate the linear magnification, m, of the lens and the uncertainty in m given
that
V
m=—
u
5-8 COPING WITH EXTREMES IN DATA VARIABILITY 183
5.8.1 Outliers
Table 5.8. Mass loss from the ceramic YBafCufij heated in an oxygen free
environment.
Mass loss, X (mg) 5.6 5.4 5.7 5.5 5.6 6.6 5.8 5.7 5.5
Figure 5.3. Standard normal probability distribution. The sum of the areas in the
tails is equal to the probability that a value is more than 2.5 standard deviations from
the mean.
^^Xom_x ^^27)
where is the value of the outlier. Inserting values from this example
gives
0.3551 mg
To find the probability that a value differs from the mean by a least 2.5
standard deviations, we assume that the values are drawn from a
normal distribution. Figure 5.3 shows the standard normal distribution.
P(-oo<z<-2.5) obtained using table 1 in appendix 1 is equal to 0.00621.
As the tails of the distribution are symmetric, we double this value to find
the probability represented by the total shaded area in figure 5.3, i.e. the
probability that a value lies at least as far from the mean as the outlier is
2X0.00621 = 0.01242. Put another way, if we obtain, for example, 100
values, we would expect 100 x 0.012 42 «=1 to be at least as far from the
mean as the outlier. As only 9 measurements have been made, the number
of values we would expect to be at least as far from the mean as the outlier
is9X0.01242«0.1.
186 5 MEASUREMENT, ERROR AND UNCERTAINTY
(i) Calculate the mean, x, and the standard deviation, s, of the values
being considered.
(ii) Identify a possible outlying value, Calculate z, where
z= [Xq^,j-x)Is.
(iii) Use the normal probability tables to determine the probability, P, that
a value will occur at least as far from the mean as
(iv) Calculate the number, N, expected to be at least as far from the mean
as Xyyy N= nP, where n is the number of values and P is the probabil¬
ity that a value is at least as far from the mean as x^^^.
(v) If Ar< 0.5, reject the outlier.
(vi) Recalculate x and s but do not reapply the criterion to the remaining
data.
Example 5
Table 5.9 shows values for the ‘hand timing’ of the free fall of an object through a ver¬
tical distance of 5 m. Use Chauvenet’s criterion to decide if any value should be
rejected.
ANSWER
The mean, x, and standard deviation, s, of the values in table 5.9 are 1.084 s and
0.09991 s respectively. The value furthest from the mean is 0.90 s, i.e. Xgyj.= 0.90 s.
Calculating z using equation (5.27) gives
XouT~^ 0.90s-1.084s
z=-^-=-= -1.84
s 0.099 91s
Table 5.9. Values of time for an object to fall freely through a distance of 5 m.
Time, x(s) 1.11 0.90 1.13 1.13 1.16 1.23 0.94 1.08 1.12 1.04
Exercise F
In a fluid flow experiment, the time, t, for a fixed volume'of water to flow from a
vessel is measured. Table 5.10 shows five values for t, obtained through repeat meas¬
urements.
Every value differs from the true value by an amount equal to the experi¬
mental error. It is possible for the random error to be so small that it is less
than the resolution of the instrument. In this case every measurement will
yield the same value. Take as an example the values of voltage measured
across a 1 kfi resistor using a 3| digital multimeter as shown in table 5.11.
The mean of these values is 3.2 mV and the standard deviation is zero. As
the standard deviation is zero, it follows that the standard error in the mean
is also zero, so we write (for any level of confidence)
V=(3.2±0) mV
Voltage, V (mV) 3.2 3.2 3.2 3.2 3.2 3.2 3.2 3.2
i88 5 MEASUREMENT, ERROR AND UNCERTAINTY
(i) Random errors, though small, still exist but are masked due to the res¬
olution limit of the instrument.
(ii) Random errors are normally distributed.
N= nP (5.28)
„ ^MIN ^
^- (5.29)
where is the smallest value of x that will cause the value indicated by
the instrument to remain unchanged and s is the estimate of the popula-
5.8 COPING WITH EXTREMES IN DATA VARIABILITY 189
0.0269 mV
O'- = 0.0095 mV
Vn Vs
Summarising the steps to determining upper limits for the standard devi¬
ation and standard error when all values are the same, we have;
Exercise G
On measuring the diameter, d, of a copper wire with a micrometer with a smallest
scale division of 0.01 mm, it is found that 1.22 mm. The measurement is repeated
a further four times at other positions along the wire. No change in d is observed.
Assuming that measurements are affected only by random errors:
(i) Determine an upper limit for the standard deviation of the values and the stan¬
dard error of the mean.
(ii) Calculate the 95% confidence interval for the true value of the diameter.
Systematic errors, like random errors, cause measured values to differ from
the true value. In contrast to random errors, which cause measured values
to be scattered both above and below the true value, systematic errors
introduce a bias so that measured values are consistently above or consis¬
tently below the true value. Random errors ‘betray’ themselves by appar¬
ent variability in measured values but systematic errors are characterised
by a lack of variability and this makes them very difficult to detect (or, put
another way, very easy to overlook). Experienced experimenters are always
on the ‘look out’ for sources of systematic error and devote time to iden¬
tifying and where possible compensating for such errors. It is not uncom¬
mon for systematic errors to be several times larger than random errors
and this is why we should consider carefully causes of systematic error and
where possible include their effect in an expression of the uncertainty in
the best estimate of the true value.
5-9 UNCERTAINTY DUE TO SYSTEMATIC ERRORS 191
Example 6
A 35 digit multimeter is used to measure a steady voltage across a resistor. The meas¬
urement is made on the 2 V range at 22 °C and relative humidity of 50%. The value
appearing on the display of the meter is 1.898 V. Use the information in table 5.12 to
determine the uncertainty in the voltage.
ANSWER
Referring to table 5.12, the accuracy of the meter is ± (0.5% of reading+one digit).
0.5% of 1.898Vis
1 898V
-X 0.5% = 0.0095 V
100%
‘One digit’ on the 2 V range of the meter corresponds to 0.001 V. So that the total
systematic uncertainty is 0.0105 V which we would normally round to 0.011 V and we
can write V= (1.898 ± 0.011)V.
is not pursued.
18 The digits referred to are the least significant digits that appear on the display.
192 5 MEASUREMENT, ERROR AND UNCERTAINTY
Table 5.12. Resolution and accuracy specification for the dc voltage and
capacitance ranges of a 3‘A digit multimeter.
Exercise H
A 3^ digit meter is used to measure the capacitance of a capacitor. The room tem¬
perature is 25 °C and the relative humidity is 60%. The display on the meter indicates
a capacitance of 156.4 nF. Use the information in table 5.12 to determine the uncer¬
tainty in the capacitance due to systematic error in the instrument.
Offset errors are common systematic errors and, in some cases, may be
reduced. If the error remains fixed over time, then the measured value can
be corrected by an amount so as to compensate for the offset error. As an
5-9 UNCERTAINTY DUE TO SYSTEMATIC ERRORS 193
40000 n
G= (5.31)
Ra
Example 7
An instrumentation amplifier is used to amplify a small voltage generated by a
photodiode. The gain of the amplifier is given by equation (5.31). The relationship
between the gain setting resistor, Rgit), and temperature of the resistor, t, is
where Rg[0) is the resistance at 0°C, tis the temperature in°C and a is the tempera¬
ture coefficient of resistance in °C“ h If a wire wound resistor made from nichrome is
used as again setting resistor with i?^(0) = 318.3 ft and a = 4.1 x lO""* °C \ determine:
ANSWER
(ii) Using equation (5.31) to find the gain at 0°C, 20°C and 40 °C:
40000 ft
G(0°C) = 1 -E 126.7
318.3 ft
40000 ft
G(20°C) = 1 + 125.6
320.9 ft
40000 ft
G(40°C) = 1-E 124.6
323.5 ft
Exercise I
During an experiment, the temperature of the room in which the experiment is
carried out is nominally 23 °C. Due to the effect of an air-conditioning system, the
room temperature varies between 21 °C and 25 °C. The gain of an amplifier used in
the experiment is given by equation (5.31).
Find the gain of the amplifier and estimate the uncertainty due to systematic
error in the gain of the amplifier caused by the temperature dependence of the
gain setting resistor.
(ii) What assumptions did you make in order to carry out part (i)?
5-9 UNCERTAINTY DUE TO SYSTEMATIC ERRORS 195
9V
Figure 5.5. Voltmeter used to measure the voltage across a 5.6 Mfl resistor.
22 See Khazan (1994) for a discussion of transducers used to measure fluid flow.
196 5 MEASUREMENT, ERROR AND UNCERTAINTY
5.19V
3.81 V
N
Figure 5.6. Circuit with resistor and voltmeter replaced by an equivalent resistor, R^.
5.19 V 3.81V
5.2 MO “ Rp
It follows that
The effective resistance, R^, for two resistors, R and , connected in par¬
allel is given by
If Rp=3.82 Mfl and i?=5.6 Mfl, then using equation (5.33) we obtain
i?,„=12.0Ma
We can see that when R.^^ is comparable in size to R the loading effect
is considerable. To reduce the loading effect, R.^^ should be much greater
than R.
5-9 UNCERTAINTY DUE TO SYSTEMATIC ERRORS 197
Exercise J
An experimenter replaces the voltmeter in figure 5.5 by another with a higher inter¬
nal resistance, R.^^. What must be the minimum value for if the experimenter
requires the voltage indicated by the meter to differ by no more than 1% from the
voltage that would appear across the 5.6 Mfl in the absence of the meter? (In this
problem consider only the effects on the measured values due to loading.)
Quantifying the loading error can be difficult and in many cases we rely on
experience to suggest whether a loading effect can regarded as negligible.
In situations in which calibration, offset, gain and loading errors have been
minimised or accounted for, there is still another source of error that can
cause measured values to be above or below the true value and that error
is due to the temporal response of the measuring system. Static, or at least
slowly varying, quantities are easy to ‘track’ with a measuring system.
However, in situations in which a quantity varies rapidly with time, the
response of the measuring system influences the values obtained.
Elements within a measuring system which may be time dependent
include:
X ---
1 t
t* t
Figure 5.7. Input to x{t), and output from y[t), a zero order measuring system.
t* t
Figure 5.8. Response of a first order measuring system to a step change at the input
of the system.
do
T—+d^e. (5.35)
dt f
where ris referred to as the time constant of the system. Equation (5.35) is
an example of an equation which describes the response of a first order
system. A detailed analysis shows that if we can approximate the bulb of
the thermometer to a sphere filled with liquid, then
prc
T= (5.36)
where p is the density of the liquid in the thermometer, cis the specific heat
of the liquid in the thermometer, h is the heat transfer coefficient for heat
passing across the wall of the bulb of the thermometer and r is the radius
of the thermometer bulb.
Solving equation (5.35) gives
d=dp-e-^'n (5.37)
See Doebelin (1995) for more detail on first and higher order instruments.
200 5 MEASUREMENT, ERROR AND UNCERTAINTY
Example 8
A particular thermometer has a time constant of 5 s. If the thermometer is immersed
in hot water, how much time must elapse before the value indicated by the thermom¬
eter is within 5% of the final value?
ANSWER
0
-=(l-e-f/q (5.38)
Uf
If the temperature indicated by the thermometer is within 5% of the final value, then
ei e^= 0.95 = (1 - e-'^n. It follows that
therefore
For a step change in temperature, the thermometer will approach within 5% of the
final value after the time elapsed from the step change is greater than or equal to
about 3t. This can be generalised to any first order measuring system, i.e. after a step
change at the input of the system, a time inteiwal of at least Srinust elapse before the
output of the system is within 5% of the final value.
Exercise K
An experimenter has determined the time constant of a first order measuring
system to be 2.5 s. If there is a step change at the input of this system at t=0, how
long must the experimenter wait before the value indicated by the system is within
0.1% of the final value (assume random errors are negligible).
where vis the measured value.^^ It is possible that u^has been chosen such
that the probability of the true value lying between x± u^is 1. This will
cause difficulties later when we wish to sum uncertainties due to random
and systematic errors so we will be slightly conservative and assume that
the probability of the true value lying between x± is 0.95.
As with random errors, systematic errors may or may not be indepen¬
dent. As an example suppose a micrometer is used to measure the dimen¬
sions of a rectangular block of metal. If the micrometer has an
(uncorrected) offset error, then that error will affect the measurement of
each dimension of the metal block in the same way, i.e. errors in each value
obtained through measurement are dependent or correlated.
(5.40)
sx
dX (9z
m (5.41)
P=
A
where Bis the electrical resistance measured between the ends of the wire,
/ is the length of the wire and A is the cross-sectional area of the wire. As the
instruments used to measure R, I and A differ there is unlikely to he any
dependence between the systematic errors in these quantities. It follows
that applying equation (5.40) is likely to overestimate the uncertainty in p,
and adding uncertainties in the manner we considered when dealing with
independent random errors in section 5.7.4 is more appropriate.
In general, if a measured quantity, y, depends on x and z, then we can
write the systematic uncertainty in y, u^, when uncertainties in jc and z are
independent as
sy
(5.42)
u= \/u1+ u] (5.43)
5.11 COMBINING UNCERTAINTIES; RANDOM AND SYSTEMATIC 203
Equation (5.43) can be applied on the condition that the confidence inter¬
vals associated with n^and are the same. As stated in section 5.10, n^may
be such that, in the absence of random error, the true value definitely lies
between ± of the mean. That is ii^ really defines the 100% confidence
interval. By committing the ‘sin’ of taking x± to be the 95% confidence
interval, we are slightly overestimating the uncertainty due to systematic
errors. If we were to insist on using the 100% confidence interval for the
systematic errors, then we would have to do the same for the random
errors. The difficulty with this is that if we use the normal distribution to
describe the random errors, then the 100% confidence interval for the true
value of a quantity lies between ± 00. To say that the true value of a quantity
lies between ± 00 is not very discriminating.
Example 9
A 85 digit voltmeter is used to measure the output voltage of a transistor. The mean
of the ten values of voltage is found to be 5.34 V with a standard deviation of 0.11V.
Given that the voltmeter is operating on the 20 V range and that the accuracy of the
voltmeter is 0.5% of the reading +1 figure, determine:
ANSWER
(i) The standard error in the mean, a-, is found using the equation a^ = alVn.
In this example, cr=0.11 V and «= 10, so cr^ = 0.11 V/V10 = 0.034 79 V.
(ii) The 95% confidence interval for the true value of the mean, in the absence of
systematic error is given by equation (5.8) as x± where the lvalue
for V degrees of freedom and X% is the confidence level. Here X% = 95% and v=9.
From table 2 in appendix 1, tgg^ g=2.262. It follows that the 95% confidence
interval for the true voltage (in the absence of systematic error) is
(iii) The systematic uncertainty found from the instrument specification is 0.5% of
the reading + one figure. 0.5% of 5.34 Vis 0.027V and the least significant figure
for a 3i digit voltmeter on the 20 V range is 0.01 V (see table 5.12). The uncer¬
tainty due to systematic error, u^, is therefore
It follows that the 95% confidence interval for the true voltage (in the absence of
random error) is
(5.340 ± 0.037) V
Exercise L
Wooden metre rules can be used to measure lengths to a precision of better than
1 mm. However, prolonged use exposes the rules to a variety of atmospheric condi¬
tions (such as variations in temperature and humidity) which can cause the wood
to shrink or expand. Such effects introduce a systematic error into length measure¬
ment, so that the uncertainty in a length measured as 1 m using the wooden rule can
be as much as 1 mm.
In an optics experiment, a wooden metre rule was used to measure the position
of an image produced by a lens. Eight repeat measurements of the image position
were obtained and are shovm in table 5.13. Use the information in the table to find:
Image distance (mm) 855.5 851.5 855.0 852.5 855.5 851.0 853.0 856.0
5.12 WEIGHTED MEAN 205
X
W
(5.44)
where x. is the /th value and o-.is the standard deviation of the population
of which the value x- is a member. Though equation (5.44) is useful, it is
usual to combine several means, where each mean has a different standard
error. Equation (5.44) is rewritten
(5.45)
where x. is the /th mean and a. is the standard error of the /th mean.
1
Example lo
The thickness of a platinum film is measured using a profilometer (PF), a scanning
electron microscope (SEM) and an atomic force microscope (AFM). Table 5.14 shows
the mean of the values obtained by each method and standard error in the mean.
Calculate the weighted mean thickness of the film.
ANSWER
Table 5.14. Thickness of thin film of platinum determined using three techniques.
PF 325.0 7.5
SEM 330.0 5.5
AFM 329.0 4.0
Exercise M
The mean time of a ball to fall a fixed distance under gravity is found by three exper¬
imenters to be
I 1 \
(5.46)
•^111
y—
\^cr]]
where u.is the standard deviation in the ith value. If means are combined,
where each mean x. has standard error a-,, then the standard error in the
weighted mean is
= (5.47)
Example 11
Calculate the standard error of the weighted mean for the mean thicknesses appear¬
ing in table 5.14.
ANSWER
1
/ 1 ^2 1 2 i
= [8.823(nm)2]2
1 1 1
X j 2
(7.5 nm)2 (5.5 nm)^ (4.0 nm)^
= 3.0 nm
Exercise N
Calculate the standard error of the weighted mean for the time of fall of a ball using
information given in exercise M.
208 5 MEASUREMENT, ERROR AND UNCERTAINTY
5.13 Review
Problems
2. Four repeat measurements are made of the height, h, to which a steel ball
rebounds after striking a flat surface. The values obtained are shown in
table 5.15. Use these values to determine:
4. The relationship between the critical angle, 6^, and the refractive index,
n, for light travelling from glass into air is
1
(5.48)
sin 6^
r= (5.49)
d^ = KkVp (5.50)
c=
(5.52)
where is the initial height of the ball and is its rebound height. Given
that
PROBLEMS 211
(5.53)
I (cm) 49.2 48.6 47.8 48.5 42.7 47.7 49.0 48.8 48.3 47.7
11. A 3^ digit voltmeter set on its 200 mV range is used to measure the
output of a pressure transducer. Values of voltage obtained are shown in
table 5.18.
(i) Assuming that random errors dominate, use the data in table 5.18 to
determine the 95% confidence interval for the true voltage.
(ii) If the resolution and accuracy of the meter are given by table 5.12,
determine the 95% confidence interval for the true voltage assuming
that systematic errors dominate.
(iii) Combine the confidence intervals in parts (i) and (ii) of this question
to give a 95% confidence interval which accounts for uncertainty due
to both random and systematic errors.
212 5 MEASUREMENT, ERROR AND UNCERTAINTY
12. In the process of calibrating a 1 mL bulb pipette, the mass of water dis¬
pensed by the pipette was measured using an electronic balance. The process
was repeated with the same pipette until ten values were obtained. These
values are shown in table 5.19. Using the values in table 5.19 determine:
Mass (g) 0.9737 0.9752 0.9825 0.9569 0.9516 0.9617 0.9684 0.9585 0.9558 0.9718
(i) Calculate the sample mean and the estimate of the population stan¬
dard deviation.
(ii) Identify a possible outlier and apply Chauvenet’s criterion to deter¬
mine whether the outlier should be removed.
(iii) If the outlier is removed, calculate the new mean and standard devia¬
tion.
6.1 Introduction
213
214 6 LEAST SQUARES I
of engine oil between room temperature and 100°C, we observe that tbe
viscosity of the oil decreases with increasing temperature, but we would
like to know more;
Y, = a+ fix.
•'I I
* It is quite common to find the equation of a straight line written in other ways,
such as y.=
•'I
mx,+
I
c, •'I
y.= mx,+
{
fi or .'i
y.= (j
-i- b,x..
6.2 THE EQUATION OF A STRAIGHT LINE 215
where a is the intercept and his the slope of the line. If all the x-j/data gath¬
ered in an experiment were to lie along a straight line, there would be no
difficulty in determining a and b and our discussion would end here. We
would simply use a rule to draw a line through the points. Where the line
intersects the y axis at x= 0 gives a. The slope, b, is found by dividing Ay by
Ax, as indicated in figure 6.1.
In situations in which ‘real’ data are considered, even if the underly¬
ing relationship between x and y is linear, it is highly unlikely that all the
points will lie on a straight line, since sources of error act to scatter the data.
So how do we find the best line through the points?
When dealing with experimental data, we commonly plot the quantity that
we are able to control on the x axis. This quantity is referred to as the inde¬
pendent (or the ‘predictor’) variable. A quantity that changes in response to
changes in the independent variable is referred to as the dependent (or the
‘response’) variable, and is plotted on the y axis.
As an example, consider an experiment in which the velocity of
sound in air is measured at different temperatures. Here temperature is the
independent variable and velocity is the dependent variable. Table 6.1
shows temperature-velocity data for sound travelling through dry air. The
data in table 6.1 are plotted in figure 6.2. Error bars are attached to each
point to indicate the uncertainty in the values of velocity. From an inspec¬
tion of figure 6.2, it appears reasonable to propose that there is a linear
216 6 LEAST SQUARES I
-13 322
0 335
9 337
20 346
33 352
50 365
370-1
360-
350-
"e 340--
> "
330--
" 320-
310
I I r “I r
-20 -10 0 10 20 30 40 50
e(°c)
Figure 6.2. x-y graph showing velocity of sound versus temperature.
v=A+Bd (6.1)
Figure 6.3. Two lines fitted to velocity of sound versus temperature data. The
equation describing each line is showir.
Can either of the two lines be regarded as the best line through the points?
If the answer is no, then how do we find the best line? The guesswork asso¬
ciated with drawing a line through data by eye can be eliminated by apply¬
ing the technique of least squares.
To find the best line through x-y data, we need to decide upon a numerical
measure of the ‘goodness of fit’ of the line to the data. One approach is to
take that measure to be the ‘sum of squares of residuals’, which we will
discuss for the case where there is a linear relationship between x and y.
The least squares method discussed in this section rests on the assump¬
tions described in table 6.2.
Figure 6.4 shows a line drawn through the x-y data. The vertical dis¬
tances from each point to the line are labelled Ayj, Ay2, Ayg, etc. and are
referred to as the residuals (or deviations). A residual is defined as the
difference between the observed y value and the y value on the line for the
same x value. Referring to the ith observed y value as y., and the ith pre¬
dicted value found using the equation of the line as y., the residual. Ay., is
written
^yrYi-yt (6.2)
218 6 LEAST SQUARES I
Table 6.2. Assumptions upon which the unweighted least squares method is based.
Assumption Comment
where
SSR=^iy.-y,f (6.5)
^xf^yj-^Xj^Xtyj
(6.6)
and
n^Xjyi-^x^yi
(6.7)
where n is the number of data points and each summation is carried out
between i=l and i= n.
Example 1
Table 6.3 contains x-y data which are shown plotted in figure 6.5. Using these data:
(i) Find the value for the intercept and slope of the best line through the points.
(ii) Draw the line of best fit through the points.
(iii) Calculate the sum of squares of residuals, SSR.
^ For derivations of equations (6.6) and (6.7) see section A3.2 in Appendix 3.
220 6 LEAST SQUARES I
ANSWER
To calculate a and b we need the sums appearing in equations (6.6) and (6.7), namely
lx., 2j/;, and 2x.y,., and '2.x]. Many pocket calculators are able to calculate these quan-
tities^ (in fact, some are able to perform unweighted least squares fitting to give a and
directly).
We offer a word of caution here: As there are many steps in the calculations of
a and b, it is advisable not to round numbers in the intermediate calculations,^ as
rounding can significantly influence the values of a and b.
(i) Using the data in table 6.3 we find that 2x.=30, 2y.=284, 2x.y.= 1840 and
2x]=220. Substituting these values into equations (6.6) and (6.7) (and noting
that the number of points, n = 5) gives
220X284-30X1840 ^
a =--—^-= 36.4
5 X220-(30)2
, 5X1840-30X284
0 =-;— = 3.4
5X220-(30)2
(ii) The line of best fit through the data in table 6.3 is shown in figure 6.6.
(iii) The squares of residuals and their sum, SSB, are shown in table 6.4.
X y
2 43
4 49
6 59
8 63
10 70
Exercise A
Use least squares to fit a straight line to the velocity versus temperature data in
table 6.1. Calculate the intercept, a, the slope, b, of the line and the sum of squares
of the residuals, SSR.
80 n
75-
70-
65-
60-
y
55-
50 - ,
45-
40-
35 -
30 ' I 1-T" T
0 2 4 6 8 10
X
Figure 6.6. Line of best fit through data given in table 6.3.
X.
Tz y.= 36.4 + 3.4x.1
''I
2 43 43.2 0.04
4 49 50.0 1.00
6 59 56.8 4.84
8 63 63.6 0.36
10 70 70.4 0.16
SSR = 6.A
222 6 LEAST SQUARES 1
Excel® may be used to add the best straight line to an x-y graph using the
Add Trendline option. Excel® uses equations (6.6) and (6.7) to determine
the intercept and slope. Add Trendline can be found in the Chart option on
the Menu bar. (Excel®’s Trendline is described in detail in section 2.7.1.)
Advantages of the Trendline option are:
(i) It requires that data be plotted first so that we are encouraged to con¬
sider whether it is reasonable to draw a straight line through the
points.
(ii) The best line is added automatically to the graph.
(iii) The equation of the best line can be displayed if required.
(iv) Excel® is capable of drawing the best line through points for relation¬
ships between X and y other than linear, such as a logarithmic or
power relationship.
(v) If data are changed then the points on the graph are updated and dis¬
played immediately, as the graph and the line of best fit are linked
‘dynamically’ to the data.
Though excellent for viewing data and determining the best line through
points. Trendline does not:
(i) Give access to the size of the standard errors in a and b. This informa¬
tion is necessary if we wish to quote confidence intervals for intercept
and slope.
(ii) Allow ‘weighted fitting’. This is required when there is evidence to
suggest that some x-y Vcdues are more reliable than others. In this situ¬
ation the best line should be ‘forced’ to pass close to the more reliable
points. Weighted fitting is dealt with in section 6.10.
(iii) Plot residuals. Residuals are extremely helpful for assessing whether it
is appropriate to fit a straight line to data in the first place. Residuals
are dealt with in section 6.7.
Despite the usefulness of the Trendline option in Excel®, there are often sit¬
uations in which we need to extract more from the data than a and b. In
particular, the uncertainties in a and b, expressed as the standard errors in
a and b, are very important as they indicate the number of significant
figures to which to quote intercept and slope. We consider uncertainties in
intercept and slope next.
6.2 THE EQUATION OF A STRAIGHT LINE 223
One of the basic assumptions made when fitting a line to data using least
squares is that the dependent variable is subject to random error. It is reason¬
able to expect therefore that a and b are themselves influenced by the errors
in the dependent variable. A preferred way of expressing the uncertainties in
a and b is in terms of their respective standard errors, as this permits us to
calculate confidence limits for the intercept and slope. Standard errors in a
and b may be found using the ideas of propagation of uncertainties dis¬
cussed in chapter 5. Provided the uncertainty in each yvalue is the same,® the
standard errors in a and b are given by and cr^, where®
(2^ T
(6.8)
tl^X ?-(:
1
am
(6.9)
a is the standard deviation of the observed y values about the fitted line.
The calculation of a is similar to the calculation of the estimate of the pop¬
ulation standard deviation, s, of univariate data given by equation (1.16). cr
is given by^
a= (6.10)
n
Example 2
<j^S =
224 6 LEAST SQUARES I
0 0.002
5 0.131
10 0.255 ^
15 0.392
20 0.500
25 0.622
30 0.765
ANSWER
Regarding the concentration as the independent variable, x, and the absorbance as
the dependent variable, y, we write y=a+bx
Using the data in table 6.5 we find, 2x.= 105 (ng/mL), 2y.= 2.667,
57.585 (ng/mL) and 2x^=2275 (ng/mL)^. Using equations (6.6) and (6.7),
In this example units have been included explicitly® in the calculation of a and b to
emphasise that, in most situations in the physical sciences, we deal with quantities that
have units and these units must be carried through to the ‘final answers’ for a and b.
In order to use equations (6.8) and (6.9), first calculate a as given by equation
(6.10). Table 6.6 has been constructed to assist in the calculation of cr. Summing the
values in the last column of the table gives
= 3.2686X10-4
1
cr = -X 3.2686 X 10" = 0.008085
(7-2)
® In other examples in this chapter, units do not appear (for the sake of brevity) in
the intermediate calculations of a and b or and .
6.2 THE EQUATION OF A STRAIGHT LINE 225
0.008085 X(2275)2
0" = 1 = 77—^^= 5.509 X 10-3
1^2)4-(22 [7X2275-(105)2]i
i
am 0.008085 X(7)2
3.056 X lO-"* mL/ng
[7X2275-(105)2
Using the properties of the normal distribution,^ we can say that the true value for
the intercept has a probability of approximately 0.7 of lying between (4.3-5.5) X 10 ^
and (4.3 ± 5.5) X lO”^, i.e. the 70% confidence interval for a is between approximately
— 1.2 X 10~3 and 9.8x 10^3 Similarly, the 70% confidence interval for (3 is between
2.480 X 10“2 mL/ng and 2.542 X 10“2 mL/ng.
We must admit to a misdemeanour in applying the normal distribution here:
In this example we are dealing with a small number of values (seven only), so we
should use the t distribution rather than the normal distribution when calculating
confidence intervals for a and (3. We discuss this further in section 6.2.6.
Exercise B
Calculate a^ and cr^ for the data in table 6.3.
In this text we adopt the convention that uncertainties are rounded to two
significant figures, a and b are then presented to the number of figures con¬
sistent with the magnitude of the uncertainties. Where a pocket calculator
or a spreadsheet has been used to determine a and b, all intermediate cal¬
culations are held to the full internal pi;ecision of the calculator, rounding
only occurring in the presentation of the final parameter estimates.
a= a± cra
95%,!/ (6.11)
v=n-2
where n is the number of data. Similarly, the 95% confidence interval for p
is written
(6.12)
a and b are calculated using equations (6.6) and (6.7) respectively, and
cr^ are calculated using equations (6.8) and (6.9). ^ is the critical rvalue
corresponding to the 95% confidence level.
When the X% confidence interval is required, ^ is replaced in
equations (6.11) and (6.12) by Table 2 in appendix 1 gives the values
of rfor various confidence levels, X%, and degrees of freedom, v.
Example 3
Using information given in example 2, calculate the 95% confidence interval for a
and p.
ANSWER
This question requires we apply equations (6.11) and (6.12). The relevant informa¬
tion contained in example 2 (retaining extra figures to avoid rounding errors in the
final answers) is as follows:
v=n-2=7-2=5
^95%.5“
i.e.,
i.e.
Exercise C
1. Calculate the 99% confidence intervals for a and (3 in example 2.
2. The data in table 6.7 were obtained in an experiment to study the variation of the
electrical resistance, R, with temperature, 6, of a tungsten wire. Assuming the rela¬
tionship between R and 6 can be written R=A+BO, calculate:
(i) the values of the intercept, a, and slope, b, of the best line through the
resistance-temperature data;
(ii) the standard errors in a and b;
(iii) the 95% confidence intervals for A and B.
228 6 LEAST SQUARES I
d(°C) 1 4 10 19 23 28 34 40 47 60 66 78 82
nm 10.2 10.3 10.7 11.0 11.2 11.4 11.8 12.2 12.5 12.8 13.2 13.5 13.6
Calculating a and b and their standard errors using equations (6.6) to (6.10)
is tedious, especially if there are many x-y values to consider. It is possible
to use a spreadsheet to perform the calculations and this lessens the effort
considerably. An even quicker method of calculating a and b is to use the
LINESTO function in Excel®. This function estimates the parameters of the
line of best fit and returns those estimates into an array of cells on the
spreadsheet. The LINESTO function is versatile and we will consider it
again in chapter 7. Eor the moment we use it to calculate a, b, and cr^.
The syntax of the function is
Example 4
Consider the x-y data shown in sheet 6.1. Use the LINESTO function to find the
parameters a, b, and for the best line through these data.
Answer
Data are entered into columns A and B of the Excel® spreadsheet as shown in sheet 6.1.
We require Excel® to return a, b, and o-^. To do this:
“ In fact, the LINESTO function is able to return other statistics, but we will focus on
tbe standard errors for tbe moment.
6.3 EXCEL®’S LINESTO FUNCTION 229
1. Move the cursor to cell D3. With the left hand mouse button held down, pull
down and across to cell E4. Release the mouse button. Values returned by the
LINESTO function will appear in the four cells, D3 to E4.
2. Type =LINEST(B2:B9,A2:A9,TRUE,TRUE).
3. Hold down the Ctrl and Shift keys then press the Enter key.
A B
1 X y
2 1 -2.3
3 2 -8.3
4 3 -11.8
5 4 -15.7
6 5 -20.8
7 6 -25.3
8 7 -34.2
9 8 -37.2
Figure 6.7 shows part of the screen as it appears after the Enter key has been pressed.
Labels have been added to the figure to identify a, b, and The best values for
intercept and slope and their respective standard errors may be written:
Exercise D
Consider the x-y data in table 6.8.
(i) Use the LINESTO function to determine the intercept and slope of the best line
through the x-y data, and the standard error in intercept and slope.
(ii) Plot the data on an x-y graph and show the line of best fit.
We consider the use of the line of best fit next, but will return to the impor¬
tant matter of comparing models and data in later sections of this chapter
and again in chapter 7.
In section 6.7 we will see how the scatter of residuals can provide convincing
evidence for the appropriateness, or otherwise, of fitting a straight line to data.
6.4 USING THE LINE OF BEST FIT 231
where is the length of the rod at O'^C, and a is the temperature coefficient
of expansion. The right hand side of equation (6.13) can be expanded to
give
(6.14)
Equation (6.14) is of the form y=a+ bx, where /=y and 6 = x. Comparing
equation (6.14) to the equation of a straight line, we see that
ci—Iq b=lQa
It follows that
a=- (6.15)
a
i.e. the ratio b/a gives the temperature coefficient of expansion of the
material being studied. It would be usual to compare the value of a
obtained through analysis of the length-temperature data with values
reported by other experimenters who have studied the same, or similar
materials.
Exercise E
The pressure, P, at the bottom of a water tank is related to the depth of water, h, in
the tank by the equation
P=pgh+P^ (6.16)
where P^ is the atmospheric pressure, gis the acceleration due to gravity and p is the
density of the water.
232 6 LEAST SQUARES I
(i) What would you choose to plot on each axis of an x-y graph in order to obtain a
straight line?
(ii) How are the intercept and slope of that line related to p, g and P^ in equation
(6.16)?
cP = (6.17)
x=
2 ^'
y=
n
a = y-bx (6.18)
An estimated slope, b, that is slightly larger than the true slope will consistently
coincide with an estimated intercept, a, that is slightly smaller than the true
intercept, and vice versa. See Weisherg (1985) for a discussion of correlation between
a and b.
1'* See appendix 3.
6.4 USING THE LINE OF BEST FIT 233
da da
0^=7^ ^ cr). (6.20)
dbl
a b^ nf
a =- — + (6.21)
n
Exercise F
In an experiment to study thermal expansion, the length of an alumina rod is meas¬
ured at various temperatures. Assume that the relationship between length and
temperature is given by equation (6.13). Using the data in table 6.9:
(i) calculate the intercept and slope of the best straight line through the length-
temperature data and the standard errors in the intercept and slope;
(ii) determine the temperature coefficient of expansion, a, for the alumina using
equation (6.15);
(iii) calculate the standard error in a.
r(°C) 100 200 300 400 500 600 700 800 900 1000
/(m) 1.2019 1.2018 1.2042 1.2053 1.2061 1.2064 1.2080 1.2078 1.2102 1.2122
Once the intercept and slope of the best line through the points have been
determined, it is an easy matter to find the predicted value of y, y^, at an
arbitrary X value, x^, using the relationship
15
(T-= al Vn and afs given by equation (6.9).
234 6 LEAST SQUARES I
(6.22)
y^ = a+bx^
j/jj is the best estimate of the population mean of the y quantity at x x^.
The population mean of y at x=x^ is sometimes written Just as the
uncertainties in the measured y values contribute to the uncertainties in ci
and b, so the uncertainties in a and b contribute to the uncertainty in y^. As
in section 6.4.1.1, we avoid the problem of correlation of errors in a and b
by replacing ci by y — b:^ so that equatioh (6.22) becomes
As the errors in y and b are independent, the standard error in j/q, written
as cr^^, is given by
(6.24)
V f cr (6.25)
Example 5
Consider the data in table 6.10.
(a) intercept and slope, a and b, of the best line through the points;
(b) j/q for Xg = 12 and Xg = 22.5.
(c) the standard error in yg when Xg = 12 and Xg = 22.5.
(ii) Plot the data on an x-y graph showing the line of best fit and the 95% confidence
limits for jx^^^ for values of Xg between 0 and 45.
X 5 10 15 20 25 30 35 40
y 28.1 18.6 -0.5 -7.7 -14.8 -27.7 -48.5 -62.9
ANSWER
(i) (a) a and b are found using equations (6.6) and (6.7). We find a= 42.43 and
b=—2.527, so that the equation fory^ can be written
>>0 = 42.43-2.527^-0
(o-is calculated using equation (6.10)). Substituting values into equation (6.24)
gives for Xo= 12, fj-^ = 2.2. When Xo= 22.5,or^^= 1.6.
(ii) When the number of degrees of freedom equals 6, the critical rvalue for the 95%
confidence interval, given by table 2 in appendix 1, is rg5y^ g = 2.447. Equation
(6.25) is used to find lines which represent the 95% confidence intervals for
for values of Xobetween 0 and 45. These are indicated on the graph in figure 6.8.
Figure 6.8. Line of best fit and 95% confidence limits for data in table 6.10.
Exercise G
For the x-y data in example 5, calculate the 99% confidence interval for when
Xo=15.
236 6 LEAST SQUARES I
Let us assume that we have fitted the best straight line to a set of x-y data.
If we make a measurement of y at x=x^, between what limits would we
expect the measured value of y to lie? This is different from considering
confidence limits associated with the estimate of the population mean at
x=Xo because two factors must be taken into consideration:
\
1 n(x^i-xf I2
^PVg ^
(6.27)
j/g is the best estimate of the predicted value, y^ is the same as the best esti¬
mate of the population mean at x= Xg and is given by equation (6.22). ^
is the critical f value corresponding to the )(% confidence level evaluated
with V degrees of freedom.
Equation (6.26) is very similar to equation (6.24). However, the inclu¬
sion of the unity term within the brackets of equation (6.26) leads to a pre¬
diction interval for a y value at x=Xg which is much larger than the
confidence interval of the population mean at x=Xg.
Exercise H
Using the information supplied in example 5, calculate the 95% prediction interval
for y if a measurement of y is to be made at Xg = 12.
Using the best straight line through points, we can estimate a value of x for
any particular value of y. The equation of the best straight line through
points is rearranged to give
° b (6.28)
6.4 USING THE LINE OF BEST FIT 237
where is the value of x when y—y^ and y^ is the mean of repeated meas¬
urements of the dependent variable. The question arises: as there is uncer¬
tainty in a, b and y^, what will be the uncertainty in x^? As discussed in
section 6.4.1.1, the uncertainties in a and b are correlated. Replacing a in
equation (6.28) by y - bx, we have
Xo = X-F
Yo-y
(6.29)
(T: + (6.30)
yo
0-1 1 HYo
(J ^
•'0
— T 1-1-1-T
b \m n b^ n -(syy - (6.31)
^0 — (6.32)
Example 6
A spectrophotometer is used to measure the concentration of arsenic in solution.
Table 6.11 shows calibration data of the variation of absorbance^^ with arsenic con¬
centration. Assuming that the absorbance is linearly related to the arsenic concen¬
tration:
(iii) Calculate the concentration of arsenic corresponding to this absorbance and the
standard error in the concentration.
'' Absorbance is proportional to the amount of light ahsorhed by the solution as the
light passes from source to detector within the spectrophotometer.
238 6 LEAST SQUARES I
2.151 0.0660
9.561 0.2108
16.878 0.3917^
23.476 0.5441
30.337 0.6795
ANSWER
(i) Figure 6.9 shows a plot of the variation of absorbance versus concentration data
contained in table 6.11.
(ii) a and bare determined using equations (6.6) and (6.7):
We use equation (6.31) to obtain the standard error in x^. Using the information in
the question and the data in table 6.11 we find
(o-is calculated using equation (6.10)). Substituting these values into equation (6.31)
gives
% = 0-38 ppm
It is worth remarking that the third term in the brackets of equation (6.31) becomes
large for y values far from the mean of the y values obtained during the calibration
procedure. The third term is zero when the y value of the sample under test is equal
to the mean of the y values obtained during calibration.
6.5 FiTTiNG A STRAIGHT LINE TO DATA 239
0.8-1
0.7-
<D
0.6-
O
C
CD 0.5-
o 0.4-
U)
< 0.3-
0 .2-
0 .1 -
0-- “T
0 10 20 30 40
Concentration (ppm)
Exercise I
The data shown in table 6.12 were obtained in an experiment to determine the
amount of nitrite in solution using high performance liquid chromatography
(HPLC).
(i) Regarding the peak area as the dependent variable, determine the equation of
the best straight line through the data.
(ii) Four repeat measurements are made on a solution with unknown nitrite con¬
centration. The mean peak area is found to be 57156. Use the equation of the
best line to find the concentration corresponding to this peak area and the stan¬
dard error in the concentration.
uncertainties in both the x and the y values.'® If the errors in the x values
are constant and errors in the y values are negligible, we can use the results
already discussed in this chapter to find the best line through the data. We
write the equation of the line through the data as
x= a*+ by (6.33)
where X is regarded as the dependent vg^riable and yas the independent var¬
iable, a* is the intercept (i.e. the value of x when y-0) and b* is the slope of
the line. To find a* and fi*we must minimise the sum of squares of the resid¬
uals of the observed values of x from the predicted values based on a line
drawn through the points. In essence we recreate the argument begun in
section 6.2.2, but with y replacing x and x replacing y. The equation for the
best line through the x-y data in this case is given when the intercept, a*, is
(6.34)
(6.35)
- (XTi)'
Compare these equations with equations (6.6) and (6.7) for a and b when
the sum of squares of residuals in the y values is minimised.
Equation (6.33) can be rewritten as
- a* X
V— —1—I— (6.36)
b* b*
It is tempting to compare equation (6.36) withy = a-L fix and reach the con¬
clusion that
— a*
(6.37)
and
fi =
b* (6.38)
However, a and fi given by equations (6.37) and (6.38) are equal to a and fi
given by equations (6.6) and (6.7) only if both least squares fits (i.e. that
which minimises the sum of the squares of the x residuals and that which
minimises the sum of the squares of the y residuals) produce the same
This is beyond the scope of this text. For a good review of least squares methods
when both x and y variables are affected by error, see Macdonald and Thompson
(1992).
6.5 FITTING A STRAIGHT LINE TO DATA 241
X
y
2.52 2
3.45 4
3.46 6
4.25 8
4.71 10
5.47 12
6.61 14
straight line through the points. The only situation in which this happens
is when there are no errors in the x and y values, i.e. all the data lie exactly
on a straight line!
As example, consider the x-y data in table 6.13. We can perform a
least squares analysis assuming that;
(i) The errors are in the x values only. Using equations (6.34) and (6.35)
we find a*= 1.844 and b*= 0.3136. Using equations (6.37) and (6.38)
we obtain 0=-5.882 and 17=3.189.
(ii) The errors are in the y values only. Using equations (6.6) and (6.7) we
find a=-5.348 and 17=3.067.
Exercise J
With the current through a silicon diode held constant, the voltage across the diode,
V, is measured as a function of temperature, 6. Data from the experiment are shown
in table 6.14. Theory suggests that the relationship between Uand 0is
As Vis the dependent variable and 6 the independent variable, it would be usual to
plot Uon the y axis and 6 on the x axis. However, the experimenter has evidence that
the measured values of diode voltage have less error than those of temperature.
(i) Use least squares to obtain best estimates for k^ and k^, where only values of V
are assumed to have error.
(ii) Use least squares to obtain best estimates for k^ and A:,, where only values of 6
are assumed to have error.
242 6 LEAST SQUARES I
0(°C) V{V)
2.0 0.6859
10.0 0.6619
19.0 0.6379"
26.4 0.6139
40.9 0.5899
48.8 0.5659
59.7 0.5419
65.0 0.5179
80.0 0.4939
91.0 0.4699
101.3 0.4459
bb*= 1
Figure 6.10. Correlation coefficients for x-y data exhibiting various amounts of
scatter.
r— \/bb* (6.40)
^_n^x,yi- ^^^
- (2'v)^ i ’’Xyf - ^
For perfect correlation, r is either +1 or -1. We note that rhas the stune
sign as that of the slope (hor h*). Figure 6.10 shows graphs of x-y data along
with the value of rfor each. As \r\ decreases from 1 to 0, the correlation
between x and y becomes less and less convincing. Notions of‘goodness’
relating to values of r can be misleading. A value of |r| close to unity does
indicate good correlation; however, it is possible that x and y are not line¬
arly re\ated but still give a value for |r| in excess of 0.99. This is illustrated by
the next example.
244 6 LEAST SQUARES I
Example 7
Thermoelectric coolers (TECs) are devices widely used to cool electronic compo¬
nents, such as laser diodes. A TEC consists of a hot and a cold surface with the tem¬
perature difference between the surfaces maintained by an electric current. In an
experiment, the temperature difference between the hot and the cold surface, A T, is
measured as a function of the electrical current, /,passing through the TEC. The data
gathered are shown in table 6.15.
ANSWER
(i) We begin by drawing up table 6.16 which contains all the values needed to
calculate r using equation (6.41). Summing the values in each column gives
Sx.=4.2,2j/.= 113.2, 2jc.y,= 93.02, Ex?=3.64, Ej/|= 2402.22. Substituting the
summations into equation (6.41) gives
175.7
r=-=0 992
2.8X63.2558 '
A fit of the equation y= a+bx to the x-y data given in table 6.16 gives intercept
a=2.725°C and slope h=22.41°C/A.
(ii) Figure 6.11 shows the data points and the line of best fit.
/(A) AT CO
0.0 0.8
0.2 7.9
0.4 12.5
0.6 17.1
0.8 21.7
1.0 25.1
1.2 28.1
6.6 LINEAR CORRELATION COEFFICIENT, r 245
X.[ = I) y,(=dr) Tf
0.0 0.8 0.0 0.0 0.64
0.2 7.9 1.58 0.04 r
62.41
0.4 12.5 5.0 0.16 156.25
0.6 17.1 10.26 0.36 292.41
0.8 21.7 17.36 0.64 470.89
1.0 25.1 25.1 1.0 630.01
1.2 28.1 33.72 1.44 789.61
Exercise K
The temperature reached by the cold surface of a TEC cooler depends on the size of
the heat sink to which it is attached. Table 6.17 shows the temperature, T, of the cold
surface of the TEC for heat sinks of various volumes, V.
5 37.0
10 25.5
15 17.1
25 11.5
50 6.4
Consider the calculation of rfor the data shown in sheet 6.2. To calculate r:
A B
1 X y
2 42 458
3 56 420
4 56 390
5 78 380
6 69 379
7 92 360
8 102 351
9 120 300
10
6.6 LINEAR CORRELATION COEFFICIENT, r 247
(i) enter the data shown in sheet 6.2 into an Excel® spreadsheet;
(ii) type =CORREL(B2:B9,A2:A9) into cell BIO;
(iii) press the Enter key;
(iv) the value -0.9503 is returned in cell BIO.
Exercise L
Use the CORRELQ function to determine the correlation coefficient of the x-ydata
shown in table 6.18.
X 2 4 6 8 10 12 14 16 18 20 22 24
y 16.5 49.1 65.2 71.6 101.5 90.1 101.4 113.7 127.7 156.5 203.6 188.4
We have seen that we need to be cautious when using rto infer the extent
to which there is a linear relationship between x-y data, as values of |r| > 0.8
may be obtained when the underlying relationship between x and y is not
linear. There is another problem: values of |r| > 0.8 may be obtained when
X and y are totally uncorrelated, especially when the number of x-y values
is small. To illustrate this, consider the values in table 6.19.
The first column of table 6.19 contains five values of x from 0.2 to 1.
The remainder of the columns contain numbers between 0 and 1 that have
been randomly generated so that there is no underlying correlation
between x andy. The bottom row shows the correlation coefficients calcu¬
lated when each column of yis correlated in turn with the column contain¬
ing X. The column of values headed y8, when correlated with the column
Table 6.19. Correlation coefficient for ten sets of randomly generated y values correlated
with the X column values.
X yi y3 y4 y6 y7 y8 y9 yio
0.2 0.020 0.953 0.508 0.324 0.233 0.872 0.446 0.673 0.912 0.602
0.4 0.965 0.995 0.231 0.501 0.265 0.186 0.790 0.911 0.491 0.186
0.6 0.294 0.159 0.636 0.186 0.227 0.944 0.291 0.153 0.780 0.832
0.8 0.561 0.096 0.905 0.548 0.187 0.002 0.331 0.051 0.862 0.255
1.0 0.680 0.936 0.783 0.034 0.860 0.363 0.745 0.083 0.239 0.363
r 0.400 -0.323 0.743 -0.392 0.655 -0.455 0.094 -0.822 -0.541 -0.242
248 6 LEAST SQUARES I
See Bevington and Robinson (1992) for a discussion of the calculation of the
probabilities in table 6.20.
This value of probability is often used in tests to establish statistical significance,
as discussed in chapter 8.
6.7 RESIDUALS 249
Exercise M
Consider the x-y data in table 6.21.
X y
2.67 1.54
1.56 1.60
0.89 1.56
0.55 1.34
-0.25 1.33
6.7 Residuals
3.5- 1
3.0-
2.5-
2.0-
1.5-
<
1.0-
0.5-
0.0 —
-0.5- • •
-1.0 —
0 80 100
Exercise N
The period, T, of oscillation of a body on the end of a spring is measured as the mass,
m, of the body increases. Data obtained are shown in table 6.22.
(iii) Calculate the residuals and plot a graph of residuals versus mass.
(iv) Is there any ‘pattern’ discernible in the residuals? If so, suggest a possible cause
of the pattern.
m(kg) Tis)
0.2 0.39
0.4 0.62
0.6 0.72
0.8 0.87
1.0 0.92
1.2 1.07
1.4 1.13
1.6 1.16
1.8 1.23
2.0 1.32
If each residual, ^y■, is divided by the standard deviation in each y value, a.,
we refer to the quantity
Ay.=^' (6.43)
Example 8
Consider the x-y data in table 6.23.
(i) Assuming a linear relationship between x and y, determine the equation of the
best line through the data in table 6.23.
(ii) Calculate the standard deviation, a, of the data about tbe fitted line.
(iii) Determine the standardised residuals and plot a graph of standardised residuals
versus x.
ANSWER
(i) Applying equations (6.6) and (6.7) to the data in table 6.23, we find a= 7.849 and
b= 2.830, so that the equation of the best line can be written
X 2 4 6 8 10 12 14 16 18 20 22
y 8.2 23.0 26.3 31.4 39.5 36.7 56.7 46.8 53.4 63.2 74.7
X 2 4 6 8 10 12 14 16 18 20 22
y 8.2 23.0 26.3 31.4 39.5 36.7 56.7 46.8 53.4 63.2 74.7
y 13.509 19.169 24.829 30.489 36.149 41.809 47.469 53.129 58.789 64.449 70.109
Ay -5.309 3.831 1.471 0.911 3.351 -5.109 9.231 -6.329-5.389 -1.249 4.591
Ay. -0.988 0.713 0.274 0.169 0.623 -0.950 1.717 -1.177-1.002 -0.232 0.854
0.06-1
youT y (6.45)
^OUT
a
where yQ^j-is the outlier (i.e. that furthest from the line of best fit), y is the
predicted value, found using y== ^2+ bx. The standard deviation, a, is calcu¬
lated using equation (6.10).
Once has been calculated, we determine the expected number
of values, N, at least as far away from y as To do this:
y Tour ^ l^ourl^'
(ii) Calculate the expected number of values, N, at least as far from y as
y^yj. using N= nP, where n is the number of data.
256 6 LEAST SQUARES I
If Nis less than 0.5, consider rejecting the point. If a point is rejected, then
a and b should he recalculated (as well as other related quantities, such as
cr^andrr^).
Example 9
Consider x-y values in table 6.25.
(i) Plot an x-y graph and use unweighted least squares to fit a line of the form
y=a+to to the data.
(ii) Identify any suspect point{s).
(iii) Calculate the standard deviation of the y values.
(iv) Apply Chauvenet’s criterion to the suspect point - should it be rejected?
ANSWER
(i) A plot of data is shown in figure 6.18 with line of best fit attached (rt= — 1.610 and
fi=2.145).
(ii) A suspect point would appear to be x= 6, y= 8.5 as this point is furthest from the
line of best fit.
(iii) Using the data in table 6.25 and equation (6.10), a= 1.895.
(iv) Using equation (6.45), we have
8.5-(-1.610 +2.145X6)
^OUT
1.46
1.895
X y
2.0 3.5
4.0 7.2
6.0 8.5
8.0 17.1
10.0 20.0
6.9 TRANSFORMING DATA FOR LEAST SQUARES ANALYSIS
257
Exercise 0
Consider the x-y data in table 6.26. When a straight line is fitted to these data, it is
found that a = 4.460 and h=0.6111. Assuming that the data point x=10, y =13 is
‘suspect’, apply Chauvenet’s criterion to decide whether this point should be
rejected.
X y
5 7
6 8
8 9
10 13
12 11
14 12
15 14
Using equations (6.6) and (6.7) we can establish best estimates for the
intercept, a, and slope, b, of a line through x-y data. These equations
should only be applied when we are confident that there is a linear rela¬
tionship between the quantities plotted on the x and y axes. What do we do
if x-y data are clearly non-linearly related, such as those in figure 6.19? It
may be possible to apply a mathematical operation to the x or y data (or to
258 6 LEAST SQUARES I
60-1
50-
40-
y 30 -
20-
• N
10- * * • . .
0-1--1-1-^~r- n
0 10 20 30 40 50
X
both X and y) so that the transformed data appear linearly related. The next
stage is to fit the equation y=a+bx to the transformed data. How do we
choose which mathematical operation to apply? If we have little or no idea
what the relationship between the dependent and independent variable is
likely to be, we may be forced into a ‘trial and error’ approach to transform¬
ing data. As an example, the rapid decrease in y with x indicated for x< 10
in figure 6.19 suggests that there might be an exponential relationship
between x and y, such as
y^jy^Bx (6.46)
where A and B are constants. Assuming this to be the case, taking natural
logs of each side of the equation gives
lny=lnA-i-Bx (6.47)
Comparing equation (6.47) with the equation of a straight line predicts that
plotting In y versus x will produce a straight line with intercept In A and
slope B. Figure 6.20 shows the effect of applying the transformation sug¬
gested by equation (6.47) to the data in figure 6.19. The transformation has
not been successful in producing a linear x-y graph, indicating that equa¬
tion (6.46) is not appropriate to these data and that other transformation
options should be considered (for example In y versus In x). Happily, in
many situations in the physical sciences the work of others (either experi¬
mental or theoretical) provides clues as to how the data should be treated
in order to produce a linear relationship between data plotted as an x-y
graph. Without such clues we must use ‘intelligent guesswork’.
As an example, consider a ball falling freely under gravity. The dis¬
tance, 5, through which a ball falls in a time t is measured and the values
6.9 TRANSFORMING DATA FOR LEAST SQUARES ANALYSIS 259
4.5
4.0
3.5
3.0
2.5
- 2.0
1.5
1.0
0.5
0.0
-0.5 I
10 20 30 40 50
r(s) s(m)
1 7
2 22
3 52
4 84
5 128
6 200
obtained are shown in table 6.27. The data are shown in graphical form in
figure 6.21. The relationship between 5 and f is not linear. Can the data be
transformed so that a linear graph is produced? The starting point is to look
for an equation which might describe the motion of the ball. When the
acceleration, g, is constant, we can write the relationship between s and t
as^^
s=ut+\gt'^ (6.48)
200
150
E 100 -
01
50-1
"T” m
4 7
t(s)
Figure 6.21. Distance-time data for a falling ball.
Table 6.28.
Transformation of data
given in table 6.27.
1 7
2 11
3 17.3
4 21
5 25.6
6 33.3
5 1
-= u +2g t
T T TT
y — a + b X
35n
30
25
20-
15-
10-
0 I
0 3
f (s)
Figure 6.22. Graph of 5/r versus t.
When transforming data, the dependent variable should remain on the left
hand side of the equation (so that the assumption of uncertainties
restricted to the quantity plotted on the y axis is valid). Usually the inde¬
pendent variable only appears on the right hand side of the equation.
However, there are situations, such as in the linearisation of equation
(6.48), where this condition must be relaxed.
Exercise P
1. Transform the equations shown in table 6.29 into the form y= a+ bx, and indicate
how the constants in each equation are related to a and b.
2. The capacitance of a semiconductor diode decreases as the reverse bias voltage
applied to the junction increases. An important diode parameter, namely the
contact potential, can be found if the capacitance of the junction is measured as a
function of reverse bias voltage. Table 6.30 shows experimental capaci¬
tance/voltage data for a particular diode. Assume that the relationship between Cj
and Ucan be written
^=A:(U+(^.) (6.50)
where </> is the contact potential and kisa constant. Use least squares to find A: and
(f).
262 6 LEAST SQUARES 1
Dependent Independent
Equation variable variable Constantls) Hint
(iv) TW = TC Tw R T,k
\ m
(V) T m k
Our least squares analysis to this point has assumed that the uncertainty
in the y values is constant. Is this assumption still valid if the data are trans¬
formed? Very often the answer is no and to see this let us consider a situa¬
tion in which data transformation requires that the natural logarithms of
the y quantity he calculated.
6.9 TRANSFORMING DATA FOR LEAST SQUARES ANALYSIS 263
V(V) q(pF)
6.0 248
8.1 217
10.1 196
14.1 169
18.5 149
24.6 130
31.7 115
38.1 105
45.6 96.1
50.1 92.1
(6.51)
In7=ln/Q-A;x (6.52)
Assuming that equation (6.51) is valid for the experimental data, plotting In
/versus x should produce a plot in which the transformed data lie close to
a straight line. A straight line fitted to the transformed data would have
intercept In f and slope -k. As In / is taken as the ‘y quantity’ when fitting a
line to data using least squares, we must determine the uncertainty in In /.
We write
y=ln/ (6.53)
If the uncertainty in /, Uj, is small, then the uncertainty in y, u^, is given by^®
u, (6.54)
dl
u, (6.55)
1
Equation (6.55) indicates that if u, is constant, the uncertainty in In I
decreases as I increases. The consequence of this is that the assumption of
constant uncertainty in the y values used in least squares analysis is no
longer valid and unweighted least squares must be abandoned in favour of
an approach that takes into account changes in the uncertainties in the y
values. There are many situations in which data transformation leads to a
similar outcome, i.e. that the uncertainty in y values is not constant and so
requires a straight line to be fitted using weighted least squares. This is dealt
with in the next section.
Exercise Q
For the following equations, determine y and the uncertainty in y, u^, given that
\T= 56 and 2 in each case. Express to two significant figures.
How do you know if you should use a weighted fit? A good starting point is
to perform unweighted least squares to find the line of best fit through the
data (data should be transformed if necessary, as discussed in section 6.9).
A plot of the residuals should reveal whether a weighted fit is required.
Figure 6.23 shows a plot of residuals in which the residuals decrease with
increasing x (figure 6.23(a)) and increase with increasing x (figure 6.23(b)).
Such patterns in residuals are ‘tell tale’ signs that weighted least squares
fitting should be used.
In order to find the best line through the points when weighted fitting
6.10 WEIGHTED LEAST SQUARES 265
300-1
200-
100
• •
0
<1
-100
-200-
-300-
-400
10 20 30
(b)
Figure 6.23. Residuals indicating weighted fit is required; (a) indicates the size of the
residual that decreases with increasing jq (b) indicates the size of residual that
increases with increasing x.
(6.56,
where <7. is the standard deviation in y.. The intercept, a, and slope, b, are
given by
^ a}
b=- (6.58)
where
(6.59)
erf
Equations (6.57) and (6.58) give more weight to the points that have
smaller uncertainty thereby ensuring that the fitted line will pass closer to
these points than those with large uncertainty.
With weighted fitting, the best line no longer passes through (x, y) but
through the weighted centre of gravity of the points, (x,^, yj. The coordi¬
nates of the weighted centre of gravity are given by^^
Xi
(6.60)
and
(6.61)
\
Example lo
x-y data along with associated uncertainties are shown in table 6.31. Using weighted
least squares, find the intercept and slope of the best line through the points.
ANSWER
Table 6.32 contains all the quantities necessary to calculate a and b. Summing the
appropriate columns gives
Si/(r? = 0.178403
I
S x./erf = 15.00986
I I
S v /o^ = 15.46528
^ i I
a= 128.6 -0.4985
X y
18 125±10
42
O
CO
CO
+
1
67 91±6
89 84±4
108 76±4
6.10 WEIGHTED LEAST SQUARES 267
X.
yt 11 cr^i X.la] x-yfa] X]la\
Exercise R
Repeat example 10 with every value of o-^. multiplied by 5 (so, for example, when
y.= 125, (Tj = 50). Show that the values of a and h remain unchanged. (Suggestion: use
a spreadsheet!)
When a weighted fit is required, equations (6.62) and (6.63) can be used to
calculate the uncertainties (in this case taken to be the standard errors) in
a and ,
(6.62)
(6.63)
where A is given by equation (6.59). Equations (6.62) and (6.63) are appli¬
cable as long as actual values of cr. are known, as relative magnitudes will
not do in this case. Although this might seem unduly restrictive, there is
one case in which cr. may be estimated fairly accurately and that is in
counting experiments (such as those involving radioactivity or X rays),
where the Poisson distribution is valid. If the number of counts recorded is
C., then the standard deviation in C., cr., is given by
(6.64)
Exercise S
In a diffusion experiment, a radiotracer diffuses into a solid when the solid is heated
to a high temperature for a fixed period of time. The solid is sectioned and the
number of gamma counts is recorded hy a particle counter over a period of 1 minute
for each section. Table 6.33 shows the number of counts, C, as a function of depth,
d, cut in the material. Assume that the equation t^at relates Cto rfis
C=Aexp(-Ad^)
(6.65)
(6.66)
and
(Ti. (6.67)
(nA)5
For completeness, we include the expression for the weighted linear corre¬
lation coefficient, which should he used whenever weighted least
squares fitting occurs, is given as
^ ^
1 (6.68)
2"
2
(n (Tt aj^aj
\
Example 11
The data shown in table 6.34 were obtained from a study of the relationship between
current and voltage for a tunnel diode.^° For the range of voltage in table 6.34, the
relationship between current, /, and voltage, V, for a tunnel diode can be written
/= CVexp (6.69)
Answer
(6.70)
y-in7 (6.72)
(6.73)
(6.74)
Table 6.35 shows the raw data, the transformed data and the sums of the
columns necessary to calculate the weighted standard deviation, For conven¬
ience we take the standard deviation in /, cr^, to be equal to 1, so that using equa¬
tion (6.74), o-;= 1/7.
(iii) The weighted standard deviation, calculated using equation (6.65) and the sums
of numbers appearing in the bottom row of table 6.35, is cr^,= 0.1064.
Exercise T
Determine the intercept and slope of the transformed data in example 11. Calculate
also the standard errors in intercept and slope.
CO r--
CM CO LO '=t CM 05 o o CO CD CD
r- »—1 LO CD CM o 05 CO LO 05 1 1 CO LO CM CM
r- CM CM CM CM T—H 05 CO 1—H CD CO CO m CM
05 CM T—H CD LO m CO w w o CM CM
CM LO
LO I—H o oCM CO CO CO ogo o o
CM--.I CM-—
O o O q q q o o q f—H q o o
0-)
d d d o d d d d d d CO d d d d d>
CD LO un LO
o o o o o CD LO I-H LO
1 1 1 T—H CO LD CM o o o CM
^H ^H 1—H CM 1—H CM CM CO
w w w w w o o o o o O o O o t—H
CM CD CDto 05 o o o o o o o o w o o
CM*-| CM--
q o q q q o q o o o q o o
d d d d d d d d 05 d d
(7
X
CD CD I—H K CO
CO 05 LT) CD CO CM CM U3 LO CD CO CO CD
CO CO 00 I—H
CM o CO CM I—H O o CM o CO
CD CO OT
03 CD 05 »—H i“H LO 1 1 O o o CO
o o CM O I—H O O CM CM CM t 1 o o CO CO
o o O O o O o O O O tq w q q I—I
o q q o q o o q q o -=3- d d 11 o
CM— d d d CO d d d d d d <N 1—H d
1
OD CO CD 03 CD LO CD CD CO I—H CM CM CM LO
CO LO CM o CO o CD CO CM CM I—H CO
»—t CD 05 o CM o
r-H LO o CD CD 1—H O CN
rt i“H
CM CD r- o o o q O
CO LO CM r-- CD o
CO CO o
CM q o q
r- q CO CM O
q q q q q q q q q q d d d d CM
•-*1 CM— d d d d d d d d d d 1
d
^1 b
LO
o CO LO CO h- LO CO CD CO CD CD
1 CO LO CD CO CM CD CD LTD CO 05 r- CO CD T—H
CO LO r- CD CD LO CO CD CM CM CD LO CD 05 CO
w o o r—( 1—H CM CM i—H CM ^H rH 1—H O O 1—H
CM o o o o o o o o o o o O O O (N
— 1 CM-- o q q q q q o q o o o q o O CD
d d d d d d d d d d d d d d d
X
CD
CD t—H CD I—H CD
CM CD CD CO 05 CD CD CO
r-- 05 T—H f-H LO LO CD I-H I—H 05 CM
CD CD CO CM CM CD 05 CO CM CM CD
O T—H i-H
CO CO I-H r-H I-H o O CO CM CM ^H o
1 CM-- q q q O o o q o q q o o o O O CO
^1 b d d d d d d d d d d d d d d d d
Table 6.35. Weighted fitting of data in table 6.34.
CO LO CM CD CO CO 1—H 05 I-H CD
CM o o CD LO LO LO CD LO CD o ^H
I—t CO 05 05 CO LO I—H CD 05 CD CM CO CD
LO CM CM 1—H LO CO CO CM CD 05 CO O LO
LO CD i^H LO CD LO O 05 CO c/5
--1 1 H 05 05 LO CD CM
q CO q q CM q CO CD CM q CM q a
II
CM l> LO LO LO LO CD 05 CO 05 CM*
b I—H ^H U~>
■Tf CM CO CO CM 05 CO LO LO
CO o LO CM r- I-H CM CD CD LO LO LO 1—H
II
i^H CO CM CD LO CO CO CO o 1—H CD
1—H CO CD CO CO LO CM LO CO CO r^ CO CO
S' o r- t—H LO CO 05 05 o CDo 1—H 1—H CD CD CM
q q q CM q q o q ^H CD CD
CM I—4 T—H I-H d d d d d d d d d
1 1 1 1 1
CM CD 1—4 CM CD CO CD
COCO CO CO CO 05 CO LO o 1—H 1—H r-* CO
q I—H CM 1—H I—H r—H 1—H 1—H 1—H 1—H I-H 1—H o o
—1 d d d d d d d d d d d d d d d
H
II
> t-H CM CO LO CD CO 05 r-H CM CO LO
q o q o q q o q q f-H r—1 ^H I-H 1—i 1—H
s d d d d d d d d d d d d d d d
2/2 6 LEAST SQUARES I
6.11 Review
Due to the ease with which modern analysis tools such as spread¬
sheets can fit an equation to data, it is easy to overlook the question:
‘ Should we really fit a straight line to data?’ To assist in answering this ques¬
tion we introduced the correlation coefficient and residual plots as quan¬
titative and qualitative indicators of ‘goodness of fit’ and have indicated
situations in which each can be misleading. We have also considered situ¬
ations in which data transformation is required before least squares is
applied. Often, after completing the data transformation we must forsake
unweighted least squares in favour of weighted least squares.
The technique of fitting an equation to data using least squares can
be extended to situations in which there are more than two parameters to
be estimated, for example y= a + + yx, or where there are more than
two independent variables. This will be considered in the next chapter
along with situations in which linear least squares cannot be applied to the
analysis of data.
Problems
2. When a rigid tungsten sphere presses on a flat glass surface, the surface
of the glass deforms. The relationship between the mean contact pressure,
p^, and the indentation strain, s, is
p = ks (6.75)
r m
274 6 LEAST SQUARES I
Table 6.37 shows s-p^^ data for indentations made into glass. A: is a constant
dependent upon the mechanical stiffness of the material. Taking to be
the dependent variable and 5 to be the independent variable, use least
squares to determine k and the standard error in k.
u (m/s) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1
/ (/aA) ± 10 ijlA 270 300 300 330 300 330 350 330 330 360 340
u (m/s) 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2
/ iijA) ± 10 pA 340 360 350 370 340 370 360 360 380 380 390
u= (6.76)
where K^ and are constants. Values for u and / obtained in a water flow
experiment are shown in table 6.38. Assume /to be the dependentvariable,
and u the independent variable.
(i) Show that equation (6.76) can be rearranged into the form
+ (6.77)
(ii) Compare equation (6.77) with y=a+ bx, and use a spreadsheet to
perform a weightedleast squares fit to data to find /Cj and K^.
(iii) Plot the standardised residuals. Can anything he concluded from the
pattern of the residuals? (For example, is the weighting you have used
appropriate?)
T=A:C” (6.78)
where Cis the concentration of the acetic acid, fcand n are constants. Y-C
data are given in table 6.39.
276 6 LEAST SQUARES I
I- 4«)cos(20) + (6.79)
(i) Perform an unweighted fit to find the intercept and slope of a graph of
/versus cos(20).
(ii) Use the values for the slope and intercept to find / and / .
(iii) Determine the standard error in /max
P_l_B
X~A (6.80)
where Xis the mass of gas adsorbed per unit area of surface. Pis the pres¬
sure of the gas, and A and B are constants. Table 6.41 shows data gathered
in an adsorption experiment.
(i) Write equation (6.80) in the form y=a+ bx. How do the constants in
equation (6.80) relate to a and bl
PROBLEMS 277
P(N/m2) X(kg/m2)
0.27 13.9X10 5
0.39 17.8X10-^
0.62 22.5X10
0.93 27.5X10-'^
1.72 32.9X10 5
3.43 38.6X10“^
(ii) Using least squares, find values for a and b and standard errors in
a and b.
(iii) Estimate A and B and their respective standard errors.
P(Pa) A (mm)
1.5 35
5.8 20
5.8 30
6.5 25
8.0 16
13.1 9
14.5 10
18.9 7
24.7 6
27.6 6.5
2/8 6 LEAST SQUARES I
(i) Fit the equation y-a+bx to equation (6.81) to find band the standard
error in b.
(ii) If the temperature of the gas is 298 K, use equation (6.81) to estimate
the diameter of the gas molecules and the standard error in the diam¬
eter.
8. The intercept on the x axis Xjj^j of the best straight line through data is
found by setting y= 0 in equation (6^). Xj^j.is given by
_ ^
nf
a (6.82)
^iNT b\n b^
y^ = y+b{x^-x)
1 n{Xo-xY ^
a- = a (6.83)
yo
y=bx (6.84)
Use the method outlined in appendix 3 to show that the slope, b, of the best
line to pass through the points and the origin is given by
Use the method outlined in appendix 4 to show that the standard error in
b, a^, is given by
(6.85)
2.43 0.72
4.90 2.42
10.43 4.75
15.64 3.99
16.62 5.39
21.12 8.65
Least squares V
Introduction
y=a+bx
where a is the intercept of the line and b is its slope. Other situations that
we need to consider, as they occur regularly in science, require fitting equa¬
tions to x-y data where the equations:
1.5 n
1 I I I-1-1
260 270 280 290 300 310
Temperature (K)
Figure 7.1. Output voltage of a thermocouple between 270 K and 303 K.
best equation to fit to data, and what steps to take if two (or more) equa¬
tions fitted to the same data must be compared.
2n
0-
>
3-2-
(D
D)
O
>
-6 -
-8-^-^-1-1-^^-1-^-1
y= a+ bx+
y= a+ bx+ cxln X
b
y=a-\—cx
X
The approach to finding an equation which best fits the data, where the
equation incorporates more than two parameters follows that described in
appendix 3. Though the complexity of the algebraic manipulations
increases as the number of parameters increases, this can be overcome by
using matrices to assist in parameter estimation.
7-3 FORMULATING EQUATIONS 283
(7.1)
\
where y. is the /th value of y (obtained through measurement), y. is the cor¬
responding predicted value of y found using an equation relatingy to xand
a. is the standard deviation in the ith value of y.
As an example of fitting an equation with more than two parameters
to data, consider the equation
y=a+fix+cx^ (7.2)
If the uncertainty in y. is the same for all x. then a .is replaced by cr, so that
^ = -^2(y.-fl-fix.-cx2) = 0 (7.5)
— = —^ Vx?(y.-fl-fix-cx2) = 0 (7.7)
ac 0-2 ^ '
na+b^x^+c^x]=^y^ (7.8)
We can rearrange equations (7.8) to (7.10) and substitute from one equa¬
tion into another to solve for..fl, b and c. However, this approach is labour
intensive and time consuming and the likelihood of a numerical or alge¬
braic mistake is quite high. If the number of parameters to be estimated
increases to four or more, then solving for these estimates by ‘elimination
and substitution’ becomes even more formidable. An effective way to
proceed is to write the equations in matrix form, as solving linear equa¬
tions using matrices is quite efficient, especially if software is available that
can manipulate matrices.
Writing equations (7.8) to (7.10) in matrix form gives
a
b = ^XiYi
!_
M
M
M
AB = P (7.12)
where
n K a
A= ^Xi 24 B= b P= ^XiYi
24 24_ c _24yi_
To determine elements, a, b and cof B (which are the parameter estimates
appearing in equation (7.2)), equation (7.12) is manipulated to give^
B=A-ip (7.13)
where A" ‘ is the inverse matrix of the matrix A. Matrix inversion and matrix
multiplication are tedious to perform ‘by hand’, especially if matrices are
large. The built in matrix functions in Excel® are well suited to estimating
parameters in linear least squares problems.
Exercise A
The relationship between the voltage across a semiconductor diode, V, and the tem¬
perature of the diode, T, is given by
Among the many built in functions provided by Excel® are those related to
matrix inversion and matrix multiplication. These functions can be used to
solve for the parameter estimates (and the standard errors in the esti¬
mates) appearing in an equation fitted to data.
= MINVERSE(array)
A B C
1 2.6 7.3 3.4
2 9.5 4.5 5.5
3 6.7 2.3 7.8
4
Example 1
Use the MINVERSEQ function to invert the matrix shown in sheet 7.1.
ANSWER
Inverting a 3X3 matrix creates another 3X3 matrix. As the MINVERSEQ function
returns an array with nine elements, we highlight nine cells (cells E1:G3 ) into which
Excel® can return those elements. Sheet 7.2 shows the highlighted cells along with the
function =MINVERSE(A1:C3) typed into cell El. To complete the inversion of the
matrix shown in columns A to C in sheet 7.2, it is necessary to hold down the CTRL and
Shift keys then press the Enter key. Sheet 7.3 shows the elements of the inverted matrix
in columns E to G.
286 7 LEAST SQUARES II
Sheet 7.2. Example of matrix inversion using the MINVERSEO function in Excel®.
A B C D E F G 1
1 2.6 7.3 3.4 =MINVERSE(A1:C3)
2 9.5 4.5 5.5
3 6.7 2.3 7.8
4
A B C D E F G
1 2.6 7.3 3.4 -0.09285 0.203164 -0.10278
2 9.5 4.5 5.5 0.154069 0.01034 -0.07445
3 6.7 2.3 7.8 0.034329 -0.17756 0.238445
4
Exercise B
Use Excel® to invert the following matrices
The elements of the P matrix can be found using the MMULTQ function.
The syntax of the function is,
=MMULT(arrayl, array2)
where arrayl and array2 contain the elements of the matrices to be multi¬
plied together.
7-4 MATRICES AND EXCEL® 287
Example 2
Use the MMULT() function to determine the product, P, of the matrices A and B
shown in sheet 7.4.
ANSWER
Multiplying the 3X3 matrix by the 3X1 matrix in sheet 7.4 produces another matrix
of dimension 3X1. As the MMULT() function returns an array containing the ele¬
ments of the matrix P, we highlight cells (shown in column G of sheet 7.5) into which
those elements can be returned.
To determine P, type =MMULT(A2:C4,E2:E4) into cell G2. Holding down the
CTRL and Shift keys, then pressing the Enter key returns the elements of the P matrix
into cells G2 to G4 as shown in sheet 7.6.
A B C D E F G
1 A B
2 2.6 7.3 3.4 34.4
3 9.5 4.5 5.5 43.7
4 6.7 2.3 7.8 12.4
ABC D E F G H
1 A B P
2 2.6 7.3 3.4 34.4 =MMULT{A2:C4,E2:E4)
3 9.5 4.5 5.5 43.7
4 6.7 2.3 7.8 12.4
G
1 P
2 450.61
3 591.65
4 427.71
Exercise C
Use Excel® to carry out the following matrix multiplications:
12 45 67 56 32
56.8 123.5 67.8 23.1
34 54 65 43 19
87.9 12.5 54.3 34.6 (ii)
12 54 49 31 54
23.6 98.5 56.7 56.8
84 97 23 99 12
288 7 LEAST SQUARES II
X y
4 30.4
6 51.2
8 101.6
10 184.4
12 262.6
14 369.6
16 479.4
18 601.5
20 764.9
n
(7.15)
2^?
B= (7.16)
and
P= (7.17)
Using the data in table 7.1 (and with the assistance of Excel®), we find
Sx.= 108, -2845.6, 2x2=1536, 2x3=24192, 2x4=405312,
2X;y. = 45206.6, 2x?y; = 761100.4, n = 9.
Matrices A and P become
As B=A *P, we find (using Excel® for matrix inversion and multiplication)®
1.9157
- 2.7458
2.0344
Exercise D
The variation of the electrical resistance, R, with temperature, T, of a wire made from
high purity platinum is shown in table 7.2. Assuming the relationship between i? and
Tcan be written
R = A+ BT+CT^ (7.18)
TiK) RiCl)
70 17.1
100 30.0
150 50.8
200 71.0
300 110.5
400 148.6
500 185.0
600 221.5
700 256.2
800 289.8
900 322.2
1000 353.4
5 For convenience, the elements of A"' are shown to four figures, hut full precision
(to 15 figures) is used ‘internally’ when calculations are carried out with Excel®.
290 7 LEAST SQUARES II
1. deposition time, f,
2. the gas pressure in the vacuum chamber, P,
3. the distance, d, between deposition source and surface to be coated
with the thin film.
y=a+bx+cz (7.19)
(7.20)
(7.22)
To minimise differentiate x^ with respect to a, b and cin turn and set the
resulting equations equal to zero. Following the approach described in
section 7.3, the matrix equation to be solved for a, b and c is
a '^yt
b =
c
_II_I I_
i i i
A BP
28 0.49 1.72
48 0.98 1.43
66 1.47 1.24
91 1.96 1.06
117 2.45 0.93
150 2.94 0.82
198 3.43 0.67
Example 3
In an experiment to study waves on a stretched string, the frequency, f at which the
string resonates is measured as a function of the tension, T, of the string and the
length, L, of the string. The data gathered in the experiment are shown in table 7.3.
Assuming that the frequency, f, can be written in the form
where Kisa constant, use multiple least squares to determine best estimates of K, B
and C.
ANSWER
y=a+bx+cz (7.25)
where
y=in/ (7.26)
x=ln T (7.27)
292 7 LEAST SQUARES II
and
z=\nL (7.28)
Also,
Using the data in table 7.3, and the transformations given by equations (7.26) to
(7.28), the matrices A and P in equation (7.23) become
4.281
B= 0.4476
1.157
so that
Finally, we write
Exercise E
Consider the data shown in table 7.4. Assuming that y is a function of x and z such
that y=a+bx+ cz, use linear least squares to solve for a, b and c.
7-6 STANDARD ERRORS IN PARAMETER ESTIMATES 293
y X z
20.7 1 1
23.3 2 2
28.2 3 4
35.7 4 5
48.2 5 11
56.0 6 14
o-(Ay')2 (7.29)
(r^=(7(Ay')2 (7.30)
(7.31)
(T is determined using
1
a— (7.32)
n—M
2 See Bevington and Robinson (1992) for a derivation of equations (7.29) to (7.31).
294 7 LEAST SQUARES II
X y
2 -7.27
4 -17.44
6 -25.99
8 -23.02
10 -16.23
12 -15.29
14 1.85
16 21.26
18 46.95
20 71.87
22 105.97
Example 4
Consider the data in table 7.5. Assuming it is appropriate to fit the function
y=a+bx+cjd to these data determine, using linear least squares,
(i) a, b and c;
(ii) cr;
(iii) standard errors, cr^, cr^^ and
ANSWER
(i) Writing equations to solve for a, b and c in matrix form gives (see section 7.3)
To find the elements of B, write B = A'^P. Using the MINVERSEO and MMULTO
functions in Excel® gives
I I
A-i P B
7.6 STANDARD ERRORS IN PARAMETER ESTIMATES 295
Exercise F
Consider the thermocouple data in table 7.6. Assuming the relationship between V
and Tcan be written as
use linear least squares to determine best estimates for A, B, C, D and the standard
errors in these estimates.
r(K) V"(|xV)
75 -5.936
100 -5.414
125 -4.865
150 -4.221
175 -3.492
200 -2.687
225 -1.817
250 -0.892
275 0.079
300 1.081
296 7 LEAST SQUARES II
/3
v^n-M (7.36)
Example 5
In fitting the equation y= a + ^x+ yx^T- to 15 data points, it is found that estimates
of the parameters a, /3, y and S (written as a, b, c and d respectively) and their stan-
dard errors^ are
0.4651 <7^=0.02534
c= 0.1354 <7^=0.02263
d= 0.04502 <7^=0.01018
Use this information to determine the 95% confidence interval for each parameter.
ANSWER
a = a±t^„ (j
X%,u a
® Standard errors are given to four significant figures to avoid rounding errors in
subsequent calculations.
7-7 WEIGHTING THE FIT 297
It follows that
Exercise G
A force, F, is required to displace the string of an archer’s bow by an amount d.
Assuming that the relationship between F and d can be written
where a, (3 and y are parameters to be estimated using linear least squares, estimates
of the parameters were determined by fitting equation (7.37) to Fversus d data (not
shown here) which consisted of 12 data points. Estimates of a, (3 and y (written as a,
b and c respectively) and their standard errors are
Use this information to determine the 90% confidence interval for a, jS and y.
yi-yi (7.38)
(Ti
To allow for uncertainty that varies from one y value to another, we retain
0-? in the subsequent analysis. Equation (7.38) is differentiated with respect
to each parameter in turn and the resulting equations are set equal to zero.
^ Weighted fitting is required if the uncertainty in each y value is not constant - see
section 6.10.
298 7 LEAST SQUARES II
Next we solve the equations for estimates of the parameters which mini¬
mise As an example, if the function to be fitted to the data is the poly¬
nomial y=a+bx+ then a, b and c may be determined using matrices
by writing B=A”' P, where
B= (7.39)
1 ^ Xi X?
2 O'/
Xi x?
(7.40)
A- E ^crf
xf r? xi
s 0-?
and P is given by
O'/
^iYi (7.41)
^fy/
^ 0-?
The diagonal elements of the A“ ^ matrix can be used to determine the stan¬
dard errors in a, b and c. We have^°
(7.42)
(7.44)
Exercise H
Consider the data in table 7.7. Assuming that it is appropriate to fit the equation
y= a-E bx+ cy? to these data and that the standard deviation, cr., in each y value is
given by cr. = 0. ly.:
X y
-10.5 143
-9.5 104
-8.3 79
-5.3 34
-2.1 12
1.1 19
2.3 29
R= (7.45)
^{y-yd\
where y. is the dh observed value of y, y. is the predicted y value found using
the equation representing the best line through the points and y is the
mean of the observed y values.^^ We argue that equation (7.45) is plausible
by considering the summation 2 (y^ — y^^ which is the sum of squares of the
residuals, SSR. If the line passes near to all the points then SSRis close to 0
(and is equal to 0 if all the points lie on the line) and so R tends to 1, as
required by perfectly correlated data.
The square of the coefficient of multiple correlation is termed the
coefficient of multiple determination, R^, and is written
i?2 = 1-
'^iyj-yi? (7.46)
R^ gives the fraction of the square of the deviations of the observed y values
about their mean which can be explained by the equation fitted to the data.
So for example, if R^ = 0.92 this indicates that 92% of the scatter of the devi¬
ations can be explained by the equation fitted to the data.
" See Walpole, Myers and Myers (1998) for a discussion of equation (7.45).
300 7 LEAST SQUARES II
Exercise I
Consider the data in table 7.8. Fitting the equation y= <3+ bx+ czto these data using
least squares gives ■*
where is the predicted value of y for a given x.. Use this information to determine
the coefficient of multiple determination, for the data in table 7.8.
y X z
• estimates of parameters;
• standard errors in the estimates;
• the coefficient of multiple determination, R;
• the standard deviation in the y values, cr, as given by equation (7.32).
By contrast, when there are two independent variables (such as x and z),
an array with two columns of numbers must be entered into LINESTO.
That array consists of one column of x values and an adjacent column of z
values in the case where y= a+ bx+ cz is fitted to data. Where the same
independent variable appears in two terms of an equation, such as in
A B c
1 y X z
2 28.8 1.2 12.8
3 42.29 3.4 9.8
4 50.69 4.5 6.7
5 66.22 7.4 4.5
6 73.12 8.4 2.2
7 81.99 9.9 1
8
9 :=L1NEST(A2:A7, B2:C7,TRUE,TRUE)
X y
0.02 -0.0261
0.04 -0.0202
0.06 -0.0169
0.08 -0.0143
0.10 -0.0124
0.12 -0.0105
0.14 -0.0094
0.16 -0.0080
0.18 -0.0067
Exercise J
Consider the data in table 7.9. Use LINESTO to fit the equation y=a+ bx+ clnx to
these data and so determine:
(i) a, b, and c;
(ii) the standard errors in a, b, and c;
(iii) the coefficient of multiple determination,
(iv) the standard deviation, a, in the y values.
r^A+BT (7.47)
or
U2 l)R^-(M- U
ADJ
R^ is given by equation (7.46), n is the number of data and Mis the number
of parameters. Once R^^is calculated for each equation fitted to data, the
equation is preferred that has the larger value of R^^j.
Another way of comparing two (or more) equations fitted to data
where the equations have different numbers of parameters is to use the
Akaike information criterion'^ (ATC). This criterion takes into account the
sum of squares of residuals SSR, but also includes a term proportional to
the number of parameters used. AlC may be written
AfC=nlnSS/? + 2M (7.50)
See Neter, Kutner, Nachtsheim and Wasserman (1996) for a discussion of equation
(7.49).
See Akaike (1974) for a discussion on model identification.
7.10 CHOOSING EQUATIONS TO FIT TO DATA
305
Example 6
Table 7.10 shows the variation of the resistance of an alloy with temperature. Using
(unweighted) linear least squares, fit both equation (7.47) and equation (7.48) to
these data and determine for each equation:
ANSWER
7?^^ for equation (7.47) fitted to data is greater than R^o/or equation (7.48), indicat¬
ing that equation (7.47) is the better fit. The AfCfor equation (7.47) fitted to data is
lower than for equation (7.48), further supporting equation (7.47) as the more appro¬
priate equation. Einally, an inspection of the standard errors of the parameter esti¬
mates suggests that, for the equation with three parameters, the standard error in c
is so large in comparison to cthat yin equation (7.48) is ‘redundant’.
r(n) 19.5 18.4 20.2 20.1 20.9 20.8 21.2 21.8 21.9 23.6 23.2
T(K) 150 160 170 180 190 200 210 220 230 240 250
/-(U) 23.9 23.2 24.1 24.2 26.3 25.5 26.1 26.3 27.1 28.0
T(K) 260 270 280 290 300 310 320 330 340 350
We write the standard errors in the parameters to two significant figures in line
with the convention adopted in chapter 1.
3o6 7 LEAST SQUARES il
Table 7.11. Parameter estimates found by fitting equations (7.47) and (7.48) to data in
table 7.10.^^
«« 0.9619 0.9610
SSR 5.344 5.186
AIC 39.20 40.57
Exercise K
Using the data in table 7.10 verify that the parameter estimates and the standard
errors in the estimates in table 7.11 are correct.
The equations that we have fitted to data so far have been linear in the
parameters and we have determined best estimates of those parameters
using linear least squares. However, there are many equations that occur in
the physical sciences that are not linear in the parameters, for example
I=fexp[-q.x) + B (7 51)
(7.52)
(where A, B and C are parameters, I is the dependent variable and Uis the
independent variable).
We cannot apply the technique of linear least squares to either equa¬
tion (7.51) or equation (7.52) as they do not permit the construction of a set
of simultaneous equations which are linear in the unknown parameters. In
In the interests of conciseness, units of measurement have been omitted from the
table.
7.11 NON-LINEAR LEAST SQUARES 307
10000-1
9000-
8000-
7000-
6000-
I 5000-
4000-
3000-
2000-
1000-
-**tM9*mm •!•••• •••• ••
0— 1 I-1-1
0 20 30 40 50
X
to 0, does not produce a set of linear equations that can be solved for ‘best
estimates’. The technique that is adopted to find parameter estimates is to
begin with equation (7.54) and replace j>. by the equation to be fitted. For
•4
y. - aexp{-bx.) + c (7.55)
so that
(7.56)
The next stage is to ‘guess’ values (usually referred to as starting values) for
a, b and c and then to calculate Assuming that the guessed values are
not those that will minimise we need to begin a search for the best
values of a, b and c by modifying the starting values in some systematic
manner and at each stage determine whether has reduced. If has
reduced then the modified starting values are closer to the best parameter
estimates. The parameters are further adjusted until no more reduction in
is obtained. At this point we have the best estimates of the parameters
we seek. Due to the iterative nature of fitting demanded by non-linear least
squares, such fitting is done using a computer based mathematics or sta¬
tistics package.'^ We will not discuss the method of non-linear least
squares in detail here but just point to two important issues to be aware of
when using any package to fit an equation to data using the technique of
non-linear least squares.
(i) If the starting values are quite different from the best estimates, then
the non-linear least squares technique may converge very slowly to
the best estimates. Worse still, convergence may never occur, possibly
due to the fact that, when starting values are varied, the value
obtained for is beyond the range of numbers that the computer
program can cope with and the program returns an error message.
(ii) It is possible, especially with noisy data, for a minimum in to be
obtained such that any further small changes in the parameter esti¬
mates causes x^ to increase. However what has been found by the
computer is, in fact, a ‘local minimum’. This means that there is
another combination of parameter estimates that produces an even
smaller value for a^ (often referred to as a ‘global’ minimum). Avoiding
being trapped in a local minimum can be quite difficult and often
relies on the user of the program ‘knowing’ that the parameter esti¬
mates obtained by the package are nonsense.
'' Excel® does not provide built in facilities for fitting equations to data using non¬
linear least squares.
PROBLEMS 309
7.12 Review
Problems
1. The equation
(iii) Assuming that equation (7.57) is appropriate to the data in table 7.12,
use matrices to solve for a, b and c.
2. The movement of a solute through a chromatography column can be
described by the van Deemter equation,
X y
1 8.37
2 3.45
3 1.70
4 0.92
5 0.53
6 0.62
7 -0.06
8 0.63
9 0.16
10 -0.27
V (mL/minute) Himm)
3.5 9.52
7.5 5.46
15.7 3.88
20.5 3.48
25.8 3.34
36.7 3.31
41.3 3.13
46.7 3.78
62.7 3.55
78.4 4.24
96.7 4.08
115.7 4.75
125.5 4.89
where H is the plate height, v is the rate at which the mobile phase of the
solute flows through the column. A, B and C are constants. For a particular
column, Hvaries with v as given in table 7.13.
(i) Write the equations in matrix form that must be solved for best esti¬
mates of A, B and C, assuming unweighted fitting using least squares is
appropriate. (Hint: follow the steps given in section 7.3.)
PROBLEMS 311
t(s) 5(m)
0.0 135.2
0.5 159.8
1.0 163.1
1.5 183.6
2.0 181.2
2.5 181.6
3.0 189.5
3.5 175.2
4.0 162.2
4.5 136.2
5.0 113.8
5.5 83.0
6.0 49.6
3. An object is thrown off a building and its displacement above the ground
at various times after it is released is shown in table 7.14. The predicted
relationship between s and t is
4. To illustrate the way in which a real gas deviates from perfect gas beha¬
viour, PVIRTis often plotted against l/V, where Pis the pressure, 17the
volume and T the temperature (in kelvin) of the gas. R is the gas constant.
Values for PVIRT and V are shown in table 7.15 for argon gas at 150 K.
Assuming the relationship between PVIRT and 1/ Vcan be written
PV BCD
_ =4-1_I_I_ (7.60)
RT V V2
use least squares to obtain best estimates for A, B, C and D and standard
errors in the estimates. (Suggestion: Iet3/=P17/Pr, and x=llV.)
312 7 LEAST SQUARES II
V (cm^) PVIRT
35 1.21
40 0.89
45 0.72
50 0.62
60 0.48
70 0.45
80 0.51
90 0.50
100 0.53
120 0.57
150 0.61
200 0.69
300 0.76
500 0.84
700 0.89
5. Consider the x-y data in table 7.16. Fit equations y=a+bx and
y=a+bx+cx? to these data and use the adjusted multiple correlation
coefficient and the Akaike information criterion to establish which equa¬
tion better fits the data.
C^ = A + BT+y^ (7.61)
determine:
X y
5 361.5
10 182.8
15 768.6
20 822.5
25 1168.2
30 1368.6
35 1723.3
40 1688.7
45 1800.9
50 2124.5
55 2437.9
60 2641.2
T C
p
(K) a-mol^i-K-b
300 29.43
350 30.04
400 30.55
450 30.86
500 31.52
550 31.71 .
600 32.10
650 32.45
700 32.45
750 32.80
800 33.11
850 33.38
900 33.49
950 33.85
1000 34.00
314 7 LEAST SQUARES II
Model 1: The first model assumes that contacts show semiconducting behavi¬
our, where the relationship between R and Tcan be written
/? = 7lexp||j (7.62)
8.1 Introduction
Such general goals give way to specific questions that we hope can be
answered by careful analysis of data gathered in well designed experi¬
ments. Questions that might be asked include:
315
3i6 8 TESTS OF SIGNIFICANCE
^ When repeat measurements are made of a quantity, such as the diameter of a wire,
the population mean is taken to be the true value of the quantity, so long as
systematic errors are negligible (see section 5.9 for more details).
8.2 CONFIDENCE LEVELS AND SIGNIFICANCE TESTING 317
^ ^x%^x — P — X+ (8.1)
F = x±z^^a- (8.2)
where xis the sample mean and (j-~sl\fn is the standard error of the
mean, is the z value corresponding to the X% confidence level as given
in table 8.1. The probability that the true value lies in the interval given by
equation (8.2) is X%/100 %.
As an example of determining a confidence interval, consider the 40
values of focal length of a convex lens in table 8.2 obtained by repeat mea¬
surements on one lens. The mean, x, of the values in table 8.2 and standard
deviation, s, are
x= 15.26 cm 5=0.6080 cm
To determine the 95% confidence interval for the true value of the focal
length, we use table 8.1 which gives z^^^= 1.96. Using equation (8.1) gives
0.6080 0.6080
15.26- 1.96 X < 15.26-F 1.96 X
'is
be concerned that the mean of the measured values differs by 0.26 cm from
the ‘assured’ value? The answer is yes and to justify this, consider figure 8.1
which shows the probability distribution of sample means assuming the
population mean= 15.00 cm and the standard error of the mean is
0.6080/ V40 = 0.09613 cm. There is a probability of 0.95 that the mean of a
sample consisting of 40 values lies between (15.00— 1.96X0.09613) cm and
(15.00+1.96X0.09613) cm, i.e. between 14.81 cm and 15.19 cm. Put
another way, the probability that the mean of 40 values would lie outside
the interval 14.81 to 15.19 cm is 1 -0.95 ==0.05. It appears unlikely (i.e. the
probability is less than 0.05) that a sample consisting of 40 values with a
mean of 15.26 cm has been drawn from a population which has a mean of
15.00 cm. In short, there is a significantdifferencehetween the anticipated
value of focal length of the lens (15.00 cm) and the mean of the values in
table 8.2 (15.26 cm).
Though there is a significant difference between anticipated focal
length and the sample mean, we must decide whether the difference is
important. For example, if the lens in question is expensive, or perhaps was
purchased to replace another lens of focal length 15.00 cm, we may show
the data to the manufacturer and request a replacement. On the other
hand, if the lens is to be used to demonstrate the principles of image for¬
mation, a difference of 0.26 cm between the focal length as specified by the
manufacturer and the true focal length may not be regarded as important.
This is a point easy to overlook in hypothesis testing: a difference between
two numbers may be ‘statistically significant’, but too small to be impor¬
tant in a practical sense.
Another question we might ask regarding the focal length data in
table 8.2 is:
If the population mean is 15.00 cm, what is the probability that we
would obtain by chance, a mean based on 40 repeat measurements
that differs as much as 0.26 cm from the population mean?
8.2 CONFIDENCE LEVELS AND SIGNIFICANCE TESTING 319
Figure 8.2. The probability that a sample mean lies outside the interval 14.74 cm to
15.26 cm is equal to the sum of the shaded areas.
In order to answer this question we redraw figure 8.1, and shade the areas
corresponding to the probability that a sample mean differs from 15.00 cm
by at least 0.26 cm, as shown in figure 8.2. The sum of the shaded areas is
the probability that a sample mean would differ from a population with a
mean of 15.00 cm by 0.26 cm or more when the standard error of the mean
is 0.09613 cm. The shaded areas can be determined with the aid of tables,
or a computer package such as Excel®. Notice that due to the symmetry of
the normal distribution, the shaded area in each tail of the distribution
shown in figure 8.2 is the same, and so long as we calculate the area in one
tail, doubling that area gives the total area required.
One method of finding the area in the tails is to find the z value cor¬
responding to the cumulative probability P(- 00<x< 15.26 cm), where zis
given by
,=^= - - . ,„„
15 26 15 00
o> 0.09613
Mean, p. Mean, x
Standard deviation, a Standard deviation, s
Correlation coefficient, p Correlation coefficient, r
Intercept, a (of a line through x-y data) Intercept, a
Slope, p (of a line through x-ydata) Slope, b
2 As the number of values, n, in the sample tends to infinity then a sample statistic
tends to its corresponding population parameter. So, for example, as n^oo, x^/x.
It is usual to denote hypothesised population parameters by attaching the
subscript ‘0’ to the symbol used for that parameter.
8.3 HYPOTHESIS TESTING 321
X% fo q-i
-E a = 1 (8.3)
100%
Figure 8.3. Relationship between level of significance, a, and confidence level, X%.
_X /^o
(8.4)
al'Vri
Example i
The masses of a sample of 40 weights measured with an electronic balance are shown
in table 8.4. The experimenter assumes that the masses have been drawn from a pop¬
ulation with a mean of 50.00 g. Is this assumption reasonable?
ANSWER
Weights (g)
50.06 50.02 50.18 50.05 50.05 50.12 50.05 50.07 50.20 49.98
50.13 50.05 49.99 49.99 49.94 50.20 50.03 49.97 50.09 49.77
50.10 49.93 49.99 50.01 50.12 50.16 50.12 50.22 50.13 50.09
50.22 50.09 50.07 .50.02 50.05 50.14 50.10 50.18 50.08 50.02
Exercise A
Bandgap reference diodes are used extensively in electronic measuring equipment as
they provide highly stable voltages against which other voltages can be compared.®
The manufacturer’s specification indicates that the nominal voltage across the diodes
should be 1.260 V. A sample of 36 diodes is tested and the voltage across each diode is
shown in table 8.6. Using these data, test whether the diodes have been drawn from a
population with a mean of 1.260 V. Reject the null hypothesis at a = 0.05.
Table 8.5. Step by step description of the hypothesis test for example 1.
Step Details/comment
Decide the purpose of the test. To compare a sample mean with a hypothesised
population mean.
Choose the significance level of the No indication is given in the question of the
test, a. level of significance at which we should test the
null hypothesis, so we choose the ‘most
commonly’ used level, i.e. a = 0.05.
Determine the critical value of the When the area in one tail of the standard
test statistic, based on the normal distribution is 0.025 (= all), the
chosen significance level. magnitude of the z value, found using table 1 in
appendix 1, = 1.96. Hence 1.96.
(i) Calculate the mean, x, and the standard deviation, s, and determine
the standard error of the mean, using sample data.
(ii) For a hypothesised mean, calculate the value of the z-statistic, z,
using z=
(iii) Use Excel®’s NORMSDISTO function to determine the area in the tail
of the distribution® between z=-co and z=-|z|.
(iv) Multiply the area by 2 to obtain the total area in both tails of the
distribution.
(v) If a sample is drawn from a population with mean fx^, then the
probability that the sample mean would be at least as far from the
mean as |zj is equal to the sum of the areas in the tails of the
distribution.
Example 2
Consider the values in table 8.7. Use Excel® to determine the probability that these
values have been drawn from a population with a mean of 100.0.
ANSWER
The data in table 8.7 are shown in sheet 8.1. Column F of sheet 8.1 contains the
formulae required to determine the sample mean, standard deviation and so on.
Sheet 8.2 shows the values returned in column F. The number 0.067 983 returned in
cell F6 is the probability of obtaining a sample mean at least as far from the popula¬
tion mean of 100.0 as 101.85. As this probability is greater than 0.05 (the commonly
chosen level of significance, a) we cannot reject a null hypothesis that the sample is
drawn from a population with mean equal to 100.0.
® By calculating the area between -00 and -|zj we are always choosing the tail to the
leftofz=0.
326 8 TESTS OF SIGNIFICANCE
A B C D E F
1 100.5 95.6 103.2 108.4 hypothesised mean 100.0
2 96.6 98.5 93.5 92.8 sample mean =AVERAGE(A1:D9)
3 102.7 100.2 100.4 100.1 standard deviation =STDEV(A1:D9)
4 106.3 113.9 98.7 110.3 standard error of mean =F3/36''0.5
5 108.0 110.7 91.1 100.8 z-value =(F2-F1)/F4
6 97.1 98.1 91.4 99.2 probability =2*NORMSDIST(-ABS(F5))
7 108.7 101.6 101.1 99.4
8 93.9 104.7 106.5 111.6
9 107.5 100.0 111.9 101.6
E F
1 hypothesised mean 100.0
2 sample mean 101.85
3 standard deviation 6.0818
4 standard error of mean 1.013633
5 z-value 1.825118
6 probability 0.067983
7
Exercise B
Titanium metal is deposited on a glass substrate producing a thin film of nominal
thickness 55 nm. The thickness of the film is measured at 30 points, chosen at
random, across the film. The values obtained are shown in table 8.8. Determine
whether there is a significant difference, at a = 0.05, between the mean of these
values and a hypothesised population mean of 55 nm.
8.3 HYPOTHESIS TESTING 327
61 53 59 61 60 54 56 56 52 61
57 54 50 60 58 58 55 55 53 58
61 52 53 53 59 61 59 53 54 55
H^:aa>/Xo,
where /jl^ is the expected population mean based on the amount of gold
recovered using the ‘old’ process. By contrast, if a surface coating on a lens
is designed to reduce the amount of reflected light, then we would write:
H^:/x<^(,..
Here /x^ would correspond to the expected reflection coefficient for the lens
in the absence of a coating.
Tests in which the alternative hypothesis is /x>/Xq or /x<^tQ are
referred to as one tailed tests. As the name implies, we consider areas only
in one tail of the distribution as shown in figure 8.5. If the significance level,
a = 0.05 is chosen for the test, then the shaded area in figure 8.5 would be
equal to 0.05. Any value of z, determined using experimental data, that falls
in the shaded region such that z>z^^.jwould mean that the null hypothesis
is rejected at the a = 0.05 level of significance.
In order to determine z ..for a one tailed test, we calculate (1 - a).
328 8 TESTS OF SIGNIFICANCE
This gives the area under the standard normal curve between z=—“ and
The zvalue corresponding to the area (1 — a) is found using table 1
in appendix 1.
Example 3
Determine the critical value of the z statistic for a one tailed test when the level of sig¬
nificance, a = 0.1.
ANSWER
If a = 0.1, then (1 - a) = 0.9. Referring to table 1 in appendix 1, if the area under the
standard normal curve between z=-oo and z=z. .Ts 0.9, then z . = 1.28.
Exercise C
Determine the critical value of the z statistic for a one tailed test when the level of
significance a is equal to:
(i) 0.2,
(ii) 0.05,
(iii) 0.01,
(iv) 0.005.
We can never be sure that the null hypothesis is true or false, for if we could
there would be no need for hypothesis testing! This leads to two undesir¬
able outcomes:
_ X — p, .
(8.5)
Note type I and type II errors are not the same as the experimental errors
discussed in chapter 5.
" See Devore (1991) for a discussion of type II errors.
330 8 TESTS OF SIGNIFICANCE
than 30,5 can no longer be regarded as a good estimate of a, and we use the
statistic, t, given by
(8.6)
six/n
Example 4
Table 8.9 shows the value of the acceleration due to gravity, g, obtained through
experiment. Are these values significantly different from a hypothesised value of
g=9.81 m/s^? Test the hypothesis at the a = 0.05 level of significance.
ANSWER
9.752-9.81
-2.085
0.02782
The critical value of the test statistic, with a = 0.05 and v= n~l = 4is found from
table 2 in appendix 1 to be 2.776. As |f | < we cannot reject the null hypothe¬
sis, i.e. there is no significant difference at a=0.05 between the sample mean and the
hypothesised population mean.
Exercise D
The unit cell is the building block of all crystals. An experimenter prepares several
crystals of a ceramic material and compares the size of the unit cell to that published
by another experimenter. One dimension of the unit cell is the lattice dimension, c.
Table 8.10 show eight values of the c dimension obtained by the experimenter. At the
a = 0.05 level of significance, determine whether the sample mean of the values in
table 8.10 differs significantly from the published value of c of 1.1693 nm.
(i) the best estimate of the parameter found using least squares;
(ii) the standard error in the estimate;
(iii) to choose a significance level for the test.
As the number of x-y pairs in a least squares analysis is routinely less than
30, we assume that the distribution of parameter estimates follows a t dis¬
tribution. For the intercept of a line through data we hypothesise:
a — ao
t=-^
t=- (8.7)
O-a
(JIj
b
t=~- (8.8)
o-b
Example 5
Consider the x-y data in table 8.11.
(i) Using the values in table 8.11, find the intercept, a, and slope, b, of the best
straight line through the points using unweighted least squares.
(ii) Calculate the standard errors in a and b.
8.5 SIGNIFICANCE TESTING FOR LEAST SQUARES PARAMETERS 333
X
y
0.1 -0.9
0.2 -2.1
0.3 -3.6
0.4 -4.7
0.5 -5.9
0.6 -6.8
0.7 -7.8
0.8 -9.1
(iii) Determine, at the 0.05 significance level, whether the intercept and slope are
significantly different from zero.
ANSWER
Solving for the intercept using unweighted least squares, gives a=0.06786 and the
standard error in intercept cr^ = 0.1453. Similarly, the slope his -11.51 and the stan¬
dard error in the slope is 0.2878. The degrees of freedom, (= n - 2) is 8 - 2 = 6.
To test if the intercept is significantly different from zero, we write the null and
alternative hypotheses as:
Hg: a = 0;
0.
The value of the test statistic t (= a!erf is 0.06786/0.1453 == 0.4670. Using the table 2
in appendix 1 gives for a two tailed test^^ \^\^^crit cannot
reject the null hypothesis, i.e. the intercept is not significantly different from zero.
To test if the slope is significantly different from zero, we write the null and
alternative hypotheses as:
Hg:/3 = 0;
A short hand way of writing the critical value of f which corresponds to the
significance level, a, for j'degrees of freedom is
334
8 TESTS OF SIGNIFICANCE
Exercise E
Table 8.12 shows the variation in voltage across a germanium diode which was mea¬
sured as the temperature of the diode increased from 250 K to 360 K. Assuming there
is a linear relationship between voltage and temperature, find the intercept and
slope of the best line through the data in table 8.12. Test whether the intercept and
slope differ from zero at the a = 0.05 significance level.
r(K) V(V)
250 0.440
260 0.550
270 0.469
280 0.486
290 0.508
300 0.494
310 0.450
320 0.451
330 0.434
340 0.385
350 0.458
360 0.451
As variability occurs in measured values, two samples wiU seldom have the
same mean even if they are drawn from the same population. There are sit¬
uations in which we would like to compare two means and establish
whether they are significantly different.
For example, if a specimen of water from a river is divided and sent to
two laboratories for analysis of lead content, it would be reasonable to
anticipate that the difference between the means of the values obtained for
lead content by each laboratory would not be statistically significant. If a
significant difference is found then this might be traced to shortcomings in
the analysis procedure of one of the laboratories or perhaps inconsisten¬
cies in storing or handling the specimens of river water. Other circum¬
stances in which we might wish to compare two sets of data include where:
8.6 COMPARISON OF THE MEANS OF TWO SAMPLES 335
Xi -X2
t= (8.9)
9-
5
XI-X2
T+Tf (8.10)
n, r^j
where and are the number of values in sample 1 and sample 2 respec¬
tively. If the sample sizes are the same, such that n^ = n^ = n, equation (8.10)
becomes
^xi-X2
(8.11)
1) + sl{n2- 1) 2
(8.12)
/?! T- ^2 - 2
Tape 1 Tape 2
65 176
128 125
87 95
145 255
210 147
85 88
Example 6
In order to compare the adhesive properties of two types of adhesive tape, each tape
was pulled from a clean glass slide by a constant force. The time for 5 cm of each tape
to peel from the glass is shown in table 8.13. Test at the a = 0.05 level of significance
whether there is any significant difference in the peel off times for tape 1 and tape 2.
ANSWER
Hq: = /X2 (we hypothesise that the population means for both tapes are the
same);
p,j # jU.2 (two tailed test).
Using the data in table 8.13, we obtain Xj = 120.0 s, = 53.16 s, x, = 147.7 s, ^2 = 61.92 s.
Using equation (8.13), we have
(^-j
/2825.6 + 3834.3\5
=57.71 s
The value of the test statistic is found using equation (8.9), i.e.
Xi - X2 120.0 - 147.7
0.8304
%i-jc2 33.32
Here the number of degrees of freedom, v= + n^-2 = 10. The critical value of the
test statistic, for a two tailed test at a=0.05, found from table 2 in appendix 1, is
As |f|< f^rif we cannot reject the null hypothesis, i.e. based on the data in table 8.13
there is no reason to believe that the peel off time of tape 1 differs from that of tape 2.
Exercise F
As a block slides over a surface, friction acts to oppose the motion of the block. An
experiment is devised to determine whether the friction depends on the area of
contact between a block and a wooden surface. Table 8.14 shows the coefficient of
kinetic friction, for two blocks with different contact areas between block and
surface. Test at the a = 0.05 level of significance whether is independent of contact
area.
Block 1 Block 2
(contact area =174 cm^) (contact area = 47 cm^)
0.388 0.298
0.379 0.315
0.364 0.303
0.376 0.300
0.386 0.290
0.373 0.287
TTEST(arrayl,array2,tails,type)
where arrayl and array2 point to cells which contain the values of sample 1
and sample 2 respectively. The argument ‘tails’ has the value 1 for a one
tailed test and the value 2 for a two-tailed test. ‘Type’ refers to which ftest
should be performed. Eor the two sample test considered in section 8.6,
type is equal to 2.
The number returned by the function is the probability that the
sample means would differ by at least - x^, when /Xj = fx^.
338 8 TESTS OF SIGNIFICANCE
Example 7
Is there any significant difference in the Reel off times between tape 1 and tape 2 in
table 8.13?
ANSWER
H,:
Sheet 8.3 shows the data entered from table 8.13 along with the TTESTO function.
When the Enter key is pressed, the number 0.425681 is returned into cell A9. If the
null hypothesis is correct, then the probability that the sample means, Xj and 4, will
differ by at least as much as Xj — X2 is—0.43 (or 43%). This probability is so large that
we infer that the values are consistent with the null hypothesis and so we cannot
reject that hypothesis.
A B C
1 tape1 tape2
2 65 176
3 128 125
4 87 95
5 145 255
6 210 147
7 85 88
8
9 =TTEST(A2:A7.B2:B7,2,2)
Exercise G
Table 8.15 shows values of the lead content of water drawn from two locations on a
river. Use Excel®’s TTESTO function to test the null hypothesis that the mean lead
content of both samples is the same at 0.05.
Another important type of test is that in which the contents of data sets are
naturally linked or ‘paired’. For example, suppose two chemical processes
used to extract copper from ore are to be compared. Assuming that the
processes are equally efficient at extracting the copper, then the amount
obtained from any particular batch of ore should be independent of the
process used, so long as account is taken of random errors.
Other situations in which a paired ttest might be considered include
comparing:
By calculating the mean of the differences, d, of the paired values, and the
standard deviation of the differences, s^, the t test statistic can be deter¬
mined using
d — 8q
(8.14)
d
(8.15)
^ Siil'Vn
Example 8
Table 8.16 shows the amount of copper extracted from ore using two processes. Test
at a = 0.05 whether there is a significant difference in the yield of copper between the
two processes.
340 8 TESTS OF SIGNIFICANCE
ANSWER
Though it is possible to calculate the mean yield for process A and compare that with
the mean of the yield for process B, the variability between batches (say, due to the
batches being obtained from different locations) encourages us to consider a com¬
parison between yields on a batch by batch basis.
To perform the hypothesis test we write:
S=0 (the null hypothesis is that the population mean of the differences
between the paired values is zero);
Using the data in table 8.16, we find d= 0.8875%, 1.229%. Substituting these
numbers into equation (8.15) (and noting the number of pairs, n = 8), we have
f= 2.042
The number of degrees of freedom, v=n-l = 7. For a two tailed test, the critical value
of the test statistic, for v=7 and a =0.05, found from table 2 in appendix 1, is
As |fi< we cannot reject the null hypothesis, i.e. based on the data in table 8.16
there is no difference in the efficiency of the extraction methods at the 0.05 level of
significance.
Exercise H
The emfs of a batch of 9 V batteries were measured before and after a storage period
of 3 months. Table 8.17 shows the emfs of eight alkaline batteries before and after
the storage period. Test at the a = 0.05 level of significance whether the emfs of the
batteries have changed over the storage period.
8.7 t TEST FOR PAIRED SAMPLES 341
1 9.49 9.48
2 9.44 9.42
3 9.46 9.46
4 9.47 9.46
5 9.44 9.41
6 9.43 9.40
7 9.39 9.40
8 9.46 9.44
where arrayl and array2 point to cells which contain the paired values of
sample 1 and sample 2 respectively. The argument ‘tails’ is equal to 1 for a
one tailed test and equal to 2 for a two tailed test. For the paired f test, the
argument ‘type’ is equal to 1.
Exercise I
Consider the data given in table 8.16. Use the TTESTQ function to determine the
probability that the mean difference in yields, d, would be 0.8875% or larger. Assume
that the null and alternative hypotheses given in example 8 still apply.
8.8.1 The/^distribution
A test that is routinely used to compare the variability of two samples is the
‘Ftest’. This test is based on the F distribution, first introduced by (and
named in honour of) R A Fisher, a pioneer in the development of data
analysis techniques and experimental design.
If we take two random samples of sizes and from the same nor¬
mally distributed population, then the ratio
The population mean, /x = 32, and the population standard deviation, o-=3, were
arbitrarily chosen for the simulation.
18 See Graham (1993) for discussion of equation 8.17.
344
8 TESTS OF SIGNIFICANCE
Figure 8.8. The ^distribution showing the upper and lower critical values for a two
tailed test at the a significance level.
8.8.2 TheFtest
In order to determine whether the variances of two sets of data could have
come from populations with the same population variance, we begin by
stating the null hypothesis that the population variances for the two sets of
data are the same, i.e. = (t\. The alternative hypothesis (for a two tailed
test) would be cr^ # cr|. The next stage is to determine the critical value for
the F statistic. If the F value determined using the data exceeds the critical
value, then we have evidence that the two sets of data do not come from
populations that have the same variance.
An added consideration when employing the F test is that, owing to
the fact that the distribution is not symmetrical, there are two critical F
values of unequal magnitude as indicated in figure 8.8. This figure shows
the upper and lower critical Fvalues, F^ and F^ respectively, for a two tailed
test at the significance level, a. If and the F value determined using
the data exceeds then we reject the null hypothesis (for a given signifi¬
cance level, a). If and the Fvalue is less than F^, then again we reject
8.8 COMPARING VARIANCES USING THE FTEST 345
the null hypothesis. The difficulty of having two critical values can he over¬
come if, when the ratio s^ls\ is calculated, the larger variance estimate is
always placed in the numerator, so that / 5^ > 1. In doing this we need only
consider the rejection region in the right hand tail of the P distribution.
Critical values for F (in which the larger of the two variances is placed in the
numerator) are given in table 3 in appendix 1 for various probabilities in
the right hand tail of the distribution.
Example 9
ANSWER
Using the data in table 8.18, the variance of sample 1 is = 1.575 x 10“^ ijlF^ and the
variance of sample 2 is = 9.500 x 10“'‘ ixF^, hence F=s\ls\= 16.58. The number of
degrees of freedom for each sample is one less than the number of values, i.e.
1^1 = 1^2 = 5.
The critical value, for Fis obtained using table 3 in appendix 1. For a = 0.05,
the area in the right hand tail of the F distribution for a two tailed test is 0.025. For
nj = 1^2 ^5, F^,^.^=7.15. As F>F^^.^ this indicates that there is a significant difference
between the variances of sample 1 and sample 2 and so we reject the null hypothe¬
sis.
Sample 1 Sample 2
capacitance (fxF) capacitance (|jlF)
2.25 2.23
2.05 2.27
2.27 2.19
2.13 2.20
2.01 2.25
1.97 2.21
3a6 8 TESTS OF SIGNIFICANCE
Exercise J
In an experiment to measure the purity of reference quality morphine, quantitative
nuclear magnetic resonance (NMR) determined the purity as 99.920% with a stan¬
dard deviation, s, of 0.052% (number of repeat measurements, n=7). Using another
technique called isotope dilution gas chromatography mass spectrometry (GCMS)
the purity of the same material was determined as 99.879% with a standard devia¬
tion, 5, of 0.035% (again the number of repeat measurements, n=7). Determine at
the a = 0.05 level of significance whether there is a significant difference in the vari¬
ance of the values obtained using each analytical technique.
Excel®’s FINVO function can be used to determine the upper critical value
from the F distribution, so avoiding the need to consult tables of critical
values. The syntax of the function is
where probability refers to the area in the right hand tail of the Fdistribu¬
tion as shown in figure 8.8, is the number of degrees of freedom for
sample 1 and j.'2is the number of degrees of freedom for sample 2.
Example 10
Determine the upper critical value of the F distribution when v^ = 5, v^ = l and a one
tailed test is required to be carried out at the a = 0.05 level of significance.
ANSWER
Using Excel®, we type into a cell =FINV(0.05,5,7). After pressing the Enter key. Excel®
returns the number 3.971522.
Exercise K
What would be the critical F value in example 10 if a two tailed test at a = 0.05 level
of significance were to be carried out?
\
sample are not normally distributed or if the sample contains outliers, then
a significance test which is believed to be being carried out at, say, a = 0.05
could in fact be either above or below that significance level. That is, the F
test is sensitive to the actual distribution of the values. As normality is
difficult to establish when sample sizes are small, some workers are reluc¬
tant to bestow as much confidence in F tests as in t tests {t tests can be
shown to be much less sensitive to violations of normality of data than the
Ftest^®). A test that works well even when some of the assumptions upon
which the test is founded are not valid is referred to as robust.
(8.18)
See Moore and McCabe (1989) for a discussion of the robustness of rand f tests.
348 8 TESTS OF SIGNIFICANCE
v-2
8.9.2 The^^Mest
(8.20)
Figure 8.11.)^ distribution showing the critical value, for the significance
level, a.
^0.= n (8.21)
i=i
(8.22)
i=\
v=k-l (8.23)
The number of degrees of freedom is further reduced if the sample data must
be used to estimate one or more parameters. As an example, in order to
apply the ^ test to data that are assumed to be Poisson distributed, we
usually need to estimate the mean of the distribution, /r, using the sample
data. As the constraint expressed by equation (8.22) still applies, the number
of degrees of freedom for a test applied to a hypothesised Poisson distri¬
bution would be v=^k-2. Table 8.19 shows the number of degrees of
freedom for tests applied to various hypothesised distributions.
22 If there are A: categories, then the frequency in each of fc- 1 categories can take on
any value. Since the total nutnher of observations is fixed, the Adh category is
constrained to have a frequency equal to the (total number of observations - sum of
frequencies in the A; - 1 categories).
352 8 TESTS OF SIGNIFICANCE
Interval Frequency
0.0<x<0.1 12
0.1<x<0.2 12
0.2<x<0.3 9
0.3<x<0.4 13
0.4<x<0.5 7
0.5<x<0.6 9
0.6£x<0.7 5
0.7<x<0.8 12
0.8<x<0.9 9
0.9<x<1.0 12
Example ii
Table 8.20 shows the frequency of occurrence of random numbers generated
between 0 and 1 by the random number generator on a pocket calculator. The cate¬
gory width has been chosen as 0.1. Use a test at the 0.05 level of significance to
establish whether the values in table 8.20 are consistent with a number generator
that produces random numbers distributed uniformly between 0 and 1.
ANSWER
Data have been drawn from a population that consists of numbers which are
uniformly distributed between 0 and 1.
Data have been drawn from a population that consists of numbers which are
not uniformly distributed between 0 and 1.
Table 8.20 contains the observed frequencies in each category. If the random number
generator produces numbers evenly distributed between 0 and 1, then out of 100
random numbers we would expect, on average, ten numbers to lie between 0 and 0.1,
ten to lie between 0.1 and 0.2 and so on. Table 8.21 shows the observed frequencies,
expected frequencies and the terms in the summation given in equation (8.20).
Summing the last column of table 8.21 gives
= 6.2
Ei
cannot reject the null hypothesis, i.e. the numbers are consistent with
having been drawn from a population that is uniformly distributed between 0 and 1.
o, Ei E,
0.0<x<0.1 12 10 0.4
0.1<x<0.2 12 10 0.4
0.2<v<0.3 9 10 0.1
0.3<x<0.4 13 10 0.9
0.4<x<0.5 7 10 0.9
0.5<x<0.6 9 10 0.1
0.6<x<0.7 5 10 2.5
0.7<x<0.8 12 10 0.4
0.8<x<0.9 9 10 0.1
0.9<x<1.0 12 10 0.4
Counts Frequency
0 10
1 13
2 7
3 15
4 or more 5
Exercise L
The number of cosmic rays detected in 50 consecutive one minute intervals is
shown in table 8.22.
(i) Assuming that the counts follow a Poisson distribution, determine the expected
frequencies for 0, 1, 2 etc. counts. (Note that for the data in table 8.22, the mean
number of counts per minute is 1.86.)
(ii) Use a test at the 0.05 level of significance to determine whether the
distribution of counts is consistent with a Poisson distribution that has a mean
35A
8 TESTS OF SIGNIFICANCE
CHIINVCprobability, ^')
where probability refers to the area in the right hand tail of the distribution
as shown by the shaded area in figure 8.11 and v is the number of degrees
of freedom.
Example 12
Calculate the critical value of the statistic when the area in the right hand tail of
the distribution is 0.1 and the number of degrees of freedom, 3.
ANSWER
Using Excel®, type into a cell =CHIINV(0.1,3). After pressing the Enter key. Excel®
returns the number 6.251394.
Exercise M
Use Excel® to determine the critical value of the statistic for:
ANOVA is a versatile and powerful technique. See McPherson (1990) for details of
the usual variations of ANOVA.
8.10 ANALYSIS OF VARIANCE 355
• the effect of three (or more) types of fuel additive on the efficiency of a
car engine;
• influence of storage temperature on the adhesive properties of sticky
tape;
• the concentration of iron in an ore as determined by several
independent laboratories;
• the concentration of carbon dioxide emerging from four volcanic vents.
Analysis using one-way ANOVA relies on using the data in two ways to
obtain an estimate of the variance of the population. Firstly, the variance
356 8 TESTS OF SIGNIFICANCE
(8.24)
^within samples ^
_5f + 5i + 5i + ---5|:
^ within samples
or
1
within samples
= — V 5?
J
(8.25)
Another way to estimate the population variance is to find the variance, s?,
of the sample means given by
i=K
{Xj-Xf
4=2 ;=i
K-l
(8.26)
where x. is the mean of the ^h sample and X is the mean of all the sample
means (sometimes referred to as the ‘grand mean’).
5? is related to the estimate of the between sample variance,
^between samples’ by
^between samples
4= (8.27)
N
where N is the number of values in each sample. The larger the difference
between the samples means, the larger will be ^^etiveen samples- Rearranging
equation (8.27) we obtain
If all samples have been drawn from the same population, it should not
matter which method is used to estimate the population variance as each
Note that some texts and software packages use the word ‘group’ in place of
sample’ and refer to ‘between group variance’ and 'within group variance’. In this
text we will consistently use the word ‘sample’.
For simplicity we assume that each sample consists of the same number of values.
8.10 ANALYSIS OF VARIANCE 357
p_ betuvensamples
" 2 (o.zyj
within samples
Table 8.23 shows the concentration of carbon dioxide for gas emerging
from four volcanic vents. We will use ANOVA to test the hypothesis that the
gas from each vent comes from a common reservoir. We will carry out the
test at O' = 0.05 level of significance.
The nuU and alternative hypotheses are:
21 25 30 31
23 22 25 25
26 28 24 27
28 29 26 28
27 27 27 33
25 25 30 28
24 27 24 32
Table 8.24. Mean and variance of the carbon dioxide data in table 8.23.
(8.30)
N-1
where Nis the number of values in the /th vent and is the mean of the
values for the /th vent. Table 8.24 shows the sample means and the esti¬
mated variances for the data in table 8.23.
Using equation (8.25), the estimate of the within sample population
variance is
We use equation (8.28) to give the estimate of the between sample vari¬
ance, i.e.
F . =F =3 01
Comparing this with F= 3.428, as determined using the data, indicates that
we should reject the null hypothesis. That is, the population means of all
the samples are not the same.
Exercise N
Experimental studies have linked the size of the alpha wave generated by the human
brain to the amount of light falling on the retina of the eye. In one study, the size of
the alpha signal for nine people is measured at three light levels. Table 8.25 shows
the size of the alpha signal at each light level for the nine people. Using ANOVA,
determine at the a = 0.05 level of significance whether the magnitude of the alpha
wave depends on light level.
High 32 35 40 35 33 37 39 34 37
Medium 36 39 29 33 38 36 32 39 40
Low 39 42 47 39 45 51 43 43 39
8.11 Review
Problems
2. The porosity, r, was measured for two samples of the ceramic YBa2Cu307
prepared at different temperatures. The data are shown in table 8.26.
Determine at the a = 0.05 level of significance whether there is any
difference in the porosity of the two samples.
3. Blood was taken from eight volunteers and sent to two laboratories. The
urea concentration in the blood of each volunteer, as determined by both
laboratories, is shown in table 8.27. Determine at the a = 0.05 level of sig¬
nificance whether there is any difference in the urea concentration deter¬
mined by the laboratories.
Volunteer
1 2 3 4 5 6 7 8
Urea Laboratory A 4.1 2.3 8.4 7.4 7.5 3.4 3.9 6.0
concentration
(mmol/L) Laboratory B 4.0 2.4 7.9 7.3 7.3 3.0 3.8 5.5
0.00342 2.10
0.00784 3.51
0.0102 5.11
0.0125 5.93
0.0168 8.06
Machine A 50.0 49.2 49.4 49.8 48.3 50.0 51.3 49.7 49.5
mass (g)
Machine B 51.9 48.8 52.0 52.3 51.0 49.6 49.2 49.1 52.4
mass (g)
Current Batch A 251 321 617 425 430 512 205 325 415
gain Batch B 321 425 502 375 427 522 299 342 420
1.35 1.15 1.46 1.67 1.65 1.76 0.97 1.36 1.63 1.19
1.35 1.35 1.35 1.35 1.35 1.35 1.35 1.35 1.35 1.35
1.27 1.07 1.04 1.21 1.26 0.99 1.30 1.33 1.44 1.34
1.34 1.34 1.68 1.39 1.37 1.31 1.80 1.58 1.89 1.28
1.74 1.09 1.52 1.59 1.79 1.39 1.31 1.55 1.33 1.56
1.12 1.24 1.11 1.34 1.40 1.42 1.35 1.85 1.06 1.26
0.89 1.70 1.15 1.28 1.56 1.50 1.58 1.53 1.14 1.19
1.55 1.47 1.22 1.36 1.44 1.52 1.44 1.23 1.79 1.51
1.42 1.58 1.58 1.28 1.23 1.63 1.17 1.10 1.55 1.54
1.85 1.70 1.67 1.43 1.41 1.50 1.40 1.20 1.06 1.58
1.50 1.53 1.45 1.20 1.66 1.35 1.24 1.25 1.32 1.32
D45 MH TE
(i) distance travelled by the arrow when shot at 45° to the horizontal (D45);
(ii) maximum height attained by the arrow when shot vertically (MH);
(iii) time elapsed for arrow to hit the ground after being shot vertically (TE).
Table 8.32 shows values of kinetic energy of the arrow based on the three
methods. Use one-way ANOVA to establish, at the a = 0.05 level of signifi¬
cance, whether the determination the kinetic energy depends on the
method used.
9. Table 8.33 shows the concentration of calcium in a specimen of human
blood as analysed by three laboratories. Use the ANOVA utility in Excel® to
determine whether there is any significant difference in the sample means
between the three laboratories at a = 0.05 (refer to section 9.3 for details on
how to use Excel®’s ANOVA utility).
4
Chapter 9
Data Analysis tools in Excel® and the
Analysis ToolPak
9.1 Introduction
Excel® contains numerous useful data analysis tools designed around the
built in functions which will, as examples, fit an equation to data using
least squares or compare the means of many samples using analysis of
variance. These tools can be found via the menu bar. The dialog box that
appears when a tool is selected allows for the easy input of data. Once the
tool is run, results are displayed in a Worksheet with explanatory labels and
headings. As an added benefit, some tools offer automatic plotting of data
as graphs or charts.
In this chapter we consider several of Excel®’s advanced data analy¬
sis tools which form part of the Analysis ToolPak add-in, paying particular
364
9.2 ACTIVATING THE DATA ANALYSIS TOOLS 365
To activate the Data Analysis tools, select Tools >Data Analysis from the
menu bar. If the Data Analysis option does not appear in the Tools
pull down menu, then select Tools VAdd-Ins. The dialog box shown in
figure 9.1 should appear. Check the box next to Analysis TooIPak, as shown
in figure 9.1, then click OK. Data Analysis should now be added to the Tools
pull down menu. It is possible that the Analysis TooIPak utility is not
> See Orvis (1996) for a useful general reference for this chapter.
366 9 DATA ANALYSIS TOOLS IN EXCEL®
resident on your computer, in which case the computer will request you to
insert the disc (or equivalent) that holds the Excel® program.
Figure 9.2 shows the dialog box that appears after choosing Tools >-Data
Analysis from the Menu toolbar. The [H or.I buttons in the dialog box
can be used to request information on any of the tools in the Analysis
ToolPak. Features of the tools in the Analysis ToolPak are:
(i) If data are changed after using a tool, most tools must be run again to
update calculations, i.e. the output is not linked dynamically to the
input data. Exceptions to this are the Moving Average and Exponential
Smoothing tools.
(ii) Excel® ‘remembers’ the cell ranges and numbers typed into each tool’s
dialog box. This is useful if you wish to run the same tool more than
once. When Excel® is exited the dialog boxes within all the tools are
cleared.
The Data Analysis tools make extensive use of dialog boxes to allow for the
easy entry of data, selection of options such as graph plotting and the input
of numbers such as the level of significance, a.
9-3 ANOVA: SINGLE FACTOR 367
A B . c D “1 r. G
9 Number of values in each s
10 Anova: Single Factor 7*^4
11
12 SUMMARY
13 Groups Count^ Sum Averaqe Variance
14 vent 1 7 174 24.85714 5.809524 Probability of means being
15 vent 2 7 183 26.14286 v5.47619 as different as observed if
16 vent 3 7 186 26.57143 6.619048 the null hypothesis is true.
17 vent 4 7 204 29.14286 8,47619
18
degrees ot treedom SS/df
19
20 ANOVA
Source of Variation SS 'df
i.
. MS F P-value F crit
22 Between Groups 67.82143 3 22.60714 3.427798 0.033126 3.008786
'23 Within Group 158.2857 24 6.595238
24
25 Total 2261.1071 27 _ Critical value of the
26 7
F statistic F statistic (a= 0.05)
97
wss
(within groups sum of squares)
BSS
(between groups sum of squares)
Output Range box.^ Choosing this cell as $A$10 and pressing OK returns the
screen as shown (in part) in figure 9.4. Annotation has been added to figure
9.4 to clarify the labels and headings generated by Excel®. The value of the
F statistic (3.428) is greater than the critical value of the F statistic, F
(3.009). We conclude that there is a significant difference (at the a = 0.05
level of significance) between the means of the samples appearing in
figure 9.3. The same conclusion was reached when these data were
analysed in section 8.10.2. However, by using the ANOVA tool in the
Analysis ToolPak, the time to analyse the data is reduced considerably in
comparison to the approach adopted in that section. Similar increases in
efficiency are found when using other tools in the ToolPak, but one impor¬
tant matter should be borne in mind: if we require numbers returned by
formulae to be updated as soon as the contents of cells are modified, it is
better not to use the tools in the ToolPak, but to create the spreadsheet ‘from
scratch’ using the built in functions, such as AVERAGED and STDEVO.
9.4 Correlation
c i C' E F i G 1 J K M M 0
sample 2 sample 3
35.2 32,1
36 2 35.9
38 36,6
dd b
38,5 32 4
34,9 31 5 friput - •• -.
33 32,5 ■ InpLft Range:
■J 1 1
35.1 30,8 Cancel |
Grouped By: Columns
f*' Epws aelp
1 W ^.abels m first row
I Oulixit options
. i^utput Rarige:
I ^ New Worksheet By:
^ New Workbook
.H-
= i
i
i
i
!t I/stieetZ/sheets/
two columns (or rows) containing data can be determined using the
Correlation tool in the Analysis ToolPak. Figure 9.5 shows three samples of
data entered into a worksheet. Figure 9.6 shows the output of the Correlation
tool as a matrix of numbers in which the correlation of every combinations
of pairs of columns is given. The correlation matrix in figure 9.6 indicates
that values in sample 1 and sample 2 are highly correlated, but the evidence
of correlation between samples 1 and 3 and samples 2 and 3 is much less
convincing.
The F test is used to determine whether two samples of data could have
been drawn from populations with the same variance as discussed in
section 8.8.2. The F-Test tool in Excel® requires that the cell ranges contain¬
ing the two samples be entered into the dialog box, as shown in figure 9.7.
The data in figure 9.7 refer to values of capacitors supplied by two
370 9 DATA ANALYSIS TOOLS IN EXCEL®
A B D G H
1 sample 1 sample 2 sample 3
2 33.5 35.2 32,1 CORREL(A2:A9,A2:A9)
3 34.4 36.2 35.9,
4 35.8 36.6 sample 2 sample 3
5 30.7 32^ 33.6 sample 1 1
6 35.6 38.5 32.4T sample2^ 0.98559 1
7 33 34.9 31.5' ^....^aempTeS 0.428766 0.366188
8 31.2 33 T7 c;
9 32.8 35.1 C0RREL(A2;A9,B2:B9)
10
11
CORREL(A2;A9,C2;C9)
12 .i... CORREL(B2:B9,C2:C9)
Us, 4 ”1 10 * B / U 1 ^ S S ^ ) $ % : ’
-
! D c# H i S a i EJ C3 ‘hai
~Q36 &
••••»>
1 X
A '
y
B C
noise
D r eT-;--
G H : 1 1 J 1 K ’ L j , ,M„. i . .N 1 0
?4
25 .
26
27
28 i !
19
30
31
3: 1
33 [“ ' ■■■..^
mmi
34 r.T.. i.
35
1
Z-
w K Sheen /aiwg /! Sh«et3 /
Ready I I IHUM ■
A B C D E
1 X y noise y-i-noise
j
2 l1 ^-0.30023 6.699768
3 2 10 -1.27768 8.722317
4 3 13 0.244257 •13.24426
5 4 16 1.276474) 17.27647
6 5 19) 1.19835 20.19835
7 6 22 1.733133 23.73313
8 7 25 -2.18359 22.81641
9 8 28 -0.23418 27.76582
10
1d r
Figure 9.10. Normally distributed ‘noise’ in the C column. The cells in the D column
show the noise added to y values appearing in column B.
9-7 REGRESSION 373
9.7 Regression
As an example of using the Regression tool, consider the data in figure 9.11
which shows absorbance (y) versus concentration (x) data obtained
during an experiment in which standard silver solutions were analysed by
flame atomic absorption spectrometry. These data were described in
example 2 in section 6.2.4. We use the Regression tool to fit the equation
y= a + bxto the data.
Features of the dialog box are described by the annotation on
figure 9.11. Figure 9.12 shows the numbers returned by Excel® when apply¬
ing the Regression tool to the data in figure 9.11. Many useful statistics are
shown in figure 9.12. For example, cells B27 and B28 contain the best esti¬
mates of intercept and slope respectively. Cells C27 and C28 contain the
standard errors in these estimates. Also worthy of special mention are the
p values in cells E27 and E28. A p value of 0.472 for the intercept indicates
that the intercept is not significantly different from zero. By contrast, the p
value for the slope of 5.055 X10 ® indicates that it is extremely unlikely that
the ‘true’ slope is zero. Note that the Regression tool in Excel® is unable to
perform weighted least squares.
374 9 DATA ANALYSIS TOOLS IN EXCEL®
Default confidence
-owt^ options
level for confidence
(* Qutput Rangs! itAtii il.
•x.'■ ..
Figure 9.11. Worksheet showing the dialog hox for the Regression tool in Excel®.
Probability of
Probability of
obtaining a t value of
obtaining a t value of
82.18 if population
0.7779 if population
slope is zero
intercept is zero
It is possible to carry out advanced least squares using the Regression tool
in Excel®. As an example, consider fitting the equation y= a+ bx+ €:>(?■ to
data as shown in figure 9.13 (these data are also analysed in example 4 in
section 7.6). Each term in the equation that contains the independent vari¬
able is allocated its own column in the spreadsheet. For example, column
B of the spreadsheet in figure 9.13 contains x and column C contains x^. To
enter the x range (including a label in cell Bl) into the dialog box we must
highlight both the B and C columns, (or type $B$1:$C$12 into the Input X
Range box). By choosing the output to begin at $A$14 on the same
Worksheet, numbers are returned as shown in figure 9.14. The three para¬
meters, a, b and c, estimated by the least squares technique appear in cells
B30, B31 and B32 respectively, so that (to four significant figures) the equa¬
tion representing the best line through the points can be written
Figure 9.13. Fitting the equation y=a+bx+c:^ to data using Excel®’s Regression tool.
3/6 9 DATA ANALYSIS TOOLS IN EXCEL®
A B C D E —G r
13 1
14 SUMMARY OUTPUT . j
]
15
IB Regression Statistics
17 Multipla R 0.998780654
18 R Square 0.997562795
19 Adjusted R Square 0.996953494 1
Figure 9.14. Output produced by Excel®’s Regression Tool when fitting y=a+ bx+ cx^ to data.
9.8 t tests
See Devore (1991) for a discussion of the t test when samples have unequal
variances.
9-8 t Tests 377
foe©
-Output options
Figure 9.15. Peel off time data and t-Test dialog box.
A B c D E I S. V ■F J G ...H.
1 Tape 1 Tape 2 t-Test: Two-Sample Assuming Equal Variances
2 65 176 Combined variance
.
3 128 125 of both samples \ Tape 1 Tape 2
4 87 95 \ Mean 120 147.6667
5 145 255 t statistic, found \ Variance 2825.6 3834.267
6 210 147 using Observations 6 6
7 85 88 X, “ ^2 Pooled Variance 3329.933 .
8 t= — Hypothesized Mean Difference 0.
9 Sx -JC2 df 10
10 t Stat -0.83042
11 P(T<=t) one-tail 0.21284
12 t Critical one-tail 1.812462
Probribility that the sample means
13 PiT<=t) two-tail 0.425681
woukd differ at least as much as t Critical two-tail 2.228139
14
15 observed, if the population means
of both samples are the same
Excel® possesses several other tools within the Analysis ToolPak used less
often by physical scientists. For example, the Moving Average tool is often
used to smooth out variations in data observed over time, such as seasonal
variations in commodity prices in sales and marketing. The remaining
tools are outlined in the following sections, with reference to where more
information maybe obtained.
9.9-2 Covariance
(9.1)
® Two factor ANOVA is not dealt with in this text. See Devore (1991) for an
introduction to this topic.
9-9 OTHER TOOLS 379
When quantities vary with time it is sometimes useful, especially if the data
are noisy, to smooth the data. One way to smooth the data is to add the
value at some time point, f+1, to a fraction of the value obtained at the
prior time point, t. Excel®’s Exponential Smoothing tool does just this. The
relationship Excel® uses is
The Moving Average tool smoothes data by replacing the ith value in a
column of values, x., by where
^ismooth - I 2^/
The Rank and Percentile tool ranks values in a sample from largest to
smallest. Each value is given a rank between 1 and n, where n is the number
of values in the sample. The rank is also expressed as a percentage of the
data set, such that the first ranked value has a percentile of 100% and that
ranked last has a percentile of 0%.
9.9.7 Sampling
The Sampling tool allows for the selection of values from a column or row
in a Worksheet. The selection may either be periodic (say every fifth value
in a column of values) or random. When selecting at random from a group
of values, a value may be absent, appear once or more than once (this is
random selection ‘with replacement’). Excel® displays the selected values
in a column.
9.10 Review
These words of caution aside, the Analysis ToolPak is a valuable and pow¬
erful aid to the analysis of experimental data.
Appendix 1
Statistical tables
Table 1. Cumulative distribu tion function for the standard normal distribution:
(a) The table gives the area under the standard normal probability curve between z= -oo and
Z=Zj.
Example:
If Zj = - 1.24, then
P(-oc < z< -1.24) = 0.10749
zi 0
0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
-4.00 0.00003 0.00003 0.00003 0.00003 0.00003 0.00003 0.00002 0.00002 0.00002 0.00002
-3.90 0.00005 0.00005 0.00004 0.00004 0.00004 0.00004 0.00004 0.00004 0.00003 0.00003
-3.80 0.00007 0.00007 0.00007 0.00006 0.00006 0.00006 0.00006 0.00005 0.00005 0.00005
-3.70 0.00011 0.00010 0.00010. 0.00010 0.00009 0.00009 0.00008 0.00008 0.00008 0.00008
-3.60 0.00016 0.00015 0.00015 0.00014 0.00014 0.00013 0.00013 0.00012 0.00012 0.00011
-3.50 0.00023 0.00022 0.00022 0.00021 0.00020 0.00019 0.00019 0.00018 0.00017 0.00017
-3.40 0.00034 0.00032 0.00031 0.00030 0.00029 0.00028 0.00027 0.00026 0.00025 0.00024
-3.30 0.00048 0.00047 0.00045 0.00043 0.00042 0.00040 0.00039 0.00038 0.00036 0.00035
-3.20 0.00069 0.00066 0.00064 0.00062 0.00060 0.00058 0.00056 0.00054 0.00052 0.00050
-3.10 0.00097 0.00094 0.00090 0.00087 0.00084 0.00082 0.00079 0.00076 0.00074 0.00071
-3.00 0.00135 0.00131 0.00126 0.00122 0.00118 0.00114 0.00111 0.00107 0.00104 0.00100
-2.90 0.00187 0.00181 0.00175 0.00169 0.00164 0.00159 0.00154 0.00149 0.00144 0.00139
-2.80 0.00256 0.00248 0.00240 0.00233 0.00226 0.00219 0.00212 0.00205 0.00199 0.00193
-2.70 0.00347 0.00336 0.00326 0.00317 0.00307 0.00298 0.00289 0.00280 0.00272 0.00264
-2.60 0.00466 0.00453 0.00440 0.00427 0.00415 0.00402 0.00391 0.00379 0.00368 0.00357
-2.50 0.00621 0.00604 0.00587 0.00570 0.00554 0.00539 0.00523 0.00508 0.00494 0.00480
382 APPENDIX 1 STATISTICAL TABLES
0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
- 2.40 0.00820 0.00798 0.00776 0.00755 0.00734 0.00714 0.00695 0.00676 0.00657 0.00639
- 2.30 0.01072 0.01044 0.01017 0.00990 0.00964 0.00939 0.00914 0.00889 0.00866 0.00842
- 2.20 0.01390 0.01355 0.01321 0.01287 0.01255 0.01222 0.01191 0.01160 0.01130 0.01101
- 2.10 0.01786 0.01743 0.01700 0.01659 0.01618 0.01578 0.01539 0.01500 0.01463 0.01426
- 2.00 0.02275 0.02222 0.02169 0.02118 0.02068 0.02018 0.01970 0.01923 0.01876 0.01831
- 1.90 0.02872 0.02807 0.02743 0.02680 0.02619 0.02559 0.02500 0.02442 0.02385 0.02330
- 1.80 0.03593 0.03515 0.03438 0.03362 0.03288 0.03216 0.03144 0.03074 0.03005 0.02938
- 1.70 0.04457 0.04363 0.04272 0.04182 0.04093 0.04006 0.03920 0.03836 0.03754 0.03673
- 1.60 0.05480 0.05370 0.05262 0.05155 0.05050 0.04947 0.04846 0.04746 0.04648 0.04551
- 1.50 0.06681 0.06552 0.06426 0.06301 0.06178 0.06057 0.05938 0.05821 0.05705 0.05592
- 1.40 0.08076 0.07927 0.07780 0.07636 0.07493 0.07353 0.07215 0.07078 0.06944 0.06811
- 1.30 0.09680 0.09510 0.09342 0.09176 0.09012 0.08851 0.08692 0.08534 0.08379 0.08226
- 1.20 0.11507 0.11314 0.11123 0.10935 0.10749 0.10565 0.10383 0.10204 0.10027 0.09853
- 1.10 0.13567 0.13350 0.13136 0.12924 0.12714 0.12507 0.12302 0.12100 0.11900 0.11702
- 1.00 0.15866 0.15625 0.15386 0.15151 0.14917 0.14686 0.14457 0.14231 0.14007 0.13786
- 0.90 0.18406 0.18141 0.17879 0.17619 0.17361 0.17106 0.16853 0.16602 0.16354 0.16109
- 0.80 0.21186 0.20897 0.20611 0.20327 0.20045 0.19766 0.19489 0.19215 0.18943 0.18673
- 0.70 0.24196 0.23885 0.23576 0.23270 0.22965 0.22663 0.22363 0.22065 0.21770 0.21476
- 0.60 0.27425 0.27093 0.26763 0.26435 0.26109 0.25785 0.25463 0.25143 0.24825 0.24510
- 0.50 0.30854 0.30503 0.30153 0.29806 0.29460 0.29116 0.28774 0.28434 0.28096 0.27760
6
O
0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0.00 0.50000 0.50399 0.50798 0.51197 0.51595 0.51994 0.52392 0.52790 0.53188 0.53586
0.10 0.53983 0.54380 0.54776 0.55172 0.55567 0.55962 0.56356 0.56749 0.57142 0.57535
0.20 0.57926 0.58317 0.58706 0.59095 0.59483 0.59871 0.60257 0.60642 0.61026 0.61409
0.30 0.61791 0.62172 0.62552 0.62930 0.63307 0.63683 0.64058 0.64431 0.64803 0.65173
0.40 0.65542 0.65910 0.66276 0.66640 0.67003 0.67364 0.67724 0.68082 0.68439 0.68793
0.50 0.69146 0.69497 0.69847 0.70194 0.70540 0.70884 0.71226 0.71566 0.71904 0.72240
0.60 0.72575 0.72907 0.73237 0.73565 0.73891 0.74215 0.74537 0.74857 0.75175 0.75490
0.70 0.75804 0.76115 0.76424 0.76730 0.77035 0.77337 0.77637 0.77935 0.78230 0.78524
0.80 0.78814 0.79103 0.79389 0.79673 0.79955 0.80234 0.80511 0.80785 0.81057 0.81327
0.90 0.81594 0.81859 0.82121 0.82381 0.82639 0.82894 0.83147 0.83398 0.83646 0.83891
1.00 0.84134 0.84375 0.84614 0.84849 0.85083 0.85314 0.85543 0.85769 0.85993 0.86214
1.10 0.86433 0.86650 0.86864 0.87076 0.87286 0.87493 0.87698 0.87900 0.88100 0.88298
1.20 0.88493 0.88686 0.88877 0.89065 0.89251 0.89435 0.89617 0.89796 0.89973 0.90147
1.30 0.90320 0.90490 0.90658 0.90824 0.90988 0.91149 0.91308 0.91466 0.91621 0.91774
1.40 0.91924 0.92073 0.92220 0.92364 0.92507 0.92647 0.92785 0.92922 0.93056 0.93189
1.50 0.93319 0.93448 0.93574 0.93699 0.93822 0.93943 0.94062 0.94179 0.94295 0.94408
1.60 0.94520 0.94630 0.94738 0.94845 0.94950 0.95053 0.95154 0.95254 0.95352 0.95449
1.70 0.95543 0.95637 0.95728 0.95818 0.95907 0.95994 0.96080 0.96164 0.96246 0.96327
1.80 0.96407 0.96485 0.96562 0.96638 0.96712 0.96784 0.96856 0.96926 0.96995 0.97062
1.90 0.97128 0.97193 0.97257 0.97320 0.97381 0.97441 0.97500 0.97558 0.97615 0.97670
2.00 0.97725 0.97778 0.97831 0.97882 0.97932 0.97982 0.98030 0.98077 0.98124 0.98169
2.10 0.98214 0.98257 0.98300 0.98341 0.98382 0.98422 0.98461 0.98500 0.98537 0.98574
2.20 0.98610 0.98645 0.98679 0.98713 0.98745 0.98778 0.98809 0.98840 0.98870 0.98899
2.30 0.98928 0.98956 0.98983 0.99010 0.99036 0.99061 0.99086 0.99111 0.99134 0.99158
2.40 0.99180 0.99202 0.99224 0.99245 0.99266 0.99286 0.99305 0.99324 0.99343 0.99361
2.50 0.99379 0.99396 0.99413 0.99430 0.99446 0.99461 0.99477 0.99492 0.99506 0.99520
2.60 0.99534 0.99547 0.99560 0.99573 0.99585 0.99598 0.99609 0.99621 0.99632 0.99643
2.70 0.99653 0.99664 0.99674 0.99683 0.99693 0.99702 0.99711 0.99720 0.99728 0.99736
2.80 0.99744 0.99752 0.99760 0.99767 0.99774 0.99781 0.99788 0.99795 0.99801 0.99807
2.90 0.99813 0.99819 0.99825 0.99831 0.99836 0.99841 0.99846 0.99851 0.99856 0.99861
384 APPENDIX 1 STATISTICAL TABLES
0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
3.00 0.99865 0.99869 0.99874 0.99878 0.99882 0.99886 0.99889 0.99893 0.99896 0.99900
3.10 0.99903 0.99906 0.99910 0.99913 0.99916 0.99918 0.99921 0.99924 0.99926 0.99929
3.20 0.99931 0.99934 0.99936 0.99938 0.99940 0.99942 0.99944 0.99946 0.99948 0.99950
3.30 0.99952 0.99953 0.99955 0.99957 0.99958 0.99960 0.99961 0.99962 0.99964 0.99965
3.40 0.99966 0.99968 0.99969 0.99970 0.99971 0.99972 0.99973 0.99974 0.99975 0.99976
3.50 0.99977 0.99978 0.99978 0.99979 0.99980 0.99981 0.99981 0.99982 0.99983 0.99983
3.60 0.99984 0.99985 0.99985 0.99986 0.99986 0.99987 0.99987 0.99988 0.99988 0.99989
3.70 0.99989 0.99990 0.99990 0.99990 0.99991 0.99991 0.99992 0.99992 0.99992 0.99992
3.80 0.99993 0.99993 0.99993 0.99994 0.99994 0.99994 0.99994 0.99995 0.99995 0.99995
3.90 0.99995 0.99995 0.99996 0.99996 0.99996 0.99996 0.99996 0.99996 0.99997 0.99997
4.00 0.99997 0.99997 0.99997 0.99997 0.99997 0.99997 0.99998 0.99998 0.99998 0.99998
Table 2. Critical values for the t distribution.
hm.6 ~ 2.447
X% confidence
interval
Critical values for the f distribution for various probabilities, p, in the right hand tail of the distribution with
degrees of freedom in the numerator, and Vj degrees of freedom in the denominator.
Example:^
p 1 2 3 4 5 6 7 8 10 12 15 20 50
0.1 39.86 49.50 53.59 55.83 57.24 58.20 58.91 59.44 60.19 60.71 61.22 61.74 62.69
0.05 161.45 199.50 215.71 224.58 230.16 233.99 236.77 238.88 241.88 243.90 245.95 248.02 251.77
0.025 647.79 799.48 864.15 899.60 921.83 937.11 948.20 956.64 968.63 976.72 984.87 993.08 1008.10
0.1 8.53 9.00 9.16 9.24 9.29 9.33 9.35 9.37 9.39 9.41 9.42 9.44 9.47
0.05 18.51 19.00 19.16 19.25 19.30 19.33 19.35 19.37 19.40 19.41 19.43 19.45 19.48
0.025 38.51 39.00 39.17 39.25 39.30 39.33 39.36 39.37 39.40 39.41 39.43 39.45 39.48
0.01 98.50 99.00 99.16 99.25 99.30 99.33 99.36 99.38 99.40 99.42 99.43 99.45 99.48
0.005 198.50 199.01 199.16 199.24 199.30 199.33 199.36 199.38 199.39 199.42 199.43 199.45 199.48
0.1 5.54 5.46 5.39 5.34 5.31 5.28 5.27 5.25 5.23 5.22 5.20 5.18 5.15
Denominator degrees of freedom,
0.05 10.13 9.55 9.28 9.12 9.01 8.94 8.89 8.85 8.79 8.74 8.70 8.66 8.58
0.025 17.44 16.04 15.44 15.10 14.88 14.73 14.62 14.54 14.42 14.34 14.25 14.17 14.01
0.01 34.12 30.82 29.46 28.71 28.24 27.91 27.67 27.49 27.23 27.05 26.87 26.69 26.35
0.005 55.55 49.80 47.47 46.20 45.39 44.84 44.43 44.13 43.68 43.39 43.08 42.78 42.21
0.1 4.54 4.32 4.19 4.11 4.05 4.01 3.98 3.95 3.92 3.90 3.87 3.84 3.80
0.05 7.71 6.94 6.59 6.39 6.26 6.16 6.09 6.04 5.96 5.91 5.86 5.80 5.70
0.025 12.22 10.65 9.98 9.60 9.36 9.20 9.07 8.98 8.84 8.75 8.66 8.56 8.38
0.01 21.20 18.00 16.69 15.98 15.52 15.21 14.98 14.80 14.55 14.37 14.20 14.02 13.69
0.005 31.33 26.28 24.26 23.15 22.46 21.98 21.62 21.35 20.97 20.70 20.44 20.17 19.67
0.1 4.06 3.78 3.62 3.52 3.45 3.40 3.37 3.34 3.30 3.27 3.24 3.21 3.15
0.05 6.61 5.79 5.41 5.19 5.05 4.95 4.88 4.82 4.74 4.68 4.62 4.56 4.44
0.025 10.01 8.43 7.76 7.39 7.15 6.98 6.85 6.76 6.62 6.52 6.43 6.33 6.14
0.01 16.26 13.27 12.06 11.39 10.97 10.67 10.46 10.29 10.05 9.89 9.72 9.55 9.24
0.005 22.78 18.31 16.53 15.56 14.94 14.51 14.20 13.96 13.62 13.38 13.15 12.90 12.45
0.1 3.78 3.46 3.29 3.18 3.11 3.05 3.01 2.98 2.94 2.90 2.87 2.84 2.77
0.05 5.99 5.14 4.76 4.53 4.39 4.28 4.21 4.15 4.06 4.00 3.94 3.87 3.75
0.025 8.81 7.26 6.60 6.23 5.99 5.82 5.70 5.60 5.46 5.37 5.27 5.17 4.98
0.01 13.75 10.92 9.78 9.15 8.75 8.47 8.26 8.10 7.87 7.72 7.56 7.40 7.09
0.005 18.63 14.54 12.92 12.03 11.46 11.07 10.79 10.57 10.25 10.03 9.81 9.59 9.17
APPENDIX 1 STATISTICAL TABLES 387
Table 3 (continued).
0.05 4.75 3.89 3.49 3.26 3.11 3.00 2.91 2.85 2.75 2.69 2.62 2.54 2.40
0.025 6.55 5.10 4.47 4.12 3.89 3.73 3.61 3.51 3.37 3.28 3.18 3.07 2.87
0.01 9.33 6.93 5.95 5.41 5.06 4.82 4.64 4.50 4.30 4.16 4.01 3.86 3.57
0.005 11.75 8.51 7.23 6.52 6.07 5.76 5.52 5.35 5.09 4.91 4.72 4.53 4.17
0.1 3.10 2.73 2.52 2.39 2.31 2.24 2.19 2.15 2.10 2.05 2.01 1.96 1.87
0.05 4.60 3.74 3.34 3.11 2.96 2.85 2.76 2.70 2.60 2.53 2.46 2.39 2.24
0.025 6.30 4.86 4.24 3.89 3.66 3.50 3.38 3.29 3.15 3.05 2.95 2.84 2.64
0.01 8.86 6.51 5.56 5.04 4.69 4.46 4.28 4.14 3.94 3.80 3.66 3.51 3.22
0.005 11.06 7.92 6.68 6.00 5.56 5.26 5.03 4.86 4.60 4.43 4.25 4.06 3.70
0.1 3.05 2.67 2.46 2.33 2.24 2.18 2.13 2.09 2.03 1.99 1.94 1.89 1.79
0.05 4.49 3.63 3.24 3.01 2.85 2.74 2.66 2.59 2.49 2.42 2.35 2.28 2.12
0.025 6.12 4.69 4.08 3.73 3.50 3.34 3.22 3.12 2.99 2.89 2.79 2.68 2.47
0.01 8.53 6.23 5.29 4.77 4.44 4.20 4.03 3.89 3.69 3.55 3.41 3.26 2.97
0.005 10.58 7.51 6.30 5.64 5.21 4.91 4.69 4.52 4.27 4.10 3.92 3.73 3.37
0.1 3.01 2.62 2.42 2.29 2.20 2.13 2.08 2.04 1.98 1.93 1.89 1.84 1.74
0.05 4.41 3.55 3.16 2.93 2.77 2.66 2.58 2.51 2.41 2.34 2.27 2.19 2.04
0.025 5.98 4.56 3.95 3.61 3.38 3.22 3.10 3.01 2.87 2.77 2.67 2.56 2.35
0.01 8.29 6.01 5.09 4.58 4.25 4.01 3.84 3.71 3.51 3.37 3.23 3.08 2.78
0.005 10.22 7.21 6.03 5.37 4.96 4.66 4.44 4.28 4.03 3.86 3.68 3.50 3.14
0.1 2.97 2.59 2.38 2.25 2.16 2.09 2.04 2.00 1.94 1.89 1.84 1.79 1.69
0.05 4.35 3.49 3.10 2.87 2.71 2.60 2.51 2.45 2.35 2.28 2.20 2.12 1.97
0.025 5.87 4.46 3.86 3.51 3.29 3.13 3.01 2.91 2.77 2.68 2.57 2.46 2.25
0.01 8.10 5.85 4.94 4.43 4.10 3.87 3.70 3.56 3.37 3.23 3.09 2.94 2.64
0.005 9.94 6.99 5.82 5.17 4.76 4.47 4.26 4.09 3.85 3.68 3.50 3.32 2.96
388 APPENDIX 1 STATISTICAL TABLES
Tables (continued).
1 2 3 4 5 6 7 8 10 12 15 20 50
p
22 0.1 2.95 2.56 2.35 2.22 2.13 2.06 2.01 1.97 1.90 1.86 1.81 1.76 1.65
0.05 4.30 3.44 3.05 2.82 2.66 2.55 2.46 2.40 2.30 2.23 2.15 2.07 1.91
0.025 5.79 4.38 3.78 3.44 3.22 3.05 2.93 ' 2.84 2.70 2.60 2.50 2.39 2.17
0.01 7.95 5.72 4.82 4.31 3.99 3.76 3.59 3.45 3.26 3.12 2.98 2.83 2.53
0.005 9.73 6.81 5.65 5.02 4.61 4.32 4.11 3.94 3.70 3.54 3.36 3.18 2.82
24 0.1 2.93 2.54 2.33 2.19 2.10 2.04 1.98 1.94 1.88 1.83 1.78 1.73 1.62
0.05 4.26 3.40 3.01 2.78 2.62 2.51 2.42 2.36 2.25 2.18 2.11 2.03 1.86
0.025 5.72 4.32 3.72 3.38 3.15 2.99 2.87 2.78 2.64 2.54 2.44 2.33 2.11
0.01 7.82 5.61 4.72 4.22 3.90 3.67 3.50 3.36 3.17 3.03 2.89 2.74 2.44
0.005 9.55 6.66 5.52 4.89 4.49 4.20 3.99 3.83 3.59 3.42 3.25 3.06 2.70
26 0.1 2.91 2.52 2.31 2.17 2.08 2.01 1.96 1.92 1.86 1.81 1.76 1.71 1.59
0.05 4.23 3.37 2.98 2.74 2.59 2.47 2.39 2.32 2.22 2.15 2.07 1.99 1.82
0.025 5.66 4.27 3.67 3.33 3.10 2.94 2.82 2.73 2.59 2.49 2.39 2.28 2.05
0.01 7.72 5.53 4.64 4.14 3.82 3.59 3.42 3.29 3.09 2.96 2.81 2.66 2.36
0.005 9.41 6.54 5.41 4.79 4.38 4.10 3.89 3.73 3.49 3.33 3.15 2.97 2.61
28 0.1 2.89 2.50 2.29 2.16 2.06 2.00 1.94 1.90 1.84 1.79 1.74 1.69 1.57
Denominator degrees of freedom,
0.05 4.20 3.34 2.95 2.71 2.56 2.45 2.36 2.29 2.19 2.12 2.04 1.96 1.79
0.025 5.61 4.22 3.63 3.29 3.06 2.90 2.78 2.69 2.55 2.45 2.34 2.23 2.01
0.01 7.64 5.45 4.57 4.07 3.75 3.53 3.36 3.23 3.03 2.90 2.75 2.60 2.30
0.005 9.28 6.44 5.32 4.70 4.30 4.02 3.81 3.65 3.41 3.25 3.07 2.89 2.53
30 0.1 2.88 2.49 2.28 2.14 2.05 1.98 1.93 1.88 1.82 1.77 1.72 1.67 1.55
0.05 4.17 3.32 2.92 2.69 2.53 2.42 2.33 2.27 2.16 2.09 2.01 1.93 1.76
0.025 5.57 4.18 3.59 3.25 3.03 2.87 2.75 2.65 2.51 2.41 2.31 2.20 1.97
0.01 7.56 5.39 4.51 4.02 3.70 3.47 3.30 3.17 2.98 2.84 2.70 2.55 2.25
0.005 9.18 6.35 5.24 4.62 4.23 3.95 3.74 3.58 3.34 3.18 3.01 2.82 2.46
35 0.1 2.85 2.46 2.25 2.11 2.02 1.95 1.90 1.85 1.79 1.74 1.69 1.63 1.51
0.05 4.12 3.27 2.87 2.64 2.49 2.37 2.29 2.22 2.11 2.04 1.96 1.88 1.70
0.025 5.48 4.11 3.52 3.18 2.96 2.80 2.68 2.58 2.44 2.34 2.23 2.12 1.89
0.01 7.42 5.27 4.40 3.91 3.59 3.37 3.20 3.07 2.88 2.74 2.60 2.44 2.14
0.005 8.98 6.19 5.09 4.48 4.09 3.81 3.61 3.45 3.21 3.05 2.88 2.69 2.33
40 0.1 2.84 2.44 2.23 2.09 2.00 1.93 1.87 1.83 1.76 1.71 1.66 1.61 1.48
0.05 4.08 3.23 2.84 2.61 2.45 2.34 2.25 2.18 2.08 2.00 1.92 1.84 1.66
0.025 5.42 4.05 3.46 3.13 2.90 2.74 2.62 2.53 2.39 2.29 2.18 2.07 1.83
0.01 7.31 5.18 4.31 3.83 3.51 3.29 3.12 2.99 2.80 2.66 2.52 2.37 2.06
0.005 8.83 6.07 4.98 4.37 3.99 3.71 3.51 3.35 3.12 2.95 2.78 2.60 2.23
50 0.1 2.81 2.41 2.20 2.06 1.97 1.90 1.84 1.80 1.73 1.68 1.63 1.57 1.44
0.05 4.03 3.18 2.79 2.56 2.40 2.29 2.20 2.13 2.03 1.95 1.87 1.78 1.60
0.025 5.34 3.97 3.39 3.05 2.83 2.67 2.55 2.46 2.32 2.22 2.11 1.99 1.75
0.01 7.17 5.06 4.20 3.72 3.41 3.19 3.02 2.89 2.70 2.56 2.42 2.27 1.95
0.005 8.63 5.90 4.83 4.23 3.85 3.58 3.38 3.22 2.99 2.82 2.65 2.47 2.10
APPENDIX 1 STATISTICAL TABLES 389
Example;
Probability, p
(A2.1)
where is the population mean or true value of the y quantity and n is the
number of values. If we write 8y^= (y, - then 6y. is the deviation of the ith
value from the population mean. Another way to regard 8y. is that if repeat
measurements are made of a quantity, then Sy^is the experimental error (as dis¬
cussed in section 5.3). Replacing iy^-in equation (A2.1) by Sy.gives
(A2.2)
o 9y ^ dy
8y. = — 8x. + — 8z. (A2.3)
dx ‘ dz ‘
^ I ^ 8xi + —8Zi
^[dx ‘dz
I‘‘
n (A2.4)
390
APPENDIX 2 PROPAGATION OF UNCERTAINTIES 391
where Sx^and Sz^.are the deviations (or errors) in the tth values ofxandzrespec-
tively. dyidx and dyli^zaxe determined at x=xand z=z.
Concentrating on the last term in equation (A2.5) for a moment, Sx. and
8z^ take on both positive and negative values. If 8x. and 8z. are random and
mutually independent, then the summation of positive and negative terms
should give a sum that tends to zero, particularly for large data sets. We there¬
fore argue that the last term in equation (A2.5) is negligible compared to the
other terms and omit it.
The first two terms in equation (A2.5) can be written
(A2.6)
■*' yaxj n yazj n
but
2(5x,)2 ^ ^{8zd^
-= (zt -
(A2.7)
or
1
(A2.8)
Equation (A2.8) can be extended to any number of variables that possess error.
If we determine the means x and z and their respective standard errors,
(j- and cr., we can adapt equation (A2.8) to give the standard error in y, i.e.
(A2.10)
Appendix 3
Least squares and the p'rinciple of maximum
likelihood
We can use the principle of maximum likelihood to show that the best esti¬
mate of the population mean is the sample mean as given by equation (1.6).
Using this approach we can also obtain the best estimate of the mean when
the standard deviation differs from value to value. In this case we derive an
expression for what is usually termed the weighted mean. The argument is as
follows.^
A distribution of values that occurs when repeat measurements are made
is most likely to come from a population with a mean, /jl, rather than from any
other population. If the normal distribution is valid for the data, then the prob¬
ability, P^, of observing the value x. is
(A3.1)
where g, is the population mean and a. is the standard deviation of the ith value.
If we write the probability of observing n values when the population
mean is /r, as Pig), then so long as probabilities are independent
P(/a) = nP,
392
APPENDIX 3 LEAST SQUARES AND MAXIMUM LIKELIHOOD 393
(Xn~ jx)
Xexp -- (A3.3)
{Xi- ix)
Pijx) oc exp (A3.4)
CT:
For any estimate, X, of the population mean, we can calculate the probability,
P{X), of making a particular set of n measurements as
P(X) = YIP.
(Xi-X)
P,. exp (A3.5)
a-f
It follows that
(x,-X)
P{X)=cexp --2 (A3.6)
We assume that the values x. are more likely to have come from the distribution
given by equation (A3.1) than any other distribution. It follows that the proba¬
bility given by equation (A3.4) is the maximum attainable by equation (A3.6). If
we find the value of X that maximises P(X), we wilt have found the best esti¬
mate of the population mean.
We introduce which we may regard as a ‘weighted sum of squares’
(and which is the statistic discussed in chapter 8), where
iXj-X) 2
(A3.7)
0-,
ax
or
X,-
(Ti (Ti
It follows that
(A3.8)
In preference to using Xto represent the weighted mean we will use (for con¬
(A3.9)
Equation (A3.9) gives the weighted mean. If the standard deviation for every
value is the same such that fjj = 0-2= tr then equation (A3.9) becomes
X (A3.10)
W
Appendix A4.1 considers the uncertainty (as expressed by the standard error)
y= a 4- jSx (A3.11)
We can never know the exact values of a and /3, but we can find best estimates
of these by using the principle of maximum likelihood.
For any given value of x=x., we can calculate the probability, P., of
making a particular measurement of y=y;. Assuming that the y observations
are normally distributed and taking the true value of y at x= x^ to be y(xp,
1 yi-y{Xi)
exp 2 (A3.12)
0-;
Here y(xp = a -4- ^x^ and a. is the standard deviation of the observed y values.
APPENDIX 3 LEAST SQUARES AND MAXIMUM LIKELIHOOD 395
yi~y(Xi)
^exp -iX (Ti
(A3.13)
P(a,b) = nP|.
1 , yi-yi 2
(A3.14)
y—yi
P(a,f7)3cexp (A3.15)
We assume that the observed set of values is more likely to have come from the
parent distribution given by equation (A3.12) rather than any other distribu¬
tion and therefore the probability given by equation (A3.13) is the maximum
probability attainable by equation (A3.15). The best estimates of a and (3 are
those which maximise the probability given by equation (A3.15). Maximising
the exponential term means minimising the sum that appears within the expo¬
nential. Writing the summation in equation (A3.15) as we have
yj-yj
(A3.16)
o-i
yj-a- bXj 2
(A3.17)
4^ = 2
o-i
To find values for a and b which will minimise y^, we partially differentiate
equation (A3.17) with respect to a and b in turn, set the resulting equations to
than simply y=a+bx, for example, y=a+ bx+ cy?, where a, b and c are para¬
meters. In such a situation we must differentiate with respect to each of the
parameters in turn, set the equations to zero and solve for a, b and c.
Returning to equation (A3.17) and differentiating, we get
2'^\{y-a-bx^) = Q (A3.18)
c)a
2^—^{y-a-bx.] = 0 (A3.19)
^b (Tj
where cr.is the standard deviation associated with the /th value ofy, y.. In man’,
cases, a. is a constant and can be replaced by a. For example, when using an
instrument such as a voltmeter we might assess that all voltages have an uncer¬
tainty of± 10 mV. When cr.is replaced by cr, we can write equations (A3.18) and
(A3.19) as
—y'^iyi-a-bx) = 0 (A3.20)
-^X^i[y-a-bx;) = o (A3.21)
and
(A3.23)
(A3.24)
n^Xiyi-^Xj^yi
(A3.25)
a+ b
n n
APPENDIX 3 LEAST SQUARES AND MAXIMUM LIKELIHOOD 397
y=a+bx (A3.26)
Equation (A3.26) indicates that a ‘line of best fit’ found using least squares
passes through the point given by y.
<>2^+
erf “24 =2^
erf O'I
(A3.27)
(A3.28)
er 1 er ,• er ,•
Xl
CT;
a- (A3.29)
^iYi _
^(7?^ a?
b= (A3.30)
where
1 y?
(A3.31)
aj (Ti
In appendix 3 we showed that the weighted mean (i.e. the mean when the stan¬
dard deviation in x values is not constant) may he written
2
X.w {A4.1)
1
where x. is the zth value, and a. is the standard deviation in the rth value. The
‘unweighted’ mean is written as usual as
(A4.3)
or
(A4.4)
(A4.5)
398
APPENDIX 4 STANDARD ERRORS
399
so that
/ 1 \
W
(A4.6)
Equation (A4.6) gives the standard error for the weighted mean.
In situations in which ~ ^ equation (A4.6) reduces to
a
(A4.7)
y=a+bx (A4.8)
b=f{y,,y2>y3’y4’y5’-yr)
(A4.9)
(A4.10)
In an unweighted fit we take all the a.s to be the same and replace them by a,
where cris the population standard deviation in the y values and, as we cannot
know this value, we adopt the usual approximation:
where'
s2 = ^2(Ay,)2 (A4.11)
(A4.12)
A
(A4.13)
A
where
(A4.14)
Consider the jth value of y and its contribution to the uncertainty in a and b. We
have
(A4.15)
]=l J
where
= £7^
(A4.16)
' The sum of squares of the residuals is divided by (n - 2) which is the number of
degrees of freedom. The number of degrees of freedom is 2 less than the number of
data points because a degree of freedom is ‘lost’ for every parameter that is
calculated using the sample observations. Here there are two such parameters, ci
and b.
APPENDIX 4 STANDARD ERRORS 401
A similar equation can be written for {just replace & by a in equations (A4.15)
and (A4.16)).
Return to equation (A4.13) and differentiate^ & with respect to
(7^
(A4.18)
(A4.19)
(A4.20)
(A4.21)
A2
As A = (Sxp2, we have
, a^n
(A4.22)
or
an^
^b=-T (A4.23)
A2
da 1
(A4.24)
SO that
0-224
(A4.25)
or
(A4.26)
If the standard deviations in the y values are not equal, we introduce explic¬
itly into the equations for the standard error in a and b. The standard errors in
a and b are now given by ^
I
Q- =
(A4.27)
/
1
(A4.28J
\ ^ /
where
1 _ X
A= (A4.29)
0-;
Appendix 5
Introduction to matrices for least squares
analysis
Applying least squares to the problem of finding the best straight line repre¬
sented by the equation, y=£?-l-fix through data creates two equations which
must be solved for a and h (see equations (A3.22) and (A3.23)). The equations
can be solved by the method of‘elimination and substitution’ but this approach
becomes increasingly cumbersome when equations to be fitted to data contain
three or more parameters that must be estimated such as
Fitting equation (A5.1) to data using least squares creates four equations to be
solved for a, b, c and d. The preferred method for dealing with fitting of equa¬
tions to data where the equations consist of several parameters and/or inde¬
pendent variables is to use matrices. Matrices provide an efficient means of
solving linear equations as well as offering compact and elegant notation. In
this appendix we consider matrices and some of their basic properties, espe¬
cially those useful in parameter estimation by linear least squares.^
A matrix consists of a rectangular array of elements, for example:
^ Neter, Kutner, Nachtsheim and Wasserman (1996) deals with the application of
matrices to least squares.
403
404 APPENDIX 5 MATRICES FOR LEAST SQUARE ANALYSIS
where / refers to the /th row and j refers to the jth column. Matrix B above con¬
sists of a single column of elements where each element can be identified
unambiguously by a single subscript.
In general, a matrix consists of m rows and n columns and is usually
referred to as a matrix of dimension mx n (note that the number of rows is
specified first, then the number of columns). If m = n, as it does for matrix A,
then the matrix is said to be ‘square’. By contrast, B is a 3 X 1 matrix. A matrix
consisting of a single column of elements is sometimes referred to as a column
vector, or simply as a vector.
Many operations such as addition, subtraction and multiplication can be
defined for matrices. For data analysis using least squares, it is often required
to multiply two matrices together.
Matrix multiplication
Consider matrices A and B where
2 4 1 5
A= B=
7 5 3 9
flu flj2
fl2i 022
and
g_ bn bi2
^21 ^22
Pll Pl2
Pll P22
so that
22 80
37 29
BA=
69 57
In this example (and generally), AB Y= BA, so that the order in which the matri¬
ces are multiplied is important.
If matrix A consists of r, rows and c, columns and matrix B consists of
rows and columns, then the product AB can only be formed if Cj = r^. If q = r^,
then AB is a matrix with rows and columns:
QXCi r2XC2
For example
A B P
"4 16' ■7" '48‘
8 7 2 2 76
CO
3 15 38
3x3 3x1 3x 1
ft
Ti C]
tt t t
r2 C2 n C2
Ci=h
In contrast,
B A
■4
1
1
1 6“
CM
8 7 2
1_
00
3 1 5_
1
Cl* h
4o6 APPENDIX 5 MATRICES FOR LEAST SQUARE ANALYSIS
1 0 0
1= 0 1 0
0 0 1
2 1
A= 1 8
3 4
10 0
0 1 0
0 0 1
2 1 -4
A= 1 8 2
3 4 2
by the matrix
AA ^=A ' A = I
APPENDIX 5 MATRICES FOR LEAST SQUARE ANALYSIS 407
-4 1 -4
A= 2 8 2
2 4 2
then it is not possible to determine A“' and A is said to be ‘singular’. If A“^ can
be found then A is said to be ‘non-singular’. Matrix inversion is a challenging
operation to perform ‘by hand’ even for small matrices. A computer package
with matrix manipulation routines is almost mandatory if matrices larger than
3 X 3 are to be inverted.
Using the inverse matrix to solve for parameter estimates in least squares
In chapters 6 and 7 we discovered that the application of the least squares tech¬
nique leads to two or more simultaneous equations that must be solved to find
best estimates for the parameters that appear in an equation that is to be fitted
to data. The equations can be written in matrix form (see equation (7.11)),
AB = P (A5.2)
where it is the elements of matrix B which are the best estimates of the para¬
meters. To isolate these elements we multiply both sides of equation (A5.2) by
A“h This gives
A-iAB=A-ip (A5.3)
IB=A-ip (A5.4)
Now IB = B, so we have
B=A-ip (A5.5)
Example
Suppose that after applying the method of least squares to experimental data,
we obtain the following equations which must be solved for a, b, c and d:
1.75a+ 18.3fi-L42.8c-25.9rf=49.3
3.26a- I9.8b+ 17.4c-32.2(i= 65.3
18.6a-L U.7b+ I2.2c+ U.3d=-18.1
65.7a- 15.3fi- 18.9c-L25.3(i= 19.1
APPENDIX 5 MATRICES FOR LEAST SQUARE ANALYSIS
A BP
1.75 18.3 42.8 -25.9 a 49.3
3.26 - 19.8 17.4 -32.2 b 65.3
18.6 14.7 12.2 14.3 c - 18.1
65.7 - 15.3 - 18.9 25.3 d 19.1
The elements of B are found from B=A“' P. An efficient way to obtain A“' is to
use the MINVERSEQ function in Excel® as described in section 7.4.1. Using this
function we find
1.307834
0.780933
B= A ip =
1.00025
2.91625
X — fJL
Standard normal variable, z z =-
a
... (X- fi
ivariable “[jlVTl
Mean x=
Weighted mean
Standard deviation
O' ^ s
Standard error of mean
\/~n Vn
u
Fractional uncertainty
X
Uncertainty in y (y is function of
uX
X only) dx
1
2
Uncertainty in y for uncorrelated
uncertainty in x and z
Residual Ay,=y,-y,-
Standardised residual
0-,
residuals
1
Standard deviation in y values 2
cr = ^'^(Yi-YiV
(unweighted least squares) n■
^
Slope (weighted least squares) b=
yi_ _
2^ ^2 2j ^2 2j ^2 ^2
I I (Ti 2j
- O-t
Intercept (weighted least squares) a=-
squares equations
n ^iYi _ XL
,n-2i
CTi (Ti Cl
o-f
^iVi _
r..=
Ci ' ai •a‘i
y—y—•
Answers to exercises and problems
Chapter i
Exercise A
Exercise B
1. (i) 13.8 zj; (ii) 0.36 jjls; (iii) 43.258 kW; (iv) 780 Mm/s.
2. (i) 6.50X10-10m; (ii) 3.7XlO-iiC; (iii) 1.915X106W; (iv) 1.25xl0-4s.
Exercise C
1. (i) three; (ii) three; (iii) two; (iv) four; (v) one; (vi) four.
2.
Exercise D
(i) Using the guidelines in section 1.4.1, the number of intervals,
l\j= ~ 7. Dividing the range by 7 and rounding up gives an
interval width of 0.1 g. Now we can construct a grouped frequency
distribution:
413
414 ANSWERS TO EXERCISES AND PROBLEMS
49.8<x<49.9 4
49.9<x<50.0 3
50.0<x<50.1 22
50.1 <x< 50.2 18
50.2<x<50.3 4
50.3<x<50.4 0
50.4<x<50.5 1
Exercise E
A graph with semi-logarithmic scales is most appropriate.
Exercise F
x= 102.04 pF, median= 101.25 pF.
Exercise G
1. Expanding equation (1.10) gives
^xj-2x'^Xi+ X(x)2
(T =
Exercise H
x = 2.187 s, standard deviation, s=0.75 s.
Exercise I
(i) s= 0.052 s (using equation (1.16));
(ii) s= 0.047 s (using equation (1.19));
(hi) percentage difference -10%.
Problems
3. kg ^•m‘^3.s4.^2
4. (i) 5.7X10-5 s; (ii) 1.4X104 K; (iii) 1.4X103 m/s; (iv) l.OXlO^ Pa;
(V) 1.5X10-3 0.
Chapter 2
In the interests of brevity, answers to exercises and end of chapter prob¬
lems for this chapter show only relevant extracts from a Worksheet.
Exercise A
1.
t(s) V(volts) l(amps) Q(coulombs)
0 3.98 3.32E-07 1.8706E-06
5 1.58 1.32E-07 7.426E-07
10 0.61 5.08E-08 2.867E-07
15 0.24 2E-08 1.128E-07
20 0.094 7.83E-09 4.418E-08
25 0.035 2.92E-09 1.645E-08
30 0.016 1.33E-09 7.52E-09
35 0.0063 5.25E-10 2.961 E-09
40 0.0031 2.58E-10 1.457E-09
45 0.0017 1.42E-10 7.99E-10
50 0.0011 9.17E-11 5.17E-10
55 0.0007 5.83E-11 3.29E-10
60 0.0006 5E-11 2.82E-10
2.
t(s) V(volts) l(amps) Q(coulombs) l°3(amps)°5
0 3.98 3.32E-07 1.87E-06 0.000576
5 1.58 1.32E-07 7.43E-07 0.000363
10 0.61 5.08E-08 2.87E-07 0.000225
15 0.24 2E-08 1.13E-07 0.000141
20 0.094 7.83E-09 4.42E-08 8.85E-05
25 0.035 2.92 E-09 1.65E-08 5.4E-05
30 0.016 1.33E-09 7.52E-09 3.65E-05
35 0.0063 5.25E-10 2.96E-09 2.29E-05
40 0.0031 2.58E-10 1.46E-09 1.61E-05
45 0.0017 1.42E-10 7.99E-10 1.19E-05
50 0.0011 9.17E-11 5.17E-10 9.57E-06
55 0.0007 5.83E-11 3.29E-10 7.64E-06
60 0.0006 5E-11 2.82E-10 7.07E-06
4i6 ANSWERS TO EXERCISES AND PROBLEMS
Exercise B
20 6.5 130
30 7.2 216
40 8.5 340
Exercise C
(i) When g=9.81, column Breads: (ii) when g= 1.6, column Breads:
t(s) t(s)
0.638551 1.581139
0.903047 2.236068
1.106003 2.738613
1.277102 3.162278
1.427843 3.535534
Exercise D
T(K) Q (J/s)
1000 3515.4 SB (\N/{m^ K‘^) 5.67E-08
2000 56246.4 A (m^) 0.062
3000 284747.4
4000 899942.4
5000 2197125
6000 4555958
Exercise E
1.
Tm (m) 6.35E-03
Tn (m) 6.72E-03
m 52
n 86
l(m) 6.02E-07
f^numerator 4.84E-06
^denominator 2.05E-05
R(m) 2.36E-01
2.
d (m) Vd (m/s)
0.1 339.3848 V (m/s) 344
0.2 341.6924 f(Hz) 5
0.3 342.4616
0.4 342.8462
0.5 343.077
0.6 343.2308
0.7 343.3407
0.8 343.4231
0.9 343.4872
1 343.5385
ANSWERS TO EXERCISES AND PROBLEMS
417
Exercise F
1.
l(amps) ln(l) LOGIO(I)
3.32E-07 -14.9191 -6.4793
1.32E-07 -15.843 -6.88052
5.08E-08 -16.7947 -7.29385
2E-08 -17.7275 -7.69897
7.83E-09 -18.6649 -8.10605
2.92E-09 -19.6528 -8.53511
1.33E-09 -20.4356 -8.87506
5.25E-10 -21.3676 -9.27984
2.58E-10 -22.0768 -9.58782
1.42E-10 -22.6775 -9.84873
9.17E-11 -23.1129 -10.0378
5.83E-11 -23.5648 -10.2341
5E-11 -23.719 -10.301
2.
h (m) P (Pa)
1.00E+03 8.92E+04 T(K) 273
2.00E+03 7.88E+04 Po(Pa) 1.01E+05
3.00E+03 6.96E+04
4.00E+03 6.15E+04
5.00E+03 5.43E+04
6.00E+03 4.79E+04
7.00E+03 4.23E+04
8.00E+03 3.74E+04
9.00E+03 3.30E+04
Exercise G
X 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Exercise H
X y xy x^
1.75 23 40.25 3.0625
3.56 34 121.04 12.6736
5.56 42 233.52 30.9136
5.85 42 245.7 34.2225
8.76 65 569.4 76.7376
9.77 87 849.99 95.4529
sums 35.25 293 2059.9 253.0627
Exercise I
max min range
98 12 86
418 ANSWERS TO EXERCISES AND PROBLEMS
Exercise J
mean median mode
23.60417 25 27
Exercise K
Mean 6.7
Harmonic Mean 6.726957
Average Deviation 0.2376
Standard Deviation 0.29966
Exercise L
2.
Exercise M
(iii)
♦ 1250K
• 1500K
- 1750K
X 2000K
Wavelength (m)
ANSWERS TO EXERCISES AND PROBLEMS 419
Problems
1. Note that each term appearing within the square root in equation (2.12)
V (m/s) m (kg)
2.90E+08 1.39E-29 mo 9.10E-31
2.91 E+08 1.54E-29 c 3.00E+08
2.92E+08 1.73E-29
2.93E+08 1.97E-29
2.94E+08 2.30E-29
2.95E+08 2.75E-29
2.96E+08 3.44E-29
2.97E+08 4.57E-29
2.98E+08 6.85E-29
2.99E+08 1.37E-28
3. (i)
mean (%) 66.19
standard deviation (%) 5.482193
maximum (%) 79
minimum (%) 54
range (%) 25
4. (ii)
_ 120000
£ 100000
® 80000
3 60000 ♦ Series 1
£ 40000
“■ 20000
0
0 20 40 60 80
Time(s)
420 ANSWERS TO EXERCISES AND PROBLEMS
5.
600
^ 500
° 400 ♦ Halothane
2 300 ■ Chloroform
(/) 4 Trichlorethylene
« 200
100
0
0 20 40
Temperature {°C)
6. (ii)
P
Pp
H (%) f=4 f=3 f=2.5 f=2 N1.5
0 1 1 1 1 1
5 1.210526 1.157895 1.131579 1.105263 1.078947
10 1.444444 1.333333 1.277778 1.222222 1.166667
15 1.705882 1.529412 1.441176 1.352941 1.264706
20 2 1.75 1.625 1.5 1.375
25 2.333333 2 1.833333 1.666667 1.5
30 2.714286 2.285714 2.071429 1.857143 1.642857
35 3.153846 2.615385 2.346154 2.076923 1.807692
40 3.666667 3 2.666667 2.333333 2
45 4.272727 3.454545 3.045455 2.636364 2.227273
50 5 4 3.5 3 2.5
55 5.888889 4.666667 4.055556 3.444444 2.833333
60 7 5.5 4.75 4 3.25
65 8.428571 6.571429 5.642857 4.714286 3.785714
70 10.33333 8 6.833333 5.666667 4.5
75 13 10 8.5 7 5.5
80 17 13 11 9 7
ANSWERS TO EXERCISES AND PROBLEMS 421
7.
Displacement versus force for an
archer's bow
Force (N)
Chapter 3
Exercise A
Exercise B
xig) 7.2 7.4 7.6 7.8 8.0 8.2 8.4 8.6 8.8 9.0
cdf 3.401X10'® 2.327X10-4 6.210X 10 -3 6.681X10-2 0.3085 0.6915 0.9332 0.9938 0.9998 1.000
Exercise C
0.0476.
Exercise D
z 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
cdf 0.5000 0.5398 0.5793 0.6179 0.6554 0.6915 0.7257 0.7580 0.7881 0.8159 0.8413
Exercise E
1. (i) 0.2192; (ii) 7.9 (round to 8).
2. (i) 0.02275; (ii) 0.9545; (iii) 0.0214; (iv) 0.02275.
3.0.01242.
4. (i)(a) 0.1587; (b) 0.1359; (c) 0.00135; (ii) 13.5.
422 ANSWERS TO EXERCISES AND PROBLEMS
Exercise F
50% confidence interval: 17.99°C to 18.83 °C;
68% confidence interval: 17.79 °C to 19.03 °C;
90% confidence interval: 17.38 °C to 19.44 °C;
95% confidence interval: 17.18°C to 19.64°C;
99% confidence interval: 16.79°C to 20.03 “C.
Exercise G
0.84162
Exercise H
99% confidence interval: 893 kg/m^ to 917 kg/m^
Exercise I
Using equation (3.21), c7-^= 0.014 eV. Using equation (3.27), 0.014 eV.
Exercise J
90% confidence interval: 4.639 mm to 4.721 mm
Exercise K
(i)
q (fi)
-3 -2-1 0 1 2
I^[1^
0.00 —
-0.50-
- 1.00-
S -l-BO-
S -2.00-
-2.50-
-3.00-
-3.50-
(ii) Comparing In(jr) versus q[f^ with the graph given in figure 3.26 indi¬
cates that the normality has been much improved by the logarithmic trans¬
formation.
Exercise L
5.
ANSWERS TO EXERCISES AND PROBLEMS 423
Problems
1. (ii) A=l;
(iii) (a) 0.145; (b) 0.055.
4. Mean = 916.7 MPa, standard deviation = 20 MPa.
5. x= 0.64086 V, s= 0.0027 V. Number of diodes with voltage in excess of
0.6400 V would be 125.
6. (i) (b)
1.0-1
0.9-
0.8-
0.7-
0.6-
XI
o 0.5-
0.4
0.3
0.2
0.1 -I
0
13.0 14.0 15.0 16.0 17.0
Focal length (cm)
7.
NORMSDIST (z)
8. 90% confidence interval for the population mean; 2080 Hz to 2132 Hz.
9. 95% confidence interval for the population mean: 131.5 ixPa to 144.5 pPa.
10. (i) 0.147584; (ii) 0.02275; (iii) overleaf; (iv) when 1^=61.
424 ANSWERS TO EXERCISES AND PROBLEMS
10. (iii)
0.16
c
O 0.14
0.12
0.10
^ 0.08
O
CD 0.06
•- 0.04
03
0)
< 0.02
0.00
0 20 40 60 80 100
Degrees of freedom, v
11. (i) Graph of versus qif.) is not linear suggesting data are not normally
distributed.
(ii) Graph of x. versus q{f.] for transformed data is linear indicating that the
original data are lognormally distributed.
12. (iii) Normal quantile plot indicates that distribution is consistent with
a lognormal distribution.
Chapter 4
Exercise A
(i) P(r=0) = 0.8904;
(ii) P(r=l) = 0.1035;
(iii) P(r>l) = 0.0061.
Exercise B
(i) P(r= 20) = 0.09306;
(ii) P(r<20) = 0.7622;
(iii) P(r< 20) = 0.6692;
(iv) P(r>20) = 0.2378;
(v) P(r>20) = 0.3308.
Exercise C
(i) /u. = 8;
(ii) cr=2.8;
(iii) P(r> 2) = 0.9972.
ANSWERS TO EXERCISES AND PROBLEMS
425
Exercise D
Exercise E
(i) P(r=0) = 0.6065;
(ii) P(r<3) = 0.9982;
(iii) P(2<r<4) = 0.0900.
Exercise F
(i) P(r=0) = 0.3535;
(ii) P(r=l) = 0.3676;
(iii) P(r> 2) =0.0878.
Exercise G
(i) P(/-=0) = 0.2209;
(ii) P(r=l) =0.3336;
(iii) P(r= 3) = 0.1267;
(iv) P(2<r< 4) = 0.4265;
(v) P(r> 6) = 0.000962.
Exercise H
(i) (Using normal approximation to the Poisson distribution)
P(179.5<r< 230.5) = 0.9109.
Problems
l-Cio,5 = 252,q32=105, C,2,24 = 3-537Xl0ii,C33„,2g„ = 1.311X10i^l
2. (i) P(r= 1) = 0.0872; (ii) P(r> 1) = 0.004323.
Number of screens expected to have more than one faulty transistor = 22.
3. (i) Number with three functioning ammeters = 18 (rounded).
(ii) Number with two functioning ammeters = 6 (rounded).
(iii) Number with less than two functioning ammeters = 1 (rounded).
0 16
1 21
2 14
3 6
4 2
5 1
7. (i) We require P(0) = 0.2. Using equation (4.9), 0.2 = exp(-/j,), so that
ya= 1.609.
(ii) P(r> 4) = 0.02421.
8. 0.9817.
Chapter 5
Exercise A
(i) Mean light intensity = 345.2 lx.
(ii) Standard deviation s= 23 lx.
(iii) 99% confidence interved = (345 ± 15) lx.
Exercise B
Eractional uncertainty=0.05.
Absolute uncertainty = 73 °C.
Exercise C
1. /=1812Hz, u^=39 Hz.
2. U=38.8 mm^, u^=8.3 mm^.
Exercise D
p=1669 Pa, Up=64 Pa.
Exercise E
1. «= 1.487, = 0.046.
2. (i) /= 104.6 mm, u^=1.0mm; (ii) m = 5.004, m^ = 0.061.
ANSWERS TO EXERCISES AND PROBLEMS 427
Exercise F
(i) Mean ==221.8 s, standard deviation, s= 7.9 s.
(ii) Value furthest from mean is 235 s.
(iii) Number expected to be at least as far from the mean as the ‘suspect’
value is 0.465. Note this is very close to 0.5 and it would be sensible, if pos¬
sible, to acquire more data rather than eliminate the ‘outlier’.
(iv) New mean = 218.5 s, new standard deviation, s=3.1 s.
Exercise G
(i) Upper limit for standard deviation s= 0.0031 mm. Upper limit for
standard error of mean = 0.0014 mm.
(ii) 95% confidence interval = (1.2200 ± 0.0038) mm.
Exercise H
Uncertainty in capacitance = 3.5 nF.
Exercise I
(i) Gain = 1730, uncertainty in gain= 12.
(ii) Assumptions:
(a) There is no uncertainty in R^(0) and a.
(b) a is constant over the temperature range 21 °C to 25 °C.
(c) The gain setting resistor is at the same temperature as the
room.
Exercise J
267 MO (beware of premature rounding in this problem).
Exercise K
17.27 s.
Exercise L
(i) Mean = 853.75 mm.
(ii) Standard deviation s=2.0 mm.
(iii) Standard error = 0.70 mm.
(iv) 95% confidence interval = (853.8 ± 1.7) mm.
(v) 95% confidence interval = (853.8 ± 0.85) mm.
(vi) 95% confidence interval = (853.8 ± 1.9) mm.
Exercise M
Weighted mean= 1.103 s.
428 ANSWERS TO EXERCISES AND PROBLEMS
Exercise N
Standard error of weighted mean = 0.11 s.
Problems
v
1. (i) Fractional uncertainty = 0.034;
(ii) Percentage uncertainty = 3.4%.
2. (i) Mean rebound height = 186.0 mm.
(ii) Standard error in rebound height =1.2 mm.
(iii) 95% confidence interval for the rebound height = (186.0 ±3.9) mm.
3. (i) Mean film thickness = 328.33 nm.
(ii) Standard error in mean =12 nm.
(iii) 99% confidence interval = (328 ± 47) nm.
4. n= 1.466, n„ = 0.027.
5. r= (0.386±0.032).
6. 0^=(5.52±O.1O)X 10-3 Rad.
7. R=(125.57±0.59)a
8. (i) c=(0.6710±0.0084).
(ii) c= (0.6710 ±0.0073).
9. a= (-6.54± 0.99) W.
10. (i) Mean length = 47.83 cm.
(ii) Value furthest from mean is 42.7 cm.
(iii) Yes, reject outlier.
(iv) New mean = 48.4 cm.
11. (i) (166.0 ±2.0) mV.
(ii) (166.00 ±0.93) mV.
(iii) (166.0 ±2.2) mV.
12. (i) Mean mass = 0.9656 g.
(ii) Standard error of mean = 0.0032 g.
(iii) 95% confidence interval = (0.9656 ± 0.0072) g.
13. (i) Y = 33.38mL, s=0.28mL.
(ii) Possible outlier is 33.9 mL. Applying Chauvenet’s criterion indicates
that outlier should be removed.
(iii) New X = 33.28 mL, new s= 0.13 mL.
14. Weighted mean= 1.0650 g/cm^, standard error in weighted
mean = 0.0099 g/cm^.
ANSWERS TO EXERCISES AND PROBLEMS 429
Chapter 6
Exercise A
a=332.1 m/s, b=0.6496 m/(s-°C), SSi?= 15.41 m^/s^.
Exercise B
(r^= 1.5, crj,=0.23.
Exercise C
1. The 99% confidence interval for a is (4 ±22) X 10^^. The 99% confidence
interval for jS is (2.51 ±0.12) X 10^^ mL/ng.
2.
(i) fl= 10.24 fl, fi=4.324X 10-2 Tl/°C.
(ii) (7,,= 0.066n, 0-^= 1.4X10 3 n/°c.
(iii) The 95% confidence intervals for A= (10.24±0.14) fi and for
B= (4.32 ±0.32) X 10“2 n/°C.
Exercise D
(i) a=3.981, b= 16.48, a^= 1.1, o-^= 1.9.
Exercise E
(i) Plot Pversus h; (ii) a^P^,b = pg.
Exercise F
(i) a=l.20046 m, = 1.07818 X lO-^ m/°C, o-,, = 5.0 X 10’^ m,
o-^ = 8.0Xl0-^m/°C.
(ii) a = 8.981XlO-®°C-h
(iii) cr,^ = 6.7XlO-7°C-h
Exercise G
The 99% confidence interval for when X(,= 15 is 4.5 ± 7.1.
Exercise H
The 95% prediction interval for y at x^ = 12 is 12 ± 12.
Exercise 1
(i) y-4677.89 +14415.1 lx,..
(ii) Xq =3.640 ppm, (7^^ = 0.17 ppm.
430 ANSWERS TO EXERCISES AND PROBLEMS
Exercise J
(i) ‘Usual’ least squares (error in values of V), feg —0.6842 V,
A:, = -2.391X10 3V/°C.
(ii) When errors are in (lvalues, A;,, = 0.6847 V, A:, =—2.401 X 10 ^V/‘’C.
Exercise K
(i) r=-0.8636.
(ii) fl=31.794°C, h=-0.5854°C/cm3.
(iv) A plot of data indicates that the assumption of linearity is not valid.
Exercise L
r= 0.9667.
Exercise M
(i) r= 0.7262.
(ii) Value of r is not significant.
Exercise N
(ii) a= 0.4173 s,b= 0.4779 s/kg.
(iii)
O.IOn
_ 0.05-
"to 0.00“
T3
'cn -0.05“
<u
oc
-0.10-
n I I I-1-1
0.0 0.5 1.0 1.5 2.0 2.5
Mass (kg)
(iv) Yes, probably wrong equation fitted to data (i.e. period is not linearly
related to mass).
Exercise 0
Number of data expected to be at least as far from the best line as the
outlier is 0.392. Based on Chauvenet’s criterion, the outlier should be
rejected and the intercept and slope recalculated.
ANSWERS TO EXERCISES AND PROBLEMS 431
Exercise P
1. (i) Plot Ini? versus T. Intercept=IruA, slope = -B.
(ii) Plot r'versus t. Intercept = u, slope = g.
(ill) Plot //versus T. Intercept = - CT^, slope = C.
(iv) Plot versus R^. Intercept = T^, slope = — k.
(v) Plot Tversus Vm. Intercept = 0, slope = Ztt/ VX.
(vi) Plot 1//versus R. Intercept = r/E, slope = -l/£
(vii) Plot 1 /Vversus 1 /u. Intercept =llf, slope = — 1.
(viii) Plot InN versus In C. Intercept = In fc, slope = 1 / n.
(ix) Plot 1) versus l/A^. Intercept = 1/71, slope = -B/A
(x) Plot tLE versus D. Intercept=A, slope = AB.
2. A:=2.31lXI0-®V-CpF 2, </,= 1.064V.
Exercise Q
(i) y= 7.483, u,,-0.13.
(ii) y= 3136, 220.
(iii) y-0.01786, -6.4X lO'^.
(iv) y=3.189XlO-^ = 2.3X10-5.
(v) y= 1.748, =0.016.
Exercise S
(i) To linearise the equation, take the natural logarithms of both sides of
the equation to give lnC= InTl- A</2. This is of the form y= a + bx, where
y= InC, <2= InA, b= -A and x= d^.
(ii) 71=1.998X10^ counts, A = 6.087X10 ^ mm“2, cr^= 1.3X102 counts,
cr^ = 4.7 X 10"® mm-2.
Exercise T
<2 = 2.311, h=-19.70V-i, ct-^ = 0.063, cr^=0.82V-i.
Problems
1. (ii) a=9.803m/s2,/7=-2.915X10-®s-2.
(iii) SSR= 0.0063 (m/s^)^, o-= 0.028 m/s^.
(iv) cr^ = 0.019 m/s2, crj,=3.1 X 10"^ 5-2.
(v) r=-0.9579.
(vi) <2 = 9.7927m/s2, /?=-2.867 X lO"® s'^, SS/?=0.0004533 (m/s^)^,
o-= 0.0075 m/s2, o-^=0.0051 m/s^, (r^=8.3X 10 ^s’^, r=-0.9967.
2. k= 27.09 MPa, 0-^.= 0.75 MPa
432 ANSWERS TO EXERCISES AND PROBLEMS
This is the form y— a+ hx, where j/= InT, a = InA;, b—lln and InC.
(ii) 2.638, n-2.346, fr^= 0.036, fr^ = 0.028.
(iii) There is an indication that as InCincreases, so do the standard¬
ised residuals. A weighted fit is probably appropriate, but more
points are needed to confirm this.
(iv) When C=0.085 mol/L, F= (0.923±0.017) mol.
5. (i) fl = 0.9591, fi=0.9039.
® 1-863,/„,„ = 0.9591.
(iii) Note 4,^= a+b. Asa and b are correlated, replace a with y- bx
before proceeding to calculate a, . The calculation gives
hnax
a, =0.038.
‘max
6. (i) Plot 1/Xversus 1/i? The intercept a=-B/A, and the slope 1/A
(ii) a=2228 m^/kg, b= 1338 N/kg, o-^ = 31 m^/kg, cr^= 15 N/kg.
(iii) A=7.473XlO-4kg/N, 5=-1.665m2/N, o-^ = 8.6X lO'^kg/N,
o-g = 0.040 m2/N.
7. (i) fi=5.833m i-Pa-i, o-^=0.55m-TPa-i.
(ii) (i= 1.748 X 10-1° m, a^ = 8.3X lO'i^ m.
11. (ii) r= 0.9220.
(iii) Here we have six points. Using table 6.20, the probability of having
r>0.9 when data are uncorrelated is 0.014, therefore we have evi¬
dence that the correlation is significant.
Chapter 7
Exercise A
Begin by writing the equation of‘best fit’ as y. = a + bT.+ cT.lnTp where a,
b and c are best estimates of a, /3 and y respectively. The matrix equation to
be solved for a, b and c is
n a
XninT). b =
Exercise B
Exercise C
(i) 9436.22
5547.23
7173.82
(ii) 5529
6140
4428
6961
Exercise D
Best estimate of A, a= — ll.82
Best estimate of B, = 0.4244 fl/K.
Best estimate of C, c= -5.928 X 10“® fl/K^.
Exercise E
a= 15.36, ^7= 2.408, c= 1.876.
Exercise F
Best estimate of A «=-6.776 |xV.
Best estimate of B, b=4.922 X 10“^ ja,V/K.
Best estimate of C, c= 9.121 x 10"® jiV/K^.
Best estimate of D, d=-6.747 X 10^® [xV/K^.
(T^ = 0.083 |xV.
o-^= 1.6X10-3 |jlV/K.
o-^=9.0X10-VV/K2.
a-^=1.6XlO-VV/K3.
ANSWERS TO EXERCISES AND PROBLEMS
434
Exercise G
a=(0.010±0.011) N, i3=(3.9±3.4) N/m, y=(427±24) N/m^.
Exercise H
(i) fl= 13.27, Z7 = 3.628, c= 1.426.
(ii) 1.1, (t^=0.55, fr^=0.095. v
Exercise I
= 0.9988.
Exercise J
(i) a=5.102X 10-3 ^ 1.028X 10-2, C-8.019X IQ-^.
Problems
1.
n ^Xi ^expx; a
(ii) ^Xi 2^? ;^x,expx,- b =
_2expx,- ^x^expxi c _2y,expx,
n 27
(i)
,x,- n 2^?
(ii) Writing best estimates of A B and Casa, b and c, respectively, we have
a= 1.740 mm, b=26.87 mm-mL/minute, c==0.02366 mm-minute/mL.
(iii) (7^=0.13mm, o-^=0.89mm-mL/minute, cr^=:1.7X10-3mm-minute/ mL.
3. Writing best estimates of s^, u and gas a, b, and c respectively, we have
a=134.2m, b=46.27m/s, c= —lO.Obm/s^, cr^=3.4m, crj, = 2.6m/s,
a^=0A2 m/s2.
4. Writing best estimates of A, B, C and D as a, b, c and d, respectively, we
have
ANSWERS TO EXERCISES AND PROBLEMS 435
a = 0.9230, b = -67.57 cm^, c= 1977 cm® and £/= 2.387 X 10'^ cm®; (t^ = 0.017,
(Tj^=5.2 cm®, (7^=430 cm® and cr^==9.7X 10® cm®.
5. For two parameters, AIC= 149.3, 0.9707. For three parameters,
AIC = 151.3, = 0.9675.
Both indicators of goodness of fit support the equation y=a+bx being the
better fit to data.
6. (i) Writing best estimates of A, B and Cas a, b and c respectively, we
havefl = 29.90 J-mol-i-K-i, h=4.304X10-® J-mol '-K ^
C=-1.632X10® J-mol-'-K.
(ii) a^ = 0.22 J-mol '-K 1, o-^=2.4X lO"''J-mol-i-R-®,
cr^=1.8XlOM-mol-i-K.
(iii) A= (29.90±0.47) J-mol-'-K *, B= (4.30±0.53) X lO"® J-mol
C= (-1.63 ± 0.40) X 10® J-mol-^-K.
7. Indicators of goodness of fit including AIC and residuals should indicate
that equation (7.62) is a better fit to data than equation (7.63).
Chapter 8
Exercise A
Hgi 1.260 V; |U.± 1.260 V. For data, z=-3.0. z^^.j=1.96. As \z\>z^^i.,
reject null hypothesis at a = 0.05.
Exercise B
p value = 0.024, therefore at a = 0.05, there is a significant difference
between the hypothesised population mean and the sample mean.
Exercise C
(i) When a = 0.2, z^^.^= 0.84.
(ii) When a = 0.05, 1.64.
(iii) When a = 0.01, 2.33.
(iv) When a = 0.005, z^^.= 2.58.
Exercise D
t= -2.119, 2.365. As |f| < we cannot reject the null hypothesis, i.e.
the mean of the values in table 8.10 is not significantly different from the
published value of c.
Exercise E
Hj,: population intercept = 0. Carry out two tailed test at a = 0.05, t= 6.917,
t . = 2.228, therefore reject null hypothesis.
ANSWERS TO EXERCISES AND PROBLEMS
436
H„: population slope = 0,11| = 2.011, t„,,=2-228, therefore cannot reject null
hypothesis.
Exercise F
14.49 and 2.228, therefore reject the null hypothesis, i.e. the means
of the coefficient of kinetic friction for the two contacts areas are signifi¬
cantly different at the a = 0.05 level of significance.
Exercise G
Using Excel®’s TTESTQ function, the p value is 0.046. As this is less than
a=0.05 we reject the null hypothesis, i.e. there is a significant difference (at
a = 0.05) between the lead content at the two locations.
Exercise H
Carry out f test for paired samples.
r= 2.762. Eor a two tailed test at a = 0.05 and with number of degrees of
freedom = 7, 2.365. As t> reject null hypothesis, i.e. the emfs of the
batteries have changed over the period of storage.
Exercise I
p=0.08037. As p >0.05, we would not reject the null hypothesis at a = 0.05.
Exercise j
F= 2.207, F ■= 5.82. As F< Fcr we cannot reject null hypothesis at a = 0.05.
CrlL It
Exercise K
F,„. =5.285.
Exercise L
(i)
Exercise M
(i) 10.60.
(ii) 11.34.
(iii) 11.07.
(iv) 15.99.
Exercise N
F= 12.82, F^^.j=3.40, as the ANOVA indicates that, at a = 0.05, the
magnitude of the alpha wave does depends on light level.
Problems
1. (ii) For my random numbers, I found x = 152.7297.
(iv) For my 100 columns of random numbers, I found four means to lie
outside interval /x± 1.96
(v) iJL± 1.96cr^. is the 95% confidence interval for the sample mean, so
expect 5% of sample means to lie outside this interval, i.e. five
means.
2. Two tailed f test required (samples not paired). t= 1.831, = 2.228. As
t< we cannot reject the hypothesis that both samples have the same
population mean (at a = 0.05).
3. Paired sample f test required. t= 2.909, 2.365. As reject
hypothesis (at a = 0.05) that there is no difference in the urea concentra¬
tion as determined by the two laboratories.
4. r=l.lll, f^^.j3.182 (fora = 0.05 and three degrees of freedom). As
we cannot reject the null hypothesis, i.e. the intercept is not significantly
different from zero.
5. Two tailed Ftest carried out at a = 0.05. F= 3.596, F^^^=A.A8. As F<F^^^,
we cannot reject the null hypothesis that both populations have the same
variance.
6. One tailed Ftest carried out at a = 0.05. F= 2.796, F^^..=8.AA. As F<F^^.^,
we cannot reject a null hypothesis that both populations have the same
variance.
7.1 chose a bin width of 0.1 s with a bin range beginning at 0.8 s and extend¬
ing to 1.9 s. Where necessary, bins were combined to ensure that the fre¬
quencies were > 5. A chi-squared test (carried out at a = 0.05) indicates that
the distribution of data in table 8.31 is consistent with the normal distrib¬
ution.
8. F= 2.811, F . =3.55. As F<F we cannot reject a null hypothesis that
the population means are equal.
9. F= 10.61, F^^.j= 3.885. AsF>F^^.^,we reject the null hypothesis and con¬
clude that the population means are not equal.
References
Ns
438
REFERENCES 439
441
442 INDEX
SI f distribution, 122
base units, 8 critical values, 385
system, 7 rtest
significance tests, 315 paired samples, 339
comparing two means, 334 tables, 18
confidence levels, 316 TDISTQ, 125
significant figures, 12, 226 TEC,244
skewness, 80 tests of significance, 315
slope of a straight line, 394 one sided test, 327
equation, 219 small sample sizes, 329
specification of instruments, 191 two sided tests, 327
446 INDEX
Cambridge
UNIVERSITY PRESS
www.cambridge.org
ISBN 0-521-79737-3
I
9 780521 797375