Notes - Correlation & Regression
Notes - Correlation & Regression
Student's Copy
H2 Mathematics JC 2
-:_
ConnrurroN CoErrIcrENT
Include:
r
.
.
t
calculation and interpretation of the product moment correlation coefficient and of the
equation ofthe least squares regression line
interpolation and exfrapolation
use of a square, reciprocal or logarithmic fransformation to achieve linearity
Exclude:
r derivation of formulae
(a)
(b)
(c)
(d)
(e)
(D
G)
hypothesis tests
(h)
(r)
H2 Mathematics JC 2
S I
al
Student's Copy
4,4
lntroduction
orderedpairs (x,y).
Bivariate Data
W
.'1 .:i,:
.iil
A scatter diagram is
(x,,!,)
on the scatter diagram represents a single data point. From the scatter diagranq
we can judge visually if there is any relationship between x and y.
Each
Midyear score
Final examination score
In the previous chapters, we have been studying data with one variable. In this chapter, we
shall investigate data with two variables and their relationship. For example, if weian find a
relationship between the midyear exarnination scores and year-end examination scores of
students, we will be able to use the information to help us make statistical inferences about
the two examination scores.
S 2
40
50
55
60
65
80
50
53
58
60
70
88
a.
\l,,
H2 Mathematics JC 2
Student's Copy
90
80
70
60
50
40
30
30
40
50
60
70
80
90
Midyear Score
Example
The tenrperature, Z, in degree Celsius ('C) of the tyre of a car is measured when the car
travels at different speed, v (kmtr.I). Eight sets of data are obtained. Sketch the scatter
for the data.
70
80
90
v
60
20
30
40
50
T
66
91
86
98
45
104
52
64
Solution
Using TI84+
H2 Mathematics JC 2
Student's Copy
PIoIT
PI+t3
ff
HPE :E 14 Jh
{IF l.,r
lis t: Lr
liE t:Lz
!E +.
rIH..
eFk
ImdEI.
t.
2.
@ffiErArsl
Go to
Create two new lists, List I and List 2 using the given y and
Tvalues respectively.
To plot the data, press [!]for Graph.
Press @ for [SET] (settings).
Choose settings as shown on the right.
Choose Lr for X list (independent) and Lz for Y list
(dependent)"
For Mark Type, select the style to represent the data points.
foflowed Uv
To view the plot, pr"rr
To read the coordinates of each point, press
3.
4.
5.
6.
7.
8.
9.
ffi
(TRACE).
M.
lsUlffl[!
sug
E
1
E
ffiilmDlE lm@r-F:
:
!1
Frequencv
l'1ark Tvpe
:q
ffiffiEffi
l.
l.!
lo"
l{=au
Listl
List 2
T={5
H2 Mathematics JC 2
-.4
Scatter diagrams can reveal general patterns and relationships between two variables. We can
comment on the
l. Direction
related.
Negative relationship
2.
Form
Quadratic relationship
3.
Strength
No clear relationship
H2 Mathematics JC 2
Student's
Copy
' ,
Example 2
Comment on the relationship between the two variables based on the scaffer diagrams below.
100
80
100
80
60
40
20
0
60
40
20
0
tI
30+
2s)
2o) .o o .
15 I
,lj .o .. . ..
o,I
0ro2030|
II
'm+I
rrl .i'
'..a
|.o,
_t**--_
Note
Scatterdiagramscana1sogiveusvisualevidenceof-orsuspiciousobservations.
These data points are points in a sample that lie outside the overall pattern of a distribution.
H2 Mathematics JC 2
Student's Copy
5 Correlation
Interpretation of the relationship between the variables of sample data solely based on the
scatter diagram is subjective.
It can even
used. The scatter diagrams below are plotted using the same set of data but on different scales
12
8
l0
0
-10
-20
4
0
Since the scale of a scatter diagram can be manipulated, it may be more helpful to use a
numerical approach to measure the strength of a relationship between two variables. The
a linear
n datapoints, is as follows:
*_Z,ZY
I(,-rX v-Y)
Z*'- ryJ[t,'-tr!)
Using S* = I(,
s,,
(I,')'
=I 0-v)'=Zy'-(Z:)'
,s,z =
wehave
-i)' =,1r'-
r=L.
Js-s,,
Example 3
In a physical education class, the number of push-ups (-r) and sit-ups (y) done by a sample
of
x
v
I
27
30
22
26
25
15
--13717,
4
35
42
Zr'
30
38
6
52
40
=15298,
10
35
32
55
54
40
50
4A
Z* =zst,\t=380
43
H2 Mathematics JC 2
Student's Copy
Solution
Since the actual data is giverl we can use the GC to calculate the product moment correlation
coefficient.
Using TI-84+:
@
@
@
to STAT
tEfrCH
I
llnrd
clRssrc
HITfl
5TftT DIfi6NE5TI(5:
[IITffiEffi
Hl ist-: Lr
Vl iEt,: Le
FreqLiEt :
Store RegEH:
Ealculat e
mrffi
g=a*hx
a= I 4. 98822556
h=. 6578855 1 75
F E =. 7846588499
r=. 839439 I ?E I
H2 Mathematics JC 2
Student's Copy
1.
SUB
IEI
l5t
'EI
tq
TE
lE
qE
-21
fGltlffi'|ffilEin--Ef fT-
2.
ffifor
[REG] and
selea@
=14.
F =8.8
re=8. ?
l,l$e=S1.
I(o-F[
the
iUELEhiA-h*:'-
dependentvariableisnotinList2,select[SETI,andchangeffi
lists in 2Var Xlist (independent) and 2Var Ylist (dependent). ZUer Fneq : I
m
We will be looking at the significance of the other values that appear in the ssreen shot in the
next section.
q
l.
2.
Note
For any sample data, -1
(r(
1.
The sign ofthe correlation coefficient indicates the direction of linear correlation.
When
r)
ffi.
3.
When r = *1, we have perfect positive linear correlation. All the points lie on a straight
line with positive gradient.
When r
-1, we
When r
:0
, there is no
and
y.
H2 Mathematics JC 2
Student's Copy
6.
In general,
0.8 < lrl <
0.5 <
.lrl
lrl< 0.8 indicates moderately strong linear correlation between the two variables.
< 0.5 indicates weak linear correlation between the two variables.
7.
The measure
has no units.
It
of the
is
Example 4
The temperature, Z, in degree Celsius ( "C ) of the tyre of a car is measured when the car
travels at different speed, v (kmh.t). Eight sets of data are obtained.
v
(r)
(ir)
20
45
40
64
30
52
=2r
5
Solution CII-84+)
(i) Using TI84+, r =0.975
50
66
60
9l
70
86
80
98
90
104
O
@
with the same product moment correlation coefficient but different scatter diagrams.
5.
See
*nl
t0
H2 Mathematics JC 2
Student's Copy
95888944 I I
F=. 97499282 I 5
t^ z =.
Press
lS-r-TATi
[z-i'o-'l@.
If,ilffi
,J=e+bH
press [ffiTEHl.
@
E=81.84285714
h=1.572887145
=. 95888944 I I
F=. 97499?B? I 5
F
!
Solution (CASIO GC):
(1) Using CASIO GC, r =0.975.
1.
EI
1UI
EI
IEI
1l
Store data under List 1 (x) and List 2 (y)"
EE
EEI
sE
I
EII
liffilEu-ffiEE
2.
Rress @|
for X (linear).
E'
ffi
1l
H2 Mathematics JC 2
(ii)
L.
2.
Student's Copy
and Z'
55
2+32.
I,
(press
II
ilIE
3l
III
EE
5El
EII
qE
EE
(9+5)List
te5.E
t{1.E
EEI I5B.E
ffiEfor[List])
Notice that the value of r calculated in (i) and (ii) are the same.
SUB
I
E
IO
{5
IE
5E
Eq
{[
lel.E
I
ttI.
|Eh4-E'IEaTIF
IFETIT-
IfoFF
Note
Notice that the value of r calculated in (i) and (ii) are the same. This illustrates an important
property of r- it is independent of the scale of measurement for temperature.
S 6 linear Regression
In the last section, we used both the scatter diagram and the product moment correlation
coefficient between the two variables to indicate whether it is meaningful to model the
observed data with a straight line. If it appears that the data fits into a linear model, we then
attempt to find an equation to represent the relationship by linear regression.
In Example 4,the speed, v, is controlled and the temperature, Z, is measured based on v.
Thus, v is known as the independent variable while 7, whose value depends on y is called
the dependent variable.
In Example 3, we were investigating the number of sit ups and push ups a student can do. In
this case, there is no clear dependency between the two variables.
t2
H2 Mathematics JC 2
5 7
Student's Copy
Consider the data given in the scatter diagram below. We randomly draw a line to fit the data
first. The line drawn below may not be the line that best represents the data, the line of best
fit. There are sweral ways to find a line ofbest fit and we ture a method most commonly used
for finding zuch a line called the least squares method. The line obtained by this method is
called the least squares regression line.
To understand this method, we consider any line drawn to fit the data, for example
! = a+bx
We consider .r as the independent variable and y as the dependent variable. The circled
points in the diagram correspond to observed data points. If we were to use the line given
above to model the data, we would predict a differenty value for the corresponding r-value.
The difference is the error, e, which is known as the residual and is calculated as
e =observed y value - the predicted y value
Each pair of observations (x,,y,) produces a residual, e, for i =1,2,...,n .
>,4
i=l
ef to denote
Yo2
L"i'
i=l
ofy
on
Z"?
, where
l3
H2 Mathematics JC 2
Student's Copy
Zr?
=f
=
It
I(x
L @-t)(v-v)
:;
x)'
Llx
---Fi;
-@+bx,))'
minimis"
(Appendix.B),
a=V -b7
(u-ru
y = (y
Zn?
------ (l)
-bi)+bx
* Y-T=b(x-7\
Therefore, the equation of the regression line
y-y-b(x-i),
where
b-
is
(x-i)(y -V)
ofy on x.
Note
1.
Itv-D"-,)=2ry
2.
l{*-t)'=Z*'
3.
of y on x
ry
U.
n
tn-Ed/
? " n ,_
Yr'
Ltn
-(I')'
t4
H2 Mathematics JC 2
Student's Copy
S 9
as the independent
variable and
q
l.
2.
x on y
is
Note
Zl--yl'=Zy'-U.
n
= -" Z*Zv
n
Another formula for d =L*'=
3.
v u, -(zr)'
L-n
The equation of the regression line of x on y cannot be found by making x
subject in the equation of the regression line of y on x .
the
l5
;
H2 Mathematics JC 2
10
Student's
Copy
Consider the data (x,y\. Given a value of one of the variables, regression lines can be used to
predict or estimate the value of the other. The choice of the regression line used depends on
the context of the situation:
(a) If there is a clear indication that x is the independent variable we will always use the
regression ti"" orffi
to do estimation.
O)
for a
givenvalueofx,weusetheregressionlh"'-o.ffiIfwewanttoestimatexfora
given value of
q
l.
Z.
4.
ffi.
Note
When we dq
rslimatiaqlitlitrlh
-\[hgrrlyj
of the data is close to +1, and the scatter diagram also suggest
(b)
t6
'
'
H2 Mathematics JC 2
Student's Copy
Example 5
In a physical education class, the number ofpush-ups (x ) and sit-ups (y ) done by a sample
of ten randomly chosen students were recorded in the table below.
Student
x
v
27
30
(r)
2
22
26
15
35
30
6
52
25
42
38
40
t0
35
32
55
40
50
40
43
54
on x .
(il)
(iii)
Predict the number of sit-ups a student can do when he can do 50 push-ups. Give a
reason if the predicted value is reliable.
(iv)
Give a reason whether it is reliable to use the equation in (i) to predict the number
sit-ups when 60 push-ups are done.
of
Solution
L
I
H2 Mathematics JC 2
Student's
Copy
Example 6
An electrical fire was switched on in a cold room and the tunperature ofthe room was noted
at 5-minute interval.
Time, x (in minutes) from
switching on fire
Temperature,
(a)
(inoC)
l0
t5
20
25
30
35
40
0.4
1.5
3.4
5.5
7.7
9.7
tt.7
13.5
15.4
y.
of y on x
(b)
Explain why the regression line of y oL x rather than the regression line of .r on y
should be used to predict the time that has passed after switching on the fire if the
ternperature is 93C.
(c)
Predict the temperature of the room when the fire is switched on for 30 and 60
minutes. Comment on the reliability of your arxiwers.
(d)
Starting with the equation of the regression line of y on x, deduce the equation
the regression line of
y ot t where y is the temperature in oC and r is time in hours,
z on x where z is the ternperature in Kelvin (K) and x is time in minutes.
(A temperature in "C is converted to Kby addng273)
(a)
(b)
(iii)
of
of
'i
H2 Mathematics JC 2
S tl
Student's Copy
of y
on
i.e.,
ofxony i.e.,x-c+dy.
(r)
(i,
on
x= ctdy
! = a*bx
as
r:
and
D and
D
x on y
passes
d are positive.
d
we negative.
= a+bx
=c
td!
(iii)
12
(iv)
=bd.
l9
:
H2 Mathematics JC 2
Student's
Copy
S t2 Transformations
Not all relatio4ships betwegn x and y Ne linear.'If the relationship between x and y is not
linear, we can sometimes ube a suitable transformation to linearise the relationship. Here are
some examples:
Relationship
Transformation
Linear Relationship
!=axb
h.y = lna+blnx
i.e., lny and lnx have
lnY =lna+bx
= aeb'
y=Jtb
a linear
relationship.
=ax+b
i.e.,
y'
relationship.
L=
'
ax+b
Take reciprocal
ax+b
i.".,
relationship.
Example 7
Year(x)
Sales (y )
(t)
(ii)
(iii)
I
1000
2
3000
of five
7000
14000
21000
Draw a scatter plot of the above data and find the product moment correlation
coefficient between r and y. Comment on the suitability of the use of a linear
model, y = @c+bfor the sales of the guidebook.
By calculating the product moment correlation coefficient between lny and lnx,
comment on the zuifability of the use of the model ! = axb as compared to the linear
model in (i)
Find the least squares regression line of ln y on ln x . Hence estimate the values of a
and b.
20
''
H2 Mathematics JC 2
Student's Copy
Solution
(1)
2s000
20000
15000
10000
5000
r between x and y is
LI
ffi
Key in x andy data into Lr and L2 respectively.
Scroll up to the header Lr and highlight it by
pressing EmrR-].
,E
1
a
3
1000
3000
7000
'l5
E100{l
1t(l00
L3(lI=
LE
L, = ln(L, )
*d
Press
[ENTER'I
tu = ln(L, )
1000
3000
7000
(l
21000
1.5t9r
1t000
.65315
1.098E
1.3E83
IJ=E+bH
E=6.815631664
b=l.919318251
Fz=.9932916127
r.=. 996648 1 62 1
(iii)
Note
2t
H2 Mathematics JC 2
Student's
Copy
From part (i) of the above example, we have seen that the value of r may be very close to 1
but it does not necessarily imply tlnt alinear model = a+bx is the most suitable for the
data. It is always important to draw a scatter diagram to decide which model is more suitable.
Summary
I(,-;X
of the observed
bivariate data
(x,,!),
v-v)
I(,-,)'I,0-il'
zr-tr!l[r,,qir]
In order to determine if there is a linear relationship, a scatter diagrarn, together with the
product moment correlation coefficient should be used.
I
t
-1Sr<l
of y on x
y-V-b(x-t)',
r
x-7=d(y-lt),
U=ffi
is
where
O=ffi
Zr?
is
where
of x on y
the smallest
of .y on
to do estimation.
For cases where there is no clear independent variable, if we want to estimate y for a
given value of x, we use regression line of y on r. If we want to estimate r for a given
value of y , use regression line of r on y .
22
H2 Mathematics JC 2
Student's Copy
II
10.0
8.04
10.0
9.t4
10.0
7.46
8.0
6.s8
8.0
6.95
8.0
8.14
8.0
6.77
8.0
s.76
13.0
7.58
13.0
8.74
13.0
t2.74
8.0
7.71
9.0
8.81
9.0
8.77
9.0
7.lr
8.0
8.84
11.0
8.33
I1.0 9.26
11.0
7.81
8.0
8.47
14.0
9.96
14.0
8.10
14.0
8.84
8.0
7.04
6.0
7.24
6.0
6.13
6.0
6.08
8.0
5.2s
4.0
4.26
4.0
3.10
4.0
5.39
19.0
n.5a
12.0
10.84
12.0
9.13
12.0
8.15
8.0
s.s6
7.0
4.82
7.0
7.26
7.0
6.42
8.0
7.91
5.0
5.68
5.0 4.74 5.0
5.73
8.0
6.89
Note that all 4 sets of data have the same mean, variance, r value and linear regression line.
oo-l
t)
lt)
EE
t
I
ilr)
IV)
23
I
H2 Mathematics JC 2
Student's Copy
Appendix B - Properties of
The measure, r will not change if we add a constant or multiply a positive constant to all the
values of a variable.
If s = a*bx and , = c*dy, where a,b,c,d are constants, with b and d bothpositive,
thertrr, =
rr.
I("
-s-X
7)
I("-s')'I u-r)'
(a + bx) - (a + ut)
(a + bx) - (a + ot))'z
ll(c
+ ay)- (" *
bal(x-xXy-I)
l(ox-m)(ay-@)
l(tx - tt)'Z(tl, - @)'
ml(x-tXy-r)
ffi
O))'
*a2l(x-r1'le -t)'
(Note
L"?
=it
i=l
,-a-bx,)z
Oa L\t i
=
r"t
-z(\r,
''
-na
;d instead or 9{ .t
use
da
-blx,)
=o to ro o=b--bl*' =y-b,
$oann
"
-zI
Thus
x, y
i - aZx,
*b=W+b=ffid=*
= 0.
24
H2 Mathematics JC 2
Student's Copy
Questions
l.
of
fo
llowing statements
(a) r = 0 for a set of datu (*, y) implies that X and Y are unrelated.
(b) If X is the number of cigarettes smoked per dayby people dymg of lung cancer
and I is the age at deat[ then r = -0.9 implies that smoking more cigarettes
per day causes a person to die younger.
(c)
The value
of
y)
is
(X,Y).
I,
25
H2 Mathematics JC 2
2.
(a)
(b)
Student's
(r,y)
is denoted by
Copy
,
'
ofdata.
Express, in terms
(t)
(ii)
of
r,
of (y, x).
of x
are inueased
bv 5.
(m)
of (*,* y).
26
H2 Mathematics JC 2
3.
of I
Student's Copy
With the aid of a suitable diagrarn, describe the difference between the regression line
on
and that
ofX on L
I,
(l)
(ii)
(iii)
l0
sets
Z y' = 22A257,2.y
264582
of
2V
H2 Mathematics JC 2
4.
Student's
Copy
Explain with the aid of a diagrarq what is rneant by the term "least squares" in the
context of regression lines.
Delegates who travelled by car to a Statistics Conference were asked to report d,the
distance travelled (in
km)
and
113
t4
98
130
75
120
143
55
t27
130
25
180
148
100
120
196
48
r6s
(r)
(ii)
Find the equation of the least squares regression line of r on d in the form
t = atbd. Interpret the coefficient b in the context of the question.
(iii)
(1v)
(a)
100
km
O)
lsO km
[(i)
(a)
127
(b) 189]
!
I
28
rr
H2 Mathematics JC
5.
Student's Copy
F.M N94/4/9
The following summary datarefers to concentrations of carbon dioxide in the atmosphere (y)
in parts per million for the past 8 years 1971,1973,..., 1985 (x).
l{*
It*-tstt)(y
= 560,
ZO
-tzs)2
887
-32s) =704
[Source: Council on Environmental Quality, 1987 .]
(r)
Let u = x -1971and v = y -325. Calculate the equation of the least squares regression
line of v on a. Hence find the equation of the least squares regression line of y on x.
.
tf
(ii)
Calculate the product moment correlation coefficient for x andy. Comment on what
(iii)
Estimate the concentration of carbon dioxide in the atmosphere in (a) 1974 and (b)
1988.
29
H2 Mathematics JC 2
Student's
Copy
6.
(r)
(ii)
r is dependent
on /.
x 22.5
25.0
28.0
30.5
38.0
40.5
42.5
48.0
54.5
55.0
42.0
33.s
28.0
18.0
13.6
15.0
10.3
9.0
6.3
44.0
70.0
4.0
x and t.
Statg with a reasor! which ofthe following models is more appropriate to fit the data
points:
(a)
(b)
(iii)
x=atb wherea>0andD<0
x=a+bt2 wherea>0andb<0.
For the appropriate model, find the product moment correlation coefficient for
the
30
H2 Mathematics JC 2
7.
Student's Copy
N2009/IU6
The table gives the world record time, in seconds above 3 minutes 30 seconds, for running
(1)
(ii)
Year, x
1930
1940
1950
1960
t970
1980
1990
2000
Time, /
40.4
36.4
31.3
24.5
2t.t
19.0
16.3
13. r
(iii)
Explain why in this context a quadratic model would probably not be appropriate for
lo
(iv)
ng-term predictions.
Fit a model of the form lnr = a*bx to the data, and use it to predict the world record
time as at l$ January 2010. Comment on the reliability of your prediction.
t(ii) inappropriate
3l
student'scopy '
H2 Mathematics JC 2
8.
'
fM2O03/IUllOR modilied
ofy onx
ofx
and )c
71517
Y='-fr*tl0,
ony respectively.
7s=--Y+20
x
v
l0
11
l2
11
t7
t4
l9
Let Y be the value obtained by substituting a sample value of -r into the equation of the
regression line
ofy
on x. Evaluate Y for each of the eight values of x and state the value
of
z?-v)'.
For each of the eigtrt sample values of x, Y'is givr:n by Y'
constants. What can you say about the value
of
I(y -y')'Z
l(to,s;,
r:
-0.904,
8.81
32
H2 Mathematics JC 2
9.
For
Student's Copy
FM2006/rU10
a
random sample of 12 observations of pairs of values (x, y), the equation of the
regression line
12 values
(r)
(ii)
(iii)
ofr ony.
Find the estimated value ofy when x =,2.8 and comment on the reliability of this
estimate.
33
\
H2 Mathematics JC 2
10.
Student's
Copy r.
.:
N2007/IU11
Research is carried out into how the concentration of a druS in the bloodstream varies
with
time, rneasured from when the drug is given. Observations at successive times give the data
Time(/minutes)
30
65
15
Concentration ( x microgms
82
60
43
90
37
120
22
150
180
240
300
r9
t2
litre )
It is given that the value of the product moment correlation coefficient for this data is -0.912
correct to 3 decimal places.
of r when /:
(i)
calculate the product moment correlation coefficient and comment on its value,
-l
1.7 or
-l
1.8
(ii) r
- 0.994:ln x :
4.6t
- 0 01 ?1 r' r :
551