Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
2 views

Correlation and Regression -intro

The document provides an overview of correlation and regression analysis, focusing on the relationship between two or more variables. It discusses types of correlation, methods for determining correlation such as scatter plots and Pearson's coefficient, and the degrees of correlation ranging from perfect to absence of correlation. Additionally, it includes examples and formulas for calculating correlation coefficients, highlighting the importance of understanding the relationships between variables in various contexts.

Uploaded by

tum chris
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Correlation and Regression -intro

The document provides an overview of correlation and regression analysis, focusing on the relationship between two or more variables. It discusses types of correlation, methods for determining correlation such as scatter plots and Pearson's coefficient, and the degrees of correlation ranging from perfect to absence of correlation. Additionally, it includes examples and formulas for calculating correlation coefficients, highlighting the importance of understanding the relationships between variables in various contexts.

Uploaded by

tum chris
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 24

CORRELATION AND REGRESSION ANALYSIS

1.1 Introduction to Correlation


So far, we have considered only univariate distributions. By the averages, dispersion and
skewness of distribution, we get a complete idea about the structure of the distribution. Many
times, we come across problems which involve two or more variables. If we carefully study the
figures of rain fall and production of maize, figures of accidents and motor cars in a city, of
demand and supply of a commodity, of sales and profit, we may find that there is some
relationship between the two variables. On the other hand, if we compare the figures of rainfall
in America and the production of cars in Japan, we may find that there is no relationship between
the two variables. If there is any relation between two variables i.e., when one variable changes
the other also changes in the same or in the opposite direction, we say that the two variables are
correlated.
1.2 Correlation
It means the study of existence, magnitude and direction of the relation between two or more
variables.
6.3 Types of Correlation
1. Positive and negative correlation
2. Linear and non-linear correlation
A) If two variables change in the same direction (i.e., if one increases the other also increases, or
if one decreases, the other also decreases), then this is called a positive correlation. For example:
Advertising and sales.
B) If two variables change in the opposite direction (i.e., if one increases, the other decreases and
vice versa), then the correlation is called a negative correlation. For example: T.V. registrations
and cinema attendance.
The nature of the graph gives us the idea of the linear type of correlation between two variables.
If the graph is in a straight line, the correlation is called a "linear correlation" and if the graph is
not in a straight line, the correlation is non-linear or curvi-linear.
For example, if variable X changes by a constant quantity, say 20 then Y also changes by a
constant quantity, say 4. The ratio between the two always remains the same (1/5 in this case). In
case of a curvi-linear correlation this ratio does not remain constant.
1.4 Degrees of Correlation
Through the coefficient of correlation, we can measure the degree or extent of the correlation
between two variables. On the basis of the coefficient of correlation we can also determine
whether the correlation is positive or negative and also its degree or extent.
1. Perfect correlation: If two variables change in the same direction and in the same
proportion, the correlation between the two is perfect positive. According to Karl Pearson the
coefficient of correlation in this case is +1. On the other hand, if the variables change in the
opposite direction and in the same proportion, the correlation is perfect negative. its coefficient
of correlation is -1. In practice we rarely come across these types of correlations.
2. Absence of correlation: If two series of two variables exhibit no relations between them
or change in variable does not lead to a change in the other variable, then we can firmly
say that there is no correlation or absurd correlation between the two variables. In such
a case the coefficient of correlation is 0
3. Limited degrees of correlation: If two variables are not perfectly correlated or is there a
perfect absence of correlation, then we term the correlation as Limited correlation. It may
be positive, negative or zero but lies with the limits _ 1.
High degree, moderate degree or low degrees are the three categories of this kind of correlation.
The following table reveals the effect (or degree) of coefficient or correlation.
Degrees Positive Negative
Absence of correlation Zero 0
Perfect correlation + 1 -1
High degree + 0.75 to + 1 - 0.75 to -1
Moderate degree + 0.25 to + 0.75 - 0.25 to - 0.75
Low degree 0 to 0.25 0 to - 0.25
6.5 Methods Of Determining Correlation
We shall consider the following most commonly used methods
(1) Scatter Plot
(2) Karl Pearson’s coefficient of correlation
(3) Spearman’s Rank-correlation coefficient.
125
6.5.1 Scatter Plot (Scatter diagram or dot diagram)
In this method the values of the two variables are plotted on a graph paper. One is taken along
the horizontal ( (x-axis) and the other along the vertical (y-axis). By plotting the data, we get
points (dots) on the graph which are generally scattered and hence the name ‘Scatter Plot’. The
manner in which these points are scattered, suggest the degree and the direction of correlation.
The degree of correlation is denoted by ‘ r ’ and its direction is given by the signs positive and
negative.
NOTES
i) If all points lie on a rising straight line the correlation is perfectly positive and r = +1 (see
fig.1)
126
ii) If all points lie on a falling straight line the correlation is perfectly negative and r = -1 (see
fig.2)
iii) If the points lie in narrow strip, rising upwards, the correlation is high degree of positive (see
fig.3)
iv) If the points lie in a narrow strip, falling downwards, the correlation is high degree of
negative (see fig.4)
v) If the points are spread widely over a broad strip, rising upwards, the correlation is low degree
positive (see fig.5)
vi) If the points are spread widely over a broad strip, falling downward, the correlation is low
degree negative (see fig.6)
vii) If the points are spread (scattered) without any specific pattern, the correlation is absent. i.e. r
= 0. (see fig.7)
Though this method is simple and is a rough idea about the existence and the degree of
correlation, it is not reliable. As it is not a mathematical method, it cannot measure the degree of
correlation
6.5.2 Karl Pearson’s coefficient of correlation
It gives the numerical expression for the measure of correlation. it is noted by ‘ r ’. The value of
‘ r ’ gives the magnitude of correlation and sign denotes its direction. Karl Pearson correlation
coefficient is also sometimes referred to as the product moment correlation coefficient and is
defined as
xy
xy
ss
s
r  Where
127
  
n
xxyy
sxy

,

n
xx
sx
2
,

n
yy
sy
2

xy s is called the covariance of X and Y
sx is the standard deviation of X
sy is the standard deviation of Y
Therefore

   




22xxyy
xxyy
r
Example 1
A chemical fertilizer company wishes to determine the extent of correlation between ‘quantity of
compound X used’ and ‘lawn growth’ per day. The results are tabulated below:
Lawn
Compound
X (g)
Lawn
Growth (mm)
A13
B23
C46
D58
Find the Pearson’s correlation between the three variables.
Solution
128

   




22xxyy
xxyy
r
We start by obtaining the means x and y .
5
4
20
4
3368
4
3
12
4
1245






y
x
Now
x y x  x y  y x  xy  y  2 x  x  2 y  y
1 3 -2 -2 4 4 4
2 3 -1 -2 2 1 4
4611111
5823649
13 10 18
Substituting in the formula above,
0.969
10 18
r  13 
A positive r means that as x (the mass of the chemical compound) increases, then so does y(the
lawn growth)
A value of r close to 1 indicates a very strong positive correlation.
129
Alternative Formula for Calculating R
Often it is cumbersome to calculate the means when the data contains decimals or it is too large.
A second formula that does not require the calculation of the means is shown below.
We have

xyxy
xy
ss
xy
ss
s
r   cov
It can be shown that
  

n
xy
s xy xy (An alternative formula for finding the covariance of X and Y.)


n
x
sxx
2
2 (An alternative formula for finding the standard deviation of X)

n
syyy
2
2 (An alternative formula for finding the standard deviation of Y)
Therefore,
  


















n
y
y
n
x
x
n
xy
xy
r
2
2
2
2
Using this formula, we can do example 1 above as follows:
x y xy (x)2 y2
130
1 3 12 1 9
23649
4 6 24 16 36
5 8 40 25 64
12 20 73 46 118
Therefore
0.969
10 18
13
4
118 400
4
46 144
4
73 12 20












r
Example 2
From the following data compute the coefficient of correlation between x and y.
a) sx = 14.7, sy = 19.2, and xy s = 136.8
b) x  65 , y  141 , xy 1165 , x2  505 , y2  2745 , n 11
131
Solution
a)
0.485
14.7 19.2
136.8 


xy
xy
ss
s
r
b)
  



















n
y
y
n
x
x
n
xy
xy
r
2
2
2
2

 

 




 





11
2745 141
11
505 65
11
1165 65 141
22
r
r  0.986
Example 3
If covariance between x and y is 12.3 and the variance of x and y are 16.4 and 13.8 respectively.
Find the coefficient of correlation between them.
Solution: Given cov(XY) 12.3 , 2 16.4 x s and 2 13.8 y s
xyss
r  cov(XY)
132
16.4 13.8
r  12.3
r  0.818
6.5.3 Spearman’s Rank Correlation Coefficient
This method is based on the ranks of the items rather than on their actual values. The advantage
of this method over the others in that it can be used even when the actual values of items are
unknown. For example if you want to know the correlation between honesty and wisdom of the
boys of your class, you can use this method by giving ranks to the boys. It can also be used to
find the degree of agreements between the judgments of two examiners or two judges. The
formula is :
 
 


21
621
NN
RD
where R = Rank correlation coefficient
D = Difference between the ranks of two items
N = The number of observations.
Note: 1  R 1
i) When R = +1: Perfect positive correlation or complete agreement in the same direction
ii) When R = -1: Perfect negative correlation or complete agreement in the opposite direction.
iii) When R = 0: No Correlation.
133
Computation:
i. Give ranks to the values of items. Generally the item with the highest value is ranked 1
and then the others are given ranks 2, 3, 4, .... According to their values in the decreasing
order.
ii. Find the difference D = R1 - R2
where R1 = Rank of x and R2 = Rank of y
Note that D= 0 (always)
iii. Calculate D2 and then find D2
iv. Apply the formula.
Note :
In some cases, there is a tie between two or more items. in such a case each items have ranks 4th
and 5th respectively then they are given
2
45
= 4.5th rank. If three items are of equal rank say
4th then they are given
3
4  5 6
= 5th rank each. If m be the number of items of equal ranks, the
factor i m3 m
12
1 is added to S D2. If there are more than one of such cases then this factor
added as many times as the number of such cases, then
134
Example Calculate ‘ R ’ from the following data.
Student
No.:
1 2 3 4 5 6 7 8 9 10
Rank
in
Maths :
1 3 7 5 4 6 2 10 9 8
Rank
in
Stats:
3 1 4 5 6 9 7 8 10 2
Solution :
Student
No.
Rank
in
Maths
(R1)
Rank
in
Stats
(R2)
R1 - R2
D
(R1 - R2 )2
D2
1 1 3 -2 4
23124
37439
45500
5 4 6 -2 4
6 6 9 -3 9
7 2 7 -5 25
135
8 10 8 2 4
9 9 10 -1 1
10 8 2 6 36
N = 10 D  0 D2  96
Calculation of R :
Example Calculate ‘ R ’ of 6 students from the following data.
Marks
in Stats
:
40 42 45 35 36 39
Marks
in
English
:
46 43 44 39 40 43
136
Solution:
Marks
in
Stats
R1
Marks
in
English
R2 R1 - R2 (R1 -R2)2 =D2
40 3 46 1 2 4
42 2 43 3.5 -1.5 2.25
45 1 44 2 -1 1
35 6 39 6 0 0
36 5 40 5 0 0
39 4 43 3.5 0.5 0.25
N = 6 D  0 D2  7.50
Here m = 2 since in series of marks in English of items of values 43 repeated twice.
D2 is sometimes written as SD2.[Read as sum of D squared]
137
Example The value of Spearman’s rank correlation coefficient for a certain number of pairs of
observations was found to be 2/3. The sum of the squares of difference between the
corresponding rnarks was 55. Find the number of pairs.
Solution: We have
138
LECTURE 10
6.6 Introduction to Regression Analysis
Correlation gives us the idea of the measure of magnitude and direction between correlated
variables. Now it is natural to think of a method that helps us in estimating the value of one
variable when the other is known. Also correlation does not imply causation. The fact that the
variables x and y are correlated does not necessarily mean that x causes y or vice versa. For
example, you would find that the number of schools in a town is correlated to the number of
accidents in the town. The reason for these accidents is not the school attendance; but these two
increases what is known as population. A statistical procedure called regression is concerned
with causation in a relationship among variables. It assesses the contribution of one or more
variable called causing variable or independent variable or one which is being caused
(dependent variable). When there is only one independent variable then the relationship is
expressed by a straight line. This procedure is called simple linear regression.
Regression can be defined as a method that estimates the value of one variable when that of
other variable is known, provided the variables are correlated. The dictionary meaning of
regression is "to go backward." It was used for the first time by Sir Francis Galton in his research
paper "Regression towards mediocrity in hereditary stature."
6.7 Lines of Regression
In scatter plot, we have seen that if the variables are highly correlated then the points (dots) lie
in a narrow strip. if the strip is nearly straight, we can draw a straight line, such that all points are
close to it from both sides. such a line can be taken as an ideal representation of variation. This
line is called the line of best fit if it minimizes the distances of all data points from it.
This line is called the line of regression or line of best fit. Now prediction is easy because now
all we need to do is to extend the line and read the value. Thus to obtain a line of regression, we
need to have a line of best fit. But statisticians don’t measure the distances by dropping
perpendiculars from points on to the line. They measure deviations ( or errors or residuals as
they are called) (i) vertically and (ii) horizontally.
139
Thus we get two lines of regressions as shown in the figure (1) and (2).
(1) Line of regression of y on x
Its form is y = a + b x
It is used to estimate y when x is given
(2) Line of regression of x on y
Its form is x = a + b y
It is used to estimate x when y is given.
6.8 Regression Equation of y on x
It can are obtained by (1) graphically - by Scatter plot (ii) Mathematically - by the method of
least squares.
The least squares formula is
x x
s
s
yy
x
  xy  2
Example 1
Use the least squares formula to fit a regression line through (1,3), (3,5) and (5,6).
140
Solution
x y xy (x)2
1331
3 5 15 9
5 6 30 25
9 14 48 35
So
x  9 , y  14 , xy  48 , x2  35 , n  3
3
3
9
n
x
x and
3
  14 
n
y
y
x x
s
s
yy
x
  xy  2
  

n
xy
s xy xy
6
3
48 9 14 

  xy s

8
3
35 9
22
2  2     
n
x
sxx
Therefore,
3
8
6
3
y  14  x 
141
Multiplying through by 24 (L.C.M),
0.75 2.42
24
58
54
18
24 18 58
24 18 54 112
24 112 18 54





yx
yx
yx
yx
yx
Example 2
The table below shows the sales for Bidii electronics established in the late 1998.
Year 1999 2000 2001 2002 2003 2004
Sales(Sh x1000) 5 9 14 18 21 27
a) Draw a scatter graph to represent this data.
b) Find r2
c) Find the equation of the line of best fit using the linear regression formula.
d) Predict the sales for the year 2006, giving your answer to the nearest Sh
Solution
a)
142
b) To get r2
x y xy (x)2 y2
1 5 5 1 25
2 9 18 4 81
3 14 42 9 196
4 18 72 16 324
5 21 105 25 441
6 27 162 36 729
21 94 404 91 1796
  



















n
y
y
n
x
x
n
xy
xy
r
2
2
2
2

 

 




 





6
1796 94
6
91 21
6
404 21 94
22
r
=     75.22
75
91 73.5 1796 1472.66
404 329 


r  0.997
143
r2  0.994
c)
3.5
6
x  21  , 15.67
6
y  94 
x x
s
s
yy
x
  xy  2
  

n
xy
s xy xy
75
6
404 21 94 

  xy s

17.5
6
91 21
22
2  2     
n
x
sxx
 3.5
17.5
y 15.67  75 x 
15 15.67
17.5
y  75 x  
Thus is the equation for the regression line of y on x.
y  4.29x  0.67
In the year 2008, x  8
Therefore
144
y = (4.29)(8) + 0.67
y= 34.32+0.67
Sales =34.99  1,000
34,990
= Sh 35000 worth of sales.
Example3
A panel of two judges A and B graded dramatic performance by independently awarding marks
as follows:
a) Obtain the correlation coefficient r
b) Use the least squares method to obtain the regression equation of y on x.
c) Find the mark awarded by judge B to performance 8.
Solution:
a)
x y xy x2 y2
36 35 1260 1296 1225
32 33 1056 1024 1089
34 31 1054 1156 961
31 30 930 961 900
32 34 1088 1024 1156
32 32 1024 1024 1024
145
35 36 1260 1225 1296
232 231 7672 7710 7651
Now
  



















n
y
y
n
x
x
n
xy
xy
r
2
2
2
2

 

 




 





7
7651 231
7
7710 232
7
7672 232 231
22
r
    21 28
16
7710 7689 7651 7623
7612 7656




Therefore r=0.65
b) The equation of the line of regression of y on x
x x
s
s
yy
x
  xy  2
 16 xy s 2  21 x s
 33
21
y  33  16 x 
 y = 0.76x+7.92
146
c) Inserting x = 38, we get
y = 0.76 ( 38) +7.92
y = 36.8 = 37 ( approximately )
Therefore, the Judge B would have given 37 marks to 8th performance
Alternative Formula for Calculating Regression
It is expressed as y = a + bx where a and b are two unknown constants which determine the
position of the line completely.
If the values of a and b are completely determined the equation of the regression line of y on x is
obtained.
The two basic equations which can be solved simultaneously to find a and b are:-
…………(i)
…..(ii)
Example 4
From 10 observations of price x and supply y of a commodity the results obtained
x = 130, y = 220, x2 = 2288, xy = 3467
Compute the regression of y on x and interpret the result. Estimate the supply when the price of
16 units.
147
Solution: The equation of the line of regression of y on x
y=a+bx
Also from normal equations
y = n a + b x and xy = a x + b x2
we get
220 = 10 a + 130 b … ……(1)
3467 = 130 a + 2288 ……..(2)
Solving (1) and (2) a
2860 = 130 a + 1690 b
3467 = 130 a + 2288 b
On subtraction,
607 = 598 b
b = 1.002
Putting b = 1.002 in 220 = 10 a + 130 b, we get a = 8.974.
Hence the 3 equation of the line of regression of y on x is
y = 8.974 + 1.002 x
When x = 16, we get
y = 8.974 + 1.002 ( 16 )
y = 25.006
6.9 Uses of Regression Analysis
148
1. Through the methods of interpolation and extrapolation, it provides estimates of
values of the dependent variable from the values of the independent variables.
2. Regression analysis also enables us to obtain a measure of error involved in using
the regression line as a basis of estimation.
3. Regression analysis also enables us to compute the coefficient of determinationwhich
gives the measure of association or correlation between two variables.
6.10 Difference between correlation and regression
i) The objective of regression analysis is to study the relationship between the
variables involved while the coefficient of correlation is the measure of the degree
of relationship between the variables.
ii) The cause and effect relation is clearly indicated through regression analysis but
we cannot say that one variable is the cause and the other the effect.
iii) Correlation analysis is only confined to the study of linear relationship between
the variables, and therefore has limited applications. Regression analysis has
much wider applications as it studies both linear and nonlinear relationship
between variables.
iv) There may be nonsense correlation between two variables which is due to mere
chance and has no practical relevance, but there is no such thing as nonsense
regression.
149
Chapter Review Questions
1. The length and width of 10 leaves are shown on the scatter diagram below.
Relationship between leaf length and width
70
60
50
40
30
20
10
0 20 40 60 80 100 120 140 160
Length (mm)
Width
(mm)
(a) Draw a suitable line of best fit.
(b) Write a sentence describing the relationship between leaf length and leaf
width for this sample.
2. Statements I, II, III, IV and V represent descriptions of the correlation between
two variables.
I High positive linear correlation
II Low positive linear correlation
III No correlation
IV Low negative linear correlation
V High negative linear correlation
Which statement best represents the relationship between the two variables shown in
each of the scatter diagrams below.
150
10
8
6
4
2
10
8
6
4
2
10
8
6
4
2
10
8
6
4
2
0
0
0
0
2
2
2
2
4
4
4
4
6
6
6
6
8
8
8
8
10
10
10
10
x
x
x
x
y
y
y
y
(a)
(c)
(b)
(d)
Answers:
(a) ……………………………………
(b) ……………………………………
(c) …………………………………
(d) …………………………………
2. The Type Fast secretarial training agency has a new computer software
spreadsheet package. The agency investigates the number of hours it takes people
of varying ages to reach a level of proficiency using this package. Fifteen
151
individuals are tested and the results are summarised in the table below.
Age 32 40 21 45 24 19 17 21 27 54 33 37 23 45 18
(x)
Time
(in
hours)
10 12 8 15 7 8 6 9 11 16 t 13 9 17 5
(y)
(a) (i) Given that Sy = 3.5 and Sxy = 36.7, calculate the product-moment correlation
coefficient r for this data.
(ii) What does the value of the correlation coefficient suggest about the
relationship between the two variables?
(b) Given that the mean time taken was 10.6 hours, write the equation of the
regression line for y on x in the form y = ax + b.
(c) Use your equation for the regression line to predict
(i) the time that it would take a 33 year old person to reach proficiency,
giving your answer correct to the nearest hour;
(ii) the age of a person who would take 8 hours to reach proficiency, giving
your answer correct to the nearest year.
3. Ten students were given two tests, one on Mathematics and one on English.
The table shows the results of the tests for each of the ten students.
152
Student A B C D E F G H I J
Mathematics
(x)
8.6
13.4
12.8
9.3
1.3
9.4
13.1
4.9
13.5
9.6
English
(y)
33
51
30
48
12
23
46
18
36
50
(a) Given sxy (the covariance) is 35.85, calculate, correct to two decimal places,
the product moment correlation coefficient (r).
(6)
(b) Use your result from part (a) to comment on the statement:
'Those who do well in Mathematics also do well in English.'
4. The heights and weights of 10 students selected at random are shown
in the table below.
Student 1 2 3 4 5 6 7 8 9 10
Height
x cm
155 161 173 150 182 165 170 185 175 145
Weight
y kg
50 75 80 46 81 79 64 92 74 108
(a) Plot this information on a scatter graph. Use a scale of 1 cm to represent 20
cm on the
x-axis and 1 cm to represent 10 kg on the y-axis.
(b) Calculate the mean height.
(c) Calculate the mean weight.
153
(d) It is given that Sxy = 44.31.
(i) By first calculating the standard deviation of the heights, correct to two
decimal places, show that the gradient of the line of regression of y on x is 0.276.
(ii) Calculate the equation of the line of best fit.
(iii) Draw the line of best fit on your graph.
(e) Use your line to estimate
(i) the weight of a student of height 190 cm;
(ii) the height of a student of weight 72 kg.
(f) It is decided to remove the data for student number 10 from all calculations.
Explain briefly what effect this will have on the line of best fit.
References
i. Research methods by Mugenda Olive M and Mugenda Abel G. Pg132-134
ii. Business Calculations and statistics simplified by N.A Saleemi. Revised Edition. Pg
480-501, 508-522
iii. Essentials of statistics for Business and Economics by Anderson Sweety Williams Pg
84-9
154

You might also like