Unit 12 - Simple Correlation and Regression
Unit 12 - Simple Correlation and Regression
12.1 Introduction
In the previous unit, we dealt with analysis of variance (ANOVA),
assumptions for F-test, and classification of ANOVA. In this unit, we will
deal with correlation, methods of correlation, measures of correlation,
probable error, Spearman’s rank correlation coefficient, partial correlation,
multiple correlations, regression, standard error of estimate, multiple
regression analysis, and application of multiple regressions.
Both correlation and regression are used to measure the strength of
relationships between variables. Those statistical tools measure the
relationship between the variables analysed in social science research.
Objectives:
After studying this unit, you should be able to:
define correlation and regression
discuss the types and measures of correlation
calculate the Karl Pearson’s correlation coefficient
calculate the coefficient for partial and multiple correlation
apply the method of estimating unknown values from known values
through regression equations
12.1.1 Relevance
The new CEO of a health care pharmaceutical company called for a
meeting of all heads of various departments to discuss the future strategy of
the company. While he expressed satisfaction over the growing sales of the
company, he also emphasised on the need of giving a further boost to the
sales and image of the company. The head of the R and D unit suggested
investing higher funds on innovation of new products and improvement of
existing ones. He pointed out that R and D had the most significant
contribution to the sales of the company. The head of the Marketing
department emphasised the importance of marketing strategy for boosting
the sales of the company. He, therefore, wanted more funds to be made
available for the purpose. The Head of HRD department suggested the
need for more staff and also new training programmes for improving the
sales significantly. The CEO agreed in person with them and was expecting
some analysis of quantitative facts and figures to evaluate the claims of the
head of department and commit funds for the new strategies. The job was
entrusted to a consultant who analysed the data using statistical techniques
Manipal University Jaipur Page No. 439
Statistics for Management Unit 12
12.2 Correlation
When two or more variables move in sympathy with the other, then they are
said to be correlated. If both variables move in the same direction, then they
are said to be positively correlated. If the variables move in the opposite
direction, then they are said to be negatively correlated. If they move
haphazardly, then there is no correlation between them. Correlation analysis
deals with the following:
Measuring the relationship between variables.
Testing the relationship for its significance.
Giving confidence interval for population correlation measure.
METHODS OF CORRELATION
GRAPHIC ALGEBRAIC
SCATTER
COVARIANCE RANK CONCURR-
DIAGRAM
METHOD CORRELATION ENT
DEVIATION
METHOD
If the dots lie close to a straight line that runs from left bottom to right top,
then the variables are said to be positively correlated. Figure 12.3 depicts
the scattered diagram for positively correlated variables.
If the dots lie exactly on a straight line that runs from left top to right bottom,
then the variables are said to be perfectly or exactly negatively correlated.
Figure 12.4 depicts the scattered diagram for the perfectly negatively
correlated variables.
If the dots lie very close to a straight line that runs from left top to right
bottom, then the variables are said to be negatively correlated. Figure 12.5
depicts the scattered diagram for the negatively correlated variables.
If the dots lie all over the graph paper, then the variables have zero
correlation. Figure 12.6 depicts the scattered diagram of the variables with
zero correlation.
Scatter diagram tells us the direction in which they are related and does not
give any quantitative measure for comparison between data sets.
12.4.2 Karl Pearson’s correlation coefficient
A Mathematical method for measuring the intensity or the magnitude of
linear relationship between two variable series is the correlation coefficient.
In order to study the “degree of variation” between the variables in a
bivariate distribution we can use the correlation coefficient
Key Statistic
Karl Pearson’s correlation coefficient is defined as:
Cov(X, Y )
r
S.D(X ).S.D(Y )
xy
i) r ––––––––––––– (A)
N x y
where, x and y
( X X) 2 ( Y Y) 2
x and Y
2 2
N N
xy
where, ‘N’ is the number of paired observations and is called
covariance of ‘x’ and ‘y’.
Key Statistic
The other forms of Karl Pearson’s correlation coefficient formula are:
xy
ii) r –––––––––––––––––––– (B)
x y
2 2
N XY X Y
r –––– (C)
N X 2
( X) 2 N Y 2
( Y) 2
N dx dy dx dy
r ––(D)
N dx 2
( dx) 2
N dy 2
( dy) 2
For all practical purposes, we can conveniently use form D; whenever
summary information is given choose proper form from A to C.
12.4.4 Factors influencing the size of correlation coefficient
The size of ‘r’ is very much dependent upon the variability of measured
values in the correlation sample. The greater the variability, the higher will
be the correlation, everything else being equal. The size of ‘r’ is altered
when researchers select extreme groups of subjects in order to compare
these groups with respect to certain behaviours. Selecting extreme groups
on one variable increases the size of ‘r’ over what would be obtained with
more random sampling.
Combining two groups which differ in their mean values on one of the
variables is not likely to faithfully represent the true situation as far as the
correlation is concerned.
Inclusion of an extreme case (and similarly dropping of an extreme case)
can lead to changes in the amount of correlation.
xy
r
x y
2 2
Solved Problem 1
Find Karl Pearson’s correlation coefficient for the data depicted in table
12.1.
Table 12.1: Data Related to Solved Problem 1
X 20 16 12 8 4
Y 22 14 4 12 8
Solution:
Table 12.1a depicts the sums calculated for the data depicted in table 12.1a.
Table 12.1a: Sums Related to Solved Problem 1
X Y X2 Y2 XY
20 22 400 484 440
16 14 256 196 224
12 4 144 16 48
8 12 64 144 96
4 8 16 64 32
X = 60 Y = 60 X = 880
2
Y = 904
2
XY = 840
Applying the formula for ‘r’ and substituting the respective values from the
table we get r as:
N XY X Y
r
N X 2
( X) 2 N Y 2
( Y) 2
5(840) (60)(60)
r
[5(880) (60) 2 ][5(904) (60) 2 ]
r 0 70
Hence, Karl Pearson’s Correlation Coefficient is 0.70.
Solved Problem 2
Calculate the correlation coefficient from the data depicted in table 12.2.
Table 12.2: Data Related to Solved Problem 2
X 50 60 58 47 49 33 65 43 46 68
Y 48 65 50 48 55 58 63 48 50 70
Solution:
Table 12.2a depicts the frequency table of the data related to solved
problem 2.
Table 12.2a: Frequency Table Data for Solved Problem 2
dx= dy=
X dx2 Y dy2 dx dy
X-50 Y-55
50 0 0 48 -7 49 0
60 + 10 100 65 + 10 100 + 100
58 +8 64 50 -5 25 - 40
47 -3 9 48 -7 49 + 21
49 -1 1 55 0 0 0
33 -17 289 58 3 9 - 51
65 + 15 225 63 8 64 + 120
43 -7 49 48 -7 49 + 49
46 -4 16 50 -5 25 + 20
68 +18 324 70 15 225 + 270
X = 519 dx =19 dx2 = 1077 Y = 535 dy = 5 dy2 = 595 dxdy =
489
N dx dy dx dy
r
N dx 2
( dx) 2 N dy 2
( dy) 2
And substituting values we get
10 489 19 5
r 0.611
10 1077 19 10 595 5
2 2
Solved Problem 3
In a bivariate data on ‘x’ and ‘y’, variance of ‘x’ = 49, variance of ‘y’ = 9 and
covariance Cov(x, y) = -17.5. Find coefficient of correlation between ‘x’ and
‘y’.
Solution:
We know that:
xy
r
N x y
xy
Given Cov(x, y) = - 17.5
N
x 49 7 y 9 3
17 .5
r - 0.833
73
Hence, there is a highly negative correlation.
Solved Problem 4
Ten observation in Weight (x) and Height (y) of a particular age group gave
the following data.
Solution:
We know that:
N XY X Y
r
N X 2
( X) 2 N Y 2
( Y) 2
Given N = 10, X = 56 Y = 138
X = 1357 Y2 = 2136 XY = 836
2
10 836 (56)(138)
r 0.1286
10 1357 (56) 10 2136 (138)
2 2
0 6745 1 r 2
n
where, ‘r’ is measured from sample of size ‘n’.
Probable error is used to:
i) Interpret the value of ‘r’,
If r < P.E, then it is not at all significant
If r > 6 P.E, then ‘r’ is highly significant
If P.E < r < 6 P.E, we cannot say anything about the significance of
‘r’
ii) Construct confidence limits within which correlation in the population
is expected to lie.
SE (r) =
1 r
2
PE (r) = SE (r) * 0.6745
n
The reason for taking the factor 0.6745 is that in a normal distribution 50%
of the distribution lie in the range μ ± 0.6745 σ
Solved Problem 5
If r = 0.6 and n = 64, then:
a) Interpret ‘r’
b) Find the limits within which ‘’ is supposed to lie
Solution:
0 6745 1 (0.6) 2
64
= 0.054
a) 6 6 0 054 0 324
Since r 0 6 6 , r is highly significant.
Key Statistic
Spearman’s Rank correlation coefficient is defined as:
6 D2
1 3
N N
where, D is the difference between ranks assigned to the variables.
N is the number of observation
Value of ‘’ lies between ‘-1’ and ‘+1’ and its interpretation is same as that
of Karl Pearson’s correlation coefficient.
There are three types of problems. Table 12.3 depicts the types of problems
involved in calculating rank correlation coefficient.
Table 12.3: Types of Problems
Type i Ranks are assigned
Type ii Ranks are not assigned
Type iii When ranks are repeated
Type i: Ranks are assigned: When ranks are already assigned, take the
difference between the ranks of the variables and denote it by D. Then the
rank correlation is computed using the formula
6 D2
1
N( N 2 1)
Solved Problem 6
In a singing competition, two judges assigned the ranks for seven
candidates which is depicted in table 12.4. Find Spearman’s rank correlation
coefficient.
Table 12.4: Ranks of Seven Candidates
Competitor 1 2 3 4 5 6 7
Judge I 5 6 4 3 2 7 1
Judge II 6 4 5 1 2 7 3
Solution:
Table 12.4a depicts the data of solved problem 6.
6 D2
1
N( N 2 1)
6(14) 6 14
=1– 1 0.75
7(7 1)
2
7 48
Solved Problem 7
Find the rank difference coefficient of correlation (in case of no ties) for the
data depicted in table 12.5.
Table 12.5: Scores of Students on Test I and Test II
Relation between ‘x’ and ‘y’ is very high and inverse. Relationship between
score on Test I and II is very high and inverse.
Solved Problem 8
Table 12.6 depicts the sales statistics of six sales representatives in two
different localities. Find whether there is a relationship between the buying
habits of the people in the localities.
Table 12.6: Sales Data of Six Representatives
Representative 1 2 3 4 5 6
Locality I 70 40 65 110 60 20
Locality II 70 30 80 100 90 20
Solution:
Table 12.6a depicts the calculated values of correlation coefficient of data in
solved problem 8.
Table 12.6a: Calculating the Coefficient of Correlation
Representative Sales in Sales in D = R1-R2 D2
Locality I locality II
R1 R2
1 2 4 -2 4
2 5 5 0 0
3 3 3 0 0
4 1 1 0 0
5 4 2 2 4
6 6 6 0 0
N=6 D2= 8
6 D2
1
N( N 2 1)
6(8) 8
=1– 1 0.7714
6(6 1)
2
35
Therefore, there is high positive correlation between the buying habits of the
locality people.
Solved Problem 9
Find the rank correlation coefficient for the data depicted in table 12.7.
Table 12.7: Scores of Student in Test I and Test II
Student A B C D E F G H I J
Score on Test I 20 30 22 28 32 40 20 16 14 18
Score on Test II 32 32 48 36 44 48 28 20 24 28
Solution:
Table 12.7a depicts the required data for calculating the correlation
coefficient.
Table 12.7a: Ranks of Test I and Test II
Score Score Rank Rank Difference
Difference
on on of on between
Student squared
Test I Test II Test I Test II Ranks
D2
X Y R1 R2 D
A 20 32 6.5 5.5 1.0 1.00
B 30 32 3 5.5 - 2.5 6.25
C 22 48 5 1.5 3.5 12.25
D 28 36 4 4 0 0
E 32 44 2 3 - 1.0 1.00
F 40 48 1 1.5 - 0.5 0.25
G 20 28 6.5 7.5 - 1.0 1.00
H 16 20 9 10 - 1.0 1.00
I 14 24 10 9 1.0 1.00
J 18 28 8 7.5 0.5 0.25
N = 10 D2 = 24
= 1 – 6 D 1 / 12(m1 m1 ) 1 / 12(m2 m2 ) 1 / 12(m3 m3 ) 1 / 12(m4 m4 )
2 3 3 3 3
N( N 2 1)
=1–
6 24 1 / 12(2 3
2) 1 / 12(2 3 2) 1 / 12(2 3 2) 1 / 12(2 3 2)
10(10 2 1)
=1–
144 0.5 0.5 0.5 0.5 = 1 – 146
0.8525
10 99 10 99
Activity:
Find the rank correlation from the following distribution
Cost 39 65 62 90 82 75 25 98 36 78
Sales 47 53 58 86 62 68 60 91 51 54
Activity Solution
Cost Sales
X Y R1 R2 D D2
39 47 8 10 -2 4
65 53 6 8 -2 4
62 58 7 7 0 0
90 86 2 2 0 0
82 62 3 5 -2 4
75 68 5 4 1 1
25 60 10 6 4 16
98 91 1 1 0 0
36 51 9 9 0 0
78 54 4 3 1 1
D2 = 30
6 D2
1
N( N 2 1)
6 30 180
1 1 0.82
10(10 1)
2
990
Key Statistic
Partial correlation is denoted by the symbol r12.3. Here correlation
between variable 1 and 2 keeping 3rd variable constant is:
r12 r13 .r23
r12.3
1 r13 . 1 r23
2 2
where,
r12.3 = Partial correlation between variables 1 and 2 keeping 3rd constant
r12 = correlation between variables 1 and 2
r13 = correlation between variables 1 and 3
r23 = correlation between variables 2 and 3
Similarly,
r13 r12 . r23 r23 r12 . r13
r13.2 and r23.1
1 r12 1 r23 1 r12 1 r13
2 2 2 2
Solved problem 10
Given r12 = 0.8, r13 = 0.5 and r23 = 0.4, calculate all partial correlations.
Solution:
(i) The correlation between variables 1 and 2 keeping the 3rd constant is
given by:
r12 r13 .r23 0.8 0.5 0.4 0.6
r12.3 0.756
2
1 r13 . 1 r23
2
1 0.5 1 0.4
2 2 0.794
(ii) The correlation between variables 1 and 3 keeping the 2nd constant is
given by:
(iii) The correlation between variables 2 and 3 keeping the 1st constant is
given by:
r23 r21.r13 0.4 0.8 0.5
r23.1 0
1 r21 . 1 r13
2 2
1 0.8 2 1 0.5 2
R1.23 = r
12
2
r13 2 2 r12 r13 r23 1 r
23
2
R2.13 = r
2
12
r 2 2 r12 r13 r23
23
1 r
2
13
R3.12 = r
2
13
r23
2
2 r12 r13 r23 1 r
2
12
Solved Problem 11
The following are the zero order correlation coefficients.
r12 = 0.98; r13 = 0.44 r23 = 0.54
Solution:
The first variable is dependent. The second and third variables are
independent. Using the formula for multiple correlation coefficients for R1.23
we get:
R1.23 = r 2
12 r13
2
2r 12 r 13 r 23 1 r
2
23
= 0.986
12.9 Regression
According to M. M. Blair, Regression is defined as, “the measure of the
average relationship between two or more variables in terms of the original
units of the data”.
Correlation analysis attempts to study the relationship between the two
variables ‘X and ‘Y’. In regression, it is attempted to quantify the
dependence of one variable on the other. For example, if there are two
variables ‘X’ and ‘Y’ and ‘Y’ depends on ‘X’, then the dependence is
expressed in the form of the equations.
12.9.1 Regression analysis
Regression analysis is used to estimate the values of the dependent
variables from the values of the independent variables. Regression analysis
is used to get a measure of the error involved while using the regression line
as a basis for estimation. The regression coefficient Y on X is the coefficient
of the variable ‘X’ in the line of regression Y on X. Regression coefficients
are used to calculate the correlation coefficient. The square of correlation is
the product of regression coefficients.
12.9.2 Regression lines
For a set of paired observations, there exist two straight lines. The line
drawn in such a way that the sum of vertical deviation is zero and the sum of
their squares is minimum, is called regression line of ‘Y’ on ‘X’. It is used to
estimate ‘Y’ values for given ‘X’ values. The line drawn in such a way that
the sum of horizontal deviation is zero and sum of their squares is minimum,
is called regression line of ‘X’ on ‘Y’. It is used to estimate the ‘X’ values for
the given ‘Y’ values. The smaller the angle between these lines, the higher
is the correlation between the variables. The regression lines always
intersect at ( X, Y ).
Y Y b yx X X
ii) The regression equation of ‘X’ on ‘Y’ is given by:
X X b xy Y Y
Manipal University Jaipur Page No. 461
Statistics for Management Unit 12
where,
N dxdy ( dx) ( dy)
b xy or b xy r x
N dy ( dy)
2 2
y
N dxdy ( dx) ( dy) y
b yx or b r
N dx 2 ( dx) 2 x
yx
Solved Problem 12
Find regression equation from the data depicted in table 12.9. Then
calculate the correlation coefficient.
Table 12.9: Data of Ages of Husband and Wife
Age of Husband 18 19 20 21 22 23 24 25 26 27
Age of Wife 17 17 18 18 19 19 19 20 21 22
Solution:
Table 12.9a depicts the data required for calculation of correlation and
regression coefficients.
Table 12.9a: Data Required for Calculation of Correlation and Regression
Coefficients
Age of
Age of wife
husband dx = X-22 dx2 dy = Y-19 dy2 dx dy
Y
X
18 -4 16 17 -2 4 8
19 -3 9 17 -2 4 6
20 -2 4 18 -1 1 2
21 -1 1 18 -1 1 1
22 0 0 19 0 0 0
23 1 1 19 0 0 0
24 2 4 19 0 0 0
25 3 9 20 1 1 3
26 4 16 21 2 4 8
27 5 25 22 3 9 15
∑X =225 ∑dx = 5 ∑dx2=85 ∑Y = 190 ∑dy = 0 ∑dy2=24 ∑dxdy= 43
225 190
X 22.5 Y 19
10 10
Regression equation of Y on X is :
Y Y b y x (X X)
N dxdy ( dx) ( dy)
b yx
N dx 2 ( dx) 2
10 43 (5) (0) 430
byx = 0.521
10 85 (5) 2 825
19 0.521 22.5
0.521 7.2775
Regression Equation of X and Y is:
X X b xy Y Y
N dxdy ( dx) ( dy)
b xy
N dy 2 ( dy) 2
10 43 (5) (0) 430
bxy = 1.792
10 24 (0) 2 240
22.5 1.792 19
1.792 11.548
r b yx .b xy
r 0.521x1.792 0.966
Hence, the Correlation Coefficient ‘r’ is 0.966.
Solved Problem 13
Table 12.10 depicts the results that were worked out from scores in
statistics and mathematics in a certain examination.
Table 12.10: Scores in Statistics and Mathematics
Scores in Statistics Scores in Mathematics
X Y
Mean 40 48
Standard Deviation 10 15
Karl Pearson’s correlation coefficient between ‘X’ and ‘Y’ is = + 0.42. Find
the regression lines ‘X’ on ‘Y’ and ‘Y’ on ‘X’. Use the regression lines to find
the value of ‘Y’ when X = 50 and value of ‘X’ when Y = 30.
Solution:
Given the following data:
X 40; Y 48 x = 10; y = 15; r = 0.42
The regression line X on Y is:
X X b xy Y Y
x x 10
b xy r , b xy r 0.42 0.28
y y 15
40 0.28 48
0.28 26.56
The regression line ‘y’ on ‘x’ is given as:
Y Y b yx X X
y y 15
b yx r , b yx r 0.42 0.63
x x 10
48 0.63 40
Y 0.63X 22.8
Therefore,
when Y = 30; 0.28 26.56 ; X = 34.96
when X =50; Y 0.63X 22.8 ; Y = 54.3
(X X c ) 2
Sxy =
N
The standard error of estimate of Y values from X is:
(Y Yc ) 2
S xy ,
N
where Yc and Xc are the estimated values of Y and X variables from the line
of regression of Y on X and X on Y respectively.
The following simpler formulae are used for calculating Sxy and Syx
X 2 a X b XY
S xy
N
Y 2 a Y b XY
S yx
N
To make the standard error an unbiased estimate of the actual variance of
the X or Y values, we divide the variability by (N - 2)
(X X c ) 2
Sxy =
N2
(Y Yc ) 2
S xy
N2
i b 0 1i b 1 1i b 2 1i X 2i
2
1i
i b 0 2i b 1i X 2i b 2 2i
2
2i
The values of b0, b1 & b2 are estimated with the help of Principle of Least
squares.
12.11.1 Application of Multiple Regression
Multiple regressions analysis can be applied to test the factors such as
export elasticity, import elasticity, and structural change (contribution of
manufacturing sector towards GDP) influencing over employment. Here,
employment is a dependent variable.
Similarly, researchers can attempt to use multiple regressions in their
research work appropriately.
12.12 Summary
Let us recapitulate the important concepts discussed in this unit:
When two or more variables move in sympathy with the other, then they
are said to be correlated. If both variables move in the same direction,
then they are said to be positively correlated. If the variables move in the
opposite direction, then they are said to be negatively correlated. If they
move haphazardly, then there is no correlation between them.
Regression helps us to study unknown variables with the help of known
variables. It also establishes a reliability measure for estimated values.
Regression analysis helps to quantify the dependence of one variable
on the other. Some of the regression types are simple and multiple
regressions, linear and non linear regression.
Regression analysis is useful in business and economic scenarios in the
decision making process.
12.13 Glossary
Correlation: When two or more variables move in sympathy with the other,
then they are said to be correlated.
Correlation coefficient: Critical statistic which indicates the direction and
intensity of a relationship between two continuous variables. Domain
extends from -1 through 0 to +1. Significance can be determined via
statistical testing. Both parametric and nonparametric correlation coefficients
are possible.
Coefficient of variation: A relative measure of variation, expressed as a
percentage; useful in comparing the variability of data sets with different
units of measure.
3. For the data in table 12.13, obtain the two lines of regression and its
estimation of the blood pressure when age is 50 yrs.
Table 12.13: Data for Terminal Question 3
Age in yrs (X) 56 42 72 39 63 47 52 49 40 42 68 60
B P (Y) 127 112 140 118 129 116 130 125 115 120 135 133
4. Table 12.14 depicts the results that were worked out from scores in
statistics and mathematics in a certain examination.
Table 12.14: Results of Scores in Statistics and Mathematics Examination
Scores in Statistics Scores in Mathematics
(X) (Y)
Mean 39.5 47.5
Standard Deviation 10.8 17.8
Karl Pearson’s correlation coefficient between X and Y = 0.42. Find both the
regression lines. Use these lines to estimate the value of Y when X = 50 and
the value of X when Y = 30.
12.15 Answers
Terminal Questions
1. 0.903
2. 0.967
3. X = - 95 + 1.184
Y = 87.2 + 0.724
4. X = 27.62 + 0.25Y
Y = 20.24 + 0.69X
By calculating the rank correlation, find out as to which of the indicators viz.
life expectancy, literacy, and GDP affects the HDI to the maximum extent.
To what extent the life expectancy in the nation depends on the percentage
of its urban population?
(Source: Srivastava, T. N. and Rejo, S. (2008) Statistics for Management, 5 th
edition, TMH)
References
Agarwal, B. L. (2006) Basic Statistics, 4th Edition, New Age International
Publishers.
Bowerman, B. L. and Connel, R.T. O., (1996) Applied Statistics:
Improving Business Processes, Irwin.
Levin, R. I., Rubin, D. S. (2008), Statistics for Management, 7th Edition,
PHI Learning Private Limited.
Pisani, F. D. R., and Purves, R. (1997), Statistics, 3rd edition, W.W
Norton.
Srivastava, T. N. and Rejo, S. (2008) Statistics for Management, 5th
edition, TMH.
Tanur,J. M., (2002), Statistics: A Guide to the unknown, 4th
edition,Brooks/cole.
E-Reference
http://www.textbooksonline.tn.nic.in/Books/11/Stat-EM/Chapter-1.pdf