Correlation and Regression Analysis
Correlation and Regression Analysis
Correlation and Regression Analysis
Chapter-5
Correlation Analysis
Correlation analysis is the statistical tool which is used to
describe the degree of relationship between two or more than
two variables. To measure the degree of association between
such variables, one more relative measure is needed and is
known as correlation coefficient. It is generally denoted by ‘r’. It
is pure number and independent of units of measurement.
“Correlation is defined as a statistical measure which is used to study
the degree of relationship between two or more than two variables”
For example
i. Increasing ↑ X: 17 20 25 30 34
Increasing ↑ Y: 8 12 15 18 22
ii.Decreasing ↓ X: 60 51 40 35 30
Decreasing ↓ Y: 18 17 10 7 5
Negative Correlation
i. Increasing ↑ X: 17 20 25 30 34
Decreasing↓ Y: 22 18 15 12 8
ii. Decreasing ↓ X: 60 51 40 35 30
Increasing ↑Y: 5 7 10 17 18
Linear and Non-Linear Correlation
Linear Correlation
The correlation between two variables is said to be linear if
corresponding to a unit change in one variable, there is a constant
change in the other variable over the entire range of the values.
Following example illustrates a linear correlation between two
variables X and Y.
X: 1 2 3 4 5
Y: 5 7 9 11 13
Non-Linear Correlation
The correlation between two variables is said to be non-
linear or curvilinear if corresponding to a unit change in
one variable, the other variable does not change at a
constant rate over the entire range of the values. The
following example illustrates a non-linear correlation
between two variables X and Y.
X: 1 2 3 4 5
Y: 7 8 15 20 23
Simple, Partial and Multiple
Correlation
On the basis of the number of variables involved under study,
correlation can be classified into the three types.
1. Simple correlation
2. Partial correlation
3. Multiple correlation
Simple correlation
The study of degree of relationship between only two variables
is called simple correlation and its relative measure is known as
simple correlation coefficient.e.g. (i) A study on the yield of crop
with respect to only amount of fertilizer, (ii) sales revenue with
respect to amount of money spent on advertisement. (iii)
Similarly, the relationship between income and expenditure,
profit and sales etc. Simple correlation coefficient of two
variables x1 and x2 is denoted by or simply ‘r’. The simple
correlation coefficient is also called “Zero order correlation
coefficient”.
Interpretation of Correlation Coefficient
Degree of Correlation Direction
Positive Negative
Perfect correlation +1 –1
No correlation 0 0
Partial correlation
The degree of relationship (closeness/association) between two
variables by keeping the effects of all other variables constant is called
partial correlation. In other words, the degree of relationship between
one dependent variable and one independent variable and by keeping all
other independent variables constant is called partial correlation
coefficient. The partial correlation is also called net correlation. For
example; the relationship between two variables: yield of crops
production and amount of irrigation by keeping the fertilizer, quality of
seeds etc. constant is the study of partial correlation.
The correlation between two variables keeping one other
variable constant is called first order partial correlation. The
correlation between two variables keeping two other variables
constant is called second order partial correlation and so on.
The higher order partial correlation can calculate by using
different software whereas first order partial correlation can be
calculated by manually.
Multiple Correlation
The degree of association (or relationship) between one
dependent variable and combined (or joint) effect of
independent variables is called multiple correlation
coefficient. For example; the study of relationship between
dependent variable yield of crops and combined effect of
independent variables fertilizer, irrigation, seed quality, plot
of land etc. is the multiple correlation. Similarly, the demand
of a commodity depends upon the price, income of
consumer, price of related goods, advertisement etc.
Therefore, the relationship among three or more variables
simultaneously (at the same time) is called multiple
correlations.
Let us consider three variables. If is dependent variable and are independent
variables, then multiple correlation coefficient between dependent variable and
joint effect of independent variables is denoted by R 1.23 and given by
Similarly, if is dependent variable and are independent variables, then multiple
correlation coefficient between dependent variable and joint effect of
independent variables is denoted by R 2.13 and given by
If is dependent variable and are independent variables, then multiple
correlation coefficient between dependent variable and joint effect of
independent variables is denoted by R 2.13 and given by
Properties of multiple correlation coefficients
• The value of multiple correlation lies between 0 and
1(inclusive)
i.e. ;
• (i) R1.23 = R1.32 (ii) R2.13 = R2.31 (iii) R3.12 = R3.21
Coefficient of Multiple determinations
The square of the multiple correlation coefficient is called coefficient
of multiple determination. It is used to measure the proportion (or
percentage) of total variation in the dependent variable which has
been explained by the variation in the independent variables.
If R1.23 = 0.9, then coefficient of multiple determination is = 0.81
=81%. It indicates that 81% of total variation in the dependent
variable has been explained by the two independent variables and
and remaining (100 due the effect of other factors.
Example
If zero order correlation coefficients are find (i) (ii) and interpret
Solution:
Here,
=
= 0.721
The coefficient of multiple determination is given by
= 0.52 = 52%
It indicates that 52% of total variation in the dependent variable has been explained by the two
independent variables and and remaining (100 due the effect of other factors.
Example
Suppose a computer has found for a given set of values of and
Explain whether these computations may be said to be free from errors or not.
Solution:
= = = 1.29
= 1.29
Since, = 1.29 must lie between 0 and 1 (inclusively) but here >1. Hence, these
computations may not be said to be free from errors.
Example
A sample of 10 values of three variables and were obtained as, ,,
Find the partial correlation coefficient between and eliminating the effect of .
Find the multiple correlation coefficient of with and .
Solution: The simple correlation coefficient between & is given by
= =
= =
&
= =
Now, Partial correlation coefficient between and eliminating the effect of is
=
The multiple correlation coefficient of with and is given by
=
= 0.767
Example: The height and weight of 10 individuals of different ages are
given below:
Age 11 10 6 10 8 9 10 7 11 8
Height 60 67 53 56 64 57 71 58 67 57
Weight 57 55 49 52 57 48 59 50 62 51
Find ,.
Solution: Let, age = height = and weight =
Calculation of, and
where,,
11 60 57 1 0 -7 1 0 49 0 7 0
10 67 55 0 7 5 0 49 25 0 0 35
6 53 49 -4 -7 -1 16 49 1 28 4 7
10 56 52 0 -4 2 0 16 4 0 0 -8
8 64 57 -2 4 7 4 16 49 -8 14 28
9 57 48 -1 -3 -2 1 9 4 3 2 6
10 71 59 0 11 9 0 121 81 0 0 99
7 58 50 -3 -2 0 9 4 0 6 0 0
11 67 62 1 7 12 1 49 144 7 12 84
8 57 51 -2 -3 1 4 9 1 6 -2 -3
&
The partial correlation coefficient between and keeping constant is given by
=
The partial correlation coefficient between and keeping constant is given by
=
& The multiple correlation coefficient of with and is
= = 0.683
= 0.683
The literal or dictionary meaning of the Analysis
Regression word “Regression” is “Stepping back”
or “Returning back” towards the average value. It was first developed by a
British biometrician Sir Francis Galton in 1877
Regression analysis is a mathematical measure (method) of the average
relationship between two or more of variables in terms of original units of
data.
Thus, regression is a statistical tool (device) which is used to estimate or
predict one variable's value (i.e. dependent variable) from the given value of
other variables(i.e. independent variables). For example, the yield of a crop
depends on the amount of rainfall, quantity of fertilizer used, quality of seeds,
types of soil, method of cultivation etc.
There are two types of variables involved in regression analysis:
Dependent Variable: The variable whose value is to be
predicted or estimated is called dependent variable. The
dependent variable is also called explained variable or
response variable or outcome variable, Regressand variable.
Independent Variable: The variable which is used for
prediction is called independent variable. The independent
variable is also called explanatory variable or predictor or
regressor or covariate
Multiple Linear Regression Analysis.
The multiple linear regression analysis consists of
mathematical measure of the average relations between
a dependent variable and two or more than two
independent variables in terms of the original units of
data. It is expressed in the form of equation and is used
to estimate or predict the value of a dependent variable
with the help of known values of independent variables
in terms of original units of data.
Multiple regression Equation
Multiple regression equation is the algebraic
relationship between one dependent variable and two or
more than two independent variables. It is used to
estimate (or predict or find) the value of dependent
variable for the given values of independent variables in
terms of original units of data. For three variables ,
and , if is a dependent variable and , are independent
variables. Then,
The multiple regression equation of on and is given by
= a + b1 + b2 . . . . . . (A)
The normal equations for estimating a, b1 and b2 are
. . . (i)
. . .(ii)
. . .(iii)
Where,
X1 = Dependent variable.(T estimate or find)
X2, = Independent variables.
a = Point of intercept on Y -axis (Constant) = X1-intercept
= Value of X1 when X2 = 0, & X3 = 0.
b1 = slope of X1 with variable X2 keeping variable constant
= The Partial regression coefficient of X1 on X2 keeping constant.
Slope of X1 with variable keeping variable X2 constant
= The Partial regression coefficient of X1 on keeping X2
constant.
Solving these equations (i), (ii) and (iii), we get the values of a, b1 and
b2.Then, substituting the values of a , and in equation (A)
346 + 38 =194
or, 173 = 97 . . . . . . (viii)
Again, multiplying equation (v) by 21 and equation (vii) by 5 and then subtracting equation (vii)
from equation (v), we get
105a +1533
105a + 149 475 720
51602215
or,
Putting the value of in equation (ix)
38 = 57
or, 38 = 57
or, 16.34 = 57
or,
or,
Putting the values of and in equation (v)
5a +73
or, 5a + 73
or, 5a +31.39 = 37
or, 5a +6.19 = 37
or, 5a = 30.81
or, a =
Now, substituting the values of a, b1 and b2 in equation (i), we get
= 6.16 +0.43 . This is the required estimated regression equation of (i.e. job satisfaction) on (Experience) and
(Annual income).
When = 18 years and = Rs. 800000 = Rs.8 (100000)
Then, the estimated job satisfaction of an employee is
= 6.16 +0.43
= 6.16 + 0.43
= 4.3
The standard error of estimate of on X2 and X3 is given by
Se =
=
=
=
The average deviation of observed values of from the fitted regression line is
The Coefficient of multiple determination of on and is given by
Where,
=
=
= = 97.10%
It shows that 97.10% of total variation in job satisfaction ( can be explained by
the experience and annual income and remaining variation in job satisfaction
has been explained by other factors.
Multiple Regression Equation using Deviation from Actual Mean
If arithmetic means of all the three variables are in
whole number (i.e. not in fraction), in this case
“Deviation from actual mean method” is more
appropriate than direct method. In this method, the
deviations of items are taken from their respective
means and instead of solving three normal equations;
we solve only two normal equations.
For the multiple regression equation of on and:
Thus, the multiple regression equation of on and is given by
= . . . . . . . . . (A)
Where, , , ,
Then, the normal equations for estimating the values of and are
= . . . . . . (i)
. . . . . . (ii)
Solving these two normal equations, we get the values of and
Substituting the values of and in equation (A), we get
5 25 3 -3 -9 0 9 81 0 27 0 0
7 40 2 -1 6 -1 1 36 1 -6 1 -6
8 30 4 0 -4 1 0 16 1 0 0 -4
9 50 5 1 16 2 1 256 4 16 2 32
11 25 1 3 -9 -2 9 81 4 -27 -6 18
310 = 22
or,
Putting the value of in equation (iv), we get
40 10 3
or, 40 10 3
or, 10 3
or, =
Now, substituting the values of b1 and b2 in equation (A), we get
= 0.071
or, = 0.071
or, = 0.071
or, , = + 0.071 2.414
or, = 7.338+ 0.071
This is the required estimated multiple regression equation of on and
i. When annual income, = 50 (‘000’) and family size, = 4
Then, the estimated expenditure on food is
= 7.338 +0.071
= 8.54 (‘000’)
Hence, the estimated expenditure on food of a family is Rs. 8.54 (‘000’)
ii. the coefficient of multiple determination of on and is given by
= 0.122 = 12.2%
It shows that 12.2% of total variation in expenditure on food ( has been explained by
annual income and family size and remaining variation due to other factors.
iii. The standard error of estimate of on and is given by
Se =
=
= = 2.0939
The average deviation of observed values of expenditure on food from the
fitted regression line is 2.0939
The End