Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Correlation and Regression Analysis

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 69

Correlation and Regression Analysis

Chapter-5
Correlation Analysis
Correlation analysis is the statistical tool which is used to
describe the degree of relationship between two or more than
two variables. To measure the degree of association between
such variables, one more relative measure is needed and is
known as correlation coefficient. It is generally denoted by ‘r’. It
is pure number and independent of units of measurement.
“Correlation is defined as a statistical measure which is used to study
the degree of relationship between two or more than two variables”
For example

Change in volume of sales is associated with the change in


advertisement expenditure, the change in the sales of woolen garment is
related to the change in the temperature, change in the supply of a
commodity is related to the change in the price of commodity etc.
Therefore, two or more variables are said to be correlated if change in
the value of one variable is accompanied by the change in the value of
other variable(s).
Types of Correlation

● Positive and Negative Correlation


● Linear and Non-linear Correlation
● Simple, Partial and Multiple Correlations
Positive and Negative Correlation
Positive Correlation

Correlation is said to be positive or direct if the increase


(decrease) in the values of one variable results on an
average, in a corresponding increase (decrease) in the
values of other variable. e.g. sales and profit, price and
supply of commodities, income and expenditure of a family
etc.
Examples illustrate the concept of positive correlation :

i. Increasing ↑ X: 17 20 25 30 34
Increasing ↑ Y: 8 12 15 18 22

ii.Decreasing ↓ X: 60 51 40 35 30
Decreasing ↓ Y: 18 17 10 7 5
Negative Correlation

On the other hand, correlation is said to be negative or inverse if the


variables deviate in the opposite direction, i.e. if the increase
(decrease) in the values of one variable results on the average, in a
corresponding decrease (increase) in the values of other variables.
Examples of such correlation are: the day temperature and sale of
woolen dresses, price and demand of commodities etc.
Examples illustrate the concept of negative correlation :

i. Increasing ↑ X: 17 20 25 30 34
Decreasing↓ Y: 22 18 15 12 8

ii. Decreasing ↓ X: 60 51 40 35 30
Increasing ↑Y: 5 7 10 17 18
Linear and Non-Linear Correlation
Linear Correlation
The correlation between two variables is said to be linear if
corresponding to a unit change in one variable, there is a constant
change in the other variable over the entire range of the values.
Following example illustrates a linear correlation between two
variables X and Y.

X: 1 2 3 4 5

Y: 5 7 9 11 13
Non-Linear Correlation
The correlation between two variables is said to be non-
linear or curvilinear if corresponding to a unit change in
one variable, the other variable does not change at a
constant rate over the entire range of the values. The
following example illustrates a non-linear correlation
between two variables X and Y.
X: 1 2 3 4 5

Y: 7 8 15 20 23
Simple, Partial and Multiple
Correlation
On the basis of the number of variables involved under study,
correlation can be classified into the three types.

1. Simple correlation
2. Partial correlation
3. Multiple correlation
Simple correlation
The study of degree of relationship between only two variables
is called simple correlation and its relative measure is known as
simple correlation coefficient.e.g. (i) A study on the yield of crop
with respect to only amount of fertilizer, (ii) sales revenue with
respect to amount of money spent on advertisement. (iii)
Similarly, the relationship between income and expenditure,
profit and sales etc. Simple correlation coefficient of two
variables x1 and x2 is denoted by or simply ‘r’. The simple
correlation coefficient is also called “Zero order correlation
coefficient”.
Interpretation of Correlation Coefficient
Degree of Correlation Direction

Positive Negative

Perfect correlation +1 –1

Very high degree of correlation + 0.9 or more – 0.9 to – 0.99

High degree of correlation + 0.75 to + 0.8999 – 0.75 to – 0.8999

Moderate degree of correlation + 0.50 to + 0.6999 – 0.50 to – 0.6999

Low degree of correlation + 0.25 to + 0.4999 – 0.25 to – 0.4999

Very low degree of correlation +0.001 to + 0.25 – 0.001 to – 0.25

No correlation 0 0
Partial correlation
The degree of relationship (closeness/association) between two
variables by keeping the effects of all other variables constant is called
partial correlation. In other words, the degree of relationship between
one dependent variable and one independent variable and by keeping all
other independent variables constant is called partial correlation
coefficient. The partial correlation is also called net correlation. For
example; the relationship between two variables: yield of crops
production and amount of irrigation by keeping the fertilizer, quality of
seeds etc. constant is the study of partial correlation.
The correlation between two variables keeping one other
variable constant is called first order partial correlation. The
correlation between two variables keeping two other variables
constant is called second order partial correlation and so on.
The higher order partial correlation can calculate by using
different software whereas first order partial correlation can be
calculated by manually.
Multiple Correlation
The degree of association (or relationship) between one
dependent variable and combined (or joint) effect of
independent variables is called multiple correlation
coefficient. For example; the study of relationship between
dependent variable yield of crops and combined effect of
independent variables fertilizer, irrigation, seed quality, plot
of land etc. is the multiple correlation. Similarly, the demand
of a commodity depends upon the price, income of
consumer, price of related goods, advertisement etc.
Therefore, the relationship among three or more variables
simultaneously (at the same time) is called multiple
correlations.
Let us consider three variables. If is dependent variable and are independent
variables, then multiple correlation coefficient between dependent variable and
joint effect of independent variables is denoted by R 1.23 and given by
Similarly, if is dependent variable and are independent variables, then multiple
correlation coefficient between dependent variable and joint effect of
independent variables is denoted by R 2.13 and given by

 
If is dependent variable and are independent variables, then multiple
correlation coefficient between dependent variable and joint effect of
independent variables is denoted by R 2.13 and given by
Properties of multiple correlation coefficients
• The value of multiple correlation lies between 0 and
1(inclusive)
i.e. ;
• (i) R1.23 = R1.32 (ii) R2.13 = R2.31 (iii) R3.12 = R3.21
Coefficient of Multiple determinations
The square of the multiple correlation coefficient is called coefficient
of multiple determination. It is used to measure the proportion (or
percentage) of total variation in the dependent variable which has
been explained by the variation in the independent variables.
If R1.23 = 0.9, then coefficient of multiple determination is = 0.81
=81%. It indicates that 81% of total variation in the dependent
variable has been explained by the two independent variables and
and remaining (100 due the effect of other factors.
Example
If zero order correlation coefficients are find (i) (ii) and interpret
Solution:
Here,

=
= 0.721
The coefficient of multiple determination is given by
= 0.52 = 52%
It indicates that 52% of total variation in the dependent variable has been explained by the two
independent variables and and remaining (100 due the effect of other factors.
Example
Suppose a computer has found for a given set of values of and

Explain whether these computations may be said to be free from errors or not.
Solution:

= = = 1.29
= 1.29
Since, = 1.29 must lie between 0 and 1 (inclusively) but here >1. Hence, these
computations may not be said to be free from errors.
Example
A sample of 10 values of three variables and were obtained as, ,,
Find the partial correlation coefficient between and eliminating the effect of .
Find the multiple correlation coefficient of with and .
Solution: The simple correlation coefficient between & is given by

= =

= =
&
= =
Now, Partial correlation coefficient between and eliminating the effect of is
=
The multiple correlation coefficient of with and is given by

=
= 0.767
Example: The height and weight of 10 individuals of different ages are
given below:
Age 11 10 6 10 8 9 10 7 11 8

Height 60 67 53 56 64 57 71 58 67 57

Weight 57 55 49 52 57 48 59 50 62 51

Find ,.
Solution: Let, age = height = and weight =
Calculation of, and
where,,
 
11 60 57 1 0 -7 1 0 49 0 7 0
10 67 55 0 7 5 0 49 25 0 0 35
6 53 49 -4 -7 -1 16 49 1 28 4 7
10 56 52 0 -4 2 0 16 4 0 0 -8
8 64 57 -2 4 7 4 16 49 -8 14 28
9 57 48 -1 -3 -2 1 9 4 3 2 6
10 71 59 0 11 9 0 121 81 0 0 99
7 58 50 -3 -2 0 9 4 0 6 0 0
11 67 62 1 7 12 1 49 144 7 12 84
8 57 51 -2 -3 1 4 9 1 6 -2 -3
         

      = -10 = 10 = 40 = 36 = 322 = 358 = 42 =9 = 248


=
Similarly,

&
The partial correlation coefficient between and keeping constant is given by

=
 The partial correlation coefficient between and keeping constant is given by
=
 & The multiple correlation coefficient of with and is
  = = 0.683
= 0.683
The literal or dictionary meaning of the Analysis
Regression word “Regression” is “Stepping back”
or “Returning back” towards the average value. It was first developed by a
British biometrician Sir Francis Galton in 1877
Regression analysis is a mathematical measure (method) of the average
relationship between two or more of variables in terms of original units of
data.
Thus, regression is a statistical tool (device) which is used to estimate or
predict one variable's value (i.e. dependent variable) from the given value of
other variables(i.e. independent variables). For example, the yield of a crop
depends on the amount of rainfall, quantity of fertilizer used, quality of seeds,
types of soil, method of cultivation etc.
There are two types of variables involved in regression analysis:
Dependent Variable: The variable whose value is to be
predicted or estimated is called dependent variable. The
dependent variable is also called explained variable or
response variable or outcome variable, Regressand variable.
Independent Variable: The variable which is used for
prediction is called independent variable. The independent
variable is also called explanatory variable or predictor or
regressor or covariate
Multiple Linear Regression Analysis.
The multiple linear regression analysis consists of
mathematical measure of the average relations between
a dependent variable and two or more than two
independent variables in terms of the original units of
data. It is expressed in the form of equation and is used
to estimate or predict the value of a dependent variable
with the help of known values of independent variables
in terms of original units of data.
Multiple regression Equation
Multiple regression equation is the algebraic
relationship between one dependent variable and two or
more than two independent variables. It is used to
estimate (or predict or find) the value of dependent
variable for the given values of independent variables in
terms of original units of data. For three variables ,
and , if is a dependent variable and , are independent
variables. Then,
The multiple regression equation of on and is given by
= a + b1 + b2 . . . . . . (A)
The normal equations for estimating a, b1 and b2 are
. . . (i)
. . .(ii)
. . .(iii)
Where,
X1 = Dependent variable.(T estimate or find)
X2, = Independent variables.
a = Point of intercept on Y -axis (Constant) = X1-intercept
= Value of X1 when X2 = 0, & X3 = 0.
b1 = slope of X1 with variable X2 keeping variable constant
= The Partial regression coefficient of X1 on X2 keeping constant.
Slope of X1 with variable keeping variable X2 constant
= The Partial regression coefficient of X1 on keeping X2
constant.
Solving these equations (i), (ii) and (iii), we get the values of a, b1 and
b2.Then, substituting the values of a , and in equation (A)

= a + b1 + b2 X3 . . . . . (iv) which is the required estimated multiple


regression equation of X1 on X2 and X3 or the line of best fit of X1 on X2
and X3 or least square regression equation of X1 on X2 and X3 and is
used to estimate the value of dependent variable X1 with the help of
known (given) values of independent variables X2 and X3.
The coefficient of multiple determination in terms of multiple
regression
It is used to measure the proportion (or percentage) of
total variation in the dependent variable which has been
explained by the variation in the independent variables.
It is used to assess the goodness of fit of the regression
model. The higher the value of coefficient of multiple
determination, the better is the model fitted.
Direct method
When actual data are used, then the multiple regression equation of
dependent variable on two independent variables and is given by
= a + b1 + b2
Therefore, for multiple regression model, = a + b1 + b2
The Coefficient of multiple determination of on and is given by
Where,
Standard error of estimate
Standard error of estimate measures the average variation or scatterness of
the observed values around the regression line. The lesser the value of the
standard error of estimate the better is the model fitted.
Standard error of estimate in terms of regression coefficients:
For regression model: = a + b1 + b2 Where, X1 is dependent variable and X2
and X3 are independent variables.
Then,
The standard error of estimate of X1 on X2 and X3 is given by
Se =
Example:The research firm believes that the job satisfaction of employees in business firm
is mainly due to working experience and income .This firm has assessed the satisfaction of
five employees of the business firm and found the following information regarding
satisfaction score, working experience and annual income.
a. Estimate the equation to predict the job satisfaction from experience and annual income.
b. predict the job satisfaction of an employee who has spent 18 years on job and has annual
income of Rs. 800000.
c. Compute standard error of estimate and interpret the result.
d. How much variation in job satisfaction can be explained by experience and annual
income?
Job satisfaction 10 5 10 4 8
Experience (in years) 16 13 21 10 13
Annual income (Rs. 3 6 4 5 3
‘00000’ )
Solution:
Let Job satisfaction
Experience (in years)
= Annual income (Rs.100000)
Let the multiple linear regression equation of on and is given by
= a + b1 + b2 . . . . . . (i)
By using the principle of least square method, the normal equations for
estimating a, b1 and b2 are
. . . (ii)
. . .(iii)
. . .(iv)
10 16 3 100 256 9 160 30 48
5 13 6 25 169 36 65 30 78
10 21 4 100 441 16 210 40 84
4 10 5 16 100 25 40 20 50
8 13 3 64 169 9 104 24 39

37 = =21 = = 1135 =95 = 579 = 144 = 299


73 305
Here, n = 5
Substituting the sum values in the above normal equations
37 = 5a + 73+ 21
or, 5a +73 . . . . . . . . (v)
579 = 73a + 1135 299
or, 73a + 1135 299 579 . . . . . . (vi)
144 = 21a + 299 95
or, 21a + 299 95 . . . . . . . (vii)
Multiplying equations (v) by 73 and equation (vi) by 5 and then subtracting equation (vi) from equation
(v)
365a +5329
65a + 5675 1495 2895  

346 + 38 =194
or, 173 = 97 . . . . . . (viii)
Again, multiplying equation (v) by 21 and equation (vii) by 5 and then subtracting equation (vii)
from equation (v), we get
105a +1533
105a + 149 475 720  

or, 38 = 57. . . . . . . (ix)


Multiplying equation (viii) by 34 and equation (ix) by 19 and then subtracting equation (ix) from
equation (viii), we get
5882
722 646 1083  

51602215
or,
Putting the value of in equation (ix)
38 = 57
or, 38 = 57
or, 16.34 = 57
or,
or,
Putting the values of and in equation (v)
5a +73
or, 5a + 73
or, 5a +31.39 = 37
or, 5a +6.19 = 37
or, 5a = 30.81
or, a =
Now, substituting the values of a, b1 and b2 in equation (i), we get
= 6.16 +0.43 . This is the required estimated regression equation of (i.e. job satisfaction) on (Experience) and
(Annual income).
When = 18 years and = Rs. 800000 = Rs.8 (100000)
Then, the estimated job satisfaction of an employee is
= 6.16 +0.43
= 6.16 + 0.43
= 4.3
 
The standard error of estimate of on X2 and X3 is given by
Se =
=
=
=
The average deviation of observed values of from the fitted regression line is
The Coefficient of multiple determination of on and is given by
Where,
=
=
= = 97.10%
It shows that 97.10% of total variation in job satisfaction ( can be explained by
the experience and annual income and remaining variation in job satisfaction
has been explained by other factors.
Multiple Regression Equation using Deviation from Actual Mean
If arithmetic means of all the three variables are in
whole number (i.e. not in fraction), in this case
“Deviation from actual mean method” is more
appropriate than direct method. In this method, the
deviations of items are taken from their respective
means and instead of solving three normal equations;
we solve only two normal equations.
For the multiple regression equation of on and:
Thus, the multiple regression equation of on and is given by
= . . . . . . . . . (A)
Where, , , ,
Then, the normal equations for estimating the values of and are
= . . . . . . (i)
. . . . . . (ii)
Solving these two normal equations, we get the values of and
Substituting the values of and in equation (A), we get

This is the required estimated regression equation of on and


The coefficient of multiple determination in terms of multiple
regression
Actual mean method
If arithmetic means of all the three variables are in whole number (i.e. not in
fraction), when deviations of given variables are taken from their respective
means. Then,
For regression equation (i.e. regression model): = . . . . . (A)
The multiple regression equation of on and is
= Where, , ,
Then, the coefficient of multiple determination of on and is given by
Standard error of estimate in terms of regression coefficients
Actual mean method
If arithmetic means of all the three variables are in whole number (i.e. not in
fraction), when deviations of given variables are taken from their respective
means. Then,
For regression equation (i.e. regression model): = . . . . . (A)
The multiple regression equation of on and is
= Where, , ,
Then, the standard error of estimate of on and is given by
Se =
Example
A survey on income and expenditure of few families resulted in the
following data.
Expenditure (Rs. ‘000’) 5 7 8 9 11
Annual income (Rs. ‘000’ ) 25 40 30 50 25
Family siz (number ) 3 2 4 5 1

i. Estimate the expenditure on food of a family with annual income


Rs. 50,000 and having 4 family members.
ii. Compute the coefficient of multiple determination.
iii. Find the standard error of estimate.
Solution:
Let, and be three variables which represent expenditure (Rs. ‘000’), Annual
income (Rs. ‘000’) and Family size (number) respectively. The means of three
variables, and are given by ,

5 25 3 -3 -9 0 9 81 0 27 0 0
7 40 2 -1 6 -1 1 36 1 -6 1 -6
8 30 4 0 -4 1 0 16 1 0 0 -4
9 50 5 1 16 2 1 256 4 16 2 32
11 25 1 3 -9 -2 9 81 4 -27 -6 18

40 =170 =15 =0 =0 =0 = = 470 =10 = 10 = -3 = 40


  20
Since, all the values of means are in whole number, we can find the
multiple regression equation by using deviations of variables from their
respective means (i.e. Actual mean method).
Thus, The multiple regression equation of on and reduces into the
following form.
Let the multiple regression equation of on and is
= . . . . . . . . . . (A) Where, , ,
Then, the normal equations for estimating the values of and are
= . . . . . . (i)
. . . . . . (ii)
Here, n = 5
Substituting the sum values in above two normal equations (i) & (ii)
10 = 470+ 40
or, 470 . . . . . . . . (iii)
3 = 40 10
or, 40 10 3 . . . . . . (iv)
Multiplying equation (iv) by 4 and then subtracting equation (iv) from equation (iii)
470
160 40  

310 = 22
or,
Putting the value of in equation (iv), we get
40 10 3
or, 40 10 3
or, 10 3
or, =
Now, substituting the values of b1 and b2 in equation (A), we get
= 0.071
or, = 0.071
or, = 0.071
or, , = + 0.071 2.414
or, = 7.338+ 0.071
This is the required estimated multiple regression equation of on and
i. When annual income, = 50 (‘000’) and family size, = 4
Then, the estimated expenditure on food is
= 7.338 +0.071
= 8.54 (‘000’)
Hence, the estimated expenditure on food of a family is Rs. 8.54 (‘000’)
ii. the coefficient of multiple determination of on and is given by

= 0.122 = 12.2%
It shows that 12.2% of total variation in expenditure on food ( has been explained by
annual income and family size and remaining variation due to other factors.
iii. The standard error of estimate of on and is given by
Se =
=
  = = 2.0939
The average deviation of observed values of expenditure on food from the
fitted regression line is 2.0939
The End

You might also like