Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

MAT2001-SE Course Materials - Module 3 PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 32

SCHOOL OF ADVANCED SCIENCES

DEPARTMENT OF MATHEMATICS
FALL SEMESTER – 2020~2021

MAT2001 – Statistics for Engineers


(Embedded Theory Component)

COURSE MATERIAL
Module 3
Correlation and Regression
Syllabus:
Correlation and Regression – Rank Correlation – Partial and Multiple
Correlation – Multiple Regression.

Prepared By: Prof. D. Kalpana Priya (In-charge)


Prof. T. Yogalakshmi
Prof. R. Vanitha

The course in-charges thankfully acknowledge the course materials preparation


committee in-charge and members for their significant contribution in bringing
out of this course material.

************************************
Dr. D. Easwaramoorthy
Dr. A. Manimaran
Course In-charges – MAT2001-SE,
Fall Semester 2020~2021,
Department of Mathematics,
SAS, VIT, Vellore.
************************************
Module-3
Correlation and Regression

In this Module, we study the relationship between the variables. Also, the
interest lies in establishing the actual relationship between two or more variables.
This problem is dealt with regression. On the other hand, we are often not
interested to know the actual relationship but are only interested in knowing the
degree of relationship between two or more variables. This problem is dealt with
correlation analysis.

Linear relationship between two variables is represented by a straight line


which is known as regression line. In the study of linear relationship between two
variables 𝑋 and 𝑌, suppose the variable 𝑌 is such that it depends on 𝑋, then we
call it as the regression line of 𝑌 on 𝑋. If 𝑋 depends on 𝑌, then it is called as the
regression line of 𝑋 on 𝑌.

To find out the regression line, the observations (𝑥 , 𝑦 ) on the variable 𝑋


and 𝑌 are necessarily taken in pairs. For example, a chemical engineer may run a
chemical process several times in order to study the relationship between the
concentration of a certain catalyst and the yield of the process. Each time the
process is run, the concentration 𝑋 and the yield 𝑌 are recorded. Generally, the
studies are based on samples of size '𝑛' and hence '𝑛' pairs of sample observations
can be written as (𝑥 , 𝑦 ), (𝑥 , 𝑦 ), . . . , (𝑥 , 𝑦 ).

Correlation

In a bivariate distribution, we are interested to find out whether there is any


relationship between two variables. The correlation is a statistical technique
which studies the relationship between two or more variables and correlation
analysis involves various methods and techniques used for studying and
measuring the extent of relationship between the two variables. When two
variables are related in such a way that a change in the value of one is
accompanied either by a direct change or by an inverse change in the values of
the other, the two variables are said to be correlated. In the correlated variables
an increase in one variable is accompanied by an increase or decrease in the other
variable. For instance, relationship exists between the price and demand of a
commodity because keeping other things equal, an increase in the price of a
commodity shall cause a decrease in the demand for that commodity.
Relationship might exist between the heights and weights of the students and
between amount of rainfall in a city and the sales of raincoats in that city.
Utility of Correlation
The study of correlation is very useful in practical life as revealed by these
points.
1. With the help of correlation analysis, we can measure in one figure, the
degree of relationship existing between variables like price, demand, supply,
income, expenditure etc. Once we know that two variables are correlated then
we can easily estimate the value of one variable, given the value of other.
2. Correlation analysis is of great use to economists and businessmen; it reveals
to the economists the disturbing factors and suggests to him the stabilizing
forces. In business, it enables the executive to estimate costs, sales etc. and plan
accordingly.
3. Correlation analysis is helpful to scientists. Nature has been found to be a
multiplicity of interrelated forces.

Types of Correlation
Correlation can be categorized as one of the following:
(i) Positive and Negative,
(ii) Simple and Multiple.
(iii) Partial and Total.
(iv) Linear and Non-Linear (Curvilinear)
(i) Positive and Negative Correlation : Positive or direct Correlation
refers to the movement of variables in the same direction. The correlation is said
to be positive when the increase (decrease) in the value of one variable is
accompanied by an increase (decrease) in the value of other variable also.
Negative or inverse correlation refers to the movement of the variables in
opposite direction. Correlation is said to be negative, if an increase (decrease) in
the value of one variable is accompanied by a decrease (increase) in the value of
other.
(ii) Simple and Multiple Correlation : Under simple correlation, we
study the relationship between two variables only i.e., between the yield of
wheat and the amount of rainfall or between demand and supply of a
commodity. In case of multiple correlation, the relationship is studied among
three or more variables. For example, the relationship of yield of wheat may be
studied with both chemical fertilizers and the pesticides.
(iii) Partial and Total Correlation : There are two categories of multiple
correlation analysis. Under partial correlation, the relationship of two or more
variables is studied in such a way that only one dependent variable and one
independent variable is considered and all others are kept constant. For
example, coefficient of correlation between yield of wheat and chemical
fertilizers excluding the effects of pesticides and manures is called partial
correlation. Total correlation is based upon all the variables.
(iv) Linear and Non-Linear Correlation: When the amount of change
in one variable tends to keep a constant ratio to the amount of change in the
other variable, then the correlation is said to be linear. But if the amount of
change in one variable does not bear a constant ratio to the amount of change in
the other variable then the correlation is said to be non-linear. The distinction
between linear and non-linear is based upon the consistency of the ratio of
change between the variables.
Methods of Studying Correlation

There are different methods which helps us to find out whether the
variables are related or not.
1. Scatter Diagram Method.
2. Graphic Method.
3. Karl Pearson’s Coefficient of correlation.
4. Rank Method.
Karl Pearson’s Co-efficient of Correlation.
Karl Pearson’s method, popularly known as Pearsonian co-efficient of
correlation, is most widely applied in practice to measure correlation. The
Pearsonian co-efficient of correlation is represented by the symbol r. Degree of
correlation varies between + 1 and –1; the result will be + 1 in case of perfect
positive correlation and – 1 in case of perfect negative correlation. Computation
of correlation coefficient can be simplified by dividing the given data by a
common factor. In such a case, the final result is not multiplied by the common
factor because coefficient of correlation is independent of change of scale and
origin.
Cov ( x, Y )
r( X ,Y )   ( X ,Y ) 
 X . Y
1
Cov( X , Y )   XY  XY
n
1 2 2 1 2 2
X  X X Y  Y  Y
n , n
n - number of items in the given data

Standard Error
The standard error is the approximate standard deviation of a statistical
sample population. The standard error is a statistical term that measures the
accuracy with which a sample represents a population.

In statistics, a sample means deviates from the actual mean of a


population; this deviation is the standard error.
1−𝑟
𝑆. 𝐸(𝑟) =
√𝑛
Probable Error= 𝑃. 𝐸 (𝑟) = 0.675 × 𝑆. 𝐸(𝑟)
Range:
𝑟 − 𝑆. 𝐸(𝑟) ≤ 𝑃𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 𝑟 ≤ 𝑟 + 𝑆. 𝐸(𝑟)
Note: Two independent variables are uncorrelated when Cov(X,Y) = 0
Problems:

1. Find the correlation coefficient between annual advertising


expenditures and annual sales revenue for the following data:

Year (𝑖) 1 2 3 4 5 6 7 8 9 10
Annual 10 12 14 16 18 20 22 24 26 28
advertising
expenditure
(𝑋 )
Annual sales 20 30 37 50 56 78 89 100 120 110
(𝑌 )

∑ ∑
Solution: Now, 𝑋 = = = 19, 𝑌 = = = 69

𝑖 𝑋 𝑌 𝑋 𝑌 (𝑋 (𝑌 (𝑋 − 𝑋)(𝑌
−𝑋 −𝑌 − 𝑋) − 𝑌) − 𝑌)
1 10 20 -9 -49 81 2401 441
2 12 30 -7 -39 49 1521 273
3 14 37 -5 -32 25 1024 160
4 16 50 -3 -19 9 364 57
5 18 56 -1 -13 1 169 13
6 20 78 1 9 1 81 9
7 22 89 3 20 9 400 60
8 24 100 5 31 25 961 155
9 26 120 7 51 49 2601 357
10 28 110 9 41 81 1681 369
190 690 0 0 330 11200 1894
∑ ( )( )
Correlation coefficient is 𝑟 = =
∑ ( ) ∑ ( )

= 0.985
√ √

The correlation coefficient between annual expenditure and annual


sales revenue is 0.985.
2. Let X, Y and Z be uncorrelated random variables with zero
means and standard deviations 5, 12 and 9 respectively. If U = X
+ Y and V = Y + Z, find the correlation coefficient between U
and V.

Solution:
Given that all the three random variables have zero mean.
Hence, E(X) = E(Y) = E(Z) = 0.
Now, Var(X) = 𝐸(𝑋 ) − [𝐸(𝑋)]
⇒ 𝐸(𝑋 ) = Var(X) { since, E(X) = 0}
= 5 = 25
Similarly, 𝐸(𝑌 ) = 12 = 144 and 𝐸(𝑍 ) = 9 = 81

Since X and Y are uncorrelated we have Cov(X,Y) = 0


⇒ E(XY) = E(X).E(Y) = 0

Similarly, E(YZ) = 0 and E(ZX) = 0.

To find 𝜌(𝑈, 𝑉):

( ) ( ). ( )
Now, 𝜌(𝑈, 𝑉) =
.

E(U) = E [X + Y] = E[X] + E[Y] = 0


E(V) = E [Y + Z] = E[Y] + E[Z] = 0

𝐸(𝑈 ) = 𝐸[(𝑋 + 𝑌) ] = 𝐸[𝑋 ] + 𝐸 [𝑌 ] + 2𝐸 [𝑋𝑌]


= 25 + 144 + 0
= 169
Similarly, 𝐸(𝑉 ) = 225

Now, 𝑉𝑎𝑟(𝑈) = 𝐸(𝑈 ) − [𝐸(𝑈)] = 169


⇒ 𝜎 = √169 = 13
Similarly, 𝑉𝑎𝑟(𝑉) = 𝐸(𝑉 ) − [𝐸(𝑉)] = 225
⇒ 𝜎 = √225 = 15
E(UV) = E[(X+Y) (Y+Z)]
= E(XY) + 𝐸(𝑌 ) + E(XZ) + E(YZ)
= 144

( ) ( ). ( )
Therefore, 𝜌(𝑈, 𝑉) = = =
.

3. If the joint pdf of (X,Y) is given by 𝑓(𝑥, 𝑦) = 𝑥 + 𝑦, 0≤


𝑥, 𝑦 ≤ 1. Find 𝜌 .
Solution:
( ) ( ). ( )
We know that, 𝜌(𝑋, 𝑌) =
.
∞ ∞
Now, 𝐸(𝑋𝑌) = ∫ ∞ ∫ ∞ 𝑥𝑦𝑓(𝑥, 𝑦)𝑑𝑥𝑑𝑦
= ∫ ∫ 𝑥𝑦(𝑥 + 𝑦)𝑑𝑥𝑑𝑦

=∫ + 𝑑𝑦

=∫ + 𝑑𝑦

= +
=
The pdf of X and Y is given by
𝑓(𝑥) = ∫ 𝑓(𝑥, 𝑦)𝑑𝑦 = ∫ (𝑥 + 𝑦)𝑑𝑦 = 𝑥𝑦 + =𝑥+

𝑓(𝑦) = ∫ 𝑓(𝑥, 𝑦)𝑑𝑥 = ∫ (𝑥 + 𝑦)𝑑𝑥 = + 𝑥𝑦 =𝑦+

𝐸(𝑋) = ∫ 𝑥𝑓(𝑥)𝑑𝑥 = ∫ 𝑥 𝑥 + 𝑑𝑥 = + = + =

𝐸(𝑌) = ∫ 𝑦𝑓(𝑦)𝑑𝑦 = ∫ 𝑦 𝑦 + 𝑑𝑦 = + = + =

𝐸(𝑋 ) = ∫ 𝑥 𝑓(𝑥)𝑑𝑥 = ∫ 𝑥 𝑥+ 𝑑𝑥 = + = +
=
𝐸(𝑌 ) = ∫ 𝑦 𝑓(𝑦)𝑑𝑦 = ∫ 𝑦 𝑦+ 𝑑𝑦 = + = +
=
5 7 11
𝑉𝑎𝑟(𝑋) = 𝐸(𝑋 ) − [𝐸(𝑋)] = + =
12 12 144
√11
⇒𝜎 =
12
5 7 11
𝑉𝑎𝑟(𝑌) = 𝐸(𝑌 ) − [𝐸(𝑌)] = + =
12 12 144
√11
⇒𝜎 =
12
( ) ( ). ( ) .
Therefore, 𝜌(𝑋, 𝑌) = =√ =
. .

4. The independent random variables X and Y have the pdf given


4𝑎𝑥 , 0 ≤ 𝑥 ≤ 1
by 𝑓 (𝑥) = , 𝑓 (𝑦) =
0 , 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
4𝑏𝑦 , 0 ≤ 𝑦 ≤ 1
0 , 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
Find the correlation coefficient.

Solution:
𝐸(𝑋) = ∫ 𝑥𝑓(𝑥)𝑑𝑥 = ∫ 𝑥4𝑎𝑥𝑑𝑥 = 4𝑎 ∫ 𝑥 𝑑𝑥 = 4𝑎 =

𝐸(𝑌) = ∫ 𝑦𝑓(𝑦)𝑑𝑦 = ∫ 𝑦4𝑏𝑦𝑑𝑦 = 4𝑏 ∫ 𝑦 𝑑𝑦 = 4𝑏 =

Since X and Y are independent, the joint pdf of X and Y is


given by 𝑓(𝑥, 𝑦) = 𝑓(𝑥). 𝑓(𝑦)
= (4𝑎𝑥)(4𝑏𝑦)
= 16𝑎𝑏𝑥𝑦, 0 ≤ 𝑥 ≤ 1, 0 ≤ 𝑦 ≤ 1
Now, 𝐸(𝑋𝑌) = ∫ ∫ 𝑥𝑦𝑓(𝑥, 𝑦)𝑑𝑥𝑑𝑦
=∫ ∫ 𝑥𝑦(16𝑎𝑏𝑥𝑦)𝑑𝑥𝑑𝑦 =
Therefore we get, Cov(X,Y) = E(XY) – E(X)E(Y)
= - =0
Which implies that the cor(X,Y)=0

That is, the variables X and Y are independent and there is no


relationship between them.

SPEARMAN'S RANK CORRELATION COEFFICIENT

Rank correlation coefficient is useful for finding correlation between any two
qualitative characteristics such as Beauty, Honesty, and Intelligence etc., which
cannot be measured quantitatively but can be arranged serially in order of merit
or proficiency possessing the two characteristics.
Suppose we associate the ranks to individuals or items in two series based on
order of merit, the Spearman's Rank correlation coefficient r is given by
 6d 2 
 1  
 n(n  1) 
2

Where,
∑d2 = Sum of squares of differences of ranks between paired items in
two series
n = Number of paired items
Remarks

Spearman's rank correlation coefficient can be used to find the correlation


between two quantitative characteristics or variables. In this case, we associate
the ranks to the observations based on their magnitudes for X and Y series
separately. Let RX and Ry be the ranks of observations on two variables X and
Y respectively for a pair. Then the Spearman's rank correlation coefficient is
given by
 6d 2 
 1  
 n(n  1) 
2

Where, Σd2 = (Rx - Ry)2 = sum of squares of differences between the ranks of
variables X and Y
n = number of pairs of observations
SPEARMAN'S RANK CORRELATION COFFICIENT FOR A DATA WITH
TIED OBSERVATIONS
In any series, if two or more observations are having same values then the
observations are said to be tied observations. If tie occurs for two or more
observations in a series, then common ranks have to be given to the tied
observations in that series; these common ranks are the average of the ranks,
which these observations would have assumed if they were slightly different from
each other and the next observation will get the rank next to the rank already
assumed.
In the case of data with tied observations, the Spearman's rank correlation
coefficient is given by

 6( Adj  d 2 ) 
  1  
 n(n  1) 
2

Where,
 S1 3  S1   S 2 3  S 2   S 3 3  S 3 
Adj  d   d  
2 2
    ......
 12   12   12 

Here,
S1 is the number of times first tied observation is repeated
S2 is the number of times second tied observation is repeated
S3 is the number of times third observation is repeated etc.
Problem: In a quantitative aptitude test, two judges rank the ten competitors in the
following order.
Competitor 1 2 3 4 5 6 7 8 9 10
Ranking of
4 5 2 7 8 1 6 9 3 10
judge I
Ranking of
8 3 9 10 6 7 2 5 1 4
judge II

Is there any concordance between the two judges ?


Solution: Let Rx: Ranking by Judge I and Ry: Ranking by Judge II The Spearman's rank
correlation coefficient is given by
 6d 2 
 1  
 n(n  1) 
2

Where, Σd2 = (Rx - Ry)2 and n = number of competitors.


Rx Ry d= Rx - Ry d2
4 8 -4 16
5 3 2 4
2 9 -7 79
7 10 -3 9
8 6 2 4
1 7 -6 36
6 2 4 16
9 5 4 16
3 1 2 4
10 4 6 36
TOT 190

 6(190) 
  1  
10(100  1) 
= 1- 1.1515

= -0.1515
We say that there is low degree of negative rank correlation between the two judges.

Problem : Twelve recruits were subjected to selection test to ascertain their suitability for
a certain course of training. At the end of training they were given a proficiency test. The
marks scored by the recruits are recorded below:

Recruit 1 2 3 4 5 6 7 8 9 10 11 12
Selection
Test 44 49 52 54 47 76 65 60 63 58 50 67
Score
Proficiency
48 55 45 60 43 80 58 50 77 46 47 65
Test Scrore

calculate rank correlation coefficient and comment on your result

Solution: Let selection test score be a variable X and proficiency test score be a variable
Y. We associate the ranks to the scores based on their magnitudes. The spearman's rank
correlation coefficient is given by
 6d 2 
 1  
 n(n  1) 
2

Where, Σd2 = (Rx - Ry)2 = sum of squares of differences between the ranks of
observations X and Y
n = number of recruits.
Given,
X Y Rx Ry d= Rx - Ry d2
44 48 12 8 4 16
49 55 10 6 4 16
52 45 8 11 -3 9
54 60 7 4 3 9
47 43 11 12 -1 1
76 80 1 1 0 0
65 58 3 5 -2 4
60 50 5 7 -2 4
63 77 4 2 2 4
58 46 6 10 -4 16
50 47 9 9 0 0
67 65 2 3 -1 1

From the table, we have,


∑d2 = 80, n = 12

 6(80) 
  1  
12(144  1) 

= 1- 0.2797
= 0.7203
We say that there is high degree of positive rank correlation between the scores of selection
and proficiency tests.
Example:
Following is the data on heights and weights of ten students in a class:
Heights
140 142 140 160 150 155 160 157 140 170
(in cm)
Weights
43 45 42 50 45 52 57 48 49 53
(in cm)

Calculate rank correlation coefficient between heights and weights of students.


Solution:
Let height be a variable X and weight be a variable Y. Since, the data contains tied
observations, we associate average ranks to the tied observations. The spearman's rank
correlation coefficient is given by
 6( Adj  d 2 ) 
  1  
 n(n  1) 
2

Where,
 S 3  S1   S 2 3  S 2   S 3 3  S 3 
Adj  d 2   d 2   1     ......
 12   12 
  12 

N= No. of students

X Y Rx Ry d= Rx - Ry d2
140 43 9 9 0 0
142 45 7 7.5 -0.5 0.25
140 42 9 10 -1 1
160 50 2.5 4 -1.5 2.25
150 45 6 7.5 -1.5 2.25
155 52 5 3 2 4
160 57 2.5 1 1.5 2.25
157 48 4 6 -2 4
140 49 9 5 4 16
170 53 1 2 -1 1
TOT 33

From the table, we have,


n = 10, ∑d2 33, S1 = 3, S2 = 2, S3 = 33
Thus,
 S 3  S1   S 2 3  S 2   S 3 3  S 3 
Adj  d 2   d 2   1     ......
 12   12 
  12 

 33  3   2 3  2   2 3  2 
Adj  d 2  33       ......
 12   12   12 

= 33+2+0.5+0.5
= 36
 6(36) 
  1  
10(100  1) 
  1  0.2182

= 0.7818
We say that there is high degree of positive rank correlation between heights and weights
of students.

Partial and Multiple Correlation


Let us consider the example of yield of rice in a firm. It may be
affected by the type of soil, temperature, amount of rainfall, usage of
fertilizers etc. It will be useful to determine how yield of rice is influenced
by one factor or how yield of rice is affected by several other factors. This
is done with the help of partial and multiple correlation analysis.
The basic distinction between multiple and partial correlation
analysis is that in the former, the degree of relationship between the
variable Y and all the other variables X 1 , X 2 ,..., X n taken together is
measured, whereas, in the later, the degree of relationship between Y and
one of the variables X 1 , X 2 ,..., X n is measured by removing the effect of all
the other variables.

Partial correlation

Partial correlation coefficient provides a measure of the relationship


between the dependent variable and other variable, with the effect of the
rest of the variables eliminated. If there are three variables
X 1 , X 2 and X 3 , there will be three coefficients of partial correlation,
each studying the relationship between two variables when the third is held
constant. If we denote by r12.3, that is, the coefficient of partial correlation
X 1 and X2 keeping X 3 constant, it is calculated as
r12  r13 r23 r13  r12 r23
r12.3  r13.2 
, ,
1  r132 1  r23
2
1  r122 1  r23
2

r23  r12 r13


r23.1 
1  r122 1  r132
1. In a trivariate distribution, it is found that r12  0.7 , r13  0.61 and
r23  0.4 . Find the partial correlation coefficients.
Solution:
r12  r13 r23 0.7  (0.61  0.4)
r12.3   0.628
2 =
1  r132 1  r23 1  (0.61) 2
1  (0.4) 2

r13  r12 r23 0.61  (0.7  0.4)


r13.2   0.504
2 =
1  r122 1  r23 1  (0.7) 2
1  (0.4) 2

r23  r12 r13 0.4  (0.7  0.61)


r23 .1   0.048
=
1  r122 1  r132 1  (0.7) 2
1  (0.61) 2

Multiple Correlation

In multiple correlation, we are trying to make estimates of the value of one


of the variable based on the values of all the others. The variable whose
value we are trying to estimate is called the dependent variable and the
other variables on which our estimates are based are known as independent
variables.
The coefficient of multiple correlation with three variables
X1, X 2 and X 3 are R1.23, R2.13 and R3.21 . R1.23, is the
coefficient of multiple correlation related to X 1 as a dependent variable
and X 2 , X 3 as two independent variables and it can be expressed in terms
of r12, r23 and r13 as
r122  r132  2r12 r23 r13
R1.23  2
1  r23
,

r122  r23
2
 2r12 r23 r13
R2.13 
1  r132
,

r132  r23
2
 2r12 r23 r13
R3.12 
1  r122
PROPERTIES OF MULTIPLE CORRELATION COEFFICIENT
The following are some of the properties of multiple correlation coefficients:
1. Multiple correlation coefficient is the degree of association between observed
value of the dependent variable and its estimate obtained by multiple regression,
2. Multiple Correlation coefficient lies between 0 and 1.
3. If multiple correlation coefficient is 1, then association is perfect and multiple
regression equation may said to be perfect prediction formula.
4. If multiple correlation coefficient is 0, dependent variable is uncorrelated with
other independent variables. From this, it can be concluded that multiple
regression equation fails to predict the value of dependent variable when values
of independent variables are known.
5. Multiple correlation coefficient is always greater or equal than any total
correlation coefficient. If R1.23 is the multiple correlation coefficient than R1.23  r12
or r13 or r23 and
6. Multiple correlation coefficient obtained by method of least squares would
always be greater than the multiple correlation coefficient obtained by any other
method.
Example:

1. The following zero-order correlation coefficients are given:


r12  0.98, r13  0.44 and r23 = 0.54. Calculate multiple correlation
coefficient treating first variable as dependent and second and third
variables as independent.

Solution:

r122  r132  2r12 r23 r13


R1.23  2
1  r23
(0.98) 2  (0.44) 2  2(0.98)(0.54)(0.44)
 0.986
=
1  (0.54) 2

2. From the following data, obtain R1.23 , R2.13 and R3.12

X1 2 5 7 11
X2 3 6 10 12
X3 1 3 6 10

Solution:
We need r12, r13 and r23 which are obtained from the following table:
S. No X1 X2 X3 (X1)2 (X2)2 (X3)2 X 1 X2 X1 X3 X 2 X3
1 2 3 1 4 9 1 6 2 3
2 5 6 3 25 36 9 30 15 18
3 7 10 6 49 100 36 70 42 60
4 11 12 10 121 144 100 132 110 120
TOT 25 31 20 199 289 146 238 169 201

Now we get the total correlation coefficient r12 , r13 and r23
N ( X 1 X 2 )  ( X 1 )( X 2 )
r12 
N ( X 1
2
)  ( X 1 ) 2  N ( X 2
2
)  ( X 2 ) 2 
r12 = 0.97
N (  X 1 X 3 )  (  X 1 )( X 3 )
r13 
N ( X 1
2
)  ( X 1 ) 2  N ( X 3
2
)  ( X 3 ) 2 
r13 = 0.99
N ( X 2 X 3 )  ( X 2 )( X 3 )
r23 
N ( X 2
2
)  ( X 2 ) 2  N ( X 3
2
)  ( X 3 ) 2 
r23 = 0.97
Now, we calculate R1.23
We have, r12 = 0.97, r13 = 0.99 and r23 = 0.97

r122  r132  2r12 r23 r13


R1.23  2
1  r23
R1.23 = 0.99

r122  r23
2
 2r12 r23 r13
R2.13 
1  r132
R2.13 = 0.97

r132  r23
2
 2r12 r23 r13
R3.12 
1  r122
R3.12 = 0.99
Remarks:

1. The lines of regression of Y on X and X on Y passes through the mean


value of x and y. In other words, the mean value of x and y can be
obtained as the point of intersection of the two regression lines.

2. In case of perfect correlation. (r = ± 1), both the lines of regression


coincide. Therefore. in general. we always have two lines pf regression
except in the particular case of perfect correlation when both the lines
coincide and we get only one line.

3. The sign of correlation coefficient is the same as that of regression


coefficients, since the sign of each depends upon the co-variance term.
Thus, if the regression coefficients are positive, 'r' is positive and if the
regression coefficients are negative 'r' is negative.

4. If one of the regression coefficients is greater than unity. the other must
be less than unity.

5. If the two variables are uncorrelated, the lines of regression become


perpendicular to each other.

Problems:

Solution:
When X = 30, Y = (- 0.6643) (30) + 59.2576
Y = 39.3286

Solution:
3. Estimate the regression line from the given information:

4. The two regression lines are given as x+2y-5=0 and 2x+3y-8=0. Which
one is the regression line of x on y?
5. The Two Lines of Regressions Are X + 2y – 5 = 0 and 2x + 3y – 8 = 0
and the Variance of X is 12. Find the Variance of Y and the Coefficient of
Correlation.

Advanced types of linear regression ( not in Syllabus)

Linear models are the oldest type of regression. It was designed so that
statisticians can do the calculations by hand. However, OLS ( Ordinary Least
squares)has several weaknesses, including a sensitivity to both outliers and
multicollinearity, and it is prone to to overfitting. To address these problems,
statisticians have developed several advanced variants:

 Lasso regression (least absolute shrinkage and selection operator)


performs variable selection that aims to increase prediction accuracy by
identifying a simpler model. It is similar to Ridge regression but with
variable selection.

 Ridge regression allows you to analyse data even when severe


multicollinearity is present and helps prevent overfitting. This type of
model reduces the large, problematic variance that multicollinearity causes
by introducing a slight bias in the estimates. The procedure trades away
much of the variance in exchange for a little bias, which produces more
useful coefficient estimates when multicollinearity is present.

 Partial least squares (PLS) regression is useful when you have very few
observations compared to the number of independent variables or when
your independent variables are highly correlated. PLS decreases the
independent variables down to a smaller number of uncorrelated
components, similar to Principal Components Analysis. Then, the
procedure performs linear regression on these components rather the
original data. PLS emphasizes developing predictive models and is not
used for screening variables. Unlike OLS, you can include multiple
continuous dependent variables. PLS uses the correlation structure to
identify smaller effects and model multivariate patterns in the dependent
variables.

Practice Problem:

Multiple Regression
If the number of independent variables in a regression model is more
than one, then the model is called as multiple regression. In fact, many of the real-
world applications demand the use of multiple regression models.

Assumptions of multiple linear regression

Homogeneity of variance (homoscedasticity): the size of the error in our


prediction doesn’t change significantly across the values of the independent
variable.

Independence of observations: the observations in the dataset were collected


using statistically valid methods, and there are no hidden relationships among
variables.
In multiple linear regression, it is possible that some of the independent
variables are actually correlated with one another, so it is important to check
these before developing the regression model. If two independent variables are
too highly correlated (r2 > ~0.6), then only one of them should be used in the
regression model.

Normality: The data follows a normal distribution.

Linearity: the line of best fit through the data points is a straight line, rather
than a curve or some sort of grouping factor.

Multiple linear Regression formula

Y  b0  b1 X1  b2 X 2  b3 X 3  b4 X 4
 y = the predicted value of the dependent variable
 b0 = the y-intercept (value of y when all other parameters are set to 0)
 b1X1= the regression coefficient (B1) of the first independent variable
(X1) (a.k.a. the effect that increasing the value of the independent variable
has on the predicted y value)
 … = do the same for however many independent variables you are testing
 bnXn = the regression coefficient of the last independent variable

Application:
where Y represents the economic growth rate of a country, X1 represents
the time period, X 2 represents the size of the populations of the country,
X 3 represents the level of employment in percentage, X 4 represents the

percentage of literacy, b0 is the intercept and b1, b2 , b3 and b4 are the slopes
X1, X 2 , X 3 and X 4
of the variables respectively. In this regression model,
X1 , X 2 , X 3 and X 4 are the independent variables and Y is the dependent

variable.

Regression model with two independent variables using normal


equations:

If the regression equation with two independent variables is


Then, the normal equations are

Problems:
1. The annual sales revenue (in crores of rupees) of a product
as a function of sales force (number of salesmen) and annual
advertising expenditure (in lakhs of rupees) for the past 10
years are summarized in the following table.

Let the regression model be

Y X1 X2 X21 X22 X1X2 YX1 YX2


20 8 28 64 784 224 160 560
23 13 23 169 529 299 299 529
25 8 38 64 1444 304 200 950
27 18 16 324 256 288 486 432
21 23 20 529 400 460 483 420
29 16 28 256 784 448 464 812
22 10 23 100 529 230 220 506
24 12 30 144 900 360 288 720
27 14 26 196 676 364 378 702
35 20 32 400 1024 640 700 1120
253 142 264 2246 7326 3617 3678 6751 Total

If mean, standard deviation and partial correlation of the trivariate


distribution are known, then the multiple regression of X1 on X2 and X3 is
given by
Example :

Solution:
Practice Problems :
1.

Solution : r = 0.9360 .

You might also like