MAT2001-SE Course Materials - Module 3 PDF
MAT2001-SE Course Materials - Module 3 PDF
MAT2001-SE Course Materials - Module 3 PDF
DEPARTMENT OF MATHEMATICS
FALL SEMESTER – 2020~2021
COURSE MATERIAL
Module 3
Correlation and Regression
Syllabus:
Correlation and Regression – Rank Correlation – Partial and Multiple
Correlation – Multiple Regression.
************************************
Dr. D. Easwaramoorthy
Dr. A. Manimaran
Course In-charges – MAT2001-SE,
Fall Semester 2020~2021,
Department of Mathematics,
SAS, VIT, Vellore.
************************************
Module-3
Correlation and Regression
In this Module, we study the relationship between the variables. Also, the
interest lies in establishing the actual relationship between two or more variables.
This problem is dealt with regression. On the other hand, we are often not
interested to know the actual relationship but are only interested in knowing the
degree of relationship between two or more variables. This problem is dealt with
correlation analysis.
Correlation
Types of Correlation
Correlation can be categorized as one of the following:
(i) Positive and Negative,
(ii) Simple and Multiple.
(iii) Partial and Total.
(iv) Linear and Non-Linear (Curvilinear)
(i) Positive and Negative Correlation : Positive or direct Correlation
refers to the movement of variables in the same direction. The correlation is said
to be positive when the increase (decrease) in the value of one variable is
accompanied by an increase (decrease) in the value of other variable also.
Negative or inverse correlation refers to the movement of the variables in
opposite direction. Correlation is said to be negative, if an increase (decrease) in
the value of one variable is accompanied by a decrease (increase) in the value of
other.
(ii) Simple and Multiple Correlation : Under simple correlation, we
study the relationship between two variables only i.e., between the yield of
wheat and the amount of rainfall or between demand and supply of a
commodity. In case of multiple correlation, the relationship is studied among
three or more variables. For example, the relationship of yield of wheat may be
studied with both chemical fertilizers and the pesticides.
(iii) Partial and Total Correlation : There are two categories of multiple
correlation analysis. Under partial correlation, the relationship of two or more
variables is studied in such a way that only one dependent variable and one
independent variable is considered and all others are kept constant. For
example, coefficient of correlation between yield of wheat and chemical
fertilizers excluding the effects of pesticides and manures is called partial
correlation. Total correlation is based upon all the variables.
(iv) Linear and Non-Linear Correlation: When the amount of change
in one variable tends to keep a constant ratio to the amount of change in the
other variable, then the correlation is said to be linear. But if the amount of
change in one variable does not bear a constant ratio to the amount of change in
the other variable then the correlation is said to be non-linear. The distinction
between linear and non-linear is based upon the consistency of the ratio of
change between the variables.
Methods of Studying Correlation
There are different methods which helps us to find out whether the
variables are related or not.
1. Scatter Diagram Method.
2. Graphic Method.
3. Karl Pearson’s Coefficient of correlation.
4. Rank Method.
Karl Pearson’s Co-efficient of Correlation.
Karl Pearson’s method, popularly known as Pearsonian co-efficient of
correlation, is most widely applied in practice to measure correlation. The
Pearsonian co-efficient of correlation is represented by the symbol r. Degree of
correlation varies between + 1 and –1; the result will be + 1 in case of perfect
positive correlation and – 1 in case of perfect negative correlation. Computation
of correlation coefficient can be simplified by dividing the given data by a
common factor. In such a case, the final result is not multiplied by the common
factor because coefficient of correlation is independent of change of scale and
origin.
Cov ( x, Y )
r( X ,Y ) ( X ,Y )
X . Y
1
Cov( X , Y ) XY XY
n
1 2 2 1 2 2
X X X Y Y Y
n , n
n - number of items in the given data
Standard Error
The standard error is the approximate standard deviation of a statistical
sample population. The standard error is a statistical term that measures the
accuracy with which a sample represents a population.
Year (𝑖) 1 2 3 4 5 6 7 8 9 10
Annual 10 12 14 16 18 20 22 24 26 28
advertising
expenditure
(𝑋 )
Annual sales 20 30 37 50 56 78 89 100 120 110
(𝑌 )
∑ ∑
Solution: Now, 𝑋 = = = 19, 𝑌 = = = 69
𝑖 𝑋 𝑌 𝑋 𝑌 (𝑋 (𝑌 (𝑋 − 𝑋)(𝑌
−𝑋 −𝑌 − 𝑋) − 𝑌) − 𝑌)
1 10 20 -9 -49 81 2401 441
2 12 30 -7 -39 49 1521 273
3 14 37 -5 -32 25 1024 160
4 16 50 -3 -19 9 364 57
5 18 56 -1 -13 1 169 13
6 20 78 1 9 1 81 9
7 22 89 3 20 9 400 60
8 24 100 5 31 25 961 155
9 26 120 7 51 49 2601 357
10 28 110 9 41 81 1681 369
190 690 0 0 330 11200 1894
∑ ( )( )
Correlation coefficient is 𝑟 = =
∑ ( ) ∑ ( )
= 0.985
√ √
Solution:
Given that all the three random variables have zero mean.
Hence, E(X) = E(Y) = E(Z) = 0.
Now, Var(X) = 𝐸(𝑋 ) − [𝐸(𝑋)]
⇒ 𝐸(𝑋 ) = Var(X) { since, E(X) = 0}
= 5 = 25
Similarly, 𝐸(𝑌 ) = 12 = 144 and 𝐸(𝑍 ) = 9 = 81
( ) ( ). ( )
Now, 𝜌(𝑈, 𝑉) =
.
( ) ( ). ( )
Therefore, 𝜌(𝑈, 𝑉) = = =
.
=∫ + 𝑑𝑦
=∫ + 𝑑𝑦
= +
=
The pdf of X and Y is given by
𝑓(𝑥) = ∫ 𝑓(𝑥, 𝑦)𝑑𝑦 = ∫ (𝑥 + 𝑦)𝑑𝑦 = 𝑥𝑦 + =𝑥+
𝐸(𝑋) = ∫ 𝑥𝑓(𝑥)𝑑𝑥 = ∫ 𝑥 𝑥 + 𝑑𝑥 = + = + =
𝐸(𝑌) = ∫ 𝑦𝑓(𝑦)𝑑𝑦 = ∫ 𝑦 𝑦 + 𝑑𝑦 = + = + =
𝐸(𝑋 ) = ∫ 𝑥 𝑓(𝑥)𝑑𝑥 = ∫ 𝑥 𝑥+ 𝑑𝑥 = + = +
=
𝐸(𝑌 ) = ∫ 𝑦 𝑓(𝑦)𝑑𝑦 = ∫ 𝑦 𝑦+ 𝑑𝑦 = + = +
=
5 7 11
𝑉𝑎𝑟(𝑋) = 𝐸(𝑋 ) − [𝐸(𝑋)] = + =
12 12 144
√11
⇒𝜎 =
12
5 7 11
𝑉𝑎𝑟(𝑌) = 𝐸(𝑌 ) − [𝐸(𝑌)] = + =
12 12 144
√11
⇒𝜎 =
12
( ) ( ). ( ) .
Therefore, 𝜌(𝑋, 𝑌) = =√ =
. .
√
Solution:
𝐸(𝑋) = ∫ 𝑥𝑓(𝑥)𝑑𝑥 = ∫ 𝑥4𝑎𝑥𝑑𝑥 = 4𝑎 ∫ 𝑥 𝑑𝑥 = 4𝑎 =
Rank correlation coefficient is useful for finding correlation between any two
qualitative characteristics such as Beauty, Honesty, and Intelligence etc., which
cannot be measured quantitatively but can be arranged serially in order of merit
or proficiency possessing the two characteristics.
Suppose we associate the ranks to individuals or items in two series based on
order of merit, the Spearman's Rank correlation coefficient r is given by
6d 2
1
n(n 1)
2
Where,
∑d2 = Sum of squares of differences of ranks between paired items in
two series
n = Number of paired items
Remarks
Where, Σd2 = (Rx - Ry)2 = sum of squares of differences between the ranks of
variables X and Y
n = number of pairs of observations
SPEARMAN'S RANK CORRELATION COFFICIENT FOR A DATA WITH
TIED OBSERVATIONS
In any series, if two or more observations are having same values then the
observations are said to be tied observations. If tie occurs for two or more
observations in a series, then common ranks have to be given to the tied
observations in that series; these common ranks are the average of the ranks,
which these observations would have assumed if they were slightly different from
each other and the next observation will get the rank next to the rank already
assumed.
In the case of data with tied observations, the Spearman's rank correlation
coefficient is given by
6( Adj d 2 )
1
n(n 1)
2
Where,
S1 3 S1 S 2 3 S 2 S 3 3 S 3
Adj d d
2 2
......
12 12 12
Here,
S1 is the number of times first tied observation is repeated
S2 is the number of times second tied observation is repeated
S3 is the number of times third observation is repeated etc.
Problem: In a quantitative aptitude test, two judges rank the ten competitors in the
following order.
Competitor 1 2 3 4 5 6 7 8 9 10
Ranking of
4 5 2 7 8 1 6 9 3 10
judge I
Ranking of
8 3 9 10 6 7 2 5 1 4
judge II
6(190)
1
10(100 1)
= 1- 1.1515
= -0.1515
We say that there is low degree of negative rank correlation between the two judges.
Problem : Twelve recruits were subjected to selection test to ascertain their suitability for
a certain course of training. At the end of training they were given a proficiency test. The
marks scored by the recruits are recorded below:
Recruit 1 2 3 4 5 6 7 8 9 10 11 12
Selection
Test 44 49 52 54 47 76 65 60 63 58 50 67
Score
Proficiency
48 55 45 60 43 80 58 50 77 46 47 65
Test Scrore
Solution: Let selection test score be a variable X and proficiency test score be a variable
Y. We associate the ranks to the scores based on their magnitudes. The spearman's rank
correlation coefficient is given by
6d 2
1
n(n 1)
2
Where, Σd2 = (Rx - Ry)2 = sum of squares of differences between the ranks of
observations X and Y
n = number of recruits.
Given,
X Y Rx Ry d= Rx - Ry d2
44 48 12 8 4 16
49 55 10 6 4 16
52 45 8 11 -3 9
54 60 7 4 3 9
47 43 11 12 -1 1
76 80 1 1 0 0
65 58 3 5 -2 4
60 50 5 7 -2 4
63 77 4 2 2 4
58 46 6 10 -4 16
50 47 9 9 0 0
67 65 2 3 -1 1
6(80)
1
12(144 1)
= 1- 0.2797
= 0.7203
We say that there is high degree of positive rank correlation between the scores of selection
and proficiency tests.
Example:
Following is the data on heights and weights of ten students in a class:
Heights
140 142 140 160 150 155 160 157 140 170
(in cm)
Weights
43 45 42 50 45 52 57 48 49 53
(in cm)
Where,
S 3 S1 S 2 3 S 2 S 3 3 S 3
Adj d 2 d 2 1 ......
12 12
12
N= No. of students
X Y Rx Ry d= Rx - Ry d2
140 43 9 9 0 0
142 45 7 7.5 -0.5 0.25
140 42 9 10 -1 1
160 50 2.5 4 -1.5 2.25
150 45 6 7.5 -1.5 2.25
155 52 5 3 2 4
160 57 2.5 1 1.5 2.25
157 48 4 6 -2 4
140 49 9 5 4 16
170 53 1 2 -1 1
TOT 33
33 3 2 3 2 2 3 2
Adj d 2 33 ......
12 12 12
= 33+2+0.5+0.5
= 36
6(36)
1
10(100 1)
1 0.2182
= 0.7818
We say that there is high degree of positive rank correlation between heights and weights
of students.
Partial correlation
Multiple Correlation
r122 r23
2
2r12 r23 r13
R2.13
1 r132
,
r132 r23
2
2r12 r23 r13
R3.12
1 r122
PROPERTIES OF MULTIPLE CORRELATION COEFFICIENT
The following are some of the properties of multiple correlation coefficients:
1. Multiple correlation coefficient is the degree of association between observed
value of the dependent variable and its estimate obtained by multiple regression,
2. Multiple Correlation coefficient lies between 0 and 1.
3. If multiple correlation coefficient is 1, then association is perfect and multiple
regression equation may said to be perfect prediction formula.
4. If multiple correlation coefficient is 0, dependent variable is uncorrelated with
other independent variables. From this, it can be concluded that multiple
regression equation fails to predict the value of dependent variable when values
of independent variables are known.
5. Multiple correlation coefficient is always greater or equal than any total
correlation coefficient. If R1.23 is the multiple correlation coefficient than R1.23 r12
or r13 or r23 and
6. Multiple correlation coefficient obtained by method of least squares would
always be greater than the multiple correlation coefficient obtained by any other
method.
Example:
Solution:
X1 2 5 7 11
X2 3 6 10 12
X3 1 3 6 10
Solution:
We need r12, r13 and r23 which are obtained from the following table:
S. No X1 X2 X3 (X1)2 (X2)2 (X3)2 X 1 X2 X1 X3 X 2 X3
1 2 3 1 4 9 1 6 2 3
2 5 6 3 25 36 9 30 15 18
3 7 10 6 49 100 36 70 42 60
4 11 12 10 121 144 100 132 110 120
TOT 25 31 20 199 289 146 238 169 201
Now we get the total correlation coefficient r12 , r13 and r23
N ( X 1 X 2 ) ( X 1 )( X 2 )
r12
N ( X 1
2
) ( X 1 ) 2 N ( X 2
2
) ( X 2 ) 2
r12 = 0.97
N ( X 1 X 3 ) ( X 1 )( X 3 )
r13
N ( X 1
2
) ( X 1 ) 2 N ( X 3
2
) ( X 3 ) 2
r13 = 0.99
N ( X 2 X 3 ) ( X 2 )( X 3 )
r23
N ( X 2
2
) ( X 2 ) 2 N ( X 3
2
) ( X 3 ) 2
r23 = 0.97
Now, we calculate R1.23
We have, r12 = 0.97, r13 = 0.99 and r23 = 0.97
r122 r23
2
2r12 r23 r13
R2.13
1 r132
R2.13 = 0.97
r132 r23
2
2r12 r23 r13
R3.12
1 r122
R3.12 = 0.99
Remarks:
4. If one of the regression coefficients is greater than unity. the other must
be less than unity.
Problems:
Solution:
When X = 30, Y = (- 0.6643) (30) + 59.2576
Y = 39.3286
Solution:
3. Estimate the regression line from the given information:
4. The two regression lines are given as x+2y-5=0 and 2x+3y-8=0. Which
one is the regression line of x on y?
5. The Two Lines of Regressions Are X + 2y – 5 = 0 and 2x + 3y – 8 = 0
and the Variance of X is 12. Find the Variance of Y and the Coefficient of
Correlation.
Linear models are the oldest type of regression. It was designed so that
statisticians can do the calculations by hand. However, OLS ( Ordinary Least
squares)has several weaknesses, including a sensitivity to both outliers and
multicollinearity, and it is prone to to overfitting. To address these problems,
statisticians have developed several advanced variants:
Partial least squares (PLS) regression is useful when you have very few
observations compared to the number of independent variables or when
your independent variables are highly correlated. PLS decreases the
independent variables down to a smaller number of uncorrelated
components, similar to Principal Components Analysis. Then, the
procedure performs linear regression on these components rather the
original data. PLS emphasizes developing predictive models and is not
used for screening variables. Unlike OLS, you can include multiple
continuous dependent variables. PLS uses the correlation structure to
identify smaller effects and model multivariate patterns in the dependent
variables.
Practice Problem:
Multiple Regression
If the number of independent variables in a regression model is more
than one, then the model is called as multiple regression. In fact, many of the real-
world applications demand the use of multiple regression models.
Linearity: the line of best fit through the data points is a straight line, rather
than a curve or some sort of grouping factor.
Y b0 b1 X1 b2 X 2 b3 X 3 b4 X 4
y = the predicted value of the dependent variable
b0 = the y-intercept (value of y when all other parameters are set to 0)
b1X1= the regression coefficient (B1) of the first independent variable
(X1) (a.k.a. the effect that increasing the value of the independent variable
has on the predicted y value)
… = do the same for however many independent variables you are testing
bnXn = the regression coefficient of the last independent variable
Application:
where Y represents the economic growth rate of a country, X1 represents
the time period, X 2 represents the size of the populations of the country,
X 3 represents the level of employment in percentage, X 4 represents the
percentage of literacy, b0 is the intercept and b1, b2 , b3 and b4 are the slopes
X1, X 2 , X 3 and X 4
of the variables respectively. In this regression model,
X1 , X 2 , X 3 and X 4 are the independent variables and Y is the dependent
variable.
Problems:
1. The annual sales revenue (in crores of rupees) of a product
as a function of sales force (number of salesmen) and annual
advertising expenditure (in lakhs of rupees) for the past 10
years are summarized in the following table.
Solution:
Practice Problems :
1.
Solution : r = 0.9360 .