Lecture Notes #4 Correlation
Lecture Notes #4 Correlation
MatMod
College of Science & Information Technology
RRGuerrero
Mathematics Department
First Semester, Session 1, AY 2020-2021
A. CORRELATION
Correlation
Correlation measures the association or strengths of the relationship between two
variable say, 𝑥 and 𝑦.
Before the relationship between these two sets of data is found out through computation, it is
essential to first discuss the three types of correlation.
When a variable increases while the other decreases, these variables are indirectly correlated or
negatively correlated.
But when one variable increases and the other increases as well or one variable decreases, as well
as the other variable, then the two variables are directly correlated or positively correlated.
Example 3a.1: Suppose a ten-item English and a ten-item test in Mathematics were administered to ten
students. The scores of the students are tabulated below. It must be determined if the scores in the
Mathematics quiz (labelled as variable 𝑥) and the English test (labelled as variable 𝑦) are correlated or
not.
Table 3a.1. Math & Eng scores of 10 students
Student Mathematics English The scatter graph for Table 3a.1 is given by Figure 3a.1.In the
score (x) score graph, note that the x-axis represents the scores in Mathematics
(y) and y-axis shows the score in English.
1 4 5 Each point in the graph below is an ordered pair (𝑥, 𝑦)
2 5 4 corresponding to the score obtained by a student in the two
3 9 8 subjects.
4 2 3
5 8 9
6 1 2
7 2 1
8 7 6
9 6 7
10 4 5
Figure 3a.1
In Example 3a.1, the corresponding scatter graph (Figure 3a.1) indicates a direct correlation between
variables 𝑥 and 𝑦 which appears to be increasing.
Example 3a.2: Suppose the scores of the students in those two subjects happen to be as follows with the
corresponding scatter graph:
Page |
2
Table 3a.2
Figure 3a.2
This time the trend of the data is decreasing, hence, the variables are negatively correlated.
Example 3a.3: Suppose the same students have the following scores:
Table 3a.3
Figure 3a.3
Students Math English
score (x) score (y)
1 2 2
2 3 6
3 4 7
4 2 9
5 5 5
6 6 3
7 6 7
8 8 4
9 9 8
10 3 7
It can be noticed that the corresponding scatter graph as shown in Figure 3a.3, the graph neither increasing
nor decreasing. This graph represents a ZERO correlation.
DEFINITIONS:
3a.1.1 Two variables are positively correlated if the values of the two variables both increase or both
decrease.
3a.1.2 Two variables are negatively correlated if the values of one variable increases while the values of
the other decreases. Page |
3
3a.1.3 Two variables are not correlated or they have zero correlation if one variable neither increases nor
decreases while the other increases.
While a scatter plot may be a convenient way of inspecting correlation between two variables,
it does not offer a measure of the strength of a correlation.
Karl Pearson, an English mathematician & biostatistician, invented a formula that can give a
numerical value to the measure of a correlation.
This formula does not only show how greatly two data sets are correlated but also reveals if
the correlation is direct or inverse, or if the data sets are not correlated.
The formula named after him is called the Pearson product – moment correlation.
The degree of correlation between two data sets 𝑥 and 𝑦 is represented by the Pearson product –
moment correlation coefficient 𝑟𝑥𝑦 which can have values from −1 to 1, where 1 representing a
strong positive relationship, whereas, −1 indicating a strong negative relationship.
If the coefficient 𝑟𝑥𝑦 = 0, then there is NO RELATIONSHIP between the two variables.
The Pearson product-moment correlation formula is given by:
Below are some scatter diagrams along with the type of linear correlation that exists
between the 𝑥 and 𝑦 variables.
The closer the absolute value of 𝑟𝑥𝑦 is to 1, the stronger the linear relationship between the
variables.
Figure 3a.4
Figure 3a.4 Page |
4
Consider the data in Example 3a.1. Let us organize the data as shown in the table that follows.
You may also tabulate using Excel.
Math score English score (𝑥𝑖 − 𝑥̅ ) (𝑦𝑖 − 𝑦̅) (𝑥𝑖 − 𝑥̅ )2 (𝑦𝑖 − 𝑦̅)2 (𝑥𝑖 − 𝑥̅ ) (𝑦𝑖 − 𝑦̅)
(x) (y)
4 5 −0.8 0 0.64 0 0
5 4 0.2 −1 0.04 1 −0.2
9 8 4.2 3 17.64 9 12.6
2 3 −2.8 −2 7.84 4 5.6
8 9 3.2 4 10.24 16 12.8
1 2 −3.8 −3 14.44 9 11.4
2 1 −2.8 −4 7.84 16 11.2
7 6 2.2 1 4.84 1 2.2
6 7 1.2 2 1.44 4 2.4
4 5 −0.8 0 0.64 0 0
10 10
𝑛
10 10 ∑(𝑥𝑖 − 𝑥̅ )2 ∑(𝑦𝑖 − 𝑦̅)2
∑ 𝑥𝑖 = 48 ∑ 𝑦𝑖 = 50 𝑖=1 𝑖=1 ∑(𝑥𝑖 − 𝑥̅ )(𝑦𝑖 − 𝑦̅)
𝑖=1 𝑖=1 = 65.6 = 60.0 𝑖=1
= 58
58 58
𝑟𝑥𝑦 = = = 𝟎. 𝟗𝟐
√(65.6)(60) 62.73755
Math score English score (𝑥𝑖 − 𝑥̅ ) (𝑦𝑖 − 𝑦̅) (𝑥𝑖 − 𝑥̅ )2 (𝑦𝑖 − 𝑦̅)2 (𝑥𝑖 − 𝑥̅ ) (𝑦𝑖 − 𝑦̅)
(x) (y)
9 3 4.1 −2.5 16.81 6.25 −10.25
3 6 −1.9 0.5 3.61 0.25 −0.95
4 7 −0.9 1.5 0.81 2.25 −1.35 Page |
7 4 2.1 −1.5 4.41 2.25 −3.15 5
6 2 1.1 −3.5 1.21 12.25 −3.85
1 9 −3.9 3.5 15.21 12.25 −13.65
2 8 −2.9 2.5 8.41 6.25 −7.25
5 4 0.1 −1.5 0.01 2.25 −0.15
10 2 5.1 −3.5 26.01 12.25 −17.85
2 10 −2.9 4.5 8.41 20.25 −13.05
10 10
𝑛
10 10 ∑(𝑥𝑖 − 𝑥̅ )2 ∑(𝑦𝑖 − 𝑦̅)2
∑ 𝑥𝑖 = 49 ∑ 𝑦𝑖 = 55 𝑖=1 𝑖=1 ∑(𝑥𝑖 − 𝑥̅ )(𝑦𝑖 − 𝑦̅)
𝑖=1 𝑖=1 = 84.9 = 76.5 𝑖=1
= −71.5
The absolute value of the correlation coefficient is almost negative 1; hence, it has a strong negative
correlation.
Thus, the corresponding scatter graph in Example 3a.2 is decreasing from left to right.
Math score English score (𝑥𝑖 − 𝑥̅ ) (𝑦𝑖 − 𝑦̅) (𝑥𝑖 − 𝑥̅ )2 (𝑦𝑖 − 𝑦̅)2 (𝑥𝑖 − 𝑥̅ ) (𝑦𝑖 − 𝑦̅)
(x) (y)
2 2 −3 −4 9 10 12
5 5 −1 0 1 0
3 6 −2 0 4 0 0
6 3 1 −3 1 9 −3
8 4 3 −2 9 4 −6
2 9 −3 3 9 9 −9
9 8 4 2 16 4 8
3 7 −2 1 4 1 −2
6 7 1 1 1 1 1
4 7 −1 1 1 1 −1
10 10
𝑛
10 10 ∑(𝑥𝑖 − 𝑥̅ )2 ∑(𝑦𝑖 − 𝑦̅)2
∑ 𝑥𝑖 = 48 ∑ 𝑦𝑖 = 58 𝑖=1 𝑖=1 ∑(𝑥𝑖 − 𝑥̅ )(𝑦𝑖 − 𝑦̅)
𝑖=1 𝑖=1 = 54 = 46 𝑖=1
= 0.0
0
𝑟𝑥𝑦 = =0
√(54)(46)
This conforms with the scatter graph, that is, the graph is neither increasing nor decreasing,
and therefore the two sets of data are not correlated.
Definition
3a.1.4 The Spearman’s rank – order correlation is the nonparametric version of the Pearson
product-moment correlation.
Spearman’s correlation coefficient, denoted as 𝜌, also written as 𝑟𝑠 , measures the
STRENGTH and DIRECTION of ASSOCIATION between two ranked variables.
Page |
6
To compute for the Spearman’s rank correlation coefficient, we use the following formula:
6 ∑𝑛 2
𝑖=1 𝑑𝑖 Where 𝑑 = difference of paired ranks
𝜌 = 1 − 𝑛 (𝑛2 −1) 𝑛 = number of paired data
Example 3a.4. Given the scores in Mathematics and English below, rank the scores, and use the
Spearman’s rho to compute for the correlation coefficient.
∑ 𝑑𝑖2 = 92
𝑖=1
6 ∑𝑛𝑖=1 𝑑𝑖2
𝜌 =1−
𝑛 (𝑛2 − 1)
Where ∑𝑛𝑖=1 𝑑𝑖2 = 92 and 𝑛 = 10, so we have:
6 ∑𝑛𝑖=1 𝑑𝑖2
𝜌 =1−
𝑛 (𝑛2 − 1)
6(92)
=1− = 1 − 0.557576 = 0.44
10(102 − 1)
The correlation coefficient is 0.44 which is a low positive correlation.
The (Phi) 𝜙 COEFFICIENT
For a pair of nominal dichotomous set of data, the phi coefficient is more appropriate to describe
the data set than the Pearson product-moment correlation or Spearman’s rank correlation
coefficient.
𝑎𝑑−𝑏𝑐 Page |
Its formula is given by: 𝜙= 7
√(𝑎+𝑏)(𝑐+𝑑)(𝑎+𝑐)(𝑏+𝑑)
𝑎𝑑 − 𝑏𝑐
𝜙=
√(𝑎 + 𝑏)(𝑐 + 𝑑)(𝑎 + 𝑐)(𝑏 + 𝑑)
6 ∙ 13 − 14 ∙ 10 −62
= = = −0.000312
√(6 + 4)(10 + 13)(6 + 10)(4 + 13) 198,720
The result shows that the opinion on capital punishment whether for or against is almost zero in the
negative side.
The point –biserial correlation coefficient is a correlation that measures the strength of association
between a continuous – level variable (ratio or interval data) and a binary variable.
Binary variables are variables of nominal scale having only two possible values. They are also called
dichotomous variables. Given two variable sets, in which 𝑥 is the continuous variable and 𝑦 the
dichotomous variable, the formula of POINT-BISERIAL CORRELATION COEFFICIENT is:
𝑥1 − ̅̅̅
̅̅̅ 𝑥2 𝑛1 𝑛2
𝜌𝑥𝑦 = √ Remark:
𝑠𝑥 𝑛(𝑛 − 1) The point – biserial
correlation coefficient
measures the
Where 𝑥
̅̅̅1 is the mean 𝑥 when 𝑦 = 1 or those labelled with 1 relationship between a
𝑥2 is the mean 𝑥 when 𝑦 = 2 or those labelled with 2
̅̅̅ real dichotomous and an
interval sets of data.
𝑛1 the number of samples labeled 1 in 𝑦
𝑛2 the number of samples labeled 2 in 𝑦
Solution:
In this example, the point-biserial correlation will be used because the data involves a
continuous interval data (the test results) and a nominal dichotomous data (gender).
Let 𝑥 represent the interval data and 𝑦 stand for the dichotomous data.
The formula to be used is the equation for determining the correlation coefficient.
∑9𝑖=1(𝑥𝑖 − 𝑥̅ )2
𝑠𝑥 = √ = 4.245913
𝑛−1
11.25 − 13.6 (4)(5)
𝜌𝑥𝑦 = √ = −0.1957
4.245913 9(9 − 1)
Different formulas can be used to compute the correlation between two data sets.
o If data 𝑥 and 𝑦 are both interval data, the Pearson product – moment correlation will
suffice.
o For two data sets that are both ordinal, the Spearman’s rank correlation coefficient is
suited for checking the correlated behavior of data that are both real nominal
dichotomous.
o For a pair of nominal dichotomous set of data, the phi coefficient is more appropriate to
describe the data set than the Pearson product-moment correlation or Spearman’s rank
correlation coefficient.
o But, if one real nominal dichotomous and one interval data are involved, the point-
biserial formula should be utilized.
Reference: Baltazar, EC et al (2018). Mathematics in the Modern World. C & E Publishing Inc, pp.67-81.