Lesson 7 - Linear Correlation and Simple Linear Regression
Lesson 7 - Linear Correlation and Simple Linear Regression
n xi xi n yi yi
i1 i 1 i1 i 1
The Sample Pearson Correlation Coefficient can be interpreted in the following manner:
1. The value of r, ranges from -1 to +1. If r = +1 or r = -1, there is a perfect linear relationship and all
points lie in the straight line.
2. An r close to +1 indicates a high positive linear relationship between the two variables X and Y,
that is, if the value of X increases then the value of Y also increases.
3. An r close to -1 indicates a high negative linear relationship between the sample values, that is, the
value of X decreases as the value of Y increases.
4. An r near 0 means that there is a lack of linearity between the two variables, or there is no linear
relationship between them. This doesn’t mean they are not associated at all because the relationship
maybe nonlinear.
Scatter diagram is a graphical presentation of the independent variable (plotted on the horizontal
axis) and the dependent variable (plotted on the vertical axis). Through this graph or diagram is the easiest
way to determine if a relationship exists between the two variables.
The figures below are the scatter diagrams showing the different types of linear relationships.
Note: The correlation coefficient remains high (𝑟 ≈ ±1) value when the points cluster fairly around a
straight line (Figure 1 and Figure 2).
1
Figure 3. No linear Relationship Figure 4. No Linear Relationship
Note:
In Figure 3, the coefficient r becomes smaller as the distribution of points cluster less closely around
the line, and it becomes virtually zero when the distribution shows randomness.
Figure 4 shows a neat curvilinear relationship between the variables and it can be verified that its
linear correlation coefficient will be low or near 0.
The Sample Coefficient of Determination, r 2 , is a number that determine the total variation in the
values of variable Y that can be accounted for or explained by the linear relationship with the values of the
variable X . It is usually expressed as a percentage.
For example, if the correlation coefficient, r, is 0.60, then 𝑟 2 = (0.60)2 = 0.36 = 36%. This
means that 36% of the total variation of Y can be explained by its linear relationship X.
Example. A person’s muscle mass is expected to decrease with age. To explore this relationship, a
researcher randomly selected 10 persons from ages 40 to 79 years old and measured their muscle mass(unit).
The result is as follows:
X (age) 71 64 43 67 56 73 68 56 76 65
Y (muscle mass) 82 91 100 68 87 73 78 80 65 84
Based on the given data, do the following:
a. Plot the scatter diagram of the given data.
b. Find the sample coefficient of determination, 𝑟 2 and interpret the result.
c. Test 𝑯𝟎 : 𝝆 = 𝟎 at the 0.01 level of significance.
2
Solution:
a. The scatter diagram of the given data.
110
100
Muscle Mass
90
80
70
60
40 50 60 70 80
Age of a Person
10
∑ 𝑥𝑖 = 𝑥1 + 𝑥2 + ⋯ + 𝑥10 = 71 + 64 + ⋯ + 65 = 639;
𝑖=1
10
∑ 𝑦𝑖 = 𝑦1 + 𝑦2 + ⋯ + 𝑦10 = 82 + 91 + ⋯ + 84 = 808;
𝑖=1
10
10(50887) − (639)(808)
𝑟= = −0.796,
√[10(41701) − 6392 ][10(66292) − 8082 ]
indicating a negative linear relationship between X (age of the person) and Y (muscle mass).
which means that 63% of the total variation of the muscle mass is explained or accounted for
by the age of the person.
3
c. Following the steps in hypothesis testing, we have:
Steps:
1. Hypotheses:
𝐻0 : There is no linear relationship between the variables age and muscle mass, that is, 𝜌 = 0.
𝐻𝑎 : There is a linear relationship between the variables age and muscle mass, that is, 𝜌 ≠ 0.
2. Significance Level: 𝛼 = 0.01
3. Test Statistic: The appropriate test statistic is
𝒓 − 𝝆𝟎
𝒕𝒄 =
𝟐
√𝟏 − 𝒓
𝒏−𝟐
4. Critical Regions: Reject 𝐻0 if 𝑡𝑐 < −𝑡(𝛼,𝑛−2) or 𝑡𝑐 > 𝑡(𝛼,𝑛−2) ,
2 2
5. Computation: Using the formula, the actual value of the test statistic is:
6. Statistical Decision: Since 𝒕𝒄 = −𝟑. 𝟕𝟐 is less than −𝟑. 𝟑𝟓𝟓 (meaning, it is in the critical
region), the null hypothesis 𝑯𝟎 is rejected.
7. Conclusion: The data provides evidence that there is a linear relationship between age and
muscle mass at 𝜶 = 𝟎. 𝟎𝟏.
4
Lesson 2. SIMPLE LINEAR REGRESSION
Regression analysis is a statistical method which makes use of the relationship between two or
more quantitative variables so that one variable, called the dependent variable or response variable, can
be predicted with the knowledge of the values of the other variable, called the independent variable or
explanatory variable.
A mathematical equation that allows us to predict values of one dependent variable from known
values of one or more independent variable is called a regression equation.
𝑌 = 𝑎 + 𝑏𝑋
Regression analysis deals with finding estimates of the constants a and b so that once an
estimate of the constants is found, a value 𝑌̂ can be predicted from known value of X through the
regression equation
̂=𝒂
𝒀 ̂𝑿
̂+𝒃
where 𝑌̂ – is the predicted dependent variable;
𝑋 – is the independent variable;
𝑎̂ – is the least squares estimates of the parameter 𝑎; and
𝑏̂ – is the least squares estimates of the parameter 𝑏.
Estimation of Parameters
Given the sample {(𝑥𝑖 , 𝑦𝑖 ), 𝑖 = 1,2,3, … , 𝑛} the least squares estimate of the parameters in the
regression line are:
𝑏̂ =
where 𝑏 is the regression coefficient or the slope of the regression line and 𝑎 is the constant of regression
or the y-intercept of the regression line.
Moreover,
5
𝑛 𝑛
1 1
𝑦̅ = ∑ 𝑦𝑖 𝑎𝑛𝑑 𝑥̅ = ∑ 𝑥𝑖
𝑛 𝑛
𝑖=1 𝑖=1
are the means of the sample values of 𝑋 and 𝑌, respectively.
Table below provides the formula needed to perform the test for regression coefficient.
where 𝑏 = 𝑏0 𝑏 ≠ 𝑏0
𝑡𝑐 < −𝑡(𝛼,,𝑛−2)
2
𝑛 ∑𝑛𝑖=1 𝑥𝑖 2 − (∑𝑛𝑖=1 𝑥𝑖 )2 2
𝑠𝑥 =
𝑛(𝑛 − 1) (The (The
variables X variables X or
𝑛 ∑𝑛𝑖=1 𝑦𝑖 2
− (∑𝑛𝑖=1 𝑦𝑖 )2 and Y are not and Y are
𝑠𝑦 2 = linearly linearly 𝑡𝑐 > 𝑡(𝛼,𝑛−2)
𝑛(𝑛 − 1) 2
related.) related.)
𝑛−1
𝑠𝑒 2 = (𝑠 2 − 𝑏2 𝑠𝑥 2 )
𝑛−2 𝑦
Example. A person’s muscle mass is expected to decrease with age. To explore this relationship, a
researcher randomly selected 10 persons from ages 40 to 79 years old and measured their muscle mass(unit).
The result is as follows:
X (age) 71 64 43 67 56 73 68 56 76 65
Y (muscle mass) 82 91 100 68 87 73 78 80 65 84
Solution:
6
a. To solve for the estimates b and a, we have the following given and computations:
𝑛 = 10;
𝑥1 = 71, 𝑥2 = 64, 𝑥3 = 43, 𝑥4 = 67, 𝑥5 = 56, 𝑥6 = 73, 𝑥7 = 68, 𝑥8 = 56, 𝑥9 = 76, 𝑥10 = 65;
𝑦1 = 82, 𝑦2 = 91, 𝑦3 = 100, 𝑦4 = 68, 𝑦5 = 87, 𝑦6 = 73, 𝑦7 = 78, 𝑦8 = 80, 𝑦9 = 65, 𝑦10 = 84;
10
∑ 𝑥𝑖 = 𝑥1 + 𝑥2 + ⋯ + 𝑥10 = 71 + 64 + ⋯ + 65 = 639;
𝑖=1
10
∑ 𝑦𝑖 = 𝑦1 + 𝑦2 + ⋯ + 𝑦10 = 82 + 91 + ⋯ + 84 = 808;
𝑖=1
10
= −0.8565
Therefore, the estimated regression line is ̂𝒀 = 𝟏𝟑𝟓. 𝟓𝟑𝟎𝟒 − 𝟎. 𝟖𝟓𝟔𝟓𝑿, that is,
𝑌̂ = 𝑎̂ + 𝑏̂𝑋
̂𝒀 = 135.5304 + (−0.8565)𝑋
= 𝟏𝟑𝟓. 𝟓𝟑𝟎𝟒 − 𝟎. 𝟖𝟓𝟔𝟓𝑿.
The negative slope indicates that as the person gets older, the muscle mass decreases.
𝑛−1 10 − 1
𝑠𝑒 2 = (𝑠𝑦 2 − 𝑏̂ 2 𝑠𝑥 2 ) = [111.73 − (−0.8565)2 (96.54)] = 46.02
𝑛−2 10 − 2
After solving for 𝑠𝑥 and 𝑠𝑒 , the actual value of the test statistic is:
6. Statistical Decision: Since 𝒕𝒄 = −𝟑. 𝟕𝟐𝟓 is less than −𝟐. 𝟑𝟎𝟔 (meaning, it is in the critical
region), the null hypothesis 𝑯𝟎 is rejected.
7. Conclusion: The data provides evidence that there is a linear relationship between age and
muscle mass at 𝜶 = 𝟎. 𝟎𝟓.
References: Supe, A., et. al., (2013). Elementary Statistics. Central Book Supply Inc
Triola, M.F., (2010). Elementary Statistics 11 th Edition. Technology Update.
Prepared by:
JOBELLE S. SIMBLANTE