Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
116 views

Lesson 7 - Linear Correlation and Simple Linear Regression

Uploaded by

chloe frost
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
116 views

Lesson 7 - Linear Correlation and Simple Linear Regression

Uploaded by

chloe frost
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Lesson 7

LINEAR CORRELATION AND SIMPLE LINEAR REGRESSION

Lesson 1: LINEAR CORRELATION


Correlation analysis attempts to measure the strength of the relationship between two random
variables by means of a single number called correlation coefficient. This concerned only with the strength
of the relationship and no causal effect is implied. One of the more commonly used measures of the linear
relationship between two variables is the Pearson Product Moment Correlation Coefficient (𝜌). The
estimated sample correlation coefficient between X and Y, denoted by (r ), is given by:
n n n
n xi yi   xi  yi
r i 1 i 1 i 1
where n is the sample size
 n 2  n    n 2  n 2 
2

n xi    xi   n yi    yi  
 i1  i 1    i1  i 1  

The Sample Pearson Correlation Coefficient can be interpreted in the following manner:
1. The value of r, ranges from -1 to +1. If r = +1 or r = -1, there is a perfect linear relationship and all
points lie in the straight line.
2. An r close to +1 indicates a high positive linear relationship between the two variables X and Y,
that is, if the value of X increases then the value of Y also increases.
3. An r close to -1 indicates a high negative linear relationship between the sample values, that is, the
value of X decreases as the value of Y increases.
4. An r near 0 means that there is a lack of linearity between the two variables, or there is no linear
relationship between them. This doesn’t mean they are not associated at all because the relationship
maybe nonlinear.

Scatter diagram is a graphical presentation of the independent variable (plotted on the horizontal
axis) and the dependent variable (plotted on the vertical axis). Through this graph or diagram is the easiest
way to determine if a relationship exists between the two variables.
The figures below are the scatter diagrams showing the different types of linear relationships.

Figure 1. Direct Linear Relationship Figure 2. Inverse Linear Relationship

Note: The correlation coefficient remains high (𝑟 ≈ ±1) value when the points cluster fairly around a
straight line (Figure 1 and Figure 2).

1
Figure 3. No linear Relationship Figure 4. No Linear Relationship
Note:
 In Figure 3, the coefficient r becomes smaller as the distribution of points cluster less closely around
the line, and it becomes virtually zero when the distribution shows randomness.
 Figure 4 shows a neat curvilinear relationship between the variables and it can be verified that its
linear correlation coefficient will be low or near 0.

The Sample Coefficient of Determination, r 2 , is a number that determine the total variation in the
values of variable Y that can be accounted for or explained by the linear relationship with the values of the
variable X . It is usually expressed as a percentage.
For example, if the correlation coefficient, r, is 0.60, then 𝑟 2 = (0.60)2 = 0.36 = 36%. This
means that 36% of the total variation of Y can be explained by its linear relationship X.

 Test Concerning the Correlation Coefficient, 𝝆


The sample correlation coefficient, 𝒓, is a value computed from the sample and is used to
estimate the population correlation coefficient 𝝆. Thus, the value 𝒓 is used when testing the null
hypothesis that 𝝆 is equal to some value 𝝆𝟎 , that is, 𝑯𝟎 : 𝝆 = 𝝆𝟎 .
Table below provides the formula needed to perform the test for correlation
coefficient.
Formula in Testing the Correlation Coefficient, 𝝆
Test Statistic 𝑯𝟎 𝑯𝒂 Rejection Region
𝜌 = 𝜌0 𝜌 ≠ 𝜌0
𝑡𝑐 < −𝑡(𝛼,,𝑛−2)
(There is no (There is a 2
𝑟 − 𝜌0 linear linear
𝑡𝑐 =
2
√1 − 𝑟
relationship relationship or
𝑛−2 between X between X
and Y.) and Y.) 𝑡𝑐 > 𝑡(𝛼,𝑛−2)
2

Note: If the 𝝆𝟎 is not specified, use 𝝆𝟎 = 𝟎.

Example. A person’s muscle mass is expected to decrease with age. To explore this relationship, a
researcher randomly selected 10 persons from ages 40 to 79 years old and measured their muscle mass(unit).
The result is as follows:
X (age) 71 64 43 67 56 73 68 56 76 65
Y (muscle mass) 82 91 100 68 87 73 78 80 65 84
Based on the given data, do the following:
a. Plot the scatter diagram of the given data.
b. Find the sample coefficient of determination, 𝑟 2 and interpret the result.
c. Test 𝑯𝟎 : 𝝆 = 𝟎 at the 0.01 level of significance.
2
Solution:
a. The scatter diagram of the given data.
110
100

Muscle Mass
90
80
70
60
40 50 60 70 80
Age of a Person

A decreasing slope is observed indicating a negative relationship between X and Y.

b. To solve for 𝑟 2 , we have the following given and computations:


𝑛 = 10;
𝑥1 = 71, 𝑥2 = 64, 𝑥3 = 43, 𝑥4 = 67, 𝑥5 = 56, 𝑥6 = 73, 𝑥7 = 68, 𝑥8 = 56, 𝑥9 = 76, 𝑥10 = 65;
𝑦1 = 82, 𝑦2 = 91, 𝑦3 = 100, 𝑦4 = 68, 𝑦5 = 87, 𝑦6 = 73, 𝑦7 = 78, 𝑦8 = 80, 𝑦9 = 65, 𝑦10 = 84;

10

∑ 𝑥𝑖 = 𝑥1 + 𝑥2 + ⋯ + 𝑥10 = 71 + 64 + ⋯ + 65 = 639;
𝑖=1
10

∑ 𝑦𝑖 = 𝑦1 + 𝑦2 + ⋯ + 𝑦10 = 82 + 91 + ⋯ + 84 = 808;
𝑖=1
10

∑ 𝑥𝑖 𝑦𝑖 = 𝑥1 𝑦1 + 𝑥2 𝑦2 + ⋯ + 𝑥10 𝑦10 = 71(82) + 64(91) + ⋯ + 65(84) = 50887;


𝑖=1
10

∑ 𝑥𝑖 2 = 𝑥1 2 + 𝑥2 2 + ⋯ + 𝑥10 2 = 712 + 642 + ⋯ + 652 = 41701 ;


𝑖=1
10

∑ 𝑦𝑖 2 = 𝑦1 2 + 𝑦2 2 + ⋯ + 𝑦10 2 = 822 + 912 + ⋯ + 842 = 66292.


𝑖=1

10(50887) − (639)(808)
𝑟= = −0.796,
√[10(41701) − 6392 ][10(66292) − 8082 ]

indicating a negative linear relationship between X (age of the person) and Y (muscle mass).

The sample coefficient of determination 𝑟 2 is computed as

𝒓𝟐 = (−𝟎. 𝟕𝟗𝟔)𝟐 × 𝟏𝟎𝟎% = 𝟔𝟑. 𝟑𝟔%

which means that 63% of the total variation of the muscle mass is explained or accounted for
by the age of the person.

3
c. Following the steps in hypothesis testing, we have:
Steps:
1. Hypotheses:
𝐻0 : There is no linear relationship between the variables age and muscle mass, that is, 𝜌 = 0.
𝐻𝑎 : There is a linear relationship between the variables age and muscle mass, that is, 𝜌 ≠ 0.
2. Significance Level: 𝛼 = 0.01
3. Test Statistic: The appropriate test statistic is
𝒓 − 𝝆𝟎
𝒕𝒄 =
𝟐
√𝟏 − 𝒓
𝒏−𝟐
4. Critical Regions: Reject 𝐻0 if 𝑡𝑐 < −𝑡(𝛼,𝑛−2) or 𝑡𝑐 > 𝑡(𝛼,𝑛−2) ,
2 2

where 𝑡(0.01,10−2) = 𝑡(0.005,8) = 3.355. refer to t-table


2

Thus, we reject 𝑯𝟎 if 𝒕𝒄 < −𝟑. 𝟑𝟓𝟓 or 𝒕𝒄 > 𝟑. 𝟑𝟓𝟓.

5. Computation: Using the formula, the actual value of the test statistic is:

−𝟎. 𝟕𝟗𝟔 − 𝟎 −𝟎. 𝟕𝟗𝟔


𝒕𝒄 = = = −𝟑. 𝟕𝟐
√𝟏 − 𝟎. 𝟔𝟑𝟑𝟔𝟏𝟔
𝟐
√𝟏 − (−𝟎. 𝟕𝟗𝟔)
𝟏𝟎 − 𝟐 𝟖

6. Statistical Decision: Since 𝒕𝒄 = −𝟑. 𝟕𝟐 is less than −𝟑. 𝟑𝟓𝟓 (meaning, it is in the critical
region), the null hypothesis 𝑯𝟎 is rejected.

7. Conclusion: The data provides evidence that there is a linear relationship between age and
muscle mass at 𝜶 = 𝟎. 𝟎𝟏.

4
Lesson 2. SIMPLE LINEAR REGRESSION
Regression analysis is a statistical method which makes use of the relationship between two or
more quantitative variables so that one variable, called the dependent variable or response variable, can
be predicted with the knowledge of the values of the other variable, called the independent variable or
explanatory variable.

Purposes of Regression Analysis


i. To measure relationship between two or more variables; and
ii. To predict or estimate values of a dependent variable from known values of independent
variables.

A mathematical equation that allows us to predict values of one dependent variable from known
values of one or more independent variable is called a regression equation.
𝑌 = 𝑎 + 𝑏𝑋
Regression analysis deals with finding estimates of the constants a and b so that once an
estimate of the constants is found, a value 𝑌̂ can be predicted from known value of X through the
regression equation
̂=𝒂
𝒀 ̂𝑿
̂+𝒃
where 𝑌̂ – is the predicted dependent variable;
𝑋 – is the independent variable;
𝑎̂ – is the least squares estimates of the parameter 𝑎; and
𝑏̂ – is the least squares estimates of the parameter 𝑏.

Assumptions on Regression Analysis


i. The values of the independent variable X may be “fixed”, that is, X values may be selected in
advance by the researcher, or they may be obtained without the imposition of any restriction, in
which case, X is not a random variable.
ii. The values of X are measured without error.
iii. The dependent variable Y given different values of the independent variable X is normally
distributed.
iv. The variances of the dependent variable Y, given different values of the independent variable X are
equal.
Note: For iii and iv, this is a condition known as homoscedasticity.

Estimation of Parameters
Given the sample {(𝑥𝑖 , 𝑦𝑖 ), 𝑖 = 1,2,3, … , 𝑛} the least squares estimate of the parameters in the
regression line are:

𝑏̂ =

where 𝑏 is the regression coefficient or the slope of the regression line and 𝑎 is the constant of regression
or the y-intercept of the regression line.
Moreover,
5
𝑛 𝑛
1 1
𝑦̅ = ∑ 𝑦𝑖 𝑎𝑛𝑑 𝑥̅ = ∑ 𝑥𝑖
𝑛 𝑛
𝑖=1 𝑖=1
are the means of the sample values of 𝑋 and 𝑌, respectively.

 Test Concerning the Regression Coefficient, 𝒃.


The relationship between the two variables X and Y is expressed through the linear
equation, 𝑌 = 𝑎 + 𝑏𝑋. The value of 𝒃 plays an important role in the relationship and can be
summarized as follows:
1. When 𝒃 > 𝟎, X and Y are directly related. As X increases, Y also increase. As X decreases, Y also
decreases.
2. When 𝒃 < 𝟎, X and Y are inversely linearly related. As X increases, Y decreases. As X decreases,
Y increases.
3. When 𝒃 = 𝟎, X and Y are not linearly related.
4. When 𝒃 ≠ 𝟎, X and Y are linearly related.

Table below provides the formula needed to perform the test for regression coefficient.

Formula in Testing the Regression Coefficient, 𝒃


Test Statistic 𝑯𝟎 𝑯𝒂 Rejection Region
𝑠𝑥 (√𝑛 − 1)(𝑏̂ − 𝑏0 )
𝑡𝑐 =
𝑠𝑒

where 𝑏 = 𝑏0 𝑏 ≠ 𝑏0
𝑡𝑐 < −𝑡(𝛼,,𝑛−2)
2
𝑛 ∑𝑛𝑖=1 𝑥𝑖 2 − (∑𝑛𝑖=1 𝑥𝑖 )2 2
𝑠𝑥 =
𝑛(𝑛 − 1) (The (The
variables X variables X or
𝑛 ∑𝑛𝑖=1 𝑦𝑖 2
− (∑𝑛𝑖=1 𝑦𝑖 )2 and Y are not and Y are
𝑠𝑦 2 = linearly linearly 𝑡𝑐 > 𝑡(𝛼,𝑛−2)
𝑛(𝑛 − 1) 2
related.) related.)
𝑛−1
𝑠𝑒 2 = (𝑠 2 − 𝑏2 𝑠𝑥 2 )
𝑛−2 𝑦

Note: If the 𝒃𝟎 is not specified, use 𝒃𝟎 = 𝟎.

Example. A person’s muscle mass is expected to decrease with age. To explore this relationship, a
researcher randomly selected 10 persons from ages 40 to 79 years old and measured their muscle mass(unit).
The result is as follows:

X (age) 71 64 43 67 56 73 68 56 76 65
Y (muscle mass) 82 91 100 68 87 73 78 80 65 84

Based on the given data, do the following:


a. Obtain the regression line equation.
b. Estimate the muscle mass when age of the person is 60 years old.
c. Test 𝑯𝟎 : 𝒃 = 𝟎 at the 0.05 level of significance.

Solution:
6
a. To solve for the estimates b and a, we have the following given and computations:
𝑛 = 10;
𝑥1 = 71, 𝑥2 = 64, 𝑥3 = 43, 𝑥4 = 67, 𝑥5 = 56, 𝑥6 = 73, 𝑥7 = 68, 𝑥8 = 56, 𝑥9 = 76, 𝑥10 = 65;
𝑦1 = 82, 𝑦2 = 91, 𝑦3 = 100, 𝑦4 = 68, 𝑦5 = 87, 𝑦6 = 73, 𝑦7 = 78, 𝑦8 = 80, 𝑦9 = 65, 𝑦10 = 84;

10

∑ 𝑥𝑖 = 𝑥1 + 𝑥2 + ⋯ + 𝑥10 = 71 + 64 + ⋯ + 65 = 639;
𝑖=1
10

∑ 𝑦𝑖 = 𝑦1 + 𝑦2 + ⋯ + 𝑦10 = 82 + 91 + ⋯ + 84 = 808;
𝑖=1
10

∑ 𝑥𝑖 𝑦𝑖 = 𝑥1 𝑦1 + 𝑥2 𝑦2 + ⋯ + 𝑥10 𝑦10 = 71(82) + 64(91) + ⋯ + 65(84) = 50887;


𝑖=1
10

∑ 𝑥𝑖 2 = 𝑥1 2 + 𝑥2 2 + ⋯ + 𝑥10 2 = 712 + 642 + ⋯ + 652 = 41701 ;


𝑖=1
10

∑ 𝑦𝑖 2 = 𝑦1 2 + 𝑦2 2 + ⋯ + 𝑦10 2 = 822 + 912 + ⋯ + 842 = 66292.


𝑖=1
𝑛 𝑛
1 1 1 1
𝑦
̅ = ∑ 𝑦𝑖 = (808) = 80.8 ; 𝑎𝑛𝑑 𝑥
̅ = ∑ 𝑥𝑖 = (639) = 63.9.
𝑛 10 𝑛 10
𝑖=1 𝑖=1

10(50887) − (639)(808) 508870 − 516312 −7442


𝑏̂ = =
10(41701) − 6392
=
417010 − 408321
=
8689

= −0.8565

𝑎̂ = 𝑦̅ − 𝑏̂𝑥̅ = 80.8 − (−0.8565)(63.9) = 135.5304

Therefore, the estimated regression line is ̂𝒀 = 𝟏𝟑𝟓. 𝟓𝟑𝟎𝟒 − 𝟎. 𝟖𝟓𝟔𝟓𝑿, that is,
𝑌̂ = 𝑎̂ + 𝑏̂𝑋
̂𝒀 = 135.5304 + (−0.8565)𝑋
= 𝟏𝟑𝟓. 𝟓𝟑𝟎𝟒 − 𝟎. 𝟖𝟓𝟔𝟓𝑿.
The negative slope indicates that as the person gets older, the muscle mass decreases.

b. The predicted muscle mass of a person who is 60 years old is


̂𝒀 = 𝟏𝟑𝟓. 𝟓𝟑𝟎𝟒 − 𝟎. 𝟖𝟓𝟔𝟓(𝟔𝟎) = 𝟖𝟒. 𝟏𝟒𝟎𝟒 ≈ 𝟖𝟒 𝒖𝒏𝒊𝒕𝒔

c. Following the steps in hypothesis testing, we have:


Steps:
1. Hypotheses:
𝐻0 : There is no linear relationship between the variables age and muscle mass, that is, 𝑏 = 0.
𝐻𝑎 : There is a linear relationship between the variables age and muscle mass, that is, 𝑏 ≠ 0.
2. Significance Level: 𝛼 = 0.05
3. Test Statistic: The appropriate test statistic is
7
̂ − 𝑏0 )
𝑠𝑥 (√𝑛 − 1)( 𝑏
𝒕𝒄 =
𝑠𝑒
where
2 2
2 𝑛 ∑𝑛 2 𝑛
𝑖=1 𝑥𝑖 −(∑𝑖=1 𝑥𝑖 ) 2 𝑛 ∑𝑛 2 𝑛
𝑖=1 𝑦𝑖 −(∑𝑖=1 𝑦𝑖 ) 𝑛−1
𝑠𝑥 = ; 𝑠𝑦 = ; 𝑠𝑒 2 = 𝑛−2 (𝑠𝑦 2 − 𝑏2 𝑠𝑥 2 )
𝑛(𝑛−1) 𝑛(𝑛−1)

4. Critical Regions: Reject 𝐻0 if 𝑡𝑐 < −𝑡(𝛼,𝑛−2) or 𝑡𝑐 > 𝑡(𝛼,𝑛−2) ,


2 2

where 𝑡(0.05,10−2) = 𝑡(0.025,8) = 2.306. refer to t-table


2

Thus, we reject 𝑯𝟎 if 𝒕𝒄 < −𝟐. 𝟑𝟎𝟔 or 𝒕𝒄 > 𝟐. 𝟑𝟎𝟔.

5. Computation: First we to Find 𝑠𝑥 and 𝑠𝑒 by computing the following:


2
𝑛 ∑𝑛 2 𝑛
𝑖=1 𝑥𝑖 −(∑𝑖=1 𝑥𝑖 ) 10(41701)−(639)2
𝑠𝑥 2 = 𝑛(𝑛−1)
= 10(10−1)
= 96.54

𝑠𝑥 = √𝑠𝑥 2 = √96.54 = 9.83

𝑛 ∑𝑛𝑖=1 𝑦𝑖 2 − (∑𝑛𝑖=1 𝑦𝑖 )2 10(66292) − (808)2


𝑠𝑦 2 = = = 111.73
𝑛(𝑛 − 1) 10(10 − 1)

𝑛−1 10 − 1
𝑠𝑒 2 = (𝑠𝑦 2 − 𝑏̂ 2 𝑠𝑥 2 ) = [111.73 − (−0.8565)2 (96.54)] = 46.02
𝑛−2 10 − 2

𝑠𝑒 = √𝑠𝑒 2 = √46.02 = 6.78

After solving for 𝑠𝑥 and 𝑠𝑒 , the actual value of the test statistic is:

̂ − 𝒃𝟎 ) 𝟗. 𝟖𝟑(√𝟏𝟎 − 𝟏)(−𝟎. 𝟖𝟓𝟔𝟓 − 𝟎)


𝒔𝒙 (√𝒏 − 𝟏)( 𝒃
𝒕𝒄 = = = −𝟑. 𝟕𝟐𝟓
𝒔𝒆 𝟔. 𝟕𝟖

6. Statistical Decision: Since 𝒕𝒄 = −𝟑. 𝟕𝟐𝟓 is less than −𝟐. 𝟑𝟎𝟔 (meaning, it is in the critical
region), the null hypothesis 𝑯𝟎 is rejected.

7. Conclusion: The data provides evidence that there is a linear relationship between age and
muscle mass at 𝜶 = 𝟎. 𝟎𝟓.

References: Supe, A., et. al., (2013). Elementary Statistics. Central Book Supply Inc
Triola, M.F., (2010). Elementary Statistics 11 th Edition. Technology Update.

Prepared by:
JOBELLE S. SIMBLANTE

You might also like