Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
2 views

Topic_13_Correlation_and_Simple_Linear_Regression

The document discusses correlation and simple linear regression, focusing on how to measure the strength of association between two variables using correlation analysis and scatter plots. It explains the concept of a trend line, the calculation of the correlation coefficient, and the least squares method for fitting a linear regression model. Examples illustrate the application of these concepts in predicting outcomes based on historical data.

Uploaded by

Bageya Alexis
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Topic_13_Correlation_and_Simple_Linear_Regression

The document discusses correlation and simple linear regression, focusing on how to measure the strength of association between two variables using correlation analysis and scatter plots. It explains the concept of a trend line, the calculation of the correlation coefficient, and the least squares method for fitting a linear regression model. Examples illustrate the application of these concepts in predicting outcomes based on historical data.

Uploaded by

Bageya Alexis
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

H.W.

Kayondo c 1

Correlation and Simple Linear Regression

1 Correlation
Correlation analysis is used to measure strength of association (linear rela-
tionship) between two variables. It is only concerned with strength of the
relationship, and no causal effect is implied.
The relationship can be visualized using scatter plots and by computing correla-
tion coefficient.

1.1 Scatter plots


A scatter plot (or scatter diagram) is used to show a relationship between two variables.
A scatter plot is an effective way to show how two variables relate to each other
by showing how closely the data points fit to a straight line.

A scatter plot is a graph that relates two groups of data. These two groups of
data are plotted as ordered pairs to create the graph.

1.1.1 Line of best fit/the trend line


Sometimes we can fit a trend line or a line of best fit to the points plotted on a
scatter diagram. A trend line shows a correlation of the points on the scatter
plot. You can study the line to see how the data behaves and may have a basis
to predict what the data might be for values not given.

Example 1 This is an example for length and wingspan of hawks.

Length (in) 21 21 18 24 16 19 17 19
Wingspan (in) 36 41 38 46 31 39 35 46

Draw a scatter plot and fit a trend line to the data points.

Solution

The scatter plot with a trend line is shown in Figure 1. We shall see in section
2.1 how to fit such a line using least squares method. The equation of the trend
line is y = 11.1253 + 1.4387x, where x and y are length and wingspan of hawks in
inches, respectively. The slope and intercept for the trend line are 1.4387 and 11.1253,
respectively.

Example 2 The following table presents information on tornado occurrences in United


States of America (USA).
2 Engineering Mathematics IV– EMT 2201

Figure 1: Scatter plot for Hawks.

Year 1950 1955 1960 1965 19705 1975 1980 1985 1990 1995
No. of
Tornadoes 201 593 616 897 654 919 866 684 1133 1234

(a) Draw a scatter plot for the data given and fit the best trend line to the points.
(b) Use the line of best fit to predict how many tornadoes may be reported in the
United States in 2015 if the trend continues.

Solution

(a) A scatter plot with a trend line is shown in Figure 2.


(b) The equation of the trend line is y = −31710.364 + 16.472x, where x and
y are years and number of tornadoes, respectively. When x = 2015, y is
approximately 1481. Therefore, using the fitted trend line, we predict 1481
tornadoes in the year 2015.

Remark: Scatter plots provide a convenient way to determine whether a


correlation exists between two variables. A positive correlation occurs when both
variables increase. A negative correlation occurs when one variable increases
and the other variable decreases. If the data points are randomly scattered, there
H.W. Kayondo c 3

Figure 2: Scatter plot for Tornadoes.

is little or no correlation. If the data points are close to the line of best fit, it is
said to have a strong correlation. Examples of scatter plots are shown in Figures
3, 4 and 5.
4 Engineering Mathematics IV– EMT 2201

Figure 3: The first examples of Scatter plots.

Figure 4: The second examples of Scatter plots.


H.W. Kayondo c 5

Figure 5: The third examples of Scatter plots.

1.2 Correlation coefficient, r


The population correlation coefficient ρ measures the strength of the association
between the variables, while the sample correlation coefficient r is an estimate
of ρ and is used to measure the strength of the linear relationship in the sample
observations.
For two quantitative variables X and Y, for which n pairs of measurements
(xi , yi ) are available, Pearson’s correlation coefficient (r) gives a measure of the
linear association between X and Y.

1.2.1 Features of ρ and r


• Unit free.
• Range between -1 and 1.
• The closer to -1, the stronger the negative linear relationship.
• The closer to 1, the stronger the positive linear relationship.
• The closer to 0, the weaker the linear relationship.

1.2.2 Calculating the correlation coefficient


Sample correlation coefficient is given by:
P
(x − x̄)(y − ȳ)
r= pP ,
[ (x − x̄)2 ] (y − ȳ)2
P
6 Engineering Mathematics IV– EMT 2201

or the algebraic equivalent:


P P P
nxy − x y
r= p P P  P P ,
[n( x2 ) − ( x)2 ] n( y2 ) − ( y)2

where
• r = Sample correlation coefficient,
• n = Sample size,
• x = Value of the independent variable,
• y = Value of the dependent variable.
Figure 6 illustrates examples of various r values with the fitted straight lines.

Figure 6: Illustrations for various r values .

Example 3 You are developing a new analytical method for the determination of blood
urea nitrogen (BUN). You want to determine whether your method differs significantly
from a standard one for analyzing a range of sample concentrations expected to be found
in the routine laboratory. It has been ascertained that the two methods have comparable
precisions. The data shown in table below is for two sets of the results for a number of
individual samples.

Calculate the correlation coefficient for the data taking your method as x and
the standard method as y.
H.W. Kayondo c 7

Sample Your method (mg/dL), x Standard method (mg/dL), y


A 10.2 10.5
B 12.7 11.9
C 8.6 8.7
D 7.5 16.9
E 11.2 10.9
F 11.5 11.1

Solution

n = 6, x2i = 653.23, y2i = 855.18, x̄ = 10.2833, ȳ =


P P
Start by calculating
11.6667 and xi yi = 709.53. After substituting in the formula, one gets correla-
P
tion coefficient as −0.3834. Comment on the correlation coefficient.

Exercise

Calculate and comment on the computed correlation coefficients for example


problems 1 and 2.

Remark
When the term correlation coefficient is used with out further qualification,
it usually refers to the Pearson product-moment correlation coefficient. It
should be noted that other correlation coefficients exist, for example Spearman’s
rank correlation coefficient, Kendall rank correlation coefficient, Goodman and
Kruskal’s gamma coefficient, among others.

2 Simple linear regression analysis


A simple linear regression model is one which:

• has only one independent variable (x) and one dependent variable (y).

• describes relationship between x and y by a linear function.

• Changes in y are assumed to be caused by changes in x.

Regression analysis is used to:

• Predict the value of a dependent variable based on the value of at least


one independent variable,

• Explain the impact of changes in an independent variable on the dependent


variable.

Remark
A dependent variable is the one we wish to explain and an independent variable
8 Engineering Mathematics IV– EMT 2201

Figure 7: Types of possible regression with one independent variable.

explains the dependent variable.

Figure 7 shows different types of regression with one independent variable.


One of the most popular methods which is used to fit a linear regression
model is called least squares method.

2.1 The least squares regression model


The basic idea of the method of least squares is easy to understand. It may
seem unusual that when several people measure the same quantity, they usually
do not obtain the same results. In fact, if the same person measures the same
quantity several times, the results will vary. What then is the best estimate for
the true measurement? The method of least squares gives a way to find the best
estimate, assuming that the errors (i.e. the differences from the true value) are
random and unbiased. Let us consider a simple example.

Example 4 Suppose we measure a distance four times, and obtain the following results:
72, 69, 70 and 73 units What is the best estimate of the correct measurement?

Let us denote the estimate of the true measurement by x, and form the deviations
(errors) from x, namely: x − 72, x − 69, x − 70, and x − 73.
Let S be the sum of the squares of these errors, i.e. S = (x − 72)2 + (x − 69)2 + (x −
70)2 + (x − 73)2 .
We seek the value of x that minimises the value of S. We can simplify S through
H.W. Kayondo c 9

the following steps:


S = x2 − 144 + 5184 + x2 − 138 + 4761 + x2 − 140 + 4900 + x2 − 146 + 5329.
S = 4x2 − 568 + 20174.
S = 4(x2 − 142) + 20174.
S = 4(x − 71)2 + 20174 − 4(71)2 .
S = 4(x − 71)2 + 10.
We can see from this form (or we can use calculus) that the minimum value of
S is 10, when x = 71. So the best estimate of the true measurement is 71 units!
Note that 71 units is the mean or average of the original four measurements. It
is always true that for n measurements the minimum value of S occurs when x
equals the mean of the n measurements.

2.1.1 The line of best fit


Suppose we want to estimate the line of best fit for a set of ordered pairs. The
method of least squares calculates the line of best fit by minimising the sum of the
squares of the vertical distances of the points to the line. Figure 8 demonstrates
the Least Squares method.

Figure 8: Least squares method.

Let’s illustrate the least squares method with a simple example below.

Example 5 Problem: Given these measurements of the two quantities x and y, find
y7 : Due to random errors in the measurements the ordered pairs (xi , yi ) do not lie on a
straight line. Assume the values can be approximated by the linear function ŷ = ax + b.
Let us call the deviations (errors) di = ŷi − yi = axi + b − yi for i = 1, 2, . . . , 6.
10 Engineering Mathematics IV– EMT 2201

x1 = 2 x2 = 4 x3 = 6 x4 = 8 x5 = 10 x6 = 12 x7 = 14
y1 = 2 y2 = 4 y3 = 4 y4 = 5 y5 = 5 y6 = 7 y7 =?

Let’s solve the problem algebraically by finding the sum of the squares of the errors and
minimising it:

S = d21 + d22 + d23 + d24 + d25 + d26


= (2a + b − 2)2 + (4a + b − 4)2 + (6a + b − 4)2 + (8a + b − 5)2
+(10a + b − 5)2 + (12a + b − 7)2
= 364a2 + 84ab + 6b2 − 436a − 54b + 135.

We now find the partial derivative of S with respect to a. This means that we differentiate
S with respect to a, and treat b as if it was a constant. As with one variable, we set the
derivative equal to zero. This gives

182a + 21b = 109. (1)

We also find the partial derivative of S with respect to b. This means that we differentiate
S with respect to b, and treat a as if it was a constant and set the derivative equal to
zero. This gives

14a + 2b = 9. (2)

Solving equations (1) and (2) we have


29 56
a= and .
70 35
So the line of best fit is
29 56
y=
x+ .
70 35
We can now use the model to find the unknown information; for example,
29 56
y7 = y(14) = × 14 + = 7.4.
70 35
We can solve the least errors problem numerically
P through trial-and improvement
by systematically varying a and b until S = (y − ŷ)2 is a minimum.

We can find the line of best fit “graphically” by using a technology curve-fitting
program, e.g. by using Excel’s trendline. Excel in essence calculates a and b using
these formulae: P
(x − x̄)(y − ȳ)
b= P ,
(x − x̄)2
and
a = ȳ − bx̄.
H.W. Kayondo c 11

Model
Data
x y y = ax + b
2 2 2.42857
4 4 3.25714
6 4 4.08571
8 5 4.91429
10 5 5.74286
12 7 6.57143
14 ?? 7.4

Example 6 Consider the following table:


Using trendline produced in Excel:
Note that the point (x̄, ȳ) = (7, 4.5), i.e. the ordered pair formed by the mean
of the x values and the mean of the y values lies on the line of best fit. This is
always the case.

Theorem: The Least Squares Model for a set of data (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )
passes through the point (xa , ya ) where xa is the average of the x0i s and ya is the
average of the y0i s.
Using this theorem enables us to simplify our method of calculating the param-
eters a and b of our line of best fit with formula ŷ = ax + b.

Calculating the averages of the x values and of the y-values gives the point
(7,4.5). We know that this point lies on the line of best fit and therefore the
ordered pair satisfies the equation:

4.5 = 7a + b ⇒ b = 4.5 − 7a.

So we can now express the equation of the line of best fit only in terms of the
gradient a, as:

ŷ = ax + 4.5 − 7a. (3)

The following table shows data points (x & y), model in terms of a, errors
arising from the fitted model for each data point (y − ŷ) and the squared errors
((y − ŷ)2 )
From the Table, the sum of square errors (S) in terms of a can be simplified
as:
S = 70a2 − 58a + 13.5.
S = 70(a − 58a 70 ) + 13.5.
70(a − 70 ) + 13.5 − 70( 140
58 2 58 2
)
a = 140 = 70 .
58 29

In otherwords, for a = 29 70 the sum of the squared errors is a minimum, and


therefore the line with gradient a = 29 70 fits the data the best. After evaluating b
12 Engineering Mathematics IV– EMT 2201

Model
Data
x y ŷ = ax + b y − ŷ (y − ŷ)2
2 2 4.5 − 5a 5a − 2.5 25a − 25a + 6.25
2

4 4 4.5 − 3a 3a − 0.5 9a2 − 3a + 0.25


6 4 4.5-a a − 0.5 a2 − a + 0.25
8 5 4.5 + a 0.5 − a a2 − a0.25
10 5 4.5 + 3a 0.5 − 3a 9a − 3a + 0.25
2

12 7 4.5 + 5a 2.5-5a 25a2 − 25a + 6.25


14 ?? Sum: 70a2 − 58a + 13.5

and substituting into equation (3), we therefore obtain:

29 29
y0 = x + 1.6, and ⇒ y0 (14) = × 14 + 1.6 = 7.4.
70 70

2.1.2 Deriving regression coefficients using the method of least squares


Suppose that we have n pairs of observations (x1 , y1 ), (x2 , y2 ), ..., (xn , yn ).
Suppose also that we may express the observations as:
yi = βˆ0 + βˆ1 xi + i for i = 1, 2, ..., n and i represents vertical deviation from the
ith observation to the regression line. The sum of squares of the deviations of
n n
observations from the regression line is: S = 2i = (yi − βˆ0 − βˆ1 xi )2
P P
i=1 i=1
The least squares estimators of βˆ0 and βˆ1 must satisfy:
n
∂S X
= −2 (yi − βˆ0 − βˆ1 xi ) = 0 (4)
∂βˆ0 i=1

n
∂S X
= −2 (yi − βˆ0 − βˆ1 xi )xi = 0 (5)
∂βˆ1 i=1

Simplifying Equations (4) and (5) gives:


n
X n
X
nβˆ0 + βˆ1 xi = yi (6)
i=1 i=1

n
X n
X n
X
βˆ0 xi + βˆ1 x2i = xi yi (7)
i=1 i=1 i=1

Equations (6) and (7) are called least squares normal equations.
From equation (6), the estimate for βˆ0 is:

βˆ0 = ȳ − βˆ1 x̄ (8)


H.W. Kayondo c 13

n n
where ȳ = 1
yi and x̄ = 1
P P
n n xi
i=1 i=1
Substituting equation (8) into equation (7) and simplifying gives the estimate
for βˆ1 as:
n
P n
P
n
P xi yi
i=1 i=1
xi yi − n
i=1
βˆ1 =  Pn 2 (9)
n xi
x2i −
P i=1
n
i=1

The regression line can then be written as ŷ = βˆ0 + βˆ1 x.


Remark
The formula for evaluating β̂1 given in Equation (9) is equivalent to that used to
estimate b in the Excel’s trendline which we quoted previously. Try to expand
out the expression for b and verify for your self. Good luck!

Example 7 Using data given in example 1, find the least squares estimates for the
linear regression coefficients.

Solution

We computed the following quantities:


8 8 8
n = 8, x̄ = 19.375, ȳ = 39, xi = 155, yi = 312, xi yi = 6111 and
P P P
i=1 i=1 i=1
8
x2i = 3049.
P
i=1
We then substituted the computed quantities in equations (8) and (9) and we
obtained:
βˆ1 = 1.438692, βˆ0 = 11.12534 and the fitted linear regression equation is ŷ =
11.12534 + 1.438692x.

Example 8 Using data given in example 2, find the least squares estimates for the
linear regression coefficients.

Solution

We computed the following quantities:


10 10 10
n = 10, x̄ = 1972.5, ȳ = 779.7, xi = 19725, yi = 7797, xi yi = 15413555
P P P
i=1 i=1 i=1
10
x2i = 38909625.
P
and
i=1
We then substituted the computed quantities in equations (8) and (9) and we
obtained:
βˆ1 = 16.47152, βˆ0 = −31710.37 and the fitted linear regression equation is
ŷ = −31710.37 + 16.47152x.
14 Engineering Mathematics IV– EMT 2201

Exercises

1. Solve the two equations (6) and (7) simulateneous and obtain the expres-
sions for βˆ0 and βˆ1 .

2. The following is a summary of quantities for data on compressive strength


(x) and intrinsic permeability
P 2 (y) of various
P concrete mixes
P 2 and cures:
= = = = xi = 157.42 and
P
n 14, y i 572, yi
23530, x i 43,
xi yi = 1697.80
P

Assuming that the two variables are related according to the simple linear
regression model.

(a) Calculate the least squares estimates of the slope and intercept.
(b) Use the equation of the fitted line to predict what permeability would
be observed when the compressive strength is x = 4.3
(c) Give a point estimate of the permeability when compressive strength
is x = 3.7.
(d) Suppose that the observed value of permeability at x = 3.7 is y = 46.1.
Calculate the value of the corresponding residual.

3. A researcher wants to find out whether there is a relationship between


the heights of daughters and heights of their fathers’. He took a random
sample of 6 fathers and their 6 daughters. Their heights in inches are given
in the table below.

Father (x) 63 65 66 67 67 68
Daughter (y) 66 68 65 67 69 70

(a) Draw a scatter diagram for this data and comment on its trend.
(b) Fit the least squares regression line for the data.
(c) Predict the height of a daughter whose father’s height is 70 inches.

4. Regression methods were used to analyse the data from a study inves-
tigating the relationship between roadway surface temperature (x) and
pavementPdeflection (y).PSummary quantities were:P
= = 2
= = x2i = 143215.8 and
P
n 20, y i 12.75, yi
8.86, xi 1478,
xi yi = 1083.67.
P

(a) Calculate the least squares estimates of the slope and intercept.
(b) Use the equation of the fitted line to predict what pavement deflection
would be observed when the surface temperature is 85◦ F.
H.W. Kayondo c 15

x 4.3 4.5 5.9 5.6 6.1 5.2 3.8 2.1 7.5


y 126 121 116 118 114 118 132 141 108

5. In a research between the amount of rainfall and the quantity of air


pollution removed, the following data were collected:
Daily rainfall (x) is in 0.01cm & Particulate Removed (y) is in micrograms/m3 .

(a) Find the equation of the regression line to predict the particulate
removed from the amount of daily rainfall.
(b) Estimate the amount of particulate removed when the daily rainfall
is x = 4.8 units.

6. The following data were collected from 10 university students who had
just finished their 3 year course. The aim was to determine whether
there is a relationship between points obtained at A-level (x) and the final
cumulative grade point average (CGPA) represented by y. CGPA for a
student on normal progress is defined in the interval [2.00, 5.00], while
the maximum points at A-level is 20.

x 11 20 16 15 8 19 14 12 11 9
y 3.40 4.32 4.61 3.62 2.40 3.82 3.52 3.52 3.50 2.87

(a) Find the equation of the regression line to predict the CGPA of
someone who has just finished a three year course at the university.
(b) Predict the CGPA foe someone who scored 15 points at A-level.

7. The following are marks obtained by 12 students in Elements of Probability


and Statistics assessment for course work (x) and final examination y.
The course work was marked out of 30, while the final examination was
marked out of 70.

x 23 15 21 22 24 28 29 30 20 14 27 10
y 57 46 55 24 33 60 69 69 48 41 62 40

(a) Plot a scatter diagram for the data.


(b) Estimate the linear regression line.
16 Engineering Mathematics IV– EMT 2201

(c) Estimate the final mark for a student who scored 18 marks in the
course work..

8. An article in the Tappi Journal (March, 1986) presented data on green liquor
Na2 S concentration (in grams per litre) and paper machine production (in
tons per day). The data (read from a graph) are shown as follows:

y 40 42 49 46 44 48 46 43 53 52 54 57 58
x 825 830 890 895 890 910 915 960 990 1010 1012 1030 1050

(a) Draw a scatter diagram of the data.


(b) Fit a simple linear regression model with y as green liquor (NaS )
concentration and x as production using least squares method. .
(c) Find the fitted value of y corresponding to x = 60 and the associated
residual.

2.2 Coefficient of determination, R2


The coefficient of determination is the portion of the total variation in the
dependent variable that is explained by variation in the independent variable.
The coefficient of determination is also called R-squared and is denoted as R2 ,
given by
SSR
R2 =
SST
where 0 ≤ R2 ≤ 1, and SSR is the sum of squares explained by regression and
SST is the total sum of squares.
Note that in the single independent variable case, the coefficient of determination
is
R2 = r2 ,
where

• R2 is coefficient of determination, and

• r is the simple correlation coefficient.

Total variation is made up of two parts:

SST = SSE + SSR


Total sumPof squares Sum of squares error Sum of squares regresson
SST = (y − ȳ)2 SSE = (y − ŷ)2 ( ŷ − ȳ)2
P P

where:
H.W. Kayondo c 17

• ȳ= Average value of the dependent variable,


• y = Observed values of the dependent variable,
• ŷ= Estimated value of y for the given x value,

• SST = total sum of squares; Measures the variation of the yi values around
their mean ȳ,
• SSE = error sum of squares, and is the variation attributable to factors
other than the relationship between x and y, and
• SSR = regression sum of squares– the explained variation attributable to
the relationship between x and y

Example 9 Using data given in problem 3, evaluate R2 and √ establish whether the
computed correlation coefficient, done in eample 3 is indeed R2 .

Solution

From the calculations, the following were obtained:


ŷ = 17.318 − 0.5496x, SSR = 5.663125, SST = 38.51333
It follows that R2 = 0.1470432.
Since we obtained r = −0.3834 in example 3, it is clear that r2 = R2 .

You might also like