Topic_13_Correlation_and_Simple_Linear_Regression
Topic_13_Correlation_and_Simple_Linear_Regression
Kayondo c 1
1 Correlation
Correlation analysis is used to measure strength of association (linear rela-
tionship) between two variables. It is only concerned with strength of the
relationship, and no causal effect is implied.
The relationship can be visualized using scatter plots and by computing correla-
tion coefficient.
A scatter plot is a graph that relates two groups of data. These two groups of
data are plotted as ordered pairs to create the graph.
Length (in) 21 21 18 24 16 19 17 19
Wingspan (in) 36 41 38 46 31 39 35 46
Draw a scatter plot and fit a trend line to the data points.
Solution
The scatter plot with a trend line is shown in Figure 1. We shall see in section
2.1 how to fit such a line using least squares method. The equation of the trend
line is y = 11.1253 + 1.4387x, where x and y are length and wingspan of hawks in
inches, respectively. The slope and intercept for the trend line are 1.4387 and 11.1253,
respectively.
Year 1950 1955 1960 1965 19705 1975 1980 1985 1990 1995
No. of
Tornadoes 201 593 616 897 654 919 866 684 1133 1234
(a) Draw a scatter plot for the data given and fit the best trend line to the points.
(b) Use the line of best fit to predict how many tornadoes may be reported in the
United States in 2015 if the trend continues.
Solution
is little or no correlation. If the data points are close to the line of best fit, it is
said to have a strong correlation. Examples of scatter plots are shown in Figures
3, 4 and 5.
4 Engineering Mathematics IV– EMT 2201
where
• r = Sample correlation coefficient,
• n = Sample size,
• x = Value of the independent variable,
• y = Value of the dependent variable.
Figure 6 illustrates examples of various r values with the fitted straight lines.
Example 3 You are developing a new analytical method for the determination of blood
urea nitrogen (BUN). You want to determine whether your method differs significantly
from a standard one for analyzing a range of sample concentrations expected to be found
in the routine laboratory. It has been ascertained that the two methods have comparable
precisions. The data shown in table below is for two sets of the results for a number of
individual samples.
Calculate the correlation coefficient for the data taking your method as x and
the standard method as y.
H.W. Kayondo c 7
Solution
Exercise
Remark
When the term correlation coefficient is used with out further qualification,
it usually refers to the Pearson product-moment correlation coefficient. It
should be noted that other correlation coefficients exist, for example Spearman’s
rank correlation coefficient, Kendall rank correlation coefficient, Goodman and
Kruskal’s gamma coefficient, among others.
• has only one independent variable (x) and one dependent variable (y).
Remark
A dependent variable is the one we wish to explain and an independent variable
8 Engineering Mathematics IV– EMT 2201
Example 4 Suppose we measure a distance four times, and obtain the following results:
72, 69, 70 and 73 units What is the best estimate of the correct measurement?
Let us denote the estimate of the true measurement by x, and form the deviations
(errors) from x, namely: x − 72, x − 69, x − 70, and x − 73.
Let S be the sum of the squares of these errors, i.e. S = (x − 72)2 + (x − 69)2 + (x −
70)2 + (x − 73)2 .
We seek the value of x that minimises the value of S. We can simplify S through
H.W. Kayondo c 9
Let’s illustrate the least squares method with a simple example below.
Example 5 Problem: Given these measurements of the two quantities x and y, find
y7 : Due to random errors in the measurements the ordered pairs (xi , yi ) do not lie on a
straight line. Assume the values can be approximated by the linear function ŷ = ax + b.
Let us call the deviations (errors) di = ŷi − yi = axi + b − yi for i = 1, 2, . . . , 6.
10 Engineering Mathematics IV– EMT 2201
x1 = 2 x2 = 4 x3 = 6 x4 = 8 x5 = 10 x6 = 12 x7 = 14
y1 = 2 y2 = 4 y3 = 4 y4 = 5 y5 = 5 y6 = 7 y7 =?
Let’s solve the problem algebraically by finding the sum of the squares of the errors and
minimising it:
We now find the partial derivative of S with respect to a. This means that we differentiate
S with respect to a, and treat b as if it was a constant. As with one variable, we set the
derivative equal to zero. This gives
We also find the partial derivative of S with respect to b. This means that we differentiate
S with respect to b, and treat a as if it was a constant and set the derivative equal to
zero. This gives
14a + 2b = 9. (2)
We can find the line of best fit “graphically” by using a technology curve-fitting
program, e.g. by using Excel’s trendline. Excel in essence calculates a and b using
these formulae: P
(x − x̄)(y − ȳ)
b= P ,
(x − x̄)2
and
a = ȳ − bx̄.
H.W. Kayondo c 11
Model
Data
x y y = ax + b
2 2 2.42857
4 4 3.25714
6 4 4.08571
8 5 4.91429
10 5 5.74286
12 7 6.57143
14 ?? 7.4
Theorem: The Least Squares Model for a set of data (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )
passes through the point (xa , ya ) where xa is the average of the x0i s and ya is the
average of the y0i s.
Using this theorem enables us to simplify our method of calculating the param-
eters a and b of our line of best fit with formula ŷ = ax + b.
Calculating the averages of the x values and of the y-values gives the point
(7,4.5). We know that this point lies on the line of best fit and therefore the
ordered pair satisfies the equation:
So we can now express the equation of the line of best fit only in terms of the
gradient a, as:
The following table shows data points (x & y), model in terms of a, errors
arising from the fitted model for each data point (y − ŷ) and the squared errors
((y − ŷ)2 )
From the Table, the sum of square errors (S) in terms of a can be simplified
as:
S = 70a2 − 58a + 13.5.
S = 70(a − 58a 70 ) + 13.5.
70(a − 70 ) + 13.5 − 70( 140
58 2 58 2
)
a = 140 = 70 .
58 29
Model
Data
x y ŷ = ax + b y − ŷ (y − ŷ)2
2 2 4.5 − 5a 5a − 2.5 25a − 25a + 6.25
2
29 29
y0 = x + 1.6, and ⇒ y0 (14) = × 14 + 1.6 = 7.4.
70 70
n
∂S X
= −2 (yi − βˆ0 − βˆ1 xi )xi = 0 (5)
∂βˆ1 i=1
n
X n
X n
X
βˆ0 xi + βˆ1 x2i = xi yi (7)
i=1 i=1 i=1
Equations (6) and (7) are called least squares normal equations.
From equation (6), the estimate for βˆ0 is:
n n
where ȳ = 1
yi and x̄ = 1
P P
n n xi
i=1 i=1
Substituting equation (8) into equation (7) and simplifying gives the estimate
for βˆ1 as:
n
P n
P
n
P xi yi
i=1 i=1
xi yi − n
i=1
βˆ1 = Pn 2 (9)
n xi
x2i −
P i=1
n
i=1
Example 7 Using data given in example 1, find the least squares estimates for the
linear regression coefficients.
Solution
Example 8 Using data given in example 2, find the least squares estimates for the
linear regression coefficients.
Solution
Exercises
1. Solve the two equations (6) and (7) simulateneous and obtain the expres-
sions for βˆ0 and βˆ1 .
Assuming that the two variables are related according to the simple linear
regression model.
(a) Calculate the least squares estimates of the slope and intercept.
(b) Use the equation of the fitted line to predict what permeability would
be observed when the compressive strength is x = 4.3
(c) Give a point estimate of the permeability when compressive strength
is x = 3.7.
(d) Suppose that the observed value of permeability at x = 3.7 is y = 46.1.
Calculate the value of the corresponding residual.
Father (x) 63 65 66 67 67 68
Daughter (y) 66 68 65 67 69 70
(a) Draw a scatter diagram for this data and comment on its trend.
(b) Fit the least squares regression line for the data.
(c) Predict the height of a daughter whose father’s height is 70 inches.
4. Regression methods were used to analyse the data from a study inves-
tigating the relationship between roadway surface temperature (x) and
pavementPdeflection (y).PSummary quantities were:P
= = 2
= = x2i = 143215.8 and
P
n 20, y i 12.75, yi
8.86, xi 1478,
xi yi = 1083.67.
P
(a) Calculate the least squares estimates of the slope and intercept.
(b) Use the equation of the fitted line to predict what pavement deflection
would be observed when the surface temperature is 85◦ F.
H.W. Kayondo c 15
(a) Find the equation of the regression line to predict the particulate
removed from the amount of daily rainfall.
(b) Estimate the amount of particulate removed when the daily rainfall
is x = 4.8 units.
6. The following data were collected from 10 university students who had
just finished their 3 year course. The aim was to determine whether
there is a relationship between points obtained at A-level (x) and the final
cumulative grade point average (CGPA) represented by y. CGPA for a
student on normal progress is defined in the interval [2.00, 5.00], while
the maximum points at A-level is 20.
x 11 20 16 15 8 19 14 12 11 9
y 3.40 4.32 4.61 3.62 2.40 3.82 3.52 3.52 3.50 2.87
(a) Find the equation of the regression line to predict the CGPA of
someone who has just finished a three year course at the university.
(b) Predict the CGPA foe someone who scored 15 points at A-level.
x 23 15 21 22 24 28 29 30 20 14 27 10
y 57 46 55 24 33 60 69 69 48 41 62 40
(c) Estimate the final mark for a student who scored 18 marks in the
course work..
8. An article in the Tappi Journal (March, 1986) presented data on green liquor
Na2 S concentration (in grams per litre) and paper machine production (in
tons per day). The data (read from a graph) are shown as follows:
y 40 42 49 46 44 48 46 43 53 52 54 57 58
x 825 830 890 895 890 910 915 960 990 1010 1012 1030 1050
where:
H.W. Kayondo c 17
• SST = total sum of squares; Measures the variation of the yi values around
their mean ȳ,
• SSE = error sum of squares, and is the variation attributable to factors
other than the relationship between x and y, and
• SSR = regression sum of squares– the explained variation attributable to
the relationship between x and y
Example 9 Using data given in problem 3, evaluate R2 and √ establish whether the
computed correlation coefficient, done in eample 3 is indeed R2 .
Solution