Bio2 Module 1 - Simple Linear Regression and Correlation
Bio2 Module 1 - Simple Linear Regression and Correlation
Biostatistics II _ module 1
Getu Degu
November 2008
Simple linear regression and correlation
1
Relationship between heights of
fathers and their oldest sons
73
Heights of oldest sons (inches)
72
71
70
69
68
67
66
65
64
63
62
62 64 66 68 70 72 74
Heights of fathers (inches)
2
A) Simple linear regression
3
Regression is a method of estimating the numerical
relationship between variables. For example, we
would like to know what is the mean or expected
weight for factory workers of a given height, and what
increase in weight is associated with a unit
increase in height.
4
The Method of least square
5
The values ‘a’ and ‘b’ in the equation are constants,
i.e., their values are fixed. The constant ‘a’ indicates
the value of y when x=0. It is also called the y
intercept. The value of ‘b’ shows the slope of the
regression line and gives us a measure of the change
in y for a unit change in x.
6
The constants ‘a’ and ‘b’ are determined by solving
simultaneously the equations (normal equations):
ΣY = an + bΣX
ΣXY = aΣX + bΣX²
a= -b
b= =
7
Example: Heights of 10 fathers (X) together with their
oldest sons (Y) are given below (in inches). Find the
regression of Y on X.
63 65 4095 3969
64 67 4288 4096
70 69 4830 4900
72 70 5040 5184
65 64 4160 4225
67 68 4556 4489
68 71 4828 4624
66 63 4158 4356
70 70 4900 4900
71 72 5112 5041
8
a= -b
b= =
b= = = = 0.77
9
Estimate the height of the oldest son for a father’s
height of 70 inches.
10
Explained, unexplained (error), total variations
11
The variation of the Y’s about their mean can also be
computed. The quantity Σ(Y- )² is called the total
variation.
r² =
12
B) Linear Correlation (Karl Pearson’s Coefficient of
linear correlation):- measures the degree of linear
correlation between two variables (eg. X and Y).
This correlation coefficient is given in pure number,
independent of the units in which the variables are
expressed. It also tells us the direction of the slope
of a regression line is positive or negative.
13
Properties
1) -1 r 1
2) r is a pure number without any unit
3) If r is close to 1 a strong positive
relationship
4) If r is close to -1 a strong negative
relationship
5) If r = 0 → no correlation
r = 0.7776 0.78
14
Rank correlation coefficient
rs = 1 -
15
Example
di² = 10, n = 6.
rs = 1 - =1- =1-
= 1 – 0.29
= 0.71
16
Spurious correlation
17
What do you think about the correlation coefficient (r)
of 0.9 between the amount of rainfall in Canada and
the maize production in Ethiopia from 1990 to 2000?
Assume the yearly data of the amount of rainfall and
maize production for the years 1990 to 2000 are
available.
18
Exercise 5
Data on FEV1 (forced expiratory volume in one
second) (Y) and height (X) of 20 male medical
students are given below:
19