Correlation and regression
Correlation and regression
Regression
By
R. Meredith Chelsea
Assistant Professor
PG & Research Department of International
Business
SRCAS
Topics Covered:
Is there a relationship between x and y?
What is the strength of this relationship
Pearson’s r
Can we describe this relationship and use this to predict y from
x?
Regression
Is the relationship we have described statistically significant?
t test
Relevance to SPM
GLM
The relationship between x and y
Correlation: is there a relationship between 2
variables?
Regression: how well a certain independent
variable predict dependent variable?
CORRELATION CAUSATION
In
order to infer causality: manipulate independent
variable and observe effect on dependent variable
Scattergrams
Y Y Y
Y Y Y
X X X
Variance: n
• Gives information on variability of a
single variable.
2
i
( x x ) 2
S i 1
x
Covariance: n 1
• Gives information on the degree to
which two variables vary together. n
• Note how similar the covariance is
to variance: the equation simply (x i x)( yi y )
multiplies x’s error scores by y’s
error scores as opposed to squaring cov( x, y ) i 1
x’s error scores. n 1
Covariance
(x i x)( yi y )
cov( x, y ) i 1
n 1
When X and Y : cov (x,y) = pos.
When X and Y : cov (x,y) = neg.
When no constant relationship: cov (x,y) = 0
Example Covariance
6 x y xi - x yi - y ( xi - x )( yi - y )
5
4
0 3 -3 0 0
3
2 2 -1 -1 1
2
3 4 0 1 0
1 4 0 1 -3 -3
0 6 6 3 3 9
0 1 2 3 4 5 6 7
x =3 y =3 å= 7
( x x)( y
i i y ))
7 What does this
cov( x, y ) i 1
1.75 number tell us?
n 1 4
Problem with Covariance:
The value obtained by covariance is dependent on the size of
the data’s standard deviations: if large, the value will be
greater than if small… even if the relationship between x and y
is exactly the same in the large versus small standard
deviation datasets.
Example of how covariance value
relies on variance
High variance data Low variance data
cov( x, y )
rxy
sx s y
Pearson’s R continued
n n
( x x)( y
i i y) ( x x)( y
i i y)
cov( x, y ) i 1 rxy i 1
n 1 (n 1) s x s y
Z xi * Z yi
rxy i 1
n 1
Limitations of r
When r = 1 or r = -1:
We can predict y from x with certainty
all data points are on a straight line: y = ax + b
r is actually r̂
r = true r of whole population
= estimate of r based on data
r̂
r is very sensitive to extreme values:
5
0
0 1 2 3 4 5 6
Regression
Correlation tells you if there is an association
between x and y but it doesn’t describe the
relationship or allow you to predict one
variable from the other.
= ŷ, predicted value
= y i , true value
ε = residual error
Least Squares Regression
To find the best line we must minimise the sum of
the squares of the residuals (the vertical distances
from the data points to our line)
Model line: ŷ = ax + b a = slope, b = intercept
Residual (ε) = y - ŷ
Sum of squares of residuals = Σ (y – ŷ)2
b
ε b ε
b
b b b
y = ax + b b = y – ax
We can put our equation for a into this giving:
r sy r = correlation coefficient of x and y
b=y- s x s = standard deviation of y
y
x s = standard deviation of x
x
We can calculate the regression line for any data, but the important
question is how well does this line fit the data, or how good is it at
predicting y from x
How good is our model?
∑(y – y)2 SSy
Total variance of y: sy 2 = =
n-1 dfy
r 2 = sŷ 2 / s y 2
F-statistic: complicated
rearranging
sŷ 2 r2 (n - 2)2
F(df ,df ) = =......=
ŷ er
ser2 1 – r2
And it follows that:
r (n - 2) So all we need to
(because F = t 2) t (n-2) = know are r and n
√1 – r2
General Linear Model
Linear regression is actually a form of the
General Linear Model where the parameters
are a, the slope of the line, and b, the intercept.
y = ax + b +ε
A General Linear Model is just any model that
describes the data in terms of a straight line
Multiple regression
Multiple regression is used to determine the effect of a number
of independent variables, x1, x2, x3 etc, on a single dependent
variable, y
The different x variables are combined in a linear way and
each has its own regression coefficient:
This is what SPM does and all will be explained next week!