Topic03 Correlation Regression
Topic03 Correlation Regression
Topic03 Correlation Regression
Regression
Cal State Northridge
427
Ainsworth
Major Points - Correlation
Questions answered by correlation
Scatterplots
An example
20
Average Number of Alcoholic Drinks
18
16
14
Per Week
12
10
8
6
4
2
0
0 5 10 15 20 25
Average Hours of Video Games Per Week
Inverse Relationship
Scatterplot: Video Games and Test Score
100
90
80
70
Exam Score
60
50
40
30
20
10
0
0 5 10 15 20
Average Hours of Video Games Per Week
An Example
160
150
140
130
120
SYSTOLIC
110
100
0 10 20 30
SMOKING
Smoking and BP
The Data
1 11 26
2 9 21
3 9 24
4 9 21
5 8 19
6 8 13
7 8 19
Surprisingly, the 8 6 11
9 6 23
U.S. is the first 10 5 15
country on the list- 11 5 13
12 5 4
-the country 13 5 18
with the highest 14 5 12
15 5 3
consumption and 16 4 11
17 4 15
highest mortality. 18 4 6
19 3 13
20 3 4
21 3 14
Scatterplot of Heart Disease
20
10
{X = 6, Y = 11}
0
2 4 6 8 10 12
This
gives us “residuals” or “errors of
prediction”
To be discussed later
Correlation
Co-relation
The relationship between two variables
Based on covariance
Measure of degree to which large scores on
X go with large scores on Y, and small scores
on X go with small scores on Y
Think of it as variance, but with 2 variables
instead of 1 (What does that mean??)
18
Covariance
Remember that variance is:
( X X ) 2
( X X )( X X )
VarX
N 1 N 1
The formula for co-variance is:
( X X )(Y Y )
Cov XY
N 1
How this works, and why?
When would covXY be large and positive?
Large and negative?
Country X (Cig.) Y (CHD) (X X ) (Y Y ) ( X X ) * (Y Y )
1 11 26 5.05 11.48 57.97
2 9 21 3.05 6.48 19.76
3 9 24 3.05 9.48 28.91
4 9 21 3.05 6.48 19.76
5 8 19 2.05 4.48 9.18
6 8 13 2.05 -1.52 -3.12
7 8 19 2.05 4.48 9.18
8 6 11 0.05 -3.52 -0.18
9 6 23 0.05 8.48 0.42
Example
10 5 15 -0.95 0.48 -0.46
11 5 13 -0.95 -1.52 1.44
12 5 4 -0.95 -10.52 9.99
13 5 18 -0.95 3.48 -3.31
14 5 12 -0.95 -2.52 2.39
15 5 3 -0.95 -11.52 10.94
16 4 11 -1.95 -3.52 6.86
17 4 15 -1.95 0.48 -0.94
18 4 6 -1.95 -8.52 16.61
19 3 13 -2.95 -1.52 4.48
20 3 4 -2.95 -10.52 31.03
21 3 14 -2.95 -0.52 1.53
Mean 5.95 14.52
SD 2.33 6.69
Sum 222.44
Example
21
( X X )(Y Y ) 222.44
Covcig .&CHD 11.12
N 1 21 1
What the heck is a covariance?
I thought we were talking about
correlation?
Correlation Coefficient
Cov XY
r
s X sY
Correlation is a standardized
covariance
Calculation for Example
CovXY = 11.12
sX = 2.33
sY = 6.69
Correlation = .713
Sign is positive
Why?
Z-score method
r
z z x y
N 1
Attractiveness Date?
0 0
1 0
1 1
1 1
0 0
1 1
= 0.71 28
Factors Affecting r
Range restrictions
Looking at only a small portion of the total
scatter plot (looking at a smaller portion of
the scores’ variability) decreases r.
Reducing variability reduces r
Nonlinearity
The Pearson r (and its relatives) measure the
degree of linear relationship between two
variables
If a strong non-linear relationship exists, r will
provide a low, or at least inaccurate measure
of the true relationship.
Factors Affecting r
Heterogeneous subsamples
Everyday examples (e.g. height and weight
using both men and women)
Outliers
Overestimate Correlation
Underestimate Correlation
Countries With Low Consumptions
Data With Restricted Range
18
16
CHD Mortality per 10,000
14
12
10
4
2
2.5 3.0 3.5 4.0 4.5 5.0 5.5
Population parameter =
Null hypothesis H0: = 0
Test
of linear independence
What would a true null mean here?
What would a false null mean here?
N 2
tr
1 r 2
Where DF = N-2
Tables of Significance
In our example r was .71
N-2 = 21 – 2 = 19
N 2 19 19
tr .71* .71* 6.90
1 r 2
1 .712
.4959
CIGARET CHD
CIGARET Pears on Correlation 1 .713**
Sig. (2-tailed) . .000
N 21 21
CHD Pears on Correlation .713** 1
Sig. (2-tailed) .000 .
N 21 21
**. Correlation is significant at the 0.01 level (2-tailed).
Regression
What is regression?
42
other changes?
Influence
Linear Regression
43
47
The Data 2
3
9
9
21
24
4 9 21
5 8 19
Based on the data we have 6 8 13
what would we predict the 7
8
8
6
19
11
rate of CHD be in a country 9 6 23
10 5 15
that smoked 10 cigarettes on 11 5 13
12 5 4
average? 13 5 18
14 5 12
First, we need to establish a 15 5 3
4 11
prediction of CHD from 16
17 4 15
smoking… 18
19
4
3
6
13
20 3 4
21 3 14
30
We predict a
20
CHD rate of
about 14
Regression
Line
10
48
Regression Line
49
Formula
Yˆ bX a
Yˆ
= the predicted value of Y (e.g. CHD
mortality)
X = the predictor variable (e.g. average
cig./adult/country)
Regression Coefficients
50
Slope cov XY sy
b 2 or b r
sX sx
N XY X Y
or b
N X 2 ( X ) 2
Intercept a Y bX
For Our Data
52
CovXY = 11.12
s2X = 2.332 = 5.447
b = 11.12/5.447 = 2.042
Answers are not exact due to rounding error and desire to match
SPSS.
SPSS Printout
53
Note:
54
Yˆ bX a 2.042 X 2.367
Yˆ 2.042*6 2.367 14.619
They actually have 23 deaths/10,000
Our error (“residual”) =
23 - 14.619 = 8.38
a large error
56
30
20
Prediction
10
0
2 4 6 8 10 12
57
Residuals
58
( X X ) 0
So, how do we get rid of the 0’s?
Square them.
Regression Line:
A Mathematical Definition
The regression line is the line which when
drawn through your data set produces the
smallest value of:
(Y Y )
ˆ 2
Residual variance
The variability of predicted values
ˆ
(Yi Yi ) 2
SSresidual
s2
Y Yˆ
N 2 N 2
Standard Error of Estimate
62
Example
1 11 26 24.829 1.171 1.371
2 9 21 20.745 0.255 0.065
3 9 24 20.745 3.255 10.595
63 4 9 21 20.745 0.255 0.065
5 8 19 18.703 0.297 0.088
(Yi Yˆi ) 2 440.756
6 8 13 18.703 -5.703 32.524 2
s
Y Yˆ
23.198
7 8 19 18.703 0.297 0.088 N 2 21 2
8 6 11 14.619 -3.619 13.097
9 6 23 14.619 8.381 70.241 (Yi Yˆi )2 440.756
10 5 15 12.577 2.423 5.871 sY Yˆ
11 5 13 12.577 0.423 0.179
N 2 21 2
12 5 4 12.577 -8.577 73.565 23.198 4.816
13 5 18 12.577 5.423 29.409
14 5 12 12.577 -0.577 0.333
15 5 3 12.577 -9.577 91.719
16 4 11 10.535 0.465 0.216
17 4 15 10.535 4.465 19.936
18 4 6 10.535 -4.535 20.566
19 3 13 8.493 4.507 20.313
20 3 4 8.493 -4.493 20.187
21 3 14 8.493 5.507 30.327
Mean 5.952 14.524
SD 2.334 6.690
Sum 0.04 440.757
Regression and Z Scores
64
SSregression (Yˆ Y )
2
Regression
Degrees of freedom
Total
dftotal =N-1
Regression
dfregression = number of predictors
Residual
dfresidual = dftotal – dfregression
dftotal = dfregression + dfresidual
Partitioning Variability
67
(Y Y ) 440.757; df residual 20 1 19
ˆ 2
SSresidual
2
(Y Y ) 895.247
s2
44.762
N 1
total
20
(Y Y )
ˆ 2
454.307
s2
regression 454.307
1 1
(Y Y )
ˆ 2
440.757
s2
23.198
N 2
residual
19
2
Note : sresidual sY Yˆ
Coefficient of Determination
70
SSY
The percentage of the total variability in
Y explained by X
2
r for our example
71
r = .713
r 2 = .7132 =.508
SSregression 454.307
r
2
.507
or SSY 895.247
It is defined as 1 - r 2 or
SSresidual
1 r 2
SSY
Example
1 - .508 = .492
SS residual 440.757
1 r
2
.492
SSY 895.247
2
r, SS and sY-Y’
73
r2 * SStotal = SSregression
(1 - r2) * SStotal = SSresidual
N 1 20
sY Yˆ s y (1 r )
2
6.690* (.492) 4.816
N 2 19
Testing Overall Model
74
Example 2
sregression 454.307
2
19.594
sresidual 23.198
Model Summary
ANOVAb
Sum of
Model Squares df Mean Square F Sig.
1 Regress ion 454.482 1 454.482 19.592 .000 a
Res idual 440.757 19 23.198
Total 895.238 20
a. Predictors : (Constant), CIGARETT
b. Dependent Variable: CHD
Testing Slope and Intercept
77
quite well?