Theory of Regression

Theory of Regression
The Course
16 (or so) lessons
Some flexibility
Depends how we feel
What we get through
Part I: Theory of Regression

1. Models in statistics
2. Models with more than one parameter:
regression
3. Why regression?
4. Samples to populations
5. Introducing multiple regression
6. More on multiple regression
Part 2: Application of regression

7. Categorical predictor variables
8. Assumptions in regression analysis
9. Issues in regression analysis
10. Non-linear regression
11. Moderators (interactions) in regression
12. Mediation and path analysis
Part 3: Advanced Types of Regression
13. Logistic Regression
14. Poisson Regression
15. Introducing SEM
16. Introducing longitudinal multilevel
4
models
House Rules
Jeremy must remember
Not to talk too fast
If you dont understand

Ask
Any time
If you think Im wrong

Ask. (Im not always right)
5
Learning New Techniques

Best kind of data to learn a new
technique
Data that you know well, and understand
Your own data

In computer labs (esp later on)
Use your own data if you like
My data
Ill provide you with
Simple examples, small sample sizes
Conceptually simple (even silly)
Computer Programs
SPSS
Mostly
Excel
For calculations
GPower
Stata (if you like)
R (because its flexible and free)
Mplus (SEM, ML?)
AMOS (if you like)
7
Lesson 1: Models in
statistics
Models, parsimony, error,
mean, OLS estimators
10
What is a Model?
11
What is a model?
Representation
Of reality
Not reality
Model aeroplane represents a

real aeroplane
If model aeroplane = real
aeroplane, it isnt a model
12
Statistics is about modelling

Representing and simplifying
Sifting
What is important from what is not
important
Parsimony
In statistical models we seek
parsimony
Parsimony simplicity
13
Parsimony in Science
A model should be:
1: able to explain a lot
2: use as few concepts as possible
More it explains
The more you get
Fewer concepts
The lower the price
Is it worth paying a higher price for a

better model?
14
A Simple Model
Height of five individuals
1.40m
1.55m
1.80m
1.62m
1.63m
These are our DATA

15
A Little Notation
Y
Yi
The (vector of) data that we are

modelling
The ith observation in
our data.
Y 4,5,6,7,8
Y2 5
16
Greek letters represent the

true value in the population.
0
j
(Beta) Parameters in our model

(population value)
The value of the first parameter of
our model in the population.
The value of the jth parameter of
our model, in the population.
(Epsilon) The error in the population
model.
17
Normal letters represent the values in our

sample. These are sample statistics,
which are used to estimate population
parameters.
b
e
A parameters in our model (sample

statistics)
The error in our sample.
The data in our sample which we
are trying to model.
18
Symbols on top change the meaning.
The data in our sample which we

are trying to model (repeated).
Yi
The estimated value of Y, for the ith

case.
The mean of Y.
19

So b1 1
I will use b1 (because it is easier to type)
20
Not always that simple

some texts and computer programs
use
b = the parameter estimate (as we
have used)
(beta) = the standardised parameter
estimate
SPSS does this.
21
A capital letter is the set (vector) of

parameters/statistics
Set of all parameters (b0, b1, b2, b3

bp)
Rules are not used very consistently (even

by me).
Dont assume you know what someone
means, without checking.
22
We want a model
To represent those data
Model 1:
1.40m, 1.55m, 1.80m, 1.62m, 1.63m
Not a model
A copy
VERY unparsimonious
Data: 5 statistics
Model: 5 statistics
No improvement
23
Model 2:
The mean (arithmetic mean)
A one parameter model
Yi b0 Y
Yi
i 1
n
24
Which, because we are lazy, can

be written as
Y
Y
n
25
The Mean as a Model
26
The (Arithmetic) Mean

We all know the mean
The average
Learned about it at school
Forget (didnt know) about how clever the
mean is
The mean is:

An Ordinary Least Squares (OLS) estimator
Best Linear Unbiased Estimator (BLUE)
27
Mean as OLS Estimator

Going back a step or two
MODEL was a representation of DATA
We said we want a model that explains a lot
How much does a model explain?
DATA = MODEL + ERROR
ERROR = DATA - MODEL
We want a model with as little ERROR as
possible
28
What is error?
Data (Y)
1.40
1.55
1.80
1.62
1.63
Model (b0)
mean
1.60
Error (e)
-0.20
-0.05
0.20
0.02
0.03
29
How can we calculate the amount

of error?
Sum of errors
ERROR ei
(Yi Y )
(Yi b0 )
0.20 0.05 0.20 0.02 0.03

0
30
0 implies no ERROR
Not the case
Knowledge about ERROR is useful

As we shall see later
31
Sum of absolute errors

Ignore signs
ERROR ei
Yi Y
Yi b0
0.20 0.05 0.20 0.02 0.03
0.50
32
Are small and large errors equivalent?

One error of 4
Four errors of 1
The same?
What happens with different data?

Y = (2, 2, 5)
b0 = 2
Not very representative
Y = (2, 2, 4, 4)
b0 = any value from 2 - 4
Indeterminate
There are an infinite number of solutions which would
satisfy our criteria for minimum error
33
Sum of squared errors (SSE)
ERROR e
2
i
(Yi Y )
(Yi b0 )
0.20 0.05 0.202 0.022 0.032

2
0.08
34
Determinate
Always gives one answer
If we minimise SSE
Get the mean
Shown in graph
SSE plotted against b0
Min value of SSE occurs when
b0 = mean
35
2
1.8
1.6
1.4
SSE
1.2
1
0.8
0.6
0.4
0.2
0
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
b0
36
1.9
The Mean as an OLS

Estimate
37
Mean as OLS Estimate

The mean is an Ordinary Least
Squares (OLS) estimate
As are lots of other things
This is exciting because

OLS estimators are BLUE
Best Linear Unbiased Estimators
Proven with Gauss-Markov Theorem
Which we wont worry about
38
BLUE Estimators
Best
Minimum variance (of all possible
unbiased estimators
Narrower distribution than other
estimators
e.g. median, mode
Linear
Y Y
Linear predictions
For the mean
Linear (straight, flat) line
39
Unbiased
Centred around true (population)
values
Expected value = population value
Minimum is biased.
Minimum in samples > minimum in
population
Estimators
Errrmm they are estimators
Also consistent
Sample approaches infinity, get closer
to population values
Variance shrinks
40
SSE and the Standard

Deviation
Tying up a loose end
2
SSE (Yi Y )
(Yi Y )
n
(Yi Y )
n 1
41
SSE closely related to SD

Sample standard deviation s
Biased estimator of population SD
Population standard deviation -

Need to know the mean to calculate SD
Reduces N by 1
Hence divide by N-1, not N
Like losing one df
42
Proof
That the mean minimises SSE
Not that difficult
As statistical proofs go
Available in
Maxwell and Delaney Designing
experiments and analysing data
Judd and McClelland Data Analysis
(out of print?)
43
Whats a df?
The number of parameters free to
vary
When one is fixed
Term comes from engineering

Movement available to structures
44
0 df
No variation
available
1 df
Fix 1 corner, the
shape is fixed
45
Back to the Data

Mean has 5 (N) df
1st moment
has N 1 df
Mean has been fixed
2nd moment
Can think of as amount cases vary
away from the mean
46
While we are at it
Skewness has N 2 df
3rd moment
Kurtosis has N 3 df
4rd moment
Amount cases vary from
47
Parsimony and df
Number of df remaining
Measure of parsimony
Model which contained all the data

Has 0 df
Not a parsimonious model
Normal distribution
Can be described in terms of mean and
2 parameters
(z with 0 parameters)
48
Summary of Lesson 1
Statistics is about modelling DATA
Models have parameters
Fewer parameters, more parsimony, better
Models need to minimise ERROR

Best model, least ERROR
Depends on how we define ERROR
If we define error as sum of squared
deviations from predicted value
Mean is best MODEL
49
50
51
Lesson 2: Models with one

more parameter regression
52
In Lesson 1 we said
Use a model to predict and
describe data
Mean is a simple, one parameter
model
Yi Y Y
53
More Models
Slopes and Intercepts
54
More Models
The mean is OK
As far as it goes
It just doesnt go very far
Very simple prediction, uses very little
information
We often have more information

than that
We want to use more information
than that
55
House Prices
In the UK, two of the largest
lenders (Halifax and Nationwide)
compile house price indices
Predict the price of a house
Examine effect of different
circumstances
Look at change in prices

Guides legislation
E.g. interest rates, town planning
56
Predicting House Prices

B eds
(0 0 0 s )
77
74
88
62
90
136
35
134
138
55
57
One Parameter Model

The mean
Y 88.9
Y b0 Y
SSE 11806.9
How much is that house worth?
88,900
58
Use 1 df to say that
Adding More Parameters

We have more information than
this
We might as well use it
Add a linear function of number of
bedrooms (x1)
Y b0 b1x1
59
Alternative Expression
Estimate of Y (expected value of Y)
Y b0 b1x1
Value of Y
Yi b0 b1 xi1 ei
60
Estimating the Model

We can estimate this model in four
different, equivalent ways
Provides more than one way of thinking
about it
1. Estimating the slope which minimises

SSE
2. Examining the proportional reduction
in SSE
3. Calculating the covariance
61
4. Looking at the efficiency of the
Estimate the Slope to

Minimise SSE
62
Estimate the Slope

Stage 1
Draw a scatterplot
x-axis at mean
Not at zero
Mark errors on it
Called residuals
Sum and square these to find SSE
63
160
140
120
100
80
60
1.5
2.5
3.5
4.5
40
20
0
64
5.5
160
140
120
100
80
60
1.5
2.5
3.5
4.5
40
20
0
65
5.5
Add another slope to the chart

Redraw residuals
Recalculate SSE
Move the line around to find slope
which minimises SSE
Find the slope
66
First attempt:
67
Any straight line can be defined

with two parameters
The location (height) of the slope
b0
Sometimes called
The gradient of the slope

b1
68
Gradient
b1 units
1 unit
69
Height
b0 units
70
Height
If we fix slope to zero
Height becomes mean
Hence mean is b0
Height is defined as the point that

the slope hits the y-axis
The constant
The y-intercept
71
Why the
constant?
b0x0
Where x0 is 1.00
for every case
i.e. x0 is constant
Implicit in SPSS
Some packages
force you to make
it explicit
(Later on well
need to make it
explicit)
beds (x1)
x0
(000s)
77
74
88
62
90
136
35
134
138
55
72
Why the intercept?

Where the regression line intercepts
the y-axis
Sometimes called y-intercept
73
Finding the Slope

How do we find the values of b0
and b1?
Start with we jiggle the values, to find
the best estimates which minimise
SSE
Iterative approach
Computer intensive used to matter,
doesnt really any more
(With fast computers and sensible search
algorithms more on that later)74
Start with
b0=88.9 (mean)
b1=10 (nice round number)
SSE = 14948 worse than it was
b0=86.9,
b0=66.9,
b0=56.9,
b0=46.9,
b0=51.9,
b0=51.9,
b0=46.9,
..
b1=10,
b1=10,
b1=10,
b1=10,
b1=10,
b1=12,
b1=14,
SSE=13828
SSE=7029
SSE=6628
SSE=8228
SSE=7178
SSE=6179
SSE=5957
75
Quite a long time later

b0 = 46.000372
b1 = 14.79182
SSE = 5921
Gives the position of the

Regression line (or)
Line of best fit
Better than guessing
Not necessarily the only method

But it is OLS, so it is the best (it is
BLUE)
76
160
140
120
Price
100
80
60
Actual Price
Predicted Price
40
20
0
0.5
1.5
2.5
3.5
4.5
Number of Bedrooms
77
5.5
We now know
A house with no bedrooms is worth
46,000 (??!)
Adding a bedroom adds 15,000
Told us two things

Dont extrapolate to meaningless
values of x-axis
Constant is not necessarily useful
It is necessary to estimate the equation
78
Standardised Regression
Line
One big but:
Scale dependent
Values change
to , inflation
Scales change
, 000, 00?
Need to deal with this

79
Dont express in raw units

Express in SD units
x1=1.72
y=36.21
b1 = 14.79
We increase x1 by 1, and
increases by 14.79
14.79 (14.79 / 36.21) SDs 0.408SDs
80
Similarly, 1 unit of x1 = 1/1.72 SDs

Increase x1 by 1 SD
increases by 14.79 (1.72/1) = 8.60
Put them both together
b1 x1
y
81
14.79 1.72
0.706
36.21
The standardised regression line
Change (in SDs) in associated with
a change of 1 SD in x1
A different route to the same

answer
Standardise both variables (divide by
SD)
82
Find line of best fit
The standardised regression line

has a special name
The Correlation Coefficient

(r)
(r stands for regression, but more on
that later)
Correlation coefficient is a
standardised regression slope
Relative change, in terms of SDs
83
Proportional Reduction in
Error
84
Proportional Reduction in
Error
We might be interested in the level
of improvement of the model
How much less error (as proportion)
do we have
Proportional Reduction in Error (PRE)
Mean only
Error(model 0) = 11806
Mean + slope
Error(model 1) = 5921
85
ERROR(0) ERROR(1)
PRE
ERROR(0)
ERROR(1)
PRE 1
ERROR(0)
5921
PRE 1
11806
PRE 0.4984
86
But we squared all the errors in the

first place
So we could take the square root
(Its a shoddy excuse, but it makes the
point)
0.4984 0.706
This is the correlation coefficient
Correlation coefficient is the square
root of the proportion of variance
87
explained
Standardised Covariance
88
Standardised Covariance
We are still iterating
Need a closed-form
Equation to solve to get the
parameter estimates
Answer is a standardised
covariance
A variable has variance
Amount of differentness
We have used SSE so far

89
SSE varies with N

Higher N, higher SSE
Divide by N
Gives SSE per person
(Actually N 1, we have lost a df to
the mean)
The variance
Same as SD2
We thought of SSE as a scattergram
Y plotted against X
(repeated image follows)

90
160
140
120
100
80
60
1.5
2.5
3.5
4.5
40
20
0
91
5.5
Or we could plot Y against Y

Axes meet at the mean (88.9)
Draw a square for each point
Calculate an area for each square
Sum the areas
Sum of areas
SSE
Sum of areas divided by N

Variance
92
Plot of Y against Y
180
160
140
120
100
0
20
40
60
80
80
100
120
140
60
40
20
0
93
160
180
Draw Squares
180
Area =
40.1 x 40.1
= 1608.1
160
138 88.9
= 40.1
140
138 88.9
= 40.1
120
100
20
35 88.9
= -53.9
40
60
80
80
100
120
140
160
60
40
35 88.9
= -53.9
20
Area =
-53.9 x -53.9
= 2905.21
0
94
180
What if we do the same procedure

Instead of Y against Y
Y against X
Draw rectangles (not squares)

Sum the area
Divide by N - 1
This gives us the variance of x with
y
The Covariance
Shortened to Cov(x, y)
95
96
Area
= (-33.9) x (-2)
= 67.8
55 88.9
= -33.9
4-3=1
138-88.9
= 49.1
1 - 3 = -2
Area =
49.1 x 1
= 49.1
97
More formally (and easily)

We can state what we are doing as
an equation
Where Cov(x, y) is the covariance
( x x )( y y )
Cov( x , y )
N 1
Cov(x,y)=44.2
What do points in different sectors
do to the covariance?
98
Problem with the covariance

Tells us about two things
The variance of X and Y
The covariance
Need to standardise it
Like the slope
Two ways to standardise the

covariance
Standardise the variables first
Subtract from mean and divide by SD
Standardise the covariance

afterwards
99
First approach
Much more computationally
expensive
Too much like hard work to do by hand
Need to standardise every value
Second approach
Much easier
Standardise the final value only
Need the combined variance

Multiply two variances
Find square root (were multiplied in
first place)
100
Standardised covariance
Cov( x , y )
Var( x ) Var( y )
44.2
2.9 1311
0.706
101
The correlation coefficient

A standardised covariance is a
correlation coefficient
Covariance
variance variance
102
Expanded
( x x )( y y )
N 1
2
2
( x x ) ( y y )
N 1
N 1
103
This means
We now have a closed form equation
to calculate the correlation
Which is the standardised slope
Which we can use to calculate the
unstandardised slope
104
We know that:
b1 x1
We know that:
b1
r y
x1
105
b1
r y
x1
0.706 36.21
b1
1.72
b1 14.79
So value of b1 is the same as the
iterative approach
106
The intercept
Just while we are at it
The variables are centred at zero

We subtracted the mean from both
variables
Intercept is zero, because the axes
cross at the mean
107
Add mean of y to the constant

Adjusts for centring y
Subtract mean of x
But not the whole mean of x
Need to correct it for the slope
c y b1x1
c 88.9 14.8 3
c 46.00
Naturally, the same
108
Accuracy of Prediction
109
One More (Last One)

We have one more way to
calculate the correlation
Looking at the accuracy of the
prediction
Use the parameters

b0 and b1
To calculate a predicted value for
each case
110
Beds
A c tu a l
P r e d i c te d
P ric e
P ric e
77
6 0 .8 0
74
7 5 .5 9
88
6 0 .8 0
62
9 0 .3 8
90
1 1 9 .96
136
1 1 9 .96
35
7 5 .5 9
134
1 1 9 .96
138
1 0 5 .17
55
6 0 .8 0
Plot actual
price against
predicted
price
From the
model
111
140
120
Predicted Value
100
80
60
40
20
20
40
60
80
100
Actual Value
120
140
112
160
r = 0.706
The correlation
Seems a futile thing to do

And at this stage, it is
But later on, we will see why
113
Some More Formulae

For hand calculation
r
xy
x 2 y 2
Point biserial
M
r
y1
M y 0 PQ
sd y
114
Phi ()
Used for 2 dichotomous variables
Vote P
Vote Q
Homeowner
A: 19
B: 54
Not homeowner
C: 60
D:53
BC AD
r
( A B )(C D)( A C )( B D)
115
Problem with the phi correlation

Unless Px= Py (or Px = 1 Py)
Maximum (absolute) value is < 1.00
Tetrachoric can be used
Rank (Spearman) correlation

Used where data are ranked
6 d
r
2
n(n 1)
2
116
Summary
Mean is an OLS estimate
OLS estimates are BLUE
Regression line
Best prediction of DV from IV
OLS estimate (like mean)
Standardised regression line

A correlation
117
Four ways to think about a

correlation
1. Standardised regression line
2. Proportional Reduction in Error
(PRE)
3. Standardised covariance
4. Accuracy of prediction
118
119
120
Lesson 3: Why
Regression?
A little aside, where we look
at why regression has such a
curious name.
121
Regression
The or an act of regression;
reversion; return towards the
mean; return to an earlier stage of
development, as in an adults or an
adolescents behaving like a child
(From Latin gradi, to go)
So why name a statistical

technique which is about
prediction and explanation?
122
Francis Galton
Charles Darwins cousin
Studying heritability
Tall fathers have shorter sons

Short fathers have taller sons
Filial regression toward mediocrity
Regression to the mean
123
Galton thought this was biological

fact
Evolutionary basis?
Then did the analysis backward

Tall sons have shorter fathers
Short sons have taller fathers
Regression to the mean

Not biological fact, statistical artefact
124
Other Examples
Secrist (1933): The Triumph of
Mediocrity in Business
Second albums often tend to not be as
good as first
Sequel to a film is not as good as the
first one
Curse of Athletics Weekly
Parents think that punishing bad
behaviour works, but rewarding good
behaviour doesnt
125
Pair Link Diagram

An alternative to a scatterplot
126
r=1.00
x
x
x
x
x
x
x
127
r=0.00
x
x
x
128
From Regression to
Correlation
Where do we predict an
individuals score on y will be,
based on their score on x?
Depends on the correlation
r = 1.00 we know exactly where

they will be
r = 0.00 we have no idea
r = 0.50 we have some idea
129
r=1.00
Starts here
Will end
up here
y
130
r=0.00
Starts here
Could end
anywhere here
y
131
r=0.50
Probably
end
somewher
e here
Starts
here
132
Galton Squeeze Diagram

Dont show individuals
Show groups of individuals, from the
same (or similar) starting point
Shows regression to the mean
133
r=0.00
Ends here
Group starts
here
Group starts
here
y
134
r=0.50
y
135
r=1.00
y
136
r
units
1 unit
Correlation is amount of regression

that doesnt occur
137
No
regression
r=1.00
138
Some
regression
r=0.50
139
r=0.00
Lots
(maximum)
regression
r=0.00
y
140
Formula
z y rxy z x
141
Conclusion
Regression towards mean is statistical
necessity
regression = perfection correlation
Very non-intuitive
Interest in regression and correlation
From examining the extent of regression
towards mean
By Pearson worked with Galton
Stuck with curious name
See also Paper B3

142
143
144
Lesson 4: Samples to
Populations Standard
Errors and Statistical
Significance
145
The Problem
In Social Sciences
We investigate samples
Theoretically
Randomly taken from a specified
population
Every member has an equal chance
of being sampled
Sampling one member does not alter
the chances of sampling another
Not the case in (say) physics,

biology, etc.
146
Population
But its the population that we are
interested in
Not the sample
Population statistic represented with
Greek letter
Hat means estimate
x
x
147
Sample statistics (e.g. mean)

estimate population parameters
Want to know
Likely size of the parameter
If it is > 0
148
Sampling Distribution
We need to know the sampling
distribution of a parameter
estimate
How much does it vary from sample
to sample
If we make some assumptions

We can know the sampling
distribution of many statistics
Start with the mean
149
Sampling Distribution of
the Mean
Given
Normal distribution
Random sample
Continuous data
Mean has a known sampling

distribution
Repeatedly sampling will give a
known distribution of means
Centred around the true (population)
150
mean ()
Analysis Example: Memory

Difference in memory for different
words
10 participants given a list of 30
words to learn, and then tested
Two types of word
Abstract: e.g. love, justice
Concrete: e.g. carrot, table
151
Concrete Abstract
12
4
11
7
4
6
9
12
8
6
12
10
9
8
8
5
12
10
8
4
Diff (x)
8
4
-2
-3
2
2
1
3
2
4
x 2.1
x 3.11
N 10
152
Confidence Intervals
This means
If we know the mean in our sample
We can estimate where the mean in
the population () is likely to be
Using
The standard error (se) of the mean
Represents the standard deviation of
the sampling distribution of the mean
153
1 SD
contains
68%
Almost 2
SDs contain
95%
154
We know the sampling distribution

of the mean
t distributed
Normal with large N (>30)
Know the range within means from

other samples will fall
Therefore the likely range of
x
se( x )
n
155
Two implications of equation

Increasing N decreases SE
But only a bit
Decreasing SD decreases SE
Calculate Confidence Intervals

From standard errors
95% is a standard level of CI

95% of samples the true mean will lie
within the 95% CIs
In large samples: 95% CI = 1.96 SE
In smaller samples: depends on t
distribution (df=N-1=9)
156
x 2.1,
x 3.11,
N 10
x 3.11
se( x )
0.98
n
10
157
95% CI 2.26 0.98 2.22

x CI x CI
-0.12 4.32
158
What is a CI?
(For 95% CI):
95% chance that the true
(population) value lies within the
confidence interval?
95% of samples, true mean will
land within the confidence
interval?
159
Significance Test
Probability that is a certain value
Almost always 0
Doesnt have to be though
We want to test the hypothesis

that the difference is equal to 0
i.e. find the probability of this
difference occurring in our sample IF
=0
(Not the same as the probability that
160
=0)
Calculate SE, and then t

t has a known sampling distribution
Can test probability that a certain
value is included
x
t
se(x )
2.1
t
2.14
0.98
p 0.061
161
Other Parameter
Estimates
Same approach
Prediction, slope, intercept, predicted
values
At this point, prediction and slope are
the same
Wont be later on
We will look at one predictor only

More complicated with > 1
162
Testing the Degree of

Prediction
Prediction is correlation of Y with
The correlation when we have one
IV
Use F, rather than t

Started with SSE for the mean only
This is SStotal
Divide this into SSresidual
SSregression
SStot = SSreg + SSres
163
SSreg df1
SS res df 2
df1 k
df 2 N k 1,
164
Back to the house prices

Original SSE (SStotal) = 11806
SSresidual = 5921
What is left after our model
SSregression = 11806 5921 = 5885

What our model explains
Slope = 14.79
Intercept = 46.0
r = 0.706
165
SSreg df1
SS res df 2
5885 1
F
7.95
5921 (10 1 1)
df1 k 1
df 2 N k 1 8
166
F = 7.95, df = 1, 8, p = 0.02
Can reject H0
H0: Prediction is not better than chance
A significant effect
167
Statistical Significance:
What does a p-value (really)
mean?
168
A Quiz
Six questions, each true or false
Write down your answers (if you like)
An experiment has been done. Carried
out perfectly. All assumptions perfectly
satisfied. Absolutely no problems.
P = 0.01
Which of the following can we say?
169
1. You have absolutely disproved

the null hypothesis (that is, there
is no difference between the
population means).
170
2. You have found the probability of

the null hypothesis being true.
171
3. You have absolutely proved your

experimental hypothesis (that
there is a difference between the
population means).
172
4. You can deduce the probability of

the experimental hypothesis
being true.
173
5. You know, if you decide to reject

the null hypothesis, the
probability that you are making
the wrong decision.
174
6. You have a reliable experimental

finding in the sense that if,
hypothetically, the experiment
were repeated a great number of
times, you would obtain a
significant result on 99% of
occasions.
175
OK, What is a p-value

Cohen (1994)
[a p-value] does not tell us what we
want to know, and we so much
want to know what we want to
know that, out of desperation, we
nevertheless believe it does (p
997).
176
OK, What is a p-value

Sorry, didnt answer the question
Its The probability of obtaining a
result as or more extreme than the
result we have in the study, given
that the null hypothesis is true
Not probability the null hypothesis
is true
177
A Bit of Notation
Not because we like notation
But we have to say a lot less
Probability P
Null hypothesis is true H
Result (data) D
Given - |
178
Whats a P Value
P(D|H)
Probability of the data occurring if the
null hypothesis is true
Not
P(H|D)
Probability that the null hypothesis is
true, given that we have the data =
p(H)
P(H|D) P(D|H)
179
What is probability you are prime

minister
Given that you are british
P(M|B)
Very low
What is probability you are British

Given you are prime minister
P(B|M)
Very high
P(M|B) P(B|M)
180
Theres been a murder

Someone bumped off a statto for talking
too much
The police have DNA

The police have your DNA
They match(!)
DNA matches 1 in 1,000,000 people

Whats the probability you didnt do
the murder, given the DNA match (H|
D)
181
Police say:
P(D|H) = 1/1,000,000
Luckily, you have Jeremy on your

defence team
We say:
P(D|H) P(H|D)
Probability that someone matches

the DNA, who didnt do the murder
Incredibly high
182
Back to the Questions

Haller and Kraus (2002)
Asked those questions of groups in
Germany
Psychology Students
Psychology lecturers and professors
(who didnt teach stats)
Psychology lecturers and professors
(who did teach stats)
183
1. You have absolutely disproved the

null hypothesis (that is, there is no
difference between the population
means).
True
34% of students
15% of professors/lecturers,
10% of professors/lecturers teaching
statistics
. False
. We have found evidence against
the null hypothesis
184
2. You have found the probability of

the null hypothesis being true.
32% of students
26% of professors/lecturers
statistics
. False
. We dont know
185
3. You have absolutely proved your

experimental hypothesis (that there is
a difference between the population
means).
20% of students
statistics
False
186
4. You can deduce the probability of

the experimental hypothesis being
true.
59% of students
statistics
. False
187
5. You know, if you decide to reject the

null hypothesis, the probability that
you are making the wrong decision.
68% of students
73% of professors professors/lecturers
teaching statistics
. False
. Can be worked out
P(replication)
188
6. You have a reliable experimental

finding in the sense that if,
hypothetically, the experiment were
repeated a great number of times,
you would obtain a significant result
on 99% of occasions.
41% of students
37% of professors professors/lecturers
teaching statistics
. False
. Another tricky one
It can be worked out
189
One Last Quiz

I carry out a study
All assumptions perfectly satisfied
Random sample from population
I find p = 0.05
You replicate the study exactly

What is probability you find p < 0.05?
190
I carry out a study

All assumptions perfectly satisfied
Random sample from population
I find p = 0.01
You replicate the study exactly

What is probability you find p < 0.05?
191
Significance testing creates

boundaries and gaps where none
exist.
Significance testing means that we
find it hard to build upon
knowledge
we dont get an accumulation of
knowledge
192
Yates (1951)
"the emphasis given to formal tests of
significance ... has resulted in ... an undue
concentration of effort by mathematical
statisticians on investigations of tests of
significance applicable to problems which
are of little or no practical importance ...
and ... it has caused scientific research
workers to pay undue attention to the
results of the tests of significance ... and
too little to the estimates of the magnitude
of the effects they are investigating
193
Testing the Slope

Same idea as with the mean
Estimate 95% CI of slope
Estimate significance of difference
from a value (usually 0)
Need to know the sd of the slope

Similar to SD of the mean
194
s y. x
(Y Y )
N k 1
s y.x
SSres
N k 1
s y.x
5921
27.2
8
195
Similar to equation for SD of mean

Then we need standard error
- Similar (ish)
When we have standard error
Can go on to 95% CI
Significance of difference
196
se(by .x )
s y.x
( x x )
27.2
se(by.x )
5.24
26.9
197
Confidence Limits
95% CI
t dist with N - k - 1 df is 2.31
CI = 5.24 2.31 = 12.06
95% confidence limits
14.8 12.1 14.8 12.1

2.7 26.9
198
Significance of difference from zero

i.e. probability of getting result if =0
Not probability that = 0
b
14.7
t
2.81
se(b)
5.2
df N k 1 8
p 0.02
This probability is (of course) the
same as the value for the

prediction
199
Testing the Standardised

Slope (Correlation)
Correlation is bounded between 1
and +1
Does not have symmetrical distribution,
except around 0
Need to transform it
Fisher z transformation approximately
normal
z 0.5[ln( 1 r ) ln( 1 r )]
1
SE z
n3
200
z 0.5[ln( 1 0.706) ln( 1 0.706)]

z 0.879
1
1
SE z
0.38
n3
10 3
95% CIs
0.879 1.96 * 0.38 = 0.13
0.879 + 1.96 * 0.38 = 1.62
201
Transform back to correlation
e 1
r 2y
e 1
2y
95% CIs = 0.13 to 0.92

Very wide
Small sample size
Maybe thats why CIs are not
reported?
202
Using Excel
Functions in excel
Fisher() to carry out Fisher
transformation
Fisherinv() to transform back to
correlation
203
The Others
Same ideas for calculation of CIs
and SEs for
Predicted score
Gives expected range of values given
X
Same for intercept

But we have probably had enough
204
Lesson 5: Introducing
Multiple Regression
205
Residuals
We said
Y = b0 + b1x1
We could have said

Yi = b0 + b1xi1 + ei
We ignored the i on the Y

And we ignored the ei
Its called error, after all
But it isnt just error

Trying to tell us something
206
What Error Tells Us

Error tells us that a case has a
different score for Y than we
predict
There is something about that case
Called the residual

What is left over, after the model
Contains information
Something is making the residual 0
But what?
207
160
140
swimming
pool
120
Price
100
80
Unpleasant
neighbours
60
Actual Price
Predicted Price
40
20
0
0.5
1.5
2.5
3.5
4.5
Number of Bedrooms
208
5.5
The residual (+ the mean) is the

value of Y
If all cases were equal on X
It is the value of Y, controlling for
X
Other words:
Holding constant
Partialling
Residualising
Conditioned on
209
Beds
(0 0 0 s )
77
74
88
62
90
136
35
134
138
55
210
Sometimes adjustment is enough on its

own
Measure performance against criteria
Teenage pregnancy rate

Measure pregnancy and abortion rate in
areas
Control for socio-economic deprivation, and
anything else important
See which areas have lower teenage
pregnancy and abortion rate, given same
level of deprivation
Value added education tables

Measure school performance
Control for initial intake
211
Control?
In experimental research
Use experimental control
e.g. same conditions, materials, time
of day, accurate measures, random
assignment to conditions
In non-experimental research
Cant use experimental control
Use statistical control instead
212
Analysis of Residuals
What predicts differences in crime
rate
After controlling for socio-economic
deprivation
Number of police?
Crime prevention schemes?
Rural/Urban proportions?
Something else
This is what regression is about

213
Exam performance
Consider number of books a student
read (books)
Number of lectures (max 20) a
student attended (attend)
Books and attend as IV, grade as

DV
214
Book s
Attend
0
1
0
2
4
4
1
4
3
0
9
15
10
16
10
20
11
20
15
15
Grade
45
57
45
51
65
88
44
87
89
59
First 10 cases
215
Use books as IV
R=0.492, F=12.1, df=1, 28, p=0.001
b0=52.1, b1=5.7
(Intercept makes sense)
Use attend as IV
R=0.482, F=11.5, df=1, 38, p=0.002
b0=37.0, b1=1.9
(Intercept makes less sense)
216
100
90
80
70
Grade (100)
60
50
40
30
-1
Books
217
100
90
80
70
60
Grade
50
40
30
5
11
13
15
17
19
Attend
218
21
Problem
Use R2 to give proportion of shared
variance
Books = 24%
Attend = 23%
So we have explained 24% + 23%

= 47% of the variance
NO!!!!!
219
Look at the correlation matrix

BOOKS
ATTEN
D
GRADE
0.44
0.49
0.48
BOOKS
ATTEN
D
GRADE
Correlation of books and attend is

(unsurprisingly) not zero
Some of the variance that books
shares with grade, is also shared by
attend
220
I have access to 2 cars

My wife has access to 2 cars
We have access to four cars?
No. We need to know how many of
my 2 cars are the same cars as her 2
cars
Similarly with regression

But we can do this with the residuals
Residuals are what is left after (say)
books
See of residual variance is explained
by attend
Can use this new residual variance to
221
calculate SSres, SStotal and SSreg
Well. Almost.
This would give us correct values for
SS
Would not be correct for slopes, etc
Assumes that the variables have a

causal priority
Why should attend have to take what
is left from books?
Why should books have to take what
is left by attend?
Use OLS again

222
Simultaneously estimate 2
parameters
b1 and b2
Y = b0 + b1x1 + b2x2
x1 and x2 are IVs
Not trying to fit a line any more

Trying to fit a plane
Can solve iteratively

Closed form equations better
But they are unwieldy
223
3D scatterplot
(2points only)
y
x2
x1
224
b2
b1
b0
x2
x1
225
(Really) Ridiculous
Equations
2
y y x1 x1 x2 x2 y y x2 x2 x1 x1 x2 x2
b1
2
2
2
x1 x1 x2 x2 x1 x1 x2 x2
2
y y x2 x2 x1 x1 y y x1 x1 x2 x2 x1 x1
b2
2
2
2
x2 x2 x1 x1 x2 x2 x1 x1
b0 y b1x1 b2x2
226
The good news

There is an easier way
The bad news

It involves matrix algebra
The good news

We dont really need to know how to
do it
The bad news

We need to know it exists
227
A Quick Guide to Matrix

Algebra
(I will never make you do it
again)
228
Very Quick Guide to Matrix

Algebra
Why?
Matrices make life much easier in
multivariate statistics
Some things simply cannot be done
without them
Some things are much easier with
them
If you can manipulate matrices

you can specify calculations v. easily
e.g. AA = sum of squares of a
column
229
Doesnt matter how long the column
A scalar is a number
A scalar: 4
A vector is a row or column of
numbers
A row vector:
4 8 7
A column vector: 11
230
A vector is described as rows x

columns
4 8 7
Is a 1 4 vector
11
Is a 2 1 vector
A number (scalar) is a 1 1 vector
231
A matrix is a rectangle, described

as rows x columns
2 6 5 7 8
4 5 7 5 3
1 5 2 7 8
Is a 3 x 5 matrix
Matrices are referred to with bold
capitals
232
Correlation matrices and

covariance matrices are special
They are square and symmetrical
Correlation matrix of books, attend
and grade
1.00 0.44 0.49
0.44 1.00 0.48

0.49 0.48 1.00
233
Another special matrix is the

identity matrix I
A square matrix, with 1 in the
diagonal and 0 in the off-diagonal
0
I
0
0
1
0
0
0
0
1
0
0
0
Note that this is a correlation matrix, with

correlations all = 0
234
Matrix Operations
Transposition
A matrix is transposed by putting it
on its side
Transpose of A is A A 7 5 6
7

A ' 5
6

235
Matrix multiplication
A matrix can be multiplied by a scalar,
a vector or a matrix
Not commutative
AB BA
To multiply AB
Number of rows in A must equal number
of columns in B
236
Matrix by vector
a
b
c
e
f
2
7
17
3
11
19
g
h
i
13
23
k
l
3 5
11 13
17 19 23
2
3
4
aj dk gl
bj ek hl
cj fk il
4 9 20
14 33 52
34 57 92
2 33

3 99
4 141

237
43
90
183
Matrix by matrix
a b e

c d g
f
h
ae cf

ce dg
af bh
cf dh
2 3 2 3 4 12 6 15
5 7 4 5 10 28 15 35
16 21

38 50
238
Multiplying by the identity matrix

Has no effect
Like multiplying by 1
AI A
2 3 1 0
2 3
5 7 0 1
5 7
239
The inverse of J is: 1/J

J x 1/J = 1
Same with matrices
Matrices have an inverse
Inverse of A is A-1
AA-1=I
Inverting matrices is dull

We will do it once
But first, we must calculate the
determinant
240
The determinant of A is |A|

Determinants are important in
statistics
(more so than the other matrix
algebra)
We will do a 2x2
Much more difficult for larger matrices
241
a b
A
c d
A ad cb
1.0 0.3
A
0.3 1.0
A 1 1 0.3 0.3
A 0.91
242
Determinants are important

because
Needs to be above zero for regression
to work
Zero or negative determinant of a
correlation/covariance matrix means
something wrong with the data
Linear redundancy
Described as:
Not positive definite
Singular (if determinant is zero)
In different error messages
243
Next, the adjoint

a b
A
c d
d b
adj A
c a
Now
1
A
adj A
A
1
244
Find A-1
1.0 0.3
A
0.3 1.0
A 0.91
1.0 0.3
1
0.91 0.3 1.0

1.10 0.33

0.33 1.10
245
Matrix Algebra with

Correlation Matrices
246
Determinants
Determinant of a correlation matrix
The volume of space taken up by the
(hyper) sphere that contains all of the
points
1.0 0.0
A
0.0 1.0
A 1.0
247
X
X
1.0 0.0
A
0.0 1.0
A 1.0
248
X
X
X
1.0 1.0
A
1.0 1.0
A 0.0
249
Negative Determinant
Points take up less than no
space
Correlation matrix cannot exist
Non-positive definite matrix
250
Sometimes Obvious
1.0 1.2
A
1
.
2
1
.
0
A 0.44
251
Sometimes Obvious (If You

Think)
1
0.9 0.9
0.9
1
0.9
0.9 0.9
1
A 2.88
252
Sometimes No Idea
1.00 0.76 0.40
A 0.76
1
0.30
0.40 0.30
A 0.01
1.00 0.75 0.40
A 0.75
1
0.30
0.40 0.30
A 0.0075
253
Multiple R for Each

Variable
Diagonal of inverse of correlation
matrix
Used to calculate multiple R
Call elements aij
Ri .123...k
1
1
aii
254
Regression Weights
Where i is DV
j is IV
bi . j
aij
aij
255
Back to the Good News

We can calculate the standardised
parameters as
B=Rxx-1 x Rxy
Where
B is the vector of regression weights
Rxx-1 is the inverse of the correlation
matrix of the independent (x)
variables
Rxy is the vector of correlations of the
correlations of the x and y variables
Now do exercise 3.2
256
One More Thing

The whole regression equation can
be described with matrices
very simply
Y XB E
257
Where
Y = vector of DV
X = matrix of IVs
B = vector of coefficients
Go all the way back to our example
258
1 0 9
1 1 5
1 0 10
1 2 16
1 4 10
1 4 20
1 1 11
1 4 20
1 3 15
1 0 15
b0
b
1
b
2
e1
e
2
e3

e4
e5

e6
e7

e8
e
9
e10
45
57
45
51
65

88
44

87
89

59
259
1
1
1
1
1
1
1
1
0
1
0
2
4
4
1
4
3
0
5
10
16
10
20
11
20
15
15
The constant literally a

constant. Could be any
e1 45
but
number,
it is most
e2 57
convenient to make it 1.
e 45
Used
the
3 to capture
e4 intercept.
51
b0

e5

b1
b e6
2 e
7
e8

e9
e
10
65
88
44
87
89
59
260
1
1
1
1
1
1
1
1
0
1
0
2
4
4
1
4
3
0
5
10
16
10
20
11
20
15
15
e1 45
e2 57
e 45
3
ematrix
51of values
The
4
b0
(books
65
fore5IVs
and

b1
attend)
e
88
6
b2

e7 44
e8 87
e9 89
e 59
10
261
1 0 9
1 1 5
1 0 10
1 2 16
1 4 10
The parameter
1 4are
20
estimates. We
trying to find
1 11
1 the
best values
1 of
4 20
these. 1 3 15
1 0 15
b0

b1
b
2
e1 45
e2 57
e 45
3
e4 51
e 65
5
e6 88
e7 44
e8 87
e9 89
e 59
10
262
Error. We are trying

9
1 0 this
to minimise
1
1
1
1
1
1
1
1
1
0
2
4
4
1
4
3
0
5
10
16
10
20
11
20
15
15
b0

b1
b
2
e1 45
e2 57
e 45
3
e4 51
e 65
5
e6 88
e7 44
e8 87
e9 89
e 59
10
263
e1 45
1 0 9
e2 57
1 1 5
e 45
1 0 10
3
e4 51
1 2 16
1 4 10 b0 e 65
b1 5
1 4 20 e6 88
b2 e
7 44
1 1 11
e8 87
1 4 20
e9 89
1 3 15
e 59
The 1DV0 - 15
grade
10
264
Y=BX+E
Simple way of representing as many IVs
as you like
Y = b0x0 + b1x1 + b2x2 + b3x3 + b4x4 + b5x5 + e
x01
x02
x11
x12
x21
x22
x31
x32
x41
x42
x51
x52
b0

b1
b e
2 1
b3 e2
b
4
b
5
265
b0

b1
x01 x11 x21 x31 x41 x51 b2 e1

x02 x12 x22 x32 x42 x52 b3 e2
b
4
b
5
b0x0 b1x1 ...bk xk e
266
Generalises to Multivariate
Case
Y=BX+E
Y, B and E
Matrices, not vectors
Goes beyond this course

(Do Jacques Tacqs course for more)
(Or read his book)
267
268
269
270
Lesson 6: More on
Multiple Regression
271
Parameter Estimates
Parameter estimates (b1, b2 bk)
were standardised
Because we analysed a correlation
matrix
Represent the correlation of each

IV with the DV
When all other IVs are held constant
272
Can also be unstandardised

Unstandardised represent the unit
change in the DV associated with a
1 unit change in the IV
When all the other variables are held
constant
Parameters have standard errors

associated with them
As with one IV
Hence t-test, and associated
probability can be calculated
Trickier than with one IV
273
Standard Error of
Regression Coefficient
Standardised is easier
1 R
1
SEi
2
n k 1 1 R i
2
Y
R2i is the value of R2 when all other

predictors are used as predictors of that
variable
Note that if R2i = 0, the equation is the same as
for previous
274
Multiple R
The degree of prediction
R (or Multiple R)
No longer equal to b
R2 Might be equal to the sum of

squares of B
Only if all xs are uncorrelated
275
In Terms of Variance
Can also think of this in terms of
variance explained.
Each IV explains some variance in the
DV
The IVs share some of their variance
Cant share the same variance

twice
276
Variance in Y
accounted for by
x1
rx1y2 = 0.36
The total
variance of Y
=1
Variance in Y
accounted for by
x2
rx2y2 =
277 0.36
In this model
R2 = ryx12 + ryx22
R2 = 0.36 + 0.36 = 0.72
R = 0.72 = 0.85
But
If x1 and x2 are correlated
No longer the case
278
Variance in Y
accounted for by
x1
rx1y2 = 0.36
The total
variance of Y
=1
Variance shared
between x1 and x2
(not equal to
rx1x2)
Variance in Y
accounted for by
x2
279
rx2y2 = 0.36
So
We can no longer sum the r2
Need to sum them, and subtract the
shared variance i.e. the correlation
But
Its not the correlation between them
Its the correlation between them as a
proportion of the variance of Y
Two different ways

280
Based on estimates
R b1ryx1 b2ryx2
2
If rx1x2 = 0
rxy = bx1
Equivalent to ryx12 + ryx22
281
Based on correlations
R
2
2
yx1
2
yx2
2ryx1 ryx2 rx1x2
1 r
2
x1x2
rx1x2 = 0
Equivalent to ryx12 + ryx22
282
Can also be calculated using

methods we have seen
Based on PRE
Based on correlation with prediction
Same procedure with >2 IVs
283
Adjusted R2
R2 is an overestimate of population
value of R2
Any x will not correlate 0 with Y
Any variation away from 0 increases R
Variation from 0 more pronounced
with lower N
Need to correct R2
Adjusted R2
284
Calculation of Adj. R2
N 1
Adj. R 1 (1 R )
N k 1
2
1 R2
Proportion of unexplained variance
We multiple this by an adjustment
More variables greater adjustment
More people less adjustment
285
Shrunken R2
Some authors treat shrunken and
adjusted R2 as the same thing
Others dont
286
N 1
N k 1
N 20, k 3
20 1
19
1.1875
20 3 1 16
N 10, k 8
N 10, k 3
10 1
9
9
10 8 1 1
10 1
9
1.5
10 3 1 6
287
Extra Bits
Some stranger things that
can
happen
Counter-intuitive
288
Suppressor variables
Can be hard to understand
Very counter-intuitive
Definition
An independent variable which
increases the size of the parameters
associated with other independent
variables above the size of their
correlations
289
An example (based on Horst, 1941)

Success of trainee pilots
Mechanical ability (x1), verbal ability
(x2), success (y)
Correlation matrix
Mech
Mech
Verb
Success
1
0.5
0.3
Verb
0.5
1
0
Success
0.3
0
1
290
Mechanical ability correlates 0.3 with

success
Verbal ability correlates 0.0 with
success
What will the parameter estimates
be?
(Dont look ahead until you have had
a guess)
291
Mechanical ability
b = 0.4
Larger than r!
Verbal ability
b = -0.2
Smaller than r!!
So what is happening?
You need verbal ability to do the test
Not related to mechanical ability
Measure of mechanical ability is
contaminated by verbal ability
292
High mech, low verbal

High mech
This is positive
Low verbal
Negative, because we are talking about
standardised scores
Your mech is really high you did well on
the mechanical test, without being good
at the words
High mech, high verbal

Well, you had a head start on mech,
because of verbal, and need to be
brought down a bit
293
Another suppressor?
x1
x2
y
x1
1
0.5
0.3
x2
0.5
1
0.2
y
0.3
0.2
1
b1 =
b2 =
294
Another suppressor?
x1
x2
y
x1
1
0.5
0.3
x2
0.5
1
0.2
y
0.3
0.2
1
b1 =0.26
b2 = -0.06
295
And another?
x1
x2
y
x1
1
0.5
0.3
x2
0.5
1
-0.2
y
0.3
-0.2
1
b1 =
b2 =
296
And another?
x1
x2
y
x1
1
0.5
0.3
x2
0.5
1
-0.2
y
0.3
-0.2
1
b1 = 0.53
b2 = -0.47
297
One more?
x1
x2
y
x1
1
-0.5
0.3
x2
-0.5
1
0.2
y
0.3
0.2
1
b1 =
b2 =
298
One more?
x1
x2
y
x1
1
-0.5
0.3
x2
-0.5
1
0.2
y
0.3
0.2
1
b1 = 0.53
b2 = 0.47
299
Suppression happens when two

opposing forces are happening together
And have opposite effects
Dont throw away your IVs,

Just because they are uncorrelated with the
DV
Be careful in interpretation of regression

estimates
Really need the correlations too, to interpret
what is going on
Cannot compare between studies with
different IVs
300
Standardised Estimates >

1
Correlations are bounded
-1.00 r +1.00
We think of standardised regression
estimates as being similarly bounded
But they are not
Can go >1.00, <-1.00

R cannot, because that is a proportion
of variance
301
Three measures of ability

Mechanical ability, verbal ability 1,
verbal ability 2
Score on science exam
Mech
Mech
Verbal1
Verbal2
Scores
1
0.1
0.1
0.6
Verbal1
0.1
1
0.9
0.6
Verbal2
0.1
0.9
1
0.3
Scores
0.6
0.6
0.3
1
Before reading on, what are the parameter

estimates?
302
Mech
Verbal1
Verbal2
0.56
1.71
-1.29
Mechanical
About where we expect
Verbal 1
Very high
Verbal 2
Very low
303
What is going on
Its a suppressor again
An independent variable which
increases the size of the parameters
associated with other independent
variables above the size of their
correlations
Verbal 1 and verbal 2 are

correlated so highly
They need to cancel each other out
304
Variable Selection
What are the appropriate
independent variables to use in a
model?
Depends what you are trying to do
Multiple regression has two

separate uses
Prediction
Explanation
305
Prediction
What will happen
in the future?
Emphasis on
practical
application
Variables selected
(more) empirically
Value free
Explanation
Why did
something
happen?
Emphasis on
understanding
phenomena
Variables selected
theoretically
Not value free
306
Visiting the doctor

Precedes suicide attempts
Predicts suicide
Does not explain suicide
More on causality later on

Which are appropriate variables
To collect data on?
To include in analysis?
Decision needs to be based on theoretical
knowledge of the behaviour of those variables
Statistical analysis of those variables (later)
Unless you didnt collect the data
Common sense (not a useful thing to say)

307
Variable Entry Techniques

Entry-wise
All variables entered simultaneously
Hierarchical
Variables entered in a predetermined
order
Stepwise
Variables entered according to
change in R2
Actually a family of techniques
308
Entrywise
All variables entered simultaneously
All treated equally
Hierarchical
Entered in a theoretically determined
order
Change in R2 is assessed, and tested
for significance
e.g. sex and age
Should not be treated equally with other
variables
Sex and age MUST be first
Confused with hierarchical linear

309
modelling
Stepwise
Variables entered empirically
Variable which increases R2 the most
goes first
Then the next
Variables which have no effect can be

removed from the equation
Example
IVs: Sex, age, extroversion,
DV: Car how long someone spends
looking after their car
310
Correlation Matrix
SEX
SEX
AGE
EXTRO
CAR
AGE
1.00
-0.05
0.40
0.66
-0.05
1.00
0.40
0.23
EXTRO CAR
0.40
0.66
0.40
0.23
1.00
0.67
0.67
1.00
311
Entrywise analysis
r2 = 0.64
SEX
AGE
EXTRO
b
0.49
0.08
0.44
p
<0.01
0.46
<0.01
312
Stepwise Analysis
Data determines the order
Model 1: Extroversion, R2 = 0.450
Model 2: Extroversion + Sex, R2 =
0.633
EXTRO
SEX
b
0.48
0.47
p
<0.01
<0.01
313
Hierarchical analysis
Theory determines the order
Model 1: Sex + Age, R2 = 0.510
Model 2: S, A + E, R2 = 0.638
Change in R2 = 0.128, p = 0.001
SEX
0.49
< 0.01
A GE
0.08
0.46
E X TRO
0.44
< 0.01
314
Which is the best model?

Entrywise OK
Stepwise excluded age
Did have a (small) effect
Hierarchical
The change in R2 gives the best estimate
of the importance of extroversion
Other problems with stepwise

F and df are wrong (cheats with df)
Unstable results
Small changes (sampling variance)
large differences in models
315
Uses a lot of paper

Dont use a stepwise procedure to
pack your suitcase
316
Is Stepwise Always Evil?

Yes
All right, no
Research goal is predictive
(technological)
Not explanatory (scientific)
What happens, not why
N is large
40 people per predictor, Cohen, Cohen,
Aiken, West (2003)
Cross validation takes place

317
A quick note on R2
R2 is sometimes regarded as the fit
of a regression model
Bad idea
If good fit is required maximise R2

Leads to entering variables which do
not make theoretical sense
318
Critique of Multiple
Regression
Goertzel (2002)
Myths of murder and multiple
regression
Skeptical Inquirer (Paper B1)
Econometrics and regression are

junk science
Multiple regression models (in US)
Used to guide social policy
319
More Guns, Less Crime

(controlling for other factors)
Lott and Mustard: A 1% increase in

gun ownership
3.3% decrease in murder rates
But:
More guns in rural Southern US
More crime in urban North (crack
cocaine epidemic at time of data)
320
Executions Cut Crime

No difference between crimes in
states in US with or without death
penalty
Ehrlich (1975) controlled all
variables that effect crime rates
Death penalty had effect in reducing
crime rate
No statistical way to decide whos

right
321
Legalised Abortion
Donohue and Levitt (1999)
Legalised abortion in 1970s cut crime in
1990s
Lott and Whitley (2001)

Legalising abortion decreased murder rates
by 0.5 to 7 per cent.
Its impossible to model these data

Controlling for other historical events
Crack cocaine (again)
322
Another Critique
Berk (2003)
Regression analysis: a constructive critique
(Sage)
Three cheers for regression

As a descriptive technique
Two cheers for regression

As an inferential technique
One cheer for regression

As a causal analysis
323
Is Regression Useless?
Do regression carefully
Dont go beyond data which you have
a strong theoretical understanding of
Validate models
Where possible, validate predictive
power of models in other areas,
times, groups
Particularly important with stepwise
324
Lesson 7: Categorical
Independent Variables
325
Introduction
326
Introduction
So far, just looked at continuous
independent variables
Also possible to use categorical
(nominal, qualitative) independent
variables
e.g. Sex; Job; Religion; Region; Type
(of anything)
Usually analysed with t-test/ANOVA

327
Historical Note
But these (t-test/ANOVA) are
special cases of regression analysis
Aspects of General Linear Models
(GLMs)
So why treat them differently?

Fishers fault
Computers fault
Regression, as we have seen, is

computationally difficult
Matrix inversion and multiplication
Unfeasible, without a computer
328
In the special cases where:

You have one categorical IV
Your IVs are uncorrelated
It is much easier to do it by
partitioning of sums of squares
These cases
Very rare in applied research
Very common in experimental
research
Fisher worked at Rothamsted agricultural
research station
Never have problems manipulating
wheat, pigs, cabbages, etc
329
In psychology
Led to a split between experimental
psychologists and correlational
psychologists
Experimental psychologists (until
recently) would not think in terms of
continuous variables
Still (too) common to dichotomise

a variable
Too difficult to analyse it properly
Equivalent to discarding 1/3 of your
data
330
The Approach
331
The Approach
Recode the nominal variable
Into one, or more, variables to represent
that variable
Names are slightly confusing

Some texts talk of dummy coding to refer
to all of these techniques
Some (most) refer to dummy coding to
refer to one of them
Most have more than one name
332
If a variable has g possible

categories it is represented by g-1
variables
Simplest case:
Smokes: Yes or No
Variable 1 represents Yes
Variable 2 is redundant
If it isnt yes, its no
333
The Techniques
334
We will examine two coding

schemes
Dummy coding
For two groups
For >2 groups
Effect coding
For >2 groups
Look at analysis of change

Equivalent to ANCOVA
Pretest-posttest designs
335
Dummy Coding 2 Groups

Also called simple coding by SPSS
A categorical variable with two groups
One group chosen as a reference group
The other group is represented in a variable
e.g. 2 groups: Experimental (Group 1)

and Control (Group 0)
Control is the reference group
Dummy variable represents experimental
group
Call this variable group1
336
For variable group1

1 = Yes, 2=No
O rig in a l
New
C a t e g o ry
V a ria b le
E xp
Con
337
Some data
Group is x, score is y
Control
Group
Experiment 1
Experiment 2
Experiment 3
Experimental
Group
10
10
10
20
10
30
338
Control Group = 0
Intercept = Score on Y when x = 0
Intercept = mean of control group
Experimental Group = 1
b = change in Y when x increases 1
unit
b = difference between experimental
group and control group
339
35
30
Gradient of slope
25 represents
difference
between
20
means
15
10
5
0
Control Group
Experiment 1
Experimental Group
Experiment 2
Experiment 3
340
Dummy Coding 3+
Groups
With three groups the approach is
the similar
g = 3, therefore g-1 = 2 variables
needed
3 Groups
Control
Experimental Group 1
Experimental Group 2
341
Original
Category
Con
Gp1
Gp2
Gp1
Gp2
0
1
0
0
0
1
Recoded into two variables

Note do not need a 3rd variable
If we are not in group 1 or group 2 MUST
be in control group
3rd variable would add no information
(What would happen to determinant?)
342
F and associated p
Tests H0 that
g1 g 2 g 3
b1 and b2 and associated p-values
Test difference between each
experimental group and the control
group
To test difference between

experimental groups
Need to rerun analysis
343
One more complication

Have now run multiple comparisons
Increases i.e. probability of type I
error
Need to correct for this

Bonferroni correction
Multiply given p-values by two/three
(depending how many comparisons
were made)
344
Effect Coding
Usually used for 3+ groups
Compares each group (except the
reference group) to the mean of all
groups
Dummy coding compares each group to the
reference group.
Example with 5 groups

1 group selected as reference group
Group 5
345
Each group (except reference) has

a variable
1 if the individual is in that group
0 if not
-1 if in reference group
group
1
2
3
4
5
group_1 group_2 group_3 group_4

1
0
0
0
0
1
0
0
0
0
1
0
0
0
0
1
-1
-1
-1
-1
346
Examples
Dummy coding and Effect Coding
Group 1 chosen as reference group
each time
Data
Group
Mean
SD
1
52.40
4.60
2
56.30
5.70
3
60.10
5.00
Total
56.27
5.88
347
Dummy
Group
dummy
2
dummy
3
1
2
3
0
1
0
0
0
1
Group
Effect2
effect3
1
2
3
-1
1
0
-1
0
1
Effect
348
Dummy
Effect
R=0.543, F=5.7,
R=0.543, F=5.7,
df=2, 27, p=0.009
df=2, 27, p=0.009
b0 = 52.4,
b0 = 56.27,
b1 = 3.9, p=0.100
b1 = 0.03, p=0.980
b2 = 7.7, p=0.002
b2 = 3.8, p=0.007
b0 g1
b0 G
b1 g2 g1
b1 g2 G
b2 g3 g1
b2 g3 G
349
In SPSS
SPSS provides two equivalent
procedures for regression
Regression (which we have been using)
GLM (which we havent)
GLM will:
Automatically code categorical variables
Automatically calculate interaction terms
GLM wont:
Give standardised effects
Give hierarchical R2 p-values
Allow you to not understand
350
ANCOVA and Regression
351
Test
(Which is a trick; but its designed to
make you think about it)
Use employee data.sav

Compare the pay rise (difference
between salbegin and salary)
For ethnic minority and non-minority
staff
What do you find?
352
ANCOVA and Regression

Dummy coding approach has one
special use
In ANCOVA, for the analysis of change
Pre-test post-test experimental design

Control group and (one or more)
experimental groups
Tempting to use difference score + t-test /
mixed design ANOVA
Inappropriate
353
Salivary cortisol levels

Used as a measure of stress
Not absolute level, but change in
level over day may be interesting
Test at: 9.00am, 9.00pm

Two groups
High stress group (cancer biopsy)
Group 1
Low stress group (no biopsy)

Group 0
354
High Stress
Low Stress
AM
20.1
22.3
PM
6.8
11.8
Diff
13.3
10.5
Correlation of AM and PM = 0.493

(p=0.008)
Has there been a significant
difference in the rate of change of
salivary cortisol?
3 different approaches
355
Approach 1 find the differences,

do a t-test
t = 1.31, df=26, p=0.203
Approach 2 mixed ANOVA, look

for interaction effect
F = 1.71, df = 1, 26, p = 0.203
F = t2
Approach 3 regression (ANCOVA)

based approach
356
IVs: AM and group

DV: PM
b1 (group) = 3.59, standardised
b1=0.432, p = 0.01
Why is the regression approach

better?
The other two approaches took the
difference
Assumes that r = 1.00
Any difference from r = 1.00 and you
add error variance
Subtracting error is the same as adding
357
error
Using regression
Ensures that all the variance that is
subtracted is true
Reduces the error variance
Two effects
Adjusts the means
Compensates for differences between
groups
Removes error variance
358
In SPSS
SPSS automates all of this
But you have to understand it, to
know what it is doing
Use Analyse, GLM, Univariate

ANOVA
359
Outcome
here
Categorical
predictors here
Continuous
predictors here
Click options
360
Select
parameter
estimaters
361
More on Change
If difference score is correlated
with either pre-test or post-test
Subtraction fails to remove the
difference between the scores
If two scores are uncorrelated
Difference will be correlated with both
Failure to control
Equal SDs, r = 0
Correlation of change and pre-score
=0.707
362
Even More on Change

A topic of surprising complexity
What I said about difference scores
isnt always true
Lords paradox it depends on the
precise question you want to answer
Collins and Horn (1993). Best

methods for the analysis of change
Collins and Sayer (2001). New
methods for the analysis of change.
363
Lesson 8: Assumptions in
Regression Analysis
364
The Assumptions
1. The distribution of residuals is normal
(at each value of the dependent
variable).
2. The variance of the residuals for every
set of values for the independent
variable is equal.
violation is called
heteroscedasticity.
3. The error term is additive
no interactions.
4. At every value of the dependent

variable the expected (mean) value of
365
the residuals is zero
5. The expected correlation between

residuals, for any two cases, is 0.
The independence assumption (lack of

autocorrelation)
6. All independent variables are

uncorrelated with the error term.
7. No independent variables are a perfect
linear function of other independent
variables (no perfect multicollinearity)
8. The mean of the error term is zero.
366
What are we going to do
Deal with some of these

assumptions in some detail
Deal with others in passing only
look at them again later on
367
Assumption 1: The
Distribution of Residuals is
Normal at Every Value of
the Dependent Variable
368
Look at Normal
Distributions
A normal distribution
symmetrical, bell-shaped (so they
say)
369
What can go wrong?

Skew
non-symmetricality
one tail longer than the other
Kurtosis
too flat or too peaked
kurtosed
Outliers
Individual cases which are far from the
distribution
370
Effects on the Mean

Skew
biases the mean, in direction of skew
Kurtosis
mean not biased
standard deviation is
and hence standard errors, and
significance tests
371
Examining Univariate
Distributions
Histograms
Boxplots
P-P plots
Calculation based methods
372
Histograms
30
A and B
30
20
20
10
10
373
C and D
40
14
12
30
10
8
20
6
10
374
E&F
20
10
375
Histograms can be tricky .

7
4
4
376
Boxplots
377
P-P Plots
A&B
1.00
1.00
.75
.75
.50
.50
.25
.25
0.00
0.00
.25
.50
.75
1.00
0.00
0.00
.25
.50
378
.75
1.00
C&D
1.00
1.00
.75
.75
.50
.50
.25
.25
0.00
0.00
.25
.50
.75
1.00
0.00
0.00
.25
.50
379
.75
1.00
E&F
1.00
1.00
.75
.75
.50
.50
.25
.25
0.00
0.00
.25
.50
.75
1.00
0.00
0.00
.25
.50
.75
380
1.00
Calculation Based
Skew and Kurtosis statistics
Outlier detection statistics
381
Skew and Kurtosis

Statistics
Normal distribution
skew = 0
kurtosis = 0
Two methods for calculation

Fishers and Pearsons
Very similar answers
Associated standard error

can be used for significance of departure
from normality
not actually very useful
Never normal above N = 400
382
Skewness SE Skew Kurtosis SE Kurt

A
B
C
D
E
F
-0.12
0.271
0.454
0.117
2.106
0.171
0.172
0.172
0.172
0.172
0.172
0.172
-0.084
0.265
1.885
-1.081
5.75
-0.21
0.342
0.342
0.342
0.342
0.342
0.342
383
Outlier Detection
Calculate distance from mean
z-score (number of standard deviations)
deleted z-score
that case biased the mean, so remove it
Look up expected distance from mean

1% 3+ SDs
Calculate influence
how much effect did that case have on the
mean?
384
Non-Normality in
Regression
385
Effects on OLS Estimates

The mean is an OLS estimate
The regression line is an OLS
estimate
Lack of normality
biases the position of the regression
slope
makes the standard errors wrong
probability values attached to statistical
significance wrong
386
Checks on Normality
Check residuals are normally
distributed
SPSS will draw histogram and p-p plot
of residuals
Use regression diagnostics

Lots of them
Most arent very interesting
387
Regression Diagnostics
Residuals
standardised, unstandardised, studentised,
deleted, studentised-deleted
look for cases > |3| (?)
Influence statistics
Look for the effect a case has
If we remove that case, do we get a
different answer?
DFBeta, Standardised DFBeta
changes in b
388
DfFit, Standardised DfFit

change in predicted value
Covariance ratio
Ratio of the determinants of the
covariance matrices, with and without
the case
Distances
measures of distance from the
centroid
some include IV, some dont
389
More on Residuals
Residuals are trickier than you
might have imagined
Raw residuals
OK
Standardised residuals
Residuals divided by SD
se
e
n k 1
2
390
Leverage
But
That SD is wrong
Variance of the residuals is not equal
Those further from the centroid on the
predictors have higher variance
Need a measure of this
Distance from the centroid is

leverage, or h (or sometimes hii)
One predictor
Easy
391

xi x
1
hi
2
n ( x x )
2
Minimum hi is 1/n, the maximum is

1
Except
SPSS uses standardised leverage - h*
It doesnt tell you this, it just uses it
392
1
hi hi
n
2
xi x
*
hi
2
( x x )
*
Minimum 0, maximum (N-1/N)
393
Multiple predictors
Calculate the hat matrix (H)
Leverage values are the diagonals of
this matrix
H X(X' X) X'
Where X is the augmented matrix of
predictors (i.e. matrix that includes
the constant)
Hence leverage hii element ii of H
394
Example of calculation of hat

matrix
1 15 1 15 1 15
1 20 1 20 1 20
H

... ... ... ...

... ...
1 65
1 65 1 65
1 15
0.318 0.273
1 20
0.273 0.236
... ...
1 65
395
0.318
Standardised / Studentised
Now we can calculate the
standardised residuals
SPSS calls them studentised residuals
Also called internally studentised
residuals
ei
ei
se 1 hi
396
Deleted Studentised
Residuals
Studentised residuals do not have
a known distribution
Cannot use them for inference
Deleted studentised residuals

Externally studentised residuals
Jackknifed residuals
Distributed as t
With df = N k 1
397
Testing Significance
We can calculate the probability of
a residual
Is it sampled from the same
population
BUT
Massive type I error rate
Bonferroni correct it
Multiply p value by N
398
Bivariate Normality
We didnt just say residuals
normally distributed
We said at every value of the
dependent variables
Two variables can be normally
distributed univariate,
but not bivariate
399
Couples IQs
male and female
FEMALE
MALE
5
6
4
Frequency
Frequency
0
60.0
70.0
80.0
90.0
100.0
110.0
120.0
130.0
140.0
0
60.0
70.0
80.0
90.0
100.0
Seem reasonably normal

400
110.0
120.0
130.0
140.0
But wait!!
160
140
120
100
M ALE
80
60
40
40
60
80
100
120
140
FEMALE
401
160
When we look at bivariate

normality
not normal there is an outlier
So plot X against Y
OK for bivariate
but may be a multivariate outlier
Need to draw graph in 3+ dimensions
cant draw a graph in 3 dimensions
But we can look at the residuals

instead
402
IQ histogram of residuals
12
10
403
Multivariate Outliers
Will be explored later in the
exercises
So we move on
404
What to do about NonNormality

Skew and Kurtosis
Skew much easier to deal with
Kurtosis less serious anyway
Transform data
removes skew
positive skew log transform
negative skew - square
405
Transformation
May need to transform IV and/or DV
More often DV
time, income, symptoms (e.g. depression) all
positively skewed
can cause non-linear effects (more later) if

only one is transformed
alters interpretation of unstandardised
parameter
May alter meaning of variable
May add / remove non-linear and moderator
effects
406
Change measures
increase sensitivity at ranges
avoiding floor and ceiling effects
Outliers
Can be tricky
Why did the outlier occur?
Error? Delete them.
Weird person? Probably delete them
Normal person? Tricky.
407
You are trying to model a process

is the data point outside the process
e.g. lottery winners, when looking at
salary
yawn, when looking at reaction time
Which is better?
A good model, which explains 99% of
your data?
A poor model, which explains all of it
Pedhazur and Schmelkin (1991)

analyse the data twice
408
We will spend much less time on

the other 6 assumptions
Can do exercise 8.1.
409
Assumption 2: The variance

of the residuals for every
set of values for the
independent variable is
equal.
410
Heteroscedasticity
This assumption is a about
heteroscedasticity of the residuals
Hetero=different
Scedastic = scattered
We dont want heteroscedasticity

we want our data to be
homoscedastic
Draw a scatterplot to investigate

411
160
140
120
100
80
MALE
60
40
40
60
FEMALE
80
100
120
140
412
160
Only works with one IV

need every combination of IVs
Easy to get use predicted values

use residuals there
Plot predicted values against

residuals
or
or
or
or
standardised residuals
deleted residuals
standardised deleted residuals
studentised residuals
A bit like turning the scatterplot on

its side
413
R e s id u a l
Good no heteroscedasticity
Predicted Value
414
R e s id u a l
Bad heteroscedasticity
Predicted Value
415
Testing Heteroscedasticity
Whites test
1.
2.
3.
4.
Not automatic in SPSS (is in SAS)

Luckily, not hard to do
Do regression, save residuals.
Square residuals
Square IVs
Calculate interactions of IVs
e.g. x1x2, x1x3, x2 x3
416
5. Run regression using

squared residuals as DV
IVs, squared IVs, and interactions as IVs
6. Test statistic = N x R2
Distributed as 2
Df = k (for second regression)
Use education and salbegin to

predict salary (employee
data.sav)
. R2 = 0.113, N=474, 2 = 53.5, df=5,
p < 0.0001
417
Plot of Pred and Res
Regression Standardized Residual
-2
-4
-2
Regression Standardized Predicted Value

418
Magnitude of
Heteroscedasticity
Chop data into slices
5 slices, based on X (or predicted
score)
Done in SPSS
Calculate variance of each slice

Check ratio of smallest to largest
Less than 10:1
OK
419
The Visual Bander

New in SPSS 12
420
Variances
1
of.219the 5 groups
2
.336
.757
.751
3.119
We have a problem
3 / 0.2 ~= 15
421
Dealing with
Heteroscedasticity
Use Huber-White estimates

Very easy in Stata
Fiddly in SPSS bit of a hack
Use Complex samples

1. Create a new variable where all
cases are equal to 1, call it const
2. Use Complex Samples, Prepare for
Analysis
3. Create a plan file
422
4.
5.
6.
7.
Sample weight is const

Finish
Use Complex Samples, GLM
Use plan file created, and set up
model as in GLM
(More on complex samples later)
In Stata, do regression as normal,

and click robust.
423
Heteroscedasticity
Implications and Meanings
Implications
What happens as a result of
heteroscedasticity?
Parameter estimates are correct
not biased
Standard errors (hence p-values) are

incorrect
424
However
If there is no skew in predicted
scores
P-values a tiny bit wrong
If skewed,
P-values very wrong
Can do exercise
425
Meaning
What is heteroscedasticity trying
to tell us?
Our model is wrong it is misspecified
Something important is happening
that we have not accounted for
e.g. amount of money given to

charity (given)
depends on:
earnings
degree of importance person assigns to
the charity (import)
426
Do the regression analysis

R2 = 0.60, F=31.4, df=2, 37, p < 0.001
seems quite good
b0 = 0.24, p=0.97
b1 = 0.71, p < 0.001
b2 = 0.23, p = 0.031
Whites test
2 = 18.6, df=5, p=0.002
The plot of predicted values

against residuals
427
Plot shows heteroscedastic

relationship
428
Which means
the effects of the variables are not
additive
If you think that what a charity does
is important
you might give more money
how much more depends on how much
money you have
429
70
60
50
40
GIVEN
30
Earnings
20
High
10
Low
4
10
12
14
16
IMPORT
430
One more thing about

heteroscedasticity
it is the equivalent of homogeneity of
variance in ANOVA/t-tests
431
Assumption 3: The Error

Term is Additive
432
Additivity
What heteroscedasticity shows you
effects of variables need to be additive
Heteroscedasticity doesnt always show

it to you
can test for it, but hard work
(same as homogeneity of covariance
assumption in ANCOVA)
Have to know it from your theory

A specification error
433
Additivity and Theory

Two IVs
Alcohol has sedative effect
A bit makes you a bit tired
A lot makes you very tired
Some painkillers have sedative effect

A bit makes you a bit tired
A lot makes you very tired
A bit of alcohol and a bit of painkiller

doesnt make you very tired
Effects multiply together, dont add
together
434
If you dont test for it

Its very hard to know that it will
happen
So many possible non-additive

effects
Cannot test for all of them
Can test for obvious
In medicine
Choose to test for salient non-additive
effects
e.g. sex, race
435
Assumption 4: At every
value of the dependent
variable the expected
(mean) value of the
residuals is zero
436
Linearity
Relationships between variables should
be linear
best represented by a straight line
Not a very common problem in social

sciences
except economics
measures are not sufficiently accurate to
make a difference
R2 too low
unlike, say, physics
437
Fuel
Relationship between speed of

travel and fuel used
Speed
438
R2 = 0.938
looks pretty good
know speed, make a good prediction
of fuel
BUT
look at the chart
if we know speed we can make a
perfect prediction of fuel used
R2 should be 1.00
439
Detecting Non-Linearity
Residual plot
just like heteroscedasticity
Using this example

very, very obvious
usually pretty obvious
440
Residual plot
441
Linearity: A Case of
Additivity
Linearity = additivity along the range of
the IV
Jeremy rides his bicycle harder
Increase in speed depends on current speed
Not additive, multiplicative
MacCallum and Mar (1995). Distinguishing
between moderator and quadratic effects in
multiple regression. Psychological Bulletin.
442
Assumption 5: The expected

correlation between
residuals, for any two cases,
is 0.
The independence assumption
(lack of autocorrelation)
443
Independence Assumption
Also: lack of autocorrelation
Tricky one
often ignored
exists for almost all tests
All cases should be independent of one

another
knowing the value of one case should not
tell you anything about the value of other
cases
444
How is it Detected?
Can be difficult
need some clever statistics
(multilevel models)
Better off avoiding situations

where it arises
Residual Plots
Durbin-Watson Test
445
Residual Plots
Were data collected in time order?
If so plot ID number against the
residuals
Look for any pattern
Test for linear relationship
Non-linear relationship
Heteroscedasticity
446
R
e
s
id
u
a
l
2
1
0
--1
201
0P
2
0
3
0
4
0
a
rtic
p
a
n
tN
u
m
b
e
r
447
How does it arise?

Two main ways
time-series analyses
When cases are time periods
weather on Tuesday and weather on Wednesday
correlated
inflation 1972, inflation 1973 are correlated
clusters of cases
patients treated by three doctors
children from different classes
people assessed in groups
448
Why does it matter?

Standard errors can be wrong
therefore significance tests can be
wrong
Parameter estimates can be wrong

really, really wrong
from positive to negative
An example
students do an exam (on statistics)
choose one of three questions
IV: time
DV: grade
449
Result, with line of best fit

90
80
70
60
50
40
Grade
30
20
10
10
Time
20
30
40
50
60
450
70
Result shows that

people who spent longer in the exam,
achieve better grades
BUT
we havent considered which question
people answered
we might have violated the
independence assumption
DV will be autocorrelated
Look again
with questions marked
451
Now somewhat different

90
80
70
60
50
40
Question
Grade
30
20
10
10
1
20
30
40
50
60
70
Time
452
Now, people that spent longer got

lower grades
questions differed in difficulty
do a hard one, get better grade
if you can do it, you can do it quickly
Very difficult to analyse well

need multilevel models
453
Durbin Watson Test

Not well implemented in SPSS
Depends on the order of the data
Reorder the data, get a different
result
Doesnt give statistical significance

of the test
454
Assumption 6: All
independent variables are
uncorrelated with the
error term.
455
Uncorrelated with the

Error Term
A curious assumption
by definition, the residuals are
uncorrelated with the independent
variables (try it and see, if you like)
It is about the DV
must have no effect (when the IVs
have been removed)
on the DV
456
Problem in economics
Demand increases supply
Supply increases wages
Higher wages increase demand
OLS estimates will be (badly)

biased in this case
need a different estimation procedure
two-stage least squares
simultaneous equation modelling
457
Assumption 7: No
independent variables are
a perfect linear function
of other independent
variables
no perfect multicollinearity
458
No Perfect Multicollinearity
IVs must not be linear functions of one
another
matrix of correlations of IVs is not positive
definite
cannot be inverted
analysis cannot proceed
Have seen this with

age, age start, time working
also occurs with subscale and total
459
Large amounts of collinearity

a problem (as we shall see)
sometimes
not an assumption
460
Assumption 8: The mean of

the error term is zero.
You will like this one.
461
Mean of the Error Term =

0
Mean of the residuals = 0

That is what the constant is for
if the mean of the error term deviates

from zero, the constant soaks it up
Y 0 1 x1
Y ( 0 3) 1 x1 ( 3)
- note, Greek letters because we are
talking about population values
462
Can do regression without the

constant
Usually a bad idea
E.g R2 = 0.995, p < 0.001
Looks good
463
1
3
1
2
1
1
0
9
8
76789x
1
0
1
1
2
1
3
1
464
465
Lesson 9: Issues in
Regression Analysis
Things that alter the
interpretation of the
regression equation
466
The Four Issues
Causality
Sample sizes
Collinearity
Measurement error
467
Causality
468
What is a Cause?
Debate about definition of cause
some statistics (and philosophy)
books try to avoid it completely
We are not going into depth
just going to show why it is hard
Two dimensions of cause

Ultimate versus proximal cause
Determinate versus probabilistic
469
Proximal versus Ultimate

Why am I here?
I walked here because
This is the location of the class
because
Eric Tanenbaum asked me because
(I dont know)
because I was in my office when he
rang because
I am a lecturer at York because
I saw an advert in the paper because
470
I exist because
My parents met because
My father had a job
Proximal cause
the direct and immediate cause of
something
Ultimate cause
the thing that started the process off
I fell off my bicycle because of the
bump
I fell off because I was going too fast
471
Determinate versus Probabilistic

Cause
Why did I fall off my bicycle?
I was going too fast
But every time I ride too fast, I dont
fall off
Probabilistic cause
Why did my tyre go flat?

A nail was stuck in my tyre
Every time a nail sticks in my tyre,
the tyre goes flat
Deterministic cause
472
Can get into trouble by mixing

them together
Eating deep fried Mars Bars and doing
no exercise are causes of heart
disease
My Grandad ate three deep fried
Mars Bars every day, and the most
exercise he ever got was when he
walked to the shop next door to buy
one
(Deliberately?) confusing
deterministic and probabilistic causes
473
Criteria for Causation

Association
Direction of Influence
Isolation
474
Association
Correlation does not mean causation
we all know
But
Causation does mean correlation
Need to show that two things are

related
may be correlation
my be regression when controlling for third
(or more) factor
475
Relationship between price and

sales
suppliers may be cunning
when people want it more
stick the price up
Price
Price
Demand
Sales
1
0.6
0
Demand
0.6
1
0.6
Sales
0
0.6
1
So no relationship between price

and sales
476
Until (or course) we control for

demand
b1 (Price) = -0.56
b2 (Demand) = 0.94
But which variables do we enter?
477
Direction of Influence
Relationship between A and B
three possible processes
A
B causes A
C causes A & B
A causes B
478
How do we establish the direction

of influence?
Longitudinally?
Barometer
Drops
Storm
Now if we could just get that

barometer
needle to stay where it is
Where the role of theory comes

in
479
(more on this later)
Isolation
Isolate the dependent variable
from all other influences
as experimenters try to do
Cannot do this
can statistically isolate the effect
using multiple regression
480
Role of Theory
Strong theory is crucial to making
causal statements
Fisher said: to make causal
statements make your theories
elaborate.
dont rely purely on statistical
analysis
Need strong theory to guide

analyses
481
what critics of non-experimental
S.J. Gould a critic

says correlate price of petrol and his
age, for the last 10 years
find a correlation
Ha! (He says) that doesnt mean
there is a causal link
Of course not! (We say).
No social scientist would do that analysis
without first thinking (very hard) about
the possible causal relations between the
variables of interest
Would control for time, prices, etc
482
Atkinson, et al. (1996)

relationship between college grades
and number of hours worked
negative correlation
Need to control for other variables
ability, intelligence
Gould says Most correlations are

non-causal (1982, p243)
Of course!!!!
483
I drink a lot
of beer
16 causal
relations
120 non-causal
correlations
laugh
toilet
jokes (about
statistics)
vomit
karaoke
curtains closed
sleeping
headache
equations (beermat)
thirsty
fried breakfast
no beer
curry
chips
falling over
lose keys
484
Abelson (1995) elaborates on this

method of signatures
A collection of correlations relating

to the process
the signature of the process
e.g. tobacco smoking and lung

cancer
can we account for all of these
findings with any other theory?
485
1.
2.
3.
4.
5.
6.
7.
8.
The longer a person has smoked cigarettes, the

greater the risk of cancer.
The more cigarettes a person smokes over a
given time period, the greater the risk of
cancer.
People who stop smoking have lower cancer
rates than do those who keep smoking.
Smokers cancers tend to occur in the lungs,
and be of a particular type.
Smokers have elevated rates of other diseases.
People who smoke cigars or pipes, and do not
usually inhale, have abnormally high rates of lip
cancer.
Smokers of filter-tipped cigarettes have lower
cancer rates than other cigarette smokers.
Non-smokers who live with smokers have
elevated cancer rates.
486
(Abelson, 1995: 183-184)
In addition, should be no anomalous

correlations
If smokers had more fallen arches than
non-smokers, not consistent with theory
Failure to use theory to select

appropriate variables
specification error
e.g. in previous example
Predict wealth from price and sales
increase price, price increases
Increase sales, price increases
487
Sometimes these are indicators of

the process
e.g. barometer stopping the needle
wont help
e.g. inflation? Indicator or cause?
488
No Causation without
Experimentation
Blatantly untrue
I dont doubt that the sun shining
makes us warm
Why the aversion?

Pearl (2000) says problem is no
mathematical operator
No one realised that you needed one
Until you build a robot
489
AI and Causality
A robot needs to make judgements
about causality
Needs to have a mathematical
representation of causality
Suddenly, a problem!
Doesnt exist
Most operators are non-directional
Causality is directional
490
Sample Sizes
How many subjects does it
take to run a regression
analysis?
491
Introduction
Social scientists dont worry enough
about the sample size required
Why didnt you get a significant result?
I didnt have a large enough sample
Not a common answer
More recently awareness of sample size

is increasing
use too few no point doing the research
use too many waste their time
492
Research funding bodies

Ethical review panels
both become more interested in
sample size calculations
We will look at two approaches

Rules of thumb (quite quickly)
Power Analysis (more slowly)
493
Rules of Thumb
Lots of simple rules of thumb exist
10 cases per IV
>100 cases
Green (1991) more sophisticated
To test significance of R2 N = 50 + 8k
To test sig of slopes, N = 104 + k
Rules of thumb dont take into

account all the information that we
have
Power analysis does
494
Power Analysis
Introducing Power Analysis
Hypothesis test
tells us the probability of a result of
that magnitude occurring, if the null
hypothesis is correct (i.e. there is no
effect in the population)
Doesnt tell us
the probability of that result, if the
null hypothesis is false
495
According to Cohen (1982) all null

hypotheses are false
everything that might have an effect,
does have an effect
it is just that the effect is often very tiny
496
Type I Errors
Type I error is false rejection of H0
Probability of making a type I error
the significance value cut-off
usually 0.05 (by convention)
Always this value

Not affected by
sample size
type of test
497
Type II errors
Type II error is false acceptance of
the null hypothesis
Much, much trickier
We think we have some idea

we almost certainly dont
Example
I do an experiment (random
sampling, all assumptions perfectly
satisfied)
I find p = 0.05
498
You repeat the experiment exactly

different random sample from same
population
What is probability you will find p <

0.05?

Another experiment, I find p = 0.01
Probability you find p < 0.05?

Very hard to work out

not intuitive
need to understand non-central
499
sampling distributions (more in a
Probability of type II error = beta

( )
same as population regression
parameter (to be confusing)
Power = 1 Beta
Probability of getting a significant
result
500
State of the World
Research
Findings
H0 True
(no effect to
be found)
H0 false
(effect to be
found)
H0 true (we
find no effect
p > 0.05)
Type II error
p =
power = 1 -
H0 false (we
find an effect
p < 0.05)
Type I error
p=
501
Four parameters in power analysis

prob. of Type I error
prob. of Type II error (power = 1
)
Effect size size of effect in
population
N
Know any three, can calculate the

fourth
Look at them one at a time
502
Probability of Type I error

Usually set to 0.05
Somewhat arbitrary
sometimes adjusted because of
circumstances
rarely because of power analysis
May want to adjust it, based on power

analysis
503
Probability of type II error

Power (probability of finding a result)
=1
Standard is 80%
Some argue for 90%
Implication that Type I error is 4 times

more serious than type II error
adjust ratio with compromise power
analysis
504
Effect size in the population

Most problematic to determine
Three ways
1. What effect size would be useful to
find?
R2 = 0.01 - no use (probably)
2. Base it on previous research

what have other people found?
3. Use Cohens conventions

small R2 = 0.02
medium R2 = 0.13
large R2 = 0.26
505
Effect size usually measured as f2

For R2
R
f
2
1 R
2
506
For (standardised) slopes

2
sri
f
2
1 R
2
Where sr2 is the contribution to the

variance accounted for by the
variable of interest
i.e. sr2 = R2 (with variable) R2
(without)
change in R2 in hierarchical regression
507
N the sample size

usually use other three parameters to
determine this
sometimes adjust other parameters
() based on this
e.g. You can have 50 participants. No
more.
508
Doing power analysis

With power analysis program
SamplePower, GPower, Nquery
With SPSS MANOVA

using non-central distribution
functions
Uses MANOVA syntax
Relies on the fact you can do anything
with MANOVA
Paper B4
509
Underpowered Studies
Research in the social sciences is
often underpowered
Why?
See Paper B11 the persistence of
underpowered studies
510
Extra Reading
Power traditionally focuses on p
values
What about CIs?
Paper B8 Obtaining regression
coefficients that are accurate, not
simply significant
511
Collinearity
512
Collinearity as Issue and

Assumption
Collinearity (multicollinearity)
the extent to which the independent
variables are (multiply) correlated
If R2 for any IV, using other IVs =

1.00
perfect collinearity
variable is linear sum of other
variables
regression will not proceed
513
(SPSS will arbitrarily throw out
a
R2 < 1.00, but high

other problems may arise
Four things to look at in collinearity

meaning
implications
detection
actions
514
Meaning of Collinearity
Literally co-linearity
lying along the same line
Perfect collinearity
when some IVs predict another
Total = S1 + S2 + S3 + S4
S1 = Total (S2 + S3 + S4)
rare
515
Less than perfect

when some IVs are close to predicting
correlations between IVs are high
(usually, but not always)
516
Implications
Effects the stability of the
parameter estimates
and so the standard errors of the
parameter estimates
and so the significance
Because
shared variance, which the regression
procedure doesnt know where to put
517
Red cars have more accidents than

other coloured cars
because of the effect of being in a red
car?
because of the kind of person that
drives a red car?
we dont know
No way to distinguish between these

three:
Accidents = 1 x colour + 0 x person

Accidents = 0 x colour + 1 x person
Accidents = 0.5 x colour +5180.5 x
Sex differences
due to genetics?
due to upbringing?
(almost) perfect collinearity
statistically impossible to tell
519
When collinearity is less than

perfect
increases variability of estimates
between samples
estimates are unstable
reflected in the variances, and hence
standard errors
520
Detecting Collinearity
Look at the parameter estimates
large standardised parameter
estimates (>0.3?), which are not
significant
be suspicious
Run a series of regressions

each IV as DV
all other IVs as IVs
for each IV
521
Sounds like hard work?

SPSS does it for us!
Ask for collinearity diagnostics

Tolerance calculated for every IV
Tolerance 1-R
Variance Inflation Factor

sq. root of amount s.e. has been
increased
1
VIF
Tolerance
522
Actions
What you can do about collinearity
no quick fix (Fox, 1991)
1. Get new data
avoids the problem

address the question in a different
way
e.g. find people who have been
raised as the wrong gender
exist, but rare
Not a very useful suggestion

523
2. Collect more data
not different data, more data

collinearity increases standard error
(se)
se decreases as N increases
get a bigger N
3. Remove / Combine variables
If an IV correlates highly with other

IVs
Not telling us much new
If you have two (or more) IVs which
are very similar
e.g. 2 measures of depression, socio524

economic status, achievement,
etc
sum them, average them, remove one
Many measures
use principal components analysis to

reduce them
3. Use stepwise regression (or some

flavour of)
See previous comments

Can be useful in theoretical vacuum
4. Ridge regression
not very useful

behaves weirdly
525
Measurement Error
526
What is Measurement
Error
In social science, it is unlikely that
we measure any variable perfectly
measurement error represents this
imperfection
We assume that we have a true

score
T
A measure of that score

x
527
xT e
just like a regression equation
standardise the parameters
T is the reliability
the amount of variance in x which comes from
T
but, like a regression equation

assume that e is random and has mean of
zero
more on that later
528
Simple Effects of
Measurement Error
Lowers the measured correlation
between two variables
Real correlation
true scores (x* and y*)
Measured correlation
measured scores (x and y)
529
True correlation
of x and y
rx*y*
x*
y*
Reliability of x
rxx
Reliability of y
ryy
Measured
correlation of x and y
rxy
530
Attenuation of correlation
rxy rx * y * rxx ryy

Attenuation corrected correlation
rx * y *
rxy
rxx ryy
531
Example
rxx 0.7
ryy 0.8
rxy 0.3
rx* y*
rx* y*
rxy
rxx ryy
0.3
0.40
0.7 0.8
532
Complex Effects of
Measurement Error
Really horribly complex
Measurement error reduces
correlations
reduces estimate of
reducing one estimate
increases others
because of effects of control

combined with effects of suppressor
variables
exercise to examine this
533
Dealing with Measurement

Error
Attenuation correction
very dangerous
not recommended
Avoid in the first place

use reliable measures
dont discard information
dont categorise
Age: 10-20, 21-30, 31-40
534
Complications
Assume measurement error is
additive
linear
Additive
e.g. weight people may under-report /
over-report at the extremes
Linear
particularly the case when using proxy
variables
535
e.g. proxy measures

Want to know effort on childcare,
count number of children
1st child is more effort than last
Want to know financial status, count

income
1st 10 much greater effect on financial
status than the 1000th.
536
Lesson 10: Non-Linear

Analysis in Regression
537
Introduction
Non-linear effect occurs
when the effect of one independent
variable
is not consistent across the range of
the IV
Assumption is violated
expected value of residuals = 0
no longer the case
538
Some Examples
539
Skill
A Learning Curve
Experience
540
Performance
Yerkes-Dodson Law of Arousal
Arousal
541
Suicidal
Enthusiastic
Enthusiasm Levels over a

Lesson on Regression
Time
3.5
542
Learning
line changed direction once
Yerkes-Dodson
line changed direction once
Enthusiasm
line changed direction twice
543
Everything is Non-Linear
Every relationship we look at is
non-linear, for two reasons
Exam results cannot keep increasing
with reading more books
Linear in the range we examine
For small departures from linearity

Cannot detect the difference
Non-parsimonious solution
544
Non-Linear
Transformations
545
Bending the Line

Non-linear regression is hard
We cheat, and linearise the data
Do linear regression
Transformations
We need to transform the data
rather than estimating a curved line
which would be very difficult
may not work with OLS
we can take a straight line, and bend it

or take a curved line, and straighten it
back to linear (OLS) regression
546
We still do linear regression

Linear in the parameters
Y = b1x + b2x2 +
Can do non-linear regression

Non-linear in the parameters
Y = b1x + b2x2 +
Much trickier
Statistical theory either breaks down
OR becomes harder
547
Linear transformations
multiply by a constant
add a constant
change the slope and the intercept
548
y=2x
y=x + 3
y=x
x
549
Linear transformations are no use

alter the slope and intercept
dont alter the standardised
parameter estimate
Non-linear transformation
will bend the slope
quadratic transformation
y = x2
one change of direction
550
Cubic transformation
y = x2 + x3
two changes of direction
551
Quadratic Transformation
y=0 + 0.1x + 1x2
552
Square Root Transformation
y=20 + -3x +
5x
553
Cubic Transformation
6
y = 3 - 4x + 2x2 - 0.2x3
5
4
3
2
1
0
0
554
Logarithmic Transformation
y = 1 + 0.1x + 10log(x)
555
Inverse Transformation
y = 20 -10x + 8(1/x)
556
To estimate a non-linear regression

we dont actually estimate anything
non-linear
we transform the x-variable to a nonlinear version
can estimate that straight line
represents the curve
we dont bend the line, we stretch the
space around the line, and make it
flat
557
Detecting Non-linearity
558
Draw a Scatterplot
Draw a scatterplot of y plotted
against x
see if it looks a bit non-linear
e.g. Anscombes data
e.g. Education and beginning salary
from bank data
drawn in SPSS
with line of best fit
559
Anscombe (1973)
constructed a set of datasets
show the importance of graphs in
regression/correlation
For each dataset

N
Mean of x
Mean of y
Equation of regression line
sum of squares (X - mean)
correlation coefficient
R2
11
9
7.5
y = 3 + 0.5x
110
0.82
0.67
560
561
562
563
564
A Real Example
Starting salary and years of
education
From employee data.sav
565
Beginning Salary
Expected
value of error
(residual) is >
0
Educational Level (years)
Expected
value of error
(residual) is <
0
566
Use Residual Plot

Scatterplot is only good for one
variable
use the residual plot (that we used for
heteroscedasticity)
Good for many variables
567
We want
points to lie in a nice straight sausage
568
We dont want
a nasty bent sausage
569
Educational level and starting

salary
10
-2
-2
-1
570
Carrying Out Non-Linear

Regression
571
Linear Transformation
Linear transformation doesnt
change
interpretation of slope
standardised slope
se, t, or p of slope
R2
Can change
effect of a transformation
572
Actually more complex

with some transformations can add a
constant with no effect (e.g.
quadratic)
With others does have an effect

inverse, log
Sometimes it is necessary to add a

constant
negative numbers have no square
root
0 has no log
573
Education and Salary

Linear Regression
Saw previously that the assumption of
expected errors = 0 was violated
Anyway
R2 = 0.401, F=315, df = 1, 472, p < 0.001
salbegin = -6290 + 1727 educ
Standardised
b1 (educ) = 0.633
Both parameters make sense

574
Non-linear Effect
Compute new variable
quadratic
educ2 = educ2
Add this variable to the equation

R2 = 0.585, p < 0.001
salbegin = 46263 + -6542 educ + 310
educ2
slightly curious
Standardised
b1 (educ) = -2.4
b2 (educ2) = 3.1
What is going on?

575
Collinearity
is what is going on
Correlation of educ and educ2
r = 0.990
Regression equation becomes difficult

(impossible?) to interpret
Need hierarchical regression

what is the change in R2
is that change significant?
R2 (change) = 0.184, p < 0.001
576
Cubic Effect
While we are at it, lets look at the
cubic effect
R2 (change) = 0.004, p = 0.045
19138 + 103 e + -206 e2 + 12 e3
Standardised:
b1(e) = 0.04
b2(e2) = -2.04
b3(e3) = 2.71
577
Fourth Power
Keep going while we are ahead
wont run
???
Collinearity is the culprit

Tolerance (educ4) = 0.000005
VIF = 215555
Matrix of correlations of IVs is not

positive definite
cannot be inverted
578
Interpretation
Tricky, given that parameter
estimates are a bit nonsensical
Two methods
1: Use R2 change
Save predicted values
or calculate predicted values to plot line
of best fit
Save them from equation

Plot against IV
579
50000
40000
30000
Beginning Salary
20000
Cubic
10000
Quadratic
0
Linear
8
10
12
Education (Years)
14
16
18
20
22
580
Differentiate with respect to e

We said:
s = 19138 + 103 e + -206 e2 + 12
e3
but first we will simplify it to quadratic
s = 46263 + -6542 e + 310 e2
dy/dx = -6542 + 310 x 2 x e
581
Education Slope
9
-962
10
-342
11
278
12
898
13
1518
14
2138
15
2758
16
3378
17
3998
18
4618
19
5238
20
5858
1 year of
education at the
higher end of the
scale, better than
1 year at the lower
end of the scale.
MBA versus GCSE
582
Differentiate Cubic
19138 + 103 e + -206 e2 + 12
e3
dy/dx = 103 206 2 e + 12 3
e2
Can calculate slopes for quadratic
and cubic at different values
583
Education Slope (Quad) Slope (Cub)

9
-962
-689
10
-342
-417
11
278
-73
12
898
343
13
1518
831
14
2138
1391
15
2758
2023
16
3378
2727
17
3998
3503
18
4618
4351
19
5238
5271
20
5858
6263
584
A Quick Note on
Differentiation
For y = xp
dx/dy = pxp-1
For equations such as

y =b1x + b2xP
dy/dx = b1 + b2pxp-1
y = 3x + 4x2
dy/dx = 3 + 4 2x
585
y = b1x + b2x2 + b3x3

dy/dx = b1 + b2 2x + b3 3 x2
y = 4x + 5x2 + 6x3
dx/dy = 4 + 5 2 x + 6 3 x2
Many functions are simple to
differentiate
Not all though
586
Automatic Differentiation
If you
Dont know how to differentiate
Cant be bothered to look up the
function
Can use automatic differentiation

software
e.g. GRAD (freeware)
587
588
Lesson 11: Logistic

Regression
Dichotomous/Nominal
Dependent Variables
589
Introduction
Often in social sciences, we have a
dichotomous/nominal DV
we will look at dichotomous first, then a
quick look at multinomial
Dichotomous DV
e.g.
guilty/not guilty
pass/fail
won/lost
Alive/dead (used in medicine)
590
Why Wont OLS Do?
591
Example: Passing a Test

Test for bus drivers
pass/fail
we might be interested in degrees of pass
fail
a company which trains them will not
fail means pay for them to take it again
Develop a selection procedure

Two predictor variables
Score Score on an aptitude test
Exp Relevant prior experience (months)
592
1st ten cases

Score
5
1
1
4
1
1
4
1
3
4
Exp
6
15
12
6
15
6
16
10
12
26
Pass
0
0
0
0
1
0
1
1
0
1
593
DV
pass (1 = Yes, 0 = No)
Just consider score first

Carry out regression
Score as IV, Pass as DV
R2 = 0.097, F = 4.1, df = 1, 48, p =
0.028.
b0 = 0.190
b1 = 0.110, p=0.028
Seems OK
594
Or does it?
1st Problem pp plot of residuals
1.00
E xp ec te d C u m P rob
.75
.50
.25
0.00
0.00
.25
Observed Cum Prob
.50
.75
1.00
595
2nd problem - residual plot
596
Problems 1 and 2
strange distributions of residuals
parameter estimates may be wrong
standard errors will certainly be
wrong
597
3rd problem interpretation

I score 2 on aptitude.
Pass = 0.190 + 0.110 2 = 0.41
I score 8 on the test
Pass = 0.190 + 0.110 8 = 1.07
Seems OK, but

What does it mean?
Cannot score 0.41 or 1.07
can only score 0 or 1
Cannot be interpreted
need a different approach
598
A Different Approach
Logistic Regression
599
Logit Transformation
In lesson 10, transformed IVs
now transform the DV
Need a transformation which gives

us
graduated scores (between 0 and 1)
No upper limit
we cant predict someone will pass twice
No lower limit
you cant do worse than fail
600
Step 1: Convert to
Probability
First, stop talking about values
talk about probability
for each value of score, calculate
probability of pass
Solves the problem of graduated

scales
601
probability of
failure given a
score of 1 is 0.7
Score 1 2 3 4 5
N
7 5 6 4 2
Fail
P
0.7 0.5 0.6 0.4 0.2
N
3 5 4 6 8
Pass
P
0.3 0.5 0.4 0.6 0.8
probability of
passing given a
score of 5 is 0.8
602
This is better
Now a score of 0.41 has a meaning
a 0.41 probability of pass
But a score of 1.07 has no

meaning
cannot have a probability > 1 (or < 0)
Need another transformation
603
Step 2: Convert to OddsRatio

Need to remove upper limit
Convert to odds
Odds, as used by betting shops
5:1, 1:2
Slightly different from odds in

speech
a 1 in 2 chance
odds are 1:1 (evens)
50%
604
Odds ratio = (number of times it

happened) / (number of times it
didnt happen)
p(event)
p(event)
odds ratio
p(not event) 1 p(event)
605
0.8 = 0.8/0.2 = 4
equivalent to 4:1 (odds on)
4 times out of five
0.2 = 0.2/0.8 = 0.25

equivalent to 1:4 (4:1 against)
1 time out of five
606
Now we have solved the upper

bound problem
we can interpret 1.07, 2.07,
1000000.07
But we still have the zero problem

we cannot interpret predicted scores
less than zero
607
Step 3: The Log

Log10 of a number(x)
log( x )
10
log(10) = 1
log(100) = 2
log(1000) = 3
608
log(1) = 0
log(0.1) = -1
log(0.00001) = -5
609
Natural Logs and e

Dont use log10
Use loge
Natural log, ln
Has some desirable properties, that
log10 doesnt
For us
If y = ln(x) + c
dy/dx = 1/x
Not true for any other logarithm
610
Be careful calculators and stats

packages are not consistent when
they use log
Sometimes log10, sometimes loge
Can prove embarrassing (a friend told
me)
611
Take the natural log of the odds ratio

Goes from - +
can interpret any predicted value
612
Putting them all together

Logit transformation
log-odds ratio
not bounded at zero or one
613
Score 1
Fail
Pass
N
P
N
P
Odds (Fail)
log(odds)fail
7
5
6
4
0.7 0.5 0.6 0.4
3
5
4
6
0.3 0.5 0.4 0.6
5
2
0.2
8
0.8
2.33 1.00 1.50 0.67 0.25

0.85 0.00 0.41 -0.41 -1.39
614
probability
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Probability gets
closer to zero, but
never reaches it as
logit goes down.
-3.5
-3
-2.5
-2
-1.5
-1
-0.5
0.5
1.5
Logit
615
2.5
3.5
Hooray! Problem solved, lesson

over
errrmmm almost
Because we are now using logodds ratio, we cant use OLS

we need a new technique, called
Maximum Likelihood (ML) to estimate
the parameters
616
Parameter Estimation
using ML
ML tries to find estimates of model
parameters that are most likely to
give rise to the pattern of
observations in the sample data
All gets a bit complicated
OLS is a special case of ML
the mean is an ML estimator
617
Dont have closed form equations

must be solved iteratively
estimates parameters that are most
likely to give rise to the patterns
observed in the data
by maximising the likelihood function
(LF)
We arent going to worry about this

except to note that sometimes, the
estimates do not converge
ML cannot find a solution
618
Interpreting Output
Using SPSS
Overall fit for:
step (only used for stepwise)
block (for hierarchical)
model (always)
in our model, all are the same
2=4.9, df=1, p=0.025
F test
619
Omnibus Tests of Model Coefficients

Chi-square
Step 1
df
Sig.
Step
4.990
.025
Block
4.990
.025
Model
4.990
.025
620
Model summary
-2LL (=2/N)
Cox & Snell R2
Nagelkerke R2
Different versions of R2
No real R2 in logistic regression
should be considered pseudo R2
621
Model Summary
Step
1
-2 Log
likelihood
Cox & Snell

R Square
64.245
.095
Nagelkerke
R Square
.127
622
Classification Table
predictions of model
based on cut-off of 0.5 (by default)
predicted values x actual values
623
Classification Tablea
Predicted
PASS
Observed
Step 1
PASS
Percentage
Correct
18
69.2
12
12
50.0
Overall P ercentage
60.0
a. The cut value is .500
624
Model parameters
B
Change in the logged odds associated
with a change of 1 unit in IV
just like OLS regression
difficult to interpret
SE (B)
Standard error
Multiply by 1.96 to get 95% CIs
625
Variables in the Equation

B
Step
a
1
S.E.
Wald
SCORE
-.467
.219
4.566
Constant
1.314
.714
3.390
a. Variable(s) entered on step 1: SCORE.

95.0% C.I.for EXP(B)
Step
a
1
score
Constant
Sig.
.386
Exp(B)
1.263
.199
.323
Lower
.744
a. Variable(s) entered on step 1: score.

626
Upper
2.143
Constant
i.e. score = 0
B = 1.314
Exp(B) = eB = e1.314 = 3.720
OR = 3.720, p = 1 (1 / (OR + 1))
= 1 (1 / (3.720 + 1))
p = 0.788
627
Score 1
Constant b = 1.314
Score B = -0.467
Exp(1.314 0.467) = Exp(0.847)
= 2.332
OR = 2.332
p = 1 (1 / (2.332 + 1))
= 0.699
628
Standard Errors and CIs

SPSS gives
B, SE B, exp(B) by default
Can work out 95% CI from standard
error
B 1.96 x SE(B)
Or ask for it in options
Symmetrical in B
Non-symmetrical (sometimes very) in
exp(B)
629

95.0% C.I.f or
EXP(B)
B
S.E .
Exp(B)
SCORE
-.467
.219
.627
Constan
t
1.314
.714
3.720
Lower
.408
a. Variable(s) entered on step 1: SCORE.
630
Upper
.962
The odds of passing the test are

multiplied by 0.63 (CIs = 0.408,
0.962p p = 0.033), for every
additional point on the aptitude
test.
631
More on Standard Errors

In OLS regression
If a variable is added in a hierarchical
fashion
The p-value associated with the change in
R2 is the same as the p-value of the variable
Not the case in logistic regression
In our data 0.025 and 0.033
Wald standard errors

Make p-value in estimates is wrong too
high
(CIs still correct)
632
Two estimates use slightly different

information
P-value says what if no effect
CI says what if this effect
Variance depends on the hypothesised ratio
of the number of people in the two groups
Can calculate likelihood ratio based

p-values
If you can be bothered
Some packages provide them
automatically
633
Probit Regression
Very similar to logistic
much more complex initial
transformation (to normal
distribution)
Very similar results to logistic
(multiplied by 1.7)
In SPSS:
A bit weird
Probit regression available through
menus
634
But requires data structured

differently
However
Ordinal logistic regression is
equivalent to binary logistic
If outcome is binary
SPSS gives option of probit
635
Results
Estimat
e
SE
Logistic
(binary)
Score
0.288
0.301
0.339
Exp
0.147
0.073
0.043
Logistic
(ordinal)
Score
0.288
0.301
0.339
Exp
0.147
0.073
0.043
Logistic
(probit)
Score
0.191
0.178
0.282
Exp
0.090
0.042
0.033
636
Differentiating Between
Probit and Logistic
Depends on shape of the error term
Normal or logistic
Graphs are very similar to each other
Could distinguish quality of fit
Given enormous sample size
Logistic = probit x 1.7

Actually 1.6998
Probit advantage
Understand the distribution
Logistic advantage
Much simpler to get back to the probability
637
638
2.8
2.6
2.4
2.2
1.8
1.6
1.4
1.2
0.8
0.6
0.4
0.2
-0.2
-0.4
-0.6
-0.8
-1
-1.2
-1.4
-1.6
-1.8
-2
-2.2
-2.4
-2.6
-2.8
-3
1.2
Normal (Probit)
Logistic
0.8
0.6
0.4
0.2
Infinite Parameters
Non-convergence can happen
because of infinite parameters
Insoluble model
Three kinds:
Complete separation
The groups are completely distinct
Pass group all score more than 10
Fail group all score less than 10
639
Quasi-complete separation
Separation with some overlap
Pass group all score 10 or more
Fail group all score 10 or less
Both cases:
No convergence
Close to this
Curious estimates
Curious standard errors
640
Categorical Predictors
Can cause separation
Esp. if correlated
Need people in every cell
Male
White
Non-White
Female
White
Non-White
Below
Poverty
Line
Above
Poverty
Line
641
Logistic Regression and

Diagnosis
Logistic regression can be used for
diagnostic tests
For every score
Calculate probability that result is positive
Calculate proportion of people with that score (or
lower) who have a positive result
Calculate c statistic
Measure of discriminative power
%age of all possible cases, where the model
gives a higher probability to a correct case
than to an incorrect case
642
Perfect c-statistic = 1.0

Random c-statistic = 0.5
SPSS doesnt do it automatically

But easy to do
Save probabilities
Use Graphs, ROC Curve
Test variable: predicted probability
State variable: outcome
643
Sensitivity and Specificity

Sensitivity:
Probability of saying someone has a
positive result
If they do: p(pos)|pos
Specificity
Probability of saying someone has a
negative result
If they do: p(neg)|neg
644
Calculating Sens and Spec

For each value
Calculate
proportion of minority earning less p(m)
proportion of non-minority earning less
p(w)
Sensitivity (value)
P(m)
645
Salary
P(minority)
10
20
30
40
50
60
70
80
90
.39
.31
.23
.17
.12
.09
.06
.04
.03
646
Using Bank Data

Predict minority group, using
salary (000s)
Logit(minority) = -0.044 + salary x
0.039
Find actual proportions
647
R
O
C
u
r
v
e
.0
1
0
.0
8
.6
S
ensit
.0
0
4
.0
2
.D
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
1
S
p
e
c
i
f
t
y
ia
g
o
n
a
ls
e
g
m
e
n
ts
a
rro
d
u
c
e
d
b
y
tie
s
.
Area under
curve is cstatistic
648
More Advanced
Techniques
Multinomial Logistic Regression
more than two categories in DV
same procedure
one category chosen as reference
group
odds of being in category other than
reference
Polytomous Logit Universal Models

(PLUM)
Ordinal multinomial logistic regression
649
For ordinal outcome variables
Final Thoughts
Logistic Regression can be
extended
dummy variables
non-linear effects
interactions (even though we dont
cover them until the next lesson)
Same issues as OLS

collinearity
outliers
650
651
652
Lesson 12: Mediation and

Path Analysis
653
Introduction
Moderator
Level of one variable influences effect of
another variable
Mediator
One variable influences another via a third
variable
All relationships are really mediated

are we interested in the mediators?
can we make the process more explicit
654
In examples with bank
education
beginning
salary
Why?
What is the process?
Are we making assumptions about the
process?
Should we test those assumptions?
655
job skills
expectations
beginning
salary
education
negotiating
skills
kudos
for bank
656
Direct and Indirect

Influences
X may affect Y in two ways
Directly X has a direct (causal)
influence on Y
(or maybe mediated by other
variables)
Indirectly X affects Y via a

mediating variable - M
657
e.g. how does going to the pub

effect comprehension on a
Summer school course
on, say, regression
Having
fun in pub
in
evening
not
reading
books on
regressio
n
less
knowledg
e
Anythin
g here?
658
Having
fun in pub
in
evening
not
reading
books on
regressio
n
less
knowledg
e
fatigue
Still
needed
?
659
Mediators needed
to cope with more sophisticated
theory in social sciences
make explicit assumptions made
about processes
examine direct and indirect influences
660
Detecting Mediation
661
4 Steps
From Baron and Kenny (1986)
To establish that the effect of X on Y
is mediated by M
1. Show that X predicts Y
2. Show that X predicts M
3. Show that M predicts Y, controlling
for X
4. If effect of X controlling for M is
zero, M is complete mediator of the
662
relationship
Example: Book habits

Enjoy Books
Buy books
Read Books
663
Three Variables
Enjoy
How much an individual enjoys books
Buy
How many books an individual buys
(in a year)
Read
How many books an individual reads
(in a year)
664
ENJOY
BUY
READ
ENJOY BUY
READ
1.00
0.64
0.73
0.64
1.00
0.75
0.73
0.75
1.00
665
The Theory
enjoy
buy
read
666
Step 1
1. Show that X (enjoy) predicts Y
(read)
b1 = 0.487, p < 0.001
standardised b1 = 0.732
OK
667
2. Show that X (enjoy) predicts M

(buy)
b1 = 0.974, p < 0.001
OK
668
3. Show that M (buy) predicts Y

(read), controlling for X (enjoy)
b1 = 0.469, p < 0.001
OK
669
4. If effect of X controlling for M is

zero, M is complete mediator of
the relationship
(Same as analysis for step 3.)
b2 = 0.287, p = 0.001
Hmmmm
Significant, therefore not a complete

mediator
670
0.287
(step 4)
enjoy
read
buy
0.974
(from step
2)
0.206
(from step
3)
671
The Mediation Coefficient

Amount of mediation =
Step 1 Step 4
=0.487 0.287
= 0.200
OR
Step 2 x Step 3
=0.974 x 0.206
= 0.200
672
SE of Mediator
enjoy
buy
a
(from step
2)
read
b
(from step
2)
sa = se(a)
sb = se(b)
673
Sobel test
Standard error of mediation
coefficient can be calculated
se b s + a s - s s
2 2
a
a = 0.974
sa = 0.189
2 2
b
2 2
a b
b = 0.206
sb = 0.054
674
Indirect effect = 0.200

se = 0.056
t =3.52, p = 0.001
Online Sobel test:

http://www.unc.edu/~preacher/sobel/
sobel.htm
(Wont be there for long; probably will be
somewhere else)
675
A Note on Power
Recently
Move in methodological literature away
from this conventional approach
Problems of power:
Several tests, all of which must be
significant
Type I error rate = 0.05 * 0.05 = 0.0025
Must affect power
Bootstrapping suggested as alternative

See Paper B7, A4, B9
B21 for SPSS syntax
676
677
678
Lesson 13: Moderators

in Regression
different slopes for different
folks
679
Introduction
Moderator relationships have many
different names
interactions (from ANOVA)
multiplicative
non-linear (just confusing)
non-additive
All talking about the same thing

680
A moderated relationship occurs

when the effect of one variable
depends upon the level of another
variable
681
Hang on
That seems very like a nonlinear
relationship
Moderator
Effect of one variable depends on level of another
Non-linear
Effect of one variable depends on level of itself
Where there is collinearity

Can be hard to distinguish between them
Paper in handbook (B5)
Should (usually) compare effect sizes
682
e.g. How much it hurts when I drop

a computer on my foot depends on
x1: how much alcohol I have drunk
x2: how high the computer was
dropped from
but if x1 is high enough
x2 will have no effect
683
e.g. Likelihood of injury in a car

accident
depends on
x1: speed of car
x2: if I was wearing a seatbelt
but if x1 is low enough
x2 will have no effect
684
685
e.g. number of words (from a list) I

can remember
depends on
x1: type of words (abstract, e.g.
justice, or concrete, e.g. carrot)
x2: Method of testing (recognition
i.e. multiple choice, or free recall)
but if using recognition
x1: will not make a difference
686
We looked at three kinds of

moderator
alcohol x height = pain
continuous x continuous
speed x seatbelt = injury

continuous x categorical
word type x test type

categorical x categorical
We will look at them in reverse

order
687
How do we know to look

for moderators?
Theoretical rationale
Often the most powerful
Many theories predict
additive/linear effects
Fewer predict moderator effects
Presence of heteroscedasticity
Clue there may be a moderated
relationship missing
688
Two Categorical Predictors
689
2 IVs
Data
word type (concrete [1], abstract [2])

test method (recog [1], recall [2])
20 Participants in one of four groups
1,
1,
2,
2,
1
2
1
2
5 per group
lesson12.1.sav
690
Recog
Recall
Total
Concrete Abstract Total

Mean
15.40
15.20
15.30
SD
2.19
2.59
2.26
Mean
15.60
6.60
11.10
Std. Deviation 1.67
7.44
6.95
Mean
15.50
10.90
13.20
Std. Deviation 1.84
6.94
5.47
691
Graph of means
18
16
14
12
10
WORDS
8
1.00
6
1.00
2.00
2.00
TEST
692
ANOVA Results
Standard way to analyse these
data would be to use ANOVA
Words: F=6.1, df=1, 16, p=0.025
Test: F=5.1, df=1, 16, p=0.039
Words x Test: F=5.6, df=1, 16,
p=0.031
693
Procedure for Testing

1: Convert to effect coding
can use dummy coding,
collinearity is less of an issue
doesnt make any difference to
substantive interpretation
2: Calculate interaction term
In ANOVA interaction is automatic
In regression we create an
interaction variable
694
Interaction term (wxt)

multiply effect coded variables
together
word
-1
1
-1
1
test
-1
-1
1
1
wxt
1
-1
-1
1
695
3: Carry out regression

Hierarchical
linear effects first
interaction effect in next block
696
b0=13.2
b1 (words) = -2.3, p=0.025
b2 (test) = -2.1, p=0.039
b3 (words x test) = -2.2, p=0.031
Might need to use change in R2 to

test sig of interaction, because of
collinearity
What do these mean?
b0 (intercept) = predicted value of
Y (score) when all X = 0
i.e. the central point
697
b0 = 13.2
grand mean
b1 = -2.3
distance from grand to mean for two
word types
13.2 (-2.3) = 15.5
13.2 + (-2.3) = 10.9
Recog
Recall
Total
Concrete Abstract Total

15.40
15.20
15.30
15.60
6.60
11.10
15.50
10.90
13.20
698
b2 = -2.1
distance from grand mean to recog
and recall means
b3 = -2.2
to understand b3 we need to look at
predictions from the equation without
this term
Score = 13.2 + (-2.3) w + (-2.1) t
699
Score = 13.2 + (-2.3) w + (-2.1) t

So for each group we can calculate
an expected value
700
b1 = -2.3, b2 = -2.1
W
Word
Test
Expected Value
Cog
-1
-1
13.2 + (-2.3) x (-1) + (-2.1) x -1
Call
-1
13.2 + (-2.3) x (-1) + (-2.1) x 1
Cog
-1
13.2 + (-2.3) x 1 + (-2.1) x (-1)
Call
13.2 + (-2.3) x 1 + (-2.1) x 1
701
W
C
C
A
A
T
Word Test Exp
Actual Value
Call
-1 -1
17.6
15.4
Cog
-1
1
13.4
15.6
Call
1 -1
13.0
15.2
Cog
1
1
8.8
11.0
The exciting part comes when we

look at the differences between
the actual value and the value in
the 2 IV model
702
Each difference = 2.2 (or 2.2)

The value of b3 was 2.2
the interaction term is the correction
required to the slope when the
second IV is included
703
Examine the slope for word type

18
16
14
12
10
8
6
4
Gradient =
(11.1 - 15.3) / 2 =
-2.1
2
0
Recog (-1)
Recall (1)
Test Type
704
Add the slopes for two test groups

18
16
14
12
10
8
6
4
2
0
Recog (-1)
Both word
groups (2.1)
Abstract
(6.6 - 15.2 )/2
= -4.3
Test Type
Concrete
(15.6-15.4 )/2
= 0.1
Recall (1)
705
b associated with interaction

the change in slope, away from the
average, associated with a 1 unit
change in the moderating variable
OR
Half the difference in the slopes
706
Another way to look at it

Y = 13.2 + -2.3w + -2.1t + -2.2wt
Examine concrete words group (w = -1)
substitute values into the equation
Y(concrete) = 13.2 + -2.3-1 + -2.1t + -2.21t

Y(concrete) = 13.2 + 2.3 + -2.1t + 2.2t
Y(concrete) = 15.5 + 0.1t
The effect of changing test type for
concrete words (the slope, which is half
the actual difference)
707
Why go to all that effort? Why not

do ANOVA in the first place?
1. That is what ANOVA actually does
if it can handle an unbalanced

design (i.e. different numbers of
people in each group)
Helps to understand what can be
done with ANOVA
SPSS uses regression to do ANOVA
2. Helps to clarify more complex

cases
as we shall see
708
Categorical x Continuous
709
Note on Dichotomisation
Very common to see people
dichotomise a variable
Makes the analysis easier
Very bad idea
Paper B6
710
Data
A chain of 60 supermarkets
examining the relationship
between profitability, shop size,
and local competition
2 IVs
shop size
comp (local competition, 0=no,
1=yes)
DV
profit
711
Data, lesson 12.2.sav

Shopsize
4
10
7
10
10
29
12
6
14
62
Comp
1
1
0
0
1
1
0
1
0
0
Profit
23
25
19
9
18
33
17
20
21
8
712
1st Analysis
Two IVs
R2=0.367, df=2, 57, p < 0.001
Unstandardised estimates
b1 (shopsize) = 0.083 (p=0.001)
b2 (comp) = 5.883 (p<0.001)
Standardised estimates
b1 (shopsize) = 0.356
b2 (comp) = 0.448
713
Suspicions
Presence of competition is likely to
have an effect
Residual plot shows a little
heteroscedasticity
3
-1
-2
-3
-2.0
-1.5
-1.0
-.5
0.0
.5
1.0
1.5
714
2.0
Procedure for Testing

Very similar to last time
convert comp to effect coding
-1 = No competition
1 = competition
Compute interaction term
comp (effect coded) x size
Hierarchical regression
715
Result
Unstandardised estimates
b1 (shopsize) = 0.071 (p=0.006)
b2 (comp) = -1.67 (p = 0.506)
b3 (sxc) = -0.050 (p=0.050)
Standardised estimates
b1 (shopsize) = 0.306
b2 (comp) = -0.127
b3 (sxc) = -0.389
716
comp now non-significant

shows importance of hierarchical
it obviously is important
717
Interpretation
Draw graph with lines of best fit
drawn automatically by SPSS
Interpret equation by substitution

of values
evaluate effects of
size
competition
718
40
30
20
Profit
10
Competition
No competition
All Shops
0
20
40
60
80
100
Shopsize
719
Effects of size
in presence and absence of
competition
(can ignore the constant)
Y=x10.071 + x2(-1.67) + x1x2 (0.050)
Competition present (x2 = 1)
Y=x10.071 + 1(-1.67) + x11 (-0.050)
Y=x10.071 + -1.67 + x1(-0.050)
Y=x1 0.021
+ (1.67)
720
Y=x10.071 + x2(-1.67) + x1x2 (0.050)

Competition absent (x2 = -1)
Y=x10.071 + -1(-1.67) + x1-1 (0.050)
Y=x1 0.071 + x1-1 (-0.050) + -1(1.67)
Y= x1 0.121 (+ 1.67)
721
Two Continuous Variables
722
Data
Bank Employees
only using clerical staff
363 cases
predicting starting salary
previous experience
age
age x experience
723
Correlation matrix
only one significant
724
Initial Estimates (no moderator)

(standardised)
R2 = 0.061, p<0.001
Age at start = -0.37, p<0.001
Previous experience = 0.36, p<0.001
Suppressing each other

Age and experience compensate for
one another
Older, with no experience, bad
Younger, with experience, good
725
The Procedure
Very similar to previous
create multiplicative interaction term
BUT
Need to eliminate effects of means

cause massive collinearity
and SDs
cause one variable to dominate the
interaction term
By standardising
726
To standardise x,
subtract mean, and divide by SD
re-expresses x in terms of distance
from the mean, in SDs
ie z-scores
Hint: automatic in SPSS in

Descriptives
Create interaction term of age and
exp
axe = z(age) z(exp)
727
Hierarchical regression
two linear effects first
moderator effect in second
hint: it is often easier to interpret if
standardised versions of all variables
are used
728
Change in R2
0.085, p<0.001
Estimates (standardised)
b1 (exp) = 0.104
b2 (agestart) = -0.54
b3 (age x exp) = -0.54
729
Interpretation 1: Pick-aPoint
Graph is tricky
cant have two continuous variables
Choose specific points (pick-a-point)
Graph the line of best fit of one variable
at others
Two ways to pick a point

1: Choose high (z = +1), medium (z = 0)
and low (z = -1)
Choose sensible values age 20, 50,
80?
730
We know:
Y = e 0.10 + a -0.54 + a e -0.54
Where a = agestart, and e = experience
We can rewrite this as:

Y = (e 0.10) + (a -0.54) + (a e -0.54)
Take a out of the brackets
Y = (e 0.10) + (-0.54 + e -0.54)a
Bracketed terms are simple intercept and

simple slope
0= (e 0.10)
1= (-0.54 + e -0.54)a
Y = 0 + 1a
731
Pick any value of e, and we know the

slope for a
Standardised, so its easy
e = -1
0= (-1 0.10) = -0.10

1= (-0.54 + -1 -0.54)a = -0.0a
e=0
0= (0 0.10) = 0
1= (-0.54+ 0 -0.54)a = -0.54a
e=1
0= (1 0.10) = 0.10
1= (-0.54 + 1 -0.54)a = -1.08a
732
Graph the Three Lines
733
Interpretation 2: P-Values and

CIs
Second way
Newer, rarely done
Calculate CIs of the slope

At any point
Calculate p-value
At any point
Give ranges of significance

734
What do you need?

The variance and covariance of the
estimates
SPSS doesnt provide estimates for
intercept
Need to do it manually
In options, exclude intercept

Create intercept c = 1
Use it in the regression
735
Enter information into web page:

www.unc.edu/~preacher/interact/a
cov.htm
(Again, may not be around for long)
Get results
Calculations in Bauer and Curran
(in press: Multivariate Behavioral
Research)
Paper B13
736
4.1
4.2
4.3
4.4
4.5
MLR 2-Way Interaction Plot
4.0
CVz1(1)
CVz1(2)
CVz1(3)
-1.0
-0.5
0.0
0.5
1.0
737
Areas of Significance
0 .0
-0 .2
-0 .4
-0 .6
S im p le S lo p e
0 .2
0 .4
Confidence Bands
-4
-2
Experience
738
2 complications
1: Constant differed
2: DV was logged, hence non-linear
effect of 1 unit depends on where the unit
is
Can use SPSS to do graphs showing

lines of best fit for different groups
See paper A2
739
Finally
740
Unlimited Moderators
Moderator effects are not limited
to
2 variables
linear effects
741
Three Interacting Variables

Age, Sex, Exp
Block 1
Age, Sex, Exp
Block 2
Age x Sex, Age x Exp, Sex x Exp
Block 3
Age x Sex x Exp
742
Results
All two way interactions significant
Three way not significant
Effect of Age depends on sex
Effect of experience depends on sex
Size of the age x experience
interaction does not depend on sex
(phew!)
743
Moderated Non-Linear
Relationships
Enter non-linear effect
Enter non-linear effect x moderator
if significant indicates degree of nonlinearity differs by moderator
744
745
Modelling Counts: Poisson

Regression
Lesson 14
746
Counts and the Poisson

Distribution
Von Bortkiewicz
(1898)
Numbers of
Prussian soldiers
kicked to death by
horses
0
1
2
3
4
5
109
65
22
3
1
0
747
The data fitted a Poisson probability

distribution
When counts of events occur, poisson
distribution is common
E.g. papers published by researchers, police
arrests, number of murders, ship accidents
Common approach
Log transform and treat as normal
Problems
Censored at 0
Integers only allowed
Heteroscedasticity
748
The Poisson Distribution
749
exp( )
p( y | x)
y!
750
exp( )
p( y | x)
y!
Where:
y is the count
is the mean of the poisson
distribution
In a poisson distribution
The mean = the variance (hence
heteroscedasticity issue))

751
Poisson Regression in
SPSS
Not directly available
SPSS can be tweaked to do it in three ways:
General loglinear model (genlog)
Non-linear regression (CNLR)
Bootstrapped p-values only
Both are quite tricky
SPSS 15,
752
Example Using Genlog

Number of shark
bites on different
colour surfboards
100 surfboards, 50
red, 50 blue
Weight cases by
bites
Analyse,
Loglinear, General
Colour is factor
753
Results
Correspondence Between Parameters and
Terms of the Design
Parameter
Aliased Term
1
Constant
2
[COLOUR = 1]
3 x [COLOUR = 2]
Note: 'x' indicates an aliased (or a
redundant) parameter. These parameters
are set to zero.
754
Asymptotic
Param
Est.
1
2
3
4.1190
-.5495
.0000
SE
.1275
.2108
.
Note: Intercept
(param 1) is
curious
Param 2 is the
difference in the
means
95% CI
Z-value Lower
Upper
32.30
-2.61
.
4.37
-.14
.
3.87
-.96
.
755
SPSS: Continuous
Predictors
Bleedin nightmare
http://www.spss.com/tech/answer/
details.cfm?
tech_tan_id=100006204
756
Poisson Regression in
Stata
SPSS will save a Stata file
Open it in Stata
Statistics, Count outcomes, Poisson
regression
757
Poisson Regression in R
R is a freeware program
Similar to SPlus
www.r-project.org
Steep learning curve to start with

Much nicer to do Poisson (and other)
regression analysis
http://www.stat.lsa.umich.edu/~faraway/book
/
http://www.jeremymiles.co.uk/regressionbook
/extras/appendix2/R/
758
Commands in R
Stage 1: enter data
colour <- c(1, 0, 1, 0, 1, 0 1)
bites <- c(3, 1, 0, 0, )
Run analysis
p1 <- glm(bites ~ colour, family
= poisson)
Get results
summary.glm(p1)
759
R Results
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.3567
0.1686 -2.115 0.03441 *
colour
0.5555
0.2116
2.625 0.00866 **
Results for colour

Same as SPSS
For intercept different (weird SPSS)
760
Predicted Values
Need to get exponential of
parameter estimates
Like logistic regression
Exp(0.555) = 1.74
You are likely to be bitten by a shark
1.74 times more often with a red
surfboard
761
Checking Assumptions
Was it really poisson distributed?
For Poisson, 2
As mean increases, variance should also

increase
Residuals should be random

Overdispersion is common problem
Too many zeroes
For blue: 2 = exp(-0.3567) =

1.42
For red: 2 = exp(-0.3567
+
762
0.555) = 2.48
exp( )
p( y | x)
y!
Strictly:
exp( )
p ( yi | xi )
y!
763
Compare Predicted with

Actual Distributions
764
Overdispersion
Problem in poisson regression
Too many zeroes
Causes
2 inflation
Standard error deflation
Hence p-values too low
Higher type I error rate
Solution
Negative binomial regression
765
Using R
R can read an SPSS file
But you have to ask it nicely
Click Packages menu, Load

package, choose Foreign
Click File, Change Dir
Change to the folder that contains
your data
766
More on R
R uses objects
To place something into an object use < X <- Y
Puts Y into X
Function is read.spss()
Mydata <- read.spss(spssfilename.sav)
Variables are then referred to as

Mydata$VAR1
Note 1: R is case sensitive
Note 2: SPSS variable name in capitals
767
GLM in R
Command
glm(outcome ~ pred1 + pred2 + +
predk [,family = familyname])
If no familyname, default is OLS
Use binomial for logistic, poisson for
poisson
Output is a GLM object

You need to give this a name
my1stglm <- glm(outcome ~ pred1 +
pred2 + + predk [,family =
768
familyname])
Then need to explore the result

summary(my1stglm)
To explore what it means

Need to plot regressions
Easiest is to use Excel
769
770
Introducing Structural
Equation Modelling
Lesson 15
771
Introduction
Related to regression analysis
All (OLS) regression can be
considered as a special case of SEM
Power comes from adding

restrictions to the model
SEM is a system of equations
Estimate those equations
772
Regression as SEM
Grades example
Grade = constant + books + attend +
error
Looks like a regression equation
Also
Books correlated with attend
Explicit modelling of error
773
Path Diagram
System of equations are usefully
represented in a path diagram
x
Measured variable
unmeasured variable
regression
correlation
774
Path Diagram for

Regression
Must usually
explicitly
model error
error
Books
Grade
Attend
Must explicitly
model correlation
775
Results
Unstandardised
2.00
1.00
BOOKS
4.04
2.65
17.84
13.52
GRADE
1.28
ATTEND
776
Standardised
e
BOOKS
.35
.44
.82
GRADE
.33
ATTEND
777
Table
GRADE
GRADE
GRADE
GRADE
<-- BOOKS
<-- ATTEND
<-- e
Estimate
4.04
1.28
13.52
37.38
S.E.
1.71
0.57
1.53
7.54
C.R.
2.36
2.25
8.83
4.96
St. Est.
0.02
0.35
0.03
0.33
0.00
0.82
0.00
Coefficientsa
Unstandardized
Coefficients
M odel
1
B
(Constant)
Standardized
Coefficients
Std. Error
Beta
Sig.
37.38
7.74
BOOKS
4.04
1.75
.35
.03
ATTEND
1.28
.59
.33
.04
a. Dependent Variable: GRADE
.00
778
So What Was the Point?

Regression is a special case
Lots of other cases
Power of SEM
Power to add restrictions to the model
Restrict parameters
To zero
To the value of other parameters
To 1
779
Restrictions
Questions
Is a parameter really necessary?
Are a set of parameters necessary?
Are parameters equal
Each restriction adds 1 df

Test of model with 2
780
The 2 Test
Can the model proposed have
generated the data?
Test of significance of difference of
model and data
Statistically significant result
Bad
Theoretically driven
Start with model
Dont start with data
781
Regression Again
0, 1
BOOKS
GRADE
ATTEND
Both estimates restricted to zero
782
Two restrictions
2 df for 2 test
2 = 15.9, p = 0.0003
This test is (asymptotically)

equivalent to the F test in
regression
We still havent got any further
783
Multivariate Regression
y1
x1
y2
x2
y3
784
Test of all xs on all ys

(6 restrictions = 6 df)
y1
x1
y2
x2
y3
785
Test of all x1 on all ys

(3 restrictions)
y1
x1
y2
x2
y3
786
Test of all x1 on all y1

(3 restrictions)
y1
x1
y2
x2
y3
787
Test of all 3 partial correlations between

ys, controlling for xs
(3 restrictions)
y1
x1
y2
x2
y3
788
Path Analysis and SEM

More complex
models can
add more
restrictions
ENJOY
BUY
E.g. mediator
model
1 restriction
No path from
enjoy -> read
e_buy
READ
e_read
789
Result
2 = 10.9, 1 df, p = 0.001
Not a complete mediator
Additional path is required
790
Multiple Groups
Same model
Different people
Equality constraints between

groups
Means, correlations, variances,
regression estimates
E.g. males and females
791
Multiple Groups Example

Age
Severity of psoriasis
SEVE in emotional areas
Hands, face, forearm
SEVNONE in non-emotional areas

Anxiety
Depression
792
Correlationsa
AGE
AGE
.017
.035
.004
.009
.859
.717
110
110
110
110
110
-.270
.665
.045
.075
.004
.000
.639
.436
110
110
110
110
110
-.248
.665
.109
.096
.009
.000
.255
.316
110
110
110
110
110
Pearson Correlation
.017
.045
.109
.782
Sig. (2-tailed)
.859
.639
.255
.000
110
110
110
110
110
Pearson Correlation
.035
.075
.096
.782
Sig. (2-tailed)
.717
.436
.316
.000
110
110
110
110
110
Pearson Correlation
N
Pearson Correlation
Sig. (2-tailed)
N
N
GHQ_D
GHQ_D
-.248
Sig. (2-tailed)
GHQ_A
GHQ_A
-.270
SEVNONE
SEVNONE
Pearson Correlation
Sig. (2-tailed)
SEVE
SEVE
N
a. SEX = f
793
Correlationsa
AGE
AGE
Pearson Correlation
Sig. (2-tailed)
N
SEVE
Pearson Correlation
Sig. (2-tailed)
N
SEVNONE
Pearson Correlation
Sig. (2-tailed)
N
GHQ_A
Pearson Correlation
Sig. (2-tailed)
N
GHQ_D
Pearson Correlation
Sig. (2-tailed)
N
SEVE
SEVNONE
GHQ_A
GHQ_D
-.243
-.116
-.195
-.190
.031
.310
.085
.094
79
79
79
79
79
-.243
.671
.456
.453
.031
.000
.000
.000
79
79
79
79
79
-.116
.671
.210
.232
.310
.000
.063
.040
79
79
79
79
79
-.195
.456
.210
.800
.085
.000
.063
.000
79
79
79
79
79
-.190
.453
.232
.800
.094
.000
.040
.000
79
79
79
79
79
a. SEX = m
794
Model
AGE
SEVE
SEVNONE
e_s
e_sn
Dep
Anx
E_d
e_a
795
AGE
Females
-.27
.96
-.25
SEVE
SEVNONE
.07
.04
e_s
.97
e_sn
.03
.09 -.04
.15
.64
Dep
Anx
.99
.99
E_d
e_a
.78
796
AGE
Males
-.24
.97
-.12
SEVE
SEVNONE
-.08
-.08
e_s
.99
e_sn
.52
-.12 .55
-.17
.67
Dep
Anx
.88
.88
E_d
e_a
.74
797
Constraint
sevnone -> dep
Constrained to be equal for males and
females
1 restriction, 1 df
2 = 1.3 not significant
4 restrictions
2 severity -> anx & dep
798
4 restrictions, 4 df
2 = 1.3, p = 0.014
Parameters are not equal
799
Missing Data: The big

advantage
SEM programs tend to deal with
missing data
Multiple imputation
Full Information (Direct) Maximum
Likelihood
Asymptotically equivalent
Data can be MAR, not just MCAR

800
Power: A Smaller
Advantage
Power for regression gets tricky
with large models
With SEM power is (relatively) easy
Its all based on chi-square
Paper B14
801
Lesson 16: Dealing with

clustered data & longitudinal
models
802
The Independence
Assumption
In Lesson 8 we talked about
independence
The residual of any one case should not tell
you about the residual of any other case
Particularly problematic when:

Data are clustered on the predictor variable
E.g. predictor is household size, cases are
members of family
E.g. Predictor is doctor training, outcome is
patients of doctor
Data are longitudinal

Have people measured over time
Its the same person!
803
Clusters of Cases
Problem with cluster (group)
randomised studies
Or group effects
Use Huber-White sandwich

estimator
Tell it about the groups
Correction is made
Use complex samples in SPSS
804
Complex Samples
As with Huber-White for
heteroscedasticity
Add a variable that tells it about the clusters
Put it into clusters
Run GLM
As before
Warning:
Need about 20 clusters for solutions to be
stable
805
Example
People randomised by week to one of
two forms of triage
Compare the total cost of treating each
Ignore clustering
Difference is 2.40 per person, with 95%
confidence intervals 0.58 to 4.22, p
=0.010
Include clustering
Difference is still 2.40, with 95% CIs 5.65
to -0.85, and p = 0.141.
Ignoring clustering led to type I error

806
Longitudinal Research
For comparing
repeated
measures
Clusters are
people
Can model the
repeated
measures over
time
Data are usually

short and fat
ID
V1 V2 V3 V4
807
Converting Data
Change data to
tall and thin
Use Data,
Restructure in
SPSS
Clusters are ID
ID
5
808
(Simple) Example
Use employee data.sav
Compare beginning salary and salary
Would normally use paired samples ttest
Difference = $17,403, 95% CIs

$16,427.407, $18,379.555
809
Restructure the Data

Do it again
With data tall and
thin
Complex GLM with

Time as factor
ID as cluster
Difference =
$17,430, 95% CIs =
16427.407,
18739.555
ID
Time
Cash
$18,75
0
$21,45
0
$12,00
0
$21,90
0
$13,20
0
$45,00
0
810
Interesting
That wasnt very interesting
What is more interesting is when we
have multiple measurements of the
same people
Can plot and assess trajectories

over time
811
Single Person Trajectory
+
+
+
+
+
Time
812
Multiple Trajectories: Whats

the Mean and SD?
Time
813
Complex Trajectories
An event occurs
Can have two effects:
A jump in the value
A change in the slope
Event doesnt have to happen at

the same time for each person
Doesnt have to happen at all
814
Slope 1
Jump
Slope 2
Event Occurs
815
Parameterising
Time
1
2
3
4
5
6
7
8
9
Event
0
0
0
0
0
1
1
1
1
Time2
0
0
0
0
0
0
1
2
3
Outcome
12
13
14
15
16
10
9
8
7
816
Draw the Line
What are the parameter estimates?
817
Main Effects and

Interactions
Main effects
Intercept differences
Moderator effects
Slope differences
818
Multilevel Models
Fixed versus random effects
Fixed effects are fixed across
individuals (or clusters)
Random effects have variance
Levels
Level 1 individual measurement
occasions
Level 2 higher order clusters
819
More on Levels
NHS direct study
Level 1 units: .
Level 2 units:
Widowhood food study

Level 1 units
Level 2 units
820
More Flexibility
Three levels:
Level 1: measurements
Level 2: people
Level 3: schools
821
More Effects
Variances and covariances of
effects
Level 1 and level 2 residuals
Makes R2 difficult to talk about
Outcome variable
Yij
The score of the ith person in the jth
group
822
Y
2.3
3.2
4.5
4.8
7.2
3.1
1.6
i
1
2
3
1
2
3
4
j
1
1
1
2
2
2
2
823
Notation
Notation gets a bit horrid
Varies a lot between books and
programs
We used to have b0 and b1

If fixed, thats fine
If random, each person has their own
intercept and slope
824
Standard Errors
Intercept has standard errors
Slopes have standard errors
Random effects have variances
Those variances have standard errors
Is there statistically significant variation
between higher level units (people)?
OR
Is everyone the same?
825
Programs
Since version 12
Can do this in SPSS
Cant do anything really clever
Menus
Completely unusable
Have to use syntax
826
SPSS Syntax
MIXED
relfd with time
/fixed = time
/random = intercept time |
subject (id) covtype(un)
/print = solution.
827
SPSS Syntax
MIXED
relfd with time
Outcome
Continuous
predictor
828
SPSS Syntax
MIXED
relfd with time
/fixed = time
Must specify effect as
fixed first
829
SPSS Syntax
MIXED
relfd with time
/fixed = time
Intercept and
time are random
Specify random
effects
SPSS assumes that your

level 2 units are subjects,
and needs to know the id
variable
830
SPSS Syntax
MIXED
relfd with time
fixed = time
Covariance matrix of random
effects is unstructured.
(Alternative is id identity or vc
variance components).
831
SPSS Syntax
MIXED
relfd with time
fixed = time
/print = solution.
Print the answer
832
The Output
Information criteria
Well come back
Information Criteriaa
-2 Restricted Log
Likelihood
64899.758
Akaike's Information
64907.758
Criterion (AIC)
Hurvich and Tsai's
Criterion (AICC)
64907.763
Bozdogan's Criterion
64940.134
(CAIC)
Schwarz's Bayesian
Criterion (BIC)
64936.134
The information criteria are displayed in smaller-is-better forms.

a. Dependent Variable: relfd.
833
Fixed Effects
Not useful here, useful for
interactions
a
Type III Tests of Fixed Effects
Numerator df
Denominator
df
Intercept
741
3251.877
.000
time
741.000
2.550
.111
Source
Sig.
834
Estimates of Fixed Effects

Interpreted as regression equation
Estimates of Fixed Effectsa
95% Confidence
Interval
Parameter
Intercept
time
Estimate
Std.
Error
df
Sig.
Lower
Bound
Upper
Bound
21.90
21.90
.38
57.025
.000
21.15
22.66
-.06
-.06
.04
-1.597
.111
-.14
.01
835
Covariance Parameters
Estimates of Covariance Parametersa
Parameter
Estimate
Residual
64.11577 1.0526353
Intercept +
time [subject
= id]
Std. Error
UN (1,1)
85.16791 5.7003732
UN (2,1)
-4.53179
.5067146
UN (2,2)
.7678319
.0636116

836
Change Covtype to VC
We know that this is wrong
The covariance of the effects was
statistically significant
Can also see if it was wrong by
comparing information criteria
We have removed a parameter from

the model
Model is worse
Model is more parsimonious
Is it much worse, given the increase in
parsimony?
837
UN Model
-2 Restricted Log
Likelihood
64899.758
VC Model
-2 Restricted Log
Likelihood
65041.891
64907.758
Criterion (AIC)
65047.891
Criterion (AIC)
Hurvich and Tsai's

Criterion (AICC)
Hurvich and Tsai's

Criterion (AICC)
64907.763
65047.894
64940.134
(CAIC)
65072.173
(CAIC)
Schwarz's Bayesian
64936.134
Criterion (BIC)
Schwarz's Bayesian
65069.173
Criterion (BIC)
The information
criteria are displayed in smaller-is
The information criteria are displayed in smaller-is-better
forms.
Lower is better.
838
Adding Bits
So far, all a bit dull
We want some more predictors, to make
it more exciting
E.g. female
Add:
Relfd with time female
/fixed = time sex time * sex
What does the interaction term

represent?
839
Extending Models
Models can be extended
Any kind of regression can be used
Logistic, multinomial, Poisson, etc
More levels
Children within classes within schools
Measures within people within classes within
prisons
Multiple membership / cross classified

models
Children within households and classes, but
households not nested within class
Need a different program

E.g. MlwiN
840
MlwiN Example (very

quickly)
841
Books
Singer, JD and Willett, JB (2003). Applied
Longitudinal Data Analysis: Modeling
Change and Event Occurrence. Oxford,
Oxford University Press.
Examples at:
http://www.ats.ucla.edu/stat/SPSS/ex
amples/alda/default.htm
842
The End
843

Theory of Regression

Uploaded by

Copyright:

Available Formats

Theory of Regression

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Theory of Regression

Uploaded by

Copyright:

Available Formats

Theory of Regression

Part I: Theory of Regression

Part 2: Application of regression

If you dont understand

If you think Im wrong

Learning New Techniques

Your own data

Model aeroplane represents a

Statistics is about modelling

Is it worth paying a higher price for a

These are our DATA

The (vector of) data that we are

Greek letters represent the

(Beta) Parameters in our model

Normal letters represent the values in our

A parameters in our model (sample

Symbols on top change the meaning.

The data in our sample which we

The estimated value of Y, for the ith

Not always that simple

A capital letter is the set (vector) of

Set of all parameters (b0, b1, b2, b3

Rules are not used very consistently (even

Which, because we are lazy, can

The Mean as a Model

The (Arithmetic) Mean

The mean is:

Mean as OLS Estimator

How can we calculate the amount

0.20 0.05 0.20 0.02 0.03

Knowledge about ERROR is useful

Sum of absolute errors

Are small and large errors equivalent?

What happens with different data?

Sum of squared errors (SSE)

0.20 0.05 0.202 0.022 0.032

The Mean as an OLS

Mean as OLS Estimate

This is exciting because

SSE and the Standard

SSE closely related to SD

Population standard deviation -

Like losing one df

Term comes from engineering

Back to the Data

Model which contained all the data

Models need to minimise ERROR

Lesson 2: Models with one

We often have more information

Look at change in prices

Predicting House Prices

One Parameter Model

Adding More Parameters

Estimating the Model

1. Estimating the slope which minimises

Estimate the Slope to

Estimate the Slope

Add another slope to the chart

Find the slope