SPSS Independent Samples T Test
SPSS Independent Samples T Test
SPSS Independent Samples T Test
SPSS independent samples t-test is a procedure for testing whether the means in two populations
on one metric variable are equal. The two populations are identified in the sample by a
dichotomous variable. These two groups of cases are considered “independent samples” because
none of the cases belong to both groups simultaneously; that is, the samples don't overlap.
A marketeer wants to know whether women spend the same amount of money on clothes as
men. She asks 30 male and 30 female respondents how much many Euros they spend on clothing
each month, resulting in clothing_expenses.sav. Do these data contradict the null hypothesis that
men and women spend equal amounts of money on clothing?
Respondents were able to fill in any number of euros. Before moving on to the actual t-test, we
first need to get a basic idea of what the data look like. We'll take a quick look at the histogram
for the amounts spent. We'll obtain this by running FREQUENCIES as shown in the syntax
below.
Result
1
These values look plausible. The maximum monthly amount spent on clothing (around € 525,-)
is not unlikely for one or two respondents, the vast majority of whom spend under €100,-. Also,
note that N = 60, which tells us that there are no missing values.
If we just run our test at this point, SPSS will immediately provide us with relevant test statistics
and a p-value. However, such results can only be taken seriously insofar as the independent t-test
assumptions have been met. These are
Assumption 1 is mostly theoretical. Violation of assumption 2 hardly affects test results for
reasonable sample sizes (say n >30). If this doesn't hold, perhaps consider a Mann-Whitney test
instead or on top of the t-test.
If assumption 3 is violated, test results need to be corrected. For the independent samples t-test,
the SPSS output contains the uncorrected as well as the corrected results by default.
2
respondents.
Finally, we Paste and run the syntax.
T-TEST GROUPS=gender(0 1)
/MISSING=ANALYSIS
/VARIABLES=amount_spent
/CRITERIA=CI(.95).
From the first table, showing some basic descriptives, we see that 30 female and 30 male
respondents are included in the test.
Female respondents spent an average of €136,- on clothing each month. For male
respondents this is only €88,-. The difference is roughly €48,-.
As shown in the screenshot, the t-test results are reported twice. The first line ("equal
variances assumed") assumes that the aforementioned assumption of equal variances has been
met.
If this assumption doesn't hold, the t-test results need to be corrected. These corrected results
are presented in the second line ("equal variances not assumed").
3
Whether the assumption of equal variances holds is evaluated using Levene's test for the
equality of variances. As a rule of thumb, if Sig. > .05, use the first line of t-test results.
Reversely, if its p-value (“Sig.”) < .05 we reject the null hypothesis of equal variances and thus
use the second line of t-test results.
The difference between the amount spent by men and women is around €48,- as we'd already
seen previously.
The chance of finding this or a larger absolute difference between the two means is about
14%. Since this is a fair chance, we do not reject the hypothesis that men and women spend
equal amounts of money on clothing.
Note that the p-value is two-tailed. This means that the 14% chance consists of a 7% chance of
finding a mean difference smaller than € -48,- and another 7% chance for a difference larger than
€ 48,-.
When reporting the results of an independent samples t-test, we usually present a table with the
sample sizes, means and standard deviations. Regarding the significance test, we'll state that “on
average, men did not spend a different amount than women; t(58) = 1.5, p = .14.”
Introduction
The independent t-test, also called the two sample t-test, independent-samples t-test or student's
t-test, is an inferential statistical test that determines whether there is a statistically significant
difference between the means in two unrelated groups.
The null hypothesis for the independent t-test is that the population means from the two
unrelated groups are equal:
H0: u1 = u2
In most cases, we are looking to see if we can show that we can reject the null hypothesis and
accept the alternative hypothesis, which is that the population means are not equal:
HA: u1 ≠ u2
To do this, we need to set a significance level (also called alpha) that allows us to either reject or
accept the alternative hypothesis. Most commonly, this value is set at 0.05.
4
What do you need to run an independent t-test?
Unrelated groups
Unrelated groups, also called unpaired groups or independent groups, are groups in which the
cases (e.g., participants) in each group are different. Often we are investigating differences in
individuals, which means that when comparing two groups, an individual in one group cannot
also be a member of the other group and vice versa. An example would be gender - an individual
would have to be classified as either male or female – not both.
The independent t-test requires that the dependent variable is approximately normally distributed
within each group.
Note: Technically, it is the residuals that need to be normally distributed, but for an independent
t-test, both will give you the same result.
You can test for this using a number of different tests, but the Shapiro-Wilks test of normality or
a graphical method, such as a Q-Q Plot, are very common. You can run these tests using SPSS
Statistics, the procedure for which can be found in our Testing for Normality guide. However,
the t-test is described as a robust test with respect to the assumption of normality. This means
that some deviation away from normality does not have a large influence on Type I error rates.
The exception to this is if the ratio of the smallest to largest group size is greater than 1.5 (largest
compared to smallest).
If you find that either one or both of your group's data is not approximately normally distributed
and groups sizes differ greatly, you have two options: (1) transform your data so that the data
becomes normally distributed (to do this in SPSS Statistics see our guide on Transforming Data),
or (2) run the Mann-Whitney U test which is a non-parametric test that does not require the
assumption of normality (to run this test in SPSS Statistics see our guide on the Mann-Whitney
U Test).
The independent t-test assumes the variances of the two groups you are measuring are equal in
the population. If your variances are unequal, this can affect the Type I error rate. The
assumption of homogeneity of variance can be tested using Levene's Test of Equality of
Variances, which is produced in SPSS Statistics when running the independent t-test procedure.
5
If you have run Levene's Test of Equality of Variances in SPSS Statistics, you will get a result
similar to that below:
This test for homogeneity of variance provides an F-statistic and a significance value (p-value).
We are primarily concerned with the significance value – if it is greater than 0.05 (i.e., p > .05),
our group variances can be treated as equal. However, if p < 0.05, we have unequal variances
and we have violated the assumption of homogeneity of variances.
If the Levene's Test for Equality of Variances is statistically significant, which indicates that the
group variances are unequal in the population, you can correct for this violation by not using the
pooled estimate for the error term for the t-statistic, but instead using an adjustment to the
degrees of freedom using the Welch-Satterthwaite method. In all reality, you will probably never
have heard of these adjustments because SPSS Statistics hides this information and simply labels
the two options as "Equal variances assumed" and "Equal variances not assumed" without
explicitly stating the underlying tests used. However, you can see the evidence of these tests as
below:
From the result of Levene's Test for Equality of Variances, we can reject the null hypothesis that
there is no difference in the variances between the groups and accept the alternative hypothesis
that there is a statistically significant difference in the variances between groups. The effect of
not being able to assume equal variances is evident in the final column of the above figure where
we see a reduction in the value of the t-statistic and a large reduction in the degrees of freedom
(df). This has the effect of increasing the p-value above the critical significance level of 0.05. In
this case, we therefore do not accept the alternative hypothesis and accept that there are no
statistically significant differences between means. This would not have been our conclusion had
we not tested for homogeneity of variances.
When reporting the result of an independent t-test, you need to include the t-statistic value, the
degrees of freedom (df) and the significance value of the test (p-value). The format of the test
6
result is: t(df) = t-statistic, p = significance value. Therefore, for the example above, you could
report the result as t(7.001) = 2.233, p = 0.061.
In order to provide enough information for readers to fully understand the results when you have
run an independent t-test, you should include the result of normality tests, Levene's Equality of
Variances test, the two group means and standard deviations, the actual t-test result and the
direction of the difference (if any). In addition, you might also wish to include the difference
between the groups along with a 95% confidence interval. For example:
General
Inspection of Q-Q Plots revealed that cholesterol concentration was normally distributed for both
groups and that there was homogeneity of variance as assessed by Levene's Test for Equality of
Variances. Therefore, an independent t-test was run on the data with a 95% confidence interval
(CI) for the mean difference. It was found that after the two interventions, cholesterol
concentrations in the dietary group (6.15 ± 0.52 mmol/L) were significantly higher than the
exercise group (5.80 ± 0.38 mmol/L) (t(38) = 2.470, p = 0.018) with a difference of 0.35 (95%
CI, 0.06 to 0.64) mmol/L.
Introduction
The independent-samples t-test (or independent t-test, for short) compares the means between
two unrelated groups on the same continuous, dependent variable. For example, you could use an
independent t-test to understand whether first year graduate salaries differed based on gender
(i.e., your dependent variable would be "first year graduate salaries" and your independent
variable would be "gender", which has two groups: "male" and "female"). Alternately, you could
use an independent t-test to understand whether there is a difference in test anxiety based on
educational level (i.e., your dependent variable would be "test anxiety" and your independent
variable would be "educational level", which has two groups: "undergraduates" and
"postgraduates").
This "quick start" guide shows you how to carry out an independent t-test using SPSS Statistics,
as well as interpret and report the results from this test. However, before we introduce you to this
procedure, you need to understand the different assumptions that your data must meet in order
for an independent t-test to give you a valid result. We discuss these assumptions next.
Assumptions
When you choose to analyse your data using an independent t-test, part of the process involves
checking to make sure that the data you want to analyse can actually be analysed using an
independent t-test. You need to do this because it is only appropriate to use an independent t-test
7
if your data "passes" six assumptions that are required for an independent t-test to give you a
valid result. In practice, checking for these six assumptions just adds a little bit more time to your
analysis, requiring you to click a few more buttons in SPSS Statistics when performing your
analysis, as well as think a little bit more about your data, but it is not a difficult task.
Before we introduce you to these six assumptions, do not be surprised if, when analysing your
own data using SPSS Statistics, one or more of these assumptions is violated (i.e., is not met).
This is not uncommon when working with real-world data rather than textbook examples, which
often only show you how to carry out an independent t-test when everything goes well!
However, don't worry. Even when your data fails certain assumptions, there is often a solution to
overcome this. First, let's take a look at these six assumptions:
8
enhanced independent t-test guide, we also explain what you can do if your data fails this
assumption (i.e., if it fails it more than a little bit).
Assumption #6: There needs to be homogeneity of variances. You can test this
assumption in SPSS Statistics using Levene’s test for homogeneity of variances. In our
enhanced independent t-test guide, we (a) show you how to perform Levene’s test for
homogeneity of variances in SPSS Statistics, (b) explain some of the things you will need
to consider when interpreting your data, and (c) present possible ways to continue with
your analysis if your data fails to meet this assumption.
You can check assumptions #4, #5 and #6 using SPSS Statistics. Before doing this, you should
make sure that your data meets assumptions #1, #2 and #3, although you don't need SPSS
Statistics to do this. When moving on to assumptions #4, #5 and #6, we suggest testing them in
this order because it represents an order where, if a violation to the assumption is not correctable,
you will no longer be able to use an independent t-test (although you may be able to run another
statistical test on your data instead). Just remember that if you do not run the statistical tests on
these assumptions correctly, the results you get when running an independent t-test might not be
valid. This is why we dedicate a number of sections of our enhanced independent t-test guide to
help you get this right. You can find out about our enhanced independent t-test guide here, or
more generally, our enhanced content as a whole here.
In the section, Test Procedure in SPSS Statistics, we illustrate the SPSS Statistics procedure
required to perform an independent t-test assuming that no assumptions have been violated. First,
we set out the example we use to explain the independent t-test procedure in SPSS Statistics.
Introduction
For example, you could use multiple regression to understand whether exam performance can be
predicted based on revision time, test anxiety, lecture attendance and gender. Alternately, you
could use multiple regression to understand whether daily cigarette consumption can be
predicted based on smoking duration, age when started smoking, smoker type, income and
gender.
Multiple regression also allows you to determine the overall fit (variance explained) of the model
and the relative contribution of each of the predictors to the total variance explained. For
example, you might want to know how much of the variation in exam performance can be
explained by revision time, test anxiety, lecture attendance and gender "as a whole", but also the
"relative contribution" of each independent variable in explaining the variance.
9
This "quick start" guide shows you how to carry out multiple regression using SPSS Statistics, as
well as interpret and report the results from this test. However, before we introduce you to this
procedure, you need to understand the different assumptions that your data must meet in order
for multiple regression to give you a valid result. We discuss these assumptions next.
Assumptions
When you choose to analyse your data using multiple regression, part of the process involves
checking to make sure that the data you want to analyse can actually be analysed using multiple
regression. You need to do this because it is only appropriate to use multiple regression if your
data "passes" eight assumptions that are required for multiple regression to give you a valid
result. In practice, checking for these eight assumptions just adds a little bit more time to your
analysis, requiring you to click a few more buttons in SPSS Statistics when performing your
analysis, as well as think a little bit more about your data, but it is not a difficult task.
Before we introduce you to these eight assumptions, do not be surprised if, when analysing your
own data using SPSS Statistics, one or more of these assumptions is violated (i.e., not met). This
is not uncommon when working with real-world data rather than textbook examples, which often
only show you how to carry out multiple regression when everything goes well! However, don’t
worry. Even when your data fails certain assumptions, there is often a solution to overcome this.
First, let's take a look at these eight assumptions:
10
Durbin-Watson statistic, as well as showing you the SPSS Statistics procedure required,
in our enhanced multiple regression guide.
Assumption #4: There needs to be a linear relationship between (a) the dependent
variable and each of your independent variables, and (b) the dependent variable and the
independent variables collectively. Whilst there are a number of ways to check for these
linear relationships, we suggest creating scatterplots and partial regression plots using
SPSS Statistics, and then visually inspecting these scatterplots and partial regression plots
to check for linearity. If the relationship displayed in your scatterplots and partial
regression plots are not linear, you will have to either run a non-linear regression analysis
or "transform" your data, which you can do using SPSS Statistics. In our enhanced
multiple regression guide, we show you how to: (a) create scatterplots and partial
regression plots to check for linearity when carrying out multiple regression using SPSS
Statistics; (b) interpret different scatterplot and partial regression plot results; and (c)
transform your data using SPSS Statistics if you do not have linear relationships between
your variables.
Assumption #5: Your data needs to show homoscedasticity, which is where the
variances along the line of best fit remain similar as you move along the line. We explain
more about what this means and how to assess the homoscedasticity of your data in our
enhanced multiple regression guide. When you analyse your own data, you will need to
plot the studentized residuals against the unstandardized predicted values. In our
enhanced multiple regression guide, we explain: (a) how to test for homoscedasticity
using SPSS Statistics; (b) some of the things you will need to consider when interpreting
your data; and (c) possible ways to continue with your analysis if your data fails to meet
this assumption.
Assumption #6: Your data must not show multicollinearity, which occurs when you
have two or more independent variables that are highly correlated with each other. This
leads to problems with understanding which independent variable contributes to the
variance explained in the dependent variable, as well as technical issues in calculating a
multiple regression model. Therefore, in our enhanced multiple regression guide, we
show you: (a) how to use SPSS Statistics to detect for multicollinearity through an
inspection of correlation coefficients and Tolerance/VIF values; and (b) how to interpret
these correlation coefficients and Tolerance/VIF values so that you can determine
whether your data meets or violates this assumption.
Assumption #7: There should be no significant outliers, high leverage points or highly
influential points. Outliers, leverage and influential points are different terms used to
represent observations in your data set that are in some way unusual when you wish to
perform a multiple regression analysis. These different classifications of unusual points
reflect the different impact they have on the regression line. An observation can be
classified as more than one type of unusual point. However, all these points can have a
very negative effect on the regression equation that is used to predict the value of the
dependent variable based on the independent variables. This can change the output that
SPSS Statistics produces and reduce the predictive accuracy of your results as well as the
statistical significance. Fortunately, when using SPSS Statistics to run multiple regression
on your data, you can detect possible outliers, high leverage points and highly influential
points. In our enhanced multiple regression guide, we: (a) show you how to detect
outliers using "casewise diagnostics" and "studentized deleted residuals", which you can
11
do using SPSS Statistics, and discuss some of the options you have in order to deal with
outliers; (b) check for leverage points using SPSS Statistics and discuss what you should
do if you have any; and (c) check for influential points in SPSS Statistics using a measure
of influence known as Cook's Distance, before presenting some practical approaches in
SPSS Statistics to deal with any influential points you might have.
Assumption #8: Finally, you need to check that the residuals (errors) are
approximately normally distributed (we explain these terms in our enhanced multiple
regression guide). Two common methods to check this assumption include using: (a) a
histogram (with a superimposed normal curve) and a Normal P-P Plot; or (b) a Normal
Q-Q Plot of the studentized residuals. Again, in our enhanced multiple regression guide,
we: (a) show you how to check this assumption using SPSS Statistics, whether you use a
histogram (with superimposed normal curve) and Normal P-P Plot, or Normal Q-Q Plot;
(b) explain how to interpret these diagrams; and (c) provide a possible solution if your
data fails to meet this assumption.
You can check assumptions #3, #4, #5, #6, #7 and #8 using SPSS Statistics. Assumptions #1 and
#2 should be checked first, before moving onto assumptions #3, #4, #5, #6, #7 and #8. Just
remember that if you do not run the statistical tests on these assumptions correctly, the results
you get when running multiple regression might not be valid. This is why we dedicate a number
of sections of our enhanced multiple regression guide to help you get this right.
In the section, Procedure, we illustrate the SPSS Statistics procedure to perform a multiple
regression assuming that no assumptions have been violated. First, we introduce the example that
is used in this guide.
Introduction
Linear regression is the next step up after correlation. It is used when we want to predict the
value of a variable based on the value of another variable. The variable we want to predict is
called the dependent variable (or sometimes, the outcome variable). The variable we are using to
predict the other variable's value is called the independent variable (or sometimes, the predictor
variable). For example, you could use linear regression to understand whether exam performance
can be predicted based on revision time; whether cigarette consumption can be predicted based
on smoking duration; and so forth. If you have two or more independent variables, rather than
just one, you need to use multiple regression.
This "quick start" guide shows you how to carry out linear regression using SPSS Statistics, as
well as interpret and report the results from this test. However, before we introduce you to this
procedure, you need to understand the different assumptions that your data must meet in order
for linear regression to give you a valid result. We discuss these assumptions next.
Assumptions
12
When you choose to analyse your data using linear regression, part of the process involves
checking to make sure that the data you want to analyse can actually be analysed using linear
regression. You need to do this because it is only appropriate to use linear regression if your data
"passes" six assumptions that are required for linear regression to give you a valid result. In
practice, checking for these six assumptions just adds a little bit more time to your analysis,
requiring you to click a few more buttons in SPSS Statistics when performing your analysis, as
well as think a little bit more about your data, but it is not a difficult task.
Before we introduce you to these six assumptions, do not be surprised if, when analysing your
own data using SPSS Statistics, one or more of these assumptions is violated (i.e., not met). This
is not uncommon when working with real-world data rather than textbook examples, which often
only show you how to carry out linear regression when everything goes well! However, don’t
worry. Even when your data fails certain assumptions, there is often a solution to overcome this.
First, let’s take a look at these six assumptions:
Assumption #1: Your two variables should be measured at the continuous level (i.e.,
they are either interval or ratio variables). Examples of continuous variables include
revision time (measured in hours), intelligence (measured using IQ score), exam
performance (measured from 0 to 100), weight (measured in kg), and so forth. You can
learn more about interval and ratio variables in our article: Types of Variable.
Assumption #2: There needs to be a linear relationship between the two variables.
Whilst there are a number of ways to check whether a linear relationship exists between
your two variables, we suggest creating a scatterplot using SPSS Statistics where you can
plot the dependent variable against your independent variable and then visually inspect
the scatterplot to check for linearity. Your scatterplot may look something like one of the
following:
If the relationship displayed in your scatterplot is not linear, you will have to either run a
non-linear regression analysis, perform a polynomial regression or "transform" your data,
which you can do using SPSS Statistics. In our enhanced guides, we show you how to:
13
(a) create a scatterplot to check for linearity when carrying out linear regression using
SPSS Statistics; (b) interpret different scatterplot results; and (c) transform your data
using SPSS Statistics if there is not a linear relationship between your two variables.
Assumption #3: There should be no significant outliers. An outlier is an observed data
point that has a dependent variable value that is very different to the value predicted by
the regression equation. As such, an outlier will be a point on a scatterplot that is
(vertically) far away from the regression line indicating that it has a large residual, as
highlighted below:
The problem with outliers is that they can have a negative effect on the regression
analysis (e.g., reduce the fit of the regression equation) that is used to predict the value of
the dependent (outcome) variable based on the independent (predictor) variable. This will
change the output that SPSS Statistics produces and reduce the predictive accuracy of
your results. Fortunately, when using SPSS Statistics to run a linear regression on your
data, you can easily include criteria to help you detect possible outliers. In our enhanced
linear regression guide, we: (a) show you how to detect outliers using "casewise
diagnostics", which is a simple process when using SPSS Statistics; and (b) discuss some
of the options you have in order to deal with outliers.
Assumption #4: You should have independence of observations, which you can easily
check using the Durbin-Watson statistic, which is a simple test to run using SPSS
Statistics. We explain how to interpret the result of the Durbin-Watson statistic in our
enhanced linear regression guide.
Assumption #5: Your data needs to show homoscedasticity, which is where the
variances along the line of best fit remain similar as you move along the line. Whilst we
explain more about what this means and how to assess the homoscedasticity of your data
in our enhanced linear regression guide, take a look at the three scatterplots below, which
provide three simple examples: two of data that fail the assumption (called
heteroscedasticity) and one of data that meets this assumption (called homoscedasticity):
14
Whilst these help to illustrate the differences in data that meets or violates the assumption
of homoscedasticity, real-world data can be a lot more messy and illustrate different
patterns of heteroscedasticity. Therefore, in our enhanced linear regression guide, we
explain: (a) some of the things you will need to consider when interpreting your data; and
(b) possible ways to continue with your analysis if your data fails to meet this
assumption.
Assumption #6: Finally, you need to check that the residuals (errors) of the regression
line are approximately normally distributed (we explain these terms in our enhanced
linear regression guide). Two common methods to check this assumption include using
either a histogram (with a superimposed normal curve) or a Normal P-P Plot. Again, in
our enhanced linear regression guide, we: (a) show you how to check this assumption
using SPSS Statistics, whether you use a histogram (with superimposed normal curve) or
Normal P-P Plot; (b) explain how to interpret these diagrams; and (c) provide a possible
solution if your data fails to meet this assumption.
You can check assumptions #2, #3, #4, #5 and #6 using SPSS Statistics. Assumptions #2 should
be checked first, before moving onto assumptions #3, #4, #5 and #6. We suggest testing the
assumptions in this order because assumptions #3, #4, #5 and #6 require you to run the linear
regression procedure in SPSS Statistics first, so it is easier to deal with these after checking
assumption #2. Just remember that if you do not run the statistical tests on these assumptions
correctly, the results you get when running a linear regression might not be valid. This is why we
dedicate a number of sections of our enhanced linear regression guide to help you get this right.
15
SPSS Web Books
Regression with SPSS
Chapter 2 - Regression Diagnostics
Chapter Outline
2.0 Regression Diagnostics
2.1 Unusual and Influential data
2.2 Tests on Normality of Residuals
2.3 Tests on Nonconstant Error of Variance
2.4 Tests on Multicollinearity
2.5 Tests on Nonlinearity
2.6 Model Specification
2.7 Issues of Independence
2.8 Summary
2.9 For more information
In our last chapter, we learned how to do ordinary linear regression with SPSS, concluding with
methods for examining the distribution of variables to check for non-normally distributed
variables as a first look at checking assumptions in regression. Without verifying that your data
have met the regression assumptions, your results may be misleading. This chapter will explore
how you can use SPSS to test whether your data meet the assumptions of linear regression. In
particular, we will consider the following assumptions.
Linearity - the relationships between the predictors and the outcome variable should be
linear
Normality - the errors should be normally distributed - technically normality is necessary
only for the t-tests to be valid, estimation of the coefficients only requires that the errors
be identically and independently distributed
Homogeneity of variance (homoscedasticity) - the error variance should be constant
Independence - the errors associated with one observation are not correlated with the
errors of any other observation
Model specification - the model should be properly specified (including all relevant
variables, and excluding irrelevant variables)
Additionally, there are issues that can arise during the analysis that, while strictly speaking are
not assumptions of regression, are none the less, of great concern to regression analysts.
Many graphical methods and numerical tests have been developed over the years for regression
diagnostics and SPSS makes many of these methods easy to access and use. In this chapter, we
16
will explore these methods and show how to verify regression assumptions and detect potential
problems using SPSS.
A single observation that is substantially different from all other observations can make a large
difference in the results of your regression analysis. If a single observation (or small group of
observations) substantially changes your results, you would want to know about this and
investigate further. There are three ways that an observation can be unusual.
Outliers: In linear regression, an outlier is an observation with large residual. In other words, it
is an observation whose dependent-variable value is unusual given its values on the predictor
variables. An outlier may indicate a sample peculiarity or may indicate a data entry error or other
problem.
Leverage: An observation with an extreme value on a predictor variable is called a point with
high leverage. Leverage is a measure of how far an observation deviates from the mean of that
variable. These leverage points can have an unusually large effect on the estimate of regression
coefficients.
How can we identify these three types of observations? Let's look at an example dataset called
crime. This dataset appears in Statistical Methods for Social Sciences, Third Edition by Alan
Agresti and Barbara Finlay (Prentice Hall, 1997). The variables are state id (sid), state name
(state), violent crimes per 100,000 people (crime), murders per 1,000,000 (murder), the
percent of the population living in metropolitan areas (pctmetro), the percent of the population
that is white (pctwhite), percent of population with a high school education or above (pcths),
percent of population living under poverty line (poverty), and percent of population that are
single parents (single). Below we read in the file and do some descriptive statistics on these
variables. You can click crime.sav to access this file, or see the Regression with SPSS page to
download all of the data files used in this book.
descriptives
/var=crime murder pctmetro pctwhite pcths poverty single.
Descriptive Statistics
17
CRIME 51 82 2922 612.84 441.100
Valid N (listwise) 51
Let's say that we want to predict crime by pctmetro, poverty, and single . That is to say, we
want to build a linear regression model between the response variable crime and the independent
variables pctmetro, poverty and single. We will first look at the scatter plots of crime against
each of the predictor variables before the regression analysis so we will have some ideas about
potential problems. We can create a scatterplot matrix of these variables as shown below.
graph
/scatterplot(matrix)=crime murder pctmetro pctwhite pcths poverty single .
18
The graphs of crime with other variables show some potential problems. In every plot, we see a
data point that is far away from the rest of the data points. Let's make individual graphs of crime
with pctmetro and poverty and single so we can get a better view of these scatterplots. We will
use BY state(name) to plot the state name instead of a point.
19
GRAPH /SCATTERPLOT(BIVAR)=poverty WITH crime BY state(name) .
20
GRAPH /SCATTERPLOT(BIVAR)=single WITH crime BY state(name) .
All the scatter plots suggest that the observation for state = "dc" is a point that requires extra
attention since it stands out away from all of the other points. We will keep it in mind when we
do our regression analysis.
Now let's try the regression command predicting crime from pctmetro poverty and single. We
will go step-by-step to identify all the potentially unusual or influential points afterwards.
regression
/dependent crime
/method=enter pctmetro poverty single.
Variables Entered/Removed(b)
21
Model Summary(b)
ANOVA(b)
Total 9728474.745 50
Coefficients(a)
Unstandardized Standardized
Coefficients Coefficients
t Sig.
-
(Constant) -1666.436 147.852 .000
11.271
1
PCTMETRO 7.829 1.255 .390 6.240 .000
22
POVERTY 17.680 6.941 .184 2.547 .014
Let's examine the standardized residuals as a first means for identifying outliers. Below we use
the /residuals=histogram subcommand to request a histogram for the standardized
residuals. As you see, we get the standard output that we got above, as well as a table with
information about the smallest and largest residuals, and a histogram of the standardized
residuals. The histogram indicates a couple of extreme residuals worthy of investigation.
regression
/dependent crime
/method=enter pctmetro poverty single
/residuals=histogram.
Variables Entered/Removed(b)
Model Summary(b)
ANOVA(b)
23
Model Sum of Squares df Mean Square F Sig.
Total 9728474.745 50
Coefficients(a)
Residuals Statistics(a)
24
Residual -523.01 426.11 .00 176.522 51
Let's now request the same kind of information, except for the studentized deleted residual. The
studentized deleted residual is the residual that would be obtained if the regression was re-run
omitting that observation from the analysis. This is useful because some points are so influential
that when they are included in the analysis they can pull the regression line close to that
observation making it appear as though it is not an outlier -- however when the observation is
deleted it then becomes more obvious how outlying it is. To save space, below we show just the
output related to the residual analysis.
regression
/dependent crime
/method=enter pctmetro poverty single
/residuals=histogram(sdresid).
Residuals Statistics(a)
25
Minimum Maximum Mean Std. Deviation N
26
The histogram shows some possible outliers. We can use the outliers(sdresid) and id(state)
options to request the 10 most extreme values for the studentized deleted residual to be displayed
labeled by the state from which the observation originated. Below we show the output generated
by this option, omitting all of the rest of the output to save space. You can see that "dc" has the
largest value (3.766) followed by "ms" (-3.571) and "fl" (2.620).
regression
/dependent crime
/method=enter pctmetro poverty single
/residuals=histogram(sdresid) id(state) outliers(sdresid).
Outlier Statistics(a)
1 51 dc 3.766
3 9 fl 2.620
27
4 18 la -1.839
5 39 ri -1.686
6 12 ia 1.590
7 47 wa -1.304
8 13 id 1.293
9 14 il 1.152
10 35 oh -1.148
We can use the /casewise subcommand below to request a display of all observations where the
sdresid exceeds 2. To save space, we show just the new output generated by the /casewise
subcommand. This shows us that Florida, Mississippi and Washington DC have sdresid values
exceeding 2.
regression
/dependent crime
/method=enter pctmetro poverty single
/residuals=histogram(sdresid) id(state) outliers(sdresid)
/casewise=plot(sdresid) outliers(2) .
Casewise Diagnostics(a)
Case Number STATE Stud. Deleted Residual CRIME Predicted Value Residual
Now let's look at the leverage values to identify observations that will have potential great
influence on regression coefficient estimates. We can include lever with the histogram( ) and
the outliers( ) options to get more information about observations with high leverage. We show
28
just the new output generated by these additional subcommands below. Generally, a point with
leverage greater than (2k+2)/n should be carefully examined. Here k is the number of predictors
and n is the number of observations, so a value exceeding (2*3+2)/51 = .1568 would be worthy
of further investigation. As you see, there are 4 observations that have leverage values higher
than .1568.
regression
/dependent crime
/method=enter pctmetro poverty single
/residuals=histogram(sdresid lever) id(state) outliers(sdresid lever)
/casewise=plot(sdresid) outliers(2).
Outlier Statistics(a)
1 51 dc 3.766
2 25 ms -3.571
3 9 fl 2.620
4 18 la -1.839
5 39 ri -1.686
Stud. Deleted Residual
6 12 ia 1.590
7 47 wa -1.304
8 13 id 1.293
9 14 il 1.152
10 35 oh -1.148
1 51 dc .517
3 25 ms .171
29
4 49 wv .161
5 18 la .146
6 46 vt .117
7 9 fl .083
8 26 mt .080
9 31 nj .075
10 17 ky .072
30
As we have seen, DC is an observation that both has a large residual and large leverage. Such
points are potentially the most influential. We can make a plot that shows the leverage by the
residual and look for observations that are high in leverage and have a high residual. We can do
this using the /scatterplot subcommand as shown below. This is a quick way of checking
potential influential observations and outliers at the same time. Both types of points are of great
concern for us. As we see, "dc" is both a high residual and high leverage point, and "ms" has an
extremely negative residual but does not have such a high leverage.
regression
/dependent crime
/method=enter pctmetro poverty single
/residuals=histogram(sdresid lever) id(state) outliers(sdresid, lever)
/casewise=plot(sdresid) outliers(2)
/scatterplot(*lever, *sdresid).
31
Now let's move on to overall measures of influence, specifically let's look at Cook's D, which
combines information on the residual and leverage. The lowest value that Cook's D can assume
is zero, and the higher the Cook's D is, the more influential the point is. The conventional cut-off
point is 4/n, or in this case 4/51 or .078. Below we add the cook keyword to the outliers( )
option and also on the /casewise subcommand and below we see that for the 3 outliers flagged in
the "Casewise Diagnostics" table, the value of Cook's D exceeds this cutoff. And, in the "Outlier
Statistics" table, we see that "dc", "ms", "fl" and "la" are the 4 states that exceed this cutoff, all
others falling below this threshold.
regression
/dependent crime
/method=enter pctmetro poverty single
/residuals=histogram(sdresid lever) id(state) outliers(sdresid, lever, cook)
/casewise=plot(sdresid) outliers(2) cook dffit
/scatterplot(*lever, *sdresid).
Casewise Diagnostics(a)
Case Number STATE Stud. Deleted Residual CRIME Cook's Distance DFFIT
32
51 dc 3.766 2922 3.203 477.319
Outlier Statistics(a)
1 51 dc 3.766
2 25 ms -3.571
3 9 fl 2.620
4 18 la -1.839
5 39 ri -1.686
Stud. Deleted Residual
6 12 ia 1.590
7 47 wa -1.304
8 13 id 1.293
9 14 il 1.152
10 35 oh -1.148
1 51 dc 3.203 .021
2 25 ms .602 .663
3 9 fl .174 .951
Cook's Distance
4 18 la .159 .958
5 39 ri .041 .997
6 12 ia .041 .997
33
7 13 id .037 .997
8 20 md .020 .999
9 6 co .018 .999
10 49 wv .016 .999
1 51 dc .517
2 1 ak .241
3 25 ms .171
4 49 wv .161
5 18 la .146
Centered Leverage Value
6 46 vt .117
7 9 fl .083
8 26 mt .080
9 31 nj .075
10 17 ky .072
Cook's D can be thought of as a general measure of influence. You can also consider more
specific measures of influence that assess how each coefficient is changed by including the
observation. Imagine that you compute the regression coefficients for the regression model with
a particular case excluded, then recompute the model with the case included, and you observe the
change in the regression coefficients due to including that case in the model. This measure is
called DFBETA and a DFBETA value can be computed for each observation for each predictor.
As shown below, we use the /save sdbeta(sdbf) subcommand to save the DFBETA values for
each of the predictors. This saves 4 variables into the current data file, sdfb1, sdfb2, sdfb3 and
sdfb4, corresponding to the DFBETA for the Intercept and for pctmetro, poverty and for
single, respectively. We could replace sdfb with anything we like, and the variables created
would start with the prefix that we provide.
34
regression
/dependent crime
/method=enter pctmetro poverty single
/residuals=histogram(sdresid lever) id(state) outliers(sdresid, lever, cook)
/casewise=plot(sdresid) outliers(2) cook dffit
/scatterplot(*lever, *sdresid)
/save sdbeta(sdfb).
The /save sdbeta(sdfb) subcommand does not produce any new output, but we can see the
variables it created for the first 10 cases using the list command below. For example, by
including the case for "ak" in the regression analysis (as compared to excluding this case), the
coefficient for pctmetro would decrease by -.106 standard errors. Likewise, by including the
case for "ak" the coefficient for poverty decreases by -.131 standard errors, and the coefficient
for single increases by .145 standard errors (as compared to a model excluding "ak"). Since the
inclusion of an observation could either contribute to an increase or decrease in a regression
coefficient, DFBETAs can be either positive or negative. A DFBETA value in excess
of 2/sqrt(n) merits further investigation. In this example, we would be concerned about absolute
values in excess of 2/sqrt(51) or .28.
list
/variables state sdfb1 sdfb2 sdfb3
/cases from 1 to 10.
STATE SDFB1 SDFB2 SDFB3
We can plot all three DFBETA values for the 3 coefficients against the state id in one graph
shown below to help us see potentially troublesome observations. We see changed the value
labels for sdfb1 sdfb2 and sdfb3 so they would be shorter and more clearly labeled in the
graph. We can see that the DFBETA for single for "dc" is about 3, indicating that by including
"dc" in the regression model, the coefficient for single is 3 standard errors larger than it would
have been if "dc" had been omitted. This is yet another bit of evidence that the observation for
"dc" is very problematic.
35
/sdfb2 "Sdfbeta poverty"
/sdfb3 "Sdfbeta single" .
GRAPH
/SCATTERPLOT(OVERLAY)=sid sid sid WITH sdfb1 sdfb2 sdfb3 (PAIR) BY
state(name)
/MISSING=LISTWISE .
The following table summarizes the general rules of thumb we use for the measures we have
discussed for identifying observations worthy of further investigation (where k is the number of
predictors and n is the number of observations).
Value
Measure
leverage >(2k+2)/n
abs(rstu) >2
36
We have shown a few examples of the variables that you can refer to in the /residuals ,
/casewise, /scatterplot and /save sdbeta( ) subcommands. Here is a list of all of the variables
that can be used on these subcommands; however, not all variables can be used on each
subcommand.
In addition to the numerical measures we have shown above, there are also several graphs that
can be used to search for unusual and influential observations. The partial-regression plot is
very useful in identifying influential points. For example below we add the /partialplot
subcommand to produce partial-regression plots for all of the predictors. For example, in the 3rd
plot below you can see the partial-regression plot showing crime by single after both crime and
37
single have been adjusted for all other predictors in the model. The line plotted has the same
slope as the coefficient for single. This plot shows how the observation for DC influences the
coefficient. You can see how the regression line is tugged upwards trying to fit through the
extreme value of DC. Alaska and West Virginia may also exert substantial leverage on the
coefficient of single as well. These plots are useful for seeing how a single point may be
influencing the regression line, while taking other variables in the model into account.
Note that the regression line is not automatically produced in the graph. We double clicked on
the graph, and then chose "Chart" and the "Options" and then chose "Fit Line Total" to add a
regression line to each of the graphs below.
regression
/dependent crime
/method=enter pctmetro poverty single
/residuals=histogram(sdresid lever) id(state) outliers(sdresid, lever, cook)
/casewise=plot(sdresid) outliers(2) cook dffit
/scatterplot(*lever, *sdresid)
/partialplot.
38
39
DC has appeared as an outlier as well as an influential point in every analysis. Since DC is
really not a state, we can use this to justify omitting it from the analysis saying that we really
wish to just analyze states. First, let's repeat our analysis including DC below.
regression
/dependent crime
/method=enter pctmetro poverty single.
Coefficients(a)
Unstandardized Standardized
Coefficients Coefficients
t Sig.
-
(Constant) -1666.436 147.852 .000
11.271
Now, let's run the analysis omitting DC by using the filter command to omit "dc" from the
analysis. As we expect, deleting DC made a large change in the coefficient for single .The
coefficient for single dropped from 132.4 to 89.4. After having deleted DC, we would repeat the
process we have illustrated in this section to search for any other outlying and influential
observations.
40
Coefficients(a)
Summary
In this section, we explored a number of methods of identifying outliers and influential points. In
a typical analysis, you would probably use only some of these methods. Generally speaking,
there are two types of methods for assessing outliers: statistics such as residuals, leverage, and
Cook's D, that assess the overall impact of an observation on the regression results, and statistics
such as DFBETA that assess the specific impact of an observation on the regression
coefficients. In our example, we found out that DC was a point of major concern. We performed
a regression with it and without it and the regression equations were very different. We can
justify removing it from our analysis by reasoning that our model is to predict crime rate for
states not for metropolitan areas.
One of the assumptions of linear regression analysis is that the residuals are normally distributed.
It is important to meet this assumption for the p-values for the t-tests to be valid. Let's use the
elemapi2 data file we saw in Chapter 1 for these analyses. Let's predict academic performance
(api00) from percent receiving free meals (meals), percent of English language learners (ell),
and percent of teachers with emergency credentials (emer). We then use the /save command to
generate residuals.
get file="c:\spssreg\elemapi2.sav".
regression
/dependent api00
41
/method=enter meals ell emer
/save resid(apires).
Variables Entered/Removed(b)
Model Summary(b)
ANOVA(b)
42
Coefficients(a)
Casewise Diagnostics(a)
93 3.087 604
Residuals Statistics(a)
43
Std. Residual -3.208 3.087 .000 .996 400
We now use the examine command to look at the normality of these residuals. All of the results
from the examine command suggest that the residuals are normally distributed -- the skewness
and kurtosis are near 0, the "tests of normality" are not significant, the histogram looks normal,
and the Q-Q plot looks normal. Based on these results, the residuals from this regression appear
to conform to the assumption of being normally distributed.
examine
variables=apires
/plot boxplot stemleaf histogram npplot.
Case Processing Summary
Cases
Descriptives
Lower
-5.6620909
95% Confidence Interval for Bound
Mean
APIRES
Upper Bound 5.6620909
Median -3.6572906
44
Variance 3318.018
Minimum -185.47331
Maximum 178.48224
Range 363.95555
Tests of Normality
Kolmogorov-Smirnov(a) Shapiro-Wilk
45
Unstandardized Residual Stem-and-Leaf Plot
46
Stem width: 100.0000
Each leaf: 2 case(s)
47
2.3 Heteroscedasticity
48
Another assumption of ordinary least squares regression is that the variance of the residuals is
homogeneous across levels of the predicted values, also known as homoscedasticity. If the model
is well-fitted, there should be no pattern to the residuals plotted against the fitted values. If the
variance of the residuals is non-constant then the residual variance is said to be "heteroscedastic."
Below we illustrate graphical methods for detecting heteroscedasticity. A commonly used
graphical method is to use the residual versus fitted plot to show the residuals versus fitted
(predicted) values. Below we use the /scatterplot subcommand to plot *zresid (standardized
residuals) by *pred (the predicted values). We see that the pattern of the data points is getting a
little narrower towards the right end, an indication of mild heteroscedasticity.
regression
/dependent api00
/method=enter meals ell emer
/scatterplot(*zresid *pred).
Let's run a model where we include just enroll as a predictor and show the residual vs. predicted
plot. As you can see, this plot shows serious heteroscedasticity. The variability of the residuals
when the predicted value is around 700 is much larger than when the predicted value is 600 or
when the predicted value is 500.
regression
/dependent api00
49
/method=enter enroll
/scatterplot(*zresid *pred).
As we saw in Chapter 1, the variable enroll was skewed considerably to the right, and we found
that by taking a log transformation, the transformed variable was more normally distributed.
Below we transform enroll, run the regression and show the residual versus fitted plot. The
distribution of the residuals is much improved. Certainly, this is not a perfect distribution of
residuals, but it is much better than the distribution with the untransformed variable.
1 LENROLL(a) . Enter
50
Model Summary(b)
ANOVA(b)
Coefficients(a)
51
a Dependent Variable: API00
Residuals Statistics(a)
Finally, let's revisit the model we used at the start of this section, predicting api00 from meals,
ell and emer. Using this model, the distribution of the residuals looked very nice and even
52
across the fitted values. What if we add enroll to this model. Will this automatically ruin the
distribution of the residuals? Let's add it and see.
regression
/dependent api00
/method=enter meals ell emer enroll
/scatterplot(*zresid *pred).
Variables Entered/Removed(b)
Model Summary(b)
ANOVA(b)
53
b Dependent Variable: API00
Coefficients(a)
Casewise Diagnostics(a)
93 3.004 604
Residuals Statistics(a)
54
Predicted Value 430.82 888.08 647.62 130.214 400
As you can see, the distribution of the residuals looks fine, even after we added the variable
enroll. When we had just the variable enroll in the model, we did a log transformation to
improve the distribution of the residuals, but when enroll was part of a model with other
variables, the residuals looked good so no transformation was needed. This illustrates how the
distribution of the residuals, not the distribution of the predictor, was the guiding factor in
determining whether a transformation was needed.
2.4 Collinearity
55
When there is a perfect linear relationship among the predictors, the estimates for a regression
model cannot be uniquely computed. The term collinearity implies that two variables are near
perfect linear combinations of one another. When more than two variables are involved it is often
called multicollinearity, although the two terms are often used interchangeably.
The primary concern is that as the degree of multicollinearity increases, the regression model
estimates of the coefficients become unstable and the standard errors for the coefficients can get
wildly inflated. In this section, we will explore some SPSS commands that help to detect
multicollinearity.
We can use the /statistics=defaults tol to request the display of "tolerance" and "VIF" values for
each predictor as a check for multicollinearity. The "tolerance" is an indication of the percent of
variance in the predictor that cannot be accounted for by the other predictors, hence very small
values indicate that a predictor is redundant, and values that are less than .10 may merit further
investigation. The VIF, which stands for variance inflation factor, is (1 / tolerance) and as a rule
of thumb, a variable whose VIF values is greater than 10 may merit further investigation. Let's
first look at the regression we did from the last section, the regression model predicting api00
from meals, ell and emer using the /statistics=defaults tol subcommand. As you can see, the
"tolerance" and "VIF" values are all quite acceptable.
regression
/statistics=defaults tol
/dependent api00
/method=enter meals ell emer .
<some output deleted to save space>
Coefficients(a)
56
Now let's consider another example where the "tolerance" and "VIF" values are more worrisome.
In the regression analysis below, we use acs_k3 avg_ed grad_sch col_grad and some_col as
predictors of api00. As you see, the "tolerance" values for avg_ed grad_sch and col_grad are
below .10, and avg_ed is about 0.02, indicating that only about 2% of the variance in avg_ed is
not predictable given the other predictors in the model. All of these variables measure education
of the parents and the very low "tolerance" values indicate that these variables contain redundant
information. For example, after you know grad_sch and col_grad, you probably can predict
avg_ed very well. In this example, multicollinearity arises because we have put in too many
variables that measure the same thing, parent education.
We also include the collin option which produces the "Collinearity Diagnostics" table
below. The very low eigenvalue for the 5th dimension (since there are 5 predictors) is another
indication of problems with multicollinearity. Likewise, the very high "Condition Index" for
dimension 5 similarly indicates problems with multicollinearity with these predictors.
regression
/statistics=defaults tol collin
/dependent api00
/method=enter acs_k3 avg_ed grad_sch col_grad some_col.
<some output deleted to save space>
Coefficients(a)
57
Collinearity Diagnostics(a)
Variance Proportions
Eige Cond
n ition
Dim valu Inde
Mode e e x ACS_K AVG_E GRAD_SC COL_GRA SOME_CO
l nsio (Constan 3 D H D L
n t)
5.01
1 1.000 .00 .00 .00 .00 .00 .00
3
.002 42.03
5 .22 .86 .14 .10 .15 .09
8 6
.011 65.88
6 .77 .13 .86 .81 .77 .66
5 7
Let's omit one of the parent education variables, avg_ed. Note that the VIF values in the
analysis below appear much better. Also, note how the standard errors are reduced for the parent
education variables, grad_sch and col_grad. This is because the high degree of collinearity
caused the standard errors to be inflated. With the multicollinearity eliminated, the coefficient
for grad_sch, which had been non-significant, is now significant.
regression
/statistics=defaults tol collin
/dependent api00
/method=enter acs_k3 grad_sch col_grad some_col.
<some output omitted to save space>
Coefficients(a)
58
Unstandardized Standardized Collinearity
Coefficients Coefficients Statistics
t Sig.
Collinearity Diagnostics(a)
Variance Proportions
Eige Cond
n ition
Dim valu Inde
Mode e x (Constan ACS_K GRAD_SC COL_GRA SOME_CO
ensio
l t) 3 H D L
n
39.92
5 .0249 .99 .99 .01 .01 .00
5
59
2.5 Tests on Nonlinearity
When we do linear regression, we assume that the relationship between the response variable and
the predictors is linear. If this assumption is violated, the linear regression will try to fit a straight
line to data that do not follow a straight line. Checking the linearity assumption in the case of
simple regression is straightforward, since we only have one predictor. All we have to do is a
scatter plot between the response variable and the predictor to see if nonlinearity is present, such
as a curved band or a big wave-shaped curve. For example, let us use a data file called
nations.sav that has data about a number of nations around the world. Let's look at the
relationship between GNP per capita (gnpcap) and births (birth). Below if we look at the
scatterplot between gnpcap and birth, we can see that the relationship between these two
variables is quite non-linear. We added a regression line to the chart by double clicking on it and
choosing "Chart" then "Options" and then "Fit Line Total" and you can see how poorly the line
fits this data. Also, if we look at the residuals by predicted, we see that the residuals are not
homoscedastic, due to the non-linearity in the relationship between gnpcap and birth.
regression
/dependent birth
/method=enter gnpcap
/scatterplot(*zresid *pred)
/scat(birth gnpcap) .
Variables Entered/Removed(b)
1 GNPCAP(a) . Enter
Model Summary(b)
60
ANOVA(b)
Coefficients(a)
Residuals Statistics(a)
61
Residual -23.18 28.10 .00 10.629 109
62
63
We modified the above scatterplot changing the fit line from using linear regression to using
"lowess" by choosing "Chart" then "Options" then choosing "Fit Options" and choosing
"Lowess" with the default smoothing parameters. As you can see, the "lowess" smoothed curve
fits substantially better than the linear regression, further suggesting that the relationship between
gnpcap and birth is not linear.
We can see that the capgnp scores are quite skewed with most values being near 0, and a
handful of values of 10,000 and higher. This suggests to us that some transformation of the
variable may be necessary. One commonly used transformation is a log transformation, so let's
try that. As you see, the scatterplot between capgnp and birth looks much better with the
regression line going through the heart of the data. Also, the plot of the residuals by predicted
values look much more reasonable.
Variables Entered/Removed(b)
64
Model Variables Entered Variables Removed Method
1 LGNPCAP(a) . Enter
Model Summary(b)
ANOVA(b)
Coefficients(a)
65
Model B Std. Error Beta
Residuals Statistics(a)
66
67
This section has shown how you can use scatterplots to diagnose problems of non-linearity, both
by looking at the scatterplots of the predictor and outcome variable, as well as by examining the
residuals by predicted values. These examples have focused on simple regression, however
similar techniques would be useful in multiple regression. However, when using multiple
regression, it would be more useful to examine partial regression plots instead of the simple
scatterplots between the predictor variables and the outcome variable.
A model specification error can occur when one or more relevant variables are omitted from the
model or one or more irrelevant variables are included in the model. If relevant variables are
omitted from the model, the common variance they share with included variables may be
wrongly attributed to those variables, and the error term can be inflated. On the other hand, if
irrelevant variables are included in the model, the common variance they share with included
variables may be wrongly attributed to them. Model specification errors can substantially affect
the estimate of regression coefficients.
Consider the model below. This regression suggests that as class size increases the academic
performance increases, with p=0.053. Before we publish results saying that increased class size
is associated with higher academic performance, let's check the model specification.
/dependent api00
/method=enter acs_k3 full
/save pred(apipred).
Coefficients(a)
68
SPSS does not have any tools that directly support the finding of specification errors, however
you can check for omitted variables by using the procedure below. As you notice above, when
we ran the regression we saved the predicted value calling it apipred. If we use the predicted
value and the predicted value squared as predictors of the dependent variable, apipred should be
significant since it is the predicted value, but apipred squared shouldn't be a significant predictor
because, if our model is specified correctly, the squared predictions should not have much of
explanatory power above and beyond the predicted value. That is we wouldn't expect apipred
squared to be a significant predictor if our model is specified correctly. Below we compute
apipred2 as the squared value of apipred and then include apipred and apipred2 as predictors
in our regression model, and we hope to find that apipred2 is not significant.
Coefficients(a)
The above results show that apipred2 is significant, suggesting that we may have omitted
important variables in our regression. We therefore should consider whether we should add any
other variables to our model. Let's try adding the variable meals to the above model. We see that
meals is a significant predictor, and we save the predicted value calling it preda for inclusion in
the next analysis for testing to see whether we have any additional important omitted variables.
regression
/dependent api00
69
/method=enter acs_k3 full meals
/save pred(preda).
<some output omitted to save space>
Coefficients(a)
We now create preda2 which is the square of preda, and include both of these as predictors in
our model.
compute preda2 = preda**2.
regression
/dependent api00
/method=enter preda preda2.
Coefficients(a)
70
(Constant) -136.510 95.059 -1.436 .152
We now see that preda2 is not significant, so this test does not suggest there are any other
important omitted variables. Note that after including meals and full, the coefficient for class
size is no longer significant. While acs_k3 does have a positive relationship with api00 when
only full is included in the model, but when we also include (and hence control for) meals,
acs_k3 is no longer significantly related to api00 and its relationship with api00 is no longer
positive.
The statement of this assumption is that the errors associated with one observation are not
correlated with the errors of any other observation. Violation of this assumption can occur in a
variety of situations. Consider the case of collecting data from students in eight different
elementary schools. It is likely that the students within each school will tend to be more like one
another that students from different schools, that is, their errors are not independent.
Another way in which the assumption of independence can be broken is when data are collected
on the same variables over time. Let's say that we collect truancy data every semester for 12
years. In this situation it is likely that the errors for observations between adjacent semesters will
be more highly correlated than for observations more separated in time -- this is known as
autocorrelation. When you have data that can be considered to be time-series you can use the
Durbin-Watson statistic to test for correlated residuals.
We don't have any time-series data, so we will use the elemapi2 dataset and pretend that snum
indicates the time at which the data were collected. We will sort the data on snum to order the
data according to our fake time variable and then we can run the regression analysis with the
durbin option to request the Durbin-Watson test. The Durbin-Watson statistic has a range from
0 to 4 with a midpoint of 2. The observed value in our example is less than 2, which is not
surprising since our data are not truly time-series.
Model Summary
71
Std. Error
Adjusted R Durbin-
Model R R Square of the
Square Watson
Estimate
1 .318 .101 .099 135.026 1.351
2.8 Summary
This chapter has covered a variety of topics in assessing the assumptions of regression using
SPSS, and the consequences of violating these assumptions. As we have seen, it is not sufficient
to simply run a regression analysis, but it is important to verify that the assumptions have been
met. If this verification stage is omitted and your data does not meet the assumptions of linear
regression, your results could be misleading and your interpretation of your results could be in
doubt. Without thoroughly checking your data for problems, it is possible that another
researcher could analyze your data and uncover such problems and question your results
showing an improved analysis that may contradict your results and undermine your conclusions.
72