Lectures On Spss 2010
Lectures On Spss 2010
Lectures On Spss 2010
SPSS stands for the "Statistical Package for the Social Sciences." It is composed of two
inter-related facets, the statistical package itself and SPSS language, a system of syntax used
to execute commands and procedures. When you use SPSS, you work in one of several
windows: the Data View, the Variable View, the Output View, the draft output view, and the
script view.
A survey was conducted among 100 singers who have sold CDs during the year 2007. Data
was collected for the following 4 variables:
1. Advertising Budget (thousands of rupees) for the CD
2. No of CDs Sold (thousands)
3. No. of times Songs are played on Radio 1 during the week prior to its release
4. Attractiveness of the Singer on a scale 1 to 10
When you first open SPSS for Windows, the first thing you will see is the Data Editor.
The Data Editor consists of two windows. By default the Data View (The data view has a
spreadsheet-like interface, much like Excel which allows the data to be entered and viewed,
is shown in Figure 1). Data values can be entered in the Data View spreadsheet (This is
where you can start! Start inputting your data).
The other window is the Variable View, which allows the types of variables to be specified
and viewed (Figure 2). (This is where you actually type the questions from your
questionnaire to SPSS, the codes or labels (e.g. 0, 1, 2..) for categorical variables (e.g. males,
females): Male = 1, Female = 0)
The user can toggle between the windows by clicking on the appropriate tabs on the bottom
left of the screen (Figure 3).
By default SPSS aligns numerical data entries to the right-hand side of the cells and text
(string) entries to the left-hand side. By default SPSS uses a period/full stop to indicate
missing numerical values. You may also use numbers to represent missing data. For example
1
data was missing for two students for Numeracy. The missing data have been represented by
98 (98 represent data that was missing because student was absent)and 99 (99 represent data
that was missing because student had an exemption).
When labels have been assigned to the category codes of a categorical variable, these can be
displayed by checking Value Labels (or by selecting on the toolbar). For example you can
either display the codes 0 and 1 in the column uni or you can display University of
Mauritius and University of Technology.
The appearance of the Data View spreadsheet is controlled by the View drop-down menu.
This can be used to change the font in the cells, remove lines, and make value labels visible.
2
3. Variable View
In Variable View each variable definition occupies a row of this spreadsheet. (The first
column in Data View must be completely defined in the first row of Variable View. The
second column in Data View must be completely defined in the second row of Variable
View.) As soon as data is entered under a column in the Data View, the default name of the
column occupies a row in the Variable View.
That is why you should NOT type the name of columns in Data View shown in Figure 4.
These names will appear automatically once you define the variables in Variable View as
shown in Figure 3.
Figure 4
There are 10 characteristics to be specified under the columns of the Variable View:
Name — the chosen variable name. This can be up to eight alphanumeric characters but
must begin with a letter. While the underscore (_) is allowed, hyphens (-), ampersands (&),
and spaces cannot be used. Variable names are not case sensitive. Name is used for internal
processing by SPSS. Name does not appear on the output generated by SPSS.
Label — a label attached to the variable name. In contrast to the name, this is not confined
to eight characters and spaces can be used. It is generally a good idea to assign variable
labels. (Here you can paste the sentences from your questionnaire which you have typed in
MS Word). They are helpful for reminding users of the meaning of variables (placing the
cursor over the variable name in the Data View will make the variable label appear) and are
displayed in the output from statistical analyses.
Type — the type of data. SPSS provides a default variable type once variable values have
been entered in a column of the Data View. The type can be changed by highlighting the
respective entry in the second column of the Variable View and clicking the three-period
symbol (…) ( ) appearing on the right-hand side of the cell. This results in the
Variable Type box being opened, which offers a number of types of data including various
formats for numerical data, dates, or currencies. (Note that a common mistake made by first-
time users is to enter categorical variables as type “string” by typing text into the Data View.
To enable later analyses, categories should be given artificial number codes and defined to be
of type “numeric.”)
Decimals — the number of digits to the right of the decimal place to be displayed for data
entries. This is not relevant for string data and for such variables the entry under the fourth
column is given as a greyed-out zero. The value can be altered in the same way
as the value of Width. For example, the value 879.45 has a decimal of 2. Note decimal
must be adjusted before adjusting the width.
Width — the width of the actual data entries. The default width of numerical variable
entries is eight. The width can be increased or decreased by highlighting the respective cell in
the third column and employing the upward or downward arrows appearing on the right-
hand side of the cell or by simply typing a new number in the cell. For example, the value
879.45 has a width of 6.
3
Values — labels attached to category codes. For categorical variables, an integer code
should be assigned to each category and the variable defined to be of type “numeric.” When
this has been done, clicking on the respective cell under the sixth column of the Variable
View makes the three-period symbol appear, and clicking this opens the Value Labels
dialogue box, which in turn allows assignment of labels to category codes. For example, our
data set included a categorical variable sex indicating the gender of the subject. Clicking the
three-period symbol ( ) opens the dialogue box shown in Figure 5 where numerical
code “0” was declared to represent university of Mauritius and code “1” University of
Technology.
Figure 5 Value Labels.
Missing — missing value codes. SPSS recognizes the period symbol as indicating a missing
value. If other codes have been used (e.g. 98, 99) these have to be declared to represent
missing values by highlighting the respective cell in the seventh column, clicking the three-
periods symbol and filling in the resulting Missing Values dialogue box accordingly.
Figure 6: Defining Missing Values
Columns — width of the variable column in the Data View. The default cell width for
numerical variables is eight. Note that when the Width value is larger than the Columns
value, only part of the data entry might be seen in the Data View. The cell width can be
changed in the same way as the width of the data entries or simply by dragging the relevant
column boundary. (Place cursor on right-hand boundary of the title of the column to be
resized. When the cursor changes into a vertical line with a right and left arrow, drag the
cursor to the right or left to increase or decrease the column width.)
Align — alignment of variable entries. The SPSS default is to align numerical variables to
the right-hand side of a cell and string variables to the left. It is generally helpful to adhere to
this default; but if necessary, alignment can be changed by highlighting the
relevant cell in the ninth column and choosing an option from the drop-down list.
Measure — measurement scale of the variable. The default chosen by SPSS depends on the
data type. For example, for variables of type “numeric,” the default measurement scale is a
continuous or interval scale (referred to by SPSS as “scale”). For variables of type “string,”
the default is a nominal scale. The third option, “ordinal,” is for categorical variables with
4
ordered categories but is not used by default. It is good practice to assign each variable the
highest appropriate measurement scale (“scale” > “ordinal” > “nominal”) since this has
implications for the statistical methods that are applicable. The default setting can be
changed by highlighting the respective cell in the tenth column and choosing an appropriate
option from the drop-down list.
(n = 106)
5
You have to input the data in Table 2.1 as follows.
6
STEP 1
Figure 7: Entering Data in Data View
STEP 2
Then go to Variable View to define the Variables and define the Name and Label as shown
in Figure 8.
Figure 8. Defining Variables:
STEP 3
Click on the three-period symbol ( ) in the cell where the row diet and Values
meet as shown in figure 9. This opens a dialogue box shown in Figure 10.
Figure 9.
7
Step 4
Type 1 in Value and type Restricted diet in Value Label.
Click on
Click on
Click on Ok.
8
Note: The Statistics Menus
The drop-down menus available after selecting Data, Transform, Analyze, or Graphs from
the menu bar provide procedures concerned with different aspects of a statistical analysis.
They allow manipulation of the format of the data spreadsheet to be used for analysis (Data),
generation of new variables (Transform), running of statistical procedures (Analyze), and
construction of graphical displays (Graphs).
Most statistics menu selections open dialogue boxes. The dialogue boxes are used to select
variables and options for analysis. A main dialogue for a statistical procedure has several
components: A source variables list is a list of variables from the Data View spreadsheet
that can be used in the requested analysis. Only variable types that are allowed by the
procedure are displayed in the source list. Variables of type “string” are often not allowed.
9
Transfer Life Spans of Rats to Dependent List. Dependent List declares the continuous
variables. We would like to have output generated for the diets separately. So diet is
transferred to Factor List. Labeling the observations by the rat’s ID number will enable
possible outlying observations to be identified.
Figure 3.2
10
For graphical displays of the data we again need the Explore dialogue box; in fact, by
checking both in this box, we can get our descriptive statistics and the plots we require. Here
we select Boxplots and Histogram to display the distributions of the lifespans of the rats,
and probability plots to assess more directly the assumption of normality within each dietary
group.
We can now move on to examine the graphical displays we have selected. The box plots are
shown in Figure 3.3.
Figure 3.3
This type of plot (also known as box-and-whisker plot) provides a “picture” of a five-point
summary of the sample observations in each group. The lower end of the box represents the
lower quartile and the upper end the upper quartile; thus the box width is the IQR and
covers the middle 50% of the data. The horizontal line within the box is placed at the
median of the sample. The bottom “whisker” extends to the minimum data point in the
sample, except if this point is deemed an outlier by SPSS. (SPSS calls a point an “outlier” if
the point is more than 1.5 X IQR away from the box and considers it an “extreme value”
when it is more than 3 X IQR away.)
11
In the latter case, the whisker extends to the second lowest case, except if this is found to be
an outlier and so on. The top whisker extends to the maximum value in the sample, again
provided this value is not an outlier. The box plots in Figure 3.3 leads to the same
conclusions as the descriptive summaries. Lifespans in the restricted diet group appear to be
longer “on average” but also more variable.
How to report?
A number of rats have been indicated as possible outliers and for the ad libitum diet; some
are even marked as extreme observations. Since we have employed case labels, we can
identify the rats with very short lifespans. Here the rat with the shortest lifespan (89 days) is
rat number 107.
(Lifespans that are short relative to the bulk of the data can arise as a result of negative
skewness of the distributions — observations labeled “outliers” by SPSS do not necessarily
have to be removed before further analyses, although they do merit careful consideration.
Here we shall not remove any of the suspect observations before further analyses.) The
evidence from both the summary statistics for the observations in each dietary group and the
box plot is that the distributions of the lifespans in the underlying population are non-
symmetric and that the variances of the lifespans vary between the diet groups.
Such information is important in deciding which statistical tests are most appropriate for
testing hypotheses of interest about the data, as we shall see later.
For data set containing outliers, it is NOT recommended to use mean, standard
deviation and variance. Use median instead of the mean. Use inter-quartile
range instead of Standard Deviation/Variance.
Figure 3.4 shows the descriptive statistics supplied by default (further statistics can be
requested from Explore via the Statistics sub-dialogue box).
12
Figure 3.4
How to report?
Write one statement on the measures of central tendency: mean or median or mode.
The median lifespan is shorter for rats on the ad libitum diet (710 days compared with
1035.5 days for rats on the restricted diet). A similar conclusion is reached when either the
mean or the 5% trimmed mean is used as the measure of location.
Write one statement on the measures of dispersion: interquartile range or standard deviation or variance.
The “spread” of the lifespans as measured by the interquartile range (IQR) appears to vary
with diet, with lifespans in the restricted diet group being more variable (IQR in the
restricted diet group is 311.5 days, but only 121 days in the Ad libitum diet group). Other
measures of spread, such as the standard deviation and the range of the sample, confirm the
increased variability in the restricted diet group.
Note another way to state that is to say that the lifespans in the Ad libitum diet is more
consistent.
13
Write one statement on the measures of symmetry: index of skewness.
Write one statement on the measures of peakedness: index of kurtosis
SPSS provides measures of two aspects of the “shape” of the lifespan distributions in each
dietary group, namely, skewness and kurtosis. The index of skewness takes the value zero for
a symmetrical distribution. A negative value indicates a negatively skewed distribution, a
positive value a positively skewed distribution — Figure 3.5 shows an example of each type.
Both data sets we show some degree of negative skewness. There is a concentration
of smaller values.
The kurtosis index measures the extent to which the peak of a unimodal frequency
distribution departs from the shape of normal distribution. A value of zero corresponds to a
normal distribution (A); positive values indicate a distribution that is more pointed than a
normal distribution (C) and a negative value a flatter distribution (B) — Figure 3.6 shows
examples of each type. For both sets data, the distributions are more pointed than a normal
distribution. Such findings have possible implications for later analyses that may be carried
out on the data.
Figure 3.5
Figure 3.6
14
7. Histogram
An alternative to the box plot for displaying sample distributions is the histogram. Figure 3.7
shows the histograms for the lifespans under each diet. Each histogram displays the
frequencies with which certain ranges (or “bins”) of lifespans occur within the sample. SPSS
chooses the bin width automatically, but here we chose both our own bin width (100 days)
and the range of the x-axis (100 days to 1500 days) so that the histograms in the two groups
were comparable. To change the default settings to reflect our choices, we go through the
following steps:
Figure 3.8
Figure 3.8
Figure 3.7
15
Figure 3.8
As we might expect the histograms indicate negatively skewed frequency distributions with
the left-hand tail being more pronounced in the restricted diet group.
16
Figure 3.8b
You have to generate the following output and compare the performances of the students of
UTM and UOM. Generate the Q-Q plot. Carry out the normality tests.
17
Descriptives
9. 1 Normality tests
Note the Kolmogorov-Smirnov and Shapiro-Wilk tests will are used to test whether the
variables are normally distributed. If the Sig. Value of the test is less than 5% or 0.05 we
conclude that the variable does not follow a Normal distribution. In case of conflict, we
report the Shapiro-Wilk test.
To obtain Normality test: click on Analyze – Descriptive Statistics – Explore. Click on
Plots and tick Normality plots with tests as shown in Figure 3.8b.
18
Tests of Normality
University: UOM or
UTM? Kolmogorov-Smirnov(a) Shapiro-Wilk
Statistic df Sig. Statistic df Sig.
Percentage Marks University of
obtained on SPSS Mauritius .109 50 .192 .962 50 .111
exam in January
University of
.140 50 .016 .935 50 .008
Technology
a Lilliefors Significance Correction
How to report?
(a)
H0: Percentage Marks obtained in the SPSS exam in January by UOM students follows a Normal
distribution
H1: Percentage Marks obtained in the SPSS exam in January by UOM students does not follow a
Normal distribution
Test: Kolmogorov-Smirnov
Statistic = .109 p-value = 0.192
Conclusion: The Percentage Marks obtained on SPSS exam in January for UOM students
follows a Normal distribution as its Sig. = 0.192 > 0.05. Accept H0.
(b)
H0: Percentage Marks obtained in the SPSS exam in January by UTM students follows a Normal
distribution
H1: Percentage Marks obtained in the SPSS exam in January by UTM students does not follow a
Normal distribution
Test: Kolmogorov-Smirnov
Statistic = .140 p-value = 0.016
Conclusion: The Percentage Marks obtained on SPSS exam in January for UOM students
follows a Normal distribution as its Sig. = 0.016 < 0.05. Reject H0. For UTM students it does not follow
a Normal distribution as its Sig. = 0.016 < 0.05.
19
large quantiles are smaller than would be expected — with this being most pronounced for
the lowest three quantiles in the ad libitum group. Such a picture is characteristic of
distributions with a heavy left tail; thus again we detect some degree of negative skewness.
Figure 3.9
Figure 10.1
20
Figure 10.2
The following output will be generated. There were 4 students who obtained Numeracy
level1.00. There were 15 students who obtained level 2.00.
Note Percent takes missing data into consideration when calculating the percentage. Valid
percent ignores missing data. There are two missing cases for the variable Numeracy level.
For Example for Numeracy level 1.00:
Percent: 4/100 * 100 = 4.0 % and Valid Percent: 4/98 *100 = 4.1%
Cumulative
Frequency Percent Valid Percent Percent
Valid Numeracy 1.00 4 4.0 4.1 4.1
Level 2.00 15 15.0 15.3 19.4
3.00 15 15.0 15.3 34.7
4.00 17 17.0 17.3 52.0
5.00 13 13.0 13.3 65.3
6.00 8 8.0 8.2 73.5
7.00 9 9.0 9.2 82.7
8.00 9 9.0 9.2 91.8
9.00 2 2.0 2.0 93.9
10.00 3 3.0 3.1 96.9
12.00 1 1.0 1.0 98.0
13.00 1 1.0 1.0 99.0
14.00 1 1.0 1.0 100.0
Total 98 98.0 100.0
Missing Missing Data
because student 1 1.0
was absent
Missing Data
because student 1 1.0
was exempted
Total 2 2.0
21
Total 100 100.0
22
(2)
23
11.2 Comparing two dependent/related
groups
(1)
24
25
12. Practice session 4 – Comparing two
independent groups: Mann Whitney U-test
and Independent Samples t-test
The two groups of rats are Independent (as two rats of the same family are not in two
different groups). Who live longer: those who were on restricted diet or those who were on
ad libitum diet? We wish to compare the lifespan (the only variable) of two groups
(restricted diet and ad libitum diet).
Step 1: Using Normality tests verify that the variable lifespan does NOT a Normal
distribution.
Tests of Normality
Kolmogorov-Smirnov(a) Shapiro-Wilk
Statistic df Sig. Statistic df Sig.
Lifespan of rats .086 195 .001 .974 195 .001
a Lilliefors Significance Correction
Since Sig. = 0.001 < 5%=0.05 we conclude that at 5% significance level, lifespan does not
follow a Normal distribution.
This means that we can’t use independent samples t-test to test H0. So we will use the
Mann Whitney U-test.
Step 2: Click
Analyze – Non-parametric Tests – 2 Independent-Samples…
26
Step 3:
Transfer lifespan to Test Variable. Transfer diet to Grouping Variable.
The Test Variable(s) list contains the variables that are to be compared between two levels of
the Grouping Variable.
Step 4:
Click on Define Groups.
Step 5:
Type in 1 and 2 which are the codes we have used for the two diets.
Here the variable lifespan is to be compared between level “1” and “2” of the grouping
variable diet. The Define Groups… sub-dialogue box is used to define the levels of interest.
In this example, the grouping variable has only two levels, but pair-wise group comparisons
are also possible for variables with more than two group levels.
In the table, the group with higher mean rank is the group with greater number of high
scores within it. Therefore, we can conclude that restricted diet group has significantly
higher lifespan.
27
Test Statistics(a)
Lifespan of
rats
Mann-Whitney U 1462.500
Wilcoxon W 5467.500
Z -8.291
Asymp. Sig. (2-tailed) .000
a Grouping Variable: Diet
How to report?
H0: Population median of lifespan in the restricted diet group = population median of lifespan in the ad
libitum group
H1: Population median of lifespan in the restricted diet group ≠ population median of lifespan in the ad
libitum group
Since restricted diet group has higher mean rank it has significantly higher lifespan.
H0: Population mean of lifespan in the restricted diet group = population mean of lifespan in the ad
libitum group
H1: Population mean of lifespan in the restricted diet group ≠ population mean of lifespan in the ad
libitum group
Step 1: Using Normality tests verify that the variable lifespan follows a Normal
distribution. (let us ignore for the moment that lifespan follows a normal distribution)
Step 2: Click
Analyze – Compare Means – Independent-Samples T Test
28
Figure 12.1
Step 3:
Transfer lifespan to Test Variable. Transfer diet to Grouping Variable.
The Test Variable(s) list contains the variables that are to be compared between two levels of
the Grouping Variable.
Step 4:
Click on Define Groups.
Step 5:
Type in 1 and 2 which are the codes we have used for the two diets.
Here the variable lifespan is to be compared between level “1” and “2” of the grouping
variable diet. The Define Groups… sub-dialogue box is used to define the levels of interest.
In this example, the grouping variable has only two levels, but pair-wise group comparisons
are also possible for variables with more than two group levels.
Figure 12.2
29
Step 6:
This begins with a number of descriptive statistics for each group. (Note that the standard
errors of the means are given, i.e., the standard deviation of lifespan divided by the square
root of the group sample size.) The next part of the display gives the results of applying two
versions of the independent samples t-test; the first is the usual form, based on assuming
equal variances in the two groups (i.e., homogeneity of variance), standard error of this
estimator (32.9 days) to construct a 95% CI for the mean difference (from 219.9 to 349.6
days). The mean lifespan in the restricted diet group is between about 220 and 350 days
longer than the corresponding value in the ad libitum diet.
The “Independent Samples Test” table also includes statistical significance test proposed by
Levene (1960) for testing the null hypothesis that the variances in the two groups are equal.
In this instance, the test suggests that there is a significant difference in the size of the within
diet variances (p < 0.001).
How to report?
H0: The variances of lifespan in the two groups are homogeneous.
H1: The variances of lifespan in the two groups are not homogeneous.
30
Consequently, it may be more appropriate here to use the alternative version of the t-test
given in the second row of the table.
This version of the t-test uses separate variances instead of a pooled variance to construct
the standard error and reduces the degrees of freedom to account for the extra variance.
t = 9.161 p = 0.000 < 5%
How to report?
H0: Population mean of lifespan in the restricted diet group = population mean of lifespan in the ad
libitum group
H1: Population mean of lifespan in the restricted diet group ≠ population mean of lifespan in the ad
libitum group
31
Since earlier analyses showed there was some evidence of abnormality in the lifespans data, it
may be useful to look at the results of an appropriate nonparametric Mann-Whitney U-test
(instead of the t-test) that does not rely on this assumption.
H0: Mean*/Median** Percentage Marks obtained on SPSS exam in January by UOM students =
Mean*/Median** Percentage Marks obtained on SPSS exam in January by UTM students
H1: Mean*/Median** Percentage Marks obtained on SPSS exam in January by UOM students ≠
Mean*/Median** Percentage Marks obtained on SPSS exam in January by UTM students
H0: Mean*/Median** Percentage Marks obtained on SPSS exam in April by UOM students =
Mean*/Median** Percentage Marks obtained on SPSS exam in April by UTM students
H1: Mean*/Median** Percentage Marks obtained on SPSS exam in April by UOM students ≠
Mean*/Median** Percentage Marks obtained on SPSS exam in April by UTM students
32
14. Practice session 6 – Comparing two
dependent groups: Wilcoxon Signed Ranks
test ** and Paired Samples t-test *
The SPSS file to be used: File to be used for SPSS Lectures.sav
A survey was conducted among 100 students (that is why we have used 100 rows in Data
Data View: one row for each student). All the 100 students had to take a test in January.
They were then given further lectures in SPSS. The SAME 100 students had to take a
second test in April. Each student has a pair of marks (January and April). That is we have
assessed the students twice. That’s why we have two columns of data: one for January and
one for April. These two sets of data are called dependent or paired. We would like to
know if there has been a change in performance from January to April. We may write it as
H0: Mean*/Median** Percentage Marks obtained on SPSS exam in January by ALL 100 students =
Mean*/Median** Percenatage Marks obtained on SPSS exam in April by ALL 100 students
H1: Mean*/Median** Percentage Marks obtained on SPSS exam in January by ALL 100 students ≠
Mean*/Median** Percenatage Marks obtained on SPSS exam in April by ALL 100 students
Step 1: For each student, calculate the difference in marks obtained in January and April and
store it in a column called diffmark.
Click on Transform and Compute:
Click on .
33
Type Difference in Jan and Apr Marks
Click on Continue
Click on OK.
Step 2: Using Normality tests verify that the variable Difference in Jan and Apr Marks
follows a Normal distribution.
34
Tests of Normality
Kolmogorov-Smirnov(a) Shapiro-Wilk
Statistic df Sig. Statistic df Sig.
Difference in Jan
and Apr Marks .106 100 .007 .962 100 .006
How to report?
H0: Difference in Jan and Apr Marks follows a Normal distribution
H1: Difference in Jan and Apr Marks does not follow a Normal distribution
Test: Kolmogorov-Smirnov
Statistic = .106 p-value = 0.007
Conclusion: The Difference in Jan and Apr Marks does not follow a Normal distribution as its Sig. =
0.007 < 0.05. Reject H0.
As the condition of Normality is not satisfied we can’t use the Paired Samples t-test. We
should use the Wilcoxon Signed Ranks test to test:
H0: Median Percentage Marks obtained on SPSS exam in January by ALL 100 students = Median
Percenatage Marks obtained on SPSS exam in April by ALL 100 students
H1: Median Percentage Marks obtained on SPSS exam in January by ALL 100 students ≠ Median
Percenatage Marks obtained on SPSS exam in April by ALL 100 students
35
14.1 Wilcoxon Signed Ranks test
Step 1: Click
Analyze – Non-parametric Tests – 2 Related-Samples…
Step 3: Then click on to transfer the two variables simultaneously to Test Pair(s)
List.
36
Click on OK.
Step 4.
Test Statistics(b)
37
How to report?
There were 24 students whose Percentage Marks obtained on SPSS exam in April <
Percentage Marks obtained on SPSS exam in January
There were 64 students whose Percentage Marks obtained on SPSS exam in April >
Percentage Marks obtained on SPSS exam in January
There were 12 students whose Percentage Marks obtained on SPSS exam in April =
Percentage Marks obtained on SPSS exam in January
H0: Median Percentage Marks obtained on SPSS exam in January by ALL 100 students = Median
Percentage Marks obtained on SPSS exam in April by ALL 100 students
H1: Median Percentage Marks obtained on SPSS exam in January by ALL 100 students ≠ Median
Percentage Marks obtained on SPSS exam in April by ALL 100 students
Conclusion: Since p = 0.000 < 5%, at 5% level of significance, we reject H0. As the result is
based on negative ranks, we conclude that there has been a significant increase in the marks
from January to April.
H0: Mean Percentage Marks obtained on SPSS exam in January by ALL 100 students = Mean
Percentage Marks obtained on SPSS exam in April by ALL 100 students
H1: Mean Percentage Marks obtained on SPSS exam in January by ALL 100 students ≠ Mean
Percentage Marks obtained on SPSS exam in April by ALL 100 students
Click on
38
Step 2: Click on Percentage Marks obtained on SPSS exam in January and
immediately after, click Percentage Marks obtained on SPSS exam in April
Step 3: Then click on to transfer the two variables simultaneously to Test Pair(s)
List.
Click on OK.
Step 4.
Std. Error
Mean N Std. Deviation Mean
Pair 1 Percentage Marks obtained on
SPSS exam in January 58.100 100 21.3156 2.1316
39
Paired Samples Test
Sig. (2-
Paired Differences t df tailed)
Std. 95% Confidence
Std. Error Interval of the
Mean Deviation Mean Difference
Lower Upper
Pair Percentage Marks
1 obtained on SPSS exam
in January - Percentage -2.510 6.3477 .6348 -3.770 -1.250 -3.954 99 .000
Marks obtained on SPSS
exam in April
How to report?
The mean Percentage Marks obtained on SPSS exam in January was 58.100.
The mean Percentage Marks obtained on SPSS exam in April was 60.610.
H0: Mean Percentage Marks obtained on SPSS exam in January by ALL 100 students = Mean
Percentage Marks obtained on SPSS exam in April by ALL 100 students
H1: Mean Percentage Marks obtained on SPSS exam in January by ALL 100 students ≠ Mean
Percentage Marks obtained on SPSS exam in April by ALL 100 students
As mean Percentage Marks obtained on SPSS exam in April was higher than that obtained in
January, we conclude that there has been a significant increase in the marks from January to
April.
H1: Mean Percentage Marks obtained on SPSS exam in January by ALL 100 students < Mean
Percentage Marks obtained on SPSS exam in April by ALL 100 students
Statistics: t = -3.954
For a one-tail t-test p = Sig. (1-tailed) = Sig. (2-tailed) ÷ 2 = 0.000 ÷2 = 0.000 < 2.5 %
= 5% ÷ 2
At 5% level of significance, we reject H0.
40
15. Practice session 7 – Comparing two
dependent groups: Wilcoxon Signed Ranks test
and Paired Samples t-test
Data file to be used:
Enter data for the ages at marriage for the sample of 100 couples that applied for marriage
licences in Cumberland County, PA, in 1993. (Data set is given on page 6).
(a) Compute the difference between the age of every husband and wife store the difference
in the column diffage (label it as difference in age of husband and wife)
(b) Using Normality tests verify if the variable difference in age of husband and wife
follows a Normal distribution.
(c) Decide whether we should use Wilcoxon Signed Ranks test or Paired Samples t-test
41
16. Practice session 8 – Correlation Pearson’s
Correlation Coefficient and Spearman’s rho
The SPSS file to be used: File to be used for SPSS Lectures.sav
Pearson’s Correlation coefficient is used when two variables
have a linear relationship, and
are measured at ratio or interval level.
Pearson's correlation coefficient assumes that each pair of variables is bivariate
normal, especially for small sample (size less than 30). However, it is considered to
be a robust statistics.
If the variables are not normally distributed but take values that can be ranked then we
should use non-parametric Spearman’s rho.
Kendall’s Tau is another non-parametric correlation and it should be used rather than
Spearman’s rho when you have a small set of data with large number of tied ranks.
When interpreting your results, be careful not to draw any cause-and-effect conclusions
due to a significant correlation. Correlation coefficients say nothing about which variable
causes the other to change.
For example:
r = + 0.397 between X and Y indicates a positive correlation between X and Y, that is as
X increases, Y increases.
r = -0.441 between X and indicates a positive correlation between X and Y, that is as X
increases, Y decreases.
Although we cannot make direct conclusions about causality, we can take the correlation
coefficient one step further by squaring it. The correlation coefficient (R2) is the
measure of the amount of variability in one variable that is explained by the other.
For example if r = -0.441, then R2 = 0.194 = 19.4%. This means X accounts for 19.4% of
the variability in Y. This means 80.6% of the variability in Y is still to be accounted for
by other variables. Note: even if X can account for 19.4% of the variability in Y, X does
not necessarily cause this variation.
42
(a) Percentage Marks obtained on SPSS exam in January and Percentage Marks
obtained on SPSS exam in April
(b) Percentage Marks obtained on SPSS exam in January and Percentage Marks
obtained in Computer Studies
43
Step 6. Click on Titles and Type
44
How to report?
Percentage Marks obtained on SPSS exam in April
80
60
40
University: UOM or U
University of Techno
20 logy
University of Maurit
0 ius
0 20 40 60 80 100
The Scatter diagram shows that three is a positive linear relationship between Percentage
Marks obtained on SPSS exam in January and Percentage Marks obtained on SPSS
exam in April.
YOU should repeat the procedure for Percentage Marks obtained on SPSS exam in
January and Percentage Marks obtained in Computer Studies to obtain the following
scatter diagram.
45
Percentage Marks obtained in Computer Studies
Scatter Plot of January and April SPSS Exam marks
80
70
60
50
40 University: UOM or U
University of Techno
30 logy
University of Maurit
20 ius
0 20 40 60 80 100
The Scatter diagram shows that three is no linear relationship between Percentage Marks
obtained on SPSS exam in January and Percentage Marks obtained in Computer
Studies.
Correlational Analyses
Suppose we would like to investigate the correlation between
(a) between Percentage Marks obtained on SPSS exam in January and Percentage
Marks obtained on SPSS exam in April; and
(b) between Percentage Marks obtained on SPSS exam in January and Percentage
Marks obtained in Computer Studies.
46
Step 3. Click on Options and select Means and Std deviations only if you are using
Pearson’s correlation.
Exclude cases pairwise. Cases with missing values for one or both of a pair of
variables for a correlation coefficient are excluded from the analysis. Since each
coefficient is based on all cases that have valid codes on that particular pair of
variables, the maximum information available is used in every calculation. This
can result in a set of coefficients based on a varying number of cases.
Exclude cases listwise. Cases with missing values for any variable are excluded
from all correlations..
How to report?
SPSS presents the correlation in a symmetric matrix form as shown in Table 16.1.
The element along the diagonal is always 1.00 as there is a perfect correlation between a
variable and itself.
There are three values in each cell. The first value at the top is the correlation coefficient, the
second value is the significance level, and the last value is the sample size.
The correlation between Percentage Marks obtained on SPSS exam in January and
Percentage Marks obtained on SPSS exam in April is +0.955 which indicates a
significant (This indicated by **. Correlation is significant at the 0.01 level (2-tailed);
significant means significantly different from zero!) strong positive correlation between the
two variables. This indicates that those who obtained high scores in January also obtained
high scores in April.
The correlation between Percentage Marks obtained on SPSS exam in January and
Percentage Marks obtained on SPSS exam in April is +0.955 which indicates a
47
significant strong positive correlation between the two variables. This indicates that those
who obtained high scores in January also obtained high scores in April.
The correlation between Percentage Marks obtained on SPSS exam in January and
Percentage Marks obtained in Computer Studies is +0.064. However, this correlation is
not significantly different from Zero. This indicates that there no relation between
Percentage Marks obtained on SPSS exam in January and Percentage Marks
obtained in Computer Studies. That is those scoring high marks on SPSS exam in
January did not always score high marks on Computer Studies exam.
Similarly there the correlation between Percentage Marks obtained on SPSS exam in
April and Percentage Marks obtained in Computer Studies is not significantly different
from zero.
Note: Flag significant correlations. Correlation coefficients significant at the 0.05 level
are identified with a single asterisk, and those significant at the 0.01 level are identified
with two asterisks.
48
17. Practice session 9 – Correlation Pearson’s
Correlation Coefficient and Spearman’s rho
17.1 Using Pearson’s Correlation coefficient, investigate the relationship between each pair
of variable:
17.2 Using Spearman’s rho, investigate the relationship between Numeracy level on a scale
of 1 to 20 and Overall Grade in previous semester.
Numeracy
level on a Overall Grade
scale of 1 to in previous
20 semester
Spearman's rho Numeracy level on a Correlation
1.000 .324(**)
scale of 1 to 20 Coefficient
Sig. (2-tailed) . .001
N 98 98
Overall Grade in Correlation
.324(**) 1.000
previous semester Coefficient
Sig. (2-tailed) .001 .
N 98 100
** Correlation is significant at the 0.01 level (2-tailed).
49
18. Practice session 10 – Multiple Linear
Regression
In this chapter, we shall deal with two sets of data where interest lies in either examining
low one variable relates to a number of others or in predicting one variable from others. The
first data set is shown in Table 4.1 and includes four variables, sex, age, extroversion, and car,
the latter being the average number of minutes per week a person spends looking after his or
her car. According to a particular theory, people who score higher on a measure of
extroversion are expected to spend more time looking after their cars since a person may
project their self-image through themselves or through objects of their own. At the same
time, car-cleaning behavior might be related to demographic variables such as age and sex.
Therefore, one question here is how the variables sex, age, and extroversion affect the time
that a person spends cleaning his or her car.
50
Multiple Linear Regression
Multiple linear regression is a method of analysis for assessing the strength of the
relationship between each of a set of explanatory variables (sometimes known as independent
variables, although this is not recommended since the variables are often correlated), and a
single response (or dependent) variable. When only a single explanatory variable is involved, we
have what is generally referred to as simple linear regression. Applying multiple regression
analysis to a set of data results in what are known as regression coefficients, one for each
explanatory variable. These coefficients give the estimated change in the response variable
associated with a unit change in the corresponding explanatory variable, conditional on the
other explanatory variables remaining constant. The fit of a multiple regression model can be
judged in various ways, for example, calculation of the multiple correlation coefficient or by the
examination of residuals, each of which will be illustrated later. (Further details of multiple
regression are given below)
51
In the car cleaning data set in Table 4.1, each of the variables — extroversion (extrover in
the Data View spreadsheet), sex (sex), and age (age) — might be correlated with the
response variable, amount of time spent car cleaning (car). In addition, the explanatory
variables might be correlated among themselves. All these correlations can be found from
the correlation matrix of the variables, obtained by using the commands
Analyze – Correlate – Bivariate…
and including all four variables under the Variables list in the resulting dialogue box . This
generates the output shown in Display 4.1.
The output table provides Pearson correlations between each pair of variables and associated
significance tests. We find that car cleaning is positively correlated with extroversion (r =
0.67, p < 0.001) and being male (r = 0.661, p < 0.001). The positive relationship with age (r =
0.234) does not reach statistical significance ( p = 0.15). The correlations between the
52
explanatory variables imply that both older people and men are more extroverted (r = 0.397,
r = 0.403).
Since all the variables are correlated to some extent, it is difficult to give a clear answer to
whether, for example, extroversion is really related to car cleaning time, or whether the
observed correlation between the two variables arises from the relationship of extroversion
to both age and sex, combined with the relationships of each of the latter two variables
to car cleaning time. (A technical term for such an effect would be confounding.) Similarly the
observed relationship between car cleaning time and gender could be partly attributable to
extroversion.
In trying to disentangle the relationships involved in a set of variables, it is often helpful to
calculate partial correlation coefficients. Such coefficients measure the strength of the linear
relationship between two continuous variables that cannot be attributed to one or more
confounding variables (for more details, see Rawlings, Pantula, and Dickey, 1998). For
example, the partial correlation between car cleaning time and extroversion rating “partialling
out” or “controlling for” the effects of age and gender measures the strength of relationship
between car cleaning times and extroversion that cannot be attributed to relationships with
the other explanatory variables. We can generate this correlation coefficient in SPSS by
choosing
Analyze – Correlate – Partial…
from the menu and filling in the resulting dialogue box as shown in Display 4.2. The
resulting output shows the partial correlation coefficient together with a significance test
(Display 4.3). The estimated partial correlation between car cleaning and extroversion, 0.51,
is smaller than the previous unadjusted correlation coefficient, 0.67, due to part of the
relationship being attributed to gender and/or age. We leave it as an exercise to the reader to
53
generate the reduced partial correlation, 0.584, between car cleaning time and gender after
controlling for extroversion and age.
Thus far, we have quantified the strength of relationships between our response variable, car
cleaning time, and each explanatory variable after adjusting for the effects of the other
54
explanatory variables. We now proceed to use the multiple linear regression approach with
dependent variable, car, and explanatory variables, extrover, sex, and age, to quantify
the nature of relationships between the response and explanatory variables after adjusting for
the effects of other variables. (This is a convenient point to note that categorical explanatory
variables, such as gender, can be used in multiple linear regression modeling as long they are
represented by dummy variables. To “dummy-code” a categorical variable with k categories, k-
1 binary dummy variables are created. Each of the dummy variables relates to a single
category of the original variable and takes the value “1” when the subject falls into the
category and “0” otherwise. The category that is ignored in the dummy-coding represents
the reference category. Here sex is the dummy variable for category “male,” hence category
“female” represents the reference category.) A multiple regression model can be set up in
SPSS by using the commands
Analyze – Regression – Linear…
This results in the Linear Regression dialogue box shown in Display 4.4:
55
We specify the dependent variable and the set of explanatory variables under the
headings Dependent and Independent(s), respectively.
The regression output is controlled via the Statistics… button. By default, SPSS only
prints estimates of regression coefficients and some model fit tables. Here we also
ask for confidence intervals to be included in the output (Display 4.4).
The resulting SPSS output tables are shown in Display 4.5 and Display 4.6. The model fit
output consists of a “Model Summary” table and an “ANOVA” table (Display 4.5). The
former includes the multiple correlation coefficient, R, its square, R2, and an adjusted
version of this coefficient as summary measures of model fit (see Box 4.1). The multiple
correlation coefficient R = 0.799 indicates that there is a strong correlation between the
observed car cleaning times and those predicted by the regression model. In terms of
variability in observed car cleaning times accounted for by our fitted model, this amounts to
a proportion of R2 = 0.634, or 63.4%. Since by definition R2 will increase when further terms
are added to the model even if these do not explain variability in the population, the adjusted
R2 is an attempt at improved estimation of R2 in the population. The index is adjusted down
to compensate for chance increases in R2, with bigger adjustments for larger sets of
explanatory variables (see Der and Everitt, 2001). Use of this adjusted measure leads to a
revised estimate that 60.8% of the variability in car cleaning times in the population can be
explained by the three explanatory variables.
56
The error terms in multiple regression measure the difference between an individual’s car
cleaning time and the mean car cleaning time of subjects of the same age, sex, and
extroversion rating in the underlying population. According to the regression model, the
mean deviation is zero (positive and negative deviations cancel each other out). But the more
variable the error, the larger the absolute differences between observed cleaning times and
those expected. The “Model Summary” table provides an estimate of the standard deviation
of the error term (under “Std. Error of the Estimate”).
Here we estimate the mean absolute deviation as 13.02 min, which is small considering that
the observed car cleaning times range from 7 to 97 min per week.
Time Spent (in minutes per week) = 11.306 + .464 (extroversion) + .156 (age) +
20.071(gender)
Finally, the “ANOVA” table provides an F-test for the null hypothesis that none of the
explanatory variables are related to car cleaning time, or in other words, that R2 is zero (see
Box 4.1). Here we can clearly reject this null hypothesis (F (3,36) = 21.1, p < 0.001), and so
conclude that at least one of age, sex, and extroversion is related to car cleaning time.
The output shown in Display 4.6 provides estimates of the regression coefficients, standard
errors of the estimates, t-tests that a coefficient takes the value zero, and confidence intervals
(see Box 4.1). The estimated regression coefficients are given under the heading
“Unstandardized Coefficients B”; these give, for each of the explanatory variables, the
predicted change in the dependent variable when the explanatory variable is increased by one
unit conditional on all the other variables in the model remaining constant. For example,
here we estimate that the weekly car cleaning time is increased by 0.464 min for every
additional score on the extroversion scale (or by 4.64 min per week for an increase of 10
units on the extroversion scale) provided that the individuals are of the same age and sex.
Similarly, the estimated effect for a ten-year increase in age is 1.56 min per week.
57
for age and extroversion rating. The regression coefficient estimate of extroversion has a
standard error (heading “Std. Error”) of 0.13 min per week and a 95% confidence interval
for the coefficient is given by [0.200, 0.728], or in other words, the increase in car cleaning
time per increase of ten in extroversion rating is estimated to be in the range 2.00 to 7.28
min per week. (Those interested in p-values can use the associated t-test to test the null
hypothesis that extroversion has no effect on car cleaning times.)
Finally, the Coefficients table provides standardized regression coefficients under the heading
“Standardized Coefficients Beta”. These coefficients are standardized so that they measure
the change in the dependent variable in units of its standard deviation when the explanatory
variable increases by one standard deviation. The standardization enables the comparison of
effects across explanatory variables (more details can be found in Everitt, 2001b). For
example, here increasing extroversion by one standard deviation (SD = 19.7) is estimated to
increase car cleaning time by 0.439 standard deviations (SD = 20.8 min per week). The set of
beta-coefficients suggests that, after adjusting for the effects of other explanatory variables,
gender has the strongest effect on car cleaning behavior. (Note that checking Descriptives
and Part and partial correlations in the Statistics sub-dialogue box in Display 4.4 provides
summary statistics of the variables involved in the multiple regression model, including the
Pearson correlation and partial correlation coefficients shown in Displays 4.1 and 4.3.)
For the car cleaning data, where there are only three explanatory variables, using the ratio of
an estimated regression coefficient to its standard error in order to identify those variables
that are predictive of the response and those that are not, is a reasonable approach to
developing a possible simpler model for the data (that is, a model that contains fewer
explanatory variables). But, in general, where a larger number of explanatory variables are
involved, this approach will not be satisfactory. The reason is that the regression coefficients
and their associated standard errors are estimated conditional on the other explanatory
variables in the current model. Consequently, if a variable is removed from the model, the
regression coefficients of the remaining variables (and their standard errors) will change
when estimated from the data excluding this variable. As a result of this complication, other
procedures have been developed for selecting a subset of explanatory variables, most
associated with the response. The most commonly used of these methods are:
Forward selection. This method starts with a model containing none of the explanatory
variables. In the first step, the procedure considers variables one by one for inclusion and
selects the variable that results in the largest increase in R2. In the second step, the
procedures consider variables for inclusion in a model that only contains the variable
selected in the first step. In each step, the variable with the largest increase in R2 is selected
until, according to an F-test, further additions are judged to not improve the model.
Backward selection. This method starts with a model containing all the variables and
eliminates variables one by one, at each step choosing the variable for exclusion as that
leading to the smallest decrease in R2. Again, the procedure is repeated until, according
to an F-test, further exclusions would represent a deterioration of the model.
Stepwise selection. This method is, essentially, a combination of the previous two
approaches. Starting with no variables in the model, variables are added as with the forward
58
selection method. In addition, after each inclusion step, a backward elimination process is
carried out to remove variables that are no longer judged to improve the model.
Automatic variable selection procedures are exploratory tools and the results from a multiple
regression model selected by a stepwise procedure should be interpreted with caution.
Different automatic variable selection procedures can lead to different variable subsets since
the importance of variables is evaluated relative to the variables included in the model in the
previous step of the procedure. A further criticism relates to the fact that a number of tests
are employed during the course of the automatic procedure, increasing the chance of false
positive findings in the final model. Certainly none of the automatic procedures for selecting
subsets of variables are foolproof; they must be used with care and warnings such as the
following given in Agresti (1996) should be noted:
Computerized variable selection procedures should be used with caution. When one
considers a large number of terms for potential inclusion in a model, one or two of
them that are not really important may look impressive simply due to chance.
For instance, when all the true effects are weak, the largest sample effect may
substantially overestimate its true effect. In addition, it often makes sense to include
certain variables of special interest in a model and report their estimated effects even
if they are not statistically significant at some level.
In addition, the comments given in McKay and Campbell (1982a, b) concerning the validity
of the F-tests used to judge whether variables should be included in or eliminated from a
model need to be considered. Here, primarily for illustrative purposes, we carry out an
automatic forward selection procedure to identify the most important predictors of car
cleaning times out of age, sex, and extroversion, although previous results give, in this case, a
very good idea of what we will find. An automatic forward variable selection procedure is
requested from SPSS by setting the Method option in the Linear Regression dialogue box to
Forward (see Display 4.4).
59
is highly significant (F (1,38) = 31, p < 0.001). Adding gender to the model increases the
percentage variance explained by 18.3% (F (1,37) = 18.4, p < 0.001).
YOU have to check that backward and stepwise variable selection leads to the same subset
of variables for this data example. But remember, this may not always be the case.)
For stepwise procedures, the “Coefficients” table shows the regression coefficients
estimated for the model at each step. Here we note that the unadjusted effect of xtroversion
on car cleaning time was estimated to be an increase in car cleaning time of 7.08 min per
week per 10 point increase on the extroversion scale. When adjusting for gender (model 2),
this effect reduces to 5.09 min per week per 10 points (95% CI from 2.76 to 7.43 min per
week per 10 points). SPSS also provides information about the variables not included in the
regression model at each step. The “Excluded Variables” table provides standardized
regression coefficients (under “Beta in”) and t-tests for significance. For example, under
Model 1, we see that gender, which had not been included in the model at this stage, might
be an important variable since its standardized effect after adjusting for extroversion is of
moderate size (0.467), there also remains moderate size partial correlation between gender
and car cleaning after controlling for extroversion (0.576).
Multicollinearity
Approximate linear relationships between the explanatory variables, called
multicollinearity, can cause a number of problems in multiple regression,
including:
It severely limits the size of the multiple correlation coefficient because the
explanatory variables are primarily attempting to explain much of the same variability
in the response variable (see Dizney and Gromen, 1967, for an example).
It makes determining the importance of a given explanatory variable difficult because
the effects of explanatory variables are confounded due to their intercorrelations.
It increases the variances of the regression coefficients, making use of the estimated
model for prediction less stable. The parameter estimates become unreliable (for
more details, see Belsley, Kuh, and Welsh, 1980).
Spotting multicollinearity among a set of explanatory variables might not be easy. The
obvious course of action is to simply examine the correlations between these variables, but
while this is a good initial step that is often helpful, more subtle forms of multicollinearity
involving more than two variables might exist. A useful approach is the examination of
the variance inflation factors (VIFs) or the tolerances of the explanatory variables. The tolerance of
an explanatory variable is defined as the proportion of variance of the variable in question
not explained by a regression on the remaining explanatory variables with smaller values
indicating stronger relationships. The VIF of an explanatory variable measures the inflation
of the variance of the variable’s regression coefficient relative to a regression where all the
explanatory variables are independent.
The VIFs are inversely related to the tolerances with larger values indicating involvement in
more severe relationships (according to a rule of thumb, VIFs above 10 or tolerances below
0.1 are seen as a cause of concern).
Since we asked for Collinearity diagnostics in the Statistics sub-dialogue box, the
Coefficients” table and the “Excluded Variables” table in Display 4.8 include columns
labeled “Collinearity Statistics.” In the “Coefficients” table, the multicollinearities involving
60
the explanatory variables of the respective model are assessed. For example, the model
selected in the second step of the procedure included extroversion and gender as explanatory
variables. So a multicollinearity involving these two variables (or more simply, their
correlation) has been assessed. In the “Excluded Variables” table, multicollinearities
involving the excluded variable and those included in the model are assessed. For example,
under “Model 2,” multicollinearities involving age (which was excluded) and extroversion
and gender (which were included) are measured. Here none of the VIFs give reason for
concern. (SPSS provides several other collinearity diagnostics, but we shall
not discuss these because they are less useful in practice than the VIFs.)
We usually calculate the average VIF and this should not be substantially greater
than one. An average VIF greater than 10 is definitely a cause of concern.
It might be helpful to visualize our regression of car cleaning times on gender and
extroversion rating by constructing a suitable graphical display of the fitted model. Here,
with only one continuous explanatory variable and one categorical explanatory variable, this
is relatively simple since a scatterplot of the predicted values against extroversion rating can
be used. First, the predicted (or fitted) values for the subjects in our sample need to be saved
via the Save… button on the Linear Regression dialogue box (see Display 4.4). This opens
the Save sub-dialogue box shown in Display 4.9 where Unstandardized Predicted Values can
be requested. Executing the command includes a new variable pre_1 on the right-hand side
of the Data View spreadsheet.
This variable can then be plotted against the extroversion variable using the following
instructions:
_ The predicted value variable, pre_1, is declared as the Y Axis and the extroversion variable,
extrover as the X Axis in the Simple Scatterplot dialogue box.
_ The gender variable, sex, is included under the Set Markers by list to enable later
identification of gender groups.
_ The resulting graph is then opened in the Chart Editor and the commands Format –
Interpolation… – Straight used to connect the points.
61
62
63
The final graph shown in Display 4.10 immediately conveys that the amount of time spent
car cleaning is predicted to increase with extroversion rating, with the strength of the effect
determined by the slope of two parallel lines (5.1 min per week per 10 points on the
extroversion scale). It also shows that males are estimated to spend more time cleaning their
cars with the increase in time given by the vertical distance between the two parallel lines
(19.18 min per week).
Autocorrelation
Autocorrelation exist if adjacent residuals are correlated, i.e. residuals are not independent.
Autocorrelation affects the model. Autocorrelation is measured using Durbin Watson
Statistics. Values between 1 and 3 indicated that the autocorrelation does not affect the
model. The closer the value is to 2, the better it is.
Homoscedasticity
At each level of predictor variables, the variance of the residual terms should be constant.
This just means that the residual at each level of predictor(s) should have the same variance
(homoscedasticity); when the variances are very unequal there is said to be
heteroscedasticity. In order to verify this assumption, we plot *ZRESID against
*ZPRED. If the assumption of homoscedasticity is met then this plot should be a random
array of dots evenly dispersed around zero.
If this graph funnels out, then the chances are that there is heteroscedasticity in data.
64
How to report?
(a) Write the regression equation
Interpretation of standardized coefficients:
A survey was conducted among 100 singers who have sold CDs during the year 2007. Data
was collected for the following 4 variables:
1. Advertising Budget (thousands of rupees) for the CD
2. No of CDs Sold (thousands)
3. No. of times Songs are played on Radio 1 during the week prior to its release
4. Attractiveness of the Singer on a scale 1 to 10
65
20. Practice session 12 – Cross Tab and Chi-
Square test for Independence
Very often, we are not interested in test scores, or continuous measures, but in categorical
variables. Categorical variables are also called grouping variables. Categorical variable lead to
categorical data which are
Generally non-numerical data.
Data placed in exclusive categories.
Cases counted rather than measured.
E.g. People can be classified in categories according to their occupation
E.g. Cars can be classified in categories according to their make or colour
Lifespan of up to 500 into a category that we will call short (we will give this
category a code 0)
Lifespan of 501 to 999 into a category that we will call medium (we will give this
category a code 1)
Lifespan of 1000 and above into a category that we will call long (we will give this
category a code 2)
66
Step 3.: Click on Old and New Values
Step 4.: Click on Range and type in 500 as shown and 0 in Value
Step 6.: Click on Range and type in 501 through 999 as shown and 1 in Value
67
Step 7.: Click on Add
Step 8.: Click on Range and type in 1000 through highest as shown and 2 in Value
68
Step 13.: Go to Data View to note the column lifgroup that has been added.
The lifespan categories have now been created. This procedure can be used to create
categories for variables like age, salary, …..
69
Contingency tables and chi-squared test of independence
Contingency tables are one of the most common ways to summarize observations on two
categorical variables. For all such tables, interest generally lies in assessing whether or not
there is any relationship or association between the row variable and the column variable that
make up the table. Most commonly, a chi-squared test of independence is used to answer this
question, although alternatives such as Fisher’s exact test or McNemar’s test may be needed
when the sample size is small (Fisher’s test) or the data consist of matched samples
(McNemar’s test). In addition, in 2 X 2 tables, it may be required to calculate a confidence
interval for the ratio of population proportions. For a series of 2 X 2 tables, the Mantel-
Haenszel test may be appropriate (see later). (Brief accounts of each of the tests mentioned are
given in below).
70
71
72
Note. For Chi-Square test to be meaningful
1. The categories must be mutually exclusive so that each case or person contributes to
one cell/category only.
2. The expected frequencies (or expected count) should be greater than 5.
It is acceptable that in larger contingency tables to have up to 20% of expected
frequencies below 5. Even in larger contingency tables, expected frequencies should
not be less than 1.
If this condition is not satisfied then we usually merge adjacent cells. This of course
decreases the number of categories.
Step 2.: Transfer diet into rows and lifgroup into columns.
73
Step 3.: Click Statistics. Select Chi-square. Select Phi and Cramer’s V.
Step 5.: Click Cells. Select Observed and Expected. Select Row, Column and Total.
74
Table 20.1 Diet * Categories of lifespan Crosstabulation
Categories of lifespan
Short Medium Long Total
Diet Restricted Count 11 34 61 106
diet Expected
9.8 63.1 33.2 106.0
Count
% within
10.4% 32.1% 57.5% 100.0%
Diet
% within
Categories 61.1% 29.3% 100.0% 54.4%
of lifespan
% of Total 5.6% 17.4% 31.3% 54.4%
Ad libitum Count 7 82 0 89
diet Expected
8.2 52.9 27.8 89.0
Count
% within
7.9% 92.1% .0% 100.0%
Diet
% within
Categories 38.9% 70.7% .0% 45.6%
of lifespan
% of Total 3.6% 42.1% .0% 45.6%
Total Count 18 116 61 195
Expected
18.0 116.0 61.0 195.0
Count
% within
9.2% 59.5% 31.3% 100.0%
Diet
% within
Categories 100.0% 100.0% 100.0% 100.0%
of lifespan
% of Total 9.2% 59.5% 31.3% 100.0%
Asymp. Sig.
Value df (2-sided)
Pearson Chi-Square 80.884(a) 2 .000
Likelihood Ratio 104.448 2 .000
Linear-by-Linear
40.893 1 .000
Association
N of Valid Cases
195
a 0 cells (.0%) have expected count less than 5. The minimum expected count is 8.22.
75
Table 20.3 Symmetric Measures
How to report?
(a) In Table 20.1 ensure that it satisfies the criteria that to have up to 20% of expected
frequencies below 5. Here all the cells have expected frequency (Expected
Count) of greater than 5.
(b) Write one statement for each cell in Table 20.1. Note we have 3 X 2= 6 cells in all.
There are four possible statements that we can formulate for one cell. Although all the
statements are mathematically correct, not all of them are logically correct. So you are
advised to choose that statement that best describes the cell.
We will consider the cell Restricted diet – Short:
Categories of lifespan
Short Total
Diet Restricted Count 11 106
diet Expected
9.8
Count
% within
10.4%
Diet
% within
Categories 61.1%
of lifespan
% of Total 5.6%
Total Count 18 195
(ii) % within Diet: (This is also called row %). It is calculated by dividing Count
by row total and then multiplying by 100: (11÷106) x 100 = 10.4%. This is
interpreted as 10.4% of those on restricted diet (as this row is the row of
restricted diet) have short life.
76
(iv) % of Total: (This is also called Total %). It is calculated by dividing Count by
total and then multiplying by 100: (11÷195) x 100 = 5.6%. This is interpreted
as 5.6% of the total number of rats included in the study were on restricted
and had a short life.
In our case (from Table 20.3) Cramer’s V = 0.644 out of a possible maximum value of 1.
This represents quite a strong association between diet and lifespan.
H0: There is no association between Socio Economic Status of Student and Overall
Grade in previous semester.
H1: There is an association between Socio Economic Status of Student and Overall
Grade in previous semester.
77
22. Practice session 14 – One-Way ANOVA
and Kruskal Wallis test
If two groups of participants perform a task under different conditions, an independent
samples t test can be used to test the null hypothesis (H0) of equality of the two population
means:
H0:μ1 = μ2
If the test shows significance, we can reject H0 and conclude that there is a difference
between the two population means. t-test is applied when we have two means (only) to
compare.
The same null hypothesis, however, can also be tested by using the analysis of variance
(ANOVA for short). Like the t test, the ANOVA was designed for the purpose of
comparing means but is more versatile than the t test.
78
Table 22.1 Performance score of persons who were given drug A,B,C, and D, and Placebo.
It is apparent from Table 22.1 that there are considerable differences among the five sample.
The question is, could the null hypothesis actually be true and the differences we see in the
table have come about merely through sampling error?
79
From within subjects factors, in which the participant is tested at all levels (i.e. under all the
conditions making up the factor). In ANOVA designs, an experiment with a within subjects
factor is also said to have repeated measures on that factor: the measure or DV is taken at
all levels. Our drug experiment is a one-factor between subjects experiment. The one-way
ANOVA is applicable here.
Running ANOVA.
Step 1: Entering the data
As with the independent samples t test, you will need to define two variables:
1. a grouping variable with a simple name such as Group, which identifies the
condition under which a score was achieved. (The grouping variable should also be
given a more meaningful variable label such as Drug Condition, which will appear in
the output.)
2. a variable with a name such as Score, which contains all the scores in the data set.
This is the dependent variable.
The grouping variable will consist of five values (one for the placebo condition and one for
each of the four drugs). We shall arbitrarily assign numerical values thus: 0 = Placebo; 1 =
Drug A; 2 = Drug B; 3 = Drug C; 4 = Drug D.
80
Step 2. Verifying Conditions
1. For each subpopulation (for each drug and placebo), the dependent variable
must follow a Normal distribution.
Carry out a normality test to verify that this condition is satisfied.
Tests of Normality
Kolmogorov-Smirnov(a) Shapiro-Wilk
Drug Statistic df Sig. Statistic df Sig.
Score Placebo .142 9 .200(*) .978 9 .951
Drug A .236 9 .159 .932 9 .502
Drug B .220 9 .200(*) .904 9 .277
Drug C .305 9 .015 .852 9 .078
Drug D .185 9 .200(*) .958 9 .782
* This is a lower bound of the true significance.
a Lilliefors Significance Correction
81
As all the sig.> 0.05, the condition of normality is satisfied.
2. The second condition: The subpopulations must have the same variance. The
Levene’s test will be used. This will be obtained together with the ANOVA
test.
3. The subpopulations must be independent of each other.
Step 4.
82
Note: Contrasts are used to investigate which mean or set of mean values or linear
combination of mean values shows differences with other mean values.
83
84
85
Therefore we conclude that effect of drug A on performance is not significantly different
from effect of Placebo.
86
Planned and unplanned comparisons
Before running a drug experiment in the current example, the experimenter may have some
very specific questions in mind. It might be expected, for example (perhaps on theoretical
grounds), that the mean score of every group who have ingested one of the drugs will be
greater than the mean score of the Placebo group. This expectation would be tested by
comparing each drug group with the Placebo group. Perhaps, on the other hand, the
experimenter has theoretical reasons to suspect that Drugs A and B should enhance
performance, but Drugs C and D should not. That hypothesis would be tested by comparing
the Placebo mean with the average score for groups A and B combined and with the average
score for groups B and C combined. These are examples of planned comparisons.
Often, however, the experimenter, perhaps because the field has been little explored, has
only a sketchy idea of how the results will turn out. There may be good reason to expect that
some of the drugs will enhance performance; but it may not be possible, a priori, to be more
specific. Unplanned, a posteriori or post hoc, comparisons are part of the ‘data-snooping’ that
inevitably follows the gathering of a data set.
87
88
89
Step 7. Click on Options
90
Test of Homogeneity of Variances
Score
Levene
Statistic df1 df2 Sig.
2.464 4 40 .061
To obtain the following Table, don’t choose contrast and run the ANOVA analyses.
ANOVA
Score
Sum of
Squares df Mean Square F Sig.
Between Groups 337.422 4 84.356 7.888 .000
Within Groups 427.778 40 10.694
Total 765.200 44
H0:μ1 = μ2 = μ3 = μ4 = μ5
Test: ANOVA
F = 7.888 p= 0.000<0.05, we reject H0.
How to report?
(a)
H0:μ1 = μ2 = μ3 = μ4 = μ5
H1: H0 is false.
Test: ANOVA
F = 7.888 p= 0.000<0.05, we reject H0.
91
(b) Tests of Normality
Kolmogorov-Smirnov(a) Shapiro-Wilk
Drug Statistic df Sig. Statistic df Sig.
Score Placebo .142 9 .200(*) .978 9 .951
Drug A .236 9 .159 .932 9 .502
Drug B .220 9 .200(*) .904 9 .277
Drug C .305 9 .015 .852 9 .078
Drug D .185 9 .200(*) .958 9 .782
* This is a lower bound of the true significance.
a Lilliefors Significance Correction
(c)
F = 2.464 p = 0.061 > 0.05. Accept that variances are Homogeneous.
The non-significance of the Levene F Statistic for the test of equality of error variances
(homogeneity of variances) indicates that the assumption of homogeneity of variance is
tenable; however, considerable differences among the variances are apparent from
inspection.
(e) contrasts
92
93
94