Spss
Spss
Spss
_______________________________________________________________________________
SPSS – Lecture and Activities
Exercise 1
We will start using SPSS. The activities are getting less labor-intensive, as you will get the mathematics for
free. Still, you should always keep in mind that software can only help you if you understand what they do and
in which case you can use this or that function.
SPSS stands for Statistical Package for the Social Sciences, and is the most frequently used software among
psychologists, sociologists and linguists (and probably in many other fields) to perform statistical computations.
Although you create a report, you do not need to turn this in unless you are taking the course for credit. We will
discuss how to write up statistical results, and it may be useful to exchange the reports with other students in
order to get some criticism.
In general reports should be as short as possible, that is, copy-and-paste only the SPSS results that are
necessary. Explain the results in one sentence, especially if you needed to do more than just copy-paste (e.g.,
find the lowest value or calculate the difference of two values). Do not add any further information. Reporting
irrelevant information can result in less points, as filtering the relevant information is one of your tasks.
Tasks that you simply have to do (before you get to the questions) appear below with a > starting the
line in bold letters. Concerning these, you need not report anything, simply perform these tasks. The questions
to be answered in the report are given below with a * starting the line, and in bold letters.
Answer the questions in a short but exact way, starting the number of the question. For instance:
…
3. 20 measurements.
4. Word length 3.
…
Try to finish all exercises during the hands-on. Should this fail, you can go on working on the exercises in your
own time.
Aims of Hands-on 1
In case SPSS has not been installed on your machine yet, you get a window saying that you have to restart your
computer. Do that, otherwise SPSS may have problems running.
> Once SPSS is running, you are offered a menu with choices. Click on “cancel”.
Now you are in the Data Editor, the window of SPSS in which you can enter data and work on them. It is a
spreadsheet you might be familiar with from other applications. On its top you find the name of the data file you
are working with, but at this moment it is still: Untitled1 [DataSet0].
In the Data Editor, each (vertical) column of numbers represents a variable. Each variable is given a name,
which appears on the top of the column. Use meaningful names, such as LENGTH, and not something like
X24A06.
Each (horizontal) row represents a case. A case is a series of observations belonging together, such as the
answers of a respondent to the questions in a questionnaire, or different values measured on the same subject of
the experiment. For instance, if you have 32 respondents, then you need 32 rows for the 32 cases. If the
questionnaire contained 40 questions, then you most probably need 40 columns, and so you have 40
variables. (Next week, we learn how to calculate new, derivative variables from existing ones.)
The Data Editor is composed of two parts: the Data View and the Variable View. By clicking on the knob on
the bottom left part of the page you can switch between them.
The Variable View offers an overview of your variables, and you can also define some features of these
variables. The most important features are:
On the top of the window you find the menu of SPSS: FILE, EDIT, VIEW, DATA, etc. All statistical
calculations are found under ANALYZE, and all diagrams and charts under GRAPHS. To calculate new
variables based on the existing ones, use the commands under TRANSFORM. The HELP menu provides you
help with further assistance, but which may prove quite concise in the beginning.
> Have a look at the different menus to get a general overview of them.
The MLU (Mean Length of Utterance) measures the length of an utterance (a well-formed sentence or a
sentence-like series of words) by counting the number of words it contains. It is an important measure of
linguistic capabilities of children acquiring a language, of patients with impaired language, but it is also useful
in identifying authors of texts.
3, 5, 4, 4, 10, 4, 11, 4, 4, 6, 3, 4, 4, 8, 8, 8, 5, 8, 4, 9.
> Enter these values by hand and add the variable the name MLU.
> In the Variable View, set the number of decimals to 0 (as utterence length always has an integer value).
When you work with SPSS (as with any other application), it is good practice to regularly save your data files.
Output files are often simpler to create again, but data files are certainly not. Moreover, SPSS 14 is not always
stable, causing the program to terminate unexpectedly. Finally, we may want to use some of the data files
during several labs.
> Therefore, save your data file to your own network drive (X:\) in a separate folder that you create
specifically for this hands-on.
A frequency table is a table that shows how often each value of a variable appears among your data.
During the data entry process, one quite often makes errors. Hence, it is imperative to check always the data you
have just entered. Beside rereading the numbers in Data View, you should also look for outliers “created” by
erroneous data entry: for instance, typing too many zeros or entering two values in a single cell will create
values much greater than other values. In the present case, check if the frequency table contains only values you
remember having entered (and that make sense). Compare also your frequency table to the one of your
neighbours.
> Check the frequency of each value in your frequency table together your neighbour.
C. Creating a histogram
A histogram (or frequency diagram) is a graph displaying how frequently each of the possible values of a
variable occur (or how frequently values falling within a certain range occur) among the data having been
entered.
D. Creating a boxplot
A boxplot is another visualization of a distribution and it proves useful for other purposes later on.
We often would like to summarize a variable as a single number that tells you roughly where the values of that
variable are located. Generally the mean(average) is used for that purpose. Another option is employing
the modus, that is, the value that appears most frequently. One can also use the median, the middle value if the
observations are sorted from lowest to highest.
When a histogram is created, the mean is automatically calculated. The modus, the median and the mean can
also be derived by choosing “Analyze”, “Descriptive Statistics”, and then “Frequencies“ in the menu. If you
wish, uncheck the mark next to ‘Display Frequency Table’, and ignore the warning. Then choose the mean, the
modus and the median via the Statistics.
> Have SPSS calculate the mean, the modus and the median, and report them to you in a single table.
* 16. Copy this table to your report.
* 17. Suppose you make an error during data entry: you type 80 instead of 8. Which of these values will
change, and which will not? (Why? How does M&M call this feature of a statistical measure?)
* 18. The median of MLU is lower than its mean. This is because the histogram is skewed to the … (left or
right?), and it has a longer tail to the … (left or right?).
In many cases we are not only interested in where more or less the values of the variable are located, but also in
the “width” of the frequency distribution. There are different measures of describing the “width” of the
histogram. The most known one is standard deviation (SD), but range and interquartile range are also used.
The drawback of the range (the difference of the maximum and minimum values) is that it is fully dependent on
the two most extreme values being measured.
> Have SPSS calculate for you the SD, the range and the quartiles.
Hint: “Analyze”, “Descriptive Statistics”, “Frequencies”.
Exercise 2
Our goal is to learn more about basic functionalities of SPSS, as well as to practice z-inferences and t-inferences
(confidence intervals and one-sample t-test).
Aims of Hands-on 2
Hands-on 2
> Load (open) the data file used last week, which contained information on the variable MLU.
A. COMPUTING NEW VARIABLES USING "COMPUTE"
Remember that a variable is the output of one measurement (or experiment) on different subjects (called cases).
So "height" or "weight" or "gender" or "score obtained on some test" or "native tongue" or "reaction time" are
all variables. It is, however, often necessary to derive new variables based on the existing ones, such as the sum
of the scores obtained on two different tests by each subject, the ratio of the correct sentences and of all
sentences for each subject, or transforming a score into a grade. Recoding, to be introduced in the next section
of this lab, is also a kind of variable transformation.
Now we take an example that should help us also better understand the concept of standard deviation (SD). SD
is sometimes compared to the mean of the (absolute value of the) deviations. The latter can also be calculated
with SPSS. Yet, since it is not a standard measure, we have to go through the steps of the calculation ourselves.
First, we shall introduce a new variable based on MLU, which corresponds to the distance of each data point
from the mean (called the deviation of each data point). Then, the mean of this second variable can be simply
calculated using SPSS.
Check whether the sum (i.e., the mean) of the deviations is really 0, as mentioned earlier in the course. To do
that, you need to change the variable being worked with in the "Analyze" - "Descriptive statistics" -
"Frequencies" window.
Afterward, have another variable calculated again, called ABSDEV, which contains the absolute values of the
deviations (that is, without the negative signs).
> Use "COMPUTE" again to obtain the absolute deviations from the mean.
Hint: First, enter the name of the new column. Then choose the group "Arithmetic" within the "Function
group". Find "Abs" within the window "Functions And Special Variables". Finally, put the variable DEV
between the parentheses of 'Abs()'.
> Now, have SPSS calculate the mean of the new variable ABSDEV (similarly to the way done in the
previous hands-on).
B. RECODING A VARIABLE
A special type of variable transformation is called recoding, and it is used if the raw data have been collected
using a different value set from what we need for statistical purposes. One might wish to change the units of
measurement from inch to centimeter, or from fractions of seconds to milliseconds.
Another example is the recoding of nominal values to numbers: Even though it is good practice to use
meaningful coding systems (strings such as "m" and "f" for gender, or "eur", "ame", "afr", "asi" and "aus" for
continents of origin), some statistical packages (including SPSS) allow fewer manipulations and analyzes for
data encoded thus. Therefore, we may prefer to recode "m" as "1" and "f" as "2", etc. – keeping always in mind
that the numerical values should not be seen as real numbers (no order between them, and no arithmetical
manipulations).
We are now interested in knowing how many long MLU's there are in the text. We define an MLU as "long" if
it contains more than six words. In the present case, a sample of 20 utterances, you probably would not use
SPSS, but in the case of 1000 utterances the story becomes quite different... Therefore, we are going to
introduce a new variable LONG_MLU derived from MLU: LONG_MLU is 0 if the MLU is 6 or less, and 1
otherwise. The process of changing the values of a variable in this manner is called recoding, which is
especially useful in the case of questionnaires.
> Create a new variable LONG_MLU from the variable MLU that is 1 for original ("old") values greater
than 6, and 0 else.
Hint: "Transform", "Recode". Always choose "Into Different Variables", otherwise you lose your original data,
and you won't be able to check your computations. Copy MLU to the window, and enter the name
LONG_MLU as Output Variable. Click on "Change" to have this name in the window. Afterward, use "Old and
New Values" to provide the original and the corresponding recoded values: enter an old and a corresponding
new value, click on "Add", and repeat this procedure for all values. If the formula is okay in the window, click
on "Continue", then on "OK".
> For the next task, open a new data file, and close the old data file.
The subjects of an experiment read sentences on the screen of a computer, word by word. Each time the subject
has read the word he or she presses a key. The previous word disappears and the next one becomes visible. The
time elapsed between pressing the keys is the time needed by the subject to read the word.
The following values are the time in milliseconds needed to read 24 words (Source: Edith Kaan and Laurie
Stowe, Developing an Experiment, 1995. Techniques and Design, Klapper vakgroep Taalwetenschappen,
Rijksuniversiteit Groningen):
450 390 467 654 30 542 334 432 421 357 497 493 550 549 467 575 578 342 446 547 534 495 979 479.
> Place your mouse above the link and click on the right button. Choose 'Save Link As... '.
> Save this file in your own SPSS-hands-on folder (directory).
> Have a look at the structure of this file: What does it contain? How is it organized? For instance, are
values delimited by some special character, such as by a space, or each value is in a new line? Does the
file contain information describing the content of the file (name of the variable(s), description, source of
the data, etc.)?
> Import this file to SPSS using "File", "Read Text Data". Find the text file just being saved and open it.
You are now offered the Text Import Wizard of SPSS, which is going to help you open the file.
> Answer the questions of Text Import Wizard.
Hints: This text file does not have a Predefined Format. That is, the variables are not found in a specific column,
but the values are simply delimited by a space. The file does not contain any variable name. Each case consists
of a single observation (a single value). Therefore, you have to choose 'A specific number of variables
represents a case' and set it to 1.
If you wish, you can also define the name of the variable, but you can do that also later.
> Use the name RDT for the variable. Then, go to "Variable view" and use the field "label" to explain
what the abbreviation RDT stands for: "reading time per word". Observe that you will be shown the label
and not the variable name in different reports returned by SPSS.
If the data import is successful, you have a variable (column) with 24 numbers.
> Set in the Variable View the number of decimals for this variable to 0 (as the reading time has been
measured with the precision of 1 msec, so the values are always integer).
> Save these data as a usual data file, that is, in the native SPSS format .sav.
* 4. Create a histogram including a Normal curve, as well as a boxplot of RDT. Copy it to your report.
* 5. You can find two outliers among your data. Which are they, and what kind of explanation(s) could
you provide to explain them?
* 6. In case you decide to remove these cases from your data set, do you expect the mean or the standard
deviation to change more? Why?
> Remove these cases from your data file by selecting the corresponding rows (click on the gray case
number on the left), and then press the DELETE key.
> Calculate the mean and the SD again by creating a new histogram.
From now onward we shall work on these data with the outliers being removed.
* 8. You know the size of the sample, and you know its standard deviation. What is the standard error,
then? Calculate it both by hand (give details of your calculation in the report) and let SPSS calculate it
for you. Are the two values the same?
Hint: "Analyze", "Descriptive Statistics", "Frequencies". Choose "Statistics" and SE. Do not forget to turn off
"Display Frequency Table".
Now we turn to Table D of Moore and McCabe. Having calculated the standard error, let us find the confidence
interval for the mean of the variable RDT. Let us set the confidence level to C = 95%.
> Determine the degree of freedom (df) of the sample.
> Use Table D to determine the z* and the t* corresponding to the level of confidence C.
* 9. Determine the confidence interval for the mean of the sample using the Student-t-statistic. Provide
details of your calculations in your report.
* 10. What is the meaning of this confidence interval?
* 11. Why have we used the t-statistic and not the z-statistic?
* 12. Suppose we know that the population standard deviation happens to be the same as the standard
deviation of the sample. Determine the confidence interval using the z-statistic for this case.
* 13. Now have the confidence interval calculated for you by SPSS. Copy the values returned by SPSS to
your report. Is it different from your calculations?
Hint: "Analyze", "Compare Means", "One Sample T-test".
The last two columns of the table present the lower bound and higher bound of the confidence interval as a
difference from the test value. If you set the test value to 0, then the last two columns will give you simply the
bounds of the confidence interval. If, however, you set the test value to the sample mean, then the last two
columns will show you how much you have to add to, and detract from the sample mean to find the confidence
interval, in which the population mean lies with the given confidence level.
> Repeat the procedure of having SPSS compute the confidence interval with a t-test, but this time with a
confidence level of 99%.
* 14. Add again to your report the higher and lower values between which the population mean must lie.
Why and how is this confidence interval different from the previously calculated one?
Suppose there are two competing theories about reading. They associate reading with two different neural
mechanisms and therefore they have two different predictions about reading speed of the particular words
employed in this experiment. Theory FRT ("fast reading theory") predicts that the average time needed to read
these words is at most 440 msec, whereas theory SRT ("slow reading theory") predicts a reading time of at least
505 msec (always including the time needed to press the button).
Are your data able to refute or corroborate any of these theories at a significance level of alpha = 5%? A hint:
use the above theories as null hypotheses; so you ask whether you can refute them, or whether your data are
consistent with them (hence, they corroborate them). Please provide your calculations both by hand, and using
SPSS.
* 15. In each of the two cases, what is the null hypothesis exactly, and what is the alternative hypothesis?
(In words/one full sentence, please.)
* 16. Are you using a one-sided or a two-sided test?
* 17. Perform the test for both cases by hand, and describe the steps of your calculation.
* 18. Let SPSS calculate the test for you and copy the results.
* 19. What is the meaning of the P-value in each of the two cases? (Hint: the probability of exactly what is
it?) Please write one full sentence for each case in your report.
* 20. For each case, provide the key sentence summarizing the results of the statistical analysis, as it is
done in scientific papers. That is, either "Based on our data, we can reject the null-hypothesis at a
significance level alpha = 0.05, that is, we can conclude that [the alternative hypothesis in words] is true (t =
..., df = ..., P = ...)" or "our data do not provide sufficient evidence to reject the null hypothesis, that is, to
conclude that...".
Exercise 3
Aims of Hands-on 3
Hands-on 3
This week we focus on inferences towards the population mean of different populations using a t-test, described
in sections 7.1 and 7.2 of M&M. We shall also shortly mention the F-test for comparing standard deviations,
described in details in the optional section 7.3 of M&M (reading the first few paragraphs of that section will
prove useful). Finally, remember that M&M introduces three different versions of t-test, and note that SPSS
employs a slightly different terminology:
one-sample t-test, a.k.a. single-sample t-test (section 7.1; see also previous lab);
matched pairs t-test, a.k.a. paired-samples t-test (end of section 7.1);
two-sample t-test, a.k.a. independent-samples t-test (section 7.2).
During this lab, we employ data from Joseph A. Wipf (Department of Foreign Languages, Purdue University).
The data describe two groups of ten social workers who followed an intensive summer course in Spanish. One
group came from urban areas where Spanish is frequently spoken, and the other group came from suburban
cities and towns. Each of the twenty participants took a listening exam, both before and after the course.
----------------------------------
Group after before
1 29 30
1 30 28
1 32 31
1 30 26
1 16 20
1 25 30
1 31 34
1 18 15
1 33 28
1 25 20
2 32 30
2 28 29
2 34 31
2 32 29
2 32 34
2 27 20
2 28 26
2 29 25
2 32 31
2 32 29
----------------------------------
> Give the following names to the variables: GROUP, AFTER and BEFORE. Notice that BEFORE is
found in the last column!
> The values of all variables are always integer numbers. Thus, set the number of decimal digits for each
of the variables to 0.
> Then, save the file as a standard SPSS data file, that is, in a .sav format.
We shall soon run a t-test on the variable BEFORE, and therefore it is useful to know whether the variable
follows (approximately) a Normal distribution. This is good practice, even though M&M p. 456 writes that two-
sample t procedures are quite robust against violation of Normality, especially if each sample has a size of 5 or
more and if the sample sizes are equal. Both criteria are true in our case.) Were the sample really large, we
could simply check if the histogram reasonably matches the Normal curve fitted by SPSS. However, in the case
of a smaller sample (such as ours) random variation can cause the histogram diverge significantly from a
Normal curve.
Therefore, we need a different technique to assess the Normality of our data set. The simplest one (introduced in
M&M 1.3, p. 68) is drawing a Normal quantile plot; data fitting a Normal distribution will lie along a (diagonal)
straight line, unlike data following a different distribution.
To create a Normal quantile plot in SPSS, you can use the functionality 'Q-Q Plot' under Graphics. By default,
"test distribution" in Q-Q plots is set to Normal distribution; make sure you do not use a different distribution.
> Create a Normal Q-Q plot for the variable BEFORE.
> Remove the second, unsolicited diagram provided by SPSS ('detrended'), by selecting and then deleting
it.
* 1. Copy the Normal quantile plot to your report. Is this variable distributed Normally? Why?
As the two groups may differ in the mean score BEFORE, it is useful to create the Q-Q plot per group. So we
need to separate the cases that belong to Group 1 from those that belong to Group 2. SPSS has a function to
perform this separation automatically after you have defined the filter – a useful tool if you have a huge amount
of data, or if you would like to apply a complicated filtering condition.
Choose 'Select Cases' in the 'Data' menu. Click on 'If condition is satisfied', and enter "GROUP=1" in the
condition window. Click on 'Continue', then on 'OK'. From now onwards, cases belonging to Group 2 will be
crossed over and will not be taken into consideration in graphs and calculations. The column filter$ can be
ignored, as it is created for SPSS's own purposes.
* 2. Copy this Q-Q plot to your report. What is your conclusion for this group?
Do not forget to turn off the selection.
Our next objective is to test whether there is a difference (on average) between the two groups of participants at
the beginning of the course. This is certainly a relevant question before we turn to whether the course resulted in
some improvement in the participants' skills.
* 3. In the present case, the populations have not been clearly defined. Nevertheless, try to formulate a
research question so that you have clearly a population and you have clearly a sample. Describe what the
story is about then, and what the goal is of the statistical procedures being employed in this lab.
* 4. What is the null hypothesis to be tested? (Formulate one full sentence. Do not forget: does the null
hypothesis concern the groups/samples, or the populations?)
* 5. What is the alternative hypothesis? Is the testing one-sided or two-sided?
* 6. What requirements must be met in order to be able to use a t-test on two independent samples?
(Think of the sampling procedure, of the distribution of the population, etc.) Are these assumptions met?
When you perform a t-test for two independent samples, you have to decide whether the procedure should
suppose the two populations have the same standard deviation, or no such supposition should be made. This
decision influences the results of the test. As the formula of the t-test is slightly different in the two cases, SPSS
reports the result of both approaches, and leaves the choice to the user.
If the populations have the same SD, we say that the variances (Variance=SD^2) are homogeneous. Supposing
homogeneity renders the computations simpler (a factor that was especially important in the past), and if the
variances of the sample are only slightly different, such a supposition does not have significant consequences.
If, however, one sample has a SD of 2 and the other sample has a SD of 20, then supposing homogeneity on the
populations is not very plausible; you should then employ the procedure not postulating homogeneity. (M&M
7.3, p. 474 suggests to always employ the latter procedure.)
> Perform the t-test on the variable BEFORE to test the difference in the means of the two independent
groups.
Hint: The two-sample t-test is called "Independent samples t-test" in SPSS (under "compare means"). The t-test
asks for a variable to separate the groups. So first turn off Select Case ("all cases"). Then, in the window for
the t-test, you have to select the variable that you would like to test, as well as another variable that serves as the
criterion for defining the groups. Use GROUP as this second variable. You also have to determine which values
of GROUP will define sample 1 and sample 2.
* 7. What is the standard deviation of the two groups? Are they reasonably the same, or quite different?
SPSS first performs an F-test (cf. M&M 7.3) to check the homogeneity (similarity, equality) of the standard
deviations/variances. Refer to the first two columns of the last table. The null hypothesis of this F-test is that the
two samples originate from two populations that have equal standard deviations. The p-value of the F-test
assesses the probability of drawing samples that are at least as far away from the null hypothesis as our samples.
* 8. What probability has SPSS calculated? Is there reason to reject the hypothesis at significance level
alpha = 0.05 that the standard deviations are homogeneous?
Let now turn to the outcomes of the t-test. First, we suppose that the standard deviations of the two populations
are equal (homogeneous).
Now, let us assume that the standard variation (variances) of the two populations are not necessarily the same.
> Let SPSS draw two boxplots in one figure to visualize the differences in the values of variable BEFORE
across the two groups.
Hints: Choose 'Simple' and 'Summaries for groups or cases'. Use GROUP as category variable.
* 13. Copy this figure to your report. Add a good (precise, detailed and informative) caption to this figure
of one or a few sentences, as usual practice in scientific publications and scholarly books.
* 14. How many different (independent) cases do we have actually in our sample? How many
observations do we have per case in the sample?
The best way to determine whether a participant has improved his or her skills is to compare the course-final
score to the course-initial score, that is, by calculating the difference AFTER - BEFORE. This is exactly what
the t-test for related samples does (see also in M&M, end of section 7.1). Yet, by performing the test yourself,
you can better see what exactly happens and you can also draw figures of the variable of difference
IMPROVEMENT = AFTER - BEFORE.
> Use 'Compute' to calculate the new variable of difference. Call it IMPROVEMENT.
* 15. Does IMPROVEMENT follow a Normal distribution? How did you get to this conclusion?
Hint: You can both fit a Normal curve to the histogram and create a Normal quantile plot.
* 16. Give a 90% confidence interval for the population mean IMPROVEMENT. (Refer, if necessary, to
the SPSS functions already employed in the previous lab.)
As described on the last pages of section 7.1 of M&M, you can perform a one-sample t-test on this difference
variable, and this is the procedure called matched pairs t-test. Let SPSS calculate this single-sample t-test for
you.
* 17. Test the hypothesis that the mean of IMPROVEMENT is 0 using a t-test. Formulate your
conclusions by reporting the value of the t-statistic, df, p-value (one-tailed or two-tailed? why?), and
whether you can reject the null hypothesis (which is what?) at alpha-level 0.05. (Refer, if necessary, to the
previous lab.)
* 18. What is your conclusion: is there improvement in the scores obtained on the listening test? Illustrate
your claim with convincing figures, too. It is up to you to choose what type(s) of figures you use, but
always add captions to figures.
We have used all twenty social workers to find out if the scores on the listening test have improved. However, it
is also possible that one of the groups displays significant improvement, whereas the members of the other
group of ten have not, or have almost not, improved their listening skills. This question is especially interesting
if the two groups followed a course with a different methodology, and so we would like to argue for the
advantages of one of them.
You have now to combine what you have learned in D with what you have learned in F.
* 19. Describe the statistical procedure you perform: what type of test(s), on which variable(s)/group(s),
what is the null-hypothesis, do you use a one-tailed or a two-tailed alternative hypothesis, are the criteria
for performing the test reasonably met, etc.?
* 20. What is your conclusion: is there a difference between the two populations? Report your conclusion,
including the results of the statistical procedure, as usual. Illustrate your conclusion with figure(s),
including a caption.
Exercise 4
Aims of Hands-on 4
Hands-on 4
Is there a connection between gender and (academic) accomplishment? A large university carried out a research
on PhD students who had started their PhD research six years earlier. The following two-way table presents
how the number of these students is distributed according to two variables: status of their research and gender.
Gender
Status Man Woman
Quit 238 98
Still in progress 134 33
Thesis defended 423 98
> Enter these values to SPSS and specify the correct names of the variables.
Suggestion: A useful trick to do this in SPSS is to define three different variables: GENDER, STATUS and
COUNT. Each cell of the above table then becomes one case (one row), producing six rows in total. In truth,
each and every student represents a separate case described with two variables (gender and status), so we should
have entered 238 cases of "man/quit", 98 cases of "woman/quit", etc. Instead of doing so hundreds of times
(which is not only time consuming but also a potential source of errors), we rather enter "man/quit" only once
but also add a third variable COUNT, which we shall soon use as a "weight".
Another useful practice in SPSS is to use numeric values even for categorical variables, because you will have
access to more functionalities of SPSS then. For instance, you can encode 'man' as '1' and 'woman' as '2'; 'quit'
as '0', 'still in progress' as '1' and 'thesis defended' as '2'. Note that any other numbers could also be used, and
that in many cases categorical variables are not ranked such as numbers are. (So, the numbers associated with
'man' and 'woman' could be reversed. Yet, it might make sense to use a number for 'still in progress' that is
between the numbers used for 'quit' and for 'thesis defended'.)
>> In VARIABLE VIEW (column 'values') set the "meaning" of each possible value for both categorical
variables: '1' is 'man', '2' is 'woman', etc.
> Check whether you have entered the counts correctly by creating a two-way table of GENDER and
STATUS with the count of occurrences in the cells (that is, not with percentages, etc.).
Hint: Analyze, Descriptive, Crosstabs... Click on "Cells" to define what you want to see.
* 1. Copy the table to your report. What does SPSS display on the margins of this table? Explain how
these values are obtained.
A disadvantage of the two-way table above is that the connection between variables GENDER and STATUS
(which is what interests us) is far from evident looking at it.
> Create a one-way table of GENDER and STATUS with exclusively row conditional distributions.
> Create a one-way table with exclusively column conditional distribution.
> Create a one-way table of GENDER and STATUS with exclusively joint distributions.
Hint: Descriptive Statistics, Crosstabs. Click on 'Cells', and choose what you would like to have. In order to
keep the table clear and intelligible, make sure you always have only one distribution displayed at a time.
* 2. Explain how the values in each of these three tables are obtained. Where can you find the conditional
distributions and where can you find the marginal distributions?
* 3. Choose the table that you consider the most useful to show the difference between men and women.
Write a short paragraph describing and explaining your observations with the table appended, as if it
were a section in a scientific paper. Do not forget to add a descriptive caption to the table.
> Create a diagram, for instance stacked diagrams, which shows the difference between men and women
as much as possible. Experiment with different types of diagrams offered by SPSS: try out what different
options look like.
Hint: Graphs, Bar...
* 4. What is more useful: showing counts or percentages? Copy the diagram that you consider the most
helpful, and add a caption to it.
E. Proportions
So far, we have been busy with descriptive statistics and visualization. Now, we turn to inferential statistics, that
is, we seek to draw conclusions from the sample on the entire population.
In the present lab, we shall employ two different approaches: statistical procedures to estimate proportions in
the population, as well as chi-square test to examine independence of the variables. The two approaches employ
two different views on the same data.
In the first approach, a population is described by three parameters: the proportion of students having quit
within six years, the proportion of students still in progress after six years, and the proportion of students having
defended their thesis within six years. A population can be all students at a certain university, or all female
students at a certain university, etc. For instance, we may estimate the quitting rate among men, or compare it to
the quitting rate among women.
In what follows, we are going to employ the statistical procedures described in chapter 8 of M&M. Although we
have three different proportions (quitting rate, in-progress rate and defense rate) summing up always to 1, these
procedures always focus on one rate at a time. So, we shall focus on the defense rate only.
> For each of the following questions, specify what the population(s) and the sample(s) drawn from the
population(s) are, what is "success" (the proportion of which these procedures deal with). Explain from
which tables calculated in part C you take your values. Finally, explain which statistical procedure (for
example, test) you use, and check that the criteria for applying the statistical procedures (as described by
M&M) are always met.
You can perform the computations either by hand (calculate the z-statistics as described in chapter 8, and use
tables A or D of M&M), or using software such as http://www.quantitativeskills.com/sisa/statistics/t-test.htm.
Unfortunately, SPSS is unable to help you in this task.
If you have X cases of success in a sample of n data, then enter X/n as mean and n as nr. of cases.
The site rounds values off, and make sure it uses the correct rounding.
Make intensive use of the "clear" button.
Ignore std def and DEFF.
At question 5: leave the Mean 2 and N of cases 2 empty (zero: so the "difference between means" will
be your only mean); set the confidence interval C.I.
At question 8, you want to compare two proportions (two "means" in this software), so you simply enter
the proportion and total number of women (mean 1 and Number of cases 1), as well as the proportion
and total number of men (mean 2 and Number of cases 2). You are returned a t-value and a very large
df, so you can use this t-value as if it were a z-value, and estimate the p-values based on Table A. NB:
the software gives you some probabilities, but make sure you do not misunderstand what they refer to;
it is worth checking them in a Table.
At questions 6 and 7, you want to compare two proportions (two "means" in this software) again, but
the second one is not a measured one, rather a test value. In other words, you want to run a one-sample
test, and not a two-sample test. Yet, this software does not seem to offer such an opportunity. Still, you
can use a trick (similar to the one used in section M\&M 9.3): you present your test value (mean 2) as if
it were a mean value measured on an extremely large sample (Number of cases = 100000 at least). In
fact, if you check the formulae of one-sample and two-sample procedures, you will see that if n2 is
much larger than n1, then the procedure is the same as comparing the first sample to mu2 in a one-
sample procedure.
* 5. Provide a 90% confidence interval for the proportion of PhD students defending their thesis within
six years.
* 6. Based on these data, can we safely (that is, with a significance level of 5%) say that the percentage of
students defending their thesis within six years at this university is exactly 50%? (one-sided or two-sided?
p-value?)
* 7. A national survey revealed that the percentage of students defending their thesis within six years is
47%. Can we conclude at a significance level of 0.05 that the percentage at this university is larger than
the national average? (one-sided or two-sided? p-value?)
* 8. Can we conclude that there is a significant difference between the probability of a man finishing
within six years and the probability of a woman finishing within six years?
The second inferential approach consider these data as describing a single sample, originating in a single
population. Yet, two variables are measured for each case: GENDER and STATUS. In other words, you do not
compare 795 men to 229 women for STATUS, but you compare GENDER to STATUS in 1024 cases.
The chi-square test to be employed tests whether there is an association between the two variables (chapter 9 of
M&M): whether knowing the value of one variable can we predict the value of the other variable? A situation
of a very strong association would be for instance if all men have quit and all women have defended their
theses; that is, by knowing the value of GENDER for a certain case, we could predict the value of STATUS
with full certainty. A situation of a somehow weaker association is if 70& of men have quit and 70% of women
have defended their theses. In this case, if we were told that a certain student is male, we would bet that he has
quit, even though we are not absolutely sure about it. Finally, in a situation with absolutely no association, the
quitting rate among men is equal to the quitting rate among women: being told that the gender does not
influence our knowledge concerning the probability that that student has quit.
Having entered the data in the two-way table earlier today, we can let SPSS do the job. Yet, we need to be able
to interpret the data, and now how to formulate your conclusion.
* 9. Formulate the null hypothesis of the chi-square test, and the alternative hypothesis, in one full
sentence each. Check whether the criteria for applying a chi-square test (as described by M&M) apply in
our case.
The chi-square test compares the observed counts to the expected counts in each cell. The latter are calculated
using the totals on the margins. Can you let SPSS display the two-way table with the expected counts? Compare
the observed and the expected counts in each cell: are they "very" different? What the chi-square test does is
answering this question in a precise way.
* 10. Summarize your conclusion concerning rejection or non-rejection of the null hypothesis, and what it
means: is there a statistically significant association between gender and status? As usual, provide the
details of the statistical procedure in parenthesis: in this case, the chi-square value, the degree of
freedom and the p-value (probability, significance). Additionally, explain why df has this value.
Note: a "statistically significant" association means that it can be observed using statistical techniques and based
on our data. It is, however, not necessarily a "significant" association, that is, a strong association. The
"strength" of such a correlation can be measured in different ways. For instance, as we assigned '0', '1' and '2' to
the different possible values of status, we can compute the mean of the status for men, as well as for women,
and we can compare these two means using a two-sample t-test.
Both these techniques will also provide us with information on the direction of the association: whether men or
women tend to have a higher score on variable STATUS. Chi-square does not tell us this direction, as it is
designed to be employed on categorical data (such as ours), in which case direction is theoretically meaningless.
The usual way to formulate the conclusion of the statistical procedure is as follows:
Optional task: perform the above computations yourself, describing the mathematical details of the procedures.
Exercise 5
Aims of Hands-on 5
A ANOVA
B Non-parametric test: Wilcoxon Rank Sum Test
Hands-on 5
Examining the reading skills of children in the U.S., three methods of education were compared. Several
variables were measured before the lessons started. One of the goals of the pretest was to see whether the three
groups of children had similar cognitive capacities. One of its variables gave an indication of the "ability of
reading garbled sentences", which measures a certain kind of text comprehension. The data for the 22 subjects
are given below. The three types of education are called (B)asal, (D)irected Reading as Thinking Activity en
(S)trategies. (Source: research done by Jim Baumann and Leah Jones from the School of Education of Purdue
University; slightly altered!)
Group
B D S
4 7 11
14 7 7
9 12 4
12 10 7
16 16 7
15 15 6
14 9 11
12 8 14
16 13 13
8 12 9
13 7 12
9 6 13
12 8 4
12 9 13
12 9 6
10 8 12
8 9 6
12 13 11
11 10 14
8 8 8
17 8 5
9 10 8
You can read the data above (save it to your disk, before importing it to SPSS, as usual). This time, this is a csv-
file ("comma separated value"): the delimiter character between cells in a row is not space but a comma.
Most probably, you will import this file in a way so that you get three columns, as the table above. Yet, this is
not what you need for further processing. Indeed, each value represents a separate case, so there are 66 cases in
total. That is, you want 66 rows, as a row represents a case in SPSS. You will, therefore, cut-and-paste the
columns under each other, to get a single column for variable ARGS ("ability of reading garbled sentences"). In
case you use an abbreviation for the name of the variable, do not forget to add an explanation ("label" in
"Variable View"). Moreover, after cut-and-paste, remove the variables (columns) that have just been emptied
("Data View", right-click on the top of the column, and then "clear").
Yet, you need a second variable to distinguish between cases belonging to the three methods. So introduce a
new variable, called METHOD, which has three values: use numerical values, and add labels to the values
(1=basal,2=directed, 3=strategies). You will use this second variable to define groups of cases, as your goal is
exactly to compare those groups. Probably the simplest way is to enter the values 1, 2 and 3 of METHOD by
hand (depending on the way you have cut-and-pasted the columns earlier, but probably cases 1-22 represent
method B, 23-44 represent D and 45-66 represent S).
An ANOVA test (M&M ch. 12) compares averages, boxplots show medians. If the two distributions are nearly
symmetric, both central measures will display nearly the same values. However, if the number of observations
is low, and the variable values are not very diverse, boxplots are not very accurate.
*1. Draw for each group a boxplot (simple boxplot, summaries for groups of cases).
*2. Give for each group (B, D, and S) the mean and the standard deviation. What is the ratio between the
biggest and the smallest standard deviation? Can we employ an ANOVA test to get reliable information?
Hint: You can use "select case" in order to obtain the mean and the standard deviation of 22 cases only at a
time.
*3. Formulate H_0 and H_a. Run ANOVA ("Analyze", "Compare Means"), and copy the one-way
ANOVA table to your report. What is your conclusion? Formulate the "magic sentence" with the
statistical details in parenthesis.
Hint: the variable whose mean you want to compare is called "dependent list" and the variable used to form the
groups is called "factor". Namely, the question is whether the quantitative variable depends on factors such as
the method. Let SPSS also plot a "means plot" and give you data on descriptive statistics (within 'Options' of the
ANOVA window).
After having performed an ANOVA, we proceed by searching for why the null hypothesis has been rejected:
which of the three populations differ from the others? We can either employ contrasts, or run posthoc pairwise
comparisons of the samples. In SPSS, you find two buttons in the ANOVA window that bring you to these
further procedures.
*4. Analyze the contrasts '(D and S) vs. B' (contrast1) and 'D vs. S' (contrast2). Check with contrast1
whether (D and S) have higher average scores than B. Check with contrast2 whether the average scores
of D are not equal to the average scores of S. Give for both contrasts the null hypothesis and the
alternative hypothesis, the t-value, the p-value, and the conclusion.
Help: open the "contrast" window from the ANOVA window. Add the three coefficients for the three groups
one under the other (enter first value, click on "add", etc.; the order corresponds to the values in variable
METHOD). Make sure the sum of the coefficients is 0, which can be also checked in the window. Then, click
on "next" to enter the coefficients of a second contrast.
*5. Perform a Bonferroni-test with alfa=0.05. In which pairwise comparison(s) are the two groups
significantly different? (Look at the stars...)
Optional question: a friend of yours argues that ANOVA is worthless, because you get a situation in
which a=b and b=c, but a and c are different. Such a situation is impossible. What is your answer?
Has the number of female Nobel laureates increase as a result of women's emancipation in recent decades? As
of 2008 there, have been 35 female laureates
(http://en.wikipedia.org/wiki/List_of_female_Nobel_laureates, http://nobelprize.org/nobel_prizes/lists/women.h
tml), whereas the Nobel prize has been awarded to a man 759 times. (The four people who received the Nobel
prize twice has been counted twice; by the way, one of them was a woman.)
It is clear that much more men have received the Nobel prize than women. However, women were awarded the
Nobel prize since its earliest years, and at least one woman received the Nobel prize in each decade, with the
exception of the 1950's. The number of women increases since the sixties, but the same is true of men, due to
several factors: establishing the Nobel prize in economics, and the practice of sharing the prize between three
people becoming most frequent (whereas the Nobel prize was not awarded quite often earlier).
I have compiled two files: a list for men and a list of for women. These lists contain only the years, so that we
can see the distribution of Nobel prizes per year. I have omitted the 22 cases when an organization was awarded
the Nobel prize for peace.
> Import the data from both files to SPSS. Use a variable called YEAR. Introduce a second variable called
GENDER, with some numerical encoding as you did last week (e.g., 1 = man, 2 = woman). Do not forget
setting the values and the decimals in the "Variable View".
Hints: it is easiest to first read the two files to two separate spreadsheets. Using "Compute Variable" (within
"Transform"), create the second variable GENDER (target variable: gender = numeric expression: 1, for one
file, and gender = 2 for the other file). Then, copy-paste the two columns from the shorter spreadsheet to the end
of the longer one. Finally, do not forget to save what you get in a .sav format.
>Draw for each group a boxplot (simple boxplot, summaries for groups of cases; variable: year, category
axis: gender): what can you see (spread, median)?
> Create a histogram showing the distribution of man, and another one showing the distribution of
women.
You can do it before copy-pasting. Alternatively, use the "select case" function (condition gender=2, refer to lab
3). Do not forget to unselect the case afterward.
*6. Copy the two histograms to your report. Add captions. Compare the two distributions: Are they
similar? What shape(s) do these distributions follow?
(NB: Please, do not argue that they are close to a Normal distribution! Do you expect to decrease the number of
Nobel prizes in the future?)
Here are a few ways to pose the same question (make sure you understand each of them, and why they are
related):
1. Has the proportion of women being awarded a Nobel prize increased in the second half of the history of
the prize, with respect to the same proportion in the first half of its history? (Cf. tests on proportions.)
2. Is there a correlation between the variables YEAR and GENDER? (Cf. scatterplot and Pearson's
correlation coefficient r.)
3. Are variables YEAR and GENDER independent from each other? (Cf. chi-square test.)
*7. Choose any of these methods (preferably one that you haven't done yet). Report your results, and
explain them.
1. Has the number of women laureates increase more quickly recently than the number of men laureate?
2. If we look at percentages, and not absolute numbers, are the two distributions the same, or different?
3. Is the cumulative proportion of the two distributions different?
4. If we list all laureates, are the few women distributed equally among the men?
5. Is the median of the two populations different?
These last four questions bring us to nonparametric tests, which you can find in SPSS under "Analyze,
Nonparametric Tests". We obviously have two independent samples, and we focus on Mann-Whitney U, a
variant of Wilcoxon Rank Sum Test (cf. M&M, p. 15-8).
*8. Perform the test, report the results and draw a conclusion.
*9. M&M (15.1) proposes two alternative interpretation of what is tested by the Wilcoxon rank sum test:
either the identity of the two distributions (no parameter involved at all; hence the name "nonparametric
test"), or the equality of medians (an unusual parameter involved). Explain which interpretation makes
sense in the present case?
*10. Why do we need to "fall back" to a nonparametric test in the present case? Give at least two
reasons.
Hints: Did you get anything useful at question 7? Could you employ a chi-square test (cf. criteria of its use and
footnote by SPSS below the table)? Does the shape of the distribution suggest using a traditional ("Normal")
test? Is the variable being discussed nominal, ordinal or really numerical?