SPSS Session
SPSS Session
Objectives
2
Selecting Appropriate Statistical
Technique
Research Objectives
Group Comparison
Relationship Exploration
Type/Nature of IV & DV
Categorical
Continuous
Outcome/
Mediator Depen-dent
variable
H+
HI SC
Independent
H-
Variable (IV) H-
Pricing
H-
Outcome/
H+ Depen-dent
variable
H+
VBJ Consumption
5
Mediator
Research Objectives
Size
choice
Male/
Female Group Comparison
Size
choice
FW
H+
Large size
Relationship Exploration
H+
Consumption
6
QUESTIONNAIRE DESIGN &
DEVELOPMENT/SELECTION
7
Development of Research Instrument/Questionnaire
Important tips for
developing/adapting a scale
1. Better to use already developed scales which have demonstrated high reliability and validity
2. Use focus group study to adapt the existing scales into your context, if your research context is new
3. Don’t mention the name of the construct that you are intended to measure through a selected scale
4. Don’t number your questions, specially when you have large number of questions
5. Use appropriate responses for the selected questions and introduce category of Not Applicable/Don’t Know
if possible
Nominal
Ordinal/categorical variable
Scale/continuous variable
5. Attach cover letter highlighting the purpose of research and ensuring participant’s anonymity.
6. Try to avoid
long complex questions
double negatives
8 jargon or abbreviations
culture-specific terms
words with double meanings
Questionnaire Design & Development/Selection
9
Quantitative Data Analysis with SPSS
Adapted from Figure 1.4 The logic of research process (Vaus, 2009)
Preparing a Code Book & Setting Up a SPSS File
11
The following steps must be taken in the order presented below:
“ For additional tips on developing coding scheme read part 1 of SPSS Survival Manual, 4 th edition”
12
Original Survey Coded Survey
13
2. Create your new SPSS FILE and define the attributes of your variables
Make sure to follow the ground rules you have established in your coding scheme. That is, to
designate:
A. Variable Names
A. Variable Labels
14
B. Value Labels (e.g., for variable gender, 0=Male, 1=Female this is
especially important for categorical variables.)
C. User-Missing Values
16
6. Examine the data and summary statistics (e.g., frequencies, min. &
max. values) to spot and correct coding errors.
7. Calculate reliability ONLY for each of your summated multi-
item scale and, (a) identify items detracting from reliability, and (b)
refine the scale accordingly
8. Create a composite/summated variable for each summated
multi-item scale
(Menu Bar: Transform, Compute--Target Variable = a mathematical
17
expression)
9. Compute Descriptive Statistics for different subgroups
separately
(Menu Bar: Data, Split File, Organize Output by Groups--specify the grouping variable) OR
(Menu Bar: Analyze, Compare Means, Means, Dep. List, Indep. List, Select Options, OK)
10. Save and print your SPSS data file (Menu Bar: File, Save As…)
11. Analyze the data (Menu Bar: Analyze--select analysis options)
18
“ For additional help read part 2 & 3 of SPSS Survival Manual, 4 th edition”
SCREENING AND CLEANING THE DATA FILE
DATA SCREENING
Data screening (also known for us as “data screaming”) ensures your data is “clean”
and ready to go before you conduct further your planned statistical analyses.
Data must always be screened to ensure the data is reliable, and valid for testing the
type of causal theory your have planned for.
Screening and cooking are not synonymous – screening is like preparing the best
ingredients for your gourmet food!
STATISTICAL PROBLEMS WITH MISSING DATA
If you are missing much of your data, this can cause several problems;
e.g., can’t calculate the estimated model.
EFA, CFA, and path models require a certain minimum number of data
points in order to compute estimates – each missing data point reduces
your valid n by 1.
Greater model complexity (number of items, number of paths) and
improved power require larger samples.
LOGICAL PROBLEM WITH MISSING DATA
Missing data will indicate systematic bias because respondents may not have answered particular
questions in your survey because of a common cause (poor formulation, sensitivity etc).
For example, if you ask about gender, and if females are less likely to report their gender than
males, then you will have “male-biased” data.
Perhaps only 50% of the females reported their gender, but 95% of the males reported gender.
If you use gender as moderator in your causal models, then you will be heavily biased toward
males,
Because you will not end up using the unreported responses from females.
You may also have biased sample from female respondents.
CASE SCREENING
Cases refers to the rows in your data set
Missing data in rows
Unengaged responses
Outliers on continuous variables
Go to transform and replace the missing values
Median for ordinal(other variables)
Mean for continuous or interval scale(experience)
Change with median of near by points for others and series mean for
experience
WHAT DO WE REPORT?
Go to analyze-descriptive stats
Select all variables except id
Frequencies, skewness and kurtosis
We don’t really need skewness for a 5-point Likert scale
If your skewness value is greater than 1 then you are positive (right) skewed, if it is less than -1 you are
negative (left) skewed, if it is in between, then you are fine.
Some published thresholds are a bit more liberal and allow for up to +/-2.2, instead of +/-1.
Values over 3 are problematic
If the absolute value of the skewness/kurtosis is less than three times the standard error, then you are
fine; otherwise, you are skewed
You can delete the item if you have many items of the construct but if you have less then you better watch it during
factor analysis
CORRELATION AND REGRESSION WITH SPSS
WHAT IS A CORRELATION?
140
100
80
60
40
20
-20
10 20 30 40 50 60 70 80 90
Age
90
80
60
50
40
30
20
10
10 20 30 40 50 60 70 80 90
Age
100
80
40
20
-20
10 20 30 40 50 60 70 80 90
Slide 33 Age
MEASURING RELATIONSHIPS
We need to see whether as one variable increases, the other increases, decreases or
stays the same.
This can be done by calculating the Covariance.
We look at how much each score deviates from the mean.
If both variables deviate from the mean by the same amount, they are likely to be related.
Run this data in you SPSS to check for correlation and Regression
The variance tells us by how much scores deviate from the mean for a
single variable.
It is closely linked to the sum of squares.
Covariance is similar – it tells is by how much scores on two variables
differ from their respective means.
COVARIANCE
Calculate the error between the mean and each subject’s score for the first variable
(x).
Calculate the error between the mean and their score for the second variable (y).
Multiply these error values.
Add these values and you get the cross product deviations.
The covariance is the average cross-product deviations:
xi x yi y
Cov( x, y ) N 1
PROBLEMS WITH COVARIANCE
Cov xy
r sx s y
xi x yi y
N 1s x s y
THINGS TO KNOW ABOUT THE CORRELATION
It is an effect size
±.1 = small effect
±.3 = medium effect
±.5 = large effect
Coefficient of determination, r2
By squaring the value of r you get the proportion of variance in DV shared by the variance in the other IV
It is also the effect size
Coefficient of Alienation
1-r2
It is the proportion of variance in the dependent variable (Y) unexplained by variance in the independent variable (X)
CORRELATION AND CAUSALITY
47
SCATTERPLOT
48
TYPES OF CORRELATION
49
ASSUMPTIONS OF PEARSON’S R
50
CORRELATION: EXAMPLE
QRM file
INTERPRETATION
52
USING R 2 FOR INTERPRETATION
Although we cannot make direct conclusions about causality from a correlation, we can take the correlation
coefficient a step further by squaring it.
The correlation coefficient squared (known as the coefficient of determination, R2) is a measure of the
amount of variability in one variable that is shared by the other.
(.288)2 = .083. This value tells us how much of the variability in BL is shared by PQ.
Although PQ was correlated with BL, it can account for only 8.3% of BL.
To put this value into perspective, this leaves 91.7% of the variability still to be accounted for by other
variables.
53
Note that although R2 is an extremely useful measure of the substantive
importance of an effect, it cannot be used to infer causal relationships.
Although we usually talk in terms of ‘the variance in y accounted for by x’, or
even the variation in one variable explained by the other
This still says nothing about which way causality runs. So, although PQ can
account for 8.3% of the variation in BL, it does not necessarily cause this
variation.
54
REPORTING THE RESULTS- EX
56
(POINT-)BISERIAL CORRELATION
R2 = (0.378)2 = .143. Hence, we can conclude that gender accounts for 14.3% of
the variability in time spent away from home.
58
Nonparametric Correlation
59
NONPARAMETRIC CORRELATION
Spearman’s Rho
Pearson’s correlation on the ranked data
Kendall’s Tau
Better than Spearman’s for small samples
• We need to find a “model” that has the least “variances” and best fit
the data.
If the value of the SSm is small, then using the regression model
is little better than using the mean as the model.
How they are related?
• Mathematically, SST = SSR + SSE.
Slide 75
Discovering Statistics Using SPSS
• Advertising expenditure can account for 33.5% of the variation in
record sales.
• 66% of the variation in record sales cannot be explained by advertising
alone.
• Therefore, there must be other variables that have an influence also.
• F is 99.59, which is significant at p < .001
• This result tells us that there is less than a 0.1% chance that an F-ratio this large would
happen if the null hypothesis were true.
• Therefore, we can conclude that our regression model results in significantly better
prediction of record sales than if we used the mean value of record sales.
• In short, the regression model overall predicts record sales significantly well
• The ANOVA tells us whether the model, overall, results in a significantly good degree of
prediction of the outcome variable.
• However, the ANOVA doesn’t tell us about the individual contribution of variables in the
model
• Intercept is b0 = 134.14, and this can be interpreted as meaning that when no money is spent on advertising (when X
= 0), the model predicts that 134,140 records will be sold
• b1 is 0.096 . Although this value is the slope of the regression line
• The change in the outcome associated with a unit change in the predictor.
• If our predictor variable is increased by one unit (if the advertising budget is increased by 1), then our model predicts that 0.096
extra records will be sold.
• For an increase in advertising of £1000 the model predicts 96 (0.096 × 1000 = 96) extra record sales.
• Is it good for company?
• As you might imagine, this investment is pretty bad for the record company: it invests £1000 and gets only
96 extra sales.
• Fortunately, as we already know, advertising accounts for only one-third of record sales
• R square = .335, which tells us that advertising expenditure can account for 33.5% of the
variation in record sales.
• Beta = the change in the outcome associated with a unit change in the predictor = if our
independent variable is increased by 1 unit, our model predicts that 0.096 extra records will be sold.
• T-test = tests the null hypothesis that the value of beta is 0: therefore, if it is significant,
• we accept the hypothesis that the beta value is significantly different from zero
• the predictor variable contributes significantly to our ability to estimate values of the outcome.
• Like F, the t-statistic is also based on the ratio of explained variance against
unexplained variance or error.
Using the model
• record salesi = b0 + b1advertising budget
= 134:14 + ð0:096 × advertising budgetiÞ
y b0 b1 X 1 b2 X 2 bn X n i
Slide 84
b0
• b0 is the intercept.
• The intercept is the value of the Y variable when
all Xs = 0.
Slide 85
Beta Values
Slide 86
Doing Multiple Regression
Slide 87
Methods of Regression
• Hierarchical:
• (Blockwise Entry): based on early research findings ,experimenter decides
the order in which variables are entered into the model.
• Forced Entry:
• All predictors are entered simultaneously.
• Stepwise:
• Predictors are selected using their semi-partial correlation with the
outcome.
• Exploratory
Slide 88
Doing Multiple Regression
Slide 89
Regression Statistics
Regression
Diagnostics
Output: Model Summary
Slide 92
Output: ANOVA
Slide 93
Analysis of Variance: ANOVA
• The F-test
• looks at whether the variance explained by the model (SSM)
is significantly greater than the error within the model
(SSR).
• It tells us whether using the regression model is
significantly better at predicting values of the outcome
than using the mean.
• Regression is used to predict a continuous outcome on the
basis of one or more continuous predictor variables.
• Whereas ANOVA is used to predict a continuous outcome on
the basis of one or more categorical predictor variables.
Slide 94
Output: betas
Slide 95
How to Interpret Beta Values
• Beta values:
• the change in the outcome associated with a unit
change in the predictor.
• Standardised beta values:
• tell us the same but expressed as standard
deviations.
Slide 96
How to Interpret Beta Values
How to Interpret Beta Values
Constructing a Model
y b0 b1 X 1 b2 X 2
Sales 41124 0.087Adverts 3589plays
Slide 99
Standardised Beta Values
• They tell us the number of standard deviations that the outcome will
change as a result of one standard deviation change in the predictor.
• The standardized beta values are all measured in standard deviation
units and so are directly comparable: therefore, they provide a better
insight into the ‘importance’ of a predictor in the model
Slide 100
Reporting the Model
Logistic
Regression
SIMPLE REGRESSION VS LOGISTIC
• We can use regression to predict future outcomes based on past data, when
the outcome is a continuous variable
• Logistic regression, an extension of regression that allows us to predict
categorical outcomes based on predictor variables.
• We can predict which of two categories a person is likely to belong to given certain
other information.
When and Why
• To predict an outcome variable that is categorical
from one or more categorical or continuous
predictor variables.
• Used because having a categorical outcome
variable violates the assumption of linearity in
normal regression.
Assessing the Model: the log-likelihood
statistic
• As in linear regression, we want to know not only how well the model overall fits the
data, but also the individual contribution of predictors.
• In linear regression, we used the estimated regression coefficients (b) and their standard
errors to compute a t-statistic.
• In logistic regression there is an analogous statistic known as the Wald statistic, which
has a special distribution known as the chi-square distribution.
• Wald statistic tells us whether the b coefficient for that predictor is significantly different
from zero. If the coefficient is significantly different from zero then we can assume that
the predictor is making a significant contribution to the prediction of the outcome (Y):
EXAMPLE
We do need to decide whether to use the first category as our baseline ( )
or the last category ( ).
• Choose forward LR because study is the first in the field and so we have no past research to tell us which
variables to expect to be reliable predictors and it follows a stepwise procedure
• By default, SPSS uses Indicator coding, which is the standard dummy variable 0 and 1
• 236.992 represents the fit of the most basic
model to the data. When including
only the constant, the computer bases the model
on assigning every participant to a single
category of the outcome variable.
Price condition was significant (b=1.178, Wald = 6.694, p <.05) with an OR=3.248, which suggests that participants in the supersized pricing condition were more than three times likely to
choose the larger bottle supporting H1. More than 56% of participants chose the large size of bottle in the supersized pricing condition compared to 26.7% in linear price condition explaining
the influence of supersize pricing on consumers’ size choice decision.
Nutritional label condition was not significant (b = -.110, Wald = .052, p >.82) but more importantly, the two-way interaction between Price and Nutritional label condition was significant (b
= -1.318, Wald = 3.851, p =.05). This means that nutritional label alone does not predict whether a person will choose the large size or smile size but its interaction with price does.
Interestingly, In the control condition (no Nutritional label manipulation), the effect of the price condition was significant (OR = 3.536, p < .01). In contrast, in the Nutritional label condition,
there was no effect of price condition (OR = .856, ns) on bottle size selection. Therefore, we can conclude that the presence of a nutritional label is responsible for the enhanced health salience
which caused the influence of supersized pricing on size choice to decrease. the choice of larger bottle size significantly reduced from more than 56% to 21.7% when participants who were in
supersized pricing condition were exposed to nutritional label.
Moreover, liking for the product (b = .296[ns]), Brand preference (b = .339[ns]), Attitude towards the product (b= -.565 [ns]) and the level of thirst (b=-.014 [ns]) were not significant. It
suggests that supersized pricing is such a substantial influence on consumers’ decision related to the choice of package size, that it overrides many other factors which are often considered to
be important influencers in our food choice decisions such as whether the consumer likes the product? Any special preference for the brand? Level of thirst or even the attitude that an
individual has about the brand.
Split file check for individual results (NL)
In the control condition (no Nutritional label manipulation), the effect of the price condition
was significant (OR = 3.536, p < .01). In contrast, in the Nutritional label condition, there was
no effect of price condition (OR = .856, ns) on bottle size selection
The choice of larger bottle size
significantly reduced from 56.3% to
21.7% when participants who were in
supersized pricing condition were
exposed to nutritional label.
Chi square test
Bold values signify conditions with significant differences (p < .05) in size
choice based on pricing method.
STATISTICAL TECHNIQUE TO
COMPARE GROUPS
COMPARING MEANS: Independent sample
• t-test
Test the difference between two means
• Independent means
• Each person has only been measured once
• DV has been measured on Continuous scale
• LAVENE’s test measure the homogeneity of variance assumption
• T test assumes that variance or SD in both groups/samples are same
• They don’t have to be exactly the same, but they have to be same to the extent that they are not significantly
different from each other
• T test is relatively robust to violation of this assumption, you should still test and report it in results
• if F is significant, variance of two groups is not equal
• F value is not significant, homogeneity of variance assumption does hold
There is sig
difference bw two
scores (FS at t1
and t2)
ONE WAY Analysis of Variance (ANOVA)
• An * means that two groups being compared are significantly different from one another
• g1 and g3 are statistically significantly different from each other
Discovering Statistics Using SPSS
One-way Repeated Measures ANOVA
• Each subject is exposed to two or more
different conditions
• Or measured on same continuous
scale for three or more occasions
• Can also be used to compared
respondents’ responses to two or more
different questions
• Questions should be measured
using same scale- 1=SD to 5=SA
Multivariate test
•All these tests yield same results but most
common is Wilk’s Lambda, which shows statistically
sig effect for time
•There was a change in confidence score across 3
time periods
Options-Pairwise comparisons