Psy 234 Investigating Relationships Week 11

Investigating
relationships
Week 11
Two categorical
variables
Are boys more likely to prefer maths and science than girls?
Variables:
• Favourite subject (Nominal)
• Gender (Binary/ Nominal)
Summarise using %’s/ stacked or multiple bar charts

Test: Chi-squared
Tests for a relationship between two categorical variables
Scatterplot
Relationship between two scale variables:
ØExplores the way the two co-

vary: (correlate)
₋ Positive / negative
₋ Linear / non-linear
Outlier
₋ Strong / weak
ØPresence of outliers
ØStatistic used:
r = correlation coefficient
Linear
Correlation Coefficient r
} Measures strength of a relationship between two
continuous variables -1 ≤ r ≤ 1
r = 0.9
r = 0.01
r = -0.9
Correlation Interpretation
An interpretation of the size of the coefficient has been
described by Cohen (1992) as:
Correlation coefficient value Relationship
-0.3 to +0.3 Weak

-0.5 to -0.3 or 0.3 to 0.5 Moderate
-0.9 to -0.5 or 0.5 to 0.9 Strong
-1.0 to -0.9 or 0.9 to 1.0 Very strong
Cohen, L. (1992). Power Primer. Psychological Bulletin,

112(1) 155-159
Does chocolate make you clever or crazy?
} A paper in the New England Journal of Medicine claimed a relationship
between chocolate and Nobel Prize winners
r = 0.791
http://www.nejm.org/doi/full/10.1056/NEJMon1211064
Chocolate and serial killers
} What else is related to chocolate consumption?
r = 0.52
http://www.replicatedtypo.com/chocolate-consumption-traffic-
accidents-and-serial-killers/5718.html
www.statstutor.ac.uk
Hypothesis tests for r
Tests the null hypothesis that the population
correlation r = 0 NOT that there is a strong
relationship!
It is highly influenced by the number of

observations e.g. sample size of 150 will classify a
correlation of 0.16 as significant!
Better to use Cohen’s interpretation

Exercise
• Interpret the following correlation coefficients using Cohen’s and
explain what it means
Relationship Correlation
Average IQ and chocolate consumption 0.27
Road fatalities and Nobel winners 0.55
Gross Domestic Product and Nobel winners 0.7
Mean temperature and Nobel winners -0.6
Exercise - solution
Relationship Correlation Interpretation
Average IQ and chocolate 0.27 Weak positive relationship. More

consumption chocolate per capita = higher average
IQ
Road fatalities and Nobel 0.55 Strong positive. More accidents = more
winners prizes!
Gross Domestic Product and 0.7 Strong positive. Wealthy countries =

Nobel winners more prizes
Mean temperature and -0.6 Strong negative. Colder countries =

Nobel winners more prizes.
Confounding
Is there something else affecting both chocolate
consumption and Nobel prize winners?
Chocolate Number of
consumption Nobel winners
GDP (wealth)
Temperature
Dataset for today
• Factors affecting birth weight of babies
Mother smokes
=1
Standard gestation = 40 weeks

Exercise: Gestational age and birth weight
a) Describe the relationship between the gestational age of a
baby and their weight at birth.
r = 0.706
b) Draw a line of best fit

through the data (with
roughly half the points
above and half below)
Exercise - Solution
Describe the relationship between the gestational age
of a baby and their weight at birth.
There is a strong positive

relationship which is
linear
• Regression is useful when we want to
• look for significant relationships between two
variables
• predict a value of one variable for a given value of the
other
• It involves estimating the line of best fit through the data

which minimises the sum of the squared residuals
• What are the residuals?

Residuals
• Residuals are the differences between the
observed and predicted weights
Residuals
Baby heavier Baby lighter
than predicted than expected
Regression
line Baby the same
y=a+bx as predicted
X Predictor / explanatory variable (independent variable)

Regression
Simple linear regression looks at the relationship
between two Scale variables by producing an equation for
a straight line of the form
Independent
y = a + bx
Dependent variable variable
Intercept Slope
Which uses the independent variable to predict the

dependent variable
Hypothesis testing
• We are often interested in how likely we are to obtain our
estimated value of if there is actually no relationship
between x and y in the population
One way to do this is to do a

test of significance for the
slope
H0 : b = 0
Output from SPSS
• Key regression table:
Y = -6.66 + 0.36x P – value < 0.001
• As p < 0.05, gestational age is a significant predictor of

birth weight. Weight increases by 0.36 lbs for each
week of gestation.
How reliable are predictions? – R 2
How much of the variation in birth weight is explained by the
model including Gestational age?
Proportion of the variation in birth weight explained by

the model R2 = 0.499 = 50%
Predictions using the model are fairly reliable.
Which variables may help improve the fit of the model?

Compare models using Adjusted R2
Assumptions for regression
Assumption Plot to check
The relationship between the independent Original scatter plot of the

and dependent variables is linear. independent and
dependent variables
Homoscedasticity: The variance of the Scatterplot of
residuals about predicted responses standardised predicted
should be the same for all predicted values and residuals
responses.
The residuals are independently normally Plot the residuals in a
distributed histogram
Look for patterns.
Checking normality
Histogram of the residuals
looks approximately
normally distributed
When writing up, just say

‘normality checks were
carried out on the residuals
and the assumption of
normality was met’
Outliers are outside

Predicted values against residuals
Are there any patterns as the predicted values increases?
There is a problem with Homoscedasticity if the scatter is

not random. A “funnelling” shape such as this suggests
problems.
What if assumptions are not met?
} If the residuals are heavily skewed or the residuals show different
variances as predicted values increase, the data needs to be
transformed
} Try taking the natural log (ln) of the dependent variable. Then
repeat the analysis and check the assumptions
Exercise
• Investigate whether mothers pre-pregnancy weight and birth
weight are associated using a scatterplot, correlation and simple
regression.
Exercise - scatterplot
• Describe the relationship using the scatterplot and
correlation coefficient
r = 0.39
Regression question
• Pre-pregnancy weight p-value:

• Regression equation:
• Interpretation:
R2 = 0.152
Does the model result in reliable predictions?
Check the assumptions
Correlation
• Pearson’s correlation = 0.39
• Describe the relationship using the scatterplot and correlation

coefficient
• There is a moderate positive linear relationship between mothers’

pre-pregnancy weight and birth weight (r = 0.39). Generally, birth
weight increases as mothers weight increases
Regression
Pre-pregnancy weight p-value: p = 0.011
• Regression equation: y = 3.16 + 0.03x
• Interpretation:
• There is a significant relationship between a mothers’ pre-
pregnancy weight and the weight of her baby (p = 0.011). Pre-
pregnancy weight has a positive affect on a baby’s weight with
an increase of 0.03 lbs for each extra pound a mother weighs.
• Does the model result in reliable predictions?

• Not really. Only 15.2% of the variation in birth weight is
accounted for using this model.
Checking assumptions
• Linear relationship
• Histogram roughly peaks in the middle
• No patterns in residuals
Multiple regression
} Multiple regression has several binary or Scale
independent variables
y = a + b1 x1 + b 2 x2 + b 3 x3
} Categorical variables need to be recoded as

binary dummy variables
} Effect of other variables is removed (controlled
for) when assessing relationships
Multiple regression
What affects the number of Nobel prize winners?
Dependent: Number of Nobel prize winners
Possible independents: Chocolate consumption, GDP and mean

temperature
} Chocolate consumption is significantly related to Nobel prize

winners in simple linear regression
} Once the effect of a country’s GDP and temperature were
taken into account, there was no relationship
Multiple regression
• In addition to the standard linear regression checks, relationships
BETWEEN independent variables should be assessed
• Multicollinearity is a problem where continuous independent
variables are too correlated (r > 0.8)
• Relationships can be assessed using scatterplots and correlation
for scale variables
• SPSS can also report collinearity statistics on request. The VIF
should be close to 1 but under 5 is fine whereas 10 + needs
checking
Exercise
• Which variables are most strongly related?
Exercise - Solution
• Which variables are most strongly related?
• Gestation and birth weight (0.709)
• Mothers height and weight (0.671)

Mothers height and weight are strongly related. They don’t exceed
the problem correlation of 0.8 but try the model with and without
height in case it’s a problem.
• When both were included in regression, neither were significant
but alone they were
Logistic regression
} Logistic regression has a binary dependent variable
} The model can be used to estimate probabilities
} Example: insurance quotes are based on the likelihood of you

having an accident
} Dependent = Have an accident/ do not have accident
} Independents: Age (preferably Scale), gender, occupation,
marital status, annual mileage
} Ordinal regression is for ordinal dependent variables

Psy 234 Investigating Relationships Week 11

Uploaded by

Copyright:

Available Formats

Psy 234 Investigating Relationships Week 11

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Psy 234 Investigating Relationships Week 11

Uploaded by

Copyright:

Available Formats

Investigating

Summarise using %’s/ stacked or multiple bar charts

ØExplores the way the two co-

Correlation coefficient value Relationship

-0.3 to +0.3 Weak

Cohen, L. (1992). Power Primer. Psychological Bulletin,

It is highly influenced by the number of

Better to use Cohen’s interpretation

Average IQ and chocolate 0.27 Weak positive relationship. More

Gross Domestic Product and 0.7 Strong positive. Wealthy countries =

Mean temperature and -0.6 Strong negative. Colder countries =

Standard gestation = 40 weeks

b) Draw a line of best fit

There is a strong positive

• It involves estimating the line of best fit through the data

• What are the residuals?

X Predictor / explanatory variable (independent variable)

Which uses the independent variable to predict the

One way to do this is to do a

Y = -6.66 + 0.36x P – value < 0.001

• As p < 0.05, gestational age is a significant predictor of

Proportion of the variation in birth weight explained by

Which variables may help improve the fit of the model?

The relationship between the independent Original scatter plot of the

When writing up, just say

Outliers are outside

There is a problem with Homoscedasticity if the scatter is

• Pre-pregnancy weight p-value:

• Describe the relationship using the scatterplot and correlation

• There is a moderate positive linear relationship between mothers’

• Does the model result in reliable predictions?

} Categorical variables need to be recoded as

Dependent: Number of Nobel prize winners

Possible independents: Chocolate consumption, GDP and mean

} Chocolate consumption is significantly related to Nobel prize

• Mothers height and weight (0.671)

} Example: insurance quotes are based on the likelihood of you

} Ordinal regression is for ordinal dependent variables

You might also like