An Introduction To Logistic Regression in R
An Introduction To Logistic Regression in R
words the dependent variable does not vary continuously. Using the child is male outcome from
above, for 1000 records on this TRUE/FALSE (binary) response what we have are proportions and
not means or variances. This is not to say that we could not have used a continuous measure to
indicate a childs masculinity. Albeit such an approach would provide more power to detect
relationships among covariates, nature is not so obliging. As a proportion the dependent variable is
bounded by 0 and 1. The Log transformation of the dependent variable will free up the boundary
constraint. However, this becomes the defining line between logistic regression and OLS. By log
transforming the data we are essentially taking the natural log of the odds of the dependent occurring
or not. Thus, logistic regression estimates the odds of a certain event occurring (in this case the
probability of the child being male or female).
n <- 200
z <- c(rep(0,132),rep(1,268))
x <- c(rnorm(n), 1+rnorm(n))
y <- c(rep(0,n), rep(1,n))
par(mfrow=c(1,2))
op <- par(lwd=2,cex.main=1.3,cex.lab=1.3,cex.axis=1.3)
plot(y~z,main="Figure 1: Plot of Dichotomous DV & Binary IV",xlab="Cry Baby",ylab="Child is a male: 0=No,1=Yes")
abline(lm(y~z), col='red')
plot(jiggle(y)~x,main=" Figure 2: Plot of Dichotomous DV & Continuos IV",xlab="# of poops per day",ylab="Child is a
male: 0=No,1=Yes")
abline(lm(jiggle(y)~jiggle(x)), col='blue')
[Type text]
0.8
0.6
0.4
0.0
0.2
0.8
0.6
0.4
0.2
0.0
1.0
1.0
0.0
0.2
0.4
0.6
0.8
1.0
-2
Cry Baby
This equation is the typical setup of a linear model. In logistic regression we model the probability (p)
of an outcome in our dependent variable.
Probability (Child is Male) = 0 + 1 + n ------------ equation 1.
As stated above, because the above equation is bounded on the left hand side and the right hand
side is not we transform it by taking the natural log of the probability.
Logit (p) = Log (p) = log (p / (1-p)) ---------------------- equation 2.
We can alternatively represent equation 2 as this
Logit (probability of child is male) =
------ equation 3.
The quotient of the two probabilities is typically referred to as the odds-ratio of the event occurring.
[Type text]
logit(x)
-2
0.2
0.4
0.6
0.8
1.0
responses. Logistic regression, using maximum likelihood estimation, models these data points with
more precision.
Total
200
Given these results there are several deductions that could be made.
The probability that a person had unsafe sex in this population: 110/200 = 0.55
The probability of safe sex = 1 - probability of unsafe sex: 1-0.55 = 0.45 (i.e., 90/200)
[Type text]
The odds-ratio that a person had unsafe sex as opposed to safe sex in this population = 110/90 =
1.2:1. Likewise the odds of safe sex is the inverse = 90/110 = 0.81:1
Total
100
100
200
Again, we can ask the same questions as above, but now we can also test if the risk is significantly
different between males and females.
The odds of males having unsafe sex in the past month is 90/10 = 9:1
The odds of females having unsafe sex in the past month is 20/80 = 0.25:1
The odds-ratio of males to females for having had unsafe sex in the past month is 9/0.25 = 36.
Males are 36 times more likely to have had unsafe sex in the past month than females.
Using the formula (p/(1-p))/(q/(1-q))
occurring in the first group, and q is the probability of the event occurring in the second group:
((90/100) / (10/100)) / ((20/100) / (80/100)) = (0.9*0.8) / (0.1*0.2) = 0.72/0.02 = 36.
Part 2: Generating contingency tables of your data and simple Chi-Square testing
Let's start with how to create contingency tables of our data in R. If you are familiar with SAS, STATA,
or SPSS you typically just call up a cross-tabulated frequency table command. In R these tables are
called "FLAT TABLES" and the concept behind their creation is the same.
[Type text]
We will start be creating a dataframe for each example that contains the information in a personlevel format. The resultant risky sex variable in the dataframes will be coded as a binary variable
where a 0 indicates the absence of risky sexual practices by that individual and a 1 indicates the
admission of unsafe sex by that person in the past month.
For example 1:
ex1.unsafe <- (rep(1,times=110)) #Prevalence of unsafe sex
ex1.safe <- (rep(0,times=90))
#Prevalence of safe sex
ex1 <- data.frame(risky_sex=c(ex1.safe,ex1.unsafe))
ex1[,1] <- as.ordered(ex1[,1])
NOTE: You must set up you dependent variable as an ordered factor. This is important
because (1) you cannot order Yes or No categories and (2) it will facilitate the
identification of your predicted outcome level (i.e. Yes versus No) when fitting logit models.
For example 2:
Ladies <- "Female"
Gentlemen <- "Male"
gender <- (c(Ladies,Gentlemen)) #Later we will have to convert the gender variable into a contrast code.
fem.safe <- rep(0,times=80)
#Prevalence of safe sex in females
fem.unsafe <- rep(1,times=20)
#Prevalence of unsafe sex in females
male.safe <- rep(0,times=10)
#Prevalence of safe sex in males
male.unsafe <- rep(1,times=90) #Prevalence of unsafe sex in males
ex2 <- data.frame(risky_sex=c(fem.safe,fem.unsafe,male.safe,male.unsafe),gender=rep(gender,each=100))
ex2[,1] <- as.ordered(ex2[,1])
Now that we have the data we can proceed to create a frequency table of each of the ordered
categories. In each example we are interested in the prevalence of Risky Sex Behavior. Frequency
tables can be generated using the table() and ftable commands in R. Both functions produce the
same results, but the ftable() command offers greater control over the structure of your table by
allowing you to choose the independent variables you are interested in, and how they will create
levels in your dependent variable. It is also important to note that the ftable() command is sensitive to
(1) the number of dimensions of your table (at least two variables are required), and (2) the class of
the object which you are attempting to represent as a table. The tables below demonstrate the use of
the table() and ftable() commands using dataframes. You can also save your tables by equating your
table command to an available object name.
[Type text]
8
?table
table(ex1)
ex1
0
1
90
110
table(ex2)
risky_sex
0
1
Gender
Female Male
80
10
20
90
gender
Female
Male
risky_sex 0
80
10
20
90
#Chi-Square Testing
?chisq.test
chisq.test(table(ex1))
Chi-squared test for given probabilities
data: table(ex1)
X-squared = 2, df = 1, p-value = 0.1573
chisq.test(table.2)
Pearson's Chi-squared test with Yates' continuity correction
data: table.2
X-squared = 96.1818, df = 1, p-value < 2.2e-16
become more complex with the use of continuous, ordinal, dichotomous or combinations of
independent variables, the logistic regression method becomes more applicable. We will get to those
complexities in the next example, but for now you should note that the logistic regression model
generates the same results from the Chi-square tests of the first and second examples. The script will
also serve as the basis for executing logistic models using the glm() function.
#For example 1
ex1.test <- glm (risky_sex ~ 1,family=binomial(logit), data=ex1)
#For example 2
gender.code <- (ex2$gender==0.5)*1 - 0.5
#Create a contrast code for dichotomous IVs
ex2.test <- glm(risky_sex ~ gender.code, family=binomial(logit), data=ex2)
summary(ex1.test);summary (ex2.test)
Note that the specification of the model is unchanged, with the exception of the family function being
set to binomial, and the method of binomial analysis being set to logit. Another important aspect of
the analysis is the coding of your dichotomized or leveled factors. This is very important when it
comes to interpreting the results of the models.
First we must recode gender in a manner that makes its parameter estimate interpretable. Using -0.5
for females and 0.5 for males, the interpretation for the gender parameter estimate would be, the
amount by which the log of the odds of having risky sex in the past month is greater/lesser in males
than females.
The next set of scripts uses the summary() command to view the results of the fitted model.
[Type text]
10
summary(ex1.test)
Call:
glm(formula = risky_sex ~ 1, family = binomial(logit), data = ex1)
Deviance Residuals:
Min
1Q
Median
3Q Max
-1.264 -1.264 1.093 1.093 1.093
Coefficients:
(Intercept)
Estimate
0.2007
Std. Error
0.1421
z value Pr(>|z|)
1.412 0.158
Estimate
0.4055
3.5835
Std. Error
0.2083
0.4166
z value
1.946
8.601
Pr (>|z|)
0.0516
<2e-16 ***
11
NULL
gender.code
Df
DevianceResid.
110.16
198
Df
Resid. Dev
199
275.26
165.10
This comparison is always present in the summary of the model; however if you would like to
compare other models that you have tested and saved as an object you can do so with this
command.
12
How to interpret the coefficient of the intercept only model: The log of the odds of having had
risky sex in the past month is 0.20. You can also choose to convert the logit (log of the odds) back to
an odds-ratio by taking the exponent of the intercept (exp(0.20) = 1.22:1). You can also convert the
odds back to probabilities:
Odds = p/(1-p).
1.22 = p/(1-p)
1.22(1-p) = p
1.22 1.22p = p
1.22 = 2.22p
p = 1.22/2.22 = 0.55
Coupled with the Wald-
2 and p-value we know that the 0.22 is not significantly different from 0.Thi
can also be interpreted as, the probabilities 0.55 and 0.45 are not significantly different. In terms of
odds, the odds are 1:1. Notice that the odds and probability statistics are the same as when done by
hand.
For the second example we sought to determine the extent to which the likelihood of reporting unsafe
sex was different in males and females. Thus we are testing a model in which the coefficient for the
gender parameter (i.e. the difference between males and females) is zero.
Model 1: Log (Odds that Risky Sex = 1) = 0 +
Model 2: Log (Odds that Risky Sex = 1) = 0 + 1(Gender) +
H0: 1 = 0
Results: Model 2: Log (Odds that Risky Sex = 1) = 0.41 + 3.58(Gender) +
How to interpret the intercept in an augmented model: When Gender is equal to zero, the log of
the odds of having had risky sex in the past month is 0.41. You can also choose to convert the logit
(log of the odds) back to odds by taking the exponent of the intercept (exp(0.41) = 1.5). You can also
convert the odds back to probabilities:
1.5 = p/(1/p)
1.5(1-p) = p
1.5 1.5p = p
1.5 = 2.5p
p = 1.5/2.5 = 0.6
More importantly what does Gender = 0 mean if gender is coded as -0.5 and 0.5? The correct way of
describing the intercept would be: On average across males and females the logit of having had
[Type text]
13
risky sex in the past month is 0.41. This is why it is important to code your categorical independent
variable to target the question being asked.
How to interpret the Slope in an augmented model: For each unit increase in Gender, the log of
the odds of having had risky sex in the past month increases by 3.58. In terms of odds, you would
say: the odds of having had risky sex increases by a factor of 36 (i.e., exp(3.58) = 36). In this
example there is only a one unit increase in Gender that represents the difference between males
and females, thus we are getting an odds-ratio of 36:1. Notice that the odds-ratio is the same as
when we did it by hand in Part 1.
14
The script and output below demonstrate how to answer these questions.
#Multivariate Logistic Regression
gender <- (c(-0.5,0.5))
sexes <- rep(gender,each=100)
fem.safe <- rep(0,times=80)
fage.safe <- rep(c(18,19,20,21,22,23),c(25,15,12,10,10,8))
fem.unsafe <- rep(1,times=20)
fage.unsafe <- rep(c(18,19,20,21,22,23),c(0,1,3,3,4,9))
#AS females get older the incidence of risky sex increases
male.safe <- rep(0,times=10)
mage.safe <- rep(c(18,19,20,21,22,23),c(0,0,1,2,3,4))
male.unsafe <- rep(1,times=90)
mage.unsafe <- rep(c(18,19,20,21,22,23),c(25,22,18,9,8,8)) #AS males get older the incidence of risky sex decreases
set.seed(1234)
lover <- rnorm(200,1,1)
lover[runif(100,0,1)<.40] <- 0
sex.partner <- (lover==0)*1
#about 40% of lovers are females and they are randomly distributed throughout the
data.
#So we have same sex and opposite sex couples.
#Creating the College Student Annual Income (per thousand) Variable
set.seed(1234)
income <- rnorm(n=200,mean=15,sd=5) #No effect - randomly distributed throughout the data
#We must also mean deviate this variable before we ue it in our analyses thus to make interpretation easy.
cincome <- round(income-mean(income),digits=2)
#Creating the college year variable
#By randomizing this variable we expect little or no relationship between risky sex and academic year.
#Academic ear ranges from 1(Freshmen) to 5 (super seniors).
#We must mean center academic year, so that the intercept is interpretable.
set.seed(1234)
year <- sample(1:5,size=200,replace=TRUE)
cyear <- round(year-(mean(year)),digits=2)
cyear
#Remember that we need to contrast code sex to make it interpretable
sex.code <- ((sexes==0.5)*1 - 0.5)
#Let's first manually recode gender to be 0.5 if males and -0.5 if female
sex.code
#Males are 0.5; females -0.5
#We also have to contrast code partner sex
partner.code <- ((sex.partner==1)*1 - 0.5) #So the interpretation is such that how much the risk increases for male
partners #compared to female partners
partner.code
#Males are 0.5; females -0.5
#Lastly, we must place all this data into a dataframe called "example3".
example3 <data.frame(rsex=c(fem.safe,fem.unsafe,male.safe,male.unsafe),gender=sex.code,age=c(fage.safe,fage.unsafe,mage.saf
e,mage.unsafe),year=cyear,income=cincome,partner=partner.code)
example3$rsex <- as.ordered(example3$rsex)
example3 <- example3[order(example3$gender,example3$age,example3$year,example3$partner),]
example3
[Type text]
15
explore the relationship between the variables in the dataset called example3 (income is excluded
because it is a continuous variable with a very wide range of values).
Looking at partner gender and risky sex we find that there is no relationship between the two.
#Risky sex by gender of sexual partner
table.2 <- ftable(example3[c("partner","rsex")])
table.2
rsex
0
1
partner
-0.5
62
72
0.5
28
38
chisq.test(table.2)
Pearson's Chi-squared test with Yates' continuity correction
data: table.2
X-squared = 0.1316, df = 1, p-value = 0.7168
Looking at gender and sexual partner we find no evidence of a specific preference in either gender.
[Type text]
16
Age does not appear to be related to the prevalence of past month risky sex practices.
table.4 <- ftable(example3[c("age","rsex")])
table.4
rsex
0
1
age
18
25
25
19
15
23
20
13
21
21
12
12
22
13
12
23
12
17
chisq.test(table.4)
Pearson's Chi-squared test
data: table.4
X-squared = 2.4936, df = 5, p-value = 0.7775
Looking at gender, age, and risky sex together reveals that an age-gender interaction is present.
table.5 <- ftable(example3[c("gender","age","rsex")])
table.5
rsex
0
1
gender
age
-0.5
18
25
0
19
15
1
20
12
3
21
10
3
22
10
4
23
8
9
0.5
18
19
20
21
22
23
0
0
1
2
3
4
25
22
18
9
8
8
chisq.test(table.5)
Pearson's Chi-squared test
data: table.5
X-squared = 118.5057, df = 11, p-value < 2.2e-16
Warning message:
In chisq.test(table.5) : Chi-squared approximation may be incorrect
[Type text]
17
Deviance Resid.
275.256
5.127
Df
198
Resid. Dev
270.129
Take the intercept (which is in logits (log of the odds) and convert it back to odds by taking the
exponent of it. Try ?exp; exp(Q1$coefficients[1]). This gives you the intercept in terms of odds is 1.23.
So the first hypothesis is TRUE because 0.21 is not significantly different from zero (Wald-
2 = (1.44)2
= 2.07, df = 1, p-value = 0.15)
Q2. Males are at a greater risk for risky sex behaviors regardless of the effects of income.
Q2 <- glm(rsex ~ gender+income,family=binomial(logit),data=example3)
summary(Q2)
glm(formula = rsex ~ gender + income, family = binomial(logit), data = example3)
Coefficients:
Estimate
Std. Error
z value Pr(>|z|)
(Intercept)
0.41297
0.21015
1.965 0.0494 *
gender
3.60176
0.42389
8.497 <2e-16 ***
income
0.07021
0.03908
1.796 0.0724 .
AIC: 167.87
> anova(Q2)
Analysis of Deviance Table
Df
Deviance
Resid. Df
Resid. Dev
NULL
199
275.256
Gender
1
110.158
198
165.097
Income
1
3.228
197
161.869
The hypothesis is correct. Assuming everyone has the same income, males are at a greater risk than
females.
[Type text]
18
Q3. Testing if the likelihood of risky sex increases as age increases, controlling for the effects
of income.
Q3 <- glm(rsex ~ age+income,family=binomial(logit),data=example3)
summary(Q3)
glm(formula = rsex ~ age + income, family = binomial(logit), data = example3)
Coefficients:
Estimate
Std. Error
z value Pr(>|z|)
(Intercept)
0.45401
1.66253
0.273 0.7848
age
-0.01230
0.08234
-0.149 0.8813
income
0.06524
0.02943
2.217 0.0266 *
AIC: 276.11
anova(Q3)
Analysis of Deviance Table
Df
Deviance Resid. Df
Resid. Dev
NULL
199
275.256
age
1
0.012
198
275.244
income 1
5.137
197
270.106
The hypothesis was false. Controlling for income, Age has no relationship with the risk for unsafe
sexual practices. Income is significant; probably by chance association. If the sample size of example
3 was larger this effect would most likely be absent.
Q4. Testing if the odds of risky sex increase as academic year in college increases regardless
of income.
Q4 <- glm(rsex ~ year+income,family=binomial(logit),data=example3)
summary(Q4)
glm(formula = rsex ~ year + income, family = binomial(logit), data = example3)
Coefficients:
Estimate
Std. Error
z value Pr(>|z|)
(Intercept)
0.21534
0.14668
1.468 0.1421
year
0.26751
0.10548
2.536 0.0112 *
income
0.06823
0.02966
2.301 0.0214 *
AIC: 269.49
anova(Q4)
Analysis of Deviance Table
Df
Deviance Resid. Df
Resid. Dev
NULL
199
275.256
year
1
6.242
198
269.014
income
1
5.527
197
263.486
The hypothesis was correct. However, as academic year increases the odds in favor of a risky-sex
behavior response decreases. Is this association TRUE? (Recall the creation of example3). Like
income, year was independently generated and randomly distributed throughout example3. Its effect
in this model is also likely due to a chance association resulting from the small sample size.
[Type text]
19
Q5. Persons who are heterosexual are at the same level of risk for risky sex behaviors as
homosexuals regardless of income.
Q5 <- glm(rsex ~ income+gender*partner,family=binomial(logit),data=example3)
glm(formula = rsex ~ income + gender * partner, family = binomial(logit), data = example3)
Coefficients:
Estimate
Std. Error
z value Pr(>|z|)
(Intercept)
0.49285
0.24504
2.011
0.0443 *
income
0.06682
0.03906
1.711 0.0871 .
gender
3.74231
0.49210
7.605 2.85e-14 ***
partner
0.34578
0.49298
0.701 0.4830
gender:partner 0.59164
0.98069
0.603 0.5463
AIC: 171.20
> anova(Q5)
Analysis of Deviance Table
Df
Deviance Resid. Df
Resid. Dev
NULL
199
275.256
income
1
5.127
198 270.129
gender
1
108.260
197 161.869
partner
1
0.290
196 161.579
gender:partner 1
0.382
195 161.197
The hypothesis was correct. Over and above the income effects, the difference in risk between
males and females did not change if a persons partner was of the same sex or not.
NOTE THE CODING OF THE GENDER AND PARTENR VARIABLES AND THE RESPECTIVE
CODING OF THEIR INTERACTION. What does the parameter estimate of 0.59 represent? It is the
change in the logit if a persons was in a homosexual as opposed to a heterosexual relationship.
Q6. The difference in likelihood of risky sex between males and females is not altered by (1)
age, and (2) academic year in college, regardless of income.
Q6 <- glm(rsex ~ income+gender*age+gender*year,family=binomial(logit),data=example3)
summary(Q6)
glm(formula = rsex ~ income + gender * age + gender * year, family = binomial(logit), data = example3)
Coefficients:
Estimate
Std. Error
z value Pr(>|z|)
(Intercept)
5.31604
4.51564
1.177 0.23910
income
0.03681
0.04840
0.760 0.44698
gender
42.10440
9.05790
4.648 3.35e-06 ***
age
-0.22194
0.20821
-1.066 0.28645
year
0.57369
0.21866
2.624 0.00870 **
gender:age
-1.81472
0.41793
-4.342 1.41e-05 ***
gender:year
1.16582
0.43494
2.680 0.00735 **
AIC: 127.45
anova(Q6)
Analysis of Deviance Table
Df
Deviance
Resid. Df
Resid. Dev
NULL
199
275.256
income
1
5.127
198
270.129
gender
1
108.260
197
161.869
age
1
1.044
196
160.825
year
1
2.688
195
158.138
gender:age
1
35.433
194
122.704
gender:year
1
9.250
193
113.454
[Type text]
20
The hypothesis is false. Controlling for the effects of income the gender logit is affected by age and
academic year (chance association), such that as age increases the logit difference between males
and females decreases. This implies that females are more likely to have a risky sex response at
younger ages, while males have a higher likelihood at older ages (just as the contingency table
depicted). The chance significance between gender and academic year indicates that the logit
differences between male and females increases by a factor of 1.17 each additional year a person
progresses through college.
Part 5: Assumption Testing for Logistic Regression
21
closely to the line representing a normal distribution hence the data is consistent with a normal
distribution.
[Type text]
22
A very critical assumption is that there is actually no correlation among the observed dependent
variables. In other words knowing something about one observation should not provide any
information about another observation. Unfortunately this is a very difficult assumption to test and so
we briefly provide mechanisms to avoid dependence among observations in your data. The best
protection against violating the independence assumption is to (1) design the experiment in a manner
where the likelihood of non-independence in the data is reduced and (2) if you are aware of the cause
of the dependence in the data it should be included in the model.
The following R script tests the first four assumptions using the model for question 6 in the
multivariate example above.
op <- par(mfrow=c(2,2))
op <- par(lwd=2,cex.main=1.3,cex.lab=1.3,cex.axis=1.3)
plot(Q6)
par(op)
[Type text]
23
Figure 5: Plot of the Assumption tests for the logistic model in question 6
102
2
1
-3 -2 -1
Scale-Location
Residuals vs Leverage
0.5
1.0
102
0 1 2 3
Theoretical Quantiles
-4 -2 0
Predicted values
81
193
101
Cook's distance
0.00
0.05
0.10
Leverage
Exam Question
[Type text]
102
Predicted values
81
83
Question 1
81
1.5
0.0
-4 -2 0
83
-2
0 1 2
81
83
-2
Normal Q-Q
-2
Residuals
Residuals vs Fitted
0.15
24
A. Use ftable() to create 4 univariate contingency tables of the age(Age), church attendance
(Att.church), behavior disinhibition (BD), and early alcohol exposure (Early.alc) variables with
lifetime alcohol SUD (L_drink). Describe the relationship between each variable and lifetime
alcohol SUD. Provide the necessary statistics to support your claims (i.e. odds-ratios or
probabilities, Chi-square, degrees of freedom, and p-values.
B. Use ftable() to look for an interaction between BD and early alcohol exposure on the
prevalence of lifetime alcohol use reports. Does it appear that persons who are exposed under
supervision at home endorse having had a use disorder at some point in their lifetime more
than persons who are not? Also look at the interaction between BD and church attendance?
Does church attendance protect against alcohol use disorders.
C. Scour the data and determine for each variable how many rows contain missing data for each
variable. Try to do this in one line using the apply function. Your output should list each of the
variable names and the number of rows with missing data below it.
Question 2
SIDE INFO: Data were created using mean centered forms of Age, SES, and BD so you do not
need to center the variables. However, keep in mind that this will affect your interpretation of
the model coefficients.
A. Test the hypothesis that persons from low SES families are at greater risk for a lifetime SUD
disorder diagnosis in adolescence than people from high SES families.
B. Use separate models to test the hypotheses that church attendance and early alcohol
exposure are a protective factor for alcohol use, while controlling for SES, and Age. Interpret
the intercept and church slope of the church model in terms of odds. For the early alcohol
exposure interpret the parameter of the intercept and the alcohol exposure slope in terms of
logits.
C. The literature suggests that BD is the end all. Once you are high on this scale you are very
likely to be diagnosed with an alcohol disorder in your lifetime. Test this hypothesis, and based
on the model what would the odds in favor of a lifetime alcohol disorder report be if a person,
scored 12 on the BD scale, is 16 years old, goes to church, was never exposed to alcohol
positively while growing up and whose family earns $60,000 annually (Note: The dependent
[Type text]
25
variable was created the mean deviated forms of Age, BD, and SES). Control for all the
other variables in the model. THIS IS ALSO KNWN AS THE KITCHEN SINK MODEL. Have
any of the earlier relationships changed once we take BD into account? What is the direction of
their effects, does it match up to our expectations?
D. Test if there is a significant interaction between church attendance and BD, while controlling
for Age, SES, and church attendance.
Question 3
A. Compare the fit of the kitchen sink model to the model with the interaction between BD and
church attendance. Which model best fits the data? Justify your answer with the appropriate
statistics.
B. Drop the non-significant parameters from your best fitting model and compare the linearity
assumption in your best fitting model and this new model. Which model best fits the data
linearly? Are there any other differences in model assumptions that vary between the two
models? Use appropriate titles to distinguish the assumptions between the two models.
[Type text]