Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
5 views

Computer Lab 1 MM

CL1
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Computer Lab 1 MM

CL1
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

MSc Epidemiology Specialization Course Mixed Models

Exercises Day 1: Introduction to Multilevel Modeling

During this computer lab we will:


• Try to reproduce the results from the lectures (Ex 1-3)
• Analyze data from a multi-center trial on hypertension (Ex 4)
• Analyze data from a cross-over trial (also, coincidentally, on hypertension) (Ex 5)
• Look at the secondary question (about single- vs mixed-gender schools) in the London School data (Ex
6)
• Look at some descriptive statistics for a longitudinal study (Ex 7)
Note 1: you may choose to use R, SPSS or both to answer the questions below. If you get stuck, R code is
available in the answers below, and SPSS syntax (MM analysis day 1.sps) is available on Moodle. In addition,
in the third section of this document there are step-by-step instructions for exercises 1-3 in R, with code and
explanations of the code and the output.
Note 2: before running the R code or SPSS syntax, make sure to change the file paths!!
Note 3: R users should consult “Fitting linear mixed effects models in R.docx” on Moodle for some
explanation about the two packages used in this course to fit mixed models in R.

Exercises
Exercise 1
In R, SPSS, or both: try to reproduce the analysis of the schools dataset (school.dat or school.sav) so far. R
users: you may want to take a look at the step-by-step instructions below.
a. Make a scatterplot of all 4059 data points, looking at the relation between normexam and standlrt.
b. Make a (simple) linear regression model of normexam on standlrt, and ask for a summary of that model.
c. Get individual scatterplots of the relation between normexam and standlrt for the 65 schools separately.
(SPSS users: split file)
d. Run a linear regression of normexam on standlrt per school, and get the mean and SD’s of the regression
coefficients. (SPSS users: keep split file on, within Regression menu, click on “Save” and have SPSS
save the coefficients to a separate dataset.)
e. Save your SPSS syntax and/or R script for further use.

Exercise 2
Continue with reproducing the analysis of the schools dataset (school.dat or school.sav) so far. (Again, if you
get stuck in R, you may want to take a look at the step-by-step instructions below.)
a. Fit a linear mixed model with random intercept to predict exam scores using the LRT scores.
b. Add a random slope to the model in (a). Interpret this model.

Exercise 3
Finish the analysis of the schools dataset (school.dat or school.sav). (Again, if you get stuck in R, you
may want to take a look at the step-by-step instructions below.)

1
a. Add child- and school-level explanatory variables. Interpret the model.
b. For the model in (a), we will write a brief description of the statistical model used. Fill in the blanks:
“A linear mixed effects model was estimated, using fixed effects for ______________. A random
__________ and a random effect of _________ per __________ were added to correct
for ______________.”

Exercise 4
Part c of this question will be used in the quiz this afternoon. Please save or print the output
and have it on hand (together with this exercise) when you complete the quiz.
A multi-center, randomized, double-blind clinical trial was done to compare two treatments for hypertension.
One treatment was a new drug (1 = Carvedilol) and the other was a standard drug for controlling hypertension
(2 = Nifedipine). Twenty-nine centers participated in the trial and patients were randomized in order of
entry. One pre-randomization and four post-treatment visits were made. Here, we will concentrate on the
last recorded measurement of diastolic blood pressure (primary endpoint: dbp). The data can be found in
the SPSS data file dbplast.sav. Read the data into R or SPSS ( In R please use the read.spss() function
in the foreign package to read in this data!!).
The research question is which of the two medicines (treat) is more effective in reducing DBP. Since baseline
(pre-randomization) DBP (dbp1) will likely be associated with post-treatment DBP and will reduce the
variation in the outcome (thereby increasing our power to detect a treatment effect), we wish to include it
here as a covariate.
a. Make some plots to describe the patterns of the data.
b. Fit a model to answer the research question, using maximum likelihood estimation, taking into account
that patients within centers may have correlated data. Interpret the coefficients of the model.
c. Make a new baseline dbp variable, centered around its mean. Re-fit the model in (b) using the centered
baseline blood pressure variable, using maximum likelihood estimation, and interpret the parameters of
this new model.

Exercise 5
In a small crossover study two drugs, A and B, are compared for their effect on the diastolic blood pressure
(DBP). Each patient in the study receives the two treatments in a random order and separated in time
(“wash-out” period) so that one treatment does not influence the blood pressure measurement obtained after
administering the other treatment (i.e. to rule out carry-over effect) . The data are given in the data file
crossover.sav and crossover.dat*.
Note that subject 4 has only the measurement for drug A and that subject 16 has only the measurement for
drug B.
a. Use descriptive statistics to get a feel for the data. Which drug seems to be better at reducing DBP?
b. Fit a model to the data, looking at drug and period effect and correcting for the fact that (most)
patients have more than one DBP measurement. Which variable(s) do you choose as random?
c. Interpret the results of the model. Is there a difference between the two treatments? Is there a period
effect?
d. What other hypothesis might we want to test here?

Exercise 6
A secondary question regarding the school exam data (exercises 1 & 2) was proposed in the lecture. Use
SPSS or R (or both) to address the question: is the difference between boys and girls the same for single-sex
and mixed-gender schools? (Note: you’ll need to make a new variable for single-gender (schgend = 2 or 3)
vs mixed-gender (schgend = 1) schools before proceeding with the analysis.)

2
Exercise 7 (Challenge)
Tomorrow we will spend the morning session examining different ways of analyzing the Reisby dataset. This
is a longitudinal dataset on 66 patients with endogenous or exogenous depression. Patients are measured
every week starting at baseline; from week 1 on, they were all treated with imipramine. The outcome is
the score on the Hamilton Depression Rating Scale (HDRS), a score based on a questionnaire administered
by a health care professional. The score ranges - theoretically - from 0 (no depressive symptoms) to 52,
where scores higher than 20 indicate moderate to very severe depression. The questions of interest are
how the HDRS score changes over time for the patients, and whether the patterns of HDRS over time
differ for patients with endogenous and exogenous depression. The data is available in both a “wide” and a
“long” format: reisby_wide.sav and *reisby_long.sav** ( In R please use the read.spss() function in the
foreign package to read in this data!!)
a. We heard this morning that longitudinal data is also multi-level data. How many levels do we have
here? What does each level represent?
b. Use descriptive statistics (means, SDs, graphs) to get a feel for the data, concentrating on the patterns
(individual and/or group) of HDRS over time (note that there are two versions of the dataset given,
one “wide” and one “long”. For some graphs and descriptive statistics, one version may be easier to use
than the other.
c. What do you notice about the mean HDRS score over time? And the variation?
d. Time was measured at 6 discrete moments. How would you want to incorporate time in the fixed part
of the model: as discrete or continuous? Explain your answer.
e. If you were to include a random intercept in the model, for which level would you include an intercept?
f. Do you think it is necessary to include time in the random part of the model? Why or why not?

3
Answers to exercises
Exercises 1&2
Most answers are in the presentation and in the step-by-step instructions below.

Exercise 3
a.
b. “A linear mixed effects model was estimated, using fixed effects for the standardized London Reading
Test, student gender, school gender, and school performance. A random intercept and a random effect
of the standardized London Reading Test per school were added to correct for correlation of children
within schools.”

Exercise 4
a. There are no immediately discernible differences between the two treatments in the final DBP level of
the patients. There also does not appear to be a strong relation between pre-randomization DBP and
final DBP.
trial <- read.spss(file.path(mypath,"dbplast.sav"), to.data.frame = TRUE, use.missings = TRUE)
p1 <- ggplot(data = trial, aes(x = treat, y = dbp, group=center))
p1 + geom_point() + facet_wrap(~center)

1 2 3 4 5 6
140
120
100
80

7 8 11 12 13 14
140
120
100
80

15 18 23 25 26 27
140
120
dbp

100
80

29 30 31 32 35 36
140
120
100
80
1.001.251.501.752.00
1.001.251.501.752.00
1.001.251.501.752.00
37 40 41
140
120
100
80
1.001.251.501.752.00
1.001.251.501.752.00
1.001.251.501.752.00
treat

p2 <- ggplot(data = trial, aes(x = dbp1, y = dbp, group=center))


p2 + geom_point() + facet_wrap(~center)

4
1 2 3 4 5 6
140
120
100
80

7 8 11 12 13 14
140
120
100
80

15 18 23 25 26 27
140
120
dbp

100
80

29 30 31 32 35 36
140
120
100
80
951001051101151209510010511011512095100105110115120
37 40 41
140
120
100
80
951001051101151209510010511011512095100105110115120
dbp1

b. We want to examine the difference between treatment groups while controlling for baseline DBP. The
required analysis (if in a single center) would be an ANCOVA: a linear regression with one categorical
and one continuous variable. Since this is a multi-center trial, we need to also add (at least) a random
intercept per center. Depending on further theory or on the results of the graphs in (a), we might also
want to add a random effect (slope) per center for baseline DPB and/or a random effect for treatment.
The former allows each center to have its own DBP-DBP1 slope (in this data there does not appear to
be much variation in slopes, see the second plot in (a)); the latter would allow the treatment effect to
differ per center (not something one generally hopes for, and also not very likely given the first plot
in (a)). Since there is no (theoretical or practical) reason to add random effects of baseline DBP or
treatment, we use a model with only a random intercept per center. This model assumes centers have
different mixes of patients with (on average) higher or lower blood pressure, but that the trend between
DBP and baseline and the difference between treatments is the same in every center. Note that the
code below assumes the variable treat has been coded as a factor/categorical variable; if it is numeric
in your dataset, please use factor(treat) in your model.
lme.1 <- lme(fixed=dbp ~ dbp1 + factor(treat), random=~1|center, data=trial, method="ML")
summary(lme.1)

## Linear mixed-effects model fit by maximum likelihood


## Data: trial
## AIC BIC logLik
## 1393.739 1410.052 -691.8694
##
## Random effects:
## Formula: ~1 | center
## (Intercept) Residual
## StdDev: 2.824933 8.406353

5
##
## Fixed effects: dbp ~ dbp1 + factor(treat)
## Value Std.Error DF t-value p-value
## (Intercept) 74.10138 13.630882 164 5.436287 0.0000
## dbp1 0.17470 0.131612 164 1.327382 0.1862
## factor(treat)2 -1.11787 1.230104 164 -0.908764 0.3648
## Correlation:
## (Intr) dbp1
## dbp1 -0.997
## factor(treat)2 -0.122 0.080
##
## Standardized Within-Group Residuals:
## Min Q1 Med Q3 Max
## -2.16658330 -0.72441696 -0.07447943 0.55358319 5.04172608
##
## Number of Observations: 193
## Number of Groups: 27
c. In R:
trial$cdbp1 <- trial$dbp1 - mean(trial$dbp1)
In SPSS:
COMPUTE cdbp1=dbp1-102.70.
EXECUTE.

Exercise 5
Answers in R:
First, read in the data:
cross <- read.table(file.path(mypath, "crossover.dat"),header=TRUE)
## summary(cross)

a. In R, spaghetti plots are prettiest using ggplot2:


dspag <- ggplot(data=cross, aes(x=DRUG, y=Y)) + geom_line() +
guides(colour=FALSE) + xlab("Drug") + ylab("DBP") +
ggtitle("Spaghetti plot Drug A vs B")
dspag + aes(colour = factor(PATIENT))

6
Spaghetti plot Drug A vs B

120
DBP

100

80

1.00 1.25 1.50 1.75 2.00


Drug

It looks like most patients have lower blood pressure when using drug A.
We can also look at the two periods:
pspag <- ggplot(data=cross, aes(x=PERIOD, y=Y)) + geom_line() +
guides(colour=FALSE) + xlab("Period") + ylab("DBP") +
ggtitle("Spaghetti plot Period 1 vs 2")
pspag + aes(colour = factor(PATIENT))

7
Spaghetti plot Period 1 vs 2

120
DBP

100

80

1.00 1.25 1.50 1.75 2.00


Period

There does not seem to be a discernible period effect.


Descriptive statistics:
tapply(cross$Y,cross$DRUG,mean)

## 1 2
## 104.5000 113.6111
tapply(cross$Y,cross$DRUG,sd)

## 1 2
## 12.94899 11.01944
Mean DBP for drug A is 104.5 (SD 12.9); for drug B 113.6 (11.0).
tapply(cross$Y,cross$PERIOD,mean)

## 1 2
## 109.5294 108.6316
tapply(cross$Y,cross$PERIOD,sd)

## 1 2
## 12.73341 13.03930
Mean DBP for period 1 is 109.5 (SD 12.7); for period 2 108.6 (13.0).
b. Because we have two measurements per patient, we need (at least?) a random intercept. Do we need
more random effects? A random effect for drug would assume that the treatment effect is different for
each patient. While this could well be the case, in order to estimate such an effect, we would need

8
more than one measurement per patient per drug! The same holds true for a random effect for period
(additionally, a different treatment effect per period violates one of the important assumptions of a
cross-over trial). In this case, a model with a random effect per patient is both the least and the most
we can do. Fit a mixed model with random intercept per patient:
cross.lme1 <- lme(Y~factor(DRUG) + factor(PERIOD),
random= ~1|PATIENT, method="ML", data=cross)
summary(cross.lme1)

## Linear mixed-effects model fit by maximum likelihood


## Data: cross
## AIC BIC logLik
## 280.6751 288.5927 -135.3375
##
## Random effects:
## Formula: ~1 | PATIENT
## (Intercept) Residual
## StdDev: 8.980622 7.276865
##
## Fixed effects: Y ~ factor(DRUG) + factor(PERIOD)
## Value Std.Error DF t-value p-value
## (Intercept) 104.95515 3.115860 18 33.68416 0.0000
## factor(DRUG)2 9.36033 2.581338 15 3.62615 0.0025
## factor(PERIOD)2 -1.25006 2.583876 15 -0.48379 0.6355
## Correlation:
## (Intr) f(DRUG
## factor(DRUG)2 -0.388
## factor(PERIOD)2 -0.427 -0.058
##
## Standardized Within-Group Residuals:
## Min Q1 Med Q3 Max
## -2.28987540 -0.42035217 -0.02943319 0.44467310 1.49482872
##
## Number of Observations: 36
## Number of Groups: 19
intervals(cross.lme1)

## Approximate 95% confidence intervals


##
## Fixed effects:
## lower est. upper
## (Intercept) 98.687664 104.955154 111.222644
## factor(DRUG)2 4.092571 9.360326 14.628082
## factor(PERIOD)2 -6.522999 -1.250063 4.022873
##
## Random Effects:
## Level: PATIENT
## lower est. upper
## sd((Intercept)) 5.806065 8.980622 13.89092
##
## Within-group standard error:
## lower est. upper
## 5.231807 7.276865 10.121316
c. On drug B, the average DBP is 9.3 (95% CI 4.1-14.6) mmHg higher than on A (the reference category),

9
Wald p=0.0025. So drug A works better in decreasing DBP. The average DBP in period 2 is slightly
(and not statistically significantly) lower than in period 1: -1.25 (95% CI -6.5 – 4.0), so no significant
period effect (Wald p=0.6355).

d. It might be interesting to examine a (fixed) period*drug interaction. A significant interaction would


indicate that (despite the wash-out period) the order in which the drugs were taken has an influence on
one’s blood pressure. This could be medically plausible; however, with the current design (each patient
is given only A then B or B then A and the DBP measured only once on each drug) this hypothesis
cannot be tested!

10
Answers in SPSS:
a. In SPSS, make a graph via Graphs > Legacy Dialogs > Line > Multiple, click “Define” button:

Resulting graph:

11
It looks like most patients have lower blood pressure when using drug A.
We can also look at the two periods:

12
There does not seem to be a discernible period effect.
Mean DBP for drug A is 104.5; for drug B 113.6.
b. Because we have two measurements per patient, we need (at least?) a random intercept. Do we need
more random effects? A random effect for drug would assume that the treatment effect is different for
each patient. While this could well be the case, in order to estimate such an effect, we would need
more than one measurement per patient per drug! The same holds true for a random effect for period
(additionally, a different treatment effect per period violates one of the important assumptions of a
cross-over trial). In this case, a model with a random effect per patient is both the least and the most
we can do. Fit a mixed model with random intercept per patient using the menu or the following syntax
(note the command EMMEANS, which gives the estimated means for the two drugs):
MIXED Y BY PERIOD DRUG
/CRITERIA=CIN(95) MXITER(100) MXSTEP(5) SCORING(1)
SINGULAR(0.000000000001)
HCONVERGE(0, ABSOLUTE) LCONVERGE(0, ABSOLUTE)
PCONVERGE(0.000001, ABSOLUTE)
/FIXED=PERIOD DRUG | SSTYPE(3)
/METHOD=ML
/PRINT=SOLUTION
/RANDOM=INTERCEPT | SUBJECT(PATIENT) COVTYPE(VC)
/EMMEANS=TABLES(DRUG) .

13
c. On the basis of the Wald tests of the fixed effects, it would appear that drug type has significant effect
on DBP; patients on drug A have, on average, a 9.4 mmHg lower diastolic blood pressure than on drug
B (95% CI 4.2 – 14.5 mmHg lower). In period 1, patients have, on average 1.3 mmHg higher DBP than
in period 2, though this difference in not statistically significant according to the Wald test. (We will
learn how to test these hypotheses properly (using likelihood ratio tests) on Day 3.) Note that the
coefficients in SPSS are reversed from those in R; this is because R takes the second group (drug B and
period 2) to be the reference groups, while R takes the first (drug A and period 1). The conclusions
from both programs are, of course, the same.

d. It might be interesting to examine a (fixed) period*drug interaction. A significant interaction would


indicate that (despite the wash-out period) the order in which the drugs were taken has an influence on
one’s blood pressure. This could be medically plausible; however, with the current design (each patient
is given only A then B or B then A and the DBP measured only once on each drug) this hypothesis
cannot be tested!

Exercise 6
This question is about effect modification (or statistical interaction), of school gender in the relation between
individual gender and exam score. We add and interpret an interaction between school and individual gender.
Results below are from R. Note that SPSS uses girls at mixed-gender schools as the reference category.
The estimated differences between girls and boys and single vs mixed gender schools are the same, but the
parameter estimates are different due to the difference in reference categories.
# Re-read in the data, if necessary
london <- data.frame(read.table(file.path(mypath,"school.dat"),header=TRUE))

14
# make new variable mixed gender (schgend = 1 vs 2 or 3)
london$mixed <- as.numeric(london$schgend==1)
table(london$mixed,london$schgend) # check new variable

##
## 1 2 3
## 0 0 513 1377
## 1 2169 0 0

# mixed model with random intercept & random slope, plus gender, school gender & school avg
sch.lme.sub <- lme(normexam~standlrt + factor(gender)+
factor(mixed) + factor(schav) + factor(gender)*
factor(mixed), random=~standlrt | school, data=london,
method="ML")
summary(sch.lme.sub)

## Linear mixed-effects model fit by maximum likelihood


## Data: london
## AIC BIC logLik
## 9300.414 9369.809 -4639.207
##
## Random effects:
## Formula: ~standlrt | school
## Structure: General positive-definite, Log-Cholesky parametrization
## StdDev Corr
## (Intercept) 0.2660302 (Intr)
## standlrt 0.1212546 0.499
## Residual 0.7417279
##
## Fixed effects: normexam ~ standlrt + factor(gender) + factor(mixed) + factor(schav) + factor(ge
## Value Std.Error DF t-value p-value
## (Intercept) -0.0777973 0.10389282 3991 -0.748823 0.4540
## standlrt 0.5515520 0.02006954 3991 27.482049 0.0000
## factor(gender)1 0.1371785 0.10481254 3991 1.308799 0.1907
## factor(mixed)1 -0.1869683 0.09777574 61 -1.912216 0.0605
## factor(schav)2 0.0668879 0.08534913 61 0.783698 0.4362
## factor(schav)3 0.1742650 0.09876083 61 1.764516 0.0827
## factor(gender)1:factor(mixed)1 0.0299527 0.11006608 3991 0.272134 0.7855
## Correlation:
## (Intr) stndlr fctr(g)1 fctr(m)1 fct()2 fct()3
## standlrt 0.162
## factor(gender)1 -0.614 0.006
## factor(mixed)1 -0.674 -0.002 0.711
## factor(schav)2 -0.540 -0.035 -0.061 -0.061
## factor(schav)3 -0.451 -0.080 -0.124 -0.061 0.622
## factor(gender)1:factor(mixed)1 0.586 -0.017 -0.952 -0.726 0.054 0.113
##
## Standardized Within-Group Residuals:
## Min Q1 Med Q3 Max
## -3.83388964 -0.63434049 0.02308039 0.67675190 3.41363881
##
## Number of Observations: 4059
## Number of Groups: 65

15
The effect of school gender on the exam scores of boys and girls is not statistically significant (p = 0.7855),
but for the interpretation:
• boys (ref) at single-gender schools (ref): reference category
• boys (ref) at mixed-gender schools: 0.19 SD lower than ref (boys, single-sex)
• girls at single-gender schools (ref): 0.14 SD higher than ref (boys, single-sex)
• girls at mixed-gender schools: 0.14 - 0.19 + 0.03 = -0.02 so 0.02 SD lower than boys at single-gender
schools.

Exercise 7
a. 2 levels: measurements (level 1) within patients (level 2). Others answers tomorrow!

16
Step-by-step answers to exercises 1-3
You can copy the code below into the R editor (or better yet: into the RStudio editor), and run one or more
lines at a time by highlighting them and pressing Ctrl+Enter (Windows) or Command+Enter (Mac).
Note: before running any of the R code below, be sure to change the path name to the directory in which you
have stored your data! This can be done using the menu in RStudio (Session- Set Working Directory. . . ), by
using the setwd() function or as below by defining the path (here I call it mypath) and using the file.path()
function.

Exercise 1
Try to reproduce the analysis of the schools dataset (school.dat) so far. Before we get started, we’ll first load
a few packages we will need for our analysis.
library(foreign)
library(nlme)
library(psych)
library(ggplot2)

If you get an error message that the package is not available, you will first have to install it. For instance the
foreign library: install.packages("foreign"), or use the menu in RStudio (Tools - Install Packages).
We also set a path for the directory in which the data is stored. We will use this path, called mypath together
with the file.path() function to tell R where to find our datasets. Make sure you change this to the path
in which your data is stored (hint: go to the directory in File Explorer, and copy the path, then paste it in R.
Remember to change the single backlash to double backslash or single forward slash.)
# CHANGE THIS TO YOUR OWN PATH!!
mypath <- "O:\\Biostatistiek\\Onderwijs\\MedicalStatistics\\Mixed Models\\datasets\\"

Now we read the data in as a data frame store it in the object “london”, and examine the first few lines of
data in the data frame (please change the name of the path below to the directory in which you have stored
your data!). Use the summary() function to get quick descriptive statistics of all the variables.
london <- data.frame(read.table(file.path(mypath,"school.dat"), header=TRUE))

Or you can use setwd() and your own directory, and leave out file.path (note: remove hashtags):
# setwd("D:\\MSc\\Mixed\\data\\") # change to your path
# london <- data.frame(read.table("school.dat", header=TRUE))

head(london)

## school student normexam standlrt gender schgend avslrt schav vrband


## 1 1 1 0.26132 0.61906 1 1 0.16617 2 1
## 2 1 2 0.13407 0.20580 1 1 0.16617 2 2
## 3 1 3 -1.72390 -1.36460 0 1 0.16617 2 3
## 4 1 4 0.96759 0.20580 1 1 0.16617 2 2
## 5 1 5 0.54434 0.37110 1 1 0.16617 2 2
## 6 1 6 1.73490 2.18940 0 1 0.16617 2 1
summary(london)

## school student normexam standlrt


## Min. : 1.00 Min. : 1.0 Min. :-3.666100 Min. :-2.93500
## 1st Qu.:14.00 1st Qu.: 16.0 1st Qu.:-0.699510 1st Qu.:-0.62071
## Median :29.00 Median : 33.0 Median : 0.004322 Median : 0.04050
## Mean :31.01 Mean : 38.7 Mean :-0.000118 Mean : 0.00181

17
## 3rd Qu.:47.00 3rd Qu.: 54.0 3rd Qu.: 0.678760 3rd Qu.: 0.61906
## Max. :65.00 Max. :198.0 Max. : 3.666100 Max. : 3.01600
## gender schgend avslrt schav
## Min. :0.0000 Min. :1.000 Min. :-0.755960 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:1.000 1st Qu.:-0.149340 1st Qu.:2.000
## Median :1.0000 Median :1.000 Median :-0.020198 Median :2.000
## Mean :0.6001 Mean :1.805 Mean : 0.001811 Mean :2.127
## 3rd Qu.:1.0000 3rd Qu.:3.000 3rd Qu.: 0.210520 3rd Qu.:3.000
## Max. :1.0000 Max. :3.000 Max. : 0.637660 Max. :3.000
## vrband
## Min. :1.000
## 1st Qu.:1.000
## Median :2.000
## Mean :1.843
## 3rd Qu.:2.000
## Max. :3.000
Now let’s make a scatterplot of all 4059 data points, looking at the relation between normexam and standlrt.
The resulting plot will be displayed in the lower right-hand corner of RStudio (and a little bit later in this
document).
plot(normexam~standlrt, data=london,
xlab="standardized London Reading Test score",
ylab="normalized exam score", main="All schools together", cex.main=1.15, pch=20)

Fit a (simple) linear regression model of normexam on standlrt, and ask for a summary of that model. We’ll
also add the linear model to the plot made above:
simple <- lm(normexam~standlrt, data=london)
summary(simple)

##
## Call:
## lm(formula = normexam ~ standlrt, data = london)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.65617 -0.51847 0.01265 0.54397 2.97399
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.001195 0.012642 -0.095 0.925
## standlrt 0.595055 0.012730 46.744 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8054 on 4057 degrees of freedom
## Multiple R-squared: 0.35, Adjusted R-squared: 0.3499
## F-statistic: 2185 on 1 and 4057 DF, p-value: < 2.2e-16
abline(simple, col="red")

18
All schools together
normalized exam score

2
0
−2

−3 −2 −1 0 1 2 3

standardized London Reading Test score

Get individual scatterplots of the relation between normexam and standlrt for the 65 schools separately. For
visualizing longitudinal data, the ggplot() function in ggplot2 package is very useful. The logic of the
functions is not immediately obvious (i.e. I totally stole this code), but if you stick with it, it should become
clearer. (For complete documentation of ggplot2, see http://ggplot2.tidyverse.org/reference/.)
We start by defining a base for the graphs and storing it in object called p. The aes stands for “aesthetic”,
and defines the dataset, the X & Y variables, and any stratification (group) variables. Note that when you
run the first line, nothing happens. Only when you ask for the object p (and add options to it) do you actually
get a plot. Here, we do a number of steps in one: we add points to p, plus a regression line (“smoothed”
but with lm = linear model) and a “facet wrap”, asking for stratification on school. If you want, you can
add one option at a time, to see what each option does (so, for instance, start with p + geom_point to get
the same scatterplot as above, and p + geom_point + geom_smooth(method=lm, se = FALSE) to get the
scatterplot plus the regression line).
p <- ggplot(data = london, aes(x = standlrt, y = normexam))
p + geom_point() + geom_smooth(method=lm, se = FALSE) + facet_wrap(~school)

19
1 2 3 4 5 6 7 8 9
4
2
0
−2
−4
10 11 12 13 14 15 16 17 18
4
2
0
−2
−4
19 20 21 22 23 24 25 26 27
4
2
0
−2
−4
28 29 30 31 32 33 34 35 36
4
normexam

2
0
−2
−4
37 38 39 40 41 42 43 44 45
4
2
0
−2
−4
46 47 48 49 50 51 52 53 54
4
2
0
−2
−4
55 56 57 58 59 60 61 62 63
4
2
0
−2
−4
−2 0 2 −2 0 2 −2 0 2 −2 0 2 −2 0 2 −2 0 2 −2 0 2
64 65
4
2
0
−2
−4
−2 0 2 −2 0 2
standlrt

Now we’ll run a linear regression of normexam on standlrt per school, save it as the object persch and get
the means and standard deviations of the regression coefficients.
persch <- lmList(normexam~standlrt| school, data=london)
mean(coef(persch)[,1])

## [1] -0.06812356
mean(coef(persch)[,2])

## [1] 0.4245775
sd(coef(persch)[,1])

## [1] 0.5191847
sd(coef(persch)[,2])

## [1] 0.9394058
When we do a linear regression per school, the mean of the intercepts is -0.068 and the SD is 0.519. The
mean and SD of the LRT slopes are 0.425 and 0.939.
Don’t forget to save your R script for further use later!

Exercise 2
Continue with reproducing the analysis of the London schools dataset so far. If you stopped your R/RStudio
session, you may need to read the data in again.
Fit a linear mixed model to predict exam scores using the LRT scores, with a random intercept per school.

20
sch.lme.1 <- lme(fixed=normexam~standlrt, random=~1 | school, data=london, method="ML")
summary(sch.lme.1)

## Linear mixed-effects model fit by maximum likelihood


## Data: london
## AIC BIC logLik
## 9365.213 9390.447 -4678.606
##
## Random effects:
## Formula: ~1 | school
## (Intercept) Residual
## StdDev: 0.3035269 0.7521481
##
## Fixed effects: normexam ~ standlrt
## Value Std.Error DF t-value p-value
## (Intercept) 0.0023871 0.04003241 3993 0.05963 0.9525
## standlrt 0.5633697 0.01246844 3993 45.18366 0.0000
## Correlation:
## (Intr)
## standlrt 0.008
##
## Standardized Within-Group Residuals:
## Min Q1 Med Q3 Max
## -3.7161719 -0.6304245 0.0286690 0.6844298 3.2680306
##
## Number of Observations: 4059
## Number of Groups: 65
You can store and examine the random effects using the function ranef(). These are the b̂0i for the first few
schools, so the estimated differences between the intercept of the school and the average (fixed) intercept of
all schools. Take a look at first six random effects from the model with only random intercept:
head(ranef(sch.lme.1))

## (Intercept)
## 1 0.37375960
## 2 0.50204297
## 3 0.50388873
## 4 0.01813312
## 5 0.24043097
## 6 0.54139222
The first school, for example, is estimated to be 0.37 points higher than the overall (fixed) intercept.
Now let’s add a random slope for LRT per school. Note that we don’t need to say +1 for the random intercept,
but that R automatically adds a random intercept along with a random LRT slope per school.
sch.lme.2 <- lme(fixed=normexam~standlrt, random=~standlrt | school, data=london, method="ML")
summary(sch.lme.2)

## Linear mixed-effects model fit by maximum likelihood


## Data: london
## AIC BIC logLik
## 9328.84 9366.693 -4658.42
##
## Random effects:
## Formula: ~standlrt | school

21
## Structure: General positive-definite, Log-Cholesky parametrization
## StdDev Corr
## (Intercept) 0.3007308 (Intr)
## standlrt 0.1205745 0.497
## Residual 0.7440777
##
## Fixed effects: normexam ~ standlrt
## Value Std.Error DF t-value p-value
## (Intercept) -0.0115074 0.03979168 3993 -0.289191 0.7725
## standlrt 0.5567280 0.01994280 3993 27.916244 0.0000
## Correlation:
## (Intr)
## standlrt 0.365
##
## Standardized Within-Group Residuals:
## Min Q1 Med Q3 Max
## -3.83123175 -0.63247485 0.03404143 0.68320630 3.45617381
##
## Number of Observations: 4059
## Number of Groups: 65
Can you interpret this model?
In the output above, the standard deviations of the random effects and residuals are displayed. To extract
the variances of the random effects, use VarCorr() function:
VarCorr(sch.lme.2)

## school = pdLogChol(standlrt)
## Variance StdDev Corr
## (Intercept) 0.09043900 0.3007308 (Intr)
## standlrt 0.01453822 0.1205745 0.497
## Residual 0.55365165 0.7440777
Of course we could have squared the standard deviations to get the variances of the random effects and the
residuals, but the VarCorr() function in combination with an lme object will display both the variances and
the standard deviations of the random effects, and of the residuals.
Take a look at first few random effects of this new model:
head(ranef(sch.lme.2))

## (Intercept) standlrt
## 1 0.37492828 0.12497491
## 2 0.47020355 0.16472764
## 3 0.47977967 0.08084160
## 4 0.03501158 0.12720722
## 5 0.24627621 0.07205202
## 6 0.51840173 0.05859393
These are the estimated random effects for the intercept b̂0i and LRT slope b̂1i for the first few schools. These
first 6 intercepts and slopes are coincidentally all positive, meaning that these 6 schools all have a higher
than average intercept and higher than average LRT slope. To examine all 65 intercepts and slopes, you can
run the command ranef(sch.lme.2).
Again, don’t forget to save your R script for further use later!

22
Exercise 3
We’ll finish reproducing the analysis of the schools dataset. (Again, if you stopped your R/RStudio session,
you may need to read the data in again.)
Let’s add gender (a child-level variable) to the mixed model with random intercept & random slope:
sch.lme.3 <- lme(normexam~standlrt + factor(gender), random=~standlrt | school, data=london, method="ML"
summary(sch.lme.3)

## Linear mixed-effects model fit by maximum likelihood


## Data: london
## AIC BIC logLik
## 9301.358 9345.518 -4643.679
##
## Random effects:
## Formula: ~standlrt | school
## Structure: General positive-definite, Log-Cholesky parametrization
## StdDev Corr
## (Intercept) 0.2936242 (Intr)
## standlrt 0.1212575 0.533
## Residual 0.7416710
##
## Fixed effects: normexam ~ standlrt + factor(gender)
## Value Std.Error DF t-value p-value
## (Intercept) -0.1117670 0.04305229 3992 -2.596075 0.0095
## standlrt 0.5529634 0.01998634 3992 27.667060 0.0000
## factor(gender)1 0.1757988 0.03225659 3992 5.450011 0.0000
## Correlation:
## (Intr) stndlr
## standlrt 0.370
## factor(gender)1 -0.426 -0.036
##
## Standardized Within-Group Residuals:
## Min Q1 Med Q3 Max
## -3.83299563 -0.63770664 0.02428286 0.68232962 3.45010389
##
## Number of Observations: 4059
## Number of Groups: 65
Now we’ll add two school-level variables (school gender & school average):
sch.lme.4 <- lme(normexam~standlrt + factor(gender)+ factor(schgend) + factor(schav),
random=~standlrt | school, data=london, method="ML")
summary(sch.lme.4)

## Linear mixed-effects model fit by maximum likelihood


## Data: london
## AIC BIC logLik
## 9300.414 9369.809 -4639.207
##
## Random effects:
## Formula: ~standlrt | school
## Structure: General positive-definite, Log-Cholesky parametrization
## StdDev Corr
## (Intercept) 0.2660309 (Intr)
## standlrt 0.1212542 0.499

23
## Residual 0.7417279
##
## Fixed effects: normexam ~ standlrt + factor(gender) + factor(schgend) + factor(schav)
## Value Std.Error DF t-value p-value
## (Intercept) -0.2647657 0.08159434 3992 -3.244902 0.0012
## standlrt 0.5515520 0.02006950 3992 27.482097 0.0000
## factor(gender)1 0.1671313 0.03385088 3992 4.937282 0.0000
## factor(schgend)2 0.1869684 0.09777600 60 1.912211 0.0606
## factor(schgend)3 0.1570156 0.07780641 60 2.018029 0.0481
## factor(schav)2 0.0668879 0.08534936 60 0.783696 0.4363
## factor(schav)3 0.1742650 0.09876108 60 1.764511 0.0827
## Correlation:
## (Intr) stndlr fct()1 fctr(schg)2 fctr(schg)3 fctr(schv)2
## standlrt 0.205
## factor(gender)1 -0.182 -0.037
## factor(schgend)2 -0.340 0.002 0.157
## factor(schgend)3 -0.253 0.025 -0.235 0.230
## factor(schav)2 -0.761 -0.035 -0.014 0.061 0.000
## factor(schav)3 -0.648 -0.080 -0.018 0.061 -0.083 0.622
##
## Standardized Within-Group Residuals:
## Min Q1 Med Q3 Max
## -3.83388977 -0.63434044 0.02308039 0.67675187 3.41363835
##
## Number of Observations: 4059
## Number of Groups: 65
To get approximate (Wald) confidence intervals, produced by the intervals() function in nlme package:
intervals(sch.lme.4)

## Approximate 95% confidence intervals


##
## Fixed effects:
## lower est. upper
## (Intercept) -0.424598124 -0.26476566 -0.1049332
## standlrt 0.512238511 0.55155200 0.5908655
## factor(gender)1 0.100821935 0.16713130 0.2334407
## factor(schgend)2 -0.008444022 0.18696837 0.3823808
## factor(schgend)3 0.001513856 0.15701559 0.3125173
## factor(schav)2 -0.103688945 0.06688792 0.2374648
## factor(schav)3 -0.023116161 0.17426500 0.3716462
##
## Random Effects:
## Level: school
## lower est. upper
## sd((Intercept)) 0.21573177 0.2660309 0.3280576
## sd(standlrt) 0.08918693 0.1212542 0.1648514
## cor((Intercept),standlrt) 0.11905278 0.4992424 0.7517526
##
## Within-group standard error:
## lower est. upper
## 0.7255056 0.7417279 0.7583129
Now let’s use lmer function from lme4 package to fit same final model. For more on comparison lme and
lmer, see document “Fitting linear mixed effects models in R.docx” on Moodle. First, we’ll detach nlme so it

24
doesn’t get in the way (there is some overlap in functions between the two packages):
detach("package:nlme")
library(lme4)

Here is the linear mixed model with random intercept & random slope, plus child gender, school gender &
school avg, fitted with the lmer function:
sch.lme.4a <- lmer(formula=normexam~standlrt + factor(gender)+ factor(schgend) + factor(schav) + (standl
summary(sch.lme.4a)

## Linear mixed model fit by maximum likelihood ['lmerMod']


## Formula:
## normexam ~ standlrt + factor(gender) + factor(schgend) + factor(schav) +
## (standlrt | school)
## Data: london
##
## AIC BIC logLik deviance df.resid
## 9300.4 9369.8 -4639.2 9278.4 4048
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.8339 -0.6343 0.0231 0.6768 3.4136
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## school (Intercept) 0.07077 0.2660
## standlrt 0.01470 0.1213 0.50
## Residual 0.55016 0.7417
## Number of obs: 4059, groups: school, 65
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) -0.26476 0.08152 -3.248
## standlrt 0.55155 0.02005 27.506
## factor(gender)1 0.16713 0.03382 4.942
## factor(schgend)2 0.18697 0.09769 1.914
## factor(schgend)3 0.15702 0.07774 2.020
## factor(schav)2 0.06689 0.08528 0.784
## factor(schav)3 0.17426 0.09868 1.766
##
## Correlation of Fixed Effects:
## (Intr) stndlr fct()1 fctr(schg)2 fctr(schg)3 fctr(schv)2
## standlrt 0.205
## fctr(gndr)1 -0.182 -0.037
## fctr(schg)2 -0.340 0.002 0.157
## fctr(schg)3 -0.253 0.025 -0.235 0.230
## fctr(schv)2 -0.761 -0.035 -0.014 0.061 0.000
## fctr(schv)3 -0.648 -0.080 -0.018 0.061 -0.083 0.622
Take a minute to compare the output from the two functions, and see if you can interpret the results of the
model.
With the lme4 package, we can use the built-in confint() function to get better (profile likelihood) confidence
intervals. (Warning: this may take a minute!)

25
confint(sch.lme.4a)

## 2.5 % 97.5 %
## .sig01 0.216944467 0.33052884
## .sig02 0.130231607 0.76802946
## .sig03 0.086637119 0.16256042
## .sigma 0.725620061 0.75844018
## (Intercept) -0.432503559 -0.09853868
## standlrt 0.510973904 0.59117563
## factor(gender)1 0.100793252 0.23346717
## factor(schgend)2 -0.007326814 0.38149357
## factor(schgend)3 0.000777421 0.31257546
## factor(schav)2 -0.107245098 0.24419862
## factor(schav)3 -0.040521916 0.39009440
Before starting the next exercise, you might want to clean up your workspace (using the rm function). Note
that the command given below (rm(list=ls())) is a “nuclear option” since it will delete everything in your
workspace. I find it useful, but have seen complaints about the habit. Since we’ll go back to the nlme package,
this would also be a good time to detach lme4 and re-attach nlme:
rm(list=ls())
detach("package:lme4")
library(nlme)

26

You might also like