R Handout Statistics and Data Analysis Using R
R Handout Statistics and Data Analysis Using R
Valentin Wimmer
contact: Valentin.Wimmer@wzw.tum.de
Lehrstuhl fr Panzenzchtung, TUM
u
u
December 2010
Contents
Contents
1 Introduction
1.1 Installing R . . .
1.2 Working with R
1.3 First steps . . .
1.4 R help . . . . . .
5.2
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2
3
3
4
5
2 Data management
2.1 Data structure . . .
2.2 Read in data . . . .
2.3 Data manipulation
2.4 Data export . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
6
7
11
14
16
.
.
.
.
3 Descriptive Statistics
16
4 Graphics
19
4.1 Classic graphics . . . . . . . . . . 19
4.2 Controlling the appearance of
a plot . . . . . . . . . . . . . . . . 22
28
32
.
.
.
.
34
36
36
38
.
.
.
.
40
40
44
53
59
5 Probability Distributions
27
5.1 R as a set of statistical tables . . 27
79
90
R output is displayed as
[1] 10
important notes
useful hints
description of datasets
Introduction
What is R?
R is an environment for statistical computing and graphics
The root of R is the S-Language
Today R is the most important statistical software
Open source, everybody can contribute to R
Library of add-on R packages
R package: collection of functions, documentations, data and examples
The main distribution is maintained by the R Developement Core Team
Advantages of R
Rapid developement
Open Source and thus no Black Box. All algorithms and functions are
visible in the source code.
No fees
Many systems: Macintosh, Windows, Linux, ...
Quite fast
Many new statistical methods available as R-packages
Disadvantages of R
Open Source: compatibility with older versions not so important
No real quality check of the contents of R-packages
No graphical user interface
Error message often not useful
There exist faster programms for special problems
1.1
Installing R
Installing R
The base distribution is available under www.r-project.org
The packages can be installed direct with R or www.CRAN.r-project.org
or www.bioconductor.org
There exist dierent versions for each system
*.zip for Windows and *.tar.gz for Unix
For windows, use the Precompiled binary distributions of the base system
1.2
Working with R
Working with R
Use a text-editor, e.g. Tinn-R to save your scripts
Make comments to your code (comment lines start with #):
Set work directories
R> getwd()
R> setwd("")
Specifying a path in R
The path has to be given in the form "C:\..." or "C://..."
Working with R
In R commands are evaluated directly and result printed on screen
Your input may have more than one line
Dierent input in the same line is separated with ;
Available packages, functions or objects are available in the R workspace
To list the available objects, type
R> ls()
1.3
First steps
First steps
Use R as a calculator
R> print(sqrt(5) * 2)
[1] 4.472136
First steps
Operators: +, , , / , &, &&, |, ||
Note that R is case sensitiv. Names of objects have to start with a letter.
Other numeric functions
abs()
sqrt()
round(), floor(), ceiling()
sum(), prod()
log(), log10(), log2()
exp()
sin(), cos(), tan()
1.4
absolute value
square root
round up/down
sum and product
logarithm
exponential function
angle functions
R help
R help
The help-function
R> help(mean)
R> `?`(mean)
R manuals
An Introduction to R
R Data Import/Export
R Installation and Administration
Writing R Extensions
www.rseek.org
Exercises
Calculate the following with R
3
( 5 2) 2 6 + 5 e2
Look at the help le of floor() and ceiling().
What is the result?
R> a <- (4 <= floor(4.1))
R> b <- (5 > ceiling(4.1))
R> a != b
(5
(5
(5
(5
(5
(5
(1
<
<
<
<
<
<
<
3)
3)
3)
3)
3)
3)
3)
| (3 < 5)
|| (3 < 5)
|| c((3 < 5), c(5 < 3))
| c((3 < 5), c(5 < 3))
& c((3 < 5), c(5 < 3))
&& c((3 < 5), c(5 < 7))
&& c((3 < 5), c(8 < 7))
Data management
iris dataset
Throughout this course we will use several datasets comming with R. The
most famous one is Fishers iris data. To load it, type
R> data(iris)
R> head(iris)
1
2
3
4
5
6
2.1
[nominal]
Data structure
Data structure
R is an object-oriented programming language, so every object is instance of
a class. The name of the class can be determined by
R> class(iris)
[1] "data.frame"
Data structure
Multiple numbers can be combined in a vector
R> c(4, 2, 5)
[1] 4 2 5
[1,]
[2,]
[1,]
[2,]
[3,]
[,1] [,2]
2
5
6
3
4
3
[1,]
[2,]
matrix product
R> t(m) %*% m
[1,]
[2,]
[3,]
inverse of a matrix m1
R> (m <- matrix(c(1, 0, 0, 0, 1, 0, 0, 1, 1), ncol = 3))
[1,]
[2,]
[3,]
R> solve(m)
[1,]
[2,]
[3,]
result
general structure of the object
show the rst 6 lines
dimension of the object (rows columns)
length of the object
number of characters in one string
important statistical parameters
variable names
row names
column names
useful for
list, data.frame
data.frame, matrix
data.frame, matrix
list, numeric
character
list, data.frame
matrix, data.frame
matrix, data.frame
matrix, data.frame
Class of a S3 object
To check, if an obect has an special class, use
R> obj <- 3
R> is.numeric(obj)
[1] TRUE
R> is.character(obj)
[1] FALSE
The class of an object can simple be changed (no formal check of attributes)
R> obj2 <- obj
R> (obj2 <- as.character(obj2))
[1] "3"
R> is.character(obj2)
[1] TRUE
R> is.numeric(obj2)
[1] FALSE
Handling factor-objects
A factor is used for objects with repeated levels
A factor has two arguments: levels and labels
R> levels(iris$Species)
[1] "setosa"
"versicolor" "virginica"
R> head(labels(iris$Species))
[1] "1" "2" "3" "4" "5" "6"
labels
levels
10
Exercises
Basic
What is the dierence between
R> v <- c("An", "Introduction", "to", "R")
R> nchar(v)
R> length(v)
4 12
X= 5 3
9 11
What are the dimensions of X?
20
7
10
0.8
1
8
Exercises
Advanced
What are the last three values of Sepal.Length?
Look at the help le of matrix. Try to nd out the use of the byrow
argument.
The variable Sepal.Length should be categorisised. We want to have two
categories, one up 6 cm and one above 6cm. Create a factor and give
labels for the categories.
How many observations are in each of the three groups?
2.2
Read in data
Read in data
To read in tabulated data, use
R> read.table(file, header = TRUE, ...)
were file is the name (+path) of the data or use the url. Use header=TRUE
if the rst row of the data contains the variable names. This function is the
default reading function and is suitable for txt, dat,... and returns an object of
class data.frame. Use the csv-format for Excel les and use function:
R> read.csv2(file, header = TRUE, sep = ";", dec = ".", ...)
look at the data before reading it with R (header, decimal symbol, missing
values,...)
11
"Hallo" "3"
"7"
0 -1 -2 -3 -4
Vectors with a regular structure are constructed with the functions rep
and seq
R> rep(c(1, 5), each = 2, times = 3)
[1] 1 1 5 5 1 1 5 5 1 1 5 5
8 10
R> o1[2]
[1] 6
Entries of a matrix
R> (o2 <- matrix(data = 1:12, nrow = 3, ncol = 4))
12
[1,]
[2,]
[3,]
R> o2[2, 3]
[1] 8
R> o2[2, ]
[1]
8 11
R> o2[, 3]
[1] 7 8 9
Entries of a data.frame
R> iris[2, 5]
[1] setosa
Levels: setosa versicolor virginica
R> iris[3:5, "Species"]
[1] setosa setosa setosa
Levels: setosa versicolor virginica
R> head(iris$Petal.Width)
[1] 0.2 0.2 0.2 0.2 0.2 0.4
Entries of a list
R> mod <- lm(Petal.Width ~ Species, data = iris)
R> mod[["coefficients"]]
(Intercept) Speciesversicolor
0.246
1.080
Speciesvirginica
1.780
R> mod[[1]]
(Intercept) Speciesversicolor
0.246
1.080
13
Speciesvirginica
1.780
attaching a data.frame
The attach() command can be used for datasets. Afterwards the variable of
the dataset are attached to the R search path.
You only need to type
R> Petal.Width
instead of
R> iris$Petal.Width
library(RODBC)
trac <- odbcConnectExcel("Table.xls")
Tnames <- sqlTables(trac)$TABLE_NAME
df1 <- sqlFetch(trac, Tnames[1], as.is = TRUE)
odbcClose(trac)
Exercises
Load the le data2read from the website. Have a look at the le.
Read in data using the read.table function and save the data as an object
with name mydata. What is the class of object mydata?
Asses the data structure of mydata with the methods str, head, dim,
colnames and rownames.
What are the values of the third row of mydata?
Attach mydata and give the rows of mydata where x8.
2.3
Data manipulation
14
V. De Geneve
Neuchatel
Rive Droite
15
2.4
Data export
Data export
Data of R can be writen into a le with the write.table() command
The R workspace can be saved with save() command and can be reloaded
with load()
A
Function xtable() to create L TEX-table output
R> library(xtable)
R> xtable(summary(iris), caption = "Summary statistic of iris data")
1
2
3
4
5
6
Sepal.Length
Min. :4.300
1st Qu.:5.100
Median :5.800
Mean :5.843
3rd Qu.:6.400
Max. :7.900
Sepal.Width
Min. :2.000
1st Qu.:2.800
Median :3.000
Mean :3.057
3rd Qu.:3.300
Max. :4.400
Petal.Length
Min. :1.000
1st Qu.:1.600
Median :4.350
Mean :3.758
3rd Qu.:5.100
Max. :6.900
Petal.Width
Min. :0.100
1st Qu.:0.300
Median :1.300
Mean :1.199
3rd Qu.:1.800
Max. :2.500
Species
setosa :50
versicolor:50
virginica :50
Excercises
Construct the following sequences using the functions rep or seq:
[1] 3 4 5 6 7 3 4 5 6 7 3 4 5 6 7
[1] -2.5 -2.0 -1.5 -1.0 -0.5
0.0
0.5
1.0
1.5
2.0
[1] 2 2 2 2 3 3 3 3 4 4 4 4 2 2 2 2 3 3 3 3 4 4 4 4
[1] 10.0 10.5 11.0 11.5 12.0 12.5 13.0 13.5 14.0 14.5 15.0
Descriptive Statistics
16
mean(x)
median(x)
min(x)
max(x)
range(x)
quantile(x,probs=c())
sd(x),var(x)
cov(x,y),cor(x,y))
compute
compute
compute
compute
compute
compute
compute
compute
(trimmed) mean
median (50% quantile)
minimum
maximum
minimum and maximum
the probs100% quantile
the standard deviation/variation
the covariance/correlation
Cross Tables
The table command gives the frequencies of a categorial variable, typically it is used with one or more factor-variables
R> table(CO2$Type)
Quebec Mississippi
42
42
Quebec
Mississippi
nonchilled chilled
21
21
21
21
17
TRUE
111
The missing values are removed with the na.rm argument in many functions as those in Table 3
R> mean(c(1:5, NA, 8), na.rm = TRUE)
[1] 3.833333
Excercises
Basic
Compute the 5%, the 10% and the 75% quantile of the education rate in
the swiss data.
Is the distribution of the education rate in the swiss data symmetric, left
skewed or right skewed?
Compute the covariance and the correlation between the ambient carbon
dioxide concentrations and the carbon dioxide uptake rates in the CO2
data. How to you interpret the result?
Compute the correlation coecient for each Treatment group separatly.
Advanced
Compute the correlation between the following two variables: v1=[11,5,6,3,*,35,4,4],
v2=[4.6,2,1.6,*,*,7,2,3], where * indicates a missing value.
Give ranks to uptake and conc and compute spearmans correlation coefcient.
18
Graphics
Graphics with R
Many graphical possibilities and settings in R
Division in high-level and low-level graphics
Standard graphics for each object type (try plot for a data.frame)
Additional graphics system in the library lattice
Standard graphical function: plot to plot y against x
You can either use plot(x,y) or plot(y x)
4.1
Classic graphics
20
30
q
q
q
10
Education
40
50
q
q
20
q
q q
qq
q
q
q
q q
qq
qq q q
qq
q q
q
q qq
q
q
q
q
q
q
q
q
40
60
Agriculture
19
80
q
q q
q
2.5
2.0
1.5
1.0
0.5
q
q
setosa
versicolor
virginica
10 15 20 25 30
0
Frequency
Histogram of swiss$Education
10
20
30
40
50
60
swiss$Education
20
200
50 100
0
Black
Brown
Red
Blond
Black
Brown
Blond
Red
21
Brown
Red
Blond
Green
Hazel
Standardized
Residuals:
<4
Blue
4:2
Eye
2:0
0:2
Brown
2:4
>4
Black
Hair
4.2
p for points
l for lines
b for both
h for histogram like vertical lines,
s/S for stair steps
n for no plotting
main: a string specifying the plot title
sub: a string specifying the subs title
xlab: a string specifying the title for the x axis
ylab: a string specifying the title for the y axis
22
23
mar[SOUTH<1]
oma[EAST<4]
mar[EAST<4]
0 2 4 6 8
mar[WEST<2]
Plot Area
0
10
Figure
oma[SOUTH<1]
Outer
Margin Area
10
6
X
10
10
Figure
2
0
Figure
X
0 2 4 6 8
0 2 4 6 8
0 2 4 6 8
0 2 4 6 8
Title Line
oma[WEST<2]
mar[NORTH<3]
4
0
Figure
10
X
Figure
Outer Margin Area
24
q
q
q
q
q qq
q
q
q
q
qq
q
q q qqq qq
qq
q qq
q
qqq
q qq
q
q
qq qqqq q
qq q qqq
q
qq
q qq qqqqqqqqqqq qqq
qq
qq
qqqqq
qqq
q
qqqq
q
q
qqqq qq qq
q
q
q
q
qqq
qqq
q
q
q qq
q
qq
q
q
Sepal.Width
Sepal.Length
25
q
qq
q
setosa
versicolor
virginica
q
q
q
q
q
qqq qq
q
qqq
q q qqq qq
q qq
q
qqq qqqq q
qqqqqq
q
q qqq
q qq qqqqqqqqqqq qqq
q
qqq qqqqqqq q
qq
q qq
q
q
qqqq qq qq
q q
qqqq
q
qq
q
q
q
q qq
q
qq
q
q
Sepal.Width
Sepal.Length
Saving graphics
There are two possibilities to save graphics in R
Right click in graphic an save as
better alternative: functions pdf, png, postscript
Usage:
R>
R>
R>
R>
Exercises
Basic
Make a histogram of each of Sepal.Length, Petal.Length, Sepal.Width
and Petal.Width in one plot for the iris data. Give proper axis labels,
a title for each subbplot and a title for the whole plot.
Now we use the swiss data. Plot the Education rate against the Agriculture rate. Use dierent symbols for the provinces with more catholics
and those with more protestant inhabitants. Choose axis limits from 0 to
100. Add a legend to your plot.
Save this plot as a pdf-le.
26
Advanced
Add two vertiacal lines and two horizontal lines to the plot which represent
the means of the Agriculture and the Education for both subgroups
(Catholic vs Protestant).
Look at the layout function. This is an alternative for par(mfrow).
Probability Distributions
5.1
Probability Distributions
Several theoretical distributions are included in R
The probability density function is dened as P(X = )
The cumulative distribution function (CDF) is dened as P(X )
The quantile function is the smallest for a given q where P ((X ) > q)
The functions names consists of a prex and the probability name. The prex
name d is for the density, p for the CDF, q for the quantile function and
r for simulation (random samples).
Probability Distributions
The name of the distribution is usually an abbreviation
R> pnorm(0.5)
[1] 0.6914625
gives the value of the CDF of a standard normal distribution N(0,1) at point
0.5.
Additional arguments can passed to the distribution functions. For example,
R> pnorm(0.5, mean = 2, sd = 3)
[1] 0.3085375
27
Distribution
normal
binomial
hypergeometric
Poisson
geometric
Students t
chi-squared
F
gamma
beta
R name
norm
binom
hyper
pois
geom
t
chisq
f
gamma
beta
additional arguments
mean, sd
size, prob
m, n, k
lambda
prob
df, ncp
df, ncp
df1, df2, ncp
shape, scale
shape1, shape2, ncp
0.0
0.2
0.4
0.6
0.8
N(4,0.5) distribution
Computing quantiles
To compute the 95% quantile of the N(0,2)-distribution, use qnorm()
R> qnorm(0.95, mean = 0, sd = 2)
[1] 3.289707
5.2
28
4
2
3.28
qnorm(x,mean=0,sd=2)
Quantiles of N(0,2)distribution
0.95
0.0
0.2
0.4
0.6
0.8
1.0
to display of the numbers by stem (a stem and leaf plot) use stem
to make a density estimation use density
to plot the empirical cumulative distribution function, use cdf
to compare the quantiles of two univariate data sets x and y, use qqplot(x,y)
to compare the empirical quantiles with a normal distribution, use qqnorm
Density estimation
The faithful dataset
Waiting time between eruptions and the duration of the eruption for the Old
Faithful geyser in Yellowstone National Park, Wyoming, USA.
A data frame with 272 observations on 2 variables:
eruptions numeric Eruption time in mins
waiting numeric Waiting time to next eruption (in mins)
The function density computes kernel density estimates
An important argument for density estimation is the bandwith (argument
bw)
The rug function can be used to add a rug representation to the plot
Density estimation
29
R> data(faithful)
R> plot(density(faithful$eruptions))
R> rug(faithful$eruptions)
0.3
0.2
0.0
0.1
Density
0.4
0.5
density.default(x = faithful$eruptions)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
070355555588
000022233333335577777777888822335777888
00002223378800035778
0002335578023578
00228
23
080
7
2337
250077
0000823577
2333335582225577
0000003357788888002233555577778
03335555778800233333555577778
02222335557780000000023333357778888
0000233357700000023578
00000022335800333
0370
30
0.6
0.4
0.0
0.2
Fn(x)
0.8
1.0
ecdf(faithful$eruptions)
R> qqnorm(faithful$eruptions)
R> qqline(faithful$eruptions)
4.5
3.5
2.5
1.5
Sample Quantiles
Normal QQ Plot
qq q
qq q
qq
qqqqq
qqq q
qqq
qqq
qqq
qq
q
qqq
qqq
qqq
qqq
qqq
qqq
qqq
qq
qq
qq
qqq
qq
qq
qq
qq
qq
qq
qq
qq
qq
q
qq
qq
qq
qq
qq
qq
qq
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
qq
q
q
qq
q
q
qqq
qq
qq
qqq
q
q
q
qqq
qqqq
qqqqq
qqqqqqq
qqqqqqq
q qqqqqq
q q qqq qq
Theoretical Quantiles
31
Would you say, that the eruptions of Old Faithful are normal distributed?
Plot a histogramm together with a density estimation for the eruptions
data. Try dierent values for the bandwidth.
Exercises
Advanced
Continous distributions can be visualized with curve. How can categorial
distributions be visualized?
The distribution of the eruption data can be seen as a mixture of two
normal distributions. Try to nd the mean and variance of these normals
manually and plot a histogramm together with the distribution of the
mixture of normals.
HairEyeColor dataset
The HairEyeColor data
Distribution of hair and eye color and sex in 592 statistics students at the
University of Delaware. A 3-dimensional array resulting from cross-tabulating
592 observations on 3 variables. The variables and their levels are as follows:
1. Hair levels: Black, Brown, Red, Blond
2. Eye levels: Brown, Blue, Hazel, Green
3. Sex levels: Male, Female
32
Statistical Tests
Inference = process of drawing conclusions about a population based on
a sample from the whole population
Example: Comparison of two samples of randomly selected 20 plants from
two elds. The population are all plants on each eld
A test is used to asses the hypotheses
Truth
H0
H1
Test
Non-signicant
Signicant
True Negative Type-1 Error
Type-2 Error
True Positive
Statistical Tests
The type of suitable test depends on the population distribution the question setting:
33
6.1
Students t-Test
Used to test the null hypothesis that the means of two independent populations of size n1 respectively n2 are the same
H0 : 1 = 2 .
The test statistic is computed as
t=
y1 y2
s 1/ n1 + 1/ n2
30
10
20
uptake
40
nonchilled
chilled
R>
R>
R>
R>
R>
R>
==
==
==
==
"chilled", ]$uptake)
"chilled", ]$uptake)
"nonchilled", ]$uptake)
"nonchilled", ]$uptake)
Normal QQ Plot
q
20
q
q
q
q
q
q
q
q
qq
qq
q
40
q
q
qq
qq
qq
qq
q
qq
qq
q
q
30
10
q
q
qq
q
q
qq
q
qq
qq
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
20
Sample Quantiles
30
q
qq
qq q
q
q
q
q
q
q
q
q
10
Sample Quantiles
40
Normal QQ Plot
q
q
q
qq
q
Theoretical Quantiles
Theoretical Quantiles
35
6.2
Wilcoxon test
This test is more conservative then a t-Test with the same data
6.3
Chi-squared-test
Chi-squared-test
A sample of n observations in two nominal (categorial) variables is arranged
in a Contingency Table Under H0 the row variable and the column variable
1
2
.
.
.
r
1
n11
n21
.
.
.
nr1
n.1
y
...
c
n1c
n2c
.
.
.
nrc
n.c
n1.
n2.
.
.
.
nr.
n
Chi-squared-test
The test statistic for the chi-squared-test is
r
(njk Ejk )2
Ejk
j=1 k=1
0.5
2 -distribution
0.2
0.3
2
3
4
5
6
0.0
0.1
dchisq(x)
0.4
df =
df =
df =
df =
df =
10
2 -distribution
2
3
4
5
6
qchisq(x)
df =
df =
df =
df =
df =
0.0
0.2
0.4
0.6
x
37
0.8
1.0
Chi-squared-test with R
Contingeny Tabel of Hair and Eye Color of male statistic students
R> HairEyeColor[, , "Male"]
Eye
Hair
Brown Blue Hazel Green
Black
32
11
10
3
Brown
53
50
25
15
Red
10
10
7
7
Blond
3
30
5
8
Chi-squared-test with R
The expected values Ejk , j = 1, ..., r, k = 1, ..., c are obtained as follows:
R> chisq.test(HairEyeColor[, , "Male"])$expected
Eye
Hair
Brown
Black 19.67025
Brown 50.22939
Red
11.94265
Blond 16.15771
6.4
Blue
Hazel
Green
20.27240 9.433692 6.623656
51.76703 24.089606 16.913978
12.30824 5.727599 4.021505
16.65233 7.749104 5.440860
Kolmogorov-Smirnov-Test
Kolmogorov-Smirnov-Test
This test is either used to compare to variables and y or to compare one
variable with a theorectical distribution, e.g. the normal distribution. In R, this
test is implemented in the function ks.test(). The arguments are
x a numeric vector of data values.
y either a numeric vector of data values, or a character string naming
a cumulative distribution function or an actual cumulative distribution
function such as pnorm.
alternative indicates the alternative hypothesis and must be one of
two.sided (default), less, or greater.
... further arguments passed to the distribution function
38
Kolmogorov-Smirnov-Test with R
The Fertility-rate in the swiss data should be compared with a normal distribution. Simply use the ks.test() for vector of the 47 measures
R> ks.test(swiss$Fertility, "pnorm")
One-sample Kolmogorov-Smirnov test
data: swiss$Fertility
D = 1, p-value < 2.2e-16
alternative hypothesis: two-sided
The null hypothesis distribution is normal is rejected with =0.05 because data is not scaled
The KS-Test with scaled data results in
R> ks.test(scale(swiss$Fertility), "pnorm")
One-sample Kolmogorov-Smirnov test
data: scale(swiss$Fertility)
D = 0.1015, p-value = 0.7179
alternative hypothesis: two-sided
Exercises
Basic
The means of Sepal.Length should be compared for the species setosa
and virginica in the iris data. Think about the assumptions (normal
distributions, equal variances) and choose the right test. Is there a significant dierence in mean?
Compute the p-value with help of the t-statistic on slide 35 on your own.
What value would we have with a one-sided hypotesis?
Why do we have df=9 in the output of the 2 -test on slide 38?
Advanced
It was necessary to scale the data before using the ks.test. How can we
get the right result without using the scale method? Have a look at the
help for the function and think about the ...-argument.
Have a look at slide 38 again and compute the value of the 2 -statistic on
our own.
39
Exercises
Advanced
You have given the following data. Are the two factors m1 [levels A
and B] and m2 [levels A, B and C] independent?
1
2
3
4
5
6
7
8
m1
A
A
A
A
B
B
A
B
m2
B
A
A
B
B
C
C
B
9
10
11
12
13
14
15
16
m1
B
B
B
B
A
B
A
B
m2
C
C
A
B
A
C
B
B
7.1
ANOVA
ANOVA
ANOVA is used in so called factorial designs. Following distinction can be
done
balanced designs: same number of observations in each cell
unbalanced designs: dierent number of observations in each cell
40
low
high
A
low
high
(3.5,5,4.5) (5.5,6,6)
(2,4,1.5)
(8,11,9)
20.5
45.5
Total
30.5
35.5
66
Table 7: balanced design with two treatments A and B with two levels 1 (low)
and 2 (high) with 3 measurements for each cell
ANOVA
The model for ANOVA with two factors and interaction is
yjk = + + j + ()j + jk ,
(1)
with
yjk the kth measurement made in cell (, j)
the overall mean
main eect of the rst factor
j the main eect of the second factor
()j the interaction eect of the two factors
jk the residual error term with jk N(0, 2 )
ANOVA formula in R
The ANOVA model is specied with a model formula. The two-way layout
with interactions of equation (1) is writen in R as
y A + B + A:B,
where A is the rst and B the second factor. The interaction term is denoted by
A:B. An equivalent model formula is
y A*B.
41
ANOVA
In a ANOVA, the total variance of the observations is partioned into parts
due to the main eects and interaction plus an error term. The hypothesis that
there is no signicant eect is tested with an F-test. The assupmtions for the
F-test are
The observations are independent of each other.
The observations in each cell arise from a population having a normal
distribution.
The observations in each cell are from populations having the same variance.
1.4
F-distribution
0.8
0.6
0.0
0.2
0.4
df(x)
1.0
1.2
df1 = 1, df2 = 1
df1 = 1, df2 = 2
df1 = 10, df2 = 10
df1 = 10, df2 = 100
df1 = 2, df2 = 1000
ANOVA with R
Now we want to perform ANOVA with two main eects and interaction
for the example from above
Let us rst have a look at the data
ANOVA with R
The mean plot suggested that there are dierences in the means, especially
for factor A
To apply ANOVA to the data we use the function aov
The results is shown in a nice way with the summary method
42
high
low
mean of y
high
low
A
Factors
Pr(>F)
0.0001654 ***
0.2219123
0.0028433 **
0.05 . 0.1
ANOVA with R
The resulting ANOVA-table shows that there is a highly signicant eect
of factor A
The interaction eect is signikant too
The main eect of B is not signikant on 5% level
The estimated eects are obtained with the coef method
R> coef(aov(y ~ A + B + A:B))
(Intercept)
4.333333
Ahigh
1.500000
Bhigh Ahigh:Bhigh
-1.833333
5.333333
Note that the level "low" is treated as reference category due to identication restrictions
43
R> interaction.plot(A, B, y)
mean of y
high
low
low
high
A
Advanced
In a second step, we want to look if the ambient carbon dioxide concentr
conc has an eect on uptake too. Check this with an ANOVA. The
variable should be dichotomized at the median before using it as a factor.
7.2
Linear Regression
Linear Regression
Linear Regression model:
y = + + , = 1, ..., n
44
(2)
with
y response measure for individual
covariate measure for individual
regression coecient for the intercept
regression coecient for the slope
error term with N(0, 2 )
The coecients and are estimated by and which is the least squares
solution:
n
( )(y y )
=1
n
( )2
=1
Cov(, y)
Var()
= ry
sy
s
(3)
(4)
(5)
45
y = + , = 1, ..., n
(6)
The variability of the data set is measured through dierent sums of squares:
n
SStot
=
=1
n
SSreg
=
=1
n
SSerr
=
=1
(7)
(8)
(9)
It holds that
SSerr + SSreg = SStot and R2 =
SSreg
SStot
r = y y , = 1, ..., n
(10)
The sum of squares of the residuals is a measure for the t of the modell.
Residuals are the vertical distances of each point from the regression line.
The regression is unbiased, so E() = 0.
There should be no structure in the distribution of the residuals look
at the residuals.
46
Temp
2.429
The Coecients give the least squares estimates for (Intercept) and
A detailed model summary is obtained with the summary method
47
100
0
50
Ozone
150
q
q
q
qq
q q
qqq
q q
q
q
q
q
q
q
qqq q
q q
q
qqq
qq q
q
q
q
q
qqq q q
q
q
q q q q
q
qq q
q q
qq
q qq
q q qqqq qqq
q
q
qq
qqqq qqq
qqq
q
q
q q
q
qq
q qqqq
qq
q
q q
60
70
80
90
Temp
Median
-0.5869
3Q
Max
11.3062 118.2705
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -146.9955
18.2872 -8.038 9.37e-13 ***
Temp
2.4287
0.2331 10.418 < 2e-16 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1
1
Residual standard error: 23.71 on 114 degrees of freedom
(37 observations deleted due to missingness)
Multiple R-squared: 0.4877,
Adjusted R-squared: 0.4832
F-statistic: 108.5 on 1 and 114 DF, p-value: < 2.2e-16
48
100
50
0
Ozone
150
q
q
qq
q
qq
q q
qqq
q q
q
q
q
q
q
q
qqq q
q q
q
qqq
q
qq q
q
q
q
qqq q q
q
q
q q q q
q
qq q
q q
qq
qq
q qq
q
q q q q qqqqqq q q
qq
qqq
q
q
q
q qq qqqqqq
qq
q
qq
q
q q
60
70
80
90
Temp
myMod is of class lm
R> class(myMod)
[1] "lm"
Temp
2.428703
49
use
returns a vector of length 2 containing the regressions coecients
returns a vector of length n containing the predicted values
returns a vector of length n containing the residuals
a method for predicting values with the model
a model summary
a plot for the model diagnostics
Model diagnostics
62
q 30 q
q q
q q
q
q q
qqqq
q
qqqq
qqqqqq qq
q qq
q
qq
q qq
qq qqq q
qq q qq qqqq qqqq
qq q
q q
q q
qqqqq qqqq
qqq q qq
q qq
q
qqqqqq
qq
qq q
q
qq
q
20
40
60
2.0
ScaleLocation
1.0
50
50
Residuals
q 117
0.0
Residuals vs Fitted
Standardized residuals
q 117
62
q 30 q
q q
q
qq
q q
q
q
qq qqq q q
q
q
q q
q
q
q
qq
qqqq q
q
q
q
q
q qq qqqqq q q
q q q q qqqqq qqq q q
q
q qqqq
qq
q
q
q q qqq
q
q q
qq q qq q
q q
q q
qq
q
q
qq q
q
q
q q
qq
q q
80
60
80
Residuals vs Leverage
q 117
0.5
62
q q
q 99 q
q
qq
qq q q
q q q
q
q q q qqq q q q q q q
q q
q
q q q qq q qq q
q
q q
q q q qq q q
q qq q
qq
qq
q
qq qq
q
q
qq q
qqqqqq
qq q q
qq q
q
q
q
q
q
q
Cook's distance
62
30
qq
q
q
q
qq
q
qq
qq
qq
qq
qq
qq
q qq
qqqq
qqqq
qqq
qqq
qqqq
qqq
qq
qqq
qq
qq
qq
qqq
qqqq
qqqq
qqqqq
qqqqq
qqqqq
qqqq
qq
q qq
Standardized residuals
117 q
40
Fitted values
Normal QQ
2 0
Standardized residuals
Fitted values
20
0.00
Theoretical Quantiles
0.02
0.04
Leverage
30
62
117
50
1.0
18.0
31.5
63.5 168.0
200
q
q
q
q
q
q
q q q
q
q
q
qq q
qqq q q
q q q
qqqq
q
qq
qq
q qq q
q
q q
q
q
qqqq qq q q q
q
q
q
q
qqq q
q q
q
q
q qqq
q qq qqq qqqqq
q
qq qq
q qq qqq
q
q
qq q
qqq
q
50
100
Ozone
150
40
60
80
100
120
Temp
51
[1] 0.7130697
R> cor(airquality$Temp, airquality$Ozone, use = "complete.obs")
[1] 0.6983603
R> myMod2 <- lm(Ozone ~ qTemp, data = airquality)
R> myMod2
Call:
lm(formula = Ozone ~ qTemp, data = airquality)
Coefficients:
(Intercept)
-57.07404
qTemp
0.01612
R> summary(myMod2)
Call:
lm(formula = Ozone ~ qTemp, data = airquality)
Residuals:
Min
1Q
-39.7067 -16.7409
Median
-0.9534
3Q
Max
10.5436 119.2933
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -57.074043
9.386333 -6.081 1.63e-08 ***
qTemp
0.016123
0.001485 10.859 < 2e-16 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1
1
Residual standard error: 23.23 on 114 degrees of freedom
(37 observations deleted due to missingness)
Multiple R-squared: 0.5085,
Adjusted R-squared: 0.5042
F-statistic: 117.9 on 1 and 114 DF, p-value: < 2.2e-16
52
200
q
q
q
q
q
q
q q q
q
q
qq qq q
q
q q q
q q q
qqqq
q
q q qq
qqq q
q
q
q
q q
q
q
q
q
q q q
q
qqq q q q q
qq
q q
q
q
q qqq
q qq qqq qqqqq
q
q qq
q qq qqq
qqq q
q
q
q
qqq
q
50
100
Ozone
150
40
60
80
100
120
Temp
Excercises
Basic
Which of the models myMod or myMod2 ts better to the data? Give reasons
for your choice.
What could be a problem of the second model?
In second analysis we want to analyse the eect of the temperature on
the wind. Use the airquality data, make descriptive analyses and t a
model. Give interpretations for the results.
Look at the model diagnostics. Are there outliers in the data? Highlight
the outliers.
Advanced
Look at the airquality data. As you can see, there are 37 missing values
for Ozone. Predict this values with the predict method and with our
model myMod1 and plot them into the scatterplot.
We want to estimate the eect of the agriculture rate on the education
rate in the swiss data. What could be a problem here?
7.3
(11)
Note, that the linear in multiple linear regresion applies to the regression parameters, not the response or independent variables. So also quadratic functions
of independent variables t in this model class.
Multiple Linear Regression
The multiple linear regression model can be writen as a common model for
all observations as
y = X + ,
(12)
with
y = (y1 , ..., yn )T as the vector of response variables
= (1 , ..., q )T as vector of regression coecients
= (1 , ..., n )T as vector of error terms
X=
1
1
.
.
.
11
21
.
.
.
12
22
.
.
.
..
.
1q
2q
.
.
.
n1
n2
nq
= (X T X)1 X T y.
E() =
and the variance is
Var() = 2 (X T X)1 .
y = 0 + 1 1 + ... + q q .
54
(13)
= y y , = 1, ..., n.
H = X(X T X)1 X
.
r =
1 h
r = r
Source of variation
Regression
Residual
Total
1/ 2
nq1
n q r2
Sum of squares
n
(y y )2
=1
n
(y y )2
=1
n
(y y )2
=1
Degrees of freedom
q
n-q-1
n-1
Table 8: Analysis of variance table for the multiple linear regression model.
The mean square ratio
F=
(y
=1
(y
=1
y )2 / q
y )2 / (n q 1)
H0 : 1 = ... = q = 0.
Under H0 , the statistic F has an F-distribution with q and n q 1 degrees
of freedom.
55
R =
(y
=1
n
(y
=1
y )2
y )2
=1
2
=1
n
(y
=1
y )2
2
2
In linear Regression R2 = ry and in multiple linear Regression R2 = ryy
holds.
y = 0 + 1 1 + 2 2 +
y = 0 + 1 1 +
(M1)
(14)
(M2).
(15)
Now, R2 R2
M1
M2
Dierent models can only be compared with R2 , if the have the same
response, the same number of parameters and an intercept.
R2 = 1 (1 R2 )
dj
n1
nq1
56
Temp
-9.55060
qTemp
0.07807
R> summary(mod)
Call:
lm(formula = Ozone ~ Temp + qTemp, data = airquality)
Residuals:
Min
1Q
-37.619 -12.513
Median
-2.736
3Q
Max
9.676 123.909
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 305.48577 122.12182
2.501 0.013800 *
Temp
-9.55060
3.20805 -2.977 0.003561 **
qTemp
0.07807
0.02086
3.743 0.000288 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1
1
Residual standard error: 22.47 on 113 degrees of freedom
(37 observations deleted due to missingness)
Multiple R-squared: 0.5442,
Adjusted R-squared: 0.5362
F-statistic: 67.46 on 2 and 113 DF, p-value: < 2.2e-16
The F-statistic in the model output compares the model with an intercept
model. This is an test for a general eect of the model.
57
Dierent, nested model can be compared also. The smaller model has to
be part of the bigger model
If we want to know, if the additional quadratic eect of Temp is meaningful,
we compare the models
R> mod1 <- lm(Ozone ~ Temp, data = airquality)
R> mod2 <- lm(Ozone ~ Temp + qTemp, data = airquality)
Model Selection
More covariates lead always to a equal or better model adaption and thus
a higher SSreg
Adding random numbers as covariates would also improve the model adaption
The task is to nd a compromise between model adaption and model
complexity
This is called model selection
AC = 2k + n(n(SSReg / n))
with k = number of explanatory variables + intercept in the model and
n = number of observations.
The idea is to penalize the number of explanatory variables to favour
sparse models.
58
Exercises
Basic
We want to nd the best model to explain the amount of Ozone in the
airquality data. Make some descriptive statistic, e.g. scatterplots for
each possible covariate and the response and compute the correlation matrix.
Try dierent models and compare the with the anova method. Which
model would you choose. Take = 0.05. Give an interpretation for the
eects.
Check the model assumptions and look for outliers in the data.
Advanced
Look at the help of the function step and try to nd the best model
with this function. Compare the results (Note: look especially at the
direction argument).
7.4
Categorial Covariates
Dummy coding
Dummy coding uses only ones and zeros to convey all of the necessary
information on group membership.
A dummy variable is dened as
d=
0
1
no group member
group member
59
Dummy coding
Consider the following example with observations of a factor variable grp
with 4 groups (each replicated 2 times) and a metric response measure y.
We have to create 4-1=3 dummy variables d1, d2 and d3.
The resulting coding is the following:
y
5
4
7
4
6
3
9
5
Table 9:
grp
1
2
3
4
1
2
3
4
d1
0
1
0
0
0
1
0
0
d2
0
0
1
0
0
0
1
0
d3
0
0
0
1
0
0
0
1
Dummy coding
Means of y in each group
1
2
3
4
5.5 3.5 8.0 4.5
We t now a linear model with R. Note that dummy coding is the default
for factor covariates and the rst level is used as reference category.
R> y <- c(5, 4, 7, 4, 6, 3, 9, 5)
R> grp <- factor(rep(1:4, times = 2))
R> str(grp)
Factor w/ 4 levels "1","2","3","4": 1 2 3 4 1 2 3 4
R> mod <- lm(y ~ grp)
R> coef(mod)
60
(Intercept)
5.5
grp2
-2.0
grp3
2.5
grp4
-1.0
Dummy coding
We could also dene the dummy variables on our own:
R> d1 <- c(0, 1, 0, 0, 0, 1, 0, 0)
R> d2 <- c(0, 0, 1, 0, 0, 0, 1, 0)
R> d3 <- c(0, 0, 0, 1, 0, 0, 0, 1)
d1
-2.0
d2
2.5
d3
-1.0
Eect coding
An alternative coding scheme is eect coding.
Eect coding uses only ones, zeros and minus ones to convert all of the
necessary information on group membership.
A eect variable for k
1
1
ej =
groups is dened as
member of group j
member of group k
else
for j = 1, ..., k 1
61
Eect coding
In our example, the eect coding scheme is the following:
y
5
4
7
4
6
3
9
5
Table 10:
grp
1
2
3
4
1
2
3
4
d1
1
0
0
-1
1
0
0
-1
d2
0
1
0
-1
0
1
0
-1
d3
0
0
1
-1
0
0
1
-1
Eect coding
We t now a linear model with R and use the eect coding scheme. This
can be spezied in the contrasts argument of lm.
R> lm(y ~ grp, contrasts = list(grp = contr.sum(4)))
Call:
lm(formula = y ~ grp, contrasts = list(grp = contr.sum(4)))
Coefficients:
(Intercept)
5.375
grp1
0.125
grp2
-1.875
grp3
2.625
With eect coding the intercept is equal to the grand mean of all of the
observations.
The coecients of each of the eect variables is equal to the dierence
between the mean of the group coded 1 and the grand mean.
Interaction
As in ANOVA interactions can be used in regression models.
An interaction term can be spezied in the model fomula.
Following notations are possible in a formula:
62
+
:
*
.
add a variable
remove a variable (-1 to remove intercept)
interaction between two variables
add variables and their interaction
add all variables height
Advanced
Now t the lm and use dummy coding with summer as reference category.
Fit the lm and use eect coding.
Exercises
Advanced
There is the theory, that the ozone level does not depend on the Month
only but additional also on the day of the week. One assumes that there
is a dierence between weekdays and weekend.
Create a new factor variable which contains the information if an observation is made on a weekday or nor not (weekend). Perhabs the functions
as.Date and weekdays are useful.
Fit a lm with the eect of seasaon and weekday + interaction and give an
interpretation for the results.
63
Dust dataset
The dust data is available under the archieve of datasets of the department of
statistics at the LMU :http://www.stat.uni-muenchen.de/service/datenarchiv/dust/dust.asc.
The dust data
The data was collected by the Deutsche Forschungsgemeinschaft in the years
1960 to 1977 in a munich factory with 1246 workers. The data.frame with one
observation for each woker contains the following variables
cbr [factor] chronical reaction of bronchia [0: no, 1:yes]
dust [numeric] dust exposure at workplace (mg/m3 )
smoking [factor] Is the worker a smoker? [0: no, 1:yes]
expo [numeric] time of exposure (years)
8.1
Logistic Regression
Logistic Regression
In the dust data our response variable is cbr.
It has only two possible categories yes and no.
Multiple linear regression model can be writen as
N(, 2 )
0 + 1 1 + ... + q q
This model is only suitable for continous response variabels with, conditional on the values of the explanatory variables, a normal distribution
with constant variance.
So he model is not suitable for the response variable cbr which is binary.
Other situations where linear regression fails are count data responses,
metric but not normal responses or categorial responses
Logistic Regression
What we need is a model which takes the special type of the response
variable into account
For binary data {0,1} we want to model the response probability of taking
the value 1 directly as a linear function of the explanatory variables.
64
For example we want the probability that somebody has a chronical reaction of bronchia depending on
Logistic Regression
A suitable transformation is the logistic or logit function which ensures a
range of transformed values in [0,1].
The logistic regression model for the resonse probability is
logit() = log
= 0 + 1 1 + ... + q q
(16)
The logit of a probability is simply the log of the odds of the response
taking the value one.
Equation (16) can be also write as
(1 , 2 , ..., q ) =
exp(0 + 1 1 + ... + q q )
1 + exp(0 + 1 1 + ... + q q )
The logit function can tak any real value, but the associated probability
always lies in the required [0,1] interval.
Logistic Regression
The regression coecients in a logist regression model have a dierent
interpretation as in linear regression.
Equation (16) can be also write as
In a logistic regression model, the parameter j associated with the explanatory variable j is such that exp(j ) is the odds that the response
variable takes one when j increases by one conditional on the other explanatory variables remaining constant.
The regression coecients are estimated by maximum likelihood (ML).
65
dust
Min.
: 0.2000
1st Qu.: 0.4925
Median : 1.4050
Mean
: 2.8154
3rd Qu.: 5.2475
Max.
:24.0000
smoking
no :325
yes:921
expo
Min.
: 3.00
1st Qu.:16.00
Median :25.00
Mean
:25.06
3rd Qu.:33.00
Max.
:66.00
The argument family=binomial() is used to specify, that logistic regression should be used.
66
10
20
10
40
expo
dust
q
q
30
15
50
20
60
no
yes
no
cbr
yes
cbr
1.0
0.8
0.6
0.0
0.2
no
0.4
cbr
0.0
0.2
no
0.4
cbr
0.6
yes
0.8
yes
1.0
dust
15
25 30
40
expo
67
10
15
1.0
0.8
0.6
0.0
0.2
no
0.4
cbr
0.0
0.2
no
0.4
cbr
0.6
yes
0.8
yes
1.0
20
10
20
dust
30
40
50
60
expo
dust
yes
Standardized
Residuals:
yes
smoking
0:2
2:4
no
>4
no
cbr
68
3Q
-0.4909
Max
2.0688
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.258885
0.181545 -12.443 < 2e-16 ***
expo
0.040673
0.006066
6.705 2.01e-11 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1
1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1356.8
Residual deviance: 1309.9
AIC: 1313.9
on 1245
on 1244
degrees of freedom
degrees of freedom
From the results we see that expo has a signicant positve eect on the
probability of cbr=="yes" at the 5% level.
The value of the regression coecient is
R> coef(dust_glm_1)["expo"]
expo
0.04067309
69
The tted probabilities for the dust_glm_1 are tted as a line in the
conditional densityplot using the following code
R> cdplot(cbr ~ expo, data = dust, ylevels = c("yes", "no"))
R> x <- seq(from = min(dust$expo), to = max(dust$expo), length = 200)
R> lines(x, exp(a + b * x)/(1 + exp(a + b * x)), lwd = 2, col = 2)
The tted probabilities are computed using the inverse of the logistic function as transformation.
70
1.0
0.0
0.2
yes
0.4
cbr
0.6
no
0.8
fitted
10
20
30
40
50
60
expo
In the next step, we include the next covariate from the dust data. This
is the amount of dust exposure at workplace.
We can now t a logistic regression model that includes both explanatory
variables using the code
R> dust_glm_2 <- glm(cbr ~ expo + dust, data = dust, family = binomial())
R> summary(dust_glm_2)
71
Call:
glm(formula = cbr ~ expo + dust, family = binomial(), data = dust)
Deviance Residuals:
Min
1Q
Median
-1.3550 -0.7685 -0.6007
3Q
-0.4592
Max
2.1620
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.502380
0.196077 -12.762 < 2e-16 ***
expo
0.039467
0.006138
6.429 1.28e-10 ***
dust
0.090799
0.023076
3.935 8.33e-05 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1
1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1356.8
Residual deviance: 1294.7
AIC: 1300.7
on 1245
on 1243
degrees of freedom
degrees of freedom
The predicted values of the second model can be ploted agains both explanatory variables in a bubble plot using the following code
R> prob <- predict(dust_glm_2, type = "response")
R> plot(dust ~ expo, data = dust, pch = 21, ylim = c(0, 30), xlim = c(0,
+
80), cex = prob * 8)
72
30
25
20
15
10
dust
q
qqq q q qqq qq
qq q q qqqqq
q q qqq
q qq q q
qq
qq
q q qq
q
q qq q q
q
q qq q q qqq qq
q qqqqq q qqq qq
qqq q qqqq qq
qqqq qqqq q
qqq q q
q qqq qq
q qqq qq
qq q q qqq q
q q q q q q qq qq
q
q qqqq q
q qq
q qqq
q
qqq qqq qq
q
qq qq
q
qq q q q
q
q qq
qq
qq
qq q q
q
q q
qq qqq q q
q q q qq
qqqq q q q q q qqq q q q
q
qq q qq q
qq
q q q q q qqqq q qqq q
qqqqq q q q q q qqq q q q q qq
q
q qq q q q q q qqq q q qq q q
q q q q qqq q qq q qq
q
qqq qqq q q
qq q
q
q q qq q
q
qq
qq
20
qq
qq
qq
qq
qq
qqq
qq
qq
qq q
qq
qqq
qq
q
q
q
q
40
60
80
expo
Figure 12: Bubble plot of tted values for a logistic regression model tted to
the dust data
Analysis of Deviance Table
Model 1: cbr ~ expo
Model 2: cbr ~ expo + dust
Resid. Df Resid. Dev Df Deviance P(>|Chi|)
1
1244
1309.9
2
1243
1294.7 1
15.267 9.333e-05 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1
Exercises
Basic
After how many years of exposure is the probability of having a chronical
reaction of bronchia greater 0.5, 0.9 and 0.95? Use the estimations of
model dust_glm_1.
By which factor does the odds of getting a cbr increase, when the exposure
inreases by 3 years?
73
We extend model dust_glm_1 by including the factor smoking as explanatory variable. What is the eect of smoking and is it useful to include
this variable in the model?
Advanced
Plot the tted probablity values against the years of exposure. Fit two
lines, one for smokers and one for nonsmokers. Fit the observed values
using the function rug.
Include an interaction of smoking and exposure in the model. Give an
interpreation for the results. Is it useful to model this interaction?
Exercises
Advanced
Plot the tted probablity values against the years of exposure. Fit again
two lines, one for smokers and one for nonsmokers.
If you compare the three explanatory variables and dust, smoking and
expo, which ones and which interactions would you include in a model
(model selection)?
Using the predict method on a glm object gives us the probabilities of
cbr="yes". We want to predict with our logistic regression model if an
person is likely to have cbr or not. For that purpose, every person with a
tted probability of > 0.5 is predicted to have cbr and vice versa. This
predicted values can be compared with the real values, e.g. making a 22
cross table:
no
yes
8.2
74
= g()
Generalized Linear Models
Each member of a the exponential family distribution has an owen link
function and variance function.
The link function ensures the right range of predicted values for every
distribution, i.e. y > 0 for y N() or y [0, 1] for y Bin(n, ).
Family
Normal
Poisson
Binomial
Gamma
Inverse Gaussian
Link g
=
= log
= log(/ (1 ))
= 1
= 2
Variance Function
1
(1 )
2
3
GLM - Residuals
The Pearson residual is comparable to the standardized residuals used for
linear models and is dened as
rP =
V()
rD = sign(y ) d ,
with
2
rD = Deviance 0
75
ships data
Formatting the data
R>
R>
R>
R>
library(MASS)
data(ships)
ships$year <- factor(ships$year)
ships$period <- factor(ships$period)
year
60:10
65:10
70:10
75:10
period
60:20
75:20
service
Min.
:
0.0
1st Qu.: 175.8
Median : 782.0
Mean
: 4089.3
3rd Qu.: 2078.5
Max.
:44882.0
incidents
Min.
: 0.0
1st Qu.: 0.0
Median : 2.0
Mean
: 8.9
3rd Qu.:11.0
Max.
:58.0
Poisson regression
The response variable incidents is count data.
The response can only take positive values.
Such a variable is unlikely to have a normal distribution.
76
Pois(y) =
e y
y!
30
0
10
20
incidents
40
50
60
10000
20000
30000
40000
service
The predict method for an object of class glm2 returns by default the
predicted values on scale of the linear predictor (type="link").
The predicted values on response scale are obtained with the argument
type="response"
R> pred <- predict(ships_glm_1, type = "response")
77
R> summary(ships_glm_1)
Call:
glm(formula = incidents ~ service, family = poisson(), data = ships)
Deviance Residuals:
Min
1Q
Median
-6.0040 -3.1674 -2.0055
3Q
0.9155
Max
7.2372
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.613e+00 7.150e-02
22.55
<2e-16 ***
service
6.417e-05 2.870e-06
22.36
<2e-16 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1
1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 730.25
Residual deviance: 374.55
AIC: 476.41
on 39
on 38
degrees of freedom
degrees of freedom
Exercises
Basic
What is the interpretation of the regression coecients of model ships_glm_1?
Think of the interpretation in a logistic regression model here.
How many incidents would you expect after 10, 100, 1000, 10000 and
100000 month of service?
Make a residuals vs. tted values scatterplot for the deviance residuals.
Look at a tted values vs. orginal values scatterplot for the ships_glm_1
model. What could be a problem?
Add other explanatory variables to the model. Try to nd the best model
by comparing the models with the anova method?
78
10
q
q
res
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
20
40
60
80
pred
variety
ARAPAHOE : 4
BRULE
: 4
BUCKSKIN : 4
CENTURA : 4
CENTURK78: 4
CHEYENNE : 4
(Other) :200
yield
Min.
: 1.05
1st Qu.:23.52
Median :26.85
Mean
:25.53
3rd Qu.:30.39
Max.
:42.00
latitude
Min.
: 4.30
1st Qu.:17.20
Median :25.80
Mean
:27.22
3rd Qu.:38.70
Max.
:47.30
longitude
Min.
: 1.20
1st Qu.: 7.20
Median :14.40
Mean
:14.08
3rd Qu.:20.40
Max.
:26.40
The trial consits of 4 Blocks, each variety is repeated once on each block
R> all(table(Wheat2$Block, Wheat2$variety) == 1)
[1] TRUE
Data situation
We have repeated measures, i.e. clustered data or a longitudinal study
The data has the following form
80
12
14
16
18
latitude
20
22
latitude
24
26
26
28
30
25
20
15
longitude
34
36
38
latitude
32
10
20
15
longitude
30
25
10
25
20
10
15
longitude
30
30
10
25
30
15
20
10
longitude
40
42
44
46
latitude
Figure 14: Wheat Yield Trial. The size of the symbols corresponds to the yield.
Result
Correlated data
Observations of the same cluster tend to be more equal than observations
of dierent clusters
Dierent sources of variety in the data
(17)
or in matrix notation
y = X + U b + = 1, ..., m
The random eects are assumed to be iid distributed normal
b N(0, D),
81
(18)
Experimental Design
In an experimental design often blocks are dened.
We are not interested in the block eects specically but must account for
their eect.
Therefore, blocks are treated as random eects.
This can be done in the framework of linear mixed models.
82
q q
q
q
q q q
q
q
q qqq q q qq q q q q q q q
q qq q q q q q q q
q qq q q
q
q q q
q
q
q qq qq q q q q q q qq q qqq q q
q
q
q
q q
q
q q q qq
q
Block
qq q q
q q q q qq
q
q q q
q q
10
q
q
q qq
qq
qq
q
q
q qq
qq
q
qq qq qq q qqq q qqq q
q q q qq q qqq q q q
q
q
20
30
40
yield
StdDev:
(Intercept) Residual
3.14809 6.931017
Max
2.3440565
4
2
3
1
(Intercept)
-3.8623327
2.8511949
-0.8737036
1.8848413
If we compare this with the block means minus the overall mean
R> tapply(Wheat2$yield, Wheat2$Block, mean) - mean(Wheat2$yield)
83
4
-4.1966518
2
3
3.0979911 -0.9493304
1
2.0479911
In the next step we include the position of the crops on the eld as xed
eect to your model
This is done by including an main eect for latitude and longitude and an
interaction term
Note the blocks are of eqaul size and arranged in coloumns with the plants
sowed in 3-4 coloumns in each block
We need an equal range of latitude and logitude in each block, otherwise
latitude would be a surrogate for the block eect
R> Wheat2$latitude2 <- Wheat2$latitude - rep(c(4.3, 17.2, 25.8,
+
38.7), each = 56)
R> unique(Wheat2$latitude2)
[1]
0.0
4.3
8.6 12.9
4.3
8.6 12.9
4.3
8.6
84
R> summary(lmmWheat2_2)
Linear mixed-effects model fit by REML
Data: Wheat2
AIC
BIC
logLik
1432.378 1452.74 -710.1892
Random effects:
Formula: ~1 | Block
(Intercept) Residual
StdDev:
3.451906 5.524812
Fixed effects: yield ~ latitude2 * longitude
Value Std.Error DF
t-value p-value
(Intercept)
23.974644 2.4981401 217 9.596997 0.0000
latitude2
-0.827432 0.2320629 217 -3.565551 0.0004
longitude
0.225626 0.0974046 217 2.316379 0.0215
latitude2:longitude 0.044225 0.0138447 217 3.194342 0.0016
Correlation:
(Intr) lattd2 longtd
latitude2
-0.651
longitude
-0.661 0.831
latitude2:longitude 0.534 -0.865 -0.845
Standardized Within-Group Residuals:
Min
Q1
Med
Q3
-3.28913321 -0.47504226 0.03741677 0.72172891
Max
2.11526795
85
25
20
15
q
qq
q
q
qq
q
q
q
qq
q
q
q
q
q
qq
q
qq
q
qq
q
qq
qq
q
q
q
qq
q
q
q
q
q
q
q q
q
q
q
q
q
qq
qq
q
q
qq
qq
q
q
q
q
q
q
qq
q
q
q
qq
q
qq
qq
q
qq
q
qq
q
q
qq
qq
q
q
q
q
q
q
qq
Block 1
Block 2
Block 3
Block 4
q
q
qq
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
qq
qq
q
qq
q
q
qq
q
qq
q
q
qq
q
q q
q
qq
q
q
q
qq
q
q
qq
qq
q
q
qq
q
q
qq
10
longitude
q
q
q
q
q
qq
qq
q
q
q
q
qq
q
qq
qq
q
qq
qq
q
q
qq
q
qq
q
qq
q
q
q
q
q q
q
q
q
q
10
12
latitude
Figure 15: Fitted values for Wheat yield trial with model lmmWheat2_2.
R> filled.contour(x = unique(WB1$latitude2), y = sort(unique(WB1$longitude)),
+
z = t(fitB1))
R> persp(x = unique(WB1$latitude2), y = sort(unique(WB1$longitude)),
+
z = t(fitB1), theta = 25, phi = 25)
package lme4
In the lme4 library linear mixed models can bes estimated with the lmer
function
R> library(lme4)
R> `?`(lmer)
86
35
25
30
20
25
20
15
15
10
10
5
5
0
0
10
12
R> summary(Wheat2_lme4)
Linear mixed model fit by REML
Formula: yield ~ latitude2 * longitude + (1 | Block)
Data: Wheat2
AIC BIC logLik deviance REMLdev
1432 1453 -710.2
1410
1420
Random effects:
Groups
Name
Variance Std.Dev.
Block
(Intercept) 11.916
3.4519
Residual
30.524
5.5248
Number of obs: 224, groups: Block, 4
Fixed effects:
87
long
it
ude
alues
fitted v
latit
ude
(Intercept)
23.97464066
latitude2
-0.82743079
88
longitude latitude2:longitude
0.22562615
0.04422476
$Block
(Intercept)
4 -4.73226084
2 2.61986444
3 -0.09551437
1 2.20791077
$Block
(Intercept)
(Intercept)
11.91566
attr(,"stddev")
(Intercept)
3.451907
attr(,"correlation")
(Intercept)
(Intercept)
1
attr(,"sc")
sigmaREML
5.524812
89
Exercises
Make some descriptive analyses for the Soybean data.
We want to compare the growth of soybeans for the dierent Plots. What
is your response variable, what are xed and what are random eects? Set
up a model and give an interpretation for the results.
We want to t two random eects for each plot: a random intercept and a
random slope for the time. The model should also include the xed eects
Variety and Year. Fit this model using the lme4 package.
Look at the coavariance matrix of the random eects and give an interpretation.
10
Repetitive execution
Functions for repetitive execution: for loops, repeat and while
There is also a for loop construction which has the form
R> for (name in expr_1) expr_2
24 120 720
90