0% found this document useful (0 votes)

1K views

R Handout Statistics and Data Analysis Using R

This document provides an introduction to using R for statistics and data analysis. It discusses installing and working with R, basic functions and syntax, data management including reading in data and working with different data structures, descriptive statistics, graphics, probability distributions, hypothesis testing, linear models and writing custom functions. The document uses the iris data set throughout as an example and includes exercises for readers to practice the concepts covered.

Uploaded by

tele6

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

1K views

R Handout Statistics and Data Analysis Using R

Uploaded by

tele6

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 91

Statistics and Data analysis using R

Valentin Wimmer
contact: Valentin.Wimmer@wzw.tum.de
Lehrstuhl fr Panzenzchtung, TUM
u
u
December 2010

Contents

Contents
1 Introduction
1.1 Installing R . . .
1.2 Working with R
1.3 First steps . . .
1.4 R help . . . . . .

5.2

.
.
.
.

2
3
3
4
5

2 Data management
2.1 Data structure . . .
2.2 Read in data . . . .
2.3 Data manipulation
2.4 Data export . . . .

.
.
.
.

6
7
11
14
16

.
.
.
.

3 Descriptive Statistics

4 Graphics
19
4.1 Classic graphics . . . . . . . . . . 19
4.2 Controlling the appearance of
a plot . . . . . . . . . . . . . . . . 22

Examining the distribution of

a set of data . . . . . . . . . . . .

6 Hypothesis Testing (Inference)

6.1 Two-sample t-test for equality
of means . . . . . . . . . . . . .
6.2 Wilcoxon test . . . . . . . . . .
6.3 Chi-squared-test . . . . . . . . .
6.4 Kolmogorov-Smirnov-Test . .

32
.
.
.
.

34
36
36
38

7 ANOVA and linear Regression

7.1 ANOVA . . . . . . . . . . . . . .
7.2 Linear Regression . . . . . . .
7.3 Multiple Linear Regression . .
7.4 Categorial Covariates . . . . .

.
.
.
.

40
40
44
53
59

8 Generalized Linear Regression

64
8.1 Logistic Regression . . . . . . . . 64
8.2 The Generalized Linear Model . 74
9 Linear Mixed Models

5 Probability Distributions
27
5.1 R as a set of statistical tables . . 27

10 Writing own functions

Aims of this course

Basic knowledge about R
Overview over important and useful R functions
Getting a feeling for the possibilities of R for data analysis and statistics
Learn to analyse own data with R
Writing own R functions

About this course

R commands are displayed as
R> 5 + 5

R output is displayed as
[1] 10

important notes
useful hints
description of datasets

Introduction

What is R?
R is an environment for statistical computing and graphics
The root of R is the S-Language
Today R is the most important statistical software
Open source, everybody can contribute to R
Library of add-on R packages
R package: collection of functions, documentations, data and examples
The main distribution is maintained by the R Developement Core Team

Advantages of R
Rapid developement
Open Source and thus no Black Box. All algorithms and functions are
visible in the source code.
No fees
Many systems: Macintosh, Windows, Linux, ...
Quite fast
Many new statistical methods available as R-packages

Disadvantages of R
Open Source: compatibility with older versions not so important
No real quality check of the contents of R-packages
No graphical user interface
Error message often not useful
There exist faster programms for special problems

1.1

Installing R

Installing R
The base distribution is available under www.r-project.org
The packages can be installed direct with R or www.CRAN.r-project.org
or www.bioconductor.org
There exist dierent versions for each system
*.zip for Windows and *.tar.gz for Unix
For windows, use the Precompiled binary distributions of the base system

1.2

Working with R

Working with R
Use a text-editor, e.g. Tinn-R to save your scripts
Make comments to your code (comment lines start with #):
Set work directories
R> getwd()
R> setwd("")

Install an add-on packages

R> install.packages("doBy")

Specifying a path in R
The path has to be given in the form "C:\..." or "C://..."
Working with R
In R commands are evaluated directly and result printed on screen
Your input may have more than one line
Dierent input in the same line is separated with ;
Available packages, functions or objects are available in the R workspace
To list the available objects, type
R> ls()

Workspace cn be saved every time using save or when R session is nished

1.3

First steps

First steps
Use R as a calculator
R> print(sqrt(5) * 2)
[1] 4.472136

Assign a value to an object, return value of an object

R> a <- 2
R> a + 3
[1] 5
R> b <- a
R> b
[1] 2

Load an add-on R package

R> library(MASS)

First steps
Operators: +, , , / , &, &&, |, ||

Comparision: ==, ! =, >, >=, <, <=

Allocation
R> a <- 2
R> a <- 2
R> a = 2

Note that R is case sensitiv. Names of objects have to start with a letter.
Other numeric functions

abs()
sqrt()
round(), floor(), ceiling()
sum(), prod()
log(), log10(), log2()
exp()
sin(), cos(), tan()

1.4

absolute value
square root
round up/down
sum and product
logarithm
exponential function
angle functions

R help

R help
The help-function
R> help(mean)
R> `?`(mean)

Overview over an add-on package

R> help(package = "MASS")

R manuals

An Introduction to R
R Data Import/Export
R Installation and Administration
Writing R Extensions

www.rseek.org

better use www.rseek.org than www.google.de

Exercises
Set your work directory
Dene an object course and assign the value "Using R"
Type ls() to see the content of your workspace
Close your R session and save the workspace
Open the saved workspace with R
Type ls() to see the content of your workspace
Install add-on package lme4 from CRAN
Load the package in your workspace
Look at the help of the package

Exercises
Calculate the following with R
3

( 5 2) 2 6 + 5 e2
Look at the help le of floor() and ceiling().
What is the result?
R> a <- (4 <= floor(4.1))
R> b <- (5 > ceiling(4.1))
R> a != b

What is the result (give a solution)?

R>
R>
R>
R>
R>
R>
R>

(5
(5
(5
(5
(5
(5
(1

<
<
<
<
<
<
<

3)
3)
3)
3)
3)
3)
3)

| (3 < 5)
|| (3 < 5)
|| c((3 < 5), c(5 < 3))
| c((3 < 5), c(5 < 3))
& c((3 < 5), c(5 < 3))
&& c((3 < 5), c(5 < 7))
&& c((3 < 5), c(8 < 7))

Data management

iris dataset
Throughout this course we will use several datasets comming with R. The
most famous one is Fishers iris data. To load it, type
R> data(iris)

R> head(iris)

1
2
3
4
5
6

Sepal.Length Sepal.Width Petal.Length Petal.Width Species

5.1
3.5
1.4
0.2 setosa
4.9
3.0
1.4
0.2 setosa
4.7
3.2
1.3
0.2 setosa
4.6
3.1
1.5
0.2 setosa
5.0
3.6
1.4
0.2 setosa
5.4
3.9
1.7
0.4 setosa

For more information, type

R> help(iris)

The iris data

iris data
This data set contains the measurements in centimeters of the variables sepal
length and width and petal length and width, respectively, for 50 owers from
each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.
Sepal.Length [metric]
Sepal.Width [metric]
Petal.Length [metric]
Petal.Width [metric]
Species

2.1

[nominal]

Data structure

Data structure
R is an object-oriented programming language, so every object is instance of
a class. The name of the class can be determined by
R> class(iris)

[1] "data.frame"

The most important classes are

character
numeric
logical
data.frame
matrix
list
factor

Data structure
Multiple numbers can be combined in a vector
R> c(4, 2, 5)
[1] 4 2 5

Another important object is a matrix

A matrix is a rectangular pattern with n rows and p columns
A matrix is created as
R> (m <- matrix(data = c(2, 5, 6, 3, 4, 3), nrow = 2, ncol = 3))

[1,]
[2,]

[,1] [,2] [,3]

2
6
4
5
3
3

Computations with matrices

transpose the matrix: mT
R> t(m)

[1,]
[2,]
[3,]

[,1] [,2]
2
5
6
3
4
3

product with a constant

R> m * 4

[1,]
[2,]

[,1] [,2] [,3]

8
24
16
20
12
12

matrix product
R> t(m) %*% m

[1,]
[2,]
[3,]

[,1] [,2] [,3]

29
27
23
27
45
33
23
33
25

Alternative crossprod() method

inverse of a matrix m1
R> (m <- matrix(c(1, 0, 0, 0, 1, 0, 0, 1, 1), ncol = 3))

[1,]
[2,]
[3,]

[,1] [,2] [,3]

1
0
0
0
1
1
0
0
1

R> solve(m)

[1,]
[2,]
[3,]

[,1] [,2] [,3]

1
0
0
0
1
-1
0
0
1

diagonal elements of a matrix

R> diag(crossprod(m))
[1] 1 1 2

Assess the data sturcture

command
str
head
dim
length
nchar
summary
names
rownames
colnames

result
general structure of the object
show the rst 6 lines
dimension of the object (rows columns)
length of the object
number of characters in one string
important statistical parameters
variable names
row names
column names

useful for
list, data.frame
data.frame, matrix
data.frame, matrix
list, numeric
character
list, data.frame
matrix, data.frame
matrix, data.frame
matrix, data.frame

Table 1: useful commands to asses the structure of an R object

Class of a S3 object
To check, if an obect has an special class, use
R> obj <- 3
R> is.numeric(obj)
[1] TRUE

R> is.character(obj)
[1] FALSE

The class of an object can simple be changed (no formal check of attributes)
R> obj2 <- obj
R> (obj2 <- as.character(obj2))
[1] "3"
R> is.character(obj2)
[1] TRUE
R> is.numeric(obj2)
[1] FALSE

Handling factor-objects
A factor is used for objects with repeated levels
A factor has two arguments: levels and labels
R> levels(iris$Species)
[1] "setosa"

"versicolor" "virginica"

R> head(labels(iris$Species))
[1] "1" "2" "3" "4" "5" "6"

A numeric or a character variable can be converted to a factor with the

factor() command and the arguments

labels
levels

Exercises
Basic
What is the dierence between
R> v <- c("An", "Introduction", "to", "R")
R> nchar(v)
R> length(v)

What are the rst three values of Sepal.Length?

Dene the following matrix in R:

4 12

X= 5 3
9 11
What are the dimensions of X?

20
7
10

0.8

1
8

What is the sum of all elements of X?

Exercises
Advanced
What are the last three values of Sepal.Length?
Look at the help le of matrix. Try to nd out the use of the byrow
argument.
The variable Sepal.Length should be categorisised. We want to have two
categories, one up 6 cm and one above 6cm. Create a factor and give
labels for the categories.
How many observations are in each of the three groups?

2.2

Read in data

Read in data
To read in tabulated data, use
R> read.table(file, header = TRUE, ...)

were file is the name (+path) of the data or use the url. Use header=TRUE
if the rst row of the data contains the variable names. This function is the
default reading function and is suitable for txt, dat,... and returns an object of
class data.frame. Use the csv-format for Excel les and use function:
R> read.csv2(file, header = TRUE, sep = ";", dec = ".", ...)

look at the data before reading it with R (header, decimal symbol, missing
values,...)
11

Creating and indexing complex objects

Objects can be shared in a vector
A vector is created with the c() command
R> (myvec <- c(1, "Hallo", 3, 7))
[1] "1"

"Hallo" "3"

"7"

A vector can contain objects with dierent classes

Consequtive numbers are obtained by
R> 3:9
[1] 3 4 5 6 7 8 9
R> 6:-4
[1]

0 -1 -2 -3 -4

Vectors with a regular structure are constructed with the functions rep
and seq
R> rep(c(1, 5), each = 2, times = 3)
[1] 1 1 5 5 1 1 5 5 1 1 5 5

Entries of vector objects

R> (o1 <- c(3, 6:8, 10))
[1]

8 10

R> o1[2]
[1] 6

Entries of a matrix
R> (o2 <- matrix(data = 1:12, nrow = 3, ncol = 4))

[1,]
[2,]
[3,]

[,1] [,2] [,3] [,4]

1
4
7
10
2
5
8
11
3
6
9
12

R> o2[2, 3]
[1] 8
R> o2[2, ]
[1]

8 11

R> o2[, 3]
[1] 7 8 9

Entries of a data.frame
R> iris[2, 5]
[1] setosa
Levels: setosa versicolor virginica
R> iris[3:5, "Species"]
[1] setosa setosa setosa
Levels: setosa versicolor virginica
R> head(iris$Petal.Width)
[1] 0.2 0.2 0.2 0.2 0.2 0.4

Entries of a list
R> mod <- lm(Petal.Width ~ Species, data = iris)
R> mod[["coefficients"]]
(Intercept) Speciesversicolor
0.246
1.080

Speciesvirginica
1.780

R> mod[[1]]
(Intercept) Speciesversicolor
0.246
1.080

Speciesvirginica
1.780

attaching a data.frame
The attach() command can be used for datasets. Afterwards the variable of
the dataset are attached to the R search path.
You only need to type
R> Petal.Width

instead of
R> iris$Petal.Width

to refer to the variable Petal.Width in the dataset iris then.

Note that objects can be masked by attaching.
Reading xls-les
To read in Excel les, you need the RODBC-libary and the following commands:
R>
R>
R>
R>
R>

library(RODBC)
trac <- odbcConnectExcel("Table.xls")
Tnames <- sqlTables(trac)$TABLE_NAME
df1 <- sqlFetch(trac, Tnames[1], as.is = TRUE)
odbcClose(trac)

Exercises
Load the le data2read from the website. Have a look at the le.
Read in data using the read.table function and save the data as an object
with name mydata. What is the class of object mydata?
Asses the data structure of mydata with the methods str, head, dim,
colnames and rownames.
What are the values of the third row of mydata?
Attach mydata and give the rows of mydata where x8.

2.3

Data manipulation

The swiss data

swiss dataset
Standardized fertility measure and socio-economic indicators for each of 47
French-speaking provinces of Switzerland at about 1888. A data frame with
47 observations on 6 variables, each of which is in percent, i.e., in [0,100].

Fertility: common standardized fertility measure

Agriculture % of males involved in agriculture as occupation
Examination % draftees receiving highest mark on army examination
Education % education beyond primary school for draftees.
Catholic % catholic (as opposed to protestant).
Infant.Mortality live births who live less than 1 year.

All variables but Fertility give proportions of the population.

Data manipulation
Replacement in objects
R> (v <- 1:9)
[1] 1 2 3 4 5 6 7 8 9
R> (v[3] <- 5)
[1] 5

A data.frame can also be manipulated with the fix() command

Subset of a data frame. With the following command only the individuals
of species setosa are selected from the iris dataset
R> iris.Setosa <- subset(iris, iris$Species == "Setosa")

Order the elements of a numeric vector

R> order.edu <- order(swiss$Education, decreasing = TRUE)

Give the the 3 districts with the highest Education

R> swiss[order.edu[1:3], 1:5]

V. De Geneve
Neuchatel
Rive Droite

Fertility Agriculture Examination Education Catholic

35.0
1.2
37
53
42.34
64.4
17.6
35
32
16.92
44.7
46.6
16
29
50.43

2.4

Data export

Data export
Data of R can be writen into a le with the write.table() command
The R workspace can be saved with save() command and can be reloaded
with load()
A
Function xtable() to create L TEX-table output

R> library(xtable)
R> xtable(summary(iris), caption = "Summary statistic of iris data")

1
2
3
4
5
6

Sepal.Length
Min. :4.300
1st Qu.:5.100
Median :5.800
Mean :5.843
3rd Qu.:6.400
Max. :7.900

Sepal.Width
Min. :2.000
1st Qu.:2.800
Median :3.000
Mean :3.057
3rd Qu.:3.300
Max. :4.400

Petal.Length
Min. :1.000
1st Qu.:1.600
Median :4.350
Mean :3.758
3rd Qu.:5.100
Max. :6.900

Petal.Width
Min. :0.100
1st Qu.:0.300
Median :1.300
Mean :1.199
3rd Qu.:1.800
Max. :2.500

Species
setosa :50
versicolor:50
virginica :50

Table 2: Summary statistic of iris data

Excercises
Construct the following sequences using the functions rep or seq:
[1] 3 4 5 6 7 3 4 5 6 7 3 4 5 6 7
[1] -2.5 -2.0 -1.5 -1.0 -0.5

0.0

0.5

1.0

1.5

2.0

[1] 2 2 2 2 3 3 3 3 4 4 4 4 2 2 2 2 3 3 3 3 4 4 4 4

[1] 10.0 10.5 11.0 11.5 12.0 12.5 13.0 13.5 14.0 14.5 15.0

Descriptive Statistics

The C02 data

The Co2 dataset
The CO2 data frame has 84 rows and 5 columns of data from an experiment on
the cold tolerance of the grass species Echinochloa crus-galli.

mean(x)
median(x)
min(x)
max(x)
range(x)
quantile(x,probs=c())
sd(x),var(x)
cov(x,y),cor(x,y))

compute
compute
compute
compute
compute
compute
compute
compute

(trimmed) mean
median (50% quantile)
minimum
maximum
minimum and maximum
the probs100% quantile
the standard deviation/variation
the covariance/correlation

Table 3: Standard descriptive functions

Plant: an ordered factor with levels Qn1 < Qn2 < Qn3 < ... < Mc1
giving a unique identier for each plant.
Type a factor with levels Quebec Mississippi giving the origin of the plant
Treatment a factor with levels nonchilled and chilled
conc a numeric vector of ambient carbon dioxide concentrations (mL/L).
uptake a numeric vector of carbon dioxide uptake rates (umol/m2 sec).

Cross Tables
The table command gives the frequencies of a categorial variable, typically it is used with one or more factor-variables
R> table(CO2$Type)
Quebec Mississippi
42
42

A cross-table of two factors is obtained by

R> table(CO2$Type, CO2$Treatment)

Quebec
Mississippi

nonchilled chilled
21
21
21
21

We have a balanced experimental design

More than 2 factors are possible too

Descriptive Statistics

Missing values (NA)

Missing values are allways coded with with symbol NA
The function is.na is used to check, if a value is missing
The function complete.cases takes a data.frame and returns a logiacl
vector being TRUE when the corresponding observation contains no missing
values
R> table(complete.cases(airquality))
FALSE
42

TRUE
111

Missing values often should not be considered in the calculation

R> mean(c(1:5, NA, 8))
[1] NA

The missing values are removed with the na.rm argument in many functions as those in Table 3
R> mean(c(1:5, NA, 8), na.rm = TRUE)
[1] 3.833333

Excercises
Basic
Compute the 5%, the 10% and the 75% quantile of the education rate in
the swiss data.
Is the distribution of the education rate in the swiss data symmetric, left
skewed or right skewed?
Compute the covariance and the correlation between the ambient carbon
dioxide concentrations and the carbon dioxide uptake rates in the CO2
data. How to you interpret the result?
Compute the correlation coecient for each Treatment group separatly.

Advanced
Compute the correlation between the following two variables: v1=[11,5,6,3,*,35,4,4],
v2=[4.6,2,1.6,*,*,7,2,3], where * indicates a missing value.
Give ranks to uptake and conc and compute spearmans correlation coefcient.

Graphics

Graphics with R
Many graphical possibilities and settings in R
Division in high-level and low-level graphics
Standard graphics for each object type (try plot for a data.frame)
Additional graphics system in the library lattice
Standard graphical function: plot to plot y against x
You can either use plot(x,y) or plot(y x)

4.1

Classic graphics

Graphics with R: plot for metric and metric variables

q
q
q

Education

R> plot(Education ~ Agriculture, data = swiss)

q
q

q
q q
qq
q
q
q
q q
qq
qq q q
qq
q q
q
q qq
q
q
q
q
q
q
q
q

Agriculture

Boxplot for metric (+ factor)

R> boxplot(Petal.Width ~ Species, data = iris)

q
q q
q

2.5
2.0
1.5
1.0
0.5

q
q

setosa

versicolor

virginica

Histogram for metric variates

R> hist(swiss$Education)

10 15 20 25 30
0

Frequency

Histogram of swiss$Education

swiss$Education

Bar plot for factors

R> barplot(rowSums(HairEyeColor), col = c("black", "brown", "red",
+
"gold"))

200
50 100
0

Black

Brown

Red

Blond

Pie charts for factors

R> pie(rowSums(HairEyeColor), col = c("black", "brown", "red", "gold"))

Black

Brown

Blond
Red

Mosaicplot for cross-tables

R> mosaicplot(HairEyeColor[, , "Male"], shade = TRUE, main = "")

Brown

Red

Blond

Green

Hazel

Standardized
Residuals:

Blue

4:2

Eye

2:0

0:2

Brown

2:4

Black

Hair

4.2

Controlling the appearance of a plot

Main graphic arguments

The following arguments t for all types of graphics:
type a string speciying the type of plot. Possible is

p for points
l for lines
b for both
h for histogram like vertical lines,
s/S for stair steps
n for no plotting
main: a string specifying the plot title
sub: a string specifying the subs title
xlab: a string specifying the title for the x axis
ylab: a string specifying the title for the y axis

Additional graphic arguments

The following arguments specify the appearance of a plot:
xlim: a vector of 2 sepecifying minimum and maximum of x axis
ylim: a vector of 2 sepecifying minimum and maximum of y axis
lty: a numeric giving the line type (1: solid line, 2: dashed line, ...)
lwd: a numeric giving the line width
pch: a string or a numeric giving the plotting character (pch=21 for a
lled circle, pch=19 for a solid circle,...)
cex: a numeric specifying the character (or symbol) expansion
col: a vector of strings or numbers specifying the color ot the plotting
symbols (to use red color you can use col="red" or col=2)

Set graphical parameters

To set or query graphical parameters use par
R> par(mfrow = c(1, 2))

By this commands, the graphics window is diveded into 2 parts (1 row,

two columns)
Following arguments are useful (additional to the ones on the previous
page)

mfrow=c(r,c) divide the graphics window into r rows and c columns

mar=c(b,l,u,r) set the gure margin at the bottom (b), left (l),
upper(u) and right (r).
oma=c(b,l,u,r) set the outer gure margin
bg=col set col as a color for the gure background
las: set the axis style (0: always parallel to the axis [default], 1:
always horizontal, 2: always perpendicular to the axis, 3: always
vertical)
Set graphical parameters
Set graphical parameters

oma = c(2,2,2,2) oma[NORTH<3]

mar = c(5.1,4.1,4.1,2.1)

mar[SOUTH<1]

oma[EAST<4]

mar[EAST<4]

0 2 4 6 8

mar[WEST<2]

Plot Area
0

Figure

oma[SOUTH<1]
Outer

Margin Area

Figure 1: The R Plot Area with Outer Margin Area

6
X

Figure

2
0

Figure

X
0 2 4 6 8

0 2 4 6 8

Title Line

oma[WEST<2]

mar[NORTH<3]

4
0

Figure

X
Figure
Outer Margin Area

Figure 2: Multiple Figures by Row (mfrow) with Outer Margin Area

q
q
q
q
q qq
q
q
q
q
qq
q
q q qqq qq
qq
q qq
q
qqq
q qq
q
q
qq qqqq q
qq q qqq
q
qq
q qq qqqqqqqqqqq qqq
qq
qq
qqqqq
qqq
q
qqqq
q
q
qqqq qq qq
q
q
q
q
qqq
qqq
q
q
q qq
q

qq
q
q

Sepal.Width

R> plot(Sepal.Width ~ Sepal.Length, data = iris, subset = iris$Species ==

+
"setosa", xlim = c(4, 8), ylim = c(1, 6), col = 14)
R> points(Sepal.Width ~ Sepal.Length, data = iris, subset = iris$Species ==
+
"versicolor", col = "lightgreen")
R> points(Sepal.Width ~ Sepal.Length, data = iris, subset = iris$Species ==
+
"virginica", col = "blue")

Sepal.Length

Adding low-level graphics to a plot

Low-level output can be added to an existing plot
Other useful low-level funtions are:

points adding points [args: x,y,pch,...]

lines adding lines [args: x,y,lty,...]
text adding text [args: x,y,labels]
title adding a title
abline adding a line [args: a,b,h,v]
axis adding an axis [args: side,at,labels,las,...]
Adding low-level graphics to a plot
Adding a legend
R> legend("topright", legend = c("setosa", "versicolor", "virginica"),
+
pch = c(21, 21, 21), col = c(14, "lightgreen", "blue"))

q
qq
q

setosa
versicolor
virginica

q
q
q
q
q
qqq qq
q
qqq
q q qqq qq
q qq
q
qqq qqqq q
qqqqqq
q
q qqq
q qq qqqqqqqqqqq qqq
q
qqq qqqqqqq q
qq
q qq
q
q
qqqq qq qq
q q
qqqq
q
qq
q
q
q
q qq
q

qq
q
q

Sepal.Width

Sepal.Length

Saving graphics
There are two possibilities to save graphics in R
Right click in graphic an save as
better alternative: functions pdf, png, postscript

Usage:
R>
R>
R>
R>

pdf("C://mypath", width = 5, height = 4)

plot(Sepal.Width ~ Sepal.Length, data = iris)
title("Iris data")
dev.off()

With pdf() a graphic device is opened

and closed with dev.off()

Exercises
Basic
Make a histogram of each of Sepal.Length, Petal.Length, Sepal.Width
and Petal.Width in one plot for the iris data. Give proper axis labels,
a title for each subbplot and a title for the whole plot.
Now we use the swiss data. Plot the Education rate against the Agriculture rate. Use dierent symbols for the provinces with more catholics
and those with more protestant inhabitants. Choose axis limits from 0 to
100. Add a legend to your plot.
Save this plot as a pdf-le.

Advanced
Add two vertiacal lines and two horizontal lines to the plot which represent
the means of the Agriculture and the Education for both subgroups
(Catholic vs Protestant).
Look at the layout function. This is an alternative for par(mfrow).

Probability Distributions

5.1

R as a set of statistical tables

Probability Distributions
Several theoretical distributions are included in R
The probability density function is dened as P(X = )
The cumulative distribution function (CDF) is dened as P(X )

The quantile function is the smallest for a given q where P ((X ) > q)

The functions names consists of a prex and the probability name. The prex
name d is for the density, p for the CDF, q for the quantile function and
r for simulation (random samples).
Probability Distributions
The name of the distribution is usually an abbreviation
R> pnorm(0.5)
[1] 0.6914625

gives the value of the CDF of a standard normal distribution N(0,1) at point
0.5.
Additional arguments can passed to the distribution functions. For example,
R> pnorm(0.5, mean = 2, sd = 3)
[1] 0.3085375

gives the value of the CDF of N(2,3) at point 0.5.

Probability Distributions
Following (and others) distributions are available in R
Visualising probability functions
For visualisations, the curve function is useful

Distribution
normal
binomial
hypergeometric
Poisson
geometric
Students t
chi-squared
F
gamma
beta

R name
norm
binom
hyper
pois
geom
t
chisq
f
gamma
beta

additional arguments
mean, sd
size, prob
m, n, k
lambda
prob
df, ncp
df, ncp
df1, df2, ncp
shape, scale
shape1, shape2, ncp

Table 4: Probability distributions in R

R> curve(dnorm(x, mean = 4, sd = 0.5), from = 2, to = 8, main = "N(4,0.5) distribution")

0.0

0.2

0.4

0.6

dnorm(x, mean = 4, sd = 0.5)

0.8

N(4,0.5) distribution

Computing quantiles
To compute the 95% quantile of the N(0,2)-distribution, use qnorm()
R> qnorm(0.95, mean = 0, sd = 2)
[1] 3.289707

5.2

Examining the distribution of a set of data

Given a (univariate) set of data we can examine its distribution in a large
number of ways:
to get a basic summary use summary or fivenum

4
2

3.28

qnorm(x,mean=0,sd=2)

Quantiles of N(0,2)distribution

0.95
0.0

0.2

0.4

0.6

0.8

1.0

to display of the numbers by stem (a stem and leaf plot) use stem
to make a density estimation use density
to plot the empirical cumulative distribution function, use cdf
to compare the quantiles of two univariate data sets x and y, use qqplot(x,y)
to compare the empirical quantiles with a normal distribution, use qqnorm

Density estimation
The faithful dataset
Waiting time between eruptions and the duration of the eruption for the Old
Faithful geyser in Yellowstone National Park, Wyoming, USA.
A data frame with 272 observations on 2 variables:
eruptions numeric Eruption time in mins
waiting numeric Waiting time to next eruption (in mins)
The function density computes kernel density estimates
An important argument for density estimation is the bandwith (argument
bw)
The rug function can be used to add a rug representation to the plot

Density estimation

R> data(faithful)
R> plot(density(faithful$eruptions))
R> rug(faithful$eruptions)

0.3
0.2
0.0

0.1

Density

0.4

0.5

density.default(x = faithful$eruptions)

N = 272 Bandwidth = 0.3348

Stem and Leaf plot

R> stem(faithful$eruptions)
The decimal point is 1 digit(s) to the left of the |
16
18
20
22
24
26
28
30
32
34
36
38
40
42
44
46
48
50

|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|

070355555588
000022233333335577777777888822335777888
00002223378800035778
0002335578023578
00228
23
080
7
2337
250077
0000823577
2333335582225577
0000003357788888002233555577778
03335555778800233333555577778
02222335557780000000023333357778888
0000233357700000023578
00000022335800333
0370

Empirical cumulative distribution

R> plot(ecdf(faithful$eruptions), do.points = FALSE, verticals = TRUE)

0.6
0.4
0.0

0.2

Fn(x)

0.8

1.0

ecdf(faithful$eruptions)

R> qqnorm(faithful$eruptions)
R> qqline(faithful$eruptions)

4.5
3.5
2.5
1.5

Sample Quantiles

Normal QQ Plot

Theoretical Quantiles

Normal Q-Q plot

Exercises
Basic
You randomly select a sample of 10 persons in a class. There are 20 women
and 30 men in that class. What is the probabilty to have exactly 3 woman
in that sample? Suppose you are sampling with replacement.
What is the probability if you are sampling without replacement?
What is P(X > 3) if X N(0,2)?

Would you say, that the eruptions of Old Faithful are normal distributed?
Plot a histogramm together with a density estimation for the eruptions
data. Try dierent values for the bandwidth.

Exercises
Advanced
Continous distributions can be visualized with curve. How can categorial
distributions be visualized?
The distribution of the eruption data can be seen as a mixture of two
normal distributions. Try to nd the mean and variance of these normals
manually and plot a histogramm together with the distribution of the
mixture of normals.

Hypothesis Testing (Inference)

HairEyeColor dataset
The HairEyeColor data
Distribution of hair and eye color and sex in 592 statistics students at the
University of Delaware. A 3-dimensional array resulting from cross-tabulating
592 observations on 3 variables. The variables and their levels are as follows:
1. Hair levels: Black, Brown, Red, Blond
2. Eye levels: Brown, Blue, Hazel, Green
3. Sex levels: Male, Female

Statistical Tests
Inference = process of drawing conclusions about a population based on
a sample from the whole population
Example: Comparison of two samples of randomly selected 20 plants from
two elds. The population are all plants on each eld
A test is used to asses the hypotheses

HO : null hypothesis (can only be rejected, not valided)

H1 : alternative hypothesis
Note that HO H1
/

Decission is done after computing a test statistic and comparing it with

tabulated theoretical values of the distribution under H0

Errors in hypothesis testing

Truth

H0
H1

Test
Non-signicant
Signicant
True Negative Type-1 Error
Type-2 Error
True Positive

Table 5: Type-1 and Type-2 errors in hypothesis testing

Statistical testing relies on control of the type-1 error rate
the level of test, usually denoted by is equal to 0.05
Def. p-value: Probability of observing a more extreme than the observed
test statistic given H0 is true

Statistical Tests
The type of suitable test depends on the population distribution the question setting:

comparing a measure in two normal Populations

Students t-Test

comparing a measure in two non-normal Populations

Wilcoxontest

testing independence in Contingency Tables

Chi-squared-test

comparing one sample with a theoretical distribution

Kolmogorv-Smirno test

6.1

Two-sample t-test for equality of means

Students t-Test
Used to test the null hypothesis that the means of two independent populations of size n1 respectively n2 are the same
H0 : 1 = 2 .
The test statistic is computed as
t=

y1 y2

s 1/ n1 + 1/ n2

where s is the pooled standard deviation.

Under the null hypothesis, the t-statistic has a Students t-distribution with
n1 + n2 - 2 degrees of freedom, thus H0 is rejected if |t| t/ 2,n1 +n2 2 (two
sided test) or |t| t,n1 +n2 2 (one sided test)
Students t-Test with R

30
10

uptake

R> boxplot(uptake ~ Treatment, data = CO2, ylab = "uptake")

nonchilled

chilled

Figure 3: Boxplot of estimates of amount of uptake in both treatments.

Students t-Test with R

Students t-Test with R
Students t-Test is done with the function t.test. It has the following
arguments
34

R>
R>
R>
R>
R>
R>

par(mfrow = c(1, 2))

qqnorm(CO2[CO2$Treatment
qqline(CO2[CO2$Treatment
qqnorm(CO2[CO2$Treatment
qqline(CO2[CO2$Treatment
par(mfrow = c(1, 1))

==
==
==
==

"chilled", ]$uptake)
"chilled", ]$uptake)
"nonchilled", ]$uptake)
"nonchilled", ]$uptake)

Normal QQ Plot
q

q
q
q
q
q
q
q
q
qq
qq
q

q
q
qq
qq
qq
qq
q
qq
qq
q
q

30
10

q
q
qq
q
q
qq
q
qq

qq
q
q
qq
q
q
q
q
q
q
q
q
q

q
q
q

Sample Quantiles

q
qq
qq q
q
q
q
q
q
q
q
q

Sample Quantiles

Normal QQ Plot

q
q
q
qq
q

Theoretical Quantiles

Figure 4: Normal Probability plots of estimates of amount of uptake in treament

levels chilled and nonchilled.
x a (non-empty) numeric vector of data values
y an optional (non-empty) numeric vector of data values.
alternative a character string specifying the alternative hypothesis, must
be one of two.sided (default), greater or less.
paired a logical indicating whether you want a paired t-test (for dependent
samples).
conf.level condence level of the interval (default = 0.95).
formula a formula of the form lhs rhs where lhs is a numeric variable giving the data values and rhs a factor with two levels giving the
corresponding groups.
data an optional matrix or data frame

Students t-Test with R

R> t.test(uptake ~ Treatment, data = CO2, var.equal = FALSE)

Welch Two Sample t-test

data: uptake by Treatment
t = 3.0485, df = 80.945, p-value = 0.003107
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
2.382366 11.336682
sample estimates:
mean in group nonchilled
mean in group chilled
30.64286
23.78333

6.2

Wilcoxon test

Wilcoxon test with R

In opposite to Students t-Test the Wilcoxon test uses only the ranks of
the observations
No distribution assumptions for the variables
R> wilcox.test(uptake ~ Treatment, data = CO2)
Wilcoxon rank sum test with continuity correction
data: uptake by Treatment
W = 1187.5, p-value = 0.006358
alternative hypothesis: true location shift is not equal to 0

This test is more conservative then a t-Test with the same data

6.3

Chi-squared-test

Chi-squared-test
A sample of n observations in two nominal (categorial) variables is arranged
in a Contingency Table Under H0 the row variable and the column variable

1
2
.
.
.
r

1
n11
n21
.
.
.
nr1
n.1

y
...

c
n1c
n2c
.
.
.
nrc
n.c

n1.
n2.
.
.
.
nr.
n

Table 6: The general r c Contingency Table.

y are independent, thus the expected values Ejk for cell (j, k) can be computed
as Ejk = nj. n.k / n.
36

Chi-squared-test
The test statistic for the chi-squared-test is
r

(njk Ejk )2

Ejk

j=1 k=1

This value is compared with a 2 -distribution with (r 1)(c 1) degrees of

freedom.

0.5

2 -distribution

0.2

0.3

2
3
4
5
6

0.0

0.1

dchisq(x)

0.4

df =
df =
df =
df =
df =

2 -distribution

2
3
4
5
6

qchisq(x)

df =
df =
df =
df =
df =

0.0

0.2

0.4

0.6
x

0.8

1.0

Chi-squared-test with R
Contingeny Tabel of Hair and Eye Color of male statistic students
R> HairEyeColor[, , "Male"]
Eye
Hair
Brown Blue Hazel Green
Black
32
11
10
3
Brown
53
50
25
15
Red
10
10
7
7
Blond
3
30
5
8

The chi-squared-test is performed with the function chisq.test. The argument

is simply a contingeny table (obtained with the function table)
R> chisq.test(HairEyeColor[, , "Male"])
Pearson's Chi-squared test
data: HairEyeColor[, , "Male"]
X-squared = 41.2803, df = 9, p-value = 4.447e-06

Chi-squared-test with R
The expected values Ejk , j = 1, ..., r, k = 1, ..., c are obtained as follows:
R> chisq.test(HairEyeColor[, , "Male"])$expected
Eye
Hair
Brown
Black 19.67025
Brown 50.22939
Red
11.94265
Blond 16.15771

6.4

Blue
Hazel
Green
20.27240 9.433692 6.623656
51.76703 24.089606 16.913978
12.30824 5.727599 4.021505
16.65233 7.749104 5.440860

Kolmogorov-Smirnov-Test

Kolmogorov-Smirnov-Test
This test is either used to compare to variables and y or to compare one
variable with a theorectical distribution, e.g. the normal distribution. In R, this
test is implemented in the function ks.test(). The arguments are
x a numeric vector of data values.
y either a numeric vector of data values, or a character string naming
a cumulative distribution function or an actual cumulative distribution
function such as pnorm.
alternative indicates the alternative hypothesis and must be one of
two.sided (default), less, or greater.
... further arguments passed to the distribution function

Kolmogorov-Smirnov-Test with R
The Fertility-rate in the swiss data should be compared with a normal distribution. Simply use the ks.test() for vector of the 47 measures
R> ks.test(swiss$Fertility, "pnorm")
One-sample Kolmogorov-Smirnov test
data: swiss$Fertility
D = 1, p-value < 2.2e-16
alternative hypothesis: two-sided

The null hypothesis distribution is normal is rejected with =0.05 because data is not scaled
The KS-Test with scaled data results in
R> ks.test(scale(swiss$Fertility), "pnorm")
One-sample Kolmogorov-Smirnov test
data: scale(swiss$Fertility)
D = 0.1015, p-value = 0.7179
alternative hypothesis: two-sided

Now H0 is not rejected

Exercises
Basic
The means of Sepal.Length should be compared for the species setosa
and virginica in the iris data. Think about the assumptions (normal
distributions, equal variances) and choose the right test. Is there a significant dierence in mean?
Compute the p-value with help of the t-statistic on slide 35 on your own.
What value would we have with a one-sided hypotesis?
Why do we have df=9 in the output of the 2 -test on slide 38?

Advanced
It was necessary to scale the data before using the ks.test. How can we
get the right result without using the scale method? Have a look at the
help for the function and think about the ...-argument.
Have a look at slide 38 again and compute the value of the 2 -statistic on
our own.

Exercises
Advanced
You have given the following data. Are the two factors m1 [levels A
and B] and m2 [levels A, B and C] independent?

1
2
3
4
5
6
7
8

m1
A
A
A
A
B
B
A
B

m2
B
A
A
B
B
C
C
B

9
10
11
12
13
14
15
16

m1
B
B
B
B
A
B
A
B

m2
C
C
A
B
A
C
B
B

ANOVA and linear Regression

ANOVA & linear Regression

Both ANOVA (Analysis of Variance) and linear Regression are used to
analyse the eect of one idependent explanatory covariate on a dependent
response variable
If there is more than one covariate, the expression multiple linear regression
is used
ANOVA and linear Regression have in common, that the response is assumed to be normal distributed
The dierence is, that the covariates in ANOVA are factors but in linear
Regression no assumption is made for the covarites

7.1

ANOVA

ANOVA
ANOVA is used in so called factorial designs. Following distinction can be
done
balanced designs: same number of observations in each cell
unbalanced designs: dierent number of observations in each cell

low
high

A
low
high
(3.5,5,4.5) (5.5,6,6)
(2,4,1.5)
(8,11,9)
20.5
45.5

Total
30.5
35.5
66

Table 7: balanced design with two treatments A and B with two levels 1 (low)
and 2 (high) with 3 measurements for each cell
ANOVA
The model for ANOVA with two factors and interaction is
yjk = + + j + ()j + jk ,

(1)

with
yjk the kth measurement made in cell (, j)
the overall mean
main eect of the rst factor
j the main eect of the second factor
()j the interaction eect of the two factors
jk the residual error term with jk N(0, 2 )

ANOVA formula in R
The ANOVA model is specied with a model formula. The two-way layout
with interactions of equation (1) is writen in R as
y A + B + A:B,
where A is the rst and B the second factor. The interaction term is denoted by
A:B. An equivalent model formula is
y A*B.

In both cases, a mean is implicitly included in the model. In case = 0 use

y A + B + A:B -1.

ANOVA
In a ANOVA, the total variance of the observations is partioned into parts
due to the main eects and interaction plus an error term. The hypothesis that
there is no signicant eect is tested with an F-test. The assupmtions for the
F-test are
The observations are independent of each other.
The observations in each cell arise from a population having a normal
distribution.
The observations in each cell are from populations having the same variance.

1.4

F-distribution

0.8
0.6
0.0

0.2

0.4

df(x)

1.0

1.2

df1 = 1, df2 = 1
df1 = 1, df2 = 2
df1 = 10, df2 = 10
df1 = 10, df2 = 100
df1 = 2, df2 = 1000

ANOVA with R
Now we want to perform ANOVA with two main eects and interaction
for the example from above
Let us rst have a look at the data

ANOVA with R
The mean plot suggested that there are dierences in the means, especially
for factor A
To apply ANOVA to the data we use the function aov
The results is shown in a nice way with the summary method

R> plot.design(y ~ A + B + A:B)

high

low

mean of y

high

low
A

Factors

Figure 5: Plot of the means for each level of the factors.

R> summary(aov(y ~ A + B + A:B))
Df Sum Sq Mean Sq F value
A
1 52.083 52.083 43.8596
B
1 2.083
2.083 1.7544
A:B
1 21.333 21.333 17.9649
Residuals
8 9.500
1.187
--Signif. codes: 0 *** 0.001 ** 0.01 *

Pr(>F)
0.0001654 ***
0.2219123
0.0028433 **

0.05 . 0.1

ANOVA with R
The resulting ANOVA-table shows that there is a highly signicant eect
of factor A
The interaction eect is signikant too
The main eect of B is not signikant on 5% level
The estimated eects are obtained with the coef method
R> coef(aov(y ~ A + B + A:B))
(Intercept)
4.333333

Ahigh
1.500000

Bhigh Ahigh:Bhigh
-1.833333
5.333333

Note that the level "low" is treated as reference category due to identication restrictions

R> interaction.plot(A, B, y)

mean of y

high
low

low

high
A

Figure 6: Interaction plot of A B

ANOVA with R
To understand the interaction eect, the function interaction.plot() is
very useful
Exercises
Basis
Compute an ANOVA for the CO2 dataset. The eect of Treatment and
Type on the carbon dioxide uptake rates should be examined. Give an
interpretation for the result. Would you use a model with or without
interaction?
Think about the model assumptions. A useful method is to use the plot
method with the result of an ANOVA.

Advanced
In a second step, we want to look if the ambient carbon dioxide concentr
conc has an eect on uptake too. Check this with an ANOVA. The
variable should be dichotomized at the median before using it as a factor.

7.2

Linear Regression

Linear Regression
Linear Regression model:
y = + + , = 1, ..., n
44

(2)

with
y response measure for individual
covariate measure for individual
regression coecient for the intercept
regression coecient for the slope
error term with N(0, 2 )

The coecients and are estimated by and which is the least squares
solution:
n

( )(y y )
=1
n

( )2
=1

One can see that

Cov(, y)
Var()

= ry

sy
s

(3)
(4)

(5)

where ry is the empiric correlation coecient between and y and s is the

standard deviation for and sy is the standard deviation for y. The coecient of determination R2 is the proportion of variance that is explained by the
statistical model. It gives information about the goodness of t. It holds that
2
R2 = ry [0, 1]

Note, that R2 does not tell whether

the independent variables are a true cause of the changes in the dependent
variable
the correct regression was used
the most appropriate independent variable has been chosen
the model might be improved by using transforming the independent variable

The estimated values are obtained as

y = + , = 1, ..., n

(6)

The variability of the data set is measured through dierent sums of squares:
n

SStot

=
=1
n

SSreg

=
=1
n

SSerr

=
=1

(y y )2 , the total sum of squares

(7)

(y y )2 , the regression sum of squares

(8)

(y y )2 , the sum of squares of residuals

(9)

It holds that
SSerr + SSreg = SStot and R2 =

SSreg
SStot

Linear Regression - Residuals

The standard residuals are dened as

r = y y , = 1, ..., n

(10)

The sum of squares of the residuals is a measure for the t of the modell.
Residuals are the vertical distances of each point from the regression line.
The regression is unbiased, so E() = 0.
There should be no structure in the distribution of the residuals look
at the residuals.

New York Air Quality Measurements

airquality dataset
Daily air quality measurements in New York, May to September 1973. A data
frame with 154 observations on 6 variables:
Ozone [numeric] Mean ozone in parts per billion from 13:00 to 15:00 oclock
at Roosevelt Island
Solar.R [numeric] Solar radiation in Langleys from 08:00 to 12:00 oclock
at Central Park
Wind [numeric] Average wind speed in miles per hour at 07:00 and 10:00
oclock at LaGuardia Airport

Temp [numeric] Maximum daily temperature in degrees Fahrenheit at La

Guardia Airport
Month [numeric] Month (1 - 12)
Day [numeric] Day of month (1 - 31)

Linear Regression with R

In our study we want to measure the inuence of the temperature on the
ozone level
We use the airquality dataset
We have to think about the causal relationship
Here Temp is the independent and Ozone is the dependent variable
We should make some descriptive analyses before
An important meausure is the correlation coecient r. It can be calculated
as
R> cor(airquality$Temp, airquality$Ozone, use = "complete.obs")
[1] 0.6983603

A linear Regression can be done with the lm function

R> lm(Ozone ~ Temp, data = airquality)
Call:
lm(formula = Ozone ~ Temp, data = airquality)
Coefficients:
(Intercept)
-146.995

Temp
2.429

The Coecients give the least squares estimates for (Intercept) and
A detailed model summary is obtained with the summary method

R> plot(Ozone ~ Temp, data = airquality)

100
0

Ozone

150

q
q

q
qq
q q
qqq
q q
q
q
q
q
q
q
qqq q
q q
q
qqq
qq q
q
q
q
q
qqq q q
q
q
q q q q
q
qq q
q q
qq
q qq
q q qqqq qqq
q
q
qq
qqqq qqq
qqq
q
q
q q
q
qq
q qqqq
qq
q
q q

Temp

Figure 7: Scatterplot of Temp and Ozone

R> summary(lm(Ozone ~ Temp, data = airquality))

Call:
lm(formula = Ozone ~ Temp, data = airquality)
Residuals:
Min
1Q
-40.7295 -17.4086

Median
-0.5869

3Q
Max
11.3062 118.2705

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -146.9955
18.2872 -8.038 9.37e-13 ***
Temp
2.4287
0.2331 10.418 < 2e-16 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1
1
Residual standard error: 23.71 on 114 degrees of freedom
(37 observations deleted due to missingness)
Multiple R-squared: 0.4877,
Adjusted R-squared: 0.4832
F-statistic: 108.5 on 1 and 114 DF, p-value: < 2.2e-16

R> plot(Ozone ~ Temp, data = airquality)

R> abline(lm(Ozone ~ Temp, data = airquality))

100
50
0

Ozone

150

q
q

qq
q
qq
q q
qqq
q q
q
q
q
q
q
q
qqq q
q q
q
qqq
q
qq q
q
q
q
qqq q q
q
q
q q q q
q
qq q
q q
qq
qq
q qq
q
q q q q qqqqqq q q
qq
qqq
q
q
q
q qq qqqqqq
qq
q
qq
q
q q

Temp

Figure 8: Scatterplot of Temp and Ozone with regressioncurve

The Regression curve can be displayed in the scatterplot

It is useful to save the model as an object

R> myMod <- lm(Ozone ~ Temp, data = airquality)

myMod is of class lm
R> class(myMod)
[1] "lm"

The structure of myMod can be seen with the str method

Now several method can be applied directly to myMod, e.g.
R> coef(myMod)
(Intercept)
-146.995491

Temp
2.428703

This gives the regression coecients

The following methods can be applied to an object of class lm

method
coef
fitted.values
residuals
predict
summary
plot

use
returns a vector of length 2 containing the regressions coecients
returns a vector of length n containing the predicted values
returns a vector of length n containing the residuals
a method for predicting values with the model
a model summary
a plot for the model diagnostics

Model diagnostics

62
q 30 q
q q
q q
q
q q
qqqq
q
qqqq
qqqqqq qq
q qq
q
qq
q qq
qq qqq q
qq q qq qqqq qqqq
qq q
q q
q q
qqqqq qqqq
qqq q qq
q qq
q
qqqqqq
qq
qq q
q
qq
q

2.0

ScaleLocation

1.0

50
50

Residuals

q 117

0.0

Residuals vs Fitted

Standardized residuals

R> layout(mat = matrix(1:4, 2, 2))

R> plot(myMod)

q 117

62
q 30 q
q q
q
qq
q q
q
q
qq qqq q q
q
q
q q
q
q
q
qq
qqqq q
q
q
q
q
q qq qqqqq q q
q q q q qqqqq qqq q q
q
q qqqq
qq
q
q
q q qqq
q
q q
qq q qq q
q q
q q
qq
q
q
qq q
q
q
q q
qq
q q

Residuals vs Leverage
q 117

0.5

62
q q
q 99 q
q
qq
qq q q
q q q
q
q q q qqq q q q q q q
q q
q
q q q qq q qq q
q
q q
q q q qq q q
q qq q
qq
qq
q
qq qq
q
q
qq q
qqqqqq
qq q q
qq q
q
q
q
q
q
q

Cook's distance

30
qq
q
q
q
qq
q
qq
qq
qq
qq
qq
qq
q qq
qqqq
qqqq
qqq
qqq
qqqq
qqq
qq
qqq
qq
qq
qq
qqq
qqqq
qqqq
qqqqq
qqqqq
qqqqq
qqqq
qq
q qq

Standardized residuals

117 q

Fitted values

Normal QQ

2 0

Standardized residuals

Fitted values

0.00

Theoretical Quantiles

0.02

0.04

Leverage

Three outliers can be identied: observations 30, 62 and 117

Look at the data
R> airquality[c(30, 62, 117), ]

30
62
117

Ozone Solar.R Wind Temp Month Day qTemp

115
223 5.7
79
5 30 6241
135
269 4.1
84
7
1 7056
168
238 3.4
81
8 25 6561

Compare with the other values

R> fivenum(airquality$Ozone, na.rm = TRUE)
[1]

1.0

18.0

31.5

63.5 168.0

R> fivenum(airquality$Temp, na.rm = TRUE)

[1] 56 72 79 85 97

Mark the observations in the scatterplot

200

R> plot(Ozone ~ Temp, data = airquality[-c(30, 62, 117), ], ylim = c(-10,

+
210), xlim = c(40, 120))
R> abline(myMod)
R> points(airquality[c(30, 62, 117), ]$Ozone ~ airquality[c(30,
+
62, 117), ]$Temp, col = 2)

q
q
q
q
q
q
q q q
q
q
q
qq q
qqq q q
q q q
qqqq
q
qq
qq
q qq q
q
q q
q
q
qqqq qq q q q
q
q
q
q
qqq q
q q
q
q
q qqq
q qq qqq qqqqq
q
qq qq
q qq qqq
q
q
qq q
qqq
q

100

Ozone

150

100

120

Temp

Observations with a high Temp are underestimated by our model

Maybe the true relationship between is not linear but, e.g. the true functional relationsship is quadratic
We t a second model where the temperature is considered quadratic
R> airquality$qTemp <- airquality$Temp^2
R> cor(airquality$qTemp, airquality$Ozone, use = "complete.obs")

[1] 0.7130697
R> cor(airquality$Temp, airquality$Ozone, use = "complete.obs")
[1] 0.6983603
R> myMod2 <- lm(Ozone ~ qTemp, data = airquality)
R> myMod2
Call:
lm(formula = Ozone ~ qTemp, data = airquality)
Coefficients:
(Intercept)
-57.07404

qTemp
0.01612

R> summary(myMod2)
Call:
lm(formula = Ozone ~ qTemp, data = airquality)
Residuals:
Min
1Q
-39.7067 -16.7409

Median
-0.9534

3Q
Max
10.5436 119.2933

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -57.074043
9.386333 -6.081 1.63e-08 ***
qTemp
0.016123
0.001485 10.859 < 2e-16 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1
1
Residual standard error: 23.23 on 114 degrees of freedom
(37 observations deleted due to missingness)
Multiple R-squared: 0.5085,
Adjusted R-squared: 0.5042
F-statistic: 117.9 on 1 and 114 DF, p-value: < 2.2e-16

The new curve displayed in the scatterplot

R> plot(Ozone ~ Temp, data = airquality, ylim = c(-10, 210), xlim = c(40,
+
120))
R> x <- 1:120
R> lines(coef(myMod2)[1] + coef(myMod2)[2] * x^2)

200

q
q
q
q
q
q
q q q
q
q
qq qq q
q
q q q
q q q
qqqq
q
q q qq
qqq q
q
q
q
q q
q
q
q
q
q q q
q
qqq q q q q
qq
q q
q
q
q qqq
q qq qqq qqqqq
q
q qq
q qq qqq
qqq q
q
q
q
qqq
q

100

Ozone

150

100

120

Temp

Excercises
Basic
Which of the models myMod or myMod2 ts better to the data? Give reasons
for your choice.
What could be a problem of the second model?
In second analysis we want to analyse the eect of the temperature on
the wind. Use the airquality data, make descriptive analyses and t a
model. Give interpretations for the results.
Look at the model diagnostics. Are there outliers in the data? Highlight
the outliers.

Advanced
Look at the airquality data. As you can see, there are 37 missing values
for Ozone. Predict this values with the predict method and with our
model myMod1 and plot them into the scatterplot.
We want to estimate the eect of the agriculture rate on the education
rate in the swiss data. What could be a problem here?

7.3

Multiple Linear Regression

If more than one dependent variable should be considered in one model we
need Multiple Linear Regression. Assume that y represents the value of the
response variable of the th observations and that 1 , 2 , ..., q represent the
values of the q explanatory variables. The model is given by
y = 0 + 1 1 + ... + q q + , = 1, ..., n
53

(11)

with N(0, 2 ). It holds that

E(y|1 , ..., q ) = 0 + 1 1 + ... + q q .

Note, that the linear in multiple linear regresion applies to the regression parameters, not the response or independent variables. So also quadratic functions
of independent variables t in this model class.
Multiple Linear Regression
The multiple linear regression model can be writen as a common model for
all observations as
y = X + ,
(12)
with
y = (y1 , ..., yn )T as the vector of response variables
= (1 , ..., q )T as vector of regression coecients
= (1 , ..., n )T as vector of error terms

The design or model matrix X has the following form:

1
1
.
.
.

11
21
.
.
.

12
22
.
.
.

..
.

1q
2q
.
.
.

with n rows and q + 1 columns.

Multiple Linear Regression

The least squares estimator for is calculated as

= (X T X)1 X T y.

The expectation for is given as

E() =
and the variance is

Var() = 2 (X T X)1 .

The predicted values are

y = 0 + 1 1 + ... + q q .

(13)

Multiple Linear Regression

In a multiple linear regression model, the parameter j associated with
the explanatory variable j is such that j is the increase in the response
variable y when j increases by 1, conditional on the other explanatory
variables remaining constant.
If j > 0, the response variable increases if j increases, if j < 0 the
response variable decreases if j inceases.

Multiple Linear Regression - Residuals

The standard residuals are dened as

= y y , = 1, ..., n.

The symetric n n prediction-matrix or hat-matrix is dened as

H = X(X T X)1 X

The standardized residuals are dened as (h is the th diagonal element

of H)

.
r =

1 h

The studentized residuals are dened as

r = r

Source of variation
Regression
Residual
Total

1/ 2

nq1

n q r2

Sum of squares
n

(y y )2
=1
n

(y y )2
=1

Degrees of freedom
q
n-q-1
n-1

Table 8: Analysis of variance table for the multiple linear regression model.
The mean square ratio
F=

(y
=1

y )2 / q

y )2 / (n q 1)

provides an F-test of the general hypothesis

H0 : 1 = ... = q = 0.
Under H0 , the statistic F has an F-distribution with q and n q 1 degrees
of freedom.
55

As in Linear Regression R2 is a measure for the t of the model. It is

computed as
2

R =

(y
=1
n
(y
=1

y )2

2
=1

n
(y
=1

y )2

2
2
In linear Regression R2 = ry and in multiple linear Regression R2 = ryy

holds.

Hierachical ordered models can be compared with R2 . Suppose we have

two models

y = 0 + 1 1 + 2 2 +
y = 0 + 1 1 +

(M1)

(14)

(M2).

(15)

Now, R2 R2
M1
M2

Dierent models can only be compared with R2 , if the have the same
response, the same number of parameters and an intercept.

Multiple Linear Regression

Adjusted R2 is a modication of R2 that adjusts for the number of explanatory terms in a model.
Unlike R2 , the adjusted R2 increases only if the new term improves the
model more than would be expected by chance.
The adjusted R2 can be negative, and will always be less than or equal to
R2 .
Adjusted R2 is computed as

R2 = 1 (1 R2 )
dj

nq1

Adjusted R2 is useful for model selection.

In the R summary outout, the R2 measure is called Multiple R-squared
and the adjusted R2 Adjusted R-squared.

Multiple Linear Regression with R

We want to extend the simple Model Ozone Temp by an additional
quadratic eect of Temp.
The model call in R is
R> airquality$qTemp <- airquality$Temp^2
R> (mod <- lm(Ozone ~ Temp + qTemp, data = airquality))
Call:
lm(formula = Ozone ~ Temp + qTemp, data = airquality)
Coefficients:
(Intercept)
305.48577

Temp
-9.55060

qTemp
0.07807

Signicant eect of both Temp and qTemp

R> summary(mod)

Call:
lm(formula = Ozone ~ Temp + qTemp, data = airquality)
Residuals:
Min
1Q
-37.619 -12.513

Median
-2.736

3Q
Max
9.676 123.909

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 305.48577 122.12182
2.501 0.013800 *
Temp
-9.55060
3.20805 -2.977 0.003561 **
qTemp
0.07807
0.02086
3.743 0.000288 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1
1
Residual standard error: 22.47 on 113 degrees of freedom
(37 observations deleted due to missingness)
Multiple R-squared: 0.5442,
Adjusted R-squared: 0.5362
F-statistic: 67.46 on 2 and 113 DF, p-value: < 2.2e-16

The F-statistic in the model output compares the model with an intercept
model. This is an test for a general eect of the model.

Dierent, nested model can be compared also. The smaller model has to
be part of the bigger model
If we want to know, if the additional quadratic eect of Temp is meaningful,
we compare the models
R> mod1 <- lm(Ozone ~ Temp, data = airquality)
R> mod2 <- lm(Ozone ~ Temp + qTemp, data = airquality)

In R, this is done with the anova command

R> anova(mod1, mod2)
Analysis of Variance Table
Model 1: Ozone ~ Temp
Model 2: Ozone ~ Temp + qTemp
Res.Df
RSS Df Sum of Sq
F
Pr(>F)
1
114 64110
2
113 57038 1
7071.8 14.010 0.0002876 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1

As we can see, the quadratic eect is meaningful

Model Selection
More covariates lead always to a equal or better model adaption and thus
a higher SSreg
Adding random numbers as covariates would also improve the model adaption
The task is to nd a compromise between model adaption and model
complexity
This is called model selection

Model Selection - Deviance & AIC

Deviance : is a quality of t statistic for a model
AIC (Akaikes information criterion): is a measure of the goodness of t
of an estimated statistical model.

AC = 2k + n(n(SSReg / n))
with k = number of explanatory variables + intercept in the model and
n = number of observations.
The idea is to penalize the number of explanatory variables to favour
sparse models.

Exercises
Basic
We want to nd the best model to explain the amount of Ozone in the
airquality data. Make some descriptive statistic, e.g. scatterplots for
each possible covariate and the response and compute the correlation matrix.
Try dierent models and compare the with the anova method. Which
model would you choose. Take = 0.05. Give an interpretation for the
eects.
Check the model assumptions and look for outliers in the data.

Advanced
Look at the help of the function step and try to nd the best model
with this function. Compare the results (Note: look especially at the
direction argument).

7.4

Categorial Covariates

Including categorial covariates

Up to now we have considered only metric covariates in the models.
As in ANOVA, also categorial covarites are allowed as covariates.
Additional, interactions between metric and categorial and two or more
categorial covariates are possible.
Covariates are treated as categorial if they are declared as a factor.
factor covarites are coded to use them in models
Two main types of coding: dummy coding and eect coding
A Linear Regression Model with inly categorial covarites is the same as
an ANOVA.

Dummy coding
Dummy coding uses only ones and zeros to convey all of the necessary
information on group membership.
A dummy variable is dened as

0
1

no group member
group member

For a factor with k groups (levels) k1 dummy variables d1 , d2 ,...,dk1

have to be introduced.
For d1 , every observation in group 1 will be coded as 1 and 0 for all other
groups it will be coded as zero.
We then code d2 with 1 if the observation is in group 2 and zero otherwise.
dk is not needed because d1 ,...,dk1 has all of the information needed to
determine which observation is in which group.

Dummy coding
Consider the following example with observations of a factor variable grp
with 4 groups (each replicated 2 times) and a metric response measure y.
We have to create 4-1=3 dummy variables d1, d2 and d3.
The resulting coding is the following:
y
5
4
7
4
6
3
9
5

Table 9:

grp
1
2
3
4
1
2
3
4

d1
0
1
0
0
0
1
0
0

d2
0
0
1
0
0
0
1
0

d3
0
0
0
1
0
0
0
1

Example of dummy coding

d1 is called the refernce category.

Dummy coding
Means of y in each group
1
2
3
4
5.5 3.5 8.0 4.5

We t now a linear model with R. Note that dummy coding is the default
for factor covariates and the rst level is used as reference category.
R> y <- c(5, 4, 7, 4, 6, 3, 9, 5)
R> grp <- factor(rep(1:4, times = 2))
R> str(grp)
Factor w/ 4 levels "1","2","3","4": 1 2 3 4 1 2 3 4
R> mod <- lm(y ~ grp)
R> coef(mod)

(Intercept)
5.5

grp2
-2.0

grp3
2.5

grp4
-1.0

The intercept is the mean of group 1.

The estimated regression coecients are just the dierence of the intercept.

Dummy coding
We could also dene the dummy variables on our own:
R> d1 <- c(0, 1, 0, 0, 0, 1, 0, 0)
R> d2 <- c(0, 0, 1, 0, 0, 0, 1, 0)
R> d3 <- c(0, 0, 0, 1, 0, 0, 0, 1)

We set up a model and estimate the eects:

R> lm(y ~ d1 + d2 + d3)
Call:
lm(formula = y ~ d1 + d2 + d3)
Coefficients:
(Intercept)
5.5

d1
-2.0

d2
2.5

d3
-1.0

The results are the same

Dummy coding is implemented in the function contr.treatment

Eect coding
An alternative coding scheme is eect coding.
Eect coding uses only ones, zeros and minus ones to convert all of the
necessary information on group membership.
A eect variable for k

1
1
ej =

groups is dened as
member of group j
member of group k
else

for j = 1, ..., k 1

For a factor with k groups (levels) k 1 eect variables e1 , e2 ,...,ek1

have to be introduced.
For e1 , every observation in group 1 will be coded as 1, as -1 for group k
and 0 for all other groups it will be coded as zero.
ek is not needed because e1 ,...,ek1 has all of the information needed to
determine which observation is in which group.

Eect coding
In our example, the eect coding scheme is the following:
y
5
4
7
4
6
3
9
5

Table 10:

grp
1
2
3
4
1
2
3
4

d1
1
0
0
-1
1
0
0
-1

d2
0
1
0
-1
0
1
0
-1

d3
0
0
1
-1
0
0
1
-1

Example of effect coding

Eect coding is implemented in the function contr.sum

Eect coding
We t now a linear model with R and use the eect coding scheme. This
can be spezied in the contrasts argument of lm.
R> lm(y ~ grp, contrasts = list(grp = contr.sum(4)))
Call:
lm(formula = y ~ grp, contrasts = list(grp = contr.sum(4)))
Coefficients:
(Intercept)
5.375

grp1
0.125

grp2
-1.875

grp3
2.625

With eect coding the intercept is equal to the grand mean of all of the
observations.
The coecients of each of the eect variables is equal to the dierence
between the mean of the group coded 1 and the grand mean.

Interaction
As in ANOVA interactions can be used in regression models.
An interaction term can be spezied in the model fomula.
Following notations are possible in a formula:

+
:
*
.

add a variable
remove a variable (-1 to remove intercept)
interaction between two variables
add variables and their interaction
add all variables height

Table 11: formula notation in R

Exercises
Basic
We look again to the airquality data. We want to analyse the eect of
the date on the Ozone level. Make rst some descriptive plots for Ozone
and Month and Day.
Create a new factor season for the season. It schould have two levels:
spring: from 01-05 to 15-06 and summer from 16-06 to 31-09. Add this
new variable to the airquality dataset. How many observations are in
spring and how many are in summer?
What is the mean value of Ozone in spring, what is the mean value in
summer?
Fit a linear model with Ozone as response and season as a categorial
covariate. Use dummy coding and spring as the refernce category. Give
an interpretation for the result.

Advanced
Now t the lm and use dummy coding with summer as reference category.
Fit the lm and use eect coding.

Exercises
Advanced
There is the theory, that the ozone level does not depend on the Month
only but additional also on the day of the week. One assumes that there
is a dierence between weekdays and weekend.
Create a new factor variable which contains the information if an observation is made on a weekday or nor not (weekend). Perhabs the functions
as.Date and weekdays are useful.
Fit a lm with the eect of seasaon and weekday + interaction and give an
interpretation for the results.

Generalized Linear Regression

Dust dataset
The dust data is available under the archieve of datasets of the department of
statistics at the LMU :http://www.stat.uni-muenchen.de/service/datenarchiv/dust/dust.asc.
The dust data
The data was collected by the Deutsche Forschungsgemeinschaft in the years
1960 to 1977 in a munich factory with 1246 workers. The data.frame with one
observation for each woker contains the following variables
cbr [factor] chronical reaction of bronchia [0: no, 1:yes]
dust [numeric] dust exposure at workplace (mg/m3 )
smoking [factor] Is the worker a smoker? [0: no, 1:yes]
expo [numeric] time of exposure (years)

8.1

Logistic Regression

Logistic Regression
In the dust data our response variable is cbr.
It has only two possible categories yes and no.
Multiple linear regression model can be writen as

N(, 2 )
0 + 1 1 + ... + q q

This model is only suitable for continous response variabels with, conditional on the values of the explanatory variables, a normal distribution
with constant variance.
So he model is not suitable for the response variable cbr which is binary.
Other situations where linear regression fails are count data responses,
metric but not normal responses or categorial responses

Logistic Regression
What we need is a model which takes the special type of the response
variable into account
For binary data {0,1} we want to model the response probability of taking
the value 1 directly as a linear function of the explanatory variables.

For example we want the probability that somebody has a chronical reaction of bronchia depending on

the amount of dust exposure

the time of exposure
the smoking status
We have to ensure that our model does not lead to tted values for the
response probability outside [0,1].

Logistic Regression
A suitable transformation is the logistic or logit function which ensures a
range of transformed values in [0,1].
The logistic regression model for the resonse probability is

logit() = log

= 0 + 1 1 + ... + q q

(16)

The logit of a probability is simply the log of the odds of the response
taking the value one.
Equation (16) can be also write as

(1 , 2 , ..., q ) =

exp(0 + 1 1 + ... + q q )
1 + exp(0 + 1 1 + ... + q q )

The logit function can tak any real value, but the associated probability
always lies in the required [0,1] interval.

Logistic Regression
The regression coecients in a logist regression model have a dierent
interpretation as in linear regression.
Equation (16) can be also write as

= exp(0 ) exp(1 1 ) ... exp(q q )

In a logistic regression model, the parameter j associated with the explanatory variable j is such that exp(j ) is the odds that the response
variable takes one when j increases by one conditional on the other explanatory variables remaining constant.
The regression coecients are estimated by maximum likelihood (ML).

Logistic Regression with R

First, we load the data
R> dust <- read.table("http://www.stat.uni-muenchen.de/service/datenarchiv/dust/dust.asc",
+
header = TRUE)

We declare cbr and smoking as factor variables

R> dust$cbr <- factor(dust$cbr, levels = c(0, 1), labels = c("no",
+
"yes"))
R> dust$smoking <- factor(dust$smoking, levels = c(0, 1), labels = c("no",
+
"yes"))

A short summary of the data

R> summary(dust)
cbr
no :954
yes:292

dust
Min.
: 0.2000
1st Qu.: 0.4925
Median : 1.4050
Mean
: 2.8154
3rd Qu.: 5.2475
Max.
:24.0000

smoking
no :325
yes:921

expo
Min.
: 3.00
1st Qu.:16.00
Median :25.00
Mean
:25.06
3rd Qu.:33.00
Max.
:66.00

We now t a logistic model with cbr as response variable and expo as

explanatory variable using the glm (generalized linear model) function.
The code to t the model is
R> dust_glm_1 <- glm(cbr ~ expo, data = dust, family = binomial())
R> class(dust_glm_1)
[1] "glm" "lm"

The class glm inherits of lm.

This model implies a global mean automatically 1

The argument family=binomial() is used to specify, that logistic regression should be used.

R> layout(matrix(1:2, ncol = 2))

R> boxplot(dust ~ cbr, data = dust, xlab = "cbr", ylab = "dust")
R> boxplot(expo ~ cbr, data = dust, xlab = "cbr", ylab = "expo")

expo

dust

q
q

yes

cbr

yes
cbr

Figure 9: Descriptive boxplots for the dust data.

1.0
0.8
0.6
0.0

0.2

0.4

cbr
0.0

0.2

0.4

cbr

0.6

yes

0.8

yes

1.0

R> layout(matrix(1:2, ncol = 2))

R> plot(cbr ~ dust, data = dust)
R> plot(cbr ~ expo, data = dust)

dust

25 30

expo

Figure 10: Descriptive plots for the dust data.

1.0
0.8
0.6
0.0

0.2

0.4

cbr
0.0

0.2

0.4

cbr

0.6

yes

0.8

yes

1.0

R> layout(matrix(1:2, ncol = 2))

R> cdplot(cbr ~ dust, data = dust)
R> cdplot(cbr ~ expo, data = dust)

dust

expo

Figure 11: Conditional plots for the dust data.

R> mosaicplot(cbr ~ smoking, data = dust, shade = TRUE)

dust
yes

Standardized
Residuals:

yes

smoking

<4 4:2 2:0

0:2

2:4

cbr

A description of the model is obtained with the summary method

R> summary(dust_glm_1)
Call:
glm(formula = cbr ~ expo, family = binomial(), data = dust)
Deviance Residuals:
Min
1Q
Median
-1.1681 -0.7751 -0.6270

3Q
-0.4909

Max
2.0688

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.258885
0.181545 -12.443 < 2e-16 ***
expo
0.040673
0.006066
6.705 2.01e-11 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1
1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1356.8
Residual deviance: 1309.9
AIC: 1313.9

on 1245
on 1244

degrees of freedom
degrees of freedom

Number of Fisher Scoring iterations: 4

From the results we see that expo has a signicant positve eect on the
probability of cbr=="yes" at the 5% level.
The value of the regression coecient is
R> coef(dust_glm_1)["expo"]
expo
0.04067309

This value is more helpful if we convert to the corresponding values for

the odds themselves by exponentiating the estimate
R> exp(coef(dust_glm_1)["expo"])
expo
1.041512

This means, if the years of exposure is increased by 1 the odds that

cbr=="yes" increases by 0.38.

The probability of getting a chronical reaction of bronchia with a dust

exposure of 10 years is computed as
R> a <- coef(dust_glm_1)[1]
R> b <- coef(dust_glm_1)["expo"]
R> exp(a + b * 10)/(1 + exp(a + b * 10))
(Intercept)
0.1356202

A 95% condence interval for the not transformed eect is

R> confint(dust_glm_1, parm = "expo")
2.5 %
97.5 %
0.02887967 0.05267799

Using the exponential transoformation, we get for the 95% condence

interval
R> exp(confint(dust_glm_1, parm = "expo"))
2.5 %
97.5 %
1.029301 1.054090

The tted probabilities for the dust_glm_1 are tted as a line in the
conditional densityplot using the following code
R> cdplot(cbr ~ expo, data = dust, ylevels = c("yes", "no"))
R> x <- seq(from = min(dust$expo), to = max(dust$expo), length = 200)
R> lines(x, exp(a + b * x)/(1 + exp(a + b * x)), lwd = 2, col = 2)

The tted probabilities are computed using the inverse of the logistic function as transformation.

1.0
0.0

0.2

yes

0.4

cbr

0.6

0.8

fitted

expo

In the next step, we include the next covariate from the dust data. This
is the amount of dust exposure at workplace.
We can now t a logistic regression model that includes both explanatory
variables using the code
R> dust_glm_2 <- glm(cbr ~ expo + dust, data = dust, family = binomial())

The transormed 95% condenceintervalls for the regression coecients are:

R> exp(confint(dust_glm_2, parm = c("expo", "dust")))
2.5 %
97.5 %
expo 1.027912 1.052967
dust 1.046549 1.145827

Both explanatory variables have a positive eect on the probability of

getting a chronical reaction of bronchia

R> summary(dust_glm_2)

Call:
glm(formula = cbr ~ expo + dust, family = binomial(), data = dust)
Deviance Residuals:
Min
1Q
Median
-1.3550 -0.7685 -0.6007

3Q
-0.4592

Max
2.1620

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.502380
0.196077 -12.762 < 2e-16 ***
expo
0.039467
0.006138
6.429 1.28e-10 ***
dust
0.090799
0.023076
3.935 8.33e-05 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1
1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1356.8
Residual deviance: 1294.7
AIC: 1300.7

on 1245
on 1243

degrees of freedom
degrees of freedom

Number of Fisher Scoring iterations: 4

The predicted values of the second model can be ploted agains both explanatory variables in a bubble plot using the following code
R> prob <- predict(dust_glm_2, type = "response")
R> plot(dust ~ expo, data = dust, pch = 21, ylim = c(0, 30), xlim = c(0,
+
80), cex = prob * 8)

The size of the points corresponds to the probability of getting a chronical

reaction of bronchia.

To compare dust_glm_1 and dust_glm_2 we use again the anova method.

This is possible, because we want to compare two nested models.
For two glm-objects the argument test="Chisq" is necassary to compare
to models as two lm objects
R> anova(dust_glm_1, dust_glm_2, test = "Chisq")

30
25
20
15

dust

q
qqq q q qqq qq
qq q q qqqqq
q q qqq
q qq q q
qq
qq
q q qq
q
q qq q q
q
q qq q q qqq qq
q qqqqq q qqq qq
qqq q qqqq qq
qqqq qqqq q
qqq q q
q qqq qq
q qqq qq
qq q q qqq q
q q q q q q qq qq
q
q qqqq q
q qq
q qqq
q
qqq qqq qq
q
qq qq
q
qq q q q
q
q qq
qq
qq
qq q q
q
q q
qq qqq q q
q q q qq
qqqq q q q q q qqq q q q
q
qq q qq q
qq
q q q q q qqqq q qqq q
qqqqq q q q q q qqq q q q q qq
q
q qq q q q q q qqq q q qq q q
q q q q qqq q qq q qq
q
qqq qqq q q
qq q
q
q q qq q
q

qq
qq

qq
qq
qq
qq
qq
qqq
qq
qq
qq q
qq
qqq
qq
q
q
q
q
40

expo

Figure 12: Bubble plot of tted values for a logistic regression model tted to
the dust data
Analysis of Deviance Table
Model 1: cbr ~ expo
Model 2: cbr ~ expo + dust
Resid. Df Resid. Dev Df Deviance P(>|Chi|)
1
1244
1309.9
2
1243
1294.7 1
15.267 9.333e-05 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1

Including dust as explanatory variable leads to a signicant increase in

model adaption.

Exercises
Basic
After how many years of exposure is the probability of having a chronical
reaction of bronchia greater 0.5, 0.9 and 0.95? Use the estimations of
model dust_glm_1.
By which factor does the odds of getting a cbr increase, when the exposure
inreases by 3 years?

We extend model dust_glm_1 by including the factor smoking as explanatory variable. What is the eect of smoking and is it useful to include
this variable in the model?

Advanced
Plot the tted probablity values against the years of exposure. Fit two
lines, one for smokers and one for nonsmokers. Fit the observed values
using the function rug.
Include an interaction of smoking and exposure in the model. Give an
interpreation for the results. Is it useful to model this interaction?

Exercises
Advanced
Plot the tted probablity values against the years of exposure. Fit again
two lines, one for smokers and one for nonsmokers.
If you compare the three explanatory variables and dust, smoking and
expo, which ones and which interactions would you include in a model
(model selection)?
Using the predict method on a glm object gives us the probabilities of
cbr="yes". We want to predict with our logistic regression model if an
person is likely to have cbr or not. For that purpose, every person with a
tted probability of > 0.5 is predicted to have cbr and vice versa. This
predicted values can be compared with the real values, e.g. making a 22
cross table:

real value of cbr

no
yes

predicted value of cbr

no
yes
True Negative False Positive
False Negative True Positive

Compare the dierent models by comparing the counts of True Negatives,

False Positives, False Negatives and True Positive.

8.2

The Generalized Linear Model

Generalized Linear Model

The generalized linear model (GLM) is a exible generalization of ordinary
least squares regression.
The GLM generalizes linear regression by allowing the linear model to be
related to the response variable via a link function and by allowing the
magnitude of the variance of each measurement to be a function of its
predicted value

The GLM consists of three elements.

1. A probability distribution from the exponential family (normal, binomial, Poisson,...).

2. A linear predictor = X = 0 + 1 1 + ...q q .
3. A link function g such that E(y) = = g1 ().
The link function g describes how the mean response E(y) = is kinked
to the explanatory variables through the linear predictior

= g()
Generalized Linear Models
Each member of a the exponential family distribution has an owen link
function and variance function.
The link function ensures the right range of predicted values for every
distribution, i.e. y > 0 for y N() or y [0, 1] for y Bin(n, ).
Family
Normal
Poisson
Binomial
Gamma
Inverse Gaussian

Link g
=
= log
= log(/ (1 ))
= 1
= 2

Variance Function
1

(1 )
2
3

GLM - Residuals
The Pearson residual is comparable to the standardized residuals used for
linear models and is dened as

rP =

V()

The deviance residuals are computed as

rD = sign(y ) d ,
with

2
rD = Deviance 0

Ships Damage Data

The ships data
Data frame giving the number of damage incidents and aggregate months of
service by ship type, year of construction, and period of operation for 40 ships.
This data set is part of the library MASS
type [factor] levels A to E
year [factor] year of construction: levels 1960 - 64, 65 - 69, 70 - 74, 75 79 (coded as 60, 65, 70, 75).
period [factor] period of operation : levels 1960 - 74, 75 - 79 (coded as
60, 75).
service [numeric] aggregate months of service.
incidents number of damage incidents (response).

ships data
Formatting the data
R>
R>
R>
R>

library(MASS)
data(ships)
ships$year <- factor(ships$year)
ships$period <- factor(ships$period)

Summary of the data

R> summary(ships)
type
A:8
B:8
C:8
D:8
E:8

year
60:10
65:10
70:10
75:10

period
60:20
75:20

service
Min.
:
0.0
1st Qu.: 175.8
Median : 782.0
Mean
: 4089.3
3rd Qu.: 2078.5
Max.
:44882.0

incidents
Min.
: 0.0
1st Qu.: 0.0
Median : 2.0
Mean
: 8.9
3rd Qu.:11.0
Max.
:58.0

Poisson regression
The response variable incidents is count data.
The response can only take positive values.
Such a variable is unlikely to have a normal distribution.

We model a Poisson distribution:

Pois(y) =

e y
y!

This type of GLM is known as Poisson regression.

The default link function is the log function
We want to measure the eect of aggregated months of service on the
number of incidents.

Generalized Linear Models with R

30
0

incidents

R> plot(incidents ~ service, data = ships, type = "h")

10000

20000

30000

40000

service

Generalized Linear Models with R

We can apply the model using
R> ships_glm_1 <- glm(incidents ~ service, data = ships, family = poisson())

The residuals can be obtained by

R> res <- residuals(ships_glm_1, type = "pearson")

The predict method for an object of class glm2 returns by default the
predicted values on scale of the linear predictor (type="link").
The predicted values on response scale are obtained with the argument
type="response"
R> pred <- predict(ships_glm_1, type = "response")

R> summary(ships_glm_1)

Call:
glm(formula = incidents ~ service, family = poisson(), data = ships)
Deviance Residuals:
Min
1Q
Median
-6.0040 -3.1674 -2.0055

3Q
0.9155

Max
7.2372

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.613e+00 7.150e-02
22.55
<2e-16 ***
service
6.417e-05 2.870e-06
22.36
<2e-16 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1
1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 730.25
Residual deviance: 374.55
AIC: 476.41

on 39
on 38

degrees of freedom
degrees of freedom

Number of Fisher Scoring iterations: 6

Exercises
Basic
What is the interpretation of the regression coecients of model ships_glm_1?
Think of the interpretation in a logistic regression model here.
How many incidents would you expect after 10, 100, 1000, 10000 and
100000 month of service?
Make a residuals vs. tted values scatterplot for the deviance residuals.
Look at a tted values vs. orginal values scatterplot for the ships_glm_1
model. What could be a problem?
Add other explanatory variables to the model. Try to nd the best model
by comparing the models with the anova method?

R> plot(res ~ pred)

R> abline(h = 0, lty = 2)

q
q

res

q
q
q

q
q
q
q
q
q
q
q
q
q
q
q
q
q

pred

Figure 13: Residuals vs. tted values.

Exercises
Advanced
Both the Poisson and binomial distribution have variance functions that
are completely determinded by the mean. There is no free paramter for
the variance (see in the summary output: Dispersion parameter for
poisson family taken to be 1).
Often the empirical varinace is bigger than the theoretical variance. This
is called overdispersion.
To deal this problem, we can use a quasi-likelihood approach. This is
implemented in R in the function quasipoisson() which can be passed as
a family argument for glm.
Ret model ships_glm_1 with a quasi-likelihood approach and give an
interpretation of the results.

Linear Mixed Models

Wheat Yield Trial data

The Wheat2 data
The Wheat2 data frame has 224 rows and 5 columns. It summarizes the results
of a eld trial. This dataset is part of the package nlme.
79

Block [factor] an ordered factor with levels 4 < 2 < 3 < 1

variety a factor with 56 levels [ARAPAHOE, BRULE, BUCKSKIN,
CENTURA,...]
yield [numeric] the yield of the crop
latitude [numeric] species the lattidude of the crop on the eld
longitude [numeric] species the longitude of the crop on the eld

Load the data

R> library(nlme)
R> data(Wheat2)

Summary of the data

R> summary(Wheat2)
Block
4:56
2:56
3:56
1:56

variety
ARAPAHOE : 4
BRULE
: 4
BUCKSKIN : 4
CENTURA : 4
CENTURK78: 4
CHEYENNE : 4
(Other) :200

yield
Min.
: 1.05
1st Qu.:23.52
Median :26.85
Mean
:25.53
3rd Qu.:30.39
Max.
:42.00

latitude
Min.
: 4.30
1st Qu.:17.20
Median :25.80
Mean
:27.22
3rd Qu.:38.70
Max.
:47.30

longitude
Min.
: 1.20
1st Qu.: 7.20
Median :14.40
Mean
:14.08
3rd Qu.:20.40
Max.
:26.40

The trial consits of 4 Blocks, each variety is repeated once on each block
R> all(table(Wheat2$Block, Wheat2$variety) == 1)
[1] TRUE

Data situation
We have repeated measures, i.e. clustered data or a longitudinal study
The data has the following form

(y1 , ..., yj , ..., yn , 1 , ..., j , ..., n ), = 1, ..., m

Here we have m clusters and n observations for each cluster.
yj is the values from observation j in cluster .

latitude

22
latitude

25
20
15

longitude

latitude

20
15

longitude

25
20

longitude

latitude

Figure 14: Wheat Yield Trial. The size of the symbols corresponds to the yield.
Result
Correlated data
Observations of the same cluster tend to be more equal than observations
of dierent clusters
Dierent sources of variety in the data

Within clusters based on the repeated measures. This means the

departures of the measures from the mean of the cluster.
Between clusters. This means the departures of the means of the
clusters from the overall mean.
Goals
Estimation of
Cluster-eects
Population-eects
Correlation structure

Linear Mixed Models

The linear mixed model for repeated measures is given by
yj = T + T b + j = 1, ..., m, j = 1, ..., n
j
j

(17)

or in matrix notation
y = X + U b + = 1, ..., m
The random eects are assumed to be iid distributed normal
b N(0, D),
81

(18)

where D is the covariance matrix of random eects and

N(0, ).

Experimental Design
In an experimental design often blocks are dened.
We are not interested in the block eects specically but must account for
their eect.
Therefore, blocks are treated as random eects.
This can be done in the framework of linear mixed models.

Linear Mixed Models with R

There are two packages to compute linear mixed models: nlme and lme4
Both have several advantages and disadvantages
We rst have a look at the nlme-package
Here, a special object class named groupedData exists which is special
for cluster data with repeated measures
We make an groupedData object for the eld trial
R> gD <- groupedData(yield ~ 1 | Block, data = Wheat2)
R> plot(gD)

We now t an linear mixed model using the function lme

R> lmmWheat2 <- lme(gD)
R> summary(lmmWheat2)
Linear mixed-effects model fit by REML
Data: gD
AIC
BIC
logLik
1515.307 1525.528 -754.6535
Random effects:
Formula: ~1 | Block

q q

q
q

q q q
q
q

q qqq q q qq q q q q q q q
q qq q q q q q q q
q qq q q
q
q q q
q
q

q qq qq q q q q q q qq q qqq q q
q
q
q
q q
q
q q q qq
q

Block

qq q q

q q q q qq
q

q q q
q q

q
q

q qq
qq

qq
q
q

q qq

qq qqq qq qqqqq q qqq q q q q

qq q q q q qq q qq q q
q q qq
q
q q q
q

qq
q

qq qq qq q qqq q qqq q
q q q qq q qqq q q q
q
q

yield

StdDev:

(Intercept) Residual
3.14809 6.931017

Fixed effects: yield ~ 1

Value Std.Error DF t-value p-value
(Intercept) 25.52701 1.640755 220 15.55809
0
Standardized Within-Group Residuals:
Min
Q1
Med
Q3
-3.3405353 -0.4134947 0.1171165 0.6241342

Max
2.3440565

Number of Observations: 224

Number of Groups: 4

The random eects are estimated as

R> random.effects(lmmWheat2)

4
2
3
1

(Intercept)
-3.8623327
2.8511949
-0.8737036
1.8848413

If we compare this with the block means minus the overall mean
R> tapply(Wheat2$yield, Wheat2$Block, mean) - mean(Wheat2$yield)

4
-4.1966518

2
3
3.0979911 -0.9493304

1
2.0479911

Eects are shrunken to zero

In the next step we include the position of the crops on the eld as xed
eect to your model
This is done by including an main eect for latitude and longitude and an
interaction term
Note the blocks are of eqaul size and arranged in coloumns with the plants
sowed in 3-4 coloumns in each block
We need an equal range of latitude and logitude in each block, otherwise
latitude would be a surrogate for the block eect
R> Wheat2$latitude2 <- Wheat2$latitude - rep(c(4.3, 17.2, 25.8,
+
38.7), each = 56)
R> unique(Wheat2$latitude2)
[1]

0.0

4.3

8.6 12.9

4.3

8.6 12.9

4.3

8.6

The model is now specied as

R> lmmWheat2_2 <- lme(yield ~ latitude2 * longitude, random = ~1 |
+
Block, data = Wheat2)

We do not use a groupedData-object now but instead we specify a model

fromula for the xed eect and a formula for the random eect.
1|Block means, that a random intercept is tted for each level of the grouping
variable (here Block)
Note, that the grouping variable must be a factor
The estimed random eects are now
R> random.effects(lmmWheat2_2)
(Intercept)
4 -4.73121459
2 2.61928461
3 -0.09549262
1 2.20742260

R> summary(lmmWheat2_2)
Linear mixed-effects model fit by REML
Data: Wheat2
AIC
BIC
logLik
1432.378 1452.74 -710.1892
Random effects:
Formula: ~1 | Block
(Intercept) Residual
StdDev:
3.451906 5.524812
Fixed effects: yield ~ latitude2 * longitude
Value Std.Error DF
t-value p-value
(Intercept)
23.974644 2.4981401 217 9.596997 0.0000
latitude2
-0.827432 0.2320629 217 -3.565551 0.0004
longitude
0.225626 0.0974046 217 2.316379 0.0215
latitude2:longitude 0.044225 0.0138447 217 3.194342 0.0016
Correlation:
(Intr) lattd2 longtd
latitude2
-0.651
longitude
-0.661 0.831
latitude2:longitude 0.534 -0.865 -0.845
Standardized Within-Group Residuals:
Min
Q1
Med
Q3
-3.28913321 -0.47504226 0.03741677 0.72172891

Max
2.11526795

Number of Observations: 224

Number of Groups: 4

Graphical visualisation of the results

We want to visualize our results
We plot the tted values for the rst block
R> WB1 <- Wheat2[Wheat2$Block == 1, ]
R> fitB1 <- matrix(data = 0, ncol = length(unique(WB1$latitude2)),
+
nrow = length(unique(WB1$longitude)), byrow = FALSE)
R> fitB1[16:71] <- fitted(lmmWheat2_2)[1:56]

Graphical visualisation is done with the functions filled.contour and

persp using the following commands:

25
20
15

q
qq
q
q
qq
q
q
q
qq
q
q
q

q
q
qq
q
qq
q
qq
q
qq
qq
q
q
q
qq
q
q
q
q
q
q
q q
q
q
q
q
q
qq

qq
q
q
qq
qq
q
q
q
q
q
q
qq
q
q

q
qq
q
qq
qq
q
qq
q
qq
q
q
qq
qq
q
q
q
q
q
q
qq

Block 1
Block 2
Block 3
Block 4

q
q
qq

q
q
q
q
q
q
q

q
q
q

qq
q
q

q
q
q
q

q
qq
qq
q
qq
q
q
qq
q
qq
q
q
qq
q
q q
q

qq
q
q
q
qq
q
q
qq
qq
q
q
qq
q
q
qq

longitude

q
q
q
q
q
qq
qq
q
q
q
q
qq
q
qq
qq
q

qq
qq
q
q
qq
q
qq
q
qq
q
q

q
q

q q
q
q

q
q

latitude

Figure 15: Fitted values for Wheat yield trial with model lmmWheat2_2.
R> filled.contour(x = unique(WB1$latitude2), y = sort(unique(WB1$longitude)),
+
z = t(fitB1))
R> persp(x = unique(WB1$latitude2), y = sort(unique(WB1$longitude)),
+
z = t(fitB1), theta = 25, phi = 25)

package lme4
In the lme4 library linear mixed models can bes estimated with the lmer
function
R> library(lme4)
R> `?`(lmer)

This function uses dierent arguments

Fixed and random eects are specied in the same formula
Fixed eects are given as usual, random eect are given in brackets and
the grouping variable is given by a vertical line
The model from above can be tted with lmer typing

30
20
25
20

15
10
10
5

5
0
0

Figure 16: Contour plot for the tted values of block 1.

R> Wheat2_lme4 <- lmer(yield ~ latitude2 * longitude + (1 | Block),
+
data = Wheat2)
R> class(Wheat2_lme4)
[1] "mer"
attr(,"package")
[1] "lme4"

No groupeData-objects can bes used

R> summary(Wheat2_lme4)
Linear mixed model fit by REML
Formula: yield ~ latitude2 * longitude + (1 | Block)
Data: Wheat2
AIC BIC logLik deviance REMLdev
1432 1453 -710.2
1410
1420
Random effects:
Groups
Name
Variance Std.Dev.
Block
(Intercept) 11.916
3.4519
Residual
30.524
5.5248
Number of obs: 224, groups: Block, 4
Fixed effects:

long
it

ude

alues
fitted v
latit

ude

Figure 17: Perspective plot for the tted values of block 1.

Estimate Std. Error t value
(Intercept)
23.97464
2.49785
9.598
latitude2
-0.82743
0.23206 -3.566
longitude
0.22563
0.09740
2.316
latitude2:longitude 0.04422
0.01384
3.194
Correlation of Fixed Effects:
(Intr) lattd2 longtd
latitude2
-0.651
longitude
-0.661 0.831
lttd2:lngtd 0.534 -0.865 -0.845

To extract the estimated xed eects use

R> fixef(Wheat2_lme4)

(Intercept)
23.97464066

latitude2
-0.82743079

To extract the estimated random eects use

R> ranef(Wheat2_lme4)

longitude latitude2:longitude
0.22562615
0.04422476

$Block
(Intercept)
4 -4.73226084
2 2.61986444
3 -0.09551437
1 2.20791077

To extract the covarinace matrix of random eects use

R> VarCorr(Wheat2_lme4)

$Block
(Intercept)
(Intercept)
11.91566
attr(,"stddev")
(Intercept)
3.451907
attr(,"correlation")
(Intercept)
(Intercept)
1
attr(,"sc")
sigmaREML
5.524812

Growth of soybean plants

The Soybean dataset
Data from an experiment to compare growth patterns of two genotypes of soybeans: Plant Introduction (P), an experimental strain, and Forrest (F), a commercial variety. The Soybean data frame has 412 rows and 5 columns. The data
set is part of the library nlme.
Plot a factor giving a unique identier for each plot.
Variety a factor indicating the variety; Forrest (F) or Plant Introduction
(P).
Year a factor indicating the year the plot was planted.
Time a numeric vector giving the time the sample was taken (days after
planting).
weight a numeric vector giving the average leaf weight per plant (g).

Exercises
Make some descriptive analyses for the Soybean data.
We want to compare the growth of soybeans for the dierent Plots. What
is your response variable, what are xed and what are random eects? Set
up a model and give an interpretation for the results.
We want to t two random eects for each plot: a random intercept and a
random slope for the time. The model should also include the xed eects
Variety and Year. Fit this model using the lme4 package.
Look at the coavariance matrix of the random eects and give an interpretation.

Writing own functions

Repetitive execution
Functions for repetitive execution: for loops, repeat and while
There is also a for loop construction which has the form
R> for (name in expr_1) expr_2

where name is the loop variable. expr 1 is a vector expression, (often

a sequence like 1:20) and expr 2 is repeatetly evaluated as name ranges
through the values in the vector result of expr 1.
Example
R> x <- rep(NA, times = 6)
R> for (i in 1:6) x[i] <- factorial(i)
R> x
[1]

24 120 720

Writing own functions

The R language allows the user to create objects of mode function. These
are true R functions that are stored in a special internal form and may be
used in further expressions
A function is dened by an assignment of the form
R> name <- function(arg_1, arg_2, ...) expression

The expression is an R expression, (usually a grouped expression), that uses

the arguments, arg i, to calculate a value. The value of the expression is
the value returned for the function
A call to the function then usually takes the form
R> name(expr_1, expr_2, ...)

Data Wrangling With R
91% (11)
Data Wrangling With R
237 pages
Kleinbaum Applied Regression Analysis and Other Multivariable Methods 3 Ed PDF
0% (6)
Kleinbaum Applied Regression Analysis and Other Multivariable Methods 3 Ed PDF
9 pages
An Introduction To Data Analysis in R - 9783030489977 PDF
100% (3)
An Introduction To Data Analysis in R - 9783030489977 PDF
289 pages
Research Capsule
100% (1)
Research Capsule
3 pages
R Programming Notes
100% (1)
R Programming Notes
32 pages
Learn R Programming in A Day
100% (7)
Learn R Programming in A Day
229 pages
Book - Roger D Peng-Exploratory Data Analysis With R-Leanpub (2015) PDF
0% (1)
Book - Roger D Peng-Exploratory Data Analysis With R-Leanpub (2015) PDF
125 pages
Data Analysis with STATA: Explore the big data field and learn how to perform data analytics and predictive modelling in STATA
From Everand
Data Analysis with STATA: Explore the big data field and learn how to perform data analytics and predictive modelling in STATA
Prasad Kothari
4.5/5 (7)
Spss 1
No ratings yet
Spss 1
2 pages
2b03 Test2 Version1 2014
No ratings yet
2b03 Test2 Version1 2014
10 pages
E5 - Statistical Analysis Using R
100% (1)
E5 - Statistical Analysis Using R
45 pages
Unit3 160420200647 PDF
No ratings yet
Unit3 160420200647 PDF
146 pages
R Programming For NGS Data Analysis
No ratings yet
R Programming For NGS Data Analysis
5 pages
RSTUDIO
No ratings yet
RSTUDIO
44 pages
Dplyr Tutorial
100% (1)
Dplyr Tutorial
22 pages
Survival Plots SURVMINER Package Tutorial
No ratings yet
Survival Plots SURVMINER Package Tutorial
5 pages
Predictive Modeling Project Report
100% (2)
Predictive Modeling Project Report
31 pages
Statistical Computing by Using R
100% (1)
Statistical Computing by Using R
11 pages
Computational Statistics With R
100% (1)
Computational Statistics With R
125 pages
Classical Crypto
No ratings yet
Classical Crypto
16 pages
R-Cheat Sheet
100% (1)
R-Cheat Sheet
4 pages
R For Statistics PDF
86% (7)
R For Statistics PDF
312 pages
R Shiny Cheatsheet
No ratings yet
R Shiny Cheatsheet
2 pages
Descriptive Analysis in R Programming - GeeksforGeeks-1-12
No ratings yet
Descriptive Analysis in R Programming - GeeksforGeeks-1-12
12 pages
STATS LAB Basics of R PDF
No ratings yet
STATS LAB Basics of R PDF
77 pages
Exploratory Analysis
100% (1)
Exploratory Analysis
3 pages
Frequency Distribution For Categorical Data
No ratings yet
Frequency Distribution For Categorical Data
6 pages
Lesson 5 Data Wrangling in Data Science.
100% (1)
Lesson 5 Data Wrangling in Data Science.
11 pages
Class 7
No ratings yet
Class 7
42 pages
R Short Tutorial
No ratings yet
R Short Tutorial
5 pages
Data Analysis Using R
100% (1)
Data Analysis Using R
78 pages
Ggplot2 Elegant Graphics For Data Analysis (2016, Springer) PDF
No ratings yet
Ggplot2 Elegant Graphics For Data Analysis (2016, Springer) PDF
281 pages
(Chapman & Hall_CRC The R Series) Chester Ismay, Albert Y. Kim - Statistical Inference via Data Science_ A ModernDive into R and the Tidyverse (Chapman & Hall_CRC The R Series)-Chapman and Hall_CRC (2
No ratings yet
(Chapman & Hall_CRC The R Series) Chester Ismay, Albert Y. Kim - Statistical Inference via Data Science_ A ModernDive into R and the Tidyverse (Chapman & Hall_CRC The R Series)-Chapman and Hall_CRC (2
461 pages
Statistics For Data Science - 1
100% (2)
Statistics For Data Science - 1
38 pages
100 Data Science in R Interview Questions and Answers For 2016
100% (2)
100 Data Science in R Interview Questions and Answers For 2016
56 pages
Random Forest
No ratings yet
Random Forest
5 pages
Creating A Live World Weather Map Using Shiny - by M. Makkawi - The Startup - Medium
No ratings yet
Creating A Live World Weather Map Using Shiny - by M. Makkawi - The Startup - Medium
40 pages
R Programming Presentation
100% (1)
R Programming Presentation
23 pages
Flexible Imputation of Missing Data
100% (2)
Flexible Imputation of Missing Data
444 pages
Advanced R Data Analysis Training PDF
No ratings yet
Advanced R Data Analysis Training PDF
72 pages
The Next Level of Data Visualization in Python
100% (1)
The Next Level of Data Visualization in Python
17 pages
Least Squares Problems: How To State and Solve Them, Then Evaluate Their Solutions
100% (1)
Least Squares Problems: How To State and Solve Them, Then Evaluate Their Solutions
63 pages
Linear Regression
83% (6)
Linear Regression
499 pages
Data Visualization in R - With Cheat Sheets PDF
100% (1)
Data Visualization in R - With Cheat Sheets PDF
62 pages
R Data Analyst DAR
No ratings yet
R Data Analyst DAR
6 pages
Time Series Analysis - An Introduction
No ratings yet
Time Series Analysis - An Introduction
38 pages
Shiny Introduction
100% (1)
Shiny Introduction
70 pages
Ggplot2 Cheatsheet
No ratings yet
Ggplot2 Cheatsheet
2 pages
Datascience and R PDF
No ratings yet
Datascience and R PDF
488 pages
(Use R!) Keon-Woong Moon - Learn Ggplot2 Using Shiny App (2017, Springer) PDF
100% (3)
(Use R!) Keon-Woong Moon - Learn Ggplot2 Using Shiny App (2017, Springer) PDF
356 pages
Data Wrangling
No ratings yet
Data Wrangling
30 pages
Statistics With R Fall 20180912 PDF
No ratings yet
Statistics With R Fall 20180912 PDF
101 pages
Applied Categorical and Count Data Analysis (PDFDrive)
50% (2)
Applied Categorical and Count Data Analysis (PDFDrive)
380 pages
A Learning Guide To R PDF
0% (1)
A Learning Guide To R PDF
255 pages
Data Science Course Content
No ratings yet
Data Science Course Content
4 pages
Overview Of Bayesian Approach To Statistical Methods: Software
From Everand
Overview Of Bayesian Approach To Statistical Methods: Software
Vinaitheerthan Renganathan
No ratings yet
ggplot2 Essentials
From Everand
ggplot2 Essentials
Donato Teutonico
No ratings yet
Data Preparation and Exploration: Applied to Healthcare Data
From Everand
Data Preparation and Exploration: Applied to Healthcare Data
Robert Hoyt
No ratings yet
Survival Analysis
From Everand
Survival Analysis
Rupert G. Miller, Jr.
No ratings yet
Biostatistics by Example Using SAS Studio
From Everand
Biostatistics by Example Using SAS Studio
Ron Cody
No ratings yet
Errors of Regression Models: Bite-Size Machine Learning, #1
From Everand
Errors of Regression Models: Bite-Size Machine Learning, #1
Lee Baker
No ratings yet
Mastering Machine Learning with R - Second Edition
From Everand
Mastering Machine Learning with R - Second Edition
Cory Lesmeister
No ratings yet
Learning Shiny
From Everand
Learning Shiny
Resnizky Hernán G.
No ratings yet
Excel 2013/2016: Get Your Hands Dirty
From Everand
Excel 2013/2016: Get Your Hands Dirty
Sam Akrasi
No ratings yet
Oberheim Matrix 1000 Owners Manual PDF
No ratings yet
Oberheim Matrix 1000 Owners Manual PDF
33 pages
Making Burr Puzzles PDF
No ratings yet
Making Burr Puzzles PDF
8 pages
GSMNP Map JUNE14 Complete4 2
No ratings yet
GSMNP Map JUNE14 Complete4 2
60 pages
Beck M., Geoghegan R. The Art of Proof.. Basic Training For Deeper Mathematics (Springer, 2010)
No ratings yet
Beck M., Geoghegan R. The Art of Proof.. Basic Training For Deeper Mathematics (Springer, 2010)
20 pages
Contopoulos G. - Adventures in Order and Chaos. A Scientific Autobiography (2005)
No ratings yet
Contopoulos G. - Adventures in Order and Chaos. A Scientific Autobiography (2005)
208 pages
Free Word Combinations
No ratings yet
Free Word Combinations
2 pages
7 Lexical Semantics
No ratings yet
7 Lexical Semantics
55 pages
QS62 QS82 Manual
No ratings yet
QS62 QS82 Manual
105 pages
Essential Oil Extraction Techniques
100% (1)
Essential Oil Extraction Techniques
10 pages
Burning Mouth Syndrome An Update 2010
No ratings yet
Burning Mouth Syndrome An Update 2010
7 pages
Weissenborn - Look So Good
No ratings yet
Weissenborn - Look So Good
2 pages
Walkin'the Blues
No ratings yet
Walkin'the Blues
4 pages
Linear Approximations and The Cox Model
No ratings yet
Linear Approximations and The Cox Model
40 pages
Leadership Styles and Employees' Motivation - Perspective From An Emerging Economy - Document - Gale Academic OneFile
No ratings yet
Leadership Styles and Employees' Motivation - Perspective From An Emerging Economy - Document - Gale Academic OneFile
14 pages
CHI SQUARE TEST and ANOVA
No ratings yet
CHI SQUARE TEST and ANOVA
20 pages
HW 4
No ratings yet
HW 4
7 pages
ProcessMA16 Manual
No ratings yet
ProcessMA16 Manual
34 pages
Emotional Content Bias in Urban Legends
0% (1)
Emotional Content Bias in Urban Legends
28 pages
Final Defensepptx
No ratings yet
Final Defensepptx
15 pages
Python
No ratings yet
Python
76 pages
Syllabus Mphil Education
No ratings yet
Syllabus Mphil Education
15 pages
Statistics 502 Lecture Notes: Peter D. Hoff
No ratings yet
Statistics 502 Lecture Notes: Peter D. Hoff
186 pages
Anreg
No ratings yet
Anreg
587 pages
MBA Syllabus 2022
No ratings yet
MBA Syllabus 2022
209 pages
CBCS Syllabus - Ma Applied Psychology - 20.07.2019
No ratings yet
CBCS Syllabus - Ma Applied Psychology - 20.07.2019
57 pages
Influence of Polishing Systems On Surface Roughness of Composite Resins: Polishability of Composite Resins
No ratings yet
Influence of Polishing Systems On Surface Roughness of Composite Resins: Polishability of Composite Resins
11 pages
Institute of Actuaries of India: Subject CT3 - Probability and Mathematical Statistics
100% (2)
Institute of Actuaries of India: Subject CT3 - Probability and Mathematical Statistics
6 pages
FinalExam Fall2020 Updated GB213
No ratings yet
FinalExam Fall2020 Updated GB213
11 pages
Factors That Affect Nursing Perception As A Career Choice Among Scientific High School Students in Hebron City - 0
No ratings yet
Factors That Affect Nursing Perception As A Career Choice Among Scientific High School Students in Hebron City - 0
46 pages
Immediate download Basic Business Statistics 13th Edition (eBook PDF) ebooks 2024
100% (2)
Immediate download Basic Business Statistics 13th Edition (eBook PDF) ebooks 2024
41 pages
Res 431
No ratings yet
Res 431
3 pages
Thirishali Report
No ratings yet
Thirishali Report
119 pages
Grammatical Competence Level of Freshmen in The College of Education at Taguig City University: Basis For An Intervention Program
100% (4)
Grammatical Competence Level of Freshmen in The College of Education at Taguig City University: Basis For An Intervention Program
127 pages
677cc494589faa9aee6cfb8a_FORMULA_SHEET_MAS202
No ratings yet
677cc494589faa9aee6cfb8a_FORMULA_SHEET_MAS202
2 pages
Beck & Rankin (1997)
No ratings yet
Beck & Rankin (1997)
12 pages
A Comparative Study of Standard American English and Non-Standard PDF
No ratings yet
A Comparative Study of Standard American English and Non-Standard PDF
42 pages
The Draw-A-Person Test: An Indicator of Children's Cognitive and Socioemotional Adaptation?
100% (1)
The Draw-A-Person Test: An Indicator of Children's Cognitive and Socioemotional Adaptation?
18 pages
The PSPP Guide:: An Introduction To Statistical Analysis
No ratings yet
The PSPP Guide:: An Introduction To Statistical Analysis
95 pages