Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
56 views

R Intro Script

This document provides an introduction to using R and RStudio. It discusses downloading and installing R and RStudio, the basic layout and components of the RStudio interface, using R as a calculator, assigning objects and variables in R using <-, and some basic data structures like vectors. The overall document serves as a guide for learning the fundamentals of the R programming language and getting started with RStudio.

Uploaded by

ki_soewarsono
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views

R Intro Script

This document provides an introduction to using R and RStudio. It discusses downloading and installing R and RStudio, the basic layout and components of the RStudio interface, using R as a calculator, assigning objects and variables in R using <-, and some basic data structures like vectors. The overall document serves as a guide for learning the fundamentals of the R programming language and getting started with RStudio.

Uploaded by

ki_soewarsono
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 86

Introduction to R

Jena University Hospital


Institute of Medical Statistics, Computer and Data Sciences
Julia Palm (julia.palm@med.uni-jena.de)
Contents

Preface 3

1 Introduction 4
1.1 What is R? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Installing R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Getting to know RStudio . . . . . . . . . . . . . . . . . . . . . . 5
1.4 R as a calculator . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5 Assignments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.6 Basic data structures . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.7 Troubleshooting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2 Data exploration 17
2.1 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2 Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3 Loading data into R . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4 Descriptive analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.5 Troubleshooting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3 Data manipulation & Inference I 37


3.1 Data manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2 Diagnostic Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.3 Comparing two samples . . . . . . . . . . . . . . . . . . . . . . . 48
3.4 Troubleshooting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4 Inference II 54
4.1 Risk and Odds . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.2 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.3 Time-to-event analysis . . . . . . . . . . . . . . . . . . . . . . . . 63

5 Advanced Use 67
5.1 Programming basics . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.2 Graphics with ggplot . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.3 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

2
Preface

This instruction manual belongs to the course Introduction to R which is taught


at the Institute for Medical Statistics, Computer and Data Sciences at Jena
University hospital. Each chapter belongs to one of the five course dates. It is
written in a way that should allow you to reproduce the entire course by yourself
on your personal computer.
There are a lot of code examples in this instruction manual. You can generally
recognize a piece of R code in this document by the grey highlighting. If the
code returns a result, the result is displayed directly below the code:
1+1

[1] 2
We strongly encourage you to try out all the example code yourself while working
through the chapters!
If you run into problems please check the Troubleshooting paragraph at the end
of the first two chapters that lists the most common problems and corresponding
solutions people encounter when learning to use R.

3
Chapter 1

Introduction

1.1 What is R?
R is a software for data analysis, data manipulation and visualization and a well
developed and powerful programming language. It is a highly extensible, open-
source and free software which compiles and runs on a wide variety of UNIX
platforms, Windows and MacOS. The R project was started bei Robert Gentle-
man and Ross Ihaka at the University of Auckland in 1995 and is maintained by
the R Core Team (2021), an international group of volunteer developers. The
website of the R project is:

http://www.r-project.org

Although R works fine by itself, we will use it in combination with RStudio, a so


called Integrated Development Environment (IDE) which provides a comfortable
graphical user interface and some additional functionalities.

1.2 Installing R
To begin, you should install R and RStudio on your computer. If you are
working on a computer that has these programs already installed, you can skip
this part. To install R, go to the Comprehensive R Archive Network (CRAN),
for example here: https://ftp.fau.de/cran/ and install the Version of R that
is suitable for your operating system. After you have installed R, visit https:
//rstudio.com/products/rstudio/download/ and download and install RStudio
Desktop. Check whether you can open RStudio by clicking on the RStudio Icon
on your Desktop or by searching it in your taskbar.

4
1.3. GETTING TO KNOW RSTUDIO 5

1.3 Getting to know RStudio


RStudio is generally divided into four subwindows. Note: If you see only three
subwindows upon opening RStudio, click File > New File > R Script. The
upper left window is the R Script were you write your code. The lower left
window is the console. Here the code that you’ve written in the script gets
excecuted and the results are displayed.

To try this out, type 1+1 into your script. Then mark this piece of code and
either click on the Run symbol in the upper right corner or press Ctrl + Enter
on your keyboard.

As you can see, the code gets reprinted in the console behind the > , which is
called the prompt and directly below the result [1] 2 is displayed. The process
of sending code from the script to the console is called running or executing your
code. It is possible to write code directly into the console next to the prompt
and executing it by hitting Enter . However, we strongly advise typing all of
your code out in the script before executing it, since this makes rerunning and
changing your code way easier. To make your code more humanly readable, you
can comment it in your script. Any line of code that begins with a # will not
be evaluated when send to the console but will be merely printed out:
#R doesn't calculate 1+1 if it is written like this:

#1+1

Anything you write in R that is not a comment is case sensitive, which means
for example A and a are not the same thing to R.

The upper and lower right windows in RStudio will be explained when they
become relevant in the following chapters.
6 CHAPTER 1. INTRODUCTION

1.4 R as a calculator
As you have seen in the example above, R can be used as an ordinary calculator.
You can use + and - for addition and subtraction, * and \ for multiplication
and division and ^ for exponentiation.
Try out different calculations like the following by typing them into your script
and running them. You can either run them line by line or mark several lines
at once for execution.
5+3

[1] 8
7*3/2

[1] 10.5
2^3

[1] 8
(2-5)*8^2

[1] -192
The [1] in front of the output will appear in front of every vector in the console.
It is an index telling you the position of the first element in the row which is
useful when the vector is so long it produces line breaks in you console. You
will learn what a vector is in just a moment.

1.5 Assignments
One of the most important concepts in R is the assignment of names to objects.
So far the objects we have encountered are simple numbers. To assign a name
to a number, you use the assignment operator <- (no space between < and
- ) like this:
x <- 3
some.complicated.name <- 7

We call x and some.complicated.name variables. Notice how they appear


in the top right window of RStudio under the tab Environment once you have
run those two lines. In the environment you will see all R objects that you have
defined so far. R will list their names for you if you use ls() :
ls()

[1] "some.complicated.name" "x"


The collection of named R objects showing up in the environment window is
called workspace and can be saved and reloaded as you will learn later on. For a
1.6. BASIC DATA STRUCTURES 7

variable name, you can use any string that doesn’t have blank spaces or special
characters in it and that doesn’t begin with a number. You can look up the
value that is stored in a variable by simply typing out its name and running it:
x

[1] 3
Now you use these variables for computation:
x + some.complicated.name

[1] 10
You can overwrite the value stored in your variables at any point by simply
rerunning the original assignment with different values.
For example you can assign the values 2 and 80 to x and some.complicated.name
and compute their product.
x <- 2
some.complicated.name <- 80
x*some.complicated.name

[1] 160
If you want to remove a variable, you can use the rm() command like this:
rm(x)

If you want to remove all variables in your workspace, you can use a combination
of rm() and ls() :
rm(list=ls())

As you can see, the variables now disappear from the environment window in
the top right corner. The advantage of using variables instead of numbers in
your code is that your code becomes reusable. Imagine having typed out a long
computation that you want to perform repeatedly with different numbers. If
your computation uses variable names, you only write it down once and are
able to rerun it with as many different values as you like by just assigning those
values to the variable one by one.

1.6 Basic data structures


R has a small number of basic data structures from which all other kinds of
objects can be built. We’ll go through the most important ones one by one.

1.6.1 Vectors
So far you have worked with single numbers. These are actually a special case
of the most important data structure in R, the so called vector. A vector in R
8 CHAPTER 1. INTRODUCTION

is a sequence of elements of the same data type. For example 1 2 3 4 is a


numeric vector (i.e. with elements that are numbers) with four elements, namely
the numbers from 1 to 4. You build a vector with the c() function like this:
c(8, 2, 4, 6, 2, 1)

[1] 8 2 4 6 2 1
This is a numeric vector of length 6 (i.e. it has 6 elements). An example for a
vector producing a linebreak is for example the following:
c(1, 23, 4, 5, 6, 7, 7, 8, 4, 2, 4, 6, 8, 98, 45, 23,
45, 8, 97, 23, 4, 23, 1, 3, 5, 6, 2, 45, 3, 45, 4, 1, 3)

[1] 1 23 4 5 6 7 7 8 4 2 4 6 8 98 45 23 45 8 97 23 4 23 1 3 5
[26] 6 2 45 3 45 4 1 3
You can store vectors in variables as well:
my_vector <- c(8, 2, 4, 6, 2, 1)

Notice how the new variable my_vector now appears in the environment win-
dow on the upper right side. You can retrieve the vector stored in my_vector
by typing it out and executing the code:
my_vector

[1] 8 2 4 6 2 1

Numeric
The kind of vector you’ve just seen is the numeric vector (or just numeric), which
is a vector containing numbers. A numeric containing only whole numbers (like
my_vector ) can be called an integer, which is a subtype of numeric.
If the numbers in the numeric have decimal places, it can be called a double.
v <- c(1.5, 3.234, 7, 0.12356)
v

[1] 1.50000 3.23400 7.00000 0.12356


You can use numeric vectors in calculations just like single numbers:
a <- c(1, 2, 3, 4)
a*3

[1] 3 6 9 12
b <- c(2, 4, 6, 7)
a+b

[1] 3 6 9 11
1.6. BASIC DATA STRUCTURES 9

R executes the operation element-wise. This means the computation should


involve either two elements of the same length (like a+b in our example) or
one vector and a single number (like a*3 in our example). If the lengths of the
vectors in your calculation don’t fit, R will recycle the shorter to make it fit the
longer of the vectors.
long <- c(1, 2, 3, 4)
short <- c(1, 2)
long+short

[1] 2 4 4 6
Here, the shorter vector was repeated, i.e. the calculation was long + c(short, short) .

Character
R can not only deal with numbers, it can also deal with text. A piece of text
is called a string and is written in a pair of double or single quotes. A vector
containing strings as elements is called a character vector:
v2 <- c("male", "female", "female", "male")
v2

[1] "male" "female" "female" "male"


v3 <- c('blue', 'brown', 'yellow')
v3

[1] "blue" "brown" "yellow"

Logical
Another important type of vector is the logical vector, the elements of which are
the so called booleans TRUE and FALSE , which can be shortened by T and F
(cases matter, you have to use upper case letters in both versions.)
c(TRUE, FALSE, TRUE)

[1] TRUE FALSE TRUE


c(F, T, T, T)

[1] FALSE TRUE TRUE TRUE


Boolean values are the result of logical operations, that is, of statements that
can be either true or false:
3 < 4

[1] TRUE
The most common logical operators we will use are the following:
10 CHAPTER 1. INTRODUCTION

• AND &
• OR |
• NOT !
• greater than >
• greater or equal >=
• less than <
• less or equal <=
• equal to == (yes, you need two equal signs)
• not equal to !=
The first three operators can be used with numbers like this:
3 < 1

[1] FALSE
5 > 2

[1] TRUE
5 == 5

[1] TRUE
5 != 5

[1] FALSE
The other operators can be used to link boolean values:
TRUE & FALSE

[1] FALSE
TRUE | FALSE

[1] TRUE
!FALSE

[1] TRUE
You can also create more complex expressions, using () to group statments:
((1+2)==(5-2)) & (7<9)

[1] TRUE

1.6.2 Subsetting vectors


Every element of a vector can be accessed individually by referencing its position
(i.e. its index) in the vector. You can for example retrieve the fourth element of
my_vector like this:
1.6. BASIC DATA STRUCTURES 11

my_vector[4]

[1] 6
It is also possible to select more than one element of the vector by using an
integer vector of the desired indices (e.g. c(1,4,5) if you want to retrieve the
first, fourth and fifth element of a vector) within the square brackets:
my_vector[c(1, 4, 5)]

[1] 8 6 2
We call this subsetting your vector. For subsetting vectors we often need longer
sequences of integers. To generate a sequence of consecutive integer numbers
R has the <start> : <end> operator, which is read as from <start> to
<end> :
3:10 #generates sequence from 3 to 10

[1] 3 4 5 6 7 8 9 10

1.6.3 Data frames


If you want to do statistics, the most likely format your data will come in is
some kind of table. In R, the basic form of a table is called a data.frame and
looks like this:

name height gender age


John 185.2 male 25
Max 175.8 male 32
Susi 155.1 female 27
Anna 162.7 female 24

Usually every row is an observation (e.g. an individual or a measurement point)


and each column is a variable on which the observation is measured (e.g. age, gen-
der etc.). For learning purposes, R has some built-in data frames, one of which
is the data.frame iris (Fisher, 1936). You can have a look at a data.frame
like this: (careful, that’s a capital V in View() !), though for really large data
sets, not all of the rows might be displayed.
View(iris)

Table 1.2: The first 10 rows of the iris data set.

Sepal.Length Sepal.Width Petal.Length Petal.Width Species


5.1 3.5 1.4 0.2 setosa
4.9 3.0 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
12 CHAPTER 1. INTRODUCTION

4.6 3.1 1.5 0.2 setosa


5.0 3.6 1.4 0.2 setosa
5.4 3.9 1.7 0.4 setosa
4.6 3.4 1.4 0.3 setosa
5.0 3.4 1.5 0.2 setosa
4.4 2.9 1.4 0.2 setosa
4.9 3.1 1.5 0.1 setosa

The data set iris gives measurements of sepal and petal lengths and widths
of 150 flowers from three different species of iris. You can extract each of the
columns with a $ :
iris$Sepal.Length

[1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4 5.1
[19] 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5 4.9 5.0
[37] 5.5 4.9 4.4 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0 7.0 6.4 6.9 5.5
[55] 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6 6.7 5.6 5.8 6.2 5.6 5.9 6.1
[73] 6.3 6.1 6.4 6.6 6.8 6.7 6.0 5.7 5.5 5.5 5.8 6.0 5.4 6.0 6.7 6.3 5.6 5.5
[91] 5.5 6.1 5.8 5.0 5.6 5.7 5.7 6.2 5.1 5.7 6.3 5.8 7.1 6.3 6.5 7.6 4.9 7.3
[109] 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7 7.7 6.0 6.9 5.6 7.7 6.3 6.7 7.2
[127] 6.2 6.1 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7 6.3 6.4 6.0 6.9 6.7 6.9 5.8 6.8
[145] 6.7 6.7 6.3 6.5 6.2 5.9
As you can see, iris$Sepal.Length is just a numeric vector! Consequently
you can do calculations on these vectors, e.g. compute the mean sepal length of
the flowers:
mean(iris$Sepal.Length)

[1] 5.843333
Basically, a data.frame in R is a number of vectors of the same length that have
been stuck together columnwise to build a table. Each column must have a
unique format but different formats can be assigned to different columns. In
this example, columns 1 to 4 are numbers and column 5 is a string.

1.6.4 Lists
While data.frames are useful to bundle together vectors of the same length, lists
are used to combine more heterogeneous data. The following block of code
creates a list:
#create list
my.list <- list(my_vector, long, iris[1:10,])
#print list
my.list
1.6. BASIC DATA STRUCTURES 13

[[1]]
[1] 8 2 4 6 2 1

[[2]]
[1] 1 2 3 4

[[3]]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5.0 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
A list is a collection of R objects that are called the elements of the list. Lists
are similar to data.frames, but while data.frames can only have vectors of the
same length as their elements (i.e. the variables), lists can have all kinds of data
types as elements. An element of a list can be a vector of arbitrary length,
a data.frame, another list or even a function. The list we have just created
contains two vectors of different lengths and a data.frame containing the first
ten rows ot the iris data set. You can access a single list element by referencing
its position in the list using double square brackets [[]] :
my.list[[1]] #result is a vector

[1] 8 2 4 6 2 1
my.list[[3]] #result is a data.frame

Sepal.Length Sepal.Width Petal.Length Petal.Width Species


1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5.0 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
If you want to subset the list (i.e. keep only certain parts), use single square
brackets [] :
14 CHAPTER 1. INTRODUCTION

my.list[2:3] #results is a list

[[1]]
[1] 1 2 3 4

[[2]]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5.0 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa

Note that if you use single square brackets [] , the result will always be a
list, whereas using double square brackets [[]] will return whatever type the
object is that you are referencing with [[]] .

my.list is an unnamed list, but it is also possible to create a named list:


#create list
my.named.list <- list(a=my_vector, b=long, c=iris[1:10,])
#print list
my.named.list

$a
[1] 8 2 4 6 2 1

$b
[1] 1 2 3 4

$c
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5.0 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
1.6. BASIC DATA STRUCTURES 15

10 4.9 3.1 1.5 0.1 setosa


The advantage of a named list is that you can extract the list elements by their
names, similar to extracting variables from a data.frame:
my.named.list$a

[1] 8 2 4 6 2 1
my.named.list$b

[1] 1 2 3 4
The square brackets [] and [[]] do however also work on named lists. Be-
cause lists can bundle a lot of heterogeneous data in one R object, they are
quite often used to give results of functions for statistical analyses as you will
see later on.

1.6.5 Determining the class of an object


You can find out what type of data structure an object is with the
class() function:
class(my.list)

[1] "list"
class(iris)

[1] "data.frame"
class(iris$Sepal.Length)

[1] "numeric"
class(my_vector)

[1] "numeric"
class(c(TRUE,FALSE, FALSE))

[1] "logical"

1.6.6 Investigating the structure of an object


When you get more complex objects, it can sometimes be useful to get an
overview over their structure with str() :
str(my.list)

List of 3
$ : num [1:6] 8 2 4 6 2 1
$ : num [1:4] 1 2 3 4
16 CHAPTER 1. INTRODUCTION

$ :'data.frame': 10 obs. of 5 variables:


..$ Sepal.Length: num [1:10] 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9
..$ Sepal.Width : num [1:10] 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1
..$ Petal.Length: num [1:10] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5
..$ Petal.Width : num [1:10] 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1
..$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1
This tells you that my.list is a list with 3 elements. The first two are numeric
vectors, the third is a data.frame. As you can see the columns of the data.frame
are also shown in part, but truncated so that they don’t clutter the console too
much.

1.7 Troubleshooting
1. The code I’m sending to the console appears but doesn’t seem to be exe-
cuted: Check whether the last line of the console shows the prompt > .
This means R is ready to receive new commands. If there is no > but a
+ instead, you probably forgot to close a bracket some lines before and
R is waiting for the closing bracket. Just hit Esc to interrupt the cur-
rent command and try again. Make sure the number of opening brackets
matches the number of closing brackets. I If you are in RStudio and there
is no + , but a little red stop sign in the upper right corner of the console,
R is still working on the computation. If it doesn’t go away after a few
moments, but you know your computation shouldn’t take this long, click
the stop sign to terminate the current computation and try to find out
why the computation you started won’t finish.
2. Error: object ‘x’ not found: You probably forgot to define x ( x beeing
a stand-in for the variable in your error message) and it doesn’t show up
in the environment on the upper left. Run the assignment for x and try
again.
Chapter 2

Data exploration

2.1 Functions
Besides the data structures you have learned about in the last chapter, there is
another important concept you need to learn about when using R: the function.
In principle, you can imagine a function as a little machine that takes some input
(usually some kind of data), processes that input in a certain way and gives
back the result as output. There are functions for almost every computation
and statistical test you might want to do, there are functions to read and write
data, to shape and manipulate it, to produce plots and even to write books
(this document is written completely in R)! The function mean() for example
takes a numeric vector as input and computes the mean of the numbers in the
numeric vector:
mean(c(2, 4, 6))

[1] 4

2.1.1 Structure of a function


The information that goes into the function is called an argument, the output is
called the result. A function can have an arbitrary number of arguments, which
are named to tell them apart. The function log() for example takes two
arguments: a numeric vector x with numbers you want to take the logarithm
of and a single integer base with respect to which the logarithm should be
taken :
log(x = c(10, 20, 30), base = 10)

[1] 1.000000 1.301030 1.477121

17
18 CHAPTER 2. DATA EXPLORATION

2.1.2 How to use a function


To find out how a function is used (i.e. what arguments it takes and what kind
of result it returns) you can use R’s help. Just put ? in front of the function
name (without brackets after the function name). If you run this code, the help
page appears in the lower right window in R Studio.
?log

As you can see the help page gives information about a couple of functions,
one of which is log() . Besides the description of the arguments you should
have a look at the information under Usage. Here you can see that the default
value for base is exp(1) (which is approximately 2.72, i.e. Eulers number),
2.2. PACKAGES 19

whereas there is no default value for x . All arguments that appear solely with
their name but without a default value (like x in this case) under Usage are
mandatory when you call the function. Not providing these arguments will
throw an error. All arguments that have a default value given (like base in
this case) can be omitted, in which case R assumes the default value for that
argument:
log(x = c(20,30,40)) #argument base can be omitted

[1] 2.995732 3.401197 3.688879


log(base=3) #argument x can not be omitted

Error in eval(expr, envir, enclos): Argument "x" fehlt (ohne Standardwert)

(Translates to Error: Argument "x" is missing (without a default value) )


If you omit the names of the arguments in the function call, R will assume
which object belongs to which argument:
log(c(10, 20, 30), 10)

[1] 1.000000 1.301030 1.477121

2.2 Packages
A basic set of functions are already included in basic R, i.e. the software you
downloaded when installing R. But since there is a huge community worldwide
constantly developing new functions and features for R and since the entirety of
all R functions there are is way to big to install at once, most of the functions
are bundeled into so called packages. A package is a bundle of functions you
can download and install from the Comprehensive R Archive Network (CRAN)
(https://cran.r-project.org/). If you visit the site, you can also get an overview
over all available packages. You can install a package by using the function
install.packages() which takes the package name as a string (i.e. in quotes)
as its argument:
install.packages("lubridate")

If you run this line of code, R goes online and downloads the package
lubdridate (Grolemund and Wickham, 2011) which contains a number of
useful functions for dealing with dates. This operation has to be done only
once, so it is one of the rare cases where it makes sense to copy the code
directly into the console. If you write it in your script window it is advisable to
comment out the code with a # after you’ve run it once to avoid unnecessarily
running it again if you rerun the rest of your script.
Once you have installed the package, its functions are downloaded to your com-
puter but are not accessible yet, because the package has to be activated first.
20 CHAPTER 2. DATA EXPLORATION

If you try to call a function from a package that is not activated yet (e.g. the
function today() from lubridate ), you get this error:
today()

Error in today(): konnte Funktion "today" nicht finden


(Translates to Error in today(): Could not find function "today" ).

To activate the package, you use the function library() . This function acti-
vates the package for your current R session, so you have to do this once per
session (a session starts when you open R/Rstudio and ends when you close the
window).
library(lubridate)
today()

[1] "2021-04-21"
As you can see the function today() is an example of a function that doesn’t
need any argument. Nevertheless you have to call it with brackets () to
indicate that you’re calling a function rather than a variable called today .
Most packages print some sort of information into the console when they are
loaded with library() . Don’t be alarmed by the red color - all of R’s messages,
warnings and errors are printed in red. The only messages you should be worried
about for this course are the ones starting with Error in: , the rest can be
safely ignored for now. However, warning messages can be informative if they
appear.

2.3 Loading data into R


Most of the time you will want to use R on your own data so the first thing
you’ll usually do when starting an analysis is to load your data into R. There
are generally two ways of loading data into R: either your data is available in
an R-format such as an .RData file, or your data comes in some non-R format.
The former mostly happens when you saved your R workspace for later, the
latter will be needed more frequently. For almost every data format there is
a dedicated importing function in R. We’ll start by showing you how to read
in non-R formats and then show you how to save and load your R workspace.
For demonstration purposes, we have provided the NINDS data set (Marler
et al., 1995) you have encountered in the lecture in two formats: NINDS.csv
and NINDS.xlsx.

2.3.1 The working directory


Before we can show you how to import data, you have to get to know another
important concept: the working directory. The working directory is basically
2.3. LOADING DATA INTO R 21

the folder on your computer where R looks for files to import and where R
will create files if you save something. To find out what the current working
directory is, use:
getwd() #no arguments needed

R should now print the path to your current woking directory to the console.
To change it you can use R-Studio. Click Session > Set Working Directory >
Choose Directory… in the toolbar in the upper left of your window. You can
then navigate to the folder of your choice and click Open.
Now you will see that R prints setwd("<Path to the chosen directory>")
to the console. This shows you how you can set your working directory without
clicking: You use the function setwd() and put the correct path in it. Note
that R uses / to divide folders, this is different to windows.

Check if it worked by rerunning getwd() . You should now put the data files
NINDS.csv and NINDS.xlsx in the folder you have chosen as your working di-
rectory.

2.3.2 Reading data


The comma-separated values (.csv) format is probably the format most widely
used in the open-source community. Csv-files can be read into R using the
read.csv() function, which expects a csv-file using , as a separator and .
as a decimal point. If you happen to have a German file that uses ; and ,
instead, you have to use read.csv2() . Here, we will however use the standard
csv format:
read.csv("NINDS.csv")

We haven’t printed the result in this document because it is too long, but if you
execute the code yourself you can see that the read.csv() function prints the
entire data set (possibly truncated) into the console. If you want to work with
the data, it makes sense to store it in a variable:
NINDS_csv <- read.csv("NINDS.csv")

You can now see the data.frame in the environment window. To show you at
least one other importing function, we have provided the exact same data set as
an excel file. To read this file, you need to install a package with functions for
excel files first, for example the package openxlsx (Schauberger and Walker,
2020):
install.packages("openxlsx") #only do this once

library(openxlsx)
NINDS_xlsx <- read.xlsx("NINDS.xlsx")
22 CHAPTER 2. DATA EXPLORATION

If you have another kind of file, just google read R and you file type and you
will most likely find an R package for just that.
You should now be able to see NINDS_xlsx and NINDS_csv , two identical
data.frames in your working directory. Since they are identical we will work
with NINDS_csv from here on.

2.3.3 Saving and loading R data


Sometimes you have worked on some data and want to be able to use your R
objects in a later session. In this case, you can save your workspace (the objects
listed under Environment) using save() or save.image() . save() takes
the names ob the objects you want to save and a name for the file they are
saved in. save.image() just saves all of the R objects in your workspace, so
you just have to provide the file name:
save(NINDS_csv, file="NINDS.RData") #saves only NINDS_csv
save.image(file="my_workspace.RData") #saves entire workspace

When you now open a new R session and want to pick up where you left, you
can load the data with load() :
load("my_workspace.RData")

If you want to save a data.frame in some non-R format, almost every read func-
tion has a corresponding write function. The most versatile is write.table()
which will write a text-file based format, like a tabular separated file or a csv,
depending on what you supply in the sep argument.

2.4 Descriptive analysis


Now that we’ve loaded data into R, let’s start with some actual statistics. To
get an overview over your data.frame, you can first have a look at it using the
View() function. :
View(NINDS_csv)

The data set contains the following variables:


• record A registration number
• AGE Age in years
• BGENDER Patient’s gender
• TDX Presumptive Diagnosis at time of treatment
• BDIAB History of diabetes at baseline
• BNINETY Barthel index at 90 days
• BHYPER History of hypertension at baseline
• BRACE Patient’s race
• DCENSOR Indicates if patient died during trial follow-up
2.4. DESCRIPTIVE ANALYSIS 23

• GOS6M Glasgow at 6 months


• HOUR2 NIH stroke scale at 2 hours
• HOUR24 NIH stroke scale at 24 hours
• SEVENTEN NIH stroke scale at 7-10 days
• NINETY NIH stroke scale at 90 days
• NIHSSB NIH stroke scale at baseline
• PART Designation of Trial (Part 1 or 2)
• SURDAYS Days from randomization to death/censored
• TREATCD Treatment code
• TWEIGHT Estimated weight at randomization
• WEIGHT Actual weight measured after randomization
• STATUS24 Dichotomised NIH stroke scale at 24 hours

2.4.1 Data types in data.frames


In the lecture you have learned about different data types that variables can
have. Here is an overview over the data types R uses to represent these variable
types:
Variable type R data type Example variable in NINDS_csv
Metric Numeric/Integer WEIGHT/SURDAYS
Ordinal Integer/Ordered factor GOS6M
Nominal Unordered factor / character TREATCD
You are already familiar with the numeric and the character. The integer is a
special case of a numeric that only contains whole numbers. The factor on the
other hand is a data type that is used to represent categorical variables. It is
similar to a character but has only a limited number of values, the so called
factor levels. The patient’s name would for example be a character because
there is a potentially infinite number of values this variable could take. The
variable TREATCD on the other hand can be used as a factor, because the only
two values it can take in our data set are Placebo and t-PA .

TREATCD would be an unordered factor, that is, a nominal variable. To represent


ordinal variables, you can use the ordered factor that implies an ordering of the
factor levels.
You can get an overview over the variable types in the NINDS_csv data.frame
by clicking the little blue icon with the triangle next to the NINDS_csv object
in the environment window in the upper right corner of R studio.
When you read the data into R without specifying the data type of every column,
R will try to guess them, usually ending up with numeric or integer for all
variables containing only numbers and character for variables containing letters
or other symbols.
If you want to specify the classes for the read-in process, you can usually pass
the argument colClasses with a character vector containing the types for all
24 CHAPTER 2. DATA EXPLORATION

your variables to the reading function. Because that can be quite a long vector
when you have a lot of variables it is often more easy to just let R guess the
types and correct them later if necessary.

TREATCD for example has been read in as a character as you can see using the
class() function:
class(NINDS_csv$TREATCD)

[1] "character"

You can turn it into a factor like this:


NINDS_csv$TREATCD <- factor(NINDS_csv$TREATCD)

The factor levels (i.e. the values you newly build factor variable can take) can
be extracted with levels() :
levels(NINDS_csv$TREATCD)

[1] "Placebo" "t-PA"

Similarly, one could argue that GOS6M should be an ordered factor with Good
< Mod. Dis < Sev. Dis < Veget < Dead , but currently it is represented
as character.

We can fix that like this:


NINDS_csv$GOS6M <- factor(NINDS_csv$GOS6M, ordered = TRUE,
levels = c("Good", "Mod. Dis",
"Sev. Dis", "Veget", "Dead"))

factor() creates a factor variable from a character vector or existing factor,


ordered=TRUE , tells the function to make the factor ordered and the levels=
argument specifies the correct order of the levels. With the assignment operator
<- we overwrite the old version of GOS6M and TREATCD with the new version.
The line breaks are just there so the code is better readable. They don’t change
the functionality, just make sure to mark all lines when executing the code.

If you know beforehand that most of the character variables in your data.frame
should actually be factors, you can specify this when reading the data in using
the argument stringsAsFactors = TRUE :
NINDS_csv <- read.csv("NINDS.csv", stringsAsFactors = TRUE)

Have a look at how the description of the data.frame in the environment window
changes after running this line of code!
2.4. DESCRIPTIVE ANALYSIS 25

2.4.2 Missing values


When scrolling trough your data, you might have noticed some cells contain NA
as a value. NA stands for Not Available and is the value R uses to represent
missing values. If you have read in your data from other formats, you might
have to check how missing values were coded there and give that information
to the read-in function to make sure they are turned in NA .

In most computations you’ll have to tell R explicitly how to deal with these
values (e.g. remove them before computation), else you’ll get NA as a result.
For example, if you want to compute the mean of the variable TWEIGHT , which
contains missing values, you have to set the argument na.rm=TRUE , where
na.rm stands for NA remove:
#default for na.rm is FALSE so NA's are not removed
mean(NINDS_csv$TWEIGHT)

[1] NA
#this way, NA's are removed before computation
mean(NINDS_csv$TWEIGHT, na.rm=TRUE)

[1] 78.37432

2.4.3 Numerical description


There are functions for many descriptive measures you might want to compute.
Most of the time the function name gives a good clue at what the function does.
We’ll go trough the most common ones in the following paragraphs.

Measures of central tendency


Measures of central tendency tell us where the majority of values in the distri-
bution are located. Let’s compute the mean, median and all quartiles of the
AGE variable:
mean(NINDS_csv$AGE) #mean

[1] 66.94177
median(NINDS_csv$AGE) #median

[1] 68.69141
quantile(NINDS_csv$AGE) #gives all quartiles, min/max

0% 25% 50% 75% 100%


26.48927 60.12106 68.69141 75.37464 89.00000
26 CHAPTER 2. DATA EXPLORATION

Measures of dispersion
Measures of dispersion describe the spread of the values around a central value.
Here you can see how to compute the variance, the standard deviation and
the range of a variable. To get the interquartile range just pick the 25- and
75-percentile from the quantile() function above!
var(NINDS_csv$AGE) #variance

[1] 135.7818
sd(NINDS_csv$AGE) #standard deviaton

[1] 11.65254
range(NINDS_csv$AGE) #range

[1] 26.48927 89.00000


min(NINDS_csv$AGE) #minimum

[1] 26.48927
max(NINDS_csv$AGE) #maximum

[1] 89

Measures of association
Measures of association describe the relationship between two or more variables.
In this case there is more than one way to deal with missing values, so instead of
the na.rm argument, here we have the argument use= to specify which values
to use in case of missing values. For the simple case of looking at correlations
between two variables, you can set this argument to use = "complete.obs"
which means that only cases without missing values go into the computation. If
an observation (i.e. a patient) has NA on at least one of the two variables, this
observation is excluded from the computation.
We can compute the Pearson and the Spearman correlation of the actual and
the estimated weight of the NINDS patients using the cor() function:
cor(NINDS_csv$WEIGHT, NINDS_csv$TWEIGHT, use = "complete.obs",
method = "pearson")

[1] 0.9313269
cor(NINDS_csv$WEIGHT, NINDS_csv$TWEIGHT, use = "complete.obs",
method = "spearman")

[1] 0.9362007
For the association of categorical variables, you’ll mostly want to look at the
frequency tables of the categories. A frequency table for a single variable is
2.4. DESCRIPTIVE ANALYSIS 27

produced like this:


table(NINDS_csv$TREATCD) #gives absolute frequencies

Placebo t-PA
312 312
But you can also use table() to generate cross tables for two variables:
table(NINDS_csv$BRACE, NINDS_csv$BGENDER)

female male
Asian 5 3
Black 75 94
Hispanic 16 21
Other 1 6
White, non-Hispanic 165 238

General overview
If you want to get an overview over your entire data.frame, the summary()
function is convenient. This function can be used for a lot of different kinds of
R objects and gives a summary appropriate for whatever the input is. If you
give it a data.frame, summary() will give the minimal and maximal value, the
1st and 3rd quartile and the mean and median for every quantitative (i.e. nu-
meric/integer) variable and a frequency table for every factor as well as the
number of missing values:
summary(NINDS_csv)

record AGE BGENDER TDX


Min. : 33656 Min. :26.49 female:262 Cardioembolic :273
1st Qu.:2689566 1st Qu.:60.12 male :362 Large vessel occlusive:252
Median :4968180 Median :68.69 Other : 18
Mean :4980392 Mean :66.94 Small vessel occlusive: 81
3rd Qu.:7276841 3rd Qu.:75.37
Max. :9987047 Max. :89.00

BDIAB BNINETY BHYPER BRACE DCENSOR


No :489 Min. : 0.00 No :209 Asian : 8 No :456
Yes :131 1st Qu.: 10.00 Yes :408 Black :169 Yes:168
NA's: 4 Median : 85.00 NA's: 7 Hispanic : 37
Mean : 62.59 Other : 7
3rd Qu.:100.00 White, non-Hispanic:403
Max. :100.00

GOS6M HOUR2 HOUR24 SEVENTEN NINETY


28 CHAPTER 2. DATA EXPLORATION

Dead :153 Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00
Good :231 1st Qu.: 6.00 1st Qu.: 4.00 1st Qu.: 2.00 1st Qu.: 1.00
Mod. Dis:133 Median :12.00 Median :11.00 Median : 8.00 Median : 6.00
Sev. Dis:105 Mean :12.76 Mean :12.33 Mean :12.08 Mean :12.85
Veget : 2 3rd Qu.:18.00 3rd Qu.:19.00 3rd Qu.:18.00 3rd Qu.:18.00
Max. :38.00 Max. :42.00 Max. :42.00 Max. :42.00
NA's :1 NA's :1 NA's :9
NIHSSB PART SURDAYS TREATCD
Min. : 1.00 Min. :1.000 Min. : 0.0 Placebo:312
1st Qu.: 9.00 1st Qu.:1.000 1st Qu.: 242.8 t-PA :312
Median :14.00 Median :2.000 Median : 366.0
Mean :14.79 Mean :1.534 Mean : 359.3
3rd Qu.:20.00 3rd Qu.:2.000 3rd Qu.: 378.0
Max. :37.00 Max. :2.000 Max. :1970.0

TWEIGHT WEIGHT STATUS24


Min. : 39.00 Min. : 41.18 high:300
1st Qu.: 68.15 1st Qu.: 65.88 low :283
Median : 78.00 Median : 77.36 NA's: 41
Mean : 78.37 Mean : 78.07
3rd Qu.: 86.40 3rd Qu.: 88.00
Max. :179.50 Max. :179.59
NA's :1

2.4.4 Graphical description


For exploration it is also useful to plot the data. R has a number of options for
plotting, ranging from simple plots which are quickly made to more elaborate
functions and packages that allow you to produce more complex, publication
ready plots with just a little more effort.

For this chapter we’ll start with the quick and easy ones and go from the most
broadly applicable plots that can be used for all types of data to the more
exclusive ones that can be used only for data types with a certain scaling.

We’ll start by introducing the basic plot functions without any customization
of labels or axes to give you an overview. When you create plots you want to
share, you should of course improve them as, e.g., shown in the last paragraph
of the chapter.

Barplot
A barplot can technically be used on every variable with a finite set of values.
The barplot() function takes a frequency table and produces a barplot from
it.
2.4. DESCRIPTIVE ANALYSIS 29

barplot(table(NINDS_csv$BRACE))
400
300
200
100
0

Asian Black Hispanic Other White, non−Hispanic

If you give it a crosstable, you get divided barplots:


barplot(table(NINDS_csv$BGENDER, NINDS_csv$BRACE))
400
300
200
100
0

Asian Black Hispanic Other White, non−Hispanic

Histogram

If you have a metric variable, you can also use the histogram:
hist(NINDS_csv$AGE)
30 CHAPTER 2. DATA EXPLORATION

Histogram of NINDS_csv$AGE

120
100
80
Frequency

60
40
20
0

30 40 50 60 70 80 90

NINDS_csv$AGE

Boxplot

If you have data that is as least ordinal you can use the boxplot() function:
boxplot(NINDS_csv$SURDAYS)
2.4. DESCRIPTIVE ANALYSIS 31

2000
1500
1000
500
0

You can also split the boxplot by another (categorical) variable using the ~
sign:
boxplot(NINDS_csv$SURDAYS ~ NINDS_csv$BGENDER)
2000
NINDS_csv$SURDAYS

1500
1000
500
0

female male

NINDS_csv$BGENDER
32 CHAPTER 2. DATA EXPLORATION

Scatterplot
And we can use scatterplots to get an idea about the relationship of two metric
variables:
plot(NINDS_csv$TWEIGHT, NINDS_csv$WEIGHT)
180
160
140
NINDS_csv$WEIGHT

120
100
80
60
40

40 60 80 100 120 140 160 180

NINDS_csv$TWEIGHT

Customisation
Even these basic plots come with a whole lot of customisation options. We’ll
exemplary show you a couple of them for the histogram. You can find out
about all possible options by going to the help page of the respective function
(e.g. ?hist ).
2.4. DESCRIPTIVE ANALYSIS 33

#change the number of breaks


hist(NINDS_csv$AGE, breaks = 5)

Histogram of NINDS_csv$AGE
200
150
Frequency

100
50
0

20 30 40 50 60 70 80 90

NINDS_csv$AGE

#Add customized x axis and Title


hist(NINDS_csv$AGE, xlab = "Age of subjects in years", main = "My Title")
34 CHAPTER 2. DATA EXPLORATION

My Title

120
100
80
Frequency

60
40
20
0

30 40 50 60 70 80 90

Age of subjects in years

#change the color


hist(NINDS_csv$AGE, col = "blue")
2.4. DESCRIPTIVE ANALYSIS 35

Histogram of NINDS_csv$AGE
120
100
80
Frequency

60
40
20
0

30 40 50 60 70 80 90

NINDS_csv$AGE

All the plotting functions we have just shown you are useful because they are
easy to use. In a later chapter we will introduce the package ggplot2 (Wick-
ham et al., 2020) which allows you to make plots for more complex displays like
this one:
36 CHAPTER 2. DATA EXPLORATION

Placebo t−PA

40
NIH stroke scale at 2 hours

30
Glasgow at six months
Dead
Good
20
Mod. Dis
Sev. Dis
Veget
10

0 10 20 30 0 10 20 30
NIH stroke scale at 24 hours

2.5 Troubleshooting
1. Error in file(file, “rt”) : cannot open the connection […] No such file or
directory : The file you are trying to open probably doesn’t exist. Check
if you spelled the file name correctly. Also check if the working directory
actually contains the file you are trying to read.
2. Error in library(“xy”) : there is no package called ‘xy’ : You either mis-
spelled the package name or you haven’t installed the package yet.
3. Error in install.packages: object ‘xy’ not found : Have you forgotten to
put quotation marks around the package name?
4. Error in install.packages: package ‘xy’ is not available (for R version
x.x.x): Either you misspelled the package name or the package does not
exist, or it does not exist for your R version.
5. Error in plot.new() : figure margins too large : The plot window in the
lower right corner of R studio is too small to display the plot. Make it
bigger by dragging the left margin further to the left and rerun the plotting
function.
Chapter 3

Data manipulation &


Inference I

In this chapter we will learn some useful tools for data manipulation and then
go on to our first inferential statistics.

3.1 Data manipulation


In the previous chapter we have already done some exploratory analysis on our
data but have skipped an important step that often comes before: In many cases
we don’t have one tidy data set containing only the variables we are interested
in. Often our data is scattered over multiple data sets and contains not only
the variables or cases we are interested in, but a lot of unnecessary information.
The tidyverse (Wickham et al., 2019) is an R package that contains a lot
of useful functions to deal with these problems, so we’ll start by installing and
loading this package:
install.packages("tidyverse") #only do this once

library(tidyverse)

You can ignore the messages that are printed into the console upon loading the
package.

3.1.1 Tibbles
We have provided you with two data sets to try out the data manipulation func-
tions, data1.csv and data2.csv. Make sure the files are stored in your working
directory folder and then read them in with:

37
38 CHAPTER 3. DATA MANIPULATION & INFERENCE I

d1 <- read_csv("data1.csv")

-- Column specification --------------------------------------------------------


cols(
weight = col_double(),
age = col_double(),
id = col_double()
)
d2 <- read_csv("data2.csv")

-- Column specification --------------------------------------------------------


cols(
height = col_double(),
eyecolor = col_character(),
id = col_double()
)
Please make sure to use read_csv() and not read.csv() . While both are
used to read csv-files into R, the former is a special tidyverse function that
gives your data.frame a couple of nice extra features. The first thing you’ll
notice is the information given about the column specification while reading in
the data. Here you can already check if the data type R chose for each column
makes sense to you. double is a form of numeric vector, so in principle everything
looks fine apart from the variable id which even though it contains numbers
shouldn’t be a numeric / double because you can not really make computations
with id-numbers. You can fix this issue with the following lines:
d1$id <- as.character(d1$id)
d2$id <- as.character(d2$id)

#check type
class(d1$id)

[1] "character"
class (d2$id)

[1] "character"
What you did there is converting the variable d1$id from a numeric to a
character and then storing this new version in place of the old version of d1$id .

If you look at the type of d1 using class() , you can see that it is more than
just a data.frame, it is for example also a tbl which is short for tibble.
class(d1)

[1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"


3.1. DATA MANIPULATION 39

Basically a tibble can be used for everything a data.frame can be used for, but
has some nice additional properties, for example they look nicer when printed
to the console.
d1

# A tibble: 5 x 3
weight age id
<dbl> <dbl> <chr>
1 85 34 1
2 56 72 2
3 73 33 3
4 76 45 4
5 60 23 5

3.1.2 Select and Filter


The first two functions for data manipulation are select() , which allows you
to keep only certain variables (i.e. columns) and filter() , which allows you
to keep only certain rows.
As its first argument, select() takes a data.frame or tibble and as its second
argument the names of the variables you want to keep. Although it is technically
not needed, we recommend bundling together the variable names in a vector:
select(d1, c(id, age))

# A tibble: 5 x 2
id age
<chr> <dbl>
1 1 34
2 2 72
3 3 33
4 4 45
5 5 23
It is also possible to specify the variables you want to throw out instead, by
putting a - before their names:
select(d1, -c(weight, id))

# A tibble: 5 x 1
age
<dbl>
1 34
2 72
3 33
4 45
5 23
40 CHAPTER 3. DATA MANIPULATION & INFERENCE I

When you use filter() to chose only certain rows, you’ll mostly have
some kind of rule which cases to keep. These rules are expressed as logical
statements (see Chapter 1). For example the statement age > 40 would
select all cases older than 40. You can also connect multiple conditions:
(age > 30) & (weight > 60) & (weight < 75) for example selects all
cases that are older than 30 and weigh between 60kg and 75kg. filter()
takes a tibble as its first and a logical expression as its second argument:
filter(d1,(age > 30) & (weight > 60) & (weight < 75) )

# A tibble: 1 x 3
weight age id
<dbl> <dbl> <chr>
1 73 33 3

It is also possible to select parts of your tibble or data.frame by simply listing


the indices of the rows and columns you want to keep. This works similarly to
subsetting a vector by using [<rows to keep> , <columns to keep>] :
d1[c(1:3), c(1,3)] #keep rows 1 to 3 and columns 1 and 3

# A tibble: 3 x 2
weight id
<dbl> <chr>
1 85 1
2 56 2
3 73 3

If you want to keep all columns or all rows, you leave the corresponding element
in the [,] empty. You nevertheless have to keep the comma!
d1[1,] #only keep first row

# A tibble: 1 x 3
weight age id
<dbl> <dbl> <chr>
1 85 34 1
d1[,3] #only keep third column

# A tibble: 5 x 1
id
<chr>
1 1
2 2
3 3
4 4
5 5
3.1. DATA MANIPULATION 41

3.1.3 Join
Another useful operation is the the join which allows you to join two data sets
by a common key variable. d1 contains the weight and age of subject, while
d2 contains their height and eye color. Let’s try to compute their body mass
index. To do this, we need to join the two data sets because we need the weight
and height information in one place. If you look at the IDs of the subjects you’ll
notice that we cannot simply paste together those two tibbles because firstly the
rows don’t have the right order and secondly each tibble contains one person
that the other doesn’t (id 5 and 6):
d1

# A tibble: 5 x 3
weight age id
<dbl> <dbl> <chr>
1 85 34 1
2 56 72 2
3 73 33 3
4 76 45 4
5 60 23 5
d2

# A tibble: 5 x 3
height eyecolor id
<dbl> <chr> <chr>
1 156 brown 2
2 164 blue 1
3 189 brown 4
4 178 green 3
5 169 blue 6
There are several ways to deal with this.

Inner join
The inner join only keeps cases (i.e. rows), that appear in both data sets. The
function inner_join() takes two data.frames or tibbles and a string giving
the name of the key variable that defines which rows belong together:
inner_join(d1, d2, by="id")

# A tibble: 4 x 5
weight age id height eyecolor
<dbl> <dbl> <chr> <dbl> <chr>
1 85 34 1 164 blue
2 56 72 2 156 brown
3 73 33 3 178 green
42 CHAPTER 3. DATA MANIPULATION & INFERENCE I

4 76 45 4 189 brown
As you can see the cases 5 and 6 that only appeared in one of the data
sets are left out of the result. If the key variable has different names in the
two data.frames, e.g. id and sno , you specify the by argument like this:
by = c("id" = "sno") .

Full Join
The opposite of the inner join is the full join. full_join() takes the same
arguments as inner_join() but returns all cases. If a case doesn’t appear in
the other data set, the missing information is indicated with NA :
full_join(d1,d2,by="id")

# A tibble: 6 x 5
weight age id height eyecolor
<dbl> <dbl> <chr> <dbl> <chr>
1 85 34 1 164 blue
2 56 72 2 156 brown
3 73 33 3 178 green
4 76 45 4 189 brown
5 60 23 5 NA <NA>
6 NA NA 6 169 blue

Right and left join


The right and left join take one of the data sets fully and join only the rows
from the other data set that fit this data set. That is, the right join takes the
full right data set from the function call and attaches all fitting rows from the
left data set, whereas the left join takes the full left data set from the function
call and attaches all fitting rows from the right data set:
left_join(d1,d2, by="id") #all cases from d1 are kept

# A tibble: 5 x 5
weight age id height eyecolor
<dbl> <dbl> <chr> <dbl> <chr>
1 85 34 1 164 blue
2 56 72 2 156 brown
3 73 33 3 178 green
4 76 45 4 189 brown
5 60 23 5 NA <NA>
right_join(d1,d2, by="id") #all cases from d2 are kept

# A tibble: 5 x 5
weight age id height eyecolor
3.1. DATA MANIPULATION 43

<dbl> <dbl> <chr> <dbl> <chr>


1 85 34 1 164 blue
2 56 72 2 156 brown
3 73 33 3 178 green
4 76 45 4 189 brown
5 NA NA 6 169 blue

3.1.4 Creating new variables


Now that we have learned to join two data sets, let’s do a right join of d1 and
d2 and use the resulting data set to compute the subjects’ BMI. First we save
the joint data in a new variable d :
d <- right_join(d1,d2,by="id")

𝑤𝑒𝑖𝑔ℎ𝑡[𝑘𝑔]
Then we compute the BMI according to the formula ℎ𝑒𝑖𝑔ℎ𝑡[𝑚] 2 . Since our data

gives the height in cm instead of meters, we have to divide this number by 100:
d$weight/(d$height/100)^2

[1] 31.60321 23.01118 23.04002 21.27600 NA

As you can see there’s a NA for the last case, because we have no weight
information on this person. If you want to save the BMI for further analysis it
makes sense to save it as a new variable in your tibble/data.frame. To create
a new variable in an existing data.frame, your write its name behind a $ and
use the assignment operator like this:
d$BMI <- d$weight/(d$height/100)^2

You can now see that you data see d has BMI as a variable:
d

# A tibble: 5 x 6
weight age id height eyecolor BMI
<dbl> <dbl> <chr> <dbl> <chr> <dbl>
1 85 34 1 164 blue 31.6
2 56 72 2 156 brown 23.0
3 73 33 3 178 green 23.0
4 76 45 4 189 brown 21.3
5 NA NA 6 169 blue NA

You can also create variables using logical operators. Suppose you want to
create an indicator variable blueEyes that takes the value 1 when a person
has blue eyes and the value 0 else. First, you create a variable that takes the
value 0 for everyone:
44 CHAPTER 3. DATA MANIPULATION & INFERENCE I

d$blueEyes <- 0

# A tibble: 5 x 7
weight age id height eyecolor BMI blueEyes
<dbl> <dbl> <chr> <dbl> <chr> <dbl> <dbl>
1 85 34 1 164 blue 31.6 0
2 56 72 2 156 brown 23.0 0
3 73 33 3 178 green 23.0 0
4 76 45 4 189 brown 21.3 0
5 NA NA 6 169 blue NA 0
Now you have to set the variable 1 for every blue eyed person. You can create
a vector telling you the blue eyed persons like this:
d$eyecolor == "blue"

[1] TRUE FALSE FALSE FALSE TRUE


If you use this expression for subsetting the rows of the data.frame you get only
the blue eyed persons:
d[d$eyecolor == "blue",]

# A tibble: 2 x 7
weight age id height eyecolor BMI blueEyes
<dbl> <dbl> <chr> <dbl> <chr> <dbl> <dbl>
1 85 34 1 164 blue 31.6 0
2 NA NA 6 169 blue NA 0
You can now use this selection in combination with $blueEyes to set exactly
those variable values to 1:
d[d$eyecolor == "blue",]$blueEyes

[1] 0 0
d[d$eyecolor == "blue",]$blueEyes <- 1

# A tibble: 5 x 7
weight age id height eyecolor BMI blueEyes
<dbl> <dbl> <chr> <dbl> <chr> <dbl> <dbl>
1 85 34 1 164 blue 31.6 1
2 56 72 2 156 brown 23.0 0
3 73 33 3 178 green 23.0 0
4 76 45 4 189 brown 21.3 0
5 NA NA 6 169 blue NA 1
3.1. DATA MANIPULATION 45

3.1.5 Split-Apply-Combine paradigm


One thing you’ll want to do quite often in statistics is to compute a certain
statistic not for your whole sample but individually for certain subgroups. The
steps needed for this are the following:
1. split your data in subsets according to some factor, e.g. eye color
2. apply the statistic to each subset
3. combine the results in a suitable way
R has a huge number of solutions for this. Some of the most handy ones are
part of the plyr package (Wickham, 2011). Before we load and install plyr ,
we have to unload the dplyr package which we have implicitly loaded as part
of the tidyverse before. dplyr is something like a big brother of plyr and
in order to avoid compatibility problems you should always load plyr first and
then dplyr if you need functions from both packages.

To unload a package, you can use detach("package:<package name>") :


detach("package:dplyr")

Now we can install plyr and then load the two packages in the correct order:
install.packages("plyr") #only do this once

library(plyr)# load plyr


library(dplyr) #load dplyr again after plyr

The function we will introduce here is ddply , one of a family of func-


tions that you can find in the documentation of the plyr functions.
In its easiest form, ddply() takes a data.frame or tibble, the name
of the variable to split by and an information on whether you want
to summarize the results ( summarise ) into a smaller data.frame or to
attach the computed statistics to the original data.frame ( mutate ). Fi-
nally you can list all the statistics you want to compute in the form
nameForStatistic = functionToComputeStatistic(variableToApplyTo)
Let’s look at an example where we compute the mean BMI and the median
height for each eye color:
#create a summary with the summarise option
ddply(d,"eyecolor", summarise, meanBMI=mean(BMI, na.rm=T),
medianHeight=median(height, na.rm=T))

eyecolor meanBMI medianHeight


1 blue 31.60321 166.5
2 brown 22.14359 172.5
3 green 23.04002 178.0
46 CHAPTER 3. DATA MANIPULATION & INFERENCE I

#alternatively attach results to original data with the mutate option


ddply(d,"eyecolor", mutate, meanBMI=mean(BMI, na.rm=T),
medianHeight=median(height, na.rm=T))

weight age id height eyecolor BMI blueEyes meanBMI medianHeight


1 85 34 1 164 blue 31.60321 1 31.60321 166.5
2 NA NA 6 169 blue NA 1 31.60321 166.5
3 56 72 2 156 brown 23.01118 0 22.14359 172.5
4 76 45 4 189 brown 21.27600 0 22.14359 172.5
5 73 33 3 178 green 23.04002 0 23.04002 178.0

This is just one example of the split-apply-combine paradigm. Besides ddply ,


there are for example a couple of functions to deal with other input and output
formats than data.frames. For a more comprehensive introduction to the topic
and the package we refer to (Wickham, 2011).

3.2 Diagnostic Tests


As you’ve learned in the lecture we can use ROC curves to assess the usefulness
of one variable to predict the category of another variable. The NINDS data
set for example contains the variable STATUS24 which classifies subjects into
two groups: those with higher NHS stroke scores after 24 hours and those with
lower NHS stroke scores after 24 hours. First, let’s read the data and look at
the distribution of STATUS24 :
NINDS <- read_csv("NINDS.csv")
table(NINDS$STATUS24)

high low
300 283

We can now try to predict the status at 24 hours with the NHS stroke scale
value at 2 hours, HOUR2 . Plotting a ROC curve will give us an idea about the
diagnostic usefulness of HOUR2 . To import the necessary functions, we’ll use
the package ROCit (Khan and Brandenburger, 2020):
install.packages("ROCit") # only do this once

library(ROCit)

The function rocit() takes the arguments score (a numeric variable that
is used for diagnosis) and class (the factor that contains the condition to be
diagnosed). Since STATUS24 is currently stored as a character instead of a
factor in our data set, we have to convert it to a factor with this line of code:
NINDS$STATUS24 <- factor(NINDS$STATUS24, levels=c("low", "high"))
3.2. DIAGNOSTIC TESTS 47

Now we can run the function rocit() . It returns an R object containing the
diagnostic information which we will store in the variable rocObject :
rocObject <- rocit(NINDS$HOUR2, NINDS$STATUS24)

Warning in rocit(NINDS$HOUR2, NINDS$STATUS24): NA(s) in score and/or class,


removed from the data.
We can plot the ROC curve by calling plot() on the object returned by
rocit() :
plot(rocObject)
1.0
0.8
Sensitivity (TPR)

Optimal (Youden Index) point


0.6
0.4
0.2

Empirical ROC curve


Chance line
0.0

0.0 0.2 0.4 0.6 0.8 1.0

1−Specificity (FPR)

We can also extract a number of diagnostic properties for every cutoff with the
function measureit() . This function takes the object returned by rocit()
and the argument measure , a character string specifying which properties to
compute. A list of the possible measures can be found on the help page for the
function. Here we will use the sensitivity (SENS), the specificity (SPEC), the
positive and negative predictive value (PPV and NPV) and the positive and
negative diagnostic likelihood ratio (pDRL and nDLR). Because the result is
quite large, we’ll save it under the name properties instead of just printing
it to the console.
properties<-measureit(rocObject,
measure = c("SENS", "SPEC", "PPV", "NPV", "pDLR", "nDLR" ))

The object properties is a list. To get an overview over its elements, click
48 CHAPTER 3. DATA MANIPULATION & INFERENCE I

on the little triangle in the blue circle next to properties in the environment
window of R Studio. You can have a closer look at its elements using the $ :
properties$Cutoff[1:10]# the first 10 cutoff values

[1] Inf 38 36 34 33 32 31 30 29 28
head(properties$SENS, 10)#Sensitivity for the first 10 cutoffs

[1] 0.000000000 0.006666667 0.010000000 0.016666667 0.026666667 0.033333333


[7] 0.046666667 0.070000000 0.086666667 0.113333333
The elements SENS and Cutoff are quite long numeric vectors, so the above
code shows you two ways of printing only the first 10 elements of these vectors,
[1:10] and head(, 10) .

3.3 Comparing two samples


Comparing two samples is a way of testing whether a variable of interest (e.g. the
NIH stroke scale value) differs between two groups (e.g. between men and women
or between two measurement points). The two samples are the observed values
of the variable in the one group and the observed values in the other group. This
is synonymous to testing the association of a dichotomous variable (e.g. gender
or time point) with another variable (e.g. NIHSS). If men and women differ
in their NIHSS values, we say that there is an association between gender and
NIHSS.
Disclaimer: As you’ve learned in the lecture, most statistical tests can only be
applied to data that meet certain conditions. The t-test for example should
only be used on normally distributed variables. But because this course is an
introduction to R more than an introduction to statistics, we decided not to be
to strict about test assumptions here, since our example data set doesn’t have
all the variable types we’d need to show you all the tests you should know about.
This means that in the following we will apply tests to data that may violate
the assumptions of the respective test in order to show you how all the tests are
technically executed without having to introduce a new data set for every test.

3.3.1 Unpaired samples


Two samples are unpaired if the values from the two samples are statistically
independent, that is, if the observations in the one sample are independent of the
observations in the other sample. Observing a variable in two sets of different
patients, for example, results in unpaired samples.

t-test
The t-test can be used to test if the means of normally distributed metric vari-
ables differ significantly. You can either test the sample mean against a prespec-
3.3. COMPARING TWO SAMPLES 49

ified mean or you can test whether the means of two samples differ significantly.
For a one sample t-test you use the function t.test() which takes a numeric
vector and mu , the mean you want to test your sample mean against:
#test if average age differs from 50
t.test(x=NINDS$AGE, mu = 50)

One Sample t-test

data: NINDS$AGE
t = 36.319, df = 623, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 50
95 percent confidence interval:
66.02571 67.85782
sample estimates:
mean of x
66.94177

In the output, you find the value of the test statistic (i.e. the t value), its degrees
of freedom and the corresponding p-value in the first row. If the p value is
very small it is displayed in scientific notation, e.g. p-value < 2.2e-16 which
means 𝑝 < 2.2 × 10−16 . In the row below you get the alternative hypothesis
in words and below that the confidence interval for the sample mean. You can
judge the significance of the result by checking whether your p-value undercuts
the significance level (e.g. 0.05) or by checking whether the confidence interval
includes mu , i.e. 50.

To report the results of a t-test, you give the t-value with the corresponding
degrees of freedom and the p-value. When the p-value is very small, it is often
convention to just report it to be below a certain margin, e.g. 0.001. In this
case the mean age of our sample significantly differs from 50 with t(623)=36.319,
p<0.001. It is often useful to also include the estimated mean and its confidence
interval (in this case 66.94 [66.02 ; 67.85]) when you report t-test results, so the
reader can also judge the clinical relevance, not just the statistical significance
of the result.

For the two sample t-test, there are two ways to pass your samples to the
function. We will show you one version here and the other in the example
for paired t-tests. In the first version, you pass to the function the variable
containing the measured values and a grouping variable indicating which values
belong to which group, separated by a ~ .

To compare the mean of the body weight between men and women for example,
you write NINDS$WEIGHT ~ NINDS$BGENDER to indicate you want to compare
the weight by gender. This formulation stresses the view of testing the relation-
ship between WEIGHT and the dichotomous variable BGENDER . Since we want
the unpaired version of the t-test, we set the argument paired=FALSE .
50 CHAPTER 3. DATA MANIPULATION & INFERENCE I

#test whether mean weight differs between men and women


t.test(NINDS$WEIGHT ~ NINDS$BGENDER, paired = FALSE)

Welch Two Sample t-test

data: NINDS$WEIGHT by NINDS$BGENDER


t = -9.4688, df = 543.11, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-15.41512 -10.11814
sample estimates:
mean in group female mean in group male
70.65966 83.42630

As you can see for the the unpaired t-test R defaults to doing the Welch test,
which is robust to different variances in your two samples. If you are confident
the variances are equal, you can set the argument var.equal=TRUE to get the
standard t-test.

The alternative hypothesis and the confidence interval now refer to the mean
difference between men and women. Thus, a significant result is indicated by a
confidence interval that doesn’t include 0 and a p-value < 0.05. So the results
tell us that in our sample, the weight differs significantly between men and
women with t(543.11) = -9.47, p<0.001.

Wilcoxon-Mann-Whitney test

The Wilcoxon-Mann-Whitney test serves as a non-parametric alternative to


the t-test when your two samples are metric or ordinal with many possible
values. Because in this case you don’t make assumptions about the distri-
bution, the stated alternative hypothesis isn’t about the mean difference but
more generally about a location shift between the two samples. The arguments
of the wilcox.test() function work in the same way as the ones for the
t.test() function:
wilcox.test(NINDS$HOUR24 ~ NINDS$BGENDER, paired=FALSE)

Wilcoxon rank sum test with continuity correction

data: NINDS$HOUR24 by NINDS$BGENDER


W = 52543, p-value = 0.01781
alternative hypothesis: true location shift is not equal to 0

The results tell us that the NIHSS at 24 hours differs significantly between men
and women with p=0.01781.
3.3. COMPARING TWO SAMPLES 51

Chi-square-test
The Chi square test can be used to test whether there is an association between
two dichotomous variables, or differently put, whether the probability distribu-
tion of one of the two variables differs between the groups defined by the other
variable. We can for example test whether the NIH stroke scale status at 24
hours (with values high and low ) differs between men and women by passing
the two dichotomous variables to the function chisq.test() :
chisq.test(NINDS$BGENDER, NINDS$STATUS24)

Pearson's Chi-squared test with Yates' continuity correction

data: NINDS$BGENDER and NINDS$STATUS24


X-squared = 2.002, df = 1, p-value = 0.1571
As you can see, there is no significant difference in the probability for the NIHSS
status between men and women with 𝜒2 (1) = 2.002, p=0.157.

3.3.2 Paired samples


Paired samples are samples that are statistically dependent. This happens for
example when we measure the same set of patients twice, like measuring the
NIH stroke scale after 2 hours and after 24 hours. The result are two samples
( HOUR2 and HOUR24 ) that are dependent because they come from the same
patients.

T-Test
For paired samples of metric, symmetric and normally distributed variables we
can use the paired t-test. It works the same way as the unpaired t-test, we just
have to set the argument paired=TRUE . As announced above, we’ll show you
another way of specifying the data here. If your observations are stored in two
different vectors (as opposed to the example for the unpaired t-tests where all
observations of the weight where stored in one vector that could be divided by
the factor gender), you can pass those two vectors separated by a comma to the
t-test function:
t.test(NINDS$WEIGHT, NINDS$TWEIGHT, paired=TRUE)

Paired t-test

data: NINDS$WEIGHT and NINDS$TWEIGHT


t = -1.2673, df = 622, p-value = 0.2055
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.8313524 0.1791902
sample estimates:
52 CHAPTER 3. DATA MANIPULATION & INFERENCE I

mean of the differences


-0.3260811
The result indicates that there is no significant difference between the estimated
and the actual weight of the NINDS patients with t(622) = -1.2673, p=0.206. In
other words, we could not observe evidence for differences between the estimated
and the actual weight of NINDS patients.

Wilcoxon test
If your samples are metric and symmetric but not normally distributed, you
can use the Wilcoxon-test. You have already encountered the wilcox.test()
function before, now we just have to set the argument paired=T :
wilcox.test(NINDS$HOUR2, NINDS$HOUR24, paired=TRUE)

Wilcoxon signed rank test with continuity correction

data: NINDS$HOUR2 and NINDS$HOUR24


V = 86275, p-value = 0.002123
alternative hypothesis: true location shift is not equal to 0
The output tells us that there is a significant difference between the NIHSS at
2 and 24 hours with p<0.01.

Sign-Test
The Sign-test can be used when the samples are ordinal with many possible
values. It tests whether median of the pairwise differences of the two samples is
equal to 0 (for two sided tests) or less/greater then 0 (for one sided tests). If the
median of the differences is greater than 0, this means that for majority of value
pairs the first variable is greater, if the median is less than 0, it means that for
the majority of value pairs the second variable is greater. To do this test in R,
we can use the SIGN.test() function from the BSDA package (Arnholt and
Evans, 2017):
install.packages("BSDA") #only do this once

library(BSDA)
SIGN.test(NINDS$HOUR2, NINDS$HOUR24, alternative = "greater")

Dependent-samples Sign-Test

data: NINDS$HOUR2 and NINDS$HOUR24


S = 311, p-value = 0.0007664
alternative hypothesis: true median difference is greater than 0
95 percent confidence interval:
0 Inf
sample estimates:
3.4. TROUBLESHOOTING 53

median of x-y
0

Achieved and Interpolated Confidence Intervals:

Conf.Level L.E.pt U.E.pt


Lower Achieved CI 0.9455 0 Inf
Interpolated CI 0.9500 0 Inf
Upper Achieved CI 0.9538 0 Inf
The p-value below 0.05 indicates that a significant majority of patients has a
higher stroke scale value at 2 hours than at 24 hours.

Mc Nemar test
The Mc Nemar tests whether two dichotomous variables occur with different
frequencies (i.e. different probabilities for the “yes-event”). It can be used on
paired samples, e.g. where both variables have been observed in the same set of
patients like history of diabetes ( BDIAB ) and history of hypertension ( BHYPER ),
:
mcnemar.test(NINDS$BDIAB, NINDS$BHYPER)

McNemar's Chi-squared test with continuity correction

data: NINDS$BDIAB and NINDS$BHYPER


McNemar's chi-squared = 227.39, df = 1, p-value < 2.2e-16
As we can see there is a significant difference between the frequency of diabetes
and hypertension in the NINDS sample with 𝜒2 (1) = 227.39, p<0.001

3.4 Troubleshooting
1. Error in xy: could not find function “xy” where xy stands for the function
in your error message. You probably forgot to load the package containing
the function or you misspelled the function. If you’re unsure which package
it belongs to, consider googling the function.
2. Error in library(xy) : there is no package called ‘xy’ where xy stands
for the package name in your error message. You probably forgot to
install the package before loading it. Try install.packages("xy") . If
the installation fails, check if you are connected to the internet and have
sufficient rights on you computer to install software.
Chapter 4

Inference II

In this chapter you will learn how to compute risks and odds, do regression
analysis and time-to-event analysis (aka survival analysis).

4.1 Risk and Odds


For the computation of risks and odds, we will use the epiR package. You can
install it with install.packages("epiR") when you are using it for the first
time. We will use the NINDS data set again for an example.
NINDS <- read.csv("NINDS.csv", stringsAsFactors = T)

Let’s compute the odds and risks for a high NIH stroke scale value at 24 hours
( STATUS24 ) in the treatment and placebo groups ( TREATCD ). First of all, we’ll
have to compute a contingency table of the two variables:
tab <- table(NINDS$TREATCD, NINDS$STATUS24)
tab

high low
Placebo 173 132
t-PA 127 151
Then we can supply this table to the epi.2by2 function from the epiR pack-
age:
library(epiR)

epi.2by2(tab)

Outcome + Outcome - Total Inc risk * Odds

54
4.1. RISK AND ODDS 55

Exposed + 173 132 305 56.7 1.311


Exposed - 127 151 278 45.7 0.841
Total 300 283 583 51.5 1.060

Point estimates and 95% CIs:


-------------------------------------------------------------------
Inc risk ratio 1.24 (1.06, 1.46)
Odds ratio 1.56 (1.12, 2.16)
Attrib risk * 11.04 (2.96, 19.11)
Attrib risk in population * 5.77 (-1.35, 12.90)
Attrib fraction in exposed (%) 19.46 (5.36, 31.46)
Attrib fraction in population (%) 11.22 (2.52, 19.14)
-------------------------------------------------------------------
Test that OR = 1: chi2(1) = 7.094 Pr>chi2 = 0.01
Wald confidence limits
CI: confidence interval
* Outcomes per 100 population units

In the upper part you can see the contingency table we provided. For inter-
pretation you need to compare the rows with the original contingency table.
Then you can see that Exposed + is the Placebo group, Exposed - is the
t-PA group, Outcome + is the high status group and Outcome - is the low
status group. On the two rightmost columns you can see the risk for high status
under Inc risk * and the odds for high status under Odds . Note that the
risk is specified in percent.

You can see that in the Placebo group 56.7 % percent of the patients have a
high status, the odds of having a high status vs. having a low status are 1.311
in this group, meaning that on average, there are 1.311 Patients with a high
status per person with a low status. In the t-PA group on the other hand, the
risk is lower with 45.7% and the odds of having a high status in this group are
only 0.841.

When we want to compare the two groups, we can look at the table under
Point estimates and 95% CIs . Here you can see the Inc risk ratio of
1.24, which means that the risk for a high status is increased by a factor of 1.24
in the Placebo group vs. the t-PA group. The odds ratio tells us, that the odds
in the placebo group are increased by a factor of 1.56. Finally, the risk difference
can be found under Attrib risk * , telling us that the risk is 11.04% higher
in the Placebo group than in the t-PA group.

You can check the respective confidence intervals to see if there is a significant
difference between the groups. For ratios, the 95%-confidence intervals should
not include 1, for the difference, the confidence interval should not include 0 to
indicate a significant result at a significance level of 0.05. This is true for all 3
estimates in our case.
56 CHAPTER 4. INFERENCE II

4.2 Regression
To get to know regression, let’s go back to the iris data set we have used
before:
data(iris)#load data set
View(iris)

Table 4.1: The first 10 rows of the iris data set.

Sepal.Length Sepal.Width Petal.Length Petal.Width Species


5.1 3.5 1.4 0.2 setosa
4.9 3.0 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5.0 3.6 1.4 0.2 setosa
5.4 3.9 1.7 0.4 setosa
4.6 3.4 1.4 0.3 setosa
5.0 3.4 1.5 0.2 setosa
4.4 2.9 1.4 0.2 setosa
4.9 3.1 1.5 0.1 setosa

This data.frame contains information of sepal and petal lengths of 150 plants
from three species of flowers: setosa, versicolor and virginica:
table(iris$Species)

setosa versicolor virginica


50 50 50
A question we could try to answer is whether there is a significant difference
between the mean sepal lengths of those three species. Lets first look at the
means descriptively:
library(plyr)#for the ddply function
ddply(iris, "Species",summarise, meanLength=mean(Sepal.Length))

Species meanLength
1 setosa 5.006
2 versicolor 5.936
3 virginica 6.588
As you can see there are differences between the three means, but are those dif-
ferences systematic or due to chance? This question can be answered using linear
regression analysis, which is based on a linear model specifying the relationship
between the dependent variable and one or several independent variables. In
our case, we take the sepal length as dependent variable y and the species as
independent variable x.
4.2. REGRESSION 57

4.2.1 The model behind linear regression


In the model one species is chosen as the reference for the sepal length and then
we look at the differences of the other species sepal lengths to this reference.
For the iris example the model would be:
𝑌𝑖 = 𝑏𝑠𝑒𝑡𝑜𝑠𝑎 + 𝑏𝑣𝑒𝑟𝑠𝑖𝑐𝑜𝑙𝑜𝑟 ⋅ 𝐼𝑣𝑒𝑟𝑠𝑖𝑐𝑜𝑙𝑜𝑟 + 𝑏𝑣𝑖𝑟𝑔𝑖𝑛𝑖𝑐𝑎 ⋅ 𝐼𝑣𝑖𝑟𝑔𝑖𝑛𝑖𝑐𝑎 + 𝜖𝑖
𝐼𝑣𝑒𝑟𝑠𝑖𝑐𝑜𝑙𝑜𝑟 and 𝐼𝑣𝑖𝑟𝑔𝑖𝑛𝑐𝑎 are so-called indicator or dummy variables. 𝐼𝑣𝑒𝑟𝑠𝑖𝑐𝑜𝑙𝑜𝑟 is
1 when a plant belongs to the species versicolor and 0 otherwise, 𝐼𝑣𝑖𝑟𝑔𝑖𝑛𝑖𝑐𝑎 is
1 when a plant belongs to the species virginica and 0 otherwise. Because each
plant only belongs to one species, at most one of the indicator variables can
take the value 1 at the same time.
So here is how the formulas actually reduce for the three species:
setosa: 𝑌𝑖 = 𝑏𝑠𝑒𝑡𝑜𝑠𝑎 + 𝜖𝑖
versicolor: 𝑌𝑖 = 𝑏𝑠𝑒𝑡𝑜𝑠𝑎 + 𝑏𝑣𝑒𝑟𝑠𝑖𝑐𝑜𝑙𝑜𝑟 + 𝜖𝑖
virginica: 𝑌𝑖 = 𝑏𝑠𝑒𝑡𝑜𝑠𝑎 + 𝑏𝑣𝑖𝑟𝑔𝑖𝑛𝑖𝑐𝑎 + 𝜖𝑖
From this it es relatively easy to derive the meaning of the parameters:
• 𝑌𝑖 is the sepal length of plant 𝑖
• 𝑏𝑠𝑒𝑡𝑜𝑠𝑎 is the mean for the reference group setosa (this parameter is called
the intercept)
• 𝑏𝑣𝑒𝑟𝑠𝑖𝑐𝑜𝑙𝑜𝑟 is the difference between the mean of versicolor and the mean
of setosa
• 𝑏𝑣𝑖𝑟𝑔𝑖𝑛𝑖𝑐𝑎 is the difference between the mean of virginica and the mean of
setosa
• 𝜖𝑖 is the residual, i.e. the difference between plant 𝑖 and the mean of its
species

4.2.2 Computing the linear regression


To compute this regression, we first specify a linear model using the
lm() function in R and then use the summary() function on this
model. For lm() , we use the same formula notation as for the t.test:
<dependent variable> ~ <independent variable> . We can either specify
the variables directly with a $ or just give the variable names alone and specify
the data.frame they come from in the argument data=<your data.frame>
#two equivalent ways of specifying the linear model
linMod <- lm(iris$Sepal.Length ~ iris$Species)
linMod <- lm(Sepal.Length ~ Species, data=iris)

To get the actual regression analysis results, we use the summary() function
on the linear model:
58 CHAPTER 4. INFERENCE II

summary(linMod)

Call:
lm(formula = Sepal.Length ~ Species, data = iris)

Residuals:
Min 1Q Median 3Q Max
-1.6880 -0.3285 -0.0060 0.3120 1.3120

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.0060 0.0728 68.762 < 2e-16 ***
Speciesversicolor 0.9300 0.1030 9.033 8.77e-16 ***
Speciesvirginica 1.5820 0.1030 15.366 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5148 on 147 degrees of freedom


Multiple R-squared: 0.6187, Adjusted R-squared: 0.6135
F-statistic: 119.3 on 2 and 147 DF, p-value: < 2.2e-16
The output can be interpreted as follows:
The table under Residuals gives you an idea about the central tendency and
spread of the residuals 𝜖𝑖 .
Below you see the Coefficients table where you can find the values for 𝑏𝑠𝑒𝑡𝑜𝑠𝑎
(5.006), 𝑏𝑣𝑒𝑟𝑠𝑖𝑐𝑜𝑙𝑜𝑟 (0.93) and 𝑏𝑣𝑖𝑟𝑔𝑖𝑛𝑖𝑐𝑎 (1.582) under estimate . Each of these
values has its own p-value (under Pr(>|t|) ) which tests the hypothesis that
the respective coefficient is 0. Since 𝑏𝑣𝑒𝑟𝑠𝑖𝑐𝑜𝑙𝑜𝑟 and 𝑏𝑣𝑖𝑟𝑔𝑖𝑛𝑐𝑎 are the mean differ-
ences of versicolor and viriginica to setosa, we can conclude that both species
have significantly larger sepal lengths than the reference species setosa. If these
p-values were >0.05, this would mean there was no evidence of differences be-
tween the respective species and the setosa sepal length.
The R-Squared (𝑅2 ) gives you information on the fraction of the variability of
the sepal length that can be explained by the species (𝑅2 = 𝑒𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒
𝑡𝑜𝑡𝑎𝑙 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 ).
2
The 𝑅 takes values between 0 (the sepal length cannot be explained by the
species at all) and 1 (the sepal length depends solely (=deterministically)) on
the species.
In the last line of the output we get an F-statistic and p-value indicating a
significant overall association between sepal length and species. These values
are exactly the same we would get if we computed an ANOVA instead of a
regression.
It is often advisable to also look at the confidence intervals of your regression
coefficients. The summary() doesn’t show these intervals, but you can compute
4.2. REGRESSION 59

them easily using the confint() function:


confint(linMod, level = 0.95)

2.5 % 97.5 %
(Intercept) 4.8621258 5.149874
Speciesversicolor 0.7265312 1.133469
Speciesvirginica 1.3785312 1.785469

4.2.3 Multiple linear regression


Regression analysis can go far beyond the simple linear regression we have just
computed. Regression can not only accommodate categorical, but also metric
independent variables and you can also use regression analysis for more than
one independent variable (=multiple regression or multivariable regression).
Let’s go back to the NINDS data set for an example:
NINDS <- read.csv("NINDS.csv", stringsAsFactors = T)

Here is a regression for checking for an association between weight and the
history of hypertension:
mod1 <- lm(WEIGHT ~ BHYPER, data=NINDS)
summary(mod1)

Call:
lm(formula = WEIGHT ~ BHYPER, data = NINDS)

Residuals:
Min 1Q Median 3Q Max
-36.461 -11.737 -0.226 10.157 100.455

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 76.143 1.221 62.375 <2e-16 ***
BHYPERYes 2.995 1.501 1.995 0.0465 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 17.65 on 615 degrees of freedom


(7 observations deleted due to missingness)
Multiple R-squared: 0.006429, Adjusted R-squared: 0.004813
F-statistic: 3.979 on 1 and 615 DF, p-value: 0.04651
As you can see there is a significant association between hypertension and weight,
with a p-value of 0.0465 and coefficients indicating that the average weight of
patients without hypertension is around 76 kg and that patients with hyperten-
sion weigh on average 3 kg more. Of course this could have been investigated
60 CHAPTER 4. INFERENCE II

with a T-test.

Regression, however, allows us to introduce more independent variables to the


model. One could for example hypothesize that the weight is also associated
with diabetes. If we add this variable to model, something interesting happens:
mod2 <- lm(WEIGHT ~ BHYPER + BDIAB, data=NINDS)
summary(mod2)

Call:
lm(formula = WEIGHT ~ BHYPER + BDIAB, data = NINDS)

Residuals:
Min 1Q Median 3Q Max
-37.402 -11.872 -0.473 9.958 101.940

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 75.303 1.234 61.047 < 2e-16 ***
BHYPERYes 2.349 1.500 1.566 0.117962
BDIABYes 6.050 1.746 3.466 0.000566 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 17.49 on 612 degrees of freedom


(9 observations deleted due to missingness)
Multiple R-squared: 0.02559, Adjusted R-squared: 0.0224
F-statistic: 8.036 on 2 and 612 DF, p-value: 0.0003591

Let’s first go through the coefficients one by one. The intercept 75.303 is the
mean weight for a person without hypertension and without diabetes. For pa-
tients with hypertension, the mean weight increases by 2.349 kg to 77.652. For
patients with diabetes, the mean weight increases by 6.050 kg, resulting in a
mean of 81.353 kg for patients with diabetes only. Patients with hypertension
and diabetes on the other hand end up with a mean of 75.303 + 2.349 + 6.050 =
83.702 kg.

Looking at the p-values we can however see that only the effect of diabetes
on weight is significant (p<0.001), while there is no significant effect of the
hypertension anymore (p=0.12).

In this scenario we say that controlling for diabetes, there is no effect of hy-
pertension on the weight of patients. This means that BHYPER doesn’t contain
information about the weight of patients hat is not already represented in the
information in BDIAB .
4.2. REGRESSION 61

4.2.4 Logistic regression


Regression analysis is not restricted to metric dependent variables. When we
want to analyze the association of several variables with a binary variable, for
example, we can use logistic regression which belongs to the broader group of
generalised linear models (glm).
Let’s check whether the presence of hypertension BHYPER is associated with
the weight and age of the NINDS subjects. The logistic regression is computed
the same way as the ordinary linear regression but using glm() instead of
lm() to build the model. Because glm() can be used for a number of other
models besides the one for logistic regression, we have to specify the argument
family=binomial to indicate that our dependent variable comes from a bino-
mial distribution, i.e. is a binary variable:
mod3 <- glm(BHYPER ~ WEIGHT + AGE, family= binomial, data=NINDS)
summary(mod3)

Call:
glm(formula = BHYPER ~ WEIGHT + AGE, family = binomial, data = NINDS)

Deviance Residuals:
Min 1Q Median 3Q Max
-1.9008 -1.2700 0.7703 0.9067 1.3740

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.328955 0.757860 -4.393 1.12e-05 ***
WEIGHT 0.017634 0.005311 3.320 0.000899 ***
AGE 0.039643 0.007853 5.048 4.46e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 790.0 on 616 degrees of freedom


Residual deviance: 759.3 on 614 degrees of freedom
(7 observations deleted due to missingness)
AIC: 765.3

Number of Fisher Scoring iterations: 4


The intercept (𝑏0 ≈ −3.33) is the log odds of having hypertension for a person
of age 0 weighing 0 kg. This theoretical value does of course not make sense as
is often the case with intercepts in a regression where 0 isn’t really a possible
value for the independent variable(s).
The other two coefficients, however, can be interpreted as the log odds ratios
62 CHAPTER 4. INFERENCE II

associated with an increase of one unit in the dependent variable. The log odds
for hypertension increase by 0.018 for every kg a person gains. They increase
by 0.04 for every year older a person is. That means for example that a person
that is ten years older than another person has 0.04 ⋅ 10 = 0.4 higher log odds
for hypertension.
Because log odds are hard to interpret, the coefficients are often exponentiated,
resulting in more interpretable odds (for the intercept) and odds ratios (for the
other coefficients). To do this in R, we can directly extract the coefficients from
the mod3 object:
coef(mod3) # the original coefficients

(Intercept) WEIGHT AGE


-3.32895502 0.01763448 0.03964310
exp(coef(mod3)) # odds/odds ratios

(Intercept) WEIGHT AGE


0.03583053 1.01779089 1.04043937

This output tells us for example that the odds of having hypertension increase
by a factor of 1.04 for every additional year of life. For two years, the odds
accordingly increase by 1.04⋅1.04 = 1.042 , for 10 years they increase by 1.0410 =
1.48. It is important to keep in mind that the additive nature of the coefficient
on the original (i.e. log odds) scale transforms to a multiplicative nature when
we transform the coefficients with an exponential transformation (i.e. to the
odds scale).

You can of course again look at confidence intervals ob both scales too:
#original coefficients/logit
confint(mod3)

2.5 % 97.5 %
(Intercept) -4.839137893 -1.86282271
WEIGHT 0.007388658 0.02824346
AGE 0.024427301 0.05526389
#odds/odds ratios
exp(confint(mod3))

2.5 % 97.5 %
(Intercept) 0.007913874 0.1552338
WEIGHT 1.007416022 1.0286461
AGE 1.024728092 1.0568195
4.3. TIME-TO-EVENT ANALYSIS 63

4.3 Time-to-event analysis

Time-to-event data, traditionally often called survival data, comes from studies
where patients were followed over time until a particular event (e.g. death or
relapse) occurs. We usually analyse this data using the Kaplan-Meier-estimator.
Two good packages with functions to compute the Kaplan-Meier estimator as
well as a couple of other useful statistics are the packages survival (Therneau,
2020) and survminer (Kassambara et al., 2020) so we will install and load these
packages:
install.packages("survival")
install.packages("survminer")

library(survival)
library(survminer)

First we compute a so-called survival object for the survival of the NINDS pa-
tients with the function Surv() . The result of this function is an R-object
we can use for the actual survival analyses following. Surv() expects two
arguments: The survival time ( SURDAYS in our case) and a numeric variable
indicating whether the subject died or not. Because the variable DCENSOR con-
taining this information is a factor, we have to wrap it in the as.numeric()
function to turn it into a numeric:
s <- Surv(time=NINDS$SURDAYS, event=as.numeric(NINDS$DCENSOR))

To compute the Kaplan Meier estimates, we use the function survfit() on


s . If we want the overall survival, we write:
sf_overall <- survfit(s ~ 1, data=NINDS)

And plot the results with:


ggsurvplot(sf_overall)
64 CHAPTER 4. INFERENCE II

Strata + All

1.00

++++ +
++++++++++++++++++++++++
Survival probability

0.75
+++++++ ++++++++++ + ++ + ++++ ++++
++
++ + +
0.50

0.25

0.00
0 500 1000 1500 2000
Time
However, it is more interesting to compare the survival of different groups.
Lets compare the survival of the treatment group t-PA vs. the control group
Placebo from the variable TREATCD :
sf_treat<-survfit(s ~ TREATCD, data=NINDS)
ggsurvplot(sf_treat)
4.3. TIME-TO-EVENT ANALYSIS 65

Strata + TREATCD=Placebo + TREATCD=t−PA

1.00

++++
++ ++++++++++++++++++++
Survival probability

0.75 +++++++++++ +++++++ ++++++ ++ + + + ++ +++


++++++++++ ++
+++++ + + ++ +++ + +

0.50 + +

0.25

0.00
0 500 1000 1500 2000
Time
The ggsurvplot() function also has a lot of nice additional options.
risk.table=TRUE adds the a table for the number at risk under the plot,
pval=TRUE adds the p-value of the log-rank test comparing the survival of
the two groups and pval.method=TRUE prints the name of the test above the
p-value:
ggsurvplot(sf_treat, risk.table = TRUE, pval = TRUE, pval.method = TRUE)
66 CHAPTER 4. INFERENCE II

Strata + TREATCD=Placebo + TREATCD=t−PA

1.00
Survival probability
++
+++ +++++++++++++
0.75 ++++++++++++ ++++++ +++++ + + ++ +++
+++
++++++++++++++++ + + ++ +++ + +
0.50 + +

0.25
Log−rank
p = 0.26
0.00
0 500 1000 1500 2000
Time
Number at risk
Strata

TREATCD=Placebo 312 52 9 2 0
TREATCD=t−PA 312 56 12 2 0
0 500 1000 1500 2000
Time
It is also possible to look at the Kaplan Meier estimates as numbers directly by
calling summary(sf_treat) , but because the output is rather long, we won’t
print it here.
Chapter 5

Advanced Use

In the final chapter we’ll have a look at some of the functionalities of R that
make it superior to conventional statistic software. Well have a look at some
basic programming you need to write your own functions and show you how to
make publication ready plots with ggplot2 .

5.1 Programming basics


So far we have used R mainly as a software for statistical analysis, but it is in
fact a fully-fledged programming language. Learning the basic structures you
need for programming your own functions is actually not very hard, so we’ll
show the basic building blocks here.

5.1.1 Defining a function


Apart from using already existing functions in R, you can write you own func-
tion if you don’t find one that is doing exactly what you need. For demonstra-
tion purposes, let’s define a function mySum() that takes two single numbers
firstNumber and secondNumber as input and computes the sum of these
numbers:
mySum <- function(firstNumber, secondNumber){
result <- firstNumber + secondNumber
result
}

In this block of code a function is defined and given the name mySum using the
assignment operator <- .

The definition of a function always comes in the form function(<arguments>){<body>} .

67
68 CHAPTER 5. ADVANCED USE

<arguments> is a comma seperated list of the input data you need for you
computation and <body> describes the operations that need to be done for
the computation. For better readability, we usually enter the <body> over
several lines enclosed by {} .

So mySum() expects two input objects firstNumber and secondNumber . In


the body, these two are added and the result is assigned the name result . In
the next line result is called, to make sure the result gets actually printed
when calling the function, then the body closes with } .

After defining the function we can use it:


mySum(3,4)

[1] 7

When you execute this line of code, the following happens:

1. R looks up the function that is saved under mySum .


2. The value 3 is assigned to the internal variable firstNumber and the
value 4 is assigned to the internal variable secondNumber
3. firstNumber + secondNumber is executed, the result 7 is assigned to
the internal variable result
4. result is called at the end of the body to make sure its value is returned
to the “outside”.
5. Everything that isn’t explicitly called in the last line of the body stays
inside the function. This means neither result nor firstNumber or
secondNumber can be called outside of the function as the following line
shows:
firstNumber

Error in eval(expr, envir, enclos): Objekt 'firstNumber' nicht gefunden

(Translates to Error in eval(expr, envir, enclos): object "firstNumber" not found. )


As you know from the functions you’ve used already, it is also possible to assign
default values to some of the arguments. The following function has a default
of 10 for secondNumber :
mySum2 <- function(firstNumber, secondNumber=10){
result <- firstNumber + secondNumber
result
}

This means if you omit secondNumber in the function call, it is assumed to be


10 :
5.1. PROGRAMMING BASICS 69

mySum2(5)

[1] 15
But you can overwrite the default:
mySum2(5,2)

[1] 7
You can also call other functions inside your function. For example you can
write a function that computes the mean difference of two vectors:
meandiff <- function(x,y){
result <- mean(x) - mean(y)
result
}

v1<-c(1,2,3)
v2<-c(10,20,30)

meandiff(v1,v2)

[1] -18

5.1.2 Conditional statements


Sometimes you want your code to do one thing in one case and another thing
in the other case. For example you could write some code that tests whether a
person has fever:
bodytemp <- 38

if(bodytemp>=38){
"fever"
}

[1] "fever"
You can change the value of bodytemp to different values to see how the con-
ditional statement works. In the condition part if(<logical statement>)
you test a logical condition of the kind you’ve learned about in the first chapter.
Then follows the body {<what to do>} that specifies the code you want to
execute if the condition evaluates to TRUE .
In the above code nothing happens if the condition is not met. If you want
your code to return a "no fever" for cases where bodytemp < 38 , you can
extend the statement by an else part:
70 CHAPTER 5. ADVANCED USE

bodytemp <- 37

if(bodytemp>=38){
"fever"
}else{
"no fever"
}

[1] "no fever"


Now, if the condition evaluates to TRUE the block in the first {} is executed,
if the condition evaluates to FALSE , the block in the second {} is executed.
Of course you can wrap this in a function to make it easier to use repeatedly:
hasFever <- function(bodytemp){

if(bodytemp>=38){
status<-"fever"
}else{
status<-"no fever"
}

status
}

And try it out with different values:


hasFever(36.2)

[1] "no fever"


hasFever(40)

[1] "fever"
You can also check different conditions in a row using else if in between.
The line breaks are just for readability but make sure you keep track of all the
opening and closing brackets!
tempChecker <- function(bodytemp){

if(bodytemp<36){

status <- "too cold"

}else if(bodytemp>=38){

status <- "too hot"


5.1. PROGRAMMING BASICS 71

}else{

status <- "normal"


}

status
}

Try it out with different numbers:


tempChecker(35)

[1] "too cold"


tempChecker(39)

[1] "too hot"


tempChecker(37)

[1] "normal"
In this code, the conditions are checked in the order they appear in. If the
first condition applies, the first block of code is executed, and the rest of the
if else statement is ignored. If the first condition is not met, the second
condition is evaluated. If it is TRUE the following code block is executed, the
rest of the statement is ignored. When the all of the conditions have been tested
and evaluated to FALSE , the last code block from the else part is executed.

5.1.3 Loops
The final structure is the loop: A loop allows you to assign repetitive tasks to
your computer instead of doing them yourself. The first kind of loop you’ll learn
about is the for loop. In this loop you specify the number of repetitions for a
task explicitly. The following loop prints the numbers from 1 to 5:
for (i in 1:5) {

print(i)

[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
In the () part you define the counting variable, which is often called i (but
can have any other name too) and we define the values this counting variable
72 CHAPTER 5. ADVANCED USE

should take (the values 1 to 5 in our case). In the {} part we then define the
task for every iteration. print(i) simply tells R to print the value of i into
the console. So the above loop has 5 iterations in each of which the current
value of i is printed to the console.
Of course we can also have proper computations. For example we can add up
alle the numbers from 1 to 1000 with this code:
result <- 0

for(i in 1:1000){

result <- result + i

result

[1] 500500
In the above code the value of result is 0 to begin with. Then the loop
enters its first round and the value of result is updated to the current value
of result plus the current value of i , so 0 + 1 = 1 . Then the second
iteration starts and the same happens again: The current value of result is
updated by adding the current value of i to it, so result is now 1 + 2 = 3
etc.
Sometimes a repetitive task has to be done until a certain condition is met, but
we cannot tell beforehand how many iterations it is going to take. In these cases,
we can use the while loop. For example you can count how often you have to
add 0.6 until you get to a number that is greater than 1000:
x <- 0
counter <- 0
while(x <= 1000){
x <- x + 0.6
counter <- counter + 1
}
counter

[1] 1667
Before the loop starts, both x and counter have the value 0. Then in every
iteration, x grows by 0.6 and counter by 1 to count the number of iterations.
As soon as the condition in () is not met anylonger (i.e. when x is greater
than 1000), the loop stops. As you can see, it takes 1667 iterations to make x
greater than 1000. The previous examples are of course just toy examples to
demonstrate the basic functionality of loops. In reality we can use a loop for
more practical tasks, for example to create the same kind of graphic for a large
5.2. GRAPHICS WITH GGPLOT 73

number of variables. This brings us to the final chapter of this course: How to
produce plots using ggplot2 .

5.2 Graphics with ggplot


The package ggplot2 (Wickham, 2016) is the most widely used graphics pack-
age in R because it gives you control over even the finest details of your graphics
while simultaneously doing a lot of the work automatically for you. The syntax
takes a bit of getting used to but the results are worth it! This chapter only
touches upon the most commonly used functions of ggplot2. For a comprehen-
sive overview and more useful resources go to https://www.rdocumentation.o
rg/packages/ggplot2/versions/3.3.0.

5.2.1 Structure
In ggplot you build your graphics layer by layer as if you were painting a picture.
You start by laying out a blank sheet (with the basic function ggplot() ) upon
which you add graphical elements (called geoms ) and structural elements (like
axis labels, colour schemes etc.).

To start, lets install the package and load the NINDS data set again.
install.packages(ggplot2) #only do this once

library(ggplot2)
d <- read.csv("NINDS.csv")

Starting with a simple graphic, we want to draw a scatterplot of the NIHSS at


2 hours and 24 hours. In ggplot, we build a graphics object and save it as a
variable that is only actually drawn if we call it in the console. We start by
laying out a white sheet and call our graphic my_plot :
my_plot <- ggplot(data = d)

In this function, we tell the graphic that our data comes from the data set d .
But since we haven’t told ggplot what to draw yet, my_plot only produces a
blank graph:
my_plot
74 CHAPTER 5. ADVANCED USE

This changes if we add the geom for a scatterplot called geom_point() .


my_plot + geom_point(aes(x=HOUR2,y=HOUR24))

40

30
HOUR24

20

10

0 10 20 30
HOUR2

The aes() function is part of every geom, it is short for aesthetic and used
5.2. GRAPHICS WITH GGPLOT 75

to specify every feature of the geom that depends on variables from the data
frame, like the definition of the x- and y-axis.

Within aes() we can for example set the color of geom_point() to depend
on the gender:
my_plot + geom_point(aes(x=HOUR2,y=HOUR24, color=BGENDER))

40

30

BGENDER
HOUR24

female
20
male

10

0 10 20 30
HOUR2

If you want to set a feature of the geom that doesn’t depend on any of the
variables (e.g. setting one color for all the points), this is done outside of the
aesthetics argument:
my_plot + geom_point(aes(x=HOUR2,y=HOUR24), color="blue")
76 CHAPTER 5. ADVANCED USE

40

30
HOUR24

20

10

0 10 20 30
HOUR2

You can also add more than one layer to the plot. For example, we could superim-
pose a (non-linear) regression line by simply adding the geom geom_smooth() :
my_plot + geom_point(aes(x=HOUR2,y=HOUR24, color=BGENDER)) +
geom_smooth(aes(x=HOUR2, y=HOUR24))

40

30

BGENDER
HOUR24

20 female
male

10

0 10 20 30
HOUR2
5.2. GRAPHICS WITH GGPLOT 77

With more than one layer it is easier formatting the code with line breaks. These
breaks don’t affect the functionality in any way aside from readability, just make
sure you mark all the lines when executing the code. Note, that each line but
the last has to end with a + for R to know that those lines belong together.

geom_smooth() , too, can be divided with color (e.g. by BGENDER ) if we specify


it in the color argument of the aesthetics:
my_plot + geom_point(aes(x=HOUR2,y=HOUR24, color=BGENDER)) +
geom_smooth(aes(x=HOUR2, y=HOUR24, color=BGENDER))

40

30

BGENDER
HOUR24

20 female
male

10

0 10 20 30
HOUR2

When several layers share the same aesthetics it can be useful to define these
aesthetics in the basic plot produced by ggplot() :
my_plot2 <- ggplot(data=d, aes(x=HOUR2, y=HOUR24, color=BGENDER))

my_plot2 + geom_point() + geom_smooth()


78 CHAPTER 5. ADVANCED USE

40

30

BGENDER
HOUR24

20 female
male

10

0 10 20 30
HOUR2

Instead of defining the aesthetics for each geom seperately, geom_point() and
geom_smooth() inherit the aesthetics of my_plot2 and the graphic looks
exactly the same.

5.2.2 Labels

You can set the labels of your plot using labs() .


my_plot2 +
geom_point() +
geom_smooth() +
labs(title = "Scatterplot",
x= "NIHSS at 2 Hours",
y="NIHSS at 24 Hours")
5.2. GRAPHICS WITH GGPLOT 79

Scatterplot

40

30
NIHSS at 24 Hours

BGENDER
20 female
male

10

0 10 20 30
NIHSS at 2 Hours

5.2.3 Facets

So far we have divided our graph using different colors. It is however also
possible to split the graph according to a one or more variables in the data frame
using facet_wrap() . To split the graph by presumptive diagnosis ( TDX ) we
write:
my_plot2 +
geom_point() +
geom_smooth() +
labs(title = "Scatterplot",
x= "NIHSS at 2 Hours",
y="NIHSS at 24 Hours") +
facet_wrap(~ TDX)
80 CHAPTER 5. ADVANCED USE

Scatterplot
Cardioembolic Large vessel occlusive

40

20
NIHSS at 24 Hours

BGENDER

Other Small vessel occlusive female


male

40

20

0 10 20 30 0 10 20 30
NIHSS at 2 Hours

And if we want to split the graph by gender, too, we simply add BGENDER .
With ncol=2 we can also tell ggplot to display the plots in two columns:
my_plot2 +
geom_point() +
geom_smooth() +
labs(title = "Scatterplot",
x= "NIHSS at 2 Hours",
y="NIHSS at 24 Hours") +
facet_wrap(~ TDX + BGENDER, ncol=2)
5.2. GRAPHICS WITH GGPLOT 81

Scatterplot
Cardioembolic Cardioembolic
female male
40
20
0

Large vessel occlusive Large vessel occlusive


female male
NIHSS at 24 Hours

40
20
0
BGENDER
Other Other
female
female male
male
40
20
0

Small vessel occlusive Small vessel occlusive


female male
40
20
0
0 10 20 30 0 10 20 30
NIHSS at 2 Hours

5.2.4 Theme

The theme of a ggplot can be used to change the default appearance of the entire
plot or to change specific components of your plot. To find a list of complete
themes, go to https://ggplot2.tidyverse.org/reference/ggtheme.html or install
the package ggthemes which contains even more complete themes. The default
theme of ggplot is theme_grey , but we can change it like this:
my_plot2 +
geom_point() +
geom_smooth() +
labs(title = "Scatterplot",
x= "NIHSS at 2 Hours",
y="NIHSS at 24 Hours") +
theme_bw()
82 CHAPTER 5. ADVANCED USE

Scatterplot

40

30
NIHSS at 24 Hours

BGENDER
20 female
male

10

0 10 20 30
NIHSS at 2 Hours

If on the other hand you want to change only certain elements, for exam-
ple the font size or type of your axis labels, you use theme() , which al-
lows you customize every element of your plot. To change text elements, you
give an element_text() to the appropriate argument of theme() . Within
element_text you can set the font size, font type, font color, font face and
many more aspects. The arguments that take element_text() objects are for
example axis.text for the numbers on the axes, axis.title for the axis
labels and plot.title for the plot title :
my_plot2 +
geom_point() +
geom_smooth() +
labs(title = "Scatterplot",
x= "NIHSS at 2 Hours",
y="NIHSS at 24 Hours") +
theme(axis.text.x = element_text(size=15),
axis.text.y= element_text(size=10),
axis.title = element_text(size=16, face="italic"),
plot.title = element_text(size=18, face="bold"))
5.3. FURTHER READING 83

Scatterplot

40
NIHSS at 24 Hours

30

BGENDER
20 female
male

10

0 10 20 30
NIHSS at 2 Hours

5.3 Further reading


This course aimed at introducing you to the basic ideas of how R works and
how it can be of use to you. We have tried to find a balance between keeping it
as easy as possible for you to start analysing right away without omitting too
much of the underlying concepts. However, we had to leave out a great deal of
concepts to fit into these five sessions.
If you are interested in getting to know R better, there is a mountain of useful
material. We’ll list a few in the following:

5.3.1 Webpages
• R for Data Science (Wickham and Grolemund, 2017) available at https:
//r4ds.had.co.nz/, an online book with very clear and detailed introduc-
tions that focuses more on R and how to use it for data analysis and less
on traditional statistics.
• Modern Dive (Ismay and Kim, 2019) available at https://moderndive.c
om/, an online book giving an introduction to R but with a strong focus
in statistical inference.
• STHDA (http://www.sthda.com/english/wiki/r-basics-quick-and-easy),
a website with short, hands-on tutorials explaining how to do a number
of statistical analysis including help with output interpretation.
• rdocumentation (https://www.rdocumentation.org/), a collection of
the help pages to all the R packages and functions that is a bit more nicely
84 CHAPTER 5. ADVANCED USE

formatted than the help pages whithin R.

5.3.2 Books
• R for Data Science and Modern Dive are also available as physical
books
• Discovering Statistics Using R (Field et al., 2012), an extensive but
very accessible and entertaining introduction to statistics from the very
basic to advanced statistical analyses with examples in R
Bibliography

Arnholt, A. T. and Evans, B. (2017). BSDA: Basic Statistics and Data Analysis.
R package version 1.2.0.

Field, A., Miles, J., and Field, Z. (2012). Discovering Statistics Using R. Sage
Publications Ltd.

Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems.


Annals of Eugenics, 7(2):179–188.

Grolemund, G. and Wickham, H. (2011). Dates and times made easy with
lubridate. Journal of Statistical Software, 40(3):1–25.

Ismay, C. and Kim, A. Y. (2019). Statistical Inference via Data Science: A


ModernDive into R and the Tidyverse. Chapman and Hall/CRC.

Kassambara, A., Kosinski, M., and Biecek, P. (2020). survminer: Drawing


Survival Curves using ggplot2. R package version 0.4.8.

Khan, M. R. A. and Brandenburger, T. (2020). ROCit: Performance Assess-


ment of Binary Classifier with Visualization. R package version 2.1.1.

Marler, J. R., Brott, T., Broderick, J., Kothari, R., Odonoghue, M., Barsan,
W., and et al. (1995). Tissue plasminogen activator for acute ischemic stroke.
New England Journal of Medicine, 333(24):1581–1588. PMID: 7477192.

R Core Team (2021). R: A Language and Environment for Statistical Computing.


R Foundation for Statistical Computing, Vienna, Austria.

Schauberger, P. and Walker, A. (2020). openxlsx: Read, Write and Edit xlsx
Files. R package version 4.2.3.

Therneau, T. M. (2020). survival: Survival Analysis. R package version 3.2-7.

Wickham, H. (2011). The split-apply-combine strategy for data analysis. Jour-


nal of Statistical Software, 40(1):1–29.

Wickham, H. (2016). ggplot2: Elegant Graphics for Data Analysis. Springer-


Verlag New York.

85
86 BIBLIOGRAPHY

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L. D., François, R.,
Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L.,
Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P.,
Spinu, V., Takahashi, K., Vaughan, D., Wilke, C., Woo, K., and Yutani,
H. (2019). Welcome to the tidyverse. Journal of Open Source Software,
4(43):1686.
Wickham, H., Chang, W., Henry, L., Pedersen, T. L., Takahashi, K., Wilke,
C., Woo, K., Yutani, H., and Dunnington, D. (2020). ggplot2: Create Elegant
Data Visualisations Using the Grammar of Graphics. R package version 3.3.2.
Wickham, H. and Grolemund, G. (2017). R for Data Science: Import, Tidy,
Transform, Visualize, and Model Data. O’Reilly Media, 1 edition.

You might also like