R Intro Script
R Intro Script
Preface 3
1 Introduction 4
1.1 What is R? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Installing R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Getting to know RStudio . . . . . . . . . . . . . . . . . . . . . . 5
1.4 R as a calculator . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5 Assignments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.6 Basic data structures . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.7 Troubleshooting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2 Data exploration 17
2.1 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2 Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3 Loading data into R . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4 Descriptive analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.5 Troubleshooting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4 Inference II 54
4.1 Risk and Odds . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.2 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.3 Time-to-event analysis . . . . . . . . . . . . . . . . . . . . . . . . 63
5 Advanced Use 67
5.1 Programming basics . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.2 Graphics with ggplot . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.3 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
2
Preface
[1] 2
We strongly encourage you to try out all the example code yourself while working
through the chapters!
If you run into problems please check the Troubleshooting paragraph at the end
of the first two chapters that lists the most common problems and corresponding
solutions people encounter when learning to use R.
3
Chapter 1
Introduction
1.1 What is R?
R is a software for data analysis, data manipulation and visualization and a well
developed and powerful programming language. It is a highly extensible, open-
source and free software which compiles and runs on a wide variety of UNIX
platforms, Windows and MacOS. The R project was started bei Robert Gentle-
man and Ross Ihaka at the University of Auckland in 1995 and is maintained by
the R Core Team (2021), an international group of volunteer developers. The
website of the R project is:
http://www.r-project.org
1.2 Installing R
To begin, you should install R and RStudio on your computer. If you are
working on a computer that has these programs already installed, you can skip
this part. To install R, go to the Comprehensive R Archive Network (CRAN),
for example here: https://ftp.fau.de/cran/ and install the Version of R that
is suitable for your operating system. After you have installed R, visit https:
//rstudio.com/products/rstudio/download/ and download and install RStudio
Desktop. Check whether you can open RStudio by clicking on the RStudio Icon
on your Desktop or by searching it in your taskbar.
4
1.3. GETTING TO KNOW RSTUDIO 5
To try this out, type 1+1 into your script. Then mark this piece of code and
either click on the Run symbol in the upper right corner or press Ctrl + Enter
on your keyboard.
As you can see, the code gets reprinted in the console behind the > , which is
called the prompt and directly below the result [1] 2 is displayed. The process
of sending code from the script to the console is called running or executing your
code. It is possible to write code directly into the console next to the prompt
and executing it by hitting Enter . However, we strongly advise typing all of
your code out in the script before executing it, since this makes rerunning and
changing your code way easier. To make your code more humanly readable, you
can comment it in your script. Any line of code that begins with a # will not
be evaluated when send to the console but will be merely printed out:
#R doesn't calculate 1+1 if it is written like this:
#1+1
Anything you write in R that is not a comment is case sensitive, which means
for example A and a are not the same thing to R.
The upper and lower right windows in RStudio will be explained when they
become relevant in the following chapters.
6 CHAPTER 1. INTRODUCTION
1.4 R as a calculator
As you have seen in the example above, R can be used as an ordinary calculator.
You can use + and - for addition and subtraction, * and \ for multiplication
and division and ^ for exponentiation.
Try out different calculations like the following by typing them into your script
and running them. You can either run them line by line or mark several lines
at once for execution.
5+3
[1] 8
7*3/2
[1] 10.5
2^3
[1] 8
(2-5)*8^2
[1] -192
The [1] in front of the output will appear in front of every vector in the console.
It is an index telling you the position of the first element in the row which is
useful when the vector is so long it produces line breaks in you console. You
will learn what a vector is in just a moment.
1.5 Assignments
One of the most important concepts in R is the assignment of names to objects.
So far the objects we have encountered are simple numbers. To assign a name
to a number, you use the assignment operator <- (no space between < and
- ) like this:
x <- 3
some.complicated.name <- 7
variable name, you can use any string that doesn’t have blank spaces or special
characters in it and that doesn’t begin with a number. You can look up the
value that is stored in a variable by simply typing out its name and running it:
x
[1] 3
Now you use these variables for computation:
x + some.complicated.name
[1] 10
You can overwrite the value stored in your variables at any point by simply
rerunning the original assignment with different values.
For example you can assign the values 2 and 80 to x and some.complicated.name
and compute their product.
x <- 2
some.complicated.name <- 80
x*some.complicated.name
[1] 160
If you want to remove a variable, you can use the rm() command like this:
rm(x)
If you want to remove all variables in your workspace, you can use a combination
of rm() and ls() :
rm(list=ls())
As you can see, the variables now disappear from the environment window in
the top right corner. The advantage of using variables instead of numbers in
your code is that your code becomes reusable. Imagine having typed out a long
computation that you want to perform repeatedly with different numbers. If
your computation uses variable names, you only write it down once and are
able to rerun it with as many different values as you like by just assigning those
values to the variable one by one.
1.6.1 Vectors
So far you have worked with single numbers. These are actually a special case
of the most important data structure in R, the so called vector. A vector in R
8 CHAPTER 1. INTRODUCTION
[1] 8 2 4 6 2 1
This is a numeric vector of length 6 (i.e. it has 6 elements). An example for a
vector producing a linebreak is for example the following:
c(1, 23, 4, 5, 6, 7, 7, 8, 4, 2, 4, 6, 8, 98, 45, 23,
45, 8, 97, 23, 4, 23, 1, 3, 5, 6, 2, 45, 3, 45, 4, 1, 3)
[1] 1 23 4 5 6 7 7 8 4 2 4 6 8 98 45 23 45 8 97 23 4 23 1 3 5
[26] 6 2 45 3 45 4 1 3
You can store vectors in variables as well:
my_vector <- c(8, 2, 4, 6, 2, 1)
Notice how the new variable my_vector now appears in the environment win-
dow on the upper right side. You can retrieve the vector stored in my_vector
by typing it out and executing the code:
my_vector
[1] 8 2 4 6 2 1
Numeric
The kind of vector you’ve just seen is the numeric vector (or just numeric), which
is a vector containing numbers. A numeric containing only whole numbers (like
my_vector ) can be called an integer, which is a subtype of numeric.
If the numbers in the numeric have decimal places, it can be called a double.
v <- c(1.5, 3.234, 7, 0.12356)
v
[1] 3 6 9 12
b <- c(2, 4, 6, 7)
a+b
[1] 3 6 9 11
1.6. BASIC DATA STRUCTURES 9
[1] 2 4 4 6
Here, the shorter vector was repeated, i.e. the calculation was long + c(short, short) .
Character
R can not only deal with numbers, it can also deal with text. A piece of text
is called a string and is written in a pair of double or single quotes. A vector
containing strings as elements is called a character vector:
v2 <- c("male", "female", "female", "male")
v2
Logical
Another important type of vector is the logical vector, the elements of which are
the so called booleans TRUE and FALSE , which can be shortened by T and F
(cases matter, you have to use upper case letters in both versions.)
c(TRUE, FALSE, TRUE)
[1] TRUE
The most common logical operators we will use are the following:
10 CHAPTER 1. INTRODUCTION
• AND &
• OR |
• NOT !
• greater than >
• greater or equal >=
• less than <
• less or equal <=
• equal to == (yes, you need two equal signs)
• not equal to !=
The first three operators can be used with numbers like this:
3 < 1
[1] FALSE
5 > 2
[1] TRUE
5 == 5
[1] TRUE
5 != 5
[1] FALSE
The other operators can be used to link boolean values:
TRUE & FALSE
[1] FALSE
TRUE | FALSE
[1] TRUE
!FALSE
[1] TRUE
You can also create more complex expressions, using () to group statments:
((1+2)==(5-2)) & (7<9)
[1] TRUE
my_vector[4]
[1] 6
It is also possible to select more than one element of the vector by using an
integer vector of the desired indices (e.g. c(1,4,5) if you want to retrieve the
first, fourth and fifth element of a vector) within the square brackets:
my_vector[c(1, 4, 5)]
[1] 8 6 2
We call this subsetting your vector. For subsetting vectors we often need longer
sequences of integers. To generate a sequence of consecutive integer numbers
R has the <start> : <end> operator, which is read as from <start> to
<end> :
3:10 #generates sequence from 3 to 10
[1] 3 4 5 6 7 8 9 10
The data set iris gives measurements of sepal and petal lengths and widths
of 150 flowers from three different species of iris. You can extract each of the
columns with a $ :
iris$Sepal.Length
[1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4 5.1
[19] 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5 4.9 5.0
[37] 5.5 4.9 4.4 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0 7.0 6.4 6.9 5.5
[55] 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6 6.7 5.6 5.8 6.2 5.6 5.9 6.1
[73] 6.3 6.1 6.4 6.6 6.8 6.7 6.0 5.7 5.5 5.5 5.8 6.0 5.4 6.0 6.7 6.3 5.6 5.5
[91] 5.5 6.1 5.8 5.0 5.6 5.7 5.7 6.2 5.1 5.7 6.3 5.8 7.1 6.3 6.5 7.6 4.9 7.3
[109] 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7 7.7 6.0 6.9 5.6 7.7 6.3 6.7 7.2
[127] 6.2 6.1 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7 6.3 6.4 6.0 6.9 6.7 6.9 5.8 6.8
[145] 6.7 6.7 6.3 6.5 6.2 5.9
As you can see, iris$Sepal.Length is just a numeric vector! Consequently
you can do calculations on these vectors, e.g. compute the mean sepal length of
the flowers:
mean(iris$Sepal.Length)
[1] 5.843333
Basically, a data.frame in R is a number of vectors of the same length that have
been stuck together columnwise to build a table. Each column must have a
unique format but different formats can be assigned to different columns. In
this example, columns 1 to 4 are numbers and column 5 is a string.
1.6.4 Lists
While data.frames are useful to bundle together vectors of the same length, lists
are used to combine more heterogeneous data. The following block of code
creates a list:
#create list
my.list <- list(my_vector, long, iris[1:10,])
#print list
my.list
1.6. BASIC DATA STRUCTURES 13
[[1]]
[1] 8 2 4 6 2 1
[[2]]
[1] 1 2 3 4
[[3]]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5.0 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
A list is a collection of R objects that are called the elements of the list. Lists
are similar to data.frames, but while data.frames can only have vectors of the
same length as their elements (i.e. the variables), lists can have all kinds of data
types as elements. An element of a list can be a vector of arbitrary length,
a data.frame, another list or even a function. The list we have just created
contains two vectors of different lengths and a data.frame containing the first
ten rows ot the iris data set. You can access a single list element by referencing
its position in the list using double square brackets [[]] :
my.list[[1]] #result is a vector
[1] 8 2 4 6 2 1
my.list[[3]] #result is a data.frame
[[1]]
[1] 1 2 3 4
[[2]]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5.0 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
Note that if you use single square brackets [] , the result will always be a
list, whereas using double square brackets [[]] will return whatever type the
object is that you are referencing with [[]] .
$a
[1] 8 2 4 6 2 1
$b
[1] 1 2 3 4
$c
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5.0 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
1.6. BASIC DATA STRUCTURES 15
[1] 8 2 4 6 2 1
my.named.list$b
[1] 1 2 3 4
The square brackets [] and [[]] do however also work on named lists. Be-
cause lists can bundle a lot of heterogeneous data in one R object, they are
quite often used to give results of functions for statistical analyses as you will
see later on.
[1] "list"
class(iris)
[1] "data.frame"
class(iris$Sepal.Length)
[1] "numeric"
class(my_vector)
[1] "numeric"
class(c(TRUE,FALSE, FALSE))
[1] "logical"
List of 3
$ : num [1:6] 8 2 4 6 2 1
$ : num [1:4] 1 2 3 4
16 CHAPTER 1. INTRODUCTION
1.7 Troubleshooting
1. The code I’m sending to the console appears but doesn’t seem to be exe-
cuted: Check whether the last line of the console shows the prompt > .
This means R is ready to receive new commands. If there is no > but a
+ instead, you probably forgot to close a bracket some lines before and
R is waiting for the closing bracket. Just hit Esc to interrupt the cur-
rent command and try again. Make sure the number of opening brackets
matches the number of closing brackets. I If you are in RStudio and there
is no + , but a little red stop sign in the upper right corner of the console,
R is still working on the computation. If it doesn’t go away after a few
moments, but you know your computation shouldn’t take this long, click
the stop sign to terminate the current computation and try to find out
why the computation you started won’t finish.
2. Error: object ‘x’ not found: You probably forgot to define x ( x beeing
a stand-in for the variable in your error message) and it doesn’t show up
in the environment on the upper left. Run the assignment for x and try
again.
Chapter 2
Data exploration
2.1 Functions
Besides the data structures you have learned about in the last chapter, there is
another important concept you need to learn about when using R: the function.
In principle, you can imagine a function as a little machine that takes some input
(usually some kind of data), processes that input in a certain way and gives
back the result as output. There are functions for almost every computation
and statistical test you might want to do, there are functions to read and write
data, to shape and manipulate it, to produce plots and even to write books
(this document is written completely in R)! The function mean() for example
takes a numeric vector as input and computes the mean of the numbers in the
numeric vector:
mean(c(2, 4, 6))
[1] 4
17
18 CHAPTER 2. DATA EXPLORATION
As you can see the help page gives information about a couple of functions,
one of which is log() . Besides the description of the arguments you should
have a look at the information under Usage. Here you can see that the default
value for base is exp(1) (which is approximately 2.72, i.e. Eulers number),
2.2. PACKAGES 19
whereas there is no default value for x . All arguments that appear solely with
their name but without a default value (like x in this case) under Usage are
mandatory when you call the function. Not providing these arguments will
throw an error. All arguments that have a default value given (like base in
this case) can be omitted, in which case R assumes the default value for that
argument:
log(x = c(20,30,40)) #argument base can be omitted
2.2 Packages
A basic set of functions are already included in basic R, i.e. the software you
downloaded when installing R. But since there is a huge community worldwide
constantly developing new functions and features for R and since the entirety of
all R functions there are is way to big to install at once, most of the functions
are bundeled into so called packages. A package is a bundle of functions you
can download and install from the Comprehensive R Archive Network (CRAN)
(https://cran.r-project.org/). If you visit the site, you can also get an overview
over all available packages. You can install a package by using the function
install.packages() which takes the package name as a string (i.e. in quotes)
as its argument:
install.packages("lubridate")
If you run this line of code, R goes online and downloads the package
lubdridate (Grolemund and Wickham, 2011) which contains a number of
useful functions for dealing with dates. This operation has to be done only
once, so it is one of the rare cases where it makes sense to copy the code
directly into the console. If you write it in your script window it is advisable to
comment out the code with a # after you’ve run it once to avoid unnecessarily
running it again if you rerun the rest of your script.
Once you have installed the package, its functions are downloaded to your com-
puter but are not accessible yet, because the package has to be activated first.
20 CHAPTER 2. DATA EXPLORATION
If you try to call a function from a package that is not activated yet (e.g. the
function today() from lubridate ), you get this error:
today()
To activate the package, you use the function library() . This function acti-
vates the package for your current R session, so you have to do this once per
session (a session starts when you open R/Rstudio and ends when you close the
window).
library(lubridate)
today()
[1] "2021-04-21"
As you can see the function today() is an example of a function that doesn’t
need any argument. Nevertheless you have to call it with brackets () to
indicate that you’re calling a function rather than a variable called today .
Most packages print some sort of information into the console when they are
loaded with library() . Don’t be alarmed by the red color - all of R’s messages,
warnings and errors are printed in red. The only messages you should be worried
about for this course are the ones starting with Error in: , the rest can be
safely ignored for now. However, warning messages can be informative if they
appear.
the folder on your computer where R looks for files to import and where R
will create files if you save something. To find out what the current working
directory is, use:
getwd() #no arguments needed
R should now print the path to your current woking directory to the console.
To change it you can use R-Studio. Click Session > Set Working Directory >
Choose Directory… in the toolbar in the upper left of your window. You can
then navigate to the folder of your choice and click Open.
Now you will see that R prints setwd("<Path to the chosen directory>")
to the console. This shows you how you can set your working directory without
clicking: You use the function setwd() and put the correct path in it. Note
that R uses / to divide folders, this is different to windows.
Check if it worked by rerunning getwd() . You should now put the data files
NINDS.csv and NINDS.xlsx in the folder you have chosen as your working di-
rectory.
We haven’t printed the result in this document because it is too long, but if you
execute the code yourself you can see that the read.csv() function prints the
entire data set (possibly truncated) into the console. If you want to work with
the data, it makes sense to store it in a variable:
NINDS_csv <- read.csv("NINDS.csv")
You can now see the data.frame in the environment window. To show you at
least one other importing function, we have provided the exact same data set as
an excel file. To read this file, you need to install a package with functions for
excel files first, for example the package openxlsx (Schauberger and Walker,
2020):
install.packages("openxlsx") #only do this once
library(openxlsx)
NINDS_xlsx <- read.xlsx("NINDS.xlsx")
22 CHAPTER 2. DATA EXPLORATION
If you have another kind of file, just google read R and you file type and you
will most likely find an R package for just that.
You should now be able to see NINDS_xlsx and NINDS_csv , two identical
data.frames in your working directory. Since they are identical we will work
with NINDS_csv from here on.
When you now open a new R session and want to pick up where you left, you
can load the data with load() :
load("my_workspace.RData")
If you want to save a data.frame in some non-R format, almost every read func-
tion has a corresponding write function. The most versatile is write.table()
which will write a text-file based format, like a tabular separated file or a csv,
depending on what you supply in the sep argument.
your variables to the reading function. Because that can be quite a long vector
when you have a lot of variables it is often more easy to just let R guess the
types and correct them later if necessary.
TREATCD for example has been read in as a character as you can see using the
class() function:
class(NINDS_csv$TREATCD)
[1] "character"
The factor levels (i.e. the values you newly build factor variable can take) can
be extracted with levels() :
levels(NINDS_csv$TREATCD)
Similarly, one could argue that GOS6M should be an ordered factor with Good
< Mod. Dis < Sev. Dis < Veget < Dead , but currently it is represented
as character.
If you know beforehand that most of the character variables in your data.frame
should actually be factors, you can specify this when reading the data in using
the argument stringsAsFactors = TRUE :
NINDS_csv <- read.csv("NINDS.csv", stringsAsFactors = TRUE)
Have a look at how the description of the data.frame in the environment window
changes after running this line of code!
2.4. DESCRIPTIVE ANALYSIS 25
In most computations you’ll have to tell R explicitly how to deal with these
values (e.g. remove them before computation), else you’ll get NA as a result.
For example, if you want to compute the mean of the variable TWEIGHT , which
contains missing values, you have to set the argument na.rm=TRUE , where
na.rm stands for NA remove:
#default for na.rm is FALSE so NA's are not removed
mean(NINDS_csv$TWEIGHT)
[1] NA
#this way, NA's are removed before computation
mean(NINDS_csv$TWEIGHT, na.rm=TRUE)
[1] 78.37432
[1] 66.94177
median(NINDS_csv$AGE) #median
[1] 68.69141
quantile(NINDS_csv$AGE) #gives all quartiles, min/max
Measures of dispersion
Measures of dispersion describe the spread of the values around a central value.
Here you can see how to compute the variance, the standard deviation and
the range of a variable. To get the interquartile range just pick the 25- and
75-percentile from the quantile() function above!
var(NINDS_csv$AGE) #variance
[1] 135.7818
sd(NINDS_csv$AGE) #standard deviaton
[1] 11.65254
range(NINDS_csv$AGE) #range
[1] 26.48927
max(NINDS_csv$AGE) #maximum
[1] 89
Measures of association
Measures of association describe the relationship between two or more variables.
In this case there is more than one way to deal with missing values, so instead of
the na.rm argument, here we have the argument use= to specify which values
to use in case of missing values. For the simple case of looking at correlations
between two variables, you can set this argument to use = "complete.obs"
which means that only cases without missing values go into the computation. If
an observation (i.e. a patient) has NA on at least one of the two variables, this
observation is excluded from the computation.
We can compute the Pearson and the Spearman correlation of the actual and
the estimated weight of the NINDS patients using the cor() function:
cor(NINDS_csv$WEIGHT, NINDS_csv$TWEIGHT, use = "complete.obs",
method = "pearson")
[1] 0.9313269
cor(NINDS_csv$WEIGHT, NINDS_csv$TWEIGHT, use = "complete.obs",
method = "spearman")
[1] 0.9362007
For the association of categorical variables, you’ll mostly want to look at the
frequency tables of the categories. A frequency table for a single variable is
2.4. DESCRIPTIVE ANALYSIS 27
Placebo t-PA
312 312
But you can also use table() to generate cross tables for two variables:
table(NINDS_csv$BRACE, NINDS_csv$BGENDER)
female male
Asian 5 3
Black 75 94
Hispanic 16 21
Other 1 6
White, non-Hispanic 165 238
General overview
If you want to get an overview over your entire data.frame, the summary()
function is convenient. This function can be used for a lot of different kinds of
R objects and gives a summary appropriate for whatever the input is. If you
give it a data.frame, summary() will give the minimal and maximal value, the
1st and 3rd quartile and the mean and median for every quantitative (i.e. nu-
meric/integer) variable and a frequency table for every factor as well as the
number of missing values:
summary(NINDS_csv)
Dead :153 Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00
Good :231 1st Qu.: 6.00 1st Qu.: 4.00 1st Qu.: 2.00 1st Qu.: 1.00
Mod. Dis:133 Median :12.00 Median :11.00 Median : 8.00 Median : 6.00
Sev. Dis:105 Mean :12.76 Mean :12.33 Mean :12.08 Mean :12.85
Veget : 2 3rd Qu.:18.00 3rd Qu.:19.00 3rd Qu.:18.00 3rd Qu.:18.00
Max. :38.00 Max. :42.00 Max. :42.00 Max. :42.00
NA's :1 NA's :1 NA's :9
NIHSSB PART SURDAYS TREATCD
Min. : 1.00 Min. :1.000 Min. : 0.0 Placebo:312
1st Qu.: 9.00 1st Qu.:1.000 1st Qu.: 242.8 t-PA :312
Median :14.00 Median :2.000 Median : 366.0
Mean :14.79 Mean :1.534 Mean : 359.3
3rd Qu.:20.00 3rd Qu.:2.000 3rd Qu.: 378.0
Max. :37.00 Max. :2.000 Max. :1970.0
For this chapter we’ll start with the quick and easy ones and go from the most
broadly applicable plots that can be used for all types of data to the more
exclusive ones that can be used only for data types with a certain scaling.
We’ll start by introducing the basic plot functions without any customization
of labels or axes to give you an overview. When you create plots you want to
share, you should of course improve them as, e.g., shown in the last paragraph
of the chapter.
Barplot
A barplot can technically be used on every variable with a finite set of values.
The barplot() function takes a frequency table and produces a barplot from
it.
2.4. DESCRIPTIVE ANALYSIS 29
barplot(table(NINDS_csv$BRACE))
400
300
200
100
0
Histogram
If you have a metric variable, you can also use the histogram:
hist(NINDS_csv$AGE)
30 CHAPTER 2. DATA EXPLORATION
Histogram of NINDS_csv$AGE
120
100
80
Frequency
60
40
20
0
30 40 50 60 70 80 90
NINDS_csv$AGE
Boxplot
If you have data that is as least ordinal you can use the boxplot() function:
boxplot(NINDS_csv$SURDAYS)
2.4. DESCRIPTIVE ANALYSIS 31
2000
1500
1000
500
0
You can also split the boxplot by another (categorical) variable using the ~
sign:
boxplot(NINDS_csv$SURDAYS ~ NINDS_csv$BGENDER)
2000
NINDS_csv$SURDAYS
1500
1000
500
0
female male
NINDS_csv$BGENDER
32 CHAPTER 2. DATA EXPLORATION
Scatterplot
And we can use scatterplots to get an idea about the relationship of two metric
variables:
plot(NINDS_csv$TWEIGHT, NINDS_csv$WEIGHT)
180
160
140
NINDS_csv$WEIGHT
120
100
80
60
40
NINDS_csv$TWEIGHT
Customisation
Even these basic plots come with a whole lot of customisation options. We’ll
exemplary show you a couple of them for the histogram. You can find out
about all possible options by going to the help page of the respective function
(e.g. ?hist ).
2.4. DESCRIPTIVE ANALYSIS 33
Histogram of NINDS_csv$AGE
200
150
Frequency
100
50
0
20 30 40 50 60 70 80 90
NINDS_csv$AGE
My Title
120
100
80
Frequency
60
40
20
0
30 40 50 60 70 80 90
Histogram of NINDS_csv$AGE
120
100
80
Frequency
60
40
20
0
30 40 50 60 70 80 90
NINDS_csv$AGE
All the plotting functions we have just shown you are useful because they are
easy to use. In a later chapter we will introduce the package ggplot2 (Wick-
ham et al., 2020) which allows you to make plots for more complex displays like
this one:
36 CHAPTER 2. DATA EXPLORATION
Placebo t−PA
40
NIH stroke scale at 2 hours
30
Glasgow at six months
Dead
Good
20
Mod. Dis
Sev. Dis
Veget
10
0 10 20 30 0 10 20 30
NIH stroke scale at 24 hours
2.5 Troubleshooting
1. Error in file(file, “rt”) : cannot open the connection […] No such file or
directory : The file you are trying to open probably doesn’t exist. Check
if you spelled the file name correctly. Also check if the working directory
actually contains the file you are trying to read.
2. Error in library(“xy”) : there is no package called ‘xy’ : You either mis-
spelled the package name or you haven’t installed the package yet.
3. Error in install.packages: object ‘xy’ not found : Have you forgotten to
put quotation marks around the package name?
4. Error in install.packages: package ‘xy’ is not available (for R version
x.x.x): Either you misspelled the package name or the package does not
exist, or it does not exist for your R version.
5. Error in plot.new() : figure margins too large : The plot window in the
lower right corner of R studio is too small to display the plot. Make it
bigger by dragging the left margin further to the left and rerun the plotting
function.
Chapter 3
In this chapter we will learn some useful tools for data manipulation and then
go on to our first inferential statistics.
library(tidyverse)
You can ignore the messages that are printed into the console upon loading the
package.
3.1.1 Tibbles
We have provided you with two data sets to try out the data manipulation func-
tions, data1.csv and data2.csv. Make sure the files are stored in your working
directory folder and then read them in with:
37
38 CHAPTER 3. DATA MANIPULATION & INFERENCE I
d1 <- read_csv("data1.csv")
#check type
class(d1$id)
[1] "character"
class (d2$id)
[1] "character"
What you did there is converting the variable d1$id from a numeric to a
character and then storing this new version in place of the old version of d1$id .
If you look at the type of d1 using class() , you can see that it is more than
just a data.frame, it is for example also a tbl which is short for tibble.
class(d1)
Basically a tibble can be used for everything a data.frame can be used for, but
has some nice additional properties, for example they look nicer when printed
to the console.
d1
# A tibble: 5 x 3
weight age id
<dbl> <dbl> <chr>
1 85 34 1
2 56 72 2
3 73 33 3
4 76 45 4
5 60 23 5
# A tibble: 5 x 2
id age
<chr> <dbl>
1 1 34
2 2 72
3 3 33
4 4 45
5 5 23
It is also possible to specify the variables you want to throw out instead, by
putting a - before their names:
select(d1, -c(weight, id))
# A tibble: 5 x 1
age
<dbl>
1 34
2 72
3 33
4 45
5 23
40 CHAPTER 3. DATA MANIPULATION & INFERENCE I
When you use filter() to chose only certain rows, you’ll mostly have
some kind of rule which cases to keep. These rules are expressed as logical
statements (see Chapter 1). For example the statement age > 40 would
select all cases older than 40. You can also connect multiple conditions:
(age > 30) & (weight > 60) & (weight < 75) for example selects all
cases that are older than 30 and weigh between 60kg and 75kg. filter()
takes a tibble as its first and a logical expression as its second argument:
filter(d1,(age > 30) & (weight > 60) & (weight < 75) )
# A tibble: 1 x 3
weight age id
<dbl> <dbl> <chr>
1 73 33 3
# A tibble: 3 x 2
weight id
<dbl> <chr>
1 85 1
2 56 2
3 73 3
If you want to keep all columns or all rows, you leave the corresponding element
in the [,] empty. You nevertheless have to keep the comma!
d1[1,] #only keep first row
# A tibble: 1 x 3
weight age id
<dbl> <dbl> <chr>
1 85 34 1
d1[,3] #only keep third column
# A tibble: 5 x 1
id
<chr>
1 1
2 2
3 3
4 4
5 5
3.1. DATA MANIPULATION 41
3.1.3 Join
Another useful operation is the the join which allows you to join two data sets
by a common key variable. d1 contains the weight and age of subject, while
d2 contains their height and eye color. Let’s try to compute their body mass
index. To do this, we need to join the two data sets because we need the weight
and height information in one place. If you look at the IDs of the subjects you’ll
notice that we cannot simply paste together those two tibbles because firstly the
rows don’t have the right order and secondly each tibble contains one person
that the other doesn’t (id 5 and 6):
d1
# A tibble: 5 x 3
weight age id
<dbl> <dbl> <chr>
1 85 34 1
2 56 72 2
3 73 33 3
4 76 45 4
5 60 23 5
d2
# A tibble: 5 x 3
height eyecolor id
<dbl> <chr> <chr>
1 156 brown 2
2 164 blue 1
3 189 brown 4
4 178 green 3
5 169 blue 6
There are several ways to deal with this.
Inner join
The inner join only keeps cases (i.e. rows), that appear in both data sets. The
function inner_join() takes two data.frames or tibbles and a string giving
the name of the key variable that defines which rows belong together:
inner_join(d1, d2, by="id")
# A tibble: 4 x 5
weight age id height eyecolor
<dbl> <dbl> <chr> <dbl> <chr>
1 85 34 1 164 blue
2 56 72 2 156 brown
3 73 33 3 178 green
42 CHAPTER 3. DATA MANIPULATION & INFERENCE I
4 76 45 4 189 brown
As you can see the cases 5 and 6 that only appeared in one of the data
sets are left out of the result. If the key variable has different names in the
two data.frames, e.g. id and sno , you specify the by argument like this:
by = c("id" = "sno") .
Full Join
The opposite of the inner join is the full join. full_join() takes the same
arguments as inner_join() but returns all cases. If a case doesn’t appear in
the other data set, the missing information is indicated with NA :
full_join(d1,d2,by="id")
# A tibble: 6 x 5
weight age id height eyecolor
<dbl> <dbl> <chr> <dbl> <chr>
1 85 34 1 164 blue
2 56 72 2 156 brown
3 73 33 3 178 green
4 76 45 4 189 brown
5 60 23 5 NA <NA>
6 NA NA 6 169 blue
# A tibble: 5 x 5
weight age id height eyecolor
<dbl> <dbl> <chr> <dbl> <chr>
1 85 34 1 164 blue
2 56 72 2 156 brown
3 73 33 3 178 green
4 76 45 4 189 brown
5 60 23 5 NA <NA>
right_join(d1,d2, by="id") #all cases from d2 are kept
# A tibble: 5 x 5
weight age id height eyecolor
3.1. DATA MANIPULATION 43
𝑤𝑒𝑖𝑔ℎ𝑡[𝑘𝑔]
Then we compute the BMI according to the formula ℎ𝑒𝑖𝑔ℎ𝑡[𝑚] 2 . Since our data
gives the height in cm instead of meters, we have to divide this number by 100:
d$weight/(d$height/100)^2
As you can see there’s a NA for the last case, because we have no weight
information on this person. If you want to save the BMI for further analysis it
makes sense to save it as a new variable in your tibble/data.frame. To create
a new variable in an existing data.frame, your write its name behind a $ and
use the assignment operator like this:
d$BMI <- d$weight/(d$height/100)^2
You can now see that you data see d has BMI as a variable:
d
# A tibble: 5 x 6
weight age id height eyecolor BMI
<dbl> <dbl> <chr> <dbl> <chr> <dbl>
1 85 34 1 164 blue 31.6
2 56 72 2 156 brown 23.0
3 73 33 3 178 green 23.0
4 76 45 4 189 brown 21.3
5 NA NA 6 169 blue NA
You can also create variables using logical operators. Suppose you want to
create an indicator variable blueEyes that takes the value 1 when a person
has blue eyes and the value 0 else. First, you create a variable that takes the
value 0 for everyone:
44 CHAPTER 3. DATA MANIPULATION & INFERENCE I
d$blueEyes <- 0
# A tibble: 5 x 7
weight age id height eyecolor BMI blueEyes
<dbl> <dbl> <chr> <dbl> <chr> <dbl> <dbl>
1 85 34 1 164 blue 31.6 0
2 56 72 2 156 brown 23.0 0
3 73 33 3 178 green 23.0 0
4 76 45 4 189 brown 21.3 0
5 NA NA 6 169 blue NA 0
Now you have to set the variable 1 for every blue eyed person. You can create
a vector telling you the blue eyed persons like this:
d$eyecolor == "blue"
# A tibble: 2 x 7
weight age id height eyecolor BMI blueEyes
<dbl> <dbl> <chr> <dbl> <chr> <dbl> <dbl>
1 85 34 1 164 blue 31.6 0
2 NA NA 6 169 blue NA 0
You can now use this selection in combination with $blueEyes to set exactly
those variable values to 1:
d[d$eyecolor == "blue",]$blueEyes
[1] 0 0
d[d$eyecolor == "blue",]$blueEyes <- 1
# A tibble: 5 x 7
weight age id height eyecolor BMI blueEyes
<dbl> <dbl> <chr> <dbl> <chr> <dbl> <dbl>
1 85 34 1 164 blue 31.6 1
2 56 72 2 156 brown 23.0 0
3 73 33 3 178 green 23.0 0
4 76 45 4 189 brown 21.3 0
5 NA NA 6 169 blue NA 1
3.1. DATA MANIPULATION 45
Now we can install plyr and then load the two packages in the correct order:
install.packages("plyr") #only do this once
high low
300 283
We can now try to predict the status at 24 hours with the NHS stroke scale
value at 2 hours, HOUR2 . Plotting a ROC curve will give us an idea about the
diagnostic usefulness of HOUR2 . To import the necessary functions, we’ll use
the package ROCit (Khan and Brandenburger, 2020):
install.packages("ROCit") # only do this once
library(ROCit)
The function rocit() takes the arguments score (a numeric variable that
is used for diagnosis) and class (the factor that contains the condition to be
diagnosed). Since STATUS24 is currently stored as a character instead of a
factor in our data set, we have to convert it to a factor with this line of code:
NINDS$STATUS24 <- factor(NINDS$STATUS24, levels=c("low", "high"))
3.2. DIAGNOSTIC TESTS 47
Now we can run the function rocit() . It returns an R object containing the
diagnostic information which we will store in the variable rocObject :
rocObject <- rocit(NINDS$HOUR2, NINDS$STATUS24)
1−Specificity (FPR)
We can also extract a number of diagnostic properties for every cutoff with the
function measureit() . This function takes the object returned by rocit()
and the argument measure , a character string specifying which properties to
compute. A list of the possible measures can be found on the help page for the
function. Here we will use the sensitivity (SENS), the specificity (SPEC), the
positive and negative predictive value (PPV and NPV) and the positive and
negative diagnostic likelihood ratio (pDRL and nDLR). Because the result is
quite large, we’ll save it under the name properties instead of just printing
it to the console.
properties<-measureit(rocObject,
measure = c("SENS", "SPEC", "PPV", "NPV", "pDLR", "nDLR" ))
The object properties is a list. To get an overview over its elements, click
48 CHAPTER 3. DATA MANIPULATION & INFERENCE I
on the little triangle in the blue circle next to properties in the environment
window of R Studio. You can have a closer look at its elements using the $ :
properties$Cutoff[1:10]# the first 10 cutoff values
[1] Inf 38 36 34 33 32 31 30 29 28
head(properties$SENS, 10)#Sensitivity for the first 10 cutoffs
t-test
The t-test can be used to test if the means of normally distributed metric vari-
ables differ significantly. You can either test the sample mean against a prespec-
3.3. COMPARING TWO SAMPLES 49
ified mean or you can test whether the means of two samples differ significantly.
For a one sample t-test you use the function t.test() which takes a numeric
vector and mu , the mean you want to test your sample mean against:
#test if average age differs from 50
t.test(x=NINDS$AGE, mu = 50)
data: NINDS$AGE
t = 36.319, df = 623, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 50
95 percent confidence interval:
66.02571 67.85782
sample estimates:
mean of x
66.94177
In the output, you find the value of the test statistic (i.e. the t value), its degrees
of freedom and the corresponding p-value in the first row. If the p value is
very small it is displayed in scientific notation, e.g. p-value < 2.2e-16 which
means 𝑝 < 2.2 × 10−16 . In the row below you get the alternative hypothesis
in words and below that the confidence interval for the sample mean. You can
judge the significance of the result by checking whether your p-value undercuts
the significance level (e.g. 0.05) or by checking whether the confidence interval
includes mu , i.e. 50.
To report the results of a t-test, you give the t-value with the corresponding
degrees of freedom and the p-value. When the p-value is very small, it is often
convention to just report it to be below a certain margin, e.g. 0.001. In this
case the mean age of our sample significantly differs from 50 with t(623)=36.319,
p<0.001. It is often useful to also include the estimated mean and its confidence
interval (in this case 66.94 [66.02 ; 67.85]) when you report t-test results, so the
reader can also judge the clinical relevance, not just the statistical significance
of the result.
For the two sample t-test, there are two ways to pass your samples to the
function. We will show you one version here and the other in the example
for paired t-tests. In the first version, you pass to the function the variable
containing the measured values and a grouping variable indicating which values
belong to which group, separated by a ~ .
To compare the mean of the body weight between men and women for example,
you write NINDS$WEIGHT ~ NINDS$BGENDER to indicate you want to compare
the weight by gender. This formulation stresses the view of testing the relation-
ship between WEIGHT and the dichotomous variable BGENDER . Since we want
the unpaired version of the t-test, we set the argument paired=FALSE .
50 CHAPTER 3. DATA MANIPULATION & INFERENCE I
As you can see for the the unpaired t-test R defaults to doing the Welch test,
which is robust to different variances in your two samples. If you are confident
the variances are equal, you can set the argument var.equal=TRUE to get the
standard t-test.
The alternative hypothesis and the confidence interval now refer to the mean
difference between men and women. Thus, a significant result is indicated by a
confidence interval that doesn’t include 0 and a p-value < 0.05. So the results
tell us that in our sample, the weight differs significantly between men and
women with t(543.11) = -9.47, p<0.001.
Wilcoxon-Mann-Whitney test
The results tell us that the NIHSS at 24 hours differs significantly between men
and women with p=0.01781.
3.3. COMPARING TWO SAMPLES 51
Chi-square-test
The Chi square test can be used to test whether there is an association between
two dichotomous variables, or differently put, whether the probability distribu-
tion of one of the two variables differs between the groups defined by the other
variable. We can for example test whether the NIH stroke scale status at 24
hours (with values high and low ) differs between men and women by passing
the two dichotomous variables to the function chisq.test() :
chisq.test(NINDS$BGENDER, NINDS$STATUS24)
T-Test
For paired samples of metric, symmetric and normally distributed variables we
can use the paired t-test. It works the same way as the unpaired t-test, we just
have to set the argument paired=TRUE . As announced above, we’ll show you
another way of specifying the data here. If your observations are stored in two
different vectors (as opposed to the example for the unpaired t-tests where all
observations of the weight where stored in one vector that could be divided by
the factor gender), you can pass those two vectors separated by a comma to the
t-test function:
t.test(NINDS$WEIGHT, NINDS$TWEIGHT, paired=TRUE)
Paired t-test
Wilcoxon test
If your samples are metric and symmetric but not normally distributed, you
can use the Wilcoxon-test. You have already encountered the wilcox.test()
function before, now we just have to set the argument paired=T :
wilcox.test(NINDS$HOUR2, NINDS$HOUR24, paired=TRUE)
Sign-Test
The Sign-test can be used when the samples are ordinal with many possible
values. It tests whether median of the pairwise differences of the two samples is
equal to 0 (for two sided tests) or less/greater then 0 (for one sided tests). If the
median of the differences is greater than 0, this means that for majority of value
pairs the first variable is greater, if the median is less than 0, it means that for
the majority of value pairs the second variable is greater. To do this test in R,
we can use the SIGN.test() function from the BSDA package (Arnholt and
Evans, 2017):
install.packages("BSDA") #only do this once
library(BSDA)
SIGN.test(NINDS$HOUR2, NINDS$HOUR24, alternative = "greater")
Dependent-samples Sign-Test
median of x-y
0
Mc Nemar test
The Mc Nemar tests whether two dichotomous variables occur with different
frequencies (i.e. different probabilities for the “yes-event”). It can be used on
paired samples, e.g. where both variables have been observed in the same set of
patients like history of diabetes ( BDIAB ) and history of hypertension ( BHYPER ),
:
mcnemar.test(NINDS$BDIAB, NINDS$BHYPER)
3.4 Troubleshooting
1. Error in xy: could not find function “xy” where xy stands for the function
in your error message. You probably forgot to load the package containing
the function or you misspelled the function. If you’re unsure which package
it belongs to, consider googling the function.
2. Error in library(xy) : there is no package called ‘xy’ where xy stands
for the package name in your error message. You probably forgot to
install the package before loading it. Try install.packages("xy") . If
the installation fails, check if you are connected to the internet and have
sufficient rights on you computer to install software.
Chapter 4
Inference II
In this chapter you will learn how to compute risks and odds, do regression
analysis and time-to-event analysis (aka survival analysis).
Let’s compute the odds and risks for a high NIH stroke scale value at 24 hours
( STATUS24 ) in the treatment and placebo groups ( TREATCD ). First of all, we’ll
have to compute a contingency table of the two variables:
tab <- table(NINDS$TREATCD, NINDS$STATUS24)
tab
high low
Placebo 173 132
t-PA 127 151
Then we can supply this table to the epi.2by2 function from the epiR pack-
age:
library(epiR)
epi.2by2(tab)
54
4.1. RISK AND ODDS 55
In the upper part you can see the contingency table we provided. For inter-
pretation you need to compare the rows with the original contingency table.
Then you can see that Exposed + is the Placebo group, Exposed - is the
t-PA group, Outcome + is the high status group and Outcome - is the low
status group. On the two rightmost columns you can see the risk for high status
under Inc risk * and the odds for high status under Odds . Note that the
risk is specified in percent.
You can see that in the Placebo group 56.7 % percent of the patients have a
high status, the odds of having a high status vs. having a low status are 1.311
in this group, meaning that on average, there are 1.311 Patients with a high
status per person with a low status. In the t-PA group on the other hand, the
risk is lower with 45.7% and the odds of having a high status in this group are
only 0.841.
When we want to compare the two groups, we can look at the table under
Point estimates and 95% CIs . Here you can see the Inc risk ratio of
1.24, which means that the risk for a high status is increased by a factor of 1.24
in the Placebo group vs. the t-PA group. The odds ratio tells us, that the odds
in the placebo group are increased by a factor of 1.56. Finally, the risk difference
can be found under Attrib risk * , telling us that the risk is 11.04% higher
in the Placebo group than in the t-PA group.
You can check the respective confidence intervals to see if there is a significant
difference between the groups. For ratios, the 95%-confidence intervals should
not include 1, for the difference, the confidence interval should not include 0 to
indicate a significant result at a significance level of 0.05. This is true for all 3
estimates in our case.
56 CHAPTER 4. INFERENCE II
4.2 Regression
To get to know regression, let’s go back to the iris data set we have used
before:
data(iris)#load data set
View(iris)
This data.frame contains information of sepal and petal lengths of 150 plants
from three species of flowers: setosa, versicolor and virginica:
table(iris$Species)
Species meanLength
1 setosa 5.006
2 versicolor 5.936
3 virginica 6.588
As you can see there are differences between the three means, but are those dif-
ferences systematic or due to chance? This question can be answered using linear
regression analysis, which is based on a linear model specifying the relationship
between the dependent variable and one or several independent variables. In
our case, we take the sepal length as dependent variable y and the species as
independent variable x.
4.2. REGRESSION 57
To get the actual regression analysis results, we use the summary() function
on the linear model:
58 CHAPTER 4. INFERENCE II
summary(linMod)
Call:
lm(formula = Sepal.Length ~ Species, data = iris)
Residuals:
Min 1Q Median 3Q Max
-1.6880 -0.3285 -0.0060 0.3120 1.3120
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.0060 0.0728 68.762 < 2e-16 ***
Speciesversicolor 0.9300 0.1030 9.033 8.77e-16 ***
Speciesvirginica 1.5820 0.1030 15.366 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
2.5 % 97.5 %
(Intercept) 4.8621258 5.149874
Speciesversicolor 0.7265312 1.133469
Speciesvirginica 1.3785312 1.785469
Here is a regression for checking for an association between weight and the
history of hypertension:
mod1 <- lm(WEIGHT ~ BHYPER, data=NINDS)
summary(mod1)
Call:
lm(formula = WEIGHT ~ BHYPER, data = NINDS)
Residuals:
Min 1Q Median 3Q Max
-36.461 -11.737 -0.226 10.157 100.455
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 76.143 1.221 62.375 <2e-16 ***
BHYPERYes 2.995 1.501 1.995 0.0465 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
with a T-test.
Call:
lm(formula = WEIGHT ~ BHYPER + BDIAB, data = NINDS)
Residuals:
Min 1Q Median 3Q Max
-37.402 -11.872 -0.473 9.958 101.940
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 75.303 1.234 61.047 < 2e-16 ***
BHYPERYes 2.349 1.500 1.566 0.117962
BDIABYes 6.050 1.746 3.466 0.000566 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Let’s first go through the coefficients one by one. The intercept 75.303 is the
mean weight for a person without hypertension and without diabetes. For pa-
tients with hypertension, the mean weight increases by 2.349 kg to 77.652. For
patients with diabetes, the mean weight increases by 6.050 kg, resulting in a
mean of 81.353 kg for patients with diabetes only. Patients with hypertension
and diabetes on the other hand end up with a mean of 75.303 + 2.349 + 6.050 =
83.702 kg.
Looking at the p-values we can however see that only the effect of diabetes
on weight is significant (p<0.001), while there is no significant effect of the
hypertension anymore (p=0.12).
In this scenario we say that controlling for diabetes, there is no effect of hy-
pertension on the weight of patients. This means that BHYPER doesn’t contain
information about the weight of patients hat is not already represented in the
information in BDIAB .
4.2. REGRESSION 61
Call:
glm(formula = BHYPER ~ WEIGHT + AGE, family = binomial, data = NINDS)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.9008 -1.2700 0.7703 0.9067 1.3740
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.328955 0.757860 -4.393 1.12e-05 ***
WEIGHT 0.017634 0.005311 3.320 0.000899 ***
AGE 0.039643 0.007853 5.048 4.46e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
associated with an increase of one unit in the dependent variable. The log odds
for hypertension increase by 0.018 for every kg a person gains. They increase
by 0.04 for every year older a person is. That means for example that a person
that is ten years older than another person has 0.04 ⋅ 10 = 0.4 higher log odds
for hypertension.
Because log odds are hard to interpret, the coefficients are often exponentiated,
resulting in more interpretable odds (for the intercept) and odds ratios (for the
other coefficients). To do this in R, we can directly extract the coefficients from
the mod3 object:
coef(mod3) # the original coefficients
This output tells us for example that the odds of having hypertension increase
by a factor of 1.04 for every additional year of life. For two years, the odds
accordingly increase by 1.04⋅1.04 = 1.042 , for 10 years they increase by 1.0410 =
1.48. It is important to keep in mind that the additive nature of the coefficient
on the original (i.e. log odds) scale transforms to a multiplicative nature when
we transform the coefficients with an exponential transformation (i.e. to the
odds scale).
You can of course again look at confidence intervals ob both scales too:
#original coefficients/logit
confint(mod3)
2.5 % 97.5 %
(Intercept) -4.839137893 -1.86282271
WEIGHT 0.007388658 0.02824346
AGE 0.024427301 0.05526389
#odds/odds ratios
exp(confint(mod3))
2.5 % 97.5 %
(Intercept) 0.007913874 0.1552338
WEIGHT 1.007416022 1.0286461
AGE 1.024728092 1.0568195
4.3. TIME-TO-EVENT ANALYSIS 63
Time-to-event data, traditionally often called survival data, comes from studies
where patients were followed over time until a particular event (e.g. death or
relapse) occurs. We usually analyse this data using the Kaplan-Meier-estimator.
Two good packages with functions to compute the Kaplan-Meier estimator as
well as a couple of other useful statistics are the packages survival (Therneau,
2020) and survminer (Kassambara et al., 2020) so we will install and load these
packages:
install.packages("survival")
install.packages("survminer")
library(survival)
library(survminer)
First we compute a so-called survival object for the survival of the NINDS pa-
tients with the function Surv() . The result of this function is an R-object
we can use for the actual survival analyses following. Surv() expects two
arguments: The survival time ( SURDAYS in our case) and a numeric variable
indicating whether the subject died or not. Because the variable DCENSOR con-
taining this information is a factor, we have to wrap it in the as.numeric()
function to turn it into a numeric:
s <- Surv(time=NINDS$SURDAYS, event=as.numeric(NINDS$DCENSOR))
Strata + All
1.00
++++ +
++++++++++++++++++++++++
Survival probability
0.75
+++++++ ++++++++++ + ++ + ++++ ++++
++
++ + +
0.50
0.25
0.00
0 500 1000 1500 2000
Time
However, it is more interesting to compare the survival of different groups.
Lets compare the survival of the treatment group t-PA vs. the control group
Placebo from the variable TREATCD :
sf_treat<-survfit(s ~ TREATCD, data=NINDS)
ggsurvplot(sf_treat)
4.3. TIME-TO-EVENT ANALYSIS 65
1.00
++++
++ ++++++++++++++++++++
Survival probability
0.50 + +
0.25
0.00
0 500 1000 1500 2000
Time
The ggsurvplot() function also has a lot of nice additional options.
risk.table=TRUE adds the a table for the number at risk under the plot,
pval=TRUE adds the p-value of the log-rank test comparing the survival of
the two groups and pval.method=TRUE prints the name of the test above the
p-value:
ggsurvplot(sf_treat, risk.table = TRUE, pval = TRUE, pval.method = TRUE)
66 CHAPTER 4. INFERENCE II
1.00
Survival probability
++
+++ +++++++++++++
0.75 ++++++++++++ ++++++ +++++ + + ++ +++
+++
++++++++++++++++ + + ++ +++ + +
0.50 + +
0.25
Log−rank
p = 0.26
0.00
0 500 1000 1500 2000
Time
Number at risk
Strata
TREATCD=Placebo 312 52 9 2 0
TREATCD=t−PA 312 56 12 2 0
0 500 1000 1500 2000
Time
It is also possible to look at the Kaplan Meier estimates as numbers directly by
calling summary(sf_treat) , but because the output is rather long, we won’t
print it here.
Chapter 5
Advanced Use
In the final chapter we’ll have a look at some of the functionalities of R that
make it superior to conventional statistic software. Well have a look at some
basic programming you need to write your own functions and show you how to
make publication ready plots with ggplot2 .
In this block of code a function is defined and given the name mySum using the
assignment operator <- .
67
68 CHAPTER 5. ADVANCED USE
<arguments> is a comma seperated list of the input data you need for you
computation and <body> describes the operations that need to be done for
the computation. For better readability, we usually enter the <body> over
several lines enclosed by {} .
[1] 7
mySum2(5)
[1] 15
But you can overwrite the default:
mySum2(5,2)
[1] 7
You can also call other functions inside your function. For example you can
write a function that computes the mean difference of two vectors:
meandiff <- function(x,y){
result <- mean(x) - mean(y)
result
}
v1<-c(1,2,3)
v2<-c(10,20,30)
meandiff(v1,v2)
[1] -18
if(bodytemp>=38){
"fever"
}
[1] "fever"
You can change the value of bodytemp to different values to see how the con-
ditional statement works. In the condition part if(<logical statement>)
you test a logical condition of the kind you’ve learned about in the first chapter.
Then follows the body {<what to do>} that specifies the code you want to
execute if the condition evaluates to TRUE .
In the above code nothing happens if the condition is not met. If you want
your code to return a "no fever" for cases where bodytemp < 38 , you can
extend the statement by an else part:
70 CHAPTER 5. ADVANCED USE
bodytemp <- 37
if(bodytemp>=38){
"fever"
}else{
"no fever"
}
if(bodytemp>=38){
status<-"fever"
}else{
status<-"no fever"
}
status
}
[1] "fever"
You can also check different conditions in a row using else if in between.
The line breaks are just for readability but make sure you keep track of all the
opening and closing brackets!
tempChecker <- function(bodytemp){
if(bodytemp<36){
}else if(bodytemp>=38){
}else{
status
}
[1] "normal"
In this code, the conditions are checked in the order they appear in. If the
first condition applies, the first block of code is executed, and the rest of the
if else statement is ignored. If the first condition is not met, the second
condition is evaluated. If it is TRUE the following code block is executed, the
rest of the statement is ignored. When the all of the conditions have been tested
and evaluated to FALSE , the last code block from the else part is executed.
5.1.3 Loops
The final structure is the loop: A loop allows you to assign repetitive tasks to
your computer instead of doing them yourself. The first kind of loop you’ll learn
about is the for loop. In this loop you specify the number of repetitions for a
task explicitly. The following loop prints the numbers from 1 to 5:
for (i in 1:5) {
print(i)
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
In the () part you define the counting variable, which is often called i (but
can have any other name too) and we define the values this counting variable
72 CHAPTER 5. ADVANCED USE
should take (the values 1 to 5 in our case). In the {} part we then define the
task for every iteration. print(i) simply tells R to print the value of i into
the console. So the above loop has 5 iterations in each of which the current
value of i is printed to the console.
Of course we can also have proper computations. For example we can add up
alle the numbers from 1 to 1000 with this code:
result <- 0
for(i in 1:1000){
result
[1] 500500
In the above code the value of result is 0 to begin with. Then the loop
enters its first round and the value of result is updated to the current value
of result plus the current value of i , so 0 + 1 = 1 . Then the second
iteration starts and the same happens again: The current value of result is
updated by adding the current value of i to it, so result is now 1 + 2 = 3
etc.
Sometimes a repetitive task has to be done until a certain condition is met, but
we cannot tell beforehand how many iterations it is going to take. In these cases,
we can use the while loop. For example you can count how often you have to
add 0.6 until you get to a number that is greater than 1000:
x <- 0
counter <- 0
while(x <= 1000){
x <- x + 0.6
counter <- counter + 1
}
counter
[1] 1667
Before the loop starts, both x and counter have the value 0. Then in every
iteration, x grows by 0.6 and counter by 1 to count the number of iterations.
As soon as the condition in () is not met anylonger (i.e. when x is greater
than 1000), the loop stops. As you can see, it takes 1667 iterations to make x
greater than 1000. The previous examples are of course just toy examples to
demonstrate the basic functionality of loops. In reality we can use a loop for
more practical tasks, for example to create the same kind of graphic for a large
5.2. GRAPHICS WITH GGPLOT 73
number of variables. This brings us to the final chapter of this course: How to
produce plots using ggplot2 .
5.2.1 Structure
In ggplot you build your graphics layer by layer as if you were painting a picture.
You start by laying out a blank sheet (with the basic function ggplot() ) upon
which you add graphical elements (called geoms ) and structural elements (like
axis labels, colour schemes etc.).
To start, lets install the package and load the NINDS data set again.
install.packages(ggplot2) #only do this once
library(ggplot2)
d <- read.csv("NINDS.csv")
In this function, we tell the graphic that our data comes from the data set d .
But since we haven’t told ggplot what to draw yet, my_plot only produces a
blank graph:
my_plot
74 CHAPTER 5. ADVANCED USE
40
30
HOUR24
20
10
0 10 20 30
HOUR2
The aes() function is part of every geom, it is short for aesthetic and used
5.2. GRAPHICS WITH GGPLOT 75
to specify every feature of the geom that depends on variables from the data
frame, like the definition of the x- and y-axis.
Within aes() we can for example set the color of geom_point() to depend
on the gender:
my_plot + geom_point(aes(x=HOUR2,y=HOUR24, color=BGENDER))
40
30
BGENDER
HOUR24
female
20
male
10
0 10 20 30
HOUR2
If you want to set a feature of the geom that doesn’t depend on any of the
variables (e.g. setting one color for all the points), this is done outside of the
aesthetics argument:
my_plot + geom_point(aes(x=HOUR2,y=HOUR24), color="blue")
76 CHAPTER 5. ADVANCED USE
40
30
HOUR24
20
10
0 10 20 30
HOUR2
You can also add more than one layer to the plot. For example, we could superim-
pose a (non-linear) regression line by simply adding the geom geom_smooth() :
my_plot + geom_point(aes(x=HOUR2,y=HOUR24, color=BGENDER)) +
geom_smooth(aes(x=HOUR2, y=HOUR24))
40
30
BGENDER
HOUR24
20 female
male
10
0 10 20 30
HOUR2
5.2. GRAPHICS WITH GGPLOT 77
With more than one layer it is easier formatting the code with line breaks. These
breaks don’t affect the functionality in any way aside from readability, just make
sure you mark all the lines when executing the code. Note, that each line but
the last has to end with a + for R to know that those lines belong together.
40
30
BGENDER
HOUR24
20 female
male
10
0 10 20 30
HOUR2
When several layers share the same aesthetics it can be useful to define these
aesthetics in the basic plot produced by ggplot() :
my_plot2 <- ggplot(data=d, aes(x=HOUR2, y=HOUR24, color=BGENDER))
40
30
BGENDER
HOUR24
20 female
male
10
0 10 20 30
HOUR2
Instead of defining the aesthetics for each geom seperately, geom_point() and
geom_smooth() inherit the aesthetics of my_plot2 and the graphic looks
exactly the same.
5.2.2 Labels
Scatterplot
40
30
NIHSS at 24 Hours
BGENDER
20 female
male
10
0 10 20 30
NIHSS at 2 Hours
5.2.3 Facets
So far we have divided our graph using different colors. It is however also
possible to split the graph according to a one or more variables in the data frame
using facet_wrap() . To split the graph by presumptive diagnosis ( TDX ) we
write:
my_plot2 +
geom_point() +
geom_smooth() +
labs(title = "Scatterplot",
x= "NIHSS at 2 Hours",
y="NIHSS at 24 Hours") +
facet_wrap(~ TDX)
80 CHAPTER 5. ADVANCED USE
Scatterplot
Cardioembolic Large vessel occlusive
40
20
NIHSS at 24 Hours
BGENDER
40
20
0 10 20 30 0 10 20 30
NIHSS at 2 Hours
And if we want to split the graph by gender, too, we simply add BGENDER .
With ncol=2 we can also tell ggplot to display the plots in two columns:
my_plot2 +
geom_point() +
geom_smooth() +
labs(title = "Scatterplot",
x= "NIHSS at 2 Hours",
y="NIHSS at 24 Hours") +
facet_wrap(~ TDX + BGENDER, ncol=2)
5.2. GRAPHICS WITH GGPLOT 81
Scatterplot
Cardioembolic Cardioembolic
female male
40
20
0
40
20
0
BGENDER
Other Other
female
female male
male
40
20
0
5.2.4 Theme
The theme of a ggplot can be used to change the default appearance of the entire
plot or to change specific components of your plot. To find a list of complete
themes, go to https://ggplot2.tidyverse.org/reference/ggtheme.html or install
the package ggthemes which contains even more complete themes. The default
theme of ggplot is theme_grey , but we can change it like this:
my_plot2 +
geom_point() +
geom_smooth() +
labs(title = "Scatterplot",
x= "NIHSS at 2 Hours",
y="NIHSS at 24 Hours") +
theme_bw()
82 CHAPTER 5. ADVANCED USE
Scatterplot
40
30
NIHSS at 24 Hours
BGENDER
20 female
male
10
0 10 20 30
NIHSS at 2 Hours
If on the other hand you want to change only certain elements, for exam-
ple the font size or type of your axis labels, you use theme() , which al-
lows you customize every element of your plot. To change text elements, you
give an element_text() to the appropriate argument of theme() . Within
element_text you can set the font size, font type, font color, font face and
many more aspects. The arguments that take element_text() objects are for
example axis.text for the numbers on the axes, axis.title for the axis
labels and plot.title for the plot title :
my_plot2 +
geom_point() +
geom_smooth() +
labs(title = "Scatterplot",
x= "NIHSS at 2 Hours",
y="NIHSS at 24 Hours") +
theme(axis.text.x = element_text(size=15),
axis.text.y= element_text(size=10),
axis.title = element_text(size=16, face="italic"),
plot.title = element_text(size=18, face="bold"))
5.3. FURTHER READING 83
Scatterplot
40
NIHSS at 24 Hours
30
BGENDER
20 female
male
10
0 10 20 30
NIHSS at 2 Hours
5.3.1 Webpages
• R for Data Science (Wickham and Grolemund, 2017) available at https:
//r4ds.had.co.nz/, an online book with very clear and detailed introduc-
tions that focuses more on R and how to use it for data analysis and less
on traditional statistics.
• Modern Dive (Ismay and Kim, 2019) available at https://moderndive.c
om/, an online book giving an introduction to R but with a strong focus
in statistical inference.
• STHDA (http://www.sthda.com/english/wiki/r-basics-quick-and-easy),
a website with short, hands-on tutorials explaining how to do a number
of statistical analysis including help with output interpretation.
• rdocumentation (https://www.rdocumentation.org/), a collection of
the help pages to all the R packages and functions that is a bit more nicely
84 CHAPTER 5. ADVANCED USE
5.3.2 Books
• R for Data Science and Modern Dive are also available as physical
books
• Discovering Statistics Using R (Field et al., 2012), an extensive but
very accessible and entertaining introduction to statistics from the very
basic to advanced statistical analyses with examples in R
Bibliography
Arnholt, A. T. and Evans, B. (2017). BSDA: Basic Statistics and Data Analysis.
R package version 1.2.0.
Field, A., Miles, J., and Field, Z. (2012). Discovering Statistics Using R. Sage
Publications Ltd.
Grolemund, G. and Wickham, H. (2011). Dates and times made easy with
lubridate. Journal of Statistical Software, 40(3):1–25.
Marler, J. R., Brott, T., Broderick, J., Kothari, R., Odonoghue, M., Barsan,
W., and et al. (1995). Tissue plasminogen activator for acute ischemic stroke.
New England Journal of Medicine, 333(24):1581–1588. PMID: 7477192.
Schauberger, P. and Walker, A. (2020). openxlsx: Read, Write and Edit xlsx
Files. R package version 4.2.3.
85
86 BIBLIOGRAPHY
Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L. D., François, R.,
Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L.,
Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P.,
Spinu, V., Takahashi, K., Vaughan, D., Wilke, C., Woo, K., and Yutani,
H. (2019). Welcome to the tidyverse. Journal of Open Source Software,
4(43):1686.
Wickham, H., Chang, W., Henry, L., Pedersen, T. L., Takahashi, K., Wilke,
C., Woo, K., Yutani, H., and Dunnington, D. (2020). ggplot2: Create Elegant
Data Visualisations Using the Grammar of Graphics. R package version 3.3.2.
Wickham, H. and Grolemund, G. (2017). R for Data Science: Import, Tidy,
Transform, Visualize, and Model Data. O’Reilly Media, 1 edition.