R For Health Data Science
R For Health Data Science
R For Health Data Science
Ewen Harrison
Riinu Pius
First edition published 2021
by CRC Press
6000 Broken Sound Parkway NW, Suite 300, Boca Raton, FL 33487-2742
Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot as-
sume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have
attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders
if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged please
write and let us know so we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or
utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including pho-
tocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission
from the publishers.
For permission to photocopy or use material electronically from this work, access www.copyright.com or contact the
Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. For works that are
not available on CCC please contact mpkbookspermissions@tandf.co.uk
Trademark notice: Product or corporate names may be trademarks or registered trademarks, and are used only for iden-
tification and explanation without intent to infringe.
Preface xvii
2 R basics 13
2.1 Reading data into R . . . . . . . . . . . . . . . . . . . . . . 14
2.1.1 Import Dataset interface . . . . . . . . . . . . . . . . 15
2.1.2 Reading in the Global Burden of Disease example
dataset . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2 Variable types and why we care . . . . . . . . . . . . . . . . 17
2.2.1 Numeric variables (continuous) . . . . . . . . . . . . 20
2.2.2 Character variables . . . . . . . . . . . . . . . . . . . 22
2.2.3 Factor variables (categorical) . . . . . . . . . . . . . 23
2.2.4 Date/time variables . . . . . . . . . . . . . . . . . . 24
2.3 Objects and functions . . . . . . . . . . . . . . . . . . . . . 27
2.3.1 data frame/tibble . . . . . . . . . . . . . . . . . . . . . 28
2.3.2 Naming objects . . . . . . . . . . . . . . . . . . . . . 28
2.3.3 Function and its arguments . . . . . . . . . . . . . . 29
2.3.4 Working with objects . . . . . . . . . . . . . . . . . 31
2.3.5 <- and = . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.3.6 Recap: object, function, input, argument . . . . . . . 33
2.4 Pipe - %>% . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.4.1 Using . to direct the pipe . . . . . . . . . . . . . . . 34
2.5 Operators for filtering data . . . . . . . . . . . . . . . . . . 35
vii
viii Contents
3 Summarising data 53
3.1 Get the data . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.2 Plot the data . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.3 Aggregating: group_by(), summarise() . . . . . . . . . . . . . . . 56
3.4 Add new columns: mutate() . . . . . . . . . . . . . . . . . . . 57
3.4.1 Percentages formatting: percent() . . . . . . . . . . . 58
3.5 summarise() vs mutate() . . . . . . . . . . . . . . . . . . . . . . 59
3.6 Common arithmetic functions - sum(), mean(), median(), etc. . . 61
3.7 select() columns . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.8 Reshaping data - long vs wide format . . . . . . . . . . . . . 65
3.8.1 Pivot values from rows into columns (wider) . . . . . 66
3.8.2 Pivot values from columns to rows (longer) . . . . . 67
3.8.3 separate() a column into multiple columns . . . . . . 68
3.9 arrange() rows . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.9.1 Factor levels . . . . . . . . . . . . . . . . . . . . . . 70
3.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.10.1 Exercise - pivot_wider() . . . . . . . . . . . . . . . . . 71
3.10.2 Exercise - group_by(), summarise() . . . . . . . . . . . . 71
3.10.3 Exercise - full_join(), percent() . . . . . . . . . . . . . 73
3.10.4 Exercise - mutate(), summarise() . . . . . . . . . . . . . 74
3.10.5 Exercise - filter(), summarise(), pivot_wider() . . . . . . 75
15 Encryption 329
15.1 Safe practice . . . . . . . . . . . . . . . . . . . . . . . . . . 329
Contents xv
15.2 encryptr package . . . . . . . . . . . . . . . . . . . . . . . 330
15.3 Get the package . . . . . . . . . . . . . . . . . . . . . . . . 330
15.4 Get the data . . . . . . . . . . . . . . . . . . . . . . . . . . 331
15.5 Generate private/public keys . . . . . . . . . . . . . . . . . 331
15.6 Encrypt columns of data . . . . . . . . . . . . . . . . . . . . 332
15.7 Decrypt specific information only . . . . . . . . . . . . . . . 332
15.8 Using a lookup table . . . . . . . . . . . . . . . . . . . . . . 333
15.9 Encrypting a file . . . . . . . . . . . . . . . . . . . . . . . . 334
15.10 Decrypting a file . . . . . . . . . . . . . . . . . . . . . . . . 334
15.11 Ciphertexts are not matchable . . . . . . . . . . . . . . . . . 335
15.12 Providing a public key . . . . . . . . . . . . . . . . . . . . . 335
15.13 Use cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
15.13.1 Blinding in trials . . . . . . . . . . . . . . . . . . . . 335
15.13.2 Re-contacting participants . . . . . . . . . . . . . . . 336
15.13.3 Long-term follow-up of participants . . . . . . . . . . 336
15.14 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336
Appendix 337
Bibliography 339
Index 341
Preface
xvii
xviii Preface
many individuals and students who have helped refine this book and welcome
suggestions and bug reports via https://github.com/SurgicalInformatics.
Ewen Harrison and Riinu Pius
Usher Institute
University of Edinburgh
Contributors
We are indebted to the following people who have generously contributed time
and material to this book: Katie Connor, Tom Drake, Cameron Fairfield, Peter
Hall, Stephen Knight, Kenneth McLean, Lisa Norman, Einar Pius, Michael
Ramage, Katie Shaw, and Olivia Swann.
About the Authors
Ewen Harrison is a surgeon and Riinu Pius is a physicist. And they’re both
data scientists, too. They dabble in a few programming languages and are gen-
erally all over technology. They are most enthusiastic about the R statistical
programming language and have a combined experience of 25 years using it.
They work at the University of Edinburgh and have taught R to hundreds of
healthcare professionals and researchers.
They believe a first introduction to R and statistical programming should
be relatively jargon-free and outcome-oriented (get those pretty plots out).
The understanding of complicated concepts will come over time with practice
and experience, not through a re-telling of the history of computing bit-by-
byte, or with the inclusion of the underlying equations for each statistical test
(although Ewen has sneaked a few equations in).
Overall, they hope to make the text fun and accessible. Just like them.
xix
Part I
Thank you for choosing this book on using R for health data analysis. Even
if you’re already familiar with the R language, we hope you will find some
new approaches here as we make the most of the latest R tools including some
we’ve developed ourselves. Those already familiar with R are encouraged to
still skim through the first few chapters to familiarise yourself with the style
of R we recommend.
R can be used for all the health data science applications we can think of. From
bioinformatics and computational biology, to administrative data analysis and
natural language processing, through internet-of-things and wearable data,
to machine learning and artificial intelligence, and even public health and
epidemiology. R has it all.
Here are the main reasons we love R:
• R is versatile and powerful - use it for
– graphics;
– all the statistical tests you can dream of;
– machine learning and deep learning;
– automated reports;
– websites;
– and even books (yes, this book was written entirely in R).
• R scripts can be reused - gives you efficiency and reproducibility.
• It is free to use by anyone, anywhere.
3
4 1 Why we love R
Lines that do not start with # are R code. This is where the number crunching
really happens. We will cover the details of this R code in the next few chapters.
The purpose of this chapter is to describe some of the terminology as well as
the interface and tools we use.
For the impatient:
• We interface R using RStudio
1.2 What is RStudio? 5
• We use the tidyverse packages that are a substantial extension to base R
functionality (we repeat: extension, not replacement)
Even though R is a language, don’t think that after reading this book you
should be able to open a blank file and just start typing in R code like an
evil computer genius from a movie. This is not what real-world programming
looks like.
Firstly, you should be copy-pasting and adapting existing R code examples
- whether from this book, the internet, or later from your existing work. Re-
writing everything from scratch is not efficient. Yes, you will understand and
eventually remember a lot of it, but to spend time memorising specific func-
tions that can easily be looked up and copied is simply not necessary.
Secondly, R is an interactive language. Meaning that we “run” R code line
by line and get immediate feedback. We do not write a whole script without
trying each part out as we go along.
Thirdly, do not worry about making mistakes. Celebrate them! The whole
point of R and reproducibility is that manipulations are not applied directly
on a dataset, but a copy of it. Everything is in a script, so you can’t do anything
wrong. If you make a mistake like accidentally overwriting your data, we can
just reload it, rerun the steps that worked well and continue figuring out what
went wrong at the end. And since all of these steps are written down in a
script, R will redo everything with a single push of a button. You do not have
to repeat a set of mouse clicks from dropdown menus as in other statistical
packages, which quickly becomes a blessing.
Keyboard Shortcuts!
Run line: Control+Enter
Run all lines (Source): Control+Shift+Enter
(On a Mac, both Control or Command work)
The Console is where R speaks to us. When we’re lucky, we get results in there
- in this example the results of a t-test (last line of the script). When we’re
less lucky, this is also where Errors or Warnings appear.
R Errors are a lot less scary than they seem! Yes, if you’re using a regular
computer program where all you do is click on some buttons, then getting a
proper red error that stops everything is quite unusual. But in programming,
Errors are just a way for R to communicate with us.
We see Errors in our own work every single day, they are very normal and do
not mean that everything is wrong or that you should give up. Try to re-frame
the word Error to mean “feedback”, as in “Hello, this is R. I can’t continue,
this is the feedback I am giving you.” The most common Errors you’ll see
are along the lines of “Error: something not found”. This almost always means
there’s a typo or you’ve misspelled something. Furthermore, R is case sensitive
so capitalisation matters (variable name lifeExp is not the same as lifeexp).
The Console can only print text, so any plots you create in your script appear
in the Plots pane (bottom-right).
Similarly, datasets that you’ve loaded or created appear in the Environment
1.3 Getting started 7
tab. When you click on a dataset, it pops up in a nice viewer that is fast even
when there is a lot of data. This means you can have a look and scroll through
your rows and columns, the same way you would with a spreadsheet.
library(tidyverse)
We can see that it has loaded 8 packages (ggplot2, tibble, tidyr, readr,
8 1 Why we love R
purrr, dplyr, stringr, forcats), the number behind a package name is its
version.
The “Conflicts” message is expected and can safely be ignored.1
There are a few other R packages that we use and are not part of the tidyverse,
but we will introduce them as we go along. If you’re incredibly curious, head
to the Resources section of the HealthyR website which is the best place to
find up-to-date links and installation instructions. Our R and package versions
are also listed in the Appendix.
The files on your computer are organised into folders. RStudio Projects live
in your computer’s normal folders - they placemark the working directory of
each analysis project. These project folders can be viewed or moved around
the same way you normally work with files and folders on your computer.
The top-right corner of your RStudio should never say “Project: (None)”.
If it does, click on it and create a New Project. After clicking on New Project,
you can decide whether to let RStudio create a New Directory (folder) on
your computer. Alternatively, if your data files are already organised into an
“Existing folder”, use the latter option.
Every set of analysis you are working on must have its own folder and RStudio
project. This enables you to switch between different projects without getting
the data, scripts, or output files all mixed up. Everything gets read in or saved
to the correct place. No more exporting a plot and then going through the
various Documents, etc., folders on your computer trying to figure out where
your plot might have been saved to. It got saved to the project folder.
Have you tried turning it off and on again? It is vital to restart R regularly.
Restarting R helps to avoid accidentally using the wrong data or functions
stored in the environment. Restarting R only takes a second and we do it
10 1 Why we love R
several times per day! Once you get used to saving everything in a script,
you’ll always be happy to restart R. This will help you develop robust and
reproducible data analysis skills.
FIGURE 1.3: Configuring your RStudio Tools -> Global Options: Untick
“Restore .RData into Workspace on Exit” and Set “Save .RData on exit” to
Never.
This does not mean you can’t or shouldn’t save your work in .RData/.rda files.
But it is best to do it consciously and load exactly what you want to load.
Letting R silently save and load everything for you may also include broken
data or objects.
When mentioned in the text, the names of R packages are in bold font, e.g., gg-
plot2, whereas functions, objects, and variables are printed with mono-spaced
font, e.g filter(), mean(), lifeExp. Functions are always followed by brackets: (),
whereas data objects or variables are not.
Otherwise, R code lives in the grey areas known as ‘code chunks’. Lines of R
output start with a double ## - this will be the numbers or text that R gives
us after executing code. R also adds a counter at the beginning of every new
line; look at the numbers in the square brackets [] below:
## [1] 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015
## [16] 1016 1017
Remember, lines of R code that start with # are called comments. We already
introduced comments as notes about the R code earlier in this chapter (Section
1.1 “Help, what’s a script?”), however, there is a second use case for comments.
When you make R code a comment, by adding a # in front of it, it gets
‘commented out’. For example, let’s say your R script does two things, prints
numbers from 1 to 4, and then numbers from 1001 to 1004:
## [1] 1 2 3 4
12 1 Why we love R
If you decide to ‘comment out’ the printing of big numbers, the code will look
like this:
## [1] 1 2 3 4
You may even want to add another real comment to explain why the latter
was commented out:
You could of course delete the line altogether, but commenting out is useful
as you might want to include the lines later by removing the # from the
beginning of the line.
Throughout this book, we are conscious of the balance between theory and
practice. Some learners may prefer to see all definitions laid out before being
shown an example of a new concept. Others would rather see practical exam-
ples and explanations build up to a full understanding over time. We strike a
balance between these two approaches that works well for most people in the
audience.
Sometimes we will show you an example that may use words that have not
been formally introduced yet. For example, we start this chapter with data
import - R is nothing without data.
In so doing, we have to use the word “argument”, which is only defined two
sections later (in 2.3 “Objects and functions”). A few similar instances arise
around statistical concepts in the Data Analysis part of the book. You will
come across sentences along the lines of “this concept will become clearer in
the next section”. Trust us and just go with it.
The aim of this chapter is to familiarise you with how R works. We will read
in data and start basic manipulations. You may want to skip parts of this
chapter if you already:
• have found the Import Dataset interface;
• know what numbers, characters, factors, and dates look like in R;
• are familiar with the terminology around objects, functions, arguments;
• have used the pipe: %>%;
• know how to filter data with operators such as ==, >, <, &, |;
• know how to handle missing data (NAs), and why they can behave weirdly
in a filter;
• have used mutate(), c(), paste(), if_else(), and the joins.
13
14 2 R basics
Clicking on a data file gives us two options: “View File” or “Import Dataset”.
We will show you how to use the Import Dataset interface in a bit, but for
standard CSV files, we don’t usually bother with the Import interface and just
type in (or copy from a previous script):
library(tidyverse)
example_data <- read_csv("example_data.csv")
View(example_data)
There are a couple of things to say about the first R code chunk of this book.
First and foremost: do not panic. Yes, if you’re used to interacting with data
by double-clicking on a spreadsheet that just opens up, then the above R code
does seem a bit involved.
However, running the example above also has an immediate visual effect. As
soon as you click Run (or press Ctrl+Enter/Command+Enter), the dataset
immediately shows up in your Environment and opens in a Viewer. You can
have a look and scroll through the same way you would in Excel or similar.
2.1 Reading data into R 15
So what’s actually going on in the R code above:
• We load the tidyverse packages (as covered in the first chapter of this
book).
• We have a CSV file called “example_data.csv” and are using read_csv() to
read it into R.
• We are using the assignment arrow <- to save it into our Environment using
the same name: example_data.
• The View(example_data) line makes it pop up for us to view it. Alternatively,
click on example_data in the Environment to achieve the exact same thing.
More about the assignment arrow (<-) and naming things in R are covered
later in this chapter. Do not worry if everything is not crystal clear just now.
In the read_csv() example above, we read in a file that was in a specific (but
common) format.
However, if your file uses semicolons instead of commas, or commas instead of
dots, or a special number for missing values (e.g., 99), or anything else weird
or complicated, then we need a different approach.
RStudio’s Import Dataset interface (Figure 2.1) can handle all of these and
more.
FIGURE 2.2: Import: Some of the special settings your data file might have.
FIGURE 2.3: After using the Import Dataset window, copy-paste the re-
sulting code into your script.
After selecting the specific options to import a particular file, a friendly preview
window will show whether R properly understands the format of your data.
DO NOT BE tempted to press the Import button.
Yes, this will read in your dataset once, but means you have to reselect the
16 2 R basics
options every time you come back to RStudio. Instead, copy-paste the code
(e.g., Figure 2.3) into your R script. This way you can use it over and over
again.
Ensuring that all steps of an analysis are recorded in scripts makes your work-
flow reproducible by your future self, colleagues, supervisors, and extraterres-
trials.
The Import Dataset button can also help you to read in Excel, SPSS, Stata,
or SAS files (instead of read_csv(), it will give you read_excel(), read_sav(),
read_stata(), or read_sas()).
If you’ve used R before or are using older scripts passed by colleagues, you
might see read.csv() rather than read_csv(). Note the dot rather than the un-
derscore.
In short, read_csv() is faster and more predictable and in all new scripts is to
be recommended.
In existing scripts that work and are tested, we do not recommend that you
start replacing read.csv() with read_csv(). For instance, read_csv() handles cate-
gorical variables differently 1 . An R script written using the read.csv() might
not work as expected any more if just replaced with read_csv().
In the next few chapters of this book, we will be using the Global Burden
of Disease datasets. The Global Burden of Disease Study (GBD) is the most
comprehensive worldwide observational epidemiological study to date. It de-
1
It does not silently convert strings to factors, i.e., it defaults to stringsAsFactors = FALSE.
For those not familiar with the terminology here - don’t worry, we will cover this in just a
few sections.
2.2 Variable types and why we care 17
scribes mortality and morbidity from major diseases, injuries and risk factors
to health at global, national and regional levels. 2
GBD data are publicly available from the website. Table 2.1 and Figure 2.4
show a high level version of the project data with just 3 variables: cause, year,
deaths_millions (number of people who die of each cause every year). Later, we
will be using a longer dataset with different subgroups and we will show you
how to summarise comprehensive datasets yourself.
library(tidyverse)
gbd_short <- read_csv("data/global_burden_disease_cause-year.csv")
TABLE 2.1: Deaths per year from three broad disease categories (short
version of the Global Burden of Disease example dataset).
A
Communicable diseases Injuries Non-communicable diseases
Deaths per year (millions) 50
3941
40
35
33
31
29
30 27
20
15 15 15 14 13 1110
10
4 5 5 4 5 44
0
1990 2000 2010 1990 2000 2010 1990 2000 2010
Year
B
cause Communicable diseases Injuries Non-communicable diseases
60
Deaths per year (millions)
50
40
30
20
10
0
1990 1995 2000 2005 2010 2015 2017
Year
FIGURE 2.4: Line and bar charts: Cause of death by year (GBD). Data in
(B) are the same as (A) but stacked to show the total of all causes.
library(tidyverse)
typesdata <- read_csv("data/typesdata.csv")
typesdata
## # A tibble: 3 x 4
## id group measurement date
## <chr> <chr> <dbl> <dttm>
## 1 ID1 Control 1.8 2017-01-02 12:00:00
## 2 ID2 Treatment 4.5 2018-02-03 13:00:00
## 3 ID3 Treatment 3.7 2019-03-04 14:00:00
This means that a lot of the time you do not have to worry about those little
<chr> vs <dbl> vs <S3: POSIXct> labels. But in cases of irregular or faulty input
data, or when doing a lot of calculations and modifications to your data, we
need to be aware of these different types to be able to find and fix mistakes.
For example, consider a similar file as above but with some data entry issues
introduced:
typesdata_faulty
## # A tibble: 3 x 4
## id group measurement date
## <chr> <chr> <chr> <chr>
## 1 ID1 Control 1.8 02-Jan-17 12:00
## 2 ID2 Treatment 4.5 03-Feb-18 13:00
## 3 ID3 Treatment 3.7 or 3.8 04-Mar-19 14:00
20 2 R basics
Notice that R parsed both the measurement and date variables as characters.
Measurement has been parsed as a character because of a data entry issue:
the person taking the measurement couldn’t decide which value to note down
(maybe the scale was shifting between the two values) so they included both
values and text “or” in the cell.
A numeric variable will also get parsed as a categorical variable if it contains
certain typos, e.g., if entered as “3..7” instead of “3.7”.
The reason R didn’t automatically make sense of the date column is that it
couldn’t tell which is the date and which is the year: 02-Jan-17 could stand for
02-Jan-2017 as well as 2002-Jan-17.
Therefore, while a lot of the time you do not have to worry about variable
types and can just get on with your analysis, it is important to understand
what the different types are to be ready to deal with them when issues arise.
So here we go.
Numbers are straightforward to handle and don’t usually cause trouble. R usu-
ally refers to numbers as numeric (or num), but sometimes it really gets its nerd
on and calls numbers integer or double. Integers are numbers without decimal
places (e.g., 1, 2, 3), whereas double stands for “Double-precision floating-point”
format (e.g., 1.234, 5.67890).
It doesn’t usually matter whether R is classifying your continuous data nu-
meric/num/double/int,
but it is good to be aware of these different terms as you
will see them in R messages.
Something to note about numbers is that R doesn’t usually print more than 6
decimal places, but that doesn’t mean they don’t exist. For example, from the
typedata tibble, we’re taking the measurement column and sending it to the mean()
function. R then calculates the mean and tells us what it is with 6 decimal
places:
2.2 Variable types and why we care 21
## [1] 3.333333
But when using the double equals operator to check if this is equivalent to a
fixed value (you might do this when comparing to a threshold, or even another
mean value), R returns FALSE:
measurement_mean == 3.333333
## [1] FALSE
Now this doesn’t seem right, does it - R clearly told us just above that the mean
of this variable is 3.333333 (reminder: the actual values in the measurement
column are 1.8, 4.5, 3.7). The reason the above statement is FALSE is because
measurement_mean is quietly holding more than 6 decimal places.
And it gets worse. In this example, you may recognise that repeating decimals
(0.333333…) usually mean there’s more of them somewhere. And you may think
that rounding them down with the round() function would make your == behave
as expected. Except, it’s not about rounding, it’s about how computers store
numbers with decimals. Computers have issues with decimal numbers, and
this simple example illustrates one:
## [1] FALSE
This returns FALSE, meaning R does not seem to think that 0.10 + 0.05 is
equal to 0.15. This issue isn’t specific to R, but to programming languages in
general. For example, python also thinks that the sum of 0.10 and 0.05 does
not equal 0.15.
This is where the near() function comes in handy:
library(tidyverse)
near(0.10+0.05, 0.15)
## [1] TRUE
22 2 R basics
## [1] TRUE
The first two arguments for near() are the numbers you are comparing; the
third argument is the precision you are interested in. So if the numbers are
equal within that precision, it returns TRUE. You can omit the third argument
- the precision (in this case also known as the tolerance). If you do, near() will
use a reasonable default tolerance value.
library(tidyverse)
typesdata %>%
count(group)
## # A tibble: 2 x 2
## group n
## <chr> <int>
## 1 Control 1
## 2 Treatment 2
count()can accept multiple variables and will count up the number of obser-
vations in each subgroup, e.g., mydata %>% count(var1, var2).
Another helpful option to count is sort = TRUE, which will order the result
putting the highest count (n) to the top.
typesdata %>%
count(group, sort = TRUE)
## # A tibble: 2 x 2
## group n
## <chr> <int>
## 1 Treatment 2
## 2 Control 1
count()with the sort = TRUE option is also useful for identifying duplicate IDs or
misspellings in your data. With this example tibble (typesdata) that only has
three rows, it is easy to see that the id column is a unique identifier whereas
the group column is a categorical variable.
2.2 Variable types and why we care 23
You can check everything by just eyeballing the tibble using the built in Viewer
tab (click on the dataset in the Environment tab).
But for larger datasets, you need to know how to check and then clean data
programmatically - you can’t go through thousands of values checking they
are all as intended without unexpected duplicates or typos.
For most variables (categorical or numeric), we recommend always plotting
your data before starting analysis. But to check for duplicates in a unique
identifier, use count() with sort = TRUE:
## # A tibble: 3 x 2
## id n
## <chr> <int>
## 1 ID1 1
## 2 ID2 1
## 3 ID3 1
## # A tibble: 3 x 2
## id n
## <chr> <int>
## 1 ID3 2
## 2 ID1 1
## 3 ID2 1
Factors are fussy characters. Factors are fussy because they include something
called levels. Levels are all the unique values a factor variable could take, e.g.,
like when we looked at typesdata %>% count(group). Using factors rather than just
characters can be useful because:
• The values factor levels can take are fixed. For example, once you tell R that
typesdata$group is a factor with two levels: Control and Treatment, combining
it with other datasets with different spellings or abbreviations for the same
variable will generate a warning. This can be helpful but can also be a
nuisance when you really do want to add in another level to a factor variable.
• Levels have an order. When running statistical tests on grouped data (e.g.,
Control vs Treatment, Adult vs Child) and the variable is just a character,
24 2 R basics
not a factor, R will use the alphabetically first as the reference (comparison)
level. Converting a character column into a factor column enables us to
define and change the order of its levels. Level order affects many things
including regression results and plots: by default, categorical variables are
ordered alphabetically. If we want a different order in say a bar plot, we need
to convert to a factor and reorder before we plot it. The plot will then order
the groups correctly.
So overall, since health data is often categorical and has a reference (com-
parison) level, then factors are an essential way to work with these data in
R. Nevertheless, the fussiness of factors can sometimes be unhelpful or even
frustrating. A lot more about factor handling will be covered later (8).
R is good for working with dates. For example, it can calculate the number of
days/weeks/months between two dates, or it can be used to find what future
date is (e.g., “what’s the date exactly 60 days from now?”). It also knows
about time zones and is happy to parse dates in pretty much any format - as
long as you tell R how your date is formatted (e.g., day before month, month
name abbreviated, year in 2 or 4 digits, etc.). Since R displays dates and times
between quotes (“ ”), they look similar to characters. However, it is important
to know whether R has understood which of your columns contain date/time
information, and which are just normal characters.
When printed, the two objects - current_datetime and my_datetime seem to have
a similar format. But if we try to calculate the difference between these two
dates, we get an error:
my_datetime - current_datetime
## [1] "character"
my_datetime_converted - current_datetime
## [1] "difftime"
This is useful if we want to apply this time difference on another date, e.g.:
But if we want to use the number of days in a normal calculation, e.g., what if
a measurement increased by 560 arbitrary units during this time period. We
might want to calculate the increase per day like this:
560/my_datesdiff
26 2 R basics
560/as.numeric(my_datesdiff)
## [1] 7.281465
The lubridate package comes with several convenient functions for parsing
dates, e.g., ymd(), mdy(), ymd_hm(), etc. - for a full list see lubridate.tidyverse.org.
However, if your date/time variable comes in an extra special format, then use
the parse_date_time() function where the second argument specifies the format
using the specifiers given in Table 2.2.
TABLE 2.2: Date/time format specifiers.
Notation Meaning Example
%d day as number 01-31
%m month as number 01-12
%B month name January-December
%b abbreviated month Jan-Dec
%Y 4-digit year 2019
%y 2-digit year 19
%H hours 12
%M minutes 01
%S seconds 59
%A weekday Monday-Sunday
%a abbreviated weekday Mon-Sun
For example:
Furthermore, the same date/time specifiers can be used to rearrange your date
and time for printing:
Sys.time()
You can even add plain text into the format() function, R will know to put the
right date/time values where the % are:
2.3 Objects and functions 27
TABLE 2.3: Example of data in columns and rows, including missing values
denoted NA (Not applicable/Not available). Once this dataset has been read
into R it gets called dataframe/tibble.
id sex var1 var2 var3
1 Male 4 NA 2
2 Female 1 4 1
3 Female 2 5 NA
4 Male 3 NA NA
Sys.time() %>% format("Happy days, the current time is %H:%M %B-%d (%Y)!")
Sys.time() %>% format("Happy days, the current time is %H:%M %B-%d (%Y)!")
There are two fundamental concepts in statistical programming that are im-
portant to get straight - objects and functions. The most common object you
will be working with is a dataset. This is usually something with rows and
columns much like the example in Table 2.3.
To get the small and made-up “dataset” into your Environment, copy and run
this code4 :
library(tidyverse)
mydata <- tibble(
id = 1:4,
sex = c("Male", "Female", "Female", "Male"),
var1 = c(4, 1, 2, 3),
var2 = c(NA, 4, 5, NA),
var3 = c(2, 1, NA, NA)
)
when RStudio is not available, but R is. This can be the case if you are working
on a supercomputer that can only serve the R Console and not RStudio.
So, regularly shaped data in rows and columns is called a table when it lives
outside R, but once you read/import it into R it gets called a tibble. If you’ve
used R before, or get given a piece of code that uses read.csv() instead of
5
read_csv(), you’ll have come across the term data frame.
When you read data into R, you want it to show up in the Environment tab.
Everything in your Environment needs to have a name. You will likely have
many objects such as tibbles going on at the same time. Note that tibble is
what the thing is, rather than its name. This is the ‘class’ of an object.
To keep our code examples easy to follow, we call our example tibble mydata. In a
real analysis, you should give your tibbles meaningful names, e.g., patient_data,
lab_results, annual_totals, etc. Object names can’t have spaces in it, which is
why we use the underscore (_) to separate words. Object names can include
numbers, but they can’t start with a number: so labdata2019 works, 2019labdata
does not.
So, the tibble named mydata is an example of an object that can be in the
Environment of your R Session:
mydata
## # A tibble: 4 x 5
## id sex var1 var2 var3
## <int> <chr> <dbl> <dbl> <dbl>
## 1 1 Male 4 NA 2
## 2 2 Female 1 4 1
## 3 3 Female 2 5 NA
## 4 4 Male 3 NA NA
5
read.csv() comes with base R, whereas read_csv() comes from the readr package within
the tidyverse. We recommend using read_csv().
2.3 Objects and functions 29
2.3.3 Function and its arguments
R functions always have round brackets after their name. This is for two
reasons. First, it easily differentiates them as functions - you will get used to
reading them like this.
Second, and more importantly, we can put arguments in these brackets.
Arguments can also be thought of as input. In data analysis, the most common
input for a function is data. For instance, we need to give mean() some data to
average over. It does not make sense (nor will it work) to feed mean() the whole
tibble with multiple columns, including patient IDs and a categorical variable
(sex).
To quickly extract a single column, we use the $ symbol like this:
mydata$var1
## [1] 4 1 2 3
You can ignore the ## [1] at the beginning of the extracted values - this is
something that becomes more useful when printing multiple lines of data as
the number in the square brackets keeps count on how many values we are
seeing.
We can then use mydata$var1 as the first argument of mean() by putting it inside
its brackets:
mean(mydata$var1)
## [1] 2.5
which tells us that the mean of var1 (4, 1, 2, 3) is 2.5. In this example, mydata$var1
is the first and only argument to mean().
But what happens if we try to calculate the average value of var2 (NA, 4, 5,
NA) (remember, NA stands for Not Applicable/Available and is used to denote
missing data):
mean(mydata$var2)
## [1] NA
30 2 R basics
So why does mean(mydata$var2) return NA (“not available”) rather than the mean
of the values included in this column? That is because the column includes
missing values (NAs), and R does not want to average over NAs implicitly. It is
being cautious - what if you didn’t know there were missing values for some
patients? If you wanted to compare the means of var1 and var2 without any
further filtering, you would be comparing samples of different sizes.
We might expect to see an NA if we tried to, for example, calculate the average
of sex. And this is indeed the case:
mean(mydata$sex)
## [1] NA
## [1] 4.5
Adding na.rm = TRUE tells R that you are happy for it to calculate the mean of any
existing values (but to remove - rm - the NA values). This ‘removal’ excludes the
NAs from the calculation, it does not affect the actual tibble (mydata) holding
the dataset.
R is case sensitive, so na.rm, not NA.rm etc. There is, however, no need to mem-
orize how the arguments of functions are exactly spelled - this is what the
Help tab is for (press F1 when the cursor is on the name of the function). Help
pages are built into R, so an internet connection is not required for this.
Make sure to separate multiple arguments with commas or R will give you an
error of Error: unexpected symbol.
2.3 Objects and functions 31
Finally, some functions do not need any arguments to work. A good example
is the Sys.time() which returns the current time and date. This is useful when
using R to generate and update reports automatically. Including this means
you can always be clear on when the results were last updated.
Sys.time()
a <- 103
This reads: the object a is assigned value 103. <- is called “the arrow assignment
operator”, or “assignment arrow” for short.
You know that the assignment worked when it shows up in the Environment
tab. If we now run a just on its own, it gets printed back to us:
## [1] 103
seq(15, 30)
## [1] 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Doing this creates example_sequence in our Environment, but it does not print
it. To get it printed, run it on a separate line like this:
example_sequence
## [1] 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
If you save the results of an R function in an object, it does not get printed.
If you run a function without the assignment (<-), its results get printed, but
not saved as an object.
example_sequence
## [1] 7.5 8.0 8.5 9.0 9.5 10.0 10.5 11.0 11.5 12.0 12.5 13.0 13.5 14.0 14.5
## [16] 15.0
Notice how we then include the variable on a new line to get it printed as well
as overwritten.
Note that many people use = instead of <-. Both <- and = can save what is
on the right into an object with named on the left. Although <- and = are
interchangeable when saving an object into your Environment, they are not
interchangeable when used as function argument. For example, remember how
we used the na.rm argument in the mean() function, and the result got printed
immediately? If we want to save the result into an object, we’ll do this, where
mean_result could be any name you choose:
Note how the example above uses both operators: the assignment arrow for
saving the result to the Environment, the = equals operator for setting an
argument in the mean() function (na.rm = TRUE).
2.4 Pipe - %>% 33
2.3.6 Recap: object, function, input, argument
• To summarise, objects and functions work hand in hand. Objects are both
an input as well as the output of a function (what the function returns).
• When passing data to a function, it is usually the first argument, with further
arguments used to specify behaviour.
• When we say “the function returns”, we are referring to its output (or an
Error if it’s one of those days).
• The returned object can be different to its input object. In our mean() exam-
ples above, the input object was a column (mydata$var1: 4, 1, 2, 3), whereas
the output was a single value: 2.5.
• If you’ve written a line of code that doesn’t include the assignment arrow
(<-), its results would get printed. If you use the assignment arrow, an object
holding the results will get saved into the Environment.
The pipe - denoted %>% - is probably the oddest looking thing you’ll see in this
book. But please bear with us; it is not as scary as it looks! Furthermore, it
is super useful. We use the pipe to send objects into functions.
In the above examples, we calculated the mean of column var1 from mydata by
mean(mydata$var1). With the pipe, we can rewrite this as:
library(tidyverse)
mydata$var1 %>% mean()
## [1] 2.5
Which reads: “Working with mydata, we select a single column called var1 (with
the $) and then calculate the mean().” The pipe becomes especially useful once
the analysis includes multiple steps applied one after another. A good way to
read and think of the pipe is “and then”.
This piping business is not standard R functionality and before using it in a
script, you need to tell R this is what you will be doing. The pipe comes from
the magrittr package (Figure 2.5), but loading the tidyverse will also load the
pipe. So library(tidyverse) initialises everything you need.
34 2 R basics
With or without the pipe, the general rule “if the result gets printed it doesn’t
get saved” still applies. To save the result of the function into a new object
(so it shows up in the Environment), you need to add the name of the new
object with the assignment arrow (<-):
FIGURE 2.5: This is not a pipe. René Magritte inspired artwork, by Stefan
Milton Bache.
By default, the pipe sends data to the beginning of the function brackets (as
most of the functions we use expect data as the first argument). So mydata %>%
lm(dependent~explanatory) is equivalent to lm(mydata, dependent~explanatory). lm() -
linear model - will be introduced in detail in Chapter 7.
However, the lm() function does not expect data as its first argument. lm()
wants us to specify the variables first (dependent~explanatory), and then wants
the tibble these columns are in. So we have to use the . to tell the pipe to
send the data to the second argument of lm(), not the first, e.g.,
mydata %>%
lm(var1~var2, data = .)
2.5 Operators for filtering data 35
Operators are symbols that tell R how to handle different pieces of data or
objects. We have already introduced three: $ (selects a column), <- (assigns
values or results to a variable), and the pipe - %>% (sends data into a function).
Other common operators are the ones we use for filtering data - these are arith-
metic comparison and logical operators. This may be for creating subgroups,
or for excluding outliers or incomplete cases.
The comparison operators that work with numeric data are relatively straight-
forward: >, <, >=, <=. The first two check whether your values are greater or
less than another value, the last two check for “greater than or equal to” and
“less than or equal to”. These operators are most commonly spotted inside the
filter() function:
gbd_short %>%
filter(year < 1995)
## # A tibble: 3 x 3
## year cause deaths_millions
## <dbl> <chr> <dbl>
## 1 1990 Communicable diseases 15.4
## 2 1990 Injuries 4.25
## 3 1990 Non-communicable diseases 26.7
Here we send the data (gbd_short) to the filter() and ask it to retain all years
that are less than 1995. The resulting tibble only includes the year 1990. Now,
if we use the <= (less than or equal to) operator, both 1990 and 1995 pass the
filter:
gbd_short %>%
filter(year <= 1995)
## # A tibble: 6 x 3
## year cause deaths_millions
## <dbl> <chr> <dbl>
## 1 1990 Communicable diseases 15.4
## 2 1990 Injuries 4.25
## 3 1990 Non-communicable diseases 26.7
## 4 1995 Communicable diseases 15.1
## 5 1995 Injuries 4.53
## 6 1995 Non-communicable diseases 29.3
Furthermore, the values either side of the operator could both be variables,
e.g., mydata %>% filter(var2 > var1).
To filter for values that are equal to something, we use the == operator.
36 2 R basics
gbd_short %>%
filter(year == 1995)
## # A tibble: 3 x 3
## year cause deaths_millions
## <dbl> <chr> <dbl>
## 1 1995 Communicable diseases 15.1
## 2 1995 Injuries 4.53
## 3 1995 Non-communicable diseases 29.3
This reads, take the GBD dataset, send it to the filter and keep rows where
year is equal to 1995.
Accidentally using the single equals = when double equals is necessary == is a
common mistake and still happens to the best of us. It happens so often that
the error the filter() function gives when using the wrong one also reminds
us what the correct one was:
gbd_short %>%
filter(year = 1995)
The answer to “do you need ==?” is almost always, “Yes R, I do, thank you”.
But that’s just because filter() is a clever cookie and is used to this common
mistake. There are other useful functions we use these operators in, but they
don’t always know to tell us that we’ve just confused = for ==. So if you get an
error when checking for an equality between variables, always check your ==
operators first.
R also has two operators for combining multiple comparisons: & and |, which
stand for AND and OR, respectively. For example, we can filter to only keep
the earliest and latest years in the dataset:
gbd_short %>%
filter(year == 1995 | year == 2017)
## # A tibble: 6 x 3
## year cause deaths_millions
## <dbl> <chr> <dbl>
## 1 1995 Communicable diseases 15.1
2.5 Operators for filtering data 37
TABLE 2.4: Filtering operators.
Operators Meaning
== Equal to
!= Not equal to
< Less than
> Greater than
<= Less than or equal to
>= Greater then or equal to
& AND
| OR
This reads: take the GBD dataset, send it to the filter and keep rows where
year is equal to 1995 OR year is equal to 2017.
Using specific values like we’ve done here (1995/2017) is called “hard-coding”,
which is fine if we know for sure that we will not want to use the same script
on an updated dataset. But a cleverer way of achieving the same thing is to
use the min() and max() functions:
gbd_short %>%
filter(year == max(year) | year == min(year))
## # A tibble: 6 x 3
## year cause deaths_millions
## <dbl> <chr> <dbl>
## 1 1990 Communicable diseases 15.4
## 2 1990 Injuries 4.25
## 3 1990 Non-communicable diseases 26.7
## 4 2017 Communicable diseases 10.4
## 5 2017 Injuries 4.47
## 6 2017 Non-communicable diseases 40.9
Filter the dataset to only include the year 2000. Save this in a new variable
using the assignment operator.
From gbd_short, select the lines where year is either 1990 or 2017 and cause is
“Communicable diseases”:
The hash symbol (#) is used to add free text comments to R code. R will not
try to run these lines, they will be ignored. Comments are an essential part of
any programming code and these are “Dear Diary” notes to your future self.
The combine function as its name implies is used to combine several values.
It is especially useful when used with the %in% operator to filter for multiple
values. Remember how the gbd_short cause column had three different causes
in it:
6
Say we wanted to filter for communicable and non-communicable diseases.
We could use the OR operator | like this:
gbd_short %>%
# also filtering for a single year to keep the result concise
filter(year == 1990) %>%
filter(cause == "Communicable diseases" | cause == "Non-communicable diseases")
## # A tibble: 2 x 3
## year cause deaths_millions
## <dbl> <chr> <dbl>
## 1 1990 Communicable diseases 15.4
## 2 1990 Non-communicable diseases 26.7
6
In this example, it would just be easier to used the “not equal” operator, filter(cause !=
“Injuries”), but imagine your column had more than just three different values in it.
2.7 Missing values (NAs) and filters 39
But that means we have to type in cause twice (and more if we had other
values we wanted to include). This is where the %in% operator together with
the c() function come in handy:
gbd_short %>%
filter(year == 1990) %>%
filter(cause %in% c("Communicable diseases", "Non-communicable diseases"))
## # A tibble: 2 x 3
## year cause deaths_millions
## <dbl> <chr> <dbl>
## 1 1990 Communicable diseases 15.4
## 2 1990 Non-communicable diseases 26.7
Filtering for missing values (NAs) needs special attention and care. Remember
the small example tibble from Table 2.3 - it has some NAs in columns var2
and var3:
mydata
## # A tibble: 4 x 5
## id sex var1 var2 var3
## <int> <chr> <dbl> <dbl> <dbl>
## 1 1 Male 4 NA 2
## 2 2 Female 1 4 1
## 3 3 Female 2 5 NA
## 4 4 Male 3 NA NA
If we now want to filter for rows where var2 is missing, filter(var2 == NA) is not
the way to do it, it will not work.
Since R is a programming language, it can be a bit stubborn with things like
these. When you ask R to do a comparison using == (or <, >, etc.) it expects a
value on each side, but NA is not a value, it is the lack thereof. The way to
filter for missing values is using the is.na() function:
mydata %>%
filter(is.na(var2))
## # A tibble: 2 x 5
## id sex var1 var2 var3
## <int> <chr> <dbl> <dbl> <dbl>
## 1 1 Male 4 NA 2
## 2 4 Male 3 NA NA
40 2 R basics
We send mydata to the filter and keep rows where var2 is NA. Note the double
brackets at the end: that’s because the inner one belongs to is.na(), and the
outer one to filter(). Missing out a closing bracket is also a common source
of errors, and still happens to the best of us.
If filtering for rows where var2 is not missing, we do this7
mydata %>%
filter(!is.na(var2))
## # A tibble: 2 x 5
## id sex var1 var2 var3
## <int> <chr> <dbl> <dbl> <dbl>
## 1 2 Female 1 4 1
## 2 3 Female 2 5 NA
mydata %>%
filter(var2 != 5)
## # A tibble: 1 x 5
## id sex var1 var2 var3
## <int> <chr> <dbl> <dbl> <dbl>
## 1 2 Female 1 4 1
However, you’ll see that by doing this, R drops the rows where var2 is NA as
well, as it can’t be sure these missing values were not equal to 5.
If you want to keep the missing values, you need to make use of the OR (|)
operator and the is.na() function:
mydata %>%
filter(var2 != 5 | is.na(var2))
## # A tibble: 3 x 5
## id sex var1 var2 var3
## <int> <chr> <dbl> <dbl> <dbl>
## 1 1 Male 4 NA 2
## 2 2 Female 1 4 1
## 3 4 Male 3 NA NA
7
In this simple example, mydata %>% filter(! is.na(var2)) could be replaced by a shorthand:
mydata %>% drop_na(var2),
but it is important to understand how the ! and is.na() work as there
will be more complex situations where using these is necessary.
8
filter(var2 != 5) is equivalent to filter(! var2 == 5)
2.7 Missing values (NAs) and filters 41
Being caught out by missing values, either in filters or other functions is com-
mon (remember mydata$var2 %>% mean() returns NA unless you add na.rm = TRUE).
This is also why we insist that you always plot your data first - outliers will
reveal themselves and NA values usually become obvious too.
Another thing we do to stay safe around filters and missing values is saving
the results and making sure the number of rows still add up:
subset1
## # A tibble: 1 x 5
## id sex var1 var2 var3
## <int> <chr> <dbl> <dbl> <dbl>
## 1 3 Female 2 5 NA
subset2
## # A tibble: 1 x 5
## id sex var1 var2 var3
## <int> <chr> <dbl> <dbl> <dbl>
## 1 2 Female 1 4 1
If the numbers are small, you can now quickly look at RStudio’s Environment
tab and figure out whether the number of observations (rows) in subset1 and
subset2 add up to the whole dataset (mydata). Or use the nrow() function to check
the number of rows in each dataset:
Rows in mydata:
nrow(mydata)
## [1] 4
Rows in subset1:
nrow(subset1)
## [1] 1
Rows in subset2:
nrow(subset2)
42 2 R basics
## [1] 1
## [1] FALSE
As expected, this returns FALSE - because we didn’t add special handling for
missing values. Let’s create a third subset only including rows where var3 is
NA:
Rows in subset2:
## [1] TRUE
The function for adding new columns (or making changes to existing ones) to
a tibble is called mutate(). As a reminder, this is what typesdata looked like:
typesdata
## # A tibble: 3 x 4
## id group measurement date
## <chr> <chr> <dbl> <dttm>
## 1 ID1 Control 1.8 2017-01-02 12:00:00
## 2 ID2 Treatment 4.5 2018-02-03 13:00:00
## 3 ID3 Treatment 3.7 2019-03-04 14:00:00
Let’s say we decide to divide the column measurement by 2. A quick way to see
these values would be to pull them out using the $ operator and then divide
by 2:
typesdata$measurement
typesdata$measurement/2
typesdata %>%
mutate(measurement/2)
## # A tibble: 3 x 5
## id group measurement date `measurement/2`
## <chr> <chr> <dbl> <dttm> <dbl>
## 1 ID1 Control 1.8 2017-01-02 12:00:00 0.9
## 2 ID2 Treatment 4.5 2018-02-03 13:00:00 2.25
## 3 ID3 Treatment 3.7 2019-03-04 14:00:00 1.85
Notice how the mutate() above returns the whole tibble with a new column
called measurement/2. This is quite nice of mutate(), but it would be best to give
columns names that don’t include characters other than underscores (_) or
dots (.). So let’s assign a more standard name for this new column:
typesdata %>%
mutate(measurement_half = measurement/2)
## # A tibble: 3 x 5
## id group measurement date measurement_half
## <chr> <chr> <dbl> <dttm> <dbl>
## 1 ID1 Control 1.8 2017-01-02 12:00:00 0.9
## 2 ID2 Treatment 4.5 2018-02-03 13:00:00 2.25
## 3 ID3 Treatment 3.7 2019-03-04 14:00:00 1.85
Better. You can see that R likes the name we gave it a bit better as it’s now
removed the back-ticks from around it. Overall, back-ticks can be used to call
out non-standard column names, so if you are forced to read in data with, e.g.,
spaces in column names, then the back-ticks enable calling column names that
would otherwise error9 :
# or
mydata %>%
select(`Nasty column name`)
But as usual, if it gets printed, it doesn’t get saved. We have two options - we
9
If this happens to you a lot, then check out library(janitor) and its function clean_names()
for automatically tidying non-standard column names.
44 2 R basics
can either overwrite the typesdata tibble (by changing the first line to typesdata =
typesdata %>%),
or we can create a new one (that appears in your Environment):
typesdata_modified
## # A tibble: 3 x 5
## id group measurement date measurement_half
## <chr> <chr> <dbl> <dttm> <dbl>
## 1 ID1 Control 1.8 2017-01-02 12:00:00 0.9
## 2 ID2 Treatment 4.5 2018-02-03 13:00:00 2.25
## 3 ID3 Treatment 3.7 2019-03-04 14:00:00 1.85
The mutate() function can also be used to create a new column with a single
constant value; which in return can be used to calculate a difference for each
of the existing dates:
library(lubridate)
typesdata %>%
mutate(reference_date = ymd_hm("2020-01-01 12:00"),
dates_difference = reference_date - date) %>%
select(date, reference_date, dates_difference)
## # A tibble: 3 x 3
## date reference_date dates_difference
## <dttm> <dttm> <drtn>
## 1 2017-01-02 12:00:00 2020-01-01 12:00:00 1094.0000 days
## 2 2018-02-03 13:00:00 2020-01-01 12:00:00 696.9583 days
## 3 2019-03-04 14:00:00 2020-01-01 12:00:00 302.9167 days
(We are then using the select() function to only choose the three relevant
columns.)
Finally, the mutate function can be used to create a new column with a sum-
marised value in it, e.g., the mean of another column:
typesdata %>%
mutate(mean_measurement = mean(measurement))
## # A tibble: 3 x 5
## id group measurement date mean_measurement
## <chr> <chr> <dbl> <dttm> <dbl>
## 1 ID1 Control 1.8 2017-01-02 12:00:00 3.33
## 2 ID2 Treatment 4.5 2018-02-03 13:00:00 3.33
## 3 ID3 Treatment 3.7 2019-03-04 14:00:00 3.33
typesdata %>%
mutate(mean_measurement = mean(measurement)) %>%
mutate(measurement_relative = measurement/mean_measurement) %>%
select(matches("measurement"))
## # A tibble: 3 x 3
## measurement mean_measurement measurement_relative
## <dbl> <dbl> <dbl>
## 1 1.8 3.33 0.54
## 2 4.5 3.33 1.35
## 3 3.7 3.33 1.11
Round the difference to 0 decimal places using the round() function inside a
mutate(). Then add a clever matches("date") inside the select() function to choose
all matching columns.
Solution:
typesdata %>%
mutate(reference_date = ymd_hm("2020-01-01 12:00"),
dates_difference = reference_date - date) %>%
mutate(dates_difference = round(dates_difference)) %>%
select(matches("date"))
## # A tibble: 3 x 3
## date reference_date dates_difference
## <dttm> <dttm> <drtn>
## 1 2017-01-02 12:00:00 2020-01-01 12:00:00 1094 days
## 2 2018-02-03 13:00:00 2020-01-01 12:00:00 697 days
## 3 2019-03-04 14:00:00 2020-01-01 12:00:00 303 days
You can shorten this by adding the round() function directly around the sub-
traction, so the third line becomes dates_difference = round(reference_date - date))
%>%. But sometimes writing calculations out longer than the absolute minimum
can make them easier to understand when you return to an old script months
later.
Furthermore, we didn’t have to save the reference_date as a new column,
the calculation could have used the value directly: mutate(dates_difference =
ymd_hm("2020-01-01 12:00") - date) %>%. But again, defining it makes it clearer for
your future self to see what was done. And it makes reference_date available for
reuse in more complicated calculations within the tibble.
46 2 R basics
And finally, we combine the filtering operators (==, >, <, etc) with the if_else()
function to create new columns based on a condition.
typesdata %>%
mutate(above_threshold = if_else(measurement > 3,
"Above three",
"Below three"))
## # A tibble: 3 x 5
## id group measurement date above_threshold
## <chr> <chr> <dbl> <dttm> <chr>
## 1 ID1 Control 1.8 2017-01-02 12:00:00 Below three
## 2 ID2 Treatment 4.5 2018-02-03 13:00:00 Above three
## 3 ID3 Treatment 3.7 2019-03-04 14:00:00 Above three
We are sending typesdata into a mutate() function, we are creating a new column
called above_threshold based on whether measurement is greater or less than 3. The
first argument to if_else() is a condition (in this case that measurement is
greater than 3), the second argument is the value if the condition is TRUE,
and the third argument is the value if the condition is FALSE.
It reads, “if this condition is met, return this, else return that”.
Look at each line in the tibble above and convince yourself that the threshold
variable worked as expected. Then look at the two closing brackets - )) - at
the end and convince yourself that they are both needed.
if_else() and missing values tip: for rows with missing values (NAs), the condi-
tion returns neither TRUE nor FALSE, it returns NA. And that might be fine,
but if you want to assign a specific group/label for missing values in the new
variable, you can add a fourth argument to if_else(), e.g., if_else(measurement >
3, "Above three", "Below three", "Value missing").
The paste() function is used to add characters together. It also works with
numbers and dates which will automatically be converted to characters before
being pasted together into a single label. See this example where we use all
2.10 Create labels - paste() 47
variables from typesdata to create a new column called plot_label (we select()
for printing space):
typesdata %>%
mutate(plot_label = paste(id,
"was last measured at", date,
", and the value was", measurement)) %>%
select(plot_label)
## # A tibble: 3 x 1
## plot_label
## <chr>
## 1 ID1 was last measured at 2017-01-02 12:00:00 , and the value was 1.8
## 2 ID2 was last measured at 2018-02-03 13:00:00 , and the value was 4.5
## 3 ID3 was last measured at 2019-03-04 14:00:00 , and the value was 3.7
The paste is also useful when pieces of information are stored in different
columns. For example, consider this made-up tibble:
pastedata
## # A tibble: 3 x 3
## year month day
## <dbl> <chr> <dbl>
## 1 2007 Jan 1
## 2 2008 Feb 2
## 3 2009 March 3
pastedata %>%
mutate(date = paste(day, month, year, sep = "-"))
## # A tibble: 3 x 4
## year month day date
## <dbl> <chr> <dbl> <chr>
## 1 2007 Jan 1 1-Jan-2007
## 2 2008 Feb 2 2-Feb-2008
## 3 2009 March 3 3-March-2009
By default, paste() adds a space between each value, but we can use the sep =
argument to specify a different separator. Sometimes it is useful to use paste0()
which does not add anything between the values (no space, no dash, etc.).
We can now tell R that the date column should be parsed as such:
48 2 R basics
library(lubridate)
pastedata %>%
mutate(date = paste(day, month, year, sep = "-")) %>%
mutate(date = dmy(date))
## # A tibble: 3 x 4
## year month day date
## <dbl> <chr> <dbl> <date>
## 1 2007 Jan 1 2007-01-01
## 2 2008 Feb 2 2008-02-02
## 3 2009 March 3 2009-03-03
library(tidyverse)
patientdata <- read_csv("data/patient_data.csv")
patientdata
## # A tibble: 6 x 3
## id sex age
## <dbl> <chr> <dbl>
## 1 1 Female 24
## 2 2 Male 59
## 3 3 Female 32
## 4 4 Female 84
## 5 5 Male 48
## 6 6 Female 65
## # A tibble: 4 x 2
## id measurement
## <dbl> <dbl>
## 1 5 3.47
## 2 6 7.31
## 3 8 9.91
## 4 7 6.11
Notice how these datasets are not only different sizes (6 rows in patientdata, 4
2.11 Joining multiple datasets 49
rows in labsdata), but include information on different patients: patiendata has
ids 1, 2, 3, 4, 5, 6, labsdata has ids 5, 6, 8, 7.
A comprehensive way to join these is to use full_join() retaining all information
from both tibbles (and matching up rows by shared columns, in this case id):
full_join(patientdata, labsdata)
## Joining, by = "id"
## # A tibble: 8 x 4
## id sex age measurement
## <dbl> <chr> <dbl> <dbl>
## 1 1 Female 24 NA
## 2 2 Male 59 NA
## 3 3 Female 32 NA
## 4 4 Female 84 NA
## 5 5 Male 48 3.47
## 6 6 Female 65 7.31
## 7 8 <NA> NA 9.91
## 8 7 <NA> NA 6.11
inner_join(patientdata, labsdata)
## Joining, by = "id"
## # A tibble: 2 x 4
## id sex age measurement
## <dbl> <chr> <dbl> <dbl>
## 1 5 Male 48 3.47
## 2 6 Female 65 7.31
And finally, if we want to retain all information from one tibble, we use either
the left_join() or the right_join():
left_join(patientdata, labsdata)
## Joining, by = "id"
## # A tibble: 6 x 4
## id sex age measurement
## <dbl> <chr> <dbl> <dbl>
## 1 1 Female 24 NA
## 2 2 Male 59 NA
## 3 3 Female 32 NA
## 4 4 Female 84 NA
## 5 5 Male 48 3.47
## 6 6 Female 65 7.31
50 2 R basics
right_join(patientdata, labsdata)
## Joining, by = "id"
## # A tibble: 4 x 4
## id sex age measurement
## <dbl> <chr> <dbl> <dbl>
## 1 5 Male 48 3.47
## 2 6 Female 65 7.31
## 3 8 <NA> NA 9.91
## 4 7 <NA> NA 6.11
## # A tibble: 2 x 3
## id sex age
## <dbl> <chr> <dbl>
## 1 7 Female 38
## 2 8 Male 29
bind_rows(patientdata, patientdata_new)
## # A tibble: 8 x 3
## id sex age
## <dbl> <chr> <dbl>
## 1 1 Female 24
## 2 2 Male 59
## 3 3 Female 32
## 4 4 Female 84
## 5 5 Male 48
## 6 6 Female 65
## 7 7 Female 38
## 8 8 Male 29
## # A tibble: 5 x 2
## id measurement
## <dbl> <dbl>
## 1 5 3.47
## 2 6 7.31
## 3 8 9.91
## 4 7 6.11
## 5 5 2.49
left_join(patientdata, labsdata_updated)
## Joining, by = "id"
## # A tibble: 7 x 4
## id sex age measurement
## <dbl> <chr> <dbl> <dbl>
## 1 1 Female 24 NA
## 2 2 Male 59 NA
## 3 3 Female 32 NA
## 4 4 Female 84 NA
## 5 5 Male 48 3.47
## 6 5 Male 48 2.49
## 7 6 Female 65 7.31
We get 7 rows, instead of 6 - as patient id 5 now appears twice with the two
different measurements. So it is important to either know your datasets well
or keep an eye on the number of rows to make sure any increases/decreases in
the tibble sizes are as you expect them to be.
3
Summarising data
“The Answer to the Great Question … Of Life, the Universe and Everything
… Is … Forty-two,” said Deep Thought, with infinite majesty and calm.
Douglas Adams, The Hitchhiker’s Guide to the Galaxy
library(tidyverse)
gbd_full <- read_csv("data/global_burden_disease_cause-year-sex-income.csv")
53
54 3 Summarising data
TABLE 3.1: Deaths per year (2017) from three broad disease categories, sex,
and World Bank country-level income groups.
The best way to investigate a dataset is of course to plot it. We have added a
couple of notes as comments (the lines starting with a #) for those who can’t
wait to get to the next chapter where the code for plotting will be introduced
and explained in detail. Overall, you shouldn’t waste time trying to understand
this code, but do look at the different groups within this new dataset.
gbd2017 %>%
# without the mutate(... = fct_relevel())
# the panels get ordered alphabetically
mutate(income = fct_relevel(income,
"Low",
"Lower-Middle",
"Upper-Middle",
"High")) %>%
# defining the variables using ggplot(aes(...)):
ggplot(aes(x = sex, y = deaths_millions, fill = cause)) +
# type of geom to be used: column (that's a type of barplot):
geom_col(position = "dodge") +
# facets for the income groups:
facet_wrap(~income, ncol = 4) +
# move the legend to the top of the plot (default is "right"):
theme(legend.position = "top")
Communicable diseases
cause Injuries
Non-communicable diseases
7.5
5.0
2.5
0.0
Female Male Female Male Female Male Female Male
sex
FIGURE 3.1: Global Burden of Disease data with subgroups: cause, sex,
World Bank income group.
56 3 Summarising data
## [1] 55.74
But a much cleverer way of summarising data is using the summarise() function:
gbd2017 %>%
summarise(sum(deaths_millions))
## # A tibble: 1 x 1
## `sum(deaths_millions)`
## <dbl>
## 1 55.74
This is indeed equal to the number of deaths per year we saw in the previous
chapter using the shorter version of this data (deaths from the three causes
were 10.38, 4.47, 40.89 which adds to 55.74).
sum() is a function that adds numbers together, whereas summarise() is an ef-
ficient way of creating summarised tibbles. The main strength of summarise()
is how it works with the group_by() function. group_by() and summarise() are like
cheese and wine, a perfect complement for each other, seldom seen apart.
We use group_by() to tell summarise() which subgroups to apply the calculations
on. In the above example, without group_by(), summarise just works on the
whole dataset, yielding the same result as just sending a single column into
the sum() function.
We can subset on the cause variable using group_by():
3.4 Add new columns: mutate() 57
gbd2017 %>%
group_by(cause) %>%
summarise(sum(deaths_millions))
## # A tibble: 3 x 2
## cause `sum(deaths_millions)`
## <chr> <dbl>
## 1 Communicable diseases 10.38
## 2 Injuries 4.47
## 3 Non-communicable diseases 40.89
gbd2017 %>%
group_by(cause, sex) %>%
summarise(sum(deaths_millions))
## # A tibble: 6 x 3
## # Groups: cause [3]
## cause sex `sum(deaths_millions)`
## <chr> <chr> <dbl>
## 1 Communicable diseases Female 4.91
## 2 Communicable diseases Male 5.47
## 3 Injuries Female 1.42
## 4 Injuries Male 3.05
## 5 Non-communicable diseases Female 19.15
## 6 Non-communicable diseases Male 21.74
We met mutate() in the last chapter. Let’s first give the summarised column a
better name, e.g., deaths_per_group. We can remove groupings by using ungroup().
This is important to remember if you want to manipulate the dataset in its
original format. We can combine ungroup() with mutate() to add a total deaths
column, which will be used below to calculate a percentage:
gbd2017 %>%
group_by(cause, sex) %>%
summarise(deaths_per_group = sum(deaths_millions)) %>%
58 3 Summarising data
ungroup() %>%
mutate(deaths_total = sum(deaths_per_group))
## # A tibble: 6 x 4
## cause sex deaths_per_group deaths_total
## <chr> <chr> <dbl> <dbl>
## 1 Communicable diseases Female 4.91 55.74
## 2 Communicable diseases Male 5.47 55.74
## 3 Injuries Female 1.42 55.74
## 4 Injuries Male 3.05 55.74
## 5 Non-communicable diseases Female 19.15 55.74
## 6 Non-communicable diseases Male 21.74 55.74
So summarise() condenses a tibble, whereas mutate() retains its current size and
adds columns. We can also further lines to mutate() to calculate the percentage
of each group:
## # A tibble: 6 x 5
## cause sex deaths_per_group deaths_total deaths_relative
## <chr> <chr> <dbl> <dbl> <chr>
## 1 Communicable diseases Female 4.91 55.74 8.8%
## 2 Communicable diseases Male 5.47 55.74 9.8%
## 3 Injuries Female 1.42 55.74 2.5%
## 4 Injuries Male 3.05 55.74 5.5%
## 5 Non-communicable diseases Female 19.15 55.74 34.4%
## 6 Non-communicable diseases Male 21.74 55.74 39.0%
The percent() function comes from library(scales) and is a handy way of format-
ting percentages You must keep in mind that it changes the column from a
number (denoted <dbl>) to a character (<chr>). The percent() function is equiv-
alent to:
## [1] "8.8%"
3.5 summarise() vs mutate() 59
This is convenient for final presentation of number, but if you intend to do
further calculations/plot/sort the percentages just calculate them as fractions
with:
gbd2017_summarised %>%
mutate(deaths_relative = deaths_per_group/deaths_total)
## # A tibble: 6 x 5
## cause sex deaths_per_group deaths_total deaths_relative
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 Communicable diseases Female 4.91 55.74 0.08809
## 2 Communicable diseases Male 5.47 55.74 0.09813
## 3 Injuries Female 1.42 55.74 0.02548
## 4 Injuries Male 3.05 55.74 0.05472
## 5 Non-communicable diseases Female 19.15 55.74 0.3436
## 6 Non-communicable diseases Male 21.74 55.74 0.3900
So far we’ve shown you examples of using summarise() on grouped data (follow-
ing group_by()) and mutate() on the whole dataset (without using group_by()).
But here’s the thing: mutate() is also happy to work on grouped data.
Let’s save the aggregated example from above in a new tibble. We will then sort
the rows using arrange() based on sex, just for easier viewing (it was previously
sorted by cause).
The arrange() function sorts the rows within a tibble:
gbd_summarised
## # A tibble: 6 x 3
## # Groups: cause [3]
## cause sex deaths_per_group
## <chr> <chr> <dbl>
## 1 Communicable diseases Female 4.91
60 3 Summarising data
You should also notice that summarise() drops all variables that are not listed in
group_by() or created inside it. So year, income, and deaths_millions exist in gbd2017,
but they do not exist in gbd_summarised.
We now want to calculate the percentage of deaths from each cause for each
gender. We could use summarise() to calculate the totals:
gbd_summarised_sex
## # A tibble: 2 x 2
## sex deaths_per_sex
## <chr> <dbl>
## 1 Female 25.48
## 2 Male 30.26
But that drops the cause and deaths_per_group columns. One way would be to
now use a join on gbd_summarised and gbd_summarised_sex:
full_join(gbd_summarised, gbd_summarised_sex)
## Joining, by = "sex"
## # A tibble: 6 x 4
## # Groups: cause [3]
## cause sex deaths_per_group deaths_per_sex
## <chr> <chr> <dbl> <dbl>
## 1 Communicable diseases Female 4.91 25.48
## 2 Injuries Female 1.42 25.48
## 3 Non-communicable diseases Female 19.15 25.48
## 4 Communicable diseases Male 5.47 30.26
## 5 Injuries Male 3.05 30.26
## 6 Non-communicable diseases Male 21.74 30.26
gbd_summarised %>%
group_by(sex) %>%
mutate(deaths_per_sex = sum(deaths_per_group))
## # A tibble: 6 x 4
## # Groups: sex [2]
## cause sex deaths_per_group deaths_per_sex
## <chr> <chr> <dbl> <dbl>
## 1 Communicable diseases Female 4.91 25.48
## 2 Injuries Female 1.42 25.48
## 3 Non-communicable diseases Female 19.15 25.48
## 4 Communicable diseases Male 5.47 30.26
## 5 Injuries Male 3.05 30.26
## 6 Non-communicable diseases Male 21.74 30.26
So mutate() calculates the sums within each grouping variable (in this example
just group_by(sex)) and puts the results in a new column without condensing
the tibble down or removing any of the existing columns.
Let’s combine all of this together into a single pipeline and calculate the per-
centages per cause for each gender:
gbd2017 %>%
group_by(cause, sex) %>%
summarise(deaths_per_group = sum(deaths_millions)) %>%
group_by(sex) %>%
mutate(deaths_per_sex = sum(deaths_per_group),
sex_cause_perc = percent(deaths_per_group/deaths_per_sex)) %>%
arrange(sex, deaths_per_group)
## # A tibble: 6 x 5
## # Groups: sex [2]
## cause sex deaths_per_group deaths_per_sex sex_cause_perc
## <chr> <chr> <dbl> <dbl> <chr>
## 1 Injuries Fema~ 1.42 25.48 6%
## 2 Communicable diseases Fema~ 4.91 25.48 19%
## 3 Non-communicable diseases Fema~ 19.15 25.48 75%
## 4 Injuries Male 3.05 30.26 10.1%
## 5 Communicable diseases Male 5.47 30.26 18.1%
## 6 Non-communicable diseases Male 21.74 30.26 71.8%
• sum()
• mean()
• median()
• min(), max()
• sd() - standard deviation
• IQR() - interquartile range
## [1] NA
## [1] 3
gbd_2rows
## # A tibble: 2 x 5
## cause year sex income deaths_millions
## <chr> <dbl> <chr> <chr> <dbl>
## 1 Communicable diseases 1990 Female High 0.21
## 2 Communicable diseases 1990 Female Upper-Middle 1.150
gbd_2rows %>%
select(cause, deaths_millions)
## # A tibble: 2 x 2
## cause deaths_millions
## <chr> <dbl>
## 1 Communicable diseases 0.21
## 2 Communicable diseases 1.150
gbd_2rows %>%
select(cause, deaths = deaths_millions)
## # A tibble: 2 x 2
## cause deaths
## <chr> <dbl>
## 1 Communicable diseases 0.21
## 2 Communicable diseases 1.150
There function rename() is similar to select(), but it keeps all variables whereas
select() only kept the ones we mentioned:
gbd_2rows %>%
rename(deaths = deaths_millions)
## # A tibble: 2 x 5
## cause year sex income deaths
## <chr> <dbl> <chr> <chr> <dbl>
## 1 Communicable diseases 1990 Female High 0.21
## 2 Communicable diseases 1990 Female Upper-Middle 1.150
select()can also be used to reorder the columns in your tibble. Moving columns
around is not relevant in data analysis (as any of the functions we showed
you above, as well as plotting, only look at the column names, and not their
positions in the tibble), but it is useful for organising your tibble for easier
viewing.
So if we use select like this:
64 3 Summarising data
gbd_2rows %>%
select(year, sex, income, cause, deaths_millions)
## # A tibble: 2 x 5
## year sex income cause deaths_millions
## <dbl> <chr> <chr> <chr> <dbl>
## 1 1990 Female High Communicable diseases 0.21
## 2 1990 Female Upper-Middle Communicable diseases 1.150
gbd_2rows %>%
select(year, sex, everything())
## # A tibble: 2 x 5
## year sex cause income deaths_millions
## <dbl> <chr> <chr> <chr> <dbl>
## 1 1990 Female Communicable diseases High 0.21
## 2 1990 Female Communicable diseases Upper-Middle 1.150
And this is where the true power of select() starts to come out. In addition
to listing the columns explicitly (e.g., mydata %>% select(year, cause...)) there
are several special functions that can be used inside select(). These special
functions are called select helpers, and the first select helper we used is every-
thing().
gbd_2rows %>%
select(starts_with("deaths"))
## # A tibble: 2 x 1
## deaths_millions
## <dbl>
## 1 0.21
## 2 1.150
So far, all of the examples we’ve shown you have been using ‘tidy’ data. Data
is ‘tidy’ when it is in long format: each variable is in its own column, and
each observation is in its own row. This long format is efficient to use in data
analysis and visualisation and can also be considered “computer readable”.
But sometimes when presenting data in tables for humans to read, or when
collecting data directly into a spreadsheet, it can be convenient to have data
in a wide format. Data is ‘wide’ when some or all of the columns are levels of
a factor. An example makes this easier to see.
Tables 3.3 and 3.2 contain the exact same information, but in long (tidy) and
wide formats, respectively.
66 3 Summarising data
If we want to take the long data from 3.3 and put some of the numbers next to
each other for easier visualisation, then pivot_wider() from the tidyr package
is the function to do it. It means we want to send a variable into columns, and
it needs just two arguments: the variable we want to become the new columns,
and the variable where the values currently are.
gbd_long %>%
pivot_wider(names_from = year, values_from = deaths_millions)
## # A tibble: 6 x 4
## cause sex `1990` `2017`
## <chr> <chr> <dbl> <dbl>
## 1 Communicable diseases Female 7.3 4.91
## 2 Communicable diseases Male 8.06 5.47
## 3 Injuries Female 1.41 1.42
## 4 Injuries Male 2.84 3.05
## 5 Non-communicable diseases Female 12.8 19.15
## 6 Non-communicable diseases Male 13.91 21.74
This means we can quickly eyeball how the number of deaths has changed
from 1990 to 2017 for each cause category and sex. Whereas if we wanted to
quickly look at the difference in the number of deaths for females and males,
we can change the names_from = argument from = years to = sex. Furthermore,
we can also add a mutate() to calculate the difference:
gbd_long %>%
pivot_wider(names_from = sex, values_from = deaths_millions) %>%
mutate(Male - Female)
## # A tibble: 6 x 5
## cause year Female Male `Male - Female`
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Communicable diseases 1990 7.3 8.06 0.76
## 2 Communicable diseases 2017 4.91 5.47 0.5600
## 3 Injuries 1990 1.41 2.84 1.430
## 4 Injuries 2017 1.42 3.05 1.63
## 5 Non-communicable diseases 1990 12.8 13.91 1.110
## 6 Non-communicable diseases 2017 19.15 21.74 2.59
All of these differences are positive which means every year, more men die
than women. Which make sense, as more boys are born than girls.
And what if we want to look at both year and sex at the same time, so to create
Table 3.2 from Table 3.3? No problem, pivot_wider() can deal with multiple
variables at the same time, names_from = c(sex, year):
3.8 Reshaping data - long vs wide format 67
gbd_long %>%
pivot_wider(names_from = c(sex, year), values_from = deaths_millions)
## # A tibble: 3 x 5
## cause Female_1990 Female_2017 Male_1990 Male_2017
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Communicable diseases 7.3 4.91 8.06 5.47
## 2 Injuries 1.41 1.42 2.84 3.05
## 3 Non-communicable diseases 12.8 19.15 13.91 21.74
pivot_wider()has a few optional arguments that may be useful for you. For
example, pivot_wider(..., values_fill = 0) can be used to fill empty cases (if you
have any) with a value you specified. Or pivot_wider(..., names_sep = ": ") can
be used to change the separator that gets put between the values (e.g., you
may want “Female: 1990” instead of the default “Female_1990”). Remember
that pressing F1 when your cursor is on a function opens it up in the Help
tab where these extra options are listed.
gbd_wide %>%
pivot_longer(matches("Female|Male"),
names_to = "sex_year",
values_to = "deaths_millions") %>%
slice(1:6)
## # A tibble: 6 x 3
## cause sex_year deaths_millions
## <chr> <chr> <dbl>
## 1 Communicable diseases Female_1990 7.3
## 2 Communicable diseases Female_2017 4.91
## 3 Communicable diseases Male_1990 8.06
## 4 Communicable diseases Male_2017 5.47
## 5 Injuries Female_1990 1.41
## 6 Injuries Female_2017 1.42
68 3 Summarising data
You’re probably looking at the example above and thinking that’s all nice
and simple on this miniature example dataset, but how on earth will I fig-
ure this out on a real-world example. And you’re right, we won’t deny that
pivot_longer() is one of the most technically complicated functions in this book,
and it can take a lot of trial and error to get it to work. How to get started
with your own pivot_longer() transformation is to first play with the select()
function to make sure you are telling R exactly which columns to pivot into
the longer format. For example, before working out the pivot_longer() code for
the above example, we would figure this out first:
gbd_wide %>%
select(matches("Female|Male"))
## # A tibble: 3 x 4
## Female_1990 Female_2017 Male_1990 Male_2017
## <dbl> <dbl> <dbl> <dbl>
## 1 7.3 4.91 8.06 5.47
## 2 1.41 1.42 2.84 3.05
## 3 12.8 19.15 13.91 21.74
While pivot_longer() did a great job fetching the different observations that
were spread across multiple columns into a single one, it’s still a combination
of two variables - sex and year. We can use the separate() function to deal with
that.
gbd_wide %>%
# same pivot_longer as before
pivot_longer(matches("Female|Male"),
names_to = "sex_year",
values_to = "deaths_millions") %>%
separate(sex_year, into = c("sex", "year"), sep = "_", convert = TRUE)
## # A tibble: 12 x 4
## cause sex year deaths_millions
## <chr> <chr> <int> <dbl>
## 1 Communicable diseases Female 1990 7.3
## 2 Communicable diseases Female 2017 4.91
## 3 Communicable diseases Male 1990 8.06
## 4 Communicable diseases Male 2017 5.47
## 5 Injuries Female 1990 1.41
3.9 arrange() rows 69
## 6 Injuries Female 2017 1.42
## 7 Injuries Male 1990 2.84
## 8 Injuries Male 2017 3.05
## 9 Non-communicable diseases Female 1990 12.8
## 10 Non-communicable diseases Female 2017 19.15
## 11 Non-communicable diseases Male 1990 13.91
## 12 Non-communicable diseases Male 2017 21.74
We’ve also added convert = TRUE to separate() so year would get converted into
a numeric variable. The combination of, e.g., “Female-1990” is a character
variable, so after separating them both sex and year would still be classified
as characters. But the convert = TRUE recognises that year is a number and will
appropriately convert it into an integer.
The arrange() function sorts rows based on the column(s) you want. By default,
it arranges the tibble in ascending order:
gbd_long %>%
arrange(deaths_millions) %>%
# first 3 rows just for printing:
slice(1:3)
## # A tibble: 3 x 4
## cause year sex deaths_millions
## <chr> <dbl> <chr> <dbl>
## 1 Injuries 1990 Female 1.41
## 2 Injuries 2017 Female 1.42
## 3 Injuries 1990 Male 2.84
gbd_long %>%
arrange(-deaths_millions) %>%
slice(1:3)
## # A tibble: 3 x 4
## cause year sex deaths_millions
## <chr> <dbl> <chr> <dbl>
## 1 Non-communicable diseases 2017 Male 21.74
## 2 Non-communicable diseases 2017 Female 19.15
## 3 Non-communicable diseases 1990 Male 13.91
The - doesn’t work for categorical variables; they need to be put in desc() for
arranging in descending order:
70 3 Summarising data
gbd_long %>%
arrange(desc(sex)) %>%
# printing rows 1, 2, 11, and 12
slice(1,2, 11, 12)
## # A tibble: 4 x 4
## cause year sex deaths_millions
## <chr> <dbl> <chr> <dbl>
## 1 Communicable diseases 1990 Male 8.06
## 2 Communicable diseases 2017 Male 5.47
## 3 Non-communicable diseases 1990 Female 12.8
## 4 Non-communicable diseases 2017 Female 19.15
But we can now use fct_relevel() inside mutate() to change the order of these
levels:
3.10 Exercises
3.10.1 Exercise - pivot_wider()
Using the GBD dataset with variables cause, year (1990 and 2017 only), sex (as
shown in Table 3.3):
Use pivot_wider() to put the cause variable into columns using the deaths_millions
as values:
TABLE 3.4: Exercise: putting the cause variable into the wide format.
Solution
gbd_long = read_csv("data/global_burden_disease_cause-year-sex.csv")
gbd_long %>%
pivot_wider(names_from = cause, values_from = deaths_millions)
Read in the full GBD dataset with variables cause, year, sex, income,
deaths_millions.
gbd_full = read_csv("data/global_burden_disease_cause-year-sex-income.csv")
glimpse(gbd_full)
72 3 Summarising data
## Rows: 168
## Columns: 5
## $ cause <chr> "Communicable diseases", "Communicable diseases", "...
## $ year <dbl> 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1990, 199...
## $ sex <chr> "Female", "Female", "Female", "Female", "Male", "Ma...
## $ income <chr> "High", "Upper-Middle", "Lower-Middle", "Low", "Hig...
## $ deaths_millions <dbl> 0.21, 1.15, 4.43, 1.51, 0.26, 1.35, 4.73, 1.72, 0.2...
Year 2017 of this dataset was shown in Table 3.1, the full dataset has seven
times as many observations as Table 3.1 since it includes information about
multiple years: 1990, 1995, 2000, 2005, 2010, 2015, 2017.
Investigate these code examples:
summary_data1 <-
gbd_full %>%
group_by(year) %>%
summarise(total_per_year = sum(deaths_millions))
summary_data1
## # A tibble: 7 x 2
## year total_per_year
## <dbl> <dbl>
## 1 1990 46.32
## 2 1995 48.91
## 3 2000 50.38
## 4 2005 51.25
## 5 2010 52.63
## 6 2015 54.62
## 7 2017 55.74
summary_data2 <-
gbd_full %>%
group_by(year, cause) %>%
summarise(total_per_cause = sum(deaths_millions))
summary_data2
## # A tibble: 21 x 3
## # Groups: year [7]
## year cause total_per_cause
## <dbl> <chr> <dbl>
## 1 1990 Communicable diseases 15.36
## 2 1990 Injuries 4.25
## 3 1990 Non-communicable diseases 26.71
## 4 1995 Communicable diseases 15.11
## 5 1995 Injuries 4.53
## 6 1995 Non-communicable diseases 29.27
3.10 Exercises 73
## 7 2000 Communicable diseases 14.81
## 8 2000 Injuries 4.56
## 9 2000 Non-communicable diseases 31.01
## 10 2005 Communicable diseases 13.89
## # ... with 11 more rows
For each cause, calculate its percentage to total deaths in each year.
Hint: Use full_join() on summary_data1 and summary_data2, and then use mutate() to
add a new column called percentage.
74 3 Summarising data
## # A tibble: 3 x 5
## year total_per_year cause total_per_cause percentage
## <dbl> <dbl> <chr> <dbl> <chr>
## 1 1990 46.32 Communicable diseases 15.36 33.161%
## 2 1990 46.32 Injuries 4.25 9.175%
## 3 1990 46.32 Non-communicable diseases 26.71 57.664%
Solution
library(scales)
full_join(summary_data1, summary_data2) %>%
mutate(percentage = percent(total_per_cause/total_per_year))
Instead of creating the two summarised tibbles and using a full_join(), achieve
the same result as in the previous exercise with a single pipeline using sum-
marise() and then mutate().
Hint: you have to do it the other way around, so group_by(year, cause) %>%
summarise(...) first, then group_by(year) %>% mutate().
Bonus: select() columns year, cause, percentage, then pivot_wider() the cause vari-
able using percentage as values.
Solution
gbd_full %>%
# aggregate to deaths per cause per year using summarise()
group_by(year, cause) %>%
summarise(total_per_cause = sum(deaths_millions)) %>%
# then add a column of yearly totals using mutate()
group_by(year) %>%
mutate(total_per_year = sum(total_per_cause)) %>%
# add the percentage column
mutate(percentage = percent(total_per_cause/total_per_year)) %>%
# select the final variables for better vieweing
select(year, cause, percentage) %>%
pivot_wider(names_from = cause, values_from = percentage)
## # A tibble: 7 x 4
## # Groups: year [7]
## year `Communicable diseases` Injuries `Non-communicable diseases`
## <dbl> <chr> <chr> <chr>
## 1 1990 33% 9% 58%
## 2 1995 31% 9% 60%
## 3 2000 29% 9% 62%
## 4 2005 27% 9% 64%
3.10 Exercises 75
## 5 2010 24% 9% 67%
## 6 2015 20% 8% 72%
## 7 2017 19% 8% 73%
Note that your pipelines shouldn’t be much longer than this, and we often save
interim results into separate tibbles for checking (like we did with summary_data1
and summary_data2, making sure the number of rows are what we expect and
spot checking that the calculation worked as expected).
R doesn’t do what you want it to do, it does what you ask it to do. Testing
and spot checking is essential as you will make mistakes. We sure do.
Do not feel like you should be able to just bash out these clever pipelines
without a lot of trial and error first.
• Calculate the total number of deaths within each income group for males
and females. Hint: this is as easy as adding , sex to group_by(income).
• pivot_wider() the income column.
Solution
gbd_full %>%
filter(year == 1990) %>%
group_by(income, sex) %>%
summarise(total_deaths = sum(deaths_millions)) %>%
pivot_wider(names_from = income, values_from = total_deaths)
## # A tibble: 2 x 5
## sex High Low `Lower-Middle` `Upper-Middle`
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Female 4.140 2.22 8.47 6.68
## 2 Male 4.46 2.57 9.83 7.950
4
Different types of plots
There are a few different plotting packages in R, but the most elegant and
versatile one is ggplot21 . gg stands for grammar of graphics which means
that we can make a plot by describing it one component at a time. In other
words, we build a plot by adding layers to it.
This does not have to be many layers, the simplest ggplot() consists of just two
components:
• the variables to be plotted;
• a geometrical object (e.g., point, line, bar, box, etc.).
ggplot() calls geometrical objects geoms.
Figure 4.1 shows some example steps for building a scatter plot, including
changing its appearance (‘theme’) and faceting - an efficient way of creating
separate plots for subgroups.
1
The name of the package is ggplot2, but the function is called ggplot(). For everything
you’ve ever wanted to know about the grammar of graphics in R, see Wickham (2016).
77
78 4 Different types of plots
(1) (2)
80 80
lifeExp 70 70
lifeExp
60 60
50 50
40 40
0 10000 20000 30000 40000 50000 0 10000 20000 30000 40000 50000
gdpPercap gdpPercap
(3) (4)
80 80
continent
70 70 Africa
lifeExp
lifeExp
Americas
60 60
Asia
Europe
50 50
Oceania
40 40
0 10000 20000 30000 40000 50000 0 10000 20000 30000 40000 50000
gdpPercap gdpPercap
(5)
Africa Americas Asia
80
70
60 continent
50 Africa
40
lifeExp
Americas
0 10000 20000 30000 40000 50000
Europe Oceania
Asia
80
Europe
70
60 Oceania
50
40
0 10000 20000 30000 40000 50000 0 10000 20000 30000 40000 50000
gdpPercap
(6)
Africa Americas Asia
80
70
60 continent
50 Africa
40
lifeExp
Americas
0 10 20 30 40 50
Europe Oceania
Asia
80
Europe
70
60 Oceania
50
40
0 10 20 30 40 50 0 10 20 30 40 50
gdpPercap/1000
FIGURE 4.1: Example steps for building and modifying a ggplot. (1) Ini-
tialising the canvas and defining variables, (2) adding points, (3) colouring
points by continent, (4) changing point type, (5) faceting, (6) changing the
plot theme and the scale of the x variable.
4.1 Get the data 79
library(tidyverse)
library(gapminder)
glimpse(gapminder)
## Rows: 1,704
## Columns: 6
## $ country <fct> Afghanistan, Afghanistan, Afghanistan, Afghanistan, Afgha...
## $ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asi...
## $ year <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 199...
## $ lifeExp <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 4...
## $ pop <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372,...
## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.113...
library(tidyverse)
library(gapminder)
gapminder$year %>% unique()
gapminder$country %>% n_distinct()
gapminder$continent %>% unique()
Let’s create a new shorter tibble called gapdata2007 that only includes data for
the year 2007.
gapdata2007
## # A tibble: 142 x 6
80 4 Different types of plots
The new tibble - gapdata2007 - now shows up in your Environment tab, whereas
gapminder does not. Running library(gapminder) makes it available to use (so the
funny line below is not necessary for any of the code in this chapter to work),
but to have it appear in your normal Environment tab you’ll need to run this
funny looking line:
Both gapdata and gapdata2007 now show up in the Environment tab and can be
clicked on/quickly viewed as usual.
We will now explain the six steps shown in Figure 4.1. Note that you only need
the first two to make a plot, the rest are just to show you further functionality
and optional customisations.
(1) Start by defining the variables, e.g., ggplot(aes(x = var1, y = var2)):
gapdata2007 %>%
ggplot(aes(x = gdpPercap, y = lifeExp))
We tend to put the data first and then use the pipe (%>%) to send it to
4.2 Anatomy of ggplot explained 81
the ggplot() function. This becomes useful when we add further data wran-
gling functions between the data and the ggplot(). For example, our plotting
pipelines often look like this:
data %>%
filter(...) %>%
mutate(...) %>%
ggplot(aes(...)) +
...
The lines that come before the ggplot() function are piped, whereas from gg-
plot() onwards you have to use +. This is because we are now adding different
layers and customisations to the same plot.
aes() stands for aesthetics - things we can see. Variables are always inside the
aes() function, which in return is inside a ggplot(). Take a moment to appreciate
the double closing brackets )) - the first one belongs to aes(), the second one
to ggplot().
(2) Choose and add a geometrical object
Let’s ask ggplot() to draw a point for each observation by adding geom_point():
gapdata2007 %>%
ggplot(aes(x = gdpPercap, y = lifeExp)) +
geom_point()
We have now created the second plot in Figure 4.1, a scatter plot.
If we copy the above code and change just one thing - the x variable from
gdpPercap to continent (which is a categorical variable) - we get what’s called
a strip plot. This means we are now plotting a continuous variable (lifeExp)
against a categorical one (continent). But the thing to note is that the rest of
the code stays exactly the same, all we did was change the x =.
gapdata2007 %>%
ggplot(aes(x = continent, y = lifeExp)) +
geom_point()
Going back to the scatter plot (lifeExp vs gdpPercap), let’s use continent to give
the points some colour. We can do this by adding colour = continent inside the
aes():
82 4 Different types of plots
80
70
lifeExp
60
50
40
Africa Americas Asia Europe Oceania
continent
FIGURE 4.2: A strip plot using geom_point().
gapdata2007 %>%
ggplot(aes(x = gdpPercap, y = lifeExp, colour = continent)) +
geom_point()
This creates the third plot in Figure 4.1. It uses the default colour scheme and
will automatically include a legend. Still with just two lines of code (ggplot(...)
+ geom_point()).
(4) specifying aesthetics outside aes()
0 1 2 4 8 15 16 17 21 22 23
gapdata2007 %>%
ggplot(aes(x = gdpPercap, y = lifeExp, colour = continent)) +
geom_point(shape = 1)
gapdata2007 %>%
ggplot(aes(x = gdpPercap, y = lifeExp, colour = continent)) +
geom_point(shape = 1) +
facet_wrap(~continent)
This creates the fifth plot in Figure 4.1. Note that we have to use the tilde (~)
in facet_wrap(). There is a similar function called facet_grid() that will create
a grid of plots based on two grouping variables, e.g., facet_grid(var1~var2). Fur-
thermore, facets are happy to quickly separate data based on a condition (so
something you would usually use in a filter).
gapdata2007 %>%
ggplot(aes(x = gdpPercap, y = lifeExp, colour = continent)) +
geom_point(shape = 1) +
facet_wrap(~pop > 50000000)
On this plot, the facet FALSE includes countries with a population less than 50
million people, and the facet TRUE includes countries with a population greater
than 50 million people.
The tilde (~) in R denotes dependency. It is mostly used by statistical models
to define dependent and explanatory variables and you will see it a lot in the
second part of this book.
(6) Grey to white background - changing the theme
Overall, we can customise every single thing on a ggplot. Font type, colour,
size or thickness or any lines or numbers, background, you name it. But a very
84 4 Different types of plots
FALSE TRUE
80
continent
70 Africa
lifeExp
Americas
60
Asia
Europe
50
Oceania
40
0 10000 20000 30000 40000 50000 0 10000 20000 30000 40000 50000
gdpPercap
(1) (2)
LABEL LABEL
80 80
lifeExp
lifeExp
60 60
40 40
195019601970198019902000 195019601970198019902000
year year
(3) (4)
LABEL LABEL
80 80
lifeExp
lifeExp
60 60
40 40
195019601970198019902000 195019601970198019902000
year year
FIGURE 4.5: Some of the built-in ggplot themes (1) default (2) theme_bw(),
(3) theme_dark(), (4) theme_classic().
As a final step, we are adding theme_bw() (“background white”) to give the plot
a different look. We have also divided the gdpPercap by 1000 (making the
4.4 Scatter-plotsbubble-plots 85
units “thousands of dollars per capita”). Note that you can apply calculations
directly on ggplot variables (so how we’ve done x = gdpPercap/1000 here).
gapdata2007 %>%
ggplot(aes(x = gdpPercap/1000, y = lifeExp, colour = continent)) +
geom_point(shape = 1) +
facet_wrap(~continent) +
theme_bw()
If you find yourself always adding the same theme to your plot (i.e., we really
like the + theme_bw()), you can use theme_set() so your chosen theme is applied
to every plot you draw:
theme_set(theme_bw())
In fact, we usually have these two lines at the top of every script:
library(tidyverse)
theme_set(theme_bw())
The ggplot anatomy (Section 4.2) covered both scatter and strip plots (both
created with geom_point()). Another cool thing about this geom is that adding
86 4 Different types of plots
a size aesthetic makes it into a bubble plot. For example, let’s size the points
by population.
As you would expect from a “grammar of graphics plot”, this is as simple as
adding size = pop as an aesthetic:
gapdata2007 %>%
ggplot(aes(x = gdpPercap/1000, y = lifeExp, size = pop)) +
geom_point()
With increased bubble sizes, there is some overplotting, so let’s make the
points hollow (shape = 1) and slightly transparent (alpha = 0.5):
gapdata2007 %>%
ggplot(aes(x = gdpPercap/1000, y = lifeExp, size = pop)) +
geom_point(shape = 1, alpha = 0.5)
(1) (2)
80 pop 80
2.50e+08
70 70
lifeExp
lifeExp
5.00e+08
60 60
7.50e+08
50 1.00e+09 50
1.25e+09
40 40
0 10 20 30 40 50 0 10 20 30 40 50
gdpPercap/1000 gdpPercap/1000
FIGURE 4.6: Turn the scatter plot from Figure 4.1:(2) to a bubble plot
by (1) adding size = pop inside the aes(), (2) make the points hollow and
transparent.
Alpha is an aesthetic to make geoms transparent, its values can range from 0
(invisible) to 1 (solid).
Let’s plot the life expectancy in the United Kingdom over time (Figure 4.7):
4.5 Line plots/time series plots 87
gapdata %>%
filter(country == "United Kingdom") %>%
ggplot(aes(x = year, y = lifeExp)) +
geom_line()
77.5
lifeExp
75.0
72.5
70.0
1950 1960 1970 1980 1990 2000
year
FIGURE 4.7: geom_line()- Life expectancy in the United Kingdom over time.
gapdata %>%
ggplot(aes(x = year, y = lifeExp)) +
geom_line()
The reason you see this weird zigzag in Figure 4.8 (1) is that, using the above
code, ggplot() does not know which points to connect with which. Yes, you
know you want a line for each country, but you haven’t told it that. So for
drawing multiple lines, we need to add a group aesthetic, in this case group =
country:
gapdata %>%
ggplot(aes(x = year, y = lifeExp, group = country)) +
geom_line()
This code works as expected (Figure 4.8 (2)) - yes there is a lot of overplotting
but that’s just because we’ve included 142 lines on a single plot.
88 4 Different types of plots
(1) (2)
80 80
60 60
lifeExp
lifeExp
40 40
1950 1960 1970 1980 1990 2000 1950 1960 1970 1980 1990 2000
year year
FIGURE 4.8: The ‘zig-zag plot’ is a common mistake: Using geom_line() (1)
without a group specified, (2) after adding group = country.
4.5.1 Exercise
60
continent
40
Africa
lifeExp
Americas
1950 1960 1970 1980 1990 2000
Europe Oceania
Asia
80
Europe
60 Oceania
40
1950 1960 1970 1980 1990 2000 1950 1960 1970 1980 1990 2000
year
There are two geoms for making bar plots - geom_col() and geom_bar() and the
examples below will illustrate when to use which one. In short: if your data is
4.6 Bar plots 89
already summarised or includes values for y (height of the bars), use geom_col().
If, however, you want ggplot() to count up the number of rows in your dataset,
use geom_bar(). For example, with patient-level data (each row is a patient)
you’ll probably want to use geom_bar(), with data that is already somewhat
aggregated, you’ll use geom_col(). There is no harm in trying one, and if it
doesn’t work, trying the other.
gapdata2007 %>%
filter(country %in% c("United Kingdom", "France", "Germany")) %>%
ggplot(aes(x = country, y = lifeExp)) +
geom_col()
This gives us Figure 4.10:1. We have also created another cheeky one using the
same code but changing the scale of the y axis to be more dramatic (Figure
4.10:2).
(1) (2)
80 81.0
60 80.5
lifeExp
lifeExp
40 80.0
20 79.5
0 79.0
France Germany United Kingdom France Germany United Kingdom
country country
FIGURE 4.10: Bar plots using geom_col(): (1) using the code example, (2)
same plot but with + coord_cartesian(ylim=c(79, 81)) to manipulate the scale into
something a lot more dramatic.
• geom_bar() then counts up the number of observations (rows) for this variable
and plots them as bars.
Our gapdata2007 tibble has a row for each country (see end of Section 4.1 to
remind yourself). Therefore, if we use the count() function on the continent
variable, we are counting up the number of countries on each continent (in
this dataset2 ):
gapdata2007 %>%
count(continent)
## # A tibble: 5 x 2
## continent n
## <fct> <int>
## 1 Africa 52
## 2 Americas 25
## 3 Asia 33
## 4 Europe 30
## 5 Oceania 2
So geom_bar() basically runs the count() function and plots it (see how the bars
on Figure 4.11 are the same height as the values from count(continent)).
(1) (2)
50 50
40 40
30 30
count
count
20 20
10 10
0 0
Africa Americas Asia Europe Oceania Africa Americas Asia Europe Oceania
continent continent
FIGURE 4.11: geom_bar() counts up the number of observations for each
group. (1) gapdata2007 %>% ggplot(aes(x = continent)) + geom_bar(), (2) same + a
little bit of magic to reveal the underlying data.
gapdata2007 %>%
ggplot(aes(x = continent)) +
geom_bar()
2
The number of countries in this dataset is 142, whereas the United Nations have 193
member states.
4.6 Bar plots 91
Whereas on the second one, we’ve asked geom_bar() to reveal the components
(countries) in a colourful way:
gapdata2007 %>%
ggplot(aes(x = continent, colour = country)) +
geom_bar(fill = NA) +
theme(legend.position = "none")
Figure 4.11 also reveals the difference between a colour and a fill. Colour is the
border around a geom, whereas fill is inside it. Both can either be set based
on a variable in your dataset (this means colour = or fill = needs to be inside
the aes() function), or they could be set to a fixed colour.
R has an amazing knowledge of colour. In addition to knowing what is “white”,
“yellow”, “red”, “green” etc. (meaning we can simply do geom_bar(fill = "green")),
it also knows what “aquamarine”, “blanchedalmond”, “coral”, “deeppink”,
“lavender”, “deepskyblue” look like (amongst many many others; search the
internet for “R colours” for a full list).
We can also use Hex colour codes, for example, geom_bar(fill = "#FF0099") is a
very pretty pink. Every single colour in the world can be represented with
a Hex code, and the codes are universally known by most plotting or image
making programmes. Therefore, you can find Hex colour codes from a lot of
places on the internet, or https://www.color-hex.com just to name one.
4.6.4 Proportions
gapdata2007 %>%
ggplot(aes(x = "Global", fill = continent)) +
geom_bar()
continent
100 Africa
Americas
count
Asia
50 Europe
Oceania
0
Global
x
FIGURE 4.12: Number of countries in the gapminder datatset with propor-
tions using the fill = continent aesthetic.
4.6.5 Exercise
Greece
Belgium
United Kingdom
Germany
Finland
Ireland
Denmark
Portugal
Slovenia
Czech Republic
Albania
Croatia
Poland
Bosnia and Herzegovina
Slovak Republic
Montenegro
Serbia
Hungary
Bulgaria
Romania
Turkey
0 20 40 60 80
lifeExp
FIGURE 4.13: Barplot exercise. Life expectancies in European countries in
year 2007 from the gapminder dataset.
4.7 Histograms
gapdata2007 %>%
ggplot(aes(x = lifeExp)) +
geom_histogram(binwidth = 10)
We can see that most countries in the world have a life expectancy of ~70-80
years (in 2007), and that the distribution of life expectancies globally is not nor-
94 4 Different types of plots
40
30
count
20
10
0
40 50 60 70 80
lifeExp
FIGURE 4.14: geom_histogram() - The distribution of life expectancies in dif-
ferent countries around the world in year 2007.
Box plots are our go to method for quickly visualising summary statistics of a
continuous outcome variable (such as life expectancy in the gapminder dataset,
Figure 4.15).
Box plots include:
• the median (middle line in the box)
• inter-quartile range (IQR, top and bottom parts of the boxes - this is where
50% of your data is)
• whiskers (the black lines extending to the lowest and highest values that are
still within 1.5*IQR)
• outliers (any observations out with the whiskers)
gapdata2007 %>%
ggplot(aes(x = continent, y = lifeExp)) +
geom_boxplot()
4.9 Multiple geoms, multiple aes() 95
80
70
lifeExp
60
50
40
Africa Americas Asia Europe Oceania
continent
FIGURE 4.15: geom_boxplot() - Boxplots of life expectancies within each con-
tinent in year 2007.
One of the coolest things about ggplot() is that we can plot multiple geoms on
top of each other!
Let’s add individual data points on top of the box plots:
gapdata2007 %>%
ggplot(aes(x = continent, y = lifeExp)) +
geom_boxplot() +
geom_point()
# (3)
gapdata2007 %>%
ggplot(aes(x = continent, y = lifeExp, colour = continent)) +
geom_boxplot() +
geom_jitter()
# (4)
gapdata2007 %>%
96 4 Different types of plots
(1) (2)
80 80
70 70
lifeExp
lifeExp
60 60
50 50
40 40
Africa Americas Asia Europe Oceania Africa Americas Asia Europe Oceania
continent continent
(3) (4)
80 80
continent
70 70
Africa
lifeExp
lifeExp
Americas
60 60
Asia
Europe
50 50 Oceania
40 40
Africa Americas Asia Europe Oceania Africa Americas Asia Europe Oceania
continent continent
This is new: aes() inside a geom, not just at the top! In the code for (4) you
can see aes() in two places - at the top and inside the geom_jitter(). And colour
= continent was only included in the second aes(). This means that the jittered
points get a colour, but the box plots will be drawn without (so just black).
This is exactly* what we see on 4.16.
*Nerd alert: the variation added by geom_jitter() is random, which means that
when you recreate the same plots the points will appear in slightly different
locations to ours. To make identical ones, add position = position_jitter(seed =
1) inside geom_jitter().
4.9 Multiple geoms, multiple aes() 97
4.9.1 Worked example - three geoms together
Let’s combine three geoms by including text labels on top of the box plot +
points from above.
We are creating a new tibble called label_data filtering for the maximum life
expectancy countries at each continent (group_by(continent)):
## # A tibble: 5 x 3
## # Groups: continent [5]
## country continent lifeExp
## <fct> <fct> <dbl>
## 1 Australia Oceania 81.2
## 2 Canada Americas 80.7
## 3 Iceland Europe 81.8
## 4 Japan Asia 82.6
## 5 Reunion Africa 76.4
The first two geoms are from the previous example (geom_boxplot() and
geom_jitter()). Note that ggplot() plots them in the order they are in the code
- so box plots at the bottom, jittered points on the top. We are then adding
geom_label() with its own data option (data = label_data) as well as a new aes-
thetic (aes(label = country), Figure 4.17):
gapdata2007 %>%
ggplot(aes(x = continent, y = lifeExp)) +
# First geom - boxplot
geom_boxplot() +
# Second geom - jitter with its own aes(colour = )
geom_jitter(aes(colour = continent)) +
# Third geom - label, with its own dataset (label_data) and aes(label = )
geom_label(data = label_data, aes(label = country))
A few suggested experiments to try with the 3-geom plot code above:
• remove data = label_data, from geom_label() and you’ll get all 142 labels (so it
will plot a label for the whole gapdata2007 dataset);
• change from geom_label() to geom_text() - it works similarly but doesn’t have
the border and background behind the country name;
• change label = country to label = lifeExp, this plots the maximum value, rather
than the country name.
98 4 Different types of plots
Reunion
70 continent
Africa
lifeExp
Americas
60 Asia
Europe
Oceania
50
40
In this chapter we have introduced some of the most common geoms, as well
as explained how ggplot() works. In fact, ggplot has 56 different geoms for you
to use; see its documentation for a full list: https://ggplot2.tidyverse.org.
With the ability of combining multiple geoms together on the same plot, the
possibilities really are endless. Furthermore, the plotly Graphic Library (https:
//plot.ly/ggplot2/) can make some of your ggplots interactive, meaning you can
use your mouse to hover over the point or zoom and subset interactively.
The two most important things to understand about ggplot() are:
• Variables (columns in your dataset) need to be inside aes();
• aes() can be both at the top - data %>% ggplot(aes()) - as well as inside a geom
(e.g., geom_point(aes())). This distinction is useful when combining multiple
geoms. All your geoms will “know about” the top-level aes() variables, but
including aes() variables inside a specific geom means it only applies to that
one.
4.12 Extra-advanced-examples 99
4.11 Solutions
library(tidyverse)
library(gapminder)
gapminder %>%
ggplot(aes(x = year,
y = lifeExp,
group = country,
colour = continent)) +
geom_line() +
facet_wrap(~continent) +
theme_bw() +
scale_colour_brewer(palette = "Paired")
library(tidyverse)
library(gapminder)
gapminder %>%
filter(year == 2007) %>%
filter(continent == "Europe") %>%
ggplot(aes(x = fct_reorder(country, lifeExp), y = lifeExp)) +
geom_col(colour = "deepskyblue", fill = NA) +
coord_flip() +
theme_classic()
There are two examples of how just a few lines of ggplot() code and the basic
geoms introduced in this chapter can be used to make very different things.
Let your imagination fly free when using ggplot()!
Figure 4.18 shows how the life expectancies in European countries have in-
creased by plotting a square (geom_point(shape = 15)) for each observation (year)
in the dataset.
gapdata %>%
filter(continent == "Europe") %>%
ggplot(aes(y = fct_reorder(country, lifeExp, .fun=max),
100 4 Different types of plots
x = lifeExp,
colour = year)) +
geom_point(shape = 15, size = 2) +
scale_colour_distiller(palette = "Greens", direction = 1) +
theme_bw()
Iceland
Switzerland
Spain
Sweden
France
Italy
Norway
fct_reorder(country, lifeExp, .fun = max)
Austria
Netherlands
Greece
Belgium year
United Kingdom
Germany 2000
Finland
Ireland 1990
Denmark 1980
Portugal
Slovenia 1970
Czech Republic 1960
Albania
Croatia
Poland
Montenegro
Bosnia and Herzegovina
Slovak Republic
Serbia
Hungary
Bulgaria
Romania
Turkey
50 60 70 80
lifeExp
gapdata2007 %>%
group_by(continent) %>%
mutate(country_number = seq_along(country)) %>%
ggplot(aes(x = continent)) +
geom_bar(aes(colour = continent), fill = NA, show.legend = FALSE) +
geom_text(aes(y = country_number, label = country), vjust = 1)+
geom_label(aes(label = continent), y = -1) +
theme_void()
4.12 Extra: Advanced examples 101
Zimbabwe
Zambia
Uganda
Tunisia
Togo
Tanzania
Swaziland
Sudan
South Africa
Somalia
Sierra Leone
Senegal
Sao Tome and Principe
Rwanda
Reunion
Nigeria
Niger
Namibia
Mozambique
Morocco Yemen, Rep.
Mauritius West Bank and Gaza
Mauritania Vietnam
Mali Thailand United Kingdom
Malawi Taiwan Turkey
Madagascar Syria Switzerland
Libya Sri Lanka Sweden
Liberia Singapore Spain
Lesotho Venezuela Saudi Arabia Slovenia
Kenya Uruguay Philippines Slovak Republic
Guinea-Bissau United States Pakistan Serbia
Guinea Trinidad and Tobago Oman Romania
Ghana Puerto Rico Nepal Portugal
Gambia Peru Myanmar Poland
Gabon Paraguay Mongolia Norway
Ethiopia Panama Malaysia Netherlands
Eritrea Nicaragua Lebanon Montenegro
Equatorial Guinea Mexico Kuwait Italy
Egypt Jamaica Korea, Rep. Ireland
Djibouti Honduras Korea, Dem. Rep. Iceland
Cote d'Ivoire Haiti Jordan Hungary
Congo, Rep. Guatemala Japan Greece
Congo, Dem. Rep. El Salvador Israel Germany
Comoros Ecuador Iraq France
Chad Dominican Republic Iran Finland
Central African Republic Cuba Indonesia Denmark
Cameroon Costa Rica India Czech Republic
Burundi Colombia Hong Kong, China Croatia
Burkina Faso Chile China Bulgaria
Botswana Canada Cambodia Bosnia and Herzegovina
Benin Brazil Bangladesh Belgium
Angola Bolivia Bahrain Austria New Zealand
Algeria Argentina Afghanistan Albania Australia
We can save a ggplot() object into a variable (we usually call it p but it can be
any name). This then appears in the Environment tab. To plot it it needs to
be recalled on a separate line to get drawn (Figure 5.1). Saving a plot into a
variable allows us to modify it later (e.g., p + theme_bw()).
library(gapminder)
library(tidyverse)
p0
5.2 Scales
5.2.1 Logarithmic
p1 <- p0 + scale_x_log10()
103
104 5 Fine tuning plots
80
continent
70 Africa
lifeExp Americas
60 Asia
Europe
50 Oceania
40
namely "reverse", "log2", or "sqrt". Check the Help tab for scale_continuous() or
look up its online documentation for a full list.
A quick way to expand the limits of your plot is to specify the value you want
to be included:
p2 <- p0 + expand_limits(y = 0)
By default, ggplot() adds some padding around the included area (see how
the scale doesn’t start from 0, but slightly before). This ensures points on
the edges don’t get overlapped with the axes, but in some cases - especially
if you’ve already expanded the scale, you might want to remove this extra
padding. You can remove this padding with the expand argument:
p4 <- p0 +
expand_limits(y = c(0, 100)) +
coord_cartesian(expand = FALSE)
We are now using a new library - patchwork - to print all 4 plots together
5.2 Scales 105
(Figure 5.2). Its syntax is very simple - it allows us to add ggplot objects
together. (Trying to do p1 + p2 without loading the patchwork package will
not work, R will say “Error: Don’t know how to add p2 to a plot”.)
library(patchwork)
p1 + p2 + p3 + p4 + plot_annotation(tag_levels = "1", tag_prefix = "p")
p1 p2
80 80
continent continent
70 60
Africa Africa
lifeExp
lifeExp
Americas Americas
60 40
Asia Asia
Europe Europe
50 20
Oceania Oceania
40 0
300 1000 3000 10000 30000 0 10000 20000 30000 40000 50000
gdpPercap gdpPercap
p3 p4 100
100
continent 75
continent
75
Africa Africa
lifeExp
lifeExp
Americas Americas
50 50
Asia Asia
Europe Europe
25 25
Oceania Oceania
0
0
0 10000 20000 30000 40000 50000 10000 20000 30000 40000
gdpPercap gdpPercap
FIGURE 5.2: p1: Using a logarithmic scale for the x axis. p2: Expanding
the limits of the y axis to include 0. p3: Expanding the limits of the y axis to
include 0 and 100. p4: Removing extra padding around the limits.
5.2.3 Zoom in
p5 <- p0 +
coord_cartesian(ylim = c(70, 85), xlim = c(20000, 40000))
5.2.4 Exercise
p6 <- p0 +
scale_y_continuous(limits = c(70, 85)) +
scale_x_continuous(limits = c(20000, 40000))
Answer: the first one zooms in, still retaining information about the excluded
points when calculating the linear regression lines. The second one removes
the data (as the warnings say), calculating the linear regression lines only for
the visible points.
p5 p6
85 85
continent continent
80 Africa 80 Africa
lifeExp
lifeExp
Americas Americas
Asia Asia
75 Europe 75 Europe
Oceania Oceania
70 70
20000 25000 30000 35000 40000 20000 25000 30000 35000 40000
gdpPercap gdpPercap
ggplot() does a good job deciding how many and which values include on the
axis (e.g., 70/75/80/85 for the y axes in Figure 5.3). But sometimes you’ll
want to specify these, for example, to indicate threshold values or a maximum
(Figure 5.4). We can do so by using the breaks argument:
p7 p8
82.6 MAX
continent continent
Africa Africa
lifeExp
lifeExp
Americas Americas
50.0 50
Asia Asia
Europe Europe
Oceania Oceania
18.0 Adults
FIGURE 5.4: p7: Specifiying y axis breaks. p8: Adding custom labels for
our breaks.
5.3 Colours
5.3.1 Using the Brewer palettes:
The easiest way to change the colour palette of your ggplot() is to specify a
Brewer palette (Harrower and Brewer (2003)):
p9 <- p0 +
scale_color_brewer(palette = "Paired")
Note that http://colorbrewer2.org/ also has options for Colourblind safe and
Print friendly.
108 5 Fine tuning plots
p10 <- p0 +
scale_color_brewer("Continent - \n one of 5", palette = "Paired")
p9 p10
80 80
Continent -
continent
one of 5
70 Africa 70
Africa
lifeExp
40 40
0 10000 20000 30000 40000 50000 0 10000 20000 30000 40000 50000
gdpPercap gdpPercap
FIGURE 5.5: p9: Choosing a Brewer palette for your colours. p10: Changing
the legend title.
R also knows the names of many colours, so we can use words to specify
colours:
p11 <- p0 +
scale_color_manual(values = c("red", "green", "blue", "purple", "pink"))
The same function can also be used to use HEX codes for specifying colours:
p12 <- p0 +
scale_color_manual(values = c("#8dd3c7", "#ffffb3", "#bebada",
"#fb8072", "#80b1d3"))
lifeExp
Americas Americas
60 Asia 60 Asia
Europe Europe
50 50
Oceania Oceania
40 40
0 10000 20000 30000 40000 50000 0 10000 20000 30000 40000 50000
gdpPercap gdpPercap
FIGURE 5.6: Colours can also be specified using words ("red", "green", etc.),
or HEX codes ("#8dd3c7", "#ffffb3", etc.).
We’ve been using the labs(tag = ) function to add tags to plots. But the labs()
function can also be used to modify axis labels, or to add a title, subtitle, or
a caption to your plot (Figure 5.7):
p13 <- p0 +
labs(x = "Gross domestic product per capita",
y = "Life expectancy",
title = "Health and economics",
subtitle = "Gapminder dataset, 2007",
caption = Sys.Date(),
tag = "p13")
p13
5.4.1 Annotation
p14 <- p0 +
annotate("text",
x = 25000,
110 5 Fine tuning plots
p13
Health and economics
Gapminder dataset, 2007
80 continent
Life expectancy
Africa
70
Americas
60 Asia
50 Europe
Oceania
40
0 10000 20000 30000 40000 50000
Gross domestic product per capita
2020-09-15
y = 50,
label = "No points here!")
p15 <- p0 +
annotate("label",
x = 25000,
y = 50,
label = "No points here!")
p16 <- p0 +
annotate("label",
x = 25000,
y = 50,
label = "No points here!",
hjust = 0)
hjust stands for horizontal justification. Its default value is 0.5 (see how the
label was centred at 25,000 - our chosen x location), 0 means the label goes
to the right from 25,000, 1 would make it end at 25,000.
5.4 Titles and labels 111
p14 p15 continent
80
Africa
80 70
lifeExp
Americas
60
Asia
continent 50 No points here! Europe
70
40
Africa Oceania
0 10000 20000 30000 40000 50000
lifeExp
Americas
gdpPercap
60 Asia
Europe
p16 continent
80
Oceania Africa
50 No points here! 70
lifeExp
Americas
60
Asia
50 No points here! Europe
40
40
Oceania
0 10000 20000 30000 40000 50000 0 10000 20000 30000 40000 50000
gdpPercap gdpPercap
FIGURE 5.8: p14: annotate("text", ...) to quickly add a text on your plot.
p15: annotate("label") is similar but draws a box around your text (making it a
label). p16: Using hjust to control the horizontal justification of the annotation.
p17 <- p0 +
annotate("text",
x = 25000,
y = 50,
label = plot_rsquared, parse = TRUE,
hjust = 0)
p17
80
continent
70 Africa
lifeExp
Americas
60 Asia
Europe
50 2
R = 0.77 Oceania
40
0 10000 20000 30000 40000 50000
gdpPercap
FIGURE 5.9: p17: Using a superscript in your plot annotation.
And finally, everything else on a plot - from font to background to the space
between facets, can be changed using the theme() function. As you saw in the
previous chapter, in addition to its default grey background, ggplot2 also comes
with a few built-in themes, namely, theme_bw() or theme_classic(). These produce
good looking plots that may already be publication ready. But if we do decide
to tweak them, then the main theme() arguments we use are axis.text, axis.title,
and legend.position.1 Note that all of these go inside the theme(), and that the
axis.text and axis.title arguments are usually followed by = element_text() as
shown in the examples below.
The way the axis.text and axis.title arguments of theme() work is that if you
specify .x or .y it gets applied on that axis alone. But not specifying these,
applies the change on both. Both the angle and vjust (vertical justification)
options can be useful if your axis text doesn’t fit well and overlaps. It doesn’t
usually make sense to change the colour of the font to anything other than
"black", we are using green and red here to indicate which parts of the plot get
changed with each line (Figure 5.10).
1
To see a full list of possible arguments to theme(), navigate to it in the Help tab or find
its online documentation at https://ggplot2.tidyverse.org/.
5.5 Overall look - theme() 113
p18 <- p0 +
theme(axis.text.y = element_text(colour = "green", size = 14),
axis.text.x = element_text(colour = "red", angle = 45, vjust = 0.5),
axis.title = element_text(colour = "blue", size = 16)
)
p18
80 continent
70 Africa
lifeExp
Americas
60 Asia
Europe
50 Oceania
40
00
00
00
00
00
0
0
10
20
30
40
50
gdpPercap
FIGURE 5.10: p18: Using axis.text and axis.title within theme() to tweak
the appearance of your plot, including font size and angle. Coloured font is
used to indicate which part of the code was used to change each element.
The position of the legend can be changed using the legend.position argument
within theme(). It can be positioned using the following words: "right", "left",
"top", "bottom". Or to remove the legend completely, use "none":
p19 <- p0 +
theme(legend.position = "none")
Alternatively, we can use relative coordinates (0–1) to give the legend a relative
x-y location (Figure 5.11):
p20 <- p0 +
theme(legend.position = c(1,0), #bottom-right corner
legend.justification = c(1,0))
114 5 Fine tuning plots
p19 p20
80 80
70 70
lifeExp
lifeExp
60 60 continent
Africa
Americas
50 50
Asia
Europe
40 40 Oceania
0 10000 20000 30000 40000 50000 0 10000 20000 30000 40000 50000
gdpPercap gdpPercap
For example, this is how to change the number of columns within the legend
(Figure 5.12):
p21 <- p0 +
guides(colour = guide_legend(ncol = 2)) +
theme(legend.position = "top") # moving to the top optional
In Chapters 12 and 13 we’ll show you how to export descriptive text, figures,
and tables directly from R to Word/PDF/HTML using the power of R Mark-
5.6 Saving your plot 115
p21
Africa Europe
80
70
lifeExp
60
50
40
0 10000 20000 30000 40000 50000
gdpPercap
FIGURE 5.12: p21: Changing the number of columns within a legend.
down. The ggsave() function, however, can be used to save a single plot into a
variety of formats, namely "pdf" or "png":
If you omit the first argument - the plot object - and call, e.g., ggsave(file =
"plot.png) it will just save the last plot that got printed.
Text size tip: playing around with the width and height options (they’re in
inches) can be a convenient way to increase or decrease the relative size of
the text on the plot. Look at the relative font sizes of the two versions of the
ggsave() call, one 5x4, the other one 10x8 (Figure 5.13):
80
80
continent
70 70
Africa
continent
lifeExp
lifeExp
Americas Africa
Americas
60 Asia 60 Asia
Europe
Europe Oceania
Oceania
50
50
40
40
0 10000 20000 30000 40000 50000 0 10000 20000 30000 40000 50000
gdpPercap gdpPercap
FIGURE 5.13: Experimenting with the width and height options within
ggsave() can be used to quickly change how big or small some of the text on
your plot looks.
Part II
Data analysis
119
In the second part of this book, we focus specifically on the business of data
analysis, that is, formulating clear questions and seeking to answer them using
available datasets.
Again, we emphasise the importance of understanding the underlying data
through visualisation, rather than relying on statistical tests or, heaven forbid,
the p-value alone.
There are five chapters. Testing for continuous outcome variables (6) leads
naturally into linear regression (7). We would expect the majority of actual
analysis done by readers to be using the methods in chapter 7 rather than
6. Similarly, testing for categorical outcome variables (8) leads naturally to
logistic regression (9), where we would expect the majority of work to focus.
Chapters 6 and 8 however do provide helpful reminders of how to prepare data
for these analyses and shouldn’t be skipped. time-to-event data (10) introduces
survival analysis and includes sections on the manipulation of dates.
6
Working with continuous outcome variables
The examples in this chapter all use the data introduced previously from the
amazing Gapminder project1 . We will start by looking at the life expectancy
of populations over time and in different geographical regions.
1
https://www.gapminder.org/
121
122 6 Working with continuous outcome variables
# Load packages
library(tidyverse)
library(finalfit)
library(gapminder)
It is vital that datasets be carefully inspected when first read (for help reading
data into R see 2.1). The three functions below provide a clear summary, al-
lowing errors or miscoding to be quickly identified. It is particularly important
to ensure that any missing data is identified (see Chapter 11). If you don’t do
this you will regret it! There are many times when an analysis has got to a
relatively advanced stage before the researcher was hit by the realisation that
the dataset was far from complete.
## Rows: 1,704
## Columns: 6
## $ country <fct> Afghanistan, Afghanistan, Afghanistan, Afghanistan, Afgha...
## $ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asi...
## $ year <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 199...
## $ lifeExp <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 4...
## $ pop <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372,...
## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.113...
As can be seen, there are 6 variables, 4 are continuous and 2 are categorical.
The categorical variables are already identified as factors. There are no missing
data. Note that by default, the maximum number of factor levels shown is give,
6.4 Plot the data 123
TABLE 6.1: Gapminder dataset, ff_glimpse: continuous.
label var_type n missing_n mean sd median
year <int> 1704 0 1979.5 17.3 1979.5
lifeExp <dbl> 1704 0 59.5 12.9 60.7
pop <int> 1704 0 29601212.3 106157896.7 7023595.5
gdpPercap <dbl> 1704 0 7215.3 9857.5 3531.8
which is why 142 country names are not printed. This can be adjusted using
ff_glimpse(gapdata, levels_cut = 142)
6.4.1 Histogram
gapdata %>%
filter(year %in% c(2002, 2007)) %>%
ggplot(aes(x = lifeExp)) + # remember aes()
geom_histogram(bins = 20) + # histogram with 20 bars
facet_grid(year ~ continent) # optional: add scales="free"
124 6 Working with continuous outcome variables
10
2002
5
0
count
10
2007
5
0
40 50 60 70 80 40 50 60 70 80 40 50 60 70 80 40 50 60 70 80 40 50 60 70 80
lifeExp
What can we see? That life expectancy in Africa is lower than in other re-
gions. That we have little data for Oceania given there are only two countries
included, Australia and New Zealand. That Africa and Asia have greater vari-
ability in life expectancy by country than in the Americas or Europe. That the
data follow a reasonably normal shape, with Africa 2002 a little right skewed.
gapdata %>%
filter(year %in% c(2002, 2007)) %>%
ggplot(aes(sample = lifeExp)) + # Q-Q plot requires 'sample'
geom_qq() + # defaults to normal distribution
geom_qq_line(colour = "blue") + # add the theoretical line
facet_grid(year ~ continent)
6.4 Plot the data 125
Africa Americas Asia Europe Oceania
80
70
2002
60
50
40
sample
30
80
70
2007
60
50
40
30
-2 -1 0 1 2 -2 -1 0 1 2 -2 -1 0 1 2 -2 -1 0 1 2 -2 -1 0 1 2
theoretical
FIGURE 6.2: Q-Q plot: Country life expectancy by continent and year.
What can we see? We are looking to see if the data (dots) follow the straight
line which we included in the plot. These do reasonably, except for Africa
which is curved upwards at each end. This is the right skew we could see on
the histograms too. If your data do not follow a normal distribution, then
you should avoid using a t-test or ANOVA when comparing groups. Non-
parametric tests are one alternative and are described in Section 6.9.
We are frequently asked about the pros and cons of checking for normality
using a statistical test, such as the Shapiro-Wilk normality test. We don’t rec-
ommend it. The test is often non-significant when the number of observations
is small but the data look skewed, and often significant when the number
of observations is high but the data look reasonably normal on inspection of
plots. It is therefore not useful in practice - common sense should prevail.
6.4.3 Boxplot
Boxplots are our preferred method for comparing a continuous variable such
as life expectancy across a categorical explanatory variable. For continuous
data, box plots are a lot more appropriate than bar plots with error bars (also
known as dynamite plots). We intentionally do not even show you how to
make dynamite plots.
The box represents the median (bold horizontal line in the middle) and in-
terquartile range (where 50% of the data sits). The lines (whiskers) extend to
the lowest and highest values that are still within 1.5 times the interquartile
range. Outliers (anything outwith the whiskers) are represented as points.
126 6 Working with continuous outcome variables
The beautiful boxplot thus contains information not only on central tendency
(median), but on the variation in the data and the distribution of the data,
for instance a skew should be obvious.
gapdata %>%
filter(year %in% c(2002, 2007)) %>%
ggplot(aes(x = continent, y = lifeExp)) +
geom_boxplot() +
facet_wrap(~ year)
2002 2007
80
70
lifeExp
60
50
40
Africa Americas Asia Europe Oceania Africa Americas Asia Europe Oceania
continent
What can we see? The median life expectancy is lower in Africa than in any
other continent. The variation in life expectancy is greatest in Africa and
smallest in Oceania. The data in Africa looks skewed, particularly in 2002 -
the lines/whiskers are unequal lengths.
We can add further arguments to adjust the plot to our liking. We particularly
encourage the inclusion of the actual data points, here using geom_jitter().
gapdata %>%
filter(year %in% c(2002, 2007)) %>%
ggplot(aes(x = factor(year), y = lifeExp)) +
geom_boxplot(aes(fill = continent)) + # add colour to boxplots
geom_jitter(alpha = 0.4) + # alpha = transparency
facet_wrap(~ continent, ncol = 5) + # spread by continent
theme(legend.position = "none") + # remove legend
xlab("Year") + # label x-axis
ylab("Life expectancy (years)") + # label y-axis
ggtitle(
"Life expectancy by continent in 2002 v 2007") # add title
6.5 Compare the means of two groups 127
Life expectancy by continent in 2002 v 2007
Africa Americas Asia Europe Oceania
80
Life expectancy (years)
70
60
50
40
2002 2007 2002 2007 2002 2007 2002 2007 2002 2007
Year
FIGURE 6.4: Boxplot with jitter points: Country life expectancy by conti-
nent and year.
Referring to Figure 6.3, let’s compare life expectancy between Asia and Europe
for 2007. What is imperative is that you decide what sort of difference exists by
looking at the boxplot, rather than relying on the t-test output. The median
for Europe is clearly higher than in Asia. The distributions overlap, but it
looks likely that Europe has a higher life expectancy than Asia.
By running the two-sample t-test here, we make the assumption that life
expectancy in each country represents an independent measurement of life
expectancy in the continent as a whole. This isn’t quite right if you think
about it carefully.
Imagine a room full of enthusiastic geniuses learning R. They arrived today
128 6 Working with continuous outcome variables
from various parts of the country. For reasons known only to you, you want to
know whether the average (mean) height of those wearing glasses is different
to those with perfect vision.
You measure the height of each person in the room, check them for glasses,
and run a two-sample t-test.
In statistical terms, your room represents a sample from an underlying pop-
ulation. Your ability to answer the question accurately relies on a number of
factors. For instance, how many people are in the room? The more there are,
the more certain you can be about the mean measurement in your groups
being close to the mean measurement in the overall population.
What is also crucial is that your room is a representative sample of the pop-
ulation. Are the observations independent, i.e., is each observation unrelated
to the others?
If you have inadvertently included a family of bespectacled nerdy giants, not
typical of those in the country as a whole, your estimate will be wrong and
your conclusion incorrect.
So in our example of countries and continents, you have to assume that the
mean life expectancy of each country does not depend on the life expectancies
of other countries in the group. In other words, that each measurement is
independent.
##
## Welch Two Sample t-test
##
## data: lifeExp by continent
## t = -4.6468, df = 41.529, p-value = 3.389e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -9.926525 -3.913705
## sample estimates:
## mean in group Asia mean in group Europe
## 70.72848 77.64860
The Welch two-sample t-test is the most flexible and copes with differences
in variance (variability) between groups, as in this example. The difference
in means is provided at the bottom of the output. The t-value, degrees of
freedom (df) and p-value are all provided. The p-value is 0.00003.
6.5 Compare the means of two groups 129
We used the assignment arrow to save the results of the t-test into a new
object called ttest_result. If you look at the Environment tab, you should
see ttest_result there. If you click on it - to view it - you’ll realise that it’s
not structured like a table, but a list of different pieces of information. The
structure of the t-test object is shown in Figure 6.5.
FIGURE 6.5: A list object that is the result of a t-test in R. We will show
you ways to access these numbers and how to wrangle them into more familiar
tables/tibbles.
## [1] 3.38922e-05
The confidence interval of the difference in mean life expectancy between the
two continents:
The broom package provides useful methods for ‘tidying’ common model
130 6 Working with continuous outcome variables
In the code above, the data = . bit is necessary because the pipe usually sends
data to the beginning of function brackets. So gapdata %>% t.test(lifeExp ~ conti-
nent) would be equivalent to t.test(gapdata, lifeExp ~ continent). However, this is
not an order that t.test() will accept. t.test() wants us to specify the formula
first, and then wants the data these variables are present in. So we have to
use the . to tell the pipe to send the data to the second argument of t.test(),
not the first.
paired_data %>%
ggplot(aes(x = year, y = lifeExp,
group = country)) + # for individual country lines
geom_line()
What is the difference in life expectancy for each individual country? We don’t
usually have to produce this directly, but here is one method.
80
70
lifeExp
60
50
)
paired_table
## # A tibble: 33 x 4
## country `2002` `2007` dlifeExp
## <fct> <dbl> <dbl> <dbl>
## 1 Afghanistan 42.1 43.8 1.70
## 2 Bahrain 74.8 75.6 0.84
## 3 Bangladesh 62.0 64.1 2.05
## 4 Cambodia 56.8 59.7 2.97
## 5 China 72.0 73.0 0.933
## 6 Hong Kong, China 81.5 82.2 0.713
## 7 India 62.9 64.7 1.82
## 8 Indonesia 68.6 70.6 2.06
## 9 Iran 69.5 71.0 1.51
## 10 Iraq 57.0 59.5 2.50
## # ... with 23 more rows
## # A tibble: 1 x 1
## `mean(dlifeExp)`
## <dbl>
## 1 1.49
paired_data %>%
t.test(lifeExp ~ year, data = ., paired = TRUE)
##
## Paired t-test
##
## data: lifeExp by year
## t = -14.338, df = 32, p-value = 1.758e-15
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.706941 -1.282271
## sample estimates:
## mean of the differences
## -1.494606
gapdata %>%
filter(year == 2007) %>% # 2007 only
group_by(continent) %>% # split by continent
do( # dplyr function
t.test(.$lifeExp, mu = 77) %>% # compare mean to 77 years
tidy() # tidy into tibble
)
## # A tibble: 5 x 9
## # Groups: continent [5]
## continent estimate statistic p.value parameter conf.low conf.high method
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 Africa 54.8 -16.6 3.15e-22 51 52.1 57.5 One S~
## 2 Americas 73.6 -3.82 8.32e- 4 24 71.8 75.4 One S~
## 3 Asia 70.7 -4.52 7.88e- 5 32 67.9 73.6 One S~
## 4 Europe 77.6 1.19 2.43e- 1 29 76.5 78.8 One S~
## 5 Oceania 80.7 7.22 8.77e- 2 1 74.2 87.3 One S~
## # ... with 1 more variable: alternative <chr>
The mean life expectancy for Europe and Oceania do not significantly differ
from 77, while the others do. In particular, look at the confidence intervals of
the results above (conf.low and conf.high columns) and whether they include
or exclude 77. For instance, Oceania’s confidence intervals are especially wide
as the dataset only includes two countries. Therefore, we can’t conclude that
its value isn’t different to 77, but that we don’t have enough observations
and the estimate is uncertain. It doesn’t make sense to report the results of a
statistical test - whether the p-value is significant or not - without assessing
the confidence intervals.
##
## One Sample t-test
##
## data: paired_table$dlifeExp
## t = 14.338, df = 32, p-value = 1.758e-15
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 1.282271 1.706941
## sample estimates:
## mean of x
134 6 Working with continuous outcome variables
## 1.494606
It may be that our question is set around a hypothesis involving more than
two groups. For example, we may be interested in comparing life expectancy
across 3 continents such as the Americas, Europe and Asia.
gapdata %>%
filter(year == 2007) %>%
filter(continent %in%
c("Americas", "Europe", "Asia")) %>%
ggplot(aes(x = continent, y=lifeExp)) +
geom_boxplot()
80
70
lifeExp
60
50
6.7.2 ANOVA
We can conclude from the significant F-test that the mean life expectancy
across the three continents is not the same. This does not mean that all in-
cluded groups are significantly different from each other. As above, the output
can be neatened up using the tidy function.
library(broom)
gapdata %>%
filter(year == 2007) %>%
filter(continent %in% c("Americas", "Europe", "Asia")) %>%
aov(lifeExp~continent, data = .) %>%
tidy()
## # A tibble: 2 x 6
## term df sumsq meansq statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 continent 2 756. 378. 11.6 0.0000342
## 2 Residuals 85 2760. 32.5 NA NA
6.7.3 Assumptions
As with the normality assumption of the t-test (for example, Sections 6.4.1
and 6.4.2), there are assumptions of the ANOVA model. These assumptions
are shared with linear regression and are covered in the next chapter, as linear
regression lends itself to illustrate and explain these concepts well. Suffice
to say that diagnostic plots can be produced to check that the assumptions
are fulfilled. library(ggfortify) includes a function called autoplot() that can be
used to quickly create diagnostic plots, including the Q-Q plot that we showed
before:
136 6 Working with continuous outcome variables
library(ggfortify)
autoplot(fit)
Standardized residuals
0 0
Residuals
-10 -2
30 30
-20
-4
1 1
72 74 76 -2 -1 0 1 2
Fitted values Theoretical Quantiles
1 2 43
Standardized Residuals
2.0
1.5 30 0
43
1.0 -2
30
0.5
-4
1
72 74 76 0.00 0.01 0.02 0.03 0.04
Fitted values Leverage
FIGURE 6.8: Diagnostic plots: ANOVA model of life expectancy by conti-
nent for 2007.
6.8 Multiple testing 137
When the F-test is significant, we will often want to determine where the
differences lie. This should of course be obvious from the boxplot you have
made. However, some are fixated on the p-value!
pairwise.t.test(aov_data$lifeExp, aov_data$continent,
p.adjust.method = "bonferroni")
##
## Pairwise comparisons using t tests with pooled SD
##
## data: aov_data$lifeExp and aov_data$continent
##
## Americas Asia
## Asia 0.180 -
## Europe 0.031 1.9e-05
##
## P value adjustment method: bonferroni
A matrix of pairwise p-values can be produced using the code above. Here we
can see that there is good evidence of a difference in means between Europe
and Asia.
We have to keep in mind that the p-value’s significance level of 0.05 means we
have a 5% chance of finding a difference in our samples which doesn’t exist in
the overall population.
Therefore, the more statistical tests performed, the greater the chances of a
false positive result. This is also known as type I error - finding a difference
when no difference exists.
There are three approaches to dealing with situations where multiple statistical
tests are performed. The first is not to perform any correction at all. Some
advocate that the best approach is simply to present the results of all the
tests that were performed, and let sceptical readers make adjustments for
themselves. This is attractive, but presupposes a sophisticated readership who
will take the time to consider the results in their entirety.
The second and classical approach is to control for the so-called family-wise
error rate. The “Bonferroni” correction is the most famous and most conser-
vative, where the threshold for significance is lowered in proportion to the
number of comparisons made. For example, if three comparisons are made,
the threshold for significance should be lowered to 0.017. Equivalently, all p-
values should be multiplied by the number of tests performed (in this case 3).
138 6 Working with continuous outcome variables
pairwise.t.test(aov_data$lifeExp, aov_data$continent,
p.adjust.method = "fdr")
##
## Pairwise comparisons using t tests with pooled SD
##
## data: aov_data$lifeExp and aov_data$continent
##
## Americas Asia
## Asia 0.060 -
## Europe 0.016 1.9e-05
##
## P value adjustment method: fdr
Try not to get too hung up on this. Be sensible. Plot the data and look for
differences. Focus on effect size. For instance, what is the actual difference in
life expectancy in years, rather than the p-value of a comparison test. Choose
a method which fits with your overall aims. If you are generating hypotheses
which you will proceed to test with other methods, the fdr approach may be
preferable. If you are trying to capture robust effects and want to minimise
type II errors, use a family-wise approach.
If your head is spinning at this point, don’t worry. The rest of the book will con-
tinuously revisit these and other similar concepts, e.g., “know your data”, “be
sensible, look at the effect size”, using several different examples and datasets.
So do not feel like you should be able to understand everything immediately.
Furthermore, these things are easier to conceptualise when using your own
dataset - especially if that’s something you’ve put your blood, sweat and tears
into collecting.
6.9 Non-parametric tests 139
Remember, the Welch t-test is reasonably robust to divergence from the nor-
mality assumption, so small deviations can be safely ignored.
Otherwise, the data can be transformed to another scale to deal with a skew.
A natural log scale is common.
TABLE 6.4: Transformations that can be applied to skewed data. For left
skewed data, subtract all values from a constant greater than the maximum
value.
## # A tibble: 6 x 3
## country lifeExp lifeExp_log
## <fct> <dbl> <dbl>
## 1 Algeria 71.0 4.26
## 2 Angola 41.0 3.71
## 3 Benin 54.4 4.00
## 4 Botswana 46.6 3.84
## 5 Burkina Faso 50.6 3.92
## 6 Burundi 47.4 3.86
africa2002 %>%
# pivot lifeExp and lifeExp_log values to same column (for easy plotting):
pivot_longer(contains("lifeExp")) %>%
ggplot(aes(x = value)) +
geom_histogram(bins = 15) + # make histogram
facet_wrap(~name, scales = "free") # facet with axes free to vary
lifeExp lifeExp_log
6
7.5
4
5.0
count
2
2.5
0.0 0
40 50 60 70 3.8 4.0 4.2
value
FIGURE 6.9: Histogram: Log transformation of life expectancy for countries
in Africa 2002.
This has worked well here. The right skew on the Africa data has been dealt
with by the transformation. A parametric test such as a t-test can now be
performed.
The Mann-Whitney U test is also called the Wilcoxon rank-sum test and
uses a rank-based method to compare two groups (note the Wilcoxon signed-
6.9 Non-parametric tests 141
rank test is for paired data). Rank-based just means ordering your grouped
continuous data from smallest to largest value and assigning a rank (1, 2, 3
…) to each measurement.
We can use it to test for a difference in life expectancies for African countries
between 1982 and 2007. Let’s do a histogram, Q-Q plot and boxplot first.
The data is a little skewed based on the histograms and Q-Q plots. The dif-
ference between 1982 and 2007 is not particularly striking on the boxplot.
africa_data %>%
wilcox.test(lifeExp ~ year, data = .)
##
## Wilcoxon rank sum test with continuity correction
##
## data: lifeExp by year
## W = 1130, p-value = 0.1499
## alternative hypothesis: true location shift is not equal to 0
1982 2007
8
6
count
4 70
2
0
40 50 60 70 40 50 60 70
60
lifeExp
lifeExp
1982 2007
70 50
sample
60
50
40 40
30
-2 -1 0 1 2 -2 -1 0 1 2 1982 2007
theoretical factor(year)
FIGURE 6.10: Panels plots: Histogram, Q-Q, boxplot for life expectancy in
Africa 1992 v 2007.
library(broom)
gapdata %>%
filter(year == 2007) %>%
filter(continent %in% c("Americas", "Europe", "Asia")) %>%
kruskal.test(lifeExp~continent, data = .) %>%
tidy()
## # A tibble: 1 x 4
## statistic p.value parameter method
## <dbl> <dbl> <int> <chr>
## 1 21.6 0.0000202 2 Kruskal-Wallis rank sum test
The finalfit package provides an easy to use interface for performing non-
parametric hypothesis tests. Any number of explanatory variables can be
tested against a so-called dependent variable. In this case, this is equivalent
to a typical Table 1 in healthcare study.
6.10 Finalfit approach 143
Note that the p-values above have not been corrected for multiple testing.
There are many other options available for this function which are covered
throughout this book. For instance, If you wish to consider only some variables
as non-parametric and summarise with a median, then this can be specified
using
6.11 Conclusions
6.12 Exercises
6.12.1 Exercise
Make a histogram, Q-Q plot, and a box-plot for the life expectancy for a conti-
nent of your choice, but for all years. Do the data appear normally distributed?
6.12.2 Exercise
6.12.3 Exercise
In 2007, in which continents did mean life expectancy differ from 70?
6.12.4 Exercise
6.13 Solutions
## Make a histogram, Q-Q plot, and a box-plot for the life expectancy
## for a continent of your choice, but for all years.
## Do the data appear normally distributed?
library(patchwork)
p1 / p2 | p3
80
40
count
20 70
0
40 60 80 60
lifeExp
lifeExp
50
100
80
sample
40
60
40
30
20
-2 0 2 1950 1960 1970 1980 1990 2000 2010
theoretical year
library(patchwork)
p1 / p2 | p3
1952 1972
6
count
4 70
2
0 60
3040506070 3040506070
lifeExp
lifeExp
50
1952 1972
70 40
sample
60
50
40 30
30
-2 -1 0 1 2 -2 -1 0 1 2 1952 1972
theoretical factor(year)
asia_2years %>%
t.test(lifeExp ~ year, data = .)
##
## Welch Two Sample t-test
##
## data: lifeExp by year
## t = -4.7007, df = 63.869, p-value = 1.428e-05
## alternative hypothesis: true difference in means is not equal to 0
6.13 Solutions 147
## 95 percent confidence interval:
## -15.681981 -6.327769
## sample estimates:
## mean in group 1952 mean in group 1972
## 46.31439 57.31927
## # A tibble: 5 x 9
## # Groups: continent [5]
## continent estimate statistic p.value parameter conf.low conf.high method
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 Africa 54.8 -11.4 1.33e-15 51 52.1 57.5 One S~
## 2 Americas 73.6 4.06 4.50e- 4 24 71.8 75.4 One S~
## 3 Asia 70.7 0.525 6.03e- 1 32 67.9 73.6 One S~
## 4 Europe 77.6 14.1 1.76e-14 29 76.5 78.8 One S~
## 5 Oceania 80.7 20.8 3.06e- 2 1 74.2 87.3 One S~
## # ... with 1 more variable: alternative <chr>
gapdata %>%
filter(year >= 1990) %>%
ggplot(aes(x = factor(year), y = pop)) +
geom_boxplot() +
facet_wrap(~continent)
148 6 Working with continuous outcome variables
1e+09
5e+08
0e+00
1992 1997 2002 2007
pop
Europe Oceania
1e+09
5e+08
0e+00
1992 1997 2002 2007 1992 1997 2002 2007
factor(year)
gapdata %>%
filter(year >= 1990) %>%
group_by(continent) %>%
do(
kruskal.test(pop ~ year, data = .) %>%
tidy()
)
## # A tibble: 5 x 5
## # Groups: continent [5]
## continent statistic p.value parameter method
## <fct> <dbl> <dbl> <int> <chr>
## 1 Africa 2.10 0.553 3 Kruskal-Wallis rank sum test
## 2 Americas 0.847 0.838 3 Kruskal-Wallis rank sum test
## 3 Asia 1.57 0.665 3 Kruskal-Wallis rank sum test
## 4 Europe 0.207 0.977 3 Kruskal-Wallis rank sum test
## 5 Oceania 1.67 0.644 3 Kruskal-Wallis rank sum test
7
Linear regression
7.1 Regression
149
150 7 Linear regression
Simple linear regression uses the ordinary least squares method for fitting. The
details are beyond the scope of this book, but if you want to get out the linear
algebra/matrix maths you did in high school, an enjoyable afternoon can be
spent proving to yourself how it actually works.
Figure 7.2 aims to make this easy to understand. The maths defines a line
which best fits the data provided. For the line to fit best, the distances between
it and the observed data should be as small as possible. The distance from
each observed point to the line is called a residual - one of those statistical
terms that bring on the sweats. It refers to the residual difference left over
after the line is fitted.
You can use the simple regression Shiny app2 to explore the concept. We want
the residuals to be as small as possible. We can square each residual (to get
rid of minuses and make the algebra more convenient) and add them up. If
this number is as small as possible, the line is fitting as best it can. Or in more
formal language, we want to minimise the sum of squared residuals.
The regression apps and example figures in this chapter have been adapted
1
These data are created on the fly by the Shiny apps that are linked and explained in
this chapter. This enables you to explore the different concepts using the same variables. For
example, if you tell the multivariable app that coffee and smoking should be confounded, it
will change the underlying dataset to conform. You can then investigate the output of the
regression model to see how that corresponds to the “truth” (that in this case, you control).
2
https://argoshare.is.ed.ac.uk/simple_regression
7.1 Regression 151
A Univariable
Dependent
variable
Explanatory
variable of interest
B Multivariable
Other Non-smoker
Smoker
explanatory variable
You can use the simple regression diagnostics shiny app3 to get a handle on
these.
Figure 7.3 shows diagnostic plots from the app, which we will run ourselves
below Figure 7.13.
Linear relationship
A simple scatter plot should show a linear relationship between the explana-
tory and the dependent variable, as in Figure 7.3A. If the data describe a
non-linear pattern (Figure 7.3B), then a straight line is not going to fit it well.
In this situation, an alternative model should be considered, such as including
a quadratic (squared, 𝑥2 ) term.
Independence of residuals
The observations and therefore the residuals should be independent. This is
more commonly a problem in time series data, where observations may be
correlated across time with each other (autocorrelation).
Normal distribution of residuals
The observations should be normally distributed around the fitted line. This
means that the residuals should show a normal distribution with a mean of
zero (Figure 7.3A). If the observations are not equally distributed around the
line, the histogram of residuals will be skewed and a normal Q-Q plot will show
residuals diverging from the straight line (Figure 7.3B) (see Section 6.4.2).
Equal variance of residuals
The distance of the observations from the fitted line should be the same on
the left side as on the right side. Look at the fan-shaped data on the simple
3
https://argoshare.is.ed.ac.uk/simple_regression_diagnostics
7.1 Regression 153
Observed
data
Observed v fitted
Fitted line difference =
residual
B
This is minimised
for best fitting line
C Should be
normally distributed
FIGURE 7.2: How a regression line is fitted. A - residuals are the green lines:
the distance between each data point and the fitted line. B - the green circle
indicates the minimum for these data; its absolute value is not meaningful
or comparable with other datasets. Follow the “simple regression Shiny app”
link to interact with the fitted line. A new sum of squares of residuals (the
black cross) is calculated every time you move the line. C - Distribution of the
residuals. App and plots adapted from https://github.com/mwaskom/ShinyApps with
permission.
154 7 Linear regression
regression diagnostics Shiny app. This fan shape can be seen on the residuals
vs fitted values plot.
Everything we talk about in this chapter is really about making sure that
the line you draw through your data points is valid. It is about ensuring that
the regression line is appropriate across the range of the explanatory variable
and dependent variable. It is about understanding the underlying data, rather
than relying on a fancy statistical test that gives you a p-value.
FIGURE 7.3: Regression diagnostics. A - this is what a linear fit should look
like. B - this is not approriate; a non-linear model should be used instead. App
and plots adapted from https://github.com/ShinyEd/intro-stats with permission.
156 7 Linear regression
Non-smoker
Smoker
Distribution of
residuals
} Results
Adjusted R2
FIGURE 7.4: Linking the fitted line, regression equation and R output.
7.1 Regression 157
7.1.5 Effect modification
Effect modification occurs when the size of the effect of the explanatory vari-
able of interest (exposure) on the outcome (dependent variable) differs depend-
ing on the level of a third variable. Said another way, this is a situation in
which an explanatory variable differentially (positively or negatively) modifies
the observed effect of another explanatory variable on the outcome.
Figure 7.5 shows three potential causal pathways using examples from the
multivariable regression Shiny app5 .
In the first, smoking is not associated with the outcome (blood pressure) or
our explanatory variable of interest (coffee consumption).
In the second, smoking is associated with elevated blood pressure, but not
with coffee consumption. This is an example of effect modification.
In the third, smoking is associated with elevated blood pressure and with
coffee consumption. This is an example of confounding.
Effect modifier:
Smoking
Effect modification
— smoking associated with
elevated blood pressure Outcome:
Exposure:
— no association between Elevated
Coffee
coffee drinking and smoking blood pressure
Confounder:
Confounding Smoking
— smoking associated with
coffee drinking and with Outcome:
elevated blood pressure Exposure:
Elevated
Coffee
blood pressure
Figure 7.6 includes a further metric from the R output: Adjusted R-squared.
R-squared is another measure of how close the data are to the fitted line. It is
also known as the coefficient of determination and represents the proportion
of the dependent variable which is explained by the explanatory variable(s).
So 0.0 indicates that none of the variability in the dependent is explained by
7.1 Regression 159
A ŷ = β0 + β1 x1 Simple
Non-smoker
Smoker
B ŷ = β0 + β1 x1 + β2 x2 Additive
C ŷ = β0 + β1 x1 + β2 x2 + β3 x1 x2 Multiplicative
the explanatory (no relationship between data points and fitted line) and 1.0
indicates that the model explains all of the variability in the dependent (fitted
line follows data points exactly).
R provides the R-squared and the Adjusted R-squared. The adjusted R-squared
includes a penalty the more explanatory variables are included in the model.
So if the model includes variables which do not contribute to the description
of the dependent variable, the adjusted R-squared will be lower.
Looking again at Figure 7.6, in A, a simple model of coffee alone does not
describe the data well (adjusted R-squared 0.38). Adding smoking to the model
improves the fit as can be seen by the fitted lines (0.87). But a true interaction
exists in the actual data. By including this interaction in the model, the fit is
very good indeed (0.93).
7.1.7 Confounding
A Simple with
True effect coffee = 0
confounding
True effect smoking = 2
Non-smoker
Smoker
B Additive
adjusted for
True effect coffee = 0
True effect smoking = 2 confounding
Non-smoker
Smoker
7.1.8 Summary
We have intentionally spent some time going through the principles and ap-
plications of linear regression because it is so important. A firm grasp of these
concepts leads to an understanding of other regression procedures, such as
logistic regression and Cox Proportional Hazards regression.
We will now perform all this ourselves in R using the gapminder dataset which
you are familiar with from preceding chapters.
162 7 Linear regression
We are interested in modelling the change in life expectancy for different coun-
tries over the past 60 years.
library(tidyverse)
library(gapminder) # dataset
library(finalfit)
library(broom)
theme_set(theme_bw())
gapdata <- gapminder
Let’s plot the life expectancies in European countries over the past 60 years,
focussing on the UK and Turkey. We can add in simple best fit lines using
geom_smooth().
gapdata %>%
filter(continent == "Europe") %>% # Europe only
ggplot(aes(x = year, y = lifeExp)) + # lifeExp~year
geom_point() + # plot points
facet_wrap(~ country) + # facet by country
scale_x_continuous(
breaks = c(1960, 2000)) + # adjust x-axis
geom_smooth(method = "lm") # add regression lines
70
60
50
FIGURE 7.8: Scatter plots with linear regression lines: Life expectancy by
year in European countries.
As you can see, ggplot() is very happy to run and plot linear regression models
for us. While this is convenient for a quick look, we usually want to build, run,
and explore these models ourselves. We can then investigate the intercepts
and the slope coefficients (linear increase per year):
First let’s plot two countries to compare, Turkey and United Kingdom:
gapdata %>%
filter(country %in% c("Turkey", "United Kingdom")) %>%
ggplot(aes(x = year, y = lifeExp, colour = country)) +
geom_point()
The two non-parallel lines may make you think of what has been discussed
above (Figure 7.6).
First, let’s model the two countries separately.
164 7 Linear regression
80
70
lifeExp country
Turkey
60
United Kingdom
50
# United Kingdom
fit_uk <- gapdata %>%
filter(country == "United Kingdom") %>%
lm(lifeExp~year, data = .)
fit_uk %>%
summary()
##
## Call:
## lm(formula = lifeExp ~ year, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.69767 -0.31962 0.06642 0.36601 0.68165
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.942e+02 1.464e+01 -20.10 2.05e-09 ***
## year 1.860e-01 7.394e-03 25.15 2.26e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4421 on 10 degrees of freedom
## Multiple R-squared: 0.9844, Adjusted R-squared: 0.9829
## F-statistic: 632.5 on 1 and 10 DF, p-value: 2.262e-10
# Turkey
fit_turkey <- gapdata %>%
filter(country == "Turkey") %>%
lm(lifeExp~year, data = .)
7.2 Fitting simple models 165
fit_turkey %>%
summary()
##
## Call:
## lm(formula = lifeExp ~ year, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.4373 -0.3457 0.1653 0.9008 1.1033
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -924.58989 37.97715 -24.35 3.12e-10 ***
## year 0.49724 0.01918 25.92 1.68e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.147 on 10 degrees of freedom
## Multiple R-squared: 0.9853, Adjusted R-squared: 0.9839
## F-statistic: 671.8 on 1 and 10 DF, p-value: 1.681e-10
fit_uk$coefficients
## (Intercept) year
## -294.1965876 0.1859657
fit_turkey$coefficients
## (Intercept) year
## -924.5898865 0.4972399
The slopes make sense - the results of the linear regression say that in the UK,
life expectancy increases by 0.186 every year, whereas in Turkey the change is
0.497 per year. The reason the intercepts are negative, however, may be less
obvious.
In this example, the intercept is telling us that life expectancy at year 0 in
the United Kingdom (some 2000 years ago) was -294 years. While this is
mathematically correct (based on the data we have), it obviously makes no
sense in practice. It is important to think about the range over which you can
extend your model predictions, and where they just become unrealistic.
166 7 Linear regression
fit_uk$coefficients
## (Intercept) year_from1952
## 68.8085256 0.1859657
fit_turkey$coefficients
## (Intercept) year_from1952
## 46.0223205 0.4972399
Now, the updated results tell us that in year 1952, the life expectancy in the
United Kingdom was 69 years. Note that the slopes do not change. There
was nothing wrong with the original model and the results were correct, the
intercept was just not meaningful.
Accessing all model information tidy() and glance()
In the fit_uk and fit_turkey examples above, we were using fit_uk %>% summary()
to get R to print out a summary of the model. This summary is not, however,
in a rectangular shape so we can’t easily access the values or put them in a
table/use as information on plot labels.
We use the tidy() function from library(broom) to get the variable(s) and specific
values in a nice tibble:
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 68.8 0.240 287. 6.58e-21
## 2 year_from1952 0.186 0.00739 25.1 2.26e-10
In the tidy() output, the column estimate includes both the intercepts and
slopes.
7.2 Fitting simple models 167
And we use the glance() function to get overall model statistics (mostly the
r.squared).
## # A tibble: 1 x 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.984 0.983 0.442 633. 2.26e-10 1 -6.14 18.3 19.7
## # ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
This equivalent to: myfit = lm(lifeExp ~ year + country + year:country, data = gapdata)
These examples of multivariable regression include two variables: year and
country, but we could include more by adding them with +, it does not just
have to be two.
We will now create three different linear regression models to further illustrate
the difference between a simple model, additive model, and multiplicative
model.
Model 1: year only
##
## Call:
## lm(formula = lifeExp ~ year_from1952, data = .)
168 7 Linear regression
##
## Coefficients:
## (Intercept) year_from1952
## 57.4154 0.3416
gapdata_UK_T %>%
mutate(pred_lifeExp = predict(fit_both1)) %>%
ggplot() +
geom_point(aes(x = year, y = lifeExp, colour = country)) +
geom_line(aes(x = year, y = pred_lifeExp))
80
70
country
lifeExp
Turkey
60
United Kingdom
50
By fitting to year only (lifeExp ~ year_from1952), the model ignores country. This
gives us a fitted line which is the average of life expectancy in the UK and
Turkey. This may be desirable, depending on the question. But here we want
to best describe the data.
How we made the plot and what does predict() do? Previously, we
were using geom_smooth(method = "lm") to conveniently add linear regression lines
on a scatter plot. When a scatter plot includes categorical value (e.g., the
points are coloured by a variable), the regression lines geom_smooth() draws are
multiplicative. That is great, and almost always exactly what we want. Here,
however, to illustrate the difference between the different models, we will have
to use the predict() model and geom_line() to have full control over the plotted
regression lines.
7.2 Fitting simple models 169
gapdata_UK_T %>%
mutate(pred_lifeExp = predict(fit_both1)) %>%
select(country, year, lifeExp, pred_lifeExp) %>%
group_by(country) %>%
slice(1, 6, 12)
## # A tibble: 6 x 4
## # Groups: country [2]
## country year lifeExp pred_lifeExp
## <fct> <int> <dbl> <dbl>
## 1 Turkey 1952 43.6 57.4
## 2 Turkey 1977 59.5 66.0
## 3 Turkey 2007 71.8 76.2
## 4 United Kingdom 1952 69.2 57.4
## 5 United Kingdom 1977 72.8 66.0
## 6 United Kingdom 2007 79.4 76.2
Note how the slice() function recognises group_by() and in this case shows
us the 1st, 6th, and 12th observation within each group.
Model 2: year + country
##
## Call:
## lm(formula = lifeExp ~ year_from1952 + country, data = .)
##
## Coefficients:
## (Intercept) year_from1952 countryUnited Kingdom
## 50.3023 0.3416 14.2262
gapdata_UK_T %>%
mutate(pred_lifeExp = predict(fit_both2)) %>%
ggplot() +
geom_point(aes(x = year, y = lifeExp, colour = country)) +
geom_line(aes(x = year, y = pred_lifeExp, colour = country))
This is better, by including country in the model, we now have fitted lines
more closely representing the data. However, the lines are constrained to be
parallel. This is the additive model that was discussed above. We need to
include an interaction term to allow the effect of year on life expectancy to
vary by country in a non-additive manner.
Model 3: year * country
80
70
lifeExp country
Turkey
60 United Kingdom
50
##
## Call:
## lm(formula = lifeExp ~ year_from1952 * country, data = .)
##
## Coefficients:
## (Intercept) year_from1952
## 46.0223 0.4972
## countryUnited Kingdom year_from1952:countryUnited Kingdom
## 22.7862 -0.3113
gapdata_UK_T %>%
mutate(pred_lifeExp = predict(fit_both3)) %>%
ggplot() +
geom_point(aes(x = year, y = lifeExp, colour = country)) +
geom_line(aes(x = year, y = pred_lifeExp, colour = country))
This fits the data much better than the previous two models. You can check
the R-squared using summary(fit_both3).
Advanced tip: we can apply a function on multiple objects at once by putting
them in a list() and using a map_() function from the purrr package. li-
brary(purrr) gets installed and loaded with library(tidyverse), but it is outside
the scope of this book. Do look it up once you get a little more comfortable
with using R, and realise that you are starting to do similar things over and
over again. For example, this code:
7.2 Fitting simple models 171
80
70
lifeExp country
Turkey
60
United Kingdom
50
## # A tibble: 3 x 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.373 0.344 7.98 13.1 1.53e- 3 1 -82.9 172. 175.
## 2 0.916 0.908 2.99 114. 5.18e-12 2 -58.8 126. 130.
## 3 0.993 0.992 0.869 980. 7.30e-22 3 -28.5 67.0 72.9
## # ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
What happens here is that map_df() applies a function on each object in the
list it gets passed, and returns a df (data frame). In this case, the function is
glance() (note that once inside map_df(), glance does not have its own brackets.
172 7 Linear regression
library(ggfortify)
autoplot(fit_both3)
0 0
Residuals
-1
-1
-2
12 12
-2
-3
1 1
50 60 70 80 -2 -1 0 1 2
Fitted values Theoretical Quantiles
1
Standardized Residuals
3
1.5 1
12
0
3
1.0 -1
-2
12
0.5 -3
1
Turkey United
50 60 70 80 Kingdom
Fitted values Factor Level Combination
FIGURE 7.13: Diagnostic plots. Life expectancy in Turkey and the UK -
multivariable multiplicative model.
7.3 Fitting more complex models 173
Finally in this section, we are going to fit a more complex linear regression
model. Here, we will discuss variable selection and introduce the Akaike Infor-
mation Criterion (AIC).
We will introduce a new dataset: The Western Collaborative Group Study.
This classic dataset includes observations of 3154 healthy young men aged 39-
59 from the San Francisco area who were assessed for their personality type. It
aimed to capture the occurrence of coronary heart disease over the following
8.5 years.
We will use it, however, to explore the relationship between systolic blood
pressure (sbp) and personality type (personality_2L), accounting for potential
confounders such as weight (weight). Now this is just for fun - don’t write in!
The study was designed to look at cardiovascular events as the outcome, not
blood pressure. But it is convenient to use blood pressure as a continuous
outcome from this dataset, even if that was not the intention of the study.
The included personality types are A: aggressive and B: passive.
We suggest building statistical models on the basis of the following six prag-
matic principles:
use automated methods of variable selection. These are often “forward selec-
tion” or “backward elimination” methods for including or excluding particular
variables on the basis of a statistical property.
In certain settings, these approaches may be found to work. However, they
create an artificial distance between you and the problem you are working on.
They give you a false sense of certainty that the model you have created is in
some sense valid. And quite frequently, they will just get it wrong.
Alternatively, you can follow the six principles above.
A variable may have previously been shown to strongly predict an outcome
(think smoking and risk of cancer). This should give you good reason to con-
sider it in your model. But perhaps you think that previous studies were
incorrect, or that the variable is confounded by another. All this is fair, but it
will be expected that this new knowledge is clearly demonstrated by you, so
do not omit these variables before you start.
There are some variables that are so commonly associated with particular
outcomes in healthcare that they should almost always be included at the start.
Age, sex, social class, and co-morbidity for instance are commonly associated
with survival. These need to be assessed before you start looking at your
explanatory variable of interest.
Furthermore, patients are often clustered by a particular grouping variable,
such as treating hospital. There will be commonalities between these patients
that may not be fully explained by your observed variables. To estimate the
coefficients of your variables of interest most accurately, clustering should be
accounted for in the analysis.
As demonstrated above, the purpose of the model is to provide a best fit
approximation of the underlying data. Effect modification and interactions
commonly exist in health datasets, and should be incorporated if present.
Finally, we want to assess how well our models fit the data with ‘model check-
ing’. The effect of adding or removing one variable to the coefficients of the
other variables in the model is very important, and will be discussed later.
Measures of goodness-of-fit such as the AIC, can also be of great use when
deciding which model choice is most valid.
7.3.3 AIC
As always, when you receive a new dataset, carefully check that it does not
contain errors.
TABLE 7.1: WCGS data, ff_glimpse: continuous.
wcgsdata %>%
ggplot(aes(y = sbp, x = weight,
colour = personality_2L)) + # Personality type
geom_point(alpha = 0.2) + # Add transparency
geom_smooth(method = "lm", se = FALSE)
200
personality_2L
sbp
B
160
A
120
From Figure 7.14, we can see that there is a weak relationship between weight
and blood pressure.
In addition, there is really no meaningful effect of personality type on blood
pressure. This is really important because, as you will see below, we are about
to “find” some highly statistically significant effects in a model.
finalfit is our own package and provides a convenient set of functions for
fitting regression models with results presented in final tables.
There are a host of features with example code at the finalfit website6 .
Here we will use the all-in-one finalfit() function, which takes a dependent
6
https://finalfit.org
7.3 Fitting more complex models 177
variable and one or more explanatory variables. The appropriate regression for
the dependent variable is performed, from a choice of linear, logistic, and Cox
Proportional Hazards regression models. Summary statistics, together with
a univariable and a multivariable regression analysis are produced in a final
results table.
Let’s look first at our explanatory variable of interest, personality type. When
a factor is entered into a regression model, the default is to compare each
level of the factor with a “reference level”. What you choose as the reference
level can be easily changed (see Section 8.9. Alternative methods are available
(sometimes called contrasts), but the default method is likely to be what you
want almost all the time. Note this is sometimes referred to as creating a
“dummy variable”.
It can be seen that the mean blood pressure for type A is higher than for type
B. As there is only one variable, the univariable and multivariable analyses
are the same (the multivariable column can be removed if desired by including
select(-5) #5th column in the piped function).
Let’s now include subject weight, which we have hypothesised may influence
blood pressure.
The output shows us the range for weight (78 to 320 pounds) and the mean
(standard deviation) systolic blood pressure for the whole cohort.
The coefficient with 95% confidence interval is provided by default. This is
interpreted as: for each pound increase in weight, there is on average a corre-
sponding increase of 0.18 mmHg in systolic blood pressure.
Note the difference in the interpretation of continuous and categorical variables
in the regression model output (Table 7.5).
The adjusted R-squared is now higher - the personality and weight together
explain 6.8% of the variation in blood pressure.
The AIC is also slightly lower meaning this new model better fits the data.
There is little change in the size of the coefficients for each variable in the
multivariable analysis, meaning that they are reasonably independent. As an
exercise, check the distribution of weight by personality type using a boxplot.
Let’s now add in other variables that may influence systolic blood pressure.
Number in dataframe = 3154, Number in model = 3142, Missing = 12, Log-likelihood = -12772.04,
AIC = 25560.1, R-squared = 0.12, Adjusted R-squared = 0.12
Age, height, serum cholesterol, and smoking status have been added. Some
of the variation explained by personality type has been taken up by these
new variables - personality is now associated with an average change of blood
pressure of 1.4 mmHg.
The adjusted R-squared now tells us that 12% of the variation in blood pres-
sure is explained by the model, which is an improvement.
Look out for variables that show large changes in effect size or a change in
the direction of effect when going from a univariable to multivariable model.
This means that the other variables in the model are having a large effect on
this variable and the cause of this should be explored. For instance, in this
example the effect of height changes size and direction. This is because of the
close association between weight and height. For instance, it may be more
sensible to work with body mass index (𝑤𝑒𝑖𝑔ℎ𝑡/ℎ𝑒𝑖𝑔ℎ𝑡2 ) rather than the two
separate variables.
In general, variables that are highly correlated with each other should be
treated carefully in regression analysis. This is called collinearity and can lead
to unstable estimates of coefficients. For more on this, see Section 9.4.2.
Let’s create a new variable called bmi, note the conversion from pounds and
inches to kg and m:
Weight and height can now be replaced in the model with BMI.
Number in dataframe = 3154, Number in model = 3142, Missing = 12, Log-likelihood = -12775.03,
AIC = 25564.1, R-squared = 0.12, Adjusted R-squared = 0.12
On the principle of parsimony, we may want to remove variables which are not
contributing much to the model. For instance, let’s compare models with and
without the inclusion of smoking. This can be easily done using the finalfit
explanatory_multi option.
This results in little change in the other coefficients and very little change in
the AIC. We will consider the reduced model the final model.
We can also visualise models using plotting. This is useful for communicating
a model in a restricted space, e.g., in a presentation.
7.3 Fitting more complex models 181
TABLE 7.11: Multivariable linear regression: Systolic blood pressure by
available explanatory variables and reduced model.
Dependent: Systolic unit value Coefficient Coefficient Coefficient (multivariable reduced)
BP (mmHg) (univariable) (multivariable)
Personality type B Mean (sd) 127.5 (14.4) - - -
A Mean (sd) 129.8 (15.7) 2.32 (1.26 to 3.37, 1.51 (0.51 to 2.50, 1.56 (0.57 to 2.56, p=0.002)
p<0.001) p=0.003)
BMI [11.2,39.0] Mean (sd) 128.6 (15.1) 1.69 (1.50 to 1.89, 1.65 (1.46 to 1.85, 1.62 (1.43 to 1.82, p<0.001)
p<0.001) p<0.001)
Age (years) [39.0,59.0] Mean (sd) 128.6 (15.1) 0.45 (0.36 to 0.55, 0.41 (0.32 to 0.50, 0.41 (0.32 to 0.50, p<0.001)
p<0.001) p<0.001)
Cholesterol (mg/100 [103.0,645.0] Mean (sd) 128.6 (15.1) 0.04 (0.03 to 0.05, 0.03 (0.02 to 0.04, 0.03 (0.02 to 0.04, p<0.001)
ml) p<0.001) p<0.001)
Smoking Non-smoker Mean (sd) 128.6 (15.6) - - -
Smoker Mean (sd) 128.7 (14.6) 0.08 (-0.98 to 1.14, 0.98 (-0.03 to 1.98, -
p=0.883) p=0.057)
Number in dataframe = 3154, Number in model = 3142, Missing = 12, Log-likelihood = -12775.03,
AIC = 25564.1, R-squared = 0.12, Adjusted R-squared = 0.12
Number in dataframe = 3154, Number in model = 3142, Missing = 12, Log-likelihood = -12776.83,
AIC = 25565.7, R-squared = 0.12, Adjusted R-squared = 0.12
Standardized residuals
1968 1968
Residuals 149 5.0
149
50
2.5
0 0.0
-2.5
110 120 130 140 150 -2 0 2
Fitted values Theoretical Quantiles
2768
1968 7.5
149 Standardized Residuals
2
1968
5.0 933 3107
2.5
1
0.0
0 -2.5
110 120 130 140 150 A B
Fitted values Factor Level Combination
FIGURE 7.15: Diagnostic plots: Linear regression model of systolic blood
pressure.
7.3.8 Summary
Time spent truly understanding linear regression is well spent. Not because you
will spend a lot of time making linear regression models in health data science
(we rarely do), but because it the essential foundation for understanding more
advanced statistical models.
7.4 Exercises 183
It can even be argued that all common statistical tests are linear models7 . This
great post demonstrates beautifully how the statistical tests we are most famil-
iar with (such as t-test, Mann-Whitney U test, ANOVA, chi-squared test) can
simply be considered as special cases of linear models, or close approximations.
Regression is fitting lines, preferably straight, through data points. Make 𝑦̂ =
𝛽0 + 𝛽1 𝑥1 a close friend.
An excellent book for further reading on regression is Harrell (2015).
7.4 Exercises
7.4.1 Exercise
Using the multivariable regression Shiny app8 , hack some p-values to prove to
yourself the principle of multiple testing.
From the default position, select “additive model” then set “Error standard
deviation” to 2. Leave all true effects at 0. How many clicks of “New Sample”
did you need before you got a statistically significant result?
7.4.2 Exercise
Plot the GDP per capita by year for countries in Europe. Add a best fit straight
line to the plot. In which countries is the relationship not linear? Advanced:
make the line curved by adding a quadratic/squared term, e.g., 𝑦 𝑥2 + 𝑥. Hint:
check geom_smooth() help page under formula.
7.4.3 Exercise
Compare the relationship between GDP per capita and year for two countries
of your choice. If you can’t choose, make it Albania and Austria.
Fit and plot a regression model that simply averages the values across the two
countries.
Fit and plot a best fit regression model.
Use your model to determine the difference in GDP per capita for your coun-
tries in 1980.
7
https://lindeloev.github.io/tests-as-linear
8
https://argoshare.is.ed.ac.uk/multi_regression/
184 7 Linear regression
7.4.4 Exercise
7.5 Solutions
gapdata %>%
filter(continent == "Europe") %>%
ggplot(aes(x = year, y = gdpPercap)) +
geom_point() +
geom_smooth(method = "lm") +
facet_wrap(country ~ .)
30000
20000
10000
0
1950 1960 1970 1980 1990 2000 1950 1960 1970 1980 1990 2000 1950 1960 1970 1980 1990 2000 1950 1960 1970 1980 1990 2000 1950 1960 1970 1980 1990 2000 1950 1960 1970 1980 1990 2000
year
7.5 Solutions 185
30000
20000
10000
0
# Plot first
gapdata %>%
filter(country %in% c("Albania", "Austria")) %>%
ggplot() +
geom_point(aes(x = year, y = gdpPercap, colour= country))
186 7 Linear regression
30000
country
gdpPercap
20000
Albania
Austria
10000
0
1950 1960 1970 1980 1990 2000
year
gapdata %>%
filter(country %in% c("Albania", "Austria")) %>%
ggplot() +
geom_point(aes(x = year, y = gdpPercap, colour = country)) +
geom_line(aes(x = year, y = predict(fit_both1)))
30000
country
gdpPercap
20000
Albania
Austria
10000
0
1950 1960 1970 1980 1990 2000
year
gapdata %>%
filter(country %in% c("Albania", "Austria")) %>%
ggplot() +
geom_point(aes(x = year, y = gdpPercap, colour = country)) +
geom_line(aes(x = year, y = predict(fit_both3), group = country))
30000
country
gdpPercap
20000
Albania
Austria
10000
0
1950 1960 1970 1980 1990 2000
year
# You can use the regression equation by hand to work out the difference
summary(fit_both3)
600
400
chol
200
40 45 50 55 60
age
Suddenly Christopher Robin began to tell Pooh about some of the things:
People called Kings and Queens and something called Factors … and Pooh
said “Oh!” and thought how wonderful it would be to have a Real Brain
which could tell you things.
A.A. Milne, The House at Pooh Corner (1928)
8.1 Factors
We said earlier that continuous data can be measured and categorical data
can be counted, which is useful to remember. Categorical data can be a:
• Factor
– a fixed set of names/strings or numbers
– these may have an inherent order (1st, 2nd 3rd) - ordinal factor
– or may not (female, male)
• Character
– sequences of letters, numbers, or symbols
• Logical
– containing only TRUE or FALSE
Health data is awash with factors. Whether it is outcomes like death, recur-
rence, or readmission. Or predictors like cancer stage, deprivation quintile, or
smoking status. It is essential therefore to be comfortable manipulating factors
and dealing with outcomes which are categorical.
191
192 8 Working with categorical outcome variables
We will use the classic “Survival from Malignant Melanoma” dataset which
is included in the boot package. The data consist of measurements made
on patients with malignant melanoma, a type of skin cancer. Each patient
had their tumour removed by surgery at the Department of Plastic Surgery,
University Hospital of Odense, Denmark, between 1962 and 1977.
For the purposes of this discussion, we are interested in the association between
tumour ulceration and death from melanoma.
The Help page (F1 on boot::melanoma) gives us its data dictionary including the
definition of each variable and the coding used.
As always, check any new dataset carefully before you start analysis.
library(tidyverse)
library(finalfit)
theme_set(theme_bw())
meldata %>% glimpse()
## Rows: 205
## Columns: 7
## $ time <dbl> 10, 30, 35, 99, 185, 204, 210, 232, 232, 279, 295, 355, 3...
## $ status <dbl> 3, 3, 2, 3, 1, 1, 1, 3, 1, 1, 1, 3, 1, 1, 1, 3, 1, 1, 1, ...
## $ sex <dbl> 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, ...
## $ age <dbl> 76, 56, 41, 71, 52, 28, 77, 60, 49, 68, 53, 64, 68, 63, 1...
## $ year <dbl> 1972, 1968, 1977, 1968, 1965, 1971, 1972, 1974, 1968, 197...
## $ thickness <dbl> 6.76, 0.65, 1.34, 2.90, 12.08, 4.84, 5.16, 3.22, 12.88, 7...
## $ ulcer <dbl> 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
8.5 Recode the data 193
## $Continuous
## label var_type n missing_n missing_percent mean sd min
## time time <dbl> 205 0 0.0 2152.8 1122.1 10.0
## status status <dbl> 205 0 0.0 1.8 0.6 1.0
## sex sex <dbl> 205 0 0.0 0.4 0.5 0.0
## age age <dbl> 205 0 0.0 52.5 16.7 4.0
## year year <dbl> 205 0 0.0 1969.9 2.6 1962.0
## thickness thickness <dbl> 205 0 0.0 2.9 3.0 0.1
## ulcer ulcer <dbl> 205 0 0.0 0.4 0.5 0.0
## quartile_25 median quartile_75 max
## time 1525.0 2005.0 3042.0 5565.0
## status 1.0 2.0 2.0 3.0
## sex 0.0 0.0 1.0 1.0
## age 42.0 54.0 65.0 95.0
## year 1968.0 1970.0 1972.0 1977.0
## thickness 1.0 1.9 3.6 17.4
## ulcer 0.0 0.0 1.0 1.0
##
## $Categorical
## data frame with 0 columns and 205 rows
It is really important that variables are correctly coded for all plotting and
analysis functions. Using the data dictionary, we will convert the categorical
variables to factors.
In the section below, we convert the continuous variables to factors (e.g., sex
%>% factor() %>%), then use the forcats package to recode the factor levels.
Modern databases (such as REDCap) can give you an R script to recode your
specific dataset. This means you don’t always have to recode your factors from
numbers to names manually. But you will always be recoding variables during
the exploration and analysis stages too, so it is important to follow what is
happening here.
This is a common question and something which is frequently done. Take for
instance the variable age. Is it better to leave it as a continuous variable, or
to chop it into categories, e.g., 30 to 39 etc.?
The clear disadvantage in doing this is that information is being thrown away.
Which feels like a bad thing to be doing. This is particularly important if the
categories being created are large.
For instance, if age was dichotomised to “young” and “old” at say 42 years
(the current median age in Europe), then it is likely that relevant information
to a number of analyses has been discarded.
Secondly, it is unforgivable practice to repeatedly try different cuts of a contin-
uous variable to obtain a statistically significant result. This is most commonly
done in tests of diagnostic accuracy, where a threshold for considering a contin-
uous test result positive is chosen post hoc to maximise sensitivity/specificity,
but not then validated in an independent cohort.
But there are also advantages to converting a continuous variable to categori-
cal. Say the relationship between age and an outcome is not linear, but rather
u-shaped, then fitting a regression line is more difficult. If age is cut into 10-
8.6 Should I convert a continuous variable to a categorical variable? 195
year bands and entered into a regression as a factor, then this non-linearity is
already accounted for.
Secondly, when communicating the results of an analysis to a lay audience,
it may be easier to use a categorical representation. For instance, an odds of
death 1.8 times greater in 70-year-olds compared with 40-year-olds may be
easier to grasp than a 1.02 times increase per year.
So what is the answer? Do not do it unless you have to. Plot and understand
the continuous variable first. If you do it, try not to throw away too much
information. Repeat your analyses both with the continuous data and cate-
gorical data to ensure there is no difference in the conclusion (often called a
sensitivity analysis).
# Summary of age
meldata$age %>%
summary()
meldata %>%
ggplot(aes(x = age)) +
geom_histogram()
25
20
15
count
10
0
0 25 50 75 100
age
Figure 8.1 illustrates different options for this. We suggest not using the label
option of the cut() function to avoid errors, should the underlying data change
or when the code is copied and reused. A better practice is to recode the levels
using fct_recode as above.
The intervals in the output are standard mathematical notation. A square
bracket indicates the value is included in the interval and a round bracket
that the value is excluded.
Note the requirement for include.lowest = TRUE when you specify breaks yourself
and the lowest cut-point is also the lowest data value. This should be clear in
Figure 8.1.
Be clear in your head whether you wish to cut the data so the intervals are of
equal length. Or whether you wish to cut the data so there are equal propor-
tions of cases (patients) in each level.
Equal intervals:
8.6 Should I convert a continuous variable to a categorical variable? 197
Quantiles:
library(patchwork)
p1 + p2
120
1.00
90
0.75
status.factor
proportion
Died melanoma
count
60 0.50
Alive
Died - other causes
30 0.25
0 0.00
Absent Present Absent Present
ulcer.factor ulcer.factor
FIGURE 8.2: Bar chart: Outcome after surgery for patients with ulcerated
melanoma.
It should be obvious that more died from melanoma in the ulcerated tumour
group compared with the non-ulcerated tumour group. The stacking is orders
from top to bottom by default. This can be easily adjusted by changing the
order of the levels within the factor (see re-levelling below). This default order
works well for binary variables - the “yes” or “1” is lowest and can be easily
compared. This ordering of this particular variable is unusual - it would be
more common to have for instance alive = 0, died = 1. One quick option is to
just reverse the order of the levels in the plot.
8.7 Plot the data 199
library(patchwork)
p1 + p2
120
1.00
90
0.75
status.factor
proportion
Died melanoma
count
60 0.50
Alive
Died - other causes
30 0.25
0 0.00
Absent Present Absent Present
ulcer.factor ulcer.factor
FIGURE 8.3: Bar chart: Outcome after surgery for patients with ulcerated
melanoma, reversed levels.
Just from the plot then, death from melanoma in the ulcerated tumour group
is around 40% and in the non-ulcerated group around 13%. The number of
patients included in the study is not huge, however, this still looks like a real
difference given its effect size.
We may also be interested in exploring potential effect modification, interac-
tions and confounders. Again, we urge you to first visualise these, rather than
going straight to a model.
p1 / p2
≤20 21 to 40 41 to 60 >60
30
Female
20
10
count
30
20
Male
10
0
Absent Present Absent Present Absent Present Absent Present
ulcer.factor
≤20 21 to 40 41 to 60 >60
1.00
0.75
Female
0.50
0.25
count
0.00
1.00
0.75
Male
0.50
0.25
0.00
Absent Present Absent Present Absent Present Absent Present
ulcer.factor
FIGURE 8.4: Facetted bar plot: Outcome after surgery for patients with
ulcerated melanoma aggregated by sex and age.
Our question relates to the association between tumour ulceration and death
from melanoma. The outcome measure has three levels as can be seen. For our
8.10 Summarising factors with finalfit 201
purposes here, we will generate a disease-specific mortality variable (status_dss),
by combining “Died - other causes” and “Alive”.
The default order for levels with factor() is alphabetical. We often want to
reorder the levels in a factor when plotting, or when performing a regression
analysis and we want to specify the reference level.
The order can be checked using levels().
The reason “Alive” is second, rather than alphabetical, is it was recoded from
“2” and that order was retained. If, however, we want to make comparisons
relative to “Alive”, we need to move it to the front by using fct_relevel().
library(finalfit)
meldata %>%
summary_factorlist(dependent = "status_dss",
explanatory = "ulcer.factor")
TABLE 8.1: Two-by-two table with finalfit: Died with melanoma by tumour
ulceration status.
library(finalfit)
meldata %>%
summary_factorlist(dependent = "status_dss",
explanatory =
c("ulcer.factor", "age.factor",
"sex.factor", "thickness")
)
8.11.1 Base R
Base R has reliable functions for all common statistical tests, but they are
sometimes a little inconvenient to extract results from.
A table of counts can be constructed, either using the $ to identify columns,
or using the with() function.
##
## Alive Died melanoma
## Absent 99 16
## Present 49 41
library(magrittr)
meldata %$% # note $ sign here
table(ulcer.factor, status_dss)
## status_dss
## ulcer.factor Alive Died melanoma
## Absent 99 16
## Present 49 41
204 8 Working with categorical outcome variables
meldata %$%
table(ulcer.factor, status_dss) %>%
prop.table(margin = 1) # 1: row, 2: column etc.
## status_dss
## ulcer.factor Alive Died melanoma
## Absent 0.8608696 0.1391304
## Present 0.5444444 0.4555556
Similarly, the counts table can be passed to chisq.test() to perform the chi-
squared test.
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: .
## X-squared = 23.631, df = 1, p-value = 1.167e-06
The result can be extracted into a tibble using the tidy() function from the
broom package.
library(broom)
meldata %$% # note $ sign here
table(ulcer.factor, status_dss) %>%
chisq.test() %>%
tidy()
## # A tibble: 1 x 4
## statistic p.value parameter method
## <dbl> <dbl> <int> <chr>
## 1 23.6 0.00000117 1 Pearson's Chi-squared test with Yates' continu~
##
## Pearson's Chi-squared test
##
## data: .
## X-squared = 2.0198, df = 3, p-value = 0.5683
##
## Fisher's Exact Test for Count Data
##
## data: .
## p-value = 0.5437
## alternative hypothesis: two.sided
library(finalfit)
meldata %>%
summary_factorlist(dependent = "status_dss",
explanatory = "ulcer.factor",
p = TRUE)
TABLE 8.3: Two-by-two table with chi-squared test using final fit: Outcome
after surgery for melanoma by tumour ulceration status.
meldata %>%
summary_factorlist(dependent = "status_dss",
explanatory =
c("ulcer.factor", "age.factor",
"sex.factor", "thickness"),
p = TRUE)
meldata %>%
summary_factorlist(dependent = "status_dss",
explanatory =
c("ulcer.factor", "age.factor",
"sex.factor", "thickness"),
p = TRUE,
p_cat = "fisher")
meldata %>%
summary_factorlist(dependent = "status_dss",
explanatory =
c("ulcer.factor", "age.factor",
"sex.factor", "thickness"),
p = TRUE,
p_cat = "fisher",
digits =
c(1, 1, 4, 2), #1: mean/median, 2: SD/IQR
# 3: p-value, 4: count percentage
na_include = TRUE, # include missing in results/test
add_dependent_label = TRUE
)
208 8 Working with categorical outcome variables
8.14 Exercises
8.14.1 Exercise
8.14.2 Exercise
By changing one and only one line in the following block create firstly a new ta-
ble showing the breakdown of status.factor by age and secondly the breakdown
of status.factor by sex:
meldata %>%
count(ulcer.factor, status.factor) %>%
group_by(status.factor) %>%
mutate(total = sum(n)) %>%
mutate(percentage = round(100*n/total, 1)) %>%
mutate(count_perc = paste0(n, " (", percentage, ")")) %>%
select(-total, -n, -percentage) %>%
spread(status.factor, count_perc)
8.14 Exercises 209
8.14.3 Exercise
Now produce these tables using the summary_factorlist() function from the fi-
nalfit package.
9
Logistic regression
Do not start here! The material covered in this chapter is best understood
after having read linear regression (Chapter 7) and working with categorical
outcome variables (Chapter 8).
Generalised linear modelling is an extension to the linear modelling we are
now familiar with. It allows the principles of linear regression to be applied
when outcomes are not continuous numeric variables.
211
212 9 Logistic regression
levels, and ‘multinomial’, where the outcome variable has >2 levels with no
inherent order.
We will only deal with binary logistic regression. When we use the term ‘logistic
regression’, that is what we are referring to.
We have good reason. In healthcare we are often interested in an event (like
death) occurring or not occurring. Binary logistic regression can tell us the
probability of this outcome occurring in a patient with a particular set of
characteristics.
Although in binary logistic regression the outcome must have two levels, re-
member that the predictors (explanatory variables) can be either continuous
or categorical.
Probability of blue
3/12 = 1/4
Odds of blue
3/9 = 1/3
Another important term to remind ourselves of is the ‘odds ratio’. Why? Be-
cause in a logistic regression the slopes of fitted lines (coefficients) can be
interpreted as odds ratios. This is very useful when interpreting the associa-
tion of a particular predictor with an outcome.
For a given categorical predictor such as smoking, the difference in chance of
the outcome occurring for smokers vs non-smokers can be expressed as a ratio
of odds or odds ratio (Figure 9.2). For example, if the odds of a smoker having
a CV event are 1.5 and the odds of a non-smoker are 1.0, then the odds of a
smoker having an event are 1.5-times greater than a non-smoker, odds ratio
= 1.5.
An alternative is a ratio of probabilities which is called a risk ratio or relative
risk. We will continue to work with odds ratios given they are an important
expression of effect size in logistic regression analysis.
214 9 Logistic regression
= a/c
Odds CV event | Smoker
CV event
Yes
a b = b/d
= b/d =
Oddsa/c
CV event ad
| Non-smoker
CV event
= a/c
bc
No c d
Smoking Smoking
= b/d
Odds CV event
smoker vs non-smoker
No Yes a/c
= =
a/c ad
ad
b/d
b/d bc
bc Odds ratio
Let’s return to the task at hand. The difficulty in moving from a continuous
to a binary outcome variable quickly becomes obvious. If our 𝑦-axis only has
two values, say 0 and 1, then how can we fit a line through our data points?
An assumption of linear regression is that the dependent variable is continuous,
unbounded, and measured on an interval or ratio scale. Unfortunately, binary
dependent variables fulfil none of these requirements.
The answer is what makes logistic regression so useful. Rather than estimating
𝑦 = 0 or 𝑦 = 1 from the 𝑥-axis, we estimate the probability of 𝑦 = 1.
There is one more difficulty in this though. Probabilities can only exist for
values of 0 to 1. The probability scale is therefore not linear - straight lines do
not make sense on it.
As we saw above, the odds scale runs from 0 to +∞. But here, probabilities
from 0 to 0.5 are squashed into odds of 0 to 1, and probabilities from 0.5 to 1
have the expansive comfort of 1 to +∞.
This is why we fit binary data on a log-odds scale.
A log-odds scale sounds incredibly off-putting to non-mathematicians, but it
is the perfect solution.
• Log-odds run from −∞ to +∞;
• odds of 1 become log-odds of 0;
• a doubling and a halving of odds represent the same distance on the scale.
9.2 Binary logistic regression 215
log(1)
## [1] 0
log(2)
## [1] 0.6931472
log(0.5)
## [1] -0.6931472
I’m sure some are shouting ‘obviously’ at the page. That is good!
This is wrapped up in a transformation (a bit like the transformations shown
in Section 6.9.1) using the so-called logit function. This can be skipped with
no loss of understanding, but for those who just-gots-to-see, the logit function
is,
𝑝
log𝑒 ( 1−𝑝 ), where 𝑝 is the probability of the outcome occurring.
Figure 9.3 demonstrates the fitted lines from a logistic regression model of
cardiovascular event by coffee consumption, stratified by smoking on the log-
odds scale (A) and the probability scale (B). We could conclude, for instance,
that on average, non-smokers who drink 2 cups of coffee per day have a 50%
chance of a cardiovascular event.
Figure 9.4 links the logistic regression equation, the appearance of the fitted
lines on the probability scale, and the output from a standard base R analy-
sis. The dots at the top and bottom of the plot represent whether individual
patients have had an event or not. The fitted line, therefore, represents the
point-to-point probability of a patient with a particular set of characteristics
having the event or not. Compare this to Figure 7.4 to be clear on the dif-
ference. The slope of the line is linear on the log-odds scale and these are
presented in the output on the log-odds scale.
Thankfully, it is straightforward to convert these to odds ratios, a measure
we can use to communicate effect size and direction effectively. Said in more
technical language, the exponential of the coefficient on the log-odds scale can
be interpreted as an odds ratio.
For a continuous variable such as cups of coffee consumed, the odds ratio is
the change in odds of a CV event associated with a 1 cup increase in coffee
216 9 Logistic regression
10
Event occurred in
these people but not in
these people
5
Log odds of event
Non-smoker
Smoker
−10
0 1 2 3 4
0.6
Probability [0, 1]
0.2
Non-smoker
Smoker
0.0
0 1 2 3 4
As with all multivariable regression models, logistic regression allows the in-
corporation of multiple variables which all may have direct effects on outcome
or may confound the effect of another variable. This was explored in detail in
Section 7.1.7; all of the same principles apply.
Adjusting for effect modification and confounding allows us to isolate the direct
effect of an explanatory variable of interest upon an outcome. In our example,
we are interested in direct effect of coffee drinking on the occurrence of cardio-
vascular disease, independent of any association between coffee drinking and
smoking.
Figure 9.5 demonstrates simple, additive and multiplicative models. Think
back to Figure 7.6 and the discussion around it as these terms are easier to
think about when looking at the linear regression example, but essentially they
work the same way in logistic regression.
Presented on the probability scale, the effect of the interaction is difficult to
see. It is obvious on the log-odds scale that the fitted lines are no longer
constrained to be parallel.
The interpretation of the interaction term is important. The exponential of the
interaction coefficient term represents a ‘ratio-of-odds ratios’. This is easiest
to see through a worked example. In Figure 9.6 the effect of coffee on the
odds of a cardiovascular event can be compared in smokers and non-smokers.
The effect is now different given the inclusion of a significant interaction term.
Please check back to the linear regression chapter if this is not making sense.
218 9 Logistic regression
Probability
loge (odds) = β0 + β1 x1 + β2 x2
Intercept:
log-odds of event
when xs all zero
exp(β2 )
odds ratio associated
Event occurred in
with smoking
these people but not in
these people exp(β1 )
odds ratio associated
with 1 cup increase in
coffee consumption Non-smoker
Smoker
Distribution of
residuals
Results
} Coffee:
OR = exp(1.23)
= 3.42
Smoking
OR = exp(3.07)
= 21.50
loge (odds[event]) = β0 + βcof f ee xcof f ee + βsmoking xsmoking
FIGURE 9.4: Linking the logistic regression fitted line and equation (A)
with the R output (B).
9.2 Binary logistic regression 219
Non-smoker
Smoker
B Additive
loge (odds) = β0 + β1 x1 + β2 x2
FIGURE 9.5: Multivariable logistic regression (A) with additive (B) and
multiplicative (C) effect modification.
1
220 9 Logistic regression
OR = exp(1.0226) = 2.78
OR = exp(1.5254) = 4.60
ROR = exp(1.4413) = 4.23
Odds ratio (OR)
Effect of each coffee in non-smokers 2.78
Effect of each coffee in smokers (2.78 * 4.23) = 11.76
Effect of smoking in non-coffee drinkers OR = 4.60
Effect of smoking for each additional coffee (4.60 * 4.23) = 19.46
The Help page (F1 on boot::melanoma) gives us its data dictionary including the
definition of each variable and the coding used.
As before, always carefully check and clean new dataset before you start the
analysis.
library(tidyverse)
library(finalfit)
melanoma %>% glimpse()
melanoma %>% ff_glimpse()
We have seen some of this already (Section 8.5: Recode data), but for this
particular analysis we will recode some further variables.
library(tidyverse)
library(finalfit)
melanoma <- melanoma %>%
mutate(sex.factor = factor(sex) %>%
fct_recode("Female" = "0",
"Male" = "1") %>%
ff_label("Sex"),
t_stage.factor =
thickness %>%
cut(breaks = c(0, 1.0, 2.0, 4.0,
222 9 Logistic regression
max(thickness, na.rm=TRUE)),
include.lowest = TRUE)
)
We will now consider our outcome variable. With a binary outcome and health
data, we often have to make a decision as to when to determine if that variable
has occurred or not. In the next chapter we will look at survival analysis where
this requirement is not needed.
Our outcome of interest is death from melanoma, but we need to decide when
to define this.
A quick histogram of time stratified by status.factor helps. We can see that
most people who died from melanoma did so before 5 years (Figure 9.7). We
can also see that the status most of those who did not die is known beyond 5
years.
library(ggplot2)
melanoma %>%
ggplot(aes(x = time/365)) +
geom_histogram() +
facet_grid(. ~ status.factor)
Let’s decide then to look at 5-year mortality from melanoma. The definition
of this will be at 5 years after surgery, who had died from melanoma and who
had not.
9.3 Data preparation and exploratory analysis 223
Alive Died melanoma Died - other
20
15
count
10
0
0 5 10 15 0 5 10 15 0 5 10 15
time/365
FIGURE 9.7: Time to outcome/follow-up times for patients in the melanoma
dataset.
# 5-year mortality
melanoma <- melanoma %>%
mutate(
mort_5yr =
if_else((time/365) < 5 &
(status == 1),
"Yes", # then
"No") %>% # else
fct_relevel("No") %>%
ff_label("5-year survival")
)
library(patchwork)
p1 + p2
120
1.00
90
0.75
mort_5yr
proportion
count
60 0.50 No
Yes
30 0.25
0 0.00
Absent Present Absent Present
ulcer.factor ulcer.factor
FIGURE 9.8: Exploration ulceration and outcome (5-year mortality).
As we might have anticipated from our work in the previous chapter, 5-year
mortality is higher in patients with ulcerated tumours compared with those
with non-ulcerated tumours.
We are also interested in other variables that may be associated with tumour
ulceration. If they are also associated with our outcome, then they will con-
found the estimate of the direct effect of tumour ulceration.
We can plot out these relationships, or tabulate them instead.
We will use the convenient summary_factorlist() function from the finalfit pack-
age to look for differences across other variables by tumour ulceration.
library(finalfit)
dependent <- "ulcer.factor"
explanatory <- c("age", "sex.factor", "year", "t_stage.factor")
melanoma %>%
summary_factorlist(dependent, explanatory, p = TRUE,
add_dependent_label = TRUE)
It appears that patients with ulcerated tumours were older, more likely to
9.4 Model assumptions 225
TABLE 9.1: Multiple variables by explanatory variable of interest: Malignant
melanoma ulceration by patient and disease variables.
A graphical check of linearity can be performed using a best fit “loess” line.
This is on the probability scale, so it is not going to be straight. But it should
be monotonic - it should only ever go up or down.
melanoma %>%
mutate(
mort_5yr.num = as.numeric(mort_5yr) - 1
) %>%
select(mort_5yr.num, age, year) %>%
pivot_longer(all_of(c("age", "year")), names_to = "predictors") %>%
ggplot(aes(x = value, y = mort_5yr.num)) +
geom_point(size = 0.5, alpha = 0.5) +
geom_smooth(method = "loess") +
facet_wrap(~predictors, scales = "free_x")
Figure 9.9 shows that age is interesting as the relationship is u-shaped. The
chance of death is higher in the young and the old compared with the middle-
aged. This will need to be accounted for in any model including age as a
predictor.
9.4.2 Multicollinearity
library(GGally)
explanatory <- c("ulcer.factor", "age", "sex.factor",
"year", "t_stage.factor")
melanoma %>%
remove_labels() %>% # ggpairs doesn't work well with labels
ggpairs(columns = explanatory)
If you have many variables you want to check you can split them up.
Continuous to continuous
Here we’re using the same library(GGally) code as above, but shortlisting the
two categorical variables: age and year (Figure 9.11):
Continuous to categorical
Let’s use a clever pivot_longer() and facet_wrap() combination to efficiently plot
228 9 Logistic regression
ulcer.factor
60
30
0
100
75
Corr:
50
age
0.188**
25
0
80
60
40
20
sex.factor
0
80
60
40
20
0
1975
1970
year
1965
50
40
30
20
10
0
t_stage.factor
50
40
30
20
10
0
50
40
30
20
10
0
50
40
30
20
10
0
Absent Present 0 25 50 75 100 Female Male 1965 1970 1975 T1 T2 T3 T4
multiple variables against each other without using ggpairs(). We want to com-
pare everything against, for example, age so we need to include -age in the
pivot_longer() call so it doesn’t get lumped up with everything else (Figure
9.12):
melanoma %>%
select(all_of(select_explanatory)) %>%
pivot_longer(-age) %>% # pivots all but age into two columns: name and value
ggplot(aes(value, age)) +
geom_boxplot() +
9.4 Model assumptions 229
age year
0.020
0.015
Corr:
age
0.010 0.188**
0.005
0.000
1975
1970
year
1965
Categorical to categorical
melanoma %>%
select(one_of(select_explanatory)) %>%
pivot_longer(-sex.factor) %>%
ggplot(aes(value, fill = sex.factor)) +
geom_bar(position = "fill") +
ylab("proportion") +
facet_wrap(~name, scale = "free", ncol = 2) +
coord_flip()
230 9 Logistic regression
T4
Male Present
T3
value
T2
Female Absent
T1
0 25 50 75 0 25 50 75 0 25 50 75
age
FIGURE 9.12: Exploring associations between continuous and categorical
explanatory variables.
t_stage.factor ulcer.factor
T4
Present
T3 sex.factor
value
Female
T2 Male
Absent
T1
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
proportion
None of the explanatory variables are highly correlated with one another.
Variance inflation factor
Finally, as a final check for the presence of higher-order correlations, the vari-
ance inflation factor can be calculated for each of the terms in a final model.
In simple language, this is a measure of how much the variance of a particular
regression coefficient is increased due to the presence of multicollinearity in
the model.
Here is an example. GVIF stands for generalised variance inflation factor. A
9.5 Fitting logistic regression models in base R 231
common rule of thumb is that if this is greater than 5-10 for any variable,
then multicollinearity may exist. The model should be further explored and
the terms removed or reduced.
## GVIF Df GVIF^(1/(2*Df))
## ulcer.factor 1.313355 1 1.146017
## age 1.102313 1 1.049911
## sex.factor 1.124990 1 1.060655
## year 1.102490 1 1.049995
## t_stage.factor 1.475550 3 1.066987
We are not trying to over-egg this, but multicollinearity can be important. The
message as always is the same. Understand the underlying data using plotting
and tables, and you are unlikely to come unstuck.
The glm() stands for generalised linear model and is the standard base R approach
to logistic regression.
The glm() function has several options and many different types of model can
be run. For instance, ‘Poisson regression’ for count data.
To run binary logistic regression use family = binomial. This defaults to family =
binomial(link = 'logit'). Other link functions exist, such as the probit function,
but this makes little difference to final conclusions.
Let’s start with a simple univariable model using the classical R approach.
##
## Call:
## glm(formula = mort_5yr ~ ulcer.factor, family = binomial, data = melanoma)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.9925 -0.9925 -0.4265 -0.4265 2.2101
##
232 9 Logistic regression
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.3514 0.3309 -7.105 1.20e-12 ***
## ulcer.factorPresent 1.8994 0.3953 4.805 1.55e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 215.78 on 204 degrees of freedom
## Residual deviance: 188.24 on 203 degrees of freedom
## AIC: 192.24
##
## Number of Fisher Scoring iterations: 5
This is the standard R output which you should become familiar with. It is
included in the previous figures. The estimates of the coefficients (slopes) in
this output are on the log-odds scale and always will be.
Easier approaches for doing this in practice are shown below, but for com-
pleteness here we will show how to extract the results. str() shows all the
information included in the model object, which is useful for experts but a bit
off-putting if you are starting out.
The coefficients and their 95% confidence intervals can be extracted and ex-
ponentiated like this.
## (Intercept) ulcer.factorPresent
## 0.0952381 6.6818182
## 2.5 % 97.5 %
## (Intercept) 0.04662675 0.1730265
## ulcer.factorPresent 3.18089978 15.1827225
Note that the 95% confidence interval is between the 2.5% and 97.5% quantiles
of the distribution, hence why the results appear in this way.
A good alternative is the tidy() function from the broom package.
library(broom)
fit1 %>%
tidy(conf.int = TRUE, exp = TRUE)
## # A tibble: 2 x 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
9.6 Modelling strategy for binary outcomes 233
## 1 (Intercept) 0.0952 0.331 -7.11 1.20e-12 0.0466 0.173
## 2 ulcer.factorPresent 6.68 0.395 4.80 1.55e- 6 3.18 15.2
We can see from these results that there is a strong association between tumour
ulceration and 5-year mortality (OR 6.68, 95%CI 3.18, 15.18).
Model metrics can be extracted using the glance() function.
fit1 %>%
glance()
## # A tibble: 1 x 8
## null.deviance df.null logLik AIC BIC deviance df.residual nobs
## <dbl> <int> <dbl> <dbl> <dbl> <dbl> <int> <int>
## 1 216. 204 -94.1 192. 199. 188. 203 205
A statistical model is a tool to understand the world. The better your model
describes your data, the more useful it will be. Fitting a successful statisti-
cal model requires decisions around which variables to include in the model.
Our advice regarding variable selection follows the same lines as in the linear
regression chapter.
Our preference in model fitting is now to use our own finalfit package. It gets
us to our results quicker and more easily, and produces our final model tables
which go directly into manuscripts for publication (we hope).
The approach is the same as in linear regression. If the outcome variable
is correctly specified as a factor, the finalfit() function will run a logistic
regression model directly.
library(finalfit)
dependent <- "mort_5yr"
explanatory <- "ulcer.factor"
melanoma %>%
finalfit(dependent, explanatory, metrics = TRUE)
Number in dataframe = 205, Number in model = 205, Missing = 0, AIC = 192.2, C-statistic =
0.717, H&L = Chi-sq(8) 0.00 (p=1.000)
Passing metrics = TRUE to finalfit() gives us a useful list of model fitting param-
eters.
We recommend looking at three metrics:
• Akaike information criterion (AIC), which should be minimised,
• C-statistic (area under the receiver operator curve), which should be max-
imised;
• Hosmer–Lemeshow test, which should be non-significant.
9.8 Model fitting 235
AIC
The AIC has been previously described (Section 7.3.3). It provides a measure
of model goodness-of-fit - or how well the model fits the available data. It is
penalised for each additional variable, so should be somewhat robust against
over-fitting (when the model starts to describe noise).
C-statistic
The c-statistic or area under the receiver operator curve (ROC) provides a
measure of model ‘discrimination’. It runs from 0.5 to 1.0, with 0.5 being no
better than chance, and 1.0 being perfect fit. What the number actually rep-
resents can be thought of like this. Take our example of death from melanoma.
If you take a random patient who died and a random patient who did not die,
then the c-statistic is the probability that the model predicts that patient 1 is
more likely to die than patient 2. In our example above, the model should get
that correct 72% of the time.
Hosmer-Lemeshow test
If you are interested in using your model for prediction, it is important that it
is calibrated correctly. Using our example, calibration means that the model
accurately predicts death from melanoma when the risk to the patient is low
and also accurately predicts death when the risk is high. The model should
work well across the range of probabilities of death. The Hosmer-Lemeshow
test assesses this. By default, it assesses the predictive accuracy for death in
deciles of risk. If the model predicts equally well (or badly) at low probabilities
compared with high probabilities, the null hypothesis of a difference will be
rejected (meaning you get a non-significant p-value).
Engage with the data and the results when model fitting. Do not use auto-
mated processes - you have to keep thinking.
Three things are important to keep looking at:
• what is the association between a particular variable and the outcome (OR
and 95%CI);
• how much information is a variable bringing to the model (change in AIC
and c-statistic);
• how much influence does adding a variable have on the effect size of another
variable, and in particular my variable of interest (a rule of thumb is seeing
236 9 Logistic regression
a greater than 10% change in the OR of the variable of interest when a new
variable is added to the model, suggests the new variable is important).
We’re going to start by including the variables from above which we think are
relevant.
library(finalfit)
dependent <- "mort_5yr"
explanatory <- c("ulcer.factor", "age", "sex.factor", "t_stage.factor")
fit2 = melanoma %>%
finalfit(dependent, explanatory, metrics = TRUE)
TABLE 9.5: Model metrics: 5-year survival from malignant melanoma (fit
2).
Number in dataframe = 205, Number in model = 205, Missing = 0, AIC = 188.1, C-statistic =
0.798, H&L = Chi-sq(8) 3.92 (p=0.864)
The model metrics have improved with the AIC decreasing from 192 to 188
and the c-statistic increasing from 0.717 to 0.798.
Let’s consider age. We may expect age to be associated with the outcome
because it so commonly is. But there is weak evidence of an association in the
univariable analysis. We have shown above that the relationship of age to the
outcome is not linear, therefore we need to act on this.
We can either convert age to a categorical variable or include it with a
quadratic term (𝑥2 + 𝑥, remember parabolas from school?).
melanoma %>%
finalfit(dependent, c("ulcer.factor", "age.factor"), metrics = TRUE)
Number in dataframe = 205, Number in model = 205, Missing = 0, AIC = 196.6, C-statistic =
0.742, H&L = Chi-sq(8) 0.20 (p=1.000)
##
## Call:
## glm(formula = mort_5yr ~ ulcer.factor + I(age^2) + age, family = binomial,
## data = melanoma)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.3253 -0.8973 -0.4082 -0.3889 2.2872
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.2636638 1.2058471 -1.048 0.295
## ulcer.factorPresent 1.8423431 0.3991559 4.616 3.92e-06 ***
## I(age^2) 0.0006277 0.0004613 1.361 0.174
## age -0.0567465 0.0476011 -1.192 0.233
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
238 9 Logistic regression
library(finalfit)
dependent <- "mort_5yr"
explanatory <- c("ulcer.factor", "I(age^2)", "age")
melanoma %>%
finalfit(dependent, explanatory, metrics = TRUE)
Number in dataframe = 205, Number in model = 205, Missing = 0, AIC = 194, C-statistic = 0.748,
H&L = Chi-sq(8) 5.24 (p=0.732)
The AIC is worse when adding age either as a factor or with a quadratic term
to the base model.
One final method to visualise the contribution of a particular variable is to
remove it from the full model. This is convenient in Finalfit.
library(finalfit)
dependent <- "mort_5yr"
explanatory <- c("ulcer.factor", "age.factor", "sex.factor", "t_stage.factor")
explanatory_multi <- c("ulcer.factor", "sex.factor", "t_stage.factor")
melanoma %>%
finalfit(dependent, explanatory, explanatory_multi,
keep_models = TRUE, metrics = TRUE)
The AIC improves when age is removed (186 from 190) at only a small loss in
discrimination (0.794 from 0.802). Looking at the model table and comparing
9.8 Model fitting 239
TABLE 9.10: Multivariable logistic regression model: comparing a reduced
model in one table (fit 5).
Dependent: 5-year No Yes OR (univariable) OR (multivariable) OR (multivariable reduced)
survival
Ulcerated tumour Absent 105 (91.3) 10 (8.7) - - -
Present 55 (61.1) 35 (38.9) 6.68 (3.18-15.18, p<0.001) 3.06 (1.25-7.93, p=0.017) 3.21 (1.32-8.28, p=0.012)
Age (years) (0,25] 10 (71.4) 4 (28.6) - - -
(25,50] 62 (84.9) 11 (15.1) 0.44 (0.12-1.84, p=0.229) 0.37 (0.08-1.80, p=0.197) -
(50,75] 79 (76.0) 25 (24.0) 0.79 (0.24-3.08, p=0.712) 0.60 (0.15-2.65, p=0.469) -
(75,100] 9 (64.3) 5 (35.7) 1.39 (0.28-7.23, p=0.686) 0.61 (0.09-4.04, p=0.599) -
Sex Female 105 (83.3) 21 (16.7) - - -
Male 55 (69.6) 24 (30.4) 2.18 (1.12-4.30, p=0.023) 1.21 (0.54-2.68, p=0.633) 1.26 (0.57-2.76, p=0.559)
T-stage T1 52 (92.9) 4 (7.1) - - -
T2 49 (92.5) 4 (7.5) 1.06 (0.24-4.71, p=0.936) 0.74 (0.15-3.50, p=0.697) 0.77 (0.16-3.58, p=0.733)
T3 36 (70.6) 15 (29.4) 5.42 (1.80-20.22, p=0.005) 2.91 (0.84-11.82, p=0.106) 2.99 (0.86-12.11, p=0.097)
T4 23 (51.1) 22 (48.9) 12.43 (4.21-46.26, p<0.001) 5.38 (1.43-23.52, p=0.016) 5.01 (1.37-21.52, p=0.020)
TABLE 9.11: Model metrics: comparing a reduced model in one table (fit
5).
Number in dataframe = 205, Number in model = 205, Missing = 0, AIC = 190, C-statistic = 0.802,
H&L = Chi-sq(8) 13.87 (p=0.085)
Number in dataframe = 205, Number in model = 205, Missing = 0, AIC = 186.1, C-statistic =
0.794, H&L = Chi-sq(8) 1.07 (p=0.998)
the full multivariable with the reduced multivariable, there has been a small
change in the OR for ulceration, with some of the variation accounted for
by age now being taken up by ulceration. This is to be expected, given the
association (albeit weak) that we saw earlier between age and ulceration. Given
all this, we will decide not to include age in the model.
Now what about the variable sex. It has a significant association with the
outcome in the univariable analysis, but much of this is explained by other
variables in multivariable analysis. Is it contributing much to the model?
library(finalfit)
dependent <- "mort_5yr"
explanatory <- c("ulcer.factor", "sex.factor", "t_stage.factor")
explanatory_multi <- c("ulcer.factor", "t_stage.factor")
melanoma %>%
finalfit(dependent, explanatory, explanatory_multi,
keep_models = TRUE, metrics = TRUE)
By removing sex we have improved the AIC a little (184.4 from 186.1) with a
small change in the c-statistic (0.791 from 0.794).
Looking at the model table, the variation has been taken up mostly by stage
4 disease and a little by ulceration. But there has been little change overall.
We will exclude sex from our final model as well.
As a final we can check for a first-order interaction between ulceration and
T-stage. Just to remind us what this means, a significant interaction would
240 9 Logistic regression
TABLE 9.13: Model metrics: further reducing the model (fit 6).
Number in dataframe = 205, Number in model = 205, Missing = 0, AIC = 186.1, C-statistic =
0.794, H&L = Chi-sq(8) 1.07 (p=0.998)
Number in dataframe = 205, Number in model = 205, Missing = 0, AIC = 184.4, C-statistic =
0.791, H&L = Chi-sq(8) 0.43 (p=1.000)
mean the effect of, say, ulceration on 5-year mortality would differ by T-stage.
For instance, perhaps the presence of ulceration confers a much greater risk of
death in advanced deep tumours compared with earlier superficial tumours.
library(finalfit)
dependent <- "mort_5yr"
explanatory <- c("ulcer.factor", "t_stage.factor")
explanatory_multi <- c("ulcer.factor*t_stage.factor")
melanoma %>%
finalfit(dependent, explanatory, explanatory_multi,
keep_models = TRUE, metrics = TRUE)
library(finalfit)
dependent <- "mort_5yr"
explanatory <- c("ulcer.factor", "age.factor",
"sex.factor", "t_stage.factor")
9.9 Correlated groups of observations 241
Number in dataframe = 205, Number in model = 205, Missing = 0, AIC = 184.4, C-statistic =
0.791, H&L = Chi-sq(8) 0.43 (p=1.000)
T-stage T1 -
-0.50 -0.25 0.00 0.25 0.50 0.5 1.0 2.0 5.0 10.0 25.0
as.numeric(OR) Odds ratio (95% CI, log scale)
FIGURE 9.13: Odds ratio plot.
Our melanoma dataset doesn’t include any higher level structure, so we will
simulate this for demonstration purposes. We have just randomly allocated 1
of 4 identifiers to each patient below.
9.9 Correlated groups of observations 243
We will speak in terms of ‘hospitals’ now, but the grouping variable(s) could
clearly be anything.
The simplest random effects approach is a ‘random intercept model’. This
allows the intercept of fitted lines to vary by hospital. The random intercept
model constrains lines to be parallel, in a similar way to the additive models
discussed above and in Chapter 7.
It is harder to demonstrate with binomial data, but we can stratify the 5-year
mortality by T-stage (considered as a continuous variable for this purpose).
Note there were no deaths in ‘hospital 4’ (Figure 9.14). We can model this
accounting for inter-hospital variation below.
melanoma %>%
mutate(
mort_5yr.num = as.numeric(mort_5yr) - 1 # Convert factor to 0 and 1
) %>%
ggplot(aes(x = as.numeric(t_stage.factor), y = mort_5yr.num)) +
geom_jitter(width = 0.1, height = 0.1) +
geom_smooth(method = 'loess', se = FALSE) +
facet_wrap(~hospital_id) +
labs(x= "T-stage", y = "Mortality (5 y)")
library(lme4)
##
## Attaching package: 'Matrix'
1 2
0.8
0.4
0.0
Mortality (5 y)
3 4
0.8
0.4
0.0
1 2 3 4 1 2 3 4
T-stage
##
## expand, pack, unpack
melanoma %>%
glmer(mort_5yr ~ t_stage.factor + (1 | hospital_id),
data = ., family = "binomial") %>%
summary()
We can incorporate our (made-up) hospital identifier into our final model from
above. Using keep_models = TRUE, we can compare univariable, multivariable and
mixed effects models.
library(finalfit)
dependent <- "mort_5yr"
explanatory <- c("ulcer.factor", "age.factor",
"sex.factor", "t_stage.factor")
explanatory_multi <- c("ulcer.factor", "t_stage.factor")
random_effect <- "hospital_id"
melanoma %>%
finalfit(dependent, explanatory, explanatory_multi, random_effect,
keep_models = TRUE,
metrics = TRUE)
246 9 Logistic regression
Number in dataframe = 205, Number in model = 205, Missing = 0, AIC = 190, C-statistic = 0.802,
H&L = Chi-sq(8) 13.87 (p=0.085)
Number in dataframe = 205, Number in model = 205, Missing = 0, AIC = 184.4, C-statistic =
0.791, H&L = Chi-sq(8) 0.43 (p=1.000)
Number in model = 205, Number of groups = 4, AIC = 173.2, C-statistic = 0.866
As can be seen, incorporating the (made-up) hospital identifier has altered our
coefficients. It has also improved the model discrimination with a c-statistic
of 0.830 from 0.802. Note that the AIC should not be used to compare mixed
effects models estimated in this way with glm() models (the former uses a
restricted maximum likelihood [REML] approach by default, while glm() uses
maximum likelihood).
Random slope models are an extension of the random intercept model. Here
the gradient of the response to a particular variable is allowed to vary by
hospital. For example, this can be included using random_effect = "(thickness |
hospital_id)" where the gradient of the continuous variable tumour thickness
was allow to vary by hospital.
As models get more complex, care has to be taken to ensure the underlying
data is understood and assumptions are checked.
Mixed effects modelling is a book in itself and the purpose here is to introduce
the concept and provide some approaches for its incorporation. Clearly much
is written elsewhere for those who are enthusiastic to learn more.
9.10 Exercises
9.10.1 Exercise
Investigate the association between sex and 5-year mortality for patients who
have undergone surgery for melanoma.
9.11 Solutions 247
First recode the variables as shown in the text, then plot the counts and
proportions for 5-year disease-specific mortality in women and men. Is there
an association between sex and mortality?
9.10.2 Exercise
Make a table showing the relationship between sex and the variables age, T-
stage and ulceration. Hint: summary_factorlist(). Express age in terms of median
and interquartile range. Include a statistical comparison.
What associations do you see?
9.10.3 Exercise
9.10.4 Exercise
9.11 Solutions
## Recode
melanoma <- melanoma %>%
mutate(sex.factor = factor(sex) %>%
fct_recode("Female" = "0",
"Male" = "1") %>%
ff_label("Sex"),
t_stage.factor =
thickness %>%
cut(breaks = c(0, 1.0, 2.0, 4.0,
max(thickness, na.rm=TRUE)),
include.lowest = TRUE)
)
# Plot
p1 <- melanoma %>%
ggplot(aes(x = sex.factor, fill = mort_5yr)) +
geom_bar() +
theme(legend.position = "none")
p1 + p2
dependent = "sex.factor"
explanatory = c("age", "t_stage.factor", "ulcer.factor")
melanoma %>%
summary_factorlist(dependent, explanatory, p = TRUE, na_include = TRUE,
cont = "median")
# Men have more T4 tumours and they are more likely to be ulcerated.
dependent = "mort_5yr"
explanatory = c("sex.factor", "age", "t_stage.factor", "ulcer.factor")
melanoma %>%
finalfit(dependent, explanatory, metrics = TRUE)
# c-statistic = 0.798
# In multivariable model, male vs female OR 1.26 (0.57-2.76, p=0.558).
# No relationship after accounting for T-stage and tumour ulceration.
# Sex is confounded by these two variables.
dependent = "mort_5yr"
explanatory = c("sex.factor", "age", "t_stage.factor", "ulcer.factor")
melanoma %>%
or_plot(dependent, explanatory)
10
Time-to-event data and survival
We will again use the classic “Survival from Malignant Melanoma” dataset
included in the boot package which we have used previously. The data consist
of measurements made on patients with malignant melanoma. Each patient
had their tumour removed by surgery at the Department of Plastic Surgery,
University Hospital of Odense, Denmark, during the period 1962 to 1977.
We are interested in the association between tumour ulceration and survival
after surgery.
251
252 10 Time-to-event data and survival
library(tidyverse)
library(finalfit)
melanoma <- boot::melanoma #F1 here for help page with data dictionary
glimpse(melanoma)
missing_glimpse(melanoma)
ff_glimpse(melanoma)
As was seen before, all variables are coded as numeric and some need recoding
to factors. This is done below for those we are interested in.
is the number of days from surgery until either the occurrence of the event
time
(death) or the last time the patient was known to be alive. For instance, if a
patient had surgery and was seen to be well in a clinic 30 days later, but there
had been no contact since, then the patient’s status would be considered alive
10.6 Kaplan Meier survival estimator 253
at 30 days. This patient is censored from the analysis at day 30, an important
feature of time-to-event analyses.
library(dplyr)
library(forcats)
melanoma <- melanoma %>%
mutate(
# Overall survival
status_os = if_else(status == 2, 0, # "still alive"
1), # "died of melanoma" or "died of other causes"
# Diease-specific survival
status_dss = if_else(status == 2, 0, # "still alive"
if_else(status == 1, 1, # "died of melanoma"
0)), # "died of other causes is censored"
We will use the excellent survival package to produce the Kaplan Meier (KM)
survival estimator (Terry M. Therneau and Patricia M. Grambsch (2000),
Therneau (2020)). This is a non-parametric statistic used to estimate the
survival function from time-to-event data.
254 10 Time-to-event data and survival
library(survival)
# Explore:
head(survival_object) # + marks censoring, in this case "Alive"
10.6.2 Model
The survival object is the first step to performing univariable and multivariable
survival analyses.
If you want to plot survival stratified by a single grouping variable, you can
substitute “survival_object ~ 1” by “survival_object ~ factor”
A life table is the tabular form of a KM plot, which you may be familiar with.
It shows survival as a proportion, together with confidence limits. The whole
table is shown with, summary(my_survfit).
We can plot survival curves using the finalfit wrapper for the package
survminer. There are numerous options available on the help page. You
should always include a number-at-risk table under these plots as it is es-
sential for interpretation.
As can be seen, the probability of dying is much greater if the tumour was
ulcerated, compared to those that were not ulcerated.
melanoma %>%
surv_plot(dependent_os, explanatory, pval = TRUE)
1.00
0.75
Probability
0.50
10.8.1 coxph()
CPH using the coxph() function produces a similar output to lm() and glm(), so
it should be familiar to you now. It can be passed to summary() as below, and
also to broom::tidy() if you want to get the results into a tibble.
library(survival)
coxph(Surv(time, status_os) ~ age + sex + thickness + ulcer, data = melanoma) %>%
summary()
10.8 Cox proportional hazards regression 257
## Call:
## coxph(formula = Surv(time, status_os) ~ age + sex + thickness +
## ulcer, data = melanoma)
##
## n= 205, number of events= 71
##
## coef exp(coef) se(coef) z Pr(>|z|)
## age 0.021831 1.022071 0.007752 2.816 0.004857 **
## sexMale 0.413460 1.512040 0.240132 1.722 0.085105 .
## thickness 0.099467 1.104582 0.034455 2.887 0.003891 **
## ulcerYes 0.952083 2.591100 0.267966 3.553 0.000381 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## age 1.022 0.9784 1.0067 1.038
## sexMale 1.512 0.6614 0.9444 2.421
## thickness 1.105 0.9053 1.0325 1.182
## ulcerYes 2.591 0.3859 1.5325 4.381
##
## Concordance= 0.739 (se = 0.03 )
## Likelihood ratio test= 47.89 on 4 df, p=1e-09
## Wald test = 46.72 on 4 df, p=2e-09
## Score (logrank) test = 52.77 on 4 df, p=1e-10
The output shows the number of patients and the number of events. The
coefficient can be exponentiated and interpreted as a hazard ratio, exp(coef).
Helpfully, 95% confidence intervals are also provided.
A hazard is the term given to the rate at which events happen. The probability
that an event will happen over a period of time is the hazard multiplied by
the time interval. An assumption of CPH is that hazards are constant over
time (see below).
For a given predictor then, the hazard in one group (say males) would be
expected to be a constant proportion of the hazard in another group (say
females). The ratio of these hazards is, unsurprisingly, the hazard ratio.
The hazard ratio differs from the relative risk and odds ratio. The hazard ratio
represents the difference in the risk of an event at any given time, whereas the
relative risk or odds ratio usually represents the cumulative risk over a period
of time.
10.8.2 finalfit()
melanoma %>%
finalfit(dependent_os, explanatory)
melanoma %>%
finalfit(dependent_os, explanatory, add_dependent_label = FALSE) %>%
rename("Overall survival" = label) %>%
rename(" " = levels) %>%
rename(" " = all)
0.0
-0.5
-1.0
Time
zph_result
## chisq df p
## age 2.067 1 0.151
## sex 0.505 1 0.477
## thickness 2.837 1 0.092
## ulcer 4.325 1 0.038
## year 0.451 1 0.502
## GLOBAL 7.891 5 0.162
260 10 Time-to-event data and survival
As a general rule, you should always try to account for any higher structure
in your data within the model. For instance, patients may be clustered within
particular hospitals.
There are two broad approaches to dealing with correlated groups of observa-
tions.
Adding a cluster() term is similar to a generalised estimating equations (GEE)
approach (something we’re not covering in this book). Here, a standard CPH
model is fitted but the standard errors of the estimated hazard ratios are
adjusted to account for correlations.
A frailty() term implies a mixed effects model, where specific random effects
term(s) are directly incorporated into the model.
Both approaches achieve the same goal in different ways. Volumes have been
written on GEE vs mixed effects models and we won’t rehearse them in this
introductory book. We favour the latter approach because of its flexibility
10.8 Cox proportional hazards regression 261
and our preference for mixed effects modelling in generalised linear modelling.
Note cluster() and frailty() terms cannot be combined in the same model.
# Cluster model
explanatory <- c("age", "sex", "thickness", "ulcer",
"cluster(hospital_id)")
melanoma %>%
finalfit(dependent_os, explanatory)
# Frailty model
explanatory <- c("age", "sex", "thickness", "ulcer",
"frailty(hospital_id)")
melanoma %>%
finalfit(dependent_os, explanatory)
The frailty() method here is being superseded by the coxme package, and we
look forward to incorporating this in the future.
A plot of any of the above models can be produced using the hr_plot() function.
262 10 Time-to-event data and survival
melanoma %>%
hr_plot(dependent_os, explanatory)
melanoma %>%
# Summary table
summary_factorlist(dependent_dss, explanatory,
column = TRUE, fit_id = TRUE) %>%
# CPH univariable
ff_merge(
melanoma %>%
coxphmulti(dependent_dss, explanatory) %>%
fit2df(estimate_suffix = " (DSS CPH univariable)")
) %>%
# CPH multivariable
ff_merge(
melanoma %>%
coxphmulti(dependent_dss, explanatory) %>%
fit2df(estimate_suffix = " (DSS CPH multivariable)")
) %>%
# Fine and Gray competing risks regression
ff_merge(
melanoma %>%
crrmulti(dependent_crr, explanatory) %>%
fit2df(estimate_suffix = " (competing risks multivariable)")
) %>%
10.11 Dates in R 263
TABLE 10.6: Cox Proportional Hazards and competing risks regression com-
bined.
Dependent: Survival all HR (DSS CPH univariable) HR (DSS CPH multivariable) HR (competing risks multivariable)
Age (years) Mean (SD) 52.5 (16.7) 1.01 (1.00-1.03, p=0.141) 1.01 (1.00-1.03, p=0.141) 1.01 (0.99-1.02, p=0.520)
Sex Female 126 (61.5) - - -
Male 79 (38.5) 1.54 (0.91-2.60, p=0.106) 1.54 (0.91-2.60, p=0.106) 1.50 (0.87-2.57, p=0.140)
Tumour thickness (mm) Mean (SD) 2.9 (3.0) 1.12 (1.04-1.20, p=0.004) 1.12 (1.04-1.20, p=0.004) 1.09 (1.01-1.18, p=0.019)
Ulcerated tumour No 115 (56.1) - - -
Yes 90 (43.9) 3.20 (1.75-5.88, p<0.001) 3.20 (1.75-5.88, p<0.001) 3.09 (1.71-5.60, p<0.001)
10.10 Summary
10.11 Dates in R
10.11.1 Converting dates to survival time
library(lubridate)
first_date <- ymd("1966-01-01") # create made-up dates for operations
last_date <- first_date +
days(nrow(melanoma)-1) # every day from 1-Jan 1966
operation_date <-
seq(from = first_date,
to = last_date, by = "1 day") # create dates
Now we will create a ‘censoring’ date by adding time from the melanoma
dataset to our made up operation date.
Remember the censoring date is either when an event occurred (e.g., death)
or the last known alive status of the patient.
# (Same as doing:):
melanoma$censoring_date <- melanoma$operation_date + days(melanoma$time)
Now consider if we only had the operation date and censoring date. We want to
create the time variable.
The Surv() function expects a number (numeric variable), rather than a date
object, so we’ll convert it:
10.12 Exercises
10.12.1 Exercise
Using the above scripts, perform a univariable Kaplan Meier analysis to de-
termine if ulcer influences overall survival. Hint: survival_object ~ ulcer.
Try modifying the plot produced (see Help for ggsurvplot). For example:
• Add in a median survival line: surv.median.line="hv"
• Alter the plot legend: legend.title = "Ulcer Present", legend.labs = c("No", "Yes")
• Change the y-axis to a percentage: ylab = "Probability of survival (%)",
surv.scale = "percent"
• Display follow-up up to 10 years, and change the scale to 1 year: xlim =
c(0,10), break.time.by = 1)
10.12.2 Exercise
Create a new CPH model, but now include the variable thickness as a variable.
• How would you interpret the output?
• Is it an independent predictor of overall survival in this model?
• Are CPH assumptions maintained?
10.13 Solutions
# Fit model
my_hazard = coxph(survival_object ~ sex + ulcer + age + thickness, data=melanoma)
summary(my_hazard)
# Check assumptions
ph = cox.zph(my_hazard)
ph
# GLOBAL shows no overall violation of assumptions.
10.13 Solutions 267
Throughout this book we have tried to provide the most efficient approaches
to data analysis using R. In this section, we will provide workflows, or ways-
of-working, which maximise efficiency, incorporate reporting of results within
analyses, make exporting of tables and plots easy, and keep data safe, secured
and backed up.
We also include a section on dealing with missing data in R. Something that
we both feel strongly about and which is often poorly described and dealt with
in academic publishing.
11
The problem of missing data
We will work through a number of functions that will help with each of these.
But first, here are some terms that are easy to mix up. These are important
as they describe the mechanism of missingness and this determines how you
can handle the missing data.
For each of the following examples we will imagine that we are collecting
data on the relationship between gender, smoking and the outcome of cancer
treatment. The ground truth in this imagined scenario is that both gender
and smoking influence the outcome from cancer treatment.
273
274 11 The problem of missing data
As it says, values are randomly missing from your dataset. Missing data values
do not relate to any other data in the dataset and there is no pattern to the
actual values of the missing data themselves.
In our example, smoking status is missing from a random subset of male and
female patients.
This may have the effect of making our population smaller, but the complete
case population has the same characteristics as the missing data population.
This is easy to handle, but unfortunately, data are almost never missing com-
pletely at random.
While it sounds obvious, this step is often ignored in the rush to get results.
The first step in any analysis is robust data cleaning and coding. Lots of
packages have a glimpse-type function and our own finalfit is no different.
This function has three specific goals:
1. Ensure all variables are of the type you expect them to be. That is the
commonest reason to get an error with a finalfit function. Numbers
should be numeric, categorical variables should be characters or fac-
tors, and dates should be dates (for a reminder on these, see Section
2.2.
2. Ensure you know which variables have missing data. This presumes
missing values are correctly assigned NA.
3. Ensure factor levels and variable labels are assigned correctly.
Using the colon_s colon cancer dataset, we are interested in exploring the asso-
ciation between a cancer obstructing the bowel and 5-year survival, accounting
for other patient and disease characteristics.
For demonstration purposes, we will make up MCAR and MAR smoking vari-
ables (smoking_mcar and smoking_mar). Do not worry about understanding the long
cascading mutate and sample() functions below, this is merely for creating the
example variables. You would not be ‘creating’ your data, we hope.
colon_s %>%
ff_glimpse(dependent, explanatory)
## $Continuous
## label var_type n missing_n missing_percent mean sd min
## age Age (years) <dbl> 929 0 0.0 59.8 11.9 18.0
## nodes nodes <dbl> 911 18 1.9 3.7 3.6 0.0
## quartile_25 median quartile_75 max
## age 53.0 61.0 69.0 85.0
## nodes 1.0 2.0 5.0 33.0
##
## $Categorical
## label var_type n missing_n missing_percent
## mort_5yr Mortality 5 year <fct> 915 14 1.5
## sex.factor Sex <fct> 929 0 0.0
## obstruct.factor Obstruction <fct> 908 21 2.3
## smoking_mcar Smoking (MCAR) <fct> 828 101 10.9
## smoking_mar Smoking (MAR) <fct> 726 203 21.9
## levels_n levels levels_count
## mort_5yr 2 "Alive", "Died", "(Missing)" 511, 404, 14
## sex.factor 2 "Female", "Male" 445, 484
## obstruct.factor 2 "No", "Yes", "(Missing)" 732, 176, 21
## smoking_mcar 2 "Non-smoker", "Smoker", "(Missing)" 645, 183, 101
## smoking_mar 2 "Non-smoker", "Smoker", "(Missing)" 585, 141, 203
## levels_percent
## mort_5yr 55.0, 43.5, 1.5
## sex.factor 48, 52
## obstruct.factor 78.8, 18.9, 2.3
## smoking_mcar 69, 20, 11
## smoking_mar 63, 15, 22
You don’t need to specify the variables, and if you don’t, ff_glimpse() will
summarise all variables:
11.3 Identify missing values in each variable: missing_plot() 277
colon_s %>%
ff_glimpse()
Use this to check that the variables are all assigned and behaving as expected.
The proportion of missing data can be seen, e.g., smoking_mar has 22% missing
data.
colon_s %>%
missing_plot(dependent, explanatory)
Mortality 5 year
Age (years)
Sex
nodes
Obstruction
Smoking (MCAR)
Smoking (MAR)
colon_s %>%
missing_pattern(dependent, explanatory)
smoking_mcar
obstruct.factor
smoking_mar
sex.factor
mort_5yr
age
631 0
167 1
69 1
27 2
14 1
4 2
3 2
8 1
4 2
1 2
1 3
339
0 0 14 21101203
## age sex.factor mort_5yr obstruct.factor smoking_mcar smoking_mar
## 631 1 1 1 1 1 1 0
## 167 1 1 1 1 1 0 1
## 69 1 1 1 1 0 1 1
## 27 1 1 1 1 0 0 2
## 14 1 1 1 0 1 1 1
## 4 1 1 1 0 1 0 2
## 3 1 1 1 0 0 1 2
## 8 1 1 0 1 1 1 1
## 4 1 1 0 1 1 0 2
## 1 1 1 0 1 0 1 2
## 1 1 1 0 1 0 0 3
## 0 0 14 21 101 203 339
11.6 Check for associations between missing and observed data 279
This allows us to look for patterns of missingness between variables. There are
11 patterns in these data. The number and pattern of missingness help us to
determine the likelihood of it being random rather than systematic.
important (we would say absolutely required) for a primary outcome measure
/ dependent variable.
Take for example “death”. When that outcome is missing it is often for a par-
ticular reason. For example, perhaps patients undergoing emergency surgery
were less likely to have complete records compared with those undergoing
planned surgery. And of course, death is more likely after emergency surgery.
missing_pairs() uses functions from the GGally package. It produces pairs
plots to show relationships between missing values and observed values in all
variables.
For continuous variables (age and nodes), the distributions of observed and
missing data can immediately be visually compared. For example, look at Row
1 Column 2. The age of patients who’s mortality data is known is the blue
box plot, and the age of patients with missing mortality data is the grey box
plot.
For categorical data, the comparisons are presented as counts (remember
geom_bar() from Chapter 4). To be able to compare proportions, we can add
the position = "fill" argument:
colon_s %>%
missing_pairs(dependent, explanatory, position = "fill")
11.6 Check for associations between missing and observed data 281
Missing data matrix
Mortality 5 year Age (years) Sex nodes Obstruction Smoking (MCAR) Smoking (MAR)
Find the two sets of bar plots that show the proportion of missing smoking
data for sex (bottom of Column 3). Missingness in Smoking (MCAR) does not
relate to sex - females and males have the same proportion of missing data.
Missingness in Smoking (MAR), however, does differ by sex as females have
more missing data than men here. This is how we designed the example at
the top of this chapter, so it all makes sense.
We can also confirm this by using missing_compare():
If you work predominately with continuous rather than categorical data, you
may find these tests from the MissMech package useful. It provides two tests
which can be used to determine whether data are MCAR; the package and its
output are well documented.
library(MissMech)
explanatory <- c("age", "nodes")
dependent <- "mort_5yr"
colon_s %>%
select(all_of(explanatory)) %>%
MissMech::TestMCARNormality()
## Call:
## MissMech::TestMCARNormality(data = .)
##
## Number of Patterns: 2
##
## Total number of cases used in the analysis: 929
##
## Pattern(s) used:
## age nodes Number of cases
2
By default, missing_compare() uses an F-test test for continuous variables and chi-squared
for categorical variables; you can change these the same way you change tests in sum-
mary_factorlist(). Check the Help tab or online documentation for a reminder.
284 11 The problem of missing data
## group.1 1 1 911
## group.2 1 NA 18
##
##
## Test of normality and Homoscedasticity:
## -------------------------------------------
##
## Hawkins Test:
##
## P-value for the Hawkins test of normality and homoscedasticity: 7.607252e-14
##
## Either the test of multivariate normality or homoscedasticity (or both) is rejected.
## Provided that normality can be assumed, the hypothesis of MCAR is
## rejected at 0.05 significance level.
##
## Non-Parametric Test:
##
## P-value for the non-parametric test of homoscedasticity: 0.6171955
##
## Reject Normality at 0.05 significance level.
## There is not sufficient evidence to reject MCAR at 0.05 significance level.
Depending on the number of data points that are missing, we may have suffi-
cient power with complete cases to examine the relationships of interest.
We therefore elect to omit the patients in whom smoking is missing. This
is known as list-wise deletion and will be performed by default and usually
silently by any standard regression function.
11.8 Handling missing data: MAR 285
• Sensitivity analysis
• Omit the variable
• Imputation
• Model the missing data
If the variable in question is thought to be particularly important, you may
wish to perform a sensitivity analysis. A sensitivity analysis in this context
aims to capture the effect of uncertainty on the conclusions drawn from
the model. Thus, you may choose to re-label all missing smoking values as
“smoker”, and see if that changes the conclusions of your analysis. The same
procedure can be performed labelling with “non-smoker”.
If smoking is not associated with the explanatory variable of interest or the
outcome, it may be considered not to be a confounder and so could be omitted.
That deals with the missing data issue, but of course may not always be
appropriate.
Imputation and modelling are considered below.
mice is our go to package for multiple imputation. That’s the process of filling
in missing data using a best-estimate from all the other data that exists. When
first encountered, this may not sound like a good idea.
However, taking our simple example, if missingness in smoking is predicted
strongly by sex (and other observed variables), and the values of the missing
data are random, then we can impute (best-guess) the missing smoking values
using sex and other variables in the dataset.
Imputation is not usually appropriate for the explanatory variable of interest
or the outcome variable, although these can be used to impute other variables.
In both cases, the hypothesis is that there is a meaningful association with
other variables in the dataset, therefore it doesn’t make sense to use these
variables to impute them.
The process of multiple imputation involves:
• Impute missing data m times, which results in m complete datasets
• Diagnose the quality of the imputed values
• Analyse each completed dataset
• Pool the results of the repeated analyses
We will present a mice() example here. The package is well documented, and
there are a number of checks and considerations that should be made to inform
the imputation process. Read the documentation carefully prior to doing this
yourself.
Note also missing_predictorMatrix() from finalfit. This provides a straightfor-
ward way to include or exclude variables to be imputed or to be used for
imputation.
Impute
library(mice)
explanatory <- c("age", "sex.factor",
"nodes", "obstruct.factor", "smoking_mar")
dependent <- "mort_5yr"
Choose which variable to input missing values for and which variables to use
for the imputation process.
colon_s %>%
select(dependent, explanatory) %>%
missing_predictorMatrix(
drop_from_imputed = c("obstruct.factor", "mort_5yr")
) -> predM
Make 10 imputed datasets and run our logistic regression analysis on each set.
##
## iter imp variable
## 1 1 mort_5yr nodes obstruct.factor smoking_mar
## 1 2 mort_5yr nodes obstruct.factor smoking_mar
## 1 3 mort_5yr nodes obstruct.factor smoking_mar
## 1 4 mort_5yr nodes obstruct.factor smoking_mar
## 2 1 mort_5yr nodes obstruct.factor smoking_mar
## 2 2 mort_5yr nodes obstruct.factor smoking_mar
## 2 3 mort_5yr nodes obstruct.factor smoking_mar
## 2 4 mort_5yr nodes obstruct.factor smoking_mar
## 3 1 mort_5yr nodes obstruct.factor smoking_mar
## 3 2 mort_5yr nodes obstruct.factor smoking_mar
## 3 3 mort_5yr nodes obstruct.factor smoking_mar
## 3 4 mort_5yr nodes obstruct.factor smoking_mar
## 4 1 mort_5yr nodes obstruct.factor smoking_mar
## 4 2 mort_5yr nodes obstruct.factor smoking_mar
## 4 3 mort_5yr nodes obstruct.factor smoking_mar
## 4 4 mort_5yr nodes obstruct.factor smoking_mar
## 5 1 mort_5yr nodes obstruct.factor smoking_mar
## 5 2 mort_5yr nodes obstruct.factor smoking_mar
## 5 3 mort_5yr nodes obstruct.factor smoking_mar
## 5 4 mort_5yr nodes obstruct.factor smoking_mar
## [1] 1193.679
# C-statistic
fits %>%
getfit() %>%
purrr::map(~ pROC::roc(.x$y, .x$fitted)$auc) %>%
unlist() %>%
mean()
## [1] 0.6789003
# Pool results
fits_pool <- fits %>%
pool()
Sex Female -
Obstruction No -
By examining the coefficients, the effect of the imputation compared with the
complete case analysis can be seen.
Other considerations
• Omit the variable
• Model the missing data
As above, if the variable does not appear to be important, it may be omitted
from the analysis. A sensitivity analysis in this context is another form of
imputation. But rather than using all other available information to best-
guess the missing data, we simply assign the value as above. Imputation is
therefore likely to be more appropriate.
There is an alternative method to model the missing data for the categorical
in this setting – just consider the missing data as a factor level. This has the
advantage of simplicity, with the disadvantage of increasing the number of
terms in the model.
library(dplyr)
explanatory = c("age", "sex.factor",
"nodes", "obstruct.factor", "smoking_mar")
fit_explicit_na = colon_s %>%
mutate(
smoking_mar = forcats::fct_explicit_na(smoking_mar)
290 11 The problem of missing data
) %>%
finalfit(dependent, explanatory)
11.10 Summary
The more data analysis you do, the more you realise just how important
missing data is. It is imperative that you understand where missing values
exist in your own data. By following the simple steps in this chapter, you
will be able to determine whether the cases (commonly patients) with missing
11.10 Summary 291
values are a different population to those with complete data. This is the basis
for understanding the impact of missing data on your analyses.
Whether you remove cases, remove variables, impute data, or model missing
values, always check how each approach alters the conclusions of your analy-
sis. Be transparent when you report your results and include the alternative
approaches in appendices of published work.
12
Notebooks and Markdown
You ask me if I keep a notebook to record my great ideas. I’ve only ever had
one.
Albert Einstein
293
294 12 Notebooks and Markdown
• code and output are adjacent to each other, so you are not constantly switch-
ing between “panes”;
• easier to work on smaller screen, e.g., laptop;
• documentation and reporting can be done beside the code, text elements
can be fully formatted;
• the code itself can be outputted or hidden;
• the code is not limited to R - you can use Python, SQL etc.;
• facilitate collaboration by easily sharing human-readable analysis docu-
ments;
• can be outputted in a number of formats including HTML (web page), PDF,
and Microsoft Word;
• output can be extended to other formats such as presentations;
• training/learning may be easier as course materials, examples, and student
notes are all in the same document.
Most people use the terms R Notebook and R Markdown interchangeably and
that is fine. Technically, R Markdown is a file, whereas R Notebook is a way to
work with R Markdown files. R Notebooks do not have their own file format,
they all use .Rmd. All R Notebooks can be ‘knitted’ to R Markdown outputs,
and all R Markdown documents can be interfaced as a Notebook.
1
https://www.rstudio.com/resources/cheatsheets
296 12 Notebooks and Markdown
When you create a file, a helpful template is provided to get you started. Figure
12.3 shows the essential elements of a Notebook file and how these translate
to the HTML preview.
```{r}
# This is basic chunk.
# It always starts with ```{r}
# And ends with ```
# Code goes here
sum(fruit$oranges)
```
This may look off-putting, but just go with it for now. You can type the four
back-ticks in manually, or use the Insert button and choose R. You will also
notice that chunks are not limited to R code. It is particularly helpful that
Python can also be run in this way.
When doing an analysis in a Notebook you will almost always want to see the
code and the output. When you are creating a final document you may wish
to hide code. Chunk behaviour can be controlled via the Chunk Cog on the right
of the chunk (Figure 12.3).
Table 12.1 shows the various permutations of code and output options that
are available. The code is placed in the chunk header but the options fly-out
now does this automatically, e.g.,
```{r, echo=FALSE}
```
12 Notebooks and Markdown
Headings (h2)
In-line code
Code chunk
- begin ```{r}
- end ```
- options
- run chunk
- run all chunks
to here
Output
output (right).
R Markdown
document type
298
12.5 The anatomy of a Notebook / R Markdown file 299
TABLE 12.1: Chunk output options when knitting an R Markdown file.
When using the Chunk Cog, RStudio will add these options appropriately;
there is no need to memorise them.
Option Code
Show output only echo=FALSE
Show code and output echo=TRUE
Show code (don’t run code) eval=FALSE
Show nothing (run code) include=FALSE
Show nothing (don’t run code) include=FALSE, eval=FALSE
We can set default options for all our chunks at the top of our document by
adding and editing knitr::opts_chunk$set(echo = TRUE) at the top of the document.
```{r}
knitr::opts_chunk$set(echo = TRUE,
warning = FALSE)
```
It is possible to set different default sizes for different output types by including
these in the YAML header (or using the document cog):
---
title: "R Notebook"
output:
pdf_document:
fig_height: 3
fig_width: 4
html_document:
fig_height: 6
fig_width: 9
---
The YAML header is very sensitive to the spaces/tabs, so make sure these are
correct.
300 12 Notebooks and Markdown
Markdown text can be included as you wish around your chunks. Figure 12.3
shows an example of how this can be done. This is a great way of getting into
the habit of explicitly documenting your analysis. When you come back to
a file in 6 months’ time, all of your thinking is there in front of you, rather
than having to work out what on Earth you were on about from a collection
of random code!
Figure 12.4 shows the various controls for running chunks and producing an
output document. Code can be run line-by-line using Ctrl+Enter as you are used
to. There are options for running all the chunks above the current chunk you
are working on. This is useful as a chunk you are working on will often rely
on objects created in preceding chunks.
It is good practice to use the Restart R and Run All Chunks option in the Run
menu every so often. This ensures that all the code in your document is self-
contained and is not relying on an object in the environment which you have
created elsewhere. If this was the case, it will fail when rendering a Markdown
document.
Probably the most important engine behind the RStudio Notebooks function-
ality is the knitr package by Yihui Xie.
Not knitting like your granny does, but rendering a Markdown document into
an output file, such as HTML, PDF or Word. There are many options which
can be applied in order to achieve the desired output. Some of these have been
specifically coded into RStudio (Figure 12.4).
PDF document creation requires a LaTeX distribution to be installed on your
computer. Depending on what system you are using, this may be setup already.
An easy way to do this is using the tinytex package.
```{r}
install.packages("tinytex")
# Restart R, then run
tinytex::install_tinytex()
```
FIGURE 12.4: Chunk and document options in Notebook/Markdown files.
Different options
exist for one-button
publication, e.g. to
RStudio Connect
Whether a Notebook
or R Markdown, all
documents can be
rendered as HTML It is good practice to
(web), PDF or Word ‘Restart R and Run All
Output options for Chunks’ every so
different output often to ensure that
formats are easily your document is
set here self-contained (does
not have any external
dependencies)
301
302 12 Notebooks and Markdown
In the next chapter we will focus on the details of producing a polished final
document.
As projects get bigger, it is important that they are well organised. This will
avoid errors and make collaboration easier.
What is absolutely compulsory is that your analysis must reside within an
RStudio Project and have a meaningful name (not MyProject! or Analysis1).
Creating a New Project on RStudio will automatically create a new folder for
itself (unless you choose “Existing Folder”). Never work within a generic Home
or Documents directory. Furthermore, do not change the working directory
using setwd() - there is no reason to do this, and it usually makes your analysis
less reproducible. Once you’re starting to get the hang of R, you should initiate
all Projects with a Git repository for version control (see Chapter 13).
For smaller projects with 1-2 data files, a couple of scripts and an R Markdown
document, it is fine to keep them all in the Project folder (but we repeat, each
Project must have its own folder). Once the number of files grows beyond that,
you should add separate folders for different types of files.
Here is our suggested approach. Based on the nature of your analyses, the
number of folders may be smaller or greater than this, and they may be called
something different.
proj/
- scripts/
- data_raw/
- data_processed/
- figures/
- 00_analysis.Rmd
scripts/ contains all the .R script files used for data cleaning/preparation. If
you only have a few scripts, it’s fine to not have this one and just keep the
.R files in the project folder (where 00_analysis.Rmd is in the above example).
data_raw/ contains all raw data, such as .csv files, data_processed/ contains data
you’ve taken from raw, cleaned, modified, joined or otherwise changed using
R scripts. figures/ may contain plots (e.g., .png, .jpg, .pdf) 00_analysis.Rmd or
00_analysis.R is the actual main working file, and we keep this in the main
project directory.
Your R scripts should be numbered using double digits, and they should have
meaningful names, for example:
12.7 File structure and workflow 303
scripts/00_source_all.R
scripts/01_read_data.R
scripts/02_make_factors.R
scripts/03_duplicate_records.R
# Melanoma project
## Data pull
# Get data
library(readr)
melanoma <- read_csv(
here::here("data_raw", "melanoma.csv")
)
# Save
save(melanoma, file =
here::here("data_processed", "melanoma_working.rda")
)
# Melanoma project
## Create factors
library(tidyverse)
load(
here::here("data_processed", "melanoma_working.rda")
)
## Recode variables
melanoma <- melanoma %>%
mutate(
sex = factor(sex) %>%
fct_recode("Male" = "1",
304 12 Notebooks and Markdown
"Female" = "0")
)
# Save
save(melanoma, file =
here::here("data", "melanoma_working.rda")
)
All these files can then be brought together in a single file to source(). This
function is used to run code from a file.
00_source_all.R might look like this:
# Melanoma project
## Source all
# Save
save(melanoma, file =
here::here("data_processed", "melanoma_final.rda")
)
You can now bring your robustly prepared data into your analysis file, which
can be .R or .Rmd if you are working in a Notebook. We call this 00_analysis.Rmd
and it always sits in the project root director. You have two options in bringing
in the data.
Remember: For .R files use source(), for .rda files use load().
---
title: "Melanoma analysis"
output: html_notebook
---
It comes from many years of finding errors due to badly organised projects. It
is not needed for a small quick project, but is essential for any major work.
At the very start of an analysis (as in the first day), we will start working in
a single file. We will quickly move chunks of data cleaning / preparation code
into separate files as we go.
Compartmentalising the data cleaning helps with finding and dealing with
errors (‘debugging’). Sourced files can be ‘commented out’ (adding a # to a
line in the 00_source_all.R file) if you wish to exclude the manipulations in that
particular file.
Most important, it helps with collaboration. When multiple people are working
on a project, it is essential that communication is good and everybody is
working to the same overall plan.
13
Exporting and reporting
The results of any data analysis are meaningless if they are not effectively
communicated.
This may be as a journal article or presentation, or perhaps a regular report
or webpage. In Chapter 13 we emphasise another of the major strengths of R
- the ease with which HTML (a web page), PDF, or Word documents can be
generated.
The purpose of this chapter is to focus on the details of how to get your
exported tables, plots and documents looking exactly the way you want them.
There are many customisations that can be used, and we will only touch on a
few of these.
We will generate a report using data already familiar to you from this book.
It will contain two tables - a demographics table and a regression table - and
a plot. We will use the colon_s data from the finalfit package. What follows is
for demonstration purposes and is not meant to illustrate model building. For
the purposes of the demonstration, we will ask, does a particular characteristic
of a colon cancer (e.g., cancer differentiation) predict 5-year survival?
The three common formats for exporting reports have different pros and cons:
• HTML is the least fussy to work with and can resize itself and its content au-
tomatically. For rapid exploration and prototyping, we recommend knitting
to HTML. HTML documents can be attached to emails and viewed using
307
308 13 Exporting and reporting
We will demonstrate how you might put together a report in two ways.
First, we will show what you might do if you were working in standard R
script file, then exporting certain objects only.
Second, we will talk about the approach if you were primarily working in a
Notebook, which makes things easier.
We presume that the data have been cleaned carefully and the ‘Get the data’,
‘Check the data’, ‘Data exploration’ and ‘Model building’ steps have already
been completed.
library(tidyverse)
library(finalfit)
colon_s %>%
13.3 Demographics table 309
summary_factorlist("differ.factor", explanatory,
p=TRUE, na_include=TRUE)
Note that we include missing data in this table (see Chapter 11).
Also note that nodes has not been labelled properly.
In addition, there are small numbers in some variables generating chisq.test()
warnings (expect fewer than 5 in any cell).
Now generate a final table.1
1
The finalfit functions used here - summary_factorlist() and finalfit() were introduced in
Part II - Data Analysis. We will therefore not describe the different arguments here, we use
them to demonstrate R’s powers of exporting to fully formatted output documents.
310 13 Exporting and reporting
colon_s %>%
or_plot(dependent, explanatory,
breaks = c(0.5, 1, 5, 10, 20, 30),
table_text_size = 3.5)
In RStudio, select:
File > New File > R Markdown
A useful template file is produced by default. Try hitting knit to Word on the
Knit button at the top of the .Rmd script window. If you have difficulties at
this stage, refer to Chapter 12.
Now paste this into the file (we’ll call it Example 1):
---
title: "Example knitr/R Markdown document"
author: "Your name"
date: "22/5/2020"
output:
word_document: default
---
## Table 1 - Demographics
```{r table1, echo = FALSE}
kable(table1, row.names=FALSE, align=c("l", "l", "r", "r", "r", "r"))
```
Knitting this into a Word document results in Figure 13.2A), which looks
pretty decent but some of the columns need some formatting and the plot
needs resized. Do not be tempted to do this by hand directly in the Word
document.
Yes, before Markdown, we would have to move and format each table and fig-
ure directly in Word, and we would repeat this every time something changed.
Turns out some patient records were duplicated and you have to remove them
13.6 MS Word via knitr/R Markdown 313
before repeating the analysis over again. Or your colleague forgot to attach
an extra file with 10 more patients.
No problem, you update the dataset, re-run the script that created the tables
and hit Knit in the R Markdown document. No more mindless re-doing for
you. We think this is pretty amazing.
If your plots are looking a bit grainy in Word, include this in your setup chunk
for high quality:
knitr::opts_chunk$set(dpi = 300)
The setup chunk is the one that starts with ```{r setup, include = FALSE} and
is generated automatically when you create a new R Markdown document in
RStudio.
To make sure tables always export with a suitable font size, you may edit your
Word file but only to create a new template. You will then use this template
to Knit the R Markdown document again.
In the Word document the first example outputted, click on a table. The style
should be compact: Right-click > Modify… > font size = 9
Alter heading and text styles in the same way as desired. Save this as colon-
Template.docx (avoid underscores in the name of this file). Move the file to your
project folder and reference it in your .Rmd YAML header, as shown below.
Make sure you get the spacing correct, unlike R code, the YAML header is
sensitive to formatting and the number of spaces at the beginning of the ar-
gument lines.
Finally, to get the figure printed in a size where the labels don’t overlap each
other, you will have to specify a width for it. The Chunk cog introduced in
the previous chapter is a convenient way to change the figure size (it is in the
top-right corner of each grey code chunk in an R Markdown document). It
usually takes some experimentation to find the best size for each plot/output
document; in this case we are going with fig.width = 10.
13.8 PDF via knitr/R Markdown 315
Knitting Example 2 here gives us Figure 13.2B). For something that is gener-
ated automatically, it looks awesome.
---
title: "Example knitr/R Markdown document"
author: "Your name"
date: "22/5/2020"
output:
word_document:
reference_docx: colonTemplate.docx
---
## Table 1 - Demographics
```{r table1, echo = FALSE}
kable(table1, row.names=FALSE, align=c("l", "l", "r", "r", "r", "r"))
```
A
Example knitr/R Markdown document Mortality 5 year: OR (95% CI, p−value)
Your name
Differentiation Well −
22/5/2020
Moderate
0.62 (0.38−1.01, p=0.054)
1.00
Poor(0.56−1.78, p=0.988)
Table 1 - Demographics Age (years) 1.01 (1.00−1.02,
− p=0.098)
Sex Female −
Exposure: Differentiation Moderate Poor Well p 0.97
Male(0.73−1.30, p=0.858)
Age (years) Mean (SD) 59.9 (11.7) 59.0 (12.8) 60.2 (12.8) 0.788
Sex Female 314 (71.7) 73 (16.7) 51 (11.6) 0.400
Submucosa
Extent of spread −
Male 349 (74.6) 77 (16.5) 42 (9.0) Muscle
1.25 (0.36−5.87, p=0.742)
Extent of spread Submucosa 12 (60.0) 3 (15.0) 5 (25.0) 0.081
Muscle 78 (76.5) 12 (11.8) 12 (11.8) 3.03
Serosa
(0.96−13.36, p=0.087)
Serosa 542 (72.8) 127 (17.0) 76 (10.2) Adjacent structures
6.80 (1.75−34.55, p=0.010)
Adjacent structures 31 (79.5) 8 (20.5) 0 (0.0)
Obstruction No 531 (74.4) 114 (16.0) 69 (9.7) 0.110 Obstruction No −
Yes 122 (70.9) 31 (18.0) 19 (11.0)
Missing 10 (50.0) 5 (25.0) 5 (25.0) 1.26Yes
(0.88−1.82, p=0.206)
Lymph nodes involved Mean (SD) 3.6 (3.4) 4.7 (4.4) 2.7 (2.2) <0.001 nodes 1.24 (1.18−1.31,
− p<0.001)
−0.50 −0.25 0.00 0.25 0.50 10 20 3040
Table 2 - Association between tumour factors and 5 year mortality as.numeric(OR) Odds ratio (95% CI, log scale)
1 2
B
Figure 1 - Association between tumour factors and 5 year mortality
Example knitr/R Markdown document
Mortality 5 year: OR (95% CI, p−value)
Your name
Differentiation Well −
22/5/2020 Moderate 0.62 (0.38−1.01, p=0.054)
Poor 1.00 (0.56−1.78, p=0.988)
Age (years) − 1.01 (1.00−1.02, p=0.098)
Table 1 - Demographics Sex Female −
Male 0.97 (0.73−1.30, p=0.858)
Exposure: Differentiation Moderate Poor Well p Extent of spread Submucosa −
Age (years) Mean (SD) 59.9 (11.7) 59.0 (12.8) 60.2 (12.8) 0.788 Muscle 1.25 (0.36−5.87, p=0.742)
Sex Female 314 (71.7) 73 (16.7) 51 (11.6) 0.400 Serosa 3.03 (0.96−13.36, p=0.087)
Male 349 (74.6) 77 (16.5) 42 (9.0) Adjacent structures 6.80 (1.75−34.55, p=0.010)
Extent of spread Submucosa 12 (60.0) 3 (15.0) 5 (25.0) 0.081 Obstruction No −
Muscle 78 (76.5) 12 (11.8) 12 (11.8) Yes 1.26 (0.88−1.82, p=0.206)
Serosa 542 (72.8) 127 (17.0) 76 (10.2) nodes − 1.24 (1.18−1.31, p<0.001)
Adjacent structures 31 (79.5) 8 (20.5) 0 (0.0) −0.50 −0.25 0.00 0.25 0.50 0.5 1.0 5.0 10.0 20.030.0
Obstruction No 531 (74.4) 114 (16.0) 69 (9.7) 0.110 as.numeric(OR) Odds ratio (95% CI, log scale)
Yes 122 (70.9) 31 (18.0) 19 (11.0)
Missing 10 (50.0) 5 (25.0) 5 (25.0)
Lymph nodes involved Mean (SD) 3.6 (3.4) 4.7 (4.4) 2.7 (2.2) <0.001
1 2
---
title: "Example knitr/R Markdown document"
author: "Your name"
date: "22/5/2020"
output:
pdf_document: default
geometry: margin=0.75in
---
13.9 Working in a .Rmd file 317
## Table 1 - Demographics
```{r table1, echo = FALSE}
kable(table1, row.names=FALSE, align=c("l", "l", "r", "r", "r", "r"),
booktabs = TRUE)
```
Introduction
Colorectal cancer is the third most common cancer worldwide. In this study, we have re-analysed a classic dataset to determine
the influence of tumour differentiation on 5-year survival prior to the introduction of modern chemotherapeutic regimes.
Methods
Data were generated within a randomised trial of adjuvant chemotherapy for colon cancer. Levamisole is a low-toxicity
compound etc. etc.
Results
Patient characteristics
Table 1: Demographics of patients entered into randomised controlled trial of adjuvant chemotherapy after surgery for colon
cancer.
Exposure: Differentiation Moderate Poor Well p
Age (years) Mean (SD) 59.9 (11.7) 59.0 (12.8) 60.2 (12.8) 0.788
Sex Female 314 (71.7) 73 (16.7) 51 (11.6) 0.400
Male 349 (74.6) 77 (16.5) 42 (9.0)
Extent of spread Submucosa 12 (60.0) 3 (15.0) 5 (25.0) 0.081
Muscle 78 (76.5) 12 (11.8) 12 (11.8)
Serosa 542 (72.8) 127 (17.0) 76 (10.2)
Adjacent structures 31 (79.5) 8 (20.5) 0 (0.0)
Lymph nodes involved Mean (SD) 3.6 (3.4) 4.7 (4.4) 2.7 (2.2) <0.001
Table 2 shows a univariable and multivariable regression analysis of 5-year mortality by patient and disease characteristics.
Discussion
13.11 Summary
Always be a first rate version of yourself and not a second rate version of
someone else.
Judy Garland
Version control is essential for keeping track of data analysis projects, as well
as collaborating. It allows backup of scripts and collaboration on complex
projects. RStudio works really well with Git, an open source distributed ver-
sion control system, and GitHub, a web-based Git repository hosting service.
Git is a piece of software which runs locally. It may need to be installed first.
It is important to highlight the difference between Git (local version control
software) and GitHub (a web-based repository store).
In RStudio, go to Tools -> Global Options and select the Git/SVN tab.
Ensure the path to the Git executable is correct. This is particularly impor-
tant in Windows where it may not default correctly (e.g., C:/Program Files
(x86)/Git/bin/git.exe).
14.2 Create an SSH RSA key and add to your GitHub account
In the Git/SVN tab, hit Create RSA Key (Figure 14.1A). In the window
that appears, hit the Create button (Figure 14.1B). Close this window.
321
322 14 Version control
Click, View public key (Figure 14.1C), and copy the displayed public key (Fig-
ure 14.1D).
If you haven’t already, create a GitHub account. On the GitHub website, open
the account settings tab and click the SSH keys tab (Figure 14.2A). Click Add
SSH key and paste in the public key you have copied from RStudio Figure
14.2B).
C D
Ctrl-C to copy
B Any name
Ctrl-P to paste
FIGURE 14.2: Adding your RStudio SSH key to your GitHub account.
Next, return to RStudio and configure Git via the Terminal (Figure 14.3A)).
Remember Git is a piece of software running on your own computer. This is
distinct to GitHub, which is the repository website.
We will now create a new project which we want to backup to GitHub.
In RStudio, click New project as normal (Figure 14.3B). Click New Directory.
Name the project and check Create a git repository. Now in RStudio, create a
new script which you will add to your repository.
After saving your new script (e.g., test.R), it should appear in the Git tab
beside Environment.
Tick the file you wish to add, and the status should turn to a green ‘A’ (Fig-
324 14 Version control
ure 14.3C). Now click Commit and enter an identifying message in Commit
message (Figure 14.3D). It makes sense to do this prior to linking the project
and the GitHub repository, otherwise you’ll have nothing to push to GitHub.
You have now committed the current version of this file to a Git repository
on your computer/server.
Now you may want to push the contents of this commit to GitHub, so it is
also backed-up off site and available to collaborators. As always, you must
exercise caution when working with sensitive data. Take steps to stop yourself
from accidentally pushing whole datasets to GitHub.1 You only want to push
R code to GitHub, not the (sensitive) data.
When you see a dataset appear in the Git tab of your RStudio, select it, then
click on More, and then Ignore. This means the file does not get included in
your Git repository, and it does not get pushed to GitHub. GitHub is not for
backing up sensitive datasets, it’s for backing up R code. And make sure your
R code does not include passwords or access tokens.
In GitHub, create a New repository, called here myproject (Figure 14.4A). You
will now see the Quick setup page on GitHub. Copy the code below push an
existing repository from the command line (Figure 14.4B).
Back in RStudio, paste the code into the Terminal. Add your GitHub user-
name and password (important!) (Figure 14.5A). You have now pushed your
commit to GitHub, and should be able to see your files in your GitHub ac-
count.
The Pull and Push buttons in RStudio will now also work (Figure 14.5B).
To avoid always having to enter your password, copy the SSH address from
GitHub and enter the code shown in Figure 14.5C and D.
Check that the Pull and Push buttons work as expected (Figure 14.5E).
Remember, after each Commit, you have to Push to GitHub, this doesn’t
happen automatically.
1
It’s fine to push some data to GitHub, especially if you want to make it publicly available,
but you should do so consciously, not accidentally.
14.4 Create a new repository on GitHub and link to RStudio project 325
Test file
Git tab
C
Tick to commit
Commit tab
Good to match
name with RStudio
project title
Copy “push
existing
repository code”
A RStudio terminal
Paste and enter
Enter github
username
Enter github
Always ensure correct project folder
Terminal -> New Terminal password
B
Refresh Git tab
Pull / push
now active
C To avoid repeat
password
requests -
D RStudio terminal enter: Github:
Copy SSH
address
E Check push / pull buttons
14.6 Summary
If your project is worth doing, then it is worth backing up! This means you
should use version control for every single project you are involved in. You
will quickly be surprised at how many times you wish to go back and rescue
some deleted code, or to restore an accidentally deleted file.
It becomes even more important when collaborating and many individuals may
be editing the same file. As with the previous chapter, this is an area which
data scientists have discovered much later than computer scientists. Get it up
and running in your own workflow and you will reap the rewards in the future.
328 14 Version control
B Github:
Copy HTTPS
address
C
Paste into
Repository
URL
Important:
Commit a file
D To avoid repeat
password
requests -
Github:
E RStudio terminal enter:
Copy SSH
address
Health data is precious and often sensitive. Datasets may contain patient iden-
tifiable information. Information may be clearly disclosive, such as a patient’s
date of birth, post/zip code, or social security number.
Other datasets may have been processed to remove the most obviously confi-
dential information. These still require great care, as the data is usually only
‘pseudoanonymised’. This may mean that the data of an individual patient is
disclosive when considered as a whole - perhaps the patient had a particularly
rare diagnosis. Or it may mean that the data can be combined with other
datasets and in combination, individual patients can be identified.
The governance around safe data handling is one of the greatest challenges
facing health data scientists today. It needs to be taken very seriously and
robust practices must be developed to ensure public confidence.
Storing sensitive information as raw values leaves the data vulnerable to con-
fidentiality breaches. This is true even when you are working in a ‘safe’ envi-
ronment, such as a secure server.
It is best to simply remove as much confidential information from records
whenever possible. If the data is not present, then it cannot be compromised.
This might not be a good idea if the data might need to be linked back to an
individual at some unspecified point in the future. This may be a problem if,
for example, auditors of a clinical trial need to re-identify an individual from
329
330 15 Encryption
the trial data. A study ID can be used, but that still requires the confidential
data to be stored and available in a lookup table in another file.
This chapter is not a replacement for an information governance course. These
are essential and the reader should follow their institution’s guidelines on this.
The chapter does introduce a useful R package and encryption functions that
you may need to incorporate into your data analysis workflow.
The encryptr package is our own and allows users to store confidential data in
a pseudoanonymised form, which is far less likely to result in re-identification.
Either columns in data frames/tibbles or whole files can be directly encrypted
from R using strong RSA encryption.
The basis of RSA encryption is a public/private key pair and is the method
used of many modern encryption applications. The public key can be shared
and is used to encrypt the information.
The private key is sensitive and should not be shared. The private key requires
a password to be set, which should follow modern rules on password complexity.
You know what you should do! If the password is lost, it cannot be recovered.
The encryptr package can be installed in the standard manner or the devel-
opment version can be obtained from GitHub.
Full documentation is maintained separately at encrypt-r.org1 .
install.packages("encryptr")
1
https://encrypt-r.org
15.5 Get the data 331
library(encryptr)
gp
#> A tibble: 1,212 x 12
#> organisation_code name address1 address2 address3 city postcode
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 S10002 MUIRHE… LIFF RO… MUIRHEAD NA DUND… DD2 5NH
#> 2 S10017 THE BL… CRIEFF … KING ST… NA CRIE… PH7 3SA
The genkeys() function generates a public and private key pair. A password is
required to be set in the dialogue box for the private key. Two files are written
to the active directory.
The default name for the private key is:
• id_rsa
If the private key file is lost, nothing encrypted with the public key can be
recovered. Keep this safe and secure. Do not share it without a lot of thought
on the implications.
genkeys()
#> Private key written with name 'id_rsa'
#> Public key written with name 'id_rsa.pub'
332 15 Encryption
Once the keys are created, it is possible to encrypt one or more columns of data
in a data frame/tibble using the public key. Every time RSA encryption is used
it will generate a unique output. Even if the same information is encrypted
more than once, the output will always be different. It is therefore not possible
to match two encrypted values.
These outputs are also secure from decryption without the private key. This
may allow sharing of data within or between research teams without sharing
confidential data.
Encrypting columns to a ciphertext is straightforward. However, as stated
above, an important principle is dropping sensitive data which is never going
to be required. Do not hoard more data than you need to answer your question.
library(dplyr)
gp_encrypt = gp %>%
select(-c(name, address1, address2, address3)) %>%
encrypt(postcode)
gp_encrypt
Decryption requires the private key generated using genkeys() and the password
set at the time. The password and file are not replaceable so need to be kept
safe and secure. It is important to only decrypt the specific pieces of informa-
tion that are required. The beauty of this system is that when decrypting a
specific cell, the rest of the data remain secure.
gp_encrypt %>%
slice(1:2) %>% # Only decrypt the rows and columns necessary
decrypt(postcode)
Rather than storing the ciphertext in the working data frame, a lookup table
can be used as an alternative. Using lookup = TRUE has the following effects:
• returns the data frame / tibble with encrypted columns removed and a key
column included;
• returns the lookup table as an object in the R environment;
• creates a lookup table .csv file in the active directory.
gp_encrypt = gp %>%
select(-c(name, address1, address2, address3)) %>%
encrypt(postcode, telephone, lookup = TRUE)
gp_encrypt
The file creation can be turned off with write_lookup = FALSE and the name of
the lookup can be changed with lookup_name = "anyNameHere". The created lookup
file should be itself encrypted using the method below.
Decryption is performed by passing the lookup object or file to the decrypt()
function.
gp_encrypt %>%
decrypt(postcode, telephone, lookup_object = lookup)
# Or
gp_encrypt %>%
decrypt(postcode, telephone, lookup_path = "lookup.csv")
Encrypting the object within R has little point if a file with the disclosive
information is still present on the system. Files can be encrypted and decrypted
using the same set of keys.
To demonstrate, the included dataset is written as a .csv file.
write_csv(gp, "gp.csv")
encrypt_file("gp.csv")
#> Encrypted file written with name 'gp.csv.encryptr.bin'
Check that the file can be decrypted prior to removing the original file from
your system.
Warning: it is strongly suggested that the original unencrypted data file backed
up in a secure system in case de-encryption is not possible, e.g., the private
key file or password is lost.
The decrypt_file function will not allow the original file to be overwritten,
therefore use the option to specify a new name for the unencrypted file.
The ciphertext produced for a given input will change with each encryption.
This is a feature of the RSA algorithm. Ciphertexts should not therefore be
attempted to be matched between datasets encrypted using the same public
key. This is a conscious decision given the risks associated with sharing the
necessary details.
In collaborative projects where data may be pooled, a public key can be made
available by you via a link to enable collaborators to encrypt sensitive data.
This provides a robust method for sharing potentially disclosive data points.
gp_encrypt = gp %>%
select(-c(name, address1, address2, address3)) %>%
encrypt(postcode, telephone, public_key_path =
"https://argonaut.is.ed.ac.uk/public/id_rsa.pub")
are blinded to the allocation, this unique ciphertext further limits the impact
of selection bias.
Researchers with approved projects may one day receive approval to carry out
additional follow-up through tracking of outcomes through electronic health-
care records or re-contact of patients. Should a follow-up study be approved,
patient identifiers stored as ciphertexts could then be decrypted to allow
matching of the participant to their own health records.
15.14 Summary
All confidential information must be treated with the utmost care. Data should
never be carried on removable devices or portable computers. Data should
never be sent by open email. Encrypting data provides some protection against
disclosure. But particularly in healthcare, data often remains potentially dis-
closive (or only pseudoanonymised) even after encryption of identifiable vari-
ables. Treat it with great care and respect.
Appendix
337
Bibliography
339
Index
341
342 Index
group_by, 56 RStudio, 7
inner_join, 48 tidyverse, 7
join, 48 interaction terms, 155
kruskal.test, 140
left_join, 48 join datasets, 48
levels, 70 Kaplan Meier estimator, 251
lm, 161 knitr, 298, 309
matches, 45
missing_glimpse, 120 labels, 46
mutate, 42, 57 linear regression, 147
near, 21 AIC, 172
or_plot, 239 assumptions, 150
pairwise.t.test, 135, 136 coefficients, 163
paste, 46 confounding, 158
percent, 58 effect modification, 155
pivot_longer, 67 finalfit, 174
pivot_wider, 66 fitted line, 148
quantile, 195 interactions, 155
read_csv, 14 model fitting principles, 171
right_join, 48 multivariable, 165
select, 62 r-squared, 156
seq, 31 log-rank test, 251
strata, 258 logistic regression, 209
summarise, 56 AIC, 233
summary_factorlist, 141, 204 assumptions, 223
Surv, 252 binary data, 209
surv_plot, 254 C-statistic, 233
survfit, 252 confounding, 215
Sys.time, 31 correlated groups, 239
t.test, 126, 130 effect modification, 215
tidy, 164 finalfit, 233
vif, 228 fitted line, 212
wilcox.test, 139 hierarchical, 239
ymd, 24 Hosmer-Lemeshow test, 233
interactions, 215
help, 8 loess, 224
Hosmer-Lemeshow test, 233 mixed effects, 239
HTML, 294 model fitting, 229, 233
import data, 14 model fitting principles, 231
imputation, 284 multicollinearity, 224
installation multilevel, 239
packages, 7 odds and probabilities, 210
Index 343
odds ratio, 211 boxplot, 93
odds ratio plot, 239 bubble, 84
random effects, 239 colour, 90
colours, 105
markdown, 291 column, 87
Microsoft Word, 294, 309 expand limits, 102
missing data, 271 fill, 90
associations, 277 finetuning, 101
demographics table, 277 ggplot2, 77
handling, 282 hazard ratio plot, 259
imputation, 284 histogram, 92
missing at random, 272 jitter, 94
missing completely at random, labels, 107
272 legend, 106, 111
missing not at random, 272 line, 85
missingness pattern, 276 log scales, 101
missingness plot, 275 path, 85
multiple imputation, 284 saving, 112
missing values, 39 scales, 101
missing values remove na.rm, 30 scatter, 84
multicollinearity, 224 sub, 101
multiple imputation, 284 superscript, 109
multiple testing, 135 text size, 110
non-parametric tests, 137 time-series, 85
Mann-Whitney U, 138 titles, 107
Wilcoxon rank sum, 138 transformations, 101
notebooks, 291 zoom, 103
plotting
objects, 27 boxplot, 123
odds ratio, 211 geom_boxplot, 123
operators, 35 geom_line, 128
geom_qq, 123
pairwise testing, 135 geom_qq_line, 123
paste, 46 odds ratio, 239
PDF, 294, 309 or_plot, 239
pipe, 33 patchwork, 139
plots, 77 surv_plot, 254
aes, 79
anatomy of a plot, 79 reading data, 14
annotate, 107 RStudio, 5
axes, 104 RStudio
bar, 87 script, 4
344 Index