R Programming For Data Science
R Programming For Data Science
List of Tables ix
List of Figures xi
1 Data 1
1.1 Baby Crawling Data . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 World Bank Data . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Email Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Handwritten Digit Recognition . . . . . . . . . . . . . . . . . 5
1.5 Looking Forward . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.6 How to learn (The most important section in this book!) . . 8
iii
iv Contents
4 Data Structures 39
4.1 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.1.1 Types, Conversion, Coercion . . . . . . . . . . . . . . . 41
4.1.2 Accessing Specific Elements of Vectors . . . . . . . . . 44
4.1.3 Practice Problem . . . . . . . . . . . . . . . . . . . . . 46
4.2 Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.3 Names of Objects in R . . . . . . . . . . . . . . . . . . . . . 48
4.4 Missing Data, Infinity, etc. . . . . . . . . . . . . . . . . . . . 49
4.4.1 Practice Problem . . . . . . . . . . . . . . . . . . . . . . 51
4.4.2 Infinity and NaN . . . . . . . . . . . . . . . . . . . . . . 51
4.5 Data Frames . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.5.1 Accessing Specific Elements of Data Frames . . . . . . 54
4.6 Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.6.1 Accessing Specific Elements of Lists . . . . . . . . . . 59
4.7 Subsetting with Logical Vectors . . . . . . . . . . . . . . . . . 61
4.7.1 Modifying or Creating Objects via Subsetting . . . . . 62
4.7.2 Logical Subsetting and Data Frames . . . . . . . . . . 63
4.8 Patterned Data . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.8.1 Practice Problem . . . . . . . . . . . . . . . . . . . . . 69
4.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
10 Classification 215
10.1 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . 215
10.1.1 Adding Predictors . . . . . . . . . . . . . . . . . . . . . 221
10.1.2 More than Two Classes . . . . . . . . . . . . . . . . . 224
10.2 Nearest Neighbor Methods . . . . . . . . . . . . . . . . . . . 229
10.2.1 kNN and the Diabetes Data . . . . . . . . . . . . . . . 235
10.2.2 Practice Problem . . . . . . . . . . . . . . . . . . . . . 236
10.2.3 kNN and the iris Data . . . . . . . . . . . . . . . . . . 236
10.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
Contents vii
12 Rcpp 261
12.1 Getting Started with Rcpp . . . . . . . . . . . . . . . . . . . . 261
12.1.1 Installation . . . . . . . . . . . . . . . . . . . . . . . . . 261
12.1.2 The Simplest C++ Example . . . . . . . . . . . . . . 262
12.2 Using Rcpp . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
12.2.1 Exporting C++ Functions . . . . . . . . . . . . . . . . 262
12.2.2 Inline C++ Code . . . . . . . . . . . . . . . . . . . . . 264
12.3 The Rcpp Interface . . . . . . . . . . . . . . . . . . . . . . . 265
12.4 No input and scalar output . . . . . . . . . . . . . . . . . . . 265
12.4.1 Scalar input and scalar output . . . . . . . . . . . . . 266
12.4.2 Vector input and scalar output . . . . . . . . . . . . . . 267
12.4.3 Vector input and vector output . . . . . . . . . . . . . 268
12.4.4 Matrix input and vector output . . . . . . . . . . . . . 269
12.4.5 Matrix input and matrix output . . . . . . . . . . . . 270
12.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
8.1 An abbreviated list of ‘sp‘ and ‘raster‘ data objects and associ-
ated classes for the fundamental spatial data types . . . . . . 174
ix
List of Figures
xi
Course Description
xiii
1
Data
u <- "https://www.finley-lab.com/files/data/BabyCrawling.tsv"
BabyCrawling <- read.table(u, header=T)
This data set has many simple properties: it is relatively small, there are no
missing observations, the variables are easily understood, etc.
1 More correctly, were volunteered by their parents.
2 These data were retrieved from http://lib.stat.cmu.edu/DASL/Datafiles/Crawling
.html.
1
2 1 Data
It is estimated that in 2015, 90% of the total 205 billion emails sent were spam.3
Spam filters use large amounts of data from emails to learn what distinguishes
spam messages from non-spam (sometimes called “ham”) messages. Below we
include one spam message followed by a ham message.4
From safety33o@l11.newnamedns.com Fri Aug 23 11:03:37 2002
Return-Path: <safety33o@l11.newnamedns.com>
Delivered-To: zzzz@localhost.example.com
Received: from localhost (localhost [127.0.0.1])
by phobos.labs.example.com (Postfix) with ESMTP id 5AC994415F
for <zzzz@localhost>; Fri, 23 Aug 2002 06:02:59 -0400 (EDT)
Received: from mail.webnote.net [193.120.211.219]
by localhost with POP3 (fetchmail-5.9.0)
for zzzz@localhost (single-drop); Fri, 23 Aug 2002 11:02:59 +0100 (IST)
Received: from l11.newnamedns.com ([64.25.38.81])
by webnote.net (8.9.3/8.9.3) with ESMTP id KAA09379
3 RadicatiGroup http://www.radicati.com
4 These messages both come from the large collection of spam and ham messages at
http://spamassassin.apache.org.
4 1 Data
URL: http://boingboing.net/#85506723
Date: Not supplied
Disney has named a new president of Walt Disney Parks, replacing Paul Pressler,
the exec who did his damnedest to ruin Disneyland, slashing spending (at the
1.4 Handwritten Digit Recognition 5
[1] http://reuters.com/news_article.jhtml?type=search&StoryID=1510778
[2] http://www.quicktopic.com/boing/H/rw7cDXT3W44C
To implement a spam filter we would have to get the data from these email
messages (and thousands of others) into a software package, extract and
separate potentially important features such as the To: line, the Subject:
line, the message body, etc., and then compare spam and non-spam messages
to find a method to classify new emails correctly. These steps are not simple
in this example. In particular, we would need to become skilled at working
with text data.
-1.000 -1.000 -1.000 -1.000 0.100 1.000 0.922 -0.439 -1.000 -1.000
-1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -0.257
0.950 1.000 -0.162 -1.000 -1.000 -1.000 -0.987 -0.714 -0.832 -1.000
-1.000 -1.000 -1.000 -1.000 -0.797 0.909 1.000 0.300 -0.961 -1.000
-1.000 -0.550 0.485 0.996 0.867 0.092 -1.000 -1.000 -1.000 -1.000
0.278 1.000 0.877 -0.824 -1.000 -0.905 0.145 0.977 1.000 1.000
1.000 0.990 -0.745 -1.000 -1.000 -0.950 0.847 1.000 0.327 -1.000
-1.000 0.355 1.000 0.655 -0.109 -0.185 1.000 0.988 -0.723 -1.000
-1.000 -0.630 1.000 1.000 0.068 -0.925 0.113 0.960 0.308 -0.884
-1.000 -0.075 1.000 0.641 -0.995 -1.000 -1.000 -0.677 1.000 1.000
0.753 0.341 1.000 0.707 -0.942 -1.000 -1.000 0.545 1.000 0.027
-1.000 -1.000 -1.000 -0.903 0.792 1.000 1.000 1.000 1.000 0.536
0.184 0.812 0.837 0.978 0.864 -0.630 -1.000 -1.000 -1.000 -1.000
-0.452 0.828 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
0.135 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -0.483 0.813 1.000
1.000 1.000 1.000 1.000 1.000 0.219 -0.943 -1.000 -1.000 -1.000
-1.000 -1.000 -1.000 -1.000 -0.974 -0.429 0.304 0.823 1.000 0.482
-0.474 -0.991 -1.000 -1.000 -1.000 -1.000
and Figure 1.3 shows the digitized images of the first 25 numeral sevens in the
data set. These give some idea of the variability in how digits are written.5
challenges in working with real data sets, within the context of the R statistical
system. We will focus on important topics such as
something else with your time. That may or may not be a good life strategy,
depending on what else you do with your time, but you won’t learn much from
the book!
Another way to engage is to read through the book “passively”, reading all
that’s written but not reading the book while at your computer, where you
could enter the R commands from the book. With this strategy you’ll probably
learn more than if you leave the book closed on a shelf, but there are better
options.
A third way to engage is to read the book while you’re at a computer, enter
the R commands from the book as you read about them, and work on the
practice problems within many of the chapters. You’ll likely learn more this
way.
A fourth strategy is even better. In addition to reading, entering the commands
given in the book, and working through the practice exercises, you think about
what you’re doing, and ask yourself questions (which you then go on to answer).
For example after working through some R code computing the logarithm of
positive numbers you might ask yourself, “What would R do if I asked it to
calculate the logarithm of a negative number? What would R do if I asked it
to calculate the logarithm of a really large number such as one trillion?” You
could explore these questions easily by just trying things out in the R Console
window.
If your goal is to maximize the time you have to binge-watch Stranger Things
Season 2 on Netflix, the first strategy may be optimal. But if your goal is to
learn a lot about computational tools for data science, the fourth strategy is
probably going to be best.
2
Introduction to R and RStudio
1. R is free
2. It is one of, if not the, most widely used software environments in
data science
3. R is under constant and open development by a diverse and expert
core group
4. It has an incredible variety of contributed packages
5. A new user can (relatively) quickly gain enough skills to obtain,
manage, and analyze data in R
Several enhanced interfaces for R have been developed. Generally such interfaces
are referred to as integrated development environments (IDE). These interfaces
are used to facilitate software development. At minimum, an IDE typically
consists of a source code editor and build automation tools. We will use the
RStudio IDE, which according to its developers “is a powerful productive
user interface for R.”1 RStudio is widely used, it is used increasingly in the R
community, and it makes learning to use R a bit simpler. Although we will
use RStudio, most of what is presented in this book can be accomplished in R
(without an added interface) with few or no changes.
11
12 2 Introduction to R and RStudio
1. Go to http://www.r-project.org/
2. Click on the CRAN link on the left side of the page
3. Choose one of the mirrors.4
4. Click on Download R for Windows
5. Click on base
6. Click on Download R 4.0.2 for Windows
7. Install R as you would install any other Windows program
1. Go to http://www.rstudio.com
2. Click on the link RStudio under the Products tab
3. Click on the RStudio Desktop box
4. Choose the DOWNLOAD RSTUDIO DESKTOP link in the Open Source
Edition column
5. On the ensuing page, click on the Installer version for your oper-
ating system, and once downloaded, install as you would any other
program
Start RStudio as you would any other program in your operating system. For
example, under Microsoft Windows use the Start Menu or double click on the
shortcut on the desktop (if a shortcut was created in the installation process).
A (rather small) view of RStudio is displayed in Figure 2.1.
Initially the RStudio window contains three smaller windows. For now our
main focus will be the large window on the left, the Console window, in which
3 New versions of R are released regularly, so the version number in Step 6 might be
in Michigan.
2.3 Using R and RStudio 13
R statements are typed. The next few sections give simple examples of the use
of R. In these sections we will focus on small and non-complex data sets, but
of course later in the book we will work with much larger and more complex
sets of data. Read these sections at your computer with R running, and enter
the R commands there to get comfortable using the R console window and
RStudio.
Figure 2.1 shows the default Rstudio theme and layout. RStudio provides great
flexibility in changing themes and code highlighting to customize the RStudio
interface. To switch between themes, click on the Tools bar at the top of the
window, click on the Global Options tab, and then click on the Appearance
tab where you will see the option to switch between three RStudio themes:
Modern, Classic, and Sky.
You can also switch the Editor font as well as specify the Editor theme from
a suite of options. Some of the editor themes are dark palettes that will
automatically activate a Dark theme. For example, an RStudio window using
the Modern theme and the Tomorrow Night Bright editor theme is shown
in Figure 2.2. Using a dark palette in RStudio can be especially nice if you
spend numerous hours at a time looking at your screen. If you’re interested in
14 2 Introduction to R and RStudio
specializing your Rstudio window even further, you can even create custom
editor themes (see this Rstudio blog5 for a useful tutorial).
FIGURE 2.2: The RStudio IDE using the Modern theme and the Tomorrow
Night Bright editor theme
2.3.2 R as a Calculator
[1] 234
[1] 7.389
5 https://support.rstudio.com/hc/en-us/articles/115011846747-Using-RStudio-
Themes
2.3 Using R and RStudio 15
[1] 2
[1] 4.605
10ˆlog10(55)
[1] 55
Most functions in R can be applied to vector arguments rather than operating
on a single argument at a time. A vector is a data structure that contains
elements of the same data type (i.e., integers).
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
[18] 18 19 20 21 22 23 24 25
[1] 1 4 9 16 25 6 14 24 36 50 11 24
[13] 39 56 75 16 34 54 76 100 21 44 69 96
[25] 125
[1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
16 2 Introduction to R and RStudio
car.hp <- c(110, 110, 93, 110, 175, 105, 245, 62, 95, 123,
123, 180, 180, 180, 205)
car.mpg <- c(21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4,
22.8, 19.2, 17.8, 16.4, 17.3, 15.2, 10.4)
car.name <- c("Mazda RX4", "Mazda RX4 Wag", "Datsun 710",
"Hornet 4 Drive", "Hornet Sportabout", "Valiant",
6 These are from a relatively old data set, with 1974 model cars.
2.3 Using R and RStudio 17
[1] 110 110 93 110 175 105 245 62 95 123 123 180
[13] 180 180 205
car.mpg
[1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2
[11] 17.8 16.4 17.3 15.2 10.4
car.name
mean(car.hp)
[1] 139.7
sd(car.hp)
[1] 50.78
summary(car.hp)
mean(car.mpg)
18 2 Introduction to R and RStudio
[1] 18.72
sd(car.mpg)
[1] 3.714
summary(car.mpg)
plot(car.hp, car.mpg)
18 20 22 24
car.mpg
10 12 14 16
car.hp
When you created the car.hp and other vectors in the previous section, you
might have noticed the vector name and a short description of its attributes
appear in the top right Global Environment window. Similarly, when you
called plot(car.hp,car.mpg) the corresponding plot appeared in the lower
right Plots window.
A comprehensive, but slightly overwhelming, cheatsheet for RStudio is avail-
able here https://www.rstudio.com/wp-content/uploads/2016/01/rstu
dio-IDE-cheatsheet.pdf. As we progress in learning R and RStudio, this
cheatsheet will become more useful. For now you might use the cheatsheet to
locate the various windows and functions identified in the coming chapters.
File names should be meaningful and end in .R. If we write a script that
analyzes a certain species distribution:
• GOOD: african_rhino_distribution.R
• GOOD: africanRhinoDistribution.R
• BAD: speciesDist.R (too ambiguous)
• BAD: species.dist.R (too ambiguous and two periods can confuse operat-
ing systems’ file type auto-detect)
• BAD: speciesdist.R (too ambiguous and confusing)
• GOOD: rhino.count
• GOOD: rhinoCount
• GOOD: rhino_count (We don’t mind the underscore and use it quite often,
although Google’s style guide says it’s a no-no for some reason)
• BAD: rhinocount (confusing)
2.6.3 Syntax
Doing work in data science, whether for homework, a project for a business, or
a research project, typically involves several iterations. For example creating
an effective graphical representation of data can involve trying out several
different graphical representations, and then tens if not hundreds of iterations
when fine-tuning the chosen representation. Furthermore, each of these repre-
sentations may require several R commands to create. Although this all could
be accomplished by typing and re-typing commands at the R Console, it is
easier and more effective to write the commands in a script file, which then
can be submitted to the R console either a line at a time or all together.1
In addition to making the workflow more efficient, R scripts provide another
large benefit. Often we work on one part of a homework assignment or project
for a few hours, then move on to something else, and then return to the original
part a few days, months, or sometimes even years later. In such cases we may
have forgotten how we created the graphical display that we were so proud of,
and will need to again spend a few hours to recreate it. If we save a script file,
we have the ingredients immediately available when we return to a portion of
a project.2
Next consider the larger scientific endeavor. Ideally a scientific study will be
reproducible, meaning that an independent group of researchers (or the original
researchers) will be able to duplicate the study. Thinking about data science,
this means that all the steps taken when working with the data from a study
should be reproducible, from the selection of variables to formal data analysis.
In principle, this can be facilitated by explaining, in words, each step of the
work with data. In practice, it is typically difficult or impossible to reproduce
a full data analysis based on a written explanation. Much more effective is
to include the actual computer code which accomplished the data work in
the report, whether the report is a homework assignment or a research paper.
Tools in R such as R Markdown facilitate this process.
1 Unsurprisingly it is also possible to submit several selected lines of code at once.
2 Inprinciple the R history mechanism provides a similar record. But with history we
have to search through a lot of other code to find what we’re looking for, and scripts are a
much cleaner mechanism to record our work.
23
24 3 Scripts and R Markdown
3.1 Scripts in R
As noted above, scripts help to make work with data more efficient and provide
a record of how data were managed and analyzed. Below we describe an
example. This example uses features of R that we have not yet discussed, so
don’t worry about the details, but rather about how it motivates the use of a
script file.
First we read in a data set containing data on (among other things) fertility
rate and life expectancy for countries throughout the world, for the years 1960
through 2014.
u <- "https://www.finley-lab.com/files/data/WorldBank.csv"
WorldBank <- read.csv(u, header=TRUE, stringsAsFactors=FALSE)
Next we print the names of the variables in the data set. Don’t be concerned
about the specific details. Later we will learn much more about reading in
data and working with data sets in R.
names(WorldBank)
[1] "iso2c"
[2] "country"
[3] "year"
[4] "fertility.rate"
[5] "life.expectancy"
[6] "population"
[7] "GDP.per.capita.Current.USD"
[8] "X15.to.25.yr.female.literacy"
[9] "iso3c"
[10] "region"
[11] "capital"
[12] "longitude"
[13] "latitude"
[14] "income"
[15] "lending"
We will try to create a scatter plot of fertility rate versus life expectancy of
countries for the year 1960. To do this we’ll first create variables containing
the values of fertility rate and life expectancy for 19603 , and print out the first
ten values of each variable.
lifeexp[1:10]
plot(lifeexp, fertility)
8
7
6
fertility
5
4
3
2
30 40 50 60 70
lifeexp
The scatter plot shows that as life expectancy increases, fertility rate tends to
decrease in what appears to be a nonlinear relationship. Now that we have a
basic scatter plot, it is tempting to make it more informative. We will do this
26 3 Scripts and R Markdown
by adding two features. One is to make the points’ size proportional to the
country’s population, and the second is to make the points’ color represent the
region of the world the country resides in. We’ll first extract the population
and region variables for 1960.
region[1:10]
8
7
6
fertility
5
4
3
2
30 40 50 60 70
lifeexp
4
2
30 40 50 60 70 80
lifeexp
Of course we should have a key which tells the viewer which region each color
28 3 Scripts and R Markdown
represents, and a way to determine which country each point represents, and
a lot of other refinements. For now we will resist such temptations.
Some of the process leading to the completed plot is shown above, such as
reading in the data, creating variables representing the 1960 fertility rate and
life expectancy, an intermediate plot that was rejected, and so on. A lot of
the process isn’t shown, simply to save space. There would likely be mistakes
(either minor typing mistakes or more complex errors). Focusing only on the
symbols() function that was used to add the colorful symbols to the scatter
plot, there would likely have been a substantial number of attempts with
different values of the circles, inches, and bg arguments before settling on
the actual form used to create the plot. This is the typical process you will
soon discover when producing useful data visualizations.
Now imagine trying to recreate the plot a few days later. Possibly someone saw
the plot and commented that it would be interesting to see some similar plots,
but for years in the 1970s when there were major famines in different countries
of the world. If all the work, including all the false starts and refinements, were
done at the console, it would be hard to sort things out and would take longer
than necessary to create the new plots. This would be especially true if a few
months had passed rather than just a few days.
But with a script file, especially a script file with a few well-chosen comments,
creating the new scatter plots would be much easier. Fortunately it is quite
easy to create and work with script files in RStudio.4 Just choose File > New
File > R script and a script window will open up in the upper left of the
full RStudio window.
An example of a script window (with some R code already typed in) is shown
in Figure 3.1. From the script window the user can, among other things, save
the script (either using the File menu or the icon near the top left of the
window) and run one or more lines of code from the window (using the run
icon in the window, or by copying and pasting into the console window). In
addition, there is a Source on Save checkbox. If this is checked, the R code
in the script window is automatically read into R and executed when the script
file is saved.
4 It is also easy in R without RStudio. Just use File > New script to create a script file,
3.2 R Markdown
People typically work on data with a larger purpose in mind. Possibly the
purpose is to understand a biological system more clearly. Possibly the purpose
is to refine a system that recommends movies to users in an online streaming
movie service. Possibly the purpose is to complete a homework assignment
and demonstrate to the instructor an understanding of an aspect of data
analysis. Whatever the purpose, a key aspect is communicating with the
desired audience.
One possibility, which is somewhat effective, is to write a document using
software such as Microsoft Word 5 and to include R output such as computations
and graphics by cutting and pasting into the main document. One drawback
to this approach is similar to what makes script files so useful: If the document
must be revised it may be hard to unearth the R code that created graphics
or analyses.6 A more subtle but possibly more important drawback is that
the reader of the document will not know precisely how analyses were done,
or how graphics were created. Over time even the author(s) of the paper will
forget the details. A verbal description in a “methods” section of a paper can
help here, but typically these do not provide all the details of the analysis,
5 Or possibly LaTeX if the document is more technical
6 Organizing the R code using script files and keeping all the work organized in a well-
thought-out directory structure can help here, but this requires a level of forethought and
organization that most people do not possess . . . including myself.
30 3 Scripts and R Markdown
but rather might state something like, “All analyses were carried out using R
version 3.4.0.”
RStudio’s website provides an excellent overview of R Markdown capabilities
for reproducible research. At minimum, follow the Get Started link at http:
//rmarkdown.rstudio.com/ and watch the introduction video.
Among other things, R Markdown provides a way to include R code that reads
in data, creates graphics, or performs analyses, all in a single document that is
processed to create a research paper, homework assignment, or other written
product. The R Markdown file is a plain text file containing text the author
wants to show in the final document, simple commands to indicate how the
text should be formatted (for example boldface, italic, or a bulleted list), and R
code that creates output (including graphics) on the fly. Perhaps the simplest
way to get started is to see an R Markdown file and the resulting document
that is produced after the R Markdown document is processed. In Figure 3.2
we show the input and output of an example R Markdown document. In this
case the output created is an HTML file, but there are other possible output
formats, such as Microsoft Word or PDF.
At the top of the input R Markdown file are some lines with --- at the top and
the bottom. These lines are not needed, but give a convenient way to specify
the title, author, and date of the article that are then typeset prominently at
the top of the output document. For now, don’t be concerned with the lines
following output:. These can be omitted (or included as shown).
Next are a few lines showing some of the ways that font effects such as italics,
boldface, and strikethrough can be achieved. For example, an asterisk before
and after text sets the text in italics, and two asterisks before and after text
sets the text in boldface.
More important for our purposes is the ability to include R code in the R
Markdown file, which will be executed with the output appearing in the
output document. Bits of R code included this way are called code chunks. The
beginning of a code chunk is indicated with three backticks and an “r” in curly
braces: ```{r}. The end of a code chunk is indicated with three backticks ```.
For example, the R Markdown file in Figure 3.2 has one code chunk:
```{r}
x = 1:10
y = 10:1
mean(x)
sd(y)
```
In this code chunk two vectors x and y are created, and the mean of x and the
standard deviation of y are computed. In the output in Figure 3.2 the R code
3.2 R Markdown 31
is reproduced, and the output of the two lines of code asking for the mean and
standard deviation is shown.
output, click the Knit HTML button at the top of the R Markdown window7 .
You’ll be prompted to choose a filename for the R Markdown file. Make sure
that you use .Rmd as the extension for this file. Once you’ve successfully saved
the file, RStudio will process the file, create the HTML output, and open this
output in a new window. The HTML output file will also be saved to your
working directory. This file can be shared with others, who can open it using a
web browser such as Chrome or Firefox.
There are many options which allow customization of R Markdown documents.
Some of these affect formatting of text in the document, while others affect
how R code is evaluated and displayed. The RStudio web site contains a
useful summary of many R Markdown options at https://www.rstudio.
com/wp-content/uploads/2015/03/rmarkdown-reference.pdf. A different,
but mind-numbingly busy, cheatsheet is at https://www.rstudio.com/wp-
content/uploads/2016/03/rmarkdown-cheatsheet-2.0.pdf. Some of the
more commonly used R Markdown options are described next.
Unordered (sometimes called bulleted) lists and ordered lists are easy in R
Markdown. Figure 3.3 illustrates the creation of unordered and ordered lists.
• For an unordered list, either an asterisk, a plus sign, or a minus sign may
7 If you hover your mouse over this Knit button after a couple seconds it should display a
keyboard shortcut for you to do this if you don’t like pushing buttons
3.2 R Markdown 33
precede list items. Use a space after these symbols before including the
list text. To have second-level items (sub-lists) indent four spaces before
indicating the list item. This can also be done for third-level items.
• For an ordered list use a numeral followed by a period and a space (1. or
2. or 3. or . . . ) to indicate a numbered list, and use a letter followed by a
period and a space (a. or b. or c. or . . . ) to indicate a lettered list. The same
four space convention used in unordered lists is used to designate ordered
sub lists.
• For an ordered list, the first list item will be labeled with the number or letter
that you specify, but subsequent list items will be numbered sequentially.
The example in Figure 3.3 will make this more clear. In those examples
notice that for the ordered list, although the first-level numbers given in the
R Markdown file are 1, 2, and 17, the numbers printed in the output are 1,
2, and 3. Similarly the letters given in the R Markdown file are c and q, but
the output file prints c and d.
R Markdown does not give substantial control over font size. Different “header”
levels are available that provide different font sizes. Put one or more hash
marks in front of text to specify different header levels. Other font choices such
as subscripts and superscripts are possible, by surrounding the text either by
tildes or carets. More sophisticated mathematical displays are also possible,
and are surrounded by dollar signs. The actual mathematical expressions are
specified using a language called LaTeX See Figures 3.4 and 3.5 for examples.
MiKTeX is convenient and is available from https://miktex.org. For a Mac system, MacTeX
is available from https://www.tug.org/mactex/
36 3 Scripts and R Markdown
3.3 Exercises
A data structure is a format for organizing and storing data. The structure
is designed so that data can be accessed and worked with in specific ways.
Statistical software and programming languages have methods (or functions)
designed to operate on different kinds of data structures.
This chapter’s focus is on data structures. To help initial understanding, the
data in this chapter will be relatively modest in size and complexity. The ideas
and methods, however, generalize to larger and more complex data sets.
The base data structures in R are vectors, matrices, arrays, data frames, and
lists. The first three, vectors, matrices, and arrays, require all elements to
be of the same type or homogeneous, e.g., all numeric or all character. Data
frames and lists allow elements to be of different types or heterogeneous, e.g.,
some elements of a data frame may be numeric while other elements may be
character. These base structures can also be organized by their dimensionality,
i.e., 1-dimensional, 2-dimensional, or N-dimensional, as shown in Table 4.1.
R has no scalar types, i.e., 0-dimensional. Individual numbers or strings are
actually vectors of length one.
An efficient way to understand what comprises a given object is to use the
str() function. str() is short for structure and prints a compact, human-
readable description of any R data structure. For example, in the code below,
we prove to ourselves that what we might think of as a scalar value is actually
a vector of length one.
a <- 1
str(a)
39
40 4 Data Structures
num 1
is.vector(a)
[1] TRUE
length(a)
[1] 1
Here we assigned a the scalar value one. The str(a) prints num 1, which says a
is numeric of length one. Then just to be sure we used the function is.vector()
to test if a is in fact a vector. Then, just for fun, we asked the length of a,
which again returns one. There are a set of similar logical tests for the other
base data structures, e.g., is.matrix(), is.array(), is.data.frame(), and
is.list(). These will all come in handy as we encounter different R objects.
4.1 Vectors
Think of a vector1 as a structure to represent one variable in a data set. For
example a vector might hold the weights, in pounds, of 7 people in a data set.
Or another vector might hold the genders of those 7 people. The c() function
in R is useful for creating (small) vectors and for modifying existing vectors.
Think of c as standing for “combine”.
the same type), since lists, to be described below, also are actually vectors. This will not be
an important issue, and the shorter term vector will be used for atomic vectors below.
4.1 Vectors 41
Notice that elements of a vector are separated by commas when using the c()
function to create a vector. Also notice that character values are placed inside
quotation marks.
The c() function also can be used to add to an existing vector. For example,
if an eighth male person was included in the data set, and his weight was 194
pounds, the existing vectors could be modified as follows.
gender
typeof(weight)
[1] "double"
typeof(gender)
[1] "character"
typeof(bp)
[1] "logical"
It may be surprising to see that the variable weight is of double type, even
though its values all are integers. By default R creates a double type vector
when numeric values are given via the c() function.
When it makes sense, it is possible to convert vectors to a different type.
Consider the following examples.
typeof(weight.int)
[1] "integer"
[1] 0 1 0 0 1 0 1 0
gender.oops
[1] NA NA NA NA NA NA NA NA
sum(bp)
4.1 Vectors 43
[1] 3
The integer version of weight doesn’t look any different, but it is stored
differently, which can be important both for computational efficiency and for
interfacing with other languages such as C++. As noted above, however, we will
not worry about the distinction between integer and double types. Converting
weight to character goes as expected: The character representations of the
numbers replace the numbers themselves. Converting the logical vector bp to
double is pretty straightforward too: FALSE is converted to zero, and TRUE is
converted to one. Now think about converting the character vector gender to
a numeric double vector. It’s not at all clear how to represent “female” and
“male” as numbers. In fact in this case what R does is to create a character
vector, but with each element set to NA, which is the representation of missing
data.2 Finally consider the code sum(bp). Now bp is a logical vector, but when
R sees that we are asking to sum this logical vector, it automatically converts
it to a numerical vector and then adds the zeros and ones representing FALSE
and TRUE.
R also has functions to test whether a vector is of a particular type.
is.double(weight)
[1] TRUE
is.character(weight)
[1] FALSE
is.integer(weight.int)
[1] TRUE
is.logical(bp)
[1] TRUE
4.1.1.1 Coercion
[1] 1 2 3 1
weight+bp
To access and possibly change specific elements of vectors, refer to the position
of the element in square brackets. For example, weight[4] refers to the fourth
element of the vector weight. Note that R starts the numbering of elements
at 1, i.e., the first element of a vector x is x[1].
weight
weight[5]
4.1 Vectors 45
[1] 223
weight[1:3]
length(weight)
[1] 8
weight[length(weight)]
[1] 194
weight[]
weight[-3]
weight[-length(weight)]
weight[0]
numeric(0)
weight[c(0,2,1)]
weight[c(-1, 2)]
Error in weight[c(-1, 2)]: only 0's may be mixed with negative subscripts
Note that mixing zero and other nonzero subscripts is allowed, but mixing
negative and positive subscripts is not allowed.
What about the (usual) case where we don’t know the positions of the elements
we want? For example possibly we want the weights of all females in the data.
Later we will learn how to subset using logical indices, which is a very powerful
way to access desired elements of a vector.
Suppose we are interested in the second to last value of the data set. One
way to do this is to first determine the length of vector using the length()
function, then taking that value and subtracting 1.
[1] 10
tree.sp[10 - 1]
[1] 9
4.2 Factors 47
This is an example of hardcoding. But what if we attempt to use the same code
on a second vector of tree species data that has a different number of sites?
[1] 6
tree.sp[10 - 1]
[1] NA
That’s clearly not what we want. Fix this code so we can always extract the
second to last value in the vector, regardless of the length of the vector.
4.2 Factors
Categorical variables such as gender can be represented as character vectors.
In many cases this simple representation is sufficient. Consider, however, two
other categorical variables, one representing age via categories youth, young
adult, middle age, senior, and another representing income via categories
lower, middle, and upper. Suppose that for the small health data set, all the
people are either middle aged or senior citizens. If we just represented the
variable via a character vector, there would be no way to know that there
are two other categories, representing youth and young adults, which happen
not to be present in the data set. And for the income variable, the character
vector representation does not explicitly indicate that there is an ordering of
the levels.
Factors in R provide a more sophisticated way to represent categorical variables.
Factors explicitly contain all possible levels, and allow ordering of levels.
income
weight
Weight
that a vector contain only elements of one type, there are different types of NA
values. Usually R determines the appropriate type of NA value automatically.
It is worth noting that the default type for NA is logical, and that NA is NOT
the same as the character string "NA".
is.na(missingCharacter)
is.na(missingCharacter)
[1] "logical"
How should missing data be treated in computations, such as finding the mean
or standard deviation of a variable? One possibility is to return NA. Another is
to remove the missing value(s) and then perform the computation.
> mean(c(1,2,3,NA,5))
[1] NA
[1] 2.75
4.4 Missing Data, Infinity, etc. 51
As this example shows, the default behavior for the mean() function is to
return NA. If removal of the missing values and then computing the mean is
desired, the argument na.rm is set to TRUE. Different R functions have different
default behaviors, and there are other possible actions. Consulting the help for
a function provides the details.
Collecting data is often a messy process resulting in multiple errors in the data.
Consider the following small vector representing the weights of 10 adults in
pounds.
my.weights <- c(150, 138, 289, 239, 12, 103, 310, 200, 218, 178)
As far as I know, it’s not possible for an adult to weigh 12 pounds, so that
is most likely an error. Change this value to NA, and then find the standard
deviation of the weights after removing the NA value.
[1] 0 1 2 3 4
> 1/x
> x/x
[1] NaN 1 1 1 1
Inf and -Inf represent infinity and negative infinity (and numbers which
are too large in magnitude to be represented as floating point numbers). NaN
represents the result of a calculation where the result is undefined, such as
dividing zero by zero. All of these are common to a variety of programming
languages, including R.
names(healthData)
colnames(healthData)
Wt Gdr bp
1 123 female FALSE
2 157 female TRUE
3 202 male FALSE
4 199 female FALSE
5 223 male TRUE
6 140 male FALSE
7 105 female TRUE
8 194 male FALSE
rownames(healthData)
The data.frame function can be used to create a data frame (although it’s
more common to read a data frame into R from an external file, something that
will be introduced later). The names of the variables in the data frame are given
as arguments, as are the vectors of data that make up the variable’s values. The
argument stringsAsFactors=FALSE asks R not to convert character vectors
into factors. As of version R 4.0.0, R does not automatically convert character
vectors into factors. However, up until this recent version, R would automatically
convert strings to factors (i.e., stringsAsFactors = TRUE), and so to avoid
confusion we will typically display stringsAsFactors=FALSE throughout most
of the book. Names of the columns (variables) can be extracted and set via
either names or colnames. In the example, the variable names are changed
to Wt, Gdr, bp and then changed back to the original Weight, Gender,
bp.meds in this way. Rows can be named also. In this case since specific row
names were not provided, the default row names of "1", "2" etc. are used.
In the next example a built-in dataset called mtcars is made available by the
data function, and then the first and last six rows are displayed using head
and tail.
data(mtcars)
head(mtcars)
tail(mtcars)
mtcars[1,4]
[1] 110
mtcars[1:3, 3]
mtcars[1:3, 2:3]
cyl disp
Mazda RX4 6 160
Mazda RX4 Wag 6 160
Datsun 710 4 108
mtcars[,1]
[1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2
[11] 17.8 16.4 17.3 15.2 10.4 10.4 14.7 32.4 30.4 33.9
[21] 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
[31] 15.0 21.4
Note that mtcars[,1] returns ALL elements in the first column. This agrees
with the behavior for vectors, where leaving a subscript out of the square
brackets tells R to return all values. In this case we are telling R to return all
rows, and the first column.
For a data frame there is another way to access specific columns, using the $
notation.
> mtcars$mpg
[1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2
[11] 17.8 16.4 17.3 15.2 10.4 10.4 14.7 32.4 30.4 33.9
[21] 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
[31] 15.0 21.4
> mtcars$cyl
[1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8
[26] 4 4 4 8 6 8 4
> mpg
> cyl
> weight
4.6 Lists
The third main data structure we will work with is a list. Technically a list is
a vector, but one in which elements can be of different types. For example a
list may have one element that is a vector, one element that is a data frame,
and another element that is a function. Consider designing a function that fits
a simple linear regression model to two quantitative variables. We might want
that function to compute and return several things such as
• The fitted slope and intercept (a numeric vector with two components)
• The residuals (a numeric vector with n components, where n is the number
of data points)
• Fitted values for the data (a numeric vector with n components, where n is
the number of data points)
• The names of the dependent and independent variables (a character vector
with two components)
In fact R has a function, lm, which does this (and much more).
[1] "list"
names(mpgHpLinMod)
mpgHpLinMod$coefficients
(Intercept) hp
30.09886 -0.06823
mpgHpLinMod$residuals
the object mpgHpLinMod).3 One component of the list is the length 2 vector of
coefficients, while another component is the length 32 vector of residuals. The
code also illustrates that named components of a list can be accessed using
the dollar sign notation, as with data frames.
The list function is used to create lists.
$first
[1] 123 157 202 199 223 140 105 194
$second
Weight Gender bp.meds
1 123 female FALSE
2 157 female TRUE
3 202 male FALSE
4 199 female FALSE
5 223 male TRUE
6 140 male FALSE
7 105 female TRUE
8 194 male FALSE
$pickle
$pickle$a
[1] 1 2 3 4 5 6 7 8 9 10
$pickle$b
Weight Gender bp.meds
1 123 female FALSE
2 157 female TRUE
3 202 male FALSE
4 199 female FALSE
5 223 male TRUE
6 140 male FALSE
7 105 female TRUE
8 194 male FALSE
Here, for illustration, I assembled a list to hold some of the R data structures
we have been working with in this chapter. The first list element, named
first, holds the weight vector we created in Section 4.1, the second list
element, named second, holds the healthData data frame, and the third list
3 The mode function returns the type or storage mode of an object.
4.6 Lists 59
element, named pickle, holds a list with elements named a and b that hold a
vector of values 1 through 10 and another copy of the healthData data frame,
respectively. As this example shows, a list can contain another list.
We already have seen the dollar sign notation works for lists. In addition, the
square bracket subsetting notation can be used. There is an added, somewhat
subtle wrinkle—using either single or double square brackets.
temporaryList$first
mode(temporaryList$first)
[1] "numeric"
temporaryList[[1]]
mode(temporaryList[[1]])
[1] "numeric"
temporaryList[1]
$first
[1] 123 157 202 199 223 140 105 194
mode(temporaryList[1])
[1] "list"
Note the dollar sign and double bracket notation return a numeric vector,
while the single bracket notation returns a list. Notice also the difference in
results below.
temporaryList[c(1,2)]
60 4 Data Structures
$first
[1] 123 157 202 199 223 140 105 194
$second
Weight Gender bp.meds
1 123 female FALSE
2 157 female TRUE
3 202 male FALSE
4 199 female FALSE
5 223 male TRUE
6 140 male FALSE
7 105 female TRUE
8 194 male FALSE
temporaryList[[c(1,2)]]
[1] 157
The single bracket form returns the first and second elements of the list, while
the double bracket form returns the second element in the first element of the
list. Generally, do not put a vector of indices or names in a double bracket,
you will likely get unexpected results. See, for example, the results below.4
temporaryList[[c(1,2,3)]]
weight
gender == "female"
weight[gender == "female"]
The results of subsetting can be assigned to a new (or existing) R object, and
subsetting on the left side of an assignment is a common way to modify an
existing R object.
weight
x <- 1:10
x
[1] 1 2 3 4 5 6 7 8 9 10
[1] 0 0 0 0 5 6 7 8 9 10
y <- -3:9
y
[1] -3 -2 -1 0 1 2 3 4 5 6 7 8 9
[1] NA NA NA 0 1 2 3 4 5 6 7 8 9
rm(x)
rm(y)
healthData
healthData$Weight[healthData$Gender == "male"]
healthData[healthData$Gender == "female", ]
Gender bp.meds
3 male FALSE
4 female FALSE
5 male TRUE
8 male FALSE
The first example is really just subsetting a vector, since the $ notation creates
vectors. The second two examples return subsets of the whole data frame. Note
that the logical vector subsets the rows of the data frame, choosing those rows
where the gender is female or the weight is more than 190. Note also that the
specification for the columns (after the comma) is left blank in the first case,
telling R to return all the columns. In the second case the second and third
columns are requested explicitly.
Next consider the much larger and more complex WorldBank data frame.
Recall, the str function displays the “structure” of an R object. Here is a look
at the structure of several R objects.
str(mtcars)
$ am : num 1 1 1 0 0 0 0 0 0 0 ...
$ gear: num 4 4 4 3 3 3 3 4 4 4 ...
$ carb: num 4 4 1 1 2 1 4 2 2 4 ...
str(temporaryList)
List of 3
$ first : num [1:8] 123 157 202 199 223 140 105 194
$ second:'data.frame': 8 obs. of 3 variables:
..$ Weight : num [1:8] 123 157 202 199 223 140 105 194
..$ Gender : chr [1:8] "female" "female" "male" "female" ...
..$ bp.meds: logi [1:8] FALSE TRUE FALSE FALSE TRUE FALSE ...
$ pickle:List of 2
..$ a: int [1:10] 1 2 3 4 5 6 7 8 9 10
..$ b:'data.frame': 8 obs. of 3 variables:
.. ..$ Weight : num [1:8] 123 157 202 199 223 140 105 194
.. ..$ Gender : chr [1:8] "female" "female" "male" "female" ...
.. ..$ bp.meds: logi [1:8] FALSE TRUE FALSE FALSE TRUE FALSE ...
str(WorldBank)
[1] 216 15
The dim function returns the dimensions of a data frame, i.e., the num-
ber of rows and the number of columns. From dim we see that there are
dim(WorldBank1971)[1] cases from 1971.
Next, how can we create a data frame which only contains data from 1971,
and also only contains cases for which there are no missing values in the
fertility rate variable? R has a built in function is.na which returns TRUE if
the observation is missing and returns FALSE otherwise. And !is.na returns
the negation, i.e., it returns FALSE if the observation is missing and TRUE if
the observation is not missing.
WorldBank1971$fertility.rate[1:25]
!is.na(WorldBank1971$fertility.rate[1:25])
[1] 193 15
4.7 Subsetting with Logical Vectors 67
From dim we see that there are 193 cases from 1971 with non-missing fertility
rate data.
Return attention now to the original WorldBank data frame with data not
only from 1971. How can we extract only those cases (rows) which have NO
missing data? Consider the following simple example:
V1 V2 V3
1 1 NA 1
2 2 1 2
3 3 4 3
4 4 5 5
5 NA NA 7
is.na(temporaryDataFrame)
V1 V2 V3
[1,] FALSE TRUE FALSE
[2,] FALSE FALSE FALSE
[3,] FALSE FALSE FALSE
[4,] FALSE FALSE FALSE
[5,] TRUE TRUE FALSE
rowSums(is.na(temporaryDataFrame))
[1] 1 0 0 0 2
First notice that is.na will test each element of a data frame for missingness.
Also recall that if R is asked to sum a logical vector, it will first convert the
logical vector to numeric and then compute the sum, which effectively counts the
number of elements in the logical vector which are TRUE. The rowSums function
computes the sum of each row. So rowSums(is.na(temporaryDataFrame))
returns a vector with as many elements as there are rows in the data frame. If
an element is zero, the corresponding row has no missing values. If an element
is greater than zero, the value is the number of variables which are missing in
that row. This gives a simple method to return all the cases which have no
missing data.
68 4 Data Structures
dim(WorldBank)
[1] 11880 15
[1] 564 15
Out of the 564 rows in the original data frame, only 564 have no missing
observations!
1:10
[1] 1 2 3 4 5 6 7 8 9 10
-5:3
[1] -5 -4 -3 -2 -1 0 1 2 3
10:4
[1] 10 9 8 7 6 5 4
pi:7
[1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
seq(from = 1, to = 5, by = 1/3)
rep(c(1,2,4), length = 9)
[1] 1 2 4 1 2 4 1 2 4
rep(c(1,2,4), times = 3)
[1] 1 2 4 1 2 4 1 2 4
[1] "a" "a" "a" "b" "b" "c" "c" "c" "c" "c" "c" "c"
Often when using R you will want to simulate data from a specific probability
distribution (i.e. normal/Gaussian, bionmial, Poisson). R has a vast suite
of functions for working with statistical distributions. To generate values
from a statistical distribution, the function has a name beginning with an “r”
followed by some abbreviation of the probability distribution. For example
to simulate from the three distributions mentioned above, we can use the
functions rnorm(), rbinom(), and rpois().
Use the rnorm() function to generate 10,000 values from the standard normal
distribution (the normal distribution with mean = 0 and variance = 1). Consult
the help page for rnorm() if you need to. Save this vector of variables to a
70 4 Data Structures
vector named sim.vals. Then use the hist() function to draw a histogram
of the simulated data. Does the data look like it follows a normal distribution?
4.9 Exercises
71
72 5 Graphics in R Part 1: ggplot2
and syntax, and has many examples on solving practical graphical problems. In
addition to the free on-line version available through MSU, the book’s source
code is available at https://github.com/hadley/ggplot2-book.
Scatter plots are a workhorse of data visualization and provide a good entry
point to the ggplot2 system. Begin by considering a simple and classic data
set sometimes called Fisher’s Iris Data. These data are available in R.
data(iris)
str(iris)
install.packages("ggplot2")
Once this is done the package is installed on the local hard drive, and we can
use the library function to make the package available during the current R
session.
Next a basic scatter plot is drawn. We’ll keep the focus on sepal length and
width, but of course similar plots could be drawn using petal length and width.
The prompt is not displayed below, since the continuation prompt + can cause
confusion.
5.1 Scatter Plots 73
library(ggplot2)
ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point()
4.5
4.0
3.5
Sepal.Width
3.0
2.5
2.0
5 6 7 8
Sepal.Length
In this case the first argument to the ggplot function is the name of the data
frame. Second, the aes (short for aesthetics) function specifies the mapping to
the x and y axes. By itself the ggplot function as written doesn’t tell R what
sort of graphical display is desired. That is done by adding a geom (short for
geometry) specification, in this case geom_point.
Looking at the scatter plot and thinking about the focus of finding a method
to classify the species, two thoughts come to mind. First, the plot might be
improved by increasing the size of the points. And second, using different colors
for the points corresponding to the three species would help.
4.0
3.5
Species
Sepal.Width
setosa
versicolor
3.0
virginica
2.5
2.0
5 6 7 8
Sepal.Length
Notice that a legend showing what the colors represent is automatically gener-
ated and included in the graphic. Next, the size of the points seems a bit big
now, and it might be helpful to use different shapes for the different species.
4.0
3.5
Species
Sepal.Width
setosa
versicolor
3.0
virginica
2.5
2.0
5 6 7 8
Sepal.Length
Here we see that the legend automatically changes to include species specific
color and shape. The size of the points seems more appropriate.
The examples above start with the function ggplot(), which takes as argu-
ments the data frame containing the data to be plotted as well as a mapping
from the data to the axes, enclosed by the aes() function. Next a geom func-
tion, in the above case geom_point(), is added. It might just specify the
geometry, but also might specify aspects such as size, color, or shape.
Typically many graphics are created and discarded in the search for an infor-
mative graphic, and often the initial specification of data and basic aesthetics
from ggplot() stays the same in all the attempts. In such a case it can be
helpful to assign that portion of the graphic to an R object, both to minimize
the amount of typing and to keep certain aspects of all the graphics constant.
Here’s how that could be done for the graphics above.
76 5 Graphics in R Part 1: ggplot2
4.5
4.0
3.5
Sepal.Width
3.0
2.5
2.0
5 6 7 8
Sepal.Length
4.0
3.5
Species
Sepal.Width
setosa
versicolor
3.0
virginica
2.5
2.0
5 6 7 8
Sepal.Length
To add a fitted least squares line to a scatter plot, use stat_smooth, which
adds a smoother (possibly a least squares line, possibly a smooth curve fit
to the data, etc.). The argument method = lm specifies a line fitted by least
squares, and the argument se = FALSE suppresses the default display of a
confidence band around the line or curve which was fit to the data.
4.0
3.5
Species
Sepal.Width
setosa
versicolor
3.0
virginica
2.5
2.0
5 6 7 8
Sepal.Length
4.0
3.5
Species
Sepal.Width
setosa
versicolor
3.0
virginica
2.5
2.0
5 6 7 8
Sepal.Length
For the iris data, it probably makes more sense to fit separate lines by species.
This can be specified using the aes() function inside stat_smooth().
4.0
3.5
Species
Sepal.Width
setosa
versicolor
3.0
virginica
2.5
2.0
5 6 7 8
Sepal.Length
In this case we specified the same color aesthetic for the points and the lines.
If we know we want this color aesthetic (colors corresponding to species) for
all aspects of the graphic, we can specify it in the main ggplot() function:
4.0
3.5
Species
Sepal.Width
setosa
versicolor
3.0
virginica
2.5
2.0
5 6 7 8
Sepal.Length
4.0
3.5
Species
Sepal.Width
setosa
versicolor
3.0
virginica
2.5
2.0
5 6 7 8
Sepal.Length
4.5
4.0
3.5
Species
Sepal.Width
setosa
versicolor
3.0
virginica
2.5
2.0
5 6 7 8
Sepal.Length
5.2 Labels, Axes, Text, etc. 83
900
motor_vehicle_theft
600
300
5.2.1 Labels
By default axis and legend labels are the names of the relevant columns in the
data frame. While convenient, we often want to customize these labels. Here
we use labs() to change the x and y axis labels and other descriptive text.2
2 Axis and legend labels can also be set in the individual scales, see the subsequent
sections.
5.3 Customizing Axes 85
900
600
300
3 ggplot2 makes the axes extend slightly beyond the given range, since typically this is
900
600
300
Next we make point size proportional to population, change the color, and
add a state label. Note, in the ggplot() call I scaled population by 100,000
to help with the interpretability of the legend. Accordingly, I also changed
the “population” label on the legend to “Population\n(100,000)” using the
labs() function4 . We use the geom_label() function to add the label, which
provides an outline around the label text and allows you to control the box
characteristics, e.g., I make the boxes slightly transparent using the alpha
argument.5
Nevada
Motor vehicle theft per 100,000 population
Arizona
900
Washington
California Hawaii
Population
(100,000)
600 Maryland
Colorado a 100
Oregon
Michigan Georgia a 200
a
Missouri
Rhode Island Florida
Texas
Tennessee
New Mexico
Oklahoma
300
Alaska
South Carolina
Utah Indiana
Kansas
Ohio
New Jersey North Carolina
300 Illinois
Nebraska
Massachusetts Delaware
Connecticut
Minnesota
Louisiana
Alabama
Mississippi Arkansas
PennsylvaniaWest
Wisconsin
Virginia Kentucky
New York
Montana VirginiaIdaho
North Dakota
Iowa
Wyoming
NewSouth Dakota
Hampshire Maine
Vermont
The labels are helpful but just too cluttered. There are some additional
arguments that can go into geom_label() that allow for label offset; however,
this won’t help us much here. Instead, we can try the ggrepel package by
Kamil Slowikowski. This useful package will automatically adjust labels so that
they don’t overlap. First we need to download and add the package using either
RStudio’s install package buttons or via install.packages("ggrepel"). Next
to make all of ggrepel’s functions available we can call library(ggrepel)
function or, if we know which function we want, we can load only that particular
function using the :: operators. I use :: below to make clear which function
is coming from ggrepel and which is coming from ggplot2.
Nevada
Motor vehicle theft per 100,000 population
900 Arizona
California Hawaii
Washington
Iowa
Illinois
Wyoming
Mississippi
New Hampshire Maine
0 Vermont Idaho West Virginia
This looks a bit better. We’ll resist making additional improvements to the
figure for now.
# of Drinks Weight
1.0 150
3.0 350
2.0 200
0.5 140
2.0 200
1.0 160
0.0 175
0.0 140
0.0 167
1.0 200
4.0 300
5.0 321
2.0 250
0.5 187
1.0 190
350
300
Weight
250
200
150
0 1 2 3 4 5
Number of Drinks
90 5 Graphics in R Part 1: ggplot2
5.4.1 Histograms
Time
1 28
2 26
3 33
4 24
5 34
6 -44
15
10
count
-25 0 25
Time
The software has an algorithm to calculate bin widths for the histogram.
Sometimes the algorithm makes choices that aren’t suitable (hence the R
message above), and these can be changed by specifying a binwidth. In
addition, the appearance of the bars also can be changed.
20
15
count
10
-50 -25 0 25
Time
5.4.2 Boxplots
Next we consider some data from the gap minder data set to construct some
box plots. These data are available in the gapminder package, which might
need to be installed via install.packages("gapminder").
library(gapminder)
ggplot(data = subset(gapminder, year == 2002),
aes(x = continent, y = gdpPercap)) +
geom_boxplot(color = "black", fill = "lightblue")
5.4 Other Types of Graphics 93
40000
30000
gdpPercap
20000
10000
Here’s the same set of boxplots, but with different colors, different axis labels,
and the boxes plotted horizontally rather than vertically.
Oceania
Europe
Continent
Asia
Americas
Africa
As part of a study, elementary school students were asked which was more
important to them: good grades, popularity, or athletic ability. Here is a brief
look at the data.
6 2 1 3
First, a simple bar graph of the most important goal chosen is drawn, followed
by a stacked bar graph which also includes the student’s gender. We then add
a side by side bar graph that includes the student’s gender.
250
200
150
count
100
50
250
200
150
Gender
count
boy
girl
100
50
100
Gender
count
boy
girl
50
In this example R counted the number of students who had each goal and
5.4 Other Types of Graphics 97
used these counts as the height of the bars. Sometimes the data contain the
bar heights as a variable. For example, we create a bar graph of India’s per
capita GDP with separate bars for each year in the data6 .
2000
1500
gdpPercap
1000
500
6R offers a large color palette, run colors() on the console to see a list of color names.
98 5 Graphics in R Part 1: ggplot2
1.0
0.5
sin(x)
0.0
-0.5
-1.0
-2 0 2
x
5.5 Themes
The theme defines non-data aspects of the plot’s characteristics such as
background color, axes, and grid lines. Default themes include: theme_bw(),
theme_classic(), theme_dark(), theme_gray(), theme_light(),
theme_linedraw(), theme_minimal(), and theme_void(). Changing
the theme is as easy as adding it to your initial ggplot() call. Here I replace
the default implicit theme_bw() theme with the classic theme.
1.0
0.5
sin(x)
0.0
-0.5
-1.0
-2 0 2
x
of pixels. Vector images are often preferred for publication quality graphics because they
can be edited, scale well, and provide crisper detail.
100 5 Graphics in R Part 1: ggplot2
5.8 Exercises
Exercise 6a Learning objectives: practice using ggplot2 functions; summarize
variables using graphics; introduce ggplot2 facets.
6
Working with Data
Bringing data into R, exporting data from R in a form that is readable by other
software, cleaning and reshaping data, and other data manipulation tasks are an
important and often overlooked component of data science. The book Spector
(2008), while a few years old, is still an excellent reference for data-related
issues, and is available for free at http://catalog.lib.msu.edu/record=b
7254984~S39a. And the R Data Import/Export manual, available online at
https://cran.r-project.org/doc/manuals/R-data.pdf, is an up-to-date
(and free) reference on importing a wide variety of datasets into R and on
exporting data in various forms.
103
104 6 Working with Data
data from R in plain text format. RStudio provides a handy data import cheat
sheet8 for many of the read functions detailed in this section.
The foreign R package provides functions to directly read data saved in some
of the proprietary formats into R, which is sometimes unavoidable, but if
possible it is good to save data from another package as plain text and then
read this plain text file into R. In Chapter 11 methods for reading web-based
data sets into R will be discussed.
The function read.table() and its offshoots such as read.csv() are
used to read in rectangular data from a text file. For example, the file
BrainAndBody.csv contains data9 on the brain weight, body weight, and
name of some terrestrial animals. Here are the first few lines of that file:
body,brain,name
1.35,8.1,Mountain beaver
465,423,Cow
36.33,119.5,Grey wolf
27.66,115,Goat
1.04,5.5,Guinea pig
As is evident, the first line of the file contains the names of the three variables,
separated (delimited) by commas. Each subsequent line contains the body
weight, brain weight, and name of a specific terrestrial animal.
This file is accessible at the url https://www.finley-lab.com/files/data
/BrainAndBody.csv. The read.table() function is used to read these data
into an R data frame.
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : line 1 did
R assumed the first line of the file contained the variable names, since header
= TRUE was specified, and counted four including This, file, contains, and
data. So in the first line of actual data, R expected four columns containing
data plus possibly a fifth column containing row names for the data set,
and complained that “line 1 did not have 5 elements.” The error message
is somewhat mysterious, since it starts with “Error in scan.” This happens
because read.table() actually uses a more basic R function called scan()
to do the work.
Here’s how to read in the file correctly.
[1] "EST"
[2] "Max.TemperatureF"
[3] "Mean.TemperatureF"
[4] "Min.TemperatureF"
[5] "Max.Dew.PointF"
[6] "MeanDew.PointF"
[7] "Min.DewpointF"
[8] "Max.Humidity"
[9] "Mean.Humidity"
[10] "Min.Humidity"
[11] "Max.Sea.Level.PressureIn"
[12] "Mean.Sea.Level.PressureIn"
[13] "Min.Sea.Level.PressureIn"
[14] "Max.VisibilityMiles"
[15] "Mean.VisibilityMiles"
[16] "Min.VisibilityMiles"
[17] "Max.Wind.SpeedMPH"
[18] "Mean.Wind.SpeedMPH"
[19] "Max.Gust.SpeedMPH"
[20] "PrecipitationIn"
[21] "CloudCover"
[22] "Events"
[23] "WindDirDegrees"
110 6 Working with Data
How can we compute the mean for each variable? One possibility is to do this
a variable at a time:
mean(WeatherKLAN2014Full$Mean.TemperatureF)
[1] 45.78
mean(WeatherKLAN2014Full$Min.TemperatureF)
[1] 36.25
mean(WeatherKLAN2014Full$Max.TemperatureF)
[1] 54.84
##Et Cetera
str(WeatherKLAN2014Full)
WeatherKLAN2014Full$PrecipitationIn[1:50]
Max.TemperatureF Mean.TemperatureF
54.838 45.781
Min.TemperatureF Max.Dew.PointF
36.255 41.800
MeanDew.PointF Min.DewpointF
36.395 30.156
Max.Humidity Mean.Humidity
88.082 70.392
Min.Humidity Max.Sea.Level.PressureIn
52.200 30.130
Mean.Sea.Level.PressureIn Min.Sea.Level.PressureIn
30.015 29.904
Max.VisibilityMiles Mean.VisibilityMiles
9.896 8.249
112 6 Working with Data
Min.VisibilityMiles Max.Wind.SpeedMPH
4.825 19.101
Mean.Wind.SpeedMPH Max.Gust.SpeedMPH
8.679 NA
CloudCover WindDirDegrees
4.367 205.000
Max.TemperatureF Mean.TemperatureF
22.2130 20.9729
Min.TemperatureF Max.Dew.PointF
20.2597 19.5167
MeanDew.PointF Min.DewpointF
20.0311 20.8511
Max.Humidity Mean.Humidity
8.1910 9.3660
Min.Humidity Max.Sea.Level.PressureIn
13.9462 0.2032
Mean.Sea.Level.PressureIn Min.Sea.Level.PressureIn
0.2159 0.2360
Max.VisibilityMiles Mean.VisibilityMiles
0.5790 2.1059
Min.VisibilityMiles Max.Wind.SpeedMPH
3.8168 6.4831
Mean.Wind.SpeedMPH Max.Gust.SpeedMPH
3.8863 NA
CloudCover WindDirDegrees
2.7798 90.0673
6.4 Summarizing Data Frames 113
As with any R function the arguments don’t need to be named as long as they
are specified in the correct order, so
Consider calculating the mean of the maximum temperature values for those
days where the cloud cover is less than 4 and when the maximum humidity is
over 85. We can do this using subsetting.
mean(WeatherKLAN2014Full$Max.TemperatureF[
WeatherKLAN2014Full$CloudCover < 4 &
WeatherKLAN2014Full$Max.Humidity > 85])
[1] 69.39
While this works, it requires a lot of typing, since each time we refer to a
variable in the data set we need to preface its name by WeatherKLAN2014Full$.
The with() function tells R that we are working with a particular data frame,
and we don’t need to keep typing the name of the data frame.
with(WeatherKLAN2014Full,
mean(Max.TemperatureF[CloudCover < 4 & Max.Humidity > 85]))
[1] 69.39
114 6 Working with Data
library(gapminder)
str(gapminder)
The data frame contains per capita GDP and population, and it might be
interesting to create a variable that gives the total GDP by multiplying these
two variables. (If we were interested in an accurate value for the total GDP we
would probably be better off getting this information directly, since it is likely
that the per capita GDP values in the data frame are rounded substantially.)
can. Note, below I first remove the altered gapminder dataframe using rm()
then bring a clean copy back in by reloading the gapminder package.
rm(gapminder)
library(gapminder)
str(gapminder)
After reflection we may realize the new variables we added to the gapminder
data frame are not useful, and should be removed.
str(gapminder)
x y z
1 1 dog 1.0
2 2 cat 1.5
3 3 pig 2.0
6.4 Transforming a Data Frame 117
a <- a[1]
a
x
1 1
2 2
3 3
x y z
1 1 dog 1.0
2 2 cat 1.5
3 3 pig 2.0
a <- a[,1]
a
[1] 1 2 3
One can also use a negative sign in front of the variable number(s). For example,
a[-(2:3)] would drop the last two columns of a. Some care is needed when
removing variables using the negative sign.
An alternative approach is to set the variables you’d like to remove to NULL. For
example, a[c("y","z")] <- NULL and a[,2:3] <- NULL produce the same
result as above.
What happens if you write a[-2:3] instead of a[-(2:3)]? Why are the
parentheses important here?
Consider the gapminder data again. Possibly we don’t want to add a new
variable that gives life expectancy in months, but rather want to modify the
existing variable to measure life expectancy in months. Here are two ways to
accomplish this.
118 6 Working with Data
rm(gapminder)
library(gapminder)
gapminder$lifeExp[1:5]
rm(gapminder)
library(gapminder)
gapminder$lifeExp[1:5]
[1] "EST"
[2] "Max.TemperatureF"
[3] "Mean.TemperatureF"
[4] "Min.TemperatureF"
[5] "Max.Dew.PointF"
[6] "MeanDew.PointF"
6.5 Rearranging Variables 119
[7] "Min.DewpointF"
[8] "Max.Humidity"
[9] "Mean.Humidity"
[10] "Min.Humidity"
[11] "Max.Sea.Level.PressureIn"
[12] "Mean.Sea.Level.PressureIn"
[13] "Min.Sea.Level.PressureIn"
[14] "Max.VisibilityMiles"
[15] "Mean.VisibilityMiles"
[16] "Min.VisibilityMiles"
[17] "Max.Wind.SpeedMPH"
[18] "Mean.Wind.SpeedMPH"
[19] "Max.Gust.SpeedMPH"
[20] "PrecipitationIn"
[21] "CloudCover"
[22] "Events"
[23] "WindDirDegrees"
If we want the wind speed variables to come right after the date, we can again
use subsetting.
[1] "EST"
[2] "Max.Wind.SpeedMPH"
[3] "Mean.Wind.SpeedMPH"
[4] "Max.Gust.SpeedMPH"
[5] "Max.TemperatureF"
[6] "Mean.TemperatureF"
[7] "Min.TemperatureF"
[8] "Max.Dew.PointF"
[9] "MeanDew.PointF"
[10] "Min.DewpointF"
[11] "Max.Humidity"
[12] "Mean.Humidity"
[13] "Min.Humidity"
[14] "Max.Sea.Level.PressureIn"
[15] "Mean.Sea.Level.PressureIn"
[16] "Min.Sea.Level.PressureIn"
[17] "Max.VisibilityMiles"
[18] "Mean.VisibilityMiles"
[19] "Min.VisibilityMiles"
[20] "PrecipitationIn"
[21] "CloudCover"
120 6 Working with Data
[22] "Events"
[23] "WindDirDegrees"
yearlyIncomeWide
yearlyIncomeLong
tasks, there are several ways to accomplish this in R. We will focus on a library
called tidyr written by Hadley Wickham that performs the transformations
and more.
6.6.1 tidyr
The R library tidyr has functions for converting data between formats. To
illustrate its use, we examine a simple data set that explores the relationship
between religion and income in the United States. The data come from a Pew
survey, and are used in the tidyr documentation to illustrate transforming
data from wide to long format.
The pivot_longer() function can transform data from wide to long format.
library(tidyr)
religionLong <- pivot_longer(data = religion, cols = 2:11,
names_to = 'IncomeLevel', values_to = 'Frequency')
head(religionLong)
# A tibble: 6 x 3
religion IncomeLevel Frequency
<chr> <chr> <int>
1 Agnostic under10k 27
2 Agnostic btw10and20k 34
3 Agnostic btw20and30k 60
4 Agnostic btw30and40k 81
5 Agnostic btw40and50k 76
6 Agnostic btw50and75k 137
tail(religionLong)
# A tibble: 6 x 3
religion IncomeLevel Frequency
<chr> <chr> <int>
1 Unaffiliated btw40and50k 341
2 Unaffiliated btw50and75k 528
3 Unaffiliated btw75and100k 407
4 Unaffiliated btw100and150k 321
5 Unaffiliated over150k 258
6 Unaffiliated DoNotKnowOrRefused 597
To use pivot_longer() we specified the data frame (data = religion),
the columns we want to pivot into longer format (cols = 2:11), the name
we want to give the column created from the income levels (names_to =
'IncomeLevel'), and the name we want to give to the column containing the
frequency values (values_to = 'Frequency').
Columns to be pivoted into longer format can be specified by name also, and
we can also specify which columns should be omitted using a negative sign in
front of the name(s). So the following creates an equivalent data frame:
# A tibble: 6 x 3
6.6 Reshaping Data 123
# A tibble: 6 x 11
religion under10k btw10and20k btw20and30k btw30and40k
<chr> <int> <int> <int> <int>
1 Agnostic 27 34 60 81
2 Atheist 12 27 37 52
3 Buddhist 27 21 30 34
4 Catholic 418 617 732 670
5 DoNotKn~ 15 14 15 11
6 Evangel~ 575 869 1064 982
# ... with 6 more variables: btw40and50k <int>,
# btw50and75k <int>, btw75and100k <int>,
# btw100and150k <int>, over150k <int>,
# DoNotKnowOrRefused <int>
Here we specify the data frame (data = religionLong), the column
(names_from = IncomeLevel) to get the name of the output column, and
the column of values (values_from = Frequency) to get the cell values from.
As can be seen, this particular call to pivot_longer() yields the original data
frame.
tidyr provides two other useful functions to separate and unite variables based
on some deliminator. Consider again the yearlyIncomeWide table. Say we
want to split the name variable into first and last name. This can be done using
the separate() function.
The separate() function from the tidyr library is especially useful when
working with real data as multiple pieces of information can be combined into
one column in a data set. For example, consider the following data frame
data.birds
birds recordingInfo
1 10 LA-2017-01-01
2 38 DF-2011-03-02
3 29 OG-2078-05-11
4 88 YA-2000-11-18
5 42 LA-2019-03-17
6 177 OG-2016-10-10
7 200 YA-2001-03-22
6.7 Manipulating Data with dplyr 125
“RecordingInfo” that contains the site where data was collected, as well as the
year, month, and date the data were recorded. The data are coded as follows:
site-year-month-day. Write a line of code that will extract the desired data
from the data.birds data frame into separate columns named site, year,
month and day.
cheatsheet.pdf
126 6 Working with Data
library(dplyr)
head(mtcars)
mtcars[,1]
[1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2
[11] 17.8 16.4 17.3 15.2 10.4 10.4 14.7 32.4 30.4 33.9
[21] 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
[31] 15.0 21.4
# A tibble: 6 x 11
mpg cyl disp hp drat wt qsec vs am
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1
2 21 6 160 110 3.9 2.88 17.0 0 1
3 22.8 4 108 93 3.85 2.32 18.6 1 1
4 21.4 6 258 110 3.08 3.22 19.4 1 0
5 18.7 8 360 175 3.15 3.44 17.0 0 0
6 18.1 6 225 105 2.76 3.46 20.2 1 0
# ... with 2 more variables: gear <dbl>, carb <dbl>
mtcarsTbl[,1]
# A tibble: 32 x 1
mpg
<dbl>
1 21
2 21
3 22.8
4 21.4
5 18.7
6 18.1
7 14.3
8 24.4
9 22.8
10 19.2
# ... with 22 more rows
As seen above, note that once the data frame is reduced to one dimension by
subsetting to one column, it is no longer a data frame and has been simplified to
a vector. This might not seem like a big deal; however, it can be very frustrating
and potentially break your code when you expect an object to behave like a
data frame and it doesn’t because it’s now a vector. Alternatively, once we
convert religionWide to a tibble via the as_tibble() function the object
remains a data frame even when subsetting down to one dimension (there is
no automatic simplification). Converting data frames using as_tibble() is
not required for using dplyr but is convenient. Also, it is important to note
that tibble is simply a wrapper around a data frame that provides some
additional behaviors. The newly formed tibble object will still behave like
a data frame (because it technically still is a data frame) but will have some
added niceties (some of which are illustrated below).
128 6 Working with Data
Recall the gapminder data. These data are available in tab-separated format
in gapminder.tsv, and can be read in using read.delim() (or the related
read functions described previously). The read.delim() function defaults to
header = TRUE so this doesn’t need to be specified explicitly. In this section
we will be working with the gapminder data often, so we will use a short name
for the data frame to save typing.
head(gm)
# A tibble: 6 x 6
country year pop continent lifeExp gdpPercap
<chr> <int> <dbl> <chr> <dbl> <dbl>
1 Afghanistan 1952 8.43e6 Asia 28.8 779.
2 Afghanistan 1957 9.24e6 Asia 30.3 821.
3 Afghanistan 1962 1.03e7 Asia 32.0 853.
4 Afghanistan 1967 1.15e7 Asia 34.0 836.
5 Afghanistan 1972 1.31e7 Asia 36.1 740.
6 Afghanistan 1977 1.49e7 Asia 38.4 786.
Filtering helps us to examine subsets of the data such as data from a particular
country, from several specified countries, from certain years, from countries
with certain populations, etc. Some examples:
# A tibble: 12 x 6
country year pop continent lifeExp gdpPercap
<chr> <int> <dbl> <chr> <dbl> <dbl>
1 Brazil 1952 56602560 Americas 50.9 2109.
6.7 Manipulating Data with dplyr 129
# A tibble: 24 x 6
country year pop continent lifeExp gdpPercap
<chr> <int> <dbl> <chr> <dbl> <dbl>
1 Brazil 1952 56602560 Americas 50.9 2109.
2 Brazil 1957 65551171 Americas 53.3 2487.
3 Brazil 1962 76039390 Americas 55.7 3337.
4 Brazil 1967 88049823 Americas 57.6 3430.
5 Brazil 1972 100840058 Americas 59.5 4986.
6 Brazil 1977 114313951 Americas 61.5 6660.
7 Brazil 1982 128962939 Americas 63.3 7031.
8 Brazil 1987 142938076 Americas 65.2 7807.
9 Brazil 1992 155975974 Americas 67.1 6950.
10 Brazil 1997 168546719 Americas 69.4 7958.
# ... with 14 more rows
filter(gm, country %in% c("Brazil", "Mexico") & year %in% c(1952, 1972))
# A tibble: 4 x 6
country year pop continent lifeExp gdpPercap
<chr> <int> <dbl> <chr> <dbl> <dbl>
1 Brazil 1952 56602560 Americas 50.9 2109.
2 Brazil 1972 100840058 Americas 59.5 4986.
3 Mexico 1952 30144317 Americas 50.8 3478.
4 Mexico 1972 55984294 Americas 62.4 6809.
# A tibble: 25 x 6
country year pop continent lifeExp gdpPercap
130 6 Working with Data
# A tibble: 3 x 6
country year pop continent lifeExp gdpPercap
<chr> <int> <dbl> <chr> <dbl> <dbl>
1 China 2007 1.32e9 Asia 73.0 4959.
2 India 2007 1.11e9 Asia 64.7 2452.
3 United St~ 2007 3.01e8 Americas 78.2 42952.
Notice the full results are not printed. For example, when we asked for the
data for Brazil and Mexico, only the first ten rows were printed. This is an
effect of using the as_tibble() function. Of course if we wanted to analyze
the results (as we will below) the full set of data would be available.
# A tibble: 1,704 x 3
country year lifeExp
<chr> <int> <dbl>
1 Afghanistan 1952 28.8
2 Afghanistan 1957 30.3
3 Afghanistan 1962 32.0
4 Afghanistan 1967 34.0
5 Afghanistan 1972 36.1
6.7 Manipulating Data with dplyr 131
select(gm, 2:4)
# A tibble: 1,704 x 3
year pop continent
<int> <dbl> <chr>
1 1952 8425333 Asia
2 1957 9240934 Asia
3 1962 10267083 Asia
4 1967 11537966 Asia
5 1972 13079460 Asia
6 1977 14880372 Asia
7 1982 12881816 Asia
8 1987 13867957 Asia
9 1992 16317921 Asia
10 1997 22227415 Asia
# ... with 1,694 more rows
select(gm, -c(2,3,4))
# A tibble: 1,704 x 3
country lifeExp gdpPercap
<chr> <dbl> <dbl>
1 Afghanistan 28.8 779.
2 Afghanistan 30.3 821.
3 Afghanistan 32.0 853.
4 Afghanistan 34.0 836.
5 Afghanistan 36.1 740.
6 Afghanistan 38.4 786.
7 Afghanistan 39.9 978.
8 Afghanistan 40.8 852.
9 Afghanistan 41.7 649.
10 Afghanistan 41.8 635.
# ... with 1,694 more rows
select(gm, starts_with("c"))
# A tibble: 1,704 x 2
132 6 Working with Data
country continent
<chr> <chr>
1 Afghanistan Asia
2 Afghanistan Asia
3 Afghanistan Asia
4 Afghanistan Asia
5 Afghanistan Asia
6 Afghanistan Asia
7 Afghanistan Asia
8 Afghanistan Asia
9 Afghanistan Asia
10 Afghanistan Asia
# ... with 1,694 more rows
Notice a few things. Variables can be selected by name or column number. As
usual, a negative sign tells R to leave something out. And there are special
functions such as starts_with that provide ways to match part of a variable’s
name.
Use the contains() function to select only the columns that contain a c in
the gapminder data set.
6.7.5 Pipes
Consider selecting the country, year, and population for countries in Asia
or Europe. One possibility is to nest a filter() function inside a select()
function.
# A tibble: 756 x 3
country year pop
<chr> <int> <dbl>
1 Afghanistan 1952 8425333
2 Afghanistan 1957 9240934
3 Afghanistan 1962 10267083
4 Afghanistan 1967 11537966
5 Afghanistan 1972 13079460
6 Afghanistan 1977 14880372
7 Afghanistan 1982 12881816
8 Afghanistan 1987 13867957
6.7 Manipulating Data with dplyr 133
gm %>%
filter(continent %in% c("Asia", "Europe")) %>%
select(country, year, pop)
# A tibble: 756 x 3
country year pop
<chr> <int> <dbl>
1 Afghanistan 1952 8425333
2 Afghanistan 1957 9240934
3 Afghanistan 1962 10267083
4 Afghanistan 1967 11537966
5 Afghanistan 1972 13079460
6 Afghanistan 1977 14880372
7 Afghanistan 1982 12881816
8 Afghanistan 1987 13867957
9 Afghanistan 1992 16317921
10 Afghanistan 1997 22227415
# ... with 746 more rows
It can help to think of %>% as representing the word “then”. The above can
be read as, “Start with the data frame gm then filter it to select data from
the continents Asia and Europe then select the variables country, year, and
population from these data”.
The pipe operator %>% is not restricted to functions in dplyr. In fact the
pipe operator itself was introduced in another package called magrittr, but is
included in dplyr as a convenience.
14 Notice the indentation used in the code. This is not necessary, as the code could be all
on one line, but I often find it easier to read in this more organized format
134 6 Working with Data
By default the gapminder data are arranged by country and then by year.
head(gm, 15)
# A tibble: 15 x 6
country year pop continent lifeExp gdpPercap
<chr> <int> <dbl> <chr> <dbl> <dbl>
1 Afghanist~ 1952 8.43e6 Asia 28.8 779.
2 Afghanist~ 1957 9.24e6 Asia 30.3 821.
3 Afghanist~ 1962 1.03e7 Asia 32.0 853.
4 Afghanist~ 1967 1.15e7 Asia 34.0 836.
5 Afghanist~ 1972 1.31e7 Asia 36.1 740.
6 Afghanist~ 1977 1.49e7 Asia 38.4 786.
7 Afghanist~ 1982 1.29e7 Asia 39.9 978.
8 Afghanist~ 1987 1.39e7 Asia 40.8 852.
9 Afghanist~ 1992 1.63e7 Asia 41.7 649.
10 Afghanist~ 1997 2.22e7 Asia 41.8 635.
11 Afghanist~ 2002 2.53e7 Asia 42.1 727.
12 Afghanist~ 2007 3.19e7 Asia 43.8 975.
13 Albania 1952 1.28e6 Europe 55.2 1601.
14 Albania 1957 1.48e6 Europe 59.3 1942.
15 Albania 1962 1.73e6 Europe 64.8 2313.
Possibly arranging the data by year and then country would be desired. The
arrange() function makes this easy. We will again use pipes.
# A tibble: 1,704 x 6
country year pop continent lifeExp gdpPercap
<chr> <int> <dbl> <chr> <dbl> <dbl>
1 Afghanist~ 1952 8.43e6 Asia 28.8 779.
2 Albania 1952 1.28e6 Europe 55.2 1601.
3 Algeria 1952 9.28e6 Africa 43.1 2449.
4 Angola 1952 4.23e6 Africa 30.0 3521.
5 Argentina 1952 1.79e7 Americas 62.5 5911.
6 Australia 1952 8.69e6 Oceania 69.1 10040.
7 Austria 1952 6.93e6 Europe 66.8 6137.
8 Bahrain 1952 1.20e5 Asia 50.9 9867.
9 Bangladesh 1952 4.69e7 Asia 37.5 684.
10 Belgium 1952 8.73e6 Europe 68 8343.
# ... with 1,694 more rows
6.7 Manipulating Data with dplyr 135
How about the data for Rwanda, arranged in order of life expectancy.
gm %>%
filter(country == "Rwanda") %>%
arrange(lifeExp)
# A tibble: 12 x 6
country year pop continent lifeExp gdpPercap
<chr> <int> <dbl> <chr> <dbl> <dbl>
1 Rwanda 1992 7290203 Africa 23.6 737.
2 Rwanda 1997 7212583 Africa 36.1 590.
3 Rwanda 1952 2534927 Africa 40 493.
4 Rwanda 1957 2822082 Africa 41.5 540.
5 Rwanda 1962 3051242 Africa 43 597.
6 Rwanda 2002 7852401 Africa 43.4 786.
7 Rwanda 1987 6349365 Africa 44.0 848.
8 Rwanda 1967 3451079 Africa 44.1 511.
9 Rwanda 1972 3992121 Africa 44.6 591.
10 Rwanda 1977 4657072 Africa 45 670.
11 Rwanda 1982 5507565 Africa 46.2 882.
12 Rwanda 2007 8860588 Africa 46.2 863.
Possibly we want these data to be in decreasing (descending) order. Here,
desc() is one of many dplyr helper functions.
gm %>%
filter(country == "Rwanda") %>%
arrange(desc(lifeExp))
# A tibble: 12 x 6
country year pop continent lifeExp gdpPercap
<chr> <int> <dbl> <chr> <dbl> <dbl>
1 Rwanda 2007 8860588 Africa 46.2 863.
2 Rwanda 1982 5507565 Africa 46.2 882.
3 Rwanda 1977 4657072 Africa 45 670.
4 Rwanda 1972 3992121 Africa 44.6 591.
5 Rwanda 1967 3451079 Africa 44.1 511.
6 Rwanda 1987 6349365 Africa 44.0 848.
7 Rwanda 2002 7852401 Africa 43.4 786.
8 Rwanda 1962 3051242 Africa 43 597.
9 Rwanda 1957 2822082 Africa 41.5 540.
10 Rwanda 1952 2534927 Africa 40 493.
11 Rwanda 1997 7212583 Africa 36.1 590.
12 Rwanda 1992 7290203 Africa 23.6 737.
136 6 Working with Data
Possibly we want to include only the year and life expectancy, to make the
message more stark.
gm %>%
filter(country == "Rwanda") %>%
select(year, lifeExp) %>%
arrange(desc(lifeExp))
# A tibble: 12 x 2
year lifeExp
<int> <dbl>
1 2007 46.2
2 1982 46.2
3 1977 45
4 1972 44.6
5 1967 44.1
6 1987 44.0
7 2002 43.4
8 1962 43
9 1957 41.5
10 1952 40
11 1997 36.1
12 1992 23.6
For analyzing data in R, the order shouldn’t matter. But for presentation to
human eyes, the order is important.
It is worth your while to get comfortable with using pipes. Here is some
hard-to-read code. Convert it into more easier to read code by using pipes.
# A tibble: 12 x 2
year lifeExp
<int> <dbl>
1 2007 43.8
2 2002 42.1
3 1997 41.8
4 1992 41.7
5 1987 40.8
6.7 Manipulating Data with dplyr 137
6 1982 39.9
7 1977 38.4
8 1972 36.1
9 1967 34.0
10 1962 32.0
11 1957 30.3
12 1952 28.8
The dplyr package has a rename function that makes renaming variables in a
data frame quite easy.
# A tibble: 6 x 6
country year population continent lifeExp gdpPercap
<chr> <int> <dbl> <chr> <dbl> <dbl>
1 Afghani~ 1952 8425333 Asia 28.8 779.
2 Afghani~ 1957 9240934 Asia 30.3 821.
3 Afghani~ 1962 10267083 Asia 32.0 853.
4 Afghani~ 1967 11537966 Asia 34.0 836.
5 Afghani~ 1972 13079460 Asia 36.1 740.
6 Afghani~ 1977 14880372 Asia 38.4 786.
# A tibble: 1 x 2
meanpop medpop
<dbl> <dbl>
1 29601212. 7023596.
##or
gm %>%
summarize(meanpop = mean(population), medpop = median(population))
138 6 Working with Data
# A tibble: 1 x 2
meanpop medpop
<dbl> <dbl>
1 29601212. 7023596.
Often we want summaries for specific components of the data. For example,
we might want the median life expectancy for each continent separately. One
option is subsetting:
median(gm$lifeExp[gm$continent == "Africa"])
[1] 47.79
median(gm$lifeExp[gm$continent == "Asia"])
[1] 61.79
median(gm$lifeExp[gm$continent == "Europe"])
[1] 72.24
median(gm$lifeExp[gm$continent == "Americas"])
[1] 67.05
median(gm$lifeExp[gm$continent == "Oceania"])
[1] 73.66
The group_by() function makes this easier, and makes the output more useful.
gm %>%
group_by(continent) %>%
summarize(medLifeExp = median(lifeExp))
4 Europe 72.2
5 Oceania 73.7
Or if we want the results ordered by the median life expectancy:
gm %>%
group_by(continent) %>%
summarize(medLifeExp = median(lifeExp)) %>%
arrange(medLifeExp)
gm %>%
group_by(continent) %>%
summarize(numObs = n())
gm %>%
group_by(continent) %>%
summarize(n_obs = n(), n_countries = n_distinct(country))
# A tibble: 5 x 3
continent n_obs n_countries
<chr> <int> <int>
1 Africa 624 52
2 Americas 300 25
3 Asia 396 33
4 Europe 360 30
5 Oceania 24 2
Here is a bit more involved example that calculates the minimum and maximum
life expectancies for countries in Africa by year.
gm %>%
filter(continent == "Africa") %>%
group_by(year) %>%
summarize(min_lifeExp = min(lifeExp), max_lifeExp = max(lifeExp))
gm %>%
select(country, continent, year, lifeExp) %>%
group_by(year) %>%
arrange(year) %>%
filter(rank(lifeExp) == 1)
6.7 Manipulating Data with dplyr 141
# A tibble: 12 x 4
# Groups: year [12]
country continent year lifeExp
<chr> <chr> <int> <dbl>
1 Afghanistan Asia 1952 28.8
2 Afghanistan Asia 1957 30.3
3 Afghanistan Asia 1962 32.0
4 Afghanistan Asia 1967 34.0
5 Sierra Leone Africa 1972 35.4
6 Cambodia Asia 1977 31.2
7 Sierra Leone Africa 1982 38.4
8 Angola Africa 1987 39.9
9 Rwanda Africa 1992 23.6
10 Rwanda Africa 1997 36.1
11 Zambia Africa 2002 39.2
12 Swaziland Africa 2007 39.6
Next we add the maximum life expectancy. Here we need to better understand
the desc() function, which will transform a vector into a numeric vector which
will be sorted in descending order. Here are some examples.
desc(1:5)
[1] -1 -2 -3 -4 -5
desc(c(2,3,1,5,6,-4))
[1] -2 -3 -1 -5 -6 4
[1] -1 -3 -2 -5 -4
We now use this to extract the maximum life expectancy. Recall that |
represents “or”. Also by default only the first few rows of a tibble object will
be printed. To see all the rows we pipe the output to print(n = 24) to ask
for all 24 rows to be printed.
gm %>%
select(country, continent, year, lifeExp) %>%
group_by(year) %>%
arrange(year) %>%
filter(rank(lifeExp) == 1 | rank(desc(lifeExp)) == 1) %>%
print(n=24)
142 6 Working with Data
# A tibble: 24 x 4
# Groups: year [12]
country continent year lifeExp
<chr> <chr> <int> <dbl>
1 Afghanistan Asia 1952 28.8
2 Norway Europe 1952 72.7
3 Afghanistan Asia 1957 30.3
4 Iceland Europe 1957 73.5
5 Afghanistan Asia 1962 32.0
6 Iceland Europe 1962 73.7
7 Afghanistan Asia 1967 34.0
8 Sweden Europe 1967 74.2
9 Sierra Leone Africa 1972 35.4
10 Sweden Europe 1972 74.7
11 Cambodia Asia 1977 31.2
12 Iceland Europe 1977 76.1
13 Japan Asia 1982 77.1
14 Sierra Leone Africa 1982 38.4
15 Angola Africa 1987 39.9
16 Japan Asia 1987 78.7
17 Japan Asia 1992 79.4
18 Rwanda Africa 1992 23.6
19 Japan Asia 1997 80.7
20 Rwanda Africa 1997 36.1
21 Japan Asia 2002 82
22 Zambia Africa 2002 39.2
23 Japan Asia 2007 82.6
24 Swaziland Africa 2007 39.6
The $ notation provides a simple way to create new variables in a data frame.
The mutate() function provides another, sometimes cleaner way to do this.
We will use mutate() along with the lag() function to investigate changes in
life expectancy over five years for the gapminder data. We’ll do this in a few
steps. First, we create a variable that measures the change in life expectancy
and remove the population and GDP variables that are not of interest. We
have to be careful to first group by country, since we want to calculate the
change in life expectancy by country.
gm %>%
group_by(country) %>%
6.7 Manipulating Data with dplyr 143
# A tibble: 1,704 x 5
# Groups: country [142]
country year continent lifeExp changeLifeExp
<chr> <int> <chr> <dbl> <dbl>
1 Afghanistan 1952 Asia 28.8 NA
2 Afghanistan 1957 Asia 30.3 1.53
3 Afghanistan 1962 Asia 32.0 1.66
4 Afghanistan 1967 Asia 34.0 2.02
5 Afghanistan 1972 Asia 36.1 2.07
6 Afghanistan 1977 Asia 38.4 2.35
7 Afghanistan 1982 Asia 39.9 1.42
8 Afghanistan 1987 Asia 40.8 0.968
9 Afghanistan 1992 Asia 41.7 0.852
10 Afghanistan 1997 Asia 41.8 0.0890
# ... with 1,694 more rows
Next, summarize by computing the largest drop in life expectancy.
gm %>%
group_by(country) %>%
mutate(changeLifeExp = lifeExp - lag(lifeExp, order_by = year)) %>%
select(-c(population, gdpPercap)) %>%
summarize(largestDropLifeExp = min(changeLifeExp))
Oops. We forgot that since we don’t have data from before 1952, the first drop
will be NA. Let’s try again.
gm %>%
group_by(country) %>%
mutate(changeLifeExp = lifeExp - lag(lifeExp, order_by = year)) %>%
select(-c(population, gdpPercap)) %>%
summarize(largestDropLifeExp = min(changeLifeExp, na.rm = TRUE))
gm %>%
group_by(country) %>%
mutate(changeLifeExp = lifeExp - lag(lifeExp, order_by = year)) %>%
select(-c(population, gdpPercap)) %>%
arrange(changeLifeExp)
# A tibble: 1,704 x 5
# Groups: country [142]
country year continent lifeExp changeLifeExp
<chr> <int> <chr> <dbl> <dbl>
1 Rwanda 1992 Africa 23.6 -20.4
2 Zimbabwe 1997 Africa 46.8 -13.6
3 Lesotho 2002 Africa 44.6 -11.0
4 Swaziland 2002 Africa 43.9 -10.4
6.8 Exercises 145
As you progress in your data science related career, we are sure you will find
dplyr as one of the most useful packages for your initial data exploration.
Here is one more Practice Problem to get more comfortable with the syntax.
Recall the iris data set we have worked with multiple times. We want to look
at the ratio of Sepal.Length to Petal.Length in the three different species.
Write a series of dplyr statements that groups the data by species, creates a new
column called s.p.ratio that is the Sepal.Length divided by Petal.Length,
then computes the mean of this column for each species in a column called
mean.ratio. Display the data in descending order of mean.ratio.
6.8 Exercises
We have been working with a wide variety of R functions, from simple functions
such as mean() and sd() to more complex functions such as ggplot() and
apply(). Gaining a better understanding of existing functions and the ability
to write your own functions dramatically increases what we can do with R.
Learning about R’s programming capabilities is an important step in gaining
facility with functions.
7.1 R Functions
Data on the yield (pounds per acre) of two types of corn seeds (regular and
kiln dried) were collected. Each of the 11 plots of land was split into two
subplots, and one of the subplots was planted in regular corn while the other
was planted in kiln dried corn. These data were analyzed in a famous paper
authored by William Gosset. Here are the data.
u.corn = "https://www.finley-lab.com/files/data/corn.csv"
corn = read.csv(u.corn, header=TRUE)
corn
regular kiln_dried
1 1903 2009
2 1935 1915
3 1910 2011
4 2496 2463
5 2108 2180
6 1961 1925
7 2060 2122
8 1444 1482
9 1612 1542
10 1316 1443
11 1511 1535
147
148 7 Functions and Programming
A paired t test, or a confidence interval for the mean difference, may be used
to assess the difference in yield between the two varieties. Of course R has
a function t.test that performs a paired t test and computes a confidence
interval, but we will perform the test without using that function. We will
focus for now on testing the hypotheses H0 : µd = 0 versus Ha : µd = 6 0 and
on a two-sided confidence interval for µd . Here µd represents the population
mean difference.
The paired t statistic is defined by
d
t= √ (7.1)
Sd / n
[1] 1.69
[1] 0.1218
[1] -10.73
ucl
[1] 78.18
7.1 R Functions 149
With a few lines of R code we have calculated the t statistic, the p-value, and
the confidence interval. Since paired t tests are pretty common, however, it
would be helpful to automate this procedure. One obvious reason is to save
time and effort, but another important reason is to avoid mistakes. It would
be easy to make a mistake (e.g., using n instead of n − 1 as the degrees of
freedom) when repeating the above computations.
Here is a first basic function which automates the computation.
$tstat
[1] 1.69
$pval
[1] 0.1218
$lcl
[1] -10.73
$ucl
[1] 78.18
An explanation and comments on the function are in order.
• paired_t <- function(x1, x2) assigns a function of two variables, x1 and
x2, to an R object called paired_t.
• The compound expression, i.e., the code that makes up the body of the
function, is enclosed in curly braces {}.
• return(list(tstat = tstat, pval = pval, lcl=lcl, ucl=ucl)) indi-
cates the object(s) returned by the function. In this case the function returns
a list with four components.
150 7 Functions and Programming
paired_t(corn$kiln_dried, corn$regular)
$tstat
[1] 1.69
$pval
[1] 0.1218
$lcl
[1] -10.73
$ucl
[1] 78.18
$tstat
[1] 1.69
7.1 R Functions 151
$pval
[1] 0.1218
$lcl
[1] -29.5
$ucl
[1] 96.96
Two things to note. First, arguments do not have to be named. So
paired_t(corn$kiln_dried, corn$regular)
and
paired_t(x1 = corn$kiln_dried, x2 = corn$regular)
are equivalent. But we need to be careful if we do not name arguments because
then we have to know the ordering of the arguments in the function declaration.
Second, in the declaration of the function, the third argument cl was given a
default value of 0.95. If a user does not specify a value for cl it will silently
be set to 0.95. But of course a user can override this, as we did in
paired_t(corn$kiln_dried, corn$regular, cl = 0.99)
Like all things in R, getting the hang of writing functions just requires practice.
Create a simple function called FtoK that is given a temperature in Farenheit
and converts it to Kelvin using the following formula
5
K = (F − 32) ∗ + 273.15
9
You should get the following if your function is correct
FtoK(80)
[1] 299.8
submit the whole function. Or a function can be written in any text editor,
saved as a plain text file (possibly with a .r extension), and then read into R
using the source() function.
paired_t(1:5, 1:4)
$pval
[1] 0.3739
$lcl
[1] -1.421
$ucl
[1] 3.021
The user specified data had different numbers of observations in x1 and x2,
which of course can’t be data tested by a paired t test. Rather than stopping
and letting the user know that this is a problem, the function continued and
produced meaningless output.
Also, the function as written only allows testing against a two-sided alternative
hypothesis, and it would be good to allow one-sided alternatives.
First we will address some checks on arguments specified by the user. For this
we will use an if() function and a stop() function.
Error in paired_t(1:5, 1:4): The input vectors must have the same length
The argument to the if() function is evaluated. If the argument returns TRUE
the ensuing code is executed. Otherwise, the ensuing code is skipped and the
rest of the function is evaluated. If a stop() function is executed, the function
is exited and the argument of stop() is printed.
To better understand and use if() statements, we need to understand com-
parison operators and logical operators.
The “double” operators && and || just examine the first element of the two
vectors, whereas the “single” operators & and | compare element by element.
[1] TRUE
[1] FALSE
Error in paired_t(1:5, 2:6, cl = 15): The confidence level must be between 0 and 1
7.2 Programming: Conditional Statements 155
Sign(-3)
Sign(0)
$tstat
[1] 1.69
$pval
[1] 0.1218
$lcl
[1] -10.73
$ucl
[1] 78.18
$tstat
[1] 1.69
$pval
[1] 0.9391
$lcl
[1] -Inf
$ucl
[1] 69.89
$tstat
[1] 1.69
$pval
[1] 0.06091
$lcl
[1] -2.434
$ucl
[1] Inf
Like most software, R does not perform exact arithmetic. Rather, R follows
the IEEE 754 floating point standards. This can have profound effects on
how computational algorithms are implemented, but is also important when
considering things like comparisons.
Note first that computer arithmetic does not follow some of the rules of ordinary
arithmetic. For example, it is not associative.
2ˆ-30
[1] 0.0000000009313
158 7 Functions and Programming
[1] 0.0000000009313
[1] 0
Computer arithmetic is not exact.
1.5 - 1.4
[1] 0.1
[1] FALSE
[1] 0.00000000000000008327
So for example an if statement that uses an equality test may not give the
expected answer. One way to avoid this problem is to test “near equality” using
all.equal(). The function takes as arguments two objects to be compared,
and a tolerance. If the objects are within the tolerance of each other, the
function returns TRUE. The tolerance has a default value of about 1.5 × 10−8 ,
which works well in many cases.
[1] TRUE
7.4 Loops
1 == 1+10ˆ-4
[1] FALSE
1 == 1 + 10ˆ-50
[1] TRUE
Clearly the machine epsilon is somewhere between 10−4 and 10−50 . How can we
find its value exactly? Since floating point numbers use a binary representation,
we know that the machine epsilon will be equal to 1/2k for some value of k.
So to find the machine epsilon, we can keep testing whether 1 and 1 + 1/2k
are equal, until we find a value k where the two are equal. Then the machine
epsilon will be 1/2k−1 , since it is the smallest value for which the two are NOT
equal.
1 == 1+1/2
[1] FALSE
1 == 1 + 1/2ˆ2
[1] FALSE
1 == 1 + 1/2ˆ3
[1] FALSE
Testing by hand like this gets tedious quickly. A loop can automate the process.
We will do this with two R loop types, repeat and while.
A repeat loop just repeats a given expression over and over again until a
break statement is encountered.
160 7 Functions and Programming
k <- 1
repeat{
if(1 == 1+1/2ˆk){
break
}else{
k <- k+1
}
}
k
[1] 53
1/2ˆ(k-1)
[1] 0.000000000000000222
This code initializes k at 1. The body of the loop initially checks whether 1
and 1 + 1/2k are equal. If they are equal, the break statement is executed and
control is transferred outside the loop. If they are not equal, k is increased by
1, and we return to the beginning of the top of the body of the loop.
k <- 1
while(1 != 1+1/2ˆk){
k <- k+1
}
k
[1] 53
1/2ˆ(k-1)
[1] 0.000000000000000222
7.4 Loops 161
x <- 1:10
S <- 0
for(i in 1:length(x)){
S <- S + x[i]
}
S
[1] 55
S = 0
for(value in x){
S = S + value
}
S
[1] 55
In the first case we loop over the positions of the vector elements, while in the
second case we loop over the elements themselves.
Often when students initially learn about if() statements and for() loops,
they use them more often than is necessary, leading to over complications and
often slower, less sophisticated code. Many simple if() statements can be
accomplished using logical subsetting and vectorization. Using the ideas you
used above, and what you learned in Chapters 4 and 6, we can use logical
subsetting to replace the simplest of if() statements. Rewrite the following
if() statement and for() loop using logical subsetting.
a <- rnorm(1000)
for (i in 1:length(a)) {
162 7 Functions and Programming
if (a[i] > 1) {
a[i] <- 0
}
}
f1 <- function(n){
x <- numeric(0)
for(i in 1:n){
x <- c(x,i)
}
x
}
f2 <- function(n){
x <- numeric(n)
for(i in 1:n){
x[i] <- i
}
7.5 Efficiency Considerations 163
x
}
n <- 100000
system.time(f1(n))
system.time(f2(n))
system.time(1:n)
n <- 1000000
system.time(1:n)
7.5.2 Vectorization
Next consider calculating the sum of the squared entries in each column of a
matrix. For example with the matrix M ,
1 2 3
M= ,
4 5 6
[1] 17 29 45
Another possibility eliminates the inner loop, using the sum() function to
compute the sum of the squared entries in the column directly.
return(out)
}
ss2(test_matrix)
[1] 17 29 45
A third possibility uses the colSums() function.
[1] 17 29 45
Here is a speed comparison, using a 1000 × 10000 matrix.
system.time(ss2(mm))
system.time(ss3(mm))
rm(mm)
166 7 Functions and Programming
[1] 34 15 11 99
Positional matching of arguments is convenient, but should be used carefully,
and probably limited to the first few, and most commonly used, arguments
in a function. Partial does have pitfalls. A partially specified argument must
unambiguously match exactly one argument—a requirement that’s not met
below.
sum(1:5)
[1] 15
sum(1:5, c(3,4,90))
[1] 112
sum(1,2,3,c(3,4,90), 1:5)
[1] 118
Think about writing such a function. There is no way to predict in advance
the number of arguments a user might specify. So the function is defined with
... as the first argument:
sum
suppose that a collaborator always supplies comma delimited files that have
five lines of description, followed by a line containing variable names, followed
by the data. You are tired of having to specify skip = 5, header = TRUE, and
sep = "," to read.table() and want to create a function my.read() which
uses these as defaults.
The ... in the definition of my.read() allows the user to specify other argu-
ments, for example, stringsAsFactors = FALSE. These will be passed on to
the read.table() function. In fact, that is how read.csv() is defined.
read.csv
function (file, header = TRUE, sep = ",", quote = "\"", dec = ".",
fill = TRUE, comment.char = "", ...)
read.table(file = file, header = header, sep = sep, quote = quote,
dec = dec, fill = fill, comment.char = comment.char, ...)
<bytecode: 0x564843e4e570>
<environment: namespace:utils>
Arguments to R functions are not evaluated until they are needed, sometimes
called lazy evaluation.
f <- function(a,b){
print(aˆ2)
print(aˆ3)
print(a*b)
}
f(a=3, b=2)
[1] 9
[1] 27
[1] 6
f(a=3)
[1] 9
[1] 27
7.8 Exercises 169
[1] 16
f(2,10)
[1] 20
In the first call, since b was not specified, it was computed as aˆ3. In the
second call, b was specified, and the specified value was used.
7.8 Exercises
Exercise 10 Learning objectives: translate statistical notation into coded
functions; learn about tools for checking validity of function arguments; practice
writing functions that return multiple objects.
8
Spatial Data Visualization and Analysis
8.1 Overview
Recall, a data structure is a format for organizing and storing data. The struc-
ture is designed so that data can be accessed and worked with in specific ways.
Statistical software and programming languages have methods (or functions)
designed to operate on different kinds of data structures.
This chapter focuses on spatial data structures and some of the R functions that
work with these data. Spatial data comprise values associated with locations,
such as temperature data at a given latitude, longitude, and perhaps elevation.
Spatial data are typically organized into vector or raster data types. (See
Figure 8.1).
• Vector data represent features such as discrete points, lines, and polygons.
• Raster data represent surfaces as a rectangular matrix of square cells or
pixels.
Whether or not you use vector or raster data depends on the type of problem,
the type of maps you need, and the data source. Each data structure has
strengths and weaknesses in terms of functionality and representation. As you
gain more experience working with spatial data, you will be able to determine
which structure to use for a particular application.
There is a large set of R packages available for working with spatial (and space-
time) data. These packages are described in the Cran Task View: Analysis of
Spatial Data1 . The CRAN task view attempts to organize the various packages
into categories, such as Handling spatial data, Reading and writing spatial
data, Visualization, and Disease mapping and areal data analysis, so users can
quickly identify package options given their project needs.
Exploring the extent of the excellent spatial data tools available in R is beyond
the scope of this book. Rather, we would point you to subject texts like Applied
Spatial Data Analysis with R by Bivand et al. (2013) (available for free via the
MSU library system), and numerous online tutorials on pretty much any aspect
1 https://CRAN.R-project.org/view=Spatial
171
172 8 Spatial Data Visualization and Analysis
This chapter will focus on a few R packages for manipulating and visualizing
spatial data. Specifically we will touch on the following packages
• sp: spatial data structures and methods
• rgdal: interface to the C/C++ spatial data Geospatial Data Abstraction
Library
• ggmap: extends ggplot2 language to handle spatial data
• leaflet: generates dynamic online maps
2 https://en.wikipedia.org/wiki/Geographic_information_system
3 http://www.esri.com/arcgis/about-arcgis
8.3 Motivating Data 173
imental%20Forest
5 https://en.wikipedia.org/wiki/Shapefile
6 https://en.wikipedia.org/wiki/GeoJSON
7 https://en.wikipedia.org/wiki/GeoTIFF
8 https://en.wikipedia.org/wiki/NetCDF
9 A longer list of spatial data file formats is available at https://en.wikipedia.org/wik
i/GIS_file_formats.
10 A more complete list of the sp package’s spatial data classes and methods is detailed
TABLE 8.1: An abbreviated list of ‘sp‘ and ‘raster‘ data objects and associ-
ated classes for the fundamental spatial data types
library(sp)
library(dplyr)
x <- c(3,2,5,6)
y <- c(2,5,6,7)
class(sp.pts)
[1] "SpatialPoints"
attr(,"package")
[1] "sp"
class(sp.pts.df)
[1] "SpatialPointsDataFrame"
8.3 Reading Spatial Data into R 175
attr(,"package")
[1] "sp"
If, for example, you already have a data frame that includes the spatial
coordinate columns and other variables, then you can promote it to a
SpatialPointsDataFrame by indicating which columns contain point coordi-
nates. You can extract or access the data frame associated with the spatial
object using @data. You can also access individual variables directly from the
spatial object using $ or by name or column number to the right of the comma
in [,] (analogues to accessing variables in a data frame).
df <- data.frame(x = c(3,2,5,6), y=c(2,5,6,7), var.1 = c("a", "b", "c", "d"), var.2 = 1:4)
class(df)
[1] "data.frame"
#promote to a SpatialPointsDataFrame
coordinates(df) <- ~x+y
class(df)
[1] "SpatialPointsDataFrame"
attr(,"package")
[1] "sp"
var.1 var.2
1 a 1
2 b 2
3 c 3
4 d 4
class(df@data)
[1] "data.frame"
df[,c("var.1","var.2")]
df[,2]
coordinates var.2
1 (3, 2) 1
2 (2, 5) 2
3 (5, 6) 3
4 (6, 7) 4
min max
x 2 6
y 2 7
Here, the data frame df is promoted to a SpatialPointsDataFrame by in-
dicating the column names that hold the longitude and latitude (i.e., x and
y respectively) using the coordinates function. Here too, the @data is used
to retrieve the data frame associated with the points. We also illustrate how
variables can be accessed directly from the spatial object. The bbox function
is used to return the bounding box that is defined by the spatial extent of the
point coordinates. The other spatial objects noted in Table 8.1 can be created,
and their data accessed, in a similar way11 .
More than often we find ourselves reading existing spatial data files into R.
The code below uses the downloader package to download all of the PEF
data we’ll use in this chapter. The data are compressed in a single zip file,
which is then extracted into the working directory using the unzip function.
A look into the PEF directory using list.files shows nine files12 . Those
named MU-bounds.* comprise the shapefile that holds the PEF’s management
unit boundaries in the form of polygons. Like other spatial data file formats,
11 This cheatsheet http://www.maths.lancs.ac.uk/~rowlings/Teaching/UseR2012/chea
directory.
8.3 Reading Spatial Data into R 177
shapefiles are made up of several different files (with different file extensions)
that are linked together to form a spatial data object. The plots.csv file
holds the spatial coordinates and other information about the PEF’s forest
inventory plots. The roads.* shapefile holds roads and trails in and around
the PEF.
library(downloader)
download("https://www.finley-lab.com/files/data/PEF.zip",
destfile="./PEF.zip", mode="wb")
list.files("PEF")
library(rgdal)
of GDAL and other software specifics be printed when the library is loaded. Don’t let it
distract you.
178 8 Spatial Data Visualization and Analysis
with 40 features
It has 1 fields
When called, the readOGR function provides a bit of information about the
object being read in. Here, we see that it read the MU-bounds shapefile from
PEF directory and the shapefile had 40 features (i.e., polygons) and 1 field
(i.e., field is synonymous with column or variable in the data frame).
You can think of the resulting mu object as a data frame where each row corre-
sponds to a polygon and each column holds information about the polygon14 .
More specifically, the mu object is a SpatialPolygonsDataFrame.
class(mu)
[1] "SpatialPolygonsDataFrame"
attr(,"package")
[1] "sp"
As illustrated using the made-up point data in the example above, you can
access the data frame associated with the polygons using @data.
class(mu@data)
[1] "data.frame"
dim(mu@data)
[1] 40 1
head(mu@data)
mu_id
0 C15
1 C17
2 C16
3 C27
4 U18
5 U31
Above, a call to class() confirms we have accessed the data frame, dim()
shows there are 40 rows (one row for each polygon) and one column, and
head() shows the first six values of the column named mu_id. The mu_id
14 Much of the actual spatial information is hidden from you in other parts of the data
structure, but is available if you ask nicely for it (see subsequent sections).
8.4 Coordinate Reference Systems 179
values are unique identifiers for each management unit polygon across the
PEF.
One of the more challenging aspects of working with spatial data is getting
used to the idea of a coordinate reference system. A coordinate reference system
(CRS) is a system that uses one or more numbers, or coordinates, to uniquely
determine the position of a point or other geometric element (e.g., line, polygon,
raster). For spatial data, there are two common coordinate systems:
There are numerous map projections15 . One of the more frustrating parts of
working with spatial data is that it seems like each data source you find offers
its data in a different map projection and hence you spend a great deal of time
reprojecting (i.e., transforming from one CRS to another) data into a common
CRS such that they overlay correctly. Reprojecting is accomplished using the
sp package’s spTransform function as demonstrated in Section 8.5.
In R, a spatial object’s CRS is accessed via the sp package proj4string
function. The code below shows the current projection of mu.
proj4string(mu)
Let’s begin by making a map of PEF management unit boundaries over top
of a satellite image using the ggmap package. Given an address, location, or
bounding box, the ggmap package’s get_map function will query Google Maps,
OpenStreetMap, Stamen Maps, or Naver Map servers for a user-specified map
type. The get_map function requires the location or bounding box coordinates
be in a geographic coordinate system (i.e., latitude-longitude). This means
we need to reproject mu from UTM zone 19 to latitude-longitude geographic
coordinates, which is defined by the '"proj=longlat +datum=WGS84"' proj.4
string. As seen below, the first argument in spTransform function is the spatial
object to reproject and the second argument is a CRS object created by passing
a proj.4 string into the CRS function.
library(ggmap)
mu.f <- fortify(mu, region="mu_id")
head(mu.f)
tem
18 If you start dealing with a lot of spatial data and reprojecting, http://spatialreferen
ce.org is an excellent resources for finding and specifying coordinate reference systems.
19 ggmap depends on ggplot2 so ggplot2 will be automatically loaded when you call
library(ggmap).
8.5 Illustration using ggmap 181
register_google(key = "AIzaSyBPAwSY5x8vQqlnG-QwiCAWQW12U3CTLZY")
mu.bbox <- bbox(mu)
mu.centroid <- apply(mu.bbox, 1, mean)
ggmap(basemap) +
geom_polygon(data=mu.f, aes(x = long, y = lat, group=group),
fill=NA, size=0.2, color="orange")
20 https://developers.google.com/maps/documentation/geocoding/get-api-key
182 8 Spatial Data Visualization and Analysis
44.87
lat 44.86
44.85
44.84
While the call to get_map can use a bounding box (mu.bbox in code above)
to define the basemap extent, I’ve found that it’s easier to pass it a center
point location (mu.centroid note the use of the apply function to find the
mean in the easting and northing extents) and zoom value to extract the
desired extent. The resulting map looks pretty good! Take a look at the
get_map function manual page and try different options for maptype (e.g.,
maptype="terrain").
Next we’ll add the forest inventory plots to the map. Begin by reading in
the PEF forest inventory plot data held in “plots.csv”. Recall, foresters have
measured forest variables at a set of locations (i.e., inventory plots) within
each management unit. The following statement reads these data and displays
the resulting data frame structure.
$ basal_area_m2_ha: num 22 23.2 23 16.1 29.2 19.1 14.1 27.4 21.6 15 ...
In plots each row is a forest inventory plot and columns are:
• mu_id identifies the management unit within which the plot is located
• plot unique plot number within the management unit
• easting longitude coordinate in UTM zone 19
• northing latitude coordinate in UTM zone 19
• biomass_mg_ha tree biomass in metric ton (per hectare basis)
• stocking_stems_ha number of tree (per hectare basis)
• diameter_cm average tree diameter measured 130 cm up the tree trunk
• basal_area_m2_ha total cross-sectional area at 130 cm up the tree trunk
(per hectare basis)
There is nothing inherently spatial about this data structure—it is simply a
data frame. We make plots into a spatial object by identifying which columns
hold the coordinates. This is done below using the coordinates function,
which promotes the plots data frame to a SpatialPointsDataFrame.
[1] "SpatialPointsDataFrame"
attr(,"package")
[1] "sp"
Although plots is now a SpatialPointsDataFrame, it does not know to which
CRS the coordinates belong; hence, the NA when proj4string(plots) is called
below. As noted in the plots file description above, easting and northing
are in UTM zone 19. This CRS is set using the second call to proj4string
below.
proj4string(plots)
[1] NA
have replaced the second argument in the spTransform call above with
proj4string(mu) and saved some typing.
We’re now ready to add the forest inventory plots to the existing basemap
with management units. Specifically, let’s map the biomass_mg_ha variable to
show changes in biomass across the forest. No need to fortify plots, ggplot
is happy to take geom_point’s data argument as a data frame (although we
do need to convert plots from a SpatialPointsDataFrame to a data frame
using the as.data.frame function). Check out the scale_color_gradient
function in your favorite ggplot2 reference to understand how the color scale
is set.
ggmap(basemap) +
geom_polygon(data=mu.f, aes(x = long, y = lat, group=group),
fill=NA, size=0.2, color="orange") +
geom_point(data=as.data.frame(plots),
aes(x = easting, y = northing, color=biomass_mg_ha)) +
scale_color_gradient(low="white", high="darkblue") +
labs(color = "Biomass (mg/ha)")
44.87
44.86
Biomass (mg/ha)
200
150
lat
44.85 100
50
44.84
There is something subtle and easy to miss in the code above. Notice the aes
function arguments in geom_points take geographic longitude and latitude,
x and y respectively, from the points data frame (but recall easting and
northing were in UTM zone 19). This works because we applied spTransform
to reproject the points SpatialPointsDataFrame to geographic coordinates.
8.5 Illustration using ggmap 185
sp then replaces the values in easting and northing columns with the re-
projected coordinate values when converting a SpatialPointsDataFrame to
a data frame via as.data.frame().
Foresters use the inventory plot measurements to estimate forest variables
within management units, e.g., the average or total management unit biomass.
Next we’ll make a plot with management unit polygons colored by average
biomass_mg_ha.
print(mu.bio)
# A tibble: 33 x 2
mu_id biomass_mu
<chr> <dbl>
1 C12 124.
2 C15 49.9
3 C16 128.
4 C17 112.
5 C20 121.
6 C21 134.
7 C22 65.2
8 C23A 108.
9 C23B 153.
10 C24 126.
# ... with 23 more rows
Recall from Section 6.7.5 this one-liner can be read as “get the data frame
from plots’s SpatialPointsDataFrame then group by management unit then
make a new variable called biomass_mu that is the average of biomass_mg_ha
and assign it to the mu.bio tibble.”
The management unit specific biomass_mu can now be joined to the mu polygons
using the common mu_id value. Remember when we created the fortified version
of mu called mu.f? The fortify function region argument was mu_id which is
the id variable in the resulting mu.f. This id variable in mu.f can be linked to
the mu_id variable in mu.bio using dplyr’s left_join function as illustrated
below.
head(mu.f, n=2)
186 8 Spatial Data Visualization and Analysis
head(mu.f, n=2)
ggmap(basemap) +
geom_polygon(data=mu.f, aes(x = long, y = lat,
group=group, fill=biomass_mu),
size=0.2, color="orange") +
scale_fill_gradient(low="white", high="darkblue",
na.value="transparent") +
labs(fill="Biomass (mg/ha)")
44.87
44.86
Biomass (mg/ha)
150
lat
100
44.85
50
44.84
Let’s add the roads and some more descriptive labels as a finishing touch. The
8.5 Illustration using ggmap 187
roads data include a variable called type that identifies the road type. To
color roads by type in the map, we need to join the roads data frame with
the fortified roads roads.f using the common variable id as a road segment
specific identifier. Then geom_path’s color argument gets this type variable
as a factor to create road-specific color. The default coloring of the roads blends
in too much with the polygon colors, so we manually set the road colors using
the scale_color_brewer function. The palette argument in this function
accepts a set of key words, e.g., "Dark2", that specify sets of diverging colors
chosen to make map object difference optimally distinct (see, the manual page
for scale_color_brewer, http://colorbrewer2.org, and blog here21 .)22
ggmap(basemap) +
geom_polygon(data=mu.f, aes(x = long, y = lat, group=group,
fill=biomass_mu),
size=0.2, color="orange") +
geom_path(data=roads.f, aes(x = long, y = lat,
group=group, color=factor(type))) +
scale_fill_gradient(low="white", high="darkblue",
na.value="transparent") +
scale_color_brewer(palette="Dark2") +
labs(fill="Biomass (mg/ha)", color="Road type", xlab="Longitude",
ylab="Latitude", title="PEF forest biomass")
r
22 Install the RColorBrewer package and run library(RColorBrewer);
display.brewer.all() to get a graphical list of available palettes.
188 8 Spatial Data Visualization and Analysis
44.86 Paved/Gravel
Trail
Winter
lat
150
100
44.84
50
The second, and more cryptic, of the two warnings from this code occurs
because some of the roads extend beyond the range of the map axes and are
removed (nothing to worry about).
library(leaflet)
There are a couple things to note in the code. First, we use the pipe operator
%>% just like in dplyr functions. Second, the popup argument in addMarkers()
takes standard HTML and clicking on the marker makes the text popup.
Third, the html version of this text provides the full interactive, dynamic
map, so we encourage you to read and interact with the html version of this
textbook for this section. The PDF document will simply display a static
version of this map and will not do justice to how awesome leaflet truly is!
As seen in the leaflet() call above, the various add... functions can take
longitude (i.e., lng) and latitude (i.e., lat). Alternatively, these functions can
extract the necessary spatial information from sp objects, e.g., Table 8.1, when
190 8 Spatial Data Visualization and Analysis
passed to the data argument (which greatly simplifies life compared with map
making using ggmap).
You can imagine that we might want to subset spatial objects to map specific
points, lines, or polygons that meet some criteria, or perhaps extract values
from polygons or raster surfaces at a set of points or geographic extent. These,
and similar types, of operations are easy in R (as long as all spatial objects
are in a common CRS). Recall from Chapter 4 how handy it is to subset data
structures, e.g., vectors and data frames, using the [] operator and logical
vectors? Well it’s just as easy to subset spatial objects, thanks to the authors
of sp, raster, and other spatial data packages.
library(raster)
25 http://www.gadm.org/%7D%7BGlobal%20Administrative%20Boundaries
26 https://www2.jpl.nasa.gov/srtm/%7D%7BShuttle%20Radar%20Topography%20Mission
27 http://www.worldclim.org/
8.7 Subsetting Spatial Data 191
class(srtm)
[1] "RasterLayer"
attr(,"package")
[1] "raster"
proj4string(srtm)
image(srtm)
plot(plots, add = TRUE)
45
44
43
y
42
41
40
A few things to notice in the code above. First the getData function needs the
longitude lon and latitude lat to identify which SRTM raster tile to return
(SRTM data come in large raster tiles that cover the globe). As usual, look
at the getData function documentation for a description of the arguments.
To estimate the PEF’s centroid coordinate, we averaged the forest inventory
plots’ latitude and longitude then assigned the result to pef.centroid (for fun,
I used the apply function to do the same task when creating mu.centroid
earlier in the chapter). Second, srtm is in a longitude latitude geographic
CRS (same as our other PEF data). Finally, the image shows SRTM elevation
192 8 Spatial Data Visualization and Analysis
along the coast of Maine, the PEF plots are those tiny specks of black in the
northwest quadrant, and the white region of the image is the Atlantic Ocean.
Okay, this is a start, but it would be good to crop the SRTM image to the
PEF’s extent. This is done using raster’s crop function. This function can
use many different kinds of spatial objects in the second argument to calculate
the extent at which to crop the object in the first argument. Here, I set mu as
the second argument and save the resulting SRTM subset over the larger tile
(the srtm object).
image(srtm)
plot(mu, add = TRUE)
44.860
y
44.850
44.840
The crop is in effect doing a spatial subsetting of the raster data. We’ll return
to the srtm data and explore different kinds of subsetting in the subsequent
sections.
As promised, we can subset spatial objects using the [] operator and a logical,
index, or name vector. The key is that sp objects behave like data frames, see
Section 4.5. A logical or index vector to the left of the comma in [,] accesses
8.7 Subsetting Spatial Data 193
min(plots$stems_ha)
[1] 119
min(plots.10k$stems_ha)
[1] 10008
You can also add new variables to the spatial objects.
head(plots)
A spatial overlay retrieves the indexes or variables from object A using the
location of object B. With some spatial objects this operation can be done using
the [] operator. For example, say we want to select and map all management
194 8 Spatial Data Visualization and Analysis
units in mu, i.e., A, that contain plots with more than 10,000 stems per ha, i.e.,
B.
ggmap(basemap) +
geom_polygon(data=mu.10k.f, aes(x = long, y = lat, group=group), fill="transparent", s
geom_point(data=as.data.frame(plots.10k), aes(x = easting, y = northing), color="white
44.87
44.86
lat
44.85
44.84
More generally, however, the over function offers consistent overlay operations
for sp objects and can return either indexes or variables from object A given
locations from object B, i.e., over(B, A) or, equivalently, B%over%A. The code
below duplicates the result from the preceding example using over.
Yes, this requires more code but over provides a more flexible and general
purpose function for overlays on the variety of sp objects. Let’s unpack this
one-liner into its five steps.
8.7 Subsetting Spatial Data 195
i. The over function returns variables for mu’s polygons that coincide
with the nrow(plots.10k@data) points in plots.10k. No points
fall outside the polygons and the polygons do not overlap, so i should
be a data frame with nrow(plots.10k@data) rows. If polygons did
overlap and a point fell within the overlap region, then variables for
the coinciding polygons are returned.
ii. Select the unique mu identifier mu_id (this step is actually not nec-
essary here because mu only has one variable).
iii. Because some management units contain multiple plots there will
be repeat values of mu_id in ii, so apply the unique function to get
rid of duplicates.
iv. Use the %in% operator to create a logical vector that identifies which
polygons should be in the final map.
v. Subset mu using the logical vector created in iv.
Now let’s do something similar using the srtm elevation raster. Say we want
to map elevation along trails, winter roads, and gravel roads across the PEF.
We could subset srtm using the roads SpatialLinesDataFrame; however,
mapping the resulting pixel values along the road segments using ggmap
requires a bit more massaging. So, to simplify things for this example, roads is
first coerced into a SpatialPointsDataFrame called roads.pts that is used
to extract spatially coinciding srtm pixel values which themselves are coerced
from raster’s RasterLayer to sp’s SpatialPixelsDataFrame called srtm.sp
so that we can use the over function. We also choose a different basemap just
for fun.
ggmap(basemap) +
geom_point(data=as.data.frame(hikes.pts),
aes(x = coords.x1, y = coords.x2, color = color.vals)) +
scale_color_gradient(low="green", high="red") +
labs(color = "Hiking trail elevation\n(m above sea level)",
xlab="Longitude", ylab="Latitude")
44.87
44.86
Hiking trail elevation
(m above sea level)
80
70
lat
60
44.85
50
40
30
44.84
In the call to geom_point above, coords.x1 coords.x2 are the default names
given to longitude and latitude, respectively, when sp coerces hikes to
hikes.pts. These points represent the vertices along line segments. I cre-
ate the vector color.vals that contains the values from srtm that I use in
the map argument color.
Overlay operations involving lines and polygons over polygons require the
rgeos package which provides an interface to the Geometry Engine - Open
Source28 (GEOS) C++ library for topology operations on geometries. We’ll
leave it to you to explore these capabilities.
28 https://trac.osgeo.org/geos/
8.7 Subsetting Spatial Data 197
head(mu.ag@data, n=2)
mu_id biomass_mg_ha
0 <NA> 49.86
1 <NA> 112.17
With mu_id rendered useless, we do not have a variable that uniquely identifies
each polygon for use in fortify’s region argument; hence no way to subse-
quently join the unfortified and fortified versions of mu.bio.ag. Here’s the
work around. If the region is not specified, fortify() uses an internal unique
polygon ID that is part of the sp data object and accessed via row.names()29
So, the trick is to add this unique polygon ID to the aggregate() output prior
to calling fortify() as demonstrated below.
29 With other data, there is a chance the row names differ from the unique poly-
Joining, by = "id"
ggmap(basemap) +
geom_polygon(data=mu.ag.f, aes(x = long, y = lat,
group=group, fill=biomass_mg_ha),
size=0.2, color="orange") +
scale_fill_gradient(low="white", high="darkblue",
na.value="transparent") +
labs(fill="Biomass (mg/ha)")
44.87
44.86
Biomass (mg/ha)
150
lat
100
44.85
50
44.84
The aggregate() function will work with all sp objects. For example let’s
map the variance of pixel values in srtm.sp by management unit. Notice that
aggregate() is happy to take a user-specified function for FUN.
Joining, by = "id"
ggmap(basemap) +
geom_polygon(data=mu.srtm.f, aes(x = long, y = lat, group=group,
fill=srtm_23_04),
size=0.2, color="orange") +
scale_fill_gradient(low="green", high="red") +
labs(fill = "Elevation standard deviation\n(m above sea level)",
xlab="Longitude", ylab="Latitude")
44.87
4
44.85
2
44.84
8.9 Exercises
30 https://cran.r-project.org/web/packages/sp/vignettes/over.pdf
9
Shiny: Interactive Web Apps in R
Shiny1 is a framework that turns R code and figures into interactive web
applications. Let’s start out by looking at a built-in example Shiny app.
library(shiny)
runExample("01_hello")
In the bottom panel of the resulting Shiny app (Figure 9.1), we can see the
R script that is essential to running any Shiny app: app.R. Take a minute to
explore how the app works and how the script code is structured. The first part
of the script (ui <-) defines the app’s user interface (UI) using directives that
partition the resulting web page and placement of input and output elements.
The second part of the script defines a function called server with arguments
input and output. This function provides the logic required to run the app
(in this case, to draw the histogram of the Old Faithful Geyser Data). Third,
ui and server are passed to a function called shinyApp which creates the
application.2
All of the example code and data for the remainder of this chapter can be
accessed using the following code.
library(downloader)
download("https://www.finley-lab.com/files/data/shiny_chapter_code.zip",
destfile="./shiny_chapter_code.zip", mode="wb")
1 https://shiny.rstudio.com/
2 Since 2014, Shiny has supported single-file applications (one file called app.R that
contains UI and server components), but in other resources, you may see two separate source
files, server.R and ui.R, that correspond to those two components. We will use the updated
one-file system here, but keep in mind that older resources you find on the internet using
the Shiny package may employ the two-file approach. Ultimately, the code inside these files
is almost identical to that within the single app.R file. See https://shiny.rstudio.com/ar
ticles/app-formats.html for more information.
201
202 9 Shiny: Interactive Web Apps in R
# app.R version 1
library(shiny)
# Define UI
ui <- fluidPage(
sidebarLayout(
If everything compiles correctly, you should get a window that looks like
Figure 9.3. There are a few things to notice. First, we read in the CSV file
at the beginning of the app.R script so that these data can be accessed in
subsquent development steps (although we don’t use it yet in this initial
application). Second, our Shiny app is not interactive yet because there is no
way for a user to input any data—the page is view-only. Third, the function
function(input,output) {} in app.R does not include any code, reflecting
that we are not yet using any user inputs or creating any outputs.
# app.R version 2
library(shiny)
# Define UI
ui <- fluidPage(
titlePanel("Random Names Age Analysis"),
sidebarLayout(
sidebarPanel(
# Dropdown selection for Male/Female
selectInput(inputId = "sexInput", label = "Sex:",
choices = c("Female" = "F",
"Male" = "M",
"Both" = "B"))
),
3 https://shiny.rstudio.com/gallery/widget-gallery.html
206 9 Shiny: Interactive Web Apps in R
Now that we’ve included the user input dialog, let’s make the application
truly interactive by changing the output depending on user input. This is
accomplished by modifying the server logic portion of the script. Our goal is
to plot an age distribution histogram in the main panel given the sex selected
by the user.
Server logic is defined by two arguments: input and output. These objects
are both list-like, so they can contain multiple other objects. We already know
from the user input part of the app.R script that we have an input component
called sexInput, which can be accessed in the reactive portion of the server
logic by calling input$sexInput (notice the use of the $ to access the input
value associated with sexInput). In “version 3” of the application, we use
the information held in input$sexInput to subset names.df then create a
histogram of names.df$Age. This histogram graphic is included as an element
in the output object and ultimately made available in the UI portion of the
script by referencing its name histogram.
Reactive portions of our app.R script’s server logic are inside the server func-
tion. We create reactive outputs using Shiny’s functions like renderPlot.4
Obviously, renderPlot renders the contents of the function into a plot ob-
ject; this is then stored in the output list wherever we assign it (in this case,
output$histogram). Notice the contents of the renderPlot function are con-
4 Every reactive output function’s name in Shiny is of the form render*.
9.3 Adding Output 207
tained not only by regular parentheses, but also by curly brackets (just one
more piece of syntax to keep track of).
# app.R version 3
library(shiny)
# Define UI
ui <- fluidPage(
titlePanel("Random Names Age Analysis"),
sidebarLayout(
sidebarPanel(
# Dropdown selection for Male/Female
selectInput(inputId = "sexInput", label = "Sex:",
choices = c("Female" = "F",
"Male" = "M",
"Both" = "B"))
),
mainPanel("our output will appear here")
)
)
if(input$sexInput != "B"){
subset.names.df <- subset(names.df, Sex == input$sexInput)
} else {
subset.names.df <- names.df
}
}
208 9 Shiny: Interactive Web Apps in R
Update your app.R server logic function to match the code above. Rerun the
application. Note the appearance of the app doesn’t change because we have
not updated the UI portion with the resulting histogram.
Now we update the UI part of app.R to make the app interactive. In the
“version 4” code below, the plotOutput("histogram") function in ui accesses
the histogram component of the output list and plots it in the main panel.
Copy the code below and rerun the application. You have now created your
first Shiny app!
# app.R version 4
library(shiny)
# Define UI
ui <- fluidPage(
titlePanel("Random Names Age Analysis"),
sidebarLayout(
sidebarPanel(
# Dropdown selection for Male/Female
selectInput(inputId = "sexInput", label = "Sex:",
choices = c("Female" = "F",
"Male" = "M",
"Both" = "B"))
),
mainPanel(plotOutput("histogram"))
)
)
if(input$sexInput != "B"){
9.4 More Advanced Shiny App: Michigan Campgrounds 209
u <- "https://www.finley-lab.com/files/data/Michigan_State_Park_Campgrounds.csv"
sites <- read.csv(u, stringsAsFactors = F)
str(sites)
First, let’s look at the structure of the page. Similar to our first application,
we again use a fluidPage layout with title panel and sidebar. The sidebar
contains a sidebar panel and a main panel. Our sidebar panel has three user
input widgets:
• sliderInput: Allows user to specify a range of campsites desired in their
campground. Since the maximum number of campsites in any Michigan state
park campground is 411, 420 was chosen as the maximum.
• selectInput: Allows user to select what type of campsites they want. To
get the entire list of camp types, we used the data frame, sites, loaded at
the beginning of the script.
• checkboxInput: Allows the user to see only campgrounds with ADA sites
available.
Table 9.1 provides a list of the various input and output elements. Take your
time and track how the app defines then uses each input and output.
In creating our server variable, we have two functions that fill in our output
elements:
9.5 Adding Leaflet to Shiny 211
})
9.7 Why use Shiny? 213
9.7 Exercises
Exercise Shiny Learning objectives: practice updating ggplot2 plot aesthet-
ics; modify Shiny HTML output; add an interactive Shiny element.
9 https://www.shinyapps.io/
10 http://docs.rstudio.com/shinyapps.io/index.html
11 https://shiny.rstudio.com/tutorial/
12 https://shiny.rstudio.com/articles/css.html
10
Classification
Classification problems are quite common. For example a spam filter is asked
to classify incoming messages into spam or non-spam, based on factors such as
the sender’s address, the subject of the message, the contents of the message,
and so on. As another example, a doctor diagnosing a patient into one of four
possible diagnoses based on symptoms, blood tests, and medical history is
another form of classification. Or a bank may want to determine (prior to
giving a loan) whether an applicant for a loan will default on the loan, again
based on several variables such as income, financial history, etc.
Classification methods are both extremely useful and an active area of research
in statistics. In this chapter we will learn about two common, and somewhat
different, classification methods, logistic regression and k nearest neighbors. A
very good introduction to classification and many other “Statistical Learning”
methods is James et al. (2014). The abbreviated treatment in this chapter
draws from James et al. (2014).
eβ0 +β1 X
p(X) = .
1 + eβ0 +β1 X
1 For those not familiar with probability notation, the right side of this equation reads as
215
216 10 Classification
library(ggplot2)
logistic <- function(x){exp(x)/(1 + exp(x))}
ggplot(data.frame(x = c(-6, 6)), aes(x)) +
stat_function(fun = logistic)
1.00
0.75
0.50
y
0.25
0.00
-6 -3 0 3 6
x
library(MASS)
head(Pima.tr)
2 This practice of randomly separating data into training data and testing data is a
Coefficients:
(Intercept) glu
218 10 Classification
-5.5036 0.0378
summary(diabetes.lr1)
Call:
glm(formula = diabetes ~ glu, family = binomial, data = Pima.tr)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.971 -0.779 -0.529 0.849 2.263
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.50364 0.83608 -6.58 0.000000000046
glu 0.03778 0.00628 6.02 0.000000001756
(Intercept) ***
glu ***
---
Signif. codes:
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Intercept)
-5.504
beta1.lr.1
glu
10.1 Logistic Regression 219
0.03778
The coefficients β0 and β1 are approximately -5.504 and 0.038, respectively. So
for example we can estimate the probability that a woman in this population
whose glucose level is 150 would be diabetic as
[1] 0.5488
We can plot the fitted probabilities along with the data “by hand”.
1.00
0.75
diabetes
0.50
0.25
0.00
The ggplot2 package also provides a way to do this more directly, using
stat_smooth.
stat_smooth(method = "glm",
method.args = list(family = "binomial"), se = FALSE)
1.00
0.75
diabetes
0.50
0.25
0.00
From these graphics we can see that although glucose level and diabetes are
related, there are many women with high glucose levels who are not diabetic,
and many with low glucose levels who are diabetic, so likely adding other
predictors to the model will help.
Next let’s see how the model does in predicting diabetes status in the data we
did not use for fitting the model. We will predict diabetes for anyone whose
glucose level leads to a model-based probability greater than 1/2. First we
use the predict function to compute the probabilities, and then use these to
make predictions.
head(Pima.te)
1 2 3 4 5 6
0.52207 0.09179 0.10519 0.07199 0.87433 0.68319
The predict function (with type = "response" specified) provides p(x) =
P (Y = 1|X = x) for all the x values in a data frame. In this case we specified
the data frame Pima.te since we want to know how the model does in predicting
diabetes in a new population, i.e., in a population that wasn’t used to “train”
the model.
table(diabetes.predict.1, Pima.te$type)
diabetes.predict.1 No Yes
No 206 58
Yes 17 51
length(diabetes.predict.1[diabetes.predict.1 == Pima.te$type])/
dim(Pima.te)[1]
[1] 0.7741
The table (sometimes called a confusion matrix) has the predictions of the
model in the rows, so for example we see that the model predicts that 206+58 =
264 of the women will not be diabetic, and that 17 + 51 = 68 of the women will
be diabetic. More interesting of course are the cells themselves. For example,
of the 206 + 17 = 223 women who are not diabetic in Pima.te, the model
correctly classifies 206, and misclassifies 17. A classifier that predicted perfectly
for the test data would have zeros off the diagonal.
Although there is a lot more notation to keep track of, the basic ideas are the
same as they were for the one predictor model. We will next see how adding
bmi, the body mass index, affects predictions of diabetes.
Coefficients:
(Intercept) glu bmi
-8.2161 0.0357 0.0900
summary(diabetes.lr2)
Call:
glm(formula = diabetes ~ glu + bmi, family = binomial, data = Pima.tr)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.058 -0.757 -0.431 0.801 2.249
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -8.21611 1.34697 -6.10 0.0000000011
glu 0.03572 0.00631 5.66 0.0000000152
bmi 0.09002 0.03127 2.88 0.004
(Intercept) ***
glu ***
bmi **
---
Signif. codes:
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
1 2 3 4 5 6
0.52359 0.05810 0.07530 0.06662 0.82713 0.50879
table(diabetes.predict.2, Pima.te$type)
diabetes.predict.2 No Yes
No 204 54
Yes 19 55
length(diabetes.predict.2[diabetes.predict.2 == Pima.te$type])/
dim(Pima.te)[1]
[1] 0.7801
Adding bmi as a predictor did not improve the predictions by very much!
Let x1 and x2 represent glucose and bmi levels. We classify a subject as
“diabetic” if the fitted p(X) is greater than 0.5, i.e., if the fitted probability of
diabetes is greater than the fitted probability of not diabetes. The boundary
for our decision is where these two fitted probabilities are equal, i.e., where
P (Y = 1|(x1 , x2 )) P (Y = 1|(x1 , x2 ))
= = 1.
P (Y = 0|(x1 , x2 )) 1 − P (Y = 1|(x1 , x2 ))
Writing these out in terms of the logistic regression model, taking logarithms,
and performing some algebra leads to the following (linear!) decision boundary:
224 10 Classification
β0 β1
x2 = − − x1 .
β2 β2
The diabetes training data along with the decision boundary are plotted below.
40
type
bmi
No
Yes
30
20
data(iris)
head(iris)
str(iris)
4.5
4.0
3.5
Species
Sepal.Width
setosa
versicolor
3.0
virginica
2.5
2.0
5 6 7 8
Sepal.Length
226 10 Classification
2.5
2.0
Species
1.5
Petal.Width
setosa
versicolor
virginica
1.0
0.5
0.0
2 4 6
Petal.Length
Here the potential predictors are sepal width, sepal length, petal width, and
petal length. The goal is to find a classifier that will yield the correct species.
From the scatter plots it should be pretty clear that a model with petal length
and petal width as predictors would classify the data well. Although in a sense
this is too easy, these data do a good job of illustrating logistic regression with
more than two classes.
Before doing that we randomly choose 75 of the 150 rows of the data frame to
be the training sample, with the other 75 being the test sample.
set.seed(321)
selected <- sample(1:150, replace = FALSE, size = 75)
iris.train <- iris[selected,]
iris.test <- iris[-selected,]
There are several packages which implement logistic regression for data with
more than two classes. We will use the VGAM package. The function vglm within
VGAM implements logistic regression (and much more).
10.1 Logistic Regression 227
library(VGAM)
iris.lr <- vglm(Species ~ Petal.Width + Petal.Length,
data = iris.train, family = multinomial)
summary(iris.lr)
Call:
vglm(formula = Species ~ Petal.Width + Petal.Length, family = multinomial,
data = iris.train)
Pearson residuals:
Min 1Q
log(mu[,1]/mu[,3]) -0.0000397 0.0000000175
log(mu[,2]/mu[,3]) -1.5279112 -0.0020907887
Median 3Q Max
log(mu[,1]/mu[,3]) 0.0000000257 0.0000000395 0.0000796
log(mu[,2]/mu[,3]) 0.0000000972 0.0101234167 1.6385058
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept):1 104.97 8735.20 NA NA
(Intercept):2 52.71 27.05 NA NA
Petal.Width:1 -41.55 21914.63 NA NA
Petal.Width:2 -9.18 4.49 -2.04 0.041 *
Petal.Length:1 -18.93 8993.53 NA NA
Petal.Length:2 -7.62 4.90 -1.55 0.120
---
Signif. codes:
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Notice that the family is specified as multinomial rather than binomial since
we have more than two classes. When run with these data, the vglm function
returns several (about 20) warnings. These occur mainly because the classes
are so easily separated, and are suppressed above.
Next we compute the probabilities for the test data.
setosa versicolor
2 1 0.00000000000009874
5 1 0.00000000000009874
6 1 0.00000000190561922
7 1 0.00000000000251434
10 1 0.00000000000001202
11 1 0.00000000000030599
virginica
2 0.00000000000000000000000000000034016
5 0.00000000000000000000000000000034016
6 0.00000000000000000000000040468850399
7 0.00000000000000000000000000002168557
10 0.00000000000000000000000000000003543
11 0.00000000000000000000000000000225863
At least for the first six cases, one probability is close to one and the other
two are close to zero, reflecting the fact that this is an easy classification
problem. Next we extract the actual predictions. For these, we want to choose
the highest probability in each row. The which.max function makes this easy.
Before applying this to the fitted probabilities, we illustrate its use. Take notice
that which.max() returns the position of the highest probability, and not the
actualy highest probability itself.
which.max(c(2,3,1,5,8,3))
[1] 5
which.max(c(2,20,4,5,9,1,0))
[1] 2
2 5 6 7 10 11
1 1 1 1 1 1
2 5 6 7 10 11
"setosa" "setosa" "setosa" "setosa" "setosa" "setosa"
Next we create the confusion matrix.
table(class.predictions, iris.test$Species)
Although in principle kNN is simple, some issues arise. First, how should k be
chosen? There is not an easy answer, but it can help to think of the extreme
values for k.
The largest possible k is the number of observations in the training set.
For example suppose that the training set has 10 observations, with classes
1, 1, 1, 2, 2, 2, 3, 3, 3, 3. Then for any point in the test set, the k = 10 nearest
neighbors will include ALL of the points in the training set, and hence every
point in the test set will be classified in class 3. This classifier has low (zero)
variance, but probably has high bias.
The smallest possible k is 1. In this case, each point in the test set is put in the
same class as its nearest neighbor in the training set. This may lead to a very
non-smooth and high variance classifier, but the bias will tend to be small.
A second issue that is relatively easy to deal with concerns the scales on which
the x values are measured. If for example one x variable has a range from 2 to
4, while another has a range from 2000 to 4000, the distance between the test
and training data will be dominated by the second variable. The solution that
is typically used is to standardize all the variables (rescale them so that their
mean is 0 and their standard deviation is 1).
These and other issues are discussed in the literature on kNN, but we won’t
pursue them further.
There are at least three R packages which implement kNN, including class,
kknn, and RWeka. We will use class below.
An example from Hastie et al. (2009) will be used to give a better sense of the
role of k in the kNN algorithm. The example uses simulated data and shows
the decision boundaries for kNN with k = 15 and k = 1.3 . Although the R
code used to draw the displays is given below, focus mainly on the graphics
produced, and what they tell us about kNN.
First the data are read into R and graphed. The predictors x1 and x2, while not
standardized, have very similar standard deviations, so we will not standardize
these data before applying kNN.
31234621/variation-on-how-to-plot-decision-boundary-of-a-k-nearest-neighbor-
classifier-f
10.2 Nearest Neighbor Methods 231
1
as.factor(class)
0
x2
1
0
-1
-2
-2 0 2 4
x1
Next a large set of test data is created using the expand.grid function, which
creates a data frame with all possible combinations of the arguments. First
a simple example to illustrate the function, then the actual creation of the
training set. The test set covers the range of the x1 and x2 values in the
training set.
x y
1 1 5.0
2 2 5.0
3 1 3.4
4 2 3.4
5 1 2.0
6 2 2.0
232 10 Classification
min(knnExample$x1)
[1] -2.521
max(knnExample$x1)
[1] 4.171
min(knnExample$x2)
[1] -2
max(knnExample$x2)
[1] 2.856
Next the kNN with k = 15 is fit. Notice that the first argument gives the x
values in the training set, the next argument gives the x values in the test set,
the third argument gives the y values (labels) from the training set. The fourth
argument gives k, and the fifth argument asks for the function to return the
probabilities of membership (that is, the proportion of the nearest neighbors
which were in the majority class) as well as the class assignments.
library(class)
Example_knn <- knn(knnExample[,c(1,2)], x.test, knnExample[,3], k = 15, prob = TRUE)
prob <- attr(Example_knn, "prob")
head(prob)
library(dplyr)
df1 <- mutate(x.test, prob = prob, class = 0, prob_cls = ifelse(Example_knn == class, 1,
str(df1)
10.2 Nearest Neighbor Methods 233
names(knnExample)
ggplot(bigdf) +
geom_point(aes(x=x1, y =x2, col=class), data = mutate(x.test, class = Example_knn), siz
geom_point(aes(x = x1, y = x2, col = as.factor(class)), size = 4, shape = 1, data = knn
geom_contour(aes(x = x1, y = x2, z = prob_cls, group = as.factor(class), color = as.fac
1
class
0
x2
1
0
-1
-2
-2 0 2 4
x1
234 10 Classification
[1] 1 1 1 1 1 1
1
class
0
x2
1
0
-1
-2
-2 0 2 4
x1
Next kNN is applied to the diabetes data. We will use the same predictors,
glu and bmi that were used in the logistic regression example. Since the scales
of the predictor variables are substantially different, they are standardized
first. The value k = 15 is chosen for kNN.
knn_Pima No Yes
No 206 55
Yes 17 54
At least in terms of the confusion matrix, kNN with k = 15 performed about
as well as logistic regression for these data.
236 10 Classification
Produce a figure that displays how the number of false positives produced
from the kNN classifier for the diabetes data set changes for all integer values
of k from 1 to 40. Use this graph to justify whether or not k = 15 was a valid
choice for the number of neighbors.
Now kNN is used to classify the iris data. As before we use petal length
and width as predictors. The scales of the two predictors are not particularly
different, so we won’t standardize the predictors. Unsurprisingly kNN does
well in classifying the test set for a wide variety of k values.
sd(iris.train$Petal.Width)
[1] 0.7316
sd(iris.train$Petal.Length)
[1] 1.705
head(iris.train)
10.3 Exercises
Exercise Classification Learning objectives: explore the logistic regression
classification method; apply the kNN classification method; create confusion
matrices to compare classification methods; plot classified data.
11
Text Data
Many applications require the ability to manipulate and process text data.
For example, an email spam filter takes as its input various features of email
such as the sender, words in the subject, words in the body, the number and
types of attachments, and so on. The filter then tries to build a classifier which
can correctly classify a message as spam or not spam (aka ham). As another
example, some works of literature, such as some of Shakespeare’s plays or
some of the Federalist papers, have disputed authorship. By analyzing word
use across many documents, researchers try to determine the author of the
disputed work.
Working with text data requires functions that will, for example, concatenate
and split text strings, modify strings (e.g., converting to lower-case or removing
vowels), count the number of characters in a string, and so on. In addition to
being useful in such contexts, string manipulation is helpful more generally in
R—for example, to effectively construct titles for graphics.
As with most tasks, there are a variety of ways to accomplish these text
processing tasks in R. The base R package has functions which work with and
modify text strings. Another useful package which approaches these tasks in a
slightly different way is stringr. As with graphics, we will focus mainly on
one package to avoid confusion. In this case we will focus on the base R string
processing functions, but will emphasize that stringr is also worth knowing.
The application to analyzing Moby Dick below comes from the book {Text
Analysis with R for Students of Literature by Matthew L. Jockers.
Often text data will not be in a rectangular format that is suitable for reading
into a data frame. For example, an email used to help train a spam filter, or
literary texts used to help determine authorship of a novel are certainly not
of this form. Often when working with text data we want to read the whole
text object into a single R vector. In this case either the scan function or
239
240 11 Text Data
the readLines function are useful. The readLines function is typically more
efficient, but scan is much more flexible.
As an example, consider the following email message and a plain text version of
the novel Moby Dick by Herman Melville, the beginning of which is displayed
subsequently.
From safety33o@l11.newnamedns.com Fri Aug 23 11:03:37 2002
Return-Path: <safety33o@l11.newnamedns.com>
Delivered-To: zzzz@localhost.example.com
Received: from localhost (localhost [127.0.0.1])
by phobos.labs.example.com (Postfix) with ESMTP id 5AC994415F
for <zzzz@localhost>; Fri, 23 Aug 2002 06:02:59 -0400 (EDT)
Received: from mail.webnote.net [193.120.211.219]
by localhost with POP3 (fetchmail-5.9.0)
for zzzz@localhost (single-drop); Fri, 23 Aug 2002 11:02:59 +0100 (IST)
Received: from l11.newnamedns.com ([64.25.38.81])
by webnote.net (8.9.3/8.9.3) with ESMTP id KAA09379
for <zzzz@example.com>; Fri, 23 Aug 2002 10:18:03 +0100
From: safety33o@l11.newnamedns.com
Date: Fri, 23 Aug 2002 02:16:25 -0400
Message-Id: <200208230616.g7N6GOR28438@l11.newnamedns.com>
To: kxzzzzgxlrah@l11.newnamedns.com
Reply-To: safety33o@l11.newnamedns.com
Subject: ADV: Lowest life insurance rates available!
moode
This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever. You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.org
Language: English
*** START OF THIS PROJECT GUTENBERG EBOOK MOBY DICK; OR THE WHALE ***
By Herman Melville
The email message is available at https://www.finley-lab.com/files/da
ta/email1.txt while the novel is available at https://www.finley-lab.co
m/files/data/mobydick.txt. We will read these into R using scan.
First, we read in the email message. The scan function has several possible
arguments. For now the important arguments are the file to be read (the
argument is named file), the type of data in the file (the argument is named
what), and how the fields in the file are separated (the argument is named
sep). To illustrate the sep argument, the file will be read into R once with
sep = "" indicating that the separator is whitespace, and once with sep =
"\n" indicating that the separator is the newline character, i.e., each field in
the file is a line.
[1] 133
email1[1:10]
[1] "From"
242 11 Text Data
[2] "safety33o@l11.newnamedns.com"
[3] "Fri"
[4] "Aug"
[5] "23"
[6] "11:03:37"
[7] "2002"
[8] "Return-Path:"
[9] "<safety33o@l11.newnamedns.com>"
[10] "Delivered-To:"
[1] 26
email1[1:10]
[1] "The Project Gutenberg EBook of Moby Dick; or The Whale, by Herman Melville"
[2] "This eBook is for the use of anyone anywhere at no cost and with"
11.1 Reading Text Data into R 243
[3] "almost no restrictions whatsoever. You may copy it, give it away or"
[4] "re-use it under the terms of the Project Gutenberg License included"
[5] "with this eBook or online at www.gutenberg.org"
[6] "Title: Moby Dick; or The Whale"
[7] "Author: Herman Melville"
[8] "Last Updated: January 3, 2009"
[9] "Posting Date: December 25, 2008 [EBook #2701]"
[10] "Release Date: June, 2001"
[11] "Language: English"
[12] "*** START OF THIS PROJECT GUTENBERG EBOOK MOBY DICK; OR THE WHALE ***"
[13] "Produced by Daniel Lazarus and Jonesey"
[14] "MOBY DICK; OR THE WHALE"
[15] "By Herman Melville"
[16] "Original Transcriber's Notes:"
[17] "This text is a combination of etexts, one from the now-defunct ERIS"
[18] "project at Virginia Tech and one from Project Gutenberg's archives. The"
[19] "proofreaders of this version are indebted to The University of Adelaide"
[20] "Library for preserving the Virginia Tech version. The resulting etext"
[21] "was compared with a public domain hard copy version of the text."
[22] "In chapters 24, 89, and 90, we substituted a capital L for the symbol"
[23] "for the British pound, a unit of currency."
[24] "ETYMOLOGY."
[25] "(Supplied by a Late Consumptive Usher to a Grammar School)"
You will notice that the scan function ignored blank lines in the file. If it is
important to preserve blank lines, the argument blank.lines.skip = FALSE
can be supplied to scan.
The file containing the novel contains some introductory and closing text that
is not part of the original novel. If we are interested in Melville’s writing, we
should remove this text. By inspection we can discover that the novel’s text
begins at position 408 and ends at position 18576.
[1] 18169
moby_dick[1:4]
moby_dick[18165:18169]
[1] "they glided by as if with padlocks on their mouths; the savage sea-hawks"
[2] "sailed with sheathed beaks. On the second day, a sail drew near, nearer,"
[3] "and picked me up at last. It was the devious-cruising Rachel, that in"
[4] "her retracing search after her missing children, only found another"
[5] "orphan."
n <- 10
paste("The value of n is", n)
paste(c("pig", "dog"), 3)
[1] "mail.google.com"
[1] "and/or"
11.2 The paste Function 245
[1] "one six two seven three eight four nine five ten"
In the example above by default paste created a vector with five elements,
each containing one input string from the first input vector and one from
the second vector, pasted together. When a non NULL argument was specified
for collapse, the vector created had one element, with the pasted strings
separated by that argument.1
1 There is a somewhat subtle difference among the examples. If all the arguments are
length one vectors, then paste by default returns a length one vector. But if one or more of
the arguments have length greater than one, the default behavior of paste is to return a
vector of length greater than one. The collapse argument changes this behavior.
246 11 Text Data
Also don’t forget that R “recycles” values from vectors if two or more different
length vectors are provided as input.
[1] "a1" "b2" "a3" "b4" "a5" "b6" "a7" "b8" "a9"
Next, consider writing a function which simulates repeatedly tossing a coin n
times, counting the number of HEADS out of the n tosses. For the first five
repetitions of n tosses, the function will print out the number of HEADS (for
example if there are 7 HEADS in the n = 10 tosses the function prints “The
number of HEADS out of 10 tosses is 7.” The function returns a histogram of
the number of HEADS, with a title stating “Number of HEADS in ?? tosses”
where ?? is replaced by the number of tosses. The paste function will help
greatly.
100
count
50
150
100
count
50
5 10 15 20
numheads
Let’s now return to the object moby_dick that contains the text of the novel.
If we want to analyze word choice, word frequency, etc., it would be helpful to
form a vector in which each element is a word from the novel. One way to do
this is to first paste the current version of the moby_dick variable into a new
version which is one long vector with the lines pasted together. To illustrate,
we will first do this with a much smaller object that shares the structure of
moby_dick.
[1] 1
small_novel
[1] 1
At this point moby_dick contains a single very long character string. Next we
will separate this string into separate words and clean up the resulting vector
a bit.
R contains functions tolower and toupper which very simply change the case
of all characters in a string.
x <- "aBCdefG12#"
y <- x
tolower(x)
[1] "abcdefg12#"
toupper(y)
[1] "ABCDEFG12#"
If we are interested in frequencies of words in Moby Dick, converting all the text
to the same case makes sense, so for example the word “the” at the beginning
of a sentence is not counted differently than the same word in the middle of a
sentence.
nchar("dog")
[1] 3
[1] 3 3 5 8
[1] 3 3 5 8 NA 4
[1] 3 3 5 8 2 4
nchar(moby_dick)
[1] 1190309
By default nchar returns NA for a missing value. If you want nchar to return
2 for a NA value, you can set keepNA = TRUE.2
The function strsplit splits the elements of a character vector. The function
returns a list, and often the unlist function is useful to convert the list into
an atomic vector.
[[1]]
[1] "mail" "msu" "edu"
[[2]]
[1] "mail" "google" "com"
2 It may be reasonable if the purpose of counting characters is to find out how much space
to allocate for printing a vector of strings where the NA string will be printed.
11.3 More String Processing Functions 251
[[3]]
[1] "www" "amazon" "com"
[1] "d" "g" "c" "t" "p" "g" "h" "rs" "r" "bb"
[11] "t"
The regular expression [aeiou] represents any of the letters a, e, i, o, u. In
general a string of characters enclosed in square brackets indicates any one
character in the string.
moby_dick[1:50]
unlist(strsplit(c("the rain", "in Spain stays mainly", "in", "the plain"), split = "[ˆ0
length(moby_dick)
[1] 253993
[1] 214889
moby_dick[1:50]
Then the second step of selecting the non-blank words would not have been
necessary. But regular expressions will be essential going forward, so it was
worthwhile using regular expressions even if they do not provide the most
efficient method.)
Use strsplit() and regular expressions to split the following strings into
their respective words (i.e. write a regular expression that will match the -
and . character). Your output should be a vector (not a list).
Now that the vector moby_dick contains each word in the novel as a separate
element, it is relatively easy to do some basic analyses. For example the nchar
function can give us a count of the number of characters in each element of
the vector, i.e., can give us the number of letters in each word in the novel.
11.3 More String Processing Functions 255
[1] 7 1 8 4 2 7 4 5 3 5 4 3 4 9 6 6 2
[18] 2 5 2 2 5 3 7 10 2 8 2 2 5 1 7 1 5
[35] 4 5 1 6 3 3 3 6 4 2 3 5 2 2 1 3
max(moby_dick_nchar)
[1] 20
40000
30000
count
20000
10000
0 5 10 15 20
nwords
moby_dick
the of and a to in that it his
256 11 Text Data
that world
green gone close says
young
book queequeg
with
thou
flask here
therefore him may deep
towards part since mate full another sleep within
far certain back leviathan seamen seas
quarter side pequod seemed turned forward having
god
indian
but next
arm who known must
seem tashtego common better
round mast set mr would anything earth mighty called thought suddenly
well
said
heard room
whale
rope particular
right
black hold
up till taken form themselves them men
ocean
below times
first
two all often nevertheless into upon reason bildad when boats now
his it
feet above man leg made place beneath teeth crew heart
know monster before
fish try her yet my sudden say peleg morning pip whole
sails
was again
to
bone whatever
best five jaw live went strange head voice
mine sail got came however watch
half do never craft no voyage eyes
chase instant large indeed come king nantucket think than vessel take
at
of
am off nigh sure
told very old mark broken good blood such or making stood pull forth
in
least
has three
aloft
thee
air
only white
could
any
t vast
harpoon
cabin soon
enough
will rather
looking
our fishery
so rolled
name through sun top shall too away us land cannot else kept
comes oh lower had going english something coffin about sailors lance ever
see
a and
present might home blue small sperm without then some we ten
seen sight grand
broad d ahab stand stranger get
thousand
their
starbuck hesat
like
used
general sometimes stern together same cried
running hear time
are boy give myself few business carpenter feel
almost just were thus whales
s nor
harpooneers
cook
dark
knew
other
whaleman
order
boat under have sharks those
stubb yes eye saw run slowly aye taking curious
coming call lay waves door though
line moment straight go dead iron several
began further always
waters along look
bones she board there they
harpooneer length water
deck moby iswhaling
11.3 More String Processing Functions 257
The substr function can be used to extract or replace substrings. The first
argument is the string to be manipulated, and the second and third arguments
specify the first and last elements of the string to be extracted or to be replaced.
x <- "Michigan"
substr(x, 3, 4)
[1] "ch"
[1] "MiCHigan"
strtrim("Michigan", 1)
[1] "M"
strtrim("Michigan", 4)
[1] "Mich"
strtrim("Michigan", 100)
[1] "Michigan"
The grep function searches for a specified pattern and returns either the
locations where this pattern is found or the selected elements. The grepl
function returns a logical vector rather than locations of elements.
Here are some examples. All use fixed = TRUE since at this point we are using
fixed character strings rather than regular expressions.
grep("a", c("the rain", "in Spain stays mainly", "in", "the plain"),
fixed = TRUE)
[1] 1 2 4
grep("a", c("the rain", "in Spain stays mainly", "in", "the plain"),
fixed = TRUE, value = TRUE)
grepl("a", c("the rain", "in Spain stays mainly", "in", "the plain"),
fixed = TRUE)
11.4 Exercises
Exercise Text Data Learning objectives: read and write text data; concate-
nate text with the paste function, analyze text with nchar; practice with
functions; manipulate strings with substr and strtrim.
12
Rcpp
12.1.1 Installation
Before we install the Rcpp package, we need a working C++ compiler. To get
the compiler:
1 You do not need prior C++ experience to complete this chapter and corresponding
exercises, but we are hoping you learn a little C++ syntax along the way!
2 http://www.rcpp.org/
3 http://adv-r.had.co.nz/Rcpp.html
261
262 12 Rcpp
#include <iostream>
int main()
{
std::cout << "Hello World!" << std::endl;
}
Think about the C++ code above. How would you accomplish the same task
in R? How are the programs similar in syntax and structure? How are they
different? In the following section, we will rewrite this program in a format that
can interface with R through Rcpp. Note that C++ is a compiled language,
which means we cannot run the program line-by-line like we can with an R
script.
Let’s start out with that simple, obligatory “Hello World!” program in C++
that we will export to R. C++ files use the extension *.cpp. Through the
RStudio menu, create a new C++ File (File > New File > C++ File). Note
that the default C++ file contains the line #include <Rcpp.h>. In C++,
these include files are header files that provide us with objects and functions
4 https://cran.r-project.org/bin/windows/Rtools/
5 For a collection of “Hello World” programs in 400+ programming languages, see here6 .
12.2 Using Rcpp 263
that we can use within our C++ file. Although the program building process is
different, there are some parallels between these header files and the packages
we use in our R code. Note that unlike R, every line of code in a C++ file
must end with a semicolon.
To verify that Rcpp is working correctly, let’s run the following code. Save the
document as hello.cpp, and enter the following code:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
void hello(){
Rprintf("Hello World!\n");
}
Now, we can go to the RStudio Console and test our code with the following
three lines and the expected output.
library(Rcpp)
sourceCpp("hello.cpp")
hello()
[1] "Hello"
The sourceCpp function parsed our C++ file (“hello.cpp”) and looked for
functions marked with the Rcpp::export attribute. A shared library was then
built and the exported function (hello()) was made available as an R function
264 12 Rcpp
in the current environment. The function simply printed “Hello World!” to our
RStudio Console. Great! You have now written a C++ function using built-in
C++ types and Rcpp wrapper types and then sourced them like an R script.
Maintaining C++ code in it’s own source file provides several benefits including
the ability to use C++ aware text-editing tools and straightforward mapping
of compilation errors to lines in the source file. However, it’s also possible to
do inline declaration and execution of C++ code.
There are several ways to accomplish this, including passing a code string to
sourceCpp() or using the shorter-form cppFunction() or evalCpp() func-
tions. Run the following code:
library(Rcpp)
cppFunction("
int add(int x, int y, int z) {
int sum = x + y + z;
return sum;
}"
)
add(1, 2, 4)
[1] 7
a <- add(3, 2, 1)
a
[1] 6
When you run this code, Rcpp will compile the C++ code and construct an R
function that connects to the compiled C++ function. We’re going to use this
simple interface to learn how to write C++. C++ is a large language, and
there’s no way to cover it all in the limited time we have. Instead, you’ll get
the basics so that you can start writing useful functions to address bottlenecks
in your R code.
12.4 The Rcpp Interface 265
Let’s start with a very simple function. It has no arguments and always returns
the integer 1:
int one() {
return 1;
}
cppFunction("
int one() {
return 1;
}"
)
• You must declare the type of output the function returns. This function
returns an int (a scalar integer). The classes for the most common types of R
vectors are: NumericVector, IntegerVector, CharacterVector, and LogicalVec-
tor.
• Scalars and vectors are different. The scalar equivalents of numeric, integer,
character, and logical vectors are: double, int, String, and bool.
• You must use an explicit return statement to return a value from a function.
• Every statement is terminated by a semicolon.
The next example function implements a scalar version of the sign() function
which returns 1 if the input is positive, and -1 if it’s negative:
cppFunction("
int signC(int x) {
if (x > 0) {
return 1;
} else if (x == 0) {
return 0;
} else {
return -1;
}
}"
)
that works the same way as R’s, e.g., as in R you can use break to exit the
loop.
One big difference between R and C++ is that the cost of loops is much lower
in C++. For example, we could implement the sum() function in R using a
loop. If you’ve been programming in R a while, you’ll probably have a visceral
reaction to this function!
In C++, loops have very little overhead, so it’s fine to use them.
cppFunction("
double sumC(NumericVector x) {
int n = x.size();
double total = 0;
for(int i = 0; i < n; i++) {
total += x[i];
}
return total;
}"
)
library(microbenchmark)
x <- runif(1000)
microbenchmark(sum(x), sumC(x), sumR(x))
Unit: nanoseconds
expr min lq mean median uq max neval
sum(x) 977 1003 1127 1034 1066 8803 100
sumC(x) 2314 2402 12831 2459 2532 1014755 100
sumR(x) 27797 27906 56479 27992 28153 2770329 100
cld
a
a
a
The microbenchmark() function runs each function it’s passed 100 times and
provides summary statistics for the multiple execution times. This is a very
handy tool for testing various function implementations (as illustrated above).
Next we’ll create a function that computes the Euclidean distance between a
value and a vector of values:
It’s not obvious that we want x to be a scalar from the function definition.
We’d need to make that clear in the documentation. That’s not a problem in
the C++ version because we have to be explicit about types:
cppFunction("
NumericVector pdistC(double x, NumericVector ys) {
12.4 No input and scalar output 269
int n = ys.size();
NumericVector out(n);
cppFunction("
NumericVector rowSumsC(NumericMatrix x) {
int nrow = x.nrow(), ncol = x.ncol();
NumericVector out(nrow);
}
out[i] = total;
}
return out;
}"
)
rowSums(x)
[1] 603 548 514 472 333 564 520 461 566 469
rowSumsC(x)
[1] 603 548 514 472 333 564 520 461 566 469
Let’s look at the main differences:
• In C++, you subset a matrix with (), not [].
• In C++, use .nrow() and .ncol() methods to get the dimensions of a
matrix.
12.5 Exercises
Exercise Rcpp Learning objectives: practice using Rcpp to run a C++ func-
tion through R; use microbenchmark() to compare function performance.
13
Databases and R
271
272 13 Databases and R
together the two tables to obtain email information and address information
for any given student at the same time.
Data analysts and data scientists require a method to do tasks like that
described above in databases. Fortunately, almost every relational database is
manipulated and stored using SQL, which makes such queries possible.
13.1.1 SQL
Let’s first explore a little bit of SQL syntax. A basic SQL query consists of
three different parts:
Using these three statements you are able to perform queries on the database
to extract data from different tables. For example, consider the database
mentioned previously consisting of student records. To obtain all the email
addresses from the email table, our query would look like this
1 The pronunciation of SQL is highly debated. I prefer to pronounce it
like the word “sequel”, but many others prefer to say it as “ess-que-ell”.
See https://softwareengineering.stackexchange.com/questions/8588/whats-the-hist
ory-of-the-non-official-pronunciation-of-sql for a good debate
2 https://www.w3schools.com/sql/
3 https://www.tutorialspoint.com/sql/index.htm
4 http://www.sqltutorial.org/
5 https://lagunita.stanford.edu/courses/Engineering/db/2014_1/about
13.2 SQL and Database Structure 273
SELECT email_address
FROM email
If we only desired the email addresses of the student with id = 001, then we
would add a condition using the WHERE clause:
SELECT email_address
FROM email
WHERE student_id = '001'
Queries of this basic form are used by database programmers numerous times
a day in order to obtain needed information from large databases. There are
many more features that can be added to this basic query form, including:
Next we will go through an example using SQLite to detail exactly how queries
on a sample database are performed.
library(dplyr)
library(dbplyr)
library(RSQLite)
We next use the DBI package to connect directly to the database. DBI is
a backend package that provides a common interface for R to work with
many different database management systems using the same code. This
package does much of the communication from R to the database that occurs
276 13 Databases and R
behind the scenes, and is an essential part of using dplyr to work with
databases. If you have not yet installed the DBI package, you can do so using
install.packages("DBI").
library(DBI)
chinook <- dbConnect(SQLite(), "chinook.db")
src_dbi(chinook)
employees %>%
select(LastName, FirstName, Phone, Email) %>%
arrange(LastName)
employees %>%
filter(Title == "Sales Support Agent") %>%
select(LastName, FirstName, Address) %>%
arrange(LastName)
employees %>%
group_by(ReportsTo) %>%
summarize(numberAtLocation = n())
# A tibble: 3 x 3
LastName FirstName Address
<chr> <chr> <chr>
1 Johnson Steve 7727B 41 Ave
2 Park Margaret 683 10 Street SW
3 Peacock Jane 1111 6 Ave SW
As we’ve seen, for simple tasks, and even many complex tasks, dplyr syntax
can be used to query external databases.
13.3.2 dbplot
If we can use dplyr to analyze the data in a database, you may be wondering
whether or not we can use ggplot2 to graph the data in the database. Of course
we can! In fact, the package dbplot is designed to process the calculations of
a plot inside a database and output a ggplot2 object. If not already installed
on your system, make sure to install the dbplot package before continuing.
We can use the same chinook database from SQLite we were using above.
Suppose we desire to see how many types of each employee there are in the
database. We can produce a barplot to show this.
library(dbplot)
employees %>%
dbplot_bar(Title)
280 13 Databases and R
2
n ()
We first load the dbplot package. Next we produce the bar plot. Notice that we
can continue to use the convenient %>% character while producing graphs using
dbplot, making the code easy to read. Since dbplot outputs a ggplot2 object,
we can further customize the graph using familiar functions from ggplot2 (so
long as the package is loaded).
library(ggplot2)
employees %>%
dbplot_bar(Title) +
labs(title = "Employee Types") +
ylab("Count") +
theme_classic()
13.3 Using dplyr to Query the Database 281
Employee Types
3
2
Count
For a second example, we will utilize a database from Google BigQuery, Google’s
fully managed low cost analytics data warehouse. The package bigrquery
provides an R interface to Google BigQuery. If not already installed, install
the package with our usual method.
We will use data from a great set of sample tables that BigQuery provides to
its users. First we will look at the shakespeare table that contains a word
index of the different works of Shakespeare. Thus, given the data in the table,
let’s use bigrquery, DBI, and dplyr to determine the ten words that appear
most often in Shakespeare’s works.
library(bigrquery)
library(DBI)
library(dplyr)
billing <- "for875-databases"
con <- dbConnect(
bigquery(),
project = "publicdata",
dataset = "samples",
billing = billing
)
282 13 Databases and R
function as shown above. We can then use dplyr like we did previously to
group the data by words, determine the count of each word, and then order it
in decreasing order to obtain the top ten words in Shakespeare’s works. Notice
that when you run this code you may obtain a message in red describing the
total amount of bytes billed. This is how Google manages how much its users
are querying and using its data. We’re just tinkering around here, so we won’t
surpass the monthly limit on free queries.
This code again displays the benefits of the “lazy evaluation” that R employs.
The con variable and shakespeare variable do not store the database or the
specific table in R itself, instead they serve as references to the database
where the specific data they are referencing is contained. Only when the dplyr
query is written and called using the collect() function is any data from the
database actually brought into R. R waits till the last possible second (i.e. is
lazy) to perform any computations.
A second table in the publicdata project on Google BigQuery contains weather
data from NOAA ranging from the years 1929 to 2010. Let’s first make a
connection to this table and explore its structure.
[1] "station_number"
[2] "wban_number"
[3] "year"
[4] "month"
[5] "day"
[6] "mean_temp"
[7] "num_mean_temp_samples"
[8] "mean_dew_point"
[9] "num_mean_dew_point_samples"
[10] "mean_sealevel_pressure"
[11] "num_mean_sealevel_pressure_samples"
[12] "mean_station_pressure"
[13] "num_mean_station_pressure_samples"
[14] "mean_visibility"
[15] "num_mean_visibility_samples"
[16] "mean_wind_speed"
[17] "num_mean_wind_speed_samples"
[18] "max_sustained_wind_speed"
[19] "max_gust_wind_speed"
[20] "max_temperature"
[21] "max_temperature_explicit"
284 13 Databases and R
[22] "min_temperature"
[23] "min_temperature_explicit"
[24] "total_precipitation"
[25] "snow_depth"
[26] "fog"
[27] "rain"
[28] "snow"
[29] "hail"
[30] "thunder"
[31] "tornado"
weather %>%
select(thunder, mean_wind_speed) %>%
head(10)
weather %>%
filter(mean_temp > 60 & mean_wind_speed > 10) %>%
summarize(count = n())
1 6581146
Further, suppose we are interested in determining how the average wind speed
has changed from 1929 to 2010. We can use dbplot to plot the data.
weather %>%
dbplot_line(year, average_wind_speed = mean(mean_wind_speed, na.rm = TRUE))
12
average_wind_speed
10
Upon first glance at this plot, a naive student might think “Wow! Wind speeds
have decreased dramatically since the 1930s’ ’, and just accept this as true
since that is what the graph shows. But since we are all data analysts at heart
(otherwise you wouldn’t be taking this course!), we want to explore this further.
We might ask why the wind was so high in the late 1920s and early 1930s?
An average wind speed of above 12 miles per hour seems pretty high. Let’s
explore the entries for all years in the 1930s and 1920s by looking at how many
records there are with a mean wind speed for each year before 1940.
weather %>%
filter(year < 1940 & !is.na(mean_wind_speed)) %>%
group_by(year) %>%
summarize(count = n())
<int> <int>
1 1931 9726
2 1934 20334
3 1935 26829
4 1939 65623
5 1932 10751
6 1936 50514
7 1937 83310
8 1929 2037
9 1930 7101
10 1933 17708
# ... with more rows
Interesting! We see that there are only 2037 records from 1929, while there are
65623 records from 1939. This suggests we could be experiencing a phenomenon
known as sampling bias in which the 2037 records from 1929 are not a valid
random representation of all the wind speeds occurring in 1929. Or potentially
the wind was just much higher during that time due to other factors we aren’t
exploring here. Determining the true cause of this pattern requires further
investigation, which we leave to you if you so desire.
with-databases/
13.7 R Studio Connections Pane 287
13.7 Exercises
Exercise Databases Learning objectives: connect to an external database;
perform simple queries using dplyr; use data from a database in a Shiny app;
learn how to perform changes (update/delete/insert) on a database
8 https://db.rstudio.com/rstudio/connections/
9 http://db.rstudio.com/
14
Digital Signal Processing
In today’s technology age, signals are pervasive throughout most of the world.
Humans and animals use audio signals to convey important information to
conspecifics. Airplanes use signals in the air to obtain important information
to ensure your safety on a flight. Cell phones are pretty much just one small
digital signal processing device. They process our speech when we talk into the
phone by removing background noise and echos that would distort the clearness
of our sound. They obtain wifi signals to allow for us to search the web. They
send text messages using signals. They use digital image processing to take
really good pictures. They take a video of your dog when he’s doing something
completely hilarious that you want to show your friend. The applications
of digital signal processing (DSP) span numerous different fields, such as
ecology, physics, communication, medicine, computer science, mathematics,
and electrical engineering. Thus, having a basic understanding of DSP and it’s
concepts can be very helpful, regardless of the field you decide to pursue.
In this chapter we will give a brief introduction to some central concepts in DSP
and how we can explore these concepts using R. We will look at applications
of DSP using some R packages designed for bioacoustic analysis. When doing
digital signal processing, R is not typically the first computer language that
comes to mind. Matlab, Octave, and C/C++ are often considered to be some
of the best languages for DSP. But R provides numerous different packages
(especially for working with audio data) that make it competitive with these
languages for digital signal processing. We will use the signal package to do
some simple digital signal processing, and then will use the tuneR, warbleR,
and seewave packages to work with audio data to show you more specific
examples of how R can be used in different fields of digital signal processing.
289
290 14 Digital Signal Processing
1.0
0.5
sin(x)
0.0
-0.5
-1.0
0 2 4 6
x
Discrete Time Signals are not continuous. They are defined only at discrete
times, and thus they are represented as a sequence of numbers (so by your
experiences in R working with sequences of numbers you should already have
an idea of how discrete signals are stored, how they can be manipulated,
etc. . . ). Figure 14.2 is an example of the exact same sine curve as in Figure
14.1 Introduction to Digital Signal Processing 291
14.1, but in a discrete form. Notice how there are only values at specific points,
and anywhere besides those points the signal is not defined. In a real-world
application, these points would correspond to measurements of the signal.
1.0
0.5
sin(x)
0.0
-0.5
-1.0
0 2 4 6
x
FIGURE 14.2: Example of a discrete signal: a sine curve with some missing
values
Notice the differences between the two graphs. The second graph is not defined
at every point and is only defined at the points where a data point exists.
This type of sampling will become important later in the chapter and in the
exercise when we begin to work with sound data (which by the way takes the
form of these simple sine curves we are working with here). There are many
other shapes and forms of signals, but since we will later be focusing on audio
data, and the sine curve is fundamental to any sort of sound, we for now will
focus on working with a sine curve.
Signals have a wide variety of properties and characteristics that give each
signal distinct behavior. The sine curve has two important characteristics.
First, a sine curve is said to be odd since it satisfies the condition sin(−t) =
-sin(t). This is easy to recognize by looking at the graph of a sine curve as you
see that it is symmetrical over the diagonal at the origin. In addition, a sine
292 14 Digital Signal Processing
curve is a periodic signal. A periodic signal is a signal that repeats itself after
a certain interval of time (called the Period). This is true of all real sine curves.
Understanding the simple properties of signals like we have done with the sine
curve is a useful tool in digital signal processing as it allows you to recognize
simple patterns that may occur with the signal in which you are interested.
Now that we have a general understanding of what a signal is and how we
can use it’s properties to learn about it’s behavior, let’s now focus on some
of the most important concepts in digital signal processing, and how we can
implement them in R. Here are three of the most central topics to DSP:
We will first briefly discuss working with noise and filters in R, and then we
will work with Fourier Transforms in the context of audio data.
The translations that you see above from the standard sine wave (sin(θ)) have
particular terms. Here we give a quick summary of the terms in the above
equation and their definitions:
Next we draw a few plots to show graphically how all these parameters influence
the signal. We will graph these as digital signals.
library(dplyr)
x <- seq(0, 2 * pi, .1)
standard <- sin(x)
altered <- 2*sin(3 * (x - 1)) + 4
graphData <- tibble(x, standard, altered)
ggplot(data = graphData, mapping = aes(x = x)) +
geom_point(mapping = aes(y = standard, color = "sin(x)")) +
geom_point(mapping = aes(y = altered, color = "2*sin(3 * (x - 1)) + 4")) +
scale_color_manual("", breaks = c("sin(x)", "2*sin(3 * (x - 1)) + 4"),
values = c("blue", "red")) +
labs(x = "x", y ="")
sin(x)
2 2*sin(3 * (x - 1)) + 4
0 2 4 6
x
Now that we understand the fundamentals of sinusoids, let’s see how we can
work with noise and filters on sinusoids using R.
294 14 Digital Signal Processing
library(signal)
t <- seq(0, 1, len = 100)
sig <- sin(2 * pi * t)
ggplot(mapping = aes(t, sig)) +
geom_line()
1.0
0.5
0.0
sig
-0.5
-1.0
1.5
1.0
0.5
noisySig
0.0
-0.5
-1.0
-1.5
0.00 0.25 0.50 0.75 1.00
t
library(dplyr)
butterFilter <- butter(3, 0.1)
recoveredSig <- signal::filter(butterFilter, noisySig)
allSignals <- data.frame(t, sig, noisySig, recoveredSig)
ggplot(allSignals, aes(t)) +
296 14 Digital Signal Processing
1.5
1.0
0.5
colour
Noisy
Signal
0.0
Original
Recovered
-0.5
-1.0
-1.5
0.00 0.25 0.50 0.75 1.00
Time
You can see that the recovered signal is not perfect, as there is still some noise
in the signal, and the timing of the peaks in the signal is not exactly matched
up with the original. But it is clear that we have nonetheless removed a large
portion of the noise from the noisy signal. Mess around with the argument
values in the butter() function in the above code. See what changing the
parameters does to the recovered graph. You can also explore the wide variety
of filter functions available in the signal package by exploring the package
documentation here1 .
For a second example, suppose we are interested in not only the signal, but
also the noise. We will extract the low and high frequencies from a sample
signal. Let’s start with a noisy signal
t <- 1:500
cleanSignal <- 50 * sin(t * 4 * pi/length(t))
noise <- 50 * 1/12 * sin(t * 4 * pi/length(t) * 12)
originalSignal <- cleanSignal + noise
1 https://cran.r-project.org/web/packages/signal/signal.pdf
14.2 Noise and Filters 297
ggplot() +
geom_line(aes(t, originalSignal))
30
originalSignal
-30
Again, you can easily recognize the pattern is sinusoidal, but there is noise
in the signal that you want to extract. Perhaps you are interested in where
the noise is coming from and want to analyze the structure of the noise to
figure this out. Thus, unlike the previous example, we want to extract the
noise from the signal and not simply eliminate it. To do this we can again use
the Butterworth filter.
ggplot(signals, aes(t)) +
geom_line(aes(y = originalSignal, color = "Original")) +
geom_line(aes(y = low, color = "Signal")) +
geom_line(aes(y = high, color = "Noise")) +
labs(x = "Time", y = "Signal")
298 14 Digital Signal Processing
30
colour
Noise
Signal
0
Original
Signal
-30
is controlled by the window length. Increasing the window size will increase
frequency resolution, but this will also make the transform less accurate in
terms of time as more signal will be selected for the transform. Thus there
is a constant level of uncertainty between the time of the signal and the
frequency of the signal2 . Different windows can be used in a DFT depending
on the application. There are numerous windows available within the seewave
package (see Figure 14.3).
1.0
hamming
bartlett
0.8
blackman
flattop
0.6
hanning
rectangle
all
0.4
0.2
0.0
DFT are plotted against time with the relative amplitude of each sine function
of each DFT represented by some color scale. In other words, the spectrogram
has an x-axis of time, a y-axis of frequency, and the amplitude represented by
the color scheme used inside the plot (which is usually included in a scale).
Let’s use a sample audio file provided by the seewave package and produce a
spectrogram of it using the spectro function. R provides multiple functions
like spectro that make the creation of spectrograms and Fourier Transforms
extremely easy and logical, which is one of the many reasons to perform digital
signal processing within the R environment.
library(seewave)
data(tico)
spectro(tico)
Amplitude
10 (dB)
0
8
Frequency (kHz)
-5
6 -10
-15
4
-20
2
-25
0 -30
0 0.5 1 1.5
Time (s)
Let’s explore more of the parameters for this function to get a better un-
derstanding of how it is working. Type in ?spectro in the R console. The
spectro() function contains a lot of different arguments that allow you to
control how the spectrogram is created. The f argument allows you to specify
the sampling rate of your sound object. The wl option allows you to control
the window length that is used when the successive DFTs are produced (aka
the STFT). You can choose which window you want to use with the wn option.
You can also specify the amount of ovlp between two successive windows with
the ovlp argument. Let’s use some of these arguments and compare it to the
standard spectrogram we previously produced.
14.3 Fourier Transforms and Spectrograms 301
Amplitude
10 (dB)
0
8
Frequency (kHz)
-5
6 -10
-15
4
-20
2
-25
0 -30
0 0.5 1 1.5
Time (s)
You can immediately see that there are some slight differences between the
spectrograms. The first spectrogram appears to be more precise than the second
spectrogram. Explore these parameters yourself to get a better understanding
of how these different properties impact the short-time fourier transform and
the spectrogram.
One cool thing about the spectro() function is that it produces a graph,
but it also stores the values along the three scales. If you save the call to a
spectro() function, that variable will be a list containing the values along
the time axis, values along the frequency axis, and the amplitude values of the
FFT decompositions. These values can be accessed in familiar ways ($time,
$frequency, $amp).
Now that you have a visualization of the sound, we will use the tuneR package to
play the sound so you can see exactly how it matches up with the spectrogram.
First install the tuneR package, which is another package great for working
with sound and signals in R. Then use the play function to play back the
tico sound. Note that the second argument in the play function is the player
argument, which may or may not need to be defined depending on your
operating system and what audio players are installed on your system. In my
case, I am playing the sound through the aplay player on a machine running
Linux.
302 14 Digital Signal Processing
library(tuneR)
play(tico, "aplay")
Now that we’ve explored how we visualize audio signals with Fourier Transforms
and spectrograms, let’s now explore some of the packages in R used explicitly
for bioacoustic analysis of audio signals. The warbleR package was written to
streamline bioacoustic research inside of R, and it provides numerous different
functions for obtaining, manipulating, and visualizing bioacoustic data (and
especially bioacoustic data of birds). We are going to go through an example
to display some of the many useful functions in the warbleR package. Our
goal is to analyze the recordings of Cedar Waxwings from the xeno-canto
database to determine whether or not there is variation in these calls across
large geographical distances. Along the way we will showcase many useful
functions in the warbleR package, and will compute numerous meausures that
give us useful information about the signal. Let’s get started!
First off, create a new directory where you want to store the data and change
the working directory to that folder. We will produce a lot of files in the
following exercise, so it is best to use a new directory to maintain organization.
Here is the directory we use, you should change yours accordingly:
setwd("/home/jeffdoser/Dropbox/teaching/for875-2020/textbook/")
We will extract sounds from a large and very common database in the field
of bioacoustics, xeno-canto. warbleR allows you to easily query the database
and work with the sounds maintained in their library. We are interested in the
sounds of Cedar Waxwings, which just so happen to be my favorite species of
bird. They are small, beautiful, majestic birds that are fairly common in rural
and suburban areas (Figure 14.4). They also have a very large geographical
region, spanning across all three countries in North America.
The warbleR package allows us to first query the database for the recordings
of Cedar Waxwings (Bombycilla cedrorum) without downloading them.
library(warbleR)
cedarWax <- quer_xc(qword = "Bombycilla cedrorum", download = FALSE)
14.4 Analyzing Audio Data with warbleR 303
FIGURE 14.4: Two cedar waxwings, possibly pondering whether or not Ross
and Rachel will ever get back together (oh wait no that’s me as I write this
chapter between binge watching Friends)
names(cedarWax)
[37] "Other_species8"
We use the quer_xc function to query the database, and use the restriction
of the scientific name for Cedar Waxwings so we only obtain recordings of
our desired species. We see that this returns a data frame with a lot of useful
information, such as the latitude and longitude of the recording, the date, the
quality of the recording, and the type of vocalization (since birds have multiple
different types of vocalizations).
Now we produce a map showing the locations of all of the recordings. The
xc_maps function allows you to either save a map as an image to your hard
drive, or you can simply load the image in the R environment. We will load
the image in the R environment by setting the img argument to FALSE.
The recordings are mostly in the United States, but you can see there are
recordings in Mexico and Canada as well. Now that we have an idea of where
these recordings were taken, let’s look at what types of recordings we have.
We will do this using the table() function.
table(cedarWax$Vocalization_type)
4 Sound files are pretty large so working with these files using the warbleR package can
be time consuming.
306 14 Digital Signal Processing
Downloading files...
double-checking downloaded files
warbleR and seewave are designed to work with wav files while the xeno-canto
database stores its recordings as mp3 files. Luckily there is a very simple
function mp32wav that converts all the mp3 files in the working directory to
wav files. We then remove all the mp3 files from the current directory using
the system function.
mp32wav()
system("rm *.mp3")
warbleR gives us many easy ways to produce spectrograms for all of our desired
sound files at once. To do this, we use the lspec function, which produces
image files with spectrograms of whole sound files split into multiple rows.
Now look in your current directory and you should see .tiff files for all of
the sound files. These files could be used to visually inspect sound files and
eliminate undesired files as a result of noise, length, or other variables you are
not interested in. But for this example we will use all six recordings.
Remember that our ultimate goal is to determine whether or not Cedar
Waxwing calls show some sort of variation across geographical distance. We
need to produce some acoustic measures to compare the different recordings.
First, we want to perform signal detection. In other words, we want to find the
exact parts in the recordings where a cedar waxwing is performing its call. We
only want to determine the acoustic parameters during these times, as this will
eliminate noise from having any impact on the parameters. We can perform
signal detection using the autodetec function. This function automatically
detects the start and end of vocalizations based on amplitude, duration, and
frequency range attributes. In addition, autodetec has two types of output:
autodetec has a TON of parameters that you can use to help specify exactly
what type of signal you are looking for in your data. To help us figure out
what parameters we want to use, let’s first look at an example spectrogram of
one of the recordings (Figure 14.5).
14.4 Analyzing Audio Data with warbleR 307
The calls appear to be between 4-10 kHz in frequency. In this sample they are
quite short, with none lasting longer than a second and most being less than
half a second. Let’s try these two parameters along with a few others and see
what we get. Let’s first do this on just one of the recordings so we don’t waste
computing power if we have to change our parameters
knitr::include_graphics("figures/sampleAutodetec.jpg")
Well it looks like we’ve done a pretty good job! There are a couple signals we
are not detecting, but we are not getting any false positives (detecting signals
that aren’t actually signals) so we will continue on with these parameters (feel
free to refine them as much as you want). Let’s run the autodetec function
for all the recordings and save it as a variable. Notice that we switched the
14.4 Analyzing Audio Data with warbleR 309
img argument to FALSE since we don’t need to produce an image for every
recording. In addition, we remove all null values as this would correspond to
the rare situation in which the autodetec function does not detect a signal in
a sound file
waxwingSignals <- autodetec(flist = wavs, bp = c(4, 10), threshold = 10, mindur = 0.05,
maxdur = 0.5,ssmooth = 800, ls = TRUE,
res = 100, flim = c(1, 12), set =TRUE, sxrow = 6,
rows = 15, redo = TRUE, it = "tiff", img = FALSE, smadj = "end")
waxwingSignals[is.na(waxwingSignals)] <- 0
Now that we have the locations of all the signals in the recordings, we can
begin to compute acoustic parameters. The warbleR package provides the
fantastic specan function that allows us to calculate 22 acoustic parameters
at once through a batch process. This is much more efficient than computing
all these parameters separately. specan uses the time coordinates from the
autodetec function that we saved in our waxwingSignals variable. It then
computes these acoustic parameters for the recordings, but only during the
times when a signal is being performed.
Joining, by = "Recording_ID"
We will do this using a process called Principal Components Analysis, a
statistical dimension reduction technique that can be used to find latent
variables that explain large amounts of variation in a data set.
2.5
finalData$Recording_ID
313682
313683
0.0 313684
321881
329907
PC2
361006
-2.5
finalData$Country
Canada
-5.0 Mexico
United States
From the graph, we see that two birds from the US form a group with a bird
from Mexico, and a bird from the US forms a group with two birds from
Canada. Depending on where exactly the birds from the US are located, this
could potentially suggest some variation across geographical distance. We leave
this to you to explore further if you so desire. But for now, we see that the
warbleR package has some fantastic tools for manipulating and working with
audio signals within the R environment.
312 14 Digital Signal Processing
14.5 Exercises
Exercise DSP Learning objectives: download, manipulate, and play sound
files using R, produce spectrograms of audio files, use filters, use acoustic
indices to analyze western New York soundscape data}
15
Graphics in R Part 2: graphics
Begin by considering a simple and classic data set sometimes called Fisher’s
Iris Data. These data are available in R.
data(iris)
str(iris)
313
314 15 Graphics in R Part 2: graphics
4.0
iris$Sepal.Width
3.5
3.0
2.5
2.0
iris$Sepal.Length
4.0
iris$Sepal.Width
3.5
3.0
2.5
2.0
iris$Sepal.Length
The pch = argument is used to control the type of symbol used to represent the
data points. pch = 19 specifies a filled in circle for each data point. You can
mess around with different numbers and you’ll see a wide variety of options. The
col = iris$Species argument tells the plot() function to assign a different
color to each unique value in the the iris$Species vector.1 Notice however
that there is no legend produced by default, so we don’t know which color
represents each species. We can add a legend using the legend() function.
1 To see the order in which R assigns colors to different values, run the palette() function
316 15 Graphics in R Part 2: graphics
setosa
versicolor
4.0
virginica
iris$Sepal.Width
3.5
3.0
2.5
2.0
iris$Sepal.Length
It may seem like there are a lot of arguments to specify to produce a legend
and a simple scatter plot, but this also means we have a lot of control over the
graph, which is very useful when you are trying to develop a publication-quality
figure for your own work.
Now perhaps we want to use different shapes as well for the different species.
To do this we assign different values for the pch argument to each species, and
subsequently change the pch argument in the legend as well.
setosa
versicolor
4.0
virginica
iris$Sepal.Width
3.5
3.0
2.5
2.0
iris$Sepal.Length
Next let’s add a fitted least squares line to the scatter plot. This can be done
pretty easily using the abline() function to add a straight ine to the curve,
and the lm() function to actually compute the least squares line
setosa
versicolor
4.0
virginica
iris$Sepal.Width
3.5
3.0
2.5
2.0
iris$Sepal.Length
For the iris data, it probably makes more sense to fit separate lines by species.
Below we compute the linear model separately for each species, then use the
abline() function to add each line to the graph. Note our use of logical
subsetting to obtain separate data frames for each species.
setosa
versicolor
4.0
virginica
iris$Sepal.Width
3.5
3.0
2.5
2.0
iris$Sepal.Length
Note that we could also do this using a for loop, which you will learn about
soon in Chapter 7.
setosa
versicolor
4.0
virginica
iris$Sepal.Width
3.5
3.0
2.5
2.0
iris$Sepal.Length
2 The “b” stands for “both”. By default, the type argument is set to “p” for “points”. You
can also specify that you only want lines using type = 'l'.
15.1 Scatter Plots 321
setosa
versicolor
4.0
virginica
iris$Sepal.Width
3.5
3.0
2.5
2.0
iris$Sepal.Length
Now we attempt to only connect the points within a given species. Doing this
requires the use of the lines() function to draw three separate lines for each
species3 First we create a plot for the first species, and then subsequently
use the lines() function to add each of the remaining two species to the
plot. Notice that there are some additional arguments in the original plot()
function that we have yet to cover. Don’t worry, we’ll talk about all this fun
stuff in the next section :)
3 Note that we could again perhaps simplify this by using a for loop, but we’ll restrain
setosa
versicolor
4.0
virginica
Sepal Width
3.5
3.0
2.5
2.0
Sepal Length
600
400
200
crime$burglary
15.2.1 Labels
By default, axis labels are the values of the x and y values you supply to the
plot() function. For making a publication-quality figure, we often want to
further customize these labels, as well as potentially add a title to the graph.
We can do this using the xlab, ylab, and main arguments in the plot()
function.
324 15 Graphics in R Part 2: graphics
800 1000
600
400
200
Perhaps it may be easier to read the values on the y-axis if they were oritented
horizontally instead of vertically. We can do this using the las = 1 argument.
1000
800
600
400
200
1000
800
600
400
200
Now we eliminate the axes and box automatically drawn by the plot()
function, and subsequently add our own axes using the axis() function. We
use most of the defaults provided by the axis() function, however, looking at
the help page for axis() reveals a large amount of possibilities for customizing
the axes in whatever way you desire.
1000
800
600
400
200
Next we make point size proportional to population, change the color, and add
a state label. First we make point size proportional to population size using
the symbols() function. We can control the color of the points using the bg
argument in the symbols() function. We then use the text() function to add
labels to each point.
Nevada
1000
Arizona
800 Washington
Hawaii
California
600 Maryland
Colorado
Oregon
Michigan Georgia
Rhode Island Missouri Tennessee
Florida
TexasNew Mexico
400 Alaska Oklahoma
South
Ohio Carolina
New Jersey Utah
Nebraska Indiana Louisiana
Kansas North Carolina
Illinois
Massachusetts
Connecticut Alabama
Delaware Mississippi
Minnesota
Pennsylvania Arkansas
200 Wisconsin
Virginia
Montana Kentucky
West
Idaho Virginia
New
North York
Dakota Iowa
Wyoming
New Hampshire
South Dakota
Vermont
Maine
0
This graph is mind-numbingly busy, and is missing some notable things (i.e. leg-
end). For now, we leave it up to you to add in a legend, make the graph less
cluttered, fix the cutoff labels, etc. if you so desire.
You run an experiment to see if the number of alcoholic beverages a person
has on average every day is related to weight. You collect 15 data points. Enter
these data in R by creating vectors, and then reproduce the following plot.
# of Drinks Weight
1.0 150
3.0 350
2.0 200
0.5 140
2.0 200
1.0 160
0.0 175
0.0 140
0.0 167
1.0 200
4.0 300
5.0 321
2.0 250
0.5 187
1.0 190
15.4 Other Types of Graphics 329
350
300
Weight
250
200
150
0 1 2 3 4 5
Number of Drinks
15.4.1 Histograms
Time
1 28
2 26
3 33
4 24
5 34
6 -44
To produce a simple histogram of these data in the graphics package, we can
use the simple hist() function.
hist(Newcomb$Time)
Histogram of Newcomb$Time
40
30
Frequency
20
10
0
-40 -20 0 20 40
Newcomb$Time
The function automatically specifies the binwidths for the data, however we
can easily change this ourselves to something more suitable using the breaks
argument. We also change the axis labels, y-axis label orientation, title, and
color of the bins.
15.4 Other Types of Graphics 331
hist(Newcomb$Time, breaks = 30, main = "Histogram", xlab = "Time", las = 1, col = 'blue')
Histogram
12
10
8
Frequency
-40 -20 0 20 40
Time
15.4.2 Boxplots
Next we consider some data from the gap minder data set to construct some
box plots. These data are available in the gapminder package, which might
need to be installed via install.packages("gapminder").
library(gapminder)
gapminder
boxplot(gdpPercap ~ continent, data = subset(gapminder, year == 2002), pch = 19, col = 'li
332 15 Graphics in R Part 2: graphics
40000
GDP Per-capita
30000
20000
10000
Continent
Here’s the same set of boxplots, but with different colors and the boxes plotted
horizontally rather than vertically
Oceania
Europe
Continent
Asia
Americas
Africa
GDP Per-Capita
15.4 Other Types of Graphics 333
However, notice in the last two graphs the y-axis labels were running off
the page. Using the graphics package, we have the ability to adjust all the
dimensions and parameters of a plot by using the par function. Below we
utilize the par function to extend the amount of space in the plot on the left
side of the graph.
par(mar=c(6,4,4,2))
boxplot(gdpPercap ~ continent, data = subset(gapminder, year == 2002), pch = 19, border =
Oceania
Europe
Continent
Asia
Americas
Africa
GDP Per-Capita
As part of a study, elementary school students were asked which was more
important to them: good grades, popularity, or athletic ability. Here is a brief
look at the data.
barplot(counts)
200
150
100
50
0
Notice that to create a basic bar graph using the graphics package, we first
take our variable of interest StudentGoals$Goals and put it in table format
15.4 Other Types of Graphics 335
using the table() function. Next we produce a stacked bar graph that also
includes the student’s gender. Then we subsequently add a side by side bar
graph that includes the student’s gender.
boy
girl
200
150
Count
100
50
0
Grades Popular Sports
Goals
120 boy
girl
100
80
Count
60
40
20
0
Grades Popular Sports
Goals
In this example R counted the number of students who had each goal and
used these counts as the height of the bars. Sometimes the data contain the
bar heights as a variable. For example, we create a bar graph of India’s per
capita GDP with separate bars for each year in the data4 .
4R offers a large color palette, run colors() on the console to see a list of color names.
15.4 Other Types of Graphics 337
500
0
Year
1.0
0.5
sin.x
0.0
-0.5
-1.0
-3 -2 -1 0 1 2 3
png("sinCurve.png")
plot(x, sin.x, type = 'l')
dev.off()
Inside the png() function, we specify the name the image is given when it is
saved. There are also arguments to specify the width and height of the saved
image. This function opens a png file where the image will be saved. There are
analagous functions for other types of files (i.e. pdf(), jpeg()). The second
line plot(x, sin.x, type = 'l') simply creates the plot we desire to save.
The dev.off() function closes the png file.
15.6 A Summary of Useful graphics Functions and Arguments 339
1. plot(x, y): the basic plotting function. Many of the graphs you
produce will start with the plot() function. Here are some useful
arguments (aside from x and y arguments to specify the data):
•pch: specify the type of point in a scatter plot. We often use
pch = 19.
•xlab and ylab: specify the x and y axis labels
•xlim and ylim: specify the range of the x and y axes
•type: specify the type of the plot. By default it is p, can create
a line plot with type = 'l' or a scatter plot with lines with
type = 'b'.
•main: add a title to the plot
•cex, cex.lab, cex.main: change the sizes of the points, the size
of the axis label text, and the size of the title text, respectively
•col: specify the color of the points/lines
•axes: if TRUE axes will be automatically drawn, if FALSE no
axes will be drawn.
2. legend(x, y, legend): add a legend to the graph specified by the
location (x, y) with the text specified by the legend argument.
3. points(x, y): add points to an already existing plot. Arguments
same as those of plot().
4. lines(x, y): add lines to an already existing plot. Arguments same
as those of plot().
5. axis(): specify the axes
6. text(x, y): add text to a plot at the location (x, y).
7. par(): function to change a wide variety of graphical parameters
for finer control over the plot’s appearance.
340 15 Graphics in R Part 2: graphics
The functions listed above certainly do not cover the realm of plots you can
produce using the graphics package, but having a good understanding of how
to use these functions will give you a firm grip on making publication-quality
plots in R.
For additional help, we find that the help pages for most graphics functions
are quite helpful. The online documentation for the package available here5
perhaps provides a more user friendly version of the help pages within R. There
are also numerous books and online tutorials available for working with these
functions, so a simple google search is often the best way to find additional
help :)
Go to the help page for the par() function (use help(par) or ?par()). Look
at the plethora of different arguments that you can use to specialize your plot
just the way you like it. List 5 arguments that we didn’t use in this chapter,
describe what they do, then use them in a graph using the iris data set (you
can use any of the variables in the data set and produce any type of graph
you like).
5 https://rdrr.io/r/graphics/graphics-package.html#heading-0
16
Solutions to Practice Problems
2.3.5
4.1.3
tree.sp[length(tree.sp) - 1]
4.4.1
341
342 16 Solutions to Practice Problems
4.8.1
5.3.2
6.3.3 (a)
6.3.3 (b)
6.6.2
6.7.4
select(gm, contains('c'))
6.7.7
gm %>%
filter(country == 'Afghanistan') %>%
select(c("year", "lifeExp")) %>%
arrange(desc(lifeExp))
6.7.11
iris %>%
mutate(s.p.ratio = Sepal.Length / Petal.Length) %>%
group_by(Species) %>%
summarize(mean.ratio = mean(s.p.ratio)) %>%
arrange(desc(mean.ratio))
344 16 Solutions to Practice Problems
7.1.1
7.4.4
10.2.2
11.3.3
11.3.5
library(wordcloud)
md.data.frame <- as.data.frame(moby_dick_word_table)
wordcloud(md.data.frame$moby_dick, md.data.frame$Freq, max.words = 500, colors = rainbow(2
Bibliography
Bivand, R. S., Pebesma, E., and Gomez-Rubio, V. (2013). Applied spatial data
analysis with R, Second edition. Springer, NY.
Hastie, T., Tibshirani, R., and Friedman, J. (2009). The elements of statistical
learning: data mining, inference and prediction. Springer, 2 edition.
James, G., Witten, D., Hastie, T., and Tibshirani, R. (2014). An Introduction
to Statistical Learning: with Applications in R. Springer Texts in Statistics.
Springer New York.
Spector, P. (2008). Data Manipulation With R. Use R!
Wickham, H. and Sievert, C. (2016). ggplot2: Elegant Graphics for Data
Analysis. Springer, Cham, 2nd edition.
347