Ggplot2 Essentials - Sample Chapter
Ggplot2 Essentials - Sample Chapter
ee
P U B L I S H I N G
C o m m u n i t y
E x p e r i e n c e
D i s t i l l e d
$ 24.99 US
16.99 UK
pl
Donato Teutonico
ggplot2 Essentials
ggplot2 Essentials
Sa
m
ggplot2 Essentials
Explore the full range of ggplot2 plotting capabilities to create
meaningful and spectacular graphs
Donato Teutonico
Preface
As a very powerful open source language, R is rapidly becoming a standard in the
scientific community, particularly for data analysis and data visualization. This is
related mainly to the vast availability of library packages, which empower the user
to apply this software in virtually any field of scientific research. In relation to data
visualization, R provides a rich palette of tools, and among the packages available,
ggplot2 is fast becoming one of the more sophisticated and advanced packages. Its
use is constantly growing in the community of R users. This increasing interest is
particularly related to the ggplot2 capability of creating high-quality plots with a
very appealing layout and coding that is sufficiently easy to use.
As a scripting language, R may be difficult to master, but in this book, you will
find a large number of examples and tips as well as detailed explanations, which
will provide you with all the necessary tools to understand the concepts behind
ggplot2 in depth and concretely apply them to solve everyday problems related
to data visualization. You will see step-by-step descriptions, ranging from the
basic applications of ggplot2 in realizing simple plots up to the realization of more
advanced and sophisticated plots. We will also dig into controlling plot details,
which will enable you to perform a full customization of the plot you intend to
realize. Finally, we will also see more special applications of ggplot2, for instance,
regarding how to include map data in plots, realize heatmaps, and realize matrix
scatterplots using additional packages based on ggplot2.
By the end of this book, you will not only have learned how to use the full potential
of ggplot2, but you will also be able to generate publication-quality plots. Moreover,
you will also be able to use this book and its examples as a reference for daily
questions concerning the use of ggplot2 for data representation.
Preface
Preface
Chapter 6, Plot Output, shows you how to modify and organize multiple plots
after their realization. We will see how to reproduce multiple plots next to each
other and save the plots in different file formats from the R console, as well as
using scripting commands.
Chapter 7, Special Applications of ggplot2, shows you examples of the special
application of ggplot2 and other packages based on ggplot2. We will see how we
can include maps in plots as well as add data to such maps; we will see how we can
draw scatterplot matrices to represent the relationships between different variables.
Finally, we will see how we can realize heat maps.
Graphics in R
The objective of this chapter is to provide you with a general overview of the
plotting environments in R and of the most efficient way of coding your graphs in
it. We will go through the most important Integrated Development Environment
(IDE) available for R as well as the most important packages available for plotting
data; this will help you to get an overview of what is available in R and how those
packages are compared with ggplot2. Finally, we will dig deeper into the grammar
of graphics, which represents the basic concepts on which ggplot2 was designed.
But first, let's make sure that you have a working version of R on your computer.
[1]
Graphics in R
Packages in R
In the next few pages of this chapter, we will quickly go through the most important
visualization packages available in R, so in order to try the code, you will also
need to have additional packages as well as ggplot2 up and running in your R
installation. In the basic R installation, you will already have the graphics package
available and loaded in the session; the lattice package is already available among
the standard packages delivered with the basic installation, but it is not loaded by
default. ggplot2, on the other hand, will need to be installed. You can install and
load a package with the following code:
> install.packages("ggplot2")
> library(ggplot2)
Keep in mind that every time R is started, you will need to load the package
you need with the library(name_of_the_package) command to be able to use
the functions contained in the package. In order to get a list of all the packages
installed on your computer, you can use the call to the library() function without
arguments. If, on the other hand, you would like to have a list of the packages
currently loaded in the workspace, you can use the search() command. One more
function that can turn out to be useful when managing your library of packages is
.libPaths(), which provides you with the location of your R libraries. This function
is very useful to trace back the package libraries you are currently using, if any, in
addition to the standard library of packages, which on Windows is located by default
in a path of the kind C:/Program Files/R/R-3.1.2/library.
The following list is a short recap of the functions just discussed:
.libPaths()
# get library location
library()
# see all the packages installed
search()
# see the packages currently loaded
[2]
Chapter 1
RStudio
RStudio (http://www.rstudio.com/) is a very nice and advanced programming
environment developed specifically for R, and this would be my recommended
choice of IDE as the R programming environment in most cases. It is available for all
the major platforms (Windows, Linux, and Mac OS X), and it can be run on a local
machine, such as your computer, or even over the Web, using RStudio Server. With
RStudio Server, you can connect a browser-based interface (the RStudio IDE) to a
version of R running on a remote Linux server.
RStudio allows you to integrate several useful functionalities, in particular if you use
R for a more complex project. The way the software interface is organized allows
you to keep an eye on the different activities you very often deal with in R, such as
working on different scripts, overviewing the installed packages, as well as having
easy access to the help pages and the plots generated. This last feature is particularly
interesting for ggplot2 since in RStudio, you will be able to easily access the history
of the plots created instead of visualizing only the last created plot, as is the case in
the default R GUI. One other very useful feature of RStudio is code completion. You
can, in fact, start typing a comment, and upon pressing the Tab key, the interface will
provide you with functions matching what you have written . This feature will turn
out to be very useful in ggplot2, so you will not necessarily need to remember all the
functions and you will also have guidance for the arguments of the functions as well.
In Figure 1.1, you can see a screenshot from the current version of RStudio (v 0.98.1091):
[3]
Graphics in R
Scripting area: In this area you can open, create, and write the scripts.
Console area: This area is the actual R console in which the commands are
executed. It is possible to type commands directly here in the console or write
them in a script and then run them on the console (I would recommend the
last option).
Visualization area: Here, you can easily load packages, open R help files,
and, even more importantly, visualize plots.
The RStudio website provides a lot of material on how to use the program, such
as manuals, tutorials, and videos, so if you are interested, refer to the website for
more details.
[4]
Chapter 1
[5]
Graphics in R
The grid package, on the other hand, provides an alternative set of graphics tools.
This package does not directly provide functions that generate complete plots, so it
is not frequently used directly to generate graphics, but it is used in the development
of advanced data visualization packages. Among the grid-based packages, the most
widely used are lattice and ggplot2, although they are built by implementing
different visualization approachesTrellis plots in the case of lattice and the
grammar of graphics in the case of ggplot2. We will describe these principles
in more detail in the coming sections. A diagram representing the connections
between the tools just mentioned is shown in Figure 1.2. Just keep in mind that this
is not a complete overview of the packages available but simply a small snapshot
of the packages we will discuss. Many other packages are built on top of the tools
just mentioned, but in the following sections, we will focus on the most relevant
packages used in data visualization, namely graphics, lattice, and, of course,
ggplot2. If you would like to get a more complete overview of the graphics tools
available in R, you can have a look at the web page of the R project summarizing
such tools, http://cran.r-project.org/web/views/Graphics.html.
grDevices
graphics
grid
ggplot2
lattice
Trellis Graphics
principle
Grammar of Graphics
principles
Figure 1.2: This is an overview of the most widely used R packages for graphics
In order to see some examples of plots in graphics, lattice and ggplot2, we will
go through a few examples of different plots over the following pages. The objective
of providing these examples is not to do an exhaustive comparison of the three
packages but simply to provide you with a simple comparison of how the different
codes as well as the default plot layouts appear for these different plotting tools.
For these examples, we will use the Orange dataset available in R; to load it in the
workspace, simply write the following code:
>data(Orange)
[6]
Chapter 1
This dataset contains records of the growth of orange trees. You can have a look at
the data by recalling its first lines with the following code:
>head(Orange)
You will see that the dataset contains three columns. The first one, Tree, is an ID
number indicating the tree on which the measurement was taken, while age and
circumference refer to the age in days and the size of the tree in millimeters,
respectively. If you want to have more information about this data, you can have
a look at the help page of the dataset by typing the following code:
?Orange
Here, you will find the reference of the data as well as a more detailed description of
the variables included.
[7]
Graphics in R
Let's assume that we would like to have a look at how age is related to the
circumference of the trees in our dataset Orange; we could simply plot the data on a
scatter plot using the high-level function plot() as shown in the following code:
plot(age~circumference, data=Orange)
This code creates the graph in Figure 1.3. As you would have noticed, we obtained
the graph directly with a call to a function that contains the variables to plot in the
form of y~x, and the dataset to locate them. As an alternative, instead of using a
formula expression, you can use a direct reference to x and y, using code in the form
of plot(x,y). In this case, you will have to use a direct reference to the data instead
of using the data argument of the function. Type in the following code:
plot(Orange$circumference, Orange$age)
[8]
Chapter 1
For the time being, we are not interested in the plot's details, such as the title or the
axis, but we will simply focus on how to add elements to the plot we just created. For
instance, if we want to include a regression line as well as a smooth line to have an
idea of the relation between the data, we should use a low-level function to add the
just-created additional lines to the plot; this is done with the lines() function:
plot(age~circumference, data=Orange)
abline(lm(Orange$age~Orange$circumference), col="blue")
lines(loess.smooth(Orange$circumference,Orange$age), col="red")
[9]
Graphics in R
The graph generated as the output of this code is shown in Figure 1.4:
Figure 1.4: This is a scatterplot of the Orange data with a regression line
(in blue) and a smooth line (in red) realized with graphics
As illustrated, with this package, we have built a graph by first calling one function,
which draws the main plot frame, and then additional elements were included using
other functions. With graphics, only additional elements can be included in the
graph without changing the overall plot frame defined by the plot() function. This
ability to add several graphical elements together to create a complex plot is one of
the fundamental elements of R, and you will notice how all the different graphical
packages rely on this principle. If you are interested in getting other code examples
of plots in graphics, there is also some demo code available in R for this package, and
it can be visualized with demo(graphics).
In the coming sections, you will find a quick reference to how you can generate a
similar plot using graphics and ggplot2. As will be described in more detail later
on, in ggplot2, there are two main functions to realize plots, ggplot() and qplot().
The function qplot() is a wrapper function that is designed to easily create basic
plots with ggplot2, and it has a similar code to the plot() function of graphics.
Due to its simplicity, this function is the easiest way to start working with ggplot2,
so we will use this function in the examples in the following sections. The code in
these sections also uses our example dataset Orange; in this way, you can run the
code directly on your console and see the resulting output.
[ 10 ]
Chapter 1
[ 11 ]
Graphics in R
[ 12 ]
Chapter 1
[ 13 ]
Graphics in R
[ 14 ]
Chapter 1
[ 15 ]
Graphics in R
[ 16 ]
Chapter 1
[ 17 ]
Graphics in R
[ 18 ]
Chapter 1
[ 19 ]
Graphics in R
[ 20 ]
Chapter 1
[ 21 ]
Graphics in R
[ 22 ]
Chapter 1
[ 23 ]
Graphics in R
[ 24 ]
Chapter 1
}
xyplot(age~circumference, data=Orange, panel=myPanel)
[ 25 ]
Graphics in R
Figure 1.5: This is a scatter plot of the Orange data with the regression line (in blue) and
the smooth line (in red) realized with lattice
As you would have noticed, taking aside the code differences, the plot generated
does not look very different from the one obtained with graphics. This is because
we are not using any special visualization feature of lattice. As mentioned earlier,
with this package, we have the option of multipanel conditioning, so let's take a look
at this. Let's assume that we want to have the same plot but for the different trees in
the dataset. Of course, in this case, you would not need the regression or the smooth
line, since there will only be one tree in each plot window, but it could be nice to
have the different observations connected. This is shown in the following code:
myPanel <- function(x,y){
panel.xyplot(x,y, type="b") #the observations
}
xyplot(age~circumference | Tree, data=Orange, panel=myPanel)
[ 26 ]
Chapter 1
Figure 1.6: This is a scatterplot of the Orange data realized with lattice, with one subpanel representing the
individual data of each tree. The number of trees in each panel is reported in the upper part of the plot area
As illustrated, using the vertical bar |, we are able to obtain the plot conditional to
the value of the variable Tree. In the upper part of the panels, you would notice the
reference to the value of the conditional variable, which, in this case, is the column
Tree. As mentioned before, ggplot2 offers this option too; we will see one example
of that in the next section.
In the next section, You would find a quick reference to how to convert a typical plot
type from lattice to ggplot2. In this case, the examples are adapted to the typical
plotting style of the lattice plots.
[ 27 ]
Graphics in R
[ 28 ]
Chapter 1
[ 29 ]
Graphics in R
[ 30 ]
Chapter 1
[ 31 ]
Graphics in R
[ 32 ]
Chapter 1
[ 33 ]
Graphics in R
[ 34 ]
Chapter 1
[ 35 ]
Graphics in R
[ 36 ]
Chapter 1
[ 37 ]
Graphics in R
[ 38 ]
Chapter 1
[ 39 ]
Graphics in R
[ 40 ]
Chapter 1
[ 41 ]
Graphics in R
In ggplot2, there are two main high-level functions capable of directly creating
a plot, qplot(), and ggplot(); qplot() stands for quick plot, and it is a simple
function that serves a purpose similar to that served by the plot() function in
graphics. The ggplot()function, on the other hand, is a much more advanced
function that allows the user to have more control of the plot layout and details. In
our journey into the world of ggplot2, we will see some examples of qplot(), in
particular when we go through the different kinds of graphs, but we will dig a lot
deeper into ggplot() since this last function is more suited to advanced examples.
[ 42 ]
Chapter 1
If you have a look at the different forums based on R programming, there is quite a
bit of discussion as to which of these two functions would be more convenient to use.
My general recommendation would be that it depends on the type of graph you are
drawing more frequently. For simple and standard plots, where only the data should
be represented and only the minor modification of standard layouts are required,
the qplot() function will do the job. On the other hand, if you need to apply
particular transformations to the data or if you would just like to keep the freedom of
controlling and defining the different details of the plot layout, I would recommend
that you focus on ggplot(). As you will see, the code between these functions is not
completely different since they are both based on the same underlying philosophy,
but the way in which the options are set is quite different, so if you want to adapt a
plot from one function to the other, you will essentially need to rewrite your code. If
you just want to focus on learning only one of them, I would definitely recommend
that you learn ggplot().
In the following code, you will see an example of a plot realized with ggplot2,
where you can identify some of the components of the grammar of graphics.
The example is realized with the ggplot() function, which allows a more direct
comparison with the grammar of graphics, but coming just after the following code,
you could also find the corresponding qplot() code useful. Both codes generate the
graph depicted in Figure 1.7:
require(ggplot2)
## Load ggplot2
data(Orange)
ggplot(data=Orange,
## Data used
aes(x=circumference,y=age, color=Tree))+
## Aesthetic
geom_point()+
## Geometry
stat_smooth(method="lm",se=FALSE)
## Statistics
## Data used
## Aesthetic mapping
geom=c("point","smooth"),method="lm",se=FALSE)
[ 43 ]
Graphics in R
This simple example can give you an idea of the role of each portion of code in a
ggplot2 graph; you have seen how the main function body creates the connection
between the data and the aesthetics we are interested to represent and how, on top
of this, you add the components of the plot, as in this case, we added the geometry
element of points and the statistical element of regression. You can also notice how
the components that need to be added to the main function call are included using
the + sign. One more thing worth mentioning at this point is that if you run just the
main body function in the ggplot() function, you will get an error message. This is
because this call is not able to generate an actual plot. The step during which the plot
is actually created is when you include the geometric attribute, which, in this case
is geom_point(). This is perfectly in line with the grammar of graphics since, as we
have seen, the geometry represents the actual connection between the data and what
is represented on the plot. This is the stage where we specify that the data should be
represented as points; before that, nothing was specified about which plot we were
interested in drawing.
Figure 1.7: This is an example of plotting the Orange dataset with ggplot2
[ 44 ]
Chapter 1
Further reading
Summary
In this chapter, we set up your installation of R and made sure that you are ready to
start creating the ggplot2 plots. You saw the different packages available to realize
plots in R and their history and relations. The graphics package is the first package
that was developed in R; it represents a simple and effective tool to realize plots.
Subsequently, the grid package was introduced with more advanced control of the
plot elements as well as more advanced graphics functionalities. Several packages
were then built on top of grid, in particular lattice and ggplot2, providing highlevel functions for advanced data representation. In the next chapter, we will explore
some important plot types that can be realized with ggplot2. You will also be
introduced to faceting.
[ 45 ]
www.PacktPub.com
Stay Connected: